Linear Regression Machine Learning (tutorial)

Linear Regression Machine Learning (tutorial)

November 22, 2019 100 By Stanley Isaacs


[BLANK_AUDIO] Let’s see, let’s see everybody. I’ll minimize myself. I don’t need to see myself. I need to see you guys. I need to see you guys. The world, this is Rash! I am hyped for this live session. Yo, class is in session everybody. We are about to do some math this
live session, so I’m super excited. But first of all,
let me take some roll call, all right? So, let’s see, Collin,
Brandon, Nil, David, Dakosh, Sebastian, Raj, Spencer,
Naresh, Niko, Clement, hi, guys! Michael, Benjamin, all right. So, that was roll call. Welcome to this live session for
the deporting, inter-deporting course. Okay. This is going to be so awesome. Because, I have been
waiting to do some math. Guess what guys. Guess what. I bought this pad to write some math on. Okay. I’ve never used this before so,
I’m super excited for this. I’m going to show you guys the math. Behind linear regression. By the end of this video, you guys are going to know like the back of your
hand, how to do linear regression. That includes gradient descent. And guess what?, we use gradient descent
all over the place in machine learning. Don’t worry if you don’t know what
that is, I’m going to show it to you. Okay? So, we’re going to deep dive into this. So, we’re going to start
off with a five minute Q&A, like always, and I think we’ve got some
Udacity peeps in the house as well. Drew. Nico, and Max, who’s the other instructor for the course. So, I think you’re here shout out,
say something so people know who you are and, so I’m going to
do my five minute Q&A, like always, and I’m going to answer all the questions
related to me, and my everything, but if you have any Udacity specific
questions, they will answer those, okay? So, let’s start up with
a five minute Q&A, and then we’re going to get right
into the code and that, okay? Do I have to know about
partial derivatives? We are going to do a partial derivative,
but I’ll show you how that works. [BLANK_AUDIO] I had to cut off cutie
pie to catch this. Wow I’m honored, I’m honored. Hey baby girl,
let me see that regression. All right, so that’s not a question. Let’s get some real questions in there,
some quality questions. All right. [BLANK_AUDIO] Would you want to check out
my Vive AI Assistant demo? Sure, yes, post a GitHub link in
the comments of one of my videos, I read all my comments,
I answer all my comments. See, I’m not fake,
you know what I’m saying? I answer all my comments. I’m here for you guys, have I
enrolled in- is calculus required for linear regression? Yes. A little bit of calculus, but
I’m going to go through that. Don’t be afraid by the word calculus. This is actually very intuitive. [BLANK_AUDIO] Can you mention some details
about the upcoming as well, looking to predict the genre from- [BLANK_AUDIO] All right. [BLANK_AUDIO] What basic maths will be needed? You’ll need to know basic algebra, okay? And then we’re going to learn
the calculus necessary to do this in this video, okay? Are against the future, yes. I mean the idea, between generate
model in general are really exciting, because you can generate
except don’t exist. This. And that has a lot of potential for
art and culture. GANs can change culture, right? We can generate music. We can generate art. We can generate paintings in
ways that humans couldn’t. Best book to understand math behind ML. Machine Learning and
Probabilistic Approach. That’s a pretty good one. Just mastering or coding too. Both. Mostly coding. Linear regression versus
other classifiers like SGDC. Linear regression is definitely easier. What is no free lunch? It’s a theorem, the no free lunch
theorem at a very high level it’s like, well you can’t make assumptions. You can’t make assumptions whenever
you are doing anything related to proving something. Just, when will you do NLP? Yo, I’m going to do so
much NLP in this course. I can’t wait for NLP, it’s coming up. Will you cover GANs? I kind of want to just do GANs
right now, you know what I mean? I’m super excited for
GANs, I will do GANs. [BLANK_AUDIO] I will give an intuition why to
do graded descent over, yes, I will explain that. Linear algebra is the way to go?, yes. What’s the difference between
cycle learn and TF learn? Cycle learn and TF learn,
great question. So, TF learn is a high level
wrapper on top of transfer flow. It’s very similar looking
to cycle learn, but cycle learn specifically is, it does. So, TF1 only focused on
deep neural networks. Cycle learn uses support vectrum shades,
and all sorts of other machinery models. Whereas TF1 is the same kind of. It has the same brevity, but
it focuses only on deep neural networks. Do you prefer WICA? No. No, when will you start
working on Anaconda? I mean, I’ll most likely start using
docker to contain those things. All right, rap for 50 case off,
let me rap for 50 case off. In this time,
I’m going to play an instrumental. I’m not going to just rap with that kind
of instrumental, you know what I mean? Don’t be discouraged, rap,
hip hop instrumental on YouTube, whatever it starts playing. Someone say a keyword and
then we’re going to get started. Triumph hip hop instrumental,
what is this about. Let’s go, play it. All right, let me just unplug my mic, so you guys can see this,
where’s the music. I’m going to say something,
you know what I’m saying? [MUSIC] 50k subs. I got 50k subs, man,
my mind is so fresh. I’m looking at this coffee mess,
looks like the best. I got caffeine on my mind,
it takes me so high through the sky. I got a USB 4, my my. I’m going to be writing math today,
like it’s all mine. Online, I see you man, it’s all fine. It’s all writing piece
of equations online. I see you coming back like threw
me your progression, wait. So that was it for the rap. Okay, so, that’s it for the rap. So, now we’re going to get
started with the code, okay? So, let’s go ahead and do this. I’m going to start screen sharing,
and then we’re going to get started. All right, here we go. [SOUND] Here we go, Google Hangouts. All right, and
what does Hangouts want to do? Hangouts wants to screen share. Hangouts wants to screen share. Your entire screen. Chair. All right, so I’ll minimize this,
and minimize, and then I’ll move this out of the way,
so I can see what you guys are doing. Okay, and we’re going to code this baby. Okay, I am in the corner here, let
make sure that you guys are seeing is, what I want you to see. [BLANK_AUDIO] Yes, what you guys are seeing is exactly
what I want you to see, perfect. All right. [BLANK_AUDIO] Okay, so here we go. [BLANK_AUDIO] Here’s what we’re going to do guys,
let me make this statement [SOUND]. This is big enough right? So in this lesson,
we’re going to do linear regression. And what is linear regression, right? So linear regression, in this case, and
let me make sure everything’s working. Everybody’s here, live chat’s working,
live video is not working. Okay, so here’s how it goes. So we’re going to do this, okay? So this is going to be
called linear regression. This is linear regression and
let me just show you guys. The best way to explain it is
to show it through visuals, so I’ll show it through visuals,
what exactly we’re going to be doing. And to show you visually,
I will give you a link to this, and I will just show it right here. This is what’s happening. So we have a set of points, and these
points are the test scores of students, and the amount of hours studied, okay? So this is what it looks like. So this, right on the right, this graph
here, these set of points are the set. The x values are the amount
of hours they studied and the y values are the test
scores they got. Okay. And intuitively, to us, there must be some kind of
correlation between these two values. But we want to prove this
programmatically, we want to prove this, I’m sorry, mathematically, we want to
prove that there is a relationship. And how do we prove that
there is a relationship? We draw a line of best fit. So how do we know what that line of best
fit is, or that linear regression is? Well we don’t know, we don’t know. We have to find that, and the way we’re going to find the line
of best fit is using gradient descent. And that process,
that training process looks like this. We’re going to draw a random line,
compute the error for that line. And I’ll talk about how we’re
going to compute that error. And that error value is going to say
how well-fit is this line to the data? And then based on that error,
it’s going to act as a compass. It’s going to tell us, well, how best should you re draw the lines
to be closer to the line invested. And we’ll keep doing that. So, it’ll be like draw a line, compute
error, draw a line, compute error, until eventually the line that we draw is the
optimal line that we should draw, okay? So, that’s at a very high level. But now I’m going to
go into the code and we’re going to talk
about this in detail. All right, so
lets go ahead and start it. [BLANK_AUDIO] So to start off, to start off I’m
going to write my main function, okay? So let me move all this stuff out of the
way, so I’ll get right into the code, all right? I’ll get right into the code. And guys, if people have questions and
I’m not able to answer them because I’m busy doing something,
please help me answer questions. I very much appreciate it. I very much appreciate it. Okay? So let me just start off by
writing the main function. What does the main function do? That’s where the meat of the code goes. Right, okay so in the main function,
we’ll write a run function, which is where we’re going to
store all of our logic. Okay, so let’s write up a run function. So the run function is a chance for us
to show what we’re doing at high levels, at a high level. So step one, is collect our data, right? Always in machine learning,
we want to collect our data. So we’ll get our data points. And what we’re going to do, how are we
going to collect our data, right? Well, to collect our data, we have to
import the one library that we’re using. I know guys,
we’re using a single library. And that library is NumPy, all right? And we’re going to use this little
symbol that means we don’t have to continually say NumPy whenever we
call its method or its functions. Okay, so what is the function
we’re going to use for NumPy? So the function we’re going to use for
NumPy, I’m sorry, right, main,
thank you, main, good call. So, the function we’re going to use for
NumPy is genfromtxt(). And what this is going to do, is it’s going to get the data
point from our data file. And let me show you guys
the data file as well. But basically we’re going to
separate it by the compass. Okay, and
we’re going to get those points. So what does this,
what does this data look like? Well, let me pull up terminal, and show you guys exactly what
this data looks like. So it looks like beta. Okay? So let me zoom in on this. Zoom, way more. 200 zoom. So these are just the hours studied,
on the left side, and then the test scores for a bunch of students, for
an intro to computer science class. Okay? The hours studied and
the test scores they got. Okay, so
that’s what we’re going to pull. That’s our data set. That’s what we’re going to
pull into our points variable. So, points is going to contain
a bunch of xy value pairs. Where x is the amount of hours
studied and y is the test score. Okay? And it’s separated by the comma. Okay, so that’s step one. We’ve done that, and genfromtext is
essentially running two main loops. The first loop converts each line
of the to a sequence of strings. And the second one is converting each
string to the appropriate data type. Okay, so that’s step one. Now, step two is to define
our hyperparameters. Okay, in machine learning, we have
what are called hyper-parameters. These are tuning nuts for our model. They are basically the parameters
that define how our model is analyzing certain data. How fast it’s spinning through the data. What operations performing on the data. There’s a whole bunch
of hyper-parameters. Thank you for the feedback. There’s a whole bunch
of hyper parameters and what we’re going to use
is the learning rates. Now the learning rate is used
a lot in machine learning, and it basically defines how
fast should our model converge? Convergence means when you
get the optimal result, the optimal model,
the line of best fit, in our case. That is convergence. So how fast should we converge? You might be thinking, well, shouldn’t
the learning rate just be a million, if you want to converge super fast? Well, no. Like all hyper-parameters,
it’s a balance, okay? So if the learning rate is too small,
we’re going to get slow convergence. But if it’s too big, then our error
function might not decrease, okay? So it might not converge. So, that’s our first hyper-parameter. Our next hyper-parameter is going
to be the initial value for b, and the initial value for m. And what is b and m? Well what we’re going to do, is we’re
going to calculate the slope, right? So this looks like a y equals mx plus b,
and so this is why I said we only
need to know basic algebra. This is the formula,
this is the slope formula, okay? All lines follow this formula, where y. So, m is the slope, b is the y
intercept, x and y are the points. Okay, so that’s the line, okay? So, this our initial b value,
our initial slope, and our initial y intercept. They’re going to start off as 0, okay? So, and then the last type of parameter
is going to be the number of iterations. How much do we want to train this model? Well, we have a very,
very small data set. There’s only a 100 points, okay. And for that, we’re not going to need to iterate
a million times or 100,000 times. We’re just going to iterate 1,000 times. Okay? So that’s our hyper-parameters, and now step three is going to be to fit,
train our models. It’s train our model. Train our model. Okay, so the first step is going to be to show
the starting gradient descent, okay? At b equals,
what is the starting gradient descent? It’s going to be zero, right? And then m is going to be the starting
point, for that we’ll say one. And this is just for
us to see the difference here, okay? [BLANK_AUDIO] All right, .format(initial_b,
initial_m). And so, what’s happening here? [BLANK_AUDIO] Compute error, for_line_given_points. So, all right, let me just write
this out and I’ll explain. initial_b, initial_m,
and then the points. Okay, so what’s happening here? Let’s go over what I just wrote here. So, in this line, we’re going to show
the starting b value, the starting m value, so what is our starting
y-intercept, what is our starting slope. And what is our starting error? And I’m going to show you how we’re
going to calculate that error. And to get that error,
given our b and m values, we have this function here called
compute_error_for_line_given_points. It’s going to take the b,
m and the points, and it’s going to compute the error for
that and it’s going to out put that. So, that’s going to be
our starting point, okay? And then, now, we’re going to actually
perform our gradient descent, and it’s going to give us
the optimal b and the optimal slope. I’m sorry, it’s going to go to the
optimal slope and the optimal y descent. So, for gradient descent,
we’re going to call this method the gradient_descent_runner,
so a given point. Given an initial b value, I’m sorry
initial m value given our learning rate, so this is where we’re going to
use all that kind of parameters, right?, because this is where
we’re training our model. So, number of iterations. Those are all the things we need for
this, okay?, and we’re going to define this
function in a second. We’re going to go deep dive and
define these functions. Okay, so then after we print our model,
well now we can just print it out, right? So, let me just copy and paste this. So, now, this is not our starting point,
this is now our ending gradient, ending point. So, face our ending point where b is
two, m is two and then error is three. And this number just define. What we’re going to see at the end. For the number of iterations for b. And then- [BLANK_AUDIO] For m and then for computing the error
for line at given points given that the final b, the final m value,
and then our points. Okay, so. [BLANK_AUDIO] Okay, so that is high level,
what’s happening here? So, all I did was I just
printed out the initial b and m value, which is nothing,
and then the error, and then I computed the rate of descent,
and then I print out the final values. So, I’m about to do this now. Okay, so we haven’t actually done this,
now we’re going to do it. So, the first thing I’m
going to talk about is, how we going to compute that error. Let’s write at that first function. What was that first function called? It was called
compute_error_for_line_given_points. Okay, so and the data set I’m
going to provide that as well, but let’s go ahead and
run up this method okay? So, this is the first step. We’re going to write up this method. Compute error for line at given points. Okay, I’m so
excited to show you guys this, because I get to use my math pad for
a second. Okay, so let me write this out,
okay?, hold on. Okay, here we go. So, let me write this out. Okay, so we’ve got a line here. Man, what a great line that is. Okay, so this is our plot, okay? And, so
we’ve got a bunch of data points here. We’ve got a bunch of data points. Write this all over the place and what we are going to do is to draw
a random line through the data. We don’t know the line invested, so we are going to draw a random
line through the data. And then, we are going to compute
the error of that line, so that error will tell us
how good our line is. Okay, so
how do we know how good our line is? But what we’re going to do is, we’re
going to go for every single y value, on that line we’re going to calculate
the distance from each point from our data to the line. Okay, so all of these distances,
all of these distances, distance one, distance two, distance three, distance
four, distance five, distance six and then you probably have more data
points down here, these distances, the distance to this line. And, so we’re going to take all those
distances and we want to sum them. And, so let me show you the equation for
that, okay? So, rather than actually
writing out this equation, like really sloppily, I’m going to
show it to you using this, okay? So, okay. So, this is the equation. So, let me explain what this is. So, we got all those distances,
right?, we got all those distances. We’re going to sum those
distances together and that, and I get the average of that. But guess what, we’re not just
going to sum those values alone, we’re going to square those values. And why are we squaring those values? Because, we’re squaring those values,
because we want it first of all to be positive, and it doesn’t really
matter what the actual value is. It’s more about the magnitude
of those values, right? And we want to minimize
that magnitude over time. So, this is the equation for that. Okay, so let me explain what
the hell this is, okay? So, we’re computing the error. We are computing the error
of our line given m and b. So, given m and b we are going to
compute the error of our line. M is our slope and b is our y intercept. So, this E, looking thing,
is called sigma notation. It’s a little weird,
giving you guys a little refresher here. This E thing, we’re going to see it a lot in machine
learning, it’s called sigma notation. And basically it’s a way of describing, calculating the sum of a set of values,
all right? So, the sum of a set of values,
which is what we’re doing. We’re calculating the sum of a set
of points, so if the starting point is where i equals 1 and the ending
point, and N is for every point. Okay, so for every point, you want to
calculate the difference in y values. So, it’s y-(mx+b). And why do we say (mx+b)? Because in the sub equation,
N y equals (mx+b) right? So, it’s y-(mx+b),
which essentially boils down to just y. So, it’s y minus y squared. And then we’re doing that for
every single point. And, so we’re going to add
all of those points together. Okay, and then get the average. And, so that why 1/N. Because we’re going to
get the average of that. And that’s value. That value is the error. Okay?, so at high level,
that is what that is. So, now let’s programmatically
write this out, okay? So, we’re going to start by initializing
the error, initialize it at zero. Okay?, so our total error at
the start is just going to be zero. There’s not anything that’s- [BLANK_AUDIO] We don’t have an error yet, okay? So, then for every point, so for
i in range of starting at zero, and then going for
the length of the points, right? So all of our data points, so for
every data point that we have. We’re going to say,
let’s get the x value, so x=points [i, 0]. And then we’re going to
get that y value, right? So, get the y value, right? So, I’m just basically
programmatically showing what I just talked about mathematically. Right? So, we’ve got the x value,
we’ve got the y value. And we want to compute that distance,
right? We’re going to do this
every single time. [BLANK_AUDIO] Then get the difference. [BLANK_AUDIO] Square it, and then add it to the total. Okay, so
here’s the actual equation, right? So, we’re going to do plus equal,
because it’s a summation, and we’re going to programmatically show what I
just talked about right here, right? y-(mx+b) squared. Okay? And we’re going to get the sum of that. So, y-(m * x + b) squared, okay? And we’re going to do that for
every point, so this whole iteration loop right here, is that equation,
okay?, minus the average part. So, that’s going to give
us the total value. The last part is to average it. So, we’ll take totalError
/ float [len[points]). So, we want it to be a float value. [BLANK_AUDIO] And that is the equation. That is the equation right there. Okay, so and then get the average. Get the average. [BLANK_AUDIO] So, this ten line
function just described, what I talked about right here
in this math equation, okay? We sum all the distances between all
those points, as I showed right here. We summed them all up, we squared
them and then we got the average. And that is our error. Okay? And we’re calculating that,
because we want a way for us, a measure of us,
something to minimize over time. Right? Something to minimize every
time we redraw our line, we want to minimize this error. Because this error basically is
a signal, it’s a compass for us. It’s telling us,
this is how bad your line is. It needs to get better. You need to make me smaller. I’m really big right now,
make me smaller. And that’s what gradient descent does. That’s what gradient descent does. And I’m going to explain how
gradient descent works in a second. But that’s that curves function, right? Okay?, what was the second
function we wrote? It was called gradient descent runner. So, this is our actual
brain descent function. So, now let’s write this out. Okay?, this is our second of
three methods, before we’re done. So, gradient_descent_runner. So, given a set of points,
given a starting value for b, given a starting value for m, given our learning rates and
given our number of iterations. We’re going to use all of these things
to calculate gradient descents. We’re going to use every single thing. Okay? [BLANK_AUDIO] Okay. So, let’s get that starting b and
m value, okay? So, the starting value for
b, we’re going to say to b. And the starting value for
m, we’re going to say to m. Okay? Simple enough. And now,
we’re going to perform gradient descent. What is gradient descent? I cannot wait to explain
gradient descent, guys. I found the perfect analogy for gradient
descent, and I’m really excited. Okay, before I explain that. Let’s just perform that you can erase
this, because the actual math is going to start in the last function
that I’m about to write. So, for
every single iteration that we define, we’re going to perform what’s
called gradient descent. So, we’re going to update b and
m with the new more accurate b and m by performing a gradient descent. By performing this gradient step, okay? So, b and m, we’re going to returned b
and m by performing this gradient step. We can already explain,
this is where the math is happening. Given out current b, our current m, given r the array
of points that we have. And then finally given
the learning rate. We’re going to calculate
that final value of b and m. And guess what? Once this gradient descent is done. We’re going to return that optimal e and
f, right? And, so that’s what we talked
about at the starting part, right? We returned that optimal b and
m and value. And before the gradient descent,
and then we then printed it out, because that optimal b and
m value gave us a line of best fit. We plug them into the y=
(mx+b) equate the formula. It gave us the line of best fit. So, now we’re going to write
out the gradient step. And this is gradient
mother f-ing descent. Okay, so
this is how it’s going to go down, okay? Here’s how it’s going to go down,
step_gradient. So, I’m just going to say, it’s time for
the magic, the magic, the greatest, the greatest, okay? So, that’s how excited am I,
just wrote the greatest twice. Okay, [LAUGH]. So, given our current b and
m values points and the learningRates. And this actually isn’t going to
help with that, so I’ll delete that. So, here are learningRates, okay? Let’s perform gradient_descent. So, okay, what is gradient_descent? Okay, so let me show you guys this. [SOUND]
How best do I describe this? So, we have. [BLANK_AUDIO] Let me just show you this image. This is going to help a lot. [BLANK_AUDIO] Okay, so this is a graph. So, let’s just look at the graph,
I mean it’s the same graph. It’s looking at it from
two different angles. It’s the same graph, okay? So, let’s look at the one on the left,
just to pick one. It’s the same graph though. We have a bunch of y values,
sorry a bunch of b values, and a bunch of m values. And then we have that error, right? That error that I just talked about,
right? So, given the 2D graph of b given
are every single y intercept, we could have given every single m
value we could have, what is the error? Okay, so for every y intercept and
slope curve what is the error? And, so we will find this is
a three dimensional graph. This is a three dimensional graph. Because the error value it’s kind
of like, it’s start up high, and then I do approach what’s called
the local minimal in our case. A local minimal, which is the small
that point at the very bottom, that is our that is where
we’re trying to get to. Okay so. Given a set of y-intercepts,
and given a set of slopes. Possible y-intercepts and possible slopes, we want to compute
the error for those three things. And if we were to graph the relationship
between these three things, it would look like this. Now, it tends to always
look very similar to this. In more complex cases we’d have many
minimal, we’d have many little values. But what we’re trying to do is get that
point, where the error is smallest. And, so how do we get that point
where the error is smallest? Well, we’re going to perform
what’s called gradient descent to get that smallest point. That value, smallest point. And a great analogy for this is a bowl. So, let me just search bowl, okay? It’s kind of like a bowl. It’s like we drop a ball into a bowl,
and we want to find that point, where the ball stops,
that endpoint, the lowest point. That b, m value is our optimal
line of vested fit value. Okay?, and the way we’re going to
get that is gradient descent. We’re going to descend, right?,
we’re descending down the bowl using the gradient, and
gradient is another word for slope. We’re going to descend down that bowl
until we get, through iteration, that lowest point. And gradient descent is used. Everywhere in machine learning. Okay? It is like the optimization method for
deep neural networks. It’s not that apparent right now. But know this. Know and understand gradient descent
like the back of your hands, because it is going to be very
useful in the future, okay? So. I don’t know why I’m
doing that equation. That was unneccessary. That was the equation for the sum of squared errors that we just
talked about, sum of squared distances. So, how are we going to
calculate that gradient descent. Well, now let’s actually do it. So, [BLANK_AUDIO] For our step gradient function, we’ll start off with an initial
gradient value for a b. So, b is going to be zero and x gradient
is going to be zero as well, okay? These are the starting points for
our gradients and gradient means slope. And, so the gradient is going
to act like a compass, and it’s going to always point down hill,
so this is what I mean by, once we calculate that error,
it’s going to act as a compass for us. It’s going to tell us. Where we should be going? What direction we should be going? How we should next redraw our lines. So for- [BLANK_AUDIO] Okay, someone asked why is
the lowest point the best? The lowest point is the best, because
it is where our error is the smallest. And when our error is the smallest, that’s when we have
the line of best fit. When the error is smallest,
that b and m value, those two, what we plug into our slope equation, is
going to give us the line of best fit. So, that’s why we’re
calculating the error, okay? So. [BLANK_AUDIO] So, for i in range[0, len[points]). [BLANK_AUDIO] Okay, so what we’re going to do is we’re
going to iterate through every single point on our scatter plot. Okay, so every single data point that
we have, we’re going to collect it. Okay, so we’re going to say, okay,
what is, so for google our first point, right? First point,
which gives us an x value and a y value. X value and y value. So, let me also write out
a little comment for this. Starting points for our gradients, okay? [BLANK_AUDIO] Now, we’re going to get the direction
with respect to b and m. Now, this is the last part, but
it’s a very, very important part. And this is where calculus
comes into play, okay? So, I’m going to talk about
how we’re doing this. Okay, so let me talk about
what we’re about to do. So, what we’re going to do, is so,
given for every single point, for every single point that we have, we’re going to calculate what’s
called the partial derivative, okay? It’s called the partial
derivative with respect to b and with respect to m, okay? And what that’s going to do, is it’s
going to give us a direction to go for both the b value and the m value, right? So, remember, in this graph,
we want a direction, right? We want to be going down the gradient. And, so on this left hand side
you see this gradient search. The m values and the b values are
increasing in the direction that they should be, because gradient intersect
is essentially a search policy. It’s a search policy. We’re trying to find
that minimum error value. Okay? And what we’re going to do to get that, is we’re going to compute the partial
derivative with respect to b, n, and f. Okay, let me show you the equation for
the partial derivative, okay? The partial derivative is
going to be right here. [BLANK_AUDIO] So, this is what the partial
derivative does. The partial derivative, we call it partial, because it’s not
telling us the whole story, right? We say, it’s partial, because we’re
calculating it for both b and m. There are two different dates. And, so
it’s going to give us the tangent line. So, it’s going to give us this
line as you see right here, right? See this line,
that line is our direction. And we’re going to use it to
update our g and m values. Okay? So, that’s what that is. And let me also show you the equation
for the partial derivative, because we’re about to write it out. So, here’s what the equation for the
partial derivative with respect to m and b looks like. Okay? They’re two different equations, right? So, let’s talk about the one on top. So, this little curvy thing
that you see up here, that just signifies that this
is a partial derivative. That’s that signifier that
this is a partial derivative. Now, we talked about sigma notation,
right?, because it’s a summation of values,
right? And that’s what we’re doing. We’re summing the partial derivative for
all of our points, okay? For all of them to compute
that gradient value, okay? And the partial variable with respect
to m and b is going to look like this. So, let’s write this out, okay? So, the b gradient, so
it’s going to give us two values. So, the b gradient is
going to be plus equals. And then what was it? Let me look at the equation again. 2 over N, so
negative 2 over N, all right? [BLANK_AUDIO] Thanks good vibes. And then it was y minus, right? And these are the equations,
they are laws. They are beautiful laws,
that always stay the same. And they give us a way of understanding the direction that we want to move in. Okay, so, b_current. Okay so. All right, so then we’ll do the same
thing, and what was the second equation. It looked pretty much the same,
minus it doesn’t have this x, right? The second one doesn’t have this x,
right? So, we’ll say, but it does have this 2N. It does have this 2N,
and then it does have [BLANK_AUDIO] Let’s see. Let’s have this x. It does have (y-([m_current * x). [BLANK_AUDIO] + b_current, okay? Okay, so now, we’ve computed
our partial derivatives, right? So, let me one more time show you guys. It’s giving us directions to go for
both b and m. And remember, they’re partial. It’s not telling us the whole story,
it’s telling us what direction should we go for b, and
what direction should we go for m? And it’s going to tell us the direction,
remember a bowl to get to that bottom point, where that error is
the smallest right here, okay? So, right here where my mouse is,
that point is what we want to get to, and that’s what the partial
derivative is going to help us with. So, once we’ve computed
the partial derivatives, the sum of them with respect to b and m, now we’re going to update our b and
m values, right? So, we’re going to use that
to update our b and m values. And guess what? This is our last step. This is our last step using
this partial derivative. [BLANK_AUDIO] Using our partial derivatives,
right plural? There’s two of them. So, and that’s going to give us a new
value for b and m, our updated b and m value. So, we have our current value for
b whatever it is, that we fed into the separated function
that keeps updating every time. And this is where our learning_rate
comes into play, okay? This is why our learning rate is so
important, because it defines the rate at which we’re updating our b and
n values, right? So, remember that 0.0001, right? And then also our n_current. [BLANK_AUDIO] That is learning_rate, [BLANK_AUDIO] Times the m gradient. Okay, and
then it’ll return those values. And we’re doing this every time, right? This is new b, and new m,
they our final b and m. It’s a step function, where we’re
doing this every iteration, right? We’re doing this for
the number of iterations we had 1000. But it’s going to return a new b and
m value every time. And guess what guys? That’s it for our code. That was it, so
let’s go over what we’ve done. Okay, but actually let me check for
errors, right? [BLANK_AUDIO] Let me check for errors, and
then I’m going to answer more questions, because I really want to make sure you
guys understand how this works, okay? So, let me demo this. So, python demo.py Only and
is not defined. Okay, right, guess what. I didn’t define N. N is the number of points. Length of points. Okay? So, let’s go. Learning rate is not defined. Where? Where is learning rate not defined? Learning rate is not defined. Wait a second. Yeah, right. Learning rate, right. Okay, what else is bad? I’ve got an overflow for double scalars. [BLANK_AUDIO] 14 y minus [BLANK_AUDIO] Uh-huh, uh-huh, uh-huh. [BLANK_AUDIO] [INAUDIBLE] Okay, so. What’s going on here? Okay, let’s save this. So yeah, it printed out the final, okay
so it got our final value right here. And if we wanted to,
let’s see, hold on a second. If we wanted to,
we got our backup here just in case. So right? So let me blow this up. Like way, way up. Let me just separate it. So this is what our outputs
going to look like. Right. So boom! Just like that. That’s how fast it trade,
in milliseconds. Why? Because our data set is so small. Okay, it’s data set was so small. Alright, so. That what’s happened and
after a thousand iterations, we got the optimal b and m values. So, right as we start up with b and
m at o at we calculate the error for our random line that we drew and
it was huge. But, eventually, after running
gradient descend we got the optimal b, the optimal m and
the lowest error point, which is at the smallest
point in the bowl. And we to do that we use gradient
decent with respect to b and m. Okay so let me go over one last time
every single thing that we just done. Is to really go over it and then will
do my last five minute Q & A okay. So we start out by collecting
our data set, right. Our data set was a collection
of test scores and the amount of hours studied, right. The x y value the test scores and the amount of hours studied
a two variable data set. Then we define our type of parameters
for our linear regression. Our learning weight, which talks
about how fast we should learn, our initial BNM values for
the slope equation: y=mx+b. The number of iterations, 1,000,
because our data set is pretty small. And then we ran gradient descent. So, what did gradient descent look like? Well for every iteration, for a thousand iterations, we computed the
gradients with respect to both b and m. And we did that constantly,
until we got that optimal b and m value. That gives us that line of best fit. Now, how did we compute the gradients? To do that, we said, okay, we’ll have a starting point
of 0 for both of those gradients. Remember, gradient is just
another word for slope. And then we said, okay so for every single point in our scatter plot,
for our data, we’ll compute the partial derivative
with the respect to of both b and m. And those two values are going
to give us a direction, a sense of direction of
where we want to go. How do we get to that lowest
point in that goal, right? That three dimensional graphic,
that lowest point and we use the learning rate to determine
how fast we want to update our DMN values, we got the difference
between the current value, and what we had before, and we return that. So for every point, we did that for
a thousand iterations, okay? And that’s what gave us the output and it looks like, visually,
it looks like this. [SOUND] Right? It’s like up, up, up, up, up,
up, up, up, up, up, up, up. It’s kind of like Wheel of Fortune,
right? It starts off fast, and it gets slower
and slower as it approaches convergence, the word we use when we have the optimal
line of best fit, convergence. See, let me do it one more time. Up, just like that, okay? So that was that, and now I’m going to screen share and
do a last five minute Q and A. Alright, stop screen share. Hi everybody, okay,
let me bring you guys back on screen, do my last five minute Q and
A, ask me anything and yeah. How’s it going everybody? [BLANK_AUDIO] Any questions? I’m open to questions. [BLANK_AUDIO] Where did I use NumPy? It’s at the very top. So, right, what’s the practical
use of linear regression? Great question. Any time we want to find
the relationship between two different variables. And then in more complex
cases there could be more. But, we want to prove mathematically. Right? Math is all about proving things
in a way that is unfalsifiable, that no one can say,
hey, that’s not true. Well I can prove it mathematically. So it’s a way to show the relationship
between two value pairs. So maybe housing prices,
and the time of year right? What is the real estate
market going to look like? Any time intuitively you think there
was a relationship you can prove it with linear regression, but
really I did this to show Grady the set. That optimization process is very
popular in Deep Learning and we’re going to use that in our Deep,
Run networks on the rest of the course, okay? And, why a device for this google? Because it is the deepest learning
library that is out there right now. That’s why. And, of course it would be,
because Google knows what they’re doing. They handle billions and
billions of queries every day. They have to be able to do
machine learning at scale. And, problems, they solve problems that
no one else has even thought of solving. And all of those solutions
are found in TensorFlow. For machine learning or
please think of the eye doctor. You can create a classifier to
classify between different types of disorder that you see in an x-ray. That’s going to augment doctors at
first, but eventually replace them. How about fitting a quadratic
curve inside of a linear line? We could do that as well. [BLANK_AUDIO] I’m going to provide the data set and
the code. I can talk slower, sure. How to find the optimal morning rate? That’s a great question. There’s several methods of doing that,
but that’s great intuition. Sometimes we can use machine learning to
find the optimal hyper-parameters, so it’s kind of like machine learning for
machine learning, but we’ll talk about that later. This is the first course,
he just calculates,I’ll do more of that in the future,
I’m going to keep doing calculus, okay? Two more questions then
we’re good to go, two more. How would you recommend me
to start machine learning? Watch this series. And watch my Learn Python for
Data Science series, watch my Intro to Tension Flow series, watch my
Machine Learning for Hackers series. Watch my videos. Why is your Udacity too extensive? I didn’t decide the price guys. I try to get it low. It’s whatever. You get paid graders for that okay. And grading is not cheap,
okay human graders. But look all the videos are going to be
released here on my channel all right. So I’m here for you guys, okay? I’m trying to grow my brand. I’m trying to grow myself,
Sharad Ravel, okay? [BLANK_AUDIO] This is the end, okay? So that’s it for the questions. And all right, so for now, I’ve gotta, [BLANK_AUDIO] Shoot a findings scene. For my next video. What? Yeah, so, thanks for watching. [SOUND] Love you guys. I’ll post the link in the comments
right when I’m done alright? The video description. I’ll post the GitHub link, and
then the data set, everything. So don’t go to the descriptions
within the hour, okay? Bye! Okay. [BLANK_AUDIO]