An intuitive overview of Gradient Descent

understanding the optimizer function for linear regression

The final step in the linear regression model is creating an optimizer function to improve our weights and bias. I'm going to explain how gradient descent works in this article, and also give you a quick explanation of what a derivative and a partial derivative are, so you can follow the process. So if you haven't studied calculus yet, don't worry. You won't become a calculus expert by reading this article (I'm certainly not one), but I think you'll be able to follow the process of gradient descent a little bit better.

I will also link an article which helped me understand the math behind partial derivatives, which you can look at to fill in some details I won't be covering here.

Why do we need gradient descent?

The goal of gradient descent is to minimize the loss. In an ideal world we want our loss to be 0 (but keep in mind that this isn't realistically possible). We minimize the loss by improving the parameters of the model, which are the weight w and the bias b in linear regression. We improve those parameters either by making them larger, or smaller - whichever makes the loss go down.

If you need to get up to date:

Read my article on Linear Regression

Read my article on Mean Squared Error from last week

How long does gradient descent...descend?

Gradient descent is an iterative process - this is just a fancy way to say that the process repeats over and over again until you reach some condition for ending it.

The condition for ending could be:

  1. We are tired of waiting: i.e. we let gradient descent run for a certain number of iterations and then tell it to stop
  2. The loss is minimized as much as we need for our problem: i.e. the loss is equal to or less than a certain number that we decide on

How can we improve the loss?

This is where derivatives come in.

What does a derivative do?

For a given function:

  • It tells us how much a change in the weight will change the output of a function For example, for the MSE loss function:
  • How much will changing w a little bit change the loss?
  • Basically, the derivative tells us the slope of the line at a given point

But, what is a partial derivative?

A partial derivative is a derivative with more than one variable. In the linear regression equation we have w and b which both can change, so there are two variables that can affect the loss. We want to isolate each of those variables so that we can figure out how much w affects the loss and how much b affects the loss separately.

  • So we measure the derivative, or the slope one variable at a time
  • Whichever variable we are not measuring, we make a constant, by setting it equal to 0
  • First we calculate the derivative of the loss with respect to w
  • And then we calculate the derivative of the loss with respect to b

Rather than illustrating the formula for partial derivatives of MSE here (which I am still learning to understand myself), I am going to include a link to a very helpful article that goes through the mathematical formula step by step for finding the partial derivatives of mean squared error. The author basically does what I was hoping to do in this article before I became a little overwhelmed by the amount of background I would need to provide.

Once you know the derivatives, how big of a step do you take when updating w and b?

Now that we have calculated the derivatives we need to actually use them to update the parameters w and b.

We will use something called the Learning Rate to tell us how big of a step to take in our gradient descent. It is called the learning rate, because it affects how quickly our model will learn the patterns in the data. What do we do with it? We use it to multiply the derivative with respect to w and b when we update w and b in each iteration of training our model.

So, in short, it's a number that controls how quickly our parameters w and b change. A lower learning rate will cause w and b to change slowly (the model learns slower), and a higher learning rate will cause w and b to change more quickly (the model learns faster).

How does gradient descent work along with linear regression?

Remember in my overview of linear regression article I discussed how after we find the loss we'll need to use that information to update our weight and bias to minimize the loss? Well we're finally ready for that step.

A quick summary before we get started with the code. We have a forward pass, where we calculate our predictions and our current loss, based on those predictions. Then we have a backward pass, where we calculate the partial derivative of the loss with respect to each of our parameters (w and b). Then, using those gradients that we gained through calculating the derivatives, we train the model by updating our parameters in the direction that reduces the loss. We use the learning rate to control how much those parameters are changed at a time in each iteration of training.

Here are the steps for one iteration of gradient descent in linear regression:

A bit of review - Moving forwards

This is called the forward pass:

  1. So we initialize our parameters

    • we can start them off at 0
    • or we can start them off at random numbers (but I've decided to start them at 0, to simplify the code)
  2. We calculate linear regression with our current weight and bias

  3. We calculate the current loss, based on the current values for w and b
# import the Python library used for scientific computing 
import numpy as np 

# predict function, based on y = wx + b equation def​ ​predict​(X, w, b):return​ X * w + b

# loss function, based on MSE equation 
def​ ​mse​(X, Y, w, b):return​ np.average(Y - (predict(X, w, b)) ** 2)

And now the new stuff - Moving backwards

This part is called the backward pass:

  1. Using the current loss we calculate the derivative of the loss with respect to w...
  2. ...and with respect to b
# calculate the gradients 
def gradients(X, Y, w, b): 
    w_gradient = np.average(-2 * (X * (predict(X, w, b) - Y)))
    b_gradient = np.average(-2 * (predict(X, w, b) - Y))
    return (w_gradient, b_gradient)

And using the gradients to train the model

  1. Then we update the weight and bias with the derivative of the loss in the direction that minimizes the loss, by multiplying each derivative with the learning rate
  2. Then we repeat that process as long as we want (set in the number of epochs) to reduce the loss as much as we want
# train the model 
# lr stands for learning rate 
def train(X, Y, iterations, lr): 
    # initialize w and b to 0 
    w = 0
    b = 0

    # empty lists to keep track of parameters current values and loss 
    log = []
    mse = []

    # the training loop 
    for i in range(iterations):
        w_gradient, b_gradient = gradient(X, Y, w, b)
        # update w and b
        w -= w_gradient * lr
        b -= b_gradient * lr

        # recalculate loss to see
        log.append((w,b))
        mse.append(mse(X, Y, w, b)
    return w, b, log, mse

A parting note: There are tricks to avoid using explicit loops in your code, so that the code will run faster, when we start to train very large datasets. But to give an idea of what is going on, I thought it made sense to visualize the train function as a loop.

I hope you enjoyed this overview of Gradient Descent. My code might not be very eloquent, but hopefully it gives you an idea of what's going on here.

If you like this style of building up the functions used in machine learning models a little bit at a time, you may enjoy this book, Programming Machine Learning, whose code I relied on in preparing this article.