Loss Function + Backpropagation + Gradient Descent

In the previous lesson we have discussed the math behind the deep learning, In this blog post we will discuss some the foundation of every machine learning model like, loss function , and then go through backpropagation and finally gradient descent.

So, let's first go through the loss function.

Loss Functions

We have seen how we get to an output! Now, what do we do with it? When a value is outputted, we calculate its error using a loss function. Our predicted values are compared with the actual values within the training data. There are two commonly used loss calculation formulas:

Mean squared error, which is most likely familiar to you if you have come across linear regression. This gif below shows how mean squared error is calculated for a line of best fit in linear regression.

Cross-entropy loss, which is used for classification learning models rather than regression.

You will learn more about this as you use loss functions in your deep learning models.

Backpropagation

This all seems fine and dandy so far. However, what if our output values are inaccurate? Do we cry? Try harder next time? Well, we can do that, but the good news is that there is more to our deep learning models.

This is where backpropagation and gradient descent come into play. Forward propagation deals with feeding the input values through hidden layers to the final output layer. Backpropagation refers to the computation of gradients with an algorithm known as gradient descent. This algorithm continuously updates and refines the weights between neurons to minimize our loss function.

By gradient, we mean the rate of change with respect to the parameters of our loss function. From this, backpropagation determines how much each weight is contributing to the error in our loss function, and gradient descent will update our weight values accordingly to decrease this error.

Gradient Descent

We have the overall process of backpropagation down! Now, let’s zoom in on what is happening during gradient descent.

If we think about the concept graphically, we want to look for the minimum point of our loss function because this will yield us the highest accuracy. If we start at a random point on our loss function, gradient descent will take “steps” in the “downhill direction” towards the negative gradient. The size of the “step” taken is dependent on our learning rate. Choosing the optimal learning rate is important because it affects both the efficiency and accuracy of our results.

The formula used with learning rate to update our weight parameters is the following:

parameter_new = parameter_old + learning_rate ⋅ gradient(loss_function(parameter_old))

The learning rate we choose affects how large the “steps” our pointer takes when trying to optimize our error function. Initial intuition might indicate that you should choose a large learning rate; however, as shown above, this can lead you to overshoot the value we are looking for and cause a divergent search.

Now you might think that you should choose an incredibly small learning rate; however, if it is too small, it could cause your model to be unbearably inefficient or get stuck in a local minimum and never find the optimum value. It is a tricky game of finding the correct combination of efficiency and accuracy.

Stochastic Gradient Descent

This leads us to the final point about gradient descent. In deep learning models, we are often dealing with extremely large datasets. Because of this, performing backpropagation and gradient descent calculations on all of our data may be inefficient and computationally exhaustive no matter what learning rate we choose.

To solve this problem, a variation of gradient descent known as Stochastic Gradient Descent (SGD) was developed. Let’s say we have 100,000 data points and 5 parameters. If we did 1000 iterations (also known as epochs in Deep Learning) we would end up with 100000⋅5⋅1000 = 500,000,000 computations. We do not want our computer to do that many computations on top of the rest of the learning model; it will take forever.

This is where SGD comes to play. Instead of performing gradient descent on our entire dataset, we pick out a random data point to use at each iteration. This cuts back on computation time immensely while still yielding accurate results.

The diagram below shows the performance differences between SGD and GD. You may notice that the SGD graph is a bit more sporadic. There is a reason for this, and we will address it in the next exercise.

The main point is that both will reach the ideal parameters, and SGD will be easier and more efficient for your computer processor. Because of this, SGD is almost universally used in favor of normal GD.

However, as well will see next, there are even more variants of gradient descent.

More Variants of Gradient Descent

Just when you thought SDG solved all our problems, even more options come into the picture!

There are also other variants of gradient descent such as Adam optimization algorithm and mini-batch gradient descent. Adam is an adaptive learning algorithm that finds individual learning rates for each parameter. Mini-batch gradient descent is similar to SGD except instead of iterating on one data point at a time, we iterate on small batches of fixed size.

Adam optimizer’s ability to have an adaptive learning rate has made it an ideal variant of gradient descent and is commonly used in deep learning models. Mini-batch gradient descent was developed as an ideal trade-off between GD and SGD. Since mini-batch does not depend on just one training sample, it has a much smoother curve and is less affected by outliers and noisy data making it a more optimal algorithm for gradient descent than SGD.

These are just some quick notes! You can read more about Adam here and more about mini-batch here. Experts in deep learning are constantly coming up with ways to improve these algorithms to make them more efficient and accurate, so the ability to adapt and build upon what you learn as you dive into this domain will be key!

Next lesson is implementing neural networks.