Gradient descent has a really simple and intuitive explanation.
The algorithm is easy to understand once you realize that it is basically hill climbing with a really simple strategy.
Let's see how it works!
🧵 👇🏽
For functions of one variable, the gradient is simply the derivative of the function.
The derivative expresses the slope of the function's tangent plane, but it can also be viewed as a one-dimensional vector!
When the function is increasing, the derivative is positive. When decreasing, it is negative.
Translating this to the language of vectors, it means that the "gradient" points to the direction of the increase!
This is the key to understand gradient descent.
Although the concept of the gradient gets a bit more complicated for multivariate functions, the above observation still holds.
Instead of two directions, we have infinite many. So here, the gradient shows the direction of the largest increase!
When we want to maximize a function with gradient descent, we simply take small steps towards the direction of the largest increase.
Take a look at the update formula, and you'll spot it immediately.
This is why the algorithm can be viewed as hill climbing.
The optimum value is the peak, and the plan to reach it is simply to go towards where the slope is the steepest.
Minimizing the function is the same as maximizing its negative.
This is the reason why we step in the opposite direction of the gradient while we minimize the training loss!
Now that you understand how gradient descent works, you can also see its downsides.
For instance, it can get stuck in a local optimum. Or, the gradient can be computationally hard to calculate when the function has millions of variables. (Like when training a neural network.)
This is just the tip of the iceberg.
Gradient descent has been improved several times. By understanding how the base algorithm works, you are now ready to tackle stochastic gradient descent, adaptive methods, and many more!
• • •
Missing some Tweet in this thread? You can try to
force a refresh