Gradient descent sounds good on paper, but there is a big issue in practice.
For complex functions like training losses for neural networks, calculating the gradient is computationally very expensive.
What makes it possible? For one, stochastic gradient descent!
🧵 👇🏽
When you have a lot of data, calculating the gradient of the loss involves the computation of a large sum.
Think about it: if 𝑥ᵢ denotes the data and 𝑤 denotes the weights, the loss function takes the form below.
Not only do we have to add a bunch of numbers together, but we have to find the gradient of each loss term.
For example, if the model contains 10 000 parameters and 1 000 000 data points, we need to compute 10 000 x 1 000 000 = 10¹⁰ derivatives.
This can take a LOT of time.
Do we need to calculate all the derivatives, though?
Not really.
Note that the total loss is, in fact, the average of individual losses. For lots of terms, a smaller subset approximates the true average with good precision! (Close data points have similar losses, etc.)
This is related to the law of large numbers. To approximate the average, you don't need to have all the information!
(If you are not familiar with the law of large numbers, here is a refresher below!)
Stochastic gradient descent simply selects a small subset of the data (might be a single data point) and only calculates the loss and gradient for that subset.
If we leave only 1% of the data, the computation is accelerated by 100 times!
This neat little trick not only makes the training faster, it quite literally makes it possible for large datasets and complex models!
Think about this the next time you are training a deep convolutional network with your notebook :)
(Also, GPU computing.)
• • •
Missing some Tweet in this thread? You can try to
force a refresh