Gradient descent has a really simple and intuitive explanation.

The algorithm is easy to understand once you realize that it is basically hill climbing with a really simple strategy.

Let's see how it works!

🧵 👇🏽
For functions of one variable, the gradient is simply the derivative of the function.

The derivative expresses the slope of the function's tangent plane, but it can also be viewed as a one-dimensional vector!
When the function is increasing, the derivative is positive. When decreasing, it is negative.

Translating this to the language of vectors, it means that the "gradient" points to the direction of the increase!

This is the key to understand gradient descent.
Although the concept of the gradient gets a bit more complicated for multivariate functions, the above observation still holds.

Instead of two directions, we have infinite many. So here, the gradient shows the direction of the largest increase!
When we want to maximize a function with gradient descent, we simply take small steps towards the direction of the largest increase.

Take a look at the update formula, and you'll spot it immediately.
This is why the algorithm can be viewed as hill climbing.

The optimum value is the peak, and the plan to reach it is simply to go towards where the slope is the steepest.
Minimizing the function is the same as maximizing its negative.

This is the reason why we step in the opposite direction of the gradient while we minimize the training loss!
Now that you understand how gradient descent works, you can also see its downsides.

For instance, it can get stuck in a local optimum. Or, the gradient can be computationally hard to calculate when the function has millions of variables. (Like when training a neural network.)
This is just the tip of the iceberg.

Gradient descent has been improved several times. By understanding how the base algorithm works, you are now ready to tackle stochastic gradient descent, adaptive methods, and many more!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tivadar Danka

Tivadar Danka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TivadarDanka

1 Apr
Gradient descent sounds good on paper, but there is a big issue in practice.

For complex functions like training losses for neural networks, calculating the gradient is computationally very expensive.

What makes it possible? For one, stochastic gradient descent!

🧵 👇🏽
When you have a lot of data, calculating the gradient of the loss involves the computation of a large sum.

Think about it: if 𝑥ᵢ denotes the data and 𝑤 denotes the weights, the loss function takes the form below. Image
Not only do we have to add a bunch of numbers together, but we have to find the gradient of each loss term.

For example, if the model contains 10 000 parameters and 1 000 000 data points, we need to compute 10 000 x 1 000 000 = 10¹⁰ derivatives.

This can take a LOT of time.
Read 7 tweets
27 Mar
Sigmoid is one of most commonly used activation functions.

However, it has a serious weakness: Sigmoids often make the gradient disappear.

This can leave the network stuck during training, so they effectively stop learning.

How can this happen?

🧵 👇🏽
Let's take a look at the Sigmoid function first.

Notice below that as 𝑥 tends to ∞ or -∞, the function flattens out.

This can happen for instance when the previous layer separates the classes very well, mapping them far away from each other in the feature space.
Why is this a problem?

Flatness means that the derivative is close to zero, as shown in the figure below.
Read 7 tweets
22 Mar
What if you want to optimize a function, but every evaluation costs you $100 and takes a day to execute?

Algorithms like gradient descent build on two key assumptions:

• function is differentiable,
• and you can calculate it on demand.

What if this is not the case?

🧵 👇🏽
For example, you want to tune the hyperparameters of a model that requires 24 hours of GPU time to train.

Can you find a good enough value under reasonable time and budget?

One method is the so-called Bayesian optimization.
Essentially, the method works as follows.

1️⃣ Model the expensive function with a Gaussian process.

Gaussian processes are easy to compute and offer a way to quantify uncertainty in the predictions.
Read 14 tweets
15 Mar
Building a good training dataset is harder than you think.

For example, you can have millions of unlabelled data points, but only have the resources to label a thousand.

This is a story is about a case that I used to encounter almost every day in my work.

🧵 👇🏽
Do you know how new drugs are developed?

Essentially, thousands of candidate molecules are tested to see if they have the targeted effect. First, the testing is done on cell cultures.

Sometimes, there is no better option than scanning through libraries of molecules.
After cells are treated with a given molecule (or molecules in some cases), the effects are studied by screening them with microscopy.

The treated cells can exhibit hundreds of different phenotypes ( = classes), some of them might be very rare.
Read 12 tweets
8 Mar
A neural network doesn't know when it doesn't know.

If you think about it, recognizing when a data point is absolutely unlike any other previously seen is a problem rarely dealt with.

However, it is essential.

In this thread, I'll explain how and why!

🧵 👇🏽
Suppose that this is your training data.

The situation looks fairly straightforward: a simple logistic regression solves the problem.

The model is deployed to production without a second thought.

Now comes the surprise!
We start receiving new data for prediction when we see the following pattern emerge.

The new instances are confidently classified, incorrectly.
Read 11 tweets
4 Mar
Mistakes should be celebrated.

I used to struggle with everything I started to do until I became skilled in it.

The key was to recognizing what I did wrong and going back to fix it. Over and over and over again.

Here is my list of failures that led me to success!

🧵 👇🏽
I was a bad student in school. The most difficult subject for me was mathematics, which I almost failed at one time.

Once I developed an interest, I started to improve very slowly.

Years later, I obtained a PhD in it after solving a problem that has been unsolved for decades.
As a teenager, I was overweight and physically weak. All fat, no muscle.

I was unable to do a single pushup.

Years later, I regularly do 25-50 pushups with one arm only. (Learning to do just a single one-armed pushup took me five years.)
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!