Machine Learning Weekly Highlights 💡

◆3 things from me
◆2 things from other people and
◆2 from the community

🧵🧵
This week, I wrote about what to consider while choosing a machine learning model for a particular problem, early stopping which is one of the powerful regularization techniques, and what to know about the learning rate.

The next is their corresponding threads!
1. What to know about a model selection process...

3. And what you should know about learning rates, their different curves, and techniques used to schedule them.

Two things from other people:

1. @rasbt shared 170 deep learning videos that he recorded in 2021. Not merely from this week, but I got to know about them this week. Thanks to @alfcnz for retweeting them...

Check those 170 videos out!

2. @Whats_AI created a great list of best ML papers in 2021.

github.com/louisfb01/best…
Two things from the community

1. @Nvidia GTC 2021: Lots of updates from NVIDIA which is on a mission of designing powerful deep learning accelerators.
I watched the keynote. It's great. There are lots of amazing news and updates from Omniverse, NVIDIA Drive, more accessibility to vision and language pre-trained models, to Jarvis AI accurate conversational bot...

Give it a watch! It's a great event!!

2. Gradients all not all you need

This paper from @Luke_Metz discusses the potential chaos of using gradients-based optimization algorithms. Most optimizers compute the gradients of the weights in order to minimize the loss function.
That usually works (but there is no theoretical guarantee that proves it always will).

The paper highlights the issues of gradients and adds that they are not all you need sometimes. Thanks to @rasbt for sharing this.

This is the end of this week's highlights. I plan to keep doing them.

Also, I am thinking a lot about going deep into some particular topics/concepts in my newsletter.

While I haven't yet put it together, you can sign up right away here:

getrevue.co/profile/deepre…
Thanks for reading!

For more ideas about machine learning, follow @Jeande_d!

Until the next week, stay safe!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jean de Nyandwi

Jean de Nyandwi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Jeande_d

12 Nov
Learning rate is one of the most important hyperparameters to adjust well during the ML model training.

A high learning rate can speed up the training, but it can cause the model to diverge. A low rate can slow the training.

Here are different learning rate curves Image
A low learning rate can also give poor results.

A good recommended practice is to usually start with a high rate and then reduce it accordingly.

There are many techniques that can be used to achieve that. They are called learning rate schedulers.
Example of learning rate scheduling techniques:

◆Power scheduler
◆Exponential scheduler
◆Piecewise constant or multi-factor scheduler
◆Performance scheduler
◆Cosine schedule
Read 4 tweets
11 Nov
The initial loss value that you should expect to get when using softmax activation in the last layer of the neural network:

Initial loss = ln(number_of_classes), ln being a natural logarithm.
Example:

last_layer = api.layers.dense(10, activation='softmax')

# number of classes = 10
initial_loss = ln(10) #2.302

Understanding this is important when it comes to debugging the network. If you see a loss of 4.5 when you have 10 classes, there is something wrong.
Also, the reported loss on the first training epoch is the average loss of the whole batch.

Thus, you may instead get the initial loss less than ln(number_of_classes) because you are training in batches. And it is a good thing.
Read 4 tweets
10 Nov
The below illustration shows early stopping, one of the effective and simplest regularization techniques used in training neural networks.

A thread on the idea behind early stopping, why it works, and why you should always use it...🧵 Image
Usually, during training, the training loss will decrease gradually, and if everything goes well on the validation side, validation loss will decrease too.

When the validation loss hits the local minimum point, it will start to increase again. Which is a signal of overfitting. Image
How can we stop the training just right before the validation loss rise again? Or before the validation accuracy starts decreasing?

That's the motivation for early stopping.

With early stopping, we can stop the training when there are no improvements in the validation metrics. Image
Read 15 tweets
7 Nov
Machine Learning weekly highlights 💡

◆3 threads from me
◆3 threads from others
◆2 news from the ML communities
3 POSTS FROM ME

This week, I explained Tom Mitchell's classical definition of machine learning, why it is hard to train neural networks, and talked about some recipes for training and debugging neuralnets.
Here is the meaning of Tom's definition of machine learning

Read 12 tweets
5 Nov
One of the things that makes training neural networks hard is the number of choices that we have to make before & during training.

Here is a training guideline covering:

◆Architectural choice
◆Activations
◆Losses
◆Optimizers
◆Batch size
◆Training & debugging recipes

🧵🧵
1. ARCHITECTURAL CHOICE

The choice of neural network architecture is primarily guided by data and the problem at hand.
Unless you are researching a new architecture, here are the popular conventions:

◆Tabular data: Feedforward networks (or Multi-layer perceptrons)
◆Images: 2D Convolutional neural networks (Convnets), Vision-transformers(ongoing research)
Read 28 tweets
2 Nov
Why is so hard to train neural networks?

Neural networks are hard to train. The more they go deep, the more they are likely to suffer from unstable gradients.

A thread 🧵🧵
Gradients can either explode or vanish, and neither of those is a good thing for the training of our network.
The vanishing gradients problem results in the network taking too long to train(learning will be very slow), and the exploding gradients cause the gradients to be very large.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!

:(