Tweet

Jean de Nyandwi

5 Nov, 28 tweets, 5 min read

One of the things that makes training neural networks hard is the number of choices that we have to make before & during training.

Here is a training guideline covering:

◆Architectural choice
◆Activations
◆Losses
◆Optimizers
◆Batch size
◆Training & debugging recipes

🧵🧵

1. ARCHITECTURAL CHOICE

The choice of neural network architecture is primarily guided by data and the problem at hand.

Unless you are researching a new architecture, here are the popular conventions:

◆Tabular data: Feedforward networks (or Multi-layer perceptrons)
◆Images: 2D Convolutional neural networks (Convnets), Vision-transformers(ongoing research)

◆Texts: Recurrent Neural Networks(RNNs), transformers, or 1D Convnets.
◆Time-series: RNNs or 1D Convnets
◆Videos & volumetric images: 3D Convnets, or 2D Convnets (with video divided into frames)
◆Sound: 1D Convnets or RNN

2. ACTIVATION FUNCTIONS

Activations are non-linear mathematical functions that are used to introduce non-linearities in the network.

Why do we need non-linearities? Well, the real-world datasets are rarely linear. In order to model them, we got to use non-linear activations.

There are many activation functions such as Sigmoid, Tanh, ReLU, Leaky ReLU, GeLU, SeLU, etc...

Here are important notes about choosing activation function:

◆Avoid using sigmoid or tanh as they can kill or cause gradients to vanish. That can slow the learning.

◆Always try ReLU at first. It works well most of the time. If you feel you need a boost in the accuracy, try Leaky ReLU, SELU, ELU. These are special versions of ReLU and can turn out to work well too or even better, but there is no guarantee.

◆The difference that the choice of activation function makes in the results is very small. Don't stress over which one is the best!

◆While choosing activations functions for the last output layer,

...Sigmoid: Binary classification, multi-label classification
...Softmax: Multiclass classification

3. LOSSES

Loss functions are used to measure the distance between the predictions and actual output during training.

The commonly used loss function in classification problems is cross-entropy.

3 types of cross-entropy:

◆Binary cross-entropy: For binary classification, and when the activation of the last layer is sigmoid.

◆Categorical cross-entropy: Mostly for multi-class classification problems and used when labels are given in one hot format(0's and 1's).

◆Sparse categorical cross-entropy: Mostly for multi-class classification problems and used when labels are given in integer format.

4. OPTIMIZERS

Optimization functions are used to minimize the loss during the training.

The most popular optimizers are Stochastic Gradient Descent(SGD), AdaGrad, RMSprop, Adam, Nadam, and AdaMax.

Unlike SGD, all other mentioned optimizers are adaptive to learning rates which means they can converge faster.

Try Adam first. If it doesn't converge faster, try Nadam or AdaMax, or RMSprop. They can also work well too.

There is no right or wrong optimizer. Try many of them (starting from Adam) and then others...

5. BATCH SIZE

Surprisingly, the batch size is a critical thing in neural network training. It can speed up or slow the training or influence the performance of the network.

The large batch size can speed up training because you are feeding many samples to the model at once, but there is also a risk of running into instabilities.

The small batch size can slow the training, but it can result in a better generalization.

What can we conclude about the batch size?

◆Use a small batch size. The network can generalize better and the training is stable as well. Try something like 32 or below.

Find more about the benefit of small batch size in this paper
arxiv.org/abs/1804.07612

◆Try a large batch size if you have a very big dataset like imagenet :) and you want to train in 1 hour or less.

But it's computationally expensive. This paper trained Imagenet using a large batch size of 8192 in 1 hour but it took 256 GPUs.

arxiv.org/abs/1706.02677

Looking at the above paper and others that tried to compete with it, batch size, GPU, and training time are pretty relational.

You too can also train Imagenet in 15 minutes if you multiply the batch size of 8192 by 4 and 256 GPUs by 4. I personally can't afford that!

6. SOME TRAINING AND DEBUGGING IDEAS

It's super hard to train neural networks. There is no clear framework for almost every choice you are supposed to make.

Randomly feeding images to ConvNets does not produce magical results.

Here are some training recipes and debugging ideas:

◆Become one with data
◆Overfit a tiny dataset
◆Visualize the model(TensorBoard is pretty good at this)
◆Visualize the samples that the model got wrong. Find why the model failed them. Fix their labels if that's the cause.

◆If the training error is low, and the validation error is high, the model is overfitting.

Regularize the model with techniques like dropout, early-stopping, or adding more data. Or augment training data.

◆If the training error is high and the validation error is also high, train longer, increase the model size or try a large model.

@karpathy

For more about recipes and ideas for training neural networks, I highly recommend you read this blog by @karpathy.

A Recipe for Training Neural Networks:

karpathy.github.io/2019/04/25/rec…

@Jeande_d

Thanks for reading.

If you found the thread helpful, retweet and share it with your friends. That's certainly the best way to support me.

Follow @Jeande_d for more machine learning ideas!

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Jeande_d

Jean de Nyandwi

@Jeande_d

2 Nov

Why is so hard to train neural networks?

Neural networks are hard to train. The more they go deep, the more they are likely to suffer from unstable gradients.

A thread 🧵🧵

Gradients can either explode or vanish, and neither of those is a good thing for the training of our network.

The vanishing gradients problem results in the network taking too long to train(learning will be very slow), and the exploding gradients cause the gradients to be very large.

Read 9 tweets

Jean de Nyandwi

@Jeande_d

1 Nov

How to think about precision and recall:

Precision: What is the percentage of positive predictions that are actually positive?

Recall: What is the percentage of actual positives that were predicted correctly?

The fewer false positives, the higher the precision. Vice-versa.

The fewer false negatives, the higher the recall. Vice-versa.

How do you increase precision? Reduce false positives.

It can depend on the problem, but generally, that might mean fixing the labels of those negative samples(being predicted as positives) or adding more of them in the training data.

Read 11 tweets

Jean de Nyandwi

@Jeande_d

28 Oct

All in just one repository:

◆Data visualization with Matplotlib & Seaborn
◆Data preprocessing with Pandas
◆Classical machine learning with Scikit-Learn: From linear models, trees, ensemble models to PCA
◆Neural networks with TensorFlow & Keras: ConvNets, RNNs, BERT, etc...

You can get all of the above here:

github.com/Nyandwi/machin…

View the notebooks easily here:

nbviewer.org/github/Nyandwi…

Read 5 tweets

Jean de Nyandwi

@Jeande_d

26 Oct

The following are 5 main types of machine learning systems based on the level of supervision involved in the training process:

◆Supervised learning
◆Unsupervised learning
◆Semi-supervised learning
◆Self-supervised learning
◆Reinforcement learning

Let's talk about them...🧵

1. Supervised learning

This is the common most type of machine learning. Most ML problems that we encounter falls into this category.

As the name implies, a supervised learning algorithm is trained with input data along with some form of guidance that we can call labels.

In other words, a supervised learning algorithm maps the input data (or X in many textbooks) to output labels (y).

Labels are also known as targets and they acts as a description of the input data.

Read 30 tweets

Jean de Nyandwi

@Jeande_d

24 Oct

My summary of the week on Twitter ML:

◆ 3 threads on explaining complex concepts
◆ 2 on practical learning resources and
◆ 1 good news

🧵🧵

@fchollet

EXPLAINED CONCEPTS/IDEAS

#1 @fchollet on the nature of generalization in deep learning, clearly explaining interpolation and manifold hypothesis.

A long thread that is worth reading

https://twitter.com/fchollet/status/1450524400227287040?s=20

@svpino

#2 @svpino on what you didn't know about machine learning pipelines.

https://twitter.com/svpino/status/1451503779623354370?s=20

Read 8 tweets

Jean de Nyandwi

@Jeande_d

18 Oct

Kaggle's 2021 State of Data Science and Machine Learning survey was released a few days ago.

If you didn't see it, here are some important takeaways 🧵

Top 5 IDEs

1. Jupyter Notebook
2. Visual Studio Code
3. JupyterLab
4. PyCharm
5. RStudio

ML Algorithms Usage: Top 10

1. Linear/logistic regression
2. Decision trees/random forests
3. Gradient boosting machines(Xgboost, LightGBM)
5. Convnets
6. Bayesian approaches
7. Dense neural networks(MLPs)
8. Recurrent neural networks(RNNs)
9. Transformers(BERT, GPT-3)
10. GANs

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Jean de Nyandwi

Try unrolling a thread yourself!

More from @Jeande_d

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Did Thread Reader help you today?

Like this author's thread?