One of the things that makes training neural networks hard is the number of choices that we have to make before & during training.

Here is a training guideline covering:

◆Architectural choice
◆Activations
◆Losses
◆Optimizers
◆Batch size
◆Training & debugging recipes

🧵🧵
1. ARCHITECTURAL CHOICE

The choice of neural network architecture is primarily guided by data and the problem at hand.
Unless you are researching a new architecture, here are the popular conventions:

◆Tabular data: Feedforward networks (or Multi-layer perceptrons)
◆Images: 2D Convolutional neural networks (Convnets), Vision-transformers(ongoing research)
◆Texts: Recurrent Neural Networks(RNNs), transformers, or 1D Convnets.
◆Time-series: RNNs or 1D Convnets
◆Videos & volumetric images: 3D Convnets, or 2D Convnets (with video divided into frames)
◆Sound: 1D Convnets or RNN
2. ACTIVATION FUNCTIONS

Activations are non-linear mathematical functions that are used to introduce non-linearities in the network.

Why do we need non-linearities? Well, the real-world datasets are rarely linear. In order to model them, we got to use non-linear activations.
There are many activation functions such as Sigmoid, Tanh, ReLU, Leaky ReLU, GeLU, SeLU, etc...

Here are important notes about choosing activation function:

◆Avoid using sigmoid or tanh as they can kill or cause gradients to vanish. That can slow the learning.
◆Always try ReLU at first. It works well most of the time. If you feel you need a boost in the accuracy, try Leaky ReLU, SELU, ELU. These are special versions of ReLU and can turn out to work well too or even better, but there is no guarantee.
◆The difference that the choice of activation function makes in the results is very small. Don't stress over which one is the best!
◆While choosing activations functions for the last output layer,

...Sigmoid: Binary classification, multi-label classification
...Softmax: Multiclass classification
3. LOSSES

Loss functions are used to measure the distance between the predictions and actual output during training.

The commonly used loss function in classification problems is cross-entropy.
3 types of cross-entropy:

◆Binary cross-entropy: For binary classification, and when the activation of the last layer is sigmoid.

◆Categorical cross-entropy: Mostly for multi-class classification problems and used when labels are given in one hot format(0's and 1's).
◆Sparse categorical cross-entropy: Mostly for multi-class classification problems and used when labels are given in integer format.
4. OPTIMIZERS

Optimization functions are used to minimize the loss during the training.

The most popular optimizers are Stochastic Gradient Descent(SGD), AdaGrad, RMSprop, Adam, Nadam, and AdaMax.
Unlike SGD, all other mentioned optimizers are adaptive to learning rates which means they can converge faster.

Try Adam first. If it doesn't converge faster, try Nadam or AdaMax, or RMSprop. They can also work well too.
There is no right or wrong optimizer. Try many of them (starting from Adam) and then others...
5. BATCH SIZE

Surprisingly, the batch size is a critical thing in neural network training. It can speed up or slow the training or influence the performance of the network.
The large batch size can speed up training because you are feeding many samples to the model at once, but there is also a risk of running into instabilities.
The small batch size can slow the training, but it can result in a better generalization.
What can we conclude about the batch size?

◆Use a small batch size. The network can generalize better and the training is stable as well. Try something like 32 or below.

Find more about the benefit of small batch size in this paper
arxiv.org/abs/1804.07612
◆Try a large batch size if you have a very big dataset like imagenet :) and you want to train in 1 hour or less.
But it's computationally expensive. This paper trained Imagenet using a large batch size of 8192 in 1 hour but it took 256 GPUs.

arxiv.org/abs/1706.02677
Looking at the above paper and others that tried to compete with it, batch size, GPU, and training time are pretty relational.

You too can also train Imagenet in 15 minutes if you multiply the batch size of 8192 by 4 and 256 GPUs by 4. I personally can't afford that!
6. SOME TRAINING AND DEBUGGING IDEAS

It's super hard to train neural networks. There is no clear framework for almost every choice you are supposed to make.

Randomly feeding images to ConvNets does not produce magical results.
Here are some training recipes and debugging ideas:

◆Become one with data
◆Overfit a tiny dataset
◆Visualize the model(TensorBoard is pretty good at this)
◆Visualize the samples that the model got wrong. Find why the model failed them. Fix their labels if that's the cause.
◆If the training error is low, and the validation error is high, the model is overfitting.

Regularize the model with techniques like dropout, early-stopping, or adding more data. Or augment training data.
◆If the training error is high and the validation error is also high, train longer, increase the model size or try a large model.
For more about recipes and ideas for training neural networks, I highly recommend you read this blog by @karpathy.

A Recipe for Training Neural Networks:

karpathy.github.io/2019/04/25/rec…
Thanks for reading.

If you found the thread helpful, retweet and share it with your friends. That's certainly the best way to support me.

Follow @Jeande_d for more machine learning ideas!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jean de Nyandwi

Jean de Nyandwi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Jeande_d

2 Nov
Why is so hard to train neural networks?

Neural networks are hard to train. The more they go deep, the more they are likely to suffer from unstable gradients.

A thread 🧵🧵
Gradients can either explode or vanish, and neither of those is a good thing for the training of our network.
The vanishing gradients problem results in the network taking too long to train(learning will be very slow), and the exploding gradients cause the gradients to be very large.
Read 9 tweets
1 Nov
How to think about precision and recall:

Precision: What is the percentage of positive predictions that are actually positive?

Recall: What is the percentage of actual positives that were predicted correctly?
The fewer false positives, the higher the precision. Vice-versa.

The fewer false negatives, the higher the recall. Vice-versa. Image
How do you increase precision? Reduce false positives.

It can depend on the problem, but generally, that might mean fixing the labels of those negative samples(being predicted as positives) or adding more of them in the training data.
Read 11 tweets
28 Oct
All in just one repository:

◆Data visualization with Matplotlib & Seaborn
◆Data preprocessing with Pandas
◆Classical machine learning with Scikit-Learn: From linear models, trees, ensemble models to PCA
◆Neural networks with TensorFlow & Keras: ConvNets, RNNs, BERT, etc...
You can get all of the above here:

github.com/Nyandwi/machin…
View the notebooks easily here:

nbviewer.org/github/Nyandwi…
Read 5 tweets
26 Oct
The following are 5 main types of machine learning systems based on the level of supervision involved in the training process:

◆Supervised learning
◆Unsupervised learning
◆Semi-supervised learning
◆Self-supervised learning
◆Reinforcement learning

Let's talk about them...🧵
1. Supervised learning

This is the common most type of machine learning. Most ML problems that we encounter falls into this category.

As the name implies, a supervised learning algorithm is trained with input data along with some form of guidance that we can call labels.
In other words, a supervised learning algorithm maps the input data (or X in many textbooks) to output labels (y).

Labels are also known as targets and they acts as a description of the input data.
Read 30 tweets
24 Oct
My summary of the week on Twitter ML:

◆ 3 threads on explaining complex concepts
◆ 2 on practical learning resources and
◆ 1 good news

🧵🧵
EXPLAINED CONCEPTS/IDEAS

#1 @fchollet on the nature of generalization in deep learning, clearly explaining interpolation and manifold hypothesis.

A long thread that is worth reading

#2 @svpino on what you didn't know about machine learning pipelines.

Read 8 tweets
18 Oct
Kaggle's 2021 State of Data Science and Machine Learning survey was released a few days ago.

If you didn't see it, here are some important takeaways 🧵
Top 5 IDEs

1. Jupyter Notebook
2. Visual Studio Code
3. JupyterLab
4. PyCharm
5. RStudio
ML Algorithms Usage: Top 10

1. Linear/logistic regression
2. Decision trees/random forests
3. Gradient boosting machines(Xgboost, LightGBM)
5. Convnets
6. Bayesian approaches
7. Dense neural networks(MLPs)
8. Recurrent neural networks(RNNs)
9. Transformers(BERT, GPT-3)
10. GANs
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(