Tweet

Vladimir Haltakov

10 Feb, 13 tweets, 3 min read

Dealing with imbalanced datasets 🐁 ⚖️ 🐘

Real world datasets are often imbalanced - some of the classes appear much more often in your data than others.

The problem? You ML model will likely learn to only predict the dominant classes.

What can you do about it? 🤔

Thread 👇

Example 🚦

We will be dealing with a ML model to detect traffic lights for a self-driving car 🤖🚗

Traffic lights are small so you will have much more parts of the image that are not traffic lights.

Furthermore, yellow lights 🟡 are much rarer than green 🟢 or red 🔴.

The problem ⚡

Imagine we train a model to classify the color of the traffic light. A typical distribution will be:
🔴 - 56%
🟡 - 3%
🟢 - 41%

So, your model can get to 97% accuracy just by learning to distinguish red from green.

How can we deal with this? 🤔

Evaluation measures 📐

First, you need to start using a different evaluation measure than accuracy:
- Precision per class
- Recall per class
- F1 score per class

I also like to look at the confusion matrix to get an overview. Always look at examples from the data as well!

In the traffic lights example above, we will see very poor recall for 🟡 (most real examples were not recognized), while precision will likely be high.

At the same time, the precision of 🟢 and 🔴 will be lower (🟡 will be classified as 🟢 or 🔴).

Get more data 🔢

The best thing you can do is to collect more data of the underrepresented classes. This may be hard or even impossible...

You can imagine ways to record more yellow lights, but imagine you want to detect a very rare disease in CT images?

Balance your data 🔀

The idea is to resample your dataset so it is better balanced.

▪️Undersampling - throw away some examples of the dominant classes

▪️ Oversampling - get more samples of the underrepresented class

Undersampling ⏬

The easiest way is to just randomly throw away samples from the dominant class.

Even better, you can use some unsupervised clustering method and throw out only samples from the big clusters.

The problem of course is that you are throwing out valuable data...

Oversampling ⏫

This is more difficult. You can just repeat sample, but it won't work very good.

You can use methods like SMOTE (Synthetic Minority Oversampling Technique) to generate new samples interpolating between existing ones. This may not be easy for complex images.

Oversampling ⏫

If you are dealing with images, you can use data augmentation techniques to create new samples by modifying the existing ones (rotation, flipping, skewing, color filters...)

You can also use GANs or simulation the synthesize completely new images.

Adapting your loss 📉

Another strategy is to modify your loss function to penalize misclassification of the underrepresented classes more than the dominant ones.

In the 🚦 example we can set them like this (proportionally to the distribution)
🔴 - 1.8
🟡 - 33.3
🟢 - 2.4

If you are training a neural network with TensorFlow or PyTorch you can do this very easily:

▪️ TensorFlow - use the class_weights parameter in the fit() function (tensorflow.org/versions/r2.0/…)

▪️PyTorch - use the weight parameter in the CrossEntropyLoss (pytorch.org/docs/stable/ge…)

Summary 🏁

In practice, you will likely need to combine all of the strategies above to achieve good performance.

Look at different evaluation metrics and start playing with the parameters to find a good balance (pun intended) 😀

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @haltakov

Vladimir Haltakov

@haltakov

8 Feb

Machine Learning Formulas Explained 👨‍🏫

This is the formula for Mean Squared Error (MSE) as defined in WikiPedia. It represents a very simple concept, but may not be easy to read if you are just starting with ML.

Read below and it will be a piece of cake! 🍰

Thread 👇

The core ⚫

Let's unpack from the inside out. MSE calculates how close are your model's predictions Ŷ to the ground truth labels Y. You want the error to go to 0.

If you are predicting house prices, the error could be the difference between the predicted and the actual price.

Why squared? 2️⃣

Subtracting the prediction from the label won't work. The error may be negative or positive, which is a problem when summing up samples.

You can take the absolute value or the square of the error. The square has the property that it punished bigger errors more.

Read 8 tweets

Vladimir Haltakov

@haltakov

28 Jan

Is this formula difficult? 🤔

This is the formula for Gradient Descent with Momentum as presented in Wikipedia.

It may look intimidating at first, but I promise you that by the end of this thread it will be easy to understand!

Thread 👇

The Basis ◻️

Let's break it down! The basis is this simple formula describing an iterative optimization method.

We have some weights (parameters) and we iteratively update them in some way to reach a goal.

Iterative methods are used when we cannot compute the solution directly

Gradient Decent Update 📉

We define a loss function describing how good our model is. We want to find the weights that minimize the loss (make the model better).

We compute the gradient of the loss and update the weights by a small amount (learning rate) against the gradient.

Read 7 tweets

Vladimir Haltakov

@haltakov

27 Jan

How to add new classes to your ML model? 🍏🍎🍊... 🍌?

You have a large multi-class NN in production.

You discover a new important class and want to add support for it *quickly* and with *low* risk.

Example: traffic signs recognition for self-driving cars 🛑🚗

Thread 👇

The naive approach 🤷‍♂️

Collect examples of the new class (for example a new traffic sign), label them and retrain the whole NN.

✅ It will probably work

❌ It will be time consuming, especially for big models.
❌ Risk for unintended regressions

Freezing the first layers 🥶

Typical CNNs learn generic image features in the initial layers and they will likely apply to the new sign as well.

You can freeze the weights of the initial layers and only retrain the last fully connected layer(s).

Read 10 tweets

Vladimir Haltakov

@haltakov

26 Jan

Machine Learning Interview Question #7 🤖🧠🧐

This is a more difficult and more open question...

❓ You are developing a traffic signs detector for a self-driving car.

How would you design it in a way that you can quickly add support for new signs, you didn't see before ❓

🌟 BONUS QUESTION 🌟:

Can you do this with minimal retrain of your neural network?

https://twitter.com/haltakov/status/1345766872164098048

Looking forward to some creative answers! 😃

Answer in the replies. Read the rules 👇

https://twitter.com/haltakov/status/1345766872164098048

Read 4 tweets

Vladimir Haltakov

@haltakov

12 Jan

What are typical challenges when training a deep neural networks ⁉️

▪️ Overfitting
▪️ Underfitting
▪️ Lack of training data
▪️ Vanishing gradients
▪️ Exploding gradients
▪️ Dead ReLUs
▪️ Network architecture design
▪️ Hyperparameter tuning

How to solve them 👇

Overfitting 🐘

Your model performs well during training, but poorly during test.

Possible solutions:
- Reduce the size of your model
- Add more data
- Increase dropout
- Stop the training early
- Add regularization to your loss
- Decrease batch size

Underfitting 🐁

You model performs poorly both during training and test.

Possible solutions:
- Increase the size of your model
- Add more data
- Train for a longer time
- Start with a pre-trained network

Read 11 tweets

Vladimir Haltakov

@haltakov

28 Dec 20

You are feeling overwhelmed when learning something new? 😫

There is so much information out there and you don't know where to start? 🥴

Here is my strategy to learn new concepts that has helped me a lot in my career...

👇 Thread 👇

The problem with complex topics? 🤔

Today, the problem is not the availability of the information, but its discovery! 🔭

You need to avoid going down the rabbit whole, before you are sure this is the right rabbit hole 😀

Learn to focus and prioritize how to spend your time!

Get a rough overview 🗺️

Research about the topic you are trying to learn and get a rough idea of the existing concepts. Don't try to understand everything yet!

The goal is to only have an overview of what is out there.

Survey papers about a specific topic are a good example.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Vladimir Haltakov

Try unrolling a thread yourself!

More from @haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Vladimir Haltakov

Did Thread Reader help you today?

Like this author's thread?