We will be dealing with an ML model to detect traffic lights for a self-driving car π€π
Traffic lights are small so you will have much more parts of the image that are not traffic lights.
Furthermore, yellow lights π‘ are much rarer than green π’ or red π΄.
The problem β‘
Imagine we train a model to classify the color of the traffic light. A typical distribution will be:
π΄ - 56%
π‘ - 3%
π’ - 41%
So, your model can get to 97% accuracy just by learning to distinguish red from green.
How can we deal with this?
Evaluation measures π
First, you need to start using a different evaluation measure than accuracy:
- Precision per class
- Recall per class
- F1 score per class
I also like to look at the confusion matrix to get an overview. Always look at examples from the data as well!
In the traffic lights example above, we will see very poor recall for π‘ (most real examples were not recognized), while precision will likely be high.
At the same time, the precision of π’ and π΄ will be lower (π‘ will be classified as π’ or π΄).
Get more data π’
The best thing you can do is to collect more data of the underrepresented classes. This may be hard or even impossible...
You can imagine ways to record more yellow lights, but imagine you want to detect a very rare disease in CT images?
Balance your data π
The idea is to resample your dataset so it is better balanced.
βͺοΈUndersampling - throw away some examples of the dominant classes
βͺοΈ Oversampling - get more samples of the underrepresented class
Undersampling β¬
The easiest way is to just randomly throw away samples from the dominant class.
Even better, you can use some unsupervised clustering method and throw out only samples from the big clusters.
The problem of course is that you are throwing out valuable data...
Oversampling β«
This is more difficult. You can just repeat sample, but it won't work very good.
You can use methods like SMOTE (Synthetic Minority Oversampling Technique) to generate new samples interpolating between existing ones. This may not be easy for complex images.
Oversampling β«
If you are dealing with images, you can use data augmentation techniques to create new samples by modifying the existing ones (rotation, flipping, skewing, color filters...)
You can also use GANs or simulation the synthesize completely new images.
Adapting your loss π
Another strategy is to modify your loss function to penalize misclassification of the underrepresented classes more than the dominant ones.
In the π¦ example we can set them like this (proportionally to the distribution)
π΄ - 1.8
π‘ - 33.3
π’ - 2.4
If you are training a neural network with TensorFlow or PyTorch you can do this very easily:
In practice, you will likely need to combine all of the strategies above to achieve good performance.
Look at different evaluation metrics and start playing with the parameters to find a good balance (pun intended)
Every Friday I repost one of my old threads so more people get the chance to see them. During the rest of the week, I post new content on machine learning and web3.
If you are interested in seeing more, follow me @haltakov
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
Principal Component Analysis is a commonly used method for dimensionality reduction.
It's a good example of how fairly complex math can have an intuitive explanation and be easy to use in practice.
Let's start from the application of PCA π
Dimensionality Reduction
This is one of the common uses of PCA in machine learning.
Imagine you want to predict house prices. You get a large table of many houses and different features for them like size, number of rooms, location, age, etc.
Some features seem correlated π
Correlated features
For example, the size of the house is correlated with the number of rooms. Bigger houses tend to have more rooms.
Another example could be the age and the year the house was built - they give us pretty much the same information.
For regression problems you can use one of several loss functions:
βͺοΈ MSE
βͺοΈ MAE
βͺοΈ Huber loss
But which one is best? When should you prefer one instead of the other?
Thread π§΅
Let's first quickly recap what each of the loss functions does. After that, we can compare them and see the differences based on some examples.
π
Mean Square Error (MSE)
For every sample, MSE takes the difference between the ground truth and the model's prediction and computes its square. Then, the average over all samples is computed.
The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.
The Binary Cross-Entropy Loss is a special case when we have only 2 classes.
π
The most important part to understand is this one - this is the core of the whole formula!
Here, Y denotes the ground-truth label, while ΕΆ is the predicted probability of the classifier.
Let's look at a simple example before we talk about the logarithm... π