And if we sample only ๐ฆ we won't detect ๐ฅ ๐คทโโ๏ธ
๐
Data Cleaning ๐งน
Now we need to clean all corrupted and irrelevant samples. We need to remove:
โช๏ธ Overexposed or underexposed images
โช๏ธ Images in irrelevant situations
โช๏ธ Faulty images
Leaving them in the dataset will hurt our model's performance!
๐
Preprocess Data โ๏ธ
Most ML models like their data nicely normalized and properly scaled. Bad normalization can also lead to worse performance (I have a nice story for another time...)
โช๏ธ Crop and resize all images
โช๏ธ Normalize all values (usually 0 mean and 1 std. dev.)
๐
Label Data ๐ท๏ธ
Manual labeling is expensive. Try to be clever and automate as much as possible:
โช๏ธ Generate labels from the input data
โช๏ธ Use slow, but accurate algorithms offline
โช๏ธ Pre-label data during collection
โช๏ธ Develop good labeling tools
โช๏ธ Use synthetic data?
๐
Label Correction โ
You will always have errors in the labels - humans make mistakes. Review and iterate!
โช๏ธ Spot checks to find systematic problems
โช๏ธ Improve labeling guidelines and tools
โช๏ธ Review test results and fix labels
โช๏ธ Label samples multiple times
๐
The danger of label errors ๐งโ๐ซ
A recent study by MIT found that 10 of the most popular public datasets had 3.4% label errors on average (ImageNet had 5.8%).
This even lead authors to choose the wrong (and more complex) model as their best one!
This is the part that is usually covered by ML courses. Now is the time to try out different features, network architectures, fine-tune parameters etc.
But we are not done yet... ๐
Iterative Process ๐
In most real-world applications the bottleneck is not the model itself, but the data. After having a first model, we need to review where it has problems and go back to:
โช๏ธ Collecting and labeling more data
โช๏ธ Correcting labels
โช๏ธ Balancing the data
๐
Deploy Model ๐ข
Deploying the model in production poses some additional constraints:
We have to find a good trade-off between these factors and accuracy.
Now we are done, right? No...๐
Monitoring ๐ฅ๏ธ
The performance of the model will start degrading over time because the world keeps changing:
โช๏ธ Concept drift - the real-world distribution changes
โช๏ธ Data drift - the properties of the data change
We need to detect this, retrain, and deploy again.
Example ๐
Drift โก๏ธ
We now have a trained model to recognize ๐ฆ, but people keep inventing new variants - see what some creative people in Munich came up with ๐
We need a way to detect that we have a problem, collect data, label, and retrain our model.
๐
Summary ๐
This is how a typical ML pipeline for real-world applications looks like. Please remember this:
โช๏ธ Curating a good dataset is the most important thing
โช๏ธ Dataset curation is an iterative process
โช๏ธ Monitoring is critical to ensure good performance over time
Every Friday I repost one of my old threads so more people get the chance to see them. During the rest of the week, I post new content on machine learning and web3.
If you are interested in seeing more, follow me @haltakov
.
โข โข โข
Missing some Tweet in this thread? You can try to
force a refresh
The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.
The Binary Cross-Entropy Loss is a special case when we have only 2 classes.
๐
The most important part to understand is this one - this is the core of the whole formula!
Here, Y denotes the ground-truth label, while ลถ is the predicted probability of the classifier.
Let's look at a simple example before we talk about the logarithm... ๐
When machine learning met crypto art... they fell in love โค๏ธ
The Decentralized Autonomous Artist (DAA) is a concept that is uniquely enabled by these technologies.
Meet my favorite DAA - Botto.
Let me tell you how it works ๐
Botto uses a popular technique to create images - VQGAN+CLIP
In simple terms, it uses a neural network model generating images (VQCAN) guided by the powerful CLIP model which can relate images to text.
This method can create stunning visuals from a simple text prompt!
๐
Creating amazing images, though, requires finding the right text prompt
Botto is programmed by its creator - artist Mario Klingemann (@quasimondo), but it creates all art itself. There is no human intervention in the creation of the images!
ROC curves measure the True Positive Rate (also known as Accuracy). So, if you have an imbalanced dataset, the ROC curve will not tell you if your classifier completely ignores the underrepresented class.
Is your machine learning model performing well? What about in 6 months? ๐ค
If you are wondering why I'm asking this, you need to learn about ๐ฐ๐ผ๐ป๐ฐ๐ฒ๐ฝ๐ ๐ฑ๐ฟ๐ถ๐ณ๐ and ๐ฑ๐ฎ๐๐ฎ ๐ฑ๐ฟ๐ถ๐ณ๐.
Let me explain this to you using two real world examples.
Thread ๐
Imagine you are developing a model for a self-driving car to detect other vehicles at night.
Well, this is not too difficult, since vehicles have two red tail lights and it is easy to get a lot of data. You model works great!
But then... ๐
Car companies decide to experiment with red horizontal bars instead of two individual lights.
Now your model fails to detect these cars because it has never seen this kind of tail light.
Your model is suffering from ๐ฐ๐ผ๐ป๐ฐ๐ฒ๐ฝ๐ ๐ฑ๐ฟ๐ถ๐ณ๐
Math is not very important when you are using a machine learning method to solve your problem.
Everybody that disagrees, should study the 92-page appendix of the Self-normalizing networks (SNN) paper, before using
torch.nn.SELU.
And the core idea of SNN is actually simple ๐
SNNs use an activation function called Scaled Exponential Linear Unit (SELU) that is pretty simple to define.
It has the advantage that the activations converge to zero mean and unit variance, which allows training of deeper networks and employing strong regularization.
๐
There are implementations both in PyTorch (torch.nn.SELU) and TensorFlow (tf.keras.activations.selu).
You need to be careful to use the correct initialization function and dropout, but this is well documented.