Here is an underrated machine learning technique that will give you important information about your data and model.
Let's talk about learning curves.
Grab your ☕️ and let's do this thing!
🧵👇
Start by creating a model. Something simple. You are still exploring what works and what doesn't, so don't get fancy yet.
We are now going to plot the loss (model error) vs. the training dataset size. This will help us answer the following questions:
▫️ Do we need more data?
▫️ Do we have a bias problem?
▫️ Do we have a variance problem?
▫️ What's the ideal picture?
▫️ Do we need more data?
As you increase the training size, if both curves converge towards each other and stop improving, you don't need more data.
If there's room for them to continue closing the gap, then more data should help.
This one should be self-explanatory: if our errors stopped improving after adding more data, it's unlikely that more of it will do any good.
But if we still see the loss improving, more data should help push it even lower.
▫️ Do we have a bias problem?
If the training error is too high, we have a high bias problem.
Also, if the validation error is too high, we have a problem with the bias —either low or high bias.
A high bias indicates that our model is not powerful enough to learn the data. This is why our training error is high.
If the training error is low, that's a good thing: our model can fit the data.
High validation error indicates that our model is not performing well on the validation data. We probably have a bias problem.
To know in which direction, we need to look at the training error to decide.
▫️ Low training error: low bias
▫️ High training error: high bias
▫️ Do we have a variance problem?
If there's a big gap between the training error and the validation error, we have high variance.
A low training error also indicates that we have high variance.
High variance indicates that the model fits the data too well (probably memorizing it.)
When testing with the validation set, we should see the big gap indicating that the model did great with the training set, but sucked with the validation set.
A couple more important points:
▫️ High bias + low variance: we are underfitting.
▫️ High variance + low bias: we are overfitting.
▫️ What's the ideal picture?
These are the curves that you should be looking forward to getting.
Training and validation error converged both to a low error.
Here is another chart that does an excellent job at explaining bias and variance.
You want low bias + low variance, but keep in mind there's always a tradeoff between them: you need to find a good enough balance for your specific use case.
If these threads help, then make sure to follow me, and you won't be disappointed.
And for even more in-depth machine learning stories, make sure you head over digest.underfitted.io. The first issue coming this Friday!
🐍
Here is a quick guide that will help you dealing with overfitting and underfitting:
In machine learning, data is represented by vectors. Essentially, training a learning algorithm is finding more descriptive representations of data through a series of transformations.
Linear algebra is the study of vector spaces and their transformations.