What can you do when your machine learning model stops improving?
There's always a point when you hit the ceiling and the performance of the model stalls.
Thread: A couple of tricks to improve your model.
Here is something that's keeping you from making progress:
You are using all of your data.
It turns out that more data is not always a good thing.
What would it look like to only focus on some of the data? Would that be helpful?
Here is the plan:
1. Find whether there's a portion of the data that's holding you back. Get rid of it.
2. Find whether a portion of the data is better suited for a different model.
Let's break these two apart to understand what to do.
The data might be noisy, and noise messes up your predictions.
First-round is about identifying these bad samples and getting rid of them.
Remember: f(🗑) → 🗑 (garbage in → garbage out)
The cleaner your training data is, the better your model performance will be.
There are many different ways to identify noise on your data.
Here is one using k-fold cross-validation:
1. Train and evaluate your model. 2. Pick the worst-performing fold. 3. Focus on the worst-performing samples. 4. Identify patterns. 5. Get rid of those samples.
Depending on your data and the model you are building, there are many other ways to identify bad samples.
Here is the important takeaway so far:
• More data is not always a good thing.
The cleaner your data, the better performance you should expect.
The second strategy is less common.
What would it look like to have different models working on the dataset? I know you have heard about "ensembles" before, but this one is a little bit different:
Different models working on different sections of the data.
Here is what you can do:
Slice your data in different cohesive sections, and try your model on each one of those.
For example, slice the dataset by any categorical column, and train a model on each slice separately.
Does your model perform better on one particular slice?
Here is a common scenario:
• You have a model with RMSE of 0.9.
• You slice the data into 3 sets.
• Train a model on each set.
• Performance on each set is now 0.99, 0.9, 0.55.
We can do something with this!
First, there's a set that outperforms our single model's performance with 0.99 RMSE. We definitely want to keep that.
But there's a set that underperforms it by a lot!
Can you build a different model that works better on that slice of the data?
Hopefully, the strategy is clear by now:
1. Start with a baseline model. 2. Slice out the data. 3. Train a model on each subset. 4. Identify underperformers. 5. Train a different model on them. 6. Combine results in an ensemble.
One thing to remember: there's no free lunch.
Sometimes, one model with 0.90 RMSE is much better than 3 models with 0.89 RMSE.
Unless you are only optimizing for the model's performance, complexity always comes at a cost.
Let's recap the four main ideas of this thread:
1. More data is not always better.
2. Finding and removing noise pays off.
3. Slicing out the data and building different models to tackle each slice is a way to squeeze better performance.
4. Complexity comes at a cost.
I post threads like this every Tuesday and Friday.
Follow me @svpino for practical tips and stories about my experience with machine learning.
• • •
Missing some Tweet in this thread? You can try to
force a refresh