When we start with machine learning, we learn to split our datasets in testing and training by taking a percentage of the data.
Unfortunately, this practice could lead to overestimating the performance of your model.
↓ 1/7
Imagine a dataset of pictures with people doing signals with their hands.
As we were told, we take 70% of the images for training and the remaining 30% for testing. We are careful to maintain the original ratio between classes.
How could this be a problem?
↓ 2/7
There are a lot of pictures of Mary in the dataset. She is showing different signals with her hands.
Also Joe. He was a model too that participated in the creation of the dataset.
↓ 3/7
By splitting the data without taking this into account, we will get Mary and Joe in both the training and testing sets.
Unfortunately, this is giving our model an unfair advantage doing testing.
↓ 4/7
In a real-life situation, our model will not see Mary or Joe. However, we both trained and tested with pictures of them.
Ideally, you want to test with a distribution that closely resembles real-life.
↓ 5/7
The solution for this? Train with Mary. Test with Joe.
You need to understand your data. Random splits are dangerous.
↓ 6/7
Follow me for practical tips about machine learning. Those that schools don't teach, and you have to beat your head against the wall before finding out.
Coming soon, in Python 🐍 3.10: "Pattern Matching."
Looks sick!
No, this is not a switch statement. Pattern matching is very different.
With patterns, you get a small language to describe the structure of the values you want to match. Look at one of the examples to see how you can match an element of a tuple.
You can use patterns to match even more complex structures. You can nest them. You can have redundancy checking.
Pattern matching is a feature you can find in functional languages.
It's excellent that Python decided to add it! I'm really excited about it.