When we start with machine learning, we learn to split our datasets in testing and training by taking a percentage of the data.
Unfortunately, this practice could lead to overestimating the performance of your model.
β 1/7
Imagine a dataset of pictures with people doing signals with their hands.
As we were told, we take 70% of the images for training and the remaining 30% for testing. We are careful to maintain the original ratio between classes.
How could this be a problem?
β 2/7
There are a lot of pictures of Mary in the dataset. She is showing different signals with her hands.
Also Joe. He was a model too that participated in the creation of the dataset.