Santiago Profile picture
2 Apr, 10 tweets, 3 min read
When we start with machine learning, we learn to split our datasets in testing and training by taking a percentage of the data.

Unfortunately, this practice could lead to overestimating the performance of your model.

1/7
Imagine a dataset of pictures with people doing signals with their hands.

As we were told, we take 70% of the images for training and the remaining 30% for testing. We are careful to maintain the original ratio between classes.

How could this be a problem?

2/7
There are a lot of pictures of Mary in the dataset. She is showing different signals with her hands.

Also Joe. He was a model too that participated in the creation of the dataset.

3/7
By splitting the data without taking this into account, we will get Mary and Joe in both the training and testing sets.

Unfortunately, this is giving our model an unfair advantage doing testing.

4/7
In a real-life situation, our model will not see Mary or Joe. However, we both trained and tested with pictures of them.

Ideally, you want to test with a distribution that closely resembles real-life.

5/7
The solution for this? Train with Mary. Test with Joe.

You need to understand your data. Random splits are dangerous.

6/7
Follow me for practical tips about machine learning. Those that schools don't teach, and you have to beat your head against the wall before finding out.

Check my feed at @svpino for more.

7/7
K-Fold Cross-validation will give you a better idea of performance than using a single test set.

One problem with this method is that it becomes prohibitive for large datasets.

Think about it this way:

What information you are using to test your model that's unrealistic you'll see in real life?

In this example, we are using both Mary and Joe in our dataset to both train and test the model, but we really need different people.

If you are expecting to see loan applications from customers that have no record, then yes.

You want your test set to represent reality as close as possible.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

4 Apr
Learning a new language is not an obvious decision, especially when you are just starting in the industry.

Here are 10 frequently asked questions about learning Python 🐍. Hopefully, these give you the answers you are looking for.

1. Can I learn Python for free?

Yes. There are multiple YouTube videos, tutorials, and courses that will teach you Python for free.

But if you can afford it, I'd recommend you find a good MOOC that gives you some structure.

↓ 1/10
2. Is Python hard to learn?

It's not, especially compared with other languages out there.

That being said, becoming an expert is a life-long journey.

But one year of experience is more than enough for you to do whatever you decide to do.

↓ 2/10
Read 13 tweets
3 Apr
25 True|False machine learning questions that are horrible for interviews but pretty fun to answer.

Most importantly: they will make you think and will keep your knowledge sharp.

These are mostly beginner-friendly.



1. A "categorical feature" is a feature that can only take a limited number of possible values.

2. Precision is a performance metric that defines a classification model's ability to identify only relevant samples.



3. Recall is a performance metric that defines a classification model's ability to identify all relevant samples.

4. One-hot encoding is an excellent solution to transform categorical features with high cardinality.

Read 14 tweets
2 Apr
You want to build a function to retrieve a value from a sequential list of unordered elements.

What would be the best approach?
You can assume that the size of the list is unknown.

Oh, sorry if this was confusing.

By "sequential list" I meant that elements come one after the other in memory. Think of a regular array.

It doesn't mean that you can't access elements out of order.

Read 4 tweets
1 Apr
Pick one of these two.

They will both help you write better Python.
Both of these are great books to open from time to time and read an individual section.

They give you bite-sized tips and advice that you can incorporate immediately into your work.

Replace 30 minutes of Netflix every week with some reading.

Read 4 tweets
1 Apr
One way to reduce overfitting is by automatically augmenting your data.

Think about this: if you had an infinite number of samples, you would never overfit because your model would see every possibility out there.

↓ 1/7
Data augmentation is a way to generate more data using an existing dataset.

For example, by applying small transformations to existing images, you can generate many useful variations.

2/7
Here are some examples of possible variations that you could generate for an image:

▫️ Zoomed-in
▫️ Randomly cropped
▫️ Horizontally shifted
▫️ Horizontally flipped
▫️ Slightly rotated
▫️ More illuminated

3/7
Read 8 tweets
31 Mar
Coming soon, in Python 🐍 3.10: "Pattern Matching."

Looks sick!
No, this is not a switch statement. Pattern matching is very different.

With patterns, you get a small language to describe the structure of the values you want to match. Look at one of the examples to see how you can match an element of a tuple.
You can use patterns to match even more complex structures. You can nest them. You can have redundancy checking.

Pattern matching is a feature you can find in functional languages.

It's excellent that Python decided to add it! I'm really excited about it.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!