Santiago Profile picture
11 Feb, 21 tweets, 4 min read
Today let's talk about why we keep "splitting the data" into different sets.

Besides machine learning people being quirky, what else is going on here?

Grab your coffee โ˜•๏ธ, and let's do it!

๐Ÿงต๐Ÿ‘‡
Imagine you are teaching a class.

Your students are getting ready for the exam, and you give them 100 answered questions, so they prepare.

You now need to design the exam.

What's the best way to evaluate the students?

(2 / 19)
If you evaluate the students on the same questions you gave them to prepare, you'll reward those who just memorized the questions.

That won't give you a good measure of how much they learned.

๐Ÿ˜‘

(3 / 19)
Instead, you decide to use different questions.

Only students that learned the material will be able to get a good score. Those who memorized the initial set of questions will have no luck.

๐Ÿค“

(4 / 19)
When building machine learning models, we follow a similar strategy.

We take a portion of the data and use it to train our model (the student.)

We call this portion the "train set."

(5 / 19)
But we don't use all of the data for training!

Instead, we leave a portion of it to evaluate how much our model learned after training.

We call this portion of the data "validation set."

(6 / 19)
What do you think would happen if we evaluate the model on the same data we used to train it?

Just like in our analogy, the score of the model will probably be very high.

Even if it just memorized the data, it will still score well!

This is not good.

(7 / 19)
Machine learning people usually talk about "training and validation accuracy."

Which one do you think would be higher?

The training accuracy probably will: that's the evaluation of the model on the same data used for training!

(8 / 19)
Sometimes, the training accuracy is excellent, while the validation accuracy is not.

When this happens, we say the model "overfit."

This means that the model memorized the training data, and when presented with the real exam (validation set), it failed it miserably.

(9 / 19)
There's more.

We use the results of evaluating our models to improve them.

This is no different than a teacher pointing their students in the right direction after analyzing the exam results.

(10 / 19)
We do this over and over again:

โ–ซ๏ธ Train
โ–ซ๏ธ Evaluate
โ–ซ๏ธ Tweak
โ–ซ๏ธ Repeat

What do you think will happen after we repeat this cycle too many times?

(11 / 19)
Repeat the cycle too many times, and the model will get really good at acing the evaluation.

Slowly, it will start "overfitting" to the validation set.

At some point, we will get excellent scores that don't truly represent the model's actual performance.

(12 / 19)
You can probably imagine the solution: we need a new validation set.

In practice, we add the old validation set to the training data, and we get a new, fresh validation set.

Remember the teacher giving you the previous year's tests for practice? Same thing.

(13 / 19)
There's something else we do.

We take another portion of the data and set it aside. We call this "test set," and we never look at it during training.

Then we go and train and validate our model until we are happy with it.

(14 / 19)
When we finish, we use the test set for a final, proper evaluation of the model's performance.

The advantage is that the model has never seen this data, neither directly (during training) or indirectly (during validation.)

(15 / 19)
This is the best evaluation to understand the true capabilities of our model.

Right after we use the test set, we never use it again to test the model. We put it back as part of the train set and find more data to test the model in future iterations.

(16 / 19)
I always felt that splitting the original data into multiple parts was arbitrary until I understood its importance.

Hopefully, this thread helps you with this.

To finish, here are a few more notes about this.

(17 / 19)
1. In practice, the size of train, validation, and test sets vary. Think about them around a 60% - 20% - 20% split.

2. There are multiple ways to validate a model. Here I explained a simple split, but there are other techniques like k-fold cross-validation.

(18 / 19)
3. The size of the dataset influences the split and the techniques that I presented here. Some may not be possible without enough data.

4. This thread is not a paper or a scientific presentation. I'm aiming to build intuition among those who are learning this stuff.

(19 / 19)
If you enjoy these attempts to make machine learning a little more intuitive, stay tuned and check out @svpino for more of these threads.

I'm really enjoying the feedback of those who tell me that these explanations hit home for them. Thanks for the feedback!
Thanks to @gusthema for the inspiration to write this thread.

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

13 Feb
It takes a single picture of an animal for my son to start recognizing it everywhere.

Neural networks aren't as good as we are, but they are good enough to be competitive.

This is a thread about neural networks and bunnies.

๐Ÿงต๐Ÿ‘‡ ImageImage
A few days ago, I discussed how networks identify patterns and use them to extract meaning from images.

Let's start this thread right from where we ended that conversation.



(2 / 16)
Let's assume we use these four pictures to train a neural network. We tell it that they all contain a bunny ๐Ÿ‡.

Our hope is for the network to learn features that are common to these images.

(3 / 16) ImageImageImageImage
Read 17 tweets
10 Feb
Interviews aren't broken.

A lot of people complain about them, yet few have any experience hiring.

This is a rant full of my own biases and limited perspective โ€”a break from machine learning threads.

๐Ÿงต๐Ÿ‘‡
Building a team is incredibly hard.

Building a good team is even more challenging.

Building a good, diverse team is a nightmare.

๐Ÿ‘‡
Imagine you are starting a new company and you need a couple of developers.

โ–ซ๏ธ Where do you find them?
โ–ซ๏ธ How do you know they are any good?
โ–ซ๏ธ How much do you pay them?

How do you make somebody come and work for you, a nobody?

๐Ÿ‘‡
Read 25 tweets
9 Feb
Seriously though, how the heck can a computer recognize what's in an image?

Grab a coffee โ˜•๏ธ, and let's talk about one of the core ideas that makes this possible.

(I'll try to stay away from the math, I promise.)

๐Ÿ‘‡
If you are a developer, spend a few minutes trying to think about a way to solve this problem:

โ†’ Given an image, you want to build a function that determines whether it shows a person's face.

2/ Image
It gets overwhelming fast, right?

What are you going to do with all of these pixels?

3/
Read 27 tweets
7 Feb
Everything you need to know about the batch size when training a neural network.

(Because it really matters, and understanding it makes a huge difference.)

A thread.
Gradient Descent is an optimization algorithm to train neural networks.

The algorithm computes how much we need to adjust the model to get closer to the results we want on every iteration.

2/
We take samples from the training dataset, run them through the model, and determine how far away our results are from the ones we expect.

We call this "error," and using it, we compute how much we need to update the model weights to improve the results.

3/
Read 19 tweets
5 Feb
The one million dollar question:

"Is it reasonable for someone to dive into machine learning with a shallow knowledge of math?"

โ–ซ๏ธ The short answer is "yes."
โ–ซ๏ธ The more nuanced answer is "it depends."

Let me try and unpack this question for you.

๐Ÿงต๐Ÿ‘‡ Image
You can think about machine learning as a spectrum that goes all the way from pure research to engineering.

The more you move towards a research position, the more you can benefit from your math knowledge. If you move in the other direction, you'll get away with less of it.

๐Ÿ‘‡
I have friends that got a Ph.D. and became college professors.

For them, math is an absolute requirement!

Not only are they working on research projects, but they are teaching the next generation of scientists and engineers.

๐Ÿ‘‡
Read 11 tweets
4 Feb
I built a model to predict whether you'll be involved in a crash next time you get in a car.

And it's 99% accurate!

Allow me to show you...๐Ÿ‘‡
Here is the model:

๐Ÿ‘‡
The National Safety Council reports that the odds of being in a car crash in the United States are 1 in 102.

That's a probability of 0.98% of being involved in a crash.

Therefore, my silly model is accurate 99% of the time!

See? I wasn't joking before.

๐Ÿ‘‡
Read 21 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!