The ABSOLUTE ESSENTIALS of Splitting Data for Machine Learning

(Explanation + Scikit-learn Implementation)

🧵 Long Thread 👇👇
#MachineLearning #DataScience
📜Introduction
Why do we need to split our data?
After training the model, you want to test its performance on new data before putting it in production. In other words, you want to measure the generalization error (how well does the model generalize to new data?).
The data is commonly split into 3 different sets:
1. Training Set
2. Development Set (Holdout Validation Set)
3. Test Set
1⃣ Training Set
- The part of the data that is used to fit (train) the model.
- The quality and quantity of the training data will have a remarkable impact on the performance of the model.
2⃣ Development Set (Holdout Validation Set)
- The part of the data that is used to provide an unbiased evaluation of the performance of several candidate models.
- I know this sounds a bit vague, but this will become clearer after the following explanation. 👇👇
What is the purpose of the development set?
- When you start a machine learning project, it is really difficult to guess which model will have the best performance, or what are the optimal values for the hyperparameters (learning rate, number of layers, etc.)
One approach is to train multiple candidate models with different architectures, and with different values for each hyperparameter.
Then you can test the performance of each of the candidate models on the test set, and choose the model with the best performance for deployment.
However, the above approach is flawed. Why?
The problem is that you have measured the generalization error multiple times on the test set, and you have adopted the model and hyperparameters to produce the best performance for that particular set (Biased Evaluation).
The solution is to create a separate set, called the development set, on which you can tune your hyperparameters and test the performance of the candidate models.
Then you choose the model with the best performance on that set as the final model.
3⃣ Test Set
- The part of the data that is used to provide an unbiased evaluation of the final model's performance.
📜The Size of each Set

🔴Traditional Machine Learning Algorithms
- You usually have a relatively low quantity of data (maybe 100,000 training instances).
Training Set: 60%
Development Set: 20%
Test Set: 20%
🔴 Deep Learning
- You usually have a huge quantity of data (maybe millions of training instances)
Training Set: 98%
Development Set: 1%
Test Set: 1%
📜Code Implementation
Assume we have a dataset called "input_data" and we are training a model named "model".😁

🔴The First Approach
Step 1
import scikit-learn's train_test_split function

continue 👇👇
Step 2
Split the input data into two sets:
training_set: 60% of the total data
dev_test_set (dev_set + test_set): 40% of the total data

Step 3
Split the dev_test_set into two equal halves:
dev_set: 20% of the total data
test_set: 20% of the total data
Then you can train multiple models on the training_set, and evaluate their performances on the dev_set.
Finally, choose the best candidate model and evaluate it one more time on the test_set.
🔴 The Second approach (Cross Validation)
Step 1
Import scikit-learn's train_test_split and corss_val_score functions.

Step 2
Split the input data into only two parts:
training_set: 80%
test_set: 20%

continue 👇👇
Step 3
Use the cross_val_score function to train and evaluate the candidate models on the same run.
- By setting the number of cross validations to 5 (cv = 5), the function will split the training_set into 5 equal parts, lets say A, B, C, D, E.
- Then it will train the model on four parts, say A, B, C, D.
Next, it will test its accuracy (scoring = 'accuracy') on the fifth part E, and return the score.
- This will be repeated 5 times (cv = 5), with the model being trained on 4 parts and test on a different 5th part.
Eventually, the function will return 5 different scores, one score per each cross validations.
You can measure the mean and the standard deviation of the 5 scores, which will give you a better estimation of the model's performance on the development set.
You can finally choose the candidate model with the best performance, and test it one more time on the test set before deployment.

Note
These were the basics of data splitting for machine learning, and there are other variations for specific use cases (will be discussed later).
That's it for this thread.
If you found it useful, kindly consider supporting my content by retweeting the first tweet, and for more content about #MachineLearning and #DataScience, follow me @ammaryh92 .

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ammar Yasser

Ammar Yasser Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ammaryh92

26 Jul
The ABSOLUTE ESSENTIALS of Bias/Variance Analysis

🧵This thread will cover the following concepts:
a. Bayes Error
b. Bias vs Variance
c. Possible Solutions

(Explanation + Examples)

#MachineLearning #DataScience
📜Introduction
- After training a ML model, it is important to assess its performance before putting it into production.
- We start by measuring the model performance on the training set to evaluate how well the model fits the training data.
- Then we measure the model performance on the test set to evaluate the generalization error.

To measure the model performance on the training set, we need a reference value against which we can compare the model performance.
This reference value is called "Bayes Error". 👇
Read 20 tweets
24 Jul
The ABSOLUTE ESSENTIALS of scikit-learn every data scientist should know
(Introduction + Examples)

🧵Long Thread 👇👇
#MachineLearning #DataScience
✍️Introduction
- scikit-learn is one of the most famous python libraries for machine learning.
- scikit-learn allows you to easily build and train machine learning models through its simple and well designed API.
- However, I will try to simplify the API for beginners.
1⃣ Estimators
- The process of learning parameters from input data is called "Estimation", and therefore any object that learns some parameters from data is called an "Estimator".
- The estimation process itself is performed by calling the fit( ) method of any estimator object.
Read 17 tweets
23 Jul
I've written multiple threads on how to get started with #MachineLearning, #DeepLearning , and #DataScience in general.
check them out (Bookmark).
🧵👇👇
Read 5 tweets
23 Jul
#python packages for #DataScience and #MachineLearning
(explanation + resources)

🧵👇
Pandas
- Pandas is probably one of the most powerful and flexible open source data analysis and manipulation tool available in any language.
- It provides a wide range of functions for data wrangling and cleaning.
resources:
1⃣ youtube.com/playlist?list=…
2⃣ Image
NumPy (Numerical Python)
- NumPy is an open source project aiming to enable numerical computing with Python.
- It provides functions and methods for performing high level mathematical functions on multi-dimensional arrays and matrices.

resources:
1⃣ Image
Read 12 tweets
23 Jun
The term Machine Learning sounds mysterious and confusing to a lot of people especially beginners.
In this thread, I will try to explain how does a machine learn, and why do we even need machine learning?
🧵👇
In pre-machine learning era, we had what is called "rule-based systems".
This basically means that we provide a machine with a bunch of instructions on how to perform a certain task.
For example, if we need to write a function that returns the square of a number.
With rule-based system, this is very easy.
1. First we define a function called Square, for example.
2. Square function takes X as an input, where X can be any number.
3. Square function multiplies X by itself (X **2).
4. Square function returns the result to the user.
Read 13 tweets
22 Jun
If you are planning to get into machine learning, then you are likely to use scikit-learn, one of Python's most famous libraries.
In this thread, I will try to break down scikit-learn's API which could be intimidating in the beginning.
🧵👇
#MachineLearning
Estimators
- An estimator is any scikit-learn object that learns some parameters from data.
- All estimators implement "fit()" method to perform the estimation process.
- Estimators can also act as transformers or predictors.
Transformers
- They are estimators which use the parameters that they have learned to transform data.
- All transformers can implement "transform()" method to perform the transformation process.

I know this sounds vague, but it will become clearer after the following example.
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(