The ABSOLUTE ESSENTIALS of Data Mismatch in Machine Learning

🧵This thread will cover the following concepts:
a. Data Distribution (explanation + examples)
b. Data Mismatch Problem
In a previous thread, we discussed that we should split our data into three parts:
- Training Set
- Development Set
- Test Set
👇👇
📜Ideally, All 3 sets (train, dev, test) should be perfectly representative of the new instances you want to generalize to.
Otherwise, a good performance on the training data will not correspond to doing well on your application.
⚠️However, getting representative data for training is sometimes very difficult.

👉Example
Assume you want to build an app that recognizes the species of flowers in images taken by the users.

For this app, you will need to train a ML algorithm on, let's say, 200,000 images.
You can download millions of flower images from the internet using web scraping to collect the required images.

⚡️The problem is that the images you download will probably be high-resolution, professionally shot images.
On the other hand, the images taken by the app users will probably be low-quality, blurry images.

In that case, the images used to train and evaluate the model's performance will not be representative of the ones used in production.
💡Solutions
1⃣Always make sure that the development set and the test set are perfectly representative of the data you expect to use in production.
2⃣Always make sure that the development set and the test set come from the same distribution.
This ensures that a good performance on the development/test set will correspond to a good performance on production data.
🔴Practical Example
When building your flower app, you collected 200,000 images from the web (non-representative), and 10,000 images taken by a phone camera (representative).
How should you distribute them?
👇👇
📒Remember that the dev/test sets should be made up exclusively from the representative images (10,000).

A Possible solution
1⃣Total images (210,000 images)
200,000 web images (non-representative)+ 10,000 phone images (representative)
2⃣Training set (205,000 images)
200,000 web images (non-representative) + 5,000 phone images (representative).

3⃣Development set
2500 phone images (representative)

4⃣Test set
2500 phone images (representative)
📜The Problem of Data Mismatch
This problem arises when the training data comes from a different distribution than the dev/test data (like in the above example).
After training the model, if the model performs poorly on the dev set, then there are two possible causes:
👇👇
a. Overfitting (high variance)
b. Data mismatch (training data has a different distribution than dev set data)

🤔How to determine the correct cause?
We hold out some of the training images (web images) in another set called the train-dev set.
We then train the model on the training set (not the train-dev set).
Next, we evaluate the model performance on the train-dev set.
If the model performs poorly on the train-dev set, then we do have an overfitting problem.
However, if the model performs well on the train-dev set, but then performs poorly on the development set, then the poor performance is caused by data mismatch.
We can solve the data mismatch problem using preprocessing and data augmentation techniques to make the training data (from the web) look more like the development data, and then retrain the model.
That's it for this thread.

For additional information, I highly recommend reading this amazing article
yashuseth.blog/2018/03/20/wha…
If you found this thread to be useful, kindly consider supporting me by retweeting the first tweet.

For more Data Science and Machine Learning content like this, follow me @ammaryh92.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ammar Yasser

Ammar Yasser Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ammaryh92

26 Jul
The ABSOLUTE ESSENTIALS of Bias/Variance Analysis

🧵This thread will cover the following concepts:
a. Bayes Error
b. Bias vs Variance
c. Possible Solutions

(Explanation + Examples)

#MachineLearning #DataScience
📜Introduction
- After training a ML model, it is important to assess its performance before putting it into production.
- We start by measuring the model performance on the training set to evaluate how well the model fits the training data.
- Then we measure the model performance on the test set to evaluate the generalization error.

To measure the model performance on the training set, we need a reference value against which we can compare the model performance.
This reference value is called "Bayes Error". 👇
Read 20 tweets
25 Jul
The ABSOLUTE ESSENTIALS of Splitting Data for Machine Learning

(Explanation + Scikit-learn Implementation)

🧵 Long Thread 👇👇
#MachineLearning #DataScience
📜Introduction
Why do we need to split our data?
After training the model, you want to test its performance on new data before putting it in production. In other words, you want to measure the generalization error (how well does the model generalize to new data?).
The data is commonly split into 3 different sets:
1. Training Set
2. Development Set (Holdout Validation Set)
3. Test Set
Read 21 tweets
24 Jul
The ABSOLUTE ESSENTIALS of scikit-learn every data scientist should know
(Introduction + Examples)

🧵Long Thread 👇👇
#MachineLearning #DataScience
✍️Introduction
- scikit-learn is one of the most famous python libraries for machine learning.
- scikit-learn allows you to easily build and train machine learning models through its simple and well designed API.
- However, I will try to simplify the API for beginners.
1⃣ Estimators
- The process of learning parameters from input data is called "Estimation", and therefore any object that learns some parameters from data is called an "Estimator".
- The estimation process itself is performed by calling the fit( ) method of any estimator object.
Read 17 tweets
23 Jul
I've written multiple threads on how to get started with #MachineLearning, #DeepLearning , and #DataScience in general.
check them out (Bookmark).
🧵👇👇
Read 5 tweets
23 Jul
#python packages for #DataScience and #MachineLearning
(explanation + resources)

🧵👇
Pandas
- Pandas is probably one of the most powerful and flexible open source data analysis and manipulation tool available in any language.
- It provides a wide range of functions for data wrangling and cleaning.
resources:
1⃣ youtube.com/playlist?list=…
2⃣ Image
NumPy (Numerical Python)
- NumPy is an open source project aiming to enable numerical computing with Python.
- It provides functions and methods for performing high level mathematical functions on multi-dimensional arrays and matrices.

resources:
1⃣ Image
Read 12 tweets
23 Jun
The term Machine Learning sounds mysterious and confusing to a lot of people especially beginners.
In this thread, I will try to explain how does a machine learn, and why do we even need machine learning?
🧵👇
In pre-machine learning era, we had what is called "rule-based systems".
This basically means that we provide a machine with a bunch of instructions on how to perform a certain task.
For example, if we need to write a function that returns the square of a number.
With rule-based system, this is very easy.
1. First we define a function called Square, for example.
2. Square function takes X as an input, where X can be any number.
3. Square function multiplies X by itself (X **2).
4. Square function returns the result to the user.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(