📜Ideally, All 3 sets (train, dev, test) should be perfectly representative of the new instances you want to generalize to.
Otherwise, a good performance on the training data will not correspond to doing well on your application.
⚠️However, getting representative data for training is sometimes very difficult.
👉Example
Assume you want to build an app that recognizes the species of flowers in images taken by the users.
For this app, you will need to train a ML algorithm on, let's say, 200,000 images.
You can download millions of flower images from the internet using web scraping to collect the required images.
⚡️The problem is that the images you download will probably be high-resolution, professionally shot images.
On the other hand, the images taken by the app users will probably be low-quality, blurry images.
In that case, the images used to train and evaluate the model's performance will not be representative of the ones used in production.
💡Solutions
1⃣Always make sure that the development set and the test set are perfectly representative of the data you expect to use in production.
2⃣Always make sure that the development set and the test set come from the same distribution.
This ensures that a good performance on the development/test set will correspond to a good performance on production data.
🔴Practical Example
When building your flower app, you collected 200,000 images from the web (non-representative), and 10,000 images taken by a phone camera (representative).
How should you distribute them?
👇👇
📒Remember that the dev/test sets should be made up exclusively from the representative images (10,000).
A Possible solution
1⃣Total images (210,000 images)
200,000 web images (non-representative)+ 10,000 phone images (representative)
2⃣Training set (205,000 images)
200,000 web images (non-representative) + 5,000 phone images (representative).
3⃣Development set
2500 phone images (representative)
4⃣Test set
2500 phone images (representative)
📜The Problem of Data Mismatch
This problem arises when the training data comes from a different distribution than the dev/test data (like in the above example).
After training the model, if the model performs poorly on the dev set, then there are two possible causes:
👇👇
a. Overfitting (high variance)
b. Data mismatch (training data has a different distribution than dev set data)
🤔How to determine the correct cause?
We hold out some of the training images (web images) in another set called the train-dev set.
We then train the model on the training set (not the train-dev set).
Next, we evaluate the model performance on the train-dev set.
If the model performs poorly on the train-dev set, then we do have an overfitting problem.
However, if the model performs well on the train-dev set, but then performs poorly on the development set, then the poor performance is caused by data mismatch.
We can solve the data mismatch problem using preprocessing and data augmentation techniques to make the training data (from the web) look more like the development data, and then retrain the model.
📜Introduction
- After training a ML model, it is important to assess its performance before putting it into production.
- We start by measuring the model performance on the training set to evaluate how well the model fits the training data.
- Then we measure the model performance on the test set to evaluate the generalization error.
To measure the model performance on the training set, we need a reference value against which we can compare the model performance.
This reference value is called "Bayes Error". 👇
📜Introduction
Why do we need to split our data?
After training the model, you want to test its performance on new data before putting it in production. In other words, you want to measure the generalization error (how well does the model generalize to new data?).
The data is commonly split into 3 different sets: 1. Training Set 2. Development Set (Holdout Validation Set) 3. Test Set
✍️Introduction
- scikit-learn is one of the most famous python libraries for machine learning.
- scikit-learn allows you to easily build and train machine learning models through its simple and well designed API.
- However, I will try to simplify the API for beginners.
1⃣ Estimators
- The process of learning parameters from input data is called "Estimation", and therefore any object that learns some parameters from data is called an "Estimator".
- The estimation process itself is performed by calling the fit( ) method of any estimator object.
Pandas
- Pandas is probably one of the most powerful and flexible open source data analysis and manipulation tool available in any language.
- It provides a wide range of functions for data wrangling and cleaning.
resources:
1⃣ youtube.com/playlist?list=…
2⃣
NumPy (Numerical Python)
- NumPy is an open source project aiming to enable numerical computing with Python.
- It provides functions and methods for performing high level mathematical functions on multi-dimensional arrays and matrices.
The term Machine Learning sounds mysterious and confusing to a lot of people especially beginners.
In this thread, I will try to explain how does a machine learn, and why do we even need machine learning?
🧵👇
In pre-machine learning era, we had what is called "rule-based systems".
This basically means that we provide a machine with a bunch of instructions on how to perform a certain task.
For example, if we need to write a function that returns the square of a number.
With rule-based system, this is very easy. 1. First we define a function called Square, for example. 2. Square function takes X as an input, where X can be any number. 3. Square function multiplies X by itself (X **2). 4. Square function returns the result to the user.