The train set will have the most instances and will be used to teach the model.
The validation set usually contains a smaller portion of the instances, and it's used to validate the model as it's training. We use the results on this set to further improve the model.
↓
The test set is never used during training. Instead, we reserve it until the very end of the process, right when we finished updating the model.
The results on this set give us an idea of the final performance of the model.
↓
Training and validating the model are processes that happen simultaneously and may lead to more data engineering.
For example, you may discover that it'd be a good idea to include a "year" feature from the original data.
↓
At this point, you go back and update and validate your dataset, split it again, and restart the training process.
You may also decide to modify the architecture of your model or its hyperparameters.
↓
The architecture of the model refers to its structure. For example, the number of layers in a neural network or their size.
Hyperparameters refer to the configuration settings of the model. For example, the value of how quick or slow it should learn.
↓
As soon as you are happy with the model's performance, you evaluate it with the test set to ensure everything is good to go and deploy it.
Deploying a model refers to making it available to its users. This is also referred to as "serving" the model.
↓
There are multiple ways to make a model available, but the most common one is putting it behind a REST API.
For example, you could build a skinny layer that accepts a JSON input, uses the model, and returns its result as a JSON output.
Let's break this down.
↓
The "Inference" or "Prediction" process is taking input from the user, running it through the model, and returning the output back.
Usually, there are several steps involved in this process.
↓
The 5 steps to run a prediction:
• Receive input
• Transform it into an instance
• Run it through the model
• Transforming model's output
• Returning final output
↓
A REST API will help us receiving the input from the user and returning the final output back. I illustrated this above by using JSON to transfer the data.
↓
The model is expecting an instance in a specific format. We usually need to transform the input JSON into that format before feeding it to the model.
We also need to run the reverse process and transform the model's output into the JSON that we will send back.
↓
Keep in mind that very often, models aren't used in isolation.
For example, we may need to query a database, call an external service, and put everything together with the model's output before returning an answer.
The API might handle all of that.
↓
Although we are close to the end, two more processes are part of the backbone of a machine learning system:
• Monitoring
• Retraining pipeline
↓
Data may change over time. For example, on average, your customers may get older.
This is referred to as "data drift."
↓
The context of your predictions may also change over time. For example, an update to a marketing campaign may slowly increase a specific product's sale ratio.
This is referred to as "concept drift."
↓
Both data and concept drift will degrade the performance of your model.
Remember that your model is a static representation of the relationship between the input data and the predictions it produces.
Monitoring will help you detect drift.
↓
Monitoring is about comparing the model's predictions with actual values over time.
There are multiple ways you can accomplish this. One of them is by routing some of the inputs to humans so they determine what the actual values should be.
↓
Finally, we get to the process I referred to before as the "Retraining pipeline."
Two main components here:
• Collecting additional data
• Producing a new version of the model
This is the only way you can keep your model fresh: updating it frequently.
↓
Depending on how sophisticated your machine learning process is, you'll run this process manually or automatically.
Ideally, everything works without human involvement. In practice, there's a lot of work to make this happens.
↓
This was a long thread. If you stumble upon this tweet and want to read from the beginning, it all starts here.
If you enjoy this content, follow me @svpino for threads like this focused on machine learning. I post them multiple times per week.