"The best way to improve a Machine Learning model is to add more features to the input data."
You have read this sentence 100x times.
But is it true for real-world projects? π€ βββ
β It is definitely true if you are trying to win a Kaggle competition. Adding more features can only help you in this context.
β However, if you are working on a real-world ML project, adding features is no "free lunch" π
One of the hardest problems in real-world ML projects is to prepare and send the input data the model needs to make predictions, once deployed.
aka "How to serve the input features the model needs to work well"?
And the thing is, not all features are the same.
Some are easier to serve than others.
Why?
Because there is already infrastructure in place (thanks to your friend the data engineer β€οΈ) that makes it possible to deliver them fast enough to your model.
If you use hard-to-fetch features, your ML models will look great at training time.
However, they will be almost impossible to use in production.
Let's go through an example to make things clearer:
#Example: Let's imagine you π¨βπ¬ work at a car-sharing startup that competes with Uber.
And they hire you to build the new "trip-duration prediction model".
The goal of this model is to predict the duration of a trip and show it to the user when she books a ride.
One feature you can use to train your ML model is whether it rains or not when the user requests the ride.
You have plenty of historical data on weather conditions, so this sounds like a great feature to add to your model.
You add a boolean feature called "is_raining" and re-train the model...
... and the validation metrics look slightly better!
Hooorraaayyy!
This feature is the cherry on top of your cake.
You feel super-confident about the model, so it is time to deploy it...
One of the backend developers in the startup π¨π½βπ», who needs to call your models, asks you:
π¨π½βπ»: "What are the features I need to pass to the model to use it?"
π¨βπ¬: "Trip origin, destination, the hour of the day... and whether or not it is raining".
π¨π½βπ»: "We do not have real-time data on weather conditions. I can't generate this "is_raining" feature"
π¨βπ¬: "How can that be? We have historical data on weather conditions, what is the problem?"
π¨π½βπ»: "Yes, we fetch this data once a day, from an external API. But NOT in real-time".
Sadly for you, this developer is damn right.
The data you used to train your model is not available at inference.
So you cannot use it.
You have 2 options:
1 β Build a data pipeline to fetch real-time data about the weather, so you can use this feature. This sounds appealing, but the amount of work required (especially at a startup) is way more than the benefit of adding this feature.
2 β Dump this feature.
In most cases, option #2 is the preferred one.
And you feel sad.
And you ask yourself
π¨βπ¬: "Why did I waste time adding this feature to my model, instead of asking right away what features are available in production?"
And this is the lesson you learn here.
β Ask the team (data engineers, developers, DevOps) what pieces of data (aka features) are easy to fetch and send in API requests.
β Use only these features to build your first model, so it goes to production.
And YOU make an impact.
To sum up:
β When you train a model, all historical features are the same. When you deploy the model, NOT all features are the same.
β Before adding a feature to your ML model, ask the team if the feature will be available in production once the model is deployed.
Wanna build a PRO project to stand out from the crowd and land a job?