Tweet

Pau Labarta Bajo

Oct 4 • 17 tweets • 4 min read

"The best way to improve a Machine Learning model is to add more features to the input data."

You have read this sentence 100x times.

But is it true for real-world projects? 🤔 ↓↓↓

→ It is definitely true if you are trying to win a Kaggle competition. Adding more features can only help you in this context.

→ However, if you are working on a real-world ML project, adding features is no "free lunch" 🍜

One of the hardest problems in real-world ML projects is to prepare and send the input data the model needs to make predictions, once deployed.

aka "How to serve the input features the model needs to work well"?

And the thing is, not all features are the same.

Some are easier to serve than others.

Why?

Because there is already infrastructure in place (thanks to your friend the data engineer ❤️) that makes it possible to deliver them fast enough to your model.

If you use hard-to-fetch features, your ML models will look great at training time.

However, they will be almost impossible to use in production.

Let's go through an example to make things clearer:

#Example: Let's imagine you 👨‍🔬 work at a car-sharing startup that competes with Uber.

And they hire you to build the new "trip-duration prediction model".

The goal of this model is to predict the duration of a trip and show it to the user when she books a ride.

One feature you can use to train your ML model is whether it rains or not when the user requests the ride.

You have plenty of historical data on weather conditions, so this sounds like a great feature to add to your model.

You add a boolean feature called "is_raining" and re-train the model...

... and the validation metrics look slightly better!

Hooorraaayyy!

This feature is the cherry on top of your cake.

You feel super-confident about the model, so it is time to deploy it...

One of the backend developers in the startup 👨🏽‍💻, who needs to call your models, asks you:

👨🏽‍💻: "What are the features I need to pass to the model to use it?"

👨‍🔬: "Trip origin, destination, the hour of the day... and whether or not it is raining".

👨🏽‍💻: "We do not have real-time data on weather conditions. I can't generate this "is_raining" feature"

👨‍🔬: "How can that be? We have historical data on weather conditions, what is the problem?"

👨🏽‍💻: "Yes, we fetch this data once a day, from an external API. But NOT in real-time".

Sadly for you, this developer is damn right.

The data you used to train your model is not available at inference.

So you cannot use it.

You have 2 options:

1 → Build a data pipeline to fetch real-time data about the weather, so you can use this feature. This sounds appealing, but the amount of work required (especially at a startup) is way more than the benefit of adding this feature.

2 → Dump this feature.

In most cases, option #2 is the preferred one.

And you feel sad.

And you ask yourself

👨‍🔬: "Why did I waste time adding this feature to my model, instead of asking right away what features are available in production?"

And this is the lesson you learn here.

→ Ask the team (data engineers, developers, DevOps) what pieces of data (aka features) are easy to fetch and send in API requests.

→ Use only these features to build your first model, so it goes to production.

And YOU make an impact.

To sum up:

→ When you train a model, all historical features are the same. When you deploy the model, NOT all features are the same.

→ Before adding a feature to your ML model, ask the team if the feature will be available in production once the model is deployed.

Wanna build a PRO project to stand out from the crowd and land a job?

Join my mentorship program ↓
datamachines.xyz/data-science-m…

@paulabartabajo_

Wanna learn more about real-world ML?
→ Follow me @paulabartabajo_
→ Join my e-mail list datamachines.xyz/subscribe/

Wanna help?
Like/Retweet the first tweet below to spread the wisdom 🙏↓

https://twitter.com/paulabartabajo_/status/1577263520822878210

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Separate emails with commas Message

Share this page!

Pau Labarta Bajo

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @paulabartabajo_

Pau Labarta Bajo

Pau Labarta Bajo

Pau Labarta Bajo

Pau Labarta Bajo

Pau Labarta Bajo

Pau Labarta Bajo

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!