Pau Labarta Bajo Profile picture
Oct 4 β€’ 17 tweets β€’ 4 min read
"The best way to improve a Machine Learning model is to add more features to the input data."

You have read this sentence 100x times.

But is it true for real-world projects? πŸ€” ↓↓↓
β†’ It is definitely true if you are trying to win a Kaggle competition. Adding more features can only help you in this context.

β†’ However, if you are working on a real-world ML project, adding features is no "free lunch" 🍜
One of the hardest problems in real-world ML projects is to prepare and send the input data the model needs to make predictions, once deployed.

aka "How to serve the input features the model needs to work well"?
And the thing is, not all features are the same.

Some are easier to serve than others.

Why?

Because there is already infrastructure in place (thanks to your friend the data engineer ❀️) that makes it possible to deliver them fast enough to your model.
If you use hard-to-fetch features, your ML models will look great at training time.

However, they will be almost impossible to use in production.

Let's go through an example to make things clearer:
#Example: Let's imagine you πŸ‘¨β€πŸ”¬ work at a car-sharing startup that competes with Uber.

And they hire you to build the new "trip-duration prediction model".

The goal of this model is to predict the duration of a trip and show it to the user when she books a ride.
One feature you can use to train your ML model is whether it rains or not when the user requests the ride.

You have plenty of historical data on weather conditions, so this sounds like a great feature to add to your model.
You add a boolean feature called "is_raining" and re-train the model...

... and the validation metrics look slightly better!

Hooorraaayyy!

This feature is the cherry on top of your cake.

You feel super-confident about the model, so it is time to deploy it...
One of the backend developers in the startup πŸ‘¨πŸ½β€πŸ’», who needs to call your models, asks you:

πŸ‘¨πŸ½β€πŸ’»: "What are the features I need to pass to the model to use it?"

πŸ‘¨β€πŸ”¬: "Trip origin, destination, the hour of the day... and whether or not it is raining".
πŸ‘¨πŸ½β€πŸ’»: "We do not have real-time data on weather conditions. I can't generate this "is_raining" feature"

πŸ‘¨β€πŸ”¬: "How can that be? We have historical data on weather conditions, what is the problem?"

πŸ‘¨πŸ½β€πŸ’»: "Yes, we fetch this data once a day, from an external API. But NOT in real-time".
Sadly for you, this developer is damn right.

The data you used to train your model is not available at inference.

So you cannot use it.
You have 2 options:

1 β†’ Build a data pipeline to fetch real-time data about the weather, so you can use this feature. This sounds appealing, but the amount of work required (especially at a startup) is way more than the benefit of adding this feature.

2 β†’ Dump this feature.
In most cases, option #2 is the preferred one.

And you feel sad.

And you ask yourself

πŸ‘¨β€πŸ”¬: "Why did I waste time adding this feature to my model, instead of asking right away what features are available in production?"
And this is the lesson you learn here.

β†’ Ask the team (data engineers, developers, DevOps) what pieces of data (aka features) are easy to fetch and send in API requests.

β†’ Use only these features to build your first model, so it goes to production.

And YOU make an impact.
To sum up:

β†’ When you train a model, all historical features are the same. When you deploy the model, NOT all features are the same.

β†’ Before adding a feature to your ML model, ask the team if the feature will be available in production once the model is deployed.
Wanna build a PRO project to stand out from the crowd and land a job?

Join my mentorship program ↓
datamachines.xyz/data-science-m…
Wanna learn more about real-world ML?
β†’ Follow me @paulabartabajo_
β†’ Join my e-mail list datamachines.xyz/subscribe/

Wanna help?
Like/Retweet the first tweet below to spread the wisdom πŸ™β†“

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Pau Labarta Bajo

Pau Labarta Bajo Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @paulabartabajo_

Oct 5
Love using Jupyter notebooks, but after a while, they look like a total mess? πŸ˜΅β€πŸ’«

What if I told you there is a quick, simple, and efficient way to make them tidy and shiny?

These 3 tips will help you keep your notebooks clean and boost your productivity πŸš€β†“
Jupyter notebooks are the most popular environment to develop Machine Learning models.

They are the faster way to
β†’ add code
β†’ fix code
β†’ re-run code

for your Machine Learning project.

However, they quickly turn into a mess...

... unless you follow these 3 tips.
Tip #1. Encapsulate common code as functions.

If you do not encapsulate your code, you are doomed to duplicate it.

And code duplication is both a productivity killer and an endless source of bugs.

The solution:
β†’ Define functionality ONCE.
β†’ Call it as many times as you need
Read 11 tweets
Sep 13
How do you know if your Machine Learning model is "good enough", in a real-world project? πŸ€”

Let me explain ↓
A Machine Learning model in the real world is either:

βœ… Good enough. In this case, the model is deployed, its predictions are used, and add value to the business every day.

❌ Not good enough. You need to improve your model, and if that is not possible, the project is canceled
When developing a model, you use standard metrics to measure how good it is.

Examples:
β†’ Mean Square Error β†’ regression problems.
β†’ Accuracy β†’ classification problems.

These are the metrics you care about in Kaggle.

However, in real-world projects, these are insufficient.
Read 14 tweets
Sep 1
Every aspiring data scientist I talk to is overwhelmed by the colossal amount of online courses to choose from 🀯

My solution to this problem ↓
Learning is about connecting the dots.

However, it feels like there are too many dots to connect when learning data science.

Too many courses...
Too many blog posts...
Too many technologies...

Solution: You need to change the way you learn.
As a professional data scientist, you are expected to be a problem-solver for the company or institution you work for.

You need to be good at building data science products that solve business problems.

And for that, you don't need to be an expert in Python, for example.
Read 15 tweets
Aug 30
There is one skill every professional data scientist must have, that no online course talks about it πŸ€”

↓
Every aspiring data scientist I talk to thinks their job starts when someone else gives them

β†’ a dataset, and
β†’ a clearly defined metric to optimize for, e.g. accuracy

They are wrong.

Things are slightly more complex in the real world.
In the real world, data science projects start from a business problem.

They are born to move a key business metric (KPI).

The data scientist's job is to translate a business problem into the *right* data science problem.

Then solve it.
Read 15 tweets
Aug 24
Wanna become a freelance data scientist? 😎

5 tips to help you become one ↓
#Tip 1: Start small

Clients donΒ΄t look for an all-in-one data scientist, but someone who can solve their SPECIFIC problems.

Identify the things you are already an expert in, e.g.

β†’ Dashboarding with Tableau, or
β†’ ML for computer vision, or
β†’ Scrapping

Apply only for these.
#Tip 2: Build a Minimum Viable Portfolio

Clients want to see real work you have done in the past. They want to see solid proof you can deliver.

Build a small public/private portfolio that focuses on your strengths (from #Tip 1 above).
Read 8 tweets
Aug 23
Most data scientists focus on algorithms.

So they fail.
Data science = code + DATA

You write code to process and understand the data.

However, if the data is bad, there is nothing that will help you.

Garbage in. Garbage out.
You can play with the code as much as you want.

But if the data is not

β†’ sufficient enough
β†’ complete enough
β†’ good enough

... you will fail.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(