Pau Labarta Bajo Profile picture
Jan 3 β€’ 15 tweets β€’ 4 min read
There is one skill every professional data scientist must have, that no online course talks about πŸ€”

πŸ§΅β†“
Every aspiring data scientist I talk to thinks their job starts when someone else gives them

β†’ a dataset, and
β†’ a clearly defined metric to optimize for, e.g. accuracy

They are wrong.

Things are slightly more complex in the real world.
In the real world, data science projects start from a business problem.

They are born to move a key business metric (KPI).

The data scientist's job is to translate a business problem into the *right* data science problem.

Then solve it.
To translate a business problem into *the right* data science problem you do 2 things:

1 β†’ ask questions
2 β†’ explore the data to find clues.

There is nothing more frustrating than building a great data science solution, to the wrong business problem.
Imagine you are a data scientist πŸ§‘πŸ½β€πŸ”¬ at Uber.

And your product lead tells you:

πŸ‘©β€πŸ’Ό: "We want to decrease user churn by 5% this quarter"
There are different reasons why a user would stop using Uber.

For example:

β†’ "Lyft is offering better prices for that geo" (pricing problem)
β†’ "Car waiting times are too long" (supply problem)
β†’ "The Android version of the app is very slow" (client-app performance problem)
You build this list ↑ by asking the right questions to the rest of the team.

You need to understand the user's experience using the app, from HER point of view.
Typically there is no single reason behind churn, but a combination of a few of these.

The question is: which one should you focus on?

This is when you pull out your great data science skills and EXPLORE THE DATA πŸ”Ž
You explore the data to understand how plausible each of the above explanations is.

The output from this analysis is a single hypothesis you should consider further.

Depending on the hypothesis, you will solve the data science problem differently.

For example...
#Example 1: "Lyft is offering better prices for that geo" (pricing problem)

Solution: Detect the segment of users who are likely to churn (possibly using an ML Model) and send personalized discounts via push notifications.
#Example 2: "Car waiting times are too long" (supply problem)

Solution: Identify the location and time where supply is too low, and offer a price incentive for divers to cover these slots.
#Example 3: "The Android version of the app is very slow" (client-app performance problem)

Solution: Go to the frontend devs, show them the breakdown of use churn by app version, and convince them they should release a new version of the app with better performance.
In conclusion,

β†’ Translating business problems into *the right" data science problem is what separates a senior from a junior data scientist.

β†’ Ask the right questions, list possible solutions, and explore the data to narrow down the list to one.

β†’ Solve this one problem.
Wanna design, build and deploy a real-world ML service?

I am preparing a hands-on course (including videos, code, and slides) to help you build an end-2-end ML service.

Join my e-mail list to be notified when the course is out ↓
datamachines.xyz/subscribe/
Wanna get more real-world ML content?

Follow me @paulabartabajo_ so you do not miss what's coming next.

Wanna help?
Like/Retweet the first tweet below to spread the wisdom ↓↓↓

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Pau Labarta Bajo

Pau Labarta Bajo Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @paulabartabajo_

Dec 29, 2022
Wanna become a freelance data scientist? 😎

5 tips to help you become one ↓
#Tip 1: Start small

Clients donΒ΄t look for an all-in-one data scientist, but someone who can solve their SPECIFIC problems.

Identify the things you are already an expert in, e.g.

β†’ Dashboarding with Tableau, or
β†’ ML for computer vision, or
β†’ Scrapping

Apply only for these.
#Tip 2: Build a Minimum Viable Portfolio

Clients want to see real work you have done in the past. They want to see solid proof you can deliver.

Build a small public/private portfolio that focuses on your strengths (from #Tip 1 above).
Read 8 tweets
Dec 27, 2022
4 strategies to build a better Machine Learning model.

🧡 ↓
Your model performance is the end result of combining 2 basic ingredients:

β†’ a dataset, and
β†’ an algorithm

If you wanna improve your model results, you need to improve either one of these 2 things.

Here are 4 ways in which you can improve them ↓
Strategy #1. Add more samples to the dataset

The more samples you feed to your algorithm, the higher the chances the algorithm picks up the existing patterns in the data.

If you work with a tabular dataset, this means you wanna have more rows in your data.
Read 8 tweets
Dec 20, 2022
How to turn an ML notebook into a batch-prediction service

(using only Python and free MLOps tools)

πŸ§΅β†“
The starting point is this one Jupyter notebook where you:

1 - Loaded data from a CSV file
2 - Engineered features and targets
3 - Trained and validated an ML model.
4 - Generated predictions on the test set.

Let's turn this notebook into a batch-prediction service ↓
A batch-prediction service ingests raw data and outputs model predictions on a schedule (e.g. every 1 hour).

You can build one using this 3-pipeline architecture
- Feature pipeline πŸ“˜
- Training pipeline πŸ“™
- Batch inference pipeline πŸ“’

Let's go step by step...
Read 11 tweets
Dec 15, 2022
"An ML model with better offline evaluation metrics is a better model in production."

But is it really? πŸ€”

Here are 4 steps to test if your ML model is better than the one running in production 🧠 ↓
A better offline metric does NOT mean a better model, because

β†’ An offline metric (e.g test ROC) is *just* a proxy for the actual business metric you care about (e.g money lost in fraudulent transactions)

β†’ The ML model is just a small bit of the whole ML system in production
So the question is:

"How do you bridge the gap between offline proxy metrics and real-world business metrics?" πŸ€”

Here are 4 methods to evaluate your ML model, from less to more robust ↓
Read 12 tweets
Dec 13, 2022
You trained a great ML model inside a notebook, but it doesn't work in production.

Why? πŸ€”

Because your ML model is as good as the dataset you use to train it.

If your data has a bug, your model has a bug πŸ›

Look at the most common data bug and the best way to solve it 🧠 ↓
Let's say you work at Netflix, and you wanna build an ML model to predict which users will cancel their subscriptions.

You have plenty of historical *events* for each user, that you can use to engineer good model features.

And this is when things get interesting... Image
You discover that `event_4` ("user visits payment method page") has an 80% correlation with churn likelihood.

In other words, `event_4` is a great feature for your ML model.

So you add it to your training dataset.
And you train your ML model.
And you get 99% accuracy.

Boom.
Read 12 tweets
Dec 1, 2022
Wanna become a professional data scientist? πŸ‘©πŸΎβ€πŸ”¬πŸ‘¨β€πŸ”¬

One that feels
- knowledgeable 🧠
- confident 😎
- and ready to charge well what she knows? πŸ’°

Here is what you should do (spoiler alert, it is hard, but worth it) ↓↓↓
The internet is flooded with Data Science/ML content:

β†’ blog posts
β†’ newsletters
β†’ Twitter threads
β†’ Arxiv papers
β†’ ...

And the thing is, reading all that is not gonna get you a job.

You need to get your hands dirty β›οΈπŸ‘·πŸΎβ€β™‚οΈπŸ‘·πŸ»β€β™€οΈ
Real learning in data science (like in life) happens when you

β†’ face a specific problem
β†’ struggle to solve it, and
β†’ eventually solve it.

I call this the "problem-struggle-solution cycle".

This is how you learn everything in life.
And data science is not an exception.
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(