Tweet

Santiago

Follow @svpino

22 Oct, 8 tweets, 2 min read

What's a machine learning pipeline?

Well, it turns out that many different things classify as "machine learning pipelines."

Here are five of the different "pipelines" you should be aware of: ↓

Our first pipeline: "Data pipeline."

This goes from ingesting the data from its sources to the final destination where we will consume it.

Sometimes, the data pipeline includes transformations of that data. Sometimes it doesn't.

This leads me to the second pipeline.

The second pipeline: "Data transformation pipeline."

"Wait, I thought this was part of the data pipeline?" You are right; sometimes it is. Sometimes it isn't.

Sometimes, you need to separate "general" transformations from use case-specific transformations.

The third pipeline: "Training and evaluation pipeline."

Here is where we split, train, evaluate, and deploy a machine learning model.

Sometimes we can join this one with data transformation and make a single pipeline.

The fourth pipeline: "Inference pipeline."

Here is where we transform production data, run it through the model, and process the results.

(Good practice here: use the same data transformation pipeline that you built before.)

The fifth pipeline: "Monitoring and maintenance pipeline."

This one is hard to find because it's indicative of mature machine learning systems (and there aren't too many of those out there.)

Goal: monitoring, retraining, and redeploying the model.

@svpino

We are still figuring things out.

Sometimes, we use the same term to refer to different processes that happen at different stages. It's confusing as hell!

But that's why I'm here: to break some of these things in a way you can understand.

@svpino ← This is me. Stay tuned.

https://twitter.com/NickXiotis/status/1451538733925380100?s=20

Great point.

Validating the data is really important whenever you can't trust the source. So yes, this will be part of your pipeline.

https://twitter.com/NickXiotis/status/1451538733925380100?s=20

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @svpino

Santiago

@svpino

19 Oct

One of the most useful things you can learn:

Greedy algorithms, how they work, and how to solve problems using them.

Here is why they are fundamental: ↓

Greedy algorithms:

• Pretty intuitive to understand
• Easy to come up with them
• A great way to solve many problems

Optimization is the root of all evil. Many times, a greedy solution is all you need to solve a problem.

At each step, a greedy algorithm always makes the best optimal choice.

(Unfortunately, this approach is not always guaranteed to converge to the optimal solution. More about this later.)

Here is an example problem where you could use a greedy algorithm:

Read 7 tweets

Santiago

@svpino

15 Oct

If you haven't looked into machine learning yet, you better start now.

I started looking seriously into machine learning around spring of 2015.

The field was very different back then.

Just to give you an idea, the top most popular deep learning frameworks didn't exist:

• TensorFlow was released at the end of 2015
• PyTorch in 2016

In just 5 - 6 years we have gone from "read my paper... it's cool" to "holly shit, look what my phone is doing!"

Machine learning has turned the industry upside down.

We have gone from "that's impossible" to "of course we can!" in record time.

Read 23 tweets

Santiago

@svpino

12 Oct

A big part of my work is to build computer vision models to recognize things.

It's usually ordinary stuff: An antenna, a fire extinguisher, a bag, a ladder.

Here is a trick I use to solve some of these problems.

The good news about having to recognize everyday objects:

There are a ton of pre-trained models that help with that. You can start with one of these models and get decent results out of the box.

This is important. I'll come back to it in a second.

Many of the use cases that I tackle are about "augmenting" the people who are working with machine learning.

Let's say you have a team looking at drone footage to find squirrels. Eight hours every day looking at images.

This sucks. I can help with that.

Read 19 tweets

Santiago

@svpino

11 Oct

Last week I trained a machine learning model using 100% of the data.

Then I used the model to predict the labels on the same dataset I used to train it.

I'm not kidding. Hear me out: ↓

Does this sound crazy?

Yes.

Would I be losing my shit if I heard that somebody did this?

Yes.

So what's going on?

I have a dataset with a single numerical feature and a binary target.

I need to know the threshold that better separates the positive samples from the negative ones.

I don't want a model to make predictions; I just need to know the threshold.

Read 10 tweets

Santiago

@svpino

8 Oct

I get asked about machine learning all the time.

Here are my answers to some of these questions: ↓

Q: Where do I start?

Start by learning how to program.

Take your time. Usually, a solid year of Python experience will set you up for success.

Kaggle has a great introductory tutorial to get you started with Python.

@AndrewYNg

Q: I already have plenty of Python experience. Now what?

For most people, I recommend the "Machine Learning Crash Course" created by Google or the "Intro to Machine Learning" from Kaggle.

If you are feeling adventurous, take "Machine Learning" from @AndrewYNg on Coursera.

Read 15 tweets

Santiago

@svpino

7 Oct

More data is usually not the way to turn around a mediocre machine learning model.

I've heard too many times that deep learning's silver bullet is throwing more data at a problem.

That hasn't been my experience.

Good Data is better than Big Data.

https://twitter.com/jeande_d/status/1446069996409470980

More data, even with a moderate amount of mislabeled examples, will hurt your model.

https://twitter.com/jeande_d/status/1446069996409470980

https://twitter.com/coo_ooi/status/1446071108734705667

Assuming the data is good, then more data is probably not going to be a problem.

Unfortunately, the quality of data is usually inversely proportional to the amount of it. More data is often mediocre data.

But if your data is good, no harm.

https://twitter.com/coo_ooi/status/1446071108734705667

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Santiago

Try unrolling a thread yourself!

More from @svpino

Santiago

Santiago

Santiago

Santiago

Santiago

Santiago

Did Thread Reader help you today?

Like this author's thread?