Tweet

Santiago

Follow @svpino

7 May, 11 tweets, 3 min read

Do you know what scares me? Data labeling in machine learning.

We don't talk enough about it, and yet we can't do anything unless we solve this first. Labeling enough data is expensive or even outright impossible.

Some ideas to solve this problem.

↓

Let's start with an example:

You have terrain and weather information for different locations. Your goal is to build a model that predicts where to drill to find oil.

How do you label this data? You drill to find out where the oil is.

This is ridiculously expensive.

To get around this problem, you need to minimize the number of labeled examples you need to build a good model.

1. Take the data
2. Select as few examples as possible
3. Drill those holes to come up with the labels
4. Train the model

How can you achieve #2?

Here's a possible solution:

Start by drilling a few holes to label some of the examples. Not too many, just enough to get a mediocre model started.

Use this model to label the rest of the dataset automatically.

The results of that model will be bad but useful.

Order the results of the model by their confidence score and take the top worst results.

For example:

• Sample A: 85% positive | 15% negative
• Sample B: 55% positive | 45% negative

We'd take Sample B.

Drill new holes on the locations represented by the samples you take.

You drill, you get new labels, your training data is now larger.

Retrain the model with the new dataset and repeat the whole process.

Stop when the model doesn't get any better.

This is a really cool approach.

It's called "Active Learning," and I just described one of the ways you can implement it.

It helps you be really strategic about what data you label to train a model.

I go into more details about this approach in the latest issue of underfitted.io. Here is a link to it:

digest.underfitted.io/archive/594438

If you aren't subscribed yet, you are missing out.

It's free, and it puts one machine learning story right in your inbox every week.

@svpino

And if you enjoy a practical perspective on machine learning, follow me @svpino, and I'll make sure you get a constant stream of good content that will help us both get better at this thing.

Every week, one tweet at a time.

https://twitter.com/96stats/status/1390634724658348039

Really good point here by Luke.

https://twitter.com/96stats/status/1390634724658348039

https://twitter.com/anabibikova/status/1390651678601449474

Love this perspective!

https://twitter.com/anabibikova/status/1390651678601449474

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @svpino

Santiago

@svpino

8 May

A topic that comes up in every interview.

Bias, variance, and their relationship with machine learning algorithms. One of the most basic concepts that you have to know by heart.

Here is a simple summary that you will easily remember.

↓

Every machine learning algorithm deals with 3 types of errors:

1. Bias error
2. Variance error
3. Irreducible error

There's nothing we can do about #3.

Let's focus on the other two.

↓ 1/5

"Bias" refers to the assumptions the model makes to simplify the process of finding answers.

The more assumptions it makes, the more biased the model is.

Read 9 tweets

Santiago

@svpino

6 May

12 machine learning YouTube videos.

On libraries, algorithms, tools, and theory.

↓

1. Jupyter Notebooks:

2. Pandas:

3. Matplotlib:

4. Seaborn:

5. Numpy:

6. Decision Trees:

7. Neural Networks:

8. Scikit-Learn:

Read 4 tweets

Santiago

@svpino

5 May

Machine learning education is broken.

If you are preparing for a research position, you are good. If you are looking to get out there and start solving problems, not even close.

Here are some thoughts so you can get ahead.

↓

Most classes, courses, and books cover the same road.

They start with a dataset. They finish with a working model. The focus is always on everything that happens in between.

Dataset → Model.

This is great, but not enough.

Real-life situations rarely start with a dataset, and they never end after you finish building your model.

Applying machine learning successfully is hard.

Here are a few examples that you should keep in mind.

Read 13 tweets

Santiago

@svpino

1 May

A little over 12 years ago, the police started building a case against me.

That was stressful. They were watching. They wanted to take me off the streets.

Here is the story of how I fled Cuba and came to the United States.

↓

After finishing college, I started taking freelance projects.

That was illegal. The Cuban government didn't allow people to make money working for foreign companies.

If you were lucky, you could get 2 years in jail. They called it "illicit enrichment."

We were a small group of friends. We met at my house every morning.

We paid a foreign national for Internet access. Cubans weren't allowed to buy it, so we had to get creative.

It was a 56kbs connection shared across 4 computers.

Read 13 tweets

Santiago

@svpino

30 Apr

We've all heard the horror stories.

Are you ready for machine learning math? Are you sure you can download a library, go through a course and make things happen?

This is some unsolicited advice.

↓ 1/10

Back then, we had to write custom training loops. Every time.

We all heard horror stories about the complexity of statistics and how ugly linear algebra was. This was a real thing.

The barrier to start with machine learning was high and full of thorns.

↓ 2/10

Today, things are different.

The lack of programming skills is a much bigger hurdle than not understanding how derivatives work.

Wanna have a better chance? Learn to code today. Worry about math tomorrow.

↓ 3/10

Read 10 tweets

Santiago

@svpino

28 Apr

Free machine learning education.

Many top universities are making their Machine Learning and Deep Learning programs publicly available. All of this information is now online and free for everyone!

Here are 6 of these programs. Pick one and get started!

↓

Introduction to Deep Learning
MIT Course 6.S191
Alexander Amini and Ava Soleimany

Introductory course on deep learning methods and practical experience using TensorFlow. Covers applications to computer vision, natural language processing, and more.

introtodeeplearning.com

Deep Learning
NYU DS-GA 1008
Yann LeCun and Alfredo Canziani

This course covers the latest techniques in deep learning and representation learning with applications to computer vision, natural language understanding, and speech recognition.

atcold.github.io/pytorch-Deep-L…

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Santiago

Try unrolling a thread yourself!

More from @svpino

Santiago

Santiago

Santiago

Santiago

Santiago

Santiago

Did Thread Reader help you today?

Like this author's thread?