Santiago Profile picture
7 May, 11 tweets, 3 min read
Do you know what scares me? Data labeling in machine learning.

We don't talk enough about it, and yet we can't do anything unless we solve this first. Labeling enough data is expensive or even outright impossible.

Some ideas to solve this problem.

Let's start with an example:

You have terrain and weather information for different locations. Your goal is to build a model that predicts where to drill to find oil.

How do you label this data? You drill to find out where the oil is.

This is ridiculously expensive.
To get around this problem, you need to minimize the number of labeled examples you need to build a good model.

1. Take the data
2. Select as few examples as possible
3. Drill those holes to come up with the labels
4. Train the model

How can you achieve #2?
Here's a possible solution:

Start by drilling a few holes to label some of the examples. Not too many, just enough to get a mediocre model started.

Use this model to label the rest of the dataset automatically.

The results of that model will be bad but useful.
Order the results of the model by their confidence score and take the top worst results.

For example:

• Sample A: 85% positive | 15% negative
• Sample B: 55% positive | 45% negative

We'd take Sample B.

Drill new holes on the locations represented by the samples you take.
You drill, you get new labels, your training data is now larger.

Retrain the model with the new dataset and repeat the whole process.

Stop when the model doesn't get any better.
This is a really cool approach.

It's called "Active Learning," and I just described one of the ways you can implement it.

It helps you be really strategic about what data you label to train a model.
I go into more details about this approach in the latest issue of underfitted.io. Here is a link to it:

digest.underfitted.io/archive/594438

If you aren't subscribed yet, you are missing out.

It's free, and it puts one machine learning story right in your inbox every week.
And if you enjoy a practical perspective on machine learning, follow me @svpino, and I'll make sure you get a constant stream of good content that will help us both get better at this thing.

Every week, one tweet at a time.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

8 May
A topic that comes up in every interview.

Bias, variance, and their relationship with machine learning algorithms. One of the most basic concepts that you have to know by heart.

Here is a simple summary that you will easily remember.

Every machine learning algorithm deals with 3 types of errors:

1. Bias error
2. Variance error
3. Irreducible error

There's nothing we can do about #3.

Let's focus on the other two.

1/5
"Bias" refers to the assumptions the model makes to simplify the process of finding answers.

The more assumptions it makes, the more biased the model is.
Read 9 tweets
6 May
12 machine learning YouTube videos.

On libraries, algorithms, tools, and theory.

1. Jupyter Notebooks:

2. Pandas:

3. Matplotlib:

4. Seaborn:
5. Numpy:

6. Decision Trees:

7. Neural Networks:

8. Scikit-Learn:
Read 4 tweets
5 May
Machine learning education is broken.

If you are preparing for a research position, you are good. If you are looking to get out there and start solving problems, not even close.

Here are some thoughts so you can get ahead.

Most classes, courses, and books cover the same road.

They start with a dataset. They finish with a working model. The focus is always on everything that happens in between.

Dataset → Model.

This is great, but not enough.
Real-life situations rarely start with a dataset, and they never end after you finish building your model.

Applying machine learning successfully is hard.

Here are a few examples that you should keep in mind.
Read 13 tweets
1 May
A little over 12 years ago, the police started building a case against me.

That was stressful. They were watching. They wanted to take me off the streets.

Here is the story of how I fled Cuba and came to the United States.

After finishing college, I started taking freelance projects.

That was illegal. The Cuban government didn't allow people to make money working for foreign companies.

If you were lucky, you could get 2 years in jail. They called it "illicit enrichment."
We were a small group of friends. We met at my house every morning.

We paid a foreign national for Internet access. Cubans weren't allowed to buy it, so we had to get creative.

It was a 56kbs connection shared across 4 computers.
Read 13 tweets
30 Apr
We've all heard the horror stories.

Are you ready for machine learning math? Are you sure you can download a library, go through a course and make things happen?

This is some unsolicited advice.

↓ 1/10
Back then, we had to write custom training loops. Every time.

We all heard horror stories about the complexity of statistics and how ugly linear algebra was. This was a real thing.

The barrier to start with machine learning was high and full of thorns.

↓ 2/10
Today, things are different.

The lack of programming skills is a much bigger hurdle than not understanding how derivatives work.

Wanna have a better chance? Learn to code today. Worry about math tomorrow.

↓ 3/10
Read 10 tweets
28 Apr
Free machine learning education.

Many top universities are making their Machine Learning and Deep Learning programs publicly available. All of this information is now online and free for everyone!

Here are 6 of these programs. Pick one and get started!

Introduction to Deep Learning
MIT Course 6.S191
Alexander Amini and Ava Soleimany

Introductory course on deep learning methods and practical experience using TensorFlow. Covers applications to computer vision, natural language processing, and more.

introtodeeplearning.com
Deep Learning
NYU DS-GA 1008
Yann LeCun and Alfredo Canziani

This course covers the latest techniques in deep learning and representation learning with applications to computer vision, natural language understanding, and speech recognition.

atcold.github.io/pytorch-Deep-L…
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(