Santiago Profile picture
1 Oct, 12 tweets, 3 min read
A team led by MIT examined 10 of the most-cited datasets used to test machine learning systems.

They found that around 3.4% of the data was inaccurate or mislabeled.

Those are very popular datasets. How about yours?

I've worked with many datasets for image classification.

Unfortunately, mislabeled data is a common problem.

It is hard for people to consistently label visual concepts, especially when the answer is not apparent.
This is a big problem.

Basically, we are evaluating models with images of elephants, expecting them to get classified as "lions."

Your model can't perform well this way.
A nice trick to improve your labels:

1. Train a model
2. Evaluate it
3. Review the model mistakes
4. Repeat

Let's dive into this process a little bit more.
When I finish training a model, I use the validation set and print out the following for each sample:

• Target (what I expect to get)
• Prediction (what the model returned)
• Confidence (the softmax value)
I then filter the list by the model mistakes (target and prediction are different) sorted by confidence.

This gives you the mistakes where the model was really confident. ← These usually reveal mislabeled data.
You can also filter the labels with the most significant number of mistakes and start your research from that point.

You want to look into labels with many mistakes because the model is clearly not learning them correctly.
After each round of sanitizing the dataset, you can repeat the entire process:

1. Train another model
2. Evaluate it
3. Review the mistakes

You want to stop as soon as there aren't any obvious mislabeled examples on the most confident results of the models.
Although this process can help improve the quality of your dataset, there's something important you can't forget:

Standardizing the labeling process to avoid mistakes in the first place is one of the best investments you can make.
Every week, I post 2 or 3 threads like this, breaking down machine learning concepts and giving you ideas on applying them in real-life situations.

You can find more of these at @svpino.

If you find this helpful, stay tuned: a lot more is coming.
If the model is looking at a picture that is supposed to represent a dog (target value = "dog") and it returns with high confidence that it is a "cat," you should look into it:

1. Maybe the model is wrong.
2. Most likely, the picture was mislabeled.

You will do this process with your validation data.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

30 Sep
Do you want more people to read your Twitter threads?

Here is something you can do.

I'm glad threads have become popular, and more people are publishing their content that way.

I've been experimenting with threads for a while. I've learned a ton of what works and what doesn't.

Keep in mind that this advice is based on my experience. It may not work for you.
The advice is simple: Don't scare the reader on your first tweet.

If you announce that your thread is "huge" or that it is a "mega-thread," people will tend to shy away from it.

The same happens if you start your thread with something like "1/25."

25 tweets???!!!
Read 12 tweets
29 Sep
Here is a fantastic example of dimensionality reduction.

Look at the attached images. They both show the number zero (huge pixels, but convince yourself they are zeros.)

The one on your left requires 64 dimensions. The one on your right only needs 5 dimensions!
We are cutting 92% of the dimensions but still keeping the essence of the data.

Dimensionality reduction is a key technique you should study.

This example uses singular value decomposition.

A couple more:

• Principal component analysis
• Independent component analysis
In case you are curious, here is the process to go from the first image (the one with 64 dimensions) to the second image:

1. Take the image
2. Apply singular value decomposition
3. Use top 5 resultant dimensions

I used this example from a course that I'm going through.
Read 8 tweets
28 Sep
"You can't use an algorithm unless you understand how it works."

That's what many people say. But I don't believe it.

This is how you can build expertise: ↓
We all learn new things in different ways.

Personally, I'm a huge proponent of learning on-demand:

• Start with a problem
• Try to solve it
• Incorporate new knowledge as you go
Almost every time:

I start using new techniques with a very superficial understanding of how they work.

Sometimes, I only know they *do* work but have no idea how.
Read 14 tweets
27 Sep
If you are a teaching a machine learning class, or thinking of creating a course, please, dedicate some time to have your students deal with data.

Most courses mention "data is important, you know?" and right away go into the 1,001 different ways to build a model.

A better way:

Have your students practice a skill they will face the very first day they go out there.

Data is messy. Incomplete. Noisy. Dissorganized. Misslabeled.

Have them deal with this for a while. Don't worry about the modeling part.
A good exercise:

1. Give your students a dataset.
2. Give them a model.
3. Ask them to improve its performance.
4. They can't touch the model code.

They should focus exclusively on improving the data to get a better performance.

This should be great practice.
Read 4 tweets
24 Sep
I've been trying to identify the most effective trait for those building a career in software development.

If I were to give you one single recommendation, what would that be?

I think I figured it out. ↓
Here is a problem I see every day:

Most people start their careers solving the same boring exercises.

This is good in certain ways, but it also limits your experience to what everyone else is doing.

The key to getting out of this trap?

*Curiosity*
If there's a single trait that has helped me make continuous progress over the last two decades in building software, it has been a relentless curiosity.

And contrary to what many believe, you can learn to be curious.

This is what I do.
Read 9 tweets
23 Sep
I've heard multiple times that you don't need to do any feature engineering or selection whenever you are using neural networks.

This is not true.

Yes, neural networks can extract patterns and ignore unnecessary features from the dataset, but this is usually not enough.

First, neural networks can't compete with our expertise understanding the data.

As long as we know the problem and the dataset, we can come up with features that it would be really hard for a network to reproduce.

This is a highly creating process. Really hard to reproduce.
A couple of notes regarding the ability of a network to do automatic feature selection:

Yes, networks can "ignore" features that have no bearing on the prediction.

But these features can still introduce noise that degrade the performance of the model.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(