Santiago Profile picture
11 Oct, 10 tweets, 2 min read
Last week I trained a machine learning model using 100% of the data.

Then I used the model to predict the labels on the same dataset I used to train it.

I'm not kidding. Hear me out: ↓
Does this sound crazy?

Yes.

Would I be losing my shit if I heard that somebody did this?

Yes.

So what's going on?
I have a dataset with a single numerical feature and a binary target.

I need to know the threshold that better separates the positive samples from the negative ones.

I don't want a model to make predictions; I just need to know the threshold.
There are a bazillion ways to find this threshold. Attached you can see one of them.

Fit a DecisionTreeClassifier on the data, and print the threshold.

That's it.
Remember, I don't need this model to do anything else. I don't care about validating it or anything like that.

Of course, I used 100% of the data for this.

And then I had to answer an important question:
After having the threshold, I needed to determine the precision and recall of the data.

How good was that threshold separating the positive from the negative examples?

Let's compute that.
So yeah, after training with 100% of the data, I used the model to predict the targets of the same data.

Sounds egregious, but it isn't.
This experience taught me something important:

Whatever preconceptions I have, the best approach is always to put them aside, be ready to be wrong, and be open to learning something new.
I had many numerical values. Some of them correspond to positive examples, some of them to negative examples.

I want to find the value (threshold) that better separates the positive from the negative values.

That was the problem.

Here is the process that should give a little bit more context around this application:

1. Boostrap the threshold ← This is what I explained in this thread.

2. Collect more data

3. Label some of that data (~20%)

4. Recompute the threshold using new labeled data

5. Repeat

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

12 Oct
A big part of my work is to build computer vision models to recognize things.

It's usually ordinary stuff: An antenna, a fire extinguisher, a bag, a ladder.

Here is a trick I use to solve some of these problems.
The good news about having to recognize everyday objects:

There are a ton of pre-trained models that help with that. You can start with one of these models and get decent results out of the box.

This is important. I'll come back to it in a second.
Many of the use cases that I tackle are about "augmenting" the people who are working with machine learning.

Let's say you have a team looking at drone footage to find squirrels. Eight hours every day looking at images.

This sucks. I can help with that.
Read 19 tweets
8 Oct
I get asked about machine learning all the time.

Here are my answers to some of these questions: ↓
Q: Where do I start?

Start by learning how to program.

Take your time. Usually, a solid year of Python experience will set you up for success.

Kaggle has a great introductory tutorial to get you started with Python.
Q: I already have plenty of Python experience. Now what?

For most people, I recommend the "Machine Learning Crash Course" created by Google or the "Intro to Machine Learning" from Kaggle.

If you are feeling adventurous, take "Machine Learning" from @AndrewYNg on Coursera.
Read 15 tweets
7 Oct
More data is usually not the way to turn around a mediocre machine learning model.

I've heard too many times that deep learning's silver bullet is throwing more data at a problem.

That hasn't been my experience.

Good Data is better than Big Data.
More data, even with a moderate amount of mislabeled examples, will hurt your model.
Assuming the data is good, then more data is probably not going to be a problem.

Unfortunately, the quality of data is usually inversely proportional to the amount of it. More data is often mediocre data.

But if your data is good, no harm.
Read 4 tweets
6 Oct
Which one do you prefer? The code on the left, or the code on the right?

I'd love to hear why. ImageImage
I always was a “left” kind of programmer.

For quite some time now I’ve been forcing myself to use the right style.

Look at “EAFP vs LBYL”. Pretty interesting arguments.

- LBYL - Look Before You Leap. (Left)

- EAFP - Easier to Ask for Forgiveness than Permission. (Right)
Also, I love all of you, but it’s usually a good practice to answer the question using one of the two options instead of going with a third, imaginary option that you feel is better for your imaginary problem.

😋
Read 5 tweets
1 Oct
A team led by MIT examined 10 of the most-cited datasets used to test machine learning systems.

They found that around 3.4% of the data was inaccurate or mislabeled.

Those are very popular datasets. How about yours?

I've worked with many datasets for image classification.

Unfortunately, mislabeled data is a common problem.

It is hard for people to consistently label visual concepts, especially when the answer is not apparent.
This is a big problem.

Basically, we are evaluating models with images of elephants, expecting them to get classified as "lions."

Your model can't perform well this way.
Read 12 tweets
30 Sep
Do you want more people to read your Twitter threads?

Here is something you can do.

I'm glad threads have become popular, and more people are publishing their content that way.

I've been experimenting with threads for a while. I've learned a ton of what works and what doesn't.

Keep in mind that this advice is based on my experience. It may not work for you.
The advice is simple: Don't scare the reader on your first tweet.

If you announce that your thread is "huge" or that it is a "mega-thread," people will tend to shy away from it.

The same happens if you start your thread with something like "1/25."

25 tweets???!!!
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(