Santiago Profile picture
Computer scientist. I teach hard-core AI/ML Engineering at https://t.co/THCAAZcBMu. YouTube: https://t.co/pROi08OZYJ

Feb 26, 2021, 12 tweets

Imagine you have a ton of data, but most of it isn't labeled. Even worse: labeling is very expensive. 😑

How can we get past this problem?

Let's talk about a different—and pretty cool—way to train a machine learning model.

☕️👇

Let's say we want to classify videos in terms of maturity level. We have millions of them, but only a few have labels.

Labeling a video takes a long time (you have to watch it in full!) We also don't know how many videos we need to build a good model.

[2 / 9]

In a traditional supervised approach, we don't have a choice: we need to spend the time and come up with a large dataset of labeled videos to train our model.

But this isn't always an option.

In some cases, this may be the end of the project. 😟

[3 / 9]

Here is a different approach: Active Learning.

Using Active Learning, we can have our algorithm start training with the data it has and interactively ask for new labeled data as it needs it.

Active Learning is a semi-supervised learning method.

[4 / 9]

Here is the most important part of "Active Learning":

The algorithm will look at all the unlabeled data and will pick the most informative examples. Then, it will ask humans to label those examples and use the answers as part of the training process.

[5 / 9]

Determining which examples are the most informative is the problematic part.

Worse case, we can select unlabeled examples randomly, but that wouldn't be smart.

The better the selection process is, the less data you'll need to build a model.

[6 / 9]

When deciding, we want the algorithm to pick the most challenging examples for the model.

Here are some existing methods that you can research further:

- Least Confidence Uncertainty
- Smallest Margin Uncertainty
- Entropy Reduction

[7 / 9]

In summary, Active Learning iteratively trains a model minimizing the amount of required labeled data.

This translates into significant savings, and sometimes, it's the difference that makes a solution viable.

[8 / 9]

Do you enjoy these threads about machine learning? Are they informative?

If I were to make a change to improve them, what would you like that to be?

[9 / 9]

🦕

You can determine any size for your batches.

You could decide to update the model after each request, or you could build up a batch before updating the model.

There are multiple ideas that you could follow here. Here are some examples:

▫️ Automatically identifying nudity is not a hard problem.

▫️ You could also identify profanity either with speech-to-text or through captions.

Other signals you could follow:

▫️ People who watch R-rated movies could be a link to find other R-rated movies.

▫️ Movie directors and actors/actresses could be a link too.

▫️ Genre is important as well.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling