Building a good training dataset is harder than you think.

For example, you can have millions of unlabelled data points, but only have the resources to label a thousand.

This is a story is about a case that I used to encounter almost every day in my work.

🧵 👇🏽
Do you know how new drugs are developed?

Essentially, thousands of candidate molecules are tested to see if they have the targeted effect. First, the testing is done on cell cultures.

Sometimes, there is no better option than scanning through libraries of molecules.
After cells are treated with a given molecule (or molecules in some cases), the effects are studied by screening them with microscopy.

The treated cells can exhibit hundreds of different phenotypes ( = classes), some of them might be very rare.
As a result, we obtain images like below.

The goal is to build a model that takes a cell as input and classifies the effect of the treatment.

Here is the catch: there are billions of cells in a single screen.

(Image source: BBBC021 dataset, bbbc.broadinstitute.org/BBBC021)
How do you prepare a training dataset for your classifier?

Each cell has to be labeled by a domain expert, a cell biologist in our case.

Unfortunately, human time and attention span are limited.

What to do then?

There are a few methods that can help.
1️⃣ Unsupervised learning.

This is a large field, covering much more than this problem. However, it can be used for our purposes.

Essentially, labeling with unsupervised learning boils down to clustering the data, then selecting representative samples for expert annotation.
Clustering can range from simple methods like k-means clustering to complicated ones, like learning an embedding with autoencoders.
2️⃣ Semi-supervised learning.

First, annotate a small but representative portion of the dataset and train a model.

Then, use the unlabeled data to improve the accuracy of the model. One extreme case is to use predictions as labels, with manual corrections if needed.
3️⃣ Active learning.

Like with semi-supervised learning, first, prepare a small dataset and train a model.

When the predictions are run on unlabelled data, the "informativeness" of each instance is measured.

The most informative points are presented to the expert for labeling.
There are several ways to measure this, each having its own circumstances to shine.

The simplest one is prediction uncertainty. However, this can enhance the bias of the model: if a sample is confidently misclassified, it will never be queried for labeling.
So, what works well for a given problem? Unfortunately, there is no good answer to that.

None of these methods are perfect, and they are far from full maturity.

However, these fields have been receiving a new surge of interest lately.
If you are interested, there is no better time to dig deep into this problem than now!

Creating a good dataset is often a huge bottleneck nowadays. We know much more about training a model, than collecting data.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tivadar Danka

Tivadar Danka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TivadarDanka

8 Mar
A neural network doesn't know when it doesn't know.

If you think about it, recognizing when a data point is absolutely unlike any other previously seen is a problem rarely dealt with.

However, it is essential.

In this thread, I'll explain how and why!

🧵 👇🏽
Suppose that this is your training data.

The situation looks fairly straightforward: a simple logistic regression solves the problem.

The model is deployed to production without a second thought.

Now comes the surprise!
We start receiving new data for prediction when we see the following pattern emerge.

The new instances are confidently classified, incorrectly.
Read 11 tweets
4 Mar
Mistakes should be celebrated.

I used to struggle with everything I started to do until I became skilled in it.

The key was to recognizing what I did wrong and going back to fix it. Over and over and over again.

Here is my list of failures that led me to success!

🧵 👇🏽
I was a bad student in school. The most difficult subject for me was mathematics, which I almost failed at one time.

Once I developed an interest, I started to improve very slowly.

Years later, I obtained a PhD in it after solving a problem that has been unsolved for decades.
As a teenager, I was overweight and physically weak. All fat, no muscle.

I was unable to do a single pushup.

Years later, I regularly do 25-50 pushups with one arm only. (Learning to do just a single one-armed pushup took me five years.)
Read 7 tweets
3 Mar
I am going to tell you the best-kept secret of linear algebra: matrices are graphs and graphs are matrices.

Encoding matrices as graphs is a cheat code, making complex behavior extremely simple to study.

Let me show you how!

🧵 👇🏽
If you looked at the example above, you probably figured out the rule.

Each row is a node, and each element of a row represents a directed edge.

The element in the 𝑖-th row, 𝑗-th column corresponds to the edge in the graph, going from 𝑖 to 𝑗.

(Formal definition below.)
Why is the directed graph representation beneficial for us?

The first example is that the powers of the matrix correspond to walks in the graph.

Take a look at how to calculate the elements of the square of a matrix.
Read 9 tweets
2 Mar
Besides Kaggle, there are several other competition platforms.

You can use these to

• learn,
• test your skills,
• collaborate with awesome people,
• enhance your resume,
• and possibly earn money.

Take a look at these below, you'll definitely find them useful!

🧵 👇🏽
1. Numerai (numer.ai)

This is quite a special one, since it only contains a single competition.

However, its aims are big: Numerai wants to build the world's first open hedge fund
2. AIcrowd (aicrowd.com)

You can find all sort of competitions here on a wide spectrum, from applied problems to research.
Read 16 tweets
1 Mar
You ask me so often for free online resources about deep learning that I decided to collect my favorite courses!

These topics interest you the most:

🟩 practical deep learning,
🟩 deep learning theory,
🟩 math resources to understand the two above.

Let's see them!

🧵 👇🏽
1️⃣ Practical deep learning.

If you want to take a deep dive straight into the field and want to start training your models right away, hands down the best course for you out there is Practical Deep Learning for Coders by fast.ai. (course.fast.ai)
To move beyond training models and learn about tooling and infrastructure, IMO the best course for you is the Full Stack Deep Learning course by @full_stack_dl.

fall2019.fullstackdeeplearning.com
Read 13 tweets
26 Feb
Have you ever thought about why neural networks are so powerful?

Why is it that no matter the task, you can find an architecture that knocks the problem out of the park?

One answer is that they can approximate any function with arbitrary precision!

Let's see how!

🧵 👇🏽
From a mathematical viewpoint, machine learning is function approximation.

If you are given data points 𝑥 with observations 𝑦, learning essentially means finding a function 𝑓 such that 𝑓(𝑥) approximates the given 𝑦-s as accurately as possible.
Approximation is a very natural idea in mathematics.

Let's see a simple example!

You probably know the exponential function well. Do you also know how to calculate it?

The definition itself doesn't really help you. Calculating the powers where 𝑥 is not an integer is tough.
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!