How do you prepare a training dataset for your classifier?
Each cell has to be labeled by a domain expert, a cell biologist in our case.
Unfortunately, human time and attention span are limited.
What to do then?
There are a few methods that can help.
1️⃣ Unsupervised learning.
This is a large field, covering much more than this problem. However, it can be used for our purposes.
Essentially, labeling with unsupervised learning boils down to clustering the data, then selecting representative samples for expert annotation.
Clustering can range from simple methods like k-means clustering to complicated ones, like learning an embedding with autoencoders.
2️⃣ Semi-supervised learning.
First, annotate a small but representative portion of the dataset and train a model.
Then, use the unlabeled data to improve the accuracy of the model. One extreme case is to use predictions as labels, with manual corrections if needed.
3️⃣ Active learning.
Like with semi-supervised learning, first, prepare a small dataset and train a model.
When the predictions are run on unlabelled data, the "informativeness" of each instance is measured.
The most informative points are presented to the expert for labeling.
There are several ways to measure this, each having its own circumstances to shine.
The simplest one is prediction uncertainty. However, this can enhance the bias of the model: if a sample is confidently misclassified, it will never be queried for labeling.
So, what works well for a given problem? Unfortunately, there is no good answer to that.
None of these methods are perfect, and they are far from full maturity.
However, these fields have been receiving a new surge of interest lately.
If you are interested, there is no better time to dig deep into this problem than now!
Creating a good dataset is often a huge bottleneck nowadays. We know much more about training a model, than collecting data.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
You ask me so often for free online resources about deep learning that I decided to collect my favorite courses!
These topics interest you the most:
🟩 practical deep learning,
🟩 deep learning theory,
🟩 math resources to understand the two above.
Let's see them!
🧵 👇🏽
1️⃣ Practical deep learning.
If you want to take a deep dive straight into the field and want to start training your models right away, hands down the best course for you out there is Practical Deep Learning for Coders by fast.ai. (course.fast.ai)
To move beyond training models and learn about tooling and infrastructure, IMO the best course for you is the Full Stack Deep Learning course by @full_stack_dl.
Have you ever thought about why neural networks are so powerful?
Why is it that no matter the task, you can find an architecture that knocks the problem out of the park?
One answer is that they can approximate any function with arbitrary precision!
Let's see how!
🧵 👇🏽
From a mathematical viewpoint, machine learning is function approximation.
If you are given data points 𝑥 with observations 𝑦, learning essentially means finding a function 𝑓 such that 𝑓(𝑥) approximates the given 𝑦-s as accurately as possible.
Approximation is a very natural idea in mathematics.
Let's see a simple example!
You probably know the exponential function well. Do you also know how to calculate it?
The definition itself doesn't really help you. Calculating the powers where 𝑥 is not an integer is tough.