3 Automated Data Labeling Approaches

Manual labeling is expensive. It often requires the time of engineers and subject matter experts (SMEs). Or the cost of using custom labeling platforms and crowdsourcing.
Unlabeled data, in contrast, is usually cheap and easy to obtain. Finding ways to reduce the amount of manual labeling can reduce costs and speed up the iterative ML development workflow.
1. One approach to reducing manual labeling is called Semi-supervised learning
If you have a small amount of human-labeled data and a large amount of unlabeled data, you can apply SSL algorithms. An example is label propagation, which relies on the similarity between labeled and unlabeled data points. Or their local graph structure.
2. Active learning selects examples that will best help the model learn

What if you could label the most significant sample of data?
Active learning is a group of "smart" sampling algorithms. These algorithms can reduce the cost of labeling but can also be an overall better sampling strategy.

Two of the more popular active learning strategies are margin sampling and query by the community.
Margin sampling samples data based on which the current model is most uncertain.

Query by Committee uses a trained ensemble of models to sample data that generate disagreement among the models.
3. Weak Supervision relies on algorithmic labeling
Algorithmic labeling is guaranteed to be noisy, but it's highly scalable. Weak Supervision leverages noisy examples from the rules defined by SMEs.

The method works with or without any ground truth labels and with one or more supervision sources.
These weak supervision sources often are heuristics that can automate labeling.

Based on models or third-party data sources (Distant Supervision.)
Snorkel is an open-source framework that is popular for weak Supervision.

The way it works is by defining the supervision sources with labeling functions. Users write the LFs to generate the noisy labels.
A generative model is used to weigh the labels.

A discriminative model is then trained on the labels. The model can classify or label new unlabeled data.
If you're in need of reducing labeling costs, check out these strategies

1. Semi-Supervised Learning
2. Active Learning
3. Weak Supervision

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Adrian Moses 🚢

Adrian Moses 🚢 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @marsmoses

12 Oct
Here's how to create a simple property graph in Supabase

Property graphs are a great way to model highly relational data. They make relationships easy to understand and, therefore, easy to build queries. Down the road, you can use the data for graph-powered machine learning.
For example, you could query the data and load them into python libraries like networkx and stellargraph. Or, if the data is more extensive, use graph databases like neo4j or memgraph.
Although you can use graph databases natively as a standalone backend, graph databases can be expensive to use, and for many day-to-day query operations, they can be overkill.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(