Manual labeling is expensive. It often requires the time of engineers and subject matter experts (SMEs). Or the cost of using custom labeling platforms and crowdsourcing.
Unlabeled data, in contrast, is usually cheap and easy to obtain. Finding ways to reduce the amount of manual labeling can reduce costs and speed up the iterative ML development workflow.
1. One approach to reducing manual labeling is called Semi-supervised learning
If you have a small amount of human-labeled data and a large amount of unlabeled data, you can apply SSL algorithms. An example is label propagation, which relies on the similarity between labeled and unlabeled data points. Or their local graph structure.
2. Active learning selects examples that will best help the model learn
What if you could label the most significant sample of data?
Active learning is a group of "smart" sampling algorithms. These algorithms can reduce the cost of labeling but can also be an overall better sampling strategy.
Two of the more popular active learning strategies are margin sampling and query by the community.
Margin sampling samples data based on which the current model is most uncertain.
Query by Committee uses a trained ensemble of models to sample data that generate disagreement among the models.
3. Weak Supervision relies on algorithmic labeling
Algorithmic labeling is guaranteed to be noisy, but it's highly scalable. Weak Supervision leverages noisy examples from the rules defined by SMEs.
The method works with or without any ground truth labels and with one or more supervision sources.
These weak supervision sources often are heuristics that can automate labeling.
Based on models or third-party data sources (Distant Supervision.)
Snorkel is an open-source framework that is popular for weak Supervision.
The way it works is by defining the supervision sources with labeling functions. Users write the LFs to generate the noisy labels.
A generative model is used to weigh the labels.
A discriminative model is then trained on the labels. The model can classify or label new unlabeled data.
If you're in need of reducing labeling costs, check out these strategies
1. Semi-Supervised Learning 2. Active Learning 3. Weak Supervision
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Here's how to create a simple property graph in Supabase
Property graphs are a great way to model highly relational data. They make relationships easy to understand and, therefore, easy to build queries. Down the road, you can use the data for graph-powered machine learning.
For example, you could query the data and load them into python libraries like networkx and stellargraph. Or, if the data is more extensive, use graph databases like neo4j or memgraph.
Although you can use graph databases natively as a standalone backend, graph databases can be expensive to use, and for many day-to-day query operations, they can be overkill.