Tweet

hunter

Nov 29 • 7 tweets • 3 min read

Simple trick to improve weak supervision: prune your training data! Our embedding-aware pruning method can boost the accuracy of weak supervision pipelines by up to 19%, and it only uses a few lines of code!

Come by #NeurIPS22 poster 611 today at 4pm to hear more, or read on 🧵

Most existing weak supervision setups (Snorkel, etc.) use all the weakly-labeled data to train a classifier. But there's an intuitive tradeoff between coverage and accuracy of the weak labels. If we cover *less* training data w/ higher accuracy, do we get a more accurate model?

Our theory and experiments say yes! We use a pruning method based on the "cut statistic" (Muhlenbach et al. 2004), which clusters examples by their embedding and picks examples w/ the least noisy nbhrds. Intuitively, homogenous regions are more likely to be correctly labeled.

Selecting a good subset of the weakly-labeled training data robustly improves performance. This method works with *any* label model---Snorkel, FlyingSquid, majority vote, etc---and the effect is consistent across a bunch of datasets, classifiers, and label models.

Adding this method to your existing weak supervision pipeline is easy! Check out our code here: github.com/hunterlang/wea…

Theoretically, we extend a classic analysis of Blum and Mitchell (1998) to characterize the coverage/precision tradeoff for a special case where the features used to create weak labels and the features used for the classifier are conditionally independent given the true label.

@david_sontag

This is a pretty standard bound---the main novelty is a uniform convergence result for *balanced error.* More work on these theoretical questions with more general assumptions is coming from us soon! Joint work w/ my advisor @david_sontag and Aravindan Vijayaraghavan.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

hunter

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!