hunter Profile picture
Nov 29 7 tweets 3 min read
Simple trick to improve weak supervision: prune your training data! Our embedding-aware pruning method can boost the accuracy of weak supervision pipelines by up to 19%, and it only uses a few lines of code!

Come by #NeurIPS22 poster 611 today at 4pm to hear more, or read on 🧵
Most existing weak supervision setups (Snorkel, etc.) use all the weakly-labeled data to train a classifier. But there's an intuitive tradeoff between coverage and accuracy of the weak labels. If we cover *less* training data w/ higher accuracy, do we get a more accurate model?
Our theory and experiments say yes! We use a pruning method based on the "cut statistic" (Muhlenbach et al. 2004), which clusters examples by their embedding and picks examples w/ the least noisy nbhrds. Intuitively, homogenous regions are more likely to be correctly labeled.
Selecting a good subset of the weakly-labeled training data robustly improves performance. This method works with *any* label model---Snorkel, FlyingSquid, majority vote, etc---and the effect is consistent across a bunch of datasets, classifiers, and label models.
Adding this method to your existing weak supervision pipeline is easy! Check out our code here: github.com/hunterlang/wea…
Theoretically, we extend a classic analysis of Blum and Mitchell (1998) to characterize the coverage/precision tradeoff for a special case where the features used to create weak labels and the features used for the classifier are conditionally independent given the true label.
This is a pretty standard bound---the main novelty is a uniform convergence result for *balanced error.* More work on these theoretical questions with more general assumptions is coming from us soon! Joint work w/ my advisor @david_sontag and Aravindan Vijayaraghavan.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with hunter

hunter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(