Simple trick to improve weak supervision: prune your training data! Our embedding-aware pruning method can boost the accuracy of weak supervision pipelines by up to 19%, and it only uses a few lines of code!
Come by #NeurIPS22 poster 611 today at 4pm to hear more, or read on 🧵
Most existing weak supervision setups (Snorkel, etc.) use all the weakly-labeled data to train a classifier. But there's an intuitive tradeoff between coverage and accuracy of the weak labels. If we cover *less* training data w/ higher accuracy, do we get a more accurate model?
Our theory and experiments say yes! We use a pruning method based on the "cut statistic" (Muhlenbach et al. 2004), which clusters examples by their embedding and picks examples w/ the least noisy nbhrds. Intuitively, homogenous regions are more likely to be correctly labeled.
Selecting a good subset of the weakly-labeled training data robustly improves performance. This method works with *any* label model---Snorkel, FlyingSquid, majority vote, etc---and the effect is consistent across a bunch of datasets, classifiers, and label models.
Adding this method to your existing weak supervision pipeline is easy! Check out our code here: github.com/hunterlang/wea…
Theoretically, we extend a classic analysis of Blum and Mitchell (1998) to characterize the coverage/precision tradeoff for a special case where the features used to create weak labels and the features used for the classifier are conditionally independent given the true label.
This is a pretty standard bound---the main novelty is a uniform convergence result for *balanced error.* More work on these theoretical questions with more general assumptions is coming from us soon! Joint work w/ my advisor @david_sontag and Aravindan Vijayaraghavan.
• • •
Missing some Tweet in this thread? You can try to
force a refresh