Eventually, if you train repeatedly on synthetic data trained from a single model – you generate gibberish.
This due to repeat sampling of the mode of the distribution. You lose the long-tail. It is also why synthetic sampling can amplify bias.
But what if you strategically sample, either by constraining criteria for sampling or expanding to multiple teachers -- two of our recent works look at it from this lens.
The results suggest the road ahead for synthetic data is far more promising.
In LLM see, LLM do we call this active inheritance and use it to optimize in the data space towards non-differentiable objectives.
Arbitrage 📈 will likely benefit any specialized domain where we don't expect a single model to perform well at all parts of the distribution we care about.
This work starts to directly address the question: "Can you avoid model collapse when you rely on synthetic data?"
By sampling strategically -- you avoid overfitting to the limitations of any single teacher. We dramatically outperform single teachers.
This also suggests mode collapse can be avoided outside of a setting where you repeatedly train on the outputs of a single teacher.
Overall -- I am most excited that we are now moving as a field to optimizing in the data space.
Historically, high-quality data has been costly to curate, which has precluded adapting training sets “on-the-fly” to target new properties. Now, we can steer in the data space.
If you made it this far, read the excellent article by @alisonmsnyder for @axios covering these different perspectives on whether there is a ceiling to progress using synthetic data.
How do you distinguish between sources of uncertainty?
This is important because the downstream remedies for atypical and noisy examples are very different.
Two of our workshop papers explore this from different perspectives.
In subset ML network tomorrow, Neil Hu and Xinyu Hu explore where simply prioritizing challenging examples fails -- motivating a more nuanced distinction between sources of uncertainty.
Work on memorization and variance of gradients (VoG) shows that hard examples are learnt later in training, and that learning rates impact what is learnt.
At face value, deep neural network pruning appears to promise you can (almost) have it all — remove the majority of weights with minimal degradation to top-1 accuracy. In this work, we explore this trade-off by asking whether certain classes are disproportionately impacted.
We find that pruning is better described as "selective brain damage" -- performance on a tiny subset of classes and images is cannibalized in order to preserve overall performance. The interesting part is what makes certain images more likely to be forgotten...