I lead @CohereForAI. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, @trustworthy_ml. Changing spaces where breakthroughs happen.
2 subscribers
Oct 4, 2024 • 11 tweets • 3 min read
One of the biggest open questions is what is the limit of synthetic data.
Does training of synthetic data lead to mode collapse?
Or is there a path forward that could outperform current models?
What is missing from this conversation is that the success of synthetic data hinges on how you optimize in the data space.
A few recent papers highlight this tension well, on the side of dangers of synthetic data -- excellent paper released in Nature.
How do you distinguish between sources of uncertainty?
This is important because the downstream remedies for atypical and noisy examples are very different.
Two of our workshop papers explore this from different perspectives.
In subset ML network tomorrow, Neil Hu and Xinyu Hu explore where simply prioritizing challenging examples fails -- motivating a more nuanced distinction between sources of uncertainty.
Very excited to share our recent work w Aaron Courville, Yann Dauphin and @DreFrome
weightpruningdamage.github.io
At face value, deep neural network pruning appears to promise you can (almost) have it all — remove the majority of weights with minimal degradation to top-1 accuracy. In this work, we explore this trade-off by asking whether certain classes are disproportionately impacted.