Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Sara Hooker

@sarahookr

Oct 4, 2024 • 11 tweets • 3 min read • Read on X

One of the biggest open questions is what is the limit of synthetic data.

Does training of synthetic data lead to mode collapse?

Or is there a path forward that could outperform current models?

What is missing from this conversation is that the success of synthetic data hinges on how you optimize in the data space.

A few recent papers highlight this tension well, on the side of dangers of synthetic data -- excellent paper released in Nature.

📜nature.com/articles/s4158…

The Nature paper finds that:

Eventually, if you train repeatedly on synthetic data trained from a single model – you generate gibberish.

This due to repeat sampling of the mode of the distribution. You lose the long-tail. It is also why synthetic sampling can amplify bias.

But what if you strategically sample, either by constraining criteria for sampling or expanding to multiple teachers -- two of our recent works look at it from this lens.

The results suggest the road ahead for synthetic data is far more promising.

https://x.com/sarahookr/status/1808237222522769410

In LLM see, LLM do we call this active inheritance and use it to optimize in the data space towards non-differentiable objectives.

https://x.com/sarahookr/status/1808237222522769410

We constrain the generation process to explicitly steer towards minimization or maximization of non-differentiable features. 📈

The results speak for themselves: we saw considerable improvements for all attributes.

https://x.com/sarahookr/status/1828786620566249724

In follow up work -- instead of following traditional "single teacher" paradigm, we show massive gains for multilingual through arbitrage sampling.

Selectively sampling parts of the distribution from different teachers avoids mode collapse.

https://x.com/sarahookr/status/1828786620566249724

Arbitrage 📈 will likely benefit any specialized domain where we don't expect a single model to perform well at all parts of the distribution we care about.

This work starts to directly address the question: "Can you avoid model collapse when you rely on synthetic data?"

By sampling strategically -- you avoid overfitting to the limitations of any single teacher. We dramatically outperform single teachers.

This also suggests mode collapse can be avoided outside of a setting where you repeatedly train on the outputs of a single teacher.

Overall -- I am most excited that we are now moving as a field to optimizing in the data space.

Historically, high-quality data has been costly to curate, which has precluded adapting training sets “on-the-fly” to target new properties. Now, we can steer in the data space.

If you made it this far, read the excellent article by @alisonmsnyder for @axios covering these different perspectives on whether there is a ceiling to progress using synthetic data.

axios.com/2024/07/27/syn…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @sarahookr

Sara Hooker

@sarahookr

Oct 10

We are hiring an efficiency research engineer.

I'll share a little bit about the role, because our hiring says a lot about what we stand for, and how we want to build technology.

So, here we go... 🔥

We are obsessed with efficiency— allowing for real-time evolution of AI depends on making adaptable intelligence extremely efficient.

Efficiency has been a constant in both myself and @sudip_r0y career. Across the last 10 years, we have led by asking can we do more with less?

I strongly believe that the next wave of intelligence will be defined by cost of adaptability, not model size.

I call this the slow death of scaling. I talk a little bit about why scaling training compute is less interesting here:

Read 4 tweets

Sara Hooker

@sarahookr

Apr 30

It is critical for scientific integrity that we trust our measure of progress.

The @lmarena_ai has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on @lmarena_ai, despite best intentions.

We spent 5 months analyzing 2.8M battles on the Arena, covering 238 models across 43 providers.

We show that preferential policies engaged in by a handful of providers lead to overfitting to Arena-specific metrics rather than genuine AI progress.

@lmarena_ai unspoken policy of hidden testing that benefits a small subset of providers.

Providers can choose what score to disclose and retract all others.

At an extreme, we see testing of up to 27 models in lead up to releases.

Read 13 tweets

Sara Hooker

@sarahookr

Jul 23, 2021

How do you distinguish between sources of uncertainty?

This is important because the downstream remedies for atypical and noisy examples are very different.

Two of our workshop papers explore this from different perspectives.

@jasonyo

In subset ML network tomorrow, Neil Hu and Xinyu Hu explore where simply prioritizing challenging examples fails -- motivating a more nuanced distinction between sources of uncertainty.

w @jasonyo, @savvyRL

Workshop: bit.ly/3wXnrNT

Paper 📜: bit.ly/36ZIhlj

@mrdanieldsouza

In the UDL Workshop today, @mrdanieldsouza and Zach Nussbaum will present our workshop paper "A Tale of Two Long Tails."

w @_cagarwal.

Workshop: bit.ly/3zurMdh

Paper 📜: bit.ly/3rsdhni

Session: bit.ly/3rqLmEp
9:45-10:45am EST

Read 4 tweets

Sara Hooker

@sarahookr

Feb 15, 2021

Yesterday, I ended up in a debate where the position was "algorithmic bias is a data problem".

I thought this had already been well refuted within our research community but clearly not.

So, to say it yet again -- it is not just the data. The model matters.

1/n

We show this in our work on compression.

Pruning and quantizing deep neural networks amplifies algorithmic bias.

arxiv.org/abs/2010.03058 and arxiv.org/abs/1911.05248

Work on memorization and variance of gradients (VoG) shows that hard examples are learnt later in training, and that learning rates impact what is learnt.

bit.ly/2N9mW2r, arxiv.org/abs/2008.11600

So, early stopping disproportionately impacts certain examples.

Read 7 tweets

Sara Hooker

@sarahookr

Nov 21, 2019

@DreFrome

What does a pruned deep neural network "forget"?

Very excited to share our recent work w Aaron Courville, Yann Dauphin and @DreFrome

weightpruningdamage.github.io

At face value, deep neural network pruning appears to promise you can (almost) have it all — remove the majority of weights with minimal degradation to top-1 accuracy. In this work, we explore this trade-off by asking whether certain classes are disproportionately impacted.

We find that pruning is better described as "selective brain damage" -- performance on a tiny subset of classes and images is cannibalized in order to preserve overall performance. The interesting part is what makes certain images more likely to be forgotten...

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Sara Hooker

Try unrolling a thread yourself!

More from @sarahookr

Sara Hooker

Sara Hooker

Sara Hooker

Sara Hooker

Sara Hooker

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!