Sara Hooker Profile picture
Apr 30 13 tweets 5 min read Read on X
It is critical for scientific integrity that we trust our measure of progress.

The @lmarena_ai has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on @lmarena_ai, despite best intentions. Image
We spent 5 months analyzing 2.8M battles on the Arena, covering 238 models across 43 providers.

We show that preferential policies engaged in by a handful of providers lead to overfitting to Arena-specific metrics rather than genuine AI progress. Image
@lmarena_ai unspoken policy of hidden testing that benefits a small subset of providers.

Providers can choose what score to disclose and retract all others.

At an extreme, we see testing of up to 27 models in lead up to releases.
There is no reasonable scientific justification for this practice.

Being able to choose the best score to disclose enables systematic gaming of Arena score.

This advantage increases with number of variants and if all other providers don’t know they can also private test.. Image
This has to be very explicit -- continuing with current practice of:

1) only some providers are allowed unlimited tests,

2) allowed to retract scores

amounts to our community accepting a practice which we learn in our intro to ML classes is unacceptable.

We must do better. Image
We also observe large differences in Arena Data Access

@lmarena_ai is a open community resource that provides free feedback but 61.3% of all data goes to proprietary model providers.
@lmarena_ai These data differences stem from some key policies that benefit a handful of providers:

1) proprietary models sampled at higher rates to appear in battles 📶
2) open-weights + open-source models removed from Arena more often
3) How many private variants
@lmarena_ai The differences in sampling rates are actually what started this project.

Aya Expanse is an open-weights model we released last year, and we couldn't figure out last November why it was sampled far less than other models. Image
@lmarena_ai Our recommendation here is simple.

The organizers themselves proposed a very well motivated active sampling rate that returns the Arena to sampling votes where they are needed.

We found this was not implemented in practice. One of our core recommendations is to return to it. Image
Overall, our work suggests that engagement from a handful of providers and preferential policies from @lmarena_ai towards the same small group have created conditions to overfit to Arena-specific dynamics rather than general model quality.

I remain optimistic this can be fixed. Image
This was an uncomfortable paper to work on because it asks us to look in the mirror as a community.

As scientists, we must do better.

As a community, I hope we can demand better.
I also do not want to detract from everything @lmarena_ai has achieved. They have democratized access to models, empowered an open community.

I believe the organizers can continue to restore trust by revising their policies.

We make very clear the five changes needed. Image
Very proud of this cross-institutional collaboration @Cohere_Labs @UWaterloo @stai_research @PrincetonCITP @uwnlp @MIT

Led by @singhshiviii @mziizm, with @YiyangNan @W4ngatang @mrdanieldsouza @sayashk @ahmetustun89 @sanmikoyejo @yuntiandeng @ShayneRedford @nlpnoah @beyzaermis Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sara Hooker

Sara Hooker Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sarahookr

Oct 4, 2024
One of the biggest open questions is what is the limit of synthetic data.

Does training of synthetic data lead to mode collapse?

Or is there a path forward that could outperform current models? Image
What is missing from this conversation is that the success of synthetic data hinges on how you optimize in the data space.

A few recent papers highlight this tension well, on the side of dangers of synthetic data -- excellent paper released in Nature.

📜nature.com/articles/s4158…
The Nature paper finds that:

Eventually, if you train repeatedly on synthetic data trained from a single model – you generate gibberish.

This due to repeat sampling of the mode of the distribution. You lose the long-tail. It is also why synthetic sampling can amplify bias.
Read 11 tweets
Jul 23, 2021
How do you distinguish between sources of uncertainty?

This is important because the downstream remedies for atypical and noisy examples are very different.

Two of our workshop papers explore this from different perspectives.
In subset ML network tomorrow, Neil Hu and Xinyu Hu explore where simply prioritizing challenging examples fails -- motivating a more nuanced distinction between sources of uncertainty.

w @jasonyo, @savvyRL

Workshop: bit.ly/3wXnrNT

Paper 📜: bit.ly/36ZIhlj
In the UDL Workshop today, @mrdanieldsouza and Zach Nussbaum will present our workshop paper "A Tale of Two Long Tails."

w @_cagarwal.

Workshop: bit.ly/3zurMdh

Paper 📜: bit.ly/3rsdhni

Session: bit.ly/3rqLmEp
9:45-10:45am EST
Read 4 tweets
Feb 15, 2021
Yesterday, I ended up in a debate where the position was "algorithmic bias is a data problem".

I thought this had already been well refuted within our research community but clearly not.

So, to say it yet again -- it is not just the data. The model matters.

1/n
We show this in our work on compression.

Pruning and quantizing deep neural networks amplifies algorithmic bias.

arxiv.org/abs/2010.03058 and arxiv.org/abs/1911.05248
Work on memorization and variance of gradients (VoG) shows that hard examples are learnt later in training, and that learning rates impact what is learnt.

bit.ly/2N9mW2r, arxiv.org/abs/2008.11600

So, early stopping disproportionately impacts certain examples.
Read 7 tweets
Nov 21, 2019
What does a pruned deep neural network "forget"?

Very excited to share our recent work w Aaron Courville, Yann Dauphin and @DreFrome

weightpruningdamage.github.io
At face value, deep neural network pruning appears to promise you can (almost) have it all — remove the majority of weights with minimal degradation to top-1 accuracy. In this work, we explore this trade-off by asking whether certain classes are disproportionately impacted.
We find that pruning is better described as "selective brain damage" -- performance on a tiny subset of classes and images is cannibalized in order to preserve overall performance. The interesting part is what makes certain images more likely to be forgotten...
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(