Hugh Zhang Profile picture
May 2 14 tweets 3 min read Read on X
Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. Image
Stepping back for a moment, LLM evals are really hard because LLMs themselves are trained on basically the entire Internet at this point, so any public benchmark you make will inevitably just end up in the LLM training set.
Ideally, you would solve this by creating custom, private evals and not releasing them. But that’s pretty hard to do typically … unless
We did this at Scale AI! After all, that is what we do! We created a new version of GSM8k from scratch called GSM1k and took great care to make sure it matches GSM8k on metrics like difficulty, number of steps in solution, answer magnitude, etc.
We find a huge range on how much models are overfit. On one end, models like Mistral or Phi do up to 10% worse on GSM1k compared to GSM8k. On the other end, models like Gemini, Claude, or GPT show basically no signs of being overfit. Image
I also want to commend Llama2 in particular, which (at least on GSM8k) did a great job at avoiding overfitting, especially relative to most other open source models.
So how much of this overfitting is data contamination? In one experiment, we compare a model’s perplexity on GSM8k to its performance gap between GSM8k and GSM1k. We find that models which are more likely to output GSM8k problems are indeed more likely to be overfit (obviously). Image
But it’s not the only factor! Some models aren’t very likely to generate GSM8k, but are extremely overfit (Math Shepherd). Other models it’s the opposite (Llema)! Our guess is that other factors, like choosing the final model checkpoint based on benchmarks, are also a reason.
Finally, I did want to mention — overfitting is not great yes, and it’s important for benchmarks to mean something credible about a model’s performance. But overfit models != bad models. Even the most overfit models are still *pretty* good at reasoning.
Phi-3 has a 10% drop on GSM1k compared to GSM8k — but it still solves 68% of GSM1k problems, which it definitely hasn’t seen before. That’s way more than any other similarly sized model and almost the same performance as dbrx-instruct, which has 132B params!
GSM1k will remain private for now. We include 50 examples (+1 in Table 1) in the paper, but the rest is hidden to prevent the same issue of data contamination that we are trying to solve. We pre-commit to releasing the data at a later point though (details in paper).
Finally, wanted to give a shout out to the “Do ImageNet Classifiers Generalize to ImageNet?” paper by @beenwrekt, @BeccaRoelofs, @lschmidt3 and @Vaishaal which was a huge inspiration for this work and a longtime favorite of mine. We learned a lot of lessons from them. Image
@beenwrekt @BeccaRoelofs @lschmidt3 @Vaishaal More details in the paper!

arxiv.org/abs/2405.00332
@beenwrekt @BeccaRoelofs @lschmidt3 @Vaishaal joint work with @_jeffda, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, @dylanslack20, @qin_lyu, @SeanHendryx, @russelljkaplan, @mikelunati, @summeryue0.

And of course, sponsored by the wonderful @alexandr_wang!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Hugh Zhang

Hugh Zhang Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(