Alexandr Wang Profile picture
May 2 2 tweets 1 min read Read on X
How overfit are popular LLMs on public benchmarks?

New research out of @scale_ai SEAL to answer this:

- produced a new eval GSM1k
- evaluated public LLMs for overfitting on GSM8k

VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not. Image
h/t to our incredible team for this research:
@hughbzhang @summeryue0, @_jeffda, Dean Lee Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, @seanh, Russell Kaplan, @mikelunati

paper link: arxiv.org/abs/2405.00332

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Alexandr Wang

Alexandr Wang Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @alexandr_wang

Sep 16
As LLMs get smarter, evals need to get harder.
OpenAI’s o1 has already maxed out most major benchmarks.

Scale is partnering with CAIS to launch Humanity’s Last Exam: the toughest open-source benchmark for LLMs.

We're putting up $500K in prizes for the best questions.

(read on)Image
We need tough questions from human experts to push AI models to their limits. If you submit one of the best questions, we’ll give you co-authorship and a share of the prize pot.

The top 50 questions will earn $5,000 each, and the next 500 will earn $500 each. All selected questions grant optional co-authorship on the resulting paper.

We're seeking questions that go beyond undergraduate level and aren't easily answerable via quick online searches.
If you have 5+ years in a technical field or hold/are pursuing a PhD, we want your insights! We're seeking questions that would truly impress you if an AI could solve them. Help us evaluate how close we are to achieving expert-level AI across diverse domains.

Submit here:

The deadline is November 1, 2024.agi.safe.ai/submit
Read 4 tweets
Aug 1
1/Gemini 1.5 Pro 0801 is the new best model (tops LMSYS, SEAL evals incoming)

Key considerations
1—OpenAI, Google, Anthropic, & Meta all right ON the frontier
2—Google has a long-term compute edge w/TPUs
3—Data & post-training becoming key competitive drivers in performance

🧵
2/We've seen 7 major models from top labs in the last 3mo:

May:
- GPT 4o
- Gemini 1.5 Pro

June:
- Claude 3.5 Sonnet

July:
- Llama 3.1
- Mistral Large 2
- GPT-4o Mini

August:
- Gemini 1.5 0801

Each of these models has been incredibly competitive—each world-class in some way.
3/The reason these are all so close together timing-wise is that every lab got their H100s at roughly the same time.

They each struggled with early issues with the H100s last fall, and the big H100 clusters all started training this spring.

Voila, 5-6 months later, big models!
Read 8 tweets
Jul 25
1/ New paper in Nature shows model collapse as successive model generations models are recursively trained on synthetic data.

This is an important result. While many researchers today view synthetic data as AI philosopher’s stone, there is no free lunch.

Read more 👇 Image
Training on pure synthetic data has no information gain, thus there is little reason the model *should* improve.

Oftentimes when evals go up from “self-distillation”, that might be from some more invisible tradeoff, i.e. mode collapse in exchange for individual eval improvement Image
3/ This core idea is very important to pay attention to:

Synthetic data can create a short-term boost in eval results, but you will pay for it later with model collapse!

You accumulate debt with mangling the model that starts invisible, and is very hard to repay.
Read 8 tweets
Jun 9
1/ one of the biggest questions in AI today is:

since GPT-4 was trained in fall 2022, we've collectively spent ~$100B on NVIDIA GPUs

will the next generation of AI models' capabilities live up to that aggregate investment level?

NVIDIA qtrly datacenter rev, by @Thomas_Woodside Image
2/ there are 2 schools of thought:

1) compute is the only real bottleneck to AI progress. the more we spend, the closer we get to AGI

2) we are hitting a data wall which will slow progress regardless of how much compute we have

3/ the original scaling laws require a scaling of data alongside compute, and while you can still improve loss with more compute, it is much less efficient than if you scaled data as well Image
Read 9 tweets
May 29
1/ We are launching SEAL Leaderboards—private, expert evaluations of leading frontier models.

Our design principles:
🔒Private + Unexploitable. No overfitting on evals!
🎓Domain Expert Evals
🏆Continuously Updated w/new Data and Models

Read more in 🧵

scale.com/leaderboard


Image
Image
Image
2/ Evaluations are a critical component of the AI ecosystem.

Evals are incentives for researchers, and our evaluations set the goals for how we aim to improve our models.

Trusted 3rd party evals are a missing part of the whole ecosystem, which is why @scale_AI built these.
3/ We eval'd many of the leading models:

- GPT-4o
- GPT-4 Turbo
- Claude 3 Opus
- Gemini 1.5 Pro
- Gemini 1.5 Flash
- Llama3
- Mistral Large

On Coding, Math, Instruction Following, and Multilinguality (Spanish).

See leaderboard results below.


Image
Image
Image
Image
Read 5 tweets
May 28
1/ Today is the 4th anniversary of the original GPT-3 paper—"Language Models are Few-Shot Learners"

Some reflections on how the last 4 years have played out, and thoughts about the next 4 years
2/ GPT-3 was when it first became clear what the potential of scaling language models was.

The efficacy of GPT-3 took the AI community by surprise for the most part—the capabilities were staggering compared to everything that came before in NLP.
3/ @scale_AI had started working on language models the year before on the very first RLHF experiments on GPT-2. But GPT-2 still felt very much like a toy at that time.

GPT-3 was the first moment where it was obvious this would be the major theme of AI
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(