How overfit are popular LLMs on public benchmarks?
New research out of @scale_ai SEAL to answer this:
- produced a new eval GSM1k
- evaluated public LLMs for overfitting on GSM8k
VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not.
h/t to our incredible team for this research:
@hughbzhang @summeryue0, @_jeffda, Dean Lee Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, @seanh, Russell Kaplan, @mikelunati
Since ChatGPT dropped in 2022, AI progress has been dramatic.
But it's also been predictable—new models, bigger chip clusters, more chatbots.
Not in 2025.
Here are the three big changes to watch for over the next 12 months 🧵
1/8
#1 Geopolitical Swing States.
The conversation is going to expand from “Who is leading - the US vs. China?” to “which country’s AI is most exportable worldwide?”
AI-curious countries around the world—“geopolitical swing states”—are going to decide which side they go with
2/8
The US must win here. Supplying the AI technology of the world is the tech equivalent of being the global reserve currency. It's a 100+-year investment.
AI cannot be China’s next international expansion expedition like the Belts and Roads Initiative.
Scale AI is proud to announce Defense Llama 🇺🇸: the LLM purpose-built for American national security.
This is the product of collaboration between @Meta, Scale, and defense experts, and is available now for integration into US defense systems.
Read more below👇
With the National Security Memorandum coming out of the White House recently, it is clear we need to move fast on AI in national security.
From the NSM:
"If the United States Government does not act with responsible speed and in partnership with industry, civil society, and academia to make use of AI capabilities in service of the national security mission — and to ensure the safety, security, and trustworthiness of American AI innovation writ large — it risks losing ground to strategic competitors."
By leveraging the best commercial models for national security, and fine-tuning them specifically for defense and intelligence use cases, we are empowering the US to succeed against our strategic competitors.
There is nothing else more important for the future of freedom.
As LLMs get smarter, evals need to get harder.
OpenAI’s o1 has already maxed out most major benchmarks.
Scale is partnering with CAIS to launch Humanity’s Last Exam: the toughest open-source benchmark for LLMs.
We're putting up $500K in prizes for the best questions.
(read on)
We need tough questions from human experts to push AI models to their limits. If you submit one of the best questions, we’ll give you co-authorship and a share of the prize pot.
The top 50 questions will earn $5,000 each, and the next 500 will earn $500 each. All selected questions grant optional co-authorship on the resulting paper.
We're seeking questions that go beyond undergraduate level and aren't easily answerable via quick online searches.
If you have 5+ years in a technical field or hold/are pursuing a PhD, we want your insights! We're seeking questions that would truly impress you if an AI could solve them. Help us evaluate how close we are to achieving expert-level AI across diverse domains.
1/Gemini 1.5 Pro 0801 is the new best model (tops LMSYS, SEAL evals incoming)
Key considerations
1—OpenAI, Google, Anthropic, & Meta all right ON the frontier
2—Google has a long-term compute edge w/TPUs
3—Data & post-training becoming key competitive drivers in performance
🧵
2/We've seen 7 major models from top labs in the last 3mo:
May:
- GPT 4o
- Gemini 1.5 Pro
June:
- Claude 3.5 Sonnet
July:
- Llama 3.1
- Mistral Large 2
- GPT-4o Mini
August:
- Gemini 1.5 0801
Each of these models has been incredibly competitive—each world-class in some way.
3/The reason these are all so close together timing-wise is that every lab got their H100s at roughly the same time.
They each struggled with early issues with the H100s last fall, and the big H100 clusters all started training this spring.
1/ New paper in Nature shows model collapse as successive model generations models are recursively trained on synthetic data.
This is an important result. While many researchers today view synthetic data as AI philosopher’s stone, there is no free lunch.
Read more 👇
Training on pure synthetic data has no information gain, thus there is little reason the model *should* improve.
Oftentimes when evals go up from “self-distillation”, that might be from some more invisible tradeoff, i.e. mode collapse in exchange for individual eval improvement
3/ This core idea is very important to pay attention to:
Synthetic data can create a short-term boost in eval results, but you will pay for it later with model collapse!
You accumulate debt with mangling the model that starts invisible, and is very hard to repay.
3/ the original scaling laws require a scaling of data alongside compute, and while you can still improve loss with more compute, it is much less efficient than if you scaled data as well