How overfit are popular LLMs on public benchmarks?
New research out of @scale_ai SEAL to answer this:
- produced a new eval GSM1k
- evaluated public LLMs for overfitting on GSM8k
VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not.
h/t to our incredible team for this research:
@hughbzhang @summeryue0, @_jeffda, Dean Lee Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, @seanh, Russell Kaplan, @mikelunati
As LLMs get smarter, evals need to get harder.
OpenAI’s o1 has already maxed out most major benchmarks.
Scale is partnering with CAIS to launch Humanity’s Last Exam: the toughest open-source benchmark for LLMs.
We're putting up $500K in prizes for the best questions.
(read on)
We need tough questions from human experts to push AI models to their limits. If you submit one of the best questions, we’ll give you co-authorship and a share of the prize pot.
The top 50 questions will earn $5,000 each, and the next 500 will earn $500 each. All selected questions grant optional co-authorship on the resulting paper.
We're seeking questions that go beyond undergraduate level and aren't easily answerable via quick online searches.
If you have 5+ years in a technical field or hold/are pursuing a PhD, we want your insights! We're seeking questions that would truly impress you if an AI could solve them. Help us evaluate how close we are to achieving expert-level AI across diverse domains.
1/Gemini 1.5 Pro 0801 is the new best model (tops LMSYS, SEAL evals incoming)
Key considerations
1—OpenAI, Google, Anthropic, & Meta all right ON the frontier
2—Google has a long-term compute edge w/TPUs
3—Data & post-training becoming key competitive drivers in performance
🧵
2/We've seen 7 major models from top labs in the last 3mo:
May:
- GPT 4o
- Gemini 1.5 Pro
June:
- Claude 3.5 Sonnet
July:
- Llama 3.1
- Mistral Large 2
- GPT-4o Mini
August:
- Gemini 1.5 0801
Each of these models has been incredibly competitive—each world-class in some way.
3/The reason these are all so close together timing-wise is that every lab got their H100s at roughly the same time.
They each struggled with early issues with the H100s last fall, and the big H100 clusters all started training this spring.
1/ New paper in Nature shows model collapse as successive model generations models are recursively trained on synthetic data.
This is an important result. While many researchers today view synthetic data as AI philosopher’s stone, there is no free lunch.
Read more 👇
Training on pure synthetic data has no information gain, thus there is little reason the model *should* improve.
Oftentimes when evals go up from “self-distillation”, that might be from some more invisible tradeoff, i.e. mode collapse in exchange for individual eval improvement
3/ This core idea is very important to pay attention to:
Synthetic data can create a short-term boost in eval results, but you will pay for it later with model collapse!
You accumulate debt with mangling the model that starts invisible, and is very hard to repay.
3/ the original scaling laws require a scaling of data alongside compute, and while you can still improve loss with more compute, it is much less efficient than if you scaled data as well
1/ Today is the 4th anniversary of the original GPT-3 paper—"Language Models are Few-Shot Learners"
Some reflections on how the last 4 years have played out, and thoughts about the next 4 years
2/ GPT-3 was when it first became clear what the potential of scaling language models was.
The efficacy of GPT-3 took the AI community by surprise for the most part—the capabilities were staggering compared to everything that came before in NLP.
3/ @scale_AI had started working on language models the year before on the very first RLHF experiments on GPT-2. But GPT-2 still felt very much like a toy at that time.
GPT-3 was the first moment where it was obvious this would be the major theme of AI