How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer.
When OpenAI released o1, it blew its predecessor GPT-4o out of the water on some math and science benchmarks. The difference was reasoning training and test-time scaling: o1 was trained to optimize its chain-of-thought, allowing extensive thinking before responding to users.
This represented a huge algorithmic improvement. To reach o1-high’s GPQA diamond performance with a non-reasoning model, you’d need 9x more pre-training compute than GPT-4o. That’s larger than the gain from switching from Kaplan to Chinchilla scaling laws!
The results are similarly striking with different models and benchmarks. On MATH, Mock AIME, and GPQA diamond, different versions of o1 often are the “equivalent” of GPT-4o trained with over 10x its pre-training compute.
We see a similar pattern with Anthropic’s Claude family. Assuming that Claude 3.7 Sonnet is a reasoning-trained version of Claude 3.5 Sonnet, we again often see “compute-equivalent gains” of a similar scale.
Of course, these estimates are highly uncertain. We have little data to work with, we may have incorrectly identified pairs of reasoning/non-reasoning models, and our estimates depend on the amount of test-time scaling performed.
And some benchmarks are also a lot more amenable to reasoning than others. For example, reasoning models show almost no improvement on GeoBench, but clearly exceed the trend in non-reasoning model performance on Mock AIME.
But benchmarks like GeoBench are exceptions to the norm. The average score of reasoning models was higher than that of non-reasoning models on ~90% of tested benchmarks, though this is in part because reasoning models tend to be more recent.
Overall, our results suggest that the shift to reasoning models was a huge algorithmic improvement. On benchmarks amenable to reasoning, we find compute-equivalent gains around 10x, around as large as that for the Transformer.
This post was written by @ansonwhho and @ardenaberg. You can read the full post here: epoch.ai/gradient-updat…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category.
The evaluation was done internally by OpenAI on an early checkpoint of o3 using a “high reasoning setting.” The model made 32 attempts on the problem and solved it only once. OpenAI shared the reasoning trace so that Dan could analyze the model’s solution and provide commentary.
Dan said the model had some false starts but eventually solved the problem “by combining an excellent intuition about asymptotic phenomena with its ability to code and run computationally intensive numerical calculations to test hypotheses.”
Should you start your training run early, so you can train for longer, or wait for the next generation of chips and algorithms? Our latest estimate suggests that it’s not effective to train for more than ~9 months. On current trends, frontier labs will hit that limit by 2027. 🧵
Why 9 months? Model developers face a tradeoff: wait before starting a run to take advantage of better hardware and algorithms, or start sooner with what’s available. Waiting lets you train faster once you start, so there’s an optimal run length for any given deadline.
Our previous work estimated that hardware + algorithmic progress would lead to a 15 month maximum training run. That work assumed algorithms were improving at 1.7x per year, but we now believe they are improving at a much faster 3x per year!
Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵
Historically, technology adoption took decades. For example, telephones took 60 years to reach 70% of US households. But tech diffuses faster and faster over time, and we should expect AI to continue this trend.
But even if we account for this trend, AI adoption seems incredibly fast. ~10% of the US used ChatGPT weekly within just 2 years, and ~30% in under 2.5 years.
We have graded the results of @OpenAI's evaluation on FrontierMath Tier 1–3 questions, and found a 27% (± 3%) performance. ChatGPT agent is a new model fine-tuned for agentic tasks, equipped with text/GUI browser tools and native terminal access. 🧵
This evaluation is not directly comparable to those on Epoch AI’s benchmarking hub, as it uses a different scaffold. First, we did not run the model ourselves—we only graded the outputs provided by OpenAI and don’t have access to their code to run the model. Second, ChatGPT agent has access to tools not available to other models we've assessed—most notably browser tools, which may have helped on questions related to recent research papers. Finally, the evaluation allowed up to 128K tokens per question, compared to our standard 100K; this difference is unlikely to have significantly affected results.
@OpenAI OpenAI has exclusive access to all FrontierMath problem statements and 237 of the 290 Tier 1–3 solutions. Epoch AI holds out the remaining solutions. We found no statistically significant performance difference between the held-out and non-held-out sets.
The IMO is next week. What will it tell us about AI?
@GregHBurnham argues that an AI gold medal could be a non-event or could be an important breakthrough—it depends on whether the AI system exhibits creative problem-solving. How to tell the difference? Read on!
@GregHBurnham It will be tempting to focus on whether an AI system gets a gold medal. Formal proof systems like Google’s AlphaProof are quite close to this, and even general-purpose LLMs have a fighting chance. But that's not the outcome to pay the most attention to.
@GregHBurnham Rather, the big thing to watch for is qualitative: can AI systems solve problems that require a lot of creativity?
@ansonwhho and @ardenaberg argue that if one reaches the scale of previous national projects, an AI Manhattan project could result in a ~1000x compute scaleup by 2027.
@ansonwhho @ardenaberg A national AI project has become more and more of a possibility in the last year, with one as the top recommendation from a US-China congressional commission.
@ansonwhho @ardenaberg Previous national projects at their peaks spent an equivalent fraction of GDP as $120B-$250B today. The authors find that such a budget could centralize most NVIDIA compute in the US.