How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer.
When OpenAI released o1, it blew its predecessor GPT-4o out of the water on some math and science benchmarks. The difference was reasoning training and test-time scaling: o1 was trained to optimize its chain-of-thought, allowing extensive thinking before responding to users.
This represented a huge algorithmic improvement. To reach o1-high’s GPQA diamond performance with a non-reasoning model, you’d need 9x more pre-training compute than GPT-4o. That’s larger than the gain from switching from Kaplan to Chinchilla scaling laws!
The results are similarly striking with different models and benchmarks. On MATH, Mock AIME, and GPQA diamond, different versions of o1 often are the “equivalent” of GPT-4o trained with over 10x its pre-training compute.
We see a similar pattern with Anthropic’s Claude family. Assuming that Claude 3.7 Sonnet is a reasoning-trained version of Claude 3.5 Sonnet, we again often see “compute-equivalent gains” of a similar scale.
Of course, these estimates are highly uncertain. We have little data to work with, we may have incorrectly identified pairs of reasoning/non-reasoning models, and our estimates depend on the amount of test-time scaling performed.
And some benchmarks are also a lot more amenable to reasoning than others. For example, reasoning models show almost no improvement on GeoBench, but clearly exceed the trend in non-reasoning model performance on Mock AIME.
But benchmarks like GeoBench are exceptions to the norm. The average score of reasoning models was higher than that of non-reasoning models on ~90% of tested benchmarks, though this is in part because reasoning models tend to be more recent.
Overall, our results suggest that the shift to reasoning models was a huge algorithmic improvement. On benchmarks amenable to reasoning, we find compute-equivalent gains around 10x, around as large as that for the Transformer.
This post was written by @ansonwhho and @ardenaberg. You can read the full post here: epoch.ai/gradient-updat…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
According to @Jsevillamol, @exponentialview’s Hannah Petrovic, and @ansonwhho, it depends. Gross margins were around 45%, making inference look profitable.
But after accounting for the cost of operations, OpenAI likely incurred a loss.🧵
Even the gross profits from running models weren’t enough to recoup R&D costs.
Gross profits running GPT-5 were less than OpenAI's R&D costs in the four months before launch. And the true R&D cost was likely higher than that.
The core problem: AI R&D is expensive, and model lifecycles are too short to get enough revenue.
So even if it’s profitable to run models, the full lifecycle is likely loss-making — as long as GPT-5 is representative of other models.
Global AI compute capacity now totals over 15 million H100-equivalents.
Our new AI Chip Sales data explorer tracks where this compute comes from across Nvidia, Google, Amazon, AMD, and Huawei, making it the most comprehensive public dataset available.
Nvidia’s B300 GPU now accounts for the majority of its revenue from AI chips, while H100s make up under 10%.
We estimate chip-level spending using earnings reports, company disclosures, and analyst and media coverage.
These chips present massive resource demands.
Even before the power overheads of servers and data centers, this many chips would draw over 10 GW of power - around twice the average power consumption of New York City.
GPT-5.2 scores 152 on the Epoch Capabilities Index (ECI), our tool for aggregating benchmark scores. This puts it second only to Gemini 3 Pro.
🧵 with individual scores.
GPT-5.2 ranks first or second on most of the benchmarks we run ourselves, including a top score on FrontierMath Tiers 1–3 and our new chess puzzles benchmark. The exception is SimpleQA Verified, where it scores notably worse than even previous GPT-5 series models.
Our AIME variant, OTIS Mock AIME 2024-2025, is nearly saturated. There remains a single problem no model has solved, shown below. The diagram is given to the model in the Asymptote vector graphics language.
The Epoch Capabilities Index is a useful way to measure model capabilities, but what does a score of 150 actually mean?
One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵
Three important takeaways:
1. Benchmarks vary in overall difficulty, and in slope. Steeper slopes imply a narrower range of difficulties at the question level and mean the benchmark saturates quickly once some progress is made.
2. While a model with a score of 140 is expected to get 45% on SWE-Bench Verified, this is just an expectation. Individual models perform better or worse on specific tasks.
For instance, GPT-5 underperforms in GPQA Diamond but overperforms in VPCT.
The world is about to see multiple 1 GW+ AI data centers.
We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.
Highlights in thread!
Several data centers will soon demand 1 GW of power, starting early next year:
- Anthropic–Amazon New Carlisle (January)
- xAI Colossus 2 (February)
- Microsoft Fayetteville (March, borderline 1GW)
- Meta Prometheus (May)
- OpenAI Stargate Abilene (July)
The largest 2026 facility (xAI Colossus 2) will have the compute equivalent of 1.4M H100 GPUs.
Even larger data centers are coming: Meta Hyperion and Microsoft Fairwater will each have 5M H100e when they reach full capacity in late 2027 to early 2028.