Epoch AI Profile picture
Aug 2, 2025 10 tweets 3 min read Read on X
How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer. Image
When OpenAI released o1, it blew its predecessor GPT-4o out of the water on some math and science benchmarks. The difference was reasoning training and test-time scaling: o1 was trained to optimize its chain-of-thought, allowing extensive thinking before responding to users. Image
This represented a huge algorithmic improvement. To reach o1-high’s GPQA diamond performance with a non-reasoning model, you’d need 9x more pre-training compute than GPT-4o. That’s larger than the gain from switching from Kaplan to Chinchilla scaling laws! Image
The results are similarly striking with different models and benchmarks. On MATH, Mock AIME, and GPQA diamond, different versions of o1 often are the “equivalent” of GPT-4o trained with over 10x its pre-training compute. Image
We see a similar pattern with Anthropic’s Claude family. Assuming that Claude 3.7 Sonnet is a reasoning-trained version of Claude 3.5 Sonnet, we again often see “compute-equivalent gains” of a similar scale. Image
Of course, these estimates are highly uncertain. We have little data to work with, we may have incorrectly identified pairs of reasoning/non-reasoning models, and our estimates depend on the amount of test-time scaling performed.
And some benchmarks are also a lot more amenable to reasoning than others. For example, reasoning models show almost no improvement on GeoBench, but clearly exceed the trend in non-reasoning model performance on Mock AIME.
But benchmarks like GeoBench are exceptions to the norm. The average score of reasoning models was higher than that of non-reasoning models on ~90% of tested benchmarks, though this is in part because reasoning models tend to be more recent. Image
Overall, our results suggest that the shift to reasoning models was a huge algorithmic improvement. On benchmarks amenable to reasoning, we find compute-equivalent gains around 10x, around as large as that for the Transformer.
This post was written by @ansonwhho and @ardenaberg. You can read the full post here: epoch.ai/gradient-updat…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Epoch AI

Epoch AI Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @EpochAIResearch

Jan 28
Was serving GPT-5 profitable?

According to @Jsevillamol, @exponentialview’s Hannah Petrovic, and @ansonwhho, it depends. Gross margins were around 45%, making inference look profitable.

But after accounting for the cost of operations, OpenAI likely incurred a loss.🧵 Image
Even the gross profits from running models weren’t enough to recoup R&D costs.

Gross profits running GPT-5 were less than OpenAI's R&D costs in the four months before launch. And the true R&D cost was likely higher than that. Image
The core problem: AI R&D is expensive, and model lifecycles are too short to get enough revenue.

So even if it’s profitable to run models, the full lifecycle is likely loss-making — as long as GPT-5 is representative of other models.
Read 8 tweets
Jan 8
Global AI compute capacity now totals over 15 million H100-equivalents.

Our new AI Chip Sales data explorer tracks where this compute comes from across Nvidia, Google, Amazon, AMD, and Huawei, making it the most comprehensive public dataset available. Image
Nvidia’s B300 GPU now accounts for the majority of its revenue from AI chips, while H100s make up under 10%.

We estimate chip-level spending using earnings reports, company disclosures, and analyst and media coverage. Image
These chips present massive resource demands.

Even before the power overheads of servers and data centers, this many chips would draw over 10 GW of power - around twice the average power consumption of New York City. Image
Read 4 tweets
Dec 12, 2025
GPT-5.2 scores 152 on the Epoch Capabilities Index (ECI), our tool for aggregating benchmark scores. This puts it second only to Gemini 3 Pro.

🧵 with individual scores. Image
GPT-5.2 ranks first or second on most of the benchmarks we run ourselves, including a top score on FrontierMath Tiers 1–3 and our new chess puzzles benchmark. The exception is SimpleQA Verified, where it scores notably worse than even previous GPT-5 series models. Image
Our AIME variant, OTIS Mock AIME 2024-2025, is nearly saturated. There remains a single problem no model has solved, shown below. The diagram is given to the model in the Asymptote vector graphics language. Image
Read 4 tweets
Nov 10, 2025
AI data center buildouts already rival the Manhattan Project in scale, but there’s little public info about them.

So we spent the last few months reading legal permits, staring at satellite images, and scouring news sources.

Here’s what you need to know. 🧵 Image
AI data centers will be some of the biggest infrastructure projects in history

e.g. OpenAI’s Stargate Abilene will need:

- As much power as Seattle (1 GW)

- >250× the compute of the GPT-4 cluster

- 450 soccer fields of land

- $32B

- Thousands of workers

- 2 years to build
By the end of the year, AI data centers could collectively see >$300 billion in investment, around 1% of US GDP.

That’s bigger than the Apollo Program (0.8%) and Manhattan Project (0.4%) at their peaks.
Read 11 tweets
Nov 7, 2025
The Epoch Capabilities Index is a useful way to measure model capabilities, but what does a score of 150 actually mean?

One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵 Image
Three important takeaways:

1. Benchmarks vary in overall difficulty, and in slope. Steeper slopes imply a narrower range of difficulties at the question level and mean the benchmark saturates quickly once some progress is made.
2. While a model with a score of 140 is expected to get 45% on SWE-Bench Verified, this is just an expectation. Individual models perform better or worse on specific tasks.

For instance, GPT-5 underperforms in GPQA Diamond but overperforms in VPCT.
Read 5 tweets
Nov 4, 2025
Announcing our Frontier Data Centers Hub!

The world is about to see multiple 1 GW+ AI data centers.

We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.

Highlights in thread! Image
Several data centers will soon demand 1 GW of power, starting early next year:

- Anthropic–Amazon New Carlisle (January)
- xAI Colossus 2 (February)
- Microsoft Fayetteville (March, borderline 1GW)
- Meta Prometheus (May)
- OpenAI Stargate Abilene (July) Image
The largest 2026 facility (xAI Colossus 2) will have the compute equivalent of 1.4M H100 GPUs.

Even larger data centers are coming: Meta Hyperion and Microsoft Fairwater will each have 5M H100e when they reach full capacity in late 2027 to early 2028. Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(