Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Epoch AI

Aug 2, 2025 • 10 tweets • 3 min read • Read on X

Scrolly

How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer.

When OpenAI released o1, it blew its predecessor GPT-4o out of the water on some math and science benchmarks. The difference was reasoning training and test-time scaling: o1 was trained to optimize its chain-of-thought, allowing extensive thinking before responding to users.

This represented a huge algorithmic improvement. To reach o1-high’s GPQA diamond performance with a non-reasoning model, you’d need 9x more pre-training compute than GPT-4o. That’s larger than the gain from switching from Kaplan to Chinchilla scaling laws!

The results are similarly striking with different models and benchmarks. On MATH, Mock AIME, and GPQA diamond, different versions of o1 often are the “equivalent” of GPT-4o trained with over 10x its pre-training compute.

We see a similar pattern with Anthropic’s Claude family. Assuming that Claude 3.7 Sonnet is a reasoning-trained version of Claude 3.5 Sonnet, we again often see “compute-equivalent gains” of a similar scale.

Of course, these estimates are highly uncertain. We have little data to work with, we may have incorrectly identified pairs of reasoning/non-reasoning models, and our estimates depend on the amount of test-time scaling performed.

And some benchmarks are also a lot more amenable to reasoning than others. For example, reasoning models show almost no improvement on GeoBench, but clearly exceed the trend in non-reasoning model performance on Mock AIME.

But benchmarks like GeoBench are exceptions to the norm. The average score of reasoning models was higher than that of non-reasoning models on ~90% of tested benchmarks, though this is in part because reasoning models tend to be more recent.

Overall, our results suggest that the shift to reasoning models was a huge algorithmic improvement. On benchmarks amenable to reasoning, we find compute-equivalent gains around 10x, around as large as that for the Transformer.

This post was written by @ansonwhho and @ardenaberg. You can read the full post here: epoch.ai/gradient-updat…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @EpochAIResearch

Epoch AI

@EpochAIResearch

Jan 28

Was serving GPT-5 profitable?

According to @Jsevillamol, @exponentialview’s Hannah Petrovic, and @ansonwhho, it depends. Gross margins were around 45%, making inference look profitable.

But after accounting for the cost of operations, OpenAI likely incurred a loss.🧵

Even the gross profits from running models weren’t enough to recoup R&D costs.

Gross profits running GPT-5 were less than OpenAI's R&D costs in the four months before launch. And the true R&D cost was likely higher than that.

The core problem: AI R&D is expensive, and model lifecycles are too short to get enough revenue.

So even if it’s profitable to run models, the full lifecycle is likely loss-making — as long as GPT-5 is representative of other models.

Read 8 tweets

Epoch AI

@EpochAIResearch

Jan 8

Global AI compute capacity now totals over 15 million H100-equivalents.

Our new AI Chip Sales data explorer tracks where this compute comes from across Nvidia, Google, Amazon, AMD, and Huawei, making it the most comprehensive public dataset available.

Nvidia’s B300 GPU now accounts for the majority of its revenue from AI chips, while H100s make up under 10%.

We estimate chip-level spending using earnings reports, company disclosures, and analyst and media coverage.

These chips present massive resource demands.

Even before the power overheads of servers and data centers, this many chips would draw over 10 GW of power - around twice the average power consumption of New York City.

Read 4 tweets

Epoch AI

@EpochAIResearch

Dec 12, 2025

GPT-5.2 scores 152 on the Epoch Capabilities Index (ECI), our tool for aggregating benchmark scores. This puts it second only to Gemini 3 Pro.

🧵 with individual scores.

GPT-5.2 ranks first or second on most of the benchmarks we run ourselves, including a top score on FrontierMath Tiers 1–3 and our new chess puzzles benchmark. The exception is SimpleQA Verified, where it scores notably worse than even previous GPT-5 series models.

Our AIME variant, OTIS Mock AIME 2024-2025, is nearly saturated. There remains a single problem no model has solved, shown below. The diagram is given to the model in the Asymptote vector graphics language.

Read 4 tweets

Epoch AI

@EpochAIResearch

Nov 10, 2025

AI data center buildouts already rival the Manhattan Project in scale, but there’s little public info about them.

So we spent the last few months reading legal permits, staring at satellite images, and scouring news sources.

Here’s what you need to know. 🧵

AI data centers will be some of the biggest infrastructure projects in history

e.g. OpenAI’s Stargate Abilene will need:

- As much power as Seattle (1 GW)

- >250× the compute of the GPT-4 cluster

- 450 soccer fields of land

- $32B

- Thousands of workers

- 2 years to build

By the end of the year, AI data centers could collectively see >$300 billion in investment, around 1% of US GDP.

That’s bigger than the Apollo Program (0.8%) and Manhattan Project (0.4%) at their peaks.

Read 11 tweets

Epoch AI

@EpochAIResearch

Nov 7, 2025

The Epoch Capabilities Index is a useful way to measure model capabilities, but what does a score of 150 actually mean?

One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵

Three important takeaways:

1. Benchmarks vary in overall difficulty, and in slope. Steeper slopes imply a narrower range of difficulties at the question level and mean the benchmark saturates quickly once some progress is made.

2. While a model with a score of 140 is expected to get 45% on SWE-Bench Verified, this is just an expectation. Individual models perform better or worse on specific tasks.

For instance, GPT-5 underperforms in GPQA Diamond but overperforms in VPCT.

Read 5 tweets

Epoch AI

@EpochAIResearch

Nov 4, 2025

Announcing our Frontier Data Centers Hub!

The world is about to see multiple 1 GW+ AI data centers.

We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.

Highlights in thread!

Several data centers will soon demand 1 GW of power, starting early next year:

- Anthropic–Amazon New Carlisle (January)
- xAI Colossus 2 (February)
- Microsoft Fayetteville (March, borderline 1GW)
- Meta Prometheus (May)
- OpenAI Stargate Abilene (July)

The largest 2026 facility (xAI Colossus 2) will have the compute equivalent of 1.4M H100 GPUs.

Even larger data centers are coming: Meta Hyperion and Microsoft Fairwater will each have 5M H100e when they reach full capacity in late 2027 to early 2028.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Epoch AI

Try unrolling a thread yourself!

More from @EpochAIResearch

Epoch AI

Epoch AI

Epoch AI

Epoch AI

Epoch AI

Epoch AI

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!