Latest Twitter Threads by @EpochAIResearch on Thread Reader App

Nov 10 • 11 tweets • 3 min read

AI data center buildouts already rival the Manhattan Project in scale, but there’s little public info about them.

So we spent the last few months reading legal permits, staring at satellite images, and scouring news sources.

Here’s what you need to know. 🧵

AI data centers will be some of the biggest infrastructure projects in history

e.g. OpenAI’s Stargate Abilene will need:

- As much power as Seattle (1 GW)

- >250× the compute of the GPT-4 cluster

- 450 soccer fields of land

- $32B

- Thousands of workers

- 2 years to build

Nov 7 • 5 tweets • 2 min read

The Epoch Capabilities Index is a useful way to measure model capabilities, but what does a score of 150 actually mean?

One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵

Three important takeaways:

1. Benchmarks vary in overall difficulty, and in slope. Steeper slopes imply a narrower range of difficulties at the question level and mean the benchmark saturates quickly once some progress is made.

Nov 4 • 8 tweets • 3 min read

Announcing our Frontier Data Centers Hub!

The world is about to see multiple 1 GW+ AI data centers.

We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.

Highlights in thread!

Several data centers will soon demand 1 GW of power, starting early next year:

- Anthropic–Amazon New Carlisle (January)
- xAI Colossus 2 (February)
- Microsoft Fayetteville (March, borderline 1GW)
- Meta Prometheus (May)
- OpenAI Stargate Abilene (July)

Oct 9 • 10 tweets • 3 min read

We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵

Note that this is the publicly available version of Deep Think, not the version that achieved a gold medal-equivalent score on the IMO. Google has described the publicly available Deep Think model as a “variation” of the IMO gold model.

Oct 3 • 5 tweets • 2 min read

Sora 2 can solve questions from LLM benchmarks, despite being a video model.

We tested Sora 2 on a small subset of GPQA questions, and it scored 55%, compared to GPT-5’s score of 72%.

GPQA Diamond is a benchmark of challenging multiple-choice science questions, like the attached example. We randomly selected 10 questions from the benchmark, and tried running Sora on them until we generated four videos per question.

Sep 30 • 7 tweets • 3 min read

Announcing our new AI Companies Data Hub!

We collected key data on frontier AI companies, including revenue run rates, funding, staff, usage rates, and compute spend.

This free resource will help you understand the trajectory and economics of AI.

Highlights in thread!

Revenue:

The combined revenue rates of OpenAI and Anthropic have grown around 10x since early 2024.

OpenAI’s annualized revenue reached $13 billion in August 2025, up from $5B at the start of the year.

Anthropic’s revenue has exploded this year, from $1B to $5B by July!

Sep 26 • 11 tweets • 2 min read

Why did OpenAI train GPT-5 with less compute than GPT-4.5?

Due to the higher returns to post-training, they scaled post-training as much as possible on a smaller model

And since post-training started from a much lower base, this meant a decrease in total training FLOP 🧵

The invention of reasoning models made it possible to greatly improve performance by scaling up post-training compute. This improvement is so great that GPT-5 outperforms GPT-4.5 despite having used less training compute overall.

https://twitter.com/1529761561170124800/status/1951734757483487450

Sep 16 • 12 tweets • 4 min read

What will AI look like by 2030 if current trends hold?

Our new report zooms in on two things: (1) whether scaling continues (compute, data, power, capital), and (2) the capabilities this enables—especially for scientific R&D.

We forecast that by 2030:
- Training clusters would cost hundreds of billions of dollars
- Compute scaling is probably not "hitting a wall"
- Synthetic & multimodal data may be needed to ease bottlenecks
- Power demands will increase but be manageable in principle

Sep 5 • 9 tweets • 3 min read

AI progress has been driven by enormous compute scaling, but this is likely to slow down within the next few years. The reasons: investor uncertainty, the heavy costs of overinvestment, and increasing lead times. 🧵

Investors are incredibly uncertain about the returns to further scaling, and overestimating the returns could cost them >$100B. So rather than going all-in today, they invest more gradually, observing the returns from incremental scaling, before reevaluating further investment.

Aug 12 • 7 tweets • 3 min read

We’ve independently evaluated the GPT-5 model family on our benchmarking suite. Here is what we’ve learned 🧵

GPT-5 performs strongly on math benchmarks, achieving a new SOTA on FrontierMath and OTIS Mock AIME 2024-2025.

Aug 8 • 9 tweets • 2 min read

OpenAI has historically scaled up training compute by around 100x with each new generation of its GPT.

However, GPT-5 appears to be an exception to this trend.

🧵 GPT-4 was trained on 2e25 floating-point operations, and OpenAI said GPT-4.5 was about an order-of-magnitude (10x) scale-up.

We don’t have a rigorous estimate yet, but GPT-5’s compute scale may be *between* GPT-4 and GPT-4.5, and it is probably not a large scale-up from 4.5.

Aug 2 • 10 tweets • 3 min read

How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer.

When OpenAI released o1, it blew its predecessor GPT-4o out of the water on some math and science benchmarks. The difference was reasoning training and test-time scaling: o1 was trained to optimize its chain-of-thought, allowing extensive thinking before responding to users.

Aug 2 • 7 tweets • 2 min read

A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category.

The evaluation was done internally by OpenAI on an early checkpoint of o3 using a “high reasoning setting.” The model made 32 attempts on the problem and solved it only once. OpenAI shared the reasoning trace so that Dan could analyze the model’s solution and provide commentary.

Jul 25 • 7 tweets • 2 min read

Should you start your training run early, so you can train for longer, or wait for the next generation of chips and algorithms? Our latest estimate suggests that it’s not effective to train for more than ~9 months. On current trends, frontier labs will hit that limit by 2027. 🧵

Why 9 months? Model developers face a tradeoff: wait before starting a run to take advantage of better hardware and algorithms, or start sooner with what’s available. Waiting lets you train faster once you start, so there’s an optimal run length for any given deadline.

Jul 17 • 12 tweets • 3 min read

How fast has society been adopting AI?

Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵

Historically, technology adoption took decades. For example, telephones took 60 years to reach 70% of US households. But tech diffuses faster and faster over time, and we should expect AI to continue this trend.

Jul 17 • 6 tweets • 2 min read

We have graded the results of @OpenAI's evaluation on FrontierMath Tier 1–3 questions, and found a 27% (± 3%) performance. ChatGPT agent is a new model fine-tuned for agentic tasks, equipped with text/GUI browser tools and native terminal access. 🧵

This evaluation is not directly comparable to those on Epoch AI’s benchmarking hub, as it uses a different scaffold. First, we did not run the model ourselves—we only graded the outputs provided by OpenAI and don’t have access to their code to run the model. Second, ChatGPT agent has access to tools not available to other models we've assessed—most notably browser tools, which may have helped on questions related to recent research papers. Finally, the evaluation allowed up to 128K tokens per question, compared to our standard 100K; this difference is unlikely to have significantly affected results.

Jul 9 • 8 tweets • 2 min read

The IMO is next week. What will it tell us about AI?

@GregHBurnham argues that an AI gold medal could be a non-event or could be an important breakthrough—it depends on whether the AI system exhibits creative problem-solving. How to tell the difference? Read on!

@GregHBurnham It will be tempting to focus on whether an AI system gets a gold medal. Formal proof systems like Google’s AlphaProof are quite close to this, and even general-purpose LLMs have a fighting chance. But that's not the outcome to pay the most attention to.

Jul 3 • 8 tweets • 3 min read

What would a Manhattan Project for AI look like?

@ansonwhho and @ardenaberg argue that if one reaches the scale of previous national projects, an AI Manhattan project could result in a ~1000x compute scaleup by 2027.

@ansonwhho @ardenaberg A national AI project has become more and more of a possibility in the last year, with one as the top recommendation from a US-China congressional commission.

Jul 2 • 11 tweets • 3 min read

The state of large-scale AI models, July 2025:

- The number of large-scale model releases is growing rapidly (418 models over 10^23 FLOP)
- The UK has fallen behind, China has caught up (9 vs 151 models)
- There are far more of the largest models (33 models over 10^25 FLOP)

First, the number of large-scale model releases is growing rapidly.

In 2020, there were 4 models trained with more than 10^23 FLOP.
By the end of 2024, there were 327 such models in our dataset.

Jun 27 • 10 tweets • 3 min read

LLM context windows have grown, but can models really use all this content?

We find signs of recent, rapid progress in their ability to do so. Read on to learn more!

From Claude 2.0’s 100k tokens in 2023 to Llama 4 Maverick’s 10M earlier this year, there’s no doubt that context windows are getting longer. On a set of models from Artificial Analysis, we find that the longest available context windows have grown at about 30x/year.

Jun 20 • 11 tweets • 3 min read

The bottlenecks to >10% GDP growth are weaker than expected, and existing $500B investments in Stargate may be tiny relative to optimal AI investment

In this week’s Gradient Update, @APotlogea and @ansonwhho explain how their work on the economics of AI brought them to this view

@APotlogea @ansonwhho Skepticism around explosive AI growth often hinges on "Baumol effects"—bottlenecks from human-dependent tasks. But to their surprise, the most comprehensive integrated assessment model of AI to date suggests these constraints are weaker than expected

https://x.com/EpochAIResearch/status/1902802037752111569

Share this page!

Enter URL or ID to Unroll