The Epoch Capabilities Index is a useful way to measure model capabilities, but what does a score of 150 actually mean?
One way to read our new capability index is by plotting the benchmark performance you expect to see, for a range of ECI scores 🧵
Three important takeaways:
1. Benchmarks vary in overall difficulty, and in slope. Steeper slopes imply a narrower range of difficulties at the question level and mean the benchmark saturates quickly once some progress is made.
2. While a model with a score of 140 is expected to get 45% on SWE-Bench Verified, this is just an expectation. Individual models perform better or worse on specific tasks.
For instance, GPT-5 underperforms in GPQA Diamond but overperforms in VPCT.
The world is about to see multiple 1 GW+ AI data centers.
We mapped their construction using satellite imagery, permits & public sources — releasing everything for free, including commissioned satellite images.
Highlights in thread!
Several data centers will soon demand 1 GW of power, starting early next year:
- Anthropic–Amazon New Carlisle (January)
- xAI Colossus 2 (February)
- Microsoft Fayetteville (March, borderline 1GW)
- Meta Prometheus (May)
- OpenAI Stargate Abilene (July)
The largest 2026 facility (xAI Colossus 2) will have the compute equivalent of 1.4M H100 GPUs.
Even larger data centers are coming: Meta Hyperion and Microsoft Fairwater will each have 5M H100e when they reach full capacity in late 2027 to early 2028.
We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!
We also conducted a more holistic evaluation of its math capabilities. 🧵
Note that this is the publicly available version of Deep Think, not the version that achieved a gold medal-equivalent score on the IMO. Google has described the publicly available Deep Think model as a “variation” of the IMO gold model.
Good performance on FrontierMath requires deep background knowledge and precise execution of computations. Deep Think has made progress but hasn’t yet mastered these skills, still scoring lower on the harder tiers of the benchmark.
Sora 2 can solve questions from LLM benchmarks, despite being a video model.
We tested Sora 2 on a small subset of GPQA questions, and it scored 55%, compared to GPT-5’s score of 72%.
GPQA Diamond is a benchmark of challenging multiple-choice science questions, like the attached example. We randomly selected 10 questions from the benchmark, and tried running Sora on them until we generated four videos per question.
To evaluate Sora on a test designed for language models, we prefixed prompts requesting a video of a professor showing the answer letter (A–D) on a piece of paper. Videos without an unambiguous letter were counted as incorrect.
Why did OpenAI train GPT-5 with less compute than GPT-4.5?
Due to the higher returns to post-training, they scaled post-training as much as possible on a smaller model
And since post-training started from a much lower base, this meant a decrease in total training FLOP 🧵
The invention of reasoning models made it possible to greatly improve performance by scaling up post-training compute. This improvement is so great that GPT-5 outperforms GPT-4.5 despite having used less training compute overall.