Epoch AI Profile picture
Investigating the trajectory of AI for the benefit of society.
2 subscribers
Jun 8 11 tweets 3 min read
How do reasoning models solve hard math problems?

We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found: Image o3-mini-high is extremely knowledgeable, and it’s not pure memorization. It fairly reliably invokes relevant techniques and results from the mathematical literature, even when problems were designed to obscure them. Image
May 28 8 tweets 3 min read
The speed of computations on GPUs depends directly on the numeric format: less precision means more calculations on the same hardware.

We analyzed the numerical format used to train 272 models from 2008 to 2025. Here’s what we found. 🧵 Image Numerical formats tell computers how to represent numbers for calculations. Higher precision formats like FP32 use more bits, in order to store the number to a higher number of significant digits. But precision comes at a cost, as each calculation takes longer to carry out.
Apr 23 12 tweets 4 min read
How quickly are AI supercomputers scaling, where are they, and who owns them?

Our new dataset covers 500+ of the largest AI supercomputers (aka GPU clusters or AI data centers) over the last six years.

Here is what we found🧵 Image Performance has grown drastically – FLOP/s of leading AI supercomputers have doubled every 9 months. Driven by:
- Deploying more chips (1.6x/year)
- Higher performance per chip (1.6x/year)

Systems with 10,000 AI chips were rare in 2019. Now, leading companies have clusters 10x that sizeImage
Apr 18 6 tweets 3 min read
OpenAI has released o3, their highly anticipated reasoning model, along with o4-mini, a smaller and cheaper model that succeeds o3-mini.

We evaluated the new models on our suite of math and science benchmarks. Results in thread! Image On FrontierMath, our benchmark of highly challenging, original math questions, o4-mini with high reasoning sets a new record in our evaluations, with an accuracy of 17% (±2%)!

o3 scores 10% (±2%) with high reasoning, behind o4-mini and o3-mini. Image
Apr 11 7 tweets 3 min read
We’ve run independent evaluations of Grok-3 and Grok-3 mini on our suite of benchmarks!

Grok-3 currently doesn’t do extended reasoning, while Grok-3 mini is a reasoning model. We ran Grok-3 mini with both “low” and “high” reasoning effort.

Full results in thread! Image On GPQA Diamond, Grok-3 is one of the top performers at 76% accuracy, beating the best competing non-reasoning models (GPT-4.5 and Claude 3.7 Sonnet) and some reasoning models like o1 and DeepSeek-R1. Grok-3 mini is slightly behind at 70 to 74%. Image
Apr 2 6 tweets 2 min read
Google DeepMind has released a new flagship model, Gemini 2.5 Pro.

We evaluated it on GPQA Diamond, and found a score of 84%, exactly matching the result reported by Google. This is the best result we have found on this benchmark to date! Image For context, GPQA Diamond is a set of very difficult multiple-choice questions about biology, chemistry, and physics; human experts only score around 70%.
Mar 28 7 tweets 2 min read
Why have AI benchmark scores often felt disconnected from real-world usefulness? It wasn't just technical limitations to creating realistic benchmarks. Historically, predicting the real-world capabilities of AI systems simply wasn’t the main goal.🧵 Image Historically, AI benchmarks were just designed to compare models – could a new model or technique improve on the state-of-the-art? Benchmarks focused on tasks 'just within reach' where progress was measurable.
Mar 21 10 tweets 3 min read
Many AI leaders claim AI's value mainly will come from accelerating R&D—"geniuses in datacenters."

This view has key flaws: R&D contributes less to economic growth & is harder to automate than believed. Most of AI's value will instead come from broad deployment in the economy. Image While it’s commonly assumed that growth is mostly driven by R&D, rigorous estimates don’t agree with this intuition. For example, the BLS estimates that private R&D only accounts for around 20% of the labor productivity growth in the US since 1988.

bls.gov/productivity/h…
Mar 20 5 tweets 2 min read
We developed GATE: a model that shows how AI scaling and automation will impact growth.

It predicts trillion‐dollar infrastructure investments, 30% annual growth, and full automation in decades.

Tweak the parameters—these transformative outcomes are surprisingly hard to avoid. Image Imagine if a central bank took AI seriously. They’d build GATE—merging economics with AI scaling laws to show how innovation, automation, and investment interact.

At its core: more compute → more automation → growth → more investment in chips, fabs, etc. Image
Mar 13 8 tweets 3 min read
How has the cost to use LLMs changed over time? Our analysis shows that the price to reach a given benchmark score has fallen dramatically—between 9x and 900x per year, depending on the benchmark and score. 🧵 A plot shows inference prices falling over time for fixed levels of performance. For example, GPT-4 cost about $40 per million tokens. 14 months later, Gemini 1.5 Flash could beat GPT-4's score on a set of Ph.D. level science questions, despite being about 300x cheaper per token. Image
Mar 8 7 tweets 3 min read
Imagine if you could train one human for thousands years to achieve unparalleled expertise, then make many copies. That’s what AI enables: spend heavily on training a single model, then cheaply replicate it. This creates a unique source of increasing returns at scale. Image When you double your compute, you don't just double output by deploying more AI instances—you can also train larger, more efficient models that reduce inference cost. Doing both simultaneously creates increasing returns: output grows more than linearly. Image
Image
Image
Mar 7 8 tweets 2 min read
In this week’s Gradient Updates issue, @EgeErdil2 argues that extrapolating past and current AI capabilities to the future yields excessively conservative estimates of the future impact of AI, and it’s often better to rely on first principles reasoning to make predictions.🧵 Image We can only extrapolate capabilities to the future when AI already shows some competence at them. So extrapolations often end up anchoring too much on what AI can do today, and not enough on the new tasks it will be able to do in the future.
Feb 20 7 tweets 2 min read
xAI has launched Grok-3, its new flagship model.

Grok-3 sets a new record in compute scale: we estimate that it was likely trained using 4e26 to 5e26 floating point operations (FLOP). Significantly, Grok-3 is the first released model known to be trained on over 1e26 FLOP. xAI has not disclosed full details about Grok-3’s training. However, the known cluster size (100k–200k GPUs), approximate training duration, and xAI’s comparisons of Grok-3’s compute usage to Grok-2 and GPT-4 all point to a total compute scale of around 4e26 to 5e26 FLOP.
Feb 13 7 tweets 2 min read
How much AI compute exists globally? How rapidly is it growing?

We analyzed NVIDIA's GPU shipments since 2018 to answer these questions, and found that the installed computing power of NVIDIA chips has doubled every 10 months on average, since 2019. Image To track installed compute, we used data from NVIDIA’s financial reports, along with a new dataset of over 700 AI data centers (forthcoming). This dataset enables us to see the relative quantities of each GPU model as they come into operation.
Feb 7 8 tweets 2 min read
How much energy does a ChatGPT query consume? One common estimate is 3 watt-hours per query.

However, in this week’s Gradient Update we find that it’s probably about 10x less for a typical query today. 🧵 Image The main energy cost of a query comes from running the model (inference). We estimate the compute cost using GPT-4o (estimated 100B active parameters) as our reference, assuming a typical 500-token response (~400 words).
Feb 7 10 tweets 3 min read
We’re excited to announce a major update to the Epoch AI Benchmarking Hub!

The Benchmarking Hub hosts our independent evaluations of AI models. This latest release overhauls how we run and share AI benchmarks—making the data more transparent, systematic, and up to date. 🧵 Image What’s new for you?

• Richer Data: See comprehensive details on each evaluation and the model behind it.
• More Frequent Updates: Expect fresh benchmarking results soon after new models launch.
Jan 30 7 tweets 3 min read
How many AI models have been trained at scales similar to GPT-4 and beyond? These are some of the most advanced models, costing tens of millions of dollars to train, and will be subject to extra scrutiny under the EU AI Act. 🧵 Image In short, we identified 24 models that we estimate were trained on over 10^25 FLOP.

The first known model at this scale was OpenAI’s GPT-4, released in early 2023 and reportedly trained with 2e25 FLOP on 25,000 A100 GPUs.
Jan 25 9 tweets 2 min read
In a new Gradient Update, @MatthewJBar analyzes the impact of AGI on human wages. He concludes that if AGI can fully substitute for human labor, it might cause wages to crash. Eventually, wages may drop below subsistence level—a minimum level required for human survival.🧵 Image His argument is based on the idea of diminishing returns to labor. In a simple model of production, increasing labor decreases wages, all else being equal. While simplistic, this argument suggests that if AGIs are scaled much faster than physical capital, human wages will crash.
Jan 15 5 tweets 2 min read
When will an open-weight AI model be trained with 1e26 FLOP?

The Biden administration's new AI export restrictions regulate models above 1e26 FLOP… unless there’s an open-weight model that exceeds this. When will that happen, and how fast will the threshold increase? 🧵 Image We looked at open-weight models in our dataset of notable models, and identified releases that pushed forward the frontier of open-weight training compute. These “top-1” models have historically grown in compute by 4.7x per year.
Jan 10 7 tweets 2 min read
What would happen if remote work were fully automated? In a new Gradient Updates issue, @MatthewJBar argues the economic impact would be massive—with the economy doubling in size even in the most conservative scenario. 🧵 Image @MatthewJBar Using GPT-4o to analyze tasks in the O*NET database, Matthew finds that 34% of work in the US economy can be performed remotely. This is contrasted with prior research, revealing an interesting discrepancy with a major existing study. Image
Jan 8 7 tweets 2 min read
The amount of compute used to train frontier models has been growing at a breakneck pace of over 4x per year since 2018, resulting in an overall scale-up of more than 10,000x! But what factors are enabling this rapid growth? 🧵 1/6 Plot showing three factors contributing to the overall growth in training compute. We decompose training compute into three constituent factors: quantity of training hardware, the computing power of that hardware in FLOP per second, and the amount of time spent training. We fit trends on each of these underlying factors. 2/6