The bottlenecks to >10% GDP growth are weaker than expected, and existing $500B investments in Stargate may be tiny relative to optimal AI investment
In this week’s Gradient Update, @APotlogea and @ansonwhho explain how their work on the economics of AI brought them to this view
@APotlogea @ansonwhho Skepticism around explosive AI growth often hinges on "Baumol effects"—bottlenecks from human-dependent tasks. But to their surprise, the most comprehensive integrated assessment model of AI to date suggests these constraints are weaker than expected
@APotlogea @ansonwhho Contrary to their expectations, even very partial AI automation—just 30% of tasks—can lead to growth rates above 20% under best-guess parameters. Achieving explosive growth (>30%) requires around 50-70% automation, still well below full automation
@APotlogea @ansonwhho Even when Baumol effects are strengthened well beyond their best guesses, growth rates still accelerate dramatically – exceeding 10% annual growth rates, like those observed in some East Asian economies during the 20th century
@APotlogea @ansonwhho The model’s predictions are of course overly aggressive, but to their surprise, additional bottlenecks like R&D externalities, investor uncertainty, and labor reallocation frictions typically fail to prevent these huge growth accelerations
@APotlogea @ansonwhho Following these findings, they both updated towards explosive growth being more plausible than they previously appreciated
@APotlogea @ansonwhho They thus also underestimated how much the world could be underinvesting in AI. Forget $500B investments in Stargate, the model suggests that optimal AI investment could be ~$25T, allowing society to soon capture the enormous returns from automating global labor (>$50T/year)
@APotlogea @ansonwhho If the model is even close to correct, current AI investments are too conservative rather than being driven by a “bubble”, as is often proclaimed. That said, negative externalities also need to be accounted for to determine optimal investment levels, which the model does not do
@APotlogea @ansonwhho So why aren’t actual AI investments higher? It’s far from certain, but possible explanations include uncertainty about automation feasibility, regulatory concerns, taxation fears, normalcy bias or conformity, or just crucial missing considerations in the model
@APotlogea @ansonwhho Overall, this work updated both of them away from extreme skepticism, as well as away from blind bullishness in >30% growth. Dramatic growth accelerations might be harder to avoid than they expected, but the model definitely doesn’t capture all the complexities of the real world
We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:
o3-mini-high is extremely knowledgeable, and it’s not pure memorization. It fairly reliably invokes relevant techniques and results from the mathematical literature, even when problems were designed to obscure them.
It also does a lot of informal and heuristic-driven reasoning akin to a physicist, with many steps lacking rigorous justification. One mathematician described the model as a “vibes-based inductive reasoner”, employing rough, intuitive leaps rather than precise proofs.
The speed of computations on GPUs depends directly on the numeric format: less precision means more calculations on the same hardware.
We analyzed the numerical format used to train 272 models from 2008 to 2025. Here’s what we found. 🧵
Numerical formats tell computers how to represent numbers for calculations. Higher precision formats like FP32 use more bits, in order to store the number to a higher number of significant digits. But precision comes at a cost, as each calculation takes longer to carry out.
Early deep learning done on GPUs almost exclusively used FP32 training, but NVIDIA’s 2017 Volta GPUs introduced major performance boosts for FP16 computing. Later that year Micikevicius et al. (2017) showed that FP16 could be stable enough for training, if you used FP32 for a few critical calculations.
To answer this question, we ran a competition at MIT, pitting eight teams of mathematicians against o4-mini-medium.
Result: o4-mini beat all but two teams. And while AIs aren't yet clearly superhuman, they probably will be soon.
Our competition included around 40 mathematicians, split into teams of four or five, and with a roughly even mix of subject matter experts and exceptional undergrads on each team. We then gave them 4.5h and internet access to answer 23 challenging FrontierMath questions.
By design, FrontierMath draws on a huge range of fields. To obtain a meaningful human baseline that tests reasoning abilities rather than breadth of knowledge, we chose problems that need less background knowledge, or were tailored to the background expertise of participants.
How quickly are AI supercomputers scaling, where are they, and who owns them?
Our new dataset covers 500+ of the largest AI supercomputers (aka GPU clusters or AI data centers) over the last six years.
Here is what we found🧵
Performance has grown drastically – FLOP/s of leading AI supercomputers have doubled every 9 months. Driven by:
- Deploying more chips (1.6x/year)
- Higher performance per chip (1.6x/year)
Systems with 10,000 AI chips were rare in 2019. Now, leading companies have clusters 10x that size
As they grew in performance, AI supercomputers got exponentially more expensive. The upfront hardware cost of leading AI supercomputers doubled roughly every year (1.9x/year). We estimate the hardware for xAI's Colossus cost about $7 billion.
OpenAI has released o3, their highly anticipated reasoning model, along with o4-mini, a smaller and cheaper model that succeeds o3-mini.
We evaluated the new models on our suite of math and science benchmarks. Results in thread!
On FrontierMath, our benchmark of highly challenging, original math questions, o4-mini with high reasoning sets a new record in our evaluations, with an accuracy of 17% (±2%)!
o3 scores 10% (±2%) with high reasoning, behind o4-mini and o3-mini.
On GPQA Diamond, a set of PhD-level multiple choice science questions, o3 scores 82% (±2%), just short of Gemini 2.5 Pro’s 84%, while o4-mini scores 80% (±2%).
This matches OpenAI’s reported scores 83% and 81% for o3 and o4-mini. Both outperform OpenAI’s older reasoning models.
We’ve run independent evaluations of Grok-3 and Grok-3 mini on our suite of benchmarks!
Grok-3 currently doesn’t do extended reasoning, while Grok-3 mini is a reasoning model. We ran Grok-3 mini with both “low” and “high” reasoning effort.
Full results in thread!
On GPQA Diamond, Grok-3 is one of the top performers at 76% accuracy, beating the best competing non-reasoning models (GPT-4.5 and Claude 3.7 Sonnet) and some reasoning models like o1 and DeepSeek-R1. Grok-3 mini is slightly behind at 70 to 74%.
On FrontierMath, our benchmark of original, expert-level mathematics problems, Grok-3 mini high scores 6%, which is one of the best results to date. Grok-3 gets 4%.
(note: OpenAI has exclusive access to FrontierMath outside of a holdout set).