Investigating the trajectory of AI for the benefit of society.
2 subscribers
Jul 17 • 12 tweets • 3 min read
How fast has society been adopting AI?
Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵
Historically, technology adoption took decades. For example, telephones took 60 years to reach 70% of US households. But tech diffuses faster and faster over time, and we should expect AI to continue this trend.
Jul 17 • 6 tweets • 2 min read
We have graded the results of @OpenAI's evaluation on FrontierMath Tier 1–3 questions, and found a 27% (± 3%) performance. ChatGPT agent is a new model fine-tuned for agentic tasks, equipped with text/GUI browser tools and native terminal access. 🧵
This evaluation is not directly comparable to those on Epoch AI’s benchmarking hub, as it uses a different scaffold. First, we did not run the model ourselves—we only graded the outputs provided by OpenAI and don’t have access to their code to run the model. Second, ChatGPT agent has access to tools not available to other models we've assessed—most notably browser tools, which may have helped on questions related to recent research papers. Finally, the evaluation allowed up to 128K tokens per question, compared to our standard 100K; this difference is unlikely to have significantly affected results.
Jul 9 • 8 tweets • 2 min read
The IMO is next week. What will it tell us about AI?
@GregHBurnham argues that an AI gold medal could be a non-event or could be an important breakthrough—it depends on whether the AI system exhibits creative problem-solving. How to tell the difference? Read on!
@GregHBurnham It will be tempting to focus on whether an AI system gets a gold medal. Formal proof systems like Google’s AlphaProof are quite close to this, and even general-purpose LLMs have a fighting chance. But that's not the outcome to pay the most attention to.
Jul 3 • 8 tweets • 3 min read
What would a Manhattan Project for AI look like?
@ansonwhho and @ardenaberg argue that if one reaches the scale of previous national projects, an AI Manhattan project could result in a ~1000x compute scaleup by 2027.
@ansonwhho @ardenaberg A national AI project has become more and more of a possibility in the last year, with one as the top recommendation from a US-China congressional commission.
Jul 2 • 11 tweets • 3 min read
The state of large-scale AI models, July 2025:
- The number of large-scale model releases is growing rapidly (418 models over 10^23 FLOP)
- The UK has fallen behind, China has caught up (9 vs 151 models)
- There are far more of the largest models (33 models over 10^25 FLOP)
First, the number of large-scale model releases is growing rapidly.
In 2020, there were 4 models trained with more than 10^23 FLOP.
By the end of 2024, there were 327 such models in our dataset.
Jun 27 • 10 tweets • 3 min read
LLM context windows have grown, but can models really use all this content?
We find signs of recent, rapid progress in their ability to do so. Read on to learn more!
From Claude 2.0’s 100k tokens in 2023 to Llama 4 Maverick’s 10M earlier this year, there’s no doubt that context windows are getting longer. On a set of models from Artificial Analysis, we find that the longest available context windows have grown at about 30x/year.
Jun 20 • 11 tweets • 3 min read
The bottlenecks to >10% GDP growth are weaker than expected, and existing $500B investments in Stargate may be tiny relative to optimal AI investment
In this week’s Gradient Update, @APotlogea and @ansonwhho explain how their work on the economics of AI brought them to this view
@APotlogea @ansonwhho Skepticism around explosive AI growth often hinges on "Baumol effects"—bottlenecks from human-dependent tasks. But to their surprise, the most comprehensive integrated assessment model of AI to date suggests these constraints are weaker than expected
We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:
o3-mini-high is extremely knowledgeable, and it’s not pure memorization. It fairly reliably invokes relevant techniques and results from the mathematical literature, even when problems were designed to obscure them.
May 28 • 8 tweets • 3 min read
The speed of computations on GPUs depends directly on the numeric format: less precision means more calculations on the same hardware.
We analyzed the numerical format used to train 272 models from 2008 to 2025. Here’s what we found. 🧵
Numerical formats tell computers how to represent numbers for calculations. Higher precision formats like FP32 use more bits, in order to store the number to a higher number of significant digits. But precision comes at a cost, as each calculation takes longer to carry out.
May 23 • 8 tweets • 2 min read
Is AI already superhuman at FrontierMath?
To answer this question, we ran a competition at MIT, pitting eight teams of mathematicians against o4-mini-medium.
Result: o4-mini beat all but two teams. And while AIs aren't yet clearly superhuman, they probably will be soon.
Our competition included around 40 mathematicians, split into teams of four or five, and with a roughly even mix of subject matter experts and exceptional undergrads on each team. We then gave them 4.5h and internet access to answer 23 challenging FrontierMath questions.
Apr 23 • 12 tweets • 4 min read
How quickly are AI supercomputers scaling, where are they, and who owns them?
Our new dataset covers 500+ of the largest AI supercomputers (aka GPU clusters or AI data centers) over the last six years.
Here is what we found🧵
Performance has grown drastically – FLOP/s of leading AI supercomputers have doubled every 9 months. Driven by:
- Deploying more chips (1.6x/year)
- Higher performance per chip (1.6x/year)
Systems with 10,000 AI chips were rare in 2019. Now, leading companies have clusters 10x that size
Apr 18 • 6 tweets • 3 min read
OpenAI has released o3, their highly anticipated reasoning model, along with o4-mini, a smaller and cheaper model that succeeds o3-mini.
We evaluated the new models on our suite of math and science benchmarks. Results in thread!
On FrontierMath, our benchmark of highly challenging, original math questions, o4-mini with high reasoning sets a new record in our evaluations, with an accuracy of 17% (±2%)!
o3 scores 10% (±2%) with high reasoning, behind o4-mini and o3-mini.
Apr 11 • 7 tweets • 3 min read
We’ve run independent evaluations of Grok-3 and Grok-3 mini on our suite of benchmarks!
Grok-3 currently doesn’t do extended reasoning, while Grok-3 mini is a reasoning model. We ran Grok-3 mini with both “low” and “high” reasoning effort.
Full results in thread!
On GPQA Diamond, Grok-3 is one of the top performers at 76% accuracy, beating the best competing non-reasoning models (GPT-4.5 and Claude 3.7 Sonnet) and some reasoning models like o1 and DeepSeek-R1. Grok-3 mini is slightly behind at 70 to 74%.
Apr 2 • 6 tweets • 2 min read
Google DeepMind has released a new flagship model, Gemini 2.5 Pro.
We evaluated it on GPQA Diamond, and found a score of 84%, exactly matching the result reported by Google. This is the best result we have found on this benchmark to date!
For context, GPQA Diamond is a set of very difficult multiple-choice questions about biology, chemistry, and physics; human experts only score around 70%.
Mar 28 • 7 tweets • 2 min read
Why have AI benchmark scores often felt disconnected from real-world usefulness? It wasn't just technical limitations to creating realistic benchmarks. Historically, predicting the real-world capabilities of AI systems simply wasn’t the main goal.🧵
Historically, AI benchmarks were just designed to compare models – could a new model or technique improve on the state-of-the-art? Benchmarks focused on tasks 'just within reach' where progress was measurable.
Mar 21 • 10 tweets • 3 min read
Many AI leaders claim AI's value mainly will come from accelerating R&D—"geniuses in datacenters."
This view has key flaws: R&D contributes less to economic growth & is harder to automate than believed. Most of AI's value will instead come from broad deployment in the economy.
While it’s commonly assumed that growth is mostly driven by R&D, rigorous estimates don’t agree with this intuition. For example, the BLS estimates that private R&D only accounts for around 20% of the labor productivity growth in the US since 1988.
We developed GATE: a model that shows how AI scaling and automation will impact growth.
It predicts trillion‐dollar infrastructure investments, 30% annual growth, and full automation in decades.
Tweak the parameters—these transformative outcomes are surprisingly hard to avoid.
Imagine if a central bank took AI seriously. They’d build GATE—merging economics with AI scaling laws to show how innovation, automation, and investment interact.
At its core: more compute → more automation → growth → more investment in chips, fabs, etc.
Mar 13 • 8 tweets • 3 min read
How has the cost to use LLMs changed over time? Our analysis shows that the price to reach a given benchmark score has fallen dramatically—between 9x and 900x per year, depending on the benchmark and score. 🧵
For example, GPT-4 cost about $40 per million tokens. 14 months later, Gemini 1.5 Flash could beat GPT-4's score on a set of Ph.D. level science questions, despite being about 300x cheaper per token.
Mar 8 • 7 tweets • 3 min read
Imagine if you could train one human for thousands years to achieve unparalleled expertise, then make many copies. That’s what AI enables: spend heavily on training a single model, then cheaply replicate it. This creates a unique source of increasing returns at scale.
When you double your compute, you don't just double output by deploying more AI instances—you can also train larger, more efficient models that reduce inference cost. Doing both simultaneously creates increasing returns: output grows more than linearly.
Mar 7 • 8 tweets • 2 min read
In this week’s Gradient Updates issue, @EgeErdil2 argues that extrapolating past and current AI capabilities to the future yields excessively conservative estimates of the future impact of AI, and it’s often better to rely on first principles reasoning to make predictions.🧵
We can only extrapolate capabilities to the future when AI already shows some competence at them. So extrapolations often end up anchoring too much on what AI can do today, and not enough on the new tasks it will be able to do in the future.
Feb 20 • 7 tweets • 2 min read
xAI has launched Grok-3, its new flagship model.
Grok-3 sets a new record in compute scale: we estimate that it was likely trained using 4e26 to 5e26 floating point operations (FLOP). Significantly, Grok-3 is the first released model known to be trained on over 1e26 FLOP.
xAI has not disclosed full details about Grok-3’s training. However, the known cluster size (100k–200k GPUs), approximate training duration, and xAI’s comparisons of Grok-3’s compute usage to Grok-2 and GPT-4 all point to a total compute scale of around 4e26 to 5e26 FLOP.