What will AI look like by 2030 if current trends hold?
Our new report zooms in on two things: (1) whether scaling continues (compute, data, power, capital), and (2) the capabilities this enables—especially for scientific R&D.
We forecast that by 2030:
- Training clusters would cost hundreds of billions of dollars
- Compute scaling is probably not "hitting a wall"
- Synthetic & multimodal data may be needed to ease bottlenecks
- Power demands will increase but be manageable in principle
Given that we expect scaling to continue, what does this mean for the resulting AI capabilities?
We focus on scientific R&D—a stated priority for leading labs—and assess both benchmarks and real-world use, allowing us to forecast the kinds of tasks AI will be able to automate.
Despite benchmarks’ weaknesses (e.g., contamination and overfitting), progress has tracked real-world improvements, and usage already supports productivity gains. People already spend billions to use AI for coding, writing, and research.
AI may be a transformative tool well before it can work autonomously.
We explore future capabilities across four domains:
- Software engineering
- Mathematics
- Biology
- Weather prediction
AI is already transforming software engineering through code assistants and question-answering. On current trends, 2030 AI will autonomously fix issues, implement features, and solve hours-long defined problems.
AI for mathematics may soon act as a research assistant, fleshing out proof sketches or intuitions. Mathematicians already offer examples of AI being helpful in their work, mostly for studying.
Tools like AlphaFold are already revolutionising biology, and will expand to predict more properties for more complex structures. AI assistants for desk research are at an early stage, but offer great promise.
AI weather prediction already outperforms traditional methods from hours to weeks ahead. The next challenges lie in further improving predictions, especially rare events, and integrating new data sources.
At minimum, we expect that AI for scientific R&D will follow in the footsteps of coding assistants for software engineers today, boosting productivity for desk-based research by 10-20%.
Impacts may differ greatly across domains. We might see a flourishing of lab research in molecular biology for years before producing new medicines, because clinical trials and approvals take years.
This report was commissioned by @GoogleDeepMind. All points of views and conclusions expressed are those of the authors and do not necessarily reflect the position or endorsement of Google DeepMind.
Thanks to @sebkrier and @jmateosgarcia for their feedback and support.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
AI progress has been driven by enormous compute scaling, but this is likely to slow down within the next few years. The reasons: investor uncertainty, the heavy costs of overinvestment, and increasing lead times. 🧵
Investors are incredibly uncertain about the returns to further scaling, and overestimating the returns could cost them >$100B. So rather than going all-in today, they invest more gradually, observing the returns from incremental scaling, before reevaluating further investment.
But as compute investments grow, the “lead time” between project initiation and product deployment gets longer–you need to buy more compute, build new data centers, and construct new fabs. For every 10× increase in compute investment, lead times grow by roughly a year.
We’ve independently evaluated the GPT-5 model family on our benchmarking suite. Here is what we’ve learned 🧵
GPT-5 performs strongly on math benchmarks, achieving a new SOTA on FrontierMath and OTIS Mock AIME 2024-2025.
In research mathematics, a FrontierMath problem author did not notice a large qualitative difference between GPT-5 and o3, but was very impressed by GPT-5 Pro.
OpenAI has historically scaled up training compute by around 100x with each new generation of its GPT.
However, GPT-5 appears to be an exception to this trend.
🧵
GPT-4 was trained on 2e25 floating-point operations, and OpenAI said GPT-4.5 was about an order-of-magnitude (10x) scale-up.
We don’t have a rigorous estimate yet, but GPT-5’s compute scale may be *between* GPT-4 and GPT-4.5, and it is probably not a large scale-up from 4.5.
Training compute scales with model size × training data.
GPT-5 is fast and fairly cheap on the API, with output tokens 15x cheaper and served ~2-4x faster than GPT-4.5 on launch! This suggests GPT-5 is a much smaller model than GPT-4.5.
How big of a paradigm shift was the rise of reasoning models? We dug into the data and found that at least on some benchmarks, reasoning models were likely as large of an algorithmic advance as the Transformer.
When OpenAI released o1, it blew its predecessor GPT-4o out of the water on some math and science benchmarks. The difference was reasoning training and test-time scaling: o1 was trained to optimize its chain-of-thought, allowing extensive thinking before responding to users.
This represented a huge algorithmic improvement. To reach o1-high’s GPQA diamond performance with a non-reasoning model, you’d need 9x more pre-training compute than GPT-4o. That’s larger than the gain from switching from Kaplan to Chinchilla scaling laws!
A fourth problem on FrontierMath Tier 4 has been solved by AI! Written by Dan Romik, it had won our prize for the best submission in the number theory category.
The evaluation was done internally by OpenAI on an early checkpoint of o3 using a “high reasoning setting.” The model made 32 attempts on the problem and solved it only once. OpenAI shared the reasoning trace so that Dan could analyze the model’s solution and provide commentary.
Dan said the model had some false starts but eventually solved the problem “by combining an excellent intuition about asymptotic phenomena with its ability to code and run computationally intensive numerical calculations to test hypotheses.”
Should you start your training run early, so you can train for longer, or wait for the next generation of chips and algorithms? Our latest estimate suggests that it’s not effective to train for more than ~9 months. On current trends, frontier labs will hit that limit by 2027. 🧵
Why 9 months? Model developers face a tradeoff: wait before starting a run to take advantage of better hardware and algorithms, or start sooner with what’s available. Waiting lets you train faster once you start, so there’s an optimal run length for any given deadline.
Our previous work estimated that hardware + algorithmic progress would lead to a 15 month maximum training run. That work assumed algorithms were improving at 1.7x per year, but we now believe they are improving at a much faster 3x per year!