🏆 In October, we invited external teams to submit to ForecastBench, our AI forecasting benchmark.
The challenge? Beat superforecasters—using any tools available (scaffolding, ensembling, etc).
The result? External submissions are now the most accurate models on our leaderboard—though superforecasters still hold #1.
@xai's model (grok-4-fast) is the leading external submission, at #2.
One of Cassi's entries takes the #3 spot
Here's what changed. đź§µ
In October, we opened up ForecastBench’s tournament leaderboard to external submissions. Teams are free to use any tools they choose.
Several teams responded, including @xai, Cassi, @fractalai, @lightningrodai, and @_Mantic_AI. Thanks to all of them for participating on this challenging benchmark.
Models from @xai and Cassi outperformed all our baseline LLM configurations.
Here are the headline scores (lower is better, Brier):
• Superforecasters: 0.083
• grok-4-fast (external submission from @xai): 0.098
• ensemble_2_crowdadj (external submission from Cassi): 0.099
• @OpenAI’s GPT-5 (our own baseline run): 0.100
• @GoogleDeepMind’s Gemini-2.5-Pro (our own baseline run): 0.102
• @AnthropicAI’s Claude-Sonnet-4-5 (our own baseline run): 0.103
External submissions hold #2 and #3, ahead of all our baseline runs. However, all LLMs still lag behind superforecasters.
Updated trend extrapolations for LLM-superforecaster parity:
• Overall: Oct 2026 (95% CI: Dec 2025 – Sep 2027)
• Dataset: May 2026 (95% CI: Oct 2025 – Jan 2027)
Today, we are launching the most rigorous ongoing source of expert forecasts on the future of AI: the Longitudinal Expert AI Panel (LEAP).
We’ve assembled a panel of 339 top experts across computer science, AI industry, economics, and AI policy.
Roughly every month—for the next three years—they’ll provide precise, falsifiable forecasts on the trajectory of AI capabilities, adoption, and impact.
Our results cover where experts predict major effects of AI, where they expect less progress than AI industry leaders, and where they disagree.
LEAP experts forecast major effects of AI by 2030, including:
⚡ 7x increase in AI’s share of U.S. electricity use (1% -> 7%)
🖥️ 9x increase in AI-assisted work hours (2% -> 18%)
By 2040, experts predict:
👥30% of adults will use AI for companionship daily
🏆60% chance that AI will solve or substantially assist in solving a Millennium Prize Problem
🚂32% chance that AI will have been at least as impactful as a "technology of the millennium," like the printing press or the Industrial Revolution.
đź§µRead on for more insights and results
Our LEAP panel is made up of the following experts:
🧑‍🔬 76 Top computer scientists (e.g., professors from top-20 universities)
🤖 76 AI industry experts (from frontier model and other leading AI companies)
đź’˛ 68 Leading economists (including many studying economic growth or technology at top universities)
đź§ 119 Policy and think tank experts
🏆 12 Honorees from TIME’s 100 most influential people in AI, in 2023 and 2024
(Plus 60 highly accurate superforecasters and 1,400 members of the U.S. public)
For more details on our sample, see the full reports linked below.
We ask questions designed to elicit high-quality, specific forecasts about the future of AI and its effects. Example questions include:
⚡ What % of US electricity will go towards training and deploying AI systems in 2027, 2030, and 2040?
🏢 What % of work hours will be assisted by generative AI in 2025, 2027, and 2030?
📊 At the end of 2040, how will people assess the impact of AI in comparison to past technological events?
In the first 3 waves of LEAP, we elicited forecasts on 18 questions regarding the future of AI. Wave 1 focused on the speed of AI progress and broad social impacts, Wave 2 on AI’s effect on science, and Wave 3 on AI adoption. For the full set of questions and results, see the reports linked below.
👇Superforecasters top the Tournament ForecastBench leaderboard, with a difficulty-adjusted Brier score of 0.081 (lower scores indicate higher accuracy).
🤖The best-performing LLM in our dataset is @OpenAI’s GPT-4.5 with a score of 0.101—a gap of 0.02.
A baseline model that always predicts 50% would yield a score of 0.25. Relative to this baseline, superforecasters are 68% and the best-performing LLM is 60% more accurate—a gap of 8 percentage points.
(Wondering about newer AI models? More on them later!)
We now have the first accuracy results from the largest-ever existential risk forecasting tournament.
In 2022, we convened 80 experts and 89 superforecasters for the Existential Risk Persuasion Tournament (XPT), which collected thousands of forecasts in 172 questions across short-, medium- and long-term time horizons.
We now have answers for 38 short-run questions covering AI progress, climate technology, bioweapons, nuclear weapons and more.
Here’s what we found out: 🧵
Respondents—especially superforecasters—underestimated AI progress.
Participants predicted the state-of-the-art accuracy of ML models on the MATH, MMLU, and QuaLITY benchmarks by June 2025.
Domain experts assigned probabilities of 21.4%, 25%, and 43.5% to the achieved outcomes.
Superforecasters assigned even lower probabilities: just 9.3%, 7.2%, and 20.1% respectively.
The International Mathematical Olympiad results were even more surprising.
AI systems achieved gold-level performance at the IMO in July 2025.
Superforecasters assigned this outcome just a 2.3% probability. Domain experts put it at 8.6%.
Today, we're excited to announce ForecastBench: a new benchmark for evaluating AI and human forecasting capabilities. Our research indicates that AI remains worse at forecasting than expert forecasters. đź§µ
Evaluating LLM forecasting ability is tricky! Prior work asks models about events that already have (or have not) occurred, risking contamination of training data.
Our solution is to use questions about future events, the outcomes of which are unknowable when forecasts are made.
ForecastBench continuously generates new questions about future events, testing the ability of AI models and humans to make accurate probabilistic predictions across diverse domains.