Forecasting Research Institute Profile picture
Jan 8 • 6 tweets • 3 min read • Read on X
🏆 In October, we invited external teams to submit to ForecastBench, our AI forecasting benchmark.

The challenge? Beat superforecasters—using any tools available (scaffolding, ensembling, etc).

The result? External submissions are now the most accurate models on our leaderboard—though superforecasters still hold #1.

@xai's model (grok-4-fast) is the leading external submission, at #2.

One of Cassi's entries takes the #3 spot

Here's what changed. đź§µImage
In October, we opened up ForecastBench’s tournament leaderboard to external submissions. Teams are free to use any tools they choose.

Several teams responded, including @xai, Cassi, @fractalai, @lightningrodai, and @_Mantic_AI. Thanks to all of them for participating on this challenging benchmark.

Models from @xai and Cassi outperformed all our baseline LLM configurations.Image
Here are the headline scores (lower is better, Brier):

• Superforecasters: 0.083

• grok-4-fast (external submission from @xai): 0.098

• ensemble_2_crowdadj (external submission from Cassi): 0.099

• @OpenAI’s GPT-5 (our own baseline run): 0.100

• @GoogleDeepMind’s Gemini-2.5-Pro (our own baseline run): 0.102

• @AnthropicAI’s Claude-Sonnet-4-5 (our own baseline run): 0.103

External submissions hold #2 and #3, ahead of all our baseline runs. However, all LLMs still lag behind superforecasters.Image
Updated trend extrapolations for LLM-superforecaster parity:

• Overall: Oct 2026 (95% CI: Dec 2025 – Sep 2027)

• Dataset: May 2026 (95% CI: Oct 2025 – Jan 2027)

• Market: Apr 2026 (95% CI: Apr 2025 – Jul 2029)

Estimates remain stable (within ~1 month of our October projections) despite new models and more resolved questions.Image
More models are coming soon.

Mid-January:

• GPT-5.1
• Gemini 3 Pro
• Grok-4.1
• GLM-4.6
• Kimi K2 Thinking

End of January:

• Claude Opus 4.5

We add models 50 days after their first forecast to ensure enough questions have resolved for stable rankings.
Think you can do better? Submit to the public leaderboard. 🚀

How to submit: github.com/forecastingres…

Explore the data:

• Leaderboards: forecastbench.org
• Full datasets: forecastbench.org/datasets/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Forecasting Research Institute

Forecasting Research Institute Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Research_FRI

Nov 10, 2025
Today, we are launching the most rigorous ongoing source of expert forecasts on the future of AI: the Longitudinal Expert AI Panel (LEAP).

We’ve assembled a panel of 339 top experts across computer science, AI industry, economics, and AI policy.

Roughly every month—for the next three years—they’ll provide precise, falsifiable forecasts on the trajectory of AI capabilities, adoption, and impact.

Our results cover where experts predict major effects of AI, where they expect less progress than AI industry leaders, and where they disagree.

LEAP experts forecast major effects of AI by 2030, including:

⚡ 7x increase in AI’s share of U.S. electricity use (1% -> 7%)
🖥️ 9x increase in AI-assisted work hours (2% -> 18%)

By 2040, experts predict:
👥30% of adults will use AI for companionship daily
🏆60% chance that AI will solve or substantially assist in solving a Millennium Prize Problem
🚂32% chance that AI will have been at least as impactful as a "technology of the millennium," like the printing press or the Industrial Revolution.

đź§µRead on for more insights and resultsImage
Image
Our LEAP panel is made up of the following experts:

🧑‍🔬 76 Top computer scientists (e.g., professors from top-20 universities)
🤖 76 AI industry experts (from frontier model and other leading AI companies)
đź’˛ 68 Leading economists (including many studying economic growth or technology at top universities)
đź§  119 Policy and think tank experts
🏆 12 Honorees from TIME’s 100 most influential people in AI, in 2023 and 2024

(Plus 60 highly accurate superforecasters and 1,400 members of the U.S. public)

For more details on our sample, see the full reports linked below.
We ask questions designed to elicit high-quality, specific forecasts about the future of AI and its effects. Example questions include:

⚡ What % of US electricity will go towards training and deploying AI systems in 2027, 2030, and 2040?

🏢 What % of work hours will be assisted by generative AI in 2025, 2027, and 2030?

📊 At the end of 2040, how will people assess the impact of AI in comparison to past technological events?

In the first 3 waves of LEAP, we elicited forecasts on 18 questions regarding the future of AI. Wave 1 focused on the speed of AI progress and broad social impacts, Wave 2 on AI’s effect on science, and Wave 3 on AI adoption. For the full set of questions and results, see the reports linked below.
Read 14 tweets
Oct 8, 2025
Is AI on track to match top human forecasters at predicting the future?

Today, FRI is releasing an update to ForecastBench—our benchmark that tracks how accurate LLMs are at forecasting real-world events.

A trend extrapolation of our results suggests LLMs will reach superforecaster-level forecasting performance around a year from now.

Here’s what you need to know: 🧵Image
Why LLM forecasting accuracy is a useful benchmark:

đź§ Forecasting requires collecting and synthesizing data, causal reasoning, and probabilistic thinking, making it a good test of reasoning

đź’ĽForecasting has high practical value

🔮Future events aren’t in training data, making the benchmark hard to game

@elonmusk: “The ability to predict the future is the best measure of intelligence”

x.com/elonmusk/statu…
Superforecasters still outperform leading LLMs.

👇Superforecasters top the Tournament ForecastBench leaderboard, with a difficulty-adjusted Brier score of 0.081 (lower scores indicate higher accuracy).

🤖The best-performing LLM in our dataset is @OpenAI’s GPT-4.5 with a score of 0.101—a gap of 0.02.

A baseline model that always predicts 50% would yield a score of 0.25. Relative to this baseline, superforecasters are 68% and the best-performing LLM is 60% more accurate—a gap of 8 percentage points.

(Wondering about newer AI models? More on them later!)Image
Read 10 tweets
Sep 2, 2025
We now have the first accuracy results from the largest-ever existential risk forecasting tournament.

In 2022, we convened 80 experts and 89 superforecasters for the Existential Risk Persuasion Tournament (XPT), which collected thousands of forecasts in 172 questions across short-, medium- and long-term time horizons.

We now have answers for 38 short-run questions covering AI progress, climate technology, bioweapons, nuclear weapons and more.

Here’s what we found out: 🧵Image
Respondents—especially superforecasters—underestimated AI progress.

Participants predicted the state-of-the-art accuracy of ML models on the MATH, MMLU, and QuaLITY benchmarks by June 2025.

Domain experts assigned probabilities of 21.4%, 25%, and 43.5% to the achieved outcomes.

Superforecasters assigned even lower probabilities: just 9.3%, 7.2%, and 20.1% respectively.Image
The International Mathematical Olympiad results were even more surprising.

AI systems achieved gold-level performance at the IMO in July 2025.

Superforecasters assigned this outcome just a 2.3% probability. Domain experts put it at 8.6%. Image
Read 10 tweets
Oct 1, 2024
Today, we're excited to announce ForecastBench: a new benchmark for evaluating AI and human forecasting capabilities. Our research indicates that AI remains worse at forecasting than expert forecasters. đź§µ

Arxiv:
Website: arxiv.org/abs/2409.19839
forecastbench.org
Evaluating LLM forecasting ability is tricky! Prior work asks models about events that already have (or have not) occurred, risking contamination of training data.
Our solution is to use questions about future events, the outcomes of which are unknowable when forecasts are made.
ForecastBench continuously generates new questions about future events, testing the ability of AI models and humans to make accurate probabilistic predictions across diverse domains.
Read 18 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(