Forecasting Research Institute Profile picture
Oct 1 18 tweets 4 min read Read on X
Today, we're excited to announce ForecastBench: a new benchmark for evaluating AI and human forecasting capabilities. Our research indicates that AI remains worse at forecasting than expert forecasters. 🧵

Arxiv:
Website: arxiv.org/abs/2409.19839
forecastbench.org
Evaluating LLM forecasting ability is tricky! Prior work asks models about events that already have (or have not) occurred, risking contamination of training data.
Our solution is to use questions about future events, the outcomes of which are unknowable when forecasts are made.
ForecastBench continuously generates new questions about future events, testing the ability of AI models and humans to make accurate probabilistic predictions across diverse domains.
ForecastBench uses a mix of questions from prediction markets, forecasting platforms, and real-world datasets. We've designed it to be continuously updated, providing an ongoing measure of AI forecasting capabilities.
We generate and use questions from many sources: @metaculus, @ManifoldMarkets, @RANDforecasting, @polymarket, @YahooFinance, @stlouisfed, @ACLEDINFO, @DBnomics, and @Wikipedia, with more to come! Image
We evaluate 17 top language models including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, comparing their performance against both crowd-sourced human forecasts and predictions from superforecasters. The results in this paper come from questions and forecasts generated in July.
Key finding: superforecasters achieve the best overall performance with a Brier score of 0.093. The general public (0.107) and best AI model (0.111) lagged behind. Brier scores are a common measure of forecasting accuracy, and lower scores indicate more accurate predictions. Image
On a larger set of questions (too many for humans to forecast), we compare models against each other: GPT-4 Turbo and Claude 3.5 Sonnet are the top-performing AI models. However, this may change over time. We'll update our leaderboard with the latest models as we test them: Image
We establish several AI baselines, including zero-shot prompting (where models answer directly without additional guidance) and scratchpad prompting (where models are instructed to show their reasoning step-by-step).
We also test models with access to (1) forecasts from a human crowd and (2) recent news articles. Interestingly, all of the best-performing models had access to crowd forecasts (when available), but access to recent news didn't significantly help.
Our research shows that AI models perform worse on "combination” questions that require reasoning about relationships between multiple events, hinting at limitations of their causal reasoning abilities.
The best models lag (somewhat) behind the median of forecasts from ordinary people, and this is true even if we use our best-performing prompts and allow the models to access the latest news and crowd forecasts, when available.
As AI systems improve, we will use ForecastBench to track if (or when) machine predictions surpass human-level performance. For now, our research shows that superforecasters and ordinary humans are still outperforming AI systems at real-world forecasting.
For more details, our website is now live! See the complete results, a leaderboard (comparing superforecasters, the public, and AI models), and regular performance updates on the ForecastBench website.
forecastbench.org/leaderboards/h…
Soon we’ll run biweekly tournaments open to submissions from anyone, so modelers can see how they perform on the benchmark. We’ll also release regularly updated datasets of questions, resolutions, forecasts, and rationales, here:
forecastbench.org/datasets.html
We are grateful to @open_phil for supporting this work, and we plan to maintain the benchmark for at least the next few years!
ForecastBench is a collaboration between researchers at @Research_FRI (@EzraKarger, Houtan Bastani, Zach Jacobs, and @PTetlock), @jcyhc_ai, @dannyhalawi15, and @FredZhang0.
@EzraKarger @PTetlock @jcyhc_ai @dannyhalawi15 @FredZhang0 For mobile users, the ForecastBench is live at !forecastbench.org

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Forecasting Research Institute

Forecasting Research Institute Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(