Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

METR

@METR_Evals

Mar 19, 2025 • 11 tweets • 4 min read • Read on X

Scrolly

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

At a high level, our method is simple:
1. We ask both skilled humans and AI systems to attempt tasks in similar conditions.
2. We measure how long the humans take.
3. We then measure how AI success rates vary depending on how long the humans took to do those tasks.

https://twitter.com/idavidrein/status/1901647558839353363

We measure human and AI performance on a variety of software tasks, some sourced from existing METR benchmarks like HCAST and some brand new.

Human completion times on these tasks range from 1 second to 16 hours.

https://twitter.com/idavidrein/status/1901647558839353363

We then fit a curve that predicts the success rate of an AI based on how long it took humans to do each task. This curve characterizes how capable an AI is at different task lengths. We then summarize the curve with the task length at which a model’s success rate is 50%.

This metric - the 50% task completion time horizon - gives us a way to track progress in model autonomy over time.

Plotting the historical trend of 50% time horizons across frontier AI systems shows exponential growth.

These results appear robust. Although our model could be wrong, we are relatively confident about its fit to the data. While our initial data only covered the most recent systems, we found we could retrodict back to GPT-2.

We ran experiments on SWE-bench Verified and found a similar trend. We also ran a small experiment on internal METR pull requests, and found results consistent with our other datasets. We are excited for researchers to extend this and measure time horizons on other benchmarks.

https://twitter.com/richardmcngo/status/1643310525697105935

We are fairly confident in the rough trend of 1-4 doublings in horizon length per year. That is fast! Measures like these help make the notion of “degrees of autonomy” more concrete and let us quantify when AI abilities may rise above specific useful (or dangerous) thresholds.

https://twitter.com/richardmcngo/status/1643310525697105935

We give more high-level information about these results and what they might imply on the METR blog: metr.org/blog/2025-03-1…

For the details, read “Measuring AI Ability to Complete Long Tasks,” now on available on arXiv: arxiv.org/abs/2503.14499

If you are interested in contributing to more research like this, on quantitative evaluation of frontier AI capabilities: METR is hiring!

hiring.metr.org

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @METR_Evals

METR

@METR_Evals

Dec 20, 2025

We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we're still working through evaluations for other recent models, this is our highest published time horizon to date.

We don’t think the high upper CI bound reflects Opus’s actual capabilities: our current task suite doesn’t have enough long tasks to confidently upper bound Opus 4.5’s 50%-time horizon. We are working on updating our task suite, and hope to share more details soon.

Based on our experience interacting with Opus 4.5, the model’s performance on specific tasks (including some not in our time horizon suite), and its benchmark performance, we would be surprised if further investigation showed Opus had a 20+ hour 50%-time horizon.

Read 5 tweets

METR

@METR_Evals

Nov 21, 2025

We estimate that Kimi K2 Thinking has a 50%-time-horizon of around 54 minutes (95% confidence interval of 25 to 100 minutes) on our agentic SWE tasks. Note that we conducted this evaluation through a third-party inference provider, which reduces our confidence in this estimate.

Model performance can vary based on inference provider and time of evaluation. Our Kimi K2 Thinking runs for this evaluation come from Novita AI via OpenRouter, over November 13-17.

https://x.com/andonlabs/status/1989862276137119799?s=20

We chose this inference provider because its policies indicated it wouldn't retain or train on our tasks. Our guess is that this produces lower performance for Kimi K2 Thinking than what we would see from the developer's own API.

https://x.com/andonlabs/status/1989862276137119799?s=20

Read 4 tweets

METR

@METR_Evals

Aug 7, 2025

In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs.

We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.

We argue: (1) these threat models require capabilities significantly beyond current systems, (2) our results indicate GPT-5 is an on-trend improvement that lacks these capabilities by a reasonable margin and (3) other evidence did not raise significant doubts about our results.

We use time horizons on our suite of software tasks to estimate the capabilities of GPT-5. We estimate that its 50%-completion time horizon on our tasks is 2 hours and 17 minutes, which is near the upper-edge of our earlier forecasts and consistent with a faster recent trend.

Read 12 tweets

METR

@METR_Evals

Jul 14, 2025

https://twitter.com/metr_evals/status/1902384481111322929

METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months.

We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.

https://twitter.com/metr_evals/status/1902384481111322929

We analyze data from 9 existing benchmarks: MATH, OSWorld, LiveCodeBench, Mock AIME, GPQA Diamond, Tesla FSD, Video-MME, RLBench, and SWE-Bench Verified, which either include human time data or allow us to estimate it.

The frontier time horizon on different benchmarks differs by >100x. Many reasoning and coding benchmarks cluster at or above 1 hour, but agentic computer use (OSWorld, WebArena) is only ~2 minutes, possibly due to poor tooling.

Read 8 tweets

METR

@METR_Evals

Jul 10, 2025

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).

We randomly assigned each task to either allow AI (typically Cursor Pro w/ Claude 3.5/3.7) or disallow AI help.

At the beginning of the study, developers forecasted that they would get sped up by 24%. After actually doing the work, they estimated that they had been sped up by 20%. But it turned out that they were actually slowed down by 19%.

Read 13 tweets

METR

@METR_Evals

Apr 16, 2025

https://twitter.com/OpenAI/status/1912549344978645199

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.

https://twitter.com/OpenAI/status/1912549344978645199

https://twitter.com/METR_Evals/status/1902384481111322929

On an updated version of our task suite, we estimate that o3 and o4-mini reach 50% time horizons which are 1.8x and 1.5x that of Claude 3.7 Sonnet, respectively. This is longer than all other public models we’ve tested.

https://twitter.com/METR_Evals/status/1902384481111322929

We observed o3 in particular has a propensity to try to “hack” our tasks to get a higher score. Importantly, we saw this arise naturally from the model without explicit nudging. Behaviors like these have required us to be more careful in how we evaluate model capabilities.

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

METR

Try unrolling a thread yourself!

More from @METR_Evals

METR

METR

METR

METR

METR

METR

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!