METR Profile picture
Jul 10, 2025 13 tweets 5 min read Read on X
We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.Image
We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).

We randomly assigned each task to either allow AI (typically Cursor Pro w/ Claude 3.5/3.7) or disallow AI help. Image
At the beginning of the study, developers forecasted that they would get sped up by 24%. After actually doing the work, they estimated that they had been sped up by 20%. But it turned out that they were actually slowed down by 19%. Image
We were surprised by this, given a) impressive AI benchmark scores, b) widespread adoption of AI tooling for software development, and c) our own recent research measuring trends in the length of tasks that agents are able to complete.
When AI is allowed, developers spend less time actively coding and searching for information, and instead spend time prompting AI, waiting on/reviewing AI outputs, and idle. We find no single reason for the slowdown—it’s driven by a combination of factors. Image
To better understand these factors, we investigate 20 properties of our setting, finding 5 likely contributors, and 8 mixed/unclear factors.

We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.Image
Image
Image
Why did we run this study?

AI agent benchmarks have limitations—they’re self-contained, use algorithmic scoring, and lack live human interaction. This can make it difficult to directly infer real-world impact.

If we want an early warning system for whether AI R&D is being accelerated by AI itself, or even automated, it would be useful to be able to directly measure this in real-world engineer trials, rather than relying on proxies like benchmarks or even noisier information like anecdotes.
So how do we reconcile our results with other sources of data on AI capabilities, like impressive benchmark results, and anecdotes/widespread adoption of AI tools?
Our RCT may underestimate capabilities for various reasons, and benchmarks and anecdotes may overestimate capabilities (likely some combination)—we discuss some possibilities in our accompanying blog post.
What do we take away?

1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).

2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.
Another implication:

It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.
What we're NOT saying:

1. Our setting represents all (or potentially even most) software engineering.

2. Future models won't be better (or current models can’t be used more effectively). Image
We’re exploring running experiments like this in other settings—if you’re an open-source developer or company interested in understanding the impact of AI on your work, reach out to us here: forms.gle/pBsSo54VpmuQC4…

Paper: metr.org/Early_2025_AI_…

Blog: metr.org/blog/2025-07-1…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with METR

METR Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @METR_Evals

Feb 24
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this. Image
Last year we published findings that AI tools caused a 20% slowdown among experienced open source developers, using data collected over February to June 2025. We still believe that estimate was accurate for the specific tools and population at the time.
We started a continuation in August 2025. However, we noticed developers were opting not to participate or submit work. Participants said they did this mostly due to expected productivity loss on "AI disallowed” tasks. Lower pay was also a factor ($50/hr, down from $150).
Read 6 tweets
Feb 20
We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated. Image
Near-saturation of the task suite can have unintuitive consequences for the time-horizon estimates. For example, the upper bound of the 95% CI is much longer than any of the tasks used for the measurement.
We are working on updated methods to better track state-of-the-art AI capabilities. However, these are still in development so they don't address our immediate measurement gap. In the meantime, we advise caution in interpreting and comparing our recent time-horizon measurements.
Read 4 tweets
Dec 20, 2025
We estimate that, on our tasks, Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we're still working through evaluations for other recent models, this is our highest published time horizon to date. Image
We don’t think the high upper CI bound reflects Opus’s actual capabilities: our current task suite doesn’t have enough long tasks to confidently upper bound Opus 4.5’s 50%-time horizon. We are working on updating our task suite, and hope to share more details soon.
Based on our experience interacting with Opus 4.5, the model’s performance on specific tasks (including some not in our time horizon suite), and its benchmark performance, we would be surprised if further investigation showed Opus had a 20+ hour 50%-time horizon.
Read 5 tweets
Nov 21, 2025
We estimate that Kimi K2 Thinking has a 50%-time-horizon of around 54 minutes (95% confidence interval of 25 to 100 minutes) on our agentic SWE tasks. Note that we conducted this evaluation through a third-party inference provider, which reduces our confidence in this estimate. Image
Model performance can vary based on inference provider and time of evaluation. Our Kimi K2 Thinking runs for this evaluation come from Novita AI via OpenRouter, over November 13-17.
We chose this inference provider because its policies indicated it wouldn't retain or train on our tasks. Our guess is that this produces lower performance for Kimi K2 Thinking than what we would see from the developer's own API.
Read 4 tweets
Aug 7, 2025
In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs.

We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness. Image
We argue: (1) these threat models require capabilities significantly beyond current systems, (2) our results indicate GPT-5 is an on-trend improvement that lacks these capabilities by a reasonable margin and (3) other evidence did not raise significant doubts about our results. Image
We use time horizons on our suite of software tasks to estimate the capabilities of GPT-5. We estimate that its 50%-completion time horizon on our tasks is 2 hours and 17 minutes, which is near the upper-edge of our earlier forecasts and consistent with a faster recent trend. Image
Read 12 tweets
Jul 14, 2025
METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months.

We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement. Image
We analyze data from 9 existing benchmarks: MATH, OSWorld, LiveCodeBench, Mock AIME, GPQA Diamond, Tesla FSD, Video-MME, RLBench, and SWE-Bench Verified, which either include human time data or allow us to estimate it. Image
The frontier time horizon on different benchmarks differs by >100x. Many reasoning and coding benchmarks cluster at or above 1 hour, but agentic computer use (OSWorld, WebArena) is only ~2 minutes, possibly due to poor tooling. Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(