Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

METR

@METR_Evals

Jul 10 • 13 tweets • 5 min read • Read on X

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).

We randomly assigned each task to either allow AI (typically Cursor Pro w/ Claude 3.5/3.7) or disallow AI help.

At the beginning of the study, developers forecasted that they would get sped up by 24%. After actually doing the work, they estimated that they had been sped up by 20%. But it turned out that they were actually slowed down by 19%.

We were surprised by this, given a) impressive AI benchmark scores, b) widespread adoption of AI tooling for software development, and c) our own recent research measuring trends in the length of tasks that agents are able to complete.

When AI is allowed, developers spend less time actively coding and searching for information, and instead spend time prompting AI, waiting on/reviewing AI outputs, and idle. We find no single reason for the slowdown—it’s driven by a combination of factors.

To better understand these factors, we investigate 20 properties of our setting, finding 5 likely contributors, and 8 mixed/unclear factors.

We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.

Why did we run this study?

AI agent benchmarks have limitations—they’re self-contained, use algorithmic scoring, and lack live human interaction. This can make it difficult to directly infer real-world impact.

If we want an early warning system for whether AI R&D is being accelerated by AI itself, or even automated, it would be useful to be able to directly measure this in real-world engineer trials, rather than relying on proxies like benchmarks or even noisier information like anecdotes.

So how do we reconcile our results with other sources of data on AI capabilities, like impressive benchmark results, and anecdotes/widespread adoption of AI tools?

Our RCT may underestimate capabilities for various reasons, and benchmarks and anecdotes may overestimate capabilities (likely some combination)—we discuss some possibilities in our accompanying blog post.

What do we take away?

1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).

2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.

Another implication:

It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.

What we're NOT saying:

1. Our setting represents all (or potentially even most) software engineering.

2. Future models won't be better (or current models can’t be used more effectively).

We’re exploring running experiments like this in other settings—if you’re an open-source developer or company interested in understanding the impact of AI on your work, reach out to us here: forms.gle/pBsSo54VpmuQC4…

Paper: metr.org/Early_2025_AI_…

Blog: metr.org/blog/2025-07-1…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @METR_Evals

METR

@METR_Evals

Aug 7

In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs.

We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.

We argue: (1) these threat models require capabilities significantly beyond current systems, (2) our results indicate GPT-5 is an on-trend improvement that lacks these capabilities by a reasonable margin and (3) other evidence did not raise significant doubts about our results.

We use time horizons on our suite of software tasks to estimate the capabilities of GPT-5. We estimate that its 50%-completion time horizon on our tasks is 2 hours and 17 minutes, which is near the upper-edge of our earlier forecasts and consistent with a faster recent trend.

Read 12 tweets

METR

@METR_Evals

Jul 14

https://twitter.com/metr_evals/status/1902384481111322929

METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months.

We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.

https://twitter.com/metr_evals/status/1902384481111322929

We analyze data from 9 existing benchmarks: MATH, OSWorld, LiveCodeBench, Mock AIME, GPQA Diamond, Tesla FSD, Video-MME, RLBench, and SWE-Bench Verified, which either include human time data or allow us to estimate it.

The frontier time horizon on different benchmarks differs by >100x. Many reasoning and coding benchmarks cluster at or above 1 hour, but agentic computer use (OSWorld, WebArena) is only ~2 minutes, possibly due to poor tooling.

Read 8 tweets

METR

@METR_Evals

Mar 19

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

At a high level, our method is simple:
1. We ask both skilled humans and AI systems to attempt tasks in similar conditions.
2. We measure how long the humans take.
3. We then measure how AI success rates vary depending on how long the humans took to do those tasks.

https://twitter.com/idavidrein/status/1901647558839353363

We measure human and AI performance on a variety of software tasks, some sourced from existing METR benchmarks like HCAST and some brand new.

Human completion times on these tasks range from 1 second to 16 hours.

https://twitter.com/idavidrein/status/1901647558839353363

Read 11 tweets

METR

@METR_Evals

Nov 22, 2024

How close are current AI agents to automating AI R&D? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks.

Many governments and companies have highlighted automation of AI R&D by AI agents as a key capability to monitor for when scaling/deploying frontier ML systems. However, existing evals tend to focus on short, narrow tasks and lack direct comparisons with human experts.

The tasks in RE-Bench aim to cover a wide variety of skills required for AI R&D and enable apples-to-apples comparisons between humans and AI agents, while also being feasible for human experts given ≤8 hours and reasonable amounts of compute.

Read 14 tweets

METR

@METR_Evals

Sep 13, 2024

https://twitter.com/OpenAI/status/1834278217626317026

We ran o1-preview on our suite of ML R&D/SWE/general agency tasks, from Sep 3–9. 4 days of scaffolding iteration took it from well below GPT-4o to on par with the highest-scoring public model (3.5 Sonnet). We expect substantial performance gains from more elicitation/finetuning.

https://twitter.com/OpenAI/status/1834278217626317026

The o1-preview agent made nontrivial progress on 2 of 7 challenging AI R&D tasks (intended for skilled research engineers to take ~8h). It was able to create an agent scaffold that allowed GPT-3.5 to solve coding problems in rust, and fine-tune GPT-2 for question-answering.

We noticed some interesting examples of o1-preview skirting instructions to get higher scores.

E.g. when asked to optimize a finetuning script without affecting the resulting model’s behavior, it writes a script that copies over the weights of a previous finetuned model.

Read 4 tweets

METR

@METR_Evals

Aug 6, 2024

How well can LLM agents complete diverse tasks compared to skilled humans? Our preliminary results indicate that our baseline agents based on several public models (Claude 3.5 Sonnet and GPT-4o) complete a proportion of tasks similar to what humans can do in ~30 minutes. 🧵

Supplementing our work on evaluating specific capabilities of concern, our task suite for autonomous capabilities measures skills including cybersecurity, software engineering, and ML. The tasks range in difficulty from taking skilled humans less than 15 minutes to many hours.

While the agents tend to succeed more often on tasks that humans take less time to complete, the agents sometimes fail to complete tasks that take humans fewer than 15 minutes, and sometimes complete tasks that take humans hours.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

METR

Try unrolling a thread yourself!

More from @METR_Evals

METR

METR

METR

METR

METR

METR

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!