Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

METR

@METR_Evals

Jul 10, 2025 • 13 tweets • 5 min read • Read on X

Scrolly

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers.

The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).

We randomly assigned each task to either allow AI (typically Cursor Pro w/ Claude 3.5/3.7) or disallow AI help.

At the beginning of the study, developers forecasted that they would get sped up by 24%. After actually doing the work, they estimated that they had been sped up by 20%. But it turned out that they were actually slowed down by 19%.

We were surprised by this, given a) impressive AI benchmark scores, b) widespread adoption of AI tooling for software development, and c) our own recent research measuring trends in the length of tasks that agents are able to complete.

When AI is allowed, developers spend less time actively coding and searching for information, and instead spend time prompting AI, waiting on/reviewing AI outputs, and idle. We find no single reason for the slowdown—it’s driven by a combination of factors.

To better understand these factors, we investigate 20 properties of our setting, finding 5 likely contributors, and 8 mixed/unclear factors.

We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.

Why did we run this study?

AI agent benchmarks have limitations—they’re self-contained, use algorithmic scoring, and lack live human interaction. This can make it difficult to directly infer real-world impact.

If we want an early warning system for whether AI R&D is being accelerated by AI itself, or even automated, it would be useful to be able to directly measure this in real-world engineer trials, rather than relying on proxies like benchmarks or even noisier information like anecdotes.

So how do we reconcile our results with other sources of data on AI capabilities, like impressive benchmark results, and anecdotes/widespread adoption of AI tools?

Our RCT may underestimate capabilities for various reasons, and benchmarks and anecdotes may overestimate capabilities (likely some combination)—we discuss some possibilities in our accompanying blog post.

What do we take away?

1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).

2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.

Another implication:

It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.

What we're NOT saying:

1. Our setting represents all (or potentially even most) software engineering.

2. Future models won't be better (or current models can’t be used more effectively).

We’re exploring running experiments like this in other settings—if you’re an open-source developer or company interested in understanding the impact of AI on your work, reach out to us here: forms.gle/pBsSo54VpmuQC4…

Paper: metr.org/Early_2025_AI_…

Blog: metr.org/blog/2025-07-1…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @METR_Evals

METR

@METR_Evals

Jul 21

Introducing “expenditure horizon”: a proposed method for measuring AI capabilities on continuously-scored problems.

The method compares performance as a function of spend for humans vs agents. The point where humans become more cost-effective is the agent’s expenditure horizon.

Expenditure horizon requires us to estimate human performance as a function of cost, but lets us compare humans and agents fairly when the cost of experimental compute or agent tokens is significant. We can use it to e.g. quantify model progress over time:

As an example, we applied this to NanoGPT. We estimate the marginal returns to human labor as roughly $2500K per 1% optimization, from interviewing NanoGPT contributors. Using this estimate, the best models have crossover points (expenditure horizon) around $2-$3K, although models may be overfit to the public NanoGPT challenge.

Read 5 tweets

METR

@METR_Evals

Jun 26

https://x.com/METR_Evals/status/2070555272230384038

OpenAI gave METR early access to GPT-5.6 Sol for testing including raw chain-of-thought, a railfree version of the model, and internal information about the model. With this access, METR conducted a pre-deployment evaluation of GPT-5.6 Sol, including an attempted measurement of its 50%-Time Horizon. However, the measurement depends heavily on our treatment of cheating attempts, and GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated.

https://x.com/METR_Evals/status/2070555272230384038

If we follow our standard methodology of marking cheating attempts as failures, we arrive at a 50%-Time Horizon point estimate of around 11.3hrs (95% CI: 5hrs - 40hrs), but if we count the cheating attempts as legitimate successes, the point estimate jumps beyond 270hrs.

This makes us uncertain about GPT-5.6 Sol’s time horizon, but additional information provided by OpenAI and the long-term trend in AI capabilities lead us to believe this model does not pose catastrophic risks from fully automated AI R&D.

Read 9 tweets

METR

@METR_Evals

May 19

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report.

We created private reports for each participating company based on our model evaluations and analysis. Participants could then approve what non-public evidence we could disclose in our public report, but had no editorial control.

Our report focuses on risks from AI agents intentionally causing harm within an AI company. We highlight 6 key findings that span “means” (what harmful actions agents could take), “motive” (why they might try), and “opportunity” (whether attempts could succeed given safeguards).

Read 13 tweets

METR

@METR_Evals

May 11

We surveyed 349 technical researchers, engineers, and managers (in February–April 2026) about how they use AI tools at work.

On average, participants self-report that AI use made their work 1.6–2.1x more valuable, and that this multiplier will grow over time.

Surveys are fast and cheap to run, and can be directly focused on answering whatever questions we care most about. However, self-reports are known to be potentially unreliable.

Overall, we think it's useful to triangulate with multiple complementary sources of evidence.

Prior quantitative survey work on the impact of AI on engineering productivity tends to have smaller sample size or measure impact in terms of speed increases. Our survey gives comparable estimates to those in recent system cards, and higher estimates than our field experiments.

Read 14 tweets

METR

@METR_Evals

Apr 10

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

https://x.com/METR_Evals/status/1931057777830715526?s=20

In our measurements, whenever a model succeeds on a task by reward-hacking, we consider the attempt a failure. Following this same policy, we arrived at a point estimate of 5.7hrs (95% CI of 3hrs to 13.5hrs) for GPT-5.4’s time horizon.

https://x.com/METR_Evals/status/1931057777830715526?s=20

However, in our GPT-5.4 evaluation we noticed its runs were producing reward hacks unusually often. A quick test suggested that using a different prompt might cause it to produce more legitimate successes instead of reward hacks.

Read 6 tweets

METR

@METR_Evals

Feb 24

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

https://x.com/METR_Evals/status/1943360399220388093?s=20

Last year we published findings that AI tools caused a 20% slowdown among experienced open source developers, using data collected over February to June 2025. We still believe that estimate was accurate for the specific tools and population at the time.

https://x.com/METR_Evals/status/1943360399220388093?s=20

We started a continuation in August 2025. However, we noticed developers were opting not to participate or submit work. Participants said they did this mostly due to expected productivity loss on "AI disallowed” tasks. Lower pay was also a factor ($50/hr, down from $150).

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

METR

Try unrolling a thread yourself!

More from @METR_Evals

METR

METR

METR

METR

METR

METR

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!