METR’s analysis of this experiment is wildly misleading. The results indicate that people who have ~never used AI tools before are less productive while learning to use the tools, and say ~nothing about experienced AI tool users. Let's take a look at why.
I immediately found the claim suspect because it didn't jibe with my own experience working w people using coding assistants, but sometimes there are surprising results so I dug in. The first question: who were these developers in the study getting such poor results?
“We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).” So they sound like reasonably experienced software devs.
"Developers have a range of experience using AI tools: 93% have prior experience with tools like ChatGPT, but only 44% have experience using Cursor." Uh oh. So they haven't actually used AI coding tools, they've like tried prompting an LLM to write code for them. But that's an entirely different kind of experience, as anyone who has used these tools can tell you.
They claim "a range of experience using AI tools", yet only a single developer of their sixteen had more than a single week of experience using Cursor. They make it look like a range by breaking "less than a week" into <1 hr, 1-10hrs, 10-30hrs, and 30-50hrs of experience. Given the long steep learning curve for effectively using these new AI tools well, this division betrays what I hope is just grossly negligent ignorance about that reality, rather than intentional deception.
Of course, the one developer who did have more than a week of experience was 20% faster instead of 20% slower. The authors note this fact, but then say “We are underpowered to draw strong conclusions from this analysis” and bury it in a figure’s description in an appendix.
If the authors of the paper had made the claim, "We tested experienced developers using AI tools for the first time, and found that at least during the first week they were slower rather than faster" that would have been a modestly interesting finding and true. Alas, that is not the claim they made.
METR published and promoted this paper with a provocative, misleading headline and summary and body. They buried the fact that the single experienced developer DID have significant gains, contradicting the headline. I hope that the authors withdraw this paper or at the very least update it to limit the claims.
The study does appear mostly well designed in other respects, although I would want to audit it carefully before accepting anything at face value. It does seem that developers and experts overestimate how much value a developer will get from using an AI tool during the first week they use it.
A more interesting study would include inexperienced devs, and those who use the tools already, and those who have invested work into optimizing usage of the tools. Perhaps someone with experience using the tools would be able to design such a future study?
It is clear that the source of disagreement is that I think using Cursor effectively is a distinct skill from talking to ChatGPT while you program and expect fairly low transfer, and the authors think it's the similar skill and expect much higher.
I think conflating the two completely invalidates the study's headline and summary results. I suppose the future will tell if this is the case. I'm glad to have found the underlying disagreement.
Apparently many of the users with less than 50 hours of experience in the paper actually had more? If true, this basically invalidates all findings in the paper, since it indicates flaws in the data collection process.
A greater theory of system design: what’s wrong with modernity and post-modernity, how to survive the coming avalanche, and how to fix the major problems we are facing.
In the beginning, we managed the world intuitively. Early human tribes did not set quarterly hunting quotas, did not have rainfall-adjusted targets for average gathering per capita. We lived in the choiceless mode:. meaningness.com/choiceless-mode
There are models in the choiceless mode too. If you believe that the hunt succeeds because of the favor of Artemis, this is a model of hunting. Choiceless mode models are simple models made of very complex parts.
A greater theory of system design: what’s wrong with modernity and post-modernity, how to survive the coming avalanche, and how to fix the major problems we are facing.
Part one: Systems are Models. But what’s a Model?
I promise this gets practical at some point, but first we have to lay some groundwork. If you find the groundwork obvious or you’re willing to just take my word for it, feel free to skip it. But ultimately, without the background you can’t even really understand the proposal.
Without loss of generality, any system can be seen as a parameter graph connected by edges, where sensory nodes receive inputs that drive both internal graph changes and produce outputs at active nodes.
I found an old list of blog post ideas that I will probably never write, but I thought it would be fun to turn them into a thread. I wrote these years ago, fun to see the trajectory of my journey. I find them all delightful, even if some are wrong in retrospect.
Power is like radioactive ore…drives the engine of an organization but dangerous to everyone who touches it. Needs to be contained and channeled.
When I was CEOing at Twitch one of the thing I’d do every batch of interns was a very short presentation on the origins of the company and then a Q&A. One of the questions was always, “Where should I work and what job should I get, or should I start a company?”
It’s an interesting question to try to answer for an intern I didn’t really know, because of course the actual answer is dependent on that person and their life. So I had to figure out how to articulate the framework I used.
First there’s money. Obviously you want money. But money is well-known for diminishing returns, after you have enough for rent and food and so on. So you don’t want to optimize for cash, it’s more of a constraint.
@arithmoquine It is shocking when you first discover the degree to which non-commodity outcomes are constrained by talent not capital, and how little you can do with money unless there’s an existing machine to buy from.
@arithmoquine Think of money as water flowing through a system of pipes and turbines powered by the flow, and access to capital as the ability to open valves in the pipes. You can spin existing turbines faster but directing water doesn’t create new turbines.
@arithmoquine Ofc if someone wants to build a new turbine, without capital it’s pointless, it’ll just sit there. Often they won’t even be able to test the idea without minimal flow to experiment with.
Epistemic status: wild speculation but also I’m clearly right
There is a single general factor — we could call it maybe somatic integrity — which determines a large fraction of the total variance in attributes between people.
It’s appears to be mostly inherited, bc it appears to be driven by things like low mutation load, lack of environmental insults, healthy womb environment, etc. It’s mostly baked by the time you’re born and can only go down from there.
That’s because somatic integrity is basically successful execution of the healthy human body plan as learned by evolution. When it all goes right, all the hard work pays off and the biological system hums.