Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Emmett Shear

@eshear

Jul 14 • 8 tweets • 4 min read • Read on X

https://x.com/METR_Evals/status/1943360406124114187

METR’s analysis of this experiment is wildly misleading. The results indicate that people who have ~never used AI tools before are less productive while learning to use the tools, and say ~nothing about experienced AI tool users. Let's take a look at why.

https://x.com/METR_Evals/status/1943360406124114187

I immediately found the claim suspect because it didn't jibe with my own experience working w people using coding assistants, but sometimes there are surprising results so I dug in. The first question: who were these developers in the study getting such poor results?

“We recruited 16 experienced open-source developers to work on 246 real tasks in their own repositories (avg 22k+ stars, 1M+ lines of code).” So they sound like reasonably experienced software devs.
"Developers have a range of experience using AI tools: 93% have prior experience with tools like ChatGPT, but only 44% have experience using Cursor." Uh oh. So they haven't actually used AI coding tools, they've like tried prompting an LLM to write code for them. But that's an entirely different kind of experience, as anyone who has used these tools can tell you.
They claim "a range of experience using AI tools", yet only a single developer of their sixteen had more than a single week of experience using Cursor. They make it look like a range by breaking "less than a week" into <1 hr, 1-10hrs, 10-30hrs, and 30-50hrs of experience. Given the long steep learning curve for effectively using these new AI tools well, this division betrays what I hope is just grossly negligent ignorance about that reality, rather than intentional deception.
Of course, the one developer who did have more than a week of experience was 20% faster instead of 20% slower. The authors note this fact, but then say “We are underpowered to draw strong conclusions from this analysis” and bury it in a figure’s description in an appendix.
If the authors of the paper had made the claim, "We tested experienced developers using AI tools for the first time, and found that at least during the first week they were slower rather than faster" that would have been a modestly interesting finding and true. Alas, that is not the claim they made.

METR published and promoted this paper with a provocative, misleading headline and summary and body. They buried the fact that the single experienced developer DID have significant gains, contradicting the headline. I hope that the authors withdraw this paper or at the very least update it to limit the claims.
The study does appear mostly well designed in other respects, although I would want to audit it carefully before accepting anything at face value. It does seem that developers and experts overestimate how much value a developer will get from using an AI tool during the first week they use it.

A more interesting study would include inexperienced devs, and those who use the tools already, and those who have invested work into optimizing usage of the tools. Perhaps someone with experience using the tools would be able to design such a future study?

https://x.com/idavidrein/status/1944892081769734559

It is clear that the source of disagreement is that I think using Cursor effectively is a distinct skill from talking to ChatGPT while you program and expect fairly low transfer, and the authors think it's the similar skill and expect much higher.

https://x.com/idavidrein/status/1944892081769734559

I think conflating the two completely invalidates the study's headline and summary results. I suppose the future will tell if this is the case. I'm glad to have found the underlying disagreement.

https://x.com/joel_bkr/status/1945173623284617370

Apparently many of the users with less than 50 hours of experience in the paper actually had more? If true, this basically invalidates all findings in the paper, since it indicates flaws in the data collection process.

https://x.com/joel_bkr/status/1945173623284617370

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @eshear

Emmett Shear

@eshear

Jul 6

https://twitter.com/eshear/status/1937342576874664006

A greater theory of system design: what’s wrong with modernity and post-modernity, how to survive the coming avalanche, and how to fix the major problems we are facing.

Part two: Modernity as systematic accuracy

https://twitter.com/eshear/status/1937342576874664006

In the beginning, we managed the world intuitively. Early human tribes did not set quarterly hunting quotas, did not have rainfall-adjusted targets for average gathering per capita. We lived in the choiceless mode:. meaningness.com/choiceless-mode

There are models in the choiceless mode too. If you believe that the hunt succeeds because of the favor of Artemis, this is a model of hunting. Choiceless mode models are simple models made of very complex parts.

Read 23 tweets

Emmett Shear

@eshear

Jun 24

A greater theory of system design: what’s wrong with modernity and post-modernity, how to survive the coming avalanche, and how to fix the major problems we are facing.

Part one: Systems are Models. But what’s a Model?

I promise this gets practical at some point, but first we have to lay some groundwork. If you find the groundwork obvious or you’re willing to just take my word for it, feel free to skip it. But ultimately, without the background you can’t even really understand the proposal.

Without loss of generality, any system can be seen as a parameter graph connected by edges, where sensory nodes receive inputs that drive both internal graph changes and produce outputs at active nodes.

Read 19 tweets

Emmett Shear

@eshear

Jun 20

I found an old list of blog post ideas that I will probably never write, but I thought it would be fun to turn them into a thread. I wrote these years ago, fun to see the trajectory of my journey. I find them all delightful, even if some are wrong in retrospect.

Power is like radioactive ore…drives the engine of an organization but dangerous to everyone who touches it. Needs to be contained and channeled.

Media empires have replaced media nation-states.

Read 25 tweets

Emmett Shear

@eshear

Mar 9

When I was CEOing at Twitch one of the thing I’d do every batch of interns was a very short presentation on the origins of the company and then a Q&A. One of the questions was always, “Where should I work and what job should I get, or should I start a company?”

It’s an interesting question to try to answer for an intern I didn’t really know, because of course the actual answer is dependent on that person and their life. So I had to figure out how to articulate the framework I used.

First there’s money. Obviously you want money. But money is well-known for diminishing returns, after you have enough for rent and food and so on. So you don’t want to optimize for cash, it’s more of a constraint.

Read 14 tweets

Emmett Shear

@eshear

Feb 28

@arithmoquine It is shocking when you first discover the degree to which non-commodity outcomes are constrained by talent not capital, and how little you can do with money unless there’s an existing machine to buy from.

@arithmoquine Think of money as water flowing through a system of pipes and turbines powered by the flow, and access to capital as the ability to open valves in the pipes. You can spin existing turbines faster but directing water doesn’t create new turbines.

@arithmoquine Ofc if someone wants to build a new turbine, without capital it’s pointless, it’ll just sit there. Often they won’t even be able to test the idea without minimal flow to experiment with.

Read 4 tweets

Emmett Shear

@eshear

Feb 22

Epistemic status: wild speculation but also I’m clearly right

There is a single general factor — we could call it maybe somatic integrity — which determines a large fraction of the total variance in attributes between people.

It’s appears to be mostly inherited, bc it appears to be driven by things like low mutation load, lack of environmental insults, healthy womb environment, etc. It’s mostly baked by the time you’re born and can only go down from there.

That’s because somatic integrity is basically successful execution of the healthy human body plan as learned by evolution. When it all goes right, all the hard work pays off and the biological system hums.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Emmett Shear

Try unrolling a thread yourself!

More from @eshear

Emmett Shear

Emmett Shear

Emmett Shear

Emmett Shear

Emmett Shear

Emmett Shear

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!