I was one of the developers in the @METR_Evals study. Thoughts:
1. This is much less true of my participation in the study where I was more conceintious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I'd look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run
I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way (1/N)
2. Historically I've lost some of my AI speed ups to cleaning up the same issues LLM code would introduce, often relatively simple violations of code conventions lik e using || instead of ??
A bunch of this is avoidable with stored system prompts which I was lazy about writing. Cursor has now made this easier and even attempts to learn repeatable rules "The user prefers X" that will get re-used, saving time here. (2/N)
3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an "open-source developer" has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I'm not (3/N)
4. As a developer in the study, it's striking to me how much more capable the models have gotten since February (when I was participating in the study)
I'm trying to recall if I was even using agents at the start. Certainly the later models (Opus 4, Gemini 2.5 Pro, o3 could just do vastly with less guidance) than 3.6, o1, etc.
For me, not going over my own data in the study, I could buy that maybe i was being slowed down a few months ago, but it is much much harder to believe now 4/N
5. There was a selection effect in which tasks I submitted to the study. (a) I didn't want to risk getting randomized to "no AI" on tasks that felt sufficiently important or daunting to do without AI assistence. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn't submit those tasks to study even though AI speed up might have been larger (5/N)
6. I think if the result is valid at this point in time, that's one thing, I think if people are citing in another 3 months time, they'll be making a mistake (and I hope Metr has published a follow-up) 6/6
@threadreaderapp unroll
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Okay, I did it. Threw Deep Research at the medical questions I tackled for ~months in 2020 when battling my wife's cancer
Based on my test case, this iteration of Deep Research can tell you what the current literature on a topic would advise, but not make novel deductions to improve upon where the human experts are at
I think it might have sped up my cancer research in 2020 but not replaced it. That guy saying it's better than his $150k/year team...maybe needs to get better at hiring, idk
🧵Thread with more details 0/n
Tbc, it's still a great tool even in the current state that I expect to use. Just hunting around for relevant topics of a paper and finding the relevant ones can take hours. Useful even if I have to read and critically judge the papers myself 1/n
Ok, so the test case: 1. we know if you have a malign tumor growing on your bone, you want to surgically cut it out 2. we know that if you cut very narrowly around the tumor, with little margin, you get worse outcomes than if you remove it with a wider margin (taking out more healthy tissue with it) – there's a straightforward monotonic curve here
Lightcone/@lesswrong (where I work) is concluding the first month of our fundraiser. We’ve raised 1.3M out of 3M we need to make it through 2025. Habryka has a 12,000 word post making the case for us.
I’m here to tell you what Habryka cannot easily do so himself: why he as a specific human is worth funding for his projects. (Thread below.)
@lesswrong 0. I’ve known @ohabryka since 2013 when he was ~19. I’ve worked with him at LessWrong/Lightcone since 2019 (six years).
If you factor in foregone income and less portable career capital, I’m a major donor to Habryka’s projects myself. Here’s why I do it:
1. he is v smart (obvs) and one of the way that comes through is that he basically never offloads thinking about a domain to others. He believes hard in being a "generalist" and that means he can and does perform every task/role in the company, and for most tasks, does it better than any others. I'm talking coding, UI design, interior design, construction design, legal, fundraising, customer support, analytics, pest eradication, you name it.
He expects the same of the core team of "generalists" (it's a bad name, should be more like "specialists in everything"). The standard rule is we're only allowed to outsource stuff that we've done at least once ourselves.
For some domains, we'll consult or employ experts, e.g. lawyers or contractors, but Habryka will be building expertise in that topic too so he can scrutinize what the supposed experts say.
And you might think the CEO is too busy and too important for every day stuff, but he’s in the trenches as much as the rest of us. Over the vacation period (or any time), he’s the one picking up the slack in responding to support queries. During events at Lighthaven, he’s the one running around lighting the outdoor heaters.