Joel Becker Profile picture
move fast and fix things @METR_evals. 'soccer'-me @MessiSeconds.
Nov 24, 2025 11 tweets 3 min read
How might @METR_Evals' time horizon trend change if compute growth slows?

In a new paper, @whitfill_parker, @bsnodin, and I show that trends + a common (and contestable -- read on!) economic model of algorithmic progress can imply substantial delays in AI capability milestones. Image Fundamentally, the ideas in the paper are extremely simple. Time horizon and compute have been growing exponentially. If compute slows, plausibly time horizon slows. If the slowing is substantial, resulting delays vs. naive trend extrapolation can be quantitatively large. Image
Aug 23, 2025 7 tweets 2 min read
half a year on, i'm still astonished at how tight METR's time horizon trend is.

begs the question: what process is driving the trend? very loosely, i have a two-part model in my head. Image first part is task-agnostic returns to scale. this is a standard AI story: just add compute. not strictly pre-training compute; any innovation that turned resources into general capabilities (better data filtering, optimizer, instruction following, tool use) would count. Image
Jul 10, 2025 9 tweets 2 min read
it’s out!

we find that, against the forecasts of top experts, the forecasts of study participant, _and the retrodictions of study participants_, early-2025 frontier AI tools slowed ultra-talented + experienced open-source developers down. the result is, of course, shocking. but i see our primary contribution as methodological. RCTs are the furthest thing from a methodological innovation—but they have thus far been missing from the toolkit of researchers studying the technical capabilities of frontier AI systems.
May 26, 2025 8 tweets 2 min read
shortly after, @_sholtodouglas suggests governments should “[in order to understand whether automation of white-collar work] is about to happen, build swe-bench for all the other forms of white-collar work.”

thread on disagreements this might suggest: i think swe-bench is really not a good measure of whether software engineering is automated. saturation doesn't come close to implying that AI agents can be plug-in replacements for human software engineers.