Latest Twitter Threads by @joel_bkr on Thread Reader App

Nov 24, 2025 • 11 tweets • 3 min read

How might @METR_Evals' time horizon trend change if compute growth slows?

In a new paper, @whitfill_parker, @bsnodin, and I show that trends + a common (and contestable -- read on!) economic model of algorithmic progress can imply substantial delays in AI capability milestones.

Fundamentally, the ideas in the paper are extremely simple. Time horizon and compute have been growing exponentially. If compute slows, plausibly time horizon slows. If the slowing is substantial, resulting delays vs. naive trend extrapolation can be quantitatively large.

Aug 23, 2025 • 7 tweets • 2 min read

half a year on, i'm still astonished at how tight METR's time horizon trend is.

begs the question: what process is driving the trend? very loosely, i have a two-part model in my head.

first part is task-agnostic returns to scale. this is a standard AI story: just add compute. not strictly pre-training compute; any innovation that turned resources into general capabilities (better data filtering, optimizer, instruction following, tool use) would count.

Jul 10, 2025 • 9 tweets • 2 min read

it’s out!

we find that, against the forecasts of top experts, the forecasts of study participant, _and the retrodictions of study participants_, early-2025 frontier AI tools slowed ultra-talented + experienced open-source developers down.

https://x.com/METR_Evals/status/1943360399220388093

the result is, of course, shocking. but i see our primary contribution as methodological. RCTs are the furthest thing from a methodological innovation—but they have thus far been missing from the toolkit of researchers studying the technical capabilities of frontier AI systems.

May 26, 2025 • 8 tweets • 2 min read

shortly after, @_sholtodouglas suggests governments should “[in order to understand whether automation of white-collar work] is about to happen, build swe-bench for all the other forms of white-collar work.”

thread on disagreements this might suggest:

https://x.com/dwarkesh_sp/status/1925673798277243279

i think swe-bench is really not a good measure of whether software engineering is automated. saturation doesn't come close to implying that AI agents can be plug-in replacements for human software engineers.

Share this page!

Enter URL or ID to Unroll