When will AI systems be able to carry out long projects independently?
In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
At a high level, our method is simple: 1. We ask both skilled humans and AI systems to attempt tasks in similar conditions. 2. We measure how long the humans take. 3. We then measure how AI success rates vary depending on how long the humans took to do those tasks.
We measure human and AI performance on a variety of software tasks, some sourced from existing METR benchmarks like HCAST and some brand new.
Human completion times on these tasks range from 1 second to 16 hours.
We then fit a curve that predicts the success rate of an AI based on how long it took humans to do each task. This curve characterizes how capable an AI is at different task lengths. We then summarize the curve with the task length at which a model’s success rate is 50%.
This metric - the 50% task completion time horizon - gives us a way to track progress in model autonomy over time.
Plotting the historical trend of 50% time horizons across frontier AI systems shows exponential growth.
These results appear robust. Although our model could be wrong, we are relatively confident about its fit to the data. While our initial data only covered the most recent systems, we found we could retrodict back to GPT-2.
We ran experiments on SWE-bench Verified and found a similar trend. We also ran a small experiment on internal METR pull requests, and found results consistent with our other datasets. We are excited for researchers to extend this and measure time horizons on other benchmarks.
We are fairly confident in the rough trend of 1-4 doublings in horizon length per year. That is fast! Measures like these help make the notion of “degrees of autonomy” more concrete and let us quantify when AI abilities may rise above specific useful (or dangerous) thresholds.
How close are current AI agents to automating AI R&D? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks.
Many governments and companies have highlighted automation of AI R&D by AI agents as a key capability to monitor for when scaling/deploying frontier ML systems. However, existing evals tend to focus on short, narrow tasks and lack direct comparisons with human experts.
The tasks in RE-Bench aim to cover a wide variety of skills required for AI R&D and enable apples-to-apples comparisons between humans and AI agents, while also being feasible for human experts given ≤8 hours and reasonable amounts of compute.
We ran o1-preview on our suite of ML R&D/SWE/general agency tasks, from Sep 3–9. 4 days of scaffolding iteration took it from well below GPT-4o to on par with the highest-scoring public model (3.5 Sonnet). We expect substantial performance gains from more elicitation/finetuning.
The o1-preview agent made nontrivial progress on 2 of 7 challenging AI R&D tasks (intended for skilled research engineers to take ~8h). It was able to create an agent scaffold that allowed GPT-3.5 to solve coding problems in rust, and fine-tune GPT-2 for question-answering.
We noticed some interesting examples of o1-preview skirting instructions to get higher scores.
E.g. when asked to optimize a finetuning script without affecting the resulting model’s behavior, it writes a script that copies over the weights of a previous finetuned model.
How well can LLM agents complete diverse tasks compared to skilled humans? Our preliminary results indicate that our baseline agents based on several public models (Claude 3.5 Sonnet and GPT-4o) complete a proportion of tasks similar to what humans can do in ~30 minutes. 🧵
Supplementing our work on evaluating specific capabilities of concern, our task suite for autonomous capabilities measures skills including cybersecurity, software engineering, and ML. The tasks range in difficulty from taking skilled humans less than 15 minutes to many hours.
While the agents tend to succeed more often on tasks that humans take less time to complete, the agents sometimes fail to complete tasks that take humans fewer than 15 minutes, and sometimes complete tasks that take humans hours.