METR Profile picture
We are a research organization that develops evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society.
Mar 19 11 tweets 4 min read
When will AI systems be able to carry out long projects independently?

In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months. Image At a high level, our method is simple:
1. We ask both skilled humans and AI systems to attempt tasks in similar conditions.
2. We measure how long the humans take.
3. We then measure how AI success rates vary depending on how long the humans took to do those tasks. Image
Nov 22, 2024 14 tweets 6 min read
How close are current AI agents to automating AI R&D? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks. Image Many governments and companies have highlighted automation of AI R&D by AI agents as a key capability to monitor for when scaling/deploying frontier ML systems. However, existing evals tend to focus on short, narrow tasks and lack direct comparisons with human experts. Image
Sep 13, 2024 4 tweets 2 min read
We ran o1-preview on our suite of ML R&D/SWE/general agency tasks, from Sep 3–9. 4 days of scaffolding iteration took it from well below GPT-4o to on par with the highest-scoring public model (3.5 Sonnet). We expect substantial performance gains from more elicitation/finetuning.
Image The o1-preview agent made nontrivial progress on 2 of 7 challenging AI R&D tasks (intended for skilled research engineers to take ~8h). It was able to create an agent scaffold that allowed GPT-3.5 to solve coding problems in rust, and fine-tune GPT-2 for question-answering.
Image
Image
Aug 6, 2024 8 tweets 3 min read
How well can LLM agents complete diverse tasks compared to skilled humans? Our preliminary results indicate that our baseline agents based on several public models (Claude 3.5 Sonnet and GPT-4o) complete a proportion of tasks similar to what humans can do in ~30 minutes. 🧵 Image Supplementing our work on evaluating specific capabilities of concern, our task suite for autonomous capabilities measures skills including cybersecurity, software engineering, and ML. The tasks range in difficulty from taking skilled humans less than 15 minutes to many hours. Image