Sayash Kapoor Profile picture
Oct 15 17 tweets 7 min read Read on X
📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks.

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9 challenging benchmarks spanning web, coding, science, and customer service tasks.

Our key insight: Benchmark accuracy hides many important details. Take claims of agents' accuracy with a huge grain of salt. 🧵Screenshot of the first page of the HAL paper
There are 3 components of HAL:

1) Standard harness evaluates agents on hundreds of VMs in parallel to drastically reduce eval time
2) 3-D evaluation of models x scaffolds x benchmarks enables insights across these dimensions
3) Agent behavior analysis using @TransluceAI Docent uncovers surprising agent behaviorsChallenges addressed by HAL
For many of the benchmarks we include, there was previously no way to compare models head-to-head, since they weren't compared on the same scaffold. Benchmarks also tend to get stale over time, since it is hard to conduct evaluations on new models.

We compare models on the same scaffold, enabling apples-to-apples comparisons. The vast majority of these evaluations were not available previously. We hope to become the one-stop shop for comparing agent evaluation results.Evaluations conducted previously vs. on HAL
@TransluceAI The HAL harness enables one-command evaluation across agent benchmarks, drastically reducing the eval setup time. For example, to evaluate GPT-OSS-120B on CORE-Bench using CORE-Agent, run this on terminal: Image
We evaluated 9 models on 9 benchmarks with 1-2 scaffolds per benchmark, with a total of 20,000+ rollouts. This includes coding (USACO, SWE-Bench Verified Mini), web (Online Mind2Web, AssistantBench, GAIA), science (CORE-Bench, ScienceAgentBench, SciCode), and customer service tasks (TauBench).List of benchmarks and agents we run
Our analysis uncovered many surprising insights:

1) Higher reasoning effort does not lead to better accuracy in the majority of cases. When we used the same model with different reasoning efforts (Claude 3.7, Claude 4.1, o4-mini), higher reasoning did not improve accuracy in 21/36 cases.change in accuracy from higher reasoning
2) Agents often take shortcuts rather than solving the task correctly. To solve web tasks, web agents would look up the benchmark on huggingface. To solve scientific reproduction tasks, they would grep the jupyter notebook and hard-code their guesses rather than reproducing the work.agents look up answers on huggingface
3) Agents take actions that would be extremely costly in deployment. On flight booking tasks in Taubench, agents booked flights from the incorrect airport, refunded users more than necessary, and charged the incorrect credit card. Surprisingly, even leading models like Opus 4.1 and GPT-5 took such actions.list of costly failures on taubench
4) We analyzed the tradeoffs between cost vs. accuracy. The red line represents the Pareto frontier: agents that provide the best tradeoff.

Surprisingly, the most expensive model (Opus 4.1) tops the leaderboard *only once*. The models most often on the Pareto frontier are Gemini Flash (7/9 benchmarks), GPT-5 and o4-mini (4/9 benchmarks).accuracy vs. cost (USD) on all 9 benchmarks
@TransluceAI 5) The most token-efficient models are not the cheapest. On comparisons of token cost vs. accuracy, Opus 4.1 is on the Pareto frontier for 3 benchmarks. This matters because providers change model prices frequently (for example, o3's price dropped by 80% soon after launch). token cost vs. accuracy
6) We log all the agent behaviors and analyze them using @TransluceAI Docent, which uses LLMs to uncover specific actions the agent took. We use this to conduct a systematic analysis of agent logs on three benchmarks: AssistantBench, SciCode, and CORE-Bench. This analysis allowed us to spot agents taking shortcuts and costly reliability failures.

We also notice interesting agent behaviors that correlate with *improved* accuracy. When agents self-verify answers and construct intermediate verifiers (such as unit tests for coding problems), they are more likely to solve the task correctly.tasks where agents self-correct and verify results have higher success rate.
7) On the flip side, many factors co-occur with failures. For example, barriers in the environment (such as CAPTCHA for web agents) and instruction-following failures (such as not outputting code in the specified format) are more likely to occur in failed tasks.

Surprisingly, agents encounter tool-call failures quite often, regardless of whether they solve the task correctly, indicating they can recover from these errors.instruction violation, tool call errors, and environment barriers rates plotted against task success and task failures
8) Agent log analysis helped us uncover a bug in one of the scaffolds we used for TauBench. The implementation of the few-shot agent on the Sierra AI repo used benchmark examples as few-shot data, a clear example of leakage. As a result, we removed this scaffold from HAL's TauBench analysis.leakage in taubench few-shot agent
We think agent log analysis, such as using Docent, will become a necessary part of agent evaluation going forward. Log analysis uncovers reliability issues, shortcuts, and costly agent errors, which indicate agents could perform worse in the real world than benchmarks suggest.

Conversely, agents could perform *better* than benchmarks suggest if environment barriers like CAPTCHAs block them on benchmarks (say, because of the large number of concurrent web search requests due to the number of tasks in a benchmark) but wouldn't block them when deployed.

Benchmark accuracy numbers do not uncover *any of these* and should be used cautiously.
@TransluceAI We solved a long list of infrastructure challenges on the way. If you build agents, benchmarks, or run evaluations, HAL could be useful to you. We might have solved many problems you will encounter.
@TransluceAI There are many more insights in the paper and on the website. We make all our data and code available:

Paper: arxiv.org/abs/2510.11977
Website: hal.cs.princeton.edu
Github: github.com/princeton-pli/…
Conducting rigorous evaluations required developing infrastructure to handle logging, scaffold and benchmark support, orchestration across VMs, and integration with other tools in the ecosystem like Docent and Weave.

We plan to conduct many more rigorous agent evaluations over the next year and continue sharing insights from our analysis. Follow @halevals for updates on HAL.

I'm grateful to have a fantastic team in place working on HAL: @random_walker, @benediktstroebl, @PKirgis, @nityndg, @siegelz_, @wei_boyi, @xue_tianci, @RonZiruChen, @flxcn, @SaitejaUtpala, @ndzfs, Dheeraj Oruganty, Sophie Luskin, @khl53182440, @BotaoYu24, @aarora79, Dongyoon Hahm, @harsh3vedi, @hhsun1, Juyong Lee, @tengjun_77, Yifan Mai, @YifeiZhou02, @maxYuxuanZhu, @RishiBommasani, @daniel_d_kang, @dawnsongtweets, @PeterHndrsn, @ysu_nlp, @percyliang

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sayash Kapoor

Sayash Kapoor Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sayashk

Sep 5
AI as Normal Technology is often contrasted with AI 2027. Many readers have asked if AI evaluations could help settle the debate.

Unfortunately, this is not straightforward. That's because the debate is not about differences in AI capability, which evaluations typically measure. It is about two completely different causal models of the world.

But most AI evaluations don't even *attempt* to measure differences in causal models. 🧵Image
A key turning point in scenarios like AI 2027 is automated AI R&D — AI systems used to autonomously improve AI, greatly speeding up the pace of AI progress.

In AI as Normal Technology, we discussed some bottlenecks to such automation. We did not go too deep into this disagreement, since our broader point was that AI impacts are bottlenecked by diffusion, and the rate of AI progress doesn't change that.

But in the time since we wrote the essay, we have thought about this disagreement more deeply. This was partly motivated by a workshop @hlntnr organized at CSET.Image
The key point of this thread is that *whether* self-improvement occurs depends a lot on our causal model of AI progress. Let's look at three different types of world models that lead to drastically different interpretations of progress, even with *exactly* the same AI evaluations.
Read 8 tweets
Aug 8
How does GPT-5 compare against Claude Opus 4.1 on agentic tasks?

Since their release, we have been evaluating these models on challenging science, web, service, and code tasks.

Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵Holistic Agent Leaderboard — results on CORE-Bench (hal.cs.princeton.edu)
Last year, in our paper AI Agents That Matter, we discussed the challenges with agent evaluations.

Since then, we've been building the Holistic Agent Leaderboard (HAL). We've evaluated 100+ agents across 8 benchmarks. The code is open source, and all logs are available online. screenshot of HAL landing page
1) CORE-Bench (scientific reproducibility) gives agents two hours to reproduce the results from a scientific paper, given access to its code and data.

Opus 4.1 is the first model to break the 50% barrier on CORE-Bench. GPT-5 is far behind — even behind Sonnet 3.7 and GPT-4.1.

We have heard claims that AI will soon automate all of science. Reproducing results is a small part of science, but the best models are far from scoring well.

Still, if AI agents can reproduce existing work, we suspect it would save millions of researcher-hours of effort, so even a 50% CORE-Bench Hard score is exciting.corebench results on HAL
Read 14 tweets
May 8
Will AI agents be controlled by big tech companies? Or could they be controlled by users, safeguarding user autonomy and privacy?

In a new position paper (accepted to ICML 2025), we outline the steps we need to take now to enable user-centric agents (w/@sethlazar, Noam Kolt)🧶 screenshot of the first page of the paper
@sethlazar PDF:

If agents are controlled by platforms, they would intensify many problems of the current platform economy:
- heightened surveillance
- increased user lock-in
- market manipulation
- further entrenching incumbent digital giants arxiv.org/pdf/2505.04345screenshot of table 1 from the paper
@sethlazar In contrast, agent advocates could run on users' local hardware or in encrypted private clouds, giving users visibility and control over their actions and data. They could prevent surveillance, reduce lock-in, and enable more open and competitive markets.
Read 7 tweets
Feb 11
I led the section on the risks of open models in the International AI Safety Report (w/@ea_seger). It was released at the French AI Action Summit.

The section explains why DeepSeek R1 is *not* open source, how to assess misuse risks, and what evidence we need going forward 🧵 Table depicting the spectrum of model release, from fully closed to fully open
@ea_seger In the wake of DeepSeek R1 release, many people (including prominent news outlets) claimed DeepSeek R1 is open source.

This is incorrect: DeepSeek did not release the training data or code needed to reproduce their results, which are both key parts of the open source definition. News headlines on DeepSeek R1
@ea_seger Meta's Llama, Google Gemma, and Mistral's Codestral are open-weight, but not open source, since they don't release their dataset.

In contrast, AI2's Olmo was released with the code, data, and model weights — it is an example of a truly open source model.
Read 13 tweets
Feb 5
Can Deep Research automate the work of research assistants? I compared OpenAI and Google Deep Research for assistance with an upcoming project.

In the process, I realized what Deep Research is great at, where it fails, and why commentators have such diverging views on it. 🧵 spectrum of how useful deep research is for different applications
Deep Research browses the internet to create reports. @random_walker and I are writing about pro-social applications of AI for social media. So I used deep research to see if it could supplement the process of surveying the state of such applications and research projects.
@random_walker OpenAI's o1-pro with Deep Research understood my query the first time around. It asked for useful clarifications, and less than 10 minutes later, I already had a useful starting point. OpenAI o1-pro response to Deep Research query
Read 18 tweets
Feb 3
I spent a few hours with OpenAI's Operator automating expense reports. Most corporate jobs require filing expenses, so Operator could save *millions* of person-hours every year if it gets this right.

Some insights on what worked, what broke, and why this matters for agents 🧵 Graph of web tasks along difficulty and severity (cost of errors)
OpenAI's Operator is a web agent that can solve arbitrary tasks on the internet *with human supervision*. It runs on a virtual machine (*not* your computer). Users can see what the agent is doing on the browser in real-time. It is available to ChatGPT Pro subscribers. Screenshot of Operator writing "hello world" on an online notepad.
I asked Operator to file reports for my OpenAI and Anthropic API expenses for the last month. This is a task I do manually each month, so I knew exactly what it would need to do. To my surprise, Operator got the first few steps exactly right: Image
Read 17 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(