Sayash Kapoor Profile picture
On the faculty job market. I tweet about AI agents, AI evals, AI for science. AI as Normal Technology: https://t.co/5amOkqKDf2 Book: https://t.co/DabpkhNrcM

Oct 15, 2025, 17 tweets

📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks.

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9 challenging benchmarks spanning web, coding, science, and customer service tasks.

Our key insight: Benchmark accuracy hides many important details. Take claims of agents' accuracy with a huge grain of salt. 🧵

There are 3 components of HAL:

1) Standard harness evaluates agents on hundreds of VMs in parallel to drastically reduce eval time
2) 3-D evaluation of models x scaffolds x benchmarks enables insights across these dimensions
3) Agent behavior analysis using @TransluceAI Docent uncovers surprising agent behaviors

For many of the benchmarks we include, there was previously no way to compare models head-to-head, since they weren't compared on the same scaffold. Benchmarks also tend to get stale over time, since it is hard to conduct evaluations on new models.

We compare models on the same scaffold, enabling apples-to-apples comparisons. The vast majority of these evaluations were not available previously. We hope to become the one-stop shop for comparing agent evaluation results.

@TransluceAI The HAL harness enables one-command evaluation across agent benchmarks, drastically reducing the eval setup time. For example, to evaluate GPT-OSS-120B on CORE-Bench using CORE-Agent, run this on terminal:

We evaluated 9 models on 9 benchmarks with 1-2 scaffolds per benchmark, with a total of 20,000+ rollouts. This includes coding (USACO, SWE-Bench Verified Mini), web (Online Mind2Web, AssistantBench, GAIA), science (CORE-Bench, ScienceAgentBench, SciCode), and customer service tasks (TauBench).

Our analysis uncovered many surprising insights:

1) Higher reasoning effort does not lead to better accuracy in the majority of cases. When we used the same model with different reasoning efforts (Claude 3.7, Claude 4.1, o4-mini), higher reasoning did not improve accuracy in 21/36 cases.

2) Agents often take shortcuts rather than solving the task correctly. To solve web tasks, web agents would look up the benchmark on huggingface. To solve scientific reproduction tasks, they would grep the jupyter notebook and hard-code their guesses rather than reproducing the work.

3) Agents take actions that would be extremely costly in deployment. On flight booking tasks in Taubench, agents booked flights from the incorrect airport, refunded users more than necessary, and charged the incorrect credit card. Surprisingly, even leading models like Opus 4.1 and GPT-5 took such actions.

4) We analyzed the tradeoffs between cost vs. accuracy. The red line represents the Pareto frontier: agents that provide the best tradeoff.

Surprisingly, the most expensive model (Opus 4.1) tops the leaderboard *only once*. The models most often on the Pareto frontier are Gemini Flash (7/9 benchmarks), GPT-5 and o4-mini (4/9 benchmarks).

@TransluceAI 5) The most token-efficient models are not the cheapest. On comparisons of token cost vs. accuracy, Opus 4.1 is on the Pareto frontier for 3 benchmarks. This matters because providers change model prices frequently (for example, o3's price dropped by 80% soon after launch).

6) We log all the agent behaviors and analyze them using @TransluceAI Docent, which uses LLMs to uncover specific actions the agent took. We use this to conduct a systematic analysis of agent logs on three benchmarks: AssistantBench, SciCode, and CORE-Bench. This analysis allowed us to spot agents taking shortcuts and costly reliability failures.

We also notice interesting agent behaviors that correlate with *improved* accuracy. When agents self-verify answers and construct intermediate verifiers (such as unit tests for coding problems), they are more likely to solve the task correctly.

7) On the flip side, many factors co-occur with failures. For example, barriers in the environment (such as CAPTCHA for web agents) and instruction-following failures (such as not outputting code in the specified format) are more likely to occur in failed tasks.

Surprisingly, agents encounter tool-call failures quite often, regardless of whether they solve the task correctly, indicating they can recover from these errors.

8) Agent log analysis helped us uncover a bug in one of the scaffolds we used for TauBench. The implementation of the few-shot agent on the Sierra AI repo used benchmark examples as few-shot data, a clear example of leakage. As a result, we removed this scaffold from HAL's TauBench analysis.

We think agent log analysis, such as using Docent, will become a necessary part of agent evaluation going forward. Log analysis uncovers reliability issues, shortcuts, and costly agent errors, which indicate agents could perform worse in the real world than benchmarks suggest.

Conversely, agents could perform *better* than benchmarks suggest if environment barriers like CAPTCHAs block them on benchmarks (say, because of the large number of concurrent web search requests due to the number of tasks in a benchmark) but wouldn't block them when deployed.

Benchmark accuracy numbers do not uncover *any of these* and should be used cautiously.

@TransluceAI We solved a long list of infrastructure challenges on the way. If you build agents, benchmarks, or run evaluations, HAL could be useful to you. We might have solved many problems you will encounter.

@TransluceAI There are many more insights in the paper and on the website. We make all our data and code available:

Paper: arxiv.org/abs/2510.11977
Website: hal.cs.princeton.edu
Github: github.com/princeton-pli/…

Conducting rigorous evaluations required developing infrastructure to handle logging, scaffold and benchmark support, orchestration across VMs, and integration with other tools in the ecosystem like Docent and Weave.

We plan to conduct many more rigorous agent evaluations over the next year and continue sharing insights from our analysis. Follow @halevals for updates on HAL.

I'm grateful to have a fantastic team in place working on HAL: @random_walker, @benediktstroebl, @PKirgis, @nityndg, @siegelz_, @wei_boyi, @xue_tianci, @RonZiruChen, @flxcn, @SaitejaUtpala, @ndzfs, Dheeraj Oruganty, Sophie Luskin, @khl53182440, @BotaoYu24, @aarora79, Dongyoon Hahm, @harsh3vedi, @hhsun1, Juyong Lee, @tengjun_77, Yifan Mai, @YifeiZhou02, @maxYuxuanZhu, @RishiBommasani, @daniel_d_kang, @dawnsongtweets, @PeterHndrsn, @ysu_nlp, @percyliang

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling