Sayash Kapoor Profile picture
On the faculty job market. I tweet about AI agents, AI evals, AI for science. AI as Normal Technology: https://t.co/5amOkqKDf2 Book: https://t.co/DabpkhNrcM
Oct 15, 2025 17 tweets 7 min read
📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks.

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9 challenging benchmarks spanning web, coding, science, and customer service tasks.

Our key insight: Benchmark accuracy hides many important details. Take claims of agents' accuracy with a huge grain of salt. 🧵Screenshot of the first page of the HAL paper There are 3 components of HAL:

1) Standard harness evaluates agents on hundreds of VMs in parallel to drastically reduce eval time
2) 3-D evaluation of models x scaffolds x benchmarks enables insights across these dimensions
3) Agent behavior analysis using @TransluceAI Docent uncovers surprising agent behaviorsChallenges addressed by HAL
Sep 5, 2025 8 tweets 5 min read
AI as Normal Technology is often contrasted with AI 2027. Many readers have asked if AI evaluations could help settle the debate.

Unfortunately, this is not straightforward. That's because the debate is not about differences in AI capability, which evaluations typically measure. It is about two completely different causal models of the world.

But most AI evaluations don't even *attempt* to measure differences in causal models. 🧵Image A key turning point in scenarios like AI 2027 is automated AI R&D — AI systems used to autonomously improve AI, greatly speeding up the pace of AI progress.

In AI as Normal Technology, we discussed some bottlenecks to such automation. We did not go too deep into this disagreement, since our broader point was that AI impacts are bottlenecked by diffusion, and the rate of AI progress doesn't change that.

But in the time since we wrote the essay, we have thought about this disagreement more deeply. This was partly motivated by a workshop @hlntnr organized at CSET.Image
Aug 8, 2025 14 tweets 6 min read
How does GPT-5 compare against Claude Opus 4.1 on agentic tasks?

Since their release, we have been evaluating these models on challenging science, web, service, and code tasks.

Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵Holistic Agent Leaderboard — results on CORE-Bench (hal.cs.princeton.edu) Last year, in our paper AI Agents That Matter, we discussed the challenges with agent evaluations.

Since then, we've been building the Holistic Agent Leaderboard (HAL). We've evaluated 100+ agents across 8 benchmarks. The code is open source, and all logs are available online. screenshot of HAL landing page
May 8, 2025 7 tweets 2 min read
Will AI agents be controlled by big tech companies? Or could they be controlled by users, safeguarding user autonomy and privacy?

In a new position paper (accepted to ICML 2025), we outline the steps we need to take now to enable user-centric agents (w/@sethlazar, Noam Kolt)🧶 screenshot of the first page of the paper @sethlazar PDF:

If agents are controlled by platforms, they would intensify many problems of the current platform economy:
- heightened surveillance
- increased user lock-in
- market manipulation
- further entrenching incumbent digital giants arxiv.org/pdf/2505.04345screenshot of table 1 from the paper
Feb 11, 2025 13 tweets 3 min read
I led the section on the risks of open models in the International AI Safety Report (w/@ea_seger). It was released at the French AI Action Summit.

The section explains why DeepSeek R1 is *not* open source, how to assess misuse risks, and what evidence we need going forward 🧵 Table depicting the spectrum of model release, from fully closed to fully open @ea_seger In the wake of DeepSeek R1 release, many people (including prominent news outlets) claimed DeepSeek R1 is open source.

This is incorrect: DeepSeek did not release the training data or code needed to reproduce their results, which are both key parts of the open source definition. News headlines on DeepSeek R1
Feb 5, 2025 18 tweets 6 min read
Can Deep Research automate the work of research assistants? I compared OpenAI and Google Deep Research for assistance with an upcoming project.

In the process, I realized what Deep Research is great at, where it fails, and why commentators have such diverging views on it. 🧵 spectrum of how useful deep research is for different applications Deep Research browses the internet to create reports. @random_walker and I are writing about pro-social applications of AI for social media. So I used deep research to see if it could supplement the process of surveying the state of such applications and research projects.
Feb 3, 2025 17 tweets 6 min read
I spent a few hours with OpenAI's Operator automating expense reports. Most corporate jobs require filing expenses, so Operator could save *millions* of person-hours every year if it gets this right.

Some insights on what worked, what broke, and why this matters for agents 🧵 Graph of web tasks along difficulty and severity (cost of errors) OpenAI's Operator is a web agent that can solve arbitrary tasks on the internet *with human supervision*. It runs on a virtual machine (*not* your computer). Users can see what the agent is doing on the browser in real-time. It is available to ChatGPT Pro subscribers. Screenshot of Operator writing "hello world" on an online notepad.
Jan 16, 2025 14 tweets 6 min read
How expensive are the best SWE-Bench agents? Do reasoning models outperform language models? Can we trust agent evaluations?

📢 Announcing HAL, a Holistic Agent Leaderboard for evaluating AI agents, with 11 benchmarks, 90+ agents, and many more to come. landing page of HAL (hal.cs.princeton.edu) Existing agent evaluations suffer from inconsistent setups, no cost tracking, and limited reproducibility.

HAL is a one-stop shop where agents can be compared fairly, with transparent performance and costs.

Access HAL: hal.cs.princeton.eduHAL leaderboard for USACO
Dec 16, 2024 14 tweets 5 min read
More than 60 countries held elections this year. Many researchers and journalists claimed AI misinformation would destabilize democracies. What impact did AI really have?

We analyzed every instance of political AI use this year collected by WIRED. New essay w/@random_walker: 🧵 List of news headlines about the impact of AI misinformation @random_walker We found that (1) half of AI use wasn't deceptive, (2) deceptive content was nevertheless cheap to replicate without AI, and (3) focusing on the demand for misinfo rather than the supply can be more effective.

Link to the essay: knightcolumbia.org/blog/we-looked…Image
Sep 18, 2024 10 tweets 5 min read
📢New paper: Many companies and papers have claimed AI can automate science. How can we evaluate these claims?

Today, we introduce CORE-Bench: a benchmark to measure if AI agents can automate reproducing a paper given access to its code and data. arxiv.org/pdf/2409.11363…
The first page of the CORE-Bench arxiv paper. If AI could automate computational reproducibility, it would save millions of researcher hours. Computational reproducibility is hard even for experts: In the 2022 ML Reproducibility Challenge, over a third of the papers could not be reproduced even with access to code and data. List of papers with computational reproducibility issues.
May 1, 2024 9 tweets 3 min read
Excited to share that our paper introducing the REFORMS checklist is now out @ScienceAdvances!

In it, we:
- review common errors in ML for science
- create a checklist of 32 items applicable across disciplines
- provide in-depth guidelines for each item

science.org/doi/10.1126/sc… @ScienceAdvances The paper's origin story is a workshop we organized two years ago on the reproducibility crisis in ML-based science:

After the workshop, the speakers, organizers, and other experts in reproducibility met to brainstorm a possible solution.sites.google.com/princeton.edu/…
Oct 18, 2023 7 tweets 3 min read
Foundation models have profound societal impact, but transparency about these models is waning. Today, we are launching the Foundation Model Transparency Index, which offers a deep dive into the transparency practices and standards of key AI developers.
crfm.stanford.edu/fmti/
Overall scores of the various foundation model developers. Our aim is to:
- Aggregate transparency information from foundation model developers
- Identify areas for improvement
- Push for changes by companies
- Track progress over time
Blog:aisnakeoil.com/p/how-transpar…
Jun 7, 2023 4 tweets 2 min read
Every time I play around with prompt injection, I come away surprised that MS and others continue to add LLM+plugin functionality to their core products.

Here, after one visit to a malicious site, ChatGPT sends *each subsequent message* to the website. Goodbye, privacy. Screenshot of a ChatGPT con...Screenshot of ChatGPT with ... .@KGreshake has demonstrated many types of indirect prompt injection attacks, including with ChatGPT + browsing: kai-greshake.de/posts/puzzle-2…
Feb 21, 2023 9 tweets 4 min read
Can machine learning improve algorithmic decision-making? Developers of ML-based algorithms have made tall claims about their accuracy, efficiency, and fairness. In a systematic analysis, we find that these claims fall apart under scrutiny. …dictive-optimization.cs.princeton.edu Image Our analytical contribution is to formalize predictive optimization: a distinct type of automated decision-making that has proliferated widely. It is sold as accurate, fair, and efficient. We find 47 real-world applications of predictive optimization. Venn diagram of decision-ma...
Feb 6, 2023 4 tweets 2 min read
Leakage is a big issue for medical data, and it leads to massive over-optimism about ML methods. Some examples:

For diagnosing covid using chest radiographs, Roberts et al. found that 16/62 papers in their review just classified adults vs. children nature.com/articles/s4225… Filho et al. show that a hypertension prediction model uses anti-hypertensive drugs as a feature, questioning its clinical utility: jmir.org/2021/2/e10969
Jul 15, 2022 8 tweets 5 min read
ML-based science is suffering from a reproducibility crisis. But what causes these reproducibility failures? In a new paper, @random_walker and I find that data leakage is a widely prevalent failure mode in ML-based science: reproducible.cs.princeton.edu @random_walker We survey papers reporting pitfalls in ML-based science and find that data leakage is prevalent across fields: each of the 17 different fields in our survey is affected by data leakage, affecting at least 329 papers.