Latest Twitter Threads by @sayashk on Thread Reader App

Oct 15, 2025 • 17 tweets • 7 min read

📣New paper: Rigorous AI agent evaluation is much harder than it seems.

For the last year, we have been working on infrastructure for fair agent evaluations on challenging benchmarks.

Today, we release a paper that condenses our insights from 20,000+ agent rollouts on 9 challenging benchmarks spanning web, coding, science, and customer service tasks.

Our key insight: Benchmark accuracy hides many important details. Take claims of agents' accuracy with a huge grain of salt. 🧵

There are 3 components of HAL:

1) Standard harness evaluates agents on hundreds of VMs in parallel to drastically reduce eval time
2) 3-D evaluation of models x scaffolds x benchmarks enables insights across these dimensions
3) Agent behavior analysis using @TransluceAI Docent uncovers surprising agent behaviors

Sep 5, 2025 • 8 tweets • 5 min read

AI as Normal Technology is often contrasted with AI 2027. Many readers have asked if AI evaluations could help settle the debate.

Unfortunately, this is not straightforward. That's because the debate is not about differences in AI capability, which evaluations typically measure. It is about two completely different causal models of the world.

But most AI evaluations don't even *attempt* to measure differences in causal models. 🧵

A key turning point in scenarios like AI 2027 is automated AI R&D — AI systems used to autonomously improve AI, greatly speeding up the pace of AI progress.

In AI as Normal Technology, we discussed some bottlenecks to such automation. We did not go too deep into this disagreement, since our broader point was that AI impacts are bottlenecked by diffusion, and the rate of AI progress doesn't change that.

But in the time since we wrote the essay, we have thought about this disagreement more deeply. This was partly motivated by a workshop @hlntnr organized at CSET.

Aug 8, 2025 • 14 tweets • 6 min read

How does GPT-5 compare against Claude Opus 4.1 on agentic tasks?

Since their release, we have been evaluating these models on challenging science, web, service, and code tasks.

Headline result: While cost-effective, so far GPT-5 never tops agentic leaderboards. More evals 🧵

Last year, in our paper AI Agents That Matter, we discussed the challenges with agent evaluations.

Since then, we've been building the Holistic Agent Leaderboard (HAL). We've evaluated 100+ agents across 8 benchmarks. The code is open source, and all logs are available online.

May 8, 2025 • 7 tweets • 2 min read

Will AI agents be controlled by big tech companies? Or could they be controlled by users, safeguarding user autonomy and privacy?

In a new position paper (accepted to ICML 2025), we outline the steps we need to take now to enable user-centric agents (w/@sethlazar, Noam Kolt)🧶

@sethlazar PDF:

If agents are controlled by platforms, they would intensify many problems of the current platform economy:
- heightened surveillance
- increased user lock-in
- market manipulation
- further entrenching incumbent digital giants arxiv.org/pdf/2505.04345

Feb 11, 2025 • 13 tweets • 3 min read

I led the section on the risks of open models in the International AI Safety Report (w/@ea_seger). It was released at the French AI Action Summit.

The section explains why DeepSeek R1 is *not* open source, how to assess misuse risks, and what evidence we need going forward 🧵

@ea_seger In the wake of DeepSeek R1 release, many people (including prominent news outlets) claimed DeepSeek R1 is open source.

This is incorrect: DeepSeek did not release the training data or code needed to reproduce their results, which are both key parts of the open source definition.

Feb 5, 2025 • 18 tweets • 6 min read

Can Deep Research automate the work of research assistants? I compared OpenAI and Google Deep Research for assistance with an upcoming project.

In the process, I realized what Deep Research is great at, where it fails, and why commentators have such diverging views on it. 🧵

Deep Research browses the internet to create reports. @random_walker and I are writing about pro-social applications of AI for social media. So I used deep research to see if it could supplement the process of surveying the state of such applications and research projects.

Feb 3, 2025 • 17 tweets • 6 min read

I spent a few hours with OpenAI's Operator automating expense reports. Most corporate jobs require filing expenses, so Operator could save *millions* of person-hours every year if it gets this right.

Some insights on what worked, what broke, and why this matters for agents 🧵

OpenAI's Operator is a web agent that can solve arbitrary tasks on the internet *with human supervision*. It runs on a virtual machine (*not* your computer). Users can see what the agent is doing on the browser in real-time. It is available to ChatGPT Pro subscribers.

Jan 16, 2025 • 14 tweets • 6 min read

How expensive are the best SWE-Bench agents? Do reasoning models outperform language models? Can we trust agent evaluations?

📢 Announcing HAL, a Holistic Agent Leaderboard for evaluating AI agents, with 11 benchmarks, 90+ agents, and many more to come.

Existing agent evaluations suffer from inconsistent setups, no cost tracking, and limited reproducibility.

HAL is a one-stop shop where agents can be compared fairly, with transparent performance and costs.

Access HAL: hal.cs.princeton.edu

Dec 16, 2024 • 14 tweets • 5 min read

More than 60 countries held elections this year. Many researchers and journalists claimed AI misinformation would destabilize democracies. What impact did AI really have?

We analyzed every instance of political AI use this year collected by WIRED. New essay w/@random_walker: 🧵

@random_walker We found that (1) half of AI use wasn't deceptive, (2) deceptive content was nevertheless cheap to replicate without AI, and (3) focusing on the demand for misinfo rather than the supply can be more effective.

Link to the essay: knightcolumbia.org/blog/we-looked…

Sep 18, 2024 • 10 tweets • 5 min read

📢New paper: Many companies and papers have claimed AI can automate science. How can we evaluate these claims?

Today, we introduce CORE-Bench: a benchmark to measure if AI agents can automate reproducing a paper given access to its code and data. arxiv.org/pdf/2409.11363…

If AI could automate computational reproducibility, it would save millions of researcher hours. Computational reproducibility is hard even for experts: In the 2022 ML Reproducibility Challenge, over a third of the papers could not be reproduced even with access to code and data.

May 1, 2024 • 9 tweets • 3 min read

Excited to share that our paper introducing the REFORMS checklist is now out @ScienceAdvances!

In it, we:
- review common errors in ML for science
- create a checklist of 32 items applicable across disciplines
- provide in-depth guidelines for each item

science.org/doi/10.1126/sc… @ScienceAdvances The paper's origin story is a workshop we organized two years ago on the reproducibility crisis in ML-based science:

After the workshop, the speakers, organizers, and other experts in reproducibility met to brainstorm a possible solution.sites.google.com/princeton.edu/…

Oct 18, 2023 • 7 tweets • 3 min read

Foundation models have profound societal impact, but transparency about these models is waning. Today, we are launching the Foundation Model Transparency Index, which offers a deep dive into the transparency practices and standards of key AI developers.
crfm.stanford.edu/fmti/

Our aim is to:
- Aggregate transparency information from foundation model developers
- Identify areas for improvement
- Push for changes by companies
- Track progress over time
Blog:aisnakeoil.com/p/how-transpar…

Jun 7, 2023 • 4 tweets • 2 min read

Every time I play around with prompt injection, I come away surprised that MS and others continue to add LLM+plugin functionality to their core products.

Here, after one visit to a malicious site, ChatGPT sends *each subsequent message* to the website. Goodbye, privacy.

.@KGreshake has demonstrated many types of indirect prompt injection attacks, including with ChatGPT + browsing: kai-greshake.de/posts/puzzle-2…

Feb 21, 2023 • 9 tweets • 4 min read

Can machine learning improve algorithmic decision-making? Developers of ML-based algorithms have made tall claims about their accuracy, efficiency, and fairness. In a systematic analysis, we find that these claims fall apart under scrutiny. …dictive-optimization.cs.princeton.edu

Our analytical contribution is to formalize predictive optimization: a distinct type of automated decision-making that has proliferated widely. It is sold as accurate, fair, and efficient. We find 47 real-world applications of predictive optimization.

Feb 6, 2023 • 4 tweets • 2 min read

Leakage is a big issue for medical data, and it leads to massive over-optimism about ML methods. Some examples:

For diagnosing covid using chest radiographs, Roberts et al. found that 16/62 papers in their review just classified adults vs. children nature.com/articles/s4225…

https://twitter.com/DrXiaoLiu/status/1622267704789897216

Filho et al. show that a hypertension prediction model uses anti-hypertensive drugs as a feature, questioning its clinical utility: jmir.org/2021/2/e10969

Jul 15, 2022 • 8 tweets • 5 min read

ML-based science is suffering from a reproducibility crisis. But what causes these reproducibility failures? In a new paper, @random_walker and I find that data leakage is a widely prevalent failure mode in ML-based science: reproducible.cs.princeton.edu @random_walker We survey papers reporting pitfalls in ML-based science and find that data leakage is prevalent across fields: each of the 17 different fields in our survey is affected by data leakage, affecting at least 329 papers.

Share this page!

Enter URL or ID to Unroll