Nitya Nadgir Profile picture
AI evals & policy @Princeton
Jun 30 20 tweets 6 min read
Can AI agents help researchers reproduce research more quickly? We conducted an uplift study. The answer is yes: researchers reproduced papers > 2x faster using Codex with GPT-5.4 xhigh. In a new paper, we show many other results. Image When a benchmark’s accuracy saturates, the field usually replaces it with a harder one. We use CORE-Bench Hard, a benchmark for computational reproducibility, as a case study to show what we can still measure after accuracy saturates.

Paper: arxiv.org/pdf/2606.26158…Image