@dhruvtrehan9 tested if LLMs can perform end to end ML research. 3/4 attempts failed. One worked and led to a paper accepted at Agents4Science 2025, world’s first conference for AI authors.
In the report we document six failure modes and four design principles. 🧵1/ We wanted a system with minimal scaffolding. To do this we built a modular agent system prompt, and mapped six agents to the scientific workflow starting from Idea Generation, spanning Experiment Execution, Evaluation and Paper Writing.
We used Gemini 2.5 Pro for most of these agents, and Claude Code led the Experiment Execution on Modal.
Each agent had basic read, write, and list tools, working in the same idea-level directory with all context files.