🧑🍳 PhD Student @Berkeley_AI | Scaling LM Agents | Views Are My Own
Apr 23 • 10 tweets • 4 min read
We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning
APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown
🧵 arxiv.org/abs/2504.15466
Reasoning models like DeepSeek R1 scale test-time compute solely by generating longer chain-of-thought (CoT)
But this single-threaded serial decoding is slow, inefficient, and strains the context window — bottlenecks that only grow as models scale
Parallelism to the rescue!
Jan 24 • 10 tweets • 4 min read
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works
Through RL, the 3B base LM develops self-verification and search abilities all on its own
You can experience the Ahah moment yourself for < $30
Code:
We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.
We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.
Dec 23, 2024 • 9 tweets • 4 min read
Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers
Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute
[🧵] github.com/SWE-Gym/SWE-Gym
Progress in SWE agents has been limited by lack of training environments with real-world coverage and execution feedback.
We create SWE-Gym, the first env for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances mimicking SWE-Bench Lite.
Apr 10, 2024 • 7 tweets • 3 min read
New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!
We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.
[🧵] arxiv.org/abs/2404.06474
We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively.