Jiayi Pan Profile picture
🧑‍🍳 PhD Student @Berkeley_AI | Scaling LM Agents | Views Are My Own
Apr 23 10 tweets 4 min read
We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning

APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown

🧵  arxiv.org/abs/2504.15466APR significantly outperforms long CoT (SoS+) on Countdown: achieves higher accuracy at lower latency and scales efficiently. Reasoning models like DeepSeek R1 scale test-time compute solely by generating longer chain-of-thought (CoT)

But this single-threaded serial decoding is slow, inefficient, and strains the context window — bottlenecks that only grow as models scale

Parallelism to the rescue!
Jan 24 10 tweets 4 min read
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for < $30
Code:

Here's what we learned 🧵github.com/Jiayi-Pan/Tiny…Image The recipe:

We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.

We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.
Dec 23, 2024 9 tweets 4 min read
Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers

Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute

[🧵] github.com/SWE-Gym/SWE-GymSWE-Gym enables scalable improvements for software engineering agents at both training and inference time. Progress in SWE agents has been limited by lack of training environments with real-world coverage and execution feedback.

We create SWE-Gym, the first env for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances mimicking SWE-Bench Lite.Image
Apr 10, 2024 7 tweets 3 min read
New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!

We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.

[🧵] arxiv.org/abs/2404.06474
Image We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively. Image