Quick ablations on CountDown:
Base model quality is key:
We run Qwen-2.5-Base 0.5B, 1.5B, 3B to 7B. 0.5B guess a solution and stop. From 1.5B, the model start learning to search, to self-verify and to revise its solutions, enabling them to achieve much higher scores.
Either base or instruct model works
- Instruct model learns faster, but converges to about same performance as base
- Instruct model's output are more structured and readable
So extra instruction tuning isn't necessary, which supports R1-Zero's design decision
The specific RL alg doesn't matter much
We tried PPO, GRPO and PRIME. Long cot all emerge and they seem all work well. We haven't got the time to tune the hyper-parameters, so don't want to make quantitative conclusions about which alg works better.
Model's reasoning behavior is very task dependent:
- For countdown, the model learns to do search and self-verificatoin
- For number multiplicatoin, the model instead learns to break down the problem using distirbution rule and solve it step by step.
Everything's open at
And it costs < $30 to train the model! We hope this project helps to demystify the emerging RL scaling research and make it more accessible!github.com/Jiayi-Pan/Tiny…
One caveat, of course, is that it's validated only in the Countdown task but not the general reasoning domain. We are now bounded by compute, and please reach out if you wanna help!
A wild ride with @JunjieZhang12 @xingyaow_ @lifan__yuan
• • •
Missing some Tweet in this thread? You can try to
force a refresh
We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning
APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown
Reasoning models like DeepSeek R1 scale test-time compute solely by generating longer chain-of-thought (CoT)
But this single-threaded serial decoding is slow, inefficient, and strains the context window — bottlenecks that only grow as models scale
Parallelism to the rescue!
We envision reasoning models that scale not just by longer CoT, but also by increasing # parallel decoding threads
APR equips LMs with tools to manage and coordinate across decoding threads. RL then optimizes how the model orchestrates these threads for best performance E2E
Progress in SWE agents has been limited by lack of training environments with real-world coverage and execution feedback.
We create SWE-Gym, the first env for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances mimicking SWE-Bench Lite.
SWE-Gym trains LMs as agents.
When fine-tuned on less than 500 agent-environment interaction trajectories sampled from GPT-4o and Claude, we achieve +14% absolute gains on SWE-Bench Verified with an 32B LM-powered OpenHands agent.
New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!
We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.
We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively.
Next, we show how they could be used for improving agents, either through inference-time guidance or fine-tuning.
We start with WebArena, a popular web agent benchmark. We experiment integrating the sota agent with Reflexion algorithm, using our evaluators as the reward function.