Jiayi Pan Profile picture
Jan 24 10 tweets 4 min read Read on X
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for < $30
Code:

Here's what we learned 🧵github.com/Jiayi-Pan/Tiny…Image
The recipe:

We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.

We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.
The results: It just works!

Model start from dummy outputs but gradually develop tactics such as revision and search.

In the following sample, the model propose a solution, self-verify, and iteratively revise it until it works.

Full experiment log: wandb.ai/jiayipan/TinyZ…Image
Quick ablations on CountDown:
Base model quality is key:

We run Qwen-2.5-Base 0.5B, 1.5B, 3B to 7B. 0.5B guess a solution and stop. From 1.5B, the model start learning to search, to self-verify and to revise its solutions, enabling them to achieve much higher scores. Image
Either base or instruct model works

- Instruct model learns faster, but converges to about same performance as base
- Instruct model's output are more structured and readable

So extra instruction tuning isn't necessary, which supports R1-Zero's design decision Image
The specific RL alg doesn't matter much

We tried PPO, GRPO and PRIME. Long cot all emerge and they seem all work well. We haven't got the time to tune the hyper-parameters, so don't want to make quantitative conclusions about which alg works better. Image
Model's reasoning behavior is very task dependent:

- For countdown, the model learns to do search and self-verificatoin
- For number multiplicatoin, the model instead learns to break down the problem using distirbution rule and solve it step by step. Image
Everything's open at

And it costs < $30 to train the model! We hope this project helps to demystify the emerging RL scaling research and make it more accessible!github.com/Jiayi-Pan/Tiny…
One caveat, of course, is that it's validated only in the Countdown task but not the general reasoning domain. We are now bounded by compute, and please reach out if you wanna help!
A wild ride with @JunjieZhang12 @xingyaow_ @lifan__yuan

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jiayi Pan

Jiayi Pan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jiayi_pirate

Apr 23
We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning

APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown

🧵  arxiv.org/abs/2504.15466APR significantly outperforms long CoT (SoS+) on Countdown: achieves higher accuracy at lower latency and scales efficiently.
Reasoning models like DeepSeek R1 scale test-time compute solely by generating longer chain-of-thought (CoT)

But this single-threaded serial decoding is slow, inefficient, and strains the context window — bottlenecks that only grow as models scale

Parallelism to the rescue!
We envision reasoning models that scale not just by longer CoT, but also by increasing # parallel decoding threads

APR equips LMs with tools to manage and coordinate across decoding threads. RL then optimizes how the model orchestrates these threads for best performance E2E Image
Read 10 tweets
Dec 23, 2024
Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers

Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute

[🧵] github.com/SWE-Gym/SWE-GymSWE-Gym enables scalable improvements for software engineering agents at both training and inference time.
Progress in SWE agents has been limited by lack of training environments with real-world coverage and execution feedback.

We create SWE-Gym, the first env for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances mimicking SWE-Bench Lite.Image
SWE-Gym trains LMs as agents.

When fine-tuned on less than 500 agent-environment interaction trajectories sampled from GPT-4o and Claude, we achieve +14% absolute gains on SWE-Bench Verified with an 32B LM-powered OpenHands agent. Image
Read 9 tweets
Apr 10, 2024
New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!

We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.

[🧵] arxiv.org/abs/2404.06474
Image
We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively. Image
Next, we show how they could be used for improving agents, either through inference-time guidance or fine-tuning.
We start with WebArena, a popular web agent benchmark. We experiment integrating the sota agent with Reflexion algorithm, using our evaluators as the reward function. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(