Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Jiayi Pan

@jiayi_pirate

Jan 24 • 10 tweets • 4 min read • Read on X

Scrolly

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for < $30
Code:

Here's what we learned 🧵github.com/Jiayi-Pan/Tiny…

The recipe:

We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.

We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.

The results: It just works!

Model start from dummy outputs but gradually develop tactics such as revision and search.

In the following sample, the model propose a solution, self-verify, and iteratively revise it until it works.

Full experiment log: wandb.ai/jiayipan/TinyZ…

Quick ablations on CountDown:
Base model quality is key:

We run Qwen-2.5-Base 0.5B, 1.5B, 3B to 7B. 0.5B guess a solution and stop. From 1.5B, the model start learning to search, to self-verify and to revise its solutions, enabling them to achieve much higher scores.

Either base or instruct model works

- Instruct model learns faster, but converges to about same performance as base
- Instruct model's output are more structured and readable

So extra instruction tuning isn't necessary, which supports R1-Zero's design decision

The specific RL alg doesn't matter much

We tried PPO, GRPO and PRIME. Long cot all emerge and they seem all work well. We haven't got the time to tune the hyper-parameters, so don't want to make quantitative conclusions about which alg works better.

Model's reasoning behavior is very task dependent:

- For countdown, the model learns to do search and self-verificatoin
- For number multiplicatoin, the model instead learns to break down the problem using distirbution rule and solve it step by step.

Everything's open at

And it costs < $30 to train the model! We hope this project helps to demystify the emerging RL scaling research and make it more accessible!github.com/Jiayi-Pan/Tiny…

One caveat, of course, is that it's validated only in the Countdown task but not the general reasoning domain. We are now bounded by compute, and please reach out if you wanna help!

A wild ride with @JunjieZhang12 @xingyaow_ @lifan__yuan

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jiayi_pirate

Jiayi Pan

@jiayi_pirate

Apr 23

We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning

APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown

🧵 arxiv.org/abs/2504.15466

Reasoning models like DeepSeek R1 scale test-time compute solely by generating longer chain-of-thought (CoT)

But this single-threaded serial decoding is slow, inefficient, and strains the context window — bottlenecks that only grow as models scale

Parallelism to the rescue!

We envision reasoning models that scale not just by longer CoT, but also by increasing # parallel decoding threads

APR equips LMs with tools to manage and coordinate across decoding threads. RL then optimizes how the model orchestrates these threads for best performance E2E

Read 10 tweets

Jiayi Pan

@jiayi_pirate

Dec 23, 2024

Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers

Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute

[🧵] github.com/SWE-Gym/SWE-Gym

Progress in SWE agents has been limited by lack of training environments with real-world coverage and execution feedback.

We create SWE-Gym, the first env for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances mimicking SWE-Bench Lite.

SWE-Gym trains LMs as agents.

When fine-tuned on less than 500 agent-environment interaction trajectories sampled from GPT-4o and Claude, we achieve +14% absolute gains on SWE-Bench Verified with an 32B LM-powered OpenHands agent.

Read 9 tweets

Jiayi Pan

@jiayi_pirate

Apr 10, 2024

New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!

We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.

[🧵] arxiv.org/abs/2404.06474

We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively.

Next, we show how they could be used for improving agents, either through inference-time guidance or fine-tuning.
We start with WebArena, a popular web agent benchmark. We experiment integrating the sota agent with Reflexion algorithm, using our evaluators as the reward function.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Jiayi Pan

Try unrolling a thread yourself!

More from @jiayi_pirate

Jiayi Pan

Jiayi Pan

Jiayi Pan

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!