Latest Twitter Threads by @jiayi_pirate on Thread Reader App

Apr 23 • 10 tweets • 4 min read

We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning

APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown

🧵 arxiv.org/abs/2504.15466

Reasoning models like DeepSeek R1 scale test-time compute solely by generating longer chain-of-thought (CoT)

But this single-threaded serial decoding is slow, inefficient, and strains the context window — bottlenecks that only grow as models scale

Parallelism to the rescue!

Jan 24 • 10 tweets • 4 min read

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for < $30
Code:

Here's what we learned 🧵github.com/Jiayi-Pan/Tiny…

The recipe:

We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.

We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.

Dec 23, 2024 • 9 tweets • 4 min read

Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers

Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute

[🧵] github.com/SWE-Gym/SWE-Gym

Progress in SWE agents has been limited by lack of training environments with real-world coverage and execution feedback.

We create SWE-Gym, the first env for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances mimicking SWE-Bench Lite.

Apr 10, 2024 • 7 tweets • 3 min read

New paper from @Berkeley_AI on Autonomous Evaluation and Refinement of Digital Agents!

We show that VLM/LLM-based evaluators can significantly improve the performance of agents for web browsing and device control, advancing sotas by 29% to 75%.

[🧵] arxiv.org/abs/2404.06474

We begin by developing two types of evaluators: one that directly queries GPT-4V and another that employs an open-weight solution. Our best model shows 82% / 93% agreement with oracle evaluations on web browsing and android device control settings respectively.

Share this page!

Enter URL or ID to Unroll