Latest Twitter Threads by @xiangyue96 on Thread Reader App

Dec 9, 2025 • 8 tweets • 4 min read

There are competing views on whether RL can genuinely improve base model's performance (e.g., pass@128). The answer is both yes and no, largely depending on the interplay between pre-training, mid-training, and RL. We trained a few hundreds of GPT-2 scale LMs on synthetic GSM-like reasoning data from scratch. Here are what we found: 🧵

Takeaway 1: RL helps extrapolation only when there's headroom left by pre-training and with the right RL data at the edge of the model's competence.

1) When RL is applied to very in-domain tasks, it never improves pass@k beyond the base model.
2) When RL targets the edge of the model's competence there are substantial extrapolations to even unseen OOD-hard tasks.

Jul 2, 2025 • 7 tweets • 4 min read

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true?

In our study (arxiv.org/pdf/2507.00432), we evaluated over 20 open-weight reasoning models and found that:
➡️Only models trained with RL exhibit broad transfer of math reasoning skills to other tasks.
➡️Models trained with SFT show limited or no transfer—especially to non-reasoning domains.

To quantify this, we introduce the Transferability Index (TI), which measures how much gain in math could transfer to others. A positive score indicates effective transfer; a negative one suggests loss of general capability.

We evaluate the models on three benchmark categories:
- Math reasoning: MATH-500, AIME24/25, Olympiad
- Other reasoning: GPQA-D (Science), LiveCodeBench2 (Code), ACPBench (Agent Planning), HeadQA (Medical)
- Non-reasoning: CoQA (Conversational QA), IFEval (Instruction Following), HalluEval (Hallucination), MC-TACO (Commonsense)

Our findings challenge the blind pursuit of leaderboard performance in math reasoning via SFT. Simply creating more math-like SFT data may inadvertently harm a model’s broader generalization. Instead, RL appears to be key for truly transferable reasoning development.

We further conducted a controlled study using only math queries during training. Surprisingly, even with this single math domain, RL training still led to strong cross-domain generalization while SFT does not.

Feb 6, 2025 • 20 tweets • 8 min read

Demystifying Long CoT Reasoning in LLMs

Reasoning models like R1 / O1 / O3 have gained massive attention, but their training dynamics remain a mystery. We're taking a first deep dive into understanding long CoT reasoning in LLMs!

11 Major Takeaways👇(long threads)arxiv.org/pdf/2502.03373

Takeaway 1:
(a) SFT with long CoT can scale up to a higher performance upper limit than short CoT.

(b) SFT with long CoTs makes further RL improvement easier, while short CoTs do not.

Share this page!

Enter URL or ID to Unroll