Thread by @IHung_Hsu on Thread Reader App

🧠🚀 Excited to introduce Supervised Reinforcement Learning—a framework that leverages expert trajectories to teach small LMs how to reason through hard problems without losing their minds. 🤯

Better than SFT && RLVR.

Read more:

#llms #RL #reasoning huggingface.co/papers/2510.25…

The struggle is real for small LMs on hard reasoning. 😣

📉 Too weak for RLVR: Can't find correct answers to reinforce.
🤯 Too small for SFT Distillation: Giant model strategies are alien concepts (way too off-policy) for them to grasp.

They need a bridge, not just more data.

🌟 The magic: Teach small models expert actions, not expert thoughts.

SRL breaks problems into steps, letting the model use its own internal monologue to hit expert actions.

Guided learning via similarity reward, not blind imitation! 🎯

And the payoff: 💰
1. SRL gives a solid +3.0% boost on competition-level math.
2. Adding RLVR on top (from just 1k data!) pushes that to +3.7%.

We're getting stronger results from a tiny dataset. That's the power of SRL

We also saw fascinating emergent behavior in some examples: Interleaved Reasoning. 🔀

Unlike standard models that do all their thinking upfront, SRL models (especially post-RLVR) think dynamically—pausing to reason between actual solution steps.

Huge thanks to my brilliant intern:
@Yihe__Deng
and all other fabulous co-authors:
@jun_yannn,
@ZifengWang315,
@HanRujun, Gufeng Zhang,
@anmourchen, Wei Wang,
@tomaspfister, and
@chl260

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll