🧠🚀 Excited to introduce Supervised Reinforcement Learning—a framework that leverages expert trajectories to teach small LMs how to reason through hard problems without losing their minds. 🤯
The struggle is real for small LMs on hard reasoning. 😣
📉 Too weak for RLVR: Can't find correct answers to reinforce.
🤯 Too small for SFT Distillation: Giant model strategies are alien concepts (way too off-policy) for them to grasp.
They need a bridge, not just more data.
🌟 The magic: Teach small models expert actions, not expert thoughts.
SRL breaks problems into steps, letting the model use its own internal monologue to hit expert actions.
Guided learning via similarity reward, not blind imitation! 🎯
And the payoff: 💰 1. SRL gives a solid +3.0% boost on competition-level math. 2. Adding RLVR on top (from just 1k data!) pushes that to +3.7%.
We're getting stronger results from a tiny dataset. That's the power of SRL
We also saw fascinating emergent behavior in some examples: Interleaved Reasoning. 🔀
Unlike standard models that do all their thinking upfront, SRL models (especially post-RLVR) think dynamically—pausing to reason between actual solution steps.
Huge thanks to my brilliant intern:
@Yihe__Deng
and all other fabulous co-authors:
@jun_yannn,
@ZifengWang315,
@HanRujun, Gufeng Zhang,
@anmourchen, Wei Wang,
@tomaspfister, and
@chl260
• • •
Missing some Tweet in this thread? You can try to
force a refresh