Latest Twitter Threads by @siyan_zhao on Thread Reader App

Jan 22 • 7 tweets • 4 min read

Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation.

Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself.

🌐Blog: siyan-zhao.github.io/blog/2026/opsd/

🧵👇

2/n As compared to SFT/off-policy distillation, GRPO, and on-policy distillation, On-Policy Self-Distillation (OPSD) provides training signal that is on-policy, dense, and teacher-free without extensive group sampling cost.

Apr 11, 2025 • 8 tweets • 3 min read

Introducing d1🚀 — the first framework that applies reinforcement learning to improve reasoning in masked diffusion LLMs (dLLMs).

Combining masked SFT with a novel form of policy gradient algorithm, d1 significantly boosts the performance of pretrained dLLMs like LLaDA.

2/n d1 is a two-stage framework to enhance reasoning in masked dLLMs. First, we use masked SFT to learn from reasoning traces in s1k, where models develop self-correction and backtracking behavior 🔍

Share this page!

Enter URL or ID to Unroll