Siyan Zhao Profile picture
CS PhD student @UCLA | Bachelors @UofT EngSci | LLMs, generative models, decision-making
Jan 22 7 tweets 4 min read
Introducing 💡On-Policy Self-Distillation💡, a simple method that enables LLM to teach itself with dense per-token feedback on its own on-policy generations—achieving 4-8x more token efficiency vs. GRPO and outperforming both GRPO and SFT/Off-Policy Distillation.

Key insight: like a student reviewing solutions, rationalizing them, and correcting prior mistakes, an LLM can be conditioned on privileged info (e.g., correct solution or a reasoning trace) and supervise its weaker self—the version without such access—by matching the privileged-info-induced distribution from itself.

🌐Blog: siyan-zhao.github.io/blog/2026/opsd/

🧵👇Image 2/n As compared to SFT/off-policy distillation, GRPO, and on-policy distillation, On-Policy Self-Distillation (OPSD) provides training signal that is on-policy, dense, and teacher-free without extensive group sampling cost.Image
Apr 11, 2025 8 tweets 3 min read
Introducing d1🚀 — the first framework that applies reinforcement learning to improve reasoning in masked diffusion LLMs (dLLMs).

Combining masked SFT with a novel form of policy gradient algorithm, d1 significantly boosts the performance of pretrained dLLMs like LLaDA.Image 2/n d1 is a two-stage framework to enhance reasoning in masked dLLMs. First, we use masked SFT to learn from reasoning traces in s1k, where models develop self-correction and backtracking behavior 🔍 Image