Casper Hansen Profile picture
Jan 20 8 tweets 2 min read Read on X
The DeepSeek R1 training procedure confused me at first. My brain refused to accept this powerful model could be incredibly straightforward.

Let me break down this elegant beast for you 🧵
This multi-stage training loop is unusually effective:
Base → RL → Finetune → RL → Finetune → RL

Does scaling stages = better performance? Let’s break down each phase. 🔍
V3 Base → R1 Zero (Stage 0/4)

⚙️GRPO: "PPO without a value function using monte carlo estimates of the advantage" - @natolambert
🔍 Data Strategy: Verified prompts via rule-based rewards (IFEval/Tülu 3) + test cases (math/code).
💡Emergent: reasoning/reflection + long CoT. Image
R1 Zero → R1 Finetuned Cold Start (Stage 1/4)

🚀Generate 1-10k long CoT samples: Use R1 Zero with few-shot prompting
⚙️Supervised finetuning using model from stage 0
💡Result: Readable thoughts + structured outputs. Image
R1 Cold Start → R1 Reasoner with RL (Stage 2/4)

🚀Train Stage 1 model with GRPO: Use data from stage 0 and add a language consistency rule (target lang % in CoT).
💡Emergent: readable reasoning with reflection + long CoT. Image
R1 Reasoning → R1 Finetuned-Reasoner (Stage 3/4)

🚀Generate 600k: multi-response sampling and only keep correct samples (using prev rules)
⚙️V3 as a judge: filter out mixed languages, long paragraphs, and code
🌐Generate 200k general-purpose samples via V3
🔥Finetune model
R1 Instruct-Reasoner → R1 Aligned (Stage 4/4)

⚖️Align DeepSeek-R1: Balance reasoning with helpfulness and harmlessness using GRPO
🔍 Data Strategy: rule-based rewards for math/code + reward model for human preferences.
🌟Result: DeepSeek R1 Image
This concludes the breakdown! I have tried to highlight the overall approach while including as many details as possible for your leisure.

Super keen to discuss this paper further and how we can create a proper open-source version!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Casper Hansen

Casper Hansen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @casper_hansen_

Mar 27, 2024
I did some research on LLM as agents today. Here is a guide to the state-of-the-art of LLMs as agents!

It's all about environments where LLMs can observe, plan, act, and iterate on solutions.

🧵1/8 Image
2/5 There are two main benchmarks that are useful. Both are seemingly hard datasets.

Especially SWE-Bench. @cognition_labs was able to show-case 13.86% accuracy with Devin.

SWE-Bench: GitHub issues or pull requests.
MINT: arxiv.org/abs/2310.06770
arxiv.org/abs/2309.10691
Image
3/5 CodeAct is open-source: dataset, model, code, UI.

Observe, Plan, Act, Iterate. CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.


github.com/xingyaoww/code…
arxiv.org/abs/2402.01030
Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(