Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Casper Hansen

Jan 20 • 8 tweets • 2 min read • Read on X

The DeepSeek R1 training procedure confused me at first. My brain refused to accept this powerful model could be incredibly straightforward.

Let me break down this elegant beast for you 🧵

This multi-stage training loop is unusually effective:
Base → RL → Finetune → RL → Finetune → RL

Does scaling stages = better performance? Let’s break down each phase. 🔍

V3 Base → R1 Zero (Stage 0/4)

⚙️GRPO: "PPO without a value function using monte carlo estimates of the advantage" - @natolambert
🔍 Data Strategy: Verified prompts via rule-based rewards (IFEval/Tülu 3) + test cases (math/code).
💡Emergent: reasoning/reflection + long CoT.

R1 Zero → R1 Finetuned Cold Start (Stage 1/4)

🚀Generate 1-10k long CoT samples: Use R1 Zero with few-shot prompting
⚙️Supervised finetuning using model from stage 0
💡Result: Readable thoughts + structured outputs.

R1 Cold Start → R1 Reasoner with RL (Stage 2/4)

🚀Train Stage 1 model with GRPO: Use data from stage 0 and add a language consistency rule (target lang % in CoT).
💡Emergent: readable reasoning with reflection + long CoT.

R1 Reasoning → R1 Finetuned-Reasoner (Stage 3/4)

🚀Generate 600k: multi-response sampling and only keep correct samples (using prev rules)
⚙️V3 as a judge: filter out mixed languages, long paragraphs, and code
🌐Generate 200k general-purpose samples via V3
🔥Finetune model

R1 Instruct-Reasoner → R1 Aligned (Stage 4/4)

⚖️Align DeepSeek-R1: Balance reasoning with helpfulness and harmlessness using GRPO
🔍 Data Strategy: rule-based rewards for math/code + reward model for human preferences.
🌟Result: DeepSeek R1

This concludes the breakdown! I have tried to highlight the overall approach while including as many details as possible for your leisure.

Super keen to discuss this paper further and how we can create a proper open-source version!

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @casper_hansen_

Casper Hansen

@casper_hansen_

Mar 27, 2024

I did some research on LLM as agents today. Here is a guide to the state-of-the-art of LLMs as agents!

It's all about environments where LLMs can observe, plan, act, and iterate on solutions.

🧵1/8

2/5 There are two main benchmarks that are useful. Both are seemingly hard datasets.

Especially SWE-Bench. @cognition_labs was able to show-case 13.86% accuracy with Devin.

SWE-Bench: GitHub issues or pull requests.
MINT: arxiv.org/abs/2310.06770
arxiv.org/abs/2309.10691

3/5 CodeAct is open-source: dataset, model, code, UI.

Observe, Plan, Act, Iterate. CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.

github.com/xingyaoww/code…
arxiv.org/abs/2402.01030

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Casper Hansen

Try unrolling a thread yourself!

More from @casper_hansen_

Casper Hansen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!