wh Profile picture
Jan 21 17 tweets 7 min read Read on X
How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level. Image
Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (arxiv.org/abs/2411.15124)
They start by trying to improve the Base Model without any supervised data.

They use Group Relative Policy Optimization (arxiv.org/pdf/2402.03300) with the advantage function just being the normalized outcome rewards

For the reward models, they use simple accuracy reminders (check answer within \boxed, run test cases) + they encourage the model to put its thinking process between tagsImage
Image
The GRPO algorithm here. Again the advantage estimation is just the outcome reward. Check out the paper linked above for more details Image
1st interesting thing of the paper:
> neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

not much else for me to add here
They say that they use a really simple prompt because they are more interested in observing the evolution in model outputs Image
Notice that they went straight from Base -> RL without an intermediate SFT/Instruct tuning stage as is common. They call this model R1-Zero Image
Why is this interesting?

Notice how simple the entire setup is. It is extremely easy to generate synthetic prompts with deterministic answers. And with literally nothing else, it is possible to go from 0.2->0.85 AIME scores.

Training the base model directly also directly extracts that ability without having its distribution disturbed by SFT

Again, at no point did they provide reference answers or instructions. The model realizes that to achieve higher reward, it needs to CoT longerImage
With this extremely straightforward setup, the network learns to reflect/reevaluate its own answers. Again, this is done completely without supervision Image
Image
The problem with RL on the base model is that the reasoning process/CoT is not really readable. So, they introduce a small amount of high quality user-friendly data before the RL process such that the final model isnt a "base model" but rather something more "assistant" like
Their entire pipeline is as follows:
1) Take a few thousand samples of high quality data of the format COT + Summary and SFT the base model

2) Repeat the R1 Zero process. They notice the language mixing problem still remains so they add a reward accounting for the proportion of target language words in the COT. (Interesting Note: This worsens performance slightly)

3) Collect 800k accurate samples from the trained model -600K STEm, 200K general purpose. (Note: These were the samples used to FT the other open models like Qwen, Llama etc)

4) They have 1 last RL stage where they combine the verifiable rewards + preference tuning that was done for DeepSeek v3 (for alignment purposes)
By now, you should have seen/heard all the results. So I will just say 1 thing. I really do think this is an o1 level model. If i had to guess its ~ the same as o1 (reasoning_effort = medium) Image
They also evaluate on the distilled models and distillation really just works. They even beat Qwen's very own QwQ.

At 8B parameters, it is matching Sonnet and has surpassed GPT-4o Image
Now they have a section on the effectiveness of distillation. They train a Qwen32B model using RL and compare it with the distilled version.

The finding that this RL version is worse off (~ the same as QwQ) shows that the way forward is to RL a huge model and distill it down.

This also gives insight to the impressive performance of o1-mini. It looks like it really is just extremely well engineered distillationImage
They also have a section on their unsuccessfully attempt which i find extremely commendable to share.

tldr: PRMs are hard to train and can be hacked. It should only be used for guided search rather than learning. MCTS was also not working and was too complicated Image
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

github.com/deepseek-ai/De…
Some thoughts:

I think this is 1 of the most important papers in a while because its the first open model that is genuinely at the frontier and not just riding on the goodwill of being open.

The paper is really really simple as you can probably tell from the thread because the approach is really really simple. It really is exactly what OpenAI is good at - doing simple things but executing at an extremely high level

Personally, I'm surprised (maybe i shouldn't be) that just RL on verifiable rewards (credits to the TULU3 team for the term) works. Now that we know this recipe, we also would have something that can match o3 soon.

Also worth noting that they did alignment tuning + language consistency tuning. This hurts performance which indicates that the model could be even better. Really interesting to think about the tradeoffs here.

The way i see it there are 2 open research areas:
- Can we improve inference time performance. Search? What is o1-pro mode doing? How is the reasoning_effort in o1 controlled?

- What does this unhackable ground truth reward look like for normal domains without deterministic ground truths. I think its just LLM-as-a-Judge but done extremely well (Sonnet probably does this)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with wh

wh Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @nrehiew_

Aug 11
Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper. Image
The main interesting thing about architecture is its load balancing. No aux loss and they use expert biases where they add a bias to the expert scores. The bias is then adjusted after each step to over/undercorrect the load balancing

(figures from the paper they citehttps://arxiv.org/pdf/2408.15664)Image
Image
Compared to DeepSeek V3 and K2, they make quite a bit of changes.
- Deeper but narrower
- no mla but gqa
- qk norm
- higher attention head/hidden dim ratio

They say that doubling attention heads doesnt improve loss but improves downstream reasoning evals. This actually reflects Kimi's finding that attention heads had neglible impact on loss. But I guess kimi didnt eval on downstream benchmarks beyond just lossImage
Read 20 tweets
Aug 6
Architectural notes about gpt-oss from reading the official implementation.

1) Unconventional SwiGLU.
- Inputs are clamped to 7
- extra 1 bias on linear
- Scaled sigmoid which becomes a GELU basically

Probably needed for gradient flow since its a deep network Image
2) An attention sink () for each of the attention heads
- Attention becomes: QK -> * 1/sqrt(d) -> Mask -> Concat with sink -> Softmax -> remove sink -> matmul V
- This is needed probably for sliding window to work properly since you won't have special tokens to 'allocate' attention toarxiv.org/abs/2309.17453Image
3) Deeper rather than wider (compared to DeepSeek v3)
The ratio of width/num_layers:
gpt-oss = 2880/36 = 80
dsv3/kimi k2 = 7168/61 = 118
Read 5 tweets
Jul 21
How to train a State-of-the-art agent model.

Let's talk about the Kimi K2 paper. Image
The first section is about Pretraining. Basic info about the model:
- essentially an (vvv sparse) MoE with MLA (DeepSeek V3 architecture)
- 15.5 T tokens (mix of human and synthetic)
- Muon + QK Clip
Scaling up Muon, they found that attention logits keep exploding.

Formally, they look at the max per head QK logit.

The 2 existing solutions are:
1) QK Norm (N/A for MLA)
2) Gemma 2 style logit softcapping. (Gemma 3 got rid of that and QK can still grow) Image
Read 18 tweets
Jul 14
Really nice read. tldr + my notes:

1) Since they were planning to use muon and 1T params, they didn't have the resources to try and tweak/improve DeepSeek v3's core arch Image
2) There is an internal (?) experiment that validated 384 experts (from 256 dsv3). I dont fully understand the translation here but I think they find that increasing number of experts by 50% doesn't impact scaling as long as total activate parameters is constant (so increased sparsity is fine)Image
Small analysis on increased experts. Since total activated params is constant, flops during prefill is the same.

for decode, this is where the cost is incurred and you get linear increase in cost for increase in sparsity
Read 6 tweets
Jun 11
Let's talk about the latest Mistral Reasoner paper.

Really cool and detailed end to end paper from the Mistral team Image
The 1st part talks about Mistral's changes to GRPO
- Remove the reference model (and corresponding KLD)
- Normalize losses by length per group
- Normalize advantages by minibatch rather than group statistics
- Decoupling trust region clipping to prevent entropy collapse
- Filter out zero advantage groupsImage
The next part talks about their 4 types of rewards
Formatting: (0.1/0)
- Must start and have <think> tags
- Must have \boxed{}
- Code must have ```

Correctness:
- For math, 0.9 if \boxed{} answer is correct
- For code, its test case with timeout and memory limits
Read 17 tweets
Apr 4
Cohere's Command A report is an extremely extensive paper on how to train a modern LLM in 2025. But it's a model for very different but specific use cases.

Let's talk about it Image
Important to start with some context about Cohere. They aren't trying to train frontier models like Meta/OpenAI/Anthropic. They focus on training models that are intelligent but specifically for enterprise tasks like RAG and multilingualism which can still be efficiently served (on premise)Image
Their architecture is a pretty standard dense Transformer:
- SwiGLU, GQA
- 3:1 local/full attention.
- No positional embeddings on the full attention layers
- No bias
- Tied input and lm head matrices
The no positional embeddings is something that I've only seen them use (). I suspect we will see more of this in 2025huggingface.co/CohereForAI/c4…Image
Read 28 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(