wh Profile picture
Jan 21 17 tweets 7 min read Read on X
How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level. Image
Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (arxiv.org/abs/2411.15124)
They start by trying to improve the Base Model without any supervised data.

They use Group Relative Policy Optimization (arxiv.org/pdf/2402.03300) with the advantage function just being the normalized outcome rewards

For the reward models, they use simple accuracy reminders (check answer within \boxed, run test cases) + they encourage the model to put its thinking process between tagsImage
Image
The GRPO algorithm here. Again the advantage estimation is just the outcome reward. Check out the paper linked above for more details Image
1st interesting thing of the paper:
> neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

not much else for me to add here
They say that they use a really simple prompt because they are more interested in observing the evolution in model outputs Image
Notice that they went straight from Base -> RL without an intermediate SFT/Instruct tuning stage as is common. They call this model R1-Zero Image
Why is this interesting?

Notice how simple the entire setup is. It is extremely easy to generate synthetic prompts with deterministic answers. And with literally nothing else, it is possible to go from 0.2->0.85 AIME scores.

Training the base model directly also directly extracts that ability without having its distribution disturbed by SFT

Again, at no point did they provide reference answers or instructions. The model realizes that to achieve higher reward, it needs to CoT longerImage
With this extremely straightforward setup, the network learns to reflect/reevaluate its own answers. Again, this is done completely without supervision Image
Image
The problem with RL on the base model is that the reasoning process/CoT is not really readable. So, they introduce a small amount of high quality user-friendly data before the RL process such that the final model isnt a "base model" but rather something more "assistant" like
Their entire pipeline is as follows:
1) Take a few thousand samples of high quality data of the format COT + Summary and SFT the base model

2) Repeat the R1 Zero process. They notice the language mixing problem still remains so they add a reward accounting for the proportion of target language words in the COT. (Interesting Note: This worsens performance slightly)

3) Collect 800k accurate samples from the trained model -600K STEm, 200K general purpose. (Note: These were the samples used to FT the other open models like Qwen, Llama etc)

4) They have 1 last RL stage where they combine the verifiable rewards + preference tuning that was done for DeepSeek v3 (for alignment purposes)
By now, you should have seen/heard all the results. So I will just say 1 thing. I really do think this is an o1 level model. If i had to guess its ~ the same as o1 (reasoning_effort = medium) Image
They also evaluate on the distilled models and distillation really just works. They even beat Qwen's very own QwQ.

At 8B parameters, it is matching Sonnet and has surpassed GPT-4o Image
Now they have a section on the effectiveness of distillation. They train a Qwen32B model using RL and compare it with the distilled version.

The finding that this RL version is worse off (~ the same as QwQ) shows that the way forward is to RL a huge model and distill it down.

This also gives insight to the impressive performance of o1-mini. It looks like it really is just extremely well engineered distillationImage
They also have a section on their unsuccessfully attempt which i find extremely commendable to share.

tldr: PRMs are hard to train and can be hacked. It should only be used for guided search rather than learning. MCTS was also not working and was too complicated Image
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

github.com/deepseek-ai/De…
Some thoughts:

I think this is 1 of the most important papers in a while because its the first open model that is genuinely at the frontier and not just riding on the goodwill of being open.

The paper is really really simple as you can probably tell from the thread because the approach is really really simple. It really is exactly what OpenAI is good at - doing simple things but executing at an extremely high level

Personally, I'm surprised (maybe i shouldn't be) that just RL on verifiable rewards (credits to the TULU3 team for the term) works. Now that we know this recipe, we also would have something that can match o3 soon.

Also worth noting that they did alignment tuning + language consistency tuning. This hurts performance which indicates that the model could be even better. Really interesting to think about the tradeoffs here.

The way i see it there are 2 open research areas:
- Can we improve inference time performance. Search? What is o1-pro mode doing? How is the reasoning_effort in o1 controlled?

- What does this unhackable ground truth reward look like for normal domains without deterministic ground truths. I think its just LLM-as-a-Judge but done extremely well (Sonnet probably does this)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with wh

wh Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @nrehiew_

Jan 5
short thread:

1) i think the most important thing is to find a niche you enjoy. It doesn't need to be the hot topics (LLMs etc), it could just be a paper you see people discuss which piques your interest. This way, a reading list won't feel like a todo list
2) People seem to have the impression that the math (notation) is impossible to understand without proper background. I beg to differ. Some papers really just use math notation for notation sake and the core intuition of the paper is actually really grokkable
3) If youre have more of an SWE background, look at the code if its available!

4) Use LLMs (claude, o1 etc) to explain the notation. Just paste the image in and ask "explain the intuition to me". Most models can do really well
Read 6 tweets
Dec 26, 2024
How to train a 670B parameter model.

Let's talk about the DeepSeek v3 report + some comparisons with what Meta did with Llama 405B Image
This is the image that has been going around so you probably know how nuts this is but some added context is that Llama 3 405B was trained on 16K H100

Image
Arch wise they differ significantly from meta which just used a single massive dense transformer

For oss Mixture of Experts, mixtral was the first (i think) and DeepSeek popularised it.

Multi-Head Latent attention (MLA) comes from their Deepseek v2 paper which basically makes it more efficient for inference by compressing the size of the kv cacheImage
Read 24 tweets
Dec 14, 2024
A new paper from Meta gets rid of tokenization by learning a transformer over raw bytes Image
A quick primer on why this is difficult and why we cannot just train on bytes naively. If we wanted to get rid of arbitrary tokenization/segmentation of input sequences, training on bytes is not that straightforward.

Obviously, training on bytes is a significantly harder problem to solve. Plus, sequences of bytes are OOM longer than what we have currently and it is extremely extremely inefficient to model this. As much as we hate BPE, tokenization is compression which allow us to have manageable sequence lengthsImage
The 1st part of their architecture is converting sequences of bytes to patches. They discuss 3 approaches:
- Static fixed size
- Start a new patch if the current patch is not a latin char, digit, or utf-8 continuation (ie is a space)
- Train a small model over individual bytes and use the entropy of that byte sequence as determined by the small model as break points.Image
Read 16 tweets
Dec 13, 2024
This new extremely detailed paper from Meta treats language modelling as modelling abstract semantic concepts rather than tokens and finds diffusion to work well

tldr: they train a concept encoder and decoder to map between readable word space and "concept" space Image
The left figure is what they call "reasoning in embedding" space. A sentence with 7 words can be summarized into 2 by clustering them based on semantic similarity. Note that this would theoretically work even if the sentences are in diff languages or not even "sentences" at all but other modalities like images

The right figure is their architecture. The main similarity that comes to mind is Latent Diffusion Models. You map your input to some latent dimension, perform modelling on that and map back outImage
The main related work here (by the same team) is SONAR. It's basically a standard encoder decoder with a single vector bottleneck. Mainly:
- There is no cross attention with tokens. Ie the decoder only attends to this bottleneck vector
- To get a fix size vector regardless of input length, they just pool the encoder outputs
arxiv.org/pdf/2308.11466Image
Read 15 tweets
Dec 10, 2024
This paper from Meta proposes a method to not have the model reason in token space but directly model its reasoning using its hidden state. The authors also do a lot of cool interpretability work in this paper.

Aesthetically, I like it alot and its simple to implement Image
The idea is pretty straightforward. Instead of mapping back out to token space using lm_head, just concat the output hidden state with the input_embeddings Image
To determine how long a model should reason, they tried learning a binary classifier but found that they could just set a predefined length for the cot Image
Read 11 tweets
Dec 2, 2024
The 6th highest scored paper (7,8,8,8) going into Neurips 2024

tldr: They introduce a new image generation approach that uses an autoregressive transformer to predict increasingly larger latent feature maps starting from a single pixel feature map to the final latent/image Image
Traditional autoregressive approaches use a tokenisation method such as vqvae or a vit style raster scan. The paper claims that an image patch is naturally related to all patches around it but these approaches enforce a unidirectional approach.
So rather than predicting the next token, they instead choose to model predicting the next “scale”. They use a vqvae to learn different sized feature maps that are gradually upsampled in the decoder. They raster scan this feature map instead and predict each feature map in sequence. The attention mask is causal so all “tokens” in the same feature map scale can attend to each other and every map that was smaller and came before itImage
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(