wh Profile picture
Aug 22 9 tweets 4 min read Read on X
New ByteDance Seed reasoning RL paper, relating RL to self-supervised learning.

The paper is pretty dense with all the dual-task derivation so this is basically my notes. Image
The main idea is to learn two tasks. Given input A, learn output B. To verify the quality of the output B, the model tries to reconstruct the input A'. The quality of the output is then evaluated based on how similar A and A' is. Image
Image
Clearly, this is very difficult. For example if this was math, and the output is some number x, it is impossible to reconstruct the problem.

The solution is to decompose the input A into a known and unknown part. During the reverse process, the model has to reconstruct the unknown part of A using both the output B and the known AImage
This is all very handwavy so the simple example is to take the sum of 2 numbers. A + B = C. The model is then provided with the output C and 1 of the numbers (eg A) and asked what is B. Image
These are clearer examples.
Math: Each rollout leads to different answers. DuPO then says to try to derive the values of the question using each rollout's answer. The rollout that can best reverse is rewarded.

Machine Translation: This is alot more straightforward since dual task learning has already been used in MT. Basically measure string similarity of the reversed translation.

(i have no idea why this figure is in the appendix when it is a much clearer depiction of DuPO)Image
Strong results in Machine Translation and Math. They also show that it can work directly on base models. Image
Image
Image
Interestingly, this even works purely in inference without any training. This can be used as a best of N judge where the backward accuracy is used to select the best trajectory. Image
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
arxiv.org/pdf/2508.14460
I think its a really interesting paper. I will say that as someone who hasnt even heard of "dual task" etc literature, the first part of the paper was extremely confusing and a little abstract. And it made more sense after seeing the examples.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with wh

wh Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @nrehiew_

Aug 11
Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper. Image
The main interesting thing about architecture is its load balancing. No aux loss and they use expert biases where they add a bias to the expert scores. The bias is then adjusted after each step to over/undercorrect the load balancing

(figures from the paper they citehttps://arxiv.org/pdf/2408.15664)Image
Image
Compared to DeepSeek V3 and K2, they make quite a bit of changes.
- Deeper but narrower
- no mla but gqa
- qk norm
- higher attention head/hidden dim ratio

They say that doubling attention heads doesnt improve loss but improves downstream reasoning evals. This actually reflects Kimi's finding that attention heads had neglible impact on loss. But I guess kimi didnt eval on downstream benchmarks beyond just lossImage
Read 20 tweets
Aug 6
Architectural notes about gpt-oss from reading the official implementation.

1) Unconventional SwiGLU.
- Inputs are clamped to 7
- extra 1 bias on linear
- Scaled sigmoid which becomes a GELU basically

Probably needed for gradient flow since its a deep network Image
2) An attention sink () for each of the attention heads
- Attention becomes: QK -> * 1/sqrt(d) -> Mask -> Concat with sink -> Softmax -> remove sink -> matmul V
- This is needed probably for sliding window to work properly since you won't have special tokens to 'allocate' attention toarxiv.org/abs/2309.17453Image
3) Deeper rather than wider (compared to DeepSeek v3)
The ratio of width/num_layers:
gpt-oss = 2880/36 = 80
dsv3/kimi k2 = 7168/61 = 118
Read 5 tweets
Jul 21
How to train a State-of-the-art agent model.

Let's talk about the Kimi K2 paper. Image
The first section is about Pretraining. Basic info about the model:
- essentially an (vvv sparse) MoE with MLA (DeepSeek V3 architecture)
- 15.5 T tokens (mix of human and synthetic)
- Muon + QK Clip
Scaling up Muon, they found that attention logits keep exploding.

Formally, they look at the max per head QK logit.

The 2 existing solutions are:
1) QK Norm (N/A for MLA)
2) Gemma 2 style logit softcapping. (Gemma 3 got rid of that and QK can still grow) Image
Read 18 tweets
Jul 14
Really nice read. tldr + my notes:

1) Since they were planning to use muon and 1T params, they didn't have the resources to try and tweak/improve DeepSeek v3's core arch Image
2) There is an internal (?) experiment that validated 384 experts (from 256 dsv3). I dont fully understand the translation here but I think they find that increasing number of experts by 50% doesn't impact scaling as long as total activate parameters is constant (so increased sparsity is fine)Image
Small analysis on increased experts. Since total activated params is constant, flops during prefill is the same.

for decode, this is where the cost is incurred and you get linear increase in cost for increase in sparsity
Read 6 tweets
Jun 11
Let's talk about the latest Mistral Reasoner paper.

Really cool and detailed end to end paper from the Mistral team Image
The 1st part talks about Mistral's changes to GRPO
- Remove the reference model (and corresponding KLD)
- Normalize losses by length per group
- Normalize advantages by minibatch rather than group statistics
- Decoupling trust region clipping to prevent entropy collapse
- Filter out zero advantage groupsImage
The next part talks about their 4 types of rewards
Formatting: (0.1/0)
- Must start and have <think> tags
- Must have \boxed{}
- Code must have ```

Correctness:
- For math, 0.9 if \boxed{} answer is correct
- For code, its test case with timeout and memory limits
Read 17 tweets
Apr 4
Cohere's Command A report is an extremely extensive paper on how to train a modern LLM in 2025. But it's a model for very different but specific use cases.

Let's talk about it Image
Important to start with some context about Cohere. They aren't trying to train frontier models like Meta/OpenAI/Anthropic. They focus on training models that are intelligent but specifically for enterprise tasks like RAG and multilingualism which can still be efficiently served (on premise)Image
Their architecture is a pretty standard dense Transformer:
- SwiGLU, GQA
- 3:1 local/full attention.
- No positional embeddings on the full attention layers
- No bias
- Tied input and lm head matrices
The no positional embeddings is something that I've only seen them use (). I suspect we will see more of this in 2025huggingface.co/CohereForAI/c4…Image
Read 28 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(