Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

wh

@nrehiew_

Aug 22 • 9 tweets • 4 min read • Read on X

Scrolly

New ByteDance Seed reasoning RL paper, relating RL to self-supervised learning.

The paper is pretty dense with all the dual-task derivation so this is basically my notes.

The main idea is to learn two tasks. Given input A, learn output B. To verify the quality of the output B, the model tries to reconstruct the input A'. The quality of the output is then evaluated based on how similar A and A' is.

Clearly, this is very difficult. For example if this was math, and the output is some number x, it is impossible to reconstruct the problem.

The solution is to decompose the input A into a known and unknown part. During the reverse process, the model has to reconstruct the unknown part of A using both the output B and the known A

This is all very handwavy so the simple example is to take the sum of 2 numbers. A + B = C. The model is then provided with the output C and 1 of the numbers (eg A) and asked what is B.

These are clearer examples.
Math: Each rollout leads to different answers. DuPO then says to try to derive the values of the question using each rollout's answer. The rollout that can best reverse is rewarded.

Machine Translation: This is alot more straightforward since dual task learning has already been used in MT. Basically measure string similarity of the reversed translation.

(i have no idea why this figure is in the appendix when it is a much clearer depiction of DuPO)

Strong results in Machine Translation and Math. They also show that it can work directly on base models.

Interestingly, this even works purely in inference without any training. This can be used as a best of N judge where the backward accuracy is used to select the best trajectory.

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
arxiv.org/pdf/2508.14460

I think its a really interesting paper. I will say that as someone who hasnt even heard of "dual task" etc literature, the first part of the paper was extremely confusing and a little abstract. And it made more sense after seeing the examples.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @nrehiew_

wh

@nrehiew_

Aug 11

Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.

The main interesting thing about architecture is its load balancing. No aux loss and they use expert biases where they add a bias to the expert scores. The bias is then adjusted after each step to over/undercorrect the load balancing

(figures from the paper they citehttps://arxiv.org/pdf/2408.15664)

Compared to DeepSeek V3 and K2, they make quite a bit of changes.
- Deeper but narrower
- no mla but gqa
- qk norm
- higher attention head/hidden dim ratio

They say that doubling attention heads doesnt improve loss but improves downstream reasoning evals. This actually reflects Kimi's finding that attention heads had neglible impact on loss. But I guess kimi didnt eval on downstream benchmarks beyond just loss

Read 20 tweets

wh

@nrehiew_

Aug 6

Architectural notes about gpt-oss from reading the official implementation.

1) Unconventional SwiGLU.
- Inputs are clamped to 7
- extra 1 bias on linear
- Scaled sigmoid which becomes a GELU basically

Probably needed for gradient flow since its a deep network

2) An attention sink () for each of the attention heads
- Attention becomes: QK -> * 1/sqrt(d) -> Mask -> Concat with sink -> Softmax -> remove sink -> matmul V
- This is needed probably for sliding window to work properly since you won't have special tokens to 'allocate' attention toarxiv.org/abs/2309.17453

3) Deeper rather than wider (compared to DeepSeek v3)
The ratio of width/num_layers:
gpt-oss = 2880/36 = 80
dsv3/kimi k2 = 7168/61 = 118

Read 5 tweets

wh

@nrehiew_

Jul 21

How to train a State-of-the-art agent model.

Let's talk about the Kimi K2 paper.

The first section is about Pretraining. Basic info about the model:
- essentially an (vvv sparse) MoE with MLA (DeepSeek V3 architecture)
- 15.5 T tokens (mix of human and synthetic)
- Muon + QK Clip

Scaling up Muon, they found that attention logits keep exploding.

Formally, they look at the max per head QK logit.

The 2 existing solutions are:
1) QK Norm (N/A for MLA)
2) Gemma 2 style logit softcapping. (Gemma 3 got rid of that and QK can still grow)

Read 18 tweets

wh

@nrehiew_

Jul 14

https://twitter.com/Yulun_Du/status/1944582056349995111

Really nice read. tldr + my notes:

1) Since they were planning to use muon and 1T params, they didn't have the resources to try and tweak/improve DeepSeek v3's core arch

https://twitter.com/Yulun_Du/status/1944582056349995111

2) There is an internal (?) experiment that validated 384 experts (from 256 dsv3). I dont fully understand the translation here but I think they find that increasing number of experts by 50% doesn't impact scaling as long as total activate parameters is constant (so increased sparsity is fine)

Small analysis on increased experts. Since total activated params is constant, flops during prefill is the same.

for decode, this is where the cost is incurred and you get linear increase in cost for increase in sparsity

Read 6 tweets

wh

@nrehiew_

Jun 11

Let's talk about the latest Mistral Reasoner paper.

Really cool and detailed end to end paper from the Mistral team

The 1st part talks about Mistral's changes to GRPO
- Remove the reference model (and corresponding KLD)
- Normalize losses by length per group
- Normalize advantages by minibatch rather than group statistics
- Decoupling trust region clipping to prevent entropy collapse
- Filter out zero advantage groups

The next part talks about their 4 types of rewards
Formatting: (0.1/0)
- Must start and have <think> tags
- Must have \boxed{}
- Code must have ```

Correctness:
- For math, 0.9 if \boxed{} answer is correct
- For code, its test case with timeout and memory limits

Read 17 tweets

wh

@nrehiew_

Apr 4

Cohere's Command A report is an extremely extensive paper on how to train a modern LLM in 2025. But it's a model for very different but specific use cases.

Let's talk about it

Important to start with some context about Cohere. They aren't trying to train frontier models like Meta/OpenAI/Anthropic. They focus on training models that are intelligent but specifically for enterprise tasks like RAG and multilingualism which can still be efficiently served (on premise)

Their architecture is a pretty standard dense Transformer:
- SwiGLU, GQA
- 3:1 local/full attention.
- No positional embeddings on the full attention layers
- No bias
- Tied input and lm head matrices
The no positional embeddings is something that I've only seen them use (). I suspect we will see more of this in 2025huggingface.co/CohereForAI/c4…

Read 28 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

wh

Try unrolling a thread yourself!

More from @nrehiew_

wh

wh

wh

wh

wh

wh

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!