Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

wh

@nrehiew_

Aug 11 • 20 tweets • 11 min read • Read on X

Scrolly

Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.

The main interesting thing about architecture is its load balancing. No aux loss and they use expert biases where they add a bias to the expert scores. The bias is then adjusted after each step to over/undercorrect the load balancing

(figures from the paper they citehttps://arxiv.org/pdf/2408.15664)

Compared to DeepSeek V3 and K2, they make quite a bit of changes.
- Deeper but narrower
- no mla but gqa
- qk norm
- higher attention head/hidden dim ratio

They say that doubling attention heads doesnt improve loss but improves downstream reasoning evals. This actually reflects Kimi's finding that attention heads had neglible impact on loss. But I guess kimi didnt eval on downstream benchmarks beyond just loss

Pretraining data of 15T tokens. They bucket based on quality and sample based on the buckets with the highest quality bucket doing 3.2 epochs.

For code, they use both github and html documents and specifically train on FIM. They also do quality bucketing here.

For math and science, they train a quality classifier on educational quality of papers.

I would have liked to see more detail here on how they determine quality etc.

For midtraining, they add another 7T tokens:
- Multiple files concatenated together to learn single repo/multi file dependencies
- Synthetic reasoning traces
- Long context

I dont think there is much new here. The only thing to speculate is which reasoner did they distill from here.

For training, they used Muon + cosine decay. They chose not to use WSD because they suspect WSD performs worse due to underfitting in the stable stage.

There is a ton of info here on hyperparams which would be useful as starting/reference points in sweeps

For SFT, they first train 3 expert models (Reasoning, Agent, Chat) and then distill them back into one model.

Experts on trained on a small set of cold start reasoning data. After the experts are done then they do normal SFT on samples from these expert models.

Here, they want the final model to be a hybrid reasoner so they balance reasoning/non reasoning data.

1 interesting thing is that they (finally!) got rid of json for function calling in exchange for xml because it is much easier to generate without needing to randomly escape characters.

Some other details when sampling from the expert models:
- they find that only training on the 50% of data with longer response lengths led to better performance. there was even more gain on training on multiple samples from these prompts.
- a ton of rejection sampling

They also SFT on agent data with a now familiar recipe.
1) Collect real world APIs and MCPs
2) Generate prompts
3) Generate fake API response using LLMs
4) Filter trajectories

The next section is on reasoning RL (GRPO + no KL)
1) 2 stage RL with hard prompts in the final stage is best (512 rollouts!)
2) Since the SFTed model already generates long CoT, don't start training by cutting off its context length (it cannot recover performance). Just continue training from the max 64K sequence length
3) To scale temperature and deal with entropy collapsing, they validate on a range of values during training on a val set. They choose the max value that has less than 1% performance drop
4) For code, token loss is better than sequence loss.
5) Science RL benefits most from data quality

(side note i think its funny they call sequence mean "conventional" when im pretty sure its a mix with some impl were still using token mean and the recent papers about it likely published after they were done)

For agentic RL, they train specifically on deepresearch/swebench style problems and interestingly say that it transfers.

The info about their deepresearch dataset curation is a little sparse here.

They also do this iterative distillation thing which i dont quite understand what they are saying to be honest so just putting it here. (is it just RL to get model 1. Use model 1 to generate cold start which trains the base for what would come to be model 2?)

They also do a bunch of general RLHF RLAIF RL. Again, i would like to have seen examples/broad categories of the type of "scoring rubrics" they used.

They RL on:
- 5000 diverse prompts (i have no clue what primary/secondary/tertiary categories are)
- if eval style dataset (+ reward model, critique model and rules all to prevent hacking)
- Function calling RL on function format correctness in both single and multi turn (where the use is simulated by an LLM)

Next section is on their RL infra + stack of megatron and sglang (github.com/THUDM/slime).

They find "different RL tasks benefit from different scheduling approaches".
- Code and Math: Keeping training and inference engines on the same worker is best since this maximizes utilisation.
- Agent: Keep training and inference separate to maximise environment throughput. This is pretty standard async RL where you go ~off policy while waiting for weights to sync (especially on rollouts with very large variety of lengths)

Some other fun details:
- BF16 training, FP8 inference
- Data buffer uses what i think is standard OpenAI compatible endpoints to run prompts

(vv similar to magistral)

For evaluation, they first evaluate the base models on English, Chinese, Code and Math.

It actually looks like K2 is the best base here and is likely the best option for fine tuning if you have the compute for it.

On Agentic evaluation, they only lose to o3. In fact, the gap between them and o3 is bigger than the gap between them and the next best model (opus).

These benchmarks are actually pretty good. But if you ask me to weigh the individual scores based on how much i care about each benchmark, I think o3 then GLM is similar to the Claudes.

I dont care for any of these reasoning benchmarks and you shouldnt too. The next set is slightly better than the previous set but still generally low signal.

For coding, I think its similar to K2 still behind the Claudes. They also evaluate on a new benchmark which measures how good models are at using Claude Code (huggingface.co/datasets/zai-o… love this)

It has the highest tool calling success which is great. But the fact that Claude still has the lowest token usage (even tho the oss models are a fraction of the per token cost) makes me think the oss models could be doom looping or just needing to correct more wrong decisions

Other miscellanous evaluations. I dont really know what to take from here? especially since we dont even know what the prompts or data looks like so take what you wish

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models @Zai_org

arxiv.org/abs/2508.06471

First, the evals say its a good model, i have used it and i think its a good model and people i trust have used it and said its a good model.

The paper is also great. I think they focused more on breadth here rather than more indepth details. I would like to have more info on certain parts and details but I guess this paper can serve as sort of a general recipe or glossary.

I do wonder if theres anything special here because it didnt feel like they did anything unique or special beyond good execution. But there again, if there was would they put it in the paper?

1 thing i liked is the attention head thing which i think is pretty cool to see loss results with Kimi's experiments match up but always check downstream results.

Another thing i wonder if doing single turn RLVR first before doing agentic RL helps performance and if this can be a further way to push K2 for eg.

The main thing I want to know is on the hybrid reasoners. There wasnt much talk about evals related to this hybrid behavior and the decisions behind it. I think its important since Qwen got rid of the hybrid in exchange for 2 separate models and OpenAI kind of did the same thing for GPT5. I wonder if the team has any insights here.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @nrehiew_

wh

@nrehiew_

Aug 6

Architectural notes about gpt-oss from reading the official implementation.

1) Unconventional SwiGLU.
- Inputs are clamped to 7
- extra 1 bias on linear
- Scaled sigmoid which becomes a GELU basically

Probably needed for gradient flow since its a deep network

2) An attention sink () for each of the attention heads
- Attention becomes: QK -> * 1/sqrt(d) -> Mask -> Concat with sink -> Softmax -> remove sink -> matmul V
- This is needed probably for sliding window to work properly since you won't have special tokens to 'allocate' attention toarxiv.org/abs/2309.17453

3) Deeper rather than wider (compared to DeepSeek v3)
The ratio of width/num_layers:
gpt-oss = 2880/36 = 80
dsv3/kimi k2 = 7168/61 = 118

Read 5 tweets

wh

@nrehiew_

Jul 21

How to train a State-of-the-art agent model.

Let's talk about the Kimi K2 paper.

The first section is about Pretraining. Basic info about the model:
- essentially an (vvv sparse) MoE with MLA (DeepSeek V3 architecture)
- 15.5 T tokens (mix of human and synthetic)
- Muon + QK Clip

Scaling up Muon, they found that attention logits keep exploding.

Formally, they look at the max per head QK logit.

The 2 existing solutions are:
1) QK Norm (N/A for MLA)
2) Gemma 2 style logit softcapping. (Gemma 3 got rid of that and QK can still grow)

Read 18 tweets

wh

@nrehiew_

Jul 14

https://twitter.com/Yulun_Du/status/1944582056349995111

Really nice read. tldr + my notes:

1) Since they were planning to use muon and 1T params, they didn't have the resources to try and tweak/improve DeepSeek v3's core arch

https://twitter.com/Yulun_Du/status/1944582056349995111

2) There is an internal (?) experiment that validated 384 experts (from 256 dsv3). I dont fully understand the translation here but I think they find that increasing number of experts by 50% doesn't impact scaling as long as total activate parameters is constant (so increased sparsity is fine)

Small analysis on increased experts. Since total activated params is constant, flops during prefill is the same.

for decode, this is where the cost is incurred and you get linear increase in cost for increase in sparsity

Read 6 tweets

wh

@nrehiew_

Jun 11

Let's talk about the latest Mistral Reasoner paper.

Really cool and detailed end to end paper from the Mistral team

The 1st part talks about Mistral's changes to GRPO
- Remove the reference model (and corresponding KLD)
- Normalize losses by length per group
- Normalize advantages by minibatch rather than group statistics
- Decoupling trust region clipping to prevent entropy collapse
- Filter out zero advantage groups

The next part talks about their 4 types of rewards
Formatting: (0.1/0)
- Must start and have <think> tags
- Must have \boxed{}
- Code must have ```

Correctness:
- For math, 0.9 if \boxed{} answer is correct
- For code, its test case with timeout and memory limits

Read 17 tweets

wh

@nrehiew_

Apr 4

Cohere's Command A report is an extremely extensive paper on how to train a modern LLM in 2025. But it's a model for very different but specific use cases.

Let's talk about it

Important to start with some context about Cohere. They aren't trying to train frontier models like Meta/OpenAI/Anthropic. They focus on training models that are intelligent but specifically for enterprise tasks like RAG and multilingualism which can still be efficiently served (on premise)

Their architecture is a pretty standard dense Transformer:
- SwiGLU, GQA
- 3:1 local/full attention.
- No positional embeddings on the full attention layers
- No bias
- Tied input and lm head matrices
The no positional embeddings is something that I've only seen them use (). I suspect we will see more of this in 2025huggingface.co/CohereForAI/c4…

Read 28 tweets

wh

@nrehiew_

Jan 21

How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level.

Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (arxiv.org/abs/2411.15124)

They start by trying to improve the Base Model without any supervised data.

They use Group Relative Policy Optimization (arxiv.org/pdf/2402.03300) with the advantage function just being the normalized outcome rewards

For the reward models, they use simple accuracy reminders (check answer within \boxed, run test cases) + they encourage the model to put its thinking process between tags

Read 17 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

wh

Try unrolling a thread yourself!

More from @nrehiew_

wh

wh

wh

wh

wh

wh

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!