Latest Twitter Threads by @nrehiew_ on Thread Reader App

Aug 11 • 20 tweets • 11 min read

Let's talk about the GLM 4.5 models.

The latest frontier open weights model out of China (and possibly the best at the moment?) with quite a bit of details in the paper.

The main interesting thing about architecture is its load balancing. No aux loss and they use expert biases where they add a bias to the expert scores. The bias is then adjusted after each step to over/undercorrect the load balancing

(figures from the paper they citehttps://arxiv.org/pdf/2408.15664)

Aug 6 • 5 tweets • 2 min read

Architectural notes about gpt-oss from reading the official implementation.

1) Unconventional SwiGLU.
- Inputs are clamped to 7
- extra 1 bias on linear
- Scaled sigmoid which becomes a GELU basically

Probably needed for gradient flow since its a deep network

2) An attention sink () for each of the attention heads
- Attention becomes: QK -> * 1/sqrt(d) -> Mask -> Concat with sink -> Softmax -> remove sink -> matmul V
- This is needed probably for sliding window to work properly since you won't have special tokens to 'allocate' attention toarxiv.org/abs/2309.17453

Jul 21 • 18 tweets • 9 min read

How to train a State-of-the-art agent model.

Let's talk about the Kimi K2 paper.

The first section is about Pretraining. Basic info about the model:
- essentially an (vvv sparse) MoE with MLA (DeepSeek V3 architecture)
- 15.5 T tokens (mix of human and synthetic)
- Muon + QK Clip

Jul 14 • 6 tweets • 2 min read

Really nice read. tldr + my notes:

1) Since they were planning to use muon and 1T params, they didn't have the resources to try and tweak/improve DeepSeek v3's core arch

https://twitter.com/Yulun_Du/status/1944582056349995111

2) There is an internal (?) experiment that validated 384 experts (from 256 dsv3). I dont fully understand the translation here but I think they find that increasing number of experts by 50% doesn't impact scaling as long as total activate parameters is constant (so increased sparsity is fine)

Jun 11 • 17 tweets • 6 min read

Let's talk about the latest Mistral Reasoner paper.

Really cool and detailed end to end paper from the Mistral team

The 1st part talks about Mistral's changes to GRPO
- Remove the reference model (and corresponding KLD)
- Normalize losses by length per group
- Normalize advantages by minibatch rather than group statistics
- Decoupling trust region clipping to prevent entropy collapse
- Filter out zero advantage groups

Apr 4 • 28 tweets • 11 min read

Cohere's Command A report is an extremely extensive paper on how to train a modern LLM in 2025. But it's a model for very different but specific use cases.

Let's talk about it

Important to start with some context about Cohere. They aren't trying to train frontier models like Meta/OpenAI/Anthropic. They focus on training models that are intelligent but specifically for enterprise tasks like RAG and multilingualism which can still be efficiently served (on premise)

Jan 21 • 17 tweets • 7 min read

How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level.

Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (arxiv.org/abs/2411.15124)

Jan 5 • 6 tweets • 2 min read

short thread:

1) i think the most important thing is to find a niche you enjoy. It doesn't need to be the hot topics (LLMs etc), it could just be a paper you see people discuss which piques your interest. This way, a reading list won't feel like a todo list

https://twitter.com/swyx/status/1875606586569453592

2) People seem to have the impression that the math (notation) is impossible to understand without proper background. I beg to differ. Some papers really just use math notation for notation sake and the core intuition of the paper is actually really grokkable

Dec 26, 2024 • 24 tweets • 10 min read

How to train a 670B parameter model.

Let's talk about the DeepSeek v3 report + some comparisons with what Meta did with Llama 405B

This is the image that has been going around so you probably know how nuts this is but some added context is that Llama 3 405B was trained on 16K H100

https://x.com/teortaxesTex/status/1872253671989551473

Dec 14, 2024 • 16 tweets • 6 min read

A new paper from Meta gets rid of tokenization by learning a transformer over raw bytes

A quick primer on why this is difficult and why we cannot just train on bytes naively. If we wanted to get rid of arbitrary tokenization/segmentation of input sequences, training on bytes is not that straightforward.

Obviously, training on bytes is a significantly harder problem to solve. Plus, sequences of bytes are OOM longer than what we have currently and it is extremely extremely inefficient to model this. As much as we hate BPE, tokenization is compression which allow us to have manageable sequence lengths

Dec 13, 2024 • 15 tweets • 8 min read

This new extremely detailed paper from Meta treats language modelling as modelling abstract semantic concepts rather than tokens and finds diffusion to work well

tldr: they train a concept encoder and decoder to map between readable word space and "concept" space

The left figure is what they call "reasoning in embedding" space. A sentence with 7 words can be summarized into 2 by clustering them based on semantic similarity. Note that this would theoretically work even if the sentences are in diff languages or not even "sentences" at all but other modalities like images

The right figure is their architecture. The main similarity that comes to mind is Latent Diffusion Models. You map your input to some latent dimension, perform modelling on that and map back out

Dec 10, 2024 • 11 tweets • 5 min read

This paper from Meta proposes a method to not have the model reason in token space but directly model its reasoning using its hidden state. The authors also do a lot of cool interpretability work in this paper.

Aesthetically, I like it alot and its simple to implement

https://twitter.com/nrehiew_/status/1859579413865599170

The idea is pretty straightforward. Instead of mapping back out to token space using lm_head, just concat the output hidden state with the input_embeddings

Dec 2, 2024 • 10 tweets • 4 min read

The 6th highest scored paper (7,8,8,8) going into Neurips 2024

tldr: They introduce a new image generation approach that uses an autoregressive transformer to predict increasingly larger latent feature maps starting from a single pixel feature map to the final latent/image

Traditional autoregressive approaches use a tokenisation method such as vqvae or a vit style raster scan. The paper claims that an image patch is naturally related to all patches around it but these approaches enforce a unidirectional approach.

Dec 1, 2024 • 10 tweets • 5 min read

This paper got an honourable mention at ICLR 2024 and the first author worked on o1 and was the creator of lora

tldr: they propose a method to derive the hidden cot that leads to an answer, given a question - using bayesian inference

In language modelling given a prompt x, we want to sample from the distribution q(y|x) to get the token sequence y with the highest conditional probability.

Directly sampling from q is intractable, so we approximate q by sampling from the tractable distribution on a token level.

In the context of CoT, we typically have a prompt prefix + cot + ans suffix. We can train a model to sample the cot z conditioned on both the prompt prefix and the ans suffix

Nov 28, 2024 • 12 tweets • 5 min read

The 3rd highest scored paper at ICLR 2025 with 6, 10, 10, 10

tldr: they introduce a provable theory for why adversarial llm jailbreaks work. Then, with data augmentation and a new fine-tuning objective, they significantly reduce the usefulness of existing jailbreak methods

They define "Shallow Safety Alignment" where llms are likely to refuse unsafe prompts primarily through a refusal prefix "I cannot" etc.

Unsurprisingly, prefilling a refusal prefix even in unaligned base models can dramatically lower harmfulness

Nov 2, 2024 • 12 tweets • 6 min read

I spent some time playing with the @AnthropicAI Token Counting API after its release yesterday

This is Claude's unique chat template, its digit tokenization and how it handles images/pdfs

tldr in image

Claude's chat template is unlike basically all other frontier models. The key idea is that it treats a conversation as a series of user-assistant message pairs

It also handles consecutive user/assistant messages by concatenating them on newlines and injects a special token if provided with an Assistant message as the first message

Anthropic provides prefilling in its API and you can see how this is supported in the prompt template since the assistant part of the template is never "closed" the same way OpenAI/ChatML does it

Jul 17, 2024 • 8 tweets • 3 min read

Visualising the loss landscapes of GPT2 and Mamba

I really like these types of visualizations :)

Another example with 3 models trained on CIFAR 10

Jul 5, 2024 • 4 tweets • 2 min read

Last week, I started building @karpathy's micrograd in Rust

By the end of the week, I ended up with a Tensor library with autograd support using only the Rust standard library

I learnt a lot about PyTorch through this process so I wrote about it here :)

I try to peel away many of PyTorch's abstractions, like how Tensors are implemented, how to think about broadcasting and how to build intuitions around backpropagation

Hopefully, everyone regardless of PyTorch familiarity will find it interesting !
nrehiew.github.io/blog/pytorch/

Share this page!

Enter URL or ID to Unroll