wh Profile picture
wh
eng primarily, ml mostly, research previously
Apr 4 28 tweets 11 min read
Cohere's Command A report is an extremely extensive paper on how to train a modern LLM in 2025. But it's a model for very different but specific use cases.

Let's talk about it Image Important to start with some context about Cohere. They aren't trying to train frontier models like Meta/OpenAI/Anthropic. They focus on training models that are intelligent but specifically for enterprise tasks like RAG and multilingualism which can still be efficiently served (on premise)Image
Jan 21 17 tweets 7 min read
How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level. Image Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (arxiv.org/abs/2411.15124)
Jan 5 6 tweets 2 min read
short thread:

1) i think the most important thing is to find a niche you enjoy. It doesn't need to be the hot topics (LLMs etc), it could just be a paper you see people discuss which piques your interest. This way, a reading list won't feel like a todo list 2) People seem to have the impression that the math (notation) is impossible to understand without proper background. I beg to differ. Some papers really just use math notation for notation sake and the core intuition of the paper is actually really grokkable
Dec 26, 2024 24 tweets 10 min read
How to train a 670B parameter model.

Let's talk about the DeepSeek v3 report + some comparisons with what Meta did with Llama 405B Image This is the image that has been going around so you probably know how nuts this is but some added context is that Llama 3 405B was trained on 16K H100

Image
Dec 14, 2024 16 tweets 6 min read
A new paper from Meta gets rid of tokenization by learning a transformer over raw bytes Image A quick primer on why this is difficult and why we cannot just train on bytes naively. If we wanted to get rid of arbitrary tokenization/segmentation of input sequences, training on bytes is not that straightforward.

Obviously, training on bytes is a significantly harder problem to solve. Plus, sequences of bytes are OOM longer than what we have currently and it is extremely extremely inefficient to model this. As much as we hate BPE, tokenization is compression which allow us to have manageable sequence lengthsImage
Dec 13, 2024 15 tweets 8 min read
This new extremely detailed paper from Meta treats language modelling as modelling abstract semantic concepts rather than tokens and finds diffusion to work well

tldr: they train a concept encoder and decoder to map between readable word space and "concept" space Image The left figure is what they call "reasoning in embedding" space. A sentence with 7 words can be summarized into 2 by clustering them based on semantic similarity. Note that this would theoretically work even if the sentences are in diff languages or not even "sentences" at all but other modalities like images

The right figure is their architecture. The main similarity that comes to mind is Latent Diffusion Models. You map your input to some latent dimension, perform modelling on that and map back outImage
Dec 10, 2024 11 tweets 5 min read
This paper from Meta proposes a method to not have the model reason in token space but directly model its reasoning using its hidden state. The authors also do a lot of cool interpretability work in this paper.

Aesthetically, I like it alot and its simple to implement Image The idea is pretty straightforward. Instead of mapping back out to token space using lm_head, just concat the output hidden state with the input_embeddings Image
Dec 2, 2024 10 tweets 4 min read
The 6th highest scored paper (7,8,8,8) going into Neurips 2024

tldr: They introduce a new image generation approach that uses an autoregressive transformer to predict increasingly larger latent feature maps starting from a single pixel feature map to the final latent/image Image Traditional autoregressive approaches use a tokenisation method such as vqvae or a vit style raster scan. The paper claims that an image patch is naturally related to all patches around it but these approaches enforce a unidirectional approach.
Dec 1, 2024 10 tweets 5 min read
This paper got an honourable mention at ICLR 2024 and the first author worked on o1 and was the creator of lora

tldr: they propose a method to derive the hidden cot that leads to an answer, given a question - using bayesian inference Image In language modelling given a prompt x, we want to sample from the distribution q(y|x) to get the token sequence y with the highest conditional probability.

Directly sampling from q is intractable, so we approximate q by sampling from the tractable distribution on a token level.

In the context of CoT, we typically have a prompt prefix + cot + ans suffix. We can train a model to sample the cot z conditioned on both the prompt prefix and the ans suffixImage
Nov 28, 2024 12 tweets 5 min read
The 3rd highest scored paper at ICLR 2025 with 6, 10, 10, 10

tldr: they introduce a provable theory for why adversarial llm jailbreaks work. Then, with data augmentation and a new fine-tuning objective, they significantly reduce the usefulness of existing jailbreak methods Image They define "Shallow Safety Alignment" where llms are likely to refuse unsafe prompts primarily through a refusal prefix "I cannot" etc.

Unsurprisingly, prefilling a refusal prefix even in unaligned base models can dramatically lower harmfulness Image
Nov 2, 2024 12 tweets 6 min read
I spent some time playing with the @AnthropicAI Token Counting API after its release yesterday

This is Claude's unique chat template, its digit tokenization and how it handles images/pdfs

tldr in image Image Claude's chat template is unlike basically all other frontier models. The key idea is that it treats a conversation as a series of user-assistant message pairs

It also handles consecutive user/assistant messages by concatenating them on newlines and injects a special token if provided with an Assistant message as the first message

Anthropic provides prefilling in its API and you can see how this is supported in the prompt template since the assistant part of the template is never "closed" the same way OpenAI/ChatML does itImage
Jul 17, 2024 8 tweets 3 min read
Visualising the loss landscapes of GPT2 and Mamba

I really like these types of visualizations :) Image Another example with 3 models trained on CIFAR 10 Image
Jul 5, 2024 4 tweets 2 min read
Last week, I started building @karpathy's micrograd in Rust

By the end of the week, I ended up with a Tensor library with autograd support using only the Rust standard library

I learnt a lot about PyTorch through this process so I wrote about it here :) Image I try to peel away many of PyTorch's abstractions, like how Tensors are implemented, how to think about broadcasting and how to build intuitions around backpropagation

Hopefully, everyone regardless of PyTorch familiarity will find it interesting !
nrehiew.github.io/blog/pytorch/