Latest Twitter Threads by @tanishqkumar07 on Thread Reader App

Mar 4 • 8 tweets • 3 min read

I've been working on a new LLM inference algorithm.

It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world.

Collab w/ @tri_dao @avnermay. Details in thread.

arXiv: arxiv.org/pdf/2603.03251

code: github.com/tanishqkumar/s…

Apr 16, 2025 • 4 tweets • 2 min read

trained a nanoGPT? feeling behind before o4-mini?

🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨

it contains thousands of lines of from-scratch, annotated pytorch implementing advanced fundamentals, from speculative decoding to vision/diffusion transformers to linear attention and much, much more.

[1/4] 👇

GitHub:

there are a wealth of amazing tutorials and introductions to LLMs, but many friends asked how to go from there to starting to attempt research.

for me, implementing fundamental modern techniques gave me the confidence to bridge the gap.

[2/4]github.com/tanishqkumar/b…

Nov 11, 2024 • 7 tweets • 4 min read

[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR;

- Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training!
- The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices!

Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.

[2/7] We first study the common technique of post-train quantizing model weights, finding that the longer you train/the more data seen during pretraining, the more sensitive the model becomes to quantization at inference-time, explaining why Llama-3 may be harder to quantize.
In fact, this loss degradation is roughly a power law in the token/parameter ratio seen during pretraining, so that you can predict in advance the critical data size beyond which pretraining on more data is actively harmful if you're serving a quantized model. The intuition might be that as more knowledge is compressed into weights as you train on more data, a given perturbation will damage performance more.
Below is a fixed language model overtrained significantly to various data budgets up to 30B tokens, then post-train quantized afterwards. This demonstrates how more pretraining FLOPs do not always lead to better models served in production.

Share this page!

Enter URL or ID to Unroll