Tanishq Kumar Profile picture
Nov 11 7 tweets 4 min read Read on X
[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR;

- Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training!
- The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices!

Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.Image
[2/7] We first study the common technique of post-train quantizing model weights, finding that the longer you train/the more data seen during pretraining, the more sensitive the model becomes to quantization at inference-time, explaining why Llama-3 may be harder to quantize.
In fact, this loss degradation is roughly a power law in the token/parameter ratio seen during pretraining, so that you can predict in advance the critical data size beyond which pretraining on more data is actively harmful if you're serving a quantized model. The intuition might be that as more knowledge is compressed into weights as you train on more data, a given perturbation will damage performance more.
Below is a fixed language model overtrained significantly to various data budgets up to 30B tokens, then post-train quantized afterwards. This demonstrates how more pretraining FLOPs do not always lead to better models served in production.Image
[3/7] We then turn our attention to training in low precision. We study both quantization-aware training (weights only) and low-precision training (everything in low precision). We decompose the model into weights, activations, and KV cache, finding scaling laws for loss when any of these are quantized to any precision, and develop a compositional and interpretable functional form to predict the effect on loss of quantizing any combination of the three during pretraining.
[4/7] Our scaling law relies on a notion of "effective parameter count" which we posit is the quantity that is reduced when you lower precision at a fixed number of real parameters, so that a 1 billion parameter model with everything trained in FP4 has a comparable number of "effective parameters" to a 250m model in BF16.

While weights can be trained in low precision without issue, activations and KV cache are sensitive. Below is the normalized "effective parameter count" as a function of precision for each of the (weights, activations, KV cache) as well as when they are all held to the same precision (tied) based on our fits.Image
[5/7] Finally, we are able to unify our findings for pre- and post-training into an interpretable functional form that predicts loss from pre- and post-training in any combination of precision. We find that pretraining in low precision "robustifies" a model to post-train quantization in a quantitatively predictable way, but by less than you would intuitively expect, for reasons we outline and test in the paper.
[6/7] Our work has several limitations -- we keep a controlled architecture and setup when doing experiments, but in practice architectural tweaks are often deliberately made to accommodate low-precision training. We also fit scaling laws on relatively small language models (up to ~250m) because we train over 450 models on large data budgets (up to over 25b tokens). We are excited for future work to study these effects at larger model scale!Image
[7/7] Many thanks to @Tim_Dettmers @chrismdesa @realDanFu for super helpful feedback as well as to the entire @HazyResearch team for their support! Models from our 465+ pretraining runs will soon be on HuggingFace for everyone to play around with, and code will also be released! The preprint is at arxiv.org/pdf/2411.04330

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tanishq Kumar

Tanishq Kumar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(