Post

Tanishq Kumar

Nov 11 • 7 tweets • 4 min read • Read on X

[1/7] New paper alert! Heard about the BitNet hype or that Llama-3 is harder to quantize? Our new work studies both! We formulate scaling laws for precision, across both pre and post-training arxiv.org/pdf/2411.04330. TLDR;

- Models become harder to post-train quantize as they are overtrained on lots of data, so that eventually more pretraining data can be actively harmful if quantizing post-training!
- The effects of putting weights, activations, or attention in varying precisions during pretraining are consistent and predictable, and fitting a scaling law suggests that pretraining at high (BF16) and next-generation (FP4) precisions may both be suboptimal design choices!

Joint work with @ZackAnkner @bfspector @blake__bordelon @Muennighoff @mansiege @CPehlevan @HazyResearch @AdtRaghunathan.

[2/7] We first study the common technique of post-train quantizing model weights, finding that the longer you train/the more data seen during pretraining, the more sensitive the model becomes to quantization at inference-time, explaining why Llama-3 may be harder to quantize.
In fact, this loss degradation is roughly a power law in the token/parameter ratio seen during pretraining, so that you can predict in advance the critical data size beyond which pretraining on more data is actively harmful if you're serving a quantized model. The intuition might be that as more knowledge is compressed into weights as you train on more data, a given perturbation will damage performance more.
Below is a fixed language model overtrained significantly to various data budgets up to 30B tokens, then post-train quantized afterwards. This demonstrates how more pretraining FLOPs do not always lead to better models served in production.

[3/7] We then turn our attention to training in low precision. We study both quantization-aware training (weights only) and low-precision training (everything in low precision). We decompose the model into weights, activations, and KV cache, finding scaling laws for loss when any of these are quantized to any precision, and develop a compositional and interpretable functional form to predict the effect on loss of quantizing any combination of the three during pretraining.

[4/7] Our scaling law relies on a notion of "effective parameter count" which we posit is the quantity that is reduced when you lower precision at a fixed number of real parameters, so that a 1 billion parameter model with everything trained in FP4 has a comparable number of "effective parameters" to a 250m model in BF16.

While weights can be trained in low precision without issue, activations and KV cache are sensitive. Below is the normalized "effective parameter count" as a function of precision for each of the (weights, activations, KV cache) as well as when they are all held to the same precision (tied) based on our fits.

[5/7] Finally, we are able to unify our findings for pre- and post-training into an interpretable functional form that predicts loss from pre- and post-training in any combination of precision. We find that pretraining in low precision "robustifies" a model to post-train quantization in a quantitatively predictable way, but by less than you would intuitively expect, for reasons we outline and test in the paper.

[6/7] Our work has several limitations -- we keep a controlled architecture and setup when doing experiments, but in practice architectural tweaks are often deliberately made to accommodate low-precision training. We also fit scaling laws on relatively small language models (up to ~250m) because we train over 450 models on large data budgets (up to over 25b tokens). We are excited for future work to study these effects at larger model scale!

[7/7] Many thanks to @Tim_Dettmers @chrismdesa @realDanFu for super helpful feedback as well as to the entire @HazyResearch team for their support! Models from our 465+ pretraining runs will soon be on HuggingFace for everyone to play around with, and code will also be released! The preprint is at arxiv.org/pdf/2411.04330

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Tanishq Kumar

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!