This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs🧵
Arguably, most progress in AI came from improvements in computational capabilities, which mainly relied on low-precision for acceleration (32-> 16 -> 8 bit). This is now coming to an end. Together with physical limitations, this creates the perfect storm for the end of scale.
Blackwell will have excellent 8-bit capabilities with blockwise quantization implemented on the hardware level. This will make 8-bit training as easy as the switch from FP16 to BF16 was. However, as we see from this paper we need more than 8-bit precision to train many models.
The main reason why Llama 405B did not see much use compared to other models, is that it is just too big. Running a 405B model for inference is a big pain. But the paper shows training smaller models, say 70B, you cannot train these models efficiently in low precision.
8B (circle)
70B (triangle)
405B (star)
We see that for 20B token training runs training a model 8B, is more efficient in 16 bit. For the 70B model, 8 bit still works, but it is getting less efficient now.
From my own experience (a lot of failed research), you cannot cheat efficiency. If quantization fails, then also sparsification fails, and other efficiency mechanisms too. If this is true, we are close to optimal now. With this, there are only three ways forward that I see...
(1) Scaling data centers: This still scales for ~2 years. (2) Scaling through dynamics: Route to smaller specialized models or larger/smaller models. (3) Knowledge distillation: I believe distillation behaves differently than other techniques and might have different properties.
For hardware we still have HBM4, which will be a good boost. But FP4 training is a lie. Node shrinks will not add much efficiency anymore. @dylan522p believes that AI can help design more efficient chips, but I am skeptical that there is much more room.
All of this means that the paradigm will soon shift from scaling to "what can we do with what we have". I think the paradigm of "how do we help people be more productive with AI" is the best mindset forward. This mindset is about processes and people rather than technology.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: arxiv.org/abs/2306.03078🧵
Rapid-fire results 1/2:
- 4.75 bit/param lossless; 3.35 bit/param best performance trade-off
- Performance cliff at 3.35 bits that is difficult to overcome
- 13B/33B LLaMA fits into iPhone 14/colab T4 with 3.35 bits
- 15% faster than FP16; ~2x speedup vs PyTorch sparse matmul
Rapid-fire results 2/2:
- row outliers seem to be responsible for creating column outliers in the next layer
- larger outliers in later layers
- probably due to the GPTQ procedure, outliers get larger in the last matrix dimensions
Guanaco models use Low-rank Adapters (LoRA) and a base model (LLaMA). As such, to use Guanaco models, you need to load each of them and combine them. You can do that in many different ways. The CPU memory needed is the final model size (not checkpoint size). Here the use-cases:
1. Load a Guanaco in 16-bit for fast inference 2. Load Guanaco in 8-bit or 4-bit to fit it in small GPUs (slow at the moment, but 4-bit will be fast soon). 3. Load Guanaco unto multiple GPUs.
Want to see how good Guanaco 65B is? Here is a little fun game: Can you distinguish ChatGPT outputs from Guanaco-65B outputs? We authors had a hard time distinguishing them — maybe there is a trick? Are you better than us? colab.research.google.com/drive/1kK6xasH… (solutions after each sample)
Rapid-fire-findings (1/3):
- 97% ChatGPT performance on 1 consumer GPU in 12 hours
- matching 16-bit performance across all scales and models
- key contributions: NormalFloat data type, paged optimizers, double quantization
- FLAN v2 good for instruction tuning, bad for chatbots
The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up: forms.gle/QCxrUmXJ4RCbrk…
We will send out about 50 invites per day, and the beta will run for about a week. As a beta tester, you get early access and can help make this feature a smooth experience for everyone. Significant contributions will be acknowledged in the repos/paper.
Our method aims to make even the largest LLMs available for finetuning on consumer hardware in a simple and straightforward way. It is memory efficient, fast, and highly robust, making it easy to replicate 16-bit fine-tuning performance for large LLMs on a consumer hardware setup
Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw
The bedrock of our work is a careful analysis of loss spikes. We were looking for the causal factor to be able to develop effective solutions. We found that "fast" spikes occur due to Adam. "Slow" loss spikes in fp16 training mainly occur due to instabilities in early layers.
Previous work often studies loss spikes by analysis of the state of the network when the loss spike occurred. But what we find is that loss spikes depend on previous updates which seemingly look fine at the loss level. This makes it very difficult to analyze.
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:
LLM.int8() works by using: (1) the high-precision vectors-wise quantization technique and (2) mixed precision decomposition. To develop (2), insights into emergent features and how they dominate attention and model predictions have been key. More on emergent features are below.
In terms of software, LLM.int8() means that you can reduce the memory footprint of a large model by 2x. There are now some models that can now be used on Google Colab that previously couldn't. Try this demonstration to run T5-11b on Colab: colab.research.google.com/drive/1YORPWx4…