We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: arxiv.org/abs/2306.03078🧵
Rapid-fire results 1/2:
- 4.75 bit/param lossless; 3.35 bit/param best performance trade-off
- Performance cliff at 3.35 bits that is difficult to overcome
- 13B/33B LLaMA fits into iPhone 14/colab T4 with 3.35 bits
- 15% faster than FP16; ~2x speedup vs PyTorch sparse matmul
Rapid-fire results 2/2:
- row outliers seem to be responsible for creating column outliers in the next layer
- larger outliers in later layers
- probably due to the GPTQ procedure, outliers get larger in the last matrix dimensions
SpQR is the result of a careful analysis of how sensitive outliers weights in the GPTQ algorithm affect outcomes. Besides column outliers found in LLM.int8(), we also find partial row outliers (that sometimes skip attention heads) and unstructured outliers.
When we try to exploit these structures, we find that the best way to reduce the error with as little memory as possible is to not enforce any structure. This means that the chaos of partial row, column, and unstructured outliers can be best tamed with a fully sparse algorithm
Sparse algorithms can be super tricky to implement. So for SpQR we had to develop both, a new sparse matrix multiplication algorithm and the storage format. The end result is a little complicated ... 😅But we still managed to get a small memory footprint (3.3 bits) and speedups.
The last innovation is bilevel quantization. It was developed in parallel to double quantization by my colleagues and is a strict improvement. We quantize the 1st order quantization by a 2nd order quantization that both has zero points and scales that are in 3-bit.
Putting unstructured outliers together with bilevel quantization and the GPTQ procedure (minimizing quantization errors by counter-balancing rounding decisions to reduce error).
In SpQR we combine quantized and sparse matmul. To speed up sparse matmul over regular cuSPARSE/PyTorch matmul we exploit the fact that we have some partial structure. Instead of loading the "correct" elements. We load more values and filter out the right ones in fast SRAM cache
While the accuracy and perplexity numbers tell a story of "no degradation," this is difficult to grasp how this translates into generation quality. Here are some examples that I found very instructive:
8-bits, 4-bit, 3.35-bits, how much time to hit 1-bit? We see a hard cliff with SpQR at around 3.35 bits, and it isn't easy to get further with this algorithm. But there are already follow-up ideas. I think we will get to 3-bit within 2-3 months. 2-bit is hard to crack, though.
SpQR is the product of an ensemble cast of talented researchers — it felt like I was just along for the ride! Thank you to my co-first authors Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, and @elias_frantar, @AshkboosSaleh, @sasha_borzunov, @thoefler, @DAlistarh!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs🧵
Arguably, most progress in AI came from improvements in computational capabilities, which mainly relied on low-precision for acceleration (32-> 16 -> 8 bit). This is now coming to an end. Together with physical limitations, this creates the perfect storm for the end of scale.
Blackwell will have excellent 8-bit capabilities with blockwise quantization implemented on the hardware level. This will make 8-bit training as easy as the switch from FP16 to BF16 was. However, as we see from this paper we need more than 8-bit precision to train many models.
Guanaco models use Low-rank Adapters (LoRA) and a base model (LLaMA). As such, to use Guanaco models, you need to load each of them and combine them. You can do that in many different ways. The CPU memory needed is the final model size (not checkpoint size). Here the use-cases:
1. Load a Guanaco in 16-bit for fast inference 2. Load Guanaco in 8-bit or 4-bit to fit it in small GPUs (slow at the moment, but 4-bit will be fast soon). 3. Load Guanaco unto multiple GPUs.
Want to see how good Guanaco 65B is? Here is a little fun game: Can you distinguish ChatGPT outputs from Guanaco-65B outputs? We authors had a hard time distinguishing them — maybe there is a trick? Are you better than us? colab.research.google.com/drive/1kK6xasH… (solutions after each sample)
Rapid-fire-findings (1/3):
- 97% ChatGPT performance on 1 consumer GPU in 12 hours
- matching 16-bit performance across all scales and models
- key contributions: NormalFloat data type, paged optimizers, double quantization
- FLAN v2 good for instruction tuning, bad for chatbots
The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up: forms.gle/QCxrUmXJ4RCbrk…
We will send out about 50 invites per day, and the beta will run for about a week. As a beta tester, you get early access and can help make this feature a smooth experience for everyone. Significant contributions will be acknowledged in the repos/paper.
Our method aims to make even the largest LLMs available for finetuning on consumer hardware in a simple and straightforward way. It is memory efficient, fast, and highly robust, making it easy to replicate 16-bit fine-tuning performance for large LLMs on a consumer hardware setup
Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw
The bedrock of our work is a careful analysis of loss spikes. We were looking for the causal factor to be able to develop effective solutions. We found that "fast" spikes occur due to Adam. "Slow" loss spikes in fp16 training mainly occur due to instabilities in early layers.
Previous work often studies loss spikes by analysis of the state of the network when the loss spike occurred. But what we find is that loss spikes depend on previous updates which seemingly look fine at the loss level. This makes it very difficult to analyze.
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:
LLM.int8() works by using: (1) the high-precision vectors-wise quantization technique and (2) mixed precision decomposition. To develop (2), insights into emergent features and how they dominate attention and model predictions have been key. More on emergent features are below.
In terms of software, LLM.int8() means that you can reduce the memory footprint of a large model by 2x. There are now some models that can now be used on Google Colab that previously couldn't. Try this demonstration to run T5-11b on Colab: colab.research.google.com/drive/1YORPWx4…