Tim Dettmers Profile picture
Jun 6, 2023 12 tweets 5 min read Read on X
We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: arxiv.org/abs/2306.03078🧵 Image
Rapid-fire results 1/2:
- 4.75 bit/param lossless; 3.35 bit/param best performance trade-off
- Performance cliff at 3.35 bits that is difficult to overcome
- 13B/33B LLaMA fits into iPhone 14/colab T4 with 3.35 bits
- 15% faster than FP16; ~2x speedup vs PyTorch sparse matmul Image
Rapid-fire results 2/2:
- row outliers seem to be responsible for creating column outliers in the next layer
- larger outliers in later layers
- probably due to the GPTQ procedure, outliers get larger in the last matrix dimensions
SpQR is the result of a careful analysis of how sensitive outliers weights in the GPTQ algorithm affect outcomes. Besides column outliers found in LLM.int8(), we also find partial row outliers (that sometimes skip attention heads) and unstructured outliers. Image
When we try to exploit these structures, we find that the best way to reduce the error with as little memory as possible is to not enforce any structure. This means that the chaos of partial row, column, and unstructured outliers can be best tamed with a fully sparse algorithm Image
Sparse algorithms can be super tricky to implement. So for SpQR we had to develop both, a new sparse matrix multiplication algorithm and the storage format. The end result is a little complicated ... 😅But we still managed to get a small memory footprint (3.3 bits) and speedups. Image
The last innovation is bilevel quantization. It was developed in parallel to double quantization by my colleagues and is a strict improvement. We quantize the 1st order quantization by a 2nd order quantization that both has zero points and scales that are in 3-bit.
Putting unstructured outliers together with bilevel quantization and the GPTQ procedure (minimizing quantization errors by counter-balancing rounding decisions to reduce error). Image
In SpQR we combine quantized and sparse matmul. To speed up sparse matmul over regular cuSPARSE/PyTorch matmul we exploit the fact that we have some partial structure. Instead of loading the "correct" elements. We load more values and filter out the right ones in fast SRAM cache Image
While the accuracy and perplexity numbers tell a story of "no degradation," this is difficult to grasp how this translates into generation quality. Here are some examples that I found very instructive: Image
8-bits, 4-bit, 3.35-bits, how much time to hit 1-bit? We see a hard cliff with SpQR at around 3.35 bits, and it isn't easy to get further with this algorithm. But there are already follow-up ideas. I think we will get to 3-bit within 2-3 months. 2-bit is hard to crack, though. Image
SpQR is the product of an ensemble cast of talented researchers — it felt like I was just along for the ride! Thank you to my co-first authors Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, and @elias_frantar, @AshkboosSaleh, @sasha_borzunov, @thoefler, @DAlistarh!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tim Dettmers

Tim Dettmers Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Tim_Dettmers

Nov 12, 2024
This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs🧵 Image
Arguably, most progress in AI came from improvements in computational capabilities, which mainly relied on low-precision for acceleration (32-> 16 -> 8 bit). This is now coming to an end. Together with physical limitations, this creates the perfect storm for the end of scale.
Blackwell will have excellent 8-bit capabilities with blockwise quantization implemented on the hardware level. This will make 8-bit training as easy as the switch from FP16 to BF16 was. However, as we see from this paper we need more than 8-bit precision to train many models.
Read 9 tweets
May 25, 2023
Looking at the comments, some people missed the Guanaco-33B demo because it was added later: huggingface.co/spaces/uwnlp/g…

Big thanks to @huggingface for sponsoring this demo!

The second thing I noticed was that people were a bit lost on how to use the adapters. So here a tutorial🧵
Guanaco models use Low-rank Adapters (LoRA) and a base model (LLaMA). As such, to use Guanaco models, you need to load each of them and combine them. You can do that in many different ways. The CPU memory needed is the final model size (not checkpoint size). Here the use-cases:
1. Load a Guanaco in 16-bit for fast inference
2. Load Guanaco in 8-bit or 4-bit to fit it in small GPUs (slow at the moment, but 4-bit will be fast soon).
3. Load Guanaco unto multiple GPUs.

Let's see how to do each of these.
Read 8 tweets
May 24, 2023
QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark:

Paper: arxiv.org/abs/2305.14314
Code+Demo: github.com/artidoro/qlora
Samples: colab.research.google.com/drive/1kK6xasH…
Colab: colab.research.google.com/drive/17XEqL1J… Image
Want to see how good Guanaco 65B is? Here is a little fun game: Can you distinguish ChatGPT outputs from Guanaco-65B outputs? We authors had a hard time distinguishing them — maybe there is a trick? Are you better than us? colab.research.google.com/drive/1kK6xasH… (solutions after each sample)
Rapid-fire-findings (1/3):
- 97% ChatGPT performance on 1 consumer GPU in 12 hours
- matching 16-bit performance across all scales and models
- key contributions: NormalFloat data type, paged optimizers, double quantization
- FLAN v2 good for instruction tuning, bad for chatbots
Read 41 tweets
May 12, 2023
The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up:
forms.gle/QCxrUmXJ4RCbrk…
We will send out about 50 invites per day, and the beta will run for about a week. As a beta tester, you get early access and can help make this feature a smooth experience for everyone. Significant contributions will be acknowledged in the repos/paper.
Our method aims to make even the largest LLMs available for finetuning on consumer hardware in a simple and straightforward way. It is memory efficient, fast, and highly robust, making it easy to replicate 16-bit fine-tuning performance for large LLMs on a consumer hardware setup
Read 6 tweets
Apr 26, 2023
Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw

Paper: arxiv.org/abs/2304.13013
Colab: github.com/mlfoundations/… Image
The bedrock of our work is a careful analysis of loss spikes. We were looking for the causal factor to be able to develop effective solutions. We found that "fast" spikes occur due to Adam. "Slow" loss spikes in fp16 training mainly occur due to instabilities in early layers.
Previous work often studies loss spikes by analysis of the state of the network when the loss spike occurred. But what we find is that loss spikes depend on previous updates which seemingly look fine at the loss level. This makes it very difficult to analyze.
Read 25 tweets
Aug 17, 2022
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:

Paper: arxiv.org/abs/2208.07339
Software: huggingface.co/blog/hf-bitsan…
Emergence: timdettmers.com/2022/08/17/llm…
LLM.int8() works by using: (1) the high-precision vectors-wise quantization technique and (2) mixed precision decomposition. To develop (2), insights into emergent features and how they dominate attention and model predictions have been key. More on emergent features are below.
In terms of software, LLM.int8() means that you can reduce the memory footprint of a large model by 2x. There are now some models that can now be used on Google Colab that previously couldn't. Try this demonstration to run T5-11b on Colab:
colab.research.google.com/drive/1YORPWx4…
Read 23 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(