Tim Dettmers Profile picture
Research Scientist @allen_ai and incoming professor @CarnegieMellon. I blog about deep learning and PhD life at https://t.co/Y78KDJKdtF.
Nov 12 9 tweets 3 min read
This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs🧵 Image Arguably, most progress in AI came from improvements in computational capabilities, which mainly relied on low-precision for acceleration (32-> 16 -> 8 bit). This is now coming to an end. Together with physical limitations, this creates the perfect storm for the end of scale.
Jun 6, 2023 12 tweets 5 min read
We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: arxiv.org/abs/2306.03078🧵 Image Rapid-fire results 1/2:
- 4.75 bit/param lossless; 3.35 bit/param best performance trade-off
- Performance cliff at 3.35 bits that is difficult to overcome
- 13B/33B LLaMA fits into iPhone 14/colab T4 with 3.35 bits
- 15% faster than FP16; ~2x speedup vs PyTorch sparse matmul Image
May 25, 2023 8 tweets 3 min read
Looking at the comments, some people missed the Guanaco-33B demo because it was added later: huggingface.co/spaces/uwnlp/g…

Big thanks to @huggingface for sponsoring this demo!

The second thing I noticed was that people were a bit lost on how to use the adapters. So here a tutorial🧵 Guanaco models use Low-rank Adapters (LoRA) and a base model (LLaMA). As such, to use Guanaco models, you need to load each of them and combine them. You can do that in many different ways. The CPU memory needed is the final model size (not checkpoint size). Here the use-cases:
May 24, 2023 41 tweets 12 min read
QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark:

Paper: arxiv.org/abs/2305.14314
Code+Demo: github.com/artidoro/qlora
Samples: colab.research.google.com/drive/1kK6xasH…
Colab: colab.research.google.com/drive/17XEqL1J… Image Want to see how good Guanaco 65B is? Here is a little fun game: Can you distinguish ChatGPT outputs from Guanaco-65B outputs? We authors had a hard time distinguishing them — maybe there is a trick? Are you better than us? colab.research.google.com/drive/1kK6xasH… (solutions after each sample)
May 12, 2023 6 tweets 2 min read
The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up:
forms.gle/QCxrUmXJ4RCbrk… We will send out about 50 invites per day, and the beta will run for about a week. As a beta tester, you get early access and can help make this feature a smooth experience for everyone. Significant contributions will be acknowledged in the repos/paper.
Apr 26, 2023 25 tweets 8 min read
Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw

Paper: arxiv.org/abs/2304.13013
Colab: github.com/mlfoundations/… Image The bedrock of our work is a careful analysis of loss spikes. We were looking for the causal factor to be able to develop effective solutions. We found that "fast" spikes occur due to Adam. "Slow" loss spikes in fp16 training mainly occur due to instabilities in early layers.
Aug 17, 2022 23 tweets 8 min read
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:

Paper: arxiv.org/abs/2208.07339
Software: huggingface.co/blog/hf-bitsan…
Emergence: timdettmers.com/2022/08/17/llm… LLM.int8() works by using: (1) the high-precision vectors-wise quantization technique and (2) mixed precision decomposition. To develop (2), insights into emergent features and how they dominate attention and model predictions have been key. More on emergent features are below.
Aug 10, 2022 6 tweets 3 min read
We release the public beta for bnb-int8🟪 for all @huggingface 🤗models, which allows for Int8 inference without performance degradation up to scales of 176B params 📈. You can run OPT-175B/BLOOM-176B easily on a single machine 🖥️. You can try it here: docs.google.com/document/d/1Jx…
1/n Stay tuned for the full research details. Our work is all about emergence. We show for the first time that it is possible to detect emergent properties in transformer hidden states directly. These insights were critical to achieving zero-degradation quantization at scale.
Oct 8, 2021 13 tweets 5 min read
I am excited to share my latest work: 8-bit optimizers – a replacement for regular optimizers. Faster 🚀, 75% less memory 🪶, same performance📈, no hyperparam tuning needed 🔢. 🧵/n

Paper: arxiv.org/abs/2110.02861
Library: github.com/facebookresear…
Video: 8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. It is also easier to pretrain larger models and it has great synergy with sharded data parallelism. 8-bit Adam is already used across multiple teams in Facebook.
Apr 8, 2020 14 tweets 3 min read
How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n To give a bit background: All this came about by my past frustration with replicating Transformer-XL results on PTB and having very poor results on WikiText-2 (WT2). On WT2, my best model after 200+ experiments was 90ish ppl which is far from standard LSTM baselines (65.8 ppl).