Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Tim Dettmers

@Tim_Dettmers

Nov 12 • 9 tweets • 3 min read • Read on X

https://twitter.com/Tanishq97836660/status/1856045600355352753

This is the most important paper in a long time . It shows with strong evidence we are reaching the limits of quantization. The paper says this: the more tokens you train on, the more precision you need. This has broad implications for the entire field and the future of GPUs🧵

https://twitter.com/Tanishq97836660/status/1856045600355352753

Arguably, most progress in AI came from improvements in computational capabilities, which mainly relied on low-precision for acceleration (32-> 16 -> 8 bit). This is now coming to an end. Together with physical limitations, this creates the perfect storm for the end of scale.

Blackwell will have excellent 8-bit capabilities with blockwise quantization implemented on the hardware level. This will make 8-bit training as easy as the switch from FP16 to BF16 was. However, as we see from this paper we need more than 8-bit precision to train many models.

The main reason why Llama 405B did not see much use compared to other models, is that it is just too big. Running a 405B model for inference is a big pain. But the paper shows training smaller models, say 70B, you cannot train these models efficiently in low precision.

8B (circle)
70B (triangle)
405B (star)

We see that for 20B token training runs training a model 8B, is more efficient in 16 bit. For the 70B model, 8 bit still works, but it is getting less efficient now.

From my own experience (a lot of failed research), you cannot cheat efficiency. If quantization fails, then also sparsification fails, and other efficiency mechanisms too. If this is true, we are close to optimal now. With this, there are only three ways forward that I see...

(1) Scaling data centers: This still scales for ~2 years.
(2) Scaling through dynamics: Route to smaller specialized models or larger/smaller models.
(3) Knowledge distillation: I believe distillation behaves differently than other techniques and might have different properties.

For hardware we still have HBM4, which will be a good boost. But FP4 training is a lie. Node shrinks will not add much efficiency anymore. @dylan522p believes that AI can help design more efficient chips, but I am skeptical that there is much more room.

All of this means that the paradigm will soon shift from scaling to "what can we do with what we have". I think the paradigm of "how do we help people be more productive with AI" is the best mindset forward. This mindset is about processes and people rather than technology.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Tim_Dettmers

Tim Dettmers

@Tim_Dettmers

Jun 6, 2023

We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: arxiv.org/abs/2306.03078🧵

Rapid-fire results 1/2:
- 4.75 bit/param lossless; 3.35 bit/param best performance trade-off
- Performance cliff at 3.35 bits that is difficult to overcome
- 13B/33B LLaMA fits into iPhone 14/colab T4 with 3.35 bits
- 15% faster than FP16; ~2x speedup vs PyTorch sparse matmul

Rapid-fire results 2/2:
- row outliers seem to be responsible for creating column outliers in the next layer
- larger outliers in later layers
- probably due to the GPTQ procedure, outliers get larger in the last matrix dimensions

Read 12 tweets

Tim Dettmers

@Tim_Dettmers

May 25, 2023

@huggingface

Looking at the comments, some people missed the Guanaco-33B demo because it was added later: huggingface.co/spaces/uwnlp/g…

Big thanks to @huggingface for sponsoring this demo!

The second thing I noticed was that people were a bit lost on how to use the adapters. So here a tutorial🧵

https://twitter.com/Tim_Dettmers/status/1661379354507476994

Guanaco models use Low-rank Adapters (LoRA) and a base model (LLaMA). As such, to use Guanaco models, you need to load each of them and combine them. You can do that in many different ways. The CPU memory needed is the final model size (not checkpoint size). Here the use-cases:

1. Load a Guanaco in 16-bit for fast inference
2. Load Guanaco in 8-bit or 4-bit to fit it in small GPUs (slow at the moment, but 4-bit will be fast soon).
3. Load Guanaco unto multiple GPUs.

Let's see how to do each of these.

Read 8 tweets

Tim Dettmers

@Tim_Dettmers

May 24, 2023

QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark:

Paper: arxiv.org/abs/2305.14314
Code+Demo: github.com/artidoro/qlora
Samples: colab.research.google.com/drive/1kK6xasH…
Colab: colab.research.google.com/drive/17XEqL1J…

Want to see how good Guanaco 65B is? Here is a little fun game: Can you distinguish ChatGPT outputs from Guanaco-65B outputs? We authors had a hard time distinguishing them — maybe there is a trick? Are you better than us? colab.research.google.com/drive/1kK6xasH… (solutions after each sample)

Rapid-fire-findings (1/3):
- 97% ChatGPT performance on 1 consumer GPU in 12 hours
- matching 16-bit performance across all scales and models
- key contributions: NormalFloat data type, paged optimizers, double quantization
- FLAN v2 good for instruction tuning, bad for chatbots

Read 41 tweets

Tim Dettmers

@Tim_Dettmers

May 12, 2023

The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up:
forms.gle/QCxrUmXJ4RCbrk…

We will send out about 50 invites per day, and the beta will run for about a week. As a beta tester, you get early access and can help make this feature a smooth experience for everyone. Significant contributions will be acknowledged in the repos/paper.

Our method aims to make even the largest LLMs available for finetuning on consumer hardware in a simple and straightforward way. It is memory efficient, fast, and highly robust, making it easy to replicate 16-bit fine-tuning performance for large LLMs on a consumer hardware setup

Read 6 tweets

Tim Dettmers

@Tim_Dettmers

Apr 26, 2023

@Mitchnw

Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw

Paper: arxiv.org/abs/2304.13013
Colab: github.com/mlfoundations/…

The bedrock of our work is a careful analysis of loss spikes. We were looking for the causal factor to be able to develop effective solutions. We found that "fast" spikes occur due to Adam. "Slow" loss spikes in fp16 training mainly occur due to instabilities in early layers.

Previous work often studies loss spikes by analysis of the state of the network when the loss spike occurred. But what we find is that loss spikes depend on previous updates which seemingly look fine at the loss level. This makes it very difficult to analyze.

Read 25 tweets

Tim Dettmers

@Tim_Dettmers

Aug 17, 2022

We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More:

Paper: arxiv.org/abs/2208.07339
Software: huggingface.co/blog/hf-bitsan…
Emergence: timdettmers.com/2022/08/17/llm…

LLM.int8() works by using: (1) the high-precision vectors-wise quantization technique and (2) mixed precision decomposition. To develop (2), insights into emergent features and how they dominate attention and model predictions have been key. More on emergent features are below.

In terms of software, LLM.int8() means that you can reduce the memory footprint of a large model by 2x. There are now some models that can now be used on Google Colab that previously couldn't. Try this demonstration to run T5-11b on Colab:
colab.research.google.com/drive/1YORPWx4…

Read 23 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Tim Dettmers

Try unrolling a thread yourself!

More from @Tim_Dettmers

Tim Dettmers

Tim Dettmers

Tim Dettmers

Tim Dettmers

Tim Dettmers

Tim Dettmers

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!