Matthew Carrigan Profile picture
Jun 21 19 tweets 5 min read Read on X
Good morning. At some point this summer, perhaps quite soon, @AIatMeta will be releasing a LLaMA-3 model with 400B parameters. It will likely be the strongest open-source LLM ever released by a wide margin.

This is a thread about how to run it locally. 🧵
First up, the basics: You can quantize models to about ~6bits per parameter before performance degrades. We don't want performance to degrade, so 6 bits it is. This means the model will be (6/8) * 400B = 300GB.
For every token generated by a language model, each weight must be read from memory once. Immediately, this makes it clear that the weights must stay in memory. Even offloading them to an M.2 drive (read speed <10GB/sec) will mean waiting 300/10 = 30 seconds per token. Image
In fact, reading the weights once per token turns out to be the bottleneck for generating text from an LLM - to get good speed, you need surprisingly little FLOPs, and surprisingly huge amounts of memory bandwidth.
So, if we don't really need massive compute power, we have two choices for where to put the weights: CPU memory or GPU memory.

GPU memory has higher bandwidth than CPU memory, but it's more expensive. How much more expensive?
Let's work it out: The highest-memory consumer GPUs (3090/4090) are 24GB. We would need about 16 of them for this. This would cost >$30,000, and you wouldn't even be able to fit them in a single case anyway. Image
The highest-memory datacenter GPUs (A100/H100) are 80GB. We would need 4-5 of these. Although we can fit that in a single case, the cost at current prices will probably be >$50,000. Image
What about CPU RAM? Consumer RAM is quite cheap, but to fit 300+GB in a single motherboard, we're probably looking at a server board, which means we need RDIMM server memory. Right now, you can get 64GB of this for about $300-400, so the total for 384GB would be ~$2,100. Image
You'll notice this number is a lot lower than the ones above! But how much bandwidth do we actually get? Here's the tricky bit: It depends on how many memory "channels" your motherboard has. This is not the same as memory slots!
Most consumer motherboards are "dual-channel", they have 2 memory channels, even if they have 4 slots. A single stick of DDR5 RAM has a bandwidth of about 40GB/sec, so with 2 channels, this becomes 80GB/sec. 300GB/80GB = theoretical minimum of 4 seconds per token. Still slow!
Server motherboards, on the other hand, have a lot more. Intel Xeon has 8 channels per CPU, AMD EPYC has 12. Let's go with AMD - we want those extra channels. Image
1 EPYC CPU = 12 RAM channels = 480GB/sec
2 EPYC CPUs = 24 RAM channels = 960GB/sec!

Now we're getting somewhere! But what's the total system cost? Let's add it up:
1 socket system:
Motherboard: ~$800
CPU: ~$1,100
Memory: ~$2,100
Case/PSU/SSD: ~$400

Total: $4,500

2 socket system:
Motherboard: ~$1,000
CPU: ~$2,200
Memory: ~$2,100
Case/PSU/SSD: ~$400

Total: $5,700
These aren't cheap systems, but you're serving an enormous (and enormously capable) open LLM at 1-2tok/sec, similar speeds to a fast human typist at a fraction of the cost of doing it on GPU!

If you want to serve a 400B LLM locally, DDR5 servers are the way to go.
Sample 2-socket build:

Motherboard: Gigabyte MZ73-LM1
CPU: 2x AMD EPYC 9124
RAM: 24x 4800mhz+ 16GB DDR5 RDIMM
PSU: Corsair HX1000i (You don't need all that power, you just need lots of CPU power cables for 2 sockets!)
Case: PHANTEKS Enthoo Pro 2 Server Image
You'll also need heatsinks - 4U/tower heatsinks for Socket SP5 EPYC processors are hard to find, but some lunatic is making them in China and selling on Ebay/Aliexpress. Buy two. Image
Once you actually have the hardware, it's super-easy: Just run llama.cpp or llama-cpp-python and pass in your Q6 GGUF. One trick, though: On a two-socket motherboard, you need to interleave the weights across both processors' RAM. Do this:

numactl --interleave=0-1 [your_script]
And that's it! A single no-GPU server tower pulling maybe 400W in the corner of your room will be running an LLM far more powerful than GPT-4.

Even a year ago, I would have dismissed this as completely impossible, and yet the future came at us fast. Good luck!
@byjlw_ @AIatMeta It is possible to permanently put some layers on the GPU and others on the CPU - this is natively supported in llama.cpp. However, you can't really move layers around during the computation, it's just going to be much slower than leaving them where they are.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Matthew Carrigan

Matthew Carrigan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @carrigmat

Apr 9
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵
First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.
To start, we're going to build the latest version of llama.cpp. Image
Read 9 tweets
Jun 29, 2022
We're exploring end-to-end NLP TensorFlow models in 🤗Transformers! We've got a quick gist here if you want to get started, or you can read on for more. 🧵 gist.github.com/Rocketknight1/…
Firstly, what's going on here? Briefly, we've integrated TensorFlow Text with 🤗Transformers, so that you can easily get a TF tokenizer that matches your model checkpoint. This works for any checkpoint, even one you trained! (Only BERT-based for now, but that will change)
Didn't we have TF tokenizers already? If you mean tokenizers that could return tokens as Numpy or TF arrays, then yes, that's been a feature since the beginning. This is different - the tokenizer is compiled into the model graph itself. No pre-processing is required.
Read 6 tweets
Jun 10, 2022
There's a fully functional protein design space on HuggingFace now, which would have felt like outrageous science fiction even 18 months ago. I'm going to try to explain what the incredible potential here is. 🧵

huggingface.co/spaces/simondu…
Proteins are long chains of simple chemicals called amino acids that fold up into complex 3D shapes. Different amino acids affect the structure in different ways - some stick to each other, some repel, some force bends into the chain.
The structure of a protein is critical to what it actually does. For example, enzymes (a type of protein that catalyzes a chemical reaction) often have a pocket in their structure that fits the chemicals they operate on, while rejecting ones they don't.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(