Matthew Carrigan Profile picture
Jun 21, 2024 19 tweets 5 min read Read on X
Good morning. At some point this summer, perhaps quite soon, @AIatMeta will be releasing a LLaMA-3 model with 400B parameters. It will likely be the strongest open-source LLM ever released by a wide margin.

This is a thread about how to run it locally. 🧵
First up, the basics: You can quantize models to about ~6bits per parameter before performance degrades. We don't want performance to degrade, so 6 bits it is. This means the model will be (6/8) * 400B = 300GB.
For every token generated by a language model, each weight must be read from memory once. Immediately, this makes it clear that the weights must stay in memory. Even offloading them to an M.2 drive (read speed <10GB/sec) will mean waiting 300/10 = 30 seconds per token. Image
In fact, reading the weights once per token turns out to be the bottleneck for generating text from an LLM - to get good speed, you need surprisingly little FLOPs, and surprisingly huge amounts of memory bandwidth.
So, if we don't really need massive compute power, we have two choices for where to put the weights: CPU memory or GPU memory.

GPU memory has higher bandwidth than CPU memory, but it's more expensive. How much more expensive?
Let's work it out: The highest-memory consumer GPUs (3090/4090) are 24GB. We would need about 16 of them for this. This would cost >$30,000, and you wouldn't even be able to fit them in a single case anyway. Image
The highest-memory datacenter GPUs (A100/H100) are 80GB. We would need 4-5 of these. Although we can fit that in a single case, the cost at current prices will probably be >$50,000. Image
What about CPU RAM? Consumer RAM is quite cheap, but to fit 300+GB in a single motherboard, we're probably looking at a server board, which means we need RDIMM server memory. Right now, you can get 64GB of this for about $300-400, so the total for 384GB would be ~$2,100. Image
You'll notice this number is a lot lower than the ones above! But how much bandwidth do we actually get? Here's the tricky bit: It depends on how many memory "channels" your motherboard has. This is not the same as memory slots!
Most consumer motherboards are "dual-channel", they have 2 memory channels, even if they have 4 slots. A single stick of DDR5 RAM has a bandwidth of about 40GB/sec, so with 2 channels, this becomes 80GB/sec. 300GB/80GB = theoretical minimum of 4 seconds per token. Still slow!
Server motherboards, on the other hand, have a lot more. Intel Xeon has 8 channels per CPU, AMD EPYC has 12. Let's go with AMD - we want those extra channels. Image
1 EPYC CPU = 12 RAM channels = 480GB/sec
2 EPYC CPUs = 24 RAM channels = 960GB/sec!

Now we're getting somewhere! But what's the total system cost? Let's add it up:
1 socket system:
Motherboard: ~$800
CPU: ~$1,100
Memory: ~$2,100
Case/PSU/SSD: ~$400

Total: $4,500

2 socket system:
Motherboard: ~$1,000
CPU: ~$2,200
Memory: ~$2,100
Case/PSU/SSD: ~$400

Total: $5,700
These aren't cheap systems, but you're serving an enormous (and enormously capable) open LLM at 1-2tok/sec, similar speeds to a fast human typist at a fraction of the cost of doing it on GPU!

If you want to serve a 400B LLM locally, DDR5 servers are the way to go.
Sample 2-socket build:

Motherboard: Gigabyte MZ73-LM1
CPU: 2x AMD EPYC 9124
RAM: 24x 4800mhz+ 16GB DDR5 RDIMM
PSU: Corsair HX1000i (You don't need all that power, you just need lots of CPU power cables for 2 sockets!)
Case: PHANTEKS Enthoo Pro 2 Server Image
You'll also need heatsinks - 4U/tower heatsinks for Socket SP5 EPYC processors are hard to find, but some lunatic is making them in China and selling on Ebay/Aliexpress. Buy two. Image
Once you actually have the hardware, it's super-easy: Just run llama.cpp or llama-cpp-python and pass in your Q6 GGUF. One trick, though: On a two-socket motherboard, you need to interleave the weights across both processors' RAM. Do this:

numactl --interleave=0-1 [your_script]
And that's it! A single no-GPU server tower pulling maybe 400W in the corner of your room will be running an LLM far more powerful than GPT-4.

Even a year ago, I would have dismissed this as completely impossible, and yet the future came at us fast. Good luck!
@byjlw_ @AIatMeta It is possible to permanently put some layers on the GPU and others on the CPU - this is natively supported in llama.cpp. However, you can't really move layers around during the computation, it's just going to be much slower than leaving them where they are.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Matthew Carrigan

Matthew Carrigan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @carrigmat

Jan 28
Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000. All download and part links below:
Motherboard: Gigabyte MZ73-LM0 or MZ73-LM1. We want 2 EPYC sockets to get a massive 24 channels of DDR5 RAM to max out that memory size and bandwidth.

gigabyte.com/Enterprise/Ser…
CPU: 2x any AMD EPYC 9004 or 9005 CPU. LLM generation is bottlenecked by memory bandwidth, so you don't need a top-end one.

Get the 9115 or even the 9015 if you really want to cut costs newegg.com/p/N82E16819113…
Read 22 tweets
Aug 29, 2024
An elegant idea I got from a @GoogleDeepMind paper years back: When doing continuous-valued regression with a neural net, don't have a single neuron output estimating the value. Instead, have a layer of neurons outputting the mean/SD/weight of gaussians. 🧵
This gives you a much richer output, and a much cleaner loss: You simply add and normalize the gaussians, and compute the loss from the probability assigned to the label value. Cross-entropy for regression tasks!
In the original paper, it was being used for RL, where there was intrinsic randomness in the environment. The model knew the reward was going to be one of two values, but when it had to output a point estimate, it could only emit the average and always incur a high loss
Read 4 tweets
Aug 12, 2024
Big announcement today @huggingface: We now have a unified API for tool use across models from @MistralAI, @AIatMeta, @cohere, @NousResearch and more!

That means that you can reuse the same simplified, portable code to add tool capabilities to all of those models! 🧵 Image
Tool use with LLMs is one of those things that's simple in theory but surprisingly complex in practice. When the model calls a tool, how do you know? How do you add it to the chat? Did you know some models expected tool defs in JSON schema and others expected Python headers?
Even the closed-source APIs often have messy documentation spread between "tool use" and "assistants" workflows. Figuring it out isn't easy!

That's changed, though. Let's walk through a simple tool use process:
Read 11 tweets
Apr 9, 2024
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵
First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.
To start, we're going to build the latest version of llama.cpp. Image
Read 9 tweets
Jun 29, 2022
We're exploring end-to-end NLP TensorFlow models in 🤗Transformers! We've got a quick gist here if you want to get started, or you can read on for more. 🧵 gist.github.com/Rocketknight1/…
Firstly, what's going on here? Briefly, we've integrated TensorFlow Text with 🤗Transformers, so that you can easily get a TF tokenizer that matches your model checkpoint. This works for any checkpoint, even one you trained! (Only BERT-based for now, but that will change)
Didn't we have TF tokenizers already? If you mean tokenizers that could return tokens as Numpy or TF arrays, then yes, that's been a feature since the beginning. This is different - the tokenizer is compiled into the model graph itself. No pre-processing is required.
Read 6 tweets
Jun 10, 2022
There's a fully functional protein design space on HuggingFace now, which would have felt like outrageous science fiction even 18 months ago. I'm going to try to explain what the incredible potential here is. 🧵

huggingface.co/spaces/simondu…
Proteins are long chains of simple chemicals called amino acids that fold up into complex 3D shapes. Different amino acids affect the structure in different ways - some stick to each other, some repel, some force bends into the chain.
The structure of a protein is critical to what it actually does. For example, enzymes (a type of protein that catalyzes a chemical reaction) often have a pocket in their structure that fits the chemicals they operate on, while rejecting ones they don't.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(