Good morning. At some point this summer, perhaps quite soon, @AIatMeta will be releasing a LLaMA-3 model with 400B parameters. It will likely be the strongest open-source LLM ever released by a wide margin.
This is a thread about how to run it locally. 🧵
First up, the basics: You can quantize models to about ~6bits per parameter before performance degrades. We don't want performance to degrade, so 6 bits it is. This means the model will be (6/8) * 400B = 300GB.
For every token generated by a language model, each weight must be read from memory once. Immediately, this makes it clear that the weights must stay in memory. Even offloading them to an M.2 drive (read speed <10GB/sec) will mean waiting 300/10 = 30 seconds per token.
In fact, reading the weights once per token turns out to be the bottleneck for generating text from an LLM - to get good speed, you need surprisingly little FLOPs, and surprisingly huge amounts of memory bandwidth.
So, if we don't really need massive compute power, we have two choices for where to put the weights: CPU memory or GPU memory.
GPU memory has higher bandwidth than CPU memory, but it's more expensive. How much more expensive?
Let's work it out: The highest-memory consumer GPUs (3090/4090) are 24GB. We would need about 16 of them for this. This would cost >$30,000, and you wouldn't even be able to fit them in a single case anyway.
The highest-memory datacenter GPUs (A100/H100) are 80GB. We would need 4-5 of these. Although we can fit that in a single case, the cost at current prices will probably be >$50,000.
What about CPU RAM? Consumer RAM is quite cheap, but to fit 300+GB in a single motherboard, we're probably looking at a server board, which means we need RDIMM server memory. Right now, you can get 64GB of this for about $300-400, so the total for 384GB would be ~$2,100.
You'll notice this number is a lot lower than the ones above! But how much bandwidth do we actually get? Here's the tricky bit: It depends on how many memory "channels" your motherboard has. This is not the same as memory slots!
Most consumer motherboards are "dual-channel", they have 2 memory channels, even if they have 4 slots. A single stick of DDR5 RAM has a bandwidth of about 40GB/sec, so with 2 channels, this becomes 80GB/sec. 300GB/80GB = theoretical minimum of 4 seconds per token. Still slow!
Server motherboards, on the other hand, have a lot more. Intel Xeon has 8 channels per CPU, AMD EPYC has 12. Let's go with AMD - we want those extra channels.
These aren't cheap systems, but you're serving an enormous (and enormously capable) open LLM at 1-2tok/sec, similar speeds to a fast human typist at a fraction of the cost of doing it on GPU!
If you want to serve a 400B LLM locally, DDR5 servers are the way to go.
Sample 2-socket build:
Motherboard: Gigabyte MZ73-LM1
CPU: 2x AMD EPYC 9124
RAM: 24x 4800mhz+ 16GB DDR5 RDIMM
PSU: Corsair HX1000i (You don't need all that power, you just need lots of CPU power cables for 2 sockets!)
Case: PHANTEKS Enthoo Pro 2 Server
You'll also need heatsinks - 4U/tower heatsinks for Socket SP5 EPYC processors are hard to find, but some lunatic is making them in China and selling on Ebay/Aliexpress. Buy two.
Once you actually have the hardware, it's super-easy: Just run llama.cpp or llama-cpp-python and pass in your Q6 GGUF. One trick, though: On a two-socket motherboard, you need to interleave the weights across both processors' RAM. Do this:
numactl --interleave=0-1 [your_script]
And that's it! A single no-GPU server tower pulling maybe 400W in the corner of your room will be running an LLM far more powerful than GPT-4.
Even a year ago, I would have dismissed this as completely impossible, and yet the future came at us fast. Good luck!
@byjlw_ @AIatMeta It is possible to permanently put some layers on the GPU and others on the CPU - this is natively supported in llama.cpp. However, you can't really move layers around during the computation, it's just going to be much slower than leaving them where they are.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵
First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.
To start, we're going to build the latest version of llama.cpp.
We're exploring end-to-end NLP TensorFlow models in 🤗Transformers! We've got a quick gist here if you want to get started, or you can read on for more. 🧵 gist.github.com/Rocketknight1/…
Firstly, what's going on here? Briefly, we've integrated TensorFlow Text with 🤗Transformers, so that you can easily get a TF tokenizer that matches your model checkpoint. This works for any checkpoint, even one you trained! (Only BERT-based for now, but that will change)
Didn't we have TF tokenizers already? If you mean tokenizers that could return tokens as Numpy or TF arrays, then yes, that's been a feature since the beginning. This is different - the tokenizer is compiled into the model graph itself. No pre-processing is required.
There's a fully functional protein design space on HuggingFace now, which would have felt like outrageous science fiction even 18 months ago. I'm going to try to explain what the incredible potential here is. 🧵
Proteins are long chains of simple chemicals called amino acids that fold up into complex 3D shapes. Different amino acids affect the structure in different ways - some stick to each other, some repel, some force bends into the chain.
The structure of a protein is critical to what it actually does. For example, enzymes (a type of protein that catalyzes a chemical reaction) often have a pocket in their structure that fits the chemicals they operate on, while rejecting ones they don't.