Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000. All download and part links below:
Motherboard: Gigabyte MZ73-LM0 or MZ73-LM1. We want 2 EPYC sockets to get a massive 24 channels of DDR5 RAM to max out that memory size and bandwidth.
RAM: This is the big one. We are going to need 768GB (to fit the model) across 24 RAM channels (to get the bandwidth to run it fast enough). That means 24 x 32GB DDR5-RDIMM modules. Example kits:
Case: You can fit this in a standard tower case, but make sure it has screw mounts for a full server motherboard, which most consumer cases won't. The Enthoo Pro 2 Server will take this motherboard:
PSU: The power use of this system is surprisingly low! (<400W) However, you will need lots of CPU power cables for 2 EPYC CPUs. The Corsair HX1000i has enough, but you might be able to find a cheaper option: corsair.com/us/en/p/psu/cp…
Heatsink: This is a tricky bit. AMD EPYC is socket SP5, and most heatsinks for SP5 assume you have a 2U/4U server blade, which we don't for this build. You probably have to go to Ebay/Aliexpress for this. I can vouch for this one: ebay.com/itm/2264992802…
And if you find the fans that come with that heatsink noisy, replacing with 1 or 2 of these per heatsink instead will be efficient and whisper-quiet: newegg.com/noctua-nf-a12x…
And finally, the SSD: Any 1TB or larger SSD that can fit R1 is fine. I recommend NVMe, just because you'll have to copy 700GB into RAM when you start the model, lol. No link here, if you got this far I assume you can find one yourself!
And that's your system! Put it all together and throw Linux on it. Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!
Next, the model. Time to download 700 gigabytes of weights from @huggingface! Grab every file in the Q8_0 folder here: huggingface.co/unsloth/DeepSe…
Believe it or not, you're almost done. There are more elegant ways to set it up, but for a quick demo, just do this.
llama-cli -m ./DeepSeek-R1.Q8_0-00001-of-00015.gguf --temp 0.6 -no-cnv -c 16384 -p "<|User|>How many Rs are there in strawberry?<|Assistant|>"
If all goes well, you should witness a short load period followed by the stream of consciousness as a state-of-the-art local LLM begins to ponder your question:
And once it passes that test, just use llama-server to host the model and pass requests in from your other software. You now have frontier-level intelligence hosted entirely on your local machine, all open-source and free to use!
And if you got this far: Yes, there's no GPU in this build! If you want to host on GPU for faster generation speed, you can! You'll just lose a lot of quality from quantization, or if you want Q8 you'll need >700GB of GPU memory, which will probably cost $100k+
Since a lot of people are asking, the generation speed on this build is 6 to 8 tokens per second, depending on the specific CPU and RAM speed you get, or slightly less if you have a long chat history. The clip above is near-realtime, sped up slightly to fit video length limits
Another update: Someone pointed out this cooler, which I wasn't aware of. Seems like another good option if you can find a seller!
@danbri The clip above is a near-realtime recording, sped up by 1.5X just to make it fit inside Twitter's video length limit
@bobbyroastb33f As long as you don't pipe its code output into a shell, then it can't affect your machine or call home without your permission, no matter how it's been trained
(it can definitely refuse to talk about sensitive topics, though)
@hmartinez82 Either way, I'll definitely try it!
@GamaGraphs Other than that it's just patiently shoving a bag of RAM sticks into slots over and over
• • •
Missing some Tweet in this thread? You can try to
force a refresh
An elegant idea I got from a @GoogleDeepMind paper years back: When doing continuous-valued regression with a neural net, don't have a single neuron output estimating the value. Instead, have a layer of neurons outputting the mean/SD/weight of gaussians. 🧵
This gives you a much richer output, and a much cleaner loss: You simply add and normalize the gaussians, and compute the loss from the probability assigned to the label value. Cross-entropy for regression tasks!
In the original paper, it was being used for RL, where there was intrinsic randomness in the environment. The model knew the reward was going to be one of two values, but when it had to output a point estimate, it could only emit the average and always incur a high loss
Big announcement today @huggingface: We now have a unified API for tool use across models from @MistralAI, @AIatMeta, @cohere, @NousResearch and more!
That means that you can reuse the same simplified, portable code to add tool capabilities to all of those models! 🧵
Tool use with LLMs is one of those things that's simple in theory but surprisingly complex in practice. When the model calls a tool, how do you know? How do you add it to the chat? Did you know some models expected tool defs in JSON schema and others expected Python headers?
Even the closed-source APIs often have messy documentation spread between "tool use" and "assistants" workflows. Figuring it out isn't easy!
That's changed, though. Let's walk through a simple tool use process:
Good morning. At some point this summer, perhaps quite soon, @AIatMeta will be releasing a LLaMA-3 model with 400B parameters. It will likely be the strongest open-source LLM ever released by a wide margin.
This is a thread about how to run it locally. 🧵
First up, the basics: You can quantize models to about ~6bits per parameter before performance degrades. We don't want performance to degrade, so 6 bits it is. This means the model will be (6/8) * 400B = 300GB.
For every token generated by a language model, each weight must be read from memory once. Immediately, this makes it clear that the weights must stay in memory. Even offloading them to an M.2 drive (read speed <10GB/sec) will mean waiting 300/10 = 30 seconds per token.
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵
First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.
To start, we're going to build the latest version of llama.cpp.
We're exploring end-to-end NLP TensorFlow models in 🤗Transformers! We've got a quick gist here if you want to get started, or you can read on for more. 🧵 gist.github.com/Rocketknight1/…
Firstly, what's going on here? Briefly, we've integrated TensorFlow Text with 🤗Transformers, so that you can easily get a TF tokenizer that matches your model checkpoint. This works for any checkpoint, even one you trained! (Only BERT-based for now, but that will change)
Didn't we have TF tokenizers already? If you mean tokenizers that could return tokens as Numpy or TF arrays, then yes, that's been a feature since the beginning. This is different - the tokenizer is compiled into the model graph itself. No pre-processing is required.
There's a fully functional protein design space on HuggingFace now, which would have felt like outrageous science fiction even 18 months ago. I'm going to try to explain what the incredible potential here is. 🧵
Proteins are long chains of simple chemicals called amino acids that fold up into complex 3D shapes. Different amino acids affect the structure in different ways - some stick to each other, some repel, some force bends into the chain.
The structure of a protein is critical to what it actually does. For example, enzymes (a type of protein that catalyzes a chemical reaction) often have a pocket in their structure that fits the chemicals they operate on, while rejecting ones they don't.