Okay. Thanks for the nerd snipe guys. I spent the day learning exactly how DeepSeek trained at 1/30 the price, instead of working on my pitch deck. The tl;dr to everything, according to their papers:
Q: How did DeepSeek get around export restrictions?
A: They didn’t. They just tinkered around with their chips to make sure they handled memory as efficiently as possibly. They lucked out, and their perfectly optimized low-level code wasn’t actually held back by chip capacity.
Q: How did DeepSeek train so much more efficiently?
A: They used the formulas below to “predict” which tokens the model would activate. Then, they only trained these tokens. They need 95% fewer GPUs than Meta because for each token, they only trained 5% of their parameters.
Q: How is DeepSeek’s inference so much cheaper?
A: They compressed the KV cache. (This was a breakthrough they made a while ago.)
Q: How did they replicate o1?
A: Reinforcement learning. Take complicated questions that can be easily verified (either math or code). Update the model if correct.
There are a bunch of other small innovations, but these are the big ones.
I don’t think there’s anything magical here. I really think they just made 2 massive cost-cutting innovations, which let them run more experiments, which led them to reverse engineer o1 faster.
Also, export restrictions didn’t harm them as much as we thought they did. That’s probably because our export restrictions were really shitty. The H800s are only worse than the H100s when it comes to chip-to-chip bandwidth.
“Is the US losing the war in AI??” I don’t think so. DeepSeek had a few big breakthroughs, we have had hundreds of small breakthroughs. If we adopt DeepSeek’s architecture, our models will be better. Because we have more compute and more data.
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
