Okay. Thanks for the nerd snipe guys. I spent the day learning exactly how DeepSeek trained at 1/30 the price, instead of working on my pitch deck. The tl;dr to everything, according to their papers:
Q: How did DeepSeek get around export restrictions?
A: They didn’t. They just tinkered around with their chips to make sure they handled memory as efficiently as possibly. They lucked out, and their perfectly optimized low-level code wasn’t actually held back by chip capacity.
Q: How did DeepSeek train so much more efficiently?
A: They used the formulas below to “predict” which tokens the model would activate. Then, they only trained these tokens. They need 95% fewer GPUs than Meta because for each token, they only trained 5% of their parameters.
Q: How is DeepSeek’s inference so much cheaper?
A: They compressed the KV cache. (This was a breakthrough they made a while ago.)
Q: How did they replicate o1?
A: Reinforcement learning. Take complicated questions that can be easily verified (either math or code). Update the model if correct.
There are a bunch of other small innovations, but these are the big ones.
I don’t think there’s anything magical here. I really think they just made 2 massive cost-cutting innovations, which let them run more experiments, which led them to reverse engineer o1 faster.
Also, export restrictions didn’t harm them as much as we thought they did. That’s probably because our export restrictions were really shitty. The H800s are only worse than the H100s when it comes to chip-to-chip bandwidth.
“Is the US losing the war in AI??” I don’t think so. DeepSeek had a few big breakthroughs, we have had hundreds of small breakthroughs. If we adopt DeepSeek’s architecture, our models will be better. Because we have more compute and more data.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
You all seemed to like my breakdown of DeepSeek’s technical reports. Here’s another DeepSeek thread, this time on company culture.
Did China work harder than the US? Are quants better AI researchers than techbros? Has China surpassed the US in innovation?
DeepSeek is a subsidiary of High-Flyer, a Chinese hedge fund. However, they are a very new hedge fund. They are disrupters in the Chinese market.
DeepSeek takes a very non-traditional approach to hiring. While most western labs (and many western hedge funds) prefer to hire seasoned industry veterans, DeepSeek prefers to hire recent graduates.
To understand what DeepSeek pulled off, we first have to understand what exactly the export restrictions do. Are they actually even that bad? Are the GPUs we sell to the Chinese actually that much worse than the US GPUs?
The answer. The H800s (the chips we sell to China) are only worse than the H100s (the US chips) in one way: they have lower memory bandwidth between GPUs. The H100s have a bandwidth of about 900 GB/s, the H800s have a bandwidth of 160 GB/s.
It would be extremely funny if, after all resources are added up (GPUs and electricity for AI, food and water and education for humans), the cost of AGI is exactly the same as the cost of human intelligence
This is actually something I mostly expect will happen. I don’t see any prima facie reason why “intelligence per unit of resource” would be higher for carbon than silicon, in completely optimal scenarios. And humans are fairly well-optimized
I think carbon and silicon have very different strengths and weaknesses (silicon is better for arithmetic, carbon is more “intuitive”, whatever that means). But if you sum all of that up into a very abstract sense (compute per resource), it probably comes out in a wash