wordgrammer Profile picture
Jan 27 9 tweets 2 min read Read on X
Okay. Thanks for the nerd snipe guys. I spent the day learning exactly how DeepSeek trained at 1/30 the price, instead of working on my pitch deck. The tl;dr to everything, according to their papers:
Q: How did DeepSeek get around export restrictions?

A: They didn’t. They just tinkered around with their chips to make sure they handled memory as efficiently as possibly. They lucked out, and their perfectly optimized low-level code wasn’t actually held back by chip capacity. Image
Q: How did DeepSeek train so much more efficiently?

A: They used the formulas below to “predict” which tokens the model would activate. Then, they only trained these tokens. They need 95% fewer GPUs than Meta because for each token, they only trained 5% of their parameters. Image
Q: How is DeepSeek’s inference so much cheaper?

A: They compressed the KV cache. (This was a breakthrough they made a while ago.) Image
Q: How did they replicate o1?

A: Reinforcement learning. Take complicated questions that can be easily verified (either math or code). Update the model if correct. Image
There are a bunch of other small innovations, but these are the big ones.
I don’t think there’s anything magical here. I really think they just made 2 massive cost-cutting innovations, which let them run more experiments, which led them to reverse engineer o1 faster.
Also, export restrictions didn’t harm them as much as we thought they did. That’s probably because our export restrictions were really shitty. The H800s are only worse than the H100s when it comes to chip-to-chip bandwidth.
“Is the US losing the war in AI??” I don’t think so. DeepSeek had a few big breakthroughs, we have had hundreds of small breakthroughs. If we adopt DeepSeek’s architecture, our models will be better. Because we have more compute and more data.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with wordgrammer

wordgrammer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @wordgrammer

Jan 27
You all seemed to like my breakdown of DeepSeek’s technical reports. Here’s another DeepSeek thread, this time on company culture.

Did China work harder than the US? Are quants better AI researchers than techbros? Has China surpassed the US in innovation?
DeepSeek is a subsidiary of High-Flyer, a Chinese hedge fund. However, they are a very new hedge fund. They are disrupters in the Chinese market. Image
DeepSeek takes a very non-traditional approach to hiring. While most western labs (and many western hedge funds) prefer to hire seasoned industry veterans, DeepSeek prefers to hire recent graduates. Image
Read 21 tweets
Jan 26
Okay, “how did DeepSeek get around Nvidia’s export restrictions?” Here’s a philosophy major’s thoughts on the situation. Image
Image
To understand what DeepSeek pulled off, we first have to understand what exactly the export restrictions do. Are they actually even that bad? Are the GPUs we sell to the Chinese actually that much worse than the US GPUs?
The answer. The H800s (the chips we sell to China) are only worse than the H100s (the US chips) in one way: they have lower memory bandwidth between GPUs. The H100s have a bandwidth of about 900 GB/s, the H800s have a bandwidth of 160 GB/s.
Read 7 tweets
Dec 22, 2024
It would be extremely funny if, after all resources are added up (GPUs and electricity for AI, food and water and education for humans), the cost of AGI is exactly the same as the cost of human intelligence Image
This is actually something I mostly expect will happen. I don’t see any prima facie reason why “intelligence per unit of resource” would be higher for carbon than silicon, in completely optimal scenarios. And humans are fairly well-optimized
I think carbon and silicon have very different strengths and weaknesses (silicon is better for arithmetic, carbon is more “intuitive”, whatever that means). But if you sum all of that up into a very abstract sense (compute per resource), it probably comes out in a wash
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(