Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

wordgrammer

@wordgrammer

Jan 27 • 9 tweets • 2 min read • Read on X

Okay. Thanks for the nerd snipe guys. I spent the day learning exactly how DeepSeek trained at 1/30 the price, instead of working on my pitch deck. The tl;dr to everything, according to their papers:

Q: How did DeepSeek get around export restrictions?

A: They didn’t. They just tinkered around with their chips to make sure they handled memory as efficiently as possibly. They lucked out, and their perfectly optimized low-level code wasn’t actually held back by chip capacity.

Q: How did DeepSeek train so much more efficiently?

A: They used the formulas below to “predict” which tokens the model would activate. Then, they only trained these tokens. They need 95% fewer GPUs than Meta because for each token, they only trained 5% of their parameters.

Q: How is DeepSeek’s inference so much cheaper?

A: They compressed the KV cache. (This was a breakthrough they made a while ago.)

Q: How did they replicate o1?

A: Reinforcement learning. Take complicated questions that can be easily verified (either math or code). Update the model if correct.

There are a bunch of other small innovations, but these are the big ones.

I don’t think there’s anything magical here. I really think they just made 2 massive cost-cutting innovations, which let them run more experiments, which led them to reverse engineer o1 faster.

Also, export restrictions didn’t harm them as much as we thought they did. That’s probably because our export restrictions were really shitty. The H800s are only worse than the H100s when it comes to chip-to-chip bandwidth.

“Is the US losing the war in AI??” I don’t think so. DeepSeek had a few big breakthroughs, we have had hundreds of small breakthroughs. If we adopt DeepSeek’s architecture, our models will be better. Because we have more compute and more data.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @wordgrammer

wordgrammer

@wordgrammer

Jan 27

You all seemed to like my breakdown of DeepSeek’s technical reports. Here’s another DeepSeek thread, this time on company culture.

Did China work harder than the US? Are quants better AI researchers than techbros? Has China surpassed the US in innovation?

DeepSeek is a subsidiary of High-Flyer, a Chinese hedge fund. However, they are a very new hedge fund. They are disrupters in the Chinese market.

DeepSeek takes a very non-traditional approach to hiring. While most western labs (and many western hedge funds) prefer to hire seasoned industry veterans, DeepSeek prefers to hire recent graduates.

Read 21 tweets

wordgrammer

@wordgrammer

Jan 26

https://twitter.com/teortaxestex/status/1883345414415581577

Okay, “how did DeepSeek get around Nvidia’s export restrictions?” Here’s a philosophy major’s thoughts on the situation.

https://twitter.com/teortaxestex/status/1883345414415581577

To understand what DeepSeek pulled off, we first have to understand what exactly the export restrictions do. Are they actually even that bad? Are the GPUs we sell to the Chinese actually that much worse than the US GPUs?

The answer. The H800s (the chips we sell to China) are only worse than the H100s (the US chips) in one way: they have lower memory bandwidth between GPUs. The H100s have a bandwidth of about 900 GB/s, the H800s have a bandwidth of 160 GB/s.

Read 7 tweets

wordgrammer

@wordgrammer

Dec 22, 2024

It would be extremely funny if, after all resources are added up (GPUs and electricity for AI, food and water and education for humans), the cost of AGI is exactly the same as the cost of human intelligence

This is actually something I mostly expect will happen. I don’t see any prima facie reason why “intelligence per unit of resource” would be higher for carbon than silicon, in completely optimal scenarios. And humans are fairly well-optimized

I think carbon and silicon have very different strengths and weaknesses (silicon is better for arithmetic, carbon is more “intuitive”, whatever that means). But if you sum all of that up into a very abstract sense (compute per resource), it probably comes out in a wash

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

wordgrammer

Try unrolling a thread yourself!

More from @wordgrammer

wordgrammer

wordgrammer

wordgrammer

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!