Gavin Baker Profile picture
Jan 27 2 tweets 4 min read Read on X
1) DeepSeek r1 is real with important nuances. Most important is the fact that r1 is so much cheaper and more efficient to inference than o1, not from the $6m training figure. r1 costs 93% less to *use* than o1 per each API, can be run locally on a high end work station and does not seem to have hit any rate limits which is wild. Simple math is that every 1b active parameters requires 1 gb of RAM in FP8, so r1 requires 37 gb of RAM. Batching massively lowers costs and more compute increases tokens/second so still advantages to inference in the cloud. Would also note that there are true geopolitical dynamics at play here and I don’t think it is a coincidence that this came out right after “Stargate.” RIP, $500 billion - we hardly even knew you.

Real: 1) It is/was the #1 download in the relevant App Store category. Obviously ahead of ChatGPT; something neither Gemini nor Claude was able to accomplish. 2) It is comparable to o1 from a quality perspective although lags o3. 3) There were real algorithmic breakthroughs that led to it being dramatically more efficient both to train and inference. Training in FP8, MLA and multi-token prediction are significant. 4) It is easy to verify that the r1 training run only cost $6m. While this is literally true, it is also *deeply* misleading. 5) Even their hardware architecture is novel and I will note that they use PCI-Express for scale up.

Nuance: 1) The $6m does not include “costs associated with prior research and ablation experiments on architectures, algorithms and data” per the technical paper. “Other than that Mrs. Lincoln, how was the play?” This means that it is possible to train an r1 quality model with a $6m run *if* a lab has already spent hundreds of millions of dollars on prior research and has access to much larger clusters. Deepseek obviously has way more than 2048 H800s; one of their earlier papers referenced a cluster of 10k A100s. An equivalently smart team can’t just spin up a 2000 GPU cluster and train r1 from scratch with $6m. Roughly 20% of Nvidia’s revenue goes through Singapore. 20% of Nvidia’s GPUs are probably not in Singapore despite their best efforts. 2) There was a lot of distillation - i.e. it is unlikely they could have trained this without unhindered access to GPT-4o and o1. As @altcap pointed out to me yesterday, kinda funny to restrict access to leading edge GPUs and not do anything about China’s ability to distill leading edge American models - obviously defeats the purpose of the export restrictions. Why buy the cow when you can get the milk for free?
2) Conclusions: 1) Lowering the cost to train will increase the ROI on AI. 2) There is no world where this is positive for training capex or the “power” theme in the near term. 3) The biggest risk to the current “AI infrastructure” winners across tech, industrials, utilities and energy is that a distilled version of r1 can be run locally at the edge on a high end work station (someone referenced a Mac Studio Pro). That means that a similar model will run on a superphone in circa 2 years. If inference moves to the edge because it is “good enough,” we are living in a very different world with very different winners - i.e. the biggest PC and smartphone upgrade cycle we have ever seen. Compute has oscillated between centralization and decentralization for a long time. 4) ASI is really, really close and no one really knows what the economic returns to superintelligence will be. If a $100 billion reasoning model trained on 100k plus Blackwells (o5, Gemini 3, Grok 4) is curing cancer and inventing warp drives, then the returns to ASI will be really high and training capex and power consumption will steadily grow; Dyson Spheres will be back to being best explanation for Fermi’s paradox. I hope the returns to ASI are high - would be so awesome. 5) This is all really good for the companies that *use* AI: software, internet, etc. 6) From an economic perspective, this massively increases the value of distribution and *unique* data - YouTube, Facebook, Instagram and X. 7) American labs are likely to stop releasing their leading edge models to prevent the distillation that was so essential to r1, although the cat may already be entirely out of the bag on this front. i.e. r1 may be enough to train r2, etc.

Grok-3 looms large and might significantly impact the above conclusions. This will be the first significant test of scaling laws for pre-training arguably since GPT-4. In the same way that it took several weeks to turn v3 into r1 via RL, it will likely take several weeks to run the RL necessary to improve Grok-3’s reasoning capabilities. The better the base model, the better the reasoning model should be as the three scaling laws are multiplicative - pre-training, RL during post-training and test-time compute during inference (a function of the RL). Grok-3 has already shown it can do tasks beyond o1 - see the Tesseract demo - how far beyond is going to be important. To paraphrase an anonymous Orc from “The Two Towers,” meat might be back on the menu very shortly. Time will tell and “when the facts, I change my mind.”

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gavin Baker

Gavin Baker Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GavinSBaker

Mar 19, 2024
1) Inflection will likely be the first of many VC-backed Foundation Model companies to fail.

Foundation models without proprietary, real-time data AND massive distribution for RLHF are the fastest depreciating assets in history.

2) Irony is that while models are commodities today, the ultimate future is likely one where there are only a few truly massive models with proprietary real time data and vast distribution.

Only a few will make it. And they will be super valuable.

3) Smaller open source models will be used for most vertical tasks to save on inference costs.

As ever, there are no barriers to entry on the internet, only barriers to scale. And once at scale, the returns are very high.

Foundation models are becoming a “Game of Emperors” and the empires on the other side of this winnowing are really large.
Read 5 tweets
Sep 17, 2023
Analysts often assume they are *the* audience for conference calls.

There are many audiences (regulators, employees, competitors) and incentives (sometimes want the stock lower i.e. if in the midst of a big buy-back or heading into option/RSU pricing).

Creates opportunities
For example:

When a company is speaking directly to competitors by saying they are willing to incur losses to defend share.

A public conference call is really the only way to legally convey this intention and help rationalize competition.
The airlines were actually investigated by the DOJ for potentially illegally colluding to set price/capacity by using Wall Street to communicate with each other.
Read 4 tweets
Jul 23, 2023
1) GPU utilization rate is the new ROIC for any company working on AI.

“In deep learning, nothing is ever just about the equations. It’s how you put them on the hardware, it’s a giant bag of black magic tricks that only very few people have truly mastered.”
2) Cloud/virtualization/containers have abstracted hardware away for a generation of software engineers. Now back with a vengeance.

The MLPerf storage benchmarks will be a catalyst for storage by showing the massive impact on GPU utilization from faster storage.
3) A company with higher GPU utilization can choose between faster time to market, a better product or lower costs.

GPU utilization shockingly low at many companies working on AI - both public and private - and one of OpenAI’s biggest advantages.
Read 5 tweets
May 2, 2023
1) Over the short-term, stocks trade on the 2nd derivative of Rev/EPS/FCF growth and ROIC changes along with beats/misses vs. expectations.

Valuation determines the magnitude of the move based on ☝️and 👇

Over the long-term, ROIC and growth in FCF per share drive performance.
2) I only say this as a reminder because there were so many replies to the tweet linked below pointing out that YoY earnings growth was weak as if that was a relevant or important fact.

The 2nd derivative and surprises vs. expectations are what matter. Always.
3) And the pattern of 2022 was generally strength during earnings season, then weakness after as the focus returned to macro.
Read 4 tweets
Apr 27, 2023
1) "Most people in high-stress, decision-making industries are always operating at this kind of simmering 6, or 4, as opposed to the undulation between deep relaxation and being at a 10. Being at a 10 is millions of times better than being at a 6."

tim.blog/2019/07/03/the…
2) "train your intuition, your somatic introspection, to feel when your quality of presence, your quality of energy is slipping from like a 10 to a 9. When I start working with top mental performers, very often, it can go from a 10 to a 2 before they even feel the slip."
3) "Really, I like to use these tools to train someone to sharpen their intuition, to sharpen their somatic sense, for where they really are...alignment of peak energy periods with the creativity work, thinking time is hugely important."
Read 7 tweets
Apr 15, 2023
1) Reread Jassy letter. Surprised that he hasn’t at least delayed Kuiper, their Starlink “competitor.”

Zero chance Kuiper succeeds until someone other than SpaceX can reliably land rockets.

In the unlikely event Blue Origin succeeds, then Kuiper will be a conflict of interest.
2) Starlink gets SpaceX’s internal launch cost (15m marginal launch cost for reused Falcon9 per an Aviation Week interview that is going to $2m post Starship) while Kuiper gets the Atlas V launch cost of $100m for a payload 20% lower than Falcon9.
3) So it is strange to start Kuiper when its costs are over 10x higher than SpaceX’s costs with SpaceX cost per metric ton about to go down another 20-30x w/ Starship.

Seems wiser to wait until someone else successfully lands a rocket and thereby dramatically lowers costs.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(