Lisan al Gaib Profile picture
Dec 26, 2024 1 tweets 2 min read Read on X
Some notes on the DeepSeek-V3 Technical Report :)

The most insane thing to me:
The whole training only cost $5.576 million or ~55 days on a 2048xH800 cluster. This is TINY compared to the Llama, GPT or Claude training runs.

- 671B MoE with 37B activate params
- DeepSeek MoE architecture: 1 shared expert and 256 routed experts, 8 active routed experts for each token
- Multi-head Latent Attention (low-rank joint compression for attention keys and values)
- Multi-token prediction (useful for spec decoding and better usage of training data) - for D additional tokens you want to predict there are D additional sequential modules

- some ablation study results for MTP:

- auxiliary-loss-free load-balancing to prevent MoE collapse; ablation study below:

- 14.8T training tokens
- BPE tokenizer 128k vocab
- only 61 layers :(
- 2.788M H800 training hours with FP8 mixed precision
- pre-training --> two stage context length expansion, first to 32k tokens and then to 128k tokens
--> post-training uses SFT and RL to align with human preferences and for distilling R1 reasoning capabilities

- a bunch of interesting stuff on the infrastructure, and how they got the FP8 training to work (I don't really care about that), but worth reading if you are into thatImage
Image
Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lisan al Gaib

Lisan al Gaib Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @scaling01

Dec 16, 2024
Let's review OpenAI's 12 days of shipmas so far:

Day 1 - o1 and ChatGPT Pro:
- delivered a product they promised us months ago
- the launch was horrendous because of bad, missing and out-of-date benchmarks
- despite the failed launch, still no new benchmarks for o1 models

( - anouncing o1 pro and ChatGPT Pro on this day was stupid imo, Pro Tier only makes sense when you know of Sora
- i love o1 but they should have done the price cuts on that day instead of o1 pro )

Overall Rating: horrendous presentation of good products and basically no surprise factor - 3.5/10

Day 2 - Reinforcement Fine-Tuning ALPHA:
- nice idea, could be very useful for businesses
- showed some cool applications
- just an alpha and completely useless for 95% of their users
- a surprise

Overall Rating: good presentation of good products but again just an alpha preview and limited applications - 6.5/10

Day 3 - Sora:
- very cool feature
- no surprise factor
- server issues
- extremely limited usage
- poor implementation, like no image preview (just read my post why I think that, it could be so much better)
- europoors are cooked
- competitors offer the same

Overall Rating: very similar to o1 launch cool product but very bad implementation and presentation - 3/10

Day 4 - Canvas:
- decent feature
- absolutely no wow or surprise factor
- I guess it can be used in CustomGPTs which is nice to have
- competitors offer the same

Overall Rating: honestly no remarks, very neutral - 5/10

Day 5 - ChatGPT Integration with Apple Intelligence:
- siri using ChatGPT to generate responses
- document analysis on MacOs
- vision features for iPhone 16
- could have literally been a sidenote in the changelog
- RIP to all android poors

Overall Rating: at least some features but jesus ... - 2/10

Day 6 - Advanced Voice with Video:
- video understand is a useful feature, no doubt about that but the examples were so USELESS
- HOHOHO cringe santa
- but again no wow or surprise factor
- europoors are cooked once more
- competitors offer the same

Overall Rating: extremely useful product, lacking presentation- 5/10

Day 7 - Projects:
- organizing information is always good
- no wow or surprise factor
- competitors offer the same

Overall Rating: could've been a post on X - 4.5/10

Day 8 - Search:
- free users gain access to search - good for the poors
- search with AVM is nice
- in app maps
- search already existed before, so zero wow factor
- competitors offer the same

Overall Rating: 5.5/10

So far ~4.4/10 - Shipmas has been slightly underwhelming, unsurprising and unfortunately overshadowed by the botched launches of o1 and Sora

o1, o1-pro and Sora could've been solid 7s or 8s

AVM is a very cool and useful feature, but the presentation just lacked substanced and good examples. I know you think it's all fun with the christmas theme, but please just show me that this product is useful for my daily life!
Like help with homework, show how to jumpstart a car, hell even get the blind guy again... Just show anything more useful

Canvas, Search and Projects should've been in one presentation - alone they are not impressive but as a whole they are a solid Quality of Life improvement in

RFT was so far the best and most surprising launch, but again no higher grade because it's just a preview alpha

Lastly, please don't ever make an Apple "launch" again
@sama please uncle sam take notes

I want you guys to succeed and ship good stuff, your researchers and engineers deserve it
Read 5 tweets
Dec 13, 2024
META JUST KILLED TOKENIZATION !!!

A few hours ago they released "Byte Latent Transformer". A tokenizer free architecture that dynamically encodes Bytes into Patches and achieves better inference efficiency and robustness!

(I was just talking about how we need dynamic tokenization that is learned during training 🥲
It's like fucking christmas!)

I don't want to talk too much about the architecture.
But here's a nice visualization from their paper.

Let's look at benchmarks instead :)

"BLT models can match the performance of tokenization-based models like Llama 3 at scales up to 8B and 4T bytes, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops!"

This is basically a perplexity vs training flops chart - scaling laws with compute. BPB is a tokenizer independent version of perplexity.

BLT is on par or better than LLama 3 BPE!

Most importantly they scale this approach to train Llama-3 8B model on 1T tokens which beats the standard Llama-3 architecture with BPE tokenizer!Image
Image
Image
Image
Make sure to check out my latest visualization. I spent way too long on it, so I have to shill for it now 😂

Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(