Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS '25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we've ever seen. No walls in sight!
Post-training: Still a total greenfield. There's lots of room for algorithmic progress and improvement, and 3.0 hasn't been an exception, thanks to our stellar team.
Congratulations to the whole team 💙💙💙
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Chain ⛓️ Rule(s) rules! Appreciation thread of one of the most interesting coincidences in machine learning. Two rules, both named "Chain Rule", happen to be absolutely critical to recent advances in ML & AI. A 🧵 on the Chain Rule of Probability & the Chain Rule of Calculus👇
The Chain Rule of Probability is a powerful tool behind recent advances in Large Language Models. By multiplying together the probabilities of many smaller events, we can compute the probability of a complex event made up of those smaller events.
p(abc) = p(c|ab) * p(b|a) * p(a)
By smaller events here, we refer to the probability of a token, given past tokens, p(c|ab). In probabilistic language modeling, a “token” is a single unit of text, like a word or part of a word. Modern language models consider a vocabulary size of ~100K tokens.
This neural network architecture that was showcased at the @Tesla AI day is a perfect example of Deep Learning at its finest. Mix and match all the greatest innovations to do something drastic and super ambitious. Congrats!
Treating the job of figuring out valid "lanes" from images as language is brilliant. Combining CNNs, transformers, attention, pointer networks, etc., you essentially write a set of instructions to build up the graph by connecting the dots, start new lanes, set curvature, etc.
This isn't ML-new, but who cares? Applied at the level of ambition of full-scale real world impact, with the right team, execution, (and compute/data!), you can do things that felt impossible before. Both the architecture and cool use of language heavily reminded me of AlphaStar.
2021 personal highlights, a🧵. Despite being a challenging year globally due to the pandemic 😷🦠, but thanks to many incredible collaborators, it's been an exciting year research-wise 🤖 Some highlights below.👇
Diversity and inclusion. I kept engaged through our efforts @DeepMind, mentorship and as a member of @Khipu_AI community, supporting AI in Latin America, where we had a fireside chat w/ @geoffreyhinton (). I was also a mentor for: docs.google.com/spreadsheets/d…
Perceiver. Being able to treat every modality as a sequence of bytes has been a personal deep learning dream. Perceiver is a transformer-derived architecture proposing a few modifications to achieve this.