Scale Advantage Research Profile picture
May 22 10 tweets 4 min read Read on X
1/ $META's AI lab (MSL) rebuilt their entire research stack from scratch and the first model off that stack is already showing better token efficiency than most competitors. Anthropic's CFO explained on Invest like the Best why that variable compounds through the entire R&D flywheel: cheaper RL, faster iteration, each model generation accelerating the next. This dynamic is unappreciated by investors and highlights why Muse Spark should make one bullish about META’s model trajectory. 🧵
2/ First, context on the rebuild. Wang says MSL undertook a "full-on renovation" of the core research stack. They didn't patch or iterate, they rebuilt from the ground up with some of the industry’s best researchers who were directly involved in building these stacks at competing AI labs. The goal was to do everything "the right way."
3/ The clearest evidence: token efficiency. On Artificial Analysis, Muse Spark achieves similar results to leading models using far fewer tokens. Wang suspects competitors have "some level of fundamental inefficiency at another part of the stack that gets patched by enabling the models to think longer." Said differently, some labs are using brute-force reasoning tokens to compensate for architectural debt. MSL's clean stack avoids that tax.Image
Image
4/ Why does token efficiency matter so much? In short, it’s a key driver of the entire R&D flywheel. Per Anthropic CFO Krishna Rao in his recent Invest like the Best episode: "If the model's better at more efficient at inference, RL is more efficient as well." He explains that reinforcement learning is "basically inference within a sandbox with a reward function." So let's say you have a model that's 2-3x more token-efficient on inference...every single RL training run for the next generation of models is 2-3x cheaper and faster too.
5/ Anthropic’s Rao describes this as a compounding loop: "Doing R&D for model capabilities, for compute efficiency, for serving customers, and then having internal workloads that can be sped up by using the best models." Every dollar of compute spent goes further and that advantage compounds with each successive model generation.
6/ Wang says their entire program is "developed around predictable scaling" and they're seeing it across four axes simultaneously: (1) pre-training scaling, (2) reinforcement learning scaling, (3) test-time compute scaling, and (4) multi-agent scaling. He sounds very bullish about scaling visibility.
7/ On Muse Spark itself, Wang is candid it's not state-of-the-art. Muse Spark is "the early data point on our scaling trajectory.” On the next model: "we're much more excited about the larger models than we are even about Muse Spark." And on the model after that: "we're even more excited."
8/ This is almost certainly why Zuck has committed hundreds of billions to compute buildout, a capex program that has spooked some investors. But if you're sitting on (a) a clean, token-efficient stack where every GPU hour yields more than competitors, (b) predictable scaling curves across pre-training, RL, test-time, and multi-agent dimensions, (c) a compounding R&D flywheel where each model generation accelerates the next, and (d) one of the most scaled distribution platform on the planet - the rational move is to pour fuel on that fire.
9/ META's scale and founder-led structure also gives it significant data advantages over peers. Obviously META has a plethora of valuable data from its 3B+ user base. But they also seem to be building a sizable advantage in knowledge work data relative to their peers. In leaked all-hands audio, Zuck justified why they are collecting keystrokes from 70-80K employees to help teach models knowledge worker tasks. He says this data set is superior to that other leading labs have access to, which are largely from third-party contractors, given META employs some of the best engineering and product talent in the industry. META wouldn't do this unless it was a significant advantage, and they are highly qualified to make that judgement given Alexandr Wang was head of the largest third-party contractor, Scale AI.
10/ Net, MSL rebuilt their AI stack clean, this enabled very strong token efficiency, and this is a big advantage that will compound in successive model generations. META also has significant advantages in training data - both from its 3B+ user base and its recent decision to collect keystrokes from its 70-80K employee base. META's massive compute buildout isn't as speculative as it seems on the surface. Muse Spark is the proof of concept and it seems very likely META's future models to scale very impressively.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Scale Advantage Research

Scale Advantage Research Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(