Thread by @scaleadvantage on Thread Reader App

1/ $META's AI lab (MSL) rebuilt their entire research stack from scratch and the first model off that stack is already showing better token efficiency than most competitors. Anthropic's CFO explained on Invest like the Best why that variable compounds through the entire R&D flywheel: cheaper RL, faster iteration, each model generation accelerating the next. This dynamic is unappreciated by investors and highlights why Muse Spark should make one bullish about META’s model trajectory. 🧵

2/ First, context on the rebuild. Wang says MSL undertook a "full-on renovation" of the core research stack. They didn't patch or iterate, they rebuilt from the ground up with some of the industry’s best researchers who were directly involved in building these stacks at competing AI labs. The goal was to do everything "the right way."

3/ The clearest evidence: token efficiency. On Artificial Analysis, Muse Spark achieves similar results to leading models using far fewer tokens. Wang suspects competitors have "some level of fundamental inefficiency at another part of the stack that gets patched by enabling the models to think longer." Said differently, some labs are using brute-force reasoning tokens to compensate for architectural debt. MSL's clean stack avoids that tax.

4/ Why does token efficiency matter so much? In short, it’s a key driver of the entire R&D flywheel. Per Anthropic CFO Krishna Rao in his recent Invest like the Best episode: "If the model's better at more efficient at inference, RL is more efficient as well." He explains that reinforcement learning is "basically inference within a sandbox with a reward function." So let's say you have a model that's 2-3x more token-efficient on inference...every single RL training run for the next generation of models is 2-3x cheaper and faster too.

5/ Anthropic’s Rao describes this as a compounding loop: "Doing R&D for model capabilities, for compute efficiency, for serving customers, and then having internal workloads that can be sped up by using the best models." Every dollar of compute spent goes further and that advantage compounds with each successive model generation.

6/ Wang says their entire program is "developed around predictable scaling" and they're seeing it across four axes simultaneously: (1) pre-training scaling, (2) reinforcement learning scaling, (3) test-time compute scaling, and (4) multi-agent scaling. He sounds very bullish about scaling visibility.

7/ On Muse Spark itself, Wang is candid it's not state-of-the-art. Muse Spark is "the early data point on our scaling trajectory.” On the next model: "we're much more excited about the larger models than we are even about Muse Spark." And on the model after that: "we're even more excited."

8/ This is almost certainly why Zuck has committed hundreds of billions to compute buildout, a capex program that has spooked some investors. But if you're sitting on (a) a clean, token-efficient stack where every GPU hour yields more than competitors, (b) predictable scaling curves across pre-training, RL, test-time, and multi-agent dimensions, (c) a compounding R&D flywheel where each model generation accelerates the next, and (d) one of the most scaled distribution platform on the planet - the rational move is to pour fuel on that fire.

9/ META's scale and founder-led structure also gives it significant data advantages over peers. Obviously META has a plethora of valuable data from its 3B+ user base. But they also seem to be building a sizable advantage in knowledge work data relative to their peers. In leaked all-hands audio, Zuck justified why they are collecting keystrokes from 70-80K employees to help teach models knowledge worker tasks. He says this data set is superior to that other leading labs have access to, which are largely from third-party contractors, given META employs some of the best engineering and product talent in the industry. META wouldn't do this unless it was a significant advantage, and they are highly qualified to make that judgement given Alexandr Wang was head of the largest third-party contractor, Scale AI.

10/ Net, MSL rebuilt their AI stack clean, this enabled very strong token efficiency, and this is a big advantage that will compound in successive model generations. META also has significant advantages in training data - both from its 3B+ user base and its recent decision to collect keystrokes from its 70-80K employee base. META's massive compute buildout isn't as speculative as it seems on the surface. Muse Spark is the proof of concept and it seems very likely META's future models to scale very impressively.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll