How to get URL link on X (Twitter) App
https://twitter.com/ID_AA_Carmack/status/1587863190695813121)
https://twitter.com/_clashluke/status/1594284161841479687
https://twitter.com/_arohan_/status/1538291264226926597

To go into a bit more detail:
https://twitter.com/OpenAI/status/1540032456559955968Let's start with their architectural description.

To be precise, while SM3 trained 7 (0.36%) models to a loss below 1.46, Shampoo achieved that with 255 (11.5%) models. Additionally, the lowest loss is 3.5% lower, which is equivalent to training a 3x bigger model with 3x more data, according to chinchilla's scaling laws. 

Note that, above, the loss plot is not an official image from the paper. Instead, the authors published all of their runs on a public tensorboard: tensorboard.dev/experiment/on3….
https://twitter.com/_clashluke/status/1463061191169822720
First of all, as @Buntworthy pointed out here: https://twitter.com/Buntworthy/status/1463905680004374535
https://twitter.com/ak92501/status/1439751096969334785
This speedup is almost as significant as Switch Transformer's (arxiv.org/abs/2101.03961). It got up to 7x speedups using 64x as many (sparse) parameters.
https://twitter.com/ak92501/status/1419824931181846528
With fewer parameters, layers, and lower training time, they achieve a 3.2% (relative) lower top-1 error.
https://twitter.com/ak92501/status/1414020174357934086

To reasonably create these samples, I attempted to optimize the model by jitting it with TorchScript. After countless wrong attempts, it's finally 5x as fast as the baseline. (If you're using PyTorch, try JIT. You might want to follow my notebook for further optimizations.)
https://twitter.com/Hanxiao_6/status/1394742841033641985I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.