Latest Twitter Threads by @_clashluke on Thread Reader App

Nov 20, 2022 • 11 tweets • 5 min read

Over the past weeks, I've worked on validating @ID_AA_Carmack's hypothesis on how to improve Adam's second-order approximation (

https://twitter.com/ID_AA_Carmack/status/1587863190695813121

)
Resulting from that, I'd like to present TGAdam, an optimizer with up to 50% lower relative error:

https://twitter.com/_clashluke/status/1594284161841479687

1/11 2/11

Unlike AdamW, TGAdam performs well across a wide range of hyperparameters. Additionally, it can significantly outperform the baseline (MNIST+LR=0.1) with minimal tuning.
Below, you can see the aggregated results of over 6986 runs across architectures and datasets:

Jun 25, 2022 • 7 tweets • 5 min read

Following a recent discussion sparked by @_arohan_ in this thread:

https://twitter.com/_arohan_/status/1538291264226926597

We tried Shampoo with a few more settings and compared it against AdamW as that's more common than SM3.

TL;DR: Shampoo still is better, but Shampoo#AdamW > AdamW

To go into a bit more detail:
The best pure Adam(W) outperforms the previous best (SM3#Shampoo) by 9.1%.
This is likely caused by our model's significant architectural changes as we switched from Attention to Bottleneck-Convolution+RNN. For Attention, SM3 might still be better.

Jun 23, 2022 • 17 tweets • 7 min read

OpenAI just released a Video-GPT ("VPT") that "solved" Minecraft.
Below, we'll take apart their model to the point where we can start reproducing it.
If you're interested in training this on "the world," join our discord server: discord.gg/24WsKDsV6w

https://twitter.com/OpenAI/status/1540032456559955968

Let's start with their architectural description.
The core of their system has three parts:
1) "Data Cleaning": web-scale scraping and filtering
2) "IDM": a BERT-like model to generate data
3) "VPT": a GPT trained on Video

Jun 12, 2022 • 9 tweets • 6 min read

PSA: Switch your optimizer to Shampoo!

We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at @HomebrewNLP and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!

To be precise, while SM3 trained 7 (0.36%) models to a loss below 1.46, Shampoo achieved that with 255 (11.5%) models. Additionally, the lowest loss is 3.5% lower, which is equivalent to training a 3x bigger model with 3x more data, according to chinchilla's scaling laws.

Nov 29, 2021 • 22 tweets • 12 min read

"Sparse is Enough in Scaling Transformers", a recent paper by Sebastian Jaszczur from Google Research, shows 40x speedups at inference using structured sparsity without reducing downstream performance.

Abs: arxiv.org/abs/2111.12763
Code: github.com/google/trax/co…

1/22

Note that, above, the loss plot is not an official image from the paper. Instead, the authors published all of their runs on a public tensorboard: tensorboard.dev/experiment/on3….
This way, we can compare the results ourselves.

2/22

Nov 25, 2021 • 6 tweets • 5 min read

I want to retract this tweet publicly:

https://twitter.com/_clashluke/status/1463061191169822720

It is incorrect and causes unnecessary harm to the authors of "PoolFormer: MetaFormer is Actually What You Need for Vision" (arxiv.org/abs/2111.11418).
Using just AvgPool and MLP, they outperform most models.

1/6

First of all, as @Buntworthy pointed out here:

https://twitter.com/Buntworthy/status/1463905680004374535

They added a comparison with "ResNet strikes back" (arxiv.org/abs/2110.00476) on GitHub (github.com/sail-sg/poolfo…), showing how they outperform ResNet+ by training PoolFormer with DeiT's augmentations.

2/6

Sep 20, 2021 • 7 tweets • 5 min read

Primer combines L1-BN (arxiv.org/abs/1802.09769), Conformer (arxiv.org/abs/2005.08100) and "Squared ReLU" to reach up to 4x faster convergence at no additional memory cost.

https://twitter.com/ak92501/status/1439751096969334785

This speedup is almost as significant as Switch Transformer's (arxiv.org/abs/2101.03961). It got up to 7x speedups using 64x as many (sparse) parameters.
Primer, however, doesn't use more parameters. It's also orthogonal to Switch, so a combined 32x speedup seems plausible.

Jul 27, 2021 • 7 tweets • 4 min read

Finally, someone did it.
MoE + Weight sharing.
This is amazing.

WideNet finds a way to combine two time-parameter tradeoffs to reduce the final training time and parameter count.

https://twitter.com/ak92501/status/1419824931181846528

With fewer parameters, layers, and lower training time, they achieve a 3.2% (relative) lower top-1 error.
Their experiments also illustrate that ViT by itself can learn with weight sharing, which is incredibly exciting.

Jul 15, 2021 • 6 tweets • 4 min read

I finally got around to playing with @RiversHaveWings's VQGAN+CLIP notebooks!
The first order of business was to try to reproduce @ak92501's beautiful samples. You can see the results of my journey below (seeds=0 and 123456)

1/5

https://twitter.com/ak92501/status/1414020174357934086

To reasonably create these samples, I attempted to optimize the model by jitting it with TorchScript. After countless wrong attempts, it's finally 5x as fast as the baseline. (If you're using PyTorch, try JIT. You might want to follow my notebook for further optimizations.)

2/5

May 18, 2021 • 19 tweets • 8 min read

This is major breakthrough 👇
We're now using only seq^2 (4Mi) elements for each attention tensor instead of batch*heads*seq^2 (128Gi) for a PanGu-Alpha-200B-sized model, without reducing the performance or ability to scale.

https://twitter.com/Hanxiao_6/status/1394742841033641985

I'll implement it immediately in our GPT codebase and share its performance on 2B-equivalent models.

@Hanxiao_6, is the split across channels necessary? You briefly describe it as "effective". Is that on TPU?

Share this page!

Enter URL or ID to Unroll