Thread by @giffmana on Thread Reader App

With @XiaohuaZhai @__kolesnikov__ @neilhoulsby we scale up plain old ViT on ~infinite data (3B🤯😬)

We share our results (incl. scaling laws, ImageNet SOTA both many and few-shot) and our recipe (incl. OneWeirdTrick to significantly boost few-shot)

arxiv.org/abs/2106.04560
🧵👇

1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...

2. Larger ViT are more sample-efficient. L/16 reaches the same accuracy s Ti/16 with about 100x fewer images seen!

3. Results: These large pre-trained ViTs are pretty amazing at few-shot learning with just a linear classifier on top of frozen model. Almost 70% with 1 image per class on ImageNet, and 83% with 5 images per class, i.e. 0.5% of the dataset!
Also, new SOTA when fine-tuned: 90.45%.

4. OneWeirdTrick to improve linear few-shot dramaticaly: We switch to GAP (or MAP) "head" and use much stronger weight-decay on the classifier "head" than on ViT's "body". It looks *worse* upstream, but is *a lot better* in few-shot!

We hypothesize this increases margin à la SVM

5. Learning-rate: Train 1 get M free!

We opt for a learning-rate schedule of warmup-rsqrt-cooldown. This allows to train "infinitely" and add cooldowns post-hoc, simulating many runs with just one.

In experiments (not shown), this was much better than "warm restarts" schedules.

6. Because of complex XLA optimization, one can't say upfront what will fit at the memory limit. We use an empirial "shapefinder" approach and scale "diagonally".

We investigate "novel" optimizer variants with half-precision to reduce their memory use A LOT without loss of accu.

This was a fun exploration.

Besides my co-authors I'd like to give special shout-out to @jekbradbury who selflessly helped us get stuff running on huge TPU machines!
Also, @rikelhood @joapuipe @_basilM Alexey for hanging in there with us.

@PreetumNakkiran you should like this. I remember you specifically asked for it a while ago, and I wanted to answer "working on it, be patient" but obviously couldn't. So there you go!

Oh and one more thing, even though I'm not Steve jobs: Jax + TPU VM made this all a breeze, implementation-wise. I can highly recommend that combo for research!

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll