With @XiaohuaZhai @__kolesnikov__ @neilhoulsby we scale up plain old ViT on ~infinite data (3B🤯😬)

We share our results (incl. scaling laws, ImageNet SOTA both many and few-shot) and our recipe (incl. OneWeirdTrick to significantly boost few-shot)

arxiv.org/abs/2106.04560
🧵👇
1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...
2. Larger ViT are more sample-efficient. L/16 reaches the same accuracy s Ti/16 with about 100x fewer images seen!
3. Results: These large pre-trained ViTs are pretty amazing at few-shot learning with just a linear classifier on top of frozen model. Almost 70% with 1 image per class on ImageNet, and 83% with 5 images per class, i.e. 0.5% of the dataset!
Also, new SOTA when fine-tuned: 90.45%.
4. OneWeirdTrick to improve linear few-shot dramaticaly: We switch to GAP (or MAP) "head" and use much stronger weight-decay on the classifier "head" than on ViT's "body". It looks *worse* upstream, but is *a lot better* in few-shot!

We hypothesize this increases margin à la SVM
5. Learning-rate: Train 1 get M free!

We opt for a learning-rate schedule of warmup-rsqrt-cooldown. This allows to train "infinitely" and add cooldowns post-hoc, simulating many runs with just one.

In experiments (not shown), this was much better than "warm restarts" schedules.
6. Because of complex XLA optimization, one can't say upfront what will fit at the memory limit. We use an empirial "shapefinder" approach and scale "diagonally".

We investigate "novel" optimizer variants with half-precision to reduce their memory use A LOT without loss of accu.
This was a fun exploration.

Besides my co-authors I'd like to give special shout-out to @jekbradbury who selflessly helped us get stuff running on huge TPU machines!
Also, @rikelhood @joapuipe @_basilM Alexey for hanging in there with us.
@PreetumNakkiran you should like this. I remember you specifically asked for it a while ago, and I wanted to answer "working on it, be patient" but obviously couldn't. So there you go!
Oh and one more thing, even though I'm not Steve jobs: Jax + TPU VM made this all a breeze, implementation-wise. I can highly recommend that combo for research!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucas Beyer

Lucas Beyer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @giffmana

10 Jun
So you think you know distillation; it's easy, right?

We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva.

Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)

🧵👇arxiv.org/abs/2106.05237
This is not a fancy novel method. It's plain old distillation.

But we investigate it thoroughly, for model compression, via the lens of *function matching*.

We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!
0. Intuition: Want the student to replicate _the whole function_ represented by the teacher, everywhere that we expect data in input space.

This is a much stronger view than the commonly used "teacher generates better/more informative labels for the data". See pic above.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(