Lucas Beyer (bl16) Profile picture
Jun 9, 2021 10 tweets 5 min read Read on X
With @XiaohuaZhai @__kolesnikov__ @neilhoulsby we scale up plain old ViT on ~infinite data (3B🤯😬)

We share our results (incl. scaling laws, ImageNet SOTA both many and few-shot) and our recipe (incl. OneWeirdTrick to significantly boost few-shot)

arxiv.org/abs/2106.04560
🧵👇
1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...
2. Larger ViT are more sample-efficient. L/16 reaches the same accuracy s Ti/16 with about 100x fewer images seen!
3. Results: These large pre-trained ViTs are pretty amazing at few-shot learning with just a linear classifier on top of frozen model. Almost 70% with 1 image per class on ImageNet, and 83% with 5 images per class, i.e. 0.5% of the dataset!
Also, new SOTA when fine-tuned: 90.45%.
4. OneWeirdTrick to improve linear few-shot dramaticaly: We switch to GAP (or MAP) "head" and use much stronger weight-decay on the classifier "head" than on ViT's "body". It looks *worse* upstream, but is *a lot better* in few-shot!

We hypothesize this increases margin à la SVM
5. Learning-rate: Train 1 get M free!

We opt for a learning-rate schedule of warmup-rsqrt-cooldown. This allows to train "infinitely" and add cooldowns post-hoc, simulating many runs with just one.

In experiments (not shown), this was much better than "warm restarts" schedules.
6. Because of complex XLA optimization, one can't say upfront what will fit at the memory limit. We use an empirial "shapefinder" approach and scale "diagonally".

We investigate "novel" optimizer variants with half-precision to reduce their memory use A LOT without loss of accu.
This was a fun exploration.

Besides my co-authors I'd like to give special shout-out to @jekbradbury who selflessly helped us get stuff running on huge TPU machines!
Also, @rikelhood @joapuipe @_basilM Alexey for hanging in there with us.
@PreetumNakkiran you should like this. I remember you specifically asked for it a while ago, and I wanted to answer "working on it, be patient" but obviously couldn't. So there you go!
Oh and one more thing, even though I'm not Steve jobs: Jax + TPU VM made this all a breeze, implementation-wise. I can highly recommend that combo for research!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucas Beyer (bl16)

Lucas Beyer (bl16) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @giffmana

Apr 5
A bit late, but I just read ReFT, here's a quick thread.

- A PEFT method
- acts on activs `h` -> small inference overhead

1/5
Image
Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.

However, they hparam search which subset of layers and positions to apply it to.

Even more, they suggest to (sometimes) tie the (R,W,b) parameters from a layer across positions.

2/5 Image
It seems best to apply to all layers, but only few positions. For decoder models, tying across positions, but not for decoders. Rank can be lower for smaller models.

3/5

Image
Image
Image
Read 5 tweets
Nov 10, 2023
🧶You may know me as SigLIP evangelist.

But don't forget I also co-created Cap(Pa), which I'm bullish on.

CapPa nailed the ARO benchmark where contrastive models struggle. We have new results that it also nails the newer, harder SugarCrepe benchmark.



Image
Image
My original motivation for captioning pretraining is that there are things contrastive pretraining will fundamentally not learn.

Think "cat sitting left of dog", it only needs to "detect cat" if there's no other cat in the minibatch.

This is the essence of CLIPs binding problem

Image
Image
Image
ARO and others (left pic) try to benchmark this, but they have issues. You can see that an LM not even looking at the image can still pick the right caption. We showed this in CapPa.

SugarCrepe (right pic) fixes this: hard negatives make sense.

more:

Image
Image
Read 8 tweets
Oct 22, 2023
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT:

1) Make i21k a thing
Release:
2) best CLIP (siglip) by a large margin
3) best i1k ResNet50 ever
4) best pre-trained ResNets
5) >55k ViTs
6) Most efficient JAX/TPU CV code
deets👇
1) i21k was completely overlooked by everyone before our BigTransfer (BiT) paper. When I digged it up, there was only one single blogpost on the web reporting training on it, and it reported bad results.

It's now widely used for classification pre-training better than i1k. Image
2) Just a month ago, we released SigLIP models, which are CLIP-like and much better than anything else. The 400m parameter one works as well as EVA-CLIP's 4B parameter one, see benchmark in open_clip:

github.com/mlfoundations/…
Read 10 tweets
Sep 28, 2023
Pleased to announce we are releasing checkpoints for our SigLIP models!

These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one.

Sorry, no magnet link mic drop. More in thread🧶
The colab with checkpoints and code examples is in our big_vision JAX codebase:

Here's a table comparing to public models of the same size. The performance jump is significant, and we REMOVED near-duplicates of the benchmarks from our training data. github.com/google-researc…
Image
Here's some examples for the i18n model, which "gets" native cultural things. We're looking for more examples to test culture-language effects in i18n models.

Also, the long-standing grand challenge of computer vision, cow on beach, is finally solved!! Even in tuxedo :) Image
Read 6 tweets
Aug 18, 2023
What makes CLIP work?
The contrast with negatives via softmax?
The more negatives, the better -> large batch-size?

We'll answer "no" to both in our ICCV oral🤓
By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes.

Hop in🧶 Image
Perhaps surprisingly, we can replace the SoftMax-xent by a Sigmoid-xent loss in CLIP training and things just work.

With one little detail: add a learnable bias, much like the temperature.

This is conceptually simpler and cleaner: does image I and text T match: yes or no? Image
It's also much simpler code, and since every element in the similarity matrix is independent, makes it obvious that we can compute the loss "in chunks" across devices.

Chunked sigmoid loss never instantiates the full all-to-all matrix, hence letting us scale batch-size. Image
Read 9 tweets
Jun 16, 2023
Who killed non-contrastive image-text pretraining? @AlecRad and @_jongwook_kim with the below Fig2 in CLIP.

Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.

Generative captioning is not only competitive, it seems better! ImageImageImage
Some results first: Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. ImageImageImageImage
Even better, Captioning (green) seems to scale better than Contrastive (red), both in terms of model size (top row) and training duration (bottom row). Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(