Lucas Beyer Profile picture
Jun 10, 2021 11 tweets 5 min read Read on X
So you think you know distillation; it's easy, right?

We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva.

Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)

🧵👇arxiv.org/abs/2106.05237
This is not a fancy novel method. It's plain old distillation.

But we investigate it thoroughly, for model compression, via the lens of *function matching*.

We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!
0. Intuition: Want the student to replicate _the whole function_ represented by the teacher, everywhere that we expect data in input space.

This is a much stronger view than the commonly used "teacher generates better/more informative labels for the data". See pic above.
1. Consistency: to achieve this, teacher and student need to see the same view (crop) of the image. For example, this means no pre-computed teacher logits! We can generate many more views via mixup.

Other approaches may look good early, but eventually fall behind consistency.
2. Patience: The function matching task is HARD! We need to train *a lot* longer than typical, and actually we were not able to reach saturation yet. Overfitting does not happen, as when function-matching, an "overfit" student is great! (Note: w/ pre-computed teacher, we overfit)
2b. Excessively long training may mean optim struggle. We try advanced optimization via Shampoo, and get 4x faster convergence.

We believe this setting is a great test-bed for optimizer research: No concern of overfitting, and reducing training error means generalizing better!
3. By distilling a couple large BiT R152x2 models into a ResNet-50, we get a ResNet-50 on ImageNet that gets 82.8% at 224px resolution, and 80.5% at 160px! 😎

No "tricks" just plain distillation, patiently matching functions.
4. Importantly, this simple strategy works on many datasets of various sizes, down to only 1020 training images, where anything else we tried overfit horribly.

Be patient, be consistent, that's it. Eventually, you'll reach or outperform your teacher!
2c. We can't stress patience enough. Multiple strategies, for example initializing the student with a pre-trained model shown here, look promising at first, but eventually plateau and are outperformed by patient, consistent function matching.
5. We have a lot more content. MobileNet students, distilling on on "random other" data (shown below), very thorough baselines, a teacher ensemble, and.... BiT download statistics!
PS: we are working on releasing a bunch of the models, including the best ones, ... but we're also on vacation. Watch github.com/google-researc… and stay tuned, we're aiming for next week!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucas Beyer

Lucas Beyer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @giffmana

Apr 5
A bit late, but I just read ReFT, here's a quick thread.

- A PEFT method
- acts on activs `h` -> small inference overhead

1/5
Image
Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.

However, they hparam search which subset of layers and positions to apply it to.

Even more, they suggest to (sometimes) tie the (R,W,b) parameters from a layer across positions.

2/5 Image
It seems best to apply to all layers, but only few positions. For decoder models, tying across positions, but not for decoders. Rank can be lower for smaller models.

3/5

Image
Image
Image
Read 5 tweets
Nov 10, 2023
🧶You may know me as SigLIP evangelist.

But don't forget I also co-created Cap(Pa), which I'm bullish on.

CapPa nailed the ARO benchmark where contrastive models struggle. We have new results that it also nails the newer, harder SugarCrepe benchmark.



Image
Image
My original motivation for captioning pretraining is that there are things contrastive pretraining will fundamentally not learn.

Think "cat sitting left of dog", it only needs to "detect cat" if there's no other cat in the minibatch.

This is the essence of CLIPs binding problem

Image
Image
Image
ARO and others (left pic) try to benchmark this, but they have issues. You can see that an LM not even looking at the image can still pick the right caption. We showed this in CapPa.

SugarCrepe (right pic) fixes this: hard negatives make sense.

more:

Image
Image
Read 8 tweets
Oct 22, 2023
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT:

1) Make i21k a thing
Release:
2) best CLIP (siglip) by a large margin
3) best i1k ResNet50 ever
4) best pre-trained ResNets
5) >55k ViTs
6) Most efficient JAX/TPU CV code
deets👇
1) i21k was completely overlooked by everyone before our BigTransfer (BiT) paper. When I digged it up, there was only one single blogpost on the web reporting training on it, and it reported bad results.

It's now widely used for classification pre-training better than i1k. Image
2) Just a month ago, we released SigLIP models, which are CLIP-like and much better than anything else. The 400m parameter one works as well as EVA-CLIP's 4B parameter one, see benchmark in open_clip:

github.com/mlfoundations/…
Read 10 tweets
Sep 28, 2023
Pleased to announce we are releasing checkpoints for our SigLIP models!

These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one.

Sorry, no magnet link mic drop. More in thread🧶
The colab with checkpoints and code examples is in our big_vision JAX codebase:

Here's a table comparing to public models of the same size. The performance jump is significant, and we REMOVED near-duplicates of the benchmarks from our training data. github.com/google-researc…
Image
Here's some examples for the i18n model, which "gets" native cultural things. We're looking for more examples to test culture-language effects in i18n models.

Also, the long-standing grand challenge of computer vision, cow on beach, is finally solved!! Even in tuxedo :) Image
Read 6 tweets
Aug 18, 2023
What makes CLIP work?
The contrast with negatives via softmax?
The more negatives, the better -> large batch-size?

We'll answer "no" to both in our ICCV oral🤓
By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes.

Hop in🧶 Image
Perhaps surprisingly, we can replace the SoftMax-xent by a Sigmoid-xent loss in CLIP training and things just work.

With one little detail: add a learnable bias, much like the temperature.

This is conceptually simpler and cleaner: does image I and text T match: yes or no? Image
It's also much simpler code, and since every element in the similarity matrix is independent, makes it obvious that we can compute the loss "in chunks" across devices.

Chunked sigmoid loss never instantiates the full all-to-all matrix, hence letting us scale batch-size. Image
Read 9 tweets
Jun 16, 2023
Who killed non-contrastive image-text pretraining? @AlecRad and @_jongwook_kim with the below Fig2 in CLIP.

Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.

Generative captioning is not only competitive, it seems better! ImageImageImage
Some results first: Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. ImageImageImageImage
Even better, Captioning (green) seems to scale better than Contrastive (red), both in terms of model size (top row) and training duration (bottom row). Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(