Lucas Beyer Profile picture
Researcher (Google DeepMind/Brain in Zürich, ex-RWTH Aachen), Gamer, Hacker, Belgian. Mostly gave up trying mastodon as lb@sigmoid.social
Nikolai Profile picture leej.twg Profile picture 2 subscribed
Apr 5 5 tweets 3 min read
A bit late, but I just read ReFT, here's a quick thread.

- A PEFT method
- acts on activs `h` -> small inference overhead

1/5
Image Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.

However, they hparam search which subset of layers and positions to apply it to.

Even more, they suggest to (sometimes) tie the (R,W,b) parameters from a layer across positions.

2/5 Image
Nov 10, 2023 8 tweets 5 min read
🧶You may know me as SigLIP evangelist.

But don't forget I also co-created Cap(Pa), which I'm bullish on.

CapPa nailed the ARO benchmark where contrastive models struggle. We have new results that it also nails the newer, harder SugarCrepe benchmark.



Image
Image
My original motivation for captioning pretraining is that there are things contrastive pretraining will fundamentally not learn.

Think "cat sitting left of dog", it only needs to "detect cat" if there's no other cat in the minibatch.

This is the essence of CLIPs binding problem

Image
Image
Image
Oct 22, 2023 10 tweets 4 min read
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT:

1) Make i21k a thing
Release:
2) best CLIP (siglip) by a large margin
3) best i1k ResNet50 ever
4) best pre-trained ResNets
5) >55k ViTs
6) Most efficient JAX/TPU CV code
deets👇 1) i21k was completely overlooked by everyone before our BigTransfer (BiT) paper. When I digged it up, there was only one single blogpost on the web reporting training on it, and it reported bad results.

It's now widely used for classification pre-training better than i1k. Image
Sep 28, 2023 6 tweets 3 min read
Pleased to announce we are releasing checkpoints for our SigLIP models!

These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one.

Sorry, no magnet link mic drop. More in thread🧶 The colab with checkpoints and code examples is in our big_vision JAX codebase:

Here's a table comparing to public models of the same size. The performance jump is significant, and we REMOVED near-duplicates of the benchmarks from our training data. github.com/google-researc…
Image
Aug 18, 2023 9 tweets 5 min read
What makes CLIP work?
The contrast with negatives via softmax?
The more negatives, the better -> large batch-size?

We'll answer "no" to both in our ICCV oral🤓
By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes.

Hop in🧶 Image Perhaps surprisingly, we can replace the SoftMax-xent by a Sigmoid-xent loss in CLIP training and things just work.

With one little detail: add a learnable bias, much like the temperature.

This is conceptually simpler and cleaner: does image I and text T match: yes or no? Image
Jun 16, 2023 11 tweets 6 min read
Who killed non-contrastive image-text pretraining? @AlecRad and @_jongwook_kim with the below Fig2 in CLIP.

Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.

Generative captioning is not only competitive, it seems better! ImageImageImage Some results first: Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. ImageImageImageImage
Mar 3, 2023 4 tweets 2 min read
That's a super interesting claim!

The main reason highlighted is minibatch gradient variance (see screenshot).

Unsolicited review (sorry @liuzhuang1234!): This immediately asks for experiments that can validate or nullify the hypothesis, none of which I found in the paper:

1/3 1. As minibatch size grows/shrinks, the effect should vanishes/increase.
2. Similar but different: plot effect of this on Y axis, and minibatch-size as percent of trainset-size on X axis.

Even worse: batch-size only mentioned in appendix!

2/3
Feb 17, 2023 11 tweets 6 min read
Beyond classification in vision, it always feels weird to optimize for a loss which doesn't _really_ match how we'll use the model later on*, but happens to be differentiable.

In our latest work, we tackle this discrepancy🧶

*unless the model is 100% perfect, which it never is. - In pix2seq, you don't _really_ care about perplexity of the detection string
- In FasterRCNN, DETR & co, you don't _really_ care about box-losses and class-losses

You care about driving safely, counting moles or parking spots, the robot picking the box, ... ImageImageImageImage
Dec 29, 2022 14 tweets 8 min read
How good of a BERT can one get in ONE DAY on ONE GPU?

With all the recent studies about scaling compute up, this paper takes a refreshing turn and does a deep dive into scaling down compute.

It's well written, stock full of insights. Here is my summary and my opinions.

🧶 1/N 2/N First, the setting. See screenshot for full info, but in short:
- 24h of training on a single good GPU (2080ti or a4000)
- Transformer architecture, modifications are OK
- MLM training from scratch
- NO use of pre-trained anything in any way (except tokenizer)
- Any dataset
Mar 28, 2022 11 tweets 6 min read
Want to turn any vision backbone into an image-text model? Want to show the age-old "your model wouldn't recognize a cow on the beach" is a red herring?

That's LiT🔥 (Locked-image Tuning), a new alternative to fine-tuning that combines the best of fine-tuning and zero-shot
1/n🧶 As usual, our method is really simple: take a pre-trained (sup, selfsup, whatever) image backbone, freeze it, and attach a text encoder to it. On any image-text dataset, train the text encoder to predict the corresponding image's embedding, CLIP-style.
2/n
Jan 12, 2022 4 tweets 2 min read
1/3 All these methods look the same to you?
That's the point of this paper!

Simply adding losses works equally well as any fancy multi-task method, if one tunes the baseline properly.

This matches my experience, and fits my philosophy: tune the simplest possible method -> win. 2/3 I've tried fancy multi-task methods almost every year, but they never outperformed my well-tuned "just add the losses". I never thought much of it, but this paper actually explores both theoretically and empirically why that is!
Nov 18, 2021 9 tweets 4 min read
It's about time: analog clock reading in the wild arxiv.org/abs/2111.09162

A great example of an applied vision paper, let me walk you through why I like it. 🧶

They also make good use of Spatial Transformer Networks (STN) one of the most elegant ideas that usually don't work :) 1. It's an interesting problem that seems niche, but anyone can immediately relate to it.

2. Showing the architecture once more, it's pretty straightforward. STN takes an image and predicts a transform (homography), then warps the image with it. However, it almost never works.
Nov 12, 2021 10 tweets 5 min read
1/N The return of patch-based self-supervision! It never worked well and you had to bend over backwards with ResNets (I tried).
Now with ViT, very simple patch-based self-supervised pre-training rocks! First BeIT, now Masked AutoEncoders i1k=87.8% arxiv.org/pdf/2111.06377…

🧶 2/N The idea is super simple *and* efficient on TPU: shuffle patches, keep the first N (196, keep 49) and pass them to ViT. Short sequence makes it fast too!
Then, restore order and fill in [mask] tokens, pass that to another, smaller ViT to reconstruct the masks in pixel space.
Jun 10, 2021 11 tweets 5 min read
So you think you know distillation; it's easy, right?

We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva.

Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)

🧵👇arxiv.org/abs/2106.05237 This is not a fancy novel method. It's plain old distillation.

But we investigate it thoroughly, for model compression, via the lens of *function matching*.

We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!
Jun 9, 2021 10 tweets 5 min read
With @XiaohuaZhai @__kolesnikov__ @neilhoulsby we scale up plain old ViT on ~infinite data (3B🤯😬)

We share our results (incl. scaling laws, ImageNet SOTA both many and few-shot) and our recipe (incl. OneWeirdTrick to significantly boost few-shot)

arxiv.org/abs/2106.04560
🧵👇 1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...