Lucas Beyer (bl16) Profile picture
Researcher (now: OpenAI, ex: DeepMind, Brain, RWTH Aachen), Gamer, Hacker, Belgian. Anon feedback: https://t.co/xe2XUqkKit ✗DMs → email
2 subscribers
Feb 20 4 tweets 3 min read
o3-mini-high figured out the issue with @SakanaAILabs CUDA kernels in 11s.
It being 150x faster is a bug, the reality is 3x slower.

I literally copy-pasted their CUDA code into o3-mini-high and asked "what's wrong with this cuda code". That's it!
Proof: chatgpt.com/share/67b6f47c…

Fig1: o3-mini's answer.
Fig2: Their orig code is wrong in subtle way. The fact they run benchmarking TWICE with wildly different results should make them stop and think.
Fig3: o3-mini's fix. Code is now correct. Benchmarking results are consistent. 3x slower.Image
Image
Image
There are three real lessons to be learned here:
1) Super-straightforward CUDA code like that has NO CHANCE of ever being faster than optimized cublas kernels. If it is, something is wrong.
2) If your benchmarking results are mysterious and inconsistent, something is wrong.
3) o3-mini-high is REALLY GOOD. It literally took 11sec to find the issue. It took me around 10min to make this write-up afterwards.
Feb 4 7 tweets 4 min read
I took a brief look at the Harmonic Loss paper

tl;dr: instead of dot-product with softmax, do euclid dist with normalized 1/d**n.

I kinda want this to work. I've dabbled with preferring euclid many times throughout my career (eg triplet loss etc)

However... I have to say that this MNIST weights figure looks suspicious as hell.

I've trained linear + softmax mnist and looked at weights often, and it never looks as bad as presented here. However, their score of ~92.5% is the expected one, so that's good. Image
Feb 3 4 tweets 1 min read
Don't trust the headline. This 56M…
…DOESNT include compute, it's everything else
…scattered across many unis, profs, post-docs, phds
…who collaborate in theory, but work on their individual papers in reality, cuz that's what's needed to graduate
…is unrelated to DeepSeek What I expect to come out of it:
- a whole bunch of papers, probably some interesting ones. A lot on multilinguality and language transfer.
- a series of benchmarks for various EU languages
- (I hope) nice small-languages datasets

I think that's good overall.
Jan 27 5 tweets 2 min read
Just had a quick look at DeepSeek's new Janus Pro paper.

I don't think it's a big deal (yet...!), but quick TL;DR below before hype gets out of hands. Image It's as straightforward an Omni model as it gets:
- a core autoregressive decoder LLM
- a SigLIP encoder for understanding (L@384 why not So400m)
- a VQ-VAE for generation (from LlamaGen)
Three training stages:
1. new-params only on ImageNet
2. fine-tune all on mm-mix
3. SFT mix Image
Dec 30, 2024 13 tweets 9 min read
Long paper thread!

I've held a skeptical opinion of DiffTransformer after skimming it, but after some recent interaction thought that's unfair and I'd give it a proper read. Did so on a recent flight.

Paper TL;DR: pair two attention heads, and do:

(sm(Q1K1) - λ sm(Q2K2)) V Image The motivation in Fig1 is very solid a priori: as context gets long, the sum of (small) attention on irrelevant tokens might be more than the attention to few individual relevant tokens, thus drowning them.

However, it is just an illustration, and I'm still not sure how much this is really a problem for well trained models. Instabilities usually occur because attention logits grow, which makes the attention look much more like the green one already. This is not often talked about, but evidence is spread across some papers like ViT22B or "small scale proxies". If max attn is ~0.5, then we're already in the green.

I would have liked some plots about attn distribution/entropy in the DiffTransformer paper to actually justify the illustration. (We'll get back to this)Image
Image
Apr 5, 2024 5 tweets 3 min read
A bit late, but I just read ReFT, here's a quick thread.

- A PEFT method
- acts on activs `h` -> small inference overhead

1/5
Image Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.

However, they hparam search which subset of layers and positions to apply it to.

Even more, they suggest to (sometimes) tie the (R,W,b) parameters from a layer across positions.

2/5 Image
Nov 10, 2023 8 tweets 5 min read
🧶You may know me as SigLIP evangelist.

But don't forget I also co-created Cap(Pa), which I'm bullish on.

CapPa nailed the ARO benchmark where contrastive models struggle. We have new results that it also nails the newer, harder SugarCrepe benchmark.



Image
Image
My original motivation for captioning pretraining is that there are things contrastive pretraining will fundamentally not learn.

Think "cat sitting left of dog", it only needs to "detect cat" if there's no other cat in the minibatch.

This is the essence of CLIPs binding problem

Image
Image
Image
Oct 22, 2023 10 tweets 4 min read
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT:

1) Make i21k a thing
Release:
2) best CLIP (siglip) by a large margin
3) best i1k ResNet50 ever
4) best pre-trained ResNets
5) >55k ViTs
6) Most efficient JAX/TPU CV code
deets👇 1) i21k was completely overlooked by everyone before our BigTransfer (BiT) paper. When I digged it up, there was only one single blogpost on the web reporting training on it, and it reported bad results.

It's now widely used for classification pre-training better than i1k. Image
Sep 28, 2023 6 tweets 3 min read
Pleased to announce we are releasing checkpoints for our SigLIP models!

These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one.

Sorry, no magnet link mic drop. More in thread🧶 The colab with checkpoints and code examples is in our big_vision JAX codebase:

Here's a table comparing to public models of the same size. The performance jump is significant, and we REMOVED near-duplicates of the benchmarks from our training data. github.com/google-researc…
Image
Aug 18, 2023 9 tweets 5 min read
What makes CLIP work?
The contrast with negatives via softmax?
The more negatives, the better -> large batch-size?

We'll answer "no" to both in our ICCV oral🤓
By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes.

Hop in🧶 Image Perhaps surprisingly, we can replace the SoftMax-xent by a Sigmoid-xent loss in CLIP training and things just work.

With one little detail: add a learnable bias, much like the temperature.

This is conceptually simpler and cleaner: does image I and text T match: yes or no? Image
Jun 16, 2023 11 tweets 6 min read
Who killed non-contrastive image-text pretraining? @AlecRad and @_jongwook_kim with the below Fig2 in CLIP.

Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.

Generative captioning is not only competitive, it seems better! ImageImageImage Some results first: Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. ImageImageImageImage
Mar 3, 2023 4 tweets 2 min read
That's a super interesting claim!

The main reason highlighted is minibatch gradient variance (see screenshot).

Unsolicited review (sorry @liuzhuang1234!): This immediately asks for experiments that can validate or nullify the hypothesis, none of which I found in the paper:

1/3 1. As minibatch size grows/shrinks, the effect should vanishes/increase.
2. Similar but different: plot effect of this on Y axis, and minibatch-size as percent of trainset-size on X axis.

Even worse: batch-size only mentioned in appendix!

2/3
Feb 17, 2023 11 tweets 6 min read
Beyond classification in vision, it always feels weird to optimize for a loss which doesn't _really_ match how we'll use the model later on*, but happens to be differentiable.

In our latest work, we tackle this discrepancy🧶

*unless the model is 100% perfect, which it never is. - In pix2seq, you don't _really_ care about perplexity of the detection string
- In FasterRCNN, DETR & co, you don't _really_ care about box-losses and class-losses

You care about driving safely, counting moles or parking spots, the robot picking the box, ... ImageImageImageImage
Dec 29, 2022 14 tweets 8 min read
How good of a BERT can one get in ONE DAY on ONE GPU?

With all the recent studies about scaling compute up, this paper takes a refreshing turn and does a deep dive into scaling down compute.

It's well written, stock full of insights. Here is my summary and my opinions.

🧶 1/N 2/N First, the setting. See screenshot for full info, but in short:
- 24h of training on a single good GPU (2080ti or a4000)
- Transformer architecture, modifications are OK
- MLM training from scratch
- NO use of pre-trained anything in any way (except tokenizer)
- Any dataset
Mar 28, 2022 11 tweets 6 min read
Want to turn any vision backbone into an image-text model? Want to show the age-old "your model wouldn't recognize a cow on the beach" is a red herring?

That's LiT🔥 (Locked-image Tuning), a new alternative to fine-tuning that combines the best of fine-tuning and zero-shot
1/n🧶 As usual, our method is really simple: take a pre-trained (sup, selfsup, whatever) image backbone, freeze it, and attach a text encoder to it. On any image-text dataset, train the text encoder to predict the corresponding image's embedding, CLIP-style.
2/n
Jan 12, 2022 4 tweets 2 min read
1/3 All these methods look the same to you?
That's the point of this paper!

Simply adding losses works equally well as any fancy multi-task method, if one tunes the baseline properly.

This matches my experience, and fits my philosophy: tune the simplest possible method -> win. 2/3 I've tried fancy multi-task methods almost every year, but they never outperformed my well-tuned "just add the losses". I never thought much of it, but this paper actually explores both theoretically and empirically why that is!
Nov 18, 2021 9 tweets 4 min read
It's about time: analog clock reading in the wild arxiv.org/abs/2111.09162

A great example of an applied vision paper, let me walk you through why I like it. 🧶

They also make good use of Spatial Transformer Networks (STN) one of the most elegant ideas that usually don't work :) 1. It's an interesting problem that seems niche, but anyone can immediately relate to it.

2. Showing the architecture once more, it's pretty straightforward. STN takes an image and predicts a transform (homography), then warps the image with it. However, it almost never works.
Nov 12, 2021 10 tweets 5 min read
1/N The return of patch-based self-supervision! It never worked well and you had to bend over backwards with ResNets (I tried).
Now with ViT, very simple patch-based self-supervised pre-training rocks! First BeIT, now Masked AutoEncoders i1k=87.8% arxiv.org/pdf/2111.06377…

🧶 2/N The idea is super simple *and* efficient on TPU: shuffle patches, keep the first N (196, keep 49) and pass them to ViT. Short sequence makes it fast too!
Then, restore order and fill in [mask] tokens, pass that to another, smaller ViT to reconstruct the masks in pixel space.
Jun 10, 2021 11 tweets 5 min read
So you think you know distillation; it's easy, right?

We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva.

Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)

🧵👇arxiv.org/abs/2106.05237 This is not a fancy novel method. It's plain old distillation.

But we investigate it thoroughly, for model compression, via the lens of *function matching*.

We highligh two crucial principles that are often missed: Consistency and Patience. Only both jointly give good results!
Jun 9, 2021 10 tweets 5 min read
With @XiaohuaZhai @__kolesnikov__ @neilhoulsby we scale up plain old ViT on ~infinite data (3B🤯😬)

We share our results (incl. scaling laws, ImageNet SOTA both many and few-shot) and our recipe (incl. OneWeirdTrick to significantly boost few-shot)

arxiv.org/abs/2106.04560
🧵👇 1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...