How to get URL link on X (Twitter) App


There are three real lessons to be learned here:
https://twitter.com/dbaek__/status/1886781418115862544I have to say that this MNIST weights figure looks suspicious as hell.
https://twitter.com/nathanbenaich/status/1886414128878674358What I expect to come out of it:
It's as straightforward an Omni model as it gets:
The motivation in Fig1 is very solid a priori: as context gets long, the sum of (small) attention on irrelevant tokens might be more than the attention to few individual relevant tokens, thus drowning them.
https://twitter.com/arankomatsuzaki/status/1776057023697731913
Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.
https://x.com/giffmana/status/1669840989853196292?s=20

My original motivation for captioning pretraining is that there are things contrastive pretraining will fundamentally not learn.

Perhaps surprisingly, we can replace the SoftMax-xent by a Sigmoid-xent loss in CLIP training and things just work.


Some results first: Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. 


https://twitter.com/arankomatsuzaki/status/1631469683055403008
1. As minibatch size grows/shrinks, the effect should vanishes/increase.
https://twitter.com/__kolesnikov__/status/1626546150579879936- In pix2seq, you don't _really_ care about perplexity of the detection string



2/N First, the setting. See screenshot for full info, but in short:
As usual, our method is really simple: take a pre-trained (sup, selfsup, whatever) image backbone, freeze it, and attach a text encoder to it. On any image-text dataset, train the text encoder to predict the corresponding image's embedding, CLIP-style.
https://twitter.com/y0b1byte/status/14812833512815534172/3 I've tried fancy multi-task methods almost every year, but they never outperformed my well-tuned "just add the losses". I never thought much of it, but this paper actually explores both theoretically and empirically why that is!
1. It's an interesting problem that seems niche, but anyone can immediately relate to it.
2/N The idea is super simple *and* efficient on TPU: shuffle patches, keep the first N (196, keep 49) and pass them to ViT. Short sequence makes it fast too!
This is not a fancy novel method. It's plain old distillation.
1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...