Lucas Beyer (bl16) Profile picture
Jun 9, 2021 10 tweets 5 min read Read on X
With @XiaohuaZhai @__kolesnikov__ @neilhoulsby we scale up plain old ViT on ~infinite data (3B🤯😬)

We share our results (incl. scaling laws, ImageNet SOTA both many and few-shot) and our recipe (incl. OneWeirdTrick to significantly boost few-shot)

arxiv.org/abs/2106.04560
🧵👇
1. The scaling laws. It seems that in image classification too, Transformers follow a power-law (eg straight line in log-log), although it saturates on both upper and lower end. This holds across datasets, linear eval, fine-tuning, ...
2. Larger ViT are more sample-efficient. L/16 reaches the same accuracy s Ti/16 with about 100x fewer images seen!
3. Results: These large pre-trained ViTs are pretty amazing at few-shot learning with just a linear classifier on top of frozen model. Almost 70% with 1 image per class on ImageNet, and 83% with 5 images per class, i.e. 0.5% of the dataset!
Also, new SOTA when fine-tuned: 90.45%.
4. OneWeirdTrick to improve linear few-shot dramaticaly: We switch to GAP (or MAP) "head" and use much stronger weight-decay on the classifier "head" than on ViT's "body". It looks *worse* upstream, but is *a lot better* in few-shot!

We hypothesize this increases margin à la SVM
5. Learning-rate: Train 1 get M free!

We opt for a learning-rate schedule of warmup-rsqrt-cooldown. This allows to train "infinitely" and add cooldowns post-hoc, simulating many runs with just one.

In experiments (not shown), this was much better than "warm restarts" schedules.
6. Because of complex XLA optimization, one can't say upfront what will fit at the memory limit. We use an empirial "shapefinder" approach and scale "diagonally".

We investigate "novel" optimizer variants with half-precision to reduce their memory use A LOT without loss of accu.
This was a fun exploration.

Besides my co-authors I'd like to give special shout-out to @jekbradbury who selflessly helped us get stuff running on huge TPU machines!
Also, @rikelhood @joapuipe @_basilM Alexey for hanging in there with us.
@PreetumNakkiran you should like this. I remember you specifically asked for it a while ago, and I wanted to answer "working on it, be patient" but obviously couldn't. So there you go!
Oh and one more thing, even though I'm not Steve jobs: Jax + TPU VM made this all a breeze, implementation-wise. I can highly recommend that combo for research!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucas Beyer (bl16)

Lucas Beyer (bl16) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @giffmana

Feb 20
o3-mini-high figured out the issue with @SakanaAILabs CUDA kernels in 11s.
It being 150x faster is a bug, the reality is 3x slower.

I literally copy-pasted their CUDA code into o3-mini-high and asked "what's wrong with this cuda code". That's it!
Proof: chatgpt.com/share/67b6f47c…

Fig1: o3-mini's answer.
Fig2: Their orig code is wrong in subtle way. The fact they run benchmarking TWICE with wildly different results should make them stop and think.
Fig3: o3-mini's fix. Code is now correct. Benchmarking results are consistent. 3x slower.Image
Image
Image
There are three real lessons to be learned here:
1) Super-straightforward CUDA code like that has NO CHANCE of ever being faster than optimized cublas kernels. If it is, something is wrong.
2) If your benchmarking results are mysterious and inconsistent, something is wrong.
3) o3-mini-high is REALLY GOOD. It literally took 11sec to find the issue. It took me around 10min to make this write-up afterwards.
my fork of the author's colab, with the fix:


PS: I wouldn't have found the bug myself, because it's been a literal decade since I wrote cuda kernel launch code myself...colab.research.google.com/drive/1CS1g0Of…
Read 4 tweets
Feb 4
I took a brief look at the Harmonic Loss paper

tl;dr: instead of dot-product with softmax, do euclid dist with normalized 1/d**n.

I kinda want this to work. I've dabbled with preferring euclid many times throughout my career (eg triplet loss etc)

However...
I have to say that this MNIST weights figure looks suspicious as hell.

I've trained linear + softmax mnist and looked at weights often, and it never looks as bad as presented here. However, their score of ~92.5% is the expected one, so that's good. Image
I trained a plain MNIST linear model, not tuning much (no wd!) and it looks like one of the two pics below.

Add small wd and it looks like another of the pics below.

I trained another one with their Harmonic stuff, closely following their code and hparams in the code, and it looks like the other pic below.

Guess which is which now.Image
Image
Image
Read 7 tweets
Feb 3
Don't trust the headline. This 56M…
…DOESNT include compute, it's everything else
…scattered across many unis, profs, post-docs, phds
…who collaborate in theory, but work on their individual papers in reality, cuz that's what's needed to graduate
…is unrelated to DeepSeek
What I expect to come out of it:
- a whole bunch of papers, probably some interesting ones. A lot on multilinguality and language transfer.
- a series of benchmarks for various EU languages
- (I hope) nice small-languages datasets

I think that's good overall.
What I do not expect to come out of it:
- an open-source base model at the frontier
Read 4 tweets
Jan 27
Just had a quick look at DeepSeek's new Janus Pro paper.

I don't think it's a big deal (yet...!), but quick TL;DR below before hype gets out of hands. Image
It's as straightforward an Omni model as it gets:
- a core autoregressive decoder LLM
- a SigLIP encoder for understanding (L@384 why not So400m)
- a VQ-VAE for generation (from LlamaGen)
Three training stages:
1. new-params only on ImageNet
2. fine-tune all on mm-mix
3. SFT mix Image
The main changes from Janus to -Pro are basically all data-related:
1. "Purify" stage2 by moving ImageNet to Stage1, make that longer.
2. Update data for understanding by taking from DeepSeek-VL2
3. Update data for generation by throwing in MidJourney data Image
Read 5 tweets
Dec 30, 2024
Long paper thread!

I've held a skeptical opinion of DiffTransformer after skimming it, but after some recent interaction thought that's unfair and I'd give it a proper read. Did so on a recent flight.

Paper TL;DR: pair two attention heads, and do:

(sm(Q1K1) - λ sm(Q2K2)) V Image
The motivation in Fig1 is very solid a priori: as context gets long, the sum of (small) attention on irrelevant tokens might be more than the attention to few individual relevant tokens, thus drowning them.

However, it is just an illustration, and I'm still not sure how much this is really a problem for well trained models. Instabilities usually occur because attention logits grow, which makes the attention look much more like the green one already. This is not often talked about, but evidence is spread across some papers like ViT22B or "small scale proxies". If max attn is ~0.5, then we're already in the green.

I would have liked some plots about attn distribution/entropy in the DiffTransformer paper to actually justify the illustration. (We'll get back to this)Image
Image
Next, the core idea is very simple and nice, but I notice a few details making it immediately less "neat" somehow?

The DiffAttn actually does _not_ re-normalize the diff, unlike what happened in the Fig1 motivating illustration. This confuses me a lot, so now how does this lead to amplifying the "relevant" scores? The second head clearly got a different job now, assuming lambda is positive, it must learn what to suppress from the first one?

I would have loved some plots and experiments looking at how it really looks like in trained DiffTransformers, instead of just the fig1 mismatched illustration.Image
Image
Read 13 tweets
Apr 5, 2024
A bit late, but I just read ReFT, here's a quick thread.

- A PEFT method
- acts on activs `h` -> small inference overhead

1/5
Image
Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.

However, they hparam search which subset of layers and positions to apply it to.

Even more, they suggest to (sometimes) tie the (R,W,b) parameters from a layer across positions.

2/5 Image
It seems best to apply to all layers, but only few positions. For decoder models, tying across positions, but not for decoders. Rank can be lower for smaller models.

3/5

Image
Image
Image
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(