Post

@XiaohuaZhai

@jekbradbury

@PreetumNakkiran

More from @giffmana

Lucas Beyer (bl16)

@giffmana

Feb 20

o3-mini-high figured out the issue with @SakanaAILabs CUDA kernels in 11s.
It being 150x faster is a bug, the reality is 3x slower.

I literally copy-pasted their CUDA code into o3-mini-high and asked "what's wrong with this cuda code". That's it!
Proof: chatgpt.com/share/67b6f47c…

Fig1: o3-mini's answer.
Fig2: Their orig code is wrong in subtle way. The fact they run benchmarking TWICE with wildly different results should make them stop and think.
Fig3: o3-mini's fix. Code is now correct. Benchmarking results are consistent. 3x slower.

There are three real lessons to be learned here:
1) Super-straightforward CUDA code like that has NO CHANCE of ever being faster than optimized cublas kernels. If it is, something is wrong.
2) If your benchmarking results are mysterious and inconsistent, something is wrong.
3) o3-mini-high is REALLY GOOD. It literally took 11sec to find the issue. It took me around 10min to make this write-up afterwards.

my fork of the author's colab, with the fix:

PS: I wouldn't have found the bug myself, because it's been a literal decade since I wrote cuda kernel launch code myself...colab.research.google.com/drive/1CS1g0Of…

Read 4 tweets

Lucas Beyer (bl16)

@giffmana

Feb 4

https://twitter.com/dbaek__/status/1886781418115862544

I took a brief look at the Harmonic Loss paper

tl;dr: instead of dot-product with softmax, do euclid dist with normalized 1/d**n.

I kinda want this to work. I've dabbled with preferring euclid many times throughout my career (eg triplet loss etc)

However...

https://twitter.com/dbaek__/status/1886781418115862544

I have to say that this MNIST weights figure looks suspicious as hell.

I've trained linear + softmax mnist and looked at weights often, and it never looks as bad as presented here. However, their score of ~92.5% is the expected one, so that's good.

I trained a plain MNIST linear model, not tuning much (no wd!) and it looks like one of the two pics below.

Add small wd and it looks like another of the pics below.

I trained another one with their Harmonic stuff, closely following their code and hparams in the code, and it looks like the other pic below.

Guess which is which now.

Read 7 tweets

Lucas Beyer (bl16)

@giffmana

Feb 3

https://twitter.com/nathanbenaich/status/1886414128878674358

Don't trust the headline. This 56M…
…DOESNT include compute, it's everything else
…scattered across many unis, profs, post-docs, phds
…who collaborate in theory, but work on their individual papers in reality, cuz that's what's needed to graduate
…is unrelated to DeepSeek

https://twitter.com/nathanbenaich/status/1886414128878674358

What I expect to come out of it:
- a whole bunch of papers, probably some interesting ones. A lot on multilinguality and language transfer.
- a series of benchmarks for various EU languages
- (I hope) nice small-languages datasets

I think that's good overall.

What I do not expect to come out of it:
- an open-source base model at the frontier

Read 4 tweets

Lucas Beyer (bl16)

@giffmana

Jan 27

Just had a quick look at DeepSeek's new Janus Pro paper.

I don't think it's a big deal (yet...!), but quick TL;DR below before hype gets out of hands.

It's as straightforward an Omni model as it gets:
- a core autoregressive decoder LLM
- a SigLIP encoder for understanding (L@384 why not So400m)
- a VQ-VAE for generation (from LlamaGen)
Three training stages:
1. new-params only on ImageNet
2. fine-tune all on mm-mix
3. SFT mix

The main changes from Janus to -Pro are basically all data-related:
1. "Purify" stage2 by moving ImageNet to Stage1, make that longer.
2. Update data for understanding by taking from DeepSeek-VL2
3. Update data for generation by throwing in MidJourney data

Read 5 tweets

Lucas Beyer (bl16)

@giffmana

Dec 30, 2024

Long paper thread!

I've held a skeptical opinion of DiffTransformer after skimming it, but after some recent interaction thought that's unfair and I'd give it a proper read. Did so on a recent flight.

Paper TL;DR: pair two attention heads, and do:

(sm(Q1K1) - λ sm(Q2K2)) V

The motivation in Fig1 is very solid a priori: as context gets long, the sum of (small) attention on irrelevant tokens might be more than the attention to few individual relevant tokens, thus drowning them.

However, it is just an illustration, and I'm still not sure how much this is really a problem for well trained models. Instabilities usually occur because attention logits grow, which makes the attention look much more like the green one already. This is not often talked about, but evidence is spread across some papers like ViT22B or "small scale proxies". If max attn is ~0.5, then we're already in the green.

I would have liked some plots about attn distribution/entropy in the DiffTransformer paper to actually justify the illustration. (We'll get back to this)

Next, the core idea is very simple and nice, but I notice a few details making it immediately less "neat" somehow?

The DiffAttn actually does _not_ re-normalize the diff, unlike what happened in the Fig1 motivating illustration. This confuses me a lot, so now how does this lead to amplifying the "relevant" scores? The second head clearly got a different job now, assuming lambda is positive, it must learn what to suppress from the first one?

I would have loved some plots and experiments looking at how it really looks like in trained DiffTransformers, instead of just the fig1 mismatched illustration.

Read 13 tweets

Lucas Beyer (bl16)

@giffmana

Apr 5, 2024

https://twitter.com/arankomatsuzaki/status/1776057023697731913

A bit late, but I just read ReFT, here's a quick thread.

- A PEFT method
- acts on activs `h` -> small inference overhead

1/5

https://twitter.com/arankomatsuzaki/status/1776057023697731913

Learns a (R,W,b) per layer and _per position_ in the (prompt) sequence.

However, they hparam search which subset of layers and positions to apply it to.

Even more, they suggest to (sometimes) tie the (R,W,b) parameters from a layer across positions.

2/5

It seems best to apply to all layers, but only few positions. For decoder models, tying across positions, but not for decoders. Rank can be lower for smaller models.

3/5

Read 5 tweets

Share this page!

Enter URL or ID to Unroll

Lucas Beyer (bl16)

Try unrolling a thread yourself!

More from @giffmana

Lucas Beyer (bl16)

Lucas Beyer (bl16)

Lucas Beyer (bl16)

Lucas Beyer (bl16)

Lucas Beyer (bl16)

Lucas Beyer (bl16)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!