Dan Fu Profile picture
CS PhD Candidate/Researcher at Stanford. Systems for machine learning. Sometimes YouTuber/podcaster.

Jan 23, 2023, 18 tweets

Attention is all you need... but how much of it do you need?

Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao

📜 arxiv.org/abs/2212.14052 1/n

One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone!

We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters!

github.com/HazyResearch/H3 2/n

In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers.

Two key ideas:
* Adapting SSMs to be able to do *comparison*
* Making SSMs as hardware-efficient as attention 3/n

Part 1: the quality gap

SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling.

In our paper, we use *synthetic languages* to probe this gap 4/n

These synthetic languages (inspired by great work like transformer-circuits.pub/2022/in-contex…) test how well SSMs can do in-context learning compared to attention.

We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n

In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap.

The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n

The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling.

We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n

These gains also translate to strong downstream zero- and few-shot performance. On SuperGLUE, our zero-shot performance outperforms Transformer models of similar sizes. 8/n

Part 2: the efficiency gap

But that's not all! In order to scale H3 up to billion-parameter models, we had to make it as hardware-efficient as attention.

The convolution is O(N log N) asymptotically, but still underperforms FlashAttention for short sequences... 9/n

What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes.

We develop FlashConv to address this problem.

FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n

With FlashConv, we can make SSMs outperform attention for almost all sequence lengths -- up to 35x faster than FlashAttention for long sequences! 11/n

The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers.

Up to 1,980 tokens/second! 12/n

We're super excited about these advances, so we're releasing our code and model weights today:

github.com/HazyResearch/H3 13/n

And we've got two blog posts up on our work -- first, read about our synthetic languages and how we developed H3:

hazyresearch.stanford.edu/blog/2023-01-2… 14/n

Then check out our blog up on @togethercompute about how FlashConv speeds up SSMs:

together.xyz/blog/h3 15/n

Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator?

No more fixed context windows, long context for everyone! 16/n

With @tri_dao (co-first), @KhaledSaab11, @ai_with_brains, Atri Rudra, and @HazyResearch!

Thanks to @StanfordAILab, @StanfordHAI, @StanfordCRFM, and @togethercompute for helping provide us with the compute necessary to train these models! 17/17

And thanks to @_albertgu for helpful discussions about the H3 architecture - and more importantly, sending us daily hippo videos :)

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling