Dan Fu Profile picture
Jan 23 18 tweets 8 min read
Attention is all you need... but how much of it do you need?

Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao

📜 arxiv.org/abs/2212.14052 1/n
One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone!

We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters!

github.com/HazyResearch/H3 2/n
In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers.

Two key ideas:
* Adapting SSMs to be able to do *comparison*
* Making SSMs as hardware-efficient as attention 3/n
Part 1: the quality gap

SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling.

In our paper, we use *synthetic languages* to probe this gap 4/n
These synthetic languages (inspired by great work like transformer-circuits.pub/2022/in-contex…) test how well SSMs can do in-context learning compared to attention.

We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n
In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap.

The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n
The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling.

We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n
These gains also translate to strong downstream zero- and few-shot performance. On SuperGLUE, our zero-shot performance outperforms Transformer models of similar sizes. 8/n
Part 2: the efficiency gap

But that's not all! In order to scale H3 up to billion-parameter models, we had to make it as hardware-efficient as attention.

The convolution is O(N log N) asymptotically, but still underperforms FlashAttention for short sequences... 9/n
What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes.

We develop FlashConv to address this problem.

FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n
With FlashConv, we can make SSMs outperform attention for almost all sequence lengths -- up to 35x faster than FlashAttention for long sequences! 11/n
The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers.

Up to 1,980 tokens/second! 12/n
We're super excited about these advances, so we're releasing our code and model weights today:

github.com/HazyResearch/H3 13/n
And we've got two blog posts up on our work -- first, read about our synthetic languages and how we developed H3:

hazyresearch.stanford.edu/blog/2023-01-2… 14/n
Then check out our blog up on @togethercompute about how FlashConv speeds up SSMs:

together.xyz/blog/h3 15/n
Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator?

No more fixed context windows, long context for everyone! 16/n
With @tri_dao (co-first), @KhaledSaab11, @ai_with_brains, Atri Rudra, and @HazyResearch!

Thanks to @StanfordAILab, @StanfordHAI, @StanfordCRFM, and @togethercompute for helping provide us with the compute necessary to train these models! 17/17
And thanks to @_albertgu for helpful discussions about the H3 architecture - and more importantly, sending us daily hippo videos :)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dan Fu

Dan Fu Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @realDanFu

Jan 10, 2022
The Stanford MLSys Seminar is now available in podcast form on Apple Podcasts, Spotify, Google, and more!

We release new podcasts every Monday and Friday (new episodes on Fridays, old episodes from the backlog on Mondays).

Check us out on your favorite platform below! (1/n)
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(