Dan Fu Profile picture
Jan 23, 2023 18 tweets 8 min read Read on X
Attention is all you need... but how much of it do you need?

Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao

📜 arxiv.org/abs/2212.14052 1/n
One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone!

We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters!

github.com/HazyResearch/H3 2/n
In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers.

Two key ideas:
* Adapting SSMs to be able to do *comparison*
* Making SSMs as hardware-efficient as attention 3/n
Part 1: the quality gap

SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling.

In our paper, we use *synthetic languages* to probe this gap 4/n
These synthetic languages (inspired by great work like transformer-circuits.pub/2022/in-contex…) test how well SSMs can do in-context learning compared to attention.

We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n
In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap.

The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n
The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling.

We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n
These gains also translate to strong downstream zero- and few-shot performance. On SuperGLUE, our zero-shot performance outperforms Transformer models of similar sizes. 8/n
Part 2: the efficiency gap

But that's not all! In order to scale H3 up to billion-parameter models, we had to make it as hardware-efficient as attention.

The convolution is O(N log N) asymptotically, but still underperforms FlashAttention for short sequences... 9/n
What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes.

We develop FlashConv to address this problem.

FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n
With FlashConv, we can make SSMs outperform attention for almost all sequence lengths -- up to 35x faster than FlashAttention for long sequences! 11/n
The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers.

Up to 1,980 tokens/second! 12/n
We're super excited about these advances, so we're releasing our code and model weights today:

github.com/HazyResearch/H3 13/n
And we've got two blog posts up on our work -- first, read about our synthetic languages and how we developed H3:

hazyresearch.stanford.edu/blog/2023-01-2… 14/n
Then check out our blog up on @togethercompute about how FlashConv speeds up SSMs:

together.xyz/blog/h3 15/n
Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator?

No more fixed context windows, long context for everyone! 16/n
With @tri_dao (co-first), @KhaledSaab11, @ai_with_brains, Atri Rudra, and @HazyResearch!

Thanks to @StanfordAILab, @StanfordHAI, @StanfordCRFM, and @togethercompute for helping provide us with the compute necessary to train these models! 17/17
And thanks to @_albertgu for helpful discussions about the H3 architecture - and more importantly, sending us daily hippo videos :)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dan Fu

Dan Fu Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @realDanFu

Feb 8
ChatGPT's 1700-token system prompt got you down?

Led by @jordanjuravsky, @brad19brown, introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA!

A few things I love in this project: 1/Image
The idea is pretty simple. You can use the softmax scaling trick to split up the prefix and suffix into different attention calls - and batch attention queries over the shared prefixes.

This reduces your IO and changes GEMV calls into GEMM calls for higher FLOP util. 2/
The barrier to entry for improving efficiency is lower than ever. The entire code for this algorithm fits in a page of pseudocode in the appendix of the paper, and it's all in PyTorch:

No custom CUDA code, just standard calls to FlashAttention-v2. 3/arxiv.org/abs/2402.05099
Read 6 tweets
Mar 28, 2023
This sentiment is exactly right - and why we've been working to increase sequence length in our lab for the past two years!

From FlashAttention, to S4, H3, Hyena, and more - check out our blog post putting this line of work into context: hazyresearch.stanford.edu/blog/2023-03-2…

More below: 1/n
The context lengths of foundation models have grown exponentially recently - exciting developments!

We've been happy to play a small role with FlashAttention, and we're very excited about the possibilities: multiple media sources, complex demonstrations, and more! 2/n
In our blog, we talk about another line of work to increasing sequence length - starting with S4 by the great @_albertgu (new recruiting students at CMU!), and leading up to H3, Hyena, and more! 3/n
Read 6 tweets
Jan 10, 2022
The Stanford MLSys Seminar is now available in podcast form on Apple Podcasts, Spotify, Google, and more!

We release new podcasts every Monday and Friday (new episodes on Fridays, old episodes from the backlog on Mondays).

Check us out on your favorite platform below! (1/n)
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(