Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
The tech report has all the info:
More details in blogposts:
https://t.co/hh2yGicgOe
https://t.co/ANwdH0fgMs
https://t.co/EjeYlGmBuL
I’ve been working with @AdeptAILabs and we’ve made FlashAttention even faster for long sequences! For seqlen 8K, FlashAttention is now up to 2.7x faster than a standard PyTorch implementation even at small batch, making it easier to train better LMs with longer context 1/7
We're releasing an optimized implementation of GPT2/GPT3 with FlashAttention🚀!
This trains 3-5x faster than the Huggingface version, reaching up to 189 TFLOPs/sec per A100, 60.6% (model) FLOPs util of the theoretical maximum. 1/6 github.com/HazyResearch/f…
The main ingredient is FlashAttention, which computes attention fast (2-4x) and with less memory (10x), without any approximation. This means that we don't need to do any activation checkpointing 2/6
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu
By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
Transformers have grown larger and deeper, but longer context remains difficult, since self-attention has time and memory quadratic in seq. length. Approx attn attempts to address this by trading off quality for compute complexity, but often doesn’t achieve wall-clock speedup. 2/