Tri Dao Profile picture
Incoming Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.
Jerome Ku Profile picture 1 subscribed
Jul 17, 2023 13 tweets 4 min read
Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
Image
Image
The tech report has all the info:

More details in blogposts:
https://t.co/hh2yGicgOe
https://t.co/ANwdH0fgMs
https://t.co/EjeYlGmBuL

FlashAttention-2 is available in the open source: https://t.co/b3RaWgoFbE
2/tridao.me/publications/f…
crfm.stanford.edu/2023/07/17/fla…
together.ai/blog/tri-dao-f…
princeton-nlp.github.io/flash-atttenti…
github.com/Dao-AILab/flas…
Jan 17, 2023 8 tweets 5 min read
I’ve been working with @AdeptAILabs and we’ve made FlashAttention even faster for long sequences! For seqlen 8K, FlashAttention is now up to 2.7x faster than a standard PyTorch implementation even at small batch, making it easier to train better LMs with longer context 1/7 Image We described this improvement in more details in this blogpost:
crfm.stanford.edu/2023/01/13/fla…
adept.ai/flashierattent…
2/7
Nov 29, 2022 6 tweets 2 min read
We're releasing an optimized implementation of GPT2/GPT3 with FlashAttention🚀!
This trains 3-5x faster than the Huggingface version, reaching up to 189 TFLOPs/sec per A100, 60.6% (model) FLOPs util of the theoretical maximum. 1/6
github.com/HazyResearch/f… The main ingredient is FlashAttention, which computes attention fast (2-4x) and with less memory (10x), without any approximation. This means that we don't need to do any activation checkpointing
2/6
May 31, 2022 11 tweets 5 min read
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu

By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/ Image Transformers have grown larger and deeper, but longer context remains difficult, since self-attention has time and memory quadratic in seq. length. Approx attn attempts to address this by trading off quality for compute complexity, but often doesn’t achieve wall-clock speedup. 2/