Latest Twitter Threads by @tri_dao on Thread Reader App

Jul 11, 2024 • 10 tweets • 5 min read

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS!
1/

The tech report has all the info:
More details in blog posts:

FlashAttention-3 is available in the open source:
2/tridao.me/publications/f…
tridao.me/blog/2024/flas…
pytorch.org/blog/flashatte…
together.ai/blog/flashatte…
research.colfax-intl.com/flashattention…
github.com/Dao-AILab/flas…

Jul 17, 2023 • 13 tweets • 4 min read

Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/

The tech report has all the info:

More details in blogposts:
https://t.co/hh2yGicgOe
https://t.co/ANwdH0fgMs
https://t.co/EjeYlGmBuL

FlashAttention-2 is available in the open source: https://t.co/b3RaWgoFbE
2/tridao.me/publications/f…
crfm.stanford.edu/2023/07/17/fla…
together.ai/blog/tri-dao-f…
princeton-nlp.github.io/flash-atttenti…
github.com/Dao-AILab/flas…

Jan 17, 2023 • 8 tweets • 5 min read

I’ve been working with @AdeptAILabs and we’ve made FlashAttention even faster for long sequences! For seqlen 8K, FlashAttention is now up to 2.7x faster than a standard PyTorch implementation even at small batch, making it easier to train better LMs with longer context 1/7

https://twitter.com/tri_dao/status/1531437619791290369

We described this improvement in more details in this blogpost:
crfm.stanford.edu/2023/01/13/fla…
adept.ai/flashierattent…
2/7

Nov 29, 2022 • 6 tweets • 2 min read

We're releasing an optimized implementation of GPT2/GPT3 with FlashAttention🚀!
This trains 3-5x faster than the Huggingface version, reaching up to 189 TFLOPs/sec per A100, 60.6% (model) FLOPs util of the theoretical maximum. 1/6
github.com/HazyResearch/f… The main ingredient is FlashAttention, which computes attention fast (2-4x) and with less memory (10x), without any approximation. This means that we don't need to do any activation checkpointing
2/6

https://twitter.com/tri_dao/status/1531437619791290369

May 31, 2022 • 11 tweets • 5 min read

Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu

By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/

Transformers have grown larger and deeper, but longer context remains difficult, since self-attention has time and memory quadratic in seq. length. Approx attn attempts to address this by trading off quality for compute complexity, but often doesn’t achieve wall-clock speedup. 2/

Share this page!

Enter URL or ID to Unroll