CS PhD Candidate/Researcher at Stanford. Systems for machine learning. Sometimes YouTuber/podcaster.
Feb 8 • 6 tweets • 2 min read
ChatGPT's 1700-token system prompt got you down?
Led by @jordanjuravsky, @brad19brown, introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA!
A few things I love in this project: 1/
The idea is pretty simple. You can use the softmax scaling trick to split up the prefix and suffix into different attention calls - and batch attention queries over the shared prefixes.
This reduces your IO and changes GEMV calls into GEMM calls for higher FLOP util. 2/
Mar 28, 2023 • 6 tweets • 4 min read
This sentiment is exactly right - and why we've been working to increase sequence length in our lab for the past two years!
The context lengths of foundation models have grown exponentially recently - exciting developments!
We've been happy to play a small role with FlashAttention, and we're very excited about the possibilities: multiple media sources, complex demonstrations, and more! 2/n
Jan 23, 2023 • 18 tweets • 8 min read
Attention is all you need... but how much of it do you need?
Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao
📜 arxiv.org/abs/2212.14052 1/n
One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone!
We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters!