Aman Sanger Profile picture
Mar 5 13 tweets 3 min read Read on X
With a 256K token prompt, a 7b model can generate tokens as quickly as codellama-7b with an 8K prompt.

How? The model must use multi-query attention.

Here's why...
(1/10)
For large context windows and large batch inference, the bottleneck for generation speed is bottlenecked by KV cache size.

To illustrate this, let’s look at our 8K context window with multi-head attention at a batch size of 16.
(2/10)
To generate each token, we need to spend 7e9*2*16 = 224 GFLOPs. [1]

We won’t be compute bound as a single A100 has 300TFLOPs, meaning it could sustain >1000 tokens/s without the memory bottleneck.
(3/10)
We’ll also need to read the model weights from memory, costing us 7e9*2 = 14GB of memory bandwidth [2]

Finally, we need to read the KV cache on each token. This is:

128 attn dim * 32 kv heads *32 layers * 2 key + value *16 batch size *8K tokens * 2 bytes = 67GB!
(4/10)
Now, how fast should we be decoding in theory? Well, using tp=2 on two A100s, we’re reading 81GB on each token or 40GB/A100.

Assuming we can achieve 70% of the 2TB/s bandwidth, that’s 35 tok/s.
(5/10)
Let’s swap out codellama for our multi-query model. Notice that we can directly trade context length for heads in our KV cache memory.

We multiply our context window by 32, and divide the num_heads by 32 giving us the exact same size.
(6/10)
This means for our 256k token prompt, we're decoding at 35 tok/s on a reasonable batch size
(7/10)
Now, prefill is a whole different story, and actually getting the model to pay attention to 256K tokens from training is another can of worms...
(8/10)
But MQA models give such massive inference wins for long context that they may be worth it despite slightly worse scaling perf than MHA or GQA.

Plus, we’re already working with an MQA model that beats codellama 7b on code evals.
(9/10)
Finally, 256K is not the limit of how far you can take things with vanilla transformers for reasonable inference perf.

There are a few more long context tricks that, in theory, should be able to preserve that same perf up to 1M+ ;)

(10/10)
[1] Simple model flops calc
We can also account for attention flops, but notice they'll also be insignificant compared to KV cache size:

For each of the attention heads and sequences in the batch, we’re doing a 1 x 128 dim (query) times a 128 x 256k (keys) matmul.

That’s 2* 128*256k FLOPs, then * 16 bs * 32 kv heads * 32 layers.

Then the attention weights are a 1 x 256K vector, multiplying a 256K x 128 vector (the values), giving 2 * 256K * 128 FLOPs, again * 16bs * 32 kv heads * 32 layers. This is 2 TFLOPs.

On our 2 GPUs, we should be able to do 300TPS (probs closer to 200TPS) if not bottlenecked by compute

But we're still bottnecked by mem bw, which keeps us at 35 TPS
* Actually we need to spend 2TFLOPs/token when accounting for attention, but on our 2 A100s, that would be 200-300TPS, which the memory bw bottleneck prevents us from hitting
[2] - We need to read model weights to generate each token, and we're assuming bf16 or fp16, hence the term there

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aman Sanger

Aman Sanger Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @amanrsanger

Jan 12
One magical part of Cursor’s internal tech stack is a prompt compilation library called priompt ()

Here's why works so well... (1/12) github.com/anysphere/prio…
Standard prompting libraries use variants of “f-strings” with subbed-in inputs.

For us, a prompt is defined as a function that maps some set of inputs X and a token budget n to some string, s:

p(X, n) = s

We call this operation "rendering"

(2/12)
For example, my inputs, X, could include conversation history, contents of the current file, chunks of documentation, and codebase context we deem relevant.

This sums to 100K tokens. But the budget we are working with may just be 4000 tokens.

(3/12)
Read 13 tweets
Dec 5, 2023
At Cursor, we've built very high-quality retrieval datasets (for training embeddings/rerankers).

To do this, we use GPT-4 grading and the Trueskill ratings system (a better version of Elo)

Here’s how.. (1/10)
We start with an out-of-the-box dataset of coding queries and get the top 100 ada embedding results in their repositories.

But we need much better ground truth labels than cosine similarity.
(2/10)
Our goal is to determine an approximate ground-truth ordering over the 100 code blocks for each query.

This is where an effective technique from the literature comes in handy: “Listwise Reranking” [1]
(3/10)
Read 11 tweets
Dec 2, 2023
After switching our vector db to @turbopuffer, we're saving an order of magnitude in costs and dealing with far less complexity!

Here's why...
(1/10)
We've seen two key advantages of Turbopuffer with no perf degradation:

1. Normal vector database pricing makes no sense for our workloads (lots of moderate-sized indices).
2. The normal “pods” or cluster-based indices (of Pinecone for example) add unnecessary complexity

(2/10)
Most vector databases store the indices in memory.

For older use-cases, this made sense A given customer will have several large vector indices with consistently high usage on each index.

And the index should be in memory for high-throughput/low-latency querying.

(3/10)
Read 11 tweets
Nov 28, 2023
People claim LLM knowledge distillation is trivial with logprobs, but that's not quite right...

It's very tricky to distill between different tokenizers. [1]

Internally, we've solved this with a clever algorithm we called tokenization transfer
(1/7)
To start, we needed to build a sophisticated primitive called the "Logmass Trie"

It's an extended Trie where each edge not only contains a character but a weight that represents the "log probability" of that character conditional on the string thus far
(2/7)
This edge weight is just an estimate.

But it must satisfy the constraint that for a contained string X, summing the log probabilities of the edges on the path to X gives the log probability of X
(3/7)
Read 8 tweets
Nov 27, 2023
Though @cursor_ai is powered by standard retrieval pipelines today, we've been working on something much better called:

Deep Context

After @walden_yan built an early version for our vscode fork, Q&A accuracy skyrocketed.

Soon, we're bringing this to everyone (1/6)
The fundamental issue with RAG is that it encourages shallow answers

Even excellent engineers would struggle to answer a question with only RAG-like context about a codebase they've never seen. (2/6)
A good engineer would first read the code.

They'd follow the breadcrumbs. Go to relevant files, functions, classes, goto definitions/references, to build understanding.

What happens when we let GPT-4 do this... It builds a scratchpad of deep understanding (3/6)
Read 7 tweets
Nov 26, 2023
At @cursor_ai, we’ve scaled throughput on GPT-4 to 2-3x over baseline without access to knobs in OpenAI’s dedicated instances [1]

We did this by reverse-engineering expected GPT-4 latency and memory usage from first principles.

Here’s how... (1/10)
First, all of this is only possible with OpenAI's dedicated capacity. For large enough orgs with high usage, this is a no-brainer for cost reasons.

Dedicated capacity lets you commit to some usage for an extended period of time for reduced pricing.
(2/10)
More importantly, it offers a different abstraction: Your models now run on “instances”. [2]

Each instance can be treated as a machine (or group of machines) running some large transformer.
(3/10)
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(