Aman Sanger Profile picture
May 12 12 tweets 4 min read Twitter logo Read on Twitter
Llama and many recent open-source models have a significant architectural limitation

They use multi-head attention instead of multi-query attention (which is used by PaLM and probs Claude 100K)

This can result in slowdowns of up to 30x

Heres the math behind why (1/n) twitter.com/i/web/status/1… Image
Let's consider the widely used 7B param llama architecture.

It has 32 layers, 32 heads, and d_k, d_v sizes of 128

The key issue with multi-head attention is the cost of repeatedly accessing the previously computed attention keys and values during inference.

why? (2/n) Image
How does multi-head attention work when generating the Nth token?

To generate each token, we calculate a query for each head. Then, we look at the keys and values for prev tokens and apply the op:

Softmax(q^TK/sqrt(d)) *V

where head q = (d_k,1) k=(d_k,N) and v=(d_v, N)

(3/12) Image
When generating the N+1th token, each of the 32 layers and 32 heads needs to access the (128xN) dimensional K and V matrices

For our 7B param model, we need to access 32*32*128*2*N cached values, or 2(32*32*128*2)N bytes = 520KB*N

Here we hit the wall on memory bandwidth (4/12)
This can get incredibly out of hand for massive values of N. For 64K context models, this means 33GB of data from the KV cache needs to be read from RAM!

To achieve reasonable throughput, higher batch sizes are a must. With a batch of 16, that makes the KV cache 528GB! (5/12)
To fit in memory, we’d need to split the cache across several GPUs!

With a naive implementation, we are limited by the 1.5 TB/s DRAM or 1.6TB/s inter-GPU bandwidth.

So our time per token is... (528GB+14GB)/1.6TB/s = 340 ms/token! [1]

(6/12)

[1] - 14 GB for the model weights
Multi-query attention gives a massive speedup here.

We use the same attention formula as before, but this time K,V are shared across all 32 attention heads!

This means a 32x reduction in our KV cache size. When memory bound, this gives up to a 32x speedup!

(7/12)
Lets work out some concrete numbers, the new KV cache is 16.5GB and the model weights are 14GB

This leaves us with just 16.5GB for the KV cache, and 14GB for the model weights, meaning 30GB of total data being moved at 1.5 TB/s - giving 20ms/token, a 17x speedup...

(8/12)
And the larger we make our batch size, the more substantial this speedup will be.

In the limit, the speedup should approach 32x.

In PaLM, they tested this up to 40K context windows and massive batch sizes

(9/12) Image
As a caveat, these are all incredibly crude estimates/calculations, and some of my math may be way off.

For example, in practice, memory bandwidth utilization will not be as high as the peak of 1.5 TB/s.

(10/12)
Furthermore, several naive transformer implementations will be moving more total data than the model weights + the KV cache.

This is often due to unnecessary repeated work (for example not fusing ops like the softmax).

(11/12)
But from the original multi-query attention paper (by the goat Noam Shazeer) they see decoding speedups of over 10x for much smaller sequence lengths

While PaLM (for much longer sequence lengths up to 40K) sees far more substantial speedups than even 30x

(12/12) Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aman Sanger

Aman Sanger Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @amanrsanger

May 11
The size of all code/history on Github public repos is 92TB

The size of Google's monorepo in 2015 was 86TB (of much higher quality code)

If Google were willing to deploy code models trained on their own data, they'd have a noticable advantage over everyone else. twitter.com/i/web/status/1… ImageImage
To be fair, they would have to use the diff/history data much more than the raw code.

There is almost certainly more code at Github HEAD, but Google may benefit from a richer commit history

And, I'd suspect the size of their mono repo would have substantially increased since
Furthermore, these high-quality code tokens aren't just good for code, but incredibly useful for general language modeling performance:

Read 5 tweets
May 10
Palm2 just dropped, and there are claims that the largest model is just 14.7B params.

In reality, the model is probably closer to 100B parameters

But why... (1/5) twitter.com/i/web/status/1… Image
The 14.7B param number comes from the scaling experiments they ran. The largest model in these experiments was 14.7B params

A (very rough) approximation of the curves from the paper gives something like:

n_params = sqrt(FLOPS/100)

(2/5) Image
We also know PaLM 2 used more flops than Palm, which required 2.52e24 FLOPs [1]

For that number, the optimal number of params is 160B

(3/5)
Read 6 tweets
Mar 16
Want to code using GPT-4? We made an IDE built for programming alongside it

Try out the public beta here: cursor.so
We're also thrilled to announce our partnership with @openai through their early startup program!

After partnering with them, we were given early access to GPT-4.

Since then, we've worked on designing an interface that captures how powerful these models truly are.
And now that GPT-4 is publicly available, we have a flurry of features we'll be shipping soon.

It completely changes the game on what is possible with AI-assistive programming.
Read 4 tweets
Mar 1
There are times and places for training your own models... With the release OpenAI's chatGPT API - coding is looking less like one of them.

The human-eval pass@1 rate of ChatGPT is as good as the best Open Source model's pass@100 rate.

And this is still just GPT 3.5...
Not only that, but 10x better pricing than text-davinci, and far lower latency.

After seeing this news today, I really would not want to be one of OpenAI's competitors
For those unfamiliar with pass@k, this means if I took the best open-source code model (CodeGen 16B) and sampled 100 generations, the probability that 1 of those 100 generations was correct is the same as the probability ChatGPT gets it right on the first try.
Read 4 tweets
Jan 18
Introducing Cursor!! (cursor.so)

A brand new IDE build from the ground up with LLMs. Watch us use Cursor to ship new features blazingly fast.
Just press Cmd+k to instruct the IDE. Enter anything and watch Cursor do its magic
Want to make complex changes across several files at once? Cursor can suggest editable multifile diffs.
Read 7 tweets
Nov 11, 2022
My bet is that in the long run, reading and writing to external memory is key for much more capable models that can continually learn.

Someone will make the Neural Turing Machine work with a transformer backbone. (1/4)
Retrieval Transformers (RETRO) are a first step at this, but they feel more like giving an LM access to a search api over some controllable dataset.

They are read-only, require massive data banks that need to be manually updated, and are limited by the frozen bert encoder. (2/4)
Memorizing transformers is another promising direction (arxiv.org/pdf/2203.08913…).

They have a smaller memory bank than RETRO that effectively acts as a much longer context window.

But again, there is no controllable way for the model to write to external memory. (3/4)
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(