KV caching in LLMs, clearly explained (with visuals):
KV caching is a technique used to speed up LLM inference.
Before understanding the internal details, look at the inference speed difference in the video:
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to vocab space.
- Logits of the last token is used to generate the next token.
- Repeat for subsequent tokens.
Check this👇
Thus, to generate a new token, we only need the hidden state of the most recent token.
None of the other hidden states are required.
Next, let's see how the last hidden state is computed within the transformer layer from the attention mechanism.
During attention:
The last row of query-key-product involves:
- the last query vector.
- all key vectors.
Also, the last row of the final attention result involves:
- the last query vector.
- all key & value vectors.
Check this visual to understand better:
The above insight suggests that to generate a new token, every attention operation in the network only needs:
- query vector of the last token.
- all key & value vectors.
But, there's one more key insight here.
As we generate new tokens:
- The KV vectors used for ALL previous tokens do not change.
Thus, we just need to generate a KV vector for the token generated one step before.
Rest of the KV vectors can be retrieved from a cache to save compute and time.
This is called KV caching!
To reiterate, instead of redundantly computing KV vectors of all context tokens, cache them.
To generate a token:
- Generate QKV vector for the token generated one step before.
- Get all other KV vectors from cache.
- Compute attention.
Check this👇
KV caching saves time during inference.
In fact, this is why ChatGPT takes some time to generate the first token than the subsequent tokens.
During that time, it is computing the KV cache of the prompt.
That said, KV cache also takes a lot of memory.
Llama3-70B has:
- total layers = 80
- hidden size = 8k
- max output size = 4k
Here:
- Every token takes up ~2.5 MB in KV cache.
- 4k tokens will take up 10.5 GB.
More users → more memory.
I'll cover KV optimization soon.
That's a wrap!
If you enjoyed this tutorial:
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
"Our GPT model generates 100 tokens in 42 seconds.
How do you make it 5x faster?"
You: "I'll allocate more GPUs for faster generation."
Interview over.
Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.
Without KV caching, your model recalculates keys and values for each token, repeating work.
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.