Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Avi Chawla

@_avichawla

Feb 14, 2025 • 11 tweets • 4 min read • Read on X

Scrolly

KV caching in LLMs, clearly explained (with visuals):

KV caching is a technique used to speed up LLM inference.

Before understanding the internal details, look at the inference speed difference in the video:

- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)

Let's dive in!

To understand KV caching, we must know how LLMs output tokens.

- Transformer produces hidden states for all tokens.
- Hidden states are projected to vocab space.
- Logits of the last token is used to generate the next token.
- Repeat for subsequent tokens.

Check this👇

Thus, to generate a new token, we only need the hidden state of the most recent token.

None of the other hidden states are required.

Next, let's see how the last hidden state is computed within the transformer layer from the attention mechanism.

During attention:

The last row of query-key-product involves:
- the last query vector.
- all key vectors.

Also, the last row of the final attention result involves:
- the last query vector.
- all key & value vectors.

Check this visual to understand better:

The above insight suggests that to generate a new token, every attention operation in the network only needs:

- query vector of the last token.
- all key & value vectors.

But, there's one more key insight here.

As we generate new tokens:

- The KV vectors used for ALL previous tokens do not change.

Thus, we just need to generate a KV vector for the token generated one step before.

Rest of the KV vectors can be retrieved from a cache to save compute and time.

This is called KV caching!

To reiterate, instead of redundantly computing KV vectors of all context tokens, cache them.

To generate a token:
- Generate QKV vector for the token generated one step before.
- Get all other KV vectors from cache.
- Compute attention.

Check this👇

KV caching saves time during inference.

In fact, this is why ChatGPT takes some time to generate the first token than the subsequent tokens.

During that time, it is computing the KV cache of the prompt.

That said, KV cache also takes a lot of memory.

Llama3-70B has:
- total layers = 80
- hidden size = 8k
- max output size = 4k

Here:
- Every token takes up ~2.5 MB in KV cache.
- 4k tokens will take up 10.5 GB.

More users → more memory.

I'll cover KV optimization soon.

That's a wrap!

If you enjoyed this tutorial:

Find me → @_avichawla

Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @_avichawla

Avi Chawla

@_avichawla

Apr 15

- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation

Here are 6 must-know ways for graph feature engineering (with code):

Like images, text, and tabular datasets have features, so do graph datasets.

This means when building models on graph datasets, we can engineer these features to achieve better performance.

Let's discuss some feature engineering techniques below!

First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).

We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.

Check this code👇

Read 13 tweets

Avi Chawla

@_avichawla

Jan 22

A simple technique trains neural nets 4-6x faster!

- OpenAI used it in GPT models.
- Meta used it in LLaMA models.
- Google used it in Gemini models.

Here's a breakdown (with code):

Typical deep learning frameworks are conservative when it comes to assigning data types.

The default data type is usually 64-bit or 32-bit, when they could have used 16-bit, for instance.

This is also evident from the code below👇

As a result, we are not entirely optimal at allocating memory.

Of course, this is done to ensure better precision in representing information.

However, this precision comes at the cost of memory utilization, which is not desired in all situations.

Check this 👇

Read 12 tweets

Avi Chawla

@_avichawla

Dec 12, 2025

Read 14 tweets

Avi Chawla

@_avichawla

Dec 10, 2025

You're in an AI Engineer interview at OpenAI.

The interviewer asks:

"Our GPT model generates 100 tokens in 42 seconds.

How do you make it 5x faster?"

You: "I'll allocate more GPUs for faster generation."

Interview over.

Here's what you missed:

The real bottleneck isn't compute, it's redundant computation.

Without KV caching, your model recalculates keys and values for each token, repeating work.

- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)

Let's dive in to understand how it works!

To understand KV caching, we must know how LLMs output tokens.

- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.

Check this👇

Read 10 tweets

Avi Chawla

@_avichawla

Dec 7, 2025

You're in a Research Scientist interview at OpenAI.

The interviewer asks:

"How would you expand the context length of an LLM from 2K to 128K tokens?"

You: "I will fine-tune the model on longer docs with 128K context."

Interview over.

Here's what you missed:

Extending the context window isn't just about larger matrices.

In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!

So, how do we manage it?

continue...👇

1) Sparse Attention

It limits the attention computation to a subset of tokens by:

- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.

But this has a trade-off between computational complexity and performance.

Read 12 tweets

Avi Chawla

@_avichawla

Nov 25, 2025

Context engineering, clearly explained (with visuals):

(an illustrated guide below)

So, what is context engineering?

It’s the art and science of delivering the right information, in the right format, at the right time, to your LLM.

Here's a quote by Andrej Karpathy on context engineering...👇

To understand context engineering, it's essential to first understand the meaning of context.

Agents today have evolved into much more than just chatbots.

The graphic below summarizes the 6 types of contexts an agent needs to function properly.

Check this out 👇

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Avi Chawla

Try unrolling a thread yourself!

More from @_avichawla

Avi Chawla

Avi Chawla

Avi Chawla

Avi Chawla

Avi Chawla

Avi Chawla

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!