You ask it “What is an LLM?” and get gibberish like “try peter hand and hello 448Sn”.
It hasn’t seen any data yet and possesses just random weights.
Check this 👇
1️⃣ Pre-training
This stage teaches the LLM the basics of language by training it on massive corpora to predict the next token. This way, it absorbs grammar, world facts, etc.
But it’s not good at conversation because when prompted, it just continues the text.
Check this 👇
2️⃣ Instruction fine-tuning
To make it conversational, we do Instruction Fine-tuning by training on instruction-response pairs. This helps it learn how to follow prompts and format replies.
Now it can:
- Answer questions
- Summarize content
- Write code, etc.
Check this 👇
At this point, we have likely:
- Utilized the entire raw internet archive and knowledge.
- The budget for human-labeled instruction response data.
So what can we do to further improve the model?
We enter into the territory of Reinforcement Learning (RL).
Let's learn next 👇
3️⃣ Preference fine-tuning (PFT)
You must have seen this screen on ChatGPT where it asks: Which response do you prefer?
That’s not just for feedback but it’s valuable human preference data.
OpenAI uses this to fine-tune their models using preference fine-tuning.
Check this 👇
In PFT:
The user chooses between 2 responses to produce human preference data.
A reward model is then trained to predict human preference and the LLM is updated using RL.
Check this 👇
The above process is called RLHF (Reinforcement Learning with Human Feedback) and the algorithm used to update model weights is called PPO.
It teaches the LLM to align with humans even when there’s no "correct" answer.
But we can improve the LLM even more.
Let's learn next👇
4️⃣ Reasoning fine-tuning
In reasoning tasks (maths, logic, etc.), there's usually just one correct response and a defined series of steps to obtain the answer.
So we don’t need human preferences, and we can use correctness as the signal.
This is called reasoning fine-tuning👇
Steps:
- The model generates an answer to a prompt.
- The answer is compared to the known correct answer.
- Based on the correctness, we assign a reward.
This is called Reinforcement Learning with Verifiable Rewards.
GRPO by DeepSeek is a popular technique.
Check this👇
Those were the 4 stages of training an LLM from scratch.
- Start with a randomly initialized model.
- Pre-train it on large-scale corpora.
- Use instruction fine-tuning to make it follow commands.
- Use preference & reasoning fine-tuning to sharpen responses.
Check this 👇
If you found it insightful, reshare it with your network.
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
"Our GPT model generates 100 tokens in 42 seconds.
How do you make it 5x faster?"
You: "I'll allocate more GPUs for faster generation."
Interview over.
Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.
Without KV caching, your model recalculates keys and values for each token, repeating work.
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.