Temperature in LLMs, clearly explained (with code):
Let's prompt OpenAI GPT-3.5 with a low temperature value twice.
It produces identical responses from the LLM.
Check the response below👇
Now, let's prompt it with a high temperature value.
This time, it produces a gibberish output. Check the output below👇
What is going on here? Let's dive in!
Text-generating LLMs are like classification models whose output layer spans the entire vocabulary.
However, instead of selecting the best token, they "sample" the prediction.
So even if “Token 1” has the highest softmax score, it may not be chosen due to sampling👇
The impact of sampling is controlled using the Temperature parameter.
Temperature introduces the following tweak in the softmax function 👇
If the temperature is low, the probabilities look like a max value instead of a “soft-max” value.
This means the sampling process will almost certainly choose the token with the highest probability. This makes the generation process (nearly) greedy.
Check this👇
If the temperature is high, the probabilities start to look like a uniform distribution:
This means the sampling process may select any token. This makes the generation process random and heavily stochastic, like we saw earlier.
Check this👇
Some best practices for using temperature (T):
- Set a low T value to generate predictable responses.
- Set a high T value to generate more random and creative responses.
- An extremely high T value rarely has any real utility, as shown below👇
That's a wrap!
If you enjoyed this tutorial:
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
"Our GPT model generates 100 tokens in 42 seconds.
How do you make it 5x faster?"
You: "I'll allocate more GPUs for faster generation."
Interview over.
Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.
Without KV caching, your model recalculates keys and values for each token, repeating work.
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.
Let's build a reasoning LLM using GRPO, from scratch (100% local):
Today, we're going to learn how to turn any model into a reasoning powerhouse.
We'll do so without any labeled data or human intervention, using Reinforcement Finetuning (GRPO)!
Tech stack:
- @UnslothAI for efficient fine-tuning
- @HuggingFace TRL to apply GRPO
Let's go! 🚀
What is GRPO?
Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.
Here's a brief overview of GRPO before we jump into code: