Temperature in LLMs, clearly explained (with code):
Let's prompt OpenAI GPT-3.5 with a low temperature value twice.
It produces identical responses from the LLM.
Check the response below👇
Now, let's prompt it with a high temperature value.
This time, it produces a gibberish output. Check the output below👇
What is going on here? Let's dive in!
Text-generating LLMs are like classification models whose output layer spans the entire vocabulary.
However, instead of selecting the best token, they "sample" the prediction.
So even if “Token 1” has the highest softmax score, it may not be chosen due to sampling👇
The impact of sampling is controlled using the Temperature parameter.
Temperature introduces the following tweak in the softmax function 👇
If the temperature is low, the probabilities look like a max value instead of a “soft-max” value.
This means the sampling process will almost certainly choose the token with the highest probability. This makes the generation process (nearly) greedy.
Check this👇
If the temperature is high, the probabilities start to look like a uniform distribution:
This means the sampling process may select any token. This makes the generation process random and heavily stochastic, like we saw earlier.
Check this👇
Some best practices for using temperature (T):
- Set a low T value to generate predictable responses.
- Set a high T value to generate more random and creative responses.
- An extremely high T value rarely has any real utility, as shown below👇
That's a wrap!
If you enjoyed this tutorial:
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Function calling & MCP for LLMs, clearly explained (with visuals):
Before MCPs became popular, AI workflows relied on traditional Function Calling for tool access. Now, MCP is standardizing it for Agents/LLMs.
The visual below explains how Function Calling and MCP work under the hood.
Let's learn more!
In Function Calling:
- The LLM receives a prompt.
- The LLM decides the tool.
- The programmer implements a procedure to accept a tool call request and prepare a function call.
- A backend service executes the tool.
Let's build an MCP-powered Agentic RAG (100% local):
Below, we have an MCP-driven Agentic RAG that searches a vector database and falls back to web search if needed.
To build this, we'll use:
- Bright Data to scrape web at scale.
- @qdrant_engine as the vector DB.
- @cursor_ai as the MCP client.
Let's build it!
Here's how it works:
1) The user inputs a query through the MCP client (Cursor).
2-3) The client contacts the MCP server to select a relevant tool.
4-6) The tool output is returned to the client to generate a response.