MCP is like a USB-C port for your AI applications.
Just as USB-C offers a standardized way to connect devices to various accessories, MCP standardizes how your AI apps connect to different data sources and tools.
Let's dive in! 🚀
At its core, MCP follows a client-server architecture where a host application can connect to multiple servers.
Key components include:
- Host
- Client
- Server
Here's an overview before we dig deep 👇
The Host and Client:
Host: An AI app (Claude desktop, Cursor) that provides an environment for AI interactions, accesses tools and data, and runs the MCP Client.
MCP Client: Operates within the host to enable communication with MCP servers.
Next up, MCP server...👇
The Server
A server exposes specific capabilities and provides access to data.
3 key capabilities:
- Tools: Enable LLMs to perform actions through your server
- Resources: Expose data and content from your servers to LLMs
- Prompts: Create reusable prompt templates and workflows
The Client-Server Communication
Understanding client-server communication is essential for building your own MCP client-server.
Let's begin with this illustration and then break it down step by step... 👇
1️⃣ & 2️⃣: capability exchange
client sends an initialize request to learn server capabilities.
server responds with its capability details.
e.g., a Weather API server provides available `tools` to call API endpoints, `prompts`, and API documentation as `resource`.
3️⃣ Notification
Client then acknowledgment the successful connection and further message exchange continues.
Before we wrap, one more key detail...👇
Unlike traditional APIs, the MCP client-server communication is two-way.
Sampling, if needed, allows servers to leverage clients' AI capabilities (LLM completions or generations) without requiring API keys.
While clients to maintain control over model access and permissions
I hope this clarifies what MCP does.
In the future, I'll explore creating custom MCP servers and building hands-on demos around them.
Over to you! What is your take on MCP and its future?
If you found it insightful, reshare with your network.
Find me → @akshay_pachaar ✔️
For more insights and tutorials on LLMs, AI Agents, and Machine Learning!
Let's build an MCP-powered Agentic RAG (100% local):
Below, we have an MCP-powered Agentic RAG that searches a vector database and falls back to web search if needed.
To build this, we'll use:
- @firecrawl_dev search endpoint for web search.
- @qdrant_engine as the vector DB.
- @cursor_ai as the MCP client.
Let's build it!
Here's how it works:
1) The user inputs a query through the MCP client (Cursor).
2-3) The client contacts the MCP server to select a relevant tool.
4-6) The tool output is returned to the client to generate a response.
Function calling & MCP for LLMs, clearly explained (with visuals):
Before MCPs became popular, AI workflows relied on traditional Function Calling for tool access. Now, MCP is standardizing it for Agents/LLMs.
The visual below explains how Function Calling and MCP work under the hood.
Let's learn more!
In Function Calling:
- The LLM receives a prompt.
- The LLM decides the tool.
- The programmer implements a procedure to accept a tool call request and prepare a function call.
- A backend service executes the tool.
Let's build an MCP server that connects to 200+ data sources (100% local):
Before we dive in, here's a quick demo of what we're building!
Tech stack:
- @MindsDB to power our unified MCP server
- @cursor_ai as the MCP host
- @Docker to self-host the server
Let's go! 🚀
Here's the workflow:
- User submits a query
- Agent connects to MindsDB MCP server to find tools
- Selects appropriate tool based on user query and call it
- Finally, returns a contextually relevant response
KV caching in LLMs, clearly explained (with visuals):
KV caching is a technique used to speed up LLM inference.
Before understanding the internal details, look at the inference speed difference in the video:
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to vocab space.
- Logits of the last token is used to generate the next token.
- Repeat for subsequent tokens.