After MCP, A2A, & AG-UI, there's another Agent protocol.
It's fully open-source and launched by IBM Research.
Here's a complete breakdown (with code):
ACP is a standardized, RESTful interface for Agents to discover and coordinate with other Agents, regardless of their framework.
Just like A2A, it lets Agents communicate with Agents. There are some differences, which we shall discuss later.
Let's dive into the code first!
Here's how it works:
- Build the Agents and host them on ACP servers.
- The ACP server receives requests from the ACP Client and forwards them to the Agent.
- ACP Client itself can be an Agent to intelligently route requests to the Agents (like MCP Client does).
Check this 👇
We’ll create a research summary generator, where:
- Agent 1 drafts a general topic summary (built using CrewAI)
- Agent 2 fact-checks & enhances it using web search (built using Smolagents).
Start by installing some dependencies and a local LLM using Ollama.
Check this 👇
In our case, we’ll have two servers, and each server will host one Agent.
Let’s define the server that will host the CrewAI Agent and its LLM.
Here's how we do it 👇
Next, we define an Agent on this server.
- Line 1 → Decorate the method.
- Line 6-21 → Build the Agent and kick off the Crew.
- Line 23 → Return the output in the expected ACP format.
- Line 26 → Serve on a REST-based ACP server running locally.
Check this 👇
Next, repeat these steps for the 2nd server to host the Smolagents Agent and its LLM.
- Line 1-10 → Imports + define the Server & the LLM.
- Line 12 → Decorate the method.
- Line 21-28 → Define the Agent with a web search tool.
- Line 31 → Serve the Agent.
Check this 👇
Finally, we use an ACP client to connect both agents in a workflow.
- Line 6-7 → Connect the client to both servers.
- Line 11-14 → Invoke the first agent to receive an output.
- Line 18-21 → Pass the output to the next agent for enhancement.
Check this 👇
Almost done!
Run the two servers as follows 👇
And then run the client to get an output from a system that’s powered by ACP using `uv run acp_client[.]py`
Check this 👇
This demo showcases how you can use ACP to enable Agents to communicate via a standardized protocol, even if they are built using different frameworks.
How is ACP different from A2A?
- ACP is built for local-first, low-latency communication.
- A2A is optimized for web-native, cross-vendor interoperability
- ACP uses a RESTful interface, making it easier to embed in your stack.
- A2A supports more flexible, natural interactions.
- ACP excels in controlled, edge, or team-specific setups.
- A2A shines in broader cloud-based collaboration
That's a wrap!
If you found it insightful, reshare it with your network.
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
"Our GPT model generates 100 tokens in 42 seconds.
How do you make it 5x faster?"
You: "I'll allocate more GPUs for faster generation."
Interview over.
Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.
Without KV caching, your model recalculates keys and values for each token, repeating work.
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.