Let's build a real-time Voice RAG Agent, step-by-step:
Before we begin, here's a quick demo of what we're building
Tech stack:
- @Cartesia_AI for SOTA text-to-speech
- @AssemblyAI for speech-to-text
- @LlamaIndex to power RAG
- @livekit for orchestration
Let's go! 🚀
Here's an overview of what the app does:
1. Listens to real-time audio 2. Transcribes it via AssemblyAI 3. Uses your docs (via LlamaIndex) to craft an answer 4. Speaks that answer back with Cartesia
Now let's jump into code!
1️⃣ Set up environment and logging
This ensures we can load configurations from .env and keep track of everything in real time.
Check this out👇
2️⃣ Setup RAG
This is where your documents get indexed for search and retrieval, powered by LlamaIndex.
The agents answers would be grounded to this knowledge base.
Check this out👇
3️⃣ Setup Voice Activity Detection
We also want Voice Activity Detection (VAD) for smooth real-time experience—so we’ll “prewarm” the Silero VAD model.
This helps us detect when someone is actually speaking.
Check this out👇
4️⃣ The VoicePipelineAgent and Entry Point
This is where we bring it all together. The agent:
1. Listens to real-time audio. 2. Transcribes it using AssemblyAI. 3. Crafts an answer with your documents via LlamaIndex. 4. Speaks that answer back using Cartesia.
Check this out 👇
5️⃣ Run the app
Finally, we tie it all together. We run our agent with, specifying the prewarm function and main entrypoint.
That’s it—your Real-Time Voice RAG Agent is ready to roll!
The entire code is 100% open-source, you can find it here!
1. Listens to real-time audio 2. Transcribes it via AssemblyAI 3. Uses your docs (via LlamaIndex) to craft an answer 4. Speaks that answer back with Cartesia
If you found it insightful, reshare with your network.
Find me → @akshay_pachaar ✔️
For more insights and tutorials on LLMs, AI Agents, and Machine Learning!
Let's build an MCP-powered audio analysis toolkit:
Before we dive in, here's a demo of what we're building!
Tech stack:
- @AssemblyAI for transcription and audio analysis.
- Claude Desktop as the MCP host.
- @streamlit for the UI
Let's build it!
Here's our workflow:
- User's audio input is sent to AssemblyAI via a local MCP server.
- AssemblyAI transcribes it while providing the summary, speaker labels, sentiment, and topics.
- Post-transcription, the user can also chat with audio.
MCP is like a USB-C port for your AI applications.
Just as USB-C offers a standardized way to connect devices to various accessories, MCP standardizes how your AI apps connect to different data sources and tools.
Let's dive in! 🚀
At its core, MCP follows a client-server architecture where a host application can connect to multiple servers.