Let's build a real-time Voice RAG Agent, step-by-step:
Before we begin, here's a quick demo of what we're building
Tech stack:
- @Cartesia_AI for SOTA text-to-speech
- @AssemblyAI for speech-to-text
- @LlamaIndex to power RAG
- @livekit for orchestration
Let's go! ๐
Here's an overview of what the app does:
1. Listens to real-time audio 2. Transcribes it via AssemblyAI 3. Uses your docs (via LlamaIndex) to craft an answer 4. Speaks that answer back with Cartesia
Let's build an MCP-powered audio analysis toolkit:
Before we dive in, here's a demo of what we're building!
Tech stack:
- @AssemblyAI for transcription and audio analysis.
- Claude Desktop as the MCP host.
- @streamlit for the UI
Let's build it!
Here's our workflow:
- User's audio input is sent to AssemblyAI via a local MCP server.
- AssemblyAI transcribes it while providing the summary, speaker labels, sentiment, and topics.
- Post-transcription, the user can also chat with audio.