Avi Chawla Profile picture
Aug 20 12 tweets 4 min read Read on X
DeepMind built a simple RAG technique that:

- reduces hallucinations by 40%
- improves answer relevancy by 50%

Let's understand how to use it in RAG systems (with code):
Most RAG apps fail due to retrieval. Today, we'll build a RAG system that self-corrects inaccurate retrievals using:

- @firecrawl_dev for scraping
- @milvusio as vectorDB
- @beam_cloud for deployment
- @Cometml Opik for observability
- @Llama_Index for orchestration

Let's go!
Here's an overview of what the app does:

- First search the docs with user query
- Evaluate if the retrieved context is relevant using LLM
- Only keep the relevant context
- Do a web search if needed
- Aggregate the context & generate response

Now let's jump into code!
1️⃣ Setup LLM

We will use gpt-oss as the LLM, locally served using Ollama.

Check this out👇 Image
2️⃣ Setup vector DB

Our primary source of knowledge is the user documents that we index and store in a Milvus vectorDB collection.

This will be the first source that will be invoked to fetch context when the user inputs a query.

Check this👇 Image
3️⃣ Setup search tool

If the context obtained from the vector DB isn't relevant, we resort to web search using Firecrawl.

More specifically, we use the latest v2 endpoint that provides 10x faster scraping, semantic crawling, News & image search, and more.

Check this👇 Image
4️⃣ Observability

LlamaIndex offers a seamless integration with CometML's Opik.

We use this to trace every LLM call, monitor, and evaluate our Corrective RAG application.

Check this 👇 Image
5️⃣ Create the workflow

Now that we have everything set up, it's time to create the event-driven agentic workflow that orchestrates our application.

We pass in the LLM, vector index, and web search tool to initialize the workflow.

Check this 👇 Image
7️⃣ Kick off the workflow

Finally, when we have everything ready, we kick off our workflow.

Check this👇 Image
8️⃣ Deployment with Beam

Beam enables ultra-fast serverless deployment of any AI workflow.

Thus, we wrap our app in a Streamlit interface, specify the Python libraries, and the compute specifications for the container.

Finally, we deploy it in a few lines of code👇 Image
Run the app

Beam launches the container and deploys our streamlit app as an HTTPS server that can be accessed from a web browser.

In the video, our workflow is able to answer a query that's unrelated to the document. The evaluation step makes this possible.

Check this 👇
That's a wrap!

If you found it insightful, reshare it with your network.

Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Avi Chawla

Avi Chawla Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @_avichawla

Aug 17
Model Context Protocol (MCP), clearly explained (with visuals):
MCP is like a USB-C port for your AI applications.

Just as USB-C offers a standardized way to connect devices to various accessories, MCP standardizes how your AI apps connect to different data sources and tools.

Let's dive in! 🚀
At its core, MCP follows a client-server architecture where a host application can connect to multiple servers.

Key components include:

- Host
- Client
- Server

Here's an overview before we dig deep 👇
Read 11 tweets
Aug 14
A new embedding model cuts vector DB costs by ~200x.

It also outperforms OpenAI and Cohere models.

Here's a complete breakdown (with visuals):
RAG is 80% retrieval and 20% generation.

So if RAG isn't working, most likely, it's a retrieval issue, which further originates from chunking and embedding.

Contextualized chunk embedding models solve this.

Let's dive in to learn more!
In RAG:

- No chunking drives up token costs
- Large chunks lose fine-grained context
- Small chunks lose global/neighbourhood context

In fact, chunking also involves determining chunk overlap, generating summaries, etc., which are tedious.

There's another problem!
Read 12 tweets
Aug 11
Let's fine-tune OpenAI gpt-oss (100% locally):
Today, let's learn how to fine-tune OpenAI's latest gpt-oss locally.

We'll give it multilingual reasoning capabilities as shown in the video.

We'll use:
- @UnslothAI for efficient fine-tuning.
- @huggingface transformers to run it locally.

Let's begin!
1️⃣ Load the model

We start by loading the gpt-oss (20B variant) model and its tokenizer using Unsloth.

Check this 👇 Image
Read 10 tweets
Aug 8
Enterprises build RAG over 100s of data sources, not one!

- Microsoft ships it in M365 products.
- Google ships it in its Vertex AI Search.
- AWS ships it in its Amazon Q Business.

Let's build an MCP-powered RAG over 200+ sources (100% local):
Enterprise data is scattered across many sources.

Today, we'll build a unified MCP server that can query 200+ sources from one interface.

Tech stack:
- @mcpuse to build a local MCP client
- @MindsDB to connect to data sources
- @ollama to serve GPT-oss locally

Let's begin!
Here's the workflow:

- User submits a query.
- Agent connects to the MindsDB MCP server to find tools.
- Selects the appropriate tool based on user's query and invokes it
- Finally, it returns a contextually relevant response

Now, let's dive into the code!
Read 12 tweets
Aug 7
I have been building AI Agents in production for over an year.

If you want to learn too, here's a simple tutorial (hands-on):
Today, we'll build and deploy a Coding Agent that can scrape docs, write production-ready code, solve issues and raise PRs, directly from Slack.

Tech stack:
- Claude Code for code generation
- @xpander_ai as the Agent backend
- @firecrawl_dev for scraping

Let's begin!
For context...

xpander is a plug-and-play Backend for agents that manages scale, memory, tools, multi-user states, events, guardrails, and more.

Once we deploy an Agent, it also provides various triggering options like MCP, Webhook, SDK, Chat, etc.

Check this👇
Read 11 tweets
Aug 6
12 MCP, RAG, and Agents cheat sheets for AI engineers (with visuals):
1️⃣ Function calling & MCP for LLMs

Before MCPs became popular, AI workflows relied on traditional Function Calling for tool access. Now, MCP is standardizing it for Agents/LLMs.

The visual covers how Function Calling & MCP work under the hood.

Check the thread below 👇
2️⃣ 4 stages of training LLMs from scratch

This visual covers the 4 stages of building LLMs from scratch to make them practically applicable.

- Pre-training
- Instruction fine-tuning
- Preference fine-tuning
- Reasoning fine-tuning

Here's my detailed thread about it 👇
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(