Avi Chawla Profile picture
Aug 20, 2025 12 tweets 4 min read Read on X
DeepMind built a simple RAG technique that:

- reduces hallucinations by 40%
- improves answer relevancy by 50%

Let's understand how to use it in RAG systems (with code):
Most RAG apps fail due to retrieval. Today, we'll build a RAG system that self-corrects inaccurate retrievals using:

- @firecrawl_dev for scraping
- @milvusio as vectorDB
- @beam_cloud for deployment
- @Cometml Opik for observability
- @Llama_Index for orchestration

Let's go!
Here's an overview of what the app does:

- First search the docs with user query
- Evaluate if the retrieved context is relevant using LLM
- Only keep the relevant context
- Do a web search if needed
- Aggregate the context & generate response

Now let's jump into code!
1️⃣ Setup LLM

We will use gpt-oss as the LLM, locally served using Ollama.

Check this out👇 Image
2️⃣ Setup vector DB

Our primary source of knowledge is the user documents that we index and store in a Milvus vectorDB collection.

This will be the first source that will be invoked to fetch context when the user inputs a query.

Check this👇 Image
3️⃣ Setup search tool

If the context obtained from the vector DB isn't relevant, we resort to web search using Firecrawl.

More specifically, we use the latest v2 endpoint that provides 10x faster scraping, semantic crawling, News & image search, and more.

Check this👇 Image
4️⃣ Observability

LlamaIndex offers a seamless integration with CometML's Opik.

We use this to trace every LLM call, monitor, and evaluate our Corrective RAG application.

Check this 👇 Image
5️⃣ Create the workflow

Now that we have everything set up, it's time to create the event-driven agentic workflow that orchestrates our application.

We pass in the LLM, vector index, and web search tool to initialize the workflow.

Check this 👇 Image
7️⃣ Kick off the workflow

Finally, when we have everything ready, we kick off our workflow.

Check this👇 Image
8️⃣ Deployment with Beam

Beam enables ultra-fast serverless deployment of any AI workflow.

Thus, we wrap our app in a Streamlit interface, specify the Python libraries, and the compute specifications for the container.

Finally, we deploy it in a few lines of code👇 Image
Run the app

Beam launches the container and deploys our streamlit app as an HTTPS server that can be accessed from a web browser.

In the video, our workflow is able to answer a query that's unrelated to the document. The evaluation step makes this possible.

Check this 👇
That's a wrap!

If you found it insightful, reshare it with your network.

Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Avi Chawla

Avi Chawla Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @_avichawla

Jan 22
A simple technique trains neural nets 4-6x faster!

- OpenAI used it in GPT models.
- Meta used it in LLaMA models.
- Google used it in Gemini models.

Here's a breakdown (with code):
Typical deep learning frameworks are conservative when it comes to assigning data types.

The default data type is usually 64-bit or 32-bit, when they could have used 16-bit, for instance.

This is also evident from the code below👇 Image
As a result, we are not entirely optimal at allocating memory.

Of course, this is done to ensure better precision in representing information.

However, this precision comes at the cost of memory utilization, which is not desired in all situations.

Check this 👇 Image
Read 12 tweets
Dec 12, 2025
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation

Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.

This means when building models on graph datasets, we can engineer these features to achieve better performance.

Let's discuss some feature engineering techniques below! Image
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).

We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.

Check this code👇 Image
Read 14 tweets
Dec 10, 2025
You're in an AI Engineer interview at OpenAI.

The interviewer asks:

"Our GPT model generates 100 tokens in 42 seconds.

How do you make it 5x faster?"

You: "I'll allocate more GPUs for faster generation."

Interview over.

Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.

Without KV caching, your model recalculates keys and values for each token, repeating work.

- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)

Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.

- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.

Check this👇
Read 10 tweets
Dec 7, 2025
You're in a Research Scientist interview at OpenAI.

The interviewer asks:

"How would you expand the context length of an LLM from 2K to 128K tokens?"

You: "I will fine-tune the model on longer docs with 128K context."

Interview over.

Here's what you missed:
Extending the context window isn't just about larger matrices.

In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!

So, how do we manage it?

continue...👇 Image
1) Sparse Attention

It limits the attention computation to a subset of tokens by:

- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.

But this has a trade-off between computational complexity and performance. Image
Read 12 tweets
Nov 25, 2025
Context engineering, clearly explained (with visuals):

(an illustrated guide below) Image
So, what is context engineering?

It’s the art and science of delivering the right information, in the right format, at the right time, to your LLM.

Here's a quote by Andrej Karpathy on context engineering...👇 Image
To understand context engineering, it's essential to first understand the meaning of context.

Agents today have evolved into much more than just chatbots.

The graphic below summarizes the 6 types of contexts an agent needs to function properly.

Check this out 👇 Image
Read 10 tweets
Oct 27, 2025
8 key skills to master LLM Engineering:

(free/open-source resources below) Image
1️⃣ Prompt engineering

Prompt engineering is far from dead!

The key is to craft structured prompts that reduce ambiguity and result in deterministic outputs.

Treat it as engineering, not copywriting!

Here's something I published on JSON prompting: Image
2️⃣ RAG systems

RAG is 80% retrieval and 20% generation. So if you sort out retrieval, the hard part is over.

Airweave (open-source) lets you build live, bi-temporal knowledge bases so that your LLMs always reason on the freshest facts.

Repo: github.com/airweave-ai/ai…

Supports fully agentic retrieval with semantic and keyword search, query expansion, and more across 30+ sources.
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(