- reduces hallucinations by 40%
- improves answer relevancy by 50%
Let's understand how to use it in RAG systems (with code):
Most RAG apps fail due to retrieval. Today, we'll build a RAG system that self-corrects inaccurate retrievals using:
- @firecrawl_dev for scraping
- @milvusio as vectorDB
- @beam_cloud for deployment
- @Cometml Opik for observability
- @Llama_Index for orchestration
Let's go!
Here's an overview of what the app does:
- First search the docs with user query
- Evaluate if the retrieved context is relevant using LLM
- Only keep the relevant context
- Do a web search if needed
- Aggregate the context & generate response
Now let's jump into code!
1️⃣ Setup LLM
We will use gpt-oss as the LLM, locally served using Ollama.
Check this out👇
2️⃣ Setup vector DB
Our primary source of knowledge is the user documents that we index and store in a Milvus vectorDB collection.
This will be the first source that will be invoked to fetch context when the user inputs a query.
Check this👇
3️⃣ Setup search tool
If the context obtained from the vector DB isn't relevant, we resort to web search using Firecrawl.
More specifically, we use the latest v2 endpoint that provides 10x faster scraping, semantic crawling, News & image search, and more.
Check this👇
4️⃣ Observability
LlamaIndex offers a seamless integration with CometML's Opik.
We use this to trace every LLM call, monitor, and evaluate our Corrective RAG application.
Check this 👇
5️⃣ Create the workflow
Now that we have everything set up, it's time to create the event-driven agentic workflow that orchestrates our application.
We pass in the LLM, vector index, and web search tool to initialize the workflow.
Check this 👇
7️⃣ Kick off the workflow
Finally, when we have everything ready, we kick off our workflow.
Check this👇
8️⃣ Deployment with Beam
Beam enables ultra-fast serverless deployment of any AI workflow.
Thus, we wrap our app in a Streamlit interface, specify the Python libraries, and the compute specifications for the container.
Finally, we deploy it in a few lines of code👇
Run the app
Beam launches the container and deploys our streamlit app as an HTTPS server that can be accessed from a web browser.
In the video, our workflow is able to answer a query that's unrelated to the document. The evaluation step makes this possible.
Check this 👇
That's a wrap!
If you found it insightful, reshare it with your network.
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
"Our GPT model generates 100 tokens in 42 seconds.
How do you make it 5x faster?"
You: "I'll allocate more GPUs for faster generation."
Interview over.
Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.
Without KV caching, your model recalculates keys and values for each token, repeating work.
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.