A simple technique makes RAG ~32x memory efficient!
- Perplexity uses it in its search index
- Azure uses it in its search pipeline
- HubSpot uses it in its AI assistant
Let's understand how to use it in RAG systems (with code):
Today, let's build a RAG system that queries 36M+ vectors in <30ms using Binary Quantization.
Tech stack:
- @llama_index for orchestration
- @milvusio as the vector DB
- @beam_cloud for serverless deployment
- @Kimi_Moonshot Kimi-K2 as the LLM hosted on Groq
Let's build it!
Here's the workflow:
- Ingest documents and generate binary embeddings.
- Create a binary vector index and store embeddings in the vector DB.
- Retrieve top-k similar documents to the user's query.
- LLM generates a response based on additional context.
Let's implement this!
0️⃣ Setup Groq
Before we begin, store your Groq API key in a .env file and load it into your environment to leverage the world's fastest AI inference.
Check this 👇
1️⃣ Load data
We ingest our documents using LlamaIndex's directory reader tool.
It can read various data formats including Markdown, PDFs, Word documents, PowerPoint decks, images, audio and video.
Check this 👇
2️⃣ Generate Binary Embeddings
Next, we generate text embeddings (in float32) and convert them to binary vectors, resulting in a 32x reduction in memory and storage.
This is called binary quantization.
Check this implementation 👇
3️⃣ Vector indexing
After our binary quantization is done, we store and index the vectors in a Milvus vector database for efficient retrieval.
Indexes are specialized data structures that help optimize the performance of data retrieval operations.
Check this 👇
4️⃣ Retrieval
In the retrieval stage, we:
- Embed the user query and apply binary quantization to it.
- Use Hamming distance as the search metric to compare binary vectors.
- Retrieve the top 5 most similar chunks.
- Add the retrieved chunks to the context.
Check this👇
5️⃣ Generation
Finally, we build a generation pipeline using the Kimi-K2 instruct model, served on the fastest AI inference by Groq.
We specify both the query and the retrieved context in a prompt template and pass it to the LLM.
Check this 👇
6️⃣ Deployment with Beam
Beam enables ultra-fast serverless deployment of any AI workflow.
Thus, we wrap our app in a Streamlit interface, specify the Python libraries, and the compute specifications for the container.
Finally, we deploy the app in a few lines of code👇
7️⃣ Run the app
Beam launches the container and deploys our streamlit app as an HTTPS server that can be easily accessed from a web browser.
Check this demo 👇
Moving on, to truly assess the scale and inference speed, we test the deployed setup over the PubMed dataset (36M+ vectors).
Our app:
- queried 36M+ vectors in <30ms.
- generated a response in <1s.
Check this demo👇
Done!
We just built the fastest RAG stack leveraging BQ for efficient retrieval and
using ultra-fast serverless deployment of our AI workflow.
Here's the workflow again for your reference 👇
That's a wrap!
If you found it insightful, reshare it with your network.
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML in recommendation
- Spotify uses graph ML in recommendation
- Pinterest uses graph ML in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
"Our GPT model generates 100 tokens in 42 seconds.
How do you make it 5x faster?"
You: "I'll allocate more GPUs for faster generation."
Interview over.
Here's what you missed:
The real bottleneck isn't compute, it's redundant computation.
Without KV caching, your model recalculates keys and values for each token, repeating work.
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in to understand how it works!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
You're in a Research Scientist interview at OpenAI.
The interviewer asks:
"How would you expand the context length of an LLM from 2K to 128K tokens?"
You: "I will fine-tune the model on longer docs with 128K context."
Interview over.
Here's what you missed:
Extending the context window isn't just about larger matrices.
In a traditional transformer, expanding tokens by 8x increases memory needs by 64x due to the quadratic complexity of attention. Refer to the image below!
So, how do we manage it?
continue...👇
1) Sparse Attention
It limits the attention computation to a subset of tokens by:
- Using local attention (tokens attend only to their neighbors).
- Letting the model learn which tokens to focus on.
But this has a trade-off between computational complexity and performance.
Let's build a reasoning LLM using GRPO, from scratch (100% local):
Today, we're going to learn how to turn any model into a reasoning powerhouse.
We'll do so without any labeled data or human intervention, using Reinforcement Finetuning (GRPO)!
Tech stack:
- @UnslothAI for efficient fine-tuning
- @HuggingFace TRL to apply GRPO
Let's go! 🚀
What is GRPO?
Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.
Here's a brief overview of GRPO before we jump into code: