A simple technique makes RAG ~32x memory efficient!
- Perplexity uses it in its search index
- Azure uses it in its search pipeline
- HubSpot uses it in its AI assistant
Let's understand how to use it in RAG systems (with code):
Today, let's build a RAG system that queries 36M+ vectors in <30ms using Binary Quantization.
Tech stack:
- @llama_index for orchestration
- @milvusio as the vector DB
- @beam_cloud for serverless deployment
- @Kimi_Moonshot Kimi-K2 as the LLM hosted on Groq
Let's build it!
Here's the workflow:
- Ingest documents and generate binary embeddings.
- Create a binary vector index and store embeddings in the vector DB.
- Retrieve top-k similar documents to the user's query.
- LLM generates a response based on additional context.
- Google Maps uses graph ML to predict ETA
- Netflix uses graph ML (GNN) in recommendation
- Spotify uses graph ML (HGNNs) in recommendation
- Pinterest uses graph ML (PingSage) in recommendation
Here are 6 must-know ways for graph feature engineering (with code):
Like images, text, and tabular datasets have features, so do graph datasets.
This means when building models on graph datasets, we can engineer these features to achieve better performance.
Let's discuss some feature engineering techniques below!
First, let’s create a dummy social networking graph dataset with accounts and followers (which will also be accounts).
We create the two DataFrames shown below, an accounts DataFrame and a followers DataFrame.
KV caching in LLMs, clearly explained (with visuals):
KV caching is a technique used to speed up LLM inference.
Before understanding the internal details, look at the inference speed difference in the video:
- with KV caching → 9 seconds
- without KV caching → 42 seconds (~5x slower)
Let's dive in!
To understand KV caching, we must know how LLMs output tokens.
- Transformer produces hidden states for all tokens.
- Hidden states are projected to the vocab space.
- Logits of the last token are used to generate the next token.
- Repeat for subsequent tokens.
5 levels of Agentic AI systems, clearly explained (with visuals):
Agentic AI systems don't just generate text; they can make decisions, call functions, and even run autonomous workflows.
The visual explains 5 levels of AI agency, starting from simple responders to fully autonomous agents.
Let's dive in to learn more!
1️⃣ Basic responder
- A human guides the entire flow.
- The LLM is just a generic responder that receives an input and produces an output. It has little control over the program flow.
Qwen-3 Coder is Alibaba’s most powerful open-source coding LLM.
Today, let's build a pipeline to compare it to Sonnet 4 using:
- @LiteLLM for orchestration.
- @deepeval to build the eval pipeline (open-source).
- @OpenRouterAI to access @Alibaba_Qwen 3 Coder.
Let's dive in!
Here's the workflow:
- Ingest a GitHub repo and provide it as context to the LLMs.
- Generate code from both models using context + query.
- Compare the generated code using DeepEval.