Rohan Profile picture
Aug 26 13 tweets 4 min read Twitter logo Read on Twitter
Previously we've seen @LangChainAI ParentDocumentRetriever that creates smaller chunks from a document and links them back to the initial documents during retrieval.

MultiVectorRetriever is a more customizable version of that. Let's see how to use it 🧵👇 Image
@LangChainAI ParentDocumentRetriever automatically creates the small chunks and links their parent document id.

If we want to create some additional vectors for each documents, other than smaller chunks, we can do that and then retrieve those using MultiVectorRetriever.
We can customize how these additional vectors are created for each parent document. Here're some ways @LangChainAI mentioned in their documentation.

- smaller chunks
- store the summary vector of each document
- store the vectors of hypothetical questions for each documents
Now let's try to understand the example code from langchain documentation 👇
First we create the retriever itself.

Here we pass the
- vectorstore to store all the vectors for the documents
- docstore to store the documents themselves
- id_key is the key of the metadata field which will be used to store the document id for each vector Image
Also we create unique uuid for each of the documents.

We'll use these ids to store the documents in the docstore.

MultiVectorRetriever will use these ids to retrieve the documents from the vector similarity search. Image
Now let's implement the ParentDocumentRetriever using MultiVectorRetriever

- iterate over each document
- split the document to get the children chunks
- store each small chunk in the vectorstore, with the parent doc_id as metadata Image
As MultiVectorRetriever is more flexible and customizable, we need to manually add the additional vectors to the vectorstore and set the doc_id of the associated document as a metadata field.

Also we need to add the docs with their id to the docstore. Image
We can also create a summary for each document.

Oftentimes a summary may be able to capture more accurately what a chunk is about, leading to better retrieval.
Image
Image
Also as we'll be matching the vectors with user's query embedding vector, we might get better results if we create some hypothetical user queries of a particular document and store them in the vectorstore.
Image
Image
Based on the specific use case, we can create other vectors as well for each document.

For these vectors, we need to make sure to add the doc_id as the metadata. And MultiVectorRetriever will handle the rest to retrieve the initial documents from these vectors.
MultiVectorRetriever documentation:

python.langchain.com/docs/modules/d…
Thanks for reading.

I write about AI, ChatGPT, LangChain etc. and try to make complex topics as easy as possible.

Stay tuned for more ! 🔥 #ChatGPT #LangChain

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan

Rohan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @clusteredbytes

Aug 14
While splitting the raw text for Retrieval Augmented Generation (RAG), what should be the ideal length of each chunk? What’s the sweet spot?

Strike a balance between small vs large chunks using @LangChainAI ParentDocumentRetriever

Let's see how to use it 👇🧵 Image
The issue:

- smaller chunks reflect more accurate semantic meaning after creating embedding

- but they sometimes might lose the bigger picture and might sound out of context, making it difficult for the LLM to properly answer user's query with limited context per chunk.
@LangChainAI ParentDocumentRetriever addresses this issue by creating embedding from the smaller chunks only as they capture better semantic meaning.

But while plugging into the LLM input, it uses the larger chunks with better context.
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(