Rohan Profile picture
Aug 14 14 tweets 4 min read Twitter logo Read on Twitter
While splitting the raw text for Retrieval Augmented Generation (RAG), what should be the ideal length of each chunk? What’s the sweet spot?

Strike a balance between small vs large chunks using @LangChainAI ParentDocumentRetriever

Let's see how to use it 👇🧵 Image
The issue:

- smaller chunks reflect more accurate semantic meaning after creating embedding

- but they sometimes might lose the bigger picture and might sound out of context, making it difficult for the LLM to properly answer user's query with limited context per chunk.
@LangChainAI ParentDocumentRetriever addresses this issue by creating embedding from the smaller chunks only as they capture better semantic meaning.

But while plugging into the LLM input, it uses the larger chunks with better context.
Let’s walk through the example code from LangChain’s website on ParentDocumentRetriever 🧑‍💻 👇
We're gonna need two splitters instead of one.

- One for creating the larger chunks

- Another one for creating the smaller chunks Image
Storing the chunks

- As we're creating embedding for the small chunks only, we'll use a vectorstore to store those.

- Whereas the larger chunks are stored in an InMemoryStore, a KEY-VALUE pair data structure, that stays in the memory while the program is running. Image
Create the ParentDocumentRetriever object

We pass the vectorstore, docstore, parent and child splitters to the Constructor. Image
Adding the documents using retriever.add_documents() method Image
After adding, we can see there are 66 keys in the store. That means 66 large chunks have been added.

Also, if we apply similarity search on the vectorstore itself, we’ll get the small chunks only. Image
Now let's use the retriever for retrieving relevant documents using retriever.get_relevant_documents() method Image
Thus we use small chunks (with better semantic meaning) for vector similarity matching and return their corresponding larger chunks that have the bigger picture and more context.
Hopefully the ParentDocumentRetriever will help you to retrieve better relevant documents while using LangChain for Retrieval Augmented Generation (RAG).
Detailed blog post on ParentDocumentRetriever with more explanation and code snippets
clusteredbytes.pages.dev/posts/2023/lan…
Thanks for reading.

I write about AI, ChatGPT, LangChain etc. and try to make complex topics as easy as possible.

Stay tuned for more ! 🔥 #ChatGPT #LangChain

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan

Rohan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(