Combining documents with LLMs is a key part of retrieval and chaining
We've improved our @LangChainAI reference documentation across the 5 major CombineDocumentsChains and helper functions to help with clarity and understanding of how these work
🧵
📄 `format_document`
Want to control which metadata keys show up in the prompt?
This helper function is rarely exposed, but is key to combining documents with LLMs
It takes a Document and formats it into a string using a PromptTemplate
The most basic CombineDocumentsChain, this takes N documents, formats them into a string using a PromptTemplate and `format_document`, and then combines them into a single prompt and passes them to an LLM
It takes an LLMChain and a ReduceDocumentsChain. It first applies the LLMChain to each document, and then passes all the results to the ReduceDocumentsChain
One thing we've seen is that while default agents make it easy to prototype, a lot of teams want to customize some component of them in order to improve the accuracy of THEIR application
In order enable this, we exposed all the core components
The basic idea: you store multiple embedding vectors per document. How do you generate these embeddings?
👨👦Smaller chunks (this is ParentDocumentRetriever)
🌞Summary of document
❓Hypothetical questions
🖐️Manually specified text snippets
Quick 🧵
Language models are getting larger and larger context windows
This is great, because you can pass bigger chunks in!
But if you have larger chunks, then a single embedding per chunk can start to fall flat, as there can be multiple distinct topics in that longer passage
One solution is to start creating not one but MULTIPLE embeddings per document
This was the basic realization with our ParentDocumentRetriever ~2 weeks ago, but it's really much more general than that