ModifyDocumentsChain adds additional context to docs retrieved from vectorstore based on the metadata and runs before StuffDocumentChain. This helps us #buildinpublic
In many use cases, we would want to add more context to retrieved chunks from vectorstore such as @pinecone before passing to OpenAI to improve the GPT's accuracy.
ModifyDocumentsChain contains chunkModifier method that modifies the doc's content based on the metadata.
At stockinsights.ai, we provide users to leverage GPT over the fin data (Eg: Earning reports) of Indian companies. Our goal is to generate accurate answers to users' queries based on only these reports.
The Problem: Splitting large docs into chunks of meaningful context
There's no simple solution to this seemingly innocuous problem. The requirements are 1. Chunks should contain enough context to answer users query 2. Relevant data can lie in multiple reports. So larger chunks crowd out other meaningful chunks that we pass to OpenAI. Not correct
3. A user can query with a company name & expects us to provide the results exclusively for the company 4. Each chunk shouldn't contain same context as that can push the similarity search score of a non-relevant chunk
We are trying different approaches to handle this issue.
Approach 1: Add context to each vector store chunk of report at the time of persisting the chunk. For eg: Chunks of TataSteel company should contain - "This is the earnings transcript chunk of TataSteel company"
Cons: All chunks will have same context. So, point 4 gets affected.
Approach2: Add the context to docs before sending it to OpenAI. 1. This ensures that chunks are distinct & only highly relevant chunks are chosen 2. OpenAI receives much better context too.
ModifyDocuemntsChain is used to prepend this context to chunk before sending to OpenAI.
• • •
Missing some Tweet in this thread? You can try to
force a refresh