Louis-François Bouchard 🎥🤖 Profile picture
Sep 13, 2024 15 tweets 7 min read Read on X
This is the best #RAG stack, according to a fantastic study currently in review (by Wang et al., 2024) (it's a gold mine!).

Here are the best components of each part of the system and how they work… 👇 Image
First is Query Classification. Not all queries are equal. Some queries don't need retrieval as the LLM already has the knowledge (e.g. Who is Messi?)

They created 15 task categories based on whether they provided sufficient information (See image).

They then train a binary classifier for tasks based on user-given information, termed “sufficient” (yellow), which need not retrieval, and “insufficient” (red), where retrieval may be necessary.Image
Next is chunking your data—the old chunking challenge: not too small, not too large. You need the optimal context amount.

The size significantly impacts performance...

Too long adds context but increases costs and adds noise.
Too short is efficient, improves retrieval recall and is cheap but may lack relevant information.

The optimal chunk size involves balancing metrics like faithfulness and relevancy. Faithfulness measures if the response is hallucinated or, in other words, whether it matches the retrieved texts, whereas relevancy measures if the retrieved texts and responses match queries.

Between 256 and 512 is best in their study, but it depends on your data. Run evals!

small2big (have small chunks for search, then use large chunks that include the smaller one for generation) and sliding windows (overlap tokens between chunks) help.Image
Use metadata and hybrid search.

Enhancing chunk blocks with metadata like titles, keywords, and hypothetical questions will help in lots of cases.

Hybrid search combines vector search (original embedding) with traditional keyword search (BM25), enhancing retrieval accuracy. HyDE (generate pseudo-documents from original queries to enhance retrieval) helps, but is incredibly inefficient. Using just Hybrid search is better ATM.
The Embedding model. Which one? To fine-tune or not?

They work with open-source models here. LLM-Embedder was best for its balance of performance and size:

Just note that they only tested open-source models, so Cohere and OpenAI were out of the game. Cohere is probably your best bet otherwise.github.com/FlagOpen/FlagE…
Vector Database - which one?

Milvus seems ideal (between open-source ones) for long-term use.

milvus.io
Image
Transform the user queries!

Use query rewriting (prompt an LLM to rewrite queries mostly for clarity).

Use query decomposition (complex questions into smaller sub-questions and retrieve for each).

(for best results) Use pseudo-documents generation (e.g. HyDE, generate a hypothetical document from the query and use that instead), which adds latency.
Use reranking!

Retrieve K documents and rerank them- it helps a lot.

Ensures that the most pertinent information appears at the top of the list.

(DLM-based, which utilizes classification True/False relevancy) monoT5 is best for balancing performance and efficiency. It fine-tunes the T5 model to reorder retrieved documents by evaluating how well the sequence of words in each document matches the query, ensuring the most relevant results appear first.

RankLLaMA has the best performance.

TILDEv2 is the quickest.Image
Document Repacking

This happens AFTER reranking.

Use “reverse” repacking, which arranges them in ascending order—inspired by Liu et al. (, amazing paper!), who found that optimal performance is achieved when relevant information is positioned at the start or end of the input. Repacking optimizes how the information is presented to the LLM for generationarxiv.org/abs/2307.03172
Use summarization to avoid redundant, or unnecessary information and reduce costs (long inputs sent to the LLM).

Doing Recomp () is best. It has extractive (selects useful sentences) and abstractive (synthesizes information from multiple documents) compressors, providing the best of both worlds.

*In time-sensitive applications, removing summarization could effectively reduce response time.github.com/carriex/recomp
If you can, fine-tuning the generator is worthwhile.

They experimented with various combinations of relevant and random documents (assuming 1:1(?), no ratio provided in the paper) to see how this affects the generator's output quality.

Augmenting with relevant and randomly-selected documents (Disturb) during fine-tuning enhanced the generator’s robustness to irrelevant information while effectively utilizing relevant contexts.Image
When dealing with multimodalities (images)...

In text2image, a user’s text query retrieves images from a database based on similarity, speeding up image generation when relevant images already exist. Using it in the retrieval process improves efficiency by retrieving existing images, avoiding the need for on-the-fly generation.

In image2text (happens more frequently), a provided image is matched with similar images in the database to retrieve pre-stored captions or generate new ones. Using it in the retrieval process enhances groundedness (ensuring retrieved information is accurate and based on real, pre-existing data) by retrieving accurate, pre-verified information from stored images.Image
A great tl;dr from the paper: Image
Consider that these insights are from one and only one paper. There are some limitations, which the authors state:

• Joint training of retrievers and generators was not explored (and has great potential).
• Modular design (used for simplicity) limited the exploration of chunking techniques.
• High costs restricted evaluation of chunking methods.
• Expanding to speech and video modalities is a potential direction.

+ this is all only open-source tools. You may prefer to use @cohere, @OpenAI, @activeloopai, etc...
I made a video about it, too if interested:

I invite everyone to read their paper “Searching for Best Practices in Retrieval-Augmented Generation” by Wang et al., 2024:
arxiv.org/abs/2407.01219

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Louis-François Bouchard 🎥🤖

Louis-François Bouchard 🎥🤖 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(