Tweet

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Rohan

@clusteredbytes

Oct 27 • 13 tweets • 4 min read Twitter logo

Read on Twitter

Previously we've seen how to improve retrieval by funetuning an embedding model.

@llama_index also supports finetuning an adapter on top of existing models, which lets us improve retrieval without updating our existing embeddings. 🚀

Let's see how it works 👇🧵

@llama_index For adapters, we pull apart every single layer of the transformer and add randomly initialized new weights.

Then, instead of finetuning all the weights, we freeze the weights of the pre-trained model, only finetune the newly added weights.

We apply similar technique here 👇

@llama_index Here we "freeze" the document embeddings, and then we train a transformation on the query embedding instead.

Thus we're not limited to only Sentence Transformer models.

We can apply this on top of any existing model without re-embedding existing data.

@llama_index The linear adapter:

The query embedding is updated using this linear transformation of the adapter:

updated_q = W*q + b

We train the linear adapter on the training corpus to find the best value for the weight and bias, W and b.

@llama_index 3 steps for finetuning adapters:

1. generate set of synthetic query-context pairs from training and evaluation dataset.
2. Fine-tuning our linear adapter on top of an existing model (e.g. ada)
3. Get the updated model using the base model and the adapter.

@llama_index we use the generate_qa_embedding_pairs function from LlamaIndex to generate both training and evaluation datasets.

@llama_index Now, we create the finetune engine with all the parameters.

@llama_index Next we perform finetuning using engine.finetune().

Finetuning adapter is not resource hungry and can be done on a macbook, no beefy GPU required.

@llama_index Then, we get the model with finetuned adapter using engine.get_finetuned_model().

We can also get the model with the finetuned adapter from the base model and the model path of the trained adapter:

After getting the model, we use it as usual.

@llama_index Instead of simple linear adapter, LlamaIndex also supports Advanced transformation using Deeper Neural Networks e.g. TwoLayerNN or even our own custom model by subclassing the BaseAdapter class.

Stay tuned for guides on these.

@llama_index As document embeddings are unchanged, we can choose to arbitrarily re-train this adapter in the future on top of changing data distributions.

Though performance increase is not as good as finetuning the entire model, but still slightly better than the pre-trained model.

@llama_index Full guide with benchmarks in the official documentation: docs.llamaindex.ai/en/stable/exam…

https://twitter.com/1355239433432403968/status/1718010908151095715

@llama_index Thanks for reading.

I write about AI, LLMs, RAG etc. and try to make complex topics as easy as possible.

Stay tuned for more ! 🔥 #AI #RAG

https://twitter.com/1355239433432403968/status/1718010908151095715

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @clusteredbytes

Rohan

@clusteredbytes

Oct 19

Extract tables from documents using @llama_index UnstructuredElementParser and then use RecursiveRetriever to enable hybrid tabular/semantic queries and also comparisons over multiple docs.

Let's see how to use this advanced RAG technique 🧵👇

@llama_index First we load the documents.

Then we create the new UnstructuredElementNodeParser from LLamaIndex.

@llama_index This parser:

- extracts tables from data
- converts those tables to Dataframe
- for each of those tables, it creates 2 nodes
- one Table Node that contains the Dataframe as string
- another IndexNode that stores the summary of that table and a reference to that Table Node

Read 11 tweets

Rohan

@clusteredbytes

Oct 9

Finetuning the embedding model can allow for more meaningful embedding representations, leading to better retrieval performance.

@llama_index has abstraction for finetuning sentence transformers embedding models that makes this process quite seamless.

Let's see how it works 👇

@llama_index Finetuning means updating the model weights themselves over a set of data corpus to make the model work better for specific use-cases.

E.g. for embedding ArXiv papers, we want the embeddings to align semantically with the concepts and not filler words like “This paper is…”.

@llama_index .@llama_index has guides on how to finetune embeddings in different ways:

- finetune the embedding model itself (only sentence transformers)
- finetune an adapter over any black-box embedding model (stay tuned for this one 🔥)

Read 10 tweets

Rohan

@clusteredbytes

Oct 2

Multi Document Agent architecture (v0) in @llama_index, a step beyond naive top-k RAG.

It allows answering broader set of questions over multiple documents, which weren't possible with basic RAG.

Let's break down the agent architecture and see how it works 👇🧵

Architecture:

- For each document, a VectorIndex is created for semantic search, and a SummaryIndex is created for summarization

- Then we create QueryEngine for both these Indices

- Next the QueryEngines are converted to QueryTools

These Tools are passed to OpenAIAgent. This is the document agent.

Each document has an agent like this that chooses to perform summarization or semantic search within each document.

Read 7 tweets

Rohan

@clusteredbytes

Sep 29

We've seen that smaller chunks are good for capturing semantic meaning and larger ones are good for providing better context.

@llama_index AutoMergingRetriever takes it one step further by keeping the chunks in a tree structure and dynamically choosing the chunk length. 🧵👇

The first step here is parsing via the HierarchicalNodeParser.

It stores the node in a tree structure, where deeper nodes are smaller chunks and shallow nodes are larger chunks.

We can specify how many layers of nodes we want and the splitter size for each layer.

All nodes are stored in a docstore and only the leaf nodes are stored in a vectorstore.

At first, the vectorstore retriever is called to get the initial leaf nodes.

From here we try to auto-merge parents to find parent with the correct chunk size.

Read 10 tweets

Rohan

@clusteredbytes

Aug 26

Previously we've seen @LangChainAI ParentDocumentRetriever that creates smaller chunks from a document and links them back to the initial documents during retrieval.

MultiVectorRetriever is a more customizable version of that. Let's see how to use it 🧵👇

https://twitter.com/1355239433432403968/status/1691143792831639556

@LangChainAI ParentDocumentRetriever automatically creates the small chunks and links their parent document id.

If we want to create some additional vectors for each documents, other than smaller chunks, we can do that and then retrieve those using MultiVectorRetriever.

https://twitter.com/1355239433432403968/status/1691143792831639556

We can customize how these additional vectors are created for each parent document. Here're some ways @LangChainAI mentioned in their documentation.

- smaller chunks
- store the summary vector of each document
- store the vectors of hypothetical questions for each documents

Read 13 tweets

Rohan

@clusteredbytes

Aug 14

While splitting the raw text for Retrieval Augmented Generation (RAG), what should be the ideal length of each chunk? What’s the sweet spot?

Strike a balance between small vs large chunks using @LangChainAI ParentDocumentRetriever

Let's see how to use it 👇🧵

The issue:

- smaller chunks reflect more accurate semantic meaning after creating embedding

- but they sometimes might lose the bigger picture and might sound out of context, making it difficult for the LLM to properly answer user's query with limited context per chunk.

@LangChainAI ParentDocumentRetriever addresses this issue by creating embedding from the smaller chunks only as they capture better semantic meaning.

But while plugging into the LLM input, it uses the larger chunks with better context.

Read 14 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Rohan

Try unrolling a thread yourself!

More from @clusteredbytes

Rohan

Rohan

Rohan

Rohan

Rohan

Rohan

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!