Tweet

More from @IntuitMachine

Carlos E. Perez

@IntuitMachine

Oct 14

Let's talk about Vec2Text

This paper introduces a powerful new technique for inverting text embeddings back to their source texts. The method, Vec2Text, demonstrates for the first time the ability to recover full text sequences from state-of-the-art neural text encoders. Through an iterative process of generation and error correction guided by embedding geometry, Vec2Text is able to reconstruct inputs with over 90% accuracy.

The implications of this advance are profound. It challenges the assumption that embeddings anonymize data by distilling texts down to latent representations. In fact, embeddings leak as much private information as raw text. This forces a re-evaluation of how embeddings are handled, shared, and secured. They must be safeguarded with the same stringency as the original textual data.

Beyond privacy concerns, Vec2Text enables myriad applications for improving natural language processing systems. Inversion offers new capabilities for model analysis, data augmentation, search relevance, conditioned generation, and debugging embedding spaces. The technique powerfully demonstrates that embeddings contain enough signal to reconstruct their source texts.

By developing the first method to reliably invert real-world embedding models at scale, this work sparks a new direction in understanding what embeddings contain. The results demand changes in how privacy and security communities view token embeddings. Meanwhile, the novel text generation process at the core of Vec2Text can be harnessed across language applications to advance the field. Both cautionary tale and technical blueprint, this paper highlights the double-edged sword of readable embeddings.

Here are some ways the text embedding inversion method proposed in this paper could be utilized to improve LLMs, embeddings, and search engines:

Improving LLMs:
- Generate more training data by inverting embeddings from a dataset. This expands the diversity of texts seen during training.
- Use inverted texts as stimuli for LLM probing and analysis. Reconstructed texts can reveal biases or artifacts of the LLM.
- Iteratively refine LLM generations by inverting the embedding, correcting, and regenerating text.

Improving Embeddings:
- Identify bugs or errors in embedding models by inspecting inverted training texts.
- Generate augmented training data by perturbing embeddings and inverting.
- Use inversion as an embedding evaluation metric - embedding quality correlates with reconstruction accuracy.

Improving Search:
- Index synthesized texts by inverting index embeddings for better semantic search.
- Generate search query variants by inverting the query embedding.
- Rerank search results by embedding queries, then inverting top results and comparing inverted texts.

So in summary, inversion provides a new capability to generate texts associated with arbitrary embeddings. This enables applications like data augmentation, analysis of models, improved matching and relevance ranking, and embedding debugging.

Related prior work

- Work in computer vision has reconstructed images from convolutional network embeddings. This is analogous to inverting text embeddings.

- Some methods try to recover query text from search engine query embeddings. These focus on short text and shallow encoders.

- Other work analyzes privacy leaks from clinical and word embedding models. This demonstrates embeddings can leak private data.

- Methods for controlled text generation are related, but usually require model gradients. Vec2Text only requires embeddings.

- Text autoencoders train decoders to reconstruct text, but from internal encoder states, not outputs.

- Analyses of gradient leakage in federated learning are similar in spirit. Gradients can reveal training data.

- A bag-of-words inversion method was proposed, but could not recover full text ordering.

- Other work trains a decoder on embedding-text pairs, but from a fixed pretrained encoder. Vec2Text inverts dynamically.

In summary, prior work either operated on limited short text, required model access, or achieved only partial reconstruction. Vec2Text advances reconstruction to long sequences from black-box state-of-the-art embedding models.

Read 5 tweets

Carlos E. Perez

@IntuitMachine

Oct 13

Diagnosis of Thought (DoT) Prompting is a 3-stage framework that provides step-by-step guidance for an AI system to detect cognitive distortions from a patient's speech.

Stage 1 - Subjectivity Assessment: Separate objective facts from subjective thoughts and opinions.

Stage 2 - Contrastive Reasoning: Elicit reasoning that both supports and contradicts the subjective thoughts.

Stage 3 - Schema Analysis: Identify underlying thought patterns and cognitive models causing distorted thinking.

By strategically prompting the AI through this structured analysis, DoT generates interpretability rationales and ensures diligent, methodical diagnosis of distortions. The staged approach is analogous to a doctor's medical examination or a tutor's step-by-step guidance on a math problem. Overall, DoT leverages the reasoning capabilities of AI while providing the transparency needed for reliable and ethical application in mental healthcare.

Diagnosis of Thought (DoT) Prompting is like a doctor performing a thorough medical examination to understand a patient's condition before making a diagnosis. It strategically prompts the AI system to analyze a person's speech step-by-step in order to detect cognitive distortions.

The first stage, Subjectivity Assessment, is like taking the patient's medical history. The AI separates factual information from the patient's subjective thoughts and feelings. This establishes an objective baseline, just like a doctor learning the patient's health background.

The second stage, Contrastive Reasoning, is analogous to a doctor ordering diagnostic tests for the patient. The AI elicits lines of reasoning that both support and contradict the subjective thoughts. This is like running medical tests from different angles to get a comprehensive picture of what's going on.

The third stage, Schema Analysis, is similar to a doctor piecing together test results to understand the underlying cause. The AI summarizes the thought patterns and cognitive models that explain the patient's distorted thinking. Just as tests help a doctor determine the root issue, this stage identifies the thought schema causing distortions.

Finally, the AI makes an assessment, like a doctor diagnosing the patient's condition after the exam. The step-by-step prompting allows for interpretability, just as a doctor explains how they deduced the diagnosis from the tests. This structured process ensures careful analysis before detecting cognitive distortions, like a diligent doctor investigating symptoms to reach an accurate diagnosis.

Diagnosis of Thought (DoT) Prompting is like a friendly tutor guiding a student through their math homework - it walks an AI system step-by-step to "show its work" in detecting cognitive distortions.

In the first stage, Subjectivity Assessment, the AI highlights the objective facts and subjective opinions like a tutor circling the known and unknown variables in a word problem. This separates the givens from the student's (patient's) own thoughts.

Next, Contrastive Reasoning has the AI map out reasoning supporting and contradicting the student's approach, just as a tutor would ask "How did you get this answer?" and "Is there another way to solve this?" Getting both perspectives gives a clearer picture, revealing flaws in the student's (patient's) thinking.

The third stage, Schema Analysis, has the AI tie it all together to identify patterns in the student's (patient's) problem-solving process. The tutor sees the student struggles with a certain type of problem or concept. Similarly, the AI recognizes underlying thought schemas causing distortions.

Finally, the AI makes an assessment on the student's (patient's) work - does it have mistakes? Where did they go wrong? A tutor grades the homework, while the AI detects distortions.

Like a friendly tutor, DoT Prompting walks the AI through reasoned analysis step-by-step. The "show your work" approach builds interpretability, like a tutor giving feedback on the student's mistakes to improve their skills. Guiding the AI this way helps detect distortions more reliably.

Read 4 tweets

Carlos E. Perez

@IntuitMachine

Oct 11

Let's discuss the approach of MEMWALKER

At its core, MEMWALKER transforms how AI systems process long-form knowledge - and this paradigm shift unlocks tremendous value.

For far too long, language models have been confined by fixed context lengths that severely limit their understanding of anything beyond short snippets. But real intelligence requires reasoning about extensive documents, books, and dialogue.

MEMWALKER finally provides LLMs with a human-like reading ability - the power to interactively traverse an information landscape, seeking exactly the pieces they need for the task at hand. No more blindly processing huge swaths of text - now models can zero in on what matters.

And this new capability opens up a world of possibilities. With MEMWALKER, LLMs can start analyzing legal contracts, synthesizing research papers, even consuming entire novels. We can scale up conversational agents that track long-term context. More intelligent question answering over books or articles is within reach.

At the heart of this leap is prompting. MEMWALKER showcases how the right prompts enable so much more - tree construction, selective reading, error recovery. Prompting unlocks higher reasoning, turning LLMs into sophisticated information navigators.

So MEMWALKER is no mere incremental advance - it ushers in a new era of contextual intelligence. The limits we took for granted are dissolving before our eyes. As T.S. Eliot put it, “We shall not cease from exploration, and the end of all our exploring will be to arrive where we started and know the place for the first time.” MEMWALKER lights the path ahead.

Here is an intuitive explanation of the MEMWALKER process using a storybook example:

Imagine a long storybook with 20 chapters.

To construct the memory tree, we first summarize each chapter into a short 1-paragraph summary. These become the leaf nodes.

We then group summaries from every 5 chapters and summarize those into a parent node summary. For example, chapters 1-5 are summarized into parent node A.

We repeat this recursive summarization up the tree. Chapters 6-10 are summarized into node B. Nodes A and B are summarized into the root node.

Now if we want to answer a question about Chapter 7, the model starts at the root node summary. From there it decides to inspect node B, since that covers chapters 6-10.

Looking at the summaries for chapters 6-7-8-9-10, it then decides to look specifically at chapter 7's full text.

After reading chapter 7, it has enough info to answer the question without needing to look at other chapters.

During this traversal, the model stores a working memory of the nodes it visited, like the root summary and the chapter 6-10 summary.

If it picked the wrong path initially, say chapter 15, it could revert back up the tree after seeing the full text is irrelevant.

The key ideas are:

- Summarize segments into a tree structure
- Use reasoning prompts to traverse the tree
- Access full text only for relevant segments
- Maintain global context in working memory
- Recover from wrong paths by going back up the tree

So in summary, the model reads selectively by navigating the tree, rather than processing the full story.

Key limitations of large language models (LLMs) that MEMWALKER helps address:

Context Length Limitation:
- LLMs have a predetermined maximum context window size (e.g. 4096 tokens) due to memory constraints.
- This limits their ability to process long sequences of text.
- MEMWALKER navigates a summary tree to access relevant segments, overcoming the context length limitation.

Position Bias:
- Attention in LLMs is biased towards certain positions like the beginning or end.
- But not all positions have equal relevance in long texts.
- By interactively reading the tree, MEMWALKER can focus on the most useful content.

Information Loss in Recurrence:
- Recurrent LLMs tend to lose details from earlier segments.
- MEMWALKER maintains a working memory to retain global context.

Ineffective Retrieval:
- Retrieval systems often match similar documents rather than coherent text.
- But MEMWALKER retrieves related segments from a single document.

Lack of Reasoning:
- LLMs may struggle to reason about which parts of a long text are relevant.
- MEMWALKER's prompts elicit explicit reasoning from the LLM to make informed navigation decisions.

So in summary, MEMWALKER introduces an interactive prompting approach to overcome inherent LLM limitations like fixed context, bias, recurrence loss, ineffective retrieval, and lack of reasoning when dealing with long coherent texts. By selectively reading the most relevant segments, MEMWALKER pushes the boundaries of LLM memory capacity.

Read 6 tweets

Carlos E. Perez

@IntuitMachine

Oct 7

This New Research Will Change How You Design RAG Systems

Should you use longer contexts or retrieval to boost your RAG model's performance on long document tasks? This important paper provides key insights through a comprehensive study comparing and combining both techniques using massive 70B parameter LLMs.

The authors deliver compelling results that challenge assumptions. Their experiments show retrieval continues enhancing even models with huge 32K contexts, achieving new SOTA accuracy.

Practitioners building real-world RAG applications will gain immense value from these findings. The paper offers clear guidance for optimal model selection and integration strategies to maximize accuracy and speed.

By thoroughly investigating the interplay of retrieval and context length, this research significantly advances RAG system design. It provides practitioners tangible benefits for improving performance on QA, summarization, and recommendation tasks.

The concrete insights and analysis offer a roadmap for developing the next generation of high-accuracy RAG models. Those building real-world systems would be remiss not to read this paper and apply its lessons.

Experimental Setup:

A. LLMs

- GPT-43B: 43 billion parameter proprietary model by Nemo, trained on 1.1T tokens. 48 layers, hidden size 8192. Trained with 4096 context length.

- LLaMA2-70B: Public 70B parameter model. Trained on 2T tokens with 80 layers and hidden size 8192. Also 4096 context length.

B. Tasks

- 7 datasets: QMSum, Qasper, NarrativeQA, QuALITY, MuSiQue, HotpotQA, MultiFieldQA
- Range of doc lengths from 5k to 85k tokens
- Tasks: question answering, query based summarization
- Metrics: ROUGE, F1, exact match

C. Evaluation Metrics

- ROUGE-1/2/L for QMSum
- F1 for Qasper, NarrativeQA, MuSiQue, HotpotQA, MultiFieldQA
- Exact match for QuALITY

D. Context Extension

- Test 16K for GPT-43B, 16K and 32K for LLaMA2-70B
- Use positional interpolation method
- Finetune on Pile dataset to adapt

E. Retrieval

- Test Dragon, Contriever, OpenAI embeddings
- Encode queries and chunked contexts
- Retrieve top 5/10/20 chunks

F. Instruction Tuning

- Finetune on blend of 102K samples
- Format: System prompt, context, question, answer
- Take loss only on answer

Results

A. Retrieval helps both short and long context LLMs

- Retrieval significantly boosts 4K context models, helps close gap to 16K context models
- Retrieval also improves 16K and 32K context models over baseline

B. Comparable to GPT-3.5 and Davinci

- Best LLaMA2-70B-32k + retrieval matches GPT-3.5-turbo-16K and exceeds Davinci-003 on average scores

C. Impact of different retrievers

- All retrievers (Dragon, Contriever, OpenAI) improve results
- Public retrievers can outperform OpenAI embeddings

D. Varying number of retrieved chunks

- Increasing from 5 to 20 chunks does not always help
- Best results obtained with top 5 or 10 chunks
- Too much irrelevant context can hurt performance

Read 7 tweets

Carlos E. Perez

@IntuitMachine

Oct 6

Introducing FRESHPROMPT and FRESHQA - the latest innovations to finally make your language models fresh!

Legacy RAG systems simply overwhelm your AI with irrelevant text passages from the internet. But our breakthrough approach is different. We carefully extract the most relevant, factual snippets and structure them for in-context learning.

FRESHPROMPT leverages search engine intelligence to retrieve vital related Q&As and teaches your model to reason over evidence chronologically. This adapts your AI to current knowledge and reduces hallucination.

And we validated it thoroughly with FRESHQA - our novel dataset with diverse question types requiring up-to-the-minute facts. Extensive human evaluation proves traditional RAG struggles while FRESHPROMPT shines.

The results speak for themselves - up to 50% reduction in incorrect and outdated responses! Why settle for a stale AI when you could have a FRESH one?

There are a few key reasons why standard retrieval augmented generation (RAG) methods are ineffective on their own for the task explored in this paper, requiring the development of FRESHPROMPT:

- RAG systems retrieve contextual passages but do not select factual snippets. FRESHPROMPT incorporates metadata to extract factual snippets from search engines.

- RAG systems provide retrieved documents without any structure. FRESHPROMPT organizes and formats snippets in a principled way.

- RAG systems do not teach the LLM how to reason over retrievals. FRESHPROMPT provides demonstrations for in-context learning.

- RAG systems retrieve documents using the original query. FRESHPROMPT leverages search engines' understanding of related questions and answers.

- RAG systems are not optimized for real-time QA. FRESHPROMPT prioritizes recent snippets to adapt to new knowledge.

- RAG systems retrieve excessive irrelevant context. FRESHPROMPT is focused and keeps only highly relevant snippets.

- RAG systems require training signal from human demonstrations or supervision. FRESHPROMPT is fully unsupervised.

In essence, standard RAG lacks the capability to identify, select, structure, and teach LLMs using the most relevant, factual, and up-to-date snippets for real-time QA. The innovations in FRESHPROMPT address these limitations in a simple and effective way.

Key details about the FRESHPROMPT method proposed in the paper:

- Given a question, it first queries a search engine (Google Search) to retrieve relevant results.

- Extracts snippets from organic search results, answer boxes, related questions, knowledge graphs etc.

- Casts snippets into a common format with metadata like source, date, title, highlights.

- Sorts snippets chronologically with most recent at the end.

- Provides 5 question-answer demonstrations at the beginning of the prompt.

- Each demonstration shows an example question, retrieved snippets, and reasoning to get the answer.

- Teaches the LLM to read through the snippets, reason over them, and output the most up-to-date answer.

- Uses 10 organic search snippets, 2 related questions, 2 QA platform snippets for GPT-3.5 by default.

- Uses 10 organic, 3 related questions, 3 QA platform snippets for GPT-4 by default.

- Keeps the most relevant snippets based on date after sorting by context length limits.

- Performs a single inference call to the LLM to generate the answer.

- No additional training required compared to other search-augmented methods.

- Substantially improves accuracy over baseline LLMs and other prompting approaches.

- Analysis shows number and order of snippets impact accuracy, with more improving performance.

Read 5 tweets

Carlos E. Perez

@IntuitMachine

Oct 5

Self-specialization is crucially important for the ongoing development and progress of large language models (LLMs) for the following key reasons:

Expertise in Niche Domains
- As LLMs are applied to more specialized domains like biomedicine, law, etc., uncovering domain expertise is critical. Self-specialization provides an efficient way to carve out niche expertise from generalist LLMs.

Data Efficiency
- Acquiring expert annotations is challenging. Self-specialization only needs a handful of seeds, enabling domain specialization with minimal human involvement. This is far more practical than relying solely on manual data.

Parameter Efficiency
- Compact specialization modules can be overlaid on top of a shared base LLM, avoiding redundant parameters for each domain. This allows serving multiple expert models efficiently.

Adaptability
- The self-supervised approach inherently adapts the LLM to new domains by generating tailored data. This is more flexible than pre-defined training objectives.

Scalability
- By having LLMs self-generate data, self-specialization removes the training data bottleneck. This enables scaling to new domains easily without manual data collection.

In summary, self-specialization essentially provides a pathway to extract specialized knowledge in an adaptable, scalable, and extremely efficient manner. This will be a critical capability as we push LLMs into more and more expert domains while needing to maintain versatility and avoid exponentially growing data and parameter needs. Unlocking latent domain expertise will be key, and self-specialization offers a highly promising approach to make this feasible.

Here is a more detailed explanation of the self-specialization process that's described by a recent paper.

The first step is to collect a small set of high-quality, human-authored seed instructions that encapsulate the core concepts and intricacies of the target domain (e.g. biomedicine). These serve as the starting point.

Then, the large language model (LLM) is prompted to generate new synthetic instructions and corresponding input contexts based on the seeds. The model combines and recombines elements from the seed instructions to produce novel, domain-tailored instructions and contexts. This allows it to expand beyond the original seeds to cover a broader scope of the domain.

Next is the response generation phase. Here, the model's responses are enhanced by retrieving relevant external knowledge from an unlabeled corpus of domain documents. Specifically, the instruction and context are used as a query to retrieve the top most relevant docs. These provide supplementary domain information to inform the model's responses.

After obtaining a robust set of synthetic data comprising domain-specific instructions, contexts, and responses, the model enters the specialization phase. The base LLM undergoes fine-tuning on this generated dataset to adapt its knowledge and calibrate its expertise specifically for the target domain. This transforms the previously generic model into a specialized expert.

Finally, the process can be iterated by using the now-specialized model as the generator to create new instructions and responses. Each iteration further refines the model's domain expertise. The key advantage is that the model learns to generate high-quality, accurate domain data in a self-supervised manner.

In summary, self-specialization provides an efficient way to uncover and optimize the latent domain expertise inside large pre-trained LLMs with minimal manual supervision. The automatically generated data steers the model to resonate with the nuances of the specialized domain.

Here are the key points from the conclusion of this paper:

- Self-specialization shows promise for uncovering latent domain expertise within large language models (LLMs) with minimal supervision.

- The paper introduces and explores this new concept of guiding LLMs to specialize in target domains via self-generated instructional data.

- Experiments demonstrate the effectiveness of self-specialization in a biomedical domain. The self-specialized model substantially outperforms its base model and even larger aligned models.

- This highlights the efficiency and practicality of self-specialization, achieved with very limited seed data and compact parameter-efficient tuning.

- The simple proposed approach opens exciting opportunities for further work on optimizing and extracting specialized knowledge from generalist foundation models.

- Future directions include exploring smaller base models, additional domains like sports, combining multiple self-specialized models, and iterative refinement.

- Overall, self-specialization provides an innovative pathway to uncover the latent expertise within large LLMs in a highly efficient and adaptable manner with minimal human involvement.

- The promising preliminary results signify an important advancement towards specialized models that can address nuances of different expert domains while maintaining versatility.

Read 5 tweets

Share this page!

Enter URL or ID to Unroll

Carlos E. Perez

Try unrolling a thread yourself!

More from @IntuitMachine

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!