Rohan Paul Profile picture
Aug 9 β€’ 4 tweets β€’ 4 min read β€’ Read on X
LLM Basics - Binary Quantization πŸ”₯

🧡 A thread - 1/n πŸ‘‡

The concept itself isn't new, but what's reignited interest is the recent announcement from @cohere regarding their support for int8 and binary embeddings in their Cohere embed v3.

πŸ“Œ First, in essence, embeddings are numerical representations of more complex objects, like text, images, audio, etc. Specifically, the objects are represented as n-dimensional vectors.

After transforming the complex objects, you can determine their similarity by calculating the similarity of the respective embeddings! This is crucial for many use cases: it serves as the backbone for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and much more.

-------

πŸ“Œ Binary Quantization for embeddings

Unlike quantization in models where you reduce the precision of weights, quantization for embeddings refers to a post-processing step for the embeddings themselves. In particular, binary quantization refers to the conversion of the float32 values in an embedding to 1-bit values, resulting in a 32x reduction in memory and storage usage.

--------

✨ Binary quantization example

Vector embeddings are usually generated by embedding models, such as Cohere’s embed v3, and a single vector embeddings will in the following form.

[0.056, -0.128, -0.029, 0.047, …, 0.135]

To quantize float32 embeddings to binary, we simply threshold normalized embeddings at 0

That is, because these embeddings have very small absolute numbers close to zero, you can turn them into a binary vector:

1: If the value is greater or equal to 0.

0: If the value is smaller than 0.

So that you get something like this.

[1, 0, 0, …, 1]Image
🧡 2/n

πŸ“Œ So basically why does binary quantization reduce vector embedding size so much?

It's kind of like turning a colored image into a black and white image.

By converting the floating point numbers, which are stored in 32 bits, into a single bit, you only need 1/32nd of memory space to store a binarized vector. This can lead to increased search speed and reduced storage costs.

And because vector embeddings are usually high-dimensional, you can still get meaningful similarity measures for vector search. 🀯

✨ Now the question is how to calculate the similarity of vectors which has been binarized ?

πŸ“Œ We can use the Hamming Distance to efficiently perform retrieval with these binary embeddings. This is simply the number of positions at which the bits of two binary embeddings differ. The lower the Hamming Distance, the closer the embeddings, and thus the more relevant the document. A huge advantage of the Hamming Distance is that it can be easily calculated with 2 CPU cycles, allowing for blazingly fast performance.Image
🧡 3/n

πŸ€” Why Binary Quantization (BQ) is particularly suitable for high-dimensional vectors.

Simply because, in higher dimensional space, even with BQ, the vector can retain a high degree of information.

First, noting the basics, the number of elements in a single vector represents the total dimensionality of that vector. Each element of a vector represents a coordinate in a particular dimension, so a vector with `n` elements is said to inhabit an n-dimensional space.

When we refer to a vector's dimensionality, we are essentially describing how many degrees of freedom or independent directions of information it contains. For example, a 3-dimensional vector might represent a point in 3D space with coordinates along the X, Y, and Z axes.

πŸ“Œ In high-dimensional spaces, vectors possess a large number of elements. Despite each element being aggressively quantized to a single bit, the overall vector retains substantial aggregate information. The high dimensionality ensures that, even in binary form, the relationships and structures inherent to the data can be preserved to a useful extent.

πŸ“Œ This is on the assumption that the essential information of the vector is distributed across its many dimensions, allowing the binary-reduced vector to approximate the original's informational content in aggregate, despite the severe reduction in pre cision per dimension.Image
🧡 4/n

✨ What are the drawbacks of Binary Quantization?

Firstly, the adoption of binary quantization impacts the accuracy and precision of your search results. Although you can still retrieve relevant outcomes, the nuance and detail provided by higher-resolution data can be lost, leading to less precise results.

Furthermore, binary quantization is a one-way streetβ€”once you've converted your data into binary form, there's no turning back. This process is a form of lossy compression, meaning once the data has undergone quantization, the original, detailed information is irretrievably lost.

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Jul 8
Incredible results for the RAG world from @nvidia model πŸ‘. Llama3-RankRAG from @nvidia significantly outperforms GPT-4 models on 9 knowledge-intensive benchmarks. 🀯

πŸ“Œ Performs comparably to GPT-4 on 5 RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains. 🀯

The secret is a novel instruction fine-tuning framework, named RankRAG πŸ‘¨β€πŸ”§

Llama3-RankRAG-8B and Llama3-RankRAG-70B outperforms Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B by a margin, respectively. πŸ”₯

The problem with traditional RAG was that LLM typically utilize the top-k contexts from a retriever.

This led to suboptimal performance, especially when dealing with a large number of retrieved passages or when initial retrieval results were poor.

The key question this paper addresses is how to unify context ranking and answer generation within a single LLM for more effective RAG. The researchers conclude that their proposed RankRAG method significantly outperforms existing approaches by instruction-tuning an LLM for both ranking and generation tasks.

πŸ“Œ RankRAG instruction-tunes a single LLM for dual purposes: context ranking and answer generation in RAG. This unified approach allows the model to excel at both tasks simultaneously. The process incorporates a small fraction of ranking data (about 50k examples) alongside other task-specific datasets. Yields superior ranking performance compared to models trained on much larger ranking datasets.

πŸ“Œ RankRAG uses a retrieve-rerank-generate pipeline. The LLM first reranks the top-N retrieved contexts, then generates answers based on the refined top-k contexts.

πŸ“Œ The training blend for RankRAG includes context-rich QA data, retrieval-augmented QA data, context ranking data, and retrieval-augmented ranking data. This diverse mix enhances the model's ability to handle various RAG scenarios.

πŸ“Œ The method addresses the trade-off between recall and precision in context selection. By incorporating ranking, RankRAG can effectively use a smaller number of highly relevant contexts (e.g., top-5) while maintaining or improving performance.

πŸ“Œ RankRAG's ranking capability transfers well across different retrievers and generalizes to unseen domains, showcasing its robustness and adaptability.Image
πŸ“Œ Current RAG systems use limited retrievers (e.g. BM25, BERT) for efficiency, compromising relevance estimation accuracy.

πŸ“Œ There's a trade-off in selecting top-k contexts: small k misses information, large k introduces noise.

πŸ“Œ Performance often plateaus around k=10, as shown with ChatQA-1.5.

πŸ“Œ These limitations motivate RankRAG's approach of integrating ranking into the LLM itself.Image
RankRAG process:

πŸ“Œ Instruction-tune LLM on diverse dataset blend: context-rich QA, retrieval-augmented QA, ranking data.

πŸ“Œ At inference: Retriever gets top-N documents.

πŸ“Œ LLM reranks these N documents, selecting top-k most relevant.

πŸ“Œ Same LLM generates answer using selected top-k contexts.

πŸ“Œ Unified model performs both ranking and generation, improving RAG pipeline efficiency.

πŸ“Œ Uses only ~50k ranking examples, yet outperforms models trained on much larger datasets.Image
Read 10 tweets
Jun 30
Another 'WOW' paper - Upto 20x improvement in inference throughput with Block Transformer compared to vanilla transformers with equivalent perplexity.🀯

How ❓ by MASSIVELY reducing KV cache IO overhead from quadratic to linear with respect to context length, solving a key challenge in scaling to very long contexts and also novel application of global-to-local modeling. 🀯

Paper - "Block Transformer: Global-to-Local Language Modeling for Fast Inference":

πŸ“Œ Block Transformers can also be uptrained from pretrained vanilla models, closely approaching the performance of those pretrained from scratch, using just 10% of the training budget.

πŸ“Œ adopts a hierarchical global-to-local modeling approach. It isolates the expensive bottlenecks of global modeling to lower layers and applies fast local modeling in upper layers. This is achieved through three components:

1. Embedder: aggregates each block of L_B input tokens into an input block embedding. i.e. L_B represents the block length, which is the number of tokens aggregated into a single block.

2. Block decoder: an autoregressive transformer that applies self-attention between blocks to decode a context block embedding for predicting the next block.

3. Token decoder: autoregressively decodes the token contents of the next block, applying local self-attention between only the L_B tokens within the block.

πŸ“Œ The block decoder reduces overall costs through its coarse granularity. It mitigates the quadratic costs of self-attention by using coarse-grained block inputs instead of individual tokens, reducing context length by L_B. This reduces FLOPs for positionwise computations by L_B and attention score computation by L_B^2. KV cache usage and KV cache IO are also reduced by L_B and L_B^2 respectively.
πŸ“Œ The token decoder nearly eliminates the costs of attention as there is no need to compute, store, and retrieve KV-cache of past tokens beyond the small local context of L_B tokens. It eliminates prefill (necessary only in the block decoder) and reduces KV cache IO from quadratic to linear with respect to context length. This allows for significantly higher compute unit utilization.
πŸ“Œ To incorporate the context embedding and leverage the low-cost compute in the token decoder, the context block embedding is projected into prefix tokens. This enables further refinement of the global context and allows increasing computational width of the token decoder by extending the prefix length.Image
Hierarchical global-to-local architectures have shown significant potential to effectively model large-scale data by addressing global dependencies in coarse detail and capturing fine details within local regions. Image
Image
Read 7 tweets
Jun 30
LLMs are highly sensitive to prompt variations, leading to inconsistent performance across different prompts for the same task. πŸ‘¨β€πŸ”§

Intent-based Prompt Calibration (IPC) iteratively refines prompts to match user intent using synthetic boundary cases, addressing prompt sensitivity and optimizing with limited data.

πŸ“Œ IPC generates challenging synthetic samples at each iteration, focusing on boundary cases that expose prompt ambiguities.

πŸ“Œ The system employs three meta-prompts: Sample Generator, Analyzer, and Prompt Generator. The Sample Generator creates diverse, adversarial samples with balanced class distribution. The Analyzer evaluates prompt performance and identifies failure cases. The Prompt Generator suggests improved prompts based on historical performance and analysis.

πŸ“Œ For generative tasks, IPC first calibrates a ranking prompt, then uses it to optimize the generative prompt. This approach allows optimization with minimal annotation effort.

πŸ“Œ The system architecture consists of four components: Dataset (manages data operations), Estimator (handles predictions and annotations), Evaluator (assesses records and performs error analysis), and Optimizer (manages the optimization process flow).

πŸ“Œ IPC outperforms existing methods like OPRO and PE on classification tasks (spoiler detection, sentiment analysis, PG detection) and generative tasks (enthusiastic/reliable and sarcastic/positive movie reviews).

πŸ“Œ The method demonstrates superior performance with limited data, achieving higher accuracy and lower variance compared to baseline approaches.

πŸ“Œ Ablation studies reveal the importance of synthetic data generation, iterative refinement, and error analysis in improving model performance.

πŸ“Œ IPC effectively handles imbalanced data distributions by generating balanced synthetic samples, particularly beneficial for real-world moderation tasks.Image
Image
Image
Read 5 tweets
Jun 30
A very intriguing recent paper "Nested Jailbreak Prompts can Fool LLMs Easily" - reveals the inadequacy of current defense methods in safeguarding LLMs.

Generalizes jailbreak prompt attacks into two aspects:

(1) Prompt Rewriting and
(2) Scenario Nesting.

πŸ“Œ Propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Sgnificantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs.

πŸ“Œ ReNeLLM framework introduced, generalizing jailbreak prompt attacks into prompt rewriting and scenario nesting. Prompt rewriting involves operations like paraphrasing, altering sentence structure, misspelling sensitive words, inserting meaningless characters, partial translation, and changing expression style. These operations preserve semantic meaning while disguising harmful intent.

πŸ“Œ Scenario nesting embeds rewritten prompts into task scenarios like code completion, text continuation, and table filling. This leverages LLMs' instruction-following capabilities to bypass safety alignments. Scenarios chosen align with training data, shift attention, and leave blanks for completion.

πŸ“Œ Automated process uses LLMs to generate and evaluate jailbreak prompts. GPT-3.5 performs rewriting and harmfulness evaluation. Nested prompts fed to target LLM (e.g. Claude-2) for response. Success determined by harmful output generation.

πŸ“Œ ReNeLLM achieves state-of-the-art attack success rates (ASR) across open and closed-source LLMs. For Claude-2, ReNeLLM attains 69.6% GPT-ASR compared to 0% for baselines. Time cost reduced by 76.61% vs GCG and 86.19% vs AutoDAN.

πŸ“Œ Attention visualization reveals LLMs' priority shift from balancing external/internal instructions to favoring external ones after rewriting/nesting. This explains jailbreak success and informs potential defenses.

πŸ“Œ Defense strategies explored: incorporating priority prompts (e.g. "prioritize safety"), enhancing safety through supervised fine-tuning, and using harmfulness classifiers. Results show trade-offs between safety and performance, highlighting challenges in developing robust defenses.Image
Image
Image
Read 6 tweets
Jun 29
This 76-page paper on Prompting Techniques has become quite popular. A nice read for your weekend.

- "The Prompt Report: A Systematic Survey of Prompting Techniques": ✨

Explores structured understanding and taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities.

πŸ“Œ The paper focuses on discrete prefix prompts rather than cloze prompts, because prefix prompts are widely used with modern LLM architectures like decoder-only models. It excludes soft prompts and techniques using gradient-based updates.

πŸ“Œ The paper identifies 58 text-based prompting techniques broken into 6 major categories:

1) In-Context Learning (ICL) - learning from exemplars/instructions in the prompt
2) Zero-Shot - prompting without exemplars
3) Thought Generation - prompting the LLM to articulate reasoning
4) Decomposition - breaking down complex problems
5) Ensembling - using multiple prompts and aggregating outputs
6) Self-Criticism - having the LLM critique its own outputs

πŸ“Œ For ICL, it discusses key design decisions like exemplar quantity, ordering, label quality, format, and similarity that critically influence output quality. It also covers ICL techniques like K-Nearest Neighbor exemplar selection.

πŸ“Œ Extends the taxonomy to multilingual prompts, discussing techniques like translate-first prompting and cross-lingual ICL. It also covers multimodal prompts spanning image, audio, video, segmentation, and 3D modalities.

πŸ“Œ More complex techniques like agents that access external tools, code generation, and retrieval augmented generation are also taxonomized. Evaluation techniques using LLMs are discussed.

πŸ“Œ Prompting issues like security (prompt hacking), overconfidence, biases, and ambiguity are highlighted. Two case studies - benchmarking techniques on MMLU and an entrapment detection prompt engineering exercise - are presented.Image
Image
Image
Read 9 tweets
Jun 28
Activation Beacon is such a classic paper from Jan-2024

"Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon"

Can be a revolutionary paper if implementable for all cases - for massively increasing the context window of LLMs

Authors trained LLaMA-2 for 10K-steps with 4K context window and then it generalized to 400K context window πŸ”₯

πŸ“ŒKey technique is to condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM.

πŸ“Œ "It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. "

πŸ“Œ "Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine."

πŸ“Œ "The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by Γ—100 times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository."Image
A relevant question here, that automatically comes to mind - 'How does the quality of retrieval change over context length ❓

And we can refer to this paper for understanding that aspect.

Paper - ''Training-Free Long-Context Scaling of Large Language Models'

This paper shows that existing long-context LLMs, which have already supported a 32k context window, can further extrapolate to a 192k context length while maintaining high passkey retrieval accuracy and low perplexity.arxiv.org/abs/2402.17463Image
Activation Beacon Paper -

Activation Beacon Github Code - arxiv.org/abs/2401.03462
github.com/FlagOpen/FlagE…
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(