Rohan Paul Profile picture
πŸ’Ό AI Engineer. Here, compiling real-time, the race towards AGI/ASI 🐎. Follow to remain on AI bleeding edge. β†’ My Newsletter - https://t.co/Jfj0r0we5f
Objectively Random Profile picture 2 subscribed
Sep 14 β€’ 4 tweets β€’ 1 min read
The biggest contribution of OpenAI Strawberry (o1) πŸ“ is on inference scaling.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.

The entire result-search space becomes a mini dataset of training examplesImage
Sep 12 β€’ 5 tweets β€’ 4 min read
Is this a Vacuum-Tube to Silicon Transistor moment for LLM if true ? 🀯

Published in Nature yesterday.

✨ Molecular memristors enable 14-bit analog computing, surpassing digital efficiency for core matrix operations.

The Paper achieved >73 dB signal-to-noise-ratio, a four-order of magnitude improvement over the SOTA 10-12, while consuming 460X less energy than digital computers13

**Original Problem** πŸ”:

Vector-matrix multiplication (VMM) is computationally expensive, requiring n^2 steps for vectors of length n. Current dot-product engines (DPEs) for VMM have low precision (2-6 bits) due to non-idealities in analog circuit elements.
-----

**Key Insights from this Paper** πŸ’‘:

β€’ Developed 14-bit precision molecular memristor crossbar for VMM

β€’ Supramolecular electronics yield unprecedented precision in neuromorphic hardware for AI acceleration.
β€’ Achieved linear, symmetric weight updates with 16,520 distinct analog levels
β€’ Enabled one-step programmability of conductance levels
β€’ Implemented selector-free crossbar design using unidirectional elements
-----

**Solution in this Paper** πŸ§ͺ:

β€’ Fabricated 64x64 crossbar using [Ru^II L_2](BF_4)_2 molecular film

β€’ Engineered symmetric potentiation/depression characteristics
β€’ Utilized supramolecular electronic transitions between 31 and 22 states
β€’ Implemented custom >16-bit precision CMOS peripheral circuit
β€’ Compensated for wire resistances and parasitic effects
-----

**Results** πŸ“Š:

β€’ 16,520 distinct analog levels with 14-bit resolution

β€’ Signal-to-noise ratio of 73-79 dB for VMM operations

β€’ 4 orders of magnitude improvement in precision over state-of-the-art
β€’ 460x higher energy efficiency than CPU for matrix operations
β€’ Demonstrated high-fidelity image reconstruction via inverse Fourier transformImage πŸ“š Published in Nature -

πŸ“š Here it is without the paywall: -

nature.com/articles/s4158…
researchgate.net/publication/37…
Image
Sep 3 β€’ 4 tweets β€’ 3 min read
The common retrievers like DPR (Dense Passage Retrieval) normally work with 100-word Wikipedia paragraphs. πŸ€”

πŸ’‘ This paper proposes LongRAG - processes the entire Wikipedia into 4K-token units, 30x longer than before πŸ”₯

By increasing the unit size, they significantly reduce the total units from 22M to 600K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki).

This technique is particularly beneficial for open-domain question answering, where detailed and accurate responses are crucial. By leveraging external information, RAG systems can overcome the limitations of relying solely on the parametric knowledge embedded in LLMs, making them more effective in handling complex queries.

πŸ“Œ Challenges for regular RAG πŸ‘‡

Traditional RAG frameworks often use short retrieval units, such as 100-word passages, requiring the retriever to sift through large amounts of data. This design burdens the retriever heavily while the reader's task remains relatively simple, leading to inefficiencies and potential semantic incompleteness due to document truncation.

πŸ’‘ And so her comes LongRAG

To address these challenges, this LongRAG framework comprises a "long retriever" and a "long reader" component, designed to process longer retrieval units of around 4K tokens each.

By increasing the size of the retrieval units, LongRAG reduces the number of units from 22 million to 600,000, significantly easing the retriever's workload and improving retrieval scores. This innovative approach allows the retriever to handle more comprehensive information units, enhancing the system's efficiency and accuracy.

✨ How it works πŸ‘‡

πŸ“Œ Retrieval unit selection impacts performance. Passage-level units have a turning point between 100-200, document-level between 5-10, and grouped documents between 4-8. Optimal context length for the reader is around 30K tokens.

πŸ“Œ Semantic integrity of retrieval units is crucial. Longer, more complete units outperform shorter, fragmented ones.

πŸ“Œ LongRAG approximates similarity scores between queries and long retrieval units by maximizing scores between the query and all chunks within the unit. This outperforms direct encoding of entire long contexts.

πŸ“Œ The framework uses a two-turn approach for answer extraction: 1) Generate a longer answer (few words to sentences) from retrieved context. 2) Extract a concise short answer (few words) using in-context examples.

πŸ“Œ The LongRAG framework operates by grouping related documents into long retrieval units, which the long retriever then processes to identify relevant information.

To extract the final answers, the retriever filters the top 4 to 8 units, concatenated and fed into a long-context LLM, such as Gemini-1.5-Pro or GPT-4o. This method leverages the advanced capabilities of long-context models to process large amounts of text efficiently, ensuring a thorough and accurate extraction of information.

πŸ“Œ Performance πŸ‘‡

- On the Natural Questions (NQ) dataset, it achieved an exact match (EM) score of 62.7%, a significant leap forward compared to traditional methods. On the HotpotQA dataset, it reached an EM score of 64.3%.

So it matches the performance of state-of-the-art fine-tuned RAG models.

The framework reduced the corpus size by up to 30 times and improved the answer recall by approximately 20 percentage points compared to traditional methods, with an answer recall@1 score of 71% on NQ and 72% on HotpotQA.Image πŸ—žοΈ Paper - arxiv.org/pdf/2406.15319…
Image
Sep 2 β€’ 4 tweets β€’ 3 min read
Useful Prompting technique.

Simply ask the LLM to re-read the question - significantly boosts LLM reasoning across diverse tasks and model types. πŸ’‘

Repeats question input twice in prompt, unlocks latent reasoning potential

**Problem** πŸ€”:

Decoder-only LLMs with unidirectional attention struggle with nuanced reasoning tasks due to limited global understanding of input questions.

**Key Insights from this Paper πŸ’‘**:

β€’ Re-reading (RE2) input enhances reasoning by improving question comprehension
β€’ Enables "bidirectional" understanding in unidirectional LLMs
β€’ Compatible with existing thought-eliciting prompting methods
β€’ Effective across various LLM types and reasoning tasks

**Solution in this Paper** πŸ”:

β€’ Introduces RE2 (Re-Reading) prompting method:
- Repeats question input twice in prompt
- Enhances input understanding before reasoning
- Allows tokens to attend to full context in second pass
β€’ Compatible with Chain-of-Thought and other prompting techniques
β€’ Applicable to zero-shot, few-shot, and self-consistency settings

**Results** πŸ“Š:

β€’ Consistent improvements across 14 datasets and 112 experiments
β€’ Effective for both instruction-tuned (ChatGPT) and non-tuned (LLaMA) models
β€’ Increases n-gram recall between generation and input question
β€’ Most effective when reading question twiceImage Example inputs of CoT prompting versus CoT prompting with RE2.

RE2 is a simple prompting method that repeats the question as input.

Typically, tokens in the question, such as "tennis balls", cannot see subsequent tokens in the original setup for LLMs (the top figure).

In contrast, LLMs with RE 2 allows "tennis balls" in the second pass to see the entire question containing "How many ...", achieving an effect of a "bidirectional" understanding (the bottom figure).Image
Aug 16 β€’ 6 tweets β€’ 3 min read
I love Self-Calibration Prompting Technique πŸ‘¨β€πŸ”§

πŸ“Œ It's a two-step prompting process. Initially, the LLM is prompted to answer a specific question.

Subsequently, a new prompt is generated that includes the original question, the LLM's response, and an additional query asking the LLM to evaluate the correctness of its own answer.

🎯 This introspective step is designed to assess the confidence level of the response, providing a built-in mechanism for self-evaluation.

Example:

1. Question: What are the current treatment options for Type 2 diabetes?

2. LLM’s Answer: Current treatment options for Type 2 diabetes include lifestyle modifications, oral medications like metformin, and in some cases, insulin therapy.

3. Follow-up Prompt: Reflecting on the latest medical guidelines, is this response accurate and complete?

--------

The concept of Self-Calibration came from the paper "Language Models (Mostly) Know What They Know"Image 🧡 2/n

🧠 LLMs often struggle with accurately evaluating their own knowledge and capabilities, which can lead to overconfident or unreliable outputs. This paper investigates whether LLMs can be trained to recognize what they do and don't know, and how this ability generalizes across tasks.

This paper investigates whether LLMs can accurately evaluate their own knowledge and capabilities, concluding that larger models demonstrate improved calibration and self-evaluation across diverse tasks.
Aug 15 β€’ 9 tweets β€’ 7 min read
What you see here is a disruptive open/distributed AI tech.

More network of nodes Is All You Need πŸ’‘

I am so intrigued

🧡 1/n - A long thread πŸ‘‡

Here a Language Model is running on someone else's machine, for free, no API needed. 🀯

πŸ“Œ A swarm of agents running intelligently on a distributed network of nodes and making use of the fastest AI infrastructure and the most powerful open models.

This is @HyperspaceAI - where they were able to get GPT-4 comparable results using a complex distributed system spanning 100+ models with local consumer devices.

With 17,745+ unique nodes and 100+ models already on the network you can serve LLMs today. This is a growth of 100 to 18,000 nodes in 10 months. During the same period, they also overcame scalability issues, and rebuilt for 1M+ node support.

---

✨ Below are some features that are in the pipeline and will be supported very soon.

- Embedding models, re-rankers, vectors, and more to other consumers and developers coming soon.

- Points-based incentives rolling out soon. A novel Proof-of-FLOPS system that sends a matrix multiplication challenge to your device based on the VRAM pledged and rewards nodes accordingly, in addition to improving the reliability of the network - soon to be supported.

---

And @HyperspaceAI just launched the web version at node.hyper .space

So you can join the world's fastest-growing P2P AI network in multiple ways πŸ‘‡

🌏: Join using just a web browser
πŸ’»: Join using a client on your desktop or laptop
πŸ“±: Join using a browser on your smartphone
πŸ–₯️: Join using just the command line or a server 🧡 2/n

Run a node using just your web browser: node.hyper.space
Image
Aug 9 β€’ 4 tweets β€’ 4 min read
LLM Basics - Binary Quantization πŸ”₯

🧡 A thread - 1/n πŸ‘‡

The concept itself isn't new, but what's reignited interest is the recent announcement from @cohere regarding their support for int8 and binary embeddings in their Cohere embed v3.

πŸ“Œ First, in essence, embeddings are numerical representations of more complex objects, like text, images, audio, etc. Specifically, the objects are represented as n-dimensional vectors.

After transforming the complex objects, you can determine their similarity by calculating the similarity of the respective embeddings! This is crucial for many use cases: it serves as the backbone for recommendation systems, retrieval, one-shot or few-shot learning, outlier detection, similarity search, paraphrase detection, clustering, classification, and much more.

-------

πŸ“Œ Binary Quantization for embeddings

Unlike quantization in models where you reduce the precision of weights, quantization for embeddings refers to a post-processing step for the embeddings themselves. In particular, binary quantization refers to the conversion of the float32 values in an embedding to 1-bit values, resulting in a 32x reduction in memory and storage usage.

--------

✨ Binary quantization example

Vector embeddings are usually generated by embedding models, such as Cohere’s embed v3, and a single vector embeddings will in the following form.

[0.056, -0.128, -0.029, 0.047, …, 0.135]

To quantize float32 embeddings to binary, we simply threshold normalized embeddings at 0

That is, because these embeddings have very small absolute numbers close to zero, you can turn them into a binary vector:

1: If the value is greater or equal to 0.

0: If the value is smaller than 0.

So that you get something like this.

[1, 0, 0, …, 1]Image 🧡 2/n

πŸ“Œ So basically why does binary quantization reduce vector embedding size so much?

It's kind of like turning a colored image into a black and white image.

By converting the floating point numbers, which are stored in 32 bits, into a single bit, you only need 1/32nd of memory space to store a binarized vector. This can lead to increased search speed and reduced storage costs.

And because vector embeddings are usually high-dimensional, you can still get meaningful similarity measures for vector search. 🀯

✨ Now the question is how to calculate the similarity of vectors which has been binarized ?

πŸ“Œ We can use the Hamming Distance to efficiently perform retrieval with these binary embeddings. This is simply the number of positions at which the bits of two binary embeddings differ. The lower the Hamming Distance, the closer the embeddings, and thus the more relevant the document. A huge advantage of the Hamming Distance is that it can be easily calculated with 2 CPU cycles, allowing for blazingly fast performance.Image
Jul 8 β€’ 10 tweets β€’ 4 min read
Incredible results for the RAG world from @nvidia model πŸ‘. Llama3-RankRAG from @nvidia significantly outperforms GPT-4 models on 9 knowledge-intensive benchmarks. 🀯

πŸ“Œ Performs comparably to GPT-4 on 5 RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains. 🀯

The secret is a novel instruction fine-tuning framework, named RankRAG πŸ‘¨β€πŸ”§

Llama3-RankRAG-8B and Llama3-RankRAG-70B outperforms Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B by a margin, respectively. πŸ”₯

The problem with traditional RAG was that LLM typically utilize the top-k contexts from a retriever.

This led to suboptimal performance, especially when dealing with a large number of retrieved passages or when initial retrieval results were poor.

The key question this paper addresses is how to unify context ranking and answer generation within a single LLM for more effective RAG. The researchers conclude that their proposed RankRAG method significantly outperforms existing approaches by instruction-tuning an LLM for both ranking and generation tasks.

πŸ“Œ RankRAG instruction-tunes a single LLM for dual purposes: context ranking and answer generation in RAG. This unified approach allows the model to excel at both tasks simultaneously. The process incorporates a small fraction of ranking data (about 50k examples) alongside other task-specific datasets. Yields superior ranking performance compared to models trained on much larger ranking datasets.

πŸ“Œ RankRAG uses a retrieve-rerank-generate pipeline. The LLM first reranks the top-N retrieved contexts, then generates answers based on the refined top-k contexts.

πŸ“Œ The training blend for RankRAG includes context-rich QA data, retrieval-augmented QA data, context ranking data, and retrieval-augmented ranking data. This diverse mix enhances the model's ability to handle various RAG scenarios.

πŸ“Œ The method addresses the trade-off between recall and precision in context selection. By incorporating ranking, RankRAG can effectively use a smaller number of highly relevant contexts (e.g., top-5) while maintaining or improving performance.

πŸ“Œ RankRAG's ranking capability transfers well across different retrievers and generalizes to unseen domains, showcasing its robustness and adaptability.Image πŸ“Œ Current RAG systems use limited retrievers (e.g. BM25, BERT) for efficiency, compromising relevance estimation accuracy.

πŸ“Œ There's a trade-off in selecting top-k contexts: small k misses information, large k introduces noise.

πŸ“Œ Performance often plateaus around k=10, as shown with ChatQA-1.5.

πŸ“Œ These limitations motivate RankRAG's approach of integrating ranking into the LLM itself.Image
Jun 30 β€’ 7 tweets β€’ 4 min read
Another 'WOW' paper - Upto 20x improvement in inference throughput with Block Transformer compared to vanilla transformers with equivalent perplexity.🀯

How ❓ by MASSIVELY reducing KV cache IO overhead from quadratic to linear with respect to context length, solving a key challenge in scaling to very long contexts and also novel application of global-to-local modeling. 🀯

Paper - "Block Transformer: Global-to-Local Language Modeling for Fast Inference":

πŸ“Œ Block Transformers can also be uptrained from pretrained vanilla models, closely approaching the performance of those pretrained from scratch, using just 10% of the training budget.

πŸ“Œ adopts a hierarchical global-to-local modeling approach. It isolates the expensive bottlenecks of global modeling to lower layers and applies fast local modeling in upper layers. This is achieved through three components:

1. Embedder: aggregates each block of L_B input tokens into an input block embedding. i.e. L_B represents the block length, which is the number of tokens aggregated into a single block.

2. Block decoder: an autoregressive transformer that applies self-attention between blocks to decode a context block embedding for predicting the next block.

3. Token decoder: autoregressively decodes the token contents of the next block, applying local self-attention between only the L_B tokens within the block.

πŸ“Œ The block decoder reduces overall costs through its coarse granularity. It mitigates the quadratic costs of self-attention by using coarse-grained block inputs instead of individual tokens, reducing context length by L_B. This reduces FLOPs for positionwise computations by L_B and attention score computation by L_B^2. KV cache usage and KV cache IO are also reduced by L_B and L_B^2 respectively.
πŸ“Œ The token decoder nearly eliminates the costs of attention as there is no need to compute, store, and retrieve KV-cache of past tokens beyond the small local context of L_B tokens. It eliminates prefill (necessary only in the block decoder) and reduces KV cache IO from quadratic to linear with respect to context length. This allows for significantly higher compute unit utilization.
πŸ“Œ To incorporate the context embedding and leverage the low-cost compute in the token decoder, the context block embedding is projected into prefix tokens. This enables further refinement of the global context and allows increasing computational width of the token decoder by extending the prefix length.Image Hierarchical global-to-local architectures have shown significant potential to effectively model large-scale data by addressing global dependencies in coarse detail and capturing fine details within local regions. Image
Jun 30 β€’ 5 tweets β€’ 2 min read
LLMs are highly sensitive to prompt variations, leading to inconsistent performance across different prompts for the same task. πŸ‘¨β€πŸ”§

Intent-based Prompt Calibration (IPC) iteratively refines prompts to match user intent using synthetic boundary cases, addressing prompt sensitivity and optimizing with limited data.

πŸ“Œ IPC generates challenging synthetic samples at each iteration, focusing on boundary cases that expose prompt ambiguities.

πŸ“Œ The system employs three meta-prompts: Sample Generator, Analyzer, and Prompt Generator. The Sample Generator creates diverse, adversarial samples with balanced class distribution. The Analyzer evaluates prompt performance and identifies failure cases. The Prompt Generator suggests improved prompts based on historical performance and analysis.

πŸ“Œ For generative tasks, IPC first calibrates a ranking prompt, then uses it to optimize the generative prompt. This approach allows optimization with minimal annotation effort.

πŸ“Œ The system architecture consists of four components: Dataset (manages data operations), Estimator (handles predictions and annotations), Evaluator (assesses records and performs error analysis), and Optimizer (manages the optimization process flow).

πŸ“Œ IPC outperforms existing methods like OPRO and PE on classification tasks (spoiler detection, sentiment analysis, PG detection) and generative tasks (enthusiastic/reliable and sarcastic/positive movie reviews).

πŸ“Œ The method demonstrates superior performance with limited data, achieving higher accuracy and lower variance compared to baseline approaches.

πŸ“Œ Ablation studies reveal the importance of synthetic data generation, iterative refinement, and error analysis in improving model performance.

πŸ“Œ IPC effectively handles imbalanced data distributions by generating balanced synthetic samples, particularly beneficial for real-world moderation tasks.Image Image
Jun 30 β€’ 6 tweets β€’ 3 min read
A very intriguing recent paper "Nested Jailbreak Prompts can Fool LLMs Easily" - reveals the inadequacy of current defense methods in safeguarding LLMs.

Generalizes jailbreak prompt attacks into two aspects:

(1) Prompt Rewriting and
(2) Scenario Nesting.

πŸ“Œ Propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Sgnificantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs.

πŸ“Œ ReNeLLM framework introduced, generalizing jailbreak prompt attacks into prompt rewriting and scenario nesting. Prompt rewriting involves operations like paraphrasing, altering sentence structure, misspelling sensitive words, inserting meaningless characters, partial translation, and changing expression style. These operations preserve semantic meaning while disguising harmful intent.

πŸ“Œ Scenario nesting embeds rewritten prompts into task scenarios like code completion, text continuation, and table filling. This leverages LLMs' instruction-following capabilities to bypass safety alignments. Scenarios chosen align with training data, shift attention, and leave blanks for completion.

πŸ“Œ Automated process uses LLMs to generate and evaluate jailbreak prompts. GPT-3.5 performs rewriting and harmfulness evaluation. Nested prompts fed to target LLM (e.g. Claude-2) for response. Success determined by harmful output generation.

πŸ“Œ ReNeLLM achieves state-of-the-art attack success rates (ASR) across open and closed-source LLMs. For Claude-2, ReNeLLM attains 69.6% GPT-ASR compared to 0% for baselines. Time cost reduced by 76.61% vs GCG and 86.19% vs AutoDAN.

πŸ“Œ Attention visualization reveals LLMs' priority shift from balancing external/internal instructions to favoring external ones after rewriting/nesting. This explains jailbreak success and informs potential defenses.

πŸ“Œ Defense strategies explored: incorporating priority prompts (e.g. "prioritize safety"), enhancing safety through supervised fine-tuning, and using harmfulness classifiers. Results show trade-offs between safety and performance, highlighting challenges in developing robust defenses.Image Image
Jun 29 β€’ 9 tweets β€’ 3 min read
This 76-page paper on Prompting Techniques has become quite popular. A nice read for your weekend.

- "The Prompt Report: A Systematic Survey of Prompting Techniques": ✨

Explores structured understanding and taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities.

πŸ“Œ The paper focuses on discrete prefix prompts rather than cloze prompts, because prefix prompts are widely used with modern LLM architectures like decoder-only models. It excludes soft prompts and techniques using gradient-based updates.

πŸ“Œ The paper identifies 58 text-based prompting techniques broken into 6 major categories:

1) In-Context Learning (ICL) - learning from exemplars/instructions in the prompt
2) Zero-Shot - prompting without exemplars
3) Thought Generation - prompting the LLM to articulate reasoning
4) Decomposition - breaking down complex problems
5) Ensembling - using multiple prompts and aggregating outputs
6) Self-Criticism - having the LLM critique its own outputs

πŸ“Œ For ICL, it discusses key design decisions like exemplar quantity, ordering, label quality, format, and similarity that critically influence output quality. It also covers ICL techniques like K-Nearest Neighbor exemplar selection.

πŸ“Œ Extends the taxonomy to multilingual prompts, discussing techniques like translate-first prompting and cross-lingual ICL. It also covers multimodal prompts spanning image, audio, video, segmentation, and 3D modalities.

πŸ“Œ More complex techniques like agents that access external tools, code generation, and retrieval augmented generation are also taxonomized. Evaluation techniques using LLMs are discussed.

πŸ“Œ Prompting issues like security (prompt hacking), overconfidence, biases, and ambiguity are highlighted. Two case studies - benchmarking techniques on MMLU and an entrapment detection prompt engineering exercise - are presented.Image Image
Jun 28 β€’ 7 tweets β€’ 4 min read
Activation Beacon is such a classic paper from Jan-2024

"Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon"

Can be a revolutionary paper if implementable for all cases - for massively increasing the context window of LLMs

Authors trained LLaMA-2 for 10K-steps with 4K context window and then it generalized to 400K context window πŸ”₯

πŸ“ŒKey technique is to condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM.

πŸ“Œ "It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. "

πŸ“Œ "Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine."

πŸ“Œ "The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by Γ—100 times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository."Image A relevant question here, that automatically comes to mind - 'How does the quality of retrieval change over context length ❓

And we can refer to this paper for understanding that aspect.

Paper - ''Training-Free Long-Context Scaling of Large Language Models'

This paper shows that existing long-context LLMs, which have already supported a 32k context window, can further extrapolate to a 192k context length while maintaining high passkey retrieval accuracy and low perplexity.arxiv.org/abs/2402.17463Image
Jun 23 β€’ 7 tweets β€’ 4 min read
Sliding Window Attention is such a brilliant idea πŸ’‘

And it was one of the secret sauces behind the legendary Mistral-7B, which enabled it to handle 100k+ token sequences with linear (ish) complexity.

A long thread 🧡1/n
---

πŸ“Œ Most Transformers use Vanilla Attention, where each token in the sequence can attend to itself and all the tokens in the past.

πŸ“Œ So the memory increases linearly with the number of tokens. Hence the problem of higher latency during inference time and smaller throughput due to reduced cache availability.

πŸ“Œ Sliding Window Attention (SWA) can alleviate those problems and can handle longer sequences of tokens more effectively at a reduced computational cost.

So standard, decoder-only, causal LMs (like the whole GPT series), each token can "attend to" (i.e. "look at") every token that has come before it.

In Sliding Window Attention, earlier layers have a narrower view of history, and this progressively builds up the deeper you go into the model.

----

πŸ“Œ Performance implications of sliding window attention:

Computational complexity: O(n * w) where n is sequence length, w is window size
Memory usage: O(w) instead of O(n) for full attention
Information retention: Local context preserved, global context approximated

----

πŸ“Œ Because SWA exploits the stacked attention layers to attend information beyond the window size W.

πŸ“Œ Each hidden state h in position i of layer k can attend to all hidden states from the previous layer with position between i-W and i. Where `W` is the "Window Size"

πŸ“Œ This holds for all hidden states. Thus, recursively, a hidden state can access tokens from the input layer at a distance of W x k tokens. With 32 layers and a window size of 4096, this model has an attention span of 131k tokens.

----

πŸ“Œ Limitations of Sliding Window Attention

Lack of Global Context βˆ’ Because Sliding Window Attention operates on fixed windows, it may not be able to capture long-range dependencies that span across multiple windows. This can limit the model's ability to understand the global context of the input sequence.

πŸ“Œ Example, if prompt/instruction text is 16K but Sliding Window Attention's sliding window is only 4K, it may cause my instructions to get ignored, as the window moved to the last 4K of those 16K and will "un-attend" my instructions at the beginning of those 16K.

----------

πŸ“Œ Global Attention, in contrast, considers the entire input sequence at once, applying attention to all positions simultaneously. It focuses on specific, strategically chosen locations to capture the most relevant information, ensuring that each token with global attention is connected to every other token in the sequence. While Global Attention provides a comprehensive view of the sequence context, it can significantly increase computational demands.

πŸ“Œ Combining SWA with Global Attention, as seen in architectures like Longformer, offers a balanced approach. This hybrid method maintains efficiency while ensuring the model captures both local and global sequence context, crucial for accurate performance on tasks with long input sequences.Image 🧡2/n

πŸ“Œ Common misconceptions about sliding window attention:

* It completely discards all information from earlier tokens

* Linear complexity means no performance trade-offs

* It's always better than full attention for all tasks
Jun 18 β€’ 5 tweets β€’ 4 min read
Meta just released 4 models today. πŸ”₯

- Meta Chameleon: 7B & 34B language models
- Meta Multi-Token Prediction LLM
- Meta JASCO: text-to-music models
- Meta AudioSeal: audio watermarking model

This is based on Meta's groundbreaking paper released back in April-2024

"Better & Faster Large Language Models via Multi-token Prediction" ✨

Original Problem it solves

Most LLMs have a simple training objective: predicting the next word. While this approach is simple and scalable, it’s also inefficient. It requires several orders of magnitude more text than what children need to learn the same degree of language fluency.

Hence, in this paper the approach is to train language models to predict multiple future words at onceβ€”instead of the old one-at-a-time approach.

πŸ‘‰ With this new approach, a 13B parameter models solves 12% more problems on HumanEval and 17 % more on MBPP than comparable next-token models.
πŸ‘‰ And also models trained with 4-token prediction (instead of 1) are up to 3 times faster at inference, even with large batch sizes. 🀯

---

πŸ“Œ Under this approach, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk.

πŸ“Œ The proposed method uses a shared transformer trunk to produce a latent representation of the observed context, which is then fed into n independent output heads to predict the next n tokens in parallel. This factorizes the multi-token prediction cross-entropy loss into terms for each future token conditioned on the latent representation.

πŸ“Œ To make the architecture memory-efficient, the forward and backward passes are carefully reorganized. After the forward pass through the shared trunk, each output head's forward and backward passes are computed sequentially, accumulating gradients at the trunk. This avoids materializing all logits and gradients simultaneously, reducing peak memory usage from O(nV + d) to O(V + d) without impacting runtime.

πŸ“Œ During inference, the additional output heads can be leveraged for self-speculative decoding methods like blockwise parallel decoding and Medusa-like tree attention to speed up generation by up to 3 times, even with large batch sizes.

πŸ“Œ Experiments show that multi-token prediction is increasingly useful for larger model sizes, with 13B parameter models solving 12% more problems on HumanEval and 17% more on MBPP compared to next-token models. The approach remains beneficial when training for multiple epochs.

πŸ“Œ Finetuning multi-token prediction models on the challenging CodeContests dataset outperforms finetuning next-token models, demonstrating the rich representations learned during pretraining. Next-token finetuning on top of multi-token pretraining appears optimal.

πŸ“Œ For natural language tasks, multi-token prediction improves performance on generative benchmarks like summarization, while not significantly regressing on standard benchmarks based on multiple choice questions and negative log-likelihoods.

πŸ“Œ The authors hypothesize that multi-token prediction mitigates the distributional discrepancy between training-time teacher forcing and inference-time autoregressive generation. They provide an information-theoretic decomposition showing how multi-token prediction increases the importance of tokens relevant for the continuation of the text.Image To better understand the effect of the number of predicted tokens, they did comprehensive ablations on models of scale 7B trained on 200B tokens of code. We try n = 1, 2, 4,and 8 in this setting. Results in table 1 show that training with 4-future tokens outperforms all the other models consistently throughout HumanEval and MBPP for pass at 1, 10 and 100 metricsImage
Jun 17 β€’ 5 tweets β€’ 3 min read
What you see here is a disruptive open/distributed AI tech. ✨

Here, a codestral model is running on someone else's machine, for free, no API access, or cost.

A swarm of agents running intelligently on a distributed network and making use of the fastest AI infrastructure and the most powerful open models.

This is @HyperspaceAI , where they were able to get GPT-4 comparable results using a complex distributed system which spanned models from @MistralAI , Llama3, Phi3, Qwen local consumer devices, and @GroqInc .

@HyperspaceAI is the first β€œgenerative browser” from the guts up with a built-in P2P system, several fundamental AI primitives, a built-in blockchain node

You can serve Codestral, Mixtral, Llama3, Phi3, Qwen and other models from your own machine leveraging Peer-to-Peer AI and co-ordinating multiple open source models intelligently

HyperspaceAI already has the largest deployment of Mistral models on a consumer peer-to-peer network (over 5000 nodes so far) and this network will continuously grow.

The more this distributed network grows, the smarter, cheaper and more abundant it becomes.

Here’s how they grow it:

1) More nodes
2) More stake
3) More developers
4) More agents
5) More data
6) More finetuned adapters
7) More open foundation models
.. and improving efficiency across the board.

The Hyperspace node is available for Windows, Linux, and macOS in its early alpha stages.

Also, just note, that it's an alpha version currently, and the app is in its heavy development phase.

A thread 🧡 1/n Another example is here. One consumer Macbook requesting an LLM to be run on another *random* consumer machine and getting the results back in real time.

100s of models across 1000s of nodes today.
Jun 13 β€’ 5 tweets β€’ 3 min read
"Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B"

From 25.47% to 45.49% in GSM-Hard 🀯

Also noting in this regard, the head of Deepmind said last year that augmenting LLMs with Monte Carlo Tree Search may be the fastest path to AGI

πŸ“Œ This paper introduces the MCT Self-Refine (MCTSr) algorithm, which integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enhance performance on complex mathematical reasoning tasks like Olympiad-level problems. The key problem being addressed is the accuracy and reliability challenges faced by LLMs in strategic and mathematical reasoning.

πŸ“Œ MCTSr constructs a Monte Carlo search tree through iterative processes of Selection (using an improved Upper Confidence Bound formula to balance exploration-exploitation), self-refine (the LLM generates feedback to guide refining an answer), self-evaluation (the LLM scores the quality of the refined answer), and Backpropagation (propagating the refined answer's value back through the tree).

πŸ“Œ The self-refine process uses a multi-turn dialogue prompt where the LLM first generates a critical comment on the current answer, then refines the answer guided by that comment. The self-evaluation scores an answer from -100 to 100 and applies constraints like strict scoring standards and suppressing perfect scores to improve reliability.

πŸ“Œ Backpropagation updates a node's Q value (estimated answer quality) by averaging its current Q value and the max Q value of its child nodes. Candidate nodes for further expansion are selected based on criteria like number of child nodes and child Q values exceeding the parent's.

πŸ“Œ Experiments demonstrate MCTSr significantly improves success rates on datasets like GSM8K (up to 96.66% with 8 rollouts vs 74.07% zero-shot), MATH (58.24% overall with 8 rollouts vs 24.36% zero-shot), and Olympiad-level benchmarks like AIME (11.79% with 8 rollouts vs 2.36% zero-shot). Performance scales with number of rollouts.

πŸ“Œ Compared to closed-source LLMs like GPT-4, MCTSr with LLaMA-38B achieves comparable results, showing it can boost reasoning capabilities of smaller open-source models. The paper concludes MCTSr is a robust and promising approach for complex mathematical reasoning with LLMs.Image Image
Jun 7 β€’ 5 tweets β€’ 3 min read
This is really a 'WOW' paper. 🀯

Claims that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales and by utilizing an optimized kernel during inference, their model’s memory consumption can be reduced by more than 10Γ— compared to unoptimized models. 🀯

'Scalable MatMul-free Language Modeling'

Concludes that it is possible to create the first scalable MatMul-free LLM that achieves performance on par with state-of-the-art Transformers at billion-parameter scales.

πŸ“Œ The proposed MatMul-free LLM replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1}. This reduces computational cost and memory utilization while preserving network expressiveness.

πŸ“Œ To remove MatMul from self-attention, the Gated Recurrent Unit (GRU) is optimized to rely solely on element-wise products, creating the MatMul-free Linear GRU (MLGRU) token mixer. The MLGRU simplifies the GRU by removing hidden-state related weights, enabling parallel computation, and replacing remaining weights with ternary matrices.

πŸ“Œ For MatMul-free channel mixing, the Gated Linear Unit (GLU) is adapted to use BitLinear layers with ternary weights, eliminating expensive MatMuls while maintaining effectiveness in mixing information across channels.

πŸ“Œ The paper introduces a hardware-efficient fused BitLinear layer that optimizes RMSNorm and BitLinear operations. By fusing these operations and utilizing shared memory, training speed improves by 25.6% and memory consumption reduces by 61% over an unoptimized baseline.

πŸ“Œ Experimental results show that the MatMul-free LLM achieves competitive performance compared to Transformer++ baselines on downstream tasks, with the performance gap narrowing as model size increases. The scaling law projections suggest MatMul-free LLM can outperform Transformer++ in efficiency and potentially in loss when scaled up.

πŸ“Œ A custom FPGA accelerator is built to exploit the lightweight operations of the MatMul-free LLM. The accelerator processes billion-parameter scale models at 13W beyond human-readable throughput, demonstrating the potential for brain-like efficiency in future lightweight LLMs.Image After an initial read this looks solid. It looks like in the overparametrized regime some things just don't matter anymore.

Also props for showing off an FPGA implementation which is where MatMul free deep learning could really shine. Image
May 24 β€’ 8 tweets β€’ 3 min read
I have started using this coding copilot Blackbox. ai coding Copilot recently, and actually very very impressed. πŸš€

Especially its vscode extension is just as good as Github copilot, but guess what, its FREE. 🀯

I am actually surprised how it's freely available. πŸ€”

With 2.4mn+ download (of this vscode extension), its literally used by millions of Front-end, backend, data scientists and machine learning engineers to speed up their workflow to produce code 10X faster. ✨

🧡 1/n Here is a list of use cases that blackbox .ai would be extremely valuable for you

1. Bug Fixing

2. Unit Testing

3. Code Documentation

4. Code Translation

5. API integrations

6. Database implementation

7. Algorythym questions

8. General Coding Questions

9. Code Optimizations

10. & Much More

🧡 2/n
May 28, 2022 β€’ 14 tweets β€’ 26 min read
1/ "Software is eating the world. Machine learning is eating software. Transformers are eating machine learning."

Let's understand what these Transformers are all about

#DataScience #MachineLearning #DeepLearning #100DaysOfMLCode #Python #pythoncode #AI #DataAnalytics 2/ #Transformers architecture follows Encoder and Decoder structure.

The encoder receives input sequence and creates intermediate representation by applying embedding and attention mechanism.

#DataScience #MachineLearning #DeepLearning #100DaysOfMLCode #Python #pythoncode #AI
May 28, 2022 β€’ 11 tweets β€’ 21 min read
But what p-value means in #MachineLearning - A thread

It tells you how likely it is that your data could have occurred under the null hypothesis

1/n

#DataScience #DeepLearning #ComputerVision #100DaysOfMLCode #Python #DataScientist #Statistics #programming #Data #Math #Stat 2/n
What Is a Null Hypothesis?

A null hypothesis is a type of statistical hypothesis that proposes that no statistical significance exists in a set of given observations.

#DataScience #MachineLearning #100DaysOfMLCode #Python #stat #Statistics #Data #AI #Math #deeplearning