πΌ AI Engineer.
Compiling real-time the race towards AGI π.
Follow to stay at bleeding edge AI π
I write daily on my Newsletter β https://t.co/Jfj0r0wLUN
4 subscribers
Dec 12 β’ 5 tweets β’ 3 min read
Synthetic data and iterative self-improvement is all you need.
No humans needed in the evaluation loop.
This paper introduces a self-improving evaluator that learns to assess LLM outputs without human feedback, using synthetic data and iterative self-training to match top human-supervised models.
-----
Original Problem π€:
Building strong LLM evaluators typically requires extensive human preference data, which is costly and becomes outdated as models improve. Current approaches rely heavily on human annotations, limiting scalability and adaptability.
-----
Solution in this Paper π§:
β The method starts with unlabeled instructions and uses a seed LLM to generate contrasting response pairs, where one is intentionally inferior.
β It then uses an LLM-as-Judge approach to generate reasoning traces and final judgments for these synthetic pairs.
β The system filters correct judgments and uses them to train an improved evaluator model.
β This process repeats iteratively, with each iteration using the improved model to generate better synthetic training data.
-----
Key Insights from this Paper π‘:
β Human preference data isn't necessary for training strong LLM evaluators
β Synthetic data generation with iterative self-improvement can match human-supervised approaches
β Different data sources (safety, math, coding) improve performance in their respective domains
-----
Results π:
β Improved RewardBench accuracy from 75.4 to 88.3 (88.7 with majority voting)
β Outperformed GPT-4 (84.3) and matched top reward models trained with human data
β Achieved 79.5% agreement with human judgments on MT-Bench using majority voting
The diagram shows how an AI system learns to evaluate responses without human help, using an iterative training process:
1. Input Stage π―
- It starts with a prompt (x)
- Creates a similar but slightly different version of that prompt (x')
2. Response Generation π
- The system uses an LLM to create two responses:
- A "good" response to the original prompt
- A "bad" response by answering the modified prompt
3. Judgment Phase π
- An AI judge (Mi) evaluates these responses
- It samples multiple judgments about which response is better
- The system selects only the correct verdicts
4. Training Loop βοΈ
- These judgments are collected as training data
- The system uses this data to train an improved version of itself (Mi+1)
- This new, better model becomes the judge for the next round
Think of it like a student who: 1. Creates their own practice problems 2. Solves them in both good and not-so-good ways 3. Learns to tell the difference between good and bad solutions 4. Uses this knowledge to get even better at judging solutions
The key innovation is that this entire process runs automatically, without needing humans to say which answers are good or bad. The system teaches itself to become a better evaluator through practice and iteration.
Dec 3 β’ 4 tweets β’ 3 min read
HunyuanVideo: Open-source alternative to Runway Gen-3, Luma 1.6, few others top performing Chinese video generative models just arrived. π€―
π― A 13B-parameter open-source video generator model from by Tencent that matches commercial quality π
β HunyuanVideo represents a major advancement in open-source video generation, released by Tencent in December 2024 with public code and model weights
β The model matches or exceeds closed-source solutions while being fully accessible to researchers and developers
β Running on H800/H20 GPUs, it requires 45-60GB memory depending on resolution settings
π¬ Architecture
β The foundation is a Causal 3D VAE that intelligently compresses videos with specific ratios - 4x for time dimension, 8x for spatial dimensions, and 16x for channels
β Unlike traditional approaches using CLIP/T5, HunyuanVideo employs a decoder-only Multimodal LLM as its text encoder, enabling better image-text alignment and complex reasoning
β The architecture follows a novel dual-stream to single-stream progression - first processing video and text independently, then merging them for enhanced multimodal fusion
β A sophisticated prompt rewriting system offers two modes: Normal for better understanding user intent, and Master for enhancing visual quality aspects
π οΈ Implementation Details
β Supports various aspect ratios including 9:16, 16:9, 4:3, 3:4, and 1:1 with resolutions up to 720p
β Uses flow matching for training with a configurable shift factor of 9.0 and embedded classifier-free guidance
β Provides CPU offloading capabilities to manage memory efficiently during high-resolution generation
π Performance Metrics
β Professional evaluation across 1,533 prompts shows superior results: 68.5% text alignment, 64.5% motion quality, 96.4% visual quality
Dec 1 β’ 4 tweets β’ 2 min read
Emotional RAG: AI now recall memories based on emotions, just like humans do.
Original Problem π€:
Role-playing agents powered by LLMs struggle to maintain consistent personality traits and generate human-like responses due to limited emotional context in memory retrieval.
-----
Solution in this Paper π‘:
β’ Introduces Emotional RAG framework for role-playing agents
β’ Encodes both semantic and emotional vectors for queries and memory
β’ Implements two retrieval strategies:
- Combination: Fuses semantic and emotional similarity scores
- Sequential: Retrieves based on one factor, then reranks using the other
β’ Designs emotion-aware prompt templates for LLMs
-----
Key Insights from this Paper:
β Incorporating emotional states in memory retrieval enhances personality consistency
β Mood-Dependent Memory theory from psychology applies to AI agents
β Different retrieval strategies work best for different personality evaluation metrics
β Emotional congruence improves the human-likeness of generated responses
-----
Results π:
β’ Outperforms traditional RAG methods across multiple datasets
β’ Significant improvements in full personality evaluations (MBTI, BFI)
β’ Better performance on open-source models (ChatGLM-6B, Qwen-72B) compared to GPT-3.5
β’ Achieves higher accuracy in overall personality trait predictions
πEmotional RAG framework consists of four main components:
β Query encoding: Encodes both semantic and emotional aspects of user queries
β Memory encoding: Stores and encodes conversation history with semantic and emotional vectors
β Emotional retrieval: Retrieves relevant memory based on both semantic and emotional similarity
β Response generation: Uses retrieved memory along with character profile to generate responses
Nov 25 β’ 6 tweets β’ 3 min read
Type a sentence, get any sound - from talking cats to singing saxophones. Brilliant release by NVIDIA
β¨ NVIDIA just unveiled Fugatto, a groundbreaking 2.5B parameter audio AI model that can generate and transform any combination of music, voices, and sounds using text prompts and audio inputs
Fugatto could ultimately allow developers and creators to bring sounds to life simply by inputting text prompts,
β The model demonstrates unique capabilities like creating hybrid sounds (trumpet barking), changing accents/emotions in voices, and allowing fine-grained control over sound transitions - trained on millions of audio samples using 32 NVIDIA H100 GPUs
π¨βπ§ Architecture
Built as a foundational generative transformer model leveraging NVIDIA's previous work in speech modeling and audio understanding. The training process involved creating a specialized blended dataset containing millions of audio samples
β ComposableART's Innovation in Audio Control
Introduces a novel technique allowing combination of instructions that were only seen separately during training. Users can blend different audio attributes and control their intensity
β Temporal Interpolation Capabilities
Enables generation of evolving soundscapes with precise control over transitions. Can create dynamic audio sequences like rainstorms fading into birdsong at dawn
β Processes both text and audio inputs flexibly, enabling tasks like removing instruments from songs or modifying specific audio characteristics while preserving others
β Shows capabilities beyond its training data, creating entirely new sound combinations through interaction between different trained abilities
π Real-world Applications
β Allows rapid prototyping of musical ideas, style experimentation, and real-time sound creation during studio sessions
β Can modify voice characteristics for language learning applications, allowing content delivery in familiar voices
@NVIDIAAIDev
β Creates a massive dataset (20M+ rows, ~330 years of audio) by combining multiple open source datasets and using LLMs to generate rich descriptions and instructions
Nov 16 β’ 14 tweets β’ 5 min read
Consolidated insights on LLM fine-tuning - a long read across 114 pages.
"Ultimate Guide to Fine-Tuning LLMs"
Worth a read during the weekend.
Few ares it covers π
π Fine-tuning Pipeline
β Outlines a seven-stage process for fine-tuning LLMs, from data preparation to deployment and maintenance.
π§ Advanced Fine-tuning Methods
β Covers techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) for aligning LLMs with human preferences.
β Discusses methods like LoRA, QLoRA, and adapters that enable efficient fine-tuning by updating only a subset of model parameters.
π¬ Evaluation metrics and benchmarks for assessing fine-tuned LLMs
β Includes perplexity, accuracy, and task-specific measures. Benchmarks like GLUE, SuperGLUE, TruthfulQA, and MMLU assess various aspects of LLM performance. Safety evaluations using frameworks like DecodingTrust are also crucial for ensuring responsible AI deployment.
π» Explores various deployment approaches and optimization techniques to enhance LLM performance and efficiency in real-world applications.
π Examines the extension of fine-tuning techniques to multimodal models and domain-specific applications in fields like medicine and finance.
Note, the content's value stands on its merit, even though possibly the authors leveraged AI assistance in some parts of the paper's creation.
𧡠1/n
𧡠2/n
A chronological timeline showcasing the evolution of LLMs from 1990 to 2023.
Nov 8 β’ 4 tweets β’ 2 min read
MapReduce meets LLMs: Divide-and-conquer approach lets regular LLMs process 100x longer documents than their context limit
Using MapReduce principles, small-context LLMs now handle million-token documents efficiently.
Original Problem π:
LLMs struggle to process extremely long texts exceeding their context window, limiting their application in tasks requiring comprehensive document understanding.
-----
Solution in this Paper π οΈ:
β’ LLM Γ MapReduce: A training-free framework for long-sequence processing
β’ Structured information protocol: Addresses inter-chunk dependency
β’ In-context confidence calibration: Resolves inter-chunk conflicts
β’ Three-stage process: Map, collapse, and reduce stages for efficient processing
-----
Key Insights from this Paper π‘:
β’ Divide-and-conquer approach enables short-context LLMs to handle long texts
β’ Structured information and confidence calibration improve cross-chunk processing
β’ Framework is compatible with different LLMs, demonstrating generalization capability
β’ Efficient design outperforms standard decoding in speed
The LLM Γ MapReduce framework consists of three main stages:
1. Map stage: The long input text is divided into chunks, and an LLM extracts necessary information from each chunk.
2. Collapse stage: If the mapped results still exceed the model's context window, they are compressed while maintaining the same structure as the mapped results.
3. Reduce stage: The final response is generated based on the collapsed results.
Nov 6 β’ 6 tweets β’ 2 min read
"Understanding LLMs from Scratch Using Middle School Math"
Neural networks learn to predict text by converting words to numbers and finding patterns through attention mechanisms.
So the network turns words into numbers, then use attention to decide what's important for predicting next words
Nice long blog (40 minuted reading time), check the link in comment.
Content
Nov 4 β’ 11 tweets β’ 3 min read
For learning Machine Learning with actual projects, checkout this Repo.
A comprehensive educational repository from @AnthropicAI containing 5 structured courses: API fundamentals, prompt engineering, real-world applications, evaluations, and tool integration with Claude APIs.
Anthropic API fundamentals
Nov 1 β’ 5 tweets β’ 3 min read
Not all brain cells are equal - same goes for LLM attention heads! π‘
Why store everything when you can just remember the important stuff?
Smart KV cache compression that knows which attention heads matter most.
Hence, HeadKV intelligently compresses LLM memory by identifying and prioritizing crucial attention heads
π― Original Problem:
KV caching in LLMs faces significant memory overhead with increasing input length. Current compression methods operate at layer-level, missing the opportunity to optimize at individual attention head level.
-----
π§ Solution in this Paper:
β’ HeadKV: Compresses KV cache at individual head level instead of layer level
β’ Allocates cache budgets based on head importance using Needle-in-a-Haystack tests
β’ HeadKV-R2: Enhanced version that evaluates both retrieval and reasoning abilities
β’ Uses dynamic budget allocation across heads based on importance scores
β’ Retains most relevant KV cache entries within each head using attention-based selection
-----
π‘ Key Insights:
β’ Not all attention heads are equally important for text generation
β’ Head-level compression outperforms layer-level approaches
β’ Combining retrieval and reasoning abilities for importance scoring is crucial
β’ Dynamic budget allocation across heads is more effective than fixed allocation
β’ Just 1.5% of KV cache can retain 97% of full performance
-----
π Results:
β’ Achieves 97% of full KV cache performance while retaining only 1.5% of cache
β’ Outperforms baselines on LongBench and LooGLE benchmarks
β’ Superior performance in low-resource settings (KV size = 64 & 128)
β’ Maintains computational efficiency comparable to existing approaches
β’ Effective preservation of both retrieval and reasoning capabilities
π The method operates in two key steps: First, it estimates head importance scores using Needle-in-a-Haystack tests that evaluate both retrieval and reasoning abilities.
Second, it allocates KV cache budgets to individual heads based on their importance scores, with more important heads receiving larger cache allocations.
Oct 28 β’ 4 tweets β’ 2 min read
GenAI implementation poses a series of hurdles to overcome.
Vector DBs, data processing pipelines, embedding models, deployment systems, and monitoring tools and many more.
All these create significant engineering complexity.
A 𧡠1/n
So if I have a all-in-one GenAI development toolkit operating on owned infrastructure, that would eliminate all these unnecessary pressure and stumbling blocks from implementing a new GenAI project.
And then I found such a solution: @DynamiqAGI β¨
And what's great is that it is open-source with Apache 2 License.
So Dynamiq simplifies my AI-powered solution development cycle significantly.
It handles Multi-agent orchestration and Retrieval-Augmented Generation (RAG) integration with a comprehensive toolkit.
Core capabilities: π¨βπ§
-> Agent orchestration: Single and multi-agent workflow support
-> RAG toolkit: Vector DB integration, chunking, pre-processing, reranking
-> DAG workflow control: Parallel execution, retries, error handling
-> Custom validators: Configurable validation rules for workflows
-> Multi-modal integration: Support for Vision-Language Models (VLMs)
Also good for scalability, from prototype to enterprise.
𧡠2/n
Here's a simple example to get you started with @DynamiqAGI :
Oct 26 β’ 8 tweets β’ 2 min read
MIT's "Mathematics for Computer Science".
A 1048 page available for Free.
Focuses on explaining the use of mathematical models and methods to analyze problems in computer science.
Oct 15 β’ 5 tweets β’ 2 min read
Agent S uses a computer like a human to solve diverse desktop tasks on different systems.
Experience-augmented hierarchical planning enables Agent S to handle diverse GUI tasks with improved performance.
**The original Problem** π―:
Automating complex computer tasks presents challenges in acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic interfaces.
-----
**Solution in this Paper** π οΈ:
β’ Experience-augmented hierarchical planning:
- Manager module for task decomposition
- Worker modules for subtask execution
- Self-evaluator for experience summarization
β’ Agent-Computer Interface (ACI):
- Dual-input strategy for visual understanding and element grounding
- Bounded action space of language-based primitives
β’ Continual memory update mechanism for ongoing learning
-----
**Key Insights from this Paper** π‘:
β’ Combining external knowledge and internal experience enhances task planning
β’ Structured interface improves MLLM reasoning for GUI control
β’ Hierarchical planning supports long-horizon workflows
β’ Continual learning enables adaptation to new tasks and environments
-----
**Results** π:
β’ OSWorld benchmark: 20.58% success rate (83.6% relative improvement over baseline)
β’ Consistent improvements across five computer task categories
β’ WindowsAgentArena: 18.2% success rate (36.8% improvement without adaptation)
β’ Ablation studies confirm effectiveness of individual components
π€ Agent S addresses three main challenges in automating computer tasks:
1. Acquiring domain-specific knowledge for diverse applications 2. Planning over long task horizons 3. Handling dynamic, non-uniform interfaces
Oct 9 β’ 4 tweets β’ 3 min read
Brilliant Paper from @Microsoft. π
"DIFFERENTIAL TRANSFORMER" β¨
DIFF Transformer cancels attention noise, enhancing key information retrieval and reducing hallucination in large language models.
β’ 30% accuracy improvement in key information retrieval with 64K context
β’ 10-20% accuracy gain in many-shot in-context learning across datasets
β’ 7-11% reduction in hallucination for summarization and question answering
β’ Maintains performance with 6-bit quantization, while Transformer degrades significantly
**Original Problem** π:
Transformer tends to overallocate attention to irrelevant context, leading to challenges in accurately retrieving key information.
-----
**Solution in this Paper** π‘:
β’ Introduces DIFF Transformer with differential attention mechanism
β’ Calculates attention scores as difference between two separate softmax attention maps
β’ Subtraction cancels noise, promoting emergence of sparse attention patterns
β’ Amplifies attention to relevant context while reducing attention to irrelevant parts
β’ Uses GroupNorm to normalize each attention head independently
-----
**Key Insights from this Paper** π‘:
β’ DIFF Transformer outperforms Transformer in scaling model size and training tokens
β’ Requires only ~65% of model size or training tokens to match Transformer performance
β’ Excels in long-context modeling, key information retrieval, and in-context learning
β’ Mitigates hallucination in question answering and text summarization
β’ Reduces outliers in model activations, enabling better quantization
Transformer often over-attends to irrelevant context (i.e., attention noise). DIFF Transformer amplifies attention to answer spans and cancels noise, enhancing the capability of context modeling.
Oct 6 β’ 5 tweets β’ 2 min read
"claude 3.5 sonnet to outperform openai o1 in terms of reasoning" with prompting π€
----
Prompt from the article:
Begin by enclosing all thoughts within tags, exploring multiple angles and approaches.
Break down the solution into clear steps within tags. Start with a 20-step budget, requesting more for complex problems if needed.
Use tags after each step to show the remaining budget. Stop when reaching 0.
Continuously adjust your reasoning based on intermediate results and reflections, adapting your strategy as you progress.
Regularly evaluate progress using tags. Be critical and honest about your ..........
........"
Sep 25 β’ 4 tweets β’ 5 min read
MotleyCrew: a pragmatic approach to AI agents
AI agents are all the rage these days. An AI agent is simply a wrapper around a Large Language Model (LLM) that allows it to request actions, such as a web search, from the outside world, and feed the results back to the LLM, and so on until a desired outcome is achieved.
π¨βπ§ As each agent can handle only so many different instructions at once in its prompt, often it is helpful to use a team of multiple agents for a task, just like a team of humans divides a task among themselves.
Then the question arises: how should the agents coordinate their work? π€
A thread π§΅1/n π
Many existing solutions and frameworks assume the user will use that particular framework's agent implementation as well as its way of coordinating agent interactions.
That is an important limitation: building good agents is hard, and so is creating good agent communication semantics.
As building good agents is a difficult task that is distinct from multi-agent orchestration, itβs unlikely that a single framework would have both the best agents and the best multi-agent communication layer.
And then I stumbled upon this Open-source project - MotleyCrew. π‘
We canβt simply assume that the framework that excels at one will also excel at the other.
So that was the starting point for @motleycrew_ai providing wrappers to all the above frameworksβ agents, and focussing on making their interaction as simple and powerful as possible.
πThus, for example, in MotleyCrew you can directly pass agents as tools to other agents (without introducing an additional βdelegationβ concept with its own semantics), and these, in turn, can have other agents as tools, and so on.
πAnother MotleyCrew feature, output handlers, is a simple way of implementing the writer-critic pattern, and to provide user-specified guarantees about an agentβs output.
πThis works as follows: the agent is told (under the hood) that it has to return its final result only via the output handler. The output handler runs any verification logic the user specifies, and if that fails, tells the agent what failed and asks it to try again, until success. If successful, the output handlerβs output is returned as the agentβs output.
πThis pattern takes the reliability of AI agents to a whole new level: the output handler can contain both algorithmic logic (for example, verifying that all the links contained in the agent input are also contained in the output) and agentic logic (for example, criticising the writing style, or double-checking that the output fulfils the instructions given to the agent), or any combination.
π§΅2/n
π MotleyCrew doesn't try to reinvent the wheel or build its own little walled garden with its own implementation of every component needed in a multi-agent system.
π On the contrary, if you want to combine RAG from LlamaIndex with a group chat from Autogen, and make them nodes in a complex LangGraph structure, with a couple of CrewAI tools thrown in somewhere, the framework makes it easy for you.
π Beyond combining other frameworks, @motleycrew_ai also adds some other features that have proven to be useful in practice, such as an embedded knowledge graph that can be used for structuring the knowledge the agents create, as well as an ability for agents to create tasks for each other using that knowledge graph.
Small but extremely useful features are the ability to let tools see their agent's input, and to force agents to only return a final result through a tool - thus guaranteeing a level of output quality that is hard to achieve otherwise.
Sep 14 β’ 4 tweets β’ 1 min read
The biggest contribution of OpenAI Strawberry (o1) π is on inference scaling.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.
The entire result-search space becomes a mini dataset of training examples
Is this a Vacuum-Tube to Silicon Transistor moment for LLM if true ? π€―
Published in Nature yesterday.
β¨ Molecular memristors enable 14-bit analog computing, surpassing digital efficiency for core matrix operations.
The Paper achieved >73 dB signal-to-noise-ratio, a four-order of magnitude improvement over the SOTA 10-12, while consuming 460X less energy than digital computers13
**Original Problem** π:
Vector-matrix multiplication (VMM) is computationally expensive, requiring n^2 steps for vectors of length n. Current dot-product engines (DPEs) for VMM have low precision (2-6 bits) due to non-idealities in analog circuit elements.
-----
**Key Insights from this Paper** π‘:
β’ Developed 14-bit precision molecular memristor crossbar for VMM
β’ Supramolecular electronics yield unprecedented precision in neuromorphic hardware for AI acceleration.
β’ Achieved linear, symmetric weight updates with 16,520 distinct analog levels
β’ Enabled one-step programmability of conductance levels
β’ Implemented selector-free crossbar design using unidirectional elements
-----
**Solution in this Paper** π§ͺ:
β’ Fabricated 64x64 crossbar using [Ru^II L_2](BF_4)_2 molecular film
β’ Engineered symmetric potentiation/depression characteristics
β’ Utilized supramolecular electronic transitions between 31 and 22 states
β’ Implemented custom >16-bit precision CMOS peripheral circuit
β’ Compensated for wire resistances and parasitic effects
-----
**Results** π:
β’ 16,520 distinct analog levels with 14-bit resolution
β’ Signal-to-noise ratio of 73-79 dB for VMM operations
β’ 4 orders of magnitude improvement in precision over state-of-the-art
β’ 460x higher energy efficiency than CPU for matrix operations
β’ Demonstrated high-fidelity image reconstruction via inverse Fourier transform
π Published in Nature -
The common retrievers like DPR (Dense Passage Retrieval) normally work with 100-word Wikipedia paragraphs. π€
π‘ This paper proposes LongRAG - processes the entire Wikipedia into 4K-token units, 30x longer than before π₯
By increasing the unit size, they significantly reduce the total units from 22M to 600K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki).
This technique is particularly beneficial for open-domain question answering, where detailed and accurate responses are crucial. By leveraging external information, RAG systems can overcome the limitations of relying solely on the parametric knowledge embedded in LLMs, making them more effective in handling complex queries.
π Challenges for regular RAG π
Traditional RAG frameworks often use short retrieval units, such as 100-word passages, requiring the retriever to sift through large amounts of data. This design burdens the retriever heavily while the reader's task remains relatively simple, leading to inefficiencies and potential semantic incompleteness due to document truncation.
π‘ And so her comes LongRAG
To address these challenges, this LongRAG framework comprises a "long retriever" and a "long reader" component, designed to process longer retrieval units of around 4K tokens each.
By increasing the size of the retrieval units, LongRAG reduces the number of units from 22 million to 600,000, significantly easing the retriever's workload and improving retrieval scores. This innovative approach allows the retriever to handle more comprehensive information units, enhancing the system's efficiency and accuracy.
β¨ How it works π
π Retrieval unit selection impacts performance. Passage-level units have a turning point between 100-200, document-level between 5-10, and grouped documents between 4-8. Optimal context length for the reader is around 30K tokens.
π Semantic integrity of retrieval units is crucial. Longer, more complete units outperform shorter, fragmented ones.
π LongRAG approximates similarity scores between queries and long retrieval units by maximizing scores between the query and all chunks within the unit. This outperforms direct encoding of entire long contexts.
π The framework uses a two-turn approach for answer extraction: 1) Generate a longer answer (few words to sentences) from retrieved context. 2) Extract a concise short answer (few words) using in-context examples.
π The LongRAG framework operates by grouping related documents into long retrieval units, which the long retriever then processes to identify relevant information.
To extract the final answers, the retriever filters the top 4 to 8 units, concatenated and fed into a long-context LLM, such as Gemini-1.5-Pro or GPT-4o. This method leverages the advanced capabilities of long-context models to process large amounts of text efficiently, ensuring a thorough and accurate extraction of information.
π Performance π
- On the Natural Questions (NQ) dataset, it achieved an exact match (EM) score of 62.7%, a significant leap forward compared to traditional methods. On the HotpotQA dataset, it reached an EM score of 64.3%.
So it matches the performance of state-of-the-art fine-tuned RAG models.
The framework reduced the corpus size by up to 30 times and improved the answer recall by approximately 20 percentage points compared to traditional methods, with an answer recall@1 score of 71% on NQ and 72% on HotpotQA.
ποΈ Paper - arxiv.org/pdf/2406.15319β¦
Sep 2 β’ 4 tweets β’ 3 min read
Useful Prompting technique.
Simply ask the LLM to re-read the question - significantly boosts LLM reasoning across diverse tasks and model types. π‘
Repeats question input twice in prompt, unlocks latent reasoning potential
**Problem** π€:
Decoder-only LLMs with unidirectional attention struggle with nuanced reasoning tasks due to limited global understanding of input questions.
**Key Insights from this Paper π‘**:
β’ Re-reading (RE2) input enhances reasoning by improving question comprehension
β’ Enables "bidirectional" understanding in unidirectional LLMs
β’ Compatible with existing thought-eliciting prompting methods
β’ Effective across various LLM types and reasoning tasks
**Solution in this Paper** π:
β’ Introduces RE2 (Re-Reading) prompting method:
- Repeats question input twice in prompt
- Enhances input understanding before reasoning
- Allows tokens to attend to full context in second pass
β’ Compatible with Chain-of-Thought and other prompting techniques
β’ Applicable to zero-shot, few-shot, and self-consistency settings
**Results** π:
β’ Consistent improvements across 14 datasets and 112 experiments
β’ Effective for both instruction-tuned (ChatGPT) and non-tuned (LLaMA) models
β’ Increases n-gram recall between generation and input question
β’ Most effective when reading question twice
Example inputs of CoT prompting versus CoT prompting with RE2.
RE2 is a simple prompting method that repeats the question as input.
Typically, tokens in the question, such as "tennis balls", cannot see subsequent tokens in the original setup for LLMs (the top figure).
In contrast, LLMs with RE 2 allows "tennis balls" in the second pass to see the entire question containing "How many ...", achieving an effect of a "bidirectional" understanding (the bottom figure).
Aug 16 β’ 6 tweets β’ 3 min read
I love Self-Calibration Prompting Technique π¨βπ§
π It's a two-step prompting process. Initially, the LLM is prompted to answer a specific question.
Subsequently, a new prompt is generated that includes the original question, the LLM's response, and an additional query asking the LLM to evaluate the correctness of its own answer.
π― This introspective step is designed to assess the confidence level of the response, providing a built-in mechanism for self-evaluation.
Example:
1. Question: What are the current treatment options for Type 2 diabetes?
2. LLMβs Answer: Current treatment options for Type 2 diabetes include lifestyle modifications, oral medications like metformin, and in some cases, insulin therapy.
3. Follow-up Prompt: Reflecting on the latest medical guidelines, is this response accurate and complete?
--------
The concept of Self-Calibration came from the paper "Language Models (Mostly) Know What They Know"
𧡠2/n
π§ LLMs often struggle with accurately evaluating their own knowledge and capabilities, which can lead to overconfident or unreliable outputs. This paper investigates whether LLMs can be trained to recognize what they do and don't know, and how this ability generalizes across tasks.
This paper investigates whether LLMs can accurately evaluate their own knowledge and capabilities, concluding that larger models demonstrate improved calibration and self-evaluation across diverse tasks.