KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy
Synthetic data and iterative self-improvement is all you need.
No humans needed in the evaluation loop.
This paper introduces a self-improving evaluator that learns to assess LLM outputs without human feedback, using synthetic data and iterative self-training to match top human-supervised models.
-----
Original Problem 🤔:
Building strong LLM evaluators typically requires extensive human preference data, which is costly and becomes outdated as models improve. Current approaches rely heavily on human annotations, limiting scalability and adaptability.
-----
Solution in this Paper 🔧:
→ The method starts with unlabeled instructions and uses a seed LLM to generate contrasting response pairs, where one is intentionally inferior.
→ It then uses an LLM-as-Judge approach to generate reasoning traces and final judgments for these synthetic pairs.
→ The system filters correct judgments and uses them to train an improved evaluator model.
→ This process repeats iteratively, with each iteration using the improved model to generate better synthetic training data.
-----
Key Insights from this Paper 💡:
→ Human preference data isn't necessary for training strong LLM evaluators
→ Synthetic data generation with iterative self-improvement can match human-supervised approaches
→ Different data sources (safety, math, coding) improve performance in their respective domains
-----
Results 📊:
→ Improved RewardBench accuracy from 75.4 to 88.3 (88.7 with majority voting)
→ Outperformed GPT-4 (84.3) and matched top reward models trained with human data
→ Achieved 79.5% agreement with human judgments on MT-Bench using majority voting
The diagram shows how an AI system learns to evaluate responses without human help, using an iterative training process:
1. Input Stage 🎯
- It starts with a prompt (x)
- Creates a similar but slightly different version of that prompt (x')
2. Response Generation 🔄
- The system uses an LLM to create two responses:
- A "good" response to the original prompt
- A "bad" response by answering the modified prompt
3. Judgment Phase 📊
- An AI judge (Mi) evaluates these responses
- It samples multiple judgments about which response is better
- The system selects only the correct verdicts
4. Training Loop ⚙️
- These judgments are collected as training data
- The system uses this data to train an improved version of itself (Mi+1)
- This new, better model becomes the judge for the next round
Think of it like a student who: 1. Creates their own practice problems 2. Solves them in both good and not-so-good ways 3. Learns to tell the difference between good and bad solutions 4. Uses this knowledge to get even better at judging solutions
The key innovation is that this entire process runs automatically, without needing humans to say which answers are good or bad. The system teaches itself to become a better evaluator through practice and iteration.
Paper Title: "Self-Taught Evaluators"
Generated below podcast on this paper with Google's Illuminate.
Beautiful Opensource Tool, ScrapeGraphAI with 16.2K Github stars 🌟
Turns natural language commands into production-ready web scrapers using LLM-powered graph pipelines.
This library stands out by integrating Large Language Models (LLMs) and modular graph-based pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.
Why ScrapegraphAI ❓
Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
→ ScrapeGraphAI builds web scraping pipelines using LLMs and directed graph logic.
→ It extracts information from websites and local documents (XML, HTML, JSON, Markdown) through simple natural language prompts.
→ Supports OpenAI, Groq, Azure, Gemini APIs and local Ollama models. Features parallel LLM calls, multi-language support, and integrates with browsers through Playwright. Built for production use with comprehensive testing and CI/CD.
💻 Usage
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).
The most common one is the `SmartScraperGraph`, which extracts information from a single page given a user prompt and a source URL.
HunyuanVideo: Open-source alternative to Runway Gen-3, Luma 1.6, few others top performing Chinese video generative models just arrived. 🤯
🎯 A 13B-parameter open-source video generator model from by Tencent that matches commercial quality 👏
→ HunyuanVideo represents a major advancement in open-source video generation, released by Tencent in December 2024 with public code and model weights
→ The model matches or exceeds closed-source solutions while being fully accessible to researchers and developers
→ Running on H800/H20 GPUs, it requires 45-60GB memory depending on resolution settings
🔬 Architecture
→ The foundation is a Causal 3D VAE that intelligently compresses videos with specific ratios - 4x for time dimension, 8x for spatial dimensions, and 16x for channels
→ Unlike traditional approaches using CLIP/T5, HunyuanVideo employs a decoder-only Multimodal LLM as its text encoder, enabling better image-text alignment and complex reasoning
→ The architecture follows a novel dual-stream to single-stream progression - first processing video and text independently, then merging them for enhanced multimodal fusion
→ A sophisticated prompt rewriting system offers two modes: Normal for better understanding user intent, and Master for enhancing visual quality aspects
🛠️ Implementation Details
→ Supports various aspect ratios including 9:16, 16:9, 4:3, 3:4, and 1:1 with resolutions up to 720p
→ Uses flow matching for training with a configurable shift factor of 9.0 and embedded classifier-free guidance
→ Provides CPU offloading capabilities to manage memory efficiently during high-resolution generation
📊 Performance Metrics
→ Professional evaluation across 1,533 prompts shows superior results: 68.5% text alignment, 64.5% motion quality, 96.4% visual quality
Emotional RAG: AI now recall memories based on emotions, just like humans do.
Original Problem 🤔:
Role-playing agents powered by LLMs struggle to maintain consistent personality traits and generate human-like responses due to limited emotional context in memory retrieval.
-----
Solution in this Paper 💡:
• Introduces Emotional RAG framework for role-playing agents
• Encodes both semantic and emotional vectors for queries and memory
• Implements two retrieval strategies:
- Combination: Fuses semantic and emotional similarity scores
- Sequential: Retrieves based on one factor, then reranks using the other
• Designs emotion-aware prompt templates for LLMs
-----
Key Insights from this Paper:
→ Incorporating emotional states in memory retrieval enhances personality consistency
→ Mood-Dependent Memory theory from psychology applies to AI agents
→ Different retrieval strategies work best for different personality evaluation metrics
→ Emotional congruence improves the human-likeness of generated responses
-----
Results 📊:
• Outperforms traditional RAG methods across multiple datasets
• Significant improvements in full personality evaluations (MBTI, BFI)
• Better performance on open-source models (ChatGLM-6B, Qwen-72B) compared to GPT-3.5
• Achieves higher accuracy in overall personality trait predictions
🔍Emotional RAG framework consists of four main components:
→ Query encoding: Encodes both semantic and emotional aspects of user queries
→ Memory encoding: Stores and encodes conversation history with semantic and emotional vectors
→ Emotional retrieval: Retrieves relevant memory based on both semantic and emotional similarity
→ Response generation: Uses retrieved memory along with character profile to generate responses
Type a sentence, get any sound - from talking cats to singing saxophones. Brilliant release by NVIDIA
✨ NVIDIA just unveiled Fugatto, a groundbreaking 2.5B parameter audio AI model that can generate and transform any combination of music, voices, and sounds using text prompts and audio inputs
Fugatto could ultimately allow developers and creators to bring sounds to life simply by inputting text prompts,
→ The model demonstrates unique capabilities like creating hybrid sounds (trumpet barking), changing accents/emotions in voices, and allowing fine-grained control over sound transitions - trained on millions of audio samples using 32 NVIDIA H100 GPUs
👨🔧 Architecture
Built as a foundational generative transformer model leveraging NVIDIA's previous work in speech modeling and audio understanding. The training process involved creating a specialized blended dataset containing millions of audio samples
→ ComposableART's Innovation in Audio Control
Introduces a novel technique allowing combination of instructions that were only seen separately during training. Users can blend different audio attributes and control their intensity
→ Temporal Interpolation Capabilities
Enables generation of evolving soundscapes with precise control over transitions. Can create dynamic audio sequences like rainstorms fading into birdsong at dawn
→ Processes both text and audio inputs flexibly, enabling tasks like removing instruments from songs or modifying specific audio characteristics while preserving others
→ Shows capabilities beyond its training data, creating entirely new sound combinations through interaction between different trained abilities
🔍 Real-world Applications
→ Allows rapid prototyping of musical ideas, style experimentation, and real-time sound creation during studio sessions
→ Can modify voice characteristics for language learning applications, allowing content delivery in familiar voices
@NVIDIAAIDev
→ Creates a massive dataset (20M+ rows, ~330 years of audio) by combining multiple open source datasets and using LLMs to generate rich descriptions and instructions
→ Optimal Transport Conditional Flow Matching
Trains using OT-CFM objective with a T5-based transformer architecture and adaptive layer normalization
→ Discusses methods like LoRA, QLoRA, and adapters that enable efficient fine-tuning by updating only a subset of model parameters.
🔬 Evaluation metrics and benchmarks for assessing fine-tuned LLMs
→ Includes perplexity, accuracy, and task-specific measures. Benchmarks like GLUE, SuperGLUE, TruthfulQA, and MMLU assess various aspects of LLM performance. Safety evaluations using frameworks like DecodingTrust are also crucial for ensuring responsible AI deployment.
💻 Explores various deployment approaches and optimization techniques to enhance LLM performance and efficiency in real-world applications.
🌐 Examines the extension of fine-tuning techniques to multimodal models and domain-specific applications in fields like medicine and finance.
Note, the content's value stands on its merit, even though possibly the authors leveraged AI assistance in some parts of the paper's creation.
🧵 1/n
🧵 2/n
A chronological timeline showcasing the evolution of LLMs from 1990 to 2023.
🧵 3/n
Mind map depicting various dimensions of Large Language Models (LLMs), covering aspects from pre-training and fine-tuning methodologies to efficiency, evaluation, inference, and application domains.