Post

More from @rohanpaul_ai

Rohan Paul

Nov 16

Consolidated insights on LLM fine-tuning - a long read across 114 pages.

"Ultimate Guide to Fine-Tuning LLMs"

Worth a read during the weekend.

Few ares it covers 👇

📊 Fine-tuning Pipeline

→ Outlines a seven-stage process for fine-tuning LLMs, from data preparation to deployment and maintenance.

🧠 Advanced Fine-tuning Methods

→ Covers techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) for aligning LLMs with human preferences.

🛠️ Parameter-Efficient Fine-Tuning (PEFT) Techniques

→ Discusses methods like LoRA, QLoRA, and adapters that enable efficient fine-tuning by updating only a subset of model parameters.

🔬 Evaluation metrics and benchmarks for assessing fine-tuned LLMs

→ Includes perplexity, accuracy, and task-specific measures. Benchmarks like GLUE, SuperGLUE, TruthfulQA, and MMLU assess various aspects of LLM performance. Safety evaluations using frameworks like DecodingTrust are also crucial for ensuring responsible AI deployment.

💻 Explores various deployment approaches and optimization techniques to enhance LLM performance and efficiency in real-world applications.

🌐 Examines the extension of fine-tuning techniques to multimodal models and domain-specific applications in fields like medicine and finance.

Note, the content's value stands on its merit, even though possibly the authors leveraged AI assistance in some parts of the paper's creation.

🧵 1/n

🧵 2/n

A chronological timeline showcasing the evolution of LLMs from 1990 to 2023.

🧵 3/n

Mind map depicting various dimensions of Large Language Models (LLMs), covering aspects from pre-training and fine-tuning methodologies to efficiency, evaluation, inference, and application domains.

Read 14 tweets

Rohan Paul

@rohanpaul_ai

Nov 8

MapReduce meets LLMs: Divide-and-conquer approach lets regular LLMs process 100x longer documents than their context limit

Using MapReduce principles, small-context LLMs now handle million-token documents efficiently.

Original Problem 🔍:

LLMs struggle to process extremely long texts exceeding their context window, limiting their application in tasks requiring comprehensive document understanding.

-----

Solution in this Paper 🛠️:

• LLM × MapReduce: A training-free framework for long-sequence processing
• Structured information protocol: Addresses inter-chunk dependency
• In-context confidence calibration: Resolves inter-chunk conflicts
• Three-stage process: Map, collapse, and reduce stages for efficient processing

-----

Key Insights from this Paper 💡:

• Divide-and-conquer approach enables short-context LLMs to handle long texts
• Structured information and confidence calibration improve cross-chunk processing
• Framework is compatible with different LLMs, demonstrating generalization capability
• Efficient design outperforms standard decoding in speed

-----

Results 📊:

• Outperforms closed-source and open-source LLMs on InfiniteBench
• Average score: 68.66 (vs. 57.34 for GPT-4)
• Enables Llama3-70B-Instruct (8K context) to process 1280K tokens
• Faster inference: 2 GPUs for 128K tokens (vs. 4 GPUs for standard decoding)

🧩 The key components of the LLM × MapReduce framework

The LLM × MapReduce framework consists of three main stages:

1. Map stage: The long input text is divided into chunks, and an LLM extracts necessary information from each chunk.

2. Collapse stage: If the mapped results still exceed the model's context window, they are compressed while maintaining the same structure as the mapped results.

3. Reduce stage: The final response is generated based on the collapsed results.

🔑 The paper introduces two key innovations to address the challenges of inter-chunk dependency and inter-chunk conflict:

1. Structured information protocol: This protocol defines the information passed from the map stage to the reduce stage, ensuring the model has critical inputs needed to infer the correct answer when aggregating different chunks.

2. In-context confidence calibration mechanism: This allows the model to assign reliable confidence scores to the output of each chunk, aiding in effectively resolving inter-chunk conflicts.

Read 4 tweets

Rohan Paul

@rohanpaul_ai

Nov 6

"Understanding LLMs from Scratch Using Middle School Math"

Neural networks learn to predict text by converting words to numbers and finding patterns through attention mechanisms.

So the network turns words into numbers, then use attention to decide what's important for predicting next words

Nice long blog (40 minuted reading time), check the link in comment.

Content

here’s a neural network that does the classification

Read 6 tweets

Rohan Paul

@rohanpaul_ai

Nov 4

For learning Machine Learning with actual projects, checkout this Repo.

920 open-source projects with a total of 4.7M stars grouped into 34 categories.

github.com/ml-tooling/bes…

Read 11 tweets

Rohan Paul

@rohanpaul_ai

Nov 3

A comprehensive educational repository from @AnthropicAI containing 5 structured courses: API fundamentals, prompt engineering, real-world applications, evaluations, and tool integration with Claude APIs.

Anthropic API fundamentals

Prompt Engineering Interactive Tutorial

Read 7 tweets

Rohan Paul

@rohanpaul_ai

Nov 1

Not all brain cells are equal - same goes for LLM attention heads! 💡

Why store everything when you can just remember the important stuff?

Smart KV cache compression that knows which attention heads matter most.

Hence, HeadKV intelligently compresses LLM memory by identifying and prioritizing crucial attention heads

🎯 Original Problem:

KV caching in LLMs faces significant memory overhead with increasing input length. Current compression methods operate at layer-level, missing the opportunity to optimize at individual attention head level.

-----

🔧 Solution in this Paper:

• HeadKV: Compresses KV cache at individual head level instead of layer level
• Allocates cache budgets based on head importance using Needle-in-a-Haystack tests
• HeadKV-R2: Enhanced version that evaluates both retrieval and reasoning abilities
• Uses dynamic budget allocation across heads based on importance scores
• Retains most relevant KV cache entries within each head using attention-based selection

-----

💡 Key Insights:

• Not all attention heads are equally important for text generation
• Head-level compression outperforms layer-level approaches
• Combining retrieval and reasoning abilities for importance scoring is crucial
• Dynamic budget allocation across heads is more effective than fixed allocation
• Just 1.5% of KV cache can retain 97% of full performance

-----

📊 Results:

• Achieves 97% of full KV cache performance while retaining only 1.5% of cache
• Outperforms baselines on LongBench and LooGLE benchmarks
• Superior performance in low-resource settings (KV size = 64 & 128)
• Maintains computational efficiency comparable to existing approaches
• Effective preservation of both retrieval and reasoning capabilities

🔍 The method operates in two key steps: First, it estimates head importance scores using Needle-in-a-Haystack tests that evaluate both retrieval and reasoning abilities.

Second, it allocates KV cache budgets to individual heads based on their importance scores, with more important heads receiving larger cache allocations.

💡 The main innovations are:

(1) Operating at individual head level rather than layer level for KV cache compression,

(2) Using a novel importance score estimation that considers both retrieval and reasoning abilities, and

(3) Implementing dynamic budget allocation across heads based on their importance distributions.

Read 5 tweets

Share this page!

Enter URL or ID to Unroll

Rohan Paul

Try unrolling a thread yourself!

More from @rohanpaul_ai

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!