elvis Profile picture
Sep 9 7 tweets 3 min read Read on X
Another impressive paper by Meta.

It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.

REFRAG achieves up to 30.85× TTFT acceleration.

Let's break down the technical details: Image
TL;DR

REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.

This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.Image
Core idea

Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.

A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.Image
Why it works under the hood

Attention maps show that retrieved passages rarely interact with each other (block-diagonal pattern).

So REFRAG avoids wasting attention across irrelevant text, only paying full price for chunks that matter. Image
Speedups without dumbing down

Benchmarks show up to 30× faster time-to-first-token and 6–7× higher throughput versus vanilla LLaMA.

Even compared to strong baselines like CEPE, REFRAG is still 3–4× faster, with equal or better accuracy. Image
Longer memory for free

By compressing most chunks, REFRAG effectively extends model context length up to 16× more tokens, letting it juggle way more retrieved passages without breaking latency budgets. Image
Better use of retrieval budget

With the same latency, REFRAG can process more passages than a baseline model and outperform it across 16 RAG tasks, especially when the retriever is weak (messy or noisy results).

Beyond RAG, it boosts multi-turn dialog (keeping more history without truncation) and long-doc summarization (higher ROUGE at fixed compute).

Paper: arxiv.org/abs/2509.01092Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with elvis

elvis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @omarsar0

Sep 9
Emergent Hierarchical Reasoning in LLMs

The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.

First, the model firms up low-level execution, then progress hinges on exploring high-level planning.

More on this interesting analysis: Image
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.

They also propose semantic entropy as a better exploration signal than token-level entropy. Image
Two-phase dynamic

Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.

Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling. Image
Read 7 tweets
Sep 8
I'm surprised Agentic RAG is not getting more attention.

That's all about to change.

Here's why: Image
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.

Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Tools provide the agentic RAG system with more important context when it needs it.

Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.
Read 9 tweets
Sep 8
Another banger paper on reasoning LLMs!

They train models to "think wider" to explore multiple ideas that produce better responses.

It's called native thought parallelism and proves superior to sequential reasoning.

Great read for AI devs!

Here are the technical details: Image
TL;DR

This paper proposes a new way to make LLMs smarter at problem solving.

Instead of making the model think in one long chain of reasoning (which often gets stuck in early mistakes), they train it to explore multiple independent ideas at the same time (via parallel reasoning paths) and then merge them into a final answer.Image
The problem

Current “think longer” tricks run into Tunnel Vision. Once a model takes a wrong step, it usually can’t recover, no matter how many extra tokens you give it.

Early tokens commit the model to a suboptimal path; majority-style parallel sampling can beat one long chain under the same token budget.Image
Read 7 tweets
Sep 7
Another impressive paper by Google DeepMind.

It takes a closer look at the limits of embedding-based retrieval.

If you work with vector embeddings, bookmark this one.

Let's break down the technical details: Image
Quick Overview

This paper looks at how search engines that rely on vector embeddings have built-in limits.

Even if you train them perfectly, they just can’t handle every possible search query once the combinations of relevant documents get too complex.

The authors prove this with math, then confirm it with experiments on a simple but tricky dataset they call LIMIT.Image
Built-in ceiling

Each document and query is turned into a single vector.

The study shows there’s only so many correct top-k results these vectors can represent.

If you ask for more combinations than the vectors can encode, it’s impossible for the system to get it right. Image
Read 8 tweets
Sep 6
Everyone is talking about this new OpenAI paper.

It's about why LLMs hallucinate.

You might want to bookmark this one.

Let's break down the technical details: Image
Quick Overview

The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.

Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.

The fix is to realign mainstream evaluations to stop penalizing abstentions.Image
Pretraining inevitably produces some errors

Even if you trained on flawless text, the way models learn guarantees they’ll still slip up sometimes.

That’s because the training goal pushes them to give answers instead of saying “I don’t know.”

The calibration histograms below illustrate that GPT-4 style base models are well calibrated prior to RL, consistent with this claim.Image
Read 7 tweets
Sep 6
Universal Deep Research

NVIDIA recently published another banger tech report!

The idea is simple: allow users to build their own custom, model-agnostic deep research agents with little effort.

Here is what you need to know: Image
Overview

Universal Deep Research (UDR) proposes a general, model-agnostic deep-research agent that lets users bring their own model and strategy.

Instead of a fixed pipeline, UDR compiles natural-language research strategies into executable code, runs them in a sandbox, and emits structured progress notifications before returning a final report.Image
Motivation

Current deep-research tools hard-code strategy and model choice, limiting source prioritization, domain-specific workflows, and model swap-ability.

UDR targets all three gaps by separating the research strategy from the underlying model. Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(