It trains a top minimal single-agent model for deep research.
Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.
Now let's break it down:
One agent, minimal tools
The agent only gets search, static browsing (no link clicking), and Python. This makes training hard enough that the model has to learn strategy, not just rely on shortcuts.
Instead of relying on complex multi-agent setups, they train one model end-to-end with RL on synthetic tasks.
Clever scaffolding
Multi-turn tool calls are collapsed into a single growing context question.
This stabilizes reasoning and avoids messy, runaway conversations. A clean_memory tool lets the model compress its own context when it gets too long.
It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.
REFRAG achieves up to 30.85× TTFT acceleration.
Let's break down the technical details:
TL;DR
REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.
This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.
Core idea
Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.
A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.
The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.
First, the model firms up low-level execution, then progress hinges on exploring high-level planning.
More on this interesting analysis:
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.
They also propose semantic entropy as a better exploration signal than token-level entropy.
Two-phase dynamic
Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.
Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling.
I'm surprised Agentic RAG is not getting more attention.
That's all about to change.
Here's why:
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.
Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Tools provide the agentic RAG system with more important context when it needs it.
Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.
They train models to "think wider" to explore multiple ideas that produce better responses.
It's called native thought parallelism and proves superior to sequential reasoning.
Great read for AI devs!
Here are the technical details:
TL;DR
This paper proposes a new way to make LLMs smarter at problem solving.
Instead of making the model think in one long chain of reasoning (which often gets stuck in early mistakes), they train it to explore multiple independent ideas at the same time (via parallel reasoning paths) and then merge them into a final answer.
The problem
Current “think longer” tricks run into Tunnel Vision. Once a model takes a wrong step, it usually can’t recover, no matter how many extra tokens you give it.
Early tokens commit the model to a suboptimal path; majority-style parallel sampling can beat one long chain under the same token budget.