It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.
REFRAG achieves up to 30.85× TTFT acceleration.
Let's break down the technical details:
TL;DR
REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.
This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.
Core idea
Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.
A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.
Why it works under the hood
Attention maps show that retrieved passages rarely interact with each other (block-diagonal pattern).
So REFRAG avoids wasting attention across irrelevant text, only paying full price for chunks that matter.
Speedups without dumbing down
Benchmarks show up to 30× faster time-to-first-token and 6–7× higher throughput versus vanilla LLaMA.
Even compared to strong baselines like CEPE, REFRAG is still 3–4× faster, with equal or better accuracy.
Longer memory for free
By compressing most chunks, REFRAG effectively extends model context length up to 16× more tokens, letting it juggle way more retrieved passages without breaking latency budgets.
Better use of retrieval budget
With the same latency, REFRAG can process more passages than a baseline model and outperform it across 16 RAG tasks, especially when the retriever is weak (messy or noisy results).
Beyond RAG, it boosts multi-turn dialog (keeping more history without truncation) and long-doc summarization (higher ROUGE at fixed compute).
The spec-init slash command prompt, if you want to try it:
"Your task is to first help me build a spec for my new project in ARGUMENT.
Use the AskUserQuestion Tool to help build the spec in ARGUMENT by interviewing me and gathering requirements and details about the project implementation, UI & UX, tech stack, concerns, tradeoffs, etc.
Make sure questions are not obvious and probe deeper into the underlying needs and constraints.
Interview me continually and systematically until the spec is complete. Document all responses and insights to create a comprehensive and well-structured specification that serves as the foundation for the project."
Just built a new skill in Claude Code using Opus 4.5.
The skill uses Gemini 3 Pro (via API) for designing web pages.
Look at what it generated from one simple prompt.
If you have been designing websites with Claude Code, you already know how generic they turn out.
So I built a skill that uses Gemini 3 Pro to lead creative direction and generate designs. It is extremely good at this.
Opus 4.5 then integrates all that into our app.
The prompt I used: "I want to design the landing page for a new AI game. We want it to be futuristic and all that, and use animations as much as possible."
I will test with some other prompts and see how far I can push this. But the results are very exciting already.
This is one of the most insane things Nano Banana Pro 🍌 can do.
It can reproduce figures with mind-blowing precision.
No competition in this regard!
Prompt: "Please reproduce this chart in high quality and fidelity and offer annotated labels to better understand it."
When I tried this for the first time, I didn't expect that this was possible.
The level of understanding this requires is what's remarkable about it all.
The levels of personalization this unlocks are also impressive.
"Can you convert it into a cartoonish version?"
Just look at this 🤯
"Can you create a delightful cartoonish version of this table. And please put cute colors and icons along with interesting annotations to make it more readable."