Building with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, PaperswithCode, PhD • I share insights on how to build with AI Agents ↓
30 subscribers
Sep 13 • 7 tweets • 3 min read
RL done right is no joke!
The most interesting AI paper I read this week.
It trains a top minimal single-agent model for deep research.
Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.
Now let's break it down:
One agent, minimal tools
The agent only gets search, static browsing (no link clicking), and Python. This makes training hard enough that the model has to learn strategy, not just rely on shortcuts.
Instead of relying on complex multi-agent setups, they train one model end-to-end with RL on synthetic tasks.
Sep 9 • 7 tweets • 3 min read
Another impressive paper by Meta.
It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.
REFRAG achieves up to 30.85× TTFT acceleration.
Let's break down the technical details:
TL;DR
REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.
This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.
Sep 9 • 7 tweets • 3 min read
Emergent Hierarchical Reasoning in LLMs
The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.
First, the model firms up low-level execution, then progress hinges on exploring high-level planning.
More on this interesting analysis:
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.
They also propose semantic entropy as a better exploration signal than token-level entropy.
Sep 8 • 9 tweets • 3 min read
I'm surprised Agentic RAG is not getting more attention.
That's all about to change.
Here's why:
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.
Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Sep 8 • 7 tweets • 3 min read
Another banger paper on reasoning LLMs!
They train models to "think wider" to explore multiple ideas that produce better responses.
It's called native thought parallelism and proves superior to sequential reasoning.
Great read for AI devs!
Here are the technical details:
TL;DR
This paper proposes a new way to make LLMs smarter at problem solving.
Instead of making the model think in one long chain of reasoning (which often gets stuck in early mistakes), they train it to explore multiple independent ideas at the same time (via parallel reasoning paths) and then merge them into a final answer.
Sep 7 • 8 tweets • 3 min read
Another impressive paper by Google DeepMind.
It takes a closer look at the limits of embedding-based retrieval.
If you work with vector embeddings, bookmark this one.
Let's break down the technical details:
Quick Overview
This paper looks at how search engines that rely on vector embeddings have built-in limits.
Even if you train them perfectly, they just can’t handle every possible search query once the combinations of relevant documents get too complex.
The authors prove this with math, then confirm it with experiments on a simple but tricky dataset they call LIMIT.
Sep 6 • 7 tweets • 3 min read
Everyone is talking about this new OpenAI paper.
It's about why LLMs hallucinate.
You might want to bookmark this one.
Let's break down the technical details:
Quick Overview
The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.
Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.
The fix is to realign mainstream evaluations to stop penalizing abstentions.
Sep 6 • 8 tweets • 3 min read
Universal Deep Research
NVIDIA recently published another banger tech report!
The idea is simple: allow users to build their own custom, model-agnostic deep research agents with little effort.
Here is what you need to know:
Overview
Universal Deep Research (UDR) proposes a general, model-agnostic deep-research agent that lets users bring their own model and strategy.
Instead of a fixed pipeline, UDR compiles natural-language research strategies into executable code, runs them in a sandbox, and emits structured progress notifications before returning a final report.
Sep 5 • 7 tweets • 3 min read
Cool research from Microsoft!
They release rStar2-Agent, a 14B math reasoning models trained with agentic RL.
It reaches frontier-level math reasoning in just 510 RL training steps.
Here are my notes:
Quick Overview
rStar2-Agent (Microsoft Research). A 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT.
It introduces GRPO-RoC, a rollout strategy that filters noisy successful traces, plus infrastructure for massive, low-latency tool execution.
Aug 31 • 10 tweets • 4 min read
Overview of Self-Evolving Agents
There is a huge interest in moving from hand-crafted agentic systems to lifelong, adaptive agentic ecosystems.
What's the progress, and where are things headed?
Let's find out:
This survey defines self-evolving AI agents and argues for a shift from static, hand-crafted systems to lifelong, adaptive agentic ecosystems.
It maps the field’s trajectory, proposes “Three Laws” to keep evolution safe and useful, and organizes techniques across single-agent, multi-agent, and domain-specific settings.
Aug 28 • 7 tweets • 3 min read
Memory-R1
Another really cool paper showing how RL can enhance an LLM's agentic and memory capabilities.
Great read for AI devs.
Here are my notes:
Overview
A framework that teaches LLM agents to decide what to remember and how to use it.
Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering.
Aug 27 • 8 tweets • 3 min read
Don't sleep on small models!
Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively.
GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.
A must-read for devs:
Quick Overview
Anemoi is a semi-centralized generalist multi-agent system powered by an A2A communication MCP server from @Coral_Protocol.
Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus.
Aug 27 • 7 tweets • 3 min read
Efficient Language Model with PostNAS
NVIDIA's recent research on LLMs has been fantastic.
Jet-Nemotron is the latest in efficient language models, which significantly improves generation throughput.
Here are my notes:
A hybrid-architecture LM family built by “adapting after pretraining.”
Starting from a frozen full-attention model, the authors search where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits.
The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.
Aug 25 • 6 tweets • 3 min read
Fine-tuning LLM Agents without Fine-tuning LLMs
Catchy title and very cool memory technique to improve deep research agents.
Great for continuous, real-time learning without gradient updates.
Here are my notes:
Overview
Proposes a memory‑based learning framework that lets deep‑research agents adapt online without updating model weights.
The agent is cast as a memory‑augmented MDP with case‑based reasoning, implemented in a planner–executor loop over MCP tools.
Aug 20 • 9 tweets • 4 min read
Chain-of-Agents
Interesting idea to train a single model with the capabilities of a multi-agent system.
84.6% reduction in inference cost!
Distillation and Agentic RL are no joke!
Here are my notes:
Overview
This work proposes training single models to natively behave like multi‑agent systems, coordinating “role‑playing” and tool agents end‑to‑end.
They distill strong multi‑agent frameworks into CoA trajectories, then optimize with agentic RL on verifiable tasks.
Aug 19 • 8 tweets • 3 min read
Has GPT-5 Achieved Spatial Intelligence?
GPT-5 sets SoTA but not human‑level spatial intelligence.
My notes below:
This report introduces a unified view of spatial intelligence (SI) for multimodal models and evaluates GPT‑5 and strong baselines across eight fresh SI benchmarks.
GPT‑5 leads overall but is still short of human skill, especially on mentally reconstructing shapes, changing viewpoints, and deformation/assembly tasks.
Aug 18 • 8 tweets • 3 min read
Retrieval-Augmented Reasoning with Lean Language Models
Great paper showing how to fuse RAG and reasoning into a single small-footprint language model.
Distillation works if done correctly.
Very exciting results!
Here are my notes:
Overview
The work proposes a domain-tuned pipeline that fuses RAG and reasoning into a single small-footprint model.
The team distills reasoning traces from a frontier model into Qwen2.5 variants, uses summarization to keep context small, and shows that a 32B local model approaches frontier accuracy on an NHS A‑to‑Z clinical QA task.
Aug 16 • 8 tweets • 3 min read
M3-Agent: A Multimodal Agent with Long-Term Memory
Impressive application of multimodal agents.
Lots of great insights throughout the paper.
Here are my notes with key insights:
M3 Agent
Introduces a framework for agents that watch and listen to long videos, build entity-centric memories, and use multi-turn reasoning to answer questions.
Aug 15 • 8 tweets • 4 min read
AI Agents are terrible at long-horizon tasks.
Even the new GPT-5 model struggles with long-horizon tasks.
This is one of the most pressing challenges when building AI agents.
Pay attention, AI devs!
This is a neat paper that went largely unnoticed.
Here are my notes:
What's new?
The work presents a new benchmark and data‑generation pipeline to test agents on realistic, multi‑day office tasks across Word, Excel, PDF, Email, and Calendar.
OdysseyBench targets long‑horizon, context‑dependent workflows instead of atomic tasks.
Two splits: OdysseyBench+ (300 tasks distilled from real OfficeBench cases) and OdysseyBench‑Neo (302 newly synthesized, more complex tasks).
Tasks require retrieving key facts from multi‑day dialogues and coordinating actions across apps.
Aug 13 • 8 tweets • 3 min read
The Illusion of Progress
It's well known that there are caveats with benchmarks and metrics that measure LLM capabilities.
It's no different for hallucination detection.
"ROUGE fails to reliably capture true hallucination"
Here are my notes:
Overview
The paper argues that common QA hallucination detectors look better than they are because evaluations lean on ROUGE.
In human‑aligned tests, many detectors drop sharply. Simple response‑length heuristics rival complex methods, revealing a core evaluation flaw.
Aug 12 • 8 tweets • 3 min read
Unlocking Long-Horizon Agentic Search
AI agents still struggle with long-horizon tasks.
This paper sheds light on how to improve long-horizon agentic search with RL.
Here are my notes:
Overview
It introduces ASearcher, an open-source framework for training LLM-based search agents capable of long-horizon, expert-level search.
Addresses 2 major limitations in prior open-source approaches: short turn limits (≤10) and lack of large-scale, high-quality QA data.