Building with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, PaperswithCode, PhD • I share insights on how to build with AI Agents ↓
30 subscribers
Aug 31 • 10 tweets • 4 min read
Overview of Self-Evolving Agents
There is a huge interest in moving from hand-crafted agentic systems to lifelong, adaptive agentic ecosystems.
What's the progress, and where are things headed?
Let's find out:
This survey defines self-evolving AI agents and argues for a shift from static, hand-crafted systems to lifelong, adaptive agentic ecosystems.
It maps the field’s trajectory, proposes “Three Laws” to keep evolution safe and useful, and organizes techniques across single-agent, multi-agent, and domain-specific settings.
Aug 28 • 7 tweets • 3 min read
Memory-R1
Another really cool paper showing how RL can enhance an LLM's agentic and memory capabilities.
Great read for AI devs.
Here are my notes:
Overview
A framework that teaches LLM agents to decide what to remember and how to use it.
Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering.
Aug 27 • 8 tweets • 3 min read
Don't sleep on small models!
Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively.
GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.
A must-read for devs:
Quick Overview
Anemoi is a semi-centralized generalist multi-agent system powered by an A2A communication MCP server from @Coral_Protocol.
Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus.
Aug 27 • 7 tweets • 3 min read
Efficient Language Model with PostNAS
NVIDIA's recent research on LLMs has been fantastic.
Jet-Nemotron is the latest in efficient language models, which significantly improves generation throughput.
Here are my notes:
A hybrid-architecture LM family built by “adapting after pretraining.”
Starting from a frozen full-attention model, the authors search where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits.
The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.
Aug 25 • 6 tweets • 3 min read
Fine-tuning LLM Agents without Fine-tuning LLMs
Catchy title and very cool memory technique to improve deep research agents.
Great for continuous, real-time learning without gradient updates.
Here are my notes:
Overview
Proposes a memory‑based learning framework that lets deep‑research agents adapt online without updating model weights.
The agent is cast as a memory‑augmented MDP with case‑based reasoning, implemented in a planner–executor loop over MCP tools.
Aug 20 • 9 tweets • 4 min read
Chain-of-Agents
Interesting idea to train a single model with the capabilities of a multi-agent system.
84.6% reduction in inference cost!
Distillation and Agentic RL are no joke!
Here are my notes:
Overview
This work proposes training single models to natively behave like multi‑agent systems, coordinating “role‑playing” and tool agents end‑to‑end.
They distill strong multi‑agent frameworks into CoA trajectories, then optimize with agentic RL on verifiable tasks.
Aug 19 • 8 tweets • 3 min read
Has GPT-5 Achieved Spatial Intelligence?
GPT-5 sets SoTA but not human‑level spatial intelligence.
My notes below:
This report introduces a unified view of spatial intelligence (SI) for multimodal models and evaluates GPT‑5 and strong baselines across eight fresh SI benchmarks.
GPT‑5 leads overall but is still short of human skill, especially on mentally reconstructing shapes, changing viewpoints, and deformation/assembly tasks.
Aug 18 • 8 tweets • 3 min read
Retrieval-Augmented Reasoning with Lean Language Models
Great paper showing how to fuse RAG and reasoning into a single small-footprint language model.
Distillation works if done correctly.
Very exciting results!
Here are my notes:
Overview
The work proposes a domain-tuned pipeline that fuses RAG and reasoning into a single small-footprint model.
The team distills reasoning traces from a frontier model into Qwen2.5 variants, uses summarization to keep context small, and shows that a 32B local model approaches frontier accuracy on an NHS A‑to‑Z clinical QA task.
Aug 16 • 8 tweets • 3 min read
M3-Agent: A Multimodal Agent with Long-Term Memory
Impressive application of multimodal agents.
Lots of great insights throughout the paper.
Here are my notes with key insights:
M3 Agent
Introduces a framework for agents that watch and listen to long videos, build entity-centric memories, and use multi-turn reasoning to answer questions.
Aug 15 • 8 tweets • 4 min read
AI Agents are terrible at long-horizon tasks.
Even the new GPT-5 model struggles with long-horizon tasks.
This is one of the most pressing challenges when building AI agents.
Pay attention, AI devs!
This is a neat paper that went largely unnoticed.
Here are my notes:
What's new?
The work presents a new benchmark and data‑generation pipeline to test agents on realistic, multi‑day office tasks across Word, Excel, PDF, Email, and Calendar.
OdysseyBench targets long‑horizon, context‑dependent workflows instead of atomic tasks.
Two splits: OdysseyBench+ (300 tasks distilled from real OfficeBench cases) and OdysseyBench‑Neo (302 newly synthesized, more complex tasks).
Tasks require retrieving key facts from multi‑day dialogues and coordinating actions across apps.
Aug 13 • 8 tweets • 3 min read
The Illusion of Progress
It's well known that there are caveats with benchmarks and metrics that measure LLM capabilities.
It's no different for hallucination detection.
"ROUGE fails to reliably capture true hallucination"
Here are my notes:
Overview
The paper argues that common QA hallucination detectors look better than they are because evaluations lean on ROUGE.
In human‑aligned tests, many detectors drop sharply. Simple response‑length heuristics rival complex methods, revealing a core evaluation flaw.
Aug 12 • 8 tweets • 3 min read
Unlocking Long-Horizon Agentic Search
AI agents still struggle with long-horizon tasks.
This paper sheds light on how to improve long-horizon agentic search with RL.
Here are my notes:
Overview
It introduces ASearcher, an open-source framework for training LLM-based search agents capable of long-horizon, expert-level search.
Addresses 2 major limitations in prior open-source approaches: short turn limits (≤10) and lack of large-scale, high-quality QA data.
Aug 11 • 4 tweets • 2 min read
Getting huge productivity boosts by combining Claude Code with Obsidian vaults.
Everything in Obsidian is .md, so this is like the most delicious context for LLMs.
Everything is in one place: notes, bookmarks, instructions, LLM context, AI outputs, and so on.
The part I like about Obsidian is that, finally, I feel like I own my notes.
I can access them everywhere.
Modify them when I want.
And leverage them with LLMs all the time.
Aug 7 • 30 tweets • 9 min read
BREAKING: OpenAI introduces GPT-5
Here's everything you need to know:
Altman claims that with GPT-5, it is now like talking to an expert.
It can write entire programs from scratch. Software-on-demand is a defining characteristic.
PhD-level experts in your pockets.
Aug 3 • 15 tweets • 6 min read
The Agentic Web is upon us!
If you want to learn about the Agentic Web, look no further.
This new report is a banger!
It presents a detailed framework to understand and build the agentic web.
Here is everything you need to know:
Agentic Web
This paper introduces the concept of the Agentic Web, a transformative vision of the internet where autonomous AI agents, powered by LLMs, act on behalf of users to plan, coordinate, and execute tasks.
Aug 2 • 9 tweets • 3 min read
Hierarchical Reasoning Model
This is one of the most interesting ideas on reasoning I've read in the past couple of months.
It uses a recurrent architecture for impressive hierarchical reasoning.
Here are my notes:
The paper proposes a novel, brain-inspired architecture that replaces CoT prompting with a recurrent model designed for deep, latent computation.
Jul 30 • 7 tweets • 3 min read
Graph-R1
New RAG framework just dropped!
Combines agents, GraphRAG, and RL.
Here are my notes:
Introduces a novel RAG framework that moves beyond traditional one-shot or chunk-based retrieval by integrating graph-structured knowledge, agentic multi-turn interaction, and RL.
Jul 28 • 14 tweets • 5 min read
GLM-4.5 looks like a big deal!
> MoE Architecture
> Hybrid reasoning models
> 355B total (32B active)
> GQA + partial RoPE
> Multi-Token Prediction
> Muon Optimizer + QK-Norm
> 22T-token training corpus
> Slime RL Infrastructure
> Native tool use
Here's all you need to know:
Model Architecture & Pre-Training
GLM-4.5 is 355B total parameters (32B active); deeper model with narrower width; optimized for reasoning via more layers and 96 attention heads.
GLM-4.5-Air is 106B (12B active).
22T-token training corpus that combines 15T general data with 7T code/reasoning-focused data.
Grouped-Query Attention + partial RoPE to enhance long-context efficiency and accuracy in reasoning tasks.
Jul 27 • 6 tweets • 2 min read
Claude Code is more than a coding agent.
It's more like a super smart orchestrator agent.
Watch this evaluator loop agent I just built using sub agents and / commands.
This is one of the fastest ways to build custom agentic workflows.
Claude Code is no joke!
I'm impressed to see how easy it is to control how the sub agents communicate with each other (i.e., chain, loop, hierarchical, critic, etc.).
Claude Code is good out of the box, but customization gives you a clear advantage.
Custom sub agents + / commands solve that.
Jul 19 • 8 tweets • 3 min read
Context Rot
Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs.
Banger report from Chroma.
Here are my takeaways (relevant for AI devs):
Context Rot
The research evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.
Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors show that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."
Jul 18 • 12 tweets • 4 min read
A Survey of Context Engineering
160+ pages covering the most important research around context engineering for LLMs.
This is a must-read!
Here are my notes:
The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions.