There is a huge interest in moving from hand-crafted agentic systems to lifelong, adaptive agentic ecosystems.
What's the progress, and where are things headed?
Let's find out:
This survey defines self-evolving AI agents and argues for a shift from static, hand-crafted systems to lifelong, adaptive agentic ecosystems.
It maps the field’s trajectory, proposes “Three Laws” to keep evolution safe and useful, and organizes techniques across single-agent, multi-agent, and domain-specific settings.
Paradigm shift and guardrails
The paper frames four stages: Model Offline Pretraining → Model Online Adaptation → Multi-Agent Orchestration → Multi-Agent Self-Evolving.
It introduces three guiding laws for evolution: maintain safety, preserve or improve performance, and then autonomously optimize.
LLM-centric learning paradigms:
MOP (Model Offline Pretraining): Static pretraining on large corpora; no adaptation after deployment.
MOA (Model Online Adaptation): Post-deployment updates via fine-tuning, adapters, or RLHF.
MAO (Multi-Agent Orchestration): Multiple agents coordinate through message exchange or debate, without changing model weights.
MASE (Multi-Agent Self-Evolving): Agents interact with their environment, continually optimising prompts, memory, tools, and workflows.
The Evolution Landscape of AI Agents
The paper presents a visual taxonomy of AI agent evolution and optimisation techniques, categorised into three major directions:
single-agent optimisation, multi-agent optimisation, and domain-specific optimisation.
Unified framework for evolution
A single iterative loop connects System Inputs, Agent System, Environment feedback, and Optimizer.
Optimizers search over prompts, tools, memory, model parameters, and even agent topologies using heuristics, search, or learning.
Single-agent optimization toolbox
Techniques are grouped into:
(i) LLM behavior (training for reasoning; test-time scaling with search and verification),
(ii) prompt optimization (edit, generate, text-gradient, evolutionary),
(iii) memory optimization (short-term compression and retrieval; long-term RAG, graphs, and control policies), and
(iv) tool use and tool creation.
Agentic Self-Evolution methods
The authors present a comprehensive hierarchical categorization of agentic self-evolution methods, including single-agent, multi-agent, and domain-specific optimization categories.
Multi-agent workflows that self-improve
Beyond manual pipelines, the survey treats prompts, topologies, and backbones as searchable spaces.
It distinguishes code-level workflows and communication-graph topologies, covers unified optimization that jointly tunes prompts and structure, and describes backbone training for better cooperation.
Evaluation, safety, and open problems
Benchmarks span tools, web navigation, GUI agents, collaboration, and specialized domains; LLM-as-judge and Agent-as-judge reduce evaluation cost while tracking process quality.
The paper stresses continuous, evolution-aware safety monitoring and highlights challenges such as stable reward modeling, efficiency-effectiveness trade-offs, and transfer of optimized prompts/topologies to new models or domains.
They release rStar2-Agent, a 14B math reasoning models trained with agentic RL.
It reaches frontier-level math reasoning in just 510 RL training steps.
Here are my notes:
Quick Overview
rStar2-Agent (Microsoft Research). A 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT.
It introduces GRPO-RoC, a rollout strategy that filters noisy successful traces, plus infrastructure for massive, low-latency tool execution.
Method
GRPO-RoC oversamples rollouts, then keeps only the cleanest correct ones while preserving diverse failures, reducing tool-call errors and formatting issues during training.
Another really cool paper showing how RL can enhance an LLM's agentic and memory capabilities.
Great read for AI devs.
Here are my notes:
Overview
A framework that teaches LLM agents to decide what to remember and how to use it.
Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering.
Active memory control with RL
The Memory Manager selects ADD, UPDATE, DELETE, or NOOP after a RAG step and edits entries accordingly; training with PPO or GRPO uses downstream QA correctness as the reward, removing the need for per-edit labels.
Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively.
GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.
A must-read for devs:
Quick Overview
Anemoi is a semi-centralized generalist multi-agent system powered by an A2A communication MCP server from @Coral_Protocol.
Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus.
Design
A semi-centralized planner proposes an initial plan, while worker agents (web, document processing, reasoning/coding) plus critique and answer-finding agents collaborate via MCP threads.
Agents communicate directly with each other.
All participants can list agents, create threads, send messages, wait for mentions, and update plans as execution unfolds.
NVIDIA's recent research on LLMs has been fantastic.
Jet-Nemotron is the latest in efficient language models, which significantly improves generation throughput.
Here are my notes:
A hybrid-architecture LM family built by “adapting after pretraining.”
Starting from a frozen full-attention model, the authors search where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits.
The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.
PostNAS pipeline
Begins with a pre-trained full-attention model and freezes MLPs, then proceeds in four steps:
1. Learn optimal placement or removal of full-attention layers 2. Select a linear-attention block 3. Design a new attention block 4. Run a hardware-aware hyperparameter search
Catchy title and very cool memory technique to improve deep research agents.
Great for continuous, real-time learning without gradient updates.
Here are my notes:
Overview
Proposes a memory‑based learning framework that lets deep‑research agents adapt online without updating model weights.
The agent is cast as a memory‑augmented MDP with case‑based reasoning, implemented in a planner–executor loop over MCP tools.
Method
Decisions are guided by a learned case‑retrieval policy over an episodic Case Bank.
Non‑parametric memory retrieves Top‑K similar cases; parametric memory learns a Q‑function (soft Q‑learning or single‑step CE training in deep‑research settings) to rank cases for reuse and revision.
Interesting idea to train a single model with the capabilities of a multi-agent system.
84.6% reduction in inference cost!
Distillation and Agentic RL are no joke!
Here are my notes:
Overview
This work proposes training single models to natively behave like multi‑agent systems, coordinating “role‑playing” and tool agents end‑to‑end.
They distill strong multi‑agent frameworks into CoA trajectories, then optimize with agentic RL on verifiable tasks.
Paradigm shift
CoA generalizes ReAct/TIR by dynamically activating multiple roles and tools within one model, preserving a single coherent state while cutting inter‑agent chatter.