Here is what the layered architecture of the agent internet ecosystem looks like as it stands. It shows different layers, like the Agent Internet, the Protocol Layer, and the Application Layer.
Timeline
The report provides an overview of LLMs, the agent frameworks, agent protocols, and popular applications from 2019 till now. It's not complete, but it provides a rough overview of the progress. It's still early days for agents, and stronger LLMs and protocols are key.
Microsoft just released Phi-4-Mini-Reasoning to explore small reasoning language models for math.
Let's find out how this all works:
Phi-4-Mini-Reasoning
The paper introduces Phi-4-Mini-Reasoning, a 3.8B parameter small language model (SLM) that achieves state-of-the-art mathematical reasoning performance, rivaling or outperforming models nearly TWICE its size.
Unlocking Reasoning
They use a systematic, multi-stage training pipeline to unlock strong reasoning capabilities in compact models, addressing the challenges posed by their limited capacity.
Uses large-scale distillation, preference learning, and RL with verifiable rewards.
Then you see papers like this and it gives you a better understanding of the opportunities and challenges ahead.
Lots of great ideas in this paper. I've summarized a few below:
What is it?
UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video).
Modality-aware routing
To counter modality bias in unified embedding spaces (where queries often retrieve same-modality results regardless of relevance), UniversalRAG introduces a router that dynamically selects the appropriate modality (e.g., image vs. text) for each query.
Building Production-Ready AI Agents with Scalable Long-Term Memory
Memory is one of the most challenging bits of building production-ready agentic systems.
Lots of goodies in this paper.
Here is my breakdown:
What does it solve?
It proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation.
The solution:
Introduces two systems: Mem0, a dense, language-based memory system, and Mem0g, an enhanced version with graph-based memory to model complex relationships.
Both aim to extract, consolidate, and retrieve salient facts over time efficiently.
This one provides a comprehensive taxonomy of recent system-level innovations for efficient LLM inference serving.
Great overview for devs working on inference.
Here is what's included:
Instance-Level Methods
Techniques like model parallelism (pipeline, tensor, context, and expert parallelism), offloading (e.g., ZeRO-Offload, FlexGen, TwinPilots), and request scheduling (inter- and intra-request) are reviewed...
Novel schedulers like FastServe, Prophet, and INFERMAX optimize decoding with predicted request lengths. KV cache optimization covers paging, reuse (lossless and semantic-aware), and compression (e.g., 4-bit quantization, compact encodings).
265 pages of everything you need to know about building AI agents.
5 things that stood out to me about this report:
1. Human Brain and LLM Agents
Great to better understand what differentiates LLM agents from human/brain cognition, and what inspirations we can get from the way humans learn and operate.
2. Definitions
There is a nice, detailed, and formal definition for what makes up an AI agent. Most of the definitions out there are too abstract.