Small Language Models are the Future of Agentic AI
Lots to gain from building agentic systems with small language models.
Capabilities are increasing rapidly!
AI devs should be exploring SLMs.
Here are my notes:
Overview
This position paper argues that small language models (SLMs), defined pragmatically as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented.
The authors propose that shifting from LLM-first to SLM-first architectures will yield major gains in efficiency, modularity, and sustainability.
SLMs are already capable of commonsense reasoning, instruction following, and code/tool interaction at levels comparable to 30–70B models, with orders of magnitude better throughput.
Examples include Phi-3, Hymba-1.5B, DeepSeek-R1-Distill, and RETRO-7.5B.
The economic benefits are significant: SLMs offer 10–30× lower inference cost than LLMs, require less parallel infrastructure, and are amenable to overnight fine-tuning and even edge deployment (e.g., ChatRTX).
This enables faster iteration and better data control.
SLMs support modular, composable agent systems where specialized models handle subtasks, resulting in better alignment, lower risk of hallucinations, and easier debugging.
The authors advocate for heterogeneous architectures, with SLMs as defaults and LLMs used selectively.
A six-step LLM-to-SLM conversion algorithm is proposed, involving usage logging, task clustering, and PEFT fine-tuning.
This supports gradual migration from monolithic agents to SLM-based compositions.
Case studies on MetaGPT, Open Operator, and Cradle suggest 40–70% of LLM invocations can be reliably replaced with SLMs, particularly for structured generation and routine tool use.
LLMs retain an advantage in general language understanding, and that economic inertia favors their continued use, but this paper makes a compelling case that SLM-centric systems better reflect real-world agentic requirements and resource constraints.
Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs.
Banger report from Chroma.
Here are my takeaways (relevant for AI devs):
Context Rot
The research evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.
Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors show that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."
Simple tasks reveal degradation
Even basic benchmarks like semantic variants of Needle-in-a-Haystack, repeated word copying, or long QA logs (LongMemEval) expose accuracy drops as context length increases.
The decline is more dramatic for semantically ambiguous inputs or outputs that scale with length.
160+ pages covering the most important research around context engineering for LLMs.
This is a must-read!
Here are my notes:
The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions.
The context engineering evolution timeline from 2020 to 2025 involves foundational RAG systems to complex multi-agent architectures.
> GPT-4.1 leads
> Gemini-2.5-flash excels at tool selection
> Kimi K2 is the top open-source model
> Grok 4 falls short
> Reasoning models lag behind
> No single model dominates all domains
More below:
@rungalileo introduces Agent Leaderboard v2, a domain-specific evaluation benchmark for AI agents designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investment.
Unlike earlier tool-calling benchmarks that saturate at 90%+ accuracy, v2 focuses on Action Completion (AC) and Tool Selection Quality (TSQ) in complex, multi-turn conversations.
Semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards.
Here are my notes:
Overview
Investigates the surprising fragility of LLM-based reward models used in Reinforcement Learning with Verifiable Rewards (RLVR).
The authors find that inserting superficial, semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards, regardless of the actual correctness of the response.
"Master keys" break LLM judges
Simple, generic lead-ins (e.g., “Let’s solve this step by step”) and even punctuation marks can elicit false YES judgments from top reward models.
This manipulation works across models (GPT-4o, Claude-4, Qwen2.5, etc.), tasks (math and general reasoning), and prompt formats, reaching up to 90% false positive rates in some cases.
MemAgent-14B is trained on 32K-length documents with an 8K context window.
Achieves >76% accuracy even at 3.5M tokens!
That consistency is crazy!
Here are my notes:
Overview
Introduces an RL–driven memory agent that enables transformer-based LLMs to handle documents up to 3.5 million tokens with near lossless performance, linear complexity, and no architectural modifications.
RL-shaped fixed-length memory
MemAgent reads documents in segments and maintains a fixed-size memory updated via an overwrite mechanism.
This lets it process arbitrarily long inputs with O(N) inference cost while avoiding context window overflows.