Small Language Models are the Future of Agentic AI
Lots to gain from building agentic systems with small language models.
Capabilities are increasing rapidly!
AI devs should be exploring SLMs.
Here are my notes:
Overview
This position paper argues that small language models (SLMs), defined pragmatically as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented.
The authors propose that shifting from LLM-first to SLM-first architectures will yield major gains in efficiency, modularity, and sustainability.
SLMs are already capable of commonsense reasoning, instruction following, and code/tool interaction at levels comparable to 30–70B models, with orders of magnitude better throughput.
Examples include Phi-3, Hymba-1.5B, DeepSeek-R1-Distill, and RETRO-7.5B.
The economic benefits are significant: SLMs offer 10–30× lower inference cost than LLMs, require less parallel infrastructure, and are amenable to overnight fine-tuning and even edge deployment (e.g., ChatRTX).
This enables faster iteration and better data control.
SLMs support modular, composable agent systems where specialized models handle subtasks, resulting in better alignment, lower risk of hallucinations, and easier debugging.
The authors advocate for heterogeneous architectures, with SLMs as defaults and LLMs used selectively.
A six-step LLM-to-SLM conversion algorithm is proposed, involving usage logging, task clustering, and PEFT fine-tuning.
This supports gradual migration from monolithic agents to SLM-based compositions.
Case studies on MetaGPT, Open Operator, and Cradle suggest 40–70% of LLM invocations can be reliably replaced with SLMs, particularly for structured generation and routine tool use.
LLMs retain an advantage in general language understanding, and that economic inertia favors their continued use, but this paper makes a compelling case that SLM-centric systems better reflect real-world agentic requirements and resource constraints.
It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.
REFRAG achieves up to 30.85× TTFT acceleration.
Let's break down the technical details:
TL;DR
REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.
This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.
Core idea
Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.
A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.
The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.
First, the model firms up low-level execution, then progress hinges on exploring high-level planning.
More on this interesting analysis:
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.
They also propose semantic entropy as a better exploration signal than token-level entropy.
Two-phase dynamic
Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.
Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling.
I'm surprised Agentic RAG is not getting more attention.
That's all about to change.
Here's why:
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.
Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Tools provide the agentic RAG system with more important context when it needs it.
Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.
They train models to "think wider" to explore multiple ideas that produce better responses.
It's called native thought parallelism and proves superior to sequential reasoning.
Great read for AI devs!
Here are the technical details:
TL;DR
This paper proposes a new way to make LLMs smarter at problem solving.
Instead of making the model think in one long chain of reasoning (which often gets stuck in early mistakes), they train it to explore multiple independent ideas at the same time (via parallel reasoning paths) and then merge them into a final answer.
The problem
Current “think longer” tricks run into Tunnel Vision. Once a model takes a wrong step, it usually can’t recover, no matter how many extra tokens you give it.
Early tokens commit the model to a suboptimal path; majority-style parallel sampling can beat one long chain under the same token budget.