elvis Profile picture
Jul 1 8 tweets 3 min read Read on X
Small Language Models are the Future of Agentic AI

Lots to gain from building agentic systems with small language models.

Capabilities are increasing rapidly!

AI devs should be exploring SLMs.

Here are my notes: Image
Overview

This position paper argues that small language models (SLMs), defined pragmatically as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented. Image
The authors propose that shifting from LLM-first to SLM-first architectures will yield major gains in efficiency, modularity, and sustainability.
SLMs are already capable of commonsense reasoning, instruction following, and code/tool interaction at levels comparable to 30–70B models, with orders of magnitude better throughput.

Examples include Phi-3, Hymba-1.5B, DeepSeek-R1-Distill, and RETRO-7.5B. Image
The economic benefits are significant: SLMs offer 10–30× lower inference cost than LLMs, require less parallel infrastructure, and are amenable to overnight fine-tuning and even edge deployment (e.g., ChatRTX).

This enables faster iteration and better data control.
SLMs support modular, composable agent systems where specialized models handle subtasks, resulting in better alignment, lower risk of hallucinations, and easier debugging.

The authors advocate for heterogeneous architectures, with SLMs as defaults and LLMs used selectively. Image
A six-step LLM-to-SLM conversion algorithm is proposed, involving usage logging, task clustering, and PEFT fine-tuning.

This supports gradual migration from monolithic agents to SLM-based compositions. Image
Case studies on MetaGPT, Open Operator, and Cradle suggest 40–70% of LLM invocations can be reliably replaced with SLMs, particularly for structured generation and routine tool use.

LLMs retain an advantage in general language understanding, and that economic inertia favors their continued use, but this paper makes a compelling case that SLM-centric systems better reflect real-world agentic requirements and resource constraints.

Paper: arxiv.org/abs/2506.02153Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with elvis

elvis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @omarsar0

Sep 9
Another impressive paper by Meta.

It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.

REFRAG achieves up to 30.85× TTFT acceleration.

Let's break down the technical details: Image
TL;DR

REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.

This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.Image
Core idea

Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.

A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.Image
Read 7 tweets
Sep 9
Emergent Hierarchical Reasoning in LLMs

The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.

First, the model firms up low-level execution, then progress hinges on exploring high-level planning.

More on this interesting analysis: Image
The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.

They also propose semantic entropy as a better exploration signal than token-level entropy. Image
Two-phase dynamic

Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.

Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling. Image
Read 7 tweets
Sep 8
I'm surprised Agentic RAG is not getting more attention.

That's all about to change.

Here's why: Image
Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.

Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.
Tools provide the agentic RAG system with more important context when it needs it.

Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.
Read 9 tweets
Sep 8
Another banger paper on reasoning LLMs!

They train models to "think wider" to explore multiple ideas that produce better responses.

It's called native thought parallelism and proves superior to sequential reasoning.

Great read for AI devs!

Here are the technical details: Image
TL;DR

This paper proposes a new way to make LLMs smarter at problem solving.

Instead of making the model think in one long chain of reasoning (which often gets stuck in early mistakes), they train it to explore multiple independent ideas at the same time (via parallel reasoning paths) and then merge them into a final answer.Image
The problem

Current “think longer” tricks run into Tunnel Vision. Once a model takes a wrong step, it usually can’t recover, no matter how many extra tokens you give it.

Early tokens commit the model to a suboptimal path; majority-style parallel sampling can beat one long chain under the same token budget.Image
Read 7 tweets
Sep 7
Another impressive paper by Google DeepMind.

It takes a closer look at the limits of embedding-based retrieval.

If you work with vector embeddings, bookmark this one.

Let's break down the technical details: Image
Quick Overview

This paper looks at how search engines that rely on vector embeddings have built-in limits.

Even if you train them perfectly, they just can’t handle every possible search query once the combinations of relevant documents get too complex.

The authors prove this with math, then confirm it with experiments on a simple but tricky dataset they call LIMIT.Image
Built-in ceiling

Each document and query is turned into a single vector.

The study shows there’s only so many correct top-k results these vectors can represent.

If you ask for more combinations than the vectors can encode, it’s impossible for the system to get it right. Image
Read 8 tweets
Sep 6
Everyone is talking about this new OpenAI paper.

It's about why LLMs hallucinate.

You might want to bookmark this one.

Let's break down the technical details: Image
Quick Overview

The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.

Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.

The fix is to realign mainstream evaluations to stop penalizing abstentions.Image
Pretraining inevitably produces some errors

Even if you trained on flawless text, the way models learn guarantees they’ll still slip up sometimes.

That’s because the training goal pushes them to give answers instead of saying “I don’t know.”

The calibration histograms below illustrate that GPT-4 style base models are well calibrated prior to RL, consistent with this claim.Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(