elvis Profile picture
Building with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, PaperswithCode, PhD • I share insights on how to build with LLMs & AI Agents ⬇️
25 subscribers
Jun 2 7 tweets 3 min read
Reasoning Models Thinking Slow and Fast at Test Time

Another super cool work on improving reasoning efficiency in LLMs.

They show that slow-then-fast reasoning outperforms other strategies.

Here are my notes: Image What's the high level?

Introduces a universal framework, AlphaOne (α1), for modulating the reasoning progress of large reasoning models (LRMs) during inference.

Rather than relying on rigid or automatic schedules, α1 explicitly controls when and how models engage in “slow thinking” using a tunable parameter α.Image
May 31 10 tweets 4 min read
Open-Ended Evolution of Self-Improving Agents

Can AI systems endlessly improve themselves?

This work shows the potential of self-improving AI, inspired by biological evolution and open-ended exploration.

This is a must-read!

Here are my notes: Image What's the high level?

This work presents the Darwin Gödel Machine (DGM), a system that advances the vision of self-improving AI by combining self-referential code modification with open-ended evolutionary search...
May 30 5 tweets 2 min read
Building Production-Grade Conversational Agents with Workflow Graphs

Uses DAG to design robust and complex agentic systems.

If you're building AI agents, this is worth a read.

Here are my notes: Image Quick overview

This paper presents a pragmatic, production-ready framework for building LLM-powered conversational agents using workflow graphs, with a specific focus on e-commerce scenarios.

Instead of relying solely on end-to-end generation, the authors design agents using a directed acyclic graph (DAG), enabling flexible yet controllable interactions that adhere to strict business rules and format constraints.
May 29 8 tweets 3 min read
An Operating System for Memory-Augmented Generation in LLMs

Lots of great ideas on how to think about memory and better manage it in LLM-based agents.

Must read!

Here are my notes: Image It introduces a unified operating system for managing memory in LLMs, addressing a key limitation in current architectures: their lack of structured, persistent, and governable memory... Image
May 28 7 tweets 3 min read
New Lens on RAG Systems

RAG systems are more brittle than you think, even when provided sufficient context.

Great work from Google and collaborators.

Good tips for devs included.

Here are my notes: Image What is the paper about?

It introduces a new empirical framework for analyzing RAG systems through the lens of sufficient context, whether the retrieved content alone enables answering a query.

This notion helps decouple retrieval failures from generation errors in LLMs. Image
May 27 9 tweets 3 min read
NEW: Mistral AI announces Agents API

- code execution
- web search
- MCP tools
- persistent memory
- agentic orchestration capabilities

Cool to see that Mistral AI has joined the growing number of agent frameworks.

More below: Image There is a nice documentation for this release.

You can see below the things that are supported.

Persistent state across conversations, image generation, handoff capabilities, structured outputs, document understanding, citations, and more. Image
May 22 6 tweets 2 min read
Learn to Reason via Mixture-of-Thought

Interesting paper to improve LLM reasoning utilizing multiple reasoning modalities:

- code
- natural language
- symbolic (truth-table) representations

Cool idea and nice results.

My notes below: Image TL;DR:

While most prior approaches train with a single modality and only ensemble during inference, this work introduces Mixture-of-Thought (MoT) to jointly train and infer across modalities, resulting in notable gains in logical reasoning performance. Image
May 21 7 tweets 3 min read
Efficiency in LLMs

Pay attention, devs.

This is one of the most comprehensive benchmarks to date on improving the efficiency of LLMs.

You don't see reports like this every day.

Here are my notes: Image Conducted on a production-grade GPU cluster, the study evaluates over 100 model-technique combinations using six orthogonal metrics (memory, compute, latency, throughput, energy, and compression).

It offers actionable insights for researchers, engineers, and practitioners, revealing critical trade-offs and guiding optimal deployment decisions.Image
May 20 7 tweets 3 min read
A Survey on LLMs in Scientific Discovery

The next step for AI agents is scientific discovery.

This is a great paper summarizing trends and the future.

Here are my notes: Image What's the paper about?

This paper presents a conceptual framework to understand the evolving role of LLMs in scientific discovery, emphasizing their progression from task-specific tools to autonomous scientific agents.

Anchored in the stages of the scientific method, the survey proposes a three-level taxonomy, LLM as Tool, Analyst, and Scientist, and categorizes over 90 research works accordingly.Image
May 19 9 tweets 4 min read
The Pitfalls of Reasoning for Instruction-Following in LLMs

If you're dev using reasoning models, read this one.

Lots of great insights and mitigation tactics.

Here are my notes: Image Main finding of the paper:

This paper uncovers a counterintuitive weakness in today’s reasoning-enhanced language models: explicit step-by-step reasoning (via chain-of-thought, CoT) can actually harm a model’s ability to follow instructions with constraints, rather than help.
May 18 7 tweets 3 min read
Understanding Reasoning Capabilities in LLMs

This report shares insights and tips for using reasoning models.

Great read for devs.

Here are my notes: Image What is the paper about?

Runs a comprehensive evaluation of LLM reasoning in dynamic tasks by comparing prompting strategies like self-reflection, heuristic mutation, and planning.

They assessed adaptive decision-making across interactive tasks on open-source models. Image
May 17 9 tweets 4 min read
AI Agents vs. Agentic AI

Interesting paper summarizing distinctions between AI Agents and Agentic AI.

It also talks about the key ideas, solutions, and the future.

Here are my notes: Image What is the paper about?

The paper provides a comprehensive taxonomy and comparison between AI Agents and Agentic AI, clarifying their conceptual, architectural, and operational differences. Image
May 16 14 tweets 4 min read
BREAKING: OpenAI announces research preview of Codex in ChatGPT

Next-level coding agent within ChatGPT.

Pay attention, devs and non-devs!

Here is all you need to know: Image What's being released?

A remote software engineering agent, Codex. Can run many coding tasks in parallel.

Available for Pro, Enterprise, and Team ChatGPT users starting today.
May 16 5 tweets 2 min read
The CoT Encyclopedia

How to predict and steer the reasoning strategies of LLMs that use chain-of-thought (CoT)?

More below: Image The framework automatically extracts diverse reasoning criteria from model-generated CoTs, embeds and clusters them, and generates human-interpretable contrastive rubrics.

This enables a more nuanced and comprehensive classification of reasoning strategies.
May 14 10 tweets 4 min read
LLMs Get Lost in Multi-turn Conversation

The cat is out of the bag.

Pay attention, devs.

This is one of the most common issues when building with LLMs today.

Glad there is now paper to share insights.

Here are my notes: Image The paper investigates how LLMs perform in realistic, multi-turn conversational settings where user instructions are often underspecified and clarified over several turns.

I keep telling devs to spend time preparing those initial instructions. Prompt engineering is important. Image
May 10 6 tweets 3 min read
Multi-Agent Embodied AI

Physical AI is the next big thing.

Here is a nice survey on current progress.

5 parts I found interesting from this paper: Image 1. Why does it matter?

This survey maps the fast-growing landscape of embodied systems that move beyond a single robot to teams of heterogeneous agents, reviewing 300+ papers and distilling where the field stands and where it can go next.

Most work still focuses on single-agent settings. The paper spotlights the growing need for collaborative, multi-agent systems capable of handling the complexity of real-world environments.Image
May 3 6 tweets 3 min read
A Survey of AI Agent Protocols

5 things that stood out to me about this report: Image Agent Internet Ecosystem

Here is what the layered architecture of the agent internet ecosystem looks like as it stands. It shows different layers, like the Agent Internet, the Protocol Layer, and the Application Layer. Image
May 1 8 tweets 3 min read
Small reasoning models are here!

Microsoft just released Phi-4-Mini-Reasoning to explore small reasoning language models for math.

Let's find out how this all works: Image Phi-4-Mini-Reasoning

The paper introduces Phi-4-Mini-Reasoning, a 3.8B parameter small language model (SLM) that achieves state-of-the-art mathematical reasoning performance, rivaling or outperforming models nearly TWICE its size. Image
Apr 30 7 tweets 3 min read
Universal RAG

RAG is dead, they said.

Then you see papers like this and it gives you a better understanding of the opportunities and challenges ahead.

Lots of great ideas in this paper. I've summarized a few below: Image What is it?

UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video).
Apr 29 9 tweets 3 min read
Building Production-Ready AI Agents with Scalable Long-Term Memory

Memory is one of the most challenging bits of building production-ready agentic systems.

Lots of goodies in this paper.

Here is my breakdown: Image What does it solve?

It proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation. Image
Apr 29 5 tweets 2 min read
A Survey of Efficient LLM Inference Serving

This one provides a comprehensive taxonomy of recent system-level innovations for efficient LLM inference serving.

Great overview for devs working on inference.

Here is what's included: Image Instance-Level Methods

Techniques like model parallelism (pipeline, tensor, context, and expert parallelism), offloading (e.g., ZeRO-Offload, FlexGen, TwinPilots), and request scheduling (inter- and intra-request) are reviewed... Image