Reasoning Models Thinking Slow and Fast at Test Time
Another super cool work on improving reasoning efficiency in LLMs.
They show that slow-then-fast reasoning outperforms other strategies.
Here are my notes:
What's the high level?
Introduces a universal framework, AlphaOne (α1), for modulating the reasoning progress of large reasoning models (LRMs) during inference.
Rather than relying on rigid or automatic schedules, α1 explicitly controls when and how models engage in “slow thinking” using a tunable parameter α.
Token magic
The method dynamically inserts “wait” tokens to encourage deeper reasoning & then deterministically ends slow thinking with a “</think>” token to prompt efficient answer generation.
This yields better accuracy & efficiency than previous test-time scaling methods.
This work shows the potential of self-improving AI, inspired by biological evolution and open-ended exploration.
This is a must-read!
Here are my notes:
What's the high level?
This work presents the Darwin Gödel Machine (DGM), a system that advances the vision of self-improving AI by combining self-referential code modification with open-ended evolutionary search...
Unlike the original Gödel machine, which requires provable benefits for code changes (a practically intractable constraint), the DGM adopts an empirical approach: it modifies its own codebase and evaluates improvements on coding benchmarks.
Building Production-Grade Conversational Agents with Workflow Graphs
Uses DAG to design robust and complex agentic systems.
If you're building AI agents, this is worth a read.
Here are my notes:
Quick overview
This paper presents a pragmatic, production-ready framework for building LLM-powered conversational agents using workflow graphs, with a specific focus on e-commerce scenarios.
Instead of relying solely on end-to-end generation, the authors design agents using a directed acyclic graph (DAG), enabling flexible yet controllable interactions that adhere to strict business rules and format constraints.
Multi-State DAG Framework
Each node in the graph corresponds to a conversational state with its own system prompt, tool access, and execution rules.
This structure enables robust constraint handling (e.g., avoiding hallucinated responses or non-compliant suggestions) by localizing logic and formatting within specific graph nodes.
An Operating System for Memory-Augmented Generation in LLMs
Lots of great ideas on how to think about memory and better manage it in LLM-based agents.
Must read!
Here are my notes:
It introduces a unified operating system for managing memory in LLMs, addressing a key limitation in current architectures: their lack of structured, persistent, and governable memory...
While today's LLMs rely primarily on parametric memory (model weights) and limited short-term context, MemOS proposes a comprehensive memory lifecycle and management infrastructure designed to support continual learning, behavioral consistency, and knowledge evolution.
RAG systems are more brittle than you think, even when provided sufficient context.
Great work from Google and collaborators.
Good tips for devs included.
Here are my notes:
What is the paper about?
It introduces a new empirical framework for analyzing RAG systems through the lens of sufficient context, whether the retrieved content alone enables answering a query.
This notion helps decouple retrieval failures from generation errors in LLMs.
New definition and classifier for sufficient context
The authors formalize “sufficient context” as context that plausibly allows answering a query, without requiring ground truth.
They develop a high-accuracy LLM-based autorater (Gemini 1.5 Pro, 93% accuracy) to label instances as having sufficient or insufficient context, enabling large-scale evaluation without needing ground-truth answers.