Then you see papers like this and it gives you a better understanding of the opportunities and challenges ahead.
Lots of great ideas in this paper. I've summarized a few below:
What is it?
UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video).
Modality-aware routing
To counter modality bias in unified embedding spaces (where queries often retrieve same-modality results regardless of relevance), UniversalRAG introduces a router that dynamically selects the appropriate modality (e.g., image vs. text) for each query.
Granularity-aware retrieval
Each modality is broken into granularity levels (e.g., paragraphs vs. documents for text, clips vs. full-length videos). This allows queries to retrieve content that matches their complexity -- factual queries use short segments while complex reasoning accesses long-form data.
Flexible routing
It supports both training-free (zero-shot GPT-4o prompting) and trained (T5-Large) routers. Trained routers perform better on in-domain data, while GPT-4o generalizes better to out-of-domain tasks. An ensemble router combines both for robust performance.
Performance
UniversalRAG outperforms modality-specific and unified RAG baselines across 8 benchmarks spanning text (e.g., MMLU, SQuAD), image (WebQA), and video (LVBench, VideoRAG). With T5-Large, it achieves the highest average score across modalities.
Case study
In WebQA, UniversalRAG correctly routes a visual query to the image corpus (retrieving an actual photo of the event), while TextRAG and VideoRAG fail. Similarly, on HotpotQA and LVBench, it chooses the right granularity, retrieving documents or short clips.
Overall, this is a great paper showing the importance of considering modality and granularity in a RAG system.
Microsoft just released Phi-4-Mini-Reasoning to explore small reasoning language models for math.
Let's find out how this all works:
Phi-4-Mini-Reasoning
The paper introduces Phi-4-Mini-Reasoning, a 3.8B parameter small language model (SLM) that achieves state-of-the-art mathematical reasoning performance, rivaling or outperforming models nearly TWICE its size.
Unlocking Reasoning
They use a systematic, multi-stage training pipeline to unlock strong reasoning capabilities in compact models, addressing the challenges posed by their limited capacity.
Uses large-scale distillation, preference learning, and RL with verifiable rewards.
Building Production-Ready AI Agents with Scalable Long-Term Memory
Memory is one of the most challenging bits of building production-ready agentic systems.
Lots of goodies in this paper.
Here is my breakdown:
What does it solve?
It proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation.
The solution:
Introduces two systems: Mem0, a dense, language-based memory system, and Mem0g, an enhanced version with graph-based memory to model complex relationships.
Both aim to extract, consolidate, and retrieve salient facts over time efficiently.
This one provides a comprehensive taxonomy of recent system-level innovations for efficient LLM inference serving.
Great overview for devs working on inference.
Here is what's included:
Instance-Level Methods
Techniques like model parallelism (pipeline, tensor, context, and expert parallelism), offloading (e.g., ZeRO-Offload, FlexGen, TwinPilots), and request scheduling (inter- and intra-request) are reviewed...
Novel schedulers like FastServe, Prophet, and INFERMAX optimize decoding with predicted request lengths. KV cache optimization covers paging, reuse (lossless and semantic-aware), and compression (e.g., 4-bit quantization, compact encodings).
265 pages of everything you need to know about building AI agents.
5 things that stood out to me about this report:
1. Human Brain and LLM Agents
Great to better understand what differentiates LLM agents from human/brain cognition, and what inspirations we can get from the way humans learn and operate.
2. Definitions
There is a nice, detailed, and formal definition for what makes up an AI agent. Most of the definitions out there are too abstract.
Agent2Agent (A2A) is a new open protocol that lets AI agents securely collaborate across ecosystems regardless of framework or vendor.
Here is all you need to know:
Universal agent interoperability
A2A allows agents to communicate, discover each other’s capabilities, negotiate tasks, and collaborate even if built on different platforms. This enables complex enterprise workflows to be handled by a team of specialized agents.
Built for enterprise needs
The protocol supports long-running tasks (e.g., supply chain planning), multimodal collaboration (text, audio, video), and secure identity/auth flows (matching OpenAPI-grade auth). Agents share JSON-based “Agent Cards” for capability discovery, negotiate UI formats, and sync task state with real-time updates.