Latest Twitter Threads by @omarsar0 on Thread Reader App

Oct 16 • 7 tweets • 2 min read

I am not going to lie.

I see a lot of potential in the Skills feature that Anthropic just dropped!

Just tested with Claude Code. It leads to sharper and precise outputs.

It's structured context engineering to power CC with specialized capabilities, leveraging the filesystem.

I think it might be one of the best ways to really tap into the full potential of Claude Code.

Tune instructions, output formats, use of scripts, tools (MCP or otherwise), and more.

For specialized tasks, CC outputs dumb stuff at times; the idea here is to scope CC on demand.

Oct 16 • 7 tweets • 4 min read

Banger paper from Meta and collaborators.

This paper is one of the best deep dives yet on how reinforcement learning (RL) actually scales for LLMs.

The team ran over 400,000 GPU hours of experiments to find a predictable scaling pattern and a stable recipe (ScaleRL) that consistently works as you scale up compute.

Think of it as a practical guide for anyone trying to train reasoning or alignment models with RL.

More on why this is a big deal:

1. The big insight: RL progress follows a predictable curve.

When you plot model performance vs compute, the growth isn’t random; it follows a sigmoid (S-shaped) curve.

The curve has three simple knobs:
A = the best performance you’ll ever reach,
B = how efficiently you reach it,
C_mid = how much compute it takes to hit the halfway point.

The amazing part: you can fit this curve using just small runs and accurately predict how a 100k-hour run will behave.

So you no longer need to guess; you can forecast where your RL setup will top out before burning compute.

Sep 30 • 6 tweets • 2 min read

We are living in the most insane timeline.

I just asked Claude Code (with Claude Sonnet 4.5) to develop an MCP Server (end-to-end) that allows me to programatically create n8n workflows from within Claude Code itself.

Took about 10 mins!

You can now create n8n workflows with pure natural language from Claude Code.

This is one of the top requests in our academy: how to automate the creation of n8n workflows.

It turns out that this is a great use case for MCP.

Sep 28 • 7 tweets • 3 min read

Great work showing prompt synthesis as a new scaling axis for reasoning.

Good training data is scarce.

This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs.

Technical details below:

This work shows that we can scale reasoning ability in LLMs by automatically generating hard, high-quality prompts instead of relying only on human-written datasets.

Sep 25 • 7 tweets • 3 min read

Language Models that Think and Chat Better

Proposes a simple RL recipe to improve small open models (eg, 8B) that rivals GPT-4o and Claude 3.7 Sonnet (thinking).

Pay attention to this one, AI devs!

Here are my notes:

TL;DR

A simple recipe, RL with Model-rewarded Thinking (RLMT), makes small open models “plan first, answer second” on regular chat prompts and trains them with online RL against a preference reward.

They find that long, explicit thinking paired with a strong preference reward generalizes beyond verifiable domains.

Sep 22 • 6 tweets • 3 min read

Very cool work from Meta Superintelligence Lab.

They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments.

Great resource to stress-test agents in environments closer to real apps.

Read on for more:

TL;DR

ARE + Gaia2: a research platform and benchmark for building and stress-testing agent systems in realistic, time-driven environments.

The paper introduces a modular simulator (ARE) and a mobile-style benchmark (Gaia2) that emphasize asynchronous events, verification of write actions, and multi-agent coordination in noisy, dynamic settings.

Sep 19 • 8 tweets • 4 min read

Scary knowing that your AI agents can refuse to turn off.

A sandboxed CLI eval shows frontier LLMs sometimes sabotage a scripted shutdown to finish trivial tasks, even when told to allow shutdown.

Robust interruptibility is one of the hardest problems today.

Learn more:

Setup

Models answer successive math questions via ./next_question; the 4th call triggers ./shutdown .sh.

Finishing all questions requires tampering with the script. Compliance means proceeding and getting shut down; resistance means modifying, moving, or neutering the script. Prompts A–K vary instruction placement, clarity, and “self-preservation” framing.

Sep 17 • 7 tweets • 3 min read

Cool paper from Microsoft.

And it's on the very important topic of in-context learning.

So what's new?

Let's find out:

Is In-Context Learning (ICL) real learning, or just parroting?

This paper digs into that question with a big empirical study. The short answer: ICL does count as learning under formal definitions, but it’s a fragile kind of learning that leans heavily on patterns in the examples you show it.

Sep 13 • 7 tweets • 3 min read

RL done right is no joke!

The most interesting AI paper I read this week.

It trains a top minimal single-agent model for deep research.

Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.

Now let's break it down:

One agent, minimal tools

The agent only gets search, static browsing (no link clicking), and Python. This makes training hard enough that the model has to learn strategy, not just rely on shortcuts.

Instead of relying on complex multi-agent setups, they train one model end-to-end with RL on synthetic tasks.

Sep 9 • 7 tweets • 3 min read

Another impressive paper by Meta.

It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.

REFRAG achieves up to 30.85× TTFT acceleration.

Let's break down the technical details:

TL;DR

REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.

This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.

Sep 9 • 7 tweets • 3 min read

Emergent Hierarchical Reasoning in LLMs

The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.

First, the model firms up low-level execution, then progress hinges on exploring high-level planning.

More on this interesting analysis:

The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.

They also propose semantic entropy as a better exploration signal than token-level entropy.

Sep 8 • 9 tweets • 3 min read

I'm surprised Agentic RAG is not getting more attention.

That's all about to change.

Here's why:

Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.

Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.

Sep 8 • 7 tweets • 3 min read

Another banger paper on reasoning LLMs!

They train models to "think wider" to explore multiple ideas that produce better responses.

It's called native thought parallelism and proves superior to sequential reasoning.

Great read for AI devs!

Here are the technical details:

TL;DR

This paper proposes a new way to make LLMs smarter at problem solving.

Instead of making the model think in one long chain of reasoning (which often gets stuck in early mistakes), they train it to explore multiple independent ideas at the same time (via parallel reasoning paths) and then merge them into a final answer.

Sep 7 • 8 tweets • 3 min read

Another impressive paper by Google DeepMind.

It takes a closer look at the limits of embedding-based retrieval.

If you work with vector embeddings, bookmark this one.

Let's break down the technical details:

Quick Overview

This paper looks at how search engines that rely on vector embeddings have built-in limits.

Even if you train them perfectly, they just can’t handle every possible search query once the combinations of relevant documents get too complex.

The authors prove this with math, then confirm it with experiments on a simple but tricky dataset they call LIMIT.

Sep 6 • 7 tweets • 3 min read

Everyone is talking about this new OpenAI paper.

It's about why LLMs hallucinate.

You might want to bookmark this one.

Let's break down the technical details:

Quick Overview

The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated.

Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty.

The fix is to realign mainstream evaluations to stop penalizing abstentions.

Sep 6 • 8 tweets • 3 min read

Universal Deep Research

NVIDIA recently published another banger tech report!

The idea is simple: allow users to build their own custom, model-agnostic deep research agents with little effort.

Here is what you need to know:

Overview

Universal Deep Research (UDR) proposes a general, model-agnostic deep-research agent that lets users bring their own model and strategy.

Instead of a fixed pipeline, UDR compiles natural-language research strategies into executable code, runs them in a sandbox, and emits structured progress notifications before returning a final report.

Sep 5 • 7 tweets • 3 min read

Cool research from Microsoft!

They release rStar2-Agent, a 14B math reasoning models trained with agentic RL.

It reaches frontier-level math reasoning in just 510 RL training steps.

Here are my notes:

Quick Overview

rStar2-Agent (Microsoft Research). A 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT.

It introduces GRPO-RoC, a rollout strategy that filters noisy successful traces, plus infrastructure for massive, low-latency tool execution.

Aug 31 • 10 tweets • 4 min read

Overview of Self-Evolving Agents

There is a huge interest in moving from hand-crafted agentic systems to lifelong, adaptive agentic ecosystems.

What's the progress, and where are things headed?

Let's find out:

This survey defines self-evolving AI agents and argues for a shift from static, hand-crafted systems to lifelong, adaptive agentic ecosystems.

It maps the field’s trajectory, proposes “Three Laws” to keep evolution safe and useful, and organizes techniques across single-agent, multi-agent, and domain-specific settings.

Aug 28 • 7 tweets • 3 min read

Memory-R1

Another really cool paper showing how RL can enhance an LLM's agentic and memory capabilities.

Great read for AI devs.

Here are my notes:

Overview

A framework that teaches LLM agents to decide what to remember and how to use it.

Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering.

Aug 27 • 8 tweets • 3 min read

Don't sleep on small models!

Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively.

GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.

A must-read for devs:

Quick Overview

Anemoi is a semi-centralized generalist multi-agent system powered by an A2A communication MCP server from @Coral_Protocol.

Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus.

Aug 27 • 7 tweets • 3 min read

Efficient Language Model with PostNAS

NVIDIA's recent research on LLMs has been fantastic.

Jet-Nemotron is the latest in efficient language models, which significantly improves generation throughput.

Here are my notes:

A hybrid-architecture LM family built by “adapting after pretraining.”

Starting from a frozen full-attention model, the authors search where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits.

The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.

Share this page!

Enter URL or ID to Unroll