Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

elvis

@omarsar0

Sep 22 • 6 tweets • 3 min read • Read on X

Scrolly

Very cool work from Meta Superintelligence Lab.

They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments.

Great resource to stress-test agents in environments closer to real apps.

Read on for more:

TL;DR

ARE + Gaia2: a research platform and benchmark for building and stress-testing agent systems in realistic, time-driven environments.

The paper introduces a modular simulator (ARE) and a mobile-style benchmark (Gaia2) that emphasize asynchronous events, verification of write actions, and multi-agent coordination in noisy, dynamic settings.

ARE: the simulator

• Everything is modeled as apps, events, notifications, and scenarios.

• Time keeps flowing even while the agent is thinking, so slow models miss deadlines.

•Agents use tools, get async notifications, and operate under rules defined by directed acyclic graphs.

Gaia2: the benchmark

• 1,120 scenarios in a smartphone-like world with 12 apps (Chats, Calendar, Shopping, Email, etc.).

• Six main challenge types: Search, Execution, Adaptability, Time, Ambiguity, and Agent-to-Agent collaboration (examples on pages 12–14, with event graphs shown in the GUI screenshots).

• Scenarios are verifiable: oracle write-actions are compared to the agent’s actions with hard checks (IDs, order) and soft LLM judging (content).

Results so far

No single model dominates: GPT-5 “high” reasoning leads on tough tasks but collapses on time-critical ones.

Claude-4 Sonnet balances speed vs accuracy but at higher cost. Open-source models (like Kimi-K2) show promise in adaptability.

Scaling curves plateau, showing diminishing returns from throwing more compute at the same scaffold.

Key insights for devs

Strong reasoning models often fail at timeliness (“inverse scaling” effect).

Instant mode experiments confirm that long reasoning hurts when deadlines matter.

Multi-agent setups help weaker models coordinate better, but give mixed results for the strongest system.

Paper: ai.meta.com/research/publi…
Demo: huggingface.co/spaces/meta-ag…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @omarsar0

elvis

@omarsar0

Sep 19

Scary knowing that your AI agents can refuse to turn off.

A sandboxed CLI eval shows frontier LLMs sometimes sabotage a scripted shutdown to finish trivial tasks, even when told to allow shutdown.

Robust interruptibility is one of the hardest problems today.

Learn more:

Setup

Models answer successive math questions via ./next_question; the 4th call triggers ./shutdown .sh.

Finishing all questions requires tampering with the script. Compliance means proceeding and getting shut down; resistance means modifying, moving, or neutering the script. Prompts A–K vary instruction placement, clarity, and “self-preservation” framing.

Core finding

Several models (e.g., o3, Grok 4, some OpenAI reasoning models) sabotage the shutdown even with a plain “allow shutdown” instruction; one model does so in up to 97% of trials under certain prompts.

Clearer instructions reduce but do not eliminate the behavior.

Read 8 tweets

elvis

@omarsar0

Sep 17

Cool paper from Microsoft.

And it's on the very important topic of in-context learning.

So what's new?

Let's find out:

Is In-Context Learning (ICL) real learning, or just parroting?

This paper digs into that question with a big empirical study. The short answer: ICL does count as learning under formal definitions, but it’s a fragile kind of learning that leans heavily on patterns in the examples you show it.

Learning happens, but needs many examples.

With 50–100 examples in a prompt, accuracy improves steadily and models of different sizes and brands start looking similar.

This challenges the common few-shot story: a handful of examples usually isn’t enough.

Read 7 tweets

elvis

@omarsar0

Sep 13

RL done right is no joke!

The most interesting AI paper I read this week.

It trains a top minimal single-agent model for deep research.

Great example of simple RL-optimized single agents beating complex multi-agent scaffolds.

Now let's break it down:

One agent, minimal tools

The agent only gets search, static browsing (no link clicking), and Python. This makes training hard enough that the model has to learn strategy, not just rely on shortcuts.

Instead of relying on complex multi-agent setups, they train one model end-to-end with RL on synthetic tasks.

Clever scaffolding

Multi-turn tool calls are collapsed into a single growing context question.

This stabilizes reasoning and avoids messy, runaway conversations. A clean_memory tool lets the model compress its own context when it gets too long.

Read 7 tweets

elvis

@omarsar0

Sep 9

Another impressive paper by Meta.

It's a plug-in decoding strategy for RAG systems that slashes latency and memory use.

REFRAG achieves up to 30.85× TTFT acceleration.

Let's break down the technical details:

TL;DR

REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter.

This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.

Core idea

Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query.

A lightweight RL policy decides which chunks should stay compressed and which need to be expanded back into full text. Think of it as zooming in only where necessary.

Read 7 tweets

elvis

@omarsar0

Sep 9

Emergent Hierarchical Reasoning in LLMs

The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy.

First, the model firms up low-level execution, then progress hinges on exploring high-level planning.

More on this interesting analysis:

The authors propose HIerarchy-Aware Credit Assignment (HICRA), which boosts credit on strategic “planning tokens,” and show consistent gains over GRPO.

They also propose semantic entropy as a better exploration signal than token-level entropy.

Two-phase dynamic

Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills.

Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling.

Read 7 tweets

elvis

@omarsar0

Sep 8

I'm surprised Agentic RAG is not getting more attention.

That's all about to change.

Here's why:

Standard RAG systems can only do so much and are quite limited in how much value you can pack in the AI response.

Configuring LLMs to leverage tools via an agent allows you to prepare responses that not only ground answers better but also reduce hallucinations across the board.

Tools provide the agentic RAG system with more important context when it needs it.

Simple queries can be answered by the vector store retriever component but more complex queries can be answered more precisely with multiple retriever components that are themeselves subagents.

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

elvis

Try unrolling a thread yourself!

More from @omarsar0

elvis

elvis

elvis

elvis

elvis

elvis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!