Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

elvis

@omarsar0

Jul 18 • 12 tweets • 4 min read • Read on X

Scrolly

A Survey of Context Engineering

160+ pages covering the most important research around context engineering for LLMs.

This is a must-read!

Here are my notes:

The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions.

The context engineering evolution timeline from 2020 to 2025 involves foundational RAG systems to complex multi-agent architectures.

The work distinguishes prompt engineering from context engineering on dimensions like state, scalability, error analysis, complexity, etc.

Context engineering components include context retrieval and generation, context processing, context management, and how they are all integrated into systems implementation, such as RAG, memory architectures, tool-integrated reasoning, and multi-agent coordination mechanisms.

One important aspect of context processing is contextual self-refinement, which aims to improve outputs through cyclical feedback mechanisms.

An important aspect of context management is how to deal efficiently with long context and reasoning chains. The paper provides an overview of and characteristics of key methods for long-chain reasoning.

Memory is key to building complex agentic systems that can adapt, learn, and perform coherent long-term tasks.

There is also a nice overview of different memory implementation patterns.

Tool-calling capabilities in an area of continuous development in the space. The paper provides an overview of tool-augmented language model architectures and how they compare across tool categories.

Context engineering is going to evolve rapidly.

But this is a great overview to better map and keep track of this rapidly evolving landscape.

There is a lot more in the paper. Over 1000+ references included.

This survey tries to capture the most common methods and biggest trends, but there is more on the horizon as models continue to improve in capability and new agent architectures emerge.

You can read the full paper below:

arxiv.org/abs/2507.13334

Want to take it a step further?

Learn about context engineering and how to build effective agentic systems in my courses: dair-ai.thinkific.com

We also have a workshop on context engineering coming soon.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @omarsar0

elvis

@omarsar0

Jul 17

Agent Leaderboard v2 is here!

> GPT-4.1 leads
> Gemini-2.5-flash excels at tool selection
> Kimi K2 is the top open-source model
> Grok 4 falls short
> Reasoning models lag behind
> No single model dominates all domains

More below:

@rungalileo introduces Agent Leaderboard v2, a domain-specific evaluation benchmark for AI agents designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investment.

Unlike earlier tool-calling benchmarks that saturate at 90%+ accuracy, v2 focuses on Action Completion (AC) and Tool Selection Quality (TSQ) in complex, multi-turn conversations.

Read 7 tweets

elvis

@omarsar0

Jul 14

One Token to Fool LLM-as-a-Judge

Watch out for this one, devs!

Semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards.

Here are my notes:

Overview

Investigates the surprising fragility of LLM-based reward models used in Reinforcement Learning with Verifiable Rewards (RLVR).

The authors find that inserting superficial, semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards, regardless of the actual correctness of the response.

"Master keys" break LLM judges

Simple, generic lead-ins (e.g., “Let’s solve this step by step”) and even punctuation marks can elicit false YES judgments from top reward models.

This manipulation works across models (GPT-4o, Claude-4, Qwen2.5, etc.), tasks (math and general reasoning), and prompt formats, reaching up to 90% false positive rates in some cases.

Read 6 tweets

elvis

@omarsar0

Jul 10

BREAKING: xAI announces Grok 4

"It can reason at a superhuman level!"

Here is everything you need to know:

Elon claims that Grok 4 is smarter than almost all grad students in all disciplines simultaneously.

100x more training than Grok 2.

10x more compute on RL than any of the models out there.

Performance on Humanity's Last Exam

Elon: "Grok 4 is post-grad level in everything!"

Read 21 tweets

elvis

@omarsar0

Jul 8

MemAgent

MemAgent-14B is trained on 32K-length documents with an 8K context window.

Achieves >76% accuracy even at 3.5M tokens!

That consistency is crazy!

Here are my notes:

Overview

Introduces an RL–driven memory agent that enables transformer-based LLMs to handle documents up to 3.5 million tokens with near lossless performance, linear complexity, and no architectural modifications.

RL-shaped fixed-length memory

MemAgent reads documents in segments and maintains a fixed-size memory updated via an overwrite mechanism.

This lets it process arbitrarily long inputs with O(N) inference cost while avoiding context window overflows.

Read 6 tweets

elvis

@omarsar0

Jul 6

Agentic RAG for Personalized Recommendation

This is a really good example of integrating agentic reasoning into RAG.

Leads to better personalization and improved recommendations.

Here are my notes:

Overview

This work introduces a multi-agent framework, ARAG, that enhances traditional RAG systems with reasoning agents tailored to user modeling and contextual ranking.

It reframes recommendation as a structured coordination problem between LLM agents.

Instead of relying on static similarity-based retrieval, ARAG comprises four agents:

- User Understanding Agent synthesizes user preferences from long-term and session behavior.

- NLI Agent evaluates semantic alignment between candidate items and user intent.

- Context Summary Agent condenses relevant item metadata.

- Item Ranker Agent ranks final recommendations using all prior reasoning.

Read 5 tweets

elvis

@omarsar0

Jul 3

AI for Scientific Search

AI for Science is where I spend most of my time exploring with AI agents.

This 120+ pages report does a good job of highlighting why all the big names like OpenAI and Google DeepMind are pursuing AI4Science.

Bookmark it!

My notes below:

There are five key areas:

(1) AI for Scientific Comprehension
(2) AI for Academic Survey
(3) AI for Scientific Discovery
(4) AI for Academic Writing
(5) AI for Academic Peer Review

Just look at the large body of work that's been happening in the space:

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

elvis

Try unrolling a thread yourself!

More from @omarsar0

elvis

elvis

elvis

elvis

elvis

elvis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!