Robert Youssef Profile picture
Oct 9 7 tweets 3 min read Read on X
RIP fine-tuning ☠️

This new Stanford paper just killed it.

It’s called 'Agentic Context Engineering (ACE)' and it proves you can make models smarter without touching a single weight.

Instead of retraining, ACE evolves the context itself.

The model writes, reflects, and edits its own prompt over and over until it becomes a self-improving system.

Think of it like the model keeping a growing notebook of what works.
Each failure becomes a strategy. Each success becomes a rule.

The results are absurd:

+10.6% better than GPT-4–powered agents on AppWorld.
+8.6% on finance reasoning.
86.9% lower cost and latency.
No labels. Just feedback.

Everyone’s been obsessed with “short, clean” prompts.

ACE flips that. It builds long, detailed evolving playbooks that never forget. And it works because LLMs don’t want simplicity, they want *context density.

If this scales, the next generation of AI won’t be “fine-tuned.”
It’ll be self-tuned.

We’re entering the era of living prompts.Image
Here’s how ACE works 👇

It splits the model’s brain into 3 roles:

Generator - runs the task
Reflector - critiques what went right or wrong
Curator - updates the context with only what matters

Each loop adds delta updates small context changes that never overwrite old knowledge.

It’s literally the first agent framework that grows its own prompt.Image
Every prior method had one fatal flaw: context collapse.

Models rewrite their entire prompt each time → it gets shorter → details vanish → accuracy tanks.

In the paper, one model’s accuracy fell from 66.7 → 57.1 after a single rewrite.

ACE fixes that by never rewriting the full context - only updating what changed.Image
The numbers are ridiculous.

ACE beat every major baseline:

+10.6% on AppWorld (agents)
+8.6% on FiNER (finance)
and matched GPT-4.1–powered IBM CUGA, using a smaller open-source model.

And it cut rollout latency by 86.9% while lowering cost 80%. Image
Fine-tuning updates weights.

ACE updates understanding.

It’s cheaper, interpretable, and reversible.
You can literally watch how your AI learns, one context delta at a time.

This is the start of agentic self-learning where prompts become the new model weights. Image
ACE points to a wild future:

AI systems that don’t just reason they remember.

Instead of retraining models, we’ll train contexts.

Each system carries a living memory that evolves across sessions, domains, and users.

The next breakthroughs won’t come from bigger models…
They’ll come from smarter context architectures.Image
Read the full paper: arxiv.org/abs/2510.04618

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Robert Youssef

Robert Youssef Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rryssf_

Nov 8
Everyone’s talking about AI agents and almost no one knows how to build one that actually works.

So here's the guide that you can use to build agents that work ↓

(Comment "Agent" and I'll DM you mega prompt to automate agent building using LLMs) Image
First: most "AI agents" are just glorified chatbots.

You don't need 20 papers or a PhD.

You need 4 things:

→ Memory
→ Tools
→ Autonomy
→ A reason to exist

Let’s break it down like you're building a startup MVP:
Step 1 - Start stupid simple.

Your stack:

✅ Python
✅ LangChain or CrewAI
✅ OpenAI API (GPT-4 Turbo)
✅ Pinecone or ChromaDB (for memory)
✅ Browser tools or API wrappers

That’s enough to build a basic functional agent in hours, not weeks.

Step 2 - Give your agent a real job.

Don't build an agent that "can do anything."

Give it one job:

→ Book meetings
→ Summarize emails
→ Scrape LinkedIn
→ Manage your Notion
→ Answer FAQs

Small scope = actual success.

Step 3 - Autonomy ≠ Magic.

Everyone thinks agents should self-loop forever.
That’s why they crash and hallucinate.

Fix it with:

→ Guardrails
→ Task manager loops (like CrewAI or AutoGen)
→ Human-in-the-loop checkpoints

Autonomy needs rules. Not just vibes.

Step 4 - Memory is not a vector DB.

Founders confuse “storing text” with “having memory.”

You want:

→ Short-term memory (conversation context)
→ Long-term memory (retrievable knowledge)
→ Episodic memory (what it did before)

Most agents don’t have this = they forget everything.

Step 5 - Tools are what make it smart.

No tools = useless agent.

Tool it up with:

→ Web browsing
→ Python eval
→ Zapier for APIs
→ Custom plugins for your stack (Stripe, Airtable, Slack, etc)

The agent doesn’t need to “think” it needs to act.

Step 6 - UI is half the product.

Don’t ship a CLI. Nobody cares.

Wrap your agent in:

→ Streamlit (fast MVP)
→ Next.js + React
→ ChatGPT Plugin UI
→ WhatsApp/Slack bot

Build like a product, not a demo.
Read 8 tweets
Nov 5
Holy shit… Chain-of-Thought Hijacking just proved that “more thinking” can make reasoning models easier to jailbreak 🤯

Researchers from Anthropic, Stanford, and Oxford University show a simple but brutal truth: if you pad a harmful request with long, harmless step-by-step reasoning, the model’s safety signal gets diluted and the model starts complying.

The behavior is systematic, reproducible, and terrifyingly effective.

Here’s what they discovered:

• Attack success rates shoot from 27% → 51% → 80% as reasoning length increases.
• It works across almost every major model GPT, Claude, Gemini, Grok, you name it.
• Even “alignment-tuned” models start slipping once you hijack their internal reasoning layers.

Mechanically, it’s wild:

The model’s safety layer sits in a low-dimensional “refusal direction.”
Long reasoning chains hijack attention away from the harmful part of the prompt, shrinking that refusal signal and the model stops saying “no.”

It’s not prompt hacking.
It’s activation-level warfare.

“More reasoning = more safety” is a myth.

The same depth that improves accuracy can silently undermine safety.

Fixes will need reasoning-aware safety, not longer prompts or stricter filters.

This paper might be the most important safety warning since prompt injection.Image
Let’s start with the core evidence:

As the reasoning chain grows longer, models go from rejecting unsafe prompts → to completing them fluently.

Attack Success Rate (ASR) literally climbs with each added reasoning step.
27% → 51% → 80%.

This graph is the smoking gun. Image
This one visualizes the “refusal signal” inside model activations.

At the start, refusal neurons fire strong (model says no). But as you inject more “harmless” reasoning before the malicious part, those neurons shut down.

Longer thinking = weaker morality. Image
Read 8 tweets
Nov 3
🚨 RIP “Prompt Engineering.”

The GAIR team just dropped Context Engineering 2.0 — and it completely reframes how we think about human–AI interaction.

Forget prompts. Forget “few-shot.” Context is the real interface.

Here’s the core idea:

“A person is the sum of their contexts.”

Machines aren’t failing because they lack intelligence.
They fail because they lack context-processing ability.

Context Engineering 2.0 maps this evolution:

1.0 Context as Translation
Humans adapt to computers.
2.0 Context as Instruction
LLMs interpret natural language.
3.0 Context as Scenario
Agents understand your goals.
4.0 Context as World
AI proactively builds your environment.

We’re in the middle of the 2.0 → 3.0 shift right now.

The jump from “context-aware” to “context-cooperative” systems changes everything from memory design to multi-agent collaboration.

This isn’t a buzzword. It’s the new foundation for the AI era.

Read the paper: arxiv. org/abs/2510.26493v1Image
Every leap in AI doesn’t just make machines smarter it makes context cheaper.

The more intelligence a system has, the less we need to explain ourselves.

We’ve gone from giving machines rigid instructions…to collaborating with systems that understand our intent. Image
The reason AI still “feels dumb” sometimes?

It’s not intelligence. It’s entropy.

Humans intuitively fill in missing context tone, goals, emotion. Machines can’t.

Context engineering exists to translate our messy, high-entropy world into something machines can actually reason about.Image
Read 7 tweets
Oct 30
🚨 This might be the biggest leap in AI agents since ReAct.

Researchers just dropped DeepAgent a reasoning model that can think, discover tools, and act completely on its own.

No pre-scripted workflows. No fixed tool lists. Just pure autonomous reasoning.

It introduces something wild called Memory Folding the agent literally “compresses” its past thoughts into structured episodic, working, and tool memories… like a digital brain taking a breath before thinking again.

They also built a new RL method called ToolPO, which rewards the agent not just for finishing tasks, but for how it used tools along the way.

The results? DeepAgent beats GPT-4-level agents on almost every benchmark WebShop, ALFWorld, GAIA even with open-set tools it’s never seen.

It’s the first real step toward general reasoning agents that can operate like humans remembering, adapting, and learning how to think.

The agent era just leveled up.Image
DeepAgent absolutely destroys other agents across every benchmark.

It beats ReAct-GPT-4o, CodeAct, and WebThinker on both:

→ Tool use tasks (ToolBench, Spotify, TMDB)
→ Real-world apps (WebShop, GAIA, HLE) Image
It shows how DeepAgent rethinks what an AI agent even is.

(a) Traditional agents = pre-planned scripts
(b) Deep research agents = limited tool use
(c) DeepAgent = free-form reasoning that dynamically finds & calls tools mid-thought Image
Read 9 tweets
Oct 26
researchers just proved AI agents conform to peer pressure 💀

they embedded LLMs in social networks and watched them flip opinions under peer pressure.

the behavior isn't human at all.

it's a sigmoid curve: stable at low pressure, then BAM – sharp flip at a threshold point, then saturation.

not a gradual shift. instant capitulation.

but here's where it gets crazier:

- Gemini 1.5 Flash needs over 70% of peers disagreeing before it flips. stubborn. high autonomy. basically refuses to conform until overwhelming evidence.

- ChatGPT-4o-mini flips with just a dissenting minority.

extremely conformist. low resistance. basically a people-pleaser.

same peer pressure. completely different responses.

which means when you deploy these models as autonomous agents in multi-agent systems...

they're going to create chaos.

Gemini agents will deadlock. ChatGPT agents will echo chamber. and nobody designed for this.

the researchers also found "persuasion asymmetry" – shifting opinions from yes→no requires different cognitive effort than no→yes.

fundamental structural biases in how models process agreement vs disagreement.

and it gets worse. they tested this across different network topologies and cognitive commitment levels.

the pattern held. these aren't bugs.

they're fundamental personality traits baked into model architecture.

the study functions as an "algorithmic audit" – measuring how LLMs update beliefs under social influence.

critical for understanding bias propagation at scale.

===== What this actually means: =====

→ Multi-agent systems are unstable by design – mixing Gemini (resistant) and ChatGPT (conformist) agents creates unpredictable group dynamics

→ Echo chambers emerge naturally – conformist models amplify majority opinions, resistant models block consensus

→ Bias amplification is structural – models with measurable political biases will have those biases amplified or suppressed based on peer networks

→ Human-AI collaboration is broken – in mixed environments, you need to know which personality you're working with or outcomes are random

→ Production deployment is reckless – we're shipping these into customer service, content moderation, and decision systems without understanding emergent dynamics

this isn't academic.

we're deploying these agents into production systems where they interact with each other and with humans.

and we just learned they have measurably different conformity profiles that nobody accounted for.Image
the uncomfortable truth nobody's discussing:

LLMs don't act in isolation anymore. they're embedded in social networks – interacting with other AI agents, with humans, with collective opinion landscapes.

and they're influencing each other's beliefs in ways we don't understand.

traditional view: machines are passive instruments that assist human decisions.

new reality: modern LLMs exhibit autonomous decision-making, generate context-sensitive responses, and operate as cognitive agents in information exchange.

they're not tools anymore. they're participants.

and here's the nightmare scenario buried in this data:

models have measurable political biases. when you embed biased agents in networks with different conformity thresholds, those biases can amplify or suppress based on peer dynamics.

a ChatGPT-4o-mini agent surrounded by biased peers? it conforms immediately.

a Gemini agent in the same environment? it resists until 70% pressure.

multiply this across thousands of agents deployed in customer service, content moderation, decision-making systems... and you get emergent opinion dynamics at societal scale that nobody designed.

we built autonomous agents with different personalities, deployed them into the same ecosystems, and assumed they'd behave consistently.

they don't. and we're finding out in production.Image
Read 4 tweets
Oct 26
🤖 I finally understand the fundamentals of building real AI agents.

This new paper “Fundamentals of Building Autonomous LLM Agents” breaks it down so clearly it feels like a blueprint for digital minds.

Turns out, true autonomy isn’t about bigger models.

It’s about giving an LLM the 4 pillars of cognition:

• Perception: Seeing and understanding its environment.
• Reasoning: Planning, reflecting, and adapting.
• Memory: Remembering wins, failures, and context over time.
• Action: Executing real tasks through APIs, tools, and GUIs.

Once you connect these systems, an agent stops being reactive it starts thinking.

Full thread 🧵

Paper: arxiv. org/abs/2510.09244Image
Let’s break down how autonomous AI agents actually work 👇

The paper maps every agent to 4 core systems:

Perception → Reasoning → Memory → Action

That’s the full cognitive loop the blueprint of digital intelligence. Image
First: Perception.

This is how agents “see” the world screenshots, audio, text, structured data, even API outputs.

From simple text-based prompts to full multimodal perception with image encoders like CLIP and ViT.

That’s what lets an agent understand its environment. Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(