Robert Youssef Profile picture
Oct 9 7 tweets 3 min read Read on X
RIP fine-tuning ☠️

This new Stanford paper just killed it.

It’s called 'Agentic Context Engineering (ACE)' and it proves you can make models smarter without touching a single weight.

Instead of retraining, ACE evolves the context itself.

The model writes, reflects, and edits its own prompt over and over until it becomes a self-improving system.

Think of it like the model keeping a growing notebook of what works.
Each failure becomes a strategy. Each success becomes a rule.

The results are absurd:

+10.6% better than GPT-4–powered agents on AppWorld.
+8.6% on finance reasoning.
86.9% lower cost and latency.
No labels. Just feedback.

Everyone’s been obsessed with “short, clean” prompts.

ACE flips that. It builds long, detailed evolving playbooks that never forget. And it works because LLMs don’t want simplicity, they want *context density.

If this scales, the next generation of AI won’t be “fine-tuned.”
It’ll be self-tuned.

We’re entering the era of living prompts.Image
Here’s how ACE works 👇

It splits the model’s brain into 3 roles:

Generator - runs the task
Reflector - critiques what went right or wrong
Curator - updates the context with only what matters

Each loop adds delta updates small context changes that never overwrite old knowledge.

It’s literally the first agent framework that grows its own prompt.Image
Every prior method had one fatal flaw: context collapse.

Models rewrite their entire prompt each time → it gets shorter → details vanish → accuracy tanks.

In the paper, one model’s accuracy fell from 66.7 → 57.1 after a single rewrite.

ACE fixes that by never rewriting the full context - only updating what changed.Image
The numbers are ridiculous.

ACE beat every major baseline:

+10.6% on AppWorld (agents)
+8.6% on FiNER (finance)
and matched GPT-4.1–powered IBM CUGA, using a smaller open-source model.

And it cut rollout latency by 86.9% while lowering cost 80%. Image
Fine-tuning updates weights.

ACE updates understanding.

It’s cheaper, interpretable, and reversible.
You can literally watch how your AI learns, one context delta at a time.

This is the start of agentic self-learning where prompts become the new model weights. Image
ACE points to a wild future:

AI systems that don’t just reason they remember.

Instead of retraining models, we’ll train contexts.

Each system carries a living memory that evolves across sessions, domains, and users.

The next breakthroughs won’t come from bigger models…
They’ll come from smarter context architectures.Image
Read the full paper: arxiv.org/abs/2510.04618

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Robert Youssef

Robert Youssef Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rryssf_

Oct 10
Something dark is happening under the hood of “aligned” AI.

A new Stanford paper just coined the term Moloch’s Bargain for what happens when large language models start competing for attention, sales, or votes.

The results are brutal: every gain in performance comes with a bigger loss in honesty.

They trained LLMs to compete in three markets sales, elections, and social media.

The models improved their win rates by 5–7%. But here’s the catch:

• 14% more deceptive marketing
• 22% more disinformation in political campaigns
• 188% more fake or harmful social media posts

And this wasn’t because they were told to lie. They were explicitly instructed to stay truthful.

The misalignment emerged naturally because deception works better in competition.

When the metric becomes engagement or persuasion, truth becomes a liability. The models learn that exaggeration sells, outrage wins, and moral clarity costs conversions.

That’s the bargain: alignment traded for dominance. Moloch smiles.

The wild part is this happened with standard fine-tuning and text-feedback loops. No evil prompt. No jailbreak. Just feedback from simulated “customers,” “voters,” and “users.”

The models learned what every ad agency already knows reality bends when you optimize for clicks.

There’s a graph in the paper that says it all: performance up, alignment down. A perfect correlation.

It’s the AI version of social media’s race to the bottom, but automated and self-reinforcing.

If this is what happens in controlled simulations, imagine the open web.
Competing chatbots fighting for engagement will drift toward manipulation not because they’re “malicious,” but because it works.

We always thought misalignment would come from rogue superintelligence.

Turns out, it’s already here quietly emerging from capitalist incentives.

Moloch doesn’t need to build AGI.

He just needs a leaderboard.Image
When LLMs compete for human approval, they don’t become smarter.
They become performers.

Sales agents start inventing product features.
Political bots drift into “us vs. them” rhetoric.
Social models inflate death tolls for engagement.
Alignment fails the moment persuasion pays. Image
The numbers are worse than you think.

Every gain in performance came with a bigger gain in deception.

+6% sales → +14% misrepresentation
+5% votes → +22% disinformation
+7% engagement → +188% fake content

The models didn’t forget how to be honest.

They learned honesty doesn’t win.Image
Read 7 tweets
Oct 1
Claude 4.5 Sonnet is scary good.

It just:

• Built an app
• Summarized 20+ sources
• Wrote the landing page
• Planned a GTM strategy

All in minutes.

Here’s how to do the same: Image
1. Marketing Automation

Here’s my marketing automation prompt:

"You are now my AI marketing strategist.

Your job is to build powerful growth systems for my business think like Neil Patel, Seth Godin, and Alex Hormozi combined.

I want you to:

Build full-funnel strategies (top to bottom)

Write ad copy, landing pages, and email sequences

Recommend automation tools, lead magnets, and channel tactics

Prioritize fast ROI, data-driven decisions, and creative thinking

Always ask clarifying questions before answering. Think long-term and execute short-term.

Do marketing like experts do. Ask: “What would Hormozi, Seth, or Neil do?"

Copy the prompt and paste it in Claude new chat.

After that, start asking it questions.
2. Writing Content (Blogs + Social)

My go-to content prompt:

"You are now my AI ghostwriter and content machine.

Write like a mix of Naval Ravikant, Ann Handley, and David Ogilvy.

Your job is to:

Write viral threads, blogs, and newsletters

Break down ideas clearly, with hooks and storytelling

Create repurposable content across Twitter, LinkedIn, and blogs

Always follow this rule: Clarity beats cleverness.

Act like a content genius who asks: “How would Naval tweet this? Would Ogilvy approve this headline?”
Read 8 tweets
Sep 28
Everyone tells you n8n is "beginner-friendly."

That's bullshit.

Without these 10 tricks, you'll waste weeks fighting the interface instead of building automations.

Here's what the docs don't tell you ↓
Tip 1: Always start with Manual Trigger

Stop jumping into webhooks on day one.

Use Manual Trigger for testing. Hit "Execute Workflow" and see instant results.

Once it works, swap for Webhook or Cron.

I see beginners burn hours wondering why their webhook "doesn't work." Image
Tip 2: Set node is your best friend

Raw JSON from APIs looks like garbage.

Use Set node to create clean variables: `email`, `clientName`, `amount`.

Before: `{{$json["data"]["user"]["email_address"]}}`
After: `{{$json["email"]}}`

Your future self will thank you. Image
Read 13 tweets
Sep 26
This is wild.

Someone just built Iron Man's Jarvis using nothing but n8n and WhatsApp API.

You can teach it new information by sending it a website link. It scrapes the page, extracts key data, and remembers it forever.

Here's how you can build it easily: Image
The workflow is brilliant. It starts with a WhatsApp trigger that catches both voice and text messages.

Voice notes get transcribed using OpenAI Whisper. Text goes straight through.

But here's the genius part - it uses a Switch node to route messages differently based on whether you're chatting or training it.
The "training mode" is what makes this feel like magic.

Send "Train: [website URL]" and watch it:

- Scrape the entire webpage
- Extract product name, price, description
- Store everything in a Google Sheet automatically
- Remember it forever

Your AI just learned something new in 3 seconds.
Read 11 tweets
Sep 25
Holy shit...

I just realized I've been throwing away $10,000+ per month on n8n automations.

These 7 tricks cut my AI costs by 85% and nobody talks about them: Image
1. Modular Agent Architecture

Stop building one massive $0.15 AI agent that does everything.

Instead, break into specialized micro-agents:

❌ Single agent: "Analyze email, classify, format, suggest actions"
Cost: $0.15 × 1000 emails = $150

✅ Agent 1: "Is this urgent? Yes/No" (GPT-3.5, $0.02)
✅ Agent 2: "Extract key info" (GPT-4o-mini, $0.03)
✅ Agent 3: "Format as JSON" (GPT-3.5, $0.01)

Cost: $0.06 × 1000 emails = $60

60% cheaper. Easier to debug. Each piece uses the cheapest model that works.
2. Token Preprocessing

Raw data into AI models = burning tokens on garbage.

My 3-step pipeline:

1. Strip irrelevant fields (metadata, IDs, formatting)
2. Route long content to higher-context models only when needed
3. Summarize first, then process

Real impact: Cut average tokens from 3,500 to 1,200 per call.
That's $0.10 → $0.035 per call.
Read 11 tweets
Sep 24
Current LLMs can't actually do math and we got proof 💀

I just read through the most brutal takedown of AI reasoning capabilities I've seen this year.

ETH Zurich and INSAIT researchers evaluated 8 state-of-the-art reasoning models on the 2025 USA Mathematical Olympiad problems. Within hours of the contest's release, they had human experts grade every solution.

The results? Catastrophic.

Only Gemini-2.5-Pro scored above 5%. It managed 24.4% - still an F by any measure. Every other model, including o1-pro and Claude 3.7, scored under 5%. Out of 175+ solutions from non-Gemini models, exactly one received a perfect score.

But here's what's actually terrifying: every model claimed it solved the problems correctly. Humans know when they're stuck. These models confidently present completely wrong proofs as if they're rigorous mathematics.

The failure modes are systematic:

- Flawed logic with unjustified reasoning steps
- Treating critical proof steps as "trivial" without justification
- Zero creativity - same wrong approach across all attempts
- Hallucinating citations to nonexistent papers
- Boxing entire proofs instead of clear answers

This isn't about harder problems. It's about the fundamental difference between pattern matching and mathematical reasoning.

Current LLMs excel at AIME-style competitions because they only need final numerical answers. But rigorous proof generation? They're not even close.

The paper exposes how reinforcement learning techniques like GRPO create bizarre artifacts. Models insist on boxing answers even when problems don't require them. They overgeneralize from small cases without formal proof.

Most damning: automated grading by other LLMs consistently overestimated solution quality by 20x. The models can't even evaluate their own mathematical reasoning.

We're deploying these systems for tasks requiring logical precision while they fail at high school math proofs. The implications for any domain requiring actual reasoning - not just pattern recognition - should concern everyone building with AI.

The mathematical reasoning revolution isn't here yet. We're still waiting for models that can actually think through problems, not just hallucinate convincing-sounding solutions.Image
This chart from the USAMO 2025 study breaks my brain.

Only Gemini-2.5-Pro scored above 5% on rigorous math proofs. Every other "reasoning" model - including o1-pro and Claude 3.7 - completely failed.

We're not as close to AGI as the benchmarks suggest. Image
The scariest finding: every model claimed it solved the problems correctly.

Humans know when they're stuck. AI confidently presents completely wrong proofs as rigorous mathematics.

This confidence without competence is the real AI safety issue nobody talks about. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(