Robert Youssef Profile picture
Oct 26, 2025 11 tweets 4 min read Read on X
🤖 I finally understand the fundamentals of building real AI agents.

This new paper “Fundamentals of Building Autonomous LLM Agents” breaks it down so clearly it feels like a blueprint for digital minds.

Turns out, true autonomy isn’t about bigger models.

It’s about giving an LLM the 4 pillars of cognition:

• Perception: Seeing and understanding its environment.
• Reasoning: Planning, reflecting, and adapting.
• Memory: Remembering wins, failures, and context over time.
• Action: Executing real tasks through APIs, tools, and GUIs.

Once you connect these systems, an agent stops being reactive it starts thinking.

Full thread 🧵

Paper: arxiv. org/abs/2510.09244Image
Let’s break down how autonomous AI agents actually work 👇

The paper maps every agent to 4 core systems:

Perception → Reasoning → Memory → Action

That’s the full cognitive loop the blueprint of digital intelligence. Image
First: Perception.

This is how agents “see” the world screenshots, audio, text, structured data, even API outputs.

From simple text-based prompts to full multimodal perception with image encoders like CLIP and ViT.

That’s what lets an agent understand its environment. Image
To make perception sharper, they use VCoder and Set-of-Mark.

Set-of-Mark = giving the model “visual anchors” bounding boxes it can reason around.
This massively reduces hallucination and object confusion.

Your AI agent literally learns where to look. Image
Image
Next up: Reasoning.

This is where agents plan, reflect, and adapt using methods like:

→ Chain-of-Thought
→ Tree-of-Thought
→ Decompose–Plan–Merge (DPPM)

These aren’t prompts they’re thinking architectures.

This is how an agent stops guessing and starts reasoning. Image
Agents also reflect on their mistakes.

The Reflection system evaluates its own outputs, rewrites failed steps, and stores feedback for next time.

There’s even “Anticipatory Reflection” the agent critiques itself before acting.

That’s how self-correction becomes second nature. Image
When agents scale, they evolve into multi-agent systems.

Each agent becomes an expert planner, memory manager, debugger, action executor.

They coordinate like a digital team.

We’re basically designing AI organizations inside one model.
Memory is the secret sauce.

Agents use short-term context windows, long-term memory banks, and RAG-based recall to remember experiences and strategies.

It’s the difference between “doing” and “learning.”

Without memory, you don’t get agents you get amnesia.
Finally: Execution.

Where thoughts turn into actions.

Agents use structured tool calls, code generation, and multimodal control (mouse, keyboard, GUI).

It’s not hypothetical they can use apps like humans do.

We’re not far from AI that runs your computer for you. Image
So when people say “agents are just LLMs with tools”…
show them this.

Perception. Reasoning. Memory. Action.

Each one architected, tested, and connected in a feedback loop.

That’s not a chatbot.

That’s cognitive software.
Stop wasting hours writing prompts

→ 10,000+ ready-to-use prompts
→ Create your own in seconds
→ Lifetime access. One-time payment.

Claim your copy 👇
godofprompt.ai/pricing

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Robert Youssef

Robert Youssef Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rryssf_

Mar 6
Google DeepMind just taught an AI to do something most AI models are terrible at: actually learn from being told it's wrong.

the technique is called Social Meta-Learning. it's borrowed from developmental psychology, not machine learning.

and it transfers across domains. train it on math correction, it gets better at learning from coding feedback too.

here's what they did:Image
here's the uncomfortable truth about every chatbot you use right now.

current LLMs are trained almost entirely for single-turn performance. give a prompt, get an answer. one shot.

this means they're actually bad at the thing conversations are supposed to be for: learning through back-and-forth.

you correct them, they don't really integrate the correction. you give feedback, they acknowledge it but don't fundamentally shift their approach. the dialogue feels static because it is.

the researchers say post-training might actually make this worse.
the DeepMind team reframed the problem completely.

instead of asking "how do we make models give better single-turn answers?"
they asked: "how do we teach a model to learn from being taught?"

they borrowed a concept from developmental psychology called social meta-learning. it's how children learn to learn from other people. not just absorbing information, but learning the skill of extracting useful information from social interaction.

the insight: learning from feedback is itself a trainable skill. not an emergent property. a skill.
Read 11 tweets
Feb 28
researchers put heavy TikTok users in an EEG and found something unsettling.

their frontal lobe activity was reduced during focus tasks.

the weird part: their behavioral performance looked normal. the damage only showed up in the brain scans.

here's what's actually happening: Image
the study measured "theta power" in the prefrontal cortex during attention tasks.

theta waves are the neural signature of executive control. the thing that lets you focus, ignore distractions, and finish what you started.

heavy short-form video users showed significantly reduced theta activity in the frontal region.

even after controlling for anxiety, depression, age, and gender.
here's the disturbing part:

the behavioral tests looked fine. participants could still complete the tasks.

but the neural machinery underneath was working harder and firing weaker.

this is what early-stage cognitive decline looks like. function stays normal while the infrastructure degrades.
Read 11 tweets
Feb 26
Google DeepMind just used AlphaEvolve to breed entirely new game-theory algorithms that outperform ones humans spent years designing

the discovered algorithms use mechanisms so non-intuitive that no human researcher would have tried them.

here's what actually happened and why it matters:Image
first, the framing matters.

this isn't "ask ChatGPT to write an algorithm." this is AlphaEvolve, Google's evolutionary coding agent powered by Gemini 2.5 Pro.

it treats algorithm source code as a genome. the LLM acts as a genetic operator, rewriting logic, injecting new control flows, mutating symbolic operations.

then it evaluates the offspring against game-theoretic benchmarks and evolves the next generation.

it's not prompting. it's natural selection over code.
the target: two foundational families in multi-agent reinforcement learning.

counterfactual regret minimization (CFR) and policy space response oracles (PSRO).

these are the algorithms behind things like superhuman poker AI. they find Nash equilibria in imperfect-information games.

the problem: designing effective variants of these algorithms has been a manual, intuition-driven process for nearly two decades. each new game setting demands its own specialized tweaks.

DeepMind asked: what if you let evolution find the tweaks instead?
Read 10 tweets
Feb 26
Google DeepMind just published something that isn't a benchmark or a new model.

it's a governance framework for when AI agents start hiring other AI agents.

sounds abstract. it's not. this is the missing infrastructure layer for the "agentic web."

here's why it matters: Image
current multi-agent systems treat delegation as task splitting.

"break this into subtasks, assign them to tools."

DeepMind's argument: that's not delegation. that's just decomposition.

real delegation transfers authority, responsibility, and accountability. current systems transfer none of these.Image
when an agent delegates to another agent today, you get:

> no clear authority boundaries
> no verification that work was actually done correctly
> no accountability chain when things fail
> no trust calibration based on track record

the whole thing runs on hope and well-structured prompts.
Read 13 tweets
Feb 24
Google Research just proved you can boost llm accuracy by up to 76 percentage points with zero extra output tokens, zero latency increase, and zero fine-tuning 🤯

the technique: paste your prompt twice.

that's it. that's the paper.

but WHY it works reveals something important about how every llm you use actually reads your input:Image
every major llm processes text left to right. each token can only attend to tokens that came before it. never forward.

this means when you write a prompt like:

[long context] → [question at the end]

the context tokens were processed without any awareness of what question was coming.

the model reads your setup blind, then answers with whatever representations it already locked in.

your question arrives too late to reshape how the context was understood.
the paper's solution is almost absurdly simple.

instead of sending , send .

when the model hits the second copy, every token now attends to the full first copy. the question has already been seen. the context gets reprocessed with complete awareness.

you're essentially giving a unidirectional model a form of bidirectional attention. without changing the architecture. without any new training. just by repeating yourself.
Read 10 tweets
Feb 23
Deepseek just broke the one rule every transformer has followed for a decade 🤯

x + f(x). the residual connection.

if you don't know what that means, here's the simple version: every time a neural network processes your input through a layer, it keeps a copy of the original and adds it back at the end. like a safety net. if the layer screws up, the original signal survives.

gpt-4 uses it. claude uses it. gemini uses it. every major model since 2015 treats this as sacred. nobody touches it.

Deepseek touched it.

instead of 1 stream carrying your data forward, they split it into 4 parallel streams. each stream carries different aspects of the information. and learned mixing matrices decide how those streams talk to each other at every layer.

more lanes on the highway. smarter traffic control. same computational cost.

sounds perfect on paper. here's where it breaks:Image
ByteDance actually tried this first. they published "hyper-connections" (HC) and it looked incredible on small models. faster convergence. better benchmarks. the theory was sound.

then they tried to scale it.

at 27B parameters, things went wrong. the mixing matrices that control how the 4 streams blend together have no guardrails. nothing stops them from amplifying signals.

imagine a game of telephone, but instead of the message getting quieter, it gets louder at every step. by the time it passes through 60 layers, the signal has been amplified ~3000x.

that's not a slow degradation. that's an explosion.

Deepseek saw it happen in real time: a loss spike at training step 12,000. gradient norms shot through the roof. the model wasn't learning anymore. it was screaming.

most teams would have abandoned the idea. Deepseek asked a different question.
their insight was clean:

the problem isn't giving the model multiple streams. the problem is nobody told the streams how to behave.

unconstrained mixing means any matrix value is fair game. positive, negative, huge, tiny. multiply those across 60 layers and you get chaos.

Deepseek's fix: force every mixing matrix to follow a strict rule.

it's called the Birkhoff polytope. fancy name, simple idea:

> every row must sum to 1
> every column must sum to 1
> every entry must be zero or positive

in plain english: information can be redistributed between streams, but it cannot be created or destroyed.

the analogy that clicks: imagine 4 glasses of water. you can pour between them however you want. any combination, any amount. but the total water across all 4 glasses must stay exactly the same.

no glass overflows. no glass runs dry. the system stays balanced no matter what you do.

that's the constraint. and it changes everything.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(