Post

More from @rohanpaul_ai

Rohan Paul

@rohanpaul_ai

Jun 17

It’s a hefty 206-page research paper, and the findings are concerning.

"LLM users consistently underperformed at neural, linguistic, and behavioral levels"

This study finds LLM dependence weakens the writer’s own neural and linguistic fingerprints. 🤔🤔

Relying only on EEG, text mining, and a cross-over session, the authors show that keeping some AI-free practice time protects memory circuits and encourages richer language even when a tool is later reintroduced.

⚙️ The Experimental Setup

Fifty-four Boston-area students wrote SAT-style essays under three conditions: ChatGPT only, Google only, or brain only.

Each person completed three timed sessions with the same condition, then an optional fourth session in the opposite condition.

A 32-channel Enobio headset recorded brain signals throughout, and every keystroke, prompt, and interview answer was archived for analysis.

🧠 Brain Connectivity Results

Alpha and beta networks were strongest when no external tool was allowed, moderate with Google, and weakest with ChatGPT.

Lower coupling during LLM use signals reduced internal attention and memory rehearsal, while high parieto-frontal flow in the brain-only group matches deep semantic processing.

Read 12 tweets

Rohan Paul

@rohanpaul_ai

Jun 15

Large Language Model agents are vulnerable to prompt injection attacks that hijack tool use and leak data.
The paper proposes six design patterns that restrict where untrusted text can act, giving resistance without crippling usefulness.

⚙️ The Core Concepts

Prompt injection slips malicious text into an agent’s context and rewrites its plan.

Filters, adversarial training, and user approval are brittle because clever wording can still bypass them.

The authors instead isolate untrusted data with structured workflows that block it from gaining control.

🛡️ Action-Selector Pattern

The agent picks one permitted action from a fixed list and never processes tool output.

Because no feedback loop exists, injected text cannot trigger unexpected calls.

Use cases are simple routers such as customer-service macros or database shortcuts.

📑 Plan-Then-Execute Pattern

The agent first writes a full action plan, locks it, then runs tools; outputs cannot add new steps.

This keeps control flow intact while still letting the agent react to outside data inside each step.

Attacks can still tamper with parameters, so the plan must avoid unsafe primitives.

Read 21 tweets

Rohan Paul

@rohanpaul_ai

Jun 13

Anthropic just dropped the beautiful explaination of how they built a multi-agent research system using multiple Claude AI agents.

A MUST read for anyone building multi-agent system.

A lead agent plans research steps, spawns specialized subagents to search in parallel, and then gathers and cites results. It covers architecture, prompt design, tool selection, evaluation methods, and production challenges to make AI research reliable and efficient.

Single-agent research assistants stall when queries branch into many directions. Anthropic links one lead Claude with parallel subagents to chase each thread at once, then fuses their findings.

⚙️ The Core Concepts

Research questions rarely follow a straight path, so a fixed pipeline leaves gaps. One lead agent plans the investigation, spawns subagents that roam in parallel, and later condenses their notes into a coherent answer.

🧠 Why Multi-Agent Architecture Helps

Each subagent brings its own context window, so the system can pour in many more tokens than a single model would hold. Anthropic measured that token volume alone explained 80% of success on BrowseComp, and adding subagents pushed performance 90.2% past a lone Claude Opus 4 on internal tasks.

Running agents in parallel also cuts wall-clock time because searches, tool calls, and reasoning steps happen side by side rather than one after another.

@AnthropicAI

🛠️ Architecture Walkthrough

The orchestrator-worker pattern gives the lead agent control while letting specialists act independently. A user query lands with the lead Researcher, which thinks aloud, stores the plan in memory, and distributes focused jobs like list company directors or trace chip shortages.

Subagents call web search or workspace tools, judge results with interleaved thinking, and return concise digests. A citation agent then pins every claim to a source before the answer reaches the user.

🧩 Prompt Design and Agent Coordination

Early versions wasted effort by spawning 50 subagents for a trivial fact or by looping forever when data was scarce. The team fixed this by encoding explicit scaling rules, teaching the lead agent how many helpers fit a task and capping tool calls per helper.

Prompts also nudge subagents to start with broad queries, skim available material, and narrow only when needed, which mirrors expert human research habits.

Claude itself rewrites poor tool descriptions, trimming task time 40% by preventing misuse.

Read 11 tweets

Rohan Paul

@rohanpaul_ai

Jun 13

AI Agents vs. Agentic AI

→ AI Agents react to prompts; Agentic AI initiates and coordinates tasks.

→ Agentic AI includes orchestrators and meta-agents to assign and oversee sub-agents.

🧵1/n

🧠 The Core Concepts

AI Agents and Agentic AI are often confused as interchangeable, but they represent different stages of autonomy and architectural complexity.

AI Agents are single-entity systems driven by large language models (LLMs). They are designed for task-specific execution: retrieving data, calling APIs, automating customer support, filtering emails, or summarizing documents. These agents use tools and perform reasoning through prompt chaining, but operate in isolation and react only when prompted.

Agentic AI refers to systems composed of multiple interacting agents, each responsible for a sub-task. These systems include orchestration, memory sharing, role assignments, and coordination.

Instead of one model handling everything, there are planners, retrievers, and evaluators communicating to achieve a shared goal. They exhibit persistent memory, adaptive planning, and multi-agent collaboration.

🏗️ Architectural Breakdown

AI Agents: Structured as a single model using LLMs. Equipped with external tools. Operates through a cycle of perception, reasoning, and action. Executes one task at a time with limited context continuity.

Agentic AI: Uses multiple LLM-driven agents. Supports task decomposition, role-based orchestration, and contextual memory sharing. Agents communicate via queues or buffers and learn from feedback across sessions.

🔧 How AI Agents Work

An AI Agent typically receives a user prompt, chooses the correct tool (e.g., search engine, database query), gets results, and then generates an output. It loops this with internal reasoning until the task is completed. Frameworks like LangChain and AutoGPT are built on this structure.

🤖 What Agentic AI Adds

Agentic AI introduces:

- Goal decomposition: breaking tasks into subtasks handled by specialized agents.

- Orchestration: a meta-agent (like a CEO) delegates and integrates.

- Memory systems: episodic, semantic, or vector-based for long-term context.

- Dynamic adaptation: agents can replan or reassign tasks based on outcomes.

Examples include CrewAI or AutoGen pipelines, where agents draft research papers or coordinate robots.

🧵2/n

🔄 Mechanisms of Autonomy

A single AI Agent begins work when a user or scheduler fires a prompt, selects one tool at a time, and stops when the task flag is cleared.

Agentic AI starts from a high-level objective, decomposes it through a planner agent, routes subtasks to specialist agents, and keeps cycling until success criteria are met.

Shared memory lets each agent read what others learned, while structured messages prevent conflicts and allow recovery when one path stalls.

🧵3/n

Workflow of an AI Agent performing real-time news search

→ AI Agent handles user query "Latest AI news?" autonomously.

→ Searches web using tools, showing its tool-augmented reasoning.

→ Summarizes news with LLM, focusing on task-specificity.

→ Generates concise answer, demonstrating reactivity to user input.

→ Workflow reflects AI Agent’s modular, single-task design.

Read 10 tweets

Rohan Paul

@rohanpaul_ai

Jun 12

A follow-up study on Apple's "Illusion of Thinking" Paper is published now.

Shows the same models succeed once the format lets them give compressed answers, proving the earlier collapse was a measurement artifact.

Token limits, not logic, froze the models.

Collapse vanished once the puzzles fit the context window.

So Models failed the rubric, not the reasoning.

⚙️ The Core Concepts

Large Reasoning Models add chain-of-thought tokens and self-checks on top of standard language models. The Illusion of Thinking paper pushed them through four controlled puzzles, steadily raising complexity to track how accuracy and token use scale. The authors saw accuracy plunge to zero and reasoned that thinking itself had hit a hard limit.

📊 Puzzle-Driven Evaluation

Tower of Hanoi forced models to print every move; River Crossing demanded safe boat trips under strict capacity. Because a solution for forty-plus moves already eats thousands of tokens, the move-by-move format made token budgets explode long before reasoning broke.

🔎 Why Collapse Appeared

The comment paper pinpoints three test artifacts: token budgets were exceeded, evaluation scripts flagged deliberate truncation as failure, and some River Crossing instances were mathematically unsolvable yet still graded. Together these artifacts masqueraded as cognitive limits.

✅ Fixing the Test

When researchers asked the same models to output a compact Lua function that generates the Hanoi solution, models solved fifteen-disk cases in under five thousand tokens with high accuracy, overturning the zero-score narrative.

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

arxiv.org/abs/2506.09250

The token count grows roughly as five times two to the power N, then squared. This growth is so steep that it quickly exhausts the context window.

With sixty four thousand tokens, Claude 3.7 and DeepSeek R1 can list every move only up to about 7 or 8 disks.

With 100K tokens, o3 mini reaches eight disks.

The paper argues that the earlier study misread a memory bottleneck as a reasoning failure.

The token-growth rule puts a hard cap on how many Tower of Hanoi moves can be printed before the context window fills up.

When the required tokens jump past that cap, the model must truncate its answer, and the grader marks it wrong even if the model still knows the plan.

By showing that the predicted break-points — around seven or eight disks for the given token budgets — match the point where accuracy crashes, the math connects the dots.

It turns the headline “models stop thinking” into “models run out of room,” which is the central claim of the comment paper.

Read 5 tweets

Rohan Paul

@rohanpaul_ai

May 31

A 340 page huge report on AI trends - released by @bondcap

Some wild findings from this report.

🧵1/n

🧵2/n

Meta’s Llama Downloads Exploded 3.4× in Eight Months.

an unprecedented developer adoption curve for any open-source LLM.

bondcap. com/reports/tai

🧵3/n

AI Chatbots Now Mistaken as Human 73 Percent of the Time

In Q1 2025, testers mistook AI responses for human replies 73 percent of the time in Turing-style experiments. That’s up from roughly 50 percent only six months earlier—showing how quickly models have learned to mimic human conversational nuance

Read 34 tweets

Share this page!

Enter URL or ID to Unroll

Rohan Paul

Try unrolling a thread yourself!

More from @rohanpaul_ai

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!