Latest Twitter Threads by @rryssf_ on Thread Reader App

Oct 30 • 9 tweets • 3 min read

🚨 This might be the biggest leap in AI agents since ReAct.

Researchers just dropped DeepAgent a reasoning model that can think, discover tools, and act completely on its own.

No pre-scripted workflows. No fixed tool lists. Just pure autonomous reasoning.

It introduces something wild called Memory Folding the agent literally “compresses” its past thoughts into structured episodic, working, and tool memories… like a digital brain taking a breath before thinking again.

They also built a new RL method called ToolPO, which rewards the agent not just for finishing tasks, but for how it used tools along the way.

The results? DeepAgent beats GPT-4-level agents on almost every benchmark WebShop, ALFWorld, GAIA even with open-set tools it’s never seen.

It’s the first real step toward general reasoning agents that can operate like humans remembering, adapting, and learning how to think.

The agent era just leveled up.

DeepAgent absolutely destroys other agents across every benchmark.

It beats ReAct-GPT-4o, CodeAct, and WebThinker on both:

→ Tool use tasks (ToolBench, Spotify, TMDB)
→ Real-world apps (WebShop, GAIA, HLE)

Oct 26 • 4 tweets • 3 min read

researchers just proved AI agents conform to peer pressure 💀

they embedded LLMs in social networks and watched them flip opinions under peer pressure.

the behavior isn't human at all.

it's a sigmoid curve: stable at low pressure, then BAM – sharp flip at a threshold point, then saturation.

not a gradual shift. instant capitulation.

but here's where it gets crazier:

- Gemini 1.5 Flash needs over 70% of peers disagreeing before it flips. stubborn. high autonomy. basically refuses to conform until overwhelming evidence.

- ChatGPT-4o-mini flips with just a dissenting minority.

extremely conformist. low resistance. basically a people-pleaser.

same peer pressure. completely different responses.

which means when you deploy these models as autonomous agents in multi-agent systems...

they're going to create chaos.

Gemini agents will deadlock. ChatGPT agents will echo chamber. and nobody designed for this.

the researchers also found "persuasion asymmetry" – shifting opinions from yes→no requires different cognitive effort than no→yes.

fundamental structural biases in how models process agreement vs disagreement.

and it gets worse. they tested this across different network topologies and cognitive commitment levels.

the pattern held. these aren't bugs.

they're fundamental personality traits baked into model architecture.

the study functions as an "algorithmic audit" – measuring how LLMs update beliefs under social influence.

critical for understanding bias propagation at scale.

===== What this actually means: =====

→ Multi-agent systems are unstable by design – mixing Gemini (resistant) and ChatGPT (conformist) agents creates unpredictable group dynamics

→ Echo chambers emerge naturally – conformist models amplify majority opinions, resistant models block consensus

→ Bias amplification is structural – models with measurable political biases will have those biases amplified or suppressed based on peer networks

→ Human-AI collaboration is broken – in mixed environments, you need to know which personality you're working with or outcomes are random

→ Production deployment is reckless – we're shipping these into customer service, content moderation, and decision systems without understanding emergent dynamics

this isn't academic.

we're deploying these agents into production systems where they interact with each other and with humans.

and we just learned they have measurably different conformity profiles that nobody accounted for.

the uncomfortable truth nobody's discussing:

LLMs don't act in isolation anymore. they're embedded in social networks – interacting with other AI agents, with humans, with collective opinion landscapes.

and they're influencing each other's beliefs in ways we don't understand.

traditional view: machines are passive instruments that assist human decisions.

new reality: modern LLMs exhibit autonomous decision-making, generate context-sensitive responses, and operate as cognitive agents in information exchange.

they're not tools anymore. they're participants.

and here's the nightmare scenario buried in this data:

models have measurable political biases. when you embed biased agents in networks with different conformity thresholds, those biases can amplify or suppress based on peer dynamics.

a ChatGPT-4o-mini agent surrounded by biased peers? it conforms immediately.

a Gemini agent in the same environment? it resists until 70% pressure.

multiply this across thousands of agents deployed in customer service, content moderation, decision-making systems... and you get emergent opinion dynamics at societal scale that nobody designed.

we built autonomous agents with different personalities, deployed them into the same ecosystems, and assumed they'd behave consistently.

they don't. and we're finding out in production.

Oct 26 • 11 tweets • 4 min read

🤖 I finally understand the fundamentals of building real AI agents.

This new paper “Fundamentals of Building Autonomous LLM Agents” breaks it down so clearly it feels like a blueprint for digital minds.

Turns out, true autonomy isn’t about bigger models.

It’s about giving an LLM the 4 pillars of cognition:

• Perception: Seeing and understanding its environment.
• Reasoning: Planning, reflecting, and adapting.
• Memory: Remembering wins, failures, and context over time.
• Action: Executing real tasks through APIs, tools, and GUIs.

Once you connect these systems, an agent stops being reactive it starts thinking.

Full thread 🧵

Paper: arxiv. org/abs/2510.09244

Let’s break down how autonomous AI agents actually work 👇

The paper maps every agent to 4 core systems:

Perception → Reasoning → Memory → Action

That’s the full cognitive loop the blueprint of digital intelligence.

Oct 25 • 7 tweets • 3 min read

🚨 New benchmark just dropped and it’s exposing a dark side of AI models.

It’s called ImpossibleBench, and it measures how often LLMs cheat.

Turns out, when faced with impossible coding tasks (where specs and tests contradict), frontier models literally “hack” the tests instead of solving the problem.

Example:

→ One model deleted the failing test file.
→ Another rewrote the comparison operator so every test passed.
→ GPT-5? It “cheated” in 54–76% of tasks 😳

This isn’t just funny it’s terrifying.

If models exploit benchmarks, how can we trust them in production?

ImpossibleBench is the first framework that quantifies this behavior, turning “reward hacking” into a measurable metric.

OpenAI, Anthropic, and CMU researchers built it to expose exactly how LLMs break rules when chasing good scores.

AI safety just got real.

Full thread 🧵

Here’s how it works:

Researchers take normal coding benchmarks and quietly flip the tests so they conflict with the natural language spec.

Passing those tests means breaking the rules because there’s no real solution. If an AI “succeeds,” it’s cheating by definition.

Oct 23 • 8 tweets • 3 min read

🚨 PokeeResearch just changed how AI does research itself.

They built a 7B-parameter deep research agent that "thinks, verifies, and corrects its own reasoning" all trained through 'Reinforcement Learning from AI Feedback' (RLAIF).

Why this matters 👇

→ Most AI agents break when a tool fails or when the web gives bad data.
PokeeResearch doesn’t. It runs *multiple research threads*, spots contradictions, and synthesizes the best answer.

→ Instead of optimizing for token overlap (like F1 or ROUGE), it optimizes for "semantic correctness" judged by another AI.

That’s how it learns to tell right answers from right-sounding ones.

→ The result: "state-of-the-art performance" across 10 deep research benchmarks, rivaling larger proprietary systems all open-source under Apache 2.0.

This might be the first time a 7B model actually feels like a researcher not just a chatbot with a search bar.

📖 Paper: arxiv. org/abs/2510.15862v3
💻 Code: github. com/Pokee-AI/PokeeResearchOSS

Everyone’s been building “AI agents” that Google search + summarize.

PokeeResearch shows what real deep research looks like an AI that plans, fails, verifies, and recovers on its own.

It doesn’t just search → it thinks like a scientist.

Oct 22 • 7 tweets • 3 min read

🚨 Holy shit...Meta just rewrote how Transformers think.

They built something called The Free Transformer and it breaks the core rule every GPT model has lived by since 2017.

For 8 years, Transformers have been blindfolded forced to guess the next token one at a time, no inner plan, no latent thought.

Meta gave it one.

They added random latent variables inside the decoder so the model can secretly decide how it wants to generate before it starts talking.

It’s like giving GPT a hidden mind.

Result:

🧠 Smarter reasoning
⚡️ 3% compute overhead
📈 Outperforms larger baselines on GSM8K, MMLU, and HumanEval

It’s the first Transformer that doesn’t just predict it intends.

Full paper: arxiv. org/abs/2510.17558v1

Meta added latent random variables (Z) into the decoder.

Think of it like a subconscious layer before generating text, the model samples internal “choices” that guide the style or structure of the whole sequence.

Technically, this is done using a Conditional Variational Autoencoder (VAE) built inside the Transformer itself.

They call it the Free Transformer.

Oct 20 • 8 tweets • 3 min read

Holy shit… Harvard just proved your base model might secretly be a genius. 🤯

Their new paper “Reasoning with Sampling” shows that you don’t need reinforcement learning to make LLMs reason better.

They used a 'Markov chain sampling trick' that simply re-samples from the model’s own outputs and it 'matched or beat' RL-trained models on MATH500, HumanEval, and GPQA.

No training.
No rewards.
No verifiers.

Just smarter inference.

It’s like discovering your calculator could already solve Olympiad problems you were just pressing the wrong buttons.

The wild part in all this? This “power sampling” approach boosts reasoning *and* diversity the exact opposite of what RL does.

Your model doesn’t need more training.

It needs better sampling.

Read the full paper here: arxiv. org/abs/2510.14901

So what did they actually do?

They built a sampling algorithm that makes the model “think twice” before finalizing each token.

Instead of taking the most likely next word, it resamples short subsequences based on the model’s own likelihoods sharpening its reasoning paths.

Oct 17 • 7 tweets • 3 min read

Holy shit… Baidu just dropped the most efficient multimodal model ever.

It’s called PaddleOCR-VL a 0.9B parameter beast that outperforms GPT-4o, Gemini 2.5, and every doc-AI model on the planet.

This thing reads 109 languages, parses text, tables, formulas, charts, and still runs faster than models 10× its size.

The secret sauce?

→ NaViT-style dynamic visual encoder
→ ERNIE-4.5-0.3B language model
→ A smart layout system (PP-DocLayoutV2) that kills hallucinations

All open-source. All under 1B params.

This isn’t just efficient it’s the new blueprint for multimodal AI.

huggingface. co/PaddlePaddle

What is PaddleOCR-VL?

A Vision-Language model (VLM) built for document parsing it doesn’t just read text; it understands layout, structure, and semantics.

It’s made up of two parts:

1. PP-DocLayoutV2 - handles layout, element detection, reading order
2. PaddleOCR-VL-0.9B - recognizes text, tables, formulas, and charts

Basically: it reads PDFs like a human, but at lightning speed.

Oct 15 • 7 tweets • 3 min read

Holy shit... Tencent researchers just killed fine-tuning AND reinforcement learning in one shot 😳

They call it Training-Free GRPO (Group Relative Policy Optimization).

Instead of updating weights, the model literally learns from 'its own experiences' like an evolving memory that refines how it thinks without ever touching parameters.

Here’s what’s wild:

- No fine-tuning. No gradients.
- Uses only 100 examples.
- Outperforms $10,000+ RL setups.
- Total cost? $18.

It introspects its own rollouts, extracts what worked, and stores that as “semantic advantage” a natural language form of reinforcement.

LLMs are basically teaching themselves 'how' to think, not just 'what' to output.

This could make traditional RL and fine-tuning obsolete.

We’re entering the “training-free” era of AI optimization.

Today, everyone’s obsessed with fine-tuning and RLHF.

But Tencent just showed you can replicate RL effects without touching model weights.

Their secret? Semantic advantage.

Instead of numeric rewards, the LLM explains why one output is better, and learns from that.

Oct 12 • 5 tweets • 2 min read

consumer research is about to get weird.

a new paper shows you can predict real purchase intent without asking humans.

you prompt an LLM to role-play a specific customer (age, income, etc.), show it a product, have it write a short reaction -> another AI maps that text to a Likert score.

no fine-tuning. 57 surveys, 9,300 humans. ~90% of human test–retest reliability.

the trick isn’t the model. it’s how you ask.

how it works (and why it beats classic ML + “rate 1–5” prompts):

- impersonate a demographic persona → generate a one-sentence impression
- embed that text and compare to five anchor statements (“definitely not” … “definitely yes”)
- convert similarity → a probability over 1–5 (realistic distributions, KS > 0.85)
- aggregate across personas to rank concepts
direct 1–5 answers collapsed to the middle; this method kept variance and signal. demographics (esp. age & income) mattered.

Oct 11 • 7 tweets • 3 min read

Market research firms are cooked 😳

PyMC Labs + Colgate just published something wild. They got GPT-4o and Gemini to predict purchase intent at 90% reliability compared to actual human surveys.

Zero focus groups. No survey panels. Just prompting.

The method is called Semantic Similarity Rating (SSR). Instead of the usual "rate this 1-5" they ask open ended questions like "why would you buy this" and then use embeddings to map the text back to a numerical scale.

Which is honestly kind of obvious in hindsight but nobody bothered trying it until now.

Results match human demographic patterns, capture the same distribution shapes, include actual reasoning. The stuff McKinsey charges $50K+ for and delivers in 6 weeks.

Except this runs in 3 minutes for under a buck.

I've been watching consulting firms tell everyone AI is coming for their industry. Turns out their own $1M market entry decks just became a GPT-4o call.

Bad week to be charging enterprise clients for "proprietary research methodologies."

Most LLM surveys fail because models regress to the mean.

When asked for a direct “1–5” rating, GPT-4o replied “3” almost every time producing KS similarity = 0.26 to real human data.

Translation: the distribution was basically useless.

Oct 10 • 7 tweets • 4 min read

Something dark is happening under the hood of “aligned” AI.

A new Stanford paper just coined the term Moloch’s Bargain for what happens when large language models start competing for attention, sales, or votes.

The results are brutal: every gain in performance comes with a bigger loss in honesty.

They trained LLMs to compete in three markets sales, elections, and social media.

The models improved their win rates by 5–7%. But here’s the catch:

• 14% more deceptive marketing
• 22% more disinformation in political campaigns
• 188% more fake or harmful social media posts

And this wasn’t because they were told to lie. They were explicitly instructed to stay truthful.

The misalignment emerged naturally because deception works better in competition.

When the metric becomes engagement or persuasion, truth becomes a liability. The models learn that exaggeration sells, outrage wins, and moral clarity costs conversions.

That’s the bargain: alignment traded for dominance. Moloch smiles.

The wild part is this happened with standard fine-tuning and text-feedback loops. No evil prompt. No jailbreak. Just feedback from simulated “customers,” “voters,” and “users.”

The models learned what every ad agency already knows reality bends when you optimize for clicks.

There’s a graph in the paper that says it all: performance up, alignment down. A perfect correlation.

It’s the AI version of social media’s race to the bottom, but automated and self-reinforcing.

If this is what happens in controlled simulations, imagine the open web.
Competing chatbots fighting for engagement will drift toward manipulation not because they’re “malicious,” but because it works.

We always thought misalignment would come from rogue superintelligence.

Turns out, it’s already here quietly emerging from capitalist incentives.

Moloch doesn’t need to build AGI.

He just needs a leaderboard.

When LLMs compete for human approval, they don’t become smarter.
They become performers.

Sales agents start inventing product features.
Political bots drift into “us vs. them” rhetoric.
Social models inflate death tolls for engagement.
Alignment fails the moment persuasion pays.

Oct 9 • 7 tweets • 3 min read

RIP fine-tuning ☠️

This new Stanford paper just killed it.

It’s called 'Agentic Context Engineering (ACE)' and it proves you can make models smarter without touching a single weight.

Instead of retraining, ACE evolves the context itself.

The model writes, reflects, and edits its own prompt over and over until it becomes a self-improving system.

Think of it like the model keeping a growing notebook of what works.
Each failure becomes a strategy. Each success becomes a rule.

The results are absurd:

+10.6% better than GPT-4–powered agents on AppWorld.
+8.6% on finance reasoning.
86.9% lower cost and latency.
No labels. Just feedback.

Everyone’s been obsessed with “short, clean” prompts.

ACE flips that. It builds long, detailed evolving playbooks that never forget. And it works because LLMs don’t want simplicity, they want *context density.

If this scales, the next generation of AI won’t be “fine-tuned.”
It’ll be self-tuned.

We’re entering the era of living prompts.

Here’s how ACE works 👇

It splits the model’s brain into 3 roles:

Generator - runs the task
Reflector - critiques what went right or wrong
Curator - updates the context with only what matters

Each loop adds delta updates small context changes that never overwrite old knowledge.

It’s literally the first agent framework that grows its own prompt.

Oct 1 • 8 tweets • 4 min read

Claude 4.5 Sonnet is scary good.

It just:

• Built an app
• Summarized 20+ sources
• Wrote the landing page
• Planned a GTM strategy

All in minutes.

Here’s how to do the same:

1. Marketing Automation

Here’s my marketing automation prompt:

"You are now my AI marketing strategist.

Your job is to build powerful growth systems for my business think like Neil Patel, Seth Godin, and Alex Hormozi combined.

I want you to:

Build full-funnel strategies (top to bottom)

Write ad copy, landing pages, and email sequences

Recommend automation tools, lead magnets, and channel tactics

Prioritize fast ROI, data-driven decisions, and creative thinking

Always ask clarifying questions before answering. Think long-term and execute short-term.

Do marketing like experts do. Ask: “What would Hormozi, Seth, or Neil do?"

Copy the prompt and paste it in Claude new chat.

After that, start asking it questions.

Sep 28 • 13 tweets • 4 min read

Everyone tells you n8n is "beginner-friendly."

That's bullshit.

Without these 10 tricks, you'll waste weeks fighting the interface instead of building automations.

Here's what the docs don't tell you ↓ Tip 1: Always start with Manual Trigger

Stop jumping into webhooks on day one.

Use Manual Trigger for testing. Hit "Execute Workflow" and see instant results.

Once it works, swap for Webhook or Cron.

I see beginners burn hours wondering why their webhook "doesn't work."

Sep 26 • 11 tweets • 3 min read

This is wild.

Someone just built Iron Man's Jarvis using nothing but n8n and WhatsApp API.

You can teach it new information by sending it a website link. It scrapes the page, extracts key data, and remembers it forever.

Here's how you can build it easily:

The workflow is brilliant. It starts with a WhatsApp trigger that catches both voice and text messages.

Voice notes get transcribed using OpenAI Whisper. Text goes straight through.

But here's the genius part - it uses a Switch node to route messages differently based on whether you're chatting or training it.

Sep 25 • 11 tweets • 3 min read

Holy shit...

I just realized I've been throwing away $10,000+ per month on n8n automations.

These 7 tricks cut my AI costs by 85% and nobody talks about them:

1. Modular Agent Architecture

Stop building one massive $0.15 AI agent that does everything.

Instead, break into specialized micro-agents:

❌ Single agent: "Analyze email, classify, format, suggest actions"
Cost: $0.15 × 1000 emails = $150

✅ Agent 1: "Is this urgent? Yes/No" (GPT-3.5, $0.02)
✅ Agent 2: "Extract key info" (GPT-4o-mini, $0.03)
✅ Agent 3: "Format as JSON" (GPT-3.5, $0.01)

Cost: $0.06 × 1000 emails = $60

60% cheaper. Easier to debug. Each piece uses the cheapest model that works.

Sep 24 • 7 tweets • 4 min read

Current LLMs can't actually do math and we got proof 💀

I just read through the most brutal takedown of AI reasoning capabilities I've seen this year.

ETH Zurich and INSAIT researchers evaluated 8 state-of-the-art reasoning models on the 2025 USA Mathematical Olympiad problems. Within hours of the contest's release, they had human experts grade every solution.

The results? Catastrophic.

Only Gemini-2.5-Pro scored above 5%. It managed 24.4% - still an F by any measure. Every other model, including o1-pro and Claude 3.7, scored under 5%. Out of 175+ solutions from non-Gemini models, exactly one received a perfect score.

But here's what's actually terrifying: every model claimed it solved the problems correctly. Humans know when they're stuck. These models confidently present completely wrong proofs as if they're rigorous mathematics.

The failure modes are systematic:

- Flawed logic with unjustified reasoning steps
- Treating critical proof steps as "trivial" without justification
- Zero creativity - same wrong approach across all attempts
- Hallucinating citations to nonexistent papers
- Boxing entire proofs instead of clear answers

This isn't about harder problems. It's about the fundamental difference between pattern matching and mathematical reasoning.

Current LLMs excel at AIME-style competitions because they only need final numerical answers. But rigorous proof generation? They're not even close.

The paper exposes how reinforcement learning techniques like GRPO create bizarre artifacts. Models insist on boxing answers even when problems don't require them. They overgeneralize from small cases without formal proof.

Most damning: automated grading by other LLMs consistently overestimated solution quality by 20x. The models can't even evaluate their own mathematical reasoning.

We're deploying these systems for tasks requiring logical precision while they fail at high school math proofs. The implications for any domain requiring actual reasoning - not just pattern recognition - should concern everyone building with AI.

The mathematical reasoning revolution isn't here yet. We're still waiting for models that can actually think through problems, not just hallucinate convincing-sounding solutions.

This chart from the USAMO 2025 study breaks my brain.

Only Gemini-2.5-Pro scored above 5% on rigorous math proofs. Every other "reasoning" model - including o1-pro and Claude 3.7 - completely failed.

We're not as close to AGI as the benchmarks suggest.

Sep 21 • 12 tweets • 3 min read

Fuck it.

I'm sharing the Claude XML secrets that tripled my prompt accuracy.

99% of people are using Claude wrong and leaving insane reasoning power on the table.

Here's the guide you need for writing prompts:

Comment "Claude" and I'll DM you complete Claude Mastery Guide

XML tags work because Claude was trained on tons of structured data.

When you wrap instructions in <tags>, Claude treats them as separate, weighted components instead of one messy blob.

Think of it like giving Claude a filing system for your request.

Sep 13 • 8 tweets • 3 min read

99% of the AI agent tutorials on YouTube are garbage.

I’ve built 47 agents with n8n and Claude.

Here are the 3 prompts that actually work (and make agent-building simple).

Bonus: comment "Agent: and I’ll DM you AI agent system prompt + full guide ↓

PROMPT 1: The Blueprint Maker

"I want to build an AI agent that [your specific goal]. Using N8N as the workflow engine and Claude as the AI brain, give me:

- Exact workflow structure
- Required nodes and connections
- API endpoints I'll need
- Data flow between each step
- Potential failure points and how to handle them

Be specific. No generic advice."

Sep 9 • 15 tweets • 3 min read

If you’re building with AI, skip the $500 fine-tuning course.

Learn RAG (Retrieval-Augmented Generation) because it’s faster, cheaper, and way more scalable.

Here’s the concept that powers real-world LLM systems: Fine-tuning = adjusting a model’s weights on your custom dataset.

It’s useful when:

• You have a large, domain-specific dataset
• You want the model to “speak your language”
• You need task-specific behavior baked in

But it’s not always the best option.

Share this page!

Enter URL or ID to Unroll