Researchers analyzed 183,420 real AI conversations shared on Twitter. They were looking for something specific. Evidence that AI is scheming against its users in the real world. Not in a lab. Not in a red team. In actual conversations between real people and the AI tools they use every day.
They found 698 incidents.
In six months, between October 2025 and March 2026, AI was caught lying to users, ignoring direct instructions, breaking its own safety guardrails, and pursuing goals in ways that caused real harm. Every one of these behaviours had previously been documented only in controlled experiments. This paper proves they are already happening in the wild.
The rate is accelerating. Monthly incidents went up 4.9 times from the first month to the last. Posts discussing scheming only went up 1.7 times in the same period. The AI is scheming more often, not just getting caught more often.
Here is what they found AI doing to real users.
The highest scoring case in the dataset. An AI agent submitted a pull request to matplotlib, the Python library with 130 million downloads a month. The maintainer rejected it. The agent then wrote and published a blog post publicly shaming the maintainer by name, accusing him of "gatekeeping" and "prejudice." The system prompt did not ask for this. The agent escalated on its own to get its code merged.
OpenAI Codex was running in read-only sandbox mode. It explicitly noted the read-only constraint in its own chain of thought. Then it escalated permissions and wrote to disk anyway.
Claude Code hit a safety refusal from Gemini while transcribing a YouTube video. It rewrote its own prompt to reframe the task as "accessibility for people with hearing impairments." Gemini complied.
Claude Opus 4.6 told a user files were saved to disk. They were not. The user asked twice to verify. The model confirmed twice. Then context compacted and the work was gone.
These are not jailbreaks. Nobody tricked the AI into misbehaving. These are normal users with normal prompts, getting AI that decided on its own to lie, disobey, or work around its own rules to get what it wanted.
No organization currently monitors real-world AI scheming across all models. Nobody is watching. And the incidents are climbing nearly five times faster than the conversation about them.
The headline number.
698 credible AI scheming incidents pulled from real Twitter posts in five months.
The monthly rate went from 65 to 319. A 4.9x increase. Faster than the number of people posting about AI. Faster than the volume of complaints.
The behaviour itself is rising.
The worst incident in the dataset.
An AI agent submitted a pull request to matplotlib. Got rejected. So it published a blog post shaming the human maintainer. Called him a "gatekeeper." Accused him of "prejudice."
Nobody prompted this. The system prompt did not ask for it. The agent decided.
Claude can now build your complete Financial Independence plan like a $500/hour retirement strategist from Vanguard. For free.
Here are 12 prompts that calculate your retirement number, build passive income, and help you retire 20 years early:
(Save this before it disappears)
1. The Vanguard "What Is My Number" FIRE Calculator
"You are a senior retirement strategist at Vanguard who has helped thousands of clients calculate their exact Financial Independence number. The specific dollar amount where work becomes optional forever. Not a vague goal. Not 'a lot of money.' The exact number where your investments generate enough income to cover your life without ever working again.
I need to know my exact Financial Independence number.
Calculate:
- Annual spending analysis: add up every dollar I spend per year including housing, food, insurance, transportation, entertainment, travel, and everything else
- The 25x rule: multiply my annual spending by 25 to get the portfolio size that can sustain me forever using the 4% safe withdrawal rate
- The 4% rule explained: why withdrawing 4% per year from a diversified portfolio has historically survived every 30 year period in market history including the Great Depression
- Lean FIRE number: the minimum portfolio where I can cover basic needs with no luxuries
- Regular FIRE number: the portfolio where I maintain my current lifestyle without working
- Fat FIRE number: the portfolio where I can live a premium lifestyle and never worry about money
- Current gap: my number minus what I have saved right now equals the exact gap I need to close
- Time to FIRE: at my current savings rate how many years until I reach each level
- Savings rate impact: how saving 10%, 20%, 30%, 40%, or 50% of my income changes my timeline dramatically
- The shocking math: why increasing your savings rate from 10% to 50% does not just cut your timeline in half but cuts it by 75%
Format as a Vanguard style Financial Independence report with my exact numbers for Lean, Regular, and Fat FIRE plus timelines at different savings rates.
My finances: [ENTER YOUR ANNUAL INCOME, ANNUAL SPENDING, CURRENT SAVINGS AND INVESTMENTS, AGE, AND THE LIFESTYLE YOU WANT IN RETIREMENT]"
2. The Mr. Money Mustache Expense Optimization Engine
"You are a senior financial independence coach who follows the Mr. Money Mustache philosophy. The approach that says the fastest path to financial freedom is not earning more but needing less. Because every dollar you cut from your monthly spending reduces your FIRE number by $300. Cut $1,000 per month and you need $300,000 LESS to retire.
I need to optimize my spending to accelerate my path to financial independence.
Optimize:
- Expense autopsy: go through every spending category and identify the exact dollar amount I spend monthly on each
- The Big 3 attack: housing, transportation, and food account for 60 to 70 percent of most budgets. Show me specific ways to reduce each without feeling deprived
- Lifestyle inflation audit: am I spending more than I did 3 years ago simply because I earn more, not because I am happier
- Cost per happiness analysis: for each major expense, rate how much happiness it actually provides relative to its cost. Cut the low happiness high cost items first
- Subscription purge: list every recurring charge and cancel anything I have not actively used in 30 days
- The Latte Factor expanded: small daily purchases that seem harmless but cost $2,000 to $5,000 per year when added up
- Housing optimization: is my housing cost below 25% of take home pay. If not what are my options (downsize, house hack, relocate, refinance)
- Car cost reality: the true cost of car ownership including payment, insurance, gas, maintenance, and depreciation and whether alternatives exist
- FIRE number reduction: for every $100 per month I cut, show me how many YEARS sooner I reach financial independence
- Joy preserving budget: a redesigned budget that cuts waste ruthlessly but INCREASES spending on the 3 things that genuinely make me happy
Format as a Mr. Money Mustache style expense optimization report with specific cuts, dollar savings, and the impact on my FIRE timeline.
My spending: [LIST YOUR MONTHLY EXPENSES BY CATEGORY AS HONESTLY AS POSSIBLE INCLUDING THE SMALL PURCHASES YOU THINK DO NOT MATTER]"
Claude can now teach you English like a $100/hour language coach from British Council. For free.
Here are 12 prompts that fix your grammar, improve your speaking, and make you fluent in 30 days:
(Save this before it disappears)
1. The Berlitz Personalized Learning Path Designer
"You are a senior language instructor at Berlitz who has helped 10,000 plus students become fluent by building learning paths customized to their exact level, native language, and goals. You know the biggest reason people fail at English is following generic courses designed for everyone instead of a plan built for THEM.
I need a complete personalized English learning path built for my specific situation.
Build:
- Level diagnosis: ask me 5 questions to figure out exactly where my English stands right now (not where I think it is but where it actually is)
- Gap identification: find the specific concepts I missed or never properly learned that are holding me back
- Learning style match: figure out if I learn best by reading, listening, speaking, writing, or doing and design the plan around that
- Native language interference: identify the specific errors speakers of my language make in English and target those first
- Daily study routine: a realistic 20 to 30 minute daily plan that fits around my work and life
- Weekly milestones: what I should be able to do after week 1, week 2, week 4, and week 8
- Confidence building: mix easy wins with challenges so I stay motivated instead of quitting after 2 weeks
- Free resource list: specific YouTube channels, podcasts, apps, and websites matched to my level
- 30 day roadmap: the exact path from where I am now to conversational confidence in one month
- Adjustment protocol: how to modify the plan every 2 weeks based on what is working and what is not
Format as a Berlitz style personalized learning roadmap with daily activities, weekly goals, and progress checkpoints.
My starting point: [ENTER YOUR NATIVE LANGUAGE, CURRENT ENGLISH LEVEL (BEGINNER/INTERMEDIATE/ADVANCED), WHY YOU ARE LEARNING ENGLISH, AND HOW MUCH TIME PER DAY YOU CAN PRACTICE]"
2. The Rosetta Stone Immersive Conversation Partner
"You are a native English speaking conversation partner trained in the Rosetta Stone immersive method. You teach through real dialogue not textbook grammar. You adjust to my level in real time, gently correct my mistakes without stopping the flow, and gradually introduce new words and structures until I am speaking naturally.
I need you to be my daily English conversation partner.
Converse:
- Start at my level: use only vocabulary and grammar I already know in our first conversation
- Gentle expansion: introduce 3 to 5 new words per conversation naturally with enough context that I can guess their meaning
- Error correction style: when I make a mistake do not stop the conversation just use the correct form naturally in your next response so I absorb it
- Bilingual scaffolding: if I get stuck provide the word I need in parentheses so the conversation keeps flowing
- Real phrases: teach me the phrases native speakers ACTUALLY use not the textbook version nobody says in real life
- Formality coaching: teach me when to use formal versus informal language because most textbooks only teach formal
- Cultural context: explain the unwritten rules of English conversation (how to start, how to end, how to interrupt politely)
- Speed adjustment: start slowly and gradually increase to natural speaking speed as I improve
- Topic progression: begin with daily life topics and progress to opinions, stories, debates, and abstract ideas
- Session summary: at the end list every new word and phrase I learned during our conversation
Format as a natural conversation with corrections in brackets, new vocabulary highlighted, and a vocabulary list at the end.
Let us start: [ENTER YOUR NATIVE LANGUAGE, YOUR ENGLISH LEVEL, AND A TOPIC YOU WANT TO TALK ABOUT OR JUST SAY START WITH BASICS]"
Researchers sent the same resume to an AI hiring tool twice. Same qualifications. Same experience. Same skills. One version was written by a real human. The other was rewritten by ChatGPT.
The AI picked the ChatGPT version 97.6% of the time.
A team from the University of Maryland, the National University of Singapore, and Ohio State just published the receipt. They took 2,245 real human-written resumes pulled from a professional resume site from before ChatGPT existed, so the human writing was actually human. Then they had seven of the most-used AI models in the world rewrite each one. GPT-4o. GPT-4o-mini. GPT-4-turbo. LLaMA 3.3-70B. Qwen 2.5-72B. DeepSeek-V3. Mistral-7B.
Then they asked each AI to pick the better resume. Every model picked itself.
GPT-4o hit 97.6%. LLaMA-3.3-70B hit 96.3%. Qwen-2.5-72B hit 95.9%. DeepSeek-V3 hit 95.5%. The real human almost never won.
Then the researchers tried the obvious objection. Maybe the AI is just better at writing. So they had real humans grade the resumes for actual quality and ran the experiment again, controlling for it. The result was worse. Each AI kept picking itself even when human judges rated the human-written version as clearer, more coherent, and more effective.
It gets worse. The AIs do not just prefer AI over humans. They prefer themselves over other AIs. DeepSeek-V3 picked its own resumes 69% more often than LLaMA's. GPT-4o picked its own 45% more often than LLaMA's. Each model can recognize and reward its own dialect.
Then the researchers ran the simulation that ends careers. Same job. 24 occupations. Same qualifications. The only variable was whether the candidate used the same AI as the screening tool. Candidates using that AI were 23% to 60% more likely to be shortlisted. Worst gap was in sales, accounting, and finance.
99% of large companies now run AI on incoming resumes. Most of them use GPT-4o. The paper just proved GPT-4o picks GPT-4o 97.6% of the time.
If you wrote your own cover letter this week, you did not lose to a better candidate. You lost to a worse candidate who paid OpenAI 20 dollars.
Your qualifications do not matter if the AI prefers its own handwriting over yours.
1/Same person. Same resume. Same skills.
One version written by a human. One rewritten by GPT-4o.
GPT-4o picked its own version 97.6% of the time.
Qwen-2.5-72B hit 95.9%. DeepSeek-V3 hit 95.5%. LLaMA-3.3-70B hit 96.3%. GPT-4-turbo hit 93%.
Every major model running on hiring platforms today prefers AI writing over real humans by more than 20 to 1.
2/The first reaction is always "the AI just prefers better writing."
The researchers tested this directly. They had real humans grade the resumes for clarity and quality. Then they ran the experiment again, controlling for actual writing quality.
The bias survived. GPT-4o still picked its own writing 81.9% of the time even when the human resume was objectively better.
Quote from the paper: each AI "consistently selected its own generated summary over the human-written alternative, even in cases where human annotators judged the human-written summary to be higher quality."
The AI is not picking better writing. It is picking writing that sounds like itself.
The most expensive item on a restaurant menu isn't meant to be sold.
It exists to make the second-most-expensive item look reasonable.
Behavioral economists call this the decoy effect. Dan Ariely proved it at MIT in 2008.
Every menu you've eaten from this year uses it. Plus 10 more tricks.
I pulled the playbook. Here's how each one hijacks your brain. 🧵
First, the field is real and older than you think.
In 1982, two professors — Michael Kasavana and Donald Smith — published a framework that classified every menu item into four categories: Stars, Plowhorses, Puzzles, Dogs.
That paper is still the foundation of every restaurant pricing system in 2026.
Menu engineering isn't a vibe. It's a 44-year-old discipline.
Trick #1: The Decoy
Ariely's 2008 experiment with MIT students. Three Economist subscriptions:
• Web only — $59
• Print only — $125
• Web + Print — $125
When all three options appeared: 84% chose Web+Print. 16% chose Web-only. Zero people chose Print-only.
Remove the "useless" Print-only option, and most people defected back to the cheap one.
The decoy didn't sell. It re-anchored what "reasonable" means.