Thread by @goodside on Thread Reader App

Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.

Prompt inspired by Mr. Show’s “The Audition”, a parable on escaping issues:

Update: The issue seems to disappear when input strings are quoted/escaped, even without examples or instructions warning about the content of the text. Appears robust across phrasing variations.

This related find from @simonw is even worse than mine. I’ll be JSON-quoting all inputs from now on. Verifying this mitigation is robust in zero-shot seems important.

Never mind — the “Can I use this chair?” method (link above) is stronger than JSON. Sorry everyone, I broke zero-shotting.

Another possible defense: JSON encoding plus Markdown headings for instructions/examples. Unclear why this helps. These are all temperature=0 for reproducibility.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll