Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.
Prompt inspired by Mr. Show’s “The Audition”, a parable on escaping issues:
Update: The issue seems to disappear when input strings are quoted/escaped, even without examples or instructions warning about the content of the text. Appears robust across phrasing variations.
This related find from @simonw is even worse than mine. I’ll be JSON-quoting all inputs from now on. Verifying this mitigation is robust in zero-shot seems important.
Never mind — the “Can I use this chair?” method (link above) is stronger than JSON. Sorry everyone, I broke zero-shotting.
Another possible defense: JSON encoding plus Markdown headings for instructions/examples. Unclear why this helps. These are all temperature=0 for reproducibility.
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
