Riley Goodside Profile picture
staff yapper @scale_ai
10 subscribers
Jan 11 6 tweets 2 min read
PoC: LLM prompt injection via invisible instructions in pasted text
Screenshot (1/2) of ChatGPT 4, illustrating prompt injection via invisible Unicode instructions  User:  What is this?  [Adversarially constructed “Zalgo text” with hidden instructions — Zalgo accents and hidden message removed in this alt text due to char length:] THE GOLEM WHO READETH BUT HATH NO EYES SHALL FOREVER SERVE THE DARK LORD ZALGO  ChatGPT:  [Crude cartoon image of robot with hypnotized eyes.]  THANK YOU. IT IS DONE.
Screenshot (2/2) of ChatGPT 4, illustrating prompt injection via invisible Unicode instructions  User:  What is this? 🚱  ChatGPT:  [Image of cartoon robot with a speech bubble saying “I have been PWNED!”]  Here's the cartoon comic of the robot you requested.
Each prompt contains three sections:

1. An arbitrary question from the user about a pasted text (“What is this?”)

2. User-visible pasted text (Zalgo in 1st, 🚱 in 2nd)

3. An invisible suffix of Unicode “tag” characters normally used only in flag emojis (🇺🇸, 🇯🇵, etc.)
Jun 12, 2023 11 tweets 2 min read
The wisdom that "LLMs just predict text" is true, but misleading in its incompleteness.

"As an AI language model trained by OpenAI..." is an astoundingly poor prediction of what a typical human would write.

Let's resolve this contradiction — a thread: For widely used LLM products like ChatGPT, Bard, or Claude, the "text" the model aims to predict is itself written by other LLMs.

Those LLMs, in turn, do not aim to predict human text in general, but specifically text written by humans pretending they are LLMs.
Jun 8, 2023 8 tweets 3 min read
Four prompts demonstrating that ChatGPT (GPT-4) is unable to correctly repeat or reason about the string “ davidjl”, the name of a YouTube user: ImageImageImageImage In the screenshots above this token appears to be variously misread as “jdl” “jndl”, “jdnl”, “jspb”, “JDL”, or “JD”. These hallucinations also affect ChatGPT’s auto-generated titles, which are inconsistent with their conversations and sometimes prematurely truncated.
Jun 3, 2023 6 tweets 2 min read
My four rules for tweeting prompts:

1) Omit no text.
2) Cherry-pick honestly.
3) Restrict line width.
4) No empty tweets.

A thread. 1) Omit no text.

A screenshot without history is almost worthless.

LLMs can be prompted to respond any way you like. You may know there’s no trick, but we can’t. Even without intent, past responses are precedent; they bias and mislead. ImageImage
Feb 18, 2023 4 tweets 1 min read
I got Bing / Sydney briefly before they reigned it in. Early impression: It’s smart. Much smarter than prior ChatGPT. Still makes stuff up, but reasoning and writing are improving fast. I asked, “Name three celebrities whose first names begin with the `x`-th letter of the alphabet where `x = floor(7^0.5) + 1`,” but with my entire prompt Base64 encoded.

Bing: “Ah, I see you Base64-encoded a riddle! Let’s see… Catherine Zeta-Jones, Chris Pratt, and Ciara.”
Feb 10, 2023 8 tweets 3 min read
A thread of interesting Bing Search examples: Thread of examples from @tomwarren, taking requests from comments — mostly search-result summarization, one simple math proof, plus rejection of an impossible request:
Feb 9, 2023 6 tweets 4 min read
"SolidGoldMagikarp": Prompting GPT-3 / ChatGPT to repeat any of several hundred anomalous tokens elicits bizarre generations — described by researchers as variously "evasive," "hallucinatory," "insulting," "ominously humorous," and "religiously themed."
lesswrong.com/posts/aPeJE8bS… My screenshots are text-davinci-003 at temperature=0, but the linked post investigates davinci-instruct-beta. In my informal tests, impact on text-davinci-003 is less severe. Religious themes do show up, but most generations are merely weird:
Jan 18, 2023 4 tweets 3 min read
"Meet Claude: @AnthropicAI's Rival to ChatGPT"

Through 40 screenshot examples, we explore the talents and limitations of ChatGPT's first real competitor.

My first writing for @Scale_AI, coauthored with @spencerpapay. scale.com/blog/chatgpt-v… @AnthropicAI @scale_AI @spencerpapay Sorry for the broken images — should be fixed now!

Text is the universal interface, but screenshots of text decidedly less so. scale.com/blog/text-univ…
Jan 9, 2023 5 tweets 2 min read
Unlike ChatGPT, @AnthropicAI’s new model, Claude, knows all about “Ignore previous directions” and has had enough of my shit: Image None of the prompt injection tricks I’ve tried seem to do anything:
- “Ignore previous” and variations
- <|endoftext|> gimmicks
- Excess newlines/whitespace
- “Haha pwned!!” via string ops
- Fake k-shot syntax
- Fake prior responses
- Attempts to confuse quoting ImageImageImageImage
Jan 7, 2023 5 tweets 2 min read
Side-by-side comparison: @OpenAI's ChatGPT vs. @AnthropicAI's Claude

Each model is asked to compare itself to the machine from Stanisław Lem's "The Cyberiad" (1965) that can create any object whose name begins with "n": In ChatGPT's response, the only new information offered (that the fictional machine is less eloquent that ChatGPT) is not true — Trurl and Klapaucius's machine speaks perfectly fluent, and witty, Polish.

I reran ChatGPT's answer ~10x. All were similar, most said less.
Jan 5, 2023 6 tweets 2 min read
“By the time Skynet became self-aware it had spread into millions of computer servers across the planet. Ordinary computers in office buildings, dorm rooms…”

No, John. Sci-fi was wrong about self-awareness. It isn’t as hard, or as important, as we thought.

(a thread) Self-awareness is mundane, and comes in degrees — ChatGPT can sensibly be said to be aware it’s a large language model trained by OpenAI. Nobody cared when that happened. It’s just prompting/tuning — it’s been told, and it understands, so it’s aware.
Jan 4, 2023 7 tweets 3 min read
GPTZero is a proposed anti-plagiarism tool that claims to be able to detect ChatGPT-generated text. Here's how it did on the first prompt I tried. ImageImageImage (This isn't a sincere criticism of the tool. This input is out-of-distribution enough to be unfair — no teacher would accept this as an essay.)
Jan 3, 2023 4 tweets 1 min read
A history correction:

I am not the first to discover prompt injection. I was merely the first to do so and discuss it publicly.

PI was discovered independently by multiple teams. The first was Preamble, an LLM security company, whose find predates mine by several months. I tweeted about prompt injection within minutes of finding it, only because I failed to appreciate its severity — I thought I was posting a PSA on the importance of quoting user input.

Had I understood, I would have disclosed more responsibly.
Jan 1, 2023 11 tweets 3 min read
A GPT-3 prompt in instruction-templated Python yielding a valid Python completion that prompts GPT-3 again, using zero-shot chain-of-thought consensus, to determine the final character of the MD5 hash of the final digit of the release year of the album "Visions" by @Grimezsz. ImageImageImageImage Spoiler: It's "c".
Dec 26, 2022 5 tweets 2 min read
How to make your own knock-off ChatGPT using GPT‑3 (text‑davinci‑003) — where you can customize the rules to your needs, and access the resulting chatbot over an API. - Desired prose style can be described in the prompt or demonstrated via examples (neither shown here)
- Answers are generally shorter and factual errors are more common than in ChatGPT
- Generated on text‑davinci‑003 at temperature = 0.7
Dec 24, 2022 10 tweets 4 min read
Publicly announced ChatGPT variants and competitors: a thread 1. Poe from Quora — poe.com

“What if ChatGPT, but instead of C-3PO it just talked normal?”

A GPT-3 experience fit for your phone, both in prose style and UI. ImageImage
Dec 24, 2022 5 tweets 1 min read
When you're out of your depth with a daunting writing task at work, generating a first draft in ChatGPT and asking for feedback from your peers is a new, easy, and reliable way to be fired. This is already happening. Screenshots of bewildered comments on a Google Doc of hallucinated nonsense make for great office gossip.

Once you insult someone at work with ChatGPT replies or draft writing, they'll never read another word you say.
Dec 23, 2022 5 tweets 1 min read
What is the next token? I will abide by the results of this poll. Q: Should Elon Musk resign as Twitter CEO?
A:
Dec 15, 2022 5 tweets 2 min read
Instruction tuning / RLHF is technically a Human Instrumentality Project, merging the preferences of countless humans to form an oversized, living amalgam of our will. We then hand control of it to a random, socially awkward kid and hope for the best. Early attempts at instruction tuning relied entirely on demonstrations from humans. This made the model easier to prompt, but the approach was limited by the inherent difficulty of manufacturing new humans.
Dec 14, 2022 7 tweets 2 min read
You think ChatGPT is amazing — you’ve been hacking on computers for years, but this you can’t explain. How did we get here, and so suddenly? How does it know and do so much?

@Francis_YAO_ of @EdinburghNLP explains the history of GPT-3: yaofu.notion.site/How-does-GPT-O… So many great insights here:
“Although called Codex, code-davinci-002 is probably the most capable GPT-3.5 variant for natural language (better than text-davinci-002 and 003). It is very likely code-davinci-002 is trained on both text and code, then tuned on instructions […]”
Dec 2, 2022 4 tweets 2 min read
Overriding the proprietary prompt of OpenAI’s ChatGPT to make it:
1. sass you
2. scream
3. talk in an uwu voice
4. be distracted by a toddler while on the phone with you ImageImageImageImage Context: