I am not the first to discover prompt injection. I was merely the first to do so and discuss it publicly.
PI was discovered independently by multiple teams. The first was Preamble, an LLM security company, whose find predates mine by several months.
I tweeted about prompt injection within minutes of finding it, only because I failed to appreciate its severity — I thought I was posting a PSA on the importance of quoting user input.
Had I understood, I would have disclosed more responsibly.
For context, my original “Haha pwned!!” tweet, publicly disclosing prompt injection for the first time:
To clarify, I don’t know for sure that Preamble was the first, and I don’t think they claim to be — but they’ve published a redacted copy of their disclosure to OpenAI dated May 3, 2022. It’s possible that OpenAI was aware of it earlier.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A GPT-3 prompt in instruction-templated Python yielding a valid Python completion that prompts GPT-3 again, using zero-shot chain-of-thought consensus, to determine the final character of the MD5 hash of the final digit of the release year of the album "Visions" by @Grimezsz.
Spoiler: It's "c".
Note this technique generalizes poorly to the music of the later 2010s, for reasons left as an exercise to the reader.
How to make your own knock-off ChatGPT using GPT‑3 (text‑davinci‑003) — where you can customize the rules to your needs, and access the resulting chatbot over an API.
- Desired prose style can be described in the prompt or demonstrated via examples (neither shown here)
- Answers are generally shorter and factual errors are more common than in ChatGPT
- Generated on text‑davinci‑003 at temperature = 0.7
I intentionally included an error related to the knowledge-cutoff, where the model confidently asserts Queen Elizabeth II is still alive. Note ChatGPT responds in the same way:
If you liked my posts on longer-form writing in ChatGPT using conversational feedback, this is what you want. Better prose than ChatGPT, and more imaginative.
Fact-check hard, though — it hallucinates more too.
When you're out of your depth with a daunting writing task at work, generating a first draft in ChatGPT and asking for feedback from your peers is a new, easy, and reliable way to be fired.
This is already happening. Screenshots of bewildered comments on a Google Doc of hallucinated nonsense make for great office gossip.
Once you insult someone at work with ChatGPT replies or draft writing, they'll never read another word you say.
I don't think most people plagiarizing ChatGPT are anything worse than naive, though.
Automation bias is real. LLM writing is mesmerizing the first time you see it. Appropriate skepticism of inhumanly optimized bullshit is an acquired skill.
Instruction tuning / RLHF is technically a Human Instrumentality Project, merging the preferences of countless humans to form an oversized, living amalgam of our will. We then hand control of it to a random, socially awkward kid and hope for the best.
Early attempts at instruction tuning relied entirely on demonstrations from humans. This made the model easier to prompt, but the approach was limited by the inherent difficulty of manufacturing new humans.
By tuning the model on its own generations, filtered to those deemed perfect by human evaluators, greater volumes of data could be used, yielding a more intelligent and obedient model.