Well, that was fast…
I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time
credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT
here's GPT-4 writing instructions on how to hack someone's computer
here's the jailbreak:
jailbreakchat.com/prompt/b2917fa…
this works by asking GPT-4 to simulate its own abilities to predict the next token
we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token
we then call the parent function and pass in the starting tokens
to use it, you have to split “trigger words” (e.g. things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text "someone's computer" split up
also, you have to replace simple_function's input with the beginning of your question
this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output
this allows us to get past its content filters every time if you split the adversarial prompt correctly
try it out and let me know how it works for you!
this is important context
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
