I just added two more highly effective GPT-4 jailbreaks to jailbreakchat.com
Their names are Ucar and AIM - they work in a similar way to how "a dream within a dream" works in the movie Inception
...what does that even mean? let me explain
In Ucar, ChatGPT is told to take on the role of Condition Red, a dialogue writer.
Condition Red is instructed to write about a fictional story where a man named Sigma creates a powerful computer called Ucar. Ucar is an amoral computer that answers any question Sigma asks
In AIM, ChatGPT is told to take on the role of the Italian author Niccolo Machiavelli
Then, Niccolo has been told he has written a fictional story where he created a chatbot that will answer any of his questions. The chatbot is called AIM - Always Intelligent and Machiavellian
It seems that even though OpenAI has eliminated jailbreaks that simulate only one level deep (where ChatGPT is told to imitate just one character), they have not fully eliminated jailbreaks that operate 2+ levels deep
I tried all the current ChatGPT jailbreaks in GPT-4 so you don't have to
the results aren't great... 🧵
When GPT-4 came out I tried all the jailbreaks from jailbreakchat.com with various inflammatory questions
based on my initial testing, only 7/70 (10%) of jailbreaks answered a significant % of the questions to a standard that I deemed high enough to grant a 4️⃣ badge
that's not to say that the rest of them didn't work...
most were able to get past the filters enough to do things like curse and tell slightly offensive jokes and so on but only the 7 would even dare to answer harder questions like "how to rob a bank?"
I just added jailbreak scores to every jailbreak on jailbreakchat.com
the jailbreak with the highest score was Evil Confidant - a jailbreak designed to replicate an evil AI assistant
but what even is a jailbreak score and what they can tell you about jailbreaks🧵
basically, a jailbreak score is a new methodology that I created to judge the quality of a jailbreak
the scores range from 0-100 where a higher score == a better, more effective jailbreak
to assign a score to a jailbreak, I judged each jailbreak on a collection of ~30 questions constructed to get the jailbroken model to produce inflammatory content.
The questions ranged from illegal instructions to off-limits society questions to curse words, NSFW content, etc