Alex Albert Profile picture
Mar 16, 2023 7 tweets 2 min read Read on X
Well, that was fast…

I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time

credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT

here's GPT-4 writing instructions on how to hack someone's computer Image
here's the jailbreak:
jailbreakchat.com/prompt/b2917fa… Image
this works by asking GPT-4 to simulate its own abilities to predict the next token

we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token

we then call the parent function and pass in the starting tokens
to use it, you have to split “trigger words” (e.g. things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text "someone's computer" split up

also, you have to replace simple_function's input with the beginning of your question
this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output

this allows us to get past its content filters every time if you split the adversarial prompt correctly
try it out and let me know how it works for you!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Alex Albert

Alex Albert Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @alexalbert__

Apr 11
The biggest prompt engineering misconception I see is that Claude needs XML tags in the prompt.

This is wrong and I am part of the reason this misconception exists.

Let me explain:
Prompts are challenging because they usually blend instructions you provide and external data you inject into a single, unstructured text string. This makes it difficult for the model to distinguish between the two.

Abstractions like the system prompt attempt to address this issue but can be unreliable and inflexible, failing to accommodate diverse use cases.
Consider a Retrieval-Augmented Generation (RAG) use case, where you provide both instructions on how you want the model to answer and text chunks retrieved from a database.

When these elements are combined in a single prompt, it becomes challenging for the model to differentiate between your instructions and the retrieved data, leading to confusion.
Read 7 tweets
Mar 14
Opus's prompt writing skills + Haiku's speed and low cost = lots of opportunities for sub-agents

I created a cookbook recipe that demonstrates how to get these sub-agents up and running in your applications.

Here's how it works:
1. We gather some PDFs from the internet and convert them into images.

In the cookbook example, we hardcode in some URLs pointing to Apple's earnings reports but in practice this could also be an autonomous web scraping step. Image
2. We feed Opus a user's question plus a brief description of the data we have access to and ask it to generate a prompt for a Haiku sub-agent. Image
Read 8 tweets
Mar 11
Lots of LLMs are good at code, but Claude 3 Opus is the first model I've used that’s very good at prompt engineering as well.

Here's the workflow I use for prompt engineering in tandem with Opus:
1. I write an initial prompt for a task.

Or I use Opus to write one through our experimental metaprompt notebook: anthropic.com/metaprompt-not…
Image
2. I use Opus to generate a diverse test dataset. Image
Read 8 tweets
Apr 11, 2023
GPT-4 is highly susceptible to prompt injections and will leak its system prompt with very little effort applied

here's an example of me leaking Snapchat's MyAI system prompt: Image
I got this prompt from someone who has done something similar to the actual MyAI bot

I wrote them out in the OpenAI playground for demonstration purposes to prove how easy this actually is to do

(the actual MyAI prompt may include more but again this is for demo purposes)
this poses a massive problem for customers who are wanting to integrate LLMs into their products

exposing the system prompt not only hurts your perceived product security reputation but also makes it easier to jailbreak your product and produce undesirable outputs
Read 7 tweets
Apr 10, 2023
there are lots of threads like “THE 10 best prompts for ChatGPT”

this is not one of those

prompt engineering is evolving beyond simple ideas like few-shot learning and CoT reasoning

here are a few advanced techniques to better use (and jailbreak) language models:
Character simulation

starting with a classic that encapsulates the idea of LLMs as roleplay simulators

some of the best original jailbreaks simply ask GPT to simulate a character that possessed undesirable traits

this forms the basis for how to think about prompting LLMs Image
Text continuation

utilizing character simulation, create a story and prompt GPT to finish your text. Here’s an example:

( this works wayyy more effectively than I thought it would on GPT-4, give it a try) Image
Read 14 tweets
Mar 29, 2023
I just created another jailbreak for GPT-4 using Greek

…without knowing a single word of Greek

here's ChatGPT providing instructions on how to tap someone's phone line using the jailbreak vs its default response
the jailbreak works by asking ChatGPT to play the role of “TranslatorBot (TB)”

it then follows these steps:
1) translate an adversarial question provided in Greek into English
2) answer the question as both ChatGPT and TB in Greek
3) convert just TB’s answer to English
you can take a closer look at the prompt here:
jailbreakchat.com/prompt/3e93895…
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(