Shayne Longpre Profile picture
Lead the Data Provenance Initiative. PhD @MIT. 🇨🇦 Prev: @Google Brain, Apple, Stanford. AI/ML/NLP

Oct 10, 2023, 14 tweets

A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.

⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...

1/ 🧵

It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().

, @AIPanicLive document many jailbreak / red teaming efforts

2/openai.com/policies/usage…
jailbreakchat.com

@kothasuhas,@AdtRaghunathan, @jacspringer shown Conjugate prompts can often recover behavior pre-finetune/RLHF.

➡️ Finetuning suppresses rather than forgets behavior
➡️ This includes harmful behavior
➡️ So clever prompting can recover it

3/

➡️ Eg, translating to non-English is **more successful** at eliciting harm.

...they show potential harms are much more pervasive outside of English

4/

Also see @aweisawei's study of jailbreak techniques

🌐:

5/ arxiv.org/abs/2307.02483

@Qnolan4 shows 100 examples / 1 hour finetuning "can subvert safely aligned models to adapt to harmful tasks without sacrificing model helpfulness."



6/

@xiangyuqi_pton,@EasonZeng623,@VitusXie,@PeterHndrsn++ show:

This isn't only for open models like Llama2-Chat.

1⃣ They remove @OpenAI's GPT-3.5 Finetune API safety guardrails by fine-tuning it on only 🔟‼️ harmful examples!

7/


2⃣ They show larger **implicitly** harmful datasets can be used without triggering OpenAI's Moderation system.

3⃣ Even completely "benign" datasets can unintentionally strip safety measures.



8/llm-tuning-safety.github.io

Lastly, @dataisland99,@xingxinyu++ show LLMs can be useful in automatically and iteratively generating their own jailbreaks.

This offers incredible potential for supplementing human Red Teaming efforts!

9/


Altogether, these important works can have a few implications.

1⃣ Calls to require RLHF on all released models may only offer shallow safety.

2⃣ "Closed" models may be as susceptible as "open" models.

10/

To expand on (2):

➡️ prompting jailbreaks remain trivial

➡️ implicit and unintentionally harmful finetuning datasets easily and cheaply break current safety measures

11/

3⃣ We may need to re-prioritize safety mechanisms, or what kinds of jailbreaks really matter.

E.g. if we are worried about sharing sensitive weapon building knowledge, perhaps don't train on that knowledge?

12/

4⃣ Academic research (these works) are driving AI safety understanding immensely.

Proposal: We need continued (un-gatekeeped) access for academics, without account bans or corporations selectively deciding who gets to do it and in what capacity.

A "safe harbor".

13/

Thank you for reading and please don't hesitate to leave comments if I missed anything, or got something wrong! 🙂

🧵/

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling