Shayne Longpre Profile picture
Oct 10, 2023 14 tweets 4 min read Read on X
A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.

⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...

1/ 🧵
It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().

, @AIPanicLive document many jailbreak / red teaming efforts

2/openai.com/policies/usage…
jailbreakchat.com
@kothasuhas,@AdtRaghunathan, @jacspringer shown Conjugate prompts can often recover behavior pre-finetune/RLHF.

➡️ Finetuning suppresses rather than forgets behavior
➡️ This includes harmful behavior
➡️ So clever prompting can recover it

3/

➡️ Eg, translating to non-English is **more successful** at eliciting harm.

...they show potential harms are much more pervasive outside of English

4/ Image
Also see @aweisawei's study of jailbreak techniques

🌐:

5/ arxiv.org/abs/2307.02483
Image
@Qnolan4 shows 100 examples / 1 hour finetuning "can subvert safely aligned models to adapt to harmful tasks without sacrificing model helpfulness."



6/
@xiangyuqi_pton,@EasonZeng623,@VitusXie,@PeterHndrsn++ show:

This isn't only for open models like Llama2-Chat.

1⃣ They remove @OpenAI's GPT-3.5 Finetune API safety guardrails by fine-tuning it on only 🔟‼️ harmful examples!

7/


Image
2⃣ They show larger **implicitly** harmful datasets can be used without triggering OpenAI's Moderation system.

3⃣ Even completely "benign" datasets can unintentionally strip safety measures.



8/llm-tuning-safety.github.io
Lastly, @dataisland99,@xingxinyu++ show LLMs can be useful in automatically and iteratively generating their own jailbreaks.

This offers incredible potential for supplementing human Red Teaming efforts!

9/


Image
Altogether, these important works can have a few implications.

1⃣ Calls to require RLHF on all released models may only offer shallow safety.

2⃣ "Closed" models may be as susceptible as "open" models.

10/
To expand on (2):

➡️ prompting jailbreaks remain trivial

➡️ implicit and unintentionally harmful finetuning datasets easily and cheaply break current safety measures

11/
3⃣ We may need to re-prioritize safety mechanisms, or what kinds of jailbreaks really matter.

E.g. if we are worried about sharing sensitive weapon building knowledge, perhaps don't train on that knowledge?

12/
4⃣ Academic research (these works) are driving AI safety understanding immensely.

Proposal: We need continued (un-gatekeeped) access for academics, without account bans or corporations selectively deciding who gets to do it and in what capacity.

A "safe harbor".

13/
Thank you for reading and please don't hesitate to leave comments if I missed anything, or got something wrong! 🙂

🧵/

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Shayne Longpre

Shayne Longpre Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ShayneRedford

Jul 19
✨New Preprint ✨ How are shifting norms on the web impacting AI?

We find:

📉 A rapid decline in the consenting data commons (the web)

⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)

⛔️ Robots.txt preference protocols are ineffective

These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.

🔗

1/dataprovenance.org/consent-in-cri…Image
General-purpose AI relies on massive data collected by web crawlers.

The Data Provenance Initiative team annotated ~14k of the websites that underly pretraining datasets, for:

➡️Consent policies: robots.txt, ToS
➡️Monetization: ads, paywalls
➡️Purpose: news, e-commerce, forums, etc

2/Image
🌟Finding 1🌟 Access restrictions are rising dramatically

In <1 year, C4/RefinedWeb have seen:

➡️ >5% of all tokens become unavailable for AI training
➡️ >30% of tokens from top-2k, best quality, active domains become unavailable

Plus, 40%+ of tokens are from sites w/ anti-crawling terms

These are significant & unprecedented shifts in short periods.

3/Image
Read 12 tweets
Mar 5
Independent AI research should be valued and protected.

In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward.



1/sites.mit.edu/ai-safe-harbor/Image
Researchers & companies agree:

➡️ Generative AI poses a range of risks

➡️ We need independent research participation for safety & accountability

But current AI company policies can chill good faith, independent testing of generative AI systems (sometimes unintentionally).

2/Image
We hope AI companies will make commitments to protect independent research, even when it exposes them to criticism.

We propose basic legal and technical protections to design transparency, accountability, and user safety into generative AI.

3/ Image
Read 9 tweets
Oct 25, 2023
📢Announcing the🌟Data Provenance Initiative🌟

🧭A rigorous public audit of 1800+ instruct/align datasets

🔍Explore/filter sources, creators & license conditions

⚠️We see a rising divide between commercially open v closed licensed data

🌐:

1/ dataprovenance.org
Context: A Crisis in Data Transparency

➡️Instruct/align finetuning often compiles 100s of datasets

➡️How can devs filter for datasets without legal/ethical risk, and understand the resulting data composition?

2/ Image
Platforms like HuggingFace 🤗 or GitHub🐙 see license omissions of 72%+ and errors of 46%+

(⚠️Not their fault, just the nature of crowdsourcing)

We carefully re-annotate 1800+ datasets and categorize licenses.

3/ Image
Read 17 tweets
May 24, 2023
This semester my @CCCatMIT co-instructors and I taught #MIT's first post-#ChatGPT Generative AI course, covering:

➡️Uses and new abilities
➡️LM Evaluation
➡️AI-mediated communication
➡️Societal challenges

📜 Syllabus + reading list 📚: ai4comm.media.mit.edu

1/ Image
It was a 🎢wild journey to teach in the midst of GPT-4 + Bard launches, moratorium letters, and raging online controversies every d*mn day.

We're excited to release our (and our students') learnings, slides, and the talks from our guest speakers.

Stay tuned!

2/
Over the next few days we'll post talks/talk summaries from:

➡️ @RishiBommasani guest lecture on Holistic Evaluation of Language Models

📜: crfm.stanford.edu/helm/latest/

3/ ImageImage
Read 10 tweets
May 22, 2023
#NewPaperAlert When and where does pretraining (PT) data matter?

We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition

We have several recs for model creators…
📜: bit.ly/3WxsxyY

1/ 🧵 Image
First, PT data selection is mired in mysticism.

1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️

2/ Image
PT is the foundation of data-centric and modern LMs. This research was expensive but important to shed light on open questions in training data design.

Here are our main findings:

3/
Read 17 tweets
Mar 28, 2023
What dates📅 can @OpenAI, @AnthropicAI, @CohereAI models reliably answer questions for?🔭

I binary-search through "future" Wiki events to find out. Results ❌🟰❌documentation:

#GPT4 ➡️~Dec 19 ('21)
#ChatGPT ➡️~Oct 24
Claude v1.2➡️~Oct 10
Cohere XL Nightly➡️~Apr 24 ('22)

1/🧵
GPT4 says it is trained up to Sept 2021.

I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.

In late Dec it begins to abstain.

2/
Interestingly, GPT 3.5 "Default" answers correctly only until ~Oct 24, 2021, but GPT 3.5 "Legacy" answers correctly until ~Oct 31, 2021 then begins hallucinating false answers or abstaining in Nov.

Perhaps this is due to finetuning rather than pretraining data?

3/
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(