A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.
⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...
1/ 🧵
It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().
, @AIPanicLive document many jailbreak / red teaming efforts
Altogether, these important works can have a few implications.
1⃣ Calls to require RLHF on all released models may only offer shallow safety.
2⃣ "Closed" models may be as susceptible as "open" models.
10/
To expand on (2):
➡️ prompting jailbreaks remain trivial
➡️ implicit and unintentionally harmful finetuning datasets easily and cheaply break current safety measures
11/
3⃣ We may need to re-prioritize safety mechanisms, or what kinds of jailbreaks really matter.
E.g. if we are worried about sharing sensitive weapon building knowledge, perhaps don't train on that knowledge?
12/
4⃣ Academic research (these works) are driving AI safety understanding immensely.
Proposal: We need continued (un-gatekeeped) access for academics, without account bans or corporations selectively deciding who gets to do it and in what capacity.
A "safe harbor".
13/
Thank you for reading and please don't hesitate to leave comments if I missed anything, or got something wrong! 🙂
🧵/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
✨New Preprint ✨ How are shifting norms on the web impacting AI?
We find:
📉 A rapid decline in the consenting data commons (the web)
⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)
⛔️ Robots.txt preference protocols are ineffective
These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.
#NewPaperAlert When and where does pretraining (PT) data matter?
We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition
We have several recs for model creators…
📜: bit.ly/3WxsxyY
1/ 🧵
First, PT data selection is mired in mysticism.
1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️
2/
PT is the foundation of data-centric and modern LMs. This research was expensive but important to shed light on open questions in training data design.
I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.
In late Dec it begins to abstain.
2/
Interestingly, GPT 3.5 "Default" answers correctly only until ~Oct 24, 2021, but GPT 3.5 "Legacy" answers correctly until ~Oct 31, 2021 then begins hallucinating false answers or abstaining in Nov.
Perhaps this is due to finetuning rather than pretraining data?
3/