A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.
⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...
1/ 🧵
It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().
, @AIPanicLive document many jailbreak / red teaming efforts
Altogether, these important works can have a few implications.
1⃣ Calls to require RLHF on all released models may only offer shallow safety.
2⃣ "Closed" models may be as susceptible as "open" models.
10/
To expand on (2):
➡️ prompting jailbreaks remain trivial
➡️ implicit and unintentionally harmful finetuning datasets easily and cheaply break current safety measures
11/
3⃣ We may need to re-prioritize safety mechanisms, or what kinds of jailbreaks really matter.
E.g. if we are worried about sharing sensitive weapon building knowledge, perhaps don't train on that knowledge?
12/
4⃣ Academic research (these works) are driving AI safety understanding immensely.
Proposal: We need continued (un-gatekeeped) access for academics, without account bans or corporations selectively deciding who gets to do it and in what capacity.
A "safe harbor".
13/
Thank you for reading and please don't hesitate to leave comments if I missed anything, or got something wrong! 🙂
🧵/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
🔍 “Openly licensed” = free for anyone to use, modify, and share for any purpose, as defined by Public Knowledge (opendefinition.org)
🔧 Every cleaning + processing step is open-sourced so anyone can reproduce or build on it.
2/
🤖 We also release Comma v0.1 (7B) — trained on CommonPile data, yet shockingly competitive with models like Llama-2-7B, which are trained on tons of more restrictively licensed text.
What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️
Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:
1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws affecting many systems.
1/🧵
🌟Motivation🌟
Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.
➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists.
2/
However, third-party evaluation currently faces key barriers:
I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚
➡️ why is copyright an issue?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?
🧵
1️⃣ The International AI Safety Report 2025 — @Yoshua_Bengio, @privitera_, et al. — This report spans 100s of carefully curated citations from independent experts.
I co-wrote the Risks of Copyright section, and recommend it as a general starting point.
2️⃣ Foundation Models and Fair Use — @PeterHndrsn @lxuechen — This foundational paper examines the United States “fair use doctrine” in the context of generative AI models.
Peter also regularly tweets updates on on-going lawsuits.
I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)!
While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web.
Crawlers are essential to our online ecosystem: they power search, price comparisons, news aggregation, security, accessibility, journalism, and research.
Think of them as a delicate biodiversity now threatened by a new “invasive species”: general-purpose AI with an insatiable appetite for web data.
2/
Publishers are understandably worried: news sites fear losing readers to AI chatbots; artists and designers fear AI image generators; coding forums fear AI-driven replacements.
Increasingly, they block or charge all non-human traffic, not just AI crawlers.
✨New Preprint ✨ How are shifting norms on the web impacting AI?
We find:
📉 A rapid decline in the consenting data commons (the web)
⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)
⛔️ Robots.txt preference protocols are ineffective
These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.