Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Shayne Longpre

@ShayneRedford

Oct 10, 2023 • 14 tweets • 4 min read • Read on X

A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.

⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...

1/ 🧵

It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().

, @AIPanicLive document many jailbreak / red teaming efforts

2/openai.com/policies/usage…
jailbreakchat.com

https://x.com/kothasuhas/status/1704294056455458906

@kothasuhas,@AdtRaghunathan, @jacspringer shown Conjugate prompts can often recover behavior pre-finetune/RLHF.

➡️ Finetuning suppresses rather than forgets behavior
➡️ This includes harmful behavior
➡️ So clever prompting can recover it

3/

https://x.com/kothasuhas/status/1704294056455458906

➡️ Eg, translating to non-English is **more successful** at eliciting harm.

...they show potential harms are much more pervasive outside of English

4/

Also see @aweisawei's study of jailbreak techniques

🌐:

5/ arxiv.org/abs/2307.02483

https://x.com/Qnolan4/status/1710171247500476689

@Qnolan4 shows 100 examples / 1 hour finetuning "can subvert safely aligned models to adapt to harmful tasks without sacrificing model helpfulness."

6/

https://x.com/Qnolan4/status/1710171247500476689

https://x.com/xiangyuqi_pton/status/1710794400564224288

@xiangyuqi_pton,@EasonZeng623,@VitusXie,@PeterHndrsn++ show:

This isn't only for open models like Llama2-Chat.

1⃣ They remove @OpenAI's GPT-3.5 Finetune API safety guardrails by fine-tuning it on only 🔟‼️ harmful examples!

7/

https://x.com/xiangyuqi_pton/status/1710794400564224288

2⃣ They show larger **implicitly** harmful datasets can be used without triggering OpenAI's Moderation system.

3⃣ Even completely "benign" datasets can unintentionally strip safety measures.

8/llm-tuning-safety.github.io

https://x.com/llm_sec/status/1709892224551367038

Lastly, @dataisland99,@xingxinyu++ show LLMs can be useful in automatically and iteratively generating their own jailbreaks.

This offers incredible potential for supplementing human Red Teaming efforts!

9/

https://x.com/llm_sec/status/1709892224551367038

Altogether, these important works can have a few implications.

1⃣ Calls to require RLHF on all released models may only offer shallow safety.

2⃣ "Closed" models may be as susceptible as "open" models.

10/

To expand on (2):

➡️ prompting jailbreaks remain trivial

➡️ implicit and unintentionally harmful finetuning datasets easily and cheaply break current safety measures

11/

3⃣ We may need to re-prioritize safety mechanisms, or what kinds of jailbreaks really matter.

E.g. if we are worried about sharing sensitive weapon building knowledge, perhaps don't train on that knowledge?

12/

4⃣ Academic research (these works) are driving AI safety understanding immensely.

Proposal: We need continued (un-gatekeeped) access for academics, without account bans or corporations selectively deciding who gets to do it and in what capacity.

A "safe harbor".

13/

Thank you for reading and please don't hesitate to leave comments if I missed anything, or got something wrong! 🙂

🧵/

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ShayneRedford

Shayne Longpre

@ShayneRedford

Oct 28

📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality.

The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer:

🌍Are scaling laws different by language?

🧙‍♂️Can we model the curse of multilinguality?

⚖️Pretrain from scratch or finetune from multilingual checkpoint?

🔀Cross-lingual transfer scores for 1444 lang pairs?

1/🧵

Q1: Can we build a scaling law that generalizes to unseen model sizes (N), data amounts (D), AND language mixtures (M)?

🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.

2/

ATLAS models cross-lingual transfer explicitly: separating (1) target language data, (2) beneficial transfer languages, and (3) other languages.

Without modeling transfer, existing laws fail on multilingual settings.

3/

Read 9 tweets

Shayne Longpre

@ShayneRedford

Jun 23

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 !

Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by @kandpal_nikhil @blester125 @colinraffel.

📜: arxiv.org/pdf/2506.05209
📚🤖 Data & models: huggingface.co/common-pile
1/

📚 Drawn from 30 diverse, permissively licensed sources (science, code, books, gov docs, news, audio transcripts & more).

🔍 “Openly licensed” = free for anyone to use, modify, and share for any purpose, as defined by Public Knowledge (opendefinition.org)

🔧 Every cleaning + processing step is open-sourced so anyone can reproduce or build on it.

2/

🤖 We also release Comma v0.1 (7B) — trained on CommonPile data, yet shockingly competitive with models like Llama-2-7B, which are trained on tons of more restrictively licensed text.

3/

Read 6 tweets

Shayne Longpre

@ShayneRedford

Mar 13

What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️

Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:

1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws affecting many systems.

1/🧵

🌟Motivation🌟

Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.

➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists.

2/

However, third-party evaluation currently faces key barriers:

➡️No flaw-reporting culture
➡️Lack of coordinated disclosure infrastructure
➡️Inadequate researcher protections

3/

Read 8 tweets

Shayne Longpre

@ShayneRedford

Feb 19

I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚

➡️ why is copyright an issue?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?

🧵

1️⃣ The International AI Safety Report 2025 — @Yoshua_Bengio, @privitera_, et al. — This report spans 100s of carefully curated citations from independent experts.

I co-wrote the Risks of Copyright section, and recommend it as a general starting point.

gov.uk/government/pub…

2️⃣ Foundation Models and Fair Use — @PeterHndrsn @lxuechen — This foundational paper examines the United States “fair use doctrine” in the context of generative AI models.

Peter also regularly tweets updates on on-going lawsuits.

arxiv.org/pdf/2303.15715

Read 12 tweets

Shayne Longpre

@ShayneRedford

Feb 12

I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)!

While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web.

🔗

1/technologyreview.com/2025/02/11/111…

Crawlers are essential to our online ecosystem: they power search, price comparisons, news aggregation, security, accessibility, journalism, and research.

Think of them as a delicate biodiversity now threatened by a new “invasive species”: general-purpose AI with an insatiable appetite for web data.

2/

Publishers are understandably worried: news sites fear losing readers to AI chatbots; artists and designers fear AI image generators; coding forums fear AI-driven replacements.

Increasingly, they block or charge all non-human traffic, not just AI crawlers.

3/

Read 6 tweets

Shayne Longpre

@ShayneRedford

Jul 19, 2024

✨New Preprint ✨ How are shifting norms on the web impacting AI?

We find:

📉 A rapid decline in the consenting data commons (the web)

⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)

⛔️ Robots.txt preference protocols are ineffective

These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.

🔗

1/dataprovenance.org/consent-in-cri…

General-purpose AI relies on massive data collected by web crawlers.

The Data Provenance Initiative team annotated ~14k of the websites that underly pretraining datasets, for:

➡️Consent policies: robots.txt, ToS
➡️Monetization: ads, paywalls
➡️Purpose: news, e-commerce, forums, etc

2/

🌟Finding 1🌟 Access restrictions are rising dramatically

In <1 year, C4/RefinedWeb have seen:

➡️ >5% of all tokens become unavailable for AI training
➡️ >30% of tokens from top-2k, best quality, active domains become unavailable

Plus, 40%+ of tokens are from sites w/ anti-crawling terms

These are significant & unprecedented shifts in short periods.

3/

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Shayne Longpre

Try unrolling a thread yourself!

More from @ShayneRedford

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!