Latest Twitter Threads by @ShayneRedford on Thread Reader App

Jun 23 • 6 tweets • 3 min read

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 !

Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by @kandpal_nikhil @blester125 @colinraffel.

📜: arxiv.org/pdf/2506.05209
📚🤖 Data & models: huggingface.co/common-pile
1/

📚 Drawn from 30 diverse, permissively licensed sources (science, code, books, gov docs, news, audio transcripts & more).

🔍 “Openly licensed” = free for anyone to use, modify, and share for any purpose, as defined by Public Knowledge (opendefinition.org)

🔧 Every cleaning + processing step is open-sourced so anyone can reproduce or build on it.

2/

Mar 13 • 8 tweets • 4 min read

What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️

Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:

1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws affecting many systems.

1/🧵

🌟Motivation🌟

Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.

➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists.

2/

Feb 19 • 12 tweets • 4 min read

I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚

➡️ why is copyright an issue?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?

🧵

1️⃣ The International AI Safety Report 2025 — @Yoshua_Bengio, @privitera_, et al. — This report spans 100s of carefully curated citations from independent experts.

I co-wrote the Risks of Copyright section, and recommend it as a general starting point.

gov.uk/government/pub…

Feb 12 • 6 tweets • 2 min read

I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)!

While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web.

🔗

1/technologyreview.com/2025/02/11/111…

Crawlers are essential to our online ecosystem: they power search, price comparisons, news aggregation, security, accessibility, journalism, and research.

Think of them as a delicate biodiversity now threatened by a new “invasive species”: general-purpose AI with an insatiable appetite for web data.

2/

Jul 19, 2024 • 12 tweets • 5 min read

✨New Preprint ✨ How are shifting norms on the web impacting AI?

We find:

📉 A rapid decline in the consenting data commons (the web)

⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)

⛔️ Robots.txt preference protocols are ineffective

These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.

🔗

1/dataprovenance.org/consent-in-cri…

General-purpose AI relies on massive data collected by web crawlers.

The Data Provenance Initiative team annotated ~14k of the websites that underly pretraining datasets, for:

➡️Consent policies: robots.txt, ToS
➡️Monetization: ads, paywalls
➡️Purpose: news, e-commerce, forums, etc

2/

Mar 5, 2024 • 9 tweets • 3 min read

Independent AI research should be valued and protected.

In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward.

1/sites.mit.edu/ai-safe-harbor/

Researchers & companies agree:

➡️ Generative AI poses a range of risks

➡️ We need independent research participation for safety & accountability

But current AI company policies can chill good faith, independent testing of generative AI systems (sometimes unintentionally).

2/

Oct 25, 2023 • 17 tweets • 6 min read

📢Announcing the🌟Data Provenance Initiative🌟

🧭A rigorous public audit of 1800+ instruct/align datasets

🔍Explore/filter sources, creators & license conditions

⚠️We see a rising divide between commercially open v closed licensed data

🌐:

1/ dataprovenance.org

Context: A Crisis in Data Transparency

➡️Instruct/align finetuning often compiles 100s of datasets

➡️How can devs filter for datasets without legal/ethical risk, and understand the resulting data composition?

2/

Oct 10, 2023 • 14 tweets • 4 min read

A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.

⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...

1/ 🧵 It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().

, @AIPanicLive document many jailbreak / red teaming efforts

2/openai.com/policies/usage…
jailbreakchat.com

May 24, 2023 • 10 tweets • 10 min read

This semester my @CCCatMIT co-instructors and I taught #MIT's first post-#ChatGPT Generative AI course, covering:

➡️Uses and new abilities
➡️LM Evaluation
➡️AI-mediated communication
➡️Societal challenges

📜 Syllabus + reading list 📚: ai4comm.media.mit.edu

1/

It was a 🎢wild journey to teach in the midst of GPT-4 + Bard launches, moratorium letters, and raging online controversies every d*mn day.

We're excited to release our (and our students') learnings, slides, and the talks from our guest speakers.

Stay tuned!

2/

May 22, 2023 • 17 tweets • 9 min read

#NewPaperAlert When and where does pretraining (PT) data matter?

We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition

We have several recs for model creators…
📜: bit.ly/3WxsxyY

1/ 🧵

First, PT data selection is mired in mysticism.

1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️

2/

Mar 28, 2023 • 7 tweets • 6 min read

What dates📅 can @OpenAI, @AnthropicAI, @CohereAI models reliably answer questions for?🔭

I binary-search through "future" Wiki events to find out. Results ❌🟰❌documentation:

#GPT4 ➡️~Dec 19 ('21)
#ChatGPT ➡️~Oct 24
Claude v1.2➡️~Oct 10
Cohere XL Nightly➡️~Apr 24 ('22)

1/🧵

GPT4 says it is trained up to Sept 2021.

I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.

In late Dec it begins to abstain.

2/

Feb 27, 2023 • 17 tweets • 8 min read

🔭 A 🧵 on @OpenAI LLM "Alignment" (e.g. #ChatGPT)

Q: How does this differ from publicly available "Instruction Tuning" (IT)?

A: Proprietary Alignment is actually 3 separate components:

1⃣ Instruction tuning
2⃣ ➕ Open-ended generation/creative prompts
3⃣ ➕ Human feedback

1/

Component 1⃣:

Instruction Tuning, in its simplest form, teaches the model to follow/answer instructions, instead of generating plausible continuations.

E.g. see @GoogleAI's Flan Collection: arxiv.org/abs/2301.13688

2/

Feb 1, 2023 • 11 tweets • 7 min read

✨New Paper✨What’s the best completely public competitor to #ChatGPT?

Flan-T5 beats all public models we tested:
Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B

We release the @GoogleAI 🌟Flan Collection🌟data + methods for Instruction Tuning!

1/

The 🌟Flan Collection🌟 (1st used in Flan-PaLM bit.ly/3Zu7bU2):

➕ Merges Flan 2021, P3, NIv2, CoT instruction-datasets into 1800+ dataset collection
➕ Data augmentations and mixing strategies
➕ 100s new templates

2/

Oct 6, 2022 • 13 tweets • 7 min read

📢 A 🧵 on the Trends in NLP Datasets.

What’s changed since SQuAD was all the rage in 2016? A: A LOT. 🔭

1. Generic ➡️ Niche Tasks
2. Task-specific Training+Eval ➡️ Eval Only
3. Dataset ➡️ Benchmark ➡️ Massive Collections
4. Datasets ➡️ Diagnostics

1/

What started as a trickle became an explosion of NLP datasets over the last few years.

@sebastian ruder used to track all NLP sets on his website: nlpprogress.com. It’s no longer possible to keep up-to-date.

2/

Jun 14, 2022 • 16 tweets • 9 min read

📢 A 🧵on the future of NLP model inputs.

What are the options and where are we going? 🔭

1. Task-specific finetuning (FT)
2. Zero-shot prompting
3. Few-shot prompting
4. Chain of thought (CoT)
5. Parameter-efficient finetuning (PEFT)
6. Dialog

[1/]

🌟Task-specific finetuning 🌟

The traditional way to prepare NLP models for deployment, it usually obtains the best performance for a specific task, but:

(a) it requires many training examples
(b) it (often) specializes a model for ONE task and ONE data input format ONLY

[2/]

May 28, 2022 • 6 tweets • 6 min read

Sharing my *rough* slides from a @CCCatMIT February reading group.

Covers "NLP Training Trends for Large Language Models" (LLM) and a survey of 4 new interesting papers: FLAN, T0, ExT5, MetaICL!

📚: bit.ly/3a3SxOj [1/6]

1st paper we discuss multi-task fine-tuning in FLAN by @_jasonwei, @MaartenBosma, et al.

TLDR: Multi-task instruction tuning a 137B model on dozens of tasks vastly improves zero/few-shot learning

📜: arxiv.org/abs/2109.01652 [2/6]

Share this page!

Enter URL or ID to Unroll