Shayne Longpre Profile picture
PhD @MIT. Prev: @Google Brain, @apple ML, @stanfordnlp. 🇨🇦 Interests: AI/ML/NLP, Data-centric AI, transparency & societal impact
2 subscribers
Jul 19 12 tweets 5 min read
✨New Preprint ✨ How are shifting norms on the web impacting AI?

We find:

📉 A rapid decline in the consenting data commons (the web)

⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic)

⛔️ Robots.txt preference protocols are ineffective

These precipitous changes will impact the availability and scaling laws for AI data, affecting coporate developers, but also non-profit and academic research.

🔗

1/dataprovenance.org/consent-in-cri…Image General-purpose AI relies on massive data collected by web crawlers.

The Data Provenance Initiative team annotated ~14k of the websites that underly pretraining datasets, for:

➡️Consent policies: robots.txt, ToS
➡️Monetization: ads, paywalls
➡️Purpose: news, e-commerce, forums, etc

2/Image
Mar 5 9 tweets 3 min read
Independent AI research should be valued and protected.

In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward.



1/sites.mit.edu/ai-safe-harbor/Image Researchers & companies agree:

➡️ Generative AI poses a range of risks

➡️ We need independent research participation for safety & accountability

But current AI company policies can chill good faith, independent testing of generative AI systems (sometimes unintentionally).

2/Image
Oct 25, 2023 17 tweets 6 min read
📢Announcing the🌟Data Provenance Initiative🌟

🧭A rigorous public audit of 1800+ instruct/align datasets

🔍Explore/filter sources, creators & license conditions

⚠️We see a rising divide between commercially open v closed licensed data

🌐:

1/ dataprovenance.org
Context: A Crisis in Data Transparency

➡️Instruct/align finetuning often compiles 100s of datasets

➡️How can devs filter for datasets without legal/ethical risk, and understand the resulting data composition?

2/ Image
Oct 10, 2023 14 tweets 4 min read
A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are.

⛓️ Prompt jailbreaks are easy
🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable
🤖 LLMs can auto-generate their own jailbreaks...

1/ 🧵 It's been repeatedly shown that careful prompt re-wording, roleplaying, and even just insisting can jailbreak Llama2-Chat/#ChatGPT usage policy ().

, @AIPanicLive document many jailbreak / red teaming efforts

2/openai.com/policies/usage…
jailbreakchat.com
May 24, 2023 10 tweets 10 min read
This semester my @CCCatMIT co-instructors and I taught #MIT's first post-#ChatGPT Generative AI course, covering:

➡️Uses and new abilities
➡️LM Evaluation
➡️AI-mediated communication
➡️Societal challenges

📜 Syllabus + reading list 📚: ai4comm.media.mit.edu

1/ Image It was a 🎢wild journey to teach in the midst of GPT-4 + Bard launches, moratorium letters, and raging online controversies every d*mn day.

We're excited to release our (and our students') learnings, slides, and the talks from our guest speakers.

Stay tuned!

2/
May 22, 2023 17 tweets 9 min read
#NewPaperAlert When and where does pretraining (PT) data matter?

We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition

We have several recs for model creators…
📜: bit.ly/3WxsxyY

1/ 🧵 Image First, PT data selection is mired in mysticism.

1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️

2/ Image
Mar 28, 2023 7 tweets 6 min read
What dates📅 can @OpenAI, @AnthropicAI, @CohereAI models reliably answer questions for?🔭

I binary-search through "future" Wiki events to find out. Results ❌🟰❌documentation:

#GPT4 ➡️~Dec 19 ('21)
#ChatGPT ➡️~Oct 24
Claude v1.2➡️~Oct 10
Cohere XL Nightly➡️~Apr 24 ('22)

1/🧵 GPT4 says it is trained up to Sept 2021.

I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.

In late Dec it begins to abstain.

2/
Feb 27, 2023 17 tweets 8 min read
🔭 A 🧵 on @OpenAI LLM "Alignment" (e.g. #ChatGPT)

Q: How does this differ from publicly available "Instruction Tuning" (IT)?

A: Proprietary Alignment is actually 3 separate components:

1⃣ Instruction tuning
2⃣ ➕ Open-ended generation/creative prompts
3⃣ ➕ Human feedback

1/ Component 1⃣:

Instruction Tuning, in its simplest form, teaches the model to follow/answer instructions, instead of generating plausible continuations.

E.g. see @GoogleAI's Flan Collection: arxiv.org/abs/2301.13688

2/
Feb 1, 2023 11 tweets 7 min read
✨New Paper✨What’s the best completely public competitor to #ChatGPT?

Flan-T5 beats all public models we tested:
Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B

We release the @GoogleAI 🌟Flan Collection🌟data + methods for Instruction Tuning!

1/ The 🌟Flan Collection🌟 (1st used in Flan-PaLM bit.ly/3Zu7bU2):

➕ Merges Flan 2021, P3, NIv2, CoT instruction-datasets into 1800+ dataset collection
➕ Data augmentations and mixing strategies
➕ 100s new templates

2/
Oct 6, 2022 13 tweets 7 min read
📢 A 🧵 on the Trends in NLP Datasets.

What’s changed since SQuAD was all the rage in 2016? A: A LOT. 🔭

1. Generic ➡️ Niche Tasks
2. Task-specific Training+Eval ➡️ Eval Only
3. Dataset ➡️ Benchmark ➡️ Massive Collections
4. Datasets ➡️ Diagnostics

1/ What started as a trickle became an explosion of NLP datasets over the last few years.

@sebastian ruder used to track all NLP sets on his website: nlpprogress.com. It’s no longer possible to keep up-to-date.

2/
Jun 14, 2022 16 tweets 9 min read
📢 A 🧵on the future of NLP model inputs.

What are the options and where are we going? 🔭

1. Task-specific finetuning (FT)
2. Zero-shot prompting
3. Few-shot prompting
4. Chain of thought (CoT)
5. Parameter-efficient finetuning (PEFT)
6. Dialog

[1/] ImageImage 🌟Task-specific finetuning 🌟

The traditional way to prepare NLP models for deployment, it usually obtains the best performance for a specific task, but:

(a) it requires many training examples
(b) it (often) specializes a model for ONE task and ONE data input format ONLY

[2/]
May 28, 2022 6 tweets 6 min read
Sharing my *rough* slides from a @CCCatMIT February reading group.

Covers "NLP Training Trends for Large Language Models" (LLM) and a survey of 4 new interesting papers: FLAN, T0, ExT5, MetaICL!

📚: bit.ly/3a3SxOj [1/6] 1st paper we discuss multi-task fine-tuning in FLAN by @_jasonwei, @MaartenBosma, et al.

TLDR: Multi-task instruction tuning a 137B model on dozens of tasks vastly improves zero/few-shot learning

📜: arxiv.org/abs/2109.01652 [2/6]