Shayne Longpre Profile picture
May 22 17 tweets 9 min read Twitter logo Read on Twitter
#NewPaperAlert When and where does pretraining (PT) data matter?

We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition

We have several recs for model creators…
📜: bit.ly/3WxsxyY

1/ 🧵 Image
First, PT data selection is mired in mysticism.

1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️

2/ Image
PT is the foundation of data-centric and modern LMs. This research was expensive but important to shed light on open questions in training data design.

Here are our main findings:

3/
🌟Finding 1 – Corpus age matters 🌟

➡️ Diffs in PT and eval year lead to 🔻performance – and it isn’t overcome by finetuning!

➡️ Size matters: this effect is larger for XL than Small models

➡️ This phenomenon complicates NLP evaluations comparing new and old models.

4/ Image
🌟Finding 2 – Qual/Tox Filter Trade-Offs 🌟

➡️ Quality filters trade-off: boosts performance, but also toxic generation.

➡️ Toxicity filters impose the opposite trade-off: 🔻perf and 🔻toxic gen

5/ Image
🌟Finding 3 – Inverse Toxicity Filters 🌟

Surprisingly, *inverse* toxicity filters (removing the least toxic content) improve toxicity identification tasks.

(Also improves QA in books, academic, and common sense domains.)

6/ Image
🌟Finding 5 – Filter effects are unpredictable from text characteristics 🌟

E.g. quality classifier ranks Books as highest quality, but eval on Books were NOT helped by quality filtering.

And “low-quality” domains (e.g. biomedical) benefited most from quality filters.

Why?

7/ Image
We believe relevant/beneficial training text isn’t always on the ends of the narrowly-defined “quality” spectrum.

➡️ Future work: More nuanced & multidimensional measures of quality could lead to much stronger results.

8/
🌟Finding 6 – One size filter does not fit all 🌟

Our results suggest one filter type is not best for all situations.

9/
🌟Finding 7 – Domain composition effects 🌟

➡️ Web and books sources are most beneficial, emphasizing data heterogeneity (web) and quality (books)

➡️ For generalization, train on all data sources!

10/ Image
We tie these findings back to a detailed breakdown of C4 and the Pile’s characteristics.

Check out the paper for more details: 📜 bit.ly/3WxsxyY 📜

11/
🌟 Limitations 🌟

➡️ These ablations are computationally costly, but we believe justified to avoid model creators from repeating each other’s (undocumented) mistakes.

➡️ These results are an early preprint (not yet peer reviewed)→ we welcome & hope for community feedback!

12/
And also @sangmichaelxie's recent work on stronger methods for balancing Pile domains:

14/
Finally, thank you for reading!

This has been my favourite long term project to date because of the irreplaceable and incredibly supportive core collaborators @daphneipp @emilyrreif @katherine1ee @gyauney and @dmimno.

15/ Image
Also thank you to @ada_rob @JasonWei @barret_zoph @denny_zhou @krob for their critical guidance and support, as well as @MaartenBosma, @noahconst, Noah Fiedel, @dsmilkov & @jacobandreas for their constructive feedback and early guidance.

🧵/🧵
Wrong Jason, my bad! I meant @_jasonwei / @agikoala

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Shayne Longpre

Shayne Longpre Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ShayneRedford

May 24
This semester my @CCCatMIT co-instructors and I taught #MIT's first post-#ChatGPT Generative AI course, covering:

➡️Uses and new abilities
➡️LM Evaluation
➡️AI-mediated communication
➡️Societal challenges

📜 Syllabus + reading list 📚: ai4comm.media.mit.edu

1/ Image
It was a 🎢wild journey to teach in the midst of GPT-4 + Bard launches, moratorium letters, and raging online controversies every d*mn day.

We're excited to release our (and our students') learnings, slides, and the talks from our guest speakers.

Stay tuned!

2/
Over the next few days we'll post talks/talk summaries from:

➡️ @RishiBommasani guest lecture on Holistic Evaluation of Language Models

📜: crfm.stanford.edu/helm/latest/

3/ ImageImage
Read 10 tweets
Mar 28
What dates📅 can @OpenAI, @AnthropicAI, @CohereAI models reliably answer questions for?🔭

I binary-search through "future" Wiki events to find out. Results ❌🟰❌documentation:

#GPT4 ➡️~Dec 19 ('21)
#ChatGPT ➡️~Oct 24
Claude v1.2➡️~Oct 10
Cohere XL Nightly➡️~Apr 24 ('22)

1/🧵
GPT4 says it is trained up to Sept 2021.

I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.

In late Dec it begins to abstain.

2/
Interestingly, GPT 3.5 "Default" answers correctly only until ~Oct 24, 2021, but GPT 3.5 "Legacy" answers correctly until ~Oct 31, 2021 then begins hallucinating false answers or abstaining in Nov.

Perhaps this is due to finetuning rather than pretraining data?

3/
Read 7 tweets
Feb 27
🔭 A 🧵 on @OpenAI LLM "Alignment" (e.g. #ChatGPT)

Q: How does this differ from publicly available "Instruction Tuning" (IT)?

A: Proprietary Alignment is actually 3 separate components:

1⃣ Instruction tuning
2⃣ ➕ Open-ended generation/creative prompts
3⃣ ➕ Human feedback

1/
Component 1⃣:

Instruction Tuning, in its simplest form, teaches the model to follow/answer instructions, instead of generating plausible continuations.

E.g. see @GoogleAI's Flan Collection: arxiv.org/abs/2301.13688

2/
Instruction Tuning public collections are made of 95%+:
➡️ academic,
➡️ short-answer,
➡️ traditional,
NLP tasks. This is a limitation.

3/
Read 17 tweets
Feb 1
✨New Paper✨What’s the best completely public competitor to #ChatGPT?

Flan-T5 beats all public models we tested:
Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B

We release the @GoogleAI 🌟Flan Collection🌟data + methods for Instruction Tuning!

1/
The 🌟Flan Collection🌟 (1st used in Flan-PaLM bit.ly/3Zu7bU2):

➕ Merges Flan 2021, P3, NIv2, CoT instruction-datasets into 1800+ dataset collection
➕ Data augmentations and mixing strategies
➕ 100s new templates

2/
This yields the best performing instruction tuning collection that has been compiled and released into one repo.

See our survey Figure of the prior works we built on to produce this compilation.

3/
Read 11 tweets
Oct 6, 2022
📢 A 🧵 on the Trends in NLP Datasets.

What’s changed since SQuAD was all the rage in 2016? A: A LOT. 🔭

1. Generic ➡️ Niche Tasks
2. Task-specific Training+Eval ➡️ Eval Only
3. Dataset ➡️ Benchmark ➡️ Massive Collections
4. Datasets ➡️ Diagnostics

1/
What started as a trickle became an explosion of NLP datasets over the last few years.

@sebastian ruder used to track all NLP sets on his website: nlpprogress.com. It’s no longer possible to keep up-to-date.

2/
🌟 Trend 1 🌟 Generic dataset are replaced with more niche datasets.

⏳ Before: datasets released for general tasks.

⌛️ Now: We see tasks targeting hyper-specific abilities.

Exs:

3/
Read 13 tweets
Jun 14, 2022
📢 A 🧵on the future of NLP model inputs.

What are the options and where are we going? 🔭

1. Task-specific finetuning (FT)
2. Zero-shot prompting
3. Few-shot prompting
4. Chain of thought (CoT)
5. Parameter-efficient finetuning (PEFT)
6. Dialog

[1/] ImageImage
🌟Task-specific finetuning 🌟

The traditional way to prepare NLP models for deployment, it usually obtains the best performance for a specific task, but:

(a) it requires many training examples
(b) it (often) specializes a model for ONE task and ONE data input format ONLY

[2/]
Because large language models (LLMs) can be:

(a) v expensive to train, and
(b) have emergent capabilities to interpret a NEW task from only an instruction

researchers are experimenting with new strategies to get model predictions…

[3/]
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(