#NewPaperAlert When and where does pretraining (PT) data matter?
We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition
We have several recs for model creators…
📜: bit.ly/3WxsxyY
1/ 🧵
First, PT data selection is mired in mysticism.
1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️
2/
PT is the foundation of data-centric and modern LMs. This research was expensive but important to shed light on open questions in training data design.
Here are our main findings:
3/
🌟Finding 1 – Corpus age matters 🌟
➡️ Diffs in PT and eval year lead to 🔻performance – and it isn’t overcome by finetuning!
➡️ Size matters: this effect is larger for XL than Small models
➡️ This phenomenon complicates NLP evaluations comparing new and old models.
4/
🌟Finding 2 – Qual/Tox Filter Trade-Offs 🌟
➡️ Quality filters trade-off: boosts performance, but also toxic generation.
➡️ Toxicity filters impose the opposite trade-off: 🔻perf and 🔻toxic gen
5/
🌟Finding 3 – Inverse Toxicity Filters 🌟
Surprisingly, *inverse* toxicity filters (removing the least toxic content) improve toxicity identification tasks.
(Also improves QA in books, academic, and common sense domains.)
6/
🌟Finding 5 – Filter effects are unpredictable from text characteristics 🌟
E.g. quality classifier ranks Books as highest quality, but eval on Books were NOT helped by quality filtering.
And “low-quality” domains (e.g. biomedical) benefited most from quality filters.
Why?
7/
We believe relevant/beneficial training text isn’t always on the ends of the narrowly-defined “quality” spectrum.
➡️ Future work: More nuanced & multidimensional measures of quality could lead to much stronger results.
8/
🌟Finding 6 – One size filter does not fit all 🌟
Our results suggest one filter type is not best for all situations.
9/
🌟Finding 7 – Domain composition effects 🌟
➡️ Web and books sources are most beneficial, emphasizing data heterogeneity (web) and quality (books)
➡️ For generalization, train on all data sources!
10/
We tie these findings back to a detailed breakdown of C4 and the Pile’s characteristics.
Check out the paper for more details: 📜 bit.ly/3WxsxyY 📜
11/
🌟 Limitations 🌟
➡️ These ablations are computationally costly, but we believe justified to avoid model creators from repeating each other’s (undocumented) mistakes.
➡️ These results are an early preprint (not yet peer reviewed)→ we welcome & hope for community feedback!
12/
There’s been a lot of discussion lately on training dataset composition! Some other tweet threads:
I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.
In late Dec it begins to abstain.
2/
Interestingly, GPT 3.5 "Default" answers correctly only until ~Oct 24, 2021, but GPT 3.5 "Legacy" answers correctly until ~Oct 31, 2021 then begins hallucinating false answers or abstaining in Nov.
Perhaps this is due to finetuning rather than pretraining data?
3/