#NewPaperAlert When and where does pretraining (PT) data matter?
We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition
We have several recs for model creators…
📜: bit.ly/3WxsxyY
1/ 🧵
First, PT data selection is mired in mysticism.
1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️
2/
PT is the foundation of data-centric and modern LMs. This research was expensive but important to shed light on open questions in training data design.
I found it correctly answers unknowable events in Oct, Nov, and even Dec 11th & 19th.
In late Dec it begins to abstain.
2/
Interestingly, GPT 3.5 "Default" answers correctly only until ~Oct 24, 2021, but GPT 3.5 "Legacy" answers correctly until ~Oct 31, 2021 then begins hallucinating false answers or abstaining in Nov.
Perhaps this is due to finetuning rather than pretraining data?
3/