Discover and read the best of Twitter Threads about #NewPaperAlert

Most recents (1)

#NewPaperAlert When and where does pretraining (PT) data matter?

We conduct the largest published PT data study, varying:
1⃣ Corpus age
2⃣ Quality/toxicity filters
3⃣ Domain composition

We have several recs for model creators…
📜: bit.ly/3WxsxyY

1/ 🧵 Image
First, PT data selection is mired in mysticism.

1⃣ Documentation Debt: #PALM2 & #GPT4 don't document their data
2⃣ PT is expensive ➡️ experiments are sparse
3⃣ So public data choices are largely guided by ⚡️intuition, rumors, and partial info⚡️

2/ Image
PT is the foundation of data-centric and modern LMs. This research was expensive but important to shed light on open questions in training data design.

Here are our main findings:

3/
Read 17 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!