Wondering how one can create a dataset of several TB of text data to train a language model?📚
With @BigscienceW, we have been through this exercise and shared everything in our #NeurIPS2022 paper "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset"
🧵
🌸What is ROOTS?
The Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus is a 1.6TB corpus including 46 natural languages and 13 programming languages 🌟
The languages were chosen based on the language expertise of the communities who participated in the effort
🌸Why ROOTS?
This dataset was created during the @BigscienceW - an international and multidisciplinary initiative with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground.
🌸How was ROOTS built?
The final corpus is made up of:
1️⃣ 62% from a community-selected and documented list of language data source (aka Crowdsourced Datasets)
2️⃣ 38% from a pre-processed web crawl, OSCAR filtered with the help of native speakers
🌸How were the crowdsourced Datasets sourced?
The various efforts of the BigScience Data Sourcing Working Group have converged on bringing together diverse data sources:
🔍 Identified Datasets and Collections
🕸️ Web pages from domain names
🖥️ Github codes
The pseudo crawl part is the part that required the most effort in terms of engineering during the sourcing step.
🕷️ We retrieved pages corresponding to the target domain names from Common Crawl snapshots in WARC format and then used our custom parser to extract text from HTML.
🌸How were the Crowdsourced Datasets processed?
The developed processing pipeline allowed us to apply a set of custom operations to each of the crowdsourced sources (except GitHub) so that each of them would be as close as possible to our definition of natural language documents
In order to help the operations selection for each dataset, we developed an app that simulates the effect of the operation on a sample of the dataset by showing:
• estimated dataset level metrics
• samples of documents removed or modified
🌸How did we process OSCAR?
First, we have kept the Oscar splits in languages spoken by collaborators.
Then, it was filtered with a specific pipeline in order to keep only documents with a sufficient quality indicator for web content and a regex was used to remove personal info
🌸How was OSCAR filtered?
We defined quality indicators based on character or word repetition, special characters, closed class words, flagged words, perplexity or number of words.
Thanks to a visualization tool, native speakers selected the thresholds to apply according to the language of the webpage.
🌸How was OSCAR deduplicated?
To minimize the number of duplicates, two deduplication algorithms were used:
SimHash – a bag-of-word method - was applied on documents shorter than 6000 characters
Suffix Array – a substring-sharing method – was applied on longer documents
🌸How does ROOTS fit among the corpora used to train large language models?
🌸What is the high-level view of ROOTS?
We have calculated and visualized some statistics of this corpus:
• on the size of each document
• on the quality indicators of each document
🌸We released
- the numerous data tools that were developed along the way and enabled us to curate, source, clean, and inspect all constituents of ROOTS
- a preliminary gated, subject to committing to the BigScience ethical charter, large subset of ROOTS
Angelina McMillan-Major, @ggdupont, @BlancheMinerva, Anna Rogers, @LoubnaBenAllal1 , Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, @PierreColombo6, Javier de la Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber
Manuel Romero Muñoz, Jian Zhu, Daniel van Strien, Zaid Alyafeai, Khalid Almubarak, Vu Minh Chien, @ItziarGD, @Aitor57, Kyle Lo, Manan Dey, Pedro Ortiz Suarez, @SkyLi0n, Shamik Bose, David Ifeoluwa Adelani, Long Phan , Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim