Wondering how one can create a dataset of several TB of text data to train a language model?📚

With @BigscienceW, we have been through this exercise and shared everything in our #NeurIPS2022 paper "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset"

🧵
🌸What is ROOTS?

The Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus is a 1.6TB corpus including 46 natural languages and 13 programming languages 🌟
The languages were chosen based on the language expertise of the communities who participated in the effort
🌸Why ROOTS?

This dataset was created during the @BigscienceW - an international and multidisciplinary initiative with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground.
🌸How was ROOTS built?

The final corpus is made up of:
1️⃣ 62% from a community-selected and documented list of language data source (aka Crowdsourced Datasets)
2️⃣ 38% from a pre-processed web crawl, OSCAR filtered with the help of native speakers
🌸How were the crowdsourced Datasets sourced?

The various efforts of the BigScience Data Sourcing Working Group have converged on bringing together diverse data sources:

🔍 Identified Datasets and Collections
🕸️ Web pages from domain names
🖥️ Github codes
The pseudo crawl part is the part that required the most effort in terms of engineering during the sourcing step.

🕷️ We retrieved pages corresponding to the target domain names from Common Crawl snapshots in WARC format and then used our custom parser to extract text from HTML.
🌸How were the Crowdsourced Datasets processed?

The developed processing pipeline allowed us to apply a set of custom operations to each of the crowdsourced sources (except GitHub) so that each of them would be as close as possible to our definition of natural language documents
In order to help the operations selection for each dataset, we developed an app that simulates the effect of the operation on a sample of the dataset by showing:

• estimated dataset level metrics
• samples of documents removed or modified
🌸How did we process OSCAR?

First, we have kept the Oscar splits in languages spoken by collaborators.

Then, it was filtered with a specific pipeline in order to keep only documents with a sufficient quality indicator for web content and a regex was used to remove personal info
🌸How was OSCAR filtered?

We defined quality indicators based on character or word repetition, special characters, closed class words, flagged words, perplexity or number of words.
Thanks to a visualization tool, native speakers selected the thresholds to apply according to the language of the webpage.
🌸How was OSCAR deduplicated?

To minimize the number of duplicates, two deduplication algorithms were used:

SimHash – a bag-of-word method - was applied on documents shorter than 6000 characters

Suffix Array – a substring-sharing method – was applied on longer documents
🌸How does ROOTS fit among the corpora used to train large language models?
🌸What is the high-level view of ROOTS?

We have calculated and visualized some statistics of this corpus:
• on the size of each document
• on the quality indicators of each document
🌸We released
- the numerous data tools that were developed along the way and enabled us to curate, source, clean, and inspect all constituents of ROOTS
- a preliminary gated, subject to committing to the BigScience ethical charter, large subset of ROOTS
🌸Some links

Data: hf.co/bigscience-data
Code: github.com/bigscience-wor…
Ethical charter: hf.co/spaces/bigscie…
A big shoutout to all the collaborators of this project: @HugoLaurencon , @thomas_wang21 , @christopher, @avillanovamoral, @Fluke_Ellington
@lvwerra , Chenghao Mou, @EduGPonferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, @qlhoest
Angelina McMillan-Major, @ggdupont, @BlancheMinerva, Anna Rogers, @LoubnaBenAllal1 , Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, @PierreColombo6, Javier de la Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber
Manuel Romero Muñoz, Jian Zhu, Daniel van Strien, Zaid Alyafeai, Khalid Almubarak, Vu Minh Chien, @ItziarGD, @Aitor57, Kyle Lo, Manan Dey, Pedro Ortiz Suarez, @SkyLi0n, Shamik Bose, David Ifeoluwa Adelani, Long Phan , Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Saulnier Lucile

Saulnier Lucile Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @LucileSaulnier

Sep 30, 2021
🎉 So proud to announce that our "Distributed Deep Learning in Open Collaborations" paper has been accepted at #NeurIPS2021! 🎊

Blog post: huggingface.co/blog/collabora…
Arxiv paper: arxiv.org/abs/2106.10207
Teaser: via @YouTube

#researchpaper #MachineLearning
🔐🌐This work proposes DeDLOC: a method to collaboratively train large neural networks with diverse and scattered computing resources

Among the experiments conducted, we trained SahajBERT a Bengali language model with 40 volunteers that is competitive with SOTA models 😱
Open Sourced model checkpoints

✨Pretrained : huggingface.co/neuropark/saha…
🧩Classification: huggingface.co/neuropark/saha…
🖌️Named Entity Recognition: huggingface.co/neuropark/saha…
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(