Saulnier Lucile Profile picture
ML @ Hugging Face | ENS Paris-Saclay (MVA) | Centrale Paris
Nov 28, 2022 23 tweets 10 min read
Wondering how one can create a dataset of several TB of text data to train a language model?📚

With @BigscienceW, we have been through this exercise and shared everything in our #NeurIPS2022 paper "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset"

🧵 🌸What is ROOTS?

The Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus is a 1.6TB corpus including 46 natural languages and 13 programming languages 🌟
Sep 30, 2021 6 tweets 6 min read
🎉 So proud to announce that our "Distributed Deep Learning in Open Collaborations" paper has been accepted at #NeurIPS2021! 🎊

Blog post:…
Arxiv paper:
Teaser: via @YouTube

#researchpaper #MachineLearning 🔐🌐This work proposes DeDLOC: a method to collaboratively train large neural networks with diverse and scattered computing resources

Among the experiments conducted, we trained SahajBERT a Bengali language model with 40 volunteers that is competitive with SOTA models 😱