NLP/AI scientist. Faculty at @MBZUAI
Previously @EdinburghNLP PhD, @Amazon Alexa, @Google research, @Apple Siri
Oct 30, 2023 • 7 tweets • 3 min read
Scraped data such as from Wikipedia is vital for NLP, but how reliable is it in low-resource settings?
🚀Happy to present our work "NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages" @AACL 2023🇮🇩
arxiv.org/abs/2309.10661
We explore 2 methods of building a corpus for 12 underrepresented Indonesian languages: by human translation, and by doing free-form paragraph writing given a theme.
We then compare their quality vs Wikipedia text.
Apr 25, 2023 • 5 tweets • 3 min read
Introducing LaMini-LM🦙, a diverse set of 15 (more coming) mini-sized models (up to 1.5B) distilled from 2.6M instructions, comparable in performance to Alpaca-7B in downstream NLP + human eval.
Models + data are available strictly for research use: github.com/mbzuai-nlp/LaM…
We created a new large-scale instruction dataset of 2.58 million instructions by combining both downstream NLP tasks and general instructions. We then generated responses using ChatGPT.
Dec 5, 2022 • 13 tweets • 4 min read
Sedang marak-maraknya ChatGPT.
ChatGPT ini merupakan AI yang berbasis “Language Model” (Disingkat LM).
GImana sih cara mereka bekerja?
Saya akan coba bahas dengan bahasa awam 🧵 (1/11)
Ada banyak jenis2 LM, tapi yang ramai dipakai belakangan ini, sebenarnya tugasnya hanya melengkapi kalimat.
Jadi si LM ini diberikan potongan teks, kemudian dia akan melanjutkannya piece-by-piece. Jadi simpelnya, LM ini “cuma” text auto-complete aja.
(2/11)