Alham Fikri Aji Profile picture
NLP/AI scientist. Faculty at @MBZUAI Previously @EdinburghNLP PhD, @Amazon Alexa, @Google research, @Apple Siri
Oct 30, 2023 7 tweets 3 min read
Scraped data such as from Wikipedia is vital for NLP, but how reliable is it in low-resource settings?

🚀Happy to present our work "NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages" @AACL 2023🇮🇩

arxiv.org/abs/2309.10661
Image We explore 2 methods of building a corpus for 12 underrepresented Indonesian languages: by human translation, and by doing free-form paragraph writing given a theme.

We then compare their quality vs Wikipedia text. Image
Apr 25, 2023 5 tweets 3 min read
Introducing LaMini-LM🦙, a diverse set of 15 (more coming) mini-sized models (up to 1.5B) distilled from 2.6M instructions, comparable in performance to Alpaca-7B in downstream NLP + human eval.

Models + data are available strictly for research use: github.com/mbzuai-nlp/LaM… Image We created a new large-scale instruction dataset of 2.58 million instructions by combining both downstream NLP tasks and general instructions. We then generated responses using ChatGPT.
Dec 5, 2022 13 tweets 4 min read
Sedang marak-maraknya ChatGPT.

ChatGPT ini merupakan AI yang berbasis “Language Model” (Disingkat LM).

GImana sih cara mereka bekerja?
Saya akan coba bahas dengan bahasa awam 🧵 (1/11) Ada banyak jenis2 LM, tapi yang ramai dipakai belakangan ini, sebenarnya tugasnya hanya melengkapi kalimat.

Jadi si LM ini diberikan potongan teks, kemudian dia akan melanjutkannya piece-by-piece. Jadi simpelnya, LM ini “cuma” text auto-complete aja.
(2/11)