We explore 2 methods of building a corpus for 12 underrepresented Indonesian languages: by human translation, and by doing free-form paragraph writing given a theme.
We then compare their quality vs Wikipedia text.
When we compare Wikipedia data, both Nusa Translation (NusaT) and Nusa Paragraph (NusaP) are generally more lexically diverse and use fewer loan words.
We also realize that apparently some of the Wikipedia pages for low-resource languages are mostly boilerplate.
Language Models trained with NusaP or NusaT also achieve lower perplexity on unseen test data in corresponding local languages, written by native speakers.
Due to how NusaT and NusaP were made, we can also convert these datasets into a benchmark dataset!
So we additionally perform the evaluation of various models on this benchmark.
To conclude:
- We release NusaT and NusaP, high-quality corpus for 12 underrepresented languages
- Underrepresented languages corpus from Wikipedia does not represent the true language distribution
- We suggest alternative methods through Translation and free-text writing.
Thanks to the amazing work of all the authors:
@SCahyawijaya
@HolyLovenia
@FajriKoto
@DeaAdhista
@emmanuel_davee
@sarahoktaviani
@SabilMAk
@JhonsonLee
@Shadieqq
@wawancenggoro
@hanungwahyuning
@BryanWilie
@galihpradipta
@gentaiscool
@DavidMoeljadi
@AyuPurwarianti
@pascalefung
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Introducing LaMini-LM🦙, a diverse set of 15 (more coming) mini-sized models (up to 1.5B) distilled from 2.6M instructions, comparable in performance to Alpaca-7B in downstream NLP + human eval.
We created a new large-scale instruction dataset of 2.58 million instructions by combining both downstream NLP tasks and general instructions. We then generated responses using ChatGPT.
Next, we fine-tuned various models using different checkpoints with our dataset. We evaluated these models both on downstream NLP tasks and through human evaluation. Encoder-Decoder-based models are surprisingly good.
ChatGPT ini merupakan AI yang berbasis “Language Model” (Disingkat LM).
GImana sih cara mereka bekerja?
Saya akan coba bahas dengan bahasa awam 🧵 (1/11)
Ada banyak jenis2 LM, tapi yang ramai dipakai belakangan ini, sebenarnya tugasnya hanya melengkapi kalimat.
Jadi si LM ini diberikan potongan teks, kemudian dia akan melanjutkannya piece-by-piece. Jadi simpelnya, LM ini “cuma” text auto-complete aja.
(2/11)
Cuma text auto-complete doang, kok bisa pinter?
Karena si LM prediksi next text-nya itu pake rumus matematika yang kompleks dan variabelnya bejibun.
Dan lagi, LM ini belajar dari seabrek data di internet.
Peneliti NLP pun msh gk 100% paham kenapa LM ini bisa powerful.
(3/11)