Alham Fikri Aji Profile picture
Oct 30, 2023 7 tweets 3 min read Read on X
Scraped data such as from Wikipedia is vital for NLP, but how reliable is it in low-resource settings?

🚀Happy to present our work "NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages" @AACL 2023🇮🇩

arxiv.org/abs/2309.10661
Image
We explore 2 methods of building a corpus for 12 underrepresented Indonesian languages: by human translation, and by doing free-form paragraph writing given a theme.

We then compare their quality vs Wikipedia text. Image
When we compare Wikipedia data, both Nusa Translation (NusaT) and Nusa Paragraph (NusaP) are generally more lexically diverse and use fewer loan words.

We also realize that apparently some of the Wikipedia pages for low-resource languages are mostly boilerplate.
Image
Image
Language Models trained with NusaP or NusaT also achieve lower perplexity on unseen test data in corresponding local languages, written by native speakers. Image
Due to how NusaT and NusaP were made, we can also convert these datasets into a benchmark dataset!

So we additionally perform the evaluation of various models on this benchmark. Image
To conclude:
- We release NusaT and NusaP, high-quality corpus for 12 underrepresented languages
- Underrepresented languages corpus from Wikipedia does not represent the true language distribution
- We suggest alternative methods through Translation and free-text writing. Image
Thanks to the amazing work of all the authors:
@SCahyawijaya
@HolyLovenia
@FajriKoto
@DeaAdhista
@emmanuel_davee
@sarahoktaviani
@SabilMAk
@JhonsonLee
@Shadieqq
@wawancenggoro
@hanungwahyuning
@BryanWilie
@galihpradipta
@gentaiscool
@DavidMoeljadi
@AyuPurwarianti
@pascalefung

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Alham Fikri Aji

Alham Fikri Aji Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AlhamFikri

Apr 25, 2023
Introducing LaMini-LM🦙, a diverse set of 15 (more coming) mini-sized models (up to 1.5B) distilled from 2.6M instructions, comparable in performance to Alpaca-7B in downstream NLP + human eval.

Models + data are available strictly for research use: github.com/mbzuai-nlp/LaM… Image
We created a new large-scale instruction dataset of 2.58 million instructions by combining both downstream NLP tasks and general instructions. We then generated responses using ChatGPT.
Next, we fine-tuned various models using different checkpoints with our dataset. We evaluated these models both on downstream NLP tasks and through human evaluation. Encoder-Decoder-based models are surprisingly good. ImageImage
Read 5 tweets
Dec 5, 2022
Sedang marak-maraknya ChatGPT.

ChatGPT ini merupakan AI yang berbasis “Language Model” (Disingkat LM).

GImana sih cara mereka bekerja?
Saya akan coba bahas dengan bahasa awam 🧵 (1/11)
Ada banyak jenis2 LM, tapi yang ramai dipakai belakangan ini, sebenarnya tugasnya hanya melengkapi kalimat.

Jadi si LM ini diberikan potongan teks, kemudian dia akan melanjutkannya piece-by-piece. Jadi simpelnya, LM ini “cuma” text auto-complete aja.
(2/11)
Cuma text auto-complete doang, kok bisa pinter?

Karena si LM prediksi next text-nya itu pake rumus matematika yang kompleks dan variabelnya bejibun.
Dan lagi, LM ini belajar dari seabrek data di internet.

Peneliti NLP pun msh gk 100% paham kenapa LM ini bisa powerful.
(3/11)
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(