Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Alham Fikri Aji

@AlhamFikri

Oct 30, 2023 • 7 tweets • 3 min read • Read on X

Scrolly

Scraped data such as from Wikipedia is vital for NLP, but how reliable is it in low-resource settings?

🚀Happy to present our work "NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages" @AACL 2023🇮🇩

arxiv.org/abs/2309.10661

We explore 2 methods of building a corpus for 12 underrepresented Indonesian languages: by human translation, and by doing free-form paragraph writing given a theme.

We then compare their quality vs Wikipedia text.

When we compare Wikipedia data, both Nusa Translation (NusaT) and Nusa Paragraph (NusaP) are generally more lexically diverse and use fewer loan words.

We also realize that apparently some of the Wikipedia pages for low-resource languages are mostly boilerplate.

Language Models trained with NusaP or NusaT also achieve lower perplexity on unseen test data in corresponding local languages, written by native speakers.

Due to how NusaT and NusaP were made, we can also convert these datasets into a benchmark dataset!

So we additionally perform the evaluation of various models on this benchmark.

To conclude:
- We release NusaT and NusaP, high-quality corpus for 12 underrepresented languages
- Underrepresented languages corpus from Wikipedia does not represent the true language distribution
- We suggest alternative methods through Translation and free-text writing.

Thanks to the amazing work of all the authors:
@SCahyawijaya
@HolyLovenia
@FajriKoto
@DeaAdhista
@emmanuel_davee
@sarahoktaviani
@SabilMAk
@JhonsonLee
@Shadieqq
@wawancenggoro
@hanungwahyuning
@BryanWilie
@galihpradipta
@gentaiscool
@DavidMoeljadi
@AyuPurwarianti
@pascalefung

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AlhamFikri

Alham Fikri Aji

@AlhamFikri

Apr 25, 2023

Introducing LaMini-LM🦙, a diverse set of 15 (more coming) mini-sized models (up to 1.5B) distilled from 2.6M instructions, comparable in performance to Alpaca-7B in downstream NLP + human eval.

Models + data are available strictly for research use: github.com/mbzuai-nlp/LaM…

We created a new large-scale instruction dataset of 2.58 million instructions by combining both downstream NLP tasks and general instructions. We then generated responses using ChatGPT.

Next, we fine-tuned various models using different checkpoints with our dataset. We evaluated these models both on downstream NLP tasks and through human evaluation. Encoder-Decoder-based models are surprisingly good.

Read 5 tweets

Alham Fikri Aji

@AlhamFikri

Dec 5, 2022

Sedang marak-maraknya ChatGPT.

ChatGPT ini merupakan AI yang berbasis “Language Model” (Disingkat LM).

GImana sih cara mereka bekerja?
Saya akan coba bahas dengan bahasa awam 🧵 (1/11)

Ada banyak jenis2 LM, tapi yang ramai dipakai belakangan ini, sebenarnya tugasnya hanya melengkapi kalimat.

Jadi si LM ini diberikan potongan teks, kemudian dia akan melanjutkannya piece-by-piece. Jadi simpelnya, LM ini “cuma” text auto-complete aja.
(2/11)

Cuma text auto-complete doang, kok bisa pinter?

Karena si LM prediksi next text-nya itu pake rumus matematika yang kompleks dan variabelnya bejibun.
Dan lagi, LM ini belajar dari seabrek data di internet.

Peneliti NLP pun msh gk 100% paham kenapa LM ini bisa powerful.
(3/11)

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Alham Fikri Aji

Try unrolling a thread yourself!

More from @AlhamFikri

Alham Fikri Aji

Alham Fikri Aji

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!