Tom Leaman Profile picture
Husband, father, VP Site Reliability Engineering @ Warner Bros. Discovery, wannabe woodworker, baker, 3D printer, cyclist. Opinions 100% my own

Sep 30, 2020, 18 tweets

First talk of the day: "Bonjour! Wie geht es dir? Dhanyawaad" - How Multilingual is your NLP Model? with Shreya Khurana from GoDaddy

#GHC20 #vGHC

Shreya is a Sr. Data Scientist and has done a number of talks on NLP at Python conferences

We're going to be covering:
1. Why Multilingual for NLP
2. How do we go Multilingual
3. What's the future of Multilingual?

#GHC20 #vGHC

Multilingual is important due to the sheer scale of languages, countries, and diversity of users active on the internet using our software.

#GHC20 #vGHC

Off-the-shelf Models are often based on English words. There's a huge amount of data necessary to build these.

#GHC #vGHC

Many of our conversations are multi-lingual in nature with frequent code-switching.
#GHC20 #vGHC2020

Transliterated phrases are also challenges as these take phrases from one language and switch them over into the typeset of another language

#GHC20 #vGHC2020

We're looking at 3 frameworks that help handle code-switched text or transliterated text: cld3, langid

#GHC20 #vGHC2020

CLD3 is a framework that supports 100+ languages and does well with transliteration but has challenges with small text length and 'borrowed words'

#GHC20 #vGHC2020

langid is a framework that supports 97 languages. It has some trouble with transliterated languages and underrepresented languages as well such as Swahili despite it being in its training set.

#GHC20 #vGHC2020

The final framework we're looking at is langdetect which only supports 55 languages. Unfortunately it isn't trained at all for transliterated languages...

#GHC20 #vGHC2020

So... how do we add great multilingual support?

First option: Build Annotated Datasets!

#GHC20 #vGHC2020

Next: BERT or Bidirectional Encoder Representations from Transformers.

Google uses this for search rankings and it's led to a substantial increase in accuracy. It handles semantic information really well.

BERT is trained on 104+ languages via Wikipedia

#GHC20 #vGHC2020

BERT helps handle English bias via Exponential smoothed weighting of the data which helps mitigate this.

BERT is pretty special due to using pre-training processes and it's enormous dataset.

#GHC20 #vGHC2020

Tom's take: I'm no NLP expert but BERT seems to be the "go-to" language model in this domain. Shreya's talk is really highlighting how it's dataset and ability to manage a number of multi-lingual problems is pretty substantial.

#GHC20 #vGHC2020

So, what is exciting in the future of Multilingual NLP?

Language Agnostic Representations - grouping of similar phrases across translations

Cross-Language Modeling - train language models off of existing ones

Both benefit underrepresented languages
#GHC20 #vGHC2020

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling