First talk of the day: "Bonjour! Wie geht es dir? Dhanyawaad" - How Multilingual is your NLP Model? with Shreya Khurana from GoDaddy
#GHC20 #vGHC
Shreya is a Sr. Data Scientist and has done a number of talks on NLP at Python conferences
We're going to be covering:
1. Why Multilingual for NLP
2. How do we go Multilingual
3. What's the future of Multilingual?
#GHC20 #vGHC
Multilingual is important due to the sheer scale of languages, countries, and diversity of users active on the internet using our software.
#GHC20 #vGHC
Off-the-shelf Models are often based on English words. There's a huge amount of data necessary to build these.
#GHC #vGHC
Many of our conversations are multi-lingual in nature with frequent code-switching.
#GHC20 #vGHC2020
Transliterated phrases are also challenges as these take phrases from one language and switch them over into the typeset of another language
#GHC20 #vGHC2020
We're looking at 3 frameworks that help handle code-switched text or transliterated text: cld3, langid
#GHC20 #vGHC2020
CLD3 is a framework that supports 100+ languages and does well with transliteration but has challenges with small text length and 'borrowed words'
#GHC20 #vGHC2020
langid is a framework that supports 97 languages. It has some trouble with transliterated languages and underrepresented languages as well such as Swahili despite it being in its training set.
#GHC20 #vGHC2020
The final framework we're looking at is langdetect which only supports 55 languages. Unfortunately it isn't trained at all for transliterated languages...
#GHC20 #vGHC2020
So... how do we add great multilingual support?
Next: BERT or Bidirectional Encoder Representations from Transformers.
Google uses this for search rankings and it's led to a substantial increase in accuracy. It handles semantic information really well.
BERT is trained on 104+ languages via Wikipedia
#GHC20 #vGHC2020
BERT helps handle English bias via Exponential smoothed weighting of the data which helps mitigate this.
BERT is pretty special due to using pre-training processes and it's enormous dataset.
#GHC20 #vGHC2020
Tom's take: I'm no NLP expert but BERT seems to be the "go-to" language model in this domain. Shreya's talk is really highlighting how it's dataset and ability to manage a number of multi-lingual problems is pretty substantial.
#GHC20 #vGHC2020
So, what is exciting in the future of Multilingual NLP?
Language Agnostic Representations - grouping of similar phrases across translations
Cross-Language Modeling - train language models off of existing ones
Both benefit underrepresented languages
#GHC20 #vGHC2020
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
