Tom Leaman Profile picture
Sep 30, 2020 18 tweets 10 min read Read on X
First talk of the day: "Bonjour! Wie geht es dir? Dhanyawaad" - How Multilingual is your NLP Model? with Shreya Khurana from GoDaddy

#GHC20 #vGHC
Shreya is a Sr. Data Scientist and has done a number of talks on NLP at Python conferences
We're going to be covering:
1. Why Multilingual for NLP
2. How do we go Multilingual
3. What's the future of Multilingual?

#GHC20 #vGHC
Multilingual is important due to the sheer scale of languages, countries, and diversity of users active on the internet using our software.

#GHC20 #vGHC
Off-the-shelf Models are often based on English words. There's a huge amount of data necessary to build these.

#GHC #vGHC
Many of our conversations are multi-lingual in nature with frequent code-switching.
#GHC20 #vGHC2020
Transliterated phrases are also challenges as these take phrases from one language and switch them over into the typeset of another language

#GHC20 #vGHC2020
We're looking at 3 frameworks that help handle code-switched text or transliterated text: cld3, langid

#GHC20 #vGHC2020
CLD3 is a framework that supports 100+ languages and does well with transliteration but has challenges with small text length and 'borrowed words'

#GHC20 #vGHC2020
langid is a framework that supports 97 languages. It has some trouble with transliterated languages and underrepresented languages as well such as Swahili despite it being in its training set.

#GHC20 #vGHC2020
The final framework we're looking at is langdetect which only supports 55 languages. Unfortunately it isn't trained at all for transliterated languages...

#GHC20 #vGHC2020
So... how do we add great multilingual support?
First option: Build Annotated Datasets!

#GHC20 #vGHC2020
Next: BERT or Bidirectional Encoder Representations from Transformers.

Google uses this for search rankings and it's led to a substantial increase in accuracy. It handles semantic information really well.

BERT is trained on 104+ languages via Wikipedia

#GHC20 #vGHC2020
BERT helps handle English bias via Exponential smoothed weighting of the data which helps mitigate this.

BERT is pretty special due to using pre-training processes and it's enormous dataset.

#GHC20 #vGHC2020
Tom's take: I'm no NLP expert but BERT seems to be the "go-to" language model in this domain. Shreya's talk is really highlighting how it's dataset and ability to manage a number of multi-lingual problems is pretty substantial.

#GHC20 #vGHC2020
So, what is exciting in the future of Multilingual NLP?

Language Agnostic Representations - grouping of similar phrases across translations

Cross-Language Modeling - train language models off of existing ones

Both benefit underrepresented languages
#GHC20 #vGHC2020
@threadreaderapp - unroll please!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tom Leaman

Tom Leaman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tleam

Oct 2, 2020
Really excited for this talk by Morgan Weever - Inclusivity Power-Up: Lessons Learned Mentoring Formerly Incarcerated Software Engineers.

It's a group you don't hear about much in DEI.

#vGHC20
For this community mentorship isn't just about the technical knowledge (which is important) but there are some specialized needs for career growth. Additionally, imposter syndrome (a big issue in tech) can weigh heavy for formerly incarcerated individuals.

#vGHC20
Some useful tips for Technical Education regardless of a mentee's background.

"Don't disparage the mentee's educational path" is a big one IMO - more and more folks are entering tech from non-traditional backgrounds which is *awesome*

#vGHC20
Read 10 tweets
Oct 1, 2020
Next on the schedule today is a workshop: Breathing Life into ERGs: the Impact of a Pandemic and a Racial Justice Movement.

Hoping I can stay concentrated on this session - I've been losing the WFH distraction battle today.

#vGHC #GHC20
Had issues joining via web so I'm going mobile. In to the panel just in time for intros whew! Moira Bohannon, Mercedes Hall and Patreece Spence are all speakers with Beth Dickerson hosting. All from Elsevier.

Definitely appreciate the intros including pronouns
#vGHC #GHC20
Workshops at #vGHC #GHC20 are the more 'interactive' version of virtual events this year. We're getting started with some audience polls to better understand demographics.

Lots of folks coming from SWE and a broad distro of folks across their career stage.
Read 14 tweets
Sep 30, 2020
Next session at #vGHC20 for me is pretty pertinent: Male Allies: The One "DEI" Thing a Male Ally Can Do Today - a panel including Glenn Block, David Graham, Jason Thompson, and Jeremiah Chan.

Note: DEI == Diversity, Equity, and Inclusion

#vGHC20 #ghc2020
Sobering stat: study by the patent office 12.8% of patent inventors are women. The percentage growth in this area is actually slowing down.. only 2% growth in 15+ years.

This isn't just about recognition for patent creation there's a financial impact as well.

#vGHC20 #ghc2020
Another stat: 9 of 10 venture capital dollars goes to while males according to our moderator Ha Nguyen. Through Spero Ventures she's working to help make those numbers more diverse.

#vGHC20 #ghc2020
Read 14 tweets
Sep 30, 2020
I'm *really* excited about the next session on my list: Applying Accessibility and Gender Sensitive Design Strategy to API Design with Anwesha Bhattacharjee.

I tend to think of Accessibility == UX so it will be great to see a take on service build out.

#GHC20 #vGHC2020
Dr. Anwesha reiterates my original take: this isn't just about UX. She's a Product Manager at Hopper and formerly a data scientist.

#GHC20 #vGHC2020
This session is geared towards folks in B2B, a product manager/designer role familiar with Design Thinking or are building out a public facing API catalog.

My take: probably pretty important for internal API creation too!

#GHC20 #vGHC2020
Read 11 tweets
Oct 23, 2019
Ethics in software development can be tricky - data may be used in unintended ways. As software devs it’s not always easy to think of these possibilities when we are so focused on delivering “the service.” #devopsdaysphilly
There’s a double edged sword at play - we may not know the usefulness of data points until we have them. But from a privacy perspective we should only be grabbing data that’s relevant to providing the service #devopsdaysphilly
Read 4 tweets
Oct 23, 2019
@bridgetkromhout takes the stage to give her talk “Join our party! The Cloud Native Adventure Brigade” at #devopsdaysphilly Image
Please check to see if an OSS project exists that meets your needs and you can contribute to vs starting something from scratch! - @bridgetkromhout #devopsdaysphilly
Microservices don’t always make things easier: think debugging and local development. Sometimes it makes things a lot harder. - @bridgetkromhout #devopsdaysphilly
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(