Mishig Davaadorj Profile picture
Apr 25 β€’ 9 tweets β€’ 4 min read
How do language models (like BERT or GPT) "see" words?

TLDR: whereas we see πš†πšŽΜ„πš•πšŒπš˜Μπš–πšŽΜ‚ 𝚝𝚘́ πšπš‘πšŽΜˆ πŸ€— πšƒπš˜Μ‚πš”πšŽΜπš—πš’Μ„πš£πšŽΜ„πš›πšœ, language models see [𝟷0𝟷, 𝟼𝟷𝟼0, 𝟸000, 𝟷𝟿𝟿𝟼, 𝟷00, 𝟷𝟿𝟸0𝟺, 𝟷𝟽𝟼𝟸𝟿, 𝟸0𝟷𝟻, 𝟷0𝟸]
🧡 on Tokenization by examples
1/
2/ NLP Tokenization steps are ↳ πš—πš˜πš›πš–πšŠπš•πš’πš£πšŠπšπš’πš˜πš— ➜ πš™πš›πšŽ-πšπš˜πš”πšŽπš—πš’πš£πšŠπšπš’πš˜πš— ➜ πš–πš˜πšπšŽπš• ➜ πš™πš˜πšœπš-πš™πš›πš˜πšŒπšŽπšœπšœπš’πš—πš.

Together, they are called a "tokenization pipeline"
huggingface.co/docs/tokenizer…
3/ πš—πš˜πš›πš–πšŠπš•πš’πš£πšŠπšπš’πš˜πš—:
πš†πšŽΜ„πš•πšŒπš˜Μπš–πšŽΜ‚ 𝚝𝚘́ πšπš‘πšŽΜˆ πŸ€— πšƒπš˜Μ‚πš”πšŽΜπš—πš’Μ„πš£πšŽΜ„πš›πšœ ➜ πš†πšŽπš•πšŒπš˜πš–πšŽ 𝚝𝚘 πšπš‘πšŽ πŸ€— πšƒπš˜πš”πšŽπš—πš’πš£πšŽπš›πšœ
4/ πš™πš›πšŽ-πšπš˜πš”πšŽπš—πš’πš£πšŠπšπš’πš˜πš—:
πš†πšŽπš•πšŒπš˜πš–πšŽ 𝚝𝚘 πšπš‘πšŽ πŸ€— πšƒπš˜πš”πšŽπš—πš’πš£πšŽπš›πšœ ➜ [('πš†πšŽπš•πšŒπš˜πš–πšŽ', (0, 𝟽)),('𝚝𝚘', (𝟾, 𝟷0)),('πšπš‘πšŽ', (𝟷𝟷, 𝟷𝟺)),('πŸ€—', (𝟷𝟻, 𝟷𝟼)),('πšƒπš˜πš”πšŽπš—πš’πš£πšŽπš›πšœ', (𝟷𝟽, 𝟸𝟽))]
5/ πš–πš˜πšπšŽπš•:
[('πš†πšŽπš•πšŒπš˜πš–πšŽ', (0, 𝟽)),('𝚝𝚘', (𝟾, 𝟷0)),('πšπš‘πšŽ', (𝟷𝟷, 𝟷𝟺)),('πŸ€—', (𝟷𝟻, 𝟷𝟼)),('πšƒπš˜πš”πšŽπš—πš’πš£πšŽπš›πšœ', (𝟷𝟽, 𝟸𝟽))] ➜ [πš πšŽπš•πšŒπš˜πš–πšŽ, 𝚝𝚘, πšπš‘πšŽ, [πš„π™½π™Ί], πšπš˜πš”πšŽπš—, ##πš’πš£πšŽπš›, ##𝚜]
6/ πš™πš˜πšœπš-πš™πš›πš˜πšŒπšŽπšœπšœπš’πš—πš:
[πš πšŽπš•πšŒπš˜πš–πšŽ, 𝚝𝚘, πšπš‘πšŽ, [πš„π™½π™Ί], πšπš˜πš”πšŽπš—, ##πš’πš£πšŽπš›, ##𝚜] ➜ [[π™²π™»πš‚], πš πšŽπš•πšŒπš˜πš–πšŽ, 𝚝𝚘, πšπš‘πšŽ, [πš„π™½π™Ί], πšπš˜πš”πšŽπš—, ##πš’πš£πšŽπš›, ##𝚜, [πš‚π™΄π™Ώ]]

* notice [π™²π™»πš‚] $𝙰 [πš‚π™΄π™Ώ]
7/ tokens to ids conversion:
[[π™²π™»πš‚], πš πšŽπš•πšŒπš˜πš–πšŽ, 𝚝𝚘, πšπš‘πšŽ, [πš„π™½π™Ί], πšπš˜πš”πšŽπš—, ##πš’πš£πšŽπš›, ##𝚜, [πš‚π™΄π™Ώ]] ➜ [𝟷0𝟷, 𝟼𝟷𝟼0, 𝟸000, 𝟷𝟿𝟿𝟼, 𝟷00, 𝟷𝟿𝟸0𝟺, 𝟷𝟽𝟼𝟸𝟿, 𝟸0𝟷𝟻, 𝟷0𝟸]
8/ Checkout πŸ€— Tokenizers docs (w/ new look) huggingface.co/docs/tokenizer…
9/ πŸ€— Course has an excellent section on tokenization by @LucileSaulnier for anyone who wants to learn more huggingface.co/course/chapter6

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Mishig Davaadorj

Mishig Davaadorj Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(