Discover and read the best of Twitter Threads about #realworldtext

Most recents (1)

Can your NLP model handle noooisy mEsSy #realworldtext?

ByT5 works on raw UTF-8 bytes (no tokenization!), beats SoTA models on many popular tasks, and is more robust to noise.

📜 Preprint: arxiv.org/abs/2105.13626
💾 Code/Models: github.com/google-researc…

Summary thread ⬇️ (1/9) Image
Tokenizers have many drawbacks:
- Finite, fixed vocabulary - often can't process new/unseen languages
- Lack of robustness to missspeling and n o i s e
- Not learned "end-to-end"
- Giant vocabulary matrices in the multilingual setting
- Lots of technical debt in practice

(2/9)
Operating on the raw byte sequence used to represent text (e.g. UTF-8) solves many of the aforementioned issues. The main drawback: Sequence lengths tend to increase significantly compared to using token sequences.

(3/9)
Read 9 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!