, 8 tweets, 2 min read
My Authors
Read all threads
For those in the language modeling space, a question regarding perplexity as a metric with varying tokenization:
- Is there a hard proof showing for a dataset D being tokenized using A and B that the perplexity is equivalent?
- Does that proof take into account teacher forcing?
I ask as I have never seen a proof and always assumed smarter people than myself had thought about it. Intuitively I felt it reasonable until I recently began pondering over the teacher forcing aspect which is essentially giving your model supervision, including at test time.
Imagine you had the task of language modeling:
"Bob and Alice were fighting for first place but who won? [predict: Bob or Alice]"
The claim is that the language model's perplexity (confusion) should be equal regardless of how we split the text.
If we tokenize to "Bob" and "Alice" (words) then we get a far larger error in prediction than if we tokenize to "B|ob" and "A|lice" (wordpieces). The teacher's "supervision" at test time gives "A" or "B", collapsing the entropy, so you get the rest correct ("ob|lice").
Standard ambiguities too i.e. "special|ized" vs "special|ised". The dynamics of the neural network would certainly change - larger bursts of entropy more infrequently with words - but I don't know if that constitutes an unfair or unreasonable change in the model's operation.
- From a compression point of view this is reasonable as you can guarantee the next token is correct.
- From a language modeling point of view I'm less certain. It's Ouroboros - you're feeding the output of your model back as input. Losing the oracle teacher is big.
"The tokenization Ouroboros issue" seems to definitely have some manner of impact on how well perplexity as a metric will translate to real world generation, assuming a word vs wordpiece LM with equal perplexity, but perhaps that's outside of the scope of perplexity as a metric.
So, dear reader, have you pondered such things? Do you have an answer? My SotA claim depends on it! Most importantly however: the creation of future datasets. If all tokenizations are equal, all datasets are blobs. Otherwise they must remain specifically tokenized sequences.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Smerity

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!