12,399 views

Smerity

@Smerity

, 8 tweets, 2 min read

My Authors

For those in the language modeling space, a question regarding perplexity as a metric with varying tokenization:
- Is there a hard proof showing for a dataset D being tokenized using A and B that the perplexity is equivalent?
- Does that proof take into account teacher forcing?

I ask as I have never seen a proof and always assumed smarter people than myself had thought about it. Intuitively I felt it reasonable until I recently began pondering over the teacher forcing aspect which is essentially giving your model supervision, including at test time.

Imagine you had the task of language modeling:
"Bob and Alice were fighting for first place but who won? [predict: Bob or Alice]"
The claim is that the language model's perplexity (confusion) should be equal regardless of how we split the text.

If we tokenize to "Bob" and "Alice" (words) then we get a far larger error in prediction than if we tokenize to "B|ob" and "A|lice" (wordpieces). The teacher's "supervision" at test time gives "A" or "B", collapsing the entropy, so you get the rest correct ("ob|lice").

Standard ambiguities too i.e. "special|ized" vs "special|ised". The dynamics of the neural network would certainly change - larger bursts of entropy more infrequently with words - but I don't know if that constitutes an unfair or unreasonable change in the model's operation.

- From a compression point of view this is reasonable as you can guarantee the next token is correct.
- From a language modeling point of view I'm less certain. It's Ouroboros - you're feeding the output of your model back as input. Losing the oracle teacher is big.

"The tokenization Ouroboros issue" seems to definitely have some manner of impact on how well perplexity as a metric will translate to real world generation, assuming a word vs wordpiece LM with equal perplexity, but perhaps that's outside of the scope of perplexity as a metric.

So, dear reader, have you pondered such things? Do you have an answer? My SotA claim depends on it! Most importantly however: the creation of future datasets. If all tokenizations are equal, all datasets are blobs. Otherwise they must remain specifically tokenized sequences.

Enjoying this thread?

Keep Current with Smerity

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

More from @Smerity see all

Related threads

Trending hashtags

Did Thread Reader help you today?