- Is there a hard proof showing for a dataset D being tokenized using A and B that the perplexity is equivalent?
- Does that proof take into account teacher forcing?
"Bob and Alice were fighting for first place but who won? [predict: Bob or Alice]"
The claim is that the language model's perplexity (confusion) should be equal regardless of how we split the text.
- From a language modeling point of view I'm less certain. It's Ouroboros - you're feeding the output of your model back as input. Losing the oracle teacher is big.