Xiaochuang Han Profile picture
Research Scientist at Meta FAIR. Formerly @uwnlp.
Dec 9 7 tweets 4 min read
Can we simplify video generation by decomposing it into interleaved text-video co-generation?
Would explicit, repeated thinking in language improve generation in pixels?

We introduce TV2TV: a unified model that jointly learns
- language modeling (next-token prediction)
- video flow matching (next-frame prediction)

At inference, TV2TV dynamically alternates between textual thinking and video generation.

Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport.

📖 arxiv.org/abs/2512.05103 TV2TV builds on Transfusion, LMFusion, and mixture-of-Transformers, but goes further by enabling interleaved text-video co-generation in a single model. It models text autoregressively and video in short chunks semi-autoregressively (flow matching within ~0.25s clips, autoregressive globally). Each text token or video chunk attends to all previous text tokens and clean video chunks.Image