Jiatao Gu Profile picture
Assistant Prof @CIS_Penn and Staff ML Researcher at @Apple (MLR) | ex-FAIR | PhD @HKUniversity | Research on Generative AI & World Models. また、日本語もできます。
May 15 10 tweets 4 min read
Can fast generative models still be likelihood-based?

Excited to share our new work @Apple MLR --Normalizing Trajectory Models

a step toward high-quality few-step generation with exact trajectory likelihood, powered by normalizing flows.

Paper:
[1/9]huggingface.co/papers/2605.08…Image Diffusion and flow-matching models typically generate through many small steps, where simple denoising transitions are a reasonable approximation.

But when we compress generation into only a few coarse steps, the reverse transitions become much more complex.

[2/9] Image
May 11 10 tweets 5 min read
Excited to share STARFlow2 from Apple MLR :
🥨Bridging Language Models and Normalizing Flows for Unified Multimodal Generation.

One model to understand, reason, and generate continuous images with a single unified autoregressive mechanism?

Paper:
1/9huggingface.co/papers/2605.08…Image A core challenge in unified models is structural mismatch:
LMs decode text causally with KV-cache, while top image generators rely on iterative full-image denoising. This makes interleaved text-image generation unnatural and often requires re-encoding visual outputs. 2/9Transfusion: unifying text generation and image diffusion into single architecture.
Transfusion requires customized attention masks to switch between causal text decoding and full-image denoising.
Dec 12, 2025 11 tweets 5 min read
(1/n) There’s a long-running debate on bringing representation learning into generative modeling—their latent spaces play different roles.

🚀🚀 We present FAE, a simple-yet-effective framework that bridges them with a single attention layer!

Paper: huggingface.co/papers/2512.07…FAE achieves 13x/7x faster convergence than RAEs (2/n) Why it may be exciting?
🔸 ImageNet256 SOTA FID w/o CFG: 1.48/2.08 (800/80 epochs)
🔸 Near-SOTA FID w/ CFG: 1.29/1.70 (800/80 epochs)
🔸 Same latents work for both diffusion and NF models on ImageNet and T2I tasks;
🔸 Simple layer bridging spaces while preserving semantics!Image
Image
Oct 12, 2024 8 tweets 4 min read
🚀Excited to introduce our recent work @ AppleMLR --
DART: Denoising AutoRegressive Transformer for Scalable Text-to-Image Generation!
A transformer-based model that unifies Autoregressive and Diffusion with a non-Markovian diffusion framework:
🔗 (1/n)arxiv.org/abs/2410.08159Image Diffusion model (DM) is limited by Markovian process where it only depends on the current input at each timestep. Unlike DM, DART leverages the full generative trajectory while retaining the progressive modeling benefits, leading to more efficient and flexible generation. (2/n)
Oct 24, 2023 7 tweets 5 min read
📢 Introducing our latest research @Apple MLR for generating high-quality images & videos with a multi-resolution diffusion model -- Matryoshka Diffusion Models or MDM🪆, directly in pixel space (~1024px) without any VAEs or cascaded models. Code will be released soon! !(1/n)
Image MDM is a single generative model that handles various high-resolution targets:
Images 🖼️
Text-to-Images 📜➡️🖼️
Text-to-Videos 📜➡️🎥
Distinct from existing works, MDM doesn't need a pre-trained VAE (e.g., SD) or training multiple upscaling modules (e.g., IMAGEN)(2/n)

Image
Image
Image