Latest Twitter Threads by @sainingxie on Thread Reader App

Oct 14 • 8 tweets • 6 min read

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right.

today, we introduce Representation Autoencoders (RAE).

>> Retire VAEs. Use RAEs. 👇(1/n)

diffusion transformers have come a long way, but most still lean on the old 2021 sd-vae for their latent space.

that causes a few big issues:
1. outdated backbones make the architecture more complex than it needs to be. the sd-vae runs at around 450 gflops, while a simple ViT-B encoder only needs about 22 gflops.

2. over-compressed latent spaces (just 4 channels) limit how much information can be stored. compression leads to intelligence they say, but not here: VAE-style compression doesn’t actually do much. it’s basically as limited as raw 3-channel pixels.

3. weak representations: with reconstruction-only training, the VAE learns weak features (~8% linear probe), which ends up slowing convergence and hurting generation quality. we’ve learned by now: representation matters for generation quality. and the sd-vae is just not built for that (2/n)

Dec 22, 2024 • 8 tweets • 5 min read

Video understanding is the next frontier, but not all videos are alike. Models now reason over youtube clips and feature films, but what about the everyday spaces we—and our future AI assistants—navigate and experience?
Introducing Thinking in Space, our latest study exploring how multimodal LLMs see, remember and recall spaces. 🧵[1/n]
vision-x-nyu.github.io/thinking-in-sp…

In vision, we handle space but rarely reason; multimodal LLMs think but often ignore spatial logic. Yet as humans—from taking a mental rotation test or picking out furniture for a new home—we rely on spatial and visual thinking that doesn’t always translate well into words. [2/n]

Oct 13, 2024 • 6 tweets • 4 min read

Representation matters.
Representation matters.
Representation matters, even for generative models.

We might've been training our diffusion models the wrong way this whole time. Meet REPA: Training Diffusion Transformers is easier than you think! (🧵1/n)sihyun.me/REPA/

People (in academia) always tell me that training DiTs/SiTs is way too hard because it takes 7M iters and weeks to get the FID we reported in the paper. We figured out how to speed up training by ~18X, hitting even better FID in less than 400K iters. We did this by digging into the representation learned from diffusion models (2/n).

Jun 26, 2024 • 10 tweets • 8 min read

Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]

From our previous projects (MMVP, V*, VIRL), we've noticed unexpected visual shortcomings in current MLLM systems. While we can temporarily fix issues by e.g. adding data, one root problem is that our visual representations are not yet sufficient for language understanding.
In the short term, projects like Astra and GPT-4o are impressive. However, to develop a reliable multimodal assistant that perceives the real world like humans, manages complex tasks robustly, and acts accordingly, weak sensory grounding will likely become a bottleneck.
Language priors are powerful, but we shouldn't use them as crutches (quoting @ylecun) to compensate for deficiencies in visual representations. [2/n]

Feb 16, 2024 • 4 tweets • 4 min read

Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community.

What we have learned so far:
- Architecture: Sora is built on our diffusion transformer (DiT) model (published in ICCV 2023) — it's a diffusion model with a transformer backbone, in short:
DiT = [VAE encoder + ViT + DDPM + VAE decoder].
According to the report, it seems there are not much additional bells and whistles.

- "Video compressor network": Looks like it's just a VAE but trained on raw video data. Tokenization probably plays a significant role in getting good temporal consistency. By the way, VAE is a ConvNet, so DiT technically is a hybrid model ;) (1/n)

When Bill and I were working on the DiT project, instead of creating novelty (see my last tweet🤷‍♂️), we prioritized two aspects: simplicity and scalability. These priorities offer more than just conceptual advantages.

- Simplicity means flexibility. The cool thing about vanilla ViT that people often miss is how it makes your model way more flexible when it comes to working with input data. For example, in masked autoencoder (MAE), ViT helped us to just process the visible patches and ignore the masked ones. And similarly, Sora "can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid." UNet does not directly offer this flexibility.
👀Speculation: Sora might also use Patch n’ Pack (NaViT) from Google, to make DiT adaptable to variable resolutions/durations/aspect ratios.

- Scalability is the core theme of the DiT paper. First, an optimized DiT runs much faster than UNet in terms of wall-clock time per Flop. More importantly, Sora demonstrated that the DiT scaling law applies not just to images but now to videos as well -- Sora replicates the visual scaling behavior observed in DiT.
👀Speculation: In the Sora report, the quality for the first video is quite bad, I suspect it is using a base model size. A back-of-the-envelope calculation: DiT XL/2 is 5X GFLOPs of the B/2 model, so the final 16X compute model is probably 3X DiT-XL model size, which means Sora might have ~3B parameters – if true, this is not an unreasonable model size . It could suggest that training the Sora model might not require as many GPUs as one would anticipate – I would expect very fast iterations going forward. (2/n)

Jan 5, 2024 • 7 tweets • 4 min read

🔍Introducing V*: exploring guided visual search in multimodal LLMs

MLLMs like GPT4V & LLaVA are amazing, but one concern that keeps me up at night: the (frozen) visual encoder typically extracts global image tokens *only once*, regardless of resolution or scene complexity (1/n)

Why does this matter? Consider everyday situations like locating keys on a cluttered table or spotting a friend in a crowd: we engage our system II and actively *search* for the necessary visual info -- we do not have an 'internal CLIP' that shows us everything all at once. (2/n)

Share this page!

Enter URL or ID to Unroll