TLDR: 1. Replaces the CNN encoder and decoder with a vision transformer βViT-VQGANβ, leading to significantly better speed-quality tradeoffs compared to CNN-VQGAN
2. Vanilla VQVAE often learns rarily used / βdeadβ codebook vectors leading to wasted capacity. Here, they add a linear projection of the code vectors into a lower dimensional βlookupβ space. This factorization of embedding / lookup consistently improves reconstruction quality.
3. Encoded latents + codebook vectors are L2 normalized, placing all of them on the unit sphere where the Euclidean distance between latent and codebook vector corresponds to their cosine similarity, further improving training stability and reconstruction quality.
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
Inspired by the amazing work of @HvnsLstAngel I've been experimenting with a "color-quantized VQGAN"
Essentially, I introduced a codebook of possible colors and apply quantization in rgb space.
It's always fascinating how removing entropy can make samples more interesting...