For the techies:
Turns out sending gradients straight through this rgb-quantization is not great for stability, so I'm also minimizing mean(quant_distances) to keep raw img close to quantized one!
โข โข โข
Missing some Tweet in this thread? You can try to
force a refresh
TLDR: 1. Replaces the CNN encoder and decoder with a vision transformer โViT-VQGANโ, leading to significantly better speed-quality tradeoffs compared to CNN-VQGAN
2. Vanilla VQVAE often learns rarily used / โdeadโ codebook vectors leading to wasted capacity. Here, they add a linear projection of the code vectors into a lower dimensional โlookupโ space. This factorization of embedding / lookup consistently improves reconstruction quality.
Inspired by the amazing work of @HvnsLstAngel I've been experimenting with a "color-quantized VQGAN"
Essentially, I introduced a codebook of possible colors and apply quantization in rgb space.
It's always fascinating how removing entropy can make samples more interesting...