I discovered a bug in my own Diffusion + CLIP pipeline and suddenly the samples are unreal.. π€―
Here's
"Just a liquid reality..." #AIart#notdalle2#Diffusion#clip
"The magnificent portal of mother Gaia"
"Framing reality"
"Gathering at the great elder sphere"
"Why such a rush? It's all twisting and bending anyway"
"My hair is a living creature"
Caveat: all these pieces are the result of a tremendous amount (months) of code and parameter tuning, careful selection of initialization images, prompt engineering and cherry picking.
#dalle2 is incredible at compositionality and realism, but I haven't seen it do this yetπ¨βπ¨π§ββοΈπ
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
This is a "3D-diffusion" video created using a combination of four different AI modelsπ€―
Welcome to the metaverse! ππ
There's such incredible potential here that I want to explain how I made this, so here's a thread! (1/n)
The two main models that draw the pixels are a diffusion model guided by a language prompt through @OpenAI's CLIP model.
This idea was introduced by @advadnoun and later refined by many other creatives. My talk at @Kikk_Festival further explains this:
The diffusion model (I integrated code from @RiversHaveWings and @Somnai_dreams for this) generates images by iteratively denoising noisy-pixel images, every time you run this from different noise, you get a different image, guided by the language prompt:
Finally playing around with CLIP + diffusion models.
12 GPU hours in I gotta say I'm pretty impressed with the difference in esthetics compared to VQGANπ
Big thanks to @RiversHaveWings & @Somnai_dreams for providing great starting code!
"a dystopian city"
"The real problem of humanity is that we have Paleolithic emotions, medieval institutions and godlike technology"
TLDR: 1. Replaces the CNN encoder and decoder with a vision transformer βViT-VQGANβ, leading to significantly better speed-quality tradeoffs compared to CNN-VQGAN
2. Vanilla VQVAE often learns rarily used / βdeadβ codebook vectors leading to wasted capacity. Here, they add a linear projection of the code vectors into a lower dimensional βlookupβ space. This factorization of embedding / lookup consistently improves reconstruction quality.
Inspired by the amazing work of @HvnsLstAngel I've been experimenting with a "color-quantized VQGAN"
Essentially, I introduced a codebook of possible colors and apply quantization in rgb space.
It's always fascinating how removing entropy can make samples more interesting...