Retrieval Augmented #Diffusion (RDM) models: Smaller diffusion models can generate high-quality generations by accessing an external memory to guide the generation. Inspired by Deepmind's RETRO.

A 🧶

Paper: arxiv.org/abs/2204.11824

Day 10 #30daysofDiffusion #MachineLearning Image
If the model can rely on this external memory always, it just has to learn important details about the image generation process such as the composition of scenes rather than, for example, remembering how different dogs look like.
Setting: X is the training set and D is a *disjoint* image set which is used for retrieval. θ denotes the parameters of the diffusion model. ξ is the retrieval function which takes in an image and selects "k" images from D. φ is a pretrained image encoder.
Both ξ and φ are pretrained, "fixed" functions and we do not modify them during training or inference. Only θ is learned/optimized during training.
So the generative model formulation in this case boils down to learning a diffusion (or auto-regressive model) conditioned on similar-looking images to a given train image - x. Image
In this paper, the authors use a CLIP image encoder for retrieval and use cosine similarity to choose top "k" similar-looking images for a given image. The authors also chose φ to be the CLIP image encoder. Image
The training objective of the diffusion model is shown below. Think of SDM except the text representation is replaced by multiple image representations, (which are similar to the image we are diffusing) cross-attending into the U-Net encoder. Image
The authors discuss 3 possible inference scenarios. Input is (1) an image (2) text (3) no condition. The first case is easy since we trained the model conditioned on the image.
If we want to generate an image based on text, the authors use the CLIP text encoder to find the closest matches from database D. (dot-p similarity between clip_text(input text) and clip_image(D), and pick the top "k" high similarity images).
Unconditional is interesting. The simplest way is to just sample a random image from the memory bank, D. However, in reality, each of the images in D is equally likely since some of them might be more similar to training data than others. So how to get around this?
Authors generate an MLE-ish score for the images in D based on how often they appear in the top "k" of the training data X. Image
Then using this p(x~) distribution from the above tweet, sample an image from D, and then use that to get top-k similar images from D again and use those as guidance to generate the final image. Image
Most of the architectural components are the same as Latent Diffusion Model. The authors evaluated the model performance with different datasets as "D". RDM-OpenImages model performed the best across various metrics Image
A larger "k" leads to higher recall, which means sample diversity is less. In this paper, the authors used k=4 in most of the experiments. However larger k in training led to better generalization capabilities. Image
When you replace the retrieval dataset during the inference, some zero-shot style transfer abilities are observed. Image
This one is slightly confusing to me, the authors say that the for text-image synthesis case if we use clip text embedding of the text and use that to retrieve "k" images from "D" and use both the text embedding and NN images, the generations are not good. (contd)
And just using the NNs leads to even worse generations. It is slightly strange that the model is not doing well on the thing it is trained for... Image
Official code (same repo as LDM) - github.com/CompVis/latent…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gowthami Somepalli

Gowthami Somepalli Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @gowthami_s

Jan 10
StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.

A 🧵

Paper: arxiv.org/abs/2212.05032

Day 9 #30daysofDiffusion #MachineLearning
T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.
Also sometimes the model gets adjectives mixed up. For example, in the figure below, the prompt is - "red car and white sheep". However, the model produced a red sheep too!

The authors address this compositionality issue in this paper.
Read 13 tweets
Jan 9
InstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.

A 🧶

Paper: arxiv.org/abs/2211.09800

Day 8 #30daysofDiffusion #Diffusion #MachineLearning Image
It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.
In this paper, the authors cleverly use such techniques to generate the training data and then finetune Stable Diffusion to perform edits in a single forward pass. They use 2 pretrained models, GPT-3 Davinci model and the SD model to generate the data.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(