DreamBooth: Assign a rare sequence of tokens as the subject's identifier and fine-tune the diffusion model on the small set of images with the "subject". A 🧵
The authors use the Imagen model in this paper which uses T5-XXL language model to encode the text guidance to generate small 64x64 image first and then use a super-resolution model to blow it up to 1024x1024.
The authors observed that fine-tuning all the modules (including SR module) results in the best performance.
To prevent overfitting of the model to the small input dataset, the authors use prior-preservation loss, which is like "distillation from the original pretrained model". In the following loss term, "c" is the prompt with the subject's identifier and c_pr is without it.
So the overall fine-tuning pipeline is as follows. The yellow module refers to the first loss term and second yellow refers to the second loss term from the above tweet. Once the text2image part is finetuned on the input set of images, the authors then fine-tune the SR module.
The model is quite versatile and can do lots of things, like content change wrt the background or the object itself, style change, and so on.
Prior preservation loss is important, otherwise, the model will output only the "subject" for a given noun.
Failure cases: Might not work on rare categories, might overfit and output the training data for some instances, and appearance might change based on certain text prompts.
1/ Let's start with the definition of "replication" in our study. We consider something a copy if it is perceptually very similar to all/ majority of training image patches. In the example below, we consider all the yellow highlighted matches as potential copies.
Retrieval Augmented #Diffusion (RDM) models: Smaller diffusion models can generate high-quality generations by accessing an external memory to guide the generation. Inspired by Deepmind's RETRO.
If the model can rely on this external memory always, it just has to learn important details about the image generation process such as the composition of scenes rather than, for example, remembering how different dogs look like.
Setting: X is the training set and D is a *disjoint* image set which is used for retrieval. θ denotes the parameters of the diffusion model. ξ is the retrieval function which takes in an image and selects "k" images from D. φ is a pretrained image encoder.
StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.
T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.
Also sometimes the model gets adjectives mixed up. For example, in the figure below, the prompt is - "red car and white sheep". However, the model produced a red sheep too!
The authors address this compositionality issue in this paper.
InstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.
It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.
In this paper, the authors cleverly use such techniques to generate the training data and then finetune Stable Diffusion to perform edits in a single forward pass. They use 2 pretrained models, GPT-3 Davinci model and the SD model to generate the data.