Gowthami Somepalli Profile picture
Grad student @UMDCS. Past: @AIatMeta, @AmazonScience, @IITMadras. Currently working on #Diffusion and #Multimodal understanding. GPU poor. She/her.

Jan 9, 2023, 11 tweets

InstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.

A 🧶

Paper: arxiv.org/abs/2211.09800

Day 8 #30daysofDiffusion #Diffusion #MachineLearning

It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.

In this paper, the authors cleverly use such techniques to generate the training data and then finetune Stable Diffusion to perform edits in a single forward pass. They use 2 pretrained models, GPT-3 Davinci model and the SD model to generate the data.

What's the need for GPT you might wonder! It is hard to generate a lot of edit instructions manually, so the authors create a small dataset first (700) and fine-tune GPT on it which in turn generates large-scale(>400k) "plausible" edits to the captions.

Then the authors take the edited caption and run it by Prompt-to-prompt model to generate the modified images. Check out this thread to see how prompt-to-prompt works.

The data generation pipeline is shown below.

Now we have a dataset of {instruction, image, edited image }. The authors fine-tune the SD model with slight modifications to the encoder module to condition on the original image. The loss objective is shown below. c_i is the original image and c_T is the edit instruction.

Another important point is, SD uses classifier-free guidance, and we want to control how much "guidance" we want in a generation. In this model, there is text as well as image guidance, hence the updated score looks like the following.

s_i controls similarity with the input image, while s_t controls consistency with the edit instruction. We can see the effect of both these hyperparameters in the figure below.

I love this example. Here they show different variants of Abbey Road album covers.

Failure cases: The model is as good as the training data and it carries forward the issues of GPT-3 and Prompt-to-prompt editing. Especially with spatial reasoning stuff (like moving something from left to right etc).

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling