Tweet

Gowthami Somepalli

Jan 10 • 13 tweets • 5 min read

StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.

A 🧵

Paper: arxiv.org/abs/2212.05032

Day 9 #30daysofDiffusion #MachineLearning

T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.

Also sometimes the model gets adjectives mixed up. For example, in the figure below, the prompt is - "red car and white sheep". However, the model produced a red sheep too!

The authors address this compositionality issue in this paper.

https://twitter.com/gowthami_s/status/1610659866435289092?s=20

This paper is sort of built on ideas introduced in the Prompt-to-prompt paper. If you haven't read the paper, check out the summary here.

https://twitter.com/gowthami_s/status/1610659866435289092?s=20

Some notation: In SD, the text guidance is provided as cross-attention. Qt is the query vector from the image, Wp is the prompt representation matrix generated by CLIP, and Kp and Vp are key-value pairs from Wp. Let M_t be the attention map computed between Qt and Kp.

First, the authors use an off-the-shelf parser to extract a collection of concepts from all hierarchical levels as C = {c1, c2, . . . , ck} with cp being the full prompt and then generate W matrices for all these levels. Then encode each of them using CLIP

Then we end up with a list of W and V representations one for each of the levels.

Modify the output of the attention layer, Ot as a linear combination of V_i's from all the levels. Note that M_t is an attention map computed with the whole prompt.

The authors also propose a variant where attention maps change too based on the level (Q_t is fixed but K_p changes). The authors suggest that this formulation helps with "omit objects in generated images" issue.

To summarize the whole process. Essentially we are conditioning the generation process with encodings of multiple (parse tree) level noun phrases instead of a single encoding of the whole prompt.

Results: The authors did a human evaluation to test the fidelity of generations. They tested 2 benchmarks - ABC and CC-500. (both introduced in this paper). The authors show that humans prefer this method's generations to composable diffusion's gens.

The model resolves a few issues like color leakage and missing attributes. The model also did well on object-level and scene-level compositionality and counting.

Unofficial implementation - github.com/shunk031/train…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @gowthami_s

Gowthami Somepalli

@gowthami_s

Jan 11

Retrieval Augmented #Diffusion (RDM) models: Smaller diffusion models can generate high-quality generations by accessing an external memory to guide the generation. Inspired by Deepmind's RETRO.

A 🧶

Paper: arxiv.org/abs/2204.11824

Day 10 #30daysofDiffusion #MachineLearning

If the model can rely on this external memory always, it just has to learn important details about the image generation process such as the composition of scenes rather than, for example, remembering how different dogs look like.

Setting: X is the training set and D is a *disjoint* image set which is used for retrieval. θ denotes the parameters of the diffusion model. ξ is the retrieval function which takes in an image and selects "k" images from D. φ is a pretrained image encoder.

Read 18 tweets

Gowthami Somepalli

@gowthami_s

Jan 9

InstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.

A 🧶

Paper: arxiv.org/abs/2211.09800

Day 8 #30daysofDiffusion #Diffusion #MachineLearning

It should be fast when you want to edit an image in real-time. Models like textual inversion or prompt-to-prompt optimize during inference which makes them slow.

In this paper, the authors cleverly use such techniques to generate the training data and then finetune Stable Diffusion to perform edits in a single forward pass. They use 2 pretrained models, GPT-3 Davinci model and the SD model to generate the data.

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Gowthami Somepalli

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @gowthami_s

Gowthami Somepalli

Gowthami Somepalli

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!