Gowthami Somepalli Profile picture
Grad student @UMDCS. Past: @AIatMeta, @AmazonScience, @IITMadras. Currently working on #Diffusion and #Multimodal understanding. GPU poor. She/her.

Jan 10, 2023, 13 tweets

StructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.

A 🧵

Paper: arxiv.org/abs/2212.05032

Day 9 #30daysofDiffusion #MachineLearning

T2I models like SD produce great aesthetically pleasing generations for a given prompt, however, most of us never get them right on the first try. Sometimes the model ignores part of the prompt and some objects we want in the picture are missing.

Also sometimes the model gets adjectives mixed up. For example, in the figure below, the prompt is - "red car and white sheep". However, the model produced a red sheep too!

The authors address this compositionality issue in this paper.

This paper is sort of built on ideas introduced in the Prompt-to-prompt paper. If you haven't read the paper, check out the summary here.

Some notation: In SD, the text guidance is provided as cross-attention. Qt is the query vector from the image, Wp is the prompt representation matrix generated by CLIP, and Kp and Vp are key-value pairs from Wp. Let M_t be the attention map computed between Qt and Kp.

First, the authors use an off-the-shelf parser to extract a collection of concepts from all hierarchical levels as C = {c1, c2, . . . , ck} with cp being the full prompt and then generate W matrices for all these levels. Then encode each of them using CLIP

Then we end up with a list of W and V representations one for each of the levels.

Modify the output of the attention layer, Ot as a linear combination of V_i's from all the levels. Note that M_t is an attention map computed with the whole prompt.

The authors also propose a variant where attention maps change too based on the level (Q_t is fixed but K_p changes). The authors suggest that this formulation helps with "omit objects in generated images" issue.

To summarize the whole process. Essentially we are conditioning the generation process with encodings of multiple (parse tree) level noun phrases instead of a single encoding of the whole prompt.

Results: The authors did a human evaluation to test the fidelity of generations. They tested 2 benchmarks - ABC and CC-500. (both introduced in this paper). The authors show that humans prefer this method's generations to composable diffusion's gens.

The model resolves a few issues like color leakage and missing attributes. The model also did well on object-level and scene-level compositionality and counting.

Unofficial implementation - github.com/shunk031/train…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling