We show evidence that DALL-E 2, in stark contrast to humans, does not respect the constraint that each word has a single role in its visual interpretation.
We detail three types of behaviors that are inconsistent with the single role per word constraint
(1) A noun with multiple senses in an ambiguous prompt may cause DALL-E 2 to generate the same noun twice, but with different senses. In “a bat is flying over a baseball stadium”, “bat” is visualized as both: a flying mammal and a wooden stick.
(2) A noun may be interpreted as a modifier of two different entities. For “a fish and a gold ingot”, DALL-E 2 will consistently include a goldfish, despite “gold” being a modifier of “ingot”, not “fish”
(3) A noun is interpreted simultaneously as an entity and as a modifier of a different entity. For “a seal is opening a letter”, DALL-E 2 tends to attribute the “sealed” property to the letter
Bonus finding: some nouns can be replaced with their description, and still lead to the first behavior. “crane” is not explicitly mentioned in “a tall, long-legged, long-necked bird and a construction site”, but two types of crane appear for its interpretation
If you’re interested to read more about this cool phenomenon, give our paper a read!
How diverse are the outputs of text-to-image models and how can we measure that? In our new work, we propose a measure based on LLMs and Visual-QA (VQA), and show NONE of the 12 models we experiment with are diverse. 🧵
1/11
In the world, cookies come in many shapes and kinds, but generated cookies are consistently round and chocolate chip. This happens for many other concepts as well. How can we estimate and quantify the distribution of such behaviors more systematically 🤔?
2/11