Raphaël Millière Profile picture
May 24, 2022 11 tweets 6 min read Read on X
With the release of #Imagen from @GoogleAI yesterday, here's a quick follow-up thread on the progress of compositionality in vision-language models.🧵 1/11
A few weeks ago DALL-E 2 was unveiled. It exhibits both very impressive success cases and clear failure cases – especially when it comes to counting, relative position, and some forms of variable binding. Why? 2/11
Under the hood, DALL-E 2 uses a frozen CLIP model to encode captions into embeddings. CLIP's contrastive training objective leads it to learn only the features of images people tend to describe online (e.g., common objects/relations and aesthetic style) 3/11
CLIP only needs to learn visual features sufficient to match an image with the correct caption. As a result, it's unlikely to preserve the kind of information that proves useful for things such as counting, relative spatial position, and variable binding. 4/11
When DALL-E 2 generates an image, it starts with the high-level features encoded in CLIP, the fills in the details with a diffusion model. This does not enable it to add the compositional features missing from the initial text encoding. 5/11
#Imagen is a different beast. The architecture is very simple: the caption is encoded in a frozen language model (T5-xxl) both much larger & trained on much more text than CLIP. A series of conditional diffusion models generate and upscale an image from the T5 text embedding. 6/
The Imagen paper showcases the importance of the text encoding model for image generation, which benefits from scaling. Imagen seems better than DALL-E 2 at visualizing text, counting, parsing relative position, and some forms of variable binding. 7/11
These are all non-cherry picked samples from the paper. While not perfect, they suggest that Imagen is better at parsing the compositional semantics of captions, even when it contains multiple objects and features. 8/11
There are still important limitations. The annotated plot below shows that humans judge Imagen to be slightly worse than DALL-E 2 when it comes to the complex compositional prompts proposed by @GaryMarcus et al. in the spirit of adversarial evaluation. 9/11
Unfortunately, the whole battery of tests (called DrawBench) only contains 200 prompts that are not systematically produced. I hope @GoogleAI will let researchers conduct more systematic evaluations in the future. Perhaps we need a BIG-Bench for vision-language models! 10/11
For further discussion on this topic, join the upcoming workshop on compositionality and AI I'm organizing with @GaryMarcus in June – free registration here: compositionalintelligence.github.io 11/11

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Raphaël Millière

Raphaël Millière Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @raphaelmilliere

Feb 17
There's a lot of speculation about whether OpenAI's video generation model Sora has a 'physics engine' (bolstered by OAI's own claims about 'world simulation'). Like the debate about world models in LLMs, this question is both genuinely interesting and somewhat ill-defined. 🧵1/
Of course it's widely unlikely that Sora literally makes function calls to an external physics engine like UE5 during inference. Note that this has been done before with LLMs, see this Google paper where the model answers questions through simulations with a physics engine. 2/ Figure from Liu et al. (2022), "Mind's Eye: Grounded Language Model Reasoning through Simulation" (https://arxiv.org/abs/2210.05359)
But that's not what most people are speculating about. Rather, the idea is that Sora would acquire an internal model of physics during training, and make use of this internal model to generate temporally and spatially coherent videos. 3/
Read 15 tweets
Apr 5, 2023
📝New preprint! What does it take for AI models to have grounded representations of lexical items? There is a lot of disagreement – some verbal, some substantive – about what grounding involves. Dimitri Mollo and I frame this old question in a new light 1/
arxiv.org/abs/2304.01481
Back in 1990, Harnad characterized the "Symbol Grounding Problem" with the following question: How can AI systems designed to process linguistic inputs have internal representations and outputs that are intrinsically meaningful? 2/
sciencedirect.com/science/articl…
Harnad asked this question about classical AI systems manipulating symbols with arbitrary shapes. An analogous issue arises for neural nets, like language models, that compute over vectors rather than symbols: we call it the Vector Grounding Problem as a nod to Harnad's work. 3/
Read 14 tweets
Mar 24, 2023
Yann LeCun kicking off the debate with a bold prediction: nobody in their right mind will use autoregressive models 5 years from now #phildeeplearning
@ylecun closing his presentation with some conjectures #phildeeplearning
Ellie Pavlick @BrownCSDept leading the charge on the "No" side!
Read 13 tweets
Mar 9, 2023
Another day, another opinion essay about ChatGPT in the @nytimes. This time, Noam Chomsky and colleagues weigh in on the shortcomings of language models. Unfortunately, this is not the nuanced discussion one could have hoped for. 🧵 1/

nytimes.com/2023/03/08/opi…
For a start I'm not sure the melodramatic tone serves the argument: "machine learning will degrade our science and debase our ethics", and "we can only laugh or cry at [LLM's] popularity"! I know op-eds are often editorialized for dramatic effect, but maybe this is a bit much? 2/
The substantive claims are all too familiar: LLMs learn from co-occurrence statistics without leveraging innate structure; they describe and predict instead of doing causal inference; and they can't balance original reasoning with epistemic and moral constraints. 3/
Read 17 tweets
Feb 10, 2023
I don't think lossy compression is a very helpful analogy to convey what (linguistic or multimodal) generative models do – at least if "blurry JPEGs" is the leading metaphor. It might work in a loose sense, but it doesn't tell the whole story. 1/

newyorker.com/tech/annals-of…
Generative models can definitely be used for lossy compression (see below), but that's a special case of their generative capabilities. Reducing all they do to LC perpetuates the idea that they just regurgitate approximations of their training samples. 2/

web.archive.org/web/2022092100…
This bit about interpolation strikes me as particularly misleading. Inference on generative models involves computations that are way more complex and structured than (say) nearest neighbor pixel interpolation in image decompression. 3/ Image
Read 8 tweets
Aug 9, 2022
Can you reliably get image generation models like DALL-E 2 to illustrate specific visual concepts using made-up words? In this new preprint, I show that you can, using new approaches for text-based adversarial attacks on image generation. 1/12

arxiv.org/abs/2208.04135
Image generation models are typically trained on multilingual datasets (even accidentally). The paper introduces "macaronic prompting", a method to concatenate chunks from synonymous words in multiple languages to design nonce strings that can reliably query visual concepts. 2/12
For example, the word for “birds” is “Vögel” in German, “uccelli” in Italian, “oiseaux” in French, and “pájaros” in Spanish. Concatenate subword tokens from these words and you get strings like “uccoisegeljaros”, which reliably prompt DALL-E to generate images of birds. 3/12 Image
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(