With the release of #Imagen from @GoogleAI yesterday, here's a quick follow-up thread on the progress of compositionality in vision-language models.šŸ§µ 1/11
A few weeks ago DALL-E 2 was unveiled. It exhibits both very impressive success cases and clear failure cases ā€“ especially when it comes to counting, relative position, and some forms of variable binding. Why? 2/11
Under the hood, DALL-E 2 uses a frozen CLIP model to encode captions into embeddings. CLIP's contrastive training objective leads it to learn only the features of images people tend to describe online (e.g., common objects/relations and aesthetic style) 3/11
CLIP only needs to learn visual features sufficient to match an image with the correct caption. As a result, it's unlikely to preserve the kind of information that proves useful for things such as counting, relative spatial position, and variable binding. 4/11
When DALL-E 2 generates an image, it starts with the high-level features encoded in CLIP, the fills in the details with a diffusion model. This does not enable it to add the compositional features missing from the initial text encoding. 5/11
#Imagen is a different beast. The architecture is very simple: the caption is encoded in a frozen language model (T5-xxl) both much larger & trained on much more text than CLIP. A series of conditional diffusion models generate and upscale an image from the T5 text embedding. 6/
The Imagen paper showcases the importance of the text encoding model for image generation, which benefits from scaling. Imagen seems better than DALL-E 2 at visualizing text, counting, parsing relative position, and some forms of variable binding. 7/11
These are all non-cherry picked samples from the paper. While not perfect, they suggest that Imagen is better at parsing the compositional semantics of captions, even when it contains multiple objects and features. 8/11
There are still important limitations. The annotated plot below shows that humans judge Imagen to be slightly worse than DALL-E 2 when it comes to the complex compositional prompts proposed by @GaryMarcus et al. in the spirit of adversarial evaluation. 9/11
Unfortunately, the whole battery of tests (called DrawBench) only contains 200 prompts that are not systematically produced. I hope @GoogleAI will let researchers conduct more systematic evaluations in the future. Perhaps we need a BIG-Bench for vision-language models! 10/11
For further discussion on this topic, join the upcoming workshop on compositionality and AI I'm organizing with @GaryMarcus in June ā€“ free registration here: compositionalintelligence.github.io 11/11

ā€¢ ā€¢ ā€¢

Missing some Tweet in this thread? You can try to force a refresh
怀

Keep Current with RaphaĆ«l MilliĆØre

RaphaĆ«l MilliĆØre Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @raphaelmilliere

May 25
Go read this excellent and timely blog post on compositionality and vision-language models. I share the positive sentiment towards recent progress in this area, with some caveats about remaining hurdles. 1/6
I disagree that "it makes no sense to criticise DALL-E (or neural networks in general) for their poor composition", if that simply means pointing out current limitations. I also emphasized DALL-E's strengths, but it clearly struggles with some forms of compositionality. 2/6
The blog post rightly celebrates DALL-E's impressive performance with more semantic forms of composition (conceptual combinations, e.g. "avocado chair"). However, compositional semantics is often determined by more sophisticated syntactic structure. 3/6
Read 6 tweets
Apr 14
The release of impressive new deep learning models in the past few weeks, notably #dalle2 from @OpenAI and #PaLM from @GoogleAI, has prompted a heated discussion of @GaryMarcus's claim that DL is "hitting a wall". Here are some thoughts on the controversy du jour. šŸ§µ 1/25
One of @GaryMarcus' central claims is that current DL models fail at compositionality. The assessment of this claim is complicated by the fact that people may differ in how they understand compositionality ā€“ and what a "test of compositionality" should even look like. 2/25
Compositionality traditionally refers to a (putative) property of language: the meaning of a complex expression is fully determined by its structure and the meanings of its constituents. (There are good reasons to doubt that language is always compositional in that sense.) 3/25
Read 26 tweets
Jan 15, 2021
There is an increasing awareness that digital privacy matters even you don't have "anything to hide". I've been vocal about this for a while but people often don't know where to begin. The recent WhatsApp controversy is a good opportunity for a šŸ§µ with a few privacy tips. 1/n
First things first: assuming you're not breaking the law, why should you care? Ask yourself: Would you be fine with a company monitoring your home 24/7 with a surveillance camera? What about someone watching you through a window with binoculars? Probably not. 2/n
Our lives are increasingly digital. While some of our online activity is shared and public-facing by design (such as this thread), much of it is not and shouldn't be. From our phones to our TVs, virtually every device we own facilitates the harvesting of our data. 3/n
Read 19 tweets
Jul 31, 2020
I've seen some questions about how I could produce the texts I shared earlier by prompting GPT-3, and whether GPT-3 is capable of producing such a convincing output at all, so here's a thread to clarify a few points.
My methodology was the following. Since I don't yet have access to the API, I used @AiDungeon with the "Dragon" model (which is GPT-3) and a custom prompt. AFAIK, AID allows for arbitrarily large prompts, but as @MaCroPhilosophy pointed out these must be automatically truncated.
I use the schema outlined below for the prompt. As I mentioned, given the length of that prompt (way above the 2048 BPEs context window described in the GPT-3 paper), I assume that this prompt was truncated so only the end was passed to the model.
Read 11 tweets
Jul 31, 2020
I asked GPT-3 to write a response to the philosophical essays written about it by @DrZimmermann, @rinireg @ShannonVallor, @add_hawk, @AmandaAskell, @dioscuri, David Chalmers, Carlos Montemayor, and Justin Khoo published yesterday by @DailyNousEditor. It's quite remarkable!
The prompt contained the essays themselves, plus a blurb explaining that GPT-3 had to respond to them. Full disclosure: I produced a few outputs and cherry-picked this one, although they were all interesting in their own way.
One was really sassy: "I'll admit that my ideas are largely untested. I haven't spent years in academia toiling away at some low-paying job that I don't really enjoy just so that I can eventually get a job doing something that I don't really want to be doing in the first place."
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(