With the release of #Imagen from @GoogleAI yesterday, here's a quick follow-up thread on the progress of compositionality in vision-language models.š§µ 1/11
A few weeks ago DALL-E 2 was unveiled. It exhibits both very impressive success cases and clear failure cases ā especially when it comes to counting, relative position, and some forms of variable binding. Why?
Under the hood, DALL-E 2 uses a frozen CLIP model to encode captions into embeddings. CLIP's contrastive training objective leads it to learn only the features of images people tend to describe online (e.g., common objects/relations and aesthetic style) 3/11
CLIP only needs to learn visual features sufficient to match an image with the correct caption. As a result, it's unlikely to preserve the kind of information that proves useful for things such as counting, relative spatial position, and variable binding. 4/11
When DALL-E 2 generates an image, it starts with the high-level features encoded in CLIP, the fills in the details with a diffusion model. This does not enable it to add the compositional features missing from the initial text encoding. 5/11
#Imagen is a different beast. The architecture is very simple: the caption is encoded in a frozen language model (T5-xxl) both much larger & trained on much more text than CLIP. A series of conditional diffusion models generate and upscale an image from the T5 text embedding. 6/
The Imagen paper showcases the importance of the text encoding model for image generation, which benefits from scaling. Imagen seems better than DALL-E 2 at visualizing text, counting, parsing relative position, and some forms of variable binding. 7/11
These are all non-cherry picked samples from the paper. While not perfect, they suggest that Imagen is better at parsing the compositional semantics of captions, even when it contains multiple objects and features. 8/11
There are still important limitations. The annotated plot below shows that humans judge Imagen to be slightly worse than DALL-E 2 when it comes to the complex compositional prompts proposed by @GaryMarcus et al. in the spirit of adversarial evaluation. 9/11
Unfortunately, the whole battery of tests (called DrawBench) only contains 200 prompts that are not systematically produced. I hope @GoogleAI will let researchers conduct more systematic evaluations in the future. Perhaps we need a BIG-Bench for vision-language models! 10/11
For further discussion on this topic, join the upcoming workshop on compositionality and AI I'm organizing with @GaryMarcus in June ā free registration here: compositionalintelligence.github.io 11/11
ā¢ ā¢ ā¢
Missing some Tweet in this thread? You can try to
force a refresh
Go read this excellent and timely blog post on compositionality and vision-language models. I share the positive sentiment towards recent progress in this area, with some caveats about remaining hurdles. 1/6
I disagree that "it makes no sense to criticise DALL-E (or neural networks in general) for their poor composition", if that simply means pointing out current limitations. I also emphasized DALL-E's strengths, but it clearly struggles with some forms of compositionality. 2/6
The blog post rightly celebrates DALL-E's impressive performance with more semantic forms of composition (conceptual combinations, e.g. "avocado chair"). However, compositional semantics is often determined by more sophisticated syntactic structure. 3/6
The release of impressive new deep learning models in the past few weeks, notably #dalle2 from @OpenAI and #PaLM from @GoogleAI, has prompted a heated discussion of @GaryMarcus's claim that DL is "hitting a wall". Here are some thoughts on the controversy du jour. š§µ 1/25
One of @GaryMarcus' central claims is that current DL models fail at compositionality. The assessment of this claim is complicated by the fact that people may differ in how they understand compositionality ā and what a "test of compositionality" should even look like. 2/25
Compositionality traditionally refers to a (putative) property of language: the meaning of a complex expression is fully determined by its structure and the meanings of its constituents. (There are good reasons to doubt that language is always compositional in that sense.) 3/25
There is an increasing awareness that digital privacy matters even you don't have "anything to hide". I've been vocal about this for a while but people often don't know where to begin. The recent WhatsApp controversy is a good opportunity for a š§µ with a few privacy tips. 1/n
First things first: assuming you're not breaking the law, why should you care? Ask yourself: Would you be fine with a company monitoring your home 24/7 with a surveillance camera? What about someone watching you through a window with binoculars? Probably not. 2/n
Our lives are increasingly digital. While some of our online activity is shared and public-facing by design (such as this thread), much of it is not and shouldn't be. From our phones to our TVs, virtually every device we own facilitates the harvesting of our data. 3/n
I've seen some questions about how I could produce the texts I shared earlier by prompting GPT-3, and whether GPT-3 is capable of producing such a convincing output at all, so here's a thread to clarify a few points.
My methodology was the following. Since I don't yet have access to the API, I used @AiDungeon with the "Dragon" model (which is GPT-3) and a custom prompt. AFAIK, AID allows for arbitrarily large prompts, but as @MaCroPhilosophy pointed out these must be automatically truncated.
I use the schema outlined below for the prompt. As I mentioned, given the length of that prompt (way above the 2048 BPEs context window described in the GPT-3 paper), I assume that this prompt was truncated so only the end was passed to the model.
The prompt contained the essays themselves, plus a blurb explaining that GPT-3 had to respond to them. Full disclosure: I produced a few outputs and cherry-picked this one, although they were all interesting in their own way.
One was really sassy: "I'll admit that my ideas are largely untested. I haven't spent years in academia toiling away at some low-paying job that I don't really enjoy just so that I can eventually get a job doing something that I don't really want to be doing in the first place."