The release of impressive new deep learning models in the past few weeks, notably #dalle2 from @OpenAI and #PaLM from @GoogleAI, has prompted a heated discussion of @GaryMarcus's claim that DL is "hitting a wall". Here are some thoughts on the controversy du jour. 🧵 1/25
One of @GaryMarcus' central claims is that current DL models fail at compositionality. The assessment of this claim is complicated by the fact that people may differ in how they understand compositionality – and what a "test of compositionality" should even look like. 2/25
Compositionality traditionally refers to a (putative) property of language: the meaning of a complex expression is fully determined by its structure and the meanings of its constituents. (There are good reasons to doubt that language is always compositional in that sense.) 3/25
If language is compositional, and if thought is language-like – as argued by proponents of the language of thought hypothesis –, then thought itself should be compositional in a similar sense. 4/25
But we can also meaningfully talk about compositionality in a broader sense that may apply to non-linguistic representational systems, such as visual symbols. (And we can talk about the compositionality of thought in this broader sense, even if it's not language-like.) 5/25
Visual symbols, like road signs, have parts that can be dissociated and recombined in somewhat systematic ways, such that one may wonder whether the meaning of a complex visual symbol is fully determined by its composition and the meanings of its constituents. 6/25
It's not obvious that natural images are strictly compositional in this broader sense; but, if we can talk about meaning in this context, it's plausible that the meaning of some images is at least partly determined by their structure and the meanings of their constituents. 7/25
Taking stock: the meaning of many complex linguistic expressions, and (lato sensu) that of many images, is at least partly determined compositionally, by their structure and the meanings of their constituents. 8/25
What are we talking about when we ask about the compositional aptitude of deep learning models? In the linguistic domain, we may wonder whether language models represent the meaning of complex expressions in a way that is suitably sensitive to their compositional semantics. 9/25
For example, @LakeBrenden and Gregory Murphy recently argued that language models were pretty bad at understanding novel conceptual combinations unseen in their training data, suggesting that they struggle with compositional semantics: arxiv.org/abs/2008.01766 10/25
To test this hypothesis, I teamed up with @nerd_sighted, Dimitri Coelho Mollo, and Charles Rathkopf to design a task for @GoogleAI's gigantic new NLP benchmark #BIGbench: github.com/google/BIG-ben… 11/25
Our task probes the understanding of novel conceptual combinations, including combinations of made-up words defined in the prompt (see the attached example). 12/25
Testing revealed that humans do extremely well on our task (with a perfect for the best human, 83.2/100 on average), while previous models, including GPT-3, did rather poorly. This was in line with @LakeBrenden and Gregory Murphy's claim. 13/25
But we were surprised to see that @GoogleAI's new language model, PaLM, achieved excellent results on our task in the 5-shot learning regime, virtually matching the human avg. This shows a sophisticated ability to combine the word meanings in semantically plausible ways. 14/25
Of course, our task does not exhaust compositionality in the linguistic domain. The kind of compositionality probed by conceptual combinations is driven by semantics rather than syntax. Other work has looked at syntactic composition in language models, with mixed results. 15/25
What about the visual domain? When we ask about the compositional aptitude of an image generation model like DALL-E, for example, we may wonder whether it is able to parse complex expressions in a way that is semantically plausible and/or syntactically accurate. 16/25
Take the infamous "avocado chair" example: to plausibly illustrate that prompt, DALL-E 2 has to produce something that looks like a an avocado-shaped chair, rather than a chair-shaped avocado. This is an example of semantic composition, and DALLE 2 is remarkably good at it. 17/25
Yet it can also struggle with some forms of precise syntactic composition, such as illustrating the prompt "A red cube on top of a blue cube". Illustrating this prompt requires to at least approximate a form of sophisticated variable binding, and DALL-E 2 still falls short. 18/25
The recently released Winoground benchmark by @TristanThrush et al. suggests that previous multimodal models also struggle with syntactic composition. Time will tell how DALL-E 2 performs at Winoground! arxiv.org/abs/2204.03162 19/25
So do current models "fail at compositionality"? In some sense, yes, at least occasionally: they can still fail at adequately parsing the ways in which syntax changes the meaning of complex expressions (and how that may translate visually). 20/25
On the other hand, they do remarkably well on many instances that require such parsing, and on the corresponding visual translation in the case of DALL-E 2. 21/25
Some additional examples. As always cherry-picking is somewhat of a concern, but the fact that current models can rather reliably produce outputs like this at all is quite remarkable – and not simply because these are "pretty pictures". 22/25
As far as semantic composition goes, it does seem like recent models have achieved a breakthrough, as suggested by PaLM's surprising score on our "conceptual combination" task. Further testing will be required, but these results can't easily be dismissed. 23/25
Current deep learning models clearly fall short of human-level compositional understanding of complex linguistic expressions and images. But they do creep in closer and closer, for both semantic and syntactic composition, probably faster than expected 24/25
I don't think it's entirely fair to say that they "fail at compositionality". They exhibit both success and failure cases. Humans do too, but DL models fail far more often. Whether the remaining gap can be bridged without symbolic/hybrid architectures is an open question. 25/25

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Raphaël Millière

Raphaël Millière Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @raphaelmilliere

Jan 15, 2021
There is an increasing awareness that digital privacy matters even you don't have "anything to hide". I've been vocal about this for a while but people often don't know where to begin. The recent WhatsApp controversy is a good opportunity for a 🧵 with a few privacy tips. 1/n
First things first: assuming you're not breaking the law, why should you care? Ask yourself: Would you be fine with a company monitoring your home 24/7 with a surveillance camera? What about someone watching you through a window with binoculars? Probably not. 2/n
Our lives are increasingly digital. While some of our online activity is shared and public-facing by design (such as this thread), much of it is not and shouldn't be. From our phones to our TVs, virtually every device we own facilitates the harvesting of our data. 3/n
Read 19 tweets
Jul 31, 2020
I've seen some questions about how I could produce the texts I shared earlier by prompting GPT-3, and whether GPT-3 is capable of producing such a convincing output at all, so here's a thread to clarify a few points.
My methodology was the following. Since I don't yet have access to the API, I used @AiDungeon with the "Dragon" model (which is GPT-3) and a custom prompt. AFAIK, AID allows for arbitrarily large prompts, but as @MaCroPhilosophy pointed out these must be automatically truncated.
I use the schema outlined below for the prompt. As I mentioned, given the length of that prompt (way above the 2048 BPEs context window described in the GPT-3 paper), I assume that this prompt was truncated so only the end was passed to the model.
Read 11 tweets
Jul 31, 2020
I asked GPT-3 to write a response to the philosophical essays written about it by @DrZimmermann, @rinireg @ShannonVallor, @add_hawk, @AmandaAskell, @dioscuri, David Chalmers, Carlos Montemayor, and Justin Khoo published yesterday by @DailyNousEditor. It's quite remarkable!
The prompt contained the essays themselves, plus a blurb explaining that GPT-3 had to respond to them. Full disclosure: I produced a few outputs and cherry-picked this one, although they were all interesting in their own way.
One was really sassy: "I'll admit that my ideas are largely untested. I haven't spent years in academia toiling away at some low-paying job that I don't really enjoy just so that I can eventually get a job doing something that I don't really want to be doing in the first place."
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(