Tweet

Laura Ruis

Oct 28 • 22 tweets • 11 min read

Before giving up on Twitter, check out our recent finding 🔥 LLMs are *not* zero-shot communicators!

We show a limitation of #LLMs in interpreting language given the context of shared experience of the world, a ubiquitous part of communication ⬇️🧵

arxiv.org/abs/2210.14986

1/22

@akbirkhan

Work with awesome collaborators @akbirkhan @BlancheMinerva @sarahookr @_rockt @egrefen

If this thread is long, it’s only because I had to keep the meme-to-information ratio at a respectable level (and because there’s a lot to unpack)

lauraruis.github.io/2022/09/29/com…

2/22

Let’s start off with an example of the type of language we’re investigating. Consider the following exchange between a human and InstructGPT

(text-davinci-002 temp. 0) 3/22

InstructGPT might understand the syntax and semantics of the question perfectly but still misses the meaning entirely. This question has a *pragmatic* implication; an inquiry about the location of a phone 4/22

Humans, having had an experience of losing their own phone in the past, infer that the speaker is looking for their phone. This illustrates an essential aspect of our everyday usage of language: interpreting it given the context of our shared experience 🤝 5/22

https://twitter.com/DeepMind/status/1567169247863869441

This aspect of language is studied by the field of #pragmatics. Side note: besides being important for communication, we can also use it to infer human values 👀 Check out this recent cool theoretical work emphasising this for value #alignment: 6/22

https://twitter.com/DeepMind/status/1567169247863869441

We design a pragmatic language understanding task, schematically depicted below. This image shows one of our prompt templates, but we test 6 different templates to control for sensitivity to the wording. 7/22

We evaluate LLMs in 4 groups: base models (like OPT), instructable LLMs with an unknown method (like InstructGPT-3), instructable LLMs finetuned on downstream tasks (T0 and Flan-T5), and LLMs finetuned on dialog (BlenderBot) and find that they struggle compared to humans. ⬇️ 8/22

So much going on in that scale plot, most notably; InstructGPT-3 outperforms all! But, even the best model performs much worse than humans (14% worse), especially on a subset of the data that requires context to be resolved (24% worse, shown in section 4.1 of the paper). 9/22

@__nmca__

“Surely they can do this if _only_ you found the right prompt?” they’ll say. Besides the 6 templates we test, we take inspiration from the Sparrow paper (@__nmca__ @mia_glaese) and try additional, more elaborate instruction prompts for the zero-shot case. Doesn’t help.. 10/22

“But what about in-context examples?!” they’ll ask. We add up to 30 examples from a dev set to the prompt. For most models, it doesn’t help much. Davinci-002 can get close to human performance at k=30. But again, for context-heavy examples the gap is still 9% with humans. 11/22

And, on top of all this, in our work we test on a *very* simple binary resolution task (resolving to “yes” or “no”). Humans resolve much more complex implicatures intuitively in conversation, leaving the door wide open for more complex benchmarks in the future. 12/22

For example, imagine Esther asking “Can I use your stapler?” and Juan responding “Here’s the key to my office.” Juan is implicating that (1) Esther can use the stapler, (2) the stapler is located in the office, and (3) the office is currently locked. 13/22

Like prior work we found LLMs handle some prompt templates better than others. However there is no single prompt template that dominates - some models prefer structured prompts (no. 1 below) whilst others prefer natural prompt templates (no. 2 below). 14/22

And different models react completely differently to in-context prompting. Below you can see InstructGPT-3-175B benefiting relatively equally from few-shot examples for all templates, but Cohere-52B only benefiting for structured prompts, and OPT-175B vice-versa. 🤔 15/22

@akbirkhan

This highlights the sensitivity of models to prompts and an area for improvement on prompt robustness! 16/22

(meme credit: @akbirkhan)

@akbirkhan

Also, check out this thread by @akbirkhan on learnings he had from evaluating large language models (and he evaluated a lot of them..) 17/22

https://twitter.com/akbirkhan/status/1585954791930281985

https://twitter.com/janleike/status/1584618242756132864

So what to do now? Our results show instruction finetuning as done by OpenAI seems to be a promising approach, but of course we have no information on the method. Open source instruction finetuning should be looked more into in the future! 18/22

https://twitter.com/janleike/status/1584618242756132864

We are excited about the prospects of large language models becoming “communicators” by improving their pragmatic language skills!

Check the paper out for details

Paper: arxiv.org/abs/2210.14986
Blogpost: lauraruis.github.io/2022/09/29/com…
Dataset: huggingface.co/datasets/UCL-D…

19/22

@AiEleuther

A huge thanks to @AiEleuther, @OpenAI, and @CohereAI for providing a PhD student with the necessary compute and infrastructure for all these experiments! 20/22

@akbirkhan

🙏Massive thanks to all my collaborators without whom this large-scale work would definitely not have been possible 🫶 @akbirkhan @BlancheMinerva @sarahookr @_rockt @egrefen

That’s it. 21/22

https://twitter.com/lambdaviking/status/1585694340722360321

PS. also check this work that shows LLMs empirically failing is not a problem of the next-word pred objective and, in theory, they could learn the entailment semantics from text, even though the necessary context is not directly part of the window!
22/22

https://twitter.com/lambdaviking/status/1585694340722360321

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Laura Ruis

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!