Felix Hill Profile picture
Research Scientist, Deepmind I try to think hard about everything I tweet, esp on 90s football and 80s music None of my opinions are really someone else's
Oct 18 9 tweets 2 min read
Since GPT3, we have known that large Transformer language models are effective few-shot learners.

But did you know that the nature of language itself makes language particularly rich pre-training data for inducing few-shot learning?

1/n
The word frequencies in all languages follow a power law (Zipfian) distribution.

This means that, during pretraining, the LLM sees a small number of words a lot, and lots of different words vary rarely

2/n Image
May 9, 2023 16 tweets 4 min read
Lots of folks are writing papers at the moment.

Writing a paper from scratch can be daunting, so here’s a *tried and tested method* that I find works well, particularly if writing with multiple authors.

1/ I’m assuming you have already done some experiments and have some interesting results you’d like to share. I’m also assuming you are the *lead* or *co-leading* author of the paper.

First step: write a (preliminary) paper title and agree with the paper authors.

2/
Feb 20, 2023 4 tweets 2 min read
All of NLP is/was alignment research

It just didn't know it for the first 70 years Heres my PhD work on alignment of word embeddings

arxiv.org/abs/1408.3456
Dec 30, 2022 9 tweets 3 min read
This explanation thread on why transformers support in-context learning and fast adaptation (promptability) better than recurrent architectures like RNNs or LSTMs is getting interest

So time for *LSTMs All is Not Lost* - a short tale about why LSTMs are not worthless 1/ Recurrent nets like LSTMs endow a network with a strong inductive bias to connect nearby elements of a sequence

This *ought* to be useful when modelling language, but it turns out that a network without this bias can easily learn it, since the data points in that direction 2/
Dec 30, 2022 13 tweets 4 min read
This seems wrong, at least if you consider the emergence of fast adaptability (or "in-context learning") as one of the key facets of LLMs.

See Fig 7 in this paper, experiments on Omniglot meta-learning with different memory architectures (1/2)

arxiv.org/abs/2205.05055 These archs are not 'meta trained' - they are trained to predict the label for the next image, and labels are fixed throughout training.

In a transformer, the ability to quickly learn new image-label mappings emerges, but in RNN on LSTM networks it doesn't

2/3
Nov 16, 2022 12 tweets 4 min read
Lots of folks are talking about *emergence* in Deep Learning as if it's a new thing, that happens only in large language models at scale.

It's not! It has been happening for decades and in very small networks.

🧵 🧵 🧵 🧵 🧵 🧵 🧵 🧵 🧵 In the late 80s, Elman wrote one of the most impactful papers in the history of Psychology.

The topic: *emergence in neural networks*

The scale: tiny toy datasets

onlinelibrary.wiley.com/doi/pdf/10.120…

2/n
Sep 4, 2020 13 tweets 5 min read
#GPT3 from @OpenAI showed an emergent ability in large neural language models for rapidly acquiring and using new words.

We develop an agent that does this in a simulated 3D environment. 

arxiv.org/abs/2009.01719

1/N
The key insight is a particular form of external memory system motivated by Paivio's Dual-Coding Theory of knowledge representation and Baddeley's Working Memory.

2/N
Dec 14, 2019 4 tweets 3 min read
Loving the workshop on context and compositionality. Opportunity to ask a question that has troubled me for some time: what is compositionality? Weak variant "the parts of the input affect meaning of whole". Trivially true even of e.g the pixels in an image. 1/... Strong variant "the parts of the input entirely determine the meaning of whole". Trivially false (unless mathematics / formal logic / model theory) because of #context and #memory. So what's left? 2/2