Latest Twitter Threads by @FelixHill84 on Thread Reader App

Oct 18, 2024 • 9 tweets • 2 min read

Since GPT3, we have known that large Transformer language models are effective few-shot learners.

But did you know that the nature of language itself makes language particularly rich pre-training data for inducing few-shot learning?

1/n The word frequencies in all languages follow a power law (Zipfian) distribution.

This means that, during pretraining, the LLM sees a small number of words a lot, and lots of different words vary rarely

2/n

May 9, 2023 • 16 tweets • 4 min read

Lots of folks are writing papers at the moment.

Writing a paper from scratch can be daunting, so here’s a *tried and tested method* that I find works well, particularly if writing with multiple authors.

1/

I’m assuming you have already done some experiments and have some interesting results you’d like to share. I’m also assuming you are the *lead* or *co-leading* author of the paper.

First step: write a (preliminary) paper title and agree with the paper authors.

2/

Feb 20, 2023 • 4 tweets • 2 min read

All of NLP is/was alignment research

It just didn't know it for the first 70 years Heres my PhD work on alignment of word embeddings

arxiv.org/abs/1408.3456

Dec 30, 2022 • 9 tweets • 3 min read

This explanation thread on why transformers support in-context learning and fast adaptation (promptability) better than recurrent architectures like RNNs or LSTMs is getting interest

So time for *LSTMs All is Not Lost* - a short tale about why LSTMs are not worthless 1/

https://twitter.com/FelixHill84/status/1608707861508681728

Recurrent nets like LSTMs endow a network with a strong inductive bias to connect nearby elements of a sequence

This *ought* to be useful when modelling language, but it turns out that a network without this bias can easily learn it, since the data points in that direction 2/

Dec 30, 2022 • 13 tweets • 4 min read

This seems wrong, at least if you consider the emergence of fast adaptability (or "in-context learning") as one of the key facets of LLMs.

See Fig 7 in this paper, experiments on Omniglot meta-learning with different memory architectures (1/2)

arxiv.org/abs/2205.05055

https://twitter.com/jacobmbuckman/status/1608681178961444864

These archs are not 'meta trained' - they are trained to predict the label for the next image, and labels are fixed throughout training.

In a transformer, the ability to quickly learn new image-label mappings emerges, but in RNN on LSTM networks it doesn't

2/3

Nov 16, 2022 • 12 tweets • 4 min read

Lots of folks are talking about *emergence* in Deep Learning as if it's a new thing, that happens only in large language models at scale.

It's not! It has been happening for decades and in very small networks.

🧵 🧵 🧵 🧵 🧵 🧵 🧵 🧵 🧵 In the late 80s, Elman wrote one of the most impactful papers in the history of Psychology.

The topic: *emergence in neural networks*

The scale: tiny toy datasets

onlinelibrary.wiley.com/doi/pdf/10.120…

2/n

Sep 4, 2020 • 13 tweets • 5 min read

#GPT3 from @OpenAI showed an emergent ability in large neural language models for rapidly acquiring and using new words.

We develop an agent that does this in a simulated 3D environment.

arxiv.org/abs/2009.01719

1/N The key insight is a particular form of external memory system motivated by Paivio's Dual-Coding Theory of knowledge representation and Baddeley's Working Memory.

2/N

Dec 14, 2019 • 4 tweets • 3 min read

Loving the workshop on context and compositionality. Opportunity to ask a question that has troubled me for some time: what is compositionality? Weak variant "the parts of the input affect meaning of whole". Trivially true even of e.g the pixels in an image. 1/...

https://twitter.com/alex_ander/status/1205884639631556608

Strong variant "the parts of the input entirely determine the meaning of whole". Trivially false (unless mathematics / formal logic / model theory) because of #context and #memory. So what's left? 2/2

Share this page!

Enter URL or ID to Unroll