Cameron R. Wolfe, Ph.D. Profile picture
ML @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

Mar 27, 2023, 6 tweets

Large language models (LLMs) are fun to use, but understanding the fundamentals of how they work is also incredibly important. One major idea and building block of LLMs is their underlying architecture: the decoder-only transformer model. 🧵[1/6]

The original transformer has two parts:

1. Encoder: uses bidirectional self-attention and feed-forward neural networks
2. Decoder: uses masked self-attention, cross attention, and feed-forward neural networks

The encoder processes input, then the decoder generates output. [2/6]

Decoder-only transformers are just the decoder portion of the transformer architecture! However, the cross attention portion of the decoder is removed due to the lack of an encoder. This is because we can’t attend to an encoder that doesn’t exist! [3/6]

Each block of the decoder-only model has masked, multi-headed self-attention and a position-wise (i.e., applied individually to each token vector) feed-forward neural network. These modules are separated by layer normalization and a residual connection. [4/6]

The use of masked self-attention is especially important for language modeling applications, as it prevents the model from accessing future information (i.e., predicting the next token by just copying it) and enables autoregressive language generation. [5/6]

For more details on decoder-only transformers, check out my overview of language modeling basics below!

🔗: cameronrwolfe.substack.com/p/language-mod…

[6/6]

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling