— Context —
Speaking about the transformer architecture, one may incorrectly talk about an encoder-decoder architecture. But this is *clearly* not the case.
The transformer architecture is an example of encoder-predictor-decoder architecture, or a conditional language-model.
The classical definition of an encoder-decoder architecture is the autoencoder (AE). The (blue / cold / low-energy) target y is auto-encoded. (The AE slides are coming out later today.)
Now, the main difference between an AE and a language-model (LM) is that the input is delayed by one unit. This means that a predictor is necessary to estimate the hidden representation of a *future* symbol.
It's similar to a denoising AE, where there is a temporal corruption.
We also saw how a conditional predictive energy based model includes an additional input x (in pink). The input x can be considered as “context” for the given prediction.
Now, putting the two things together, we end up with a 2×encoder-predictor-decoder type architecture.
This is what was going on in my mind when I was just trying to explain how the “encoder-decoder transformer architecture” was supposed to work. Well, it didn't make any sense. 🙄
For the part concerning the attention, you can find a summary below.
In addition to which, I've added the explicit distinction between self-attention (thinking about how to make pizza) and cross-attention (calling mom, asking for all her pizza recipes) slide.
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.