Speaking about the transformer architecture, one may incorrectly talk about an encoder-decoder architecture. But this is *clearly* not the case.
The transformer architecture is an example of encoder-predictor-decoder architecture, or a conditional language-model.
The classical definition of an encoder-decoder architecture is the autoencoder (AE). The (blue / cold / low-energy) target y is auto-encoded. (The AE slides are coming out later today.)
Now, the main difference between an AE and a language-model (LM) is that the input is delayed by one unit. This means that a predictor is necessary to estimate the hidden representation of a *future* symbol.
It's similar to a denoising AE, where there is a temporal corruption.
We also saw how a conditional predictive energy based model includes an additional input x (in pink). The input x can be considered as “context” for the given prediction.
Now, putting the two things together, we end up with a 2×encoder-predictor-decoder type architecture.
This is what was going on in my mind when I was just trying to explain how the “encoder-decoder transformer architecture” was supposed to work. Well, it didn't make any sense. 🙄
For the part concerning the attention, you can find a summary below.
In addition to which, I've added the explicit distinction between self-attention (thinking about how to make pizza) and cross-attention (calling mom, asking for all her pizza recipes) slide.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
⚠️ Long post warning ⚠️
5 years ago, for my birthday, out of the blue (this was so much a prank) *The Yann LeCun* texted me (no, we didn't know each other) on Messenger offering me a life changing opportunity, which I failed to obtain the ‘proper’ way, but got it by accident. 🤷🏼♂️
Why did I fail? I'm not that smart.
Don't even start telling me I'm humble. I can gauge far too well the brain-power of NYU PhD students surrounding me, let alone my colleagues.
Did I manage to make it after years of faking it? Not in the slightest.
So, did he make a mistake picking this quirky Italian? I'd say no.
While working on an autonomous driving project, as instructed, I went out of my way to help with teaching for as much as I could.
My dream was to teach world wide, and YouTube let me just do that.
Let's try this. Hopefully, I won't regret it, haha. 😅😅😅
Sat 2 Oct 2021 @ 9:00 EST, live stream of my latest lecture.
Prerequisites: practica 1 and 2 from DLSP21.
Yesterday, in @kchonyc's NLP class, we've learnt about the input (word and sentence) and class embeddings, and how these are updated using the gradient of the log-probability of the correct class, i.e. log p(y* | x).
Say x is a sentence of T words: x = {w₁, w₂, …, w_T}.
1h(w) is the 1-hot representation of w (its index in a dictionary).
e(w) is the dense representation associated with w.
ϕ(x) = ∑ e(wₜ) bag-of-word sentence representation.
∇e(w) = ∇ϕ(x) = u_y* − 𝔼_{y|x}[u_y]
We'll add to e(w) the correct class embedding u_y* while removing what the network thinks it should be instead 𝔼_{y|x}[u_y]. *If* these two are the same, then the gradient will be zero, and nothing will be added or subtracted.
Learn about regularised EBMs: from prediction with latent variables to sparse coding. From temporal regularisation methods to (conditional) variational autoencoders.
We think that not only babies find peekaboo funny.
You let us know, okay?
😅😅😅
Learn about modern speech recognition and the Graph Transformer Networks with @awnihannun!
In this lecture, Awni covers the connectionist temporal classification (CTC) loss, beam search decoding, weighted finite-state automata and transducers, and GTNs!
«Graph Transformer Networks are deep learning architectures whose states are not tensors but graphs.
You can back-propagate gradients through modules whose inputs and outputs are weighted graphs.
GTNs are very convenient for end-to-end training of speech recognition and NLP sys.»
«They can be seen as a differentiable form of WFST (weighted finite-state transducers) widely used in speech recognition.
Awni is the lead author of libgtn, a GTN library for PyTorch.»