Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT.
This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code.
1/N
Since we'll be referencing multi-head attention and GPT, make sure to read parts 1 & 2 if you're unfamiliar with these concepts.
We saw with GPT that we can pre-train language models with a causal predict-the-future objective. Instead, BERT uses a fill-in-the-blank objective. It is called bi-directional because unlike GPT (which is causal) it sees both past and future tokens at once.
3/N
So how does BERT work? The architecture is outlined below. BERT works (almost) exactly like GPT except for two main differences: (A) instead of using a causal mask we use a random one (B) BERT prepends an additional [CLS] to each sequence. What is this new [CLS] token?
4/N
Suppose we want to use BERT for sentiment classification. Our transformer's outputs have shape (B, T, D). We want to compress this to (B, D) so each data pt has an aggregate representation that we can feed into a classifier. First guess - how about we average over the T dim?
5/N
This will work, but it assumes that all tokens are equally useful for classification. Isn't the whole point of attention to weigh tokens based on their relevance? What if we add a new token to the input that aggregates the other tokens with attention? That's the pt of [CLS].
6/N
When we finetune BERT for classification (hence name of token), we use the final hidden state of the [CLS] token as the input to a classifier, which is just an MLP that projects the [CLS] hidden state with shape (B,D) to the number of classes with shape (B, num_classes).
7/N
Aside from [CLS] there are other special tokens in BERT - namely [MASK], which replaces masked tokens, and [SEP] which is a sentence separation token. Tokenization happens during data loading. Once we've tokenized, we fuse these and positional tokens like in the code below.
8/N
Now that we know how tokenization works, the BERT block and pre-training objectives are actually identical to GPT. The only diff is that BERT uses a random mask while GPT uses a causal one, but the code implementation is nearly identical for both.
8/N
For BERT, each time we pass in a batch of data for pre-training, we sample a new random mask. This can be done efficiently if the mask is created directly on the GPU. Here's what what mask creation and masked language modeling objective look like. Pretty straightforward!
9/N
Tl;dr BERT = GPT but with random masking and some special tokens (e.g. [CLS] token). It's remarkable that transformers are so simple and general-purpose.
Next time I'll show how transformers can be used in computer vision with the Vision Transformer (ViT) + MAE loss.
10/N END
• • •
Missing some Tweet in this thread? You can try to
force a refresh
GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP.
In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding.
1/N
In part 1 we covered multi-head attention (MHA). tl;dr attention allows a neural network to “see” all words in the input as well as their relationships. As a result the net attends to the most important words for optimizing its objective.
So far, we haven’t defined an objective for MHA to optimize. GPT uses a very simple unsupervised objective - predict the next word in a sentence given previous words. This objective is called unsupervised because it doesn’t require any labels.
Transformers are arguably the most impactful deep learning architecture from the last 5 yrs.
In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention.
1/n
What is attention? Say you want to classify the sentiment of “attention is not too shabby.“ “shabby” suggests 😞 but “not” actually means it's 😀. To correctly classify you need to look at all the words in the sentence. How can we achieve this?
2/n
The simplest thing we can do is input all words into the network. Is that enough? No. The net needs to not only see each word but understand its relation to other words. E.g. it’s crucial that “not” refers to “shabby”. This is where queries, keys, values (Q,K,V) come in.
Patch extraction is a fundamental operation in deep learning, especially for computer vision.
By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy.
1/n
In deep learning we often need to preprocess inputs into patches. This can mean splitting an image into overlapping or non-overlapping 2D patches or splitting a long audio or text input into smaller equally sized chunks.
2/n
Implementing patches efficiently is harder than it seems. For example, we can load an image into a numpy array, then write a for loop to index into the array and get patches. This works but requires extra memory and the for loop is slow. Can we do better?
Humans reuse skills effortlessly to learn new tasks - can robots do the same? In our new paper, we show how to pre-train robotic skills and adapt them to new tasks in a kitchen.
tl;dr you’ll have a robot chef soon. 🧑🍳🤖
links / details below
thread 🧵 1/10
Title: Hierarchical Few-Shot Imitation with Skill Transition Models
Paper: arxiv.org/abs/2107.08981
Site: sites.google.com/view/few-shot-…
Main idea: fit generative “skill” model on large offline dataset, adapt it to new tasks
Result: show robot a new task, it will imitate it
2/10
We introduce Few-shot Imitation with Skill Transition Models (FIST). FIST first extracts skills from a diverse offline dataset of demonstrations, and then adapts them to the new downstream task. FIST has 3 steps (1) Extraction (2) Adaptation (3) Evaluation.
3/10
Is RL always data inefficient? Not necessarily. Framework for Efficient Robotic Manipulation (FERM) - shows real robots can learn basic skills from pixels with sparse reward in *30 minutes* using 1 GPU 🦾
Real-robot RL is challenging for a number of reasons, and data efficiency is chief among them. Common workarounds are training in simulation and transferring the learned policy to the real robot (Sim2Real) or parallelizing training with robot farms (QT-Opt).
2/N
But what makes RL data inefficient in the first place? One hypothesis - (i) representation learning and (ii) exploration. In principle, if we solve both problems, RL should be able to learn quickly.
3/N