Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Misha Laskin

@MishaLaskin

Jan 13, 2022 • 12 tweets • 5 min read • Read on X

Scrolly

GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP.

In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding.

1/N

https://twitter.com/MishaLaskin/status/1479246928454037508

In part 1 we covered multi-head attention (MHA). tl;dr attention allows a neural network to “see” all words in the input as well as their relationships. As a result the net attends to the most important words for optimizing its objective.

https://twitter.com/MishaLaskin/status/1479246928454037508

2/N

So far, we haven’t defined an objective for MHA to optimize. GPT uses a very simple unsupervised objective - predict the next word in a sentence given previous words. This objective is called unsupervised because it doesn’t require any labels.

3/N

To predict future words we need to enforce causal structure. We can do this with the attention matrix. A value of 0 means “no relationship”. So we need to set attentions between current & future words to 0. We do this by setting Q*K.T = -inf for future words. Why -inf?

4/N

We want attention to be 0 for future words, but if we apply the mask after the softmax, attention will no longer be normalized. So we set QK.T where mask=0 to -inf and then normalize. Notice how although we only have 1 sentence we’re making 4 predictions.

5/N

Masked causal attention is the main idea of GPT. Now we just need to define the full architecture. The transformer block for GPT is MHA → LayerNorm → MLP → LayerNorm. MHA does the bulk of the work, LayerNorms normalize outputs, MLP projects + adds capacity.

6/N

So far, we’ve been saying that GPT predicts words. That’s not entirely true. There are ~1M words in English - if we were literally predicting words each prediction would be a classification across 1M classes. To see this clearly, let’s write down the loss GPT optimizes.

7/N

To reduce the # of classes in the loss we instead use “tokens”. Tokens are a map between chars and vectors. E.g. chars in the alphabet can be represented by 26 unique vectors - these vectors are called tokens. The map from strings to unique vectors is called tokenization.

8/N

For simple problems tokenizing each char is OK. But it’s not efficient - char groups like “it” “the” occur frequently so we’d prefer to group them into their own tokens. For this reason, GPT uses Byte Pair Encoding which iteratively clusters common char groups into tokens.

9/N

A final but *very* important detail is that our model currently has no way of knowing the order of words. As it stands, our model cannot distinguish between “my dog ate my homework” and “my homework ate my dog” despite them having opposite meanings.

10/N

To encode order into the model, we use positional tokens. Similar to char tokens, we label each position with a unique vector. We then project the char and position tokens with linear layers and add them. This embedding is then passed to the transformer block.

11/N

We covered masked causal attention, the GPT objective, transformer blocks, tokens & positions. Note: there are ofc many other strategies for tokenization / pos encoding.

Putting it all together, here’s the code for the GPT architecture! tinyurl.com/mr2dj2z6

12/N END

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @MishaLaskin

Misha Laskin

@MishaLaskin

Oct 26, 2022

In our new work - Algorithm Distillation - we show that transformers can improve themselves autonomously through trial and error without ever updating their weights.

No prompting, no finetuning. A single transformer collects its own data and maximizes rewards on new tasks.

1/N

We've seen a lot of successful models showing how transformers can learn in-context.

But transformers have not been shown to *reinforcement* learn in-context. To adapt to new tasks, you either need to manually specify a prompt or finetune the model (e.g. preferences).

2/N

Would be great if transformers could adapt (do RL) out-of-the-box.

Don't Decision Transformers (DTs) / Gato do RL? No!

DTs and Gato learn policies from offline data, but these policies cannot improve themselves autonomously through trial and error.

3/N

Read 12 tweets

Misha Laskin

@MishaLaskin

Jul 11, 2022

How much memory do you need to train deep neural networks? You may find the answer to be counter intuitive.

For example, suppose we're training a 4 megabyte MLP with batch_size = hidden_dim, how much memory do we need? 4MB? No - we need 8MB!

Here's why...

1/N

Consider a 1M param MLP, each param is stored as a float32. How much memory is required to train the MLP?

You might guess that it's the amount of bytes needed to store the model:

1M params * 4 bytes per float32 = 4MB.

This is wrong...

2/N

...or rather, not entirely correct.

Since we train deep nets with backpropagation, we need to store not just the model but also all of the activations from the fwd pass in order to compute gradients.

The memory needed to store activations is often >> than size(model).

3/N

Read 13 tweets

Misha Laskin

@MishaLaskin

Jan 18, 2022

Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT.

This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code.

1/N

https://twitter.com/MishaLaskin/status/1481767733972901888

Since we'll be referencing multi-head attention and GPT, make sure to read parts 1 & 2 if you're unfamiliar with these concepts.

Part 2, GPT:

https://twitter.com/MishaLaskin/status/1481767733972901888

Part 1, Multi-head attention:

https://twitter.com/MishaLaskin/status/1479246928454037508

2/N

We saw with GPT that we can pre-train language models with a causal predict-the-future objective. Instead, BERT uses a fill-in-the-blank objective. It is called bi-directional because unlike GPT (which is causal) it sees both past and future tokens at once.

3/N

Read 11 tweets

Misha Laskin

@MishaLaskin

Jan 7, 2022

Transformers are arguably the most impactful deep learning architecture from the last 5 yrs.

In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention.

1/n

What is attention? Say you want to classify the sentiment of “attention is not too shabby.“ “shabby” suggests 😞 but “not” actually means it's 😀. To correctly classify you need to look at all the words in the sentence. How can we achieve this?

2/n

The simplest thing we can do is input all words into the network. Is that enough? No. The net needs to not only see each word but understand its relation to other words. E.g. it’s crucial that “not” refers to “shabby”. This is where queries, keys, values (Q,K,V) come in.

3/n

Read 12 tweets

Misha Laskin

@MishaLaskin

Jan 4, 2022

Patch extraction is a fundamental operation in deep learning, especially for computer vision.

By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy.

1/n

In deep learning we often need to preprocess inputs into patches. This can mean splitting an image into overlapping or non-overlapping 2D patches or splitting a long audio or text input into smaller equally sized chunks.

2/n

Implementing patches efficiently is harder than it seems. For example, we can load an image into a numpy array, then write a for loop to index into the array and get patches. This works but requires extra memory and the for loop is slow. Can we do better?

3/n

Read 12 tweets

Misha Laskin

@MishaLaskin

Jul 22, 2021

Humans reuse skills effortlessly to learn new tasks - can robots do the same? In our new paper, we show how to pre-train robotic skills and adapt them to new tasks in a kitchen.

tl;dr you’ll have a robot chef soon. 🧑‍🍳🤖

links / details below
thread 🧵 1/10

Title: Hierarchical Few-Shot Imitation with Skill Transition Models
Paper: arxiv.org/abs/2107.08981
Site: sites.google.com/view/few-shot-…
Main idea: fit generative “skill” model on large offline dataset, adapt it to new tasks
Result: show robot a new task, it will imitate it
2/10

We introduce Few-shot Imitation with Skill Transition Models (FIST). FIST first extracts skills from a diverse offline dataset of demonstrations, and then adapts them to the new downstream task. FIST has 3 steps (1) Extraction (2) Adaptation (3) Evaluation.
3/10

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Misha Laskin

Try unrolling a thread yourself!

More from @MishaLaskin

Misha Laskin

Misha Laskin

Misha Laskin

Misha Laskin

Misha Laskin

Misha Laskin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!