Latest Twitter Threads by @MishaLaskin on Thread Reader App

Oct 26, 2022 • 12 tweets • 6 min read

In our new work - Algorithm Distillation - we show that transformers can improve themselves autonomously through trial and error without ever updating their weights.

No prompting, no finetuning. A single transformer collects its own data and maximizes rewards on new tasks.

1/N

We've seen a lot of successful models showing how transformers can learn in-context.

But transformers have not been shown to *reinforcement* learn in-context. To adapt to new tasks, you either need to manually specify a prompt or finetune the model (e.g. preferences).

2/N

Jul 11, 2022 • 13 tweets • 3 min read

How much memory do you need to train deep neural networks? You may find the answer to be counter intuitive.

For example, suppose we're training a 4 megabyte MLP with batch_size = hidden_dim, how much memory do we need? 4MB? No - we need 8MB!

Here's why...

1/N Consider a 1M param MLP, each param is stored as a float32. How much memory is required to train the MLP?

You might guess that it's the amount of bytes needed to store the model:

1M params * 4 bytes per float32 = 4MB.

This is wrong...

2/N

Jan 18, 2022 • 11 tweets • 4 min read

Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT.

This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code.

1/N

Since we'll be referencing multi-head attention and GPT, make sure to read parts 1 & 2 if you're unfamiliar with these concepts.

Part 2, GPT:

https://twitter.com/MishaLaskin/status/1481767733972901888

Part 1, Multi-head attention:

https://twitter.com/MishaLaskin/status/1479246928454037508

2/N

Jan 13, 2022 • 12 tweets • 5 min read

GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP.

In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding.

1/N

In part 1 we covered multi-head attention (MHA). tl;dr attention allows a neural network to “see” all words in the input as well as their relationships. As a result the net attends to the most important words for optimizing its objective.

https://twitter.com/MishaLaskin/status/1479246928454037508

2/N

Jan 7, 2022 • 12 tweets • 5 min read

Transformers are arguably the most impactful deep learning architecture from the last 5 yrs.

In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention.

1/n

What is attention? Say you want to classify the sentiment of “attention is not too shabby.“ “shabby” suggests 😞 but “not” actually means it's 😀. To correctly classify you need to look at all the words in the sentence. How can we achieve this?

2/n

Jan 4, 2022 • 12 tweets • 4 min read

Patch extraction is a fundamental operation in deep learning, especially for computer vision.

By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy.

1/n

In deep learning we often need to preprocess inputs into patches. This can mean splitting an image into overlapping or non-overlapping 2D patches or splitting a long audio or text input into smaller equally sized chunks.

2/n

Jul 22, 2021 • 10 tweets • 5 min read

Humans reuse skills effortlessly to learn new tasks - can robots do the same? In our new paper, we show how to pre-train robotic skills and adapt them to new tasks in a kitchen.

tl;dr you’ll have a robot chef soon. 🧑‍🍳🤖

links / details below
thread 🧵 1/10

Title: Hierarchical Few-Shot Imitation with Skill Transition Models
Paper: arxiv.org/abs/2107.08981
Site: sites.google.com/view/few-shot-…
Main idea: fit generative “skill” model on large offline dataset, adapt it to new tasks
Result: show robot a new task, it will imitate it
2/10

Jan 20, 2021 • 14 tweets • 5 min read

Is RL always data inefficient? Not necessarily. Framework for Efficient Robotic Manipulation (FERM) - shows real robots can learn basic skills from pixels with sparse reward in *30 minutes* using 1 GPU 🦾

paper: bit.ly/2M3CFPG
site / code: bit.ly/390Sz6g

1/N

Real-robot RL is challenging for a number of reasons, and data efficiency is chief among them. Common workarounds are training in simulation and transferring the learned policy to the real robot (Sim2Real) or parallelizing training with robot farms (QT-Opt).

2/N

Share this page!

Enter URL or ID to Unroll