Andrej Karpathy Profile picture
Building @EurekaLabsAI. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥
41 subscribers
Dec 3, 2024 4 tweets 7 min read
The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days.

Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design.

It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends).

Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time.

Thank you for the story @DBahdanau !Image "Links in the reply followup" (not a huge fan :p)
referenced papers:

Attention paper:
"Neural Machine Translation by Jointly Learning to Align and Translate"
arxiv.org/abs/1409.0473

Transformer paper:
"Attention is All You Need"
arxiv.org/abs/1706.03762

Alex Graves paper around that time with similar soft pooling operations:
"Neural Turing Machines"
arxiv.org/abs/1410.5401
+the referenced (at the time super impressive, inspiring and forward-looking) handwriting paper, this is 2013!:
"Generating Sequences With Recurrent Neural Networks"
arxiv.org/abs/1308.0850

Jason Weston mentioned paper:
"Memory Networks"
arxiv.org/abs/1410.3916

The referenced Ilya, Oriol, Quoc paper at Google:
"Sequence to Sequence Learning with Neural Networks"
arxiv.org/abs/1409.3215
Feb 20, 2024 4 tweets 2 min read
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.Image We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Image
Dec 27, 2023 11 tweets 7 min read
"Man-Computer Symbiosis" by Licklider, 1960

I love reading technology prediction documents because the benefit of hindsight is training data. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool.groups.csail.mit.edu/medg/people/ps… Licklider argues that the period of "intelligence augmentation" (IA) may be transient on the path to full automation (AI), but still long enough to be worth thinking through and about.
His citations for what must have felt like rapid progress in both narrow AI and AGI (of that age, i.e. the "general problem solver" [20]) are today known to be false starts that were off track in a quite fundamental way, at that time based on a manual process of encoding knowledge with predicate logic and using production rules of logic and search to manipulate them into conclusions. Today, most of AI is only aware of all of this work as a historical curiosity, it is not part of the "master branch" of the field, it is stuck in a dead end feature branch. And notably, what is considered today the most promising approach (LLMs) were at that time not only completely computationally inaccessible, but also impossible due to the lack of training data of trillions of tokens in digitized forms. (What might be an equivalent of that today?)
The study by the Air Force, estimating that machines alone would be doing problem solving of military significance in 20 years time evokes a snicker today. Amusingly, "20 years away" seems to be a kind of codeword for "no idea, long time". Arguably, I'm not sure that we are there even today, 64 years later. Computers do a lot to increase situational awareness, but decision making of "military significance" afaik is still well within the domain of human computation.Image
Apr 2, 2023 5 tweets 2 min read
Next frontier of prompt engineering imo: "AutoGPTs" . 1 GPT call is just like 1 instruction on a computer. They can be strung together into programs. Use prompt to define I/O device and tool specs, define the cognitive loop, page data in and out of context window, .run(). Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc.
Mar 6, 2023 4 tweets 1 min read
More good read/discussion on psychology of LLMs. I don't follow in full but imo it is barking up the right tree w.r.t. a framework for analysis. lesswrong.com/posts/D7PumeYT… A pretrained LLM is not an AI but a simulator, described by a statistical physics based on internet webpages. The system evolves given any initial conditions (prompt). To gather logprob it internally maintains a probability distribution over what kind of document it is completing
Jan 24, 2023 11 tweets 6 min read
The hottest new programming language is English This tweet went wide, thought I'd post some of the recent supporting articles that inspired it.
1/ GPT-3 paper showed that LLMs perform in-context learning, and can be "programmed" inside the prompt with input:output examples to perform diverse tasks arxiv.org/abs/2005.14165 Image
Jan 17, 2023 4 tweets 2 min read
🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."

We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT. Image First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph. Image
Jan 11, 2023 4 tweets 2 min read
Didn't tweet nanoGPT yet (quietly getting it to good shape) but it's trending on HN so here it is :) :
github.com/karpathy/nanoG…
Aspires to be simplest, fastest repo for training/finetuning medium-sized GPTs. So far confirmed it reproduced GPT-2 (124M). 2 simple files of ~300 lines Rough example, a decent GPT-2 (124M) pre-training reproduction would be 1 node of 8x A100 40GB for 32 hours, processing 8 GPU * 16 batch size * 1024 block size * 500K iters = ~65B tokens. I suspect this wall clock can still be improved ~2-3X+ without getting too exotic.
Dec 7, 2022 6 tweets 2 min read
Dreambooth (stable diffusion finetuning for personal profile pictures) has been going viral last few days as well, for good reasons it's super fun; Unlike other places stableboost.ai lets you play with infinite variations and experiment and play with your own prompts: Turns out in a parallel Universe I'd look awesome as a samurai, cowboy and... saint? :D Image
Nov 18, 2022 11 tweets 4 min read
An interesting historical note is that neural language models have actually been around for a very long time but noone really cared anywhere near today's extent. LMs were thought of as specific applications, not as mainline research unlocking new general AI paths and capabilities E.g. ~20 years ago Bengio et al 2003 (pdf: jmlr.org/papers/volume3…) trained a neural language model. The state of the art GPT+friends of today are the exact same (autoregressive) model, except the neural net architecture is upgraded from an MLP to a Transformer.
Nov 16, 2022 4 tweets 1 min read
Is it the number of examples that matters or the number of presentations to the model during training? E.g. humans used spaced repetition to memorize facts but there are no equivalents of similar techniques in LLMs where the typical training regime is uniform random. More generally a few remarkable strategies people use during their training:
1) skim text because they already know it
2) ignore text because it's clearly noise (e.g. they won't memorize SHA256 hashes. LLMs will.)
3) revisit parts that are learnable but not yet learned
Oct 19, 2022 7 tweets 2 min read
The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously:
1) expressive (in the forward pass)
2) optimizable (via backpropagation+gradient descent)
3) efficient (high parallelism compute graph) (1) because its message-passing-like architecture is general (i.e. completeness) and powerful (i.e. efficiency), able to cover many real-world algorithms and in a small number of compute steps; an an empirical finding.
Oct 11, 2022 6 tweets 2 min read
🥷New (1h55m) Lecture #5: "Becoming a Backprop Ninja"
We take the 2-layer MLP from last lecture and backprop through all of it manually: cross entropy loss, linear layer 2, tanh, batchnorm, linear layer 1, embedding table. I give away answers in the video Image (yes I had a lot of fun with the thumbnail :D)
Sep 26, 2022 7 tweets 3 min read
I actually mostly built Lexicap so I could share a few snippets of Nick Lane ep :). (I already read the books so I'm ~familiar with the topics, these snippets are just personally newish+notable). (Maybe a great podcast app would make threads like this much easier!) "A cell is basically just a micro version of the planet."
karpathy.ai/lexicap/0318-s… haven't thought about it this way before. Image
Sep 22, 2022 10 tweets 3 min read
Reading through OpenAI Whisper paper github.com/openai/whisper some notes: Image Idea 1: keep the neural net and the optimization super simple: vanilla Transformer (2017 style) LLM. The innovation is around 1) what the dataset and the training objective is and 2) the I/O schema that allows a single model to multi-task as a speech recognition swiss-army knife.
Sep 10, 2022 4 tweets 1 min read
Stable Diffusion concepts library huggingface.co/sd-concepts-li… textual inversion is amazing - can train a custom word vector (not otherwise reachable by english text) to mean a concept, based on examples. Opens up many possibilities of condensing objects/styles into special tokens 🚀 prompts may start to take on a mixed english mixed special inverted token forms, like "a photo of <karpathy/cool-object-v7> in the style of <coolperson/trippystyle>".
Feb 9, 2022 5 tweets 2 min read
Computer vision research feels a bit stagnating in a local minimum of 2D texture recognition on ImageNet, COCO etc. This is great but only step 1. Unlocking further progress needs new framework:
1) the data source has to become diverse videos, not individual frames from internet 2) ground truth is compiled from "offline tracker" 3D reconstructions, not human labeling. The reconstructions are aided by solutions from step 1.
3) outputs are (NeRF-like) query-able scene representations, not 1-of-k class labels.
Dec 8, 2021 9 tweets 2 min read
The ongoing consolidation in AI is incredible. Thread: ➡️ When I started ~decade ago vision, speech, natural language, reinforcement learning, etc. were completely separate; You couldn't read papers across areas - the approaches were completely different, often not even ML based. In 2010s all of these areas started to transition 1) to machine learning and specifically 2) neural nets. The architectures were diverse but at least the papers started to read more similar, all of them utilizing large datasets and optimizing neural nets.
Oct 24, 2021 4 tweets 2 min read
Really excellent reading and pointers from @ericjang11, putting into words a new "Just Ask for Generalization" approach/philosophy to AI that the field has been slowly internalizing recently. Few more thoughts in thread -> The first time I was personally shook by this philosophy was when I saw the "Just tell the AI to be nice" meme on my Twitter, which is the same idea - GPT can be seen as a super multi-task policy (trained via supervised learning), and prompt engineering is the goal conditioning.
Oct 5, 2021 8 tweets 2 min read
A fun story of trying to buy one small black coffee at Starbucks the other day. Normally this is one $5 transaction at the register, 5 seconds at the drip, done. But this Starbucks store (for some reason, covid?) was only taking online orders. There's a QR code to get started. Now I really wanted my coffee but braced for what was to come. I unlocked my phone, scanned the QR code, went to the site, am told to download the app. So I download the app. Now I'm told I have to create an account. So I create an account. Now the app is asking my location.