This tweet went wide, thought I'd post some of the recent supporting articles that inspired it. 1/ GPT-3 paper showed that LLMs perform in-context learning, and can be "programmed" inside the prompt with input:output examples to perform diverse tasks arxiv.org/abs/2005.14165
2/ These two [1] arxiv.org/abs/2205.11916 , [2] arxiv.org/abs/2211.01910 are good examples that the prompt can further program the "solution strategy", and with a good enough design of it, a lot more complex multi-step reasoning tasks become possible.
3/ These two articles/papers:
[1] evjang.com/2021/10/23/gen…
[2] arxiv.org/abs/2106.01345
bit more technical but TLDR good prompts include the desired/aspiring performance. GPTs don't "want" to succeed. They want to imitate. You want to succeed, and you have to ask for it.
4/ Building A Virtual Machine inside ChatGPT engraved.blog/building-a-vir…
Here we start getting into specifics of "programming" in English. Take a look at the rules and input/output specifications declared in English, conditioning the GPT into a particular kind of role. Read in full.
5/ "ChatGPT in an iOS Shortcut — Worlds Smartest HomeKit Voice Assistant" matemarschalko.medium.com/chatgpt-in-an-…
This voice assistant is significantly more capable and personalized than your regular Siri/Alexa/etc., and it was programmed in English.
6/ "GPT is all you need for the backend" github.com/TheAppleTucker…
Tired: use an LLM to help you write a backend
Wired: LLM is the backend
Inspiring project from a recent Scale hackathon. The LLM backend takes state as JSON blob and modifies it based on... English description.
7/ The prompt allegedly used by Bing chat, potentially spilled by a prompt injection attack
important point for our purposes is that the identity is constructed and programmed in English, by laying out who it is, what it knows/doesn't know, and how to act.
8/ These examples illustrate how prompts 1: matter and 2: are not trivial, and why today it makes sense to be a "prompt engineer" (e.g. @goodside ). I also like to think of this role as a kind of LLM psychologist.
9/ Pulling in one more relevant tweet of mine from a while ago. GPTs run natural language programs by completing the document.
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
I love reading technology prediction documents because the benefit of hindsight is training data. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool.groups.csail.mit.edu/medg/people/ps…
Licklider argues that the period of "intelligence augmentation" (IA) may be transient on the path to full automation (AI), but still long enough to be worth thinking through and about.
His citations for what must have felt like rapid progress in both narrow AI and AGI (of that age, i.e. the "general problem solver" [20]) are today known to be false starts that were off track in a quite fundamental way, at that time based on a manual process of encoding knowledge with predicate logic and using production rules of logic and search to manipulate them into conclusions. Today, most of AI is only aware of all of this work as a historical curiosity, it is not part of the "master branch" of the field, it is stuck in a dead end feature branch. And notably, what is considered today the most promising approach (LLMs) were at that time not only completely computationally inaccessible, but also impossible due to the lack of training data of trillions of tokens in digitized forms. (What might be an equivalent of that today?)
The study by the Air Force, estimating that machines alone would be doing problem solving of military significance in 20 years time evokes a snicker today. Amusingly, "20 years away" seems to be a kind of codeword for "no idea, long time". Arguably, I'm not sure that we are there even today, 64 years later. Computers do a lot to increase situational awareness, but decision making of "military significance" afaik is still well within the domain of human computation.
An interesting observation from Licklider is that most of his "thinking" in a day-to-day computational task thought experiment is not so much thinking, but more a rote, mechanical, automatable data collection and visualization. It is this observation that leads him to conclude that the strengths and weaknesses of humans and computers are complementary; That computers can do the busy work, and humans can do thinking work. This has been the prevailing paradigm for the next 64 years, and it's only very recently (last ~year) that computers have started to make a dent into "thinking" in a general, scaleable, and economy-impacting way. Not in an explicit, hard, predicate logic way, but in an implicit, soft, statistical way. Hence the LLM-driven AI summer of today.
Next frontier of prompt engineering imo: "AutoGPTs" . 1 GPT call is just like 1 instruction on a computer. They can be strung together into programs. Use prompt to define I/O device and tool specs, define the cognitive loop, page data in and out of context window, .run().
Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc.
(so I'd expect the good prompts to explicitly address things like this)
More good read/discussion on psychology of LLMs. I don't follow in full but imo it is barking up the right tree w.r.t. a framework for analysis. lesswrong.com/posts/D7PumeYT…
A pretrained LLM is not an AI but a simulator, described by a statistical physics based on internet webpages. The system evolves given any initial conditions (prompt). To gather logprob it internally maintains a probability distribution over what kind of document it is completing
In particular, "good, aligned, conversational AI" is just one of many possible different rollouts. Finetuning / alignment tries to "collapse" and control the entropy to that region of the simulator. Jailbreak prompts try to knock the state into other logprob ravines.
🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."
We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT.
First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph.
The second ~1hr builds up the Transformer: multi-headed self-attention, MLP, residual connections, layernorms. Then we train one and compare it to OpenAI's GPT-3 (spoiler: ours is around ~10K - 1M times smaller but the ~same neural net) and ChatGPT (i.e. ours is pretraining only)
Didn't tweet nanoGPT yet (quietly getting it to good shape) but it's trending on HN so here it is :) : github.com/karpathy/nanoG…
Aspires to be simplest, fastest repo for training/finetuning medium-sized GPTs. So far confirmed it reproduced GPT-2 (124M). 2 simple files of ~300 lines
Rough example, a decent GPT-2 (124M) pre-training reproduction would be 1 node of 8x A100 40GB for 32 hours, processing 8 GPU * 16 batch size * 1024 block size * 500K iters = ~65B tokens. I suspect this wall clock can still be improved ~2-3X+ without getting too exotic.
I'd like to continue to make it faster, reproduce the other GPT-2 models, then scale up pre-training to bigger models/datasets, then improve the docs for finetuning (the practical use case). Also working on video lecture where I will build it from scratch, hoping out in ~2 weeks.