Andrej Karpathy Profile picture
Sep 22, 2022 10 tweets 3 min read Read on X
Reading through OpenAI Whisper paper github.com/openai/whisper some notes: Image
Idea 1: keep the neural net and the optimization super simple: vanilla Transformer (2017 style) LLM. The innovation is around 1) what the dataset and the training objective is and 2) the I/O schema that allows a single model to multi-task as a speech recognition swiss-army knife.
Idea 2: Scrape a large (680,000hr) audio+transcript dataset, spend much attention+care on heuristics for rejecting/cleaning algorithmically. Some of it is wrong but there is a ton of it. Simple supervised learning from there on, skip auxiliary objectives, self-supervision, etc.
Idea 3: Use special tokens at the input to condition the model for all desired tasks in a single model (language id, speech detection, transcription, translation). Create a "meta-language" of special tokens of a fixed schema that orchestrates the tasks/stages. Image
Idea 4: Adopt the GPT train/eval mindset: train on large internet-scraped datasets, then evaluate zero-shot performance on standard evaluation benchmarks (ignoring their training sets entirely!). This approach decreases dataset-specific overfitting and creates more robust models. Image
Striking story/paragraph from the paper on why this is the correct regime of training:evaluation to focus on. TLDR it is possible to overfit to datasets and their statistics without producing actually robust and generalizable models. Image
Scaling laws indicate room for additional performance improvements from scaling both 1) the model size and 2) the dataset size, though with some hints of diminishing returns in the case of English specifically, which is most abundant in the training set. Image
Few more notes:
- multi-task transfer is (-) for small models but (+) for large models! (much optimism for more scaling)
- long-form transcription using hacky decoding heuristics :\
- eval is hard: WER has well-documented problems, requires hacky/extensive text normalization.
Favorite paragraph of the paper: citing the software packages used throughout the project. Personally excited and hopeful to see this become a lot more common. Image
TLDR: You can get far with: vanilla Transformer (2017). Scrape a massive (though weakly-labeled) dataset, use simple supervised learning. Multi-task. Eval in zero-shot regime. More perf expected from further model+data scaling. Eval is hard. Some parts (decoding) feel hacky.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrej Karpathy

Andrej Karpathy Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @karpathy

Feb 20
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.Image
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Image
Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.


In the video we essentially build minbpe from scratch.
Don't miss the to build your owngithub.com/karpathy/minbpe
exercise.md
Read 4 tweets
Dec 27, 2023
"Man-Computer Symbiosis" by Licklider, 1960

I love reading technology prediction documents because the benefit of hindsight is training data. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool.groups.csail.mit.edu/medg/people/ps…
Licklider argues that the period of "intelligence augmentation" (IA) may be transient on the path to full automation (AI), but still long enough to be worth thinking through and about.
His citations for what must have felt like rapid progress in both narrow AI and AGI (of that age, i.e. the "general problem solver" [20]) are today known to be false starts that were off track in a quite fundamental way, at that time based on a manual process of encoding knowledge with predicate logic and using production rules of logic and search to manipulate them into conclusions. Today, most of AI is only aware of all of this work as a historical curiosity, it is not part of the "master branch" of the field, it is stuck in a dead end feature branch. And notably, what is considered today the most promising approach (LLMs) were at that time not only completely computationally inaccessible, but also impossible due to the lack of training data of trillions of tokens in digitized forms. (What might be an equivalent of that today?)
The study by the Air Force, estimating that machines alone would be doing problem solving of military significance in 20 years time evokes a snicker today. Amusingly, "20 years away" seems to be a kind of codeword for "no idea, long time". Arguably, I'm not sure that we are there even today, 64 years later. Computers do a lot to increase situational awareness, but decision making of "military significance" afaik is still well within the domain of human computation.Image
An interesting observation from Licklider is that most of his "thinking" in a day-to-day computational task thought experiment is not so much thinking, but more a rote, mechanical, automatable data collection and visualization. It is this observation that leads him to conclude that the strengths and weaknesses of humans and computers are complementary; That computers can do the busy work, and humans can do thinking work. This has been the prevailing paradigm for the next 64 years, and it's only very recently (last ~year) that computers have started to make a dent into "thinking" in a general, scaleable, and economy-impacting way. Not in an explicit, hard, predicate logic way, but in an implicit, soft, statistical way. Hence the LLM-driven AI summer of today.Image
Read 11 tweets
Apr 2, 2023
Next frontier of prompt engineering imo: "AutoGPTs" . 1 GPT call is just like 1 instruction on a computer. They can be strung together into programs. Use prompt to define I/O device and tool specs, define the cognitive loop, page data in and out of context window, .run().
Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc.
(so I'd expect the good prompts to explicitly address things like this)
Read 5 tweets
Mar 6, 2023
More good read/discussion on psychology of LLMs. I don't follow in full but imo it is barking up the right tree w.r.t. a framework for analysis. lesswrong.com/posts/D7PumeYT…
A pretrained LLM is not an AI but a simulator, described by a statistical physics based on internet webpages. The system evolves given any initial conditions (prompt). To gather logprob it internally maintains a probability distribution over what kind of document it is completing
In particular, "good, aligned, conversational AI" is just one of many possible different rollouts. Finetuning / alignment tries to "collapse" and control the entropy to that region of the simulator. Jailbreak prompts try to knock the state into other logprob ravines.
Read 4 tweets
Jan 24, 2023
The hottest new programming language is English
This tweet went wide, thought I'd post some of the recent supporting articles that inspired it.
1/ GPT-3 paper showed that LLMs perform in-context learning, and can be "programmed" inside the prompt with input:output examples to perform diverse tasks arxiv.org/abs/2005.14165 Image
2/ These two [1] arxiv.org/abs/2205.11916 , [2] arxiv.org/abs/2211.01910 are good examples that the prompt can further program the "solution strategy", and with a good enough design of it, a lot more complex multi-step reasoning tasks become possible. Image
Read 11 tweets
Jan 17, 2023
🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."

We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT. Image
First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph. Image
The second ~1hr builds up the Transformer: multi-headed self-attention, MLP, residual connections, layernorms. Then we train one and compare it to OpenAI's GPT-3 (spoiler: ours is around ~10K - 1M times smaller but the ~same neural net) and ChatGPT (i.e. ours is pretraining only) Image
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(