Idea 1: keep the neural net and the optimization super simple: vanilla Transformer (2017 style) LLM. The innovation is around 1) what the dataset and the training objective is and 2) the I/O schema that allows a single model to multi-task as a speech recognition swiss-army knife.
Idea 2: Scrape a large (680,000hr) audio+transcript dataset, spend much attention+care on heuristics for rejecting/cleaning algorithmically. Some of it is wrong but there is a ton of it. Simple supervised learning from there on, skip auxiliary objectives, self-supervision, etc.
Idea 3: Use special tokens at the input to condition the model for all desired tasks in a single model (language id, speech detection, transcription, translation). Create a "meta-language" of special tokens of a fixed schema that orchestrates the tasks/stages.
Idea 4: Adopt the GPT train/eval mindset: train on large internet-scraped datasets, then evaluate zero-shot performance on standard evaluation benchmarks (ignoring their training sets entirely!). This approach decreases dataset-specific overfitting and creates more robust models.
Striking story/paragraph from the paper on why this is the correct regime of training:evaluation to focus on. TLDR it is possible to overfit to datasets and their statistics without producing actually robust and generalizable models.
Scaling laws indicate room for additional performance improvements from scaling both 1) the model size and 2) the dataset size, though with some hints of diminishing returns in the case of English specifically, which is most abundant in the training set.
Few more notes:
- multi-task transfer is (-) for small models but (+) for large models! (much optimism for more scaling)
- long-form transcription using hacky decoding heuristics :\
- eval is hard: WER has well-documented problems, requires hacky/extensive text normalization.
Favorite paragraph of the paper: citing the software packages used throughout the project. Personally excited and hopeful to see this become a lot more common.
TLDR: You can get far with: vanilla Transformer (2017). Scrape a massive (though weakly-labeled) dataset, use simple supervised learning. Multi-task. Eval in zero-shot regime. More perf expected from further model+data scaling. Eval is hard. Some parts (decoding) feel hacky.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Stable Diffusion concepts library huggingface.co/sd-concepts-li… textual inversion is amazing - can train a custom word vector (not otherwise reachable by english text) to mean a concept, based on examples. Opens up many possibilities of condensing objects/styles into special tokens 🚀
prompts may start to take on a mixed english mixed special inverted token forms, like "a photo of <karpathy/cool-object-v7> in the style of <coolperson/trippystyle>".
beautiful addition to the quickly growing toolkit of steering diffusion models
Computer vision research feels a bit stagnating in a local minimum of 2D texture recognition on ImageNet, COCO etc. This is great but only step 1. Unlocking further progress needs new framework: 1) the data source has to become diverse videos, not individual frames from internet
2) ground truth is compiled from "offline tracker" 3D reconstructions, not human labeling. The reconstructions are aided by solutions from step 1. 3) outputs are (NeRF-like) query-able scene representations, not 1-of-k class labels.
( rant triggered by re-stumbling by the Replica Dataset and friends, which has the right flavor for the data generating component but is still quite early (e.g. small, simple indoor scene-constrained, no moving objects, etc etc.) github.com/facebookresear… )
The ongoing consolidation in AI is incredible. Thread: ➡️ When I started ~decade ago vision, speech, natural language, reinforcement learning, etc. were completely separate; You couldn't read papers across areas - the approaches were completely different, often not even ML based.
In 2010s all of these areas started to transition 1) to machine learning and specifically 2) neural nets. The architectures were diverse but at least the papers started to read more similar, all of them utilizing large datasets and optimizing neural nets.
But as of approx. last two years, even the neural net architectures across all areas are starting to look identical - a Transformer (definable in ~200 lines of PyTorch github.com/karpathy/minGP…), with very minor differences. Either as a strong baseline or (often) state of the art.
Really excellent reading and pointers from @ericjang11, putting into words a new "Just Ask for Generalization" approach/philosophy to AI that the field has been slowly internalizing recently. Few more thoughts in thread ->
The first time I was personally shook by this philosophy was when I saw the "Just tell the AI to be nice" meme on my Twitter, which is the same idea - GPT can be seen as a super multi-task policy (trained via supervised learning), and prompt engineering is the goal conditioning.
wrt consciousness I do suspect it can just emerge in large-enough models trained on hard-enough tasks. The idea that emergence of consciousness is just another "grokking" phenomenon was the inspiration for my earlier short story "Forward Pass" karpathy.github.io/2021/03/27/for…
A fun story of trying to buy one small black coffee at Starbucks the other day. Normally this is one $5 transaction at the register, 5 seconds at the drip, done. But this Starbucks store (for some reason, covid?) was only taking online orders. There's a QR code to get started.
Now I really wanted my coffee but braced for what was to come. I unlocked my phone, scanned the QR code, went to the site, am told to download the app. So I download the app. Now I'm told I have to create an account. So I create an account. Now the app is asking my location.
Err, deny location privilege, of course! I scroll through the USA map all the way to the store I'm at, tap on it to select it. I scroll through the entire menu trying to find my simple small black coffee. I add it to the cart. Check out. Luckily, looks like I can Apple Pay!