Andrej Karpathy Profile picture
Sep 22 10 tweets 3 min read
Reading through OpenAI Whisper paper github.com/openai/whisper some notes: Image
Idea 1: keep the neural net and the optimization super simple: vanilla Transformer (2017 style) LLM. The innovation is around 1) what the dataset and the training objective is and 2) the I/O schema that allows a single model to multi-task as a speech recognition swiss-army knife.
Idea 2: Scrape a large (680,000hr) audio+transcript dataset, spend much attention+care on heuristics for rejecting/cleaning algorithmically. Some of it is wrong but there is a ton of it. Simple supervised learning from there on, skip auxiliary objectives, self-supervision, etc.
Idea 3: Use special tokens at the input to condition the model for all desired tasks in a single model (language id, speech detection, transcription, translation). Create a "meta-language" of special tokens of a fixed schema that orchestrates the tasks/stages. Image
Idea 4: Adopt the GPT train/eval mindset: train on large internet-scraped datasets, then evaluate zero-shot performance on standard evaluation benchmarks (ignoring their training sets entirely!). This approach decreases dataset-specific overfitting and creates more robust models. Image
Striking story/paragraph from the paper on why this is the correct regime of training:evaluation to focus on. TLDR it is possible to overfit to datasets and their statistics without producing actually robust and generalizable models. Image
Scaling laws indicate room for additional performance improvements from scaling both 1) the model size and 2) the dataset size, though with some hints of diminishing returns in the case of English specifically, which is most abundant in the training set. Image
Few more notes:
- multi-task transfer is (-) for small models but (+) for large models! (much optimism for more scaling)
- long-form transcription using hacky decoding heuristics :\
- eval is hard: WER has well-documented problems, requires hacky/extensive text normalization.
Favorite paragraph of the paper: citing the software packages used throughout the project. Personally excited and hopeful to see this become a lot more common. Image
TLDR: You can get far with: vanilla Transformer (2017). Scrape a massive (though weakly-labeled) dataset, use simple supervised learning. Multi-task. Eval in zero-shot regime. More perf expected from further model+data scaling. Eval is hard. Some parts (decoding) feel hacky.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrej Karpathy

Andrej Karpathy Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @karpathy

Sep 10
Stable Diffusion concepts library huggingface.co/sd-concepts-li… textual inversion is amazing - can train a custom word vector (not otherwise reachable by english text) to mean a concept, based on examples. Opens up many possibilities of condensing objects/styles into special tokens 🚀
prompts may start to take on a mixed english mixed special inverted token forms, like "a photo of <karpathy/cool-object-v7> in the style of <coolperson/trippystyle>".
beautiful addition to the quickly growing toolkit of steering diffusion models
Read 4 tweets
Feb 9
Computer vision research feels a bit stagnating in a local minimum of 2D texture recognition on ImageNet, COCO etc. This is great but only step 1. Unlocking further progress needs new framework:
1) the data source has to become diverse videos, not individual frames from internet
2) ground truth is compiled from "offline tracker" 3D reconstructions, not human labeling. The reconstructions are aided by solutions from step 1.
3) outputs are (NeRF-like) query-able scene representations, not 1-of-k class labels.
( rant triggered by re-stumbling by the Replica Dataset and friends, which has the right flavor for the data generating component but is still quite early (e.g. small, simple indoor scene-constrained, no moving objects, etc etc.) github.com/facebookresear… )
Read 5 tweets
Dec 8, 2021
The ongoing consolidation in AI is incredible. Thread: ➡️ When I started ~decade ago vision, speech, natural language, reinforcement learning, etc. were completely separate; You couldn't read papers across areas - the approaches were completely different, often not even ML based.
In 2010s all of these areas started to transition 1) to machine learning and specifically 2) neural nets. The architectures were diverse but at least the papers started to read more similar, all of them utilizing large datasets and optimizing neural nets.
But as of approx. last two years, even the neural net architectures across all areas are starting to look identical - a Transformer (definable in ~200 lines of PyTorch github.com/karpathy/minGP…), with very minor differences. Either as a strong baseline or (often) state of the art.
Read 9 tweets
Oct 24, 2021
Really excellent reading and pointers from @ericjang11, putting into words a new "Just Ask for Generalization" approach/philosophy to AI that the field has been slowly internalizing recently. Few more thoughts in thread ->
The first time I was personally shook by this philosophy was when I saw the "Just tell the AI to be nice" meme on my Twitter, which is the same idea - GPT can be seen as a super multi-task policy (trained via supervised learning), and prompt engineering is the goal conditioning.
wrt consciousness I do suspect it can just emerge in large-enough models trained on hard-enough tasks. The idea that emergence of consciousness is just another "grokking" phenomenon was the inspiration for my earlier short story "Forward Pass" karpathy.github.io/2021/03/27/for…
Read 4 tweets
Oct 5, 2021
A fun story of trying to buy one small black coffee at Starbucks the other day. Normally this is one $5 transaction at the register, 5 seconds at the drip, done. But this Starbucks store (for some reason, covid?) was only taking online orders. There's a QR code to get started.
Now I really wanted my coffee but braced for what was to come. I unlocked my phone, scanned the QR code, went to the site, am told to download the app. So I download the app. Now I'm told I have to create an account. So I create an account. Now the app is asking my location.
Err, deny location privilege, of course! I scroll through the USA map all the way to the store I'm at, tap on it to select it. I scroll through the entire menu trying to find my simple small black coffee. I add it to the cart. Check out. Luckily, looks like I can Apple Pay!
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(