Tweet

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @karpathy

Andrej Karpathy

@karpathy

Jan 24

The hottest new programming language is English

This tweet went wide, thought I'd post some of the recent supporting articles that inspired it.
1/ GPT-3 paper showed that LLMs perform in-context learning, and can be "programmed" inside the prompt with input:output examples to perform diverse tasks arxiv.org/abs/2005.14165

2/ These two [1] arxiv.org/abs/2205.11916 , [2] arxiv.org/abs/2211.01910 are good examples that the prompt can further program the "solution strategy", and with a good enough design of it, a lot more complex multi-step reasoning tasks become possible.

Read 11 tweets

Andrej Karpathy

@karpathy

Jan 17

🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."

We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT.

First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph.

The second ~1hr builds up the Transformer: multi-headed self-attention, MLP, residual connections, layernorms. Then we train one and compare it to OpenAI's GPT-3 (spoiler: ours is around ~10K - 1M times smaller but the ~same neural net) and ChatGPT (i.e. ours is pretraining only)

Read 4 tweets

Andrej Karpathy

@karpathy

Jan 11

Didn't tweet nanoGPT yet (quietly getting it to good shape) but it's trending on HN so here it is :) :
github.com/karpathy/nanoG…
Aspires to be simplest, fastest repo for training/finetuning medium-sized GPTs. So far confirmed it reproduced GPT-2 (124M). 2 simple files of ~300 lines

Rough example, a decent GPT-2 (124M) pre-training reproduction would be 1 node of 8x A100 40GB for 32 hours, processing 8 GPU * 16 batch size * 1024 block size * 500K iters = ~65B tokens. I suspect this wall clock can still be improved ~2-3X+ without getting too exotic.

I'd like to continue to make it faster, reproduce the other GPT-2 models, then scale up pre-training to bigger models/datasets, then improve the docs for finetuning (the practical use case). Also working on video lecture where I will build it from scratch, hoping out in ~2 weeks.

Read 4 tweets

Andrej Karpathy

@karpathy

Dec 7, 2022

https://twitter.com/tall/status/1600571735455465472

Dreambooth (stable diffusion finetuning for personal profile pictures) has been going viral last few days as well, for good reasons it's super fun; Unlike other places stableboost.ai lets you play with infinite variations and experiment and play with your own prompts:

https://twitter.com/tall/status/1600571735455465472

Turns out in a parallel Universe I'd look awesome as a samurai, cowboy and... saint? :D

Stableboost auto-suggests a few hundred prompts by default but you can generate additional variations for any one prompt that seems to be giving fun/interesting results, or adjust it in any way:

Read 6 tweets

Andrej Karpathy

@karpathy

Nov 18, 2022

An interesting historical note is that neural language models have actually been around for a very long time but noone really cared anywhere near today's extent. LMs were thought of as specific applications, not as mainline research unlocking new general AI paths and capabilities

E.g. ~20 years ago Bengio et al 2003 (pdf: jmlr.org/papers/volume3…) trained a neural language model. The state of the art GPT+friends of today are the exact same (autoregressive) model, except the neural net architecture is upgraded from an MLP to a Transformer.

The non-obvious crux of the shift is an empirical finding, emergent only at scale, and well-articulated in the GPT-3 paper (arxiv.org/abs/2005.14165). Basically, Transformers demonstrate the ability of "in-context" learning. At run-time, in the activations. No weight updates.

Read 11 tweets

Andrej Karpathy

@karpathy

Nov 16, 2022

https://twitter.com/_akhaliq/status/1592701993805488128

Is it the number of examples that matters or the number of presentations to the model during training? E.g. humans used spaced repetition to memorize facts but there are no equivalents of similar techniques in LLMs where the typical training regime is uniform random.

https://twitter.com/_akhaliq/status/1592701993805488128

More generally a few remarkable strategies people use during their training:
1) skim text because they already know it
2) ignore text because it's clearly noise (e.g. they won't memorize SHA256 hashes. LLMs will.)
3) revisit parts that are learnable but not yet learned

4) ignore text because it's clearly just an outcome of a known algorithm and not "worth remembering", e.g. expansion of pi
5) some text is best written down on a piece of paper and not worth remembering
etc

Read 4 tweets

Share this page!

Andrej Karpathy

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!