Tweet

https://twitter.com/SigGravitas/status/1642181498278408193

twitter.com/i/web/status/1…

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @karpathy

Andrej Karpathy

@karpathy

Mar 6

More good read/discussion on psychology of LLMs. I don't follow in full but imo it is barking up the right tree w.r.t. a framework for analysis. lesswrong.com/posts/D7PumeYT…

A pretrained LLM is not an AI but a simulator, described by a statistical physics based on internet webpages. The system evolves given any initial conditions (prompt). To gather logprob it internally maintains a probability distribution over what kind of document it is completing

In particular, "good, aligned, conversational AI" is just one of many possible different rollouts. Finetuning / alignment tries to "collapse" and control the entropy to that region of the simulator. Jailbreak prompts try to knock the state into other logprob ravines.

Read 4 tweets

Andrej Karpathy

@karpathy

Jan 24

The hottest new programming language is English

This tweet went wide, thought I'd post some of the recent supporting articles that inspired it.
1/ GPT-3 paper showed that LLMs perform in-context learning, and can be "programmed" inside the prompt with input:output examples to perform diverse tasks arxiv.org/abs/2005.14165

2/ These two [1] arxiv.org/abs/2205.11916 , [2] arxiv.org/abs/2211.01910 are good examples that the prompt can further program the "solution strategy", and with a good enough design of it, a lot more complex multi-step reasoning tasks become possible.

Read 11 tweets

Andrej Karpathy

@karpathy

Jan 17

🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."

We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT.

First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph.

The second ~1hr builds up the Transformer: multi-headed self-attention, MLP, residual connections, layernorms. Then we train one and compare it to OpenAI's GPT-3 (spoiler: ours is around ~10K - 1M times smaller but the ~same neural net) and ChatGPT (i.e. ours is pretraining only)

Read 4 tweets

Andrej Karpathy

@karpathy

Jan 11

Didn't tweet nanoGPT yet (quietly getting it to good shape) but it's trending on HN so here it is :) :
github.com/karpathy/nanoG…
Aspires to be simplest, fastest repo for training/finetuning medium-sized GPTs. So far confirmed it reproduced GPT-2 (124M). 2 simple files of ~300 lines

Rough example, a decent GPT-2 (124M) pre-training reproduction would be 1 node of 8x A100 40GB for 32 hours, processing 8 GPU * 16 batch size * 1024 block size * 500K iters = ~65B tokens. I suspect this wall clock can still be improved ~2-3X+ without getting too exotic.

I'd like to continue to make it faster, reproduce the other GPT-2 models, then scale up pre-training to bigger models/datasets, then improve the docs for finetuning (the practical use case). Also working on video lecture where I will build it from scratch, hoping out in ~2 weeks.

Read 4 tweets

Andrej Karpathy

@karpathy

Dec 7, 2022

https://twitter.com/tall/status/1600571735455465472

Dreambooth (stable diffusion finetuning for personal profile pictures) has been going viral last few days as well, for good reasons it's super fun; Unlike other places stableboost.ai lets you play with infinite variations and experiment and play with your own prompts:

https://twitter.com/tall/status/1600571735455465472

Turns out in a parallel Universe I'd look awesome as a samurai, cowboy and... saint? :D

Stableboost auto-suggests a few hundred prompts by default but you can generate additional variations for any one prompt that seems to be giving fun/interesting results, or adjust it in any way:

Read 6 tweets

Andrej Karpathy

@karpathy

Nov 18, 2022

An interesting historical note is that neural language models have actually been around for a very long time but noone really cared anywhere near today's extent. LMs were thought of as specific applications, not as mainline research unlocking new general AI paths and capabilities

E.g. ~20 years ago Bengio et al 2003 (pdf: jmlr.org/papers/volume3…) trained a neural language model. The state of the art GPT+friends of today are the exact same (autoregressive) model, except the neural net architecture is upgraded from an MLP to a Transformer.

The non-obvious crux of the shift is an empirical finding, emergent only at scale, and well-articulated in the GPT-3 paper (arxiv.org/abs/2005.14165). Basically, Transformers demonstrate the ability of "in-context" learning. At run-time, in the activations. No weight updates.

Read 11 tweets

Share this page!

Andrej Karpathy

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!