Post

More from @karpathy

Andrej Karpathy

@karpathy

Feb 11

New art project.
Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further.
gist.github.com/karpathy/8627f…

The way it works is that the full LLM architecture and loss function is stripped entirely to the most atomic individual mathematical operations that make it up (+, *, **, log, exp), and then a tiny scalar-valued autograd engine (micrograd) calculates gradients. Adam for optim.

(oops should have added to this thread instead of separate post). Made a few changes and put it up here as a mirror to the gist because I wanted it to one page. karpathy.ai/microgpt.html

Read 4 tweets

Andrej Karpathy

@karpathy

Mar 23, 2025

I just vibe coded a whole iOS app in Swift (without having programmed in Swift before, though I learned some in the process) and now ~1 hour later it's actually running on my physical phone. It was so ez... I had my hand held through the entire process. Very cool.

I didn't even read any docs at all, I just opened a ChatGPT convo and followed instructions.

A number of people asked If I can share the convo and yes sure - these were the 4 convos with my super noob swift questions lol:

1 starting the app
chatgpt.com/share/67e02d8a…
2 enhancements
chatgpt.com/share/67e02d99…
3 adding AppStorage to persist state over time
chatgpt.com/share/67e02da3…
4 deploy to phone
chatgpt.com/share/67e02db4…

and this is what it looks like late last night
x.com/karpathy/statu…

I'm already happily using it today for tracking, and will probably hack on it more on this fine sunday.

Read 5 tweets

Andrej Karpathy

@karpathy

Feb 27, 2025

GPT 4.5 + interactive comparison :)

Today marks the release of GPT4.5 by OpenAI. I've been looking forward to this for ~2 years, ever since GPT4 was released, because this release offers a qualitative measurement of the slope of improvement you get out of scaling pretraining compute (i.e. simply training a bigger model). Each 0.5 in the version is roughly 10X pretraining compute. Now, recall that GPT1 barely generates coherent text. GPT2 was a confused toy. GPT2.5 was "skipped" straight into GPT3, which was even more interesting. GPT3.5 crossed the threshold where it was enough to actually ship as a product and sparked OpenAI's "ChatGPT moment". And GPT4 in turn also felt better, but I'll say that it definitely felt subtle. I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete "slam dunk" examples were difficult to find. It's that ... everything was just a little bit better but in a diffuse way. The word choice was a bit more creative. Understanding of nuance in the prompt was improved. Analogies made a bit more sense. The model was a little bit funnier. World knowledge and understanding was improved at the edges of rare domains. Hallucinations were a bit less frequent. The vibes were just a bit better. It felt like the water that rises all boats, where everything gets slightly improved by 20%. So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I'm in the same hackathon 2 years ago. Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to. Still, it is incredible interesting and exciting as another qualitative measurement of a certain slope of capability that comes "for free" from just pretraining a bigger model.

Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning, and RLHF, so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.). In these cases, training with RL and gaining thinking is incredibly important and works better, even if it is on top of an older base model (e.g. GPT4ish capability or so). The state of the art here remains the full o1. Presumably, OpenAI will now be looking to further train with Reinforcement Learning on top of GPT4.5 model to allow it to think, and push model capability in these domains.

HOWEVER. We do actually expect to see an improvement in tasks that are not reasoning heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc. So these are the tasks that I was most interested in during my vibe checks.

So below, I thought it would be fun to highlight 5 funny/amusing prompts that test these capabilities, and to organize them into an interactive "LM Arena Lite" right here on X, using a combination of images and polls in a thread. Sadly X does not allow you to include both an image and a poll in a single post, so I have to alternate posts that give the image (showing the prompt, and two responses one from 4 and one from 4.5), and the poll, where people can vote which one is better. After 8 hours, I'll reveal the identities of which model is which. Let's see what happens :)

Question 1. Poll is in the following post.

Question 1 poll: Which is better?

Read 12 tweets

Andrej Karpathy

@karpathy

Dec 3, 2024

The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days.

Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design.

It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends).

Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time.

Thank you for the story @DBahdanau !

"Links in the reply followup" (not a huge fan :p)
referenced papers:

Attention paper:
"Neural Machine Translation by Jointly Learning to Align and Translate"
arxiv.org/abs/1409.0473

Transformer paper:
"Attention is All You Need"
arxiv.org/abs/1706.03762

Alex Graves paper around that time with similar soft pooling operations:
"Neural Turing Machines"
arxiv.org/abs/1410.5401
+the referenced (at the time super impressive, inspiring and forward-looking) handwriting paper, this is 2013!:
"Generating Sequences With Recurrent Neural Networks"
arxiv.org/abs/1308.0850

Jason Weston mentioned paper:
"Memory Networks"
arxiv.org/abs/1410.3916

The referenced Ilya, Oriol, Quoc paper at Google:
"Sequence to Sequence Learning with Neural Networks"
arxiv.org/abs/1409.3215

Ty to a reply, text version for those on mobile:

---

Hi Andrej,

Happy to tell you the story as it happened 8 years ago!

I came to Yoshua's lab as an intern, after having done my first year of MSc at Jacobs University with Herbert Jaeger.

I told Yoshua I'm happy to work on anything. Yoshua put me on the machine translation project to work with Kyunghyun Cho and the team. I was super skeptical about the idea of cramming a sequence of words in a vector. But I also really wanted a PhD offer. So I rolled up my sleeves and started doing what I was good at - writing code, fixing bugs and so on. At some point I showed enough understanding of what's going on that Yoshua invited me to do a PhD (2014 was a good time when that was enough - good old times!). I was very happy and I thought it's time to have fun and be creative.

So I started thinking about how to avoid the bottleneck between encoder and decoder RNN. My first idea was to have a model with two "cursors", one moving through the source sequence (encoded by a BiRNN) and another one moving through the target sequence. The cursor trajectories would be marginalized out using dynamic programming. KyungHyun Cho recognized this as an equivalent to Alex Graves' RNN Transducer model. Following that, I may have also read Graves' hand-writing recognition paper. The approach looked inappropriate for machine translation though.

The above approach with cursors would be too hard to implement in the remaining 5 weeks of my internship. So I tried instead something simpler - two cursors moving at the same time synchronously (effectively hard-coded diagonal attention). That sort of worked, but the approach lacked elegance.

So one day I had this thought that it would be nice to enable the decoder RNN to learn to search where to put the cursor in the source sequence. This was sort of inspired by translation exercises that learning English in my middle school involved. Your gaze shifts back and forth between source and target sequence as you translate. I expressed the soft search as softmax and then weighted averaging of BiRNN states. It worked great from the very first try to my great excitement. I called the architecture RNNSearch, and we rushed to publish an ArXiV paper as we knew that Ilya and co at Google are somewhat ahead of us with their giant 8 GPU LSTM model (RNN Search still ran on 1 GPU).

As it later turned out, the name was not great. The better name (attention) was only added by Yoshua to the conclusion in one of the final passes.

We saw Alex Graves' NMT paper 1.5 months later. It was indeed exactly the same idea, though he arrived at it with a completely different motivation. In our case, necessity was the mother of invention. In his case it was the ambition to bridge neural and symbolic AI, I guess? Jason Weston's and co Memory Networks paper also featured a similar mechanism.

I did not have the foresight to think that attention can be used at a lower level, as the core operation in representation learning. But when I saw the Transformer paper, I immediately declared to labmates that RNNs are dead.

To go back to your original question: the invention of "differentiable and data-dependent weighted average" in Yoshua's lab in Montreal was independent from Neural Turing Machines, Memory Networks, as well as some relevant cog-sci papers from the 90s (or even 70s; can give you any links though). It was the result of Yoshua's leadership in pushing the lab to be ambitious, KyungHyun Cho great skills at running a big machine translation project staffed with junior PhD students and interns, and lastly, my own creativity and coding skills that had been honed in years of competitive programming. But I don't think that this idea would wait for any more time before being discovered. Even if myself, Alex Graves and other characters in this story did not do deep learning at that time, attention is just the natural way to do flexible spatial connectivity in deep learning. It is a nearly obvious idea that was waiting for GPUs to be fast enough to make people motivated and take deep learning research seriously. Ever since I realized this, my big AI ambition is to start amazing applied projects like that machine translation project. Good R&D endeavors can do more for progress in fundamental technologies than all the fancy theorizing that we often consider the "real" AI research.

That's all! Very curious to hear more about your educational AI projects (I heard some rumors from Harm de Vries ;)).

Cheers,
Dima

Read 4 tweets

Andrej Karpathy

@karpathy

Feb 20, 2024

New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.

We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

In the video we essentially build minbpe from scratch.
Don't miss the to build your owngithub.com/karpathy/minbpe
exercise.md

Read 4 tweets

Andrej Karpathy

@karpathy

Dec 27, 2023

"Man-Computer Symbiosis" by Licklider, 1960

I love reading technology prediction documents because the benefit of hindsight is training data. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool.groups.csail.mit.edu/medg/people/ps…

Licklider argues that the period of "intelligence augmentation" (IA) may be transient on the path to full automation (AI), but still long enough to be worth thinking through and about.
His citations for what must have felt like rapid progress in both narrow AI and AGI (of that age, i.e. the "general problem solver" [20]) are today known to be false starts that were off track in a quite fundamental way, at that time based on a manual process of encoding knowledge with predicate logic and using production rules of logic and search to manipulate them into conclusions. Today, most of AI is only aware of all of this work as a historical curiosity, it is not part of the "master branch" of the field, it is stuck in a dead end feature branch. And notably, what is considered today the most promising approach (LLMs) were at that time not only completely computationally inaccessible, but also impossible due to the lack of training data of trillions of tokens in digitized forms. (What might be an equivalent of that today?)
The study by the Air Force, estimating that machines alone would be doing problem solving of military significance in 20 years time evokes a snicker today. Amusingly, "20 years away" seems to be a kind of codeword for "no idea, long time". Arguably, I'm not sure that we are there even today, 64 years later. Computers do a lot to increase situational awareness, but decision making of "military significance" afaik is still well within the domain of human computation.

An interesting observation from Licklider is that most of his "thinking" in a day-to-day computational task thought experiment is not so much thinking, but more a rote, mechanical, automatable data collection and visualization. It is this observation that leads him to conclude that the strengths and weaknesses of humans and computers are complementary; That computers can do the busy work, and humans can do thinking work. This has been the prevailing paradigm for the next 64 years, and it's only very recently (last ~year) that computers have started to make a dent into "thinking" in a general, scaleable, and economy-impacting way. Not in an explicit, hard, predicate logic way, but in an implicit, soft, statistical way. Hence the LLM-driven AI summer of today.

Read 11 tweets

Share this page!

Enter URL or ID to Unroll

Andrej Karpathy

Try unrolling a thread yourself!

More from @karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!