Andrej Karpathy Profile picture
Jul 1, 2018 3 tweets 1 min read Read on X
most common neural net mistakes: 1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. ; others? :)
oh: 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters
6) thinking view() and permute() are the same thing (& incorrectly using view)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrej Karpathy

Andrej Karpathy Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @karpathy

Feb 27
GPT 4.5 + interactive comparison :)

Today marks the release of GPT4.5 by OpenAI. I've been looking forward to this for ~2 years, ever since GPT4 was released, because this release offers a qualitative measurement of the slope of improvement you get out of scaling pretraining compute (i.e. simply training a bigger model). Each 0.5 in the version is roughly 10X pretraining compute. Now, recall that GPT1 barely generates coherent text. GPT2 was a confused toy. GPT2.5 was "skipped" straight into GPT3, which was even more interesting. GPT3.5 crossed the threshold where it was enough to actually ship as a product and sparked OpenAI's "ChatGPT moment". And GPT4 in turn also felt better, but I'll say that it definitely felt subtle. I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete "slam dunk" examples were difficult to find. It's that ... everything was just a little bit better but in a diffuse way. The word choice was a bit more creative. Understanding of nuance in the prompt was improved. Analogies made a bit more sense. The model was a little bit funnier. World knowledge and understanding was improved at the edges of rare domains. Hallucinations were a bit less frequent. The vibes were just a bit better. It felt like the water that rises all boats, where everything gets slightly improved by 20%. So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I'm in the same hackathon 2 years ago. Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to. Still, it is incredible interesting and exciting as another qualitative measurement of a certain slope of capability that comes "for free" from just pretraining a bigger model.

Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning, and RLHF, so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.). In these cases, training with RL and gaining thinking is incredibly important and works better, even if it is on top of an older base model (e.g. GPT4ish capability or so). The state of the art here remains the full o1. Presumably, OpenAI will now be looking to further train with Reinforcement Learning on top of GPT4.5 model to allow it to think, and push model capability in these domains.

HOWEVER. We do actually expect to see an improvement in tasks that are not reasoning heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc. So these are the tasks that I was most interested in during my vibe checks.

So below, I thought it would be fun to highlight 5 funny/amusing prompts that test these capabilities, and to organize them into an interactive "LM Arena Lite" right here on X, using a combination of images and polls in a thread. Sadly X does not allow you to include both an image and a poll in a single post, so I have to alternate posts that give the image (showing the prompt, and two responses one from 4 and one from 4.5), and the poll, where people can vote which one is better. After 8 hours, I'll reveal the identities of which model is which. Let's see what happens :)
Question 1. Poll is in the following post. Image
Question 1 poll: Which is better?
Read 12 tweets
Dec 3, 2024
The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days.

Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design.

It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends).

Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time.

Thank you for the story @DBahdanau !Image
"Links in the reply followup" (not a huge fan :p)
referenced papers:

Attention paper:
"Neural Machine Translation by Jointly Learning to Align and Translate"
arxiv.org/abs/1409.0473

Transformer paper:
"Attention is All You Need"
arxiv.org/abs/1706.03762

Alex Graves paper around that time with similar soft pooling operations:
"Neural Turing Machines"
arxiv.org/abs/1410.5401
+the referenced (at the time super impressive, inspiring and forward-looking) handwriting paper, this is 2013!:
"Generating Sequences With Recurrent Neural Networks"
arxiv.org/abs/1308.0850

Jason Weston mentioned paper:
"Memory Networks"
arxiv.org/abs/1410.3916

The referenced Ilya, Oriol, Quoc paper at Google:
"Sequence to Sequence Learning with Neural Networks"
arxiv.org/abs/1409.3215
Ty to a reply, text version for those on mobile:

---

Hi Andrej,

Happy to tell you the story as it happened 8 years ago!

I came to Yoshua's lab as an intern, after having done my first year of MSc at Jacobs University with Herbert Jaeger.

I told Yoshua I'm happy to work on anything. Yoshua put me on the machine translation project to work with Kyunghyun Cho and the team. I was super skeptical about the idea of cramming a sequence of words in a vector. But I also really wanted a PhD offer. So I rolled up my sleeves and started doing what I was good at - writing code, fixing bugs and so on. At some point I showed enough understanding of what's going on that Yoshua invited me to do a PhD (2014 was a good time when that was enough - good old times!). I was very happy and I thought it's time to have fun and be creative.

So I started thinking about how to avoid the bottleneck between encoder and decoder RNN. My first idea was to have a model with two "cursors", one moving through the source sequence (encoded by a BiRNN) and another one moving through the target sequence. The cursor trajectories would be marginalized out using dynamic programming. KyungHyun Cho recognized this as an equivalent to Alex Graves' RNN Transducer model. Following that, I may have also read Graves' hand-writing recognition paper. The approach looked inappropriate for machine translation though.

The above approach with cursors would be too hard to implement in the remaining 5 weeks of my internship. So I tried instead something simpler - two cursors moving at the same time synchronously (effectively hard-coded diagonal attention). That sort of worked, but the approach lacked elegance.

So one day I had this thought that it would be nice to enable the decoder RNN to learn to search where to put the cursor in the source sequence. This was sort of inspired by translation exercises that learning English in my middle school involved. Your gaze shifts back and forth between source and target sequence as you translate. I expressed the soft search as softmax and then weighted averaging of BiRNN states. It worked great from the very first try to my great excitement. I called the architecture RNNSearch, and we rushed to publish an ArXiV paper as we knew that Ilya and co at Google are somewhat ahead of us with their giant 8 GPU LSTM model (RNN Search still ran on 1 GPU).

As it later turned out, the name was not great. The better name (attention) was only added by Yoshua to the conclusion in one of the final passes.

We saw Alex Graves' NMT paper 1.5 months later. It was indeed exactly the same idea, though he arrived at it with a completely different motivation. In our case, necessity was the mother of invention. In his case it was the ambition to bridge neural and symbolic AI, I guess? Jason Weston's and co Memory Networks paper also featured a similar mechanism.

I did not have the foresight to think that attention can be used at a lower level, as the core operation in representation learning. But when I saw the Transformer paper, I immediately declared to labmates that RNNs are dead.

To go back to your original question: the invention of "differentiable and data-dependent weighted average" in Yoshua's lab in Montreal was independent from Neural Turing Machines, Memory Networks, as well as some relevant cog-sci papers from the 90s (or even 70s; can give you any links though). It was the result of Yoshua's leadership in pushing the lab to be ambitious, KyungHyun Cho great skills at running a big machine translation project staffed with junior PhD students and interns, and lastly, my own creativity and coding skills that had been honed in years of competitive programming. But I don't think that this idea would wait for any more time before being discovered. Even if myself, Alex Graves and other characters in this story did not do deep learning at that time, attention is just the natural way to do flexible spatial connectivity in deep learning. It is a nearly obvious idea that was waiting for GPUs to be fast enough to make people motivated and take deep learning research seriously.  Ever since I realized this, my big AI ambition is to start amazing applied projects like that machine translation project. Good R&D endeavors can do more for progress in fundamental technologies than all the fancy theorizing that we often consider the "real" AI research.

That's all! Very curious to hear more about your educational AI projects (I heard some rumors from Harm de Vries ;)).

Cheers,
Dima
Read 4 tweets
Feb 20, 2024
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.Image
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Image
Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.


In the video we essentially build minbpe from scratch.
Don't miss the to build your owngithub.com/karpathy/minbpe
exercise.md
Read 4 tweets
Dec 27, 2023
"Man-Computer Symbiosis" by Licklider, 1960

I love reading technology prediction documents because the benefit of hindsight is training data. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool.groups.csail.mit.edu/medg/people/ps…
Licklider argues that the period of "intelligence augmentation" (IA) may be transient on the path to full automation (AI), but still long enough to be worth thinking through and about.
His citations for what must have felt like rapid progress in both narrow AI and AGI (of that age, i.e. the "general problem solver" [20]) are today known to be false starts that were off track in a quite fundamental way, at that time based on a manual process of encoding knowledge with predicate logic and using production rules of logic and search to manipulate them into conclusions. Today, most of AI is only aware of all of this work as a historical curiosity, it is not part of the "master branch" of the field, it is stuck in a dead end feature branch. And notably, what is considered today the most promising approach (LLMs) were at that time not only completely computationally inaccessible, but also impossible due to the lack of training data of trillions of tokens in digitized forms. (What might be an equivalent of that today?)
The study by the Air Force, estimating that machines alone would be doing problem solving of military significance in 20 years time evokes a snicker today. Amusingly, "20 years away" seems to be a kind of codeword for "no idea, long time". Arguably, I'm not sure that we are there even today, 64 years later. Computers do a lot to increase situational awareness, but decision making of "military significance" afaik is still well within the domain of human computation.Image
An interesting observation from Licklider is that most of his "thinking" in a day-to-day computational task thought experiment is not so much thinking, but more a rote, mechanical, automatable data collection and visualization. It is this observation that leads him to conclude that the strengths and weaknesses of humans and computers are complementary; That computers can do the busy work, and humans can do thinking work. This has been the prevailing paradigm for the next 64 years, and it's only very recently (last ~year) that computers have started to make a dent into "thinking" in a general, scaleable, and economy-impacting way. Not in an explicit, hard, predicate logic way, but in an implicit, soft, statistical way. Hence the LLM-driven AI summer of today.Image
Read 11 tweets
Apr 2, 2023
Next frontier of prompt engineering imo: "AutoGPTs" . 1 GPT call is just like 1 instruction on a computer. They can be strung together into programs. Use prompt to define I/O device and tool specs, define the cognitive loop, page data in and out of context window, .run().
Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc.
(so I'd expect the good prompts to explicitly address things like this)
Read 5 tweets
Mar 6, 2023
More good read/discussion on psychology of LLMs. I don't follow in full but imo it is barking up the right tree w.r.t. a framework for analysis. lesswrong.com/posts/D7PumeYT…
A pretrained LLM is not an AI but a simulator, described by a statistical physics based on internet webpages. The system evolves given any initial conditions (prompt). To gather logprob it internally maintains a probability distribution over what kind of document it is completing
In particular, "good, aligned, conversational AI" is just one of many possible different rollouts. Finetuning / alignment tries to "collapse" and control the entropy to that region of the simulator. Jailbreak prompts try to knock the state into other logprob ravines.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(