⚡️ Excited to share that I am starting an AI+Education company called Eureka Labs.
The announcement:
---
We are Eureka Labs and we are building a new kind of school that is AI native.
How can we approach an ideal experience for learning something new? For example, in the case of physics one could imagine working through very high quality course materials together with Feynman, who is there to guide you every step of the way. Unfortunately, subject matter experts who are deeply passionate, great at teaching, infinitely patient and fluent in all of the world's languages are also very scarce and cannot personally tutor all 8 billion of us on demand.
However, with recent progress in generative AI, this learning experience feels tractable. The teacher still designs the course materials, but they are supported, leveraged and scaled with an AI Teaching Assistant who is optimized to help guide the students through them. This Teacher + AI symbiosis could run an entire curriculum of courses on a common platform. If we are successful, it will be easy for anyone to learn anything, expanding education in both reach (a large number of people learning something) and extent (any one person learning a large amount of subjects, beyond what may be possible today unassisted).
Our first product will be the world's obviously best AI course, LLM101n. This is an undergraduate-level class that guides the student through training their own AI, very similar to a smaller version of the AI Teaching Assistant itself. The course materials will be available online, but we also plan to run both digital and physical cohorts of people going through it together.
Today, we are heads down building LLM101n, but we look forward to a future where AI is a key technology for increasing human potential. What would you like to learn?
---
@EurekaLabsAI is the culmination of my passion in both AI and education over ~2 decades. My interest in education took me from YouTube tutorials on Rubik's cubes to starting CS231n at Stanford, to my more recent Zero-to-Hero AI series. While my work in AI took me from academic research at Stanford to real-world products at Tesla and AGI research at OpenAI. All of my work combining the two so far has only been part-time, as side quests to my "real job", so I am quite excited to dive in and build something great, professionally and full-time.
It's still early days but I wanted to announce the company so that I can build publicly instead of keeping a secret that isn't. Outbound links with a bit more info in the reply!
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
I love reading technology prediction documents because the benefit of hindsight is training data. Here, 64 years ago, Licklider imagines computing as a fundamentally intelligence amplification tool.groups.csail.mit.edu/medg/people/ps…
Licklider argues that the period of "intelligence augmentation" (IA) may be transient on the path to full automation (AI), but still long enough to be worth thinking through and about.
His citations for what must have felt like rapid progress in both narrow AI and AGI (of that age, i.e. the "general problem solver" [20]) are today known to be false starts that were off track in a quite fundamental way, at that time based on a manual process of encoding knowledge with predicate logic and using production rules of logic and search to manipulate them into conclusions. Today, most of AI is only aware of all of this work as a historical curiosity, it is not part of the "master branch" of the field, it is stuck in a dead end feature branch. And notably, what is considered today the most promising approach (LLMs) were at that time not only completely computationally inaccessible, but also impossible due to the lack of training data of trillions of tokens in digitized forms. (What might be an equivalent of that today?)
The study by the Air Force, estimating that machines alone would be doing problem solving of military significance in 20 years time evokes a snicker today. Amusingly, "20 years away" seems to be a kind of codeword for "no idea, long time". Arguably, I'm not sure that we are there even today, 64 years later. Computers do a lot to increase situational awareness, but decision making of "military significance" afaik is still well within the domain of human computation.
An interesting observation from Licklider is that most of his "thinking" in a day-to-day computational task thought experiment is not so much thinking, but more a rote, mechanical, automatable data collection and visualization. It is this observation that leads him to conclude that the strengths and weaknesses of humans and computers are complementary; That computers can do the busy work, and humans can do thinking work. This has been the prevailing paradigm for the next 64 years, and it's only very recently (last ~year) that computers have started to make a dent into "thinking" in a general, scaleable, and economy-impacting way. Not in an explicit, hard, predicate logic way, but in an implicit, soft, statistical way. Hence the LLM-driven AI summer of today.
Next frontier of prompt engineering imo: "AutoGPTs" . 1 GPT call is just like 1 instruction on a computer. They can be strung together into programs. Use prompt to define I/O device and tool specs, define the cognitive loop, page data in and out of context window, .run().
Interesting non-obvious note on GPT psychology is that unlike people they are completely unaware of their own strengths and limitations. E.g. that they have finite context window. That they can just barely do mental math. That samples can get unlucky and go off the rails. Etc.
(so I'd expect the good prompts to explicitly address things like this)
More good read/discussion on psychology of LLMs. I don't follow in full but imo it is barking up the right tree w.r.t. a framework for analysis. lesswrong.com/posts/D7PumeYT…
A pretrained LLM is not an AI but a simulator, described by a statistical physics based on internet webpages. The system evolves given any initial conditions (prompt). To gather logprob it internally maintains a probability distribution over what kind of document it is completing
In particular, "good, aligned, conversational AI" is just one of many possible different rollouts. Finetuning / alignment tries to "collapse" and control the entropy to that region of the simulator. Jailbreak prompts try to knock the state into other logprob ravines.
This tweet went wide, thought I'd post some of the recent supporting articles that inspired it. 1/ GPT-3 paper showed that LLMs perform in-context learning, and can be "programmed" inside the prompt with input:output examples to perform diverse tasks arxiv.org/abs/2005.14165
2/ These two [1] arxiv.org/abs/2205.11916 , [2] arxiv.org/abs/2211.01910 are good examples that the prompt can further program the "solution strategy", and with a good enough design of it, a lot more complex multi-step reasoning tasks become possible.
🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."
We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT.
First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph.
The second ~1hr builds up the Transformer: multi-headed self-attention, MLP, residual connections, layernorms. Then we train one and compare it to OpenAI's GPT-3 (spoiler: ours is around ~10K - 1M times smaller but the ~same neural net) and ChatGPT (i.e. ours is pretraining only)