Why does ChatGPT work so well? Is it “just scaling up GPT-3” under the hood? In this 🧵, let’s discuss the “Instruct” paradigm, its deep technical insights, and a big implication: “prompt engineering” as we know it may likely disappear soon:👇
The original GPT-3 was trained by a minimalistic objective: predict the next word on a massive text corpus. Many abilities magically emerge, such as reasoning, coding, translation. You can even do “few-shot learning”: define new tasks by providing I/O examples in context. 1/
It’s not at all obvious why simply predicting the next word can give us such abilities. One intuitive explanation is to imagine a detective story. Suppose the model needs to fill in the last blank: “the murderer is ___”, then it has to do deep reasoning to answer correctly. 2/
But this is not enough. In practice, we have to coax GPT-3 to autocomplete what we desire by carefully curating the examples, wording, and structure. This is exactly “prompt engineering”, where users have to practice the awkward and sometimes nonsensical vernacular of LLMs. 3/
Prompt engineering is a BUG🐞, not a feature! It’s caused by the fundamental *misalignment* between the next-word objective and the actual user intent in real applications. Example: you want GPT-3 to “Explain the moon landing to a 6yo”. It replies like a drunk parrot🦜: 4/
Prompt engineering is even worse in DALLE2 and Stable Diffusion. Just go to lexica.art and see how insane some prompts are. My favorite is the “parentheses trick” - adding (((…))) sometimes gives you better images 😅. It’s both hilarious and embarassing. 5/
ChatGPT and the base model InstructGPT address the plague in an elegant way. The key observation is that alignment is very hard to be captured by in-the-wild data. Humans must be in the loop to help tutor GPT, and GPT will be able to ask better questions as it improves. 6/
There are 3 steps. The first is very straightforward: just collect a dataset of human-written answers to prompts that users submit, and finetune GPT by supervised learning. It’s easiest but also the most costly: it could be slow and painful for humans to write long responses. 7/
Step 2 is much more interesting. GPT is asked to *propose* a few different answers, and all a human annotator needs to do is *ranking* the responses from most desirable to least. Using these labels, we can train a reward model that captures human *preferences*. 8/
In reinforcement learning (RL), the reward function is typically hardcoded, such as the game score in Atari games. ChatGPT’s data-driven reward model is a powerful idea. Another example is our recent MineDojo work that learns reward from tons of Minecraft YouTube videos: 9/
Step 3: treat GPT as a policy and optimize it by RL against the learned reward. PPO is chosen as a simple and effective training algorithm. Now that GPT is better aligned, we can rinse and repeat step 2-3 to improve it continously. It’s like CI for LLM! 10/
This is the “Instruct” paradigm - a super effective way to do alignment, as evident in ChatGPT’s mindblowing demos. The RL part also reminds me of the famous P=NP (or ≠) problem: it tends to be much easier to verify a solution than actually solving the problem from scratch. 11/
Similarly, humans can quickly assess the quality of GPT’s output, but it’s much harder and cognitively taxing to write out a full solution. InstructGPT exploits this fact to lower the manual labeling cost significantly, making it practical to scale up the model CI pipeline. 12/
Another interesting connection is that the Instruct training looks a lot like GANs. Here ChatGPT is a generator and reward model (RM) is a discriminator. ChatGPT tries to fool RM, while RM learns to detect alien with human help. The game converges when RM can no longer tell. 13/
Model alignment with user intent is also making its way to image generation! There are some preliminary works, such as arxiv.org/abs/2211.09800. Given the explosive AI progress, how long will it take to have an Instruct- or Chat-DALLE that feels like talking to a real artist? 14/
So folks, enjoy prompt engineering while it lasts! It’s an unfortunate historical artifact - a bit like alchemy🧪, neither art nor science. Soon it will just be “prompt writing” - my grandma can get it right on her first try. No more magic incantations to coerce the model. 15/
Of course, ChatGPT is not perfect enough to completely eliminate prompt engineering for now, but it is an unstoppable force. Meanwhile, the model has other serious syndromes: hallucination & habitual BS. I covered this in another thread: 16/
Further reading: reward model also has scaling laws: arxiv.org/abs/2210.10760! Also the RM is only an imperfect proxy (unlike Atari), so it’s a bad idea to over-optimize. This paper is from @johnschulman2, inventor of PPO. Super interesting work but went under the radar. 18/
There are also other artifacts caused by the misalignment problem, such as prompt hacking or “injection”. I actually like this one because it allows us to bypass OpenAI’s prompt prefix and fully unleash the model 😆. See @goodside’s cool findings: 19/
I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5!
DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process.
At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation.
What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world.
Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day.
The future of robot data is generative.
The future of the entire robot learning pipeline will also be generative. 🧵
Here’s one example: imagine asking a human to repeat this task 1000s of times to gather enough data variations — they’d be bored out of their mind. Just ask a simulator to do the hard work!!
2/🧵
Real world experiments on a humanoid robot at GEAR Lab, NVIDIA HQ.
Exciting updates on Project GR00T! We discover a systematic way to scale up robot data, tackling the most painful pain point in robotics. The idea is simple: human collects demonstration on a real robot, and we multiply that data 1000x or more in simulation. Let’s break it down:
1. We use Apple Vision Pro (yes!!) to give the human operator first person control of the humanoid. Vision Pro parses human hand pose and retargets the motion to the robot hand, all in real time. From the human’s point of view, they are immersed in another body like the Avatar. Teleoperation is slow and time-consuming, but we can afford to collect a small amount of data.
2. We use RoboCasa, a generative simulation framework, to multiply the demonstration data by varying the visual appearance and layout of the environment. In Jensen’s keynote video below, the humanoid is now placing the cup in hundreds of kitchens with a huge diversity of textures, furniture, and object placement. We only have 1 physical kitchen at the GEAR Lab in NVIDIA HQ, but we can conjure up infinite ones in simulation.
3. Finally, we apply MimicGen, a technique to multiply the above data even more by varying the *motion* of the robot. MimicGen generates vast number of new action trajectories based on the original human data, and filters out failed ones (e.g. those that drop the cup) to form a much larger dataset.
To sum up, given 1 human trajectory with Vision Pro
-> RoboCasa produces N (varying visuals)
-> MimicGen further augments to NxM (varying motions).
This is the way to trade compute for expensive human data by GPU-accelerated simulation. A while ago, I mentioned that teleoperation is fundamentally not scalable, because we are always limited by 24 hrs/robot/day in the world of atoms. Our new GR00T synthetic data pipeline breaks this barrier in the world of bits.
Scaling has been so much fun for LLMs, and it's finally our turn to have fun in robotics! We are building tools to enable everyone in the ecosystem to scale up with us. Links in thread:
RoboCasa: our generative simulation framework. It's fully open-source! Here you go:
MimicGen: our generative action framework @AjayMandlekar. The code is open-source for robot arms, but we will have another version for humanoid and 5-finger hands.
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof. @yukez. GEAR stands for Generalist Embodied Agent Research.
We believe in a future where every machine that moves will be autonomous, and robots and simulated agents will be as ubiquitous as iPhones. We are building the Foundation Agent — a generally capable AI that learns to act skillfully in many worlds, virtual and real.
2024 is the Year of Robotics, the Year of Gaming AI, and the Year of Simulation. We are setting out on a moon-landing mission, and getting there will spin off mountains of learnings and breakthroughs.
Here's a highlight thread on the exciting research that we spearheaded!
Eureka: GPT-4 writes reward functions to teach a 5-finger robot hand how to do pen spinning tricks better than I can. Trained with GPU-accelerated physics simulation at 1000x faster than real-time!
Voyager: the first LLM-powered agent that plays Minecraft proficiently. Voyager bootstraps its own capabilities as it explores the open-ended world continuously.
What did I tell you a few days ago? 2024 is the year of robotics. Mobile-ALOHA is an open-source robot hardware that can do dexterous, bimanual tasks like cooking a meal (with human teleoperation). Very soon, hardware will no longer bottleneck us on the quest for human-level, generally capable robots. The brain will be.
This work is done by 3 researchers with academic budget. What an incredible job! Stanford rocks! Congrats to @zipengfu @tonyzzhao @chelseabfinn
Academia is no longer the place for the biggest frontier LLMs, simply because of resource constraints. But robotics levels the playing field a bit between academia and industry, at least in the near term. More affordable hardware is the inevitable trend. Advice for aspiring PhD students: embrace robotics - less crowded, more impactful.
I confirmed with friends at the team that they did not speed up the video. Having such smooth motions at real-time, especially in hand dexterity, will unlock LOTS of new capabilities down the road. Regardless of how well you train the model in the world of bits, a slow and unreliable hardware will always be the fundamental bottleneck in the world of atoms.
The tactile sensing on fingers is the obvious right path forward. Now we can train truly multimodal robot transformers that take in text, video, audio, touch, proprioception (position, orientation, motion sensing) and some day, even smell and touch. The output is humanoid motor controls.
Can Optimus spin pens? Someone please try out our Eureka method and let me know? @Tesla_Optimus 👏
Btw, this is Eureka from my team at NVIDIA Research!
This is the coolest Diffusion work I've seen in a while! It generates Visual Anagrams, a type of optical illusion where an image looks like one thing, but changes appearance when transformed.
It works with any orthogonal transformation matrices, which luckily include rotation, permutation (jigsaw puzzles), and color negation.
Intuitively, the method first inverts the noise from multiple image transforms (with different text prompts), and then average them. After taking a diffusion step in the averaged noise, the resulting image becomes an anagram that aligns with the texts in different views. It does very little computation, using pre-trained Stable Diffusion.
Simple, elegant, and inexpensive technique for non-professionals to create some interesting art!