Can we devise a more tractable RL problem if we give the agent examples of successful outcomes (states, not demos)? In MURAL, we show that uncertainty-aware classifiers trained with (meta) NML make RL much easier. At #ICML2021 arxiv.org/abs/2107.07184
If the agent gets some examples of high reward states, we can train a classifier to automatically provide shaped rewards (this is similar to methods like VICE). A standard classifier is not necessarily well shaped.
This is where the key idea in MURAL comes in: use normalized max likelihood (NML) to train a classifier that is aware of uncertainty. Label each state as either positive (success) or negative (failure), and use the ratio of likelihoods from these classifiers as reward!
This provides for exploration, since novel states will have higher uncertainty (hence reward closer to 50/50), while still shaping the reward to be larger closer to the example success states. This turns out to be a great way to do "directed" exploration.
Doing this tractably is hard, because we need two new classifiers for *every* state the agent visits, so to make this efficient, we use meta-learning (MAML) to meta-train one classifier to adapt to every label for every state very quickly, which we call meta-NML.
This ends up working very well across a wide range of manipulation, dexterous hand, and navigation tasks. To learn more about NML in deep learning, you can also check out Aurick Zhou's excellent blog post on this topic here: bairblog.github.io/2020/11/16/acn…
Action chunking is a great idea in robotics: by getting a model to produce a short sequence of actions, it _just works better_ for some mysterious reason. Now it turns out this can help in RL too, and it's a bit clearer why: action chunks help explore and help with backups. 🧵👇
The idea is very simple: train an actor and critic over action chunks (short sequences of actions). The setup is "offline to online": pretrain with offline RL on offline data, then run online exploration. It helps a lot (compare red line for QC vs blue lines for prior methods).
There are a few details that matter here (and perhaps that's why prior attempts to use action chunking with RL didn't work so well?): (1) it really helps to add a "BC term" (behavior constraint) to keep the action chunks coherent with offline data, a bit like learning primitives
If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇
DSRL trains an actor and Q-function, treating the diffusion noise as the action space. Because samples from the noise prior map to reasonable actions for the policy, DSRL essentially explores "inside" the set of reasonable pre-trained behaviors, making it extremely efficient.
DSRL learns essentially in real time, with good results in as little as 50 trials (it's so efficient that a person can literally sit in front of the robot and push a button to assign sparse rewards).
Fun project at PI: knowledge insulation for VLAs. We figured out how to train VLAs with cont. actions much more effectively by insulating the VLM and training it with discrete actions, while action expert learns on top. 5-7x faster, and importantly way better language following 👇
The idea: when we train VLA, we put a little “motor cortex” on top of the VLM (the “action expert”), but it is initialized from scratch, so its gradients mess up the VLM backbone. We can put a stop grad to prevent this, but we still need VLM reps to adapt to the robot.
So what do we do? We add discrete action losses to the VLM backbone to get good representations, even as the action expert learns on top! This makes it train way faster, preserves web-scale knowledge, and improves performance.
Can VLMs enable robots to autonomously improve? In our new work we ran a fleet of robot arms to collect autonomous data with VLM-proposed tasks and showed that robots can keep getting better as they are deployed, without supervision:
The idea: use VLMs to propose possible semantic tasks to do, then use a diffusion model to synthesize an image of the proposed task, use this image as a goal for a goal conditioned policy, and then improve the goal conditioned policy from the resulting experience.
This works very well because the goal-conditioned policy can self-improve without any human supervision, while the VLM and diffusion model leverages Internet-scale pretraining. So every component either improves through self-supervision or benefits from pretraining (or both).
Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain & finetune rapidly to new tasks? Scaled Q-Learning aims to unlock this ability, now on @GoogleAI blog: ai.googleblog.com/2023/02/pre-tr…
👇
The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data.
Performance on training games is very good, even from highly suboptimal data. With near optimal data, this outperforms non-Q-learning methods (e.g., BC, decision transformers) even vs models 2.5x bigger (DT 200M), on suboptimal data it gets more than double the score!
General Navigation Models (GNM) are general-purpose navigation backbones that can drive many robots. It turns out that simple goal-conditioned policies can be trained on multi-robot datasets and generalize in zero-shot to entirely new robots!
The GNM architecture we use is simple: a model that takes in a current image, a goal image, and a temporal context (stack of frames) that tells the model how the robot behaviors (which it uses to infer size, dynamics, etc.). With a topological graph, this lets it drive the robot.
The key is that the GNM is trained on data from many robots: big vehicles (ATVs, etc.), small ground robots, even little RC cars. All data is treated the same way: the model just learns to directly generalize over robot types, learning general navigational skills.