Deep RL is hard: lots of hparam tuning, instability. Perhaps there is a reason for this? Turns out the same thing that makes supervised deep learning work well makes deep RL work poorly, leading to feature vectors that grow out of control: arxiv.org/abs/2112.04716
Let me explain:
Simple test: compare offline SARSA vs offline TD. TD uses the behavior policy, so same policy is used for the backup, but SARSA uses dataset actions, while TD samples *new* actions (from the same distr.!). Top plot is phi(s,a)*phi(s',a'): dot prod of current & next features.
Well, that's weird. Why do TD feature dot products grow and grow until the method gets unstable, while SARSA stays flat? To understand this, we must understand implicit regularization, which makes overparam models like deep nets avoid overfitting.
When training with SGD, deep nets don't find just *any* solution, but a well regularized solution. SGD finds lower-norm solutions that then generalize well (see derived reg below). We might think that the same thing happens when training with RL, and hence deep RL will work well.
But if we apply similar logic to deep RL as supervised learning, we can derive what the "implicit regularizer" for deep RL looks like. And it's not pretty. First term looks like the supervised one, but the second one blows up feature dot products, just like we see in practice!
In practice, we can simply add some *explicit* regularization on the features to counteract this nasty implicit regularizer. We call this DR3. It simply minimizes these feature dot products. Unlike normal regularizers, DR3 actually *increases* model capacity!
We can simply add DR3 to standard offline RL algorithms and boost their performance, basically without any other modification. We hope that further research on overparameterization in deep RL can shed more light about why deep RL is unstable and how we can fix it!
Can VLMs enable robots to autonomously improve? In our new work we ran a fleet of robot arms to collect autonomous data with VLM-proposed tasks and showed that robots can keep getting better as they are deployed, without supervision:
The idea: use VLMs to propose possible semantic tasks to do, then use a diffusion model to synthesize an image of the proposed task, use this image as a goal for a goal conditioned policy, and then improve the goal conditioned policy from the resulting experience.
This works very well because the goal-conditioned policy can self-improve without any human supervision, while the VLM and diffusion model leverages Internet-scale pretraining. So every component either improves through self-supervision or benefits from pretraining (or both).
Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain & finetune rapidly to new tasks? Scaled Q-Learning aims to unlock this ability, now on @GoogleAI blog: ai.googleblog.com/2023/02/pre-tr…
👇
The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data.
Performance on training games is very good, even from highly suboptimal data. With near optimal data, this outperforms non-Q-learning methods (e.g., BC, decision transformers) even vs models 2.5x bigger (DT 200M), on suboptimal data it gets more than double the score!
General Navigation Models (GNM) are general-purpose navigation backbones that can drive many robots. It turns out that simple goal-conditioned policies can be trained on multi-robot datasets and generalize in zero-shot to entirely new robots!
The GNM architecture we use is simple: a model that takes in a current image, a goal image, and a temporal context (stack of frames) that tells the model how the robot behaviors (which it uses to infer size, dynamics, etc.). With a topological graph, this lets it drive the robot.
The key is that the GNM is trained on data from many robots: big vehicles (ATVs, etc.), small ground robots, even little RC cars. All data is treated the same way: the model just learns to directly generalize over robot types, learning general navigational skills.
What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: sites.google.com/berkeley.edu/l…
A thread:
Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this.
By analogy (which we can make precise!) Lyapunov functions tell us how to stabilize around a point in space (i.e., x=0). What if we want is to stabilize in high density regions (i.e., p(s) >= eps). Both require considering long horizon outcomes though, so we can't just be greedy!
NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: sea-snell.github.io/ILQL_site/
We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data!
Implicit Q-learning (IQL) provides a particularly convenient method for offline RL for NLP, with a training procedure that is very close to supervised learning, but with the addition of rewards in the loss. Our full method slightly modifies IQL w/ a CQL term and smarter decoding.
Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: arxiv.org/abs/2206.01367
A thread:
The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training.
This leads to improved generalization performance on the test set, and can be readily combined with other methods for improving performance. It works especially well when training data is more limited.