Sergey Levine Profile picture
Jul 18, 2021 7 tweets 4 min read Read on X
Can we devise a more tractable RL problem if we give the agent examples of successful outcomes (states, not demos)? In MURAL, we show that uncertainty-aware classifiers trained with (meta) NML make RL much easier. At #ICML2021
arxiv.org/abs/2107.07184

A (short) thread:
The website has a summary: sites.google.com/view/mural-rl

If the agent gets some examples of high reward states, we can train a classifier to automatically provide shaped rewards (this is similar to methods like VICE). A standard classifier is not necessarily well shaped.
This is where the key idea in MURAL comes in: use normalized max likelihood (NML) to train a classifier that is aware of uncertainty. Label each state as either positive (success) or negative (failure), and use the ratio of likelihoods from these classifiers as reward!
This provides for exploration, since novel states will have higher uncertainty (hence reward closer to 50/50), while still shaping the reward to be larger closer to the example success states. This turns out to be a great way to do "directed" exploration.
Doing this tractably is hard, because we need two new classifiers for *every* state the agent visits, so to make this efficient, we use meta-learning (MAML) to meta-train one classifier to adapt to every label for every state very quickly, which we call meta-NML.
This ends up working very well across a wide range of manipulation, dexterous hand, and navigation tasks. To learn more about NML in deep learning, you can also check out Aurick Zhou's excellent blog post on this topic here: bairblog.github.io/2020/11/16/acn…
We'll present MURAL at #ICML2021.
Tue Jul 20 07:35 PM -- 07:40 PM (PDT): icml.cc/virtual/2021/s…

w/ Kevin Li, @abhishekunique, Ashwin Reddy, Vitchyr Pong, Aurick Zhou, Justin Yu

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Levine

Sergey Levine Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svlevine

Feb 23, 2023
Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain & finetune rapidly to new tasks? Scaled Q-Learning aims to unlock this ability, now on @GoogleAI blog:
ai.googleblog.com/2023/02/pre-tr…
👇
The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data.
Performance on training games is very good, even from highly suboptimal data. With near optimal data, this outperforms non-Q-learning methods (e.g., BC, decision transformers) even vs models 2.5x bigger (DT 200M), on suboptimal data it gets more than double the score! Image
Read 6 tweets
Oct 10, 2022
General Navigation Models (GNM) are general-purpose navigation backbones that can drive many robots. It turns out that simple goal-conditioned policies can be trained on multi-robot datasets and generalize in zero-shot to entirely new robots!

sites.google.com/view/drive-any…

Thread>
The GNM architecture we use is simple: a model that takes in a current image, a goal image, and a temporal context (stack of frames) that tells the model how the robot behaviors (which it uses to infer size, dynamics, etc.). With a topological graph, this lets it drive the robot. Image
The key is that the GNM is trained on data from many robots: big vehicles (ATVs, etc.), small ground robots, even little RC cars. All data is treated the same way: the model just learns to directly generalize over robot types, learning general navigational skills. Image
Read 6 tweets
Jun 22, 2022
What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: sites.google.com/berkeley.edu/l…

A thread:
Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this.
By analogy (which we can make precise!) Lyapunov functions tell us how to stabilize around a point in space (i.e., x=0). What if we want is to stabilize in high density regions (i.e., p(s) >= eps). Both require considering long horizon outcomes though, so we can't just be greedy!
Read 8 tweets
Jun 21, 2022
NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: sea-snell.github.io/ILQL_site/

Code: github.com/Sea-Snell/Impl…

Thread ->
We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data!
Implicit Q-learning (IQL) provides a particularly convenient method for offline RL for NLP, with a training procedure that is very close to supervised learning, but with the addition of rewards in the loss. Our full method slightly modifies IQL w/ a CQL term and smarter decoding.
Read 7 tweets
Jun 17, 2022
Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: arxiv.org/abs/2206.01367

A thread: Image
The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training. Image
This leads to improved generalization performance on the test set, and can be readily combined with other methods for improving performance. It works especially well when training data is more limited. Image
Read 6 tweets
May 23, 2022
Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!

diffusion-planning.github.io

🧵->
The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method. Image
The architecture is quite straightforward, with the only trajectory-specific "inductive bias" being temporally local receptive fields at each step (intuition is that each step looks at its neighbors and tries to "straighten out" the trajectory, making it more physical) Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(