Sergey Levine Profile picture
Jun 22 8 tweets 5 min read
What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: sites.google.com/berkeley.edu/l…

A thread:
Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this.
By analogy (which we can make precise!) Lyapunov functions tell us how to stabilize around a point in space (i.e., x=0). What if we want is to stabilize in high density regions (i.e., p(s) >= eps). Both require considering long horizon outcomes though, so we can't just be greedy!
We can take actions that are high-density now, but lead to (inevitable) low density later, so just like the Lyapunov function needs to take the future into account, so does the Lyapunov dynamics model, integrating future outcomes via a Bellman equation just like in Q-learning.
LDM can be thought of as a "value function" with a funny backup, w/ (log) density as "reward" at the terminal states and using a "min" backup at other states (see equation below, E = -log P is the energy). In special cases, LDM can be Lyapunov, density model, and value function!
Intuitively, the LDM learns to represent the worst-case future log density we will see if we take a particular action, and then obey the LDM constraint thereafter. Even with approximation error, we can prove that this keeps the system in-distribution, minimizing errors!
Some results -- here, we use model-based RL (MPC, like in PETS) to control the hopper to hop to different locations, with the LDM providing a safety constraint. As we vary the threshold, the hopper stops falling, and if we tighten the constraint too much it stands in place.
By @katie_kang_, Paula Gradu, Jason Choi, @michaeljanner, Claire Tomlin, and myself

web: sites.google.com/berkeley.edu/l…
paper: arxiv.org/abs/2206.10524

This will appear at #ICML2022!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Levine

Sergey Levine Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svlevine

Jun 21
NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: sea-snell.github.io/ILQL_site/

Code: github.com/Sea-Snell/Impl…

Thread ->
We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data!
Implicit Q-learning (IQL) provides a particularly convenient method for offline RL for NLP, with a training procedure that is very close to supervised learning, but with the addition of rewards in the loss. Our full method slightly modifies IQL w/ a CQL term and smarter decoding.
Read 7 tweets
Jun 17
Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: arxiv.org/abs/2206.01367

A thread: Image
The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training. Image
This leads to improved generalization performance on the test set, and can be readily combined with other methods for improving performance. It works especially well when training data is more limited. Image
Read 6 tweets
May 23
Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!

diffusion-planning.github.io

🧵->
The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method. Image
The architecture is quite straightforward, with the only trajectory-specific "inductive bias" being temporally local receptive fields at each step (intuition is that each step looks at its neighbors and tries to "straighten out" the trajectory, making it more physical) Image
Read 6 tweets
Apr 27
Offline RL is a natural fit for dialogue: RL with humans is hard, but data of humans talking to humans is plentiful. In two new papers, we explore offline RL and for end-to-end dialogue systems with Transformers!
CALM: sea-snell.github.io/CALM_LM_site/
CHAI: siddharthverma314.github.io/research/chai-…
🧵->
In CHAI, we use a language model (GPT-2) to propose responses that are then ranked by a sequence Q-function. In training, this produces target values. At test time, it selects the response. CHAI can adapt a variety of offline RL methods with CQL performing best (w/ small margin). Image
CHAI outperforms prior methods on Craigslist Negotiation when compared to a variety of different evaluation bots, though the gap is small -- likely more diverse datasets are needed to fully understand the performance of offline RL for negotiation. But the examples seem promising. Image
Read 10 tweets
Apr 25
When does offline RL outperform behavioral cloning when BC gets *optimal* data? A short summary of our recent paper with @aviral_kumar2, Anikait Singh, Joey Hong: arxiv.org/abs/2204.05618
(see also the blog post bair.berkeley.edu/blog/2022/04/2…)
A thread: ImageImage
Let's consider "success/failure" tasks where there is a reward at the end (+1 or 0) if you succeed. In general, both BC and offline RL get O(H) regret (i.e., linear in the horizon) if provided with optimal demonstration data. But there are some important special cases! ImageImage
If an MDP has critical states (states where it's critical to take a good action), but many states are *not* critical, RL gets sublinear regret. Such MDPs are common -- e.g., in navigation, far from obstacles any action is OK, but close by it's important to take specific actions. Image
Read 6 tweets
Apr 25
Should you use imitation learning or offline RL? This can be a surprisingly subtle question -- sometimes, even with near optimal data offline RL is better both in theory and in practice! @aviral_kumar2 and @ikostrikov discuss in their new blog post: bair.berkeley.edu/blog/2022/04/2…

A 🧵:
Of course "standard" BC suffers due to assuming all data is optimal, but even if we condition or filter the data, there are more complex stitching behaviors that require dynamic programming. This is known, but it's important to note that this is the norm, not the exception. Image
Indeed, tasks that require stitching show huge gaps between BC-style methods and dynamic programming methods (left), except of course if we incorporate additional knowledge into conditional BC, like with Scott's RvS-G blog post from last week (right): bair.berkeley.edu/blog/2022/04/2… ImageImage
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(