Tweet

@katie_kang_

More from @svlevine

Sergey Levine

@svlevine

Jun 21

NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: sea-snell.github.io/ILQL_site/

Code: github.com/Sea-Snell/Impl…

Thread ->

We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data!

Implicit Q-learning (IQL) provides a particularly convenient method for offline RL for NLP, with a training procedure that is very close to supervised learning, but with the addition of rewards in the loss. Our full method slightly modifies IQL w/ a CQL term and smarter decoding.

Read 7 tweets

Sergey Levine

@svlevine

Jun 17

Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: arxiv.org/abs/2206.01367

A thread:

The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training.

This leads to improved generalization performance on the test set, and can be readily combined with other methods for improving performance. It works especially well when training data is more limited.

Read 6 tweets

Sergey Levine

@svlevine

May 23

Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!

diffusion-planning.github.io

🧵->

The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method.

The architecture is quite straightforward, with the only trajectory-specific "inductive bias" being temporally local receptive fields at each step (intuition is that each step looks at its neighbors and tries to "straighten out" the trajectory, making it more physical)

Read 6 tweets

Sergey Levine

@svlevine

Apr 27

Offline RL is a natural fit for dialogue: RL with humans is hard, but data of humans talking to humans is plentiful. In two new papers, we explore offline RL and for end-to-end dialogue systems with Transformers!
CALM: sea-snell.github.io/CALM_LM_site/
CHAI: siddharthverma314.github.io/research/chai-…
🧵->

In CHAI, we use a language model (GPT-2) to propose responses that are then ranked by a sequence Q-function. In training, this produces target values. At test time, it selects the response. CHAI can adapt a variety of offline RL methods with CQL performing best (w/ small margin).

CHAI outperforms prior methods on Craigslist Negotiation when compared to a variety of different evaluation bots, though the gap is small -- likely more diverse datasets are needed to fully understand the performance of offline RL for negotiation. But the examples seem promising.

Read 10 tweets

Sergey Levine

@svlevine

Apr 25

@aviral_kumar2

When does offline RL outperform behavioral cloning when BC gets *optimal* data? A short summary of our recent paper with @aviral_kumar2, Anikait Singh, Joey Hong: arxiv.org/abs/2204.05618
(see also the blog post bair.berkeley.edu/blog/2022/04/2…)
A thread:

Let's consider "success/failure" tasks where there is a reward at the end (+1 or 0) if you succeed. In general, both BC and offline RL get O(H) regret (i.e., linear in the horizon) if provided with optimal demonstration data. But there are some important special cases!

If an MDP has critical states (states where it's critical to take a good action), but many states are *not* critical, RL gets sublinear regret. Such MDPs are common -- e.g., in navigation, far from obstacles any action is OK, but close by it's important to take specific actions.

Read 6 tweets

Sergey Levine

@svlevine

Apr 25

@aviral_kumar2

Should you use imitation learning or offline RL? This can be a surprisingly subtle question -- sometimes, even with near optimal data offline RL is better both in theory and in practice! @aviral_kumar2 and @ikostrikov discuss in their new blog post: bair.berkeley.edu/blog/2022/04/2…

A 🧵:

Of course "standard" BC suffers due to assuming all data is optimal, but even if we condition or filter the data, there are more complex stitching behaviors that require dynamic programming. This is known, but it's important to note that this is the norm, not the exception.

Indeed, tasks that require stitching show huge gaps between BC-style methods and dynamic programming methods (left), except of course if we incorporate additional knowledge into conditional BC, like with Scott's RvS-G blog post from last week (right): bair.berkeley.edu/blog/2022/04/2…

Read 5 tweets

Share this page!

Sergey Levine

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @svlevine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?