Sergey Levine Profile picture
Associate Professor at UC Berkeley
Rui Carvalho Profile picture 1 subscribed
Feb 23, 2023 6 tweets 4 min read
Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain & finetune rapidly to new tasks? Scaled Q-Learning aims to unlock this ability, now on @GoogleAI blog:
ai.googleblog.com/2023/02/pre-tr…
👇 The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data.
Oct 10, 2022 6 tweets 4 min read
General Navigation Models (GNM) are general-purpose navigation backbones that can drive many robots. It turns out that simple goal-conditioned policies can be trained on multi-robot datasets and generalize in zero-shot to entirely new robots!

sites.google.com/view/drive-any…

Thread> The GNM architecture we use is simple: a model that takes in a current image, a goal image, and a temporal context (stack of frames) that tells the model how the robot behaviors (which it uses to infer size, dynamics, etc.). With a topological graph, this lets it drive the robot. Image
Jun 22, 2022 8 tweets 5 min read
What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: sites.google.com/berkeley.edu/l…

A thread: Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this.
Jun 21, 2022 7 tweets 5 min read
NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: sea-snell.github.io/ILQL_site/

Code: github.com/Sea-Snell/Impl…

Thread -> We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data!
Jun 17, 2022 6 tweets 3 min read
Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: arxiv.org/abs/2206.01367

A thread: Image The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training. Image
May 23, 2022 6 tweets 3 min read
Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!

diffusion-planning.github.io

🧵-> The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method. Image
Apr 27, 2022 10 tweets 6 min read
Offline RL is a natural fit for dialogue: RL with humans is hard, but data of humans talking to humans is plentiful. In two new papers, we explore offline RL and for end-to-end dialogue systems with Transformers!
CALM: sea-snell.github.io/CALM_LM_site/
CHAI: siddharthverma314.github.io/research/chai-…
🧵-> In CHAI, we use a language model (GPT-2) to propose responses that are then ranked by a sequence Q-function. In training, this produces target values. At test time, it selects the response. CHAI can adapt a variety of offline RL methods with CQL performing best (w/ small margin). Image
Apr 25, 2022 6 tweets 4 min read
When does offline RL outperform behavioral cloning when BC gets *optimal* data? A short summary of our recent paper with @aviral_kumar2, Anikait Singh, Joey Hong: arxiv.org/abs/2204.05618
(see also the blog post bair.berkeley.edu/blog/2022/04/2…)
A thread: ImageImage Let's consider "success/failure" tasks where there is a reward at the end (+1 or 0) if you succeed. In general, both BC and offline RL get O(H) regret (i.e., linear in the horizon) if provided with optimal demonstration data. But there are some important special cases! ImageImage
Apr 25, 2022 5 tweets 4 min read
Should you use imitation learning or offline RL? This can be a surprisingly subtle question -- sometimes, even with near optimal data offline RL is better both in theory and in practice! @aviral_kumar2 and @ikostrikov discuss in their new blog post: bair.berkeley.edu/blog/2022/04/2…

A 🧵: Of course "standard" BC suffers due to assuming all data is optimal, but even if we condition or filter the data, there are more complex stitching behaviors that require dynamic programming. This is known, but it's important to note that this is the norm, not the exception. Image
Feb 18, 2022 5 tweets 5 min read
Offline model-based optimization generates designs from prior data. E.g.,
Chip design: arxiv.org/abs/2110.11346
Robots, semiconductors, proteins: bair.berkeley.edu/blog/2021/10/2…
We are releasing a benchmark to evaluate MBO methods called Design-Bench: github.com/rail-berkeley/…
A thread -> The tasks in Design-Bench include: optimizing DNA sequences, optimizing robot morphology, neural architecture search, molecules for drug design, and controller optimization. This provides a suite of datasets and ground truth models to evaluate design algorithms in many domains.
Feb 4, 2022 6 tweets 4 min read
If you are doing offline RL, and you have a bunch of data without reward labels, should you: (a) learn a reward function; (b) label all the data with 0? Turns out that (b) (somewhat shockingly) is a very good choice, in theory and in practice: arxiv.org/abs/2202.01741

A thread: Straight from the annals of "I can't believe it's not broken": if all unlabeled data gets a reward of 0, we get more samples (lower sampling error) but also reward bias. We can analyze theoretically what this does to the performance of RL, with a perf bound that has 3 terms:
Feb 3, 2022 6 tweets 3 min read
Can gradient-based meta-learning be performed entirely online? Jathushan Rajasegaran's new paper proposes a *fully* online version of MAML (FOML), with two parameter vectors that are updated incremental (a prior and posterior): arxiv.org/abs/2202.00263

A thread: Standard MAML is batch mode, but in reality we might observe training samples in sequence, and the task might shift, suddenly or gradually. We want to use each datapoint *both* to improve our model *and* to learn how to learn more quickly for when the task changes.
Dec 21, 2021 9 tweets 5 min read
RL via supervised learning ("RvS") can be done w/ conditional BC. It was studied in many contexts (more below), but unclear what matters to make it work (weighting? online? Transformers?). The answer is "none of the above": arxiv.org/abs/2112.10751

Here is RvS w/ simple MLP. A 🧵 What is RvS (or conditional BC)? If we have some data and a context variable (omega), we can pick for each trajectory which omega value it is best for, and use it as supervised data for that omega value. Omega can be a goal, reward, or any parameter of the task. Diag + equation:
Dec 13, 2021 8 tweets 5 min read
Deep RL is hard: lots of hparam tuning, instability. Perhaps there is a reason for this? Turns out the same thing that makes supervised deep learning work well makes deep RL work poorly, leading to feature vectors that grow out of control: arxiv.org/abs/2112.04716

Let me explain: Simple test: compare offline SARSA vs offline TD. TD uses the behavior policy, so same policy is used for the backup, but SARSA uses dataset actions, while TD samples *new* actions (from the same distr.!). Top plot is phi(s,a)*phi(s',a'): dot prod of current & next features.
Dec 8, 2021 5 tweets 3 min read
Classifiers can act as value functions, when a task is specified with a set of example outcomes. We call this method RCE. It's simple and works well! Find out more at @ben_eysenbach's poster Wed 12/8 4:30 pm at @NeurIPSConf, and full-length talk Fri 4:20 pm PT (oral sess 5)🧵> Poster link: neurips.cc/virtual/2021/p…
Oral link: neurips.cc/virtual/2021/o…
Blog post: ai.googleblog.com/2021/03/recurs…
Paper: arxiv.org/abs/2103.12656
Dec 7, 2021 8 tweets 3 min read
Intrinsic motivation allows RL to find complex behaviors without hand-designed reward. What makes for a good objective? Information about the world can be translated into energy (or rather, work), so can an intrinsic objective accumulate information? That's the idea in IC2. A 🧵: Image The "Maxwell's demon" thought exercise describes how information translates into energy. In one version, the "demon" opens a gate when a particle approaches from one side, but not the other, sorting them into one chamber (against the diffusion gradient). This lowers entropy. Image
Nov 22, 2021 5 tweets 2 min read
Reusable datasets, such as ImageNet, are a driving force in ML. But how can we reuse data in robotics? In his new blog post, Frederik Ebert talks about "bridge data": multi-domain and multi-task datasets that boost generalization of new tasks: bair.berkeley.edu/blog/2021/11/1…

A thread: The reason reusing data in robotics is hard is that everyone has a different lab environment, different tasks, etc. So to be reusable, the dataset needs to be both multi-domain *and* multi-task, so that it enables generalization across domains.
Nov 19, 2021 9 tweets 5 min read
We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method!
arxiv.org/abs/2106.02039
bair.berkeley.edu/blog/2021/11/1…

A thread: Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens.
Oct 19, 2021 4 tweets 2 min read
To make an existing model more robust at test time: augment a single test image in many ways, finetune model so that predictions on augmented images "agree", minimizing marginal entropy. This is the idea behind MEMO (w/ Marvin Zhang & @chelseabfinn): arxiv.org/abs/2110.09506

🧵> MEMO is a simple test-time adaptation method that takes any existing model (no change during training), and finetunes it on one image:
1. generate augmentations of test image
2. make predictions on all of them
3. minimize marginal entropy of these predictions (make them similar)
Oct 13, 2021 7 tweets 3 min read
Implicit Q-learning, or "I can't believe it's not SARSA": state-of-the-art offline RL results, fast and easy to implement; almost SARSA, but with a different loss to provide "implicit policy improvement": arxiv.org/abs/2110.06169

w/ @ikostrikov, @ashvinair
🧵-> Here is the idea: if we want to prevent *all* OOD action issues in offline RL, we could use *only* actions in the dataset. That leads to a SARSA update, which is very stable. But it learns the *behavior policy* value function, not the optimal value function: Image
Sep 23, 2021 6 tweets 4 min read
Offline RL lets us run RL without active online interaction, but tuning hyperparameters, model capacity etc. still requires rollouts, or validation tasks. In our new paper, we propose guidelines for *fully offline* tuning for algorithms like CQL arxiv.org/abs/2109.10813

A thread: In supervised learning, we have training and validation sets, and this works great for tuning. There is no equivalent in RL, making tuning hard. However, with CQL, when there is too much or too little model capacity, we do get very characteristic estimated vs true Q-value curves ImageImage