Latest Twitter Threads by @svlevine on Thread Reader App

Jul 11 • 5 tweets • 2 min read

Action chunking is a great idea in robotics: by getting a model to produce a short sequence of actions, it _just works better_ for some mysterious reason. Now it turns out this can help in RL too, and it's a bit clearer why: action chunks help explore and help with backups. 🧵👇

The idea is very simple: train an actor and critic over action chunks (short sequences of actions). The setup is "offline to online": pretrain with offline RL on offline data, then run online exploration. It helps a lot (compare red line for QC vs blue lines for prior methods).

Jun 26 • 4 tweets • 2 min read

If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇

DSRL trains an actor and Q-function, treating the diffusion noise as the action space. Because samples from the noise prior map to reasonable actions for the policy, DSRL essentially explores "inside" the set of reasonable pre-trained behaviors, making it extremely efficient.

May 28 • 5 tweets • 2 min read

Fun project at PI: knowledge insulation for VLAs. We figured out how to train VLAs with cont. actions much more effectively by insulating the VLM and training it with discrete actions, while action expert learns on top. 5-7x faster, and importantly way better language following 👇

The idea: when we train VLA, we put a little “motor cortex” on top of the VLM (the “action expert”), but it is initialized from scratch, so its gradients mess up the VLM backbone. We can put a stop grad to prevent this, but we still need VLM reps to adapt to the robot.

Jul 31, 2024 • 7 tweets • 3 min read

Can VLMs enable robots to autonomously improve? In our new work we ran a fleet of robot arms to collect autonomous data with VLM-proposed tasks and showed that robots can keep getting better as they are deployed, without supervision:

🧵👇 auto-improvement.github.io

The idea: use VLMs to propose possible semantic tasks to do, then use a diffusion model to synthesize an image of the proposed task, use this image as a goal for a goal conditioned policy, and then improve the goal conditioned policy from the resulting experience.

Feb 23, 2023 • 6 tweets • 4 min read

Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain & finetune rapidly to new tasks? Scaled Q-Learning aims to unlock this ability, now on @GoogleAI blog:
ai.googleblog.com/2023/02/pre-tr…
👇 The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data.

Oct 10, 2022 • 6 tweets • 4 min read

General Navigation Models (GNM) are general-purpose navigation backbones that can drive many robots. It turns out that simple goal-conditioned policies can be trained on multi-robot datasets and generalize in zero-shot to entirely new robots!

sites.google.com/view/drive-any…

Thread>

The GNM architecture we use is simple: a model that takes in a current image, a goal image, and a temporal context (stack of frames) that tells the model how the robot behaviors (which it uses to infer size, dynamics, etc.). With a topological graph, this lets it drive the robot.

Jun 22, 2022 • 8 tweets • 5 min read

What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: sites.google.com/berkeley.edu/l…

A thread:

Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this.

Jun 21, 2022 • 7 tweets • 5 min read

NLP and offline RL are a perfect fit, enabling large language models to be trained to maximize rewards for tasks such as dialogue and text generation. We describe how ILQL can make this easy in our new paper: sea-snell.github.io/ILQL_site/

Code: github.com/Sea-Snell/Impl…

Thread ->

We might want RL in many places in NLP: goal-directed dialogue, synthesize text that fulfills subjective user criteria, solve word puzzles. But online RL is hard if we need to actively interact with a human (takes forever, annoying). Offline RL can learn from only human data!

Jun 17, 2022 • 6 tweets • 3 min read

Deep nets can be overconfident (and wrong) on unfamiliar inputs. What if we directly teach them to be less confident? The idea in RCAD ("Adversarial Unlearning") is to generate images that are hard, and teach it to be uncertain on them: arxiv.org/abs/2206.01367

A thread:

The idea is to use *very* aggressive adversarial training, generating junk images for which model predicts wrong label, then train the model to minimize its confidence on them. Since we don't need "true" labels for these images, we make *much* bigger steps than std adv training.

May 23, 2022 • 6 tweets • 3 min read

Can we do model-based RL just by treating a trajectory like a huge image, and training a diffusion model to generate trajectories? Diffuser does exactly this, guiding generative diffusion models over trajectories with Q-values!

diffusion-planning.github.io

🧵->

The model is a diffusion model over trajectories. Generation amounts to producing a physically valid trajectory. Perturbing by adding gradients of Q-values steers the model toward more optimal trajectories. That's basically the method.

Apr 27, 2022 • 10 tweets • 6 min read

Offline RL is a natural fit for dialogue: RL with humans is hard, but data of humans talking to humans is plentiful. In two new papers, we explore offline RL and for end-to-end dialogue systems with Transformers!
CALM: sea-snell.github.io/CALM_LM_site/
CHAI: siddharthverma314.github.io/research/chai-…
🧵->

In CHAI, we use a language model (GPT-2) to propose responses that are then ranked by a sequence Q-function. In training, this produces target values. At test time, it selects the response. CHAI can adapt a variety of offline RL methods with CQL performing best (w/ small margin).

Apr 25, 2022 • 6 tweets • 4 min read

When does offline RL outperform behavioral cloning when BC gets *optimal* data? A short summary of our recent paper with @aviral_kumar2, Anikait Singh, Joey Hong: arxiv.org/abs/2204.05618
(see also the blog post bair.berkeley.edu/blog/2022/04/2…)
A thread:

Let's consider "success/failure" tasks where there is a reward at the end (+1 or 0) if you succeed. In general, both BC and offline RL get O(H) regret (i.e., linear in the horizon) if provided with optimal demonstration data. But there are some important special cases!

Apr 25, 2022 • 5 tweets • 4 min read

Should you use imitation learning or offline RL? This can be a surprisingly subtle question -- sometimes, even with near optimal data offline RL is better both in theory and in practice! @aviral_kumar2 and @ikostrikov discuss in their new blog post: bair.berkeley.edu/blog/2022/04/2…

A 🧵: Of course "standard" BC suffers due to assuming all data is optimal, but even if we condition or filter the data, there are more complex stitching behaviors that require dynamic programming. This is known, but it's important to note that this is the norm, not the exception.

Feb 18, 2022 • 5 tweets • 5 min read

Offline model-based optimization generates designs from prior data. E.g.,
Chip design: arxiv.org/abs/2110.11346
Robots, semiconductors, proteins: bair.berkeley.edu/blog/2021/10/2…
We are releasing a benchmark to evaluate MBO methods called Design-Bench: github.com/rail-berkeley/…
A thread ->

The tasks in Design-Bench include: optimizing DNA sequences, optimizing robot morphology, neural architecture search, molecules for drug design, and controller optimization. This provides a suite of datasets and ground truth models to evaluate design algorithms in many domains.

Feb 4, 2022 • 6 tweets • 4 min read

If you are doing offline RL, and you have a bunch of data without reward labels, should you: (a) learn a reward function; (b) label all the data with 0? Turns out that (b) (somewhat shockingly) is a very good choice, in theory and in practice: arxiv.org/abs/2202.01741

A thread:

Straight from the annals of "I can't believe it's not broken": if all unlabeled data gets a reward of 0, we get more samples (lower sampling error) but also reward bias. We can analyze theoretically what this does to the performance of RL, with a perf bound that has 3 terms:

Feb 3, 2022 • 6 tweets • 3 min read

Can gradient-based meta-learning be performed entirely online? Jathushan Rajasegaran's new paper proposes a *fully* online version of MAML (FOML), with two parameter vectors that are updated incremental (a prior and posterior): arxiv.org/abs/2202.00263

A thread:

Standard MAML is batch mode, but in reality we might observe training samples in sequence, and the task might shift, suddenly or gradually. We want to use each datapoint *both* to improve our model *and* to learn how to learn more quickly for when the task changes.

Dec 21, 2021 • 9 tweets • 5 min read

RL via supervised learning ("RvS") can be done w/ conditional BC. It was studied in many contexts (more below), but unclear what matters to make it work (weighting? online? Transformers?). The answer is "none of the above": arxiv.org/abs/2112.10751

Here is RvS w/ simple MLP. A 🧵

What is RvS (or conditional BC)? If we have some data and a context variable (omega), we can pick for each trajectory which omega value it is best for, and use it as supervised data for that omega value. Omega can be a goal, reward, or any parameter of the task. Diag + equation:

Dec 13, 2021 • 8 tweets • 5 min read

Deep RL is hard: lots of hparam tuning, instability. Perhaps there is a reason for this? Turns out the same thing that makes supervised deep learning work well makes deep RL work poorly, leading to feature vectors that grow out of control: arxiv.org/abs/2112.04716

Let me explain:

Simple test: compare offline SARSA vs offline TD. TD uses the behavior policy, so same policy is used for the backup, but SARSA uses dataset actions, while TD samples *new* actions (from the same distr.!). Top plot is phi(s,a)*phi(s',a'): dot prod of current & next features.

Dec 8, 2021 • 5 tweets • 3 min read

Classifiers can act as value functions, when a task is specified with a set of example outcomes. We call this method RCE. It's simple and works well! Find out more at @ben_eysenbach's poster Wed 12/8 4:30 pm at @NeurIPSConf, and full-length talk Fri 4:20 pm PT (oral sess 5)🧵>

Poster link: neurips.cc/virtual/2021/p…
Oral link: neurips.cc/virtual/2021/o…
Blog post: ai.googleblog.com/2021/03/recurs…
Paper: arxiv.org/abs/2103.12656

Dec 7, 2021 • 8 tweets • 3 min read

Intrinsic motivation allows RL to find complex behaviors without hand-designed reward. What makes for a good objective? Information about the world can be translated into energy (or rather, work), so can an intrinsic objective accumulate information? That's the idea in IC2. A 🧵:

The "Maxwell's demon" thought exercise describes how information translates into energy. In one version, the "demon" opens a gate when a particle approaches from one side, but not the other, sorting them into one chamber (against the diffusion gradient). This lowers entropy.

Nov 22, 2021 • 5 tweets • 2 min read

Reusable datasets, such as ImageNet, are a driving force in ML. But how can we reuse data in robotics? In his new blog post, Frederik Ebert talks about "bridge data": multi-domain and multi-task datasets that boost generalization of new tasks: bair.berkeley.edu/blog/2021/11/1…

A thread: The reason reusing data in robotics is hard is that everyone has a different lab environment, different tasks, etc. So to be reusable, the dataset needs to be both multi-domain *and* multi-task, so that it enables generalization across domains.

Share this page!

Enter URL or ID to Unroll