Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Sergey Levine

@svlevine

Nov 19, 2021 • 9 tweets • 5 min read • Read on X

Scrolly

We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method!
arxiv.org/abs/2106.02039
bair.berkeley.edu/blog/2021/11/1…

A thread:

Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens.

Although the Transformer is "monolithic", it *discovers* things like near-Markovian attention patterns (left) and a kind of "action smoothing" (right), where sequential actions are correlated to each other. So the Transformer learns about the structure in RL, to a degree.

It also makes *very* long-horizon rollouts successfully, far longer than standard autoregressive models p(s'|s,a). So something about a big "dumb" model works very well for modeling complex dynamics, suggesting it might work very well for model-based RL.

For control, we can simply run beam search, using reward instead of likelihood as the score. Of course, we could use other planners too. On the (comparatively easy) D4RL locomotion tasks, Trajectory Transformer is on par with the best prior method (CQL).

But if we *combine* Trajectory Transformer with a good Q-function (e.g., from IQL), we can solve the much more challenging Ant Maze tasks with state-of-the-art results, much better than all prior methods. Ant Maze is much harder, because it requires temporal compositionality.

This is significant because only dynamic programming methods perform well on Ant Maze (e.g., Decision Transformer is on par with simple behavioral cloning) -- to our knowledge Trajectory Transformer + IQL is the first model-based approach that improves over pure DP on these tasks

This is joint work with Michael Janner & Qiyang Li, accepted for a spotlight presentation at NeurIPS 2021:
trajectory-transformer.github.io
arxiv.org/abs/2106.02039
Code: github.com/JannerM/trajec…

@ikostrikov

Also, if you want to read the paper that we "borrowed" the Q-function from for Ant Maze, it's here: arxiv.org/abs/2110.06169

@ikostrikov makes some really nice Q-functions😉

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @svlevine

Sergey Levine

@svlevine

Jun 26

If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇

DSRL trains an actor and Q-function, treating the diffusion noise as the action space. Because samples from the noise prior map to reasonable actions for the policy, DSRL essentially explores "inside" the set of reasonable pre-trained behaviors, making it extremely efficient.

DSRL learns essentially in real time, with good results in as little as 50 trials (it's so efficient that a person can literally sit in front of the robot and push a button to assign sparse rewards).

Read 4 tweets

Sergey Levine

@svlevine

May 28

Fun project at PI: knowledge insulation for VLAs. We figured out how to train VLAs with cont. actions much more effectively by insulating the VLM and training it with discrete actions, while action expert learns on top. 5-7x faster, and importantly way better language following 👇

The idea: when we train VLA, we put a little “motor cortex” on top of the VLM (the “action expert”), but it is initialized from scratch, so its gradients mess up the VLM backbone. We can put a stop grad to prevent this, but we still need VLM reps to adapt to the robot.

So what do we do? We add discrete action losses to the VLM backbone to get good representations, even as the action expert learns on top! This makes it train way faster, preserves web-scale knowledge, and improves performance.

Read 5 tweets

Sergey Levine

@svlevine

Jul 31, 2024

Can VLMs enable robots to autonomously improve? In our new work we ran a fleet of robot arms to collect autonomous data with VLM-proposed tasks and showed that robots can keep getting better as they are deployed, without supervision:

🧵👇 auto-improvement.github.io

The idea: use VLMs to propose possible semantic tasks to do, then use a diffusion model to synthesize an image of the proposed task, use this image as a goal for a goal conditioned policy, and then improve the goal conditioned policy from the resulting experience.

This works very well because the goal-conditioned policy can self-improve without any human supervision, while the VLM and diffusion model leverages Internet-scale pretraining. So every component either improves through self-supervision or benefits from pretraining (or both).

Read 7 tweets

Sergey Levine

@svlevine

Feb 23, 2023

@GoogleAI

Pretraining on large datasets is powerful, it enables learning new tasks quickly (e.g., from BERT, LLMs, etc.). Can we do the same for RL, pretrain & finetune rapidly to new tasks? Scaled Q-Learning aims to unlock this ability, now on @GoogleAI blog:
ai.googleblog.com/2023/02/pre-tr…
👇

The idea is very simple: pretrain a large ResNet-based Q-function network with conservative Q-learning (CQL) with several design decisions to ensure it learns at scale. Pretrain on ~40 games with highly suboptimal data, then finetune to new games with offline or online data.

Performance on training games is very good, even from highly suboptimal data. With near optimal data, this outperforms non-Q-learning methods (e.g., BC, decision transformers) even vs models 2.5x bigger (DT 200M), on suboptimal data it gets more than double the score!

Read 6 tweets

Sergey Levine

@svlevine

Oct 10, 2022

General Navigation Models (GNM) are general-purpose navigation backbones that can drive many robots. It turns out that simple goal-conditioned policies can be trained on multi-robot datasets and generalize in zero-shot to entirely new robots!

sites.google.com/view/drive-any…

Thread>

The GNM architecture we use is simple: a model that takes in a current image, a goal image, and a temporal context (stack of frames) that tells the model how the robot behaviors (which it uses to infer size, dynamics, etc.). With a topological graph, this lets it drive the robot.

The key is that the GNM is trained on data from many robots: big vehicles (ATVs, etc.), small ground robots, even little RC cars. All data is treated the same way: the model just learns to directly generalize over robot types, learning general navigational skills.

Read 6 tweets

Sergey Levine

@svlevine

Jun 22, 2022

What do Lyapunov functions, offline RL, and energy based models have in common? Together, they can be used to provide long-horizon guarantees by "stabilizing" a system in high density regions! That's the idea behind Lyapunov Density Models: sites.google.com/berkeley.edu/l…

A thread:

Basic question: if I learn a model (e.g., dynamics model for MPC, value function, BC policy) on data, will that model be accurate when I run it (e.g., to control my robot)? It might be wrong if I go out of distribution, LDMs aim to provide a constraint so they don't do this.

By analogy (which we can make precise!) Lyapunov functions tell us how to stabilize around a point in space (i.e., x=0). What if we want is to stabilize in high density regions (i.e., p(s) >= eps). Both require considering long horizon outcomes though, so we can't just be greedy!

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Sergey Levine

Try unrolling a thread yourself!

More from @svlevine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!