We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method!
arxiv.org/abs/2106.02039
bair.berkeley.edu/blog/2021/11/1…

A thread:
Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens.
Although the Transformer is "monolithic", it *discovers* things like near-Markovian attention patterns (left) and a kind of "action smoothing" (right), where sequential actions are correlated to each other. So the Transformer learns about the structure in RL, to a degree.
It also makes *very* long-horizon rollouts successfully, far longer than standard autoregressive models p(s'|s,a). So something about a big "dumb" model works very well for modeling complex dynamics, suggesting it might work very well for model-based RL.
For control, we can simply run beam search, using reward instead of likelihood as the score. Of course, we could use other planners too. On the (comparatively easy) D4RL locomotion tasks, Trajectory Transformer is on par with the best prior method (CQL).
But if we *combine* Trajectory Transformer with a good Q-function (e.g., from IQL), we can solve the much more challenging Ant Maze tasks with state-of-the-art results, much better than all prior methods. Ant Maze is much harder, because it requires temporal compositionality.
This is significant because only dynamic programming methods perform well on Ant Maze (e.g., Decision Transformer is on par with simple behavioral cloning) -- to our knowledge Trajectory Transformer + IQL is the first model-based approach that improves over pure DP on these tasks
This is joint work with Michael Janner & Qiyang Li, accepted for a spotlight presentation at NeurIPS 2021:
trajectory-transformer.github.io
arxiv.org/abs/2106.02039
Code: github.com/JannerM/trajec…
Also, if you want to read the paper that we "borrowed" the Q-function from for Ant Maze, it's here: arxiv.org/abs/2110.06169

@ikostrikov makes some really nice Q-functions😉

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Levine

Sergey Levine Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svlevine

19 Oct
To make an existing model more robust at test time: augment a single test image in many ways, finetune model so that predictions on augmented images "agree", minimizing marginal entropy. This is the idea behind MEMO (w/ Marvin Zhang & @chelseabfinn): arxiv.org/abs/2110.09506

🧵>
MEMO is a simple test-time adaptation method that takes any existing model (no change during training), and finetunes it on one image:
1. generate augmentations of test image
2. make predictions on all of them
3. minimize marginal entropy of these predictions (make them similar)
This can significantly improve a model's robustness to OOD inputs. Here are examples on ImageNet-C where MEMO fixes mistakes the model would have made without MEMO. This doesn't involve additional assumptions, the training is exactly the same, and it operates on one test image.
Read 4 tweets
23 Sep
Offline RL lets us run RL without active online interaction, but tuning hyperparameters, model capacity etc. still requires rollouts, or validation tasks. In our new paper, we propose guidelines for *fully offline* tuning for algorithms like CQL arxiv.org/abs/2109.10813

A thread:
In supervised learning, we have training and validation sets, and this works great for tuning. There is no equivalent in RL, making tuning hard. However, with CQL, when there is too much or too little model capacity, we do get very characteristic estimated vs true Q-value curves ImageImage
Of course, the true return is unknown during offline training, but we can still use our understanding of the trends of estimated Q-values to provide guidelines for how to adjust model capacity. These guidelines are not guaranteed to work, but seem to work well in practice. ImageImageImage
Read 6 tweets
25 Jul
An "RL" take on compression: "super-lossy" compression that changes the image, but preserves its downstream effect (i.e., the user should take the same action seeing the "compressed" image as when they saw original) sites.google.com/view/pragmatic…

w @sidgreddy & @ancadianadragan

🧵>
The idea is pretty simple: we use a GAN-style loss to classify whether the user would have taken the same downstream action upon seeing the compressed image or not. Action could mean button press when playing a video game, or a click/decision for a website.
The compression itself is done with a generative latent variable model (we use styleGAN, but VAEs would work great too, as well as flows). PICO basically decides to throw out those bits that it determines (via its GAN loss) won't change the user's downstream decision.
Read 6 tweets
23 Jul
RAIL will be presenting a number of exciting late breaking poster results at the RL4RealLife WS #ICML2021 (8 pm PT today!): sites.google.com/view/RL4RealLi…

Algorithms for real-world RL w/ mobile manipulators, lifelong meta-learning methods, principled multi-task data sharing.

A thread:
We'll show how RL can control robots that learn to clean up a room, entirely in the real world. By Charles Sun, @ColinearDevin, @abhishekunique7, @jendk3r, @GlenBerseth. Image
We'll present CoMPS, an algorithm for online continual meta-learning, where an agent meta-learns tasks one by one, with each task accelerating future tasks. By @GlenBerseth, WilliamZhang365, @chelseabfinn Image
Read 4 tweets
23 Jul
In RL, "implicit regularization" that helps deep learning find good solutions can actually lead to huge instability. See @aviral_kumar2 talk on DR3:
7/23 4pm PT RL for real: icml.cc/virtual/2021/w…
7/24 5:45pm PT Overparameterization WS talk icml.cc/virtual/2021/w…
#ICML2021

🧵> Image
You can watch the talk in advance here:
And then come discuss the work with Aviral at the poster sessions! This work is not released yet, but it will be out shortly.

We're quite excited about this result, and I'll try to explain why.
Deep networks are overparameterized, meaning there are many parameter vectors that fit the training set. So why does it not overfit? While there are many possibilities, they all revolve around some kind of "implicit regularization" that leads to solutions that generalize well.
Read 8 tweets
18 Jul
Can we devise a more tractable RL problem if we give the agent examples of successful outcomes (states, not demos)? In MURAL, we show that uncertainty-aware classifiers trained with (meta) NML make RL much easier. At #ICML2021
arxiv.org/abs/2107.07184

A (short) thread:
The website has a summary: sites.google.com/view/mural-rl

If the agent gets some examples of high reward states, we can train a classifier to automatically provide shaped rewards (this is similar to methods like VICE). A standard classifier is not necessarily well shaped.
This is where the key idea in MURAL comes in: use normalized max likelihood (NML) to train a classifier that is aware of uncertainty. Label each state as either positive (success) or negative (failure), and use the ratio of likelihoods from these classifiers as reward!
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!

:(