Tweet

@aviral_kumar2

More from @svlevine

Sergey Levine

@svlevine

8 Dec

@ben_eysenbach

Classifiers can act as value functions, when a task is specified with a set of example outcomes. We call this method RCE. It's simple and works well! Find out more at @ben_eysenbach's poster Wed 12/8 4:30 pm at @NeurIPSConf, and full-length talk Fri 4:20 pm PT (oral sess 5)🧵>

Poster link: neurips.cc/virtual/2021/p…
Oral link: neurips.cc/virtual/2021/o…
Blog post: ai.googleblog.com/2021/03/recurs…
Paper: arxiv.org/abs/2103.12656

The basic idea: instead of coding up a reward function by hand, provide a few example outcomes (states) that denote "success". RCE trains a classifier, which predicts whether an action will lead to "success" *in the future*

Read 5 tweets

Sergey Levine

@svlevine

7 Dec

Intrinsic motivation allows RL to find complex behaviors without hand-designed reward. What makes for a good objective? Information about the world can be translated into energy (or rather, work), so can an intrinsic objective accumulate information? That's the idea in IC2. A 🧵:

The "Maxwell's demon" thought exercise describes how information translates into energy. In one version, the "demon" opens a gate when a particle approaches from one side, but not the other, sorting them into one chamber (against the diffusion gradient). This lowers entropy.

This seems to violate the second law of thermodynamics. The explanation for why it does not is that information about the particles itself is exchangeable with potential energy (that's a gross oversimplifications, but this is just a tweet...).

Read 8 tweets

Sergey Levine

@svlevine

22 Nov

Reusable datasets, such as ImageNet, are a driving force in ML. But how can we reuse data in robotics? In his new blog post, Frederik Ebert talks about "bridge data": multi-domain and multi-task datasets that boost generalization of new tasks: bair.berkeley.edu/blog/2021/11/1…

A thread:

The reason reusing data in robotics is hard is that everyone has a different lab environment, different tasks, etc. So to be reusable, the dataset needs to be both multi-domain *and* multi-task, so that it enables generalization across domains.

So if I collect a little bit of data for my task in my new domain, can I use a reusable dataset to boost generalization of this task? This is not a trivial question, since the "bridge data" does not contain either the new domain or the new task.

Read 5 tweets

Sergey Levine

@svlevine

19 Nov

We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method!
arxiv.org/abs/2106.02039
bair.berkeley.edu/blog/2021/11/1…

A thread:

Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens.

Although the Transformer is "monolithic", it *discovers* things like near-Markovian attention patterns (left) and a kind of "action smoothing" (right), where sequential actions are correlated to each other. So the Transformer learns about the structure in RL, to a degree.

Read 9 tweets

Sergey Levine

@svlevine

19 Oct

@chelseabfinn

To make an existing model more robust at test time: augment a single test image in many ways, finetune model so that predictions on augmented images "agree", minimizing marginal entropy. This is the idea behind MEMO (w/ Marvin Zhang & @chelseabfinn): arxiv.org/abs/2110.09506

🧵>

MEMO is a simple test-time adaptation method that takes any existing model (no change during training), and finetunes it on one image:
1. generate augmentations of test image
2. make predictions on all of them
3. minimize marginal entropy of these predictions (make them similar)

This can significantly improve a model's robustness to OOD inputs. Here are examples on ImageNet-C where MEMO fixes mistakes the model would have made without MEMO. This doesn't involve additional assumptions, the training is exactly the same, and it operates on one test image.

Read 4 tweets

Sergey Levine

@svlevine

23 Sep

Offline RL lets us run RL without active online interaction, but tuning hyperparameters, model capacity etc. still requires rollouts, or validation tasks. In our new paper, we propose guidelines for *fully offline* tuning for algorithms like CQL arxiv.org/abs/2109.10813

A thread:

In supervised learning, we have training and validation sets, and this works great for tuning. There is no equivalent in RL, making tuning hard. However, with CQL, when there is too much or too little model capacity, we do get very characteristic estimated vs true Q-value curves

Of course, the true return is unknown during offline training, but we can still use our understanding of the trends of estimated Q-values to provide guidelines for how to adjust model capacity. These guidelines are not guaranteed to work, but seem to work well in practice.

Read 6 tweets

Share this page!

Sergey Levine

Try unrolling a thread yourself!

More from @svlevine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Sergey Levine

Did Thread Reader help you today?

Like this author's thread?