Intrinsic motivation allows RL to find complex behaviors without hand-designed reward. What makes for a good objective? Information about the world can be translated into energy (or rather, work), so can an intrinsic objective accumulate information? That's the idea in IC2. A 🧵:
The "Maxwell's demon" thought exercise describes how information translates into energy. In one version, the "demon" opens a gate when a particle approaches from one side, but not the other, sorting them into one chamber (against the diffusion gradient). This lowers entropy.
This seems to violate the second law of thermodynamics. The explanation for why it does not is that information about the particles itself is exchangeable with potential energy (that's a gross oversimplifications, but this is just a tweet...).
The idea behind IC2 (intrinsic control via information capture) is to instantiate this "belief entropy minimization" intuition into a practical unsupervised RL algorithm! There are a few variants of this principle, but they all train a latent belief model & minimize its entropy.
Minimizing belief entropy forces the agent to do two things: (1) figure out where everything is (find & observe the "particles"); (2) put things into a more orderly configuration, so that the beliefs are *simpler* (lower entropy). The latter leads to emergent skills.
For example, in a simple gridworld domain with moving objects that stop when the agent "tags" them, IC2 causes the agent to track down every object and tag it to stop its motion -- thus the agent always knows where everything is!
In the vizDoom video game environment, IC2 will look around to find enemies, and then shoot them, so that unpredictable enemies aren't there anymore (OK, this one is a bit violent... and maybe cause for some concern, but we'll find a way to apply it to more peaceful ends).
Classifiers can act as value functions, when a task is specified with a set of example outcomes. We call this method RCE. It's simple and works well! Find out more at @ben_eysenbach's poster Wed 12/8 4:30 pm at @NeurIPSConf, and full-length talk Fri 4:20 pm PT (oral sess 5)🧵>
The basic idea: instead of coding up a reward function by hand, provide a few example outcomes (states) that denote "success". RCE trains a classifier, which predicts whether an action will lead to "success" *in the future*
Reusable datasets, such as ImageNet, are a driving force in ML. But how can we reuse data in robotics? In his new blog post, Frederik Ebert talks about "bridge data": multi-domain and multi-task datasets that boost generalization of new tasks: bair.berkeley.edu/blog/2021/11/1…
A thread:
The reason reusing data in robotics is hard is that everyone has a different lab environment, different tasks, etc. So to be reusable, the dataset needs to be both multi-domain *and* multi-task, so that it enables generalization across domains.
So if I collect a little bit of data for my task in my new domain, can I use a reusable dataset to boost generalization of this task? This is not a trivial question, since the "bridge data" does not contain either the new domain or the new task.
We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method! arxiv.org/abs/2106.02039 bair.berkeley.edu/blog/2021/11/1…
A thread:
Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens.
Although the Transformer is "monolithic", it *discovers* things like near-Markovian attention patterns (left) and a kind of "action smoothing" (right), where sequential actions are correlated to each other. So the Transformer learns about the structure in RL, to a degree.
To make an existing model more robust at test time: augment a single test image in many ways, finetune model so that predictions on augmented images "agree", minimizing marginal entropy. This is the idea behind MEMO (w/ Marvin Zhang & @chelseabfinn): arxiv.org/abs/2110.09506
🧵>
MEMO is a simple test-time adaptation method that takes any existing model (no change during training), and finetunes it on one image: 1. generate augmentations of test image 2. make predictions on all of them 3. minimize marginal entropy of these predictions (make them similar)
This can significantly improve a model's robustness to OOD inputs. Here are examples on ImageNet-C where MEMO fixes mistakes the model would have made without MEMO. This doesn't involve additional assumptions, the training is exactly the same, and it operates on one test image.
Offline RL lets us run RL without active online interaction, but tuning hyperparameters, model capacity etc. still requires rollouts, or validation tasks. In our new paper, we propose guidelines for *fully offline* tuning for algorithms like CQL arxiv.org/abs/2109.10813
A thread:
In supervised learning, we have training and validation sets, and this works great for tuning. There is no equivalent in RL, making tuning hard. However, with CQL, when there is too much or too little model capacity, we do get very characteristic estimated vs true Q-value curves
Of course, the true return is unknown during offline training, but we can still use our understanding of the trends of estimated Q-values to provide guidelines for how to adjust model capacity. These guidelines are not guaranteed to work, but seem to work well in practice.
An "RL" take on compression: "super-lossy" compression that changes the image, but preserves its downstream effect (i.e., the user should take the same action seeing the "compressed" image as when they saw original) sites.google.com/view/pragmatic…
The idea is pretty simple: we use a GAN-style loss to classify whether the user would have taken the same downstream action upon seeing the compressed image or not. Action could mean button press when playing a video game, or a click/decision for a website.
The compression itself is done with a generative latent variable model (we use styleGAN, but VAEs would work great too, as well as flows). PICO basically decides to throw out those bits that it determines (via its GAN loss) won't change the user's downstream decision.