Can we devise a more tractable RL problem if we give the agent examples of successful outcomes (states, not demos)? In MURAL, we show that uncertainty-aware classifiers trained with (meta) NML make RL much easier. At #ICML2021 arxiv.org/abs/2107.07184
If the agent gets some examples of high reward states, we can train a classifier to automatically provide shaped rewards (this is similar to methods like VICE). A standard classifier is not necessarily well shaped.
This is where the key idea in MURAL comes in: use normalized max likelihood (NML) to train a classifier that is aware of uncertainty. Label each state as either positive (success) or negative (failure), and use the ratio of likelihoods from these classifiers as reward!
This provides for exploration, since novel states will have higher uncertainty (hence reward closer to 50/50), while still shaping the reward to be larger closer to the example success states. This turns out to be a great way to do "directed" exploration.
Doing this tractably is hard, because we need two new classifiers for *every* state the agent visits, so to make this efficient, we use meta-learning (MAML) to meta-train one classifier to adapt to every label for every state very quickly, which we call meta-NML.
This ends up working very well across a wide range of manipulation, dexterous hand, and navigation tasks. To learn more about NML in deep learning, you can also check out Aurick Zhou's excellent blog post on this topic here: bairblog.github.io/2020/11/16/acn…
Since many people were interested in our recent offline MBO work, I'll also write about a recent paper on MBO by Justin Fu, which trains forward models for each possible objective value and uses them to compute a posterior via NML: arxiv.org/abs/2102.07970
A thread:
The basic idea, unlike COMs (which learn pessimistic models) is to get a posterior over values for a new design x. Justin's method (NEMO) trains a separate model *for every possible value y* for the design x (discretized), and uses the likelihood from these to get the posterior.
This corresponds to the normalized maximum likelihood (NML) distribution, which has appealing regret guarantees, which we extend in NEMO to provide regret guarantees on offline MBO as well! This is more complex than COMs, but potentially more powerful as we get a full posterior.
Data-driven design is a lot like offline RL. Want to design a drug molecule, protein, or robot? Offline model-based optimization (MBO) tackles this, and our new algorithm, conservative objective models (COMs) provides a simple approach: arxiv.org/abs/2107.06882
A thread:
The basic setup: say you have prior experimental data D={(x,y)} (e.g., drugs you've tested). How to use it to get the best drug? Well, you could train a neural net f(x) = y, then pick the best x. This is a *very* bad idea, because you'll just get an adversarial example!
This is very important: lots of recent work shows how to train really good predictive models in biology, chemistry, etc. (e.g., AlphaFold), but using these for design runs into this adversarial example problem. This is actually very similar to problems we see in offline RL!
Empirical studies observed that generalization in RL is hard. Why? In a new paper, we provide a partial answer: generalization in RL induces partial observability, even for fully observed MDPs! This makes standard RL methods suboptimal. arxiv.org/abs/2107.06277
A thread:
Take a look at this example: the agent has a multi-step "guessing game" to label an image (not a bandit -- you get multiple guesses until you get it right!). We know in MDPs there is an optimal deterministic policy, so RL will learn a deterministic policy here.
Of course, this is a bad idea -- if it guesses wrong on the first try, it should not guess the same label again. But this task *is* fully observed -- there is a unique mapping from image pixels to labels, the problem is that we just don't know what it is from training data!
What did we learn from 5 years of robotic deep RL? My colleagues at Google and I tried to distill our experience into a review-style journal paper, covering some of the practical aspects of real-world robotic deep RL: arxiv.org/abs/2102.02915
🧵->
This is somewhat different from the usual survey/technical paper: we are not so much trying to provide the technical foundations of robotic deep RL, but rather describe the practical lessons -- the stuff one doesn't usually put in papers.
It's also a little bit out of date at this point (it's a journal paper, which took nearly a year to clear review, despite having very few revisions... but that's life I suppose). But we hope it will be pretty valuable to the community.
The idea: use RL + graph search to learn to reach visually indicated goals, using offline data. Starting with data in an environment (which in our case was previously collected for another project, BADGR), train a distance function and policy for visually indicated goals.
2/n
Once we have a distance function, policy, and graph, we search the graph to find a path for new visually indicated goals (images), and then execute the policy for the nearest node. A few careful design decisions (in the paper) make this work much better than prior work.
My favorite part of @NeurIPSConf is the workshops, a chance to see new ideas and late-breaking work. Our lab will present a number of papers & talks at workshops:
thread below ->
meanwhile here is a teaser image :)
At robot learning workshop, @katie_kang_ will present the best-paper-winning (congrats!!) “Multi-Robot Deep Reinforcement Learning via Hierarchically Integrated Models”: how to share modules between multiple real robots; recording here: (16:45pm PT 12/11)