Tweet

Sergey Levine

14 Jul, 9 tweets, 4 min read

Empirical studies observed that generalization in RL is hard. Why? In a new paper, we provide a partial answer: generalization in RL induces partial observability, even for fully observed MDPs! This makes standard RL methods suboptimal.
arxiv.org/abs/2107.06277

A thread:

Take a look at this example: the agent has a multi-step "guessing game" to label an image (not a bandit -- you get multiple guesses until you get it right!). We know in MDPs there is an optimal deterministic policy, so RL will learn a deterministic policy here.

Of course, this is a bad idea -- if it guesses wrong on the first try, it should not guess the same label again. But this task *is* fully observed -- there is a unique mapping from image pixels to labels, the problem is that we just don't know what it is from training data!

The guessing game is a MDP, but learning to guess from finite data becomes (implicitly) a POMDP -- what we call the epistemic POMDP, because it emerges from epistemic uncertainty. This is not unique to guessing, the same holds eg for mazes in ProcGen, robotic grasping, etc.

This leads to some counterintuitive things. Look at the "zookeeper example" below: the optimal MDP strategy is to look at the map (which is fully observed) and go to the otters, but peeking through windows generalizes much better (is never optimal in training).

What is happening here is that generalization in RL requires taking epistemic (information-gathering) actions at test time, just like we would in a POMDP, but this is never optimal to do at training-time. Hence, MDP methods will not generalize as well as POMDP methods.

Based on this idea, we developed a new algorithm, LEEP, that utilizes epistemic POMDP ideas to get better generalization. LEEP actually does *worse* on the training environments, but much better on test environments, as we would expect.

Unfortunately, solving (or even estimating) the epistemic POMDP is very hard, and LEEP makes some very crude approximations. Lots more research is needed to utilize the epistemic POMDP, in which case I think we can all make lots of progress on generalization!

@its_dibya

This was a really fun collaboration with @its_dibya, @jrahme0, @aviral_kumar2, @yayitsamyzhang, @ryan_p_adams -- a really fun group to work with on generalization🙂

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @svlevine

Sergey Levine

@svlevine

16 Jul

Data-driven design is a lot like offline RL. Want to design a drug molecule, protein, or robot? Offline model-based optimization (MBO) tackles this, and our new algorithm, conservative objective models (COMs) provides a simple approach: arxiv.org/abs/2107.06882

A thread:

The basic setup: say you have prior experimental data D={(x,y)} (e.g., drugs you've tested). How to use it to get the best drug? Well, you could train a neural net f(x) = y, then pick the best x. This is a *very* bad idea, because you'll just get an adversarial example!

This is very important: lots of recent work shows how to train really good predictive models in biology, chemistry, etc. (e.g., AlphaFold), but using these for design runs into this adversarial example problem. This is actually very similar to problems we see in offline RL!

Read 8 tweets

Sergey Levine

@svlevine

8 Feb

What did we learn from 5 years of robotic deep RL? My colleagues at Google and I tried to distill our experience into a review-style journal paper, covering some of the practical aspects of real-world robotic deep RL:
arxiv.org/abs/2102.02915

🧵->

This is somewhat different from the usual survey/technical paper: we are not so much trying to provide the technical foundations of robotic deep RL, but rather describe the practical lessons -- the stuff one doesn't usually put in papers.

It's also a little bit out of date at this point (it's a journal paper, which took nearly a year to clear review, despite having very few revisions... but that's life I suppose). But we hope it will be pretty valuable to the community.

Read 5 tweets

Sergey Levine

@svlevine

18 Dec 20

@_prieuredesion

RL enables robots to navigate real-world environments, with diverse visually indicated goals: sites.google.com/view/ving-robo…

w/ @_prieuredesion, B. Eysenbach, G. Kahn, @nick_rhinehart

paper: arxiv.org/abs/2012.09812
video:

Thread below ->

The idea: use RL + graph search to learn to reach visually indicated goals, using offline data. Starting with data in an environment (which in our case was previously collected for another project, BADGR), train a distance function and policy for visually indicated goals.

2/n

Once we have a distance function, policy, and graph, we search the graph to find a path for new visually indicated goals (images), and then execute the policy for the nearest node. A few careful design decisions (in the paper) make this work much better than prior work.

3/n

Read 6 tweets

Sergey Levine

@svlevine

11 Dec 20

@NeurIPSConf

My favorite part of @NeurIPSConf is the workshops, a chance to see new ideas and late-breaking work. Our lab will present a number of papers & talks at workshops:

thread below ->

meanwhile here is a teaser image :)

@katie_kang_

At robot learning workshop, @katie_kang_ will present the best-paper-winning (congrats!!) “Multi-Robot Deep Reinforcement Learning via Hierarchically Integrated Models”: how to share modules between multiple real robots; recording here: (16:45pm PT 12/11)

At the deep RL workshop, Ben Eysenbach will talk about how MaxEnt RL is provably robust to certain types of perturbations. Contributed talk at 14:00pm PT 12/11.
Paper: drive.google.com/file/d/1fENhHp…
Talk: slideslive.com/38941344/maxen…

Read 19 tweets

Sergey Levine

@svlevine

10 Dec 20

@NeurIPSConf

Tonight 12/10 9pm PT, Aviral Kumar will present Model Inversion Networks (MINs) at @NeurIPSConf. Offline model-based optimization (MBO) that uses data to optimize images, controllers and even protein sequences!

paper: tinyurl.com/mins-paper
pres: neurips.cc/virtual/2020/p…

more->

The problem setting: given samples (x,y) where x represents some input (e.g., protein sequence, image of a face, controller parameters) and y is some metric (e.g., how well x does at some task), find a new x* with the best y *without access to the true function*.

Classically, model-based optimization methods would learn some proxy function (acquisition function) fhat(x) = y, and then solve x* = argmax_x fhat(x), but this can result in OOD inputs to fhat(x) when x is very high dimensional.

Read 7 tweets

Sergey Levine

@svlevine

14 Oct 20

Greg Kahn's deep RL algorithms allows robots to navigation Berkeley's sidewalks! All the robot gets is a camera view, and supervision signal for when a safety driver told it to stop.

Website: sites.google.com/view/sidewalk-…
Arxiv: arxiv.org/abs/2010.04689
(more below)

The idea is simple: a person follows the robot in a "training" phase (could also watch remotely from the camera), and stops the robot when it does something undesirable -- much like a safety driver might stop an autonomous car.

The robot then tries to take those actions that are least likely to lead to disengagement. The result is a learned policy that can navigate hundreds of meters of Berkeley sidewalks entirely from raw images, without any SLAM, localization, etc., entirely using a learned neural net

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!