What did we learn from 5 years of robotic deep RL? My colleagues at Google and I tried to distill our experience into a review-style journal paper, covering some of the practical aspects of real-world robotic deep RL:
arxiv.org/abs/2102.02915

🧵->
This is somewhat different from the usual survey/technical paper: we are not so much trying to provide the technical foundations of robotic deep RL, but rather describe the practical lessons -- the stuff one doesn't usually put in papers.
It's also a little bit out of date at this point (it's a journal paper, which took nearly a year to clear review, despite having very few revisions... but that's life I suppose). But we hope it will be pretty valuable to the community.
It is important in robotic RL to think about not just the math in the algorithms, but the practicalities of getting learning systems to work in the real world: ops (can the robot keep training for a long time), safety, resets, scalability, etc.
This project was led first and foremost by @julianibarz, with a wonderful group of colleagues who contributed portions of the paper: Jie Tan, @chelseabfinn, Mrinal Kalakrishnan, Peter Pastor.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Levine

Sergey Levine Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svlevine

18 Dec 20
RL enables robots to navigate real-world environments, with diverse visually indicated goals: sites.google.com/view/ving-robo…

w/ @_prieuredesion, B. Eysenbach, G. Kahn, @nick_rhinehart

paper: arxiv.org/abs/2012.09812
video:

Thread below ->
The idea: use RL + graph search to learn to reach visually indicated goals, using offline data. Starting with data in an environment (which in our case was previously collected for another project, BADGR), train a distance function and policy for visually indicated goals.

2/n
Once we have a distance function, policy, and graph, we search the graph to find a path for new visually indicated goals (images), and then execute the policy for the nearest node. A few careful design decisions (in the paper) make this work much better than prior work.

3/n
Read 6 tweets
11 Dec 20
My favorite part of @NeurIPSConf is the workshops, a chance to see new ideas and late-breaking work. Our lab will present a number of papers & talks at workshops:

thread below ->

meanwhile here is a teaser image :)
At robot learning workshop, @katie_kang_ will present the best-paper-winning (congrats!!) “Multi-Robot Deep Reinforcement Learning via Hierarchically Integrated Models”: how to share modules between multiple real robots; recording here: (16:45pm PT 12/11)
At the deep RL workshop, Ben Eysenbach will talk about how MaxEnt RL is provably robust to certain types of perturbations. Contributed talk at 14:00pm PT 12/11.
Paper: drive.google.com/file/d/1fENhHp…
Talk: slideslive.com/38941344/maxen…
Read 19 tweets
10 Dec 20
Tonight 12/10 9pm PT, Aviral Kumar will present Model Inversion Networks (MINs) at @NeurIPSConf. Offline model-based optimization (MBO) that uses data to optimize images, controllers and even protein sequences!

paper: tinyurl.com/mins-paper
pres: neurips.cc/virtual/2020/p…

more->
The problem setting: given samples (x,y) where x represents some input (e.g., protein sequence, image of a face, controller parameters) and y is some metric (e.g., how well x does at some task), find a new x* with the best y *without access to the true function*.
Classically, model-based optimization methods would learn some proxy function (acquisition function) fhat(x) = y, and then solve x* = argmax_x fhat(x), but this can result in OOD inputs to fhat(x) when x is very high dimensional.
Read 7 tweets
14 Oct 20
Greg Kahn's deep RL algorithms allows robots to navigation Berkeley's sidewalks! All the robot gets is a camera view, and supervision signal for when a safety driver told it to stop.

Website: sites.google.com/view/sidewalk-…
Arxiv: arxiv.org/abs/2010.04689
(more below)
The idea is simple: a person follows the robot in a "training" phase (could also watch remotely from the camera), and stops the robot when it does something undesirable -- much like a safety driver might stop an autonomous car.
The robot then tries to take those actions that are least likely to lead to disengagement. The result is a learned policy that can navigate hundreds of meters of Berkeley sidewalks entirely from raw images, without any SLAM, localization, etc., entirely using a learned neural net
Read 4 tweets
13 Oct 20
Can we view RL as supervised learning, but where we also "optimize" the data? New blog post by Ben, Aviral, and Abhishek: bair.berkeley.edu/blog/2020/10/1…

The idea: modify (reweight, resample, etc.) the data so that supervised regression onto actions produces better policies. More below:
Standard supervised learning is reliable and simple, but of course if we have random or bad data, supervised learning of policies (i.e., imitation) won't produce good results. However, a number of recently proposed algorithms can allow this procedure to work.
What is needed is to iteratively "modify" the data to make it more optimal than the previous iteration. One way to do this is by conditioning the policy on something about the data, such as a goal or even a total reward value.
Read 5 tweets
25 Jun 20
Interested in trying out offline RL? Justin Fu's blog post on designing a benchmark for offline RL, D4RL, is now up: bair.berkeley.edu/blog/2020/06/2…

D4RL is quickly becoming the most widely used benchmark for offline RL research! Check it out here: github.com/rail-berkeley/…
An important consideration in D4RL is that datasets for offline RL research should *not* just come from near-optimal policies obtained with other RL algorithms, because this is not representative of how we would use offline RL in the real world. D4RL has a few types of datasets..
"Stitching" data provides trajectories that do not actually accomplish the task, but the dataset contains trajectories that accomplish parts of a task. The offline RL method must stitch these together, attaining much higher reward than the best trial in the dataset.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!