Is RL always data inefficient? Not necessarily. Framework for Efficient Robotic Manipulation (FERM) - shows real robots can learn basic skills from pixels with sparse reward in *30 minutes* using 1 GPU 🦾
Real-robot RL is challenging for a number of reasons, and data efficiency is chief among them. Common workarounds are training in simulation and transferring the learned policy to the real robot (Sim2Real) or parallelizing training with robot farms (QT-Opt).
2/N
But what makes RL data inefficient in the first place? One hypothesis - (i) representation learning and (ii) exploration. In principle, if we solve both problems, RL should be able to learn quickly.
3/N
In past work (CURL / RAD) we showed that with data augmentation (internalized explicitly or implicitly) RL agents learn as efficiently from pixels as they do from state, making contrastive + data aug reps a natural choice for real robot learning.
4/N
For exploration we use a handful (10) of human demos. But doesn't that defeat the whole point? Not really, the alternative to RL would be to use imitation learning, but imitation learning requires 100s of demos to work reliably, which takes hours or days of tedious work.
5/N
The benefit of RL is that, given only a handful demonstrations (minutes of human effort), the agent can continue exploring on its own. We combine rep learning methods that have been effective in sim with demos into a framework for efficient real-robot learning.
6/N
The framework has 3 steps:
1. Initialize the replay buffer with 10 demos (10 mins) 2. Initialize the CNN encoder with contrastive pre-training (1 min) 3. Train RL agent with data aug (30 mins)
Demo collection takes 10 mins. Training time is roughly 30 mins.
7/N
Our experimental set up is straightforward and easy to reproduce by other labs. We have 1 robot, 1 gpu, 2 cameras, and a sparse reward for each task. That's it - no MoCap, no state estimation, no dense reward specification.
8/N
FERM is able to learn 6 standard robotic tasks in 15-50 mins (30 mins avg) - reach, pick, move, pull, light switch, and open door. Practically, this means you can spin up a task in one evening of work (including constructing a sparse reward function).
9/N
Not too surprisingly, RL significantly outperforms imitation learning in this low-demonstration regime. Behavior cloning is unable to solve even the simplest tasks in this setting.
10/N
Finally, the learned RL policies are quite robust. Here are some videos showing the RL agent's policy in the presence of adversarial perturbations and unseen objects.
11/N
We hope that FERM enables other research groups (without access to insane resources) to quickly iterate on their ideas by training directly on their robots.
12/N
Amazing work and persistence by co-lead authors Albert Zhan and @RuihanZhao, who pulled some heroic feats in order to produce real-robot experiments during the pandemic. Thank you to @LerrelPinto and @pabbeel for the great collaboration!
13/N /END
typo, that gif is meant for 11/N tweet of this thread 🤷♂️
• • •
Missing some Tweet in this thread? You can try to
force a refresh