To get reward functions that generalize, we train domain-agnostic video discriminators (DVD) with:
* a lot of diverse human data, and
* a narrow & small amount of robot demos
The idea is super simple: predict if two videos are performing the same task or not.
(2/5)
This discriminator can be used as a reward by feeding in a human video of the desired task and a video of the robot’s behavior.
We use it by planning with a learned visual dynamics model.
(3/5)
Does using human videos improve reward generalization compared to using only narrow robot data?
We see:
* 20% greater task success in new environments
* 25% greater task success on new tasks
both in simulation and on a real robot.
To think about this question, we first look at how equivariances are represented in neural nets.
They can be seen as certain weight-sharing & weight-sparsity patterns. For example, consider convolutions.
(2/8)
We reparametrize a weight matrix into a sharing matrix & underlying filter parameters
It turns out this can provably represent any equivariant structure + filter parameters, for all group-equivariant convolutions with finite groups.
(3/8)