To get reward functions that generalize, we train domain-agnostic video discriminators (DVD) with:
* a lot of diverse human data, and
* a narrow & small amount of robot demos
The idea is super simple: predict if two videos are performing the same task or not.
(2/5)
This discriminator can be used as a reward by feeding in a human video of the desired task and a video of the robot’s behavior.
We use it by planning with a learned visual dynamics model.
(3/5)
Does using human videos improve reward generalization compared to using only narrow robot data?
We see:
* 20% greater task success in new environments
* 25% greater task success on new tasks
both in simulation and on a real robot.
What should ML models do when there's a *perfect* correlation between spurious features and labels?
This is hard b/c the problem is fundamentally _underdefined_
DivDis can solve this problem by learning multiple diverse solutions & then disambiguating arxiv.org/abs/2202.03418
🧵
Prior works have made progress on robustness to spurious features but also have important weaknesses:
- They can't handle perfect/complete correlations
- They often need labeled data from the target distr. for hparam tuning
DivDis can address both challenges, using 2 stages: 1. The Diversify stage learns multiple functions that minimize training error but have differing predictions on unlabeled target data 2. The Disambiguate stage uses a few active queries to identify the correct function
2/ Student feedback is a fundamental problem in scaling education.
Providing good feedback is hard: existing approaches provide canned responses, cryptic error messages, or simply provide the answer.
3/ Providing feedback is also hard for ML: not a ton of data, teachers frequently change their assignments, and student solutions are open-ended and long-tailed.
Supervised learning doesn’t work. We weren’t sure if this problem can even be solved using ML.