2/ Student feedback is a fundamental problem in scaling education.
Providing good feedback is hard: existing approaches provide canned responses, cryptic error messages, or simply provide the answer.
3/ Providing feedback is also hard for ML: not a ton of data, teachers frequently change their assignments, and student solutions are open-ended and long-tailed.
Supervised learning doesn’t work. We weren’t sure if this problem can even be solved using ML.
4/ But, we can frame it as a few-shot learning problem! Using data from past HWs and exams of Stanford’s intro CS course, we train a model to give feedback for a new problem with only ~20 examples.
Humans are critical to this process: instructors define a rubric & feedback text.
5/ Because this is open-ended Python code, our base architecture is transformers + prototypical networks.
But, there are many important details for this to *actually* work: task augmentation, question & rubric text as side info, preprocessing, and code pre-training.
6/ How well does this work?
In offline expts, meta-learning:
* achieves 8%-21% greater accuracy than supervised learning
* comes within 8% of a human TA on held-out exams.
Ablations show a >10% difference in accuracy with different design choices.
7/ Most importantly, this model was deployed to 16,000 student solutions in Code-in-Place, where it was previously not possible to give feedback.
In a randomized blind A/B test, students preferred model feedback slightly *more* than human feedback AND rated usefulness as 4.6/5.
8/ We also did several checks for bias. Among the countries & gender identities with the most representation (i.e. largest statistical power), we see no signs of bias.
Not too surprising given the model only sees typed python code w/o comments, but super important to check.
9/ At the beginning of this project, we had no idea that the goal would be possible, let alone deployable.
I still remember very naively approaching @chrispiech about using meta-learning for education after watching a talk he gave >1.5 years ago. :)
10/ It’s super exciting to see both real-world impact of meta-learning algorithms + substantive progress on AI for education
What should ML models do when there's a *perfect* correlation between spurious features and labels?
This is hard b/c the problem is fundamentally _underdefined_
DivDis can solve this problem by learning multiple diverse solutions & then disambiguating arxiv.org/abs/2202.03418
🧵
Prior works have made progress on robustness to spurious features but also have important weaknesses:
- They can't handle perfect/complete correlations
- They often need labeled data from the target distr. for hparam tuning
DivDis can address both challenges, using 2 stages: 1. The Diversify stage learns multiple functions that minimize training error but have differing predictions on unlabeled target data 2. The Disambiguate stage uses a few active queries to identify the correct function
To get reward functions that generalize, we train domain-agnostic video discriminators (DVD) with:
* a lot of diverse human data, and
* a narrow & small amount of robot demos
The idea is super simple: predict if two videos are performing the same task or not.
(2/5)
This discriminator can be used as a reward by feeding in a human video of the desired task and a video of the robot’s behavior.
We use it by planning with a learned visual dynamics model.
(3/5)