How can we evaluate *ahead of time* whether or not a model's performance will generalize from training to deployment? 1/
This is an important question for both model developers and 3rd party auditors evaluating safety. For example, the FDA regulates ML medical devices and requires evidence of model validity (internal and external). 2/