How can we evaluate *ahead of time* whether or not a model's performance will generalize from training to deployment? 1/
This is an important question for both model developers and 3rd party auditors evaluating safety. For example, the FDA regulates ML medical devices and requires evidence of model validity (internal and external). 2/
Existing evaluation tools (e.g., cross validation) help us assess performance on new data from the same distribution as our training data. But in practice we expect the deployment environment to structurally differ from the training data. 3/
For example, in healthcare the patient demographics, disease prevalence, and clinical practice patterns vary from hospital to hospital (and over time). How do we evaluate how a model will perform under these expected types of changes? 4/
Often we only have access to a small number of datasets containing observations from different sites. Evaluating performance across these datasets provides information about how the model will perform, but is limited by the availability and diversity of these datasets. 5/
To address this, using a single evaluation dataset, we develop a method for evaluating how a model's performance changes under worst-case distribution shifts. Users define shifts via distributions, e.g., a shift in P(patient demographics) or P(lab test | patient history). 6/
Then we define an uncertainty ball of possible test distributions which differ from the evaluation data only with respect to this shift. Using distributionally robust optimization we can then estimate the performance of the model on the worst-case distribution in this set. 7/
A key technical feature is that we develop a root-N consistent estimator for the worst-case performance (thanks to double/debiased ML cc @VC31415). Thus, we can use flexible machine learning estimators for nuisance parameters without inducing unnecessary bias. 8/
We demonstrate the approach on a medical risk prediction model. We show that conclusions we draw from these evaluations provide useful insight into actual performance on new datasets. 9/
Further, hyperparameters (like magnitude of shift) can be mapped to interpretable properties of the worst-case distributions and used to assess plausibility of the uncertainty set. 10/
On a more general note, there's a growing need for more "stress tests" for models (see, e.g., Google's recent underspecification paper arxiv.org/abs/2011.03395 cc @alexdamour@vivnat).
We hope our procedure helps in addressing this gap. 11/
We will be releasing code to contribute to the toolkit available to model developers for assessing the safety of models in new settings. 12/
• • •
Missing some Tweet in this thread? You can try to
force a refresh