New preprint w/ (co-first author) @royjamesadams and @suchisaria: "Evaluating Models Robustness Under Dataset Shift"

How can we evaluate *ahead of time* whether or not a model's performance will generalize from training to deployment? 1/
This is an important question for both model developers and 3rd party auditors evaluating safety. For example, the FDA regulates ML medical devices and requires evidence of model validity (internal and external). 2/
Existing evaluation tools (e.g., cross validation) help us assess performance on new data from the same distribution as our training data. But in practice we expect the deployment environment to structurally differ from the training data. 3/
For example, in healthcare the patient demographics, disease prevalence, and clinical practice patterns vary from hospital to hospital (and over time). How do we evaluate how a model will perform under these expected types of changes? 4/
Often we only have access to a small number of datasets containing observations from different sites. Evaluating performance across these datasets provides information about how the model will perform, but is limited by the availability and diversity of these datasets. 5/
To address this, using a single evaluation dataset, we develop a method for evaluating how a model's performance changes under worst-case distribution shifts. Users define shifts via distributions, e.g., a shift in P(patient demographics) or P(lab test | patient history). 6/
Then we define an uncertainty ball of possible test distributions which differ from the evaluation data only with respect to this shift. Using distributionally robust optimization we can then estimate the performance of the model on the worst-case distribution in this set. 7/
A key technical feature is that we develop a root-N consistent estimator for the worst-case performance (thanks to double/debiased ML cc @VC31415). Thus, we can use flexible machine learning estimators for nuisance parameters without inducing unnecessary bias. 8/
We demonstrate the approach on a medical risk prediction model. We show that conclusions we draw from these evaluations provide useful insight into actual performance on new datasets. 9/
Further, hyperparameters (like magnitude of shift) can be mapped to interpretable properties of the worst-case distributions and used to assess plausibility of the uncertainty set. 10/
On a more general note, there's a growing need for more "stress tests" for models (see, e.g., Google's recent underspecification paper cc @alexdamour @vivnat).

We hope our procedure helps in addressing this gap. 11/
We will be releasing code to contribute to the toolkit available to model developers for assessing the safety of models in new settings. 12/

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Adarsh Subbaswamy

Adarsh Subbaswamy Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!