TDLR: Distribution shift is *really* hard, but common patterns emerge.
To organize the 200 distribution shifts, we divide them into two categories: synthetic shifts and natural shifts.
Synthetic shifts are derived from existing images by perturbing them with noise, etc.
Natural shifts are new, unperturbed images from a different distribution.
At a high level, there has been good progress on the synthetic shifts (e.g., ImageNet-C or adversarial examples).
Natural distribution shifts (e.g., ImageNetV2 or ObjectNet), on the other hand, are still much harder.
But how do we measure robustness to begin with?
On many shifts, models with higher accuracy already perform better under distribution shift without any intervention to improve their robustness. So we have to disentangle robustness from the in-distribution accuracy.
To understand if a model is truly more robust (as opposed to being more accurate in-distribution), we introduce “effective robustness” as accuracy beyond the baseline given by standard models. This is best demonstrated graphically:
Looking at effective robustness paints a consistent picture for the natural distribution shifts in our testbed. Current robustness interventions show little to no gains. The only approach that consistently promotes robustness is training on large, diverse datasets.
There is a lot more in our paper, so we built an interactive website to explore all the data we collected: