Nabeel Seedat Profile picture
PhD student in Machine Learning @Cambridge_Uni | #DataCentricAI | 🇿🇦

Jul 20, 2022, 10 tweets

Understanding data quality is crucial for reliable ML. In our #ICML2022 paper, @NabeelSeedat01, @JonathanICrabbe & @MihaelaVDS present a Data-Centric framework for the understudied problem of identifying incongruous examples of in-distribution data.

🧵1/10

TLDR.
*Do you want to know which examples will be reliably predicted, independent of the downstream predictive model?

* Do you want to get insights into your data to understand possible limitations?

If so, Data-SUITE our new #DataCentricAI framework is for you!

2/10

There has been a significant focus on out-of-distribution data (OOD) for reliable ML.

However, in Data-SUITE we tackle an equally important yet understudied problem.

How do we assess In-Distribution data, with feature space heterogeneity?

3/10

Data-SUITE is a paradigm shift from current model-centric methods of uncertainty estimation, which assess predictive uncertainty.

Data-SUITE models uncertainty in the data itself.
i.e Data-Centric.

This allows us to flag instances in a model-independent manner.

4/10

Our new #DataCentricAI framework called Data-SUITE takes a pipeline approach to construct feature-wise confidence interval estimators leveraging:

(1) Copula modeling,
(2) Representation Learning and
(3) Conformal Prediction.

5/10

The feature-wise conformal predictor allows us to produce adaptive intervals that help us flag incongruous instances.

At the same time, with conformal prediction, we get rigorous theoretical guarantees on coverage 🚀⭐️💡

6/10

Data-SUITE's brand of data-centric uncertainty outperforms model-centric counterparts on multiple real-world tabular datasets, with different types of incongruence.

We show utility for 2 practical problems:
1. Reliable model deployment
2. Insightful data exploration

7/10

* Reliable model deployment.

Data-SUITE consistently identifies the most impactful data instances for a diverse class of downstream predictive models.

8/10

* Insightful data exploration

Data-SUITE can help data owners to understand potential data limitations.

9/10

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling