Rachel Thomas Profile picture
MS Immunology student | Past: cofounder @FastDotAI, director USF Center Applied Data Ethics, math PhD | she/her

Aug 19, 2021, 8 tweets

An overall lack of recognition for the invisible, arduous, & taken-for-granted data work in AI leads to poor data practices, resulting in data cascades (negative, downstream events)... “Everyone wants to do the model work, not the data work” 1/

storage.googleapis.com/pub-tools-publ…

Paradoxically, data is the most under-valued and
de-glamorised aspect of AI

--Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI by Nithya Sambasivan @shivanikapania Hannah Highfill @NaaShomeh @heuristicity @laroyo 2/
research.google/pubs/pub49953/

Data quality issues in AI are addressed with the wrong tools created for, and fitted to other tech problems—they are approached as a database problem, legal compliance issue, or licensing deal. 3/

“In real life, we never see clean data. Courses focus on models & tools but rarely teach about data cleaning & pipeline gaps.” CS curricula don't include training for dealing w domain-specific ‘dirty data’, documenting datasets, designing data collection, training raters,... 4/

ML data collection practices often conflict w/ existing workflows of domain experts. Data creation was added as extraneous work to on-the-ground partners (e.g., nurses, patrollers, farmers) who already had several responsibilities and were not adequately compensated. 5/

Missing metadata led practitioners to make assumptions, ultimately leading to costly discarding of
datasets or re-collecting data. Lack of metadata & collaborators changing schema w/out understanding context led to loss of 4 months of precious medical robotics data collection 6/

From goodness-of-fit to goodness-of-data:

Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about the fidelity and validity aspects of the data. Currently, there are no standardised metrics for characterising the goodness-of-data 7/

We find drastic differences in data & compute in African countries & India, compared to USA... the Global South is viewed as a site for low-level data annotation work, an emerging market for extraction from ‘bottom billion’ data subjects, or a beneficiary of AI for social good 8/

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling