So excited I can FINALLY share our new work on machine learning practitioners' data documentation perceptions, needs, challenges, and desiderata, which will appear in #CSCW2022!
Data is central to the development & evaluation of ML models. Using problematic or inappropriate datasets can lead to harms.
Data documentation frameworks like datasheets & data nutrition labels were proposed to encourage transparency and deliberate reflection on datasets. 2/n
But do these frameworks meet the needs of ML practitioners who create and consume datasets?
We conducted a series of semi-structured interviews with 14 ML practitioners and had them answer a list of questions borrowed from datasheets for datasets. 3/n
To me, one of our most surprising findings is that while data doc frameworks are often motivated from the perspective of responsible AI, practitioners struggled to connect the questions they were answering to their RAI implications.
There's a big gap here we need to address! 4/n
Participants want data doc frameworks to be adaptable, integrated into existing tools and workflows, and as automated as possible.
They have trouble prioritizing the needs of dataset consumers and providing info that someone unfamiliar with their datasets might need to know. 5/n
In the paper, we provide 7 design requirements for future data documentation frameworks based on our findings.
Our hope is that these will help make data documentation more widespread and practical and contribute to responsible AI practices. 6/n