To better understand the ethics of machine learning datasets, we picked three controversial face recognition / person recognition datasets—DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild—and analyzed ~1,000 papers that cite them.
Paper: arxiv.org/pdf/2108.02922…

Thread ⬇️
First, congrats to lead author Kenny Peng who worked on it for over a year. Kenny was a sophomore here at Princeton when he began this project, and he did this in addition to his coursework and several other research projects. The other authors are @aruneshmathur and me.
Finding 1: despite retraction, ​​DukeMTMC and MS-Celeb-1M are available through copies, derivatives, or pre-trained models. They are still widely used in papers. The creators (especially MS) simply took down the websites instead of making the ethical reasons for retraction clear.
The point isn’t that ethically problematic datasets shouldn’t be retracted. The critical work that led to retractions is invaluable. The point is that the creators could have handled the retractions better and we need other approaches going forward so retractions aren’t needed.
Finding 2: there are many derived datasets that include the original data. There’s no systematic way to even find them all. In most cases they create new ethical concerns by enabling new applications, releasing pre-trained models, adding new annotations, or other post-processing.
Finding 3a: tech change shapes dataset ethics. Benchmark datasets are often introduced for tasks where the state of the art isn’t ready for practice, which seems less ethically serious. But the benchmark enables research progress that leads to production use of the same dataset.
Finding 3b: social change shapes dataset ethics. When LFW was introduced, the diversity of the dataset was a _selling point_.A decade later it became one of the main points of criticism.

(Note: we use the term “social change” in a broad sense.)
Finding 4: the licenses of these datasets are a mess. LFW was released with no license (!). Many datasets are released under non-commercial licenses. If the intent is to stop production use, the effectiveness is limited since some of the most problematic uses are by governments.
Derived datasets often violate licenses. 4 of 7 MS-Celeb-1M derivatives failed to include the non-commercial designation. All 7 violate MS-Celeb-1M’s license which prohibits derivative distribution in the first place. Only 3 of 21 pre-trained models included the designation.
Perhaps the most significant legal issue is the commercial use of models trained on non-commercial data. Our analysis of online forum posts suggests there’s a tremendous amount of confusion about this.
Finding 5: identifying datasets through paper citations is just terrible. Often we couldn't figure out which dataset is being referred to and where to locate it. This causes problems for documentation, transparency and accountability, and research efforts such as ours.
There’s a much-needed and active line of work on mitigating ML dataset harms. The main implication of our findings for this work is the difficulty of anticipating ethical impacts at dataset creation time. We advocate that datasets should be “stewarded” throughout their lifecycle.
We outline a model for what dataset stewarding could look like. There’s lots more in the paper than this thread so check it out: arxiv.org/pdf/2108.02922…
We will share a methodological appendix and replication materials in a few weeks so others can check and build on our work.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Arvind Narayanan

Arvind Narayanan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @random_walker

4 Aug
Interested in the impact of recommender systems on society? @elucherini4, @MatthewDSun, @aawinecoff, and I have software, papers, and a talk to share:
– T-RECS, a simulation tool for studying these questions github.com/elucherini/t-r…
– Accompanying paper arxiv.org/pdf/2107.08959… 🧵
– A short piece on methodological concerns in simulation research arxiv.org/pdf/2107.14333…
– A talk (by me) offering a critical take on research on filter bubbles mediacentral.princeton.edu/media/1_45q6h2…

Here's a blog post that provides an overview and context: freedom-to-tinker.com/2021/08/04/stu…
The key rationale for this work is that phenomena such as algorithmic amplification of misinformation, filter bubbles, or content diversity in recommendations are difficult to study because they arise through repeated interactions between users, items, and the system over time.
Read 4 tweets
2 Aug
Can machine learning outperform baseline logistic regression for predicting complex social phenomena? Many prominent papers have claimed highly accurate civil war prediction. In a systematic review, @sayashk and I find these claims invalid due to errors. reproducible.cs.princeton.edu
We are not political scientists and the main point of our paper is not about civil war. Rather, we want to sound the alarm about an oncoming wave of reproducibility crises and overoptimism across many scientific fields adopting machine learning methods. We have an ongoing list:
Incidentally, we learned about one of the systematic surveys in the above list because it found pitfalls in a paper coauthored by me. Yup, even researchers whose schtick is skepticism of AI/ML are prone to overoptimism when they use ML methods. Such is the allure of AI.
Read 8 tweets
19 Jul
In my dream version of the scientific enterprise, everyone who works on X would be required to spend some percentage of their time learning and contributing to the philosophy of X. There is too much focus on the "how" and too little focus on the "why" and the "what are we even".
Junior scholars entering a field naturally tend to ask critical questions as they aren't yet inculcated into the field's dogmas. But the academic treadmill leaves them little time to voice concerns & their lack of status means that even when they do, they aren't taken seriously.
One possible intervention is for journals and conferences to devote some fraction of their pages / slots to self-critical inquiry, and for dissertation committees to make clear that they will value this type of scholarship just as much as "normal" science.
Read 4 tweets
1 Jul
We shouldn't shrug off dark patterns as simply sleazy sales online, or unethical nudges, or business-as-usual growth hacking. Dark patterns are distinct and powerful because they combine all three in an effort to extract your money, attention, and data. queue.acm.org/detail.cfm?id=… Image
That's from a 2020 paper by @aruneshmathur, @ineffablicious, Mihir Kshirsagar, and me.

PDF version: dl.acm.org/ft_gateway.cfm…
At first growth hacking was about… growth, which was merely annoying for the rest of us. But once a platform has a few billion users it must "monetize those eyeballs". So growth hackers turned to dark patterns, weaponizing nudge research and A/B testing. queue.acm.org/detail.cfm?id=… Image
Read 5 tweets
30 Jun
I study the risks of digital tech, especially privacy. So people are surprised to hear that I’m optimistic about tech’s long term societal impact. But without optimism and the belief that you can create change with research & advocacy, you burn out too soon in this line of work.
9 years ago I was on the academic job market. The majority of professors I met asked why I chose to work on privacy since—as we all know—privacy is dead because of the Internet and it's pointless to fight it. (Computer scientists tend to be technological determinists, who knew?!)
At fist I didn't expect that "why does your research field exist?" would be serious, recurring question. Gradually I came up with a pitch that at least got interviewers to briefly suspend privacy skepticism and hear about my research. (That pitch is a story for another day.)
Read 5 tweets
22 Jun
The news headlines *undersold* this paper. Widely-used machine learning tool for sepsis prediction found to have an AUC of 0.63 (!), adds little to existing clinical practice. Misses two thirds of sepsis cases, overwhelms physicians with false alerts. jamanetwork.com/journals/jamai…
This adds to the growing body of evidence that machine learning isn't good at true prediction tasks as opposed to "prediction" tasks like image classification that are actually perception tasks.
Worse, in prediction tasks it's extremely easy to be overoptimistic about accuracy through careless problem framing. The sepsis paper found that the measured AUC is highly sensitive to how early the prediction is made—it can be accurate, or clinically useful, but not both.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(