This week an #AI model was released on @huggingface that produces harmful + discriminatory text and has already posted over 30k vile comments online (says it's author).

This experiment would never pass a human research #ethics board. Here are my recommendations.

1/7 I agree with KCramer. There...Text from huggingface discu...
@huggingface as the model custodian (an interesting new concept) should implement an #ethics review process to determine the harm hosted models may cause, and gate harmful models behind approval/usage agreements.

Medical research has functional models ie for data sharing.

2/7 ImageImage
Open science and software are wonderful principles but must be balanced against potential harm. Medical research has a strong ethics culture bc we have an awful history of causing harm to people, usually from disempowered groups.

See en.wikipedia.org/wiki/Medical_e… for examples.

3/7
Finally I'd like to talk about @ykilcher's experiment here. He performed human experiments without informing users, without consent or oversight. This breaches every principle of human research ethics.

4/7
Imagine the ethics submission!

Plan: to see what happens, an AI bot will produce 30k discriminatory comments on a publicly accessible forum with many underage users and members of the groups targeted in the comments. We will not inform participants or obtain consent.

5/7
AI research has just as much capacity to cause harm as medical research, but unfortunately even small attempts to manage these risks (such as @NeurIPSConf #ethics code of conduct: openreview.net/forum?id=zVoy8…) are bitterly resisted by even the biggest names in AI research.

6/7
Twitter ate my original thread so I'm one tweet short and don't know what I missed. Sorry.

7/7
UPDATE: @huggingface has removed the model from public access and will implement a gating feature.

Furthermore they are doing for community feedback on an appropriate ethics review mechanism. This is really important so please engage with this.

@huggingface should implement gated access to prevent harmful misuse of this model and others like it. Open science and software are great principles, but must be balanced against the risk of real harm. Medical research has functional models for this ie for data sharing.

2/7 Mimic CXR database website ...Text from mimic-cxr dataset...
Furthermore, @huggingface (as the model custodian, which is an interesting new concept) should be responsible to assess which models pose a risk. An internal review board process should be set up, with model authors required to submit a summary of risk for their models.

3/7
Additionally, I want to discuss @ykilcher's choice to perform an experiment on human participants without #ethics board approval, consent, or even their knowledge. This breaches every principle of human research ethics.

This has implications for AI research as a field.

4/7
Imagine the ethics proposal!

Plan: an AI bot which will post 30k+ discriminatory comments on a publically accessible forum often populated by underage users, including members of the marginalised groups the comments target. We will not inform them or ask for consent.

5/7
Medical research has strong ethics framework for a reason. We have a disgusting history of causing harm, particularly to disempowered people. See en.wikipedia.org/wiki/Medical_e… for examples.

AI has an equal capacity for harm, and this experiment was not uniquely harmful.

6/7
AI research desperately needs a ethical code of conduct, but even small steps like those taken by @NeurIPSConf (openreview.net/forum?id=zVoy8…) are bitterly fought against, by some of the most prominent researchers in the field.

AI researchers, do better. Please.

7/7

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lauren Oakden-Rayner (Dr.Dr. 🥳)

Lauren Oakden-Rayner (Dr.Dr. 🥳) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DrLaurenOR

Apr 6
Very excited to have 2 new papers in press today in Lancet Digital Health, alongside an editorial from the journal highlighting our work.

I am immensely proud of the work we have done here and honestly think this is the most important work I have been involved in to date 🥳

1/7
#Medical #AI has a problem. Preclinical testing, including regulatory testing, does not accurately predict the risks that AI models pose once they are deployed in clinics.

I've written about this before in my blog, for example in:

google.com/amp/s/laurenoa…

2/7
In this work we:

1) describe a step by step method for algorithmic auditing in health, building on the 🔥 work by Raji et al

2) audit a high accuracy model we developed @theAIML for hip fracture dx, ID-ing several serious risks that were not detected by standard testing.

3/7
Read 9 tweets
Aug 2, 2021
#Medical #AI has the worst superpower... Racism

We've put out a preprint reporting concerning findings. AI can do something humans can't: recognise the self-reported race of patients on x-rays. This gives AI a path to produce health disparities.

1/8

lukeoakdenrayner.wordpress.com/2021/08/02/ai-…
This is a big deal, so we wanted to do it right. We did dozens of experiments, replication at multiple labs, on numerous datasets and tasks.

We are releasing all the code, as well as new labels to identify racial identity for multiple public datasets.

2/8
Humans can't detect race better than chance, but AI performs absurdly well on the task. As you can see here, AUC scores are in the high 90s, and are maintained on external validation on completely distinct datasets and across multiple different imaging tasks.

3/8 Image
Read 10 tweets
Dec 8, 2020
Docs are ROCs: A simple fix for a methodologically indefensible practice in medical AI studies.

Widely used methods to compare doctors to #AI models systematically underestimate doctors, making the AI look better than it is! We propose a solution.

lukeoakdenrayner.wordpress.com/2020/12/08/doc…

1/7
The most common method to estimate average human performance in #medical AI is to average sensitivity and specificity as if they are independent. They aren't though - they are inversely correlated on a curve.

The average points will *always* be inside the curve.

2/7
The only solution currently is to force doctors to rate images using confidence scores. While this works well in the few tasks where these scales are used in clinical practice, what does it mean to say you are 6/10 confident that there is a lung nodule?

3/7
Read 8 tweets
Aug 19, 2020
Alright, let's do this once last time. Predictions vs probabilities. What should we give doctors when we use #AI / #ML models for decision making or decision support?

#epitwitter

1/21
First, we need to ask: is there a difference?

This is a weird question, right? Of course there is! One is a categorical class prediction, the other is a continuous variable. Stats 101, amirite?

Well, no.

2/21
Let's set out the two ways that probabilities are supposed to be different than class predictions.

1) they are continuous, not categorical
2) they are probabilities, meaning the numbers reflects some truth about a patient group and are not arbitrary

Weeeeell...

3/21
Read 23 tweets
Jul 28, 2020
This discussion was getting long, so I thought I'd lay out my thoughts on a common argument: should models produce probabilities or decisions? Ie 32% chance of cancer vs "do a biopsy".

I favour the latter, because IMO it is both more useful and... more honest. IMO:

1/13
The argument against using a threshold to determine an action, at a basic level, seems to be:

1) you shouldn't discard information by turning a range of probabilities into a binary
2) probabilities are more useful at the clinical coalface

2/13
Re: 1.

No model discards information. The continuous output score always exists. It is how you make use of that information at point of care that "changes".

I use airquotes around "changes", because this is a ... false dichotomy 😆

3/13
Read 14 tweets
Mar 3, 2020
Great work showing that a good AI system doesn't always help doctors.

Echoes the decades of experience with radCAD: when the system is wrong, it biases the doctor and makes them *worse* (OR 0.33!) at diagnosis.

It is *never* as simple as AI+doctor is better than doctor alone.
I personally suspect the biggest problem is automation bias, which is where the human over-relies on the model output.

Similar to self driving cars where jumping to complete automation appears to be safer than partial automation.
But interestingly (and perhaps counter-intuitively) this could also mean that "blind" ensembling (where the human gets no AI input, and the human and AI opinions are combined algorithmically) might be better than showing the doctor what the AI thinks.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(