Profile picture
Enrico Coiera @EnricoCoiera
, 28 tweets, 5 min read Read on Twitter
Here is my 10 minute peer review of the Babylon chatbot as described in the conference paper at marketing-assets.babylonhealth.com/press/BabylonJ…

Please feel free to correct any misunderstandings I have of the evaluation in the tweets that follow.
To begin, the Babylon engine is a Bayesian reasoner. That's cool. Not sure if it qualifies as AI.
The evaluation uses artificial patient vignettes which are presented in a structured format to human GPs or a Babylon operator. So the encounter is not naturalistic. It doesn't test Babylon in front of real patients.
In the vignettes, patients were played by GP, some of whom were employed by Babylon. So they might know how Babylon liked information to be presented and unintentionally advantaged it. Using independent actors, or ideally real patients, would have had more ecological validity.
A human is part of the Babylon intervention because a human has to translate the presented vignette and enter it into Babylon. The impact of this human is not explicitly measured.
The vignettes were designed to test know capabilities of the system. Independently created vignettes exploring other diagnoses would likely have resulted in a much poorer performance. This tests Babylon on what it knows not what it might find 'in the wild'.
It seems the presentation of information was in the OSCE format, which is artificial and not how patients might present. So there was no real testing of consultation and listening skills that would be needed to manage a real world patient presentation.
Babylon is a Bayesian reasoner but no information was presented on the ‘tuning’ of priors required to get this result. This makes replication hard. A better paper would provide the diagnostic models to allow independent validation.
The quality of differential diagnoses by humans and Babylon was assessed by one independent individual. 2 Babylonis to use employees also rated differential diagnosis quality. Good research practice is to use multiple independent assessors and measure inter-rather reliability.
The safety assessment has the same flaw. Only 1 independent assessor was used and no inter rater reliability measures are presented when in-house assessors are added. Non-independent assessors bring a risk of bias.
To give some further evaluation, new vignetted are used based on MRCGP tests. However any vignettes outside of the Babylon system’s capability` were excluded. They only tested Babylon on vignettes it had a chance to get right.
So, whilst it might be ok to allow Babylon to only answer questions it is good at for limited testing, the humans did not have a reciprocal right to exclude vignettes they were not good at. This is fundamental bias in the evaluation design.
A better evaluation model would have been to draw a random subset of cases and present them to both GPs and Babylon.
So, in summary, this is a very preliminary and artificial test of a Bayesian reasoner on cases for which it has already been trained.
In machine learning this would be roughly equivalent to in-sample reporting of performance on the data used to develop the algorithm. Good practice is to report out of sample performance on previously unseen cases.
The results are confounded by artificial conditions and use of few and non-independent assessors.
So, it is to fantastic that Babylon has undertaken this evaluation, and has sought to present it in public via this conference paper. They are to be applauded for that. One of the benefits of going public is that we can now provide feedback on the study's strength and weaknesses.
PS1/ Today I have spent a little more time looking at the statistical analyses and data in the paper. They reveal additional methodological challenges that I wouldn't mind help with.
Firstly, no statistical testing is done to check if the differences reported are likely due to chance variation. A statistically rigorous study would estimate the likely effect size and use that to determine the sample size needed to detect a difference between machine and human.
For the first evaluation study, methods tells us “The study was conducted in four rounds over consecutive days. In each round, there were up to four “patients” and four doctors.” That should mean each doctor and Babylon should have seen “up to" 16 cases.
Table 1 shows Babylon used on 100 vignettes and doctors typically saw about 50. This makes no sense. Possibly they lump in the 30 Semigran cases reported separately but that still does not add up. Further as the methods for Semigran were different they cannot be added in any case
There is a problem with Doctor B who completes 78 vignettes. The others do about 50. Further looking at Table 1 and Fig 1 Doctor B is an outlier, performing far worse than the others diagnostically. This unbalanced design means average doctor performance is penalised by Doctor B.
There is also a Babylon problem. It sees on average about twice as many cases as the doctors. As we have no rule provided for how the additional cases seen by Babylon were selected, there is a risk of selection bias eg what if by chance the ‘easy’ cases were only seen by Babylon?
For the MRCGP questions, Babylon’s diagnostic accuracy is measured by its ability to identify a disease within its top 3 differential diagnoses. It identified the right diagnosis in its top 3 in 75% of 36 MRCGP CSA vignettes, and 87% of 15 AKT vignettes.
For the MRCP questions we are not given Babylon’s performance when the measure is the top differential. Media reports compare Babylon against historical MRCGP human results. One assumes humans had to produce the correct diagnosis, and were not asked for a top 3 differential.
There is huge significance clinically in putting a disease in your top few differential diagnoses and the top one you elect to investigate. It also is an unfair comparison if Babylon is rated by a top 3 differential and humans by a top 1. Clarity on this aspect would be valuable.
In closing, we are only ever given one clear head to head comparison between humans and Babylon, and that is on the 30 Semigran cases. Humans outperform Babylon when the measure is the top diagnosis.
For convenience I have collected this thread into a single blog, fixed a few typos, and reordered some of the material so that the material flows more easily.

coiera.com/2018/06/29/pap…
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Enrico Coiera
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!