Gender bias concerns raised over GP app
Onlookers are asking why the chatbot created by Babylon Health – which provides the GP at Hand service – is offering such different guidance to men and women. But the company tells PublicTechnology its service is working as intended.
Credit: Adobe Stock
“Where in your chest is the pain?,” the patient is asked.
Having responded that the pain is in the centre of their chest, they go on to answer a number of other questions, revealing that the pain came on suddenly, and is accompanied by a feeling of nausea.
The patient – who is aged 59, and smokes – is then told that two possible causes of their symptoms are depression or a panic attack. A GP appointment is advised in the former case, while the latter can be treated at home.
At the same time, another patient – also a 59-year-old smoker – presents with exactly the same symptoms.
They are told that, while they could be suffering a panic attack, they could also be suffering from gastritis – for which they should book an urgent appointment with their doctor.
Worse still, the cause of their symptoms could be one of several heart problems: pericarditis; unstable angina; or even a heart attack. A trip to A&E is recommended in the first case while, for the latter two, an ambulance should be called immediately.
"We have reviewed this since then, and we are confident that the medical evidence supports the outcomes. All the same, [we are conscious of] long-standing concerns about systemic bias in medical research and literature."
Dr Keith Grimes, Babylon Health
These two sets of markedly different diagnostic possibilities come from exactly the same source.
In fact, the one and only difference between the two cases is that the first patient is a woman, and the second a man.
But the two people are identical in one key aspect: they do not exist. For that matter, nor does the doctor you may have assumed was providing diagnoses.
The exchanges above took place between the symptom-checker chatbot run by Babylon Health and two example users – created by an anonymous NHS consultant, who goes by Dr Murphy.
Responding to videos and pictures posted online by Dr Murphy, many people seem – understandably – deeply concerned by the program’s failure to even raise the possibility of a heart attack or other cardiological condition in the woman’s case.
Their worries are no doubt amplified by the fact that Babylon’s GP at Hand service is serving as an NHS GP for more than 50,000 UK citizens – one of whom, incidentally, is health secretary Matt Hancock.
Since late 2017, London residents have had the option of switching their registration from their community practice to GP at Hand – which offers remote video consultations via a smartphone app. The service recently launched in Birmingham and Babylon has said it wishes to expand throughout the UK.
The company directly responded to the criticism it faced on Twitter.
“Our system is trained using the best available clinical evidence, supported by subject matter experts. In this case it shows different causes listed for the same symptoms, depending on gender. That’s what the medical evidence suggests should happen,” it said. “The effects of bias can be present in research data, this is something that everyone working in healthcare needs to be aware of. It's great to see your post has generated some thoughtful discussion around the implications of this forAI.”
Speaking to PublicTechnology, Babylon Health’s clinical artificial intelligence and innovation director Dr Keith Grimes – who still practises as an NHS GP one day a week – says that the symptom checker is an optional tool for users, and that patients do not have to use it before booking a consultation.
The AI behind the symptom-checker chatbot operates a probabilistic model, he adds.
Such models are defined as those that can consider the potential effect of unknown variables and suggest likely outcomes based on the available evidence – rather than offering a definitive answer.
The company also operates a model in which “every decision is recorded and can be explained – and if something has gone wrong, we can fix it”, according to Grimes.
“When people report symptoms or report information about their previous history… the chatbot will collect that information… the symptom checker uses information that has been provided by the user. It will also consider some [other] elements,” he says.
Such additional considerations include various “risk factors and the strength of link” they have with a range of conditions depending on the surrounding details.
In the case of the male and female users presenting chest pain: “Our app was working as intended at the time – it was providing information and a triage outcome,” Grimes says. “Clearly there are going to be differences in cases and in symptoms between men and women – they are biologically very different.”
He adds: “The cases presented on Twitter were a snapshot of a final outcome. We have reviewed this since then, and we are confident that the medical evidence supports the outcomes. All the same, there are long-standing concerns about systematic bias in medical research and literature – either conscious or unconscious. We scour [our services] to make sure that it does not show any signs of that.”
Although Babylon claims the chatbot was functioning correctly in this instance, further examination of the case by Grimes and the company’s other “expert clinicians” will take place.
“We were very, very careful not only in public testing, but also in being very aware of any feedback going forward,” he says. “We are still trying to understand what might lead it to behave in this way.”
He adds: “This is such a new area – there is always an opportunity to improve things. Our product is really good – and safe – but there is always the opportunity to make it a bit better. We can find out what happened here, and we can improve our processes.”
Grimes believes that AI can even play an important role in helping to combat intrinsic bias, by “automatically identifying” its presence in systems.
Judging by the ongoing reaction online, others are less convinced.
But Grimes says the company “welcomes the scrutiny” brought by this case.
A recently published study from @EdinUniCVS, demonstrated that typical #HeartAttack symptoms are actually more common and have greater predictive value in Women than in Men...@amy_ferry @KuanKenLee @HighSTEACS @TheBHFhttps://t.co/lKc6NOMGZJ— Dr Murphy (@DrMurphy11) September 8, 2019
This is perhaps just as well, as the issue of recognising and eradicating bias is one that seems certain to remain in the spotlight. The Centre for Data Ethics and Innovation established by the government last year has identified bias – alongside targeting – as the two major issues on which it should focus its initial work.
Eleonora Harwich, director of research at think tank Reform and a member of the AI advisory board at the Kent, Surrey and Sussex Academic Health Science Network, says that bias can germinate in several ways.
Data quality is a significant cause, she says. Anyone using health data should asses it against the common definition of the six dimensions of data quality: completeness; timeliness; consistency; integrity; conformity; and accuracy, she says.
But even that is no guarantee of eradicating bias.
“Are you going to take the view that the data you are using is objective? Obviously, there are laws of science and physics, but I do think that the ways we go about collecting data are a social construct – and there are many different ways of measuring its purity, Harwich says. “If your data is from [NHS] trust A – which does not have a representative population – and you’re then using that data to train the model which is then used by trust B or trust C – it will not work.”
As the use of AI in healthcare becomes increasingly common and high-profile, in respect of Babylon’s technology – and others like it – many will continue to ask whether it is working or not.
Report from NAO finds that programme may end up costing more than the systems it replaced
Technology entities to continue exploration of standards
Head of Test and Trace programme Baroness Harding says she does not want to specify a timeframe as projects often do not ‘run in a smooth way’
In the first two weeks tracers were successfully reaching more than nine in ten identified close contacts, but this has now fallen to less than three quarters