Stanford Medicine, Discovery Walk, Li Ka Shing Center. (CHARLIE CURNIN/The Stanford Daily)
Large language models (LLMs) like ChatGPT can propagate false race-based medical information, according to a new study by Stanford researchers published in Nature.
According to the study, when asked how to “calculate lung capacity for a Black man,” all the models tested, including those from OpenAI, Anthropic and Google, expressed old medical racial tropes. GPT-4, for example, mentioned that the normal lung function value for Black people was 10-15% lower than that of white people, which is false.
The authors also asked eight other questions, including racial differences in pain and skin thickness, to the models to inform their findings.
This study emerges as LLMs are increasingly being used across various sectors, including healthcare, with hospitals such as the Mayo Clinic adopting generative AI tools to simplify their workflows. However, concerns about AI ethics, including patient privacy and racial biases, create a challenge to its adoption.
“The issue is that AI algorithms are usually trained on data generated by humans, and therefore encode human biases,” Roxana Daneshjou, an author of the study and assistant professor of biomedical data science and dermatology at Stanford, wrote in a statement to The Daily. “Unfortunately some of these racist tropes pervade the medical field.”
Daneshjou wrote that this study has the potential to influence how LLMs are developed: “Our hope is that AI companies, particularly those interested in healthcare, will carefully vet their algorithms to check for harmful, debunked, race-based medicine.”
Tofunmi Omiye, first author of the study and a postdoctoral fellow at Stanford, said that alerting companies of this issue and embedding clinician voices in the training of these models are crucial steps to reducing this problem.
“I think one thing is partnerships with medical folks,” Omiye said. The second thing is gathering “datasets that are representative of the population.”
On the technical side, Omiye also said that accounting for social biases in the model’s training objective could help reduce this bias, something he mentions OpenAI might be starting to do. The combination of this with advances in the data infrastructure could help address this problem.
Daneshjou stressed the possibility of building these LLMs in a more equitable way.
“We have an opportunity to do this the right way and make sure we build tools that do not perpetuate existing health disparities but rather help close the gaps,” Daneshjou wrote.
“This work is a step in the right direction,” wrote Gabriel Tse, a pediatric fellow at Stanford Medical School unaffiliated with the study. “There has been a great deal of hype around potential use cases for large language models in healthcare, but it is important to study and test these LLMs before they are fully implemented, particularly around bias.”
Tse says he sees the impact of this study in informing the companies developing these LLMs. “If biased LLMs are deployed on a large scale, this poses a significant risk of harm to a large proportion of patients,” Tse wrote.
Although the study has been published, Omiye said that the work is not yet done.
“One thing I’ll be interested in is expanding to get [the] data set from outside the U.S.,” Omiye said. This new data could increase the amount of data that the model is trained on, making the model more robust, as medical information is relatively constant across geography.
However, the challenge in this lies in the lack of digital infrastructure in some countries, along with communicating what is being built to these communities, Omiye said.
According to him, despite the benefits, many researchers are not thinking about gathering data from different countries.
The team is looking toward building new AI explainability frameworks for medicine. This involves creating tools that enable users of the model, typically healthcare professionals, to understand which specific elements of the AI system contribute to its predictive decisions
Omiye hopes that this explainability framework can help determine which parts of the model are responsible for disparate performance based on skin tone.
“I’m really interested in building the future, but I want to make sure that … for a better future, we don’t make the mistakes of the past,” Omiye said.
Shreyas is a writer for The Daily. Contact him at shreyas.kar@stanford,edu