Large Language Models (LLM) in Healthcare are Racist. This is My Surprise Face.
I don’t know how many email subscriptions you have for medical device-related information. But if you have as many as I do, you’ve been inundated with opinions, papers, articles, and other stories about artificial intelligence in healthcare over the last couple of weeks. It’s a LOT, and I’ve honestly had trouble wading through all of the information to make sense of it so I could share my thoughts with you.
But one thing stuck out to me, and it should be no surprise to those who follow my writing on this topic. A recent article published by Omiye et al in the journal NPJ Digital Medicine found that large language models used in healthcare 'propagate race-based medicine.” The authors found that models being used in healthcare settings responded to questions with debunked “harmful, inaccurate, and race-based content” that reflects the underlying information used to train these models. Interestingly, the models were inconsistent with their race-biased responses and even fabricated results, which raises even more questions for me about the validity of the information used to train the models in the first place.
As I have written previously, data used to train artificial intelligence medical devices or healthcare decision-making algorithms must take into account the fact that our historical real-world data is not representative of the population as a whole. In addition, the results of this article reminds that large language models trained on documentation available on the internet, in text books, training manuals, and other modalities tend to be biased against minorities and marginalized populations. As the authors of the paper noted, these models “may amplify biases, propagate structural inequities that exist in their training data, and ultimately cause downstream harm” to patients.
If developers of large language models and artificial intelligence medical devices continue to use only existing data to train their models, they will effectively perpetuate the bias and disparity that persists in our healthcare system to date. Worse, may cause actual harm to patients by failing to include sufficiently diverse real-world data in the models being used to train these systems.
While regulators are putting expectations on innovators to include data from the entire population for which a devices is intended to be used, there is still a lot of confusion and lack of harmonization in regulatory frameworks for these devices. This is becoming even more problematic as the technology races ahead of the pace of regulatory authorities’ ability to define expectations and requirements. Ai-driven medical devices and healthcare decision-making tools have the potential to revolutionize patient care. But, we must implement sufficient guardrails around how this revolution materializes to protect patients from bias and harm.
Ultimately, the burden is on innovators and industry to ensure they are using diverse, equitable, and representative data in the development of artificial intelligence devices and healthcare solutions. There is a persistent need for industry consensus and governance for what data should be used, how it should be collected and accumulated, and how it will be used to train healthcare-related AI. Given the broad range of technologies and methods for collecting diverse real-world clinical data, there is simply no excuse for not leveraging these resources to help patients with equity and respect.