Ever since the launch of ChatGPT in November 2022, there has been tremendous buzz around the potential applications of artificial intelligence (AI) in a vast number of fields. Medicine, in particular, stands to benefit from advances in this technology, and some medical researchers are using AI to inform their scholarly work.
Elizabeth Park, MD, MS, is an assistant professor of medicine in the Division of Rheumatology and Clinical Immunology at the Columbia University Vagelos College of Physicians and Surgeons, New York. She completed a master’s degree with a focus on biostatistics, data science and bioinformatics through the Columbia University Mailman School of Public Health, and is now using her advanced knowledge of large language models to improve clinical research using electronic health records. The Rheumatologist sat down with Dr. Park to explore her thoughts on this subject.
The Rheumatologist (TR): Can you explain the type of clinical research you conduct and the types of research questions you seek to answer?
Dr. Park: First of all, thank you for interviewing me. This area isn’t new in medicine, but I still think it’s a bit new for rheumatology. Hopefully, articles like this can inspire or trigger some interesting collaborations and conversations.
I use electronic health record (EHR) data to build clinical cohorts and generate evidence, including studying important clinical outcomes and associations. I extract information from EHR data at my local institutions, like New York-Presbyterian/Columbia University Irving Medical Center (NYP/CUIMC), as well as [from] an international network of EHRs and claims databases called the Observational Health Data Sciences and Informatics (OHDSI), which was founded at Columbia University. One example of a clinical association I am currently studying is between methotrexate (and other disease-modifying anti-rheumatic drugs) and interstitial lung disease in rheumatoid arthritis (RA-ILD). I am using NYP/CUIMC and Cornell EHR data and extending this to OHDSI, which contains over 200 million unique patient records. Hopefully this can generate a solid amount of evidence and strengthen prior studies that indicated no strong associations between methotrexate and RA-ILD. This would finally relieve all of us rheumatologists and pulmonologists from this long-standing concern.
TR: How did you become interested in data science and bioinformatics?
Dr. Park: I think it came from my somewhat nerdy inclination and interest in manipulating large volumes of data. Studying patterns within the data and brainstorming efficient ways of data extraction, processing and synthesis were already among some of my interests, and this naturally led to my focus on using data science and informatics methods.
TR: What educational programs and training did you seek out to become skilled in the application of bioinformatics and artificial intelligence to research activities?
Dr. Park: Luckily, as part of my research track, I completed a master’s degree in patient-oriented research at the Mailman School of Public Health. I was able to enroll in a few elective classes that focused on data science, machine learning and bioinformatics. Columbia has a really strong Department of Biomedical Informatics (DBMI), and I was able to enroll in a course that focused on modeling clinical terminologies and vocabularies (like International Classification of Diseases, SNOMED CT and Unified Medical Language System) extracted from EHR data and mapping them into a common language model, which is basically one of the premises of OHDSI.
TR: Briefly, can you explain what a large language model is and how such models can be used to conduct research?
Dr. Park: Large language models (LLMs) are algorithms/tools designed to automatically process and extract from large volumes of text; these models are usually optimized and primed through prior training. Clinical researchers have utilized LLMs to process clinical documents, educational materials or scientific abstracts and manuscripts to extract important clinical variables and characteristics or to synthesize literature.
TR: How smart are these algorithms in accurately identifying relevant information in a patient’s chart?
Dr. Park: I guess this is what remains to be seen. Expectations seem high for clinicians. For instance, we want these models to perform efficiently and accurately across a range of many tasks, including synthesizing complex clinical/patient data like discharge summaries, translating clinical language into patient-centered language summaries, providing appropriate response templates for patient questions on messaging systems, and even suggesting differential diagnoses and treatment strategies (which encroaches into clinical decision making). In the RA landscape, we are using these more and more to correctly identify the diagnosis in huge volumes of text drawn from the EHR, as well as from important clinical elements (like disease activity scores).
I think the biggest issues right now include the fact that LLMs can create hallucinations (i.e., information that is false, fabricated, nonsensical and/or not present in the data elements you presented to the model) and reinforce assumptions and biases inherent in the data you fed it. So those areas need to be fine-tuned to create truly smart LLMs.
TR: What are some of the potential challenges of using LLMs in the type of research you do?
Dr. Park: For me, there are several practical barriers to integrating LLMs into my work: 1) acquiring and consolidating massive amounts of EHR text data (like clinical notes) from multiple resources; 2) collaborating with the right data science/bioinformatics/AI engineering partners to pre-process and format such data so they can be fit to use for LLMs; and 3) ensuring Protected Health Information (PHI) data are safeguarded under institutional guidelines and working within that framework when using open-source LLM tools.
TR: For physicians and researchers curious about AI but not well versed in its details, what do you recommend they do to learn more about this topic?
Dr. Park: It’s a rapidly changing landscape, so it requires a lot of upkeep, both at your own institution as well as in the literature. For rheumatology researchers who are interested, I think you first have to come up with a good applicable clinical inquiry and make sure you are partnered with the right data science/bioinformatics partners who are equipped with technical capacity to use LLMs. Of course, you must also ensure you are working within the framework of your institutional policies. I encourage everyone reading this article to seek to learn more about this exciting area.
Jason Liebowitz, MD, FACR, is an assistant professor of medicine in the Division of Rheumatology at Columbia University Vagelos College of Physicians and Surgeons, New York.