Assessing the Reliability of Current AI Platforms in Delivering Health Information Related to Crohn Disease, Ulcerative Colitis, and Colorectal Cancer



In recent years, the advancements in artificial intelligence (AI) have revolutionized the way we seek and access health information. With more people turning towards AI for answers to their problems, it is important to question how safe it is to rely on AI for answers to health-related issues. We explored the accuracy of ChatGPT—a language model developed by OpenAI—and Gemini—Google’s AI platform—in providing health information related to Crohn disease, ulcerative colitis, and colorectal cancer.


We generated 10 questions relating to Crohn disease, ulcerative colitis, and colorectal cancer in relation to the social, psychological, economic, and physical aspects that patients with these diseases may face. Each query was remastered for each disease, resulting in 30 total questions which were posed to the two separate AI models. We then regenerated the responses for a total of three times ending up with 90 generated responses per AI model. We also measured the Flesch-Kincaid Readability scores for each response and analyzed the sentiment of the text using natural language processing and computational linguistics. The Centers of Disease Control and Prevention (CDC) recommend that medical information for the public be written at no higher than an eighth-grade reading level. Generated AI responses were evaluated by six gastroenterologist attendings and fellows for accuracy within the context of a patient seeking information. Sets were deemed inappropriate if any of the three responses contained inaccurate or misleading information, based on clinical judgment. Evaluators were blinded to model names and prices. Interrater agreement (94%) and reliability (κ score, 0.87) were ideal. The study was performed in July 2023.


Of the 60 questions posed to the two different AI language models, 45% (n = 27) of the responses were found to be inaccurate. When the two AI models were compared, 43.33% (n = 13) of ChatGPT’s responses were accurate while 46.7% (n = 14) of Gemini’s responses were deemed accurate. ChatGPT also had a 13.20 average Flesch Kincaid Reading grade level and a 31.06 average Flesch Kincaid Readability score. Gemini’s responses received an average Flesch Kincaid Reading grade level of 8.34 and an average Flesch Kincaid Readability score of 56.92. ChatGPT’s average sentiment score was a 1.23 while Gemini’s average score was a 0.92.


While OpenAI’s ChatGPT and Google's Gemini platform can serve as valuable resources for information retrieval, they possess certain limitations when it comes to health-related information for Crohn disease, ulcerative colitis, and colorectal cancer. Importantly, both AI models in the study provided inappropriate responses to common patient questions regarding these conditions. Medical professionals should be aware of these limitations as they may lead to the spread of misinformation in populations with limited access to health care.