The Reliability of Artificial Intelligence in Prioritizing Management of Diabetic Macular Edema: A Comparative Study with 2 Retina Specialists

Abstract

Purpose

There are currently no studies evaluating an Artificial Intelligence (AI) Large Language Model’s (LLM) reliability for ordinally prioritizing patients. In this study, we assess the ChatGPT Plus model (GPT-4) for prioritizing Diabetic Macular Edema (DME) patients, comparing its performance to that of two board-certified retina specialists (RS). We also investigate 2 additional questions: key variables for evaluators (GPT-4 and 2 RS) in DME prioritization, and the impact of incomplete clinical profiles on inter-evaluator agreement.

Methods

We used anonymized DME data from Retina Consultants of Texas to create 28 patient profiles. These profiles were divided into 4 sets based on ascending Diabetic Retinopathy (DR) severity (Set 1), Central Subfield Thickness (CST) (Set 2), modified Best Corrected Visual Acuity (BCVA) (Set 3), and a randomly organized control set (Set 4). We intentionally modified BCVA in Set 3, resulting in clinically incomplete patient profiles. 2 RS and GPT-4 prioritized patients in each set (e.g., Set 1) according to least to most treatment needs. We calculated the mean Cohen's Kappa (k) across all 4 sets to measure agreement between the 2 RS and the 2 RS with GPT-4 (k = 0.40–0.59 (weak), 0.60–0.79 (moderate), 0.80–0.90 (strong), >0.90 (almost-perfect)). Median RS evaluations were used to calculate individual set k as well as mean set k with GPT-4.

Results

Evaluations by the 2 RS (denoted RS1 and RS2) and GPT-4 show moderate agreement (Set 3 excluded: RS1-GPT-4 mean set k = 0.631; RS2-GPT-4 mean set k = 0.68). When using the median of RS evaluations, GPT-4's agreement with RS increased within the moderate range (Set 3 excluded: Median RS-GPT-4 mean set k = 0.77). Agreement between the 2 RS was weak (Set 3 excluded: RS1-RS2 mean set k = 0.48). The inclusion of Set 3 in mean set k calculations showed no clear impact. GPT-4 responses/explanations did not acknowledge clinical ambiguities in Set 3, noted by both RS in an optional comment box. Individual k values for Sets 1, 2, 3, and 4 were as follows: 0.58, 0.79, 0.72, 0.93.

Conclusions

The results reveal that GPT-4 shows promise in reliably prioritizing DME patients compared to RS. Ultimately, GPT-4 achieved moderate-strong (k ≥ 0.60) agreement with RS in DME prioritization. Interestingly, GPT-4’s increased agreement with median RS evaluations could indicate an ability to predict the RS consensus despite moderate disagreement between the 2 RS themselves. This is interesting considering an LLM’s inherent function — mathematically predict the average human response to a given text. This further supports GPT-4’s potential as a clinical tool, offering a grounded perspective in decision-making to assist more nuanced human judgment. Incomplete clinical profiles did not clearly impact agreement, possibly suggesting GPT-4’s adaptability. However, GPT-4 failed to recognize ambiguities in patient profiles that both RS noted. Therefore, human specialists may be better equipped to prevent erroneous decisions when evaluating incomplete or exceptional patient cases. Regarding key variables for prioritization, almost-perfect agreement in the control set (Set 4) warrants investigation into additional variables beyond what is explored in this study.

Description

Keywords

Citation

Rights

License

Collections