Abstract
Introduction
Artificial intelligence (AI) and large language models (LLMs), such as ChatGPT developed by OpenAI, have been increasingly recognized for their potential to revolutionize various sectors, including healthcare.1,2 In medicine, and more specifically in gastroenterology, these models have shown promise as supportive tools for clinicians, enhancing patient care and improving healthcare delivery. 3 However, while the potential benefits are substantial, the application of AI in healthcare is not without its challenges and limitations.4,5
ChatGPT, a conversational AI system based on the generative pre-trained transformer (GPT) architecture, has demonstrated impressive capabilities in various gastroenterological applications. These include answering common patient questions, 6 taking part in self-assessment tests, 7 and even identifying research priorities. 8 Despite these promising applications, the performance of ChatGPT in the medical domain has been inconsistent, with concerns raised about its accuracy and efficacy. 9
A recent review article 10 noted that GPT-4 could be beneficial for patient–physician communication, patient education, and continuous patient care, potentially mitigating factors related to physicians’ burnout. However, the authors highlighted key limitations and ethical considerations of this AI technology, including patient confidentiality and data security, algorithmic bias, inconsistent and inaccurate responses, plagiarism concerns, compliance with data privacy regulations, and the irreplaceable role of human judgment.
This review aims to provide an evaluation of the role of ChatGPT in gastroenterology, drawing from existing literature. By analyzing studies on the application of ChatGPT in patient communication, medical education, disease management, and research prioritization, we aim to provide a perspective on the potential and challenges of this tool in gastroenterology.
Methods
Study selection
For this systematic review, we included studies that examined the application of ChatGPT in gastroenterology. We excluded studies that focused on other AI models or other areas of healthcare.
Search strategy
We conducted a comprehensive literature search using the PubMed database. The search strategy incorporated a combination of Medical Subject Headings (MeSH) terms and keywords related to ‘ChatGPT’, and ‘Gastroenterology’. The search was limited to articles published in English. Reference lists of included studies and relevant reviews were also manually searched to identify any additional studies.
Data extraction
Two independent reviewers extracted data from the included studies using a standardized data extraction form. Discrepancies were resolved through discussion or consultation with a third reviewer. The extracted information included: study design, sample size, application of ChatGPT (e.g. patient education, self-assessment, continuous care), main findings, and limitations.
Quality assessment
The quality of the included studies was assessed using the Joanna Briggs Institute (JBI) critical appraisal tools, appropriate to each study design. These tools assess the methodological quality of a study and the extent to which a study has addressed the possibility of bias in its design, conduct, and analysis. Studies were categorized as high, moderate, or low quality based on their scores. According to the JBI guidelines, it is recommended that critical appraisal be undertaken by at least two independent reviewers to minimize potential bias. In our study, the quality assessment using the modified JBI critical appraisal tools was conducted with the agreement of three authors (AL, EK, and KS) to ensure the objectivity and robustness of the evaluation.
Data synthesis
We conducted a narrative synthesis of the findings from the included studies. Due to the anticipated heterogeneity in study designs and outcomes, a meta-analysis was not planned. Instead, we focused on summarizing the applications, benefits, and limitations of ChatGPT in gastroenterology as reported in the studies, and on identifying areas for future research.
Results
The systematic review included six studies that evaluated the application, benefits, and limitations of ChatGPT in the field of gastroenterology. The studies were diverse in their objectives and methodologies, and they covered various aspects of gastroenterology, including patient education, self-assessment, patient–physician communication, disease management, and research question generation. The flowchart delineating the selection procedure of the studies included is depicted in Figure 1.

Flowchart delineating the selection procedure of the studies included in the review.
Table 1 summarizes the characteristics of studies included in the review.
Characteristics of studies included in the review.
GERD, gastroesophageal reflux disease; GI, gastrointestinal; GPT, generative pre-trained transformer.
Table 2 summarizes the main findings, and the identified benefits and limitations of ChatGPT in gastroenterology as presented in the studies included.
Key findings and limitations of ChatGPT as identified by studies included.
AI, artificial intelligence; GERD, gastroesophageal reflux disease; PPI, proton pump inhibitor.
Table 3 summarizes the quality assessment performed according to JBI critical appraisal tools.
Quality assessment of studies included in the review using modified JBI critical appraisal tools.
JBI, Joanna Briggs Institute.
ChatGPT as a tool for patients
Two studies examined the efficacy of ChatGPT as a tool for patients, mainly in answering common patient questions. The first study by Lee
In another study, 9 our group evaluated the utility of ChatGPT in answering a variety of clinical questions addressing a wide range of topics, including common symptoms, diagnostic tests, and treatments for various gastrointestinal conditions. The study revealed that ChatGPT could provide accurate and clear answers in some cases, but not in others, indicating the need for further development. Notably, both studies only examined GPT-3.5, an older and less capable ChatGPT model that is free to access.
ChatGPT as a tool for physicians
As a tool for physicians, ChatGPT was evaluated in several aspects: clinical reasoning, knowledge, and education.
In the clinical field, ChatGPT was evaluated in two different domains; management of Gastroesophageal Reflux Disease (GERD), 11 and optimizing post-colonoscopy management. 12
In the study ‘Evaluation of the potential utility of an artificial intelligence ChatBot in GERD management’, 11 evaluated the utility of ChatGPT in the management of GERD. The authors did not specify which ChatGPT model they investigated for this study. The results showed that ChatGPT provided appropriate and specific recommendations for GERD management in 91.3% of cases, with 29.0% considered completely appropriate and 62.3% mostly appropriate. However, inconsistencies were noted in responses to the same prompt, and some potential proton pump inhibitor (PPI) risks were stated as facts. Notably, patients from diverse educational backgrounds universally regarded the responses as comprehensible and beneficial. Furthermore, all respondents expressed their inclination to consider the tool as a valuable resource for obtaining medical information, highlighting the superior utility of the response format compared to that of a conventional search engine.
In the second domain, Gorelik
In the subject of knowledge and education, Suchman
Surprisingly, ChatGPT was not able to pass multiple versions of the exams, indicating its limitations with gastroenterology subspeciality level question and answer tasks. Notably, in this study, the authors examined and compared both versions of ChatGPT (versions GPT-3.5 and GPT-4), with no actual difference in the results achieved: ChatGPT-3.5 scored 65.1% and GPT-4 scored 62.4% (passing grade was 70%).
ChatGPT as a tool for researchers
In assessing ChatGPT as a tool for researchers, our group evaluated the use of a ChatGPT for highlighting research priorities and identifying open and meaningful top research questions in gastroenterology. 8 The research questions outputted by ChatGPT achieved high ratings for relevance and clarity as well as an average rating for specificity but performed poorly in terms of originality.
Studies heterogeneity
When diving into the complexities of research involving AI models, understanding the underlying methods is pivotal. Several factors can introduce variability in the outcomes of such studies, especially when dealing with models like ChatGPT. Factors contributing to this heterogeneity include the following:
The absence of specific model version details (e.g. 3.5
Lack of clarity on the method of question submission – whether they were sent collectively or in individual chat sessions for each prompt and response. The chosen method can significantly influence the replies.
Non-disclosure of the prompts utilized in the research, hindering the reproducibility of the findings.
Uncertainty over whether the study repeated the same prompts to gauge the consistency in the model’s responses. The criteria for prompt selection remain vague. Even minor alterations in the wording can drastically change outcomes, a phenomenon particularly evident with GPT 3.5. Enhancing the quality of prompts can substantially refine many of these investigations.
Benefits and limitations of ChatGPT in gastroenterology
From the included studies, several benefits of ChatGPT in gastroenterology were identified. These include the ability of ChatGPT to provide appropriate and specific recommendations, aid in patient–physician communication, patient education, and continuous patient care, and generate relevant and clear research questions. However, limitations were also noted, including ChatGPT’s insufficient understanding of complex medical questions, inconsistencies in responses, some potential PPI risks being stated as fact, some responses providing limited specific guidance, and struggles with originality. Ethical considerations were also raised, such as confidentiality and data security, stereotypes, bias and inaccuracy, plagiarism concerns, compliance with data privacy regulations, and the irreplaceable role of human judgment.
Discussion
The results of this systematic review provide a comprehensive evaluation of the role of ChatGPT, an LLM, in the field of gastroenterology. The included studies highlight the potential of ChatGPT in various applications, including patient education, self-assessment, patient–physician communication, disease management, and research question generation. However, they also underscore several limitations and ethical considerations that warrant further exploration and careful regulation.
In the era of digital health, it is essential to critically evaluate emerging technologies and their potential impact on healthcare delivery. This review contributes to this ongoing discourse, offering a focused examination of ChatGPT in the context of gastroenterology.
Our review focused on ChatGPT, as it stands out as the most popular LLM chat tool, for patients and physicians alike. Therefore, it is important to assess its performance on tasks relevant to both groups.
We believe that the insights gleaned from this review will be valuable not only to practitioners and researchers in gastroenterology, but also to policymakers, AI developers, and the broader healthcare community as we navigate the integration of AI into healthcare.4,5
The studies evaluating the efficacy of ChatGPT in answering common patient questions6,9 reveal a mixed picture. While ChatGPT demonstrated the ability to generate credible medical information, its performance was inconsistent. Some responses were accurate and clear, while others were not, indicating an insufficient understanding of complex medical information. Moreover, the AI-generated answers were written at significantly higher grade reading levels than recommended, potentially limiting their accessibility to patients with lower literacy levels. Notably, this point can easily be fixed with prompt adjustments at a system level. These findings echo the cautionary note in the editorial ‘Will ChatGPT transform healthcare?’, 4 and highlight the need for further development and fine-tuning of ChatGPT to ensure its reliability and accessibility in patient education.
In the context of gastroenterology board exam-style medical reasoning, ChatGPT did not achieve a passing score using the methods in the recently published study. This indicates its limitations as an educational tool in its current form. 7 Notably, this study examined and compared the performance of both versions of the chatbot – the free version (GPT-3.5) and the advanced version (ChatGPT-4). It is noteworthy that the advanced version did not demonstrate an advantage over the free version. On the contrary, it lagged by three points. This finding emphasizes the need for continuous updates and the development of fine-tuned models specifically geared toward medical education, as suggested in the study, or the use of additional LLM augmentation methods like database linkage. Given the dynamic nature of medical knowledge, AI tools used in medical education need to be capable of providing accurate and updated information and following new information and guidelines that become available.
ChatGPT shows promise in enhancing patient–physician communication and continuous patient care. 10 Its ability to take patients’ medical history, present the information in a concise, structured format, and continuously learn and improve based on the responses it receives could potentially improve healthcare outcomes. Moreover, by taking on tasks such as patient education and medical history taking, ChatGPT could help reduce physician burnout. However, the ethical considerations and limitations of AI, including confidentiality and data security, stereotypes, bias and inaccuracy, plagiarism concerns, compliance with data privacy regulations, and the irreplaceable role of human judgment, need to be addressed. AI technologies like ChatGPT should complement, and not replace, the human elements of empathy and professional judgment.
In the management of GERD, ChatGPT provided appropriate and specific recommendations in the majority of cases. 11 However, inconsistencies in responses to the same prompt and some potential PPI risks being stated as fact were identified as limitations. These findings highlight the need for rigorous clinical oversight in the use of ChatGPT in disease management.
ChatGPT’s ability to generate relevant, clear, and moderately specific research questions is noteworthy. 8 However, it struggled with originality, suggesting the need for further work to improve the novelty of the generated research questions.
While it is not disclosed what was the specific training data used to train ChatGPT, it is likely that similar to other LLMs, it was trained on vast information derived from the internet and other open-access sources. However, ChatGPT is not innately attuned to medical nuances.6,9 This may explain its inconsistency in providing clear and accurate information on gastroenterological issues.
In addition, ChatGPT’s information might not always be up-to-date. Especially, regarding recent medical research and guidelines. 7 Its training data depends on available open-source information at the time of training. Thus, it might not be aware of newer studies or guidelines unless it is retrained on newer data.
A significant challenge is ChatGPT’s language complexity. 6 This complexity is not an inherent flaw but a byproduct of the data it was trained on. However, for patient interactions, it is beneficial that the information is delivered at an accessible reading level.
Bias and the potential for manipulation through prompt engineering 6 arise because the model reflects the data it was trained on. If biases exist in those datasets, they will also be present in the model’s outputs. This can be hazardous in medical applications where impartiality is essential.
The inconsistencies in responses to similar prompts 11 may be a byproduct of the model’s vast training data. This is because the model may attempt to generate varied responses. However, in a clinical context, consistency is vital. Thus, this unpredictable behavior is a clear limitation.
A limitation demonstrated in the field of gastroenterology, particularly in medical reasoning and board exams, is the lack of domain-specific training for ChatGPT. 7 A model specifically fine-tuned on gastroenterological data could potentially outperform the general ChatGPT.
Notably, this could be related to how the task was presented to the model and a limitation of ChatGPT in particular. Perhaps with prompt engineering and increasing the temperature (creativity) of the model, more original responses could have been created.
This finding aligns with the review published by Sharma and Parasa, 3 which emphasizes the need for careful implementation and regulation of AI tools in healthcare.
ChatGPT proved its capability in handling various scenarios and descriptions effectively, providing concise patient letters in post-colonoscopy management by offering guideline-based recommendations. 12 These findings suggest that ChatGPT has the potential to assist healthcare providers in streamlining post-colonoscopy decision-making and improving adherence to post-colonoscopy surveillance guidelines.
This review has some limitations. First, the number of studies included in the review was relatively small, limiting the generalizability of the findings. However, as far as we know, it summarizes the current literature on the topic of ChatGPT in the field of gastroenterology. Second, the included studies were diverse in their objectives and methodologies, making it challenging to make quantitative analyses.
Finally, given the fast-paced advancements in AI technology, the conclusions of this review might soon be obsolete. Notably, the majority of the studies assessed the free version of ChatGPT.3,5 However, recent research suggests that the improved version, ChatGPT-4, performs better in the medical field. 13
In conclusion, the review of ChatGPT’s application in gastroenterology reveals mixed outcomes. While showing promise as a tool for physicians (e.g. in GERD management and post-colonoscopy adherence to guidelines), it struggled with inconsistencies in patient education and failed in self-assessment tests. However, most data were created using the free version ChatGPT,3,5 while the improved version (GPT-4) may achieve better results. Our findings emphasize the potential of ChatGPT and also underline clear limitations and the need for further refinement and ethical scrutiny.
