Abstract
Introduction
Artificial intelligence (AI) has made noteworthy progress in contemporary times and has gained increasing prevalence in diverse domains. 1 The recent emergence of ChatGPT, a chatbot based on Large Language Model created by OpenAI, has been a significant development in recent months. The ChatGPT platform was unveiled and made publicly available on 30 November 2022. 2 In just 2 months after its initial release, ChatGPT amassed 1 million users in 5 days and 100 million users, making it the application with the most rapid growth in history. 3 Primary objective of ChatGPT is to produce responses that are both contextually relevant and logically connected. The implementation of ChatGPT has garnered attention in domains that have conventionally relied on human ingenuity and efficiency, such as marketing, education, and customer service. Research has demonstrated the efficacy of ChatGPT in answering questions on authorized exams in various professions, including medicine. One of the first studies on this topic established that ChatGPT has the capacity to achieve a passing score (or near passing score) on the United States Medical Licensing Examination (USMLE)—three-step examination program for medical licensure in the United States. 4
The Medical Final Examination is the Polish equivalent of the USMLE—successful completion of the exam enables candidates to apply for a license to practice medicine in Poland (as well as in the European Union, as per Directive 2005/36/EC of the European Parliament). 3 According to current law, final-year medical students or graduates of medical schools are eligible to undertake the examination. 5 The Medical Final Examination comprises a total of 200 test questions, covering various medical specialties such as internal medicine, pediatrics, surgery, obstetrics and gynecology, psychiatry, family medicine, emergency medicine and intensive care, bioethics and medical law, medical jurisprudence and public health (Table 1 contains data regarding the distribution of queries throughout different sections, including oncological topics)—attaining a minimum of 56% of the maximum achievable points is a prerequisite for successfully passing the examination.5,6 The outcome of the Medical Final Examination holds significant importance for physicians, as it not only confers complete medical practice privileges but also serves as a pivotal factor in their selection for future specialized training programs. 7
Thematic structure of medical final examination.
Since ChatGPT was capable of passing the USMLE and “becoming a doctor” in the United States, would it also be able to do so in Poland?
Material and methods
This study aimed to determine whether ChatGPT chatbot could pass the Medical Final Examination, which is required for practicing medicine in Poland—an exam is considered passed if at least 56% of the tasks are answered correctly.
To achieve this, ChatGPT, version 3.5 was presented with questions from 11 examination sessions held between 2013–2015 and 2021–2023 (the content and statistics of which were disclosed by the exam organizer, the Medical Examination Center; questions from the period 2016–2020 and their statistics were not publicly available at the time this article was written due to regulations). 8 The selection of ChatGPT-3.5 was primarily influenced by the limited timeframe between the deployment of ChatGPT-4 and the commencement of our research, constraints on research funding, and the decision made by the authors to make the chatbot accessible to all, without any exceptions, at no cost. ChatGPT was presented with a total of 2138 unique questions from 19 to 26 May 2023, in the form of 11 tests containing 192 to 198 questions (from each test, which normally contained 200 questions, authors excluded some questions from the original set, due to inconsistencies, errors, outdated content, or the need for figure analysis)—the questions in each test were provided in a sequential way, following the same order as in the actual exam, in the same chat window for each test session, without utilizing special, individual prompts or updates (as in a genuine examination). The answers provided by ChatGPT were compared to the official answer key, which had been reviewed for any changes resulting from the advancement of medical knowledge. The Supplemental File depicts an initial phase of the procedure involving the copying of questions from the exam organizer’s website and their submission to ChatGPT in order to retrieve responses.
In order to facilitate later analysis, we classified all questions based on the different domains of the examination (internal medicine, pediatrics, surgery, obstetrics and gynecology, family medicine, emergency medicine and intensive care, psychiatry, bioethics and medical law, medical jurisprudence, and public health), into A-type and K-type assignments (as per the regulations of the Medical Examinations Center, A-type assignments require a single correct response, while K-type assignments require the correct set of statements
9
examples of both types of assignments are provided in Table 2), true or false statements expected from the question (e.g., “true statements are. . .,” “the most likely is. . .,” and “false statements are. . .,” “the least likely is. . .”) and theoretical or clinical nature of question. All answers of test-takers, the percentage of test-takers who selected the correct answer, and the question's difficulty index (ranging from 0 to 1, with a lower index indicating a more difficult question, according to the definition of Nitko adopted by Johari et al.
10
) were extracted from the Medical Examination Center’s data.
8
The obtained data were statistically analyzed utilizing the mean and standard deviation, Student’s
Examples of A-type and K-type tasks used medical final examination in Poland.
Results
A total of 2138 tasks were submitted to ChatGPT, with an average difficulty index of 0.744 ± 0.209. 84.85% of questions were classified as A-type—these questions were found to have a significant difference in difficulty when compared to K-type questions (0.752 ± 0.206 vs. 0.696 ± 0.219;
ChatGPT demonstrated a success rate of 58.61% of all questions, whereas human physicians achieved 75.60% (
An analysis was conducted on the percentage of correct answers provided by AI and doctors, divided into 11 examination sessions. In three sessions (Spring 2013, Spring 2014, and Fall 2015), ChatGPT was unable to attain the mandated passing threshold of 56%. Results attained by physicians were notably superior to those achieved by AI in almost every session (except Fall 2013). Detailed results are presented in Table 3.
Detailed results of ChatGPT and humans in different sessions of medical final examination.
The accuracy rates of both AI and human participants were compared in specific domains of tasks. The ChatGPT system exhibited the best performance in questions related to public health (82.56%) and psychiatry (77.18%). Although the absolute values in psychiatry-related tasks were more advantageous for AI, the difference was not statistically significant. In the remaining domains (except public health), physicians exhibited a significant and noteworthy advantage. Table 4 presents the complete results.
Detailed results of ChatGPT and humans in different domains of tasks in medical final examination.
The study revealed that ChatGPT exhibited a noteworthy performance disparity between type-A and type-K questions, with a higher accuracy rate for type-A questions compared to type-K questions (61.69% vs. 40.99%;
All presented questions were classified into five equal quintiles based on their difficulty index. These quintiles were categorized as very easy (difficulty index ranging from 0.923 to 0.996), easy (0.855–0.922), intermediate (0.748–0.854), hard (0.567–0.747), and very hard (less or equal to 0.566). Subsequently, the accuracy rates of the responses were juxtaposed—consistent with prior findings regarding the correlation between difficulty index and performance, ChatGPT exhibited optimal proficiency on tasks categorized as very easy while demonstrating worse efficacy on challenging items. It is noteworthy that the responses furnished by ChatGPT did not exhibit any statistically significant (in

Results of ChatGPT and humans in questions grouped based on difficulty index quintiles.
An interesting qualitative observation pertains to the selection of answer choices by ChatGPT. Despite the presence of distinct answer options labeled A, B, C, D, and E (with only one being correct), it was observed that in 64 tasks (2.99%), AI did not select any answer or indicate multiple answers as correct. The study findings indicate that a significantly increased risk of such situations was associated with A-type questions (RR 3.61; 95% CI: 1.14–11.43) and questions aimed at detecting false statements (RR 2.36; 95% CI: 1.39–4.02).
Discussion
AI has garnered worldwide attention (ChatGPT reached 100 million monthly active users just 2 months after launch, making it the fastest-growing consumer application in history 11 ), but it has also raised concerns about algorithms and machines replacing human labor, a concern that has persisted since the industrial revolution. 12 Medicine requires a thorough approach to all issues and a vast knowledge base, especially for doctors who want to help their patients.13,14 This study did not assess ChatGPT’s therapeutic efficacy—however, given the rapid pace of AI, passing an examination that allows autonomous practice could be a worthwhile first step for future discussions in this area.
Kung demonstrated that ChatGPT had the ability to successfully complete the USMLE without any prior training. 4 The equivalent of this test in Poland is the Medical Final Examination, which, however, differs in that, unlike the USMLE, it is a single-component exam consisting solely of multiple-choice questions (with five answer choices per question). 5 Nevertheless, in our study ChatGPT proved able to handle it at least as well. However, ChatGPT did not pass MGRCGP:AKT (the Applied Knowledge Test of the Membership of the Royal College of General Practitioners) 15 and the Chinese National Medical Licensing Examination. 16 In Germany, ChatGPT in version 3.5 achieved a pass rate of medical license examination in one out of three cases, whereas in version 4.0 (which had a notable technological edge), it achieved a perfect success rate. 17 During a comparative evaluation conducted in Japan, ChatGPT-4 exhibited a superior performance ranging from 27.6% to 36.3%, depending on the specific question category in the test. 18 The efficacy of ChatGPT (in both versions) has also been demonstrated in other medical examinations, including medical biochemistry, 19 physiology, 20 microbiology, 21 parasitology, 22 as well as the European Exam in Core Cardiology 23 or Ophthalmic Knowledge Assessment Program (OKAP) exam. 24 Nonetheless, AI was unsuccessful in passing the American Heart Association Basic Life Support and Advanced Cardiovascular Life Support exam, 25 in Poland, also specialization exams in internal medicine 26 or radiology. 27 Determining the cause of substantial variations in the efficacy of ChatGPT, even within the same version (especially when it comes to GPT-4), is a challenging task. One potential option could be language differences. However, a study conducted by Panthier et al. examining the efficacy of ChatGPT in the French version of the European Board of Ophthalmology Examination indicated that the main factor influencing its effectiveness was not the language used; 24 nonetheless, it is crucial to consider the contrasting global significance and prevalence of French and Polish (309.8 million French speakers vs. 40.6 million Polish speakers according to Ethnologue data 28 ). Additional observations and investigation are necessary.
Upon analysis of the examination results of human test-takers, a notable disparity in the proportion of accurate responses exists between the timeframes of 2013–2015 and post-2021, a trend that is not apparent in AI. The reason for this is probably that, as per the current regulations in Poland, a considerable segment of the Medical Final Examination (minimum 70%) comprises queries sourced from an open question bank, allowing for prior training. Article 14c of the Act on the Profession of Physician and Dentist, which was implemented at that time, resulted in a scenario where only 30% of new questions were created for each subsequent Medical Final Examination date, which was not previously accessible to the candidates. 29 Thus, this cannot be deemed a substantiation of ChatGPT’s diminished efficacy subsequent to 2021; although it is known that ChatGPT’s knowledge is limited to the year 2021, it is important to note that in our study, every question has been examined in relation to its adherence to current medical knowledge during this time.
ChatGPT demonstrated superior performance in the fields of public health and psychiatry, with accuracy rates of 82.56% and 77.18%, respectively. Interestingly, humans achieved their lowest performance in psychiatry tests (70.25%)—psychiatry is the only field in the Medical Final Examination in which ChatGPT-3.5 performed better than real doctors. The cause of this situation is still unknown to us, but a review of the available literature draws attention to the fact that aspects related to the psyche, psychology, and emotions appear to be the notable strengths of chatbots like ChatGPT. Franco D’Souza et al. 30 showed that ChatGPT-3.5 fred extremely well in clinical vignettes in psychiatry by receiving 61% grades “A,” 31% “B,” and only 8% “C.” In the study by Elyoseph et al. 31 it has been proven that ChatGPT demonstrated significantly higher performance than the general population on all the Levels of Emotional Awareness Scale (LEAS) and can further improve its result. Of course, we must keep in mind that we are dealing only with a chatbot—Levkovich and Elyoseph 32 showed that ChatGPT-3.5 was able to underestimate the risk of suicide even in high-risk patients.
The ability to provide immediate solutions to inquiries is a chance to improve quality for medical practitioners, patients, and healthcare professionals. Notwithstanding, it appears highly unlikely that AI will have the capability to substitute medical practitioners in the immediate future. Even the most sophisticated algorithms and AI-enabled technologies cannot diagnose and cure illnesses, as DiGiorgio and Ehrenfeld 33 accurately noted. Our study demonstrated that ChatGPT has the potential to successfully clear the medical licensing examination in Poland—however, it is crucial to consider that medicine is not just a precise science but also an art that necessitates the application of critical thinking beyond algorithms. Additionally, it is important to emphasize the significance of utilizing an individualized approach to patient care that is based on interpersonal communication and knowledge. However, there exists a prospective application for ChatGTP or AI in the field of medicine, like the analysis of big or the creation of realistic descriptions of clinical cases, which serve as effective tools for students to learn and prepare for their profession. 34 It is interesting to mention the courteousness exhibited by AI and its prospective utilization in routine clinical practice—it has been demonstrated that in 79% of cases, patients perceived ChatGPT’s responses to their urgent medical inquiries as being more empathetic and comprehensive when compared to those provided by human professionals. 35 On the other hand, ChatGPT’s ability to empathize may influence our perception of chatbot mistakes, therefore warranting a sensible and careful approach to its actions.
Our research has a few limitations. Analysis was limited to the evaluation of ChatGPT’s performance without conducting any comparative assessments with other AI or chatbot models. Additionally, it should be noted that ChatGPT undergoes regular updates—as previously stated in this article, employing ChatGPT-4 yields superior quality; however, it is a paid tool and not accessible to all individuals. It would be prudent to assess the efficacy of both ChatGPT versions 3.5 and 4.0 on a comparably extensive range of inquiries (similar work was done by Rosoł et al., 3 showing the superiority of ChatGPT-4, but the study was based on a small number of questions) and potentially juxtapose them with other chatbots. In our work, we did not use any prompts that might have in any way influenced the effectiveness of the answers. Despite these limitations, our study provides significant perspectives on the advantages and limitations of ChatGPT in the setting of medical licensure examinations, such as the USMLE or the Polish Medical Final Examination.
Conclusions
The results of this study demonstrated the potential effectiveness of ChatGPT version 3.5 as an approach to passing the Medical Final Examination. There is evidence to suggest that ChatGPT and maybe other AI language models, despite their limitations, could be a valuable asset in patient care. The efficacy of the GPT3.5 model, while enough for passing the exam, was subpar and inferior to the performance of medical students and early-career doctors. To enhance proficiency in this domain, it is advisable to pursue more training for these models.
Supplemental Material
sj-docx-1-smo-10.1177_20503121241257777 – Supplemental material for ChatGPT-3.5 passes Poland’s medical final examination—Is it possible for ChatGPT to become a doctor in Poland?
Supplemental material, sj-docx-1-smo-10.1177_20503121241257777 for ChatGPT-3.5 passes Poland’s medical final examination—Is it possible for ChatGPT to become a doctor in Poland? by Szymon Suwała, Paulina Szulc, Cezary Guzowski, Barbara Kamińska, Jakub Dorobiała, Karolina Wojciechowska, Maria Berska, Olga Kubicka, Oliwia Kosturkiewicz, Bernadetta Kosztulska, Alicja Rajewska and Roman Junik in SAGE Open Medicine
Footnotes
Authors contribution
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
