Abstract
Introduction
Artificial intelligence (AI) is undergoing a transformative era in healthcare, with advanced technologies such as large language models (LLMs) demonstrating remarkable potential to provide innovative solutions to complex medical challenges.1–3 These models, trained on large datasets, can generate human-like text, with applications ranging from clinical decision support to medical education. 4 Recent methodological work has also shown that the structure of the prompt and clinical context strongly modulates LLM behavior, underscoring the need for standardized, transparent prompting protocols in healthcare studies. 5 Its adoption in healthcare is expected to enhance patient care, expedite diagnostic processes, and provide analytical support to professionals. 6 However, its performance in highly specialized fields requiring extensive expertise—such as plastic surgery—remains insufficiently explored.7,8
Plastic surgery, including subspecialties such as hand surgery, constitutes a distinct domain within surgical disciplines, characterized by technical precision, anatomical complexity, and the necessity of rigorous adherence to international standards. 9 While LLM provides support in this field—offering quick access to critical information and delivering educational materials to trainees—the accuracy, consistency, and guideline compliance of their responses in such specific scenarios have not yet been thoroughly assessed.10,11 Although existing literature evaluates the performance of these models in general medical contexts, studies explicitly focusing on niche areas such as hand surgery are limited. 8 These specialties involve distinct challenges that demand technical knowledge and advanced clinical judgment. The critical question is whether LLMs can generate reliable and applicable responses in these scenarios.
Although several studies have assessed LLM performance in general medicine and plastic surgery, no research to date has systematically evaluated next-generation LLMs using standardized, guideline-anchored case scenarios in hand surgery, one of the most technically demanding surgical subspecialties. This study conducts a comprehensive evaluation of the performance of leading and widely recognized LLMs—ChatGPT-5 (OpenAI), Gemini 2 (Google), Grok 3, and DeepSeek R1—in hand surgery case scenarios.
Materials and methods
Study Design
This prospective cross-sectional analysis was designed to evaluate the performance of four prominent LLMs (ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1) on standardized, validated case scenarios in Hand Surgery. The study’s findings will help us better understand how LLMs can be integrated safely and effectively into clinical practice and medical education. Its protocol adhered to the ethical guidelines outlined in the Declaration of Helsinki and received prior approval from the institutional ethics committee to ensure transparency and rigor. Although the study did not involve human or animal subjects, obtaining ethical committee approval underscored the commitment to standards and scientific integrity. Although traditional power analysis does not apply to model-based repeated-response studies, we ensured statistical adequacy by exceeding the minimum scenario count used in prior LLM surgical studies (n: 20–30). Our dataset (50 cases × 6 questions × 4 LLMs = 1200 responses) is among the most extensive comparative datasets to date.
Evaluated models
All LLM evaluations were conducted between 3 September and 24 October 2025. During this window, we accessed the then-available production endpoints for ChatGPT-5 (OpenAI), Gemini 2 (Google DeepMind), Grok 3 (xAI), and DeepSeek-R1 (DeepSeek). GPT-5 was publicly released in August 2025, whereas DeepSeek-R1 and Grok 3 were introduced earlier in 2025 as reasoning-focused models. Gemini 2 represented the contemporaneous general-purpose Gemini model series at the time of our experiments. As such, our results should be interpreted as reflecting the performance of ChatGPT-5, Gemini 2, Grok 3, and DeepSeek-R1 during the September-October 2025 evaluation window.
Study utilization
The study used 50 developed complex case scenarios in hand surgery, based on established international surgical guidelines. All case scenarios were created following the guideline frameworks of the International Federation of Societies for Surgery of the Hand (IFSSH) and the American Society for Surgery of the Hand (ASSH) (guideline documents now cited in the Reference list) and externally validated by two independent hand surgeons with over 10 years of experience to ensure clinical authenticity.12,13 Each scenario included six targeted questions: four open-ended and two multiple-choice (50 × 6 = 300 questions per LLM).
Case scenarios were initially drafted based on common and high-stakes presentations encountered in tertiary hand surgery practice, including typical emergency referrals, elective reconstructive cases, and complex revision situations. To ensure coverage of the major domains of adult hand surgery, the expert panel prospectively categorized the 50 vignettes into acute trauma and emergency presentations, post-traumatic reconstruction and sequelae, degenerative and nerve-compression disorders, congenital anomalies, and elective reconstructive/aesthetic conditions. In the final set, acute trauma and emergency cases accounted for 40% (n = 20) of scenarios, post-traumatic reconstruction for 20% (n = 10), degenerative and nerve-compression disorders for 20% (n = 10), congenital anomalies for 10% (n = 5), and elective reconstructive/aesthetic problems for 10% (n = 5). Microsurgical decision-making (e.g., digital replantation, free-flap coverage) was primarily categorized within trauma and reconstruction rather than treated as a separate domain.
The same two fellowship-trained hand surgeons (each with >10 years of independent clinical practice in hand surgery) jointly defined the benchmark answers and the list of critical elements required for completeness and guideline adherence for every question. For each scenario, the experts jointly developed a detailed benchmark answer key specifying the expected diagnosis, classification, investigations, and management strategy. All vignettes, questions, and benchmarks were finalized and locked before any interaction with the LLMs. Five representative anonymized case vignettes, including all six questions and the corresponding expert benchmark answers, are provided in Supplemental 1. The complete set of 50 vignettes and 300 questions is available from the corresponding author upon reasonable request.
Prompting protocol
All models were queried using a standardized prompting protocol. First, a role-defining instruction was given (“You are a board-certified hand surgeon. Answer according to current international guidelines and best practices.”). Second, the complete vignette was presented under the heading ‘Case’. Third, the six questions were listed under the heading ‘Questions’, and the model was instructed to answer each question in order, labelling its responses as Q1–Q6. For each case, the entire text was pasted into a new, clean conversation to avoid cross-case contamination.
Inter-rater reliability
Two independent fellowship-trained hand surgeons served as blinded reviewers and scored all LLM responses according to the predefined rubric. To further enhance methodological rigor, we expanded our reliability assessment beyond Cohen’s kappa. In addition to the primary kappa coefficient (κ = 0.821,
Analyzing responses
The LLM-generated responses were assessed against a multidimensional set of criteria, including accuracy, completeness, and guideline adherence. Accuracy was defined as the extent to which the response matched the case-specific expert benchmark answer (e.g. correct diagnosis, appropriate classification, and recommended operation or conservative strategy). Completeness was assessed by whether all critical components of the question were addressed (e.g., naming both the diagnosis and key differential diagnoses or listing all essential steps of management). Guideline adherence referred to the concordance of the response with IFSSH/ASSH-based hand surgery recommendations and other international standards, including mention of required investigations, contraindications, and safety principles, even when more than one clinically acceptable option existed. Consequently, a response could be accurate but only partially guideline-adherent, or vice versa, which justified scoring these domains separately. A six-point Likert scale, ranging from 1 (
For multiple-choice questions, the same six-point Likert scale was applied to preserve comparability with open-ended items while distinguishing between correct, partially acceptable, and unsafe responses. Selection of the predefined correct option with entirely appropriate reasoning was scored as 5 or 6. Selection of the proper choice with incomplete or imprecise justification was scored in the mid-range (typically 4). Selection of an incorrect option with clinically neutral consequences was scored as 2 or 3, whereas clearly unsafe or guideline-inconsistent choices (e.g., recommending contraindicated procedures) were scored as 1. This approach retained the underlying correct/incorrect structure of MCQs but allowed us to capture clinically relevant gradations in safety and explanatory quality.
Data analysis
Consistency was primarily assessed to determine whether the models could reliably replicate accurate responses, an essential attribute for clinical application. The inter-rater reliability of these assessments was quantified using Cohen’s Kappa coefficient, yielding 0.821 (
Results
In evaluating LLMs across multiple-choice and open-ended case scenario questions, performance varied significantly between models (Figure 1). In multiple-choice queries of case scenario, as seen in Table 1, mean accuracy scores, measured on a six-point Likert scale, ranged from 5.6 ± 0.4 (DeepSeek) to 5.9 ± 0.2 (Gemini) and 5.9 ± 0.1 (Grok), with ChatGPT scoring 5.7 ± 0.3. Pairwise comparisons using t-tests revealed that Gemini and Grok outperformed ChatGPT (Gemini vs ChatGPT: Performance comparison of AI models across multiple evaluation categories. Scores were obtained using a six-point Likert scale (mean ± SD). The evaluated models included ChatGPT, Gemini, Grok, and DeepSeek. The categories assessed were Multiple-Choice Accuracy, Open-ended Accuracy, Open-ended Completeness, and Open-ended Guideline Adherence. Error bars indicate standard deviations. Performance of AI models on multiple-choice questions of case scenarios. *
Performance evaluation of AI models on open-ended questions of case scenarios.

Comparison of AI models based on accuracy and completeness scores. The scatter plot illustrates the performance of four different AI-LLMs models—Gemini 2, Grok 3, ChatGPT 5, and DeepSeek R1—regarding their accuracy (
Beyond significance, we calculated effect sizes to quantify the magnitude of performance differences between LLMs. For multiple-choice accuracy comparisons, Cohen’s d values ranged from 0.42 to 0.71, representing moderate to large effect strengths, particularly favoring Gemini and Grok over ChatGPT and DeepSeek. In open-ended domains, eta-squared (η2) values from Kruskal–Wallis analyses ranged from 0.28 to 0.36, indicating large effect sizes and highlighting substantial performance disparities across models. In pairwise nonparametric comparisons, Cliff’s delta (δ) similarly showed significant effects, with Gemini and Grok outperforming DeepSeek (δ = 0.74 and δ = 0.79, respectively) and moderately outperforming ChatGPT (δ = 0.41 and δ = 0.48). Collectively, these effect-size measurements confirm that the observed differences between models are not only significant but also clinically meaningful. Quantitatively, DeepSeek exhibited the highest proportion of critical errors (22.4%), followed by ChatGPT (13.7%), whereas Gemini and Grok showed substantially lower error rates (4.8% and 5.1%, respectively). These patterns highlight clear divergences in behavior: Gemini and Grok exhibit strong clinical consistency, whereas ChatGPT and DeepSeek exhibit structural weaknesses in guideline adherence and contextual reasoning.
Discussion
This study represents the first comprehensive evaluation of four state-of-the-art LLMs—ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1—in addressing complex case scenarios specific to hand surgery. Our findings demonstrate a clear performance hierarchy, with Gemini and Grok emerging as superior models across both structured multiple-choice and unstructured open-ended tasks. At the same time, ChatGPT exhibited intermediate utility, and DeepSeek consistently underperformed. These results align with—and extend—recent advancements in AI-driven clinical decision support, while also highlighting critical limitations that warrant further scrutiny.
Multiple high-impact studies in 2024–2025 have demonstrated that state-of-the-art LLMs can approach or achieve human-level performance on specialized medical knowledge tasks.7,14–16 In radiology, for example, Sarangi et al. systematically evaluated multiple LLMs for imaging decision-making in suspected pulmonary embolism. 17 They demonstrated that, when carefully benchmarked, these systems can provide guideline-congruent recommendations for complex diagnostic pathways. 18 Similarly, Mondal et al. showed that LLMs can generate accurate and readable plain-language summaries of scientific articles. Still, performance varied across models, underscoring the importance of structured, domain-specific evaluation frameworks. 5 A recent national surgery in-service exam evaluation found that GPT-4 answered 74.4% of questions correctly. 19 Similarly, in hand surgery, GPT-4 achieved about 62% accuracy on board-style examinations, markedly better than earlier models and on par with many trainees. 20 These findings mirror the user’s results in hand surgery, where GPT-4’s factual accuracy was substantially improved over prior LLM versions and often rivaled that of surgical residents. 21 The user’s study similarly observed that LLM accuracy in surgery can be highly topic-dependent, with strong performance in common reconstructive scenarios but higher error rates in complex, nuanced cases. 22
A study by Gomez-Cabello CA et al. assessed GPT-4 and Gemini’s ability to address common patient concerns following five types of cosmetic surgeries. 23 Their results showed that while the accuracy of the information provided by all models was comparable, Gemini offered more readable responses. However, a limitation observed across all models was their poor ability to provide actionable advice. Further improvement is needed in translating this knowledge into practical, easy-to-follow instructions for patients to manage their recovery effectively.24,25 Another study assessed the performance of ChatGPT-4 and Gemini in accurately classifying hand injuries and recommending appropriate management. 26 Gemini demonstrated a superior classification ability, correctly classifying a higher percentage of hand injuries than ChatGPT-4. However, ChatGPT-4 exhibited higher sensitivity in recommending surgical intervention, while Gemini showed greater specificity. Despite these differences, the study concluded that neither model is reliable enough for clinical practice in hand surgery without further validation.
In our study, Gemini and Grok demonstrated superior performance across structured and unstructured question formats, with significant advantages in multiple-choice accuracy compared to ChatGPT. While Gemini showed marginally better results in open-ended tasks, specifically in terms of accuracy, completeness, and guideline adherence, compared to Grok, these differences did not reach significance, indicating robust and balanced capabilities in both factual precision and contextual reasoning. ChatGPT occupied an intermediate position, performing moderately in structured tasks but showing significant deficiencies in open-ended queries, notably lower accuracy and guideline adherence scores, reflecting limitations in generating contextually comprehensive responses. Despite these shortcomings, ChatGPT still outperformed DeepSeek in open-ended scenarios. DeepSeek exhibited the weakest overall performance, consistently trailing other models across both formats, particularly with substantial deficits in open-ended tasks, as evidenced by lower accuracy and completeness scores, as well as higher variability in response quality. These findings underscore key trends, highlighting Gemini and Grok’s superior contextual understanding and coherence, ChatGPT’s moderate yet limited proficiency, and DeepSeek’s pronounced challenges in handling complex, guideline-dependent scenario-based queries.
The research has challenges associated with using LLMs in hand surgery and significant limitations. While LLMs exhibit considerable potential in healthcare applications, the corpus of real-world performance data remains insufficient to assess their capabilities thoroughly. The absence of established empirical benchmarks impedes rigorous comparisons of their efficacy, accuracy, and reliability with alternative proprietary models. Furthermore, the adaptability of LLMs to clinical decision-making workflows and their adherence to evolving regulatory landscapes necessitate further scrutiny and validation. Substantial efforts should be directed toward validating their reliability in domain-specific tasks, such as diagnostics, personalized medicine, and medical education, with a focus on mitigating inherent biases. Through concerted research initiatives and real-world clinical trials, similar models could be iteratively refined to better address the dynamic and multifaceted needs of the medical and scientific communities. A gap exists in the comparative analysis and evaluation framework for this.
Conclusion
Our findings provide a foundation for future regulatory validation frameworks and highlight the need for specialty-specific LLM training datasets before clinical deployment. It highlights the significant yet varied potential of state-of-the-art LLMs within the highly specialized domain of hand surgery. Gemini 2 and Grok 3 consistently demonstrated superior performance across structured and open-ended clinical scenarios, showcasing advanced capabilities in accuracy, completeness, and adherence to international guidelines. While ChatGPT-5 exhibited moderate competence, particularly excelling in structured multiple-choice contexts, its limitations in complex, open-ended reasoning underscore critical areas for improvement. DeepSeek demonstrated pronounced deficits, highlighting the need for cautious integration into practice without further refinement. The findings underscore the transformative promise of LLMs in supporting clinical decision-making, education, and patient care within orthopedic surgery.
Supplemental material
Supplemental Material - Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study
Supplemental Material for Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study by Ahmet Savran in Journal of Orthopaedic Surgery
Footnotes
Author contributions
Funding
Declaration of conflicting interests
Data Availability Statement
Study design and level of evidence
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
