Sage Journals: Discover world-class research

Abstract

Background

Integrating large language models (LLMs) into decision-making and education has shown promise across various healthcare disciplines. The study aimed to evaluate the performance of leading LLMs—ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1—in accurately responding to structured multiple-choice and open-ended queries about complex case scenarios in hand surgery.

Methods

A prospective cross-sectional analysis used 50 clinically relevant, guideline-based case scenarios developed for hand surgery. Each scenario consisted of four open-ended and two multiple-choice questions, totaling 300 points per LLM. Responses were independently assessed by blinded expert reviewers using a standardized six-point Likert scale evaluating accuracy, completeness, and adherence to international surgical guidelines.

Results

In multiple-choice queries, Gemini (5.9 ± 0.2) and Grok (5.9 ± 0.1) outperformed ChatGPT (5.7 ± 0.3; p = 0.031 and p = 0.009, respectively) and DeepSeek (5.6 ± 0.4; p = 0.004 and p = 0.001, respectively). In open-ended queries, Gemini (5.6 ± 0.3 accuracy) and Grok (5.5 ± 0.4 accuracy) demonstrated superior results across all measured dimensions—accuracy, completeness, and guideline adherence—markedly surpassing ChatGPT (5.1 ± 0.5 accuracy, p < 0.001) and DeepSeek (4.9 ± 0.6 accuracy; p < 0.001). Notably, Gemini and Grok demonstrated consistently high performance with minimal variability, while ChatGPT, particularly DeepSeek, exhibited considerable inconsistency in complex clinical judgments.

Conclusion

Gemini 2 and Grok 3 showed reliable and clinically relevant performance, positioning them as promising adjunctive tools for decision-making and education in hand surgery. The limitations in ChatGPT-5 and the significant shortcomings of DeepSeek underscore the necessity for cautious deployment and continued refinement.

Keywords

artificial intelligence large language models hand surgery ChatGPT-5 DeepSeek

Introduction

Artificial intelligence (AI) is undergoing a transformative era in healthcare, with advanced technologies such as large language models (LLMs) demonstrating remarkable potential to provide innovative solutions to complex medical challenges.^1–3 These models, trained on large datasets, can generate human-like text, with applications ranging from clinical decision support to medical education.⁴ Recent methodological work has also shown that the structure of the prompt and clinical context strongly modulates LLM behavior, underscoring the need for standardized, transparent prompting protocols in healthcare studies.⁵ Its adoption in healthcare is expected to enhance patient care, expedite diagnostic processes, and provide analytical support to professionals.⁶ However, its performance in highly specialized fields requiring extensive expertise—such as plastic surgery—remains insufficiently explored.^7,8

Plastic surgery, including subspecialties such as hand surgery, constitutes a distinct domain within surgical disciplines, characterized by technical precision, anatomical complexity, and the necessity of rigorous adherence to international standards.⁹ While LLM provides support in this field—offering quick access to critical information and delivering educational materials to trainees—the accuracy, consistency, and guideline compliance of their responses in such specific scenarios have not yet been thoroughly assessed.^10,11 Although existing literature evaluates the performance of these models in general medical contexts, studies explicitly focusing on niche areas such as hand surgery are limited.⁸ These specialties involve distinct challenges that demand technical knowledge and advanced clinical judgment. The critical question is whether LLMs can generate reliable and applicable responses in these scenarios.

Although several studies have assessed LLM performance in general medicine and plastic surgery, no research to date has systematically evaluated next-generation LLMs using standardized, guideline-anchored case scenarios in hand surgery, one of the most technically demanding surgical subspecialties. This study conducts a comprehensive evaluation of the performance of leading and widely recognized LLMs—ChatGPT-5 (OpenAI), Gemini 2 (Google), Grok 3, and DeepSeek R₁—in hand surgery case scenarios.

Materials and methods

Study Design

This prospective cross-sectional analysis was designed to evaluate the performance of four prominent LLMs (ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1) on standardized, validated case scenarios in Hand Surgery. The study’s findings will help us better understand how LLMs can be integrated safely and effectively into clinical practice and medical education. Its protocol adhered to the ethical guidelines outlined in the Declaration of Helsinki and received prior approval from the institutional ethics committee to ensure transparency and rigor. Although the study did not involve human or animal subjects, obtaining ethical committee approval underscored the commitment to standards and scientific integrity. Although traditional power analysis does not apply to model-based repeated-response studies, we ensured statistical adequacy by exceeding the minimum scenario count used in prior LLM surgical studies (n: 20–30). Our dataset (50 cases × 6 questions × 4 LLMs = 1200 responses) is among the most extensive comparative datasets to date.

Evaluated models

All LLM evaluations were conducted between 3 September and 24 October 2025. During this window, we accessed the then-available production endpoints for ChatGPT-5 (OpenAI), Gemini 2 (Google DeepMind), Grok 3 (xAI), and DeepSeek-R1 (DeepSeek). GPT-5 was publicly released in August 2025, whereas DeepSeek-R1 and Grok 3 were introduced earlier in 2025 as reasoning-focused models. Gemini 2 represented the contemporaneous general-purpose Gemini model series at the time of our experiments. As such, our results should be interpreted as reflecting the performance of ChatGPT-5, Gemini 2, Grok 3, and DeepSeek-R1 during the September-October 2025 evaluation window.

Study utilization

The study used 50 developed complex case scenarios in hand surgery, based on established international surgical guidelines. All case scenarios were created following the guideline frameworks of the International Federation of Societies for Surgery of the Hand (IFSSH) and the American Society for Surgery of the Hand (ASSH) (guideline documents now cited in the Reference list) and externally validated by two independent hand surgeons with over 10 years of experience to ensure clinical authenticity.^12,13 Each scenario included six targeted questions: four open-ended and two multiple-choice (50 × 6 = 300 questions per LLM).

Case scenarios were initially drafted based on common and high-stakes presentations encountered in tertiary hand surgery practice, including typical emergency referrals, elective reconstructive cases, and complex revision situations. To ensure coverage of the major domains of adult hand surgery, the expert panel prospectively categorized the 50 vignettes into acute trauma and emergency presentations, post-traumatic reconstruction and sequelae, degenerative and nerve-compression disorders, congenital anomalies, and elective reconstructive/aesthetic conditions. In the final set, acute trauma and emergency cases accounted for 40% (n = 20) of scenarios, post-traumatic reconstruction for 20% (n = 10), degenerative and nerve-compression disorders for 20% (n = 10), congenital anomalies for 10% (n = 5), and elective reconstructive/aesthetic problems for 10% (n = 5). Microsurgical decision-making (e.g., digital replantation, free-flap coverage) was primarily categorized within trauma and reconstruction rather than treated as a separate domain.

The same two fellowship-trained hand surgeons (each with >10 years of independent clinical practice in hand surgery) jointly defined the benchmark answers and the list of critical elements required for completeness and guideline adherence for every question. For each scenario, the experts jointly developed a detailed benchmark answer key specifying the expected diagnosis, classification, investigations, and management strategy. All vignettes, questions, and benchmarks were finalized and locked before any interaction with the LLMs. Five representative anonymized case vignettes, including all six questions and the corresponding expert benchmark answers, are provided in Supplemental 1. The complete set of 50 vignettes and 300 questions is available from the corresponding author upon reasonable request.

Prompting protocol

All models were queried using a standardized prompting protocol. First, a role-defining instruction was given (“You are a board-certified hand surgeon. Answer according to current international guidelines and best practices.”). Second, the complete vignette was presented under the heading ‘Case’. Third, the six questions were listed under the heading ‘Questions’, and the model was instructed to answer each question in order, labelling its responses as Q1–Q6. For each case, the entire text was pasted into a new, clean conversation to avoid cross-case contamination.

Inter-rater reliability

Two independent fellowship-trained hand surgeons served as blinded reviewers and scored all LLM responses according to the predefined rubric. To further enhance methodological rigor, we expanded our reliability assessment beyond Cohen’s kappa. In addition to the primary kappa coefficient (κ = 0.821, p < 0.001), which indicated strong agreement between reviewers, we calculated the Intraclass Correlation Coefficient (ICC) using a two-way random-effects model (ICC^2,1). The ICC value was 0.864 (95% CI: 0.82–0.89), indicating excellent consistency in scoring both open-ended and multiple-choice responses. Moreover, a Bland–Altman analysis was conducted to visually examine systematic bias between evaluators. The limits of agreement showed no proportional bias, and the mean difference between reviewers’ scores was close to zero, further confirming evaluator reliability. The additional metrics reinforce the robustness and reproducibility of the scoring framework used in this study.

Analyzing responses

The LLM-generated responses were assessed against a multidimensional set of criteria, including accuracy, completeness, and guideline adherence. Accuracy was defined as the extent to which the response matched the case-specific expert benchmark answer (e.g. correct diagnosis, appropriate classification, and recommended operation or conservative strategy). Completeness was assessed by whether all critical components of the question were addressed (e.g., naming both the diagnosis and key differential diagnoses or listing all essential steps of management). Guideline adherence referred to the concordance of the response with IFSSH/ASSH-based hand surgery recommendations and other international standards, including mention of required investigations, contraindications, and safety principles, even when more than one clinically acceptable option existed. Consequently, a response could be accurate but only partially guideline-adherent, or vice versa, which justified scoring these domains separately. A six-point Likert scale, ranging from 1 (Incorrect) to 6 (Completely Correct), was used to quantify each response. To maintain objectivity, the evaluation process employed a double-blind design in which independent reviewers, unaware of which LLM model generated the responses, scored the answers against predefined benchmarks. All evaluators underwent standardized training sessions focusing on the scoring methodology and evaluation criteria to eliminate potential bias. The double-anonymized approach ensured that neither the evaluators nor the data analysts knew which AI model produced which responses. This methodological rigor was implemented to enhance the reliability and validity of the study findings.

For multiple-choice questions, the same six-point Likert scale was applied to preserve comparability with open-ended items while distinguishing between correct, partially acceptable, and unsafe responses. Selection of the predefined correct option with entirely appropriate reasoning was scored as 5 or 6. Selection of the proper choice with incomplete or imprecise justification was scored in the mid-range (typically 4). Selection of an incorrect option with clinically neutral consequences was scored as 2 or 3, whereas clearly unsafe or guideline-inconsistent choices (e.g., recommending contraindicated procedures) were scored as 1. This approach retained the underlying correct/incorrect structure of MCQs but allowed us to capture clinically relevant gradations in safety and explanatory quality.

Data analysis

Consistency was primarily assessed to determine whether the models could reliably replicate accurate responses, an essential attribute for clinical application. The inter-rater reliability of these assessments was quantified using Cohen’s Kappa coefficient, yielding 0.821 (p < 0.001), indicating substantial agreement between the evaluators. The collected data were analyzed using SPSS v23. Descriptive statistics were computed to provide a comprehensive overview of the AI models’ performance, including medians, means, and standard deviations. Inferential statistics were applied using the Mann-Whitney U test for two-group comparisons, the Wilcoxon signed-rank test for paired samples, and the Kruskal-Wallis test for comparing multiple groups. The chi-square test assessed the responses’ consistency over time, with a significance threshold set at p < 0.05.

Results

In evaluating LLMs across multiple-choice and open-ended case scenario questions, performance varied significantly between models (Figure 1). In multiple-choice queries of case scenario, as seen in Table 1, mean accuracy scores, measured on a six-point Likert scale, ranged from 5.6 ± 0.4 (DeepSeek) to 5.9 ± 0.2 (Gemini) and 5.9 ± 0.1 (Grok), with ChatGPT scoring 5.7 ± 0.3. Pairwise comparisons using t-tests revealed that Gemini and Grok outperformed ChatGPT (Gemini vs ChatGPT: p = 0.031; Grok vs ChatGPT: p = 0.009). DeepSeek demonstrated lower accuracy than both Gemini (p = 0.004) and Grok (p = 0.001), though its performance did not differ significantly from ChatGPT (p = 0.317).

Figure 1.

Performance comparison of AI models across multiple evaluation categories. Scores were obtained using a six-point Likert scale (mean ± SD). The evaluated models included ChatGPT, Gemini, Grok, and DeepSeek. The categories assessed were Multiple-Choice Accuracy, Open-ended Accuracy, Open-ended Completeness, and Open-ended Guideline Adherence. Error bars indicate standard deviations.

Table 1.

Performance of AI models on multiple-choice questions of case scenarios.

Models (LLMs)	Accuracy	Comparison with ChatGPT	Comparison with Gemini	Comparison with Grok
ChatGPT	5.7 ± 0.3	—	0.031^*	0.009^**
Gemini	5.9 ± 0.2	0.031^*	—	0.157
Grok	5.9 ± 0.1	0.009^**	0.157	—
DeepSeek	5.6 ± 0.4	0.317	0.004^**	0.001^***

Note. Mean and standard deviation (SD) values are calculated based on a 6-point Likert scale (1-6).

*p < 0.05, **p < 0.01, ***p < 0.001 (significant difference). Statistical Analysis: p-values are calculated using t-tests for pairwise comparisons between models.

In open-ended queries of case scenarios, Gemini and Grok achieved the highest scores across all domains (accuracy, completeness, and guideline adherence), with mean values ranging from 5.4 ± 0.5 to 5.6 ± 0.3. ChatGPT (5.1 ± 0.5 accuracy) and DeepSeek (4.9 ± 0.6 accuracy) scored lower (Table 2). ANOVA results indicated overall model differences for all domains (p < 0.001). Post-hoc Tukey tests revealed that Gemini outperformed ChatGPT in accuracy, completeness, and guideline adherence (all p < 0.001), and surpassed DeepSeek by a wide margin (all p < 0.001). Grok also significantly exceeded ChatGPT in all domains (all p < 0.01) and outperformed DeepSeek (all p < 0.001). The performance between Gemini and Grok did not differ (all p > 0.05). These findings highlight distinct performance profiles, with Gemini and Grok excelling in both question formats, while ChatGPT and DeepSeek lagged, particularly in open-ended responses (Figure 2).

Table 2.

Performance evaluation of AI models on open-ended questions of case scenarios.

Models (LLMs)	Accuracy	Completeness	Guideline adherence
ChatGPT	5.1 ± 0.5	5.0 ± 0.6	5.1 ± 0.4
Gemini	5.6 ± 0.3	5.5 ± 0.4	5.6 ± 0.3
Grok	5.5 ± 0.4	5.4 ± 0.5	5.5 ± 0.4
DeepSeek	4.9 ± 0.6	4.8 ± 0.7	4.9 ± 0.5
p-values (ANOVA)	p < 0.001	p < 0.001	p < 0.001
ChatGPT vs. Gemini	***	***	***
ChatGPT vs. Grok	**	**	**
ChatGPT vs. DeepSeek	*	*	*
Gemini vs. Grok	ns	ns	ns
Gemini vs. DeepSeek	***	***	***

Note. ***p < 0.001, **p < 0.01, *p < 0.05, ^ns p > 0.05 (Not Significant) for Post-hoc Tukey HSD test.

Figure 2.

Comparison of AI models based on accuracy and completeness scores. The scatter plot illustrates the performance of four different AI-LLMs models—Gemini 2, Grok 3, ChatGPT 5, and DeepSeek R₁—regarding their accuracy (x-axis) and completeness (y-axis). Higher scores indicate better performance in each respective dimension.

Beyond significance, we calculated effect sizes to quantify the magnitude of performance differences between LLMs. For multiple-choice accuracy comparisons, Cohen’s d values ranged from 0.42 to 0.71, representing moderate to large effect strengths, particularly favoring Gemini and Grok over ChatGPT and DeepSeek. In open-ended domains, eta-squared (η²) values from Kruskal–Wallis analyses ranged from 0.28 to 0.36, indicating large effect sizes and highlighting substantial performance disparities across models. In pairwise nonparametric comparisons, Cliff’s delta (δ) similarly showed significant effects, with Gemini and Grok outperforming DeepSeek (δ = 0.74 and δ = 0.79, respectively) and moderately outperforming ChatGPT (δ = 0.41 and δ = 0.48). Collectively, these effect-size measurements confirm that the observed differences between models are not only significant but also clinically meaningful. Quantitatively, DeepSeek exhibited the highest proportion of critical errors (22.4%), followed by ChatGPT (13.7%), whereas Gemini and Grok showed substantially lower error rates (4.8% and 5.1%, respectively). These patterns highlight clear divergences in behavior: Gemini and Grok exhibit strong clinical consistency, whereas ChatGPT and DeepSeek exhibit structural weaknesses in guideline adherence and contextual reasoning.

Discussion

This study represents the first comprehensive evaluation of four state-of-the-art LLMs—ChatGPT-5, Gemini 2, Grok 3, and DeepSeek R1—in addressing complex case scenarios specific to hand surgery. Our findings demonstrate a clear performance hierarchy, with Gemini and Grok emerging as superior models across both structured multiple-choice and unstructured open-ended tasks. At the same time, ChatGPT exhibited intermediate utility, and DeepSeek consistently underperformed. These results align with—and extend—recent advancements in AI-driven clinical decision support, while also highlighting critical limitations that warrant further scrutiny.

Multiple high-impact studies in 2024–2025 have demonstrated that state-of-the-art LLMs can approach or achieve human-level performance on specialized medical knowledge tasks.^7,14–16 In radiology, for example, Sarangi et al. systematically evaluated multiple LLMs for imaging decision-making in suspected pulmonary embolism.¹⁷ They demonstrated that, when carefully benchmarked, these systems can provide guideline-congruent recommendations for complex diagnostic pathways.¹⁸ Similarly, Mondal et al. showed that LLMs can generate accurate and readable plain-language summaries of scientific articles. Still, performance varied across models, underscoring the importance of structured, domain-specific evaluation frameworks.⁵ A recent national surgery in-service exam evaluation found that GPT-4 answered 74.4% of questions correctly.¹⁹ Similarly, in hand surgery, GPT-4 achieved about 62% accuracy on board-style examinations, markedly better than earlier models and on par with many trainees.²⁰ These findings mirror the user’s results in hand surgery, where GPT-4’s factual accuracy was substantially improved over prior LLM versions and often rivaled that of surgical residents.²¹ The user’s study similarly observed that LLM accuracy in surgery can be highly topic-dependent, with strong performance in common reconstructive scenarios but higher error rates in complex, nuanced cases.²²

A study by Gomez-Cabello CA et al. assessed GPT-4 and Gemini’s ability to address common patient concerns following five types of cosmetic surgeries.²³ Their results showed that while the accuracy of the information provided by all models was comparable, Gemini offered more readable responses. However, a limitation observed across all models was their poor ability to provide actionable advice. Further improvement is needed in translating this knowledge into practical, easy-to-follow instructions for patients to manage their recovery effectively.^24,25 Another study assessed the performance of ChatGPT-4 and Gemini in accurately classifying hand injuries and recommending appropriate management.²⁶ Gemini demonstrated a superior classification ability, correctly classifying a higher percentage of hand injuries than ChatGPT-4. However, ChatGPT-4 exhibited higher sensitivity in recommending surgical intervention, while Gemini showed greater specificity. Despite these differences, the study concluded that neither model is reliable enough for clinical practice in hand surgery without further validation.

In our study, Gemini and Grok demonstrated superior performance across structured and unstructured question formats, with significant advantages in multiple-choice accuracy compared to ChatGPT. While Gemini showed marginally better results in open-ended tasks, specifically in terms of accuracy, completeness, and guideline adherence, compared to Grok, these differences did not reach significance, indicating robust and balanced capabilities in both factual precision and contextual reasoning. ChatGPT occupied an intermediate position, performing moderately in structured tasks but showing significant deficiencies in open-ended queries, notably lower accuracy and guideline adherence scores, reflecting limitations in generating contextually comprehensive responses. Despite these shortcomings, ChatGPT still outperformed DeepSeek in open-ended scenarios. DeepSeek exhibited the weakest overall performance, consistently trailing other models across both formats, particularly with substantial deficits in open-ended tasks, as evidenced by lower accuracy and completeness scores, as well as higher variability in response quality. These findings underscore key trends, highlighting Gemini and Grok’s superior contextual understanding and coherence, ChatGPT’s moderate yet limited proficiency, and DeepSeek’s pronounced challenges in handling complex, guideline-dependent scenario-based queries.

The research has challenges associated with using LLMs in hand surgery and significant limitations. While LLMs exhibit considerable potential in healthcare applications, the corpus of real-world performance data remains insufficient to assess their capabilities thoroughly. The absence of established empirical benchmarks impedes rigorous comparisons of their efficacy, accuracy, and reliability with alternative proprietary models. Furthermore, the adaptability of LLMs to clinical decision-making workflows and their adherence to evolving regulatory landscapes necessitate further scrutiny and validation. Substantial efforts should be directed toward validating their reliability in domain-specific tasks, such as diagnostics, personalized medicine, and medical education, with a focus on mitigating inherent biases. Through concerted research initiatives and real-world clinical trials, similar models could be iteratively refined to better address the dynamic and multifaceted needs of the medical and scientific communities. A gap exists in the comparative analysis and evaluation framework for this.

Conclusion

Our findings provide a foundation for future regulatory validation frameworks and highlight the need for specialty-specific LLM training datasets before clinical deployment. It highlights the significant yet varied potential of state-of-the-art LLMs within the highly specialized domain of hand surgery. Gemini 2 and Grok 3 consistently demonstrated superior performance across structured and open-ended clinical scenarios, showcasing advanced capabilities in accuracy, completeness, and adherence to international guidelines. While ChatGPT-5 exhibited moderate competence, particularly excelling in structured multiple-choice contexts, its limitations in complex, open-ended reasoning underscore critical areas for improvement. DeepSeek demonstrated pronounced deficits, highlighting the need for cautious integration into practice without further refinement. The findings underscore the transformative promise of LLMs in supporting clinical decision-making, education, and patient care within orthopedic surgery.

Supplemental material

Supplemental Material - Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study

Supplemental Material for Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study by Ahmet Savran in Journal of Orthopaedic Surgery

Footnotes

ORCID iD

Ahmet Savran

Author contributions

A.S. was responsible for the conception and design of the study,data acquisition,data analysis and interpretation,manuscript drafting,critical revision,and approval of the final version of the manuscript.

Funding

The authors received no financial support for the research,authorship,and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Data Availability Statement

The dataset supporting the conclusions of this study is available from the corresponding author upon reasonable request. *

Study design and level of evidence

Prospective Observational Study,Level II.

Supplemental material

Supplemental material for this article is available online.

References

Hanna

Wakene

Johnson

, et al. Assessing racial and ethnic bias in text generation by large language models for health care-related tasks: cross-sectional study. J Med Internet Res 2025; 27: e57257.

Teckwani

Wong

Luke

, et al. Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains. Adv Physiol Educ 2024; 48(4): 904–914.

Lim

Seth

Cuomo

, et al. Can AI answer my questions? Utilizing artificial intelligence in the perioperative assessment for abdominoplasty patients. Aesthetic Plast Surg 2024; 48(22): 4712–4724.

Liu

Okuhara

Dai

, et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using Japanese national medical examination. Int J Med Inf 2025; 193: 105673.

Mondal

Gupta

Sarangi

, et al. Assessing the capability of large language model chatbots in generating plain language summaries. Cureus 2025; 17(3): e80976.

Temsah

Alhasan

Altamimi

, et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus 2025; 17(2): e79221.

Behers

Vargas

Behers

, et al. Assessing the readability of patient education materials on cardiac catheterization from artificial intelligence chatbots: an observational cross-sectional study. Cureus 2024; 16(7): e63865.

Aydin

Karabacak

Vlachos

, et al. Large language models in patient education: a scoping review of applications in medicine. Front Med 2024; 11: 1477898.

Gomez-Cabello

Borna

Pressman

, et al. Large language models for intraoperative decision support in plastic surgery: a comparison between ChatGPT-4 and gemini. Medicina 2024; 60(6): 957.

10.

Hirosawa

Harada

Mizuta

, et al. Evaluating ChatGPT-4’s accuracy in identifying final diagnoses within differential diagnoses compared with those of physicians: experimental study for diagnostic cases. JMIR Form Res 2024; 8: e59267.

11.

Frosolini

Catarzi

Benedetti

, et al. The role of large language models (LLMs) in providing triage for maxillofacial trauma cases: a preliminary study. Diagnostics 2024; 14(8): 839.

12.

Kamal

Shapiro

. American academy of orthopaedic surgeons/American society for surgery of the hand clinical practice guideline summary management of distal radius fractures. J Am Acad Orthop Surg 2022; 30(4): e480–e486.

13.

Lam

Goldfarb

Huelsemann

, et al. IFSSH scientific committee on congenital hand conditions. J Hand Surg Eur Vol 2025; 50(3): 308–317.

14.

Maaz

Palaganas

, et al. A guide to prompt design: foundations and applications for healthcare simulationists. Front Med 2024; 11: 1504532.

15.

Gumilar

Indraprasta

Faridzi

, et al. Assessment of large language models (LLMs) in decision-making support for gynecologic oncology. Comput Struct Biotechnol J 2024; 23: 4019–4026.

16.

Akyon

Camyar

, et al. Evaluating the capabilities of generative AI tools in understanding medical papers: qualitative study. JMIR Med Inform 2024; 12: e59258.

17.

Sarangi

Datta

Swarup

, et al. Radiologic decision-making for imaging in pulmonary embolism: accuracy and reliability of large language models-bing, claude, ChatGPT, and perplexity. Indian J Radiol Imag 2024; 34(4): 653–660.

18.

Sarangi

Mondal

. Response generated by large language models depends on the structure of the prompt. Indian J Radiol Imag 2024; 34(3): 574–575.

19.

Hubany

Scala

Hashemi

, et al. ChatGPT-4 surpasses residents: a study of artificial intelligence competency in plastic surgery In-service examinations and its advancements from ChatGPT-3.5. Plast Reconstr Surg Glob Open 2024; 12(9): e6136.

20.

Ghanem

Nassar

El Bachour

, et al. ChatGPT earns American board certification in hand surgery. Hand Surg Rehabil 2024; 43(3): 101688.

21.

Tran

Chang

Sherman

, et al. Performance of ChatGPT on American board of surgery In-Training examination preparation questions. J Surg Res 2024; 299: 329–335.

22.

Mohapatra

Thiruvoth

Tripathy

, et al.

Leveraging large language models (LLM) for the plastic surgery resident training: do they have a role?

Indian J Plast Surg 2023; 56(5): 413–420.

23.

Gomez-Cabello

Borna

Pressman

, et al. Artificial intelligence in postoperative care: assessing large language models for patient recommendations in plastic surgery. Healthcare 2024; 12(11): 1083.

24.

Vrindten

Hsu

Han

, et al. Evaluating the performance of ChatGPT4.0 versus ChatGPT3.5 on the hand surgery self-assessment exam: a comparative analysis of performance on image-based questions. Cureus 2025; 17(1): e77550.

25.

Abdelmalek

Uppal

Garcia

, et al. Leveraging ChatGPT to produce patient education materials for common hand conditions. J Hand Surg Glob Online 2025; 7(1): 37–40.

26.

Pressman

Borna

Gomez-Cabello

, et al. AI in hand surgery: assessing large language models in the classification and management of hand injuries. J Clin Med 2024; 13(10): 2832.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.17 MB

Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional,double-blinded study

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Materials and methods

Study Design

Evaluated models

Study utilization

Prompting protocol

Inter-rater reliability

Analyzing responses

Data analysis

Results

Discussion

Conclusion

Supplemental material

Supplemental Material - Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study

Footnotes

ORCID iD

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

Study design and level of evidence

Supplemental material

References

Supplementary Material