Abstract
Keywords
Introduction
Accurate and contemporaneous patient records are fundamental to health care and essential for documenting diagnosis, treatment, and continuity of care. In the United Kingdom, record keeping in dentistry is guided by established standards from professional bodies and specialist societies (General Dental Council 2013; British Orthodontic Society 2022).
Electronic health records (EHRs) (see Appendix Table 1 for a full list of acronyms) are increasingly used by health care providers to collate and store patient data within and between organizations (Heart et al. 2017). These data include patient demographics, medical history, diagnostics, charting, treatment planning and process, and clinical outcomes, often extending across multiple specialties. EHRs offer significant operational advantages compared with analog systems, including clear, accurate, and accessible patient documentation; seamless connectivity across platforms; template commonality; and coding functionality. These features can help organizations with workflow optimization, regulatory compliance, data accessibility for research and audit, cloud-based storage, and increased accessibility to medical records. Moreover, patient referral, appointment scheduling, and contract management can all be managed electronically (Honavar 2020).
Despite considerable advantages associated with EHRs, there are also challenges, including institutional implementation, security risks, and issues with data-input reliability, often perpetuated through the extensive operator use of “copy and paste” (Ozair et al. 2015; Falcetta et al. 2023). Indeed, a key component of EHR functionality is the reliance on data input directly by clinicians during the patient consultation (Boonstra et al. 2022). This can negatively effect patient–clinician interactions, reducing eye contact, increasing clinician response times while computer-based tasks are completed (introducing pauses and encouraging distracting patient questions), and fostering a consultation environment where technology becomes the dominant presence (Crampton et al. 2016; Marino et al. 2023). To overcome some of these disadvantages, in-office or remote human scribes have been trialed to supplement the EHR, recording patient information and clinical notes during consultations, allowing clinicians to focus directly on the patient. These interventions can reduce documentation burden, improve efficiency, and increase work satisfaction, but they require additional manpower in the form of a scribe (Gidwani et al. 2017; Mishra et al. 2018; Micek et al. 2022).
Automatic speech recognition (ASR) is the process in which a machine recognizes spoken human language and transcribes these data into written text (O’Shaughnessy 2024). A key application for ASR in health care is the generation of clinical documentation, including transcription of clinical notes, letters, and doctor–patient consultations. There is evidence from pilot studies that within these contexts, ASR can be convenient, accurate, and expeditious (Latif et al. 2021). Nonetheless, challenges remain, particularly with clinical and technical terminology, which can be associated with errors necessitating substantial posttranscription editing (Quiroz et al. 2019). In particular, mistranscriptions are recognition errors in output that distort spoken words, whereas hallucinations invent or contradict information not found in the original spoken text (Ji et al. 2023). Indeed, developing robust ASR tailored to specialist subjects such as dentistry is often complicated by the difficulty and expense of acquiring the large volumes of accurately transcribed, domain-specific training data. However, recent developments in deep neural networks and machine learning have brought improvements in ASR capabilities, including tools that can leverage natural language processing and large language models (LLMs) (Xiong et al. 2017; Israni and Verghese 2019). One particularly promising approach is generative error correction (GEC), which uses an LLM to automatically identify and correct ASR errors (Errattahi et al. 2018; Ma et al. 2023).
Many contemporary systems use an ASR speech-to-text application programming interface (API) available from multiple multinational-technology companies. These newer tools should be more adept at understanding context and specialized domain-specific terminology, with recent research demonstrating the efficiency of these ASR systems in medicine (Sezgin et al. 2023; Liu et al. 2024). Given these findings, and the growing availability of this technology in clinical dentistry, we conducted a pilot study investigating the timing of dental clinical summary generation using narration with an ASR-LLM versus manual typing, identifying time reductions of almost 60% with ASR (Appendix Table 2). The aim of the present study was to investigate the transcriptional accuracy of ASR systems in dentistry using narrated orthodontic clinical records, specifically, transcriptional, lexical, and semantic accuracy using validated metrics and qualitative error analysis, including hallucinations and mistranscriptions.
Materials and Methods
Ethical approval was granted by King’s College London as a minimal risk investigation (MRA-24/25-46684). This study adheres to the World Health Organization–International Telecommunication Union checklist for artificial intelligence (AI) research in dentistry (Schwendicke et al. 2021). A detailed description of the materials and methods is available in the Appendix.
We evaluated 10 distinct ASR systems for orthodontic clinical record keeping with selection based on accessibility, functionality, and relevance to the international dental community. The first category represented directly available commercial systems (

Schematic representation of the 2-stage transcription pipeline. Dictated audio is recorded, and the audio file is processed by an automatic speech recognition system (in this example, GPT4oTranscribe) to produce a raw text transcript. The raw text transcript is then passed to a large language model (LLM; GPT4o) together with the prompt shown (temperature 0, top_p=1). The LLM performs generative error correction to provide the final transcript. Misrecognized terms are highlighted in red (uninterrupted, commutative); corrected terms are highlighted in green (unerupted, diminutive). The figure example is the generated text from Transcript 20 (with background noise), GPT4oTranscribe (raw transcript), and GPT4oTranscribeCorrected (processed transcript).
We used orthodontic clinical records incorporating diagnosis and treatment planning as our experimental model because these cover a wide range of technical language relevant to dentistry, including (but not limited to) craniofacial anatomy and cephalometrics, genetics and craniofacial growth, anatomic tooth notation, dental disease, occlusal classification, clinical indices, treatment mechanics, appliance systems, and surgical interventions. Based on the sample size calculation (
Each system was assessed for transcriptional, lexical, and semantic accuracy using validated word and character error metrics. The primary outcome was domain word error rate (DWER), which assesses transcription accuracy involving clinical terminology. Error analysis also involved manual review to identify hallucinations and categorize mistranscriptions, focusing on the clinical significance of class 3 transcription errors, which alter clinical meaning and potentially affect patient care.
All primary data are available at Zenodo: https://doi.org/10.5281/zenodo.15470163.
Results
Table 1 shows the median word and character transcription error metrics for each ASR system. Significant differences were seen for DWER, nondomain word error rate (N-DWER), word error rate (WER), unnormalized word error rate (uWER), and character error rate (CER) across systems (
Comparative Word and Character Error Metrics by ASR System.
API, application programming interface; ASR, automatic speech recognition; CER, character error rate; DWER, domain word error rate; IQR, interquartile range; N-DWER, nondomain word error rate; uWER, unnormalized word error rate; WER, word error rate.
Wald-type test for overall differences from quantile regression
Table 2 shows that with the exception of GPT4oTranscribeCorrected, ASR systems had considerable difficulty in recognizing domain-specific words, as demonstrated by DWER scores significantly higher than N-DWER (
Performance Variability between DWER versus N-DWER (Wilcoxon Signed-Rank Test).
API, application programming interface; DWER, domain word error rate; N-DWER, nondomain word error rate.
Bonferroni adjusted.
Comparative summarization metrics relating to lexical and semantic accuracy are shown in Table 3 (median Recall-Oriented Understudy for Gisting Evaluation [ROUGE], Bidirectional Encoder Representations from Transformers [BERT], and Bidirectional and Auto-Regressive Transformer [BART] scores) and Appendix Table 6 (median ROUGE-1: unigrams; ROUGE-2: bigrams; ROUGE-L: LCS, longest common subsequence). The mean ROUGE scores are shown in Appendix Figure 1. Significant differences were seen across systems for all metrics (
Comparative Summarization Metrics for Lexical (ROUGE Score) and Semantic (BERT, BART Scores) Accuracy and Hallucinations by the ASR System.
ASR, automatic speech recognition; BART, Bidirectional and Auto-Regressive Transformer; BERT, Bidirectional Encoder Representations from Transformers; IQR, interquartile range; ROUGE, Recall-Oriented Understudy for Gisting Evaluation.
Wald-type test for overall differences from quantile regression.
Wald-type test for overall differences from the generalized linear model for the binomial family.
Hallucinations were not seen across 8 of the tools but did occur with Whisper (
Table 4 presents the overall domain term missed ratios (averaged across ASR systems) and illustrative examples of mistranscriptions. This analysis revealed wide-ranging performance on specific domain terms, with some proving challenging. These examples qualitatively underscore the quantitative DWER findings, demonstrating that domain-specific terminology poses a significant challenge for current ASR systems and leads to frequent, varied transcription errors, even for common clinical terms.
Domain Terms, Overall Domain Term Missed Ratio, and Corresponding Mistranscriptions across ASR Systems.
ASR, automatic speech recognition.
The overall domain term missed ratio (%) is the proportion of times a term was mistranscribed across all ASR systems. A 100% overall missed ratio indicates all ASR systems failed to transcribe the term correctly.
A qualitative error analysis using Kanal’s typology showed class 0 formatting and class I minor nonmeaning grammatical change errors common across systems (Appendix Table 9). Class 3 errors altering meaning with potential clinical impact were also seen across systems, ranging from
Applying normalization rules lowered the WER for all systems. The normalized median WER was consistently below uWER, with absolute reductions ranging from 6.17% to 14.11% for GPT 4oTranscribe and Dragon Medical One, respectively.
Across systems, background noise increased WER (+0.01 95% CI 0.01–0.02,
Speaker accent exerted only a minor influence. Using speaker CO as the baseline, neither speaker RO nor RP altered the median WER in clean recordings (both −0.01 pp; overall
Discussion
This study provides a comprehensive evaluation of contemporary ASR systems within the context of dental clinical record keeping using orthodontic diagnosis and treatment planning as the experimental model. Our findings reveal significant variability in transcriptional accuracy across systems and underlying speech-to-text APIs. Although technological advances are evident and will continue, achieving consistent reliable transcription, particularly for domain-specific terminology of clinical relevance, remains challenging.
A primary goal of ASR in health care is to alleviate the burden associated with data input for EHR. This is the first study to investigate ASR systems in dentistry using validated metrics. WER is a common metric used to investigate ASR, with modern systems capable of scores (>2%) exceeding human transcription (around 5%) for general language (Amodei et al. 2016; Zhang et al. 2022). Defining an “acceptable” WER for transcribed clinical documentation is difficult, and we reveal wide-ranging performance across systems. Among commercial systems, Heidi Health was most accurate; however, the integrated architecture in our experimental ASR-LLM pipeline was superior. These findings confirm high-level transcriptional accuracy with ASR; however, with the exception of GPT4oTranscribeCorrected, DWER scores were significantly higher than N-DWER, emphasizing the difficulties these systems have with technical (in this study, clinical orthodontic) vocabulary. Therefore, even ASR systems achieving low WER require careful scrutiny of clinical terminology before their reliability for unreviewed use can be ensured. The domain term missed ratio analysis further illustrated this, revealing even common orthodontic terms to be problematic. Indeed, some clinical terms exhibited high missed ratios across systems, often being substituted with phonetically similar but incorrect words. Developing robust ASR systems requires large amounts of accurately transcribed training data, often difficult and expensive to obtain, especially for specialist areas such as dentistry. To address these limitations, GEC using LLM is a promising method for automatically identifying and correcting ASR errors (Errattahi et al. 2018). LLMs demonstrate high performance across various applications, suggesting suitability for this purpose and likely because they have effectively “read” vast amounts of text (including dental terminology) more extensively than an ASR engine has “heard” it in more-limited training audio. Supporting this, with effective prompting, a pretrained LLM can equal or even surpass domain-specific language models (Yang et al. 2023). In our experimental ASR-LLM pipeline, we observed that incorporating the LLM significantly improved transcription accuracy, reducing both N-DWER and DWER.
Standard metrics quantify transcription errors, but they do not capture quality dimensions for clinical documentation, including preservation of specific phrasing, logical flow, and accurate semantics (Ye-Yi et al. 2003). To gain a more holistic understanding of transcript fidelity, we measured lexical similarity (ROUGE) (Lin 2004), semantic coherence (BERT), and fluency (BART) (Yuan et al. 2021). GPT4oTranscribeCorrected and GPT4oTranscribe generally performed best across these metrics, largely mirroring WER and DWER rankings. This convergence strongly suggests that top-performing systems not only make fewer word-level errors but are also more successful at generating transcripts lexically closer to the original narration, better preserving intended clinical meaning and context. Conversely, lowest-ranked systems had fundamental issues producing accurate outputs in terms of specific wording and semantic integrity.
AI-based ASR is therefore not error proof, and unnoticed residual transcription errors can pose further risks. Qualitative error analysis was crucial for understanding these (Kanal et al. 2001). Class 0 to 1 errors were generally benign, whereas class 2 errors altered the meaning of text in ways that were obvious (“the upper left K9 is measly angulated”). The primary concern stemmed from class 3 errors, which altered meaning in a way that was not obvious (mistranscription of “upper right canine” as “upper left canine”). Assessing the clinical significance of these errors revealed impactful mistakes, including incorrect tooth identification and diagnoses, altered treatment plans, and incorrect patient instructions, which were seen across systems at various levels. The identification of clinically significant errors fuels concern about automation bias, with clinicians overestimating the accuracy of an AI-generated transcript, especially after GEC has produced near-perfect general grammar and formatting (Wang et al. 2023). The challenge of detecting domain-related errors was evident during the manual error analysis; identifying mistakes such as incorrect tooth laterality when discussing extractions required vigilance, highlighting the potential for clinical harm. Beyond transcriptional accuracy, hallucinations presented a further challenge. Although not widespread, these were observed in outputs from DigitalTCO and Whisper, manifesting diversely and including the insertion of completely incoherent text or inappropriate phrases (“Thank you for watching! Subscribe to our channel!”) presumably representing training data artifacts (Metz et al. 2024). More troubling from a clinical perspective were hallucinations generating plausible but factually incorrect information, including invented discussions about tooth restorations, incorrect statements on tooth absence, alternative treatments and tooth impactions, or misinterpretation of patient instructions. These pose significant risk, because they can blend into the clinical narrative and escape detection. To date, there are no data relating to ASR systems and hallucinations in health care, but a recent study found a substantial portion of hallucinations associated with Whisper can be potentially harmful in a nonclinical context (Koenecke et al. 2024). It should also be recognized that although newer ASR and LLM systems had reduced errors and GPT4oTranscribe produced no hallucinations, the inherent stochastic nature of these systems means susceptibility remains.
Practical implementation variables also significantly influenced real-world performance. Introducing ambient clinical background noise led to varying increases in WER and DWER across systems compared with clean audio and increased class 3 clinically significant errors for each system, highlighting the importance of the acoustic environment and the necessity for noise suppression using unidirectional microphones to improve transcription fidelity. Moreover, for equitable deployment of ASR in health care, the potential for performance bias related to speaker characteristics should also be considered (Zolnoori et al. 2024). Here, performance according to accent also varied by system, even among native English speakers. The use of ASR holds much promise for streamlining clinical documentation, with ambient AI also offering the potential to record conversations between patients and clinicians, generating notes and letters (Van Veen et al. 2024). However, reliability depends on accurate transcription for multiple users, including accent type.
Responsibly integrating ASR into clinical practice requires input from clinicians and developers. Fine-tuning models on large curated datasets is key to reducing the gap between clinical terminology and general language, and incorporating comprehensive structured clinical vocabularies such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) database is key to this process (NHS England 2025). Future products should feature uncertainty-aware visualizations to highlight terms with low-calibrated confidence scores for manual verification, particularly those deemed clinically significant (Loftus et al. 2022). In the clinic, the most important safeguard is maintaining a “human-in-the-loop” workflow to verify transcripts, as clinicians move from authors to editors of their notes (Altschuler et al. 2024). It is imperative to guard against automation bias, the tendency to assume accuracy of an AI-generated transcript, because it appears polished after LLM correction. Commercial companies share this responsibility and must be candid regarding limitations of the contemporary technology.
Limitations of the Study
This study was conducted on prepared orthodontic clinical records read verbatim, which does not fully represent fluent-spoken conversational language. Orthodontic records have some terminologies that differ from general dentistry, so the results might not be generalizable to all dentistry. In addition, the systems tested will inevitably be updated, and more will enter the market. We also focused on English-language transcriptions, with the accuracy of ASR known to differ between languages (Benzeghiba et al. 2007), limiting the international generalizability of our findings. Furthermore, although the qualitative error analyses were conducted by consensus, there is inevitably some subjectivity, particularly regarding the interpretation of error impact. Future research should prioritize in vivo studies capturing real-world patient–clinician dialogue. The significant speed advantage observed in the experimental pipeline highlighted the potential efficiency gains; however, the clinical impact of these systems also needs evaluation, quantifying the time clinicians spend verifying and correcting AI-generated transcripts, reductions in documentation workload, and ultimately, whether these systems meaningfully reduce clinician workload and enhance clinician–patient interactions. Contemporary ASR systems can also improve with training and fine-tuning, although how much data is required for robust and generalizable performance is currently unknown (Latif et al. 2021). Fine-tuning Whisper in a medical context can improve an untrained model when evaluating the same dataset; however, the true generalizability of fine-tuning is poorly understood (Roushan et al. 2024). Here, we assessed only “out-of-the-box” performance and did not fine-tune either the ASR or LLM GEC step on our own data. Although this would likely reduce WER, external validity would remain uncertain without an external dataset; consequently, out-of-sample evaluation is required to ensure that improvements from fine-tuning persist beyond the original training data.
Conclusions
This investigation revealed significant performance variability among tested ASR systems, with all capable of introducing clinically significant mistranscriptions. Clinicians using these systems should be cautious about plausible subtle substitutions or omissions of domain-specific terminology. The current status of ASR necessitates vigilance to guard against automation bias in the clinical environment, improvement in domain-specific accuracy, and potential uncertainty-aware features to ensure safe and reliable integration into clinical practice.
Author Contributions
R. O’Kane, contributed to conception and design, data acquisition, analysis, interpretation, drafted and critically revised the manuscript; D. Stonehouse-Smith, contributed to data acquisition, interpretation, critically revised the manuscript; L.C.U. Ota, R. Patel, N. Johnson, C. Slipper, contributed to data acquisition, critically revised the manuscript; J. Seehra, contributed to data analysis, interpretation, drafted and critically revised the manuscript; S.N. Papageorgiou, contributed to data acquisition, interpretation, drafted and critically revised the manuscript; M.T. Cobourne, contributed to conception and design, data analysis and interpretation, drafted and critically revised the manuscript. All authors gave their final approval and agree to be accountable for all aspects of the work.
Supplemental Material
sj-docx-1-jdr-10.1177_00220345251382452 – Supplemental material for Transcription Accuracy of Automatic Speech Recognition for Orthodontic Clinical Records
Supplemental material, sj-docx-1-jdr-10.1177_00220345251382452 for Transcription Accuracy of Automatic Speech Recognition for Orthodontic Clinical Records by R. O’Kane, D. Stonehouse-Smith, L.C.U. Ota, R. Patel, N. Johnson, C. Slipper, J. Seehra, S.N. Papageorgiou and M.T. Cobourne in Journal of Dental Research
Footnotes
Declaration of Conflicting Interests
Funding
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
