Abstract
Keywords
Introduction
Emergency departments (EDs) have seen growing strains over the past few years stemming from an eroding primary care system and human resources crisis emerging from the COVID-19 pandemic. This pressure amplified existing vulnerabilities in the emergency care system, resulting in overcrowding, prolonged patient stays, and extended wait times to see an ED provider.1,2 EDs struggle with record-setting wait times and closures, 3 highlighting the ongoing need for innovative strategies to improve healthcare delivery and patient management in these critical settings. The ED community has been a pioneer in developing and adopting data-driven clinical decision tools such as the Canadian CT Head Rule 4 and Quick Sequential Organ Function Assessment score, 5 to support decision-making in high-pressure environments. However, these static and rudimentary tools are fixed in time and validity and unwieldy to use. 6
Artificial intelligence (AI), specifically machine learning (ML), represents a potential paradigm shift in clinical prediction models. 7 Unlike traditional statistical models that rely on linear and logistic regression algorithms using statistical datasets from the past, ML models leverage flexible, nonparametric algorithms, capable of capturing complex patterns and interactions between variables, potentially enhancing predictive accuracy. 8 In ED settings, ML has shown promise in areas such as diagnostic accuracy, patient triage, and clinical decision-making. 9 However, while previous reviews have highlighted the potential of ML models to improve diagnostic accuracy and patient care in EDs,10,11 many questions remain unanswered. For example, to what extent do these models predict clinical outcomes such as mortality, length of stay, and disposition? Can they reduce wait times, optimize treatment decisions, and lower ED-associated costs?
This systematic review aims to summarize the evidence on ML implementation in EDs, with a particular focus on clinical and operational impacts.
Methods
Search strategy
We registered this systematic review with PROSPERO (registration number: CRD42024515933). An experienced information specialist (BS) developed and tested the search strategies through an iterative process in consultation with the review team. The MEDLINE strategy was peer reviewed by another senior information specialist prior to execution using the PRESS checklist. 12 Using the multi-file and de-duplication tool available on the Ovid platform, we searched Ovid MEDLINE® ALL and Embase Classic + Embase. We also searched Cochrane Central Register of Controlled Trials (CENTRAL) (Wiley), CINAHL (EBSCOhost), and IEEE Xplore. The databases were searched from inception to January 9, 2024.
The strategies utilized a combination of controlled vocabulary (e.g., “Emergency Service, Hospital,” “Artificial Intelligence,” “Data Mining”) and keywords (e.g., “emergency department,” “AI,” “deep learning”). Vocabulary and syntax were adjusted across the databases. There were no language or date restrictions, but where possible, animal-only records, opinion pieces, and other irrelevant publication types (e.g., conference abstracts and preprints) were removed. A summary of the search strategies as run can be found in Supplementary File 1 and Supplementary File 4. Records were downloaded and de-duplicated using EndNote version 9.3.3 (Clarivate Analytics) and uploaded to Covidence (Veritas Health Innovation, Melbourne, Australia. Available at www.covidence.org) for efficient data management, extraction, and synthesis.
Study selection
Eligible studies were those that implemented, or prospectively or retrospectively evaluated, the performance of ML models in emergency department settings to predict clinical outcomes or operational outcomes. Studies limited to model development or focused only on disease-specific prediction tasks without clinical or operational evaluation were excluded. The following designs were also excluded: animal models, in vitro studies, systematic reviews, narrative reviews, opinion papers, case studies, and conference papers. Study participants were humans of all ages, genders, or ethnicities who presented themselves to the emergency department for any reason. The primary outcome of interest was clinical outcomes of ML models (how they can assist in predicting mortality, treatment decisions, and disposition). Secondary outcomes included operational efficiencies of ML models (the ability of ML models to predict patient-wait times, length of stay, and to reduce ED associated cost), and any reported limitations related to ML models’ implementation. The main goal was to examine and represent the structure and outcomes of the reviewed studies rather than to systematize the full scope of each study.
Covidence was used throughout the review to manage citations. We engaged and trained several individuals to assist with screening citations (AP, DP, ESA, KS, NM, NS, ZP). During both abstract and title screening and full-text screening, the reviewers used the eligibility criteria to evaluate and determine the inclusion or exclusion of studies, which were then reported in Covidence. First-level screening consisted of title and abstract screening of all uploaded studies. Each citation was reviewed by two people independently to select studies for full-text review (AP, NM, ZP, KS). If the eligibility criteria were met completely, as assessed by both reviewers, the studies were included. If studies did not meet eligibility criteria, as determined by both reviewers, they were excluded. Any disagreement was resolved by consensus or a third reviewer (AP). Second-level screening involved a thorough assessment of the full text of all the studies that passed the initial screening based on their title and abstracts, performed by two independent reviewers (AP, NM, ZP, KS), who excluded any studies that did not meet the same eligibility criteria in the primary step and were therefore considered ineligible.
Data extraction and assessment
Members of the study team assisted with data extraction (AP, KM, NM). To extract data from the included studies, an extraction form was uploaded onto Covidence, which was developed using the Cochrane guidelines. 13 Pilot testing with the form was completed on five randomly selected studies by two reviewers (AP, KM). The data extraction was checked for consensus by one member of the study team (AP). Data was collected on the meta-data (study title, author name, year of publication) study design, study population (country, age groups, demographics), ML application type, purpose of ML application, research questions, data source, outcomes (training/testing before implementation, training/testing after implementation), sample size per outcome, study limitations, use of clinical applications, and conclusions and future findings. If information was not available from an article, it was noted. Eligible studies were categorized based on similar outcomes and presented in tabular format using data obtained from the extraction form (Supplementary File 2).
Risk of bias and quality assessment
Eligible studies were assessed independently for their risk of bias by two reviewers (KS, ZP) (Supplementary File 3). Methodological quality of the studies was determined, and the risk of bias was evaluated using the Prediction model Risk Of Bias Assessment Tool (PROBAST). 14 This tool comprises questions tailored to identify potential biases in four domains (participant selection, predictors, outcome, and analysis) as well as an overall study risk of bias. When assessing, each question was answered as yes, probably yes, probably no, no, or no information, with yes indicating a low risk of bias and no indicating a high risk of bias. Each study was rated as low risk of bias, unclear risk of bias, or high risk of bias. Two independent reviewers assessed the risk of bias for each domain and overall bias within each included study. Any discordance on methodological quality was resolved by consensus or input of the third reviewer (AP). Authors also used the PRISMA 2020 checklist to evaluate the reporting outcomes of the review (Supplementary File 5).
Results
Study characteristics
Table 1 presents key data extracted from included studies Figure 1. The included studies were published between 2004 and 2024. Specifically, 73 (87%) of the studies were published within the last five years, with only 11 studies (13%) published prior to 2019. Geographical setting varied across the studies, with 30 studies (35.7%) from the USA,15–44 12 (14.3%) from South Korea,45–56 nine (10.7%) from Taiwan,57–65 seven (8.3%) from Hong Kong and China,66–72 six (7.1%) from Italy,73–78 three (3.6%) from Israel,79–81 two (2.4%) from Canada,82,83 two (2.4%) from Singapore,84,85 two (2.4%) from France,86,87 two (2.4%) from Australia,88,89 and one each (1.1%) for a total of nine from Portugal,
90
Netherlands,
91
Switzerland,92,93 Saudi Arabia,
93
Iran,
94
Greece,
95
United Kingdom,
96
Germany,
97
and Turkey.
98
Sample sizes in these studies ranged from 80 to 4,645,483 patients, indicating a wide variation in the population sizes examined. In terms of study design, retrospective cohort studies were predominant (

Selection process of eligible studies from all identified citations (PRISMA flow diagram).
Summary of data extracted from the included studies
The reviewed studies used ML models for seven types of predictions: mortality, admission to hospital, ED or hospital length of stay, treatment decision, costs, and COVID-related outcomes. In the context of COVID-19, ML models were primarily applied to predict mortality, hospital admission, and treatment decisions in SARS-CoV-2 patients rather than detecting SARS-CoV-2 infection itself.
Quality assessments
The risk of bias assessment revealed that most studies exhibited a high risk of bias (

Risk of bias summary plot.
Applications of ML models in emergency departments
To facilitate interpretation, the included studies were grouped according to their primary ML application: (1) mortality prediction, (2) disposition prediction, (3) length of stay estimation, (4) treatment decision-making, (5) wait time prediction, and (6) cost prediction. Table 2 provides a summary of the number of studies in each category by population type (adult/ mixed, pediatric) and predominant ML algorithm (gradient boosting, random forest, neural network, other).
Overview of machine learning applications in emergency departments by outcome category.
Mortality prediction
A total of 50 studies15,16,18,20,22–24,29–33,35–38,41–48,50,52,54–60,62,64,65,67,69,70,72,74–77,79,85,90–92,96 assessed the use of ML to predict mortality rates, including short-term mortality outcomes (i.e., in-hospital or within 6 hours to 7 days of ED admission)15,16,18,20,22,24,30,31,33,44,45,52,56,58,59,62,64,67,70,74,79,92 and long-term mortality outcomes (i.e., 28 days to 1 year). 23,30,35–38,42,45,46,48,50,55,56,62,72,77,79,85,91,92
Most studies focused on adults, with only one study specifically focusing on children aged 18 years or younger 44 and three studies focusing on the elderly (aged 65 and older) population.18,64,74 Across these studies, 43 different ML models were employed, with gradient boosting, random forest, and neural networks being the most common types. AUROC values for these models ranged from 0.618 to 0.978 for gradient boosting, 0.77–0.921 for random forest, and 0.66–0.976 for neural networks.
Key features (i.e., the most significant variables used by ML models to make predictions) often remained consistent between short- and long-term mortality, including demographic variables age, sex, and race, along with vital signs (i.e., heart rate, temperature, and respiratory rate). However, the variables used often varied depending on the condition being assessed. For example, studies focusing on COVID-19-related outcomes frequently included comorbidities, such as hypertension, diabetes, or cancer in their ML models,22,30,41,45,75–77,96 whereas studies on sepsis often included clinical biomarkers, including white blood cell or platelet count.23,36,43,55,56,58,62,72,85
Some studies compared feature importance between different mortality timeframes.45,56,62,79 Two studies explored mortality prediction in septic patients.56,62 Perng et al. assessed 72-hour and 28-day mortality, presenting AUROC values of 0.94 and 0.95 for their combined neural network and SofMax ML model. 62 The study suggests base excess was the most influential feature for both 72-hour and 28-day mortality, with shock episodes (administration of inotropic agents during ED admission), and red cell distribution being crucial factors, in 72-hour and 28-day mortality, respectively. 62 Similarly, Jeon et al. examined 7-day, 14-day, and 30-day mortality. 56 The study showed that septic shock, lactate levels, malignancy, age, and oxygen saturation were the most important features for all three mortality timeframes, yet respiratory infection, which was included in the set of best features for predicting 14-day and 30-day mortality, was not included in the set for 7-day mortality. 56 In comparison, two studies explored varying timeframes of mortality in relation to triage scores.45,79 Klug et al. investigated early mortality, defined as mortality up to 2 days following ED registration, and short-term mortality, defined as mortality 2–30 days post ED registration. 79 The study found that age and structured chief complaint were the strongest predictors of mortality across all timeframes. 79 The gradient boosting model demonstrated high predictive performance, with an AUC of 0.962 for early mortality and 0.923 for short-term mortality. Notably, a simplified model incorporating nine key features (age, arrival mode, chief complaint, five primary vital signs, and emergency severity index) yielded an AUC of 0.962 for early mortality, comparable to the full-feature model, which had an AUC of 0.964. 79
Four studies focused on pediatric 44 or elderly patients only.18,64,74 Goto et al. investigated the use of ML in pediatric emergency department triage, evaluating its ability to predict critical care prediction, including ICU admission and in-hospital mortality, as well as hospitalization. 44 The study found that deep neural networks outperformed traditional triage systems, achieving an AUC of 0.85 for critical care prediction and 0.80 for hospitalization. 44 The three studies focused on elderly patients explored ML applications in different clinical contexts, including cancer, influenza, and emergency surgery.18,64,74 Qiao et al. developed the Cancer Frailty Assessment Tool (cFAST) using an extreme gradient boosting model to predict in-hospital mortality among older patients with cancer. 18 Their model, which incorporated 240 features, achieved an AUC of 0.92, significantly outperforming traditional risk indices such as the Charlson Comorbidity Index (AUC 0.62) and the Hospital Frailty Risk Score (AUC 0.71). 18 Key predictors included comorbidities, frailty markers, and hospital variables. 18 Tan et al. applied ML to predict clinical outcomes in older ED patients diagnosed with influenza, including hospitalization, pneumonia, sepsis, ICU admission, and in-hospital mortality. 64 The XGBoost model achieved the highest AUC (0.902) for ICU admission, while a logistic regression model achieved an AUC of 0.889 for in-hospital mortality, and a random forest model obtained an AUC of 0.840 for hospitalization. 64 Key predictors included oxygen saturation, pulse rate, blood pressure, and comorbidities. 64 Fransvea et al. developed an explainable Multi Layer Perceptron model to predict 30-day postoperative mortality in elderly patients undergoing emergency surgery. 74 Their model achieved an accuracy of 94.9%, with a sensitivity of 92.0% and specificity of 95.2%. Key predictors included non-chronic cardiac-related comorbidities, low oxygen saturation, elevated creatinine levels, and reduced functional capacity. 74
Disposition prediction
Twenty-one studies used ML models to predict various disposition outcomes following an ED visit, specifically hospital admission, ICU admission, or repeat ED visit with admission after discharge.15,16,22,29,31,35,39,42–44,50,53,64,75–78,84,95 The majority of studies explored focused on adults or individuals of all ages, with two studies specifically exploring children39,44 and one focusing on elderly patients. 64 The most commonly used models were gradient boosting, random forest, and neural networks, with AUROC values for admission prediction ranging from 0.675 to 0.96 for gradient boosting, 0.77 to 0.885 for random forest, and 0.66 to 0.94 for neural networks.
Studies that focused on general conditions frequently incorporated demographic data, such as sex and age, and vital sign data, including heart rate and body temperature, into their models. In studies exploring patients with respiratory illnesses, such as acute respiratory infections, or those experiencing asthma or COPD exacerbations, patient comorbidity data (e.g., heart failure, cancer, lung disease) was often included.15,22,39,50,64,75–77 Notably, some models also integrated chief complaint data as a potential predictor.15,31,95 Studies examining re-admission often leveraged features related to co-morbidities, such as the history or presence of conditions such as renal disease.43,78,84 A few studies explored both hospital admission and ICU admission,15,31,44 using predicting variables such as the mode of arrival to the ED.
Analysis of ML models in pediatric ED revealed differences in the key features influencing admission predictions. For instance, Goto et al. identified respiratory rate, ambulance use, oxygen saturation, and pulse rate as key variables for predicting both hospital and ICU admission in children. 44 However, dehydration, increased work of breathing, poor feeding, and maternal smoking were significant predictors for hospitalization in children presenting with bronchiolitis. 39
Length of stay prediction
Nine studies applied ML models to predict LoS within the ED.19,26,40,63,81,93,95,97,98 Age ranges varied across studies, with most studies focusing on adult populations (18 years and older), while some specifically examined pediatric populations (0–18 years) 39 or older adults (≥65 years).93,98
Studies consistently demonstrated that gradient boosting models achieved the highest predictive accuracy, with AUC values ranging between 0.81 and 0.85.19,63,81,93 These models outperformed traditional regression-based approaches and other machine learning models, such as artificial neural network (ANN), 26 support vector machines (SVM), and decision trees.40,98 Random forest models also performed well, particularly in studies analyzing structured triage data.40,81 Deep learning models, while less commonly used, showed potential for applications where large-scale real-time data integration is required, with chief complaint, vital signs, and previous ED visits being the most significant predictors. 95
The most significant predictors of ED LoS varied across studies, but several key factors were consistently identified, including triage level,63,95 vital signs, particularly heart rate, blood pressure, and oxygen saturation,19,40,63 chief complaints, and previous ED visits.81,95 Some models also incorporated comorbidity data, demonstrating that chronic conditions such as diabetes, hypertension, and cardiovascular disease were associated with increased ED LoS.19,93
Eleven studies developed ML models to predict hospital LoS for patients admitted from the ED,17,20,25,39,43,49,54,61,73,86,87 with most studies focusing on adult populations (18 years and older), while some specifically examined elderly patients (≥65 years).20,25,49,73
Among the ML models used, gradient boosting models showed the highest predictive performance, with AUC values ranging from. Random forest and artificial neural networks were also commonly used but showed slightly lower predictive performance.20,39,49 Deep learning models, particularly generative adversarial networks and convolutional neural networks, demonstrated superior accuracy (Sensitivity = 94%, Specificity = 92%) in specialized applications such as image-based prediction models for intracranial hemorrhage and sepsis-related hospital stays.61,87 However, these models required large datasets and external validation to ensure generalizability.
The most significant predictors of hospital LoS included demographic variables and arrival mode,49,73 as well as physiological and clinical markers, such as injury severity score, Glasgow Coma Scale, white blood cell count, lactate levels, and respiratory distress.20,43,61 Comorbidities, including hypertension, diabetes, chronic respiratory diseases, and malignancies, were also associated with increased hospital LoS, particularly in elderly populations.17,20,25,39,43,49,54,61,73,86,87
Treatment decision-making
Eight studies examined the role of ML models in making treatment decisions during an ED visit.21,27,28,34,51,68,78,80 While the majority of studies focused on adult populations, four included patients of all age groups.21,28,34,78
ML models have shown potential in supporting treatment decision-making in ED settings, particularly in sepsis detection, cardiovascular risk stratification, imaging interpretation, and triage-based decision-making. While deep learning models demonstrated high accuracy in image-based applications, gradient boosting and neural network models were more frequently used for risk stratification and decision support.
In sepsis detection, an SVM model incorporating vital signs, free-text triage assessments, and structured patient history achieved an AUC of 0.86, compared to an AUC of 0.67 when only structured data (vital signs and demographics) were used. 21 For chest pain and cardiovascular risk assessment, a neural network model was tested for its ability to guide admission or discharge decisions in ED patients presenting with chest pain. 27 However, despite its diagnostic accuracy, it did not significantly impact admission rates (pre vs post implementation: 63% vs 67%). The lack of impact was attributed to delays in obtaining cardiac marker results, which meant disposition decisions were often made before ML-based recommendations were available. 27 In another study on cardiovascular triage, a gradient boosting model for ED triage in suspected cardiovascular disease demonstrated the highest performance, with an AUC of 0.937, effectively classifying patients into appropriate triage levels. 68 For emergency triage, an ANN model was used to improve risk stratification in syncope patients, focusing on the decision to hospitalize patients to prevent severe short-term outcomes. This model demonstrated a sensitivity of 100% and a specificity of 79%. 78
For heart failure management, an unsupervised ML model was developed to identify symptom patterns predictive of acute decompensation and adverse cardiac events in ED patients with heart failure. The model identified indigestion as a novel predictor of adverse outcomes, a feature not commonly included in traditional heart failure risk scores. 28
In radiographic diagnosis and treatment planning, deep learning models demonstrated strong predictive accuracy for pneumonia detection on chest radiographs. One model achieved an AUC of 0.906 when incorporating body mass index (BMI) and age, compared to an AUC of 0.829 when using airspace opacities alone. 34 Additionally, a deep learning-based assistive system for chest radiograph interpretation significantly improved emergency physician diagnostic performance, with an AUROC of 0.801 and a kappa value of 0.902 for decision-making consistency. 51 The model was trained on ED chest radiographs annotated by radiologists. 51 In another study, a gradient boosting model was designed to optimize head CT utilization in the ED triage process. The model effectively predicted non-contrast head CT usage at triage level, achieving an AUC of 0.9. 80
Studies incorporating free-text clinical notes along with structured clinical data showed higher predictive performance, particularly in sepsis detection. 21
Wait time prediction
Seven studies applied ML models to predict ED wait times, including the time spent waiting to access medical assessment or medical treatment.66,71,83,88,89,94,95 These studies focused on all ages, with two studies specifically examining children under 18 years old,66,83 and one study focusing on individuals over 16 years old. 95
Gradient boosting models outperformed other models in predicting waiting times, with reported reductions in mean squared errors ranging from 15% to 22%,71,89 and reductions in prediction errors by up to 19%. 94 Moreover, studies reported overall decreases in patient wait times ranging from 18% to 26%, particularly in pediatric emergency care, where decision trees and logistic regression reduced median wait times by 26% through automated early diagnostic decision-making. 83 Queueing-based models combined with quantile regression improved prediction reliability, reducing underpredicted wait times by 42%. 88 In workflow optimization, gradient boosting models integrated with discrete event simulation led to a 25% reduction in total ED wait times by optimizing staff allocation and process efficiency. 94 Deep learning models, including convolutional neural networks and long short-term memory (LSTM), improved patient prioritization and reduced wait times by 18%. 95
Key predictors of ED waiting times included triage level, patient volume, time of arrival, and department occupancy. Higher-acuity patients had shorter waits, while lower-acuity cases faced delays.66,89 Congestion and staffing availability affected wait times, with peaks during busy hours and weekends.71,88 Ambulance transport reduced wait times compared to walk-ins.89,95 ML models identified ESI scores, vital signs, and chief complaints as critical for triage-based predictions. 95 Studies integrating historical patient flow data and discrete event simulation highlighted resource availability and procedural delays as key factors. 94
ED cost prediction
Three studies applied ML models to predict ED costs and resource utilization.40,48,82 Two of the studies48,82 did not pose any age restrictions, while one study specifically focused on individuals aged 16 and older. 40
Logistic regression models achieved an AUC of 0.71 for predicting frequent ED visits and 0.76 for identifying patients in the top 5% of ED users. 82 Multilabel machine learning models, particularly multilayer perceptron classifiers, were used to predict ED orders at triage, achieving a median F1 score of 0.56. Simulations integrating these models showed that reducing ED LOS by an average of 7 minutes could lead to increased efficiency but also resulted in a rise in ordering costs from $21 to $45 per visit. 40
Key predictors of ED costs included patient demographics, prior healthcare utilization, and triage decisions. Patients with a history of frequent ED visits had higher predicted costs. 82 Triage-based ML screening models for high-cost conditions, such as ST-elevation myocardial infarction (STEMI), significantly improved early detection and reduced costs associated with delayed treatment. 48
Discussion
This systematic review highlights the various applications of ML models in ED settings. Across the included studies, ML models were most frequently used for mortality prediction, disposition decisions, LOS estimation (both ED and hospital), treatment decision making, wait time forecasting, and cost prediction.
The primary data sources used were electronic patient records, which have been pivotal in enabling the development of ML models in healthcare, particularly in the EDsettings. 99 The digitization of health records over the past decades has provided the depth and accessibility of data required for developing and testing ML models that rely on detailed patient information to predict outcomes and recommend interventions with greater accuracy. 100 In addition to electronic patient records, several studies leveraged administrative databases. These databases provide large-scale, longitudinal data that can be valuable for identifying trends, conducting population-level analyses, and evaluating the long-term efficacy of medical interventions. However, the lack of real-time availability of administrative data limits their utility in clinical decision-making within ED settings.101,102
Commonly applied ML models
Regarding methodologies, neural networks, random forests, and gradient boosting emerged as the most commonly applied ML models in the reviewed studies. These models were likely chosen for their ability to handle large datasets with missing data, 103 and to predict the nonlinear relationship between parameters. 104 Their flexibility and robustness make them particularly suitable for the complex and dynamic nature of emergency care environments.
Neural networks 38 use supervised learning techniques where relationships between inputs and outputs do not follow traditional mathematical models. This allows neural networks to predict the probability of an outcome for an individual rather than for populations and to include cases with missing data. However, neural networks struggle when data is scarce and are more effective with larger datasets. Moreover, neural networks are known to reduce the interpretability of data features, sometimes to the extent that they become meaningless for understanding performance. 105 In contrast, random forests, a model that operates by creating an ensemble of decision trees, are often regarded as one of the most popular techniques for solving classification problems on large datasets. 50 The use of multiple decision trees makes the model resistant to noisy data points, often resulting in lower error rates and more stable predictions. 50 While random forests may require more time and system resources, they can perform well on both large and small datasets. 50 Similarly, gradient boosting also uses decision trees, but unlike random forests, it builds decision trees sequentially rather than independently.73,79 This sequential construction allows gradient boosting to reduce errors made by previous trees, enabling the model to learn complex patterns in the data.73,79 However, this also makes gradient boosting more sensitive to noisy data, which can reduce its performance. 73
Our review further revealed that these ML models are often chosen based on the specific outcomes being predicted in emergency care settings. Gradient boosting demonstrated high accuracy in predicting mortality, ICU admissions, and treatment decisions, with its sequential learning process making it particularly suited to capturing complex patterns in clinical data. Random forests were most effective in noisy datasets and were widely applied to disposition prediction and wait time estimation. Neural networks excelled in predicting length of stay and treatment decisions, though their limited interpretability posed challenges when understanding variable contributions to predictions. Our study highlights the importance of selecting the appropriate machine learning model based on the problem and dataset being addressed. As suggested by Zeleke et al., 73 it may be beneficial to systematically compare the performance of different algorithms and identify the best model for a given dataset to ensure accurate and reliable predictions.
ML applications in EDs
This review highlights the growing role of ML in EDs, with models applied to mortality prediction, patient disposition, LOS estimation, treatment decision-making, wait time forecasting, and cost prediction.
While ML has demonstrated strong predictive performance across these domains, key challenges remain in external validation, workflow integration, and the ability to translate predictions into real-world improvements in patient care and ED efficiency, as noted in previous reviews.106,107
Recent evidence further emphasizes that enhancing triage and forecasting processes through ML models, such as natural language processing (NLP) and feature engineering, can substantially improve operational efficiency and patient flow in EDs. For example, a recent systematic review 108 highlighted that ML and NLP models can enhance triage accuracy by integrating free-text triage notes with structured data, outperforming traditional triage scales. Similarly, a retrospective multicenter study using datasets from 11 EDs across hospitals in Australia, the United States, and the Netherlands 109 showed that feature engineering in ML-based forecasting significantly improved the prediction of patient arrivals, supporting more efficient staffing and resource allocation. These findings underscore the growing implementation of ML not only in clinical prediction but in operational optimization within the ED.
One of the most significant findings of this review is that ML models often identified key predictors of patient outcomes that align with clinical intuition. For example, studies on mortality prediction showed that while early mortality is often driven by acute physiological deterioration (such as cytokine storms in sepsis), long-term mortality is more influenced by immune dysfunction and underlying health conditions. Additionally, some studies demonstrated that simplified models with fewer variables could achieve predictive performance comparable to complex models, suggesting that streamlined, interpretable models may be sufficient for clinical decision support. 79
ML-based disposition models successfully integrated structured patient data, including demographics, comorbidities,15,22,39,50,64,75–77 as well as chief complaints,15,31,95 to refine risk assessments for hospital and ICU admission. However, reliance on structured clinical data may limit model performance, as free-text triage notes and clinician assessments often contain critical information not captured in standardized datasets. 110 Despite promising results, the variability in model performance across studies suggests that external validation and site-specific calibration are required before broad clinical adoption. Models trained on single-center datasets may not generalize well to different patient populations, particularly in settings with varying healthcare resources and admission practices. Additionally, while some studies integrated mode of arrival as a predictor of ICU admission,15,31,44 this variable is highly context-dependent and may not be a reliable feature across different hospitals or healthcare systems. Beyond predictive accuracy, further research should explore the impact of ML-driven disposition prediction on ED efficiency, patient outcomes, and healthcare costs to fully realize its potential in optimizing care.
Similarly, ML-based treatment decision models have shown promise in risk stratification for conditions such as heart failure, cardiovascular disease, sepsis, and pneumonia.21,27,28,34,51,68,78,80 However, their effectiveness in clinical practice depends on workflow integration. 27 Models that provide risk assessments without actionable recommendations may have limited impact on physician decision-making. This emphasizes the need for these models to be integrated into clinical workflows in a way that complements, rather than replaces, clinician judgment; a point also raised previously, 111 which highlights the importance of clinician-ML collaboration in improving patient outcomes. For ML to be a meaningful addition to clinical practice, it must enhance efficiency while preserving the critical role of human expertise in patient care.
ML models predicting ED and hospital LOS demonstrated that incorporating real-time operational variables, such as department occupancy and historical patient flow data, improved predictive accuracy compared to models relying solely on patient characteristics. 95 This suggests that LOS prediction should not be static but instead dynamically adjust based on ED conditions. However, the ability of ML-driven predictions to improve operational efficiency depends on real-time implementation; if hospitals do not adjust staffing and resource allocation based on model outputs, predictive gains may not translate into clinical improvements. 112 ML-based wait time forecasting faces similar challenges, as most studies have focused on retrospective predictions rather than real-time applications.71,89 Future research should evaluate how ML-driven LoS predictions influence clinical workflows and patient outcomes when actively used for decision-making. Moreover, discrepancies in hospital admission policies and discharge protocols limit the applicability of ML models across different healthcare settings. To maximize clinical impact, future research should focus on developing adaptive models that continuously learn from hospital-specific data while ensuring multi-center validation to enhance generalizability.
Cost prediction remains the least explored area, with studies suggesting that reducing ED LOS can improve efficiency but may increase diagnostic ordering costs. 40 Further research should evaluate how ML models can optimize both cost-effectiveness and patient outcomes.
To maximize clinical impact, future ML applications in ED should prioritize multi-center validation, real-time implementation, and integration into existing clinical workflows to ensure that predictive models translate into tangible improvements in patient care and ED operations. 112 Furthermore, data collection efforts should extend across multiple centers with diverse patient demographics and treatment approaches, supported by standardized data frameworks to reduce variability and promote consistency. Beyond data collection, traditional approaches to model development and validation require re-evaluation. As highlighted by Youssef et al., 107 the traditional method of external validation on secondary datasets may not always suffice. Instead, a recurring local validation approach that continuously evaluates the model's performance on the primary dataset over time is recommended. 107 This method ensures that ML models remain accurate, relevant, and responsive to changes in the specific clinical settings where they are deployed. 107 Finally, ethical considerations regarding ML applications in emergency care, including patient privacy, informed consent, and addressing biases in ML decision-making, must be systematically explored. Clear guidelines for the ethical use of ML models in EDs are essential to ensure that these technologies enhance patient care while upholding ethical principles.
Common limitations of ML models
Several recurring limitations were identified across the included studies, primarily high data dimensionality, data imbalances, and selection bias, all of which contribute to reduced generalizability. Additionally, most studies were conducted in high-income countries, particularly the United States and parts of Asia, highlighting a lack of representation from low- and middle-income countries, which may limit the global applicability of findings.
High
Proper data curation was identified as essential for reducing bias in ML models. 95 One study highlighted the necessity of large, well-curated datasets, meaning systematically cleaned, validated, and representative of real-world patient populations, to improve fairness and predictive accuracy. 95 Without proper curation, biases present in raw clinical data, such as differences in how certain conditions are diagnosed or documented across hospitals, can be reinforced by the model, leading to inaccurate or inequitable predictions. 114
Strengths and limitations
Our review has several strengths, including a comprehensive and detailed search strategy that imposed no restrictions on time or language. Furthermore, to minimize bias and enhance the reliability of our findings, we involved two independent reviewers at both the first and second levels of screening, as well as during the data extraction phase. However, our review is not without limitations. The heterogeneity among the studies regarding their methodologies, outcomes, and applications of ML precluded the possibility of performing a meta-analysis, thus limiting our capacity to provide a quantitative synthesis of the data. Moreover, with over 90% of the studies focusing on adult populations, the scope of our review is limited with respect to pediatric emergency settings, identifying a significant gap in the literature and highlighting an urgent need for further research in this area. Finally, while our review included studies implementing ML models in ED workflows, with a particular focus on clinical and operational impacts, we acknowledge that studies limited to model development without clinical or operational evaluation, or those restricted to disease-specific prediction tasks without evaluation in ED settings, were excluded. These represent important and evolving areas of ED machine learning research that warrant dedicated future systematic reviews. In addition, although no eligible studies employing large language models (LLMs) were identified during our search, this likely reflects the early stage of their adoption in clinical practice. As LLM-based applications become more prevalent in emergency care, future reviews should evaluate their implementation, impact on clinical workflows, and integration with existing decision-support systems.
Conclusion
ML models have been applied in EDs for predicting mortality, patient disposition, length of stay, treatment decisions, wait times, and costs, with gradient boosting and neural networks being the most commonly used. While some models demonstrated improvements over traditional methods, challenges in data quality, generalizability, and clinical integration remain key barriers to real-world implementation. Addressing these issues through larger, more diverse datasets, ongoing validation, and ethical oversight is critical to determining ML's clinical utility in emergency settings. Large language models offer new opportunities to enhance ED decision-making as they can process free-text inputs from health records and clinician notes, potentially improving context-aware predictions. This could enhance real-time adaptability in ED workflows, but their accuracy, interpretability, and impact on patient outcomes require further study. Future research should focus on evaluating their integration into clinical practice.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076251411209 - Supplemental material for Implementation of machine learning in emergency departments: A systematic review
Supplemental material, sj-docx-1-dhj-10.1177_20552076251411209 for Implementation of machine learning in emergency departments: A systematic review by Banafshe Hosseini, Atushi Patel, Megan Landes, Samuel Vaillancourt, Muhammad Mamdani, Kevin Maruthananth, Neha Matharu, Zuha Pathan, Krishihan Sivapragasam, Onlak Ruangsomboon, Becky Skidmore and Andrew D Pinto in DIGITAL HEALTH
Supplemental Material
sj-docx-2-dhj-10.1177_20552076251411209 - Supplemental material for Implementation of machine learning in emergency departments: A systematic review
Supplemental material, sj-docx-2-dhj-10.1177_20552076251411209 for Implementation of machine learning in emergency departments: A systematic review by Banafshe Hosseini, Atushi Patel, Megan Landes, Samuel Vaillancourt, Muhammad Mamdani, Kevin Maruthananth, Neha Matharu, Zuha Pathan, Krishihan Sivapragasam, Onlak Ruangsomboon, Becky Skidmore and Andrew D Pinto in DIGITAL HEALTH
Supplemental Material
sj-docx-3-dhj-10.1177_20552076251411209 - Supplemental material for Implementation of machine learning in emergency departments: A systematic review
Supplemental material, sj-docx-3-dhj-10.1177_20552076251411209 for Implementation of machine learning in emergency departments: A systematic review by Banafshe Hosseini, Atushi Patel, Megan Landes, Samuel Vaillancourt, Muhammad Mamdani, Kevin Maruthananth, Neha Matharu, Zuha Pathan, Krishihan Sivapragasam, Onlak Ruangsomboon, Becky Skidmore and Andrew D Pinto in DIGITAL HEALTH
Supplemental Material
sj-docx-4-dhj-10.1177_20552076251411209 - Supplemental material for Implementation of machine learning in emergency departments: A systematic review
Supplemental material, sj-docx-4-dhj-10.1177_20552076251411209 for Implementation of machine learning in emergency departments: A systematic review by Banafshe Hosseini, Atushi Patel, Megan Landes, Samuel Vaillancourt, Muhammad Mamdani, Kevin Maruthananth, Neha Matharu, Zuha Pathan, Krishihan Sivapragasam, Onlak Ruangsomboon, Becky Skidmore and Andrew D Pinto in DIGITAL HEALTH
Supplemental Material
sj-docx-5-dhj-10.1177_20552076251411209 - Supplemental material for Implementation of machine learning in emergency departments: A systematic review
Supplemental material, sj-docx-5-dhj-10.1177_20552076251411209 for Implementation of machine learning in emergency departments: A systematic review by Banafshe Hosseini, Atushi Patel, Megan Landes, Samuel Vaillancourt, Muhammad Mamdani, Kevin Maruthananth, Neha Matharu, Zuha Pathan, Krishihan Sivapragasam, Onlak Ruangsomboon, Becky Skidmore and Andrew D Pinto in DIGITAL HEALTH
Footnotes
Acknowledgments
The authors thank Lesley Anne Pablo, Disha Patel (DP), Navreet Singh (NS), and Ellah San Antonio (ESA) for assisting in conducting the systematic review. The authors also thank Kaitryn Campbell, MLIS, MSc, for the peer review of the MEDLINE search strategy. Moreover, this work was supported by the Ontario Ministry of Health and Ministry of Long-Term Care—Research Planning and Management Unit; Strategic Policy, Planning and French Language Services Division (Grant ID#: 693A). We were unable to update the corresponding PROSPERO registration because the record was created by a former staff member, and the associated login credentials are no longer available to our team. Consequently, the author list and title in the PROSPERO entry do not reflect the final version presented in this manuscript.
Ethical considerations
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Author contributions
Banafshe Hosseini (BH) and Andrew D. Pinto (ADP) conceived the study and secured funding. Atushi Patel (AP), Kevin Maruthananth (KM), Neha Matharu (NM), Zuha Pathan (ZP), and Krishihan Sivapragasam (KS) screened the studies and performed data extraction. AP drafted the initial manuscript. Becky Skidmore (BS) designed and executed the search strategy. BH and ADP supervised all stages of the review, from inception to data extraction and manuscript preparation. BH, Megan Landes (ML), Samuel Vaillancourt (SV), and Muhammad Mamdani (MM) revised the manuscript and prepared the final version. All authors contributed to critical revisions, approved the final manuscript for publication, and agreed to be accountable for all aspects of the work. BH is the guarantor.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Ontario Ministry of Health and Ministry of Long-Term Care—Research Planning and Management Unit; Strategic Policy, Planning and French Language Services Division (Grant ID#: 693A).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The datasets used and/or analysed during the current study are available from the corresponding author upon reasonable request.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
