Abstract
Keywords
Introduction
Non-communicable diseases (NCDs) such as arthritis, cancer, cardiovascular disease, diabetes mellitus and osteoporosis represent the most common disorders worldwide. 1 As the proportion of elderly people in the population increases, the number of patients with osteoporosis and other NCDs in Europe is increasing rapidly. In Europe, today an estimated 32 million people have osteoporosis which has an economic burden of 57 billion euros, and substantial morbidity and mortality, 2 and it is predicted that 34 million people will have osteoporosis by 2025. 3 Annual direct medical costs in the USA associated with osteoporotic related fractures are predicted to increase from 48.8 billion dollars (2018) to 81.5 billion dollars (2040), while the inclusion of productivity losses and caregiving expenditure is predicted to increase the 2040 cost by >13 billion dollars. 4 50% of the world’s osteoporotic fractures are predicted to occur in Asia by 2050. 5 As this trend continues, the considerable burden of osteoporotic fractures will continue to add a significant cost burden to society in the future.
Osteoporosis is recognized by physicians and patients alike as a growing public health problem. While effective treatment can reduce fracture risk by between 33% and 50%, 6 a correct diagnosis and treatment is carried out in a small proportion of patients, including those who have already presented with osteoporotic fractures. 7 Indeed, the timely and accurate identification of high risk and/or high cost patients has the potential to expedite effective illness and care management and reduce overall healthcare cost whilst also contributing to healthcare decision-making and improving service planning and provision.2,8
The health sector generates large volumes of patient and practitioner data because of the need for compliance with regulatory requirements. Significant benefits for both the healthcare providers and patients can be realized through the effective use of healthcare big data. These benefits can include improvements in the quality and efficiency of healthcare planning and delivery, early disease detection (for ease of intervention and treatment), customized health management and the efficient detection of fraudulent healthcare behaviour. The McKinsey Global Institute contends that the effective use of healthcare big data generated in the USA has the potential to add more than $300 billion in value annually with 66% of this value being associated with the reduction of healthcare expenditure. 9 For example, managers can review the cost and profit of each medical service through predictive analysis patterns to identify any abnormal claims. 10 Despite the potential for artificial intelligence (AI)-based health care, 11 much of the healthcare data is under-utilized and lies idle in large digital repositories. 12 The healthcare sector has lagged behind other industries in its application of AI mainly due to the absence of an integrated digital infrastructure 13 and in part to the lack of bioinformatics skills amongst professors of life sciences 14 and the lack of adequate training provided to physicians. 15
AI applications have enormous potential to transform health care within the context of disease detection, diagnosis, treatment and personalized medical plans. 16 This is especially significant in an era of reduced funding, increased demands and increasingly complex medical conditions associated with an ageing population. 17 An AI approach has successfully supported doctors in using the most up-to-date information to diagnose breast cancers and treat their patients. 18 Machine learning (ML) is a subset of AI where the algorithms (based on underlying mathematical models) are trained on datasets and ‘automatically learn’ by identifying patterns in these training datasets. This training and learning process assists ML in making predictions without the need for explicit programming.19,20 The predictive models resulting from the training and learning process are then applied to the test datasets. ML algorithms have been applied to predict hip and vertebral fracture,21,22 cardiovascular risk 23 and Alzheimer’s disease. 24
Currently, the primary sources of data collection involved in osteoporosis and bone fracture care are dual X-ray absorptiometry (DXA) machines, electronic medical records/electronic health records (EMRs)/(EHRs) and hospital administrative systems (HAS). DXA machines systematically collect standard data concerning patients including age, gender, ethnicity, smoking status, co-morbidities and treatment, and biometric measurements (including height, weight, bone mineral content, bone size and shape, fat mass, lean mass and bone geometry), as well as performing algorithmic calculations for fracture risk. In supporting the medical team, DXA machines contribute to osteoporosis diagnosis, bone fracture care and the prediction of fracture risk. Studies show how DXA secondary data (although it is primarily collected for osteoporosis risk study) can be used to enhance the prediction of NCDs, which account for the largest proportion of global mortality. 1 As an example, low bone mineral density (BMD), which is measured by DXA machines, is associated with a significant risk of cardiovascular disease, morbidity and mortality. 25 DXA secondary data biometric measurements are also linked to diseases including obesity, diabetes mellitus and cancer. 26
Despite the fact that osteoporosis healthcare costs (including drug intervention) are increasing, the comprehensive utilization of healthcare big data in practice remains limited. 13 Systems integration and utilization of data from multiple information systems such as DXA machines and EHR systems need improvement in understanding and implementation. In this study, we aim to address this limitation by proposing a DXA healthcare informatics prediction (HIP) system.27,28 The DXA HIP system is designed for collecting, integrating and analysing healthcare data from multiple hospital systems and aims to improve fracture risk prediction, osteoporosis care and reduce healthcare cost. The DXA HIP system was applied in three healthcare centres in Ireland to validate and verify its usability and effectiveness. Future challenges are also discussed in detail.
Conceptual design of the DXA HIP system
The DXA HIP system for osteoporosis care is presented in Figure 1. The system comprises five main modules: extraction, transformation, loading, modelling and application and can be applied to the prediction of other NCDs, such as cardiovascular disease and diabetes. The DXA HIP system for healthcare big data. 
Extraction module
In osteoporosis care, various characteristics such as a patient’s physical attributes (e.g. sex, age, weight and race), lifestyle (e.g. smoking and alcohol use) and other co-morbidities (e.g. Type II diabetes, cardiovascular disease, cancer and chronic renal disease) are correlated with osteoporotic fractures.29,30 At present, the major sources of data collection involved in osteoporosis and bone fracture care include: HAS, EMRs/EHRs and DXA
1
machines. As already outlined, the proposed DXA HIP system was applied in three healthcare centres with data sourced from the following: 1. HAS: The HAS contains a database which catalogues claims-based coding and other administrative data. The strengths of the HAS database include its use of standard coding systems such as the International Statistical Classification of Diseases and Related Health Problems (ICD) and its availability of large claims data. ICD-10, which represents the 10th revision of ICD, is beneficial in the prediction and assessment of fracture risk, as demonstrated by Rubin et al.
31
2. EHRs/EMRs: An EHR/EMR is a digital version of patient and population health information
32
which contains patient demographics, diagnoses, medications, lifestyles and biometrics. 3. DXA machines: DXA machines are used in osteoporosis diagnostics
33
and they are an important data source for the detection and assessment of fracture risk. DXA databases systematically collect standard data about patient basic information, bone information and risk factors.
The process of extracting data from the appropriate source systems (e.g. HAS, EHR/EMR and DXA) and then bringing the data into the target database is commonly called ETL (extract, transform and load). Specifically, extraction refers to the process of extracting data from a data source. In the proposed extraction module, related data is extracted from the database of one or more healthcare centres into an interim DXA HIP data warehouse (Figure 2). Extraction module: extracting data from multiple sources into the interim DXA HIP data warehouse. 
Transformation module
Although health related data are typically stored in DXA, EHR/EMR and HAS databases, only a part of this data is useful in osteoporosis and bone fracture risk assessment, diagnosis and care. Therefore, determining and selecting the relevant data are basic tasks of data transformation and integration. As the extracted data from the data sources may not fully meet the requirements of the destination library (e.g. inconsistencies in data format, data entry errors and incomplete data), it is necessary to perform data conversion and processing. Data conversion and processing can be performed in the ETL engine or by using a relational database. In validating the DXA HIP system, data spanning a 20-year period was sourced from three Irish healthcare centres. This data was transformed as follows: erroneous scans were removed following a review of significant outlier scans including those from patients having a birth date earlier than 1 January 1900; phantom scans having a dummy name, no demographics, gender and/or clinical details and duplicate scans where a second scan was carried out within 30 days and was typically performed for Least Significant Change calculations by hospital staff when calibrating equipment. It was also critical to delete and/or correct missing and erroneous data, in order to mitigate against negatively affecting the results before data analysis. Consequentially, scans relating to patients whose body mass index were greater than 60 were deleted after confirmation that height and weight information were incorrectly entered in the system.
Loading module
The final step in the ETL process is the loading of the converted and processed data into the destination DXA HIP data warehouse. Optimal data loading is dependent on the type of operation being performed and the quantity of data to be loaded. Data loading into a relational database can occur through structured query language (SQL) or bulk loading. The SQL approach is used in most cases because transactions are logged and recoverable with SQL statements used for inserting, updating and deleting data. Bulk loading methods such as a bulk copy program, bulk or application programming interfaces can be used to import the data and are easy to use and more efficient when loading large amounts of data. The choice of data loading approach depends on the needs of the system.
Description of DXA HIP warehouse data and the sources of data.
HAS: hospital administrative system; EHR: electronic health record; DXA: dual-energy X-ray absorptiometry; BMD: bone mineral density.
Modelling module
Summary of various predictor tools for medical illnesses, and some of the common shared variables included in them.
aOther: other risk factors, for example, rheumatoid arthritis and falls.
bThese tools can also be used to predict low bone density.
BMI: body mass index; BMD: bone mineral density; ABONE: Age Body Size No Estrogen; BWC: Body Weight Criterion; EPESE: Established Populations for Epidemiologic Studies of the Elderly; SOF: Study of Osteoporotic Fractures; FRAMO: Fracture and Mortality; FRAX: WHO Fracture Risk Assessment Tool; FRISC: Fracture and Immobilization Score; Garvan: Garvan Fracture Risk Calculator; OC: Osteoporosis Canada; ORAI: Osteoporosis Risk Assessment Instrument; OST: Osteoporosis Self-Assessment Tool; SCORE: Simple Calculated Risk Estimation Score; WHI: Women’s Health Initiative.

Modelling module: data analysis and predictive modelling.
AI and ML methods can be applied to risk assessment (e.g. osteoporotic fractures and other NCDs) and disease progression prediction (e.g. changes in BMD). Specifically, we can use feature selection methods (e.g. VSURF, RFE and LASSO) to eliminate irrelevant features and retain important features, and then use machine learning methods (e.g. XGBoost and CatBoost) to build an osteoporosis assessment model to identify high risk patients, and then compare this model with traditional models (e.g. OST 34 and SCORE 35 ). Advanced deep learning methods (e.g. recurrent neural network or long short-term memory) can be applied to construct a prediction model for the change in BMD and compared with basic models (e.g. artificial neural networks and multiple regression analyses 36 ). Furthermore, we can combine deep learning methods and collaborative filtering algorithms to build risk assessment models for NCDs. Since the proportion of healthy people is higher than that of patients with fractures, and the incidence of fractures is very different, we propose a multi-label imbalance classification algorithm to solve the imbalance classification problem. These new prediction models and algorithms will be verified using real world datasets, compared with existing traditional methods, and published in our follow-up articles.
While it is important to create and select predictive models, it is critical to evaluate their performance. K-fold cross-validation and external data validation are effective ways to test the performance and reliability of new models/algorithms. The outperforming model will be applied to population in the West of Ireland and other regions to test its effectiveness. As such, various evaluation metrics are used for different types of models, with the choice of evaluation metric depending entirely on the type and the purpose of the model. The common evaluation metrics in the healthcare industry include confusion matrix, receiver operating characteristic (ROC), area under curve (AUC), F1 score and root mean squared error (RMSE).
Application module
The application module can facilitate more effective; healthcare decision-making, efficient illness management and improvements to service planning and provision. The application module provides doctors with data mining, prediction and decision support for validating and improving prediction models for fragility fracture and other NCDs. Although aware of osteoporosis, most physicians typically neither correctly diagnose nor treat the disease, even for patients who have already had fractures, 7 studies show that effective treatment can reduce fracture risk by 33–50%. 6 Therefore, a more accurate identification of high risk or high cost patients has the potential to facilitate effective care and reduce overall healthcare costs.
Discussion
The application prospects of the DXA HIP system on osteoporosis care
The DXA HIP system has great potential to enhance osteoporosis care, especially in the field of risk assessment and illness management. Furthermore, integrating mobile technologies into DXA HIP system can provide users with personal-level medical services, such as follow-up and alert system, 37 mobile healthcare emergency frameworks 38 and mobile monitoring frameworks. 39
Clinical Detection of Low BMD and Fracture Risk Assessment
Predictive health models which identify high risk and/or high cost patients are the most common big data analytics tools used in healthcare currently. 40 The rate and accuracy in identifying at risk patients have the potential to facilitate more effective and efficient care. 41 In the DXA HIP system, the combination of fracture risk assessment and statistical tools in addition to AI and ML will contribute to improvement in fracture risk assessment from two perspectives. Firstly, both the current diagnostic criteria for osteoporosis and low BMD for specific populations and the currently recommended fracture risk models (e.g. FRAX and QFracture) can be verified and validated. Secondly, both new and improved predictive models for fragility fracture and osteoporosis in specific regions and populations can be developed. It is through these new and improved predictive models that changes in BMD and treatment outcomes in patients can be monitored. Additionally, patients with osteoporosis and low BMD who should undergo DXA testing and/or treatment can be identified and treated. For example, in our previous research results, we proposed seven ML technologies which improve clinical detection of low BMD 27 and used OST model as a screening tool for osteoporosis in Irish men and women. 42 We also proposed and implemented a free online tool to assess the risk of osteoporosis, named DXA-HIP Osteoporosis Assessment Tool (https://dxa-hip.shinyapps.io/dxa-hip/). Users enter their gender, age, weight and height; and the tool calculates the probability, risk level and future trends of osteoporosis.
Illness Management of Osteoporosis and NCDs
The data which are used to predict osteoporosis can also be used to predict other NCDs, such as cardiovascular disease, heart disease, cancer and diabetes. For example, cardiovascular events can be predicted using measurements of regional and whole body fat, lean mass and other biometrics, all of which are provided by DXA machines.43,44 A study showed that the use of DXA biometric data significantly outperformed traditional risk factors and the Framingham Risk Score at discriminating the prevalence of cardiovascular disease for West of Ireland patients diagnosed with rheumatoid arthritis. 45 Although low BMD, linked to frailty,45,46 can be used to predict mortality,47,48 this information is under-utilized in practice. As outlined earlier, the largest current groups of NCDs include osteoporosis, type II diabetes, cardiovascular disease, cancer and others. The authors suggest that the integration of DXA data with hospital system data, underpinned by AI and statistical tools can be used to develop robust predictive health models. Algorithms trained on the combination of common disease predictor variables can be used to predict each disease using a single source. Therefore, it is possible to develop robust predictive tools for many of the largest groups of NCDs and to screen patients who require further diagnosis for other diseases.
Challenges of the DXA HIP project
Although the potential of healthcare big data is promising, delineating some of the main challenges facing the implementation of the DXA HIP system in osteoporotic practice is critical. These challenges focus primarily on data quality and consistency and patient privacy and data security.
Data quality and consistency
Healthcare data are characterized as large, complex, heterogeneous, incongruent and multi-source and are based on incomplete observations. 49 Data quality and consistency are significant criteria for the accuracy and performance of predictive models. However, Viceconti et al 50 contend that due to a culture which does not value data driven diagnosis, medical professionals tend to regard data logging negatively, viewing it as a bureaucratic need and time waster which diverts attention from patient care. Orfanidis et al 51 argue that the lack of standardization and different needs of users (physicians, patients, nurses and administrative staff) raise data quality issues. The divide between clinical research and clinical practice reveals that while the data which is collected as part of clinical studies is generally of good quality, clinical practice tends to generate low quality data. Some of the reasons for this discrepancy relates in part to the extreme pressure under which medical professionals operate and the absence of a ‘data value’ culture. As a result, missing data is a real and universal problem. In the event of missing data, three popular methods can be used as a substitute: a rational approach, listwise deletion and the use of multiple imputations. A comprehensive approach will frequently include the following steps: (1) Identify the missing data; (2) examine the cases of the missing data and (3) remove those cases containing missing data or replacing the missing values with reasonable alternatives.52,53
An important part of the data process is detecting and dealing with data error and outliers. When preparing data for analysis, it is important to avoid the inclusion of error data. For example, patient data may be entered incorrectly. Whilst rules to detect obvious errors can be formulated, some errors including those relating to medical domain knowledge are more difficult to diagnose. Therefore, it is important to provide some initial analysis results as feedback to clinical/medical professionals so that they can determine whether or not errors were recorded in the original data. Utilizing the same data from several sources helps ‘validate’ such data with a checking and rechecking approach, but this requires the use of codes or identifiers to link data from different sources. Outliers are observations that are not predicted well by models; therefore, they also need to be catered for. The methods which can be used to identify anomalous data are physical discriminant and statistical discriminant. Physical discriminant analysis means that we judge outliers based on business experience and common sense. For instance, some patients have abnormal BMD due to the scanning sites being embedded in metal equipment. Commonly used statistical discriminant methods for eliminating outliers include t-test, Pauta criterion, Dixon criterion and Grubbs criterion.
Privacy and security of patient data
With the interest in and potential for big health data analytics, data sources are becoming increasingly available. However, in using such data, it is critical that patient privacy and consent, data security, and legal issues related to electronic health information are considered.54,55 The successful implementation of big data analytics applications in health care can be impeded by legal and regulatory barriers due to concerns which can include, but may not be limited to, inappropriate access to or use of patient data, the unintentional release of private patient healthcare data, and/or the potential use of data to inappropriately ‘profile’ patients resulting in the provision of differential care and/or healthcare resources based on highest cost and/or highest risk patients. 41
The security and privacy of patient data have become a key issue, which make it difficult for many healthcare services to reach their optimal level. As of 25 May 2018, the EU General Data Protection Regulation (GDPR) has general application to the processing of personal data in the EU. In compliance with GDPR, the data sourced from the separate hospital systems (HAS, EHRs/EMRs and DXA machines) will be merged in a manner that ‘personal data’ (i.e. any information relating to an identifiable person who can be directly or indirectly identified) are removed. More and more countries and regions will develop and adopt data protection policies in the future. Therefore, the processing of data must take into account data protection regulations. While we will collect data prospectively following informed consent, if many chose to not allow their anonymized data be included, this will further limit the robustness and generalizability for such findings. Systems and processes to deal with these issues are established in some countries, but GDPR has raised some concern in relevant EU countries.
Conclusions
Osteoporosis globally affects hundreds of millions of men, women and children, with many more at risk. Considering the potential and comprehensive utilization of big data in the healthcare industry, it is very urgent to research how to use healthcare big data to improve osteoporosis and bone fracture care. This study discusses the DXA HIP system as an intelligent system for analysing healthcare big data to improve osteoporosis and bone fractures care. Comprising data extraction, loading, transformation and modelling to application, the DXA HIP system was applied in three Irish hospitals to validate and verify its usability and effectiveness. The application prospects and challenges to a DXA HIP system for osteoporotic fracture risk prediction and care are also investigated, such as fracture risk assessment, illness management of NCDs, data quality and privacy.
In conclusion, NCDs represent an enormous and increasing global health problem, and better methods are needed to assess and manage these illnesses to reduce morbidity, mortality and healthcare costs. It is possible to combine data and develop robust predictive tools for many of the largest groups of NCDs which can be accomplished by bringing together the data and outcomes to develop multiple algorithms to predict each disease using a single source. Larger complex algorithms can be embedded with modern technology to advance and enhance their robustness using modern methods including ML and big data techniques. Big data enables us to use a whole data approach, rather than a ‘one size fits all’ traditional approach to compare ‘like with like’, thereby personalizing individual risk as well as improving our understanding of group risk and population trends and risks. The use of big data analytics needs careful governance and robust analysis in order to realize its potential so that we can move more rapidly to reduce the burden of preventable illness and its associated morbidity, mortality and cost. Furthermore, the application of big data analytics to healthcare data has the potential to improve the quality of patient care and substantially reduce the burden and cost across all NCDs.
