Abstract
Keywords
Introduction
The expansion in use of electronic health records (EHRs) provides an unparalleled opportunity for the use of routinely collected patient data to drive research and deliver an epidemiological understanding of the basis of disease. Whilst some challenges exist around legal, ethical and technological issues, 1 there is an increasing number of studies using EHRs to deliver meaningful research outputs. 2 These studies are somewhat sporadic, in diverse specialist areas including chronic obstructive pulmonary disease 3 and heart failure research.4,5 Additionally, there are published opinion pieces highlighting the potential importance the use of EHRs in research. 6
An EHR can be defined as ‘a repository of patient data in digital form, stored and exchanged securely, and accessible by multiple authorised users. It contains retrospective, concurrent, and prospective information and its primary purpose is to support continuing, efficient and quality integrated health care’.
7
Whilst there is a myriad of different definitions applied to EHRs,
8
this represents a widely accepted and comprehensive definition.
9
Critical to this paper is that an EHR’s
The first EHR was created by Loughead Aircraft Manufacturing Company (better known as Lockheed Martin Aerospace) in the 1960s, alongside a small number of other pioneer academic groups. Wider adoption of EHRs only began with commercialisation of systems in the 1990s. 10 Use, in the UK, was somewhat delayed; potentially due to the absence of commercial market forces in the National Health Service (NHS), yet the UK now represents the largest EHR market in Europe. 11
The wider uptake of high quality EHRs has been fuelled by government and trans-national initiatives, including the $19 billion HiTECH Act in the USA 12 and €2 billion Innovative Medicines Initiative in the European Union. 13 There is, however, significant geographic variation in the adoption and approach to EHRs, with some countries (notably Denmark and Sweden) utilising national EHRs and other countries (the USA and UK) adopting organisation specific EHRs. 14
Diabetes represents a truly data rich pathology with a wealth of routinely collected data, including but not limited to; average blood glucose, foot health, eye health, cardiovascular health and renal health. Diabetes therefore represents a pathology particularly able to benefit from the use of EHRs in research. Diabetes further represents a critical international challenge with a global prevalence that has doubled since 1980 from 4.5 to 7.8% of the adult population. 15 Developing and exploiting new data and information-driven research methods is therefore essential to tackling this emerging global burden. At the time of writing this paper, no review has been performed to identify what has been achieved through EHR-based diabetes research, how that progress has been achieved or an appraisal of such studies.
This work represents first a review of the direct use of EHRs as a research tool in diabetes. It considers current applications, barriers and future strategies. It also provides a meaningful benefit by developing an understanding of where gaps currently lie in the research literature, and how we can approach and overcome challenges to EHR research.
Methods
Prospective registration
This review was prospectively registered with the PROSPERO database (registration number: CRD42016038550). PROSPERO is an ‘international database of prospectively registered systematic reviews in health and social care, welfare, public health, education, crime, justice, and international development, where there is a health-related outcome’. 16 The system is supported by the University of York and aims to both reduce duplication of research and avoid bias. All registered trials undergo a review process prior to acceptance.
Search strategy
This review is underpinned by a comprehensive search strategy including MEDLINE, Embase and Engineering Village databases. Whilst Medline represents an important source of medicine, nursing and pharmacy literature, it is also important to include Embase in such a study, based on its coverage of pharmacological articles. The OVID search platform allows both Embase and MEDLINE to be searched simultaneously. In addition, informatics and computer systems engineering approaches have been particularly relevant to EHR research, and for a truly comprehensive search strategy, it is therefore necessary to consider a search of the relevant engineering literature in this review. The Engineering Village search platform comprises 12 separate databases including Ei Compendex, Inspec, GEPBASE, GeoFef, US Patents, NTIS, EnCompassLIT, EnCompassPAT, PaperChem, CBNB and Chimica. Combining both the OVID and Engineering Village search platforms adequately and comprehensively covers the relevant medical and engineering literature.
The search included only English language papers, published between 2006 and June 2018. This is appropriate to the time period, for which there has been the existence of meaningful high-quality EHRs within clinical systems. Whilst it could be argued to include a five year cut off, there is a risk this would unintentionally exclude the earliest secondary research uses of EHRs, which could be of significant value and interest. This study focused solely on original research articles, matching our aim to identify more specifically what original research has been performed using EHRs as a direct research data source.
There is an important need to consider the extent to which the grey literature should be incorporated within this study methodology. A natural concern with the application of the grey literature is the variability of study quality and absence of peer review. Given the risks of bias associated with inappropriately dealing with large data-sets contained within EHRs, 17 a consideration of the grey literature is not included here. There remains contention as to a formal definition of ‘grey literature’, with some authorities including and others excluding published conference abstracts. 18 Conference abstracts can be important to demonstrate early and developing research, as well as research that has not progressed to publication. Such information is therefore valuable to this review, and published conference abstracts indexed in the databases searched are included within the review.
Search terms
‘Electronic health record(s)’ or ‘Electronic Patient Record(s)’ OR ‘Electronic Medical Records’ and ‘diabetes’ were the key search terms employed. ‘Electronic Health Records’ is the relevant NIH MeSH term, however the other search terms are included to ensure a comprehensive search. There was no attempt to distinguish Type 1 from Type 2 diabetes; not only can these be poorly recorded in EHRs 19 but also the data variables available in EHRs are usually applicable to both.
Research studies that do not explicitly use EHR data, as a direct research data source, are excluded. In pilot work for this study, the following types of study were identified that would need excluding: studies where a separate research data registry is created, usually through manual inputting of data; 20 research where the EHR is used solely for patient recruitment/identification; 21 and research where a health record database is created solely for research purposes. 22 Whilst excluded from this study, these approaches are themselves interesting and could potentially form the future bases of additional research reviews.
Papers were selected by initially screening article title and article abstract. Those papers identified as relevant were subject to a second stage of screening through review of the whole article. Inclusion criteria (Table 1) represent English language articles, published in the last 10 years, applying an interrogation of EHR data to answer a specific medical research question. Papers specifically considering the design/formatting of EHRs and the use of EHRs for operational management, rather than clinical research purposes, will also be excluded.
Inclusion and exclusion criteria.
Data, extracted from included papers, were structured according to a pre-defined and piloted proforma that incorporated: year of publication, number of unique patient records extracted, country of publication, type of research question, primary/secondary care, single centre/regional/national data source, whether barriers are discussed and whether opportunities for further research are discussed. Under each of these headings further information has been extracted for the narrative of the clinical review. Definitions covering the type of research question are included in Table 2; studies can belong to multiple categories. Meta-analysis was not performed.
Definitions relating to study research questions.
Results
Initial search
The search strategy identified 703 research papers meeting the inclusion criteria form the Medline and Embase searches, with 268 papers from Engineering Village. This resulted therefore in a total initial search of 971 papers.
Paper selection
Individual review of articles by title, abstract and full paper resulted in exclusion of 589 articles (84%) from the OVID/Embase search; 114 articles were taken forward for further study and data extraction; 8 articles were unobtainable from the research literature and excluded from the study. The reasoning for papers being excluded is included in the flowchart in Figure 1.

Flow chart demonstrating assessment of articles for inclusion from OVID/Embase search.
Review of the articles extracted from the Engineering Village search resulted in exclusion of 230 papers (86%), for the reasoning demonstrated in Figure 2. There were 38 papers identified that met the aims/objectives of this study,23,24 however 2 of these had been identified in the OVID/Embase search and were already included within the study. This resulted in a total collection of 150 papers selected for further analysis. It should be noted that, whilst the exclusion rate of papers from the Engineering Village search was high, there was substantial content of interest and relevance to clinical research despite this being a rarely used resource. The articles, whilst not relevant for this study, focused on the design, operation and implementation of EHR systems, or of clinical systems in general and their overall impact on clinical care. We would strongly argue that greater exposure and awareness of this valuable resource is important for future clinical researchers in healthcare research.

Flow chart demonstrating assessment of articles for inclusion from Engineering Village search.
Data analysis
Publication trends over time
Over the 12-year period of study, 150 original research articles were identified. The distribution of publications over time is demonstrated in Figure 3. It is clear that there is a notable step-change in publication numbers occurring around 2012.

Article publication numbers by year.
Sample size
The largest sample size identified was 4.1 million patients, whilst the smallest was 30 patients. The mean average number of patients per study was 99,757, whilst the median average was 3352. It is important to note, therefore, that there is the potential for these averages to be distorted by outlying values, in particular, large value outliers. There is no clear temporal trend to median sample sizes over time as demonstrated in Figure 4 (all sample sizes included).

Median sample size by year.
Location of research
English language, original research articles, which met the inclusion criteria, were identified as originating from 17 different countries. One study represented an international study, utilising electronic health data from both the UK and Canada. 25 The largest number of studies originated from the USA (74 studies) with 39 studies from the UK. A full breakdown considering the number of studies per country is demonstrated in Figure 5.

Country of origin of research articles.
Type of publication
Seventy-four articles (49%) extracted from the bibliographic databases were conference proceedings or conference abstracts. For studies originating from the UK 77% were published conference abstracts or proceedings. This is in comparison to only 40% of US studies being published as conference abstracts. This is a finding of some significance and is discussed below.
Nature of articles
Articles were identified for each of the pre-specified study categories: epidemiology, prevention, susceptibility, diagnosis, prognosis, complications, medication treatment, medication side effect, non-pharmacy intervention, service delivery, insurance based. Many articles covered multiple categories. The most common study purposes were to investigate complications (50 articles), epidemiology (34 articles) and diabetes complications (30 articles).
There was considerable variation in the sample sizes used for each of the study types. The median average number of patients, in studies considering medication treatment was 7454, compared to 1861 patients for diabetes complication studies and 12,673 for epidemiology focused studies.
Discussion
Current extent of secondary use of EHRs in diabetes research
This study identifies a number of publications and research outputs that describe the secondary use of EHRs in diabetes research. The number of publications has increased over time, with a step-change in 2012, which we would argue coincides with the increased commercialisation and wider adoption of EHR systems following the US HiTech Act and EU Innovative Medicines Initiative. Since 2012, however, the number of publications has plateaued, perhaps in contrast with medical publication numbers in general, which continue to increase at a near exponential rate. It is clear, therefore, that there are barriers restricting the wider adoption and exploitation of EHR research methodologies, which must be addressed.
The UK’s adoption of EHRs as a research tool in diabetes in particular is embryonic, with the vast majority of publications being conference abstracts. The failure to convert these conference abstracts to full publications could suggest barriers exist to full publication, limitations to existing EHR datasets, or non-specialist researchers experimenting with EHR research.
Internationally, we would argue the potential of these research approaches is evident, with large sample sizes, across multiple centres, tackling a diverse range of research questions. There is the clear ability to adapt sample sizes to the research question under study with epidemiological studies frequently utilising the largest cohort numbers.
A particular challenge to the US studies that currently dominate the published literature are the insurance-based models and data restrictions that exist within such insurance-based healthcare systems and datasets. We might argue that some commercial US healthcare EHRs are designed to have insurance and billing structures, 28 with patient care a subsequent (or secondary) addition, and therefore, in effect, making the extraction of data for research purposes a tertiary use.
Barriers to diabetes research
Approximately half of studies reported barriers or limitations, as a result of using EHR data. Many studies reported multiple limitations; many conference abstracts however were brief and did not outline limitations. The most commonly reported limitation was that of missing data values, 29 examples include failures to record whether glucose values were fasting or random 30 and limited information on diabetes-specific outcome measures such as foot amputation 31 or cause of death in the community following hypoglycaemia or diabetic ketoacidosis. 32
Limited information on medication compliance was frequently described as a barrier,33–35 this is particularly significant given the high proportion of studies focused specifically on medication treatment in diabetes. Problems with misclassification of diabetes, and difficulty distinguishing between type 1 and type 2 diabetes were also described.36,37 Only two studies reported problems with data extraction, namely the extraction of unstructured data 26 and procedural variations in the documentation of information. 27 There were, however, concerns regarding a lack of longitudinal data in certain EHRs and fragmentation of patient data across diverse EHRs.30,31 These data fragmentation and longitudinal concerns were more prominent in US studies, rather than UK studies, which would be expected from the nature of NHS records; however, without a single national EHR there will remain problems, despite all patients having a single national identifier number (NHS number).
It is important to note the high proportion of extracted articles that were conference proceedings, rather than journal articles, and to consider this as a barrier in itself. This is despite a wide range of important topics and meaningful findings discussed within these articles. This could represent barriers such as a lack of funding available to develop these research projects into substantial pieces of work sufficient for peer-reviewed journal publication, or a lack of suitable journals accepting such articles for publication. Whatever the reason for a failure to translate such research into full papers represents a barrier to the wider adoption of usefulness of EHR research, it is interesting that this was a particular barrier in the UK, and suggests that we continue not to utilise fully at a system level, the important information held within our EHRs.
Future opportunities for EHRs in diabetes research
There are clear opportunities that could overcome some of the challenges described above. Excitingly, the UK has now moved on from the failed NHS National Programme for IT, becoming the largest EHR market in Europe. 11 This offers the exciting potential to overcome key barriers, most particularly that of generalisability. Many of the current EER-based studies are limited by being only single-centre studies or based on small regions. The UK healthcare system has the significant advantage of every member of the population having a unique patient identifier, which enables larger-sized studies and helps avoid some of the barriers generated by missing patient values or misclassifications as patients move between providers or re-present.
Limitations of this study and further work
This study has a number of limitations. First, it represents a narrative review without formal independent two-author article identification, extraction and analysis. Whilst a meta-analysis would be inappropriate given the diversity of the study designs and methodologies, a more systematic two-author approach to article selection and data extraction could be argued to improve the quality of the study. Importantly though, this study did undergo prospective PROSPERO registration, with a pre-defined and piloted data collection tool. In the context of the first review of its kind, the study still has significant potential to add learning to the research literature and should be considered as an exploratory review in an underexplored area on which future research can build.
Restrictions also limited this paper to considering only English language journals, it is certainly likely that EHR datasets internationally have been adopted for research purposes published in other languages – in particular from Asia, South America and Northern Europe. This review excluded the grey literature. It is evident from the articles extracted and references provided that a number of consultancy firms and charities have utilised EHR data and may not have published their work in academic journals. Finally, restrictions were placed on the definition of secondary use of EHRs, excluding registries and the use of EHRs for recruitment to clinical studies. Both these areas represent important research areas, and further study to understand the contribution of EHRs to these fields would be beneficial.
Further research is needed to look beyond simply diabetes and compare the approaches taken in other clinical specialities. The understandings developed for diabetes here might not be generalisable across disease processes. Indeed, in the increasing trend for medical research to occur within speciality ‘siloes’ there is the exciting potential for EHR-based research to cross and unite research teams.
Conclusions
There is clearly an established body of research that utilises EHRs as a data source for diabetes research. This research covers a broad range of research questions. The published studies often include large data sets but are limited by missing values (many specifically required for diabetes related research) and challenges of generalisability. The small number of journal articles published using UK data suggests research of this nature is only in its infancy in the UK. The UK however represents an exciting and almost unique environment for such research, with national unique patient identifiers allowing for large multi-centre sample sizes overcoming challenges of generalisability and maximising the clinical usefulness of results.
