Abstract
Keywords
Introduction
Monitoring, a major component of assuring the quality of clinical trials, has traditionally relied on frequent on-site monitoring visits, 1 particularly to facilitate sometimes extensive source data verification (SDV). 2 However, it is increasingly recognised that this model may be inefficient and unnecessary in many cases,3,4 with trialists questioning the value of 100% SDV.5–7 In recent years, regulators8–10 and trialists1,11 have proposed a risk-based approach to monitoring, whereby monitoring methods, including the frequency and nature of on-site visits, vary across trials depending on the risks specific to each one. The support of regulators is encouraging, indicating that risk-based methods might be implemented even in clinical trials of investigational medicinal products, that is, those historically subject to particular regulatory control and claimed to suffer under a ‘regulatory burden’.12,13
Risk-based monitoring methods can be applied at different stages of a trial. Pre-trial risk assessments can help define the overarching strategies appropriate to the trial’s risks. In some models,14,15 this is predominantly a one-off assessment during trial set-up. However, it is also possible to modify the monitoring strategy, or incorporate flexibility, based on emerging risks during the course of the trial. 16
Risk-based monitoring is often associated with fewer on-site visits than ‘traditional’ monitoring.
17
Although effective central monitoring methods alone could, in some respects, provide adequate trial monitoring in place of visits, on-site visits offer particular benefits over central monitoring. These include, for example, the ability to access site-held source data (such as patients’ medical notes, although some have suggested these might be accessed remotely instead),18–20 conduct in-person facility review
21
or assess processes such as informed consent.
22
On-site visits may also be necessary to investigate potential fraudulent activity. In a risk-based monitoring framework, visits to sites may not be routine, but can be based on assessed risk; we therefore need methods to assess site-level risk on an ongoing basis. We can interpret these methods as assessing the risk of
A recent systematic review has established the breadth of tools available to assess overall trial risk (and to use this assessment to define the monitoring strategy) in the set-up stage, 27 but so far there has been no such exercise for methods to assess ongoing site-level risk once a trial has started. We conducted a scoping review 28 to identify and characterise available methods.
Our aims were (a) for trialists, to establish if any published methods were supported by adequate evidence to support implementation in routine practice and (b) for researchers in this area, to consolidate the existing evidence and point towards future developments in this growing field.
Methods
We conducted a scoping review to identify methods for using centrally held clinical trial data to assess site-level risk of deviations from Good Clinical Practice or the trial protocol, or research misconduct, and thereby to target sites for further monitoring activity. We chose scoping review methodology as we anticipated finding a variety of results, and we wanted to characterise the extent, range and nature of research activity. 29 There is no published protocol for this scoping review.
Eligibility criteria
We defined our eligibility criteria before beginning any searches, with minor refinements (mainly to the exclusion criteria) after search strategy piloting.
We included original reports (a) describing methods for using centrally held data (i.e. at the trial coordinating centre) to assess, in ongoing trials, site-level risk of protocol or Good Clinical Practice deviation, risk of data fabrication or research misconduct, or to target sites in some other way for corrective action based on assessed risk (regardless of whether the corrective action involved an on-site monitoring visit or not); (b) with methods described in enough detail that we considered them – subjectively – reproducible; (c) either published in peer-reviewed journals or available as grey literature; (d) about clinical trials, not limited to trials of Investigational Medicinal Products; and (e) in English.
We excluded reports (a) published before 1996 (the year of the first version of the International Conference on Harmonization Good Clinical Practice Guidance, E6[R1]) 30 ; (b) about quality assurance only in the context of intervention fidelity 31 or ‘rater differences’ 32 for subjective trial outcome measures; (c) about ‘data monitoring’ in general, for example, data monitoring committees, or ‘monitoring’ in any sense other than the Good Clinical Practice sense, for example, clinical monitoring; (d) focusing only on trial recruitment; (e) about more efficient alternatives to standard on-site activity, for example, remote SDV; and (f) about site selection during trial set-up.
Information sources and search strategies
Database searches
We designed search strategies for the following databases: (a) PubMed, (b) Embase (Ovid), (c) Medline (Ovid), (d) Web of Science (Clarivate Analytics), (e) CINAHL, (f) Cochrane Central and (g) Scopus.
Full database searches took place on 23 October 2017 (run and extracted by W.J.C.). The search strategy for Medline is given in the Supplementary Information. We developed our search strategy following review of systematic reviews in this area1,33 to identify relevant search terms. The final search term combined searches around two concepts: clinical trials (using terms based on those used in a previous systematic review of monitoring methods) 33 and targeted or risk-based clinical trial monitoring. No database filters were applied.
Both reviewers (W.C. and C.H.) imported results into reference management software and used in-built tools to remove duplicate entries. Both reviewers carried out initial title and abstract screening, producing an initial shortlist of potential papers. We reviewed and discussed these, using full-text reports where possible, to agree on a final list of relevant reports. Throughout the process, S.P.S. acted as third reviewer where required.
In order to ensure that our results were current, this element of the search strategy was repeated on 28 August 2018. W.J.C. ran the searches and conducted the title and abstract screening. A shortlist of potentially relevant reports was shared with S.P.S. and CH; S.P.S. and W.J.C. agreed on a final list of additional relevant reports from this repeated search.
Conference abstracts
We hand-searched for relevant conference abstracts from the first four International Clinical Trials Methodology Conferences (occurring between 2011 and 2017) and all annual meetings of the Society for Clinical Trials since 1996 (initial searches completed on 8 December 2017). Keywords used for the conference abstract, based on the key database search strategy terms, were ‘monitor’, ‘supervision’, ‘oversight’, ‘risk’, ‘performance’, ‘metric’, ‘quality’, ‘fraud’, ‘fabrication’ and ‘error’.
Both W.J.C. and C.H. performed the abstract searches. This produced an initial shortlist of potentially relevant abstracts. A final list was agreed upon through discussion, with S.P.S. acting as third reviewer where required.
Internet searches
We conducted structured searches through Google Internet search engine (searches carried out during 15–19 December 2017).
Google searches were performed without limitations or use of quotes. Search terms were based on the main database search: ‘Risk based monitoring’, ‘Risk adapted monitoring’, ‘Central monitoring’, ‘Central statistical monitoring’, ‘Triggered monitoring’, ‘Targeted monitoring’, ‘Performance metric’, ‘Site metric’, ‘Key risk indicator’, ‘Site performance’, ‘Centre performance’, ‘Detect fraud’ and ‘Detect fabrication’. We reviewed the results on the first 20 pages, or fewer if there were no relevant results on any three consecutive pages before this.
W.J.C. and C.H. conducted the searches. Any potential additions to the included list of reports were discussed and agreed upon, with S.P.S. acting as third reviewer where required.
References, citations and author contact
To identify other relevant reports, we reviewed references (manually) and citations (using Web of Science) of all papers included or considered for inclusion in the final results, and of review articles relevant to the topic. Whenever required, we contacted report authors to help ascertain if given reports should be included, and to ask about the availability of full-text articles.
Data collection
We extracted data from full journal articles, where available. We recorded data into an Excel-based tool. W.J.C. carried out the final data collection used for this report, with S.P.S. double-checking all data for inclusion; consensus was reached on any areas of disagreement. Article authors were contacted (two attempts maximum) for missing descriptive data and further clarifications.
Our data collection template was designed and agreed prior to any data collection, with minor refinement after a first review of all relevant papers (a list of data collection variables is available as Supplementary Information). We collected descriptive data about each of the included reports, including any information on cost implications of the proposed methods.
When designing this study, although we predicted we would find a range of methods, we agreed that most of them would in essence address a classification problem, that is, methods to assign sites a status as ‘concerning’ or ‘not-concerning’, with a ‘true’ deviation status – that is, confirmed existence of meaningful problems – that could be uncovered by further review. The ‘gold standard’ reference test required to assess true status might be study-specific, but could be on-site monitoring or, if the true status was created through simulation, prior knowledge.
We considered a key measure of the reported methods’ effectiveness to be a demonstrated ability, ideally in a real-life setting, not only to detect ‘true’ sites of concern, but also to show with confidence that sites apparently not of concern are performing well. We therefore aimed to summarise the available information on classification, that is, any or all of specificity, sensitivity, positive and negative predictive value. We gathered the best reported classification statistics for each method, or, if this was not reported, used available statistics to calculate these. These calculations were verified by an independent statistician at the Medical Research Council Clinical Trials Unit at University College London.
We did not formally assess the quality of the studies. However, review of the QUADAS-2 tool for quality assessment in diagnostic accuracy studies 34 informed development of our data collection template.
Synthesis of results
The results are summarised descriptively rather than combined, as it was clear through preliminary review of the relevant papers that we would have a variety of study types.
Results
Figure 1 gives a PRISMA flow diagram 35 showing the different stages of the review. From the various data sources, we ultimately included 30 reports in our final dataset. Twenty-one of these are peer-reviewed publications. The results are characterised in Table 1 and listed in full in Table 2. Figure 2 shows reports by year of publication.
General characteristics of included studies.
Categories not mutually exclusive.
Full listing of all included reports.

PRISMA flow diagram.

Publications by year and type.
Where information on trial intervention was available, methods had most often been used in Phase III trials of investigational medicinal products. The investigational medicinal product risk category, 62 when known, was either ‘licensed and used within its licensed indication’, or ‘licensed and used outside its licensed indication’ (i.e. we found no reports involving trials of unlicensed investigational medicinal products).
We classified 20/30 of our results as central statistical monitoring methods, of which 7 focussed on detection of investigator fraud or research misconduct. We classified 9, including 1 of the 20 that used central statistical monitoring, as ‘triggered monitoring’, that is, review of each trial site against pre-set thresholds in key performance metrics, usually without any statistical testing. A final two did not fit into either of these categories; these involved using measured site metrics to directly compare sites against one another.53,45
A total of 21/30 reports included some assessment of the effectiveness of the methods; these are summarised in Table 3. The most common experimental designs were to explore the methods’ characteristics using real trial data with no known integrity issues (n = 9), and simulating data integrity problems at sites within real trial datasets and then using the method to try to identify the problem sites (n = 6).
Types of assessments and evidence presented by reports that included some assessments of their methods’ effectiveness.
Of the 21 reports, 9 had no information about sites’‘true’ status, that is, whether the problems identified through central monitoring constitute meaningful problems (either recorded through on-site monitoring or audit activity, or known because statuses were created through simulation). One report 47 only contained case studies, that is, partial and selective reporting. Seven16,41,48,50,51,58,60 had partial information, for example, some of sites’ true statuses were reported, but not all. Two explored classification ability through extensive simulation,42,43 and two had detailed information from a limited set of scenarios on the number of true and false positives and negatives.26,52
The best reported or deducible classification ability for the 11 papers with at least some information on sites’‘true’ status (excluding the case study paper) is shown in Table 4. Seven of these reports ascertained the ‘true’ status through on-site monitoring, audit or regulatory inspection and in three the ‘true’ status was known because it had been simulated. The final report 42 presented both real and simulated scenarios. ‘Best’ classification statistics were reported or deducible in 8 of these reports (of the remaining 3, 1 did not report enough data to allow any calculations, and 2 reported extensive simulations that precluded reporting of a ‘best’ result).
Best reported information on methods’ classification ability, where available or deducible.
Number of correctly flagged problem sites/(number of correctly flagged problem sites+sites incorrectly
Number of sites correctly flagged as not concerning/(number of sites correctly flagged as not concerning+sites incorrectly flagged as concerning); thick border used to highlight results more than or equal to 90%.
Number of correctly flagged problem sites/(number of correctly flagged problem sites+sites incorrectly flagged as concerning); thick border used to highlight results more than or equal to 90%.
Number of sites correctly flagged as not concerning/(number of sites correctly flagged as not concerning+sites incorrectly
One ‘positive’ centre is described as ‘reveal[ing that] RBM was not assessing risk sufficiently to drive monitoring decisions’.
Publication incorrectly rounds this to 86%.
Approximately one-third of sites included from a trial; also some uncertainty about total number of sites (sometimes reported as 21, sometimes 22; used 22 for calculations given here as this is the figure in the ‘Results’ section).
Of the eight reports with some available statistics, 1/7 had sensitivity ≥90% in at least one scenario (statistic unavailable in one report), 4/7 had specificity ≥90% in at least one scenario (unavailable in one report), 1/6 had positive predictive value ≥90% in at least one scenario (unavailable in two reports) and 5/6 had negative predictive value ≥90% in at least one scenario (unavailable in two reports). Four reports contained at least one scenario where more than one of these statistics was ≥90%, and in one case all four statistics were over 80%. 42 All four of these reports had limitations in terms of either lack of clarity around how the ‘true’ site status was ascertained, 42 unclear outcome measure definition, 48 or low or unavailable results for the other classification statistics.51,52 The four reports all described central statistical monitoring methods (as opposed to triggered monitoring), and had used a variety of statistical techniques, including both ‘supervised’ and ‘unsupervised’ analyses. 63
Some papers reported on actual or theoretical cost savings from reduced on-site monitoring,36,41,44 and others commented on the risk of incurring costs if their proposed central monitoring method identifies sites that do not in fact have meaningful problems (i.e. false positives).26,58 However, no papers gave detailed costings for the development, implementation and maintenance of the central monitoring methods themselves.
Discussion
We conducted a scoping review to identify and characterise published methods for assessing the risk of not taking corrective action at trial sites at a given time. Although our search looked for reports from any time after 1995, over half of our results are from after 2013, highlighting the recent growth of risk-based monitoring concepts. Where information on host trials was available, they were almost always trials of investigational medicinal products, emphasising the interest in applying risk-based methods – and accessing the potential associated efficiency benefits – in this setting. Around a third of our results were not full, peer-reviewed reports, reflecting a wider problem with availability of evidence supporting trial conduct methods. 64
Identified methods were mainly in two broad categories. Most were about central statistical monitoring, which uses statistical testing of all or a subset of trial data items to compare sites and identify atypical trial centres. A minority described triggered monitoring techniques, whereby sites are assessed against pre-specified site metric threshold rules (usually binary), with sites meeting the greatest number of ‘triggers’ being considered the most concerning. Several authors note that central statistical monitoring needs sufficient overall and per-site sample sizes for adequate statistical power24,26,58 (although some described methods were shown to detect problem sites during interim analysis or other early timepoints).26,50,58 Triggered monitoring, however, can be used at any stage of a trial’s recruitment (especially with trigger rules based on single instances of a given protocol violation, for example). We therefore suggest that the two techniques can, at least in theory, be used in combination.
In line with our review’s aims, our focus in characterising our results was on looking for evidence supporting the use of each proposed monitoring method. It was therefore beyond our scope to compare and contrast the different central statistical monitoring methods proposed in these reports. Several previous papers have reviewed these methods in more detail.24,52,63,65,66
Nearly half of the central statistical monitoring reports had a stated focus on identification of fraud or data fabrication. The possibility of fraud is a serious concern to trialists and a threat to wider trust in science. 67 It was possibly an important factor in establishing 100% on-site verification of trial data as a common monitoring approach.68,69 This may help explain the prevalence of reports about fraud detection, as some may see the priority in risk-based monitoring to be establishing its fraud detection ability compared with 100% SDV. However, although the incidence of data fraud is difficult to quantify, cases of extensive data fabrication appear rare enough to have individual notoriety. 70 Furthermore, methods to detect fraud are necessarily rather selective, and therefore may not alone be suitable for trialists looking to detect more common, lower level data integrity issues such as poor equipment calibration or inadequately trained trial staff, which central statistical monitoring methods may also be well-suited to detect.
We collected data on how the proposed methods we identified had been evaluated. A number of reports only presented proposed, untested methods, or only selected case studies to demonstrate the methods’ performance. Of those that presented more detailed evaluation, a common limitation was that the ‘true’ status both of identified problem sites and sites apparently not of concern was often not available, or only partially available. It was therefore difficult to know if the ‘concern’ status of sites in central monitoring results represented meaningful problems or not. In addition, a number of studies use simulation to create ‘true’ sites of concern; these raise the additional question of whether these simulations reflect real-life issues, though the involvement of clinicians (i.e. those who would provide real-life trial data) in the simulation process of some reports26,51 is reassuring.
Of the few reports with available classification statistics, the best results were often in methods’ specificity or negative predictive value. The latter finding in particular could be encouraging for those with concerns that if risk-based monitoring means reduced or omitted monitoring activity, it might fail to detect serious errors. Some of the methods also showed good classification ability in more than one classification statistic. However, this was not without caveats of opaque reporting, other classification statistics being poor or unavailable, or the potential limitations of simulation mentioned above.
It is important to recognise the limitations of the available ‘gold-standards’ in the classification of sites. When methods are tested using simulated or real-but-adjusted data, it may be difficult to know how well these accurately recreate real-life situations. When central monitoring methods are tested in real, ongoing trials, on-site monitoring may be an imperfect reference test, in that it may not be able to identify all problems. By contrast, it is clear that central monitoring, with its enhanced inter- and intra-site review, can identify issues that a single team at one site for a limited time might not. 66
It could be argued that at least some of the methods we have identified do not need extensive evaluation because they prove their own worth. For instance, they help identify outliers that in some cases are self-evidently meaningful problems to resolve. We acknowledge that some central monitoring activities identify ‘known’ problems (e.g. identifying weekend visit dates, which are unlikely to be correct) and are valuable for data cleaning purposes. However, we were specifically interested in the more nuanced use of these methods to identify sites of ‘concern’, at which monitoring activity may be targeted, and consequently sites ‘not of concern’, monitoring of which may be reduced or omitted. In light of the limitations we have described here, we do not believe any methods have yet demonstrated sufficiently reliable classification ability to justify more widespread adoption.
Aside from some comments on the potential cost of investigating false positive central monitoring results,26,58 the reports we identified contained limited information on the cost of developing and implementing their methods. As well as uncertainty about how to develop relevant methods, uncertainty or concern about costs involved is a substantial barrier to adoption of risk-based monitoring. 71
Further work is needed to fully demonstrate the effectiveness of these dynamic site risk assessment methods which, alongside pre-trial risk assessments, form the core of risk-based monitoring. We therefore recommend the following:
Coordinate research efforts. From the scoping review and contact with report authors, it was clear that various small research projects relevant to this topic were ongoing, but mostly in isolation. Researchers in this area should take stock of existing research, and set clear priorities to ensure research time is well-spent.
Standardise monitoring studies. Core outcome sets 72 or other mechanisms to standardise studies about monitoring would improve study quality and may facilitate cross-study evidence synthesis.
Share evidence. Time, commercial sensitivity and perceived reputational risk could all be barriers to publishing evidence about monitoring practices. However, additional, publicly available evidence to support the best monitoring practice will allow trialists in all settings to adopt new methods with confidence.
Publish full papers. Conference abstracts and posters can disseminate basic information about new ideas, but rarely have enough detail to allow replication or robustly demonstrate effectiveness. As this emerging field cannot be built on abstracts alone, we encourage researchers to publish full, peer-reviewed papers about their monitoring methods.
Combine complementary methods. Although work has been done on a number of distinct risk-based monitoring methods, an optimal monitoring plan might involve a combination of these, including both central statistical monitoring and triggered monitoring. A collaborative approach to combining existing methods could help develop and test such an idea.
We acknowledge several limitations. Our database searches identified relevant material from disparate locations, including abstracts in conferences in unrelated research fields. It is possible that other abstract collections include relevant material, but it was not feasible to find all of these. Although the Internet searches made little contribution to the final list of included reports, they may have been limited by known reproducibility problems. 73
Scoping review methodology advises that relevant experts in a field are surveyed to help identify other relevant work. 74 We have not formally done this. We have, however, contacted most authors of included reports for clarifications, and this has not highlighted any additional relevant reports.
Some search results were of borderline relevance to our aims, and took discussion to ultimately include or exclude. It is possible that other researchers repeating the same review might result in a slightly different list, but we believe this might only affect the ‘method-only’ papers, which are not critical to our conclusions. The comprehensive nature of our search strategy gives us confidence that our report is a sound overview of the state of the evidence in this research area.
We have not performed a formal quality assessment of reports we found; however, this is considered by some to be unnecessary in scoping review methodology. 29 There is also no validated way to review the quality of risk-based monitoring studies, although we used the QUADAS-2 tool to inform our data collection template.
Finally, we acknowledge that some time has passed since we first conducted our search for relevant evidence. Conscious of this, we repeated the main database search in 2018 (albeit with only one author conducting title and abstract screening) and added three relevant reports. We are not aware of any research published since then that might change our overall conclusions. If evidence is now available that addresses the limitations we have highlighted in the existing literature, we would certainly consider this a positive development.
Our scoping review highlighted some promising evidence for risk-based monitoring in ongoing trials. However, currently published methods may not yet have demonstrated their efficacy or cost-effectiveness well enough for trialists to implement them with confidence as a means to target or omit on-site visits. A more coordinated, collaborative and transparent approach to developing and sharing evidence in this field, including industry and academic partners, could help it grow beyond its current nascent state, and could contribute to risk-based monitoring more quickly entering routine practice.
Supplemental Material
sj-pdf-1-ctj-10.1177_1740774520976561 – Supplemental material for Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review
Supplemental material, sj-pdf-1-ctj-10.1177_1740774520976561 for Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review by William J Cragg, Caroline Hurley, Victoria Yorke-Edwards and Sally P Stenning in Clinical Trials
Supplemental Material
sj-pdf-2-ctj-10.1177_1740774520976561 – Supplemental material for Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review
Supplemental material, sj-pdf-2-ctj-10.1177_1740774520976561 for Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review by William J Cragg, Caroline Hurley, Victoria Yorke-Edwards and Sally P Stenning in Clinical Trials
Footnotes
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
