Abstract
Keywords
Introduction
Hussey and Hughes 1 and Girling and Hemming 2 derived a power formula for the standard stepped-wedge cluster-randomized design (see Figure 1) with two levels of clustering (i.e. subjects within clusters), where cross-sectional samples are taken at the lowest (subject) level, that is, different subjects are measured in every period. In this article, we derive and demonstrate power and sample size calculations for stepped-wedge cluster trials with more than two levels, in which the lowest level is cross-sectional. One such example is the CHANGE trial (ClinicalTrial.gov NCT02817282), which aims to improve nurses’ level of compliance with hand hygiene guidelines. This trial has four levels of clustering, with nurses (level 2) in wards (level 3) of several nursing homes (level 4). Nurses are followed in sessions where different opportunities for hand hygiene arise and observations (level 1) on compliance to the guideline are made.

Cluster-randomized parallel group design and different stepped wedge like designs with
If clusters consist of more than two levels, different scenarios are possible. For example, in the CHANGE trial that has four levels, the following scenarios are possible (Figure 2):

Scenarios in four-level stepped-wedge design (CHANGE trial setting): (a) only nursing homes
As illustrated in the range of possible scenarios above, the highest level (referred to as a cluster in this article) is always repeatedly measured and the lowest level cross-sectionally. Up to a certain level, all levels below this level are cross-sectionally measured, but levels above it as cohort.
Our method covers both “standard” stepped-wedge designs (i.e. designs where all clusters start in the control and end in the intervention condition) and non-standard designs (i.e. stepped-wedge designs with more, less, or no data collection before and/or after roll-out
3
and hybrid designs;
2
see Figure 1 with
Methods
In order to support the flow of arguments, technical derivations are provided in the Supplementary Files (SF) and notations given in Table 1. At time
Notations in this article illustrated in the CHANGE trial setting.
If an intermediate level is measured as cohort, the index
All clusters
Each level-2 unit (e.g. nurse) has the same number
Randomization is always on the highest level.
In terms of the cluster averages
where
In terms of the correlation
or in equivalent formulation by Girling and Hemming 2 (SF2)
where
Formulas for standard stepped-wedge trials with two, three, or four levels.
See Table I for definition of parameters and see the section “Variance inflation due to the multilevel structure” for their explanation.
Taking
For two levels, equation (5b) reduces to the variance formula in the appendix of the article by Woertman et al. 4
For
Impact of design and multilevel structure
The design (i.e. the specification of intervention/control condition for each cluster at each time) influences
As illustrated for the CHANGE trial in the section “Introduction,” various scenarios can arise because up to a certain level, all units of lower levels are measured cross-sectionally, and from that level upward, all levels have their units measured repeatedly as cohort. Relevant formulas for each possible scenario with two, three, and four levels are provided in Table 2. Derivation and implementation of these formulas in SAS® and Excel® programs are in the SF, which also contains the results for more than four levels.
Variance inflation due to the multilevel structure
The factor
To clarify the meaning of this in the CHANGE trial setting, the intracluster correlation
Variance inflation factor for stepped-wedge designs
Using equation (3) or (4) and the research by Girling and Hemming
2
and Thompson et al.,
3
we provide variance inflation factors for the
and thus, the variance inflation factor compared to a parallel group
where
From equation (8), we can see that the total variance inflation comes from two aspects of the design: the manner of assigning intervention over the measurement times and the multilevel structure at each measurement time.
Sample size and power calculation
As sample size formulas and programs for a parallel group individually randomized designs with one measurement (i.e. post-test design) are readily available, sample size calculation for the stepped-wedge trial with
and dividing this by the number of level-1 units per cluster at each measurement yields the total required number of clusters
Instead of calculating the total sample size (or number of clusters needed), power for a range of feasible configurations (i.e. number of clusters, sample size at different levels, and intracluster correlations) could be calculated to see which configuration, if any, provides sufficient power. This can be done using the usual power formula
where
To calculate
in order to investigate the impact of various design parameters on the power. Figure 3 shows

Variance inflation factor for the standard stepped wedge as a function of the correlation
For a small number of clusters, the sample size and power formulas hold only approximately. For continuous, normally distributed outcomes, this is because of the low degrees of freedom, while for binary/rate outcomes, this is because formulas (2) and (4) depend on approximating the statistical distribution of cluster averages by a normal distribution using the central limit theorem. Therefore, we recommend the use of simulation studies to check power and also type I error for designs with a small number of clusters. However, the formulas in this article can be used to see whether feasible designs (i.e. in terms of number of clusters and/or number of measurements) would be worth such further investigation.
Binary and incidence outcomes
As the argumentation underlying the formulas relies on approximating the statistical distribution of cluster averages by the normal distribution using the central limit theorem, the formulas can be used for binary and incidence outcomes as well, provided the number of clusters is sufficiently large. We now discuss what value for
If we take a two-level design and a binary outcome as an example, we can model the trial hierarchically as follows. Each subject
For a rate (incidence) outcome, the count (or rate) outcome of subject
To illustrate sample size versus power calculations, for different endpoints, and small versus large sample considerations, we present two examples in the setting of the CHANGE trial. These were not the final calculations for this trial but similar to those performed.
Example 1: binary outcome in four-level standard stepped wedge
As a first example, we calculate power for hand hygiene compliance (a binary outcome) in a four-level standard stepped wedge using the following assumptions. The duration of the trial only allows four sequences
and
so that
and

Impact of cluster size and intracluster correlations at different levels in a “standard” stepped wedge. Power of the 4 level “standard” stepped-wedge trial of Example 1 when varying either one sample size (part a-c) or one intracluster correlation (part d-f) at the specified level while keeping the other sample sizes and intracluster correlations constant. The vertical reference lines indicate the values of sample size and intracluster correlation as in Example 1
Example 2: rate outcome in a three-level standard stepped wedge
As second example, we use the variance inflation factor to calculate sample size for infection incidence (a rate). These rates are measured on patients within wards in nursing homes; hence, a 3-level design. We would expect the correlation of infection rates within wards to be high
With
and the total variance inflation is
Programs (SAS® and MS Excel®) to facilitate calculations are provided via https://github.com/steventeerenstra/multilevel-stepped-wedge and in the SF (SAS® program only).
Discussion
Power and sample size formulas for stepped-wedge designs are typically restricted to two or three levels.7,9 In this article, these formulas were extended to designs with more levels and it was demonstrated that they can either be expressed in terms of variance components or intracluster correlations. The latter expression clearly shows the separate effect of the multilevel structure within time and the stepped-wedge structure over time, similar to what has been shown for other designs but with two levels.10,11
From the formulas, it can be seen that the different design parameters have the following impact on power and sample size:
Both increasing sample size and intracluster correlations coefficients can have unexpected power properties due to the random effects canceling out. Therefore, one may question how realistic it is to assume that the random effects (of a cluster) are not varying over time. This assumption implies that the correlation of two subjects within a cluster is the same whether they are measured at the same time
The variance components or intracluster correlation coefficients needed for the calculations should preferably be estimated from studies with similar outcomes and context. These studies should have the same number of levels, but do not need to be stepped wedge, prospective, or randomized. In the absence of such studies, content-matter specialists could provide plausible values, and they could do so either in terms of variance components or intracluster correlations. Given the uncertainties in these educated guesses, we recommend that a range of plausible values for each of these parameters be considered.
Supplemental Material
180914_WEBAPPENDIX_Sample_size_for_cRCT_stepped_wedge_trials_with_more_than_2_levels – Supplemental material for Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering
Supplemental material, 180914_WEBAPPENDIX_Sample_size_for_cRCT_stepped_wedge_trials_with_more_than_2_levels for Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering by Steven Teerenstra, Monica Taljaard, Anja Haenen, Anita Huis, Femke Atsma, Laura Rodwell and Marlies Hulscher in Clinical Trials
Supplemental Material
number_of_clusters_and_power_for_2_3_4_level_designs_with_continuous_binary_incidence_outcomes – Supplemental material for Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering
Supplemental material, number_of_clusters_and_power_for_2_3_4_level_designs_with_continuous_binary_incidence_outcomes for Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering by Steven Teerenstra, Monica Taljaard, Anja Haenen, Anita Huis, Femke Atsma, Laura Rodwell and Marlies Hulscher in Clinical Trials
Footnotes
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
