Abstract
Study proposals, particularly for animal and clinical research, typically require justification for a proposed sample size based on statistical power calculations, which are often carried out automatically under defaults in web applets or statistical software. The cost-benefit analysis of this effort to researchers needing funding is easy, and following the usual procedure typically requires little, if any, justification. In our view, however, the foundations of statistical power deserve less blind acceptance and more healthy interrogation by researchers and reviewers.
We see research design as an underemphasized part of the research process and support the expectation that researchers meaningfully justify sample size choices—particularly when there are ethical concerns, such as in animal research. When taken as more than default mathematical calculations, sample size investigations can motivate deeper evaluation of plans for study design, analysis and interpretation, and expose limitations early enough to promote improvement while taking advantage of the subject matter expertise and creativity of researchers. Before we discuss an alternative path, we visit some concepts we are implicitly trusting by relying on statistical power.
This short communication does not provide yet another tutorial of sample size calculations meant to return a clear-cut answer to the question of exactly how many participants are needed per group; instead, it is meant to spark more critical evaluation of measurement, design, analysis and interpretation in the research design phase, before resources (including animal lives) are used to carry out the study.
Before discussing power-related concepts, it is important to acknowledge that the underlying concepts are subtle and intricate, making it nearly impossible to simplify them sufficiently for a broad audience without losing technical correctness. This point actually underlies much of our criticism of power-based justifications because users, and even teachers, are often not choosing to rely on statistical power with adequate understanding or appreciation of the concepts involved. For example, in the following, we are referring to ‘test hypothesis’ rather than ‘null hypothesis,’ because the tested hypothesis should not automatically amount to a nil, zero or no-effect hypothesis.
1
We can and should also consider non-zero values, and we often do this already because our traditional 95% confidence intervals show all hypotheses (possible values for the true effect size) that would result in a
The ‘coverage’ rate definition of a 95% confidence interval describes the
Confidence intervals can be created by inverting hypothesis tests: a 95% confidence interval includes all values for the test hypothesis (all possible values for the true effect size) that would result in
Intervals matching classic confidence intervals can arise more generally as quantiles or percentiles summarizing the most common values of a distribution without any need for referencing a true value or defining an error rate. This is the motivation for using posterior intervals within Bayesian inference as summarizing the region of a posterior distribution with largest posterior density (typically the middle of a distribution). In a non-Bayesian setting, intervals can be used to summarize randomization distributions or sampling distributions, again with no reliance on true or hypothesized values or error rates. A 95% interval, for example, typically provides the interval excluding values beyond the 97.5 percentile and below the 2.5 percentile that would be considered ‘rare’ according to the chosen criterion.
We encourage this more general ‘summarizing a distribution’ interpretation that helps relax interpretation of the endpoints from hard-boundary thresholds to rather arbitrary and user-chosen summaries of a distribution of interest. Displaying intervals as a collection of segments representing different choices for percentiles (e.g., 95% and 80%) facilitates this view (Figure 1).
The goal is to have a more general interpretation of intervals beyond error and coverage rates that allows their (necessarily imperfect) use as a way to represent the values most compatible with the data and the model with all background assumptions in a way that also honors context-dependent knowledge.
Entering the hypothetical land of error rates
Opening up the baggage of statistical power starts with interrogating the concepts of type I and type II error rates. Under a hypothesis-testing statistical framework, errors are defined relative to a simple decision around whether to reject the test hypothesis—it is either rejected in error (‘reject when we should not’) or not rejected in error (‘fail to reject when we should’). The former is a type I error, the latter is a type II error, and power is the rate of rejection of an incorrect test hypothesis (‘rejecting when we should’), the non-error compliment to a type II error. Power calculations are based on long-run rates of these errors over hypothetical study replications: type I error rate (
Error rates (as opposed to single errors) are therefore conceptually based on a hypothetical collection of many decisions, a proportion of which are errors. The collection of decisions is hypothetical because the decisions would arise from many study replications that in real life are not conducted; thus error rates are hypothetical—and these are the fundamental ingredients underlying power calculations used to justify real-life research decisions. While we appreciate the theoretical attractiveness and mathematical convenience error rates offer, we question handing them too much authority. They seem to bring an air of objectivity and comfort to an otherwise challenging and messy research process; but their roots inhabit the same soil as statistical hypothesis tests that have been criticized for decades, for example for their rigid focus on often poorly justified null hypotheses and decision rules.2–5
Problems arising back in reality
When we leave the hypothetical land of a having a collection of data sets and associated decisions about rejecting a test hypothesis, we inevitably face issues: real-life error rates are unknown and difficult to fully conceptualize—we usually do not have data from multiple study replications, and we never know whether the decision about rejecting a hypothesis is in error for any individual study. In addition, theoretical error rates are only as trustworthy as the assumed model of the process that generated the real-life data—and this statistical model is inherently based on assumptions about reality that are usually violated in practice and uncertain by definition (otherwise they would not be called assumptions).
While similar cautions apply broadly for statistical methods, in power-related practices we often see blatant ignoring of the underlying model and its connection to theoretical error rates, leading to overconfident expectations about reality and questionable study design decisions. This is exemplified by misleading statements such as ‘we will be wrong 5% of the time’ if we reject a test hypothesis based on a
In general, power calculations beg a lot of trust in unknowns and misunderstood concepts, and yet it is common to treat resulting sample size numbers as if they provide concrete and objective answers to inform crucial research decisions, often ones with ethical implications. We hope this glimpse into the baggage associated with error rates, and thus power, will spur some healthy skepticism; but motivating change must also acknowledge the unfortunate reality that incentives from peers, funding bodies and animal welfare committees promote the comfortable status quo (whether explicit or only assumed by the researchers) instead of rewarding curiosity about limitations of current methodological norms. Pushback against dichotomous statistical hypothesis testing has gained traction within analysis,4,5 but influence on use of power calculations has been limited, despite reliance on the same criticized theory and practices. 7 An over-focus on simple statistical power also inadvertently encourages ignoring more sophisticated design and analysis principles available to increase precision (and thus decrease, for example, number of animals used), because calculating statistical power for more complicated designs and analyses is often not straightforward or not implemented in default statistical procedures.
An alternative definition of success tied to research context
It is possible to let go of much of the baggage of error rates by shifting away from defining a research ‘success’ in terms of avoiding theoretical type I and type II errors toward a context-dependent success that honors inevitable gray area in interpretation instead of, for example, forcing use of single values for null and alternative hypotheses. A successful study should result in useful information about how compatible the data (and background assumptions) are with values large (or small) enough to be deemed practically important (e.g., clinically relevant) as compared with values too small (or large) to be considered practically important. We can do what we can in the design phase to make such a success possible, but of course there is no way to avoid results potentially consistent with the gray area between the two regions; this is not a problem, it simply highlights the challenges in interpretation that exist in real life but often are hidden behind use of default criteria and assumptions in common methods.
For example, suppose the effect of a new anti-hypertensive drug on average systolic blood pressure has to be a reduction of at least 10 units to be deemed clinically relevant, with a reduction of 6–10 units representing gray area (unclear clinical relevance), and fewer than 6 units clearly not clinically meaningful (though clearly not ‘no effect’). Then, a sample-size related goal might be to achieve a precision such that an interval is not wider than 4 units. That is, we aim for obtaining an interval that cannot overlap both clinically relevant (values greater than 10) and not relevant values (values under 6), which is only possible if an interval is narrower than 4 units (Figure 1). Note that such a successful research outcome can accompany very large or very small

Example of a ‘quantitative backdrop’ with hypothetical intervals that could arise after data collection and analysis. The number-line backdrop is context-dependent and honors a realistic gray area in which clinical relevance is unclear. The backdrop facilitates meaningful interpretation of potential study results and highlights the goal of designing a study to provide an interval of values (those reasonably compatible with the data, given the model
5
) in either the blue
As alluded to in the example, one strategy for a researcher to exert control over the width of a future interval (precision) is through choice of sample size; more information and technical guidance on choosing a sample size based on precision rather than power can be found elsewhere.8–12 Although precision-based approaches can be carried out in ways just as automatic and default as traditional power-based approaches, the focus on intervals invites use of context-dependent knowledge and expertise related to the treatment and proposed methods of measurement. In this spirit, we offer a larger framework for incorporating research context and interpretation of results
A picture can help clarify alternative definitions of success (Figure 1): intervals A and B clearly distinguish between regions of different practical implications, and both are considered study successes because all values in A are deemed too small to be clinically relevant, and nearly all values in B are large enough to be clinically relevant. Interval C, on the other hand, is not considered a success because it contains values in both regions, not supporting a conclusion in either direction. Narrower intervals (greater precision) help to avoid scenario C and facilitate successes (A and B). Even with a narrow interval, we can still land partially or fully in gray area (D); while potentially frustrating, such is the reality of doing research and D still provides valuable information to inform future studies or meta-analyses.
As Figure 1 conveys, this approach requires initial context-dependent work to draw the number line ‘backdrop’ delineating the regions. Assigning practical or clinical importance to values
While the backdrop framework can help support and facilitate sample size investigations, it is broader than that and need not involve sample size calculations to be useful. For example, suppose researchers are planning to use the largest sample size possible given ethical, logistical or cost constraints and have done as much work as possible to decrease background variance through design and analysis decisions. The result of the planning exercise is an interval with an approximate width that can be compared with the quantitative backdrop to think about and articulate how intervals will be interpreted as their location moves relative to the backdrop. The exercise can help make decisions about whether the research is worth carrying out given the width of an interval that can possibly be achieved and can serve justification of interpretations after data collection and analysis if the
Loosening our grip on interval endpoints
Our use of the term ‘interval’ thus far has been purposefully vague, as our definition of success does not depend on any particular method for obtaining intervals (e.g., confidence, credible, or posterior intervals), only that the researcher sufficiently trusts the interval and can justify its use to others (Box 1). We promote relaxing long-held views of what a statistical interval does, or should, represent and see interpreting confidence or credible intervals as compatibility intervals as a step in this direction.1,5,6,15,16 Compatibility encourages a shift from dichotomously phrased research questions (e.g., ‘is there a treatment effect?’) to the more meaningful ‘what values for a treatment effect are most compatible with the obtained data and the model with all background assumptions?’ (to which the answer would be the values included in the obtained interval). 16
We can also relax the rigidity with which interval endpoints are interpreted. When drawing an interval, the line must have ends, but values beyond the endpoints do not suddenly switch from being compatible with the data and assumptions, to incompatible. Values inside the interval are just considered
The reality is that to carry out a sample size calculation based on precision (via math or computer simulation), we must input a specific interval width. This may at first seem inconsistent with the recommendation to relax interpretations of intervals and rigidity of endpoints. However, there is no conflict if we also relax our belief that there is a single correct answer to the sample size question and instead use the exercise to motivate a nuanced investigation to help understand challenges inherent in carrying out the study. This can include many calculations to reflect different levels of precision and varying sensitivity to assumptions.
As mentioned previously, precision-based methods can be used easily to carry out a typical power calculation in disguise, rather than the more holistic approach we are promoting. Several practices can help avoid using them as power calculations in disguise: (1) avoid using confidence intervals to carry out hypothesis tests by simply checking whether they contain a hypothesized value (usually the null hypothesis of no effect); (2) embrace the
The last point deserves further attention. It is common to use previous effect estimates (such as pilot study results) as the ‘(practically meaningful) alternative value’ in traditional power calculations, although this is not necessary or recommended. The practice has negative implications for sample size justification,
11
for example, because published effect estimates are often exaggerated.
17
Such practice can lead to sample sizes that are smaller than needed (if the previous estimate is larger than the smallest values deemed practically relevant) or larger than needed (if the previous estimate is smaller than what is deemed practically relevant). There is no reason a previous estimate should automatically be judged practically relevant—it can fall anywhere relative to the backdrop and should not change the
Creating a quantitative backdrop is not an exercise in guessing the actual effect, but an exercise in explicitly defining and sharing the context within which an estimated effect will be interpreted. This can be confusing because it is counter to what is often taught and expected from funding agencies. Relative to the previous example, a pilot study may have produced an estimated reduction of three units, which, when considered relative to the backdrop, is not clinically relevant and therefore there is no reason to justify increasing the sample size to attempt to estimate an effect as small as three units with sufficient precision. The decision of what values will be judged practically relevant should thus be made based on knowledge of the subject matter (e.g., medical) and of the measurement scale, not on previous estimates of an effect of interest. Defining relevant values can, and should, be carried out before any pilot study, facilitating the exercise of specifying how potential pilot study results will be used for further planning.
Taking back the power shouldn’t be easy
A common question when considering this framework is: what if researchers do not have enough knowledge of how the outcome variable’s measurement scale is connected to practical implications to create the quantitative backdrop? That is, what if they are not able to identify values that would be considered large, or small, relative to practical implications? If this is the case, then we argue researchers should honestly declare that, with the currently available knowledge, it is impossible to come up with a justifiable sample size. In such a situation, using default power calculations will essentially just move the research challenge into the analysis and interpretation phase, after already using valuable resources for the experiment—if practical implications of possible outcomes are unclear before the experiment, they are usually still unclear after results are in. Instead, an inability to identify practical implications of possible outcomes in the planning stage of a study would highlight the exploratory nature of the research and a need for better understanding of the outcome variable, which could be a valuable research goal by itself.
Engaging in a sample size investigation as we are recommending will not feel easy.
Sample size investigation presents an opportunity for researchers to give up simple math calculations in exchange for taking back some of the authority and creativity blindly given over to statistical power for decades. We have a responsibility as scientists to work to understand and interrogate our chosen scientific methodologies to the best of our ability to avoid being fooled by our own assumptions. Embracing this challenge in the design phase of a study can lead to higher quality research, and ultimately to more efficient research spending and respect for human and animal lives.
