Abstract
The aim of psychological research often is to test whether there is an effect of a particular intervention. An experiment is then set up in which the intervention (psychotherapy, education on climate change, etc.) is applied to a random sample of subjects; this sample is called the experimental group. Next, it is assessed to what extent the subjects in this group show the expected behavior more than the subjects in the control group. This is usually assessed by comparing group means on some quantitative outcome measure. Typically, a significance test (also called a null hypothesis significance test [NHST]) is carried out to see if the observed difference is “statistically significant,” and if so, this is interpreted as an indication that the effect is not a chance finding.
However, what does one do if a result is not significant? It is well known that nonsignificance cannot be seen as an indication that there is no effect (e.g., Cohen, 1994). This is because in case of nonsignificance, it is very well possible that there actually is an effect but that the data set has insufficient information to let one draw such a conclusion. Are researchers empty-handed then? No, several remedies have been suggested. One suggested remedy is to inspect the power of the significance test. This, however, is a rather tricky affair because the power depends on assuming an effect size in the population, and this is obviously not known (Why else would researchers do a study?). So researchers have to make fairly arbitrary choices here, and ultimately, they will end up with a fairly complex conditional statement, such as “If in reality the effect size were 0.4, then for the given sample size and significance level, the chance of obtaining a statistically significant effect would have been .75.” It is hard to reason with such complex probabilistic statements. It seems better to simply accept that although NHST is meant for comparing the null hypothesis, “There is no effect” (H0), to the alternative hypothesis, “There is an effect” (H1), it can only sensibly reject H0 and hence accept H1, but it cannot accept H0. Or phrased a bit more nuanced, by NHST, one can quantify evidence against H0 but not in favor of it.
Presently, the most common route of statistical analysis is to start with NHST and next consider to what extent the effect actually is practically or theoretically relevant by inspecting the effect size and the corresponding confidence interval (CI; see guidelines by Wilkinson & Task Force on Statistical Inference, 1999, who strongly recommended reporting these). An alternative, however, is to start out from considering what is considered practically or theoretically relevant and then see to what extent the data indicate that practically or theoretically relevant effects have been observed. For instance, how likely is it that the average effect of the intervention is higher than 1 point on the well-being scale? Or how likely is it that the effect is comparable with that of another intervention and at least not more than 1 point lower? Or how likely is it that the effect is simply very small (say between −1 and 1 on the well-being scale)? Such questions seem much more relevant than testing whether there is an effect (however big or small that might be). It does require that before the study, assessments of what can be considered relevant are made. Although this will not be trivial, one could expect a researcher or an expert in the field to be able to make sensible choices here because after carrying out an NHST (as very many researchers often do), such interpretations should also be made anyway.
The approaches sketched just above are in the realm of equivalence testing and noninferiority and superiority testing and have a long history primarily in medicine (Schuirmann, 1987; Wellek, 2010). Only recently have these approaches been popularized in psychology (Lakens, 2017; Lakens et al., 2018; Linde et al., 2023). Smiley et al. (2023) offered a unified framework covering these kinds of alternatives to NHST. This framework covers a large variety of procedures involving tests based on regions of values rather than on the simple null value only, as in NHST. It describes a single procedure for carrying out tests for all such types of regions based on CIs, which thus makes all these methods very easy to apply on very many possible measures (difference of means, difference of proportions, correlations, regression weights, etc.): The only thing one needs is to be able to set up a CI for such measures.
Even though the approaches in the unified framework offer a big step forward to useful interpretations, they are still based on dichotomization. Moreover, they rather indirectly give only an indication of the likelihood of an effect being in the region of interest. The purpose of the present article is to offer procedures that go beyond simple dichotomization and procedures that lead to concrete probabilities as expressions of the likelihood of values being in the regions of interest. For this purpose, we invoke the help of Bayesian posterior estimation.
In the present article, we first describe the framework by Smiley et al. (2023). Next, we describe three Bayesian procedures that can be used to more directly assess the probability that an effect size falls in a region, and we describe their relative advantages. Finally, we show how to actually apply such procedures in practice and discuss the implications on the feasibility (and practical difficulties) associated with them.
Leading Example
As a leading example in this article, we consider a study about how often patients improve by taking a particular drug. To make it a bit more concrete, as to the patients, one could think of psychiatric patients. As to the drug taking, one could think of taking the prescribed dose for a period of 2 months, and improvement would be assessed by verifying whether the score on a general well-being scale assessed after this period has increased with respect to an assessment before the period (equal or lower scores would be considered “not improved”). We consider fictitious data for 100 subjects in which 60 were found to have improved. Our research question aims at determining what proportion of patients in a population would likely improve.
In a null hypothesis testing, we could consider a two-sided test, H0: “proportion improved = proportion not improved” against H1: “proportion improved ≠ proportion not improved.” Thus, the null hypothesis assumes that by taking the drug, patients might just as well improve as not improve, and the drug is therefore considered to be of “null” value. Denoting “proportion improved” by “
Unified Conceptual Framework for Statistical Inference in Terms of “Null Regions” and “Regions of Interest”
Smiley et al.’s (2023) article offered a unified framework involving what they called a “null region” (H0), which is compared with an H1 reflecting a “region of interest.” They started out from seven types of tests involving the null regions and regions of interest described in Table 1, which is a compilation of the information in Smiley et al.’s Figures 1 and 2.
Overview of Smiley et al.’s (2023) Framework
Note: δ indicates the population effect size (e.g., a difference of two means, a correlation, a difference of two proportions). The null region and the region of interest represent the two hypotheses compared in the test. Constants −

(a–e). Results of different tests by verifying whether the 95% confidence interval [49.7%, 69.7%], displayed as the green area, falls entirely within the region of interest (for which only the blue dashed limits are specified).

Posterior distribution for the improvement proportion from our leading example. The prior distribution used here and a scaled version of the likelihood have been superimposed on this graph. The 95% highest density interval has been displayed as the dark green bar on top of the
The test approach consists of assessing whether the 95% CI falls entirely outside the null region and hence entirely in the region of interest. If it does, then according to Smiley et al. (2023), the test is significant; hence, H0 is rejected in favor of H1. If it does not fall entirely outside the null region, then there is no significance, so H0 is not rejected, and nothing more would be concluded, not even if the 95% CI would actually fall entirely in the null region. Although this seems an unnecessary limitation of the framework, it is actually not because in all cases, the roles of null region and region of interest can be switched. In fact, in Table 1, one can see that “Minimum-effects testing two-sided” and “Equivalence testing” contrast the same hypotheses in different roles. Likewise, “Strong form hypothesis test (δ too far from predicted value)” and “Strong form hypothesis test (δ close enough to predicted value)” contrast the same hypotheses in different roles. Therefore, here, we apply the simpler decision rule to always draw the conclusion that if the 95% CI is entirely in either the null region or in the region of interest, this means support for the hypothesis that the population value is in that particular region. Only if the 95% CI is neither entirely in the null region nor entirely in the region of interest should one refrain from drawing any conclusions.
For our leading example, we find the exact binomial 95% CI = [0.497, 0.697]. Now suppose we did an equivalence test in which we stated that percentages of improved patients between 48% and 52% are associated to a negligible drug effect for clinical purposes. Then, the region of interest would be 95% CI = [48%, 52%], and clearly, 95% CI = [49.7%, 69.7%] does not fall entirely in the region of interest (see Fig. 1a). So, one would conclude that there is not enough support for practical equivalence. And there also is not enough support for nonequivalence (i.e., a two-sided minimum-effects testing also fails to reject its null hypothesis). We simply cannot draw a conclusion of either kind. However, in the unrealistic case in which we would consider percentages of improved patients between 30% and 70% as a negligible drug effect for clinical purposes, the 95% CI would fall in the associated region of practical equivalence, 95% CI = [30%, 70%] (see Fig. 1b), and hence, there would be support for practical equivalence.
Perhaps more interesting would be to conduct a noninferiority test. That is, if in practice one would consider 48% improvement good enough for all practical purposes, then the noninferiority test would boil down to checking whether the 95% CI would entirely fall in the region of interest, 95% CI = [48%, 100%]. The result is that it does (see Fig. 1c); hence, there would be support for noninferiority. But if a one-sided minimum-effects test were to be carried out, with minimum effect of 52%, then the region of interest would be 95% CI = [52%, 100%], and clearly the 95% CI would not entirely fall in that region (see Fig. 1d), so there is not enough support for this minimum-effects hypothesis. Finally, if a theory would predict 60% improvement and one would consider percentages between 58% and 62% to be practically equivalent to 60%, then the region of interest of a strong form hypothesis test on whether the likely value is sufficiently close to the theoretically predicted value would be 95% CI = [58%, 62%], and clearly, the 95% CI would not fall entirely within it (see Fig. 1e), so there is not enough support for the strong hypothesis. Thus, each test in the Smiley et al. (2023) framework can be carried out using a single 95% CI and checking whether it falls entirely inside the region of interest.
Although clearly an improvement over null hypothesis testing, the Smiley et al. (2023) framework shares a problem with NHST. It may be the case that users might want to make statements based on probabilities that the population effect size is in particular regions. With the CI-based tests presented here, this is not possible. Moreover, just like in the case shown in Figure 1d, even though the 95% CI may fall almost entirely inside the region of interest, the conclusion simply is that there is not enough support for the hypothesis that the minimum effect is 52%. The conclusion to fail to support the region of interest would be exactly the same if the 95% CI was a lot more to the left, so when the observed effect would be a lot smaller. Here, one could counter that the
Three Bayesian Approaches for Assessing How Likely the Effect Size Is in the Region of Interest
HDI + ROPE procedure
Kruschke (2011; for a more detailed explanation, see Kruschke, 2018) proposed a decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE). (Kruschke, 2018, p.270)
The method starts out with a Bayesian estimation procedure for the effect size of interest, for instance, the difference between two means, a regression weight, a correlation, or a proportion. Consider our leading example of studying the proportion of patients that experienced improvement after taking a particular drug. A Bayesian estimation procedure can then proceed as follows (e.g., see Kruschke, 2013): (a) Specify a priori the distribution for the proportion of improved patients in the population. (b) On the basis of the observed data, the likelihood function is set up. For each possible value of the parameter involved (i.e., the proportion of improvement in the population), this function specifies the probability of obtaining the observed data. (c) The posterior distribution for the proportion of improved patients in the population is computed by multiplying the likelihood function by the prior distribution and normalizing the ensuing product to a proper probability distribution.
The precise steps for obtaining a posterior distribution depend on the measure that one is interested in and may involve more than one prior and posterior distribution (e.g., for the difference of two means), but the general gist is as above. For our purpose, the most important aspect is that the posterior probability distribution specifies for each possible value of the measure of interest what its probability density is. In other words, the graph of the posterior distribution visualizes the relative probability of all possible parameter values. But because there are infinitely many parameter values, one uses densities rather than concrete probabilities. For our leading example, this is illustrated in Figure 2, which shows that the likelihood function (red, dashed) is somewhat spread around the observed proportion of .60. Here, we used a fairly strongly peaked prior (based on the Beta[26, 26] distribution), which would be appropriate if we have fairly precise prior knowledge that the proportion should be around .50 and most likely not exceeding .30 or .70 (see the blue, dotted curve); we chose this particular example prior to show that with fairly peaked priors, the posterior and the likelihood may differ quite a lot. Had we used a “flatter” prior expressing little prior knowledge, the posterior distribution would have been extremely close to the (scaled) likelihood function.
Based on this graph, one can assess probabilities for particular ranges (or regions) of values. A popular way of summarizing such a graph is by means of the so-called 95% highest density interval (HDI), which contains the shortest range of values that jointly have a posterior probability of 95%. In other words, given the data and given the assumed prior distributions, 1 we know that the probability that the parameter value is in the 95% HDI range of values is exactly 95%. The 95% HDI for our leading example is [0.49, 0.64], or in percentages [49%, 64%], which is quite similar to the 95% CI seen earlier but not exactly equal. Here, we chose as prior a Beta(26, 26) distribution, which is based on quite a bit of assumed prior knowledge. This can be specified exactly as the knowledge from a previous study using a uniform prior in which one found 25 successes out of 50 trials. If, instead, we would have chosen the flat prior Beta (1, 1), which gives the uniform distribution in [0, 1], the 95% HDI would be [0.50, 0.69], hence virtually equal to the 95% CI.
Now, the 95% HDI [49%, 64%] can be used for a comparable goal as the CI, but it is easier to interpret. So obviously, we can use the HDI in the same way as Smiley et al. (2023) used the 95% CI for their null regions and regions of interest. For our leading example, we reproduced Figure 1, now based on the 95% HDI (see Fig. 3). As we show, all conclusions are the same as when the 95% CI was used, although the kind of overlap in some cases changed strongly. For instance, the equivalence-testing example in Figure 3b had the 95% CI just within the boundaries of the region of interest, whereas the 95% HDI is more or less in the middle of the interval, relatively far from the boundaries.

(a-e) Different tests verifying whether the 95% highest density interval [49%, 64%], displayed as the green area, falls in the region of interest (for which only the blue dashed limits are specified).
Kruschke (2018), who preceded the Smiley et al. (2023) article, did not start out from null regions but from defining a so-called “region of practical equivalence,” which is similar to the region of interest defined above. He proceeded to set up a decision rule by simply verifying whether the 95% HDI falls entirely within the ROPE. Of course, a crucial step is to define the ROPE. Kruschke (2011) wrote, The ROPE indicates values of θ that we deem to be equivalent to the null value for practical purposes. . . . In real applications, the limits of the ROPE would be justified on the basis of negligible implications for small differences from the null value. (p. 302)
Kruschke (2018) offered an in-depth discussion of the choice and use of the ROPE and related this to the bounds used in equivalence testing and noninferiority testing. For another in-depth discussion about bounds in equivalence testing, see Lakens et al. (2018).
Although this approach gives a good indication of whether the range described by the region of interest consists of very probable values, it still does not exactly tell the probability of the effect-size value to be in that range. Kruschke (2011) did mention, “moreover, the proportion of the posterior inside the ROPE indicates the total credibility of values that are practically equivalent to the null” (p. 302) but did not use this in the formal decision procedure he advocated. Moreover, the amount of overlap of the HDI with the region of interest is ignored when drawing conclusions (i.e., only the existence of an overlap is of interest).
Bayes-factor approaches for testing interval null hypotheses
Morey and Rouder (2011) proposed a quite different Bayesian approach to testing hypotheses formulated as regions of values. Their approach is a variant of Bayesian null hypothesis testing, in which the null hypothesis is replaced by a small interval around 0. We therefore first discuss Bayesian null hypothesis testing.
Bayesian null hypothesis testing was introduced by Jeffreys (e.g., see Jeffreys, 1961) and basically amounts to computing the so-called Bayes factor for comparing model H0, specifying there is no effect, against H1, specifying that there is a nonzero effect in the population with the uncertainty of its true value captured by a particular probability distribution. The Bayes factor then is defined as the ratio of the marginal likelihoods for H0 versus H1. Considering the marginal likelihoods as indicators of the “support” that both models have by the observed data, the Bayes factor can be seen as a measure of relative support for H0 compared with H1. It thus treats H0 and H1 symmetrically, and unlike in NHST, conclusions can be drawn comparatively, which is more than only being able to reject H0 in favor of H1 (e.g., see Dienes, 2014; Wagenmakers, 2007). Although the Bayes factor itself does not compare posterior probabilities, it does function as the go-between for transforming prior probabilities into posterior probabilities. Specifically, the prior odds giving the ratio of (to be assessed or defined) probabilities of models H0 and H1 multiplied by the Bayes factor will offer the posterior odds of the posterior probabilities of models H0 and H1. According to Kass and Raftery (1995, p. 776), Good (1958, p. 803) was possibly the first to mention the term “Bayes factor,” and his writing clearly suggests that the term “factor” pertains to its role of changing prior odds into posterior odds, which is at the essence of Bayes theorem. Thus interpreted, the Bayes factor is meant as a means, not an end in Bayesian analysis. The end goal of Bayes’s theorem is to compute a posterior probability, not the factor that changes a prior probability into a posterior probability. Nevertheless, the Bayes factor is often used as an end by itself and interpreted on its own as degree of support for either hypothesis. The idea that it should be used for assessing posterior odds, for instance, by readers, who should specify their own prior model odds and then compute the posterior model odds by multiplying the prior model odds with the Bayes factor, is followed rarely.
This may not be surprising because specifying prior model odds may actually be somewhat awkward. In particular, the prior odds of an exact null hypothesis against an alternative hypothesis can be deemed problematic because, as often claimed (e.g., see Cohen, 1994, p. 1000, who also quotes other sources; Meehl, 1978), the probability that an effect size is exactly 0 can be considered 0 for all practically relevant research questions. If one agrees with this, as a consequence, the prior odds of H0 against H1 is 0, and whatever the Bayes factor is, the posterior odds will be 0 as well. Even if one would take a less extreme point of view and allow some probability for the null effect to be exactly true but still far less than for the alternative, then still only very high Bayes factors would lead to large posterior probabilities that the effect size is exactly zero. In practice, however, fairly often, the Bayes factor is actually (mis)taken for the posterior odds (Tendeiro et al, 2024; Wong et al, 2022). This would be correct only if one would take the prior model odds equal to 1. In such cases, one can relate Bayesian null hypothesis testing to Bayesian estimation based on the so-called spike-and-slab prior (e.g., see Rouder et al., 2018; Tendeiro & Kiers, 2023), meaning a prior distribution that is gently curved over the full range but has a very high spike for the value 0, with a probability mass of 50%. In other words, in this way, zero effects are prioritized, as a kind of skeptic default prior. This may be a deliberate choice for some, but because it is a default in various packages, this may not be realized by many users.
For the above reasons, Morey and Rouder’s (2011) interval approach to null hypothesis testing is to use a much more realistic approach than strict null hypothesis testing. They described various options, but the most compelling one, which also is directly available in JASP (JASP Team, 2024), is the one leading to the nonoverlapping-hypotheses (NOH) Bayes factor. Essentially, it starts out from specifying an interval of effect size values called
Compute the posterior probability P(δ ∈
Compute the prior probability π = P(δ ∈
Compute the Bayes factor as the posterior odds P(δ ∈
As an aside, if the interval width is shrunk toward 0, in the limit, this will give the null-hypothesis Bayes factor (see Tendeiro & Kiers, 2023). However, as shown in Equation 1, the limit cannot be reached exactly because then π = P(δ ∈
Computing the Bayes factor
2
of H1: δ ∈
Prior and Posterior Probabilities and Odds and Bayes Factors for the Five Intervals Used With the Leading Example
Note: For comparison, also the outcome of the HDI+ROPE test is given. HDI = highest density interval; ROPE = region of practical equivalence.
Comparing the Bayes-factor results with those from the HDI+ROPE approach (see Table 2, last column), one can see that the outcomes for the intervals regions [52%, 100%] and [58%, 62%] are clearly not in line with those for the HDI+ROPE approach. That is, the Bayes factors for these regions suggest positive evidence in favor of H1 over H0, whereas the conclusions from both the HDI+ROPE approach and the approach using 95% CIs fail to support these regions of interest. Moreover, for the regions [30%, 70%] and [52%, 100%], the Bayes-factor values are almost equal, and the conclusions with the HDI+ROPE approach clearly differ. These differences can at least partly be explained by the fact that the NOH Bayes factor is a change factor and is related not only to the posterior odds or the posterior probability but also to the starting point, the prior odds.
The NOH Bayes factor gives a ratio of support for H1: δ ∈
Posterior probabilities for intervals
Given the framework by Smiley et al. (2023), it seems obvious to test whether an effect-size value is in a particular region of interest by assessing how
Although it may suffice to report the probability that the effect size is in a particular interval, one can use this to formally conduct a test by specifying decision rules. For example, we can stipulate that we consider the hypothesis that the effect size is in the region of interest “supported” if the posterior probability associated with it is over 95%. If not, we can, for instance, conclude that there is “insufficient support” if the probability is (just) not over 95% or even “little support” if the probability is clearly smaller than 95% but still higher than 50%. Note that these are just subjective suggestions for interpretations of the posterior probability. How this pans out for the leading example and the five regions chosen earlier is pictured in Figures 4a through 4e. Observe that the posterior probabilities themselves have already been given in Table 2. Using as test criterion the idea that the probability for the region of interest should be over 95% to consider it supported, we show that the results concur with those by the 95%-CI and the 95%-HDI approaches, being positive only for the second and third regions of interest. But in addition to these test results, we now also have insight into the full posterior distribution, so we can also consider probabilities of other ranges of values, and we can see how the probability is distributed within the region of interest. We emphasize that when actually reporting results, it is very important to not just report the decision and associated probabilities at hand but also show the whole posterior distribution to put the probability and hence the decision in perspective.

(a–e) Posterior probabilities for each of the five chosen regions, displayed against the background of the full posterior probability density distribution.
If so desired, one can even offer probabilities for adjacent regions in one go and thus cut up the probability space in, for instance, three regions that have interesting interpretations. An example of this is given in Figure 5. Here, the regions represent “clearly smaller percentages of improved than not improved patients,” “practically equivalent percentages of improved and not improved patients,” and “clearly higher percentages of improved than not improved patients.”

Example of tests for three adjacent regions of interest for our leading example, specifying small proportions, proportions close to 0.50, and large proportions.
All in all, this approach gives outcomes with a very easy and appealing interpretation, and once the posterior probability distribution has been obtained, it is easy to carry out.
Comparison of the Four Methods
In this section, we systematically compare the four methods discussed above: the method checking whether the 95% CI lies entirely in the region of interest; the HDI+ROPE method, which checks whether 95% HDI lies entirely in the region of interest; the NOH Bayes factor test; and the test based on assessing the posterior probability for the region of interest. For the main aspects of our comparison, see Table 3.
Overview of Features of Various Methods
Note: CI = confidence interval; HDI = highest density interval; ROPE = region of practical equivalence; NOH = nonoverlapping hypotheses.
What the methods have in common
All four methods have been suggested as improvements on testing point null hypotheses by replacing them with interval hypotheses. The implied improvements have both a theoretical and a technical aspect. One theoretical improvement pertains to the resolved impossibility of being unable to accept the null hypothesis. One technical improvement is that now one can compute the probability of an effect being merely small or negligible, avoiding the extremely unlikely hypothesis that the effect is exactly equal to zero. Having solved these issues by resorting to interval hypotheses, a new challenge arises: One must now explicitly define bounds for the regions of interest. We do not view this as a disadvantage. The reason is that when interpreting results, researchers should be able to distinguish between what is practically relevant and what is not. This is already done each time researchers resort to using an effect-size measure, for instance, when they use power analysis to choose a sufficient sample size and more generally when they interpret their results. Therefore, implicitly, such choices are being made anyway. However, the interval-test approaches demand that such choices are made explicitly and are to some extent justified. And they should be made in advance because if they are made during data analysis, confirmation bias and human rationalizations may easily influence the choice of the interval in line with the expected or desired conclusion. Of course, even thus, these are subjective choices, which make them vulnerable to criticism. A researcher might therefore fear that such a justification must be “correct,” but that would be too (self-)demanding. The main idea of the justification, however, is that the ensuing intervals lead to a practically useful way of describing and interpreting results, which has been formulated before the analysis. A lot more has already been written about how to choose such bounds, as mentioned in the previous sections, so for further discussion about this, we refer the reader to, for instance, Kruschke (2018) and Lakens et al. (2018).
An issue that has sprung to the fore with the introduction of the three Bayesian methods for use in the unified-regions-testing framework is how to choose prior distributions. Obviously, like choosing bounds, the choice of priors is a challenge, and it also needs justification. But also here, the fear that such a justification must be “correct” would be too (self-)demanding. There is a lot of discussion between objective and subjective Bayesians over the role of prior distributions in Bayesian inference. Typically, the following four justifications for choosing particular prior distributions are given: (a) The prior distribution should reflect the current knowledge on the effect size in the population. (b) The prior distribution should be “noninformative” and hence make sure that the posterior is primarily if not solely determined by the data. One should, however, avoid assigning any probability mass to impossible outcomes. (c) The prior should be a commonly used and accepted default prior for this type of effect size in the population. (d) The prior should reflect one’s expert belief (before the data collection) on the probability density distribution that best describes one’s uncertainty about the effect size in the population.
Again, a lot has been said about such choices, and for the present article, we do not repeat this. The main thing, in our opinion, is that the choice of a prior should be motivated by the author. Furthermore, reporting a sensitivity analysis, in which various choices are made and resulting differences in outcomes are shown, is obviously welcome. However, the purported stability of the outcome derived from a sensitivity analysis obviously depends on how widely different the priors were taken and how the choice of different priors was justified. After all, in principle, one can always take strong priors that will steer toward totally different posterior results. So, the simple statement, “Sensitivity analysis shows that the results are hardly influenced by the prior,” is never enough. One should always describe and motivate the range of priors studied in the sensitivity analysis.
How the methods differ
One might counter that an enormous advantage of the frequentist approach, using the 95% CIs, is that no such prior choices need to be made. This is true, but then again, one cannot directly interpret results of the analysis in terms of probabilities associated with the hypotheses. Conclusion drawing is a lot more difficult if one has to use the inverse probabilities related to 95% CIs. Recall that the procedure for setting up CIs will yield intervals that in 95% of the cases contain the population value. This kind of statement is comparable with statements about a disease diagnostic. The test may have 95% chance of giving a positive test outcome if indeed one has the disease, but this does not give the probability that one has the disease given that the test outcome was positive. The base-rate knowledge of the prevalence of the disease matters a lot to determine such changes, and the same holds for the base-rate probability of effect sizes. One might therefore expect that if all feasible effect sizes have roughly the same chance to begin with, then the 95% CI might come quite close to the Bayesian HDI, based on a completely or even a fairly flat prior. Actually, there is a lot of theory about this (e.g., see Jackman, 2009, p. 94; Rubin, 1984; or relatedly, Greenland & Poole, 2013) and empirical evidence (e.g., see Albers et al., 2018). However, even then, Bayesian procedures have the advantage of offering more complete information. Rather than just an interval, they give full posterior distributions, and one can see how strongly probabilities vary both within and outside the CI range.
As mentioned above, the HDI+ROPE method is a straightforward Bayesian variant of Smiley et al.’s (2023) approach. The HDI+ROPE method likewise leads to a dichotomous (or trichotomous) outcome of supporting or not supporting the hypothesis or concluding that there is insufficient evidence for either. The HDI+ROPE method does not quantify the degree of belief or certainty of knowledge. At best, the HDI+ROPE method allows one to conclude that the probability that the value is in the region of interest is at least 95%; it cannot even tell whether the probability is less than 95% in cases in which the HDI crosses the border of the region. 3
A clear disadvantage of the HDI+ROPE method is that it does not distinguish between cases in which the HDI does not just fall entirely in the ROPE and cases in which only a very small part of it falls in the ROPE. Their interpretation is the same (“not sufficient evidence for a conclusion”), but the situations are totally different. At least one would like to have a quantification of the degree of overlap. And instead of such an indirect measure, just providing the probability of the values falling in the interval would be more valuable. As a matter of fact, the HDI serves as a kind of summarizer of the probability distribution, which, however, for the region-testing purposes is not necessary and actually can become a hindrance in cases as described above.
The second Bayesian approach, based on Bayes factors for intervals, offers a more concrete quantification of probabilities, but unfortunately, in the way it is implemented in software, it focuses on the Bayes factor, which is the go-between toward such probabilities. The actual posterior probabilities are usually left unconsidered. In the example above, we demonstrated that this may easily lead to surprising and easily confusing conclusions.
Finally, the approach consisting of assessing probabilities for the regions of interest turns out to be straightforward to carry out (once the posterior distribution has been determined, as also has to be done for the other methods). The results have straightforward interpretations in terms of probabilities of the population effect size to lie in the region of interest. Moreover, one can easily compare probabilities for many different regions and thus offer a more fine-grained picture if so desired. The approach is so obvious that it must be long in existence, even as an informal procedure in practice, and we found at least a couple of instances in the literature. We also found some criticism on it, for instance, in the supplement to Kruschke (2018), where he mentioned that Some authors (e.g., Wellek, 2010) prefer to consider the proportion of the posterior distribution that falls within the ROPE as the statistic for decision making. For example, we might reject the null value if less than 5% (say) of the distribution falls within the ROPE, and we might accept the null value if more than 95% of the distribution falls within the ROPE. Notice that this rule ignores the probability density of parameter values inside or outside the ROPE. (p. 5)
We agree that it may come across a bit odd if the highest density actually does not fall in the region of interest, and yet we conclude that we support that the region of interest most probably contains the population value. But by itself, this is not contradictory. After all, the probability pertains to a region of values that has particular interest, and if one wishes to know the probability that the actual population value falls within it, then the computed posterior probability is it, even if, taken as a single value, a value outside the interval is more likely than each single value in the region. Comparably, one might wish to know the probability that one of the 100 people in a village committed a particular crime. If there is one person outside the village who is slightly more likely to have committed the crime, then still it is good to know that as a working hypothesis, the idea that the culprit lives in the village still has 95% probability. As always, probability statements should not lead to closing one’s eyes for alternative possibilities, and an open eye for an outsider-culprit is always wise. For that reason, plots such as given by Kruschke (2018) and by us in Figure 4, which display not only intervals but also full posterior distributions, are always important. The regions and the probabilities help to shape a probabilistic decision, but it is important to keep seeing the full picture.
How to Use This in Practice
An important aspect of a statistical approach is how feasible it is to carry it out. The approach actually hinges “only” on having obtained a posterior distribution for the effect size of interest. Many packages in R have been made available for this purpose for many different measures, so not only for proportions, comparing group means, regression coefficient, and so on. If methods are not directly available, one can handle them by the general-purpose approach involving Markov chain Monte Carlo procedures, such as implemented in the R package
Specifically, the probabilities for intervals of effect sizes (when comparing means of two groups) are explicitly being offered by JASP’s

Complete screenshot of (left) input in JASP’s
Example Analysis of an Empirical Data Set for Comparison of Two Means
To give a real empirical example of the use of the posterior-probability-testing approach, we reanalyzed the data sets collected by Prowten et al. (2024). They replicated a study by Berger (2011) testing whether nonemotionally induced arousal would increase the chance that people would share news on social media. In Berger’s setup, participants in the low-arousal group were asked to sit quietly for 60 s, then carry out a distraction task, and next to read a neutral news article “that they could e-mail to anyone they wanted.” In the high-arousal group, rather than sitting still, participants had to jog in place for 60 s, and for the rest, they were treated the same as participants in the low-arousal group. Berger reported, “Compared with sitting still, running in place increased the percentage of people who e-mailed the article (from 33% to 75%), χ2(1,

Rain-cloud plots, box plots, and smooth fitted density graph produced by JASP for the two sharing percentages of participants in each of the groups of Study 1 of Prowten et al. (2024).
From their results, Prowten et al. (2024) concluded that “in contrast to Berger (2011), we did not find evidence that increased physiological arousal corresponded to an increase in the number of articles shared by the participants” (p. 1029). They even summarized that “despite the success of the arousal manipulation, participants in the high- and low-arousal conditions did
We then applied each of the four approaches for equivalence testing to the data of Study 1 of Prowten et al. (2024). The first approach is checking whether the 95% CI falls entirely in the region of interest. Because 95% CI = [−0.30, 0.45] for Study 1, it by no means falls in the region of interest. It can be concluded that there is no support for the equivalence hypothesis. It extends amply at both sides.
Then, we used JASP to do a Bayesian equivalence

Excerpt of the JASP results of a Bayesian equivalence
Figure 8 also reports the Bayes factor of the hypotheses “δ lies in the region of interest” versus the hypothesis “δ lies outside the region of interest,” which is denoted by BF∈∉ and is thus shown to equal 6.7. Such a Bayes factor can be considered positive support for “δ lies in the region of interest” compared with the hypothesis “δ lies outside the region of interest.”
Finally, in the “Prior and Posterior Mass Table” in Figure 8, we show that the posterior probability of δ lying in the region of interest (in the output denoted as “p[δ ∈ I | H1, data]” equals 0.40, which by no means is close to 0.95, so also by the posterior probability test, there is very little support for the equivalence hypothesis stating the effect is negligible, that is, between −0.1 and 0.1.
Without formally doing these tests, it could actually be seen easily from the graph of the posterior distribution that roughly, effect sizes of up to 0.3 cannot well be excluded on the basis of these data and the assumed prior distribution. 5 It can be considered somewhat odd that the Bayes factor test does give “positive support” for a negligible effect. But it should be realized that positive support is usually deemed insufficient in practice. Moreover, the Bayes factor, as said, gives only the change factor from prior odds to posterior odds, and it is quite easy to see that indeed, the prior odds for this region were a lot smaller than the posterior odds. But nevertheless, a considerable increase of small odds may still not lead to high odds, as is shown here.
For further understanding of what is going on, we might consider that 0.2 is called a small effect and 0.4 a medium effect, and hence, we could consider that small effects are covered by the interval [−0.3, 0.3]. Taking this as a new region of interest, we tested whether the data support the hypothesis that the effect is at most small. The results are given in Table 4.
Test Results for Region of Interest [−0.3, 0.3]
Note: CI = confidence interval; HDI = highest density interval; ROPE = region of practical equivalence; BF = Bayes factor.
The Bayes-factor test now led to strong support for the effect size lying in the interval [−0.3, 0.3]. The other tests still did not support this hypothesis, but clearly, much larger shares of the intervals fall in the ROI, and the posterior probability that effect size lies in the interval [−0.3, 0.3] now is as high as 0.88. So although there still is insufficient certainty (by the 0.95 rule set in these three tests), clearly, the probability that the effect size is at most small can be seen to be quite high. Conversely, one could say that the probability that the effect size is higher than 0.3 in magnitude is 0.12, which is quite a low probability, although not negligible.
We also computed the probability that the effect size is actually larger than 0.3 (thus doing a superiority test to see whether the effect is at least small) and found that the posterior probability for this is 0.10 only. Given that Berger (2011) allegedly found an effect size exceeding 0.8, we also assessed the probability that the effect size exceeds 0.8 and found this to be 0.00003.
All in all, from our reanalyses of Study 1 of Prowten et al. (2024), we conclude that it hardly gives any support for the hypothesis that the effect size is negligible, and even the hypothesis that the effect is at most small is not strongly supported, that is, not by more than 95% posterior probability. Note, however, that conversely, we found even far less support for the hypotheses that the effect is at least small or that it is higher than 0.8. The main conclusion should be that even though probabilities are more in favor of small values than high values, really strong conclusions would require narrower posterior distributions, for instance, using larger sample sizes. Note that in our conclusion, we take the full posterior into account as well and not report just the posterior probability for the interval.
Actually, in this case, a larger sample size is available. Because there are two data sets with the same setup, we merged them. Because this yields a fairly big data set, one can hence expect a narrower posterior distribution. Therefore, we now also present the results of the combined data set from Studies 1 and 2 of Prowten et al. (2024), in Figure 9 and Table 5. We show that for this combined data set, the results are clearer. For instance, in Figure 9, it can be seen that the distribution is more peaked and that its maximum has a higher density. In Table 5, we show that the hypothesis that the effect is at most small is supported by all methods. We would thus conclude, taking into account that we used the default Cauchy prior, that it is quite probable that the effect of physical arousal on sharing, at least in the way it had been set up in the experiments by Prowten et al. is at most small and that clearly, the big effect size reported by Berger (2011) was not replicated. We also show that the conclusion that the effect is negligible (let alone that it would be zero) cannot be made. For corroborating the conclusion that the effect would be virtually zero, we would need a much bigger sample, and actual absolute absence of an effect can, in our view, never be corroborated by empirical experiments.

Excerpt of the JASP results of a Bayesian equivalence
Test Results for Regions of Interest [−0.1, 0.1] and [−0.3, 0.3].
Note: CI = confidence interval; HDI = highest density interval; ROPE = region of practical equivalence; BF = Bayes factor.
Obviously, our results rely on using the default Cauchy prior. To demonstrate how influential this choice is, we did some further analyses with normal (N) priors. We found that with prior N(0, 1), the 95% credible interval was [−0.18, 0.30], and with the flatter prior N(0, 2), it was still [−0.18, 0.30]. This is virtually equal to the Cauchy-based prior and happens to be equal to the CI. Just for the sake of showing that there is an effect of priors, we also tested the strongly peaked prior N(0, 0.1) and found that the 95% credible interval reduced to [−0.13, 0.18]. Clearly, the strong prior on values close to 0 reduces the interval considerably. This, however, can be used only if one has reason to believe a priori that the effect must be very small.
The Bayes factors and the posterior probabilities were now as given in Table 6. We show that results of the posterior probability hardly changed, whereas the BF values did change quite a bit. Clearly, as is actually well known, the Bayes factor is quite sensitive to the prior distribution, the posterior distribution clearly is less so, but as shown above, it is not insensitive to choosing priors representing strong certainty, such as the peaked N(0, 0.1) one.
Test Results for ROIs [−0.1, 0.1] and [−0.3, 0.3] Using Normal Priors N(0, 1) and N(0, 2)
Note: BF = Bayes factor; ROI = region of interest.
Practical Recommendations
We have now described and illustrated four methods for (Bayesian) equivalence testing. Pros and cons have been discussed in the section in which the approaches were compared. None of them give incorrect output, but the interpretations of them should clearly differ. To avoid drawing wrong conclusions and overinterpret results, we recommend presenting a plot of the posterior distribution itself in addition to numerical output and conclusions. In case one assesses many posterior probabilities at the same time, at least giving a flavor of what the distributions look like can be very insightful.
Furthermore, it has been argued that if a specific decision or crisp conclusion has to be offered on the basis of the analysis, it makes more sense to test between “small” and “not small” or between “negligible” and “nonnegligible” than between “zero” and “nonzero” effect sizes. Our practical advice for analyzing data would be as follows:
Step 1. Specify the (equivalence) bounds by considering what for your problem would be the region of interest
6
; choose the interval bounds as
Step 2. Specify the prior distribution(s).
Step 3. Report (and inspect) the posterior distribution (as a graph, preferably superimposed on a plot of the prior distribution).
Step 4. Report the posterior probability p(δ ∈ I | data) and if so desired, also the Bayes factor for changes in the odds of δ ∈ I versus δ ∉ I.
Step 5. If you want to make a decision, you may choose to decide on practical equivalence if the posterior probability of the region of interest exceeds, for instance, 0.95.
7
If this threshold is not met, you can, with only a little bit of extra effort (as was done in our example analysis as well),
8
assess the probability that δ <
Conclusion and Discussion
It has been argued that strict null hypothesis testing should lead to conclusions on strict zeroness of effect size, which does not seem to make much sense in actual practice. In practice, one will rather be interested at least in whether an effect is negligible. And if it is concluded to be nonnegligible, it will consequently be interesting to see how big it is at least approximately, with reasonable certainty. Using frequentist approaches, one can use equivalence testing combined with effect-size estimates and CIs for these. Smiley et al. (2023) offered a versatile framework for this purpose, offering a great advantage over traditional null hypothesis testing. A disadvantage of the approach, however, is that, as with all frequentist approaches, it does not lead to concrete probability statements on the effect size even though in interpretation, such statements would make conclusions clearly easier to grasp. It is easier to understand that there is 95% chance that the effect size is between 0.4 and 0.6, for instance, than that a 95% CI is given running from 0.4 to 0.6, for which it is known that if one would repeatedly set up CIs in this way, in 95% of the cases, they would be correct. The Bayesian approach allows for the former type of statements statements, but there is no free lunch: To get these, one has to specify a prior distribution for the effect size. This might be found challenging. However, using variations on default priors, one can choose those that come closest to what one considers the actual state of the art in uncertainty about the effect size and choose that prior. As is well known and as we have illustrated in the example, variations in the choice of prior need not have serious influence on the outcomes of the posterior probabilities. Nevertheless, there always is a risk of overinterpreting the precision of such results because that depends on how accurate the data were, how representative the sample was, how reasonable the prior is, and how reasonable the data model is.
Although software is available for this approach for an enormous range of data-analysis problems, easy, accessible (“plug and play”) software, especially for interval-hypothesis testing, is not available universally. We offer this software for tests involving one or more proportions in the form of an R script, and JASP offers interval-hypothesis tests for comparison of means. Other often used tests, for instance, for correlations and regression weights, so far still need the use of less accessible software. We aim to work on providing such software in the near future.
