Sage Journals: Discover world-class research

Abstract

A well-known problem of null hypothesis significance testing is that it cannot be used to find support for the null hypothesis. A common solution for this is to replace the exact 0 value by an interval associated with values that are close to 0. This approach is denoted as equivalence testing and is a special case of procedures that test intervals of values against each other. Smiley et al. recently published a unified framework of statistical inference and suggested a straightforward method of testing all sorts of interval-based hypotheses in a unified way. In the present article, we discuss three alternative general approaches, based on Bayesian analysis, that have the advantage that the ensuing probabilities can be interpreted as probabilities of the population parameters rather than probabilities of the data (as is the case with frequentist methods). These methods (in some form) have been previously suggested, but here, we bring them together and show how they can be used for Smiley et al.’s full unified framework of statistical inference, now complementing it with three Bayesian counterparts. In particular, we show how each of the methods works in the analysis of a leading example data set involving a test on proportions. Subsequently, their relative pros and cons are discussed, and it is explained how the methods can be used for many statistical-analysis questions in practice using R and/or JASP. This is illustrated on an empirical data set for comparing means of two groups.

Keywords

null hypothesis testing interval hypothesis testing Bayesian parameter estimation

The aim of psychological research often is to test whether there is an effect of a particular intervention. An experiment is then set up in which the intervention (psychotherapy, education on climate change, etc.) is applied to a random sample of subjects; this sample is called the experimental group. Next, it is assessed to what extent the subjects in this group show the expected behavior more than the subjects in the control group. This is usually assessed by comparing group means on some quantitative outcome measure. Typically, a significance test (also called a null hypothesis significance test [NHST]) is carried out to see if the observed difference is “statistically significant,” and if so, this is interpreted as an indication that the effect is not a chance finding.

However, what does one do if a result is not significant? It is well known that nonsignificance cannot be seen as an indication that there is no effect (e.g., Cohen, 1994). This is because in case of nonsignificance, it is very well possible that there actually is an effect but that the data set has insufficient information to let one draw such a conclusion. Are researchers empty-handed then? No, several remedies have been suggested. One suggested remedy is to inspect the power of the significance test. This, however, is a rather tricky affair because the power depends on assuming an effect size in the population, and this is obviously not known (Why else would researchers do a study?). So researchers have to make fairly arbitrary choices here, and ultimately, they will end up with a fairly complex conditional statement, such as “If in reality the effect size were 0.4, then for the given sample size and significance level, the chance of obtaining a statistically significant effect would have been .75.” It is hard to reason with such complex probabilistic statements. It seems better to simply accept that although NHST is meant for comparing the null hypothesis, “There is no effect” (H₀), to the alternative hypothesis, “There is an effect” (H₁), it can only sensibly reject H₀ and hence accept H₁, but it cannot accept H₀. Or phrased a bit more nuanced, by NHST, one can quantify evidence against H₀ but not in favor of it.

Presently, the most common route of statistical analysis is to start with NHST and next consider to what extent the effect actually is practically or theoretically relevant by inspecting the effect size and the corresponding confidence interval (CI; see guidelines by Wilkinson & Task Force on Statistical Inference, 1999, who strongly recommended reporting these). An alternative, however, is to start out from considering what is considered practically or theoretically relevant and then see to what extent the data indicate that practically or theoretically relevant effects have been observed. For instance, how likely is it that the average effect of the intervention is higher than 1 point on the well-being scale? Or how likely is it that the effect is comparable with that of another intervention and at least not more than 1 point lower? Or how likely is it that the effect is simply very small (say between −1 and 1 on the well-being scale)? Such questions seem much more relevant than testing whether there is an effect (however big or small that might be). It does require that before the study, assessments of what can be considered relevant are made. Although this will not be trivial, one could expect a researcher or an expert in the field to be able to make sensible choices here because after carrying out an NHST (as very many researchers often do), such interpretations should also be made anyway.

The approaches sketched just above are in the realm of equivalence testing and noninferiority and superiority testing and have a long history primarily in medicine (Schuirmann, 1987; Wellek, 2010). Only recently have these approaches been popularized in psychology (Lakens, 2017; Lakens et al., 2018; Linde et al., 2023). Smiley et al. (2023) offered a unified framework covering these kinds of alternatives to NHST. This framework covers a large variety of procedures involving tests based on regions of values rather than on the simple null value only, as in NHST. It describes a single procedure for carrying out tests for all such types of regions based on CIs, which thus makes all these methods very easy to apply on very many possible measures (difference of means, difference of proportions, correlations, regression weights, etc.): The only thing one needs is to be able to set up a CI for such measures.

Even though the approaches in the unified framework offer a big step forward to useful interpretations, they are still based on dichotomization. Moreover, they rather indirectly give only an indication of the likelihood of an effect being in the region of interest. The purpose of the present article is to offer procedures that go beyond simple dichotomization and procedures that lead to concrete probabilities as expressions of the likelihood of values being in the regions of interest. For this purpose, we invoke the help of Bayesian posterior estimation.

In the present article, we first describe the framework by Smiley et al. (2023). Next, we describe three Bayesian procedures that can be used to more directly assess the probability that an effect size falls in a region, and we describe their relative advantages. Finally, we show how to actually apply such procedures in practice and discuss the implications on the feasibility (and practical difficulties) associated with them.

Leading Example

As a leading example in this article, we consider a study about how often patients improve by taking a particular drug. To make it a bit more concrete, as to the patients, one could think of psychiatric patients. As to the drug taking, one could think of taking the prescribed dose for a period of 2 months, and improvement would be assessed by verifying whether the score on a general well-being scale assessed after this period has increased with respect to an assessment before the period (equal or lower scores would be considered “not improved”). We consider fictitious data for 100 subjects in which 60 were found to have improved. Our research question aims at determining what proportion of patients in a population would likely improve.

In a null hypothesis testing, we could consider a two-sided test, H₀: “proportion improved = proportion not improved” against H₁: “proportion improved ≠ proportion not improved.” Thus, the null hypothesis assumes that by taking the drug, patients might just as well improve as not improve, and the drug is therefore considered to be of “null” value. Denoting “proportion improved” by “prop”, we can rewrite the statistical hypotheses as H₀: prop = 0.50 and H₁: prop ≠ 0.50. For the finding of 60 improvements out of 100 cases, the exact binomial test results in a p value = .057; hence, the difference is not significant at the customary 5% significance level.

Unified Conceptual Framework for Statistical Inference in Terms of “Null Regions” and “Regions of Interest”

Smiley et al.’s (2023) article offered a unified framework involving what they called a “null region” (H₀), which is compared with an H₁ reflecting a “region of interest.” They started out from seven types of tests involving the null regions and regions of interest described in Table 1, which is a compilation of the information in Smiley et al.’s Figures 1 and 2.

Table 1.

Overview of Smiley et al.’s (2023) Framework

Type of test	Null region (H₀)	Region of interest (H₁)
NHST two-sided	δ = 0	δ < 0 or δ > 0
NHST one-sided	δ < 0	δ > 0
Minimum-effects testing two-sided	−c < δ < c	δ < −c or δ > c
Minimum-effects testing one-sided	δ < c	δ > c
Equivalence test	δ < −c or δ > c	−c < δ < c
Strong form hypothesis test (δ too far from predicted value)	v − d < δ < v + d	δ < v − d or δ > v + d
Strong form hypothesis test (δ close enough to predicted value)	δ < v − d or δ > v + d	v − d < δ < v + d
Noninferiority testing	δ < −c	δ > −c

Note: δ indicates the population effect size (e.g., a difference of two means, a correlation, a difference of two proportions). The null region and the region of interest represent the two hypotheses compared in the test. Constants −c, c, v, and d indicate example values used to delineate the null region and the region of interest. Specifically, c indicates the left and right margin of the null region; v indicates the midpoint of a region associated with a strong form hypothesis test, as used in situations in which one hypothesizes specific values; and d indicates the margin around such values. “0” stands for no effect and often boils down to a difference of 0, but sometimes, such a difference is easier expressed by a different number. This is the case in our leading example, in which zero difference between the proportion improved and not improved comes down to the case with 50% improved. So in that case, 50%, or .50, is the null point, and for instance, the equivalence region of interest is [50% − c, 50% + c]. NHST = null hypothesis significance test.

Fig. 1.

(a–e). Results of different tests by verifying whether the 95% confidence interval [49.7%, 69.7%], displayed as the green area, falls entirely within the region of interest (for which only the blue dashed limits are specified).

Fig. 2.

Posterior distribution for the improvement proportion from our leading example. The prior distribution used here and a scaled version of the likelihood have been superimposed on this graph. The 95% highest density interval has been displayed as the dark green bar on top of the x-axis in the graph.

The test approach consists of assessing whether the 95% CI falls entirely outside the null region and hence entirely in the region of interest. If it does, then according to Smiley et al. (2023), the test is significant; hence, H₀ is rejected in favor of H₁. If it does not fall entirely outside the null region, then there is no significance, so H₀ is not rejected, and nothing more would be concluded, not even if the 95% CI would actually fall entirely in the null region. Although this seems an unnecessary limitation of the framework, it is actually not because in all cases, the roles of null region and region of interest can be switched. In fact, in Table 1, one can see that “Minimum-effects testing two-sided” and “Equivalence testing” contrast the same hypotheses in different roles. Likewise, “Strong form hypothesis test (δ too far from predicted value)” and “Strong form hypothesis test (δ close enough to predicted value)” contrast the same hypotheses in different roles. Therefore, here, we apply the simpler decision rule to always draw the conclusion that if the 95% CI is entirely in either the null region or in the region of interest, this means support for the hypothesis that the population value is in that particular region. Only if the 95% CI is neither entirely in the null region nor entirely in the region of interest should one refrain from drawing any conclusions.

For our leading example, we find the exact binomial 95% CI = [0.497, 0.697]. Now suppose we did an equivalence test in which we stated that percentages of improved patients between 48% and 52% are associated to a negligible drug effect for clinical purposes. Then, the region of interest would be 95% CI = [48%, 52%], and clearly, 95% CI = [49.7%, 69.7%] does not fall entirely in the region of interest (see Fig. 1a). So, one would conclude that there is not enough support for practical equivalence. And there also is not enough support for nonequivalence (i.e., a two-sided minimum-effects testing also fails to reject its null hypothesis). We simply cannot draw a conclusion of either kind. However, in the unrealistic case in which we would consider percentages of improved patients between 30% and 70% as a negligible drug effect for clinical purposes, the 95% CI would fall in the associated region of practical equivalence, 95% CI = [30%, 70%] (see Fig. 1b), and hence, there would be support for practical equivalence.

Perhaps more interesting would be to conduct a noninferiority test. That is, if in practice one would consider 48% improvement good enough for all practical purposes, then the noninferiority test would boil down to checking whether the 95% CI would entirely fall in the region of interest, 95% CI = [48%, 100%]. The result is that it does (see Fig. 1c); hence, there would be support for noninferiority. But if a one-sided minimum-effects test were to be carried out, with minimum effect of 52%, then the region of interest would be 95% CI = [52%, 100%], and clearly the 95% CI would not entirely fall in that region (see Fig. 1d), so there is not enough support for this minimum-effects hypothesis. Finally, if a theory would predict 60% improvement and one would consider percentages between 58% and 62% to be practically equivalent to 60%, then the region of interest of a strong form hypothesis test on whether the likely value is sufficiently close to the theoretically predicted value would be 95% CI = [58%, 62%], and clearly, the 95% CI would not fall entirely within it (see Fig. 1e), so there is not enough support for the strong hypothesis. Thus, each test in the Smiley et al. (2023) framework can be carried out using a single 95% CI and checking whether it falls entirely inside the region of interest.

Although clearly an improvement over null hypothesis testing, the Smiley et al. (2023) framework shares a problem with NHST. It may be the case that users might want to make statements based on probabilities that the population effect size is in particular regions. With the CI-based tests presented here, this is not possible. Moreover, just like in the case shown in Figure 1d, even though the 95% CI may fall almost entirely inside the region of interest, the conclusion simply is that there is not enough support for the hypothesis that the minimum effect is 52%. The conclusion to fail to support the region of interest would be exactly the same if the 95% CI was a lot more to the left, so when the observed effect would be a lot smaller. Here, one could counter that the p value associated with the test can be used to distinguish such cases. Indeed, in equivalence testing, often the p value is mentioned as well. Such p values can also be reported for the other tests in the Smiley et al. framework, by carrying out two one-sided tests with the null hypothesis located right on the respective boundaries of the interval and then reporting the highest of the resulting two p values (for an R script, see https://osf.io/auzkt/files/osfstorage). Or in case of an open interval, only one such test, involving only one boundary, has to be done. In this way, p values could be used to qualify the outcomes and distinguish cases in which the CI marginally overlaps the region of interest (fairly small p value) from cases in which it hardly overlaps at all (high p values). However, even though the p value would indeed be able to make distinctions between such cases and quantify such differences, these values cannot be interpreted easily because they do not give the probabilities that the population effect size is in particular regions of interest. To do so, one needs to resort to a Bayesian approach. In the next section, we describe three Bayesian procedures with which such probabilistic statements can be made, either indirectly or directly.

Three Bayesian Approaches for Assessing How Likely the Effect Size Is in the Region of Interest

HDI + ROPE procedure

Kruschke (2011; for a more detailed explanation, see Kruschke, 2018) proposed a

decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE). (Kruschke, 2018, p.270)

The method starts out with a Bayesian estimation procedure for the effect size of interest, for instance, the difference between two means, a regression weight, a correlation, or a proportion. Consider our leading example of studying the proportion of patients that experienced improvement after taking a particular drug. A Bayesian estimation procedure can then proceed as follows (e.g., see Kruschke, 2013): (a) Specify a priori the distribution for the proportion of improved patients in the population. (b) On the basis of the observed data, the likelihood function is set up. For each possible value of the parameter involved (i.e., the proportion of improvement in the population), this function specifies the probability of obtaining the observed data. (c) The posterior distribution for the proportion of improved patients in the population is computed by multiplying the likelihood function by the prior distribution and normalizing the ensuing product to a proper probability distribution.

The precise steps for obtaining a posterior distribution depend on the measure that one is interested in and may involve more than one prior and posterior distribution (e.g., for the difference of two means), but the general gist is as above. For our purpose, the most important aspect is that the posterior probability distribution specifies for each possible value of the measure of interest what its probability density is. In other words, the graph of the posterior distribution visualizes the relative probability of all possible parameter values. But because there are infinitely many parameter values, one uses densities rather than concrete probabilities. For our leading example, this is illustrated in Figure 2, which shows that the likelihood function (red, dashed) is somewhat spread around the observed proportion of .60. Here, we used a fairly strongly peaked prior (based on the Beta[26, 26] distribution), which would be appropriate if we have fairly precise prior knowledge that the proportion should be around .50 and most likely not exceeding .30 or .70 (see the blue, dotted curve); we chose this particular example prior to show that with fairly peaked priors, the posterior and the likelihood may differ quite a lot. Had we used a “flatter” prior expressing little prior knowledge, the posterior distribution would have been extremely close to the (scaled) likelihood function.

Based on this graph, one can assess probabilities for particular ranges (or regions) of values. A popular way of summarizing such a graph is by means of the so-called 95% highest density interval (HDI), which contains the shortest range of values that jointly have a posterior probability of 95%. In other words, given the data and given the assumed prior distributions,¹ we know that the probability that the parameter value is in the 95% HDI range of values is exactly 95%. The 95% HDI for our leading example is [0.49, 0.64], or in percentages [49%, 64%], which is quite similar to the 95% CI seen earlier but not exactly equal. Here, we chose as prior a Beta(26, 26) distribution, which is based on quite a bit of assumed prior knowledge. This can be specified exactly as the knowledge from a previous study using a uniform prior in which one found 25 successes out of 50 trials. If, instead, we would have chosen the flat prior Beta (1, 1), which gives the uniform distribution in [0, 1], the 95% HDI would be [0.50, 0.69], hence virtually equal to the 95% CI.

Now, the 95% HDI [49%, 64%] can be used for a comparable goal as the CI, but it is easier to interpret. So obviously, we can use the HDI in the same way as Smiley et al. (2023) used the 95% CI for their null regions and regions of interest. For our leading example, we reproduced Figure 1, now based on the 95% HDI (see Fig. 3). As we show, all conclusions are the same as when the 95% CI was used, although the kind of overlap in some cases changed strongly. For instance, the equivalence-testing example in Figure 3b had the 95% CI just within the boundaries of the region of interest, whereas the 95% HDI is more or less in the middle of the interval, relatively far from the boundaries.

Fig. 3.

(a-e) Different tests verifying whether the 95% highest density interval [49%, 64%], displayed as the green area, falls in the region of interest (for which only the blue dashed limits are specified).

Kruschke (2018), who preceded the Smiley et al. (2023) article, did not start out from null regions but from defining a so-called “region of practical equivalence,” which is similar to the region of interest defined above. He proceeded to set up a decision rule by simply verifying whether the 95% HDI falls entirely within the ROPE. Of course, a crucial step is to define the ROPE. Kruschke (2011) wrote,

The ROPE indicates values of θ that we deem to be equivalent to the null value for practical purposes. . . . In real applications, the limits of the ROPE would be justified on the basis of negligible implications for small differences from the null value. (p. 302)

Kruschke (2018) offered an in-depth discussion of the choice and use of the ROPE and related this to the bounds used in equivalence testing and noninferiority testing. For another in-depth discussion about bounds in equivalence testing, see Lakens et al. (2018).

Although this approach gives a good indication of whether the range described by the region of interest consists of very probable values, it still does not exactly tell the probability of the effect-size value to be in that range. Kruschke (2011) did mention, “moreover, the proportion of the posterior inside the ROPE indicates the total credibility of values that are practically equivalent to the null” (p. 302) but did not use this in the formal decision procedure he advocated. Moreover, the amount of overlap of the HDI with the region of interest is ignored when drawing conclusions (i.e., only the existence of an overlap is of interest).

Bayes-factor approaches for testing interval null hypotheses

Morey and Rouder (2011) proposed a quite different Bayesian approach to testing hypotheses formulated as regions of values. Their approach is a variant of Bayesian null hypothesis testing, in which the null hypothesis is replaced by a small interval around 0. We therefore first discuss Bayesian null hypothesis testing.

Bayesian null hypothesis testing was introduced by Jeffreys (e.g., see Jeffreys, 1961) and basically amounts to computing the so-called Bayes factor for comparing model H₀, specifying there is no effect, against H₁, specifying that there is a nonzero effect in the population with the uncertainty of its true value captured by a particular probability distribution. The Bayes factor then is defined as the ratio of the marginal likelihoods for H₀ versus H₁. Considering the marginal likelihoods as indicators of the “support” that both models have by the observed data, the Bayes factor can be seen as a measure of relative support for H₀ compared with H₁. It thus treats H₀ and H₁ symmetrically, and unlike in NHST, conclusions can be drawn comparatively, which is more than only being able to reject H₀ in favor of H₁ (e.g., see Dienes, 2014; Wagenmakers, 2007). Although the Bayes factor itself does not compare posterior probabilities, it does function as the go-between for transforming prior probabilities into posterior probabilities. Specifically, the prior odds giving the ratio of (to be assessed or defined) probabilities of models H₀ and H₁ multiplied by the Bayes factor will offer the posterior odds of the posterior probabilities of models H₀ and H₁. According to Kass and Raftery (1995, p. 776), Good (1958, p. 803) was possibly the first to mention the term “Bayes factor,” and his writing clearly suggests that the term “factor” pertains to its role of changing prior odds into posterior odds, which is at the essence of Bayes theorem. Thus interpreted, the Bayes factor is meant as a means, not an end in Bayesian analysis. The end goal of Bayes’s theorem is to compute a posterior probability, not the factor that changes a prior probability into a posterior probability. Nevertheless, the Bayes factor is often used as an end by itself and interpreted on its own as degree of support for either hypothesis. The idea that it should be used for assessing posterior odds, for instance, by readers, who should specify their own prior model odds and then compute the posterior model odds by multiplying the prior model odds with the Bayes factor, is followed rarely.

This may not be surprising because specifying prior model odds may actually be somewhat awkward. In particular, the prior odds of an exact null hypothesis against an alternative hypothesis can be deemed problematic because, as often claimed (e.g., see Cohen, 1994, p. 1000, who also quotes other sources; Meehl, 1978), the probability that an effect size is exactly 0 can be considered 0 for all practically relevant research questions. If one agrees with this, as a consequence, the prior odds of H₀ against H₁ is 0, and whatever the Bayes factor is, the posterior odds will be 0 as well. Even if one would take a less extreme point of view and allow some probability for the null effect to be exactly true but still far less than for the alternative, then still only very high Bayes factors would lead to large posterior probabilities that the effect size is exactly zero. In practice, however, fairly often, the Bayes factor is actually (mis)taken for the posterior odds (Tendeiro et al, 2024; Wong et al, 2022). This would be correct only if one would take the prior model odds equal to 1. In such cases, one can relate Bayesian null hypothesis testing to Bayesian estimation based on the so-called spike-and-slab prior (e.g., see Rouder et al., 2018; Tendeiro & Kiers, 2023), meaning a prior distribution that is gently curved over the full range but has a very high spike for the value 0, with a probability mass of 50%. In other words, in this way, zero effects are prioritized, as a kind of skeptic default prior. This may be a deliberate choice for some, but because it is a default in various packages, this may not be realized by many users.

For the above reasons, Morey and Rouder’s (2011) interval approach to null hypothesis testing is to use a much more realistic approach than strict null hypothesis testing. They described various options, but the most compelling one, which also is directly available in JASP (JASP Team, 2024), is the one leading to the nonoverlapping-hypotheses (NOH) Bayes factor. Essentially, it starts out from specifying an interval of effect size values called I, which is considered the range for the null hypothesis. Here, we follow the equivalent (but reversed) description by Smiley et al. (2023), which has for effect size δ, H₀: δ ∉ I, and H₁: δ ∈ I. This idea clearly is also equivalent to Kruschke’s (2018) ROPE-based framework, who did not explicitly name the two hypotheses H₀ and H₁. Now, the computation of the Bayes factor, as described in Morey and Rouder’s appendix, essentially boils down to the following algorithm:

Compute the posterior probability P(δ ∈ I | Data) using the data-based likelihood and the specified prior distribution.

Compute the prior probability π = P(δ ∈ I) directly from the prior distribution.

Compute the Bayes factor as the posterior odds P(δ ∈ I | Data) / (1 − P(δ ∈ I | Data)) divided by the prior odds π / (1− π), hence as

$B F = \frac{P (δ \in I | Data)}{1 - P (δ \in I | Data)} \div \frac{π}{1 - π} = \frac{P (δ \in I | Data)}{1 - P (δ \in I | Data)} \times \frac{1 - π}{π} .$ (1)

As an aside, if the interval width is shrunk toward 0, in the limit, this will give the null-hypothesis Bayes factor (see Tendeiro & Kiers, 2023). However, as shown in Equation 1, the limit cannot be reached exactly because then π = P(δ ∈ I) = 0, and “BF” (Bayes factor) in Equation 1 cannot be computed.

Computing the Bayes factor² of H₁: δ ∈ I versus H₀: δ ∉ I for the different intervals I defined for our leading example (and using the same peaked prior as in Fig. 2), we get the results listed in Table 2. If we interpret the Bayes factors according to common guidelines, such as those by Kass and Raftery (1995), values under 3 would be considered “not worth more than a bare mention,” values between 3 and 20 would be considered “positive strength of evidence,” and values between 20 and 150 would be considered “strong.” So here, we would conclude that for the first region of interest, [48%, 52%], for which BF = 0.43, the evidence does not support H₁ (the region of interest), and conversely, the evidence in favor of H₀ is only 1 / 0.43 = 2.3 and hence not worth more than a bare mention either. So we cannot draw a conclusion here. For the regions [30%, 70%], [52%, 100%], and [58%, 62%], the BF values are larger than 3 but smaller than 20, so in these cases, there is positive evidence in favor of H₁ over H₀. Only for the interval [48%, 100%] is there strong evidence in favor of H₁ (the region of interest) over H₀.

Table 2.

Prior and Posterior Probabilities and Odds and Bayes Factors for the Five Intervals Used With the Leading Example

Interval (region of interest)	Prior probability: P(δ ∈ I)	Posterior probability: P(δ ∈ I \| Data)	Prior odds	Posterior odds	Bayes Factor	95% HDI within ROPE
[48%, 52%]	0.23	0.11	0.29	0.12	0.43	No
[30%, 70%]	1.00	1.00	365.5	3976.3	10.9	Yes
[48%, 100%]	0.61	0.98	1.58	57.97	36.60	Yes
[52%, 100%]	0.39	0.87	0.63	6.83	10.82	No
[58%, 62%]	0.084	0.28	0.091	0.383	4.195	No

Note: For comparison, also the outcome of the HDI+ROPE test is given. HDI = highest density interval; ROPE = region of practical equivalence.

Comparing the Bayes-factor results with those from the HDI+ROPE approach (see Table 2, last column), one can see that the outcomes for the intervals regions [52%, 100%] and [58%, 62%] are clearly not in line with those for the HDI+ROPE approach. That is, the Bayes factors for these regions suggest positive evidence in favor of H₁ over H₀, whereas the conclusions from both the HDI+ROPE approach and the approach using 95% CIs fail to support these regions of interest. Moreover, for the regions [30%, 70%] and [52%, 100%], the Bayes-factor values are almost equal, and the conclusions with the HDI+ROPE approach clearly differ. These differences can at least partly be explained by the fact that the NOH Bayes factor is a change factor and is related not only to the posterior odds or the posterior probability but also to the starting point, the prior odds.

The NOH Bayes factor gives a ratio of support for H₁: δ ∈ I versus H₀: δ ∉ I, which may seem to be appealing, but as we show, it does not give a direct expression of the probability that the population effect size is in interval I. During the computation of the Bayes factor, this particular posterior probability was actually computed as a kind of halfway product (in Step 1). Moreover Equation 1 shows immediately that the NOH Bayes factor is computed after first computing the posterior odds. It thus appears that the Bayes factor has become the goal rather than the means to reach the posterior probability, and usually, it is the Bayes factor that seems to be the preferred choice for interpreting the results. Morey and Rouder (2011) actually did discuss the possibility of interpreting posterior odds rather than the Bayes factor, but they wrote, “Researchers who prefer posterior odds to Bayes factors may still use our methods; they must, however, stipulate prior odds” (p. 415). To us, this requirement is rather surprising because the prior odds in this case are directly available and given by π / (1− π), where π is computed rather than chosen. In fact, whereas in Bayesian null hypothesis testing both a prior distribution and model prior probabilities have to be specified to obtain the posterior odds, in the case of Bayesian interval hypothesis testing, it suffices to specify the prior distribution: The prior probabilities of H₀ and H₁ derive uniquely from the chosen prior distribution. Letting researchers stipulate prior odds would in fact go against the specification of the prior distribution. In other words, the beauty of the interval-testing procedure is that it offers not only the Bayes factor but also the posterior odds. And from the posterior odds, one can directly derive the posterior probabilities for the region of interest, which is the subject of the next subsection.

Posterior probabilities for intervals

Given the framework by Smiley et al. (2023), it seems obvious to test whether an effect-size value is in a particular region of interest by assessing how likely it is that the effect size actually is in the region. Hence, an obvious choice would seem to simply assess the probability that the effect size is in the region. Clearly, this is not possible in the frequentist framework, so there, a different approach had to be used, such as verifying whether a 95% CI falls entirely within the region. But within a Bayesian approach, it would seem the first thing to do because arguably, the most basic outcome of Bayesian analysis is a posterior probability distribution for a parameter of interest (i.e., the effect size). By means of this posterior density function, probabilities can be assessed for the effect-size values to be in any desired interval. Thus, all regions specified in the Smiley et al. framework can be dealt with. And not only can one compute probabilities over given intervals, but also, one can specify regions for which one can be 95% certain of it to contain the population value, among which the 95% credible interval or the 95% HDI. This approach has indeed been mentioned in the literature, for instance by Greenwald (1975, p. 18) and Wellek (2010), but seems to be largely ignored.

Although it may suffice to report the probability that the effect size is in a particular interval, one can use this to formally conduct a test by specifying decision rules. For example, we can stipulate that we consider the hypothesis that the effect size is in the region of interest “supported” if the posterior probability associated with it is over 95%. If not, we can, for instance, conclude that there is “insufficient support” if the probability is (just) not over 95% or even “little support” if the probability is clearly smaller than 95% but still higher than 50%. Note that these are just subjective suggestions for interpretations of the posterior probability. How this pans out for the leading example and the five regions chosen earlier is pictured in Figures 4a through 4e. Observe that the posterior probabilities themselves have already been given in Table 2. Using as test criterion the idea that the probability for the region of interest should be over 95% to consider it supported, we show that the results concur with those by the 95%-CI and the 95%-HDI approaches, being positive only for the second and third regions of interest. But in addition to these test results, we now also have insight into the full posterior distribution, so we can also consider probabilities of other ranges of values, and we can see how the probability is distributed within the region of interest. We emphasize that when actually reporting results, it is very important to not just report the decision and associated probabilities at hand but also show the whole posterior distribution to put the probability and hence the decision in perspective.

Fig. 4.

(a–e) Posterior probabilities for each of the five chosen regions, displayed against the background of the full posterior probability density distribution.

If so desired, one can even offer probabilities for adjacent regions in one go and thus cut up the probability space in, for instance, three regions that have interesting interpretations. An example of this is given in Figure 5. Here, the regions represent “clearly smaller percentages of improved than not improved patients,” “practically equivalent percentages of improved and not improved patients,” and “clearly higher percentages of improved than not improved patients.”

Fig. 5.

Example of tests for three adjacent regions of interest for our leading example, specifying small proportions, proportions close to 0.50, and large proportions.

All in all, this approach gives outcomes with a very easy and appealing interpretation, and once the posterior probability distribution has been obtained, it is easy to carry out.

Comparison of the Four Methods

In this section, we systematically compare the four methods discussed above: the method checking whether the 95% CI lies entirely in the region of interest; the HDI+ROPE method, which checks whether 95% HDI lies entirely in the region of interest; the NOH Bayes factor test; and the test based on assessing the posterior probability for the region of interest. For the main aspects of our comparison, see Table 3.

Table 3.

Overview of Features of Various Methods

Method	Extra requirements (beyond common assumptions on sampling process)	Interpretations in terms of likelihood of values in interval
95% CI	None	Very coarse (yes/no)
HDI+ROPE	Prior distribution on entire range	Very coarse (yes/no)
NOH Bayes factors	Prior distribution on entire range	Indirect quantification
Posterior probabilities	Prior distribution on entire range	Quantification as probabilities

Note: CI = confidence interval; HDI = highest density interval; ROPE = region of practical equivalence; NOH = nonoverlapping hypotheses.

What the methods have in common

All four methods have been suggested as improvements on testing point null hypotheses by replacing them with interval hypotheses. The implied improvements have both a theoretical and a technical aspect. One theoretical improvement pertains to the resolved impossibility of being unable to accept the null hypothesis. One technical improvement is that now one can compute the probability of an effect being merely small or negligible, avoiding the extremely unlikely hypothesis that the effect is exactly equal to zero. Having solved these issues by resorting to interval hypotheses, a new challenge arises: One must now explicitly define bounds for the regions of interest. We do not view this as a disadvantage. The reason is that when interpreting results, researchers should be able to distinguish between what is practically relevant and what is not. This is already done each time researchers resort to using an effect-size measure, for instance, when they use power analysis to choose a sufficient sample size and more generally when they interpret their results. Therefore, implicitly, such choices are being made anyway. However, the interval-test approaches demand that such choices are made explicitly and are to some extent justified. And they should be made in advance because if they are made during data analysis, confirmation bias and human rationalizations may easily influence the choice of the interval in line with the expected or desired conclusion. Of course, even thus, these are subjective choices, which make them vulnerable to criticism. A researcher might therefore fear that such a justification must be “correct,” but that would be too (self-)demanding. The main idea of the justification, however, is that the ensuing intervals lead to a practically useful way of describing and interpreting results, which has been formulated before the analysis. A lot more has already been written about how to choose such bounds, as mentioned in the previous sections, so for further discussion about this, we refer the reader to, for instance, Kruschke (2018) and Lakens et al. (2018).

An issue that has sprung to the fore with the introduction of the three Bayesian methods for use in the unified-regions-testing framework is how to choose prior distributions. Obviously, like choosing bounds, the choice of priors is a challenge, and it also needs justification. But also here, the fear that such a justification must be “correct” would be too (self-)demanding. There is a lot of discussion between objective and subjective Bayesians over the role of prior distributions in Bayesian inference. Typically, the following four justifications for choosing particular prior distributions are given: (a) The prior distribution should reflect the current knowledge on the effect size in the population. (b) The prior distribution should be “noninformative” and hence make sure that the posterior is primarily if not solely determined by the data. One should, however, avoid assigning any probability mass to impossible outcomes. (c) The prior should be a commonly used and accepted default prior for this type of effect size in the population. (d) The prior should reflect one’s expert belief (before the data collection) on the probability density distribution that best describes one’s uncertainty about the effect size in the population.

Again, a lot has been said about such choices, and for the present article, we do not repeat this. The main thing, in our opinion, is that the choice of a prior should be motivated by the author. Furthermore, reporting a sensitivity analysis, in which various choices are made and resulting differences in outcomes are shown, is obviously welcome. However, the purported stability of the outcome derived from a sensitivity analysis obviously depends on how widely different the priors were taken and how the choice of different priors was justified. After all, in principle, one can always take strong priors that will steer toward totally different posterior results. So, the simple statement, “Sensitivity analysis shows that the results are hardly influenced by the prior,” is never enough. One should always describe and motivate the range of priors studied in the sensitivity analysis.

How the methods differ

One might counter that an enormous advantage of the frequentist approach, using the 95% CIs, is that no such prior choices need to be made. This is true, but then again, one cannot directly interpret results of the analysis in terms of probabilities associated with the hypotheses. Conclusion drawing is a lot more difficult if one has to use the inverse probabilities related to 95% CIs. Recall that the procedure for setting up CIs will yield intervals that in 95% of the cases contain the population value. This kind of statement is comparable with statements about a disease diagnostic. The test may have 95% chance of giving a positive test outcome if indeed one has the disease, but this does not give the probability that one has the disease given that the test outcome was positive. The base-rate knowledge of the prevalence of the disease matters a lot to determine such changes, and the same holds for the base-rate probability of effect sizes. One might therefore expect that if all feasible effect sizes have roughly the same chance to begin with, then the 95% CI might come quite close to the Bayesian HDI, based on a completely or even a fairly flat prior. Actually, there is a lot of theory about this (e.g., see Jackman, 2009, p. 94; Rubin, 1984; or relatedly, Greenland & Poole, 2013) and empirical evidence (e.g., see Albers et al., 2018). However, even then, Bayesian procedures have the advantage of offering more complete information. Rather than just an interval, they give full posterior distributions, and one can see how strongly probabilities vary both within and outside the CI range.

As mentioned above, the HDI+ROPE method is a straightforward Bayesian variant of Smiley et al.’s (2023) approach. The HDI+ROPE method likewise leads to a dichotomous (or trichotomous) outcome of supporting or not supporting the hypothesis or concluding that there is insufficient evidence for either. The HDI+ROPE method does not quantify the degree of belief or certainty of knowledge. At best, the HDI+ROPE method allows one to conclude that the probability that the value is in the region of interest is at least 95%; it cannot even tell whether the probability is less than 95% in cases in which the HDI crosses the border of the region.³

A clear disadvantage of the HDI+ROPE method is that it does not distinguish between cases in which the HDI does not just fall entirely in the ROPE and cases in which only a very small part of it falls in the ROPE. Their interpretation is the same (“not sufficient evidence for a conclusion”), but the situations are totally different. At least one would like to have a quantification of the degree of overlap. And instead of such an indirect measure, just providing the probability of the values falling in the interval would be more valuable. As a matter of fact, the HDI serves as a kind of summarizer of the probability distribution, which, however, for the region-testing purposes is not necessary and actually can become a hindrance in cases as described above.

The second Bayesian approach, based on Bayes factors for intervals, offers a more concrete quantification of probabilities, but unfortunately, in the way it is implemented in software, it focuses on the Bayes factor, which is the go-between toward such probabilities. The actual posterior probabilities are usually left unconsidered. In the example above, we demonstrated that this may easily lead to surprising and easily confusing conclusions.

Finally, the approach consisting of assessing probabilities for the regions of interest turns out to be straightforward to carry out (once the posterior distribution has been determined, as also has to be done for the other methods). The results have straightforward interpretations in terms of probabilities of the population effect size to lie in the region of interest. Moreover, one can easily compare probabilities for many different regions and thus offer a more fine-grained picture if so desired. The approach is so obvious that it must be long in existence, even as an informal procedure in practice, and we found at least a couple of instances in the literature. We also found some criticism on it, for instance, in the supplement to Kruschke (2018), where he mentioned that

Some authors (e.g., Wellek, 2010) prefer to consider the proportion of the posterior distribution that falls within the ROPE as the statistic for decision making. For example, we might reject the null value if less than 5% (say) of the distribution falls within the ROPE, and we might accept the null value if more than 95% of the distribution falls within the ROPE. Notice that this rule ignores the probability density of parameter values inside or outside the ROPE. (p. 5)

We agree that it may come across a bit odd if the highest density actually does not fall in the region of interest, and yet we conclude that we support that the region of interest most probably contains the population value. But by itself, this is not contradictory. After all, the probability pertains to a region of values that has particular interest, and if one wishes to know the probability that the actual population value falls within it, then the computed posterior probability is it, even if, taken as a single value, a value outside the interval is more likely than each single value in the region. Comparably, one might wish to know the probability that one of the 100 people in a village committed a particular crime. If there is one person outside the village who is slightly more likely to have committed the crime, then still it is good to know that as a working hypothesis, the idea that the culprit lives in the village still has 95% probability. As always, probability statements should not lead to closing one’s eyes for alternative possibilities, and an open eye for an outsider-culprit is always wise. For that reason, plots such as given by Kruschke (2018) and by us in Figure 4, which display not only intervals but also full posterior distributions, are always important. The regions and the probabilities help to shape a probabilistic decision, but it is important to keep seeing the full picture.

How to Use This in Practice

An important aspect of a statistical approach is how feasible it is to carry it out. The approach actually hinges “only” on having obtained a posterior distribution for the effect size of interest. Many packages in R have been made available for this purpose for many different measures, so not only for proportions, comparing group means, regression coefficient, and so on. If methods are not directly available, one can handle them by the general-purpose approach involving Markov chain Monte Carlo procedures, such as implemented in the R package STAN. However, this is suitable for only the more technical savvy researchers. More accessible software for Bayesian analysis is provided by packages in Mplus and SPSS and notably by the freeware JASP. It offers a high variety of Bayesian procedures. These approaches can be used for getting credible intervals, Bayes factors, and graphs of posterior distributions.

Specifically, the probabilities for intervals of effect sizes (when comparing means of two groups) are explicitly being offered by JASP’s Equivalence T-tests module. Because comparing means may be the most common statistical procedure in practice, we explain this procedure in detail here. To use it, one should basically specify an interval (the region of interest) only for the effect size and a prior distribution. For the purpose of assessing the posterior probability for the region of interest, in the output, one should focus on the probability mass (which should explicitly be requested by “ticking” this option in the input screen). This gives the Bayesian posterior probability for δ ∈ I and for δ ∉ I, assuming as prior distribution for δ the default Cauchy distributions (or optionally, other priors). In the JASP routine, this is called the “posterior mass” attached to the range of values within the interval and to the range of values outside the interval. The notation may be a bit confusing, but p(δ ∈ I | H₁) represents the prior probability for δ ∈ I, and “|H₁” should be ignored here (and elsewhere), and p(δ ∈ I | H₁, data) gives the posterior probability. Analogously, p(δ ∉ I | H₁) gives the prior probability for δ ∉ I, and p(δ ∉ I | H₁, data) gives the posterior probability. To see what this looks like, see a complete screenshot of JASP input and output for such an analysis in Figure 6; the probability masses are given at the bottom right. In the example analysis in the next section, we use only this procedure. Furthermore, the R scripts we used for the analysis of our leading example are available on https://osf.io/gjezt/. It is provided in such a way that it can easily be adjusted to study proportions that a researcher has observed. In fact, we also offer a procedure for comparing two proportions.

Fig. 6.

Complete screenshot of (left) input in JASP’s Bayesian Equivalence T-test and (right) the output.

Example Analysis of an Empirical Data Set for Comparison of Two Means

To give a real empirical example of the use of the posterior-probability-testing approach, we reanalyzed the data sets collected by Prowten et al. (2024). They replicated a study by Berger (2011) testing whether nonemotionally induced arousal would increase the chance that people would share news on social media. In Berger’s setup, participants in the low-arousal group were asked to sit quietly for 60 s, then carry out a distraction task, and next to read a neutral news article “that they could e-mail to anyone they wanted.” In the high-arousal group, rather than sitting still, participants had to jog in place for 60 s, and for the rest, they were treated the same as participants in the low-arousal group. Berger reported, “Compared with sitting still, running in place increased the percentage of people who e-mailed the article (from 33% to 75%), χ²(1, N = 40) = 6.67, p < .01” (p. 892). Prowten et al. replicated the procedure with the most notable change that they asked the participants, “If you came across this article on your own, would you share the article on social media” (for further procedural details, see Prowten et al., 2024). They did two studies with the same setup. The data for these studies were made available on OSF, and we downloaded them from this platform and reanalyzed them. To check for mistakes on our or their side, we first reanalyzed the two data sets, using JASP Version 0.19.1, by a two-groups t test, with equal variance assumed. The results were indeed identical. That is, for Study 1, the low-arousal group (n = 57) had a sharing percentage of 36.5%, and the high-arousal group (n = 54) had a sharing percentage of 38.3%. We found the same p value of .693, the same effect size of d = 0.08, and the same 95% CI of [−0.30, 0.45]. The fact that a t test was done was, as far as we read, not explicitly justified in the Prowten et al. article, although it can be well expected that the percentages of which these data consisted are by no means normally distributed. However, we checked that the sample distribution was at least not overly skew or bimodal or otherwise so deviant (see Fig. 7) that common robustness arguments would justify a cautious interpretation of t-test outcomes. We also checked the data for Study 2 in this way but did not analyze this set separately because the results were quite similar and would have very little illustrative value. We did, however, use the data from Study 2 for a second analysis in which we merged the two data sets so that we would get a data set with larger groups, that is, n = 138 in the low-arousal group and n = 133 in the high-arousal group. For the combined group, we obtained an effect size of d = 0.06, with 95% CI = [−0.18, 0.30]. For comparison with Berger’s outcomes, Prowten et al. also reported the associated effect size of d = 0.89, with 95% CI = [0.22, 1.57].⁴

Fig. 7.

Rain-cloud plots, box plots, and smooth fitted density graph produced by JASP for the two sharing percentages of participants in each of the groups of Study 1 of Prowten et al. (2024).

From their results, Prowten et al. (2024) concluded that “in contrast to Berger (2011), we did not find evidence that increased physiological arousal corresponded to an increase in the number of articles shared by the participants” (p. 1029). They even summarized that “despite the success of the arousal manipulation, participants in the high- and low-arousal conditions did not differ [italics added] in their willingness to share news articles with other people” (p. 1031). Understandably, the aim of their study was to show that there is, on average, no difference between people who are lowly aroused and people who are highly aroused. Because exactly no difference would be an illusion, it is important to choose what difference would be considered negligible in this context. We reasoned that considering that d = 0.2 is often considered a small effect size (Cohen, 1988), one might consider differences between −0.1 and 0.1 as negligible (because they are at most only half the size of effects that are considered small). So a test on equivalence could be set up with the definition for equivalence being that the effect size is between −0.1 and 0.1, which we here call the region of interest.

We then applied each of the four approaches for equivalence testing to the data of Study 1 of Prowten et al. (2024). The first approach is checking whether the 95% CI falls entirely in the region of interest. Because 95% CI = [−0.30, 0.45] for Study 1, it by no means falls in the region of interest. It can be concluded that there is no support for the equivalence hypothesis. It extends amply at both sides.

Then, we used JASP to do a Bayesian equivalence t test. Because we did not have very specific ideas about the parameter under study, it seemed wise to use a prior that is fairly flat and has a slight increase near the middle range of values of the parameter. The default Cauchy prior serves that purpose, and so would many similar priors; at the end of this section, we discuss the outcomes employing some other priors. The graphical results are displayed in Figure 8, where we note that the effect size is based on the difference between sharing in the low- and high-arousal groups, rather than the other way around, which means that the effect should be multiplied by −1 to compare it with the other results. With these results, we can carry out Kruschke’s (2018) approach, except that we use the reported credible interval of [−0.29, 0.42] as a proxy for the 95% HDI (which is reasonable because especially in nearly symmetric posterior distributions, these are virtually equal). Clearly, this interval, which actually is very similar to the 95% CI, by no means falls within the bounds of the region of interest, hence the ROPE+HDI test indicates that there is no support for the equivalence hypothesis.

Fig. 8.

Excerpt of the JASP results of a Bayesian equivalence t test for Study 1 of Prowten et al. (2024). Low- and high-arousal groups are reversed.

Figure 8 also reports the Bayes factor of the hypotheses “δ lies in the region of interest” versus the hypothesis “δ lies outside the region of interest,” which is denoted by BF_∈∉ and is thus shown to equal 6.7. Such a Bayes factor can be considered positive support for “δ lies in the region of interest” compared with the hypothesis “δ lies outside the region of interest.”

Finally, in the “Prior and Posterior Mass Table” in Figure 8, we show that the posterior probability of δ lying in the region of interest (in the output denoted as “p[δ ∈ I | H₁, data]” equals 0.40, which by no means is close to 0.95, so also by the posterior probability test, there is very little support for the equivalence hypothesis stating the effect is negligible, that is, between −0.1 and 0.1.

Without formally doing these tests, it could actually be seen easily from the graph of the posterior distribution that roughly, effect sizes of up to 0.3 cannot well be excluded on the basis of these data and the assumed prior distribution.⁵ It can be considered somewhat odd that the Bayes factor test does give “positive support” for a negligible effect. But it should be realized that positive support is usually deemed insufficient in practice. Moreover, the Bayes factor, as said, gives only the change factor from prior odds to posterior odds, and it is quite easy to see that indeed, the prior odds for this region were a lot smaller than the posterior odds. But nevertheless, a considerable increase of small odds may still not lead to high odds, as is shown here.

For further understanding of what is going on, we might consider that 0.2 is called a small effect and 0.4 a medium effect, and hence, we could consider that small effects are covered by the interval [−0.3, 0.3]. Taking this as a new region of interest, we tested whether the data support the hypothesis that the effect is at most small. The results are given in Table 4.

Table 4.

Test Results for Region of Interest [−0.3, 0.3]

Method	Output	Conclusion
95% CI	[−0.30, 0.45] not entirely in [−0.3, 0.3]	No support for hypothesis that effect is at most small
HDI+ROPE	[−0.29, 0.42] not entirely in [−0.3, 0.3]	No support for hypothesis that effect is at most small
BF test	BF_∈∉ = 21.90	Strong support for hypothesis that effect is at most small
Posterior probability test	p(δ ∈ I \| H₁, data) = 0.88	Insufficient support for hypothesis that effect is at most small

Note: CI = confidence interval; HDI = highest density interval; ROPE = region of practical equivalence; BF = Bayes factor.

The Bayes-factor test now led to strong support for the effect size lying in the interval [−0.3, 0.3]. The other tests still did not support this hypothesis, but clearly, much larger shares of the intervals fall in the ROI, and the posterior probability that effect size lies in the interval [−0.3, 0.3] now is as high as 0.88. So although there still is insufficient certainty (by the 0.95 rule set in these three tests), clearly, the probability that the effect size is at most small can be seen to be quite high. Conversely, one could say that the probability that the effect size is higher than 0.3 in magnitude is 0.12, which is quite a low probability, although not negligible.

We also computed the probability that the effect size is actually larger than 0.3 (thus doing a superiority test to see whether the effect is at least small) and found that the posterior probability for this is 0.10 only. Given that Berger (2011) allegedly found an effect size exceeding 0.8, we also assessed the probability that the effect size exceeds 0.8 and found this to be 0.00003.

All in all, from our reanalyses of Study 1 of Prowten et al. (2024), we conclude that it hardly gives any support for the hypothesis that the effect size is negligible, and even the hypothesis that the effect is at most small is not strongly supported, that is, not by more than 95% posterior probability. Note, however, that conversely, we found even far less support for the hypotheses that the effect is at least small or that it is higher than 0.8. The main conclusion should be that even though probabilities are more in favor of small values than high values, really strong conclusions would require narrower posterior distributions, for instance, using larger sample sizes. Note that in our conclusion, we take the full posterior into account as well and not report just the posterior probability for the interval.

Actually, in this case, a larger sample size is available. Because there are two data sets with the same setup, we merged them. Because this yields a fairly big data set, one can hence expect a narrower posterior distribution. Therefore, we now also present the results of the combined data set from Studies 1 and 2 of Prowten et al. (2024), in Figure 9 and Table 5. We show that for this combined data set, the results are clearer. For instance, in Figure 9, it can be seen that the distribution is more peaked and that its maximum has a higher density. In Table 5, we show that the hypothesis that the effect is at most small is supported by all methods. We would thus conclude, taking into account that we used the default Cauchy prior, that it is quite probable that the effect of physical arousal on sharing, at least in the way it had been set up in the experiments by Prowten et al. is at most small and that clearly, the big effect size reported by Berger (2011) was not replicated. We also show that the conclusion that the effect is negligible (let alone that it would be zero) cannot be made. For corroborating the conclusion that the effect would be virtually zero, we would need a much bigger sample, and actual absolute absence of an effect can, in our view, never be corroborated by empirical experiments.

Fig. 9.

Excerpt of the JASP results of a Bayesian equivalence t test for the combined data from Study 1 and Study 2 of Prowten et al. (2024), using [−0.1, 0.1] as region of interest.

Table 5.

Test Results for Regions of Interest [−0.1, 0.1] and [−0.3, 0.3].

Method	Output	Conclusion
95% CI	[−0.18, 0.30] not entirely in [−0.1, 0.1]	No support for hypothesis that effect is negligible
HDI+ROPE	[−0.17, 0.29] not entirely in [−0.1, 0.1]	No support for hypothesis that effect is negligible
BF test	BF_∈∉ = 12.5	Positive support for hypothesis that effect is negligible
Posterior probability test	p(δ ∈ I \| H₁, data) = 0.55	Little support for hypothesis that effect is negligible
95% CI	[−0.18, 0.30] entirely in [−0.3, 0.3]	Support for hypothesis that effect is at most small
HDI+ROPE	[−0.17, 0.29] entirely in [−0.3, 0.3]	Support for hypothesis that effect is at most small
BF test	BF_∈∉ = 134	Strong support for hypothesis that effect is at most small
Posterior probability test	p(δ ∈ I \| H₁, data) = 0.98	Support for hypothesis that effect is at most small

Note: CI = confidence interval; HDI = highest density interval; ROPE = region of practical equivalence; BF = Bayes factor.

Obviously, our results rely on using the default Cauchy prior. To demonstrate how influential this choice is, we did some further analyses with normal (N) priors. We found that with prior N(0, 1), the 95% credible interval was [−0.18, 0.30], and with the flatter prior N(0, 2), it was still [−0.18, 0.30]. This is virtually equal to the Cauchy-based prior and happens to be equal to the CI. Just for the sake of showing that there is an effect of priors, we also tested the strongly peaked prior N(0, 0.1) and found that the 95% credible interval reduced to [−0.13, 0.18]. Clearly, the strong prior on values close to 0 reduces the interval considerably. This, however, can be used only if one has reason to believe a priori that the effect must be very small.

The Bayes factors and the posterior probabilities were now as given in Table 6. We show that results of the posterior probability hardly changed, whereas the BF values did change quite a bit. Clearly, as is actually well known, the Bayes factor is quite sensitive to the prior distribution, the posterior distribution clearly is less so, but as shown above, it is not insensitive to choosing priors representing strong certainty, such as the peaked N(0, 0.1) one.

Table 6

Test Results for ROIs [−0.1, 0.1] and [−0.3, 0.3] Using Normal Priors N(0, 1) and N(0, 2)

Chosen prior and ROI	BF	Posterior probability within ROI
Prior N(0, 1), ROI [−0.1, 0.1]	13.6	0.54
Prior N(0, 2), ROI [−0.1, 0.1]	27.9	0.54
Prior N(0, 1), ROI [−0.3, 0.3]	131	0.98
Prior N(0, 2), ROI [−0.3, 0.3]	287	0.98

Note: BF = Bayes factor; ROI = region of interest.

Practical Recommendations

We have now described and illustrated four methods for (Bayesian) equivalence testing. Pros and cons have been discussed in the section in which the approaches were compared. None of them give incorrect output, but the interpretations of them should clearly differ. To avoid drawing wrong conclusions and overinterpret results, we recommend presenting a plot of the posterior distribution itself in addition to numerical output and conclusions. In case one assesses many posterior probabilities at the same time, at least giving a flavor of what the distributions look like can be very insightful.

Furthermore, it has been argued that if a specific decision or crisp conclusion has to be offered on the basis of the analysis, it makes more sense to test between “small” and “not small” or between “negligible” and “nonnegligible” than between “zero” and “nonzero” effect sizes. Our practical advice for analyzing data would be as follows:

Step 1. Specify the (equivalence) bounds by considering what for your problem would be the region of interest⁶; choose the interval bounds as L and U.

Step 2. Specify the prior distribution(s).

Step 3. Report (and inspect) the posterior distribution (as a graph, preferably superimposed on a plot of the prior distribution).

Step 4. Report the posterior probability p(δ ∈ I | data) and if so desired, also the Bayes factor for changes in the odds of δ ∈ I versus δ ∉ I.

Step 5. If you want to make a decision, you may choose to decide on practical equivalence if the posterior probability of the region of interest exceeds, for instance, 0.95.⁷ If this threshold is not met, you can, with only a little bit of extra effort (as was done in our example analysis as well),⁸ assess the probability that δ < L and the probability that δ > U. If either of these probabilities exceeds, for instance, 0.95, you may decide that the effect is nonnegligibly negative or nonnegligibly positive, respectively. If none of these three probabilities exceed the threshold, you may conclude that there is insufficient clarity to state that any of these three possibilities is probably true. In any case, use the graph of the posterior distribution to see that your decision is in line with the full posterior distribution.

Conclusion and Discussion

It has been argued that strict null hypothesis testing should lead to conclusions on strict zeroness of effect size, which does not seem to make much sense in actual practice. In practice, one will rather be interested at least in whether an effect is negligible. And if it is concluded to be nonnegligible, it will consequently be interesting to see how big it is at least approximately, with reasonable certainty. Using frequentist approaches, one can use equivalence testing combined with effect-size estimates and CIs for these. Smiley et al. (2023) offered a versatile framework for this purpose, offering a great advantage over traditional null hypothesis testing. A disadvantage of the approach, however, is that, as with all frequentist approaches, it does not lead to concrete probability statements on the effect size even though in interpretation, such statements would make conclusions clearly easier to grasp. It is easier to understand that there is 95% chance that the effect size is between 0.4 and 0.6, for instance, than that a 95% CI is given running from 0.4 to 0.6, for which it is known that if one would repeatedly set up CIs in this way, in 95% of the cases, they would be correct. The Bayesian approach allows for the former type of statements statements, but there is no free lunch: To get these, one has to specify a prior distribution for the effect size. This might be found challenging. However, using variations on default priors, one can choose those that come closest to what one considers the actual state of the art in uncertainty about the effect size and choose that prior. As is well known and as we have illustrated in the example, variations in the choice of prior need not have serious influence on the outcomes of the posterior probabilities. Nevertheless, there always is a risk of overinterpreting the precision of such results because that depends on how accurate the data were, how representative the sample was, how reasonable the prior is, and how reasonable the data model is.

Although software is available for this approach for an enormous range of data-analysis problems, easy, accessible (“plug and play”) software, especially for interval-hypothesis testing, is not available universally. We offer this software for tests involving one or more proportions in the form of an R script, and JASP offers interval-hypothesis tests for comparison of means. Other often used tests, for instance, for correlations and regression weights, so far still need the use of less accessible software. We aim to work on providing such software in the near future.

Footnotes

Transparency

Action Editor: Yasemin Kisbu-Sakarya

Editor: David A. Sbarra

Author contributions

Henk A. L. Kiers: Conceptualization;Data curation;Formal analysis;Investigation;Methodology;Software;Visualization;Writing – original draft;Writing – review & editing.

Jorge N. Tendeiro: Conceptualization;Investigation;Methodology;Software;Visualization;Writing – original draft;Writing – review & editing.

ORCID iDs

Henk A. L. Kiers

Jorge N. Tendeiro

References

Albers

C. J.

Kiers

H. A.

van Ravenzwaaij

(2018). Credible confidence: A pragmatic view on the frequentist vs Bayesian debate. Collabra: Psychology, 4(1), Article 31. https://doi.org/10.1525/collabra.149

Berger

(2011). Arousal increases social transmission of information. Psychological Science, 22(7), 891–893. https://doi.org/10.1177/0956797611413294

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Cohen

(1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

Dienes

(2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology, 5, Article 781. https://doi.org/10.3389/fpsyg.2014.00781

Good

I. J.

(1958). Significance tests in parallel and in series. Journal of the American Statistical Association, 53, 799–813.

Greenland

Poole

(2013). Living with p values: Resurrecting a Bayesian perspective on frequentist statistics. Epidemiology, 24(1), 62–68. https://doi.org/10.1097/EDE.0b013e3182785741

Greenwald

A. G.

(1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1–20.

Jackman

(2009). Bayesian analysis for the social sciences. John Wiley and Sons.

10.

JASP Team. (2024). JASP (Version 0.19.1) [Computer software]. https://jasp-stats.org/

11.

Jeffreys

(1961). Theory of probability (3rd ed.). Oxford University Press.

12.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. https://doi.org/10.2307/2291091

13.

Kruschke

J. K.

(2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312.

14.

Kruschke

J. K.

(2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146

15.

Kruschke

J. K.

(2018). Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2), 270–280. https://doi.org/10.1177/2515245918771304

16.

Lakens

(2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

17.

Lakens

Scheel

A. M.

Isager

P. M.

(2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

18.

Linde

Tendeiro

J. N.

Selker

Wagenmakers

E. J.

van Ravenzwaaij

(2023). Decisions about equivalence: A comparison of TOST, HDI-ROPE, and the Bayes factor. Psychological Methods, 28(3), 740–755. https://doi.org/10.1037/met0000402

19.

Meehl

P. E.

(1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. https://doi.org/10.1037/0022-006X.46.4.806

20.

Morey

R. D.

Rouder

J. N.

(2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4), 406–419. https://doi.org/10.1037/a0024377

21.

Prowten

Walker

London

Pearce

Napoli

Chenevert

Clevenger

Smith

A. R.

(2024). Does physiological arousal increase social transmission of information? Two replications of Berger (2011). Psychological Science, 35(9), 1025–1034. https://doi.org/10.1177/09567976241257255

22.

Rouder

J. N.

Haaf

J. M.

Vandekerckhove

(2018). Bayesian inference for psychology, part IV: Parameter estimation and Bayes factors. Psychonomic Bulletin & Review, 25, 102–113.

23.

Rubin

D. B.

(1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172.

24.

Schuirmann

D. J.

(1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15, 657–680.

25.

Smiley

A. H.

Glazier

J. J.

Shoda

(2023). Null regions: A unified conceptual framework for statistical inference. Royal Society Open Science, 10(11), Article 221328. https://doi.org/10.1098/rsos.221328

26.

Tendeiro

J. N.

Kiers

H. A. L.

(2023). With Bayesian estimation one can get all that Bayes factors offer, and more. Psychonomic Bulletin & Review, 30(2), 534–552.

27.

Tendeiro

J. N.

Kiers

H. A. L.

Hoekstra

Wong

T. K.

Morey

R. D.

(2024). Diagnosing the misuse of the Bayes factor in applied research. Advances in Methods and Practices in Psychological Science, 7(1). https://doi.org/10.1177/25152459231213371

28.

Wagenmakers

E.-J.

(2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804. https://doi.org/10.3758/BF03194105

29.

Wellek

(2010). Testing statistical hypotheses of equivalence and noninferiority (2nd ed.). Chapman and Hall/CRC Press.

30.

Wilkinson

, & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604. https://doi.org/10.1037/0003-066X.54.8.594

31.

Wong

T. K.

Kiers

H. A. L.

Tendeiro

J. N.

(2022). On the potential mismatch between the function of the Bayes factor and researchers’ expectations. Collabra: Psychology, 8(1), Article 36357. https://doi.org/10.1525/collabra.36357