Abstract
“This independent variable had a statistically significant effect in condition A, and by contrast, we found no statistically significant effect in condition B.” Many researchers believe that these findings are sufficient to support the claim that there is a difference between conditions A and B in the effect of the variable. However, such an inference does not follow from these results, as it requires the test of the difference in the effect between the conditions, or, in other words, the test of the interaction (Abelson, 1995, p. 111; Gelman & Stern, 2006). This inferential mistake is common in neuroscience (Nieuwenhuis, Forstmann, & Wagenmakers, 2011), and one can safely assume that psychologists are also not immune from committing it. Although this mistake is now well recognized when it comes to null-hypothesis significance testing, how does it relate to the use of Bayes factors?
When conventional cutoffs are used for Bayes factors (see Box 1 for a brief introduction on the interpretation of Bayes factors via the conventional cutoffs), there may be conditions in which this inferential mistake is even more likely than with frequentist statistics. If there is good-enough Bayesian evidence for the alternative hypothesis,
The Interpretation of the Bayes Factor
The Bayes factor is a continuous measure of the strength of relative evidence for two hypotheses according to their ability to predict the data at hand (Dienes, 2016; Kruschke & Liddell, 2018; Rouder, Speckman, Sun, Morey, & Iverson, 2009). In this Tutorial, we report Bayes factors that represent the evidence for the alternative hypothesis,
Test Your Intuitions
At a Golf Club in Sussex, a coach stumbled upon a sport psychology article concluding that mental training (e.g., imagining hitting the ball with a golf club) can help golfers improve their skills when it is combined with real training. Before implementing the mental training in all student groups, the coach decided to test whether players can benefit from it. Therefore, the coach asked the students in one group to engage in mental training (on top of the traditional training) twice every week for the next 3 months. The coach also had a control group in which the students underwent traditional training but were not told to do the mental training; the students in this group had skills roughly identical to those of the students in the mental-training group. The coach assessed the performance of the students at baseline and after 3 months of training. The evaluation was performed on an interval scale from 0 to 10. Given the results of past studies with other sports, the coach expected that after 3 months of training, performance could improve by about 2 units.
To draw conclusions from the analyses, the coach used null-hypothesis significance testing, and the alpha level was set at the traditional .05. The coach reported the results of two statistical tests and a conclusion based on those tests. Evaluate the appropriateness of the following conclusion on a scale from 0 (you feel that the conclusion is completely inappropriate) to 10 (you feel that the conclusion is completely appropriate based on the information at your disposal): Comparing baseline and posttraining performance in the control group yielded a nonsignificant result,
Now suppose that the coach used Bayes factors (see Box 1), instead of null-hypothesis significance testing, to draw conclusions. Given that the coach had reasons to expect an effect of about 2 units, he used a half-normal distribution with a standard deviation of 2 as the model for the alternative hypothesis. Assess the appropriateness of the following conclusion based on the Bayes factors by choosing a value from the same scale of appropriateness (i.e., 0 means that you feel that the conclusion is completely inappropriate, and 10 means that you feel that the conclusion is completely appropriate): Comparing baseline and posttraining performance of the control group yielded good-enough evidence for the null hypothesis of no change,
The central goal of this Tutorial is to substantiate in readers the statistical intuition that to claim the existence of a difference in the effect of an independent variable between two conditions or groups, one always needs to test the interaction, and that this principle is as true for Bayesian as for frequentist statistics. We present a hypothetical case study in which Bayes factors for evidence of the presence of an effect are calculated for an experimental group and a control group for which frequentist statistical tests were significant and non-significant, respectively. By this approach, we aim to illustrate that there are cases in which using Bayes factors instead of frequentist statistics could make it more likely to commit the inferential mistake that is the focus of this Tutorial, and there are cases in which it may be the other way around. By increasing the sample size and reducing the raw effect size in our case study, we cover all the scenarios in such a case: insensitive evidence versus evidence for an effect coupled with an insensitive test of the interaction, insensitive evidence versus evidence for an effect coupled with a sensitive test of the interaction, evidence for no effect versus evidence for an effect coupled with an insensitive test of the interaction, and evidence for no effect versus evidence for an effect coupled with a sensitive test of the interaction.
The Case Study
Consider the hypothetical study, described in Box 2, in which a golf coach is trying to test whether or not adding mental training to traditional training can improve golf performance. To investigate this question, the coach randomly assigned students to a group that received traditional training only (henceforth, the control group) and a group that received traditional plus mental training (henceforth, the mental-training group). The coach assessed golf performance at baseline and after 3 month of training. Therefore, the study had a 2 × 2 mixed design. Hence, the crucial test of the idea that one can benefit more from golf training if it is combined with mental training boils down to a test of the 2 × 2 interaction of time of assessment (baseline vs. posttraining) and type of training (traditional vs. traditional plus mental). For the sake of simplicity, imagine that golf performance was measured on a scale from 0 to 10, and the coach expected that the mental training should improve performance by about 2 units.
Justifying the Model of H 1 and the Model of the Data
To compute a Bayes factor, one needs to specify the parameters of the models representing the predictions of the hypotheses under comparison (see Box 3 for more information on the essential model choices that must be made when calculating a Bayes factor). The model of
The Anatomy of the Bayes Factor
In order to assess the predictive ability of hypotheses, one needs to create models that represent their predictions. Modeling the prediction of no difference (the null hypothesis) is the straightforward part of the process. However, specifying the predictions of the alternative hypothesis requires scientifically informed decisions in every case, and so it can be a subject of debate. For instance, one needs to define the shape and parameters of the distribution representing the predictions of the alternative hypothesis as a function of the possible population effect sizes. Should the distribution be a uniform,
Specifying all of these parameters requires the researcher to make many decisions, which has the side effect of increasing analytic flexibility and so the opportunity to cherry-pick the results supporting the researcher’s pet theory. The most crucial step during which one can introduce bias is perhaps the model specification of
Disclosures
All the materials of this Tutorial are available on the Open Science Framework, at https://osf.io/jbuv7. These materials include the R script of the Bayes factor analyses introduced here and the script of the Bayes factor function. They also include the R script of a simple and interactive Web application, a Shiny app, that can calculate the Bayes factors of 2 × 2 between-groups and within-participants designs. Box 4 presents an example of the usage of the Bayes factor R script (namely, the test of the interaction in the example we discuss next), and Figure 1 portrays how the Shiny app can be applied to compute all three Bayes factors of this example. The Bayes factor Shiny app can be accessed at https://bencepalfi.shinyapps.io/Bayesian_Interaction_App/.
Calculating the Bayes Factor in R
To calculate the Bayes factor in R, one needs to obtain the summary statistics of the data (mean, standard error, and degrees of freedom) and decide on the parameters of the model of the alternative hypothesis,
The following R script reproduces the results of the test of the interaction in Example 1 (the # symbol identifies comments that are included to help readers and will be ignored by R when the script is run):
The first three arguments of the function specify the parameters of the likelihood function: the standard error, the estimate (i.e., raw effect size), and the degrees of freedom of the distribution, respectively. The last three arguments define the parameters of the model of

Print screen of the Shiny app at https://bencepalfi.shinyapps.io/Bayesian_Interaction_App/. For 2 × 2 designs, this app calculates the Bayes factor separately for each of the two groups and for the interaction, given the following statistical parameters: the raw effect size and its standard error for each group, the sample size, and the standard deviation of the half-normal distribution that models the predictions of the alternative hypothesis,
Example 1: When the Bayes Factor Helps the Researcher Avoid Committing the Inferential Mistake
In our hypothetical study, suppose that the coach found that the test comparing baseline and posttraining performance was significant in the mental-training group,
The question arises, what conclusions would one draw in this scenario if one relied on Bayes factors? These data translate into substantial evidence for the effectiveness of the training in the mental-training group,
The only way to come to a conclusion regarding whether or not mental training combined with traditional training is superior to traditional training alone is to collect more data until one obtains evidence in one direction or the other. Optional stopping is not a problem for Bayesian statistics; the Bayes factor will retain its meaning regardless of the stopping rule applied (Dienes, 2016; Rouder, 2014
2
). Thus, one can check the Bayes factor every time one recruits a new participant and stop once the Bayes factor reaches a good-enough level of evidence. For example, in this scenario, assuming that the raw effect sizes and their variances remain constant, the coach would need to recruit 94 participants in total (47 per group) to have substantial evidence for the interaction,
Example 2: When the Bayes Factor Might Exacerbate the Problem and Seemingly Create an Inferential Paradox
Now consider the scenario described in Box 2, which differs from Example 1 only in that the raw difference between baseline and posttraining performance is reduced by 0.3 units in both of the groups. All other parameters (e.g., the standard deviations and the performance difference between the control and mental-training groups) are kept constant. In this scenario, the results of significance tests probing the efficacy of the training separately in the two groups are identical to the results in Example 1 (i.e., nonsignificant improvement in the control group and significant improvement in the mental-training group). However, the Bayes factors reveal that this scenario is different from Example 1, as there is good-enough evidence for the presence of an effect of training in the mental-training group,
It might seem intuitive to conclude that the evidence for a difference in the effectiveness of training between the groups must be substantial in itself as well (cf. your feeling of appropriateness about the conclusion when you read Box 2). However, that is an unwarranted conclusion, as the rule that one cannot draw a meaningful conclusion from the difference between two categorical statements (Abelson, 1995, p. 111) applies to Bayesian statistics just as much as it applies to frequentist statistics. Hence, regardless of how tempting it feels to claim that the group with substantial evidence for
It appears that in this case, unlike in Example 1, relying on the Bayes factor for the control group rather than on the
Seemingly, this scenario presents a paradox in which one can claim that an effect exists in group A and does not exist in group B, but one cannot state that the effect is stronger in group A than in group B. These conclusions are inconsistent with one another, but the Bayes factor should not take the blame for this inconsistency. The cause of this paradox is that cutoffs have been used to interpret the Bayes factors, and so they have been reduced from continuous to categorical indicators. That is, the Bayes factors underlying the claims that there is an effect in group A and that the interaction is insensitive point in the same direction (i.e., both Bayes factors are larger than 1). Hence, the inconsistency was created by imposing a cutoff and labeling the first Bayes factor as good-enough evidence for
Fortunately, there is a way to escape this paradox. There is no need to consider the evidence at one’s disposal as fixed. Therefore, the remedy is to collect more data until the Bayes factor of the crucial test exceeds one of the cutoff values (as mentioned earlier, optional stopping does not invalidate conclusions based on Bayes factors). For instance, assuming that the raw effect sizes and their variances stay constant while data in this scenario are collected, one would need to recruit the same number of participants (47 per group) as needed in Example 1 to obtain evidence for a difference in the effect of training between the groups. Note that optional stopping applies to multilab collaborations as well: If a lab runs out of participants before reaching good-enough evidence, another lab can continue with the accumulation of evidence.
Discussion
In this Tutorial, we aimed to illustrate how the application of Bayes factors with cutoffs relates to the old problem of the tendency to compare the statistical significance of the simple effects in two groups to decide whether there is a difference between the groups rather than to compare the groups directly. We introduced two scenarios in which group A had a significant effect, whereas group B had a nonsignificant effect. In Example 1, employing Bayes factors instead of null-hypothesis significance testing would likely help a researcher avoid the inferential mistake because the test of the nonsignificant group turned out to be insensitive, and it is unlikely that a researcher would assume that good-enough evidence for
We showed that drawing a conclusion from Bayes factors can sometimes lead to a paradox (i.e., good-enough Bayesian evidence for
Continuing data collection until one obtains good-enough evidence for or against the model predicting an interaction, which is critical to escape the inferential paradox illustrated in Example 2, can be challenging in some cases if there are not sufficient resources. Thus, estimating the sample size one might need to find good-enough evidence for one hypothesis over another should play an essential role in the planning phase of an experiment. To this aim, one can compute the rough estimate of the sample size needed to probably obtain a Bayes factor that is equal to or larger than a specific value (i.e., the cutoff of good-enough evidence defined by us). For instance, to have a long-term 50% relative frequency of obtaining a Bayes factor of 3 (or 1/3), one should simply replicate the steps of the sample-size increase of Examples 1 and 2. That is, one can take the raw effect size and its standard deviation in a pilot study and assume that these parameters remain constant while the sample size increases (see Dienes, 2015, for a detailed tutorial). (For an alternative view on how to plan the design of a future experiment to achieve good-enough evidence, see Schönbrodt & Wagenmakers, 2018; for a tutorial, see Stefan, Gronau, Schönbrodt, & Wagenmakers, 2019). Finally, it is important to bear in mind that sample-size estimation is useful for planning, such as for roughly estimating how long data collection will take, but has no influence on the inferences made once the data are in. The final Bayes factor obtained is the measure of evidence for one hypothesis over another. The meaning of a Bayes factor is independent of the sample-size estimation procedure (Dienes, 2016). Thus, in Example 2, the sample-size estimation suggests the need to recruit 47 additional participants to gain good-enough evidence for the interaction. However, one may reach good-enough evidence with fewer additional participants or only after recruiting more than 47 additional participants, and once this happens, the conclusion regarding the presence or lack of the interaction should be based solely on the strength of the available evidence (i.e., the Bayes factor based on all the data collected to date).
In conclusion, it is evident that using Bayes factors is not a panacea for the inferential mistake discussed in this Tutorial. In Example 1, we illustrated that reliance on Bayes factors may mitigate the problem, and in Example 2, we showed that such reliance may exacerbate the problem. By discussing these two examples, we have intended to raise awareness that any claim about the moderating effect of an independent variable should be supported by a sensitive test of the interaction regardless of whether one uses frequentist or Bayesian statistics. Irrespective of how paradoxical it seems, good-enough Bayesian evidence for
Supplemental Material
Palfi_AMPPSOpenPracticesDisclosure-v1-0 – Supplemental material for Why Bayesian “Evidence for H 1” in One Condition and Bayesian “Evidence for H 0” in Another Condition Does Not Mean Good-Enough Bayesian Evidence for a Difference Between the Conditions
Supplemental material, Palfi_AMPPSOpenPracticesDisclosure-v1-0 for Why Bayesian “Evidence for
Footnotes
Transparency
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
