Abstract
Keywords
The use of frequentist statistics to perform inference in applied research is riddled with difficulties. There is strong evidence suggesting that
The last 10 years have witnessed an increase in published materials aiming at promoting the Bayesian paradigm to researchers in the social sciences (Etz & Vandekerckhove, 2018; Świątkowski & Carrier, 2020; van de Schoot et al., 2014). But Bayesian statistics is still relatively unknown and novel among social scientists. Hence, it would not be surprising if researchers would be making interpretation mistakes when using some of the newly learned Bayesian inferential tools. In this article, we mostly focus on null hypothesis Bayesian testing (NHBT) and the Bayes factor, that is, the Bayesian counterparts to NHST and the
This article has been written for applied social scientists for whom the Bayes factor is still a relatively new tool. The article has two main objectives. The first is to provide a full account of what a correct use of the Bayes factor entails. To this effect, we offer a commented reanalysis of a published result, carefully explaining how the Bayes factor can be adequately used to draw inferences. At the same time, we refer to some pitfalls that are important to avoid. We intend this part of the article to be used as a template of good practices for those wishing to use the Bayes factor in their work. The second objective of this article is to provide an overview of how the Bayes factor has been suboptimally handled by practitioners in published research. We offer an extension to the work of Wong and colleagues by covering a wider range of articles and assessment criteria. Furthermore, Wong et al. (2022) did not elaborate in detail on the main factors behind the identified problems. In this article, we offer an extended discussion that aims at going to the root of each problem. Specifically, we identified various reasons that may help understanding the occurrence of such idiosyncrasies. This discussion is of great value because we can aim at improving matters only after the source of the problems has been clearly identified. On the basis of the results of our discussions, we suggest possible future avenues for improvement.
The article is organized as follows. We start by offering a short introduction to the Bayes factor and how it can be used to test hypotheses (or perform model comparison in general). Next, we showcase the Bayes factor by analyzing data from a real example and discussing both good and less ideal approaches. We then summarize the main findings from the work from Wong and colleagues and present the details of the current study. After presenting the main results, we elaborate on the reasons that may help illuminate why these problems seem to occur more or less consistently. The article ends with a short summary of the previous discussion and with some constructive suggestions for the future.
The Concept of the Bayes Factor
The Bayes factor offers a means of comparing the predictive ability of two models (say,
where
for
The Bayes factor in Equation 1 offers a relative assessment of the probability of the observed data under the two competing models. For example,
An alternative means of portraying the Bayes factor is based on assuming that
Ratios of complementary probabilities are known as
It is easy to show that the following equation holds:
Written this way,
By rewriting the posterior odds as
From Equation 4 we see that the Bayes factor is equal to the ratio of the posterior odds to the prior odds. The Bayes factor is therefore a ratio of two odds, or an
The Bayes factor offers a rather general framework for model comparison. In the Bayesian framework, a “model” consists of two elements: a likelihood function (seen as a function of the data given one or more model parameters) and a set of prior distributions for the model parameters. A likelihood and a prior together yield a predictive distribution for the data. Using this predictive distribution, any two such models may be compared via the Bayes factor. In the social sciences, however, the Bayes factor is primarily used via NHBT (Tendeiro & Kiers, 2019). One of the models, the null model, stipulates that the model parameters of interest are equal to a constant (e.g., a true mean is exactly 0), or that several parameters are equal to one another (e.g., all true means are the same). Such hypotheses operationalize the concept of an “absence” of an effect or “invariance” of parameters (Rouder et al., 2009). An alternative model, then, is
A Worked-Out Example
Haeffel et al. (2023) conducted a series of studies to learn about cognitive vulnerability to depression (original data available at https://osf.io/umg9p). Their research focused on five different groups (Honduran young adults, Nepali adults, Western adults, Black U.S. adults, and U.S. undergraduates). Cognitive vulnerability was measured by means of the Cognitive Style Questionnaire (CSQ; Haeffel et al., 2008). We performed a reanalysis
2
of two-tailed independent-samples
The test’s null hypothesis
Null hypothesis, alternative hypothesis, and prior assumptions
We assume that the CSQ scores are normally distributed in either group, with potentially different mean parameters (USugrad group:
For the Bayesian independent-samples
We, the authors, lack a deep insight on the topic of cognitive vulnerability to depression. It is therefore difficult to choose a prior that is well-informed. Experts may be able to argue that standardized differences larger than 0.1, or perhaps 0.3 or 0.5, are quite unlikely. Such information could be used to specify a prior. In our case, we settle by using the default scale value of 0.707, but we also run a sensitivity analysis. This means that we consider the test result at various competing values of the scale parameter. Furthermore, priors with different values of the location parameter can also be explored. Do observe that priors symmetric around 0 allocate equal prior credence to symmetric values around 0. This may not be reasonable or properly reflect the current state of affairs (e.g., whether it is sensible that both
Interpretation
The result of the test—the Bayes factor—is
Supporting the null hypothesis
The running example is one interesting case of the Bayes factor providing relative support in favor of the null hypothesis compared with the particular alternative hypothesis used. It is well known that classic frequentist procedures do not allow supporting the null (although equivalence tests do exist; Wellek, 2003). From the frequentist
Bayes factor versus posterior odds
Observe that the Bayes factor is a statement about the relative probability of the
Relative evidence, priors, and labels
Hypothesis testing, or model comparison more generally, is an inherently relative endeavor. The merits of any one hypothesis are dependent on what other hypothesis we choose for the test. This is true regardless of the inferential paradigm of choice (frequentist or Bayesian), but it is perhaps more exacerbated in Bayesian testing because of the role played by prior distributions. Avoiding making absolute statements favoring one hypothesis (while disregarding its testing counterpart) is better avoided (see QRIP 4). Furthermore, sensitivity analyses showing the sensitivity of the Bayes factor to varying priors are important. Figure 1 shows how the Bayes factor for our test varies as a function of the scale of the Cauchy prior under the alternative hypothesis. It can be seen that there is relative evidence in favor of

Analysis of sensitivity to prior width for the independent
Bayes factor and effect size
The Bayes factor is not a valid measure of the effect size (see QRIP 7). For example, for the Bayesian
Presence versus absence
Simplistic phrasing of research hypotheses such as “
Inconclusive evidence
Bayes factors of (about) 1 imply that the observed data are equally likely under either hypothesis under comparison (Equation 1). In other words, there is lack of evidence either way. This should not be confused with evidence of absence, that is, that it is likely that there is no effect (see QRIP 9). A simple analogy is that of a nonsignificant frequentist test result. For the running example,
The Bayes Factor in Applied Research
In the previous section we provided a detailed account of how to use the Bayes factor by means of an example. Although the origins of the Bayes factor go back about 100 years (Etz & Wagenmakers, 2017), the interest on its use in applied work only increased since the 1990s with the seminal article by Robert Kass and Adrian Raftery (Kass & Raftery, 1995). In addition, the availability of faster computers and dedicated software such as JASP (JASP Team, 2023) and BayesFactor (Morey & Rouder, 2021) facilitated a wider adoption of this tool in practice in the last, say, 10 years. It is therefore natural to question how well practitioners have been dealing with the Bayes factor in applied research. However, there is not a lot of literature on this topic. To the best of our knowledge, Wong et al. (2022) is the only article of the kind. Because the current article builds on Study 1 in Wong et al. (2022), here we present a brief summary of the main findings in Wong et al. (2022). We then present the details of our extension to Wong et al. (2022).
Wong et al. (2022)
Study 1 of Wong et al. (2022) is a small peer-reviewed literature study of 73 published applied articles. The study focused exclusively on how researchers used NHBT. Each article was inspected, and the occurence of any of the eight QRIPs was marked down. Table 1 identifies and provides a brief description of each QRIP, together with the corresponding incidence in the sampled articles. As can be seen, the three most common QRIPs were the third (incomplete reporting of prior distributions), fourth (not referring to the comparison of models), and fifth (making absolute statements). Wong et al. (2022) also recorded other occurences of QRIPs beyond those mentioned in Table 1. One in particular was found often (21.9%): Bayes factors close to 1—which should imply that the models under comparison were relatively equally predictive of the observed data—were instead interpreted as supporting the null model of absence.
QRIPs for Null Hypothesis Bayesian Testing
Note: Retrieved from Wong et al. (2022). QRIP = questionable reporting or interpreting practice. BF = Bayes factor.
The current study
This article used the setup and findings from Wong et al. (2022) as a template, and both conceptually replicated and significantly extended their study design. We describe the details of our study in the Method section. Here we just list the main additions of our study to that by Wong et al. (2022):
Method
Article selection
The first author (J. N. Tendeiro) performed the article selection. An advanced search for research articles was conducted on Google Scholar on December 22, 2021, using the key (
We further complemented our sample with the result from an advanced search on the Web of Science on November 29, 2021, using the following key:
Article grading
We independently graded 10 articles randomly selected from the sample of 167 articles. Each of us graded the same 10 articles. The purpose of this pilot study was to calibrate the grading procedure to be used in the entire sample. Prior to the pilot study, we decided to cover the eight criteria listed in Table 1 plus two more:
#9: When faced with an inconclusive Bayes factor (e.g.,
#10: Interpret the Bayes factor simply using cutoffs (e.g., 1–3, 3–10).
We discussed the results in a group meeting. The ratings among the five of us were largely in agreement. We focused on aspects for which some disagreement existed, as well as on things to adapt to make the assessment more streamlined. As a result, we decided on the following grading plan for all articles:
Exclude the second (not specifying null and alternative hypotheses) and the eighth (mismatch between statistical and research hypotheses) criteria. The main argument in favor of the exclusion is that these criteria are not necessarily related to the Bayes factor per se (i.e., they could also be observed in articles resorting to NHST).
The third criterion (incomplete reporting of prior distributions) was replaced by three more narrowed criteria: #3a: The reason or justification for the chosen priors is not provided.
#3b: It is unclear which priors were used under either model.
#3c: The information on priors is incomplete (e.g., only the distribution family, but not the specific distribution used, is provided).
To attempt a thorough characterization of the practical use of the Bayes factor in applied research, we further included three extra criteria. These are descriptive in nature and do not necessarily reflect misuses of the Bayes factor. Instead, they are aimed at providing a more fine-grained characterization about how and why the Bayes factor was used. Thus, we do not refer to them as QRIPs: A: Justifying using a prior because it is “the” default
B: Arguing to use the Bayes factor to be able to draw support for null findings from NHST
C: Arguing that the Bayes factor allows distinguishing between the presence and the absence of an effect
To make things further complex, it is also important to realize that the Bayes factor typically does not permit a strict separation between any two models under comparison. Bayesian model comparison proceeds by the accumulation of evidence either way; it does not logically function as proving a mathematical theorem does. Thus, authors claiming to use the Bayes factor to “establish,” or “distinguish,” between the existence or absence of an effect may be surprised to learn that their desideratum is quite difficult to achieve. In our study, we identified articles that explicitly claimed to have used the Bayes factor with this particular motivation in mind.
Table 2 lists the criteria used to classify the sampled articles. We kept, and extended, the original numeration from Wong et al. (2022) for consistency.
The above inspection was conducted by reading through all sections in the papers except for the abstract. The abstract is a rather condensed text for which we speculated that some types of reporting problems are more prone. After conducting the study, we decided to go through all the abstracts and flag all criteria separately from the rest of the articles. We report these results in a separate section.
All supporting files that complement this article can be found at https://osf.io/57ew4.
Criteria Used
Note: QRIP = questionable reporting or interpreting practice; BF = Bayes factor; NHST = null hypothesis significance testing.
Results
The frequencies and percentages associated to each evaluated criterion are given in Table 3. As can be seen, only four of the 10 QRIPs (3c, 7, 9, and 10) were relatively rare (less than 10% of the articles). Overall, 149 articles (89.2%) displayed at least one QRIP, and 104 articles (62.3%) displayed at least two QRIPs.
Number of Articles Displaying the Corresponding Criterion
Note: QRIP = questionable reporting or interpreting practice.
Table 4 shows the occurrence of pairs of criteria. Furthermore, the supplementary material available at the OSF further includes more tabulations for these data that help to better clarify the results. We refer to results from these tables in the discussion that follows to better characterize each identified problem.
Frequencies of the Occurrence of Pairs of Criteria
Note: Dashes are used to indicate a diagonal line from the top left of the table to the bottom right. Counts are shown under this diagonal, percentages are shown above it, and missing entries are equal to 0.
Discussion of the Results
In what follows, we revisit each criterion that we included in our study. We list arguments that may help clarify why the observed issues are occurring as frequently as found in our study. This is the result of a joint discussion between the authors over these matters.
QRIPs 1 and 6
QRIP 1 concerns defining the Bayes factor as if it were a posterior odds. Equation 4 shows that the Bayes factor equates only to the posterior odds in the special case in which the prior odds is equal to 1. In other words, only when both models under comparison are a priori equally likely can the Bayes factor be interpreted as posterior model odds. However, in 13.2% of the articles we found that Bayes factors are simply introduced as if they were posterior odds, without having explicitly stated that prior odds equal to 1 were assumed. For example: These Bayes Factors can be readily interpreted as a ratio of evidence in favour of the experimental effect compared to the null effect. For example, a
Or take this example: “For instance, a Bayesian analyses . . . produced a JZS Bayes Factor of 3.74. According to Jeffreys (1961), this result indicates that there is some evidence for H0 over H1 (i.e., the hypothesis that gender is not associated with ODL scores is about three to four times more likely than the hypothesis that gender is associated with ODL scores, based on our sample’s results). (
Here is another example: “The alternative hypothesis is 2 times more likely than the null hypothesis (
We discussed these findings and tried to explain them. We can summarize our main explanations in four points.
Lack of knowledge
It is entirely likely that practitioners still do not master the basics of the Bayes factor. This is a natural explanation that is also equally plausible to most of the coming QRIPs, and we do not repeat it further. The main argument is that Bayesian hypothesis testing is still relatively novel for most practitioners, and surely so compared with frequentist inference.
Principle of indifference
Some researchers may be implicitly assuming that prior odds equal 1, that is, that a priori both models under comparison are equally likely following the advice by Jeffreys. 6 If so, the problem may be perceived as one of lack of communication.
Bayesian versus classical approaches
Many introductory texts to Bayesian inference capitalize on the fact that the
Cognitive dissonance
It is possible that some researchers are aware of the issue. However, they also realize that they followed recommendations to use Bayes factors despite the fact that Bayes factors cannot be interpreted as posterior odds (as they actually wished). To alleviate this cognitive dissonance, they convince themselves that they are entitled to “somewhat extend” the realm of the Bayes factor to what Bayesian inference at large does.
QRIPs 3a, 3b, and 3c; Usage A
These four reporting styles concern how researchers deal with prior distributions when using Bayes factors. In almost one third of the articles nothing about priors was mentioned (QRIP3b; 29.9%). Incomplete available information regarding the priors used was not a frequently found issue (QRIP3c; 6%). It sometimes happened that the used priors were mentioned but no explanation was provided (QRIP3a; 10.8%), or the authors simply stated that they used the software’s default priors (Usage A; 35.3%). In total, 130 articles (77.8%) displayed at least one of these reporting styles. Our arguments explaining this state of affairs are summarized as follows.
Too little space
Text space in most journals comes at a premium. Researchers are used to write succinctly whenever possible, saving space to highlight the main results from their studies. This fact may disadvantage a thorough presentation of the analytical details in the methods and results sections of articles. We found that, for articles reporting priors (i.e., not committing QRIP 3b), eight (6.8%) placed such information in supporting materials (supplements or appendices), although only one of these eight articles had a journal word limit. Furthermore, from articles reporting incomplete information regarding the priors used (QRIP 3c), three (30%) were published in journals with a strict word limit. Thus, at least to some extent, the pressure to write concisely may be conditioning the way explanations are provided. This argument may be a plausible explanation for QRIPs 3a, 3b, and 3c and to some extent to Usage A as well.
Habits inherited from NHST
Specifying alternative hypotheses and hypothesizing effect sizes of interest are essential to conducting power analyses in Neyman-Pearson-based NHST. Nevertheless, conducting power analyses is rare in practice. As a consequence, researchers pay relatively little attention to the alternative hypothesis already when conducting frequentist analyses. It is possible that this mindset is being carried over to NHBT, which would justify the neglect of the importance of priors in Bayesian testing as well.
QRIP 4
Bayesian evidence is relative. This means that the quantification of the merits of one model is strongly dependent on what other model is used for the comparison. As obvious as this may sound, it is very surprising that more than 60% of the articles seem to gloss over this fact. Here are two such examples: “With this ‘stronger’ VB05 prior, we found strong evidence for the null hypothesis (
Writing style
To some extent, we think that the economic way in which researchers write their articles can partly explain this result. Having to write repeatedly expressions such as “The Bayes factor indicates that the data are X times more likely under Model A than under Model B” is taxing after some time. It is very likely that some researchers objectively choose to omit parts of the text for the sake of convenience.
Implicitly assumed
This explanation is strongly tied with the previous one. We found examples of articles that in some instances explicitly referred to the relativeness of the evidence but in other cases did not. In addition to writing style, it is perhaps further assumed that the reader understands what is happening. As a consequence, dropping some words along the way may be perceived as “acceptable.”
Increased impact
Ascribing evidence to one of the models only may also be a strategy to amplify the strength of the results found. The second example above is one good example of this. It feels stronger to only report “support for the null hypothesis of absence” than to report “support for the null hypothesis of absence over one possible operationalization of the alternative hypothesis of existence” instead. The shorter way of reporting the result is “fancier” and is easier to sell in an abstract or a talk, for example.
QRIP 5 and Usage C
As discussed before, there seems to be an irresistible appeal of researchers toward using the Bayes factor to establish the presence of an effect, or the lack thereof. Our account of Usage C indicates that 18% of the articles referred to this desideratum. In addition, 35% of the articles relied on the Bayes factor to make statements about the existence (or lack thereof) of effects (QRIP 5). Here are two examples: “For 6-year-olds, there was no difference between environments (
Increased impact
Like QRIP 4, one possible explanation is to enhance the results (i.e., to overclaim).
Avoiding uncertainty
Relatedly, the generalized lack of modesty that permeates published research (Hoekstra & Vazire, 2021) may also help explain this phenomenon. In fact, many researchers seem averse to acknowledging the uncertainty in their experiments and data analyses.
Writing style
We think that some authors may find that a misleading expression such as “there is a difference between the two groups (
Influence from NHST
This is directly related to the previous point. Old habits from reporting statistical results from NHST may also help clarify the situation. In rigor, a “statistically significant” outcome simply states that an effect of at least the magnitude that was observed would be too unlikely were the null hypothesis true. It is a statement about the data under a particular hypothesis and not about any of the hypotheses. Likewise, a similar situation occurs with the Bayes factor, and QRIP 5 is a way to express that.
Decision-making
Testing two hypotheses need not always end with a decision between the two. In many cases, reporting the relative plausibility between both hypotheses should suffice. But this strategy may be perceived as “too nuanced” or even “incomplete.” Thus, instead of conducting a detailed cost-benefit analysis, and with the pressure to choose and discard between hypotheses, researchers may then fall into QRIP 5’s trap and declare the existence or absence of the effect under study.
QRIP 7
Few articles (seven; 4.2%) considered the Bayes factor as an effect-size measure. Here is one example: “Pupil size was larger in a higher tracking load. . . . However, the Bayesian test showed only positive, but smaller, effect of Load on tracking pupil size (
values and effect sizes
QRIP 7 may be the Bayesian counterpart to the wrongful association between statistical and practical significance. It is well known that even the tiniest of effects may become “statistically significant” provided that we have access to enough data. Likewise, widely different effect sizes can be associated with similar levels of evidence as indicated by the Bayes factor, depending on the priors used (Wong et al., 2022). Some researchers may make the same mistake as they make with small
Bayes factor labels
It is possible that commonly used labels to qualify levels of evidence (e.g.,
QRIP 9
Bayes factors close to 1 imply that the evidence for either model under comparison is about the same. Erroneously, in a small set of articles (six; 3.6%), researchers instead concluded that they found evidence for the null model of no effect on reporting Bayes factor values close to 1. For example: “In contrast there was no difference in meaning between the thinking without examples and planning conditions; the Bayes factor provided anecdotal evidence in favor of the null ( The difference was significant in the
Influence from NHST
A nonsignificant outcome should imply a noncommital attitude toward the null hypothesis. However, too often researchers interpret nonsignificant findings as “evidence for the null” (e.g., Goodman, 2008). We think that it is possible that this unfortunate reasoning may be resurfacing within Bayesian testing in the form of QRIP 9.
Absence as default
This explanation is closely related to the previous explanation. From NHST tradition, the null model (typically, of absence) is the hypothesis that researchers try to nullify. Faced with an absence of evidence against the null model, researchers fail to reject the null model and retain it instead. The decision to retain the null model need not necessarily reflect belief in the null model, however. From a Neyman-Pearson point of view, retaining or accepting the null hypothesis reflects only a
Dichotomization
Hypothesis testing is inherently a dichotomic inferential exercise. Such dichotomization helps create a clear divide between a null model of absence and an alternative model of presence. It is then possible that, when faced with inconclusive evidence (i.e., Bayes factors close to 1), researchers are prone to choose the absence side of the dichotomy, also because of the two reasons below.
Increased impact
It sounds arguably stronger to say that there is “evidence of an absence of an effect” rather than to say “the evidence between absence and existence is ambiguous.”
Preference for parsimony
The previous explanation not only sounds stronger but also simpler. We think that perhaps some form of Occam’s razor is taking place here and researchers err for preferring the simpler way out (see, e.g., Gallistel, 2009). We note, however, that the Bayes factor already has a preference for simpler models (Jefferys & Berger, 1991), so an additional preference for parsimony should be justified explicitly.
QRIP 10
Basing the interpretation of Bayes factors on qualitative labels associated with ranges of values is the core of this QRIP. We observed this phenomenon in nine articles (5.4%). Here is one instance: Both disgust and fear were experienced more in the experimental group (
Summary
In the article from which the example above was retrieved, there are six Bayes factors being interpreted (given in a table). The authors may have considered it to be too verbose to interpret each Bayes factor individually.
Seeking authority
Resorting to interpretative labels has the major advantage of being able to quote others to back up one’s own results. In this sense, researchers need less effort to determine the strength of the evidence that they found (i.e., they need not “think”).
Avoid criticism
Related to the previous explanation, using labels may be perceived as a means of protection against criticism aimed at the inherent subjectivity of interpreting Bayes factors. Thus, any questions concerning the perceived magnitude of the estimated effect can be deferred to the Bayes factor label system that was used.
Repeat literature
Most introductions to Bayesian hypothesis testing refer to at least one label system for the Bayes factors. Some researchers may have found such systems compelling to the point of excessively relying on them.
NHST
Using labels such as “significant” or “nonsignificant” is commonplace in frequentist inference. It is possible that some researchers are projecting the same kind of reporting behavior onto the Bayes factor.
Usage B
Twenty-seven articles (16.2%) mentioned that they used the Bayes factor as a follow-up to nonsignificant results from NHST. For example: In order to address the possibility that this study was underpowered (among other reasons), we also incorporated Bayesian analyses, which do not require a stopping rule (e.g., Rouder, 2014). If a
Below are some considerations related to this particular motivation toward using the Bayes factor.
Support H0
Very clearly, the desire to draw support for the null hypothesis is the most logical explanation. Supporting the null hypothesis is not allowed in NHST, and thus the Bayes factor is seen as advantageous (see, e.g., Dienes, 2014).
Trojan horse
The Bayes factor’s ability to draw relative support for the null hypothesis is one of its most touted advantages. We speculate whether, for some researchers, it was precisely this purported advantage that drew them to the Bayes factor.
Request from reviewers
Given that the use of Bayesian hypothesis testing is growing, it is also possible that reviewers are explicitly requesting this type of analyses.
QRIPs in Abstracts
We also looked at the occurrence of each criterion in the abstracts. The most prominent QRIPs are those associated with short and catchy reporting: 24 (14.4%) QRIP 4 (evidence reported as absolute instead of relative) and 10 (6.0%) QRIP 5 (reporting the presence or absence of effects). Seven articles (4.2%) explicitly referred to a general goal of establishing the absence or presence of a particular effect, for which the Bayes factor would be of use (Usage C).
In general, the main questionable reporting practices that we identified in abstracts seem to be directly related to the fact that they are meant to be short. The pressure to write an appealing abstract may also help explaining our findings. Of course, authors should refrain from engaging in this habit in order to prevent distortions in the published literature.
Summary and Recommendations
In the previous section we outlined various possible causes for the problems we identified. In short, we think that the main causes for the problems include a basic lack of understanding, omission of important information, unfamiliarity on dealing with prior distributions, resorting to writing styles that overemphasize impact and deemphasize uncertainty and a desire to make a dichotomous decision as the final test’s outcome.
In addition to the anticipated problems that we identified in our article reading (per Table 2), we also made note of a few other problems that we found (see the Supplementary Material available at the OSF). Here we mention three such occurrences. In one example, we identified a few instances of articles in which the authors seemed to conflate the concept of “evidence” (i.e., how the data allow us to update our belief) with that of “belief” (i.e., how likely we think each hypothesis is after observing the data). This is related to QRIP 1. In another example, there were authors who seemed to think that Bayesian statistics is less reliant on model assumptions. This is obviously misguided. In fact, Bayesian statistics has the potential of bringing models and their underlying assumptions to the analysis forefront. This is not always the case with frequentists statistics (e.g., the set of data “at least as extreme as” is not always clearly defined; Lindley, 1993). Finally, some authors were under the impression that Bayes factors could be used to test model fit. Perhaps surprisingly, Bayes factors do not fare well in what concerns model fit. The strength of the Bayes factor is to quantify the relative predictive ability between two models. One particular model may outpredict another competing model while at the same time fit the data quite poorly (but probably better than the model it outperformed). Our advice is to always consider model fit separately from testing through the Bayes factor.
All together, our findings provide a clearer image of the ongoing problems related to the use of the Bayes factor in practice. To address the current state of affairs, we also wish to offer some constructive suggestions aimed at improving things going forward. Figure 2 shows our suggestions and how they are meant to attend to each QRIP. Below we briefly summarize each of our proposals.

Summary of the potential causes for the problems identified in the literature study and suggestions for potential solutions. For each potential cause (left), QRIPs that we anticipate that follow as a consequence are listed. Potential solutions (right) are linked back to the causes that we expect they most directly apply to. QRIP = questionable reporting or interpreting practice; BF = Bayes factor.
Learning materials
Introductions to the Bayes factor commonly start by highlighting problems with the
There is a difference between the concepts of the Bayes factor (the evidence) and posterior odds (the belief; QRIPs 1 and 6). 7
Prior odds must be specified whenever there is interest in the posterior odds. Reporting posterior odds without prior odds is, at best, not ideal because it requires that the reader must consider what the authors’ priors odds were to start with.
Reporting the priors used is crucial (QRIPs 3a, 3b, and 3c). Furthermore, and as much as possible, the motivation for choosing such priors should also be provided.
It is important to conduct sensitivity analyses to assess the influence of the priors on the Bayes factor. In our study, only 26 papers (15.6%) explicitly referred to sensitivity analysis.
The Bayes factor is only a measure of the relative evidence between the two models under comparison (QRIP 4).
It is most likely impossible that the Bayes factor of one isolated study can be used to
It is important to always provide a full account of the interpretations in the article. 8 We do realize, however, that this is difficult without becoming overly repetitive. One suggestion is that authors add to the description of the statistical analysis in the methods section something like this: “Whenever we interpret a test result as providing support for one of the hypothesis, we mean to say that the evidence supports this hypothesis over the selected competing hypothesis.” At the very least, we strongly suggest that authors follow our suggestion for the key outcomes of their studies.
The Bayes factor is not an effect-size measure (QRIP 7).
Understanding the difference between the absence of evidence and evidence of absence is essential (QRIP 9).
The Bayes factor value should always be reported (QRIP 10). This is the Bayesian equivalent to requesting the exact
Checklist
We prepared a checklist that practitioners may use to guide them, at least throughout their first interactions with Bayesian hypothesis testing (see Appendix). This checklist highlights what aspects should be reported, either in the article itself or possibly in the supporting materials. We think that by using such a checklist researchers will feel reassured that they are taking all the important steps in their analysis. The checklist may also be of help to both journals and reviewers in developing standardized guidelines to which authors must abide. This may further contribute to increasing authors’ awareness to these issues.
Supplementary material
Our checklist is thorough and possibly leads to more information than one is willing to incorporate in their articles. Relegating some information to the supporting materials is a valid solution in such cases. Authors may want to resort to free and publicly available repositories such as OSF for this purpose. Journals may also promote the practice of sharing supporting materials that include the information detailed on the checklist on their websites. One suggestion is to use supplementary material (if needed) to fully report the priors used and the motivation for choosing such priors. It is important to keep in mind that priors are part of the models; therefore, any inference is contingent on the chosen priors. In this sense, failing to report priors may be considered as much of an error as it is to fail to report that one assumes normally distributed data, for example. Another suggestion is to place the results of sensitivity analyses in the supplementary materials. 9
Accept uncertainty
Statistical tools should be used within their own bounds. All the Bayes factor offers is a means of gathering evidence in favor of either hypothesis put up to a test. This does not equate to a formal proof as if it were a mathematical theorem. We suggest researchers adjust their expectations to what the Bayes factor permits. In particular, it is important to avoid the dichotomization trap that hypothesis testing typically entails. If a decision is really needed and in particular if the stakes are high, it is perhaps best to consider statistical-decision theory (Berger, 1993). It is also important to report effect sizes in order to complement test results.
Alternative inferential procedures
Testing, in particular null hypothesis testing, may not be what researchers need at all times. Some researchers have questioned the role of point null hypotheses (e.g., Vardeman, 1987). It is important to point out that alternatives do exist. One option is to use interval null hypothesis (Morey & Rouder, 2011). But often a research question may be well addressed by means of resorting to estimation instead. Arguably estimation may offer what testing does, and more (Tendeiro & Kiers, 2023).
Conclusion
In this article we charted the current state of affairs concerning the use of the Bayes factor in applied research. Our findings suggest that current practices are at best suboptimal. This happens in spite of Bayesian inference in general, and the Bayes factor in particular, being often described as more intuitive than frequentist inference (Kruschke & Liddell, 2018). We think that the problem is real and needs to be addressed for the quality of research to increase.
Some of the numbers appear small; for example, we found that 3.6% of the articles committed the error (QRIP 9) of confusing Bayes factors of about 1 with evidence of absence. We note that the error rates we report are marginal error rates, but important error rates—such as the probability of committing an error given the situation is right—should be higher. For instance, one can commit QRIP 9 only if the Bayes factor is around 1. Following a suggestion from a reviewer, we computed the proportion of occurrences of QRIP 9 among all occurrences of Bayes factor values between
In addition to reporting the identified problems, we also attempted to explain what reasons may be behind each problem. Naturally, our arguments are not evidence-based. Future research aiming at a more fine-grained understanding of the current situation would be extremely helpful.
We have offered some suggestions for actions to be taken that may contribute toward improving the situation. We think that what is needed is a better understanding of the effect of prior distributions, the difference between posterior odds and the Bayes factor, the importance of providing thorough reports of the analyses conducted (Kruschke, 2021; van Doorn et al., 2021), the need to explain the choices made, the disconnect between Bayes factors and effect sizes, and what it takes to establish that a particular effect is absent or present. Also, carrying frequentist preconceptions over into the Bayesian world is not advisable.
The way forward is not to ban Bayesian inference from our toolbox. Instead, more and better education on Bayesian inference is needed. We think that future work should use findings from Wong et al. (2022) and this article to shape improved educational materials. Better showcasing how Bayesian inference can be correctly used will empower applied researchers and improve the quality of the published scientific findings.
