Abstract
If the scientific literature were a faithful representation of the research scientists conduct, a cumulative science would be a powerful tool to infer what is true about the world. When random error is the only threat to the accuracy of individual findings, aggregating across many findings allows inferences about the presence and size of effects with a certain reliability. But when published findings are systematically biased, cumulative science breaks down: Unlike random error, bias does not cancel out when aggregating across studies—in the worst case, it accumulates, leading away from the truth rather than toward it. Unfortunately, there is reason to believe that the psychology literature is not a faithful representation of all research psychologists conduct.
Since the 1950s, scientists have repeatedly noted a suspiciously high “success” rate in psychology: Studying 362 empirical articles published in four psychology journals from 1955 to1956, Sterling (1959) found that 97.28% of studies using significance tests rejected the null hypothesis. A later replication of this study reported 95.56% statistically significant results in articles from 1986 to 1987 (Sterling et al., 1995). Likewise, in a seminal study, Fanelli (2010) analyzed authors’ verbal conclusions in hypothesis-testing articles sampled from the literatures of 20 disciplines and found that 91.5% of articles published in psychology claimed support for their first hypothesis—the highest estimate of all disciplines in the study. For these percentages to be a realistic representation of the research psychologists conduct, both statistical power and the proportion of true hypotheses (i.e., the prior probability that the null hypothesis is false) that are tested must exceed 90%. Put differently, nearly all predictions researchers make must be correct, and either the studied effects or the used samples (given the same design) must consistently be very large. These two assumptions appear highly implausible a priori, and available evidence on average statistical power in the literature shows that at least one does not hold (e.g., Szucs & Ioannidis, 2017).
A Biased Literature
A more plausible explanation for these numbers may be a selection bias toward statistically significant results in the published literature. We can distinguish two broad categories of bias: “publication bias” and “questionable research practices” (QRPs). Publication bias describes publishing behaviors that give manuscripts which find support for their tested hypotheses a higher chance of being published than manuscripts with “negative” results. These include editors and reviewers selectively rejecting manuscripts with negative results (reviewer bias; Greenwald, 1975; Mahoney, 1977) and researchers deciding not to submit studies with negative results for publication (file-drawering; Rosenthal, 1979). QRPs describe research behaviors that make evidence in favor of a certain conclusion look stronger than it is (typically, although not always, leading to more false positives; see Lakens, 2019). These include presenting unexpected results as having been predicted a priori (hypothesizing after results are known [HARKing]; Kerr, 1998) and exploiting flexibility in data analysis to obtain statistically significant results (
Some authors have argued that negative results are often uninformative or the result of low-quality research and should not be published at the same rate as positive results to avoid cluttering the literature (e.g., Baumeister, 2016; Cleophas & Cleophas, 1999; Mitchell, 2014). If most negative results that are currently missing from the literature are indeed due to immature ideas or poor methods, a literature that selects studies based on quality instead of results should contain a similar proportion of positive results as the current one. How many positive and negative results would such an unbiased literature contain in reality? We investigated this question by comparing the rate of positive results in the psychology literature with studies published in a new format designed to minimize publication bias and QRPs: Registered Reports (RRs).
Methods to Mitigate Bias
An increasingly popular proposal to reduce bias is preregistration, in which authors register a time-stamped protocol of their hypotheses, methods, and analysis plan before data collection (for a historical overview, see Wiseman et al., 2019). Preregistration is thought to mitigate QRPs by preventing HARKing and by reducing the risk of
RRs are a publication format with a restructured submission timeline: Before collecting data, authors submit a study protocol containing their hypotheses, planned methods, and analysis pipeline, which undergoes peer review. If successful, the journal commits to publishing the final article following data collection regardless of whether the hypotheses are supported (in-principle acceptance). The authors then collect and analyze the data and complete the final report. The final report is peer reviewed again but, this time, only to ensure that the the registered plan was adhered to and stated conclusions are justified (and, if applicable, that the data pass prespecified quality checks). RRs thus combine an antidote to QRPs (preregistration) with an antidote to publication bias because studies are selected for publication before their results are known. Since its introduction in 2013, the format has rapidly gained popularity and is offered by 256 journals at the time of writing (see Center for Open Science [COS] website, http://cos.io/rr).
In addition to reducing bias, RRs are designed to ensure high standards for research quality. First, predata peer review increases the chance that methodological flaws and immature ideas will be identified and addressed before a study is conducted. Second, authors typically have to include outcome-neutral control conditions that allow verifying data quality once results are in (studies failing these quality checks may be rejected). And third, many journals offering RRs require that hypothesis tests are planned with high statistical power, reducing the risk of false negatives (e.g., 90% power for a given effect size of interest 1 ).
The Current Study
The goal of our study was to test whether RRs in psychology have a lower positive result rate than articles published in the traditional way (referred to hereafter as
We set out to compare all published RRs in psychology with a new sample of SRs obtained by replicating Fanelli (2010). Fanelli searched for articles containing the phrase “test* the hypothes*,” drew a random sample of 150 articles per discipline, and coded whether the first hypothesis in each article had been supported. For SRs, we used the same sampling method (restricted to the psychology discipline); for RRs, we relied on a database curated by the COS. We chose this method because Fanelli’s 2010 and 2012 studies (both use the same coding method) have been highly influential and because it can easily be applied to a large set of studies. Because we expected many more RRs than SRs to be close replications of earlier studies—and perhaps motivated by skepticism of the original results—we additionally examined the role of replications in our analysis.
In a recent commentary, Allen and Mehler (2019) reported a similar investigation: With a self-developed coding method, they surveyed the 127 biomedical and psychology RRs listed in the COS database as of September 2018 and found 60.5% unsupported hypotheses across all included RRs (counting all hypotheses in each article). A major advantage of our study, which was planned around the same time (we were unaware of Allen and Mehler’s parallel efforts), is the ability to directly compare RRs with the standard literature. In addition, we replicate Fanelli (2010) and provide data to evaluate his method: The search term “test* the hypothes*” might introduce selection effects, meaning that results obtained this way may not generalize to hypothesis-testing studies that do not use this phrase. To this end, we coded the phrases used to introduce hypotheses in RRs, analyzed how many of them would have been detected with Fanelli’s search term, and compiled a list of alternative search terms to test the generalizability of Fanelli’s results in the future. Finally, we share a rich data set containing the exact quotes of hypotheses and conclusions on which we based our judgments as well as detailed descriptions of our sampling and coding procedure (see the Appendix in the Supplemental Material available online). This allows others to verify (or contest) our results and can hopefully provide an interesting resource for future metascientific research.
Method
After conducting a pilot to test the planned procedure, we preregistered our study (https://osf.io/sy927/). Methods and analyses described here were preregistered unless otherwise noted. Our online materials include an appendix with fine-grained methodological details and an annotated preregistration document with detailed comparisons with the eventual procedure (https://osf.io/dbhgr). The appendix and open data set also list all measures we collected but do not describe here (all of which were either auxiliary variables to facilitate the coding process or earlier versions of the variables discussed here).
Sample
We used the same method as Fanelli (2010) to obtain a new sample of SRs in psychology but restricted year of publication to 2013 to 2018 to match the sample to the RR population. We excluded articles in both groups if they were incomplete, unpublished, or retracted (e.g., meeting abstracts, study protocols without results); if they did not test a hypothesis; or if they contained insufficient information to reach a coding decision. An overview of the sampling process and all exclusions is shown in Figure 1.

Sampling process and exclusions for standard reports (SRs) and Registered Reports (RRs). SRs were accidentally oversampled: We initially excluded eight articles and only after replacing them found that two had been excluded erroneously. “Preregistered” refers to a study that had been preregistered but was not a full RR; “results-blind review” refers to an article that had undergone results-blind peer review but was not a full RR (authors knew results before first submission); “ambiguous” refers to four studies that had been treated as RRs but used preexisting data to which the authors had access before conducting their analyses and one that had no explicit signs of an RR except for a 2.5-year delay between submission and acceptance (we chose to exclude these cases to be conservative).
The sample size of SRs was prespecified to replicate the one used by Fanelli (2010),
The sample size of RRs was determined by our goal to include all published RRs in the field of psychology that tested at least one hypothesis regardless of whether they used the phrase “test* the hypothes*.” RRs were selected through a RR database curated by the COS 2 (retrieved November 19, 2018). After excluding nonpsychology articles, we verified that all remaining articles were indeed RRs by consulting the journal submission guidelines or relevant editorials or contacting the editors directly. Articles were counted as RRs if we could establish that these submissions had been reviewed and received in-principle acceptance before the data collection (or analyses) of all studies in the article had been conducted (in accordance with COS guidelines). We excluded 80 of the 151 entries in the COS RR database, leaving 71 RRs for the final analysis (see Fig. 1). Note that we excluded all eight Registered Replication Reports (Simons, 2018; Simons et al., 2014) in our sample because this format explicitly focuses on effect size estimation and not hypothesis testing (“Registered Replication Reports,” n.d.; decision was not preregistered).
Measures and coding procedure
The main dependent variable was whether the first hypothesis was supported, as reported by the authors. We tried to follow Fanelli’s (2010) coding procedure as closely as possible: By examining the abstract and/or full- text, it was determined whether the authors of each paper had concluded to have found a positive (full or partial) or negative (null or negative) support. If more than one hypothesis was being tested, only the first one to appear in the text was considered. We excluded meeting abstracts and papers that either did not test a hypothesis or for which we lacked sufficient information to determine the outcome. (p. 8)
In RRs, we coded the first preregistered hypothesis, thus excluding unregistered pilot studies. The coding procedure was identical for both article formats in all other respects. Coding disagreements between “full” and “partial” support were deemed minor because they would not affect the final results. Thus, only disagreements affecting the binary support (full or partial) as opposed to the no support classification were treated as major and resolved through discussion. M. R. M. J. Schijen coded all articles in the sample, and A. M. Scheel double-coded all articles M. R. M. J. Schijen had found difficult to code or could not code (24 RRs and 47 SRs). Only three disagreements were major (Cohen’s κ = .808) and subsequently resolved by discussion; 15 were minor (disagreement between “support” and “partial support”). We overturned the preregistered plan that A. M. Scheel would additionally code a random subset of both groups because the number of double-coded articles seemed sufficient after double-coding only the difficult cases. Because removing all indicators that could have identified RRs as such from their full texts would have been practically impossible, coding was not blind to publication format (RR vs. SR).
Hypothesis introductions
Selecting SRs using the phrase “test* the hypothes*” might yield different results than alternative search phrases. To get a better overview of “natural” descriptions of hypotheses and to facilitate future investigations of the generalizability of Fanelli’s (2010) results, we extracted the phrase used to introduce the coded hypothesis in all RRs and tried to identify clusters of common expressions.
Replication status
We expected a large proportion of RRs to be replications, many of which may have been motivated by skepticism of the original study. Because this circumstance alone could potentially lead to a lower positive result rate in RRs, we additionally coded whether hypotheses were close replications of previously published work. Because of ill-specified coding criteria in our preregistration (see the Appendix in the Supplemental Material), we used an unregistered coding strategy: We determined whether the coded hypothesis of articles whose full text contained the string “replic*” (cf. Makel et al., 2012; Mueller-Langer et al., 2019) was a close replication with the goal to verify a previously published result. Conceptual replications and internal replications (replication of a study in the same article) were not counted as replications in this narrow sense because both are more likely to be motivated by the goal to build on previous work than by skepticism. A. M. Scheel coded all articles, and D. Lakens double-coded 32 RRs (45.07%) and 99 SRs (65.13%). There were five disagreements (Cohen’s κ = .878), all of which were resolved by discussion.
Analysis
We planned to test our hypothesis in the following way (quoting directly from our preregistration, https://osf.io/sy927): A one-sided proportion test with an alpha level of 5% will be performed to test whether the positive result rate (full or partial support) of Registered Reports in psychology is statistically lower than the positive result rate of conventional reports
3
in psychology. In addition to testing if there is a statistically significant difference between RRs and conventional reports, we will test if the difference is smaller than our smallest effect size of interest using an equivalence test for proportion tests with an alpha level of 5% (Lakens, Scheel, & Isager, 2018). We determined our smallest effect size of interest to be the difference between the positive result rate in psychology (91.5%) and the positive result rate in general social sciences (85.5%) as reported by Fanelli (2010), i.e. a difference of 91.5% − 85.5% = 6%. The rationale for choosing general social sciences as a comparison is that this discipline had the lowest positive result rate amongst the ‘soft’ sciences (Fanelli, 2010). The exact percentage for general social sciences was extracted from Figure 1 in Fanelli (2010) using the software WebPlotDigitizer (Rohatgi, 2018).
We would accept our hypothesis that RRs have a lower positive result rate than SRs if the observed difference between RRs and SRs was significantly smaller than zero
Results
Preregistered analysis
Thirty-one out of 71 RRs and 146 out of 152 SRs had positive results, meaning that the positive result rate was 43.66% for RRs (95% confidence interval [CI] = [31.91, 55.95]) and 96.05% for SRs (95% CI = [91.61, 98.54]; see Fig. 2). This difference of −52.39% was statistically significant in the preregistered one-sided proportions test with α = 5%, χ2(1) = 77.96,

Positive result rates for standard reports and Registered Reports. Error bars indicate 95% confidence intervals around the observed positive result rate.
Exploratory analyses
For ease of communication, we refer to articles that were classified as close replications of previously published work as
Positive Results in Original Studies Versus Replication Studies
Note: SRs = standard reports; RRs = Registered Reports; CI = confidence interval.
Because our SR sample represents a direct replication of Fanelli (2010) for the discipline psychiatry and psychology, another interesting question is how our results compare with Fanelli’s. The difference between the positive result rates of SRs in our sample and Fanelli’s (96.05% − 91.49% = 4.56%) is not significantly different from zero in a two-sided proportions test, χ2(1) = 1.91,
Finally, we analyzed the language that was used to introduce or refer to hypotheses in RRs. We found extremely little overlap with Fanelli’s (2010) search phrase “test* the hypothes*”: Searching the abstracts, titles, and keywords of the RR sample showed that only two of 71 RRs would have been detected with this search phrase. To analyze which other hypothesis-introduction phrases researchers used in RRs, we stripped the coded hypothesis quotes from all content-specific information and extracted “minimal” phrases that most distinctively indicated that a hypothesis was being described. For example, from the hypothesis quote, “For Study 1, we predicted that participants reading about academic (vs. social) behaviors would show a better anagram performance,” we extracted the hypothesis-introduction phrase “predicted that.”
For the majority of RRs (49), we identified one hypothesis-introduction phrase; the remaining ones used two (16 RRs), three (four RRs), or four (one RR) different phrases or had no identifiable hypothesis introduction (one RR). In this total set of 97 hypothesis introductions, we found 64 unique phrases showing substantial linguistic variation (see Tables 2 and 3). We then listed all unique word stems within those phrases and analyzed their frequency. Excluding words that are common but too unspecific by themselves (e.g., “that,” “to,” “whether”), the five most frequent word stems were “hypothes*” (34 occurrences), “replicat*” (24), “test*” (20), “examine*” (eight), and “predict*” (eight). Clearly, “test*” and “hypothes*” are quite popular, yet they co-occurred only eight times, and more than half of all hypothesis introductions (51 of 97) contained neither word.
Hypothesis Introduction Phrases in Original Registered Reports (Testing New Hypotheses)
Note: Table contains 44 hypothesis introduction phrases from 30 Registered Reports: 19 articles contributed one phrase each, nine articles contributed two each, one contributed three, and one contributed four.
Hypothesis Introduction Phrases in Direct Replication Registered Reports (Testing Previously Studied Hypotheses)
Note: Table contains 53 hypothesis introduction phrases from 40 Registered Reports. One additional Registered Report had no identifiable hypothesis introduction. Thirty articles contributed one phrase each, seven contributed two each, and three contributed three each.
Sixty-nine of the 71 RRs (97.18%) had at least one of these five most frequent word stems in their title, abstract, or keywords, meaning that a regular literature search (without access to full texts) with the search terms “hypothes*
Finally, we noticed an interesting difference in language use between original and replication RRs: As the high frequency of the word stem “replicat*” suggests, replications were often framed as attempts to repeat a previously conducted
Discussion
We examined the proportion of psychology articles that found support for their first tested hypothesis and discovered a large difference (96.05% vs. 43.66%) between a random sample of SRs and the full population of RRs (at the time of data collection). More than half of the analyzed hypothesis tests in RRs were close replications of previous work, but the difference between SRs and RRs remained large when close replications were excluded from the analysis (95.95% vs. 50.00%). Clearly, the emerging literature of RRs appears to be publishing a much larger proportion of null results than the standard literature.
The positive result rate we found in SRs (96.05%) is slightly but nonsignificantly higher than the 91.5% reported by Fanelli (2010). Our replication in a more recent sample of the psychology literature thus yielded a comparably high estimate of supported hypotheses, but we cannot rule out that the positive result rate in the population has increased since 2010 (cf. Fanelli, 2012). Furthermore, our estimate of the positive result rate for RRs (43.66%) is comparable with the 39.5% reported by Allen and Mehler (2019) despite some differences in method and studied population.
To explain the 52.39% gap between SRs and RRs, we must assume some combination of differences in bias, statistical power, or the proportion of true hypotheses researchers choose to examine. Figure 3 visualizes the combinations of statistical power and proportion of true hypotheses that could produce the observed positive result rates if the literature were completely unbiased. Assuming no publication bias and no QRPs, authors of SRs would need to test almost exclusively true hypotheses (> 90%) with more than 90% power. Because this is highly implausible and contradicted by available evidence (e.g., Szucs & Ioannidis, 2017), the standard literature is unlikely to reflect reality. As noted above, methodological rigor and statistical power in RRs likely meet or exceed the level of SRs, leaving the rate of true hypotheses and bias as remaining explanations.

Combinations of the proportion of true hypotheses and statistical power that would produce the observed positive result rates given α = 5% and no bias. Shaded areas indicate 95% confidence intervals. SRs refers to standard reports, and RRs refers to Registered Reports. The curve for all SRs (i.e, including replications; 96.05% positive results,
It is a priori plausible that RRs are currently used for a population of hypotheses that are less likely to be true: For example, authors may use the format strategically for studies they expect to yield negative results (which would be difficult to publish otherwise). However, assuming over 90% true hypotheses in the standard literature is neither realistic nor would this figure be desirable for a science that wants to advance knowledge beyond trivial facts. We thus believe that this factor alone is not sufficient to explain the large difference in positive results. Rather, the numbers strongly suggest a reduction of publication bias and/or QRPs in the RR literature. Nonetheless, the prior probability of hypotheses in RRs and SRs may differ and should be studied in future research.
Limitations
Because coders could not be blinded to an article’s publication format, their judgment may have been biased. Our study was not an experiment—hypotheses, authors, and editors were not randomly assigned to each publication format—and thus precludes strong causal inferences. As discussed above, it seems highly plausible that RRs reduce publication bias and QRPs, which in turn reduces the positive result rate. Yet we know neither exactly how effective RRs are at reducing bias nor how large the effect on positive results would be in the absence of potential confounds. One such confound, as just discussed, could be that RRs may be used for particularly risky hypotheses. Another confound could be that the format attracts particularly conscientious authors who try to minimize the risk of inflated error rates regardless of the report format they use. As a third potential confound, journals that offer RRs may have more progressive editorial policies that aim to reduce publication bias and Type I error inflation for all empirical articles they publish. This could lead to less bias in the RR literature even if the format’s safeguards against certain QRPs were actually ineffective. Additional research, ideally with prospective and experimental or quasiexperimental study designs, is needed to further investigate the influence of such factors. However, a cursory look at the three journals that contributed both SRs and RRs to our data set (
Another limitation of the current study (and of Fanelli, 2010) is that SRs were selected using the search phrase “test* the hypothes*.” This phrase was virtually absent in RRs, suggesting that the search strategy may not yield a representative sample of the population of hypothesis-testing studies in the literature. The use of the phrase might even be confounded with the outcome of a study: For example, authors may be more likely to describe their research explicitly as a hypothesis test when they found positive results but prefer more vague language for unsupported hypotheses (e.g., “we examined the role of . . . ”). A similar concern could be raised for the decision to code only the first reported hypothesis of each article. The first hypothesis test may not be representative for all hypothesis tests reported in an article, and the order of reporting may differ between SRs and RRs. For example, SR authors might tend to present supported hypotheses first, whereas RR authors might be more likely to present their hypotheses in chronological order.
Both of these potential confounds might lead to an inflated estimate of the positive result rate in SRs. However, studies using different selection criteria for articles and hypotheses have found very similar rates of supported hypotheses in the literature: 97.28% in Sterling (1959), 95.56% in Sterling et al. (1995), and 97% in the original studies included in the Reproducibility Project: Psychology (Open Science Collaboration, 2015). In addition, Motyl et al. (2017) reported 89.17% and 92.01% significant results for “critical” hypothesis tests in articles published in 2003–2004 and 2013–2014, respectively. Although the selection criteria for articles and hypotheses in our study may limit the generalizability of the results, this level of convergence makes it seem unlikely that alternative methods would have yielded dramatically different conclusions.
Conclusion
Our study presents a systematic comparison of positive results in RRs and the standard literature. The much lower positive result rate in RRs compared with SRs suggests that an unbiased literature would look very different from the existing body of published research. Standard publication formats seem to lead psychological scientists to miss out on many negative results from high-quality studies, which are available in the RR literature. The absence of negative results is a serious threat to a cumulative science. In 1959, Sterling asked: “What credence can then be given to inferences drawn from statistical tests of
Supplemental Material
sj-pdf-1-amp-10.1177_25152459211007467 – Supplemental material for An Excess of Positive Results: Comparing the Standard Psychology Literature With Registered Reports
Supplemental material, sj-pdf-1-amp-10.1177_25152459211007467 for An Excess of Positive Results: Comparing the Standard Psychology Literature With Registered Reports by Anne M. Scheel, Mitchell R. M. J. Schijen and Daniël Lakens in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
