Prior research finds that statistically significant results are overrepresented in scientific publications. If significant results are consistently favored in the review process, published results could systematically overstate the magnitude of their findings even under ideal conditions. In this paper, we measure the impact of this publication bias on political science using a new data set of published quantitative results. Although any measurement of publication bias depends on the prior distribution of empirical relationships, we determine that published estimates in political science are on average substantially larger than their true value under a variety of reasonable choices for this prior. We also find that many published estimates have a false positive probability substantially greater than the conventional α = 0.05 threshold for statistical significance if the prior probability of a null relationship exceeds 50%. Finally, although the proportion of published false positives would be reduced if significance tests used a smaller α, this change would not solve the problem of upward bias in the magnitude of published results.
Many academic papers (and especially the first few articles on a topic) describe relationships that turn out to be illusory upon closer examination (Ioannidis, 2005). Additionally, the typical published estimate is probably of larger magnitude than the true relationship (Ioannidis, 2008). Recent large-scale attempts to replicate social scientific findings have discovered that many of these findings become substantially smaller and more uncertain than initially indicated (Boekel et al., 2015; Camerer et al., 2016; Hartshorne and Schachner, 2012; Ioannidis et al., 2014; Klein et al., 2014; Maniadis et al., 2014; Open Science Collaboration, 2015); the “replication crisis” has plagued fields in the hard sciences as well (e.g. Begley and Ellis, 2012; Prinz et al., 2011; Steward et al., 2012). Replicability problems are exacerbated by researcher behaviors like p-hacking (analyzing the same data in multiple ways but only reporting the most statistically significant findings).1 But even if behaviors like this were eliminated, the problems would continue to exist because the publication process privileges statistically significant results (Brodeur et al., 2016; Coursol and Wagner, 1986; Gerber et al., 2001; Gerber and Malhotra, 2008a,b; Sterling et al., 1995), including by influencing authors’ decision to write up and publicize their findings (Franco et al., 2014). When null findings are not published, they cannot place anomalously large and statistically significant results into their proper context; such anomalous results can attract a great deal of scientific interest because of their novelty and counterintuitiveness.2 These problems are often collectively referred to as publication bias (Scargle, 2000). Although the publication bias created by “misunderstanding or misuse of statistical inference is only one cause of the ‘reproducibility crisis’ … to our community, it is an important one” (Wasserstein and Lazar, 2016: 2).3
While much of the previous work in this area focuses on establishing that publication bias is real and pervasive in disciplines that use statistical evidence (e.g. by using “caliper tests” of published p-values, as in Gerber and Malhotra (2008a) and Brodeur et al. (2016)), our paper seeks to determine how publication bias has affected the accumulated body of knowledge in political science. We measure the impact of publication bias on political science using a new data set of published quantitative results. Although any measurement of publication bias depends on the prior distribution of empirical relationships, we estimate that published results in political science are distorted to a substantively meaningful degree under a variety of reasonable choices for this prior.
We come to three conclusions. First, published estimates of relationships in political science are on average substantially larger than their true value. The exact degree of upward bias depends on the choice of prior, but at the high end we estimate that the true value of published relationships is on average 40% smaller than their published value. More optimistic priors yield a lower average bias, but still find that at least 14% of results are biased upward by 10% or more. Second, we find that many published results have a false positive probability substantially greater than the conventional α = 0.05 threshold for statistical significance if the prior probability of a null relationship exceeds 50%. These two findings are quantitatively and qualitatively similar to results uncovered by the large scale replication studies noted above, suggesting that publication bias can explain much of the “replication crisis” these studies have observed.4 Finally, we find that both the upward bias in magnitude and the probability of being a false positive is smaller for results with p-values further from the threshold for significance. Our last finding suggests that requiring a more stringent statistical significance test (with a smaller α) for publication might be effective at combating publication bias (Johnson, 2013). Unfortunately, although the proportion of published false positives would be reduced by this strategy (Bayarri et al., 2016; Goodman, 2001), we find that such a reform would not solve the problem of upward bias in published results: published results near the new threshold of significance would still be (on average) substantially biased upward.
Measurement strategy
Trying to measure the degree of upward bias in an estimate of some parameter β, or the prevalence of false positives (statistically significant estimates of when the null is true), is tricky. Any measurement depends on an assumption about the true value of β (or a probability distribution of beliefs about its value, ). For example, consider the distribution of statistically significant estimates associated with a true value of β. Publication bias implies that , where is the t-statistic, is the estimated standard error of , and is the critical t value for a two-tailed significance test under a null hypothesis setting α = Pr(significant | with many degrees of freedom. For a fixed and known β, we could calculate the degree of publication bias as
where (that is, the smallest statistically significant ) and is the t probability density function. That is, we define bias as the difference between the expected value of statistically significant estimates and the true value of the estimand.5
However, in a published work, β is unknown. We must therefore calculate
under some reasonable assumptions about our prior beliefs about β, . This estimate of publication bias will obviously be a function of our choice of , and consequently it is advisable to estimate publication bias under a variety of choices for to ensure robust results.
We estimate the degree of expected publication bias in the political science literature as a proportion of the published result, . The term means that we measure the degree of bias in the direction of the true β (that is, as a function of the distance of the relationship from zero); this allows us to measure the degree to which the average published result exaggerates the true magnitude of a relationship.6 The published estimate informs our assumption about the prior to recognize that each project comes out of a different family of projects pertaining to different subfields and topics whose magnitudes are difficult to compare across families. We consider two classes of :
a spike-and-slab distribution with a spike at and a uniform slab between ; and
a spike-and-normal distribution with a spike at added to a normal distribution with standard deviation equal to .
The first distribution represents a 33% probability prior belief that a non-zero , while the second represents a ≈ 68% probability prior belief that a non-zero ; our results are robust to other reasonable choices for the boundaries of the spike-and-slab prior and the standard deviation of the normal prior.7 We systematically vary the height of the spike, ), to determine how different expectations for the baseline rate of null relationships changes our view of the published literature. Finally, we repeat our analysis with no spike at , to recognize the possibility that a point null hypothesis is never true in real data (Gelman, 2011).
We use our prior belief density to determine the relationship between true relationships and observed estimates using simulation. To do this, we generate 100,000 draws from for each published study. For each draw of , we simulate a sample estimate , where is the published standard error of and is the t-density with degrees of freedom equivalent to the published study.8 We determine which of these results is statistically significant by comparing to the two-tailed critical value from a t-density with the appropriate degrees of freedom. Finally, we calculate for each of the statistically significant draws. The average of this quantity is our estimate of We then divide this by the absolute value of the published result, , to calculate percentage bias.
Data set
We estimate the effect of publication bias on the literature in political science using a new data set of quantitative work recently published in prominent, general interest journals. Our data set is composed of 314 quantitative articles published in the American Political Science Review (APSR: 139 articles in volumes 102–107, from 2008–2013) and the American Journal of Political Science (AJPS: 175 articles in volumes 54–57, from 2010–2013).9 To simplify the analysis, we analyze only articles with continuous and unbounded dependent variables. Among the 173 articles with continuous and unbounded dependent variables,10 6 articles have at least one missing value regarding their estimates or sample sizes, which are necessary for our analysis. Therefore, we remove these 6 articles from our analysis. Consequently, we are left with 167 quantitative articles published in the APSR (70 articles) and the AJPS (97 articles). Finally, 25 studies out of these 167 quantitative articles (or 15% of that number) report statistically insignificant results as their main relationship under a two-tailed test with α = 0.05, although 17 of the 25 studies are statistically significant if using a one-tailed test with α = 0.05.11 We omit these studies from our analysis as their interpretation is unclear in the context of assessing publication bias when using an α = 0.05 two-tailed significance test, leaving 142 studies for analysis. The consequence of omitting statistically insignificant results is that our estimates are upper bounds on the degree of publication bias in the literature: the more likely it is that statistically insignificant results will be published, the smaller that publication bias will be.
A complete list of the rules we used to identify and code observations in our data set is provided in Appendix 2; we summarize the procedure here.12 Each observation of the collected data set represents one article and contains the article’s main finding (viz., an estimated marginal effect, ). Defining the main finding of an article can be complicated, as many articles present multiple results.13 We code the main finding in the following way. First, if there is any expression such as “the key independent variable” or “the main finding of this paper,” we consider that relationship the main finding. If there is no such explicit phrasing, we consider the finding that is emphasized in the abstract or in the conclusion of a paper as the main finding. If there are multiple hypotheses that receive almost equal attention, we record the information of the first hypothesis (“H1” or “the first hypothesis”).
Results
The result of applying this technique to the published (and statistically significant) marginal effects estimates in our data set reveals a substantial tendency toward upward bias in magnitude, as illustrated in Table 1. As the table shows, if we have a baseline expectation that only 10% of our hypotheses correctly predict a relationship a priori, then over 50% of published findings are expected to be at least 10% larger in magnitude than the true relationship. The typical published result in this scenario is on average at least 29% larger than the true relationship. Even if there are no relationships that are exactly zero under a normal prior density (with standard deviation equal to ), over 40% of published results have ⩾10% upward bias in magnitude. In general, the magnitude of the bias problem scales positively with the assumed underlying proportion of null results in the population of research ideas ().
Expected bias in a sample of published marginal effects from APSR and AJPS.
Assumed population
Spike-and-slab prior
Spike-and-normal prior
Mean % bias
% of estimates with bias ⩾ 10%
Mean % bias
% of estimates with bias ⩾ 10%
10%
29.8
87.3
40.5
52.8
20%
18.9
78.2
29.2
47.9
50%
8.90
41.5
17.8
44.3
100%
4.73
14.1
12.4
42.3
The table shows the estimated prevalence of upward bias in estimate magnitude in a sample of 167 articles from the American Political Science Review and the American Journal of Political Science; the sample size is 142 after 25 statistically insignificant results are excluded. We generate 100,000 draws from or with probability and with probability p for each published study; the assumed value of p is listed in column 1. For each draw of , we simulate a sample estimate , , where is the published standard error of and is the t-density with degrees of freedom equivalent to the published study. We determine which of these estimates is statistically significant by comparing to the critical value for an α = 0.05 test (two-tailed) from a t-density with degrees of freedom equivalent to the published study. Finally, we calculate for each of the statistically significant draws. Columns 2 and 4 list the mean value of these replicates, our estimate of , across all 142 results for the prior distribution indicated in the column heading. Columns 3 and 5 lists the corresponding proportion of estimates that are greater than or equal to 10%.
The implication of the analysis is that a substantial portion of published results overestimate the true size of the relationship being studied because statistical significance tests are used to screen results for publication. Biases that are large enough to be substantively meaningful are not uncommon; if our assumptions about are a good representation of the background rate of null relationships, we would expect many empirical findings (perhaps even a majority) to exaggerate the size of the true relationships that they measure. Moreover, the high end of our estimates (viz., that the true value of published relationships is on average 40% smaller than their published value) matches recent empirical estimates by large-scale replication projects. For example, the Open Science Collaboration (OSC) estimated that originally published effect sizes in psychology are on average around 50% smaller than effect sizes in replication studies of the same relationship (Open Science Collaboration, 2015: aac4716-3–aac-4716-5). A similar replication of 18 studies in experimental economics found that replication effect sizes were on average only 65.9% of the size of the original estimate, a reduction of about 44% (Camerer et al., 2016: 1434).14 Most of our estimates, however, show smaller mean bias; this suggests that either (a) the proportion of null results in the population of studies is higher than we contemplated, or (b) other factors (opportunistic model selection by the original researchers, a bias against success among the researchers performing the replication, and/or many alternative possibilities) may be working in concert with publication bias to explain prior empirical results.
Not all publications are equally susceptible to bias, as seen in Figure 1. The figure shows that individual results vary greatly in terms of expected publication bias, regardless of the prior probability of a null effect . Indeed, Figure 1(d) shows substantial bias, and substantial variation among published results, for the normal prior even with no spike at . Importantly, the expected bias is strongly associated with the published p-value of the result: smaller p-values are associated with smaller expected bias. Our finding underscores a point made in the American Statistical Association’s statement on p-values: “the widespread use of ‘statistical significance’ (generally interpreted as ‘’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process” (Wasserstein and Lazar, 2016: 9).
Histogram of expected bias calculations from APSR and AJPS. (a) Spike-and-slab prior, . (b) Spike-and-normal prior, . (c) Uniform prior, . (d) Normal prior, . Each histogram shows the proportion of articles in a sample of 167 articles from the American Political Science Review and the American Journal of Political Science corresponding to a degree of expected bias; the sample size is 142 after 25 statistically insignificant results are excluded. Expected bias is calculated using the prior density indicated in the sub-figure’s caption and the procedure described in Table 1. The color of the bar indicates a published result -value in the range listed by the sub-figure’s legend.
Calculating susceptibility to false positives
Statistical significance testing is designed to lower the risk of concluding that a relationship exists when the evidence could be consistent with no relationship at all. However, it is well established (though perhaps not widely understood) that statistical significance testing is often insufficient to reduce the chance of a false positive to an acceptable level when the prior probability of studying a null relationship is very high (Bayarri et al., 2016; Goodman, 2001; Nuzzo, 2014; Siegfried, 2010). A key factor is the prior probability that the null hypothesis is true (i.e., the a priori expectation that the relationship being studied does not actually exist). That is:
We can use this formula to calculate this probability for the observations in our data set; this is similar to a calculation that Goodman (2001) and Bayarri et al. (2016) performed using Bayes’ factors and to a closely related formula offered by Maniadis et al. (2014). To establish a lower bound for , we set to maximize the denominator of equation (3). We then set the prior probability to a fixed value and calculated for a range of Pr (stat.sig.| β = 0) ∈ . The results for four different values of are shown in Figure 2; the histogram in this figure indicates the distribution of p-values (i.e. the value of ) in our data set.
Expected lower bound false positive probability calculations. The figure shows the relationship between the probability that the null hypothesis is true given a statistically significant result as a function of the probability of obtaining a statistically significant result when the null hypothesis is true that is implied by equation (3). To establish a lower bound for , we set in equation (3). We set to several alternative values, as indicated in the figure’s legend. The histogram shows the proportion of -values in bins of width 0.005 for 142 published and statistically significant results in our data set.
As the figure shows, not all published work has an equal expected probability of being a false positive. Results that are close to the boundary of statistical significance (withp ≈ 0.05) have the greatest expected probability of being a false positive. Results that are further from this boundary (e.g. where ) are at substantially lower risk of being false positives. This finding is consistent with the prior work of the Open Science Collaboration (2015: aac4716-5), whose replications of notable findings in psychology discovered that “a negative correlation of replication success with the original study p value indicates that the initial strength of evidence is predictive of reproducibility.” Camerer et al. (2016: 1435) find the same relationship between p-values and replicability. The finding is consistent with the calculations of Goodman (2001) and Bayarri et al. (2016), who shows that lower p-values are associated with greater reductions in the posterior probability of the null hypothesis (relative to its prior probability).
Figure 2 indicates that our concern about the likelihood of a false positive should be geometrically related to our prior belief about and almost linearly related to a result’s p-value. When , the probability of a false positive never exceeds 5% in our calculation. However, if , we calculate that ≈10.6% of the published results in our data haveWhen , ≈25.4% of the published results in our data set have . Over 40% of these results have have if . Our finding may explain why so many results fail to replicate. For example, the Open Science Collaboration was able to successfully replicate only 39 of 100 relationships from the psychology literature that it tested in its study (Open Science Collaboration, 2015: aac-4716-5). A survey of researchers in psychology and allied fields by Hartshorne and Schachner (2012: 3) found that only 49% of attempted replications were able to fully replicate a study’s original findings. In economics, Camerer et al. (2016) were able to successfully replicate only 11 of the 18 studies they examined, a 61% success rate. Even in medicine, a recent study by Prinz et al. (2011) found that their laboratory was only able to completely replicate between 20% and 25% of the published work examined.
Conclusions and implications
The problem of publication bias has been studied for years and permeates all scientific disciplines that use statistical evidence (Rosenthal, 1979; Sterling et al., 1995). Interest in the problem has been reignited by effort to replicate results in multiple disciplines that have met with a surprisingly high rate of failure (Boekel et al., 2015; Camerer et al., 2016; Hartshorne and Schachner, 2012; Ioannidis et al., 2014; Klein et al., 2014; Maniadis et al., 2014; Open Science Collaboration, 2015). Prior work has established that statistically significant results are favored in political science (e.g. Gerber and Malhotra, 2008a), but to what extent does it distort substantive knowledge in the discipline? Are our findings contaminated by results that are biased upward in magnitude? Are false positive findings published too often in that literature? The answers depend on the unknown prior distribution of true relationships. But we find evidence for both problems in the published political science literature, and the problems are large enough to be qualitatively meaningful under a wide variety of different prior distributions. If these problems exist, they occur because statistically significant results are favored in the publication process: smaller values in an estimate’s sampling distribution are disproportionately ignored and null relationships are still likely to be published (Brodeur et al., 2016; Coursol and Wagner, 1986; Gerber et al., 2001; Gerber and Malhotra, 2008a,b; Sterling et al., 1995). We believe that our paper complements the findings of large-scale replication projects by placing them into a clearer theoretical context: under reasonable assumptions for the prior distribution of effects , the results of these studies are what we should expect given (a) the existence of a publication process that favors statistically significant results and (b) the distribution of published results in the literature. In short, our findings suggest that publication bias is a reasonable explanation for at least part of the “replication crisis.”
Based on our evidence, results with smaller p-values are less affected by publication bias because they are further from the α = 0.05 threshold. These results are also at lesser risk of being a false positive (Bayarri et al., 2016; Goodman, 2001). However, using a decreased threshold for statistical significance (i.e. only publishing results that can pass a significance test with α lower than 0.05), as suggested by Johnson (2013), simply recreates the problem for results near the new threshold. Consider the simulations of Table 1 for the spike-and-slab prior when : using a significance threshold of α = 0.01 results in 66.9% of estimates exceeding 10% magnitude in bias (compared to 87.3% of estimates using the α = 0.05 threshold). When under the same prior, using a significance threshold of α = 0.01 results in 20.4% of estimates exceeding 10% magnitude in bias (compared to 14.1% of estimates using the α = 0.05 threshold).
The empirical “credibility revolution” in economics and political science has rightfully made us ask harder questions of the quality of our research designs on a paper-to-paper basis (Angrist and Pischke, 2010). But as long as statistically significant results are privileged in the publication process, even researchers who do everything right from a causal identification perspective could still produce a literature with results that are (on average) biased upward and overpopulated with false positives. Just as the credibility revolution has made us more skeptical of some research designs, we believe that our findings (and the larger universe of findings concerning replicability) demand increased skepticism of novel results. This is particularly true if the result is only marginally statistically significant, because marginally significant results are at increased risk of being false positives. Consequently, it may be prudent to place less importance on the novelty and originality of a scholar’s output in evaluating his or her contribution to the discipline— recognizing, of course, that these are still important and valuable qualities!— and more importance on work that checks the robustness of existing findings, including replication studies. We should also be careful about allowing the initial discovery of a new phenomenon to shape our research agenda before the phenomenon is thoroughly replicated. In the event that the discovery is a false positive, researchers seeking to apply the findings to other areas will necessarily be building their work on a null finding, thereby raising the overall prior probability of null hypotheses (viz. ) in the population and making the overall problem of publication bias even worse. We think that these changes constitute a substantial revision to the status quo, but one that is important to safeguarding the reliability of the findings that we communicate to each other, to our students, and to the larger world.
Footnotes
We thank Ashley Leeds,Will H. Moore,Cliff Morgan,Ric Stoll,our anonymous reviewers,and participants in our sessions at the 2013 Annual Meeting of the American Political Science Association and the 2013 Annual Meeting of the Society for Political Methodology for their helpful comments and suggestions.
Correction (June 2025):
The article has been updated with correct dataverse link in the supplementary material section. For more details,please see the correction notice .
Funding
This research received no specific grant from any funding agency in the public,commercial,or not-for-profit sectors.
Supplementary material
The replication files are available at: https://dataverse.harvard.edu/dataverse/researchandpolitics . The supplementary files are available at:
Carnegie Corporation of New York Grant
The open access article processing charge (APC) for this article was waived due to a grant awarded to Research & Politics from Carnegie Corporation of New York under its “Bridging the Gap” initiative. The statements made and views expressed are solely the responsibility of the authors.
References
1.
AndersonCJBahnikSBarnett-CowanM. et al. (2016) Response to “Comment on estimating the reproducibility of psychological science”. Science351(6277): 1037–1037.
2.
AngristJDPischkeJS (2010) The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. The Journal of Economic Perspectives24(2): 3–30.
3.
BayarriMBenjaminDJBergerJO. et al. (2016) Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses. Journal of Mathematical Psychology
4.
BegleyCEllisLM (2012) Raise standards for preclinical cancer research. Nature483: 531–533.
5.
BoekelWWagenmakersEJBelayL. et al. (2015) A purely confirmatory replication study of structural brain-behavior correlations. Cortex66: 115–133.
6.
BrodeurALeMSangnierM. et al. strike back. American Economic Journal: Applied Economics8(1): 1–32.
7.
CamererCFDreberAForsellE. et al. (2016) Evaluating replicability of laboratory experiments in economics. Science351(6280): 1433–1436.
8.
CoursolAWagnerEE (1986) Effect of positive findings on submission and acceptance rates: A note on meta-analysis bias. Professional Psychology: Research and Practice17: 136–137.
9.
FrancoAMalhotraNSimonovitsG (2014) Publication bias in the social sciences: Unlocking the file drawer. Science345(6203): 1502–1505.
GerberAMalhotraN (2008a) Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Quarterly Journal of Political Science3(3): 313–326.
12.
GerberASMalhotraN (2008b) Sociological methods & publication bias in empirical sociological research: Do arbitrary significance levels distort published results?Sociological Methods & Research37(3): 3–30.
13.
GerberASGreenDPNickersonD (2001) Testing for publication bias in political science. Political Analysis: 385–392.
14.
GilbertDTGaryKPettigrewS. et al. (2016) Comment on ‘Estimating the reproducibility of psychological science’. Science351(6277): 1037.
15.
GoodmanSN (2001) Of p-values and Bayes: A modest proposal. Epidemiology12(3): 295–297.
16.
HartshorneJKSchachnerA (2012) Tracking replicability as a method of post-publication open evaluation. Frontiers in Computational Neuroscience6(8): 1–14.
17.
IoannidisJPMunafoMRFusar-PoliP. et al. (2014) Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Sciences18(5): 235–241.
18.
IoannidisJPA (2005) Why most published research findings are false. PLoS Medicine2: 696–701.
19.
IoannidisJPA (2008) Why most discovered true associations are inflated. Epidemiology19: 640–648.
20.
JohnsonVE (2013) Revised standards for statistical evidence. Proceedings of the National Academy of Sciences110(48): 19313–19317.
21.
KleinRARatliffKAVianelloM. et al. (2014) Investi-gating variation in replicability. Social Psychology45(3): 142–152.
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science349(6251): 943.
26.
PengR (2015) The reproducibility crisis in science: A statistical counterattack. Significance12(3): 30–32.
27.
PrinzFSchlangeTKhusruA (2011) Believe it or not: How much can we rely on published data on potential drug targets?Nature Reviews Drug Discovery10: 712.
28.
RosenthalR (1979) The file drawer problem and tolerance for null results. Psychological Bulletin86: 638–641.
29.
ScargleJD (2000) Publication bias: The “file-drawer” problem in scientific inference. Journal of Scientific Exploration14: 91–106.
SterlingTRosenbaumWLWinkamJJ (1995) Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician49: 108–112.
34.
StewardOPopovichPGDietrichW. et al. (2012) Replication and reproducibility in spinal cord injury research. Experimental Neurology233: 597–605.
35.
WassersteinRLLazarNA (2016) The ASA’s statement on p-values: Context, process, and purpose. The American Statistician.