Abstract
Keywords
Meta-analysis is sometimes seen to be at the top of the “pyramid of evidence,” and random effects (RE) is the canonical meta-analysis model of psychological research (Ioannidis, 2016; Owens et al., 2010). Large-scale surveys and preregistered multilab replications (PMRs) have revealed that publication-selection bias, high heterogeneity, and low statistical power are the central challenges to the credibility of psychological research (Fraley & Vazire, 2014; Klein et al., 2018; Open Science Collaboration, 2015; Stanley et al., 2018, 2022). Simulation studies establish that RE typically produce large biases and high rates of false positives when there is publication-selection bias (Bom & Rachinger, 2019; Carter et al., 2018; Henmi & Copas, 2010; Stanley, 2017; Stanley & Doucouliagos, 2014, 2015; Stanley et al., 2017; van Assen & van Aert, 2015). RE’s large biases and high rates of false positives are corroborated in applications when RE results are compared with PMRs (Kvarven et al., 2020). The central purpose of this article is to demonstrate that the unrestricted weighted least squares (UWLS) weighted average should routinely replace RE in psychology meta-analyses, regardless of whether there is publication bias.
If, as we show below, small-sample studies are more heterogeneous, then three major challenges to psychology (heterogeneity, low power, and publication-selection bias) emanate largely from a single source. Furthermore, when heterogeneity is correlated with a study’s standard errors, we show that RE estimates are dominated, statistically, by an alternative meta-analysis weighted average—the UWLS (Stanley & Doucouliagos, 2015, 2017). Unlike RE, UWLS better accommodates correlated heterogeneity because it is built on a model of
To make this case, we need to show that UWLS is expected to have superior statistical properties relative to RE even when there is no publication bias. Our simulations show that if standard errors and heterogeneity are correlated in a meta-analysis, then UWLS will dominate RE in all cases, with or without publication bias (see Table 3). But what evidence is there that standard error and heterogeneity are typically correlated in an area of psychological research? We offer preregistered meta-research evidence that standard errors and heterogeneity are typically correlated within a psychology meta-analysis and across dozens of meta-analyses. However, before we can conduct this meta-meta-analysis and gather evidence of widespread correlation of standard error with heterogeneity, we must first introduce the meta-regression tests (variance ratio meta-regression analysis [VR-MRA]) that can identify whether standard errors and heterogeneity are in fact correlated in a meta-analysis.
After an illustration, we introduce these new meta-regression tests for a correlation of standard errors and heterogeneity. We then apply these new tests to dozens of meta-analyses to evaluate whether there is evidence of a predominant correlation between standard errors and heterogeneity in psychology. Only after offering evidence supporting these two important lines of reasoning do we directly address our main thesis: that UWLS statistically dominates RE in typical application.
As an illustrative example, consider the once highly regarded theory of ego depletion. Ego depletion posits that people have a limited supply of willpower and it decreases with overuse. Ego depletion is one of the theories that have come into question because of the failure of a PMR.
1
Hagger et al.’s (2016) PMR found a scientifically and statistically negligible ego-depletion effect (
However, conventional meta-analysis misses the underlying weakness of ego-depletion experiments altogether. For example, strong evidence of a medium to large ego-depletion effect,
In this article, we develop new meta-regression methods to identify whether heterogeneity is associated with standard errors and thereby with sample size both within an area of research and across many meta-regressions when combined into a meta-meta-regression. If heterogeneity is correlated with standard errors, the RE model is no longer valid because it assumes that random heterogeneity is independent of sampling errors. We applied these new meta-regression methods to a preregistered group of 53 meta-analyses and found clear and robust evidence that heterogeneity and standard errors are generally correlated in psychology. Finally, we offer new simulations grounded on the correlation revealed by this meta-research evidence that shows UWLS statistically dominating RE whether or not there is publication-selection bias. These results have substantial implications for practice because they compel the replacement of RE by UWLS as the conventional method to summarize systematic reviews and meta-analyses of psychological research.
RE Versus Correlated Heterogeneity
The RE model assumes that effect sizes, such as Cohen’s
Random heterogeneity,
Regression analyses of squared deviations have a long history as tests of the assumptions of a statistical model. Examples include the White, the Park, and the Glejser tests of homoskedasticity (Glejser, 1960; Park, 1966; White, 1980). Tests of individual variances are based on specific regression models. A systematic pattern among these squared deviations is treated as evidence that the assumed model is invalid (heteroskedastic) and its standard errors biased. We use meta-regression analysis (MRA) to investigate whether the observed heterogeneity,
These considerations lead to a meta-regression of the square root of the variance ratio (VR) on a study’s
For the derivation and statistical rationale of this VR-MRA (Equation 2), see Section I of the Supplemental Material available online. Next, we report simulations that establish the validity of VR-MRA as a test of correlated heterogeneity and thereby a test of the validity of the RE model. In this article, the central role of this new test, VR-MRA, is to establish the widespread correlation of standard errors and heterogeneity by applying VR-MRA to a preregistered meta-meta-analysis (see Meta-research Evidence section below).
Simulations
We conduct several simulations in which the key research dimensions (heterogeneity, the distribution of sample sizes, and mean effect sizes) are set to reflect the typical values found in large surveys of psychological research (Fraley & Vazire, 2014; Stanley et al., 2019). Supplement II.A in the Supplemental Material provides full details of simulations of VR-MRA using the core of past simulations’ code and design, previously posted on OSF (https://osf.io/eh974/; Stanley, 2019) and employed in studies of psychology (Stanley & Doucouliagos, 2022; Stanley et al., 2021) using other meta-analysis methods.
The distribution of sample sizes in the primary studies {15, 35, 50, 100, or 200} mirrors a large survey of personality and social-psychology experiments (Fraley & Vazire, 2014). Following Stanley et al. (2021), each simulated meta-analysis has its mean effect (in terms of Cohen’s
Table 1 reports the mean, Type I error rate, and power of the estimated VR-MRA slope coefficient,
VR-MRA Simulations: 10,000 Replications
Note:
Note that
Nonetheless, VR-MRA has an important limitation when applied to individual meta-analyses—low statistical power. Column 3: Correlated Heterogeneity reports the results of the same simulation experiment but where heterogeneity is forced to be moderately correlated with standard errors. In particular, τ is set at {.4, .3, .3, .3, .15
Thus, VR-MRA should not be applied to the average meta-analysis alone, but only to large meta-analyses or across many meta-analyses. Nevertheless, knowing that a test has low power and interpreting the findings, accordingly, may still allow using the test more broadly. There is precedence for this practice in tests that probe selective-publication bias. For example, the Egger test has comparable power, and it is frequently used (Egger et al., 1997; Stanley et al., 2021). However, results of the Egger test do not permit conclusive statements (Lau et al., 2006). We caution inappropriate overinterpretation of the VR-MRA test if applied to single meta-analyses.
Illustration
Returning to Carter et al.’s (2015) ego-depletion meta-analysis, VR-MRA provides clear evidence that heterogeneity is correlated with standard errors,
The practical implication of this example is to not trust RE and instead use alternate methods that are not based on the RE model. The UWLS is such a meta-analysis summary estimator. It is neither fixed effect (FE) nor RE. UWLS and FE always give the same point estimate, but UWLS automatically accommodates heterogeneity when present. Like RE and FE, UWLS is an inverse variance weighted average. However, RE’s inverse variance weights are
The WAAP is a version of UWLS that down-weights small studies even further. Not only are small ego-depletion studies inadequately powered, VR-MRA offers evidence that they are more unreliable as well. Only one ego-depletion study is adequately powered (power > 80%), and only 10 of 116 have power greater than 50%, retrospectively calculated.
6
WAAP uses Cohen’s (1988) widely accepted convention of 80% to define adequate power (Stanley et al., 2017). WAAP = UWLS when UWLS is calculated only on those studies with retrospective power greater than 80%.
7
For ego depletion, WAAP = 0.100 (CI = [−0.096, 0.295]). Likewise, UWLS calculated on only those ego-depletion studies with at least 50% power is not statistically significant,
Meta-research Evidence
To overcome VR-MRA’s low power in individual meta-analyses, we conducted a meta-analysis of many VR-MRA results. Meta-analysis is often regarded as the best way to increase the statistical power of individual studies and to resolve the ambiguity of mixed findings across studies. Jackson and Turner (2017) showed that five or more studies “reasonably consistently achieve powers from random-effects meta-analyses that are greater than the studies that contribute to them” (p. 280). To increase the power of individual VR-MRA results, we combined the VR-MRA findings from these 53 “preregistered” meta-analyses and used RE meta-analysis to summarize their aggregate evidence. Our purpose for seeking meta-research evidence of the correlation of standard errors and heterogeneity is to establish that this correlation is widespread among meta-analyses and to gauge its magnitude to accurately calibrate simulations that compare the statistical properties of RE and UWLS—see UWLS section below.
Consistent with our simulations and correlated heterogeneity (Column 3 of Table 1), we found that 18 (or 34%) of these VR-MRAs have a statistically positive estimate of
Meta-meta-analysis simulations
To the same simulation design and structure used for VR-MRA estimates and reported in Table 1, we added a loop that collects 50 random VR-MRA findings at a time and calculates a conventional RE estimate of these 50 meta-regression estimates of
Column 1 of Table 2 reports the average RE estimate and
Simulations of 10,000 Random-Effects Meta-Analyses of VR-MRAs
Note:
Robustness of the meta-research evidence
For the sake of robustness and further independent validation, we investigate a second set of meta-analyses. Kvarven et al. (2020) conducted a systematic review of all meta-analyses that have an associated PMR and found 15 such pairs. The RE estimate of only 15 VR-MRA estimates will have much less power. Nonetheless, the RE estimate of
Finally, as another robustness check, we provide further meta-research evidence that heterogeneity is correlated with standard errors from an alternate MRA model of RE variances in Section III of the Supplemental Material:
See Section III of the Supplemental Material for a discussion of the total variance meta-regression analysis (TV-MRA) model (Equation 3), its application to these sets of meta-analyses as reported above, and the corresponding simulation findings of 10,000 RE meta-analyses of collections of both 50 and 16 randomly generated meta-regressions of this alternative model of RE’s variance. Evidence from TV-MRA model (Equation 3) supports the above evidence of a correlation between heterogeneity and SE in psychology—see the Section III in the Supplemental Material.
Discussion
Combining evidence across meta-regression tests consistently supports the hypothesis that heterogeneity is correlated with standard errors, thereby inconsistent with the RE model. This correlation is also corroborated in the aggregate by a correlation between the median standard error and RE estimated heterogeneity,
But why would small studies be found to be more heterogeneous? There are several likely and overlapping reasons. Researcher flexibility in choosing methods, protocols, and outcome measures provides the variation across which a statistically significant result can be selected. Such researcher flexibility generates heterogeneity, clearly seen in large differences of heterogeneity found among tightly controlled multilab replications versus meta-analyses (Klein et al., 2018; Kvarven et al., 2020; Linden & Hönekopp, 2021). As seen in many simulations, small studies require more intensive selection across this heterogeneity to achieve statistical significance (Stanley & Doucouliagos, 2014). When 50% of the reported results have been selected to be statistically significant, the average value of VR-MRA’s slope coefficients,
However, other forces are also likely to be at work. By their very nature, exploratory studies are likely to find notably different effect sizes from one exploration to the next. Small studies may employ lower-quality standards with higher risk of bias (IntHout et al., 2015, p. 866), and less reliability generates higher heterogeneity. Correlated heterogeneity may be caused by a mixture of different types of “replications” that typically comprise meta-analyses. Several researchers have classified replications as “conceptual” versus “direct” (or “close”; Hedges & Schauer, 2019; Linden & Hönekopp, 2021; Schauer & Hedges, 2020; S. Schmidt, 2009). Direct or close replications involve the use of the same experimental procedures in an effort “to replicate an earlier study as faithfully as possible” (Linden & Hönekopp, 2021, p. 360). In contrast, studies that are regarded as conceptual replications use different methods to explore the boundaries of theory, widen the field’s understanding, and assist in developing new theory (S. Schmidt, 2009). Thus, the results from conceptual replications are expected to produce higher heterogeneity than direct replications (Linden & Hönekopp, 2021).
Furthermore, direct replications often use large sample sizes (e.g., the Open Science Collaboration and Many Labs projects) to ensure adequate power. When not adequately powered, a lack of replication success would be quickly dismissed as the expected result of low power rather than attributed to the original experiment. Conversely, conceptual replications, which are more numerous, face no such demands, as demonstrated by the low power that many surveys of psychology have found (Cohen, 1962; Fraley & Vazire, 2014; Maxwell, 2004; F. L. Schmidt & Oh, 2016; Stanley et al., 2018). In fact, small samples might be advantageous for conceptual replications:
A safer strategy might be to “salami-slice” one’s resources to generate more studies which, with sufficient analytical flexibility, will almost certainly produce a number of publishable studies. . . . Authors may therefore (consciously or unconsciously) conduct a larger number of smaller studies, . . . rather than risk investing their limited resources in a smaller number of larger studies. (Vankov et al., 2014, pp. 1–2)
Thus, meta-analyses that include largely conceptual replications along with a few direct replications would be expected to produce higher heterogeneity in small studies than in large ones.
Needless to say, VR-MRA has limitations beyond the low power in single meta-analyses discussed above. Low power will be exacerbated in fields that have few studies per meta-analysis, as seen in some fields of medicine and health psychology. VR-MRA, as a regression, requires notable variation of its independent variable (standard errors or sample sizes) in a meta-analysis to be estimated reliably. Nevertheless, combining many VR-MRAs can be informative if most have notable variation in sample sizes.
Implications for Practice
Since Cohen (1988), statistical power has been universally acknowledged as a central determinant of a study’s scientific contribution. “Studies with low statistical power produce inherently ambiguous results because they often fail to replicate” (Psychonomic Society, 2012, p. 1). “You should routinely provide evidence that your study has sufficient power to detect effects of substantial interest ( Unless psychologists begin to incorporate methods for increasing the power of their studies, the published literature is likely to contain a mixture of apparent results buzzing with confusion. . . . Not only do underpowered studies lead to a confusing literature but they also create a literature that contains biased estimates of effect sizes. (Maxwell, 2004, p. 161)
When small studies systematically produce highly heterogeneous findings, their scientific contribution further erodes. Recent surveys of psychology meta-analyses found substantial heterogeneity among study findings, average τ > 0.3
Heterogeneity correlated with standard errors also has implications for the practice of systematic reviews and meta-analysis. It has long been known that RE overweight small studies and thereby is highly biased when there is publication selection bias (Carter et al., 2018; Henmi & Copas, 2010; Poole & Greenland, 1999; Stanley & Doucouliagos, 2014, 2015). When heterogeneity is correlated with standard errors, the RE model is invalid, and RE will further overweight unreliable small-study findings. Fortunately, there is a simple alternative meta-analysis approach, the UWLS, that automatically accommodates correlated heterogeneity and gives unreliable and potentially biased small studies less weight.
Unrestricted Weighted Least Squares
As discussed above, UWLS is a simple weighted average that allows heterogeneity to be correlated with standard errors. UWLS and FE have identical point estimates, but UWLS standard errors and CIs are larger when there is heterogeneity (Stanley & Doucouliagos, 2015, 2022). UWLS is easily calculated by a simple regression of the standardized effect size (
UWLS is the estimated slope,
Stanley et al. (2017) offered a variation of UWLS that uses only those studies with 80% or higher power, thereby giving the smallest studies no weight at all. Simulations show that this WAAP is less biased than other weighted averages (specifically, RE, FE, and UWLS) when there is publication-selection bias, and the bias reduction can be quite large in application (Ioannidis et al., 2017; Stanley et al., 2017).
However, when the RE model is imposed on the simulation structure and there is no publication bias, these simulations show that RE has slightly lower mean squared error (MSE) than UWLS (Bom & Rachinger, 2019; Stanley & Doucouliagos, 2014; Stanley et al., 2017). When there is publication-selection bias, these same simulations show that UWLS has notably smaller MSE than RE. What remains to be investigated is whether UWLS will dominate RE in all cases when heterogeneity is correlated with standard errors, as typically seen in psychology. Next, we present a new simulation study that considers the consequences of correlated heterogeneity on the statistical properties of RE, UWLS, and WAAP.
Simulations
Our final simulation study closely followed the simulation design of VR-MRA, reported above and detailed in Section II of the Supplemental Material. The code for the core of the design was posted online in 2019 and used in other studies (Stanley, 2019; Stanley & Doucouliagos, 2022; Stanley et al., 2021). The most influential research dimensions are calibrated from large surveys of psychological research (Fraley & Vazire, 2014; Stanley et al., 2018)—for greater details, see Section II of the Supplemental Material.
The central difference of these simulations from those previously published is that we assume that heterogeneity, τ, is correlated with standard errors to the same degree as seen in our meta-meta-analysis results. In particular, heterogeneity, τ = {.4, .3, .3, .3, .15}, is assumed to be associated with sample sizes,
In the upper half of Table 3, we report the bias, MSE, and Type I error rate (or power) for RE, UWLS, and WAAP when there is no publication bias; the lower half includes the same information after an assumption that 50% of the reported results have gone through a process of selection for statistical significance. As Table 3 shows, UWLS has smaller MSE and Type I errors than RE in all cases in which heterogeneity is correlated with standard errors and in which there is no publication-selection bias. Biases are inconsequential rounding errors. When there is publication-selection bias (Table 3, bottom half), UWLS improvement over RE is much greater. With 50% publication-selection bias, UWLS’s MSE is only 59% as large as RE, and bias is 73% of RE bias (Table 3, bottom row), and WAAP is better still.
Correlated Heterogeneity: Statistical Properties of RE, UWLS, and WAAP
Note: Mean effect δ = 0 measured as Cohen’s
Discussion
When the heterogeneity variance is correlated with sampling error variance (or sample size), simulations show that UWLS dominates RE, and WAAP does even more to reduce bias and MSE when there is publication-selection bias. Because we find robust meta-research evidence that heterogeneity and standard errors are typically correlated in psychology, UWLS (and, whenever possible, preferably its WAAP variant) should be adopted as the conventional meta-analysis estimate of mean effects and summary of systematic reviews. Even if heterogeneity and standard errors are independent and the RE model is entirely valid, simulations show that there is practically nothing to gain by using RE over UWLS when there is no publication-selection bias; however, there is much to lose if there is publication-selection bias (Bom & Rachinger, 2019; Stanley & Doucouliagos, 2014; Stanley et al., 2017). When correlated heterogeneity is common, the choice is clear—UWLS. If the systematic reviewer fears the effect of publication-selection bias and wishes to reduce it more aggressively, then there are versions of UWLS that also accomplish this goal—WAAP and weighted and iterative least squares (WILS). WILS uses UWLS to identify whether there is an excess of statistical significance in an area of research and discards those studies most responsible (Stanley & Doucouliagos, 2022; Stanley et al., 2021). Often, the remaining exaggeration is scientifically and practically insignificant (Stanley & Doucouliagos, 2022). When all studies in a meta-analysis are small, then any meta-analytic estimate should be interpreted with great caution (Ioannidis, 2005; Stanley et al., 2022).
Conclusion
We introduce new meta-regression methods, VR-MRA and TV-MRA, that can identify whether the magnitude of heterogeneity across study findings is correlated with their standard errors. Evidence from the meta-analysis of 53 “preregistered” meta-analyses (as well as a separate set of 15 meta-analyses) finds clear and robust evidence of this correlation and that small-sample studies typically have higher heterogeneity. Such variable heterogeneity is a violation of the RE model of additive and independent heterogeneity—recall Equation 1. Both findings have important implications for practice.
For decades, there has been wide recognition that the low power (and small sample size) of the typical psychology study compromises reliable scientific inference (APA, 2010; Cohen, 1962, 1988; Maxwell, 2004; Psychonomic Society, 2012; Rossi, 1990). When small studies have not only inadequate statistical power but also high heterogeneity, their scientific contribution is dubious. Our results, therefore, further expose the necessity of preregistration and preanalysis plans if typical sample sizes (
The meta-research evidence presented in this article also serves as a test of the RE model. When the heterogeneity variance is correlated with the sampling-error variance to the degree found among in dozens of VR-MRAs, simulations show that RE is dominated by an alternative weighted average, the UWLS. With or without publication-selection bias, UWLS statistically dominates RE when heterogeneity is correlated. The advantage of UWLS over RE is quite notable when there is selection for statistical significance (i.e., publication-selection bias or questionable research practices). UWLS is built on a model of multiplicative heterogeneity and thereby easily accommodates correlated heterogeneity. It has long been known that UWLS dominates RE when there is publication bias. When the magnitude of heterogeneity is also correlated with standard errors, the UWLS advantage is absolute. Thus, there is a strong case for the UWLS weighted average with its WAAP and WILS variants to replace random effects as the conventional meta-analysis estimator of psychological research.
Supplemental Material
sj-docx-1-amp-10.1177_25152459221120427 – Supplemental material for Beyond Random Effects: When Small-Study Findings Are More Heterogeneous
Supplemental material, sj-docx-1-amp-10.1177_25152459221120427 for Beyond Random Effects: When Small-Study Findings Are More Heterogeneous by T. D. Stanley, Hristos Doucouliagos and John P. A. Ioannidis in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
