Abstract
Keywords
Meta-analysis, which seeks to summarize the results of multiple studies into the same phenomenon, has become an indispensable tool in contemporary research. In pioneering work, Smith and Glass (1977) showed that psychotherapy has a strong positive effect on the average patient studied, and F. L. Schmidt and Hunter (1977) demonstrated that the validity of employment tests generalize more readily across different job types than previously believed. Influential surveys of meta-analyses have demonstrated the effectiveness of psychological interventions (Lipsey & Wilson, 1993), provided effect-size benchmarks for social psychology (Richard et al., 2003), and summarized findings on psychological gender similarities (Hyde, 2014). Here we provide a survey of meta-analyses that shifts the perspective from the mean effect size in a population of studies (i.e., the size of the average effect in a particular domain) to the heterogeneity of results (i.e., the degree to which results differ across studies into the same issue). In any meta-analysis, heterogeneity indicates the extent to which the summarized studies tap into the same population effect size. If the same population effect size is investigated, heterogeneity will be zero. Even in this case, the sampling error will create differences in observed effects across studies. Zero heterogeneity is inferred when these observed differences do not exceed the level expected as a result of the sampling error. Consider the effectiveness of psychotherapy as an example. If heterogeneity was zero, the effectiveness of psychotherapy would be the same across all studies, regardless of the issue patients present with (e.g., anorexia, depression, specific phobia), the type of therapy they receive (e.g., cognitive-behavioral, psychoanalytic), and other differences. This is obviously unrealistic; for example, some conditions are treated more successfully than others (Huhn et al., 2014). Heterogeneity thus reflects how much the population effect sizes differ across studies. We provide a formal treatment of heterogeneity later, but Figure 1 provides examples with high and low heterogeneity.

Funnel plots for two meta-analyses. Linck et al. (2014) investigated the link between working memory and second-language comprehension (a). The estimated mean of the population of effect sizes, standard deviation of observed effect sizes, and estimated heterogeneity of true effect sizes were
Heterogeneity tends to receive little attention from researchers (Aytug et al., 2012; Dieckmann et al., 2009; Ioannidis, 2008), but we argue here that much is to be gained from its study because (a) heterogeneity reflects the degree of understanding of the subject matter being investigated and (b) its study offers useful suggestions regarding the improvement of our collective research practice.
Why Heterogeneity Matters
Low (as opposed to high) heterogeneity reflects a more advanced understanding of the subject matter being studied. This is because high heterogeneity, at least as long as it remains unexplained, suggests the lack of a strong coherence between the concepts applied and the data observed. Take visuospatial skills in people with autism spectrum conditions (ASCs) as an example (Muth et al., 2014). In line with current theorizing, the average study found (moderately) better visuospatial performance in people with ASCs than in IQ-matched control subjects on a number of standardized tasks. At the same time, the heterogeneity of the results proved to be high, even for the same task. Not accounted for by any theory, this random variation in study results (which might have resulted from unrecognized variability in ASCs, unreliability in diagnosis, or other factors) points to a shortcoming in our understanding. It also implies that the result of the next study into the same question is highly unpredictable (i.e., over and above the uncertainty arising from sampling error).
Moreover, low heterogeneity should facilitate future progress for two reasons. First, a clear structure in observable data can in itself guide understanding—a point stressed by 17th-century luminaries Francis Bacon and Isaac Newton as well as modern philosophers of science such as Hans Reichenbach, Norwood Russell Hanson, and Herbert Simon (Schickore, 2018; Simon, 1973). For example, the 19th-century astronomer William Huggins observed that the light of different stars, when seen through a prism, shows the same set of spectral lines; however, he also observed that these lines are collectively shifted to varying degrees. The observation of this systematic redshift pattern led to the discovery that stars move away from us and at different speeds (Schneider, 2014). Sixty years later, Edwin Hubble observed that the degree of stars’ redshift is linearly related to their distance from us, which led to the discovery that the universe is expanding (Schneider, 2014). Skinner (1956) and Stevens (1957) provide prominent examples for a guiding role of orderly observation data in psychology. Second, the systematic violation of expectations has often proved crucial for scientific discovery (Kuhn, 1970). Thus, the failure of an increasingly convoluted Ptolemaic system to further improve the predictions of astronomic events motivated Copernicus to devise a new, heliocentric model of the cosmos; and the failure to detect expected changes in the speed of light—derived from the idea that light propagates through a medium—led Einstein to abandon the idea of a luminiferous ether and to fundamentally rethink physics. As captured in Bacon’s dictum that “truth emerges more readily from error than from confusion” (Kuhn, 1970, p. 57), such anomalies cannot emerge when theoretical concepts and observed data lack a clear connection in the first place.
We therefore propose heterogeneity as a useful perspective from which to judge the success of psychological science, alongside other yardsticks such as the generation of good theories (Wallis, 2015), the design of successful interventions (Lipsey & Wilson, 1993), and beneficial contributions to policy design (Fischhoff, 1990). Thus, heterogeneity is of considerable intrinsic value, which is why we seek to systematically measure it in the psychological-research results presented here. What are typical levels? Do they differ across domains, and if so, can we make sense of these differences? Apart from its intrinsic value, knowledge of actual levels of heterogeneity has immediate practical implications: Heterogeneity has been demonstrated to typically decrease the statistical power of studies 1 ; that is, any real effect under investigation is less likely to produce a statistically significant result (Kenny & Judd, 2019; McShane & Böckenholt, 2014; Shrout & Rodgers, 2018). For sample-size planning to take this into account, reliable estimates of heterogeneity are needed, which we supply here. Finally, and perhaps most importantly, our findings have clear implications for improving our collective research practice, as we discuss at the end of this article. Before we can address the details of our study, it is necessary to deal with a number of critical points, which we address in the next sections.
Moderators
Heterogeneity reflects a lack of understanding only when it remains unaccounted for. Let us reconsider our example of psychotherapy effectiveness. A meta-analysis that summarizes all sensible studies should find large heterogeneity because these studies will differ in key variables such as the issue being treated, the therapy being used, and so on. If this heterogeneity can be explained by moderators (e.g., that effectiveness differs strongly across treated disorders or across types of psychotherapy), this obviously no longer indicates a lack of knowledge. (On the contrary, it might be argued that explained heterogeneity reflects an increase in understanding.) We are not aware of any study to date that has systematically investigated the extent to which the heterogeneity that is observed in a set of studies is accounted for by moderators. We therefore investigate it here.
Conceptual Versus Close Replications
Heterogeneity as a concept makes sense only if the set of studies for which it is computed can, in some sense, be conceived as replications of each other. In this context, the differentiation between close and conceptual replications has become fruitful (S. Schmidt, 2009; Zwaan et al., 2018). The former seek to replicate an earlier study as faithfully as possible. The Open Science Collaboration (2015) project is a famous example. In a massive collaborative effort, the authors sought to replicate 100 studies published in high-profile psychology journals. The replications sought to copy study materials, data analyses, and other key aspects of the original studies as closely as possible and can therefore be considered close replications. In contrast, the studies summarized in a meta-analysis can typically be considered to be conceptual replications (F. L. Schmidt & Oh, 2016); that is, although they address the same topic or mechanism, they often differ markedly in their design, study materials, participants, data analysis, and other key aspects. Heterogeneity should thus tend to be larger in conceptual replications than in close replications.
A systematic comparison of heterogeneity in close and conceptual replications should be instructive. For example, Stanley et al. (2018) argued that the low replicability observed in Open Science Collaboration (2015) might reflect low power caused by high heterogeneity. However, the heterogeneity data that they presented in support of this argument stemmed almost exclusively from conceptual replications. Their assumption that heterogeneity in close replication attempts might be similar rested on only two examples for the latter.
Note that the heterogeneity for each of the 100 twin studies in the Open Science Collaboration (original and replication) cannot be reliably estimated. Instead of a single replication, this would require multiple close replications of the same effect (e.g., Klein et al., 2014). We thus use Many Labs–type replications to study heterogeneity in close replications.
Measuring Heterogeneity
So far, we have not addressed how heterogeneity can be quantified. In psychology the idea of heterogeneity is usually discussed in the context of standardized effect sizes (e.g., Cohen’s
The first approach,
The second approach directly estimates the variability in population effect sizes. It is generally assumed that population effect sizes relating to a given phenomenon follow a normal distribution; τ refers to their standard deviation (Borenstein et al., 2009) and can be calculated when individual study effect sizes and standard errors are available. As an example, consider the meta-analysis in Figure 1a. The standard deviation of the observed effect sizes in the primary studies is 0.36. (For the sake of consistency, we use Cohen’s

Two distributions of population effect sizes (standardized mean differences). The distribution on the left (solid line) shows a population effect size with a mean (δ) of 0.45 and a standard deviation (
Because τ is an unknown population parameter, it must be estimated. Its estimator
A Sensible Sampling Frame
What is a sensible sampling frame for a survey of heterogeneity? One potential strategy would be to use a representative sample of meta-analyses across all of psychology. However, our heterogeneity measure
Aims
Our aims are as follows: Given the intrinsic value of heterogeneity as an indicator of a lack of understanding, we seek to establish a typical level of heterogeneity in conceptual replications. We compare these levels across the subdisciplines of cognitive, organizational, and social psychology and against heterogeneity observed in close replications. We also investigate the extent to which heterogeneity in a set of studies can typically be accounted for by moderators. We further explore whether any characteristics explain differences in heterogeneity.
To foreshadow our key results, we find that heterogeneity tends to be very large in conceptual replications but moderate in close replications. Our investigations regarding the drivers of heterogeneity show that moderators do little to account for heterogeneity. We also find a previously unexplored strong relationship between heterogeneity and effect size, which allows us, for the first time, to make predictions about expected levels of heterogeneity for a given phenomenon. These findings have clear implications for the improvement of our collective research practice, as we discuss at the end of this article.
Method
Study search and selection strategy
We aimed to investigate all available Many Labs–type replications. We searched CurateScience.org for relevant reports in April 2017 and added studies from Many Labs 2 (Klein et al., 2018) at a later stage. Further, we investigated 50 meta-analyses each from cognitive, organizational, and social psychology. Feasibility, rather than power considerations, determined this choice. Our preregistered study protocol is available at http://aspredicted.org/blind.php?x=bf46k8.
In November 2016, we searched PsycINFO (journals only) for “meta-analy*” in the abstract field. We restricted searches to PsycINFO classifications “3000 Social Psychology,” “3600 Industrial and Organizational Psychology,” “2340 Cognitive Processes,” “2343 Learning and Memory,” and “2346 Attention.” Because this search did not yield a sufficient number of eligible meta-analyses (see below for inclusion criteria), we also searched the Web of Science (articles only) for “meta-analy*” in the categories “Psychology Social,” “Psychology Applied,” and “Psychology” (excluding meta-analyses that fell outside our target subdisciplines; see Fig. 3). All eligible meta-analyses were inspected in random order until we reached the desired number of 50 meta-analyses.

Sampling of meta-analyses.
Inclusion criteria
Meta-analyses for the three subdisciplines were included if they met all of the following criteria. First, they had to address a substantive psychological effect (rather than, e.g., the psychometric properties of a questionnaire). Second, the analyzed effects had to be described as standardized mean differences (Cohen’s
In this way, we identified 50 meta-analyses for cognitive psychology, 50 for organizational psychology, 50 for social psychology, and 57 for close replications (see Table S1 in the Supplemental Material available online).
Data extraction and analysis
If the results of more than one meta-analysis were reported, the one including the largest number of studies was extracted. If multiple meta-analyses included the same number of studies, the first was used.
Heterogeneity for each meta-analysis was computed using the DerSimonian-Laird estimator in the metafor package (Version 2.1-0; Viechtbauer, 2010) for the R software environment (Version 3.4.1; R Core Team, 2017).
4
To keep effect sizes and levels of heterogeneity consistent across studies, all effect sizes were input as Cohen’s
It turned out that the frequency distributions for some of our observed outcome variables were skewed to the right. For example, among the 150 meta-analyses,
Results
How meta-analyses address heterogeneity
Of 150 meta-analyses, 123 tested moderators, but only 83 (55%) reported a measure of heterogeneity. In 2009, the influential Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009) recommended that meta-analyses should address heterogeneity. Even for meta-analyses published after 2009, heterogeneity was reported in only 60% of cases. Note that the statistical significance of heterogeneity, for example,
Overall, heterogeneity was quantified in less than a third of cases (43 times out of 150):
Heterogeneity observed in close replications and meta-analyses
Table 1 shows descriptive statistics for close replications and meta-analyses. As expected, heterogeneity was much lower (
Descriptive Statistics for Study 1
Note: All means are Winsorized.

Observed levels of heterogeneity for 57 close replications and 50 meta-analyses in each subdiscipline (cognitive, organizational, and social psychology). In each plot, the horizontal line indicates the Winsorized mean, the top and bottom of the box indicates the top and bottom of the interquartile range (IQR), the whiskers represent values above and below the IQR, and dots represent outliers.
Levels of heterogeneity were unexpectedly similar across all three subdisciplines—cognitive versus social psychology:
Moderators
To investigate the extent to which moderators account for heterogeneity, we looked at all 36 meta-analyses with a
This moderator analysis does not suggest that the large heterogeneity in our meta-analysis sample is readily explained by mixing apples and oranges. Still, the possibility remains that authors (potentially unwisely) combine highly diverse studies and then fail to address relevant moderators. To address this point, we rated the broadness or narrowness of the inclusion criteria for each meta-analysis, using a single, global five-point scale ranging from
Exploratory analyses on what drives heterogeneity
Heterogeneity differed substantially between meta-analyses (

Heterogeneity as a function of meta-analyses’ mean effect size. The funnel plot in (a) shows 150 meta-analyses from cognitive, social, and organizational psychology; the funnel plot in (b) shows 57 meta-analyses for close replications.
In light of the link between mean
Looking at systematic reviews in health care, IntHout et al. (2015) found that smaller studies are more heterogeneous (measured as
Richard et al. (2003) proposed that as a research field matures, the focus shifts from establishing an effect to exploring its boundaries, and this should increase heterogeneity in findings. If we accept the number of studies (
We thought to test these two competing explanations (exploring boundaries vs. broader inclusion criteria). If later research into an effect tends to explore its boundaries, we would expect to see higher heterogeneity in studies conducted late than in those conducted early. We therefore looked at all meta-analyses that seemed to capture a sufficiently mature research area and included those 82 with a
If the observed correlation between
Discussion
We found that the quantification of heterogeneity in meta-analyses is uncommon. When it is undertaken, authors rarely rely on the measure that we argue is most informative. Average heterogeneity proved to be
The effect of heterogeneity on statistical power and its implications for the interpretation of low replicability rates in the Open Science Collaboration (2015) project has received considerable attention (Shrout & Rodgers, 2018; Stanley et al., 2018). We address these more specific issues here. We discuss more general implications for the progress of psychological science in the General Discussion.
The meaning of observed heterogeneity levels
It is helpful to first consider the meaning of average heterogeneity levels (
We further illustrate the meaning of heterogeneity with two examples from cognitive psychology. To further understand the importance of working memory for second-language proficiency development and processing, Linck et al. (2014) investigated the strength of this link in a meta-analysis. Included studies used a range of working memory tasks and second-language comprehension measures in diverse samples. The strength of the relationship proved medium in size (
Baker et al. (2014) used meta-analysis to investigate the degree of independence between general intelligence and mental-state understanding. Included studies used a range of established intelligence tests in diverse samples; however, all studies used the same widely used test of mental-state understanding (Reading the Mind in the Eyes test). As in the previous example, the strength of the relationship proved medium in size (
In sum, it appears that the relationship between working memory and second-language proficiency is better understood than that between intelligence and performance on the Reading the Mind in the Eyes test. More generally, everything else being equal, meta-analyses with lower heterogeneity will be more informative.
Potential biases in heterogeneity estimates
Before we address the implications of these findings in greater detail, it is necessary to highlight a number of points regarding the trustworthiness of our estimates.
Representativeness of our samples
Our sampling of meta-analyses in cognitive, organizational, and social psychology was rigorous, and perusal of the topics (see Table S1) confirms a broad coverage of topics typical for these subdisciplines. We did not find evidence for heterogeneity differences across these subdisciplines, which might indicate that our results generalize more broadly across psychology. This is supported by the fact that Stanley et al. (2018), in a broader sample of meta-analyses from
Publication bias and questionable research practices
Publication bias (Sterling, 1959) and questionable research practices (QRPs; Simmons et al., 2011) are problems in psychological research (John et al., 2012; McShane et al., 2016; Simmons et al., 2011; Sterling, 1959). As a result, only a biased sample of all conducted studies appears in the published literature; “unsuccessful” studies typically remain invisible. Given that larger compared with smaller observed effects are more likely to be statistically significant (and thus “successful”), publication bias leads to upwardly biased effect sizes in published studies and meta-analyses (e.g., McShane et al., 2016). To achieve statistically significant, and therefore publishable, results, researchers might resort to QRPs (e.g., collect a number of similar dependent variables but report findings from only the most successful one). QRPs can dramatically increase the rate of false-positive results (Simmons et al., 2011) and thus lead again to inflated effect sizes in published studies and meta-analyses.
Mathematical modeling and computer simulations suggest that publication bias can lead to an underestimation or overestimation of heterogeneity; however, the former tends to be more prevalent than the latter (Augusteijn et al., 2019; Jackson, 2006). In addition, the overestimation of heterogeneity resulting from publication bias and QRPs tended to be much smaller than the levels of heterogeneity observed in conceptual replications here (Hönekopp & Linden, 2019). From this viewpoint, the very large
Overreliance on WEIRD samples
Studies in psychology, even if they seek to address human nature in general, rely almost exclusively on samples from Western, educated, industrialized, rich, and democratic (WEIRD) societies. Henrich et al. (2010) argued that WEIRD samples are among the least suitable to make general inferences about human nature and that many phenomena that are well established in WEIRD populations fail to generalize to other populations. Obviously, this is a concern only for those studies that seek to address human nature; however, this is frequently the case. For these cases, the findings from Henrich et al. imply that observed heterogeneity would often be higher if researchers did not rely almost entirely on WEIRD samples. This should hold equally for conceptual and close replications.
Accuracy of meta-analyses
For feasibility reasons, we had to rely on reported effect sizes for the underlying primary studies. One systematic investigation of meta-analyses in medicine found that about one in five effect-size computations for primary studies was erroneous (Gøtzsche et al., 2007). This should add (error) variance to the meta-analysis and consequently inflate observed heterogeneity. Given the strict protocols and high degree of transparency for Many Labs–type studies (e.g., Klein et al., 2014, 2018), erroneous effect sizes should be less of a concern for close replications.
Summary
In sum, our data for cognitive, organizational, and social psychology should be fairly representative for these disciplines, and results might generalize fairly well beyond. Publication bias, QRPs, and overreliance on WEIRD samples should artificially lower heterogeneity estimates; meta-analytic errors regarding the extraction of effect sizes from primary studies should have the opposite effect. On balance then, there is no strong evidence to suggest that our very-high heterogeneity estimates grossly overestimate actual levels of heterogeneity. If anything, heterogeneity-deflating biases appear more numerous than heterogeneity-inflating biases. Thus, our results suggest that actual heterogeneity is typically very high in sets of conceptual replications. Although the representativeness of our close-replications sample is unclear, resulting heterogeneity estimates should, overall, be less prone to error than those for conceptual replications.
Implications for the replicability of close replications
As discussed previously, the Open Science Collaboration (2015) project famously attempted close replications of 100 studies. Although larger samples were used than in the original studies, statistical significance was achieved in only 36% of replications (25% in social psychology and 50% in cognitive psychology). This finding has become a catalyst of the controversial debate about the health of psychology research, which is still ongoing (e.g., Earp & Trafimow, 2015; Pashler & Harris, 2012; F. L. Schmidt & Oh, 2016; Simons, 2014; Stroebe & Strack, 2014). This is not the place to review this debate (for a comprehensive summary, see Zwaan et al., 2018), but one of its strands is of particular interest here. Stanley et al. (2018) suggested that heterogeneity accounts for the Open Science Collaboration’s low replication rates. The authors estimated heterogeneity to be high (
Moreover, if replication failure reflects heterogeneity-driven low power, as Stanley et al. (2018) claimed, the large difference in replication rates between cognitive and social psychology (Open Science Collaboration, 2015) should be reflected in larger heterogeneity in the latter. Our finding of virtually identical heterogeneity levels across cognitive and social psychology does not support this view. In conjunction with the low heterogeneity observed in close replications, it strengthens the interpretation that the low replication rate demonstrated in the Open Science Collaboration might be attributable to publication bias and QRPs. This is good news from our perspective because promising strategies to combat these biases have been developed (Munafò et al., 2017). On a more general level, one may note that the central issue with the results from the Open Science Collaboration is less about the percentage of original results that are true and more about the suggestion that a key plank in our common standards to accept evidence as valid (
How should heterogeneity be estimated for power calculations?
Average levels of heterogeneity (
Conclusions
We suggested that heterogeneity is a useful perspective for reflecting the degree of understanding psychology achieves. Science can be described as a quest to explain the apparent complexity of the natural world through simpler, fundamental principles. Empirical cumulativeness reflects the extent to which empirical findings fit such a simple or explicable pattern. All else being equal, high levels of (unexplained) heterogeneity indicate lower empirical cumulativeness (Asendorpf et al., 2013; Hedges, 1987; Murphy, 2017; Richard et al., 2003; Sells, 1963). For conceptual replications in three of psychology’s core disciplines (and plausibly beyond; see Stanley et al., 2018; van Erp et al., 2017), we found that heterogeneity is typically large (see Fig. 2) and unexplained, with little reason to believe that our estimates are inflated. To add some perspective, we can compare typical levels of heterogeneity (variability within a specific topic) with the variability in mean effect sizes across meta-analyses (variability between topics). Whereas we found a
Before we explore important implications of this twin finding and possible improvements for our collective research practice, we address a likely objection to our argument that heterogeneity meaningfully reflects the degree of understanding psychology achieves.
Reply to an objection
A likely objection is that progress is driven by theories and that effect sizes tend to be irrelevant for most psychological theories (e.g., Baumeister, 2016; Strack, 2017); if effect sizes are largely irrelevant, their variability (i.e., heterogeneity) is likewise of little consequence. We think that such a perspective is mistaken for a number of reasons. First, even if effect sizes were largely irrelevant, the direction of effects remains important: In the face of large heterogeneity, the direction of an effect might be difficult to predict. Second, effect sizes are by no means irrelevant for increasing understanding; therefore, their degree of variability is also important. Although some psychological theories are not rooted in quantitative concepts (e.g., Piaget’s stages in cognitive development), most psychological research is rooted in measurement. Given that measurement is regarded as a practically indispensable tool for investigation, it seems inconsistent to be disinterested in its result. In general, strong theories tend to be specific in the sense that they declare a large range of potential observations to be contrary to theory, thereby creating ample scope for the theory to be empirically challenged (Kuhn, 1970). Likewise, the ability to make precise predictions is often a hallmark of more mature science (Schickore, 2018). Effect sizes are obviously not the only route to achieve such specificity, but they may often provide a viable way forward. If heterogeneity is high, such specificity is difficult to achieve.
Finally, effect sizes are highly relevant for both explanations and practical applications. Psychological explanations typically rely on probabilistic relationships (e.g., in mate choice, men tend to put more emphasis on a partner’s physical attractiveness than women; Feingold, 1990), and, all else being equal, stronger effects convey better explanations (Woodward, 2014). For example, the sex difference in height (approximately
Implications for testing theories
Our twin finding of large heterogeneity in conceptual replications and moderate heterogeneity in close replications has important implications for the testing of theories.
Knowledge as a tool
One relates to the use of knowledge as a tool. Imagine a situation in which the test of a psychological theory X requires inducing a particular mood. If this mood induction is based on a general principle that shows large heterogeneity, a negative finding of the test can be blamed on (unreliable) methods, and theory X is protected from failure. If heterogeneity thus precludes the meaningful empirical scrutiny of theories, theoretical progress will be limited (Ferguson & Heene, 2012; Greenwald, 2012; Kerr, 1998; LeBel & Peters, 2011; Meehl, 1978). In this context, the moderate heterogeneity observed in close replications (
Theories’ boundaries
Another implication of our findings is that the evaluation of theories also requires a broad exploration of the “research space” (Asendorpf et al., 2013), that is, the space defined by the combination of different manipulations of the independent variable, different dependent variables, different study populations, and so on. As an example, consider the set of stimuli used. If only a single standard set is used in a research domain to evoke the expected effect, some theory-irrelevant feature of that set might drive the observed effect (Fiedler, 2011). This problem can be detected only by using diverse (but theory-conforming) sets of stimuli. Consider also the case in which a theory offers a narrow explanation to account for an observation (e.g., memory for a word list is improved when the survival value of its items is to be judged). If a more general and thus more parsimonious explanation holds (e.g., memory for a word list is improved by any judgments that trigger self-referent encoding), this can be discovered only by testing instances of the research space that violate the overly narrow theory while still holding for the more general account (Fiedler et al., 2012; Shrout & Rodgers, 2018).
Meta-analysis and the testing of theories
A good theory should specify its scope. To evaluate the theory, meta-analysts must move beyond a narrow focus on the mean effect size and its statistical significance and take heterogeneity into account. This is obviously not a new insight (e.g., Higgins & Thompson, 2002; Hunter & Schmidt, 1990). However, our results regarding the reporting of heterogeneity in meta-analyses suggest this is rarely implemented in practice. One reason might be that frequently used approaches to heterogeneity fail to appeal to researchers’ imagination: As shown earlier, quantification of heterogeneity is often missing or expressed in ways that might elude intuitive understanding (
Reducing unexplained heterogeneity as a sensible heuristic to advance understanding
Given that unexplained heterogeneity tends to be both large and undesirable, its reduction should become an important goal. Among other advantages, this will increase coherence between the concepts we use and our observational data, facilitate empirical scrutiny of our theories, provide greater clarity regarding the power of the explanations we can offer, and facilitate the design of practical applications. Weiss et al. (2014) offered a conceptual framework for heterogeneity in experiments, which is useful for discussing measures to either explain or reduce it.
A conceptual framework for heterogeneity
According to Weiss et al. (2014), heterogeneity in a set of experiments arises from three sources. First, studies can differ in their treatment contrasts, that is, the experimentally induced difference between the experimental and control group. The second source of heterogeneity are moderators that reside in the participants. Thus, if an effect is age-dependent, differences in participants’ age across studies will induce heterogeneity. Finally, studies might differ on relevant context moderators; for example, an effect might vary across cultures or situations. Fruitful applications of this framework can be found in Weiss et al. (2017).
Treatment contrasts
Differences in studies’ treatment contrasts will typically be driven by the strength of experimental manipulations. Stronger manipulations will often bring about stronger effects than weaker manipulations. Variability in the strength of manipulations across studies will thus induce heterogeneity in the results. If the strength of manipulations cannot be (or is not) properly expressed, it will be difficult to explain this heterogeneity. The unspecified or underspecified strength of experimental manipulations strikes us as a frequent issue across psychology that could often be avoided. We take the effect of bilateral symmetry on facial attractiveness as an arbitrary example. Correlational studies and experiments alike suggest that symmetry increases facial attractiveness (Rhodes, 2006). If the strength of experimental symmetry manipulations was described in relation to the natural variation in facial symmetry on which correlational studies rely, the variability of symmetry could be described on a common scale across all studies. These between-study differences in variability of symmetry (whether naturally occurring or experimentally induced) should be able to explain differences in results across studies and thus reduce heterogeneity. We are not aware of such attempts.
Our suggestion that systematically specifying the strength of manipulations of the independent variable will prove helpful is underpinned by the observation that many seminal insights in behavioral science relied on descriptions of the independent variable on a ratio scale. This is true for probabilities in classical conditioning (Rescorla & Wagner, 1972), operant conditioning (Herrnstein, 1961), perception under uncertainty (Tanner & Swets, 1954), and judgments and decision-making under uncertainty (Gigerenzer et al., 1991; Kahneman & Tversky, 1979); for the temporal relationship of stimuli or events and their effects on visual perception (Marcel, 1983), memory (Peterson & Peterson, 1959), and the discounting of future outcomes (e.g., Frederick, 2002); the physical stimulus intensity and its relationship with perceived stimulus intensity (Stevens, 1957); and for degrees of genetic similarity, which underpin all estimates of the heritability of psychological traits (Plomin, 1990).
Differences in studies’ treatment contrasts can also be affected by differences in the control groups, particularly in the case of real-world interventions. For these, “business as usual” (i.e., what it means
Person and context moderators
The experimental test of a motivational intervention conducted by Yeager et al. (2019) provides an excellent illustration for both person and context moderators. Their short online intervention taught a nationally representative sample of U.S. students in secondary education that they can train their intellectual abilities like a muscle, which proved to have a small positive effect on students’ grades. The authors hypothesized and confirmed that low-achieving students would benefit more from the intervention than high-achieving students (person moderator) and that the intervention would be most effective in schools with supportive peer norms (context moderator).
A meta-analytic search for moderators is most promising when it is driven by theory (Tipton et al., 2019a, 2019b). In this context it is noteworthy that psychologists have devoted great energy to describing individual differences in systematic ways (e.g., McCrae & Costa, 1997) but that comparable approaches to classify situations are, to the best of our knowledge, missing.
Multisite experiments
Meta-analyses are often limited in their ability to explain heterogeneity because relevant information on moderators or other sources of heterogeneity is unavailable for some or all of their primary studies. Multisite experiments, which directly address potential moderators in their design, are a promising alternative (e.g., Yeager et al., 2019). Such experiments are naturally arduous, but collaboration between many researchers through crowdsourcing holds great potential for such projects (Uhlmann et al., 2019).
Standardized versus original-units effect sizes
Finally, we want to draw attention to points outside of the heterogeneity framework proposed by Weiss et al. (2014). Our treatment of heterogeneity was based on descriptions of individual study results using standardized effect sizes. This is the norm for meta-analyses and conveys the obvious advantage that studies can be sensibly integrated even when they use different dependent variables. Nonetheless, standardized effect sizes might not be the best way to capture study results (Baguley, 2009; Bond et al., 2003; Tukey, 1969). Table 2 provides an example in which differences in sample means might be said to provide a more accurate description of individual results and of their differences across studies. This increase in accuracy might lead to reduced heterogeneity estimates and to the clearer emergence of informative moderators. The wealth of available data from Many Labs–type close replication studies (in which sets of close replications share the same dependent variable) provides rich opportunities for developing heterogeneity analyses on the basis of mean differences instead of standardized effect sizes and establishes whether this reduces heterogeneity estimates. If that is the case, we should also investigate the extent to which this can be fruitfully used for the analysis of conceptual replications.
Irrelevant Differences in Standard Deviations Across Studies Negatively Affect the Suitability of Standardized Effect Sizes
Note: Values are means with standard deviations in parentheses. Three similar, fictitious studies into the same phenomenon use the same dependent variable. Because Study 2 used a more diverse sample, the standardized effect size
Critics might argue that the portrayed shortcoming in standardized effect sizes (see Table 2) undermines our survey of heterogeneity. However, heterogeneity on the scale that we observed in conceptual replications cannot result from moderate inaccuracies in standardized effect sizes. Large heterogeneity is real, and its reduction should therefore become an important aim. To judge whether we make progress on this issue and to learn which strategies are best suited to reduce unexplained heterogeneity, its measurement is necessary. The approach we presented here strikes us as the most appropriate currently available.
Outlook
Chemists in the 18th century, who did not yet understand the difference between compounds and mixtures, realized that substances often combine in fixed proportions (e.g., you need 61.5 g of magnesia to neutralize 100 g of sulfuric acid; Leicester, 1965). Although useful for their daily practice, they did not attach much importance to this regularity because it appeared to lack universality (after all, you can mix one or three spoons of sugar in a cup of tea). Early in the 19th century, John Dalton parsed the seemingly incongruous observational data in a new way and realized the significance of fixed proportions, thus paving the way for the measurement of relative atomic weights and atomic theory, a major breakthrough in the history of chemistry (Kuhn, 1970). The linear relationship between stars’ distance from Earth and the speed at which they move away from us was probably more obvious to perceive: Within a short time span, Georges Lemaître and Edwin Hubble independently discovered this law and, consequently, the expansion of the universe (Schneider, 2014). These examples illustrate that (a) regularity in observational data often acts as a lodestar for discovery (Simon, 1973) and (b) even the identification of pockets of regularity might be greatly beneficial. Reducing heterogeneity should make it easier for psychologists to perceive such regularities, and the prospect of new discoveries might be the strongest incentive to do so. We suggested some means to this end. We are sure that, once heterogeneity and its reduction receives more of the attention it deserves, the ingenuity of our colleagues will greatly add to our own ideas.
Supplemental Material
sj-pdf-1-pps-10.1177_1745691620964193 – Supplemental material for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science
Supplemental material, sj-pdf-1-pps-10.1177_1745691620964193 for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science by Audrey Helen Linden and Johannes Hönekopp in Perspectives on Psychological Science
Supplemental Material
sj-pdf-5-pps-10.1177_1745691620964193 – Supplemental material for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science
Supplemental material, sj-pdf-5-pps-10.1177_1745691620964193 for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science by Audrey Helen Linden and Johannes Hönekopp in Perspectives on Psychological Science
Supplemental Material
sj-xlsx-2-pps-10.1177_1745691620964193 – Supplemental material for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science
Supplemental material, sj-xlsx-2-pps-10.1177_1745691620964193 for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science by Audrey Helen Linden and Johannes Hönekopp in Perspectives on Psychological Science
Supplemental Material
sj-xlsx-3-pps-10.1177_1745691620964193 – Supplemental material for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science
Supplemental material, sj-xlsx-3-pps-10.1177_1745691620964193 for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science by Audrey Helen Linden and Johannes Hönekopp in Perspectives on Psychological Science
Supplemental Material
sj-xlsx-4-pps-10.1177_1745691620964193 – Supplemental material for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science
Supplemental material, sj-xlsx-4-pps-10.1177_1745691620964193 for Heterogeneity of Research Results: A New Perspective From Which to Assess and Promote Progress in Psychological Science by Audrey Helen Linden and Johannes Hönekopp in Perspectives on Psychological Science
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
