Abstract
Many people have argued that psychology is experiencing a “crisis of confidence” given several prominent failures to replicate seemingly established findings (Earp & Trafimow, 2015; Pashler & Wagenmakers, 2012). Concerns about replicability have led to several highly publicized, large-scale replication attempts. The results of these studies have not been especially encouraging; replicability rates have ranged from 39% to 77% using a criterion of
Although any individual failure to replicate could be due to a multitude of reasons—and not necessarily indicative of a larger systemic problem—mounting numbers of replication failures are difficult to dismiss. The relatively low replication rates reported by the studies mentioned above have been held up as prima facie evidence of a significant discipline-wide problem. To address this problem, there have been a number of proposals to change the way research is conducted. Examples of such practices include preregistration (Jonas & Cesario, 2016; Nosek et al., 2018; Wagenmakers et al., 2012), making data publicly available (Asendorpf et al., 2013; Miguel et al., 2014), and increasing sample sizes (Cohen, 1962; Tversky & Kahneman, 1971). Other researchers have advocated for different statistical procedures, such as using Bayes’s factors instead of null-hypothesis significance testing (Etz & Vandekerckhove, 2016) or changing the default cutoff
Judgments regarding replicability are vital to the cumulative progress of science. The accumulation of scientific knowledge results from an ongoing process whereby new discoveries build on previous discoveries. The nature of this process is summarized elegantly by the idea of “standing on the shoulders of giants.” According to this metaphor, scientists can see further than their predecessors only because they stand atop a body of knowledge those predecessors have built. For this process to be successful, however, it is important that scientists can accurately judge the robustness of the existing knowledge on which they build. Attempts to build on findings that turn out to be false could mean a great deal of time, effort, and money wasted. It is therefore in researchers’ best interest to choose their giants carefully.
Although it can be difficult to fully articulate what signals whether a finding is robust or, more generally, the “quality” of a study, it appears that researchers are quite capable of judging the replicability of published results. In 2012 and 2014, prediction markets were run to ascertain the ability of psychologists to predict whether studies included in the Reproducibility Project would be successfully replicated (Dreber et al., 2015). The prediction market closed with a market price above 50 for 29 of the 41 replications attempted, therefore correctly anticipating the outcomes for 71% of the sample. Conversely, Camerer et al. (2016) ran a prediction market for replications of 18 experimental economics studies and did not find a significant relationship between the market and the actual replication rate. However, Camerer et al. (2018) later ran a prediction market for replications of 21 studies published in
Anecdotally, the quality of research has been inferred, at least in part, by the prestige of the journals in which it is published. Likewise, the reputations of individual researchers have sometimes been used to informally assess the quality of published research. It is not known, however, whether these traditional indicators of quality are also used to judge the replicability of findings. More recently, it would seem that the research practices intended to improve replicability described above would constitute salient criteria on which to judge replicability. However, very little research has been done to establish which, if any, of these newer practices actually influence confidence in the replicability of a finding.
The aim of this study was to examine how various research practices or features influence other researchers’ confidence in the replicability of a finding. We did this by surveying corresponding authors of articles published in psychology journals between 2014 and 2018. Our analysis presented here focuses on three research questions. The first question concerns the extent to which each feature influences average confidence in replicability. For example, on average, how strongly are people influenced by features such as sample size, the use of preregistration, or the use of a particular statistical method? The second research question concerns the underlying factors that influence judgments of replicability. For example, are there distinct themes that people consider when making these evaluations? The third question concerns whether there are different profiles of beliefs regarding the effectiveness of certain research practices for fostering replicability. For example, are certain people’s judgments of replicability more strongly influenced by issues relating to statistical methods but other people’s judgments more strongly influenced by the presence of preregistered hypotheses and publicly available data? By clarifying understanding of what it is that researchers perceive as being relevant to or signaling the likelihood of replicability, we hoped to identify research practices that are likely to be widely adopted as well as those that might require further justification. This study also provides benchmark data to compare perceptions of what signals replicability with research practices that actively increase replicability. Determining how attuned psychology is, as a discipline, to the factors that promote replicability is essential for efficiently and effectively increasing the rigor with which psychological research is conducted.
Disclosures
Preregistration
The preregistration documentation for this study can be found at https://osf.io/9nwrd.
Data, materials, and online resources
All the materials, code, and de-identified data from this study can be found at https://osf.io/dj2fx/.
Reporting
We report how we determined our sample size, all data exclusions, and all measures in the study.
Ethical approval
The protocol was approved by the University of Queensland Faculty of Health and Behavioural Science Human Ethics Committee (Protocol 2020000887). The study was carried out in accordance with the provisions of the Declaration of Helsinki.
Method
Participants
Our sample comprised psychological scientists who were recruited using the procedure described by Field et al. (2018; for full details of their procedure, see https://osf.io/s7a3d/). Following their procedure, we searched for journal articles published between 2014 and 2018 that appear under the following categories in the Web of Science database: “Psychology Multidisciplinary,” “Psychology Applied,” “Psychology Clinical,” “Psychology Social,” “Psychology Educational,” “Psychology Experimental,” “Psychology Developmental,” “Behavioral Sciences,” and “Psychology Mathematical.” This search yielded 14,251 articles. We extracted the e-mail address of the corresponding author of each article. After we removed duplicates, 9,017 unique e-mail addresses remained. Our full data-extraction procedure can be found on OSF (https://osf.io/jqyux/). For this study, we used 5,000 of these unique e-mail addresses and reserved the rest for a potential follow-up study. We expected a response rate similar to that of Field et al., which was approximately 12.5%. This response rate would result in a sample size of approximately 625 participants.
We compared our demographic variables with data from the National Science Foundation’s (NSF; 2017) Survey of Doctoral Recipients to assess the representativeness of our sample. The NSF data contain demographic details of more than 100,000 individuals with a research doctoral degree in psychology from an institution in the United States. Our sample was generally well aligned with the NSF sample, although our sample contained a higher proportion of males, and participants in our sample tended to be slightly younger. In the NSF sample, 41% of these individuals identified as male, and 59% identified as female. Our sample contained a slightly lower percentage of females (61% male, 35% female, 1% preferred to self-describe, 3% preferred not to say). In the NSF sample, 7% of individuals with a U.S. doctoral degree in psychology were under the age of 35 (19% in our sample), 10% were 35 to 39 years old (17% in our sample), 11% were 40 to 44 years old (15% in our sample), 12% were 45 to 49 years old (6% in our sample), 11% were 50 to 54 years old (10% in our sample), 12% were 55 to 59 years old (6% in our sample), 13% were 60 to 64 years old (4% in our sample), and 23% were 65 to 75 years old (9% in our sample). Likewise, our sample contained participants who had more recently received their PhD. In the NSF sample, the PhD was awarded 5 or fewer years prior for 12% of respondents (22% in our sample), 6 to 10 years prior for 14% of respondents (18% in our sample), 11 to 15 years prior for 13% of respondents (15% in our sample), 16 to 20 years prior for 14% of respondents (9% in our sample), 21 to 25 years prior for 13% of respondents (8% in our sample), and more than 25 years prior for 34% of respondents (16% in our sample).
Materials and Procedure
The survey was created using Qualtrics and was distributed using a shareable link sent by e-mail. Reminder e-mails were sent 1 and 3 weeks after the initial send date. Data collection concluded 1 month after the initial survey was sent out. The full survey can be accessed at https://osf.io/2dxe6/. The first survey item asked participants to estimate the percentage of randomly selected studies from the psychological literature that would be successfully replicated if a high-powered replication attempt (with a large enough sample size to measure the effect precisely) using the exact same methods and statistical analyses were to be conducted. Responses were provided on a scale from 0% to 100%.
In the following section, participants were asked to rate how much specific study attributes (e.g., research practices) would increase or decrease their confidence that a randomly selected effect from the psychological literature would be replicated. Participants were asked to consider 76 specific study attributes, each requiring an independent confidence rating. For each participant, the attributes were presented in random order, and responses were provided on a Likert scale ranging from −5 (
Some of these items address factors that have been considered in large-scale replication studies assessing the objective replicability of studies (e.g., sample size, experience of the research team, surprisingness of the effect). Others address attributes that have not been the focus of previous research but nevertheless may bear on perceptions of replicability (e.g., the type of journal in which the original study was published, the length of the article in which the original study was reported). We also included items relating to the subfield of psychology from which the study originated because we were interested in examining whether there was variance in the perceived replicability of research in different subfields. Additional items were based on feedback from an outside team of researchers who have collected data on experts’ predictions of the replicability of research and the reasons considered during these assessments. Fifteen members of the repliCATS team (https://replicats.research.unimelb.edu.au/) responded to e-mails from one of our authors inviting them to provide a list of features they have observed researchers considering when assessing the replicability of research. The repliCATS project has been running since April 2019 as a component of the U.S. government’s Systematizing Confidence in Open Research and Evidence program. Finally, we included control items that presented information that could not plausibly influence replicability (e.g., “The lead author on the original study was right-handed”).
After responding to the 76 attributes, participants were presented with a set of items that assessed demographic details such as their age, gender (“male,” “female,” “prefer not to say,” or “prefer to self-describe”), and career stage (“undergraduate student,” “postgraduate student,” “early career academic,” “mid-career academic,” “senior academic,” “left academia after finishing PhD,” “left academia before finishing PhD,” or “none of these apply to me”). Participants who described themselves as an early-career academic, a midcareer academic, a senior academic, or as having left academia after finishing their PhD were also asked to report the number of years it had been since they were awarded their PhD. Participants were also asked which field of psychology they were most interested in or most associated with, depending on their career phase (“clinical,” “cognitive,” “developmental,” “industrial/organizational,” “evolutionary,” “neuropsychology,” “social,” “quantitative,” “biological,” “health,” “human factors,” “unsure,” “other,” or “prefer not to say”). Participants who had left academia or were postgraduate students, early-career academics, midcareer academics, or senior academics were asked which field they associated with, whereas participants who were undergraduates or who selected the “none of these apply to me” response were asked which field they were most interested in.
The next set of items assessed participants’ knowledge of and familiarity with issues surrounding replicability in psychology. Participants were asked to rate how familiar they were with the current debates surrounding the replicability of psychological science (scale from 0 =
The final set of items pertained to the participants’ publication records. Participants were asked (a) what their h-index was, (b) how many publications they had, and (c) how many citations they had. Unfortunately, as a result of issues with potential identifiability, we could not release these data publicly. However, the raw publication data will be released on request to individual researchers who are committed to maintaining confidentiality. The survey took approximately 12 min to complete.
Results
Seventy-six of the 5,000 e-mails were undeliverable (e.g., recipient moved institutions, was on leave, or had retired). Therefore, the total number of e-mails successfully sent was 4,924. Two-hundred ninety-nine surveys were incomplete and consequently removed from data analysis. The survey was completed by 503 participants, which equated to a response rate of 10% (which compares favorably with that of Field et al., 2018, who had a response rate of approximately 9.5% for completed data).
Demographic and background information
Of the 503 respondents, 306 were male, 176 were female, 16 preferred not to say, and five preferred to self-describe. The average age of the sample was 45.9 years (59 did not provide their age). The sample contained one undergraduate student, 24 postgraduate students, 130 early-career academics, 140 midcareer academics, 162 senior academics, 19 individuals who left academia after finishing their PhD, one who left academia before finishing their PhD, 19 who said that none of these options applied to them, and seven who preferred not to respond. On average, participants who identified as academics or who had left academia after completing their PhD finished their PhD 15.9 years before this study. The percentages of participants who most strongly associated with (or were most interested in) each field of psychology were as follows: cognitive psychology, 17%; clinical psychology, 14%; developmental psychology, 7%; industrial/organizational psychology, 9%; evolutionary psychology, 5%; neuroscience/neuropsychology, 2%; social psychology, 24%; health psychology, 7%; quantitative psychology, 6%; human factors psychology, 2%; biological psychology, 3%; unsure, 1%; and other, 19%. Participants who selected the “other” option listed subfields such as counseling psychology, educational/school psychology, history of psychology, personality psychology, and sport psychology. Participants were able to select more than one answer, so the percentages sum to more than 100%.
On average, participants rated their familiarity with the current debates surrounding the replicability of psychological science to be 6.8 out of 10 (
Overview of analyses
We analyzed the results in three parts. The goal of Part 1 was to get an overall sense of the influence that each attribute has on confidence in replicability. In this part, we compared the distributions of confidence ratings across the 76 items. The goal of Parts 2 and 3 was to identify different profiles of beliefs regarding the factors that signal replicability. In Part 2, we conducted an exploratory factor analysis to identify the latent themes surrounding the factors that influence people’s confidence in replicability. In Part 3, we conducted a clustering analysis on the factor scores obtained in Part 2 to examine whether there were distinct subgroups of participants whose judgments of replicability were influenced by different issues (or in different ways).
Research Question 1: How Does Each Feature Influence Average Confidence in Replicability?
On average, participants believed that 53% (
For the distributions of the ratings for each item, see Figures 1 and 2. Note that the mean ratings ranged from −2.57 (“original study had low statistical power”) to 2.95 (“original result has been successfully replicated using same methods”). Information that increased confidence in the replicability of the original study included the existence of previous replications, high power or sample size, open data and materials, robustness of the phenomenon across contexts, and a representative sample of participants; analyses that were preregistered, planned in advance, or the original study was a Registered Report also increased confidence. Information that decreased confidence in the replicability of the original study included the study having low power or a small sample size, data and methods that are not openly available, a phenomenon that is highly context dependent, or results that involve a three-way interaction, are based on

The distribution of ratings for all confidence items. The height of each rectangle corresponds to the relative frequency of that rating being chosen. The color of each rectangle indicates the degree to which the presence of the attribute in question increased or decreased confidence in replicability, as indicated by the color key. The values shown on the left side of the plot indicate the mean rating for each attribute. The values on the right side of the plot indicate the percentage of participants for whom each respective factor decreased confidence (red) and increased confidence (green). For the full color figure, see the online version of the article. A table containing the means and standard deviations for all of the confidence items can be found at https://osf.io/b2wmj/.

Results of the model selection: the Bayesian information criterion (BIC) values associated with alternative Gaussian mixture models instantiating different numbers of clusters. Lower BICs indicate a better trade-off between fit and parsimony.
Research Question 2: What Are the Factors That Influence Judgments of Replicability, and What Are Their Constituent Features?
To answer this question, we conducted a Bayesian exploratory factor analysis on participants’ responses to each item. The goal of this analysis was to (a) determine the number of underlying factors that best characterized participants’ responses and (b) understand the patterns of relationships between specific features or practices and the underlying factors. The analysis was carried out using the Bayesian approach to exploratory factor analysis developed by Conti et al. (2014), which is implemented by the
This approach uses Markov chain Monte Carlo (MCMC) methods to implement a factor model in which each item is allowed to load onto no more than one latent factor. The analysis estimates the number of factors as well as the loading matrix and factor correlations under each factor structure considered. Conti et al. (2014) demonstrated that this method accurately recovers the true structure of the factor-loading matrix in the vast majority of cases when the sample size is at least 500. This suggests that our sample size should be sufficiently large to enable reliable inferences to be made regarding the underlying structure of the items.
Following Conti et al. (2014), we constrained the model such that each factor was required to have at least three items that loaded only onto that factor. We also retained their default priors (see Appendix B). The analysis requires the user to specify the maximum number of factors permitted. We set this maximum to six. We believed this setting would strike a good balance by allowing considerable variety in the factor structures that could emerge without allowing so many factors that the underlying themes would be difficult to interpret. These settings resulted in a prior on the number of factors for which approximately 8% of the mass was allocated to the one-factor model, 30% was allocated to the two-factor model, 38% was allocated to the three-factor model, 20% was allocated to the four-factor model, 4% was allocated to the five-factor model, and less than 1% was allocated to the six-factor model.
The MCMC analysis included four chains (with unique, randomly generated starting values). Each chain had a burn-in period of 20,000 samples. After burning in, each chain produced 20,000 more samples. Therefore, the final analysis was based on 80,000 samples (i.e., 4 Chains × 20,000 Samples Per Chain). We assessed convergence by computing the Gelman-Rubin (
We compared the evidence for the different factor structures by examining the posterior probability for each number of factors. The analysis revealed that a structure containing six distinct factors was most probable, with 100% of the posterior mass favoring this structure.
We interpret the item loadings and factor correlations from the six-factor model given that this model was deemed most probable. Table 1 shows the loadings of all the items onto the factors, which are ordered from the most to least variance explained. Factor 1 contains items that signal that the original study may have suffered from weak methodology (e.g., low sample size/power, the use of a convenience sample, lack of open data/methods). Factor 2 contains items that might be viewed as suggestive of questionable research practices
1
such as HARKing or
Loading of Each Item Onto the Relevant Latent Factor From the Six Factor Model
Note: 95% Bayesian credible intervals are in brackets.
Factor 3 contained items suggestive of a rigorous analysis (e.g., a large sample, high power, an analysis that was planned in advance). The subdisciplines of cognitive and quantitative psychology also loaded onto this factor. Factor 4 contains items relating more directly to the ease of conducting a replication study (e.g., the existence of previous replications, the presence of open methods and data, and the effectiveness of the methodology). Factor 5 contains items that reflect the robustness of the conclusions at a higher level (e.g., consistency with theory and prior research, robustness across contexts or experiments). Items relating to the use of Bayesian statistical methods also loaded onto this factor. Factor 6, which explained the least amount of variance, was the most difficult factor to identify a coherent theme for. We refer to this factor as the Established Convention factor because many of the items might be viewed as traditional indicators of replicability (e.g., the status of the researcher or institution, the finding being regarded as “classic,” and the use of standard statistical tools such as effect size and confidence intervals). There was also one item that did not load onto any factor (the lead author having been left-handed).
As can be seen in Table 2, many of the factors are correlated. Factors 1 and 2 are positively correlated, which makes sense given that both of these factors relate negatively to replicability. Factors 3 through 6 are all positively correlated with each other and negatively correlated with Factors 1 and 2. As can be seen in Table 2, several of the factors correlate fairly strongly with one another, which suggests that some of the factors tap into similar constructs. For example, both Rigorous Analysis and Ease of Replication incorporate methodological elements that arguably serve similar purposes (e.g., effective research design [Ease of Replication] and large sample size [Rigorous Analysis]). It is possible that these factors are conceptually separable but are correlated because they are often perceived to co-occur.
Factor Correlations With 95% Bayesian Credible Intervals [Lower, Upper]
To examine the robustness of our results to changes in the priors, we conducted a prior sensitivity analysis. To do this, we ran a follow-up analysis with different prior values for the concentration parameter that influences the mass allocated to the different numbers of factors. The value of this parameter in the analysis reported above was .17 (which is the recommended setting for a model with a maximum of six factors). We ran follow-up analyses with this parameter set to .05 (a setting that more strongly favors solutions with fewer numbers of factors) and .5 (which favors solutions with more factors). Both alternative settings resulted in a six-factor structure—the same number of factors as the model reported above. When the concentration parameter was set to .05, the pattern of loadings was virtually identical to the model above. When the concentration parameter was set to .5, items from Factors 4 (Ease of Replication) and 5 (Robustness of Conclusions) and most of the items from Factor 3 (Rigorous Analysis) in the above model merged into a single factor that explained 71% of the variance captured by the model. Remaining items from Factor 3 loaded onto a second factor (19% of the variance), and most items from Factor 1 (Weak Methodology) in the above model loaded negatively onto this factor. The contents of the third and fourth factors in this model (7% and 2% of the variance, respectively) were comparable with Factors 2 (Potential for Questionable Research Practices [QRPs]) and 6 (Established Convention) in the above model. The fifth and sixth factors each explained less than 1% of the variance captured by the model. Most of the items on the fifth factor related to the subdiscipline, whereas most items on the sixth factor were remaining items from Factor 1 (Weak Methodology) in the above model.
These results suggest that although the choice of prior exerts some influence on the distribution of thematically related groups of items across factors, the item groupings themselves are fairly stable. In other words, items that are grouped together in one model tend to be grouped together in the other models. The two exceptions are items that loaded onto Factors 1 and 3 in the original model. Under an alternative prior, with the concentration parameter set to .5, items loaded onto these factors are each split across two different factors. However, these factors are highly correlated (
Research Question 3: Are There Individual Differences in the Issues That People Consider When Evaluating Replicability?
To answer this research question, we conducted a cluster analysis using the Gaussian multivariate mixture modeling approach implemented via the
The results of the model selection are shown in Figure 2. The winning model (BIC = 21,214.17) assumed that there were three unique clusters of participants and that the clusters varied in size, shape, and orientation. The next best model (BIC = 21,273.64) assumed that there were two unique clusters that also varied in size, shape, and orientation.
In the second step, we used the winning model from Step 1 to classify participants into clusters. For the positioning of participants within each cluster with respect to the scores on each factor and the relationship between factor scores, see Figure 3. As can be seen, Cluster 1 (52% of participants; shown in blue) and Cluster 2 (42% of participants; shown in red) tend to agree on how the different factors would affect replicability. Both groups showed decreases in confidence in response to items associated with the Weak Methodology and the Potential for QRPs factors and increases in confidence in response to items associated with the Rigorous Analysis, Ease of Replication, and Robustness of Conclusions factors (with neither group showing much change in confidence in response to the items associated with the Established Convention factor). However, the changes in confidence among participants in Cluster 2 tended to be more profound than the changes among participants in Cluster 1. In other words, if items within a factor tended to increase or decrease confidence, confidence tended to increase/decrease more for participants in Cluster 2 compared with Cluster 1 (i.e., participants in Cluster 2 were the most sensitive). The participants in Cluster 3 (6% of participants; shown in green) deviated from the pattern demonstrated by Clusters 1 and 2. These participants had more variable views in general but demonstrated a systematic tendency to be less dissuaded by the items relating to the Potential for QRPs factor. Note that the mode of the Cluster 3 distribution is above zero for this factor—which suggests that participants within this cluster responded more positively to items associated with this factor—whereas the modes for the distributions associated with Clusters 1 and 2 were below zero.

Factor scores of participants in each cluster. The panels on the diagonal show the distribution of scores on the indicated factor. The other panels contain scatterplots showing the relationship between the factors. Plotted points representing participants in Clusters 1, 2, and 3 are shown in blue, red, and green, respectively.
Having established which participants belonged to each cluster, we next examined whether the clusters differed on key demographics variables. Table 3 shows the mean and standard deviation of each of the demographic items in the survey broken down by cluster. Participants in each cluster were quite similar across demographic items. Participants in Cluster 3 seemed to be the most concerned with replicability and were also the most likely to preregister, make a priori hypotheses, and run formal power analyses, whereas participants in Cluster 1 tended to have the lowest scores across these items. Furthermore, participants in Cluster 3 had nearly half as many citations on average compared with Clusters 1 and 2. However, participants in Cluster 3 had the equal highest h-index, which seems to indicate researchers across all three clusters tended to be similarly as prolific according to these metrics. Note that there was a substantially smaller proportion of early-career and midcareer researchers in Cluster 3 but a higher proportion of postgraduate students and senior academics. Cluster 3 was also the smallest cluster, so it had the greatest vulnerability to random fluctuation. There did not seem to be any particularly substantial differences between clusters across subdisciplines.
Differences in Demographic Variables Among the Different Clusters of Participants
Discussion
Accurate judgments of replicability are vital to ensure that researchers are building on work that is robust. Our aim was to understand how researchers make these judgments by assessing how 76 respective study attributes influence other psychological scientists’ confidence in the replicability of a finding. We focused on three research questions: How does each feature influence average confidence in replicability? What are the factors that influence judgments of replicability, and what are their constituent features? and Are there individual differences in the issues that people consider when evaluating replicability?
We found that there was considerable variability among the 76 study attributes in how they influenced researchers’ confidence that a finding would replicate. Among the features that tended to result in the greatest increase in confidence was the existence of previous replications, high power or sample size, open data and materials, robustness of the phenomenon across contexts, a representative sample of participants, and analyses that were preregistered, planned in advance, or a Registered Report. Among the features that tended to result in the greatest decrease in confidence were the study having low power or a small sample size, data and methods that are not openly available, a phenomenon that is highly context dependent, or results that involve a three-way interaction, report inferences based on
Furthermore, we found evidence for six somewhat distinct themes that psychological researchers consider when making judgments of replicability. The first theme was related to features that are often associated with weak methodology in the original study (e.g., low sample size/power, the use of a convenience sample) and a lack of transparency (e.g., lack of open data/methods). The second theme consisted of features perceived as suggestive of QRPs such as HARKING or
We also found individual differences in how people were influenced by the criteria discussed above. Specifically, we identified three distinct types of responders who differed in the extent to which each theme influenced their confidence that a study would replicate. Participants in Cluster 1 (52.1%) and Cluster 2 (42.1%) responded in similar ways to each theme, but participants in Cluster 2 tended to have more extreme responses (i.e., their changes in confidence were more pronounced across different items). Participants in Cluster 3 (5.8%) held views that tended to be more variable but, on average, hardly changed across different attributes. Another noteworthy feature of participants in Cluster 3 was that they tended not to be dissuaded by items in the Potential for QRPs theme and even showed a slight average increase in confidence for those items. These results suggest that most of the psychological scientists in our sample tended to agree about the factors that influence replicability, although some seemed to be less dissuaded by what many would consider to be QRPs.
We found that the factors that most strongly influenced perceived replicability tended to relate to the factors that large-scale replication attempts have found to actually influence replicability. For example, a common theme found in these replication attempts was that a larger effect size, larger sample, and smaller
The alignment of our sample’s confidence with large-scale replication attempts suggests that psychological scientists are attuned to detecting research practices that are indicative of replicable research. This is supported by evidence from the aforementioned prediction market studies (Camerer et al., 2018; Dreber et al., 2015; Forsell et al., 2019) that have shown that psychological scientists are fairly accurate at predicting the outcomes of replication attempts. Indeed, our results may provide further insight into how much weight researchers give to certain features when making these judgments. Our findings shed light on study attributes that may be used to successfully predict whether a finding will replicate. However, further research is required to determine whether the study features that inspire the most or least confidence among researchers are the ones that are most successful in predicting, respectively, replication or failure to replicate.
Our results suggest that many of the practices that have been proposed as a means to improve the replicability of psychological research—such as open data and methods (Asendorpf et al., 2013; Miguel et al., 2014), preregistration and Registered Reports (Jonas & Cesario, 2016; Nosek et al., 2018; Wagenmakers et al., 2012), and basing conclusions on Bayesian inference (Etz & Vandekerckhove, 2016), or
Practical Implications
We found that a small subset of our participants (those in the “insensitive responders” cluster) were systematically less concerned with features that could indicate the potential for what many researchers might consider to be QRPs, such as
Future Research
A useful next step might be to use these data to create a scale for measuring the different facets that determine perceived replicability. This would involve optimizing the item set so that the various facets can be reliably measured using fewer items. Future work might also benefit from examining perceptions of the replicability of more specific categories of studies, including studies from different subdisciplines. Our survey was agnostic with regard to characteristics of the study being replicated, such as the sample size, the design, the nature of the hypotheses, and the methodological approach. We designed the survey this way so our results would not be constrained to a specific type of study. However, it is possible that our use of a general study description, as opposed to a more detailed one, may mean that our results do not generalize to certain types of study designs or even to any specific study. It is, therefore, important for future work to examine how people make judgments regarding the replicability of results obtained using more specific research protocols.
Future research might also address the question of whether there are individual differences in the factor structure itself. Our analysis showed individual differences in the combinations of factors scores. However, our analysis assumed that all researchers rely on the same factor structure. Although we believe that this was an appropriate first step, it may be the case that there is variability across individuals in the factor structure itself. This variability can be examined in future research by conducting, for example, a mixture exploratory factor analysis, which allows for differences in the factor structure that emerges across individuals. This could also provide further clarity as to whether there are any differences in the perceived utility or harm of certain research practices across different subdisciplines. Indeed, our recruitment of such a broad sample may have resulted in a blending of factor structures that might truly differ across subsets of researchers.
It will also be important for future work to address potential moderating factors. For example, one might not care about the statistical approach if the sample size is sufficiently high but might care a great deal if the sample size is lower. It is also possible that familiarity with certain practices influences their perceived efficacy. For example, people more familiar with Registered Reports may be more likely to experience an increase in their confidence in a study’s replicability upon finding out that the study was a Registered Report. We believe that an examination of moderating factors is a useful next step for future research.
Conclusion
The aim of this project was to understand how 76 different study attributes influenced psychological scientists’ confidence that a finding would replicate. We found that, on average, these perceptions tended to match what large-scale replication attempts have found to actually influence replicability. Our sample also showed confidence in open science practices like openly available data and methods, preregistration, and Registered Reports. Overall, we found evidence for six themes that psychological scientists consider when evaluating replicability and that there may be individual differences in how these themes influence subsets of researchers. This work provides a useful starting point for understanding how people judge the replicability of a research finding. We hope that by better understanding researchers’ confidence in these practices, this will inform what still needs to be done to strengthen faith in practices that improve replicability but also reduce confidence in practices that might not be robust.
