In April 2019, Psychological Science published its first issue in which all Research Articles received the Open Data badge. We used that issue to investigate the effectiveness of this badge, focusing on the adherence to its aim at Psychological Science: sharing both data and code to ensure reproducibility of results. Twelve researchers of varying experience levels attempted to reproduce the results of the empirical articles in the target issue (at least three researchers per article). We found that all 14 articles provided at least some data and six provided analysis code, but only one article was rated to be exactly reproducible, and three were rated as essentially reproducible with minor deviations. We suggest that researchers should be encouraged to adhere to the higher standard in force at Psychological Science. Moreover, a check of reproducibility during peer review may be preferable to the disclosure method of awarding badges.
Editor’s Note
In the article that follows this Editor’s Note, Crüwell and colleagues report the results of an audit of the computational reproducibility of the 14 research articles published in the April 2019 issue of Psychological Science (Vol. 30, Issue 4). The audit was author-initiated—it was not by invitation of the journal. Crüwell and colleagues defined computational reproducibility as “the ability to recreate results using the original data and code (or at least a detailed description of the analyses)” (p. 514). They selected Volume 30, Issue 4 because it was the first in which all of the research articles were awarded the Open Data badge. Of the 14 research articles in the issue, Crüwell et al. assessed only one as meeting the requirements for the Open Data badge.
In their assessment, Crüwell and colleagues relied on the criteria provided in the Submission Guidelines of the journal at the time the 2019 authors submitted their articles (Psychological Science, Submission Guidelines, Open Science Badges section). The guidelines state that authors may receive an “Open Data badge for making publicly available the digitally shareable data necessary to reproduce the reported result. This includes annotated copies of the code or syntax used for all exploratory and principal analyses.” In their judgments regarding Open Data badge eligibility, Crüwell and colleagues emphasized the availability of analysis code or syntax. Importantly, neither the Open Science Framework (OSF) criteria that guide badge eligibility nor the Open Practices Disclosure (OPD) form completed by the 2019 authors makes explicit reference to analysis code or syntax. The OSF criteria state that “The Open Data badge is awarded when digitally-shareable data necessary to reproduce the reported results are publicly available,” and that “A data dictionary (for example, a codebook or metadata describing the data) is included with sufficient description for an independent researcher to reproduce the reported analyses and results” (https://osf.io/tvyxz/wiki/1.%20View%20the%20Badges/). Similarly, the OPD form that the authors completed required them to “Confirm that there is sufficient information for an independent researcher to reproduce all of the reported results, including codebook if relevant” (emphasis in original). Neither set of criteria specifies sharing of analysis code or syntax.
The difference between the Submission Guidelines and the OPD form that authors completed is important. The Submission Guidelines provide advice, but they are not the rule of law. The rule of law is established in the OPD form, which at the time the 2019 authors completed it, made no mention of analysis code. I emphasize this point because it establishes that the 2019 authors did not openly flaunt explicit criteria when they applied for the Open Data badge, and nor did eligibility for the badge turn on provision of analysis code, as established either by Psychological Science or OSF.
On behalf of Psychological Science, I apologize for the discrepancy between the Open Data badge elements listed in the Submission Guidelines and the less explicit requirement in the OPD form. The criteria outlined in the OPD form were not sufficiently explicit regarding the elements that should be included in an open-access registry in order to ensure independent reproducibility. We have changed the wording of the OPD form such that it now provides better guidance to authors in their efforts to make their science open by making their data publicly available.
Setting aside for the moment the vagueness of the requirements of the previous version of the OPD form, it is clear that it instructed authors to provide “sufficient information for an independent researcher to reproduce all of the reported results.” The OSF eligibility criteria give the same charge. By their report, for several of the articles published in the April 2019 issue, the audit team of Crüwell and colleagues was not successful in achieving the goal of independent reproduction of all of the reported results based on the information in the registry alone, with the methods they employed. Importantly, ensuring that analyses can be reproduced is only one of several possible motivations for authors to make their data openly accessible. Other possible motivations include reducing the need for duplicative data-collection efforts; facilitating collaborations; and even enabling analysis of data in different ways, thus helping to ensure findings are robust to different analytic approaches, to name a few. I venture to guess that it was goals such as these, not independent reproducibility alone, that were paramount in the minds of many of the 2019 authors as they made their data publicly available.
Critically, transparency and scientific community building are not mutually exclusive goals. In this regard, it is my pleasure to report that upon learning of the work of Crüwell and colleagues, several of the author groups with articles in Volume 30, Issue 4 of Psychological Science appended their registries to include elements identified in the audit as missing or insufficient. I appreciate the positive response of these author groups and their ongoing contributions to open science.
Patricia J. Bauer
Editor in Chief
Open science badges are incentives for researchers to participate in open science practices such as preregistration and sharing of data and materials. Sharing data is encouraged in order to increase transparency, reuse or reproducibility, and citations (Colavizza et al., 2020; Piwowar & Vision, 2013). Psychological Science adopted the badges in 2014 (Eich, 2014), and, in April 2019, published its first issue in which all 14 Research Articles received Open Data badges (Volume 30, Issue 4). The aim of this badge is to incentivize authors to share online the data necessary to reproduce the reported results (Blohowiak et al., 2022). Psychological Science’s submission guidelines state that articles may receive this badge “for making publicly available the digitally shareable data necessary to reproduce the reported result. This includes annotated copies of the code or syntax used for all exploratory and principal analyses” (Psychological Science, 2022, Open Practices Badges section; these eligibility criteria were operative in 2019).
1
The corresponding Open Practices Disclosure form uses somewhat more permissive language, requiring confirmation of “sufficient information for an independent researcher to reproduce all of the reported results.” This equates to provision of analysis code or syntax for all but the simplest analyses and data sets. We understand reproducibility to mean computational reproducibility: the ability to recreate results using the original data and code (or at least a detailed description of the analyses). Psychological Science awards badges based on the disclosure method: Authors complete an Open Practices Disclosure form, and the journal may confirm the existence of data, materials, or a preregistration (Blohowiak et al., 2022; Psychological Science, 2022).
Kidwell et al. (2016) found that introducing badges at Psychological Science led to an increase in sharing, which indicates the superficial success of this policy—particularly compared with other initiatives (see Rowhani-Farid & Barnett, 2018, and Rowhani-Farid et al., 2020, who found lower and no increase in data sharing at Biostatistics and BMJ Open, respectively). Hardwicke et al. (2021) investigated the analytic reproducibility of articles that received Open Data badges at Psychological Science between 2014 and 2015; they were able to reproduce the results of 36% of articles without author involvement and a further 24% with author involvement. Obels et al. (2020) examined data sharing and computational reproducibility of registered reports in general psychological research; 36 of the 62 articles assessed (58%) provided both data and code, of which 21 (58%) were computationally reproducible.
Whereas Hardwicke et al. (2021) and Obels et al. (2020) were concerned with computational or analytic reproducibility per se, we focused on computational reproducibility as a measure of the effectiveness of the Psychological Science Open Data badge policy. If this policy was effective, the results in the April 2019 issue should be independently and precisely reproducible. If these results are wholly or partially irreproducible, then any issues we identify during reproduction attempts may inform the improvement of the policy at Psychological Science and other journals. Our focus on one practice in one issue of Psychological Science allows for in-depth examination of the effectiveness of this specific measure for incentivizing data sharing as implemented and advertised at this journal.
Statement of Relevance
Open science badges are incentives for encouraging researchers to participate in open science practices such as preregistration and the sharing of data or experimental materials. These practices are thought to be desirable as a means for enhancing both transparency and reproducibility, which are important to scientific inquiry. In particular, the results of a study should be at least computationally reproducible using the same data and analyses. In the present study, we aimed specifically to investigate the effectiveness of the Open Data badge at Psychological Science, the stated purpose of which is to ensure the reproducibility of results. We found that the Open Data badge policy did not work as intended, and we suggest possible changes in how the badge could be awarded. We hope to contribute to improving the badge program at Psychological Science as well as reproducibility and transparency in psychology.
Open Practices Statement
The individual and summary reports, as well as the informal reproducibility ratings and code to create Tables 1 and 2, are publicly accessible at https://osf.io/xzke7/. This study was not preregistered.
Method
Sample
The scope of our investigation was all 14 Research Articles published in the April 2019 issue of Psychological Science, the journal’s first issue in which all Research Articles were awarded the Open Data badge (Bae & Luck, 2019; Dorfman et al., 2019; Garcia & Rimé, 2019; Geniole et al., 2019; Hakim et al., 2019; Hilgard et al., 2019; Johnson & Wilson, 2019; Lindsay et al., 2019; Obaidi et al., 2019; Olsson-Collentine et al., 2019; Vardy & Atkinson, 2019; Wójcik et al., 2019; Woolley & Fishbach, 2019; Yousif & Keil, 2019). To emphasize our focus on Psychological Science’s Open Data badge policy and not these individual articles, we will refer to them as Articles 101 to 114, the numbers having been randomly assigned. A superficial examination of the repositories linked to the articles shows that all articles are associated with at least some data. No code is provided in the linked repository for six of the articles (Articles 101, 105, 107, 111, 112, and 113).
Design
This is an observational, descriptive, one-group study. We did not compare the April 2019 issue of Psychological Science with any other issue or journal but rather to the ideal of the policy of the Open Data badge as implemented at Psychological Science.
In the present study, we were mainly concerned with this Open Data badge policy’s effectiveness, not with reproducibility per se. Our informal reproducibility ratings are a proxy measure of that effectiveness. Although we did not establish any criteria for successful reproduction in advance, for a study to count as reproducible, its results should at least be reproducible by a competent external researcher (National Academies of Sciences, Engineering, and Medicine, 2019), such as a PhD student with some experience and training in a similar field. When we say that a study was or was not reproducible, this is specific to our team of reproducers. Our informal reproducibility rating items were “exactly reproducible,” which represented the ideal of the Open Data badge in which there were no deviations from the reported results; “essentially reproducible,” meaning that there were minor deviations in the decimals or obvious typographical errors (e.g., 2.39 vs. 2.93); “partially reproducible,” indicated that there were more than minor deviations but the results were mostly numerically consistent; “mostly not reproducible,” meaning that there were major deviations and few numerically consistent results; and “not at all reproducible” if there was no numerical consistency between the reported results and the ones that we found, or a reproduction attempt was otherwise not possible.
Procedure
Reproducer assignment
The last author initially recruited 13 researchers of varying experience and career levels to attempt to reproduce studies from the April 2019 issue of Psychological Science on the basis of the data and, where available, code shared by the original authors. They were asked to indicate their ability to access and use four software packages: Excel, MATLAB, R, and SPSS. Each reproducer was asked to attempt to reproduce four of the 14 articles, the selection being determined by (a) the match between the reproducer’s access to software and the format of the code or data provided by the original authors, and (b) the aim to have distinct sets of researchers working on each article, where possible. Because of an error in the assignment process, two reproducers (J.M. and S.L.) were asked to reproduce the same four articles. No two articles were reproduced by the exact same set of researchers. Two reproducers dropped out and did not complete any reproduction reports. Furthermore, reproducers were unable to complete individual reproduction attempts because of technical limitations in three cases (B. J. B., Article 106; S. C., Article 110; S. J. G., Article 112). One further reproducer joined the project at a later stage. In total, 12 reproducers completed three to five reproductions each. For each of the 14 articles, at least three researchers were assigned to, and completed, individual reproduction reports (46 individual reports in total).
Reproduction process
The reproduction process was split into two stages. In the first stage, each researcher independently attempted to reproduce their assigned studies and wrote an individual reproduction report on their experience and findings. These initial reports were unstructured; some reproducers included further information such as code, whereas others focused on the narrative report of their reproduction attempts. Results were initially not shared, and reproducers were encouraged to stay as masked as possible (i.e., not discussing results with other reproducers until their own analyses were completed). In the second stage, on the basis of the individual reports, the groups of reproducers for each article agreed on a summary report of their overall findings. After the reproduction process, they rated the reproducibility of each article they had attempted to reproduce on the basis of (a) their individual, initial experience reproducing the article and (b) the summary findings and discussions among the group for each article.
All of our reproduction attempts were carried out independently of the articles’ original authors. We then contacted the authors prior to preprinting and submission to explain the nature of the project; all our analyses and conclusions were finalized by that point. In the case of two articles, the last author of the present article had previously (i.e., before the other coauthors joined the project in May 2020) contacted the corresponding authors for reproduction advice before realizing that this was not compatible with the overarching aim of the project. Consequently, he did not write an individual report on these articles, and he did not contribute to the associated group discussions.
Results
Reproducibility
Only one of the 14 articles was rated to be exactly reproducible (Article 108), and three further articles were rated essentially reproducible with minor deviations by a majority of the researchers who reproduced them, on the basis of the summary reports (Articles 101, 109, and 111). Both the initial reproducibility ratings based on the individual reproduction attempts (Table 1) and the summary ratings based on the article group’s combined reproduction attempts (Table 2) varied, and there were four changes between the modal majority-agreed initial and summary ratings (Articles 101, 109, 110, and 114).
The individual reports (46 total) and summary reports (14 total) are available on the OSF alongside further information about each reproduced article (see https://osf.io/xzke7/). The reports provide in-depth qualitative and quantitative information in the form of narrative descriptions of each reproduction attempt, often including numerical results.
Issues encountered
The following section qualitatively and nonexhaustively summarizes the issues that we encountered (for a further summary of the shared data and code, see Table 3). General issues include (a) a lack of documentation of data and/or code; (b) minor discrepancies in several results, likely due to use of random numbers without fixed seeds in bootstrapped analyses; (c) minor discrepancies in individual results, likely due to typographical or copy-paste errors; (d) unclear reporting of procedures in the article text, including the criteria for inclusion in subgroups, lack of or incorrect reporting of the variables used for regression models, and unreported one-sided analyses; (e) data storage issues on the OSF, including files being either corrupt or not downloadable at all (Article 110); and (f) ambiguous labeling of studies in the article’s Open Practices statement (Article 109). Data-specific issues include (a) provision of cleaned data without raw data, (b) provision of raw data without cleaned data, and (c) no description of, or code for, the data-cleaning process. Code-specific issues include (a) a lack of shared analysis code or modeling code and (b) issues with package or software versions (often resolvable but sometimes only with considerable effort).
Open Data badge eligibility
Overall, we found that eight articles (Articles 101, 105, 106, 107, 110, 111, 112, and 113) did not provide, even in principle, sufficient information for independent exact reproduction of their results by our team. In these cases, reproduction would require analysis code or syntax, as the descriptions of the methodology and the shared data files did not provide enough information on their own.
2
This means that (a) these articles did not meet the standard for receiving the Open Data badge at Psychological Science according to the explicit requirements stated in the submission guidelines, and (b) the authors of these articles may have interpreted the less explicit requirements of the Open Practices Disclosure statement in a rather minimalist way.
Provision of both analysis code and data was a requirement for the award of an Open Data badge at Psychological Science at the time of submission, according to the explicit requirements stated in the submission guidelines. These requirements appear to not have been met in these cases. Articles missed these explicit requirements of the journal submission guidelines to different extents. Six articles (Articles 101, 105, 107, 111, 112, and 113) did not provide any code in the linked repository (some modeling code was provided for Article 112 on a separate GitHub page not linked to from the article), and Article 101 additionally provided only summarized and incomplete data. Therefore, these articles do not appear to have met the requirements for receiving the Open Data badge, according to the explicit requirements in the submission guidelines that were in force at Psychological Science when the articles were first submitted. Arguably, given this stipulation, Articles 106 and 110 were also not eligible for the Open Data badge because they provided some code files but not the statistical analysis code. This field-leading policy was certainly introduced and implemented with the best of intentions, but there appear to have been some oversights by the journal in its execution, as the OSF guidelines recommend at least a cursory check by the journal before the badge is awarded.
On top of these clearer eligibility issues regarding the provision of sufficient information and/or analysis code for independent exact reproduction, on a strict interpretation of the badge eligibility criteria at Psychological Science, our reproduction results arguably imply that only one of the 14 articles met the requirements for an Open Data badge. Eight articles did not share both data and analysis code or otherwise sufficient information, and of the remaining six articles that did attempt to share sufficient information for independent reproduction in the form of analysis code, only one was exactly reproducible by our team. However, the reproducibility of the articles that shared data and analysis code likely decreased since publication (because of issues such as “software rot”; Hinsen, 2019). Therefore, it is unclear how we can make an inference from current reproducibility to past Open Data badge eligibility in the case of the articles that share both data and analysis code but were not exactly reproducible.
Discussion
The disclosure method did not ensure the required higher standard for the Open Data badge at Psychological Science, at least in its April 2019 issue. Of 14 articles, eight did not share both data and analysis code and so failed to meet the eligibility requirements. Of the remaining six, only one was exactly reproducible, but we do not know whether the other five were exactly reproducible at the time of submission. We make several recommendations for improving the specific badge policy at Psychological Science and comparable initiatives at other journals (for further general recommendations on improving data sharing and computational reproducibility, see Stodden et al., 2016; Trisovic et al., 2022; Wilson et al., 2017). Excellent and more in-depth recommendations and tutorials for authors to ensure that their shared data and code are eligible for an Open Data badge are provided by, for example, Arslan (2019), Eberle (2022), Klein et al. (2018), Levenstein and Lyle (2018), Peikert and Brandmaier (2021), and Van Lissa et al. (2021). Moreover, the provision of further incentives, in particular by funding agencies and institutions, may help make data sharing more common and effective (Houtkoop et al., 2018).
First, authors wanting to share their data and code could take further steps to ensure eligibility for an Open Data badge. It might be argued that the average psychology researcher lacks the necessary technical skills. Any journal offering open science badges could support its authors in making their data and code reproducible and usable by providing guidance on (a) documentation of data, code, and the online repository; (b) sharing the rawest possible data (within ethical and logistical limits) alongside the cleaned data; and (c) guidance on recommendations for avoiding dependency and version issues (e.g., by using a platform such as Docker or Code Ocean; Clyburne-Sherin et al., 2019; Nüst et al., 2020; or if working in R by using, e.g., groundhog or renv; Simonsohn & Gruson, 2022; Ushey, 2022). There are many resources for making a reproducible workflow accessible, particularly concerning data and code sharing (see above). Authors can also ensure machine-actionable reusability of their data by following the findable, accessible, interoperable, and reusable (FAIR) guidelines (Wilkinson et al., 2016). It is commendable when authors attempt to share their data—data and code imperfectly shared are typically better than data and code perfectly kept to oneself. Indeed, our study would have been impossible without the introduction of the Open Data badge. The badge is a step in the right direction, but the corresponding policy needs to be improved to better support and incentivize transparent and reproducible research.
Second, there are improvements that could be made by badge-awarding journals that require both data and code for Open Data badge eligibility. If such journals rely on the disclosure method over the peer-review method, they could better describe the specific badge criteria and clarify that code, syntax, or a detailed analysis description needs to be shared alongside the data—for example, as required by the submission guidelines at Psychological Science. Many journals, and the baseline open science badge guidelines (Blohowiak et al., 2022), do not explicitly include the sharing of analysis code as an eligibility criterion; whether they should do so depends on the purpose of the Open Data badge. If the purpose is data reusability, not sharing code may be acceptable. If the purpose includes reproducibility, however, code should always be included. This particularly applies to complex analyses, as verbal descriptions are unlikely to cover the information necessary for exact or essential reproduction (as demonstrated by our difficulties reproducing Article 112; see also Seibold et al., 2021). In simpler cases, not sharing code might seem acceptable (e.g., we essentially reproduced Article 111), but verbal reports can still fail, and sharing of analysis code ensures that all relevant information is available. By requiring the sharing of analysis code, Psychological Science is going beyond the basic requirements of the Open Data badge in order to achieve both reusability and reproducibility. Nevertheless, we still found that insufficient code was in fact shared for more than half of the examined articles. Badge-awarding journals requiring not only data but also code could more explicitly require authors to provide working code—where necessary—that enables straightforward reproducibility and produces clearly annotated output (see Bauer, 2022, for a reaffirmation of this requirement).
Third, it may be sensible to focus on other methods of awarding the open science badges. Given our results, as well as those of Hardwicke et al. (2021), a badge check may be needed as part of peer review at badge-awarding journals, including Psychological Science. This provides earlier verification and allows authors to upload all materials before publication and award of the badges. One way of doing this is to move to the peer-review method of awarding the Open Data badge (as opposed to the disclosure method; Blohowiak et al., 2022). The standard required by the peer-review method is open to interpretation by the specific journal: For the Open Data badge, this could range from a formal but brief review of the materials to independent reproduction of the reported results.
3
The expected standard should match up with the standard stated in the submission guidelines; in the case of Psychological Science, data and code are already nominally required to enable precise or exact reproducibility, at least at the time of submission (Psychological Science, 2022). This work could be done by peer reviewers, dedicated badge reviewers, editors, or dedicated editorial staff (Blohowiak et al., 2022) and should be as straightforward as running the code or scripts on the data and requiring corrections if this does not lead to an exact reproduction. A checkbox could be provided for reviewers or dedicated badge reviewers to confirm that they executed the code successfully. If the analysis methods are complex or time consuming, then it should be incumbent on the authors to provide appropriate tools and assistance to the reviewers. If this responsibility is made clear to researchers before submission, this can incentivize more straightforwardly reproducible research. Alternatively, authors could provide proof of a successful reproduction attempt, either independently or from within the research team (which would be an improvement, as analyses are commonly carried out by single team members; Veldkamp et al., 2014).
4
This could be a condition for the award of the badge, or for an alternative Open Data+ badge, similar to the existing Preregistered+ badge (Blohowiak et al., 2022). Another approach would be to break the badge down into checkboxes of what was shared (e.g., raw and/or processed data, full or partial analysis code), thereby both lowering the threshold for participation and increasing transparency and usefulness of the badge.
5
Regardless, whether authors fill in their disclosure items appropriately should continue to be monitored—a recent study found low adherence even to mandatory data availability statements in biomedical research manuscripts (Gabelica et al., 2022).
Limitations
The focus of our study was limited to the April 2019 issue of Psychological Science, a nonrandom sample of all articles in Psychological Science that received an Open Data badge. An advantage of this approach was that we could investigate each article in more depth than would be feasible for a larger sample, resulting in 46 individual reports in total, at least three per article. In comparison, Hardwicke et al. (2021) focused only on the numerical results of a subset of substantive findings for each article, meaning that reproducibility was not as fully evaluated as in our study. Our rich qualitative and quantitative results can be a starting point for further investigation. Building on our reproduction experiences may allow us to better anticipate the roadblocks that reproducers will face.
A possible limitation of our focus is that data-sharing practices may have improved overall since the publication of the issue under investigation. However, our results show only a slight improvement over those found by Hardwicke et al. (2021), who looked at articles published between 2014 and 2015 (using their less strict definition of reproducibility, equivalent to our “essential” reproduction). The Open Data badge eligibility criteria have not substantially changed since, so there is no reason to believe that a more current issue would show substantial improvement in a shorter time frame. Specifically, the eligibility criteria for the award of an Open Data badge at Psychological Science have included sharing of the relevant analysis code since at least November 2017 (Psychological Science, 2017).
Where reproducers had to recreate all or part of the analyses, our reproduction attempts may not be correct. This can result from unclear reporting or a lack of code (or other issues, identified above) but also from a reproducer’s expertise and evolving abilities as a researcher. However, we believe that competent graduate students should be able to reproduce the results of an article with an Open Data badge in their field of training. For an article that was awarded the Open Data badge at Psychological Science, reproduction should simply be a matter of running the code on the data.
An advantage of publicly shared data—over data unshared or available “on request”—is that they are available, and ideally useful, without the original authors’ involvement. Contacting authors is not always easy: Researchers change institutions or email addresses and are mortal. Sometimes authors refuse to share data, even if required by the journal. Stodden et al. (2018) assessed the effectiveness of a policy of mandatory sharing on request at the journal Science and found that, despite this policy, they received data for only 44% of articles. Hence, the independence of the reproduction attempts in our study is one of its strengths. Doubtless we could have exactly or essentially reproduced more articles by contacting the original authors. We did not do this, as we wanted to investigate the effectiveness of the specific Open Data badge policy at Psychological Science, not the analytic or computational reproducibility of individual studies. The out-of-the-box reproducibility of each article indicates that effectiveness—if a successful reproduction requires contacting the authors, the badge was unsuccessful.
Conclusion
Recent advances in open and reproducible science have been rapid, and associated journal policies are constantly improving (see Psychological Science’s move to Transparency and Openness Promotion [TOP] guidelines Level 2; Bauer, 2022). The stopgap, however, cannot be to award Open Data badges to articles that do not meet the minimum criteria. This study provides insight into the importance of sharing data for reproducibility and reuse as well as into the experience of reproducing studies that received the Open Data badge. We hope it can motivate improvements of the Open Data badge policy, or its implementation by the authors, at Psychological Science and other journals committed to promoting open science.