Abstract
Keywords
Dr. Diligence and his team are investigating whether peak-pandemic individual adherence to COVID-19 safety guidelines (masking, not drinking bleach, etc.) was positively associated with religious tolerance. Using self-report measures of both constructs, they run a survey study online. After analyzing the data from all 230 participants, Dr. Diligence is extremely pleased to find the predicted positive association between self-reported COVID-19 safety adherence and religious tolerance (
Yet after further examination of the data, Dr. Diligence and fellow researchers notice that 30 participants completed the survey in an impossibly fast time, producing response patterns that appear completely random. Viewing these data as merely noise, they assume the relationship between religious tolerance and COVID-19 policy adherence will be even stronger once these data are removed. Yet in a shock to Dr. Diligence’s team, when those 30 respondents are excluded, the association between the two variables disappears completely (
The conventional wisdom has long been that participants who carelessly respond to questionnaires and other psychological measures merely add random noise to a data set and that such error variance will, at worst, attenuate observed associations between variables. 2 This conventional wisdom reduces the perceived threat of poor data quality because (a) researchers are often more worried about the threat of false positives than false negatives and (b) researchers may be misled into thinking they can compensate for the presence of some careless responding data simply by using larger sample sizes.
Yet numerous researchers have demonstrated that this conventional wisdom is wrong: Data from careless respondents will often create systematic covariance that spuriously inflates associations between individual items (e.g., inflating scale reliabilities; Carden et al., 2019; Wise & DeMars, 2009) and between different multiitem measures, such as self-report questionnaires and cognitive ability tests (Credé, 2010; Holden et al., 2019; Holtzman & Donnellan, 2017; Huang et al., 2015; Wood et al., 2017). In fact, the presence of careless respondents can inflate any kind of statistical estimate based on covariance, including factor correlations, factor loadings, logistic regression coefficients, and so on (King et al., 2018). As we explain, this inflationary phenomenon is extremely common, not rare. The presence of careless responding (CR) 3 participants in the field’s data sets has likely generated a very large number of false positive and spuriously inflated results in published literature, especially in an era of unproctored online studies with anonymous paid participants. Beyond false-positive or inflated results in individual studies, this inflation can have further deleterious effects on everything from meta-analytic estimates to calculations of statistical power (e.g., if prior effect sizes are overestimated, then statistical power will be overestimated for subsequent studies that screen out CR).
Even many of the most experienced psychological scientists underappreciate this major threat to the field’s validity and replicability. In a preregistered study, we distributed a brief survey to the editorial boards of 10 different psychology journals that frequently publish studies with anonymous online samples (e.g.,
Given that several research teams have previously demonstrated the inflationary risks of CR, why has this critical issue failed to enter the general awareness of the psychological-research community? To some extent, it may be that the highly technical writing and/or specialized journal identities of some of these past demonstrations have been limiting factors in promoting this issue to broader audiences. The wide variety of terms used to denote CR (e.g., “indiscriminate responding,” “lazy responding,” “rapid guessing,” “inattentive responding,” “insufficient effort responding”) may also be a factor leading to “jangle fallacies” (the fallacy of erroneously assuming two constructs are different because they have different names; Kelley, 1927). 5 The largest obstacle, however, may simply be that the threat of Type 2 error posed by CR is an easy intuition to grasp, whereas the threat of Type 1 error caused by CR can feel counterintuitive or even paradoxical. For instance, even general review articles on best practices in addressing CR allot only a few sentences to discussing the risk of inflation (e.g., Ward & Meade, 2023), do not mention inflation risks at all (e.g., Malamis & Howley, 2022), or cite the relevant works in ways that may inadvertently lead a reader to think that attenuation is the only major concern (e.g., see Arthur et al., 2021, p. 110). Even for individuals who are theoretically aware of the inflationary possibilities of CR, the counterintuitiveness of the phenomenon may bias one into assuming it is a rare phenomenon and not something researchers should worry much about.
In this article, we aim to bring greater awareness of CR’s inflation risks into general research knowledge and practice in psychology by demonstrating that the inflationary risk of CR is prevalent, severe, and frequently unaddressed by researchers using samples with a high CR risk (e.g., paid online samples of anonymous participants). First, we provide a less technical explanation of the confounding effects of CR and present new introductory educational resources suitable for use in classrooms and lab meetings. Second, to concretely demonstrate the prevalence and severity of these effects in real data, we reanalyzed real data (i.e., not simulated data) from three recent articles published in the
What Is CR?
There are many kinds of problematic participant responses that can affect the validity of research findings, but here we focus on only one: CR, which is any responding that is not attentive to the item content of a survey or test (see Maniaci & Rogge, 2014). 6 CR encompasses random responding, lazily overly consistent responding (e.g., repeatedly giving the same response for long strings of items regardless of item content), and most other response styles that stem from a participant not paying attention to the content of a survey or test. In online samples exclusively, CR also includes data generated from software applications (i.e., bots) that are programmed to respond to survey questions automatically. Some CR produces data that are systematically invariant, such as when a respondent gives the same answer over and over to each survey item; other CR produces data that are less systematic, even approximately random (for an investigation of the analytical differences between different types of CR, see DeSimone et al., 2018). Less systematic (i.e., more random) CR data are thought to be more prevalent (for a discussion, see Ward & Meade, 2023), especially given that many careless responders are likely motivated to not be identified as such.
CR is often, sometimes implicitly and sometimes explicitly, discussed as a trait-like individual difference; a person who is careless in one study will tend to also be careless in other studies. For example, Bowling et al. (2016) observed that respondents who tended to be careless across multiple studies were rated by their acquaintances as being lower in conscientiousness, agreeableness, extroversion, and emotional stability. On the other hand, CR will also often be merely a situation-specific methodological nuisance; even generally careful respondents will sometimes become careless, especially if they are in a distraction-prone environment or if the research study is boring, fatiguing, overly long, and so on (see discussion in Ward & Meade, 2023).
Although CR can be found in any kind of research participant sample, paid online samples (e.g., MTurk, Prolific, Dynata) are perhaps the most susceptible to high rates of CR given that (a) hourly wage rate incentivizes speedy completion and reducing fatigue (i.e., reducing cognitive effort), (b) researchers cannot verify all their participants to be humans (i.e., not bots) or fluent readers of the study’s language (e.g., participants in foreign countries using server farms to pose as English-speaking American participants), and (c) researchers cannot assist participants to ensure they are not confused. Indeed, when rigorous screening methods are employed, published studies often identify between 15% and 50% of respondents as CR in paid online samples (e.g., Balzarini et al., 2021; Krems et al., 2021; Lassetter et al., 2021). Although there appears to be dramatically wide variability in rates of CR across various paid online sample venues (see Eyal et al., 2021; Litman et al., 2021), even the best-performing premium prescreening options will inevitably end up allowing some CR to occur.
How can CR inflate effect sizes?
When a careless responder invariably gives the same answer over and over to the items in a test (i.e., “straightlining”), it is easy to intuit how such data would often naturally inflate internal reliabilities and associations between measures (at least when all items are keyed in the same direction 7 (for a discussion, see King et al., 2018). However, what is less intuitive is that even fully random data provided by careless respondents will often inflate such associations (Credé, 2010). Our goal here is to present a concise, less technical introduction to this CR statistical confound than those that have been described previously in excellent detail (e.g., Huang & DeSimone, 2021; Huang et al., 2015) in hopes of connecting with a wider readership.
For CR as a variable
8
to be a confounding factor, it must causally effect scores for both the independent variable (
For respondents who provide exclusively random responses on a psychological scale or test, their mean scores, on average, can be expected to be very close to the midpoint of a rating scale or at chance-level performance on a test. For respondents who provide only partially random data (e.g., individuals who respond carelessly to some but not all items), their mean scores should still be closer to the midpoint of a scale or chance-level performance than will fully careful respondents (Huang et al., 2015). Empirical studies that examined CR (Huang & DeSimone, 2021; Huang et al., 2015; Litman et al., 2021) have generally corroborated that this tendency naturally emerges in real studies.
Given that one can conceptualize the group means of careless responders as roughly constant across studies (i.e., near to a scale’s midpoint or chance level), the effect that CR has on one’s data will depend on the group means of one’s own variables of interest,

Example of different effects failing to remove careless responders has on the relationship between
However, if mean scores of carefully responding participants are measurably different than the scores of careless respondents for both
It is typically the case that mean scores on a self-report scale diverge from the midpoint of a scale among careful responders (Credé, 2010). For example, most self-report measures of psychopathology and of clinical-symptom counts are positively skewed (see review in Field & Wilcox, 2017). In addition, most ability measures have population norms above chance performance. Thus, rather than being a rare phenomenon, CR inflation should, in many cases, be presumed to be more likely than CR attenuation.
Educational tools to demonstrate and explain CR
The past decades of improvement in psychological-science methods have shown that videos and visual tools can be very helpful to spreading awareness of important analytic issues. To this end, we developed the Careless Responding Simulator to visually and dynamically demonstrate the inflating/deflating effects of CR on bivariate associations. 9 Figure 2 shows a screenshot of the app that researchers can use to define different parameters for theoretical careful and careless responders, along with other parameters, to show how a correlation among careful responders changes with the addition of careless responders. There are also simulation presets that demonstrate a diluted effect, an inflated effect, and a created effect. It can be accessed online at https://fuhred.shinyapps.io/CarelessRespondingSimulator/. 10 Moreover, we have also created (a) a brief video introducing the CR inflation effect and (b) a brief video explaining how to use the Careless Responding Simulator (accessible at: https://www.youtube.com/watch?v=niTPWqr6fsE, https://www.youtube.com/watch?v=uUrgRFbiTks). These tools are not intended to help researchers estimate the specific effects of CR in their own data sets. Instead, we hope these educational aids will help readers better understand and promote the CR issues discussed in this article.

Screenshot of the Careless Responding Simulator, a Shiny app to demonstrate how the introduction of careless responding can distort true effects.
Case Studies of Three Recently Published Articles
In the previous section, we aimed to describe how unremoved CR participants can inflate effect sizes. The next step is to demonstrate this inflationary effect in real data. Whereas our explanations and simulations above assume CR that produces fully random responding, CR by real participants could potentially take on a wide variety of patterns (e.g., alternating between “1” and “3” on a response scale or producing some other systematic pattern that is not merely straightlining). Moreover, the patterns produced by CR may be influenced by specifics of individual study designs in unpredictable ways (e.g., some study designs may lead CR to manifest as severe central tendency bias, whereas others may tend to lead to CR that manifests as extreme responding bias). Thus, it is important to investigate the Type 1 error risks of CR using real participant data.
Although prior demonstrations of the inflationary effects of CR have primarily relied at least partly on simulated data, we do not use any simulated data here and instead focus on three data sets from studies recently published in
Before our systematic reviews of
Study 1: self-report scales with relatively normal distributions
Kachanoff et al. (2020) provided an example of how CR might meaningfully inflate associations between self-report scales even if the mean responses of careful and careless participants deviate only modestly from one another. Kachanoff et al.’s goal was to investigate whether collective autonomy restriction, “a feeling that other groups seek to control and restrict how their own ingroup articulates and expresses its sociocultural identity” (p. 601), motivates groups to improve their power position in a social hierarchy.
In Kachanoff et al.’s (2020) Study 1, they administered a series of six self-report scales to a sample of 412 Black American participants recruited online via the research panel firm Dynata. They used three nearly identical attention-check items in Study 1 to screen for CR (e.g., “If you’re paying attention, please select 3”). They excluded 101 participants (24.5% of the sample) who failed at least one of these items.
First, we examined whether there were mean differences on the six self-report variables between participants Kachanoff et al. (2020) retained in their study analyses (i.e., careful participants) and participants they excluded for failing an attention check (i.e., careless participants; Table 1). There were medium to large significant differences (
Comparing Correlations Without Careless Responding (
Note: Correlations with CR participants included in analyses are in bold. Δ
To test this, we ran the correlations between the study variables for the whole sample (including careless participants) and compared them with the published correlations (with careless participants removed) in Kachanoff et al. (2020). As shown in Table 1, for correlated variables that both displayed mean differences between careless and careful groups, every effect originally reported by Kachanoff et al. was inflated to varying degrees when data from careless participants were included (median effect size
The CR inflation observed in our reanalysis of the data provided by Kachanoff et al. (2020) was modest but meaningful. Even at this level of inflation, including CR data can cause Type 1 errors; distort moderation, mediation, and other analyses; and inflate effect-size estimates when included in meta-analyses.
Study 2: low base-rate self-reported behaviors
Compared with the distortion demonstrated in the Kachanoff et al. (2020) data set, the CR inflationary dynamic will be more pernicious with variables whose true values are more extremely tilted away from the midpoint of a scale. This is especially the case for low and high base-rate variables (e.g., see demonstrations in Chandler et al., 2020). Low/high base-rate phenomena could range from mental-health diagnoses (e.g., schizophrenia) or other characteristics (e.g., identifying as cisgender) to certain behaviors (e.g., physically confronting Black Lives Matters protesters; Bartusevičius et al., 2021) or other specific outcomes.
Costello et al.’s (2022) work on the correlates of left-wing authoritarianism (LWA), a construct that describes authoritarianism in service of left-wing outcomes, provides a valuable window into how CR can be particularly inflationary when working with low base-rate phenomenon. We focused our reanalysis efforts on Costello et al.’s Study 6, in which the authors investigated whether LWA predicted self-reported participation in protest violence during summer 2020 and other acts of self-reported political violence in the last 5 years because these behaviors have low base rates among the general population.
Costello et al. (2022), in Study 6, recruited 1,000 participants from Prolific, a paid online data-sourcing platform. In addition to two basic attention-check items (e.g., “Balls are round,” agree/disagree), participants were also screened for CR by being instructed to “write a sentence that you think has probably never been said before (e.g., ‘the red, disingenuous marmoset galloped over the Atlantic Ocean while wearing sunglasses’)” (Costello, personal communication). If a participant left the answer blank, wrote gibberish, failed to produce an actual sentence, or wrote a sentence that was not at all unusual, that participant was excluded. 12 On the basis of these three CR screening items, Costello et al. identified 16.9% of the sample as careless and excluded them from analyses, leaving a final sample of 834 participants.
We first identified whether (a) the LWA scale and (b) the low base-rate variables had significantly different means for the careless- and careful-participant groups. Scores on the LWA scale and each of the four binary political-violence variables all demonstrated medium to large mean differences between the careful and careless participants (
We examined the associations between LWA and the binary political-violence variables with and without careless participants by conducting a series of binary logistic regressions controlling for symbolic ideology (the same regression procedure reported by Costello et al., 2022). As shown in Table 2, adding the careless participants inflated associations between every single one of these variables yet was particularly dramatic for the two lowest base-rate behavior variables. For example, without careless participants, the odds of having taken part in violence during protests in summer 2020 increased by a factor of 2.96 for each standard-deviation increase in LWA. When careless participants were included, the odds increased by a factor of 6.02 for each standard-deviation increase in LWA. In sum, the careful CR screening conducted by Costello et al. (2022) was critical to the quality of their findings. This case study raises serious concerns about similar online studies working with low/high base-rate variables that do not screen for CR.
Comparing Left-Wing Authoritarianism Predicting Binary Variables in Binary Logistic Regression Without Careless Responding (
Note: Exp(B)s with careless participants included in analyses are in bold. LWA = Left-Wing Authoritarianism Scale; CR = careless responding; Δ
Converted to
Study 3: behavioral tasks
Although the case study of Costello et al. (2020) demonstrates the notably dangerous effects of CR with highly skewed or low base-rate self-report data, CR has the potential to undermine almost anything that a study participant could be asked to do. For instance, virtually all behavioral tasks require at least some degree of effort and/or attention, whether one is labeling the emotional content of faces, completing logic puzzles, answering trivia questions, or reading prompts for experimental manipulations (for similar concerns, see Huang & DeSimone, 2021).
The behavioral task studies of Sanchez and Dunning (2021) illustrated CR’s potential to inflate associations between different behavioral tasks and between behavioral tasks and self-report questionnaires. These researchers reported a range of correlates of the behavioral construct of “jumping to conclusions” (JTC), defined as “collecting only a few pieces of evidence before reaching a decision” (p. 790).
We focused our reanalysis on Study 1B of Sanchez and Dunning (2021) because this was the only study in their article in which they screened for CR. A total of 346 participants were recruited from MTurk. To measure JTC, participants watched a character fishing from a lake containing either (a) 80% red fish and 20% gray fish or (b) 80% gray fish and 20% red fish. One fish was caught at a time, and after each fish catch, participants could ask to watch another fish be caught. When a participant felt ready, they could stop watching fish catches and make their guess as to whether the character was fishing out of the majority-red lake or majority-gray lake. Each participant’s JTC score was a function of how many fish catches they asked to see before making their decision.
CR is an obvious possible confound for JTC because failing to ask to see more fish catches could be a result of lack of effort or a desire to complete the study as quickly as possible (for particularly germane demonstrations of this problem, see Zorowitz et al., 2021). Thus, to evaluate whether their results might be driven by lack of effort/attention by participants, Sanchez and Dunning (2021) employed a pair of attention checks that were unobtrusive (i.e., participants were not likely to realize that they were being attention checked because the items seamlessly blended with the rest of the survey). 13 Sanchez and Dunning excluded 13.9% of their sample for failing at least one of these attention-check items, leaving a final sample of 298 participants.
Would Sanchez and Dunning’s (2021) Study 1B results have been meaningfully different if they had failed to exclude careless participants? To address this, we first examined which variables had significantly different means for the participants they retained (careful participants) and participants they excluded (careless participants). For the JTC behavioral task, there was a major difference between careful and careless participants; the careless participants on average asked to see only 1.88 fish being pulled from the lake before deciding, whereas the careful participants on average asked to see 3.56 fish being pulled from the lake before deciding. Of the remaining 15 variables described in their Table 1, 11 demonstrated mean differences between the careful and careless participants (
We ran correlations between all the variables including careless participants and compared these correlations with those reported in Table 1 of Sanchez and Dunning (2021)—in which careless participants had been excluded. As shown in our Table 3, JTC’s associations with the 11 variables we had identified were all inflated when careless participants were included in the analyses. For JTC’s associations with the self-report variables from this set, on average, the presence of careless participants increased the proportion of explained variance by
Comparing Associations Without Careless Responding (
Note: Correlations with careless participants included in analyses are in bold. JTC = jumping to conclusions; Δ
Lower number = higher JTC.
Recent Data-Screening Practices in Paid Online Samples in Two Flagship Journals
As the three case examples presented above illustrate, failing to screen for and exclude CR participants will often spuriously create or inflate findings. Indeed, 28 of the 34 effect sizes (82%) that we examined became stronger, not weaker, when careless participants were included in the analyses. This begs the following question: How frequently and rigorously are researchers screening for and removing CR participants from their data sets, especially in samples collected from paid online platforms? Are potentially vast numbers of findings published in contemporary psychological science spuriously created or inflated because of lack of CR screening procedures?
Researchers in various fields (e.g., marketing research; Arndt et al., 2022) have investigated the prevalence of CR screening by systematically reviewing specific journals or published articles on a specific topic. Within psychology, such reviews have thus far (to our knowledge) been focused only on clinical psychology. For example, King et al. (2018) reviewed every study published in 2016 in 14 journals focused on addictions research; they observed that only 11 out of more than 2,079 studies (< 0.01%) reported any kind of screening for CR when collecting data online. Sharpe et al. (2023) conducted a similar review of recent studies in three other clinical-psychology journals; they observed that 14 out of 20 (70%) studies engaged in data-quality screening when collecting data from MTurk or other online recruitment panels. Jones et al. (2022) systematically reviewed alcohol-research studies employing online recruitment panels published from 2011 to 2021; 51 out of 96 (53.13%) identified articles reported screening for CR. A main limitation of all three of these clinical-psychology reviews is that none of them directly contacted authors to inquire about their screening practices; given that most journals do not currently have requirements for reporting CR data exclusions, some authors may have screened for CR but not explicitly reported it in the main text of their articles. Thus, a more valid estimate requires directly contacting study authors when the presence or absence of CR screening cannot be determined from the article text.
To improve on these prior reviews and extend them to the broader field of psychological science, we conducted a systematic review of CR-screening practices in two flagship journals widely read by psychologists outside of clinical research. Specifically, we examined articles published in
We focused on studies published in these two journals that used paid online samples. In addition to their heightened risk of CR, one of the main advantages of paid online samples is that they allow researchers to attain larger sample sizes rapidly and cheaply. As sample size increases, however, so too does the false-positive rate generated by CR inflation; in large samples, even a small CR-caused distortion can spuriously make a null effect appear to be statistically significant (Zorowitz et al., 2021; also cf. “paradox of big data”; Meng, 2018).
For full details regarding these two systematic reviews, see the Supplemental Material available online. In sum, we reviewed every article in these two journals during this 1-year period and identified studies using paid online samples. We then coded (a) whether CR screening was conducted for the study and if so, (b) whether multiple methods for detecting CR were used and (c) how many CR respondents were subsequently excluded. Every article was examined by at least two coders, and consensus on final codes was reached through group discussion if there were discrepancies between coders. In many cases, some of these codes could not be conclusively established based solely on the information in the published articles and accompanying supplemental material; in such cases, we directly contacted author teams to request the missing information.
When coding for whether a study engaged in CR screening, we erred on the side of overcrediting researchers: A study qualified as having screened for CR if any method that might be considered a form of CR screening was employed 15 even if the researcher may not have been intending to screen for CR (e.g., excluding participants who did not adequately follow instructions). Thus, our findings should be considered a conservative estimate of the problem of CR-screening neglect.
An alluvial plot summary of our review findings is provided in Figure 3. Of the 273 articles spanning 613 studies published in these two journals in our specified time frame, 104 articles spanning 444 studies analyzed data from paid online samples (48.2% of all articles published in

Alluvial plot of screening practices in
Of these 444 studies, only 217 (48.9% total; 43.0% in
Of the 232 studies that conducted CR screening, regardless of whether it was reported in text or to us in response to our email, 81 (34.9% total; 39.5% in
Of the 232 studies that conducted at least some kind of CR screening, 207 studies (89.2%) reported the number of participants that were excluded. Only about one sixth (15.1% in total; 14.1% from
In sum, only about half of online studies published in
General Discussion and Recommendations
Our overarching aim in this article was to explain and demonstrate the severity and frequency of Type 1 error risks that result from the “insidious confound” (Huang et al., 2015) created by CR research participants. Counter to long-standing conventional wisdom, the presence of partly or fully random responding will very often spuriously inflate associations between variables (e.g., Credé, 2010). Because of the counterintuitiveness of this phenomenon and perhaps also the technical nature and specialized journal identities of past articles on this topic, this serious confound has continued to be widely underappreciated.
In the three case examples we presented, the original authors took careful steps to identify and exclude substantial amounts of CR data. Our reanalyses of the data in these studies revealed just how right the authors were to conduct CR screening; if they had not done so, the majority of the effects they would have reported would have been spuriously created or inflated, some modestly and some dramatically. Beyond demonstrating that CR will often inflate effect sizes more frequently than diluting them, the studies we reanalyzed further illustrate how CR can spuriously inflate effect sizes in almost any type of research that requires effortful cognitive processing, from self-reporting attitudes to engaging in behavioral tasks. Yet our reviews of screening practices in two prominent journals show that researchers are commonly failing to rigorously screen for and remove CR participants from their data sets. Moreover, our survey study of journal editorial boards further confirmed that many of them do not adequately appreciate the Type 1 error risks posed by CR data, which likely explains why many of them do not place much weight on CR screening in reviews of articles. Thus, there are likely a very large number of false-positive and spuriously inflated results continuing to be published, especially in an era of unproctored online studies with anonymous paid participants.
Addressing CR in research
A number of excellent reviews are available to help researchers think through the various strategies to identify, remove, and report CR in their articles (e.g., Arthur et al., 2021; Curran, 2016; Goldammer et al., 2020; Hong et al., 2020).
16
In particular, Ward and Meade (2023) recently presented a helpful
Identifying CR
Scholars have used a wide range of techniques for identifying CR (e.g., Curran, 2016; Hong et al., 2020). Examples of these screening techniques are instructed items (e.g., “Please select STRONGLY AGREE”), bogus items (“I have never eaten food”), consistency between psychometric antonyms (e.g., “I love my life” and “I do not love my life”), speed of page completion (considered superior to speed of total survey completion), recall checks (e.g., “What was the name of the person in the vignette you just read?”), outlier detection methods, and many other techniques. A full range of different methods for screening for CR were helpfully explained by Ward and Meade (2023; for a review of screening methods in the context of bot detection specifically, see Storozuk et al. 2020). The R package
These different CR-screening methods appear to have both strengths and weaknesses such that “there does not seem to exist a universally effective . . . detection method” (Hong et al., 2020, p. 313). Indeed, given that CR takes different forms, such as random responding or lazily overly consistent responding (i.e., straightlining), no single screening method will be excellent at detecting all the different manifestations of CR. Moreover, the merits of any approach also depend on the characteristics of a particular research study, such as the other tasks and measures within it. For example, a screening method that would be unobtrusive for one kind of study might be awkward and intrusive for another kind of study in such a way that it causes participant reactance or willful noncompliance (Silber et al., 2022). Thus, rigorous CR screening requires researchers to develop a diverse tool kit of varying CR-identification techniques that can be applied appropriately to one’s specific study designs.
Although we refrain from recommending a catchall CR-identification method to researchers, our central recommendation echoes that presented in other CR reviews (e.g., Chmielewski & Kucker, 2020; Hong et al., 2020): “The use of any of these techniques should not be applied in a vacuum void of other techniques. . . . The strongest use of these methods is to use them in concert” (Curran, 2016, p. 6). Ward and Meade (2023) provided various suggested combinations of different screening methods, some of which must be built into a study (i.e., a priori screening) and some of which can be used with archival data (i.e., post hoc screening). Optimally, both a priori and post hoc methods should be used in tandem. In their first table, Ward and Meade arranged their a priori and post hoc CR-identification suggestions into three levels of screening rigor: minimal, moderate, and extensive. Note that only a small percentage of the online studies we reviewed in two flagship journals meet even the minimal screening rigor elaborated by Ward and Meade.
Excluding CR
Because carelessness is inherently a continuous, rather than categorical, phenomenon, researchers must next determine a proper threshold for excluding respondents from analyses. For instance, should even a single failed attention check be sufficient for exclusion, or should a series of failed attention checks be necessary? As the stringency of CR screening increases, some meaningfully careful participants will tend to be inadvertently excluded (Kim et al., 2018); as stringency of CR screening decreases, data quality will tend to suffer. 17 Below, we first address a few prominent misconceptions in CR exclusion considerations and highlight what we believe to be valid considerations for being more or less stringent in CR exclusion criteria.
The first misconception is the belief that CR data are less risky and should be less stringently excluded “if sample sizes are smaller and the analyses are somewhat more robust (e.g., correlations)” (Ward & Meade, 2023, p. 588). The results of our various case studies, however, show how even correlational analyses are at risk for spuriously inflated results if CR is left unremoved. As detailed by King et al. (2018), this risk is present for essentially any analysis that deals with covariances between items/variables. In addition, whereas larger sample sizes do amplify the potential for CR data to generate false-positive findings under tests of statistical significance (Zorowitz et al., 2021), small samples are also at risk of inflated effect sizes because of CR (for a similar discussion, see Sharpe et al., 2023). We therefore advise that sample size and analytic method should not be used as justification for lessened screening stringency.
Many researchers also initially dismiss the need to stringently exclude CR participants in their samples by pointing to strong internal consistencies for their measures. This is a natural intuition, one likely even held by some psychometricians; for instance, in Ward and Meade’s (2023) review, CR is depicted as only tending to reduce the internal reliability of measures (also see Arthur et al., 2021, p. 115), leaving readers to potentially infer that high reliabilities in their measures signal less of a need to stringently exclude CR participants. Yet the presence of even substantial amounts of CR will often not decrease the internal consistency of measures; in fact, CR data will sometimes inflate internal consistency, especially when all test items are keyed in the same direction (for a demonstration, see online simulator app in Carden et al., 2019). 18 As with any other statistical estimate based on covariance matrices, CR can add systematic covariance between items within a scale. “High Cronbach’s alpha is no indicator of data quality” (Hong & Cheng, 2019, p. 622) and thus should not be used as a rationale for less stringent CR exclusion criteria.
What we believe researchers should consider when determining a proper threshold for excluding respondents from analyses are research sample characteristics and study design. Although CR can occur in practically any research context (e.g., undergraduate study pool samples, surveys voluntarily completed by journal editorial-board members), Ward and Meade (2023) noted that CR is likely to be especially prevalent in studies administered online; studies that are long, 19 repetitive, or uninteresting to participants; and studies in which participants face little or no consequences for responding carelessly. With Ward and Meade, we advise that researchers should be moderate to extensive in their CR exclusion efforts when studies have any of these characteristics and rely only on more minimal exclusion criteria when one’s study design is unlikely to invite many careless responders (e.g., proctored in-person studies with intrinsically motivated participants).
Above all, we make an important recommendation in accordance with the findings of the present article: Researchers should assess the expected mean scores of the variables in their study and adjust their screening stringency accordingly. If any of the variables of interest have expected mean scores that differ, even modestly, from the midpoint of the variable range (or chance level on a test), then CR exclusion scrutiny and stringency should be heightened. When working with highly skewed or low/high base-rate self-report variables or with behavioral tasks that are inherently sensitive to participant effort, CR screening and exclusion should be extremely rigorous to avoid spuriously inflated effects.
Finally, whatever level of screening stringency a researcher may adopt, best practices in open science recommend running and transparently presenting analyses both with and without identified CR participants (e.g., a “multiverse approach”; Del Giudice & Gangestad, 2021; Steegen et al., 2016). In fact, such analyses can provide additional information to evaluate the necessary stringency: If the results substantially differ, that should raise additional concerns about the underlying data and prompt consideration of further increasing the stringency.
Registering and reporting CR
We encourage researchers to establish both their identification and exclusion criteria before conducting their research and to preregister these decisions. Then, just as researchers should be transparent about other study qualities that may affect the validity of their findings in articles (e.g., whether hypotheses were determined a priori, whether their measures were reliable), the identification and exclusion of CR should be described in the main text of journal articles. Although extensive specifics may be reasonably reserved for supplemental documents, we suggest that at least the following topics be addressed in the main text: screening methods employed, whether the exclusion criteria were determined before data collection, how much data were subsequently excluded, and whether any deviations from one’s preregistration were made if a screening plan was preregistered (see similar but slightly more intensive recommendations by Chmielewski & Kucker, 2020). Not only does this allow journal editors and readers to better evaluate the credibility of one’s reported effect sizes, but it also assists others with testing the reproducibility of findings.
Conclusion
We aimed to bring the issue of CR’s inflationary effects on observed associations to the forefront of the minds of researchers in an easily accessible manner. Given that our systematic reviews revealed many or most high-profile research studies with paid online samples failed to adequately screen for CR, it is prudent to presume that many published findings are spuriously inflated. It is our hope that, especially in the current era of reliance on paid online samples, researchers will view CR screening as critical to safeguarding the credibility of their psychological research.
Supplemental Material
sj-docx-1-amp-10.1177_25152459241231581 – Supplemental material for Careless Responding: Why Many Findings Are Spurious or Spuriously Inflated
Supplemental material, sj-docx-1-amp-10.1177_25152459241231581 for Careless Responding: Why Many Findings Are Spurious or Spuriously Inflated by Morgan D. Stosic, Brett A. Murphy, Fred Duong, Amber A. Fultz, Summer E. Harvey and Frank Bernieri in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
