Abstract
Questionnaires are by far the most common tool for measuring noncognitive constructs in psychology and educational sciences. Above and beyond differences in the traits to be measured, response bias may pose an additional source of variation between respondents that threatens validity of conclusions drawn from questionnaire data. In the present article, we focus on two commonly encountered response bias—careless and insufficient effort responding (C/IER) and response styles (RS) in attentive answering. The former refers to behavior shown by respondents approaching the administered items inattentively and choosing responses that do not reflect the trait to be measured, for example, by random responding or straight lining (Meade & Craig, 2012). With the latter, we refer to response behavior shown by respondents whose responses, although stemming from—at least in parts—attentive response processes are confounded with content-irrelevant variability due to differences in category usage and perception, such as midpoint (MRS) or extreme RS (ERS; Böckenholt & Meiser, 2017).
Note that the key characteristic of the employed distinction between C/IER and RS in attentive answering (or attentive RS) is whether or not observed responses still reflect the trait to be measured; that is, whether the targeted response bias induces content-irrelevant variability on the between-item (C/IER) or within-item (attentive RS) level. As such, C/IER does not merely include seemingly random responses but also subsumes response patterns going back to systematic tendencies of respondents to prefer specific response categories (e.g., outer response options) over others—as long as such systematic tendencies are the only driver of how respondents choose response options and the resulting responses are uninformative of the traits to be measured. Likewise, the employed distinction between C/IER and RS in attentive answering still allows for both types of response bias to potentially stem from noneffortful responding. However, while C/IE responses are assumed to not reflect the traits to be measured whatsoever, we assume that responses that are confounded with attentive RS are still reflective of the traits to be measured to some extent. That is, respondents must have invested at least some effort into reading the item and retrieving relevant information. In the case that RS is a manifestation of lowered effort due to, for example, fatigue, respondents may have read and/or processed the items superficially and employed their category preferences as heuristics in choosing a relevant option (see Lyu & Bolt, 2022, for recent model developments aiming to capture such behavior). Besides fatigue, other sources of RS in attentive answering exist, with cultural differences being the most prominent example (Johnson, 2005). That is, in this study, for the sake of simplicity, we employ the terms “attentive” and “inattentive”/“careless” as complementary antonyms, demarcating zero from non-zero attentiveness, rather than as gradable antonyms.
Vast literature exists that focuses on the identification of either type of response bias (see Böckenholt & Meiser, 2017; Henninger & Meiser, 2020, for overviews on RS; and Meade & Craig, 2012; Niessen et al., 2016, for overviews on C/IER). Although both types of response bias are plausible to be present in questionnaire data, so far, no model exists that considers both bias simultaneously. Separating and jointly considering both types of bias is a challenging endeavor since they may both result in rather similar response vectors. Response vectors consisting of the highest response category on all items, for instance, may either go back to a combination of a high content trait level and attentive ERS or to inattentive straight-lining behavior. Nevertheless, C/IER and attentive RS pose markedly different response processes that can be assumed to differ in other behavioral indicators besides mere responses. Response times (RTs) retrievable from computer-administered questionnaires are a prominent and widely employed example for such behavioral indicators, allowing to both separate and better understand different response processes (e.g., Henninger & Plieninger, 2020; Schnipke & Scrams, 1997; Ulitzsch et al., 2020; Wang & Xu, 2015; Wise, 2017). We aim to develop an approach that draws on the additional information contained in RTs for disentangling as well as simultaneously accounting for and investigating C/IER and attentive RS.
In what follows, we first review current, solely response-pattern-based approaches that support investigating and handling C/IER and attentive RS. We then discuss the potential of RTs for better understanding different aspects of response behavior in questionnaire data. Based on these considerations, we present an RT-based mixture modeling approach that combines and extends approaches for different aspects of response behavior and bias. In doing so, the approach distinguishes attentive respondents from those showing C/IER and separates parts of attentive responses going back to differences in trait levels from those due to RS. In an application of the model to data from the Programme for International Student Assessment 2015 (PISA; Organization for Economic Cooperation and Development, 2017) background questionnaire, we illustrate which insights on response behavior can be gained based on the presented approach and contrast it against analyses leaving either C/IER, attentive RS, or both unconsidered. We further contrast the proposed approach against a more heuristic two-step procedure that first eliminates presumed careless respondents from the data and subsequently applies model-based approaches accommodating RS in attentive responding. To investigate the trustworthiness of results obtained in the empirical application, we assess parameter recovery of the approach in a simulation study.
Response-Pattern-Based Approaches for Identifying Response Biases
Careless and Insufficient Effort Responding
Traditional approaches for C/IER commonly employ response-pattern-based indicators for its identification. Examples for such indicators are the long string index, being constructed by examining the longest sequence of subsequently occurring identical responses for each respondent (Johnson, 2005), the even-odd index, given by the within-person correlation between the responses to odd-numbered and even-numbered items belonging to the same scale, averaged across scales (Curran, 2016; Huang et al., 2012), or Mahalanobis distance, following the rationale that C/IE responses are outliers that deviate from typical response patterns (Curran, 2016; Huang et al., 2012). Exhaustive overviews and discussions of other response-pattern-based indicators are given in Curran (2016), Meade and Craig (2012), and Niessen et al. (2016).
When respondents exceed a predefined threshold on the employed indicator, they are classified as careless. There is an ongoing discussion on how thresholds should be set, as these can heavily impact conclusions on the occurrence of C/IER (e.g., Curran, 2016; Niessen et al., 2016). A further problematic aspect of these response-pattern-based indicators is that each index is tailored to the detection of a different type of C/IER behavior but insensitive to others. The long string index, for instance, is well suited for detecting straight lining but does not detect diagonal lining or random responding. Conversely, the even–odd index is insensitive to straight lining since this results in consistent response patterns (Curran, 2016). Mahalanobis distance, in contrast, can be influenced by too much normality in C/IE responses (arising when respondents randomly choose categories around the midpoint; Curran, 2016). Thus, Mahalanobis distance performs well for detecting uniformly distributed random responses while failing to detect normally distributed random responses (Meade & Craig, 2012). To accommodate this issue, Curran (2016) suggested to combine multiple indicators that are sensitive to different aspects of C/IER in a multiple-hurdle approach, where respondents with extreme values on any of the considered indicators are filtered out in a stepwise procedure. However, Ulitzsch, Pohl, et al. (2021) illustrated that multiple-hurdle approaches, too, are heavily impacted by threshold choices for the employed indicators.
To avoid making assumptions concerning the specific types of C/IER or attentive response patterns, Schroeders et al. (2020) suggested to employ supervised machine learning techniques, with the algorithm being trained on a data set for which it is known which respondents displayed attentive and C/IER behavior, for example, on data stemming from experiments manipulating instructions on how to approach the questionnaire. This approach, however, requires access to an adequate training data set and is based on the assumptions that (a) respondents in the experimental prestudy complied with instructions, for example, provided attentive responses when instructed to do so, and that (b) both attentive and C/IE responses in the data set of interest follow a structure that is comparable to the respective structures in the training data, that is, that respondents being instructed to show C/IER behavior behave in a comparable manner to those displaying C/IER behavior in out-of-lab conditions.
The abovementioned approaches pose two-step approaches, where, in the first step, careless respondents are filtered out, and in the second step, analysis methods of choice, for example, polytomous item response theory (IRT) models, are applied to the cleaned data set. In step two, researchers could in principle also employ models considering attentive RS, thus jointly accounting for C/IER and attentive RS. To the best of our knowledge, such procedures have not yet been evaluated. A potential limitation may be that misclassifications of the method chosen in step one may impact subsequent RS analyses. It is yet not known how this impacts the overall adjustment procedure. As delineated above, under both indicator-based and machine learning approaches as potential tools for step one, misclassifications are likely to occur.
Attentive RS
Approaches for attentive RS have in common that they aim at disentangling parts of the response process going back to differences in trait levels from those due to differences in category usage. In the last decades, a myriad of model-based approaches for RS has been developed. A dominant stream of research conceptualizes RS as person-specific shifts in item parameters of polytomous IRT models such as the (generalized) partial credit model (PCM; Masters, 1982; Samejima, 2016) or the rating scale model (RSM; Andrich, 1978). Henninger and Meiser (2020) provided an overview and generalized framework subordinating different approaches. The authors delineated that current procedures for modeling RS differ in the restrictions they impose on the structure of RS and/or their relationships to the substantive traits. While models that leave the structure of RS unconstrained (e.g., Bolt & Johnson, 2009; Rost, 1991; Wang & Wu, 2011) allow for an exploratory investigation of RS, they commonly assume RS to be independent of the traits to be measured in the sense that person-specific shifts are uncorrelated with the content traits. Models that impose structures on person-specific shifts constrain these to follow patterns that resemble theory-derived RS, such as MRS or ERS (e.g., Bolt & Johnson, 2009; Falk & Cai, 2016; Tutz et al., 2018; Wetzel & Carstensen, 2015). These models commonly allow for respondents with different trait levels to differ in their stylistic tendencies. In presenting this general framework, Henninger and Meiser (2020) explicated the specific types of RS and research questions that can and cannot be investigated using either of the subsumed models and provided guidelines for their application.
Another dominant stream of research on IRTree models for RS aims at decomposing the response process into subsequent, a priori defined subprocesses (see Böckenholt & Meiser, 2017; Jeon & De Boeck, 2016, for overviews). For instance, a response on an outer agreement category may be decomposed in an agreement process, followed by the decision to opt for an outer response option. IRTree models are based on the construction of binary pseudo items, containing information on the outcomes of each of the assumed subprocesses, and constituting their measurement models. Various extensions of IRTree models exist that allow for a finer-grained investigation of response subprocesses, entailing subprocesses with ordinal and multidimensional decision nodes (Meiser et al., 2019) or mixture extensions that allow for respondents to structurally differ in subprocesses involved for choosing response options (Khorramdel et al., 2019; Kim & Bolt, 2020).
Models for attentive RS pose sophisticated methods for depicting important aspects of response processes. Nevertheless, they commonly assume that all responses reflect the trait to be measured to some degree. This assumption, however, is violated when some respondents display C/IER.
Using Response Times to Investigate Response Biases
Careless and Insufficient Effort Responding
Since inattentive respondents do not invest effort in evaluating the item, retrieving relevant information, and selecting a relevant response, C/IER can be assumed to generally require less time than attentive responding. First traces of multiple processes underlying RTs in computer-administered questionnaires are already revealed in descriptive analyses. Figure 1 displays log RT distributions by response category for a single item from the PISA 2015 background questionnaire completed by students from the German sample. RTs associated with choosing either of the four response categories show a bimodal shape. We observed similar patterns across various items. In the context of cognitive assessment, such bimodal shapes are commonly assumed to go back to noneffortful rapid guessing behavior and effortful solution behavior, respectively (e.g., Wise, 2017). Comparable conclusions can be drawn in the context of questionnaire data (Kroehne et al., 2019; Ulitzsch et al., 2023).

Distribution of log response times by response category for Item ST094Q04 of the Programme for International Student Assessment 2015 background questionnaire. Response times were retrieved from raw log events using the finite state machine framework by Kroehne and Goldhammer (2018).
Various approaches exist that draw on RT information for separating attentive from C/IER behavior. Threshold-based methods aim at identifying thresholds separating RT distributions associated with either behavior, for instance, by making an educated guess on the minimum amount of time required for an attentive response (Huang et al., 2012; Meade & Craig, 2012) or by visual inspection of the RT distribution (Kroehne et al., 2019). Note that RT-based indicators come with the advantage of not entailing presumptions on the specific C/IE response patterns.
The potential of RTs for facilitating the detection of C/IER has also been illustrated by Schroeders et al. (2020) who found that prediction accuracy of their supervised machine learning approach to C/IER could further be improved by jointly considering responses and RTs for classification.
In a similar vein, deliberately incorporating theoretical considerations on response behavior, Ulitzsch, Pohl, et al. (2021) presented a model-based approach that jointly considers response and RT information for identifying C/IER. The approach assumes different data-generating processes to underlie responses and RTs associated with attentive and inattentive response behavior. For attentive responses, the model assumes customary IRT models for polytomous data to hold, such as the generalized PCM. For inattentive responses, the model assumes category probabilities to be unrelated to item characteristics and persons’ content trait levels and estimates marginal category probabilities of inattentively choosing a given category over all types of C/IER patterns. Note that in doing so the approach avoids assumptions on specific C/IER patterns. Attentive RTs are modeled in line with common RT models for noncognitive data (Ferrando & Lorenzo-Seva, 2007; Molenaar et al., 2015), considering possibly complex relations between respondents’ trait levels and RTs. More specifically, the approach considers the distance-difficulty hypothesis, stating that persons who either strongly agree or disagree with a statement can quickly decide on a suitable response option, while persons for whom it is difficult to decide whether or not they agree with a statement need more time for their decision (Kuncel & Fiske, 1974). Inattentive RTs, in turn, are assumed to be unaffected by person and item characteristics and to generally be shorter than attentive RTs.
Attentive Response Styles
In contrast to research on C/IER, where RTs have oftentimes been employed for identifying response bias, in the context of attentive RS, RTs have predominantly been used for more closely investigating preidentified attentive RS. Henninger and Plieninger (2020), for instance, examined the relationship between acquiescent responding, MRS, and ERS with RTs to draw inferences on cognitive processes in rating scale usage. The authors reported both respondent-level and item-by-respondent level effects of ERS on RTs. They found respondents with higher ERS to require longer time to generate responses, particularly when providing nonextreme responses. The authors interpreted this finding as evidence against the common notion that ERS can be seen as stemming from low cognitive effort of the respondent (referring to, e.g., Aichholzer, 2013; Krosnick, 1999). Instead, Henninger and Plieninger (2020) concluded that respondents with moderate to high ERS seem to give nonextreme responses more deliberately.
Proposed Approach
In questionnaire data, it is likely that different types of response bias occur. Up to now, although models for both RS and C/IER exist, there is no model that incorporates both types of response bias. To jointly account for and investigate C/IER and RS, we propose to leverage the rich information contained in RT data and suggest an RT-based mixture modeling approach. The approach is based on the mixture modeling approach to C/IER by Ulitzsch, Pohl, et al. (2021) and extends it to incorporate RS. To keep the model simple, we identify C/IER on the person level, that is, assume respondents to have a constant probability of showing C/IER across the questionnaire, instead of allowing for attentiveness to vary on the screen-by-respondent level as in Ulitzsch, Pohl, et al. (2021). Class membership of person
Attentive Behavior
Item responses
Attentive item responses are assumed to be governed by both respondents’ content trait levels and their RS. To this end, researchers may employ a model accommodating RS of their choosing. We here present the approach drawing on the framework for integrating RS into IRT models for polytomous data by Henninger and Meiser (2020), which conceptualizes RS as person-specific shifts in threshold parameters. Other alternatives are outlined in the discussion.
We integrate RS into a multidimensional generalized PCM. For the sake of simplicity, we assume a simple structure for the content traits, that is, each item is assumed to load on one content trait only. We denote respondent
Here,
In the empirical example and the study of parameter recovery, we will exemplarily focus on modeling ERS—one of the most studied theory-derived RS (see Buckley, 2009; Clarke, 2000; Dibek & Cikrikci, 2021; He & Van de Vijver, 2016; Johnson, 2005; Ju & Falk, 2019; Lu & Bolt, 2015, for applications). As noted in Henninger and Meiser (2020) and Henninger (2021), ERS can be accommodated by modeling perfectly negatively correlated shifts in outer threshold parameters (see also Falk & Cai, 2016; Wetzel & Carstensen, 2015). Hence, for modeling ERS, only a single person-specific RS parameter is needed, and the model given in Equation 1 can be simplified to
In the case of

Schematic illustration of person-specific shifts of item thresholds for different locations on the extreme response style trait γ. For the example with five categories, γ is assumed to affect the outer thresholds only. Person-specific thresholds are marked with dashed lines.
Response times
Item-level RTs are denoted with
with
Careless and Insufficient Effort Behavior
Item responses
When being inattentive and showing C/IER, respondents are assumed to choose response options that do not reflect their trait level, for example, by answering uniformly randomly or marking straight or diagonal lines. Based on these considerations, for inattentive responses, marginal category probabilities
In a simulation study, Ulitzsch, Pohl, et al. (2021) could show that modeling C/IE responses in terms of marginal category probabilities is well capable of capturing different types of C/IER behavior, ranging from random responding around the mid- or endpoints to structured patterns such as straight and diagonal lining.
Response times
Inattentive RTs are assumed to stem from respondents quickly proceeding through the questionnaire and choosing responses without evaluating the item content. As such, inattentive RTs are assumed to (a) be unaffected by person and item characteristics and (b) to be, on average, shorter than attentive RTs. These assumptions are incorporated into the model by assuming the lognormal distribution of inattentive RTs to be governed by a common mean
Imposing the constraint
on time intensities for attentive RTs
Joint Distribution of Person Parameters
For simplicity, the probability to show attentive response behavior is assumed to be unrelated to trait and speed levels (as in the models for rapid guessing in cognitive assessments by Schnipke & Scrams, 1997; Wang & Xu, 2015). Person parameters of the attentive component models are assumed to be multivariate normally distributed with mean vector and covariance matrix
For identifying the model, we set person parameter means to zero and content trait variances to one, while leaving discrimination and threshold parameters unconstrained. Item parameters are modeled as fixed effects. This yields the following likelihood
with
Prior Distributions
Bayesian estimation techniques facilitate estimation of the proposed approach. Priors are set in accordance with Ulitzsch, Pohl, et al. (2021). For the person parameter covariance matrix, a decomposition strategy is employed (Barnard et al., 2000), with separate prior distributions for correlations and standard deviations. For the correlation matrix, we employ an Lewandowski-Kurowicka-Joe prior (Lewandowski et al., 2009) with shape 1. For model identification, content trait standard deviations are set to unity. For the standard deviations of speed
Empirical Example
The purpose of the empirical example is fourfold. First, we investigate whether in empirical data C/IER and attentive ERS can be distinguished. Second, the empirical example serves to illustrate the insights into response behavior that can be gained on the basis of the proposed model. Third, we assess the impact of neglecting C/IER, ERS, or both. Fourth, we contrast the proposed approach against a two-step approach that first filters out C/IE respondents using threshold-based procedures and then applies an ERS model to the cleaned data set.
Data
We analyzed responses and raw log data from the German subsample (
Analyses
We analyzed the data using the proposed model considering both C/IER and ERS in attentive responding. Further, we specified three special cases of the proposed model, neglecting either C/IER, ERS, or both. In all models, missing RTs due to the FSM reconstruction were ignored, while the associated responses were considered. Note that doing so entails assuming MAR for such missing RTs, corresponding to the assumption that respondents’ decisions on which item to answer first are unrelated to their trait levels, speed, and location on the ERS trait.
In the model considering C/IER but neglecting ERS, attentive item responses were modeled using a generalized PCM, that is, the same thresholds were assumed for all respondents. In the model considering ERS but neglecting C/IER, all responses were assumed to stem from attentive response processes, that is, the mixture component was dropped and all responses and RTs were modeled according to Equations 2 and 3. Finally, in the model considering neither C/IER nor ERS, all item responses were modeled using a generalized PCM and RTs were modeled according to Equation 3. The four models were compared by means of the widely applicable information criterion (WAIC; Vehtari et al., 2017; Watanabe, 2013). We investigated both structural parameters as well as differences in person parameter estimates between the different models.
For implementing a two-step approach to jointly considering C/IER and RS, we first filtered the data for C/IER using a sequential multiple-hurdle procedure (Curran, 2016) that integrates information of multiple C/IER indicators, each being sensitive to a different aspect. In the present analyses, we employed the average time per item, the long string index, and Mahalanobis distance, sequentially filtering out respondents with the most extreme values on these indicators. We then analyzed the filtered data set using a model accommodating ERS, with responses and RTs being modeled according to Equations 2 and 3. Following Ulitzsch, Pohl, et al. (2021), in order to evaluate the range of possible results and the impact of threshold settings, we implemented two sets of thresholds, choosing either a liberal or a conservative cutoff for all three indicators employed. Details on the threshold settings are given in Table 1. For further details on implementation, we refer to Ulitzsch, Pohl, et al. (2021).
Threshold Sets Employed for Identifying Careless and Insufficient Effort Respondents in the Two-Step Approach
All analyses were performed using R Version 3.6.3 (R Development Core Team, 2017). Bayesian estimation was conducted using Stan Version 2.19 (Carpenter et al., 2017) employing the rstan package Version 2.19.3 (Guo et al., 2018). For all models, we ran two MCMC chains with 3,000 iterations each, with the first half being employed as warm-up. Stan code for the most general model accommodating both C/IER and ERS is provided in the OSF repository accompanying this study. The sampling procedure was assessed on the basis of trace plots and potential scale reduction factor (PSRF) values, with PSRF values below 1.10 for all parameters being considered as satisfactory (Gelman & Rubin, 1992; Gelman & Shirley, 2011). WAIC values were computed using the package loo (Vehtari et al., 2020). The long string index and Mahalanobis distance were calculated using the package careless (Yentes & Wilhelm, 2021).
Results
In all specified models, we observed good mixing of the MCMC chains and no PSRF values above 1.10. WAIC values and person parameter variances and correlations of all specified models are displayed in Table 2. Compared to the models neglecting C/IER, ERS, or both, the proposed model yielded the lowest WAIC, indicating that both response bias were present in the data.
Person Parameter Variances and Correlations of Different Models of Response Behavior
Investigating response behavior
In the proposed model, the population-level proportion of careless respondents (i.e.,
Investigating the consequences of neglecting response bias
By and large, all models yielded comparable estimates of the correlations between person variables. The models, however, led to somewhat different conclusions concerning response bias. While the model considering only C/IER but neglecting ERS did to not lead to considerable different conclusions on the population-level proportion of careless respondents (.04 [.04, .05]), the model yielded slightly higher marginal C/IER category probabilities for the outer response options (
While we encountered only small differences in estimates of structural parameters, we observed pronounced differences on the individual (and, as such, possibly subgroup) level. Figure 3 gives differences in environmental awareness estimates (

Differences in environmental awareness estimates (
Likewise, conclusions concerning individual response behavior were impacted if only some aspects of response bias were modeled. Figure 4 depicts differences in attentiveness probability estimates retrieved from the model considering C/IER only (

Differences in attentiveness probability estimates retrieved from the model considering careless and insufficient effort responding only (
Figure 5 depicts differences in ERS parameter estimates retrieved from the model considering ERS only (

Differences in extreme response style tendencies retrieved from the model considering extreme response styles only (
In sum, Figures 3 through 5 illustrate size and nature of the differences in person parameter estimates that can be encountered in empirical settings when different types of response bias are accounted for. These differences in person parameter estimates pose considerable effects and indicate that, in the case that a subgroup variable relates to response bias, modeling versus neglecting response bias may also impact subgroup results.
Comparisons with a two-step approach to accounting for multiple response bias
The conservative and liberal threshold settings filtered out 9.91% and 22.83% of respondents, respectively, both by far exceeding the 4% implied by the proposed model-based approach. Table 3 displays person parameter variances and correlations of the ERS model applied to the filtered data sets. Note that the models cannot be compared in terms of the WAIC, as the filtered data sets comprise different respondents. Results for the implementations of the two-step approach considerably differed from those displayed in Table 3. ERS variances, for instance, were much lower in the applications of the two-step approach, presumably due to classifying more respondents with identical responses to all (conservative threshold set) or the majority of the items (liberal threshold set) as careless, while some of these respondents were deemed more plausible to have displayed ERS in the model-based approach. More importantly, results for the two implementations of the two-step approach vastly differed from each other, further illustrating that results of filtering-based methods are heavily dependent on threshold settings, with vast differences being observable even for small differences in the employed thresholds.
Person Parameter Variances and Correlations of the Model Accommodating Extreme Response Styles After Filtering for Careless Respondents
Parameter Recovery
To investigate the trustworthiness of results obtained in the empirical application, we conducted a parameter recovery study, with data-generating parameters mimicking those obtained from the application of the proposed model to PISA data.
Data Generation
We generated 100 data sets for
Following Curran and Denison (2019) and Ulitzsch, Pohl, et al. (2021), we considered a scenario with different C/IER patterns, thereby illustrating that the model can handle the joint occurrence of different C/IER patterns. Inattentive respondents were randomly partitioned into three equally sized groups, representing uniform random responding, straight lining, and diagonal lining. For each group, patterns were generated to result in equal marginal response categories for all categories, such that marginal probabilities for C/IE responses across all patterns were given by
Estimation
For estimation, we employed the same setup as in the empirical application. To avoid nonconvergence due to an insufficient number of iterations, we used 25,000 iterations for each of the two MCMC chains, with the first half being employed as warm-up.
Results
By and large, we observed good mixing of the MCMC chains. PSRF values above 1.10 were encountered in five of the 100 replications. 5 For investigating parameter recovery, we considered only replications with all PSRF values below 1.10.
With a median correlation between true and estimated parameters of .99 and .98, item thresholds and time intensity offsets were well recovered. Table 4 displays median EAPs and interquartile ranges of the population-level proportion of careless respondents, person parameter standard deviations and correlations, the distance-difficulty parameter, the common mean and standard deviation of log C/IE RTs, the residual standard deviation of log attentive RTs, and—exemplary—the marginal C/IER category probability for the first response option as well as the item discrimination for the first item alongside the data-generating values. These parameters were estimated without bias (i.e., median EAPs were very close to the true parameters) and exhibited low variability (i.e., were precise, as indicated by narrow interquartile ranges), even with the relatively small considered sample size of
Simulation Results for Selected Parameters
Discussion
We presented a flexible RT-based mixture modeling approach that supports jointly considering, distinguishing, and studying C/IER and attentive RS in noncognitive assessment data. C/IER and attentive RS pose two commonly encountered response bias in questionnaire data. While the former results in responses that are completely uninformative of the traits of interest, the latter results in responses that, although containing information on respondents’ levels on the content traits, are confounded with content-irrelevant variability due to differences in category usage. Distinguishing and jointly considering these two types of behavior assists in drawing more valid conclusions from questionnaire data as well as getting a more nuanced understanding of response behavior. The approach has been illustrated on large-scale assessment background questionnaire data but is applicable to any type of computerized questionnaire for which RT data are available (e.g., online surveys).
For separating attentive from inattentive responding, the approach utilizes item-level RT data. In the presented model, RTs serve a twofold purpose. First, considering this rich source of information on response behavior supports separating different types of behavior that often result in rather similar response vectors. Second, in more general terms, considering RT information in models for noncognitive assessment data may enrich the understanding of how respondents interact with such assessments, for example, by investigating whether respondents with different trait levels differ in how fast they generate attentive responses, whether attentive RS are related to pacing behavior, or whether there is evidence for the distance-difficulty hypothesis in empirical data.
We applied the proposed model to empirical data from the PISA 2015 background questionnaire, where we found evidence for the joint occurrence of both ERS in attentive responding and C/IER. The empirical example highlights the potential of the proposed model for understanding processes underlying responses in questionnaire data. To investigate different response bias and get an understanding of their occurrence, the full model that considers both attentive RS and C/IER is necessary.
In the present application, we found that neglecting either type of response bias may impact conclusions concerning respondents’ content trait levels. Further, when either ERS or C/IER was left unconsidered, the modeled response bias in parts “absorbed” the unconsidered response bias, and respondents with more extreme locations on the ERS trait were estimated to have higher carelessness probabilities when C/IER, but not ERS was modeled, and vice versa. From a conceptual point of view, such effects seem plausible, as different response bias may result in very similar response vectors. Although we observed that the different models yielded different conclusions, it should also be noted that based on the empirical example alone it cannot yet be concluded that neglecting either type of response bias generally yields biased person or subgroup parameter estimates.
We further contrasted the proposed fully model-based approach against two-step approaches to jointly considering C/IER and attentive RS where, in step one, C/IE respondents are filtered out by means of threshold-based procedures and in step two, an IRT model accommodating attentive RS is applied to the cleaned data set. We see two major advantages of the proposed mixture modeling approach over two-step approaches. First, the proposed mixture modeling approach does not rely on threshold settings. There are no globally applicable values for these thresholds, as the distributions of indicators for careless and attentive respondents are scale-specific (Curran, 2016), such that threshold settings are always somewhat arbitrary. Second, the proposed approach differentiates between different types of bias in a single step and thereby avoids the sequential decision procedure of two-step approaches. One advantage of doing so is that the uncertainty of classification is taken into account. This may, for instance, be of relevance when responses and RTs of some C/IE respondents and respondents with certain types of attentive RS are very similar. While two-step procedures require a clear-cut decision for such cases, the proposed approach takes the uncertainty of classification into account. These advantages were illustrated in the empirical example, where we found large differences in structural parameter estimates for small differences in threshold settings. In fact, differences between different implementations of the two-step approach were much more pronounced that differences between fully model-based approaches accommodating different types of response bias. The price for these advantages, however, is increased model complexity, that may result in long running times with increasing questionnaire length and sample size. When these become impractical, heuristic indicator-based approaches may still be the better option for gauging the extent of C/IER in the data at hand.
The approach offers researchers a high degree of flexibility in that different component models can be plugged in for attentive and inattentive responses and RTs, thereby allowing to incorporate specific hypotheses on response behavior as well as to distinguish and study different types of response bias. Researchers may also determine the type of RS component model to employ by means of model comparisons between models with competing component models. For instance, we applied the approach using a model accommodating ERS, derived from the framework presented by Henninger and Meiser (2020). For scales that include midpoint response options, the model can be extended to jointly accommodate ERS and MRS in attentive responding (as in Wetzel & Carstensen, 2015). If researchers have deviating hypotheses concerning the nature of attentive RS that may be present in the data, other component models can be chosen. For deciding on a component model, readers are referred to the unifying framework, overview, and guidelines provided by Henninger and Meiser (2020). It should, nevertheless, be noted that some component models may result in a model that is more challenging to estimate. Mixture models for RS (e.g., Rost, 1991), for instance, would result in a mixture of mixtures. Note that the approach is not limited to PCM or RSM extensions for modeling attentive RS but may also be extended to IRTree approaches using binary pseudo items. This may be achieved by employing the mixture IRT approach for item responses with different structures by Tijmstra et al. (2018), assuming an IRTree structure based on binary pseudo items for attentive responses and estimating C/IER category probabilities based on polytomous item responses.
Concerning the modeling of attentive RTs, we point out that different approaches exist to incorporating the distance-difficulty relationship between traits and RTs (see Ranger, 2013, for an overview and comparison). Further, the relationship between the distance between the respondent’s trait level and the middle threshold parameter and RTs must not necessarily be linear but may take other functional forms (Molenaar et al., 2021). Another alternative to consider may be to use the distance from the item location rather than from the middle threshold parameter. If the distance-difficulty parameter is of substantive interest to researchers, different specifications may be compared by means of model comparisons. Further, the approach allows for incorporating other component models for attentive RTs that support greater flexibility in the assumed RT distribution, e.g., models being based on the Box–Cox normal distribution (Entink et al., 2009) or on categorized RTs (Molenaar et al., 2018).
We found the model to yield good parameter recovery with a sample size of 500, two scales with four items each, and an overall carelessness rate of .05. From these results and results of previous, similar models, we expect that convergence and parameter recovery is good in applications that have comparable or larger sample sizes and carelessness rates. Note that we cannot evaluate all possible parameter constellations and model specifications. Thus, we point out that the statistical performance of the proposed approach may not generalize to all possible combinations of component models. Especially when choosing more complex component models, we advise investigating parameter recovery of the chosen combination of component models to corroborate plausibility of results. The code for our simulation study, which may be adapted to other model specifications, can be found in the OSF repository accompanying this article.
Limitations and Future Research
We found the proposed approach to show good parameter recovery under realistic research conditions. Nevertheless, establishing boundary conditions for which the presented approach may well separate C/IER from attentive RS remains an open and important research question. Challenging conditions that may threaten parameter recovery or the trustworthiness of model comparisons may, for instance, arise under too similar RT distributions of C/IER and attentive responding.
In the empirical application, we found small differences in conclusions on structural parameters drawn from the proposed approach, considering both ERS and C/IER, and approaches neglecting either ERS, C/IER, or both. Further studies are needed to investigate whether and under which conditions neglecting specific response bias may impact conclusions more heavily. This may, for instance, be the case under higher prevalence of C/IER (recall that in the empirical example, the prevalence was only 4%), when respondents predominantly show C/IER behaviors that result in response vectors resembling those encountered under ERS (i.e., predominantly straight line), or when the ERS trait is more strongly correlated with the content traits. Further, it remains to be investigated under which conditions it is sufficient to account for either C/IER or RS in attentive responding when response bias themselves are not of substantive interest but the objective of analysis is to merely account for response bias.
The presented approach assumes the probability that a respondent provides a C/IE response to be constant across all items considered. Note that this assumption is in line with classical indicator-based procedures drawing on response-pattern-based indicators that also filter at the respondent level. C/IER, however, may vary across the questionnaire and respondents who display C/IER on some parts of the questionnaire might still provide valid responses to others, especially across lengthy questionnaires (Bowling et al., 2020; Gibson & Bowling, 2019). This issue can be accommodated by modeling C/IER behavior on the item-by-respondent (Ulitzsch et al., 2020, 2022) or screen-by-respondent level (Ulitzsch, Pohl, et al., 2021) taking a latent response approach. While integrating such extensions with the proposed model is straightforward, this would result in a highly complex model that is challenging to estimate.
In the context of questionnaires, item-level RT data become increasingly available (see Henninger & Plieninger, 2020, for recent studies recording item-level RTs) or can be reconstructed using the FSM approach (Kroehne, 2019; Kroehne & Goldhammer, 2018). Nevertheless, item-level RTs may not always be at hand. Hence, model adaptations drawing on screen-level RTs (as in Ulitzsch, Pohl, et al., 2021), which can more easily be recorded, or models relying on responses only (as in Ulitzsch et al., 2022) pose a further important topic for future research.
The presented approach showcased the utility of RTs for better understanding how respondents interact with questionnaires. Depending on how questionnaires are administered, researchers may consider additional data such as switches in browser tabs (see Steger et al., 2020, for an application) for identifying respondents not sufficiently engaged with the questionnaire. For instance, the proposed approach may be extended by incorporating the assumption that inattentive respondents may frequently switch to other browser tabs, getting distracted from the questionnaire, while attentive respondents do commonly not display such behavior. Such additional information may also be of great utility for better separating different types of response bias.
Note that the presented approach can entail rather long run times—the model in the empirical application, for instance, required approximately 8 hours to run. A potential remedy may be the development of maximum likelihood implementations, posing an important topic for future research (see Nagy & Ulitzsch, 2021; Nagy et al., 2022, for implementations of neighboring models for rapid guessing).
Finally, we point out that validation studies are urgently needed for ensuring that the substantive interpretations of the model parameters hold true. Validity could be investigated with experimental manipulations, for example, by varying instructions (as in Bowling et al., 2020; Niessen et al., 2016), through investigations of the model’s capability to detect differences between groups of respondents that can be assumed to differ in their levels of C/IER and/or their stylistic tendencies in attentive responding (see Ulitzsch, Penk, et al., 2021, for a validation study using such group comparisons to gain validity evidence for a model-based approach to rapid guessing behavior), or by investigating how attentive RS and adjusted content traits relate to external variables, assuming that relationships adjusted for response bias should more strongly align with subject-matter theory than their unadjusted counterparts (Khorramdel et al., 2017), and that attentive RS and content traits should be linked selectively to extraneous criteria of attentive RS and content traits (Plieninger & Meiser, 2014).
