Abstract
In today’s society, taking surveys seems inevitable (e.g., Chen, 2011; Secolsky & Denison, 2012). For instance, students are asked to evaluate academic matters, employees are asked to complete satisfaction surveys, and consumers of different products and customers of various services are asked about their satisfaction through surveys. Information obtained from surveys can be used to predict learning outcomes, monitor trends, investigate preferences, and the list goes on (Chen, 2011). Nowadays, most surveys are administered online, not only on desktop or laptop computers but also on mobile devices such as tablets and smartphones (Lugtig, Toepoel, & Amin, 2016; Toepoel & Lugtig, 2014). Advantages of online surveys, such as convenient collection of data from a large number and diversity of respondents (Aust, Diedenhofen, Ullrich, & Musch, 2013; Johnson, 2005; Reips, 2000, 2002, 2009), are accompanied by the threat of obtaining useless data from nonserious, unmotivated, inattentive, or careless respondents (Maniaci & Rogge, 2014; Meade & Craig, 2012; Reips, 2000, 2002). For example, some respondents might click through a survey out of curiosity instead of providing well-thought answers or rush through the questions because of the effortlessness of online surveys (Johnson, 2005; Reips, 2009). The phenomenon of respondents’ behavior characterized by generating a merely satisfactory answer instead of having the motivation to provide accurate and optimal responses is called satisficing (Krosnick, 1991; Zhang, 2013). It must be taken into account that survey results are only valid and valuable if respondents answer the questions seriously resulting in good data quality (Chen, 2011; Maniaci & Rogge, 2014).
Data quality refers to the degree in which raw data, provided directly by respondents, accurately reflect respondents’ true levels of the constructs which are supposed to be measured (Meade & Craig, 2012). Obtaining survey results from nonserious respondents may result in poor data quality, threatening the validity of research (Oppenheimer, Meyvis, & Davidenko, 2009; Reips, 2009). Inattentive responses can have negative effects on effect sizes and power, depending on the proportion of inattentiveness in the sample (Maniaci & Rogge, 2014). Despite the great prevalence of research using self-reported data in the literature, not much attention has been spent on the degree of (non)serious responding in self-report research. Consequently, it is crucial to investigate whether a self-reported measure of seriousness is an effective approach to detect nonserious respondents. If so, nonserious respondents could be removed from a data set to improve data quality and to obtain more valid knowledge about various aspects of the world. Therefore, the current article aims to gain insight about the quality of data obtained through online surveys, by using data from the nonprobability online access panel of Growth from Knowledge (GfK). The aim of the current study is to empirically examine (1) whether self-reported seriousness is a significant predictor of data quality, (2) whether data quality varies for respondents who use different devices, and (3) whether data quality differs for respondents with different demographic characteristics. In addition, we examine whether seriousness differs for different device and demographic groups.
Background
Seriousness Checks
A simple way to address the problem of nonserious respondents is to ask them about the seriousness of their participation, often referred to as seriousness checks (Aust et al., 2013; Reips, 2000, 2009). In the study of Aust and colleagues (2013), respondents self-reported their seriousness after completing a survey about political attitudes by answering the following question: “It would be very helpful if you could tell us at this point whether you have taken part seriously, so that we can use your answers for our scientific analysis, or whether you were just clicking through to take a look at the survey?” (p. 530). They found that nonserious respondents were able to identify themselves as such since the seriousness check predicted data quality, measured by correlations between particular survey items and agreement with official voting results. Likewise, in the second study of Oppenheimer and colleagues (2009), self-reported motivation in a paper-and-pencil survey was higher for respondents who passed than for respondents who failed an instructional manipulation check, which is an indicator of satisficing as it measures whether respondents read the instructions. However, motivation did not differ between failing and passing respondents in their first study.
In previous studies, seriousness checks have been conducted both before (Reips, 2002, 2008, 2009) and after (Aust et al., 2013) participation in the survey. Reips (2002, 2008, 2009) has suggested to employ seriousness checks in the early stage of the survey because this has been shown to be the best predictor of dropout rates and thus a measure of motivation. In addition, conducting a seriousness check prior to the completion of the survey can be taken as a precaution to reduce dropout rates (Reips, 2002). Aust and colleagues (2013) implemented a seriousness check after completion of the survey, assuming this was more in line with the true nature of participation by reflecting a potential change of mind during participation. By removing nonserious respondents, such seriousness checks have the potential to improve data quality, which can be measured by different indicators as will be described next.
Data Quality
In the case of attitude measurement through surveys, it is impossible to detect whether a respondent is answering a question truthfully. There is no validation data available, such as information in registers. In these cases, different indicators can be used as an indirect measure of data quality.
Nonsubstantial values: Item nonresponse and “not applicable” answers
Nonsubstantial values include item nonresponse and selecting an option like “don’t know,” “no opinion,” or “no answer,” if provided (Heerwegh & Loosveldt, 2002). Item nonresponse is perhaps the most widely used indicator regarding data quality in the existing literature (e.g., de Leeuw, Hox, & Huisman, 2003; Toepoel & Lugtig, 2014; Weber, Denk, Oberecker, Strauss, & Stummer, 2008) and is characterized by blanks or gaps in the data set for some respondents for some specific questions (de Leeuw et al., 2003). Also, a large number of “don’t know,” “no opinion,” or “not applicable” answers for a single respondent can be considered as an indicator of satisficing and nonserious answering and accordingly of low data quality (e.g., Kaminska, McCutcheon, & Billiet, 2010; Krosnick, 1991; Lenzner, 2012).
Speeding
Additionally, speeding can be examined as an indicator of data quality. Respondents with strikingly low response times, frequently called speeders, may save time by glancing over instructions, making impulsive judgements, performing shallow memory searching, or simply by answering randomly (Aust et al., 2013). The assumption behind this indicator is a nonlinear relationship between response time and data quality. This implies strikingly low response times are assumed to result from satisficing, indicating nonserious respondents. Once a certain threshold is identified, response times below the threshold may or may not be considered nonserious (Meade & Craig, 2012).
Internal data consistency
Another indicator of data quality is within-person internal data consistency, referring to the consistency of a response string for an individual. The underlying assumption of techniques measuring individual consistency is that attentive respondents provide patterns of responses that are internally consistent (Curran, 2016). Thus, it is expected that respondents provide similar answers to items measuring the same theoretical construct, for example, “I change my mood a lot” (p. 90) and “I have frequent mood swings” (p. 90) from the subscale Neuroticism from the Big Five Personality Inventory (Karim, Zamzuri, & Nor, 2009).
Nondifferentiation and response effects
Although attentive respondents are thought to provide internally consistent data, it is assumed that they do not use the same response option for long periods of time (Curran, 2016). Consequently, responding too consistently to items measuring theoretically distinct constructs indicates nonserious responding and can be used to detect respondents with a different pattern of nonserious responses than those detected by a lack of internal data consistency (Curran, 2016; Meade & Craig, 2012). Responding too consistently is called nondifferentiation, which is careless respondents’ tendency to provide the same or similar rating to many consecutive items (Zhang & Conrad, 2014). This phenomenon is similar to, but not the same as, straightlining, which refers to choosing the same response option for all items in a grid so that the selected answers are in a vertical line (Zhang, 2013). Moreover, nondifferentiation could result from response effects. Those include primacy effects, recency effects, and neutral responding. We take the definition by Lugtig and Toepoel (2015) and define primacy as selecting the first answer option. They treat the selection of the first answer option as an indicator of satisficing and increased measurement error. Recency is the tendency to select the last or later response options more often (Knäuper, 1999). Neutral responding includes that long string responses of nonserious participants tend to center on the midpoint of the response scale (Huang, Liu, & Bowling, 2015). Those response effects, of which a higher degree points to lower data quality, comprise the fifth indicator in our study.
The most effective data screening approach to measure data quality is to utilize several data quality indicators simultaneously (Curran, 2016; Meade & Craig, 2012). To this end, the current study takes multiple indicators of data quality into account: nonsubstantial values, speeding, internal data consistency, nondifferentiation, and response effects. Other indicators of data quality such as the length of open-ended answers, the rounding of numerical responses, and the lack of attention to important exclusions included with a question (Medway & Tourangeau, 2015) are not taken into account because they are not available in the data set.
Devices
As mentioned, respondents use different devices to complete online surveys. In general, previous research showed no device effects on data quality. For example, Sommer, Diedenhofen, and Musch (2017) did not find a difference between desktop and mobile device (i.e., smartphone and tablet) users in data quality, which was measured by the consistency of responses and validation of responses against internal and external data criteria. Schlosser and Mays (2018) also found no noticeable differences between computer and mobile users in terms of break off rate, item nonresponse, and length of responses to open-ended questions. In contrast, in the study of Struminskaya, Weyandt, and Bosnjak (2015), some differences were found between the completion of online surveys using smartphones or tablets and personal computers. However, these effects were relatively small, and some indicators were not related to a device but attributable to a respondent.
Less research has examined whether self-reported seriousness differs for respondents who complete surveys on different devices. It is well-established that people use mobile devices differently from traditional computers (Toepoel & Lugtig, 2015). Nevertheless, we do not expect that the (different) use of mobile devices results in differences in self-reported seriousness as it has been shown that data quality does not differ among different device users and we expect self-reported seriousness to predict data quality.
Demographic Characteristics
Seriousness may also differ for respondents with different demographic characteristics. For example, younger, less educated, and male respondents are found to have higher levels of inattention (Maniaci & Rogge, 2014). In turn, this may affect data quality. Besides, demographic characteristics may reflect the cognitive effort and capabilities required by the respondents of which a higher level may result in lower data quality (Messer, Edwards, & Dillman, 2012). This may be true for education and age of which a consistent influence on item nonresponse has been found (e.g., de Leeuw et al., 2003; Helasoja, Prättälä, Dregval, Pudule, & Kasmel, 2002; Messer et al., 2012). Age has been found to predict item nonresponse where higher ages are associated with higher item nonresponse; however, in the study of Struminskaya et al. (2015), older respondents generated lower item nonresponse. Answers of respondents with only a high school degree were associated with higher item nonresponse. For gender, it is less plausible to expect differences in data quality as cognitive capabilities are assumed to be similar for males and females. Indeed, many studies found no gender differences in item nonresponse (e.g., Heerwegh & Loosveldt, 2008; Kwak & Radler, 2002; Messer et al., 2012; Struminskaya, et al., 2015). However, Bech and Kristensen (2009) found that being female predicts item nonresponse which could be due to the age of their participants which ranged from 50 to 75 years. Regarding the selection of a “don’t know” option, Zeglovits and Schwarzer (2016) did not find a significant effect of age, but their results indicate that gender and education influence selecting “don’t know,” where males and higher educated respondents are less likely to select this option than females and lower educated respondents. This effect of education was also found by Young (2012). In addition, she found that respondents older than 48 are more likely to answer “don’t know,” while the effect of gender on selecting this option was found to depend on the topic of the survey questions.
Demographic characteristics can also influence speeding, although less widely studied. It has been found that education and gender do not influence the prevalence of speeding, while speeding is more likely among younger than among older respondents (Zhang, 2013; Zhang & Conrad, 2014)
There is less evidence about the relationship of education, gender, and age with the internal consistency of data. One study found that more educated respondents, females, and older respondents provide more internally consistent data. However, those effects were not significant for all the indicators used and were relatively small (Maniaci & Rogge, 2014). Similarly, in the study of Dunn, Heggestad, Shanock, and Theilgard (2018), being older and being female was correlated with internal consistency, although not significantly.
A relationship between education and response effects is consistently found, where response effects are weaker among highly educated respondents (e.g., Krosnick & Alwin, 1987; Peytchev, 2007; Struminskaya et al., 2015). In general, older respondents are associated with having larger response effects (e.g., Knäuper, 1999; Peytchev, 2007). Regarding gender, the results are mixed with some studies reporting no significant gender effect (e.g., Struminskaya et al., 2015), while, for example, Cole, McCormick, and Gonyea (2012) reported more straightlining among males for most of their item sets. This stands in contrast to Zhang and Conrad (2014) who found that being female predicted straightlining.
Hypotheses
Based on the literature mentioned above, we expect respondents who indicate being more serious to show higher data quality. Second, we do not expect using a different device (i.e., desktop, tablet, or mobile phone) to predict data quality. Accordingly, we do not predict a difference in self-reported seriousness between respondents using those different devices. We expect having completed a lower educational level to be related with lower seriousness and data quality but not for speeding. We do not have a specific expectation regarding the influence of gender on data quality and seriousness, since the literature is quite inconsistent about this relationship. Furthermore, we predict an increase in age to be related to a higher level of seriousness, with higher data quality indicated by a decreased amount of speeding and more internal data consistency, but with lower data quality on the indicators nonsubstantial values, nondifferentiation, and response effects.
Method
Research Design
In the data, a fully crossed 3 × 5 × 4 factorial between-subjects experimental design was used in which respondents were assigned randomly among the conditions of three different factors. The factors are three different devices (i.e., desktop personal computer, tablet, or mobile phone), five different response formats (i.e., radio buttons, big buttons, slider, visual analogue scale, or a mix of slider and visual analogue scale), and four different scale lengths (i.e., 5-point, 7-point, 11-point, or continuous scale). In this study, we only take device into account and combine the format and scale length conditions. For more information about the effect of these conditions on data quality, see Toepoel and Funke (2018).
Respondents
Data come from the GfK Online Panel. This nonprobability online access panel is certified by the International Organization for Standardization. Respondents completing the survey aimed to be representative in education, gender, and age over 15 of the Dutch population. Respondents owning a desktop personal computer, a tablet, and a mobile phone were selected. The questionnaire was extensively tested, so that the layout would function properly on all devices. There were no respondents who used a feature phone (i.e., a phone lacking the advanced functionality of smartphones), therefore the current article shall proceed with the term smartphone henceforth. The response rate was 30%, 34% was nonresponse, and the dropout rate was 4%; 32% of the respondents were not taken into account in the analyses because a device quota was reached. The response rate to invitations on tablets (32.03%) and smartphones (18.67%) was lower than on desktops (63.43%). A reminder was sent for tablets and smartphones. In total, 5,077 respondents completed the survey of whom 1,709 on a desktop, 1,702 on a tablet, and 1,666 on a smartphone. Of the respondents, 48% was male and 30.7% had completed lower education (i.e., prevocational secondary education or less), 41.3% medium education (i.e., senior general secondary education, preuniversity education, or secondary vocational education), and 27.9% higher education (i.e., higher professional education or university education). The mean age was 46.09 years (
Measurement Instrument
The survey consists of three sections and starts with three questions about attitudes toward surveys; how serious and motivated respondents are regarding completing the survey and how difficult respondents think completing surveys for the panel is in general. Then, 16 questions are asked in the experimental section about respondents’ last holiday experience. These items are supposed to measure the four realms of an experience according to Pine and Gilmore’s (1998) experience economy. They state that experiences can be sorted into four broad categories: entertaining, educational, escapist, and aesthetic events. See Appendices A and B for the wording of the seriousness questions and experimental questions, respectively. The survey ends with seven evaluation questions; respondents evaluated whether the survey was clear and enjoyable to complete, the design of the survey, the usability of the survey, and indicated for the second time whether respondents were serious and motivated in completing the survey and how difficult it was to complete the survey. For none of the questions, it was required to provide an answer to continue to the next question. Self-reported seriousness asked before the survey was missing for 75 respondents, and seriousness asked after completing the survey was missing for 95 respondents. For self-reported motivation, 51 and 89 respondents were missing, respectively. In total, for 277 respondents, the seriousness factor score was missing (p. 15), which left 4,800 respondents with valid factor scores.
For the 16 experimental questions, three different scale lengths and five different response formats were used. For these questions, we ignored the type of response format, since effects between those formats on data quality are found to be small. This was also true regarding different scale lengths (Toepoel & Funke, 2018). For the remaining questions, a 10-point Likert-type scale was used. Each question had a “not applicable” option presented below the other options. Furthermore, the GfK Panel added data regarding demographic characteristics and information about the device on which respondents completed the survey.
Procedure
The survey was conducted in April 2014. Completing each survey lasted about 5 minutes. Respondents were randomly asked to complete the survey on a particular device. Respondents were allowed to complete the survey on another device than the one they were assigned to. Among those assigned to a desktop, 24% completed the survey on a tablet or smartphone. About half of the respondents assigned to a tablet or smartphone did not comply and used a different device. However, Toepoel and Funke (2018) showed no selection effect regarding the choice of which device was used. Therefore, we use data based on the device respondents used and ignore to what device they were assigned to. Response times for each question and for the whole survey were recorded. Statistical Package for the Social Sciences, Version 24.0, was used to perform all statistical analyses.
Results
Self-Reported Seriousness and Motivation
To create a score for seriousness, a principal factor analysis was performed on the four seriousness and motivation questions (administered before and after the experimental questions) with oblique rotation (promax). One factor had an eigenvalue over Kaiser’s criterion of 1 and explained 66.01% of the variance. Appendix A shows the factor loadings (retrieved from the factor matrix) and the eigenvalue and percentage of the variance explained after extraction for this factor. The items loadings suggest that the extracted factor represents seriousness to complete the survey. Accordingly, Bartlett factor scores were used in all analyses as a measure of seriousness. A repeated measures multivariate analysis of variance comparing seriousness and motivation before and after the experimental questions was significant, Wilks’s L = .01,
In addition, it was analyzed whether the seriousness factor score differed for the device groups and demographic characteristics (i.e., education, gender, age). A one-way analysis of variance showed a significant difference in mean seriousness factor score across device groups,
Regarding the analysis for the educational groups, Levene’s test was found to be significant,
An independent
Significant mean differences in seriousness factor score existed also between age groups,

Mean seriousness factor score for the different age categories. Error bars represent 95% confidence intervals.
Data Quality
Data quality was measured by the following indicators: the amount of nonsubstantial values, speeding, within-person internal data consistency, nondifferentiation, and response effects (primacy, recency, and neutral responding). Regression analyses were used to examine whether seriousness, device group, and demographic characteristics predict data quality. For internal data consistency, a linear regression was performed, while Poisson regressions were performed for the other data quality indicators as these existed of count data. When the assumption of equidispersion was violated, negative binomial regression analyses were used as they are suitable for overdispersed count data.
Nonsubstantial values
Nonsubstantial values were measured by item nonresponse and the number of “not applicable” answers. Item nonresponse, the number of items out of the 16 experimental questions for which a respondent had missing values, was 0.38 on average (
Regression Analysis Summary for Seriousness, Device Group, and Demographic Characteristics Predicting Item Nonresponse and the Amount of “Not Applicable” Answers.
*
Speeding
Response times for completing the whole survey ranged from 1.40 to 7,243.63 minutes,
A Poisson regression analysis showed slight violations of equidispersion. The log likelihood of the same regression model using a negative binomial distribution indicated an improved fit and was therefore used to see whether speeding could be predicted by seriousness factor score, device group, and demographic characteristics. A higher seriousness factor score was significantly related to a moderate decrease in the odds of speeding. Being a smartphone user significantly decreased the odds that a respondent sped compared to desktop users with a large effect. The odds ratio indicates that being a smartphone user decreased the risk of speeding by more than 7 times. Educational level and gender were not significantly related to speeding. In contrast, age group was related to speeding in the sense that speeding decreased with age.
Internal data consistency
To obtain a score for internal consistency, we used the even–odd correlation as recommended by Curran (2016), Huang, Curran, Keeney, Poposki, and DeShon (2012), and Meade and Craig (2012). This measure is based on the assumption that items on the same scale are expected to correlate with each other for each individual (Huang et al., 2012). First, a principal component analysis was conducted on the 16 experimental items with oblique rotation (promax) to detect unidimensional scales within the survey. The values on these items were standardized to eliminate the effects of different scale lengths. One item (i.e., “It was quite boring there”) was intentionally recoded to reflect the direction of the other items. Four factors had eigenvalues over Kaiser’s criterion of 1 and explained together 65.84% of the variance. The scree plot pointed also to extracting four factors. Accordingly, four factors were retained (i.e., A, B, C, and D). Appendix B shows the factor loadings (retrieved from the pattern matrix) after rotation and the eigenvalues and percentages of variance explained after extraction for each factor. The items of each unidimensional scale were divided using an even–odd split based on the appearance of items (i.e., A1, A3 being odd, A2, A4 being even, etc.). Subsequently, for each respondent, the mean of the even items and of the odd items (based on the order of appearance in the survey) for each scale was calculated resulting in an even and an odd subscale score (i.e., the average response on the even questions of Scale A and the average response on the odd questions of Scale A, etc. for the other scales). Then a within-person correlation between those two sets of subscale scores was computed in Excel to obtain the even–odd correlation (Meade & Craig, 2012). Accordingly, the even–odd correlation is a value between −1 and 1. The average correlation between the even and the odd subsets of unidimensional scales in the data was .53 (
A linear regression analysis was performed to see whether the even–odd correlation could be predicted by seriousness factor score, device group, and demographic characteristics. As expected, seriousness factor score had a significant positive relationship with the even–odd correlation. Being a smartphone or tablet user and having completed lower education was negatively related to the even–odd correlation, having completed higher education positively, although effects were small. No significant gender and age effects were found, except for the 20–29 group which had a negative relationship with the even–odd correlation compared to the 40–49 group (see Table 2) for the linear model.
Regression Analysis Summary for Seriousness, Device Group, and Demographic Characteristics Predicting Speeding, Internal Data Consistency, and Nondifferentiation.
*
Nondifferentiation
Since the items of the survey were not presented in a grid but each on a new page, we included nondifferentiation instead of straightlining (a measure frequently used as an indicator of data quality, e.g., Kaminska et al., 2010; Revilla, Ochoa, & Turbina, 2017; Toepoel & Lugtig, 2015) as fourth data quality indicator. To detect nondifferentiation, the long string index was used, computed as the maximum number of consecutive items to which a respondent answered with the same response option (Johnson, 2005). Accordingly, this indicator has a maximum value of 16. The mean maximum long string was 2.61 (
Response effects
Finally, we used the number of items for which the first, last, or middle answer option was chosen out of the 16 experimental questions to reflect response effects (i.e., primacy effects, recency effects, and neutral responding, respectively). The mean primacy, recency, and neutral responding were, respectively, 1.50 (
Seriousness factor score also had a positive relationship with choosing the last response option. No device and education effects were found. Being female had a positive relationship with recency. The only significant age effect was a positive relationship of the 50–64 group compared to the 40–49 group with recency.
In contrast to primacy and recency and in line with the expectations, seriousness factor score had a negative relationship with neutral responding. No significant device, education, gender, and age effects were found (see Table 3) for the results per predictor of these analyses.
Regression Analysis Summary for Seriousness, Device Group, and Demographic Characteristics Predicting Primacy, Recency, and Neutral Responding.
*
Discussion and Conclusion
This study has demonstrated that asking survey respondents about the seriousness and motivation of their participation predicts data quality. The extant literature is remarkably silent about the usefulness of seriousness and motivation checks. However, we found a moderate relationship between seriousness and data quality where lower seriousness predicted lower data quality on all data quality indicators included (i.e., nonsubstantial values, speeding, internal data consistency, nondifferentiation, neutral responding), except primacy and recency. For primacy and recency, a positive relationship was found with seriousness and motivation. A possible explanation is that respondents who report higher motivation might have stronger opinions. This could be true, since strong attitudes have more impact on cognition and behavior than weak attitudes (Krosnick & Abelson, 1992). Strikingly, descriptive statistics revealed higher levels of recency than of primacy. This contrasts with the literature which suggests primacy effects occur in online environments, while the results are less clear for recency (Murphy, Hofacker, & Mizerski, 2006). This inconsistency might result from the concurrence (for 15 of the 16 questions) of recency with acquiescence, the tendency to agree with or say yes to items, regardless of their content (Couch & Keniston, 1960). Moreover, the primacy effect may be a function of the nature of the survey topic (Barnette, 2001) and may be less prevalent in the current survey as people generally like holiday experiences, which concurred with recency. Further research should investigate whether and how the survey topic results in differential response effects.
Concerning data quality of respondents using different devices (i.e., desktops, tablets, smartphones), the results were inconsistent. In contrast to previous studies, mobile device users showed lower data quality on the indicators internal data consistency and item nonresponse, the latter which was considerably increased. However, tablet users showed less “not applicable” answers. The same although nonsignificant effect was observed for smartphone users. It is possible that the Internet connection on mobile devices is slower, resulting in mobile device users clicking the button for the next page more than one time and consequently skipping an item (Mavletova, 2013) rather than indicating that these users show lower data quality. This is supported by the finding that mobile device users had higher data quality on other indicators. They showed less nondifferentiation than desktop users, as in line with previous studies where items were presented on separate pages (Keusch & Yan, 2017; Lugtig & Toepoel, 2016). And speeding was found to decrease strongly for smartphone but not for tablet users compared to desktop users. This may result from the fact that it takes more time on smartphones to read small print, to zoom, and to select answer options and from a slower Internet connection (Couper & Peterson, 2017; Mavletova, 2013; Wells, Bailey, & Link, 2013). No significant difference in response effects were found, corroborating the existing literature (e.g., Andreadis, 2015; Mavletova, 2013; Toepoel & Lugtig, 2014; Wells, Bailey, & Link, 2014). In sum, device use does not seem to have a large influence on data quality resulting from respondents’ answering behavior but rather from using a particular device, visible in a large decrease in speeding on smartphones and a large increase in item nonresponse on mobile devices. This contrasts with that desktop users report being more seriousness and motivated than smartphone users, which could result from surveys on smartphones not being taken as serious as surveys on desktop computers (Weber et al., 2008). However, as our study showed no large negative effect of the use of mobile devices on data quality, this implies that mobile devices can be used as a feasible way of data collection in survey research.
Regarding demographic characteristics, the highest level of education that respondents had completed influenced data quality. As expected, lower educated respondents showed lower data quality compared to medium and higher educated respondents indicated by more nonsubstantial values, less internal data consistency, and more nondifferentiation. No difference in terms of speeding was found, in line with our expectations. Also, there were no differences in response effects. These findings do not correspond with self-reported seriousness. Contrary to our hypothesis, higher educated respondents reported being less serious and motivated than medium and lower educated respondents. Jaccard, McDonald, Wan, Dittus, and Quinlan (2002) found that higher educated respondents show higher self-report accuracy. In line with this and since data quality tend to be higher for respondents who completed higher education, the accuracy of their self-reported seriousness and motivation may be higher. This implies that self-reported seriousness by lower educated respondents could be somewhat higher than in reality. However, this did not erase the predictive value of seriousness, as shown in the current study.
Self-reported seriousness was lower for male than for female respondents, in line with the suggestion that females are more conscientious and willing survey takers (Lambert & Miller, 2015). However, there were no differences in data quality between males and females on all indicators, except recency which is in all probability a result of the survey topic. Consequently, rather than indeed being more serious and motivated, it is likely that female respondents only report being so. This could result from a more pronounced influence of desirability on people’s answers among female than among male respondents (Philips & Clancy, 1972).
The influence of age-group on data quality was rather ambiguous. Contrary to our expectations, results for nonsubstantial values were inconsistent and older respondents did not show higher levels of internal data consistency and response effects and lower nondifferentiation, as hypothesized. The most consistent finding was that speeding decreased with age, in line with our expectations. For online survey research, this implies that data quality does not necessarily suffer when older respondents are included or overrepresented. Those effects were not perfectly reflected in self-reported seriousness and motivation, which, as expected, increased with age except for respondents of 65 years and older.
A possible reason for the discrepancy between differences in seriousness among device and demographic groups and the relationship of seriousness with data quality is that part of the nonserious respondents may answer seriousness checks in a satisficing manner, limiting the predictive value of these checks and making it difficult to establish cutoff values for seriousness checks that identify careless respondents. Nevertheless, we found a relationship of seriousness checks with data quality, which indicates the importance of adopting such checks in surveys. They can be used at the very least as quality check for the obtained data and accordingly data quality may be improved by removing nonserious participants. This could be done regardless whether those checks are included before or after a survey, as this study suggests there are no large differences of incorporating seriousness checks before or after surveys, which future research needs to verify. The finding that seriousness checks predict data quality corroborates the finding of Aust and colleagues (2013) that respondents who report being serious answer questions in a more consistent and valid manner. In contrast to the current study, their seriousness check included a reference to the importance of serious answers for the validity of research. Potentially, the effect of seriousness and motivation checks may be larger using a type of wording referring to the importance of serious answers by minimizing the number or respondents satisficing the seriousness checks, which should be investigated by future research.
Future research should also examine the relationship of self-reported seriousness and motivation with data quality in longer surveys; as in the present study, the administered survey was fairly short. This is important, since motivation may decrease over the length of the survey (Galesic & Bosnjak, 2009). In turn, this can influence data quality. For example, more item nonresponse, shorter open-ended answers, shorter response times, and lower variability of answers to question in grids have been found to increase over the length of the survey (Cole et al., 2012; Galesic & Bosnjak, 2009). Another limitation is that it could not be investigated whether certain respondents are more inclined to use certain devices, since data resulted from an experiment where respondents were assigned to a device. Further research should address this issue. Also, we ignored different scale lengths to which respondents were randomly assigned. Although Toepoel and Funke (2018) did not show large effects of scale length on data quality, scale length could have influenced the results. Nondifferentiation and response effects in particular could be subject to this.
This study contributed to the literature by showing that a self-report measure of seriousness and motivation in online attitude surveys predicts data quality. However, as shown in an analysis by Aust and colleagues (2013), few studies include seriousness checks in surveys. The findings in the current study indicate the importance of incorporating such checks in order to monitor data quality before analysis as well as to help identify and remove careless respondents and, in turn, to improve data quality.
