Abstract
Keywords
Background
Amazon Mechanical Turk (MTurk) is an online labor market in which people (“requesters”) requiring the completion of small tasks (“Human Intelligence Tasks” [HITs]) are matched with people willing to do them (“workers”). MTurk has become a popular data collection tool among social science researchers: In 2015, the 300 most influential social science journals (with impact factors greater than 2.5, according to Thomson-Reuters InCites) published more than 500 articles that relied on MTurk data in full or in part (Chandler & Shapiro, 2016).
Reflecting the popularity of MTurk, considerable effort has been invested in evaluating data collected from it, with particular emphasis on documenting the demographic and psychological characteristics of its population, the quality of respondent data, and the methodological limitations of the platform. As a result, MTurk workers have become one of the most thoroughly studied convenience samples currently available to researchers (for a review, see Chandler & Shapiro, 2016), and researchers have learned a great deal about the ways in which MTurk respondents are and are not similar to the general population. There are reasons to suspect, however, that there are also important variations between different samples drawn from MTurk, and these variations have received far less attention. This article addresses this question, using data from a study of approximately 10,000 MTurk workers to examine whether sample composition varies as a function of the time that it is collected.
We begin by reviewing what extant research reveals about the demographic composition of the MTurk worker pool. Then, we describe the methods and measures that we use in our study, after which we present the results of our analyses, which include a demographic description of the largest sample of MTurk workers we are aware of and an exploration of whether the demographic characteristics of MTurk respondent samples vary across day and time and earlier versus later in the data collection. We conclude with a discussion about the implications of the temporal variations we uncover for researchers using MTurk (and online data collection more generally).
How Representative of the General Population Are Samples of MTurk Workers?
The demographic characteristics of samples drawn from MTurk populations have been extensively studied. These studies show that most MTurk workers live in the United States and India (Paolacci, Chandler, & Ipeirotis, 2010), that U.S. MTurk workers are more diverse than many other convenience samples, and that they are not representative of the population as a whole (Paolacci & Chandler, 2014). However, while scholars caution that MTurk samples are typically less representative than commercial web panels that make explicit efforts to provide representative samples (Berinsky, Huber, & Lenz, 2012; Mullinix, Leeper, Druckman, & Freese, 2015; Weinberg, Freese, & McElhattan, 2014), they also agree that MTurk samples are more diverse than student samples or community samples recruited from college towns (Berinsky et al., 2012; Krupnikov & Levine, 2014).
Differences between the U.S. MTurk population and the U.S. general population parallel differences between samples recruited through other online methods and the U.S. population (Casler, Bickel, & Hackett, 2013; Hillygus, Jackson, & Young, 2014; Paolacci & Chandler, 2014). Most significantly, MTurk workers are typically younger than the general population (Berinsky et al., 2012; Paolacci et al., 2010), have more years of formal education, and are more liberal (Berinsky et al., 2012; Mullinix et al., 2015). MTurk workers are less likely to be married (Berinsky et al., 2012; Shapiro, Chandler, & Mueller, 2013), and more likely to identify as lesbian, gay, or bisexual (LGB; Corrigan, Bink, Fokuo, & Schmidt, 2015; Reidy, Berke, Gentile, and Zeichner, 2014; Shapiro et al., 2013). MTurk workers also tend to report lower personal incomes and are more likely to be unemployed or underemployed than members of general population (Corrigan et al., 2015; Shapiro et al., 2013). Whites and Asian Americans are overrepresented within MTurk samples, while Latinos and African Americans are underrepresented (Berinsky et al., 2012).
Are Samples of MTurk Workers Representative of MTurk Workers?
While the forgoing research makes clear that the U.S. MTurk population is not representative of the U.S. population as a whole, there are also reasons to suspect that samples recruited from MTurk are themselves not representative of the
There are many potential causes for sampling variation across studies. Anecdotal evidence suggests that MTurk sample composition might be influenced by the fact that workers share information about available studies and that reputation effects might lead workers to gravitate toward (and to avoid) particular requesters (Chandler, Mueller, & Paolacci, 2014). Some of this variation is also surely the result of MTurk workers self-selecting into the studies that interest them (for a discussion, see Couper, 2000). Design choices that are exogenous to a study design may also inadvertently influence sample composition. The effects of such exogenous choices are of particular interest to researchers because they are both within their control and typically irrelevant to the substance of the studies themselves.
The present study focuses on the impact of intertemporal variation on sample composition across (a) time of day, (b) day of week and serial position (i.e., earlier or later in data collection), both (c) across the entire data collection and (d) within specific batches. Extant evidence about sample differences across time and day are suggestive but limited by small sample sizes. Comparing samples of about 100 participants obtained within two different studies, Komarov, Reinecke, and Gajos (2013) observed that compared with workers recruited later in the evening, workers recruited during the daytime were older, more likely to be female, and less likely to use a computer mouse to complete the survey (suggesting that they were using mobile devices). Lakkaraju (2015) compared the gender, income, education and age of 700 workers across different times and days, finding that only gender varied as a function of the day a given HIT was posted.
Variation among participants who complete a research study early or later in the data collection process (referred to here as serial position effects) has been observed in other modes of data collection, but has not been examined on MTurk. Changes in sample composition between “early” and “late” responders have been observed in mail and email surveys, in part because the easiest to contact participants tend to complete surveys earlier (for a review, see Sigman, Lewis, Yount, & Lee, 2014). In general, people of color 1 are underrepresented among early respondents, as are men (Gannon, Nothern, & Carroll, 1971; Sigman et al., 2014; Voigt, Koepsell, & Daling, 2003), younger people, and people with fewer years of formal education (Voigt et al., 2003; for a discussion, see Sigman et al., 2014).
Examinations of lab studies of college students have also shown that sample compositions can vary over time. For example, women (Ebersole et al., 2016) and students with high GPAs (Aviv, Zelenski, Rallo, & Larsen, 2002; Cooper, Baumgardner, & Strathman, 1991) are more likely than men and students with lower GPAs to participate in lab studies at the beginning of the semester. Personality variables also influence when students complete lab studies, with participants who report that they are less extraverted, less open to experience, and more conscientious more likely to respond at the beginning of the semester.
Investigating whether samples vary over the course of a survey fielding period is critical, because researchers tend to recruit small samples for their research (Fraley & Vazire, 2014). In fact, most of the existing studies of the characteristics of MTurk workers rely on relatively small samples (
A second potential serial position effect on MTurk is differences between people who complete HITs shortly after they are posted or later on. This factor is independent from early versus late responding to the study because study data can be collected through any number of batch postings. In practice, researchers often collect data from MTurk by posting more than one batch of HITs, either to speed up data collection (data collection is faster immediately after an HIT is posted; Peer, Brandimarte, Samat, & Acquisti, 2017) or to circumvent the fee Amazon charges for a batch that recruits more than nine participants. When more (but smaller) batches are posted, the average batch will, by default, be closer to the front of the queue, which could affect sample composition for at least three reasons. First, a batch closer to the front of the queue reduces the amount of work it takes to find it, especially for workers who rely on the default sort order. Second, smaller batches might limit the number of workers who discover the survey through links on worker forums, because the link will be valid for a shorter period of time. Third, some workers use automated scripts or other tools to be alerted about the availability of new work. In this study, we post multiple batches that allow us to disentangle serial position effects within batches of posted HITs from serial position effects across the data collection as a whole.
Method
To explore whether MTurk worker demographics vary intertemporally, we crafted a brief HIT (average completion time was approximately 5 min) that contained demographic questions that are of interest to scholars across an array of disciplines.
We first posted our HIT on March 19, 2015, and data collection concluded on May 14, 2015, so it was active for a total of 56 days (or 8 weeks). We began by posting the HIT twice daily, at 3 p.m. and 10 p.m. Eastern Time (ET). After the first week, we added a third posting at 10 a.m. ET. 2
Only U.S.-based workers with a HIT acceptance ratio (HAR) greater than 95% and who had completed at least 100 HITs were eligible to participate. We selected workers with a 95% HAR because this subsample of workers has been shown to result in higher quality data (Peer, Vosgerau, & Acquisti, 2014) and, in our experience, to be favored by researchers. We prevented workers from completing this survey more than once across the entire fielding period.
For the first 3 weeks, workers were paid US$0.25 to complete the survey. After learning that the average time to completion was roughly 5 min, we increased the pay rate to US$0.50 for the remainder of the fielding period to comply with recommended pay norms of US$0.10 per minute (see “Guidelines for Academic Requesters,” 2014). By the end of the study, we had posted the HIT 162 times and sampled 9,770 unique respondents.
Measures
At the beginning of the study, we collected measures of age and the U.S. state in which respondents lived. Participants were then asked to report demographic information including their highest level of education, current employment status, and current occupation. We also asked a series of questions about their current relationship status, sexual orientation, sex assigned at birth, and current gender identity. In addition, we asked questions about household size, race and ethnicity, household income, religious denomination, how often they attend religious services, and self-perceived socioeconomic status (see Howe, Hargreaves, Ploubidis, De Stavola, & Huttly, 2011; Ravallion & Lokshin, 1999).
We also included a 10-item measure of the “Big Five” personality factors (Ten Item Personality Measure or TIPI; Gosling, Rentfrow, & Swann, 2003). The “Big Five” is among the most widely accepted taxonomy of personality traits within psychology (for a review, see John & Srivastava, 1999) and conceptualizes personality as consisting of five bipolar dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. The questionnaire and other materials are available online on the Open Science Framework (osf.io/tg7h3).
Prior to completing the survey, participants were asked whether they learned about the survey on MTurk or somewhere else. Those who indicated somewhere else were asked to specify where they learned about it.
Finally, using a database of more than 100,000 HITs submitted over 3 years immediately prior to the present study (reported in Stewart et al., 2015), we were able to estimate individual workers’ relative experience completing MTurk tasks. Workers with no recorded experience during the Stewart et al.’s study (
Results
Data Cleaning and Survey Metadata
Data collection resulted in 10,121 survey attempts, of which 169 attempts (generated by 147 workers) were identified as duplicate responses. Duplicate responses were defined as any submission from a WorkerID in excess of one. For workers with duplicate responses, the most complete response was taken. When both responses were of equal length (typically complete), the first response was taken. An additional 182 responses that came from non-U.S. IP addresses and one respondent without a WorkerID were also identified and deleted, resulting in 9,770 valid survey attempts.
Of the valid attempts, 780 (8%) were identified by Qualtrics as incomplete. A visual inspection of these responses found that 724 of these respondents answered the last question in the survey and were functionally complete. Only 56 respondents (0.6%) dropped out of the survey after providing only partial data. These partial responses were included for analysis.
Of all valid attempts, 518 (5.3%) came from an IP address shared by at least one other response. The majority of IP addresses (
For example, the 433 responses from IP addresses that contributed four or fewer responses were examined. Of these, 233 were almost certainly unique respondents from the same household: They came from people who listed the exact same household size, the same age of household members (±2 years in aggregate) and reported an age that corresponded to an age that matched an age of a person that the other respondent reported that they lived with. An additional 49 respondents were likely from the same household, reporting approximately the same total age of members (±5 years in aggregate), or who appeared to have neglected to report a household member (usually a child or much older adult).
Three of the four IP addresses that generated the most responses were servers registered to Amazon. It is likely that participants from these addresses are using either a proxy server, or an ISP hosted on Amazon Web Services. These responses varied in the time they were attempted, the specific browser and operating system configuration used, and the content of the survey responses.
Characteristics of the MTurk Sample
Tables 1 to 4 present summary data about the entire sample, about participants in the first two batches only, and for national estimates when available. The entire sample represents the largest sample of MTurk workers we are aware of, and likely measures about two thirds of the available worker population (Stewart et al., 2015). The sample size of the first two batches (
Demographic Characteristics of Workers.
U.S. Census Bureau (2016; mean age of adult population).
U.S. Census Bureau (2016) population estimates.
Socioeconomic Characteristics of Workers.
Bureau of Labor Statistics, U.S. Department of Labor (2016).
Relationship Characteristics of Workers.
General Social Survey (as reported and summarized in Gates, 2014).
Attitudinal and Personality Characteristics of Workers.
Population estimates derived from American National Election Studies 2012 time series unless otherwise noted.
The demographic data are reported in Table 1, including information about worker experience and where they learned about the survey. Differences between this sample and the U.S. population as a whole are generally consistent with those reported in previous analyses of smaller surveys (Berinsky et al., 2012; Krupnikov & Levine, 2014; Paolacci et al., 2010; Shapiro et al., 2013). For example, the workers in our sample are younger and more likely to be white than the U.S. population as a whole. Workers residing in the Eastern Time Zone are overrepresented compared with those in other parts of the United States. This variation is likely because the times that HITs were posted aligned most closely with the times that workers in the time zone were likely to be active.
Almost all (90.9%) workers reported finding the survey on MTurk. Of the 868 workers who found the survey elsewhere, most (
Table 2 summarizes the socioeconomic characteristics of our sample. Respondents to our survey generally reported more years of formal education than the population as a whole. Although Americans residing in the wealthiest households are underrepresented in our data, household income was much closer to the median U.S. income than would be expected from previous measurements of individual worker income (Berinsky et al., 2012; Paolacci et al., 2010). A portion of this difference is likely due to the fact that 16.5% of the respondents in our sample are under 30 and living with someone at least 18 years older than they are, suggesting that our sample includes a substantial number of millennials with low individual income but who are living with their higher income parents.
Table 3 summarizes the relationship status and characteristics of respondents, revealing that approximately a third of respondents are married and another third are single. In addition, we find that 1.5% of our sample reports are currently engaged in a consensually nonmonogamous relationship (see Haupert, Gesselman, Moors, Fisher, & Garcia, 2016). As has been observed in other studies of other MTurk workers (Corrigan et al., 2015; Reidy et al., 2014; Shapiro et al., 2013), the proportion of lesbian, gay, and particularly bisexual respondents is higher than it is in the U.S. population as a whole. This is likely because online populations are disproportionately young, and younger people are also more likely to identify as LGB (Gates & Newport, 2012; Moore, 2015).
Finally, summary statistics for the attitudinal and personality measures are summarized in Table 4. Consistent with earlier research, workers were more likely to identify as Democrats than are members of the general population (Berinsky et al., 2012; Mullinix et al., 2015). Relatively few workers identified as religious, a disproportionate number identified as atheists, and reported rates of church attendance were generally low. Relative to normed data obtained from a large convenience sample of Internet users (Gosling, Rentfrow, & Potter, 2014), MTurk workers reported being about two thirds of a standard deviation less extraverted, about a third of a standard deviation less open to new experiences, and only slightly less agreeable, conscientious, or emotionally stable.
The vast majority (92.5%) of participants in our study completed the survey on a computer. Of the remaining participants, 2% completed the survey using a tablet, 4.5% using a phone, and the rest using other devices (e.g., game consoles) or devices that could not be identified. Rates of mobile device use are somewhat lower than have been noted in other online panels (de Bruijne & Wijnant, 2014a, 2014b).
Sample Differences by Time of Completion
The focus of our investigation is how the composition of the MTurk worker pool varied across days of the week, across time of day, and across the serial order in which they participated. Main findings of these analyses are summarized in Table 5. We looked for variations within the following variables: age, gender identity, education, employment, household income, household size, race, Latino ethnicity, socioeconomic status, sexual orientation, relationship status, party identification, religion, and religiosity. Our survey design allowed respondents to identify as more than one race, so we treated each racial category (White, Black or African American, Asian American, American Indian or Alaskan Native, Native Hawaiian or Pacific Islander, or Other) as a single binary dependent variable. We also looked for differences in the Big Five personality traits: extraversion, agreeableness, conscientiousness, emotional stability, and openness. Finally, we examined workers’ prior experience and where they reported finding the survey.
Significant Results by Time of Day, Day of Week, Serial Position, and Pay Rate.
In two instances, similar and highly correlated variables were collected for purposes irrelevant to the present study. In each case, only one variable was selected for analysis. The first instance was marital status and relationship status. We selected marital status for analysis because this variable is more typically recorded in national surveys and therefore more relevant for this demographic analysis. The second instance was political ideology and party affiliation. We conducted the analyses using political ideology, but results are identical when party identification is used instead.
To limit the number of comparisons, some response options were collapsed into broader categories (e.g., specific denominations of Christianity were collapsed into a single category). In total, given the coding, our final analysis included 31 different demographic variables.
For all continuous, ordinal, and binomial variables, generalized linear modeling (GZLM) was used to regress (a) the day of the week (categorical), (b) the time of day the batch was posted (categorical), (c) the serial position of the batch within the data collection run (continuous), (d) the serial position of the individual response within the batch (continuous), and (e) a dichotomous variable representing the amount of compensation (categorical) to control for possible effects of increasing payment part way through the study. Interval dependent measures were treated as linear effects, except for worker experience (i.e., the total number of MTurk HITs already completed), which was modeled using a negative binomial distribution. This approach was adapted to multinomial regression to evaluate differences in religion, as SPSS’ implementation of GZLM cannot be used for multinomial variables.
Including so many independent and dependent variables brings with it the risk of false positives. To mitigate this risk, we limited the number of comparisons by not including interactions in the model. We also limited the comparisons of each time or day to the grand mean for all times and days (rather than individual comparisons against all other times or days). For example, we compared the mean percentage of college graduates in batches posted on Tuesdays with the mean percentage of college graduates in all batches (including Tuesdays). This approach led to a total of 13 significance tests for each of the 29 demographic variables and two MTurk behavior variables (worker experience and where they found the study), for a total of 403 comparisons.
To further reduce the potential for false positives, we set the alpha criterion at .01, rather than the more typical .05, and used the Benjamini–Hochberg adjustment (Benjamini & Hochberg, 1995) to hold the false discovery rate across all comparisons constant at .01 across all tests. Following these adjustments, no results with an unadjusted
Day of week effects
Of our 217 day-of-week comparisons, we found seven instances in which the attributes of participants recruited on a particular day of the week significantly differed from the sample as a whole. 4 These findings are summarized in Table 5.
The average age of respondents varied as a function of the day of the week. Participants on Wednesday (
People completing HITs on Sundays were more likely to be employed full-time (52%) than the sample as a whole (48.5%; β = .21, Wald χ2 = 14.01
Workers were less likely to find the survey outside of MTurk on Saturday (3.4%) or Sunday (6%) than the sample as a whole (9%; β = −.04, Wald χ2 = 35.87
Time of day effects
Of our 93 time-of-day comparisons, we found 12 instances in which attributes of participants recruited at a particular time of day differed significantly from the grand mean. 5 These differences generally reflected linear trends in the composition of the MTurk workforce throughout the day, and are summarized in Table 5.
As might be expected, one of the most pronounced consequences of posting at different times was variation in the proportion of workers from different time zones. People in earlier time zones were more likely than average to complete HITs posted at 10 a.m. (β = −.15, Wald χ2 = 71.92,
The proportion of Asian American respondents also increased over the course of the day, growing from 5.9% at 10 a.m. to 7.6% at 3 p.m. to 9% at 10 p.m. The proportion of Asian Americans was significantly lower than average at 10 a.m. (β = −.016, Wald χ2 = 15.24,
Other differences were observed that were not an artifact of time zone. The proportion of single workers increased linearly throughout the day from 29.1% at 10 a.m. to 32.2% at 3 p.m. to 34.9% at 10 p.m. The proportion of workers who are single was significantly lower than average at 10 a.m. (β = −.03, Wald χ2 = 16.91,
More workers who completed the survey at 10 p.m. used smartphones (5.8%) than across the sample as a whole (3.7%; β = .014, Wald χ2 = 18.01,
Workers who completed the HIT at 10 a.m. were less likely to report having found the HIT outside of the MTurk interface (8.5%) than the sample as a whole (9%; β = −.014, Wald χ2 = 16.01,
Finally, relative to the sample as a whole (
Overall serial position effects
Of our 31 positional comparisons, we found seven instances in which the attributes of participants differed over time. 6 Workers who completed HITs earlier in the data collection process reported higher levels of emotional stability, conscientiousness, and agreeableness. Participants who completed earlier batches of HITs also tended to be older were more likely to have a full-time job and live in smaller households. Workers who completed HITs earlier were also substantially more experienced than workers recruited later in the study (Table 6).
Worker Characteristics as a Function of Serial Position Across Study.
Within-batch serial position effects
Of our 31 positional comparisons within batch, we found five instances in which the attributes of participants recruited earlier in a given batch differed from the attributes recruited later in the same batch. Workers who completed an available HIT earlier in a given batch were on average older, more likely to be female, and less likely to be Asian American. Workers who completed HITs sooner were also less likely to have found the survey on a source outside of MTurk but tended to be more experienced than workers recruited later in the study (Table 7).
Worker Characteristics as a Function of Serial Position Within Batches.
Pay effects
Pay effects were included primarily to control for a change in design part way through data collection. Of the 31 payment comparisons, we found evidence of only two characteristics that changed once we offered to pay more. Controlling for other variables, workers in the high-pay condition reported higher emotional stability (
Discussion
In this article, we have described demographic characteristics of a large sample of MTurk workers and examined differences across time, day, and serial position. Of our 403 demographic comparisons, we found 33 differences (8.2% of tested effects), and significant effects had an average effect size of
Demographic Differences by Day and Time
Day of the week influenced few (2%, or 4/203) demographic characteristics, and these effects were small (
Time of day resulted in similarly small effects (
Of particular note, contrary to previous research (Komarov et al., 2013), we found that workers were more likely to use mobile devices late at night (5.8% of HITs posted at 10 p.m. were submitted from mobile phones, compared with 3.7% of HITs submitted during the rest of the day). Mobile device use can have adverse effects on data quality, including increased rates of attrition (Mavletova, 2013; Sommer, Diedenhofen, & Musch, 2016; Wells, Bailey, & Link, 2013) and shorter and fewer open-ended responses (Mavletova, 2013; Struminskaya, Weyandt, & Bosnjak, 2015). As a result, researchers might consider adjusting the time of day at which they post research studies or collect data if they hope to optimize mobile completion or collect open-ended responses.
The large proportion of observed differences suggest that time of day effects might be a fruitful area of future research, both through expanding the range of variables that are examined and with a particular effort to understand how regional differences, differences in the active user population across time within regions, and changes in individual responses throughout the day combine to produce these differences.
Demographic Differences by Serial Position
The effects of serial position were more extensive than time-of-day and day-of-week effects; 21% (6/29) of across-sample serial position effects were significant, with an average effect size of
When sampling error is unsystematic, larger samples more closely approximate the population. This is not so in the presence of systematic bias. As our sample increased, some biases (e.g., the democratic tendencies of respondents) remained the same. In other cases, biases actually increases (e.g., age, employment, conscientiousness, and emotional stability). Thus, it is not a given that making a sample more representative of the U.S. MTurk worker population will also make it more representative of the U.S. population as a whole. Variations in demographic characteristics across the entire sample are also relevant to researchers who recruit workers from the available pool without replacement (e.g., to prevent workers from completing the same study twice). Of particular relevance, we found variations in the “Big Five” personality factors as a function of serial position. Workers who completed HITs earlier in the data collection process reported being slightly more emotionally stable, more conscientious, and more agreeable. These traits are associated with and may moderate other important variables including respondent data quality, or political behaviors and attitudes that might bias samples (for an excellent review, see Gerber, Huber, Doherty, & Dowling, 2011), or data quality.
Variations in demographic characteristics associated with serial position within batches of HITs are important when considering whether to recruit respondents in large batch or small batches. It is particularly important to understand potential within-batch serial position effects because several third-party solutions (e.g., TurkGate, Goldin & Darlow, 2013; and TurkPrime, Litman, Robinson, & Abberbock, 2017) make it easy to divide data collection efforts into a large number of very small batches. By and large, we find that smaller batches will lead samples to be older and have more women, but will attenuate the overrepresentation of Asian American workers.
Differences in Worker Experience and Forum Use
Time of day and serial position were strongly related to how much MTurk experience respondents had and how workers found the survey. More experienced workers completed the survey earlier in data collection (both within and across batches). Variations in worker experience may be associated with greater exposure to survey tactics, experimental manipulations, which can have various effects on data quality. On one hand, more experienced workers are more familiar with common research questions, leading to practice effects (Chandler et al., 2014), potentially smaller effect sizes on commonly used experimental paradigms (Chandler et al., 2015) and potentially more extreme and less malleable attitudes toward topics that respondents are frequently asked about (Sturgis, Allum, & Brunton-Smith, 2009). On the other hand, more experienced workers may be more attentive and therefore may provide higher quality responses.
We also observed substantial intertemporal variation in workers using forums, with more referrals from links shared outside of MTurk happening in the afternoon and on Thursdays and less in evenings and weekends. These differences may be relevant if researchers are concerned about respondents who have potentially seen information about a study prior to completing it. The longer a HIT is available, the more opportunity workers have to find it on an outside forum.
Although we did not vary pay rates experimentally, we nonetheless found that when we increased pay, there was a concomitant increase in the experience of survey participants. Together, we thus observed two separate patterns: (a) Early responders to the survey tended to be more experienced workers and (b) when we increased the pay, the proportion of more experienced workers increased even further. If researchers are concerned that worker savviness might affect their findings (Krupnikov & Levine, 2014), they should be attentive to these possibilities when they post their studies.
Conclusion
This study is the largest and most comprehensive description of MTurk demographics that we are aware of and the first large-scale effort to examine intertemporal differences in sample composition (however, for a similar project, see Arechar, Kraft-Todd, & Rand, 2016). Data from our study of approximately 10,000 MTurk workers have allowed us to examine three key possible sources of temporal variation in MTurk sample composition: (a) time of day, (b) day of week, and serial position both (c) across the entire data collection and (d) within specific batches.
Taken as a whole, our results should serve as a source of both comfort and caution to scholars who use MTurk to recruit participants for their research. On one hand, we found only minimal day-of-week differences. However, we also showed that there are small but significant time-of-day variations in demographic composition—variations that bear closer scrutiny. The effects of serial position also warrant further study, as they emerged as persistent influences across multiple variables, including characteristics known to affect political and psychological attitudes (e.g., Big Five personality traits; Dietrich, Lasley, Mondak, Remmel, & Turner, 2012; Gerber et al., 2011). Differences in sample composition can compromise claims to generalizability and might lead to challenges with reproducing research findings as well (Peterson & Merunka, 2014). As is often the case, larger samples (and/or those recruited in such a way to be more representative) are especially critical when researchers are concerned about heterogeneous treatment effects may reduce the external validity of a given sample.
Researchers should bear our findings in mind as they consider how best to recruit samples from MTurk. The intertemporal dynamics we have detailed are likely to be most relevant to researchers attempting to collect representative samples of the MTurk worker population, such as studies of MTurk worker behavior and attitudes that attempt to understand the dynamics of contract labor and piece-work in the “gig economy” (Aguinis & Lawal, 2013; Brawley & Pury, 2016). But researchers interested in other topics should pay attention to relationships such as those between serial position and psychological characteristics and consider including information about when and how many times they posted their HIT when reporting results. 7 Perhaps most importantly, these findings demonstrate that the number of workers recruited and the size of batches used to recruit them can have a large effect on the average experience of sample respondents.
As MTurk and other similar online convenience samples become more widely used, it is increasingly important that we better understand who participates in these subject pools and when certain kinds of respondents are more likely to opt-in relative to others. Such examinations will help researchers assess published results, especially (though not limited to) their generalizability across populations and over time.
This project suggests several directions for future research. Beyond extending the analysis of temporal effects to new variables, or examining intertemporal variation in other sources of data, future work could examine how other design choices affect sample composition, including whether researchers with poor ratings or tasks with low pay get substantively different samples than researchers with better ratings or tasks with higher pay. This is an important area for future research to examine, particularly as researchers continue or increase reliance on online data collection.
