Abstract
Social science survey data sets are often wider than they are long. Resource limitations demand that surveys ask many questions of the minimum number of respondents needed for statistical analyses. Moreover, social scientists are often interested in hard-to-reach populations, accentuating the need to ask many questions of few respondents. These difficulties characterize the Fragile Families and Child Wellbeing Study (FFCWS), which follows a cohort of nearly 5,000 children born in large U.S. cities between 1998 and 2000, roughly three quarters of them to unmarried parents (Reichman et al. 2001). The study collects a wealth of information about this disadvantaged group, including children’s physical and mental health, cognitive function, schooling, and living and family conditions. Overall, the FFCWS data set contains nearly 13,000 variables.
The breadth of the variables contained in the FFCWS data set presents opportunities for a prediction task such as the Fragile Families Challenge (FFC). The FFC asked participants to use a data set containing variables collected from the child’s birth until year 9, and some training data from year 15, to predict six outcomes in the year 15 data: grade point average (GPA) and grit of the child, 1 material hardship and eviction of the family, layoff of the primary caregiver, and whether the primary caregiver participated in a job skills program. Although there is considerable information on each child, there are few children in the data set. As a result, new problems arise. Specifically, the high ratio of variables to observations increases the possibility of overfitting, that is, of fitting a complex model to statistical noise in a way that yields less useful out-of-sample predictions. In this article, we explore whether human-informed variable selection and parameter tuning can help solve this problem.
Machine-learning (ML) methods have been increasingly applied to data with a high ratio of variables to observations to help with these same tasks (so-called feature selection). They provide ways to effectively use vast amounts of information contained in high-dimensional data sets (Donoho 2017). In contrast to substantive social science approaches, ML methods are less concerned with theoretical informativeness and favor data-driven predictive performance. Social scientists, on the other hand, usually draw on knowledge about the underlying data-generating process linking variables to outcomes.
Increasingly, a number of applications in computer science have sought to incorporate human knowledge into ML methods (e.g., Branson et al. 2010). However, applications of these “human-in-the-loop” approaches are rare in the social sciences. In this article, we implement a human-in-the-loop approach to the FFC’s prediction tasks. We surveyed a scholarly community of social scientists as well as an anonymous community of laypeople to elicit their beliefs about which variables in the FFCWS data set would best predict each of the six outcomes. We used the information from these surveys in different ways. First, we subsetted the FFCWS data set preemptively, using either the variables identified by these surveys or a preexisting set of variables identified by the Fragile Families team. Second, we used information on scores assigned to particular variables to assign weights in the ML method. In effect, our ML approach was more likely to use variables with higher scores. We contrasted these human-in-the-loop approaches to a data-driven ML approach making use of the full data set of nearly 13,000 variables.
The article proceeds as follows. First we outline how we elicited scholarly expertise and lay judgments. To use the extensive collection of variables in the FFCWS for our modeling approaches, we needed to address the issue of missing values in the data set. Next we describe how we addressed missingness. Thereafter we describe the models used, present results, and conclude.
Using Expert and Crowd-Sourced Knowledge
There might be several ways to collect knowledge about the predictors of the outcomes in the FFC. One could screen publications or conduct interviews with individuals familiar with the FFCWS. We leveraged computational tools to retrieve insights from scholars in three steps. First, we used Amazon Mechanical Turk (MTurk) to retrieve the contact information on every author who had published using the FFCWS (786 authors). Then, we administered online surveys to each author to identify relevant predictors of each outcome. Expert surveys have been used for a variety of predictive or forecasting tasks, from projections of fertility, mortality, and immigration (Billari, Graziani, and Melilli 2012; Bijak and Wiśniowski 2010) to measuring the quality of democracy (Pemstein et al. 2015) and to school planning (Raftery et al. 2012). Experienced researchers carry a wealth of knowledge about the relationships between variables and outcomes in these data, not all of which is published. By surveying researchers, we hoped to recover knowledge that was otherwise inaccessible at relatively low cost and over no more than few days. We also fielded the same survey to a comparison sample of laypeople that we crowd-sourced using MTurk.
To elicit expert and lay beliefs, we used a wiki survey. We chose this to maximize accessibility, efficiency, and openness to new knowledge (Salganik and Levy 2015). We asked participants to choose which of two randomly selected predictors were likely to best predict a given outcome. These predictors were initially drawn from a list of 27 predictors suggested by a group of researchers familiar with the FFC, but participants were given the opportunity to add candidate predictors to the list (which would then be voted on by subsequent participants). As we explain in the Appendix, these predictors were higher-level concepts rather than specific variables. We used the data from the online surveys to generate an ordered list of candidate predictors; we scored each variable as the number of times it was voted for divided by the number of times it appeared in a pair. Further details about the surveys are included in Appendix A.
Overall, 104 of 786 sampled experts participated, generating 2,651 votes. Seven hundred laypeople participated in our MTurk surveys, generating 27,221 votes. We used the variables identified through the expert and MTurk surveys in two different ways for our predictions. First, we used it to subset the data. Together, the expert and MTurk surveys yielded 68 higher-level concepts, which we associated with 271 variables from the FFCWS data set. We took these 271 variables as a single, wiki survey–generated subset. 2 Second, we used the rankings generated by the expert and MTurk surveys directly, as information passed to an ML algorithm. In this case, this yielded two approaches rather than one: one that used expert scores and one that used lay scores. Details are provided in the section “Models.”
Imputation
Because most ML approaches require a numeric and complete data set, processing the FFCWS data to handle missingness was a crucial step in preparing variables for modeling. To appreciate the extent of this problem, note that all observations had some missingness on some variables, which implies that there would have been no observations left with listwise deletion. Data were missing for different reasons, including unwillingness to respond, “don’t know” responses, logical skips, panel attrition, anonymization of sensitive information, and error. Roughly 74 percent of the data were missing in a way that posed problems for prediction (Figure 1). In a complex study such as this, the problems posed by missingness are particularly acute. We thus explored different imputation approaches with trade-offs in terms of efficiency and effectiveness (Appendix B). Because our different imputation strategies make different assumptions, we produced five distinctly imputed data sets on the basis of three unique approaches.

Missing data.
Models
We modeled the six outcomes with regularized regression. Regularization is an ML technique that can improve prediction on new data by avoiding overfitting on training data (James et al. 2017). Regularized models can be fit with large numbers of variables and relatively few observations. Regularized regression biases or shrinks model coefficients toward zero, relative to their maximum likelihood estimators, by applying a penalty to the likelihood function. Each nonzero coefficient has an associated cost.
Absent other information, this cost is the same for every variable. If outside information warrants, however, the penalty can be relaxed for specific variables. The human knowledge of variable rankings captured through the scores from our survey is precisely this kind of information, and we drew on these scores to relax the penalties for the associated variables to differing degrees. For each scored variable, the global shrinkage parameter λ, which determines the overall degree of regularization, was multiplied by a local, variable-specific
We fit linear regressions for the continuous outcomes (GPA, grit, and material hardship) and logistic regressions for the binary outcomes (eviction, layoff, and job training). We used the implementation of regularized regression, with an “elasticnet” penalty, from the glmnet R package (Friedman, Hastie, and Tibshirani 2010). Appendix C describes the statistical and mathematical details of our models.
Results
In sum, we explored a total of 25 different approaches to prediction, distinguished by choices made at the following stages: (1) how we imputed missing observations, (2) whether we subsetted the data set prior to prediction and in what way, and (3) whether we incorporated outside knowledge into our modeling and in what way. As discussed, we considered five types of imputed data sets, three approaches to subsetting (no subsetting, subsetting to the variables identified by our wiki survey, and subsetting to the constructed variables identified by the Fragile Families team 3 ), and three approaches to incorporating scores (expert scores, MTurk scores, and no scores). There were thus 45 possible permutations across these methods; of these, we focused on 25. Limitations of time and other resources narrowed the models we could run. For instance, the multiple imputation (MI) method we chose could not be run on the full data set of 13,000 variables using available computational resources.
These 25 approaches can be compared in terms of mean squared error (Figure 2). However, because we did not fill the permutation space, it is complicated to rank the performance of choices at any given stage. In an unfilled permutation space, an unrestricted comparison of any set of choices does not hold all other strategies constant. For example, the fact that we used mean imputation with six subsetting and scoring approaches, but MI with only three, skews any comparison of the five imputation choices. Because the analytic choices we made affect our predictions, this kind of comparison is invalid. Therefore, when considering the best strategy in any given dimension, we restrict ourselves to that part of the permutation space in which we can compare across the relevant choices (Figure 3). We identify the best approach as the choice which minimizes the average or median mean squared error (MSE) across all other approaches and outcomes (Figure 4). This illustrates the relative rankings of these approaches, but the differences in performance also vary in magnitude. Therefore, we also illustrate the improvement made by any given approach, which we calculate as the average percentage improvement in MSE relative to the outcome-specific baseline MSE (Figure 5).4,5

MSE’s from approaches relevant to human-in-the-loop rankings.

Permutation space of possible and relevant approaches.

Rankings by lowest average and median MSE.

Average percentage reduction in MSE.
In what follows we consider what our results suggest for four different questions: (1) how to impute, (2) whether to subset, (3) whether to incorporate scores, and finally (4) whether it makes sense to include humans in the loop at all (whether by informed subsetting, or scoring, or both). 6
Imputation
How should researchers approach issues of missingness? Overall, our results suggest that MI is best. If researchers have the computational power to pursue this approach, they should. Note, though, that by the metric of average MSE, the next best strategy is simple mean imputation and that the dividends to MI are not obviously enormous (Figure 4a). MI results in a 4.94 percent reduction in MSE relative to baseline, on average, whereas mean imputation results in a 4.61 percent reduction (Figure 5a). So, where resource constraints are an issue, mean imputation may be a viable alternative. 7 Also, regression-based imputation methods do not clearly outperform simple mean imputation, which is noteworthy given their additional computational costs.
Subsetting
Does it make sense to preemptively subset the data before modeling? Most social science researchers who use these data no doubt do, because it is impossible for humans to make much sense of thousands of variables. It is thus tempting to do the same in a prediction exercise of this kind. Yet our results suggest that human-informed subsetting
Interestingly, the two strategies that involve subsetting are not clearly distinguishable in terms of their predictive performance. By average MSE, it seems preferable to subset to the variables from our wiki survey, but by median MSE, the constructed variables fare better. In one sense, this is as encouraging as it is surprising. The constructed variables represent the considered judgment of people with experience in the field and with the FFCWS, whereas the wiki survey variables were selected in a few days and at low cost by an anonymous community of experts and laypeople. Of course, the wiki survey was fielded within the context of the FFC with the clearly assigned task of identifying predictors for the outcomes, whereas the constructed variables were not generated explicitly for this prediction task. Nevertheless, we find there is not much to distinguish them, and if anything, the wiki survey variables perform better (Figure 5b).
Scoring
Is it useful to incorporate human knowledge into the modeling process, as described earlier? Not really, according to either of the metrics we use to rank approaches. Whether measured by average or median MSE, approaches that ignore scores altogether outperform approaches that use expert or lay scores. For advocates of an approach that marries the powers of machines to human wisdom, this is disheartening. However, there are at least two caveats. First, the differences in performance are very small. On average, as Figure 5c shows, approaches that do not use scores reduce MSE relative to baseline by about 5.39 percent, compared with 5.29 percent and 5.25 percent for experts and MTurk users, respectively. Second, as we argue below, our approach to knowledge incorporation was ad hoc. As long as it is possible to imagine better ways of incorporating human knowledge into the loop, future research should consider them.
Humans in the Loop?
Does all this suggest that there is no role for humans in the loop? Not entirely. By average MSE, the best approach overall is one that does not subset and does not incorporate outside knowledge. Yet, again, the differences between this and the next best (and, indeed, the third best approach) are slight: a reduction of 7.67 percent versus 7.59 percent. Furthermore, if ranked by lowest median MSE, our best performing approach does enlist humans: one that incorporates expert scores while not subsetting the data set (Figure 4d). The discrepancy between average and median MSE rankings is explained by the very poor performance of a no subsetting and no scores approach in predicting the layoff of a child’s primary caregiver. This may suggest that outside information is useful for some outcomes but not others. Furthermore, one possible interpretation of this result is that strategies using expert scores are more robust to bad performance in a single outcome.
What is clear from our results is that if humans are to enter the loop, it ought not to be by preemptively subsetting the data but rather by incorporating their wisdom into an approach that still leverages ML to extract information from the full data set. Making use of the full data set may not always be possible, as exemplified by the computational constraints we faced for generating a full data set with MI. However when it is possible, it can usefully augment prediction. Our approach incorporating scores on the basis of expert surveys fared better as a human-in-the-loop strategy. Although neither our approach to generating scores from the wiki surveys nor incorporating them in the models is dispositive, we believe that such approaches with further refinements may hold promise for human-in-the-loop strategies. In short, although there is obviously important information that only machines pick up, strategies that incorporate human knowledge to tune parameters in a model merit further exploration. 8
Conclusions
In this article, we considered different ways of tackling a difficulty faced by researchers seeking to use survey data sets for prediction, namely, that the large ratio of variables to observations makes informed variable selection difficult. To tackle this problem, we proposed a low-cost way to mine a scholarly community for insights. We considered ways to use this information to subset a data set preemptively or at the modeling stage (or both together).
What did we find? First, our results do not recommend preemptively subsetting the data. This is common practice in social science research, which is understandable, because social scientists are often more concerned with description and explanation rather than prediction, and humans cannot make any theoretical sense of thousands of variables. But for prediction purposes, this approach obviously discards useful information. Approaches that relied on this strategy fared worse than approaches that did not. Second, we find some evidence that human insight
What, then, is the future of humans in the loop? We believe that future research should consider at least two types of improvements to our approach. First, the response rate of our expert survey was low: improving this would make it much easier to compare the dividends of surveying experts rather than laypeople. We expect that experts bring knowledge that laypeople do not, but our results do not clearly demonstrate this. Second, future work should consider alternative ways to incorporate human knowledge into ML models. We did so in an ad hoc way, but better formalization of our intuition and better use of the scores in modeling will surely help in deciding the place of humans in the loop, going forward.
In closing, this project considered whether approaches from the tradition of informative, human-centered modeling can be usefully combined with ML techniques. We found that their combination is not always profitable but also that their judicious combination may yet be useful.
Supplemental Material
SRD-17-0125 – Supplemental material for Humans in the Loop: Incorporating Expert and Crowd-sourced Knowledge for Predictions Using Survey Data
Supplemental material, SRD-17-0125 for Humans in the Loop: Incorporating Expert and Crowd-sourced Knowledge for Predictions Using Survey Data by Anna Filippova, Connor Gilroy, Ridhi Kashyap, Antje Kirchner, Allison C. Morgan, Kivan Polimis, Adaner Usmani and Tong Wang in Socius
Footnotes
Funding
Supplemental Material
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
