Abstract
Keywords
The Fragile Families and Child Wellbeing Study (FFCWS) is a longitudinal, birth cohort study run by researchers at Princeton University and Columbia University (The Trustees of Princeton University 2018). The study follows a group of nearly 5,000 American children born between 1998 and 2000 and includes a large oversample of children born to unmarried parents (Reichman et al. 2001). The aim of the study is to characterize the relationships and conditions of unmarried parents and to study the cognitive development, mental and physical health, and social relationships of children born into such families.
The Fragile Families Challenge (FFC) is a mass collaboration social science data challenge designed to harness the predictive power of the FFCWS data set (Salganik et al. 2019). The FFC invites community members to use the data to build models that best predict six key outcomes: grade point average (GPA), grit, material hardship, eviction, job loss, and job training. In this article, we focus on predicting GPA only. It is our personal belief that a child’s GPA is very important, as it sets the tone for the rest of a child’s life and influences the range of opportunities afforded to the child (e.g., college acceptance, scholarships, admittance into competitive summer enrichment programs).
Out-of-the-box machine learning libraries such as scikit-learn and access to open data sets hosted on popular platforms such as Kaggle enable users from across the globe to create sophisticated predictive models with sometimes impressive predictive accuracy without ever needing to understand the underlying data (Kaggle 2017; Pedregosa et al. 2011). This is in stark contrast to traditional methods of predictive modeling and data analysis undertaken by researchers in fields other than engineering, specifically the social sciences. In survey research, a popular measurement technique used in applied social research, the data are often very complex (Trochim 2006). They can span many years, in the case of longitudinal studies, and are susceptible to various sources of error: coverage error, sampling error, nonresponse error, and measurement error (Visser, Krosnick, and Lavraka 2000). Thus, best practices in survey research call researchers to spend substantial time with the data—to “make friends with their data”—and to refrain from “throwing their data into a computer and trying to analyze it in minutes” (Wright 2003). Failure to do so could lead to spurious results and misleading conclusions, and researchers run the risk of misidentifying associations as statistically significant (Kelley et al. 2003).
McFarland, Lewis, and Goldberg (2016) argued that while sociologists are driven by theory and the desire to explain the patterns observed in the data, engineers are focused on creating algorithmic tools to increase the predictive accuracy of their models, without placing much importance on the explanation. But what if the only metric of success is predictive accuracy? To what extent would an engineer be rewarded for “befriending” the data? Using the FFC as our backdrop, we seek to answer whether engineers get better predictive results when they spend a little time learning the domain they are working in and, if so, how much better are these results.
In this article, we use the term
We divided the project into 2 steps. In step 1, we used a completely automatic approach that does not consider the data (the norm in data mining) to fit 124 models for GPA prediction. In step 2, we attempt to improve upon our results. We use a strategy that combines engineering-centric statistical analysis techniques with classical, more manual social science methodologies: we examined each variable in the codebook, manually selecting the ones believed to be predictive of academic achievement on the basis of a nonexpert reading of domain-specific research. Results indicate that it in most cases, it pays for engineers to “make friends” with the FFCWS codebooks. We were able to improve the predictive accuracy of 6 of the 10 top step 1 models, of which 4 saw significant improvements. However, manual variable selection did not improve the predictive ability of the 2 most accurate models from step 1.
The article is structured as follows. First we describe the procedures used to create the initial set of 124 models. We then describe the process of creating the 15 manually curated variable sets. Next we present the results; we show that we were able to improve the predictive ability of almost all the models and demonstrate the effect of each variable subset on the models. We then look at the variables that most predict GPA as identified by the two most accurate models from this project. Finally, we end the discussion with closing remarks. Additional supporting materials can be found in the Appendix, available online.
Step 1: Automatic Variable Selection
The goal of step 1 was to fit a model that could predict year 15 GPA as accurately as possible using a purely automated approach.
Data Preprocessing
With 2,121 samples and 12,942 variables, the FFC data set is a high-dimensional data set. In settings in which the number of variables far exceeds the number of samples, overfitting becomes a problem, and the learned model loses its ability to generalize (Hua et al. 2005). Thus, it is important to preprocess the data to not only reduce the number of variables but also handle missing values and standardize the data.
We tried many different approaches to data preprocessing. We tried almost all combinations of four different decisions: two types of automatic variable selection (
Model Selection
We used the following eight model types to fit a total of 124 models. This includes all possible combinations of eight different model types, two types of automatic variable selection (
Ordinary least squares linear regression 1
Least-angle regression (Efron et al. 2004)
Ridge regression (Tikhonov 1963)
Elastic net (Zou and Hastie 2005)
Orthogonal matching pursuit (Cai and Wang 2011)
Least absolute shrinkage and selection operator (LASSO) regression (Tibshirani 1994)
Decision tree regression (Quinlan 1986)
ε-support vector regression with linear kernel (Drucker et al. 1997)
The observant reader will notice that 8 × 2 × 2 × 2 × 2 = 128, but we fit only 124 models. We fit decision tree models using only some type of automatic variable selection. We did not fit these models using the full variable set, because decision trees are very susceptible to overfitting in high-dimensional settings such as this, in which the number of variables greatly outnumbers the number of samples (Pedregosa et al. 2011). This accounts for the missing 4 combinations. 2
Results
We used FFC holdout test set mean squared error (MSE) scores (FFC-HO-MSE) to evaluate the accuracy of the models. We chose the MSE metric because it is the metric used to rank and evaluate the predictive validity of the submissions made through the FFC web portal (Salganik et al. 2019). Results from step 1 are summarized in Table 1.
Evaluation Results for the 10 Most Accurate Models from Step 1.
Step 2: Manual Variable Selection
The goal of step 2 was to improve the predictive accuracy of the models generated in step 1 by combining the previous automatic approaches with manual ones inspired by survey research best practices.
Manual Variable Selection
Our first step in this second phase of the project was to get friendly with the codebooks. We went through each of the 12,942 variables, manually selecting the ones believed to be predictive of future academic achievement. To inform the decision-making process, we turned to a comprehensive review of student success literature, “What Matters to Student Success,” a report commissioned for the National Postsecondary Education Cooperative in 2006 (Kuh et al. 2006). Specifically, we relied on the first section of the report, which discusses the effects of precollege experiences on student success, such as family and peer support, academic preparation, motivation to learn, socioeconomic status, and demographics. Although the report is targeted at student success in college, research has shown that high school grades are also highly correlated with socioeconomic factors such as family income and educational attainment (Zwick and Green 2007). From the National Postsecondary Education Cooperative report, we collated a list of 57 precollege factors that have been shown by social scientists to affect student success. 3
Next, we manually examined each variable in the codebook and made judgement calls to determine whether it was directly related to any one of the 57 factors. It should be noted that we did not calculate intercoder reliability (Lombard, Snyder-Duch, and Bracken 2002). Calculating and reporting the intercoder reliability of this manual process is an area for future work. The aftermath of this process was a custom set of 3,694 variables. 4
In an effort to identify the particular groups of variables most predictive of academic achievement, we created 14 additional, more granular subsets from the manually selected set of 3,694 variables. For example, we created a variable set that contained only wave 3 variables and a different subset that contained only wave 5 variables.
We used a total of 16 variable sets in this project 5 : (1) the original set of 12,942 variables, (2) our manually curated set of 3,694 variables, and (3) 14 additional variable sets, each of which is a subset of the manually selected set of the 3,694 variables (wave 3 only, wave 5 only, etc.). Table 2 summarizes each of these 16 variable sets and provides a shorthand label for each. We use these shorthand labels to reference the various variable sets for the remainder of this article.
Descriptions of Each of the 16 Variable Sets Used in this Project.
Method
We reestimated the 10 most accurate models from step 1 on each of the 15 manually created variable subsets to produce a total of 150 models in this second step of the project. We used the same data-preprocessing procedures and imputation strategies used in step 1. As before, categorical variables were not identified and were not treated differently from the continuous ones. After data imputation, our manually curated variable set was reduced from 3,694 to 3,423 variables. The FFC submission pipeline remained the same.
Results
Manual variable selection indeed improved, and in some cases dramatically improved, the accuracy of the predictive models trained previously using purely automatic techniques. Table 3 shows that 8 of the 10 most accurate models were trained on the manually created variable sets. Figure 1 shows how substantially manual variable selection improved the FFC-HO-MSE values of the 3rd, 6th, 9th, and 10th most accurate models from step 1. After reestimating model 6 on the w5 variable set, the model rose to become the second most accurate model across both phases of the project, according to FFC-HO-MSE. The accuracies of models 4, 5, and 7 were also improved, but the change in FFC-HO-MSE was more tempered. The two most accurate models from step 1 saw no improvement.
Evaluation Results from the 10 Most Accurate Models across the Entire Project (Steps 1 and 2 Combined).

Effect of manual variable selection on the predictive ability of the 10 most accurate step 1 models. For the step 1 series, in which the full set of 12,942 variables was used to fit the models, the means squared error (MSE) value is plotted for each model. Recall that in step 2, we reestimated the top 10 step 1 models using each of the 15 manually created variable subsets (the full set of 3,694 manually curated variables plus 14 additional subsets taken from this set of 3,694 variables), giving us 15 MSE scores per model. Thus, for the step 2 series, for each model, we plot the holdout result on the basis of the result with the best leaderboard score.
Effect of Specific Variable Groups on Model Accuracy
A secondary goal was to understand how the various variable groups affected the predictive accuracy of the models trained in step 2 (e.g., do wave 5 variables yield better results than wave 3 variables?). Figure 2 is a 16 × 10 heatmap of FFC-HO-MSE scores from the 10 most accurate step 1 models trained on each of the 16 variable sets from the project, including the full set of 12,942 variables (labeled “All”). The lower the MSE value and the darker the color, the better.

Heatmap of Fragile Families Challenge holdout test set mean squared error score (FFC-HO-MSE) scores for the 10 most accurate step 1 models trained on each of the 16 variable sets from the project. The lower the mean squared error (MSE) value and the darker the color, the better. The lowest FFC-HO-MSE value, 0.348, is represented by the color red (model 1, data set All). The highest FFC-HO-MSE value, 0.546, is represented by the color white (model 10, data set w3). A baseline model that takes the mean of each outcome in the training data and predicts that mean value for all observations acheives an MSE value of 0.425 on the holdout data for grade point average (Salganik et al. 2019). Refer to Table 2 for a description of each variable set.
Variable sets w1, w2, and w3 appear to contain the weakest signal across almost all models, and variable set w5 appears to contain the strongest signal, closely followed by w1_5 and w1_5_t_kind. From w1 through w5 we see a gradual strengthening of color across several rows. This pattern and the previous observations suggest that later waves are more predictive of high school GPA than earlier waves. However, not all wave 5 data are created equally. Variable set k contains variables asked of only the child in wave 5, and t_k contains variables asked of the child and the child’s teacher in wave 5 (Salganik et al. 2019). Looking across both columns, we can visually see how FFC-HO-MSE values improved across more than half of the models when input from the teacher was removed. We see a similar phenomenon when comparing the t_kind and t_kind_k columns. The majority of the models seem to improve with added input from the child. It appears that no matter how attentive a parent, teacher, or caretaker may (or may not) be, only the child really knows what he or she is feeling and experiencing on a day-to-day basis. And many of questions asked of the child in wave 5 attempt to tease out precisely this, questions such as “Frequency kids picked on you or said mean things to you,” “I often feel lonely,” “Frequency kids take your things, like your money or lunch,” and “Amount of time on a weekday you watch TV and movies.”
Variables That Most Predict GPA
An important goal of the FFC is to gain insight into the specific variables that most predict the six outcomes of interest: GPA, grit, material hardship, eviction, layoff, and job training. The hope is that such insights may one day improve the lives of American children born into these “fragile families” (Salganik et al. 2019). Table 4 lists the variables that most predict year 15 GPA according to the two most accurate models from this project on the basis of FFC-HO-MSE scores. The most accurate model used LASSO, and the second most accurate model used elastic net. Coefficients are in parentheses, and variables are listed in order of decreasing absolute coefficient value. Because the data were standardized, we were able to compare variable coefficients according to their relative significance to the prediction task. That is, the higher the absolute value of the coefficient, the higher the level of importance of that particular variable in predicting the desired outcome, which in this case is GPA. In prediction tasks such as these, in which we are predicting a real-valued outcome in a setting with multiple independent variables, coefficients can be interpreted as the following: holding all other variables fixed, the predicted outcomes increase (if the sign of the coefficient is positive) or decrease (if the sign of the coefficient is negative) by a factor of β1 units for every 1 unit increase in
Variables That Most Predict Grade Point Average within the Two Most Accurate Models from the Project.
It is worth highlighting that although these two best models have almost equal predictive performance on the holdout data, 0.348 and 0.349, respectively, they exhibit very little overlap in the variables each model deems to be of most significance. We saw two different sets of variables returned by two different models of almost equal predictive accuracy, giving us two different pictures of which variables most predict year 15 GPA. In his analysis of the two cultures of statistical modeling, Breiman (2001) argued that in a situation in which “different models, all of them equally good . . . give different pictures of the relation between the predictor and response variables . . . the question of which one most accurately reflects the data is difficult to resolve.” As engineers, these difficulties are further compounded by a lack of domain knowledge in the social sciences. Thus, we leave intuitive explanation of these results for future work and collaborations with social scientists. Furthermore, further research is required to calculate confidence intervals for the coefficients listed in this section and to begin interpreting the magnitude of the values and features returned.
Conclusions
Using a two-step approach to the FFC, we were able to significantly improve the predictive accuracy of the majority of the models evaluated by using a combined approach of automatic and manual variable selection motivated by social science knowledge. We showed that such an approach, even though it is based on a nonexpert reading of domain-specific research, can improve the accuracy of models trained automatically. We demonstrated that if one is not careful choosing algorithms in such a data setting, it pays to take a look at the codebooks. But there is still room for improvement, as our strategy was unable to improve the accuracy of the two most accurate models from step 1; this is an area for future work.
Supplemental Material
SRD-17-0113 – Supplemental material for Friend Request Pending: A Comparative Assessment of Engineering- and Social Science–Inspired Approaches to Analyzing Complex Birth Cohort Survey Data
Supplemental material, SRD-17-0113 for Friend Request Pending: A Comparative Assessment of Engineering- and Social Science–Inspired Approaches to Analyzing Complex Birth Cohort Survey Data by Claudia V. Roberts in Socius
Footnotes
Funding
Supplemental Material
1
5
Author Biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
