Abstract
In this article, we describe an approach assisted by the least absolute shrinkage and selection operator (LASSO; Tibshirani 1996) to making predictions of material hardship and other measures of child well-being for children at age 15. Material hardship is a measure first developed by Mayer and Jencks (1989) of extreme poverty that aggregates positive responses to a set of survey questions. We use data originally from the Fragile Families and Child Wellbeing Study. To tackle the issues of missing data and variable selection, our approach consists of multiple steps: cleaning, preprocessing using LASSO, model-based imputation, and prediction using LASSO.
We apply this approach to predict material hardship, along with five other outcomes concerning children performance and welfare: grade point average (GPA), grit, job training, eviction, and layoff. We submit our results to the Fragile Families Challenge (FFC). The FFC is a mass collaborative effort with the goal of producing and facilitating research and policy ramifications aimed at addressing the challenge of fragile families in the United States. It invites scholars to make predictions of the six aforementioned outcomes using data from the Fragile Families and Child Wellbeing Study. The study produces data representative of births in large U.S. cities between 1998 and 2000. These data are based on mother and father interviews conducted at children’s birth and at years 1, 3, 5, and 9. 1 It therefore has many advantages over surveys of a similar kind, chief among which is an oversample of nonmarital births (3:1) for which interviews were conducted with both mothers and fathers, thus obtaining rich information about them (Reichman et al. 2001). The lessons learned from these prediction exercises will make an important step toward accomplishing the FFC mission. 2
The rest of this article is organized as follows. First we introduce LASSO as our main method. We then document our procedures of data cleaning, preprocessing, imputation, and prediction. Next we report the performance of our approach. Finally, we discuss the results by highlighting the importance of predictors from mother surveys and components of material hardship measured in the past.
LASSO as the Main Method
The use of LASSO underpins our strategy. In our approach, LASSO is used twice: first to preprocess the data and then to train prediction models. LASSO handles high-dimensional data (i.e., the number of covariates can be larger than that of units) well because its penalization shrinks tiny coefficients to exactly zero. Selecting variables by zeroing out coefficients also makes postestimation analysis easier, as the number of covariates becomes much smaller, which is advantageous for preprocessing the high-dimensional FFC data set. In addition, LASSO helps avoid overfitting to the training data via regularization. This feature is helpful for building prediction models.
Given the training data
where
Here, variables are standardized to have zero mean and unit variance so that regularization on coefficients is not affected by the original scale of input variables and the intercept can be omitted from equation 1. One property of LASSO is that the estimated coefficient can be exactly zero (i.e., it can achieve variable selection). For a new input
For binary outcomes, we use logistic regression with
which corresponds to minimizing the negative log likelihood of the model with
We predict probabilities, instead of classes, for binary outcomes, as the FFC recommends.
Procedures of Data Preprocessing and Prediction
This section details our procedures of data cleaning, preprocessing, imputation, and prediction. 3
Step 1: Cleaning
We immediately drop any variable with more than 60 percent of observations assigned NA (not applicable; meaning that values are missing) or negative values. In this dataset, negative values indicate different types of missingness. An extremely high degree of missingness would prevent such variables from conveying useful information for prediction purposes. We treat categorical variables as ordinal variables and apply the above cleaning rules. This procedure reduces the number of potential covariates from 12,942 to 4,207. We further exclude variables that either indicate the date of the survey only or have standard deviations less than 0.01. This step leaves us with 4,187 variables. 4
Step 2: Preprocessing with LASSO to Assist Imputation
We want to identify a small set of covariates from these 4,187 variables. Missing values in this smaller set would be imputed with Amelia, a model-based imputation algorithm proposed by King et al. (2001). 5 To arrive at these covariates, we first mean-impute the covariates and use LASSO. We use LASSO here not to make immediate predictions but to determine this small set of variables for further use. To the best of our knowledge, there have not been any prior studies that used LASSO as a preprocessing tool in preparation for further imputation using model-based methods.
We regress the six outcomes separately on mean-imputed covariates in the FFC using LASSO. 6 For each of the six sets of results, we drop the covariates with coefficients of size zero. Then we take the union over the six sets of remaining variables. This procedure leaves us with 339 covariates, listed in Table A9 in the Appendix.
Step 3: Model-based Imputation with Amelia
We identify these 339 covariates (obtained with LASSO) in the original (i.e., before mean imputation) data set. We run a model-based imputation algorithm, Amelia, on these variables from the original data set so that they will enter our final prediction process with their missing values imputed in a principled manner. Amelia jointly models variables with multivariate normal distribution. The expectation-maximization algorithm is used to estimate the model by iterating between the model parameters, mean and covariance matrix, and missing values until convergence. We use model-based imputation here because we believe covariates are correlated with one another, and hence missing values are expected to have more accurate imputation by Amelia, which fully exploits the correlation structure of covariates.
Tables 1 and 2 summarizes how many covariates survived after each step in the cleaning, preprocessing and imputation stages.
Number of Predictors Remaining after Each Data Preprocessing and Imputation Step.
The Two Rows Detail the Number of NA Observations Remaining after the Outcome Variables Were Imputed (Total 2,121 Units).
After data cleaning and imputation for covariates, we also impute the outcome variables. We create an outcome matrix with columns corresponding to each outcome variable and impute missing cells using Amelia. Outcomes for the same individual can be highly correlated. Information borrowed across outcomes should therefore improve the prediction of each outcome. 7 Tables 1 and 2 document the results of outcome imputation. Figure A1 in the Appendix shows correlations among outcome variables after imputation. Figure A2 displays distributions of imputed versus actual data among the six outcomes.
Step 4: Using LASSO (Again) to Predict Six Outcomes
After these three steps, we train prediction models with LASSO for each outcome using the R package glmnet (Simon et al. 2011). Binomial link (equation 2) is used for binary outcomes (eviction, layoff, and job training), and the linear model (equation 1) is used for GPA, grit, and material hardship. We choose tuning parameters by fivefold cross-validation for each outcome separately and select values that minimize mean squared error (MSE).
Results
The first row of Figure 1 displays the densities of out-of-sample predictions, in-sample fitted values, and in-sample training data for continuous outcomes. The second row shows separation plots (Greenhill, Ward, and Sacks 2011) for binary outcomes.

Density plot (first row) and separation plot (second row) for predicted outcomes. First row: red solid lines represent out-of-sample predicted outcome. Blue dotted lines represent in-sample fitted values. Black dashed lines are densities of outcomes in training data. Second row: separation plot for binary outcomes. Predicted probabilities for the training set are sorted according to the predicted probability from the left (minimum) to the right (maximum) and then colored by the actual outcome. The blue vertical lines occur at points where the observation takes the value 1 rather than 0. The superimposed black curve represents the predicted probabilities for the testing data set.
Table 3 reports MSEs of predictions. 8 “Final model” refers to results obtained using our approach described in this article. Each MSE in the “winning model” refers to the MSE obtained by the team that won the FFC for that corresponding variable. All other models come from post-FFC analysis. In these models, we replicate our analysis (1) using the sample mean of the imputed outcomes in testing data as predicted values for all testing units (“null model”), (2) skipping the Amelia imputation step and instead using mean imputation for all missing values (“mean imputation”), and restricting the covariates to (3) mother survey items only (“mother model”) and (4) father survey items only (“father model”).9,10
Results of Predictions (MSE on Holdout Data).
The out-of-sample prediction of material hardship using our approach achieves an MSE of 0.019, the lowest among all FFC submissions for this variable. With respect to rankings, our approach was also competitive for the following outcomes: GPA and job training. Among 163 submissions, the rankings are 30 for GPA and 30 for job training but below 100 for the other three outcomes.
Regarding our models, we note that our approach, in general, performs better than the “null model” and the “mean imputation” model. However, the “mean imputation” model still performs comparably well, suggesting that Amelia imputation might not have improved the prediction as much as expected. Results for variables from mother surveys compared with those from father surveys are discussed next.
Discussion
In this section, we focus our discussion on material hardship. LASSO selects 72 variables for final prediction, listed with coefficients in Appendix Table A1. Our results reflect that variables from mother surveys are more helpful than those from father surveys in predicting material hardship.
Below we rank the selected variables in terms of the size of their rescaled coefficients. Because glmnet returns coefficients on the original scale, we manually rescale the coefficients, which approximates the standardized coefficients. Let
The variable with the largest coefficient magnitude is whether the school instruction language is an Asian language for the child in year 5 (t5e7_3). However, in the original data, there are only 2 people answering “yes” but 2,004 people answering “no” to the survey question, with 52.7 percent of the observations missing. The variation that drives our prediction mostly comes from imputation. In addition, when rescaling coefficients, we divide glmnet estimates by empirical standard deviations. This procedure mechanically produces large (rescaled) coefficients when the original variable has small variation. Figure 2A shows the variables with the second largest to the sixth largest coefficient magnitudes. They are (1) whether the child in year 5 asks no one for help or advice other than the mother (m5e9_0), (2) whether the mother in year 3 was evicted from home in the past year (m3i23c), (3) whether the mother in year 3 was helped by an employment office since the child’s first birthday (m3i7f), (4) whether the mother in year 5 could not complete mortgage payments for the past 12 months because there was not enough money (m4i23d), and (5) whether the mother in year 5 received free food or meals over the past 12 months (m5f23a).

(A) Count plot of “yes” answers for each variable. Variable names refer to the following: m5e9_0, only person from whom the child seeks help; m3i23c, evicted from home; m3i7f, helped by employment office; m4i23d, could not pay mortgage; m5f23a, received free food or meals; and f3i6a, telephone disconnected. (B) Proportion plot of mother-survey (gray) and father-survey (blue) variables in the data set at each stage of preprocessing and prediction. The leftmost two bars correspond to the original data set; the middle two bars correspond to the imputed data set after removing missing variables, preprocessing with the least absolute shrinkage and selection operator (LASSO), and imputing with Amelia; and the rightmost two bars correspond to selected variables by the LASSO in predicting material hardship. We calculate the proportion by counting the variable names whose prefixes begin with the letter
We draw two main conclusions. First, these variables share one common characteristic: they are from mother surveys. The highest predictive variable from father surveys is whether the father in year 3 noticed the telephone disconnected in the past 12 months (f3i6a). This variable ranks 11th among our 72 selected variables. We further verify the performance gap between mother- and father-survey items in a post hoc analysis that compares the prediction results obtained using just the variables from mother surveys against those obtained using just the variables from father surveys. As shown in Table 3, the MSE for material hardship is 0.019 for the former (which is as low as that obtained using the LASSO-assisted approach described in this article) and 0.024 for the latter. Notably, using just the mother-only model will lead to better prediction results than those obtained using our approach in this article for four of six outcomes.
One may attribute the performance gap between mother-survey variables and father-survey variables in step 4 of our LASSO-assisted approach to various factors. For one, variables from father surveys are more likely to have substantial amount of missing values and so are less likely to survive in the initial stages of data cleaning, in which we delete variables according to the 60 percent cutoff rule described earlier. Figure 2B shows that items from father surveys start to have much lower proportions than those from mother surveys at the imputation stage. Yet the difference in proportions further increases after LASSO, indicating that the performance gap is more than an artifact of data cleaning. Moreover, it may be interesting in itself that father survey variables are more likely to suffer from missing values than variables from mother surveys. We want to acknowledge, however, that prediction is completely different from causal inference. Whether the importance of mother-survey predictors over those from father surveys indicates anything causal about the substantive importance of mother’s role in family welfare, childcare, or child’s education goes beyond the scope of this article.
Second, our results suggest that past outcomes may effectively predict current outcomes in panel data. Questions from which variables 2, 4, and 5 are constructed, as well as the top-ranked variables from father surveys, were similar to those asked in the year 15 primary caregiver survey that would in turn form 4 of 11 components of material hardship. Social scientists have long used past outcomes to predict future outcomes. Hegre et al. (2013) is a prominent example showing that recent history of a country’s armed conflict is a robustly effective predictor of the country’s future conflict. Whether past material hardship necessarily causes future material hardship or simply reflects some unobserved underlying causes that are correlated across time may be a subject of future research.
Supplemental Material
SRD-17-0119 – Supplemental material for Using LASSO to Assist Imputation and Predict Child Well-being
Supplemental material, SRD-17-0119 for Using LASSO to Assist Imputation and Predict Child Well-being by Diana Stanescu, Erik Wang and Soichiro Yamauchi in Socius
Footnotes
Authors’ Note
Funding
Supplemental Material
1
2
4
6
7
8
9
10
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
