Sage Journals: Discover world-class research

Abstract

This article documents an approach to predicting children’s well-being using data from the Fragile Families and Child Wellbeing Study, which are representative of births in large U.S. cities. The authors use the least absolute shrinkage and selection operator (LASSO) to preprocess the data. They then apply the Amelia algorithm to impute missing data. Finally, they use LASSO again for prediction with the imputed data. The authors report the performance of this approach for six outcome variables. The approach achieves the best performance for the variable material hardship. The out-of-sample mean squared error of the authors’ prediction is 0.019, the lowest among all submissions in the Fragile Families Challenge. The authors find that among variables with high predictive power, variables from mother surveys dominate. Furthermore, components of material hardship in the past strongly predict current material hardship.

Keywords

material hardship prediction LASSO Fragile Families Challenge

In this article, we describe an approach assisted by the least absolute shrinkage and selection operator (LASSO; Tibshirani 1996) to making predictions of material hardship and other measures of child well-being for children at age 15. Material hardship is a measure first developed by Mayer and Jencks (1989) of extreme poverty that aggregates positive responses to a set of survey questions. We use data originally from the Fragile Families and Child Wellbeing Study. To tackle the issues of missing data and variable selection, our approach consists of multiple steps: cleaning, preprocessing using LASSO, model-based imputation, and prediction using LASSO.

We apply this approach to predict material hardship, along with five other outcomes concerning children performance and welfare: grade point average (GPA), grit, job training, eviction, and layoff. We submit our results to the Fragile Families Challenge (FFC). The FFC is a mass collaborative effort with the goal of producing and facilitating research and policy ramifications aimed at addressing the challenge of fragile families in the United States. It invites scholars to make predictions of the six aforementioned outcomes using data from the Fragile Families and Child Wellbeing Study. The study produces data representative of births in large U.S. cities between 1998 and 2000. These data are based on mother and father interviews conducted at children’s birth and at years 1, 3, 5, and 9.¹ It therefore has many advantages over surveys of a similar kind, chief among which is an oversample of nonmarital births (3:1) for which interviews were conducted with both mothers and fathers, thus obtaining rich information about them (Reichman et al. 2001). The lessons learned from these prediction exercises will make an important step toward accomplishing the FFC mission.²

The rest of this article is organized as follows. First we introduce LASSO as our main method. We then document our procedures of data cleaning, preprocessing, imputation, and prediction. Next we report the performance of our approach. Finally, we discuss the results by highlighting the importance of predictors from mother surveys and components of material hardship measured in the past.

LASSO as the Main Method

The use of LASSO underpins our strategy. In our approach, LASSO is used twice: first to preprocess the data and then to train prediction models. LASSO handles high-dimensional data (i.e., the number of covariates can be larger than that of units) well because its penalization shrinks tiny coefficients to exactly zero. Selecting variables by zeroing out coefficients also makes postestimation analysis easier, as the number of covariates becomes much smaller, which is advantageous for preprocessing the high-dimensional FFC data set. In addition, LASSO helps avoid overfitting to the training data via regularization. This feature is helpful for building prediction models.

Given the training data ${Y_{i}, X_{i}}_{i = 1}^{n}$ , where $Y_{i} \in ℝ$ and $X_{i} \in ℝ^{p}$ , the LASSO estimate is defined so that it minimizes the squared loss with $L_{1}$ norm penalty, ${‖ β ‖}_{1} = \sum_{j} | β_{j} |$ . Formally, estimates for the LASSO are given by

${\hat{β}}_{lasso} \in \underset{β}{\arg \min} {\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{Τ} β)}^{2} + λ {‖ β ‖}_{1}},$ (1)

where $λ > 0$ is a tuning parameter that is chosen by cross-validation. The second term, $λ {‖ β ‖}_{1}$ , works as a regularizer that encourages smaller $β$ . Intuitively, when minimizing, a large $λ$ would induce a smaller magnitude for $β$ .

Here, variables are standardized to have zero mean and unit variance so that regularization on coefficients is not affected by the original scale of input variables and the intercept can be omitted from equation 1. One property of LASSO is that the estimated coefficient can be exactly zero (i.e., it can achieve variable selection). For a new input $X_{n + 1}$ , prediction is made by ${\hat{Y}}_{n + 1} = X_{n + 1}^{Τ} {\hat{β}}_{lasso}$ .

For binary outcomes, we use logistic regression with $L_{1}$ penalty. The estimates are given by

${\hat{β}}_{logit-lasso} \in \underset{β}{\arg \min} {- [\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} X_{i}^{Τ} β \\ - \log (1 + \exp (X_{i}^{Τ} β)) \end{array}] + λ {‖ β ‖}_{1}},$ (2)

which corresponds to minimizing the negative log likelihood of the model with $L_{1}$ regularization.

We predict probabilities, instead of classes, for binary outcomes, as the FFC recommends.

Procedures of Data Preprocessing and Prediction

This section details our procedures of data cleaning, preprocessing, imputation, and prediction.³

Step 1: Cleaning

We immediately drop any variable with more than 60 percent of observations assigned NA (not applicable; meaning that values are missing) or negative values. In this dataset, negative values indicate different types of missingness. An extremely high degree of missingness would prevent such variables from conveying useful information for prediction purposes. We treat categorical variables as ordinal variables and apply the above cleaning rules. This procedure reduces the number of potential covariates from 12,942 to 4,207. We further exclude variables that either indicate the date of the survey only or have standard deviations less than 0.01. This step leaves us with 4,187 variables.⁴

Step 2: Preprocessing with LASSO to Assist Imputation

We want to identify a small set of covariates from these 4,187 variables. Missing values in this smaller set would be imputed with Amelia, a model-based imputation algorithm proposed by King et al. (2001).⁵ To arrive at these covariates, we first mean-impute the covariates and use LASSO. We use LASSO here not to make immediate predictions but to determine this small set of variables for further use. To the best of our knowledge, there have not been any prior studies that used LASSO as a preprocessing tool in preparation for further imputation using model-based methods.

We regress the six outcomes separately on mean-imputed covariates in the FFC using LASSO.⁶ For each of the six sets of results, we drop the covariates with coefficients of size zero. Then we take the union over the six sets of remaining variables. This procedure leaves us with 339 covariates, listed in Table A9 in the Appendix.

Step 3: Model-based Imputation with Amelia

We identify these 339 covariates (obtained with LASSO) in the original (i.e., before mean imputation) data set. We run a model-based imputation algorithm, Amelia, on these variables from the original data set so that they will enter our final prediction process with their missing values imputed in a principled manner. Amelia jointly models variables with multivariate normal distribution. The expectation-maximization algorithm is used to estimate the model by iterating between the model parameters, mean and covariance matrix, and missing values until convergence. We use model-based imputation here because we believe covariates are correlated with one another, and hence missing values are expected to have more accurate imputation by Amelia, which fully exploits the correlation structure of covariates.

Tables 1 and 2 summarizes how many covariates survived after each step in the cleaning, preprocessing and imputation stages.

Table 1.

Number of Predictors Remaining after Each Data Preprocessing and Imputation Step.

Step		Variables Selected by Screening
0	Original	12,942
1	Remove missing ≥60 percent	4,207
2	Remove variables with SD < 0.01	4,187
3	LASSO (union)	339
4	Imputation	339
		Variables selected by LASSO for each outcome
		Material hardship	GPA	Grit	Eviction	Layoff	Job training
5	LASSO	72	66	190	106	75	64

Note: Steps 0 through 4 correspond to the variable preprocessing and imputation stages, and step 5 corresponds to the prediction stage. GPA = grade point average; LASSO = least absolute shrinkage and selection operator.

Table 2.

The Two Rows Detail the Number of NA Observations Remaining after the Outcome Variables Were Imputed (Total 2,121 Units).

	Number of NA Observations per Outcome
	Material Hardship	GPA	Grit	Eviction	Layoff	Job Training
Original data	662	956	703	662	844	660
After imputation	655	655	655	655	655	655

Note: GPA = grade point average; NA = observations whose values are missing.

After data cleaning and imputation for covariates, we also impute the outcome variables. We create an outcome matrix with columns corresponding to each outcome variable and impute missing cells using Amelia. Outcomes for the same individual can be highly correlated. Information borrowed across outcomes should therefore improve the prediction of each outcome.⁷ Tables 1 and 2 document the results of outcome imputation. Figure A1 in the Appendix shows correlations among outcome variables after imputation. Figure A2 displays distributions of imputed versus actual data among the six outcomes.

Step 4: Using LASSO (Again) to Predict Six Outcomes

After these three steps, we train prediction models with LASSO for each outcome using the R package glmnet (Simon et al. 2011). Binomial link (equation 2) is used for binary outcomes (eviction, layoff, and job training), and the linear model (equation 1) is used for GPA, grit, and material hardship. We choose tuning parameters by fivefold cross-validation for each outcome separately and select values that minimize mean squared error (MSE).

Results

The first row of Figure 1 displays the densities of out-of-sample predictions, in-sample fitted values, and in-sample training data for continuous outcomes. The second row shows separation plots (Greenhill, Ward, and Sacks 2011) for binary outcomes.

Figure 1.

Density plot (first row) and separation plot (second row) for predicted outcomes. First row: red solid lines represent out-of-sample predicted outcome. Blue dotted lines represent in-sample fitted values. Black dashed lines are densities of outcomes in training data. Second row: separation plot for binary outcomes. Predicted probabilities for the training set are sorted according to the predicted probability from the left (minimum) to the right (maximum) and then colored by the actual outcome. The blue vertical lines occur at points where the observation takes the value 1 rather than 0. The superimposed black curve represents the predicted probabilities for the testing data set.

Table 3 reports MSEs of predictions.⁸ “Final model” refers to results obtained using our approach described in this article. Each MSE in the “winning model” refers to the MSE obtained by the team that won the FFC for that corresponding variable. All other models come from post-FFC analysis. In these models, we replicate our analysis (1) using the sample mean of the imputed outcomes in testing data as predicted values for all testing units (“null model”), (2) skipping the Amelia imputation step and instead using mean imputation for all missing values (“mean imputation”), and restricting the covariates to (3) mother survey items only (“mother model”) and (4) father survey items only (“father model”).^9,10

Table 3.

Results of Predictions (MSE on Holdout Data).

	Hardship	Grit	GPA	Eviction	Layoff	Job Training
Final model	0.019	0.253	0.361	0.059	0.167	0.181
Winning model	0.019	0.238	0.344	0.052	0.162	0.176
Null model	0.025	0.253	0.426	0.055	0.167	0.185
Mean imputation	0.020	0.257	0.357	0.057	0.178	0.185
Mother only	0.019	0.249	0.389	0.055	0.164	0.175
Father only	0.024	0.253	0.395	0.054	0.166	0.185

Note: “Final model” refers to results obtained using the approach described in this article. Each mean squared error (MSE) in the “winning model” refers to the MSE obtained by the team that won the Fragile Families Challenge for the corresponding variable. All other models come from postchallenge analysis. In these models, we replicate our analysis (1) using the sample mean of the imputed outcomes in testing data as predicted values for all testing units, (2) skipping the Amelia imputation step and instead using mean imputation for all missing values, and restricting the covariates to (3) mother survey items only and (4) father survey items only. GPA = grade point average.

The out-of-sample prediction of material hardship using our approach achieves an MSE of 0.019, the lowest among all FFC submissions for this variable. With respect to rankings, our approach was also competitive for the following outcomes: GPA and job training. Among 163 submissions, the rankings are 30 for GPA and 30 for job training but below 100 for the other three outcomes.

Regarding our models, we note that our approach, in general, performs better than the “null model” and the “mean imputation” model. However, the “mean imputation” model still performs comparably well, suggesting that Amelia imputation might not have improved the prediction as much as expected. Results for variables from mother surveys compared with those from father surveys are discussed next.

Discussion

In this section, we focus our discussion on material hardship. LASSO selects 72 variables for final prediction, listed with coefficients in Appendix Table A1. Our results reflect that variables from mother surveys are more helpful than those from father surveys in predicting material hardship.

Below we rank the selected variables in terms of the size of their rescaled coefficients. Because glmnet returns coefficients on the original scale, we manually rescale the coefficients, which approximates the standardized coefficients. Let $\hat{β}$ be the output from glmnet. The rescaled coefficient for variable $j$ is given by ${\hat{β}}_{j}^{*} = {\hat{β}}_{j} \cdot {\hat{σ}}_{Y} / {\hat{σ}}_{X_{j}}$ , where ${\hat{σ}}_{Y}$ and ${\hat{σ}}_{X_{j}}$ are the estimated standard deviations for $Y$ and $X_{j}$ , respectively. When reporting the rescaled coefficients, we drop the ${\hat{σ}}_{Y}$ term because this is constant across variables. We acknowledge that the ranking of variables here is simply a heuristic that aids substantive interpretation of the model, and we are not making any formal inference on these rankings.

The variable with the largest coefficient magnitude is whether the school instruction language is an Asian language for the child in year 5 (t5e7_3). However, in the original data, there are only 2 people answering “yes” but 2,004 people answering “no” to the survey question, with 52.7 percent of the observations missing. The variation that drives our prediction mostly comes from imputation. In addition, when rescaling coefficients, we divide glmnet estimates by empirical standard deviations. This procedure mechanically produces large (rescaled) coefficients when the original variable has small variation. Figure 2A shows the variables with the second largest to the sixth largest coefficient magnitudes. They are (1) whether the child in year 5 asks no one for help or advice other than the mother (m5e9_0), (2) whether the mother in year 3 was evicted from home in the past year (m3i23c), (3) whether the mother in year 3 was helped by an employment office since the child’s first birthday (m3i7f), (4) whether the mother in year 5 could not complete mortgage payments for the past 12 months because there was not enough money (m4i23d), and (5) whether the mother in year 5 received free food or meals over the past 12 months (m5f23a).

Figure 2.

(A) Count plot of “yes” answers for each variable. Variable names refer to the following: m5e9_0, only person from whom the child seeks help; m3i23c, evicted from home; m3i7f, helped by employment office; m4i23d, could not pay mortgage; m5f23a, received free food or meals; and f3i6a, telephone disconnected. (B) Proportion plot of mother-survey (gray) and father-survey (blue) variables in the data set at each stage of preprocessing and prediction. The leftmost two bars correspond to the original data set; the middle two bars correspond to the imputed data set after removing missing variables, preprocessing with the least absolute shrinkage and selection operator (LASSO), and imputing with Amelia; and the rightmost two bars correspond to selected variables by the LASSO in predicting material hardship. We calculate the proportion by counting the variable names whose prefixes begin with the letter m for mother-survey variables, or f (but not “ffcc”) for father-survey variables. Proportions do not sum to 1, because the data set contains answers to surveys not directed at the mother or father.

We draw two main conclusions. First, these variables share one common characteristic: they are from mother surveys. The highest predictive variable from father surveys is whether the father in year 3 noticed the telephone disconnected in the past 12 months (f3i6a). This variable ranks 11th among our 72 selected variables. We further verify the performance gap between mother- and father-survey items in a post hoc analysis that compares the prediction results obtained using just the variables from mother surveys against those obtained using just the variables from father surveys. As shown in Table 3, the MSE for material hardship is 0.019 for the former (which is as low as that obtained using the LASSO-assisted approach described in this article) and 0.024 for the latter. Notably, using just the mother-only model will lead to better prediction results than those obtained using our approach in this article for four of six outcomes.

One may attribute the performance gap between mother-survey variables and father-survey variables in step 4 of our LASSO-assisted approach to various factors. For one, variables from father surveys are more likely to have substantial amount of missing values and so are less likely to survive in the initial stages of data cleaning, in which we delete variables according to the 60 percent cutoff rule described earlier. Figure 2B shows that items from father surveys start to have much lower proportions than those from mother surveys at the imputation stage. Yet the difference in proportions further increases after LASSO, indicating that the performance gap is more than an artifact of data cleaning. Moreover, it may be interesting in itself that father survey variables are more likely to suffer from missing values than variables from mother surveys. We want to acknowledge, however, that prediction is completely different from causal inference. Whether the importance of mother-survey predictors over those from father surveys indicates anything causal about the substantive importance of mother’s role in family welfare, childcare, or child’s education goes beyond the scope of this article.

Second, our results suggest that past outcomes may effectively predict current outcomes in panel data. Questions from which variables 2, 4, and 5 are constructed, as well as the top-ranked variables from father surveys, were similar to those asked in the year 15 primary caregiver survey that would in turn form 4 of 11 components of material hardship. Social scientists have long used past outcomes to predict future outcomes. Hegre et al. (2013) is a prominent example showing that recent history of a country’s armed conflict is a robustly effective predictor of the country’s future conflict. Whether past material hardship necessarily causes future material hardship or simply reflects some unobserved underlying causes that are correlated across time may be a subject of future research.

Supplemental Material

SRD-17-0119 – Supplemental material for Using LASSO to Assist Imputation and Predict Child Well-being

Supplemental material, SRD-17-0119 for Using LASSO to Assist Imputation and Predict Child Well-being by Diana Stanescu, Erik Wang and Soichiro Yamauchi in Socius

Footnotes

We thank the two editors of the special collection and three anonymous reviewers for detailed comments and feedback. We are grateful to the Fragile Families and Child Wellbeing Study for sharing with us the data for this study.

Authors’ Note

The authors contributed equally to this article. The previous version of this article was discussed in the FFC Scientific Workshop.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: Funding for the Fragile Families and Child Wellbeing Study was provided by the Eunice Kennedy Shriver National Institute of Child Health and Human Development through grants R01HD36916,R01HD39135,and R01HD40421 and by a consortium of private foundations,including the Robert Wood Johnson Foundation. Funding for the FFC was provided by the Russell Sage Foundation. We use R version 3.4.3 ( R Core Team,2017 ) and the following R packages for the analysis: glmnet version 2.0.13 ( Simon et al. 2011 ) and Amelia version 1.7.4 (

Supplemental Material

Supplemental material for this article is available with the manuscript on the Socius website.

1

We refer to mother interviews as mother surveys and to father interviews as father surveys.

2

For a complete description of the FFC and the data used in the FFC,as well as the six outcome variables,please refer to the introductory article in this special collection (

3

All analyses are done in R version 3.4.3 (

4

Sixty percent missingness and 0.01 standard deviation cutoffs are arbitrary choices made without further sensitivity checks or consultation with existing studies.

5

R package Amelia version 1.7.4 is used for the analysis (

6

R package glmnet version 2.0.13 is used to fit LASSO (

). Tuning parameters are selected by fivefold cross-validations.

7

When working on this project,we thought that we should not use covariate information when imputing the outcome,because we wanted to avoid contamination. Our intuition was that the covariates that contributed a lot to imputation would also receive higher coefficients in the variable selection using LASSO,but these higher coefficients were induced by construction. After more careful consideration,we realized that this intuition might not necessarily be correct. We thus refrain from advocating this particular choice of imputing outcome using only information about other outcomes but not covariates in this article. When imputing both outcome and covariates,we set the number of imputed datasets by Amelia,M,to 5,and choose the third one. It was an arbitrary decision of ours regarding the size of M and which one(s) to use.

8

For complete out-of-sample MSEs for all six outcomes in both leaderboard and holdout data,refer to Table A8 in the

9

We thank the editor and the FFC team for helping us obtain the post-FFC analysis results.

10

Mother model and father model follow the same procedure up to the Amelia imputation step.

Author Biographies

Diana Stanescu is a PhD candidate in the Department of Politics at Princeton University and an exchange scholar in the Department of Government at Harvard University. At Princeton,she holds a Center for International Security Studies and a Bradley Research Program Fellowship and is a student associate of the Niehaus Center for Globalization and Governance. She studies international trade,regulation,and lobbying with a focus on Japan using formal and quantitative methods. Her research addresses the overarching question of how the structure of foreign policy decision making shapes international cooperation. Previously she worked in international development,designing methodologies to evaluate programs on public policy,labor market discrimination,and gender equity.

Erik Wang is a PhD candidate in the Department of Politics at Princeton University. He holds a Quantitative and Analytical Political Science Fellowship at Princeton. He studies comparative political economy with a particular interest in state building,bureaucracy,and corruption. His research addresses the spatial-temporal variation in state capacity across local governments in China. He also does research on statistical methods of causal inference,with a focus on time-series cross-sectional data.

Soichiro Yamauchi is a graduate student in the Department of Government at Harvard University. He is broadly interested in political methodology and computational social science. In particular,he focuses on the large-scale record linkage problem and causal inference in panel data.

References

Greenhill

Brian

Ward

Michael D.

Sacks

Audrey

. 2011. “The Separation Plot: A New Visual Method for Evaluating the Fit of Binary Models.” American Journal of Political Science 55(4):991–1002.

Hegre

Håvard

Karlsen

Joakim

Nygård

Håvard Mokleiv

Strand

Håvard

Urdal

Henrik

. 2013. “Predicting Armed Conflict, 2010–2050.” International Studies Quarterly 57(2):250–70.

Honaker

James

King

Gary

Blackwell

Matthew

. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45(7):1–47.

King

Gary

Honaker

James

O’Connell

Anne Joseph

Scheve

Kenneth

. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95(1):49–69.

Mayer

Susan E.

Jencks

Christopher

. 1989. “Poverty and the Distribution of Material Hardship.” Journal of Human Resources 24(1):88–114.

R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing.

Reichman

Nancy E.

Teitler

Julien O.

Garfinkel

Irwin

McLanahan

Sara S.

2001. “Fragile Families: Sample and Design.” Children and Youth Services Review 23(4–5):303–26.

Salganik

Matthew J.

Lundberg

Ian

Kindel

Alexander T.

McLanahan

Sara

2019. “Introduction to the Special Collection on the Fragile Families Challenge.” Socius 5. doi:10.1177/2378023119871580.

Simon

Noah

Friedman

Jerome H.

Hastie

Trevor

Tibshirani

Rob

. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39(5):1–13.

10.

Tibshirani

Robert

. 1996. “Regression Shrinkage and Selection via the LASSO.” Journal of the Royal Statistical Society, Series B (Methodological) 58(1):267–88.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB