Sage Journals: Discover world-class research

Abstract

Widespread availability of rich educational databases facilitates the use of conditioning strategies to estimate causal effects with nonexperimental data. With dozens, hundreds, or more potential predictors, variable selection can be useful for practical reasons related to communicating results and for statistical reasons related to improving the efficiency of estimators. Background knowledge should take precedence in deciding which variables to retain. However, with many potential predictors, theory may be weak, such that functional form relationships are likely to be unknown. In this article, I propose a nonparametric method for data-driven variable selection based on permutation testing with conditional random forest variable importance. The algorithm automatically handles nonlinear relationships and interactions in its naive implementation. Through a series of Monte Carlo simulation studies and a case study with Early Childhood Longitudinal Study–K data, I find that the method performs well across a variety of scenarios where other methods fail.

Keywords

nonparametric conditional independence test causal inference variable selection average treatment effect random forest permutation test

Get full access to this article

View all access options for this article.

References

Altmann

Toloşi

Sander

Lengauer

(2010). Permutation importance: A corrected feature importance measure. Bioinformatics, 26, 1340–1347.

Austin

P. C.

Grootendorst

Anderseon

G. M.

(2007). A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: A Monte Carlo study. Statistics in Medicine, 26, 734–753.

Battacharya

Vogt

(2007). Do instrumental variables belong in propensity scores? (Technical Working Paper No. 343). Cambridge, MA: National Bureau of Economic Research.

Breiman

(2001). Random forests. Machine Learning, 45, 5–32.

Breiman

Friedman

J. H.

Olshen

R. A.

Stone

C. J.

(1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks/Cole.

Brookhart

M. A.

Schneeweiss

Rothman

K. J.

Glynn

R. J.

Avorn

Stürmer

(2006). Variable selection for propensity score models. American Journal of Epidemiology, 163, 1149–1156.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

Cover

Thomas

J. A.

(2006). Elements of information theory (2nd ed.). Hoboken, NJ: John Wiley.

de Luna

Waernbaum

Richardson

T. S.

(2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98, 861–875.

10.

Fong

Hazlett

Imai

(2018). Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. Annals of Applied Statistics, 12, 156–177.

11.

Fong

Ratkovic

Hazlett

Imai

(2015). CBPS: Covariate balancing propensity score (R package version 0.10) [Computer software manual]. Retrieved from http://CRAN.R-project.org/package=CBPS

12.

Gruber

van der Laan

M. J.

(2012). tmle: An R package for targeted maximum likelihood estimation. Journal of Statistical Software, 51, 1–35. Retrieved from http://www.jstatsoft.org/v51/i13/

13.

Häggström

(2017). Covselhigh: Model-free covariate selection in high dimensions (R package version 1.1.1) [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=CovSelHigh

14.

Häggström

(2018). Data-driven confounder selection via Markov and Bayesian networks. Biometrics, 74, 389–398.

15.

Häggström

Persson

Waernbaum

de Luna

(2015). CovSel: An R package for covariate selection when estimating average causal effects. Journal of Statistical Software, 68, 1–20.

16.

Hahn

(1996). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331.

17.

Hahn

(2004). Functional restriction and efficiency in causal inference. The Review of Economics and Statistics, 86, 73–76.

18.

Hapfelmeir

Ulm

(2013). A new variable selection approach using random forests. Computational Statistics and Data Analysis, 60, 50–69.

19.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning (2nd ed.). New York, NY: Springer.

20.

Imai

Ratkovic

(2014). Covariate balancing propensity score. Journal of the Royal Statistical Society, Series B, 76, 243–263.

21.

Kapelner

Bleich

. (2016). bartMachine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software, 70, 1–40. doi:10.18637/jss.v070.i04

22.

Keller

Tipton

(2016). Propensity score analysis in R: A software review. Journal of Educational and Behavioral Statistics, 41, 326–348.

23.

Lumley

(2004). Analysis of complex survey samples. Journal of Statistical Software, 9, 1–19.

24.

Morgan

P. L.

Frisco

M. L.

Farkas

Hibel

(2010). A propensity score matching analysis of the effects of special education services. The Journal of Special Education, 43, 236–254.

25.

Pearl

(2010). On a class of bias-amplifying variables that endanger effect estimates. In Grunwald

Spirtes

(Eds.), Proceedings of the twenty-sixth conference on uncertainty in artificial intelligence (pp. 425–432). Corvallis, OR: AUAI.

26.

Persson

Häggström

Waernbaum

de Luna

(2017). Data-driven algorithms for dimension reduction in causal inference. Computational Statistics & Data Analysis, 105, 280–292.

27.

Ridgeway

McCaffrey

Morral

Ann

Burgette

(2015). twang: Toolkit for weighting and analysis of nonequivalent groups (R package version 1.4-9.3) [Computer software manual]. Retrieved from http://CRAN.R-project.org/package=twang

28.

Rodenburg

Heidema

A. G.

Boer

J. M. A.

Bovee-Oudenhoven

I. M. J.

Feskens

E. J. M.

Mariman

E. C. M.

Keijer

(2008). A framework to identify physiological responses in microarray-based gene expression studies: Selection and interpretation of biologically relevant genes. Physiological Genomics, 33, 78–90.

29.

Rosenbaum

P. R.

Rubin

D. B.

(1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.

30.

Scutari

(2010). Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35, 1–22.

31.

Steiner

P. M.

Cook

T. D.

Clark

M. H.

(2015). Bias reduction in quasi-experiments with little selection theory but many covariates. Journal of Research on Educational Effectiveness, 8, 552–576.

32.

Steiner

P. M.

Cook

T. D.

Shadish

W. R.

(2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. Journal of Educational and Behavioral Statistics, 36, 213–236.

33.

Steiner

P. M.

Kim

(2016). The mechanics of omitted variable bias: Bias amplification and cancellation of offsetting biases. Journal of Causal Inference, 4, 2193–3677.

34.

Strobl

Boulesteix

A.-L.

Kneib

Augustin

Zeileis

(2008). Conditional variable importance for random forests. BMC Bioinformatics, 9, 307.

35.

Strobl

Zeiles

(2008). Danger: High power! Exploring the statistical properties of a test for random forest variable importance. In Brito

(Ed.), Compstat 2008—Proceedings in computational statistics (Vol. II) (pp. 59–66). Heidelberg, Germany: Physica Verlag.

36.

Tourangeau

Nord

Lé

Pollack

J. M.

Atkins-Burnett

(2006). Early childhood longitudinal study, kindergarten class of 1998-1999 (ECLS-K), combined user’s manual for the ECLS-K fifth-grade data files and electronic codebooks (NCES 2006-032) [Computer software manual] . Washington, DC: National Center for Education Statistics, U.S. Department of Education. Retrieved from http://www.nces.ed.gov/ecls

37.

van der Laan

M. J.

Gruber

(2010). Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics, 6, Article 17. doi:10.2202/1557-4679.1181

38.

Venables

W. N.

Ripley

B. D.

(2002). Modern applied statistics with S (4th ed.). New York, NY: Springer.

39.

Wooldridge

(2009, July). Should instrumental variables be used as matching variables? (Technical Report). East Lansing: Michigan State University.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.31 MB

0.00 MB