Guarding against false positive selections is important in many applications. We discuss methods based on subsampling and sample splitting for controlling the expected number of false positives and assigning p-values. They are generic and especially useful for high-dimensional settings. We review encouraging results for regression, and we discuss new adaptations and remaining challenges for selecting relevant variables, based on observational data, having a causal or interventional effect on a response of interest.
TibshiraniR. Regression analysis and selection via the Lasso. J Royal Stat Soc B1996; 58: 267–288.
2.
FanJLiR. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc2001; 96: 1348–1360.
3.
ZouHHastieT. Regularization and variable selection via the Elastic Net. J Royal Stat Soc B2005; 67: 301–320.
4.
MeinshausenNBühlmannP. High-dimensional graphs and variable selection with the Lasso. Ann Stat2006; 34: 1436–1462.
5.
ZouH. The adaptive Lasso and its oracle properties. J Am Stat Assoc2006; 101: 1418–1429.
6.
ZhaoPYuB. On model selection consistency of Lasso. J Mach Le Res2006; 7: 2541–2563.
7.
MeinshausenN. Relaxed Lasso. Comput Stat Data Anal2007; 52: 374–393.
8.
ZouHLiR. One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann Stat2008; 36: 1509–1566.
9.
ZhangCH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat2010; 38: 894–942.
10.
BühlmannPKalischMMaathuisMH. Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika2010; 97: 261–278.
11.
MaathuisMHKalischMBühlmannP. Estimating high-dimensional intervention effects from observational data. Ann Stat2009; 37: 3133–3164.
12.
Efron B. Large-Scale inference: empirical Bayes methods for estimation, testing, and prediction. Institute of Mathematical Statistics Monographs. Cambridge:, Cambridge University Press, Second edition, 2010.
13.
GoemanJJvan de GeerSAvan HouwelingenHC. Testing against a high dimensional alternative. J R Stat Soc B2006; 68(3): 477–493.
14.
MeinshausenNBühlmannP. Stability selection (with discussion). J Royal Stat Soc B2010; 72: 417–473.
15.
MeinshausenNMeierLBühlmannP. P-values for high-dimensional regression. J Am Stat Assoc2009; 104: 1671–1681.
16.
BühlmannPvan de GeerS. Statistics for high-dimensional data: methods, theory and applications, New York: Springer, 2011.
LiuXSBrutlagDLLiuJS. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol2002; 20: 835–839.
19.
FreedmanD. A remark on the difference between sampling with and without replacement. J Am Stat Assoc1977; 72: 681–681.
20.
BühlmannPYuB. Analyzing bagging. Ann Stat2002; 30: 927–961.
21.
BühlmannP. Boosting for high-dimensional linear models. Ann Stat2006; 34: 559–583.
22.
WassermanLRoederK. High dimensional variable selection. Ann Stat2009; 37: 2178–2201.
23.
van de GeerSABühlmannP. On the conditions used to prove oracle results for the Lasso. Electron J Stat2009; 3: 1360–1392.
24.
Spirtes P, Glymour C and Scheines R. Causation, prediction, and search. Cambridge, MA MIT Press, Second edition, 2000.
25.
PearlJ. Causality: models, reasoning and inference, Cambridge: Cambridge University Press, 2000.
26.
MaathuisMHColomboDKalischMBühlmannP. Predicting causal effects in large-scale systems from observational data. Nat Methods2010; 7: 247–248.
27.
WilleABühlmannP. Low-order conditional independence graphs for inferring genetic networks. Stat Appl Genet Mol Biol2006; 5: 1–32.
28.
KalischMBühlmannP. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J Mach Learn Res2007; 8: 613–636.
29.
HughesTRMartonMJJonesARRobertsCJStoughtonRArmourCDHA. Functional discovery via a compendium of expression profiles. Cell2000; 102: 109–126.
30.
R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.
31.
FriedmanJHastieTTibshiraniR. Regularization paths for generalized linear models via coordinate descent. J Stat Software2010; 33(1): 1–22.