Sage Journals: Discover world-class research

Abstract

Although high-dimensional data analysis has received a lot of attention after the advent of omics data, model selection in this setting continues to be challenging and there is still substantial room for improvement. Through a novel combination of existing methods, we propose here a two-stage subsampling approach for variable selection in high-dimensional generalized linear regression models. In the first stage, we screen the variables using smoothly clipped absolute deviance penalty regularization followed by partial least squares regression on repeated subsamples of the data; we include in the second stage only those predictors that were most frequently selected over the subsamples either by smoothly clipped absolute deviance or for having the top loadings in either of the first two partial least squares regression components. In the second stage, we again repeatedly subsample the data and, for each subsample, we find the best Akaike information criterion model based on an exhaustive search of all possible models on the reduced set of predictors. We then include in the final model those predictors with high selection probability across the subsamples. We prove that the proposed first-stage estimator is $n^{1 / 2}$ -consistent and that the true predictors are included in the first stage with probability converging to 1. In an extensive simulation study, we show that this two-stage approach outperforms the competitors yielding among the highest probability of selecting the true model while having one of the lowest number of false positives in the settings of logistic, Poisson, and linear regression. We illustrate the proposed method on two gene expression cancer datasets.

Keywords

Subsampling high-dimensional regression stability selection smoothly clipped absolute deviance partial least squares regression variable selection

Get full access to this article

View all access options for this article.

References

Tibshirani

. Regression shrinkage and selection via the lasso. J R Statist Soc B 1996; 58: 267–288.

Leng

Lin

Wahba

. A note on the lasso and related procedures in model selection. Stat Sin 2006; 16: 1273–1284.

Zhang

Huang

. The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat 2008; 36: 1567–1594.

Zou

Hastie

. Regularization and variable selection via the elastic net. J R Statist Soc B 2005; 67: 301–320.

Yuan

Lin

. Model selection and estimation in regression with grouped variables. J R Statist Soc B 2006; 68: 49–67.

Zou

. The adaptive lasso and its oracle properties. J Am Stat Assoc 2006; 101: 1418–1429.

Shao

. Regularizing LASSO: a consistent variable selection method. Stat Sin 2015; 25: 975–992.

Fan

. Variable selection via noncave penalized likelihood and its oracle properties. J Am Stat Assoc 2001; 96: 1348–1359.

Zhang

. Nearly unbiased variable selection under minimax concave penalty. Ann Stat 2010; 38: 894–942.

10.

Efron

Hastie

Johnstone

, et al. Least angle regression. Ann Stat 2004; 32: 407–499.

11.

Wold

. Estimation of principal components and related models by iterative least squares. In: Krishnajah

(ed) Multivariate analysis. New York: Academic Press, 1966, pp.391–420.

12.

Helland

. On the structure of partial least squares regression. Commun Stat- Simul Comput 1988; 17: 581–607.

13.

Höskuldsson

. PLS regression methods. J Chemom 1988; 2: 211–228.

14.

Garthwaite

. An interpretation of partial least squares. J Am Stat Assoc 1994; 89: 122–127.

15.

Meinshausen

Bühlmann

. Stability selection. J R Statist Soc B 2010; 72: 417–473.

16.

Shah

Samworth

. Variable selection with error control: another look at stability selection. J R Statist Soc B 2013; 75: 55–80.

17.

Bogdan

van den Berg

Sabatti

, et al. SLOPE-adaptive variable selection via convex optimization. Ann Appl Stat 2015; 9: 1103–1140.

18.

Huang

. Controlling the false discoveries in LASSO. Biometrics 2017; 73: 1102–1110.

19.

Bogdan

Candès

. False discoveries occur early on the lasso path. Ann Statist 2017; 45: 2133–2150.

20.

Candès

Fan

Janson

. Panning for gold: model-X knockoffs for high dimensional controlled variable selection. J R Stat Soc: Ser B 2018; 80: 551–577.

21.

Lee

Sun

, et al. Exact post-selection inference with application to the LASSO. Ann Statist 2016; 44: 907–927.

22.

Tibshirani

Taylor

Lockhart

, et al. Exact post-selection inference for sequential regression procedures. J Am Stat Assoc 2016; 111: 600–620.

23.

Fithian

Sun

Taylor

. Optimal inference after model selection. arXiv:1410.2597v4, 2017.

24.

Taylor

Tibshirani

. Post-selection inference for l1-penalized likelihood models. Cana J Stat 2018; 46: 41–61.

25.

Giurcanu

. Thresholding least-squares inference in high-dimensional regression models. Electron J Stat 2016; 10: 2124–2156.

26.

Wang

Sarkar

Carbonetto

, et al. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Ser B: Stat Methodol 2020; 82: 1273–1300.

27.

Zheng

Hong

. Building generalized linear models with ultrahigh dimensional features: a sequentially conditional approach. Biometrics 2020; 76: 47–60.

28.

Chen

Gao

Liang

, et al. Nonlinear variable selection via deep neural networks. J Comput Graph Stat 2021; 30: 484–492.

29.

Fan

Sun

. Subsampling from features in large regression to find “winning features”. Stat Anal Data Min 2021; 14: 168–184.

30.

Zeng

Mai

Zhang

. Subspace estimation with automatic dimension and variable selection in sufficient dimension reduction. J Am Stat Assoc 2022; 0: 1–13.

31.

Ghosh

. Bayesian variable selection and estimation of group lasso. Bayes Anal 2015; 10: 909–936.

32.

Yang

Wainwright

Jordan

. On the computational complexity of high-dimensional Bayesian variable selection. Ann Stat 2016; 44: 2497–2532.

33.

Papaspiliopoulos

Rossell

. Bayesian block-diagonal variable selection and model averaging. Biometrika 2017; 104: 343–359.

34.

Song

Sun

, et al. Extended stochastic gradient Markov chain Monte Carlo for large-scale Bayesian variable selection. Biometrika 2020; 107: 997–1004.

35.

Liu

Roĉková

. Variable selection via Thompson sampling. J Am Stat Assoc 2023; 118: 287–304.

36.

Dutta

Roy

. Model based screening embedded Bayesian variable selection for ultra-high dimensional settings. J Comput Graph Stat 2023; 32: 61–73.

37.

McCullagh

Nelder

. Generalized linear models. London: Chapman and Hall, 1989.

38.

Fan

Song

. Sure independence screening in generalized linear models with

n p

-dimensionality. Ann Stat 2010; 38: 3567–3604.

39.

Fan

. Non-concave penalized likelihood with np-dimensionality. IEEE Trans Inf Theory 2011; 57: 5467–5484.

40.

Kwon

Kim

. Large sample properties of the SCAD-penalized maximum likelihood estimation on high dimensions. Stat Sin 2012; 22: 629–653.

41.

Breheny

Huang

. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 2011; 5: 232–253.

42.

Fan

. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Methodol 2008; 70: 849–911.

43.

Fan

Samworth

. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 2009; 10: 2013–2038.

44.

Bastien

Vinzi

Tenenhaus

. PLS generalised linear regression. Comput Stat Data Anal 2005; 48: 17–46.

45.

Killick

Fearnhead

Aston

, et al. Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 2012; 107: 1590–1598.

46.

Capanu

Giurcanu

Begg

, et al. Optimized variable selection via repeated data splitting. Stat Med 2020; 39: 2167–2184.

47.

Capanu

Giurcanu

Begg

, et al. Subsampling based variable selection for generalized linear models. Comput Stat Data Anal 2023; 184: 107740.

48.

Nengsih

Bertrand

Myriam

, et al. Determining the number of components in PLS regression on incomplete data set. Stat Appl Genet Mol Biol 2019; 18: 20180059.

49.

Meyer

Maumy

Bertrand

. Comparaison de la regression PLS et de la regression logistique PLS: application aux donnees d’allelotypage. J de la Soc Franc Stat 2010; 151: 1–18.

50.

Bertrand

Maumy

. Partial least squares regression for generalized linear models, 2023. https://fbertran.github.io/homepage/. R package version 1.5.1.

51.

van der Vaart

. Asymptotic statistics. New York: Springer-Verlag, 1998.

52.

Friedman

Hastie

Tibshirani

. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1–22.

53.

Singh

Febbo

Ross

, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1: 203–209.

54.

Alfons

Croux

Gelper

. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 2013; 7: 226–248.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.26 MB

Two-stage subsampling variable selection for sparse high-dimensional generalized linear models

Abstract

Keywords

Get full access to this article

References

Supplementary Material