Sage Journals: Discover world-class research

Abstract

The integration of high-dimensional genomic data and clinical data into time-to-event prediction models has gained significant attention due to the growing availability of these datasets. Traditionally, a Cox regression model is employed, concatenating various covariate types log-linearly. Given that much of the data may be redundant or irrelevant, feature selection through penalization is often desirable. A notable characteristic of these datasets is their organization into blocks of distinct data types, such as methylation and clinical predictors, which require selecting a subset of covariates from each group due to high intra-group correlations. However, existing grouped variable selection methods do not ensure that variables are chosen from every group. As a result, smaller groups such as clinical predictors that are crucial for survival prediction may be under-represented. For this reason, we propose utilizing Exclusive Lasso regularization instead of standard Lasso penalization. Exclusive Lasso combines an L₁-norm penalty, which induces sparsity within each group, with an L₂-norm penalty, which promotes balanced selection across groups, ensuring that at least one variable is retained from each. To illustrate the advantages of this approach, we apply it to simulated datasets and real-world cancer data, assessing its applicability and comparing its survival prediction and variable selection performance with that of the conventional Cox regression model and other state-of-the-art methods.

Keywords

Cox regression grouped variables high-dimensional data Exclusive Lasso feature selection

1 Introduction

In recent years, advances in high-throughput genomic technologies have led to the availability of high-dimensional datasets, including DNA methylation, messenger ribonucleic acid (mRNA), mRNA expression and copy number variation, in addition to traditional clinical variables. These datasets may provide valuable information on the mechanisms of a particular disease, prompting the development of various methods to identify influential genomic and clinical characteristics for improved prognostic modelling.

A common objective in clinical research is the prediction of patient survival outcomes. The Cox proportional hazards (PH) model (Cox, 1972) is widely used for this purpose, as it not only facilitates survival prediction but also enables the assessment of the impact of predictor variables on survival. However, given the high-dimensional nature of genomic datasets, variable selection becomes a critical step in model construction. To address this, an L₁-penalized Cox model, such as the Lasso, is often employed to identify the most relevant features in time-to-event modelling (Tibshirani, 1997).

Despite its effectiveness, this approach presents several limitations. First, standard Lasso-based methods do not inherently account for grouped variables, which is particularly relevant in genomic studies where genes are often organized in biological pathways. Ignoring such group structures may lead to suboptimal feature selection and loss of biologically meaningful information. Additionally, large sets of genomic features often overshadow low-dimensional clinical variables, such as tumour size and nodal status. This is a significant drawback, as clinicopathologic variables have been demonstrated to play a crucial role in oncological studies and predictive performance improves when both clinical and genomic data are integrated (Ma et al., 2007; Herrmann et al., 2021). Final, Lasso-based selection methods have been shown to produce a relatively high rate of false positives in specific settings, which may limit their reliability in time-to-event analysis, depending on the context (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006).

Several statistical methods have been proposed to incorporate grouped variables in the Cox PH model. Although the Elastic Net Cox model (Simon et al., 2011b) does not explicitly enforce group selection, it tends to select correlated variables together, unlike pure Lasso, which typically selects only one from a correlated set. This behaviour results from its combination of L₁ and L₂-norm penalties. However, it does not guarantee that entire groups of variables will be retained or removed together. One of the most common approaches is Group Lasso (Kim et al., 2012), which applies L_2,1-norm regularization. This method enforces sparsity across groups using the L₁-norm while applying the L₂-norm within each group to regularize coefficients together. However, Group Lasso performs poorly when dealing with highly correlated groups, such as those found in multi-omics datasets. In such cases, it tends to select variables from only a few dominant groups, often overlooking smaller or lower-dimensional groups such as clinical variables, which are essential for time-to-event prediction. This requires a method that can ensure the selection of variables for each group rather than selecting entire groups of variables. To overcome this challenge, Sparse Group Lasso was introduced (Simon et al., 2013), incorporating sparsity at both the group and individual variable levels. Another advancement in this area is the Integrative L₁-Penalized Regression with Penalty Factors (IPF-Lasso; Boulesteix et al., 2017), which allows for different penalty terms across variable groups, either based on prior knowledge or data-driven selection. However, none of these methods ensures that at least one variable is selected from each group. This lack of group-level representation can result in the complete exclusion of smaller, yet potentially important, groups from the model. In biomedical studies, representing all relevant groups is important for gaining a complete understanding of underlying relationships. Our objective is therefore to develop methods that ensure every group is represented—a goal that differs from the design objectives of approaches such as Group Lasso, which may instead promote sparsity at the group level. We note that, in the broader context of methodological development (Heinze et al., 2024), our contribution should be regarded as an early-phase study.

To overcome this limitation of missing group representation, we propose the use of Exclusive Lasso regularization (Campbell and Allen, 2017), which we will extend in this work to the Cox PH model. Our approach adapts the method’s ability to encourage intra-group sparsity through the L₁-norm while promoting balanced inter-group selection via the L₂-norm, thereby ensuring that at least one variable is selected from each group. The properties of this regularization have been well studied in the literature (Campbell and Allen, 2017; Gregoratti et al., 2021), but its application to time-to-event modelling has not been explored previously.

In our previous work, we demonstrated the superior performance of Exclusive Lasso over traditional Lasso in GLM settings with high within-group correlation (Ravi and Groll, 2025; note that a preliminary compact version of this work can also be found in Ravi and Groll, 2024). In the current study, we adapt this methodology to time-to-event analysis, introducing it as a practical alternative for selecting informative predictors from different groups and integrating them into a sparse prediction model while ensuring that no group is overlooked.

We assess the performance of our proposed method by comparing it to other approaches that account for grouping effects, such as Elastic Net Cox, Sparse Group Lasso and IPF-Lasso. Across a range of scenarios, our method generally shows improvements over the alternatives in both survival prediction accuracy and selection performance. Additionally, we evaluate the practical applicability of our model by using it for survival prediction in real-world cancer studies. In addition to the standard prediction errors, we compare the biomarkers selected by each model and highlight the importance of Exclusive Lasso in selecting clinical and low-dimensional variables that other models fail to capture.

The remainder of the article is structured as follows. Section 2 introduces the Exclusive Lasso problem in the Cox PH framework. In Section 3, we present the simulation scenarios and compare our method with other Lasso procedures. The applicability of our model is demonstrated in Section 4 using the aforementioned application example. Final, Section 5 concludes.

2 Methods

In this section, we first briefly review methods for handling grouped predictors within the Cox PH framework and then introduce the Exclusive Lasso regularization in the Cox PH model.

Let i = 1, …, n denote the observations (patients) in the cohort. For each patient, we observe (t_i, δ _i , x _i ), where t_i is the event or censoring time for patient i, δ _i is the censoring indicator (1 if the event is observed and 0 if censored) and x _i = (x_i₁, …, x_ip)^⊤ is the vector of covariates associated with patient i.

The Cox PH model (Cox, 1972) specifies the hazard function for patient i at time t as $h (t | x_{i}) = h_{0} (t) \exp (x_{i}^{⊤} β),$

where h₀(t) is an unspecified baseline hazard function and β is the vector of regression coefficients. The PH assumption implies that covariates act multiplicatively on the hazard and that the hazard ratio between two individuals is constant over time.

Estimation of β is typically performed by maximizing the partial log-likelihood function, given by $l (β) = \sum_{i = 1}^{n} δ_{i} [x_{i}^{⊤} β - \log (\sum_{l \in R (t_{i})} \exp (x_{l}^{⊤} β))],$ (2.1)

where R(t_i) represents the risk set at time t_i, that is, all individuals who are still at risk (uncensored and have not yet experienced the event) at time t_i.

In high-dimensional scenarios, where the number of covariates p by far exceeds the number of patients n, the estimation of the coefficients is typically performed by introducing a penalty term, P(β), to the partial log-likelihood function. The estimation is then carried out by maximizing the penalized partial log-likelihood function, given by $l_{p e n} (β) = l (β) - λ P (β),$ (2.2)

where λ ≥ 0 is the penalization parameter. The most common penalty term is the L₁-norm penalty, known as the Lasso (Tibshirani, 1997), defined as $P (β) = \sum_{k = 1}^{p} |β_{k}|$ . This penalty shrinks the coefficients towards zero and forces some of the coefficients corresponding to less important variables to be exactly zero, effectively performing variable selection. Another widely used penalty is the L₂-norm penalty, known as ridge regression (Hoerl and Kennard, 1970), given by $P (β) = \sum_{k = 1}^{p} β_{k}^{2}$ . Unlike the L₁-penalty, the L₂-penalty shrinks coefficients continuously towards zero but does not set them exactly to zero, which leads to shrinkage without variable selection.

We focus on variable selection in scenarios where the predictors are divided into predefined, disjoint groups. For instance, in the context of multi-omics data, the variables may include different types, such as genomics, epigenomics and transcriptomics, in addition to clinical and pathological data. We assume that the indices of the true parameter vector β are divided into non-overlapping groups.

Let $G = \{g_{1}, g_{2}, \dots, g_{K}\}$ be a collection of non-overlapping, predefined subsets of {1, …, p}, called groups. That is, each group $g_{m} \in G$ is a subset of {1, …, p}, any two distinct groups $g_{m}, g_{k} \in G$ satisfy $g_{m} \cap g_{k} = \emptyset$ and the union of all groups in $G$ is equal to the entire set of indices {1, …, p}.

Elastic Net

The Elastic Net (Zou and Hastie, 2005) is a regularization method that combines the L₁ (Lasso) and L₂ (Ridge) penalties. The L₁ penalty encourages sparsity by shrinking some coefficients to zero, while the L₂ penalty promotes the inclusion of correlated variables in groups. This combination allows groups of correlated variables to be selected together. The Elastic Net is particularly effective in situations where predictors are highly correlated within groups. The Elastic Net penalty is defined as: $P (β) = α \sum_{k = 1}^{p} |β_{k}| + \frac{1}{2} (1 - α) \sum_{k = 1}^{p} β_{k}^{2},$

where α ∈ (0, 1) is the mixing parameter that controls the balance between the L₁ and L₂ penalties.

Sparse Group Lasso

The Sparse Group Lasso (Simon et al., 2013) is another method that uses a combination of L₁- and L₂-norm penalties to encourage sparsity both across groups and within each group. The penalty term for the Sparse Group Lasso is given by: $P (β) = (1 - α) \sum_{g_{m} \in G} \sqrt{p_{g_{m}}} ‖β_{g_{m}}_{2}‖ + α \sum_{k = 1}^{p} |β_{k}|,$

where βgm represents the coefficients in group $g_{m} \in G$ and p_gm is the number of variables in group g_m. This promotes group-wise selection, where all variables in a group are either included or excluded together. The second term is the Lasso penalty applied to individual coefficients, promoting sparsity at the level of individual predictors. When the parameter α = 0, the Sparse Group Lasso reduces to the standard Group Lasso and when α = 1, it becomes the Lasso.

Integrative Lasso with penalty factors (IPF-Lasso)

The Integrative Lasso with penalty factors (IPF-Lasso; Boulesteix et al., 2017) was introduced for prediction based on multi-omics datasets where there are several modalities (groups) of variables. The main idea of IPF-Lasso is to apply Lasso to each group and introduce penalty factors for different groups of variables, which can be selected according to the desired weighting of the groups or by cross-validation (CV). The IPF-Lasso penalty is defined as $\sum_{g_{m} \in G} λ_{g_{m}} |β_{g_{m}}|,$ where λ_gm is known as the penalty factor applied to the variables in group g_m. These penalty factors are chosen by CV via a grid search over a list of pre-specified candidate vectors. However, this can be a time-consuming process. To avoid manually defining candidate vectors, we follow the procedure outlined in the two-step IPF-Lasso (Schulze, 2017). In Step 1 of the process, before applying IPF-Lasso, a standard Lasso or Ridge regression is performed and the arithmetic mean of the estimated coefficients can be considered as potential penalty factors.

Exclusive Lasso

The Exclusive Lasso (Campbell and Allen, 2017) enforces structured sparsity by ensuring that at least one variable is selected from each predefined group. It combines L₁- and L₂-norm penalties, where the L₁ penalty within each group promotes the selection of informative variables, while the L₂-norm across groups prevents entire groups of coefficients from being set to zero. This approach ensures that even low-dimensional groups are represented while selecting only the most relevant variables from high-dimensional groups. The Exclusive Lasso penalty, which can be added to the Cox PH partial log-likelihood, is defined as: $P (β) = \frac{1}{2} \sum_{g_{m} \in G} {(\sum_{k \in g_{m}} |β_{k}|)}^{2} .$

The composite nature of the penalty term makes the estimation of the Exclusive Lasso problem challenging. Several strategies have been developed to tackle this challenge. One approach utilizes proximal point algorithms based on dual Newton methods (Lin et al., 2020), while others employ iterative re-weighted techniques to refine the estimation process (Kong et al., 2014; Sun et al., 2020). An alternative strategy reformulates the problem in a Lasso framework and applies a bisection algorithm, taking advantage of Lasso’s piecewise linear properties (Sun et al., 2020).

To improve computational efficiency, a fast optimization method based on the Fast Iterative Shrinkage-Thresholding Algorithm has been introduced (Huang and Liu, 2018). Another approach transforms the penalty into a differentiable one by applying a simple quadratic approximation, allowing it to be efficiently solved using a Newton-based algorithm (Ravi and Groll, 2025).

Campbell and Allen (2017), along with the aforementioned studies, proposed the Exclusive Lasso approach for generalized linear models, but to the best of our knowledge, no adaptations of Cox PH models have been implemented. In this work, we extend Exclusive Lasso to the Cox PH model, allowing its group-wise sparsity properties to be applied in a time-to-event setting. To fit the model, we also develop a coordinate descent algorithm with soft-thresholding, specifically adapted for the Cox PH likelihood, which addresses algorithmic challenges that arise in this context.

As highlighted by Campbell and Allen (2017), the Exclusive Lasso penalty is non-separable; that is, it cannot be formulated as a sum of functions depending on individual coefficients only. Consequently, it is not possible to update all coefficients simultaneously in closed form. Instead, we use a coordinate descent algorithm, where each coefficient is updated sequentially while keeping the others fixed.

Our approach builds on a coordinate descent framework originally developed for Group Lasso regularization (Yuan and Lin, 2006), which has been shown to be efficient in high-dimensional settings. The algorithm is summarized in Algorithm 1.

The gradient component for covariate j in the Cox PH model from Equation (2.1) is defined as: ${\hat{r}}_{j} = \sum_{i = 1}^{n} δ_{i} [x_{i j} - \frac{\sum_{l \in R (t_{i})} x_{l j} \exp (x_{l}^{⊤} β)}{\sum_{l \in R (t_{i})} \exp (x_{l}^{⊤} β)}],$

where x_lj represents the observed value of covariate j for individual l and R(t_i) denotes the risk set at time t_i. This formulation ensures that only individuals for whom δ _i = 1 (i.e., those who experience an event) contribute to the estimation of β_j.

We update β_j by maximizing the penalized partial log-likelihood defined in Equation (2.2), using the soft-thresholding operator S(z, w) given by: $S (z, w) = s i g n (z) \max (| z | - w, 0) .$ (2.3)

Specifically, in each coordinate update, the current coefficient β_j is adjusted according to the gradient ${\hat{r}}_{j}$ and a partial penalty term contributed by the other coefficients in its group ${\tilde{P}}_{j}$ .

To define the partial penalty, let g_m be the group containing coefficient β_j and denote by $g_{m} ∖ j$ the set of indices of the other coefficients in the same group. When updating β_j in a coordinate descent step, all other coefficients are held fixed. In this setting, the sum of absolute values over $g_{m} ∖ j$ acts as a constant that represents the influence of the remaining coefficients in the same group. The contribution of the penalty to the update of β_j is therefore captured by ${\tilde{P}}_{j} = λ \sum_{l \in g_{m} ∖ j} |β_{l}| .$

Here, the exclusion of index j ensures that the shrinkage applied to β_j depends only on the magnitudes of the other coefficients in its group. This term reflects the group-wise coupling effect of the Exclusive Lasso and appears in the coordinate descent update together with the gradient ${\hat{r}}_{j}$ through the soft-thresholding operator.

This penalty encourages competition among variables within the same group, allowing only a few features to get selected. It promotes sparsity by shrinking coefficients, especially when the penalty is large. As a result, β_j is pushed toward zero when other covariates in the group have large values, reducing redundancy among correlated variables.

Furthermore, we refer readers to Theorem 4 of Campbell and Allen (2017), which provides proof that the Exclusive Lasso coordinate descent algorithm converges to the global minimum in the case of penalized GLMs. This result can be readily adapted to other settings, including ours. Figure 1 displays the regularization paths for both Exclusive Lasso (left) and Lasso (right). The variables are divided into five distinct groups, with each group containing exactly one signal variable and the rest being noise. The example in Figure 1 is simulated from Scenario 5 in Section 3, chosen as an illustrative case that highlights the enhanced performance of Exclusive Lasso. Hence, it represents a simplified setting that is favourable for the performance of Exclusive Lasso. The signal variables are highlighted using different colours to distinguish their respective groups. Exclusive Lasso encourages within-group sparsity, driving most coefficients to zero while retaining only one active variable per group. As a result, it maintains exactly five active variables, one from each group, even at large values of λ. In contrast, Lasso applies shrinkage without regard to group structure and may eliminate informative variables or retain multiple variables from the same group.

The Exclusive Lasso was implemented in R and the full source code is publicly available in one of the author’s GitHub repositories: https://github.com/draviis/ExclusiveLassoCox.

3 Simulations

In this section, we present a detailed simulation study to evaluate the performance of our method across different scenarios.

3.1 Setting

We simulate n = 500 observations and p = 500 variables from a multivariate Gaussian distribution with a Toeplitz covariance matrix $Σ$ , where the entries $Σ_{i, j} = {0.6}^{| i - j |}$ for variables in the same group and $Σ_{i, j} = {0.3}^{| i - j |}$ for variables in different groups. Altogether, we use a moderate correlation, resulting in a high correlation within groups and a low correlation between groups. Event times are simulated using a Cox PH model framework, where the hazard function depends on a baseline hazard and a linear combination of the predictors. We consider a baseline median event time of eight years, as in Belhechmi et al. (2020), to mimic the survival in the breast cancer gene expression study. The true coefficients of the signal variables are drawn from a uniform distribution between 0.5 and 1.5. Independent censoring times are simulated using an exponential distribution with a rate of 0.02, assuming constant censoring hazard over time.

We assume that the variables are divided into either two or five groups and consider eight simulation scenarios for grouping them. Across scenarios, the total number of signal variables is set to 5, 10 or 20.

Table 1 summarizes the grouping structures used. In Scenario 1 and Scenario 5, an equal number of variables is allocated to each group: 250 variables per group in Scenario 1 and 100 variables per group in Scenario 5. These represent ideal settings, as the Exclusive Lasso is expected to perform well when at least one signal variable is present in each group.

Algorithm 1

Exclusive Lasso coordinate descent for Cox PH model.

In the remaining scenarios, we introduce unequal group sizes and the signal variables are also distributed unequally across groups. Notably, in Scenario 8, one group contains no signal variables. In Scenario 4, the signal variables per group are 10 and 490; in the smaller group of 10 variables, one variable is categorical with four levels and is designated as a signal variable. This case is designed to mimic real-world settings in which clinical variables can be categorical.

We simulate an independent validation dataset consisting of n = 500 observations to evaluate model performance. The penalization parameter λ is selected via 5-fold CV for all models. For the IPF-Lasso, we choose the value of λ by maximizing the cross-validated predictive log-likelihood via a 5-fold CV with 10 repeats. Although we initially aimed to assess our simulations using Sparse Group Lasso with a cross-validated mixing parameter α, this approach significantly increased memory and time requirements. Thus, we fix α = 0 to maintain computational efficiency comparable to other models. This choice corresponds to the standard Group Lasso and avoids using the default α = 0.95 recommended by the authors of the SGL R package (Simon et al., 2019).

Figure 1

Regularization paths for Exclusive Lasso (left) and Lasso (right) from a simulation study, where variables are evenly distributed into five groups (shown in distinct colours), with each group containing one true signal variable. In the Exclusive Lasso model, the signal variables remain active unless all other variables in their group shrink to zero. In contrast, the Lasso model selects variables without considering the group structure, allowing multiple variables from the same group to be included.

Table 1

Description of the grouping structure of signal variables across the simulation scenarios.

Scenario	Total number of groups	Variables per group	Signal variables per group
Scenario 1	2	250, 250	2, 3
Scenario 2	2	5, 495	2, 3
Scenario 3	2	15, 485	4, 6
Scenario 4	2	10, 490	7, 13
Scenario 5	5	100, 100, 100, 100, 100	1, 1, 1, 1, 1
Scenario 6	5	5, 295, 10, 90, 100	1, 1, 1, 1, 1
Scenario 7	5	50, 55, 80, 40, 75	1, 2, 2, 2, 2
Scenario 8	5	15, 20, 85, 180, 200	3, 4, 3, 5, 0

We report the results using variable selection accuracy, defined as the proportion of true positives and true negatives among all variables, along with the F1 score, false discovery rate (FDR) and integrated Brier score (IBS). The F1 score (Van Rijsbergen, 1979) is defined as the harmonic mean of precision and recall, taking into account both false positives and false negatives. The metric ranges from 0 to 1, with larger values indicating a better balance between precision and recall. The Brier score (Graf et al., 1999) at a given time point t represents the average squared distances between the observed event status and the predicted survival probability. The IBS provides an overall performance of the model by integrating the Brier score at all available time points. A lower value of IBS is desired as it indicates that the model’s predicted probabilities are closer to the true probabilities across all available time points.

Table 2

Performance metrics (standard errors in brackets) for Scenarios 1–4; best performing modelling approach per setting in bold font.

Scenario	Metric	Elastic Net	Exclusive Lasso	Group Lasso	IPF-Lasso
Scenario 1	Selection accuracy	0.93 (0.00)	0.98 (0.00)	0.01 (0.00)	0.94 (0.00)
	F1 score	0.24 (0.01)	0.60 (0.03)	0.02 (0.00)	0.27 (0.01)
	False discovery rate	0.86 (0.01)	0.54 (0.03)	0.99 (0.00)	0.84 (0.01)
	Integrated Brier score	0.07 (0.00)	0.07 (0.00)	0.12 (0.00)	0.07 (0.00)
Scenario 2	Selection accuracy	0.93 (0.00)	0.99 (0.00)	0.07 (0.03)	0.96 (0.00)
	F1 score	0.25 (0.01)	0.70 (0.02)	0.04 (0.01)	0.35 (0.01)
	False discovery rate	0.86 (0.01)	0.43 (0.03)	0.97 (0.01)	0.78 (0.01)
	Integrated Brier score	0.06 (0.00)	0.06 (0.00)	0.14 (0.00)	0.06 (0.00)
Scenario 3	Selection accuracy	0.90 (0.00)	0.97 (0.00)	0.02 (0.00)	0.93 (0.00)
	F1 score	0.29 (0.01)	0.57 (0.02)	0.04 (0.00)	0.37 (0.01)
	False discovery rate	0.83 (0.00)	0.59 (0.02)	0.98 (0.00)	0.77 (0.01)
	Integrated Brier score	0.06 (0.00)	0.07 (0.00)	0.14 (0.00)	0.06 (0.00)
Scenario 4	Selection accuracy	0.84 (0.00)	0.92 (0.00)	0.04 (0.00)	0.89 (0.00)
	F1 score	0.34 (0.00)	0.50 (0.01)	0.08 (0.00)	0.43 (0.01)
	False discovery rate	0.79 (0.00)	0.66 (0.01)	0.96 (0.00)	0.72 (0.01)
	Integrated Brier score	0.05 (0.00)	0.07 (0.00)	0.16 (0.00)	0.05 (0.00)

3.2 Results

We compare our proposed extension of Exclusive Lasso for Cox PH models with the models described in Section 2. We use the implementations available in the R packages: Glmnet (Friedman et al., 2010; Simon et al., 2011a) for Elastic Net Cox, SGL, grpreg (Breheny and Huang, 2015; Simon et al., 2019) for Group Lasso and IPF-Lasso (Boulesteix et al., 2019) for IPF-Lasso. We generate 50 random test datasets and report the average selection accuracy, F1 score, FDR, and IBS for different numbers of signal variables across Scenarios 1–8 in Tables 2 and 3. Figure 2 visually represents the F1 scores across simulation scenarios discussed previously for different models.

Exclusive Lasso shows consistently strong performance in Scenarios 1, 2, 5 and 6, where there is an equal distribution of signal variables across groups and where group sizes are balanced. In these settings, its ability to enforce within-group competition enables accurate variable selection and yields high F1 scores. In Scenario 4, performance drops slightly because categorical variables were dummy-coded and competition among levels of the same variable was not advantageous. This behaviour is expected: Exclusive Lasso treats each dummy variable as an independent predictor, so different levels of a categorical variable are effectively placed in competition with one another. As a consequence, selection may become inconsistent across levels of the same factor. Nonetheless, Exclusive Lasso still outperforms the other models in terms of overall variable selection. In Scenario 8, its performance decreases more noticeably because one group contains no signal variables. In this case, the model may still select at least one variable from the empty group, which reduces selection accuracy.

Table 3

Performance metrics (standard errors in brackets) for Scenarios 5–8; best performing modelling approach per setting in bold font.

Scenario	Metric	Elastic Net	Exclusive Lasso	Group Lasso	IPF-Lasso
Scenario 5	Selection accuracy	0.93 (0.00)	0.99 (0.00)	0.01 (0.00)	0.94 (0.00)
	F1 score	0.24 (0.01)	0.78 (0.02)	0.02 (0.00)	0.28 (0.01)
	False discovery rate	0.86 (0.01)	0.34 (0.03)	0.99 (0.00)	0.84 (0.01)
	Integrated Brier score	0.06 (0.00)	0.06 (0.00)	0.11 (0.00)	0.06 (0.00)
Scenario 6	Selection accuracy	0.93 (0.00)	0.99 (0.00)	0.06 (0.02)	0.95 (0.00)
	F1 score	0.23 (0.01)	0.76 (0.02)	0.02 (0.00)	0.33 (0.01)
	False discovery rate	0.87 (0.01)	0.35 (0.03)	0.99 (0.00)	0.80 (0.01)
	Integrated Brier score	0.06 (0.00)	0.07 (0.00)	0.10 (0.00)	0.06 (0.00)
Scenario 7	Selection accuracy	0.90 (0.00)	0.95 (0.00)	0.02 (0.00)	0.91 (0.00)
	F1 score	0.28 (0.01)	0.44 (0.01)	0.04 (0.00)	0.30 (0.01)
	False discovery rate	0.84 (0.00)	0.71 (0.01)	0.98 (0.00)	0.82 (0.00)
	Integrated Brier score	0.07 (0.00)	0.08 (0.00)	0.12 (0.00)	0.07 (0.00)
Scenario 8	Selection accuracy	0.85 (0.01)	0.90 (0.01)	0.06 (0.00)	0.89 (0.00)
	F1 score	0.34 (0.01)	0.43 (0.01)	0.08 (0.00)	0.41 (0.01)
	False discovery rate	0.79 (0.01)	0.72 (0.01)	0.96 (0.00)	0.74 (0.01)
	Integrated Brier score	0.05 (0.00)	0.09 (0.00)	0.12 (0.00)	0.05 (0.00)

By contrast, Elastic Net maintains stable but modest performance across all scenarios. Since it does not explicitly account for group structure, its performance is unaffected by variations in group size or distribution of signal variables. However, this stability comes at the cost of weaker selection accuracy compared to Exclusive Lasso.

When evaluating prediction accuracy using the IBS, Elastic Net and IPF-Lasso sometimes outperform Exclusive Lasso. Although their selection ability is limited, these models still provide accurate survival predictions. This is because, in the presence of highly correlated predictors, they tend to distribute effects across correlated variables instead of selecting a single one. While this behaviour reduces variable selection quality, it improves calibration of predicted survival probabilities, which benefits IBS.

Group Lasso performs poorly in all scenarios. This is expected because the simulation design requires selecting variables across groups rather than entire groups. Group Lasso instead selects variables from a single group, which consistently lowers its performance.

Overall, Exclusive Lasso emerges as the best-performing method in scenarios with highly correlated and grouped variables. Although its performance declines somewhat in randomly allocated scenarios, it still surpasses the other methods. The only model that occasionally approaches its performance is IPF-Lasso and this occurs only when a group contains no informative variables.

3.3 Computational time

We compared Elastic Net, Group Lasso, IPF-Lasso and Exclusive Lasso using a fixed wall-clock budget of two hours per method. Each method was allotted approximately two hours of real (wall-clock) time for hyperparameter tuning. Since each model evaluation runs to completion once started, the reported total times may slightly differ from this limit, but all methods were run under the same computational budget. All methods were trained on the same train/test split and, whenever supported, the same 10-fold partitioning was used for CV to ensure comparability. The tuning process started with a coarse grid of hyperparameters, defined as logarithmically spaced regularization paths. For Elastic Net, the mixing parameter α was additionally varied over the set 0, 0.25, 0.5, 0.75, 1. At each grid point a cross-validated score was computed and the grid was refined adaptively as long as computation time remained. For IPF-Lasso, block-specific priority factors were derived from a ridge–Cox prefit by taking the mean absolute coefficient within each group. To avoid instability, very small group means were clipped to a minimal threshold and all priority factors were normalized before use. The tuner only accepted configurations with finite CV scores and valid best-λ indices; this prevents numerical failures from biasing results. We report the number of configurations evaluated, mean time per configuration, alongside F1 scores for the selected configurations in Table 4. Elastic Net and Group Lasso explored the largest number of configurations within the allotted time, while Exclusive Lasso performed fewer but substantially more expensive fits. In contrast, IPF-Lasso completed only a single model evaluation during the time budget. This is due to the much heavier internal CV and blockwise reweighing required by the method, making each fit computationally more demanding than for the other penalization schemes. As a result, although IPF-Lasso exhausted the full time budget, it could not refine beyond its initial candidate path. Notably, the F1 scores improved only with Exclusive Lasso, indicating that its stronger variable selection led to a more focused and accurate set of predictors.

Figure 2

F1 scores across scenarios for different models.

Table 4

Comparison of Lasso-based methods with runtime, sparsity (nonzero coefficients) and F1 score.

Method	Models tried	Mean time (secs)	Total time (secs)	No. of variables selected	F1 score
Elastic Net	248	28.926	7 222.81	53	0.17
Exclusive Lasso	84	83.490	7 806.79	6	0.91
Group Lasso	364	19.739	7 201.84	500	0.02
IPF-Lasso	1	93.699	7 200.00	44	0.21

Figure 3

Prediction error curves. Left: Mean Brier scores calculated from 10–60 months, averaged over 50 random training–test-data splits of BC data. Mean Brier scores calculated from 1 to 5 years, averaged over 50 random training–test-data splits of head and neck squamous cell carcinoma (HNSC) cancer data.

4 Application

Next, we apply our proposed method to two real-world datasets. The penalization parameter λ is tuned with CV for all models.

4.1 Bladder cancer gene expression dataset

Bladder cancer (BC) is one of the most commonly diagnosed urinary cancers worldwide, with its incidence steadily increasing each year. This rise may be linked to factors such as tobacco use and an ageing population. Although the five-year survival rate for BC is relatively high at 77%, the recurrence rate remains a significant concern. Beyond genetic signatures, numerous risk factors contribute to BC development, including gender, smoking pattern and occupational exposure to carcinogens (Cumberbatch et al., 2018). Therefore, it is crucial to incorporate both clinical risk factors and sensitive biomarkers when predicting overall survival in patients with BC.

Table 5

Integrated Brier Score (standard errors in brackets); best performing modelling approach per setting in bold font.

	Elastic Net	Exclusive Lasso	Group Lasso	IPF-Lasso
BC dataset	0.153 (0.005)	0.150 (0.004)	0.155 (0.005)	0.154 (0.006)
HNSC dataset	0.176 (0.001)	0.175 (0.001)	0.177 (0.002)	0.179 (0.001)

We analyze the BC dataset retrieved from the Gene Expression Omnibus (GEO) database (URL: https://www.ncbi.nlm.nih.gov/geo/) using the ’GEOquery’ Bioconductor R package, with the GEO accession GSE31684 (Riester et al., 2012). The dataset includes gene expression data for 54 675 genes from 93 patients. For data preprocessing, we apply a variance filter to retain only the top 10% of genes with the highest variance, as previous studies have shown that using a variance filter before fitting a regularized Cox PH regression model can improve the performance of regularized Cox regression models and lead to stable feature selection (Bommert et al., 2022). We categorize the variables into two groups: Clinical and gene expression. The clinical group includes age (in years), tumour stage (Ta/T1, T2, T3, T4), nomogram score and packs smoked per year. The data is split into training (70%) and testing (30%) sets and this process is repeated 50 times. For each split, we compute prediction errors on the test set and identify the top 10 most frequently selected variables by each model on the training data. This frequency-based approach is motivated by the principle of stability selection (Meinshausen and Bühlmann, 2010), which aims to identify variables that are consistently selected across multiple resampled datasets, thereby improving the robustness and reliability of variable selection in high-dimensional settings.

We report the Brier scores computed up to five years for all the models discussed in Section 2. From Figure 3, we observe that Exclusive Lasso consistently gives the lowest mean Brier score at each time point. Although there is no substantial difference in the Brier scores across models, Table 5 shows that Exclusive Lasso gives a lower IBS when compared to other models.

Figure 4 displays the top 10 most frequently selected variables across all models. We observe that the clinical variables stage (rc_stage:ch1 pT3, rc_stage:ch1 pT4, rc_stage:ch1 pT2), nomoscore (nomoscore) and age (age) are selected more frequently by Exclusive Lasso than by any other model. The variable ‘Stage’ is selected more than 60% of the subsampling iterations. Despite being part of a low-dimensional clinical group, Exclusive Lasso consistently selects at least one variable from this group, whereas other models tend to ignore these variables completely. Apart from Exclusive Lasso, only IPF-Lasso selects these variables, but with selection frequencies below 12%, making its selection of clinical variables less consistent. Both Elastic Net and Group Lasso favour non-clinical variables. However, Elastic Net does not consistently select any variable across iterations. In contrast, Group Lasso shows higher consistency: It selects all non-clinical variables about 50% of the subsampling iterations, but it completely ignores the clinical ones.

In the BC setting, this shows that tumour stage and nomogram score are naturally kept by the model without any manual intervention, so practitioners can input all variables together rather than deciding beforehand which clinical markers must be preserved.

Figure 4

The top 10 most frequently selected variables by the different models on the training set of the Bladder cancer gene expression study.

4.2 HNSC dataset

Head and neck squamous cell carcinoma (HNSC) is one of the most prevalent malignant tumours worldwide and it continues to have a poor prognosis with a five-year survival rate below 50% (Mody et al., 2021). In this study, we obtained molecular and clinical data for HNSC patients from The Cancer Genome Atlas (National Cancer Institute and National Human Genome Research Institute 2025). The molecular data includes 842 miRNA expression features, 20 164 RNA-seq expression features and 9 434 somatic mutation features. The clinical dataset comprises five variables: Age, tumour purity, pathological stage, gender and race. After integrating the datasets, 462 common patient samples were retained for analysis. Preprocessing was performed as described previously and categorical clinical variables are transformed into numerical form using dummy encoding.

The Brier scores averaged over 50 random training–test data splits for 1–5 years for different models are shown in Figure 3. We do not observe a substantial difference between the curves, but Exclusive Lasso performs slightly better than Group Lasso and IPF–Lasso. Exclusive Lasso also achieves a considerably smaller IBS score (see Table 5).

Figure 5

The top 10 most frequently selected variables by the different models on the training set of the HNSC dataset.

The top 10 most frequently selected variables by all models are shown in Figure 5. The clinical variable CLIN_years_to_birth (age) is selected 100% of the time and the variable CLIN_gender_male (gender) is selected in more than 50% of the subsampling iterations. Although Elastic Net consistently selects variables with many features being chosen more than 50% of the subsampling iterations, it favours molecular over clinical variables. By contrast, Group Lasso and IPF–Lasso show mostly clinical variables, but with lower frequencies and less diversity, possibly indicating less robust feature selection or overly conservative behaviour.

For head and neck cancer prognosis, this means Exclusive Lasso can highlight molecular signals without overlooking established risk factors such as age and gender, offering a more balanced model that clinicians can trust rather than one dominated solely by high-dimensional omics noise.

5 Conclusion

Variable selection plays a critical role in high-dimensional biological datasets. Time-to-event prediction improves when redundant and non-informative features are filtered out, leading to better runtime efficiency and interpretability. However, most filter and prediction methods fail to account for the intricate grouping structure of biological data. Studies suggest that predictive performance improves when clinical variables are prioritized (Herrmann et al., 2021). However, due to their low dimensionality, clinical variables are often overshadowed by the vast number of gene expression features, particularly when using standard Lasso regularization. We propose using Exclusive Lasso in Cox PH regression models to ensure proper representation of low-dimensional clinical variables.

The Exclusive Lasso penalty combines the L₁-norm within groups to enforce sparsity among highly correlated features and the L₂-norm between groups to ensure all groups are represented. This approach prevents low-dimensional groups from being overlooked while selecting the most relevant variables within each group, even when they are highly correlated. In contrast, methods such as IPF-Lasso account for the grouping structure by applying an L₁-norm within each group but do not guarantee the selection of low-dimensional groups. Additionally, a major drawback of IPF-Lasso is the need to specify a set of penalty factors or weights for each group and as noted by other studies the standard IPF-Lasso is time consuming (Schulze, 2017).

In our simulation study, we compared the proposed Exclusive Lasso with other state-of-the-art methods that account for grouping structures, such as Elastic Net, Group Lasso and IPF-Lasso. Exclusive Lasso outperformed these models in terms of selection accuracy and FDR. Although its performance slightly deteriorated when a group contains no informative variables, it still performed better than the other models. While IPF-Lasso achieved comparable performance, it either failed to select variables from certain groups or tended to select highly correlated variables within the same group. Group Lasso, on the other hand, performed poorly as it failed to select variables across all groups.

We analyzed the performance of the methods in two real-world cancer studies. Although the methods had comparable IBS we observed that Exclusive Lasso achieved the best mean Brier score at every time interval. This may be because most methods tend to ignore clinical variables, whereas Exclusive Lasso selects them. The survival prediction and disease progression of cancer are highly influenced by clinical predictors such as tumour stage and smoking status. Therefore, beyond gene selection, incorporating clinical variables into prediction models is crucial. Although it is common in the literature to force clinical variables to remain in the model while applying variable selection only to high-dimensional gene expression data, we argue that this strategy is overly restrictive. While clinical covariates are often few in number, the landscape of biomedical research is rapidly evolving. In addition to traditional clinical information, modern studies increasingly incorporate environmental, lifestyle, phenotypic and imaging data. These sources add further heterogeneity and are not necessarily low-dimensional. In such settings, methods that allow variable selection across all groups of predictors, instead of forcing certain groups to remain in the model by default, are crucial. The Exclusive Lasso provides precisely this functionality by ensuring that selection is performed in a structured yet flexible way across all categories of variables, thereby avoiding the risk of overlooking smaller but potentially important groups. We also found that variable selection in Exclusive Lasso was more consistent across repetitions, whereas other models selected different variables in different iterations.

Although Exclusive Lasso is highly effective in selecting variables from each group, its estimation is challenging due to the composite nature of the penalty. As an outlook, we note that recent developments such as our proposed Newton-based NM-L_1,2 algorithm (Ravi and Groll, 2025) may further improve estimation in cases where some groups contain no informative variables. Extensions along these lines, together with stability selection strategies (Meinshausen and Bühlmann, 2010) and applications to more complex multi-omics data, are promising directions for future research. Importantly, we emphasize that Exclusive Lasso should not be seen as a universal solution; rather, it represents one flexible approach suited to structured biomedical data. This perspective is in line with the argument of Strobl and Leisch (2024), who caution against a ‘one method fits all datasets’ philosophy in methodological research, underscoring the need for diverse tools tailored to different data settings.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research,authorship and/or publication of this article: This work has been supported by the Research Training Group ’Biostatistical Methods for High-Dimensional Data in Toxicology’ (RTG 2624,Project R2) funded by the Deutsche Forschungsgemeinschaft (DFG,German Research Foundation—Project Number 427806116).

References

Belhechmi

, Bin

, Rotolo

and Michiels

(2020) Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models. BMC bioinformatics , 21, 277.

Bommert

, Welchowski

, Schmid

and Rahnenführer

(2022) Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Briefings in Bioinformatics , 23, 354.

A-L

Boulesteix

, De Bin

, Jiang

and Fuchs

(2017) IPF-LASSO: Integrative L1-penalized regression with penalty factors for prediction based on multi-omics data. Computational and Mathematical Methods in Medicine , 2017, 7691937.

A-L

Boulesteix

, Fuchs

and Schulze

(2019) IPF-LASSO: Integrative Lasso with Penalty Factors . R package version 1.1.

Breheny

and Huang

(2015) Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing , 25, 173–87.

Campbell

and Allen

(2017) Within group variable selection through the exclusive lasso. Electronic Journal of Statistics , 11, 4220–57.

Cox

(1972) Regression models and life tables. Journal of the Royal Statistical Society: Series B (Methodological) , 34, 187–220.

Cumberbatch

MGK

, Jubber

, Black

, Esperto

, Figueroa

, Kamat

, Kiemeney

, Lotan

, Pang

, Silverman

and Catto

JWF

(2018) Epidemiology of bladder cancer: A systematic review and contemporary update of risk factors in 2018. European Urology , 74, 784–95.

Friedman

, Tibshirani

and Hastie

(2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software , 33, 1–22.

10.

Graf

, Schmoor

, Sauerbrei

and Schumacher

(1999) Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine , 18, 2529–45.

11.

Gregoratti

, Mestre

and Buelga

(2021) Exclusive group lasso for structured variable selection. arXiv preprint , arXiv:2108.10284.

12.

Heinze

, A-L

Boulesteix

, Kammer

, Morris

and White IR STRATOS Initiative

(2024) Phases of methodological research in biostatistics—building the evidence base for new methods. Biometrical Journal , 66, 2200222.

13.

Herrmann

, Probst

, Hornung

, Jurinovic

and A-L

Boulesteix

(2021) Large-scale benchmark study of survival prediction methods using multi-omics data. Briefings in Bioinformatics , 22, 167.

14.

Hoerl

and Kennard

(1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics , 12, 55–67.

15.

Huang

and Liu

(2018) Exclusive sparsity norm minimization with random groups via cone projection. IEEE Transactions on Neural Networks and Learning Systems , 29, 6145–53.

16.

Kim

, Sohn

, S-H

Jung

, Kim

and Park

(2012) Analysis of survival data with group lasso. Communications in Statistics: Simulation and Computation , 41, 1593–1605.

17.

Kong

, Fujimaki

, Liu

, Nie

and Ding

(2014) Exclusive feature learning on arbitrary structures via ℓ1,2-norm. In Ghahramani

, Welling

, Cortes

, Lawrence

and Weinberger

(eds), Advances in Neural Information Processing Systems , Vol. 27. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/4faac7f26cc36b77a2b8f1b4d7f5ecf5-Paper.pdf

18.

Lin

, Yuan

, Sun

and K-C

Toh

(2020) Adaptive sieving with PPDNA: Generating solution paths of exclusive lasso models. arXiv preprint , arXiv:2009.08719.

19.

, Song

and Huang

(2007) Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics , 8, 60.

20.

Meinshausen

and Bühlmann

(2006) High-dimensional graphs and variable selection with the lasso. The Annals of Statistics , 34, 1436–62.

21.

Meinshausen

and Bühlmann

(2010) Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72, 417–73.

22.

Mody

, Rocco

, Yom

, Haddad

and Saba

(2021) Head and neck cancer. The Lancet , 398, 2289–99.

23.

Institute

National Cancer

and Institute

National Human Genome Research

(2025) The cancer genome atlas program (tcga) . URL https://www.cancer.gov/ccg/research/genome-sequencing/tcga

24.

Ravi

and Groll

(2024) Optimizing variable selection in multi-omics datasets: A focus on exclusive lasso. In Einbeck

, Maeng

, Ogundimu

and Perrakis

, editors, Developments in Statistical Modelling, pages 142–47. Springer Nature Switzerland, Cham.

25.

Ravi

and Groll

(2025) A newton-based variant of exclusive lasso for improved sparse solutions. Computational Statistics , 40, 3505–25.

26.

Riester

, Taylor

, Feifer

, Koppie

, Rosenberg

, Downey

, Bochner

and Michor

(2012) Combination of a novel gene expression signature with a clinical nomogram improves the prediction of survival in high-risk bladder cancer. Clinical Cancer Research , 18, 1323–33.

27.

Schulze

(2017) Clinical outcome prediction based on multi-omics data. Master’s thesis, Ludwig-Maximilians-Universität München.

28.

Simon

, Friedman

, Tibshirani

and Hastie

(2011a) Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software , 39, 1–13.

29.

Simon

, Friedman

, Hastie

and Tibshirani

(2011b) Regularization paths for cox’s proportional hazards model via coordinate descent. Journal of Statistical Software , 39, 1–13.

30.

Simon

, Friedman

, Hastie

and Tibshirani

(2013) A sparse-group lasso. Journal of Computational and Graphical Statistics , 22, 231–45.

31.

Simon

, Friedman

, Hastie

and Tibshirani

(2019) SGL: Fit a GLM (or Cox Model) with a Combination of Lasso and Group Lasso Regularization. R package , version 1.3.

32.

Strobl

and Leisch

(2024) Against the “one method fits all data sets” philosophy for comparison studies in methodological research. Biometrical Journal , 66, 2200104.

33.

Sun

, Chain

, Kaski

and Shawe-Taylor

(2020) Correlated feature selection with extended exclusive group lasso. arXiv preprint , arXiv:2002.12460.

34.

Tibshirani

(1997) The lasso method for variable selection in the Cox model. Statistics in Medicine , 16, 385–95.

35.

Van Rijsbergen

(1979) Information Retrieval. Butterworths, London, 2nd edition.

36.

Yuan

and Lin

(2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68, 49–67.

37.

Zhao

and Yu

(2006) On model selection consistency of the lasso. Journal of Machine Learning Research , 7, 2541–63.

38.

Zou

and Hastie

(2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67, 301–20.