Abstract
Introduction
In recent years, advances in high-throughput genomic technologies have led to the availability of high-dimensional datasets, including DNA methylation, messenger ribonucleic acid (mRNA), mRNA expression and copy number variation, in addition to traditional clinical variables. These datasets may provide valuable information on the mechanisms of a particular disease, prompting the development of various methods to identify influential genomic and clinical characteristics for improved prognostic modelling.
A common objective in clinical research is the prediction of patient survival outcomes. The Cox proportional hazards (PH) model (Cox, 1972) is widely used for this purpose, as it not only facilitates survival prediction but also enables the assessment of the impact of predictor variables on survival. However, given the high-dimensional nature of genomic datasets, variable selection becomes a critical step in model construction. To address this, an
Despite its effectiveness, this approach presents several limitations. First, standard Lasso-based methods do not inherently account for grouped variables, which is particularly relevant in genomic studies where genes are often organized in biological pathways. Ignoring such group structures may lead to suboptimal feature selection and loss of biologically meaningful information. Additionally, large sets of genomic features often overshadow low-dimensional clinical variables, such as tumour size and nodal status. This is a significant drawback, as clinicopathologic variables have been demonstrated to play a crucial role in oncological studies and predictive performance improves when both clinical and genomic data are integrated (Ma et al., 2007; Herrmann et al., 2021). Final, Lasso-based selection methods have been shown to produce a relatively high rate of false positives in specific settings, which may limit their reliability in time-to-event analysis, depending on the context (Meinshausen and Bühlmann, 2006; Zhao and Yu, 2006).
Several statistical methods have been proposed to incorporate grouped variables in the Cox PH model. Although the Elastic Net Cox model (Simon et al., 2011b) does not explicitly enforce group selection, it tends to select correlated variables together, unlike pure Lasso, which typically selects only one from a correlated set. This behaviour results from its combination of
To overcome this limitation of missing group representation, we propose the use of Exclusive Lasso regularization (Campbell and Allen, 2017), which we will extend in this work to the Cox PH model. Our approach adapts the method’s ability to encourage intra-group sparsity through the
In our previous work, we demonstrated the superior performance of Exclusive Lasso over traditional Lasso in GLM settings with high within-group correlation (Ravi and Groll, 2025; note that a preliminary compact version of this work can also be found in Ravi and Groll, 2024). In the current study, we adapt this methodology to time-to-event analysis, introducing it as a practical alternative for selecting informative predictors from different groups and integrating them into a sparse prediction model while ensuring that no group is overlooked.
We assess the performance of our proposed method by comparing it to other approaches that account for grouping effects, such as Elastic Net Cox, Sparse Group Lasso and IPF-Lasso. Across a range of scenarios, our method generally shows improvements over the alternatives in both survival prediction accuracy and selection performance. Additionally, we evaluate the practical applicability of our model by using it for survival prediction in real-world cancer studies. In addition to the standard prediction errors, we compare the biomarkers selected by each model and highlight the importance of Exclusive Lasso in selecting clinical and low-dimensional variables that other models fail to capture.
The remainder of the article is structured as follows. Section 2 introduces the Exclusive Lasso problem in the Cox PH framework. In Section 3, we present the simulation scenarios and compare our method with other Lasso procedures. The applicability of our model is demonstrated in Section 4 using the aforementioned application example. Final, Section 5 concludes.
Methods
In this section, we first briefly review methods for handling grouped predictors within the Cox PH framework and then introduce the Exclusive Lasso regularization in the Cox PH model.
Let
The Cox PH model (Cox, 1972) specifies the hazard function for patient
where
Estimation of
where
In high-dimensional scenarios, where the number of covariates
where λ ≥ 0 is the penalization parameter. The most common penalty term is the
We focus on variable selection in scenarios where the predictors are divided into predefined, disjoint groups. For instance, in the context of multi-omics data, the variables may include different types, such as genomics, epigenomics and transcriptomics, in addition to clinical and pathological data. We assume that the indices of the true parameter vector
Let
Elastic Net
The Elastic Net (Zou and Hastie, 2005) is a regularization method that combines the
where α ∈ (0, 1) is the mixing parameter that controls the balance between the
Sparse Group Lasso
The Sparse Group Lasso (Simon et al., 2013) is another method that uses a combination of
where
Integrative Lasso with penalty factors (IPF-Lasso)
The Integrative Lasso with penalty factors (IPF-Lasso; Boulesteix et al., 2017) was introduced for prediction based on multi-omics datasets where there are several modalities (groups) of variables. The main idea of IPF-Lasso is to apply Lasso to each group and introduce penalty factors for different groups of variables, which can be selected according to the desired weighting of the groups or by cross-validation (CV). The IPF-Lasso penalty is defined as
Exclusive Lasso
The Exclusive Lasso (Campbell and Allen, 2017) enforces structured sparsity by ensuring that at least one variable is selected from each predefined group. It combines
The composite nature of the penalty term makes the estimation of the Exclusive Lasso problem challenging. Several strategies have been developed to tackle this challenge. One approach utilizes proximal point algorithms based on dual Newton methods (Lin et al., 2020), while others employ iterative re-weighted techniques to refine the estimation process (Kong et al., 2014; Sun et al., 2020). An alternative strategy reformulates the problem in a Lasso framework and applies a bisection algorithm, taking advantage of Lasso’s piecewise linear properties (Sun et al., 2020).
To improve computational efficiency, a fast optimization method based on the Fast Iterative Shrinkage-Thresholding Algorithm has been introduced (Huang and Liu, 2018). Another approach transforms the penalty into a differentiable one by applying a simple quadratic approximation, allowing it to be efficiently solved using a Newton-based algorithm (Ravi and Groll, 2025).
Campbell and Allen (2017), along with the aforementioned studies, proposed the Exclusive Lasso approach for generalized linear models, but to the best of our knowledge, no adaptations of Cox PH models have been implemented. In this work, we extend Exclusive Lasso to the Cox PH model, allowing its group-wise sparsity properties to be applied in a time-to-event setting. To fit the model, we also develop a coordinate descent algorithm with soft-thresholding, specifically adapted for the Cox PH likelihood, which addresses algorithmic challenges that arise in this context.
As highlighted by Campbell and Allen (2017), the Exclusive Lasso penalty is non-separable; that is, it cannot be formulated as a sum of functions depending on individual coefficients only. Consequently, it is not possible to update all coefficients simultaneously in closed form. Instead, we use a coordinate descent algorithm, where each coefficient is updated sequentially while keeping the others fixed.
Our approach builds on a coordinate descent framework originally developed for Group Lasso regularization (Yuan and Lin, 2006), which has been shown to be efficient in high-dimensional settings. The algorithm is summarized in Algorithm 1.
The gradient component for covariate
where
We update
Specifically, in each coordinate update, the current coefficient
To define the partial penalty, let
Here, the exclusion of index
This penalty encourages competition among variables within the same group, allowing only a few features to get selected. It promotes sparsity by shrinking coefficients, especially when the penalty is large. As a result,
Furthermore, we refer readers to Theorem 4 of Campbell and Allen (2017), which provides proof that the Exclusive Lasso coordinate descent algorithm converges to the global minimum in the case of penalized GLMs. This result can be readily adapted to other settings, including ours. Figure 1 displays the regularization paths for both Exclusive Lasso (left) and Lasso (right). The variables are divided into five distinct groups, with each group containing exactly one signal variable and the rest being noise. The example in Figure 1 is simulated from Scenario 5 in Section 3, chosen as an illustrative case that highlights the enhanced performance of Exclusive Lasso. Hence, it represents a simplified setting that is favourable for the performance of Exclusive Lasso. The signal variables are highlighted using different colours to distinguish their respective groups. Exclusive Lasso encourages within-group sparsity, driving most coefficients to zero while retaining only one active variable per group. As a result, it maintains exactly five active variables, one from each group, even at large values of λ. In contrast, Lasso applies shrinkage without regard to group structure and may eliminate informative variables or retain multiple variables from the same group.
The Exclusive Lasso was implemented in
In this section, we present a detailed simulation study to evaluate the performance of our method across different scenarios.
Setting
We simulate
We assume that the variables are divided into either two or five groups and consider eight simulation scenarios for grouping them. Across scenarios, the total number of signal variables is set to 5, 10 or 20.
Table 1 summarizes the grouping structures used. In Scenario 1 and Scenario 5, an equal number of variables is allocated to each group: 250 variables per group in Scenario 1 and 100 variables per group in Scenario 5. These represent ideal settings, as the Exclusive Lasso is expected to perform well when at least one signal variable is present in each group.
Exclusive Lasso coordinate descent for Cox PH model.
Exclusive Lasso coordinate descent for Cox PH model.
In the remaining scenarios, we introduce unequal group sizes and the signal variables are also distributed unequally across groups. Notably, in Scenario 8, one group contains no signal variables. In Scenario 4, the signal variables per group are 10 and 490; in the smaller group of 10 variables, one variable is categorical with four levels and is designated as a signal variable. This case is designed to mimic real-world settings in which clinical variables can be categorical.
We simulate an independent validation dataset consisting of
Regularization paths for Exclusive Lasso (left ) and Lasso (right ) from a simulation study, where variables are evenly distributed into five groups (shown in distinct colours), with each group containing one true signal variable. In the Exclusive Lasso model, the signal variables remain active unless all other variables in their group shrink to zero. In contrast, the Lasso model selects variables without considering the group structure, allowing multiple variables from the same group to be included.
Description of the grouping structure of signal variables across the simulation scenarios.
We report the results using variable selection accuracy, defined as the proportion of true positives and true negatives among all variables, along with the F1 score, false discovery rate (FDR) and integrated Brier score (IBS). The F1 score (Van Rijsbergen, 1979) is defined as the harmonic mean of precision and recall, taking into account both false positives and false negatives. The metric ranges from 0 to 1, with larger values indicating a better balance between precision and recall. The Brier score (Graf et al., 1999) at a given time point
Performance metrics (standard errors in brackets) for Scenarios 1–4; best performing modelling approach per setting in bold font.
We compare our proposed extension of Exclusive Lasso for Cox PH models with the models described in Section 2. We use the implementations available in the
Exclusive Lasso shows consistently strong performance in Scenarios 1, 2, 5 and 6, where there is an equal distribution of signal variables across groups and where group sizes are balanced. In these settings, its ability to enforce within-group competition enables accurate variable selection and yields high F1 scores. In Scenario 4, performance drops slightly because categorical variables were dummy-coded and competition among levels of the same variable was not advantageous. This behaviour is expected: Exclusive Lasso treats each dummy variable as an independent predictor, so different levels of a categorical variable are effectively placed in competition with one another. As a consequence, selection may become inconsistent across levels of the same factor. Nonetheless, Exclusive Lasso still outperforms the other models in terms of overall variable selection. In Scenario 8, its performance decreases more noticeably because one group contains no signal variables. In this case, the model may still select at least one variable from the empty group, which reduces selection accuracy.
Performance metrics (standard errors in brackets) for Scenarios 5–8; best performing modelling approach per setting in bold font.
Performance metrics (standard errors in brackets) for Scenarios 5–8; best performing modelling approach per setting in bold font.
By contrast, Elastic Net maintains stable but modest performance across all scenarios. Since it does not explicitly account for group structure, its performance is unaffected by variations in group size or distribution of signal variables. However, this stability comes at the cost of weaker selection accuracy compared to Exclusive Lasso.
When evaluating prediction accuracy using the IBS, Elastic Net and IPF-Lasso sometimes outperform Exclusive Lasso. Although their selection ability is limited, these models still provide accurate survival predictions. This is because, in the presence of highly correlated predictors, they tend to distribute effects across correlated variables instead of selecting a single one. While this behaviour reduces variable selection quality, it improves calibration of predicted survival probabilities, which benefits IBS.
Group Lasso performs poorly in all scenarios. This is expected because the simulation design requires selecting variables across groups rather than entire groups. Group Lasso instead selects variables from a single group, which consistently lowers its performance.
Overall, Exclusive Lasso emerges as the best-performing method in scenarios with highly correlated and grouped variables. Although its performance declines somewhat in randomly allocated scenarios, it still surpasses the other methods. The only model that occasionally approaches its performance is IPF-Lasso and this occurs only when a group contains no informative variables.
We compared Elastic Net, Group Lasso, IPF-Lasso and Exclusive Lasso using a fixed wall-clock budget of two hours per method. Each method was allotted approximately two hours of real (wall-clock) time for hyperparameter tuning. Since each model evaluation runs to completion once started, the reported total times may slightly differ from this limit, but all methods were run under the same computational budget. All methods were trained on the same train/test split and, whenever supported, the same 10-fold partitioning was used for CV to ensure comparability. The tuning process started with a coarse grid of hyperparameters, defined as logarithmically spaced regularization paths. For Elastic Net, the mixing parameter α was additionally varied over the set 0, 0.25, 0.5, 0.75, 1. At each grid point a cross-validated score was computed and the grid was refined adaptively as long as computation time remained. For IPF-Lasso, block-specific priority factors were derived from a ridge–Cox prefit by taking the mean absolute coefficient within each group. To avoid instability, very small group means were clipped to a minimal threshold and all priority factors were normalized before use. The tuner only accepted configurations with finite CV scores and valid best-λ indices; this prevents numerical failures from biasing results. We report the number of configurations evaluated, mean time per configuration, alongside F1 scores for the selected configurations in Table 4. Elastic Net and Group Lasso explored the largest number of configurations within the allotted time, while Exclusive Lasso performed fewer but substantially more expensive fits. In contrast, IPF-Lasso completed only a single model evaluation during the time budget. This is due to the much heavier internal CV and blockwise reweighing required by the method, making each fit computationally more demanding than for the other penalization schemes. As a result, although IPF-Lasso exhausted the full time budget, it could not refine beyond its initial candidate path. Notably, the F1 scores improved only with Exclusive Lasso, indicating that its stronger variable selection led to a more focused and accurate set of predictors.
F1 scores across scenarios for different models.
F1 scores across scenarios for different models.
Comparison of Lasso-based methods with runtime, sparsity (nonzero coefficients) and F1 score.
Prediction error curves. Left : Mean Brier scores calculated from 10–60 months, averaged over 50 random training–test-data splits of BC data. Mean Brier scores calculated from 1 to 5 years, averaged over 50 random training–test-data splits of head and neck squamous cell carcinoma (HNSC) cancer data.
Next, we apply our proposed method to two real-world datasets. The penalization parameter λ is tuned with CV for all models.
Bladder cancer gene expression dataset
Bladder cancer (BC) is one of the most commonly diagnosed urinary cancers worldwide, with its incidence steadily increasing each year. This rise may be linked to factors such as tobacco use and an ageing population. Although the five-year survival rate for BC is relatively high at 77%, the recurrence rate remains a significant concern. Beyond genetic signatures, numerous risk factors contribute to BC development, including gender, smoking pattern and occupational exposure to carcinogens (Cumberbatch et al., 2018). Therefore, it is crucial to incorporate both clinical risk factors and sensitive biomarkers when predicting overall survival in patients with BC.
Integrated Brier Score (standard errors in brackets); best performing modelling approach per setting in bold font.
Integrated Brier Score (standard errors in brackets); best performing modelling approach per setting in bold font.
We analyze the BC dataset retrieved from the Gene Expression Omnibus (GEO) database (URL:
We report the Brier scores computed up to five years for all the models discussed in Section 2. From Figure 3, we observe that Exclusive Lasso consistently gives the lowest mean Brier score at each time point. Although there is no substantial difference in the Brier scores across models, Table 5 shows that Exclusive Lasso gives a lower IBS when compared to other models.
Figure 4 displays the top 10 most frequently selected variables across all models. We observe that the clinical variables stage (
In the BC setting, this shows that tumour stage and nomogram score are naturally kept by the model without any manual intervention, so practitioners can input all variables together rather than deciding beforehand which clinical markers must be preserved.
The top 10 most frequently selected variables by the different models on the training set of the Bladder cancer gene expression study.
Head and neck squamous cell carcinoma (HNSC) is one of the most prevalent malignant tumours worldwide and it continues to have a poor prognosis with a five-year survival rate below 50% (Mody et al., 2021). In this study, we obtained molecular and clinical data for HNSC patients from The Cancer Genome Atlas (National Cancer Institute and National Human Genome Research Institute 2025). The molecular data includes 842 miRNA expression features, 20 164 RNA-seq expression features and 9 434 somatic mutation features. The clinical dataset comprises five variables: Age, tumour purity, pathological stage, gender and race. After integrating the datasets, 462 common patient samples were retained for analysis. Preprocessing was performed as described previously and categorical clinical variables are transformed into numerical form using dummy encoding.
The Brier scores averaged over 50 random training–test data splits for 1–5 years for different models are shown in Figure 3. We do not observe a substantial difference between the curves, but Exclusive Lasso performs slightly better than Group Lasso and IPF–Lasso. Exclusive Lasso also achieves a considerably smaller IBS score (see Table 5).
The top 10 most frequently selected variables by the different models on the training set of the HNSC dataset.
The top 10 most frequently selected variables by the different models on the training set of the HNSC dataset.
The top 10 most frequently selected variables by all models are shown in Figure 5. The clinical variable
For head and neck cancer prognosis, this means Exclusive Lasso can highlight molecular signals without overlooking established risk factors such as age and gender, offering a more balanced model that clinicians can trust rather than one dominated solely by high-dimensional omics noise.
Variable selection plays a critical role in high-dimensional biological datasets. Time-to-event prediction improves when redundant and non-informative features are filtered out, leading to better runtime efficiency and interpretability. However, most filter and prediction methods fail to account for the intricate grouping structure of biological data. Studies suggest that predictive performance improves when clinical variables are prioritized (Herrmann et al., 2021). However, due to their low dimensionality, clinical variables are often overshadowed by the vast number of gene expression features, particularly when using standard Lasso regularization. We propose using Exclusive Lasso in Cox PH regression models to ensure proper representation of low-dimensional clinical variables.
The Exclusive Lasso penalty combines the
In our simulation study, we compared the proposed Exclusive Lasso with other state-of-the-art methods that account for grouping structures, such as Elastic Net, Group Lasso and IPF-Lasso. Exclusive Lasso outperformed these models in terms of selection accuracy and FDR. Although its performance slightly deteriorated when a group contains no informative variables, it still performed better than the other models. While IPF-Lasso achieved comparable performance, it either failed to select variables from certain groups or tended to select highly correlated variables within the same group. Group Lasso, on the other hand, performed poorly as it failed to select variables across all groups.
We analyzed the performance of the methods in two real-world cancer studies. Although the methods had comparable IBS we observed that Exclusive Lasso achieved the best mean Brier score at every time interval. This may be because most methods tend to ignore clinical variables, whereas Exclusive Lasso selects them. The survival prediction and disease progression of cancer are highly influenced by clinical predictors such as tumour stage and smoking status. Therefore, beyond gene selection, incorporating clinical variables into prediction models is crucial. Although it is common in the literature to force clinical variables to remain in the model while applying variable selection only to high-dimensional gene expression data, we argue that this strategy is overly restrictive. While clinical covariates are often few in number, the landscape of biomedical research is rapidly evolving. In addition to traditional clinical information, modern studies increasingly incorporate environmental, lifestyle, phenotypic and imaging data. These sources add further heterogeneity and are not necessarily low-dimensional. In such settings, methods that allow variable selection across all groups of predictors, instead of forcing certain groups to remain in the model by default, are crucial. The Exclusive Lasso provides precisely this functionality by ensuring that selection is performed in a structured yet flexible way across all categories of variables, thereby avoiding the risk of overlooking smaller but potentially important groups. We also found that variable selection in Exclusive Lasso was more consistent across repetitions, whereas other models selected different variables in different iterations.
Although Exclusive Lasso is highly effective in selecting variables from each group, its estimation is challenging due to the composite nature of the penalty. As an outlook, we note that recent developments such as our proposed Newton-based NM-
