Abstract
Keywords
Introduction
Since the original proposal of the lasso by Tibshirani, 1 penalized regression methods for variable selection in high-dimensional settings have attracted considerable attention in modern statistical research. These methods have been extensively studied in theory and widely applied in practice. Most of the methods focus on selecting individual explanatory variables (or predictors). In many settings, however, predictors possess a group structure. Incorporating this grouping information into the modeling process has the potential to improve both the interpretability and the accuracy of the model.
Consider first the linear regression problem with
where
where ||·|| is the Euclidean (
However, an obvious limitation of the group lasso is that it assumes that the groups do not overlap. This introduces a barrier to its application for many problems where variables may be included in more than one group. The application we focus on in this study is the analysis of gene expression profiles, where individual genes can be grouped into pathways in which the collective action of several genes is required for the cell to carry out a complicated function. These pathways generally overlap with each other as one gene can play a role in multiple pathways. Here,
Within the hypothesis-testing framework, a number of pathway-based approaches have been proposed for analyzing gene expression data under the premise that weak expression changes in individual genes are coordinated and can be combined in groups to produce stronger signals. 4 Hence, by incorporating prior pathway information, these approaches aim to identify differentially expressed pathways, instead of individual genes. Compared to traditional single-gene tests, pathway-based tests often lead to higher statistical power and better biological interpretation. Among the pathway-testing approaches, gene set enrichment analysis (GSEA)5,6 has been widely used. However, the hypothesis-testing framework has certain limitations for pathway analysis, such as the inability to account for the effect of multiple pathways simultaneously, and it is not well suited to using gene expression and pathway data to predict biological outcomes. 7
On the other hand, pathway-based approaches have been largely absent from regression methods due to the challenges of dealing with overlapping pathways in regression models. Limited attempts have been made to build pathway-based regression models. Wei and Li 8 proposed a nonparametric pathway-based regression using gradient decent boosting. Liu et al. 9 developed a semiparametric regression framework to model the pathway effects using least-squares kernel machines. However, the former is a “black box” approach, and its results are difficult to interpret in terms of how pathways are related to the outcome, while the latter approach only works for estimating the effect of a single pathway and cannot model multiple pathways simultaneously.
In this study, we formulate the overlapping group logistic regression model based on the latent group lasso approach, 10 making it applicable to perform pathway selection under the general linear modeling framework. This approach naturally preserves the straightforward interpretation of regression coefficients and offers the ability to scale up to model hundreds of overlapping pathways simultaneously in high-dimensional settings.
We also conduct a systematic comparison of this overlapping group lasso (OGLasso) approach with both the ordinary lasso and GSEA via both simulation and real-data studies. Our aim is to demonstrate the fundamental differences between hypothesis-testing approaches and regression models with respect to their implications for pathway selection. Thus, although a variety of extensions and refinements of GSEA have been proposed, such as GSEAlm, 11 ROAST, 12 and npGSEA, 13 we restrict our attention here to GSEA, the most well-known and widely used method in this group.
Finally, we provide a publicly available implementation of the OGLasso method described in this study through the R package grpregOverlap. This package serves as an extension of the R package grpreg, which provides a variety of functions for fitting penalized regression models involving grouped predictors but requires those groups to be nonoverlapping.
The rest of the study is organized as follows. In the “Methods” section, we review the OGLasso approach and construct the OGLasso model. In addition, we give a brief introduction to GSEA, along with some discussions. In the “Simulation studies” section, we first compare the ordinary lasso and OGLasso in terms of model accuracy with simulated data. We then examine the group selection accuracy of OGLasso and GSEA under different simulation settings. In addition, we provide two real-data studies in the “Real-data studies” section. We conclude the study with final discussions in “Discussion” section.
Methods
Overlapping Group Lasso
Suppose the
To select entire groups of covariates in the overlapping setting, Jacob et al.
10
proposed the OGLasso, formulated as
where
The idea of model (3) is to decompose the original coefficient vector into a sum of group-specific latent effects. This decomposition allows us to apply the group lasso penalty to the latent vectors
It is worth clarifying the exact meaning of “latent” here. It is not the case that the grouping structure is unobservable – we are considering situations in which the grouping is known in advance. For example, much is already known about how genes are organized into pathways; we want to leverage this information to produce more accurate models.
Rather, what is latent is the decomposition of the effect of each feature into the groups it belongs to. For example, suppose gene X belongs to pathways A and B. It may be that gene X's effect on the response is mediated entirely through pathway A and that its membership in pathway B is irrelevant. This parsing of the effect of the genes into pathways is the latent aspect of the problem that cannot be observed directly -we can only observe changes in the expression of gene X, not whether expression changed in order to produce an effect in pathway A or B.
Figure 1 illustrates the coefficient decomposition mechanism described in equation (3). Suppose that there are four variables that are included in four groups,

The coefficient decomposition of overlapping group lasso.
Based on the coefficient decomposition, model (3) can be transformed into a new minimization problem
15
with respect to γml:
Here, γ in principle consists of all elements of γ
The implication of equation (4) is that the OGLasso problem is equivalent to a classical group lasso in an expanded, nonoverlapping space. This is of considerable practical convenience, as it allows us to solve equation (4) using computationally efficient algorithms that have previously been developed for the group lasso. 16
Overlapping Group Logistic Regression
It is relatively straightforward to extend equation (4) to models other than linear regression; in this section, we describe its application to penalized logistic regression in the presence of overlapping groups. Here,
The corresponding loss function is the (scaled) negative log-likelihood function,
We can then duplicate the columns of the overlapped covariates, expanding the design matrix to
where
Gene Set Enrichment Analysis
Among the hypothesis-testing approaches for pathway selection, GSEA stands out due to its relative simplicity and for preserving the gene–gene dependencies that occur in real biological data. 17
The procedure of GSEA
6
starts with ranking the
Though widely used, GSEA also has several limitations. First, GSEA may be biased in favor of larger gene sets by systematically assigning those gene sets higher ES 18 ; second, it implicitly assumes that genes within the same gene set show coordinated (ie, either all positive or all negative) associations with the phenotype, making it less likely to detect sets in which the genes are heterogeneous with respect to the direction of association with the phenotype. 19
There are inherent differences between GSEA and the proposed overlapping group logistic regression method in the sense that GSEA treats the phenotype as fixed and gene expression as random, while regression-based methods do the opposite. Thus, GSEA tends to be more appropriate in settings where the phenotype can be directly manipulated by the experiment (eg, knockout mice), while regression is more appropriate in observational settings (eg, predicting patient outcomes). Nevertheless, there are many situations in which either method could reasonably be used, and therefore, it is of interest to compare the selection properties of the two approaches.
Simulation Studies
In all the simulation studies, we use the term “null group” to denote a group whose coefficients are all equal to zero in the true model and “true group” to denote a group with all nonzero coefficients in the true model. In addition, we refer to ||γ
OGLasso versus Ordinary Lasso
We start by comparing the OGLasso with the ordinary lasso in terms of estimation and prediction accuracy. We use root mean squared error (RMSE) to measure estimation accuracy and misclassification error (ME) to measure prediction accuracy, defined as follows:
It should be noted that we compute ME based on a new response vector generated by the same design matrix for each replication. Specifically, given a design matrix
We consider two simulations with different settings described as follows.
Setting 1: Synthetic data
We begin with synthetic data where there are 15 groups of covariates. All covariate values are simulated independently from a standard Gaussian distribution. The group sizes and overlap structure are presented below.
The number underneath the brace is the number of members shared between those two groups. For example, group 1 contains 10 members, as does group 2, but the two groups contain only 17 unique predictors, as three predictors are present in both groups. As a result, the total dimension in this setting is
Setting 2: Real data
For this simulation, a real gene expression profile data set in the p53 study
7
is used as the design matrix to mimic the complicated correlation and overlapping structures in real biomedical applications. This design matrix is fixed for each independent replication. Here, the sample size
In both of the above-mentioned two settings, the true group effect sizes of each of the five true groups are set to be equal, and the latent effects are also set to be equal within each true group. In this way, the true coefficient vector is uniquely specified. Then given the design matrix, the responses are generated according to equation (5) for each independent replication. The true group effect size is varied from 1 to 5 to simulate different magnitudes of signals.
Figure 2 illustrates the estimation and prediction accuracy of the proposed grouped variable selection method, as compared to the ordinary lasso, for both settings. The top two panels show results for the synthetic data simulation, while the bottom two panels are for the real-data simulation. The left panels illustrate the median RMSE relative to ordinary lasso over 500 replications, while the right panels compare the methods in terms of ME. OGLasso consistently achieves a lower median RMSE than that of the lasso in both synthetic and real-data simulations. As expected, the ME by both methods decreases as the coefficient magnitude increases. More interestingly, the ME by OGLasso can be substantially lower than that of ordinary lasso. In the synthetic data simulation, for example, the ME by OGLasso is 7% lower than that of ordinary lasso when the group effect size is 5. The two methods are more similar in terms of predictive accuracy on the real data, where the dimensionality is much higher and correlation structure more complicated. Nevertheless, the prediction accuracy can still be improved by around 2% with OGLasso compared to ordinary lasso.

Accuracy of OGLasso and ordinary lasso with respect to the magnitude of the group effect size. Top two panels summarize results for the synthetic data simulation, while bottom two panels are for the real-data simulation.
OGLasso versus GSEA
In this section, we use simulated data to compare the selection properties of the OGLasso against GSEA in a variety of different settings. Because OGLasso and GSEA do not estimate the same quantities and GSEA does not produce predictions, the only way to compare them is with respect to selection accuracy. To ensure a fair comparison, we use each method to select a fixed number of groups. We then evaluate the group selection accuracy by the true discovery rate (TDR):
where the # of groups selected was fixed at 5 (ie, each method was used to identify the five most important-looking groups). In each of the following simulations, the results are based on the sample size
Setting 3: Unequal group size
First, we investigate the performance of the two approaches when group sizes are unequal. In this simulation, the design matrix consists of 15 groups with all covariate values simulated independently from a standard Gaussian distribution. The group sizes and overlap structure are shown below.
The overlap here is designed to be one-third of the size of overlapped groups. As a result, the total dimension in this setting is
Table 1 summarizes the mean TDR and size of selected groups for the OGLasso and GSEA over 500 replications. The two methods are comparable in terms of TDR, while the average size of selected groups from GSEA is slightly larger.
The mean (standard error) of TDR and average size of selected groups of OGLasso and GSEA over 500 replications.
The proportion of each group selected is depicted in Figure 3. OGLasso tends to favor groups with smaller size, while GSEA has roughly an equal probability of selecting a true group regardless of its size. This is understandable, as regression-based methods have a built-in mechanism for encouraging parsimony, unlike GSEA. Whether this preference for smaller groups is desirable depends on the application and the scientific goals of the study.

Comparison of the proportion of each group being selected over 500 replications for Setting 3. In each panel, the 15 vertical bars from left to right correspond to group 1 to group 15, with groups of the same dimension clustered together. The height of the bar represents the proportion of replications in which the associated group was selected by the method.
Setting 4: Heterogeneous gene effects
Previous studies have shown that GSEA is less likely to detect sets of genes containing both positive and negative associations with the phenotype. 19 This is because, by pooling together correlations, GSEA assumes that the genes in a set have a coordinated effect – that is, that they all act in the same direction. In this simulation, we examine this aspect of GSEA further and demonstrate that the exhibition of heterogeneous effects among genes in a set deteriorates the statistical power of GSEA.
We employ the same configuration as in Setting 1 of the “OGLasso versus ordinary lasso” section for the design matrix (except that the sample size here is
On a technical note, it must be pointed out that varying σ will also change the group effect, ||γ
Figure 4 compares OGLasso and GSEA in terms of TDR as a function of σ

Comparison of OGLasso and GSEA in terms of TDR as a function of heterogeneity parameter σ. The blue dotted line indicates σ = 1.37, after which negative coefficients can occur by design. The mean values over 500 replications are displayed.
Setting 5: Correlation among genes
In this simulation, we assess how correlation among genes affects group selection. We use the same settings for the groups and overlap structure as in Setting 1, where
Figure 5 compares OGLasso and GSEA in terms of TDR as a function of pairwise correlation ρ. As expected, TDR of both methods decreases as the correlation among genes increases. However, GSEA is much more strongly affected by correlation than the OGLasso. For example, as ρ increases from 0 to 0.1, the TDR of GSEA drops from around 0.8 to 0.45. This ability – to adjust for correlation between pathways – is one of the primary potential advantages of a regression-based approach over a hypothesis-testing approach, which is limited to considering a single pathway at a time.

Comparison of OGLasso and GSEA in terms of TDR as a function of pairwise correlation
Real-Data Studies
In this section, we analyze the data from two gene expression studies reported in Subramanian et al. 6 , one involving the mutational status of p53 in cell lines and the other involving the prognosis of lung cancer patients.
The p53 study aims to identify pathways that correlated with the mutational status of the gene p53, which regulates gene expression in response to various signals of cellular stress. The p53 data 20 consist of 50 cell lines, 17 of which are classified as normal and 33 of which carry mutations in the p53 gene. To be consistent with the analysis in Subramanian et al. 7 , 308 gene sets that have size between 15 and 500 are included in our analysis. These gene sets contain a total of 4301 genes.
The lung cancer data 21 contains gene expression profiles in 86 tumor samples, of which 24 are classified as “poor” outcome and the remaining as “good” outcome. The data sets are preprocessed in the same fashion as in the p53 study, resulting in 258 gene sets that contain a total of 3256 genes. Compared to the p53 data, the lung cancer data show much weaker signals: no individual gene is found to be significant in a conventional single-gene analysis.
We first compare the OGLasso to the ordinary lasso in terms of prediction accuracy. For each method, 10-fold cross-validation was used to choose the regularization parameter γ.
Indeed, as shown in Table 2, the incorporation of pathway information into the regression model produces more accurate predictions in both studies. In the p53 study, where the signals are relatively strong, the ME of the ordinary lasso is 8% lower than that of the intercept-only model. However, the OGLasso can further lower the error by an additional 6%. In the lung cancer study, due to a small signal-to-noise ratio, the ordinary lasso performed even worse than the intercept-only model. However, the OGLasso was able to improve on the predictions of the intercept-only model, albeit only slightly.
Real-data studies: 10-fold cross-validated ME for different models. “Baseline” is the intercept-only model.
We now turn to comparing the pathways selected by OGLasso and GSEA. Again, 10-fold cross-validation is used to select A for OGLasso, while a FDR cutoff of 0.25 was used to select pathways with GSEA. Table 3 lists the number of pathways, the number of total genes, and the number of unique genes in those selected pathways by OGLasso and GSEA. In both studies, GSEA selects more pathways than OGLasso, especially in the lung cancer study (21 vs. 3). Moreover, in agreement with our earlier simulation results, GSEA selects substantially larger pathways than OGlasso. For example, in the lung cancer study, the average pathway size for GSEA is 820/21 = 39 genes, while the average size for OGLasso is only 51/3 = 17 genes.
Real-data studies: number of selected pathways (# pathways), number of total genes (# total genes), and number of unique genes (# unique genes) in selected pathways by OGLasso and GSEA.
Table 4 presents a summary of pathway selection results in the p53 study that sheds light on the nature of the pathways selected by each approach; an equivalent table for the lung cancer study is included in the Supplementary Table 1. Naturally, both approaches identify the “p53 Pathway” as being associated with p53 mutation status. However, GSEA also selects pathways “radiation_sensitivity”, which shares nine genes with “p53 Pathway”, “p53hypoxiaPathway” (seven shared genes), and “P53_UP” (five shared genes). From a regression perspective, these four pathways are largely redundant, and the three unselected pathways carry no additional useful information beyond that already contained in the p53 pathway. On the other hand, OGLasso selects one pathway, “cklPathway”, not identified by GSEA. Although the ckl pathway has a weaker marginal relationship with p53 mutation status than the hsp27 and p53 pathways, the information it contains is largely independent of the other pathways included in the model (no overlaps with the hsp27 and p53 pathways), potentially shedding light on novel p53 relationships that would not be apparent from the GSEA approach.
The p53 study: pathways selected by OGLasso and GSEA with FDR ≤ 0.25.
The biological interpretations of the pathways selected in the lung cancer study are less clear due to the weaker signals and more complicated biological outcome. Nevertheless, there are some interesting similarities and differences here as well. Of the three gene sets selected by OGLasso, one pathway (ceramide) is also selected by GSEA. The other two gene sets, although not selected by GSEA, contain a fair amount of overlap with GSEA-selected sets. For example, OGLasso selects the Fas pathway, while GSEA selects the p53 pathway. However, both pathways are involved in apoptosis, and six genes are shared between the two pathways. The simulation studies of the “OGLasso versus GSEA” section suggest that differences in the size, heterogeneity, or correlation patterns of these pathways provide an explanation for why OGLasso prefers the Fas pathway to the p53 pathway.
Discussion
Pathway-based approaches for analyzing gene expression data have become increasingly popular in recent years. Most methods have approached the problem from a multiple hypotheses testing perspective. However, the latent group lasso approach proposed by Jacob et al. 10 allows the incorporation of pathway information into regression models as well.
Regression models offer two distinct advantages in this setting. First, they provide a direct method for using the entirety of the pathway information to predict biological responses. Second, they make no assumptions about the distribution of the expression data itself. For this reason, the methods we develop here can be applied to any gene expression study, regardless of the technology used for quantification (qPCR, microarrays, RNA-Seq, etc.).
In this study, we present evidence that the incorporation of pathway information can substantially improve the accuracy of gene expression classifiers. Furthermore, we provide open-source software, publicly available at cran.r-project.org, for fitting the OGLasso models described in this study. By retaining the underlying framework of regression modeling, this approach can be applied to both continuous and binary outcomes, and it is straightforward to extend the idea to Cox proportional hazards models for time-to-event outcomes.
Finally, this study provides, to our knowledge, the only systematic comparison of OGLasso methods with the GSEA approach. There is a fundamental difference between the two methods: GSEA carries out independent tests of each gene set, while the OGLasso is a regression method that considers the effect of all pathways simultaneously. We show that, while there is broad agreement between the two, substantial differences between the approaches may arise with respect to pathway size, heterogeneity of gene effects, and correlations between gene sets. These factors, along with the goals and design of the study, should be carefully considered when deciding upon an approach to data analysis.
Author Contributions
Conceived and designed the experiments: YZ, PB. Analyzed the data: YZ. Wrote the first draft of the manuscript: YZ. Contributed to the writing of the manuscript: YZ, PB. Agree with manuscript results and conclusions: YZ, PB. Jointly developed the structure and arguments for the paper: YZ, PB. Made critical revisions and approved final version: YZ, PB. Both authors reviewed and approved of the final manuscript.
