Abstract
Introduction
Frequently in high-throughput genomic research, we want to fit a statistical model using gene expression data in order to predict a future outcome. This has become a modern challenge for statisticians because there are far more features or variables (
Discrete Survival Analysis
Survival analysis encompasses methods in which the outcome variable is time to event (eg, time to death, disease relapse, etc.). The particular method used in the analysis will depend on the scale of the survival times collected. Ideally, these will be measured on a continuous scale, but sometimes for a variety of reasons, researchers only collect times on a discrete scale. For instance, for many diseases, it is impossible to record the precise date and time of relapse (ie, a continuous measurement) because the needed data are often only collected at a physician visit. Thus, we are forced to work with discrete times. Furthermore, discrete times are used when the latent scale of the response times is discrete.
High Dimensional Discrete Survival Data
Assume there are
Let
To facilitate the formation of the likelihood, we define an
A
The Forward CR Model with a Complementary Log-Log Link Function
With discrete survival data, we are generally interested in modeling the discrete hazard rate defined as
This is also the form of a probability modeled by a forward CR model. Furthermore, if it is reasonable to assume that the data were generated by a continuous-time proportional hazards model, then we use the complementary log-log (cloglog) link function,
5
Here α
Likelihood
We define the likelihood as a product of
Now define π
We use the generalized monotone incremental forward stagewise algorithm to solve for the penalized solution:
The tuning parameter, λ, controls the amount of shrinkage. As λ increases, the number of parameter estimates that will be shrunk to zero also increases. Using these coefficient estimates and the estimates for the α's (described later), we can recursively estimate the probability that subject
Subject
GMIFS Method for Ordinal Response Modeling
The incremental forward stagewise (IFS) method is an iterative algorithm that produces a penalized solution for a linear regression model. 3 The GMIFS method is an extension of IFS capable of fitting overparameterized logistic regression models. 3 The GMIFS algorithm was extended by Archer et al (2014) for fitting several different logit link ordinal response models to high-throughput genomic data. 4 We updated this method to allow for the use of a complementary log-log link function. The steps of the GMIFS algorithm for ordinal response modeling are as follows 4 :
Enlarge the predictor space as
Initialize the α's to their empirical values. For the forward CR model with a cloglog link, these are initialized
For step
Find
Update
Estimate the α's by maximum likelihood, treating
Repeat steps 4 to 6 until the difference between two successive log-likelihoods is smaller than a pre-specified tolerance, τ.
The rationale for enlarging the predictor space is that it allows us to avoid taking the second derivative of the log-likelihood. Once the algorithm has converged, we can obtain the penalized solution by
Application
Glioblastoma
Glioblastomas (GBMs) are highly malignant and aggressive tumors that arise from the supportive tissue of the brain. Among all primary brain and central nervous system (CNS) tumors, they are the second most common after meningiomas, which are predominantly benign, and the five-year survival rate for GBM patients is less than 4%. 7 Aside from the aggressiveness of the tumors, one possible explanation for the low survival rate is that GBMs are rare in young people; the median age at diagnosis is 64, and the age group with the highest incidence rate is 75–84 year olds. 7 Treatment involves surgical removal of as much of the tumor as is safely possible followed by radiotherapy and/or chemotherapy. 8 The Cancer Genome Atlas (TCGA) Research Network revealed a subtype of GBM related to the mRNA expression and methylation of a set of genes that affects young adults and has an increased survival rate. Researchers also discovered four molecular subtypes of GBM that have unique responses to treatment and gene mutations that could lead GBMs to become resistant to therapy after a standard chemotherapy treatment.9,10 These findings highlight the importance of genomic research in the study and treatment of GBM.
Data
We downloaded the raw CEL files for GSE53733 from Gene Expression Omnibus.
11
The investigators used Affymetrix HG-U133 v2.0 GeneChips to measure gene expression from patients’ tumor samples taken from their initial operation. In the dataset, there were
Results
After the GMIFS algorithm converged, we examined two models: (a) the model selected my minimizing the AIC criterion and (b) the model resulting from the convergence of the GMIFS algorithm (Fig. 1). Using the full dataset, the AIC- selected model misclassified 10 of the 69 patients, while the converged model only misclassified one patient (Tables 1 and 2). However, the AIC-selected model was more parsimonious with 25 non-zero coefficients, while the converged model contained 46 non-zero coefficients. The 25 probe sets that had non-zero coefficient estimates in the AIC-selected model are shown in Table 3. Furthermore, for each model, we examined the sensitivity and specificity for diagnosing short-term survival as well as the sensitivity and specificity for diagnosing short- or intermediate-term survival (Tables 4 and 5). Among the probe sets with non-zero coefficient estimates, the one with the largest absolute coefficient estimate in both models (among probe sets with known gene symbols) was designed to interrogate HD Domain Containing 2 (HDDC2). Long-term survivors had higher HDDC2 expression levels than short-and intermediate-term survivors (Fig. 2). This result agrees with another GBM study that showed that HDDC2 was significantly downregulated in short-term survivors compared to long-term survivors.
12
There was also a clear positive relationship between Nucleoside-Triphosphatase, Cancer-Related (NTPCR) expression and survival time (Fig. 3). Interestingly, researchers have shown that NTPCR is overexpressed in neuroblastomas,
16
but no study has associated NTPCR with GBM. Additionally, one of the probe sets that had a non-zero coefficient estimate was designed to interrogate the gene PDZ and LIM domain 4 (PDLIM4), which has previously been studied in association with gliomas. Researchers examined the expression of the gene at the protein level for patients with gliomas and discovered that the median OS for patients with high levels of the protein (PDLI4) was significantly shorter than for patients with low protein levels.
17
We compared the mean log2 expression levels of PDLIM4 for patients with short-, intermediate-, and long-term OS using Welch's T-test. Patients with short-term OS had significantly higher expression levels than patients with long-term OS at the Bonferroni adjusted

Coefficient paths for our forward CR model using a complementary log-log link.

Boxplot of 203260_at (HDDC2) log2 expression by discrete OS outcome (short-term, intermediate, long-term survival).

Boxplot of 226813_at (NTPCR) log2 expression by discrete OS outcome (short-term, intermediate, long-term survival).
AIC-selected model cross-tabulation of the observed versus the predicted class using the full dataset.
Converged model cross-tabulation of the observed versus the predicted class using the full dataset.
Probe sets with non-zero coefficient estimates in the AIC and converged models.
AIC-selected model sensitivity and specificity for predicting short-term survival and for predicting short- or intermediate-term survival.
Converged model sensitivity and specificity for predicting short-term survival and for predicting short- or intermediate-term survival.
A common critique of a model fitted from high-dimensional data is that the final model, even if selected by minimizing AIC, is not parsimonious. In this example, critics may say that given a sample size of 69 subjects, including 25 coefficients in the model is overfitting, and that the model performance is likely a result of chance. In response, we fit two additional models whose performances will be a result of chance alone. First, we fit a model with the same gene-expression data used in our example, but we randomly permuted the response vector. Next, we fit a model using our original response vector, but instead of using the gene expression data, we used a design matrix filled with 31,744 × 69 = 2,190,336 random variables generated from a Gaussian distribution with a mean and standard deviation equal to the corresponding sample statistics of the gene expression data. If we exclude regions of underfitting and overfitting, the model fit with the gene expression data and the original response vector had better performance than the other two models whose performances are a result of chance rather than a relationship between the features and the response (Fig. 4).

Plot of model misclassification rate by number of variables included in the model for the original GsE53733 data (red line), GsE53733 data with a permuted response (blue line), and Gaussian random variables (green line).
We also performed N-fold (or leave-one-out) CV to assess the generalizability of our models (where 7Y= 69). Both the AIC-selected model and the converged model had an N-fold CV error rate of about 44.9% (Tables 6 and 7). Thus, it appears that the AIC-selected model and the model that satisfied the GMIFS convergence criterion predict discrete survival time equally well. We chose the AIC-selected model as our final model as it is more parsimonious and therefore more interpretable.
N-fold CV: AIC-selected model cross-tabulation of the observed versus the predicted class.
N-fold CV: converged model cross-tabulation of the observed versus the predicted class.
Conclusion
GBM is a particularly dangerous tumor with a low survival rate. A specific and accurate prognosis would be very useful to both the patient and the oncologist. Thus, we were interested in predicting survival time based on a patient's genomic feature data. We used discrete times because the investigators of this particular GBM study reported discrete times. Another case when discrete survival times would be used is when the outcome of interest (eg, disease relapse) can only be assessed at physician visits. The GMIFS algorithm is an effective method for building a classifier for an ordinal response outcome given a high-dimensional covariate space. In this case, we fit a forward CR model with a complementary log-log link function to model discrete survival time. The model resulting from the convergence of the algorithm had only a 1.4% resubstitution error. Using N-fold CV, the model had a 44.9% misclassification rate, significantly better than chance (66% misclassification rate for a three-class outcome), but there is room for improvement. For example, although our method performs automatic variable selection, improvement gains in classification accuracy may be achieved by reducing the dimensionality of the feature set in a meaningful way prior to model fitting. We plan to explore this topic in a follow-up paper. Furthermore, a more accurate classifier could be built with more information. For instance, the five-year survival rate for patients diagnosed between the ages of 0 and 19 is around 19%, while the five-year survival rate for patients diagnosed between the ages of 45 and 54 is only about 3.3%. 7 Additionally, age was significantly different across the three outcome classes in this study 12 but was not made available in the data. Thus, including age as an unpenalized predictor in our model would likely improve its predictive accuracy (the ordinalgmifs R package allows the user to select a subset of predictors that will not be penalized in the GMIFS algorithm 31 ). Also, Karnofsky performance status and extent of surgical resection are known prognostic factors for GBM, 32 so they could have been effective unpenalized predictors as well (despite the fact that these two variables were not significantly different across the three classes in this study 12 ). Additionally, a specific month of death would have provided more information than a range of months. However, for each discrete time value, we would need enough subjects with that response to fit a reliable model. As the price of microarray experiments decreases, we will have greater access to datasets with a larger number of subjects, making this a reasonable expectation for the future.
Author Contributions
Conceived and designed the methods: KF, KJA. Analyzed the data: KF. Wrote the first draft of the manuscript: KF. Contributed to the writing of the manuscript: KJA. Agree with manuscript results and conclusions: KF, KJA. Jointly developed the structure and arguments for the paper: KF, KJA. Made critical revisions and approved final version: KF, KJA. Both authors reviewed and approved of the final manuscript.
