Abstract
Background
Biochemically, the binding of a ligand-receptor pair (LRP) is a signal transduction event passing information from the outside of the plasma membrane to the interior of the cell. This event sets off a complex network of protein signaling cascades and interacting complexes that cause phenotypic changes on the level of the cell and in the context of cancer. 1
Ligand-receptor systems are a major focus for targeted anti-cancer therapies. For example, Yarden 2 reviewed the epidermal growth factor receptor (EGFR) pathway and the known molecular mechanisms that the cell processes to lead to proliferation or survival or migration, noting that these signals can be attacked by anti-EGFR monoclonal antibodies and tyrosine kinase inhibitors. These options face a long translational process 3 as aberrant signaling is first identified in pre-clinical models and candidate therapies progress to early clinical trials. As Gschwind and colleagues observe, there are a number of these therapies in the development pipeline, 4 so for a given cancer, we might imagine that the first task is to begin to compile evidence that a targetable receptor may be active and important. To rapidly identify promising candidate LRPs, Graeber and Eisenberg 5 hypothesized that the correlation between mRNA levels of a known LRP is a way to infer pairs with active autocrine signaling for a given disease. They used the hypothesis to rapidly screen about 200 cancer samples for candidate informative pairs using high-throughput gene expression data. With the increasing volume of genomic studies, this technique has been repeated by Castellano and colleagues 6 with increasing emphasis on finding an association with survival times for advanced ovarian cancer.
Statistically, the standard approach to merging survival responses with two genes at a time is a differential correlation (DC) or differential co-expression analysis. These methods tie signaling to prognosis by stratifying patients into empirically defined survival groups and then testing whether correlation between the ligand and the receptor differs between the two. This is an inelegant solution from the perspective of continuous survival regression models because dichotomization ignores patients with intermediate survival times and because regression models can adjust for other factors. Instead, we might regress the ligand on the receptor (or receptor on ligand; it is unclear which is preferable) using a dummy variable or transformation of survival time to account for prognosis. This is difficult because survival times are frequently censored. Further, the result can only confirm the effect of survival on correlation and not estimate the more valuable effect of correlation on prognosis.
In our previous work, 7 we noted that survival time regression model's interactions were sensitive to DC-type interactions, but we did not investigate why this association exists. In this article, we attempt to address these inconsistencies by conjecturing that there exists an underlying data-generating process that links survival and correlation through an unobserved activation level. We are able to show that this assumption is consistent with the properties of data observed to date, and we consider the implications of this model for data analysis.
Methods
Activation signal hypothesis
We hypothesize that signal transduction can be described as a continuous level that generates the correlation between ligand and receptor. This activation level may not be directly observed, but its influence on survival (or another phenotype) may be seen. Let
where the standard normal error ε captures the influence of other factors on survival; β1 controls the importance of activation, and (β0, σ) are chosen to set the marginal distribution of
The activation-dependent correlation is
Now, the inference problem is to detect the fact that these ligand and receptor are correlated and that this is somehow related to survival. We will demonstrate the result that the interaction between the ligand and receptor expression levels can be detected in a regression model. This is surprising for three reasons: the degree of correlation is patient specific (σψ(
Illustrations
Figure 1 comprises two simulated examples, motivated by the clinical prognosis for advanced ovarian cancer, showing the relationship between the activation hypothesis and the analysis of the LRP. In each scenario, described below, we set β1 = 1, α = 0.8, β0 = log18, and σ = 0.025, corresponding to a median progression-free survival (PFS) of 18 months and 12% survival at 60 months when
Scenario 1. Patient-specific activation levels.
We generated standard normal patient-specific activation level
We note that the marginal correlation between the ligand and the receptor is
We performed a standard Cox PH regression that found no marginal association between PFS, and

Simulated data illustrate survival times (top row) and correlation (bottom row), showing that the activation hypothesis generates DC in both a patient-specific (left) and a discrete (right) scenario.
Scenario 2. Extreme DC model.
DC analysis implicitly assumes that patients can be classified into an active or inactive LRP state. We consider an extreme scenario that strongly favors a DC-type analysis. Let inactive patients have level
By design, this scenario generates significant DC (
Therefore, we conclude that the latent activation model is a viable data-generating model for the LRPs and their effect on survival. It possesses properties consistent with analysis by DC and regression. A viable measure of association can be derived by studying the interaction term between the ligand and the receptor through a survival time regression.
Simulation Studies
Power to detect signaling
We consider the power of an activation regression-type model and a standard DC analysis to identify the presence of an active LRP. The DC algorithm is a single test statistic based on Fisher's transformation and standard normal theory (see the Appendix for details). We test both Cox PH regression and Weibull parametric regression (PR), expecting the latter to be competitive with the parametric DC model. There are three parameters to vary: α the strength of correlation induced by activation, β the effect of activation on survival, and
As the sample size increases, the ordering and shape are as we would expect. Specifically, the PR performs better than semiparametric regression and both are more powerful than the DC method. As the sample size gets very large (
For α = 1, α = 0.5, and α = 0, the survival effect was varied for the semiparametric model. We omitted the other models for clarity. As β grows from 0, it seems to reach an equilibrium power between 0.05 and 0.1 for each value of α |β|= 0.1 is very small for a one standard deviation change in

Power and sample size simulations demonstrate the sensitivity to α and
Correlation has a strong impact on the power. The power of the DC model is not significant for small α values (-0.5 ≪ α ≪ 0.5). α = 0 represents the null hypothesis. The tails of α matter for DC; it is only the superior model near -1 and 1. Loss of power for the semiparametric model is expected; however, it is important to note that it outperforms DC model at most levels. The PR model maintains the highest power for all values of α between -1 and 1. This demonstrates that DC is a flawed model.
Effect of censoring on power
We consider the power of each model as a function of increasing censoring rate. We analyze the effect of censoring on the three models at parameter values of α = 1.0, β = 0.8, and
Figure 2 (lower right) shows the power of each model as the censoring rate increases. PR consistently outperforms the DC and semiparametric regression models across the varying censoring rates. For censoring rates of about 0.3 and above, the semiparametric regression model outperforms the DC model. All three models decrease as the censoring rate increases; however, the DC model decreases at a greater rate than the semiparametric regression model and the PR model. The uncensored data have a power of 0.67, 0.55, and 0.73 for the DC, semiparametric regression, and PR models respectively. The DC, semiparametric regression, and PR models have powers of 0.23, 0.31, and 0.48, respectively, at a censoring rate of 0.5.
It is clear that DC is most affected by censoring. Results from before are neutral or favor the activation model in uncensored data. Therefore, we expect censoring will magnify the superiority of activation regression models over DC models.
Correlation of Activation of Multiple Pairs
Suppose that the activation level
We evaluated the degree to which a multivariate model using these pairs is able to recover the unobserved activation level by simulation. Again, β0 = log(18), β1 = 1, and we draw
It is unrealistic to assume that we know a priori which pairs are active (ie, which pairs follow the bivariate normal correlation model). Consider the
Taken together, the result implies that a two-step process may be possible: first, we may quickly screen an unselected set of pairs to estimate the per-patient activation level and, second, verify the association of individual pairs with the estimated activation level to identify the active set.

Estimation of latent activation by multiple LRPs as a function of the number of active pairs (left) and as a function of noise variables added to the model (right).
Data Analysis
Ovarian cancer is the leading source of death because of gynecologic cancer. 8 This is due in part to the fact that patients rapidly develop resistance to primary chemotherapies and remain in a phase of palliative care where alternative, targeted therapies may have an effect. 9 In particular, bevacizumab, an anti-angiogenic therapy targeting vascular endothelial growth factor A (VEGFA), is developing as an option to augment primary therapy.10,11 Thus, we might consider what other LRPs may be associated with prognosis in ovarian cancer to find candidate targets for therapy.
In this analysis, we consider gene expression data from the Cancer Genome Atlas (TCGA) ovary project, which as been measured on Affymetrix U133A arrays (note that the TCGA study uses three different platforms, and we have selected the Affymetrix version for analysis as it is most complete and straightforward to analyze by reproducible bioinformatic workflows). These data have been processed by the robust multi-array analysis (RMA) 12 , aggregated at the gene name level by choosing the brightest spot, and scaled and centered across 503 TCGA patients. In addition, we have a validation data set from an Australian observational study of ovarian cancer 13 (GEO: GSE9899) using an equivalent Affymetrix array with which we can conduct an independent evaluation of our findings. Throughout, we consider PFS (the time from surgery to the progression or recurrence of disease or death, whichever occurs first) as the clinical outcome and employ the less-powerful, but more common semiparametric Cox PH regression model. A hybrid approach might be used (given below): selection of interesting pairs by the powerful DC model and follow-up by activation modeling.
We examined a known set of LRPs taken from the Database of Ligand-Receptor Partners, 5 where 162 ligands and 131 receptors accounting for 419 interacting pairs were present on the array. We added the KEGG pathways, 14 hsa04060 (cytokines/chemokines), hsa04512 (cell adhesion molecules), and hsa04514 (ECM interactions), to this set to update the database for recent discoveries. In total, there are 475 pairs (200 ligands, 166 receptors) for consideration after verifying the ligand/receptor functions (Supplementary Table 1).
A key difference between the regression and DC approaches is that we can incorporate multiple LRPs into a multivariate model. We performed screening of all 475 pairs and found 27 LRPs that are significantly associated with PFS (unadjusted screening
Of these three models (full fit, AIC, BIC), the BIC model with eight predictors fit to the TCGA data was able to produce predicted risk scores associated with prognosis in the independent cohort (HR = 1.3, 95%CI: 1.03–1.64,
As we noted above, bevacizumab, which targets VEGFA, has recently shown promise in primary ovarian cancer therapy. 10 In addition, both VEGFA and TGFB1 were shown to induce immunosuppressive responses; targeting these factors may aid in preventing tumor progression. 15 Zaid and colleagues demonstrated that overexpression of FGF1~FGFR4 indicated poor prognostic results and inferred that targeting this LRP may lead to better survival outcomes. 16
We considered a sensitivity analysis by conducting five-fold cross-validation, repeating model selection in each fold and predicting on the withheld one-fifth of the data (approximately 100 patients). The result is shown in Table 1 for BIC and the remainder in Supplementary Table 2. There we have tabulated the number of times the AIC, BIC or full-fit model selects each of the 27 pairs and the corresponding hazard ratio and significance of the predictions in the withheld subset. The number of folds validated refers to the number of times an LRP was selected out of the five-fold cross-validation processes. In this sense, TNFSF14~TNFRSF14 was a strong result, selected in every fold in each model. Similarly, IL7~IL7R, VEGFA~NRP1, and PVRL3~PVRL1 were selected in all but one fold in every case. Note that all of these were present in the BIC model.
In the independent data set, only CCL4~CCR8 and IL7-IL7R are significant in this set of eight pairs. However, fitting this two pair model into the TCGA cohort and testing in the independent cohort yields a significant association (HR = 3.14, 95%CI: 1.56–6.32,
Multivariate model, BIC selection.
Selected as a predictor if trained on independent data.
We now consider the estimated activation level fit using the TCGA data. In the independent data set, the predicted activation shows that the 27-pair model is correlated with the 2-pair model (r = 0.43, 95%CI: 0.33–0.54) as is the 8-pair BIC model (r = 0.55, 95%CI: 0.46–0.63). So as in the simulations, little is lost by including extra predictors. The activation level is fairly prognostic when stratifying on the median level for the 27-pair (16 vs. 14 months PFS,
Discussion
In this article, we have developed a model that links correlation between a signaling ligand and receptor pair to prognosis. Because this association can be seen through statistical interactions, the analysis is amenable to multivariate regression and is therefore useful for practical genomic data analysis. The novelty of this activation model approach lies in the connection between correlation and survival. This connection may be patient specific - different patients can have different correlation levels - and it does not rely on dichotomizing the population wasting statistical power on estimating the correlation instead of modeling the association with survival. Further, we find the power of the semiparametric Cox PH regression competitive with DC analysis, especially in the context of right-censored data.
Because standard regression model building usually searches for main effects before interactions, it is likely that prior studies have overlooked useful LRP associations. We have shown that main effects can be insignificant versus the statistical interaction. Throughout the article, we have referred to the interaction term in the survival regression model. While the proposed model building and testing does not include the main effect terms, classically, these are included to account for rescaling the expression variables. In this situation, we note that the main effects are expected to be uncorrelated with survival by construction.
We have demonstrated the ability of multiple LRPs to estimate an underlying activation level, under the assumption that they are driven by a singular process of activation. In our data analysis, we have shown that this level is a useful tool for prognostic stratification and meaningful biological and translational hypothesis generation.
The activation model makes little assumption on the survival time distribution. While we employed log-normal survival times throughout this analysis, we have little reason to believe that the form of the hazard will dramatically affect the results for the semiparametric model. The PR may be viewed as an exercise in model misspecification.
We see this activation model as a tool that may be deployed along with other correlation-based bioinformatic techniques. Our data illustration highlights how multiple LRPs can be considered together. By focusing on LRPs that are highly likely to be targetable or to have an approved or pre-clinical compound, this type of analysis has a strong potential for clinical benefit.
Author Contributions
Conceived and designed the experiments: CR, KHE. Analyzed the data: CR, KHE. Wrote the first draft of the manuscript: CR, KHE. Contributed to the writing of the manuscript: CR, KHE. Agreed with manuscript results and conclusions: CR, KHE. Jointly developed the structure and arguments for the paper: CR, KHE. Made critical revisions and approved the final version: CR, KHE. Both authors reviewed and approved the final manuscript.
