Abstract
Introduction
RNA-Seq provides a revolutionary way to unveil transcriptome details by using ultra-high-throughput sequencing technologies to generate hundreds of million short reads from RNA molecules.
1
The short reads are a type of big data that usually need several or more gigabytes storage. They are usually aligned against a reference genome (eg, human genome) by using alignment tools such as
The transcription map can be represented by a non-negative integer
From a translational bioinformatics viewpoint, an essential issue in RNA-Seq data analysis is to answer the following two related queries. First, given a read count matrix, how to robustly determine whether the observed difference in read counts for a gene across two or more conditions is statistically significant? Second, how to retrieve disease biomarkers from RNA-Seq count data to provide a possible guide for disease diagnosis and prognosis? It is noted that we use terminologies “RNA-Seq data” and “RNA-Seq count data” interchangeably for the convenience of following description, unless there is a special notation.
Quite a few differential expression (DE) analysis methods have been proposed to answer the first query from different standing points.7–12 They can be categorized as parametric and nonparametric approaches according to whether they rely on statistical parameter estimation modeling approaches. The parametric methods assume that read counts subject to a probability distribution and estimate corresponding parameters for the distribution before conducting a corresponding hypothesis test to rank genes.7,8,10,12 For example,
Unlike the first query, there was no previous work in the literature on RNA-Seq data biomarker discovery. However, compared with traditional microarray data, RNA-Seq data can provide a more reproducible, high-resolution digital expression for monitoring RNA transcription. 5 Especially, it makes each gene's expression in a single sample comparable with those of others.7,14 On the other hand, it is almost impossible to compare the expression levels of genes within a sample for microarray data because of the strong background signals generated from the hybridization process of the microarray. Thus, it will be desirable to seek disease biomarker discovery from RNA-Seq count data for the sake of disease diagnosis and prognosis by taking advantage of these properties. However, RNA-Seq data biomarker discovery remains a challenging problem for the following major reasons.
First, the special characteristics of RNA-Seq count data present hurdles from reusing those biomarker discovery algorithms developed from traditional omics data, ie, microarray or proteomics data. For example, RNA-Seq count data have much fewer number of samples (eg,
Moreover, different from traditional omics data, which are usually normally distributed after normalization, RNA-Seq count data are usually modeled as the NB distribution or Poisson distribution.7,14 The biomarker discovery methods developed under the normal distribution assumption usually cannot apply to RNA-Seq count data directly. In addition, most genes in RNA-Seq count data have quite good discriminative abilities under classification compared with microarray data. This is due to the built-in ultra-high resolution of RNA-Seq that leads to more accurate measurement and higher dynamic range for gene expression under different conditions.1,15
For example, we have found that the widely employed
Second, quite a lot biomarker discovery methods require an accurate
For example, some genes with low counts or same level high counts in two conditions are actually due to the artifacts of library preparation protocols, sequencing inaccuracy, or alignment imprecision, instead of reaction to treatment.5,13 In fact, those genes, which do not have real contributions to data variations under a treatment, should not enter the DE call for the sake of the sensitivity of DE analysis, because they will distract the focus of DE analysis by increasing false positive ratios.
Moreover, almost all parametric DE analysis methods are SD-dependent methods 13 and they would falsely detect some non-DE gene as a DE gene, when the SD increases in a condition. That is, the count increases in the condition will be falsely diagnosed as a statistically significant DE change. Thus, a serious feature selection with an aim to remove the genes with no real contributions to data variations will contribute to the SD independence of these methods by decreasing false positive ratios in the DE call.
In this study, we presented a novel biomarker discovery algorithm: SEQ-Marker for RNA-Seq data from a network discovery standing point. The proposed SEQ-Marker algorithm is built on a proposed data-driven feature selection algorithm, nonnegative singular value approximation (NSVA), proposed in this study. As a feature selection algorithm to select genes according to its contribution to the whole data variance, it does not require any probability distribution assumption about count data. It also demonstrated a good consistency in identifying large variance genes (eg, long genes) as DE genes when integrated with classic DE analysis methods. Moreover, we compared our NSVA feature selection with two competing methods, count-based naive feature selection (NFS) and principal component analysis (PCA), to demonstrate its advantages in selecting meaningful genes.
Unlike the traditional biomarker discovery methods in the microarray community, the proposed SEQ-Marker employed a novel strategy to identify biomarkers from network markers. That is, it searched an inferred network marker at first for RNA-Seq count data and then identified meaningful biomarkers from the network marker by retrieving gene interaction and gene mutation information. The proposed biomarker discovery algorithm aimed at finding biomarkers by novelly viewing an inferred network marker as a small database. The database not only unveiled interaction information between genes but also made it possible to explore real disease biomarkers along an interaction “path”. Such a search scheme is especially helpful to avoid the tissue-specific expression biomarkers and identify those real disease biomarkers, which may not express themselves in omics experiments.
It is worthwhile to point out that our biomarker search mechanism is specially designed according to the properties of the real disease biomarkers by answering the following queries: “which genes’ mutations will affect most genes in the network marker?” and “which genes have the highest proximities with a most likely gene marker?”. Obviously, such a biomarker search mechanism will have advantages to capture real disease markers by overcoming the weakness of traditional
On the other hand, our proposed biomarker discovery model overcame the weakness of the traditional network marker and individual gene marker discovery. The former has been facing the difficulties in biomarker validation, because it could be prohibitive or impossible to conduct a biomarker validation for a network marker with more than hundreds of genes. The latter suffers from the poor reproducibility caused by adhocness of
Biomarker Discovery for RNA-Seq Count Data
As we pointed out before, quite a lot biomarker discovery methods on traditional omics data cannot be applied to RNA-Seq count data directly, because of the special characteristics of RNA-Seq count data. The question is “what kind of biomarker discovery algorithms of RNA-Seq count data should be?” We believe a desirable RNA-Seq biomarker discovery should satisfy at least the following criteria.
First, only a portion of meaningful genes instead of all genes should participate biomarker discovery, because a lot of genes are not informative enough to contribute to disease diagnosis. For example, some genes with very low variances in two conditions are actually not reaction to treatment but results of the artifacts of library preparations or relaxed alignment constraints. Alternatively, it is computationally expensive or even prohibitive to include all genes of a read count dataset in biomarker discovery. As such, we need a feature selection algorithm to filter the genes before starting biomarker discovery officially.
Second, a robust DE analysis model to accurately calculate
Third, the biomarker discovery algorithm should demonstrate some potential to overcome the weakness of traditional omics biomarker discovery methods by enhancing identified biomarkers’ reproducibility and validation. For example, gene–gene (protein–protein) interaction information should be included in the biomarker discovery to improve the reproducibility of biomarkers. In particular, checking the interaction information of identified biomarkers would help to bridge the gap between biomedical research and clinical practice by identifying “real” or new markers.
19
This is because that some identified biomarkers with strong statistical support (eg,
Feature Selection for RNA-Seq Count Data
Although various feature selection algorithms are available in traditional omics data communities, most of these statistical testing-based methods may not be applied to RNA-Seq count data directly, because they usually assume that population data are normally distributed (eg,
As such, it is believed that a desirable feature selection for RNA-Seq count data should satisfy the following criteria. First, it should be a data-driven method that does not have any prior assumption about data distribution to avoid possible biases from the distribution itself. Second, it should avoid evaluating each gene's significance from the linear combinations of all genes in a subspace induced by a linear or nonlinear transform. Third, it should take consideration of the nonnegative characteristic of RNA-Seq count data instead of treating them as generic data. As such, we presented a novel data-driven feature selection method, NSVA, which did not have any prior data distribution assumption and enables the gene-significant ranking by fully taking advantages of non-negativity of RNA-Seq count data. To some degree, it can be viewed as a special singular value decomposition (SVD) for nonnegative data. However, the characteristics related to SVD applied to nonnegative data are first unveiled and proposed in this work. We describe the classic SVD as follows before introducing NSVA.
Singular value decomposition
Given matrix
Different from SVD that treats nonnegative read count data as genetic data, our proposed NSVA is a more data-driven algorithm that assumes the non-negativity of input data to match the key characteristic of RNA-Seq count data. Our NSVA is built upon the Perron–Frobenius theorem, which has been widely used in Google webpage ranking, 21 as follows.
Perron–Frobenius theorem
Given a nonnegative square matrix
λm > 0 and νm ∈
Given Aν = λν and λ ≠ λm, then the eigenvector ν ∈
Applying it to the SVD decomposition of a nonnegative matrix, we have our NSVA, whose proof details are skipped for the conciseness of description.
Nonnegative singular value approximation
Given a nonnegative matrix
Both vectors
The vectors
Matrix
It is noted that NSVA guarantees a purely additive decomposition of a nonnegative matrix
From a single gene viewpoint, NSVA means that each gene is approximated by the projection of its corresponding entry in vector
NSVA feature selection for RNA-Seq count data
Our proposed NSVA makes it possible to represent eachgene
A
As a measure to rank each gene's contribution to all samples of an RNA-Seq count dataset, the gene contribution score is independent of data distribution. In other words, such a property guarantees that it will integrate with any data analysis methods smoothly, no matter they are parameter estimation methods or not. Moreover, it avoids the technical difficulty faced by the traditional transform-based feature selection methods (eg, PCA) to evaluate each gene's significance from the linear combinations of all genes by taking advantage of the non-negativity of RNA-Seq count data. Thus, as a measure induced by NSVA, it appears as a desirable way to conduct feature selection for RNA-Seq count data. We call such a gene contribution score-based feature selection as NSVA feature selection and describe it as follows.
The NSVA feature selection has the following two steps. The first is to conduct NSVA for input RNA-Seq count data and compute the gene contribute score for each gene. The second is to employ the gene contribution scores to rank the importance of each gene and filter the genes with small gene contribution scores. For instance, sort all gene contribution scores and pick the top 2000 genes with largest scores. The genes with large gene contribution scores, which are potentially “good” genes, will enter the following data analysis (eg, DE analysis). It is noted that the gene contribution score is a weight to evaluate each gene's contribution to all samples instead of a percentage value, that is the summation of all gene contribution scores is not equal to 1.
It is worthwhile to point out that NSVA feature selection is actually a variance-based feature selection, where NSVA filters the genes according to their gene contribution scores, which is equivalent to filtering genes by the count variance. Each gene,
Data variation explanation ratios
It is noted that the first singular value
SEQ-Marker Algorithm
The proposed SEQ-Marker consists of the following three major procedures. First, it employed NSVA feature selection to obtain a gene set
Second, it applied
Third, it implemented a novel biomarker search strategy by searching biomarkers from an inferred network marker rather than from RNA-Seq count data directly. The proposed SEQ-Marker algorithm at first employed
The reason we identified the core genes in the network marker was to answer the query: “which genes will most likely act as key genes in the network marker and its mutations will affect other genes mostly?” Previous studies reported that the genes with the largest interactions will play an essential role in identifying disease molecular signatures and their mutations will have most impacts on those of other genes in the network marker.19,23,24 Moreover, collecting genes with closest correlations with identified core genes will identify those genes with highest proximities with the key genes. That is, their mutations may trigger those of the identified core genes or vice versa. It is equivalent to answering the query: “which gene will most likely mutate with the key genes in the network marker?”
The reason we picked

The flowchart of the proposed SEQ-Marker algorithm. The seQ-marker algorithm consists of the following main components: a data-driven feature selection algorithm: NSVA; a “new” DE analysis method: NSVA-DESeq, by integrating our NSVA feature selection with the parametric DESeq analysis; and a novel network marker-oriented biomarker identification search strategy.
Merge the network markers
Search the network marker to identify the first
Cluster the network marker with a degree threshold (eg, 2) and seek an associative gene for each core gene that has the nearest correlation distances with the core gene with adjusted
It is noted that we could have more than one associative gene for each core gene theoretically. However, we preferred to implement only one associative gene search in our implementation for the sake of biomarker validation. Moreover, the network marker acted as a small database that provides interaction information for biomarker identified. On the other hand, the biomarkers identified will reveal the essential genes in the network marker, both of which contribute to unveiling the disease signature in a comprehensive approach.
Results
We presented our results on RNA-Seq count data biomarker discovery by applying our SEQ-Marker algorithm to two benchmark datasets:
The success of the proposed biomarker discovery algorithm largely relies on
Figure 2 answered the query by comparing NSVA-

The scatter plots of log2 mean versus log2 fold changes by comparing
First, our NVSA feature selection demonstrated a good sensitivity to filter those non-DE genes, no matter that DE genes were the majority or not among all the input genes. For instance, non-DE genes were the majority among all genes in the
We compared our NSVA feature selection with its two competing methods: count-based naive feature selection (NFS) and principal component analysis (PCA) feature selection to further demonstrate its advantage in picking potential DE genes. The count-based NFS selected genes according to its counts completely. It consisted of the following two steps. The first step selected all genes whose entries are more than or equal to the median count of the input data. The second step sorted all genes according to its coverage, ie, the sum of its counts, and selected the top-ranked genes (eg, 2000 genes).
On the other hand, PCA feature selection ranked each gene by using the 2-norm of its projection in the subspace spanned by the first three PCs. It represented the gene's contribution to all PCs and reflected its significance in the spanned subspace. PCA feature selection consisted of the following three steps. The first step conducted PCA for input data and projected it to the first three PCs. The second step calculated the 2-norm for the projection data of each gene in the subspace spanned by the three PCs. The third step sorted the genes according to the calculated 2-norm and selected the top-ranked genes. It was interesting to point out that our PCA feature selection had very high explanation ratio (>99%) for both datasets, compared with NSVA feature selection.
The northeast and northwest plots in Figure 3 compared the DE ratios from the three feature selection methods: NSVA, PCA, and NFS under DESeq analysis on corresponding 2000, 3000, 5000, and 8000 selected genes from the two original RNA-Seq datasets. The DE ratio was defined as the ratio of DE genes among all the genes of input data. It was interesting to see that the DE ratios from NSVA feature selection were much higher than those of NFS and PCA feature selection for all selection cases of two datasets. Since we only employed DESeq for DE analysis for all datasets, it was clear that the proposed NSVA feature selection demonstrated its advantage in selecting potential DE genes than the NFS and PCA feature selection.

The comparisons of DE ratios and DE gene median counts for NSVA, PCA, and NFS feature selection methods under
All DE ratios showed stable increasing patterns with the increase in the number of genes filtered on the
On the other hand, the largest DE ratios from PCA and NFS only reached 83.05% and 87.45% on the selected 2000 genes for the
Second, the proposed NSVA tended to select the DE genes with high counts with increase in the number of genes filtered. The southeast and southwest plots in Figure 3 compared the DE gene median counts from three feature selection methods. It was interesting to see that the DE gene median counts from NSVA was generally lower than those from NFS and PCA on two datasets, except those that were greater than the DE gene median counts from PCA at the 5000 and 8000 gene selection cases. In other words, NSVA feature selection had the highest DE ratios but shortest median counts for DE genes among all the three methods. It suggested that DE genes in NSVA-DESeq were mostly high-count genes instead of low-count ones, which was consistent with the previous results.5,14
However, it does not mean that high-count genes will be DE genes because our result demonstrated the DE ratios from NFS were much lower than those of NSVA under
Third, we found that the DE genes among NSVA-selected genes tended to be relatively long genes, which was also true for NFS- and PCA-selected genes. Figure 4 compared the gene length medians of NSVA-, NFS-, and PCA-selected genes and DE genes among these selected genes. It was interesting to see that PCA selected the shortest genes among three of them. For example, the gene length medians for its DE genes were quite low in the 3000, 5000, and 8000 gene selection cases. Considering the low DE ratios and low counts for PCA-selected methods, it is reasonable to say that PCA tends to select those genes with low counts or short lengths, most of which are obviously not DE genes.

Comparisons of the gene length medians of the genes selected by NSVA, PCA, and NFS methods and DE genes among the selected genes for the
Alternatively, NFS-selected genes and the DE genes among them had the longest gene lengths, but the DE ratios from NFS were quite low compared with those of NSVA, especially under the 5000 and 8000 gene selection cases. It strongly suggested that only picking high-count genes would not contribute to enhance DE analysis, because a large number of pseudo high-count genes could be generated from those lanes with high SD in RNA-Seq.7,13
On the other hand, the median gene lengths from NSVA-selected genes were shorter than those from NFS but higher than the DE gene median length (26,445 bp) of all genes for the
In summary, NSVA feature selection seems to make the following
Biomarker Discovery for Kidney–Liver and Prostate Data
We applied our SEQ-Marker algorithm to seek biomarkers for the

The plots of 1801 DE genes and 199 non-DE genes of 2000 genes selected by NSVA for the Kidney–Liver data. Unlike the DE genes, the non-DE genes have fold changes in a quite small range.
In addition, we obtained five network markers with scores 8.219, 7.922, 7.754, 7.735, and 7.637, respectively from

The network marker with 102 genes and 194 interactions identified by the SEQ-Marker algorithm for
It was interesting to find that these core genes were actually gene markers closely related to kidney and liver diseases.
For example, the gene FN1 with the largest interactions was reported as a gene associated with the development of renal cell cancer (RCC), that is a cancer related to kidney. 31 The gene ESR1 with the second largest interactions in the network marker was usually differentially expressed in liver, kidney, and other human organism parts and its mutation was reported to relate to kidney- and liver-related diseases (eg, osteosarcoma of the kidney). 32 The gene HSB90AB1 with the third largest interactions has usually seen its DE in liver, kidney, and other organisms, and its mutations were reported to be associative with kidney diseases in the previous studies. 33 Alternatively, it was easy to enumerate the biological relevance of the other two core genes, HSPB1 and CTNNB1, in the network marker with respect to kidney or liver diseases in the literature. The HSPB1 interacting with p53 directly was reported to inhibit the lung and liver tumor progression, 34 and the somatic mutation of the CTNNB1 gene was associated with the liver, skin, and other cancers. 35
Table 1 showed more detailed information about the five core genes. Interestingly, the core genes were DE genes with strong
The five core genes identified for
We conducted clustering for the network marker with a degree threshold 2 by employing MCODE26,36 and obtained four clusters: {ATP5B, FN1, ATP6V1B2, ATP5C1}, {PIK3R1, ERBB2, CTNNB1}, {HSPH1, HSP90 AB1, STK24}, and {SNRNP70, PAN2, PRPF8, RPL11, SNRNP200, PRKAA1, TUFM}. The first cluster had a cluster score of 3.33 and the other three had 3. We further found five corresponding associative genes for the core genes, where ATP5B and STK24 were the associative genes found in a same cluster as their core genes FN1 and HSP90ABf, respectively. Moreover, RPS9, ADH4, and RHOC were the associative genes of the core genes CTNNB1, ESR1, and HSPB1, respectively, Interestingly, almost all associative genes had high correlation values with their corresponding core genes, except RPS9 (core gene: CTNNB1, correlation value: 62.16%).
Table 2 showed the details about these associative genes. They shared almost the same characteristics as the core genes: all genes are DE genes with strong
The five associative genes identified for
Similarly, we applied the SEQ-Marker algorithm to the

The network marker with 203 genes and 730 interactions identified by SEQ-Marker algorithm for
Discussion
In this study, we proposed a novel biomarker discovery algorithm SEQ-Marker for RNA-Seq count data. Our biomarker discovery algorithm is based on NSVA algorithm proposed in this work. As a data-driven feature selection algorithm, our NSVA algorithm demonstrated the advantages in selecting meaningful genes before DE analysis by contributing to lowering false positive rates and improving the sequence depth independent of DE analysis when compared with its peers: PCA and NFS feature selection.
Moreover, as a first algorithm to address biomarker discovery for RNA-Seq count data, SEQ-Marker identified biomarkers by searching an inferred network marker, which acted as a database to provide gene interaction for biomarkers. The database unveiled essential gene interaction information and provided the opportunity to identify real disease biomarkers along an interaction “path”. As such, the biomarkers identified will reveal more network dynamics and contribute to unveiling disease signatures in a comprehensive approach.
It is worthwhile to point out that our proposed biomarker discovery model, SEQ-Marker for RNA-Seq data, overcomes the limitations of traditional omics biomarker discovery by finding the biomarkers without strong
An issue of particular importance is that our studies have quite different focuses from the previous studies,5,30 although almost same data were employed in these studies. For examples, the original
The proposed NSVA demonstrated a good strategy in enhancing the robustness of
Although we only integrated NSVA with the parametric method DESeq in our biomarker discovery in this study, we are integrating NSVA with nonparametric DE analysis algorithms such as NOISeq to investigate its performance in biomarker discovery. 13 In addition, because our proposed SEQ-Marker faces quite a high computing demand due to the complexities of identifying DE network markers in jActiveModule, 25 we plan to use Graphics processing unit (GPU) computing way to tackle the computing burden in the network marker discovery, in addition to applying it to RNA-Seq data-based clinical diagnosis. 45
Author Contributions
Conceived and designed the experiments: HH. Analyzed the data: HH. Contributed to the writing of the manuscript: HH. Agree with manuscript results and conclusions: HH, XJ. Jointly developed the structure and arguments for the paper: HH, XJ. Made critical revisions and approved final version: HH, XJ. Both authors reviewed and approved of the final manuscript.
