Abstract
Keywords
Introduction
The development of microarray technology has been phenomenal in the past decade, with transcriptional profiling now a standard tool in many genomics research laboratories (Ginsberg and Mirnics, 2006; Rosa et al. 2006). The rapid development and acceptance of this method can be attributed to the fact that microarrays permit the simultaneous measurement of thousands of gene expressions on a single platform instead of analyzing them on a gene-by-gene basis. One major application of this technology is the identification of genes that are differentially expressed across various experimental conditions. Such differentially expressed genes may be implicated in biological pathways of interest and can help our understanding of disease mechanisms and treatment strategies (Phan et al. 2006; Cowell and Hawthorn, 2007).
A popular method to detect difference in gene expression has been the use of fold-change cutoffs (Chattopadhyay et al. 2007; Shimada et al. 2007). This approach seeks genes whose expression intensities change, for example, by a factor of two or more between control and treatment samples. However, the fixed threshold cutoff method is not based on specific data modeling assumptions and is statistically inefficient because it cannot account for the numerous systemic and biological variations inherent in a microarray experiment (Jaluria et al. 2007). Another commonly used method is the traditional parametric
To address the need for a better statistical framework for microarray data analysis, various investigators have proposed the quantification of measurement errors associated with gene expression intensities (Ideker et al. 2000; Tu et al. 2002). In particular, parameters of a statistical data model, which account for potential error sources, can be estimated using the maximum-likelihood estimation (MLE) method. A generalized likelihood ratio (GLR) test can then be applied to identify genes whose expression levels are statistically different. A crucial step in the GLR test lies in the selection of the underlying error structure summarizing the influence of multiple sources of variation in microarray studies. Several models have been proposed for measurement errors in microarray data (Ideker et al. 2000; Rocke and Durbin, 2001; Tu et al. 2002). All of these models account for the observation that the variance of expression data of a gene increases with its mean. Ideker et al. have shown that a model that reflects two types of error, one additive and one multiplicative, can adequately model microarray data at varying intensity levels (Ideker et al. 2000). The multiplicative error term accounts for intraarray variance that influences individual study parameter estimates. The additive error component captures the influence of inter-array variations during replicate experiments. This GLR model has been applied to non-logarithm transformed intensity levels from cDNA microarray experiments.
Although various other implementations of error structures under the GLR test have been presented, no systematic comparative studies of their performance have been published. It has been reported that logarithm transformation can improve the normality of expression levels and help equalize variance, since raw intensities follow a lognormal distribution (Baldi and Long, 2001; Quackenbush, 2002). To the best of our knowledge, no statistical data model expressed in the log scale has been implemented in the GLR test. Furthermore, it is unclear how the GLR method compares to a traditional statistical test such as the parametric
Methods
Experimental study
Gene expression data were generated from lung tissues of mice exposed to chronic hypoxia. During chronic hypoxia, mice develop pulmonary hypertension and pulmonary vascular remodeling (Gharib et al. 2005). However, this
Four, 8 week old, male Balb/C mice (The Jackson Laboratory, Bar Harbor, ME) were exposed to 21 days of hypobaric hypoxia (0.5 atm). Four control mice were housed at sea level for the duration of the experiment (normoxia). On day 21, all mice were sacrificed, whole lungs removed, total RNA isolated and hybridized to cDNA microarrays consisting of 5313 murine genes and expressed sequence tags. The RNA from the four control mice was pooled and used as reference for all microarray experiments. Four microarray replications with dye swapping were performed for three of the hypoxic mice; the fourth hypoxic mouse was studied in triplicate for a total of 15 labeling experiments. By performing replications for each animal and using multiple animals, our experiment design captured both the biological and technical noise in differential gene expression during hypoxic exposure.
Data simulation
The first step in our data simulation strategy was to construct a realistic statistical model based on the experimental microarray data, and then to use this model to generate artificial gene expression values with varying statistical characteristics. Note that our statistical model described data from a cDNA microarray experiment in which control and treatment samples are hybridized on the same array. Based on exploratory data analysis of our 15 replicate microarray experiments, we observed that the variability of each gene's intensity under hypoxic or control conditions in the log scale was approximately Gaussian. In addition, we also noted that two intensity measurements (under hypoxia and normoxia) of the same spot were highly correlated. These observations were in agreement with published findings (Ideker et al. 2000; Baldi and Long, 2001; Rocke and Durbin, 2001; Tu et al. 2002).
We simulated 2000 genes in this study: 1000 genes were defined as differentially expressed (
for
In order to create realistic data simulations, we chose the parameters for the above model from our set of 15 replicate microarray experiments. We randomly selected 1000 genes from this data set and assigned their mean expression values during hypoxia and normoxia as our 1000 simulated
We denoted data sets with s = 1, 2 or 3 as data containing ‘low’, ‘medium’ or ‘high’ signal-to-noise ratios respectively. We generated the ‘low’ signal-to-noise ratio (s = 1) data based on differential gene expression values in our experimental model of hypoxic pulmonary hypertension. As discussed above, we believe this perturbation resulted in only a modest level of differential gene expression. Therefore, we increased the difference between
Statistical data models
We implemented the statistical data model proposed by Ideker and co-authors (Ideker et al. 2000). In addition, motivated by empirical observations, we also devised three additional error models that considered log-transformed expression intensities–-termed GLR1, GLR2 and GLR3–-for analysis using the GLR test. These three error structures have the following generic form:
where (µxi, µyi) is the pair of mean intensities for gene
For the simplest case, we set f(µxi) and f(µyi) equal to 1. Under this model, the measured expression intensity is dependent on a linear combination of normal error components. Hence, GLR1 is:
Based on (2), GLR2 was constructed by equating f(µxi) and f(µyi) to µxi and µyi respectively (4), so that the variance could be approximately proportional to the mean intensity (Ideker et al. 2000; Tu et al. 2002). In GLR3, f(µxi) = 1/µxi and f(µyi) = 1/µyi. In GLR3, we explicitly modeled the observation that correlation between expression levels under control and treatment conditions decreases at lower intensities by weighting the multiplicative error term using the reciprocals of the mean intensity, as shown in (5).
Describing raw intensity data, the error structure proposed by Ideker et al. (2000) is:
(6) was implemented in a public-domain computer program, Vera and Sam (http://db.systemsbiology.net/software/VERAandSAM) (Ideker et al. 2000). Henceforth, we refer to this statistical error model as VS.
Although GLR1–-3 may appear similar to the data simulation model in (1), there were fundamental differences between them. Crucially, we used a different Gaussian distribution for each of the 2000 genes in our data simulation while the GLR tests were applied to each gene to determine whether, under a single statistical data model encompassing all 2000 genes, the expression levels under control and experimental conditions were different. For example, although the GLR1 structure (3) may appear very similar to the data simulation model (1), a different Gaussian distribution was sampled for each of the simulated genes, whereas the GLR test was performed for each gene to assess whether applying one distribution to the entire set of 2000 genes can discriminate expression levels between normoxic and hypoxic states. Furthermore, since the parametric
GLR method
Details of the GLR test can be found in Ideker et al. (2000). For illustrative purposes, we will briefly discuss the GLR test for VS as defined by (6). This statistical error model is dependent on five gene-independent parameters
The MLE parameter values that maximize
To determine whether, under the pre-specified error model, individual genes are significantly differentially expressed (i.e. µxi ≠ µyi) between the two cell populations, a likelihood ratio test is subsequently performed, which assumes a univariate normal distribution for the expression data of a gene not differentially expressed (null hypothesis) and a mixture of two univariate normal distributions (with different means) for the expression data of a gene differentially expressed under control and treatment conditions (alternative hypothesis). To this end, the GLR test statistic
Two maximizations are performed in (8): in the numerator, the constraint
Parametric t-test
The
Performance of statistical tests on simulated microarray data
ROC curve analysis was used for performance evaluation. Expression status of each gene (i.e. whether a gene was classified as a
Furthermore, a parametric bootstrap procedure was used to quantify the precision of estimates of AROC and to approximate the probability of rejecting the null hypothesis when comparing the AROC's of two different statistical tests. Bootstrapping provides a way of making probability-based, assumption-free inferences about a population characteristic (e.g. AROC) without strict distributional assumptions (Efron and Gong, 1983; Efron and Tibshirani, 1986). In short, one thousand samples of
where
Results
Overall performance of statistical tests on the original simulated data
Representative ROC curves generated from our analysis are shown in Figure 1A. These curves were based on the GLR3 model using r2s1, r2s3, r4s1, r4s3, r8s1 and r8s3 original simulated data sets. The overall patterns of the ROC curves with respect to varying signal-to-noise ratio and number of replications per gene were similar among the statistical tests. As expected, the predictive power of the GLR tests and

(A) ROC curves for different data conditions derived from the GLR test utilizing GLR3 as the underlying statistical data model. In the legend, r and s indicate replications per gene and signal-to-noise ratio of the data set condition respectively. The AROC for r8s3, r8s1, r4s3, r4s1, r2s3 and r2s1 were 0.862, 0.711, 0.847, 0.642, 0.806 and 0.604 respectively. (B) ROC curves for the r2s1 data set derived using variants of the GLR test and the
Bootstrap comparison between GLR tests and t-test
The mean (±S.D.) AROC's calculated on the 1000 bootstrap replications for each of the five statistical tests and for each data condition are presented in Figure 2. The corresponding 95% confidence intervals of AROC's are shown in Table 1. Table 2 shows the number of times out of 1000 that a statistical difference was detected between statistical tests for various data conditions using the bootstrap method and a

Mean (±S.D.) AROC's out of 1000 bootstraps for the 5 statistical tests: 1. GLR1; 2. GLR2; 3. GLR3; 4. VS; and 5.
Bootstrap means and 95% confidence intervals of AROC's for each data condition and each statistical test. GLR1, GLR2, GLR3 and VS refer to statistical data models that were implemented under the GLR test. Numbers that are highlighted in bold denote the highest scoring test per simulation condition.
The number of instances that the null hypothesis (no difference between two tests) was rejected out of 1000 bootstraps (at a
In Figure 1B, using the r2s1 data set, we explored an important aspect of the ROC curve to further emphasize differences among the tests: the trade-off between sensitivity and specificity. Here, we examined the sensitivity of the statistical tests at a high range of specificity since this situation is often of interest to biologists. Using ROC curves determined from the rest of the data sets, which we have not displayed for the sake of clarity, we concluded that the GLR3 error structure consistently achieved the highest level of sensitivity in the high specificity range.
Effect of signal-to-noise ratio and replication number on test power
The overall effects of signal-to-noise ratio and number of replications per gene on the statistical tests’ power using the bootstrap technique are presented in Figures 3–5. Improvement in the signal-to-noise ratio and increase in the number of replications both led to enhanced power. (Fig. 3). However, while increased replication improved test performance, this effect diminished as the signal-to-noise ratio of the data set improved. This observation is highlighted in Figure 4, where differences in AROC's between ‘high‘-signal-quality data sets (signal-to-noise ratio = 3) with replications ranging from 3 to 8 are negligible. The overall accuracy of the GLR tests and

Mean (±S.D.) AROC's out of 1000 bootstraps, sorted by signal-to-noise ratio of data sets. Note that for the largest number of gene replications considered in this study, the improvement in the tests’ performance from processing data sets with signal-to-noise ratio of ‘medium’ to processing data sets with signal-to-noise ratio of ‘high’ is reduced. For clarity's sake, we have omitted the bootstrapped results from 3 and 6 replications per gene.

Mean (±S.D.) AROC's out of 1000 bootstraps, sorted by number of replications per gene of data sets. Note that the AROC's for ‘high’ signal quality data sets (signal-to-noise ratio of 3) with 3–8 replications per gene are comparable.

Mean (±S.D.) AROC's out of 1000 bootstraps, sorted by signal-to-noise ratio or number of replications per gene of data sets.
Discussion
Log-transformed GLR methods are superior to the parametric t-test
We compared the power of the
While there was an improvement in the
Error structure affects predictive power of GLR test
Overall, our results suggest that GLR3 was the best performing error structure compared with the GLR1, GLR2 and VS models. The GLR test using VS as its underlying error model consistently generated the lowest AROC's for all bootstrap re-samples. We attribute this to the fact that the VS model's error structure considers raw, non-log-transformed gene expression levels. We contend that such an assumption can potentially limit this test since logarithm transformation can improve the normality of expression intensities (Baldi and Long, 2001; Quackenbush, 2002). Among the statistical data models that considered log-transformed expression levels, GLR1 was the least discriminatory, implying that weighting the multiplicative error term by intensity enables the GLR test to achieve a higher predictive power (recall that
Signal-to-noise ratio contributes more to test performance than replication number
A novel finding of this work was to show the greater influence of signal-to-noise ratio compared to gene replication on test performance. This implies that experiments that maximize the signal-to-noise ratio can facilitate to a larger extent the identification of significant differences in gene expression than those with more sample repeats. To the best of our knowledge, this study represents the first attempt to directly compare the influence of replication and signal-to-noise ratio on the performance of statistical analysis of microarray data, although several reports have commented on the significance and implications of each factor separately (Lee et al. 2000; Li et al. 2001; Wildsmith et al. 2001; Rosa et al. 2006). The greater influence of signal-to-noise ratio may have important implications in experiment design. Although in practice it may be difficult to modulate signal-to-noise, there are several venues for improvement. The quality of RNA is crucial for successful microarray hybrization, and this can be confirmed using microfluidics-based platforms such as the Agilent Bioanalyzer. Several techniques can be used to minimize technical noise, and microarrays from the same print batch can be used to reduce microarray to microarray variability. Importantly, our results also imply that if a biological system does have a high signal-to-noise ratio, then only a few replications are necessary. This may substantially cut the cost of expression profiling experiments.
How many replications per gene?
It has been previously reported that performing microarray studies without sufficient replicates can lead to poor sensitivity and reliability (Lee et al. 2000; Wildsmith et al. 2001). However, the question ‘how many repeats are enough?’ is shaped by many potentially confounding factors such as the type of array equipment, laboratory technique, and, most importantly, the quality of the samples. Our study provides a framework to decide on the replication number per gene based on the overall signal-to-noise ratio of microarray data. More specifically, we recommend repeating array experiments 6–8 times for data with ‘poor’ and ‘medium’ signal-to-noise ratios, since the discriminatory ability of our tests levels off for sample size 6 or greater. In other words, data containing 6–8 replications per gene represent the best compromise between inferior signal quality and false positive rates. On the other hand, for data sets with ‘high’ signals (signal-to-noise ratio = 3), we show that 3 replications have essentially the same effect as 4 or more. This implies potential savings in future microarray studies since replications can be significantly reduced for such experiments. Taken together, our observations reinforce the notion that a successful microarray project is dependent on all steps of the process being accurately and consistently performed to maximize the reliability and significance of results. Here, we show that consideration of steps upstream of data processing, such as deciding on the proper number of microarray replications and minimizing technical/biological noise, may be necessary to ensure experimental and analytic accuracy.
An issue that bears consideration is that, in a practical setting, it may be difficult to assess the signal-to-noise ratio of actual microarray data. Signal-to-noise ratio depends on many factors, including quality of tissues, sample size relative to the number of variables to be classified, experimental variation, and inherently variable original signal intensities (Semenza, 2000; Norris and Kahn, 2006). In this light, our classification of signal-to-noise ratio levels may not be fully generalizable. To address this issue, we suggest independently quantifying the technical and biological sources of variation by, for instance, performing control-to-control experiments. Although this approach will require additional resources, it will provide valuable information on signal to noise proportion in the biological system of interest, allowing the investigator to choose the optimal number of replications to compensate for inadequate signal characteristics based on our present findings. In turn, this can lead to potential savings in experimental cost and resources.
There are several limitations in this study. First of all, we were constrained by how realistically our data simulation models the entire variability of microarray experiments. Furthermore, our experiments and data simulations were performed using cDNA microarrays and therefore might not be applicable to single dye oligonucleotide arrays such as the Affymetrix platform. We restricted our analyses to the GLR method and the parametric
Conclusions
In summary, although the present study depended on a specific set of simulated data derived from cDNA experiments and only the GLR test and the
The GLR method is more powerful than the parametric
Within the GLR approach, an error structure that contains a multiplicative error term weighted by the reciprocal of mean expression intensity outperforms other models (GLR3).
Designing experiments that maximize signal-to-noise ratio, instead of just raising the number of replicates per gene, can better identify differentially expressed genes in microarray data.
