The problem of multiple hypothesis testing can be represented as a Markov process where a new alternative hypothesis is accepted in accordance with its relative evidence to the currently accepted one. This virtual and not formally observed process provides the most probable set of non null hypotheses given the data; it plays the same role as Markov Chain Monte Carlo in approximating a posterior distribution. To apply this representation and obtain the posterior probabilities over all alternative hypotheses, it is enough to have, for each test, barely defined Bayes Factors, e.g. Bayes Factors obtained up to an unknown constant. Such Bayes Factors may either arise from using default and improper priors or from calibrating p-values with respect to their corresponding Bayes Factor lower bound. Both sources of evidence are used to form a Markov transition kernel on the space of hypotheses. The approach leads to easy interpretable results and involves very simple formulas suitable to analyze large datasets as those arising from gene expression data (microarray or RNA-seq experiments).
Multiple hypotheses testing (MHT) consists of a set of statistical techniques in which m > 1 statistical tests are stated jointly and the objective is to estimate the partition of the m hypotheses into two sets of sizes m0 and m1 = m– m0 regarded as the set of true nulls and true alternatives, respectively. For instance, the analysis of the outcome of gene expression measurements is a typical statistical problem of MHT, where m > 1, gene expression levels are compared across two biological populations, by testing a corresponding number of statistical hypotheses with the objective of discovering those genes whose expression level is related with a biological population.
The main input of any MHT procedure is the individual or marginal evidence from each test to obtain some type of joint evidence for all tests. A consistent estimation of the null and alternative sets has the objective of controlling for some type of error rates based on the number of false positives, e.g. false null rejections or false discoveries, while also maintaining a low number of false negatives, e.g. missed null rejections. The typical error rates controlled in MHT procedure are the False Discovery Rate (FDR) and the False Non-rejection Rate (FNR). These can be broadly defined as the ratio among the number of false discoveries over all discoveries (FDR) and the ratio between the number of missed rejections among all non-rejections (FNR).
The most commonly known and used MHT methods are based on the evidence provided by suitable test statistics through the corresponding p-values. The m p-values resemble the evidence gathered from observed quantities (i.e. gene expression levels) and the statistical model. In fact, these p-values may arise from basic sampling models, such as the test of the equality of the mean in two independent populations, as well as from very complicated ones such as for instance non-parametric models or even from models with a complicated hierarchical structure. It is well known that when the null hypothesis is simple or when the test statistic is ancillary, the theoretical sampling null distribution of the p-value is the uniform distribution U(0,1) and the p-value is said to be calibrated. The m p-values are usually assumed to be calibrated with respect to the U(0,1) distribution and thus MHT inference procedures are reliable with respect to such an assumption. This could be one of the main reasons why p-values are so popular in MHT. However, besides the effort in using more complicated and realistic statistical models, the settings in which p-values are calibrated are essentially very few and for very simple statistical models. Even asymptotic arguments in favor of such uniformity calibration are unsustainable in light of the fact that, for these kinds of massive experiments such as gene expression studies, replications are usually expensive. Therefore, a part of specific situations, the use of p-values in MHT, is problematic as shown in ‘Note on multiple testing for composite null hypotheses’–a paper authored by me.1 In fact, when p-values sampling null distribution deviates from the U(0,1), then usual MHT are biased in that no upper bounds on FDR or FNR are guaranteed. To this purpose,2,3 proposed a MHT procedure (Efron’s procedure in the sequel) which estimates, within an empirical Bayes setting, the unknown theoretical sampling null distribution of the p-values which may differ from U(0,1).
To work around the difficulties of using p-values, barely defined Bayes Factors (BFs in the sequel), , can be used in MHT, where Bi is the ith BF of the alternative against the null hypothesis in test and c is the unknown indetermination constant assumed to be equal for all tests. In fact, in Bertolino et al.,4 it is shown that individual fully defined and interpretable BFs are not even necessary in MHT. This avoids the heavy computational techniques required for estimating c to properly calibrate or adjust individual BFs to be individually interpretable. Such types of uncalibrated BFs arise from the fact that BFs in MHT require the elicitation of at least 2 m priors on models with only an unknown scalar parameter, i.e. one marginal prior for each model under testing. As m is large, the choice of such priors is usually forced to be one, among those that come from formal rules. These types of priors are typically improper and so the BF for single hypothesis testing is undetermined due to the ratio among priors pseudo-constants c = c1/c0, where c0 and c1 are the prior normalizing constant for the parameters of null model, H0, and the alternative one H1, respectively. Such prior constants are equal for all tests and this comes from the mathematical fact that all 2 m models for null and alternative hypotheses, share the same parameter spaces and the same priors. If one were interested in eliminating BF arbitrariness due to c, several approaches in the literature could be considered, such as those in the literature.5–11 Unfortunately, these methods are based on large sampling theory and their application would be extremely problematic in MHT, as the computational effort necessary for BF single calibration would be necessary for each one of the m tests. Another approach that leads to priors from formal rules which are proper, i.e. c0 and c1 are known, is that of the conventional prior approach in Bayarri et al.12 If, for some reason, there were interest in having individual interpretable BFs in linear models, the approach of conventional priors12 can still be employed in MHT with moderate m (say, m < 100) as a substitute for the one considered below. However, for large m, the approach proposed here may be considered in conjunction with the conventional prior approach, as well as the not very recommended one that says just use “vague proper priors.”12
The set of undetermined BFs, may also arise from calibrating p-values with respect to the infimum of their respective BFs as explained in Sellke et al.13 and reported here. Let pvi be the ith p-value for test i, with pvi < 1/e, then the corresponding infimum of the BF for the alternative against the null is
where c denotes, with an abuse of notation, the calibration constant of the true and unknown BF Bi with respect to its lower bound, i.e. , for which we only know the product cBi, but not c or Bi, separately. In this case, c is the same for all i because of the following mathematical fact: the infimum of BFs lower bounds are calculated with the same model for all i. In fact, equation (1) comes from assuming that if the test statistic has been appropriately chosen, then its density under the alternative model is decreasing with the p-value. Therefore, considering the class of densities for for the p-value under the alternative model and the U(0, 1) density under the null one, then the infimum of the BF over is (1) for p-values smaller than 1/e.13 For all p-values larger than , we assume that the evidence of the alternative against the null model does not differ from that for . It is worth noting that p-values larger than 0.37 never induce any MHT procedure to reject the corresponding null hypotheses for m, which is also large, as in the case of gene expression measurements (microarray or RNA-seq experiments). The p-value calibration in (1) can also be further improved by considering the sample size in each test with the approach developed in Pérez and Pericchi.14
For the sake of comparison, Efron’s, Benjamini–Hochberg (BH), and Benjamini–Yekutieli (BY) procedures are considered. The BY procedure is aimed at controlling the mean FDR under weak dependence assumptions among tests15 with respect to the original BH procedure.16 The taxonomy of MHT procedures can be divided into two sets: adaptive and non-adaptive procedures. Efron’s procedure belongs to the class of adaptive procedures, as the null distribution of p-values is evaluated conditionally on the observed sample of pvi, , while the BH and BY procedures belong to the class of non-adaptive MHT methods because they rely on asymptotic results that are unconditional to the observed p-values. By no means, we do mean that such MHT procedures represent the state of the art of MHT, although they certainly are popular choices in MHT. A broad review of the most popular MHT literature is beyond the scope of this work and there are several articles that we invite the reader to consult such as Dudoit et al.17 and Farcomeni,18 along with the references therein.
It is worth noting that from a Bayesian perspective, in which the model for the observable data accounts for all m hypotheses, multiple comparisons do not need to be explicitly considered as illustrated in Gelman et al.19 This is because the effect of the prior on parameters, which represent the m hypotheses, allows posterior parameter distributions to be shrunk in such a way that the multiplicity of tests is trivially accounted for Gelman et al.19 Alternatively, the MHT may be posed as a model selection problem in a regression setting as in Bayarri et al.,12 where 2 m models may be competed in estimating the probability of belonging to one of the two biological populations under comparison. In this setting, priors on model parameters and priors on model space become relevant as illustrated in the literature,12,20,21 where BFs consistency is studied for asymptotic in both sample size and m. However, although the above approach to Objective Bayesian model selection deals with almost analytical solutions for BFs, comparing 2 m models, when m is potentially of the order of say millions, it still implies millions of models in the scale! For this reason, we here focus on classical approaches with the main aim of post processing the results of studies that have been conducted with separate analyses for each hypothesis under test. Such results may be expressed either in terms of significant evidence from significance testing or conditional to the sample in the case of calibrated p-values or BFs, respectively. The advantage of the approach pursued here is that very complicated analyses can be embedded into one single analysis, where such multiplicity of significance tests (and/or BF evidences) have a Markov chain (MC) representation with the purposes of obtaining the posterior probability of the set of non null hypothesis. The interpretation of this MC representation is not necessary to obtain such a posterior probability, because it comes from a mathematical result, i.e. the definition of the equilibrium distribution for a discrete space and discrete time Markovian process. This way of proceeding also applies when employing Markov Chain Monte Carlo (MCMC) methods to approximate a posterior distribution. In fact, there is not a physical interpretation of the stochastic process used to approximate a posterior, as its equilibrium distribution is interpreted as the posterior distribution. Despite this, we argue that an interpretation of the proposed Markovian process, underlying the estimation of the posterior distribution of non-null hypotheses, can be stated as the process underlying an imaginary or maybe real, but certainly not formalized, discovery process for which there is no explicit reference in the current literature. Specifically, this random process is with regard to a researcher aiming to contrast hypotheses over some real phenomena, e.g. a gene related to a disease. He/she starts from assuming as true a certain hypothesis i and considers a different one j which can better explain the real phenomena. The process of discarding hypothesis i in favor of j is a random Markovian process where the probability of , αij, is proportional to the evidence of hypothesis j against i only. The equilibrium distribution of such a process provides the probability of each hypothesis being a true discovery. With such equilibrium distribution, it is possible to set-up a decision rule aimed at estimating the null and alternative sets, possibly controlling FDR and FNR under such an equilibrium distribution.
The rest of the article is organized as follows: Section 2 describes in more details the representation process, transition probabilities, equilibrium distribution, and finally a very simple decision rule. Section 4 illustrates the above theory validating it through theoretical results and a simulation study for a parametric toy example. Section 5 considers a more complex model with an application to cancer risk factor analysis and gene expression measurements in which evidence from BFs and p-values are compared. Section 6 illustrates another more elaborated decision rule that, differently from the previous one, requires a tuning parameter to be fixed beforehand. Conclusions are left to Section 7.
2 A MC representation of the discovery process
The aim of this section is to formally describe the MC representation process and how to obtain jump probabilities from uncalibrated BFs and the equilibrium distribution. Finally, a very simple decision rule is described with a more elaborate one in Section 6.
2.1 The discovery process
Let be the set of tests of cardinality m, always involving two hypotheses in each one: the null and the alternative. The set of tests represents m states of nature in which a discovery process may sojourn, in the sense that there is a set of states of nature that implies accepting the corresponding alternative hypotheses as true ones or true discoveries. Parallel to gene expression analysis, we have a set of m genes all compared among them in which only a subset is supposed to be related to the phenotype (e.g. a disease) or the two biological populations under comparison.
The problem indeed is that the analyst is not capable of comparing all hypotheses in in a bunch, but is only able to explore the set as follows: at time t a certain alternative hypotheses in i is considered as the true one, that is, i is a discovery or the null hypothesis in i is rejected. At time t + 1, an alternative hypothesis j is picked up at random and considered as the new true one if the probability of the alternative hypothesis in j relative to i, is large enough. Otherwise, at time t + 1, i is still the hypothesis declared as the true one. This Markovian process, over the state-space , is assumed to be repeated infinite times (or by infinite researchers), that is, as and the most visited set of hypotheses is considered to be the most probable set of true alternative hypotheses. This Markovian process plays the same role in approximating the probabilities of true discoveries as MCMC does in approximating a posterior distribution in the usual Bayesian practice. As for the MCMC, there is no interpretation in terms of modeling observable quantities; then the above interpretation, in terms of analyst behavior in assessing discoveries, is not essential as the proposed Markovian process does not directly model any observable quantities. However, we think that such an interpretation provides an intuitive and immediate viability of the resulting probabilities of true discoveries for applied science.
2.2 Jump or transition probabilities
The described Markovian process sustains the idea that the discovery process consists of a random jump process from one explanation of the reality to another. The probability of jumps from explanations are proportional to how relatively likely the new explanation is with respect to that at hand. Specifically, we set the jump probabilities as,
where the unknown constant c simplifies and disappears from the problem. Table 1 reports the transition kernel among hypotheses.
Transition kernel among hypotheses.
Hypotheses
Hypotheses
1
j
m
1
1
α1m
⋮
⋮
⋮
⋮
⋮
⋮
i
αi1
αim
⋮
⋮
⋮
⋮
⋮
⋮
m
αm1
αmj
1
The interpretation of (2) is that the collected and analyzed data provide the relative evidence from one hypothesis to another. This is defined even if it is not calibrated and/or the evidence is interpretable in a single test. This sheds new light on the importance of the BFs in MHT even if their own individual scale is not interpretable.
2.3 Equilibrium distribution
Jump probabilities in (2) define a Metropolis algorithm in which the proposal distribution, over the discrete space , is for the sake of simplicity, the discrete uniform on where the probability of state j to be proposed is 1/m and the probability to jump is αij. In this case, standard theory for MCMC leads us to state that the equilibrium distribution over the discrete state-space is given by the following steady-state probabilities:
The interpretation of pi is that the evidence for alternative hypothesis i is directly proportional to its own non-calibrated evidence and inversely proportional to that of all other hypotheses at hand. In such an interpretation, the important argument in MHT comes into play, according to which the evidence of a single hypothesis must account for the evidence of all other hypotheses being tested. Note that while cannot be interpreted as the posterior probability of the alternative hypothesis in test i because of the presence of c, formula (3), despite having an intuitive interpretation, it has, at the moment, no statistical justification except under the Markovian process illustrated above.
2.4 A basic decision rule
The posterior probabilities that each of the m alternative hypotheses can be considered as the true ones, p, can be used as the base for a decision rule to estimate the null and alternative sets. Such rules can be more or less complicated depending on the assumption about test dependence or other considerations from a decision theory perspective. Such features of a decision rule are common to all MHT procedures discussed in the literature. Here, we consider a very simple but effective one, leaving the reader to Section 6 for a more elaborate procedure. The reasoning about which hypotheses should be considered as true discoveries is based on an orthodox rule for an Objective Bayesian statistician, that is, on the well known insufficient reason principle for a discrete decision space and applied here conditionally on the given set of m hypotheses. It consists in stating that if there is not sufficient reason to believe that one hypothesis is more probable than another, then 1/m prior probability to each one should be assigned before collecting evidence. Therefore, the decision rule consists in rejecting null hypothesis i if
This also means that alternative hypotheses for which pi > 1/m should be considered as true discoveries. Such a decision rule has no tuning parameters, given the insufficient reason principle, in contrast to the usual ones in MHT where at least an FDR level must be fixed to have a decision. Of course, rejecting all pi > 1/m does not guarantee the control of FDR in a finite sample. However, it can be shown that asymptotically, for large sample sizes and a large number of tests, the FDR is negligible (see Appendix 1). This simple rule, while it controls the FDR, does not control the FNR. To achieve such a goal it is necessary to introduce a tuning parameter as explained in Section 6. Empirical results shown below are compatible with this theoretical result.
3 Illustrative example
To illustrate the method for a possible typical biomedical application, consider the study in García-Arenzana et al.22 and also discussed in the book of McDonald23 for illustrating multiple comparisons. This study consists in testing associations of 25 types of diets with mammographic density, which is an important risk factor for breast cancer in Spanish women. The p-values for the association study are published in García-Arenzana et al.22 and reported in Table 2, which contains the unscaled BFs from (1) and the probabilities p.
Results for the association study in García-Arenzana et al.22 of 25 dietary variables with mammographic density in Spanish women.
Dietary variables
p-values
cBi
p
Loc. FDR
BH
BY
λ
Total calories
0.001
53.256
0.567
0.027
0.025
0.095
<1
Olive oil
0.008
9.524
0.101
0.289
0.100
0.382
5
Whole milk
0.039
2.908
0.031
0.589
0.210
0.801
8
White meat
0.041
2.809
0.030
0.591
0.210
0.801
12
Proteins
0.042
2.763
0.029
0.592
0.210
0.801
16
Nuts
0.060
2.179
0.023
0.600
0.250
0.954
21
Cereals and pasta
0.074
1.909
0.020
0.605
0.264
1.000
29
White fish
0.205
1.132
0.012
0.730
0.491
1.000
40
Butter
0.212
1.119
0.012
0.742
0.491
1.000
46
Vegetables
0.216
1.111
0.012
0.749
0.491
1.000
53
Skimmed milk
0.222
1.101
0.012
0.760
0.491
1.000
63
Red meat
0.251
1.060
0.011
0.822
0.491
1.000
79
Fruit
0.269
1.042
0.011
0.865
0.491
1.000
99
Eggs
0.275
1.036
0.011
0.880
0.491
1.000
105
Blue fish
0.340
1.003
0.011
1.000
0.533
1.000
138
Legumes
0.341
1.003
0.011
1.000
0.533
1.000
169
Carbohydrates
0.384
1.000
0.011
1.000
0.565
1.000
180
Potatoes
0.569
1.000
0.011
1.000
0.782
1.000
222
Bread
0.594
1.000
0.011
1.000
0.782
1.000
313
Fats
0.696
1.000
0.011
1.000
0.870
1.000
387
Sweets
0.762
1.000
0.011
1.000
0.907
1.000
522
Dairy products
0.940
1.000
0.011
1.000
0.986
1.000
685
Semi-skimmed milk
0.942
1.000
0.011
1.000
0.986
1.000
1147
Total meat
0.975
1.000
0.011
0.303
0.986
1.000
2397
Processed meat
0.986
1.000
0.011
0.122
0.986
1.000
>10,000
For each dietary variable, the table contains: p-value for the association study, the corresponding unscaled BFs from (1) and the probabilities p. Three MHT procedures are considered: Efron’s procedure for which we reported the value of the corresponding Local FDR (Loc. FDR), the BH and BY procedures for which we reported the corresponding adjusted p-values (columns BH and BY). The last column is the value of the cost parameter λ for the loss function illustrated in Section 6.
Three MHT procedures are considered and at the threshold of 20%, the “Total calories” can always be associated with mammographic density, while the “olive oil” dietary variable only is associated with the BH procedure. Using the proposed cut-off of both total calories and olive oil can be considered as associated with mammographic density according to the proposed approach. The last column relates to the decision rule approach, as illustrated in Section 6.
4 Evidence from BFs
Let be a realization of experiments each with m different features, i.e. m genes expression. The vector xi contains n replications corresponding to the ith experimental feature, for . The MHT problem can be formalized as a multiple model selection problem as follows:
where and are default prior distributions and is a partition of . We propose using default and improper priors derived from the same formal rule applied to each , k = 0, 1. For the sake of simplicity, we assume that and are members of the same parametric family for each hypothesis i, namely and , respectively. In this case, we have,
where c0 and c1 are the normalizing pseudo-constants as could be non integrable functions (i.e. improper priors). Prior predictive distributions for null and alternative hypotheses are assumed to exist and are
The BF of Hi1 against Hi0 is
which is unscaled because of the arbitrary ratio c = c1/c0. We define the unscaled BF as,
Even if Bi has no interpretation in a single test, it can be used in a comparative approach; in fact, suppose having two tests, i and , if
the evidence in favor of Hi1versus Hi0 is larger than that of versus whatever c is. To characterize asymptotically the proposed procedure, it is important to state the consistency of pi as defined in (3). That is, for if i is in the set of true alternatives and , otherwise (see Appendix 1).
4.1 Toy example: Testing zero normal means with unknown variance
We illustrate the proposed method using the following toy example.4 Let , be m independent normal populations with unknown variance . Suppose testing
Sufficient statistics are and , whose observed values are denoted by and , respectively.
We assume the usual default and improper priors,
where 1A(x) is an indicator function for the event . The full BF is , where
is the unscaled BF. The calculation of p in (3) is immediate.
To have a flavor of the FDR and FNR resulting from the proposed method in this specific parametric toy-example, we consider a study of the following scenarios with, and 1000 simulated datasets in each one:
n = 10, 20, 50, 100;
m = 100, 1000, and 2000;
m0/m = 0.9, 0.95, and 0.99;
μi = μA where μA = 0.5, 1, 2, and 3.
For each simulated data set, we calculate the FDR and FNR generated by the procedure based on BFs or p-values that in this case is the p-value of the Student’s t-test on the mean with unknown variance. Such a p-value is calibrated and its BF lower bound is derived according to (1).
Figure 1 reports the resulting FDR. First of all, we can see that the procedure is consistent for regardless of the proportion of null hypotheses m0/m and it also improves as the signal μA becomes larger. Second, we can appreciate how BFs improves generally over p-values in reporting the evidence of each test. In fact, the FDR using BFs is consistently not larger than those obtained with p-value. This is due to the fact that the BF, explicitly compares two alternative hypothesis in each test, namely the zero mean hypothesis against the alternative. The same cannot be said for the p-value.4 On the contrary, the use of p-value with this simple rule tends to be less conservative and the FNR is smaller using the calibration of p-values instead of BFs as illustrated in Figure 2.
Approximated FDR from simulations. Top scale is the true proportion of null hypotheses m0/m, right scale is m and bottom scale is the sample size n. Monte Carlo Standard deviation in 1000 simulations is negligible and not reported.
Approximated FNR from simulations. Top scale is the true proportion of null hypotheses m0/m, right scale is m and bottom scale is the sample size n. Monte Carlo Standard deviation in 1000 simulations is negligible and not reported.
The fact that the procedure becomes less conservative or more liberal is due to the underlaying Bayesian machinery and in particular to the prior, which is relevant in the inference when the evidence is not conclusive. In fact, a priori, it is assumed that at least 1 hypothesis over m is the true alternative, regardless on which one it is. However, it deserves to note that within the Bayesian machinery it is possible to assign different prior probabilities to the each one of the m hypotheses instead of the non-informative 1/m and check if a posteriori their probability is increased up to a value that it is judged enough. Essentially the approach here proposed opens many lines of researches in this directions.
Finally, note that for large signals and large n, the procedure consistently estimate the set of true alternatives, as a consequence of BFs consistency. Therefore, both FDR and FNR tends to 0 for large samples and large signals, while in small samples and weak signals things are less clear.
Figures 3 and 4 compare the actual FDR and FNR against that standard procedures, namely BH/BY/Efron and the Bonferroni Procedure, which controls the Family Wise Error Rate (FWER) using the nominal threshold level equal to 5%.
Approximated FDR from simulations for standard procedures and the proposed one. Top scale is the true proportion of null hypotheses m0/m, right scale is m and bottom scale is the sample size n. Monte Carlo Standard deviation in 1000 simulations is negligible and not reported. The nominal level for all standard procedures is 5%.
Approximated FNR from simulations for standard procedures and the proposed one. Top scale is the true proportion of null hypotheses m0/m, right scale is m and bottom scale is the sample size n. Monte Carlo Standard deviation in 1000 simulations is negligible and not reported. The nominal level for all standard procedures is 5%.
From Figures 3 and 4, it is possible to see that the increment in the sample size n, benefits more the proposed procedure in terms of less FDR and less FNR with respect to the standard ones. This is again the consequence of BFs consistency and hence that of p consistency. At lower sample sizes, the proposed method, with the simple decision rule, is more liberal as the FDR is larger, but the FNR is lower.
5 Applications
Differently from the above toy example, we consider a another parametric test that is more realistic for applications to, for instance, gene expression measurements data: that is, testing the equality of the mean of two independent normal populations with all parameters unknown. More formally, let m be the number of features and denote with the outcome in population X with nx replications and the outcome in population Y with ny replications. Let and for . The set of hypotheses, for unknown, is the following:
Under the usual default priors:
the unscaled BF for H1i against H0i is4
where Beta(a,b) is the beta function evaluated in a, b and are sample means and variances for group X and Y, respectively.
With this model at hand, it is possible to apply the above MHT procedure to the analysis of microarray or RNA-seq experiments. Here, we revisit two old microarray experiments: the first is a calibration experiment where the true differentially expressed (DE) genes are known, whereas the second is a larger study also analyzed, with different approaches, in Singh et al.24 and Efron.2 In both cases, we assume that evidence may come either from BFs or from p-values of the usual Student’s t-test with the Welch correction for . This is just an asymptotic correction that does not guarantee uniform p-values in finite samples. To be used in the BH/BY and Efron procedures, such p-values are not further calibrated with respect to their BF lower bounds.
The third example is a post analysis of a more recent outcome of an RNA-seq experiment: probabilities p are calculated from the set of m p-values calibrated with respect to their BF lower bounds according to (1).
5.1 Microarray: A controlled experiment
We compare results obtained using unscaled BFs and p-values when analyzing gene expression levels of the old Affymetrix HGU95A Latin Square dataset (http://www.affymetrix.com), which has been originally used as the calibration experiment of Affymetrix HGU95A chips. Here, m = 12626 and 16 genes have been spiked at controlled levels ranging from 0 to 1024 picoMolar (pM) as shown in Table 3.
pM concentrations of 16 spiked-in genes in X and Y populations used in this study.
Gene
pM X
pM Y
Gene
pM X
pM Y
37777_at
512.00
1024.00
36202_at
8.00
16.00
684_at
1024.00
0.00
36085_at
16.00
32.00
1597_at
0.00
0.25
40322_at
32.00
64.00
38734_at
0.25
0.50
407_at
512.00
1024.00
39058_at
0.50
1.00
1091_at
128.00
256.00
36311_at
1.00
2.00
1708_at
256.00
512.00
36889_at
2.00
4.00
33818_at
64.00
128.00
1024_at
4.00
8.00
546_at
8.00
16.00
A subset of the original 16 replications is considered here to evaluate differences in small samples, specifically the number of replications for X and Y is nx = ny = 12. In this spiked-in calibration experiment, genes 1597_at and 38734_at are the least DE, whereas gene 684_at is the most DE, because it is absent from population Y and it has the highest concentration in population X. This gene has been eliminated from the analysis as it has a unusually high fold change while that of the not reported genes is between 0.8 and 1.2 fold change. We analyze expression level data obtained from summaries of probe level pairs in the scale. Such summaries are obtained by pre-processing original microarray measures according to the procedure illustrated in Irizarry et al.25
Results of the analysis are reported in Figure 5. For each gene, we reported p obtained with BF and p-values. Red points indicate genes that are declared as discoveries jointly by Bonferroni/BH/BY/Loc. FDR—Efron procedures. Of course there are still differences among these procedures at 20% of nominal cut-off and with the proposed one. Actual FDR and FNR for all methods are reported in Table 4. The actual FDR for the three procedures is a good deal above the nominal 20%, which just highlight that the interpretation of the nominal 20% is in mean 20%, where the mean is with respect to an infinite resampling. On the contrary, the proposed procedure must be interpreted given the observed sample. There are still a few DE genes that are very difficult for any possible detection as they have very low probabilities to be visited by the assumed virtual Markovian discovery process.
Results from the analysis of a Microarray controlled experiment. Values of the steady-state probabilities for each gene in scale along with the cutoff (dashed lines). A total of 17 genes are jointly declared as discovery using all BH/BY/Loc. FDR: Efron procedures with the threshold of 20%. Cut off points are explicitly reported for Efron Local False Discovery Rate BH and BY procedures.
Actual FDR and FNR for the microarray controlled experiment with the considered FDR procedures.
Procedures
Nominal cut-off
FDR
FNR
Efron
20%
5/18 = 0.28
2/12607 = 0.0002
BH
20%
4/17 = 0.24
2/12608 = 0.0002
BY
20%
9/23 = 0.39
1/12602 = 0.00008
Bonferroni
20%
9/23 = 0.39
1/12602 = 0.00008
pi > 1/m (on p-value calibration)
none
0/6 = 0
9/12619 = 0.0007
pi > 1/m (on BFs)
none
1/11 = 0.09
5/12614 = 0.0004
BH: Benjamini–Hochberg; BY: Benjamini–Yekutieli.
5.2 Microarray: Prostate data
We compare unscaled BFs with p-values in the analysis of gene expression levels for prostate cancer data.24 In this study, m = 6033 genes with nx = 50 healthy males are compared with ny = 52 prostate cancer cases. Results are shown in Figure 6. The larger sample size of this data set reduces differences between BFs and p-values and so the differences between the proposed procedure and Efron procedure or the BY procedure.
Results from the analysis of Prostate cancer data. Values of the steady-state probabilities for each gene in scale along with the cutoff (dashed lines). Only nine of the most differentiated genes are jointly detected using all procedures. Cut off points are explicitly reported for Efron Local False Discovery Rate BH and BY procedures with the threshold of 20%.
In fact, Figure 6 shows that the proposed procedure with 1/m cut-off is in between the Efron and BY procedures, with the BH procedure being more conservative. The main message of this example is that for large n, differences among the proposed procedures and the one considered for comparison tend to be small. This is consistent with the simulation study in Section 4.1, namely Figures 3 and 4.
5.3 A RNA-seq experiment: Bovine macrophage response to Mycobacterium bovis infection
In this study, the raw data consists of 3.6 trillion reads of RNA sequences to generate counts of identical sequences that are supposed to represent the abundance of target sequences in the mRNA. Counts are collected for the two biological populations under comparison: bovines infected and non-infected by Mycobacterium bovis.26 Such counts are then used to evaluate the differential gene expressions between said biological populations (see, for instance, Rapaport et al.27). After data normalization, there are m = 11131 genes that have been compared for differential expression and their corresponding p-values have been calibrated with respect to their BF lower bound (1). The results are compared with Table 1 in Nalpas et al.,26 which shows the most DE RNA sequences according to their fold change (LFC in the sequel) and also with Table 2 in Nalpas et al.26 that compares fold-changes in gene expression based on RNA-seq, microarray, and real-time qRT-PCR.
According to our procedure the first two most related sequences are (see Table 5): ENSBTAG00000022396 with probability pi almost 1 and ENSBTAG00000001725 with a LFC = 5.67 that has a raw p-value of 10–75. This latter one has a probability pi of only 1.1810–42, with a p-value and/or adjusted p-values which are around 1030 times higher than those of ENSBTAG00000022396. This means that, according to our basic procedure only ENSBTAG00000022396, also reported in Table 1 of Nalpas et al.,26 can be considered the most DE RNA sequence.
RNA-seq experiment: 10 most significant genes.
Rank
Gene ID
Related protein
LFC
p-value
Bonferroni
BH
BY
Loc. FDR
p
1
ENSBTAG00000022396
Serum amyloid A
7
−117
−113
−113
−112
−109
0
0
2
ENSBTAG00000001725
Chemokine
6
−75
−71
−71
−70
−69
−42
107
3
ENSBTAG00000013167
Sialic acid
5
−70
−66
−67
−66
−65
−47
111
4
ENSBTAG00000008612
Complement c1
5
−68
−64
−64
−63
−63
−49
116
5
ENSBTAG00000016061
Radical S-adenosyl
5
−65
−61
−62
−61
−60
−52
120
6
ENSBTAG00000034954
Beta-defensin 5
6
−65
−61
−62
−61
−60
−52
121
7
ENSBTAG00000038639
Chemokine
6
−65
−61
−62
−61
−60
−52
122
8
ENSBTAG00000005603
Chemokine
7
−64
−60
−61
−60
−59
−53
124
9
ENSBTAG00000020602
Indoleamine 2
5
−64
−59
−60
−59
−59
−53
125
10
ENSBTAG00000018119
Acyloxyacyl hydrolase
5
−63
−59
−60
−59
−58
−54
127
Columns report for each gene, from left to right: the gene rank according to p and λ (see Section 6); gene identification and related protein; LFC, raw p-value and adjusted ones according to: Bonferroni, BH, BY; the corresponding Local FDR. Last two columns report p and . All values are in Log10 scale, except LFC and which are in and natural ln scales, respectively.
Results in Nalpas et al.26 come from very well established and popular analysis workflow for RNA-seq data, that is, quantification of transcripts using a Python package HTseq followed by identification of DE genes using a R package edgeR. The final result is a sorted list of possible sequences claimed to be DE because of a low enough p-value, according to some MHT procedure28–30 and large LFC. In Nalpas et al.,26 2584 genes have been declared as DE, namely those with adjusted p-value with the BH procedure less than 0.05.
There are two differences between the very conservative results obtained here with the simple decision rule and that in Nalpas et al.26 The first difference is related to Boole’s inequality that appears in Proposition 2 of the Appendix 1, which makes the simple rule pi > 1/m conservative because dependence among all tests is assumed to as strong as possible, which may not be so. Indeed, such a conservativeness can be avoided by introducing a tuning parameter as explained in Section 6 which is usual in MHT approaches implemented in the edgeR package. In fact, if one is willing to consider larger set of possible DE genes, Section 6 would allow this by providing also relative cost of a missed rejection with respect to a false discovery that compete to such a set of possible DE genes. This cost opportunity is represented by the λ parameter, which is reported in Figure 7 as the argument of the function of the number of genes declared as DE, .
RNA-seq experiment: function . The cutoff horizontal lines represent the first 10 genes of Table 1 in Nalpas et al.,26 the 2584 genes claimed to be differentially expressed according to the BH procedure at the nominal level of 5%. The λ values are reported in natural log scale.
We can see that λ is almost constant for the first 10 genes, suggesting that there could be signal there or even slightly more genes, while 2584 genes appears to be excessive as the derivative of is almost maximum.
To further illustrate the outcome of this experiment and the role of λ, next Figure 8 reports LFC and unadjusted p-value.
RNA-seq experiment: LFC and unadjusted p-values as functions of λ.
We can see that for low values of λ, signal from genes DE is quiet strong and it decreases with λ, that is either LFC decreases or p-values increase.
Looking then at the list of the 10 most DE genes in Table 5, it is interesting to note that apart from the first gene, we have few genes related to the chemokine protein, which has also been mentioned in Table 2 in Nalpas et al.,26 where a comparison of fold-changes in gene expression based on RNA-seq, microarray, and real-time qRT-PCR was performed.
Finally, it is worth to discuss the second difference between the approach here proposed and that in Nalpas et al.26 Such a difference is more on the basis of the involved fundamentals of statistics. Essentially, the edgeR implements, for large samples, the use of p-values alone to assess the significance of a hypothesis. This is a very popular statistical practice although a questionable one, to the point that some journals started banning p-values.31 However, this practice, in the proposed procedure, is not repudiated, but embraced under the posterior probability principle (PPP) instead of the significance principle.14 Under the PPP, p-values have been calibrated with respect to the minimum probability of the hypothesis of non differential expression (1) given all observed evidence.
6 Remarks: Another decision rule
This section is devoted to illustrating another decision rule based on the vector of steady-state probabilities p. Assume that and are the sets of alternative hypotheses declared as true (m1 in total, with ) and false (m–m1), respectively (e.g. and ). There can be different ways of partitioning , but let us assume that it is defined upon the ordered vector of steady-state probabilities and thus on a given number of rejected null hypotheses m1, that is and .
The FDR is the expected proportion of false discoveries in , that is,
for m1 > 0 and for m1 = 0. Analogously, the FNR is the expected proportion of missed discoveries in , that is,
A suitable way to partition is by fixing m1 to minimize the following loss function32:
where is a user specified constant which expresses the relative cost of a missed rejection with respect to a false discovery. It can also be interpreted as the cost-complexity parameter in classification as also λ represents the cost that we have to pay to complicate the explanation of the reality with an increasing set of possible causes labeled as discoveries. For λ fixed, m1 can be estimated, conditionally on λ using,
It is worth noting that λ, in this decision rule, is the only tuning parameter and the function is very informative to choose a cutoff value for λ that leads to the optimal decision in the sense that minimizes Lλ. The arguments regarding the choice of such a cutoff are the same as those related to the choice of an optimal cost-complexity parameter in classification.
To have an idea of the behavior of the function , we consider the toy example illustrated in Section 4.1 with m = 100, m1 = 10, n = 10, and . We analyze four simulated scenarios where μA = 0, 1, 2 or 3 in each scenario. These have different signal intensities as in the first there is no signal and in the last one the mean for the alternative hypotheses is three standard deviations from that under the null.
The evolution of for the above scenarios is reported in Figure 9, while the actual and estimated FDR and FNR are reported in Figure 10.
Function for testing zero normal means with unknown variance. We have m = 100 hypotheses where m1 = 10 hypotheses come from the alternative model. The four samples are of size n = 10, and for value of the signals , where , or 3.
FDR and FNR realized (always unknown) and estimated conditionally on λ with equations (10) and (11) for testing zero normal means with unknown variance. We have m = 100 hypotheses where m1 = 10 hypotheses come from the alternative model. The four samples are of size n = 10, and for value of the signals , where , or 3. The small figure reports the same but with the full vertical scale.
Even if these are four examples of MHT, Figure 9 shows quite well the comparative behavior of functions at increasing signal levels that illustrate a general behavior. In fact, when there is no signal, (i.e. μA = 0) all hypotheses are true nulls; then they can be considered to explain the phenomena at hand and there is no price that can be paid for missing a discovery. As long as the signal grows, the price of complicating the explanation by considering more hypotheses jointly becomes higher. For example, with μA = 3, the data do not practically support more than eight true alternative hypothesis as for λ > 800 we can pay more, but such payments do not correspond to any new hypothesis that can be considered as true discovery. Of course, for all m hypotheses can be considered as true alternatives, but the price with respect to false discoveries may be unaffordable. Essentially, wide flat regions of , indicate possible cut-off values for λ and hence the size of the first most probable hypotheses to be considered as true discoveries. For example, for μA = 3, it is clear that a reasonable range to be considered is , that is, the first eight most probable hypotheses should be jointly rejected with a realized FDR of 0% and a FNR of 2% (see Figure 10). At μA = 2, we would consider no more than six or seven hypotheses () as at larger values of λ the data do not support any reasonable small set of true alternative hypotheses. Finally, it is clear that when there is no signal, no hypotheses should be really considered as true alternatives because at data suggest that all m hypotheses should be rejected.
Note that the cut-offs induced by are data dependent and this is a relevant difference from the usual MHT procedures in which cut-offs are fixed beforehand.
Finally, we return to the last column of Table 2 that reports the values of λ for the 25 dietary variables. We can appreciate how reasoning in terms of loss function leads to similar conclusions, which is that if the cost of a false discovery is no larger than that of a missed rejection then only “total calories” should be considered as related to the mammographic density. To also consider the “olive oil” dietary variable, we should assume that costs for missed rejection are 5 times larger than those for a false discovery. The behavior of the function for the 25 dietary variables is reported in Figure 11.
Function for the 25 dietary variables related to mammographic density in Spanish women.22
The detail for the first 15 most evident dietary variables does not suggests any cut-off as the increment in is almost constant across values of λ between 1 and 62. After that, there seems to be stationary steps, but the only strong signal seems to be between all dietary variables up to “Total meat” and “Processed meat.” Of course, if the analyst is willing to consider as related to mammographic density all dietary variables except “Processed meat,” then it means that he/she is also willing to consider missed discoveries to be 2397 times more expensive than false discoveries. Such a value of λ, depending on the application, may be too high.
7 Conclusions
The decision rule discussed above is just an example of a possible more sophisticated MHT procedure based on the vector of true discovery probabilities p. However, the key point of our exposition is the Markovian representation of the MHT problem that can be applied to virtually any source of evidence for the single test and is suitable for very large data bases, as the involved formulas are very simple. The strong point of this approach is the straightforward interpretation of p as probabilities that hypotheses can be considered as true discoveries given the collected data and the involved statistical models used to analyze them. This is not trivial with current MHT procedures as the interpretation of, say, adjusted p-values requires deep knowledge of statistical reasoning for an applied scientist. Finally, it is important to stress that the proposed Markovian process is to MHT as MCMC is to posterior approximation.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: The author has been funded by Ministerio de Ciencia e Innovación grant MTM2013-42323,ECO2012-38442,RYC-2012-11455,and ACOMP/2015/202,by Ministero dell’Istruzione,dell’Univesità e della Ricerca of Italy,and Regione Autonoma della Sardegna CRP-59903.
References
1.
CabrasS. A note on multiple testing for composite null hypotheses. J Stat Plain Infer2010; 140: 659–666.
2.
EfronB. Microarrays, empirical Bayes and the two-groups model. Stat Sci2008; 23: 1–22.
3.
EfronB. Size, power and false discovery rates. Ann Stat2007; 35: 1351–1377.
4.
BertolinoFCabrasSCastellanosMEet al.Unscaled Bayes factors for multiple hypothesis testing in microarray experiments. Stat Methods Med Res2015; 24: 1030–1043.
5.
MorenoEBertolinoFRacugnoW. An intrinsic limit procedure for model selection and hypotheses testing. J Am Stat Assoc1998; 93: 1451–1360.
6.
BertolinoFMorenoERacugnoW. Bayesian model selection approach to analysis of variance under heteroscedasticity. J Roy Stat Soc D-Sta2000; 49: 503–517.
7.
BergerJOPericchiLR. The intrinsic Bayes factor for model selection and prediction. J Am Stat Assoc1996; 91: 109–122.
8.
O’HaganAForsterJ. Kendall’s advanced theory of statistics: Bayesian inference2004; Vol. 2New York: Hafner.
9.
BergerJOPericchiLRObjective Bayesian methods for model selection: Introduction and comparison. In: LahiriP (ed). Model selection2001; Vol. 38Beachwood, OH: Institute of Mathematical Statistics, pp. 135–207. http://www.jstor.org/stable/4356165 (accessed 12 January 2016).
10.
BergerJOPericchiLR. Training samples in objective Bayesian model selection. Ann Stat2004; 32: 841–869.
11.
MorenoEGirónFJ. Comparison of Bayesian objective procedures for variable selection in linear regression. Test2008; 3: 472–492.
12.
BayarriMBergerJForteAet al.Criteria for Bayesian model choice with application to variable selection. Ann Stat2012; 40: 1550–1577.
13.
SellkeTBayarriMJBergerJO. Calibration of p-values for testing precise null hypotheses. Am Stat2001; 55: 62–71.
14.
PérezMEPericchiLR. Changing statistical significance with the amount of information: The adaptive α significance level. Stat Probabil Lett2014; 85: 20–24.
15.
BenjaminiYYekutieliD. The control of the False Discovery Rate in multiple testing under dependence. Ann Stat2001; 29: 1165–1188.
16.
BenjaminiYHochbergY. Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B1995; 57: 289–300.
17.
DudoitSShafferJBoldrickJ. Multiple hypothesis testing in microarray experiments. Stat Sci2003; 18: 71–103.
18.
FarcomeniA. A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. Stat Methods Med Res2008; 17: 347–388.
19.
GelmanAHillJYajimaM. Why we (usually) don’t have to worry about multiple comparisons. J Res Educ Eff2012; 5: 189–211.
20.
CasellaGGirónFJMartínezMLet al.Consistency of Bayesian procedures for variable selection. Ann Stat2009; 37: 1207–1228.
21.
MorenoEGirónFJCasellaG. Consistency of objective Bayes factors as the model dimension grows. Ann Stat2010; 38: 1937–1952.
22.
García-ArenzanaNNavarrete-MuñozEMLopeVet al.Calorie intake, olive oil consumption and mammographic density among Spanish women. Int J Cancer2014; 134: 1916–1925.
23.
McDonaldJH. Handbook of biological statistics, 3rd ed. Baltimore, MD: Sparky House Publishing, 2014.
24.
SinghDFebboPRossKet al.Gene expression correlates of clinical prostate cancer behavior. Cancer Cell2002; 1: 302–309.