Abstract
1 Introduction
When testing multiple hypotheses, the global null hypothesis is often of specific interest. It states that none of the individual null hypotheses is false. In some applications, rejecting the global null can be a goal in itself, whereas in other situations such a test may occur as part of a more sophisticated multiple test procedure. Think for instance of the closure test principle, where the global null needs to be rejected before looking at specific tests. Also, in an ANOVA, the global null is usually tested before testing for pairwise differences.
In meta-analysis, rejecting the global null implies an effect at least under some circumstances. Another application is experimental evolution, where several replicate populations of micro- or higher organisms are maintained under controlled laboratory conditions and their response to selection pressures is studied. Further applications, where such a test is of interest in its own merit, are testing for overall genomic differences in gene expression and signal detection.1,2
Several approaches to test the global null hypothesis are known. If we assume alternative scenarios where all or most null hypotheses do not hold, combination tests (e.g. Fisher's combination test 3 or Stouffer's z test 4 ), that sum up two or more independent transformed p-values to a single test statistic, have been recommended to maximize power. If, however, it is assumed that the null hypothesis holds in most cases, global tests based on individual test statistics are more powerful (e.g. Bonferroni, Simes). 5 If a larger number of hypotheses are tested, and the alternative hypothesis holds sufficiently often, goodness of fit tests for a uniform distribution of p-values could also be used. They test, however, for any type of deviation from uniformity and do not focus specifically on too small p-values. Under more specific models, such as the comparison of several normal means, more specialized tests such as a Tukey's multiple range test or an ANOVA are further options.
Higher criticism (HC) and checking for
Our focus is on a general situation where independent p-values are available from several hypothesis tests that are assumed to be uniformly distributed under the null hypothesis. As there is often no a priori knowledge on the number of false individual null hypotheses, we propose a test that enjoys good power properties, both if few and many null hypotheses are false. Our test is based on cumulative sums of the (possibly transformed) sorted p-values.
In comparison to other available methods, our simulations show that this test yields an excellent overall behavior. It typically performs better than combination tests, if the alternative holds in only a few cases. If the alternative holds in most cases, it performs better than the Bonferroni and Simes test. The performance relative to methods that combine evidence across all p-values tends to be even better under those one-sided testing scenarios, where parameters are in the interior of the null hypothesis for some of the tests. For these tests, the corresponding p-values will be stochastically larger than uniformly distributed ones, reducing in particular the power of combination tests.
We also present real data applications in the context of meta-analysis and experimental evolution.
2 Testing the global null hypothesis based on p-values
Consider a multiple testing procedure with
Some tests for the global null hypothesis use a combined endpoint, summing up the evidence across all available p-values to a single test statistic, e.g. Fisher's combination function 3 or Stouffer's test. 4 Alternatively, other approaches focus on those individual test statistics that lead to extreme p-values, such as in the Bonferroni and Simes tests. As combination tests aggregate evidence across all hypotheses, these tests are particularly powerful when there are (small) effects in many considered null hypotheses. When there are only a few (large) effects, global tests based on individual test statistics are more powerful. Other approaches are goodness of fit tests or HC.
2.1 Omnibus test
2.1.1 General outline
Starting with independent p-values
Later on, we will consider four transformations
Here,
2.2 Alternative test statistics
We briefly explain the most popular approaches that use p-values for testing the global null hypothesis.
2.2.1 Fisher's combination test
Fisher
3
proposed the combined test statistic given by
2.2.2 Stouffer's z test
Based on z-values
2.2.3 Bonferroni test
The Bonferroni test rejects the global null hypothesis, if the minimum p-value falls below
2.2.4 Simes test
An improvement of the Bonferroni test in terms of power was proposed by Simes.
5
For the
2.2.5 Higher criticism
Based on an idea by Tukey,
6
Donoho and Jin7,8 introduced the HC to test the global null hypothesis of no effect for independent hypotheses. It is defined by
2.2.6 Goodness of fit tests
For our global test problem of independent p-values and under a point null hypothesis, the p-values
3 Results
We start our simulation study by comparing the power, i.e. 1–
We simulate different scenarios by varying both the total number
Although our test is based on p-values that may arise in a multitude of settings, we want to specify effect sizes and alternative distributions in an intuitive way, and therefore compute our p-values from normally distributed data with known variance
In the simulations, we first assume that all alternatives have the same mean effect Δ/ (i) Negative effect sizes that are in the interior of the null hypotheses: We assume that under the true null hypothesis, the data have a negative effect size of –Δ/ (ii) Different effect sizes of alternative hypotheses: We assume randomly chosen exponentially distributed effect sizes with a rate parameter of (iii) Different effect sizes of alternative hypotheses and different effect sizes in the interior of the null hypotheses: We assume randomly chosen exponentially distributed effect sizes with a rate parameter of
All computations were performed using the statistical language R, 14 the Fisher's and the Stouffer's combination tests were calculated using the function combine.test in the survcomp package. 15 Our proposed omnibus method and the HC are implemented in the R-package omnibus available at https://github.com/ThomasTaus/omnibus.
For each scenario at least 10,000 simulation runs were performed.
For all following simulation results, the methods control the type I error at 5% if the global null hypothesis is true (simulation results can be found in the online supplemental material).
3.1 Influence of the chosen transformation on the omnibus method
Figure 1 shows power curves for the omnibus test using the four proposed transformations. We consider Power values for omnibus Power values for increasing Minimax power. Note: Worst case power values for 

3.2 Power comparison between different testing methods
Figure 2 shows power curves for omnibus
The Bonferroni and Simes methods give the best power results in the case of only one false null hypothesis,
The Fisher's combination test is slightly superior in scenarios with large Omnibus Power values are given for increasing 

3.2.1 Worst case behavior
We assess also the overall behavior of the statistical tests we considered by looking at the minimax power across scenarios that involve all possible numbers
3.2.2 Behavior for small numbers m1 of true alternatives
To further compare the power of our omnibus
3.3 Distributed/negative effect sizes
In Figure 4 (first row), we show simulation results for distributed effect sizes with a mean effect Δ distributed according to an exponential distribution with a rate parameter of
Figure 4 (second row) shows results for negative effect sizes under the null hypothesis, leading to p-values that are stochastically larger than uniform. A comparison with Figure 2 reveals that this does not much influence the power of the omnibus test (
The power of the Bonferroni test and of HC changes even less compared to the omnibus test when parameters are in the interior of the null hypothesis.
If both alternative and null hypotheses have effect sizes distributed according to an exponential distribution (with a rate parameter of
3.4 p-Values from discrete data
The assumption of uniformly distributed p-values under the null hypothesis is not always satisfied. Besides the possibility of parameter values in the interior of the null hypothesis, also discrete models lead to p-values that are not uniformly distributed on the interval [0, 1]. As p-values obtained from a discrete distribution are not covered by our underlying assumptions, we performed a simulation study to evaluate our test under such a situation. In view of our genetic application, we considered a two-sample binomial model. For large enough sample sizes, we would expect in general less effect of the discrete model. In our setup, however, type I error control was achieved even for small sample sizes.
For the first group, the simulated data were
We first checked whether the type I error is still controlled under our discrete model. For this purpose, we considered sample sizes Power values for discrete data simulation for omnibus 
Figure 5 provides the power obtained when using our omnibus test on several scenarios for
4 Examples
4.1 Meta-analysis
In meta-analysis, the evidence from several studies on a topic is combined. There are several examples in the literature showing that the efficacy of a treatment can vary among studies. Reasons for such a variation can be, among other factors, due to the differences in the underlying study populations or environmental factors. If effect size estimates are available for all considered studies, a random effect meta-analysis is often carried out. Global tests, such as the Fisher's and the Stouffer's tests, are a popular alternative option that do not require effect size estimates.
As an illustration, we applied our omnibus test to a data set from a meta-analysis provided by the R-package metafor.
17
We chose the data set dat.fine1993 where results from 17 studies are presented which compare post-operative radiation therapy with or without adjuvant chemotherapy in patients with malignant gliomas.
18
For each study, the data set specifies the number of patients in the experimental group (receiving radiotherapy plus adjuvant chemotherapy) as well as the number of patients in the control group (receiving radiotherapy alone). In addition, the number of survivors after 6, 12, 18, and 24 months follow-up within each group is given. One of the 17 studies recorded survival only at 12 and 24 months. For illustration purposes, we performed a separate meta-analysis for each time point and calculated
Meta-analysis (example I).
Note: Global tests have been applied to a meta-analysis comparing post-operative radiation therapy with or without adjuvant chemotherapy in patients with malignant gliomas. p-Values of the methods are shown when testing the global null hypothesis at different time points.
We next analyzed the data examples from the R-package metap. 19 We used five of the eight different data examples, ignoring three that involve only hypothetical data. For each of these data sets, a vector of p-values of lengths ranging from 9 to 34 is provided in the package. For instance, the data taken from the meta-analysis by Sutton et al. 20 involve 34 randomized clinical trials where cholesterol lowering interventions were compared between treatment and control groups. The actual treatments were mostly drugs and diets. For each study, a test was performed to analyze if the effect sizes (log odds ratio) are smaller than 0 (one-sided test) and p-values were calculated based on the normal distribution (Sutton et al., 20 Table 14.3). For details on the other data sets, we refer to the original publications and for references see the documentation of the metap package. Note that for some studies p-values were derived from independent subgroup analyses.
Meta-analysis (example II).
Note: p-Values are obtained from several global null hypothesis tests. The data have been taken from the examples provided with the R-package
4.2 Experimental evolution
With the development of large-scale inexpensive sequencing technologies, experiments became popular that aim to elucidate biological adaptation at the molecular level of DNA and RNA. In such experiments, organisms are often exposed to stress factors for several generations, and their genetic adaptation is studied. With microorganisms, such stress factors can for instance result from antibiotics, with the adaptation being resistance. With higher organisms, examples of stress factors are temperature or toxic substances. While evolution in nature usually takes place only once under comparable circumstances, experimental evolution can be done with replicate populations. Among other things, replication permits to investigate the reproducibility of adaptation, a key topic in evolutionary genetics. The statistical challenge is to identify genomic positions (called loci) involved in adaptation. There is a large number of candidate loci, for which adaptation has to be distinguished from random temporal allele frequency changes due to genetic drift as well as sampling and sequencing noise.
Furthermore, recent research suggests that replicate populations often do not show a consistent behavior, with signals of adaptation showing up partially at different loci. Two biological explanations for this finding are that beneficial alleles may be lost due to drift, and that the same adaptation at a phenotypic level can often be achieved in multiple ways at the genomic level.
When testing for significant allele frequency changes, a test like our omnibus test is therefore desirable, as it enjoys good power also when signals of adaptation are not consistent across replicates. We illustrate the application of our omnibus
A graphical summary of the p-values obtained using our omnibus test can be found in the online supplemental material.
5 Discussion
In this manuscript we introduced new non-parametric omnibus tests for testing the global null hypothesis. They require independent p-values as input and assume them to be uniformly distributed (or stochastically larger than uniform) under the null hypothesis. Our proposed approach enjoys very good power properties, no matter in how many cases the alternative holds. In our comparison with alternative approaches, it is not always the best method, but we did not find scenarios, where the omnibus test performs considerably worse than the best alternative method for a given setup. One could furthermore construct better specialized tests in situations where knowledge is available concerning likely deviations from the global null. The proposed omnibus test is useful when no such information is available.
For our test, we compute successive cumulative sums of the suitably transformed sorted individual p-values. The most unusual cumulative sum is then obtained by computing the p-value of each sum under the global null hypothesis. The smallest p-value is then used as test statistic.
We consider different transformations of the initial p-values
As expected the Simes test outperforms the Bonferroni procedure (or is equal) in the simulation study, though, for the considered scenarios the improvement in power is not remarkable.
All our simulations are based on one-sided tests, but the methods also work for the two-sided testing scenario (see Figure 6 in the online supplemental material). For two-sided tests, however, it is also possible to reject the global null hypothesis even when the individual hypotheses show clear effects in differing directions.
Supplemental Material
Supplemental material for An omnibus test for the global null hypothesis
Supplemental material for An omnibus test for the global null hypothesis by Andreas Futschik, Thomas Taus and Sonja Zehetmayer: R Core Team in Statistical Methods in Medical Research
Footnotes
Acknowledgements
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
