Abstract
Introduction
Estimation of divergence time is an integral part of population genetics and evolutionary biology. It is a common practice1,2,3 to estimate a range or a confidence interval of divergence time rather than a point-estimation (ie, a single value).
Despite being a common practice, maximum likelihood estimation of the time-range poses a challenge for coalescent-based models. This is partly due to the complicated expression of likelihoods of the coalescent-based models, and partly due to the fact that a likelihood-based formula for confidence interval requires the computation of estimated Fisher's information matrix, which has an even more complicated expression than the likelihood.
In this article, we present a divergence time range or a confidence interval based on the maximum likelihood estimation (MLE) of the divergence time from a coalescent-based model. We focus on the coalescent model of population evolution by Nielsen et al, 4 where the likelihood is computed by summing up over all possible coalescent trees between the present day and most recent common ancestor (MRCA). This one-step procedure takes into account the uncertainties of estimating the coalescent trees, as well as the effect of incomplete lineage sorting.
The coalescent framework is consistent with several models of population evolution, including a diffusion model, a Wright-Fisher model, a continuous-time, or a discrete-time Moran model (see5,6). Specifically, the coalescent framework in 4 requires a model with finite population size mating randomly within each population. The framework works with both monoecious and dioecious organisms and for both haploids and diploids. A complete separation between the populations is assumed at the point of divergence, and consequently, between-population mating are not allowed after the divergence. This also means that the coalescent events cannot take place between two individuals of different populations (between the present and the point of divergence).
It is further assumed that each locus has two alleles, and the allele-frequency spectrum for these biallelic loci is modeled with a symmetric Beta distribution as in.5–9 This assumption of a symmetric Beta distribution results in a symmetric Beta-Binomial distribution for the allele-count. (Note that, the assumption of Beta distribution comes from a diffusion approximation; see, for example). 10
The model is valuable for estimating the divergence time from related populations, as it produces the likelihood directly from the data, bypassing the gene trees. This is achieved by computing the exact probability of each coalescent tree as well as probabilities of allele frequencies given these trees and then summing over them with a closed-form mathematical formula. Thus, potential misestimation due to gene tree incongruence (eg, incomplete lineage sorting, see for example), 11 is avoided.
While computing the probabilities of the allele counts, it is assumed in this model that the effect of new mutations in allele frequencies between the MRCA and the present are negligible. This is an appropriate assumption for divergence estimation from related populations. This is because the divergence times in such populations are typically short, making the number of mutations very small. The variation in allele type is assumed to come from the mutations at or before the MRCA. A detailed mathematical description of the model probabilities is provided below.
There have been significant developments made in this model since its inception by 4 and a number of methods of inference have been proposed. Later 12 introduced a Markov Chain Monte Carlo-based on this model. More recently, 6 introduced a pruning algorithm to systematically compute the likelihood under this model for inference on a population tree. This approach builds on the approach of 4 and makes it possible to compute the likelihood of a large tree under this model. An innovative two-stage pruning algorithm is introduced for simultaneously keeping track of probability of the number of lineages, and allele counts among them. Later 7 introduced a composite likelihood method based on this model that can be used to analyze dependent data. This method treats the dependent allele counts from nearby loci as independently by multiplying the marginal likelihoods obtained from each locus. As a result, a composite likelihood is computed, which is then maximized to obtain a maximum composite likelihood estimator. This method makes it possible to use the information in physically close loci, whereas previous approaches could only use a set of independent loci to compute the likelihood. 13 Introduced a variation of the model that takes into account the effect ascertainment correction in the likelihood and MLE. This method is useful for data from loci that were selected because of their observed allele-frequencies (ie, ascertained). This method modifies the pruning algorithm of 6 so that the likelihood is corrected for ascertainment bias. In the same year 13 introduced a computational method for incorporating the effect of mutation at each branch in the pruning algorithm of. 6 This method makes it possible to apply the pruning algorithm to data from different species. This is because although the effect of new mutations can be ignored in closely related populations, their effect s are large when comparing different species. Finally, 9 has shown this model to be identifiable. The identifiability is an important desirable property of a statistical model; unidentifiable model parameters are ill-defined and therefore inference on such models could produce erroneous, confusing, and self-contradictory results.
Although the MLE of the divergence time is computed based on4,6 model by the previous authors, a formula for the range of divergence time has not been described. The computation of the MLE requires computation of the likelihood and then maximization of the likelihood over a period of parameter values. Computation of a range or confidence interval for divergence time requires estimation of variance and MLE, which is a more complicated procedure.
4
computed an estimate for the variance through simulations, which could be used to compute a confidence interval. To estimate variance in this manner, one needs to first estimate the divergence time; then one needs to simulate the whole dataset a large number of times (say
Here, we present a formula for computing an asymptotic approximation for divergence time range. Our formula is based on asymptotic approximation of the variance of the MLE. In statistical literature, this approximation is known to be a first converging approximation. 14 Typically, the number of independent data-points (independent loci) is very high (>10, 000) in genetic data and therefore this will be a close approximation. In addition, we also provide a formula for first and second order mixed derivatives of the likelihood and log-likelihood. This formula is useful as it can be used for maximizing the likelihood with Newton Raphson Method. To evaluate the performance of our method, we also estimated the coverage probability of the confidence interval through simulations, and established that the approximation is indeed quite good, and as accurate as a purely simulation-based approach. However, as demonstrated in the article, our method is much faster and could be computed in a fraction of time it takes to estimate the range using a simulation-based approach.
For demonstration purposes, we have used our method for estimating the range of the divergence time between two populations in HapMap data. 15 Our estimates are found to be of the same range as a recently published estimate of the divergence time of the same two populations.
Model and Definitions
In this section, we will briefly describe the model of.4,6 This model assumes that the divergence time is sufficiently small so that we can ignore the effect of mutation in the site-frequency-spectrum after divergence. Consider populations
The parameters of this model are τ(>0), the divergence time in generations in the unit of effective population size, or the population scaled divergence time (τ = 2
Next, we will describe how the probability distribution of (

Variables associated with our model.
Let
where β(., .) is the beta function; θ is the aforementioned mutation parameter which is to be estimated as well. Note that as mentioned in, 6 beta-binomial distribution model is due to the fact that the alleles at the root are binomial draws from the allele frequencies, and the allele frequency spectrum is modeled with a beta distribution (see, for example). 10
Given
4,6. Then, given
Next we combine the Eqs. (1, 2, 3, 4). With the observed allele counts are (

Note that setting τ = 0 does not maximize the above expression, as the coefficients of the exponential terms could be positive or negative.
Thus, we have described the full model. Maximum likelihood estimate (MLE) of the divergence time is computed by numerically maximizing right side of Eq. (5) above. In the next section we will present an estimator of the divergence time range, rather than a point-estimator of divergence time.
Methods
Let the MLE of (τ, θ) be (
where →
where
and a (1 - α) confidence interval for θ as
where Inf(
Using Lemma 1 in the Appendix A (Eqs. (13, 14)) and Eq. (5) it follows that
for
(
(
Estimation
Using Eqs. (5, 7–12), one can numerically compute confidence intervals of τ and θ as follows.
First,
Once we have
Results: Simulation and Comparison with Direct Simulation Method
We have simulated
Mechanism of model generation
Following the model of4,6 we started with a divergence time τ, mutation parameter θ, sample size (identical for the two populations)
For a given locus, we simulate the model (described in Section 2) as follows. First
For each combination of (
Simulation: confidence interval for τ and estimated coverage probability
Next, for each combination of (
Next, we compared the performance of our method with the method of computing the confidence interval using simulation estimation of variance
4
using the same simulated data. We have briefly described their method in Section 1. As we had done for our method, we estimated confidence interval for each of
For both CIAsymp and simulated CI the estimated ranges have small lengths. As expected, the length becomes smaller with larger
A comparison of the two methods reveals no significant difference in the coverage probabilities of the lengths of the intervals. In Wilcoxon signed-rank tests we found no significant difference between coverage probabilities of the two methods (
Results: Applications to HapMap Data
For the purpose of demonstrating our method and comparing its performance with known results, we applied our method to a subset of HapMap data.
15
Specifically, we estimated a confidence interval for the divergence time between HCB (Han Chinese from Be-jing, China) and CEU (United States residents of northern and western European ancestry) populations using 112 independent SNP loci in Chromosome 19. To reduce the computational load, we only considered a random sub-sample of 8
The estimated scaled divergence time was
Discussion
We have presented a formula for computing divergence time range using the MLE on a coalescent model. As MLE is asymptotically efficient, our estimated confidence interval has a high degree of accuracy. We have also presented formulae for first and second order mixed derivatives of likelihood and log-likelihood, which is useful for computation of the MLE.
Our simulation study shows that our method produces small confidence intervals with appropriate coverage of 95%. Thus, it reduces the amount of uncertainty regarding the actual divergence time. Although MLE of divergence time has been computed before using our model, an expression for the range of divergence time has never been derived because of the complexity of the expression. By deriving this expression we made it possible for the range of the divergence time to be estimated, and by evaluating its performance we established its usefulness.
The radii of the confidence intervals and the coverage probabilities produced by our methods have been shown to be statistically equivalent to that of the simulation-based estimator. However, we found that our method is an order of magnitude faster. This is expected because in a simulation or resampling-based method the data needs to be simulated a large number of times, and then confidence intervals needs to be reestimated for each resimulation or resample.
Certain assumptions are made about the underlying model. The accuracy of the estimated variance and range may be affected if the data do not fit the underlying model. For example, a molecular clock is assumed. That is, the scaled time t between A and O is assumed to be same as that between B and O. However, if the effective population sizes in populations a and b are very different then this will induce a bias in the estimated variance, as well as the estimated range. However, it is straightforward to extend our method for models without molecular clock. If a molecular clock is not assumed then one needs to separately estimate scaled divergence times τ
Another departure from our model would be from the assumption of biallelic loci. Following the same principle of tracking coalescing lineages back to the MRCA, and then following the allele-types to the present time, one can modify Eqs. 1–5 for more than two alleles. However, for more than two alleles, one needs to replace the symmetric beta-binomial distribution at the MRCA (which arises from assuming a symmetric beta distribution for biallelic allele-frequency spectrum) with its “multiple-count” version: a symmetric Dirichlet-Multinomial distribution (which arises from assuming a symmetric Dirichlet distribution for multiallelic allele-frequency spectrum). Once the likelihood is computed using modified version of Eqs. 1–5, then Eqs. 6–16 can be modified for estimating the variance and the range from the multiallelic likelihood.
We assumed that the effect of mutation between the point of divergence and the present is negligible. This is an appropriate assumption if the two populations are closely related (and consequently the divergence time is small). For large divergence times this assumption is not appropriate, and the effect of mutation may create an amount of difference between the two population that is higher than expected from an ignore-mutation model”. As a result, an erroneously larger divergence time may be estimated, which will also induce an upward bias in the estimated variance and the estimated range.
previous studies have suggested that this model is appropriate for populations within the same species.5–8 This designation is intended to serve as an upper bound for the divergence time to be modeled. In the absence of a mathematically concrete upper bound for divergence time for this model, we also use this convention. Thus, although we do not have a more concrete upper bound, we suggest using our methods for doing inference on divergence time between populations within the same species. (A lower bound is not necessary as the model fits well for small divergence times as the amount of mutation will be very small in such cases.) Moreover, a recent article 13 introduces a version of the4,6 model that takes into account the effect of mutation. An appropriate extension of our methods to the 13 model needs to be derived for estimating the variance and the range of larger divergence times. With such an extension, our methods could be used to estimate variance and range of species divergence times.
A possible disadvantage of our method would be in a scenario where significant amount of migration has taken place between the two populations after the divergence. In the presence of migration, there will be less variation than expected in a “no migration” scenario. Consequently, divergence time will be underestimated, which will also induce a downward bias in the estimated variance and the estimated range. Another limitation is that our method uses asymptotic approximation. Thus, our method is only applicable when we have a large number of independent loci. To quantify a large number of loci, a rule of thumb used by statisticians is that the asymptotic methods are to be used if there are at least 30 independent data-points. Thus, as long as we have allele counts for at least 30 independent loci, our method can be used. As most modern datasets have more than 30 independent loci, this is not a real limitation. Note that, as the effect of mutation after divergence is assumed to be negligible and is ignored, the difference in mutation rate between the two populations does not play a part in our method.
We applied our method to estimate a range for the divergence time between CEU and HCB populations from the HapMap data. Our estimates are found to be in the same range as a recently estimated divergence time between these two populations. 19
A possible extension of this method could be computation of the confidence interval of the branch-length of a phylogenetic tree given the tree-topology. This would require creating an algorithm that could efficiently and systematically compute the derivatives of the likelihood of the tree. Another possible extension could be using dependent loci to estimate the range of divergence time. This will have potential applications in high-resolution NextGen Sequencing data.
