Sage Journals: Discover world-class research

Abstract

We present an estimation of divergence time-range based on a coalescent model. This model sum the probability of coalescent trees, taking into account the effect of incomplete lineage sorting. Maximum likelihood estimate based on this model has been computed previously; however, a formula for divergence time-range estimate or confidence interval has never been presented for this model as the expression of the likelihood makes this estimation difficult to compute. Our formula for the divergence time-range estimate can be readily coded into a program. We did not use a simulation or resampling-based approach and therefore our method is fast and less computationally intensive. We demonstrate that our method is much faster and as accurate a simulation-based approach.

Keywords

divergence time range confidence interval maximum likelihood coalescent

Introduction

Estimation of divergence time is an integral part of population genetics and evolutionary biology. It is a common practice^1,2,3 to estimate a range or a confidence interval of divergence time rather than a point-estimation (ie, a single value).

Despite being a common practice, maximum likelihood estimation of the time-range poses a challenge for coalescent-based models. This is partly due to the complicated expression of likelihoods of the coalescent-based models, and partly due to the fact that a likelihood-based formula for confidence interval requires the computation of estimated Fisher's information matrix, which has an even more complicated expression than the likelihood.

In this article, we present a divergence time range or a confidence interval based on the maximum likelihood estimation (MLE) of the divergence time from a coalescent-based model. We focus on the coalescent model of population evolution by Nielsen et al,⁴ where the likelihood is computed by summing up over all possible coalescent trees between the present day and most recent common ancestor (MRCA). This one-step procedure takes into account the uncertainties of estimating the coalescent trees, as well as the effect of incomplete lineage sorting.

The coalescent framework is consistent with several models of population evolution, including a diffusion model, a Wright-Fisher model, a continuous-time, or a discrete-time Moran model (see^5,6). Specifically, the coalescent framework in⁴ requires a model with finite population size mating randomly within each population. The framework works with both monoecious and dioecious organisms and for both haploids and diploids. A complete separation between the populations is assumed at the point of divergence, and consequently, between-population mating are not allowed after the divergence. This also means that the coalescent events cannot take place between two individuals of different populations (between the present and the point of divergence).

It is further assumed that each locus has two alleles, and the allele-frequency spectrum for these biallelic loci is modeled with a symmetric Beta distribution as in.^5–9 This assumption of a symmetric Beta distribution results in a symmetric Beta-Binomial distribution for the allele-count. (Note that, the assumption of Beta distribution comes from a diffusion approximation; see, for example).¹⁰

The model is valuable for estimating the divergence time from related populations, as it produces the likelihood directly from the data, bypassing the gene trees. This is achieved by computing the exact probability of each coalescent tree as well as probabilities of allele frequencies given these trees and then summing over them with a closed-form mathematical formula. Thus, potential misestimation due to gene tree incongruence (eg, incomplete lineage sorting, see for example),¹¹ is avoided.

While computing the probabilities of the allele counts, it is assumed in this model that the effect of new mutations in allele frequencies between the MRCA and the present are negligible. This is an appropriate assumption for divergence estimation from related populations. This is because the divergence times in such populations are typically short, making the number of mutations very small. The variation in allele type is assumed to come from the mutations at or before the MRCA. A detailed mathematical description of the model probabilities is provided below.

There have been significant developments made in this model since its inception by⁴ and a number of methods of inference have been proposed. Later¹² introduced a Markov Chain Monte Carlo-based on this model. More recently,⁶ introduced a pruning algorithm to systematically compute the likelihood under this model for inference on a population tree. This approach builds on the approach of⁴ and makes it possible to compute the likelihood of a large tree under this model. An innovative two-stage pruning algorithm is introduced for simultaneously keeping track of probability of the number of lineages, and allele counts among them. Later⁷ introduced a composite likelihood method based on this model that can be used to analyze dependent data. This method treats the dependent allele counts from nearby loci as independently by multiplying the marginal likelihoods obtained from each locus. As a result, a composite likelihood is computed, which is then maximized to obtain a maximum composite likelihood estimator. This method makes it possible to use the information in physically close loci, whereas previous approaches could only use a set of independent loci to compute the likelihood.¹³ Introduced a variation of the model that takes into account the effect ascertainment correction in the likelihood and MLE. This method is useful for data from loci that were selected because of their observed allele-frequencies (ie, ascertained). This method modifies the pruning algorithm of⁶ so that the likelihood is corrected for ascertainment bias. In the same year¹³ introduced a computational method for incorporating the effect of mutation at each branch in the pruning algorithm of.⁶ This method makes it possible to apply the pruning algorithm to data from different species. This is because although the effect of new mutations can be ignored in closely related populations, their effect s are large when comparing different species. Finally,⁹ has shown this model to be identifiable. The identifiability is an important desirable property of a statistical model; unidentifiable model parameters are ill-defined and therefore inference on such models could produce erroneous, confusing, and self-contradictory results.

Although the MLE of the divergence time is computed based on^4,6 model by the previous authors, a formula for the range of divergence time has not been described. The computation of the MLE requires computation of the likelihood and then maximization of the likelihood over a period of parameter values. Computation of a range or confidence interval for divergence time requires estimation of variance and MLE, which is a more complicated procedure.⁴ computed an estimate for the variance through simulations, which could be used to compute a confidence interval. To estimate variance in this manner, one needs to first estimate the divergence time; then one needs to simulate the whole dataset a large number of times (say M = 100 times) using the estimated value of the divergence time as the real divergence time. Then, the divergence time is estimated from each of those M = 100 simulated datasets. The sample variance among the M estimated divergence time is taken as the estimated variance. However, this method takes a long time to compute, as one needs to simulate and estimate the divergence time M = 100 times (or a predetermined large number of times). A resampling approach (eg, bootstrap) would have a similar large time requirement.

Here, we present a formula for computing an asymptotic approximation for divergence time range. Our formula is based on asymptotic approximation of the variance of the MLE. In statistical literature, this approximation is known to be a first converging approximation.¹⁴ Typically, the number of independent data-points (independent loci) is very high (>10, 000) in genetic data and therefore this will be a close approximation. In addition, we also provide a formula for first and second order mixed derivatives of the likelihood and log-likelihood. This formula is useful as it can be used for maximizing the likelihood with Newton Raphson Method. To evaluate the performance of our method, we also estimated the coverage probability of the confidence interval through simulations, and established that the approximation is indeed quite good, and as accurate as a purely simulation-based approach. However, as demonstrated in the article, our method is much faster and could be computed in a fraction of time it takes to estimate the range using a simulation-based approach.

For demonstration purposes, we have used our method for estimating the range of the divergence time between two populations in HapMap data.¹⁵ Our estimates are found to be of the same range as a recently published estimate of the divergence time of the same two populations.

Model and Definitions

In this section, we will briefly describe the model of.^4,6 This model assumes that the divergence time is sufficiently small so that we can ignore the effect of mutation in the site-frequency-spectrum after divergence. Consider populations A and B with MRCA O (Fig. 1). Let us assume that each locus exhibits two alleles, arbitrarily named ‘0’ and ‘1’, by ‘allele count’ we will mean the count of allele ‘1’. The model of^4,6 gives the probability distribution of the allele counts (r_A, r_B) from (haploid) sample of sizes (n_A, n_B) respectively. (Note that by “haploid” we mean a sampling unit. The data can be from either haploid or diploid organisms. A haploid sampling unit is as a set of chromosomes containing one chromosome of each type. Each diploid individual has two sampling units).

The parameters of this model are τ(>0), the divergence time in generations in the unit of effective population size, or the population scaled divergence time (τ = 2 N_et), and θ(>0), a mutation parameter or a population scaled mutation rate (θ = 4 N_e μ) where t, N_e and μ are, respectively, the number of generations since the divergence, the effective population size and (raw) mutation rate at the time of divergence. Note that we assume a molecular clock; ie, the scaled time between A and O is same as the scaled time between B and O. However, this assumption is not binding, and we will discuss relaxing this assumption later in this article.

Next, we will describe how the probability distribution of (r_A, r_B) is computed. First, we follow n_A and n_B lineages at populations A and B respectively to the MRCA O, and compute the probability distribution of the number of coalescent events k_A and k_B, respectively. Let n_0A and n_0B be the number of lineages at O that are ancestral to the sampled lineages at A and B, respectively, and let r_0A and r_0B, respectively, be the allele count out of them (Fig. 1). Note that n_0A = n_A - k_A and n_0B = n_B - k_B. The distributions of n₀A and n_0B can be computed from the following formula (first proposed by).¹⁶

Figure 1.

Variables associated with our model.

\Pr (n_{o m} = i^{'} | n_{m} = i; τ) = (\prod_{j^{″} = i^{'} + 1}^{i} λ_{j^{″}}) \sum_{j = i^{'} + 1}^{i} \frac{e^{- λ_{j} τ}}{\prod_{j^{'} = i^{'}, j^{'} \neq j}^{i} (λ_{j^{'}} - λ_{j})}, = \sum_{j = i^{'}}^{i} c_{i i^{'} j}^{(1)} e^{- λ_{j} τ},

(1)

m = A, B, where λ_j = j(j − 1)/2(1) and

c_{i i^{'} j}^{(1)} = \frac{(\prod_{j^{″} = i^{'} + 1}^{i} λ_{j^{″}})}{\prod_{j^{'} = i^{'}, j^{'} \neq j}^{i} (λ_{j^{'}} - λ_{j})} .

Let n₀ be the total number of lineages at O that are ancestral to the all lineages sampled at A and B. Using the fact n₀ = n_0A+n_0B along with Eq. (1), one can compute the probability distribution of n₀. Let r₀ be (random) allele count out of the n0 lineages at the MRCA (Fig. 1). (r₀ = r_0A+r_0B.) The probability of r₀ given n₀ is given by a ‘root distribution’ that varies in different versions of the model. We use the root distribution used in⁶ (symmetric beta-binomial)

\Pr (r_{0} = j_{0} | n_{0} = i_{0}; θ) = (\begin{matrix} i_{0} \\ j_{0} \end{matrix}) \frac{β (j_{0} + θ, i_{0} - j_{0} + θ)}{β (θ, θ)}

(2)

where β(., .) is the beta function; θ is the aforementioned mutation parameter which is to be estimated as well. Note that as mentioned in,⁶ beta-binomial distribution model is due to the fact that the alleles at the root are binomial draws from the allele frequencies, and the allele frequency spectrum is modeled with a beta distribution (see, for example).¹⁰

Given n_0A, n_0B, and r₀, the distribution of (r_0A, r_0B) can be computed as

\begin{matrix} \Pr (r_{o A} = j_{0 A}, r_{0 B} = j_{0 B} | r_{0} = j_{0}, n_{0 A} = i_{0 A}, n_{0 B} = i_{0 B}; θ) \\ \begin{array}{l} = \frac{(\begin{matrix} j_{0 A} + j_{0 B} \\ j_{0 A} \end{matrix}) (\begin{matrix} i_{0 A} + i_{0 B} - j_{0 A} - j_{0 B} \\ i_{0 A} - j_{0 A} \end{matrix})}{(\begin{matrix} i_{0 A} + i_{0 B} \\ i_{0 A} \end{matrix})} \\ = c_{i_{0 A} i_{0 B} j_{0 A} j_{0 B}}^{(2)} (s a y) \end{array} \end{matrix}

(3)

^4,6. Then, given n_0A, n_0B, r_0A and r_0B the distribution of (r_A, r_B) can be computed as

\begin{matrix} \begin{matrix} \Pr (r_{m} = j_{m} | r_{0 m} = j_{0 m}, n_{0 m} = i_{0 m}; n_{m} = i_{m}) \\ = \underset{\begin{array}{l} 1, \\ 0, \end{array}}{\frac{β (j_{m}, i_{m} - j_{m})}{β (j_{0 m}, i_{0 m} - j_{0 m})}} (\begin{matrix} i_{m} - i_{0 m} \\ j_{m} - j_{0 m} \end{matrix}), \\ = c_{i_{m} i_{0 m} j_{0 m} j_{m}}^{(3)} (s a y) \end{matrix} & \begin{array}{l} 0 < j_{m} < i_{m} and 0 < j_{0 m} < i_{0 m}, \\ 0 = j_{m} = j_{0 m} or 0 = i_{m} - j_{m} = i_{0 m} - j_{0 m}, \\ otherwsise \end{array} \end{matrix}

(4)

m = A, B.⁶

Next we combine the Eqs. (1, 2, 3, 4). With the observed allele counts are (r_A, r_B), the likelihood of (τ, θ) can be computed as

(5)

Note that setting τ = 0 does not maximize the above expression, as the coefficients of the exponential terms could be positive or negative.

Thus, we have described the full model. Maximum likelihood estimate (MLE) of the divergence time is computed by numerically maximizing right side of Eq. (5) above. In the next section we will present an estimator of the divergence time range, rather than a point-estimator of divergence time.

Methods

Let the MLE of (τ, θ) be ( ${\hat{τ}}_{MLE}, {\hat{θ}}_{MLE}$ ). Using the standard statistical results,

\sqrt{L} ((\begin{matrix} {\hat{τ}}_{MLE} \\ {\hat{θ}}_{MLE} \end{matrix}) - (\begin{matrix} \hat{τ} \\ \hat{θ} \end{matrix})) \to_{d} {Normal}_{2} (0, {Inf}^{- 1} (τ, θ))

(6)

where →_d denotes convergence in distribution (see, for example);⁷ L is the number of independent data-points (independent loci in our case), Normal₂ denotes a bivariate normal distribution, and Inf denotes Fisher's Information matrix:

Inf (τ, θ) = E [- (\begin{matrix} \frac{\partial^{2}}{\partial τ^{2}} {log}_{e} L (τ, θ) \frac{\partial^{2}}{\partial τ \partial θ} {log}_{e} L (τ, θ) \\ \frac{\partial^{2}}{\partial τ \partial θ} {log}_{e} L (τ, θ) \frac{\partial^{2}}{\partial θ} {log}_{e} L (τ, θ) \end{matrix})]

(7)

where L(τ, θ) = L (τ, θ; (r_A, r_B) = (j_A, j_B)) is the likelihood. Using Eq. (6), one can estimate a (1 - α) confidence interval for τ as

{{\hat{τ}}_{MLE} - z_{1 - α / 2} \sqrt{{Inf}^{(11)} ({\hat{τ}}_{MLE}, {\hat{θ}}_{MLE}) / L,} {\hat{τ}}_{MLE} + z_{1 - α / 2} \sqrt{{Inf}^{(11)} ({\hat{τ}}_{MLE}, {\hat{θ}}_{MLE}) / L}}

(8)

and a (1 - α) confidence interval for θ as

{{\hat{θ}}_{MLE} - z_{1 - α / 2} \sqrt{{Inf}^{(22)} ({\hat{τ}}_{MLE}, {\hat{θ}}_{MLE}) / L,} {\hat{θ}}_{MLE} + z_{1 - α / 2} \sqrt{{Inf}^{(22)} ({\hat{τ}}_{MLE}, {\hat{θ}}_{MLE}) / L}},

(9)

where Inf^(ij) is the element (i, j) of Inf⁻¹, z_1–α/2 is the (1 - α/2)th quantile of standard normal distribution. Next, we obtain a simpler expression for Eqs. (8, 9).

Using Lemma 1 in the Appendix A (Eqs. (13, 14)) and Eq. (5) it follows that

\begin{array}{l} \frac{\partial^{l + m}}{\partial τ^{l} \partial θ^{m}} L (τ, θ, (r_{A}, r_{B}) = (j_{A}, j_{B})) \\ = \sum_{i_{1} = 1}^{i_{A}} \sum_{i_{2} = 1}^{i_{B}} (- (λ_{i_{1}} + λ_{i_{2}})) l e^{- ((λ_{i_{1}} + λ_{i_{2}}) τ} \sum_{i_{0 A} = 1}^{i_{1}} \sum_{i_{0 B} = 1}^{i_{2}} c_{i_{A} i_{0 A} i_{1}}^{(1)} c_{i_{B} i_{0 B} i_{2}}^{(1)} \\ \times \sum_{j_{0 A} = 0}^{i_{0 A}} \sum_{j_{0 B} = 0}^{i_{0 B}} (\begin{matrix} i_{0 A} + i_{0 B} \\ j_{0 A} + j_{0 B} \end{matrix}) \frac{β (j_{0 A} + j_{0 B} + θ, i_{0 A} + i_{0 B} - j_{0 A} - j_{0 B} + θ)}{β (θ, θ)} \\ \times (δ^{m} (θ, j_{0 A} + j_{0 B}, i_{0 A} + i_{0 B} - j_{0 A} - j_{0 B}) + 1_{{m = 2}} δ_{1} (θ, j_{0 A} + j_{0 B}, i_{0 A} + i_{0 B} - j_{0 A} - j_{0 B})) \\ \times c_{i_{0 A} i_{0 B} j_{0 A} j_{0 B}}^{(2)} c_{i_{A} i_{0 A} j_{0 A} j_{A}}^{(3)} c_{i_{B} i_{0 B} j_{0 B} j_{B}}^{(3)} \end{array}

(10)

for l, m ∈ {1, 2}. Also, note that

\begin{array}{l} \frac{\partial^{l + m}}{\partial τ^{l} \partial θ^{m}} l o g_{e} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B})) = \frac{\frac{\partial^{l + m}}{\partial τ^{l} \partial θ^{m}} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B}))}{L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B}))} \\ - \frac{{(\frac{\partial}{\partial τ} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B})))}^{l} {(\frac{\partial}{\partial θ} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B})))}^{m}}{{(L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B})))}^{2}} \end{array}

(11)

(l, m) ∈ {(1, 1), (2, 0), (0, 2)} and hence,

\begin{array}{l} E [\frac{\partial^{l + m}}{\partial τ^{l} \partial θ^{m}} {log}_{e} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B}))] \\ = \sum_{j_{A} = 0}^{n_{A}} \sum_{j_{B} = 0}^{n_{B}} [\frac{\partial^{l + m}}{\partial τ^{l} \partial θ^{m}} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B}))] \\ - \sum_{j_{A} = 0}^{n_{A}} \sum_{j_{B} = 0}^{n_{B}} [\frac{{(\frac{\partial}{\partial τ} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B})))}^{l} {(\frac{\partial}{\partial θ} L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B})))}^{m}}{L (τ, θ; (r_{A}, r_{B}) = (j_{A}, j_{B}))}] \end{array}

(12)

(l, m) ∈ {(1, 1), (2, 0), (0, 2)}.

Estimation

Using Eqs. (5, 7–12), one can numerically compute confidence intervals of τ and θ as follows.

First, ${\hat{τ}}_{MLE}$ and ${\hat{θ}}_{MLE}$ are computed maximizing the likelihood given in Eq. (5). The maximization may be done over a grid of values of τ and θ; alternatively, the Newton Raphson Method may be used; the derivatives of likelihood or log-likelihood for use in Newton Raphson Method can be obtained by using the expressions of derivatives in the likelihood in Eq. (10) in the expression of Eq. (11).

Once we have ${\hat{τ}}_{MLE}$ and ${\hat{θ}}_{MLE}$ , we can compute ${\hat{τ}}_{MLE}$ , ${\hat{θ}}_{MLE}$ using Eqs. (10,12) with (τ, θ) substituted by ${\hat{τ}}_{MLE}$ , ${\hat{θ}}_{MLE}$ . Next, is $Inf {({\hat{τ}}_{MLE}, {\hat{θ}}_{MLE})}^{- 1}$ numerically computed numerically and subsequently the confidence intervals of Eqs. (8,9) are computed.

Results: Simulation and Comparison with Direct Simulation Method

We have simulated N = 1, 000 datasets for each combination of sample sizes n = n_A = n_B, divergence time τ and number of independent loci L; n can take values 4 and 8; τ can take values 0.01, 0.02, 0.05, 0.075, 0.1, 0.15 and 0.2; L can take values 10,000 and 100,000. The mutation parameter was kept fixed at θ = 4 × 10,000 × (1.1 × 10⁻⁸) to reflect a (human) effective population size of 10, 000 and a mutation rate of 1.1 × 10⁻⁸.

Mechanism of model generation

Following the model of^4,6 we started with a divergence time τ, mutation parameter θ, sample size (identical for the two populations) n = n_A = n_B, and a predetermined number of independent loci L. For each locus the process is identical and independent (given the parameters). Therefore, we will describe the process for a single locus only.

For a given locus, we simulate the model (described in Section 2) as follows. First n_0A and n_0B are simulated from n_A and n_B using Eq. (1). Next, n₀ is simulated (or computed) as n₀ = n_0A+n_0B. Then, r₀ is simulated from n₀ and θ from symmetric beta-binomial distribution (Eq. (2)) as in.⁶ Symmetric beta-binomial distribution is used in⁶ to characterize the allele-frequency spectrum. That is, it is assumed in⁶ that the allele-frequency spectrum over the L loci has a symmetric Beta distribution, and r₀ given n₀ is a randomly drawn allele-count from a sample of n haploids where the allele frequency is P (which is a random draw from the allele-frequency spectrum and has a symmetric beta distribution with parameter θ). Thus, r₀ given n₀ has a symmetric beta-binomial distribution with parameter. Next, we simulate r_0A and r_0B from n₀, r₀, n_0A and n_0B using Eq. (3). Then r is simulated from n_m, n_0m and r_0m using Eq. (4) for m = A and B. This process is repeated independently (given the parameters) L times to generate the allele counts r_A and r_B for each of the L loci.

For each combination of (n, τ, L), an MLE ( ${\hat{τ}}_{MLE}$ , ${\hat{θ}}_{MLE}$ ) was computed by maximizing the full likelihood over τ and θ for each N = 1, 000 repetitions. Thus, for each combination of (n, τ, L), we have N = 1, 000 estimates of ( ${\hat{τ}}_{MLE}$ , ${\hat{θ}}_{MLE}$ ). Then, applying our methods on these estimates, 95% confidence intervals for τ was estimated for each of N = 1, 000 repetitions for each combination of (n, τ, L). Then for each combination of (n, τ, L), we computed the average of the confidence intervals over N = 1, 000 estimated values (reported in Table 1 at the CI_Asymp columns).

Table 1

Simulation: confidence interval for τ and estimated coverage probability P.

	L = 10,000 CI_ASYMP	SIM. VAR. CI	L = 100,000 CI_ASYMP	SIM. VAR. CI
τ = 0.01 n = 4	CI: {0.01 ± 0.0010} P = 0.951	{0.01 ± 0.0009} P = 0.955	CI: {0.01 ± 0.0003} P = 0.949	{0.01 ± 0.0003} P = 0.955
τ = 0.01 n = 8	CI: {0.01 ± 0.0004} P = 0.947	{0.01 ± 0.0005} P = 0.954	CI: {0.01 ± 0.0001} P = 0.954	{0.01 ± 0.0001} P = 0.945
τ = 0.02 n = 4	CI: {0.02 ± 0.0015} P = 0.945	{0.02 ± 0.0014} P = 0.960	CI: {0.02 ± 0.0005} P = 0.950	{0.02 ± 0.0003} P = 0.953
τ = 0.02 n = 8	CI: {0.02 ± 0.0006} P = 0.948	{0.02 ± 0.0005} P = 0.948	CI: {0.02 ± 0.0002} P = 0.942	{0.02 ± 0.0002} P = 0.944
τ = 0.05 n = 4	CI: {0.05 ± 0.0025} P = 0.954	{0.05 ± 0.0029} P = 0.959	CI: {0.05 ± 0.0008} P = 0.955	{0.05 ± 0.0010} P = 0.953
τ = 0.05 n = 8	CI: {0.05 ± 0.0012} P = 0.951	{0.05 ± 0.0010} P = 0.944	CI: {0.05 ± 0.0004} P = 0.951	{0.05 ± 0.0005} P = 0.945
τ = 0.075 n = 4	CI: {0.075 ± 0.0032} P = 0.954	{0.075 ± 0.0036} P = 0.949	CI: {0.075 ± 0.0010} P = 0.942	{0.075 ± 0.0011} P = 0.950
τ = 0.075 n = 8	CI: {0.075 ± 0.0016} P = 0.944	{0.075 ± 0.0015} P = 0.943	CI: {0.075 ± 0.0005} P = 0.952	{0.075 ± 0.0005} P = 0.961
τ = 0.1 n = 4	CI: {0.10 ± 0.0040} P = 0.964	{0.10 ± 0.0041} P = 0.945	CI: {0.10 ± 0.0013} P = 0.956	{0.10 ± 0.0011} P = 0.961
τ = 0.1 n = 8	CI: {0.10 ± 0.0022} P = 0.949	{0.10 ± 0.0018} P = 0.950	CI: {0.10 ± 0.0007} P = 0.949	{0.10 ± 0.0007} P = 0.948
τ = 0.15 n = 4	CI: {0.15 ± 0.0055} P = 0.957	{0.15 ± 0.0049} P = 0.954	CI: {0.15 ± 0.0017} P = 0.948	{0.15 ± 0.0016} P = 0.955
τ = 0.15 n = 8	CI: {0.15 ± 0.0034} P = 0.951	{0.15 ± 0.0038} P = 0.949	CI: {0.15 ± 0.0011} P = 0.957	{0.15 ± 0.0011} P = 0.956
τ = 0.2 n = 4	CI: {0.2 ± 0.0071} P = 0.940	{0.2 ± 0.0076} P = 0.949	CI: {0.2 ± 0.0022} P = 0.954	{0.2 ± 0.0024} P = 0.957
τ = 0.2 n = 8	CI: {0.2 ± 0.0050} P = 0.944	{0.2 ± 0.0054} P = 0.951	CI: {0.2 ± 0.0016} P = 0.957	{0.2 ± 0.0019} P = 0.948

Next, for each combination of (n, τ, L), we have also estimated the probability of the true τ falling into the estimated confidence interval (coverage probability) as the total number of time the true value was in the estimated interval (among N = 1, 000 repetitions) divided by N. The results are in Table 1 at the CI_Asymp columns.

Next, we compared the performance of our method with the method of computing the confidence interval using simulation estimation of variance⁴ using the same simulated data. We have briefly described their method in Section 1. As we had done for our method, we estimated confidence interval for each of N = 1, 000 repetitions and for each combination of (n, τ, L). (Note that, this involved resimulating M = 100 datasets for each estimated ${\hat{τ}}_{MLE}$ and estimating τ from the resimulated dataset, and thus a total of N × M = 10⁵ simulations and estimations.) Then, we have computed the average estimated 95% confidence interval and coverage probability from N = 1, 000 simulated datasets for each combination of (n, τ, L). That is, after estimating the confidence intervals for each dataset, we computed the average of the confidence intervals over N = 1, 000 estimated values for each combination of (n, τ, L) (reported in Table 1). The coverage probabilities are estimated as before as the total number of time the true value was in the estimated interval divided by N. The results are shown in Table 1.

For both CI_Asymp and simulated CI the estimated ranges have small lengths. As expected, the length becomes smaller with larger n and larger L. For τ = 0.01 the ranges have radii 0.0010 (for n = 4, L = 10, 000, CI_Asymp) to 0.0001 (for n = 8, L = 100, 000, both methods). The length of the radii increases with τ. For τ = 0.02 the ranges have radii 0.0015 (for n = 4, L = 10, 000, CI_Asymp) to 0.0002 (for n = 8, L = 100, 000, both methods). For τ = 0.05 the ranges have radii 0.0029 (for n = 4, L = 10, 000, simulated CI) to 0.0004 (for n = 8, L = 100, 000, CI_Asymp). The radius increases up with an approximate proportionality with the true value of τ. For τ = 0.1 and 0.2 the ranges have radii 0.0041 and 0.0076, respectively, (for n = 4, L = 10, 000 simulated CI) to 0.0007 (for n = 8, L = 100, 000 both methods) and 0.0016 (for n = 8, L = 100, 000 CIASYMP), respectively.

A comparison of the two methods reveals no significant difference in the coverage probabilities of the lengths of the intervals. In Wilcoxon signed-rank tests we found no significant difference between coverage probabilities of the two methods (P > 0.3) as well as between lengths of the intervals (P > 0.9). However, computing time was an order of magnitude faster for CI_Asymp than for the simulated CI method. Once we have ${\hat{τ}}_{MLE}$ and ${\hat{θ}}_{MLE}$ , it takes less than 30 minutes to compute the CI using CI_Asymp using a R v3.0.2013-05-12¹⁷ code in a 2.54 GHz Dual Core processor. Using the same computer and same version of R, simulated CI (with at least M = 100 simulations for estimating the variance) takes more than a day to compute. This is because each simulation involves simulating L independent loci and maximization of the multi-locus likelihood over τ and θ using the allele-counts from these loci.

Results: Applications to HapMap Data

For the purpose of demonstrating our method and comparing its performance with known results, we applied our method to a subset of HapMap data.¹⁵ Specifically, we estimated a confidence interval for the divergence time between HCB (Han Chinese from Be-jing, China) and CEU (United States residents of northern and western European ancestry) populations using 112 independent SNP loci in Chromosome 19. To reduce the computational load, we only considered a random sub-sample of 8 unrelated haploids from each population. Our data consist of the allele-count in 8 individuals in each population for 112 SNP loci. The 112 SNPs were selected at least 0.5 MB apart from each other to ensure the independence of the coalescent trees in the lineages of 8 haploids.

The estimated scaled divergence time was $\hat{τ} = 0.16$ , and the estimated 95% confidence intervals was. [0.125,0.195]. These numbers, when transformed in years using an overall effective population size of 3,100 (see, for example)¹⁸ for HCB and CEU populations and generation time of 25 years, produce an estimated range of between 19,375–30,225 years ago. If an overall effective population size of 4,000 is assumed, then this produces as estimated range of between 25,000–39,000 years ago. Note that these ranges roughly match with other recent estimates of divergence time between HCB and CEU (see, for example).¹⁹

Discussion

We have presented a formula for computing divergence time range using the MLE on a coalescent model. As MLE is asymptotically efficient, our estimated confidence interval has a high degree of accuracy. We have also presented formulae for first and second order mixed derivatives of likelihood and log-likelihood, which is useful for computation of the MLE.

Our simulation study shows that our method produces small confidence intervals with appropriate coverage of 95%. Thus, it reduces the amount of uncertainty regarding the actual divergence time. Although MLE of divergence time has been computed before using our model, an expression for the range of divergence time has never been derived because of the complexity of the expression. By deriving this expression we made it possible for the range of the divergence time to be estimated, and by evaluating its performance we established its usefulness.

The radii of the confidence intervals and the coverage probabilities produced by our methods have been shown to be statistically equivalent to that of the simulation-based estimator. However, we found that our method is an order of magnitude faster. This is expected because in a simulation or resampling-based method the data needs to be simulated a large number of times, and then confidence intervals needs to be reestimated for each resimulation or resample.

Certain assumptions are made about the underlying model. The accuracy of the estimated variance and range may be affected if the data do not fit the underlying model. For example, a molecular clock is assumed. That is, the scaled time t between A and O is assumed to be same as that between B and O. However, if the effective population sizes in populations a and b are very different then this will induce a bias in the estimated variance, as well as the estimated range. However, it is straightforward to extend our method for models without molecular clock. If a molecular clock is not assumed then one needs to separately estimate scaled divergence times τ_A and τ_B for Populations a and b. It is straightforward, albeit tedious, to modify Eqs. (6–16) for parameters (τ_A, τ_B, θ), rather than (τ, θ). Thus, analogous expressions for the variance and the range of (τ_A, τ_B, θ) can be computed if a molecular clock is not assumed.

Another departure from our model would be from the assumption of biallelic loci. Following the same principle of tracking coalescing lineages back to the MRCA, and then following the allele-types to the present time, one can modify Eqs. 1–5 for more than two alleles. However, for more than two alleles, one needs to replace the symmetric beta-binomial distribution at the MRCA (which arises from assuming a symmetric beta distribution for biallelic allele-frequency spectrum) with its “multiple-count” version: a symmetric Dirichlet-Multinomial distribution (which arises from assuming a symmetric Dirichlet distribution for multiallelic allele-frequency spectrum). Once the likelihood is computed using modified version of Eqs. 1–5, then Eqs. 6–16 can be modified for estimating the variance and the range from the multiallelic likelihood.

We assumed that the effect of mutation between the point of divergence and the present is negligible. This is an appropriate assumption if the two populations are closely related (and consequently the divergence time is small). For large divergence times this assumption is not appropriate, and the effect of mutation may create an amount of difference between the two population that is higher than expected from an ignore-mutation model”. As a result, an erroneously larger divergence time may be estimated, which will also induce an upward bias in the estimated variance and the estimated range.

previous studies have suggested that this model is appropriate for populations within the same species.^5–8 This designation is intended to serve as an upper bound for the divergence time to be modeled. In the absence of a mathematically concrete upper bound for divergence time for this model, we also use this convention. Thus, although we do not have a more concrete upper bound, we suggest using our methods for doing inference on divergence time between populations within the same species. (A lower bound is not necessary as the model fits well for small divergence times as the amount of mutation will be very small in such cases.) Moreover, a recent article¹³ introduces a version of the^4,6 model that takes into account the effect of mutation. An appropriate extension of our methods to the¹³ model needs to be derived for estimating the variance and the range of larger divergence times. With such an extension, our methods could be used to estimate variance and range of species divergence times.

A possible disadvantage of our method would be in a scenario where significant amount of migration has taken place between the two populations after the divergence. In the presence of migration, there will be less variation than expected in a “no migration” scenario. Consequently, divergence time will be underestimated, which will also induce a downward bias in the estimated variance and the estimated range. Another limitation is that our method uses asymptotic approximation. Thus, our method is only applicable when we have a large number of independent loci. To quantify a large number of loci, a rule of thumb used by statisticians is that the asymptotic methods are to be used if there are at least 30 independent data-points. Thus, as long as we have allele counts for at least 30 independent loci, our method can be used. As most modern datasets have more than 30 independent loci, this is not a real limitation. Note that, as the effect of mutation after divergence is assumed to be negligible and is ignored, the difference in mutation rate between the two populations does not play a part in our method.

We applied our method to estimate a range for the divergence time between CEU and HCB populations from the HapMap data. Our estimates are found to be in the same range as a recently estimated divergence time between these two populations.¹⁹

A possible extension of this method could be computation of the confidence interval of the branch-length of a phylogenetic tree given the tree-topology. This would require creating an algorithm that could efficiently and systematically compute the derivatives of the likelihood of the tree. Another possible extension could be using dependent loci to estimate the range of divergence time. This will have potential applications in high-resolution NextGen Sequencing data.

Footnotes

Acknowledgment

The author is grateful for constructive comments by anonymous reviewers.

Author Contributions

Conceived and designed the experiments: AR. Analyzed the data: AR. Wrote the first draft of the manuscript: AR. Contributed to the writing of the manuscript: AR. Agree with manuscript results and conclusions: AR. Jointly developed the structure and arguments for the paper: AR. Made critical revisions and approved final version: AR. The author reviewed and approved of the final manuscript.

Disclosures and Ethics

As a requirement of publication the authors have provided signed confirmation of their compliance with ethical and legal obligations including but not limited to compliance with IOMJE authorship and competing interests guidelines,that the article is neither under consideration for publication nor published elsewhere,of their compliance with legal and ethical guidelines concerning human and animal research participants (if applicable),and that permission has been obtained for reproduction of any copyrighted material. This article was subject to blind,independent,expert peer review. The reviewers reported no competing interests.

References

Rona

L.D.P.

, Carvalho-Pinto

C.J.

, Mazzoni

C.J.

, Peixoto

A.A.

Estimation of divergence time between two sibling species of the Anopheles (Kerteszia) cruzii complex using a multilocus approach. BMC Evolutionary Biology. 10, 2010. DOI: 10.1186/1471–2148–10–91.

Wang

, Gowik

, Tang

, Bowers

J.E.

, Westhoff

, Paterson

A.H.

Comparative genomic analysis of c4 photosynthetic pathway evolution in grasses. Genome Biology. 2009; 10: R68.

Langergraber

K.E.

, Prfer

, Rowney

. Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. PNAS. 2012; 109: 15716–21.

Nielsen

, Mountain

J.L.

, Huelsenbeck

J.P.

, Slatkin

Maximum likelihood estimation of population divergence times and population phylogeny in models without mutation. Evolution. 1998; 52: 669–77.

RoyChoudhury

Likelihood inference for population structure, using the coalescent. PhD thesis, University of Washington, 2006.

RoyChoudhury

, Felsenstein

, Thompson

E.A.

A two-stage pruning algorithm for likelihood computation for a population tree. Genetics. 2008; 180: 1095–105.

RoyChoudhury

Composite likelihood-based inferences on genetic data from dependent loci. Journal of Mathematical Biology. 62: 65–80, 2011.

RoyChoudhury

, Thompson

E.A.

Ascertainment correction for a population tree via a pruning algorithm for likelihood computation. Theoretical Population Biology. 2012; 82: 59–65.

RoyChoudhury

Identifiability of a coalescent-based population tree model. arXiv, pages arXiv: 1304.3691 [q-bio.PE], 2013.

10.

Ewens

W.J.

Mathematical Population Genetics. Springer, 2004.

11.

Degnan

J.H.

, Rosenberg

N.A.

Gene tree discordance, phylogenetic inference, and the multispecies coalescent. Trends in Ecology and Evolution. 2009; 24: 332–40.

12.

Nielsen

, Slatkin

Likelihood analysis of ongoing gene flow and historical association. Evolution. 2000; 54: 44–50.

13.

Bryant

, Bouckaert

, Felsenstein

, Rosenberg

N.A.

, RoyChoudhury

Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Bol Evol. 2012; 29: 1917–32.

14.

Lehmann

E.L.

, Casella

Theory of Point Estimation. New York: Springer, 1998.

15.

The International HapMap Consortium. The International HapMap Project. Nature. 2003; 426: 789–96.

16.

Takahata

, Nei

Gene genealogy and variance of interpopulational nucleotide differences. Genetics. 1985; 110: 325–44.

17.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013.

18.

Tenesa

, Navarro

, Hayes

B.J.

. Recent human effective population size estimated from linkage disequilibrium. Genome Research. 2007; 17: 520–6.

19.

Gravel

, Henn

B.M.

, Gutenkunst

R.N.

. The 1000 Genomes Project, and C. D. Bustamante. Demographic history and rare allele sharing among human populations. PNAS. 2011; 108: 11983–8.

Approximate Likelihood Estimation of Divergence Time Range Using a Coalescent-based Model

Abstract

Keywords

Introduction

Model and Definitions

Methods

Estimation

Results: Simulation and Comparison with Direct Simulation Method

Mechanism of model generation

Results: Applications to HapMap Data

Discussion

Footnotes

Acknowledgment

Author Contributions

Disclosures and Ethics

References