Abstract
Keywords
Introduction
Gene duplication is an important mechanism for the origin of novelty in evolution.1–3 When a gene is duplicated, one of the duplicate copies usually decays within a few million years due to an accumulation of deleterious mutations. 4 However, duplicates may be retained if they become functionally important to the organism.5–7 It has been suggested that duplicate genes may be able to carry out the original gene function, which means that paralogs may compensate for each other.8,9 Gene knockout/knockdown experiments have been conducted in multiple species to examine the degree of functional redundancy in gene families. The results suggest that the loss of function in genes with paralogs is associated with higher organismal survival than the loss of function in genes without any known paralogs (singletons), supporting the functional compensation hypothesis.10–16 However, Liao and Zhang 17 reported that duplicates rarely compensate for each other in mice, which has been debated.18–22 Overall, experimental data have not yet provided definitive evidence about whether paralogous genes do compensate for each other in most instances.
The predictions of functional compensation can be tested computationally by analyzing the disease-associated genetic variation in humans. These variants are currently experiencing negative selection in the human populations, which means that they constitute data of functional impact in nature. If functional compensation among gene family members is substantial, it is expected that fewer significant statistical associations between variants and disease phenotypes will be detected for proteins in multigene families than for singletons. Using this idea, Dickerson and Robertson 23 tested the predictions of functional compensation and found no difference between the proportion of singletons and paralogs implicated in diseases (2% difference), supporting the conclusions of Liao and Zhang. 17 However, they and others have suggested that recently diverged paralogs are less likely to be disease-associated than singletons and proteins with distantly related paralogs.23–26 These results suggest functional redundancy among young gene duplicates.
However, the abovementioned computational studies have not accounted for many potentially confounding factors. First, disease-associated single nucleotide variants (dSNVs) are found preferentially at slowly evolving amino acid positions 27 ; thus, we expect to observe a higher frequency of dSNVs in more conserved proteins. This could distort comparisons between singletons and multigene family proteins if the distributions of amino acid evolutionary rates are not the same for these two classes. Second, the numbers of dSNVs found in different proteins are not expected to be the same because the numbers of amino acids in proteins vary by an order of magnitude. This means that commonly used metrics, such as the relative fractions of disease and nondisease proteins in different protein classes, are too coarse. Metrics that take into account the number of amino acids in proteins (sequence length) are necessary for more robust hypothesis testing.
In the following section, we tested the hypothesis of functional compensation by considering the abovementioned factors to better understand the genome-wide pattern of functional evolution in gene families, which is vital for understanding genome evolution and predicting disruptive effects of the mutations of proteins that have paralogs.
Results
We obtained a set of 15,485 human proteins and their homologs from 46 diverse species from the UCSC genome browser (see Material and Methods). For each protein, we also obtained a list of paralogs from the HOVERGEN database.
28
Our set of proteins is representative of the whole human gene set because about half (52%) of these proteins have at least one paralog, a fraction that is similar to the overall fraction of proteins with paralogs in the human genome (49% in HOVERGEN database
28
). For each human protein, we computed the average rate of amino acid substitution (number of substitutions per site per billion years) using the interspecific amino acid sequence alignments (see Material and Methods). Figure 1 shows the distributions of evolutionary rates in singleton and multigene family proteins. Overall, singletons are less conserved than multigene family proteins, with a ~20% mean and ~30% median difference (

Distributions of evolutionary rates of singleton (broken line) and multigene family proteins (solid or dotted line). (A) Evolutionary rates are in the units of the number of amino acid substitutions per amino acid site per billion years. The mean and median of these distributions are 1.05 and 0.89, respectively, for singletons, and 0.80 and 0.61, respectively, for proteins in multigene families. These distributions are significantly different (two-sample Kolmogorov–Smirnov test;
dSNVs in singletons and multigene families
We analyzed all available SNVs associated with Mendelian diseases in singleton and multigene family proteins. There were a total of 47,382 dSNVs in 2,589 proteins. In these data, the proportion of proteins with at least one dSNV was slightly lower (2.2%) for singletons than that of proteins with paralogs, which is consistent with the recent reports.23,29 However, the number of dSNVs in proteins varied extensively and was found to be positively correlated with the protein length (

Distributions of the number of dSNVs. (A) A frequency diagram showing the number of proteins with at least one dSNV. (B) The average number of dSNVs per protein for proteins at different length thresholds at 100 amino acids intervals. The average number of dSNVs per protein is positively correlated with the average protein length for both multigene family (correlation = 0.005;
We compared the number of dSNVs per 100 amino acid positions (dSNV density) between multigene family and singleton proteins. Multigene family proteins have 1.6 times higher density of dSNVs than detected in singleton proteins (0.66 and 0.42, respectively). We can statistically reject the null hypothesis of equal dSNV densities in singletons and multigene family proteins (
We tested the influence of outliers on this result by excluding all proteins with >0.5 dSNVs per amino acid. This reduced the number of proteins slightly (131 proteins were excluded), but the ratio of multigene family and singleton protein dSNV densities remained unchanged (1.6;
We also tested if the observed patterns reflect the mutations of specific amino acids (eg, arginine) that comprise a major fraction of the dataset of dSNVs (16%). Arginine codons contain a CpG dinucleotide in the first two positions and are, thus, more prone to transitional mutations, leading to amino acid variation.
30
We computed the dSNV densities using only the arginine positions in proteins and found the dSNV density in multigene family proteins to be 1.5 times greater than observed in singletons (0.09 and 0.06, respectively;
Finally, we looked for the signatures of functional compensation using dSNVs that are expected to be the most severe, with the rationale that functional compensation may be easier to detect, as ameliorating severe phenotypic effects will have greater relative effect on individual fitness. We designated a dSNV to be severe if the predicted functional impact score for the variant was in the top 5% of all dSNVs (see Material and Methods). For these data, the multigene family proteins have a dSNV density 2.3 times higher than that observed for singletons (0.034 and 0.015, respectively;
Relationship of evolutionary conservation and dSNVs
We examined if protein conservation difference between singletons and multigene family proteins can explain the abovementioned pattern because it is now well established that highly conserved proteins are significantly more likely to contain dSNVs.27,31 Because the protein evolutionary rate distributions are neither normal nor symmetrical (Fig. 1), we compared medians (0.61 and 0.89, respectively) and found a ratio of 0.69 between the multigene family and singleton proteins. The inverse of this ratio (1.5) is only slightly different from the ratio of dSNV densities (1.6). This similarity suggests that the higher rate of dSNVs in multigene family proteins is mostly explained by the degree of functional constraint on proteins in multigene families versus singleton proteins. Based on this observation, we propose the evolutionary constraints hypothesis, which posits that the differences in dSNV densities among different classes of proteins (eg, singleton vs. multigene) are primarily a result of the differences in the degree of natural selection acting upon them. If true, this would be consistent with the neutral theory of molecular evolution. 32 Evolutionary constraint hypothesis does not preclude the existence of functional compensation (among other factors) in some proteins or positions, but it does claim that differences in the intensity of purifying selection will be the primary cause of observed differences in the preponderance of SNVs in different groups of proteins.
We tested the prediction of the evolutionary constraint hypothesis in an analysis of 12,952 common neutral SNVs (nSNVs) obtained from the 1000 Genomes Project.
33
These common nSNVs are complementary in nature to dSNVs, as common nSNVs persist in the human population and have risen to moderate frequencies (>5%) because their impact on fitness is effectively neutral (opposite of dSNVs that cause disease). Therefore, if functional constraints and, thus, the conservation level of human protein sequence explain the observed differences in dSNV density, we should also observe fewer nSNVs in multigene family proteins, as these proteins evolve more slowly and are expected to be subject to more severe purifying selection.
34
Indeed, the nSNV density (number of nSNVs per 100 amino acids) in multigene family proteins was lower than that of singletons (ratio = 0.82; 0.13 and 0.16, respectively;
Disease SNV prevalence in proteins with young and old paralogs
Next, we tested the hypothesis that functional compensation is more common in proteins with younger paralogs.23,24 If functional compensation generally occurs only for a brief period after the gene duplication event, then the most recently diverged paralogs will provide the most powerful signal to detect functional compensation. We first identified the closest paralog for each protein within a given gene family by selecting the paralog with the smallest nucleotide divergence in their codons (third positions only). To estimate the relative antiquity of the duplicate event, we used the protein-specific human-mouse third positions in codons to normalize each closest paralog divergence across gene families (see Materials and Methods). This normalized value yields an approximate gene duplication time when it is scaled using the human–mouse divergence time (92.3 million years ago 35 ). This approximation is reasonable, as third positions in codons evolve relatively neutrally and because we use divergence times primarily for identifying and sorting young paralogs for hypothesis testing.
Density of dSNV for duplicates that have diverged from their paralogs in the last 200 million years shows a tendency to increase with estimated duplicate age (Fig. 3A). The same pattern is observed for the positions of arginine and glycine and those with predicted severe functional impacts (Fig. 3B–D). Also, the dSNV densities for the youngest duplicates are lower than those for singletons (triangle in Fig. 3). We found that the evolutionary rate of proteins is negatively correlated with time since duplication, and the youngest duplicates have higher evolutionary rates than singletons (Fig. 4A). These patterns do not support the functional compensation hypothesis 23 and are consistent with our evolutionary constraint hypothesis. These trends are confirmed in the analysis of nSNV densities that showed expected complementary patterns (Fig. 4B).

the dSNV density in duplicates over time. Each point shows the dSNV density of all proteins with duplication age less than or equal to a threshold time (

The average evolutionary rates (A) and nSNV densities (B) of all proteins with duplication age less than or equal to a threshold time (
Disease SNV prevalence in proteins with very similar paralogs
We also tested the functional compensation hypothesis in proteins that show high amino acid sequence similarities with their paralogs, as studied by Hsiao and Vitkup.
24
We found that paralogs with the highest amino acid sequence similarities (>95%) actually have higher dSNV densities than other paralogs (0.98 vs. 0.57;
Next, we compared nSNV densities in paralogs with >95% sequence similarity to those with ≤95% similarity. For this comparison, we needed to be cognizant of the fact that variant calls are difficult when the paralogs have very similar DNA sequences.36–39 This is the case for paralogs with > 95% amino acid sequence similarity because most of these proteins also showed small divergences at the third positions in codons between paralogs (≤0.2 substitutions per site). To accommodate the variant call errors, we used proteins with ≤0.2 distance (third positions) for comparison between paralogs for two groups of proteins (225 and 69 proteins). The nSNV density was 0.30 and 0.52 for proteins that have paralogs with >95% and ≤95% sequence similarity, respectively (
Conclusions
In this article, we examined the functional compensation among paralogs as a general phenomenon through an analysis of disease-associated genetic variation in humans.23–26 In contrast to expectations under the functional compensation hypothesis, we found that multigene families have a greater tendency to harbor dSNVs than singleton proteins. We proposed that differences in functional constraints (evolutionary constraint hypothesis) explain the observed pattern to a large degree. We confirmed that singleton proteins show lower functional constraint than proteins with identifiable duplicates in the genome, which explains the increased detection of disease-associated variation observed in multigene families.
Some recent theoretical and empirical studies suggest that functional compensation can lead to enhanced purifying selection and, therefore, may actually be associated with slower evolutionary rates.14,40 Other studies indicate that the youngest duplicates are evolving under relaxed selection pressures, which would cause an increase in evolutionary rates for a few million years. 4 Such short-term and localized rate changes (faster or slower) will not have significant impact on the estimates of very long-term evolutionary rates that we have used to quantify the functional constraint. We have calculated the evolutionary rates using sequence differences in proteins that have accumulated changes for hundreds of millions of years across major groups of vertebrates. There is no evidence that pervasive functional compensation exists across the phylogenetic breadth and genomic scale reflected in our analyses. We expect our major conclusions to hold true in general, while acknowledging that functional compensation may occur in some multigene families and some amino acid positions. In summary, we suggest that there is a need to fully consider differences in the evolutionary conservation of proteins when studying the patterns of sequence variation and variant–phenotype associations.
Materials and Methods
Data assembly
Nonsynonymous dSNVs were obtained from the Human Gene Mutation Database (HGMD). 41 We used dSNVs associated with Mendelian diseases because they are generally caused by single mutations, which is an appropriate way to test functional compensation, as has been done before.23,24 We excluded all SNVs associated with complex diseases whenever mutation phenotypes indicate complex disease association in the HGMD. Common nSNV data were generated by using the nonsynonymous SNV data obtained from the 1000 Genomes Project. 33 SNVs observed with a frequency >5% were assumed to be neutral (nSNVs). Nucleotide sequences of genes in the human genome and their genomic locations were obtained from the UCSC database (hg19). 42 We used protein family annotations from the HOVERGEN database. 28 RefSeq identifiers from the UCSC browser were matched with UniProt identifiers from the HOVERGEN database using the ID Mapping software. 43 Using this annotation, we made a list of paralogs for each protein. When RefSeq IDs were converted into multiple UniProt IDs, we removed those genes if they were classified into different protein families or if they were mapped to different genomic locations. As a result, we obtained 15,485 human proteins for our analyses or ~77% of human proteins in the genome (~20,000 proteins).
Estimating protein evolutionary rates
Amino acid sequence alignments for 46 diverse vertebrate species were obtained from the UCSC resource 42 to estimate the absolute site-by-site evolutionary rate as previously described. 44 For each protein, the evolutionary rate was the average of rates over all positions, which are expressed in units of substitutions per amino acid per billion years.
Testing the significance of SNV density difference between groups of proteins
To compare SNV densities between two groups (ie, between singletons and multigene family proteins), we used a
Predicting phenotypic severities of dSNVs
For each dSNV, the evolutionary rate of the amino acid position was computed using the alignments of 46 species 44 and the impact score for EvoD prediction 46 was estimated by using myPEG. 46 The top 5% of dSNVs at ultra-conserved, well-conserved, and less-conserved positions were selected (EvoD impact scores of ≥88, ≥88, and ≥82, respectively). These constituted the top 5% of the most highly deleterious predicted alleles.
Identifying the closest paralog
For each protein family, codon alignments of human paralogs were built using the amino acid sequence alignment features of the MUSCLE software 47 in MEGA-CC, 48 which implements a codon alignment pipeline. Default options were used for amino acid sequence alignments (gap opening penalty = −2.9, gap extension penalty = 0, multiplier for gap open/close penalty in hydrophobic regions = 1.2, and maximum length of the diagonal = 24). To identify the closest paralog of each protein, pairwise evolutionary distances using the third positions in codons were computed between paralogs using the MEGA-CC software 48 ; the maximum composite likelihood (MCL) method was used to calculate pairwise evolutionary distances. 49
To exclude the influence of copy number variants on our analyses, we obtained the closest chimpanzee ortholog for each protein from the UCSC database (panTro2) and computed evolutionary distances at the third positions in codons (MCL) as well as among amino acids (
Estimating normalized distances between closest paralogs
For each pair of paralogs, we estimated pairwise evolutionary distance (
Author Contributions
Conceived and designed the experiments: SK. Analyzed the data: SM, ST. Wrote the first draft of the manuscript: SM, SK. Contributed to the writing of the manuscript: ST. Agree with manuscript results and conclusions: SM, ST, SK. Jointly developed the structure and arguments for the paper: SK, SM. Made critical revisions and approved final version: SK, SM. All authors reviewed and approved of the final manuscript.
