TY - JOUR AU - Gu, Xun AB - Gene family proliferation by gene (genome) duplication has provided the raw materials for functional innovations (Ohno 1970 ; Lundin 1993 ; Holland et al. 1994 ; Henikoff et al. 1997 ; Golding and Dean 1998 ). Several models were proposed for functional divergence among member genes (e.g., Li 1983 ; Clark 1994 ; Hughes 1994 ; Fryxell 1996 ; Nei, Gu, and Sitnikova 1997 ; Force et al. 1999 ), but the details remain largely unknown. Gu (1999) developed a statistical method for testing type I functional divergence, i.e., changes in protein function between two gene clusters result in changes in selective constraints (and therefore shifted evolutionary rates) at some residues. It stands in contrast to type II functional divergence, i.e., changes in protein function between two gene clusters do not alter the level of selective constraints. Amino acid residues with rate shifts are the sites that have either gained or lost importance as a consequence of the change of function during evolution, as proposed by the hypothesis of type I functional divergence (Gu 1999 ). Moreover, type I functional divergence provides a biological basis for the covarion theory of molecular evolution (Fitch and Markowitz 1970 ). If a statistical testing shows a significant rate difference between two gene clusters, it is of great interest to predict important amino acid residues, which can be further verified by available functional-structural evidence (Dermitzakis and Clark 2001; Gaucher, Miyamoto, and Benner 2001 ; Wang and Gu 2001 ). The posterior probability of each site is suitable to develop a statistically sound profile for selecting critical amino acid residues (Gu 1999 ), but little information is provided about how much rate difference is generated at these sites after gene duplication. In this article, we report a new site-specific profile for the rate difference, which is useful for studying the pattern of protein sequence evolution. Consider two gene clusters generated by gene duplication (or speciation), e.g., see figure 1 , the bone morphogenetic proteins (BMP) gene family tree. In each cluster, a site can be in either of two states: (1) F0, which means no altered functional constraint after gene duplication, and (2) F1, which means altered functional constraint at this site after gene duplication. As a result, there are four combined states in the case of two gene clusters: (1) F0 in both clusters, denoted by S0 = (F0, F0), resulting in no rate difference between clusters; and (2) F1 in at least one cluster, denoted by S1 = (F0, F1), (F1, F0), or (F1, F1), resulting in a rate difference between clusters. S0 and S1 are also called functional divergence configurations (Gu 2001 ). Let P(S1) = š›‰ and P(S0) = 1 āˆ’ š›‰ be the probabilities of S1 and S0, respectively; š›‰ is called the coefficient of (type I) functional divergence. Given these notations, the model of Gu (1999) model can be briefly described as follows. • First, at a given site, the number of substitutions, X1 (or X2) = i, in clusters 1 (or 2), follows a Poisson distribution, denoted by p(i|Ī»), whereas the evolutionary rate (Ī» = Ī»1 or Ī»2) varies among sites according to a gamma distribution Ļ•(Ī»). • Second, Ī»1 and Ī»2 are independent under S1, whereas Ī»1 = Ī»2 under S0. Let X = (X1, X2). Thus, one can show that the joint (conditional) distributions of X1 = i and X2 = j is given by P(X|S1) = Q1(i)Q2(j), and P(X|S0) = K12(i, j), respectively, Ā  where D1 and D2 are the total branch lengths in clusters 1 and 2, respectively, and α is the gamma distribution shape parameter (see eqs. 12 and 13 in Gu (1999) for details). • Third, given the (prior) probability P(S1) = š›‰ and P(S0) = 1 āˆ’ š›‰, the joint distribution of X1 and X2 can be expressed as: Ā  \[\mathit{P}(\mathit{X})\ {=}\ (1\ {-}\ {\theta})\mathit{K}_{12}\ {+}\ {\theta}\mathit{Q}_{1}\mathit{Q}_{2}\ (2)\] and a likelihood function can be built for estimating š›‰. When š›‰ = 0, equation (2) is reduced to a standard (homogeneous) model for rate variation among sites (e.g., Gu and Zhang (1997) ). • Fourth, estimation of š›‰ requires a number of substitutions at each site for both gene clusters (i.e., X1 and X2). As X1 and X2 cannot be directly observed, a conventional solution is to use the number of minimum-required changes (m) as an approximation, which can be inferred by the parsimony under a known phylogenetic tree (Fitch 1971 ). However, m is biased because it does not consider the possibility of multiple hits. To solve this problem, Gu and Zhang (1997) developed an algorithm for estimating the expected number of substitutions at each site, using a combination of ancestral sequence inference and maximum likelihood estimation. • Fifth, the (site-specific) posterior probability for being S1, i.e., type I functional-divergence related, is computed as follows: Ā  Obviously, P(S1|X) = 0 when š›‰ = 0, which is consistent with the standard model of rate variation among sites which assumes that the evolutionary rate of a site keeps constant during evolution, though it varies among sites. Under the statistical framework described above, here we develop a site-specific measure for rate difference based on the posterior expectation. As Ī»1 = Ī»2 = Ī» under S0, the joint distribution of Ī» and X = (X1,X2) = (i, j) is given by P(Ī»,X|S0) = p1(i|Ī»)p2(j|Ī»)(Ī»), where p1(i|Ī») and p2(j|Ī») are the Poisson distributions of substitutions in clusters 1 and 2, respectively. Then, one can show that the conditional density of Ī» under S0 is given by: Ā  Under S1, the evolutionary rates Ī»1 and Ī»2 are independent. Applying the Bayes theorem similar to the derivation of equation (4) , one can show that the conditional density of Ī»1 (or Ī»2) under S1 is given by, respectively, Ā  Then, by putting equations (1)–(5) together, we have obtained the posterior mean of rate under S0 or S1 as follows: Ā  where λ̄, λ̄1, and λ̄2 are the mean rates of Ī»,Ī»1, and Ī»2 over sites, respectively. Let v1 (or v2) be the evolutionary rate in cluster 1 (or 2). Under the current two-state model, we have v1 = Ī» under S0, and v1 = Ī»1 under S1. Thus, the expectation of v1 can be expressed as E[v1] = P(S0)λ̄ + P(S1)λ̄1. Therefore, the posterior mean of v1 is given as follows: Ā  For cluster 2, in the same manner we have: Ā  Thus, a site-specific profile for rate difference can be defined as: E[Ī”v|X] = E[v1|X] āˆ’ E[v2|X], which is given by: Ā  As these mean rates over all sites (i.e., λ̄1 and λ̄2) are usually unknown, we have to use the relative rate difference. For example, using λ̄1 as a reference, the relative rate is as follows: Ā  where c = λ̄2/λ̄1. In practice, c can be approximately estimated by the evolutionary distances using the same orthologous genes, i.e., the same evolutionary time. We use the vertebrate BMP gene family as an example. Figure 1 shows the phylogenetic tree of two BMP member genes, which is inferred by the neighbor-joining method (Saitou and Nei 1987 ). Apparently, BMP2 and BMP4 were generated by gene duplication in the early stage of vertebrates. Based on this topology, the coefficient of (type I) functional divergence is estimated to be š›‰ = 0.283 ± 0.067, providing significant evidence for altered functional constraints between BMP2 and BMP4 after gene duplication. The site-specific profile (posterior probability), P(S1|X), scores the relative likelihood of an amino acid residue to be involved in functional divergence between BMP2 and BMP4 (fig. 2 , panel a). Among a total of 338 aligned sites, only a few sites receive high scores; for most of them, the score is only around 0.2. In particular, four sites have scores more than 0.7, whereas 12 sites have scores between 0.6 and 0.7. To understand the rate difference between BMP2 and BMP4 at these sites, we computed the site-specific profile of relative rate difference rk (see fig. 2b ); positive value means that the rate of BMP2 is larger than that of BMP4, and vice versa if it is negative. It is expected that a site with large rate difference (positive or negative) should imply a high posterior probability given by equation (3) . Indeed, figure 2c shows a strong correlation between these two measures. Gu (2001) has developed a maximum likelihood framework for functional divergence, based on the Markov chain model. Using a similar approach, we can develop a site-specific profile for the rate difference. In this case, the posterior mean of rate difference (see eq. 10) should be expressed as follows: Ā  where X = (X1, X2) is for the observed amino acid configuration at a site. E[Ī»1|X, S1] (as well as E[Ī»2|X, S1]) can be computed under the framework of the Markov chain model (Yang 1997 ). The problem in practice is the computational time. Fortunately, our preliminary result shows that the performance of equations (10) and (11) is similar (unpublished data). The methodology we developed (Gu 1999, 2001 ) provides a new approach for testing the site-specific rate difference after gene duplication or speciation. The current study provides a new site-specific profile for quantitatively measuring how much the functional constraint at a site can be changed after these evolutionary events, e.g., P(S1|X) = 0.93 at site 80, indicating a strong rate shift pattern (type I functional divergence) between BMP2 and BMP4. The relative rate difference at this site (āˆ’4.6) indicates a much stronger selective constraint in the BMP2 gene than in the BMP4 gene. Moreover, given the average evolutionary rate of ∼0.4 Ɨ 10āˆ’9/year (using the human-mouse orthology with split time 100 MYA), the absolute (posterior) rate difference at this site can be computed as āˆ’ 4.6 Ɨ 0.4 Ɨ 10āˆ’9 = āˆ’ 1.84 Ɨ 10āˆ’9. Indeed, the evolutionary rate at this site is ∼2.03 Ɨ 10āˆ’10 in BMP2, but ∼2.12 Ɨ 10āˆ’9 in BMP4, indicating a ca. 10-fold rate change at site 80 after gene duplication. However, the rate in BMP4 is not higher than the synonymous rate (∼3 Ɨ 10āˆ’9, estimated by human-mouse orthology). Though it is rough and indirect, the analysis indeed indicates that positive selection may not play an important role at this site. In summary, this measure can provide a site-by-site basis for studying the relationship between the altered functional constraint and functional-structural assays, e.g., the effect of site-mutagenesis or the contribution of 3D difference. The functional-structural basis for type I functional divergence has been illustrated by Wang and Gu (2001) . After gene duplication, there are two possibilities resulting in rate difference of a site between duplicate genes: (1) it becomes more conserved in one gene copy as a consequence of acquired new functions, or (2) it becomes more variable in one gene copy as a consequence of functional relaxation (e.g., via loss of function). The sign (positive or negative) of the site-specific profile that indicates the trend of change in selective constraint is useful for understanding the underlying evolutionary mechanism. In this report, we assume that the site-specific rate difference is equivalent to the site-specific altered functional constraint, which is valid as long as the mutation rate is not site specific. Different mutational rates owing to gene locations in the genome have virtually no effect on our analysis (Gu 1999 ). Because biochemical properties (charge, hydrophobicity, etc.) of amino acid substitutions are not considered by this simple approach, the interpretation needs to be cautious in some cases, e.g., with many substitutions between amino acids R and K, which are both positively charged. This problem can be solved by two modifications. First, after a group of residues are selected, a follow-up checking based on some empirical rules may be informative. Second, we can improve our model to take this factor into account. For example, we can develop a weight matrix (or substitution matrix) of amino acid substitutions that is specific to each state (F0 or F1). Using any given measure, it is not difficult to conduct a computational search to output a list of amino acid residues, each of which seems more conserved in one cluster than the other. However, we argue that statistical modeling and prediction is essential for several reasons. First, a simple list of these residues without statistical testing cannot be used as a piece of valid evidence to support or reject a scientific hypothesis, e.g., site-specific altered functional constraint after gene duplication. Second, the criterion for residue selection should be statistically sound. Third, for protein family sequences, a phylogeny-based profile is crucial to avoid any bias caused by unequal sequence sampling. For example, consider two gene clusters with an equal number of sequences. Cluster 1 includes closely related sequences, whereas cluster 2 includes distantly related sequences. Any prediction ignoring phylogeny can be misleading because many sites will show identical amino acid patterns in cluster 1. This problem (usually causing a high false positive rate) becomes serious for a large-scale analysis because visual inspection is not possible. At any rate, a statistically sound analysis is beneficial and cost-effective for functional and evolutionary genomics, as long as it is computationally fast. Naruya Saitou, Reviewing Editor Keywords: protein sequence evolution gene duplication functional divergence posterior rate difference Address for correspondence and reprints: Xun Gu, Department of Zoology/Genetics, 332 Science II Hall, Iowa State University, Ames, Iowa 50011. xgu@iastate.edu . View largeDownload slide Fig. 1.—The phylogenetic tree of the BMP gene family, which was inferred by the neighbor-joining method, using amino acid sequences with Poisson distance. Bootstrapping values greater than 50% are presented View largeDownload slide Fig. 1.—The phylogenetic tree of the BMP gene family, which was inferred by the neighbor-joining method, using amino acid sequences with Poisson distance. Bootstrapping values greater than 50% are presented View largeDownload slide Fig. 2.—a, The site-specific profile for predicting critical amino acid sites that are responsible for the (type I) functional divergence, measured by the posterior probability. b, The site-specific profile for rate difference between BMP2 and BMP4. Positive value means the rate of BMP2 is higher than the rate of BMP4, and vice versa. c, The correlation between relative rate difference and posterior probability of (type I) functional divergence-related difference View largeDownload slide Fig. 2.—a, The site-specific profile for predicting critical amino acid sites that are responsible for the (type I) functional divergence, measured by the posterior probability. b, The site-specific profile for rate difference between BMP2 and BMP4. Positive value means the rate of BMP2 is higher than the rate of BMP4, and vice versa. c, The correlation between relative rate difference and posterior probability of (type I) functional divergence-related difference This study was supported by NIH grant RO1 GM62118 to X.G. Accession numbers are as follows. BMP-2: AF041421, HUMBMP2A, DSPBMP2, MUSBMP2A, RNBMP2, GGBMP2, XLBMP2, D49971. BMP-4: MUSBMP4, RNBOMPR4A, HUMBMP4, AF042497, GGBMP4, XLBMP4, D49972. References Clark A. G., 1994 Invasion and maintenance of a gene duplication Proc. Natl. Acad. Sci. USAĀ  91: 2950-2954 Google Scholar Dermitzakis E., A. Clark, 2001 Non-neutral diversification after duplication in mammalian developmental genes Mol. Biol. EvolĀ  18: 557-562 Google Scholar Fitch W. M., 1971 Toward defining the course of evolution: minimum change for a specific tree topology Syst. ZoolĀ  20: 406-416 Google Scholar Fitch W. M., E. Markowitz, 1970 An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution Biochem. GenetĀ  4: 579-593 Google Scholar Force A., M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, J. Postlethwait, 1999 Preservation of duplicate genes by complementary, degenerative mutations GeneticsĀ  151: 1531-1545 Google Scholar Fryxell K. J., 1996 The coevolution of gene family trees Trends GenetĀ  12: 364-369 Google Scholar Gaucher A., M. Miyamoto, S. Benner, 2001 Function–structure analysis of proteins using covarion-based evolutionary approaches: elongation factors Proc. Natl. Acad. Sci. USAĀ  98: 548-552 Google Scholar Golding G. B., A. M. Dean, 1998 The structural basis of molecular adaptation Mol. Biol. EvolĀ  15: 355-369 Google Scholar Gu X., 1999 Statistical methods for testing functional divergence after gene duplication Mol. Biol. EvolĀ  16: 1664-1674 Google Scholar ———. 2001 Maximum likelihood approach for gene family evolution under functional divergence Mol. Biol. EvolĀ  18: 453-464 Google Scholar Gu X., J. Zhang, 1997 A simple method for estimating the parameter of substitution rate variation among sites Mol. Biol. EvolĀ  14: 1106-1113 Google Scholar Henikoff S., E. A. Greene, S. Pietrokovski, P. Bork, T. K. Attwood, L. Hood, 1997 Gene families: the taxonomy of protein paralogs and chimeras ScienceĀ  278:: 609-614 Google Scholar Holland P. W. H., J. Garcia-Fernandez, N. A. Williams, A. Sidow, 1994 Gene duplication and the origins of vertebrate development DevelopmentĀ ( Suppl.:) 125-133 Google Scholar Hughes A. L., 1994 The evolution of functionally novel proteins after gene duplication Proc. R. Soc. Lond., Ser. BĀ  256: 119-124 Google Scholar Li W. H., 1983 Evolution of duplicated genes In M. Nei and R. K. Koehn, eds. Evolution of genes and proteins. Sinauer Associates, Sunderland, Mass Google Scholar Lundin L. G., 1993 Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse GenomicsĀ  16: 1-19 Google Scholar Nei M., X. Gu, T. Sitnikova, 1997 Evolution by the birth-and-death process in multigene families of the vertebrate immune systems Proc. Natl. Acad. Sci. USAĀ  94: 7799-7806 Google Scholar Ohno S., 1970 Evolution by gene duplication Springer-Verlag, Berlin Google Scholar Saitou N., M. Nei, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees Mol. Biol. EvolĀ  4: 406-425 Google Scholar Wang Y., X. Gu, 2001 Functional divergence in caspase gene family and altered functional constraints: statistical analysis and prediction GeneticsĀ  158: 1311-1320 Google Scholar Yang Z., 1997 PAML, a program package for phylogenetic analysis by maximum likelihood CABIOSĀ  13: 555-556 Google Scholar TI - A Site-specific Measure for Rate Difference After Gene Duplication or Speciation JF - Molecular Biology and Evolution DO - 10.1093/oxfordjournals.molbev.a003780 DA - 2001-12-01 UR - https://www.deepdyve.com/lp/oxford-university-press/a-site-specific-measure-for-rate-difference-after-gene-duplication-or-5EJABzJ4uS SP - 2327 EP - 2330 VL - 18 IS - 12 DP - DeepDyve ER -