Molecular Biology and Evolution

Abstract In the genome of Artiodactyla (cow, sheep, pigs, camels, and whales), a major retroposon group originated from a presumable horizontal transfer of BovB, a retrotransposon-like element retroposon, between 52 and 70 million years ago. Since then, BovB retroposons have proliferated and today occupy a quarter of the cow’s genome sequence. The BovB-related short interspersed elements (SINEs) were used for resolving the phylogeny of Bovinae (cows, spiral-horned antelopes, and nilgais) and their relatives. In silico screening of 55,000 intronic retroposon insertions in the cow genome and experimental validation of 126 introns resulted in 29 informative retroposon markers for resolving bovine evolutionary relationships. A transposition-in-transposition analysis identifies three different phases of SINE activity and show how BovB elements have expanded in the cattle genome. Artiodactyla, retroposon, RTE, SINE, Bovinae, Boselaphus, Tragelaphus, Bos Domestic cattle (Bos taurus) are important animals used for food production and transportation. Surprisingly, despite a relatively young age (early Miocene), the evolutionary history of cattle and their allies within the subfamiliy Bovinae is uncertain. Bovinae consist of three groups: the African spiral-horned antelopes (Tragelaphini), the Asian nilgais plus the four-horned antelopes (Boselaphini), and the cow, bison, and buffalo (Bovini). We have used a different strategy, nearly homoplasy-free retroposon markers, to elucidate the relationships within this group. A significant component of mammalian genomes is repetitive DNA sequence fragments. Retroposons can occur as either short interspersed elements (SINEs) or long interspersed elements (LINEs). LINE1 (L1) elements expanded during early placental mammalian evolution and propagate the shorter SINEs (Kramerov and Vassetzky 2005). A special feature of SINEs and LINEs is that once the element has inserted itself in the genome of the germline, the retroposon insertion will appear in all descendants. Only under very rare circumstances, can retroposons be excised (van de Lagemaat et al. 2005; Ray et al. 2006). The rarity of reversals and low probability of homoplasy make them ideal markers for resolving evolutionary relationships (Shedlock and Okada 2000; Shedlock et al. 2004; Ray et al. 2006) and have successfully been used to resolve the phylogeny of several mammalian groups (Shimamura et al. 1999; Nijman et al. 2002; Kriegs et al. 2006; Möller-Krull et al. 2007; Li et al. 2009). A striking feature of the cattle genome is the presence of ∼400,000 copies of the retrotransposon-like element (RTE) LINE element (Youngman et al. 1996; Adelson et al. 2009). So far RTE LINEs have been indentified in all Ruminantia (mouse deers, deers, giraffes, sheeps, and cows) and are named BovB, after their initial discovery in the bovine genome (Szemraj et al. 1995). Surprisingly, relatives to Ruminatia, Suina (pigs and peccaries), Tylopoda (camels), and Cetacea (whales and dolphins) lack RTE elements. Therefore, it has been suggested that BovB has been introduced by horizontal transfer (Zupunski et al. 2001) into the genome of an ancestor to ruminants living 70–52 million years ago (Ma) (Benton et al. 2009). In Artiodactyla, one prominent SINE type is the Cetacea, Hippopotamidae, and Ruminantia (CHR) SINE (Shimamura et al. 1997). After the introduction of BovB into the ancestral genome, new SINEs emerged due to rearrangement between older CHR SINEs and the newly introduced BovB LINE (fig. 1a). Currently, the BovB LINEs and related SINEs occupy ∼22% of the cow genome (Adelson et al. 2009), suggesting an average increase in genome size by retroposition of 0.4% per million years. Fig. 1. View largeDownload slide (a) The relationship inside Ruminantia as resolved by SINE insertion (circle) data and the evolution of the BovB-related SINEs. No insertions were identified for placing the Giraffidae (giraffe and okapi) relative to the other groups. The BovB retroposon enters the genome of the ancestral Ruminantia after the split from whales. The SINE Bov-tA emerges after a recombination between the tRNA-Glu derived SINEs and the first and last part of BovB. The tail region of Bov-tA becomes a dimer, joined by a linker sequence (orange) creating BOV-A2 elements. Shortening of BovB generates the ART2A element. Together these make up 22.8% of the cow genome. Light purple circle indicates CHR SINEs, light brown Bov-tA SINEs, and green BOV-A2. (b) Relative temporal SINE activity in the cow genome. The SINEs are sorted by their activity: most ancestral SINEs are at the bottom and to the left of the figure. The length of each bar indicates the relative temporal length of their activity. Bov-tA2 and ART2A have been active for a very long period, and the CHRs were only briefly active, CHR-1, having the most limited activity. The LINE2- and LINE3-derived MIR SINEs are shared by all placental mammals and are accordingly shown at the bottom. Fig. 1. View largeDownload slide (a) The relationship inside Ruminantia as resolved by SINE insertion (circle) data and the evolution of the BovB-related SINEs. No insertions were identified for placing the Giraffidae (giraffe and okapi) relative to the other groups. The BovB retroposon enters the genome of the ancestral Ruminantia after the split from whales. The SINE Bov-tA emerges after a recombination between the tRNA-Glu derived SINEs and the first and last part of BovB. The tail region of Bov-tA becomes a dimer, joined by a linker sequence (orange) creating BOV-A2 elements. Shortening of BovB generates the ART2A element. Together these make up 22.8% of the cow genome. Light purple circle indicates CHR SINEs, light brown Bov-tA SINEs, and green BOV-A2. (b) Relative temporal SINE activity in the cow genome. The SINEs are sorted by their activity: most ancestral SINEs are at the bottom and to the left of the figure. The length of each bar indicates the relative temporal length of their activity. Bov-tA2 and ART2A have been active for a very long period, and the CHRs were only briefly active, CHR-1, having the most limited activity. The LINE2- and LINE3-derived MIR SINEs are shared by all placental mammals and are accordingly shown at the bottom. In this study, for the first time we report a relative timeframe of ancient retropositional insertions based on an in silico screening of the completely sequenced cattle genome using the transposition-in-transposition (TinT) analysis (Churakov et al. 2010). The TinT analysis detected that an expansion of new artiodactyl-specific Bov-tA, ART2A, and BOV-A2 SINEs occurred after the BovB LINE introduction in the ancestral genome (fig. 1b). The experimental screening indicates that Bov-tA and BOV-A2 originated around the evolutionary split leading to Ruminantia and Pecora. In the genome of the ancestral Cetartiodactyla, the L1-propagated CHR SINEs were active (fig. 1b). This is in accordance with previous studies of the evolution of SINEs in Artiodactyla (Shimamura et al. 1999). In silico screening of the cattle genome (bosTau4) yielded ∼55,000 introns suitable for experimental analysis (see Supplementary Methods, Supplementary Material online). Of these, 126 introns were experimentally evaluated for retroposon insertions, and 27 introns with phylogenetically informative retroposon insertions were identified and sequenced (supplementary tables S1–S4, supplementary data set S1, Supplementary Material online). According to Waddell et al. (2001), the support from retroposon insertions can be calculated as a probability value (P) to support one of three hypotheses that can be constructed around the branch in question in a rooted tree. To complement the retroposon insertion analyses, flanking intronic and exonic sequence data were concatenated (6,950 nt) and used to reconstruct bovid phylogeny (Supplementary Methods, supplementary data set S2, Supplementary Material online). The experimental retroposon insertion screen found six Bov-tA1 insertions ([6 0 0] P = 0.0014) that support the Ruminantia clade and five insertions ([5 0 0] P = 0.0041) supporting Pecora (fig. 1a). Note that the numbers preceding the P value denote presence or absence of conflicting retroposon insertions (Waddell et al. 2001). Despite screening a large number of SINE-containing introns, none were informative for resolving relationships inside Pecora, the largest ruminant infraorder. The experimental screen identified two BOV-A2 elements ([2 0 0] P = 0.1111) that supported the monophyly of the family Bovidae. It is remarkable that only two SINE integrations were discovered given the strong support for the Bovidae branch from sequence analyses in this and other studies (Hassanin and Douzery 2003; Decker et al. 2009). In the younger part of the tree, the Bovinae and the Bovini were well supported by six ([6 0 0] P = 0.0014) or five insertions ([5 0 0] P = 0.0041) each, respectively (fig. 2). One BOV-A2 marker ([1 0 0] P = 0.3333) supported a close relationship between spiral-horned antelopes and Bovini. Two BOV-A2 markers ([2 0 0] P = 0.1111) grouped the two genera Bos and Bison (fig. 2). One marker ([1 0 0] P = 0.3333) each supported the monophyly of Cervidae and Tragelaphini. In intron 24, the African and Asian water buffalo sequence closely resembles the preinsertion state found in sheep and sable antelope (supplementary fig. S1, Supplementary Material online). Marker 24 can be placed either at the Bovinae or Bovini node and, thus, represents a conflicting marker (Bovinae: [5 1 0] P = 0.0185; Bovini: [4 1 0] P = 0.0247). There are three possible scenarios that could explain this finding: 1) incomplete lineage sorting, 2) precise parallel insertion, or 3) exact deletion (Ray et al. 2006). After insertion of a retroposon into the germ line, unequal fixation of alleles in different lineages can result in populations displaying genomic patterns of characters (haplotypes, indels, and retroposon insertions) that do not reflect the genealogic history (Churakov et al. 2009; Nishihara et al. 2009). Fig. 2. View largeDownload slide Support inside Bovidae from retroposon insertion (circle) data. BOV-A2 SINEs are the most common in this part of the tree. Green circles indicate BOV-A2 insertions and light brown Bov-tA retroposons. Fig. 2. View largeDownload slide Support inside Bovidae from retroposon insertion (circle) data. BOV-A2 SINEs are the most common in this part of the tree. Green circles indicate BOV-A2 insertions and light brown Bov-tA retroposons. The age of Bovinae is estimated by molecular data to 19–20 Ma, a timing that is well corroborated by fossil evidence (Hassanin and Douzery 2003; Hernández Fernández and Vrba 2005). The strict geographic distribution of extant Boselaphini (Asia) and Tragelaphini (Africa) should aid in making hypotheses concerning their early evolution straightforward; however, to date nothing is known about their relationship (Bibi et al. 2009). There seems to have been an initial radiation of early Bovinae in Asia, which could explain the distribution of Boselaphini, which is presently found in India and Pakistan (Wilson and Reeder 2005). Although there is plenty of fossil evidence of spiral-horned antelopes from different African regions with younger deposits, the initial colonization route to Africa remains unclear (Bibi et al. 2009). We could identify four BOV-A2 and one Bov-tA insertion supporting the Bovinae, whereas only one BOV-A2 insertion weakly supported ([1 0 0] P = 0.3333) an affiliation of spiral-horned antelopes (Tragelaphini) to Bovini. Typically, three SINE insertions for a node are considered significant support for a branch, if no conflicting insertions are observed (Waddell et al. 2001). However, in phylogenetic studies based on retroposon insertions, under certain circumstances, SINE insertions can be found supporting several different topologies (Churakov et al. 2009; Nishihara et al. 2009). This is usually correlated with short divergence times and speciation-related processes such as incomplete lineage sorting or hybridization (Hallström and Janke 2010). It is likely that conflicting SINE insertions supporting alternative topologies will be identified for the deeper splits in Bovinae. Highlighting the problems to resolve the Bovinae node, the sequence-based phylogenetic analysis of the flanking sequences suggested a sister-group relationship between Boselaphini and Tragelaphini (supplementary figs. S2 and Supplementary Data and supplementary table S5, Supplementary Material online) that is opposite to the results from the retroposon insertion. This is the first large-scale systematic screening of retroposon data for resolving the deeper relationships inside Bovinae. It allowed determining the temporal activity of RTE LINEs and SINEs in the cattle genome. RTE elements are only known from a few mammalian groups and have been highly successful in the cattle genome and their relatives. The retroposon insertion analysis yielded significant support for several nodes such as Bovinae, but the trifurcation among Tragelaphini, Boselaphini, and Bovini remained difficult to resolve by retroposon data. Acknowledgments This work was supported by the Swedish Research Council (623-2009-6994) (M.A.N.), the Nilsson-Ehle Foundation (A.J.), and LOEWE: Landes-Offensive zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz of Hesse's Ministry. Several persons and institutions donated samples, contributed experimentally, or gave comments: Stefan Hoby, Niklas Svensson, Reepark and Safari, Givskud Zoo, Vernon Hampton, The Golden Buffalo, Bisonfarm Gate, Prof. Stefan Hiendleder, Pontus Linden, Linda Nilsson, and Tim Schikora. Jon Baldur Hlidberg painted the animals. We celebrate that this is the final paper published from the Division of Evolutionary Molecular Systematics at Lund University (Sweden). The Division was founded in 1989 by Ulfur Arnason and produced 125 ISI listed papers, making it a leading research group using chromosomal and mitochondrial data for vertebrate phylogenetic studies. References Adelson DL, Raison JM, Edgar RC. Characterization and distribution of retrotransposons and simple sequence repeats in the bovine genome, Proc Natl Acad Sci U S A. , 2009, vol. 106 (pg. 12855- 12860) Google Scholar CrossRef Search ADS PubMed Benton MJ, Donoghue PCJ, Asher RJ. Hedges SB, Kumar S. Calibration and constraining molecular clocks, The timetree of life , 2009 Oxford Oxford University Press(pg. 35- 86) Bibi F, Bukhsianidze M, Gentry AW, Geraads D, Kostopoulos DS, Vrba ES. The fossil record and evolution of Bovidae: state of the field Palaeontol, Electronica , 2009, vol. 12 pg. 10A Churakov G, Grundmann N, Kuritzin A, Brosius J, Makałowski W, Schmitz J. A novel web-based TinT application and the chronology of the Primate Alu retroposon activity, BMC Evol Biol. , 2010, vol. 10 pg. 376 Google Scholar CrossRef Search ADS PubMed Churakov G, Kriegs JO, Baertsch R, Zemann A, Brosius J, Schmitz J. Mosaic retroposon insertion patterns in placental mammals, Genome Res. , 2009, vol. 19 (pg. 868- 875) Google Scholar CrossRef Search ADS PubMed Decker JE, Pires JC, Conant GC, et al. , (30 co-authors). Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics, Proc Natl Acad Sci U S A. , 2009, vol. 106 (pg. 18644- 18649) Google Scholar CrossRef Search ADS PubMed Hallström M, Janke A. Mammalian evolution may not be strictly bifurcating, Mol Biol Evol. , 2010, vol. 27 (pg. 2804- 2816) Google Scholar CrossRef Search ADS PubMed Hassanin A, Douzery EJ. Molecular and morphological phylogenies of ruminantia and the alternative position of the moschidae, Syst Biol. , 2003, vol. 52 (pg. 206- 228) Google Scholar CrossRef Search ADS PubMed Hernández Fernández M, Vrba ES. A complete estimate of the phylogenetic relationships in Ruminantia: a dated species-level supertree of the extant ruminants, Biol Rev Camb Philos Soc. , 2005, vol. 80 (pg. 269- 302) Google Scholar CrossRef Search ADS PubMed Kramerov DA, Vassetzky NS. Short retroposons in eukaryotic genomes, Int Rev Cytol. , 2005, vol. 247 (pg. 165- 221) Google Scholar CrossRef Search ADS PubMed Kriegs JO, Churakov G, Kiefmann M, Jordan U, Brosius J, Schmitz J. Retroposed elements as archives for the evolutionary history of placental mammals, PLoS Biol. , 2006, vol. 4 pg. e91 Google Scholar CrossRef Search ADS PubMed Li J, Han K, Xing J, Kim HS, Rogers J, Ryder OA, Disotell T, Yue B, Batzer MA. Phylogeny of the macaques (Cercopithecidae: Macaca) based on Alu elements, Gene , 2009, vol. 448 (pg. 242- 249) Google Scholar CrossRef Search ADS PubMed Möller-Krull M, Delsuc F, Churakov G, Marker C, Superina M, Brosius J, Douzery EJ, Schmitz J. Retroposed elements and their flanking regions resolve the evolutionary history of xenarthran mammals (armadillos, anteaters, and sloths), Mol Biol Evol. , 2007, vol. 24 (pg. 2573- 2582) Google Scholar CrossRef Search ADS PubMed Nijman IJ, van Tessel P, Lenstra JA. SINE retrotransposition during the evolution of the Pecoran ruminants, J Mol Evol. , 2002, vol. 54 (pg. 9- 16) Google Scholar CrossRef Search ADS PubMed Nishihara H, Maruyama S, Okada N. Retroposon analysis and recent geological data suggest near-simultaneous divergence of the three superorders of mammals, Proc Natl Acad Sci U S A. , 2009, vol. 106 (pg. 5235- 5240) Google Scholar CrossRef Search ADS PubMed Ray DA, Xing J, Salem AH, Batzer MA. SINEs of a nearly perfect character, Syst Biol. , 2006, vol. 55 (pg. 928- 935) Google Scholar CrossRef Search ADS PubMed Shedlock AM, Okada N. SINE insertions: powerful tools for molecular systematics, Bioessays , 2000, vol. 22 (pg. 148- 160) Google Scholar CrossRef Search ADS PubMed Shedlock AM, Takahashi K, Okada N. SINEs of speciation: tracking lineages with retroposons, Trends Ecol Evol. , 2004, vol. 19 (pg. 545- 553) Google Scholar CrossRef Search ADS PubMed Shimamura M, Abe H, Nikaido M, Ohshima K, Okada N. Genealogy of families of SINEs in cetaceans and artiodactyls: the presence of a huge superfamily of tRNA(Glu)-derived families of SINEs, Mol Biol Evol. , 1999, vol. 16 (pg. 1046- 1060) Google Scholar CrossRef Search ADS PubMed Shimamura M, Yasue H, Ohshima K, Abe H, Kato H, Kishiro T, Goto M, Munechika I, Okada N. Molecular evidence from retroposons that whales form a clade within even-toed ungulates, Nature , 1997, vol. 388 (pg. 666- 670) Google Scholar CrossRef Search ADS PubMed Szemraj J, Płucienniczak G, Jaworski J, Płucienniczak A. Bovine Alu-like sequences mediate transposition of a new site-specific retroelement, Gene , 1995, vol. 152 (pg. 261- 264) Google Scholar CrossRef Search ADS PubMed van de Lagemaat LN, Gagnier L, Medstrand P, Mager DL. Genomic deletions and precise removal of transposable elements mediated by short identical DNA segments in primates, Genome Res. , 2005, vol. 15 (pg. 1243- 1249) Google Scholar CrossRef Search ADS PubMed Waddell PJ, Kishino H, Ota R. A phylogenetic foundation for comparative mammalian genomics, Genome Inform Ser Workshop Genome Inform. , 2001, vol. 12 (pg. 141- 154) Wilson DE, Reeder DM. , Mammal species of the world: a taxonomic and geographic reference , 2005 3rd ed Baltimore (MD) Johns Hopkins University Press Youngman S, van Luenen HG, Plasterk RH. Rte-1, a retrotransposon-like element in Caenorhabditis elegans, FEBS Lett. , 1996, vol. 380 (pg. 1- 7) Google Scholar CrossRef Search ADS PubMed Zupunski V, Gubensek F, Kordis D. Evolutionary dynamics and evolutionary history in the RTE clade of non-LTR retrotransposons, Mol Biol Evol. , 2001, vol. 18 (pg. 1849- 1863) Google Scholar CrossRef Search ADS PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

journal article

LitStream Collection

Looking for Darwin in Genomic Sequences—Validity and Success of Statistical Methods

Zhai, Weiwei;Nielsen, Rasmus;Goldman, Nick;Yang, Ziheng

2012 Molecular Biology and Evolution

doi: 10.1093/molbev/mss104pmid: 22490825

Abstract The use of codon substitution models to compare synonymous and nonsynonymous substitution rates is a widely used approach to detecting positive Darwinian selection affecting protein evolution. However, in several recent papers, Hughes and colleagues claim that codon-based likelihood-ratio tests (LRTs) are logically flawed as they lack prior hypotheses and fail to accommodate random fluctuations in synonymous and nonsynonymous substitutions Friedman and Hughes (2007) also used site-based LRTs to analyze 605 gene families consisting of human and mouse paralogues. They found that the outcome of the tests was largely determined by irrelevant factors such as the GC content at the third codon positions and the synonymous rate dS, but not by the nonsynonymous rate dN or the dN/dS ratio, factors that should be related to selection. Here, we reanalyze those data. Contra Friedman and Hughes, we found that the test results are related to sequence length and the average dN/dS ratio. We examine the criticisms of Hughes and suggest that they are based on misunderstandings of the codon models and on statistical errors. Our analyses suggest that codon-based tests are useful tools for comparative analysis of genomic data sets. codon model, Darwinian selection, likelihood-ratio test Introduction It has long been realized that synonymous and nonsynonymous substitution rates can be compared to infer the direction and strength of natural selection acting on the protein (Kimura 1977). A number of methods were developed to estimate the numbers of synonymous and nonsynonymous substitutions per site (dS and dN) between two sequences (for review, see Yang 2006a). Applications of those methods to real data painted a picture of protein evolution dominated by purifying selection eliminating deleterious nonsynonymous mutations, with dN < dS. However, such pairwise comparisons lack power to detect positive selection as they average rates over all codons in the gene and over the whole time period separating the two sequences. Simultaneous comparison of many sequences using codon models (Goldman and Yang 1994; Muse and Gaut 1994) allows more sophisticated likelihood ratio tests (LRTs) to be constructed (Anisimova and Kosiol 2009). To improve power, the models focus on particular branches on the tree or individual codons in the gene, such as the “branch models” (Yang 1998), the “site models” (Nielsen and Yang 1998; Yang et al. 2000; Kosakovsky Pond and Muse 2005; Massingham and Goldman 2005; Rubinstein et al. 2011), and the “branch-site models” (Yang and Nielsen 2002; Yang et al. 2005). In several recent papers, Hughes and colleagues (Hughes 2007; Friedman and Hughes 2007; Hughes and Friedman 2008; Hughes 2012) raised a number of philosophical criticisms on the use of codon models to detect positive selection on the protein. They claim that the site-based tests lack prior hypothesis and fail to account for random fluctuations of substitutions in the gene. We examine those criticisms below. Additionally, Friedman and Hughes (2007), referred to later as “FH07,” applied site-based LRTs (see table 1) to analyze human and mouse paralogous genes and found that the outcome of the LRTs was largely determined by the sequence length, the GC content at the third codon position, and dS, factors that should have little to do with selection on the protein, but not by dN or the dN/dS ratio, factors that should reflect the action of selection. Those results are surprising, and if true, may cast doubts on the performance of the codon-based LRTs. Table 1. Site Models Implemented in Different Versions of the CODEML Program. Model code p Parameters Software References M0 (one-ratio) 1 ω M1 (neutral) 1 p0, p1 PAML 3.13 Nielsen and Yang (1998) ω0 = 0, ω1 = 1 M2 (selection) 3 p0, p1, p2 PAML 3.13 Nielsen and Yang (1998) ω0 = 0, ω1 = 1, ω2 M1a (neutral) 2 p0, p1 PAML 3.14 or later Wong et al. (2004) and Yang et al. (2005) ω0 < 1, ω1 = 1 M2a (selection) 4 p0, p1, p2 PAML 3.14 or later Wong et al. (2004) and Yang et al. (2005) ω0 < 1, ω1 = 1, ω2 > 1 M7 (beta) 2 p, q PAML 3.13 or later Yang et al. (2000) M8a (beta and ω = 1) 3 p0,p, q, ωs = 1 PAML 3.13 or later Swanson et al. (2003) and Wong et al. (2004) M8 (beta and ω) 4 p0, p, q, ωs > 1 PAML 3.13 or later Yang et al. (2000) and Yang et al. (2005) Model code p Parameters Software References M0 (one-ratio) 1 ω M1 (neutral) 1 p0, p1 PAML 3.13 Nielsen and Yang (1998) ω0 = 0, ω1 = 1 M2 (selection) 3 p0, p1, p2 PAML 3.13 Nielsen and Yang (1998) ω0 = 0, ω1 = 1, ω2 M1a (neutral) 2 p0, p1 PAML 3.14 or later Wong et al. (2004) and Yang et al. (2005) ω0 < 1, ω1 = 1 M2a (selection) 4 p0, p1, p2 PAML 3.14 or later Wong et al. (2004) and Yang et al. (2005) ω0 < 1, ω1 = 1, ω2 > 1 M7 (beta) 2 p, q PAML 3.13 or later Yang et al. (2000) M8a (beta and ω = 1) 3 p0,p, q, ωs = 1 PAML 3.13 or later Swanson et al. (2003) and Wong et al. (2004) M8 (beta and ω) 4 p0, p, q, ωs > 1 PAML 3.13 or later Yang et al. (2000) and Yang et al. (2005) NOTE.—p is the number of free parameters. Three LRTs are constructed to compare the following pairs of models: M1–M2, M1a–M2a, and M7–M8. For each test, positive selection is inferred if 2Δℓ>5.99, ˆ >1, and pˆ2=1−pˆ0−pˆ1>0. A fourth LRT compares M8a against M8, with the 1:1 mixture of 0 and χ12 used to conduct the test so that positive selection is inferred if 2Δℓ>2.71. The differences between the two PAML versions are as follows: (1) In M1 and M2 (PAML 3.13, released in 2002), ω0 = 0, while M1a and M2a (versions 3.14, September 2004, or later) have 0 < ω0 < 1. (2) In version 3.13, only the naïve empirical Bayes (NEB) method (Nielsen and Yang 1998) is available, while version 3.14 added the BEB method (Yang et al. 2005). (3) In version 3.13, ω2 in M2 and ωs in M8 (beta and ω) are estimated over the range (0, ∞). In version 3.14 or later, they are estimated over the range (1, ∞). View Large Reanalysis of the Data of FH07 We reanalyzed the 605 gene families of FH07, with the results summarized in table 2. FH07 found 329 (54.5%) alignments in which at least one of the M1–M2 and M7–M8 tests (see table 1) detected positive selection at the 5% level. This proportion is very high and does not seem convincing, as suggested by FH07. In contrast, the corresponding figure from our reanalysis using the same models in PAML 3.13 was only 63 (10%). Our efforts to reproduce the results of FH07 have not been successful. The large differences are not due to different versions of the PAML software and are unlikely to be due to numerical problems. We suspect that FH07 failed to apply the correct criterion in declaring positive selection by the tests (see table 1). At any rate the criticisms of FH07 on the codon models are based on incorrect results and are unfounded. Table 2. Numbers of Genes Showing Significant (at the 5% level) Evidence of Positive Selection by Different Tests. M7/M8 Not Significant M7/M8 Significant FH07 (From Friedman and Hughes 2007) M1/M2 not significant 275 (45.5%) 120 (19.9%) M1/M2 significant 102 (16.9%) 107 (17.7%) PAML 3.13 M1/M2 not significant 542 (89.6%) 56 (9.3%) M1/M2 significant 1 (0.17%) 6 (1.0%) PAML 4.4 M1a/M2a not significant 541 (89.4%) 46 (7.6%) M1a/M2a significant 0 (0%) 18 (3.0%) PAML 4.4a M1a/M2a not significant 542 (89.6%) 45 (7.4%) M1a/M2a significant 0 (0%) 18 (3.0%) M7/M8 Not Significant M7/M8 Significant FH07 (From Friedman and Hughes 2007) M1/M2 not significant 275 (45.5%) 120 (19.9%) M1/M2 significant 102 (16.9%) 107 (17.7%) PAML 3.13 M1/M2 not significant 542 (89.6%) 56 (9.3%) M1/M2 significant 1 (0.17%) 6 (1.0%) PAML 4.4 M1a/M2a not significant 541 (89.4%) 46 (7.6%) M1a/M2a significant 0 (0%) 18 (3.0%) PAML 4.4a M1a/M2a not significant 542 (89.6%) 45 (7.4%) M1a/M2a significant 0 (0%) 18 (3.0%) NOTE.—aThe M8a–M8 comparison was used instead of the M7–M8 comparison. View Large We also analyzed the data of FH07 using the M1a–M2a and the M8a–M8 tests using PAML version 4.4 (table 1). The M1a–M2a test identified 18 gene alignments with significant signal of positive selection, compared with 7 by the M1–M2 test (table 2). The M7–M8 test identified 64 genes to be under positive selection while the M8a–M8 test identified 63, both including as a subset the 18 genes identified by the M1a–M2a test. As noted previously (Wong et al. 2004), the M1a–M2a test tends to be more stringent than the M7–M8 test. In 14 gene alignments, M2a identified at least one codon with the Bayes empirical Bayes (BEB) posterior probability for positive selection, P > 95%. For M8, this number is 31, including the 14 cases from M2a. For those data, the results for PAML versions 3.13 and 4.4 are very similar, and both are very different from those of FH07. Following FH07, we examined the impact of various factors on the outcome of the LRTs. The results are summarized in supplementary table S1 (Supplementary Material online). FH07 found that the test outcome was significantly correlated with sequence length, GC3 and mean synonymous substitution rate (d¯S), but not with mean nonsynonymous substitution rate (d¯N) or mean dN/dS ratio (ω¯). In contrast, we found that the test outcome is significantly influenced by the sequence length (P < 0.05, Kruskal–Wallis test), as in FH07, and by the mean dN/dS ratio ω¯ (P < 0.05, Kruskal–Wallis test), but not with GC3, d¯S, or d¯N. The impact of sequence length is as expected, as the test tends to be more powerful in larger data sets (Anisimova et al. 2001). The correlation with ω¯ is not hard to explain since on average high dN/dS ratios may indicate positive selection driving amino acid changes, although a strong correlation is not automatically expected. The LRTs considered here accommodate variable ω ratios across codons so that the test may be significant even if most codons in the gene are under strong purifying selection while a few codons experience accelerated nonsynonymous substitutions driven by positive selection without elevating substantially the average ratio ω¯. As in FH07, we examined whether the set of genes showing evidence of positive selection tend to have certain tree topologies (supplementary table S2, Supplementary Material online). Like FH07, we found no strong and clear association between the outcome of the LRTs and the tree topology. However, the results from the two studies are in all other aspects quite different. Do the LRTs Based on the Site Models Lack a Prior Hypothesis? Hughes (2007) criticized the site-based LRTs (Nielsen and Yang 1998; Yang et al. 2000) and the parsimony method of Suzuki and Gojobori (1999) for lacking prior biological hypotheses about which codons are potentially under positive selection, likening them to an undirected “fishing expedition” (Hughes and Friedman 2008). Hughes values the approach taken by Hughes and Nei (1988) to analyze the Major Histocompatibility Complex (MHC) genes, where the overdominance theory of Doherty and Zinkernagel (1975) and the availability of a crystal structure (Bjorkman et al. 1987) led to the hypothesis that the antigen recognition site (ARS) may be the target of positive selection. This hypothesis was confirmed when Hughes and Nei showed that the ARS domain experienced accelerated nonsynonymous substitutions. Hughes (2007) criticized the site-based methods for lacking such a prior biological hypothesis. Nevertheless, the LRTs based on the site models are based on a priori well-specified statistical hypotheses, derived from our biological knowledge of the effects of natural selection. For example, the null hypothesis M1a (neutral) assumes two site classes with 0 < ω0 < 1 and ω1 = 1, but does not allow for sites with ω > 1. The alternative hypothesis M2a (selection) adds another site class with ω2 > 1. Strong preference for M2a over M1a will be statistical evidence for the presence of codons at which nonsynonymous mutations have elevated fixation probabilities relative to synonymous mutations, the hallmark of positive selection driving amino acid changes. To test whether there exist amino acid residues under positive selection, one does not have to know where such residues may be. Such hypothesis testing need not be based on knowledge of the mechanisms of adaptive evolution. Nevertheless, such information, if available, can be used in the test (Yang and Swanson 2002). In most cases, we lack prior knowledge of where in each gene positive selection may be acting. LRTs can still be validly used to identify genes showing previously unsuspected signals of selection—“fishing expeditions” may actually find and catch fish. Given a significant test result, statistical methods exist to identify sites under positive selection (e.g., Yang et al. 2005). Indeed, this is a major strength of those methods: by narrowing down amino acid changes that must be examined in the laboratory using site-directed mutagenesis, those methods may provide useful guidance for experimental work. Numerous examples now exist in which the results of such statistical analysis were validated by functional assays in the laboratory (e.g., Bielawski et al. 2004; Sawyer et al. 2005; Deng et al. 2010; Moury and Simon 2011). Even when structural information is available, it is rarely precise enough to pinpoint codons under positive selection, and the site-based model may still be very useful. This appears to be precisely the case with the MHC, the exemplar of positive selection, as demonstrated by Yang and Swanson (2002). Does Random Fluctuation of Synonymous and Nonsynonymous Substitutions in a Codon Invalidate the Site-Based LRTs? Hughes (2007; see also Hughes and Friedman 2005, 2008) claims that tests based on identifying codons with dN/dS > 1 are logically flawed because “all molecular data sets, even in the absence of positive selection, are likely to include some codons with ω > 1.” We suggest that such an argument confuses statistics with parameters. A parameter is a fixed constant that characterizes the population, while a statistic is a summary or estimate from the data and fluctuates among data sets. Part of the confusion may have resulted from the practice of using dN/dS to refer to both the parameter (ω) and its estimate ( ωˆ or dN/dS). Statistical hypotheses underlying the LRTs are formulated using parameters (the true unobserved ω ratios), while the dN/dS (or ωˆ) ratios that Hughes and Friedman (2005, 2008) calculated from the data are statistics (Yang 2006b). The error is most apparent in the calculation of Hughes and Friedman (2008: equations 1–6) of the probability of observing a codon with no synonymous differences and one or more nonsynonymous differences, as if the LRT would be significant in every data set containing such codons. The authors are correct to claim that such codons can occur by chance even if selection is purifying; they are wrong to believe that in every such data set the LRT will be significant. One can easily design a simulation experiment in which such codons occur in nearly every data set so that the argument of Hughes and Friedman would suggest the false positive rate of the LRT to be ∼100%, whereas in reality it is <5% (Supplementary Material online). Even if the estimated ω ratio is infinity, if the estimate is based on very few changes the LRT will not be significant. Dealing with random fluctuations in the data is indeed the essence and purpose of a statistical test. The Impact of Model Violation A further criticism we seek to lay to rest is that “LRTs are invalid if neither of the two models compared is true.” Criticisms in this area are old and have been answered before. As the statistician Box (1979) stated, “Models are never true, but fortunately it is only necessary that they be useful. For this it is usually needful only that they not be grossly wrong.” Although model violation is a concern, a number of simulation studies have examined the statistical properties (such as robustness) of codon-based tests, including the site-based tests (Anisimova et al. 2001, 2002; Wong et al. 2004), the branch-site test (Yang et al. 2005; Zhang et al. 2005; Yang and dos Reis 2011), and the BEB identification of sites (Anisimova et al. 2002). For example, the site-based tests were found to be quite robust to misspecification of the functional form used to model variable ωs among sites (Anisimova et al. 2002; Wong et al. 2004), to the tree topology, and to among-site variability in other aspects of the substitution process (such as the transition/transversion rate ratio, codon usage, and dS) (Bao et al. 2008). They are nevertheless sensitive to excessive recombinations (Anisimova et al. 2003; Shriner et al. 2003; Wilson and McVean 2006) and to sequence and alignment errors (Schneider et al. 2009; Fletcher and Yang 2010; Markova-Raina and Petrov 2011; Jordan and Goldman 2012). Furthermore, substantial progress has been made in improving codon models in the past 10–15 years (Rubinstein et al. 2011; Cannarozzi and Schneider 2012). In summary, statistical tests throughout the biological sciences are not perfect, do not have perfect power and zero error, and cases can always be found where they mislead or fail. Nevertheless, they give us a means to progress and learn. Furthermore, those tests are constantly being improved upon, to become more powerful and robust. Materials and Methods The alignments of FH07 for 605 gene families with two human and two mouse paralogues were retrieved from the Molecular Phylogenetics and Evolution Web site. Alignment gaps were removed in FH07. The phylogeny was easily inferred and was either tree I: (H1, M1) − (H2, M2) or tree II: (H1, H2) − (M1, M2), with gene duplication either before or after species split, respectively. Following FH07, we applied two LRTs comparing site models M1 (neutral) against M2 (selection), and M7 (beta) against M8 (beta and ω). FH07 appears to have used the CODEML program in PAML version 3.13, and we used the same version. In addition, we applied the LRTs comparing the modified models M1a (neutral) against M2a (selection), and M8a against M8, using the current PAML version 4.4. We also followed FH07 to examine possible relationships between the LRT results and five features of the gene alignment: the sequence length (the number of codons), mean GC content at third codon positions (GC3), mean dS (d¯S), mean dN (d¯N), and mean dN/dS (ω¯), all calculated by averaging over all pairwise comparisons. References Anisimova M, Bielawski JP, Yang Z. The accuracy and power of likelihood ratio tests to detect positive selection at amino acid sites, Mol Biol Evol. , 2001, vol. 18 (pg. 1585- 1592) Google Scholar CrossRef Search ADS PubMed Anisimova M, Bielawski JP, Yang Z. Accuracy and power of Bayes prediction of amino acid sites under positive selection, Mol Biol Evol. , 2002, vol. 19 (pg. 950- 958) Google Scholar CrossRef Search ADS PubMed Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models, Mol Biol Evol. , 2009, vol. 26 (pg. 255- 271) Google Scholar CrossRef Search ADS PubMed Anisimova M, Nielsen R, Yang Z. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites, Genetics , 2003, vol. 164 (pg. 1229- 1236) Google Scholar PubMed Bao L, Gu H, Dunn KA, Bielawski JP. Likelihood-based clustering (LiBaC) for codon models, a method for grouping sites according to similarities in the underlying process of evolution, Mol Biol Evol. , 2008, vol. 25 (pg. 1995- 2007) Google Scholar CrossRef Search ADS PubMed Bielawski JP, Dunn KA, Sabehi G, Beja O. Darwinian adaptation of proteorhodopsin to different light intensities in the marine environment, Proc Natl Acad Sci U S A. , 2004, vol. 101 (pg. 14824- 14829) Google Scholar CrossRef Search ADS PubMed Bjorkman PJ, Saper SA, Samraoui B, Bennet WS, Strominger JL, Wiley DC. Structure of the class I histocompatibility antigen, HLA-A2, Nature , 1987, vol. 329 (pg. 506- 512) Google Scholar CrossRef Search ADS PubMed Box GEP. Some problems of statistics and everyday life, J Am Stat Assoc. , 1979, vol. 74 (pg. 1- 4) Google Scholar CrossRef Search ADS Cannarozzi G, Schneider A. , Codon evolution: mechanisms and models , 2012 New York Oxford University Press Deng C, Cheng C-HC, Ye H, He X, Chen L. Evolution of an antifreeze protein by neofunctionalization under escape from adaptive conflict, Proc Natl Acad Sci U S A. , 2010, vol. 107 (pg. 21593- 21598) Google Scholar CrossRef Search ADS PubMed Doherty PC, Zinkernagel RM. A biological role for the major histocompatibility antigens, Lancet , 1975, vol. 1 (pg. 1406- 1409) Google Scholar CrossRef Search ADS PubMed Fletcher W, Yang Z. The effect of insertions, deletions and alignment errors on the branch-site test of positive selection, Mol Biol Evol. , 2010, vol. 27 (pg. 2257- 2267) Google Scholar CrossRef Search ADS PubMed Friedman R, Hughes AL. Likelihood-ratio tests for positive selection of human and mouse duplicate genes reveal nonconservative and anomalous properties of widely used methods, Mol Phylogenet Evol. , 2007, vol. 42 (pg. 388- 393) Google Scholar CrossRef Search ADS PubMed Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences, Mol Biol Evol. , 1994, vol. 11 (pg. 725- 736) Google Scholar PubMed Hughes AL. Looking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level, Heredity , 2007, vol. 99 (pg. 364- 373) Google Scholar CrossRef Search ADS PubMed Hughes AL. Evolution of adaptive phenotypic traits without positive Darwinian selection, Heredity. , 2012, vol. 108 (pg. 347- 353) Google Scholar CrossRef Search ADS PubMed Hughes AL, Friedman R. Variation in the pattern of synonymous and nonsynonymous difference between two fungal genomes, Mol Biol Evol. , 2005, vol. 22 (pg. 1320- 1324) Google Scholar CrossRef Search ADS PubMed Hughes AL, Friedman R. Codon-based tests of positive selection, branch lengths, and the evolution of mammalian immune system genes, Immunogenetics , 2008, vol. 60 (pg. 495- 506) Google Scholar CrossRef Search ADS PubMed Hughes AL, Nei M. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection, Nature , 1988, vol. 335 (pg. 167- 170) Google Scholar CrossRef Search ADS PubMed Jordan G, Goldman N. The effects of alignment error and alignment filtering on the sitewise detection of positive selection, Mol Biol Evol. , 2012, vol. 29 (pg. 1125- 1139) Google Scholar CrossRef Search ADS PubMed Kimura M. Prepondence of synonymous changes as evidence for the neutral theory of molecular evolution, Nature , 1977, vol. 267 (pg. 275- 276) Google Scholar CrossRef Search ADS PubMed Kosakovsky Pond SL, Muse SV. Site-to-site variation of synonymous substitution rates, Mol Biol Evol. , 2005, vol. 22 (pg. 2375- 2385) Google Scholar CrossRef Search ADS PubMed Markova-Raina P, Petrov D. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes, Genome Res. , 2011, vol. 21 (pg. 863- 874) Google Scholar CrossRef Search ADS PubMed Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection, Genetics , 2005, vol. 169 (pg. 1753- 1762) Google Scholar CrossRef Search ADS PubMed Moury B, Simon V. dN/dS-based methods detect positive selection linked to trade-offs between different fitness traits in the coat protein of potato virus Y, Mol Biol Evol. , 2011, vol. 28 (pg. 2707- 2717) Google Scholar CrossRef Search ADS PubMed Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome, Mol Biol Evol. , 1994, vol. 11 (pg. 715- 724) Google Scholar PubMed Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene, Genetics , 1998, vol. 148 (pg. 929- 936) Google Scholar PubMed Rubinstein ND, Mayrose I, Doron-Faigenboim A, Pupko T. Evolutionary models accounting for layers of selection in protein coding genes and their impact on the inference of positive selection, Mol Biol Evol. , 2011, vol. 28 (pg. 3297- 3308) Google Scholar CrossRef Search ADS PubMed Sawyer SL, Wu LI, Emerman M, Malik HS. Positive selection of primate TRIM5α identifies a critical species-specific retroviral restriction domain, Proc Natl Acad Sci U S A. , 2005, vol. 102 (pg. 2832- 2837) Google Scholar CrossRef Search ADS PubMed Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D. Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment, Genome Biol Evol. , 2009, vol. 1 (pg. 114- 118) Google Scholar CrossRef Search ADS PubMed Shriner D, Nickle DC, Jensen MA, Mullins JI. Potential impact of recombination on sitewise approaches for detecting positive natural selection, Genet Res. , 2003, vol. 81 (pg. 115- 121) Google Scholar CrossRef Search ADS PubMed Suzuki Y, Gojobori T. A method for detecting positive selection at single amino acid sites, Mol Biol Evol. , 1999, vol. 16 (pg. 1315- 1328) Google Scholar CrossRef Search ADS PubMed Swanson WJ, Nielsen R, Yang Q. Pervasive adaptive evolution in mammalian fertilization proteins, Mol Biol Evol. , 2003, vol. 20 (pg. 18- 20) Google Scholar CrossRef Search ADS PubMed Wilson DJ, McVean G. Estimating diversifying selection and functional constraint in the presence of recombination, Genetics , 2006, vol. 172 (pg. 1411- 1425) Google Scholar CrossRef Search ADS PubMed Wong WSW, Yang Z, Goldman N, Nielsen R. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites, Genetics , 2004, vol. 168 (pg. 1041- 1051) Google Scholar CrossRef Search ADS PubMed Yang Z. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution, Mol Biol Evol. , 1998, vol. 15 (pg. 568- 573) Google Scholar CrossRef Search ADS PubMed Yang Z. , Computational molecular evolution , 2006a Oxford Oxford University Press Yang Z. On the varied pattern of evolution of two fungal genomes: a critique of Hughes and Friedman, Mol Biol Evol. , 2006b, vol. 23 (pg. 2279- 2282) Google Scholar CrossRef Search ADS Yang Z, dos Reis M. Statistical properties of the branch-site test of positive selection, Mol Biol Evol. , 2011, vol. 28 (pg. 1217- 1228) Google Scholar CrossRef Search ADS PubMed Yang Z, Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages, Mol Biol Evol. , 2002, vol. 19 (pg. 908- 917) Google Scholar CrossRef Search ADS PubMed Yang Z, Nielsen R, Goldman N, Pedersen A-MK. Codon-substitution models for heterogeneous selection pressure at amino acid sites, Genetics , 2000, vol. 155 (pg. 431- 449) Google Scholar PubMed Zhang J, Nielsen R, Yang Z. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level, Mol Biol Evol. , 2005, vol. 22 (pg. 2472- 2479) Google Scholar CrossRef Search ADS PubMed Yang Z, Swanson WJ. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes, Mol Biol Evol. , 2002, vol. 19 (pg. 49- 57) Google Scholar CrossRef Search ADS PubMed Yang Z, Wong WSW, Nielsen R. Bayes empirical Bayes inference of amino acid sites under positive selection, Mol Biol Evol. , 2005, vol. 22 (pg. 1107- 1118) Google Scholar CrossRef Search ADS PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

journal article

LitStream Collection

Genome-Wide Analyses of Recombination Suggest That Giardia intestinalis Assemblages Represent Different Species

Xu, Feifei;Jerlström-Hultqvist, Jon;Andersson, Jan O.

2012 Molecular Biology and Evolution

doi: 10.1093/molbev/mss107pmid: 22474166

Abstract Giardia intestinalis is a major cause of waterborne enteric disease in humans. The species is divided into eight assemblages suggested to represent separate Giardia species based on host specificities and the genetic divergence of marker genes. We have investigated whether genome-wide recombination occurs between assemblages using the three available G. intestinalis genomes. First, the relative nonsynonymous substitution rates of the homologs were compared for 4,009 positional homologs. The vast majority of these comparisons indicate genetic isolation without interassemblage recombinations. Only a region of 6 kbp suggests genetic exchange between assemblages A and E, followed by gene conversion events. Second, recombination-detecting software fails to identify within-gene recombination between the different assemblages for most of the homologs. Our results indicate very low frequency of recombination between the syntenic core genes, suggesting that G. intestinalis assemblages are genetically isolated lineages and thus should be viewed as separated Giardia species. diplomonads, protists, sex, substitution rate The intestinal parasite Giardia intestinalis (syn. G. lamblia and G. duodenalis) belongs to diplomonads, and the morphological G. intestinalis species is divided into eight assemblages based on genetic differences and host specificities (Monis et al. 2009). Diplomonads have been viewed as primitive asexual eukaryotes. However, G. intestinalis encodes proteins associated with meiotic sex (Ramesh et al. 2005). Genetic exchanges have been reported both within and between assemblages based on studies of a few loci (Cooper et al. 2007; Teodorovic et al. 2007; Caccio and Ryan 2008; Lasek-Nesselquist et al. 2009), which could be the outcome of sexual life cycle (Logsdon 2008; Birky 2010; Andersson 2012). The different assemblages should be treated as the same biological species if they exchange genetic material on a genomic scale, whereas frequent gene flow within, but not between, assemblages would suggest separate species (de Queiroz 2007). Here, we have analyzed interassemblage recombination using the available genome sequences from a pig (P15—assemblage E) and two human (WB—assemblage A and GS—assemblage B) G. intestinalis isolates (Morrison et al. 2007; Franzén et al. 2009; Jerlström-Hultqvist et al. 2010). There are 4,557 shared genes between the three genomes, of which 4,189 are full-length genes in syntenic positions. Genes frame-shifted or truncated in any of the three draft genomes were excluded (Jerlström-Hultqvist et al. 2010). We also excluded homologs with length differences >10% between any two alignments (supplementary fig. S1Supplementary Data, Supplementary Material online) because these can cause artifacts in downstream analyses. This left us with 4,009 positional homologs. These were almost certainly present in the common ancestor of the three isolates and therefore are suitable for studies of interassemblage recombination. We first analyzed putative recombination events involving whole genes by studying nonsynonymous (Ka) and synonymous (Ks) substitution rates and then recombination within genes using three different recombination detection methods. All homologs were aligned on the protein level using MUSCLE (version 3.6) (Edgar 2004). These alignments were used as guides for nucleotide alignments. We calculated the pairwise Ka and Ks values using the yn00 program from the PAML package (version 4.2a) (Yang 1997) with default settings (table 1). The WB-P15 comparison shows the lowest Ka and Ks values, in agreement with the expected relationship between the three assemblages (Monis et al. 2009). Table 1. Ka and Ks Values for 4,009 Positional Homologs. Homologs Compared P15:WB P15:GS GS:WB Ka 0.060 ± 0.125 0.141 ± 0.198 0.139 ± 0.162 Ks 0.436 ± 0.142 1.239 ± 3.044 1.234 ± 2.170 Homologs Compared P15:WB P15:GS GS:WB Ka 0.060 ± 0.125 0.141 ± 0.198 0.139 ± 0.162 Ks 0.436 ± 0.142 1.239 ± 3.044 1.234 ± 2.170 View Large To investigate the patterns underlying the Ka and Ks values across the positional homologs, we plotted normalized values for all positional homologs in triangle plots (fig. 1A and supplementary fig. S1Supplementary Data and Supplementary Data, Supplementary Material online). The normalized Ks values indicated problems with saturation because the three values could not be explained by an evolutionary tree for many of the homologs (supplementary fig. S1Supplementary Data, Supplementary Material online). This is likely due to the high divergence of the GS sequences compared with the other two; the Ks values of the P15-WB comparison do not indicate saturation (table 1). The normalized Ka values, on the other hand, show no sign of saturation (fig. 1A and supplementary fig. S1Supplementary Data, Supplementary Material online). The strong concentration near the mean indicates that the majority of the genes share very similar evolutionary histories. The spread of the normalized substitution rates is a simple measure of the relative levels of recombination (supplementary fig. S1Supplementary Data, Supplementary Material online). The median spread is 0.058, which can be compared with 0.19 previously estimated for a nonrecombining Rickettsia population, and 0.34 for Neisseria which undergoes rampant intraspecies recombination (Klasson et al. 2009). Thus, the frequency of interassemblage recombination appears low for G. intestinalis. FIG. 1. View largeDownload slide Triangle plot of Ka values for 4,009 positional homologs in three Giardia intestinalis genomes. (A) Each dot represents a positional homolog, showing the proportions of its three pairwise Ka values. Each axis represents one of the three pairwise Ka values after normalizing them so that the sum of each triplet is 1.0. Tick marks on the axes indicate the direction of the coordinate lines. The mean of Ka proportion values is marked on each axis with a black dot and within the plot with a bigger black dot. Blue and red dots indicate the 112 strongest candidates, whereas the five red candidates are part of the eight positional homologs (with the rest three marked in yellow) in the region studied in figure 2. The inner triangle is marked with a black dotted line. (B) Distribution of the 112 recombination candidates from the Ka analysis on the WB genome. The positions of the WB candidates along the WB genome were plotted using a step size of 2000 and a window size of 20 × 2000. x axis represents the position on WB genome in million base pairs, and y axis shows candidate bases ratio as bases within candidate genes in a window against the window size. Peaks indicate clusters of candidate genes in a same region. The single peak over the red line, which is the 65% of the maximum ratio, corresponds to the region with four candidate genes described in figure 2. The plots were created using R (http://www.R-project.org, last accessed 2012 Apr 10). FIG. 1. View largeDownload slide Triangle plot of Ka values for 4,009 positional homologs in three Giardia intestinalis genomes. (A) Each dot represents a positional homolog, showing the proportions of its three pairwise Ka values. Each axis represents one of the three pairwise Ka values after normalizing them so that the sum of each triplet is 1.0. Tick marks on the axes indicate the direction of the coordinate lines. The mean of Ka proportion values is marked on each axis with a black dot and within the plot with a bigger black dot. Blue and red dots indicate the 112 strongest candidates, whereas the five red candidates are part of the eight positional homologs (with the rest three marked in yellow) in the region studied in figure 2. The inner triangle is marked with a black dotted line. (B) Distribution of the 112 recombination candidates from the Ka analysis on the WB genome. The positions of the WB candidates along the WB genome were plotted using a step size of 2000 and a window size of 20 × 2000. x axis represents the position on WB genome in million base pairs, and y axis shows candidate bases ratio as bases within candidate genes in a window against the window size. Peaks indicate clusters of candidate genes in a same region. The single peak over the red line, which is the 65% of the maximum ratio, corresponds to the region with four candidate genes described in figure 2. The plots were created using R (http://www.R-project.org, last accessed 2012 Apr 10). We extracted the 112 genes that were the strongest candidates for recombination based on their distances to the mean (fig. 1A and supplementary fig. S1Supplementary Data, Supplementary Material online). The largest pairwise Ka value is less than 0.08 for 100 of the candidates, and the average length of these is 193 amino acids (supplementary table S1, Supplementary Material online). If these were true recombinants between GS and any of the two other genomes, we would expect the recombination pair to have the smallest Ks value. Yet, the P15-WB comparison shows the lowest Ks values, as expected in the absence of recombination involving GS, for all except three of these 100 conserved candidate genes (supplementary table S1, Supplementary Material online). From these observations, we conclude that the majority of these short conserved genes are probably falsely scored as recombination candidates due to the stochastic occurrence of a few substitutions. The 12 more divergent genes are stronger candidates for recombination. A single recombination event may affect a set of genes in the G. intestinalis genome, which has a coding density of 82% (Jerlström-Hultqvist et al. 2010). Such an event would result in a cluster of candidate genes on the G. intestinalis chromosome. To test this, we plotted the candidates along the WB genome (fig. 1B). Only one of the observed clusters contained divergent recombination candidates; four of these genes were found in a region of five genes coding for conserved hypothetical proteins (supplementary table S1, Supplementary Material online). There are two copies of this syntenic region in the P15 genome (P15_1 and P15_2; fig. 2A). This is unlikely to be an artifact of the assembly. The two copies are similar on the nucleotide level but not identical, and the average coverage of both copies was around 35× in the assembly as expected from a single-copy region (Jerlström-Hultqvist et al. 2010). The assembly into two copies in P15 was confirmed by polymerase chain reaction (PCR) amplification and sequencing of the products. Using the recombination detection software GENECONV (Sawyer 1989) in combination with phylogenetic methods, we could detect six segments within the five-gene region with three alternative topologies indicating different phylogenetic histories (fig. 2). There is a long syntenic segment of more than three genes for which WB and P15_2 are closely related with high sequence similarity (fig. 2C). This clearly indicates an exogenous origin of this part of the P15_2 in the P15 genome. On both sides of this segment and at the very end of the region, the two P15 copies are more closely related to each other than any are to WB or GS (fig. 2B, D, and G). This topology suggests duplication or gene conversion within the P15 genome. There are two short segments toward the end of the region that show WB most similar to alternative P15 copies (fig. 2E and F), again suggesting an exogenous origin of P15 genomic sequences. Based on these analyses, we hypothesize that the P15_2 copy was introduced from an assemblage A isolate, then gene conversions and recombinations occurred between the two copies in the P15 lineage (fig. 2). FIG. 2. View largeDownload slide Comparison of the 6 kbp region that shows indication of recombination. (A) Comparison of a five-gene genomic region from the three G. intestinalis genomes including four recombination candidate genes (indicated as arrows filled with light gray). The synteny plot was plotted using R package genoPlotR (Guy et al. 2010). Two copies of the region were found in the P15 genome. The flanking regions of P15_1 and WB are shown, whereas the region was found in its own contig in GS and P15_2. One gene is frameshifted in GS. Gene IDs for the GS genome are displayed at the top, whereas gene IDs for the other three regions are omitted for better visualization. Similarity between two neighboring regions is represented in different shades of gray; darker colors indicate more similar sequences. The protein sequences were aligned with MUSCLE (version 3.6) (Edgar 2004) and fed into GENECONV (Sawyer 1989) to identify the potential gene conversions. Detected gene conversion regions were marked as thick lines using the colors of the corresponding genome. GS, P15_1, WB, and P15_2 are in orange, green, blue, and pink, respectively. For example, a 2 kbp region marked blue in P15_2 indicates a putative gene conversion with WB in which it is marked pink. Yellow implies putative gene conversion from sources outside of the data set. Sixteen gene conversion breakpoints were identified. RAxML (version 7.2.8) (Stamatakis 2006) was used to create phylogenetic trees for these. The regions could then be merged into six segments sharing different phylogenetic histories, indicated by black bars on the genome and at the bottom of the synteny plot. (B–G) Maximum likelihood trees of the six segments. P15_1 and P15_2 are shortened as P1 and P2, respectively, in the trees. The PROTCATLGF model was used within RAxML, except for alignments shorter than 150 amino acids for which the PROTCATLG model was used. FIG. 2. View largeDownload slide Comparison of the 6 kbp region that shows indication of recombination. (A) Comparison of a five-gene genomic region from the three G. intestinalis genomes including four recombination candidate genes (indicated as arrows filled with light gray). The synteny plot was plotted using R package genoPlotR (Guy et al. 2010). Two copies of the region were found in the P15 genome. The flanking regions of P15_1 and WB are shown, whereas the region was found in its own contig in GS and P15_2. One gene is frameshifted in GS. Gene IDs for the GS genome are displayed at the top, whereas gene IDs for the other three regions are omitted for better visualization. Similarity between two neighboring regions is represented in different shades of gray; darker colors indicate more similar sequences. The protein sequences were aligned with MUSCLE (version 3.6) (Edgar 2004) and fed into GENECONV (Sawyer 1989) to identify the potential gene conversions. Detected gene conversion regions were marked as thick lines using the colors of the corresponding genome. GS, P15_1, WB, and P15_2 are in orange, green, blue, and pink, respectively. For example, a 2 kbp region marked blue in P15_2 indicates a putative gene conversion with WB in which it is marked pink. Yellow implies putative gene conversion from sources outside of the data set. Sixteen gene conversion breakpoints were identified. RAxML (version 7.2.8) (Stamatakis 2006) was used to create phylogenetic trees for these. The regions could then be merged into six segments sharing different phylogenetic histories, indicated by black bars on the genome and at the bottom of the synteny plot. (B–G) Maximum likelihood trees of the six segments. P15_1 and P15_2 are shortened as P1 and P2, respectively, in the trees. The PROTCATLGF model was used within RAxML, except for alignments shorter than 150 amino acids for which the PROTCATLG model was used. Among the strongest candidates, we found only one multigene region with recombination signal, which indicates there is low frequency of large-scale recombination in the genome. However, the substitution rate analyses would not necessarily pick up small-scale recombinations covering only partial genes. To test for this type of recombination, we applied Sawyer's (Sawyer 1989), maximum χ2 (Smith 1992; Bruen et al. 2006), and RDP (Martin and Rybicki 2000; Hao 2010) recombination detection methods. These methods were chosen because they can be applied on large-scale analyses of amino acid alignments of three sequences and have been reviewed as having fairly good performance (Posada and Crandall 2001; Posada 2002). Recombination detection software is developed to test single data sets, and running 4,009 tests creates a multiple testing problem. Both true and false positives are expected among the homologs with P values below the significance level. Homologs reported significant by all three methods should be stronger recombination candidates; we detected only 98 of these (supplementary table S2 and Supplementary Data, Supplementary Material online). Actually, the majority of the genes are reported unaffected by recombination in all three tests (supplementary fig. S3, Supplementary Material online). This is in agreement with the conclusion from the substitution rate analyses that most G. intestinalis syntenic genes are unaffected by recombination between the three genomes belonging to different assemblages. Indeed, only a single clear case (fig. 2) could be detected among 4,009 syntenic homologs tested using triangle plots of substitution rates and recombination detection methods on the currently available genomic data. Our results contradict some earlier studies, which concluded that interassemblage recombination is frequent within G. intestinalis. Assemblage A-type haplotypes have been found in assemblage B isolates, and the authors argued that this suggested the existence of a sexual life cycle that enables interassemblage genetic exchange (Teodorovic et al. 2007; Lasek-Nesselquist et al. 2009). There are several plausible explanations for the inconsistency with our results. First, PCR contamination could easily produce results that indicate recombination. Indeed, several of the reported genetic exchanges in one of the previous studies involved exchange between the WB and GS isolates (Teodorovic et al. 2007); our study did not confirm these results, neither were the WB-like sequences detected in the GS genome (Franzén et al. 2009). Nevertheless, some most likely indicate true presence of haplotypes of putative recombinatory origin (Teodorovic et al. 2007; Lasek-Nesselquist et al. 2009). These could represent genetic material transiently present in the cell, perhaps introduced by viruses or plasmids, and located in nonsyntenic regions. If these do not recombine into the syntenic positions of the genome, they would not be detected in our analyses. There exists many different species concepts based on various criteria (de Queiroz 2007). Phenotypic species definitions have been widely used for microbial eukaryotes based on morphological characteristics, and this is the basis for the G. intestinalis species (Monis et al. 2009). It has been recognized that this species has very broad host ranges and genetic divergences. Consequently, the different assemblages have been argued to represent separate phylogenetic species (Monis et al. 2009). There are strong indications that G. intestinalis have some sort of sexual or parasexual life cycle (Ramesh et al. 2005; Logsdon 2008; Birky 2010; Andersson 2012). This suggests that a biological species concept could be applied in which presence of interbreeding (recombination) would indicate the same species and absence thereof different species (de Queiroz 2007). Population genetic data have in fact shown intraassemblage recombination (Cooper et al. 2007; Teodorovic et al. 2007; Caccio and Ryan 2008; Lasek-Nesselquist et al. 2009). Here, we show that the vast majority of the syntenic genes from different assemblages have undergone tree-like evolution without exchange of genetic material. More genome sequences are urgently needed to get a more detailed picture of the recombination process within the genus. For example, genomes from multiple isolates from different assemblages would enable simultaneous studies of inter- and intraassemblage recombination on a large scale. That is needed to clearly identify biological species boundaries within the genus. Nevertheless, our results from the three currently available genomes clearly indicate that the Giardia sexual or parasexual life cycle (Andersson 2012) is limited to within-assemblage genetic exchange. Together, these results support the suggestion (Monis et al. 2009) that the G. intestinalis assemblages should be viewed as different species. We thank Oscar Franzén (Karolinska Institutet) for sharing homolog data prior to publication, Lionel Guy for sharing R scripts used to generate triangle plots, and Staffan Svärd for discussions. This work was supported by a grant from The Swedish Research Council Formas (2010-899). References Andersson JO. Double peaks reveal rare diplomonad sex, Trends Parasitol. , 2012, vol. 28 (pg. 46- 52) Google Scholar CrossRef Search ADS PubMed Birky CWJr. Giardia sex? Yes, but how and how much?, Trends Parasitol. , 2010, vol. 26 (pg. 70- 74) Google Scholar CrossRef Search ADS PubMed Bruen TC, Philippe H, Bryant D. A simple and robust statistical test for detecting the presence of recombination, Genetics , 2006, vol. 172 (pg. 2665- 2681) Google Scholar CrossRef Search ADS PubMed Caccio SM, Ryan U. Molecular epidemiology of giardiasis, Mol Biochem Parasitol. , 2008, vol. 160 (pg. 75- 80) Google Scholar CrossRef Search ADS PubMed Cooper MA, Adam RD, Worobey M, Sterling CR. Population genetics provides evidence for recombination in Giardia, Curr Biol. , 2007, vol. 17 (pg. 1984- 1988) Google Scholar CrossRef Search ADS PubMed de Queiroz K. Species concepts and species delimitation, Syst Biol. , 2007, vol. 56 (pg. 879- 886) Google Scholar CrossRef Search ADS PubMed Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res. , 2004, vol. 32 (pg. 1792- 1797) Google Scholar CrossRef Search ADS PubMed Franzén O, Jerlström-Hultqvist J, Castro E, Sherwood E, Ankarklev J, Reiner D, Palm D, Andersson JO, Andersson B, Svärd S. Draft genome sequencing of Giardia intestinalis assemblage B isolate GS: are human giardiasis caused by two different species?, PLoS Pathog. , 2009, vol. 5 8pg. e1000560 Google Scholar CrossRef Search ADS PubMed Guy L, Kultima JR, Andersson SGE. genoPlotR: comparative gene and genome visualization in R, Bioinformatics , 2010, vol. 26 (pg. 2334- 2335) Google Scholar CrossRef Search ADS PubMed Hao W. OrgConv: detection of gene conversion using consensus sequences and its application in plant mitochondrial and chloroplast homologs, BMC Bioinformatics , 2010, vol. 11 pg. 114 Google Scholar CrossRef Search ADS PubMed Jerlström-Hultqvist J, Franzén O, Ankarklev J, Xu F, Nohynkova E, Andersson JO, Svärd SG, Andersson B. Genome analysis and comparative genomics of a Giardia intestinalis assemblage E isolate, BMC Genomics , 2010, vol. 11 pg. 543 Google Scholar CrossRef Search ADS PubMed Klasson L, Westberg J, Sapountzis P, et al. (12 co-authors) The mosaic genome structure of the Wolbachia wRi strain infecting Drosophila simulans, Proc Natl Acad Sci U S A. , 2009, vol. 106 (pg. 5725- 5730) Google Scholar CrossRef Search ADS PubMed Lasek-Nesselquist E, Welch DM, Thompson RC, Steuart RF, Sogin ML. Genetic exchange within and between assemblages of Giardia duodenalis, J Eukaryot Microbiol. , 2009, vol. 56 (pg. 504- 518) Google Scholar CrossRef Search ADS PubMed Logsdon JMJr. Evolutionary genetics: sex happens in Giardia, Curr Biol. , 2008, vol. 18 (pg. R66- R68) Google Scholar CrossRef Search ADS PubMed Martin D, Rybicki E. RDP: detection of recombination amongst aligned sequences, Bioinformatics , 2000, vol. 16 (pg. 562- 563) Google Scholar CrossRef Search ADS PubMed Monis PT, Caccio SM, Thompson RC. Variation in Giardia: towards a taxonomic revision of the genus, Trends Parasitol. , 2009, vol. 25 (pg. 93- 100) Google Scholar CrossRef Search ADS PubMed Morrison HG, McArthur AG, Gillin FD, et al. (29 co-authors) Genomic minimalism in the early diverging intestinal parasite Giardia lamblia, Science , 2007, vol. 317 (pg. 1921- 1926) Google Scholar CrossRef Search ADS PubMed Posada D. Evaluation of methods for detecting recombination from DNA sequences: empirical data, Mol Biol Evol. , 2002, vol. 19 (pg. 708- 717) Google Scholar CrossRef Search ADS PubMed Posada D, Crandall KA. Evaluation of methods for detecting recombination from DNA sequences: computer simulations, Proc Natl Acad Sci U S A. , 2001, vol. 98 (pg. 13757- 13762) Google Scholar CrossRef Search ADS PubMed Ramesh MA, Malik SB, Logsdon JMJr. A phylogenomic inventory of meiotic genes; evidence for sex in Giardia and an early eukaryotic origin of meiosis, Curr Biol. , 2005, vol. 15 (pg. 185- 191) Google Scholar PubMed Sawyer S. Statistical tests for detecting gene conversion, Mol Biol Evol. , 1989, vol. 6 (pg. 526- 538) Google Scholar PubMed Smith JM. Analyzing the mosaic structure of genes, J Mol Evol. , 1992, vol. 34 (pg. 126- 129) Google Scholar PubMed Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models, Bioinformatics , 2006, vol. 22 (pg. 2688- 2690) Google Scholar CrossRef Search ADS PubMed Teodorovic S, Braverman JM, Elmendorf HG. Unusually low levels of genetic variation among Giardia lamblia isolates, Eukaryot Cell. , 2007, vol. 6 (pg. 1421- 1430) Google Scholar CrossRef Search ADS PubMed Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood, Comput Appl Biosci. , 1997, vol. 13 (pg. 555- 556) Google Scholar PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

journal article

LitStream Collection

A Universal Method for the Study of CR1 Retroposons in Nonmodel Bird Genomes

Suh, Alexander;Kriegs, Jan Ole;Donnellan, Stephen;Brosius, Jürgen;Schmitz, Jürgen

2012 Molecular Biology and Evolution

doi: 10.1093/molbev/mss124pmid: 22522308

Abstract Presence/absence patterns of retroposon insertions at orthologous genomic loci constitute straightforward markers for phylogenetic or population genetic studies. In birds, the convenient identification and utility of these markers has so far been mainly restricted to the lineages leading to model birds (i.e., chicken and zebra finch). We present an easy-to-use, rapid, and cost-effective method for the experimental isolation of chicken repeat 1 (CR1) insertions from virtually any bird genome and potentially nonavian genomes. The application of our method to the little grebe genome yielded insertions belonging to new CR1 subfamilies that are scattered all across the phylogenetic tree of avian CR1s. Furthermore, presence/absence analysis of these insertions provides the first retroposon evidence grouping flamingos + grebes as Mirandornithes and several markers for all subsequent branching events within grebes (Podicipediformes). Five markers appear to be species-specific insertions, including the hitherto first evidence in birds for biallelic CR1 insertions that could be useful in future population genetic studies. birds, CR1, genomic library, retroposon extraction, inverse PCR, biallelic insertion Insertions of retroposons, repetitive genomic elements that propagate via an RNA intermediate and integrate in genomes after reverse transcription, have proven to be powerful tools both for the reconstruction of gene trees (Suh, Kriegs, et al. 2011) as well as species trees (Suh, Paus, et al. 2011) of early bird evolution. In such a complex marker system, the number of possible character states (i.e., insertion sites in the genome) is extremely large and thus events leading to homoplasy (such as exact excisions or parallel insertions) are extremely rare (Shedlock et al. 2004; Ray et al. 2006; Han et al. 2011). Consequently, the identification of three or more congruent retroposon markers for a phylogenetic branch is statistically significant (likelihood ratio test after Waddell et al. 2001) and such a maximum parsimony tree topology converges to a maximum likelihood estimation (Steel and Penny 2000). Despite the advantages of retroposed element (RE) insertions for phylogenetic reconstructions, as well as population genetics (reviewed by Ray 2007), large-scale phylogenetic studies among birds (Kaiser et al. 2007; Kriegs et al. 2007; Suh, Paus, et al. 2011) are so far limited to the lineages of model birds where complete genome sequences are available (Hillier et al. 2004; Warren et al. 2010), namely chicken and zebra finch. A few studies (Watanabe et al. 2006; Treplin and Tiedemann 2007) have investigated retroposons in nonmodel birds via methods such as hybridization-based library screening, but each yielded only a few phylogenetic markers. An alternative and efficient procedure based on magnetic bead capture was reported by St John and Quinn (2008a) but was not tested on the suitability for finding retroposon markers. To overcome the lack of retroposon information in nonmodel birds and test if there are young retroposons that could be used for avian population genetics, we have established a rapid and easy-to-use polymerase chain reaction (PCR)-based protocol (fig. 1 and supplementary Supplementary DataSupplementary Data [Supplementary Material online]; adapted from a general procedure described by Wang and Kirkness 2005) that yields a genomic retroposon-enriched library of the most abundant (Hillier et al. 2004; Warren et al. 2010) group of avian retroposons, namely the CR1 family of long interspersed elements. FIG. 1. View large Download slide Schematic construction of a PCR-based CR1-enriched library. The protocol (see supplementary Supplementary DataSupplementary Data, Supplementary Material online) was adapted from a population genetic study of short interspersed elements of dogs (Wang and Kirkness 2005). After preparation (A) of genomic DNA, DNA is digested (B) using the restriction enzyme PasI, followed by thermal inactivation. Genomic fragments are diluted in water and circularized (C) using T4 DNA ligase. After DNA purification, inverse PCR (D) using outward-facing, CR1-specific primers yields CR1-containing fragments including one flank of each CR1 insertion. The resultant genomic library is then cloned and sequenced. The light gray box is a CR1 element, black lines are nonrepetitive sequences, dark gray boxes are binding sites for the F (forward) primer and black boxes are binding sites for the R (reverse) primer. Note that due to the ambiguous fourth position of the PasI restriction site (W = A or T), only half of the sticky-ended genomic fragments are expected to be compatible for self-ligation in the circularization step (C). FIG. 1. View large Download slide Schematic construction of a PCR-based CR1-enriched library. The protocol (see supplementary Supplementary DataSupplementary Data, Supplementary Material online) was adapted from a population genetic study of short interspersed elements of dogs (Wang and Kirkness 2005). After preparation (A) of genomic DNA, DNA is digested (B) using the restriction enzyme PasI, followed by thermal inactivation. Genomic fragments are diluted in water and circularized (C) using T4 DNA ligase. After DNA purification, inverse PCR (D) using outward-facing, CR1-specific primers yields CR1-containing fragments including one flank of each CR1 insertion. The resultant genomic library is then cloned and sequenced. The light gray box is a CR1 element, black lines are nonrepetitive sequences, dark gray boxes are binding sites for the F (forward) primer and black boxes are binding sites for the R (reverse) primer. Note that due to the ambiguous fourth position of the PasI restriction site (W = A or T), only half of the sticky-ended genomic fragments are expected to be compatible for self-ligation in the circularization step (C). To examine the applicability of this method, we conducted a case study using the little grebe. From the CR1-enriched library established via the procedure outlined in figure 1, we sequenced 180 randomly selected clones of CR1-containing genomic fragments, masked all repetitive regions using CENSOR (http://www.girinst.org/censor/index.php) or RepeatMasker (http://www.repeatmasker.org) and subsequently BLAT screened (Kent 2002) the sequences for similarity to single-copy sequences within the chicken and zebra finch genomes (for more details, see fig. 2 and the corresponding figure legend). We aligned these loci and applied standard criteria for selecting marker candidates (see supplementary Supplementary DataSupplementary Data, Supplementary Material online and Suh, Paus, et al. 2011), identifying 38 retroposon candidate loci (see Supplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online). Two of these loci (each exhibiting an orthologous CR1 insertion shared among little grebe and zebra finch) were not further analyzed, because early bird phylogeny of the lineage leading to the zebra finch has already been studied extensively by Suh, Paus, et al. (2011). For the remaining 36 candidate loci, PCR primers were generated (as described in fig. 2) and tested, leading to 15 markers (see Supplementary DataSupplementary DataSupplementary DataSupplementary Data and Supplementary DataSupplementary DataSupplementary DataSupplementary Data for full alignments, Supplementary Material online) that could be sequenced in different representatives of the grebe lineages (Podicipediformes), their potential sister group candidates and out group representatives (see supplementary Supplementary DataSupplementary Data, Supplementary Material online). FIG. 2. View large Download slide Schematic identification of a CR1 insertion marker candidate. The construction of a PCR-based CR1-enriched library from a bird genome of interest (fig. 1 and supplementary Supplementary DataSupplementary Data, Supplementary Material online) and subsequent cloning yields sequences (A) including a CR1 retroposon 3′ end (lowercase letters) and PasI restriction site (underlined) as well as primer (boxed F and boxed R) and vector sequences. After removal of primer and vector sequences (B), the remaining sequence is masked (i.e., nucleotides of CR1 and other retroposed sequences are replaced by the letter “n”) using RepeatMasker or CENSOR. If BLAT searches (Kent 2002) using the reference genomes of the chicken (galGal3) and the zebra finch (taeGut1) yield a match score >50, the sequence hit (C) is extended (black arrows) by 2,000 bp upstream and downstream, respectively. If a match is found only in one of the two reference genomes, the extended sequence hit from one reference genome is BLAT screened against the other. The resultant sequences and the initial unmasked sequence from the bird genome of interest (after masking all CR1 and other retroposons using lowercase letters) are aligned (D) using MAFFT (E-INS-i, version 6, http://mafft.cbrc.jp/alignment/server/index.html; Katoh and Toh 2008). This permits the generation of one primer that specifically binds to a region that is at least partially well conserved between the bird of interest and the two reference genomes (here, the reverse complement primer binding to the right flank). The appropriate second primer is inferred on the basis of strong sequence conservation between the two reference genomes (here, the forward primer binding to the left flank), as there is no prior sequence information on this CR1-flanking sequence from the bird of interest. In rather rare cases, no well-conserved region on that flank is located within a <1,000 bp distance to the primer on the other flank. In some other cases, the above mentioned BLAT searches yield flanking sequences from only one of the two reference genomes, hampering the identification of well-conserved regions. For such marker candidates, two or more alternative primers (here, forward primers) should be generated to maximize the possibility of obtaining positive PCR results. The subsequent presence/absence analysis sampling the bird of interest as well as more or less closely related species is conducted following standard methods and strict criteria for character interpretation (Suh, Paus, et al. 2011). FIG. 2. View large Download slide Schematic identification of a CR1 insertion marker candidate. The construction of a PCR-based CR1-enriched library from a bird genome of interest (fig. 1 and supplementary Supplementary DataSupplementary Data, Supplementary Material online) and subsequent cloning yields sequences (A) including a CR1 retroposon 3′ end (lowercase letters) and PasI restriction site (underlined) as well as primer (boxed F and boxed R) and vector sequences. After removal of primer and vector sequences (B), the remaining sequence is masked (i.e., nucleotides of CR1 and other retroposed sequences are replaced by the letter “n”) using RepeatMasker or CENSOR. If BLAT searches (Kent 2002) using the reference genomes of the chicken (galGal3) and the zebra finch (taeGut1) yield a match score >50, the sequence hit (C) is extended (black arrows) by 2,000 bp upstream and downstream, respectively. If a match is found only in one of the two reference genomes, the extended sequence hit from one reference genome is BLAT screened against the other. The resultant sequences and the initial unmasked sequence from the bird genome of interest (after masking all CR1 and other retroposons using lowercase letters) are aligned (D) using MAFFT (E-INS-i, version 6, http://mafft.cbrc.jp/alignment/server/index.html; Katoh and Toh 2008). This permits the generation of one primer that specifically binds to a region that is at least partially well conserved between the bird of interest and the two reference genomes (here, the reverse complement primer binding to the right flank). The appropriate second primer is inferred on the basis of strong sequence conservation between the two reference genomes (here, the forward primer binding to the left flank), as there is no prior sequence information on this CR1-flanking sequence from the bird of interest. In rather rare cases, no well-conserved region on that flank is located within a <1,000 bp distance to the primer on the other flank. In some other cases, the above mentioned BLAT searches yield flanking sequences from only one of the two reference genomes, hampering the identification of well-conserved regions. For such marker candidates, two or more alternative primers (here, forward primers) should be generated to maximize the possibility of obtaining positive PCR results. The subsequent presence/absence analysis sampling the bird of interest as well as more or less closely related species is conducted following standard methods and strict criteria for character interpretation (Suh, Paus, et al. 2011). The enrichment method presented here yielded CR1 retroposons that inserted during many different parts of grebe evolution after the neoavian radiation (fig. 3). Two of our markers are shared among the three grebes and the flamingo but absent in the remaining sampled Neoaves (fig. 3, branch A, two REs, P = 0.1111, [2 0 0]). This corroborates the Mirandornithes hypothesis (Sangster 2005) established by sequence analyses (e.g., van Tuinen et al. 2001; Ericson et al. 2006; Hackett et al. 2008) and a few morphological characters (Mayr 2004). Considering this body of evidence, the traditional morphology-based grouping of grebes and loons (e.g., Mayr and Clarke 2003; Livezey and Zusi 2007) can be rejected. Furthermore, three markers are exclusively present in the sampled grebes; together with one marker (B-4) found via screening of GenBank (http://www.ncbi.nlm.nih.gov/genbank/) sequences, this is statistically significant evidence sensu Waddell et al. (2001) for the monophyly of Podicipediformes (fig. 3, branch B, four REs, P = 0.0123, [4 0 0]). Five markers unite the little grebe and the Australasian grebe to the exclusion of the crested grebe (fig. 3, branch C, five REs, P = 0.0041, [5 0 0]) in congruence with grebe taxonomy (Fjeldså 2004). Five retroposon insertions are present solely in the little grebe (fig. 3, branch D) and are, as the Australasian grebe (where these insertions are absent) belongs to the same genus Tachybaptus, either species- or population-specific RE insertions. We note that one of these markers appears to be dimorphic (i.e., the presence allele is not yet fixed within the little grebe population; Supplementary DataSupplementary DataSupplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online). This is, to our knowledge, the first report of a biallelic RE insertion in birds and promises that CR1 retroposons might be useful for future population genetic studies. FIG. 3. View large Download slide Retroposon-based phylogenetic tree of Mirandornithes (flamingos and grebes). The tree topology was constructed on the basis of our presence/absence matrix (Supplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online) and considering maximum parsimony. Gray balls are phylogenetically informative markers, black circles are autapomorphic retroposon insertions, and the semifilled circle is a dimorphic retroposon insertion (i.e., both the presence and the absence allele appear to be present in the little grebe population; see Supplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online). The light gray marker on branch B was found via screening of GenBank sequences. Branches that previously received strong retroposon support (Suh, Paus, et al. 2011) are highlighted with asterisks. FIG. 3. View large Download slide Retroposon-based phylogenetic tree of Mirandornithes (flamingos and grebes). The tree topology was constructed on the basis of our presence/absence matrix (Supplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online) and considering maximum parsimony. Gray balls are phylogenetically informative markers, black circles are autapomorphic retroposon insertions, and the semifilled circle is a dimorphic retroposon insertion (i.e., both the presence and the absence allele appear to be present in the little grebe population; see Supplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online). The light gray marker on branch B was found via screening of GenBank sequences. Branches that previously received strong retroposon support (Suh, Paus, et al. 2011) are highlighted with asterisks. In addition to reconstructing the species tree evolution in the lineage leading to grebes, the RE sequences themselves provide insights into the evolution of avian CR1 retroposons and their temporal impact on the little grebe genome (Supplementary DataSupplementary DataSupplementary DataSupplementary DataSupplementary DataSupplementary Data, Supplementary Material online). The CR1 element (CR1-X3_Pod) of one of the Mirandornithes markers (marker A-1) is closely related to CR1-X3_Pass, a CR1 subtype that was only active prior to and during the neoavian radiation in the lineage leading to the zebra finch (Suh, Paus, et al. 2011). Thus, the master gene of this subtype appears to have had a prolonged activity in the ancestor of Mirandornithes in comparison to the zebra finch. Marker A-2 belongs to a chicken CR1-D–related subtype (CR1-D_Pod) that was not only active in the ancestor of Mirandornithes but also before the diversification of Podicipediformes (markers B-2 and B-3). In the podicipediform ancestor, a CR1-Y–related subtype (CR1-Yb_Pod) was active (markers B-1 and B-4). Additionally, our data suggest that two other CR1 subtypes were probably exclusively active in the two youngest branches of the phylogenetic tree, namely CR1-B_Pod (markers C-1 to C-5) in the ancestor of the genus Tachybaptus and CR1-Ya_Pod (markers D-1 to D-5) at the species (and population) level of the little grebe. We are aware that the amount of CR1 retroposon markers and CR1 subtype diversity investigated here is just a fraction of the CR1 landscape of the whole little grebe genome. Nevertheless, the CR1 retroposon enrichment reported here identified five different CR1 subfamilies (four of them not present in zebra finch) that are scattered across the phylogenetic tree of CR1 retroposons. Our presence/absence analysis of these CR1 retroposon insertions suggests that they inserted throughout the last 53 million years (van Tuinen 2009) of grebe evolution since the last common ancestor of the Mirandornithes. Thus, we propose that our protocol is suitable for studying the breadth of avian CR1 diversity without the need for full-genome sequencing and, at the same time, for identifying retroposon insertions located across the avian tree of life. Furthermore, we suggest that this method is suitable for the detection of not only ancient but also very recent activity of CR1 retroposons (at least as recent as the anseriform CR1 activity reported by St John and Quinn 2008b), as we have identified, to our knowledge, the first case of intraspecific CR1 retroposition activity in a bird genome. Such lineage-specific retroposon activity patterns have been previously reported in other animals (e.g., in Drosophila, Sánchez-Gracia et al. 2005). As we have successfully tested this protocol on the genomes of a parrot, a falcon, and an emu (data not shown), we predict that PCR-based CR1 enrichment will be an easy-to-use and cost-effective alternative to hybridization-based techniques, as it is applicable virtually to all birds and yields a wealth of retroposon markers suitable for studies of avian phylogeny, taxonomy, genome evolution, and perhaps population genetics. Presumably, the method of the present study will even be suitable for the isolation of CR1s and other retroposons from genomes of nonavian organisms. We thank Anja Zemann, Astrid Farwick, Carsten Raabe, and Maria Nilsson for methodological discussions. Many thanks go to Werner Beckmann (LWL-DNA- und Gewebearchiv), Frank Grützner, Ommo Hüppop (Vogelwarte Helgoland), Holger Schielzeth, Sandra Silinski (Allwetterzoo Münster), and Leanne Wheaton for providing feather, blood, and tissue samples. Claudia Kappen and four anonymous reviewers provided valuable comments. This research was funded by the Deutsche Forschungsgemeinschaft (KR3639 to J.O.K. and J.S.). References Ericson PGP, Anderson CL, Britton T, et al. (10 co-authors) Diversification of Neoaves: integration of molecular sequence data and fossils, Biol Lett. , 2006, vol. 4 (pg. 543- 547) Google Scholar CrossRef Search ADS Fjeldså J. , The grebes , 2004 New York Oxford University Press Hackett SJ, Kimball RT, Reddy S, et al. (18 co-authors) A phylogenomic study of birds reveals their evolutionary history, Science , 2008, vol. 320 (pg. 1763- 1768) Google Scholar CrossRef Search ADS PubMed Han KL, Braun EL, Kimball RT, et al. (17 co-authors) Are transposable element insertions homoplasy free? An examination using the avian tree of life, Syst Biol. , 2011, vol. 60 (pg. 375- 386) Google Scholar CrossRef Search ADS PubMed Hillier LW, Miller W, Birney E, et al. (174 co-authors) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution, Nature , 2004, vol. 432 (pg. 695- 716) Google Scholar CrossRef Search ADS PubMed Kaiser VB, van Tuinen M, Ellegren H. Insertion events of CR1 retrotransposable elements elucidate the phylogenetic branching order in galliform birds, Mol Biol Evol. , 2007, vol. 24 (pg. 338- 347) Google Scholar CrossRef Search ADS PubMed Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program, Brief Bioinform. , 2008, vol. 9 (pg. 286- 298) Google Scholar CrossRef Search ADS PubMed Kent WJ. BLAT—the BLAST-like alignment tool, Genome Res. , 2002, vol. 12 (pg. 656- 664) Google Scholar CrossRef Search ADS PubMed Kriegs JO, Matzke A, Churakov G, Kuritzin A, Mayr G, Brosius J, Schmitz J. Waves of genomic hitchhikers shed light on the evolution of gamebirds (Aves: Galliformes), BMC Evol Biol. , 2007, vol. 7 pg. 190 Google Scholar CrossRef Search ADS PubMed Livezey BC, Zusi RL. Higher-order phylogeny of modern birds (Theropoda, Aves: Neornithes) based on comparative anatomy. II. Analysis and discussion, Zool J Linn Soc. , 2007, vol. 149 (pg. 1- 95) Google Scholar CrossRef Search ADS PubMed Mayr G. Morphological evidence for sister group relationship between flamingos (Aves: Phoenicopteridae) and grebes (Podicipedidae), Zool J Linn Soc. , 2004, vol. 140 (pg. 157- 169) Google Scholar CrossRef Search ADS Mayr G, Clarke J. The deep divergences of neornithine birds: a phylogenetic analysis of morphological characters, Cladistics , 2003, vol. 19 (pg. 527- 553) Google Scholar CrossRef Search ADS Ray DA. SINEs of progress: mobile element applications to molecular ecology, Mol Ecol. , 2007, vol. 16 (pg. 19- 33) Google Scholar CrossRef Search ADS PubMed Ray DA, Xing J, Salem AH, Batzer MA. SINEs of a nearly perfect character, Syst Biol. , 2006, vol. 55 (pg. 928- 935) Google Scholar CrossRef Search ADS PubMed Sánchez-Gracia A, Maside X, Charlesworth B. High rate of horizontal transfer of transposable elements in Drosophila, Trends Genet. , 2005, vol. 21 (pg. 200- 203) Google Scholar CrossRef Search ADS PubMed Sangster G. A name for the flamingo-grebe clade, Ibis , 2005, vol. 147 (pg. 612- 615) Google Scholar CrossRef Search ADS Shedlock AM, Takahashi K, Okada N. SINEs of speciation: tracking lineages with retroposons, Trends Ecol Evol. , 2004, vol. 19 (pg. 545- 553) Google Scholar CrossRef Search ADS PubMed St John J, Quinn TW. Rapid capture of DNA targets, Biotechniques , 2008a, vol. 44 (pg. 259- 264) Google Scholar CrossRef Search ADS St John J, Quinn TW. Identification of novel CR1 subfamilies in an avian order with recently active elements, Mol Phylogenet Evol. , 2008b, vol. 49 (pg. 1008- 1014) Google Scholar CrossRef Search ADS Steel M, Penny D. Parsimony, likelihood, and the role of models in molecular phylogenetics, Mol Biol Evol. , 2000, vol. 17 (pg. 839- 850) Google Scholar CrossRef Search ADS PubMed Suh A, Kriegs JO, Brosius J, Schmitz J. Retroposon insertions and the chronology of avian sex chromosome evolution, Mol Biol Evol. , 2011, vol. 28 (pg. 2993- 2997) Google Scholar CrossRef Search ADS PubMed Suh A, Paus M, Kiefmann M, Churakov G, Franke FA, Brosius J, Kriegs JO, Schmitz J. Mesozoic retroposons reveal parrots as the closest living relatives of passerine birds, Nat Commun. , 2011, vol. 2 pg. 443 Google Scholar CrossRef Search ADS PubMed Treplin S, Tiedemann R. Specific chicken repeat 1 (CR1) retrotransposon insertion suggests phylogenetic affinity of rockfowls (genus Picathartes) to crows and ravens (Corvidae), Mol Phylogenet Evol. , 2007, vol. 43 (pg. 328- 337) Google Scholar CrossRef Search ADS PubMed van Tuinen M. Hedges SB, Kumar S. Advanced birds (Neoaves), The timetree of life , 2009 New York Oxford University Press(pg. 419- 422) van Tuinen M, Butvill DB, Kirsch JAW, Hedges SB. Convergence and divergence in the evolution of aquatic birds, Proc R Soc Lond B Biol Sci. , 2001, vol. 268 (pg. 1345- 1350) Google Scholar CrossRef Search ADS Waddell PJ, Kishino H, Ota R. A phylogenetic foundation for comparative mammalian genomics, Genome Inform. , 2001, vol. 12 (pg. 141- 154) Google Scholar PubMed Wang W, Kirkness EF. Short interspersed elements (SINEs) are a major source of canine genomic diversity, Genome Res. , 2005, vol. 15 (pg. 1798- 1808) Google Scholar CrossRef Search ADS PubMed Warren WC, Clayton DF, Ellegren H, et al. (81 co-authors) The genome of a songbird, Nature , 2010, vol. 464 (pg. 757- 762) Google Scholar CrossRef Search ADS PubMed Watanabe M, Nikaido M, Tsuda TT, Inoko H, Mindell DP, Murata K, Okada N. The rise and fall of the CR1 subfamily in the lineage leading to penguins, Gene , 2006, vol. 365 (pg. 57- 66) Google Scholar CrossRef Search ADS PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

journal article

LitStream Collection

Endogenous Lentiviral Elements in the Weasel Family (Mustelidae)

Han, Guan-Zhu;Worobey, Michael

2012 Molecular Biology and Evolution

doi: 10.1093/molbev/mss126pmid: 22522310

Abstract Endogenous retroviruses provide molecular fossils for studying the ancient evolutionary history of retroviruses. Here, we report our independent discovery and analysis of endogenous lentiviral insertions (Mustelidae endogenous lentivirus [MELV]) within the genomes of weasel family (Mustelidae). Genome-scale screening identified MELV elements in the domestic ferret (Mustela putorius furo) genome (MELVmpf). MELVmpf exhibits a typical lentiviral genomic organization. Phylogenetic analyses position MELVmpf basal to either primate lentiviruses or feline immunodeficiency virus. Moreover, we verified the presence of MELV insertions in the genomes of several species of the Lutrinae and Mustelinae subfamilies but not the Martinae subfamily, suggesting that the invasion of MELV into the Mustelidae genomes likely took place between 8.8 and 11.8 Ma. The discovery of MELV in weasel genomes extends the host range of lentiviruses to the Caniformia (order Carnivora) and provides important insights into the prehistoric diversity of lentiviruses. weasel, lentivirus, endogenous retrovirus The long-term evolutionary mode of lentiviruses remains elusive. Retroviruses can integrate into germline genomes, providing important insights into the deep history of these viruses (Johnson and Coffin 1999). However, genome invasion by lentiviruses appears to be very rare (Katzourakis et al. 2007). The short list includes those of the lemurs (pSIV) and the European rabbit (RELIK) (Katzourakis et al. 2007; Gifford et al. 2008; Gilbert et al. 2009). Recently, Cui and Holmes (2012) discovered lentiviral insertions in the genome of the domestic ferret (Mustela putorius furo) (ELVmpf). Here, we report on our independent discovery of endogenous lentiviral elements in the weasel family, both in the domestic ferret and also several wild species. We suggest the weasel family lentiviral elements be designated “Mustelidae endogenous lentivirus” (MELV). We screened all Whole Genome Shotgun (WGS) sequences from GenBank using TBLASTN and various lentivirus proteins (supplementary table S1, Supplementary Material online) and identified six full or partial lentiviral insertions and 13 solo LTRs (supplementary table S2, Supplementary Material online) in the domestic ferret genome (MELVmpf). The consensus sequence exhibits a typical lentiviral organization, encoding three long open reading frames, gag, pol, and env, and three putative accessory genes, vif, tat, and rev (fig. 1 and supplementary figs. S1 and Supplementary Data, Supplementary Material online). MELVmpf also encodes three putative RNA secondary structure elements: a ribosomal frameshift site, a transactivation responsive region, and a Rev responsive element (supplementary fig. S3, Supplementary Material online). A maximum-likelihood (ML) tree of MELVmpf and exogenous retroviruses shows that MELVmpf groups within the lentiviruses with robust support (supplementary fig. S3, Supplementary Material online), confirming it is an endogenous lentivirus. FIG. 1. View largeDownload slide MELVmpf genomic organization. LTR, long-terminal repeat; TAR, transactivation responsive element; RRE, Rev responsive element. FIG. 1. View largeDownload slide MELVmpf genomic organization. LTR, long-terminal repeat; TAR, transactivation responsive element; RRE, Rev responsive element. We also discovered a previously overlooked putative vif gene in another endogenous lentivirus, RELIK (supplementary fig. S2, Supplementary Material online). Putative vif genes in both RELIK and MELVmpf genomes suggest vif may be a very ancient feature of lentivirus genomes; all members of the lentiviral family except equine infectious anemia virus possess a putative vif gene. To determine the relationship between MELVmpf and other lentiviruses, we analyzed conserved regions of the pol amino acid sequences using a Bayesian phylogenetic approach, finding a similar overall topology as previous phylogenies (Katzourakis et al. 2007; Gifford et al. 2008) (fig. 2A). There was considerable posterior support for placing MELVmpf basal to either primate lentiviruses or feline immunodeficiency virus (FIV), leaving the precise phylogenetic position for MELVmpf unresolved. The order Carnivora is composed of the suborders Caniformia and Feliformia (Flynn et al. 2005). The Mustelidae family belongs to the Caniformia suborder. FIV is endemic in the Feliformia (Troyer et al. 2005). The phylogenetic results suggest MELV might either be a novel lentiviral subgroup or might group with FIV in a carnivore lentivirus lineage. FIG. 2. View largeDownload slide (A) Tree of MELVmpf and both endogenous and exogenous lentiviruses. (B) Tree of endogenous lentiviruses only. Both are 50% majority-rule consensus trees. Node labels are posterior probabilities; branch lengths are in expected changes per site (mean values in the MCMC sample). The phylogenies are drawn to the same scale. (C) Phylogeny and timeline of Mustelidae based on Koepfli et al. (2008). The “+” and “−” indicate the presence or absence of MELV by PCR. (D) ML phylogeny of MELV gag sequences: lc, Lontra canadensis; mf, M. frenata; mp, M. putorius; mer, M. erminea; mniv, M. nivalis; nv, Neovison vison, mnig, M. nigripes; and mev, M. eversmanni. Sequences beginning with “contig” are derived from domestic ferret WGS contigs. FIG. 2. View largeDownload slide (A) Tree of MELVmpf and both endogenous and exogenous lentiviruses. (B) Tree of endogenous lentiviruses only. Both are 50% majority-rule consensus trees. Node labels are posterior probabilities; branch lengths are in expected changes per site (mean values in the MCMC sample). The phylogenies are drawn to the same scale. (C) Phylogeny and timeline of Mustelidae based on Koepfli et al. (2008). The “+” and “−” indicate the presence or absence of MELV by PCR. (D) ML phylogeny of MELV gag sequences: lc, Lontra canadensis; mf, M. frenata; mp, M. putorius; mer, M. erminea; mniv, M. nivalis; nv, Neovison vison, mnig, M. nigripes; and mev, M. eversmanni. Sequences beginning with “contig” are derived from domestic ferret WGS contigs. To evaluate the diversity of ancient lentiviruses, we undertook a separate Bayesian phylogenetic analysis using the conserved amino acid residues used above but including only the five known endogenous lentiviruses. After eliminating the effects of postendogenization evolution, the maximum distance between endogenous viruses was 0.72 (between the common ancestors of pSIV and RELIK), while the maximum distance between any pair of taxa on the combined endogenous and exogenous lentiviral phylogeny was not much higher: 1.28 (between SIVsyk and bovine immunodeficiency virus). This suggests that endogenous lentiviruses already exhibited, several Ma genetic divergence comparable with that of modern exogenous lentiviruses (fig. 2A and B). This is a remarkable illustration of long-term evolutionary stasis in viruses that evolve extremely rapidly over short time scales. To date the genomic invasion of MELV, we employed two approaches: calculating the divergence between 5′ LTR and 3′ LTR and determining the taxonomic distribution of MELV. The 5′ LTR and 3′ LTR of endogenous retroviruses may diverge at a neutral rate after endogenization (Johnson and Coffin 1999; Katzourakis et al. 2007). Based on 17 intron loci from six Mustela species, we estimated a neutral substitution rate for Mustela of 4.9 (95% highest posterior density [HPD]: 3.6 − 6.5) × 10−9 substitutions per site per year. We identified a complete lentiviral insertion within contig098598 of the M. putorius genome (supplementary table S2, Supplementary Material online) with a 5′LTR to 3′ LTR distance of 5.9%. Using the intron-based rate, the invasion time estimate is 6.0 (95% HPD: 4.5–8.2) Ma. This estimate, however, should be taken with caution, as the LTRs are short (295 bp) and may not have evolved at a neutral evolutionary rate. We verified the presence of MELV insertions in the genomes of the Lutrinae and Mustelinae subfamilies but not in Martinae via PCR amplification with degenerate primers designed for a conserved region of the gag gene (fig. 2C and D). Lutrinae and Mustelinae diverged from each other ∼8.8 Ma and Martinae diverged from Lutrinae and Mustelinae around 11.8 Ma (Koepfli et al. 2008). Our findings thus suggest an MELV invasion into Mustelidae genomes between 8.8 and 11.8 Ma, close to the upper bound estimated with 5′ LTR and 3′ LTR, and close to the estimate by Cui and Holmes (2012). However, these analyses come with two caveats. First, MELV gag sequences could have arisen from multiple endogenization events in some species, as with pSIV insertions in lemurs (Gilbert et al. 2009). Second, the PCR results for the Martinae might be false negatives if mutations have occurred in primer binding sites. Fadel et al. (2012) recently reported that engineered domestic ferret cells support productive replication of HIV-1. This discovery, along with ours and that of Cui and Holmes (2012), suggests there may be a role for advancing lentiviral biology using this animal model. The ferret's amenability to experimental research and its physiological and immunological similarities to humans have been amply demonstrated with other important diseases (Ball 2006). Materials and Methods Ethanol preserved tissue samples of 10 Mustelidae species (fig. 2C) were obtained in November 2011 from the Museum of Southwestern Biology, University of New Mexico, and the Museum of Vertebrate Zoology, University of California, Berkeley. Genomic DNA was extracted using the DNeasy tissue kit (QIAGEN, MD). PCR amplification of an ∼860-bp gag gene fragment was performed with the degenerate primer pair, FeEL-2R (5′-GCCTTGCArTCCTCATTwGC-3′) and FeEL-2F (5′-GTrTCTGGGCCTTGrAGAyA-3′). Purified PCR products were cloned into the pGEM-T Easy vector (Promega, WI). Cloned products were sequenced by the University of Arizona Genetics Core with an Applied Biosystems 3730XL DNA Analyzer. The sequences have been deposited in GenBank (accession numbers JQ700124–JQ700200). Protein sequences were aligned using Clustal Omega (Sievers et al. 2011). Gblocks 0.91b was used to eliminate ambiguous regions (Talavera and Castresana 2007). An ML approach was used to reconstruct the retrovirus phylogenetic trees using PHYML 3.0 (Guindon et al. 2010) with the rtREV model (Dimmic et al. 2002) and 500 bootstrap replicates. To further evaluate the relationship and divergence of lentiviruses, two data sets were used: the 426 conserved pol amino acid residues from both exogenous and endogenous representative lentiviruses and a separate alignment including only endogenous lentiviruses. These analyses were performed with MrBayes 3.1.2 (Ronquist and Huelsenbeck 2003) and the rtREV amino acid substitution model (Dimmic et al. 2002). The ML phylogenetic tree of MELV elements, based on the gag sequences we recovered from museum specimens, was reconstructed using PHYML 3.0 (Guindon et al. 2010) with 500 bootstrap replicates and the TrN + I + Γ4 substitution model determined using jModelTest (Posada 2008). MELV nucleotide sequences were aligned using MUSCLE (Edgar 2004). We used an alignment of 17 nuclear intron loci from six Mustela species to estimate the neutral evolutionary rate (Yu et al. 2011), with a normal prior (mean = 5.3 Ma, standard deviation = 0.5) on the TMRCA of Mustela according to fossil records (Koepfli et al. 2008). An uncorrelated lognormal relaxed clock model (Drummond et al. 2006) and Yule model of speciation was used. The BEAST package of programs (http://beast.bio.ed.ac.uk, v1.6.1) was employed for these Bayesian MCMC analyses. MCMC chains were run 100 million steps to achieve adequate mixing for all parameters (effective sample size > 200). Tracer v1.5 was used to summarize and analyze the resulting posterior sample. All alignments and input files used in the phylogenetic analyses are available from the authors upon request. We thank the Museum of Southwestern Biology, University of New Mexico, and the Museum of Vertebrate Zoology, University of California, Berkeley, for providing tissue samples of the Mustelidae species, Dr Li Yu for providing the Mustela sequence alignment, and two reviewers for extremely constructive suggestions. This research was supported by National Institute of Allergy and Infectious Diseases grant R01 AI084691 and the David and Lucile Packard Foundation. Associate Editor's note: This work was initially received at MBE on 13 Jan 2012, concurrent with the online publication of Cui and Holmes (2012). References Ball RS. Issues to consider for preparing ferrets as research subjects in the laboratory, ILAR J. , 2006, vol. 47 (pg. 348- 357) Google Scholar CrossRef Search ADS PubMed Cui J, Holmes EC. Endogenous lentiviruses in the ferret genome, J Virol. , 2012, vol. 86 (pg. 3383- 3385) Google Scholar CrossRef Search ADS PubMed Dimmic MW, Rest JS, Mindell DP, Goldstein D. rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny, J Mol Evol. , 2002, vol. 55 (pg. 65- 73) Google Scholar CrossRef Search ADS PubMed Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence, PLoS Biol. , 2006, vol. 4 pg. e88 Google Scholar CrossRef Search ADS PubMed Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res. , 2004, vol. 32 (pg. 1792- 1797) Google Scholar CrossRef Search ADS PubMed Fadel HJ, Saenz DT, Guevara R, von Messling V, Peretz M, Poeschla EM. Productive replication and evolution of HIV-1 in Ferret cells, J Virol. , 2012, vol. 86 (pg. 2312- 2322) Google Scholar CrossRef Search ADS PubMed Flynn JJ, Finarelli JA, Zehr S, Hsu J, Nedbal MA. Molecular phylogeny of the carnivora (mammalia): assessing the impact of increased sampling on resolving enigmatic relationships, Syst Biol. , 2005, vol. 54 (pg. 317- 337) Google Scholar CrossRef Search ADS PubMed Gifford RJ, Katzourakis A, Tristem M, Pybus OG, Winters M, Shafer RW. A transitional endogenous lentivirus from the genome of a basal primate and implications for lentivirus evolution, Proc Natl Acad Sci U S A. , 2008, vol. 105 (pg. 20362- 20367) Google Scholar CrossRef Search ADS PubMed Gilbert C, Maxfield DG, Goodman SM, Feschotte C. Parallel germline infiltration of a lentivirus in two Malagasy lemurs, PLoS Genet. , 2009, vol. 5 pg. e1000425 Google Scholar CrossRef Search ADS PubMed Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0, Syst Biol. , 2010, vol. 59 (pg. 307- 321) Google Scholar CrossRef Search ADS PubMed Johnson WE, Coffin JM. Constructing primate phylogenies from ancient retrovirus sequences, Proc Natl Acad Sci U S A. , 1999, vol. 96 (pg. 10254- 10260) Google Scholar CrossRef Search ADS PubMed Katzourakis A, Tristem M, Pybus OG, Gifford RJ. Discovery and analysis of the first endogenous lentivirus, Proc Natl Acad Sci U S A. , 2007, vol. 104 (pg. 6261- 6265) Google Scholar CrossRef Search ADS PubMed Koepfli KP, Deere KA, Slater GJ, Begg C, Begg K, Grassman L, Lucherini M, Veron G, Wayne RK. Multigene phylogeny of the Mustelidae: resolving relationships, tempo and biogeographic history of a mammalian adaptive radiation, BMC Biol. , 2008, vol. 6 pg. 10 Google Scholar CrossRef Search ADS PubMed Posada D. jModelTest: phylogenetic model averaging, Mol Biol Evol. , 2008, vol. 25 (pg. 1253- 1256) Google Scholar CrossRef Search ADS PubMed Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models, Bioinformatics , 2003, vol. 19 (pg. 1572- 1574) Google Scholar CrossRef Search ADS PubMed Sievers F, Wilm A, Dineen D, et al. (12 co-authors) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega, Mol Syst Biol. , 2011, vol. 7 pg. 539 Google Scholar CrossRef Search ADS PubMed Talavera G, Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments, Syst Biol. , 2007, vol. 56 (pg. 564- 577) Google Scholar CrossRef Search ADS PubMed Troyer JL, Pecon-Slattery J, Roelke ME, et al. (16 co-authors) Seroprevalence and genomic divergence of circulating strains of feline immunodeficiency virus among Felidae and Hyaenidae species, J Virol. , 2005, vol. 79 (pg. 8282- 8294) Google Scholar CrossRef Search ADS PubMed Yu L, Peng D, Liu J, Luan P, Liang L, Lee H, Lee M, Ryder OA, Zhang Y. On the phylogeny of Mustelidae subfamilies: analysis of seventeen nuclear non-coding loci and mitochondrial complete genomes, BMC Evol Biol. , 2011, vol. 11 pg. 92 Google Scholar CrossRef Search ADS PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

journal article

LitStream Collection

Comparative Genomics Reveals Adaptive Protein Evolution and a Possible Cytonuclear Incompatibility between European and American Eels

Gagnaire, Pierre-Alexandre;Normandeau, Eric;Bernatchez, Louis

2012 Molecular Biology and Evolution

doi: 10.1093/molbev/mss076pmid: 22362081

Abstract During the early stages of speciation, interspecific gene flow may be impeded by deleterious epistatic interactions in hybrids, which maintain parental allelic combinations at the speciation genes. The resulting semipermeable nature of the barrier to interspecific gene flow provides a valuable framework to identify the genes involved in hybrid mortality or sterility, as well as the evolutionary mechanisms that initially caused their divergence. The two Atlantic eels Anguilla anguilla and A. rostrata are partially isolated sister species that naturally hybridize, but whose genetic basis of postzygotic isolation remains unknown. We collected high-throughput sequencing data from the transcriptomes of 58 individuals and discovered 94 genes showing differentially fixed mutations between species. Evidence for positive selection at nuclear diagnostic genes was obtained using multilocus extensions of the McDonald–Kreitman test with polymorphism data from each species. In contrast, mitochondrial protein-coding genes experienced strong purifying selection and mostly diverged at synonymous sites, except for the mt-atp6 gene, which showed an atypically high nonsynonymous to synonymous rate ratio. Nuclear-encoded protein interactors of the mt-atp6 gene in the ATP synthase complex were significantly overrepresented in the list of nuclear diagnostic genes. Further analysis of resequencing data showed that positive selection has operated at both the mt-atp6 gene and its nuclear interactor atp5c1. These findings suggest that a cytonuclear incompatibility caused by a disruption of normal ATP synthase function in hybrids contributes to partial reproductive isolation between European and American eels. Atlantic eels, ATP synthase, genome scan, next generation sequencing, positive selection, speciation genes Introduction A central goal of speciation genetics is to assess the relative importance of adaptive and nonadaptive processes involved in species divergence (Orr and Presgraves 2000; Coyne and Orr 2004; Wu and Ting 2004; Presgraves 2010). This is particularly relevant to understanding the early stages of speciation, when only small numbers of genes are responsible for hybrid dysfunction. The molecular mechanisms underlying genetic incompatibilities have been well described for some speciation genes found in model organisms (e.g., Presgraves et al. 2003; Brideau et al. 2006; Lee et al. 2008; Mihola et al. 2009; Phadnis and Orr 2009; Chou et al. 2010). As predicted by the Dobzhansky–Muller model of hybrid incompatibilities (Dobzhansky 1936; Muller 1942; Orr 1995), these genes often appear to be involved in deleterious epistatic interactions caused by the combination of incompatible alleles from different loci in hybrids. Studies investigating speciation genes have shown that the evolution of incompatible alleles causing postzygotic isolation may be driven either by nonadaptive or by adaptive processes. Moreover, in many cases where positive selection has been evidenced, adaptive evolution has been frequently attributed to compensatory changes initially triggered by mutation or molecular arms races and only rarely to ecological adaptation (Presgraves 2010; Maheshwari and Barbash 2011). Although considerable progress has been made toward characterizing the evolutionary forces underlying the establishment of incompatible alleles, most speciation genes were identified from a small number of model organisms. Therefore, extending speciation genetic approaches to nonmodel species may provide a more comprehensive overview of the mechanisms at play in nature, in particular with respect to their dependence on environment. Owing to the recent progress in next generation sequencing (NGS) technologies, nonmodel organisms become increasingly amenable to genome-wide investigations (Gilad et al. 2009; Wolf et al. 2010; Ekblom and Galindo 2011). These techniques provide appealing new opportunities to discover genetic incompatibilities in taxa that are still at incipient stages of reproductive isolation and continue to exchange genes via introgressive hybridization (Barreto et al. 2011). Partially reproductively isolated species present a major potential for speciation geneticists: if the level of introgression is sufficient to prevent genetic divergence at genes that do not impair the survival or fertility of hybrids, it may in turn reveal genetic barriers that maintain elevated values of interspecies differentiation or even differentially fixed alleles. Genomic regions under divergent selection are traditionally identified using FST scan methods, which ideally require complementary analyses of the molecular signature left by selection on linked neutral variation (Akey 2002; Turner et al. 2008, 2010; Kolaczkowski et al. 2011) or of differential introgression in hybrid zones (Payseur 2010). Another appealing approach is to estimate the proportion of nonsynonymous substitutions fixed by positive selection, by contrasting divergence among species with polymorphism within species (Eyre-Walker 2006). Several multilocus approaches extending the McDonald–Kreitman (MK) test (McDonald and Kreitman 1991) have been developed for this purpose (Fay et al. 2001; Smith and Eyre-Walker 2002; Bierne and Eyre-Walker 2004; Eyre-Walker and Keightley 2009). However, these tests have rarely been applied to recently diverged species, and additional investigations are necessary to further test the reliability of MK-related approaches to investigating the early stages of speciation (Gossmann et al. 2010). The two Atlantic eel species, Anguilla rostrata and A. anguilla, provide a fascinating experimental model for studying the genetic basis of reproductive isolation in the wild. Their spawning areas partially overlap in the Sargasso Sea, from where after hatching, planktonic leptocephali larvae drift with the currents to the North-eastern American coasts for A. rostrata and the European and North African coasts for A. anguilla. These sister species can be distinguished both morphologically and genetically, but the finding of hybrids in Iceland has proved reproductive isolation to be still incomplete (Avise et al. 1990). Although a direct measure of hybrid fitness is technically challenging for eels, first generation hybrids were shown to be viable and fertile based on the genetic identification of backcross genotypes in nature (Albert et al. 2006). However, departures from neutral introgression found at some highly divergent loci suggest that particular allelic combinations may affect hybrid survival at a few genetic incompatibility loci (Gagnaire et al. 2009). While supporting the existence of a semipermeable genetic barrier to interspecific gene flow (Harrison 1986), these findings based on anonymous markers did not identify genes involved in hybrid breakdown. On the other hand, they allowed estimating a very low background level of neutral differentiation between species (average neutral FST = 0.07) and a small proportion of markers with abnormally high levels of divergence. This heterogeneous genomic divergence pattern provides a valuable framework to identify the genes involved in hybrid dysfunction (Nosil et al. 2009) and study the evolutionary mechanisms that initially caused their divergence. Using a high-throughput genome scan approach based on RNA-Seq data, we characterized patterns of polymorphism and divergence across coding genes of the two Atlantic eel species. We show that population genetic and molecular evolution analyses can be efficiently combined to identify genes possibly involved in the early stages of postzygotic reproductive isolation. Materials and Methods cDNA Sequence Data Transcriptome sequencing data (RNA-Seq) generated by the 454 GS-FLX sequencing platform were retrieved from two NCBI sequence read archives. The data set of the American eel A. rostrata (SRA045712) consisted of 940,263 individually tagged reads obtained from 40 specimens evenly sampled from the St Lawrence estuary and Florida (Gagnaire et al. 2012). The data set of the European eel A. anguilla (SRA020995) was obtained by sequencing a pooled cDNA library constructed from 18 individuals evenly sampled from the Vilaine, Tiber, and Sele estuaries (Coppe et al. 2010). The sequencing coverage of A. anguilla was improved by resequencing its library on an additional half plate of 454 GS-FLX. As a result, 371,172 new reads were added to the 310,079 archived reads, representing a total of 681,251 reads for A. anguilla. In both species, whole glass eels were used for RNA extraction. Sequence Assembly, Data Preparation, and Single Nucleotide Polymorphism Discovery CLC Genomic Workbench 3.7 (CLC bio) software was used to perform a de novo assembly of the whole sequence data. After several preliminary runs, assembly parameters were set to a minimal read length fraction of 0.5 and a similarity of 0.98, which offered a good compromise between the level of polymorphism, the sequencing error rate, and the need to minimize the inclusion of nonorthologous sequences within contigs. We used the Neighborhood Quality Standard (NQS) algorithm (Altshuler et al. 2000; Brockman et al. 2008) to discover single nucleotide polymorphisms (SNPs) from aligned reads within contigs, while taking into account the quality values of adjacent bases in order to distinguish sequencing errors from actual SNPs. We set a rare variant frequency threshold of 5% (or a count threshold of 5 for the rarest variant when the coverage exceeded 100×) as well as a minimum coverage of 20× per polymorphic site to avoid the artifactual detection of sequencing errors as SNPs. Only biallelic SNPs were considered. Detection of Diagnostic Mutations For the A. rostrata data set, each sequence read was assigned to an individual on the basis of its 10 bp bar code unique to each individual. Individual genotypes were then inferred at each polymorphic site using the probabilistic framework described in Gagnaire et al. (2012) and were used to calculate allele frequencies. Because sequence read data for A. anguilla were obtained by sequencing non–bar-coded individual libraries pooled in equal amounts (Coppe et al. 2010), allele frequencies in this species were directly estimated from allele counts at each SNP position. We used a sequencing coverage threshold of 8× per species to remove SNPs covered by less than eight different individuals in A. rostrata and eight different reads in A. anguilla. The allelic frequency differential between species (δAro/Aan) was then calculated at each retained SNP, in order to estimate the genome-wide distribution of between-species genetic differentiation and identify diagnostic mutations. Contigs containing SNPs with allele frequency differentials above 0.95 (δAro/Aan > 0.95) were all manually inspected in order to detect potential local assembly errors causing the nondetection of diagnostic mutations. Open Reading Frame Prediction and Sequence Annotation For each contig containing at least one diagnostic mutation, the consensus sequence was blasted against the nonredundant NCBI protein database (nr) using BLASTX with an E value threshold of 10−5 (Altschul et al. 1997). Protein-coding regions (open reading frames [ORFs]) within contigs showing significant hits to the nr database were identified with OrfPredictor (Min et al. 2005), using the reading frame information contained in the BLASTX output. For each successfully annotated contig, noncoding regions were trimmed and the coding region was arranged according to its reading frame. Orthologous sequences from at least three other divergent fish species (most frequently Danio rerio, Salmo salar, Tetraodon nigroviridis, Osmerus mordax, Oncorhynchus mykiss, Ictalurus punctatus, and Esox lucius) were retrieved from BLASTX outputs to be used as outgroups. We inferred the most parsimonious ancestral state at each polymorphic site in eels by aligning orthologous genes based on their translated sequences using CLUSTALW (Thompson et al. 1994) in MEGA 5 (Tamura et al. 2011). Polymorphic sites were then classified as either synonymous (S) or nonsynonymous (NS) and derived allele counts were determined in each species and assigned to one of four different frequency classes (low: [0;0.05], moderate: ]0.05;0.2], common: ]0.2;0.5], and high: ]0.5;1]). Functional overrepresentation of Gene Ontology categories in the list of nuclear diagnostic genes was tested with reference to the whole list of genes retained for SNP discovery, using Fisher's exact test in BLAST2GO v.2 (Götz et al. 2008). Tests of Molecular Evolution Because A. rostrata and A. anguilla have most likely diverged less than 500,000 years ago (Gagnaire P-A, unpublished data), most diagnostic genes carry few differentially fixed mutations between the two species. This problem added to a potentially low number of polymorphisms segregating within each species at some genes may cause a large variance in single gene estimates of the proportion of amino acid substitutions driven by positive selection (Bierne and Eyre-Walker 2004). We thus inferred the fraction of adaptive substitutions (α) in each species using three different multilocus extensions of the MK test (McDonald and Kreitman 1991) implemented in the DoFE software available on A. Eyre-Walker Lab's Web site http://www.lifesci.sussex.ac.uk/home/Adam_Eyre-Walker (University of Sussex, United Kingdom). All of these approaches assume that synonymous mutations are neutral (i.e., selection on synonymous codon usage and linkage effects can be neglected), whereas nonsynonymous mutations are either strongly deleterious, neutral, or strongly advantageous. Under these assumptions, positively selected replacement mutations are expected to go rapidly to fixation and contribute only little to polymorphism compared with divergence between species. The first method (Fay et al. 2001) simply sums the values of Dn, Ds, Pn, and Ps across genes to estimate α. However, it is sensitive to the existence of slightly deleterious mutations (SDMs) segregating at low frequencies and to variations in Ne across genes, which usually lead to underestimate α (Fay et al. 2001; Bierne and Eyre-Walker 2004; Charlesworth and Eyre-Walker 2008). The second method (Smith and Eyre-Walker 2002) estimates the average fraction of adaptive substitutions by averaging statistics across genes but may be sensitive to the presence in the data set of genes showing little or no polymorphism. The third method (Bierne and Eyre-Walker 2004) estimates α within a maximum likelihood (ML) framework, which allows including genes that have little or no polymorphism. Confidence intervals of α were obtained by bootstrapping genes for the first two methods and by determining the 2 units of Log(L) interval for the third one. Because we did not sample the same number of chromosomes for all genes, we did not use the method that estimates the distribution of fitness effects of new mutations to account for the presence of SDMs (Eyre-Walker and Keightley 2009). However, we controlled for the effects of negative selection acting on SDMs by removing rare polymorphisms segregating at low frequencies (≤20%) in each species, as commonly applied in other studies (Fay et al. 2001, 2002; Bierne and Eyre-Walker 2004; Charlesworth and Eyre-Walker 2008; Axelsson and Ellegren 2009; Gossmann et al. 2010). Since both nuclear and mitochondrial genes were found in the list of diagnostic genes, we estimated α separately from nuclear and mitochondrial gene sub-data sets, using alternatively A. rostrata or A. anguilla as the ingroup species to score polymorphism. For each species, two published sequences available for each of the 13 mitochondrial protein-coding genes (Minegishi et al. 2005) were retrieved from GenBank. Short 454 reads mapping to these reference sequences were used to improve the identification of divergent sites between species. The number of nonsynonymous substitutions per nonsynonymous site (dn), the number of synonymous substitutions per synonymous site (ds), and the ω ratio (dn/ds) were estimated for each gene using DnaSP v5 (Librado and Rozas 2009). Resequencing of Candidate Genes We resequenced two genes showing relatively high ratios of nonsynonymous to synonymous divergence (see Results), which are involved in the ATP biosynthetic process: the mitochondrial gene mt-atp6 and the nuclear gene atp5c1, coding for two interacting subunits of the cytonuclear protein complex ATP synthase (OXPHOS complex V). The mt-atp6 gene encodes the ATP synthase F0 subunit 6, a key component of the proton channel located in the inner mitochondrial membrane, whereas the atp5c1 gene encodes the ATP synthase F1 subunit gamma, which connects the F0 rotary motor to the F1 catalytic core of the ATP synthase. New primers were designed to resequence these genes in 20 American eels (A. rostrata) from the St Lawrence estuary and 20 European eels (A. anguilla) from the Vilaine estuary. The full length–coding sequence of the mt-atp6 gene (684 bp) was amplified using primers ANG-ATP6-L (5′-GTATTCTCATGGGCCGTGTT-3′) and ANG-ATP6-R (5′-CATGGGCTTGGGTCAACTAT-3′). An exon priming–intron crossing approach was developed to amplify overlapping fragments covering the whole atp5c1 gene for two individuals from each species. Two overlapping fragments located in the most diverging region of atp5c1 and covering 1,627 bp from intron 5 to exon 8 were amplified in all 40 individuals using the primer pairs ANG-ATP5C1-5L (5′-TGCTTGTGATTCAACCCATT-3′) and ANG-ATP5C1-6R (5′-CCATGCACGTGACGACTATC-3′) as well as ANG-ATP5C1-6L (5′-CTGCGCAGTGTCAGGAAATA-3′) and ANG-ATP5C1-8R (5′-AGAGCAGCTGCACCAGAGAT-3′). Polymerase chain reactions (PCRs) were performed in 20 μl reaction volumes containing 1× PCR buffer, 1.5 mM of MgCl2, 0.25 mM of each dNTP, 0.5 μM of forward and reverse primers, 0.5 unit of GoTaq DNA polymerase (Promega), and 100 ng of purified genomic DNA. The amplification parameters were as follows: 94 °C for 5 min followed by 35 cycles of (94 °C for 30 s, 60 °C for 30 s, 72 °C for 45 s), and ended by 4 min at 72 °C. Amplicons were purified with ExoSAP-IT (USB Corporation) and sequenced in both directions on an ABI 3130 xl Genetic Analyzer (Applied Biosystems), using the amplification primers and the Big Dye Terminator Sequencing Kit (v3.1 Applied Biosystems). Sequences were manually inspected in BioEdit and aligned using CLUSTALW (Thompson et al. 1994). Heterozygous positions at the atp5c1 locus, identified by double peaks in individual electropherograms, were represented by their IUPAC ambiguity code. Individual haplotypes were then inferred for individual sequences containing ambiguities (at least two heterozygous SNPs) using the ISHAPE software (Delaneau et al. 2007). Population genetic statistics, including the number of segregating sites (S), the number of haplotypes (h), the average number of nucleotide differences (k), nucleotide diversity (π) (Nei 1987), and Watterson's estimator of the per site mutation rate parameter (θW) (Watterson 1975), were computed for each gene in each species using DnaSP v5 (Librado and Rozas 2009). Three neutrality tests based on the allele frequency spectrum were performed for each locus in each species: Tajima's D (Tajima 1989) and Fu and Li's D and F using the alternate species as outgroup (Fu and Li 1993) as well as the MK test (McDonald and Kreitman 1991). For the atp5c1 gene, a between-species FST profile was generated using polymorphic sites, and between sites linkage disequilibrium (LD) was estimated using Haploview v4.2 (Barrett et al. 2005). Phylogenetic relationships among haplotypes were inferred using uncorrected P distances to reconstruct a neighbor-joining tree for the mt-atp6 gene and a NeighborNet network for the atp5c1 gene in the software SplitsTree4 (Huson and Bryant 2006). Finally, ancestral states of amino acids were estimated using the ML method in MEGA 5 (Tamura et al. 2011), at each node of an ML tree inferred from amino acid sequences of both Atlantic eel species and other available fish species as outgroups. Results Sequence Assembly and SNP Discovery A total of 1,621,514 reads representing 472.8 Mb of sequences were de novo assembled into 22,014 contigs characterized by an average length of 459 bp and a mean number of 36.4 reads per contig. Among 55,602 putative SNPs which were identified in silico using the NQS algorithm, we retained 14,348 SNPs from 3,566 contigs (average of four SNPs per contig) after filtering for a minimal coverage of 8× in both species. Using these parameters, we found a high level of polymorphism across the transcriptome of both Atlantic eel species, with an average number of one SNP every 114 bp. Identification of Genes with Diagnostic Mutations The distribution of allele frequency differences between species was skewed toward low values, with a median of 0.11 and 67.8% of the SNPs showing frequency differences below 0.20 (fig. 1). Only 231 highly divergent SNPs from 112 different contigs exhibited allele frequency differences above 0.95. Manual correction of local assembly and sequencing errors revealed a total of 94 annotated contigs showing at least one diagnostic mutation (mean = 2.9 species-diagnostic sites per contig). These were distributed among 87 nuclear genes (42 of which were fully sequenced) and 7 mitochondrial genes (4 of which were fully sequenced), for a mean number of 697 coding base pairs per contig (the complete list can be found in supplementary table S1, Supplementary Material online). Testing for overrepresented Gene Ontology terms in the subset of nuclear genes with diagnostic mutations compared with the reference list of the 3,566 genes retained for sufficient coverage revealed a highly significant enrichment in genes involved in the ATP biosynthetic process (GO:0006754, P value = 0.000027, FDR-corrected P value = 0.0024). ATP synthesis coupled with proton transport (GO:0015986), which is an ATP biosynthetic process, was also significantly overrepresented (P value = 0.000925, FDR-corrected P value = 0.0151). FIG. 1. View largeDownload slide Distribution of allele frequency differences between species based on 14,348 SNPs. FIG. 1. View largeDownload slide Distribution of allele frequency differences between species based on 14,348 SNPs. Analyses of Molecular Evolution The mean number of synonymous substitutions per synonymous site (ds) and the mean number of nonsynonymous substitutions per nonsynonymous site (dn) calculated across all 87 nuclear diagnostic genes were 0.0048 and 0.0027, respectively, for a mean nonsynonymous/synonymous rate ratio of ωnucl87 = 0.565. For the seven mitochondrial genes obtained by 454 pyrosequencing, the mean ds was 8.35 times higher than for nuclear genes (ds = 0.0405) and the mean dn was 2.75 times lower (dn = 0.0010), for a mean nonsynonymous/synonymous rate ratio of ωmito7 = 0.025. The mean values of ds and dn estimated from all the 13 mitochondrial protein-coding genes were 0.1024 and 0.0021, respectively, for a mean ωmito13 ratio of 0.021 and a mitochondrial to nuclear synonymous substitution ratio of 21.3. The locus mt-atp6 exhibited by far the highest nonsynonymous to synonymous rate ratio among the 13 mitochondrial genes (ω = 0.126; fig. 2), being six times higher than the mean ωmito13 ratio. FIG. 2. View largeDownload slide Distribution of dn, ds, and ω across the 13 mitochondrial genes, ordered according to their relative positions in the mitochondrial genome of Atlantic eels. FIG. 2. View largeDownload slide Distribution of dn, ds, and ω across the 13 mitochondrial genes, ordered according to their relative positions in the mitochondrial genome of Atlantic eels. Contrasting indications of adaptive evolution were found across sequence data sets using each of the three methods to estimate α (Fay et al. 2001; Smith and Eyre-Walker 2002; Bierne and Eyre-Walker 2004) (supplementary table S2, Supplementary Material online). Using polymorphic and divergent sites from the 87 nuclear diagnostic genes, estimated values of αnucl ranged from 0.531 to 0.753 using polymorphism from A. rostrata and from 0.113 to 0.263 using polymorphism from A. anguilla, while being only significant using the first species as ingroup (fig. 3). When controlling for the effect of SDMs segregating at low frequencies (≤20%) in nuclear genes, significant values of αnucl were obtained with each method using both species as ingroup but were still higher with A. rostrata (αnucl = 0.822–0.957) compared with A. anguilla (αnucl = 0.440–0.660). FIG. 3. View largeDownload slide Estimates of α from the nuclear gene data set using (a) Anguilla rostrata and (b) Anguilla anguilla as ingroup. Three estimation methods were used: FWW (Fay et al. 2001), SEW (Smith and Eyre-Walker 2002), and BEW (Bierne and Eyre-Walker 2004), each using either all polymorphic sites (open squares) or after removing low- to moderate-frequency mutations (≤20%) segregating in each species (solid squares). Mean estimates are shown with their 95% confidence intervals. FIG. 3. View largeDownload slide Estimates of α from the nuclear gene data set using (a) Anguilla rostrata and (b) Anguilla anguilla as ingroup. Three estimation methods were used: FWW (Fay et al. 2001), SEW (Smith and Eyre-Walker 2002), and BEW (Bierne and Eyre-Walker 2004), each using either all polymorphic sites (open squares) or after removing low- to moderate-frequency mutations (≤20%) segregating in each species (solid squares). Mean estimates are shown with their 95% confidence intervals. Using the data set comprising seven mitochondrial genes (which does not include the mt-atp6 gene), significantly negative values of αmito were inferred with the BEW method (Bierne and Eyre-Walker 2004) using polymorphism from each species (A. rostrata: αmito = −6.985, A. anguilla: αmito = −6.554; supplementary table S2, Supplementary Material online). However, these values became nonsignificant when low-frequency mutations were removed from the analyses (A. rostrata: αmito = −2.571, A. anguilla: αmito = −1.813). As expected in the presence of SDMs, within-species allele frequency spectra analyses revealed decreasing ratios of nonsynonymous to synonymous polymorphisms (pn/ps) with increasing frequency classes for both nuclear and mitochondrial data sets in each species (fig. 4). The decrease in the pn/ps ratio was faster in A. rostrata for both data sets, but significant differences in pn/ps ratios between species were only observed with nuclear genes. The proportion of neutral nonsynonymous mutations, measured by the pn/ps ratio of common polymorphisms (>50%), was 0.11 in A. rostrata and 0.12 in A. anguilla based on the nuclear data sets and was not significantly different in the mitochondrial data sets (pn/ps = 0.08 and 0.06 for A. rostrata and A. anguilla, respectively). FIG. 4. View largeDownload slide Ratios of nonsynonymous to synonymous polymorphisms (pn/ps) within each species for low- (≤5%), moderate- (5–20%), common- (20–50), and high-frequency (>50%) classes separately calculated from (a) the nuclear and (b) mitochondrial gene data sets. Differences between pn/ps ratios were tested for statistical significance using the chi-square test. Significant differences between species for each frequency class are indicated above histograms in a and b and significant differences between frequency classes within each species appear in tables c and d (Anguilla rostrata above diagonal and Anguilla anguilla below diagonal). Aro: A. rostrata, Aan: A. anguilla, *P < 0.05, **P < 0.01, and ***P < 0.001. FIG. 4. View largeDownload slide Ratios of nonsynonymous to synonymous polymorphisms (pn/ps) within each species for low- (≤5%), moderate- (5–20%), common- (20–50), and high-frequency (>50%) classes separately calculated from (a) the nuclear and (b) mitochondrial gene data sets. Differences between pn/ps ratios were tested for statistical significance using the chi-square test. Significant differences between species for each frequency class are indicated above histograms in a and b and significant differences between frequency classes within each species appear in tables c and d (Anguilla rostrata above diagonal and Anguilla anguilla below diagonal). Aro: A. rostrata, Aan: A. anguilla, *P < 0.05, **P < 0.01, and ***P < 0.001. Analysis of Two Subunits of the ATP Synthase Cytonuclear Complex Molecular diversity indices calculated from 20 mt-atp6 sequences in each species were 2.5–4 times lower in A. rostrata compared with A. anguilla (table 1 and fig. 5a). Significant negative values of Fu and Li's D and F showed an excess of low-frequency mutations in A. anguilla, indicating either a recent population expansion or directional selection. The MK test showed a significant excess of fixed amino acid–altering substitutions between species, consistent with positive selection (table 2). Among the six amino acid replacement substitutions found in the mt-atp6 gene, two corresponded to a change that occurred in A. rostrata (I43T and M83T) and three in A. anguilla (the double-nucleotide mutation N52G and F105L), whereas the ancestral state could not be determined at the V195I substitution site. Table 1. Molecular Diversity Indices and Results of Neutrality Tests for the mt-atp6 and the atp5c1 Genes. Note.—Aro: Anguilla rostrata, Aan: Anguilla anguilla, n: number of sequences, S: number of segregating sites, h: number of haplotypes, k: average number of nucleotide differences, π: nucleotide diversity, and θW: Watterson's estimator of the per site mutation rate parameter. *P < 0.05, **P < 0.01. View Large Table 2. MK Test for the mt-atp6 and atp5c1 Genes. Note.—The MK test was performed between species for the mt-atp6 gene and between European and American alleles for the coding region comprising exons 6–8 of the atp5c1 gene. View Large FIG. 5. View largeDownload slide Neighbor-joining tree for the entire mt-atp6 gene (a) and neighbor network for the intron 5 to exon 8 region of the atp5c1 gene (b). Anguilla rostrata haplotypes are represented by open circles and Anguilla anguilla by black circles. Unidirectional introgression at the atp5c1 locus is illustrated by one open circle in the A. anguilla part of the neighbor network showing introgression of a European allele into the American eel background and by six American haplotypes with intermediate positions in the network showing recombination of American with European alleles around the G211A-V212I double mutation. FIG. 5. View largeDownload slide Neighbor-joining tree for the entire mt-atp6 gene (a) and neighbor network for the intron 5 to exon 8 region of the atp5c1 gene (b). Anguilla rostrata haplotypes are represented by open circles and Anguilla anguilla by black circles. Unidirectional introgression at the atp5c1 locus is illustrated by one open circle in the A. anguilla part of the neighbor network showing introgression of a European allele into the American eel background and by six American haplotypes with intermediate positions in the network showing recombination of American with European alleles around the G211A-V212I double mutation. Resequencing data from the most diverging region of the atp5c1 gene showed a 2.7-fold lower nucleotide diversity in A. anguilla compared with A. rostrata and significant negative values of Tajima's D and Fu and Li's F in A. anguilla (table 1). Reduced diversity in the European eel was also visible in the neighbor network, which further revealed unidirectional introgression from the European to the American eel background (fig. 5b). Only one full-length European haplotype was found among 40 sequences in A. rostrata, introgression being mostly confined to the flanking regions of the G211A-V212I double mutation in exon 7. Moreover, there was a clear trend toward a decrease in FST values on each side of the G211A-V212I double mutation, indicating increasing interspecies recombination with increasing distance from this nearly diagnostic site (fig. 6). Although the extent of LD could not be tested beyond the resequenced region, analyses with Haploview revealed high LD between two linkage blocks, respectively including exons 6 and 7, with each block containing two amino acid–altering polymorphic sites (exon 6: V195A-V204L and exon 7: G211A-V212I; supplementary fig. S1, Supplementary Material online). An MK test performed between “European” (V195-V204-G211-V212) and “American” (A195-L204-A211-I212) alleles showed a significant excess of nonsynonymous substitutions in the region extending from exon 6 to exon 8 (table 2), suggestive of positive selection. Ancestral state prediction at the four amino acid replacement substitutions found between the American and European alleles showed that two changes occurred in A. rostrata (V204L and G211A) and one in A. anguilla (I212V), whereas the ancestral state could not be determined at the A195V substitution site. FIG. 6. View largeDownload slide Genetic differentiation profile of the atp5c1 gene. For each between-species polymorphism, FST is plotted against its position in the gene, with open circles referring to allelic frequencies obtained from 454 cDNA sequencing data and black circles when allelic frequencies were measured from Sanger resequencing data. Exons are shown as shaded boxes and the resequenced region is indicated by a black horizontal line above the intron 5 to exon 8 region. The horizontal dashed line indicates the average neutral FST value of 0.07 (Gagnaire et al. 2009). FIG. 6. View largeDownload slide Genetic differentiation profile of the atp5c1 gene. For each between-species polymorphism, FST is plotted against its position in the gene, with open circles referring to allelic frequencies obtained from 454 cDNA sequencing data and black circles when allelic frequencies were measured from Sanger resequencing data. Exons are shown as shaded boxes and the resequenced region is indicated by a black horizontal line above the intron 5 to exon 8 region. The horizontal dashed line indicates the average neutral FST value of 0.07 (Gagnaire et al. 2009). Discussion Evidence for Positive Selection at Nuclear Diagnostic Genes Analyzing the empirical distribution of allele frequency differences between species revealed that up to 2.6% of the transcribed genes (94/3,566) may present differentially fixed SNPs. This proportion is likely to be a slight overestimate of the true percentage of diagnostic SNPs across the transcriptome of Atlantic eels since our minimum base coverage threshold of 8× per species may not allow detecting rare variants that segregate in each genetic background. Moreover, because SNPs were detected from RNA-Seq data, gene specific variance in expression level among individuals or among alleles may contribute to bias in allele frequency estimates from pooled sample reads in A. anguilla. Nevertheless, our coverage threshold was reasonably high compared with similar studies of this type in nonmodel organisms (Galindo et al. 2010; Renaut et al. 2010; Barreto et al. 2011), which should minimize this problem. Overall, the distribution of interspecies allele frequency differences was consistent with the low-background level of neutral differentiation assessed with amplified fragment length polymorphism (Gagnaire et al. 2009) and microsatellite markers (Wirth and Bernatchez 2003) and with the view that most of the genome is still permeable to gene flow between species. Thus, even though some of the 87 nuclear genes may not be fully diagnostic because alleles from the alternate species possibly segregate at low frequencies in each species, these genes are among the most likely candidates for identifying genetic barriers involved in partial reproductive isolation since they exhibit little or no introgression. In recently diverged species with large effective population sizes such as Atlantic eels (Wirth and Bernatchez 2003), most neutral alleles may not have had enough time to reach reciprocal monophyly since the onset of divergence. Thus, incomplete lineage sorting, in addition to introgression, likely contributes to the low proportion of reciprocally monophyletic loci (Funk and Omland 2003). In such a context, a stochastic steady state has probably not been reached for both neutral and adaptive mutations (Bierne and Eyre-Walker 2004) since advantageous alleles are the only one that may have had enough time to reach fixation. This stochastic steady state will take up to 10Ne generations before being established after the species split (Nei and Li 1976). Therefore, α will probably overestimate the true proportion of substitutions driven by positive selection during this period. This possibility was previously suggested in a comparative study of plants (Gossmann et al. 2010), where the highest values of α were found between the most recently diverged species. Because of this potential bias, our estimates of α may not be directly comparable to the rates of adaptive evolution estimated using well-separated species (Eyre-Walker 2006). Although recent divergence between young species will not generate significant values of α under neutrality, artifactual evidence for positive selection may be obtained if there is an increase in population size and slightly deleterious amino acid polymorphisms (McDonald and Kreitman 1991; Eyre-Walker 2002). In Atlantic eels, however, this possibility is less likely given the demographic contraction recently experienced by both species (Wirth and Bernatchez 2003), which probably reduced the selective constraints on SDMs. Moreover, the hypothesis of a recent population expansion was not supported by the excess of amino acid polymorphisms relative to substitutions in mitochondrial genes (Nachman et al. 1996; Eyre-Walker 2002), as revealed by our negative estimates of αmito. Differences in estimated values of α between nuclear and mitochondrial data sets have previously been observed in other species and attributed to differences in rates of recombination combined with the presence of SDMs (Weinreich and Rand 2000). Our results also suggested that only a small proportion of nonsynonymous mutations may be effectively neutral in Atlantic eels (6–12%) and that nuclear and mitochondrial genes experience similar intensities of selective constraints. These observations are thus consistent with the finding of lower values of α in the nonrecombining mitochondrial DNA (mtDNA; Weinreich and Rand 2000) due to the segregation of SDMs. We found that, on average, nonsynonymous mutations segregate at lower frequencies than synonymous polymorphisms in both species, which confirms that SDMs segregate at low frequencies in the cytoplasmic and nuclear genomes of each species (Fay et al. 2001, 2002; Zhang and Li 2005; Parsch et al. 2008). Since we made the assumption that nonsynonymous polymorphisms are neutral to perform MK-based estimations of α, low-frequency SDMs will probably make MK-related tests more conservative due the recent contraction in effective population size experienced by both species (Bierne and Eyre-Walker 2004). This prediction is consistent with the higher values of α systematically obtained when excluding polymorphisms below 20% to control for the effects of negative selection acting on SDMs, as previously observed in studies on human and Drosophila (Fay et al. 2001, 2002; Bierne and Eyre-Walker 2004). Although removing low-frequency classes from the analysis may still underestimate the rate of adaptive evolution (Charlesworth and Eyre-Walker 2008), this treatment resulted in significant positive values of αnucl in each species, with each of the three MK-related tests. Thus, significant excess of nonsynonymous to synonymous substitutions relative to polymorphisms provides evidence for positive selection acting on at least some nuclear diagnostic genes included in our analysis. In parallel, we did not find any evidence for positive selection acting on our subset of seven mitochondrial genes, even when low- and moderate-frequency mutations were removed from the analysis. Instead, our results suggested that those genes (which do not include the mt-atp6 gene) mostly evolve under strong purifying selection. Indications for the Existence of a Cytonuclear Incompatibility Contrasting with the fact that most of the nuclear genome is still permeable to interspecific gene exchange, the absence of mitochondrial introgression suggests the existence of asymmetrical crossing barriers with respect to sex or postzygotic isolation mechanisms involving one or several mitochondrial genes. Since both American and European mitochondrial haplotypes have been found in Icelandic hybrids (Avise et al. 1990; Frankowski and Bastrop 2010), sex-based directionality in interspecific crosses does not seem to explain the barrier to mitochondrial gene flow. On the other hand, the significant enrichment of genes involved in the ATP synthesis coupled with proton transport process among nuclear diagnostic genes revealed some candidate loci potentially responsible for cytonuclear hybrid dysfunction. Of the 12 nuclear-encoded proteins that compose the ATP synthase complex, 5 were found to be divergent in our transcriptome scan approach. Moreover, among the two mtDNA-encoded proteins in the complex, the ratio of nonsynonymous to synonymous substitutions displayed by the mt-atp6 gene was six times higher than the mean ω value of the 13 mitochondrial genes, indicating either positive selection or relaxed purifying selection. Although demonstrating coadaptation between mtDNA- and nuclear-encoded proteins would require cytonuclear experiments or introgression analyses using natural backcrosses (Blier et al. 2001; Rand et al. 2004), resequencing the mt-atp6 gene and the most divergent nuclear gene in the ATP synthase complex (atp5c1) enabled the use of indirect approaches based on population genetic neutrality tests. Significant excesses of replacement/synonymous substitutions relative to polymorphisms strongly suggested that both the mt-atp6 and the atp5c1 genes have undergone divergent selection between species. Therefore, even if we found that SDMs segregate in the mitochondrial genome of each species, the fixation of mildly deleterious mutations due to relaxed purifying selection seems insufficient to explain the number of nonsynonymous substitutions at the mt-atp6 locus. Moreover, a scenario whereby positive selection has recently driven amino acid replacements at this gene in both species was also supported by a strikingly low level of nucleotide diversity in A. rostrata and a significant excess of low-frequency polymorphisms in A. anguilla. Species with high effective population sizes may experience recurrent selective sweeps (Bazin et al. 2006) provoking transient reductions of nucleotide diversity in the nonrecombining mitochondrial genome. Contrary to genetic drift predictions, this genetic draft effect makes the fixation rate of SDMs increasing with the population size (Gillespie 2001). The fixation of SMDs due to genetic hitchhiking on the mitochondrial chromosome may subsequently elicit the fixation of compensatory mutations within the same gene or within nuclear genes coding for interacting proteins (Rand et al. 2004; Oliveira et al. 2008). Evidence for positive selection at the nuclear gene atp5c1, encoding for a protein that interacts with the ATP synthase F0 subunit 6, may be seen as an indication of such compensatory coadaptation between two interacting genes. Furthermore, species-diagnostic mutations that were found in four other nuclear genes encoding subunits of the ATP synthase raise the possibility that compensatory mutations have also been selected in other nuclear genes implicated in the oxidative phosphorylation complex V. An exhaustive analysis of polymorphism and divergence of all the genes involved in the ATP synthase complex would accordingly provide a better understanding of this substitution pattern (Gibson et al. 2010). However, it appears likely from our results that positive selection has caused the fixation of different amino acid mutations in each species, at least for two ATP synthase subunits, one being encoded by the cytoplasmic and the other by the nuclear genome. Because these putatively species-specific associations of coadapted alleles will be disrupted in F2 hybrids, maladaptive combinations in recombined cytonuclear complexes may cause genetic barriers to interspecific gene flow and thus contribute to reproductive isolation (Burton et al. 2006). Dobzhansky–Muller incompatibilities involving the interaction of mitochondrial and nuclear genes have been found in several organisms including Tigriopus (Willett and Burton 2001; Burton et al. 2006), Drosophila (Sackton et al. 2003; Montooth et al. 2010), Nasonia (Ellison et al. 2008; Niehuis et al. 2008; Oliveira et al. 2008), and Saccharomyces (Lee et al. 2008; Chou et al. 2010). In those species, selection for compensatory mutations in response to the fixation of SMDs by drift or genetic hitchhiking in the mitochondrial genome has been proposed to explain the evolution of postzygotic isolation, which may be accelerated by a higher mitochondrial mutation rate (Oliveira et al. 2008). In Atlantic eels, the mitochondrial to nuclear synonymous substitution ratio of 8.4−21.3 lies between the 2.75 value found in Drosophila (Montooth et al. 2009) and the ratios above 25 estimated in Nasonia (Oliveira et al. 2008) and Tigriopus (Willett and Burton 2004). While this disparity could be explained by differences in mutation rates or efficiency of purifying selection, the lower rate observed in Drosophila also coincides with a lack of fixed cytonuclear incompatibilities between fly species (Montooth et al. 2010). These results suggest that a higher mitochondrial substitution rate could make the evolution of cytonuclear incompatibility more likely. Deleterious epistatic interactions between nuclear-encoded proteins are also known to contribute to reproductive isolation in several model species (Presgraves 2010). Such nuclear–nuclear incompatibilities may be present in Atlantic eels as well, and their accumulation may participate to the build-up of reproductive isolation in the presence of gene flow (Barton and de Cara 2009). Although we did not identify all the genes which are resistant to introgression, the new list of nuclear diagnostic genes revealed in our study sets the stage for further investigations of their possible roles in hybrid breakdown. Nevertheless, the disproportionate mapping of diagnostic markers to the ATP synthase complex may be indicative of the importance of mitochondrial–nuclear conflicts in postzygotic isolation between Atlantic eels and may explain why asymmetrical patterns of introgression have been observed in previous studies (Avise et al. 1990; Albert et al. 2006; Gagnaire et al. 2009). Surprisingly, the direction of introgression at the atp5c1 gene was opposite to that observed for most nuclear markers, which preferentially introgress from the American to the European background. Several explanations may account for this pattern, such as the number of genes involved in the cytonuclear incompatibility, the relative timing of the mutations, their fitness effects, and the recessivity of incompatible interactions (Orr 1995). Asymmetric introgression in the flanking regions of the G211A-V212I double mutation in the exon 7 of atp5c1 resulted in a rapid decay of genetic differentiation and LD on each side of the candidate site. This result, indicative of a relatively short region of genomic differentiation spanning only a few exons and introns around the candidate functional mutation, is consistent with the observation that differentiation does not extend beyond a few kilobases in some model organisms (Turner et al. 2008, 2010; Kolaczkowski et al. 2011). With the increasing sequencing power offered by NGS methods, further screening of the chromosomal signature at genetic incompatibility loci will enable evaluating the generality of this pattern in a context of early divergence between partially reproductively isolated species. In conclusion, we showed that adaptive protein evolution has occurred at several of the few genes showing between-species divergence despite pervasive introgression. This finding is consistent with the action of positive selection in facilitating the establishment and maintenance of genetic barriers to gene flow during the early stages of speciation. Contrasting with the prevailing view that candidate loci detected by genome scans are attributable to ecological adaptation, our transcriptome scan performed in Atlantic eels illustrates the potential of such methods to identify intrinsic factors (Bierne et al. 2011) and ultimately speciation genes contributing to reproductive isolation. We are grateful to the Associate Editor Michael Nachman as well as three anonymous referees for their constructive and helpful comments on the manuscript. We also thank M.M. Hansen, L. Zane and G.E. Maes from the Anguilla Genomics Consortium. This research was supported by the Natural Science and Engineering Research Council of Canada (NSERC) and a Canadian Research Chair in Genomics and Conservation of Aquatic Resources to L.B. P.A.G. was supported by a Government of Canada Post-Doctoral Fellowship. References Akey JM. Interrogating a high-density SNP map for signatures of natural selection, Genome Res. , 2002, vol. 12 (pg. 1805- 1814) Google Scholar CrossRef Search ADS PubMed Albert V, Jónsson B, Bernatchez L. Natural hybrids in Atlantic eels (Anguilla anguilla, A. rostrata): evidence for successful reproduction and fluctuating abundance in space and time, Mol Ecol. , 2006, vol. 15 (pg. 1903- 1916) Google Scholar CrossRef Search ADS PubMed Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. , 1997, vol. 25 (pg. 3389- 3402) Google Scholar CrossRef Search ADS PubMed Altshuler D, Pollara V, Cowles C, Van Etten W, Baldwin J, Linton L, Lander ES. An SNP map of the human genome generated by reduced representation shotgun sequencing, Nature , 2000, vol. 407 (pg. 513- 516) Google Scholar CrossRef Search ADS PubMed Avise JC, Nelson WS, Arnold J, Koehn RK, Williams GC, Thorsteinsson V. The evolutionary genetic status of Icelandic eels, Evolution , 1990, vol. 44 (pg. 1254- 1262) Google Scholar CrossRef Search ADS Axelsson E, Ellegren H. Quantification of adaptive evolution of genes expressed in avian brain and the population size effect on the efficacy of selection, Mol Biol Evol. , 2009, vol. 26 (pg. 1073- 1079) Google Scholar CrossRef Search ADS PubMed Barreto FS, Moy GW, Burton RS. Interpopulation patterns of divergence and selection across the transcriptome of the copepod Tigriopus californicus, Mol Ecol. , 2011, vol. 20 (pg. 560- 572) Google Scholar CrossRef Search ADS PubMed Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps, Bioinformatics , 2005, vol. 21 (pg. 263- 265) Google Scholar CrossRef Search ADS PubMed Barton NH, de Cara MAR. The evolution of strong reproductive isolation, Evolution , 2009, vol. 63 (pg. 1171- 1190) Google Scholar CrossRef Search ADS PubMed Bazin E, Glémin S, Galtier N. Population size does not influence mitochondrial genetic diversity in animals, Science , 2006, vol. 312 (pg. 570- 572) Google Scholar CrossRef Search ADS PubMed Bierne N, Eyre-Walker A. The genomic rate of adaptive amino acid substitution in Drosophila, Mol Biol Evol. , 2004, vol. 21 (pg. 1350- 1360) Google Scholar CrossRef Search ADS PubMed Bierne N, Welch J, Loire E, Bonhomme F, David P. The coupling hypothesis: why genome scans may fail to map local adaptation genes, Mol Ecol. , 2011, vol. 20 (pg. 2044- 2072) Google Scholar CrossRef Search ADS PubMed Blier PU, Dufresne F, Burton RS. Natural selection and the evolution of mtDNA-encoded peptides: evidence for intergenomic co-adaptation, Trends Genet. , 2001, vol. 17 (pg. 400- 406) Google Scholar CrossRef Search ADS PubMed Brideau NJ, Flores HA, Wang J, Maheshwari S, Wang X, Barbash DA. Two Dobzhansky-Muller genes interact to cause hybrid lethality in Drosophila, Science , 2006, vol. 314 (pg. 1292- 1295) Google Scholar CrossRef Search ADS PubMed Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB. Quality scores and SNP detection in sequencing-by-synthesis systems, Genome Res. , 2008, vol. 18 (pg. 763- 770) Google Scholar CrossRef Search ADS PubMed Burton RS, Ellison CK, Harrison JS. The sorry state of F2 hybrids: consequences of rapid mitochondrial DNA evolution in allopatric populations, Am Nat. , 2006, vol. 168 (pg. S14- S24) Google Scholar CrossRef Search ADS PubMed Charlesworth J, Eyre-Walker A. The McDonald-Kreitman test and slightly deleterious mutations, Mol Biol Evol. , 2008, vol. 25 (pg. 1007- 1015) Google Scholar CrossRef Search ADS PubMed Chou J-Y, Hung Y-S, Lin K-H, Lee H-Y, Leu J-Y. Multiple molecular mechanisms cause reproductive isolation between three yeast species, PLoS Biol. , 2010, vol. 8 pg. e1000432 Google Scholar CrossRef Search ADS PubMed Coppe A, Pujolar J, Maes GE, Larsen PF, Hansen MM, Bernatchez L, Zane L, Bortoluzzi S. Sequencing, de novo annotation and analysis of the first Anguilla anguilla transcriptome: EeelBase opens new perspectives for the study of the critically endangered European eel, BMC Genomics , 2010, vol. 11 pg. 635 Google Scholar CrossRef Search ADS PubMed Coyne JA, Orr HA. , Speciation , 2004 Sunderland (MA) Sinauer Associates Delaneau O, Coulonges C, Boelle P-Y, Nelson G, Spadoni J-L, Zagury J-F. ISHAPE: new rapid and accurate software for haplotyping, BMC Bioinformatics , 2007, vol. 8 pg. 205 Google Scholar CrossRef Search ADS PubMed Dobzhansky T. Studies on hybrid sterility. II. Localization of sterility factors in Drosophila pseudoobscura hybrids, Genetics , 1936, vol. 21 (pg. 113- 135) Google Scholar PubMed Ekblom R, Galindo J. Applications of next generation sequencing in molecular ecology of non-model organisms, Heredity , 2011, vol. 107 (pg. 1- 15) Google Scholar CrossRef Search ADS PubMed Ellison CK, Niehuis O, Gadau J. Hybrid breakdown and mitochondrial dysfunction in hybrids of Nasonia parasitoid wasps, J Evol Biol. , 2008, vol. 21 (pg. 1844- 1851) Google Scholar CrossRef Search ADS PubMed Eyre-Walker A. Changing effective population size and the McDonald-Kreitman test, Genetics , 2002, vol. 162 (pg. 2017- 2024) Google Scholar PubMed Eyre-Walker A. The genomic rate of adaptive evolution, Trends Ecol Evol. , 2006, vol. 21 (pg. 569- 575) Google Scholar CrossRef Search ADS PubMed Eyre-Walker A, Keightley PD. Estimating the rate of adaptive molecular evolution in the presence of slightly deleterious mutations and population size change, Mol Biol Evol. , 2009, vol. 26 (pg. 2097- 2108) Google Scholar CrossRef Search ADS PubMed Fay JC, Wyckoff GJ, Wu C-I. Positive and negative selection on the human genome, Genetics , 2001, vol. 158 (pg. 1227- 1234) Google Scholar PubMed Fay JC, Wyckoff GJ, Wu C-I. Testing the neutral theory of molecular evolution with genomic data from Drosophila, Nature , 2002, vol. 415 (pg. 1024- 1026) Google Scholar CrossRef Search ADS PubMed Frankowski J, Bastrop R. Identification of Anguilla anguilla (L.) and Anguilla rostrata (Le Sueur) and their hybrids based on a diagnostic single nucleotide polymorphism in nuclear 18S rDNA, Mol Ecol Resour. , 2010, vol. 10 (pg. 173- 176) Google Scholar CrossRef Search ADS PubMed Fu YX, Li WH. Statistical tests of neutrality of mutations, Genetics , 1993, vol. 133 (pg. 693- 709) Google Scholar PubMed Funk DJ, Omland KE. Species-level paraphyly and polyphyly: frequency, causes, and consequences, with insights from animal mitochondrial DNA, Annu Rev Ecol Evol Syst. , 2003, vol. 34 (pg. 397- 423) Google Scholar CrossRef Search ADS Gagnaire P-A, Albert V, Jónsson B, Bernatchez L. Natural selection influences AFLP intraspecific genetic variability and introgression patterns in Atlantic eels, Mol Ecol. , 2009, vol. 18 (pg. 1678- 1691) Google Scholar CrossRef Search ADS PubMed Gagnaire P-A, Normandeau E, Côté C, Hansen MM, Bernatchez L. The genetic consequences of spatially varying selection in the panmictic American eel (Anguilla rostrata), Genetics , 2012, vol. 190 (pg. 725- 736) Google Scholar CrossRef Search ADS PubMed Galindo J, Grahame JW, Butlin RK. An EST-based genome scan using 454 sequencing in the marine snail Littorina saxatilis, J Evol Biol. , 2010, vol. 23 (pg. 2004- 2016) Google Scholar CrossRef Search ADS PubMed Gibson JD, Niehuis O, Verrelli BC, Gadau J. Contrasting patterns of selective constraints in nuclear-encoded genes of the oxidative phosphorylation pathway in holometabolous insects and their possible role in hybrid breakdown in Nasonia, Heredity , 2010, vol. 104 (pg. 310- 317) Google Scholar CrossRef Search ADS PubMed Gilad Y, Pritchard JK, Thornton K. Characterizing natural variation using next-generation sequencing technologies, Trends Genet. , 2009, vol. 25 (pg. 463- 471) Google Scholar CrossRef Search ADS PubMed Gillespie JH. Is the population size of a species relevant to its evolution?, Evolution , 2001, vol. 55 (pg. 2161- 2169) Google Scholar CrossRef Search ADS PubMed Gossmann TI, Song BH, Windsor AJ, Mitchell-Olds T, Dixon CJ, Kapralov MV, Filatov DA, Eyre-Walker A. Genome wide analyses reveal little evidence for adaptive evolution in many plant species, Mol Biol Evol. , 2010, vol. 27 (pg. 1822- 1832) Google Scholar CrossRef Search ADS PubMed Götz S, García-Gómez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, Robles M, Talón M, Dopazo J, Conesa A. High-throughput functional annotation and data mining with the Blast2GO suite, Nucleic Acids Res. , 2008, vol. 36 (pg. 3420- 3435) Google Scholar CrossRef Search ADS PubMed Harrison RG. Pattern and process in a narrow hybrid zone, Heredity , 1986, vol. 56 (pg. 337- 349) Google Scholar CrossRef Search ADS Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies, Mol Biol Evol. , 2006, vol. 23 (pg. 254- 267) Google Scholar CrossRef Search ADS PubMed Kolaczkowski B, Kern AD, Holloway AK, Begun DJ. Genomic differentiation between temperate and tropical Australian populations of Drosophila melanogaster, Genetics , 2011, vol. 187 (pg. 245- 260) Google Scholar CrossRef Search ADS PubMed Lee H-Y, Chou J-Y, Cheong L, Chang N-H, Yang S-Y, Leu J-Y. Incompatibility of nuclear and mitochondrial genomes causes hybrid sterility between two yeast species, Cell , 2008, vol. 135 (pg. 1065- 1073) Google Scholar CrossRef Search ADS PubMed Librado P, Rozas J. DnaSP v5: a software for comprehensive analysis of DNA polymorphism data, Bioinformatics , 2009, vol. 25 (pg. 1451- 1452) Google Scholar CrossRef Search ADS PubMed Maheshwari S, Barbash DA. The genetics of hybrid incompatibilities, Annu Rev Genet. , 2011, vol. 45 (pg. 331- 355) Google Scholar CrossRef Search ADS PubMed McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila, Nature , 1991, vol. 351 (pg. 652- 654) Google Scholar CrossRef Search ADS PubMed Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J. A mouse speciation gene encodes a meiotic histone H3 methyltransferase, Science , 2009, vol. 323 (pg. 373- 375) Google Scholar CrossRef Search ADS PubMed Min XJ, Butler G, Storms R, Tsang A. OrfPredictor: predicting protein-coding regions in EST-derived sequences, Nucleic Acids Res. , 2005, vol. 33 (pg. W677- W680) Google Scholar CrossRef Search ADS PubMed Minegishi Y, Aoyama J, Inoue JG, Miya M, Nishida M, Tsukamoto K. Molecular phylogeny and evolution of the freshwater eels genus Anguilla based on the whole mitochondrial genome sequences, Mol Phylogenet Evol. , 2005, vol. 34 (pg. 134- 146) Google Scholar CrossRef Search ADS PubMed Montooth KL, Abt DN, Hofmann JW, Rand DM. Comparative genomics of Drosophila mtDNA: novel features of conservation and change across functional domains and lineages, J Mol Evol. , 2009, vol. 69 (pg. 94- 114) Google Scholar CrossRef Search ADS PubMed Montooth KL, Meiklejohn CD, Abt DN, Rand DM. Mitochondrial-nuclear epistasis affects fitness within species but does not contribute to fixed incompatibilities between species of Drosophila, Evolution , 2010, vol. 64 (pg. 3364- 3379) Google Scholar CrossRef Search ADS PubMed Muller HJ. Isolating mechanisms, evolution, and temperature, Biol Symp. , 1942, vol. 6 (pg. 71- 125) Nachman MW, Brown WM, Stoneking M, Aquadro CF. Nonneutral mitochondrial DNA variation in humans and chimpanzees, Genetics , 1996, vol. 142 (pg. 953- 963) Google Scholar PubMed Nei M. , Molecular evolutionary genetics , 1987 New York Columbia University Press Nei M, Li W-H. The transient distribution of allele frequencies under mutation pressure, Genet Res. , 1976, vol. 28 (pg. 205- 214) Google Scholar CrossRef Search ADS PubMed Niehuis O, Judson AK, Gadau J. Cytonuclear genic incompatibilities cause increased mortality in male F2 hybrids of Nasonia giraulti and N. vitripennis, Genetics , 2008, vol. 178 (pg. 413- 426) Google Scholar CrossRef Search ADS PubMed Nosil P, Funk DJ, Ortiz-Barrientos D. Divergent selection and heterogeneous genomic divergence, Mol Ecol. , 2009, vol. 18 (pg. 375- 402) Google Scholar CrossRef Search ADS PubMed Oliveira DCSG, Raychoudhury R, Lavrov DV, Werren JH. Rapidly evolving mitochondrial genome and directional selection in mitochondrial genes in the parasitic wasp Nasonia (Hymenoptera: Pteromalidae), Mol Biol Evol. , 2008, vol. 25 (pg. 2167- 2180) Google Scholar CrossRef Search ADS PubMed Orr HA. The population genetics of speciation: the evolution of hybrid incompatibilities, Genetics , 1995, vol. 139 (pg. 1805- 1813) Google Scholar PubMed Orr HA, Presgraves DC. Speciation by postzygotic isolation: forces, genes and molecules, Bioessays , 2000, vol. 22 (pg. 1085- 1094) Google Scholar CrossRef Search ADS PubMed Parsch J, Zhang Z, Baines JF. The influence of demography and weak selection on the McDonald-Kreitman test: an empirical study in Drosophila, Mol Biol Evol. , 2008, vol. 26 (pg. 691- 698) Google Scholar CrossRef Search ADS Payseur BA. Using differential introgression in hybrid zones to identify genomic regions involved in speciation, Mol Ecol Resour. , 2010, vol. 10 (pg. 806- 820) Google Scholar CrossRef Search ADS PubMed Phadnis N, Orr HA. A single gene causes both male sterility and segregation distortion in Drosophila hybrids, Science , 2009, vol. 323 (pg. 376- 379) Google Scholar CrossRef Search ADS PubMed Presgraves DC. The molecular evolutionary basis of species formation, Nat Rev Genet. , 2010, vol. 11 (pg. 175- 180) Google Scholar CrossRef Search ADS PubMed Presgraves DC, Balagopalan L, Abmayr SM, Orr HA. Adaptive evolution drives divergence of a hybrid inviability gene between two species of Drosophila, Nature , 2003, vol. 423 (pg. 715- 719) Google Scholar CrossRef Search ADS PubMed Rand DM, Haney RA, Fry AJ. Cytonuclear coevolution: the genomics of cooperation, Trends Ecol Evol. , 2004, vol. 19 (pg. 645- 653) Google Scholar CrossRef Search ADS PubMed Renaut S, Nolte AW, Bernatchez L. Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae), Mol Ecol. , 2010, vol. 19 (pg. 115- 131) Google Scholar CrossRef Search ADS PubMed Sackton TB, Haney RA, Rand DM. Cytonuclear coadaptation in Drosophila: disruption of Cytochrome c oxidase activity in backcross genotypes, Evolution , 2003, vol. 57 (pg. 2315- 2325) Google Scholar CrossRef Search ADS PubMed Smith NGC, Eyre-Walker A. Adaptive protein evolution in Drosophila, Nature , 2002, vol. 415 (pg. 1022- 1024) Google Scholar CrossRef Search ADS PubMed Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism, Genetics , 1989, vol. 123 (pg. 585- 595) Google Scholar PubMed Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods, Mol Biol Evol. , 2011, vol. 28 (pg. 2731- 2739) Google Scholar CrossRef Search ADS PubMed Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res. , 1994, vol. 22 (pg. 4673- 4680) Google Scholar CrossRef Search ADS PubMed Turner TL, Bourne EC, Von Wettberg EJ, Hu TT, Nuzhdin SV. Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils, Nat Genet. , 2010, vol. 42 (pg. 260- 263) Google Scholar CrossRef Search ADS PubMed Turner TL, Levine MT, Eckert ML, Begun DJ. Genomic analysis of adaptive differentiation in Drosophila melanogaster, Genetics , 2008, vol. 179 (pg. 455- 473) Google Scholar CrossRef Search ADS PubMed Watterson GA. On the number of segregating sites in genetical models without recombination, Theor Popul Biol. , 1975, vol. 7 (pg. 256- 276) Google Scholar CrossRef Search ADS PubMed Weinreich DM, Rand DM. Contrasting patterns of nonneutral evolution in proteins encoded in nuclear and mitochondrial genomes, Genetics , 2000, vol. 156 (pg. 385- 399) Google Scholar PubMed Willett CS, Burton RS. Viability of cytochrome c genotypes depends on cytoplasmic backgrounds in Tigriopus californicus, Evolution , 2001, vol. 55 (pg. 1592- 1599) Google Scholar CrossRef Search ADS PubMed Willett CS, Burton RS. Evolution of interacting proteins in the mitochondrial electron transport system in a marine copepod, Mol Biol Evol. , 2004, vol. 21 (pg. 443- 453) Google Scholar CrossRef Search ADS PubMed Wirth T, Bernatchez L. Decline of North Atlantic eels: a fatal synergy?, Proc R Soc B Biol Sci. , 2003, vol. 270 (pg. 681- 688) Google Scholar CrossRef Search ADS Wolf JBW, Lindell J, Backstrom N. Speciation genetics: current status and evolving approaches, Philos Trans R Soc B Biol Sci. , 2010, vol. 365 (pg. 1717- 1733) Google Scholar CrossRef Search ADS Wu C-I, Ting C-T. Genes and speciation, Nat Rev Genet. , 2004, vol. 5 (pg. 114- 122) Google Scholar CrossRef Search ADS PubMed Zhang L, Li W-H. Human SNPs reveal no evidence of frequent positive selection, Mol Biol Evol. , 2005, vol. 22 (pg. 2504- 2507) Google Scholar CrossRef Search ADS PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

journal article

LitStream Collection

Modeling Protein Evolution with Several Amino Acid Replacement Matrices Depending on Site Rates

Le, Si Quang;Dang, Cuong Cao;Gascuel, Olivier

2012 Molecular Biology and Evolution

doi: 10.1093/molbev/mss112pmid: 22491036

Abstract Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x. amino acid substitutions, replacement matrices, gamma and distribution-free rate models, maximum likelihood estimations, phylogenetic inference Introduction Amino acid replacement matrices—20 × 20 matrices containing estimates of the instantaneous substitution rates of any amino acid by another—are essential in most methods to infer protein phylogenies. These matrices are expected to capture the biological and physicochemical properties of amino acids. They are used in distance-based methods to estimate the evolutionary distance—the expected number of substitutions per site—between sequence pairs. In maximum likelihood (ML) and Bayesian methods, they are used to compute substitution probabilities along tree branches and hence the likelihood of the data (see textbooks, e.g., Felsenstein 2003; Yang 2006). The standard approach to infer protein phylogenies is based on the use of a single replacement matrix. Several general matrices estimated from very large sets of taxa and alignments have been proposed since the pioneering work of Dayhoff et al. (1972), notably JTT (Jones et al. 1992), WAG (Whelan and Goldman 2001), and LG (Le and Gascuel 2008). Some studies showed that specific matrices should be used for certain analyses, for example, with membrane (Jones et al. 1994) or mitochondrial (Yang et al. 1998) proteins, but general matrices are usually robust and tend to perform well in many cases (Keane et al. 2006). However, site evolution is highly heterogeneous and depends on many factors such as genetic code, solvent accessibility, secondary and tertiary structure, and protein functions. Most notably, some sites are subject to strong evolutionary pressure and evolve slowly due to their role in the structure or functions of the protein, whereas others are much less constrained and accumulate substitutions rapidly. In the standard approach, this variability is modeled by discrete gamma rate categories, which are used to modulate the (unique) replacement matrix being selected, depending on the site rates (Yang 1993). As site rates are unknown, all rates are envisaged for every site and accounted for thanks to a mixture approach (see textbooks, e.g., Gascuel and Guindon 2007). However, many works revealed that depending on site specificities not only do the global rates vary but also the substitution patterns. Notably, buried sites (typically slow) and exposed sites (typically fast) obey very different matrix models (Koshi and Goldstein 1995; Lio et al. 1998; Goldman et al. 1998; Holmes and Rubin 2002; Le, Lartillot, and Gascuel 2008; Le and Gascuel 2010). To a lesser extent, it was also shown that substitution processes vary among secondary structures (Koshi and Goldstein 1995; Thorne et al. 1996; Le, Lartillot, and Gascuel 2008; Le and Gascuel 2010). All of these works (and others) thus explored site-dependent models using several matrices or profiles. In the profile approach (Koshi and Goldstein 1998; Lartillot and Philippe 2004; Le, Gascuel, and Lartillot 2008), sets of elementary models defined by their amino acid equilibrium frequencies are used; these models rely on simple multinomial processes over the 20 amino acids—analogous to the (Felsenstein 1981) model of DNA substitution—and do not use replacement matrices (or only highly simplified ones). In the multimatrix approach (Koshi and Goldstein 1995; Thorne et al. 1996; Goldman et al. 1998; Le, Lartillot, and Gascuel 2008), different matrices are used for different site categories. The model introduced by Wang et al. (2008) is a compromise between these two approaches as it uses several (full range) matrices that only differ in their amino acid equilibrium distributions (for a similar model, see also Lartillot and Philippe 2004). In all cases, the set of profiles or matrices is combined thanks to a mixture approach or a Hidden Markov Model (HMM; Felsenstein and Churchill 1996; Thorne et al. 1996). In recent studies (Le, Gascuel, and Lartillot 2008; Le, Lartillot, and Gascuel 2008; Wang et al. 2008), this first-level mixture is combined with a second-level mixture corresponding to the standard gamma rate categories. This combination was shown to be quite accurate but is computationally heavy as both the computing time and the memory consumption are roughly proportional (see, e.g., Bryant et al. 2005) to the number of site categories (e.g., 12 with 4 gamma categories and 3 biochemical categories). In this paper, we investigate simpler models, where sites are categorized depending on their evolutionary rate, and different replacement matrices are used for each site category. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factor among sites, and there is no reason to suppose (as in the standard approach) that the substitution pattern remains identical regardless the evolutionary rate. For example, we expect slow sites to be mostly hydrophobic (and fast sites to be hydrophilic), which implies that the amino acid equilibrium frequencies should vary depending on the site rate. Investigated models thus focus on an essential site heterogeneity factor. They refine the standard gamma model by using several different replacement matrices, instead of only one modulated by a global rate. However, these models are less complex than two-level mixtures as they use a single mixture level enabling fair computing times and low memory consumption. We first verify the use of different matrices for different evolutionary rates that follow a discrete gamma distribution. To this end, we estimate a 4-matrix model (LG4M), where each matrix corresponds to one standard gamma rate category (Γ4; Yang 1993). Experimental results show that LG4M outperforms single-matrix models (JTT+Γ4, WAG+Γ4, and LG+Γ4) in terms of tree likelihood and often infers different tree topologies. We then examine the limitation of constraining rates to a gamma distribution by testing a model of four matrices where rates and weights of the four matrices are freely estimated. To this end, we estimate another 4-matrix model (LG4X), where rates and weights are left out of the gamma distribution assumption. Experimental results show that LG4X is significantly better than LG4M and comparable to the two-level mixture models from Le, Lartillot, and Gascuel (2008) and Le and Gascuel (2010) and at the same time much simpler. These results, combined with low computing times and memory consumption, suggest that LG4M and LG4X are relevant alternatives to standard single-matrix models in inferring phylogenetic trees from protein sequences. In the following, we first describe our data, then our models and their estimation procedures, and lastly provide comparisons with independent testing alignments. Data Sets To estimate LG4M and LG4X, we used alignments extracted from the Homology-Derived Secondary Structure of Proteins database (HSSP; Schneider et al. 1997). HSSP comprises ∼50,000 alignments of protein families, each containing an average of ∼550 members. Each alignment is obtained by aligning a protein with known 3D structure in the Protein Data Bank (PDB; Berman et al. 2000) to all its probable sequence homologs in UNIPROT. The protein with known structure is called the “test protein” of the alignment. HSSP alignments contain a huge number of gaps due to absent or unsequenced domains for some proteins. Consequently, we cleaned each alignment by selecting sequences that were well aligned, sufficiently different one from the other, and had 40–99% identities with the test protein. Gapped regions among selected sequences were eliminated using GBLOCKS (Castresana 2000) with default options, and we removed alignments with less than 10 selected sequences or 100 remaining sites. We also left out membrane proteins (based on their presence in the Membrane PDB; Raman et al. 2006) since their amino acid replacement pattern is highly different to that of globular proteins (Jones et al. 1994). Moreover, HSSP is highly redundant because a protein sequence may appear in more than one alignment depending on its homologs with known structure in PDB. Thus, we retained only independent alignments that do not share any sequence. To this end, we used a heuristic algorithm to find a large number of independent alignments containing a large number of sites with few gaps (Le, Lartillot, and Gascuel 2008). This selection procedure resulted in 1,771 nonredundant alignments, with an average of ∼56 sequences and ∼254 sites per alignment, a total of ∼27 million amino acids and less than 0.1% gaps. We randomly picked 1,471 alignments to estimate LG4M and LG4X and used the remaining 300 for model comparison. These alignments were the same as those used to estimate and test our two-level mixtures of profiles and matrices, and our structure-informed models (Le, Gascuel, and Lartillot 2008; Le, Lartillot, and Gascuel 2008; Le and Gascuel 2010). Additional details on the selection procedure are provided in these references, and the training and test alignments are available from http://www.atgc-montpellier.fr/models/lg4x. To assess the performance of our models, we used the 300 HSSP test alignments and another set of independent alignments extracted from TreeBase (Sanderson et al. 1994). This database contains alignments that were produced especially for phylogenetic analyses and thus provide a good benchmark for comparing models meant for phylogenetic reconstruction. Moreover, the use of test alignments from a different database should avoid possible biases induced by some feature specific to our HSSP training alignments. We took all (113) most recently updated TreeBase globular protein alignments and then removed those including too many gaps (>45%) or showing a too high level of sequence divergence (average number of amino acids per site > 8, presence in the ML tree of one or several branches with length > 2.0, or average ML tree branch length > 0.50). We retained 84 alignments, with size ranging from small, single protein alignments (e.g., 7 taxa and 232 sites), to very large concatenated protein alignments (e.g., 62 taxa and 11,544 sites). These TreeBase test alignments are also available from http://www.atgc-montpellier.fr/models/lg4x. Models All amino acid substitution matrices discussed here comply with the general time-reversible model (see textbooks, e.g., Bryant et al. 2005; Yang 2006). Such a matrix is denoted Q = (qxy), where qxy is the substitution rate from amino acid x to amino acid y(≠x); diagonal terms are set such that the row sums are all zero, that is, qxx=−∑y≠xqxy. Thanks to time reversibility, Q can be decomposed into the symmetric exchangeability matrix R=(rx↔y) and the amino acid equilibrium distribution π = (πx), using qxy=πyrx↔y(x≠y). The amino acid distribution (π) may be estimated from the training alignments and is then called the model equilibrium distribution or from the data analyzed (+F option). With single-matrix models, Q and R are normalized such that one time unit corresponds to one substitution per site, that is, ρ=−∑xqxxπx=1.0, where ρ is the global rate of Q. This constraint is released with some multimatrix models (e.g., Le, Lartillot, and Gascuel 2008), where some site categories and matrices are fast with a high global rate and some others are slow with a low ρ value. Here, we use normalized matrices only but modulate their global rate using external parameters with values fitted on the analyzed alignment (see below). A matrix Q then contains 208 free parameters (190 in R + 19 in π − 1 normalization constraint). The likelihood of the data (denoted D) for a given tree T (including branch lengths) and replacement matrix Q is L(T,Q;D)=∏iL(T,Q;Di), (1)where the product runs over all the sites (independence assumption), and L(T,Q;Di) is the likelihood of the data at site i (Di) given T and Q. L(T,Q;Di) is computed efficiently thanks to the pruning algorithm (Felsenstein 1981). Yang (1993) introduced a mixture model based on a single replacement matrix but variable rates across sites following a discrete gamma distribution with K equally weighted rate categories. With K = 4, the data likelihood is given by L(T,Q,α;D)=∏i(14∑k=14L[T,Γ(α,k)Q;Di]), (2)where Γ(α,k) is the kth rate of a discrete gamma distribution with parameter α. The weights (or contributions) of rate categories are all equal to 1/K. Both T and α are estimated by maximizing likelihood (2). Variants of this model include nonequally weighted gamma rate categories (e.g., Susko et al. 2003; Mayrose et al. 2005), an approach that could be further investigated to improve models presented here. Multimatrix models were proposed by several authors (e.g., Koshi and Goldstein 1995; Thorne et al. 1996; Goldman et al. 1998) to account for the secondary structure and solvent accessibility. With such models, the data likelihood in a mixture context is expressed as L(T,Q={Q1,…,QM},w={w1,…,wM};D)=∏i(∑m=1MwmL[T,Qm;Di]), (3)where M is the number of matrices and wm is the weight of matrix Qm, with constraint ∑m=1Mwm=1. Recent works (e.g., Le, Lartillot, and Gascuel 2008; Wang et al. 2008) combined Yang's model (2) with the above (3) multiple-matrix model: L(T,Q={Q1,…,QM},w={w1,…,wM},α;D)=∏i(∑m=1MwmK∑k=1KL[T,Γ(α,k)Qm;Di]), (4)where constraint ∑m=1Mwm=1 still holds. Equation (4) expresses two levels of mixture, one for gamma distributed rate categories and one for multiple substitution matrices. In this framework, we introduced several supervised and unsupervised models, for example, (supervised) EX2 with two matrices for buried and exposed sites and (unsupervised or “blind”) UL3 based on three matrices that were estimated without a priori knowledge on site categorization (Le, Lartillot, and Gascuel 2008). The same framework was used by Wang et al. (2008) in a 5-matrix model, where all matrices were based on the same JTT or WAG exchangeability matrix (R) but used different amino acid equilibrium distributions (π). Although the above models (EX2, UL3, etc.) perform well and provide high likelihood values, they are computationally expensive in terms of both computing time and memory consumption. This is mainly due to their high number of site categories, for example, 12 with UL3 and 4 gamma categories. In this paper, we explore simplifications of equation (4). In our LG4M model, we assume four equally weighted gamma rate categories and use four matrices, one for each rate category. Let Q = {Q1,Q2,Q3,Q4} be the set of these four matrices, where Q1 stands for the matrix corresponding to slowest category and Q4 that of the fastest one. Data likelihood is expressed as L(T,Q,α;D)=∏i(14∑k=14L[T,Γ(α,k)Qk;Di]). (5) Mathematically speaking, this model (5) is a compromise between Yang's model (2) and two-level mixture models (4). Instead of sharing the same matrix as in Yang's model, each rate has its own matrix, and each matrix is applied only to one rate category instead of being applied to all rates as in two-level mixture models. This model (5) is thus more general than Yang's model but keeps the same free parameters to be estimated from the data (i.e., α and T) as in Yang's model. From a biological standpoint, the simplification from two-level mixture (4) to model (5) means that the main heterogeneity factor among sites is their evolutionary rate, an assumption that will be tested in the Performance Comparison section. Model LG4M in equation (5) constrains the site rates using a discrete gamma distribution. In our LG4X model, we generalize LG4M by removing this constraint: L(T,Q,ρ={ρ1,ρ2,ρ3,ρ4},w={w1,w2,w3,w4};D)=∏i(∑k=14wkL[T,ρkQk;Di]), (6)where wk and ρk are the weight and rate of matrix Qk such that ∑k=14wk=1 and ∑k=14wkρk=1. The latter normalization constraint is needed to get 1.0 substitution per site within one time unit, just as in standard single-matrix models (this normalization is implicit in LG4M). This model thus involves three free parameters among weights wk plus three free parameters among rates ρk, which are estimated by maximizing likelihood (6) on the data set analyzed. Model Estimation We have a set of N protein alignments denoted D= {D1, …,DN}, where Da is an alignment. We aim to estimate a 4-matrix model Q* = (Q1*,Q2*,Q3*,Q4*) that maximizes the likelihood of D: Q*=arg maxQ=(Q1,Q2,Q3,Q4),T,P,W{∏a=1NL(Ta,Q,ρa,wa;Da)}, (7)where T=(T1,…,TN), P=(ρ1,…,ρN), and W=(w1,…,wN) are the trees, rates, and weights of the N alignments, respectively; L(Ta,Q,ρa,wa;Da) is the likelihood of Da given model Q, tree Ta, rates ρa=(ρ1a,…,ρ4a), and weights wa=(w1a,…,w4a). Thus, to estimate Q*=(Q1*,Q2*,Q3*,Q4*), we also need to estimate T*, P*, and W*, which optimize likelihood (7). We are interested in two 4-matrix models: LG4M where L(Ta,Q,ρa,wa;Da) is calculated using equation (5) and LG4X that is based on equation (6). For each alignment Da, ρa, and wa of LG4M follow a discrete gamma distribution with four equally weighted rate categories, whereas in LG4X, these parameters are freely estimated such that ∑14wka=1 and ∑14wkaρka=1, without any additional constraint. Following (Whelan and Goldman 2001; Le and Gascuel 2008; Le, Lartillot, and Gascuel 2008), we estimate Q* = (Q1*,Q2*,Q3*,Q4*) from equation (7) in two steps: 1) given a fixed starting value of Q, we estimate T*, P*, and W* by maximizing likelihood (5) (LG4M) or (6) (LG4X); 2) we then estimate Q* = (Q1*,Q2*,Q3*,Q4*) using equation (7) with respect to T*, P*, and W* values obtained in step (1). These two steps are iterated one after the other until no more improvement of T*, P*, W*, and Q* is found. Since trees, rates, and weights of alignments in D are independent of one another, we optimize T*, P*, and W* for each alignment Da independently: ∀Da:(Ta,ρa,wa)=arg maxT,ρ,w{L(T,Q,ρ,w;Da)}. (8) For this purpose, we use an adaptation of PhyML 3.0 (Guindon et al. 2010) that is described below. Having obtained T*, P*, and W*, we search for Q* that maximizes the likelihood of the data given T*, P*, and W*: Q*=(Q1*,Q2*,Q3*,Q4*)=arg maxQ=(Q1,Q2,Q3,Q4){∏a=1NL(Ta,Q,ρa,Ea;Da)}. (9) It is impractical to optimize (Q1*,Q2*,Q3*,Q4*) directly from equation (9) due to the huge number (4 × 208) of free parameters in Q. Consequently, we use the approximate learning method proposed in Le and Gascuel (2008) and Le, Lartillot, and Gascuel (2008), where Q*=(Q1*,Q2*,Q3*,Q4*) is handled by simplifying the site likelihood in equation (9) using the site rate category with maximum posterior probability (MAP) only, instead of summing overall rate categories, that is, Q*=(Q1*,Q2*,Q3*,Q4*)=arg maxQ=(Q1,Q2,Q3,Q4){∏a∏iL(Ta,ρciaQci;Dia)}, (10)where Dia is the ith site of alignment Da, ci is the MAP rate category (computed during tree estimation) for site Dia, and ρci is the rate of ci corresponding to Qci substitution matrix. Equation (10) can then be rewritten as ∀k=1…4, Qk*=arg maxQk{∏a∏i:ci=kL(Ta,ρkaQk;Dia)}. (11) In other words, every Qk is estimated independently. To achieve these estimations, we used XRate (Holmes and Rubin 2002; Klosterman et al. 2006) with the same search options as in Le and Gascuel (2008) and Le, Lartillot, and Gascuel (2008). Notably, we used the forgiven option (with 3 jumps) to escape from local optima. XRate is able to deal with mixtures, instead of using our simplifying MAP-based approach (11). However, we observed that using MAP in this estimation context is much faster, less affected by local optima, and tends to provide better results (Le and Gascuel 2008). This is why here we adopted the same strategy, which is close to Viterbi's approximation that proved to be both efficient and accurate to estimate HMMs (Durbin et al. 1998). To perform these computations and use our new models to infer trees, we adapted PhyML 3.0 (Guindon et al. 2010) to LG4X and LG4M. This dedicated version is called PhyML-4X in the following. The adaptation of PhyML to LG4M is just a trigger so that the program selects the correct matrix for each rate category. The other parts (e.g., to optimize α or to search tree topologies) are kept the same as in standard PhyML. In the case of LG4X, we reused the optimization module from (Le, Lartillot, and Gascuel 2008) to optimize weights (w1, …,w4) and rates (ρ1, …,ρ4), alternating monodimensional Brent optimization of every variable until global convergence. To account for constraint ∑wm=1, we use the variable change wi=evi/∑evm and then optimize the vis using Brent. The second constraint ∑wmρm=1 is fulfilled by rescaling rates and branch lengths before returning the final tree. To accelerate the calculations, the starting tree, rates, and weights (=0.25) are first estimated with LG4M, which involves a single (α) parameter to be optimized instead of six. Figure 1 summarizes the whole estimation procedure. Both LG4M and LG4X are initialized starting from the LG matrix. LG4M uses a supervised approach where each matrix is associated with the same gamma rate category throughout the optimization procedure; for example, the “Fast” matrix is systematically associated with the highest rate, among four gamma-distributed rates. LG4X is estimated in a semisupervised way. During the first step, sites are categorized based on the rate (associated to LG) providing the highest likelihood value. During subsequent steps, sites are categorized based on the (rate, matrix) pair with highest likelihood. In most cases, the “Fast” matrix is associated with the highest rate, and the same holds with other matrices. However, since rates (and weights) are estimated independently for each alignment without any a priori constraint, it may occur for some data sets that the “Fast” matrix is actually associated with a slow rate and vice versa (see table 1 and supplementary tables and figures at http://www.atgc-montpellier.fr/models/lg4x). Thus, this semisupervised procedure provides a measure of the importance of the site rate factor. If the initial rate-based site categorization and matrix interpretation disappeared during the subsequent training steps, this would mean that site rate is not a heterogeneity factor of first importance and that other more important factors exist. If (as is the case), the initial rate-based site categorization and matrix interpretation are (mostly) preserved all along training steps, this implies that the rate factor is of first importance, as we assumed in this study. Table 1. Main Features of LG4M and LG4X Replacement Matrices. Very Slow Slow Medium Fast LG4M LogCor/LG 0.813 0.874 0.957 0.917 ClosestMatrix Intermediate (0.828) Buried (0.914) Intermediate (0.966) Exposed (0.986) Hydro −0.492 1.249 0.219 −1.682 Average rate 0.145 0.440 0.952 2.463 5% quantiles 0.035/0.315 0.257/0.673 0.826/1.053 1.938/2.882 LG4X LogCor/LG 0.847 0.853 0.898 0.897 ClosestMatrix Buried (0.880) Buried (0.885) Intermediate (0.946) Exposed (0.987) Hydro 0.934 0.325 −0.816 −1.815 Average weight 0.313 0.332 0.233 0.122 5% quantiles 0.180/0.418 0.185/0.419 0.145/0.375 0.019/0.251 Average rate 0.289 0.770 1.370 3.420 5% quantiles 0.084/0.441 0.394/1.209 0.800/2.232 1.406/5.317 Rate distribution 84/0/0/0 0/75/6/3 0/9/73/2 0/0/5/79 Very Slow Slow Medium Fast LG4M LogCor/LG 0.813 0.874 0.957 0.917 ClosestMatrix Intermediate (0.828) Buried (0.914) Intermediate (0.966) Exposed (0.986) Hydro −0.492 1.249 0.219 −1.682 Average rate 0.145 0.440 0.952 2.463 5% quantiles 0.035/0.315 0.257/0.673 0.826/1.053 1.938/2.882 LG4X LogCor/LG 0.847 0.853 0.898 0.897 ClosestMatrix Buried (0.880) Buried (0.885) Intermediate (0.946) Exposed (0.987) Hydro 0.934 0.325 −0.816 −1.815 Average weight 0.313 0.332 0.233 0.122 5% quantiles 0.180/0.418 0.185/0.419 0.145/0.375 0.019/0.251 Average rate 0.289 0.770 1.370 3.420 5% quantiles 0.084/0.441 0.394/1.209 0.800/2.232 1.406/5.317 Rate distribution 84/0/0/0 0/75/6/3 0/9/73/2 0/0/5/79 Note.—The four matrices of LG4M and LG4X are ranked according to their average rates as “Very Slow,” “Slow,” “Medium,” and “Fast.” “LogCor/LG” is the Pearson correlation coefficient of the log-entries in the given matrix with those of LG. “ClosestMatrix” is the matrix among “Buried,” “Intermediate,” and “Exposed” matrices from EX3 (Le, Lartillot, and Gascuel 2008) that is closest to the given matrix based on the correlation of the log-entries (value in parentheses). “Hydro” is the average hydropathy index (Kyte and Doolittle 1982) of the 20 amino acids with weights equal to their equilibrium frequencies in the given matrix. “Weight” is the weight (w) of the given matrix, averaged over the 84 TreeBase testing alignments. “Rate” is the average rate (ρ) among TreeBase alignments. “5% quantiles” provide the 5th and 80th rate and weight values among these 84 alignments. “Rate distribution” is the number of alignments in which a given matrix is ranked (based on its estimated rate) as very slow/slow/medium/fast; for example, with “Slow”, we see that this matrix is never ranked as the slowest matrix, 75 times as the second slowest, 6 times as medium, and 3 times as the fastest matrix. Similar statistics are obtained with the 300 HSSP test alignments (see supplementary tables and figures at http://www.atgc-montpellier.fr/models/lg4x). View Large FIG. 1. View largeDownload slide Algorithm for estimating LG4M and LG4X. Note: Tree likelihood is calculated by PhyML-4X in step 3 using equation (5) for LG4M and equation (6) for LG4X. In step 6, Qk ≈ Qk* is measured by the sum of squared entry differences, to be <0.1. FIG. 1. View largeDownload slide Algorithm for estimating LG4M and LG4X. Note: Tree likelihood is calculated by PhyML-4X in step 3 using equation (5) for LG4M and equation (6) for LG4X. In step 6, Qk ≈ Qk* is measured by the sum of squared entry differences, to be <0.1. As all Expectation–Maximization (EM) approaches, XRate is sensitive to starting parameter values. For computing-time reasons (LG4X required nearly a week to be estimated and much more to be tested and compared with other models), we did not try alternative starting matrices and training strategies. However, based on our previous experiments with LG where XRate performed remarkably well (Le and Gascuel 2008), we are confident that LG4M (estimated in a supervised manner) should be relatively stable and insensitive to the starting matrix used to initiate the training procedure. On the other hand, we also observed (Le, Lartillot, and Gascuel 2008) that semisupervised training of mixture models is more sensitive to the choice of the starting point. This suggests that LG4X could likely be improved using other starting points or training strategies. LG4M and LG4X Matrices and Models LG4M and LG4X matrices (estimated as described above) are available at http://www.atgc-montpellier.fr/models/lg4x, along with additional information and statistics. Here, we discuss the main features of these matrices and models that make them better than single-matrix models, especially LG4X thanks to its rate distribution-free scheme. Table 1 provides summary statistics and figure 2 shows some illustrative matrices. It can be seen that LG4M and LG4X matrices clearly depart from LG. The correlation of the log-entries with those of LG (LogCor/LG; table 1) is below 0.9 in most cases. LG4M “Medium” is a noticeable exception (LogCor/LG = 0.957), which is somewhat expected as this matrix is used for intermediate sites with evolutionary rates close to 1. For each matrix, table 1 provides its global hydropathy (Hydro), computed as the average hydropathy index (Kyte and Doolittle 1982) of the 20 amino acids with weights equal to their equilibrium frequencies in the given matrix. This index also points to a clear difference between the new matrices and LG (Hydro = −0.253). Most of the matrices (e.g., LG4X “Fast” = −1.815 or LG4M “Slow” = 1.249) are clearly hydrophilic or clearly hydrophobic, with hydropathy values close to that of “Exposed” (−1.993) or “Buried” (1.715) matrices from our EX3 model (Le, Lartillot, and Gascuel 2008). These results and measures support our working hypothesis that the substitution patterns differ depending on the site rates. Modulating a unique replacement matrix (e.g., LG) using gamma-distributed rates appears to be an oversimplification. FIG. 2. View largeDownload slide LG4M and LG4X replacement matrices. Note: Amino acids are ranked from highly hydrophilic (R) to highly hydrophobic (I) based on the hydropathy index (Kyte and Doolittle 1982). Bubble sizes are proportional to the replacement rates. The horizontal axis displays the original amino acid, the vertical axis the new one resulting from replacement. X1: “Very Slow” LG4X matrix; X4: “Fast” LG4X matrix (highly similar to “Fast” LG4M matrix); M1: “Very Slow” LG4M matrix; LG is provided as a reference average matrix. FIG. 2. View largeDownload slide LG4M and LG4X replacement matrices. Note: Amino acids are ranked from highly hydrophilic (R) to highly hydrophobic (I) based on the hydropathy index (Kyte and Doolittle 1982). Bubble sizes are proportional to the replacement rates. The horizontal axis displays the original amino acid, the vertical axis the new one resulting from replacement. X1: “Very Slow” LG4X matrix; X4: “Fast” LG4X matrix (highly similar to “Fast” LG4M matrix); M1: “Very Slow” LG4M matrix; LG is provided as a reference average matrix. “Very Slow” matrices show a remarkable pattern, especially that of LG4X (fig. 2), which is mostly used to express high replacement rates between amino acid pairs that are biochemically very similar, for example: R and K (positively charged), D and E (negatively charged), and F and Y (aromatic). These three pairs are very close in the genetic code, requiring only one nucleotide change to mutate amino acid into the other. Interestingly, some of these pairs are highly hydrophilic (e.g., R and K), which contradicts the first intuition that very slow sites should be all buried and hydrophobic. However, to avoid misinterpretation, it has to be noted that rates displayed in figure 2 are relative rates; all matrices are normalized and do not incorporate the fact that very slow sites globally evolve slower (∼6 times on average; table 1) than fast sites; for example, the absolute rate (accounting for this global factor of ∼6) between R and K is nearly symmetrical and almost the same in the “Very Slow” and “Fast” matrices of LG4X. In other words, R–K replacements are fast in all rate categories, including the slowest one. The “Very Slow” LG4M matrix is less contrasted than the LG4X matrix and deals with other amino acid groups, also very close biochemically and in the genetic code, for example: I, L, and V (aliphatic) and S and T (tiny and polar). The latter amino acids are focused in the LG4X “Slow” matrix, whereas the LG4M “Slow” matrix mainly deals with tiny and nearly neutral amino acids (A, G, S, and T) and the I, V pair (supplementary tables and figures at http://www.atgc-montpellier.fr/models/lg4x). The “Very Slow” matrices are thus used to express the fact that even in very slow sites, substitutions between highly similar amino acids are likely to occur. Their contents may be seen as being similar to that of the profiles in the CAT model (Lartillot and Philippe 2004; Le, Gascuel, and Lartillot 2008). The “Slow” matrices are analogous but less contrasted. Moreover, the “Slow” matrix of LG4M is relatively close to Buried from EX3 (table 1 and above hydropathy values), indicating (as expected) that buried sites and slow sites are often the same. However, both LG4M and LG4X “Very Slow” matrices partly contradict this basic fact, as the LG4X “Very Slow” matrix focuses on some hydrophilic pairs and the LG4M “Very Slow” matrix is slightly hydrophilic (Hydro = −0.429). An explanation of this finding could be that both LG4X and LG4M “Very Slow” matrices are strongly influenced by the genetic code (see above examples), which intervenes first in the mutational process (before the physicochemical constraints and selection) and favors substitutions between amino acids that are not necessarily hydrophobic. The “Medium” matrix of LG4M is correlated with both LG and “Intermediate” matrix from EX3 and is thus mostly used for standard sites with average rates and solvent accessibility. The “Medium” matrix of LG4X is also correlated with “Intermediate” but to a lesser extent than LG4M “Medium”, and its global hydropathy is relatively low (−0.816) compared with that of LG (−0.253) and that of LG4M “Medium” (0.219). Lastly, the “Fast” matrices of LG4X and LG4M are very close (correlation of the log-entries = 0.994) and quite similar to “Exposed” from EX3 (correlation of the log-entries ≈ 0.99). As expected, fast sites and exposed sites are often the same. Moreover, “Fast” matrices show a relatively low contrast (fig. 2) and allow for all possible substitutions, with a preference for substitutions between amino acids with similar hydrophathy. All together we thus see (as expected) a clear correlation between evolutionary rate and solvent accessibility; for example, LG4M “Slow” is close to “Buried”, whereas both “Fast” matrices are close to “Exposed”. However, the matrices of LG4M and LG4X account for other features of the substitution processes; for example, LG4X “Very Slow” is weakly correlated with Buried but focuses on specific highly exchangeable amino acid pairs, some highly hydrophilic (e.g., R and K). The variability among all these replacement matrices demonstrates the complexity of amino acid substitutions and explains why a single matrix has limited capacity in modeling such complex processes. Analysis of rates across sites further illustrates the difficulty involved in substitution modeling and the advantage of flexible models such as LG4X. With LG4M, the α value of the gamma parameter is significantly higher than with LG (respectively 0.866 and 0.584 on average; this ordering of α values is observed with all but five alignments, see supplementary tables and figures at http://www.atgc-montpellier.fr/models/lg4x. This is an expected outcome, as part of the rate variability is taken into account in LG4M by the use of four different matrices. With LG4X, the matrix rates show a different picture. While with LG4M, each of the four matrices is pretty much associated with the same rate, with LG4X, the rates differ significantly depending on the data set analyzed, to the point that the “Slow” matrix is sometimes (three cases among 84 TreeBase test alignments; table 1) associated with the fastest rate. It is worth to note that rates and weights in equation (6) are optimized for each data set analyzed, without constraining the rates to be ordered depending on the matrix with which they are associated. In the same way, the weights of site categories are highly variable; for example, with TreeBase test alignments, the “Fast” weight varies from ∼0.0 to ∼0.25 (table 1). Results with HSSP test alignments (supplementary tables and figures at http://www.atgc-montpellier.fr/models/lg4x) are similar but show less variability; for example, the “Slow” matrix is only once the fastest one among 300 alignments, instead of 3/84 with TreeBase. This indicates that the flexibility of LG4X is less useful with HSSP than with TreeBase, which can be expected since LG4X was estimated from HSSP. The gains obtained by LG4X over LG4M (see Performance Comparisons) are thus explained by the high flexibility of LG4X, which is more powerful than the association of a single matrix with a distribution-free scheme of rates across sites, whose performance is somewhat disappointing (Susko et al. 2003; Mayrose et al. 2005; see results of LG+Φ4 below). Here, each matrix corresponds (to some extent) to many/few and fast/slow sites, depending on the protein analyzed. This clearly shows that substitutions are not just Markovian with a fixed pattern (replacement matrix) modulated by site-dependent rates. As in other site-dependent models, we have different categories of sites corresponding to different matrices and tendencies to be slow or fast, but no strict constraint to be so. This further illustrates the finding by many authors (e.g., Keane et al. 2006) that substitution patterns may be very different in different proteins. The payoff of LG4X flexibility is that matrices in this model are not fully interpretable as “Very Slow”, “Slow”, “Medium” and “Fast” matrices; for example, the “Slow” matrix is sometimes the fastest one, and this feature most likely impacts its coefficient values. However, the average rates of these four matrices (0.29, 0.77, 1.37, and 3.42, respectively) clearly correspond to their natural interpretation. It must be emphasized that this ranking and global interpretation are obtained through our semisupervised learning procedure (see above and fig. 1), where only the first step accounts for site rates while further optimization steps are performed in a blind manner, clustering the sites based on their preferred matrix without reference to their rate. The fact that the four LG4X matrices are still clearly correlated to rate categories after this (6-step) phase of blind learning illustrates (if needed) that the evolutionary rate is a major factor in modeling substitution processes. Model Comparisons In the following, we assess the performance of the new models LG4M and LG4X by comparing them with existing models using 84 TreeBase and 300 HSSP test alignments (see Data Sets). The following models are compared. Single-Matrix Models: JTT, WAG, LG These standard matrices are used with four categories of gamma-distributed rates across sites (+Γ4 option, not indicated below for conciseness). To assess the use of a gamma distribution with four discrete rate categories, we ran LG−Γ (constant site rate); LG+Γ3, LG+Γ6, and LG+Γ8 with 3, 6, and 8 gamma rate categories, respectively; LG+Φ4 (free distribution of site rates with four categories, just as in LG4X but using a single LG matrix). Moreover, we tested LG+F, where the amino acid frequencies are estimated from the studied alignment, instead of being assigned to the default, average frequencies of LG. LG+F was used with the +Γ4 option. LG+F should better fit the specificities of the data being analyzed, but is penalized by the large number of extraparameters (frequencies) to be estimated. In total, LG+F has 20 free parameters (1 gamma + 19 frequencies); LG+Φ4 has 6 free parameters (3 rates + 3 weights); LG, JTT, and LG (+Γ3, +Γ4, +Γ6, +Γ8) have 1 free (gamma) parameter; LG−Γ has 0 free parameter. Two-Level Mixture Models: EX2, UL3, EXEHO EX2 involves two first-level categories of sites, based on solvent accessibility; its two matrices were estimated in a supervised manner from sites being classified as Buried and Exposed in HSSP. UL3 has three first-level categories of sites; it was estimated in a fully unsupervised manner, starting from three random matrices. EXEHO has six first-level categories of sites, crossing solvent accessibility (two categories) with secondary structure (three categories: Extended, Helix, and Other); EXEHO was learned in a supervised manner from HSSP sites categorized accordingly. First-level categories in EX2, UL3, and EXEHO were combined in this study with four second-level gamma-distributed rate categories (+Γ4 option); for example, EXEHO has 6 × 4 = 24 site categories in total. EX2 and UL3 proved to be our best mixture models with 2 and 3 (first-level) categories, respectively, and UL3 was even better than the CAT60 model that involves 60 first-level categories (profiles) and thus 240 categories in total with the +Γ4 option (Le, Lartillot, and Gascuel 2008). Among mixture models, EX2 and UL3 were only beaten by EXEHO (Le and Gascuel 2010), but this latter requires high memory consumption and running time with large data sets due to its 24 site categories; for this reason, EXEHO was not run on TreeBase test alignments, some being very large but on HSSP alignments only. We also tested the model by Wang et al. (2008) but encountered several difficulties when running their program and do not provide results for this model (e.g., QmmRAxML did not finish searching for 10/84 alignments after 3 weeks). EX2, UL3, and EXEHO require estimating the gamma rate parameter from the data plus the proportions of first-level categories, that is, 2, 3, and 6 free parameters, respectively. Confidence-Based Models: EX2/S, EXEHO/S The previous models are mixtures. EX2 and EXEHO use categories of sites having a structural meaning and matrices estimated from sites categorized based on their structural properties, but to infer phylogenies these two models do not use any structural information. The likelihood of every site is computed within each category and then averaged, as expressed in equation (4). On the contrary, EX2/S and EXEHO/S use structural information on the analyzed data set. Basically, the likelihood of each site is computed based on its known structural category, as in the standard partition approach. However, since structural information may be erroneous or inappropriate in a phylogenetic context, we refined this approach by introducing a confidence coefficient, estimated from the analyzed data set, which expresses a trade-off between the standard mixture (no structural information is available) and partition (structural information is fully reliable and relevant) models. These models are described in details in Le and Gascuel (2010), where they are called EX2_CONF/MIX and EX_EHO_CONF/MIX. Here, we use EX2/S and EXEHO/S to make it clear that they benefit (contrary to all other models) from structural information. Both are combined with four gamma rate categories (+Γ4 option). They were run on HSSP test alignments only, as no structural information is available in TreeBase. EX2/S and EXEHO/S involve 3 (1 gamma + 1 confidence + 1 category proportion) and 6 (1 gamma + 1 confidence + 5 category proportions) free parameters to be estimated from the data. Single-Level Mixture Models: LG4M and LG4X These are the two new models proposed in this paper. Both involve 4 site categories in total (to be compared with the 24 categories of EXEHO and the 240 of CAT60, remembering that the computing time and memory consumption are strongly correlated to the number of categories). Rates in LG4M are gamma distributed, whereas LG4X uses a distribution-free scheme. LG4M has 1 (gamma) free parameter; LG4X has 6 (3 rates + 3 weights) free parameters. Comparison Criteria and Methods Our aim was to compare the performance of all these models, regarding likelihood and topological criteria. To infer trees, we used: the last version of PhyML 3.0 (Guindon et al. 2010) for LG, JTT, and WAG; PhyML-Structure (Le and Gascuel 2010) for EX2, UL3, EXEHO, EX2/S, and EXEHO/S; and our adaptation (PhyML-4X) of PhyML 3.0 for LG4M, LG4X, and LG+Φ4. All programs were run with BioNJ (Gascuel 1997) starting tree and subtree pruning and regrafting (SPR) tree searching. Since these models involve different numbers of free parameters, we measured their fitness to data using the AIC criterion (Akaike 1974): AIC(M,Da)=−2LL(M,Ta;Da)+2# parameters(M),where LL(M,Ta;Da) is the log-likelihood of alignment Da given model M and inferred tree Ta; #parameters(M) is the number of free parameters of model M. The AIC criterion has to be minimized; best scores are given to models with low numbers of free parameters and high likelihood values. All tested models involve one parameter (length) per tree branch plus the model parameters detailed in previous section. For every model M studied, we computed the average AIC per site for all alignments in test set A: AIC/site(M,A)=∑a∈AAIC(M,Da)∑a∈Asa,where sa is the number of sites in Da. To complete this global average result, we performed pairwise model comparisons and counted the number of alignments Da, where AIC(M1,Da) < AIC(M2,Da) (i.e., M1 fits Da better than M2) for a given model pair M1, M2. To assess the statistical significance of the observed difference between M1 and M2 for any given alignment, we used a Kishino–Hasegawa (KH; 1989) test with P < 0.01. As the number of free parameters between M1 and M2 may differ, we used AIC penalized likelihood values. This test is essentially the same as that used to compare phylogenies. We use the RELL bootstrap to estimate the distribution of the test statistic under the null hypothesis that both models are equivalent, but incorporate the number of parameters of each model in this statistic just as in the AIC criterion (for explanations and justifications of this test, see Shimodaira 1997). We compared the lengths of inferred trees, that is, the sums of their branch lengths. It has been suggested that best models tend to produce longer trees capturing more hidden substitutions (e.g., Pagel and Meade 2005). We also compared the topologies of inferred trees. Indeed, if the new models produced the same topologies as the existing models, the effort of introducing new models would be rather useless. Unfortunately, the true tree is usually unknown with real data (as opposed to simulated data), and thus, it is hard to assess the topological accuracy induced by any tree-building approach in a realistic setting. Here, we studied the topological impact of our new models, that is, whether or not using these models enables us to frequently infer trees that differ from those inferred with standard models. When comparing models M1 and M2, we counted the number of alignments where the inferred topology using M1 differs from that obtained using M2. Both topologies were also compared using the Robinson and Foulds (RF; 1979) distance, which is the number of branches (bipartitions) that belong to one tree but not to the other. When different topologies are found, one should prefer the one with best likelihood value or best AIC (or similar criterion) value, when evolutionary models used for tree inference involve different numbers of parameters. However, the difference may be slight and nonsignificant, so one cannot reject the topology with a lower fit to data. We thus counted the number of alignments where M1 and M2 topologies differ and where M1 is significantly better (worse) than M2, using a KH test on AIC penalized likelihood values with P < 0.01. Lastly, we checked that the observed topological differences comprised some branches with significant support. Indeed, the topological impact would be low if all differences corresponded to poorly supported branches. To this end, we performed bootstrap analyses and counted the number of branches with notable bootstrap support (BP1 ≥ 50%) in one tree, which were not recovered in the other tree, or had a much lower support in this tree (BP2 + 50% ≤ BP1). For example, one branch with BP1 = 40% was not counted, even when it was not recovered in the other tree; on the contrary, one branch with BP1 = 80% in one tree was counted when it was found in the other tree with BP2 = 20%. This measure (first introduced in Le and Gascuel 2010) thus summarizes the topological and branch support differences. We used only 50 bootstrap replicates for computing time reasons, but this suffices for our gap of 50% between BP1 and BP2 to be highly significant (P value ∼ 0.0 using a Z-test for two proportions). Moreover, 50% of bootstrap support was shown to be optimal in terms of topological error (Berry and Gascuel 1996; see also Holder et al. 2008). As this procedure is computationally heavy (even with 50 replicates), we analyzed the 63 smallest TreeBase alignments only. All (300, relatively small) HSSP alignments were analyzed. Fitness Comparison Comparisons were performed on 84 TreeBase and 300 HSSP test alignments. Both sets are independent of the training alignments and thus provide fair estimations of model performance. Moreover, using TreeBase should avoid possible biases induced by some of the specificities of HSSP alignments used to train our models. Tables 2 (TreeBase) and 3 (HSSP) display comparisons between all models listed above. Figures 3 (TreeBase) and 4 (HSSP) show the progress in the AIC/site for the main models. Table 2. Model Comparison with TreeBase Test Alignments, Using Likelihood and Topological Criteria. M1 M2 AIC/site #M1 > M2 #M1 > M2 (P < 0.01) #M1 < M2 (P < 0.01) #T1 > T2 #T1 < T2 #T1 > T2 (P < 0.01) #T1 < T2 (P < 0.01) RF (%) L1 − L2 (P < 0.01) LG JTT 0.47 73 66 7 39 8 36 5 430 (11) 0.036 LG WAG 0.29 71 62 7 32 10 29 6 404 (10) 0.136 LG LG−Γ 2.34 83 80 0 62 0 61 0 566 (14) 0.307 LG LG+Γ3 0.07 80 41 1 32 0 11 0 300 (8) 0.002 LG LG+Γ6 −0.04 7 1 37 3 22 0 6 266 (7) 0.018 LG LG+Γ8 −0.05 10 1 33 3 29 0 9 270 (7) 0.027 LG LG+Φ4 −0.03 34 15 21 17 28 4 10 476 (12) 0.063 LG LG+F 0.00 51 7 6 24 12 4 3 364 (9) −0.011 LG LG4M −0.15 33 11 34 20 37 7 24 616 (15) −0.073 LG LG4X −0.33 12 1 50 10 48 1 36 606 (15) 0.062 LG4M LG4X −0.18 1 0 48 0 52 0 26 530 (13) 0.145 LG4X EX2 0.08 67 27 2 44 11 17 1 536 (13) −0.100 LG4X UL3 −0.19 39 6 22 24 35 3 17 602 (15) −0.265 M1 M2 AIC/site #M1 > M2 #M1 > M2 (P < 0.01) #M1 < M2 (P < 0.01) #T1 > T2 #T1 < T2 #T1 > T2 (P < 0.01) #T1 < T2 (P < 0.01) RF (%) L1 − L2 (P < 0.01) LG JTT 0.47 73 66 7 39 8 36 5 430 (11) 0.036 LG WAG 0.29 71 62 7 32 10 29 6 404 (10) 0.136 LG LG−Γ 2.34 83 80 0 62 0 61 0 566 (14) 0.307 LG LG+Γ3 0.07 80 41 1 32 0 11 0 300 (8) 0.002 LG LG+Γ6 −0.04 7 1 37 3 22 0 6 266 (7) 0.018 LG LG+Γ8 −0.05 10 1 33 3 29 0 9 270 (7) 0.027 LG LG+Φ4 −0.03 34 15 21 17 28 4 10 476 (12) 0.063 LG LG+F 0.00 51 7 6 24 12 4 3 364 (9) −0.011 LG LG4M −0.15 33 11 34 20 37 7 24 616 (15) −0.073 LG LG4X −0.33 12 1 50 10 48 1 36 606 (15) 0.062 LG4M LG4X −0.18 1 0 48 0 52 0 26 530 (13) 0.145 LG4X EX2 0.08 67 27 2 44 11 17 1 536 (13) −0.100 LG4X UL3 −0.19 39 6 22 24 35 3 17 602 (15) −0.265 Note.—Models are compared using 84 TreeBase test alignments. All models use four categories of gamma distributed rates (+Γ4), unless explicitly stated, that is: +Φ4 and LG4X for distribution-free scheme; −Γ for constant site rate; +Γ3, +Γ6, and +Γ8 for 3, 6, and 8 gamma rate categories, respectively. LG+F: LG exchangeability coefficients are combined with the amino frequencies of the alignment being analyzed. EX2: 2-matrix two-level mixture model, with matrices estimated from buried/exposed sites. UL3: 3-matrix two-level mixture model, with blindly estimated matrices. LG4M: 4-matrix one-level mixture model using gamma distribution of site rates, proposed in this paper. LG4X: 4-matrix one-level mixture model using distribution-free scheme of site rates, proposed in this paper. On each row, model M1 is compared with model M2 using all test alignments. AIC/site: average per site difference in AIC value between M1 and M2; a positive (negative) value means that AIC/site of M1 is better (worse) than M2, on average. #M1 > M2: number of alignments (of 84) where M1 has a better AIC value than M2. #M1 > M2 (P < 0.01): number of alignments where the AIC of M1 is significantly better than that of M2. #M1 < M2 (P < 0.01): same as #M1 > M2 (P < 0.01), but now M2 is significantly better than M1. #T1 > T2: number of alignments where the tree T1 inferred with M1 has a better AIC value than T2 inferred using M2 and where T1 and T2 have different topologies. #T1 < T2: same as #T1 > T2, but now T2 is better than T1. #T1 > T2 (P < 0.01): same as #T1 > T2, but now T1 is significantly better than T2. #T1 < T2 (P < 0.01): T2 is significantly better than T1. RF (%): total RF distance between T1 and T2 trees (i.e., sum over all data sets of the number of branches that belong to one tree but not the other); numbers in parentheses report the percentage of RF relative to the total number (3,994) of internal branches in both T1 and T2 trees. L1 − L2 (P < 0.01): average of tree length differences between T1 and T2; we also counted the number of cases where T1 is longer/shorter than T2 and assessed the significance using a sign test with P < 0.01, significant differences are underlined. View Large Table 3. Model Comparison with HSSP Test Alignments, Using Likelihood and Topological Criteria. M1 M2 AIC/site #M1 > M2 #M1 > M2 (P < 0.01) #M1 < M2 (P < 0.01) #T1 > T2 #T1 < T2 #T1 > T2 (P < 0.01) #T1 < T2 (P < 0.01) RF (%) L1 − L2 (P < 0.01) LG JTT 0.72 267 221 10 220 23 184 6 3,570 (15) 0.048 LG WAG 0.31 248 141 7 196 32 110 2 3,478 (15) 0.174 LG LG4M –0.59 30 3 174 27 251 1 162 4,386 (18) 0.009 LG LG4X −0.65 13 0 182 10 257 0 163 4548 (19) 0.078 LG4M LG4X −0.06 93 2 20 83 166 2 16 4,014 (17) 0.068 LG4X EX2 0.15 241 62 2 200 51 42 1 4,110 (18) −0.063 LG4X UL3 0.00 199 37 10 165 99 26 10 4,356 (18) −0.326 LG4X EXEHO −0.14 117 2 23 88 166 2 21 4,176 (17) −0.094 LG4X EX2/S −0.21 60 1 80 56 199 0 57 4,188 (18) −0.058 LG4X EXEHO/S −0.61 5 0 223 4 250 0 181 4,204 (18) −0.092 M1 M2 AIC/site #M1 > M2 #M1 > M2 (P < 0.01) #M1 < M2 (P < 0.01) #T1 > T2 #T1 < T2 #T1 > T2 (P < 0.01) #T1 < T2 (P < 0.01) RF (%) L1 − L2 (P < 0.01) LG JTT 0.72 267 221 10 220 23 184 6 3,570 (15) 0.048 LG WAG 0.31 248 141 7 196 32 110 2 3,478 (15) 0.174 LG LG4M –0.59 30 3 174 27 251 1 162 4,386 (18) 0.009 LG LG4X −0.65 13 0 182 10 257 0 163 4548 (19) 0.078 LG4M LG4X −0.06 93 2 20 83 166 2 16 4,014 (17) 0.068 LG4X EX2 0.15 241 62 2 200 51 42 1 4,110 (18) −0.063 LG4X UL3 0.00 199 37 10 165 99 26 10 4,356 (18) −0.326 LG4X EXEHO −0.14 117 2 23 88 166 2 21 4,176 (17) −0.094 LG4X EX2/S −0.21 60 1 80 56 199 0 57 4,188 (18) −0.058 LG4X EXEHO/S −0.61 5 0 223 4 250 0 181 4,204 (18) −0.092 Note.—Models are compared using 300 HSSP test alignments. EX2/S has the same matrices as EX2 but (contrary to EX2) uses the solvent accessibility of the residues derived from the 3D protein structure. EXEHO: 6-matrix two-level mixture model, combining accessibility to solvent and secondary structure. EXEHO/S has the same six matrices as EXEHO but (contrary to EXEHO) uses the secondary structure and the solvent accessibility of the residues derived from the 3D protein structure. See note to table 2 for other abbreviations and explanations. View Large FIG. 3. View largeDownload slide AIC progress of amino acid replacement models, using TreeBase. Note: Models are compared using 84 TreeBase test alignments. All models use four categories of gamma-distributed rates (+Γ4), except LG4X. EX2: 2-matrix two-level mixture model, with matrices estimated from buried/exposed sites. UL3: 3-matrix two-level mixture model, with blindly estimated matrices. LG4M: 4-matrix one-level mixture model using gamma distribution of site rates, proposed in this paper. LG4X: 4-matrix one-level mixture model using distribution-free scheme of site rates, proposed in this paper. In the upper panel (a) performance is measured by the average AIC per site (AIC/site) and compared with the JTT value. In the lower panel (b), we count the number of alignments (among 84) where each model provides a better (positive side) and a worse (negative side) likelihood value than LG. The black bars correspond to the numbers of significant differences using the KH test on AIC values with P < 0.01. FIG. 3. View largeDownload slide AIC progress of amino acid replacement models, using TreeBase. Note: Models are compared using 84 TreeBase test alignments. All models use four categories of gamma-distributed rates (+Γ4), except LG4X. EX2: 2-matrix two-level mixture model, with matrices estimated from buried/exposed sites. UL3: 3-matrix two-level mixture model, with blindly estimated matrices. LG4M: 4-matrix one-level mixture model using gamma distribution of site rates, proposed in this paper. LG4X: 4-matrix one-level mixture model using distribution-free scheme of site rates, proposed in this paper. In the upper panel (a) performance is measured by the average AIC per site (AIC/site) and compared with the JTT value. In the lower panel (b), we count the number of alignments (among 84) where each model provides a better (positive side) and a worse (negative side) likelihood value than LG. The black bars correspond to the numbers of significant differences using the KH test on AIC values with P < 0.01. FIG. 4. View largeDownload slide AIC progress of amino acid replacement models, using HSSP. Note: Models are compared using 300 HSSP test alignments. EX2/S has the same matrices as EX2 but (contrary to EX2) uses the solvent accessibility of the residues derived from the 3D protein structure. EXEHO: 6-matrix two-level mixture model, combining accessibility to solvent and secondary structure. EXEHO/S has the same six matrices as EXEHO but (contrary to EXEHO) uses the secondary structure and the solvent accessibility of the residues derived from the 3D protein structure. See note to figure 3 for other abbreviations and explanations. FIG. 4. View largeDownload slide AIC progress of amino acid replacement models, using HSSP. Note: Models are compared using 300 HSSP test alignments. EX2/S has the same matrices as EX2 but (contrary to EX2) uses the solvent accessibility of the residues derived from the 3D protein structure. EXEHO: 6-matrix two-level mixture model, combining accessibility to solvent and secondary structure. EXEHO/S has the same six matrices as EXEHO but (contrary to EXEHO) uses the secondary structure and the solvent accessibility of the residues derived from the 3D protein structure. See note to figure 3 for other abbreviations and explanations. It is clear from these results that LG outperforms JTT and WAG. The average AIC/site of LG with TreeBase alignments is respectively 0.47 and 0.29 lower than that of JTT and WAG, equivalent to a gain of 117.5 and 72.5 log-likelihood units with a 500-site alignment. Moreover, LG is significantly better than JTT and WAG for most alignments. These results reconfirm the claim in Le and Gascuel (2008). Table 2 (TreeBase) reports comparisons between several standard model options, combined here with LG. The comparison between LG−Γ (no gamma distribution of site rates) and LG+Γ4 highlights the crucial role of modeling rates across sites. LG+Γ4 has significantly better AIC values than LG−Γ for 80/84 alignments, with an average AIC/site gain of 2.34. LG+Γ4 is significantly better than LG+Γ3 for 41/84 alignments, with an average AIC/site gain of 0.07, whereas it is worse than LG+Γ6 for 37/84 alignments, with a slight average AIC/site loss of 0.04. Using eight categories in LG+Γ8 does not improve AIC results compared with LG+Γ6. These results indicate that with standard models, three gamma rate categories are not enough and that six categories suffice. Moreover, the differences in AIC/site between these options are low compared with those of other options and models (e.g., LG+Γ vs. LG−Γ). Using four categories, which is standard throughout the community, thus appears to be a fair compromise between likelihood (AIC) value and computing time, which is proportional to the number of categories. These results also support our choice of using four categories in our LG4M and LG4X models. Having three categories would be not enough and overly simple, whereas using four categories is likely to be a fair compromise just as with standard models. However, we cannot exclude that having more than four categories could lead to even better (but slower) models. The low AIC/site gain (0.03) between LG+Φ4 and LG+Γ4 confirms the conclusions of Susko et al. (2003) and Mayrose et al. (2005), who observed low gains when combining a single replacement matrix with a distribution-free scheme or a mixture of gamma distributions. Lastly, when combining LG with +F option, where amino acid distribution is estimated independently for each testing alignment, the average AIC/site value is nearly the same as with LG alone. Six large alignments obtain significantly better AIC values than with LG, but for the other (small or medium-size) alignments, the likelihood gain is not large enough to compensate for the 19 additional free parameters estimated from the data. LG4M shows a clear improvement over LG, with an AIC/site gain of 0.15 and 0.59, with TreeBase and HSSP alignments, respectively (note that these gains cannot be compared, as they depend on several factors, e.g., the number of taxa per alignment). With HSSP, LG4M has higher AIC (and likelihood) values than LG for 270/300 alignments with 174 significant cases. With TreeBase results are not so impressive but still clearly in favor of LG4M versus LG. LG4X has a major advantage over LG, with an AIC/site gain of 0.33 and 0.65, with TreeBase and HSSP, respectively. LG4X is significantly better than LG for more than half of the alignments (both HSSP and TreeBase), whereas LG is significantly better than LG4X for only one (TreeBase) alignment. Compared with LG4M, LG4X shows a slight advantage with HSSP (AIC/site gain of 0.06) and is clearly better with TreeBase: mean AIC/site gain of 0.18, which is significant for 50/84 alignments, whereas LG4M is better than LG4X for only one nonsignificant alignment. The advantage of LG4X over LG4M is explained by its greater flexibility to fit the specificities of analyzed data, thanks to its distribution-free scheme. This scheme (+Φ4) does not show such an advantage with single-matrix models (see above) but becomes clearly beneficial when combined with different replacement matrices. The superiority of LG4X over LG4M is more marked with TreeBase than with HSSP alignments, possibly because LG4M and LG4X were learned from HSSP alignments and fit both HSSP specificities well. Moreover, HSSP alignments are smaller than TreeBase alignments, and some HSSP alignments are likely too small to compensate for the 5 additional parameters in LG4X compared with LG4M. This advantage of LG4X over LG4M should thus be observed by future users analyzing phylogeny-intended alignments, such as those stored in TreeBase. Compared with two-level mixture models, we see that LG4X is 1) slightly better than EX2 (AIC/site gain of 0.08 and 0.15 with TreeBase and HSSP, respectively); 2) nearly equivalent to UL3 (AIC/site loss of 0.19 with TreeBase but null with HSSP; the number of significant cases is low and does not favor one model over the other); 3) slightly behind EXEHO (AIC/site loss of 0.14 with HSSP, but low number [23] of cases where EXEHO is significantly better than LG4X). Globally, we thus do not see a clear advantage of two-level mixture models over our new one-level mixture models, the former involving high number of rate categories (up to 24 with EXEHO) and heavy computational resources (at least EXEHO). Lastly, when comparing EX2/S and EXEHO/S with LG4X and LG4M (and all other models), we see the clear advantage of using structure-informed models when the structural annotation of the proteins analyzed is available. The AIC/site gain of EXEHO/S over LG4X is of 0.61 on average, and this gain is significant with 223/300 alignments, whereas LG4X is never significantly better than EXEHO/S. The advantage of EX2/S over LG4X is less impressive but still significant. Tree-length comparisons (tables 2 and 3) do not show a clear picture. Some of the findings are expected, for example, LG+Γ4 trees are much longer than LG−Γ ones, but the correlation between tree length and AIC value is weak or nonexistent. Mixture models tend to produce longer trees than standard models, with the notable exception of both distribution-free models (LG+Φ4 and LG4X), which infer trees shorter than LG+Γ4 ones. LG4M and LG+Γ4 trees have similar length. UL3 trees (comparable to LG4X in AIC terms) are very long, while those inferred using EXEHO/S (our best model in AIC terms) do not differ substantially from LG ones. These results thus contradict the assumption that better models should produce longer trees (Pagel and Meade 2005). Overall, the comparisons of likelihood and AIC values show that 1) our new simple one-level mixture models outperform the standard models; 2) they are comparable to the best two-level mixture models, while requiring less computational resources; and 3) they are clearly beaten by structure-informed models, which should be preferred when structural information is available. Topological Impact The previous section analyzes model performance in terms of fit to the data, measured by likelihood and AIC values. Here, we study the impact of using refined models, that is, how they change the topology of inferred trees. We see from tables 2 and 3 that refined models have a strong topological impact compared with standard models. For example comparing LG4X with LG, both models infer different topologies with 58/84 TreeBase and 267/300 HSSP alignments. Moreover, these topologies are clearly different (percentage RF distance of 15% and 19% for TreeBase and HSSP, respectively), and the AIC value significantly favors LG4X topologies over LG topologies for 36/58 TreeBase and 163/267 HSSP alignments, whereas the LG topology is significantly favored for only one TreeBase alignment and never with HSSP. This means that with 36 (∼45%) TreeBase and 163 (∼55%) HSSP alignments, one should confidently select the LG4X topology and abandon that inferred using LG. Moreover, these two topologies are very different in general (see RF values). However, the effective impact of selecting these alternative trees would be low if the differences between topologies of standard and refined models corresponded to poorly supported branches and was solely due to random effects inherent to phylogenetic reconstruction. Indeed, inspecting the topological distances (RF) between LG topologies and those of other models, we see that closely related models still show substantial topological differences; for example (table 2), LG (used with the +Γ4 option, omitted below for conciseness as in the previous section) and LG+Γ8 topologies are different for 32/84 TreeBase alignments, with 9/32 significant cases and percentage RF distance of 7%. We thus counted the number of branches that are supported by the bootstrap in one tree but not the other (BP1 ≥ BP2 + 50%, see above). For example, comparing LG and LG+Γ8 with the 63 smallest TreeBase alignments, we have 8 such branches among a grand total of 2,382 branches, against 48 with LG versus LG4X. Using the 300 HSSP alignments, we have 122 branches supported in one tree but not the other when comparing LG and LG+Γ8 and 552 for LG versus LG4X (grand total = 23,908). These measures are much lower than the corresponding RF topological distances (tables 2 and 3), as expected since here we only consider branches with substantial bootstrap support. However, with LG versus LG4X, we still have on average ∼1 (TreeBase) to ∼2 (HSSP) branches per tree with clearly different bootstrap support. Moreover, this “topological support dissimilarity” provides a sharper view of the models' resemblance and dissemblance than standard RF distance; for example, the topological support dissimilarity between LG and LG4X is ∼6 times larger with TreeBase (∼4.5 with HSSP) than the topological support dissimilarity between LG and LG+Γ8, while this ratio is ∼2 with RF distance (both TreeBase and HSSP). We thus measured the topological support dissimilarities between main models using (63 smallest) TreeBase and (300) HSSP alignments, and then constructed distance-based trees representing the topological impacts and resemblance/dissemblance of these models. For this purpose, we used the FastME software (Desper and Gascuel 2002; http://www.atgc-montpellier.fr/fastme/) with default options (analogous to NJ but using branch swapping). Resulting trees are displayed in figure 5, and the pairwise distance matrices are available in the supplementary tables and figures at http://www.atgc-montpellier.fr/models/lg4x. FIG. 5. View largeDownload slide Topological support dissimilarities of the main models. Note: Topological support dissimilarity between models M1 and M2 is computed from the bootstrap trees inferred using M1 and M2, by counting the number of branches supported in one tree but not the other (BP1 ≥ BP2 + 50%, see text). Trees in this figure were built using distance-based FastME software, from all pairwise model dissimilarities. The tree in the upper panel (a) is based on the 63 smallest TreeBase test alignments; tree (b) is based on 300 HSSP test alignments. All models use four categories of gamma distributed rates (+Γ4), unless explicitly stated, that is, LG4X for distribution-free scheme; −Γ for constant site rate; +Γ8 for 8 gamma rate categories. EX2: 2-matrix two-level mixture model, with matrices estimated from buried/exposed sites. EX2/S has the same matrices as EX2 but (contrary to EX2) uses the solvent accessibility of the residues derived from the 3D protein structure. EXEHO/S: 6-matrix model combining the accessibility to the solvent and the secondary structure of the residues derived from the 3D protein structure. UL3: 3-matrix two-level mixture model, with blindly estimated matrices. LG4M: 4-matrix one-level mixture model using gamma distribution of site rates, proposed in this paper. LG4X: 4-matrix one-level mixture model using distribution-free scheme of site rates, proposed in this paper. FIG. 5. View largeDownload slide Topological support dissimilarities of the main models. Note: Topological support dissimilarity between models M1 and M2 is computed from the bootstrap trees inferred using M1 and M2, by counting the number of branches supported in one tree but not the other (BP1 ≥ BP2 + 50%, see text). Trees in this figure were built using distance-based FastME software, from all pairwise model dissimilarities. The tree in the upper panel (a) is based on the 63 smallest TreeBase test alignments; tree (b) is based on 300 HSSP test alignments. All models use four categories of gamma distributed rates (+Γ4), unless explicitly stated, that is, LG4X for distribution-free scheme; −Γ for constant site rate; +Γ8 for 8 gamma rate categories. EX2: 2-matrix two-level mixture model, with matrices estimated from buried/exposed sites. EX2/S has the same matrices as EX2 but (contrary to EX2) uses the solvent accessibility of the residues derived from the 3D protein structure. EXEHO/S: 6-matrix model combining the accessibility to the solvent and the secondary structure of the residues derived from the 3D protein structure. UL3: 3-matrix two-level mixture model, with blindly estimated matrices. LG4M: 4-matrix one-level mixture model using gamma distribution of site rates, proposed in this paper. LG4X: 4-matrix one-level mixture model using distribution-free scheme of site rates, proposed in this paper. Although these two trees were obtained from completely different data (TreeBase, HSSP) through a complex procedure (bootstrap, dissimilarity computation, FastME), it is remarkable that both are nearly identical in terms of topology and branch lengths. Moreover, these trees are easily interpreted and illustrate the main features of studied models. A first obvious observation is that standard and nonstandard models form the two main clades. Moreover, as expected with standard models: LG and LG+Γ8 form a tight clade; LG, WAG, and JTT are relatively close; LG−Γ (constant site rate) is isolated at the end of a long branch, which illustrates the (well-documented) impact of using a gamma distribution of site rates. Nonstandard models are separated in two clades, respectively, containing one-level and two-level mixtures. Within two-level mixtures in the HSSP tree, we have a clade containing all structure-based models, with two tight subclades containing, respectively, EX2 and EX2/S and EXEHO and EXEHO/S. Surprisingly, knowing the 3D structure in EX2/S and EXEHO/S does not have much impact on the tree topology with respect to EX2 and EXEHO. However, the topological impact of these four structure-based models with respect to LG is almost the same as that of LG with respect to LG−Γ. The topological impact of UL3, LG4M, and LG4X with respect to LG is even larger. As expected LG4M and LG4X form a clade, but both models are relatively distant, whereas UL3 is distant from all other models. From trees in figure 5, it is not possible to predict which model provides the best topologies. However, it is noticeable that nonstandard and mixture models are on the opposite side of LG−Γ, known to be a poor model, whereas standard models are in between. We might have a preference for structure-based models, as they infer similar tree topologies (fig. 5) and lead to very high likelihood values when the 3D structure is available (fig. 4). However, there is no guaranty that these models are best from a topological standpoint, even if they have strong biological justifications. LG4X and LG4M are also based on meaningful assumptions (as opposed to UL3 learned in a purely blind manner). Both have a strong topological impact and provide high likelihood gains compared with standard models. LG4M and LG4X tree topologies contain well-supported clades, not discovered by any of the other models, and thus representing biological and phylogenetic interest and deserving further investigations. Memory Consumption and Running Time LG4M and LG4X require the same amount of memory as single-matrix models with the Γ4 option. EX2, UL3, and EXEHO (all three with the Γ4 option), respectively, use twice, three, and six times as much memory because they are two-level mixtures with two, three, and six matrices, respectively. For example, LG4X and UL3+Γ4 require 2 GB and 6 GB of memory space, respectively, for the largest TreeBase alignment with 62 taxa and 11,544 sites (accession number M4680). EXEHO requires almost 12 GB to analyze this data set, which makes it impractical for most standard computers and users. The same ratios apply to the computing times needed to calculate data likelihood, given a tree with branch lengths and model parameter values. For example, EXEHO is nearly six times slower than a standard, single-matrix model. However, other factors impact the total computing time. Most notably, models differ in the number of parameters to be estimated from the data via likelihood optimization. Standard models and LG4M have only one (gamma) parameter, whereas EX2, UL3, LG4X, and EXEHO, respectively, have two, three, six, and six parameters, thus again requiring additional computing time compared with standard models and LG4M. Using PhyML-4X with standard options (+Γ4, SPR-based tree searching) on a powerful CPU (Intel[R] Xeon[R] E5440 at 2.83GHz with 16 GB memory) to infer phylogenies for all 84 TreeBase alignments requires about 55 h with standard models (LG was used here), 60 h with LG4M, and 85 h with LG4X. As expected, LG4M is nearly as fast as standard models, but LG4X is somewhat slowed down by model parameter estimations. Using PhyML-Structure (Le and Gascuel 2010) for the same task requires about 280, 380, and 670 h for EX2, UL3, and EXEHO, respectively (total time for EXEHO is about 1 month and was estimated from a sample of alignments). Applying these models to the largest TreeBase alignment (M4860) using the same CPU, programs and options, requires about 6, 8, 11, 51, 53, and 84 h for LG, LG4M, LG4X, EX2, UL3, and EXEHO, respectively. Though different programs were used in these experiments, with PhyML-4X being based on a more recent and about twice faster version of PhyML than PhyML-Structure, we obtain a clear picture: LG4M is nearly the same as standard models and is a fast model as expected; LG4X is a bit slower due to its six parameters to be estimated; EX2 and UL3 are significantly slower than standard models but still clearly applicable even to large data sets; EXEHO requires important computing resources for large data sets not only in terms of computing times but also with respect to memory space. The /S option (called CONF/MIX in Le and Gascuel 2010) that we used for HSSP alignments with known 3D structure requires nearly the same time and memory as the mixture version, with both EX2 and EXEHO. However, EXEHO (and EX2) may be used with less demanding options (CONF/LG and PART) when the 3D structure is known. Conclusion In this paper, we proposed two new models, LG4M and LG4X, for amino acid replacement modeling in protein phylogenetics. The main idea was to use different substitution matrices for the different evolutionary rate categories and to reduce the standard gamma distribution constraints on site rates by adopting a distribution-free scheme (LG4X). Experiments with independent alignments showed that LG4M and LG4X most often infer trees with higher likelihood and AIC values than single-matrix models (JTT, WAG, and LG), thus illustrating the limit of the standard approach that assumes a unique replacement matrix regardless of the site rate. Moreover, these trees tend to differ significantly in their topology from those inferred using the standard approach. These experiments also showed that our distribution-free scheme for site rates offers high flexibility and contributes greatly to LG4X performance. Since LG4M and LG4X produce significantly better results while requiring the same memory space and similar running times, they would be reasonable replacements for single-matrix models. Current phylogenetic tree inference software using the standard approach and gamma distribution of site rates could immediately use LG4M because it requires the same data structures and procedures as single-matrix models. LG4X could be easily adapted as well, by adding an appropriate optimization procedure for estimating the distribution-free site rate parameters. These two models and others, notably structural, will be incorporated in a forthcoming release of the official PhyML. Many questions arise from the improvement provided by LG4X and LG4M. The first is to check the number of rate categories. It is commonly acknowledged (our results in table 2 support this practice) that four gamma-distributed rate categories provide a fair compromise for single-matrix models. Moreover, LG4X with four rate categories is better than two-matrix × four-rate models (EX2+Γ4) and comparable to three-matrix × four-rate models (UL3+Γ4). However, we do not yet know the difference when we increase or decrease the number of rate categories and use the same scheme as LG4M or LG4X. Variants of LG4X and LG4M and their combination with the standard +F option to fit proteins with specific amino acid distributions should also be investigated. Lastly, a major direction for further research is to better understand the substitution processes revealed by these complex models and replacement matrices, unravel the biological differences between models, and exploit them for further improvements. Thanks to Associate Editor Jeffrey Thorne and two anonymous referees for their help and suggestions. This research was supported by the French ANR BIOSYS (MitoSys project) and the Vietnam National Foundation for Science and Technology Development. References Akaike H. A new look at statistical model identification, IEEE Trans Automatic Control. , 1974, vol. 19 (pg. 716- 722) Google Scholar CrossRef Search ADS Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank, Nucleic Acids Res. , 2000, vol. 28 (pg. 235- 242.) Available from: http://www.pdb.org Google Scholar CrossRef Search ADS PubMed Berry V, Gascuel O. Interpretation of bootstrap trees: threshold of clade selection and induced gain, Mol Biol Evol. , 1996, vol. 13 (pg. 999- 1011) Google Scholar CrossRef Search ADS Bryant D, Galtier N, Poursat MA. Gascuel O. Likelihood calculations in phylogenetics, Mathematics of evolution and phylogeny , 2005 Oxford Oxford University Press(pg. 33- 62) Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis, Mol Biol Evol. , 2000, vol. 17 (pg. 540- 552) Google Scholar CrossRef Search ADS PubMed Dayhoff MO, Eyck RV, Park CM. Dayhoff MO. A model of evolutionary change in proteins, Atlas of protein sequence and structure , 1972 Vol. 5. Washington (DC) National Biomedical Research Foundation(pg. 89- 99) Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle, J Comp Biol. , 2002, vol. 19 (pg. 687- 705) Google Scholar CrossRef Search ADS Durbin R, Eddy S, Krogh A, Mitchison G. , Biological sequence analysis: probabilistic models of proteins and nucleic acids , 1998 Cambridge Cambridge University Press Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach, J Mol Evol. , 1981, vol. 17 (pg. 368- 376) Google Scholar CrossRef Search ADS PubMed Felsenstein J. , Inferring phylogenies , 2003 Sunderland (MA) Sinauer Associates, Inc Felsenstein J, Churchill GA. A Hidden Markov Model approach to variation among sites in rate of evolution, Mol Biol Evol. , 1996, vol. 13 (pg. 93- 104) Google Scholar CrossRef Search ADS PubMed Gascuel O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data, Mol Biol Evol. , 1997, vol. 14 (pg. 685- 695) Google Scholar CrossRef Search ADS PubMed Gascuel O, Guindon S. Gascuel O, Steel M. Modelling the variability of evolutionary processes, Reconstructing evolution: new mathematical and computational advances , 2007 Oxford Oxford University Press(pg. 65- 99) Goldman N, Thorne JL, Jones DT. Assessing the impact of secondary structure and solvent accessibility on protein evolution, Genetics , 1998, vol. 149 (pg. 445- 458) Google Scholar PubMed Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0, Syst Biol. , 2010, vol. 59 (pg. 307- 321) Google Scholar CrossRef Search ADS PubMed Holder MT, Sukumaran J, Lewis PO. A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics, Syst Biol. , 2008, vol. 57 (pg. 814- 821) Google Scholar CrossRef Search ADS PubMed Holmes I, Rubin GM. An expectation maximization algorithm for training hidden substitution models, J Mol Biol. , 2002, vol. 317 (pg. 753- 764) Google Scholar CrossRef Search ADS PubMed Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences, Comput Appl Biosci. , 1992, vol. 8 (pg. 275- 282) Google Scholar PubMed Jones DT, Taylor WR, Thornton JM. A mutation data matrix for transmembrane proteins, FEBS Lett. , 1994, vol. 339 (pg. 269- 275) Google Scholar CrossRef Search ADS PubMed Keane TMC, Creevey CJ, Pentony MM, Naughton TJ, McLnerney JO. Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified, BMC Evol Biol. , 2006, vol. 6 pg. 29 Google Scholar CrossRef Search ADS PubMed Kishino H, Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea, J Mol Evol. , 1989, vol. 29 (pg. 170- 179) Google Scholar CrossRef Search ADS PubMed Klosterman PSA, Uzilov AV, Bendana YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I. XRate: a fast prototyping, training and annotation tool for phylo-grammars, BMC Bioinformatics , 2006, vol. 7 pg. 428 Google Scholar CrossRef Search ADS PubMed Koshi JM, Goldstein RA. Context-dependent optimal substitution matrices, Protein Eng. , 1995, vol. 8 (pg. 641- 645) Google Scholar CrossRef Search ADS PubMed Koshi JM, Goldstein RA. Models of natural mutations including site heterogeneity, Proteins , 1998, vol. 32 (pg. 289- 295) Google Scholar CrossRef Search ADS PubMed Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein, J Mol Biol. , 1982, vol. 157 (pg. 105- 132) Google Scholar CrossRef Search ADS PubMed Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process, Mol Biol Evol. , 2004, vol. 21 (pg. 1095- 1109) Google Scholar CrossRef Search ADS PubMed Le SQ, Gascuel O. An improved general amino acid replacement matrix, Mol Biol Evol. , 2008, vol. 25 (pg. 1307- 1320) Google Scholar CrossRef Search ADS PubMed Le SQ, Gascuel O. Accounting for accessibility to solvent and secondary structure in protein phylogenetics is clearly beneficial, Syst Biol. , 2010, vol. 59 (pg. 277- 287) Google Scholar CrossRef Search ADS PubMed Le SQ, Gascuel O, Lartillot N. Empirical profile mixture models for phylogenetic reconstruction, Bioinformatics , 2008, vol. 24 (pg. 2317- 2323) Google Scholar CrossRef Search ADS PubMed Le SQ, Lartillot N, Gascuel O. Phylogenetic mixture models for proteins, Philos Trans R Soc Lond B Biol Sci. , 2008, vol. 363 (pg. 3965- 3976) Google Scholar CrossRef Search ADS PubMed Lio P, Goldman N, Thorne JL, Jones DT. PASSML: combining evolutionary inference and protein secondary structure prediction, Bioinformatics , 1998, vol. 14 (pg. 726- 733) Google Scholar CrossRef Search ADS PubMed Mayrose I, Friedman N, Pupko T. A Gamma mixture model better accounts for among site rate heterogeneity, Bioinformatics , 2005, vol. 21 (pg. 151- 158) Google Scholar CrossRef Search ADS Pagel M, Meade A. Gascuel O. Mixture models in phylogenetic inference, Mathematics of evolution and phylogeny , 2005 Oxford Oxford University Press(pg. 121- 142) Raman P, Cherezov V, Caffrey M. The Membrane Protein Data Bank, Cell Mol Life Sci. , 2006, vol. 63 (pg. 36- 51) Google Scholar CrossRef Search ADS PubMed Robinson D, Foulds L. Comparison of weighted labeled trees, Lect Notes Math. , 1979, vol. 748 (pg. 119- 126) Sanderson MJ, Donoghue MJ, Piel W, Eriksson T. TreeBase: a prototype database of phylogenetic analyses and an interactive tool for browsing the phylogeny of life, Am J Bot. , 1994, vol. 81 pg. 183 Google Scholar CrossRef Search ADS Schneider R, de Daruvar A, Sander C. The HSSP database of protein structure-sequence alignments, Nucleic Acids Res. , 1997, vol. 25 (pg. 226- 230) Google Scholar CrossRef Search ADS PubMed Shimodaira H. Assessing the error probability of the model selection test, Ann Inst Stat Math. , 1997, vol. 49 (pg. 395- 410) Google Scholar CrossRef Search ADS Susko E, Field C, Blouin C, Roger AJ. Estimation of rates-across-sites distributions in phylogenetic substitution models, Syst Biol. , 2003, vol. 52 (pg. 594- 603) Google Scholar CrossRef Search ADS PubMed Thorne JL, Goldman N, Jones DT. Combining protein evolution and secondary structure, Mol Biol Evol. , 1996, vol. 13 (pg. 666- 673) Google Scholar CrossRef Search ADS PubMed Wang HC, Li K, Susko E, Roger AJ. A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny, BMC Evol Biol. , 2008, vol. 8 pg. 331 Google Scholar CrossRef Search ADS PubMed Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach, Mol Biol Evol. , 2001, vol. 18 (pg. 691- 699) Google Scholar CrossRef Search ADS PubMed Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites, Mol Biol Evol. , 1993, vol. 10 (pg. 1396- 1401) Google Scholar PubMed Yang Z. , Computational molecular evolution , 2006 Oxford Oxford University Press Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution, Mol Biol Evol. , 1998, vol. 15 (pg. 1600- 1611) Google Scholar CrossRef Search ADS PubMed © The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]

Showing 1 to 10 of 37 Articles

Articles per page

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

1983

0019

Related Journals: