Announcementsdoi: 10.1093/sysbio/syaa089pmid: N/A
SOCIETY OF SYSTEMATIC BIOLOGISTS ANNOUNCEMENTS It’s easy to join SSB.—For information about joining or renewing your membership, or accessing the journal online as a society member, please visit https://scienceserv.com/ssb. Queries about society membership can be directed to [email protected]. Systematic Biology has an Open Access option.—Authors have the option of designating their Systematic Biology papers as Open Access. Details are available from Oxford University Press at http://academic.oup.com/journals/pages/open_access. The Society of Systematic Biologists has two student representatives on the council: Kinsey Brock and Sam Church.—Kinsey Brock is a herpetologist and evolutionary biologist studying the evolution of color, morphology, and behavior in lacertid lizards. She is developing a study system with color polymorphic lizards in the Aegean islands to understand which geographic and environmental factors shape patterns of genetic and phenotypic variation within species. Sam is a third-year doctoral candidate advised by Cassandra Extavour at Harvard University. In his research he studies invertebrate evolution and development, currently focusing on the evolution of the insect egg, and he develops software for evolutionary analyses. The SSB student representatives work with the council to support activities and initiatives to better serve our student members. Systematic Biology would like to announce a new manuscript category.—Spotlights are papers that focus on an empirical system. They could range from broad phylogenetic investigation of large clades to a species delimitation investigation conducted on the phylogeographic scale. In any case, Spotlights should be characterized by their adoption of leading-edge methods and their careful approach to data collection and analysis, but will focus on the evolution of the focal group rather than the methodology used to infer this history. Spotlights should appeal to a broad audience, for example by illustrating how the focal system illustrates key evolutionary processes, but should be primarily system focused. Because this is Systematic Biology, we encourage authors to conduct thorough analyses and describe them in detail. However, authors will be asked to limit Spotlights to 6000 words and no more than 4 figures or tables. In order to achieve both of these goals, authors should make use of supplemental material for the detailed description of their Methods, and provide a short summary of these in the main manuscript. Results & Discussion can be combined if this facilitates a succinct description of the work. Publisher’s Award.—The journal is delighted to announce the two winners of the 2019 Publisher’s award, which is presented to the two best papers based on student research published in Systematic Biology during the previous year. The two winning papers and winners are: Kirilee Chaplin, Museum Victoria Kirilee Chaplin, et al. An Integrative Approach Using Phylogenomics and High-Resolution X-Ray Computed Tomography for Species Delimitation in Cryptic Taxa, Systematic Biology, Volume 69, Issue 2, March 2020, Pages 297–307, https://doi.org/10.1093/sysbio/syz048 Miroslav Valan, Stockholm University Miroslav Valan, et al. Automated Taxonomic Identification of Insects with Expert-Level Accuracy Using Effective Feature Transfer from Convolutional Networks, Systematic Biology, Volume 68, Issue 6, November 2019, Pages 876–895, https://doi.org/10.1093/sysbio/syz014 © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
The Origins of Coca: Museum Genomics Reveals Multiple Independent Domestications from Progenitor Erythroxylum gracilipesWhite, Dawson M; Huang, Jen-Pan; Jara-Muñoz, Orlando Adolfo; MadriñáN, Santiago; Ree, Richard H; Mason-Gamer, Roberta J
doi: 10.1093/sysbio/syaa074pmid: 32979264
Abstract Coca is the natural source of cocaine as well as a sacred and medicinal plant farmed by South American Amerindians and mestizos. The coca crop comprises four closely related varieties classified into two species (Amazonian and Huánuco varieties within Erythroxylum coca Lam., and Colombian and Trujillo varieties within Erythroxylum novogranatense (D. Morris) Hieron.) but our understanding of the domestication and evolutionary history of these taxa is nominal. In this study, we use genomic data from natural history collections to estimate the geographic origins and genetic diversity of this economically and culturally important crop in the context of its wild relatives. Our phylogeographic analyses clearly demonstrate the four varieties of coca comprise two or three exclusive groups nested within the diverse lineages of the widespread, wild species Erythroxylum gracilipes; establishing a new and robust hypothesis of domestication wherein coca originated two or three times from this wild progenitor. The Colombian and Trujillo coca varieties are descended from a single, ancient domestication event in northwestern South America. Huánuco coca was domesticated more recently, possibly in southeastern Peru. Amazonian coca either shares a common domesticated ancestor with Huánuco coca, or it was the product of a third and most recent independent domestication event in the western Amazon basin. This chronology of coca domestication reveals different Holocene peoples in South America were able to independently transform the same natural resource to serve their needs; in this case, a workaday stimulant. [Erythroxylum; Erythroxylaceae; Holocene; Museomics; Neotropics; phylogeography; plant domestication; target-sequence capture.] Called the “Divine Leaf” by the Inka, coca has been cultivated for over 8000 years and is the most culturally significant pharmaceutical plant in South America; it is also the source of the alkaloid cocaine—an insecticide and local anesthetic that has had the largest impact in Western Medicine of any Neotropical phytochemical (Schultes 1979; Plowman 1986; Nathanson et al. 1993; Dillehay et al. 2010; Restrepo et al. 2019). In parts of Colombia, Ecuador, Peru, Bolivia, and Brazil, the traditional varieties of coca are still cultivated as they have been since Pre-Columbian times (Fig. 1; Plowman 1984; Plowman and Hensold 2004), and today over 5 million South Americans chew the leaves for their mild stimulant and medicinal effects (Conzelman and White 2016). However, prohibition has pushed illicit cultivation for cocaine, now at its highest level since 2000, into lowland forests in Colombia, Peru, Brazil, and even southern Mexico, causing deforestation and civil destabilization in these regions (Plowman 1984; Dávalos et al. 2011; Bewley-Taylor 2016; Casale and Mallette 2016; United Nations 2019). Figure 1. Open in new tabDownload slide Map of taxon distributions and sampling localities. Polygons indicate modern areas of cultivation for the four coca varieties sensu Plowman and Hensold (2004); omitted are areas of illicit cultivation (esp. southern Colombia) and garden plots (incl. botanical gardens). Symbols indicate location of sampled herbarium specimens and rotated symbols indicate field collections by D. M. W. To understand the identity of the coca crop, we sequenced 424 nuclear genes from tissue samples collected almost entirely from historical museum collections (90% of the total) and completed the first investigation of the genetic structure of the four coca varieties and their closest wild relatives (Erythroxylum spp.). With over 10,000 exsiccatae including |$\sim $|950 cocas, the Field Museum (F) holds the world’s largest Neotropical Erythroxylum collection. We utilized this resource to maximize geographic coverage in our sampling of dozens of individuals of the coca varieties and their wild relatives, exemplifying the utility of well-curated museum collections in phylogeographic and population-genomics research. For diverse and poorly studied tropical plant taxa, and especially for drug plants that require complex logistics to collect, ship, and store, museum collections can provide an increasingly important role in systematic research (Rowe et al. 2011; Hart et al. 2016; Forrest et al. 2019). Yet, many studies using historical samples have been limited to organelles or utilized relatively undegraded DNA (Staats et al. 2013; Beck and Semple 2015). There are four taxonomic varieties of coca grown in separate geographic areas in South America (Fig. 1; Plowman 1979a,1979b). They are morphologically similar but can be distinguished by several traits, most notably leaf shape and venation patterns (Supplementary Table S1 available on Dryad at https://doi.org/10.5061/dryad.cvdncjt1n; Plowman 1979a; Bohm et al. 1982; Rury and Plowman 1983). Huánuco (or Bolivian) coca (Erythroxylum coca Lam.) is grown in the moist, montane forest on the eastern slopes of the Andes in Peru and Bolivia. This is the most abundant crop for traditional and indigenous coca leaf consumption and was the world’s primary source of cocaine hydrochloride from its discovery in 1865 until 2000 (Gootenberg 2008; United Nations 2019). Amazonian coca (E. coca var. ipadu Plowman) is grown in discrete localities throughout the Amazon basin, and its leaves are consumed as a pulverized powder with other additives (Plowman 1981). Until extensive eradication began about 30 years ago, Colombian coca (Erythroxylum novogranatense (D. Morris) Hieron.) was grown throughout the drier inter-Andean valleys in Colombia, but now it is restricted to the Cauca region and the Sierra Nevada de Santa Marta (Bohm et al. 1982). Lastly, Trujillo coca (E. novogranatense var. truxillense (Rusby) Plowman) is grown in the arid valleys of northwestern Peru for traditional coca leaf use and as a flavoring agent in Coca-Cola® (Gootenberg 2008). We also sampled from a disjunct population of what is believed to be Trujillo coca cultivated on the Colombia/Ecuador border but were unable to sample the disjunct Huánuco coca from southern Ecuador (Plowman 1984). The primary hypothesis of coca domestication was described by Plowman (Plowman 1979b; Bohm et al. 1982); he posited that Huánuco coca was domesticated in the eastern Andes of Bolivia or Peru from a wild, ancestral (and presumably extinct) form of E. coca. Then, because the two taxa can form infertile hybrids, he believed Huánuco coca was taken north to dry Andean valleys in Peru or Ecuador, where it developed into Trujillo coca. Next, three pieces of evidence led him to postulate Colombian coca was derived from Trujillo: first, all coca macrofossils |$>$|1500 years old are of the Trujillo morphotype, providing evidence that Trujillo coca is older than Colombian (Plowman 1984; Dillehay et al. 2010). Second, Trujillo |$\times $| Colombian F1 hybrids are fertile but Trujillo |$\times $| Huánuco hybrids are not, so Colombian coca has been interpreted as a derived variety with acquired interspecific hybrid incompatibility (Bohm et al. 1982). Finally, Colombian coca is the only variety that is self-compatible, which is generally a derived trait unlikely to give rise to self-incompatibility (Plowman 1986; Goldberg et al. 2010). Lastly, Huánuco coca was taken east and transformed into Amazonian coca. However, archaeological evidence does not directly support Plowman’s linear-series hypothesis beginning with Huánuco coca because Trujillo coca has the most extensive Pre-Columbian archaeological record (Mortimer 1901; Plowman 1984), and a recent discovery of Trujillo morphotype leaves in northern Peru pushed coca consumption from 4000 to over 8000 years BP, making it one of oldest cultivated plants in the Americas (Dillehay et al. 2010; Larson et al. 2014). Coca paraphernalia and bountiful artworks provide the only archaeological evidence of coca use in Colombia, the earliest is from the Yotoco culture (100–1200 CE; Reichel-Dolmatoff and Schrimpff 2005). The earliest Huánuco coca remains are endocarps from a Late Intermediate period (1000–1476 CE) site in Junín, Peru, but evidence of coca trade in Bolivia pushes this date to |$\sim $|1700 years BP (Plowman 1984; Carter and Mamani 1986; Hastorf 1987; Valdez et al. 2015). On the basis of linguistic and ethnographic similarities across its range, Amazonian coca is presumed to be the most recently developed (Plowman 1981). A more recent hypothesis from Johnson et al. (2005), is that E. coca (Amazonian and Huánuco) and E. novogranatense (Colombian and Trujillo) are sister species resulting from the domestication of a common ancestor (Johnson et al. 2005; Emche et al. 2011). However, their analysis, based on flavonoid profiles and amplified fragment-length polymorphisms, did not include the closest wild relatives and thus could not properly evaluate the independent origins of the coca crops. Our previous study (White et al. 2019) revealed that the closest wild relatives of the coca varieties are the cocaine-producing, wild species Erythroxylum cataractarum Spruce ex Peyr. and Erythroxylum gracilipes Peyr. (Aynilian et al. 1974; Plowman and Rivier 1983; Bieri et al. 2006; Islam 2011). Erythroxylum gracilipes occurs throughout most of the Amazon basin and can be distinguished from the cocas by its larger leaves (11–18 cm vs. 2.5–11 cm) with acuminate apices. Erythroxylum cataractarum of the Llanos of Colombia and Venezuela, is morphologically very similar to E. novogranatense, but with stiffer branchlets, thicker membranaceous to sub-chartaceous leaves, and stipules with more prominent apical setae (Fig. 1; Supplementary Table S1 available on Dryad). The first goal of this project is to establish the genealogical relationships and genetic structure of a severely understudied crop. By sampling both the coca crops and their closest wild relatives, this is the first study capable of inferring the number of times coca was domesticated from a wild progenitor and also where and when these events occurred. Second, we want to expand our toolkit for “unlocking” genetic data from the “treasure chests” of biological diversity stored in museums and biological collections worldwide (Särkinen et al. 2012; Jones and Good 2016). By sequencing 424 genes from over 130 herbarium collections across six taxa, we have demonstrated the efficacy of exon-capture DNA sequencing with highly degraded DNA from museum collections. Materials and Methods More detailed text is available in the Supplementary Material file on Dryad. We extracted genomic DNA from 154 Erythroxylum samples, 138 being herbarium specimens, (Supplementary Table S2 available on Dryad), and used a custom set of RNA probes (White et al. 2019) to sequence 424 nuclear genes. Cleaned reads were de novo assembled into contigs and mapped to the concatenated exon sequences for each gene using HybPiper (Johnson et al. 2016). We mapped the cleaned reads back to the supercontig consensus sequences for each gene and followed the seqcap_pop pipeline (Faircloth 2015; Harvey et al. 2016) and the GATK Best Practices workflow (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145) to call, annotate, and filter single nucleotide polymorphisms (SNPs). We aligned gene sequences using MAFFT and inferred gene trees with RAxML before reconstructing a summary tree from 424 gene trees with ASTRAL-III (Katoh et al. 2002; Stamatakis 2014; Zhang et al. 2018). We used IQ-TREE 2 (Hoang et al. 2018; Minh et al. 2020) to infer a maximum likelihood tree from the concatenated SNP data set (“ML-SNP”) and to generate SNP-based (sCF) and gene tree-based (gCF) concordance factors for both the summary tree and the ML-SNP topologies. We then dropped 10 individuals showing admixture or “trans-taxonomic” cluster assignment and used the SNP data set to infer species trees with SVDquartets (Chifman and Kubatko 2014) as well as reduced-taxon summary trees with ASTRAL-III. To assess alternative domestication scenarios, we used ASTRAL-III to estimate quartet support and the probability of a constrained species tree topology. We used our final data set of 6263 SNPs in a genetic cluster analysis using the snapclust program from the adegenet v.2.13 R package (Jombart and Ahmed 2011; Hoang et al. 2018). This method uses a fast maximum likelihood and geometric approach to assign genotypes to a set number of populations determined from goodness-of-fit statistics (Beugin et al. 2018). For principal components analysis (PCA), we used the ipyrad analysis toolkit (Eaton and Overcast 2020). Using all SNPs and the nine-population assignments, we calculated weighted |$F_{ST}$| to understand population differentiation (Hudson et al. 1992). We evaluated genetic diversity by the number of private, novel alleles within a population (vcftools and vcf-contrast; Danecek et al. 2011), observed heterozygosity and nucleotide diversity (hierfstat R package; Goudet 2005), and allelic richness (the average number of alleles per SNP with rarefaction correction; diveRsity R package; Keenan et al. 2013). We estimated the degree of admixture between taxa using treemix v1.13 (Pickrell and Pritchard 2012). We dropped gracilipes2-4, leaving six populations, and performed stairway plot analyses (Liu and Fu 2015) to infer effective population size through time. We applied an approximate Bayesian computation (ABC) approach to statistically test four alternative domestication scenarios: Plowman’s linear series, Johnson’s sister species, two origins, and three origins; the latter two hypotheses being derived from our genetic data. All E. gracilipes samples were grouped as a single population (see Supplementary Material available on Dryad) and E. cataractarum was excluded. This is a powerful approach for estimating coalescent parameters and comparing complex evolutionary scenarios (Beaumont et al. 2002; Gerbault et al. 2014). It was conducted in three steps using DIYABC v.2.1.0 (Cornuet et al. 2014). First, we generated 40 million simulated data sets under the coalescent based on our four topological scenarios. Second, we selected the subset of data sets closest to the observed (SNP) data according to summary statistics instead of the Bayesian likelihood computation (Nei’s genetic diversity, |$F_{ST}$|, and Nei’s distance using all data Nei 1972, 1987; Weir and Cockerham 1984). Third, the posterior distributions of coalescent parameters were estimated from the subset using a local linear regression procedure and the posterior probabilities of our four scenarios were compared by calculating the relative proportion of simulated data sets for each scenario present among the 500 data sets closest to the observed data set, as well as logistic regression of each scenario probability based on deviations between observed and simulated summary statistics (Beaumont 2008). We estimated confidence in our final scenario choices by calculating posterior- and prior-based error and scenario-specific type 1 and type 2 errors from additional pseudo-observed data sets generated under each scenario. Results and Discussion Our study demonstrates that a genomic approach can be applied to a population genetic project focusing on a very recent diversification history using primarily historical museum collections. Our results support a novel and robust hypothesis of multiple independent origins of different coca varieties from E. gracilipes, a widespread, wild species comprised of at least two main clades. Museum Genomics: Target Capture, Assembly, and Gene Alignment Hybridization-based exon-capture is an efficient method for sequencing select loci from genomic DNA isolated from herbarium specimens (Hart et al. 2016; Villaverde et al. 2018). We generated 270 GB of sequence data and deposited reads for all 154 samples in the NCBI Short Read Archive under BioProject PRJNA485502. Samples were collected between 1900 and 2016 with a median collection date of 1981 (Supplementary Fig. S1 and Table S2 available on Dryad). These samples presented a spectrum of DNA quantity and degradation, but these two factors had little correlation with the success of our exon capture and target assembly. A linear model fit between the genomic DNA maximum fragment size and the number of genes recovered was nearly flat (|$R^{2} = 0.004$|); the polynomial regression model fit the relationship with an R|$^{2}$| of 0.024 and was also mostly flat (Supplementary Fig. S1c available on Dryad). Samples with low input DNA amounts (less than 200 ng) were scattered throughout the range of observations of gene recovery. Plants collected in the moist tropics require extra care to dry, and DNA quality thus appears to depend more on initial drying and preservation quality rather than age, corroborating (Forrest et al., 2019) (Supplementary Fig. S1a, b available on Dryad). Not surprisingly, we observed a positive correlation between the log number of reads and the number of genes with good target coverage, defined as |$>$|75% of target length (linear |$R^{2} = 0.396$|, polynomial |$R^{2} = 0.422$| Supplementary Fig. S1d available on Dryad). This demonstrates that we recovered |$>$|90% of exons at |$>$|75% of their length with 4|$\times $| coverage per site at about 10|$\times $| sequencing depth, and this depth provides a good target for sequencing effort. However, our number of reads per sample is quite variable, spanning three orders of magnitude, thus highlighting the variability incurred by combining museum samples of varying qualities together in large sequencing pools. Dilution of RNA baits and smaller numbers of samples per hybridization pool prior to exon-capture will help minimize amplification bias and overall variability of sequence reads per sample. We removed 10 of our 154 samples from the analysis because they failed to assemble at least 90 genes with good coverage (Supplementary Fig. S2 available on Dryad). Across the 424 genes, the average assembled contig length per sample ranged from 819 bp to 8279 bp with a median of 2160 bp of sequence per sample and the 424 genes had an average of 141 samples per gene. The length of concatenated exon targets was 740,646 bp (median 1746 bp). Our assembled supercontigs (exon+intron) had an average concatenated length of 1,484,570 bp, revealing that we sequenced about 740 kb of intron and flanking sequence. Gene alignments ranged from 1749–57,834 bp in length (median 7290) with 30–93% missing data (median 59%). Cleaned trimAL alignments ranged from 129 to 8700 bp in length (median 1677) with 0.05–51.73% missing data (median 9.5%). While previous studies have included key historical collections within broader biodiversity investigations (e.g., McCormack et al. 2016; Prosser et al. 2016; Greiman et al. 2018), our study has shown modern phylogeographic analyses can rely almost entirely on historical museum collections to generate deep and useful genomic data sets for phylogeographic research. Historical museum collections can also alleviate geographic sampling shortfalls, which can lead to erroneous inferences or oversplitting geographic populations (e.g., Jackson et al. 2017; Linck et al. 2019). This applies to systems where field collection is challenging or even dangerous but also in most biological systems due to the ever-increasing difficulty to obtain permits (Prathapan et al. 2018). Genetic Structure of Coca and Wild Relatives For a crop of such cultural and economic significance, our scientific understanding of the origin and evolution of coca is minimal. This is not only due to the obstacles to research imposed by its international prohibition but also the botanically challenging nature of Erythroxylum, a morphologically complex and speciose clade of tropical shrubs and treelets. Genetic sampling from the world’s largest Erythroxylum collection at the Field Museum has reshaped our fundamental understanding of the systematics of this clade (White et al. 2019). The maximum likelihood gene trees inferred from the full gene alignments, cleaned trimAL alignments, or 212 alignments with the least missing data produced nearly identical ASTRAL summary trees, the only differences were among tips within the main clades and do not affect our presented results. Of these, the ASTRAL summary tree (AST; Fig. 2a; Supplementary Fig. S3 available on Dryad) inferred from the full gene alignments had the best local posterior probability along backbone nodes. To measure support for AST versus ML-SNP topology (topological differences are described in the Supplementary Text available on Dryad), we calculated SNP-based concordance factors (sCF; Supplementary Figs. S4 and S5 available on Dryad) and gene-tree concordance factors (gCF; Supplementary Figs. S6 and S7 available on Dryad) and found AST to be the best-supported phylogeny across our sequence and SNP data sets. Figure 2. Open in new tabDownload slide Phylogeny and genetic clustering of coca varieties and wild relatives. a) ASTRAL-III summary (AST) tree topology with branch lengths optimized using the ML-SNP data set; scale indicates substitutions per site and branch support values denote concordance factors of SNPs; collapsed clades are marked with the number of samples; tip names indicate taxon, sample ID, and geographic provenance indicated by the two letter country code and political subdivision. Probability assignment of individuals to five or nine genetic clusters are indicated in bar plots along with the population assignment of ‘pure’ individuals for species trees and population genetic statistics. b) SVDquartets species tree inferred from SNP data with bootstrap support. c) ASTRAL-III species tree inferred from 424 gene trees with local posterior probability (LPP) and percent quartet support, respectively; branch lengths scaled by coalescent units. d) Constrained ASTRAL-III species trees indicating LPP and percent quartet support for an E. cataractarum |$+$| Trujillo/Colombian clade or a Huánuco |$+$| gracilipes1 clade. The AST phylogeny reveals the 37 E. gracilipes samples form a series of nested clades within which arise separate monophyletic lineages comprising E. cataractarum, E. novogranatense (Trujillo and Colombian cocas), Amazonian coca, and Huánuco coca (Fig. 2a). The goodness-of-fit estimators for the number of genetic clusters indicated the most significant information content in 5 and 9 clusters (Supplementary Fig. S8 available on Dryad; clustering results described in Supplementary Text available on Dryad). The earliest diverging lineages of E. gracilipes are geographically distributed around the Amazon Basin and form a single cluster at |$K = 5$| and three clusters at |$K = 9$| (gracilipes2, gracilipes3, gracilipes4; Fig. 2a). Next, the E. novogranatense lineage is divided into two clades and two clusters (|$K = 9$|) that define the Trujillo and Colombian coca varieties. The exceptions are samples trux.854 and trux.855 from the small, disjunct population on Río Chical at the Colombia/Ecuador border, revealing this population has a genealogical history closer to Colombian coca. This population could represent an early Colombian landrace derived from Trujillo coca, or it could be an admixed variety with combined Colombian and Trujillo ancestry. Additional sampling from Ecuador and Venezuela should improve our understanding of the history of this clade. The remaining E. gracilipes samples form a single cluster, gracilipes1, and belong to well-supported clades with geographic structure (Fig. 2a). Within gracilipes1, 15 samples of Amazonian coca form a clade with perfect posterior probability (Supplementary Fig. S3 available on Dryad), most closely related to gracilipes1 samples from Ecuador and northern Peru. Finally, three gracilipes1 samples from the Ucayali and Madre de Dios Departments of Peru form a nested series leading to the well-supported Huánuco coca clade (Fig. 2a; Supplementary Fig. S3 available on Dryad). The fourth sample in this series, “cf.coca.884,” is described as cultivated on its herbarium voucher, but its exceptional height (3 m), leaf morphology (apex and venation), and clustering results indicate admixture with E. gracilipes (Image in Supplementary material available on Dryad). The E. cataractarum clade diverges as sister to gracilipes1 and comprises a distinct genetic cluster (Fig. 2a). However, our ML-SNP (Supplementary Figs. S5 and S7 available on Dryad) and SVDquartets tree (Fig. 2b) stirred a prior suspicion that E. cataractarum could actually be sister to the E. novogranatense lineage (White et al. 2019). We tested support among our gene trees for this alternative placement and found it received lower quartet support (35% vs. 41%) and posterior probability (0.01 vs. 0.99). To further evaluate the evolutionary history generating this discordance, we used Treemix to model genomic admixture and gene flow across a tree representing the nine populations. This inferred gene flow from Colombian coca into E. cataractarum and, secondarily, from Amazonian coca into E. cataractarum. The third most significant migration edge was from E. cataractarum into gracilipes4 (Supplementary Fig. S8 available on Dryad). These grow in the vicinity of E. cataractarum (Figs. 1 and 2a), and thus it is sensible that seeds and/or pollen from nearby Amazonian and Colombian coca farms have naturally introduced alleles into E. cataractarum. This result explains the phylogenomic discordance of the E. cataractarum clade and also indicates that it may not have a role in the evolution of coca, though the reverse appears to be true. Multiple Origins Hypothesis Our phylogenomic and clustering results indicate E. gracilipes is a paraphyletic taxon with respect to the Amazonian, Huánuco, and Colombian/Trujillo coca lineages (Fig. 2). This structure elucidates a novel hypothesis of multiple origins of domestication of coca from progenitor E. gracilipes (i.e. gracilipes1 and possibly gracilipes2), refuting both Plowman’s linear-series hypothesis beginning with Huánuco coca (Plowman 1979b) and the sister-species hypotheses suggested by Johnson and colleagues (Johnson et al. 2005; Emche et al. 2011). While there is a clear separation of the E. novogranatense lineage (Colombian and Trujillo) from the E. coca lineage (Amazonian and Huánuco) and thus at least two domestication events, the separation of the Amazonian and Huánuco varieties, which would suggest three origins of domestication, is more tenuous. The first evidence of independent origins of Amazonian and Huánuco coca comes from the phylogeny and clustering results (Fig. 2), where they are separated by samples of gracilipes1 from central and southern Peru and a clade of Ecuadorean gracilipes1. While the placement of this Ecuadorean clade is not statistically robust, three gracilipes1 samples from Peru are firmly supported as basal to the Huánuco clade (Fig. 2a; Supplementary Figs. S2–S7 available on Dryad). The second basis for three domestications is that Amazonian coca clusters with gracilipes1 at |$K = 5$| and forms its own cluster in |$K = 9$|, suggesting closer proximity to gracilipes1 than Huánuco coca. To further evaluate this separation and the progenitor-derivative relationships under our three-domestication hypothesis, we conducted a principal components analysis and calculated population genetic statistics across individuals from the nine populations. The PCA separates individuals into populations in agreement with the clustering analysis (Fig. 3). The nine populations, including Huánuco and Amazonian, cluster independently of one another in our bivariate plots describing the first eight components (explaining 65.9% of SNP variance; Fig. 3); further supporting the evolutionary separation of these two varieties. Figure 3. Open in new tabDownload slide Genetic variation of coca varieties and wild relatives. The first eight principal components are presented in four bivariate plots. Wright’s index of fixation (|$F_{ST}$|) showed all pairs showed moderate to high levels of differentiation (range 0.25–0.78; Table 2). By our measures of genetic distance, Amazonian and Huánuco cocas are more similar to E. gracilipes than they are to each other, and Amazonian and gracilipes1 are the least differentiated taxa across all populations (Table 2). Amazonian and Huánuco show a similar degree of differentiation as Colombian and Trujillo (0.55 vs. 0.52, respectively). Statistics for the number of private alleles (PA; not weighted by sample size), average allelic richness per SNP (AR; weighted by sample size), observed heterozygosity (Ho), and genetic diversity (Hs) are presented in Table 1. Amazonian has the highest genetic diversity among the cocas (AR, Hs). Together, these statistics all corroborate the separation of Huánuco and Amazonian coca and thus a three-origin hypothesis of coca domestication. Corroborating expectations of genetic bottlenecks during domestication events (Gross and Olsen 2010), the coca varieties have lower genetic diversity than the wild taxa (PA, AR, Hs; see Supplementary Material available on Dryad for discussion of Ho). Table 1. Genetic diversity of coca varieties and wild relatives Population . PA . AR . Hs . Ho . gracilipes4 211 1.088 0.0734 0.028 gracilipes3 131 1.014 0.118 0.067 gracilipes2 95 1.096 0.066 0.029 gracilipes1 467 1.116 0.083 0.023 cataract. 170 1.083 0.080 0.031 Colombian 35 1.064 0.046 0.027 Trujillo 7 1.021 0.015 0.011 Amazonian 65 1.083 0.068 0.060 Huanuco 55 1.045 0.050 0.026 Population . PA . AR . Hs . Ho . gracilipes4 211 1.088 0.0734 0.028 gracilipes3 131 1.014 0.118 0.067 gracilipes2 95 1.096 0.066 0.029 gracilipes1 467 1.116 0.083 0.023 cataract. 170 1.083 0.080 0.031 Colombian 35 1.064 0.046 0.027 Trujillo 7 1.021 0.015 0.011 Amazonian 65 1.083 0.068 0.060 Huanuco 55 1.045 0.050 0.026 Populations, as defined in Figure 3, are in rows. Statistics show the number of private alleles (PA), average, rarefaction-corrected allelic richness per SNP (AR), genetic diversity (Hs), and observed heterozygosity (Ho). Open in new tab Table 1. Genetic diversity of coca varieties and wild relatives Population . PA . AR . Hs . Ho . gracilipes4 211 1.088 0.0734 0.028 gracilipes3 131 1.014 0.118 0.067 gracilipes2 95 1.096 0.066 0.029 gracilipes1 467 1.116 0.083 0.023 cataract. 170 1.083 0.080 0.031 Colombian 35 1.064 0.046 0.027 Trujillo 7 1.021 0.015 0.011 Amazonian 65 1.083 0.068 0.060 Huanuco 55 1.045 0.050 0.026 Population . PA . AR . Hs . Ho . gracilipes4 211 1.088 0.0734 0.028 gracilipes3 131 1.014 0.118 0.067 gracilipes2 95 1.096 0.066 0.029 gracilipes1 467 1.116 0.083 0.023 cataract. 170 1.083 0.080 0.031 Colombian 35 1.064 0.046 0.027 Trujillo 7 1.021 0.015 0.011 Amazonian 65 1.083 0.068 0.060 Huanuco 55 1.045 0.050 0.026 Populations, as defined in Figure 3, are in rows. Statistics show the number of private alleles (PA), average, rarefaction-corrected allelic richness per SNP (AR), genetic diversity (Hs), and observed heterozygosity (Ho). Open in new tab Table 2. Population pairwise |$\textbf{F}_{\rm ST}$| . gracilipes4 . gracilipes3 . gracilipes2 . gracilipes1 . cataract. . Colombian . Trujillo . Amazonian . gracilipes3 0.68 — — — — — — — gracilipes2 0.42 0.62 — — — — — — gracilipes1 0.46 0.6 0.32 — — — — — cataract. 0.6 0.73 0.51 0.41 — — — — Colombian 0.66 0.76 0.56 0.49 0.6 — — — Trujillo 0.79 0.91 0.71 0.55 0.71 0.52 — — Amazonian 0.61 0.73 0.51 0.25 0.54 0.61 0.7 — Huanuco 0.74 0.83 0.67 0.4 0.69 0.71 0.78 0.55 . gracilipes4 . gracilipes3 . gracilipes2 . gracilipes1 . cataract. . Colombian . Trujillo . Amazonian . gracilipes3 0.68 — — — — — — — gracilipes2 0.42 0.62 — — — — — — gracilipes1 0.46 0.6 0.32 — — — — — cataract. 0.6 0.73 0.51 0.41 — — — — Colombian 0.66 0.76 0.56 0.49 0.6 — — — Trujillo 0.79 0.91 0.71 0.55 0.71 0.52 — — Amazonian 0.61 0.73 0.51 0.25 0.54 0.61 0.7 — Huanuco 0.74 0.83 0.67 0.4 0.69 0.71 0.78 0.55 Open in new tab Table 2. Population pairwise |$\textbf{F}_{\rm ST}$| . gracilipes4 . gracilipes3 . gracilipes2 . gracilipes1 . cataract. . Colombian . Trujillo . Amazonian . gracilipes3 0.68 — — — — — — — gracilipes2 0.42 0.62 — — — — — — gracilipes1 0.46 0.6 0.32 — — — — — cataract. 0.6 0.73 0.51 0.41 — — — — Colombian 0.66 0.76 0.56 0.49 0.6 — — — Trujillo 0.79 0.91 0.71 0.55 0.71 0.52 — — Amazonian 0.61 0.73 0.51 0.25 0.54 0.61 0.7 — Huanuco 0.74 0.83 0.67 0.4 0.69 0.71 0.78 0.55 . gracilipes4 . gracilipes3 . gracilipes2 . gracilipes1 . cataract. . Colombian . Trujillo . Amazonian . gracilipes3 0.68 — — — — — — — gracilipes2 0.42 0.62 — — — — — — gracilipes1 0.46 0.6 0.32 — — — — — cataract. 0.6 0.73 0.51 0.41 — — — — Colombian 0.66 0.76 0.56 0.49 0.6 — — — Trujillo 0.79 0.91 0.71 0.55 0.71 0.52 — — Amazonian 0.61 0.73 0.51 0.25 0.54 0.61 0.7 — Huanuco 0.74 0.83 0.67 0.4 0.69 0.71 0.78 0.55 Open in new tab Testing Alternative Domestication Scenarios We used coalescent simulations under an approximate Bayesian computation (ABC) framework to estimate statistical support for Plowman’s linear series (Scenario 1; Plowman 1979b), Johnson’s sister species (Scenario 2; Johnson et al. 2005), a two-origin hypothesis combining Huánuco and Amazonian coca (Scenario 3), and a three-origin hypothesis (Scenario 4; Fig. 4a). Simulations generated under Scenario 3 and evaluated using the logistic regression approach received the highest posterior probability of matching the observed summary statistics from our first SNP data set, but Scenario 4 was more similar to our second SNP data set (Fig. 4b). Using the direct proportions of simulated data sets closest to our observed data sets, Scenario 3 best represented both SNP data sets (Supplementary Figs. S10 and S11 available on Dryad). The ABC results consistently showed Plowman’s linear-series and Johnson’s sister-species hypotheses are unlikely historical scenarios explaining the current genetic diversity and divergence of these taxa (Fig. 4b; Supplementary Figs. S10 and S11 available on Dryad). Figure 4. Open in new tabDownload slide ABC tests and effective population size through time. a) Historical scenarios representing four domestication hypotheses. Scenarios 1–4 are drawn within colored boxes and the legend at top identifies the branch color for each taxon. b) Posterior probability of historical scenarios based on multinomial logistic regression procedure. c) Stairway plots reconstructing changes in effective population size through time for each of the four coca varieties, gracilipes1, and E. cataractarum. Solid lines are mean population size across 200 bootstraps and the 95% confidence interval is shown in gray. Based on several analyses, we can be confident that ABC could discriminate these scenarios and estimate their probabilities. Principal components analyses show the posterior distributions for Scenarios 3 and 4 were able to accurately replicate the observed data (Supplementary Figs. S10c and S11c available on Dryad). The posterior predictive error rates specific to confidence in scenario choice ranged from 0.146 to 0.206 among direct and logistic estimates from the two data sets (Supplementary Table S5 available on Dryad). The global error computed over the whole prior distribution (parameters and scenarios) ranged from 0.243 to 0.307 (Supplementary Table S5 available on Dryad). We summarized calculations of scenario-based prior error rates in confusion matrices and estimates of type I and type II error (Supplementary Tables S6–S9 available on Dryad). We found the simulated scenarios received the highest probability in all cases, but the Type I error, the probability a correct scenario was not chosen, was high for Scenario 1 (0.39–0.48) and Scenario 2 (0.33–0.42; Scenario 3 |$=$| 0.17–0.23; Scenario 4 |$=$| 0.08–0.18). However, the Type I error is significantly lower if we only analyze confusions among Scenario 3, Scenario 4, and Scenario 1 or Scenario 2 (range |$=$| 0.07–0.20), which are the meaningful comparisons with respect to our main results. Type II error, the probability a scenario was chosen when data were generated under an alternate scenario, ranged from 0.07 to 0.12 for Scenario 1, 0.15–0.16 for Scenario 2, 0.09–0.11 for Scenario 3, and 0.03–0.04 for Scenario 4. Model check results are presented in Supplementary Figs. S10 and S11 available on Dryad and prior and posterior distributions of the demographic parameters are presented in Supplementary Figs. S12 and S13 available on Dryad. We chose not to evaluate support for two or three origins from additional SNP data sets because the multiple domestication hypothesis is a high-confidence conclusion given our genetic and geographic sampling. Our posterior samples converged on parameter estimations and posterior support for the domestication scenarios agrees with the other analyses. In addition, there are limitations in the ABC approach, the largest being that many demographic phenomena beyond our topological scenarios have influenced the allele frequencies and summary statistics our results are based upon (Sunnåker et al. 2013). For instance, the remarkably high observed heterozygosity of Amazonian coca could be explained by the accumulation of somatic mutations during the mostly clonal propagation of this crop, as is observed in other mostly asexual populations (Table 1; Stoeckel and Masson 2014). More sampling of E. gracilipes, delimitation of gracilipes1–4 as distinct taxa (see below), and distinct demographic models of each domestication event will more accurately bear on this history. Timing and Locations of Domestication We estimated the relative age of each coca variety by explicitly modeling the change in population sizes through time using site frequency spectra derived from our SNP data (Fig. 4c; Supplementary Fig. S14 available on Dryad). We did not estimate generation times or extrapolate mutation rates from other studies and therefore we cannot estimate the absolute timing of domestication events. The stairway plots of gracilipes1 and E. cataractarum show these taxa have experienced a gradual population size increase followed by a very recent and drastic decrease (Fig. 4c). All coca varieties except Trujillo are marked by a stable historical population size that is smaller than gracilipes1 through much of its history. Colombian and Amazonian have about the same size, but Huánuco is smaller. Colombian coca shows the earliest departure from this stability and experienced a population decrease followed by a more recent increase, consistent with a domestication bottleneck. For Trujillo coca, population sizes have increased and then leveled off. Huánuco and Amazonian coca depart from the background population size more recently and at about the same time, although confidence intervals show changes in Amazonian could be more recent. Amazonian shows a pattern of recent population decrease whereas Huánuco coca appears to have experienced a bottleneck—a small decrease in population size followed by a large increase, then leveling off (Fig. 4c). In accordance with expectations based on historical coca farming practices, Huánuco coca is inferred to have the largest effective population size at present and Amazonian has the smallest. If we remove singletons, the model reconstructs generally similar population size histories, but without the recent population decreases in gracilipes1 and E. cataractarum, and without a rebound after the bottleneck for Colombian coca (Supplementary Fig. S14 available on Dryad). The relative coalescence times from our ABC parameter estimations also show the Trujillo/Colombian domestication was oldest and, in the case of three origins, that Amazonian coca coalesced with E. gracilipes the fewest generations ago (Supplementary Figs. S12b and S13b available on Dryad). In addition to the Colombian coca stairway plot, the long branch subtending the E. novogranatense clade (Colombian and Trujillo; Fig. 2a) and the |$F_{ST}$| statistics (Table 2) also corroborate the hypothesis that this is the oldest coca crop. Archaeological leaf fragments and chewing paraphernalia reveal coca culture was well established 8000 years BP (Plowman 1984; Dillehay et al. 2010). Evidence from the archaeological record, mating systems, and hybrid crosses (see Bohm et al. 1982) led Plowman to believe that Colombian coca was derived from Trujillo. Under neutral evolutionary models, we would expect higher genetic diversity in progenitor taxa (Gross and Olsen 2010; Feng et al. 2020), but we see Trujillo has fewer private alleles and lower allelic richness, genetic diversity, and observed heterozygosity than Colombian coca (Table 1). However, these statistics are also influenced by different demographic histories (see Mortimer 1901; Plowman 1984; Gootenberg 2008), so we maintain the working hypothesis established by Bohm et al. (1982) that Trujillo coca or a common ancestor was the progenitor of Colombian coca. Thus, a likely region of this domestication event was in Ecuador or northern Peru, near the current range of cultivation. Our phylogenetic and clustering analyses indicate Huánuco coca was domesticated in the eastern Andean foothills of southern Peru. The two gracilipes1 samples from Madre de Dios, Peru (grac.298, grac.654) cluster with and are placed at the base of the Huánuco coca clade (Fig. 2a). Neither of these samples was collected in proximity (|$<$|100 km) to coca farms, but this does not rule out the possibility of postdomestication gene flow from coca farms in this region into gracilipes1. A remarkable sample (cf.coca.884) from central Peru is described as cultivated but is morphologically intermediate and admixed between with gracilipes1 (Image in Supplementary material available on Dryad). A morphologically wild-type E. gracilipes from the montane forest in Ecuador (grac.100) was also inferred to be slightly admixed with Huánuco coca, but this is likely due to postdomestication gene flow because the Ecuadorean gracilipes1 forms a clade separate from Huánuco coca (Fig. 2a). The Amazonian clade is most closely related to gracilipes1 individuals from Loreto, Peru, and Amazonian Ecuador (Fig. 2; Supplementary Figs. S5 and S7 available on Dryad). At |$K = 5$|, Amazonian coca is not distinct from gracilipes1, but it does emerge as a separate genetic cluster when nine clusters are defined (Fig. 2a). Lastly, the clustering and |$F_{ST}$| results, as well as the fact that Amazonian coca has the highest genetic diversity of any coca variety (Table 1), support the hypothesis that Amazonian is the most recently domesticated coca crop. This is in line with linguistic and ethnographic observations that indigenous agricultural practices, preparation, consumption, and terminology surrounding coca are remarkably consistent across the Amazon basin, and thus thought to have been more recently dispersed throughout its current range (Plowman 1986). Thus, Amazonian coca was either recently derived from Huánuco coca or it was independently domesticated from gracilipes1 in Amazonian Ecuador or northern Peru. Erythroxylum gracilipes: Defining the ‘Mother of Coca’ Although E. gracilipes is one of the few wild species to have been hypothesized as a wild ancestor of the coca crops (Macbride 1949; see White et al. 2019), the taxonomic and ecological understanding of this wild species is minimal. This phylogeographic analysis reveals E. gracilipes is comprised of at least two main clades, gracilipes1 and gracilipes2–4, representing two or more distinct taxa. Amazonian and Huánuco coca are clearly derived from gracilipes1, but E. cataractarum and E. novogranatense could be derived from gracilipes1 or gracilipes2. Given our new understanding of this phylogenetic structure, focused morphological, ecological, and phylogeographic analyses are needed to inform species delimitation and taxonomic revision of E. gracilipes, thus clarifying the taxonomic identity of the progenitor(s) of coca. While taxonomic revision could lump E. coca and E. novogranatense as varieties within paraphyletic E. gracilipes in order to reflect the evolutionary relationships of these taxa (Baum 2009), we believe they should instead retain their species status because they appear to be morphologically and genetically distinct, independently evolving, monophyletic lineages (Fig. 2a; Table 2; Supplementary Fig. S9 available on Dryad; De Queiroz 2007). Following additional sampling and analysis, gracilipes2–4 should probably be reclassified as one or more distinct species. One of our gracilipes1 samples in this study, grac.798, was collected at the same locality as the E. gracilipes type specimen (Spruce 3068; San Carlos, Amazonas, Venezuela), so gracilipes1 will likely retain the E. gracilipes epithet (Turland et al. 2018). Lastly, the monophyly of E. cataractarum and lack of gene flow with gracilipes1–4 supports the conservation of this epithet. Our results establish the hypothesis that E. cataractarum formed by peripatric speciation via ecological adaptation to gallery forests in the Colombian and Venezuelan Llanos. Domestication in this system has resulted in plants with smaller, rounder, and softer leaves (lacking the sclerosed spongy mesophyll cells of E. gracilipes; Rury 1981) and erect to virgate branches (Supplementary Table S1 available on Dryad). Ethnobotanical knowledge is also scarce, the only credible record we are aware of reports the leaves are consumed for rheumatism and relaxation by Amerindians in the upper Rio Napo in Ecuador (Friedman et al. 1993). With the knowledge that E. gracilipes is a diverse and complex clade from which all coca varieties evolved, we hope this study will invigorate new botanical collections and systematic and ethnobotanical investigation of this wild taxon. Conclusions Our study has demonstrated that modern biotechnologies have finally permitted the efficient sequencing of genomic DNA from the hundreds of millions of preserved specimens sitting in wait in biological collections around the world. While the technology is available, this project also shows the importance of taxonomically complete collections for the efficacy and productivity of systematic and biodiversity investigations. Museum genomics projects can access a wide geographic and temporal scale, but might be best suited to initial, broad investigations like this one before more geographically or taxonomically focused evolutionary studies are conducted. Harvard ethnobotanist R. E. Schultes called coca the most important South American narcotic plant due to its prevalence and significance for indigenous cultures, as well as the revolutionary role of cocaine in Western medicine (Schultes 1979). Our genetic blueprint of the domestication of coca reveals that different Amerindian peoples have continuously adopted wild E. gracilipes into cultivation as a mild stimulant and medicine. Under the broad context of crop origins and centers of civilization, this contradicts the traditional Vavilovian view of few, distinct centers of origins for crops with extensive dispersal networks (Vavilov and Löve 2009) and instead corroborates localized and widespread domestication practices (Harlan 1971). This pervasive ingenuity has resulted in four unique coca crops: coca cultivation and culture has flourished for over 8000 years in northwestern South America following the domestication of E. novogranatense in that region. Erythroxylum gracilipes was also domesticated into Huánuco coca in Andes/Amazon region of Peru and Bolivia, where it has grown into a sacred commodity and cultural symbol unparalleled in the world of crop plants. The coca grown today by indigenous tribes in the Amazon basin was possibly brought down from the Andes but could also represent a third, and most recent, independent origin of coca. The results of this study have reframed our understanding of the history and diversity of this crop and have directly informed the next generation of research into where, when, and how coca was domesticated from E. gracilipes. Supplementary material Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.cvdncjt1n. Acknowledgments We thank the following herbaria for use of material: F, HUEFS, MOL. Thanks to E. Gardner and M. Johnson for guidance with library preparation and Hyb-Seq analyses, J. Walsh for assistance with DNA isolation, and the following people for help in the field and lab: A. Daza, M. Huinga, J. Janovec, F. Parra, E. Vosburgh, and J. Wells. Thanks to Proyecto Khoka in Medellín for collaboration. Specimen collection in Colombia was performed under the general collecting permit Resoluciones ANLA 1177 9 octubre 2014 and 1386 29 octubre 2015 awarded to Universidad de los Andes. Funding This work was supported by National Science Foundation grants to DMW (GRFP-0907994) and RJMG (DEB-1354975), the Grainger Bioinformatics Center and the Pritzker Laboratory for Molecular Systematics and Evolution at the Field Museum. Thank you to the Ree Lab and our anonymous reviewers for helpful comments with the manuscript. References Aynilian G.H. , Duke J.A., Gentner W.A., Farnsworth N.R. 1974 . Cocaine content of Erythroxylum species . J. Pharm. Sci. 63 : 1938 – 1939 . Google Scholar Crossref Search ADS PubMed WorldCat Baum D.A. 2009 . Species as ranked taxa . Syst. Biol. 58 : 74 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat Beaumont M.A. 2008 . Joint determination of topology, divergence time, and immigration in population trees. In: Matsumura S., Forster P., Renfrew C., editors. Simulation, genetics, and human prehistory . Cambridge : McDonald Institute for Archaeological Research . p. 135 – 154 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Beaumont M.A. , Zhang W., Balding D.J. 2002 . Approximate Bayesian computation in population genetics . Genetics 162 : 2025 – 2035 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Beck J.B. , Semple J.C. 2015 . Next-generation sampling: pairing genomics with Herbarium specimens provides species-level signal in solidago (Asteraceae) . Appl. Plant Sci. 3 : 1500014 . Google Scholar Crossref Search ADS WorldCat Beugin M.-P. , Gayet T., Pontier D., Devillard S., Jombart T. 2018 . A fast likelihood solution to the genetic clustering problem . Methods Ecol. Evol. 9 : 1006 – 1016 . Google Scholar Crossref Search ADS PubMed WorldCat Bewley-Taylor D.R. 2016 . Coca and cocaine: the evolution of international control. In: Gootenberg P., editor. Roadmaps to regulation: coca, cocaine, and derivatives . Oxford : The Beckley Foundation . p. 1 – 13 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Bieri S. , Brachet A., Veuthey J.-L., Christen P. 2006 . Cocaine distribution in wild Erythroxylum species . J. Ethnopharmacol. 103 : 439 – 447 . Google Scholar Crossref Search ADS PubMed WorldCat Bohm B.A. , Ganders F.R., Plowman T. 1982 . Biosystematics and evolution of cultivated coca (Erythroxylaceae) . Syst. Bot. 7 : 121 – 133 . Google Scholar Crossref Search ADS WorldCat Carter W.E. , Mamani M. 1986 . Coca in Bolivia . La Paz : Juventud . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Casale J.F. , Mallette J.R. 2016 . Illicit coca grown in Mexico: an alkaloid and isotope profile unlike coca grown in South America . Forensic Chem. 1 : 1 – 5 . Google Scholar Crossref Search ADS WorldCat Chifman J. , Kubatko L. 2014 . Quartet inference from SNP data under the coalescent model . Bioinformatics 30 : 3317 – 3324 . Google Scholar Crossref Search ADS PubMed WorldCat Conzelman C.S. , White D.M. 2016 . The botanical science and cultural value of Coca leaf in South America. In: Gootenberg P., editor. Roadmaps to regulation: coca, cocaine, and derivatives . Oxford : The Beckley Foundation . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Cornuet J.M. , Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.M., Estoup A. 2014 . DIYABC v2.0: a software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data . Bioinformatics 30 : 187 – 1189 . Google Scholar Crossref Search ADS WorldCat Danecek P. , Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., McVean G., Durbin R., 1000 Genomes Project Analysis Group . 2011 . The variant call format and VCFtools . Bioinformatics 27 : 2156 – 2158 . Google Scholar Crossref Search ADS PubMed WorldCat Dávalos L.M. , Bejarano A.C., Hall M.A., Correa H.L., Corthals A., Espejo O.J. 2011 . Forests and drugs: coca-driven deforestation in tropical biodiversity hotspots . Environ. Sci. Technol. 45 : 1219 – 1277 . Google Scholar Crossref Search ADS PubMed WorldCat De Queiroz K. 2007 . Species concepts and species delimitation . Syst. Biol. 56 : 879 – 886 . Google Scholar Crossref Search ADS PubMed WorldCat Dillehay T.D. , Rossen J., Ugent D., Karathanasis A., Vásquez V., Netherly C.P.J. 2010 . Early Holocene coca chewing in northern Peru . Antiquity 84 : 939 – 953 . Google Scholar Crossref Search ADS WorldCat Eaton D.A.R. , Overcast I. 2020 . ipyrad: interactive assembly and analysis of RADseq datasets . Bioinformatics 36 : 2592 – 2594 . Google Scholar Crossref Search ADS PubMed WorldCat Emche S.D. , Zhang D., Islam M.B., Bailey B.A., Meinhardt L.W. 2011 . AFLP phylogeny of 36 Erythroxylum species . Trop. Plant. Biol. 4 : 126 – 133 . Google Scholar Crossref Search ADS WorldCat Faircloth B.C. 2015 . PHYLUCE is a software package for the analysis of conserved genomic loci . Bioinformatics 32 : 786 – 788 . Google Scholar Crossref Search ADS PubMed WorldCat Feng L.-Y. , Liu J., Gao C.-W., Wu H.-B., Li G.-H., Gao L.-Z. 2020 . Higher genomic variation in wild than cultivated rubber trees, Hevea brasiliensis, revealed by comparative analyses of chloroplast genomes . Front. Ecol. Evol. 8 : 237 . doi:10.3389/fevo.2020.00237. Google Scholar Crossref Search ADS WorldCat Forrest L.L. , Hart M.L., Hughes M., Wilson H.P., Chung K.-F., Tseng Y.-H., Kidner C.A. 2019 . The limits of Hyb-Seq for herbarium specimens: impact of preservation techniques . Front. Ecol. Evol. 7 : 439 . doi:10.3389/fevo.2019.00439. Google Scholar Crossref Search ADS WorldCat Friedman J. , Bolotin D., Rios M., Mendosa P., Cohen Y., Balick M.J. 1993 . A novel method for identification and domestication of indigenous useful plants in Amazonian Ecuador . In: Janick J., Simon J.E., editors. New crops . New York : Wiley . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Gerbault P. , Allaby R.G., Boivin N., Rudzinski A., Grimaldi I.M., Pires J.C., Climer Vigueira C., Dobney K., Gremillion K.J., Barton L., Arroyo-Kalin M., Purugganan M.D., Rubio de Casas R., Bollongino R., Burger J., Fuller D.Q., Bradley D.G., Balding D.J., Richerson P.J., Gilbert M.T.P., Larson G., Thomas M.G. 2014 . Storytelling and story testing in domestication . Proc. Natl. Acad. Sci. USA 111 : 6159 – 6164 . Google Scholar Crossref Search ADS WorldCat Goldberg E.E. , Kohn J.R., Lande R., Robertson K.A., Smith S.A., Igiæ B. 2010 . Species selection maintains self-incompatibility . Science 330 : 493 – 495 . Google Scholar Crossref Search ADS PubMed WorldCat Gootenberg P. 2008 . Andean cocaine: the making of a global drug . Chapel Hill : University of North Carolina Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Goudet J. 2005 . HIERFSTAT, a package for R to compute and test hierarchical F-statistics . Mol. Ecol. Resour. 5 : 184 – 186 . Google Scholar Crossref Search ADS WorldCat Greiman S.E. , Cook J.A., Tkach V.V., Hoberg E.P., Menning D.M., Hope A.G., Sonsthagen S.A., Talbot S.L. 2018 . Museum metabarcoding: a novel method revealing gut helminth communities of small mammals across space and time . Int. J. Parasitol. 48 : 1061 – 1070 . Google Scholar Crossref Search ADS PubMed WorldCat Gross B.L. , Olsen K.M. 2010 . Genetic perspectives on crop domestication . Trends Plant Sci. 15 : 529 – 537 . Google Scholar Crossref Search ADS PubMed WorldCat Harlan J.R. 1971 . Agricultural origins: centers and noncenters . Science 174 : 468 – 474 . Google Scholar Crossref Search ADS PubMed WorldCat Hart M.L. , Forrest L.L., Nicholls J.A., Kidner C.A. 2016 . Retrieval of hundreds of nuclear loci from herbarium specimens . Taxon 65 : 1081 – 1092 . Google Scholar Crossref Search ADS WorldCat Harvey M.G. , Smith B.T., Glenn T.C., Faircloth B.C., Brumfield R.T. 2016 . Sequence capture versus restriction site associated DNA sequencing for shallow systematics . Syst. Biol. 65 : 910 – 924 . Google Scholar Crossref Search ADS PubMed WorldCat Hastorf C.A. 1987 . Archaeological evidence of coca (Erythroxylum coca, erythroxylaceae) in the upper mantaro valley, Peru . Econ Bot. 41 : 292 – 301 . Google Scholar Crossref Search ADS WorldCat Hoang D.T. , Chernomor O., Von Haeseler A., Minh B.Q., Vinh L.S. 2018 . UFBoot2: improving the ultrafast bootstrap approximation . Mol. Biol. Evol. 35 : 518 – 522 . Google Scholar Crossref Search ADS PubMed WorldCat Hudson R.R. , Slatkin M., Maddison W.P. 1992 . Estimation of levels of gene flow from DNA sequence data . Genetics 132 : 583 – 589 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Islam M.B. 2011 . Tracing the evolutionary history of coca (Erythroxylum) [thesis] . University of Colorado at Boulder . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Jackson N.D. , Carstens B.C., Morales A.E., O’Meara B.C. 2017 . Species delimitation with gene flow . Syst. Biol. 66 : 799 – 812 . Google Scholar Crossref Search ADS PubMed WorldCat Johnson E.L. , Zhang D., Emche S.D. 2005 . Inter- and intra-specific variation among five Erythroxylum taxa assessed by AFLP . Ann. Bot. 95 : 601 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat Johnson M.G. , Gardner E.M., Liu Y., Medina R., Goffinet B., Shaw A.J., Zerega N.J.C., Wickett N.J. 2016 . HybPiper: extracting coding sequence and introns for phylogenetics from high-throughput sequencing reads using target enrichment . Appl. Plant Sci. 4 : 1600016 . Google Scholar Crossref Search ADS WorldCat Jombart T. , Ahmed I. 2011 . adegenet 1.3-1: new tools for the analysis of genome-wide SNP data . Bioinformatics 27 : 3070 – 3071 . Google Scholar Crossref Search ADS PubMed WorldCat Jones M.R. , Good J.M. 2016 . Targeted capture in evolutionary and ecological genomics . Mol. Ecol. 25 : 185 – 202 . Google Scholar Crossref Search ADS PubMed WorldCat Katoh K. , Misawa K., Kuma K., Miyata T. 2002 . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform . Nucleic Acids Res. 30 : 3059 – 3066 . Google Scholar Crossref Search ADS PubMed WorldCat Keenan K. , McGinnity P., Cross T.F., Crozier W.W., Prodöhl P.A. 2013 . diveRsity: an R package for the estimation of population genetics parameters and their associated errors . Methods Ecol. Evol. 4 : 782 – 788 . Google Scholar Crossref Search ADS WorldCat Larson G. , Piperno D.R., Allaby R.G., Purugganan M.D., Andersson L., Arroyo-Kalin M., Barton L., Climer Vigueira C., Denham T., Dobney K., Doust A.N., Gepts P., Gilbert M.T.P., Gremillion K.J., Lucas L., Lukens L., Marshall F.B., Olsen K.M., Pires J.C., Richerson P.J., Rubio de Casas R., Sanjur O.I., Thomas M.G., Fuller D.Q. 2014 . Current perspectives and the future of domestication studies . Proc. Natl. Acad. Sci. USA 111 : 6139 – 6146 . Google Scholar Crossref Search ADS WorldCat Linck E. , Epperly K., Van Els P., Spellman G.M., Bryson R.W., McCormack J.E., Canales-Del-Castillo R., Klicka J. 2019 . Dense geographic and genomic sampling reveals paraphyly and a cryptic lineage in a classic sibling species complex . Syst. Biol. 68 : 956 – 966 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Liu X. , Fu Y.-X. 2015 . Exploring population size changes using SNP frequency spectra . Nat. Genetics 47 : 555 – 559 . Google Scholar Crossref Search ADS WorldCat Macbride J.F. 1949 . Erythroxylaceae . Flora of Peru . Chicago, IL, USA : Field Museum of Natural History . p. 632 – 647 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC McCormack J.E. , Tsai W.L.E., Faircloth B.C. 2016 . Sequence capture of ultraconserved elements from bird museum specimens . Mol. Ecol. Resour. 16 : 1189 – 1203 . Google Scholar Crossref Search ADS PubMed WorldCat Minh B.Q. , Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., von Haeseler A., Lanfear R. 2020 . IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era . Mol. Biol. Evol. 37 : 1530 – 1534 . Google Scholar Crossref Search ADS PubMed WorldCat Mortimer W.G. 1901 . History of coca: the divine plant of the Incas . New York : Vail . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Nathanson J.A. , Hunnicutt E.J., Kantham L., Scavone C. 1993 . Cocaine as a naturally occurring insecticide . Proc. Natl. Acad. Sci. USA 90 : 9645 – 9648 . Google Scholar Crossref Search ADS WorldCat Nei M. 1972 . Genetic distance between populations . Am. Nat. 106 : 283 – 292 . Google Scholar Crossref Search ADS WorldCat Nei M. 1987 . Molecular evolutionary genetics . New York : Columbia University Press . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Pickrell J.K. , Pritchard J.K. 2012 . Inference of population splits and mixtures from genome-wide allele frequency data . PLoS Genet 8 ( 11 ): e1002967 . doi:10.1371/journal.pgen.1002967. Google Scholar Crossref Search ADS PubMed WorldCat Plowman T. 1979a . The identity of Amazonian and Trujillo coca . Bot. Mus. Leafl. Harv. Univ. 27 : 45 – 68 . Google Scholar OpenURL Placeholder Text WorldCat Plowman T. 1979b . Botanical perspectives on coca . J. Psychedelic Drugs. 11 : 103 – 117 . Google Scholar Crossref Search ADS WorldCat Plowman T. 1981 . Amazonian coca . J. Ethnopharmacol. 3 : 195 – 225 . Google Scholar Crossref Search ADS PubMed WorldCat Plowman T. 1984 . The origin, evolution, and diffusion of coca, Erythroxylum spp., in South and Central America . Pap. Peabody Mus. Archaeol. Ethnogr. 76 : 125 – 163 . Google Scholar OpenURL Placeholder Text WorldCat Plowman T. 1986 . Coca chewing and the botanical origins of coca (Erythroxylum spp.) in Latin America. In: Pacini D., Franquemont C., editors. Coca and cocaine: effects on people and policy in Latin America . Cornell University : Cultural Survival, Inc./LASP . p. 5 – 34 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Plowman T. , Hensold N. 2004 . Names, types, and distribution of neotropical species of Erythroxylum (Erythroxylaceae) . Brittonia 56 : 1 – 53 . Google Scholar Crossref Search ADS WorldCat Plowman T. , Rivier L. 1983 . Cocaine and cinnamoylcocaine content of Erythroxylum species . Ann. Bot. 51 : 641 – 659 . Google Scholar Crossref Search ADS WorldCat Prathapan K.D. , Pethiyagoda R., Bawa K.S., Raven P.H., Rajan P.D., Countries 172 co-signatories from 35 . 2018 . When the cure kills—CBD limits biodiversity research . Science 360 : 1405 – 1406 . Google Scholar OpenURL Placeholder Text WorldCat Prosser S.W.J. , deWaard J.R., Miller S.E., Hebert P.D.N. 2016 . DNA barcodes from century-old type specimens using next-generation sequencing . Mol. Ecol. Resour. 16 : 487 – 497 . Google Scholar Crossref Search ADS PubMed WorldCat R Development Core Team. 2013 . R: a language and environment for statistical computing . Vienna, Austria : R Foundation for Statistical Computing . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Reichel-Dolmatoff G. , Schrimpff R. 2005 . Goldwork and shamanism: an iconographic study of the Gold Museum of the Banco de la República, Colombia . Villegas Asociados . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Restrepo D.A. , Saenz E., Jara-Muñoz O.A., Calixto-Botía I.F., Rodríguez-Suárez S., Zuleta P., Chavez B.G., Sanchez J.A., D’Auria J.C. 2019 . Erythroxylum in focus: an interdisciplinary review of an overlooked genus . Molecules 24 : 3788 . Google Scholar Crossref Search ADS WorldCat Rowe K.C. , Singhal S., Macmanes M.D., Ayroles J.F., Morelli T.L., Rubidge E.M., Bi K., Moritz C.C. 2011 . Museum genomics: low-cost and high-accuracy genetic data from historical specimens . Mol. Ecol. Resour. 11 : 1082 – 1092 . Google Scholar Crossref Search ADS PubMed WorldCat Rury P.M. 1981 . Systematic anatomy of Erythroxylum P. Browne: practical and evolutionary implications for the cultivated cocas . J. Ethnopharmacol. 3 : 229 – 263 . Google Scholar Crossref Search ADS PubMed WorldCat Rury P.M. , Plowman T. 1983 . Morphological studies of archaeological and recent coca leaves (Erythroxylum spp.) . Bot. Mus. Leafl. Harv. Univ. 29 : 297 – 341 . Google Scholar OpenURL Placeholder Text WorldCat Särkinen T. , Staats M., Richardson J.E., Cowan R.S., Bakker F.T. 2012 . How to open the treasure chest? Optimising DNA extraction from herbarium specimens. PLoS One 7 : e43808 . Google Scholar OpenURL Placeholder Text WorldCat Schultes R.E. 1979 . Evolution of the identification of the major South American narcotic plants . J. Psychedelic Drugs 11 : 119 – 134 . Google Scholar Crossref Search ADS PubMed WorldCat Staats M. , Erkens R.H.J., van de Vossenberg B., Wieringa J.J., Kraaijeveld K., Stielow B., Geml J., Richardson J.E., Bakker F.T. 2013 . Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens . PLoS ONE 8 ( 7 ): e69189 . doi:10.1371/journal.pone.0069189. Google Scholar Crossref Search ADS PubMed WorldCat Stamatakis A. 2014 . RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies . Bioinformatics 30 : 1312 – 1313 . Google Scholar Crossref Search ADS PubMed WorldCat Stoeckel S. , Masson J.-P. 2014 . The exact distributions of FIS under partial asexuality in small finite populations with mutation . PLoS One 9 : e85228 . Google Scholar Crossref Search ADS PubMed WorldCat Sunnåker M. , Busetto A.G., Numminen E., Corander J., Foll M., Dessimoz C. 2013 . Approximate Bayesian computation . PLoS Comput. Biol. 9 : e1002803 . Google Scholar Crossref Search ADS PubMed WorldCat Turland N.J. , Wiersema J.H., Barrie F.R., Greuter W., Hawksworth D.L., Herendeen P.S., Knapp S., Kusber W.-H., Li D.-Z., Marhold K., May T.W., McNeill J., Monro A.M., Prado J., Price M.J., Smith G.F. 2018 . International code of nomenclature for algae, fungi, and plants (Shenzhen Code) adopted by the Nineteenth International Botanical Congress Shenzhen, China, July 2017 . Glashütten : Koeltz Botanical Books . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC United Nations. 2019 . World drug report . Sales No. E.19.XI.8 . Available at: https://wdr.unodc.org/wdr2019/. Valdez L.M. , Taboada J., Valdez J.E. 2015 . Ancient use of coca leaves in the Peruvian central highlands . J. Anthropol. Res. 71 : 231 – 258 . Google Scholar Crossref Search ADS WorldCat Vavilov N.I. , Löve D. 2009 . Origin and geography of cultivated plants . Cambridge, UK : Cambridge University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Villaverde T. , Pokorny L., Olsson S., Rincón-Barrado M., Johnson M.G., Gardner E.M., Wickett N.J., Molero J., Riina R., Sanmartín I. 2018 . Bridging the micro- and macroevolutionary levels in phylogenomics: Hyb-Seq solves relationships from populations to species and above . New Phytol. 220 : 636 – 650 . Google Scholar Crossref Search ADS PubMed WorldCat Weir B.S. , Cockerham C.C. 1984 . Estimating F-statistics for the analysis of population structure . Evolution 38 : 1358 – 1370 . Google Scholar PubMed OpenURL Placeholder Text WorldCat White D.M. , Islam M.B., Mason-Gamer R.J. 2019 . Phylogenetic inference in section Archerythroxylum informs taxonomy, biogeography, and the domestication of coca (Erythroxylum species) . Am. J. Bot. 106 : 154 – 165 . Google Scholar Crossref Search ADS PubMed WorldCat Zhang C. , Rabiee M., Sayyari E., Mirarab S. 2018 . ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees . BMC Bioinformatics 19 : 153 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercialre-use, please contact [email protected] © The Author(s) 2020. Published by Oxford University Press.
Sequence Capture Phylogenomics of True Spiders Reveals Convergent Evolution of Respiratory SystemsRamírez, Martín, J;Magalhaes, Ivan L, F;Derkarabetian,, Shahan;Ledford,, Joel;Griswold, Charles, E;Wood, Hannah, M;Hedin,, Marshal
doi: 10.1093/sysbio/syaa043pmid: 32497195
Abstract The common ancestor of spiders likely used silk to line burrows or make simple webs, with specialized spinning organs and aerial webs originating with the evolution of the megadiverse “true spiders” (Araneomorphae). The base of the araneomorph tree also concentrates the greatest number of changes in respiratory structures, a character system whose evolution is still poorly understood, and that might be related to the evolution of silk glands. Emphasizing a dense sampling of multiple araneomorph lineages where tracheal systems likely originated, we gathered genomic-scale data and reconstructed a phylogeny of true spiders. This robust phylogenomic framework was used to conduct maximum likelihood and Bayesian character evolution analyses for respiratory systems, silk glands, and aerial webs, based on a combination of original and published data. Our results indicate that in true spiders, posterior book lungs were transformed into morphologically similar tracheal systems six times independently, after the evolution of novel silk gland systems and the origin of aerial webs. From these comparative data, we put forth a novel hypothesis that early-diverging web-building spiders were faced with new energetic demands for spinning, which prompted the evolution of similar tracheal systems via convergence; we also propose tests of predictions derived from this hypothesis.[Book lungs; discrete character evolution; respiratory systems; silk; spider web evolution; ultraconserved elements.] Spiders and their webs are among the most fascinating examples of animal architecture and are at the intersection of important questions in evolutionary biology, such as the construction of their own niche, the origin of key evolutionary innovations, and the extending limits of the phenotype (Bond and Opell 1998; Blackledge et al. 2011). The common ancestor of spiders likely used silk to line burrows or make simple silken structures close to the substrate (Fernández et al. 2018; Coddington et al. 2019; Hedin et al. 2019; Opatova et al. 2019). True aerial webs, from which spiders can hang in an inverted position, originated with the “true spiders” or Araneomorphae. The key innovation that allowed for the evolution of aerial webs are specialized spinning organs; araneomorphs developed a system of ampullate silk glands, producing tough fibers, and piriform silk glands that produce a cement making highly efficient anchorages (Coddington 2005; Wolff et al. 2019). Along with the spinning organs, the posterior spider body (opisthosoma) underwent a reorganization of the respiratory organs concomitantly with the simplification of the circulatory system (Huckstorf et al. 2015). Early-diverging spider lineages have four book lungs in consecutive pairs on the anteroventral abdomen; of the three main spider lineages, the Mesothelae (single family with |$\sim $|130 known species) and Mygalomorphae (20 families, |$\sim $|3000 species) retain this ancestral configuration, while most araneomorphs (96 families, |$\sim $|48,500 species) have transformed the posterior book lungs into tracheae, or lost them altogether (see Ramírez 2000; Schmitz 2013; Fig. 1). When the posterior book lungs transformed into tracheae they also migrated posteriorly, close to the silk spinning structures. Only three species-poor lineages of araneomorphs retain the ancestral configuration of four book lungs (Hypochilidae, Gradungulidae, and some Austrochilidae, 29 species in total). Figure 1 Open in new tabDownload slide Respiratory structures of selected spider taxa. a) Liphistius yamasakii (Mesothelae, Liphistiidae), b) Hickmania troglodytes (Araneomorphae, Austrochilidae), c) Kukulcania hibernalis (Araneomorphae, Filistatidae), d) Austrochilus melon (Araneomorphae, Austrochilidae), e) Archoleptoneta schusteri (Araneomorphae, Leptonetidae, Archoleptonetinae), f) Dysdera crocata (Araneomorphae, Synspermiata, Dysderidae), inset to posterior branch of lateral trachea, g) Polybetes pythagoricus (Araneomorphae, Entelegynae, Sparassidae). * |$=$| apodeme; TD |$=$| transverse duct. Figure 1 Open in new tabDownload slide Respiratory structures of selected spider taxa. a) Liphistius yamasakii (Mesothelae, Liphistiidae), b) Hickmania troglodytes (Araneomorphae, Austrochilidae), c) Kukulcania hibernalis (Araneomorphae, Filistatidae), d) Austrochilus melon (Araneomorphae, Austrochilidae), e) Archoleptoneta schusteri (Araneomorphae, Leptonetidae, Archoleptonetinae), f) Dysdera crocata (Araneomorphae, Synspermiata, Dysderidae), inset to posterior branch of lateral trachea, g) Polybetes pythagoricus (Araneomorphae, Entelegynae, Sparassidae). * |$=$| apodeme; TD |$=$| transverse duct. The origin and function of spider tracheae have long puzzled both taxonomists and physiologists (Levi 1967; Schmitz 2013). Ontogenetically, tracheae arise in various ways as modifications of the book lungs and adjacent apodemes (Purcell 1909; Ramírez 2000, 2014). Morphologically, they usually consist of four tubes that can be either short and simple or large and highly branched (Fig. 1). Physiological studies have found that extensive tracheal systems passing to the anterior body compartment (the prosoma), where locomotory functions are concentrated, contribute to aerobic metabolism during periods of high activity (Schmitz 2005), or help muscular action to monitor the web (Opell 1987). However, most spiders, including some of the most diverse spider families, spanning a broad range of body sizes and ecologies, have a small tracheal system limited to the opisthosoma (Supplementary Fig. S18 available on Dryad). The function of these small tracheal systems is still a mystery; it has been shown that blocking these tracheae has no impact on locomotory metabolism or performance, nor in CO|$_{2}$| release (Schmitz and Perry 2002; Schmitz 2005). Recent phylogenies from genomic-scale data imply that tracheate spiders are polyphyletic (Supplementary Fig. S1 available on Dryad at https://doi.org/10.5061/dryad.3bk3j9kfd), or that some lineages reacquired book lungs from tracheae—a possibility considered unlikely by Huckstorf et al. (2015). However, these prior analyses lacked a dense sampling of early-diverging araneomorphs or were limited in sequence data. In this study, we combine a genomic-scale data set derived from sequence capture of ultraconserved elements (UCEs) with a dense sampling of araneomorph lineages where tracheal systems likely originated. We then analyze new morphological data on respiratory systems, test the origin of silk gland systems and aerial webs, and propose a novel hypothesis for the origin and diversity of respiratory structures in true spiders. Materials and Methods Taxon Sampling and Matrix Assembly We used UCE sequence capture data, building upon the results of Wood et al. (2018). Including a mesothele and two mygalomorphs as outgroups, we assembled an araneomorph taxon sample that emphasized early-diverging lineages, and included many taxa never before sampled in a molecular phylogenetic analysis (see Supplementary Table S1 available on Dryad). UCE loci were obtained and processed as in Hedin et al. (2019), and multiple phylogenomic analyses were conducted to explore impacts of analysis method, data partitioning, and base composition biases (see Supplementary Methods and Results available on Dryad). Morphology Data and Character Evolution We scored the morphology of the respiratory system for a subset of sampled taxa (68 of 96) covering all major lineages. Our scorings were based on original dissections of 40 species, supplemented with published data (Supplementary Table S2 available on Dryad). We explored the effect of alternative coding schemes for multiple states and missing entries. We also scored 68 taxa for the presence of the major ampullate + piriform gland system, and for aerial webs (Supplementary Table S3 available on Dryad). To distinguish aerial webs from silken tubes or burrows (“substrate webs”), we defined them as foraging webs from which the spiders hang in an inverted position. Ancestral states were estimated using maximum likelihood, with the fit of different character evolution models compared with the Akaike information criterion (AIC). Correlated evolution between discrete characters was tested in a Bayesian framework using a threshold model from quantitative genetics (Felsenstein 2012; Revell 2014). Additional details for all character analyses are provided in the Supplementary Methods and Results and Tables S2–S5 available on Dryad. Results Voucher specimen data and relevant UCE summary values (e.g., cleaned reads, number of contigs, etc.) are found in Supplementary Table S1 available on Dryad. Raw reads from our 53 original samples have been submitted to the SRA (BioProject ID: PRJNA610839); individual locus alignments and tree files are available on Dryad and TreeBase (http://purl.org/phylo/treebase/phylows/study/TB2:S26343). The final primary matrix (534_25_noP) included 534 loci, with a combined alignment length of |$\sim $|120,000 base pairs and 56,561 parsimony informative sites. Phylogenetic results are in general robust to alternative models of molecular evolution, data partitioning scheme, and optimality criteria, and most clades are recovered with high support (Fig. 2, Supplementary S2–S5, S13 available on Dryad). Our UCE phylogenies are also largely congruent with previous phylogenomic analyses (Fernández et al. 2018). Important taxon additions include the hypochilid Ectatosticta, thus recovering a monophyletic Hypochilidae, and two telemids, suggesting that this family is sister to Scytodoidea + Pholcoidea (as in Shao and Li 2018). Our dense sample of leptonetids suggests that this family is diphyletic, with Archoleptonetinae separate from Leptonetinae. This result refutes a hypothesis of leptonetid polyphyly (Wheeler et al. 2017) but is consistent with predictions made by Ledford and Griswold (2010) based on morphology. In exploratory analyses, Trogloraptor was recovered as sister to telemids but on a very long branch. After accounting for high GC bias (Supplementary Figs. S6–11 available on Dryad), the position of Trogloraptor stabilized to closely match phylotranscriptomic results (Michalik et al. 2019) and morphological evidence (Griswold et al. 2012). The placement resolved for Huttonia is unusual, but since it belongs to a clade homogeneous in respiratory system morphology, its placement did not impact character evolution analyses. Finally, early-diverging relationships in Entelegynae, in particular the placements of uloborids and oecobiids, are unstable across analyses (see also Garrison et al. 2016; Fernández et al. 2018); because all spiders in this clade have tracheae, this uncertainty has no impact on our main character evolution results. Figure 2 Open in new tabDownload slide ExaBayes Bayesian inference tree of the “preferred” data set (see Supplementary Methods and Results available on Dryad), showing main spider clades and posterior probability values (|$P)$|. BS |$=$| bootstrap values from ML analysis of “preferred” data set. Figure 2 Open in new tabDownload slide ExaBayes Bayesian inference tree of the “preferred” data set (see Supplementary Methods and Results available on Dryad), showing main spider clades and posterior probability values (|$P)$|. BS |$=$| bootstrap values from ML analysis of “preferred” data set. Character mapping of the main architecture of the posterior respiratory system (PRS) is robust to the inclusion of “absences” as a fifth state or as missing data, pruning of terminals lacking a PRS, or coding PRS as a binary or multistate character (Supplementary Figs. S14–S17 available on Dryad). Thus, for simplicity, we discuss below the results of the multistate coding including absence as a fifth state (Fig. 3). A custom, “irreversible” evolution model that disallows regains of book lungs from tracheae or regains of respiratory structures is favored over equal rates, symmetric or all-rates-different models (AIC weight of 93%; Table 1 and Supplementary Tables S4 and S5 available on Dryad) and implies that the transformation of book lungs into tracheal structures occurred six times independently. These transformations are once to a single lamella (in Filistatidae), once to a tube plus a single leaf (in Austrochilinae), and four times to tubular tracheae (in archoleptonetines, leptonetines, Synspermiata, and Entelegynae + Palpimanoidea; Fig. 3). All branches involving morphological transitions in the respiratory system are well supported and recovered in multiple phylogenetic analyses (Fig. 2, Supplementary Fig. S13 available on Dryad), thus the ancestral state reconstruction and evolutionary model selection are robust to phylogenetic uncertainty (Supplementary Fig. S15 available on Dryad). Complex tracheal systems with tracheae supplying the prosoma are mapped as at least five independent transformations from simpler systems limited to the opisthosoma (Supplementary Fig. S18 available on Dryad). We detected four independent losses of the PRS, including one in Pholcoidea (the “lost tracheae clade”), all from simple tracheal systems limited to the opisthosoma (three from tubular tracheae and one from lamella). The ampullate + piriform gland system originated in Araneomorphae, with a single loss in our data set in the sand spider Hexophthalma (Supplementary Fig. S20 available on Dryad). The aerial web also originated in Araneomorphae, with several subsequent reversions to substrate web or losses (Supplementary Fig. S21 available on Dryad). We detected a significant correlation between the ampullate + piriform system and tracheae (highest posterior density of correlation 0.050–0.797, mean 0.410, |$P$| = 0.925, effective sample size 319) but not between aerial webs and tracheae. Figure 3 Open in new tabDownload slide Evolution of the posterior respiratory system in araneomorph spiders. a) Schematic view of opisthosoma of exemplar taxa, showing the respiratory system in relation to spinnerets. b) Posterior respiratory system mapped by maximum likelihood using multistate coding for the 68 terminals with available morphological data. Morphology schemes are grouped by main configurations using shaded areas. ap |$=$| apodeme; at |$=$| atrium of book lung; bl |$=$| book lung; ltr |$=$| lateral trachea; mtr |$=$| median trachea, derived from apodeme; spn |$=$| spinnerets. Figure 3 Open in new tabDownload slide Evolution of the posterior respiratory system in araneomorph spiders. a) Schematic view of opisthosoma of exemplar taxa, showing the respiratory system in relation to spinnerets. b) Posterior respiratory system mapped by maximum likelihood using multistate coding for the 68 terminals with available morphological data. Morphology schemes are grouped by main configurations using shaded areas. ap |$=$| apodeme; at |$=$| atrium of book lung; bl |$=$| book lung; ltr |$=$| lateral trachea; mtr |$=$| median trachea, derived from apodeme; spn |$=$| spinnerets. Table 1. Comparison among the different models used for character reconstruction Trait . Model . ER . SYM . IRV . ARD . Posterior Log likelihood |$-$|44.66457 |$-$|40.30172 |$-$|36.12254 |$-$|34.88677 respiratory system, Parameters 1 10 7 20 multistate AIC 91.32913 100.60344 86.24507 109.77355 coding AIC weights 0.07291166 0.00070617 0.92637496 0.00000721 Web, multistate Log likelihood |$-$|50.99055 |$-$|50.26976 |$-$|48.80268 |$-$|48.51729 coding (absence Parameters 1 3 4 6 coded as an AIC 103.9811 106.5395 105.6054 109.0346 additional state) AIC weights 0.55491286 0.15440849 0.24633075 0.04434791 Trait . Model . ER . SYM . IRV . ARD . Posterior Log likelihood |$-$|44.66457 |$-$|40.30172 |$-$|36.12254 |$-$|34.88677 respiratory system, Parameters 1 10 7 20 multistate AIC 91.32913 100.60344 86.24507 109.77355 coding AIC weights 0.07291166 0.00070617 0.92637496 0.00000721 Web, multistate Log likelihood |$-$|50.99055 |$-$|50.26976 |$-$|48.80268 |$-$|48.51729 coding (absence Parameters 1 3 4 6 coded as an AIC 103.9811 106.5395 105.6054 109.0346 additional state) AIC weights 0.55491286 0.15440849 0.24633075 0.04434791 See Supplementary material available on Dryad for details. Preferred models according to the AIC are highlighted in bold. ER |$=$| equal rates; SYM |$=$| symmetrical rates; IRV |$=$| irreversible model; ARD |$=$| allratesdifferent model. Open in new tab Table 1. Comparison among the different models used for character reconstruction Trait . Model . ER . SYM . IRV . ARD . Posterior Log likelihood |$-$|44.66457 |$-$|40.30172 |$-$|36.12254 |$-$|34.88677 respiratory system, Parameters 1 10 7 20 multistate AIC 91.32913 100.60344 86.24507 109.77355 coding AIC weights 0.07291166 0.00070617 0.92637496 0.00000721 Web, multistate Log likelihood |$-$|50.99055 |$-$|50.26976 |$-$|48.80268 |$-$|48.51729 coding (absence Parameters 1 3 4 6 coded as an AIC 103.9811 106.5395 105.6054 109.0346 additional state) AIC weights 0.55491286 0.15440849 0.24633075 0.04434791 Trait . Model . ER . SYM . IRV . ARD . Posterior Log likelihood |$-$|44.66457 |$-$|40.30172 |$-$|36.12254 |$-$|34.88677 respiratory system, Parameters 1 10 7 20 multistate AIC 91.32913 100.60344 86.24507 109.77355 coding AIC weights 0.07291166 0.00070617 0.92637496 0.00000721 Web, multistate Log likelihood |$-$|50.99055 |$-$|50.26976 |$-$|48.80268 |$-$|48.51729 coding (absence Parameters 1 3 4 6 coded as an AIC 103.9811 106.5395 105.6054 109.0346 additional state) AIC weights 0.55491286 0.15440849 0.24633075 0.04434791 See Supplementary material available on Dryad for details. Preferred models according to the AIC are highlighted in bold. ER |$=$| equal rates; SYM |$=$| symmetrical rates; IRV |$=$| irreversible model; ARD |$=$| allratesdifferent model. Open in new tab Discussion We assembled a new genomic-scale data set that complements previous phylogenies reconstructed mostly using transcriptomes. The high congruence with these prior studies and strong support for both deep and shallow branches indicate that UCE-based sequence capture is a good strategy when paired with dense taxon sampling, without the stringent sampling conditions of transcriptomes (Hedin et al. 2019; Kulkarni et al. 2020). Our results indicate that the posterior book lungs of spiders were transformed six times into tracheal systems after the origin of aerial webs and the evolution of the ampullate + piriform gland system of true spiders. Alternative reconstructions inferring one or two reacquisitions of book lungs were not favored by our AIC-based model selection (Supplementary Fig. S15, Tables S4 and S5 available on Dryad); furthermore, we regard the de novo evolution of book lungs as less parsimonious, as their morphologically complex structure is identical across all spiders possessing them (Fig. 3). Six clades converged to a similar conformation of few tracheal tubes limited to the opisthosoma, usually close to the spinnerets: two leptonetid clades, the austrochilines, the common ancestor of palpimanoids and entelegynes, the filistatines, and the common ancestor of Synspermiata. Why have tracheae evolved so many times within araneomorph spiders? Based on anatomical proximity of tracheae to the anterior lateral spinnerets and the ampullate silk glands, we hypothesize that tracheae originated to supply the demands of the spinning system, ultimately also linked to the evolution of aerial webs. The outlets of the ampullate and piriform silk glands are strategically placed in the anterior lateral spinnerets, which are operated by a complex musculature (Eberhard 2010), and the anchorages of silk to substrate are made through a precise choreography that determines their resistance (Wolff et al. 2019). The location of silk glands, particularly the ampullate glands, and the muscles operating the spinnerets are positioned immediately dorsal to where the tracheae are located. This suggests that early-diverging web-building spiders were faced with new energetic demands for spinning, which resulted in the evolution of similar tracheal systems via convergence. A prediction of this hypothesis is that blocking a simple PRS should be detrimental to spinning performance. We are aware that the correlation is not perfect; for example, the Pholcoidea lost their tracheae but still build aerial webs using the ampullate + piriform gland system. Our analyses also reveal at least five independent origins of extensive tracheal systems reaching the prosoma, all derived from systems limited to the opisthosoma (Supplementary Fig. S18 available on Dryad); probably many more convergences occurred within the Entelegynae. This indicates that oxygen supplementation for muscular activity in the prosoma was a later development rather than the original function of spider tracheae. Supplementary Material Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.3bk3j9kfd. Funding This research was supported by FONCyY (PICT-2015-0283 and 2017-2689 to M.J.R), by the US National Science Foundation (DEB 1754591 to M.H.), and by a Global Genome Initiative grant (GGI-Peer-2018-181 to H.M.W). Acknowledgments We acknowledge the many persons who helped to collect the specimens used in this study. The Schlinger Postdoctoral Fellowship at the California Academy of Sciences supported the participation of J.L. in early stages of this research. I.L.F.M was supported by a CONICET postdoctoral fellowship. We thank Martin Carboni and the computing facilities of INTA Balcarce for help with phylogenomic analyses. Brant Faircloth, Lauren Esposito, Matjaž Kuntner and two anonymous reviewers provided useful suggestions that helped us to improve the manuscript. References Blackledge T.A. , Kuntner M., Agnarsson I. 2011 . The form and function of spider orb webs: evolution from silk to ecosystems. In: Casas J., editor. Advances in insect physiology (Vol. 41) . Burlington : Academic Press . p. 175 – 262 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Bond J.E. , Opell B.D. 1998 . Testing adaptive radiation and key innovation hypotheses in spiders . Evolution 52 : 403 – 414 . Google Scholar Crossref Search ADS PubMed WorldCat Coddington J.A. 2005 . Phylogeny and classification of spiders. In: Ubick D., Paquin P., Cushing P.E., Roth V., editors. Spiders of North America: an identification manual . American Arachnological Society , Berkeley, CA . p. 18 – 24 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Coddington J.A. , Agnarsson I., Hamilton C.A., Bond, J.E. 2019 . Spiders did not repeatedly gain, but repeatedly lost, foraging webs . PeerJ 7 : p .e6703. Google Scholar Crossref Search ADS WorldCat Eberhard W.G. 2010 . Possible functional significance of spigot placement on the spinnerets of spiders . J. Arachnol. 38 : 407 – 414 . Google Scholar Crossref Search ADS WorldCat Felsenstein J. 2012 . A comparative method for both discrete and continuous characters using the threshold model . Am. Nat 179 ( 2 ): 145 – 156 . Google Scholar Crossref Search ADS PubMed WorldCat Fernández R. , Kallal R.J., Dimitrov D., Ballesteros J.A., Arnedo M.A., Giribet G., Hormiga G. 2018 . Phylogenomics, diversification dynamics, and comparative transcriptomics across the Spider Tree of Life . Curr. Biol. 28 : 1489 – 1497 .e5 Google Scholar Crossref Search ADS PubMed WorldCat Garrison N.L. , Rodriguez J., Agnarsson I., Coddington J.A., Griswold C.E., Hamilton C.A., Hedin M., Kocot K.M., Ledford J.M. Bond J.E. 2016 . Spider phylogenomics: untangling the Spider Tree of Life . PeerJ 4 : p.e1719 . Google Scholar Crossref Search ADS WorldCat Griswold C.E. , Audisio T., Ledford J.M. 2012 . An extraordinary new family of spiders from caves in the Pacific Northwest (Araneae, Trogloraptoridae, new family) . ZooKeys 215 : 77 – 102 . Google Scholar Crossref Search ADS WorldCat Hedin M. , Derkarabetian S., Alfaro A., Ramírez M.J., Bond J.E. 2019 . Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci . PeerJ 7 : e6864 . Google Scholar Crossref Search ADS PubMed WorldCat Huckstorf K. , Michalik P., Ramírez M., Wirkner C.S. 2015 . Evolutionary morphology of the hemolymph vascular system of basal araneomorph spiders (Araneae: Araneomorphae) . Arthropod Struct. Dev. 44 : 609 – 621 . Google Scholar Crossref Search ADS PubMed WorldCat Kulkarni S. , Wood H., Lloyd M., Hormiga, G. 2020 . Spider-specific probe set for ultraconserved elements offers new perspectives on the evolutionary history of spiders (Arachnida, Araneae) . Mol. Ecol. Res. 20 ( 1 ): 185 – 203 . Google Scholar Crossref Search ADS WorldCat Lamy E. 1902 . Recherches anatomiques sur les trachées des Araignées . Ann. Sci. Nat. Zool. 15 ( 8 ): 149 – 280 . Google Scholar OpenURL Placeholder Text WorldCat Ledford J.M. , Griswold, C.E. 2010 . A study of the subfamily Archoleptonetinae (Araneae, Leptonetidae) with a review of the morphology and relationships for the Leptonetidae . Zootaxa 2391 : 1 – 32 . Google Scholar Crossref Search ADS WorldCat Levi H.W. 1967 . Adapations of respiratory systems of spiders . Evolution 1967 : 571 – 583 . Google Scholar OpenURL Placeholder Text WorldCat Michalik P. , Kallal R., Dederichs T.M., Labarque F.M., Hormiga G., Giribet G., Ramírez M.J. ( 2019 ). Phylogenomics and genital morphology of cave raptor spiders (Araneae, Trogloraptoridae) reveal an independent origin of a flow-through female genital system . J. Zool. Syst. Evol. Res. 57 : 737 – 747 . Google Scholar Crossref Search ADS WorldCat Opatova V. , Hamilton C.A., Hedin M., Montes de Oca L., Král J, Bond J.E. 2019 . Phylogenetic systematics and evolution of the spider infraorder Mygalomorphae using genomic scale data . Syst. Biol. 69:671–707. Google Scholar OpenURL Placeholder Text WorldCat Opell B.D. 1987 . The influence of web monitoring tactics on the tracheal systems of spiders in the family Uloboridae (Arachnida, Araneida) . Zoomorphology 107 : 255 – 259 . Google Scholar Crossref Search ADS WorldCat Purcell W.F. 1909 . Development and origin of respiratory organs in Araneae . Quart. J. Microsc. Sc. (N.S.) 54 : 1 – 110 . Google Scholar OpenURL Placeholder Text WorldCat Ramírez M.J. 2000 . Respiratory system morphology and the phylogeny of haplogyne spiders (Araneae, Araneomorphae) . J. Arachnol. 28 : 149 – 157 . Google Scholar Crossref Search ADS WorldCat Ramírez M.J. 2014 . The morphology and phylogeny of dionychan spiders (Araneae: Araneomorphae) . Bull. Amer. Mus. Nat.Hist. 390 : 1 – 394 . Google Scholar Crossref Search ADS WorldCat Revell L.J. 2014 . Ancestral character estimation under the threshold model from quantitative genetics . Evolution 68 : 743 – 759 . Google Scholar Crossref Search ADS PubMed WorldCat Schmitz A. 2013 . Tracheae in spiders: respiratory organs for special functions. In: Nentwig W, editors. Spider ecophysiology . New York : Springer . p. 29 – 39 Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Schmitz A. , Perry S.F. 2002 . Respiratory organs in wolf spiders: morphometric analysis of lungs and tracheae in Pardosa lugubris (L.) (Arachnida, Araneae, Lycosidae) . Arthropod Struct. Dev. 31 : 217 – 230 . Google Scholar Crossref Search ADS PubMed WorldCat Schmitz A. 2005 . Spiders on a treadmill: influence of running activity on metabolic rates in Pardosa lugubris (Araneae, Lycosidae) and Marpissa muscosa (Araneae, Salticidae) . J. Exp. Biol. 208 : 1401 – 1411 . Google Scholar Crossref Search ADS PubMed WorldCat Shao L. , Li S. 2018 . Early Cretaceous greenhouse pumped higher taxa diversification in spiders . Mol. Phylogenet. Evol. 127 : 146 – 155 . Google Scholar Crossref Search ADS PubMed WorldCat Wheeler W.C. , Coddington J.A., Crowley L.M., Dimitrov D., Goloboff P.A., Griswold C.E., Hormiga G., Prendini L., Ramírez M.J., Sierwald P., Almeida-Silva L.M., Álvarez-Padilla F., Arnedo M.A., Benavides L.R., Benjamin S.P., Bond J.E., Grismado C.J., Hasan E., Hedin M., Izquierdo M.A., Labarque F.M., Ledford J., Lopardo L., Maddison W.P., Miller J.A., Piacentini L.N., Platnick N.I., Polotow D., Silva-Dávila D., Scharff N., Szűts T., Ubick D., Vink C., Wood H.M., Zhang J.X. 2017 . The spider tree of life: phylogeny of Araneae based on target-gene analyses from an extensive taxon sampling . Cladistics 33 ( 6 ): 576 – 616 . Google Scholar Crossref Search ADS WorldCat Wolff J.O. , Paterno G.B., Liprandi D., Ramírez M.J., Bosia F., van der Meijden A., Michalik P., Smith H.M., Jones B.R., Ravelo A.M. Pugno N., 2019 . Evolution of aerial spider webs coincided with repeated structural optimization of silk anchorages . Evolution 73 : 2122 – 2134 . Google Scholar Crossref Search ADS PubMed WorldCat Wood H.M. , González V.L., Lloyd M., Coddington J., Scharff N. 2018 . Next-generation museum genomics: phylogenetic relationships among palpimanoid spiders using sequence capture techniques (Araneae: Palpimanoidea) . Mol. Phylogenetics Evol. 127 : 907 – 918 . Google Scholar Crossref Search ADS WorldCat © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-SpaceWeber, Claudia C; Perron, Umberto; Casey, Dearbhaile; Yang, Ziheng; Goldman, Nick
doi: 10.1093/sysbio/syaa036pmid: 32353118
Abstract How can we best learn the history of a protein’s evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modeling based on inferred amino acid sequence and side chain configuration). But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the “unseen” state-space. To illustrate this, we consider two examples with amino acid sequences as input. We show that $$\omega$$ , a parameter describing the relative strength of selection on nonsynonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible. [Ancestral reconstruction; natural selection; protein structure; state-spaces; substitution models.] The evolution of protein sequences is driven by a combination of forces that influence both what types of mutation occur and which of them are allowed to fix by natural selection. The former process operates at the level of the nucleotide sequence and manifests at the amino acid level through the structure of the genetic code. Functional and structural constraints then determine the probability of survival of the mutants. A wide variety of computational tools to make inferences about different layers of this process are available, considering observations from nucleotide, codon or amino acid sequences, and (occasionally) protein structure. Models that take data from one of these state-spaces as input typically use transition probabilities between these same character states to compute outputs, such as phylogenies, selective constraints, or ancestral states. Being able to obtain certain types of information about evolution is therefore usually contingent on having access to observations in the relevant state-spaces. Given the abundance of available genome sequences, access to interesting data is ordinarily not a problem. Codon sequences, for example, are commonly used to quantify the strength of natural selection, measured by $$\omega$$ , the relative rate of nonsynonymous to synonymous substitutions. Variants of the standard codon model estimate constraints on specific sites, branches, or different types of amino acid substitutions (Yang et al. 1998; Yang 2014; Weber and Whelan 2019). Empirical amino acid models, which work with amino acid sequences and are often used to estimate phylogenies, consider how “exchangeable” different residues are (Whelan and Goldman 2001; Le and Gascuel 2008). This allows them to capture some functional constraints and therefore reconstruct plausible amino acid trajectories and ancestral sequences. A subset of models go beyond sequence alone and incorporate elements of structure. This can either take the form of mixture models describing site- or partition-specific amino acid propensities (Koshi and Goldstein 1995; Le et al. 2008a,b; Le and Gascuel 2010), or explicitly modeling observed changes in the protein’s three-dimensional organization. Recently, an extended version of the empirical amino acid model was introduced that additionally accounts for rates of exchange between amino acid side chain configurations (Perron et al. 2019). How suited a given amino acid is to a particular sequence and structural context is not only influenced by the biochemical properties of its side chain, but also by its spatial orientation. This includes the rotation of the $$C^{\alpha}$$ – $$C{^\beta}$$ bond, or the $$\chi_{1}$$ rotational isomer (“rotamer”) configuration, which can be discretized into up to three states per residue, resulting in a state-space with 55 characters (see Methods and (Perron et al., 2019) for details). The empirical rotamer-aware model therefore allows reconstruction of ancestral amino acid sequences that include side chain configurations (Perron et al. 2019)—provided structural information is available for the extant descendants of the protein of interest. Reconstructed side chain configurations are of practical interest as they can provide a plausible prior for structure prediction for a variety of applications. For example, structural models are widely used for in silico functional annotation of genes and variants, prediction of protein-protein interactions and docking (Zhang et al. 2012; Vakser 2014; Waterhouse et al. 2018). Given the variety of options, the choice of model for a study might be guided by the research question and which aspects of the evolutionary process are most interesting. In some cases, data availability may limit the range of suitable models. For example, given a set of amino acid sequences, models operating in codon- or rotamer space cannot be straightforwardly applied. Scenarios where one might encounter this mismatch include selection analyses incorporating protein sequences from ancient specimens or databases where corresponding nucleotide sequences are not retrievable, or studies where access to complete high-quality protein structures is limited. Attempts to bridge the gap between available input and desired model output have, thus far, been limited—at least as far as conventional observable character states are concerned. For example, Yang et al. (1998) formulated an amino acid model that merges synonymous codons into a single amino acid state, with substitution rates computed as an average of the codon rates. This model can estimate transition–transversion bias from amino acids. However, it is unable to provide a measure of the strength of selection. Nevertheless, there are well-established methods that allow the handling of data for which only part of the state-space information is available. This is achieved by encoding observed states as ambiguous representations of characters in a larger state-space. This application of standard statistical theory for missing data has been used previously in phylogenetics (e.g., (Yang 2014), p. 110–112). A notable example are covariotide models, conceptualized by Fitch and Markowitz (1970), where each nucleotide may be in an “on” or “off” state that cannot be directly observed (Tuffley and Steel 1998; Galtier 2001; Huelsenbeck 2002). Can the principles employed by these methods be applied more broadly to allow substitution models to take “partial” observations as input? In this article, we present a proof of principle, demonstrating that it is possible to infer information about evolutionary processes that occurred in an expanded state-space using only the aggregated data, taking advantage of an established method for handling ambiguity in sequence alignments. The ability to model sequences in a state-space with a larger set of characters allows us to obtain outputs that would otherwise be unavailable. For example, we can capture relative selective constraints on nonsynonymous versus synonymous substitutions from amino acid sequences. The path through amino acid space hence helps reveal the path evolution takes through codon space. The same method can be applied to reconstructing ancestral amino acid rotamer configurations using only amino acid sequences. Using input data consisting of a mixture of rotamer and amino acid sequences further allows us to refine these reconstructions and obtain a useful starting point for homology modeling. Materials and Methods Inferring Model Parameters from Data in an Aggregated State-Space using Ambiguity Our framework is maximum likelihood (ML) inference on phylogenetic trees, based on alignments of observable characters that evolve independently according to a Markov process. We consider cases where the characters at the tips of a phylogenetic tree are only available in an “aggregated” state-space $$\textbf{A}$$ with $$m$$ states. Each state $$a_i$$ in $$\textbf{A} = \{a_1, ..., a_m \}$$ corresponds to one or more “separate” states $$s_j$$ in a larger state-space $$\textbf{S} = \{s_1, ..., s_n \}$$ (where $$n > m$$ ). Meanwhile, each state in $$\textbf{S}$$ maps to a single state in $$\textbf{A}$$ . For example, where $$\textbf{S}$$ describes the set of 61 sense codons, $$\textbf{A}$$ might describe the 20 amino acid states: each codon codes for one specific amino acid, while a given amino acid can be represented by multiple codons. Similarly, each amino acid ( $$\textbf{A}$$ ) can represent multiple rotamer configurations ( $$\textbf{S}$$ ; see Perron et al. (2019), Table 1). If we only have access to amino acid sequences rather than codon or rotamer sequences, but modeling the data in $$\textbf{S}$$ would be more informative, we can take advantage of these mappings. In order to estimate phylogenetic models under ML when the data do not match the model state-space, we modify an established method for handling alignment gaps and ambiguity characters. The conditional probability vector $$L_k(j)$$ is a crucial part of phylogenetic likelihood calculations. It records the probability of the observed data descended from node $$k$$ conditional on the presence of state $$j$$ at node $$k$$ . There is one such vector for each combination of alignment position (not indicated in this notation, for simplicity) and node $$k$$ , with one element for every permitted state $$j$$ . The iterative calculation of the likelihood is initialized at the tips of the tree: if $$k$$ is a tip with state $$x$$ recorded in the alignment, the element $$L_k(x)$$ is set to 1 and $$L_k(j) = 0$$ for all other $$j \neq x$$ (Felsenstein 2004)—the data are recorded as having been correctly observed with certainty. When data are missing, or when there is a gap at a site in the alignment, the corresponding $$L_k(j)$$ are set to 1 for all $$j$$ , representing total absence of knowledge of the true character states and effectively removing node $$k$$ from the likelihood calculation for the site. In the case where the observed data are in the aggregated state-space $$\textbf{A}$$ , and we are interested in modeling in the separate state-space $$\textbf{S}$$ , we can proceed in a similar manner. Consider a simple four-state ( $$\textbf{S}$$ ) model with the aggregated ( $$\textbf{A}$$ ) states $$a = \{a_1, a_2\}$$ and $$b = \{b_1, b_2\}$$ . If we observe state $$a$$ in $$\textbf{A}$$ , which could represent either $$a_1$$ or $$a_2$$ in $$\textbf{S}$$ , the corresponding conditional probability vector $$L_k = (L_{a_1}, L_{b_1}, L_{a_2}, L_{b_2})$$ is set to $$(1, 0, 1, 0)$$ . Hence, our observation is ambiguous with respect to the character in $$\textbf{S}$$ . We use the term “ambiguous” to refer to instances where incomplete information about the state at a given site is available, but the character is not missing. Where data are completely absent (missing) for an alignment position, the same vector is encoded by $$L_k(x_j) = (1, 1, 1, 1)$$ . Once $$L_k$$ has been set at all tips according to this modification, the calculation of the likelihood proceeds as normal following Felsenstein’s pruning algorithm (Felsenstein 1981). Treating data observed in $$\textbf{A}$$ as ambiguous states in $$\textbf{S}$$ is similar to the “covariotide” model of Huelsenbeck (2002), which assigns each nucleotide an ambiguous “on” or “off” state. Ambiguity has also been used to encode population allele frequencies using small samples as input (De Maio et al. 2015), and to handle sequence error and uncertainty (Kozlov 2018). Our approach differs from that presented in Yang et al. (1998), where all synonymous codons for an amino acid are combined into one state and substitution rates between amino acids represent averages over codons. Description of the Codon Model The codon model considered here follows the standard M0 model as implemented in PAML (Yang 2007); see also (Goldman and Yang 1994; Yang et al. 2000), with parameters $$\omega$$ = $$dN/dS$$ , $$\kappa$$ representing the ratio of transition mutations to transversions, and $$\pi$$ representing the codon equilibrium frequencies. The instantaneous rate matrix is given by: $$$$\begin{equation} Q_{ij} = \left\{ \begin{array}{l l} 0 & i \text{ and } j \text{ differ by more }\\ & \text{than a single nucleotide } \\ \pi_j & i \text{ and } j \text{ differ by a single}\\ & \text{synonymous transversion } \\ \kappa \pi_j & i \text{ and } j \text{ differ by a single}\\ & \text{synonymous transition } \\ \omega \pi_j & i \text{ and } j \text{ differ by a single}\\ & \text{nonsynonymous transversion } \\ \omega \kappa \pi_j & i \text{ and } j \text{ differ by a single}\\ & \text{nonsynonymous transition } . \end{array} \right. \label{eq:codonQ} \end{equation}$$$$(1) We implemented the model (now available in PAML as M5) and likelihood calculations in a standard ML framework, encoding the conditional probability vector $$L_k$$ in codon space ( $$\textbf{S}$$ ) using observed amino acids ( $$\textbf{A}$$ ), as outlined above. We make the simplifying assumption that the vector of equilibrium frequencies is fixed at $$\pi_j = 1/61$$ for all codons $$j$$ . The codon frequencies cannot be directly observed and examining their identifiability is beyond the scope of this work. Codon Sequence Simulations and Inference We used evolver (Yang 2007) to generate a single random unrooted tree with 20 tip nodes using a birth–death process (Yang and Rannala 1997) with a tree height of 0.5 (see Supplementary Fig. S1A available on Dryad at http://dx.doi.org/10.5061/dryad.tx95x69sm). We next simulated sequences under M0 over a range of parameters generating 100 replicates with 3000 codons for each combination of configurations (unless stated otherwise). We then analyzed the simulated sequences using codeml from the PAML package, fitting M5 to the translated amino acid sequences and fitting M0 to the original codon sequences, both assuming equal codon frequencies (Yang 2007). We also recorded the standard errors for the parameters (option getSE = 1). Description of the Rotamer Model The empirical rotamer-aware model (RAM55) follows the structure of a standard empirical amino acid model and is fit to alignment data in the same manner. The instantaneous rate matrix, $$Q_{ij}$$ , is defined by the exchangeabilities between states derived from a database of sequences of proteins with known 3D structure and their equilibrium frequencies. Rather than the usual 20 amino acid states, the rotamer model considers 55 discrete states determined by the $$\chi_{1}$$ dihedral angle between the side chain’s first two covalently linked carbons, the rotamer configuration. Each amino acid can be categorized into up to three distinct states based on an observed protein structure; for example, state L3 denotes a leucine residue in conformational state 3. The exchange rates likely capture the effect of local steric constraints on side chain orientation. RAM55 is described in detail by Perron et al. (2019), along with RUM20, a conventional 20-state empirical amino acid model computed from the same data set. Rotamer Sequence Simulation and Ancestral Side Chain Configuration Reconstruction To generate sequences in rotamer space with known phylogenies and ancestral states, we used a set-up similar to the one described by Perron et al. (2019). Briefly, we randomly generated a 32-tip tree using a Yule process and scaled the branches by 0.1–1 (Supplementary Fig. S2 available on Dryad). We then performed a continuous-time Markov chain simulation along the branches for 1000 replicates of 200 sites each using the RAM55 exchangeabilities and equilibrium frequencies. Simulated alignments use a custom encoding format which expresses both amino acid states and rotamer states (i.e., a mixed alignment) using a common alphabet of single-character symbols (see https://bitbucket.org/uperron/ambiguity_coding). To emulate cases where structural information is not available for some of the terminal nodes, we generated mixed alignments by “masking” a proportion of the terminal rotamer sequences, leaving only amino acids. Amino acid states are then treated as ambiguous rotamer state assignments in the inference step (see below). Further, sequences can be removed from the simulated alignment to illustrate the loss of information caused by discarding sequences entirely when no structure is available. This is done by replacing a specific proportion of the alignment’s sequences with gap characters. Both the masking and discarding operations are performed over a set of sequences selected independently for each replicate according to a uniform distribution. To reconstruct ancestral states, we modified the approach described by Perron et al. (2019), encoding each amino acid (state-space $$\textbf{A}$$ ) observed in the alignment as ambiguous in rotamer space (state-space $$\textbf{S}$$ ) in the conditional probability vector. This procedure allows us to use the RAM55 model to infer rotamer sequences at internal nodes using amino acid or rotamer sequences and the tree that was used for simulation as input. To compute posterior probabilities for reconstructions (Yang et al. 1995), we applied the marginal reconstruction algorithm of Koshi and Goldstein (1996). A joint reconstruction algorithm (Pupko et al. 2000) gives qualitatively similar results. To assess the accuracy of the reconstructions, we examined the proportion of sites with matching characters in rotamer space. That is, we require both the amino acid and its rotamer configuration to be identical. The same reconstruction approach can be modified to predict side chain configurations for extant homologous proteins in a given family (i.e., tip nodes of the tree). Specifically, each terminal rotamer sequence is, in turn, masked and its side chain configurations are reconstructed conditioned on the observed amino acid states by treating the terminal node as if it were an internal node. This permits us to infer rotamer configurations for extant proteins with known sequences but unknown structures, based on the known sequences and structures of their homologs. To illustrate this, we considered two manually curated empirical data sets, consisting of 16 ADK structures and 30 RuBisCO structures from PDBe (wwPDB consortium 2018), respectively. For each data set, a multiple amino acid sequence alignment was generated using MAFFT (Katoh et al. 2002). Rotamer configuration was then assigned to each amino acid in the alignment, generating a rotamer sequence alignment (see Perron et al. (2019) Methods for details). The tree for the reconstruction was estimated from the rotamer sequence alignment using RAxML-NG under the RAM55 model (Kozlov et al. 2019; Perron et al. 2019). We then masked, in turn, each terminal rotamer sequence in the alignment and predicted each amino acid’s $$\chi_{1}$$ configuration using RAM55 and the marginal reconstruction algorithm as described above. Here, the extant amino acid sequence is known and the rotamer state prediction is thus constrained to the observed amino acid. Prediction accuracy can be computed against the original rotamer sequence; in order to benchmark our method’s accuracy we first established a baseline accuracy by assigning the $$\chi_{1}$$ configuration either according to a uniform probability distribution (we denote this by “Unif”) or using the relative equilibrium frequencies of each possible configuration according to RAM55 (“RelFreq”). A widely used strategy to predict side chain configurations in unresolved structures consists of assigning to each amino acid the same configuration found at the corresponding site in the nearest homologous neighbor’s structure (Sutcliffe et al. 1987; Waterhouse et al. 2018); we refer to this approach as “Nearest Neighbor Configuration” (NNC). NNC is only applicable to sites where the amino acid is conserved in the template, and so our implementation of NNC falls back to a RelFreq strategy for nonconserved sites. We also evaluated a scenario where no structural information is available for the nearest sequence [“Masked Nearest Neighbor” (MNN)]. Here, RAM55 can make use of mixed data (both amino acid-only and rotamer sequences) using the ambiguity coding described above. Results Substitutions in the Aggregated State-Space Contain Information about the Path Taken through the Larger State-Space We first consider how it might be feasible to extract information about a process operating in a separate state-space $$\textbf{S}$$ from data in $$\textbf{A}$$ . This will naturally depend on the relationships between the structures of the state-spaces. Where some states in the aggregated space are only accessible via multiple steps in the separate space, it is possible to gather information about which states might have been visited. To illustrate this concept, we examine an example where sequences evolve in codon space and are observed in amino acid space. For simplicity, we disregard transition-transversion bias (i.e., we assume $$\kappa$$ = 1). Given an alignment and phylogeny that strongly suggest an evolutionary trajectory W $$\rightarrow$$ L $$\rightarrow$$ H, the most direct path through codon space in single nucleotide steps requires at least two synonymous substitutions (Fig. 1). Hence, knowledge of amino acid sequence evolution can reveal information about codon changes. In practice, many different routes through codon space may be compatible with the observed data; each is assessed by standard likelihood calculations and the embedded information about codon changes weighted appropriately. Figure 1. Open in new tabDownload slide The path of a sequence through amino acid space contains information about which codons may have been visited. We illustrate a W (tryptophan) $$\rightarrow$$ L (leucine) $$\rightarrow$$ H (histidine) trajectory that requires multiple synonymous substitutions. Amino acid states representing the trajectory are shown in dark font. Compatible corresponding (unobserved) codons are shown in lighter font. Faded out boxes represent neighboring states in the genetic code. Solid arrows indicate changes required for the trajectory with the minimum number of steps, while dashed arrows indicate substitutions associated with multiple compatible paths. Black arrows denote nonsynonymous substitutions, and lighter arrows indicate synonymous substitutions. Figure 1. Open in new tabDownload slide The path of a sequence through amino acid space contains information about which codons may have been visited. We illustrate a W (tryptophan) $$\rightarrow$$ L (leucine) $$\rightarrow$$ H (histidine) trajectory that requires multiple synonymous substitutions. Amino acid states representing the trajectory are shown in dark font. Compatible corresponding (unobserved) codons are shown in lighter font. Faded out boxes represent neighboring states in the genetic code. Solid arrows indicate changes required for the trajectory with the minimum number of steps, while dashed arrows indicate substitutions associated with multiple compatible paths. Black arrows denote nonsynonymous substitutions, and lighter arrows indicate synonymous substitutions. Figure 2. Open in new tabDownload slide Illustration of paths through amino acid and rotamer space, given an implied trajectory L (leucine) $$\rightarrow$$ F (phenylalanine) $$\rightarrow$$ Y (tryptophan). Dark font indicates observed amino acid states. Light font indicates unobserved rotamer configurations. Arrows show observed path through amino acid space. Lines connecting rotamer states indicate transition probabilities between states, with darker shading indicating more probable substitutions according to the RAM55 matrix. L3 $$\rightarrow$$ F3 $$\rightarrow$$ Y3 is the most likely trajectory. Figure 2. Open in new tabDownload slide Illustration of paths through amino acid and rotamer space, given an implied trajectory L (leucine) $$\rightarrow$$ F (phenylalanine) $$\rightarrow$$ Y (tryptophan). Dark font indicates observed amino acid states. Light font indicates unobserved rotamer configurations. Arrows show observed path through amino acid space. Lines connecting rotamer states indicate transition probabilities between states, with darker shading indicating more probable substitutions according to the RAM55 matrix. L3 $$\rightarrow$$ F3 $$\rightarrow$$ Y3 is the most likely trajectory. Where the separate state-space model disallows many transitions (as with the codon model in equation 1), it is easy to see how inferred moves through the aggregated space can give information about the separate states. However, even when the model of interest in $$\textbf{S}$$ is described by a $$Q$$-matrix that does not contain zeros, similar principles apply. Here, none of the routes through the unobserved space are prohibited by the exchangeabilities, but each is more or less probable. We can therefore distinguish between different routes without directly observing them. For example, given an alignment that implies the amino acid trajectory L $$\rightarrow$$ F $$\rightarrow$$ Y, the RAM55 model has several available routes through rotamer space. However, considering the relative empirical exchangeabilities between states, we observe marked differences in how probable each path is (Fig. 2). This, in turn, should allow us to infer, for example, the most probable rotamer sequence at an ancestral node, using the ambiguity approach. Note that, in some cases, the aggregated state-space will not retain sufficient information about paths through the separate space to estimate parameters. One example would be a nucleotide model with $$\kappa$$ applied to a (fully) RY-coded alignment, where all transitions become unobservable and transversions provide direct routes between both aggregated states. Selection can be Inferred from Amino Acid Data Alone Figure 3. Open in new tabDownload slide Estimation of codon model parameters from amino acid data. A) Simulation ( $$\omega^*$$ ) and estimated ( $$\hat{\omega}$$ ) values of $$\omega$$ show a strong linear relationship ( $$\kappa^*$$ = 2 for all simulations shown). Squares represent the median $$\hat{\omega}_{\mathrm{M5}}$$ for each $$\omega^*$$ ; points show estimates from individual alignments. Dashed line indicates $$y = x$$ ; solid line shows the line fit for the medians, with high values of $$\omega^*$$ slightly prone to overestimation (slope = 1.07, intercept = 0.0083, $$r^2$$ = 0.92, $$p \approx$$ 0). B) Estimates of $$\kappa^*$$ show a similar pattern ( $$\omega^*$$ = 0.3; slope = 1.00, intercept = $$-$$0.0009, $$r^2$$ = 1.00, $$p \approx$$ 0). Figure 3. Open in new tabDownload slide Estimation of codon model parameters from amino acid data. A) Simulation ( $$\omega^*$$ ) and estimated ( $$\hat{\omega}$$ ) values of $$\omega$$ show a strong linear relationship ( $$\kappa^*$$ = 2 for all simulations shown). Squares represent the median $$\hat{\omega}_{\mathrm{M5}}$$ for each $$\omega^*$$ ; points show estimates from individual alignments. Dashed line indicates $$y = x$$ ; solid line shows the line fit for the medians, with high values of $$\omega^*$$ slightly prone to overestimation (slope = 1.07, intercept = 0.0083, $$r^2$$ = 0.92, $$p \approx$$ 0). B) Estimates of $$\kappa^*$$ show a similar pattern ( $$\omega^*$$ = 0.3; slope = 1.00, intercept = $$-$$0.0009, $$r^2$$ = 1.00, $$p \approx$$ 0). Next, the question arises whether ambiguity coding extracts enough signal to allow meaningful inferences to be made. We therefore asked whether the ambiguity approach permits inference about codon evolution from amino acid data, considering the M5 variant of the standard M0 codon model (see Methods section). To determine if M5 is capable of detecting the relative strength of selection under which a sequence evolved, we require data for which this parameter is known. The most straightforward way of obtaining this is to simulate sequences under a model identical to the one used in the estimation step. We therefore considered translated sequences that were evolved on a randomly generated 20-taxon tree under the codon model M0. As an initial benchmark, we generated 100 alignments with 3000 codons with the simulation parameters $$\omega^*$$ = 0.3 and $$\kappa^*$$ = 2, and obtain accurate and unbiased estimates of both (median $$\hat{\omega}$$ = 0.304; median $$\hat{\kappa}$$ = 1.99). Analyzing the original codon sequences using M0 with identical settings gives similar results (median $$\hat{\omega}$$ = 0.299; median $$\hat{\kappa}$$ = 2.00). We note that M5 tends to be noisier, presumably due to its inability to directly observe synonymous changes. This observation holds across a range of $$\omega^*$$ and $$\kappa^*$$ values, with high $$\omega^*$$ values being somewhat prone to overestimation, although a strong linear relationship between true and estimated parameters is maintained (Fig. 3). There is little interaction between $$\omega$$ and $$\kappa$$ (see Supplementary Fig. S3 available on Dryad). In the following, we therefore consider only the combination $$\omega^* = 0.3$$ and $$\kappa^* = 2$$ , unless otherwise noted. The parameter $$\hat{\kappa}$$ shows a relatively modest increase in variance under M5 compared to M0 (standard deviation $$s_\kappa = 0.0541$$ under M5; $$s_\kappa = 0.0364$$ under M0), presumably due to its direct, and thus inferable, impact on nonsynonymous substitution patterns. The results obtained for $$\hat{\kappa}$$ under M5 are similar to those for M6 ( $$\hat{\kappa}_{\mathrm{M6}}$$ = 2.00 with $$s_\kappa = 0.0561$$ ), which estimates the parameter from amino acid sequences by averaging over synonymous codons rather than gathering information from ambiguity coding while traversing the tree (Yang et al. 1998). This suggests that $$\hat{\kappa}$$ can be robustly estimated from amino acid sequences, even using coarse-grained approaches. However, as expected, discarding codon information does lead to a loss of signal primarily affecting $$\hat{\omega}$$ , which displays a markedly higher variance under M5 than under M0 ( $$s_\kappa = 0.0456$$ vs. 0.0056, respectively). Why might this be? A comparison of the estimates for $$dS$$ tree length versus $$dN$$ tree length suggests that M5 has more difficulty estimating the former, with variation in $$dS$$ tree length accounting for almost all of the variation in overall tree length (see Supplementary Fig. S4 available on Dryad). This is consistent with the fact that synonymous changes are not directly represented in amino acid sequences, whereas nonsynonymous changes are (as long as sufficiently short timescales are considered). We therefore next consider how much information about $$\omega$$ is retained by M5, compared to M0. How Much Information Loss Does Discarding Codons Cause? Figure 4. Open in new tabDownload slide Increasing the number of columns allows us to quantify how much information is retained in amino acid sequences (M5) relative to codon sequences (M0). The median standard errors of A) $$\hat{\omega}$$ and B) $$\hat{\kappa}$$ decrease for both models as codons are added. The dashed horizontal line in a) indicates the observed median standard error of $$\hat{\omega}$$ for 100 codons under M0, and illustrates that M5 requires a substantially longer alignment to reach a comparable standard error. Fitting functions of the form $$c / \sqrt{n}$$ to the median standard errors, with $$n$$ equaling the number of alignment columns, allows us to quantify the difference in information content. Equating $$c_{\mathrm{M0}} / \sqrt{\smash{f}n}$$ with $$c_{\mathrm{M5}} / \sqrt{n}$$ indicates equivalent information content for $$\smash{f}n$$ codons in M0 and $$n$$ codons in M5: hence M5 recovers a fraction $$\smash{f} = (c_{\mathrm{M0}} / c_{\mathrm{M5}})^2$$ of the information available to M0. Alternatively, M5 has lost 100(1 $$- \smash{f}$$ )% of the information available to M0. Figure 4. Open in new tabDownload slide Increasing the number of columns allows us to quantify how much information is retained in amino acid sequences (M5) relative to codon sequences (M0). The median standard errors of A) $$\hat{\omega}$$ and B) $$\hat{\kappa}$$ decrease for both models as codons are added. The dashed horizontal line in a) indicates the observed median standard error of $$\hat{\omega}$$ for 100 codons under M0, and illustrates that M5 requires a substantially longer alignment to reach a comparable standard error. Fitting functions of the form $$c / \sqrt{n}$$ to the median standard errors, with $$n$$ equaling the number of alignment columns, allows us to quantify the difference in information content. Equating $$c_{\mathrm{M0}} / \sqrt{\smash{f}n}$$ with $$c_{\mathrm{M5}} / \sqrt{n}$$ indicates equivalent information content for $$\smash{f}n$$ codons in M0 and $$n$$ codons in M5: hence M5 recovers a fraction $$\smash{f} = (c_{\mathrm{M0}} / c_{\mathrm{M5}})^2$$ of the information available to M0. Alternatively, M5 has lost 100(1 $$- \smash{f}$$ )% of the information available to M0. The ability of M5 to capture information that is directly “seen” by M0 can be measured by comparing the variances in parameter estimates on alignments with varying amounts of evolutionary signal. The most straightforward way to add information to a phylogeny given a codon model is to increase the number of codons in the alignment. Since both M0 and M5 give rise to unbiased estimates of $$\hat{\omega}$$ and $$\hat{\kappa}$$ (Fig. 3), we compared the variance in the parameter estimates for alignments of varying lengths. Given $$\omega^* = 0.3$$ and $$\kappa^* = 2$$ across 1000 replicates, we find that the median standard error of $$\hat{\omega}_{\mathrm{M0}}$$ , as estimated by codeml, is consistently lower than that of $$\hat{\omega}_{\mathrm{M5}}$$ . Across the range, the standard error is approximately 10 times higher for M5 (Fig. 4), indicating an information loss of 98.7% (see Fig. 4A for details). By comparison, the equivalent loss for $$\hat{\kappa}$$ is only 38.8% (Fig. 4B). Nevertheless, the estimates of $$\hat{\omega}$$ are reasonably accurate given a sufficiently long alignment (Fig. 3). It is perhaps counter-intuitive that acceptable estimates of $$\hat{\omega}$$ can still be obtained with M5 in these circumstances. However, the relative magnitude of the error may remain relatively small. For example, given $$\omega^* = 0.3, \kappa^* = 2,$$ and $$n = 3000$$ , the interquartile range for $$\hat{\omega}$$ is 0.296–0.304 under M0 and 0.270–0.344 under M5. Note that the information loss seems to vary with $$\omega^*$$ . We observe a larger ratio of standard errors when $$\omega^*$$ = 0.1 (approximately 12–20 times), with an information loss of about 99.2% (results not shown). Tree Depth and Taxon Number Impact M5 Information Loss Our results show estimates of $$\hat{\omega}_{\mathrm{M5}}$$ to be noise-prone for short alignments. Because increasing alignment length is not a viable solution to reduce variance for real amino acid sequence data, we also considered how other features of the alignment impact M5’s ability to accurately infer $$\hat{\omega}$$ . For example, it may be possible to select sequences with higher divergence or include additional taxa in the phylogeny, hence adding more information. Scaling-up the branch lengths of the tree in the simulations gives an initial improvement in the standard error of $$\hat{\omega}$$ . The greatest reduction is observed for a tree length of approximately 4 times the length of the original tree (around 20), followed by a decline for longer trees (see Fig. 5). This is a consequence of the increased number of substitution events from which the model can infer parameters, which is advantageous until the sequences become too divergent, the true number of substitutions is underestimated, and the data become too noisy. Figure 5. Open in new tabDownload slide Tree length influences variance in estimates $$\hat{\omega}$$ for both M5 and M0, with intermediate values producing the lowest standard errors. Points show the median value across replicates. Note the difference in scale on the y-axis between M5 and M0. Figure 5. Open in new tabDownload slide Tree length influences variance in estimates $$\hat{\omega}$$ for both M5 and M0, with intermediate values producing the lowest standard errors. Points show the median value across replicates. Note the difference in scale on the y-axis between M5 and M0. The trajectory of the change in variance observed across different tree lengths is broadly comparable for M0 and M5, with the variance for M5 remaining consistently higher for all lengths. In the case of M0, this is due to saturation at synonymous sites (Yang 2014). For M5, it is easy to see that multiple substitutions at a site along a single branch make it more difficult to infer the path through codon space (see Fig. 1). This confirms that the model is behaving as expected. Adding additional taxa has a similar effect on the variance. When we examine a tree of comparable height (0.5) with twice as many tips ( $$n = 40$$ , see Supplementary Fig. S1B available on Dryad), the standard error of $$\hat{\omega}$$ decreases compared to the smaller tree (median = 0.0223 vs. 0.0396 for 1000 replicates of 5000 codons and $$\omega^*$$ = 0.3; equivalent values for M0 are 0.0030 vs. 0.0043). As above, this behavior is expected as the additional tips add information, provided the branches are not exceedingly long. Given these observations, we conclude that estimating the strength of selection from amino acids is a feasible strategy where nucleotide sequences may be difficult or impossible to obtain (e.g., where amino acid sequences from databases or publications cannot be reliably mapped back to the underlying codons). Although there is an appreciable loss of signal, M5 is statistically consistent and approaches the correct parameter estimates given enough amino acid sequence data. Accurate Reconstruction of Ancestral Side Chain Configuration from Amino Acids Figure 6. Open in new tabDownload slide The accuracy of ancestral rotamer sequence reconstruction from mixed data under RAM55 increases and shows lower variance when more rotamer configuration information is available. The x-axis shows the fraction of rotamer configuration information removed (i.e., masked). The vertical bars show the standard deviation of the reconstruction accuracies, centered around the median. The black dash-dot lines represent the maximum accuracy reached on full (unmasked) alignments; black dashed lines show the accuracy achieved by reconstructing the amino acids under RUM20 and randomly assigning (“guessing”) the rotamer configuration. A) Results when all branches of the tree (Supplementary Fig. S2 available on Dryad) are multiplied by 0.1, showing greater overall accuracy in this case. B) Results for the standard simulation tree (see Methods section). Figure 6. Open in new tabDownload slide The accuracy of ancestral rotamer sequence reconstruction from mixed data under RAM55 increases and shows lower variance when more rotamer configuration information is available. The x-axis shows the fraction of rotamer configuration information removed (i.e., masked). The vertical bars show the standard deviation of the reconstruction accuracies, centered around the median. The black dash-dot lines represent the maximum accuracy reached on full (unmasked) alignments; black dashed lines show the accuracy achieved by reconstructing the amino acids under RUM20 and randomly assigning (“guessing”) the rotamer configuration. A) Results when all branches of the tree (Supplementary Fig. S2 available on Dryad) are multiplied by 0.1, showing greater overall accuracy in this case. B) Results for the standard simulation tree (see Methods section). We next ask whether the strategy of treating characters that are not directly observable as ambiguous is also informative when the instantaneous rate matrix underlying the substitution model is not sparse (i.e., does not contain transitions with probabilities equal to 0). To examine how ambiguity coding performs given an empirical rotamer-aware model, we simulated data with 55 states under the RAM55 model on a 32-taxon tree (Supplementary Fig. S2 available on Dryad, see Methods section), and subsequently reconstructed the ancestral sequences under the same model. We opted to benchmark the model using reconstruction accuracy, as ancestral side chain configurations represent an output that would be otherwise unobtainable from amino acid data alone. Varying the proportion of masked sequences in the alignment (see Methods section) allows us to compare scenarios where structures are available for some of the sequences of interest, or none at all, similar to what would be observed for real empirical data. The reconstruction accuracy for the data where rotamer information is available for all of the tips provides a benchmark for the performance of ambiguity coding. There is a relatively modest reduction in overall rotamer state reconstruction accuracy between simulations where rotamer configurations are known for all taxa, and simulations where this information is not available for any of the taxa ( $$\sim$$ 15% difference for the unscaled tree, Fig. 6, scaling factor = 1.0). Reconstruction under RAM55 using only amino acid sequences is markedly more accurate than the only alternative approaches of using a conventional empirical amino acid model to reconstruct the protein sequence and randomly assigning (“guessing”) rotamer states (Fig. 6, dashed line), or assigning them based on the equilibrium frequencies of the RAM55 model (Supplementary Fig. S5 available on Dryad). Hence, it is advantageous to reconstruct under the rotamer-aware model, even when the input data are only available in the aggregated state-space. As expected, the overall accuracy depends on how difficult the ancestral sequence reconstruction problem is, with a shallower tree showing higher sequence identity between simulated and reconstructed characters (Fig. 6A). Interestingly, the greatest increase in performance appears between alignments with no rotamer configuration information present and 12.5% of sequences containing that information (Fig. 6). This suggests that little structural information is required in order to achieve ancestral reconstruction of rotamer states with acceptable accuracy. Intuitively, the fraction of correctly inferred states declines with increasing distance from the tips of the tree (Spearman’s rank correlation coefficient $$-$$ 0.537, $$p < 0.001$$ ; details not shown). We also considered how the certainty with which the model assigns the correct ancestral state responds to rotamer information being masked at the tips of the tree. Unsurprisingly, the marginal posterior probability for the correct state declines as information is removed (see Supplementary Fig. S6 available on Dryad). We observe a drop in the certainty of the reconstruction preceding the drop in accuracy. To examine the robustness of our approach, we also assessed ancestral sequence reconstruction accuracy under a simple model violation scenario, simulating data under RAM55 with gamma-distributed rates (Yang 1994) and reconstructing under RAM55 without rate heterogeneity. When rotamer configuration is masked, we observe a larger decline in accuracy compared to a scenario with no violation, which is expected given that the amino acid sequence contains less signal (Supplementary Fig. S7 available on Dryad). Gains Associated with using Amino Acid Sequences to Infer Rotamer Configuration in Absence of Structure Figure 7. Open in new tabDownload slide The accuracy of ancestral rotamer sequence reconstruction from mixed data under RAM55 increases when masked sequences, which lack rotamer states, are not discarded. The x-axis shows the fraction of information available under two scenarios. The dark circles reflect the amount of rotamer information that has been masked (i.e., replaced with amino acids), and the light crosses represent the amount of rotamer sequences that has been replaced with gaps (discarded). Masking half of the rotamer configurations produces accuracies comparable with those obtained by replacing 1/8–1/4 of sequences in the alignments with gaps (see black dashed lines). Reconstructions with ambiguity always outperform discarding an equivalent fraction of amino acid sequences. As before, the shallower tree, A), shows higher overall accuracy. Figure 7. Open in new tabDownload slide The accuracy of ancestral rotamer sequence reconstruction from mixed data under RAM55 increases when masked sequences, which lack rotamer states, are not discarded. The x-axis shows the fraction of information available under two scenarios. The dark circles reflect the amount of rotamer information that has been masked (i.e., replaced with amino acids), and the light crosses represent the amount of rotamer sequences that has been replaced with gaps (discarded). Masking half of the rotamer configurations produces accuracies comparable with those obtained by replacing 1/8–1/4 of sequences in the alignments with gaps (see black dashed lines). Reconstructions with ambiguity always outperform discarding an equivalent fraction of amino acid sequences. As before, the shallower tree, A), shows higher overall accuracy. As with the codon model example, we would like to quantify the loss of information associated with using aggregated state-space data for inference in the separate state-space. Given that the output of the empirical model we are studying is not a parameter estimate (as opposed to our mechanistic codon model/selection example) but the percentage of correctly reconstructed residues, extending the alignment is not informative. Instead, we compared the accuracy of reconstructions under two scenarios: (a) all state information (amino acid and rotamer configuration) is discarded from a proportion of sequences and (b) masking is used so that amino acid, but not rotamer, sequences are available for a proportion of the alignment. This provides a measure of the advantage gained by considering additional amino acid sequences where no structural information is available. For the unscaled tree, masking 50% of the rotamer configurations produces ancestral reconstructions that are comparable in accuracy to trees where 12.5% of taxa have gaps (Fig. 7A), indicating a noticeable advantage for including amino acid sequences where full rotamer state information is unavailable. In other words, augmenting half of the amino acid sequences with rotamer configuration information is approximately as informative as having 87.5% of the full rotamer information. Further, removing all rotamer information and reconstructing with ambiguity is equivalent to retaining 50% of the original information. These results suggest that it can be very valuable to consider amino acid sequences that lack structures. Improved Prediction of Side Chain Configurations in Homologous Structures Considering its robust performance, how might ambiguity coding be put to practical use in the context of reconstructing side chain configurations? Prediction of side chain conformations is an important part of protein structure modeling and interaction modeling. For a given protein sequence of unknown structure, it is possible to construct a model of the target protein from its amino acid sequence and experimentally determined structures of related homologous proteins. This homology modeling strategy aims to predict both the main chain geometry and side chain configurations. In conserved regions, side chains can be modeled starting from configurations observed at corresponding sites in the nearest homologous structure (NNC: see Methods section). Further steps are then required, particularly to model nonconserved side chain configurations (Waterhouse et al. 2018). Side chain configurations could be predicted for an extant amino acid sequence using RAM55 and a modified ancestral reconstruction algorithm by constraining the $$\chi_{1}$$ configuration prediction to the set of configurations that are possible given the observed amino acid at any given site. Another realistic homology modeling scenario might involve our target’s nearest homolog also lacking a resolved structure and only being available as an amino acid sequence (MNN: see Methods section). In this context, RAM55 can use a mixed alignment (amino acid sequences and rotamer sequences) to inform its predictions rather than relying exclusively on available structures, which ought to improve reconstruction accuracy as seen above (Fig. 7). To evaluate our approach to side chain configuration prediction, we considered two empirical protein family data sets (RuBisCO and ADK, see Methods section) composed of amino acid sequences from a range of species and their corresponding rotamer configuration information. We investigated two scenarios: (i) a rotamer sequence is available for the nearest neighbor of each terminal node or (ii) only a masked amino acid sequence is available for the nearest neighbor. Predicting $$\chi_{1}$$ side chain configurations using RAM55 is more accurate ( $$\sim$$ 11% median improvement for both data sets) than NNC (see Methods section) when the nearest neighbor’s structure is available (Supplementary Figs. S8, S9 available on Dryad). Further, RAM55 can make use of all the available rotamer sequence information, as well as the nearest neighbor’s amino acid information, when the nearest neighbor’s structure is not available. Meanwhile, the traditional approach would instead rely on the second-nearest structure (MNN: see Methods section). This results in improved reconstruction accuracy for RAM55 ( $$\sim$$9% and $$\sim$$12% median improvement, respectively) over MNN (Supplementary Figs. S8, S9 available on Dryad). For both NNC and MNN analyses, the improvements with RAM55 are driven by strongly increased accuracy at nonconserved sites (results not shown). Our method provides plausible predictions of $$\chi_{1}$$ configurations using a strategy that, as opposed to NNC or MNN, explicitly models the evolutionary process along the branches of the phylogeny and can make use of amino acid information when structures are not available. RAM55-based predictions could speed up the side chain homology modeling process by creating an informed prior to constrain the search space, particularly where close homologs with unresolved structures might otherwise be discarded by traditional strategies. Discussion We have demonstrated that treating characters in an aggregated state-space $$\textbf{A}$$ as ambiguous versions of characters in a larger state-space $$\textbf{S}$$ allows us to obtain information that would otherwise not be accessible from data in $$\textbf{A}$$ . Our examples show that this is true for estimating the strength of natural selection under a codon model, and for reconstructing ancestral side chain configurations under an empirical model, both from amino acid sequences alone. Naturally, where data are available in a larger state-space matching the internal structure of the preferred model it is advantageous to make use of them. The codon model example provides a particularly clear illustration: completely discarding codon information leads, for obvious reasons, to increased variance in estimates of $$\hat{\omega}$$ . We nevertheless find it remarkable that selection parameter estimates ordinarily derived from comparisons of synonymous and nonynonymous substitutions can be obtained given sufficient amino acid data. It has previously been argued that modeling coding sequence evolution at the codon level rather than the amino acid level is generally preferable because it offers a more detailed description of the process that generated the data (Ren et al. 2005; Seo and Kishino 2008; Kosiol and Goldman 2011; Whelan et al. 2015; Weber and Whelan 2019). On the other hand, the ambiguity approach may be useful to obtain an approximate estimate of the strength of natural selection in cases where amino acid sequences are more readily obtainable. For example, the supplementary materials accompanying phylogenetic studies often only provide amino acid alignments, and as many as 17% of nucleotide sequences corresponding to proteins in Pfam (El-Gebali et al. 2018) have been previously reported unrecoverable (Whelan et al. 2003). The ability to perform a preliminary screen to determine whether a sequence of interest is under weak or strong evolutionary constraint might therefore be convenient. However, we caution against over-interpreting the results returned by M5, particularly when individual sequences are being considered or codon usage may be biased (violating model assumptions). The absence of high-quality structures for many extant proteins provides more practical applications for ambiguity coding. In the case of the RAM55 empirical rotamer model, we have shown the utility of using amino acid sequences alone, and “mixed” inputs where even a limited amount of structural information leads to considerable improvements in the accuracy of $$\chi_{1}$$ configuration prediction. Being able to use information from amino acid sequences improves prediction accuracy over modeling side chains based on the nearest available structure alone. This approach could benefit homology modeling strategies, specifically the steps involving modeling both conserved side chains based on a known template structure, and nonconserved side chain modeling achieved by searching a rotamer library and minimizing an energy function (Xu 2005; Krivov et al. 2009; Shapovalov and Dunbrack Jr 2011). In this context, RAM55’s predictions constrain the rotamer configuration sampling space. This could result in a reduction of the number of energy refinement cycles required. In addition, using RAM55 and the marginal ancestral reconstruction algorithm makes it possible to obtain posterior probabilities for each of the possible configurations at a given site. This distribution might provide a more robust prior for further refinement, compared to using the single most likely reconstructed configuration or the nearest homolog’s configuration at that site. Further work would be required to quantify improvements in speed and accuracy. Given the advantage of including mixed input data demonstrated in our rotamer sequence reconstruction analyses, we expect combining amino acid and DNA sequences to be promising, as well as straightforward to implement. This would address some of the current limitations of M5 with respect to analyzing phylogenetic data sets with some missing codon sequences. For example, accurate estimates of codon frequencies would be more readily obtainable. A more speculative and potentially intriguing application would be estimating selection or structural information from ancient protein sequences. Proteins can persist for longer in the environment than DNA under certain conditions (Schweitzer et al. 2007; Wadsworth and Buckley 2014; Cappellini et al. 2019), enabling phylogenetic inferences to be made based on substantially older specimens such as dinosaurs and other extinct organisms (Schroeter et al. 2017; Schweitzer et al. 2019; Welker et al. 2019). Our methods permit the use of a mixture of all available DNA and protein sequences to maximize signal, extending analyses that are normally only possible with DNA sequences to incorporate additional data sources. In the absence of any compelling available ancient protein data sets, we do not attempt to provide a benchmark here. The proof of principle described here using two relatively simple models should not be taken as a substitute for carefully stress-testing the ambiguity coding approach for specific applications. As is the case with all models, with or without ambiguous inputs, making overly simple assumptions about the data can lead to misspecification and therefore inaccurate results. We recommend performing appropriate benchmarks and specifying models accordingly. As illustrated by our rate heterogeneity model violation scenario, we expect model misspecifications to have similar effects with ambiguity coding as they would in general: Estimates will become more noisy. Due to the information loss inherent to relying on the aggregated state-space, a somewhat greater decline is naturally to be expected. However, we note that our empirical analysis, which demonstrates that rotamer states can be reconstructed using real sequences as input, suggests that more complex scenarios can be captured. One could conceive of a variety of extensions to our implementations, including gamma-distributed rate variation or mixture models of codon evolution. Assessing them all thoroughly is beyond the scope of this manuscript. In this work, we have shown that ambiguity coding allows evolutionary inference from partially “hidden” data under phylogenetic models with both sparse (e.g., mechanistic) and nonsparse (e.g., empirical) exchangeability matrices. Thus, the principles underlying likelihood analysis of missing data (Felsenstein, 2004; Yang, 2014) and covariotide models (Huelsenbeck 2002) can be applied more broadly, allowing us to estimate selection and reconstruct aspects of protein structure given input data that are not fully resolved. Finally, ambiguity coding could conceivably be applied to other state-spaces beyond amino acids, codons, and rotamer states, provided there is reason to believe that movement through the aggregated space contains info about the separate space. Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.tx95x69sm. Acknowledgments We thank Nicola De Maio, Iain Moal, and Alexey Kozlov for helpful discussions, and the reviewers and editors for thoughtful feedback. Model Availability The M5 codon model is available as model 5 in codeml (PAML version 4.9h) (Yang, 2007) and is run with the sequence type set to amino acids (seqtype = 2). The program overrides the codon frequency setting specified in the control file and resets the CodonFreq variable to 0 (1/61). Rotamer sequence simulation and ancestral sequence reconstruction code is available at https://bitbucket.org/uperron/ambiguity_coding. References Cappellini E. , Welker F., Pandolfi L., Ramos-Madrigal J., Samodova D., Rüther P.L, Fotakis A.K., Lyon D., Moreno-Mayar J.V., Bukhsianidze M., Rakownikow Jersie-Christensen R., Mackie M., Ginolhac A., Ferring R., Tappen M., Palkopoulou E., Dickinson M.R., Stafford T.W., Chan Y.L., Götherström A., Nathan S.K.S.S., Heintzman P.D., Kapp J.D., Kirillova I., Moodley Y., Agusti J., Kahlke R.-D., Kiladze G., Martínez-Navarro B., Liu S., Sandoval Velasco M., Sinding M.-H.S., Kelstrup C.D., Allentoft M.E., Orlando L., Penkman K., Shapiro B., Rook L., Dalén L., Gilbert M.T.P., Olsen J.V., Lordkipanidze D., Willerslev E. 2019 . Early Pleistocene enamel proteome from Dmanisi resolves Stephanorhinus phylogeny . Nature 574 : 103 – 107 . Google Scholar Crossref Search ADS PubMed WorldCat De Maio N. , Schrempf D., Kosiol C. 2015 . PoMo: an allele frequency-based approach for species tree estimation . Syst. Biol. 64 : 1018 – 1031 . Google Scholar Crossref Search ADS PubMed WorldCat El-Gebali S. , Mistry J., Bateman A., Eddy S.R., Luciani A., Potter S.C., Qureshi M., Richardson L.J., Salazar G.A., Smart A., Sonnhammer E.L., Hirsh L., Paladin L., Piovesan D., Tosatto S.C., Finn R. D. 2018 . The Pfam protein families database in 2019 . Nucleic Acids Res. 47 : D427 – D432 . Google Scholar Crossref Search ADS WorldCat Felsenstein J. 1981 . Evolutionary trees from DNA sequences: a maximum likelihood approach . J. Mol. Evol. 17 : 368 – 376 . Google Scholar Crossref Search ADS PubMed WorldCat Felsenstein J. 2004 . Inferring phylogenies . Sunderland, MA : Sinauer Associates . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Fitch W. M. , Markowitz E. 1970 . An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution . Biochem. Genet. 4 : 579 – 593 . Google Scholar Crossref Search ADS PubMed WorldCat Galtier N. 2001 . Maximum-likelihood phylogenetic analysis under a covarion-like model . Mol. Biol. Evol. 18 : 866 – 873 . Google Scholar Crossref Search ADS PubMed WorldCat Goldman N. , Yang Z. 1994 . A codon-based model of nucleotide substitution for protein-coding DNA sequences . Mol. Biol. Evol. 11 : 725 – 736 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Huelsenbeck J. P. 2002 . Testing a covariotide model of DNA substitution . Mol. Biol. Evol. 19 : 698 – 707 . Google Scholar Crossref Search ADS PubMed WorldCat Katoh K. , Misawa K., Kuma K-I, Miyata T. 2002 . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform . Nucleic Acids Res. 30 : 3059 – 3066 . Google Scholar Crossref Search ADS PubMed WorldCat Koshi J. M. , Goldstein R. A. 1995 . Context-dependent optimal substitution matrices . Protein Eng. Des. Sel. 8 : 641 – 645 . Google Scholar Crossref Search ADS WorldCat Koshi J. M. , Goldstein R. A. 1996 . Probabilistic reconstruction of ancestral protein sequences . J. Mol. Evol. 42 : 313 – 320 . Google Scholar Crossref Search ADS PubMed WorldCat Kosiol C. , Goldman N. 2011 . Markovian and non-Markovian protein sequence evolution: aggregated Markov process models . J. Mol. Biol. 411 : 910 – 923 . Google Scholar Crossref Search ADS PubMed WorldCat Kozlov A.M. , Darriba D., Flouri T., Morel B., Stamatakis A. 2019 . RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference . Bioinformatics. 35 : 4453 – 4455 . Google Scholar Crossref Search ADS PubMed WorldCat Kozlov O. 2018 . Models, optimizations, and tools for large-scale phylogenetic inference, handling sequence uncertainty, and taxonomic validation [Ph.D. thesis] . Karlsruhe Institute of Technology Karlsruhe , Germany . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Krivov G.G. , Shapovalov M.V., Dunbrack R.L. Jr. 2009 . Improved prediction of protein side-chain conformations with SCWRL4 . Proteins: Struct. Funct. Bioinformatics 77 : 778 – 795 . Google Scholar Crossref Search ADS WorldCat Le S.Q. , Gascuel O. 2008 . An improved general amino acid replacement matrix . Mol. Biol. Evol. 25 : 1307 – 1320 . Google Scholar Crossref Search ADS PubMed WorldCat Le S.Q. , Gascuel O. 2010 . Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial . Syst. Biol. 59 : 277 – 287 . Google Scholar Crossref Search ADS PubMed WorldCat Le S.Q. , Gascuel O., Lartillot N. 2008a . Empirical profile mixture models for phylogenetic reconstruction . Bioinformatics 24 : 2317 – 2323 . Google Scholar Crossref Search ADS WorldCat Le S.Q. , Lartillot N., Gascuel O. 2008b . Phylogenetic mixture models for proteins . Philos. Trans. R. Soc. B 363 : 3965 – 3976 . Google Scholar Crossref Search ADS WorldCat Perron U. , Kozlov A.M., Stamatakis A., Goldman N., Moal I.H. 2019 . Modelling structural constraints on protein evolution via side-chain conformational states . Mol. Biol. Evol. 36 : 2086 – 2103 . Google Scholar Crossref Search ADS PubMed WorldCat Pupko T. , Pe I., Shamir R., Graur D. 2000 . A fast algorithm for joint reconstruction of ancestral amino acid sequences . Mol. Biol. Evol. 17 : 890 – 896 . Google Scholar Crossref Search ADS PubMed WorldCat Ren F. , Tanaka H., Yang Z. 2005 . An empirical examination of the utility of codon-substitution models in phylogeny reconstruction . Syst. Biol. 54 : 808 – 818 . Google Scholar Crossref Search ADS PubMed WorldCat Schroeter E.R. , DeHart C.J., Cleland T.P., Zheng W., Thomas P.M., Kelleher N.L., Bern M., Schweitzer M.H. 2017 . Expansion for the Brachylophosaurus canadensis collagen I sequence and additional evidence of the preservation of Cretaceous protein . J. Proteome Res. 16 : 920 – 932 . Google Scholar Crossref Search ADS PubMed WorldCat Schweitzer M.H. , Schroeter E.R., Cleland T.P., Zheng W. 2019 . Paleoproteomics of mesozoic dinosaurs and other mesozoic fossils . Proteomics 19 : 1800251 . Google Scholar Crossref Search ADS WorldCat Schweitzer M.H. , Suo Z., Avci R., Asara J.M., Allen M.A., Arce F.T., Horner J. R. 2007 . Analyses of soft tissue from Tyrannosaurus rex suggest the presence of protein . Science 316 : 277 – 280 . Google Scholar Crossref Search ADS PubMed WorldCat Seo T.-K. , Kishino H. 2008 . Synonymous substitutions substantially improve evolutionary inference from highly diverged proteins . Syst. Biol. 57 : 367 – 377 . Google Scholar Crossref Search ADS PubMed WorldCat Shapovalov M.V., Dunbrack Jr. R.L. 2011 . A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions . Structure 19 : 844 – 858 . Crossref Search ADS PubMed WorldCat Sutcliffe M. , Hayes F., Blundell T. 1987 . Knowledge based modelling of homologous proteins, part II: rules for the conformations of substituted sidechains . Protein Eng. Des. Select. 1 : 385 – 392 . Google Scholar Crossref Search ADS WorldCat Tuffley C. , Steel M. 1998 . Modeling the covarion hypothesis of nucleotide substitution . Math. Biosci. 147 : 63 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat Vakser I.A. 2014 . Protein-protein docking: from interaction to interactome . Biophys. J. 107 : 1785 – 1793 . Google Scholar Crossref Search ADS PubMed WorldCat Wadsworth C. , Buckley M. 2014 . Proteome degradation in fossils: investigating the longevity of protein survival in ancient bone . Rapid Commun. Mass Spectrom. 28 : 605 – 615 . Google Scholar Crossref Search ADS PubMed WorldCat Waterhouse A. , Bertoni M., Bienert S., Studer G., Tauriello G., Gumienny R., Heer F.T., de Beer T.A., Rempfer C., Bordoli L., Lepore R., Schwede T. 2018 . SWISS-MODEL: homology modelling of protein structures and complexes . Nucleic Acids Res. 46 : W296 – W303 . Google Scholar Crossref Search ADS PubMed WorldCat Weber C.C. , Whelan S. 2019 . Physicochemical amino acid properties better describe substitution rates in large populations . Mol. Biol. Evol. 36 : 679 – 690 . Google Scholar Crossref Search ADS PubMed WorldCat Welker F. , Ramos-Madrigal J, Kuhlwilm M., Liao W., Gutenbrunner P., de Manuel M., Samodova D., Mackie M., Allentoft M.E., Bacon A.-M. et al. 2019 . Enamel proteome shows that Gigantopithecus was an early diverging pongine . Nature 576 : 262 – 265 . Google Scholar Crossref Search ADS PubMed WorldCat Whelan S. , Allen J.E., Blackburne B.P., Talavera D. 2015 . ModelOMatic: fast and automated model selection between RY, nucleotide, amino acid, and codon substitution models . Syst. Biol. 64 : 42 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat Whelan S. , De Bakker P.I., Goldman N. 2003 . Pandit: a database of protein and associated nucleotide domains with inferred trees . Bioinformatics 19 : 1556 – 1563 . Google Scholar Crossref Search ADS PubMed WorldCat Whelan S. , Goldman N. 2001 . A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach . Mol. Biol. Evol. 18 : 691 – 699 . Google Scholar Crossref Search ADS PubMed WorldCat wwPDB Consortium. 2018 . Protein data bank: the single global archive for 3D macromolecular structure data . Nucleic Acids Res. 47 : D520 – D528 . OpenURL Placeholder Text WorldCat Xu J. 2005 . Rapid protein side-chain packing via tree decomposition . Annual International Conference on Research in Computational Molecular Biology . Cambridge (MA) : Springer . p. 423 – 439 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Yang Z. 1994 . Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods . J. Mol. Evol. 39 : 306 – 314 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. 2007 . PAML 4: phylogenetic analysis by maximum likelihood . Mol. Biol. Evol. 24 : 1586 – 1591 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. 2014 . Molecular evolution: a statistical approach . New York : Oxford University Press . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Yang Z. , Kumar S., Nei M. 1995 . A new method of inference of ancestral nucleotide and amino acid sequences . Genetics 141 : 1641 – 1650 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Yang Z. , Nielsen R., Goldman N., Pedersen A.-M.K. 2000 . Codon-substitution models for heterogeneous selection pressure at amino acid sites . Genetics 155 : 431 – 449 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Yang Z. , Nielsen R., Hasegawa M. 1998 . Models of amino acid substitution and applications to mitochondrial protein evolution . Mol. Biol. Evol. 15 : 1600 – 1611 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. , Rannala B. 1997 . Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method . Mol. Biol. Evol. 14 : 717 – 724 . Google Scholar Crossref Search ADS PubMed WorldCat Zhang Q.C. , Petrey D., Garzón J.I., Deng L., Honig B. 2012 . PrePPI: a structure-informed database of protein–protein interactions . Nucleic Acids Res. 41 : D828 – D833 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.
Consistency of SVDQuartets and Maximum Likelihood for Coalescent-Based Species Tree EstimationWascher,, Matthew;Kubatko,, Laura
doi: 10.1093/sysbio/syaa039pmid: 32415974
Abstract Numerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent (MSC) have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees but may be inconsistent when gene trees are estimated from data for loci of finite length. Here, we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the MSC model such that the sites are conditionally independent given the species tree (we call these data coalescent independent sites [CIS] data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of ML and SDVQuartets using simulation for both data types. [Consistency; gene tree; maximum likelihood; multilocus data; hylogenetic inference; species tree; SVDQuartets.] Advances in sequencing technology over the last 20 years have led to widespread availability of large-scale sequence data sets from multiple loci for which the goal is to obtain an estimate of the species-level phylogenetic relationships among the taxa under consideration. Analysis of such data has presented significant computational challenges; however, because inference methods must include models that capture variation at two distinct scales. First, a model for the process by which the phylogenetic histories of individual loci vary given the overall species tree must be developed. The coalescent process (Kingman 1982, 1982, 1982) is usually used for this purpose. Second, the mutation process arising along the locus-specific phylogenies, typically called gene trees, must be modeled. This is usually accomplished using standard nucleotide substitution models (Liò and Goldman 1998). Together, these two model components are often referred to as the multispecies coalescent (MSC). Numerous methods for inference of species trees under the MSC have been developed (reviews of these methods can be found in several places, e.g., Liu et al. (2009) and Kubatko (2019)). Inference of the species phylogeny under the MSC is challenging because the gene trees are not directly observed, and must therefore be integrated over when computing probabilities associated with the DNA sequence data. Consider a species tree with |$M$| species labeled |$1, 2, \ldots, M$|, and suppose that |$m_j$| individuals are sampled within each species |$j$|. Thus, |$\mathcal{M} = \sum_{j=1}^{M} m_j$| is the total number of sequences in the data set. Using the framework of the MSC, we denote the probability density of gene tree history |$h$| and associated vector of coalescent times |$\mathbf{t}_h$|, conditional on species tree topology |$S$| and vector of speciation times |$\mathbf{\tau}$|, by |$f_{(h,\mathbf{t}_h) | (S, \mathbf{\tau})}$| (see Rannala and Yang, 2003 for a description of how to compute this density). We further define a site pattern to be an assignment of states |$i_1 i_2 \cdots i_{\mathcal{M}}$| to the |$\mathcal{M}$| tips of the tree, such that |$i_k \in \{A, C, G, T \}$| for |$k = 1, 2, \ldots, \mathcal{M}$|, and we denote the probability of site pattern |$p^h = i_1 i_2 \cdots i_{\mathcal{M}}$| arising from gene tree history |$(h, \mathbf{t}_h)$| by |$p^{h}_{(i_1 i_2 \cdots i_{\mathcal{M}}) | (h,\mathbf{t}_h)} $|. This probability is the usual phylogenetic likelihood along a gene tree, computed assuming one of the standard nucleotide substitution models. The probability of observing site pattern |$p = i_1 i_2 \cdots i_{\mathcal{M}}$| from the species tree is then given by $$\begin{equation}\label{eq:site.pattern.prob.cis} p_{i_1 i_2 \cdots i_{\mathcal{M}}|(S,\mathbf{\tau})} = \sum_{h \in \mathcal{H}} \int_{\mathbf{t}_h} { p^{h}_{(i_1 i_2 \cdots i_n) | (h,\mathbf{t}_h)} } { f_{(h,\mathbf{t}_h) | (S, \mathbf{\tau})} } d\mathbf{t}_h \end{equation}$$(1) where the sum is taken over all gene tree histories |$\mathcal{H}$| with corresponding branch lengths |$\mathbf{t}_h$| appropriately integrated out. See Chifman and Kubatko, 2015 for full details of the calculations. Note that Equation (1) implies that each site in the sequence alignment is an independent observation from the model; that is, each site represents a draw from the distribution of gene trees given the species tree as specified by the MSC, with subsequent mutation along the sampled gene tree according to one of the standard nucleotide substitution models. We use the term shape coalescent independent sites (CIS) to distinguish data of this type from single nucleotide polymorphism (SNP) data, which do not usually include invariable sites. Under this model, a sample of |$N$| CIS can be viewed as a sample from the multinomial distribution, where the number of categories is the number of possible sites patterns, |$4^{\mathcal{M}}$|, and the category probabilities are given by the site pattern probabilities. Thus, assuming that the sites are independent conditional on the species tree, the log likelihood of species tree |$(S, \mathbf{\tau})$| is given by $$\begin{equation}\label{eq:CIS.likelihood} \mathcal{L} \biggl( (S, \mathbf{\tau}) \biggr) = \sum_{q = 1}^{4^\mathcal{M}} x_q \log (p_q), \end{equation}$$(2) where |$x_q$| is the observed number of sites with pattern |$q$|, |$p_q$| is the probability of site pattern |$q$| under the model, |$q = 1, 2, \ldots, 4^{\mathcal{M}}$|, and |$\sum_{q=1}^{4^{\mathcal{M}}} x_q = N$|. We note that the site pattern probabilities are functions of the parameters in the MSC model, including both the branch lengths and the effective population sizes along each branch. This likelihood has been mentioned earlier by Xu and Yang (2016). The likelihood for multilocus data is more complicated, because in that case sites within a locus share the same gene tree and are thus correlated with one another, unless we condition on the gene tree. Suppose that there are |$G$| loci and that locus |$g$| has length |$n_g$|. Let |${ p^h}_{j | (h,\mathbf{t}_h)}$| denote the probability that the site pattern observed for site |$j$| within a particular locus arises from gene tree history |$(h,\mathbf{t}_h)$|. Then, the multilocus likelihood of species tree |$(\mathcal{S},\mathbf{\tau})$| is $$\begin{equation} \mathcal{L}\biggl( (\mathcal{S},\mathbf{\tau}) \biggr) = \prod_{g=1}^{G} \biggl( \sum_{\mathcal{H}} \int_{\mathbf{t}_{{h}}}\biggl( \prod_{j=1}^{n_g} { p^h}_{j | (h,\mathbf{t}_h)} \biggr)\ {f_{(h,\mathbf{t}_h) | (S, \mathbf{\tau})}} d \mathbf{t}_h \biggr). \end{equation}$$(3) The outermost product is taken over the |$G$| loci, and assumes that the loci are independent, conditional on the species tree. Comparing the terms inside this outer product to the expression in Equation (1), we see that within the integral over the gene tree branch lengths, the product over the |$n_g$| sites within each gene must be taken. These sites are conditionally independent given the gene tree and branch lengths but are not independent when conditioning only on the species tree because they share a common gene tree. The product appearing inside the integral makes it difficult to apply standard asymptotic arguments to this expression. Even taking the log of this likelihood, which allows the likelihood based on CIS data (Equation (2)) to be handled in a straightforward way, does not resolve the problem of the product appearing inside the integral. This also makes clear why computation of the species tree likelihood for multilocus data under the MSC model is challenging. In fact, we do not know of any direct implementations that compute this likelihood for trees larger than the four-taxon case we consider here. To study the convergence properties of SVDQuartets and of maximum likelihood (ML), we consider the case in which |$M=4$| and |$m_j=1$| for all |$j$|, that is, we consider four-taxon trees with one sequence sampled in each species. In this case, there are |$4^{4} = 256$| possible site patterns, 15 rooted species trees, and 3 unrooted species trees. When considering ML to estimate the species tree, we restrict our attention to CIS data and use the likelihood given in Equation (2), given the difficulty in handling the multilocus likelihood discussed above. To find the ML estimate of the species tree for CIS data, one needs to be able to compute the true site pattern probabilities for each possible species tree. Formulas for these site pattern probabilities were given by Chifman and Kubatko (2015) for simple substitution models (e.g., JC69; Jukes and Cantor 1969). Under the JC69 model and using these formulas with a single value of the effective population size parameter, |$\theta$|, specified for the entire tree, we can find the ML estimate of the species tree by considering each of the 15 rooted species trees and finding the set of speciation times that maximize the likelihood for each. The tree with the highest likelihood is the ML estimate. We have implemented this method in R using the optim function to carry out the optimization for each topology. Our code can be found at https://github.com/lkubatko/SpeciesTreeConsistency. To obtain an estimate of the four-taxon species tree for SVDQuartets for any data type (both CIS and multilocus data) and for the GTR+I+|$\Gamma$| model or any submodel, let |$L$| denote the set of four taxa under consideration, and suppose that |$L$| is partitioned into two sets, |$L_1$| and |$L_2$|, such that |$|L_1| = |L_2| =2$|. We say that |$L_1 | L_2$| is a shape split. The split |$L_1|L_2$| is shape valid for tree |$S$| if the subtrees containing the taxa in |$L_1$| and in |$L_2$| do not intersect; otherwise the split is not valid. For example, consider the tree |$((1,2),(3,4))$|. The split |$12|34$| is valid, while the splits |$13|24$| and |$14|23$| are not valid. For each of the three possible splits, the 256 possible site patterns can be arranged into a |$16 \times 16$| matrix in which the rows of the matrix correspond to possible states for the taxa in |$L_1$| and the columns correspond to possible states for the taxa in |$L_2$|. Such a matrix is called a shape flattening matrix, and is denoted |$Flat_{L_1|L_2}$|. For an empirical data set, the entries of the matrix are the observed frequencies of the site pattern that corresponds to the row and column indices, that is, $$\begin{gather*}{ {Flat_{12|34}} = \begin{pmatrix} & [AA] & {[AC]} & [AG] & [AT] & { [CA]} & \cdots & [TT] \\ [AA] & p_{AAAA} & {p_{AAAC}} & p_{AAAG} & p_{AAAT} & {p_{AACA}} & \cdots & p_{AATT} \\ [AC] & p_{ACAA} & {p_{ACAC}} & p_{ACAG} & p_{ACAT} & {p_{ACCA}} & \cdots & p_{ACTT}\\ [AG] & p_{AGAA} & { p_{AGAC}} & p_{AGAG} & p_{AGAT} & {p_{AGCA}} & \cdots & p_{AGTT}\\ [AT] & p_{ATAA} & {p_{ATAC}} & p_{ATAG} & p_{ATAT} & {p_{ATCA}} & \cdots & p_{ATTT}\\ [CA] & p_{CAAA} & {p_{CAAC}} & p_{CAAG} & {p_{CAAT}} & {p_{CACA}} & \cdots & p_{CATT}\\ & & & & & & & \\ \cdots & \cdots & { \cdots} & \cdots & \cdots & \cdots & \cdots & \cdots \\ & & & & & & & \\ [TT] & p_{TTAA} & p_{TTAC} & p_{TTAG} & p_{TTAT} & p_{TTCA} & \cdots & p_{TTTT} \\ \end{pmatrix}} \end{gather*}$$ For example, the |$(3,2)$| entry, |$p_{AGAC}$|, is the probability of observing nucleotide |$A$| for taxon |$1$|, |$G$| for taxon |$2$|, |$A$| for taxon |$3$|, and |$C$| for taxon |$4$|. When the rows and columns of the matrix correspond to a valid split, the matrix will have rank 10 for data observed perfectly from the model. When the rows and columns correspond to a split that is not valid, the matrix will be rank 16. The SVDQuartets method constructs three matrices (one for each of the three possible splits for four taxa), and computes the singular value decomposition shape (SVD) score for each matrix, $$\begin{equation}\label{eq:svd.score} SVD(L_1|L_2) = \sqrt{\sum_{k=11}^{16} \hat{\sigma}_k^2}, \end{equation}$$(4) where |$\hat{\sigma}_k$| is the |$k\textrm{th}$| singular value computed for the matrix of observed site pattern frequencies. For observed data, the magnitudes of the 11th through 16th singular values are expected to be small when the matrix corresponds to the valid split, and thus the split |$L_1|L_2$| with the lowest |$SVD(L_1|L_2)$| is selected. Note that in the case of four taxa, identifying the valid split is equivalent to inferring the unrooted species tree. Under either criterion for estimation, we denote the estimator of the species tree by |$S^*$| and the true species tree by |$S$|. Intuitively, consistency means that as more data are used to form the species tree estimate, the probability that |$S^* = S$| goes to 1. SVDQuartets has been assumed to be statistically consistent, but a formal proof has not been provided. ML is known to be consistent when used to estimate gene trees, but consistency of ML has not been formally examined in the species tree case. In the sections below, we prove that SVDQuartets is consistent for both CIS and multilocus data and that ML is consistent for CIS data. We derive bounds for the error probability of SVDQuartets and compare both methods using both theory and simulations. Consistency Results We first define a generative model for multilocus data. Definition 0.1. We assume that data are generated from the following statistical model: Population and genome sizes are large enough that the fact that genes and sites are sampled without replacement can be ignored. Define |$\mathbf{p}$| to be a vector of multinomial probabilities such that if we select a gene at random and sample one nucleotide at random from that gene, the unconditional site pattern distribution |$\mathbf{X}\sim \textrm{Multinomial}(1,\mathbf{p})$|. Define |$\{\mathbf{D}\}_{i = 1}^{N} \overset{iid}{\sim} F$| such that |$E(\mathbf{D_i}) = \mathbf{0}$| and if we select |$N$| genes at random, each of the |$N$| genes will have multinomial site pattern probabilities |$\mathbf{p_i}$| where |$\mathbf{p_i}$| |$\overset{d}{=} \mathbf{p + D_i}$|. Conditional on |$\{\mathbf{D}\}_{i = 1}^{N}$|, if |$\mathbf{X_i}$| are the observed site pattern counts for a sample of |$n_i$| nucleotides from gene |$i$|, then |$\mathbf{X_i}$| |$\sim \textrm{Multinomial}(n_i,\mathbf{p_i})$| and the collection |$\{\mathbf{X_i}\}_{i=1}^{N}$| is independent. Note that when |$n_i = 1$| for all |$i$|, this model generates CIS data, which may thus be considered a special case of multilocus data. Consistency for ML for CIS Data While the literature contains numerous proofs of consistency of ML for estimation of gene trees (e.g., Yang 1994, Rogers 1997; RoyChoudhury et al. 2015; Truszkowski and Goldman 2016, described further below), no such proofs have been given for the case of ML estimation of the species tree, in part because it is not computationally feasible to use ML for species tree estimation under the coalescent model for trees of arbitrary size as discussed above. Some recent attention has also been given to evaluating the consistency of methods other than ML for estimating the species tree, but such work has focused primarily on the case in which multilocus data are collected and summary statistics methods are used to form estimators (Roch et al. 2019) or on the concatenation method (Roch and Steel 2015). In this section, we formally prove that for CIS data ML estimation of the species tree under the MSC described above is statistically consistent for four-taxon trees. We follow the proof of Truszkowski and Goldman, 2016 for the case of gene trees, as most of their proof generalizes directly to the species tree case and their proof corrects the omissions of earlier proofs. We refer the reader to Truszkowski and Goldman, 2016 for many of the details. We first review related work for the case of ML estimation of gene trees, i.e., trees estimated using data from a single locus under one of the standard models of nucleotide substitution. Early proofs of the consistency of ML estimation for gene trees were given by Yang, 1994 and Rogers, 1997, but more recent examinations by RoyChoudhury et al., 2015 and Truszkowski and Goldman, 2016 have found that these proofs are incomplete. RoyChoudhury et al., 2015 explain the problems with these proofs succinctly; we outline their argument here as it will apply to our proof for the species tree case given below. First, note that, assuming identifiability of the gene tree topology, which requires nonzero internal edges (for a proof, see, e.g., Allman et al., 2008, and note condition (2) in Section 2.1), the following proposition results from a straightforward application of the Strong Law of Large Numbers. Proposition 0.2. Suppose |$T_0$| is the true tree and |$T_j$| is any other tree. Then there exists |$N$| such that for all |$n \ge N$| $$\begin{equation} L(T_0) > L(T_j) \hspace{2mm} a.s. \end{equation}$$(5) Though it is tempting to use Proposition (0.2) to claim consistency of the ML estimate of the gene tree topology, as noted by RoyChoudhury et al., 2015 and Truszkowski and Goldman, 2016, this result is not sufficient to conclude that ML estimation is consistent. To see why, consider the typical definition of consistency of the maximum likelihood estimator (MLE) that states that if |$\hat{T}$| is the MLE then $$\begin{equation}\label{eq:cons.def} \hat{T} \overset{P}{\rightarrow} T_0 \end{equation}$$(6) under a metric |$D(\cdot,\cdot)$|, where |$\overset{p}{\to}$| denotes convergence in probability. In order to guarantee that (6) holds, we either need to show that for any |$\epsilon > 0$| there exists some constant |$C_{\epsilon} > 0$| such that $$\begin{equation}\label{unif bound} \sup_{T_j:D(T_j,T_0) > \epsilon}\left \{L(T_0) - L(T_j)\right \} \ge C_{\epsilon}, \end{equation}$$(7) so that we are assured there cannot be trees of arbitrarily high likelihood far away from the true tree, or that the parameter space is compact. Under any reasonable metric, it is easy to see that the parameter space is not compact because it does not include trees with branches of length 0, as noted above. Truszkowski and Goldman (2016) provide a corrected proof by defining the following metric and showing that (7) holds for this metric (see Lemma 3 of Truszkowski and Goldman (2016)). Definition 0.3 (Distance between two trees, Truszkowski and Goldman (2016)) For two taxa |$a$| and |$b$| in tree |$S$|, define their distance, |$d_S(a,b)$|, to be the sum of the lengths of all edges on the path from |$a$| to |$b$|. Further, define the distance between two trees |$S_1$| and |$S_2$| to be |$D(S_1,S_2) = \max_{a,b \in L}|d_{S_1}(a,b) - d_{S_2}(a,b)|$|. Note that |$D(\cdot ,\cdot )$| is a metric as long as all branch lengths are positive. We now state and prove a modified version of Truszkowski and Goldman, 2016’s gene tree consistency result for the case of species trees estimated from a sample of CIS obtained under the MSC. Theorem 0.4 (Consistency of the ML estimator of the species tree for CIS data) Let |$S^*_N$| denote the MLE of species tree |$S$| for a sample of |$N$| CIS obtained under the multispecies coalescent. Then |$D(S^*_N, S) \rightarrow 0$| with probability 1 as |$N \rightarrow \infty$|. Our proof follows the general outline given by Truszkowski and Goldman (2016) for the gene tree case. Two crucial steps in their proof must be verified for the species tree case. First, the species tree must be identifiable, which has been established by Chifman and Kubatko (2015) for species trees that satisfy the molecular clock and by Long and Kubatko (2019) for nonclock species trees and for trees in which the effective population sizes vary throughout the tree. Second, a particular function of the pairwise distribution of states at the tips must satisfy a concavity condition. We state and verify this condition in the following proof. Proof Because the site pattern counts for a random sample of |$N$| CIS follow a multinomial distribution with probabilities given in Equation (1) above, the likelihood function for the ML estimate of the species tree is similar in form to that in the case of a gene tree. Thus Proposition (0.2) and most steps in the consistency proof given by Truszkowski and Goldman (2016) can be verified in a straightforward manner. The only nontrivial condition to be verified in the species tree case is that Lemma 3 of Truszkowski and Goldman (2016) still holds for the particular site pattern probabilities that arise in the species tree setting. This lemma involves some conditions on the pairwise site pattern probabilities, which we define below. Following the notation of Truszkowski and Goldman (2016), let |$f_{xy}^{ab}$| denote the frequency with which taxon |$a$| is observed to have state |$x$| and taxon |$b$| is observed to have state |$y$|, where |$x, y \in \{A,C,G,T\}$|. Let |$p^{d^{'}}_{xy}$| denote the probability that taxon |$a$| has state |$x$| and taxon |$b$| has state |$y$|, where |$x, y \in \{A,C,G,T\}$|, when |$d_S(a,b)=d^{'}$|. To verify Lemma 3 of Truszkowski and Goldman (2016), it is sufficient to verify that the function $$\begin{equation}\label{eq:concave} \sum_{x, y \in \{A, C, G, T\}} f_{xy}^{ab} \log(p_{xy}^{d^{'}}) \end{equation}$$(8) is concave in |$d^{'}$|. Under the JC69 model (Jukes and Cantor 1969) and the MSC model, Chifman and Kubatko (2015) (see their Supplement A) gave explicit formulas for the site patterns probabilities on four-taxon trees. Using these with |$\mu = 4/3$| as specified by the JC69 model and |$\theta$|, the effective population size parameter, set to 0.01, we can sum over pairs of taxa to find $$\begin{equation}\label{eq:pairwise.site.pattern.probs} p^{d^{'}}_{xy} = \begin{cases} \frac{1}{4} + \frac{225}{304}e^{-4d^{'}/3}, & x = y \\ \frac{3}{4} - \frac{225}{304}e^{-4d^{'}/3}, & x \neq y \end{cases}. \end{equation}$$(9) Ruskino (2018, personal communication) has derived more general expressions as a function the |$\theta$| parameter that follow the same general form. Using these expressions, it is straightforward to verify that the expression in Equation (8) is concave in |$d^{'}$| and thus that Lemma 3 of Truszkowski and Goldman (2016) holds. This establishes Theorem 0.4, and thus the ML estimate of the species tree for four taxa is statistically consistent for CIS data. □ We next consider consistency for SVDQuartets. Consistency and Error Rate for SVDQuartets Recall that for the SVDQuartets method, we choose the tree with split |$\textrm{argmin}_{L_1|L_2}SVD(L_1|L_2)$| as our estimate of the species tree. In this section, we prove that this estimator is consistent for multilocus data in the following sense and give its rate of convergence. Theorem 0.5 (Consistency of SVDQuartets) Suppose that the conditions of the model proposed by Chifman and Kubatko, 2015 are satisfied, and |$L_1|L_2^*$| is the true valid split among splits with |$|L_1| = |L_2| = 2$|. Fix |$\epsilon > 0$|. Assume |$\lim_{N\rightarrow\infty} \max_{i =1 \ldots N}\{n_i\} = K < \infty$|, and that all of the entries of the vector |$\mathbf{p}$| are strictly between |$0$| and |$1$|. Then |$\exists N_{\epsilon}$| such that |$\forall N \geq N_{\epsilon}$|, |$\mathbb{P}(argmin_{L_1|L_2} SVD (L_1|L_2) \neq L_1|L_2^*) < \epsilon $|. We give the details of the proof of Theorem 0.5 in the remainder of this section. The result follows from consistency of the |$\hat{p}_{ij}$| in the flattening matrix and the fact that singular values of a matrix satisfy a Lipschitz condition with respect to perturbations of the matrix (see Golub and VanLoan 2013). The assumption that all of the entries of the vector |$\mathbf{p}$| are strictly between |$0$| and |$1$| may seem arbitrary, but if this is not true, then we are considering the problem of estimating site pattern probabilities for sites that either always or never occur, and such cases are neither realistic nor interesting. Lemma 0.6 (Corollary 8.6.2 of Golub and VanLoan, 2013) Let |$A,E \in \mathbb{R}^{m \times n}$| with |$m \geq n$|, and let |$\sigma_i$|, |$i \in \{1, \ldots n\}$|, denote the singular values in descending order. Then for |$i \in \{1, \ldots n\}$|, |$|\sigma_i(A+E) - \sigma_i(A)| \leq ||E||_2 = \sigma_1(E)$|. We first establish that the |$p_{ij}$| are consistently estimated (for two reasonable uses of multilocus data) and give their asymptotic error. The estimator |$\hat{\mathbf{p}}_2$| is currently used by SVDQuartets as implemented in PAUP* (Swofford 2019). Lemma 0.7 Suppose data are generated as in Definition 0.1 with |$N$| and |$n_i$|, |$i = 1 \ldots N$|, such that |$\lim_{N\rightarrow\infty} \max_{i =1 \ldots N}\{n_i\} = K < \infty$|, and that all of the entries of the vector |$\mathbf{p}$| are strictly between |$0$| and |$1$|. Consider the estimators $$\hat{\mathbf{p}}_1 = \frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{X}_i}{n_i} \; \; \; \; \; \; \; \; and \; \; \; \; \; \; \; \; \; \hat{\mathbf{p}}_2 = \frac{1}{\sum_{i=1}^{N}n_i}\sum_{i=1}^{N}\mathbf{X}_i.$$ The following hold: Let |$\hat{p}_1^j$| be the |$j\textrm{th}$| entry of |$\hat{\mathbf{p}}_1$|, and let |$p^j$| be the |$j\textrm{th}$| entry of |$\mathbf{p}$| in Definition 0.1. Then for any |$\epsilon > 0$|, $$\mathbb{P}\left({|\hat{p}_1^j - p^j| > \epsilon}\right) \leq 2\exp(-2N \epsilon^2).$$ Let |$\hat{p}_2^j$| be the |$j\textrm{th}$| entry of |$\hat{\mathbf{p}}_2$|. Then for the |$K$| defined in Theorem 0.5 and |$\epsilon > 0$|, $$\mathbb{P}\left({|\hat{p}_2^j - p^j| > \epsilon}\right) \leq 2\exp(-2N (\frac{\epsilon}{K})^2).$$ Proof In both cases, we will apply Hoeffding’s inequality. Let |$X_{i,j}$| denote the |$j\textrm{th}$| entry of |$\mathbf{X}_i$|, and let |$W_{i,j} = \frac{X_{i,j}}{n_i}$|. Then the |$W_{i,j}$| are bounded between |$0$| and |$1$| and independent with respect to the |$i$| index for any given |$j$|. Thus, we can apply Hoeffding’s inequality to conclude $$\begin{equation}\label{Hof1} \mathbb{P}\left(\!{\biggl|\frac{1}{N}\sum_{i=1}^{N}W_{i,j} - E\bigg(\frac{1}{N}\sum_{i=1}^{N}W_{i,j}\bigg)\biggr| > \epsilon}\!\right) \leq 2\exp(-2N\epsilon^2). \end{equation}$$(10) The first term in the expression above is |$\hat{p}_1^j$|. The second term is equal to |$\frac{1}{N}\sum_{i=1}^{N}E(W_{i,j})$|. Now let |$D_{i,j}$| be the |$j\textrm{th}$| entry of |$\mathbf{D}_i$|. Then $$E(W_{i,j}) = E(E(W_{i,j}|D_{i,j})) = E(p^j + D_{i,j}) = p^j$$ since our generative model assumes that |$E(\mathbf{D}_i) = \mathbf{0}$|. Thus, 10 states that |$\mathbb{P}\left({|\hat{p}_1^j - p^j| > \epsilon}\right) \leq 2\exp(-2N \epsilon^2)$| as desired. Again let |$X_{i,j}$| denote the |$j\textrm{th}$| entry of |$\mathbf{X}_i$|, and now let |$W_{i,j} = \frac{X_{i,j}}{K}$| where |$K = \lim_{N\rightarrow\infty} \max_{i =1 \ldots N}\{n_i\}$| which we have assumed is finite as stated in Theorem 0.5. Then the |$W_{i,j}$| are bounded between |$0$| and |$1$| and independent with respect to the |$i$| index for any |$j$|. Note that $$\hat{p}_2^j = \biggl(\frac{NK}{\sum_{i=1}^{N}n_i}\biggr)\frac{1}{N}\sum_{i=1}^{N}W_{i,j}$$ and $$\begin{eqnarray*} E\biggl[\!\!\biggl(\frac{NK}{\sum_{i=1}^{N}n_i}\biggr)\frac{1}{N}\sum_{i=1}^{N}W_{i,j}\!\!\biggr] & = & \biggl(\!\frac{NK}{\sum_{i=1}^{N}n_i}\!\biggr)\frac{1}{N}\sum_{i=1}^{N} E(W_{i,j})\\[6pt] & = & \biggl(\frac{NK}{\sum_{i=1}^{N}n_i}\biggr)\sum_{i=1}^{N}\frac{n_i p^j}{K} = p^j, \end{eqnarray*}$$ where the second equality above holds because |$E(W_{i,j}) = E(E(W_{i,j})|D_{i,j}) = E(\frac{n_i}{K}(p^j + D_{i,j})) = \frac{n_i p^j}{K}$|. Thus, noting that |$\frac{NK}{\sum_{i=1}^{N}n_i} \leq K$|, we have $$\begin{eqnarray*} &&\mathbb{P}\left({|\hat{p}_2^j - p^j| > \epsilon}\right)\\[6pt] &&\quad = \mathbb{P}\left(\biggl|\biggl(\frac{NK}{\sum_{i=1}^{N}n_i}\biggr)\frac{1}{N}\sum_{i=1}^{N}W_{i,j}\right.\\[6pt] &&\qquad\left.- E\biggl[\biggl(\frac{NK}{\sum_{i=1}^{N}n_i}\biggr)\frac{1}{N}\sum_{i=1}^{N}W_{i,j}\biggr] \biggr| > \epsilon\right)\\[6pt] &&\quad = \mathbb{P}\left(\biggl|\frac{1}{N}\sum_{i=1}^{N}W_{i,j} - E\biggl(\frac{1}{N}\sum_{i=1}^{N}W_{i,j}\biggr)\biggr|\right.\\[6pt] &&\qquad \left. > \frac{\epsilon}{NK/\sum_{i=1}^N n_i}\right)\\[6pt] &&\qquad < \mathbb{P}\left({\biggl|\frac{1}{N}\sum_{i=1}^{N}W_{i,j} - E\biggl(\frac{1}{N}\sum_{i=1}^{N}W_{i,j}\biggr)\biggr| > \frac{\epsilon}{K}}\right)\\[6pt] &&\quad{} \leq 2\exp(-2N(\epsilon/K)^2), \end{eqnarray*}$$ where the last inequality is the result of applying Hoeffding’s inequality. It might appear from the above bounds that |$\hat{p}_1$| should be preferred to |$\hat{p}_2$| because the |$K$| term does not appear in the bound for |$\hat{p}_1$|, resulting in a smaller bound in that case. However, the bound in part (2) of the lemma is not tight. Rather, allowing the |$K$| term to appear in the exponent is simply a convenient way of dealing with the heterogeneity arising from the term |$\frac{1}{\sum_{i=1}^{N}n_i}$|. We discuss the relative merits of |$\hat{p}_1$| and |$\hat{p}_2$| in the later section “SVDQuartets for Multilocus Data.” If |$\lim_{N\rightarrow\infty} \max_{i =1 \ldots N}\{n_i\} = K < \infty$| does not hold, |$\hat{\mathbf{p}}_1$| and |$\hat{\mathbf{p}}_2$| are still consistent estimators of |$\mathbf{p}$|, but the deviations may have thinner tails than the bounds given above. Since it seems unrealistic that this assumption would be violated as in practice genes are finite in length, we do not provide a proof for the case where it does not hold. □ Lemma 0.8 For any split |$L_1|L_2$|, |$SVD(L_1|L_2) \overset{p}{\to} \sqrt{\sum_{i=11}^{16} \sigma_i^2}$|, where |$\sigma_i$| are the descending ordered singular values of the |$16\times16$| matrix |$Flat_{L_1|L_2}(P)$|. Proof. Because |$SVD(L_1|L_2)$| is a continuous function of the vector |$(\sigma_1, \ldots, \sigma_{16})$|, it suffices to show that |$\hat{\sigma_i} \overset{p}{\to} \sigma_i$| uniformly in |$i$|. Fix |$\epsilon, \delta > 0$|. We will show |$\exists N_{\epsilon, \delta}$| such that |$\forall N \geq N_{\epsilon,\delta}$|, |$\mathbb{P}\left({\sup_i|\hat{\sigma_i} - \sigma_i| > \delta}\right) < \epsilon$|. We index the vector of site pattern probabilities |$\mathbf{p}$| as |$\{p_{ij}\}$| to match their locations in the flattening matrix. Note that this is a modification of the notation in Lemma 0.7 which used vectors to denote the site pattern probabilities. Likewise, we index |$\hat{\mathbf{p}}$| as |$\{\hat{p}_{ij}\}$|. Define |$e_{ij} := \hat{p}_{ij} - p_{ij}$|, and observe that Lemma 0.7 implies that for any |$i,j$|, |$\mathbb{P}\left({|e_{ij}| > \delta}\right) \leq 2\exp(-2N(\frac{\delta}{K})^2)$|. Now choose |$N_{\epsilon,\delta}$| large enough so that when |$N \geq N_{\epsilon, \delta}$|, |$\mathbb{P}\left({|e_{ij}| > \frac{\delta}{64}}\right) \leq 2\exp(-2N(\frac{\delta}{64K})^2)< \frac{\epsilon}{256}$|. Using a union bound, we have $$\begin{eqnarray*} \mathbb{P}\left({\textrm{max}_{i,j}|e_{ij}| > \delta}\right) & \leq & \sum_{i,j} \mathbb{P}\left({|e_{ij}| > \frac{\delta}{64}}\right) \\ & \leq & \sum_{i,j}\frac{\epsilon}{256} \\ & < & \epsilon. \end{eqnarray*}$$ Now choose |$E$| in Lemma 0.6 to be |$E = \{e_{ij}\}$|. Then |$\sup_i|\hat{\sigma_i} - \sigma_i| \leq ||E||_2$|. It is well-known that for any matrix |$E \in \mathbb{R}^{k \times k}$|, |$||E||_2 \leq \sqrt{k}||E||_1 \leq k\sqrt{k}|\textrm{max}_{i,j}(e_{ij})|$|. Applying this fact with |$k=16$|, we have $$\begin{eqnarray*} P(\sup_i|\hat{\sigma_i} - \sigma_i|<\delta) & > & P(16(4) |\textrm{max}(e_{ij})|<\delta) \\ & = & 1-P\biggl(|\textrm{max}(e_{ij})|>\frac{\delta}{64}\biggr) \\ & >&1-\epsilon \end{eqnarray*}$$ which gives |$\mathbb{P}\left({\sup_i|\hat{\sigma_i} - \sigma_i| > \delta}\right) < \epsilon$|, as desired. □ We can now prove Theorem 0.5: Proof. Theorem 1 of Chifman and Kubatko, 2015 implies that |$\sqrt{\sum_{i=11}^{16} \sigma_i^2} = 0$| if and only if we choose the split |$L_1|L_2^*$|. Then because we have finitely many (3) splits to choose from, we can find some |$c > 0$| such that |$\sqrt{\sum_{i=11}^{16} \sigma_i^2} > c$| for any split |$L_1|L_2 \neq L_1|L_2^*$|. Fix |$\epsilon > 0$|. Choose |$\epsilon^* = \frac{\epsilon}{3}$| and |$\delta = \frac{c}{2}$|. Then for the |$N_{\epsilon^*,\delta}$| that satisfies Lemma 0.8 using |$\epsilon^*$| and |$\delta$|, for |$N \geq N_{\epsilon^*,\delta}$|, we will have |$\mathbb{P}\left({SVD(L_1|L_2^*) > c/2}\right) < \epsilon/3$| and |$\mathbb{P}\left({SVD(L_1|L_2) < c/2}\right) < \epsilon/3$| for every |$L_1|L_2 \neq L_1|L_2^*$|. Then, using the union bound, $$\begin{eqnarray*} & & \mathbb{P}\left({\textrm{argmin}_{L_1|L_2}SVD(L_1|L_2) \neq L_1|L_2^*}\right) \\[4pt] & \leq & \mathbb{P}\left({SVD(L_1|L_2^*) > c/2}\right) + \sum_{L_1|L_2 \neq L_1|L_2^*} \mathbb{P}\left({SVD(L_1|L_2) < c/2}\right) \\ & < & \epsilon \end{eqnarray*}$$ which completes the proof and establishes that SVDQuartets is a statistically consistent method for species tree estimation under the MSC. □ We emphasize that this result proves consistency of SVDQuartets for both CIS data and for multilocus data using either of the estimators in Lemma 0.7 above. In both of these cases, the above result also gives a bound on the error rate, as described below. Corollary 0.9. When estimating the split with SVDQuartets using a sample of |$n_i$|, |$i = 1 \ldots N$|, loci from each of |$N$| genes, there exists a constant |$\sigma^* > 0$| such that for large |$N$| the probability of choosing an incorrect split is bounded by $$\begin{equation} \mathbb{P}\left({\textrm{Error}_{SVD}}\right) \leq (2)(256)\exp(-2N (\frac{\sigma^*}{128K})^2). \end{equation}$$(11) Proof. Let |$\sigma^*$| be the smallest in absolute value of the |$11\textrm{th}$|–|$16\textrm{th}$| nonzero singular values among all possible splits |$L_1|L_2$|. Note that as a consequence of Lemma 0.6, for any split and each |$\hat{\sigma}_i$|, $$\begin{equation}\label{eq:sigma_bound} |\hat{\sigma_i} - \sigma_i| \leq 64\max{|e_{ij}|}. \end{equation}$$(12) Let |$\sigma_i^F$| denote the |$i\textrm{th}$| singular value for an incorrect split for any |$i = 11, \ldots, 16$|, and let |$\hat{\sigma}_i^F$| denote the corresponding observed value. Applying (12) and assuming that |$64\max{|e_{ij}|} < |\sigma^*/2|$| gives $$\begin{eqnarray} |\hat{\sigma}_i^F |& \geq & |\sigma_i^F| - 64\max{|e_{ij}|} \notag\\[4pt] & \geq & |\sigma^*| - 64\max{|e_{ij}|} \notag\\[4pt] & \geq & 2(64\max{|e_{ij}|} ) - 64\max{|e_{ij}|} = 64\max{|e_{ij}|} \label{eq:abs_bound} \end{eqnarray}$$(13) for each |$i = 11, \ldots, 16$|. Applying (12) again to singular values from the true split gives |$|\hat{\sigma}_i^T-0| \leq 64\max{|e_{ij}|}$|, and we have $$\begin{eqnarray} SVD(L_1|L_2) & = & \sqrt{\sum_{i=11}^{16}(\hat{\sigma}_i)^2} \notag\\ & \geq & \sqrt{\sum_{i=11}^{16}(64 \max{|e_{ij}|)^2}} \label{eq:intermed}\\ & \geq & \sqrt{\sum_{i=11}^{16} (\hat{\sigma}_i^T)^2} \notag\\ & = & SVD(L_1|L_2^*), \notag \end{eqnarray}$$(14) where (14) follows from (13). This establishes that the correct split will be selected by SVDQuartets whenever |$64\max{|e_{ij}|} < |\sigma^*/2|$|. The probability that SVDQuartets makes an error in selecting the split can thus be given by $$\begin{align*} &\mathbb{P}\left({\textrm{Error}_{SVD}}\right) \leq \mathbb{P}\left({64\max{|e_{ij}|} > \sigma^*/2}\right)\\ &\quad{} \leq \sum_{i=1}^{16}\sum_{j = 1}^{16} \mathbb{P}\left({|e_{ij}| > \sigma^*/128}\right)\!. \end{align*}$$ Recall from Lemmas 0.7 and 0.8 that |$\mathbb{P}\left({|e_{ij}| > \epsilon}\right) \leq 2\exp(-2N (\frac{\epsilon}{K})^2) $|, so $$\begin{align*} & \mathbb{P}\left({Error_{SVD}}\right) \leq \sum_{i=1}^{16}\sum_{j = 1}^{16} \mathbb{P}\left({|e_{ij}| > \sigma^*/128}\right)\\ &\quad{} \leq (2)(256)\exp\left(-2N\Big(\frac{\sigma^*}{128K}\Big)^2\right).\\[-24pt] \end{align*}$$ □ Note that our proof of consistency and error bound derivation depend on the structure of a four-taxon species tree only insofar as Theorem 1 of Chifman and Kubatko, 2015 has only been proven for trees of four taxa and our choice of constants. Should that result be extended to trees with a larger number of taxa, our arguments above imply that the estimator based on the SVD score in such cases would also be consistent for multilocus data and would have an error rate bound of |$O(\exp(-2N (\frac{|\sigma^*|}{128K})^2))$|. Comparison of Asymptotic Properties of ML and SVDQ uartets Theoretical Comparison Shi and Yang, 2018 conjecture that SVDQuartets is inefficient compared to ML when both are applied to multilocus data, as measured by the probability of recovering the correct species tree. Note that this is a different notion of efficiency than that which is applied in typical statistical settings, raising the question of whether classical statistical results concerning asymptotic efficiency of ML estimators (see, e.g., Lehmann and Casella 1998) apply in this case. As mentioned earlier in our discussion of consistency, it is not clear whether ML for the species tree estimation problem satisfies the general conditions of Wald (1949) for consistency. Nonetheless, we have been able to show that both ML for CIS data and SVDQuartets for multilocus and CIS data give statistically consistent estimators of the species tree. We next try to summarize what is know about the asymptotic error probabilities of the methods, as a way of addressing the claim made by Shi and Yang (2018) about the relative efficiency of the two methods. To our knowledge, error rate bounds for ML when applied to multilocus data have been rigorously derived in only a few special cases. Xu and Yang (2016) showed that in the case of a three-taxon species tree, the probability of choosing the wrong topology when using ML for data consisting of rooted gene trees is approximately $$\mathbb{P}\left({\textrm{Error}_{ML}}\right) \approx C_1 \Phi(-C_2\sqrt{N})$$ for explicit constants |$C_1$| and |$C_2$| that depend on the probabilities of the three possible gene trees that can arise within the species tree (which in turn can be computed from the other parameters). It is important to note, however, that this result is an approximation rather than a bound. It does not account for the rate at which |$\mathbb{P}\left({\textrm{Error}_{ML}}\right)$|, which is not exactly normal, converges to a normal distribution, and this rate could potentially be slower than the decay of the normal tail given by the approximating expression |$C_1 \Phi(-C_2\sqrt{N})$|. An equivalent result for four-taxon species trees has not been derived. Another partial result about the error rate of ML estimation comes from the following idea. Suppose that rather than sample |$n_i$| sites from each of |$N$| loci, we are able to sample gene trees directly, so that we in fact know the topology and branch lengths of each of the |$N$| sampled gene trees, |$G_1 \ldots G_N$|. Letting |$\mathbf{p}_l$| denote the observed site patterns (i.e., the alignment) for gene |$l$|, we note that in this case, $$\begin{eqnarray} \mathcal{L} \biggl( \!\!(S, \mathbf{\tau})|(G_1,p_1), \ldots (G_N,p_N) \biggr) & = & \prod_{\ell = 1}^{N}f(G_\ell|(S, \mathbf{\tau}))\notag \\ & = & \mathcal{L} \biggl(\! (S, \mathbf{\tau})|G_1, \ldots G_N \!\biggr),\nonumber\\ \end{eqnarray}$$(15) where |$f$| is the gene tree density under the MSC. In words, if we observe the gene trees directly, the alignments give no additional information and the likelihood of interest is the species tree likelihood based on the sampled gene trees. Furthermore, since this sampling scheme uses strictly more information than sampling only finite-length alignments, it seems reasonable to assume that its estimation power should be at least as high, (i.e., its error rate no worse than that of ML based on multilocus sampling) although this also requires a proof to be made fully rigorous. Results about error rates for trees of any size have been derived for the problem of estimating the species tree topology using gene trees directly. Liu et al. (2010) showed that for their maximum tree method, the probability of choosing the wrong topology is bounded by an expression of the form $$\mathbb{P}\left({Error_{MT}}\right) \leq C_1 \exp(-C_2 N),$$ and that if all populations have the same size, the maximum tree estimator is also the ML estimator. This result is a rigorous upper bound rather than an approximation. We note that it is comparable to our result for SVDQuartets insofar as both bounds take the form |$C_1 \exp(-C_2 N)$|, albeit likely for different values of |$C_1$| and |$C_2$|. A rigorous comparison of the performance of ML and SVDQuartets is inconclusive, in large part because not enough is known about the performance of ML. Since the result of Xu and Yang (2016) comes from multinomial probabilities, it is likely that applying Hoeffding’s inequality in that case would also yield a bound of the form |$C_1 \exp(-C_2 N)$| in addition to the approximation given in their work, although we have not rigorously verified this. One might additionally conjecture that such a result holds for species trees with arbitrary numbers of taxa, rather than just the three-taxon species tree. If this is true, then we could say that ML and SVDQuartets both have error rate bounds of the form |$C_1 \exp(-C_2 N)$|, where the constants |$C_1$| and |$C_2$| likely differ between the methods, but we cannot compare beyond this statement. We hope that scholars interested in comparing the performance of ML and SVDQuartets will derive more complete rigorous results that will allow for a more comprehensive theoretical comparison. Figure 1 Open in new tabDownload slide Results of the simulation study for the symmetric species tree for CIS data. The x-axis shows the number of CIS, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:1.0,Species4:1.0):0.5); setting 2 refers to tree ((Species1:0.5,Species2:0.5):1.0,(Species3:0.5,Species4:0.5):1.0); and setting 3 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:0.5,Species4:0.5):1.0). Figure 1 Open in new tabDownload slide Results of the simulation study for the symmetric species tree for CIS data. The x-axis shows the number of CIS, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:1.0,Species4:1.0):0.5); setting 2 refers to tree ((Species1:0.5,Species2:0.5):1.0,(Species3:0.5,Species4:0.5):1.0); and setting 3 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:0.5,Species4:0.5):1.0). Figure 2 Open in new tabDownload slide Results of the simulation study for the asymmetric species tree for CIS data. The x-axis shows the number of CIS, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree (Species4:2.5,(Species3:1.5,(Species2:0.5,Species1:0.5):1.0):1.0); setting 2 refers to tree (Species4:2.0,(Species3:1.0,(Species2:0.5,Species1:0.5):0.5):1.0); and setting 3 refers to tree (Species4:2.5,(Species3:2.0,(Species2:1.0,Species1:1.0):1.0):0.5). Figure 2 Open in new tabDownload slide Results of the simulation study for the asymmetric species tree for CIS data. The x-axis shows the number of CIS, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree (Species4:2.5,(Species3:1.5,(Species2:0.5,Species1:0.5):1.0):1.0); setting 2 refers to tree (Species4:2.0,(Species3:1.0,(Species2:0.5,Species1:0.5):0.5):1.0); and setting 3 refers to tree (Species4:2.5,(Species3:2.0,(Species2:1.0,Species1:1.0):1.0):0.5). Comparison via Simulation We conducted several simulation studies to comparatively evaluate the performance of ML and SVDQuartets. In the first simulation study, CIS data were simulated along the four-taxon symmetric and asymmetric species trees by first simulating gene trees using the package COAL (Degnan and Salter 2005) and then simulating sequence data under the JC69 model (Jukes and Cantor 1969) using Seq-Gen (Rambaut and Grassly 1997). The JC69 model was used because Chifman and Kubatko (2015) provided explicit formulas for the site pattern probabilities for four-taxon trees under the coalescent for this model, allowing us to implement the ML method in this case. We considered three species trees with all internal branch lengths and all external branch lengths leading to cherries set to the same value, either 0.5, 1.0, or 2.0 in coalescent units. We also consider three species trees with varying branch lengths. In these three cases, all branch lengths were either 0.5 or 1.0, and the placement of the shorter branches was varied between internal and external branches. The precise trees used are given in the captions to Figures 1 and 2, which show the simulation results. For all of the model trees, we set the effective population size parameter |$\theta = 4N\mu$| to 0.001, 0.005, or 0.01 for all branches. In addition to examining the performance of SVDQuartets for CIS data, we examined its performance on SNP data by rerunning it on each simulated data set after removing all of the constant sites. Since SNP data are more commonly collected, this will provide an indication of how much information is lost in moving from CIS to SNP data when using SVDQuartets for inference. For each of the three methods (SVDQuartets for CIS data, SVDQuartets for SNP data, and ML), we examined the proportion of times out of the 500 replicates that each of the methods correctly estimated the unrooted species tree when the total number of sites sampled ranged from 1000 to 10 000. In some cases, particularly those in which the overall mutation rate is low, as often results from both small effective population size and short branches, the ML algorithm will not converge and/or singular values cannot be computed accurately enough to infer the tree with SVDQuartets. When this occurred, we discarded that replicate from the summary of that method’s performance. If a particular simulation setting had fewer than 100 replicates in which estimation was completed without error, we did not include the result for that setting in the relevant figure. Our second simulation study considers multilocus data. We applied SVDQuartets as implemented in PAUP*, which ignores information about loci and treats the data as CIS data (this is the common and recommended practice for SVDQuartets at present). Because the multilocus likelihood is not computationally tractable, we approximated ML inference by running the BPP software (Yang and Rannala 2014; Yang 2015; Rannala and Yang 2017; Flouris et al. 2018) with the prior for |$\tau$| set to IG(3,0.015) and the prior for |$\theta$| set to IG(3,0.01). We hereafter refer to these as the default priors. We discarded the first 400 samples as burnin, and recorded every other sampled tree for a total of 1500 samples. For four-taxon trees, the species tree with the highest posterior probability will be generally equivalent to the ML tree. We consider the same model trees as in the first simulation study and the same choices of |$\theta$|. We used 5, 10, 15, 20, 25, 35, or 50 loci, with 200 bp per locus, and replicated each simulation condition 500 times. Replicates in which SVDQuartets failed to return an estimate of the species tree due to numerical imprecision of the singular value computation were discarded, as described above. For the third simulation study, we considered more difficult species trees, namely those found in the anomaly zone. In particular, we considered the tree found in Xu and Yang (2016), which is given by (Species1:0.48,(Species2:0.44,(Species3:0.4,Species4:0.4): 0.04):0.04) when branch lengths are reported in coalescent units. For the analysis in BPP, we considered both default priors, as we did in our second simulation study, and priors suggested by the simulation study in Xu and Yang (2016). Because the current version of BPP uses the inverse gamma (IG) distribution, rather than the gamma distribution used by Xu and Yang (2016), we use inverse gamma prior for |$\theta$| and |$\tau$| that have |$\alpha=3$| and mean set to the true value. Another difference from our simulation conditions and the simulation carried out by Xu and Yang (2016) is the number of sites per locus. Thus, we include one set of simulations with 200 bp per loci (as above) and another with 1000 bp per loci as in Xu and Yang (2016). Our preliminary simulations with these settings indicated that many more sites were needed for accurate inference for both ML and SVQuartets, and thus for the CIS simulations, we considered the number of sites ranging from 1 000 to 1 000 000. For the multilocus simulations, we considered 50, 100, 150, 200, 300, and 400 loci. Because of the additional computational cost associated with the large number of sites, we used only 100 replicates of each simulation condition. Finally, we considered |$\theta$| values that differed somewhat from those used above, in order to reproduce the results of Xu and Yang (2016) reported in their Figure 7. In particular, we considered |$\theta = 0.05$| (the value used by Xu and Yang (2016)) and |$\theta = 0.01$|. As mentioned above, we considered both informative and default priors for BPP. For the informative priors, we assumed |$\tau$| is IG(3.0,0.024), and |$\theta$| is IG(3,0.1) when the true value of |$\theta = 0.05$| and |$\theta$| is IG(3,0.02) when the true value of |$ \theta=0.01$|. Figure 1 shows the results of the first simulation for the symmetric species tree, and Figure 2 shows the results for the asymmetric species tree. In general, both methods are able to accurately infer the unrooted four-taxon species tree with sufficient data. When the model species tree is symmetric (Fig. 1), both methods are very accurate when the branch lengths within the model species tree are equal, though shorter branch lengths and lower values of |$\theta$| (settings which correspond to lower overall mutation rates) are more difficult for both methods. When branch lengths vary within the tree for the symmetric case, the first varying lengths setting, which corresponds to short internal branch lengths, was most difficult for both methods, though ML in general showed higher error than SVDQuartets for all three choices of |$\theta$|. Results for the asymmetric model species tree (Fig. 2) were likewise similar for both methods, with shorter internal branch lengths corresponding to lower accuracy for both methods. An important observation is that SVDQuartets does not decrease in accuracy when applied to SNP data as compared to CIS data. This can be explained by the observation that constant site patterns do not play a role in the reduced rank result of Chifman and Kubatko (2015) that is the basis for the SVDQuartets method. Figure 3 shows the results of the second simulation for the symmetric species tree, and Figure 4 shows the results for the asymmetric species tree. In most cases, the accuracy of SVDQuartets is lower than that of BPP, which is not surprising given that BPP is designed explicitly for multilocus data and SVDQuartets is designed for CIS data. It is clear that as the number of loci increases and the branches become longer, BPP accurately infers the true four-taxon species tree, while the performance of SVDQuartets lags behind. This suggests that the SDVDQuartets method may be most useful for genome-scale multilocus data, a setting in which the asymptotic consistency result suggests good performance and in which Bayesian methods become computationally expensive, while Bayesian methods such as BPP may be more appropriate when a more limited number of loci are available. We further compare the two frameworks (i.e., a likelihood-based framework such as BPP and the SVDQuartets methods) in the Discussion section. Figure 3 Open in new tabDownload slide Results of the simulation study for the symmetric species tree for multilocus data. The x-axis shows the number of genes, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:1.0,Species4:1.0):0.5); setting 2 refers to tree ((Species1:0.5,Species2:0.5):1.0,(Species3:0.5,Species4:0.5):1.0); and setting 3 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:0.5,Species4:0.5):1.0). Figure 3 Open in new tabDownload slide Results of the simulation study for the symmetric species tree for multilocus data. The x-axis shows the number of genes, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:1.0,Species4:1.0):0.5); setting 2 refers to tree ((Species1:0.5,Species2:0.5):1.0,(Species3:0.5,Species4:0.5):1.0); and setting 3 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:0.5,Species4:0.5):1.0). Figure 4 Open in new tabDownload slide Results of the simulation study for the asymmetric species tree for multilocus data. The x-axis shows the number of genes, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:1.0,Species4:1.0):0.5); setting 2 refers to tree ((Species1:0.5,Species2:0.5):1.0,(Species3:0.5,Species4:0.5):1.0); and setting 3 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:0.5,Species4:0.5):1.0). Figure 4 Open in new tabDownload slide Results of the simulation study for the asymmetric species tree for multilocus data. The x-axis shows the number of genes, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method. a) |$\theta = 0.001$|, all branch lengths equal to value given in the legend; b) |$\theta = 0.001$|, varying branch lengths; c) |$\theta = 0.005$|, all branch lengths equal to the value given in the legend; d) |$\theta = 0.005$|, varying branch lengths; e) |$\theta = 0.01$|, all branch lengths equal to the value given in the legend; f) |$\theta = 0.01$|, varying branch lengths. For b), d), and f), setting 1 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:1.0,Species4:1.0):0.5); setting 2 refers to tree ((Species1:0.5,Species2:0.5):1.0,(Species3:0.5,Species4:0.5):1.0); and setting 3 refers to tree ((Species1:1.0,Species2:1.0):0.5,(Species3:0.5,Species4:0.5):1.0). The results of the third simulation study are shown in Figure 5 for both CIS data (first row) and multilocus data (second row). As expected, for species trees for which there are anomalous gene trees, estimation of the correct species tree is more difficult for both methods, and more data are required to achieve reasonable accuracy. In the case of CIS data, SVDQuartets performs well with accuracy near 100% once a sufficient amount of data are available, in this case more than 500 000 sites. It is again worth noting that the accuracy of the method applied to SNP data is nearly identical to the CIS case, suggesting that SVDQuartets will be a very effective method when genome-scale SNP data are available. The performance of ML lags behind, likely due to the rather low number of informative sites available when species tree branch lengths are extremely short. For example, when branch length are short and |$\theta$| is not very large, most of the site patterns generated will be constant (i.e., invariable) site patterns. These site patterns are not informative for either SVDQuartets or ML. The next most frequently occurring site patterns will be those with a different nucleotide in only one species. Such site patterns provide no information about topology for ML, since they do not provide information about which two taxa are most closely related. However, these site patterns are informative for SDVQuartets, because the reduced-rank result on which the method is based uses the relationship that site patterns |$xxxy$| and |$xxyx$| should occur in equal frequency. Thus, it is reasonable that SVDQuartets performs better than ML for CIS and SNP data in low-information settings such as this. Figure 5 Open in new tabDownload slide Results of the third simulation study which considers the anomalous species tree of Xu and Yang (2016) given by (Species1:0.48,(Species2:0.44, (Species3:0.4,Species4:0.4):0.04):0.04). In each plot, the x-axis shows the amount of data, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method.The first row shows the results for CIS data for a) |$\theta=0.01$| and b) |$\theta = 0.05$|. The second row shows the results for multilocus data for c) |$\theta=0.01$| and d) |$\theta = 0.05$|. Figure 5 Open in new tabDownload slide Results of the third simulation study which considers the anomalous species tree of Xu and Yang (2016) given by (Species1:0.48,(Species2:0.44, (Species3:0.4,Species4:0.4):0.04):0.04). In each plot, the x-axis shows the amount of data, and the y-axis shows the proportion of correctly estimated unrooted species trees for each method.The first row shows the results for CIS data for a) |$\theta=0.01$| and b) |$\theta = 0.05$|. The second row shows the results for multilocus data for c) |$\theta=0.01$| and d) |$\theta = 0.05$|. For the multilocus setting, we compared the performance of BPP with both informative and default priors, and we note that the choice of prior has an impact on the resulting inference. This is particularly apparent when |$\theta=0.05$| (Fig. 5d), where we see that the proportion correct decreases by |$\sim$|10% when default priors are used rather than priors centered at the true value of |$\theta$|. This is because this particular choice of |$\theta$| is larger than the typically observed empirical values over which the default priors are centered. We selected this value in attempt to reproduce the high accuracy of BPP reported by Xu and Yang (2016) for this species tree. However, we also considered the more realistic value of |$\theta = 0.01$| (Fig. 5c), where we see that the effect of the prior is less substantial, likely because the default prior puts more weight close to this value. However, BPP’s accuracy decreases for this value of |$\theta$|, which can be attributed to the fact that the mutation rate is lower, resulting in fewer informative sites and therefore less information available for inference. Also notable is the effect of locus length in Figure 5c,d, with shorter loci resulting in lower accuracy. Only with informative priors and the larger value of |$\theta$| is accuracy substantially above 60% achieved for BPP when loci are 200 bp in length. The strong performance of BPP noted by Xu and Yang (2016) for this tree is only achieved with a large value of |$\theta$|, informative priors, and long loci (1000 bp each). We note that the performance of SVDQuartets is poor overall for the multilocus setting, with accuracy only around 50% for most conditions examined. Though this is similar to BPP’s accuracy for default priors, short loci, and a lower value of |$\theta$|, all of which reflect common empirical conditions, it is clear that an increase in information available, either through longer loci or a larger value of |$\theta$|, benefits BPP more than SVDQuartets. This result, together with the encouraging results for the CIS and SNP cases in Figure 5a,b, further support our assertion that SVDQuartets may be most effective when the amount of data available, whether multilocus or SNP, is very large—precisely the situations in which Bayesian methods become more computationally expensive. SVDQ uartets for Multilocus Data: |$\hat{\mathbf{p}}_1$| versus |$\hat{\mathbf{p}}_2$| SVDQuartets was originally formulated for CIS data, and is easily applied to SNP data, as the constant patterns present in CIS data and absent in SNP data do not impact the reduced-rank results that form the theoretical basis of the method (see Chifman and Kubatko (2015) for details). However, in many cases, multilocus data have been sequenced and are already available. Recall that Theorem 0.5 showed that SVDQuartets is consistent when using either of Estimator 1: |$\hat{\mathbf{p}}_1 = \frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{X}_i}{n_i}$| Estimator 2: |$\hat{\mathbf{p}}_2 = \frac{1}{\sum_{i=1}^{N}n_i}\sum_{i=1}^{N}\mathbf{X}_i$| to estimate site pattern probabilities when multilocus data are generated via Definition 0.1. A natural question is then whether one of these estimators should be preferred. Wood et al. (2005) examine this question and find that neither is uniformly better; rather the relative performance depends on the distribution |$F$| in Definition 0.1. Very generally, when |$F$| is concentrated around some value, |$\hat{\mathbf{p}}_2$| is better while when |$F$| is spread out, |$\hat{\mathbf{p}}_1$| is better. For further discussion, we refer readers to Wood et al. (2005), noting that |$\hat{\mathbf{p}}_1$| corresponds to their arithmetic average while |$\hat{\mathbf{p}}_2$| corresponds to their weighted average. Discussion Our work gives the first consistency results for four-taxon species tree inference under the coalescent model for SVDQuartets for both CIS and multilocus data and for ML for CIS data. Previous consistency results for ML were only derived in the case of gene trees. In addition, we have proved that the SVDQuartets estimator has asymptotic error probability |$O(\exp(-CN))$| for CIS and multilocus data, where |$N$| is the number of loci. The constant |$C$| probably depends on the structure of the tree being estimated, but our simulations show that it does not appear to be particularly unreasonable in a variety of scenarios. We compare the performance of SVDQuartets and ML theoretically, and find that what is known rigorously is not sufficient to confirm the conjecture of Shi and Yang (2018) that ML is more efficient than SVDQuartets; rather the comparison is inconclusive, in large part because not enough is known about the performance of ML. We can only note that our error bounds for SVDQuartets and those conjectured from partial theoretical results for ML both take the form |$O(\exp(-CN)),$| where the constant |$C$| may differ between the methods. In our simulations, we assumed that the effective population size, |$\theta$|, was constant throughout the tree. However, for empirical data, |$\theta$| may vary from branch to branch, or even along branches within the tree. It is therefore important to note that our proofs of consistency did not rely on the assumption of constant effective population size. In the case of consistency of ML for CIS data, identifiability is known to hold when |$\theta$| varies through the tree (Long and Kubatko 2019) and expressions analogous to that in Equation (9) can be obtained for varying effective population sizes (Rusinko 2018). In the case of the consistency of SVDQuartets, recent work (Long and Kubatko 2019) has established that the method holds in the case of varying |$\theta$|s, as well as in the absence of a molecular clock. Thus, the consistency result for SVDQuartets applies to a wide variety of mechanisms for data generation. Our simulations demonstrate comparable performance for both ML and SVDQuartets for CIS data, while ML (as implemented in BPP) generally performs better with multilocus data. Importantly, our first simulation shows that SVDQuartets can be applied to SNP data without any loss of power to infer the true species tree, making it a good choice for computationally efficient analysis of SNP data under the MSC. Examination of the performance of these methods in the anomaly zone indicates that BPP can be sensitive to the choice of prior and to the number of sites within the loci, while SVDQuartets may require a large number of loci to obtain high accuracy. We also note that, at present, BPP implements only the JC69 model, while the theoretical results underlying SVDQuartets hold for the general time reversible (GTR) model and all submodels (Chifman and Kubatko 2015), as well as for species trees that violate the molecular clock (Long and Kubatko 2019), making the method quite generally applicable. Given the consistency results derived here, we suggest that for multilocus data, SVDQuartets will be a useful alternative to Bayesian methods such as BPP when the size of the data, in terms of the number of loci and/or the number of species, makes markov chain monte carlo-based methods computationally prohibitive, that is, our results indicate that SVDQuartets can be used to achieve consistent estimates of the species tree topology in precisely the cases in which Bayesian methods are currently computationally expensive. Acknowledgments We thank Edward Susko, Ziheng Yang, and an anonymous reviewer for helpful comments on earlier drafts of this manuscript that led to its improvement. We are particularly grateful to Dr. Susko for suggesting a correction to our proof of consistency of SVDQuartets, and for several helpful comments about our overall approach. References Allman E.S. , Ané C., Rhodes J.A. 2008 . Identifiability of a Markovian model of molecular evolution with gamma-distributed rates . Adv. Appl. Prob. 40 : 228 – 249 . Google Scholar Crossref Search ADS WorldCat Chifman J. , Kubatko L. 2015 . Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites . J. Theor. Biol. 374 : 35 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat Degnan J. , Salter L. 2005 . Gene tree distributions under the coalescent process . Evolution 59 : 24 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat Flouris T. , Jiao, X., Rannala B., Yang Z. 2018 . Species tree inference with bpp using genomic sequences and the multispecies coalescent . Mol. Biol. Evol. 35 : 2585 – 2593 . Google Scholar Crossref Search ADS PubMed WorldCat Golub G.H. , VanLoan C.F. 2013 . Matrix computations . Baltimore (MD) : Johns Hopkins University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Jukes T. , Cantor C.R. 1969 . Evolution of protein molecules In: Munro H.N., editor. Mammalian protein metabolism . New York : Academic Press . p. 21 – 123 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Kingman J.F.C. 1982a . Exchangeability and the evolution of large populations. In: Koch G., Spizzichino F., editors. Exchangeability in probability and statistics . North-Holland : Amsterdam . p. 97 – 112 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Kingman J.F.C. 1982b . On the genealogy of large populations . J. Appl. Prob . 19A : 27 – 43 . Google Scholar Crossref Search ADS WorldCat Kingman J.F.C. 1982c . The coalescent . Stoch. Proc. Appl. 13 : 235 – 248 . Google Scholar Crossref Search ADS WorldCat Kubatko L. 2019 . The multispecies coalescent. In: Balding D.J., Moltke I., Marioni J., editors. Handbook of Statistical Genomics . 4th ed. Hoboken (NJ) : Wiley . p. 219 – 246 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Lehmann E.L. , Casella G. 1998 . Theory of point estimation . Springer Texts in Statistics New York : Springer . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Liò P. , Goldman N. 1998 . Models of molecular evolution and phylogeny . Genome Res. 8 : 1233 – 1244 . Google Scholar Crossref Search ADS PubMed WorldCat Liu L. , Yu L., Kubatko L., Pearl D.K., Edwards S.V. 2009 . Coalescent methods for estimating multilocus phylogenetic trees . Mol. Phylogenet. Evol. 53 : 320 – 328 . Google Scholar Crossref Search ADS PubMed WorldCat Liu L. , Yu L., Pearl D. 2010 . Maximum tree: a consistent estimator of the species tree . J. Math. Biol. 60 : 95 – 106 . Google Scholar Crossref Search ADS PubMed WorldCat Long C. , Kubatko L. 2019 . Identifiability and reconstructibility of species phylogenies under a modified coalescent . Bull. Math. Biol. 81 : 408 – 430 . Google Scholar Crossref Search ADS PubMed WorldCat Rambaut A. , Grassly N. 1997 . SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees . Comput. Appl. Biosci. 13 : 235 – 238 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Rannala B. , Yang Z. 2003 . Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci . Genetics 164 : 1645 – 1656 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Rannala B. , Yang Z. 2017 . Efficient Bayesian species tree inference under the multispecies coalescent . Syst. Biol. 66 : 823 – 842 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Roch S. , Nute M., Warnow T. 2019 . Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods . Syst. Biol. 68 : 281 – 297 . Google Scholar Crossref Search ADS PubMed WorldCat Roch S. , Steel M. 2015 . Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent . Theor. Popul. Biol. 100 : 56 – 62 . Google Scholar Crossref Search ADS WorldCat Rogers J. 1997 . On the consistency of maximum likelihood estimation of phylogenetic trees from nucleotide sequences . Syst.Biol. 46 : 354 – 357 . Google Scholar Crossref Search ADS PubMed WorldCat RoyChoudhury A. , Willis A., Bunge J. 2015 . Consistency of a phylogenetic tree maximum likelihood estimator . J. Stat. Plan. Inference 161 : 73 – 80 . Google Scholar Crossref Search ADS WorldCat Shi C.-M. , Yang Z. 2018 . Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of gibbons . Mol. Biol. Evol. 35 : 159 – 179 . Google Scholar Crossref Search ADS PubMed WorldCat Swofford D.L. 2019 . PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Available from: https://paup.phylosolutions.com. Truszkowski J. , Goldman N. 2016 . Maximum likelihood phylogenetic inference is consistent on multiple sequence alignments, with or without gaps . Syst. Biol. 65 : 328 – 333 . Google Scholar Crossref Search ADS PubMed WorldCat Wald A. 1949 . Note on the consistency of the maximum likelihood estimate . Ann. Math. Stat. 20 : 595 – 601 . Google Scholar Crossref Search ADS WorldCat Wood G. , Lai C.-D., Qiao C. 2005 . Estimation of a proportion using several independent samples of binomial mixtures . Aust. N. Z. J. Stat. 47 : 441 – 448. Google Scholar Crossref Search ADS WorldCat Xu B. , Yang Z. 2016 . Challenges in species tree estimation under the multispecies coalescent model . Genetics 204 : 1353 – 1368 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. 1994 . Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods . Syst. Biol. 43 : 329 – 342 . Google Scholar Crossref Search ADS WorldCat Yang Z. 2015 . The BPP program for species tree estimation and species delimitation . Curr. Zool. 61 : 854 – 865 . Google Scholar Crossref Search ADS WorldCat Yang Z. , Rannala B. 2014 . Unguided species delimitation using dna sequence data from multiple loci . Mol. Biol. Evol. 31 : 3125 – 3135 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Joint Phylogenetic Estimation of Geographic Movements and Biome Shifts during the Global Diversification of ViburnumLandis, Michael, J;Eaton, Deren A, R;Clement, Wendy, L;Park,, Brian;Spriggs, Elizabeth, L;Sweeney, Patrick, W;Edwards, Erika, J;Donoghue, Michael, J
doi: 10.1093/sysbio/syaa027pmid: 32267945
Abstract Phylogeny, molecular sequences, fossils, biogeography, and biome occupancy are all lines of evidence that reflect the singular evolutionary history of a clade, but they are most often studied separately, by first inferring a fossil-dated molecular phylogeny, then mapping on ancestral ranges and biomes inferred from extant species. Here we jointly model the evolution of biogeographic ranges, biome affinities, and molecular sequences, while incorporating fossils to estimate a dated phylogeny for all of the 163 extant species of the woody plant clade Viburnum (Adoxaceae) that we currently recognize in our ongoing worldwide monographic treatment of the group. Our analyses indicate that while the major Viburnum lineages evolved in the Eocene, the majority of extant species originated since the Miocene. Viburnum radiated first in Asia, in warm, broad-leaved evergreen (lucidophyllous) forests. Within Asia, we infer several early shifts into more tropical forests, and multiple shifts into forests that experience prolonged freezing. From Asia, we infer two early movements into the New World. These two lineages probably first occupied warm temperate forests and adapted later to spreading cold climates. One of these lineages (Porphyrotinus) occupied cloud forests and moved south through the mountains of the Neotropics. Several other movements into North America took place more recently, facilitated by prior adaptations to freezing in the Old World. We also infer four disjunctions between Asia and Europe: the Tinus lineage is the oldest and probably occupied warm forests when it spread, whereas the other three were more recent and in cold-adapted lineages. These results variously contradict published accounts, especially the view that Viburnum radiated initially in cold forests and, accordingly, maintained vessel elements with scalariform perforations. We explored how the location and biome assignments of fossils affected our inference of ancestral areas and biome states. Our results are sensitive to, but not entirely dependent upon, the inclusion of fossil biome data. It will be critical to take advantage of all available lines of evidence to decipher events in the distant past. The joint estimation approach developed here provides cautious hope even when fossil evidence is limited. [Biogeography; biome; combined evidence; fossil pollen; phylogeny; Viburnum.] Convincing historical explanations for the modern distribution of organisms are rare. This is because they necessitate, at least: 1) a compelling phylogenetic treatment of a comprehensive sample of relevant species (a phylogenetic component); 2) the incorporation of fossils to assess the absolute timing of events (a temporal component); 3) a historical biogeographic analysis that identifies and explains geographic disjunctions (a spatial component); and 4) information on the environments occupied by the species under consideration (an ecological component). Using quantitative phylogenetic models, numerous recent studies have extracted important insights from the comparison of dated phylogenies, geographic inferences, and biome reconstructions (e.g., Weeks et al. 2014; Meseguer et al. 2015; Cardillo et al. 2017; Gagnon et al. 2019). On the whole, analyses of these components have been carried out sequentially: species relationships are first inferred from sequence data, the tree is then time-calibrated with fossils, and the dated tree is used to separately infer geographic movements and biome shifts during the evolution of a clade. With sequential analyses, by design, evolutionary inferences that are made during the later stages cannot influence inferences made earlier in the analytical sequence. Molecular and morphological evidence alone may be sufficient to infer phylogenetic relationships or divergence times during that first stage of analysis. But perhaps not always since, from an evidential standpoint, a fossil may have too few morphological characters to securely position it during the first (phylogenetic) stage of analysis, causing the fossil to be omitted in later stages of biogeographical or ecological analysis. That said, phylogenetically unresolved fossils still indicate that at least one ancestral lineage was associated with a particular region or biome, even if it is uncertain exactly where in the phylogeny that lineage belongs. In such cases, the temporal, geographic, and ecological evidence that fossils provide may not only inform ancestral state estimates, but also aid in the phylogenetic placement of the fossils themselves. Here we implement an approach that jointly models the evolution of molecular sequences, fossils, geographic ranges, and biome affinities to collectively inform evolutionary inferences. Conceptually, our approach is a biogeographical variant of the “combined evidence” paradigm (Nylander et al. 2004) that is often used for “tip-dating” phylogenies (Pyron 2011; Ronquist et al. 2012) that we do not believe has been previously explored. Our combined evidence strategy favors evolutionary histories that are in harmony across phylogenetic, temporal, biogeographic, and ecological components, while penalizing those with dissonant components. The specific question we address here is how Viburnum (Adoxaceae, Dipscales), a widespread angiosperm lineage consisting of |$\sim $|163 species of shrubs and small trees, radiated into the geographic ranges and biomes that it occupies today. There are several reasons why Viburnum is well suited to such inference problems. First, we have sampled it comprehensively, and have assembled multiple molecular (e.g., cpDNA: Clement et al. 2014; RAD-seq: Eaton et al. 2017) and morphological data sets (see Supplement 1; all Supplementary Materials are available on Dryad at https://doi.org/10.5061/dryad.hx3ffbgb1), including data on traits and climate variables related to biogeography and biome occupancy (e.g., Schmerler et al. 2012; Weber et al. 2012; Chatelet et al. 2013; Edwards et al. 2014; Scoffoni et al. 2016; Edwards et al. 2017; Spriggs, Eaton et al. 2019; Sinnott-Armstrong et al. 2020). Second, as Viburnum exemplifies Northern Hemisphere disjunctions that are common in plants, fungi, and animals (e.g., Wen 1999; 2002; Donoghue and Smith 2004; Milne 2006), as well as shifts between tropical, warm temperate, and cold temperate forest biomes (e.g., Donoghue and Edwards 2014), our results address questions of long-standing interest to evolutionary biologists, biogeographers, and ecologists (Fig. 1; Supplement 1 available on Dryad). Third, we and others have published on the Viburnum radiation (e.g., Winkworth and Donoghue 2005; Moore and Donoghue 2007, 2009; Clement and Donoghue 2011; Clement et al. 2014; Spriggs et al. 2015; Lens et al. 2016), so our analyses provide a critical test of recently proposed and strongly contrasting hypotheses. On the one hand, it has been argued that Viburnum originated in tropical forests, and that the few species present today in these forests are “dying embers” of a deep tropical past (Spriggs et al. 2015). On the other hand, it has been hypothesized that Viburnum originated in cold forests that experienced prolonged freezing, and that they spread only later into tropical environments (Lens et al. 2016). Figure 1. Open in new tabDownload slide Global diversity of Viburnum across six areas and four biomes. Areas are marked with colored polygons for Southeast Asia (magenta), East Asia (red), Europe (green), North America (yellow), Central America and Mexico (cyan), and South America (blue). Counts of local species with biome affinities are reported for each area with pie charts, with the area of each chart the corresponding to the total number of local species. Biomes are colored as tropical (red), warm temperate/lucidophyllous (green), cloud forest (sky blue), and cold temperate forest (dark blue). The locations and biomes for the five fossil pollen specimens are represented with smaller markers with black borders. Figure 1. Open in new tabDownload slide Global diversity of Viburnum across six areas and four biomes. Areas are marked with colored polygons for Southeast Asia (magenta), East Asia (red), Europe (green), North America (yellow), Central America and Mexico (cyan), and South America (blue). Counts of local species with biome affinities are reported for each area with pie charts, with the area of each chart the corresponding to the total number of local species. Biomes are colored as tropical (red), warm temperate/lucidophyllous (green), cloud forest (sky blue), and cold temperate forest (dark blue). The locations and biomes for the five fossil pollen specimens are represented with smaller markers with black borders. Apart from the details of this case, however, we hope that the approach featured here will illuminate an important point. It is, it must be admitted, a humbling task to infer ancient events, and the results in many cases are tenuous at best. Given the obvious limitations of working with extant species and few, if any, fossils, it is necessary to integrate all of the available sources of evidence if we hope to produce assuring answers. The consilience of multiple lines of evidence may give us a chance to convincingly favor some hypotheses over others, even in cases that appear unpromising on the surface. Materials and Methods Analysis Overview Our chief goal was to estimate a joint distribution of phylogenetic topologies, time-calibrated lineage splitting events, biogeographic histories, and biome shift histories for 163 extant Viburnum species and 5 fossil specimens. To this end, we adopted a two-stage approach. During Stage 1, we estimated a tree topology for Viburnum from RAD-seq data for 118 species using a phylogenomic concatenation method. In Stage 2, we estimated absolute divergence times, ancestral ranges, and ancestral biomes while requiring that all 163 extant species relationships be congruent with the 99%-majority rule consensus tree from Stage 1. To do so, we jointly modeled the evolution of the biogeographic ranges, the biome affinities, and 10 partitioned molecular loci (9 cpDNA and nrITS) to estimate node ages and any equivocal species relationships. Dating information enters the model through five fossil occurrences, generated by the fossilized birth–death process, and through a secondary calibration for the stem age of Viburnum. Morphology-based clade constraints were additionally imposed upon the fossil taxa and unsequenced extant species (Supplements 4 and 5 available on Dryad). The second stage of our approach takes inspiration from morphology-based combined evidence (Nylander et al. 2004) and tip-dating (Pyron 2011, Ronquist et al. 2012) analyses, and from fossil-based ancestral state reconstructions for biogeography (Mao et al. 2012; Wood et al. 2012) and biome occupancy (Betancur-R et al. 2015). Both tip-dating and fossil-based ancestral state estimates incorporate fossil taxa as leaf nodes in the phylogeny. Rather than first estimating the phylogeny of extant and fossil taxa, and then estimating ancestral ranges and biome affinities, we directly include biome and biogeographic data as part of the combined evidence tip-dating exercise using RevBayes (Höhna et al. 2016). This means that each of our joint posterior samples represents an internally consistent evolutionary scenario that could have unfolded under our generative model to produce the evidence we have collected. Stage 1: RAD-Seq Topology Estimation We estimated the Viburnum tree topology using newly sequenced RAD data for this study combined with data from a previous study (Eaton et al. 2017), nearly doubling the number of taxa with genomic data from 64 to 127 (Supplement 2 available on Dryad). These newly sequenced species filled sampling gaps for many subclades; in addition, we added representatives of one major clade (Mollotinus: Viburnum ellipticum, V. molle), increased the sampling of the Neotropical Oreinotinus clade from 4 to 25 species, and included nine outgroup species. Raw RAD-seq data have been submitted to NCBI SRA under project number PRJNA605569. Supplement 2 (available on Dryad) details how the raw RAD-seq data were generated and how they were assembled with ipyrad v.0.7.13 (Eaton 2014). The RAD-seq topology for the 127 Viburnum and outgroup species was inferred under a concatenated GTR|$+$|Gamma model using RAxML (Stamatakis 2014). Outgroup species were then pruned from the estimated tree to yield the topology for 118 of |$\sim $|163 (72%) Viburnum species. Nodes with equivocal bootstrap support (|$P<$| 0.99) were collapsed into polytomies, resulting in the RAD-seq backbone topology used in Stage 2. Stage 2: Joint Bayesian Macroevolutionary Inference We constructed a phylogenetic model to jointly estimate species relationships, divergence times, fossil ages, biogeographical histories, and biome shifts. Below, we first describe the data used for Stage 2, and then the components of the macroevolutionary model. Molecular sequences for 153 of |$\sim $|163 (94%) extant species were used to estimate species relationships among 35 lineages that were not topologically constrained and to estimate all phylogenetic divergence times under a relaxed clock model. For this, we assembled previously published sequences for 138 extant viburnums for nine chloroplast genes (matK, ndhF, petBD, psbA, rbcL, rpl32, trnC, trnK, and trnSG) and one nuclear ribosomal marker (ITS). We then replaced 34 of those sequences that originated from herbarium or botanical garden specimens with new sequences extracted from field specimens that we collected. In addition, we sequenced 15 previously unsequenced species for this study, bringing the total number of species for these data to 153. The final matrix contained 22.6% missing cells. All taxa in the Stage 1 RAD-seq data set were also represented in this Stage 2 cpDNA data set. Sequences were aligned in Muscle then manually adjusted in AliView as needed. Sequence accession numbers are given in Supplement 3 (available on Dryad). Geographical regions and biome affinities were scored for all 163 extant species and for the 5 fossil specimens. All species ranges were coded into one or more of six discrete areas: Southeast Asia, Eastern Asia, Europe, North America, Central America, and South America. These areas are meant to reflect the major centers of endemism for plant clades distributed around the Northern Hemisphere (Laurasian distributions); however, we omitted Western North America as only one species of Viburnum (V. ellipticum) is endemic to that region. We subdivided Asia into Eastern and Southeastern regions to reflect patterns of endemism in Viburnum, especially the distribution of a number of lineages in Southeast Asia (Vietnam, Malaysia, Indonesia, and the Philippines). We subdivided the Neotropics into Central America (and Mexico) and South America both to reflect the isolation of South America before the Miocene and to assess the pattern and timing of the occupation of South America. Species were assigned affinities to one or more of four mesic forest biome classes, following Edwards et al. (2017): tropical/subtropical (T), warm temperate (i.e., lucidophyllous) (W), cold temperate (Co), and/or cloud forest (Cl). Tropical forests experience limited seasonality (i.e., temperature and precipitation remain relatively stable and high throughout the year). Warm temperate forests are characterized by greater seasonality (warm/wet versus cool/dry seasons), but do not experience a period of prolonged freezing, and they are dominated by broad-leaved evergreen trees. In contrast, cold temperate forests experience a season in which temperatures routinely fall below freezing for one or more months and are dominated by conifers and deciduous angiosperms. Cloud forests occur in montane regions at tropical/subtropical latitudes and are characterized by frequent cloud cover at the canopy level. They can and often do experience limited seasonality and large daily fluctuations in temperature, but not a prolonged freezing season. The majority of taxa were coded for a single biome type, with two groups of exceptions. First, 14 living species with ranges in East and/or Southeast Asia were coded as ambiguous for warm versus cold temperate states (W/Co). Second, we assigned biome affinities to the five fossil taxa by reference to descriptions of the paleofloras from which they were obtained and paleoclimate analyses where available (e.g., Moss et al. 2005; Greenwood et al. 2010; Denk et al. 2011; Zaborac-Reed and Leopold 2016; see Supplement 5 available on Dryad). For four of the fossils the paleofloral assemblages imply the occupancy of forests most analogous to modern warm temperate forests. Only the paleoflora associated with the youngest fossil, an Icelandic specimen from the mid-Miocene, contained numerous cold temperate elements in addition to warm temperate species, and thus was coded as ambiguous for either warm or cold temperate (W/Co). We used RevBayes (Höhna et al. 2016) to perform a joint Bayesian analysis of all Stage 2 Viburnum data, which included the molecular cpDNA+ITS sequences, the RAD-seq topological constraints, the biogeographic data, the biome data, and the fossil data using a combined evidence framework (Nylander et al. 2004; Pyron 2011; Ronquist et al. 2012). Where combined evidence analyses typically integrate molecular and morphological data (Nylander et al. 2004), we integrate molecular, biogeographic, and biome data. We modeled lineage diversification using the fossilized birth–death process (Heath et al. 2014) with a truncated normal stem age prior centered on 71 Ma (52.7–85.74 Ma; Bell and Donoghue 2005). Molecular sequences were partitioned by locus (Nylander et al. 2004) and concatenated, with each locus evolving under a GTR|$+$|Gamma substitution model (Tavaré 1986; Yang 1994). Historical biogeography was modeled using DEC (Ree et al. 2005) with a time-stratified model (Ree and Smith 2008), with dispersal rates informed by paleogeographical dispersal graphs (Landis 2017). Biome shifts were modeled using an unconstrained rate matrix with four states. Three independent, global clocks modeled the rates of molecular substitution events, biogeographic events, and biome shift events. The molecular clock rate was further relaxed across partitions and across branches using mean-1 rate multipliers. Priors for all parameters were generally chosen to be weakly informative or to place higher density on parameter values that reduce complex models into simple ones. Posteriors were estimated using Markov chain Monte Carlo (MCMC). Supplement 6 (available on Dryad) describes the model and MCMC analyses settings in finer detail. Annotated RevBayes scripts to perform the analysis and generate figures are available at http://github.com/mlandis/vib_div. When estimating the posterior, we constrained the phylogenetic relationships for three subsets of our taxa, each requiring different justifications and assumptions. Phylogenies that did not obey the specified clade constraints were not sampled in the posterior. First, we applied a backbone constraint that required the presence of all highly supported relationships in the RAD-seq topology (i.e., clades with bootstrap support |$\ge\!99$|. Second, we imposed clade constraints for five fossils. Based on studies by Manchester and colleagues (e.g., Manchester 2002; Manchester et al. 2002), we have rejected all leaf fossils previously assigned to Viburnum, and instead have incorporated fossil pollen grains. The pollen exine comes in three forms in Viburnum (Donoghue 1985): type Ia grains are regularly reticulate, with the reticulum elevated on columellae, and with psilate (smooth) muri (Donoghue 1985; Plates I–IV); type Ib grains have a continuous reticulum, as in Ia, but with regularly scabrate (bumpy) muri (Donoghue 1985, Plate VI); and type Ic grains have an intectate, retipilate exine, with scabrate pili (Donoghue 1985, Plate VII). Donoghue (1985) postulated that Viburnum pollen evolved from Ia to Ib to Ic, and this has been confirmed in subsequent phylogenetic analyses (e.g., Winkworth and Donoghue 2005; Clement and Donoghue 2011, Clement et al. 2014; Spriggs et al. 2015). Exine type Ib (with scabrate muri) is inferred to have evolved along the Valvatotinus branch, and type Ic (with scabrate pili) is inferred to have evolved independently from type Ib within the Lentago clade and within the Euviburnum clade. Two of our fossil pollen grains—one from British Columbia (Manchester et al. 2015) and one from the Florissant Fossil Beds in Colorado (Bouchal 2013)—have the ancestral exine morphology (type Ia); these can be assigned with some confidence to the total group of Viburnum, and although we cannot assign them to any particular subclade, we forbid their assignment to subclades characterized by type Ib or Ic. The pollen grains from the Paris Basin in Europe (Gruas-Cavagnetto 1978) and from the Northwest Territories in Canada (McIntyre 1991) appear to have exine type Ib, and therefore were constrained to belong to Valvatotinus, but were forbidden from joining the two subclades characterized by exine type Ic. Finally, Miocene pollen from Iceland (Denk et al. 2011) is of type Ic, and thus has two tenable topological placements. We therefore designed a specialized “either-or” clade constraint that requires the fossil placement to satisfy either the Lentago|$+$|Iceland or the Euviburnum|$+$|Iceland clade constraint. Third, 10 of our |$\sim $|163 extant Viburnum species are currently unsequenced. We variously constrained the placement of these species based on morphological characters. As described for fossil pollen grains, particular placements within a subclade were forbidden in several cases based on morphology. Details of these constraints are provided in Supplement 4 (available on Dryad). The phylogenetic relationships of any unconstrained species are inferred from their molecular, biogeographic, and biome character matrices. Although the 5 fossil taxa and the 10 unsequenced living taxa have no molecular characters, their biome and biogeography characters lend support for some phylogenetic relationships over others, and thus their phylogenetic positions are not simply drawn from the prior. The Stage 2 analysis produces a joint posterior distribution over the constrained time-calibrated phylogenies, along with ancestral range estimates, ancestral biome estimates, and model parameters, all conditional on the combined molecular, biogeographic, biome, and fossil evidence. We apply our Stage 2 analysis in four ways: first, using a Complete data set that combines all evidence currently available as described above; second, using a Masked data set that sets all biogeography and biome characters for fossil taxa to be fully ambiguous (“?”); third, by recoding the four biome states into just two biome states, Nonfreezing (Tropical, Cloud, Warm Temperate) and Freezing (Cold Temperate) biomes, for both the Complete and Masked data sets; and, fourth, by assessing the sensitivity of our root state estimates due to possible biome-correlated biases in fossil recovery rates (described below). Ancestral State Estimates Ancestral states and stochastic mappings were sampled for biogeographic ranges and biome affinities at regular intervals during MCMC sampling (Huelsenbeck et al. 2003). Each ancestral state and stochastic mapping sample corresponds to a posterior tree sample. We summarize ancestral state estimates using the maximum clade credibility (MCC) topology. Because ancestral range estimates at nodes may have differing ancestral and daughter states due to cladogenetic change, we defined a stricter node-matching rule than we used for the biome estimates, and only retained ancestral samples if the node and its two daughter nodes appear in the MCC topology. We also present ancestral state estimates as state-frequency-through-time (SFTT) plots. An SFTT plot summarizes what proportion of lineages exist at time |$t$| that are found in state |$i$| while averaging over the posterior distribution of topologies, divergence times, and stochastic mappings. SFTT plots are generated as follows. For each posterior sample, we parse its stochastic mapping to record the number of lineages in each possible biome-area state pair (|$6\times 4 = 24$| states) for each 1 Myr of time between 100 Ma and the Present. Within each interval, we then compute the frequency of lineages across biome states given that they are present in a particular area (biomes-per-area SFTT) and the frequency of lineages across areas given a particular biome affinity (areas-per-biome SFTT). Frequencies estimated from time bins with fewer than 510 samples are not shown, i.e. too few samples to estimate all state frequencies within |$\pm $|0.05 of the true multinomial frequencies with probability 0.95 (Thompson 1987). Note that lineages without sampled descendants do not contribute to the SFTT plots. Although we have achieved (to our knowledge at present) complete extant taxon sampling, our five fossil taxa certainly underrepresent historical Viburnum diversity and variation. SFTT plots were generated for both the Complete and Masked data sets to gauge the influence of fossil states on historical inferences. Sensitivity of Biome Root State Estimates All five Viburnum fossil pollen specimens were sampled from paleofloras that we judged to be either strictly or partially analogous to modern, warm temperate biomes, whereas no fossils were associated with tropical or cloud forest biomes. Because all five fossils were also sampled from North American and European localities, taphonomic or acquisition biases may distort which ancestral biomes are represented among our fossil taxa. We expect our fossil biome scores to inform our ancestral biome estimate for the most recent common ancestor of Viburnum (the root biome state, X|$_{\rm root})$|, but how robust would that result be to the discovery of new phylogenetic evidence? In particular, we are interested in how much “missing data” (e.g. from unsampled fossils) supporting alternative root states would be needed to favor a cold temperate root state. To measure this, we developed a sensitivity test that manipulates the prior probabilities for biome root state frequencies in a controlled manner. Adjusting root state priors effectively lets us specify how much missing evidence (or missing data, X|$_{\rm miss}$|) we wish to introduce in favor of freezing (Co) or nonfreezing (T, W, Cl) root state estimates without simulating data. We calibrate the amount of missing data by setting the cold temperate root state probability to equal 0.01, 0.02, 0.05, 0.10, 0.25, 0.40, 0.60, 0.75, 0.90, 0.95, 0.98, and 0.99, and set the root state priors for the remaining three biome states equal to (1/3) |$\times$| (1 – Pr(X|$_{\rm root}$| = Cold)). We then assess how posterior root state estimates respond to different amounts of missing data for both the Complete and Masked data sets. Biome-Area Transition Graph We compute the posterior mean number of times that any lineage transitioned between each ordered pair of biome-areas. Transition types that have a posterior mean less than 1 are filtered out. We then plot the posterior mean number of transitions (arrows) between biome-area pairs (nodes). Results Phylogeny Inferred species relationships among sequenced taxa are shown in Supplementary Fig. S1 (available on Dryad). The Stage 1 topology for 118 Viburnum species that we estimated from RAD-seq data is displayed as an MCC topology (Supplementary Fig. S1A available on Dryad). It is almost entirely consistent with the RAD-seq tree obtained by Eaton et al. (2017) for 65 Viburnum species. We note, however, that the exact placement of V. clemensiae is unclear. Eaton et al. (2017) rooted the tree along the V. clemensiae branch based on previous studies using mainly cpDNA data and Sambucus species as outgroups (e.g., Donoghue et al. 2004; Winkworth and Donoghue 2004; Clement and Donoghue 2011). In the present analysis, which relied on RAD-seq data for nine outgroup species (not shown), the position of V. clemensiae is equivocal and therefore collapsed. It is either sister to all other Viburnum species (as per Eaton et al. 2017, and previous studies), or, with equally low support, sister to a large clade containing the Urceolata, Pseudotinus, Crenotinus, and Valvatotinus clades (as in Supplementary Fig. S1B available on Dryad). We did not recover the weakly supported (pp|$=0.58$|) placement of V. clemensiae obtained by Lens et al. (2016) based on four cpDNA markers, ITS sequences, and outgroups (i.e., sister to a clade including all Viburnum except Valvatotinus). Other, more minor, differences in tree topology as compared with Eaton et al. (2017) and to other previous analyses (Winkworth and Donoghue 2005; Clement et al. 2014; Clement and Donoghue 2011; Spriggs et al. 2015; Lens et al. 2016) are described in Supplement 7 (available on Dryad). All Stage 2 analyses were estimated while constrained by those Stage 1 relationships with |$>$|98% bootstrap support. Supplementary Figure S1B (available on Dryad) shows the Stage 2 MCC topology for the 153 species with RAD-seq and/or chloroplast DNA sequences (i.e., the 10 unsequenced extant and the 5 fossil Viburnum taxa were filtered from each posterior sample before constructing the MCC topology). By design, this is entirely consistent with the RAD-seq backbone tree (Supplementary Fig. S1A available on Dryad), but |$\sim $|43 additional species relationships are here resolved (sometimes with poor support) based on their cpDNA and ITS sequences. Phylogenetic relationships for all 168 tips (163 extant Viburnum species plus 5 fossils) are shown in Figure 2. Because we lack molecular data for 10 extant taxa and for the fossils, they behave as rogue taxa (Wilkinson 1996), thereby eroding clade support as compared with Supplementary Fig. S1B (available on Dryad). That said, we see higher-than-expected support (pp|$=0.37$| and pp|$=0.55$|) for the placement of two fossils (from British Columbia and Colorado) along the Porphyrotinus stem; this reflects their ages, presence in North America, and warm temperate biome affinities. Two other fossils (from Paris Basin and Northwest Territories) are placed along the stem of Valvatotinus. Our analyses favor the placement of the Iceland fossil pollen within the North American Lentago clade, along the V. nudum/V. cassinoides branch, rather than in the alternative position (based on exine morphology) within Euviburnum. Figure 2. Open in new tabDownload slide Viburnum divergence time estimates. MCC topology includes all 168 taxa from the Complete data set. Clades with posterior probabilities |$ \le $| 0.99 are annotated (shaded markers) whereas clades with posterior probability |$ \approx $|1.0 are not. Blue node bars indicate 95% highest posterior density age estimates. Red bars for fossil taxa represent uncertainty in the fossils’ ages. Taxon labels for unsequenced extant taxa are colored gray. Figure 2. Open in new tabDownload slide Viburnum divergence time estimates. MCC topology includes all 168 taxa from the Complete data set. Clades with posterior probabilities |$ \le $| 0.99 are annotated (shaded markers) whereas clades with posterior probability |$ \approx $|1.0 are not. Blue node bars indicate 95% highest posterior density age estimates. Red bars for fossil taxa represent uncertainty in the fossils’ ages. Taxon labels for unsequenced extant taxa are colored gray. Dating Divergence time estimates with 95% highest posterior densities (HPD95) are shown in Figure 2. It appears that most of the major Viburnum lineages had evolved by the end of the Eocene (i.e., V. clemensiae, Pseudotinus, Urceolata, Euviburnum, Punctata|$+$|Lentago, V. amplificatum, Lutescentia, Solenotinus, Tinus, Porphyrotinus, Opulus, and Laminotinus). The divergence of Mollotinus from Dentata|$+$|Oreinotinus, of Punctata from Lentago, and of the Coriacea, Sambucina, Lobata, and Succotinus clades, probably took place in the Oligocene. In any case, most of the diversification of Viburnum took place from the Miocene onward. The three significant radiations identified by Spriggs et al. (2015), in Succotinus, Oreinotinus, and Solenotinus, were underway by the mid-Miocene, though most of the speciation in these clades occurred within the past 5 Myr (14/25 events in Succotinus, 23/34 events in Oreinotinus, and 9/15 events in Solenotinus, excluding unsequenced taxa). Overall, we estimate that 73% (122/167) of verifiable (i.e., not masked by extinction) speciation events in Viburnum occurred during the past 12 Myr, and 45% (75/167) within the past 5 Myr. Any Viburnum lineage is expected to split (speciate) roughly every 7 Myr and go extinct every 11 Myr (posterior mean birth rate of 0.143 and extinction rate of 0.091). Biogeography Our reconstruction of ancestral ranges based on the full analysis is shown in Figure 3. Although the results are equivocal for the base of Viburnum, we note that we find the highest support for Eastern Asia alone, or for Eastern Asia plus SE Asia, and very limited support for the presence of Viburnum in the New World at that time. The ancestors of the several major clades that existed in the Paleocene are also inferred to have been present in Eastern Asia. Figure 3. Open in new tabDownload slide Viburnum ancestral ranges estimated from the Complete data set. Node colors correspond to range states (legend). Only the three most probable range states are reported for each node, while all less probable states are grouped together under the “...” label and colored gray. A) Range probabilities for the ancestral range and the daughter ranges at each internal node with pie charts. Taxon labels for unsequenced extant taxa are colored gray. B) A subsample of stochastic mappings for the same six posterior samples as shown in Figure 4. Figure 3. Open in new tabDownload slide Viburnum ancestral ranges estimated from the Complete data set. Node colors correspond to range states (legend). Only the three most probable range states are reported for each node, while all less probable states are grouped together under the “...” label and colored gray. A) Range probabilities for the ancestral range and the daughter ranges at each internal node with pie charts. Taxon labels for unsequenced extant taxa are colored gray. B) A subsample of stochastic mappings for the same six posterior samples as shown in Figure 4. By the Late Paleocene or the Early Eocene, we see evidence for the movement of the Porphyrotinus clade into North America, most likely from Eastern Asia. From North America, this lineage spread to the mountains of the Neotropics twice: 1) V. australe of the Mollotinus clade is now restricted to a few localities in northeastern Mexico; and 2) the Oreinotinus clade, with some 36 species, has spread as far south as Northern Argentina. The Oreinotinus radiation appears to have begun in eastern Mexico by the mid-Miocene and then tracked southward, with one lineage entering South America 5–12 Ma. Based primarily on the fossil pollen from the Northwest Territories, we also see the establishment of the Valvatotinus lineage in North America in the Eocene, although Lentago, the extant New World clade within Valvatotinus, may not have been present in North America until near the end of the Eocene. Today, Lentago has seven species in eastern North America and one in Mexico (V. elatum), extending south to Chiapas (Spriggs, Eaton et al. 2019; Spriggs, Schlutius et al. 2018). Our results indicate that there were four other, more recent, movements into North America. The Pseudotinus and Lobata clades probably entered in the mid-Miocene, where they are represented today in eastern North America by V. lantanoides and V. acerifolium, respectively. V. lantanoides is closely related to Eastern Asian species, and we infer that its ancestor probably entered North America through the Bering Land Bridge. V. acerifolium is sister to V. orientale in the Caucasus mountains of Georgia, so it is possible that it entered across the North Atlantic (Denk et al. 2011). Reconstructions for the Opulus clade are more complicated, but we prefer an interpretation involving two disjunctions: 1) in the core Opulus clade, in which the V. trilobum lineage may have entered North America (from Asia or Europe) before 15 Ma; and 2) between V. edule and V. koreanum, which today straddle the Bering strait, and may have separated within the past 4 Myr. The rest of the diversification of Viburnum took place in the Old World, mainly in Eastern Asia, but with several extensions into SE Asia. Specifically, SE Asia is occupied today by the early-branching V. clemensiae and V. amplificatum, both of Borneo, the Punctata clade spanning from southern China to the Western Ghats of India, several species of the Lutescentia clade, and all members of the Sambucina|$+$|Coriacea clade except for V. ternatum (which extends into Central China). Four clades have one or more species in Europe, and in each case, these are inferred to have been derived from Asian ancestors. The Tinus lineage appears to have entered Europe first, perhaps at the end of the Eocene or the beginning of the Oligocene. V. tinus is widespread around the Mediterranean (including northern Africa) and may have spawned two island endemics: V. rugosum in the Canary Islands and V. treleasei in the Azores (Moura et al. 2015). Within Lobtata, the ancestor of V. orientale may have entered Europe some 10 Ma, and the split between the European V. opulus and the Asian V. sargentii probably occurred within the past 5 Ma. Finally, in Euviburnum, the two European species (V. lantana and V. maculatum) shared a common ancestor 1–2 Ma with their Asian relative, V. burejaeticum. Biomes Our ancestral biome reconstructions are shown in Figure 4. With considerable statistical certainty (pp |$= 0.93$|), we infer that Viburnum first occupied warm temperate (lucidophyllous) forests and diversified in these forests in the Eocene. There may have been two or more shifts into more tropical (i.e., less monsoonal) forests, especially in the V. clemensiae and V. amplificatum lineages. The entry of the Punctata lineage and the Coriacea|$+$|Sambucina clade into more tropical forests probably began during the Oligocene. Support for a warm temperate origin is still favored when we mask fossil biome states (Supplementary Fig. S4 available on Dryad). Figure 4. Open in new tabDownload slide Viburnum ancestral biomes estimated from the Complete data set. Node colors correspond to forest biome affinity states (legend). A) Probabilities are given by pie charts at nodes. Taxon labels for unsequenced extant taxa are colored gray. B) A subsample of stochastic mappings for the same six posterior samples as shown in Figure 3. Figure 4. Open in new tabDownload slide Viburnum ancestral biomes estimated from the Complete data set. Node colors correspond to forest biome affinity states (legend). A) Probabilities are given by pie charts at nodes. Taxon labels for unsequenced extant taxa are colored gray. B) A subsample of stochastic mappings for the same six posterior samples as shown in Figure 3. There is virtually no support that Viburnum first diversified in cold temperate forests, regardless of whether we encode habitat preferences using four (Fig. 4) or two (Supplementary Fig. S5 available on Dryad) biomes. Instead, it appears that there were multiple transitions from warm temperate into cold temperate forests. The earliest of these shifts, perhaps in the Late Eocene, occurred in two clades (Opulus and Pseudotinus) that today occupy the coldest climates of any Viburnum species (Edwards et al. 2017; Park and Donoghue 2020), including boreal forests in North America (e.g., V. edule, V. trilobum, and V. lantanoides), Europe (V. opulus), and Asia (V. koreanum, V. sargentii). Many additional shifts into cold temperate forests appear to have taken place during the Oligocene or early Miocene. These are seen in the Lentago, Euviburnum, Solenotinus, Mollotinus, Dentata, Lobata, and Succotinus clades. In Asia there are multiple inferred transitions from warm to cold temperate forests within Succotinus (six or more) and Solenotinus (three or more). The only lineage that plausibly shifted from cold into warm temperate is V. obovatum in North America, and no lineages shifted from cold temperate into tropical biomes. It is not clear whether the Neotropical cloud forest clade (Oreinotinus) originated from ancestors in warm or cold temperate forests. It appears that the Porphyrotinus clade occupied warm temperate forests when it entered North America, and it is possible that the occupation of this biome (with limited distribution in North America today) preceded both the shift into cold temperate forests to the north and cloud forests to the south. Sensitivity Analyses Using all four biome states and ignoring fossil biomes as assumed under the Masked data set, the posterior probability of the biome root state is highly sensitive to whether any missing data might inform a freezing root state (Supplementary Fig. S6 available on Dryad). When we assign biome states to fossil taxa with the Complete data set, cold temperate forests have negligible posterior support (pp |$<$| 0.05) when there is no missing data that would influence the root state estimate (|$Pr(X_{\rm root}$| = Cold) = 0.25). A freezing biome root state remains less probable than any nonfreezing biome (pp |$<$| 0.50) until the degree to which missing data favors freezing forests is high (|$Pr(X_{\rm root} = Cold)= 0.82$|). Only when we assume that any missing data are extremely biased toward a freezing origin for Viburnum (|$Pr(X_{\rm root} = Cold) = 0.99$|) do we confidently estimate such a cold origin (pp |$>$| 0.95). Biome and Biogeography State Frequencies Reconstructed Viburnum lineages resided in different biomes depending on their ages and on the regions that they occupied, all of which varied over time (Fig. 5). Lineages that are 50 Ma or older generally inhabited warm temperate forests, particularly throughout the northern regions, but with no Neotropical representation. High proportions of warm temperate lineages shifted into cold forests between 35 Ma and the present, giving rise to similar proportions of warm and cold temperate lineages in Eastern Asia and Europe today, but more cold than warm temperate lineages in North America (Fig. 5, left). Early Cenozoic occupancy of cold temperate forests is far more prevalent when fossil states are Masked (Supplementary Fig. S7, left, available on Dryad). In SE Asia, we see low but stable proportions of tropical lineages dating back to the Eocene, with a more recent rise in tropical lineages beginning at roughly 20 Ma (Fig. 5, left). The oldest Central American lineages (roughly 30 Ma) could have been anything but tropical, but by 15 Ma, most Central American lineages are cloud forest-adapted. South American lineages are less than 10 Ma, and exclusively inhabit cloud forests. Figure 5. Open in new tabDownload slide Viburnum biome and biogeography state frequencies through time taxa as estimated from the Complete data set. Subplots in the left column report the frequency of lineages across biome states given for all lineages with sampled descendants found within a particular region. Subplots in the right column are similar, except they report regional frequencies across lineages given a particular biome state. Time bins with too few posterior samples to guarantee accurate frequency estimates were marked as empty (see main text). Figure 5. Open in new tabDownload slide Viburnum biome and biogeography state frequencies through time taxa as estimated from the Complete data set. Subplots in the left column report the frequency of lineages across biome states given for all lineages with sampled descendants found within a particular region. Subplots in the right column are similar, except they report regional frequencies across lineages given a particular biome state. Time bins with too few posterior samples to guarantee accurate frequency estimates were marked as empty (see main text). Historically, most warm temperate lineages were found in East and Southeast Asia (Fig. 5, right). We find increased North American warm temperate representation that peaked at roughly 50 Ma and declined toward the present. This signal is absent when fossil biome states are Masked (Supplementary Fig. S7, right, available on Dryad). Although older tropical lineages were found primarily in Eastern Asia, SE Asian tropical diversity expanded steadily after 50 Ma; beginning at 20 Ma, SE Asian tropical diversity exceeds that of Eastern Asia (Fig. 5, right). The proportion of cold temperate lineages has remained fairly stable in Eastern Asia and North America over the past 50 Myr (a ratio around 2:1). Cloud forest lineages probably first appeared in North America, followed by Central America, and most recently South America. Biome-Area Transitions Biome shifts from warm temperate into cold temperate forests have been common and have occurred most often in Eastern Asia, but also in North America (Fig. 6). Cold-to-warm shifts are far less common in our stochastic mappings. Within the cold temperate biome, dispersals from East Asia to North America are most common. Warm temperate Asian lineages gave rise to the tropical Asian lineages. There are some transitions from Eastern Asia into SE Asian tropical forests, but few in the opposite direction. Biome shifts into Neotropical cloud forests along the branches leading to Oreinotinus and to V. elatum of the Lentago clade were from either cold or warm temperate forests, not from tropical forests. Figure 6. Open in new tabDownload slide Viburnum biome-area transition graph estimated from the Complete data set. Arrows represent the mean number of evolutionary transitions from one area-biome state into another area-biome state by averaging over stochastic mappings for all posterior samples. Arrow widths are classified based on the posterior mean number of transitions of the corresponding type. Transitions with posterior means less than one are not shown. Transparent ovals correspond to the four biomes: sub/tropical (red), warm temperate (green), cloud (sky blue), and cold temperate (dark blue) forests. Solid nodes correspond to the six regions: Southeast Asia (magenta), East Asia (red), Europe (green), North America (yellow), Central America (cyan), and South America (blue). Note that our model assumes that transitions involve a change in either an area or in a biome, but not both simultaneously. Figure 6. Open in new tabDownload slide Viburnum biome-area transition graph estimated from the Complete data set. Arrows represent the mean number of evolutionary transitions from one area-biome state into another area-biome state by averaging over stochastic mappings for all posterior samples. Arrow widths are classified based on the posterior mean number of transitions of the corresponding type. Transitions with posterior means less than one are not shown. Transparent ovals correspond to the four biomes: sub/tropical (red), warm temperate (green), cloud (sky blue), and cold temperate (dark blue) forests. Solid nodes correspond to the six regions: Southeast Asia (magenta), East Asia (red), Europe (green), North America (yellow), Central America (cyan), and South America (blue). Note that our model assumes that transitions involve a change in either an area or in a biome, but not both simultaneously. Discussion Timing of Events Several previous studies have inferred absolute ages for Viburnum. Using four fossils within Dipsacales, Bell and Donoghue (2005) concluded that the Viburnum stem lineage (i.e., the first split within crown Adoxaceae) extended into the late Cretaceous, between 70 and 85 Ma. Using fossil pollen assigned to Valvatotinus, Moore and Donoghue (2007, 2009) inferred a notably young crown age for Viburnum of 28 Ma (HPD95, 17–40 Ma). Using the same fossil pollen, but a greatly expanded molecular data set, Spriggs et al. (2015) found an older Eocene crown age of 55 Ma (HPD95, 46–73 Ma). Based on doubtful Eocene fossil leaves from China (Manchester et al. 2002), Lens et al. (2016) reported a crown age for Viburnum centered on 56 Ma (HPD95, 50–61 Ma). Spriggs et al. (2015) also found that using disputed Viburnum leaves from the Paleocene pushed the crown into the Late Cretaceous (|$\sim $|80 Ma). Our results, based on the most complete species sample and additional fossil pollen grains, agree with a late Cretaceous estimate of 69 Ma (HPD95, 56–80 Ma). Overall, increases in species sampling and additional fossils have steadily pushed back estimates of the crown age, regardless of what time-calibration method was used. That said, our study is the only one of the five to employ the fossilized birth–death process (Heath et al. 2014), which may have uniquely influenced our crown age estimate (Arcila et al. 2015). Regardless of the exact age of the Viburnum crown, all analyses agree that most of the major Viburnum lineages were differentiated in the Eocene, and that much of Viburnum evolution took place from the Miocene onward. Biogeography Our biogeographic results are broadly consistent with previous studies of Viburnum, but provide a much more complete account. Winkworth and Donoghue (2005) carried out a parsimony (DIVA; Ronquist 1997) analysis on an undated tree for 41 species based on several chloroplast markers, nrITS, and two copies of the nuclear gene WAXY. They concluded that Viburnum originated in Asia and that there were 5–6 disjunctions between to the Old World and the New World. Most of the differences between our analyses and Winkworth and Donoghue (2005) stem from their limited species sampling. For example, because they did not include the European V. tinus or V. opulus, they identified only two disjunctions between Asia and Europe, as compared with our four. Lens et al. (2016) analyzed Viburnum biogeography using a tree of 97 species based on four cpDNA markers and nrITS sequences obtained mostly from Clement et al. (2014). They reached several significantly different conclusions. Most importantly, they found several transitions from Asia to North America, followed by movements in some cases back into Asia. In contrast, our analyses identified multiple movements into North America, with no subsequent returns. These conflicts reflect differences in the sampling and phylogenetic placement of species, and the scoring of particular tips. For example, their inference that the common ancestor of the large clade that includes Opulus plus Laminotinus (i.e., Imbricotinus; Clement et al. 2014) was confined to North America is strongly influenced by 1) not including the Asian V. sargentii of the Opulus clade, and 2) the nonmonophyly of the Lobata clade, and specifically the placement of the North American V. acerifolium as sister to rest of an otherwise Asian clade. RAD-seq data very strongly support the monophyly of Lobata, which effectively screens off V. acerifolium from influencing the ancestral area reconstruction. Consequently, we infer that the ancestors of both the Opulus|$+$|Laminotinus clade and the Laminotinus clade occupied Asia, and that V. acerifolium therefore entered North America at a later date. Similarly, our placement of V. punctatum (with its SE Asian sister species, V. lepidotulum) as sister to the Lentago clade (see also Eaton et al. 2017) allows us to infer that the ancestor of Valvatotinus occurred in Asia (rather than possibly in both North America and Asia). Finally, our finding that V. dentatum and V. scabrellum are sister to Oreinotinus (rather than nested within it) is consistent with previous studies and implies a much simpler biogeographic scenario (i.e., without a re-entry from the Neotropics into North America). Leaving aside these differences, these analyses all support multiple disjunctions in Viburnum around the Northern Hemisphere (e.g., between Eastern Asia and Eastern North America; Eastern Asia and Europe) that broadly match those seen in many other Laurasian plant clades (Wen 1999; Milne and Abbott 2002; Donoghue and Smith 2004; Milne 2006; Wen et al. 2016; Donoghue 2008; Harris et al. 2013). It is clear that movements through Beringia have been common on-and-off throughout the Cenozoic (Graham 2018), and some shifts in Viburnum over the past 15 Myr from Asia to North America (e.g., in Pseudotinus and Opulus) are probably best explained in this way. However, in the case of V. acerifolium, we cannot rule out movement across the North Atlantic in this same time frame (Tiffney 1985; Denk et al. 2011). Most significantly, here we are able to differentiate between four Asia-North America disjunctions that took place more recently, and two (in Porphyrotinus and Valvatotinus) that took place much earlier and may be better explained by passage across the North Atlantic (a case of pseudo-congruence; Donoghue and Moore 2003). This difference in timing also correlates with diversification. The two clades that entered early-on are the ones that diversified in the New World, whereas the more recent arrivals are represented by just a single species. This would appear to support the thesis that the time available for diversification is a key factor in explaining diversity patterns (e.g., Wiens et al. 2011). Although one of the early New World clades, Lentago, contains only eight species, the other, Oreinotinus, has diversified rapidly in a short time period (Spriggs et al. 2015). This clade spread southward, ultimately into South America, and speciation appears to have been elevated owing to isolation in disjunct Neotropical mountain regions. In contrast, the rapid radiation of the Succotinus clade in Asia (Spriggs et al. 2015) appears to have been fostered by disjunctions involving China, Korea, Japan, and Taiwan, coupled with multiple shifts from warmer to colder forests. Ancestral Biomes and Biome Shifts Previous analyses have reached opposite conclusions regarding the biome in which Viburnum initially diversified. Spriggs et al. (2015) supported the view that Viburnum originated in tropical forests, whereas Lens et al. (2016) concluded that it evolved in cold temperate forests. Our analyses strongly support a third interpretation, namely that Viburnum originated in warm temperate (lucidophyllous) forests, and subsequently adapted to both colder and to more tropical climates (Fig. 4). When we collapsed the comparison to warmer forests that experience little freezing versus cold forests with prolonged freezing, we very confidently favored an origin in the warmer forests, and decisively ruled out cold temperate forests as the original biome (Supplementary Fig. S5 available on Dryad). However, with respect to distinguishing warm temperate versus tropical forests, we caution that our analyses do not address the possibility that there have been differential rates of extinction in the different biomes. Specifically, they do not test the hypothesis advanced by Spriggs et al. (2015) that rates of extinction in Viburnum have been higher in tropical forests since the Oligocene. If this was indeed the case, our modern sample would bias against obtaining a tropical ancestry. Convincing estimation of ancestral biomes is tricky, especially in light of extinction, but it benefits from the comprehensive sampling achieved here and insights provided by all relevant data sources. Specifically, we highlight that the inclusion of fossil biome assignments has a major impact. Warm temperate forests are favored as ancestral whether the fossils are included (Fig. 4) or not (Supplementary Fig. S4 available on Dryad), but their inclusion does dramatically increase our confidence in this result. Indeed, as our sensitivity analyses show (see above), a substantial increase of freezing-biased “missing evidence” would be needed to erase the influence of known fossil biome states, and even more to achieve significant support for cold forest ancestry. Regarding subsequent biome shifts, it appears that there were multiple shifts into more tropical climates from warm temperate regions, some possibly quite early in Viburnum evolution (V. clemensiae, V. amplificatum, the Punctata clade, the Sambucina|$+$|Coriacea clade, and several Lutescentia species). Of these, we note the tropical radiation of the newly recognized Sambucina+Coriacea clade, within which there have been a number of recent speciation events involving disjunctions between the Philippines (V. glaberrimum), Borneo (V. vernicosum and V. hispidulum), and Peninsular Malaysia (V. beccarii). Likewise, there have been multiple shifts from warm temperate into cold temperate forests, scattered throughout the tree: e.g., the Opulus and Pseudotinus clades, V. plicatum within Lutescentia, V. grandiflorum and its relatives within Solenotinus, and multiple instances within Succotinus (Fig. 4). The largest single radiation in cold temperate forests is Euviburnum with 16 species, mostly in central China, but extending along the Himalayas (V. cotinifolium) and into Europe (V. lantana and V. maculatum). Importantly, we inferred no clear-cut reversals from cold forests back into warm or tropical forests. The Interaction of Geographic Movements and Biome Shifts The combined analysis of biogeography, biomes, and fossils provides us with several new insights. The most salient case concerns when, and in which biome, the large Porphyrotinus clade entered the New World. Our combined analyses unequivocally support the view that Porphyrotinus arrived in the New World by the Early Eocene. This reflects the presence of Viburnum fossil pollen in the New World in the mid-Eocene, and the placement of the grains from British Columbia (BC) and Colorado (CO) with increased support along the stem of Porphyrotinus (BC|$+$|CO|$+$|Porphyrotinus, pp|$=0.37$|; BC|$+$|Porphyrotinus, pp|$=0.15$|; CO|$+$|Porphyrotinus, pp|$=0.55$|). It is likely their shared geography that unites these fossils with Porphyrotinus, as support for these relationships plummets when fossil states are masked (BC|$+$|CO|$+$|Porphyrotinus, pp|$=0.01$|; BC|$+$|Porphyrotinus, pp|$=0.06$|; CO|$+$|Porphyrotinus, pp|$=0.15$|). Alternative placements of these pollen grains with East Asian clades would require additional (less probable) dispersal events. Most importantly, our results suggest that the Porphyrotinus lineage originally occupied warm temperate forests. This is significant for two reasons. First, it implies that there were later shifts into cold temperate forests within North America. That is, these plants adapted in situ to the spread of cold climates, as opposed to having arrived from Asia already adapted to them. Second, it implies that warm temperate forests were widespread in the Eocene in North America, whereas today these forests are almost absent from the region except in the SE United States. Less integrated analyses would suggest instead that Porphyrotinus entered the New World already adapted to cold forests. Likewise, our results allow cloud forest plants to have evolved directly from warm temperate plants, whereas less complete analyses would favor their origin from cold temperate plants. In the case of the lineages that arrived in the New World more recently, our results indicate that these had already adapted to cold climates in the Old World. For example, the eastern North American V. lantanoides is nested within an otherwise Asian clade (Pseudotinus), and its immediate ancestor was presumably already adapted to cold when it spread (Park and Donoghue 2019). The same logic applies to V. acerifolium in Lobata, and to intercontinental movements in the two subclades within Opulus. Overall, as expected given global cooling since the Eocene, the Viburnum occupancy of cold temperate forests has increased through time in North America, as it also has in Asia (owing to multiple evolutionary shifts from warm to cold forests) and in Europe (owing to migration; Figs. 5 and 6). Shifts into tropical forests have occurred in Asia, but never elsewhere. In particular, we note that there have been no transitions into tropical forests from cloud forests in the Neotropics, despite ample opportunity based on the direct adjacency of these biomes throughout that region (Donoghue and Edwards 2014). We note that the model we used to reconstruct biome shifts in this study does not explicitly consider how the evolving geographical distribution of biomes might shape ancestral biome estimates, and instead relies on information drawn from the biome states assigned to fossil taxa. In a related analysis under a phylogenetic model that allows lineages to evolve within a spatiotemporally dynamic paleobiome structure, however, we recovered similar ancestral biome estimates from purely extant Viburnum taxa (Landis et al. 2019). The Evolution of Biome-Related Traits Lens et al. (2016) compared Viburnum with its relative Sambucus in order to test the theory (see Carlquist 1975, 2012) that simple perforations in vessel elements in the wood evolved from scalariform perforations in connection with a shift into dry and/or warm areas where there would be selection for greater hydraulic efficiency. They documented scalariform perforations in 17 species of Viburnum and simple perforations in 17 species of Sambucus, and they assessed the contemporary and inferred ancestral climate spaces occupied by these two clades. They found that the modern climates occupied by a selection of Viburnum and Sambucus species are largely overlapping, but they argued that Viburnum evolved initially in colder climates and Sambucus in warmer ones. Specifically, they argued that Viburnum likely evolved in cold temperate forests, and that freezing temperatures in these forests explain why scalariform perforations were retained (i.e., to prevent freezing-induced air embolisms). In stark contrast, we found that Viburnum radiated initially in warm temperate forests that experienced little freezing, and moved multiple times into colder forests. As detailed in Supplement 7 (available on Dryad), this major difference in our two results stems from their limited species sampling and inaccurate climate scores for several key species. Our findings imply that exposure to cold was not the factor that favored the retention of scalariform perforations. It is plausible, however, that they retained ancestral vessel elements simply because they continued to occupy mesic forests (albeit without prolonged freezing) where they did not experience a significant increase in evaporative demand (Carlquist 1975, 2012). This does not negate their argument for the evolution of simple perforations in Sambucus (we have not reanalyzed whether this lineage shifted into warmer/drier areas requiring greater hydraulic efficiency), but it does alter their conclusions regarding Viburnum evolution and the presumed benefits of scalariform perforations. Regarding the evolution of other biome-relevant traits, we note that our results broadly support the conclusions of other studies regarding leaf traits. Specifically, our analyses are consistent with the view that the first viburnums had more-or-less entire (smooth) or irregularly toothed leaf margins, and that shifts into colder climates in multiple lineages were accompanied by the evolution of regularly toothed and/or lobed leaves (Schmerler et al. 2012; Edwards et al. 2016; Spriggs, Eaton et al. 2019). Likewise, our findings support multiple shifts from the evergreen habit to the deciduous habit associated with adaptation to climates with prolonged winter freezing (Edwards et al. 2017). We note that there have also been reversals in both characters; for example, evergreen leaves with entire margins evolved in the cloud forests of the Neotropics. Overall, the concordance of biome shifts with biome-related traits reinforces the hypothesis that Viburnum originated in warm forests. Conclusions We draw several general lessons from our analyses. As advocates for comprehensive sampling (Donoghue and Edwards 2019), we have done our best to include all of the Viburnum species that we currently recognize. Yet, it is important to appreciate that we still come up short. First, basic taxonomic issues remain, and we anticipate the discovery of additional Viburnum species as we apply molecular approaches at the population level. A case in point is our resurrection of V. nitidum, which was hidden on the coastal plain of the Southeastern US (Spriggs, Eaton et al. 2019). Second, we are painfully aware of how little we know about extinct viburnums. The structure of the tree itself strongly suggests that we are missing such species. Tropical depauperons such as V. clemensiae and V. amplificatum imply past extinctions (Donoghue and Sanderson 2015), and, as Spriggs et al. (2015) argued, rates of extinction in Viburnum may have been especially high in the tropics. What appear to be abrupt major biome transitions (e.g., directly from warm forests into boreal forests in Pseudotinus and Opulus) also suggest that we are missing ecological intermediates. The failure to fully represent these lineages surely biases our studies. Third, it was unclear how to best integrate diverse data sets that vary in terms of species sampling, character completeness, data quality, biological properties, and ease of modeling. During our preliminary analyses, for instance, we found that our newer RAD-seq data set was reliable for estimating tree topology (Eaton et al. 2017), but the large numbers of short loci were not easily modeled using standard divergence time estimation methods. The longer loci from our species-rich cpDNA data set, on the other hand, could be used to estimate node ages, even if cpDNA alone is not ideal for estimating species relationships at all timescales. These data had complementary strengths for estimating the evolution of Viburnum, so no hard-won data were discarded. Moreover, reusing our older (but here expanded) cpDNA data set allowed us to draw stronger conclusions regarding the effect of the data on the inference (Supplement 7 available on Dryad). Like the molecular data sets, incorporating the fossil, biome, and biogeography data sets each presented its own challenges, which we discuss below. In Viburnum, as in most other clades, critically identified fossils remain rare. Fortunately, in this case, there are a few fossils, and we have tried to make the most of the limited evidence that these provide. For example, we developed an “either-or” clade constraint to ensure that our Icelandic fossil formed a clade with either of two nonsister extant clades. Without this constraint, we would have needed to force a relationship with Euviburnum or with Lentago, or to relax the constraint to allow the unfounded placement of the fossils among the lineages that separate Euviburnum from Lentago, or to neglect the fossil altogether. In connection with this development, we added functionality to RevBayes to apply multiple backbone constraints to overlapping taxon sets, which was necessary to infer the relationships of the morphology-only species. Including fossil taxa with their geographic and biome states has two effects: it helps affiliate a fossil taxon with particular extant clades, and it tethers the ancestral condition of these clades through those affiliations. As discussed above, we saw this especially in the case of the Porphyrotinus clade. If, instead, we had adopted a sequential inference strategy that first estimated the tree distribution and then estimated ancestral states across that distribution, we would likely have placed the British Columbia and Colorado fossils apart from Porphyrotinus in over 75% of sampled phylogenies, which would proportionally reduce support for Porphyrotinus entering North America during the Eocene. These rogue fossils would in turn diffusely inflate support for North American and warm temperate ancestry throughout the tree. Our joint inference framework avoids this issue, illustrating that biogeography has another promising role to play in phylogenetic inference (Landis 2017, Landis et al. 2018). On the other hand, we recognize that there are potential biases in the sampling of fossils. Our fossil pollen was obtained from predominantly warm temperate plant assemblages in North American and European localities, and none were sampled from East Asia, where Viburnum probably originated. By treating the prior root state frequencies as a surrogate for missing data in support for a particular root state, we developed a simple Bayesian sensitivity test to measure what quantity of newly discovered evidence would be needed to recover a cold temperate ancestral state estimate (Supplementary Fig. S6 available on Dryad). This revealed that we would require new evidence equivalent to a 3-fold increase in prior weight for freezing forests to attain majority support (pp |$>$| 0.5) for a freezing root state. One way to interpret this result is that it would require the discovery of 12 new cold temperate fossils from the Paleocene or Eocene (i.e., three times the four warm temperate fossil pollen samples that we have from the Early Cenozoic), while excluding any new discoveries of fossils from nonfreezing biomes. Overall, our sensitivity test echoes earlier findings that taxon sampling really matters (Heath et al. 2008; Wood et al. 2012; Betancur-R et al. 2015; Wright et al. 2015). Future sensitivity studies might study method behavior while expanding the existing extant and fossil taxa with simulated fossil taxa under a panel of geography- and biome-dependent biases. We are also keenly aware of the need to propagate uncertainty throughout the joint Bayesian inference of evolutionary history. Among other things, this creates new challenges for visualization. Ancestral state trees are typically plotted against a single (consensus) phylogeny (Figs. 3A and 4A), which do not capture the vast space of possible evolutionary histories (Figs. 3B and 4B). We complemented our phylogeny-based view of range and biome evolution with plots of state frequencies through time that marginalized over various sources of uncertainty, including topology among fossil and extant taxa, divergence times, stochastic mappings, and evolutionary parameters (represented in Figs. 3B and 4b). Rather than generating M |$\times$| N |$=$| 24 plots for all pairs of M |$=$| 4 biomes and N |$=$| 6 regions, we instead developed M |$+$| N |$=$| 10 conditional plots to highlight how the occupancy of biomes and regions co-evolved in Viburnum. By averaging over uncertainty in this way, we gain a clearer summary of evolutionary trends, independent of any specific phylogenetic hypothesis. For example, we are able to conclude that the proportion of cold temperate viburnums exceeded 25% in East Asia, Europe, and North America starting in the Miocene (Fig. 5). The combined evidence approach developed here has allowed us to integrate molecular data, fossils, biogeographic ranges, and biome affinities, and this has provided a new and greatly improved understanding of the geographic and ecological radiation of a plant clade throughout the Cenozoic. But, there are clearly other relevant lines of evidence, and we look forward to their incorporation as well. For example, in addition to the leaf and wood traits mentioned above, we are optimistic about more fully and directly incorporating physiological and phenological data for Viburnum that bear on biome occupancy (e.g., Chatelet et al. 2013; Edwards et al. 2014; Scoffoni et al. 2016). Recent work to estimate ancestral biome affinities within an evolving spatial distribution of biomes through time also speaks to the warm or cold origin of Viburnum (Landis et al. 2019). Unifying these insights using a joint inference framework, like the one we have developed here, will be important in future studies. In the end, it is the alignment of these accumulating lines of evidence that gives us confidence that we are on the right track. Supplementary Material Data and Supplementary Materials are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.hx3ffbgb1. The Supplementary Materials are organized into the following sections: (1) Background information on Viburnum, (2) RAD-seq data, (3) Chloroplast DNA data, (4) Placement of unsequenced species based on morphological traits, (5) Fossil pollen morphology and biome assignments, (6) Bayesian phylogenetic analysis, (7) Extended discussion of phylogenetic and ancestral biome estimates, and (8) Supplementary Figures (S1 through S10). Raw RAD-seq data have been submitted to NCBI SRA under project number PRJNA605569. RevBayes scripts, R plotting scripts, and output files are available through http://github.com/mlandis/vib_div. Acknowledgments We are grateful to Rachel Warnock, Hervé Sauquet, Alexandre Antonelli, and Bryan Carstens for their insightful reviews, which helped us improve the clarity and broaden the scope of our manuscript. Undergradute researchers Amanda Goble and Patrick Gallagher were instrumental in generating new Sanger sequences for the expanded plastid dataset. Our studies have relied heavily on specimens housed in many herbaria, large and small. We are extremely grateful for the careful stewardship (and, increasingly, the digitization) of these priceless collections, and to the many collectors of these specimens over the centuries. Our work has also built on our field studies in over a dozen countries in the past two decades (Vietnam, Malaysia, Indonesia, the Philippines, China, Taiwan, Japan, India, Georgia, Colombia, Ecuador, Peru, Bolivia, and Mexico), and we are forever indebted to the many colleagues who have made these trips possible and so much fun. Finally, we wish to express our gratitude to the staffs of the Arnold Arboretum of Harvard University and the Harvard University Herbaria, where we have spent countless hours studying our favorite plants. Funding M. J. L. was supported by an NSF Postdoctoral Fellowship in Biology (DBI-1612153) and a Donnelley Postdoctoral Fellowship through the Yale Institute for Biospheric Studies. Our field studies have been funded in part through the Division of Botany of the Yale Peabody Museum of Natural History. We are especially grateful to generous support through a series of NSF awards: IOS-0842800, IOS-0843231, IOS-1256706, IOS-1257262, DEB-1145606, DEB-1026611, and, most recently, DEB-1557059 and DEB-1753504. References Arcila D. , Pyron R.A., Tyler J.C., Ortí G., Betancur-R R. 2015 . An evaluation of fossil tip-dating versus node-age calibrations in tetraodontiform fishes (Teleostei: Percomorphaceae) . Mol. Phylogenet. Evol. 82 : 131 – 145 . Google Scholar Crossref Search ADS PubMed WorldCat Bell C.D. , Donoghue M.J. 2005 . Dating the Dipsacales: comparing models, genes, and evolutionary implications . Am. J. Bot. 92 : 284 – 296 . Google Scholar Crossref Search ADS PubMed WorldCat Betancur-R R. , Orti G., Pyron R.A. 2015 . Fossil-based comparative analyses reveal ancient marine ancestry erased by extinction in ray-finned fishes . Ecol. Lett. 18 : 441 – 450 . Google Scholar Crossref Search ADS PubMed WorldCat Bouchal J.M. 2013 . The microflora of the uppermost Eocene (Preabonian) Florissant Formation, a combined method approach [MS thesis] . University of Vienna . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Cardillo M. , Weston P.H., Reynolds Z.K.M., Olde P.M., Mast A.R., Lemmon E. M., Lemmon A. R., Bromham L. 2017 . The phylogeny and biogeography of Hakea (Proteaceae) reveals the role of biome shifts in a continental plant radiation . Evolution. 71 : 1928 – 1943 . Google Scholar Crossref Search ADS PubMed WorldCat Carlquist S. 1975 . Ecological strategies of xylem evolution . Berkeley, CA : University of California Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Carlquist S. 2012 . How wood evolves: a new synthesis . Botany. 90 : 901 – 940 . Google Scholar Crossref Search ADS WorldCat Chatelet D.S. , Clement W., Sack L., Donoghue M.J., Edwards E.J. 2013 . The evolution of photosynthetic anatomy in Viburnum (Adoxaceae) . Int. J. Plant Sci. 174 : 1277 – 1291 . Google Scholar Crossref Search ADS WorldCat Clement W.L. , Arikake M., Sweeney P., Edwards E.J., Donoghue M.J. 2014 . A chloroplast tree for Viburnum (Adoxaceae) and its implications for phylogenetic classification and character evolution . Am. J. Bot. 101 : 1029 – 1049 . Google Scholar Crossref Search ADS PubMed WorldCat Clement W.L. , Donoghue M.J. 2011 . Dissolution of Viburnum section Megalotinus (Adoxaceae) of Southeast Asia and its implications for morphological evolution and biogeography . Int. J. Plant Sci. 172 : 559 – 573 . Google Scholar Crossref Search ADS WorldCat Denk T. , Grımsson F., Zetter R., Sımonarson L.A. 2011 . Late Cainozoic Floras of Iceland: 15 million years of vegetation and climate history in the Northern North Atlantic. Topics in Geobiology, Vol. 35 . Dordrecht : Springer Netherlands . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Donoghue M.J. 1985 . Pollen diversity and exine evolution in Viburnum and the Caprifoliaceae sensu lato . J. Arnold Arb. 66 : 421 – 469 . Google Scholar OpenURL Placeholder Text WorldCat Donoghue M.J. 2008 . A phylogenetic perspective on the distribution of plant diversity . Proc. Nat. Acad. Sci. USA 105 : 11549 – 11555 . Google Scholar Crossref Search ADS WorldCat Donoghue M.J. , Baldwin B.G., Li J., Winkworth R.C. 2004 . Viburnum phylogeny based on the chloroplast trnK intron and nuclear ribosomal ITS DNA sequences . Syst. Bot. 29 : 188 – 198 . Google Scholar Crossref Search ADS WorldCat Donoghue M.J. , Edwards E.J. 2014 . Biome shifts and niche evolution in plants . Annu. Rev. Ecol. Evol. Syst. 45 : 547 – 572 . Google Scholar Crossref Search ADS WorldCat Donoghue M.J. , Edwards E.J. 2019 . Model clades are vital for comparative biology, and ascertainment bias is not a problem in practice: a response to Beaulieu and O’Meara (2018) . Am. J. Bot. 106 : 327 – 330 . Google Scholar Crossref Search ADS PubMed WorldCat Donoghue M.J. , Moore B.R. 2003 . Toward an integrative historical biogeography . Integr. Comp. Biol. 43 : 261 – 270 . Google Scholar Crossref Search ADS PubMed WorldCat Donoghue M.J. , Sanderson M.J. 2015 . Confluence, synnovation, and depauperons in plant diversification . New Phytol. 207 : 260 – 274 Google Scholar Crossref Search ADS PubMed WorldCat Donoghue M.J. , Smith S.A. 2004 . Patterns in the assembly of temperate forests around the Northern Hemisphere . Phil. Trans. Roy. Soc. London B 359 : 1633 – 1644 . Google Scholar Crossref Search ADS WorldCat Eaton D.A.R. 2014 . PyRAD: assembly of de novo RADseq loci for phylogenetic analyses . Bioinformatics. 30 : 1844 – 1849 . Google Scholar Crossref Search ADS PubMed WorldCat Eaton D.A.R. , Spriggs E.L., Park B., Donoghue M. J. 2017 . Misconceptions on missing data in RAD-seq phylogenetics with a deep-scale example from flowering plants . Syst. Biol. 66 : 399 – 412 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Edwards E.J. , Chatelet D., Sack L., Donoghue M.J. 2014 . Leaf life span and the leaf economic spectrum in the context of whole plant architecture . J. Ecol. 102 : 328 – 336 . Google Scholar Crossref Search ADS WorldCat Edwards E.J. Chatelet D.S., Chen B.C., Ong J.Y., Tagane S., Kanemitsu H., Tagawa K., Teramoto K., Park B., Chung K.F., Hu J.M., Yahara T., Donoghue M.J. 2017 . Convergence, consilience, and the evolution of the temperate deciduous forests . Am. Nat. 190 : S87 – S104 . Google Scholar Crossref Search ADS PubMed WorldCat Edwards E.J. , Spriggs E.L., Chatelet D.S., Donoghue M. J. 2016 . Unpacking a century-old mystery: winter buds and the latitudinal gradient in leaf form . Am. J. Bot. 103 : 975 – 978 . Google Scholar Crossref Search ADS PubMed WorldCat Gagnon E. , Ringelberg J.J., Bruneau A., Lewis G.P., Hughes C.E. 2019 . Global succulent biome phylogenetic conservatism across the pantropical Caesalpinia Group (Leguminosae) . New Phytol. 222 : 1994 – 2008 . Google Scholar Crossref Search ADS PubMed WorldCat Graham A. 2018 . The role of land bridges, ancient environments, and migrations in the assembly of the North American flora . J. Syst. Evol. 56 : 405 – 429 . Google Scholar Crossref Search ADS WorldCat Greenwood D.R. , Basinger J.F., Smith R.Y. 2010 . How wet was the Arctic Eocene rain forest? Estimates of precipitation from Paleogene Arctic macrofloras. Geology. 38 : 15 – 18 . Google Scholar OpenURL Placeholder Text WorldCat Gruas-Cavagnetto C. 1978 . Étude palynologique de L’Éoècene du Bassin Anglo-Parisien . Mémoires de la Société Géologique de France (Nouvelle Série) 131 : 1 – 64 , 15 plates. Google Scholar OpenURL Placeholder Text WorldCat Harris A.J. , Wen J., Xiang Q.-Y. 2013 . Inferring the biogeographic origins of inter-continental disjunct endemics using a Bayes-DIVA approach . J. Syst. Evol. 51 : 117 – 133 . Google Scholar Crossref Search ADS WorldCat Heath T.A. , Hedtke S.M., Hillis D.M. 2008 . Taxon sampling and the accuracy of phylogenetic analyses . J. Syst. Evol. 46 : 239 – 257 . Google Scholar OpenURL Placeholder Text WorldCat Heath T.A. , Huelsenbeck J.P., Stadler T. 2014 . The fossilized birth–death process for coherent calibration of divergence-time estimates . Proc. Natl. Acad. Sci. USA, 111 : E2957 – E2966 . Google Scholar Crossref Search ADS WorldCat Höhna S. , Landis M.J., Heath T.A., Boussau B., Lartillot N., Moore B.R., Huelsenbeck J.P., Ronquist F. 2016 . RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language . Syst. Biol. 65 : 726 – 736 . Google Scholar Crossref Search ADS PubMed WorldCat Huelsenbeck J.P. , Nielsen R., Bollback J.P. 2003 . Stochastic mapping of morphological characters . Syst. Biol. 52 : 131 – 158 . Google Scholar Crossref Search ADS PubMed WorldCat Landis M.J. 2017 . Biogeographic dating of speciation times using paleogeographically informed processes . Syst. Biol. 66 : 128 – 144 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Landis M.J. , Edwards E.J., Donoghue M.J. Forthcoming 2019 . Modeling phylogenetic biome shifts on a planet with a past . Syst. Biol. 70(1):95–116. Google Scholar OpenURL Placeholder Text WorldCat Landis M.J. , Freyman W.A., Baldwin B.G. 2018 . Retracing the Hawaiian silversword radiation despite phylogenetic, biogeographic, and paleogeographic uncertainty . Evolution. 72 : 2343 – 2359 . Google Scholar Crossref Search ADS PubMed WorldCat Lens F. , Vos R.A., Charrier G., van der Niet T., Merckx V., Bass P., Gutierrez J.A., Jacobs B., Doria L.C., Smets E., Delzon S., Janssens S.B. 2016 . Scalariform-to-simple transition in vessel perforation plates triggered by differences in climate during the evolution of Adoxaceae . Ann. Bot. 118 : 1043 – 1056 . Google Scholar Crossref Search ADS PubMed WorldCat Manchester S.R. 2002 . Leaves and fruits of Davidia (Cornales) from the Paleocene of North America . Syst. Bot. 27 : 368 – 382 . Google Scholar OpenURL Placeholder Text WorldCat Manchester S.R. , Akhmetiev M.A., Kodrul T.M. 2002 . Leaves and fruits of Celtis aspera (Newberrt) comb . nov. (Celtidaceae) from the Paleocene of North America and Eastern Asia. Int. J. Plant Sci. 163 : 725 – 736 . Google Scholar OpenURL Placeholder Text WorldCat Manchester S.R. , Grímmson F., Zetter R. 2015 . Assessing the fossil record of Asterids in the context of our current phylogenetic framework . Ann. Mo. Bot. Gard. 100 : 329 -363. Google Scholar Crossref Search ADS PubMed WorldCat Mao K. , Milne R.I., Zhang L., Peng Y., Liu J., Thomas P., Mill R.R., Renner S.S. 2012 . Distribution of living Cupressaceae reflects the breakup of Pangea . Proc. Natl. Acad. Sci. USA 109 : 7793 – 7798 . Google Scholar Crossref Search ADS WorldCat McIntyre D.J. 1991 . Pollen and spore flora of an Eocene forest, eastern Axel Heiberg Island, N.W.T . In: Christie R.L., McMillan N.J., editors. Tertiary Fossil Forests of the Geodetic Hills Axel Heiberg Island . Arctic Archipelago : Geological Survey of Canada Bulletin 403 : p. 83 – 97 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Meseguer A.S. , Lobo J.M., Ree R., Beerling D.J., Sanmartin I. 2015 . Integrating fossils, phylogenies, and niche models into biogeography to reveal ancient evolutionary history: the case of Hypericum (Hypericaceae) . Syst. Biol. 2 : 215 – 232 . Google Scholar Crossref Search ADS WorldCat Milne R.I. 2006 . Northern hemisphere plant disjunctions: a window on tertiary land bridges and climate change? Ann. Bot. 98 : 465 – 472 . Google Scholar Crossref Search ADS PubMed WorldCat Milne R.I. and Abbott R.J. 2002 . The origin and evolution of Tertiary relict floras . Adv. Bot. Res. 38 : 281 – 314 . Google Scholar Crossref Search ADS WorldCat Moore B.R. , Donoghue M.J. 2007 . Correlates of diversification in the plant clade Dipsacales: geographic movement and evolutionary innovations . Am. Nat. 170 : S28 – S55 . Google Scholar Crossref Search ADS PubMed WorldCat Moore B.A. , Donoghue, M.J. 2009 . A Bayesian approach for evaluating the impact of historical events on rates of diversification . Proc. Nat. Acad. Sci. USA 106 : 4307 – 4312 . Google Scholar Crossref Search ADS WorldCat Moss P.T. , Greenwood D.R., Archibald S.B. 2005 . Regional and local vegetation community dynamics of the Eocene Okanagan Highlands (British Columbia — Washington State) from palynology . Can. J. Earth Sci. 42 : 187 – 204 . Google Scholar Crossref Search ADS WorldCat Moura M. , Carine M.A., Malecot V., Lourenco P., Scaefer H., Silva L. 2015 . A taxonomic reassessment of Viburnum (Adoxaceae) in the Azores . Phytotaxa. 210 : 4 – 023 . Google Scholar Crossref Search ADS WorldCat Nylander J.A.A. , Ronquist F., Huelsenbeck J.P., Nieves-Aldrey J. 2004 . Bayesian phylogenetic analysis of combined data . Syst. Biol. 53 : 47 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat Park B. , Donoghue M.J. 2019 . Phylogeography of a widespread Eastern North American shrub, Viburnum lantanoides (Adoxaceae) . Am. J. Bot. 106 : 389 – 401 . Google Scholar Crossref Search ADS PubMed WorldCat Park B. , Donoghue M.J. Forthcoming 2020 . Phylogenomic insights into the independent origins of sterile marginal flowers in Viburnum (Adoxaceae) . Am. J. Bot. Google Scholar OpenURL Placeholder Text WorldCat Pyron R.A. 2011 . Divergence time estimation using fossils as terminal taxa and the origins of Lissamphibia . Syst. Biol. 60 : 466 – 481 . Google Scholar Crossref Search ADS PubMed WorldCat Ree H.R. , Moore B.R., Webb C., Donoghue M.J. 2005 . A likelihood framework for inferring the evolution of geographic range on phylogenetic trees . Evolution. 59 : 2299 – 2311 . Google Scholar Crossref Search ADS PubMed WorldCat Ree R.H. , Smith S.A. 2008 . Maximum likelihood inference of geographic range evolution by dispersal, local extinction, and cladogenesis . Syst. Biol. 57 : 4 -14. Google Scholar Crossref Search ADS PubMed WorldCat Ronquist F. 1997 . Dispersal-vicariance analysis: a new approach to the quantification of historical biogeography . Syst. Biol. 46 : 195 – 203 . Google Scholar Crossref Search ADS WorldCat Ronquist F. , Klopfstein S., Vilhelmsen L., Schulmeister S., Murray D.L., Rasnitsyn A.P. 2012 . A total-evidence approach to dating with fossils, applied to the early radiation of the Hymenoptera . Syst. Biol. 61 : 973 – 999 . Google Scholar Crossref Search ADS PubMed WorldCat Schmerler S. , Clement W., Beaulieu J., Chatelet D., Sack L., Donoghue M.J., Edwards E. 2012 . Evolution of leaf form correlates with tropical-temperate transitions in Viburnum (Adoxaceae) . Proc. Royal Soc. B: Biol. Sci. 279 : 3905 – 3913 . Google Scholar OpenURL Placeholder Text WorldCat Scoffoni C. , Chatelet D.S., Pasquet-kok J., Rawls M., Donoghue M.J., Edwards E.J., Sack L. 2016 . Hydraulic basis for the evolution photosynthetic productivity . Nat. Plants 2 : 16072 . Google Scholar Crossref Search ADS PubMed WorldCat Sinnott-Armstrong M.A. , Lee C., Clement W.L., Donoghue M.J. 2020 . Fruit syndromes in Viburnum: correlated evolution of color, nutritional content, and morphology in bird-dispersed fleshy fruits . BMC Evol. Biol. 20 : 1 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat Spriggs E.L. , Clement W.L., Sweeney P.W., Madriñán S., Edwards E.J, and Donoghue M. J.. 2015 . Temperate radiations and dying embers of a tropical past: evidence from Viburnum diversification . New Phytol. 207 : 340 – 354 . Google Scholar Crossref Search ADS PubMed WorldCat Spriggs E.L. , Eaton D.A.R., Sweeney P.W., Schlutius C., Edwards E.J., Donoghue M.J. 2019 . RAD-seq data reveal a cryptic Viburnum species on the North American Coastal Plain . Syst. Biol. 68 : 187 – 203 . Google Scholar Crossref Search ADS PubMed WorldCat Spriggs, E.L. , Schlutius, C., Eaton, D.A., Park, B., Sweeney, P.W., Edwards, E.J., Donoghue, M.J. ( 2019 ). Differences in flowering time maintain species boundaries in a continental radiation of Viburnum . Am. J. Bot. 106 : 833 – 849 . Google Scholar Crossref Search ADS PubMed WorldCat Spriggs E.L. , Schmerler S.B., Edwards E.J., Donoghue M.J. 2018 . Leaf form evolution in Viburnum parallels variation within individual plants . Am. Nat. 191 : 235 – 249 . Google Scholar Crossref Search ADS PubMed WorldCat Stamatakis A. 2014 . RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies . Bioinformatics. 30 : 1312 – 1313 . Google Scholar Crossref Search ADS PubMed WorldCat Tavaré S. 1986 . Some probabilistic and statistical problems in the analysis of DNA sequences . Lect. Math. Life Sci. 17 : 57 – 86 . Google Scholar OpenURL Placeholder Text WorldCat Thompson S.K. 1987 . Sample size for estimating multinomial proportions . The Am. Stat. 41 : 42 – 46 . Google Scholar OpenURL Placeholder Text WorldCat Tiffney B.H. 1985 . The Eocene North Atlantic land bridge: its importance in Tertiary and modern phylogeography of the Northern Hemisphere . J. Arnold Arb. 66 : 243 – 273 . Google Scholar Crossref Search ADS WorldCat Weber M.G. , Donoghue M.J., Clement W.L., Agarwal A.A. 2012 . Phylogenetic and experimental tests of interactions among mutualistic plant defense traits in Viburnum (Adoxaceae) . Am. Nat. 180 : 450 – 463 . Google Scholar Crossref Search ADS PubMed WorldCat Weeks A. , Zapata F., Pell S.K., Daly D.C., Mitchell J.D., Fine P.V.A. 2014 . To move or to evolve: contrasting patterns of intercontinental connectivity and climatic niche evolution in “Terebinthaceae” (Anacardiaceae and Burseraceae) . Front. Genet. 5 : 409 . Google Scholar Crossref Search ADS PubMed WorldCat Wen J. 1999 . Evolution of eastern Asian and eastern North American disjunct distributions in flowering plants . Ann. Rev. Ecol. Syst. 30 : 421 – 455 . Google Scholar Crossref Search ADS WorldCat Wen J. , Nie Z.-L., Ickert-Bond S.M. 2016 . Intercontinental disjunctions between eastern Asia and western North America in vascular plants highlight the biogeographic importance of the Bering land bridge from late Cretaceous to Neogene . J. Syst. Evol. 54 : 469 – 490 . Google Scholar Crossref Search ADS WorldCat Wiens J.J. , Pyron, R.A., Moen D.C. 2011 . Phylogenetic origins of local-scale diversity patterns and the causes of Amazonian megadiversity . Ecol. Lett. 14 : 643 – 652 . Google Scholar Crossref Search ADS PubMed WorldCat Wilkinson M. 1996 . Majority-rule reduced consensus trees and their use in bootstrapping . Mol. Biol. Evol. 13 : 437 – 444 . Google Scholar Crossref Search ADS PubMed WorldCat Winkworth R.C. , Donoghue M.J. 2004 . Viburnum phylogeny: evidence from the duplicated nuclear gene GBSSI . Mol. Phyl. Evol. 33 : 109 – 126 . Google Scholar Crossref Search ADS WorldCat Winkworth R.C. , Donoghue M.J. 2005 . Viburnum phylogeny based on combined molecular data: implications for taxonomy and biogeography . Am. J. Bot. 92 : 653 – 666 . Google Scholar Crossref Search ADS PubMed WorldCat Wood H.M. , Matzke N.J., Gillespie R.G., Griswold C.E. 2012 . Treating fossils as terminal taxa in divergence time estimation reveals ancient vicariance patterns in the Palpimanoid spiders . Syst. Biol. 62 : 264 – 284 . Google Scholar Crossref Search ADS PubMed WorldCat Wright A.M. , Lyons K.M., Brandley M.C., Hillis D.M. 2015 . Which came first: the lizard or the egg? Robustness in phylogenetic reconstruction of ancestral states . J. Exp. Zool. Part B: Mol. Dev. Evol. 324 : 504 – 516 . Google Scholar Crossref Search ADS WorldCat Yang Z. 1994 . Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods . J. Mol. Evol. 39 : 306 – 314 . Google Scholar Crossref Search ADS PubMed WorldCat Zaborac-Reed S.J. , Leopold E.B. 2016 . Determining the paleoclimate and elevation of the late Eocene Florissant flora: support from the coexistence approach . Can. J. Earth Sci. 53 : 565 – 573 . Google Scholar Crossref Search ADS WorldCat © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Defining Species When There is Gene FlowJiao,, Xiyun;Yang,, Ziheng
doi: 10.1093/sysbio/syaa052pmid: 32617579
Abstract Whatever one’s definition of species, it is generally expected that individuals of the same species should be genetically more similar to each other than they are to individuals of another species. Here, we show that in the presence of cross-species gene flow, this expectation may be incorrect. We use the multispecies coalescent model with continuous-time migration or episodic introgression to study the impact of gene flow on genetic differences within and between species and highlight a surprising but plausible scenario in which different population sizes and asymmetrical migration rates cause a genetic sequence to be on average more closely related to a sequence from another species than to a sequence from the same species. Our results highlight the extraordinary impact that even a small amount of gene flow may have on the genetic history of the species. We suggest that contrasting long-term migration rate and short-term hybridization rate, both of which can be estimated using genetic data, may be a powerful approach to detecting the presence of reproductive barriers and to define species boundaries.[Gene flow; introgression; migration; multispecies coalescent; species concept; species delimitation.] Introduction The concept of species is a controversial one, with a number of definitions proposed in the literature (Mallet 2013 Zachos 2016). The biological species concept emphasizes reproductive isolation, although low levels of cross-species gene flow are tolerated in modern versions of the concept (Coyne and Orr 2004). The lineage species concept considers species as independently evolving lineages (De Queiroz 2007). Despite the differences in species definitions, it is generally expected that an individual is genetically more closely related to an individual of the same species than to an individual of a different species. Here, we may measure genetic relatedness in two ways. First, if we sample two sequences |$a_1$| and |$a_2$| from species |$A$| and one sequence |$b$| from species |$B$|, we expect the average sequence distances to satisfy |$\mathbb{E}(t_{aa}) <$| |$\mathbb{E}(t_{ab})$|. Second, we expect gene tree |$G_1 = ((a_1, a_2), b)$| to occur with a higher probability than gene trees |$G_2 = ((b, a_1), a_2)$| or |$G_3 = ((b, a_2), a_1)$|. Two approaches to identifying and delimiting species make use of those expectations explicitly. First, DNA barcoding is a fast approach to species identification and is occasionally applied to species delimitation as well (Hebert et al. 2003). A genetic-distance threshold or “barcoding gap” based on a universal marker (such as mitochondrial cytochrome oxidase 1 or cytochrome b) is used to distinguish within- and between-species divergences. A query specimen is assigned to an existing species in the database if the sequence distance between the query and the sequences in the library is smaller than the threshold. Otherwise, the specimen is considered a new species not yet represented in the library. The threshold may be arbitrary (Hebert et al. 2003) or estimated from a database by minimizing assignment errors (Meyer and Paulay 2005 Puillandre et al. 2012). The use of one barcoding threshold for different species in the database may lead to errors of identification when different species have very different population sizes and divergence times (e.g., Hudson and Turelli 2003 Meyer and Paulay 2005 Dasmahapatra et al. 2010 Yang and Rannala 2017). Here, we emphasize the fact that barcoding methods rely on a distance threshold, with the expectation that within-species sequence divergence is smaller than between-species divergence, |$\mathbb{E}(t_{aa}) <$| |$\mathbb{E}(t_{ab})$|. Second, the recently developed genealogical divergence index or gdi (Jackson et al. 2017) is a simple method for fast species delimitation, useful for generating hypotheses of species status for systematic evaluations integrating different sources of information. The gdi is a linear transform on |$\mathbb{P}(G_1)$|. Two populations are considered distinct species if |$\mathbb{P}(G_1) > 0.8$| or one single species if |$\mathbb{P}(G_1) < 0.47$|, with the species status undecided if |$\mathbb{P}(G_1)$| falls between the two limits. It is expected that |$\mathbb{P}(G_1) > \frac{1}{3}$| |$> \mathbb{P}(G_2)$| |$=$| |$\mathbb{P}(G_3)$|. The two expectations, |$\mathbb{E}(t_{aa}) <$| |$\mathbb{E}(t_{ab})$| and |$\mathbb{P}(G_1) > \mathbb{P}(G_2)$|, are correct if isolation is complete and there is no cross-species gene flow (Fig. 1A). When we trace the genealogical history of sequences |$a_1, a_2$|, and |$b$| backwards in time, there is the possibility that sequences |$a_1$| and |$a_2$| coalesce before reaching the common ancestor |$R$| (Fig. 1A). If this happens, the gene tree will be |$G_1$|; otherwise the three possible gene trees will occur with equal probability. Thus |$\mathbb{P}(G_1) >$| |$\mathbb{P}(G_2)$| = |$\mathbb{P}(G_3)$|. Similarly, if sequences |$a_1$| and |$a_2$| coalesce in species |$A$|, they will have a shorter expected distance than between species; otherwise their distance will be the same as between species. Thus, we expect |$\mathbb{E}(t_{aa}) <$| |$\mathbb{E}(t_{ab})$|. Figure 1. Open in new tabDownload slide A) The multispecies coalescent (MSC) model for two species (|$A$| and |$B$|) with four parameters shown in the inset (species divergence time |$\tau_R = \tau_{AB} = \tau$| and three population size parameters: |$\theta_A$|, |$\theta_B$|, and |$\theta_R$|). Both |$\theta$| and |$\tau$| are measured in the number of mutations per site. Two gene trees for three sequences (|$a_1$| and |$a_2$| from |$A$| and |$b$| from |$B$|) are shown inside the species tree. If sequences |$a_1$| and |$a_2$| coalesce in species |$A$|, the gene tree will be |$G_1 = ((a_1, a_2), b)$|; otherwise all three sequences will enter species |$R$| and the three gene trees |$G_1$|, |$G_2 = ((b, a_1), a_2)$| and |$G_3 = ((b, a_2), a_1)$| will occur with equal probability (|$\frac{1}{3}$| each). B) The MSC model with migration (the IM model) and C) the MSC model with introgression (the MSci model), with 5 and 8 parameters, respectively, shown in the inset. Under the IM model, the migration rate |$M = M_{BA} = N_A m_{BA}$| is the expected number of |$B \to A$| migrants in species |$A$| per generation, with |$m_{BA} = m$| to be the proportion of immigrants (from species |$B$|) in species |$A$|. Under the MSci model, |$\tau_H= \tau_S$| while |$\varphi$| is the introgression probability. Under both the IM and MSci models, there are multiple scenarios under which gene tree |$G_1$| may occur. For example, in the dashed red tree, |$a_1$| and |$a_2$| coalesce in species |$A$|, while in the solid green tree, |$a_1$| and |$a_2$| migrate (backwards in time) into species |$B$| and then coalesce in species |$B$|. Figure 1. Open in new tabDownload slide A) The multispecies coalescent (MSC) model for two species (|$A$| and |$B$|) with four parameters shown in the inset (species divergence time |$\tau_R = \tau_{AB} = \tau$| and three population size parameters: |$\theta_A$|, |$\theta_B$|, and |$\theta_R$|). Both |$\theta$| and |$\tau$| are measured in the number of mutations per site. Two gene trees for three sequences (|$a_1$| and |$a_2$| from |$A$| and |$b$| from |$B$|) are shown inside the species tree. If sequences |$a_1$| and |$a_2$| coalesce in species |$A$|, the gene tree will be |$G_1 = ((a_1, a_2), b)$|; otherwise all three sequences will enter species |$R$| and the three gene trees |$G_1$|, |$G_2 = ((b, a_1), a_2)$| and |$G_3 = ((b, a_2), a_1)$| will occur with equal probability (|$\frac{1}{3}$| each). B) The MSC model with migration (the IM model) and C) the MSC model with introgression (the MSci model), with 5 and 8 parameters, respectively, shown in the inset. Under the IM model, the migration rate |$M = M_{BA} = N_A m_{BA}$| is the expected number of |$B \to A$| migrants in species |$A$| per generation, with |$m_{BA} = m$| to be the proportion of immigrants (from species |$B$|) in species |$A$|. Under the MSci model, |$\tau_H= \tau_S$| while |$\varphi$| is the introgression probability. Under both the IM and MSci models, there are multiple scenarios under which gene tree |$G_1$| may occur. For example, in the dashed red tree, |$a_1$| and |$a_2$| coalesce in species |$A$|, while in the solid green tree, |$a_1$| and |$a_2$| migrate (backwards in time) into species |$B$| and then coalesce in species |$B$|. However, it is not so clear whether the expectations are correct when there is cross-species gene flow. In this article, we study the impact of gene flow on genetic divergences within and between species, by using the Markov chain characterization of the process of coalescent and migration developed in the structured coalescent framework in population genetics (Notohara 1990 Wilkinson-Herbots 2008). We demonstrate that with different population sizes (and thus coalescent rates) and asymmetrical migration rates, it is possible for a gene sequence to be on average more distant from another sequence of the same species than from a sequence randomly sampled from another species. We refer to the region of the parameter space in which |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| or |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$| as the species-definition anomaly zone, similar to the species-tree anomaly zone discussed by Degnan and Rosenberg (2006). Our results highlight the complexity of defining and delimiting species in the presence of gene flow: for example, in the anomaly zone, application of any barcoding criterion or the gdi index may lead to incorrect inference of one species when two exist. We note that in the past decade, analyses of genomic sequence data have detected cross-species gene flow in a variety of species including Arabidopsis (Arnold et al. 2016), corals (Mao et al. 2018), mosquitoes (Fontaine et al. 2015 Thawornwattana et al. 2018), butterflies (Martin et al. 2013), birds (Ellegren et al. 2012), cats (Li et al. 2019), bears (Liu et al. 2014), cattle (Wu et al. 2018), gibbons (Chan et al. 2013), and hominins (Nielsen et al. 2017). Empirical studies suggest very high proportions of species that hybridize with at least one other species (Mallet 2005 Mallet 2008). It is thus of great importance to examine the impact of cross-species gene flow on the definition and identification of species. Here, we formulate our results in the context of using genomic sequence data to infer the history of species divergences and gene flow and to delimit species boundaries. We focus on the continuous-time migration model (the IM model, Hey 2010) (Fig. 1B) but will show that the same behavior occurs under the episodic introgression model or the multispecies coalescent with introgression (MSci) model (Yu et al. 2014 Flouri et al. 2020) (Fig. 1C), which may be more realistic for some biological systems. The IM Model for Two Species and Three Sequences Consider two diploid species |$A$| and |$B$|, which diverged time |$\tau = \tau_R$| ago and have since been undergoing migration from species |$B$| to species |$A$|, at the rate of |$m = m_{BA}$| per generation (Fig. 1B). We formulate our theory in the context of analyzing genomic sequence data, so that time is scaled by mutations and both |$\theta$| and |$\tau$| are measured in the number of mutations per site. For each species, the population size parameter is defined as |$\theta = 4N\mu$|, where |$N$| is the effective population size and |$\mu$| is the mutation rate per site per generation. We define the migration rate |$m_{BA}$| as the proportion of |$B \to A$| immigrants in the receiving population |$A$|, so that |$M_{BA} = m_{BA}N_A$| is the expected number of |$B \to A$| migrants per generation. The parameters in the IM model (Hey 2010) include |$\tau_R$|, |$\theta_A$|, |$\theta_B$|, |$\theta_R$|, and |$M_{BA}$| (Fig. 1B). The IM model of Figure 1B is a special case of the model of Long and Kubatko (2018, Fig. 1B which allows migration in both directions. We consider the genealogical relationships among three sequences: |$a_1$| and |$a_2$| from |$A$| and |$b$| from |$B$|. In this setting, the gene trees and coalescent times are random variables, with distributions specified by the parameters in the model. The backward process of coalescence and migration during time interval |$(0, \tau_R)$| is described by a Markov chain (Notohara 1990), where the state of the chain is specified by the number of sequences remaining in the sample, the populations in which they reside, the population IDs (|$A$| and |$B$|) and the sequence IDs (|$a_1$|, |$a_2,$| and |$b$|) (Zhu and Yang 2012; Andersen et al. 2014; Tian and Kubatko 2016; Dalquen et al. 2017). For example, in the state |$A_{a_1}A_{a_2}B_b$|, abbreviated |$AAB$|, there are three sequences in the sample, and sequences |$a_1$|, |$a_2$|, and |$b$| are in populations |$A$|, |$A$|, and |$B$|, respectively. This is the initial state for the Markov chain as our sample consists of sequences |$a_1$| and |$a_2$| from |$A$| and |$b$| from |$B$|. Similarly, |$ABB$| is the state reached when sequence |$a_2$| migrates (backwards in time) into population |$B$|. The state |$A_{aa}B_b$|, abbreviated |$AB_b$|, means that two sequences remain in the sample, with the ancestor of |$a_1$| and |$a_2$| in population |$A$|, and sequence |$b$| in population |$B$|. This is the state reached when sequences |$a_1$| and |$a_2$| coalesce in species |$A$|. The generator matrix |$Q^{\textrm{\ding{172}}}$| for the Markov chain is where |$w_{BA} = \frac{m_{BA}}{\mu} = \frac{4M_{BA}}{\theta_A}$| is the mutation-scaled migration rate, and |$c_A = \frac{2}{\theta_A}$| and |$c_B = \frac{2}{\theta_B}$| are the coalescent rates. Here, one time unit is the expected time taken to accumulate one mutation per site. In a species with a scaled population size |$\theta = 4N\mu$|, each pair of sequences coalesce at the rate |$\frac{2}{\theta}$|, with the average coalescent time to be |$\frac{\theta}{2}$|. . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . |$BB_b$| . |$B_{a_1}B$| . |$B_{a_2}B$| . |$B$| . |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 0 0 0 0 |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 0 0 0 0 |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| 0 0 0 0 |$BBB$| 0 0 0 |$-3 c_B$| 0 0 0 |$c_B$| |$c_B$| |$c_B$| 0 |$AB_b$| 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 0 |$A_{a_1}B$| 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 |$A_{a_2}B$| 0 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 |$BB_b$| 0 0 0 0 0 0 0 |$-c_B$| 0 0 |$c_B$| |$B_{a_1}B$| 0 0 0 0 0 0 0 0 |$-c_B$| 0 |$c_B$| |$B_{a_2}B$| 0 0 0 0 0 0 0 0 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 0 0 0 0 0 0 0 0 . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . |$BB_b$| . |$B_{a_1}B$| . |$B_{a_2}B$| . |$B$| . |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 0 0 0 0 |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 0 0 0 0 |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| 0 0 0 0 |$BBB$| 0 0 0 |$-3 c_B$| 0 0 0 |$c_B$| |$c_B$| |$c_B$| 0 |$AB_b$| 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 0 |$A_{a_1}B$| 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 |$A_{a_2}B$| 0 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 |$BB_b$| 0 0 0 0 0 0 0 |$-c_B$| 0 0 |$c_B$| |$B_{a_1}B$| 0 0 0 0 0 0 0 0 |$-c_B$| 0 |$c_B$| |$B_{a_2}B$| 0 0 0 0 0 0 0 0 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 0 0 0 0 0 0 0 0 Open in new tab . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . |$BB_b$| . |$B_{a_1}B$| . |$B_{a_2}B$| . |$B$| . |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 0 0 0 0 |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 0 0 0 0 |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| 0 0 0 0 |$BBB$| 0 0 0 |$-3 c_B$| 0 0 0 |$c_B$| |$c_B$| |$c_B$| 0 |$AB_b$| 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 0 |$A_{a_1}B$| 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 |$A_{a_2}B$| 0 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 |$BB_b$| 0 0 0 0 0 0 0 |$-c_B$| 0 0 |$c_B$| |$B_{a_1}B$| 0 0 0 0 0 0 0 0 |$-c_B$| 0 |$c_B$| |$B_{a_2}B$| 0 0 0 0 0 0 0 0 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 0 0 0 0 0 0 0 0 . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . |$BB_b$| . |$B_{a_1}B$| . |$B_{a_2}B$| . |$B$| . |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 0 0 0 0 |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 0 0 0 0 |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| 0 0 0 0 |$BBB$| 0 0 0 |$-3 c_B$| 0 0 0 |$c_B$| |$c_B$| |$c_B$| 0 |$AB_b$| 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 0 |$A_{a_1}B$| 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 0 |$A_{a_2}B$| 0 0 0 0 0 0 |$-w_{BA}$| 0 0 |$w_{BA}$| 0 |$BB_b$| 0 0 0 0 0 0 0 |$-c_B$| 0 0 |$c_B$| |$B_{a_1}B$| 0 0 0 0 0 0 0 0 |$-c_B$| 0 |$c_B$| |$B_{a_2}B$| 0 0 0 0 0 0 0 0 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 0 0 0 0 0 0 0 0 Open in new tab Probabilities of Gene Trees We calculate the probabilities for the three gene trees: |$G_1 = ((a_1, a_2), b)$|; |$G_2 = ((b, a_1), a_2)$|; and |$G_3 = ((b, a_2), a_1)$|, as functions of the parameters in the IM model (Fig. 1B). Note that the gene tree topology is determined by the first coalescent event, so that there is no need to follow the Markov chain any further after the first coalescent has occurred. Thus, we construct a simplified Markov chain in which all two-sequence states (such as |$AB_b$| and |$A_{a_1}B$|) are changed into absorbing states, and state |$B$| is unreachable and thus removed from the chain. Similarly as soon as the chain enters the state |$BBB$|, with all three sequences in species |$B$|, the three gene trees occur with equal probabilities. Thus we make |$BBB$| an absorbing state as well, with |$BB_b$|, |$B_{a_1}B$|, and |$B_{a_2}B$| unreachable and removed. The modified Markov chain then has seven states, with the generator |$Q^{\textrm{\ding{173}}}$| . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . (1) |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 (2) |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 (3) |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| (4) |$BBB$| 0 0 0 0 0 0 0 (5) |$AB_b$| 0 0 0 0 0 0 0 (6) |$A_{a_1}B$| 0 0 0 0 0 0 0 (7) |$A_{a_2}B$| 0 0 0 0 0 0 0 . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . (1) |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 (2) |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 (3) |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| (4) |$BBB$| 0 0 0 0 0 0 0 (5) |$AB_b$| 0 0 0 0 0 0 0 (6) |$A_{a_1}B$| 0 0 0 0 0 0 0 (7) |$A_{a_2}B$| 0 0 0 0 0 0 0 Open in new tab . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . (1) |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 (2) |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 (3) |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| (4) |$BBB$| 0 0 0 0 0 0 0 (5) |$AB_b$| 0 0 0 0 0 0 0 (6) |$A_{a_1}B$| 0 0 0 0 0 0 0 (7) |$A_{a_2}B$| 0 0 0 0 0 0 0 . |$AAB$| . |$ABB$| . |$BAB$| . |$BBB$| . |$AB_b$| . |$A_{a_1}B$| . |$A_{a_2}B$| . (1) |$AAB$| |$-2w_{BA} - c_A$| |$w_{BA}$| |$w_{BA}$| 0 |$c_A$| 0 0 (2) |$ABB$| 0 |$-w_{BA} - c_B$| 0 |$w_{BA}$| 0 |$c_B$| 0 (3) |$BAB$| 0 0 |$-w_{BA} - c_B$| |$w_{BA}$| 0 0 |$c_B$| (4) |$BBB$| 0 0 0 0 0 0 0 (5) |$AB_b$| 0 0 0 0 0 0 0 (6) |$A_{a_1}B$| 0 0 0 0 0 0 0 (7) |$A_{a_2}B$| 0 0 0 0 0 0 0 Open in new tab Let |$P^{\textrm{\ding{173}}}(\tau) = \exp{(Q^{\textrm{\ding{173}}}\tau)} $| be the matrix of transition probabilities over time |$\tau = \tau_R$|. We have the probability for gene tree |$G_1$| to be $$\begin{equation} \label{eq-PG1-IM} \mathbb{P}(G_1) = P^{\textrm{\ding{173}}}_{15}(\tau) + [P^{\textrm{\ding{173}}}_{11}(\tau) + P^{\textrm{\ding{173}}}_{12}(\tau) + P^{\textrm{\ding{173}}}_{13}(\tau) + P^{\textrm{\ding{173}}}_{14}(\tau)]/3. \end{equation}$$(1) The different terms account for different scenarios that lead to gene tree |$G_1$|. First, sequences |$a_1$| and |$a_2$| may coalesce in population |$A$|, before reaching time |$\tau$|: this occurs with probability |$P^{\textrm{\ding{173}}}_{15}(\tau)$|. Second, if both sequences |$a_1$| and |$a_2$| enter species |$B$| any time during the time interval (|$0, \tau$|), the chain will be in state 4 (|$BBB$|): each gene tree will then have probability |$\frac{1}{3}$| when the coalescent events occur at random in species |$B$| or |$R$| (Fig. 1B). Finally, if no coalescent occurs over the time interval (|$0, \tau$|) and if at most one of the |$A$| sequences enters species |$B$|, the chain will be in states 1, 2, or 3 (for |$AAB$|, |$ABB$| or |$BAB$|) at time |$\tau$|: then all three sequences will enter the common ancestor |$R$| and each gene tree occurs with probability |$\frac{1}{3}$|. The eigenvalues of |$Q^{\textrm{\ding{173}}}$| are on the diagonal: |$\lambda_1 = -\frac{2+8M}{\theta_A}$|, |$\lambda_2 =$| |$\lambda_3 = -\frac{4M}{\theta_A} - \frac{2}{\theta_B}$|, and |$\lambda_4 =$| |$\cdots = $| |$\lambda_7$| = 0. These are all real, as are the eigenvectors. We derive |$P^{\textrm{\ding{173}}}(\tau)$| using Mathematica, but the expression is tedious and not presented here. Let |$e_1 = \textrm{e}^{\lambda_1 \tau}$| and |$e_2 = \textrm{e}^{\lambda_2 \tau}$|. Then equation 1 can be simplified, to give $$\begin{align}\label{eq-PG1-IM2} & \mathbb{P}(G_1) = \left[ ((2 - 4M)e_1 -3) \theta_A^2 + (3 - 8 M^2 - (2 + 8 M^2)e_1 \right.\nonumber \\ & \quad \left. + 4M (1 + 4 M)e_2 ) \theta_A \theta_B + 2M(1 + 2M)(3 + 4M - 2e_1) \theta_B^2 \right]\nonumber\\ &\qquad{} \left/\left[3 (1 + 4 M) (\theta_A + 2 M \theta_B) (-\theta_A + \theta_B + 2 M \theta_B) \right]\right. . \end{align}$$(2) Similarly, the probability for gene tree |$G_2$| is $$\begin{equation} \label{eq-PG2-IM} \mathbb{P}(G_2) = P^{\textrm{\ding{173}}}_{17}(\tau) + [P^{\textrm{\ding{173}}}_{11}(\tau) + P^{\textrm{\ding{173}}}_{12}(\tau) + P^{\textrm{\ding{173}}}_{13}(\tau) + P^{\textrm{\ding{173}}}_{14}(\tau)]/3 . \end{equation}$$(3) From equations 1 and 3, we can see that |$\mathbb{P}(G_2) > $| |$\mathbb{P}(G_1)$| if and only if |$P^{\textrm{\ding{173}}}_{17}(\tau) > P^{\textrm{\ding{173}}}_{15}(\tau)$|. Indeed which of gene trees |$G_1$|, |$G_2$|, and |$G_3$| is more probable depends on the relative likelihoods of three scenarios (Fig. 1B): (i) |$a_1$| and |$a_2$| coalesce in |$A$|, which occurs with probability |$P^{\textrm{\ding{173}}}_{15}(\tau)$| and leads to |$G_1$|; (ii) |$a_1$| migrates (backwards in time) to |$B$| and coalesces with |$b$|, with probability |$P^{\textrm{\ding{173}}}_{17}(\tau)$| for |$G_2$|; (iii) |$a_2$| migrates (backwards in time) to |$B$| and coalesces with |$b$|, with probability |$P^{\textrm{\ding{173}}}_{16}(\tau)$| for |$G_3$|. In all other scenarios, the three gene trees occur with equal probability. When the coalescent rate is much lower in species |$A$| than in |$B$| (or when |$\theta_A \gg \theta_B$|) and the migration rate from |$B$| to |$A$| is high, case (i) may be less probable than (ii) or (iii). The anomaly in gene tree probabilities identified here is similar to the species-tree anomaly analyzed by (Long and Kubatko, 2018). The assumption of unidirectional migration in our model allows us to obtain simpler or more expressive analytical results than is possible under the model of bidirectional migration of Long and Kubatko (2018). As an example, consider |$\mathbb{P}(G_1)$| as a function of |$M$|, with other parameters fixed at |$\tau = 0.02, \theta_A = 0.025$|, and |$\theta_B = 0.001$| (Fig. 2B). When |$M = 0$|, the IM model of Figure 1B reduces to the simple MSC model of Figure 1A, and the gene tree probabilities are |$\mathbb{P}(G_1)$| = |$1-\frac{2}{3}\textrm{e}^{-2\tau/\theta_A}$| = 0.865402 and |$\mathbb{P}(G_2) =$| |$\mathbb{P}(G_3)$| = |$\frac{1}{3}\textrm{e}^{-2\tau/\theta_A}$| = 0.067299. Here, |$\tau/\frac{\theta_A}{2}$| is the branch length of branch |$A$| in coalescent units (as the average coalescent time in population |$A$| is |$\frac{1}{2}\theta_A$| mutations per site), and |$\textrm{e}^{-2\tau/\theta_A}$| is the probability that two sequences (|$a_1$| and |$a_2$|) do not coalesce along branch |$A$| or over the time interval (|$0, \tau$|). At the threshold value |$M^*$| = 0.521361, |$\mathbb{P}(G_1) = \mathbb{P}(G_2) = \frac{1}{3}$|. When |$M$| = 0.8, the probabilities for the three scenarios described above are |$P^{\textrm{\ding{173}}}_{15}(\tau)$| = 0.23781 and |$P^{\textrm{\ding{173}}}_{16}(\tau) = P^{\textrm{\ding{173}}}_{17}(\tau)$| = 0.35753, with |$\mathbb{P}(G_1)$| = 0.25352 |$< \mathbb{P}(G_2)$| = |$\mathbb{P}(G_3)$| = 0.37324. In the limit of |$M = \infty$|, sequences |$a_1$| and |$a_2$| will immediately migrate (backwards in time) into |$B$| and the three sequences will coalesce at random, with |$\mathbb{P}(G_1) \to \frac{1}{3}$|. Figure 2. Open in new tabDownload slide A) Probabilities of gene trees |$G_1 = ((a_1, a_2), b)$| and |$G_2 = ((b, a_1), a_2)$| as functions of the migration rate |$M$| when the other parameters in the IM model (Fig. 1B) are fixed: |$\tau = 0.02, \theta_A = 0.025$|, and |$\theta_B = 0.001$|. Note that when |$M > M^*$| = 0.521361 immigrants per generation, |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$|. B) Partition of the parameter space for the IM model (Fig. 1B) according to probabilities for gene trees. Below the outer tent, |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$|, while below the inner tent |$\mathbb{P}(G_1) <$| 0.2 and |$\mathbb{P}(G_2)$| = |$\mathbb{P}(G_3) >$| 0.4. Figure 2. Open in new tabDownload slide A) Probabilities of gene trees |$G_1 = ((a_1, a_2), b)$| and |$G_2 = ((b, a_1), a_2)$| as functions of the migration rate |$M$| when the other parameters in the IM model (Fig. 1B) are fixed: |$\tau = 0.02, \theta_A = 0.025$|, and |$\theta_B = 0.001$|. Note that when |$M > M^*$| = 0.521361 immigrants per generation, |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$|. B) Partition of the parameter space for the IM model (Fig. 1B) according to probabilities for gene trees. Below the outer tent, |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$|, while below the inner tent |$\mathbb{P}(G_1) <$| 0.2 and |$\mathbb{P}(G_2)$| = |$\mathbb{P}(G_3) >$| 0.4. To verify the analytical results, we used the simulate option of bpp (Flouri et al. 2018) to generate |$10^7$| gene trees at those parameter values. The estimates of |$\mathbb{P}(G_1)$| are 0.865374, 0.333393, and 0.253542, for |$M =$| 0, 0.521361, and 0.8, respectively, which differ from the above analytical calculations by less than |$10^{-4}$|. Supplementary Figure S1A,B available on Dryad at http://dx.doi.org/10.5061/dryad.xwdbrv1b5 examines the impact of the divergence time (|$\tau_R$|) and the ratio of the population sizes (|$\theta_A/\theta_B$|) on gene tree probabilities, when other parameters are fixed at the values of Figure 2. Those two parameters similarly partition the parameter space into two zones, with the anomaly |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| occurring for large |$\tau_R$| (with |$\tau_R > 0.00119461$|) and for very different population sizes (with |$\theta_A/\theta_B >$| 2.66667). Nevertheless, |$\tau_R$| and |$\theta_A/\theta_B$| appear to have less impact than the migration rate |$M$| (Fig. 2A). Figure 2B shows a partition of the 3D parameter space into two zones: when the parameters are inside the outer tent, we have the anomaly |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$|. Average Coalescent Times Next, we consider the average coalescent times or sequence distances within and between species. One could in principle use the Markov chain |$Q^{\textrm{\ding{172}}}$| constructed earlier for the process of coalescence and migration for the three sequences in the sample (|$a_1$|, |$a_2$|, and |$b$|). However, it is far simpler to use a reduced Markov chain with fewer states for two sequences only. To derive the density of the coalescent time between sequences |$a_1$| and |$a_2$|, that is, |$t_{aa}$|, we consider a Markov chain with 4 states. We abbreviate states like |$A_{a_1}A_{a_2}$| as |$AA$|, and merge states |$A$| and |$B$| (which are states reached when the two sequences have coalesced with the ancestral sequence in either |$A$| or |$B$|) into one absorbing state, |$A|B$| (for “|$A$| or |$B$|”) (Andersen et al. 2014). The generator matrix |$Q^{\textrm{\ding{174}}}$| is . |$AA$| . |$AB$| . |$BB$| . |$A | B$| . |$AA$| |$-(2w_{BA} + c_A)$| |$ 2 w_{BA}$| 0 |$c_A$| |$AB$| 0 |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 0 |$-c_B$| |$c_B$| |$A|B$| 0 0 0 0 . |$AA$| . |$AB$| . |$BB$| . |$A | B$| . |$AA$| |$-(2w_{BA} + c_A)$| |$ 2 w_{BA}$| 0 |$c_A$| |$AB$| 0 |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 0 |$-c_B$| |$c_B$| |$A|B$| 0 0 0 0 Open in new tab . |$AA$| . |$AB$| . |$BB$| . |$A | B$| . |$AA$| |$-(2w_{BA} + c_A)$| |$ 2 w_{BA}$| 0 |$c_A$| |$AB$| 0 |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 0 |$-c_B$| |$c_B$| |$A|B$| 0 0 0 0 . |$AA$| . |$AB$| . |$BB$| . |$A | B$| . |$AA$| |$-(2w_{BA} + c_A)$| |$ 2 w_{BA}$| 0 |$c_A$| |$AB$| 0 |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 0 |$-c_B$| |$c_B$| |$A|B$| 0 0 0 0 Open in new tab The eigenvalues of |$Q^{\textrm{\ding{174}}}$| are on the diagonal: |$\lambda_1 = -\frac{8M}{\theta_A} -\frac{2}{\theta_A}$|, |$\lambda_2 = -\frac{4M}{\theta_A}$|, |$\lambda_3 = -\frac{2}{\theta_B}$|, and |$\lambda_4 = 0$|. Let the transition probability matrix over time |$t$| be |$P^{\textrm{\ding{174}}}(t) = \textrm{exp}(Q^{\textrm{\ding{174}}} t)$|, which is a function of |$\textrm{e}^{\lambda_k t}$|, |$k = 1, 2, 3$|. Like |$\tau$|, time |$t$| is measured in the expected number of mutations per site. Thus $$\begin{equation} \label{eq:ftaa-IM} f(t_{aa}) = \begin{cases}\!\!\!\! P^{\textrm{\ding{174}}}_{AA, AA}(t_{aa}) \cdot \frac{2}{\theta_A} + P^{\textrm{\ding{174}}}_{AA, BB}(t_{aa}) \cdot \frac{2}{\theta_B}, & \textrm{if } 0 < t_{aa} < \tau_R,\\ \left[ 1 - P^{\textrm{\ding{174}}}_{AA, A|B}(\tau_R) \right] \frac{2}{\theta_R} \textrm{e}^{-\frac{2}{\theta_R}(t_{aa}-\tau_R)}, & \textrm{if } t_{aa} > \tau_R. \end{cases} \end{equation}$$(4) Note that according to the definition of the probability density function, |$f(t_{aa}) \Delta t$| is the probability that the coalescent time falls in the interval |$(t_{aa}, t_{aa} + \Delta t)$|. When |$t_{aa} < \tau_R$|, this is the sum of two terms, as the coalescent event can occur in either species |$A$| or |$B$|. The first term, |$P^{\textrm{\ding{174}}}_{AA, AA}(t_{aa}) \cdot \frac{2}{\theta_A} \Delta t$|, is the probability that sequences |$a_1$| and |$a_2$| are both in species |$A$| right before time |$t_{aa}$|, multiplied by the probability, |$\frac{2}{\theta_A} \Delta t$|, that they coalesce during the time interval |$(t_{aa}, t_{aa} + \Delta t)$|. The second term, |$P^{\textrm{\ding{174}}}_{AA, BB}(t_{aa}) \cdot \frac{2}{\theta_B} \Delta t$|, is the probability for the coalescent to occur in species |$B$|. Similarly, in the case |$t_{aa} > \tau_R$|, both |$a_1$| and |$a_2$| enter |$R$|, with probability |$ 1 - P^{\textrm{\ding{174}}}_{AA, A|B}(\tau_R) $|, and then coalesce in |$R$| in the time interval (|$t_{aa}, t_{aa}+\Delta t$|), with probability |$\frac{2}{\theta_R} \textrm{e}^{-\frac{2}{\theta_R}(t_{aa}-\tau_R)} \Delta t$|. The expectation of |$t_{aa}$| is given by averaging over the three cases of equation 4 in which |$a_1$| and |$a_2$| coalesce in |$A$|, |$B$|, and |$R$|: $$\begin{eqnarray} \label{eq:Etaa-IM} \mathbb{E}(t_{aa}) &=& \int_0^{\tau_R}t P^{\textrm{\ding{174}}}_{AA, AA}(t) \textrm{d}t \cdot \frac{2}{\theta_A} + \int_0^{\tau_R}t P^{\textrm{\ding{174}}}_{AA, BB}(t) \textrm{d}t \cdot \frac{2}{\theta_B}\nonumber\\ &&+ \left[1 - P^{\textrm{\ding{174}}}_{AA, A|B}(\tau_R)\right] \left(\tau_R + \frac{\theta_R}{2} \right) \nonumber \\ &=& \left[e_1 -\frac{4M(e_1 - e_2)}{1+2M} - \frac{8 e_3 M^2 \theta_B^2}{(\theta_A - \theta_B- 4M\theta_B)(2M\theta_B - \theta_A)} \right. \nonumber \\ & & - \frac{8 e_2 M^2 \theta_B}{(1 + 2M)(2M\theta_B - \theta_A)}\nonumber\\ & &\left. - \frac{8 e_1 M^2 \theta_B}{(1 + 2M)(\theta_A - \theta_B- 4M\theta_B)}\right] \left(\tau_R + \frac{\theta_R}{2}\right) \nonumber \\ &&+ \frac{\theta_A \left[1 - e_1 (1 - \lambda_1\tau_R)\right]}{2{(1 + 4M)}^2} - \frac{\theta_A^2 \left[1 - e_2 (1 -\lambda_2\tau_R)\right]}{(1 + 2M)(2M\theta_B - \theta_A)} \nonumber \\ &&+ \frac{4 M^2 \theta_A^2}{{(1 + 4M)}^2 (2M\theta_B - \theta_A)}\nonumber\\ &&\quad{} \left[\frac{1}{1 + 2M} + \frac{\theta_B}{\theta_A - \theta_B- 4M\theta_B}\right] \left[1 - e_1 (1 - \lambda_1\tau_R)\right] \nonumber \\ & & -\frac{4 M^2 \theta_B^3 [1 - e_3 (1 - \lambda_3 \tau_R)]}{(2M\theta_B - \theta_A)(\theta_A - \theta_B- 4M\theta_B)}, \end{eqnarray}$$(5) where |$e_k = \textrm{e}^{\lambda_k \tau_R}$|, |$k = 1, 2, 3$|. To derive the density of the coalescent time |$t_{ab}$| between sequences |$a_1$| and |$b$|, we consider a Markov chain with three states describing the backward process of coalescence and migration during time interval |$(0, \tau_R)$|. We abbreviate states like |$A_{a_1}B_b$| as |$AB$| here. The generator matrix |$Q^{\textrm{\ding{175}}}$| is . |$AB$| . |$BB$| . |$B$| . |$AB$| |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 . |$AB$| . |$BB$| . |$B$| . |$AB$| |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 Open in new tab . |$AB$| . |$BB$| . |$B$| . |$AB$| |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 . |$AB$| . |$BB$| . |$B$| . |$AB$| |$-w_{BA}$| |$w_{BA}$| 0 |$BB$| 0 |$-c_B$| |$c_B$| |$B$| 0 0 0 Open in new tab Thus the transition probability matrix is |$P^{\textrm{\ding{175}}}(t) = \textrm{exp}(Q^{\textrm{\ding{175}}} t)$|, and $$\begin{equation} \label{eq:ftab-IM} f(t_{ab}) = \begin{cases} P^{\textrm{\ding{175}}}_{AB, BB}(t_{a b}) \cdot \frac{2}{\theta_B}, & \textrm{if } 0 < t_{ab} < \tau_R, \\ \left[1 - P^{\textrm{\ding{175}}}_{AB, B}(\tau_R)\right] \frac{2}{\theta_R} \textrm{e}^{-\frac{2}{\theta_R}(t_{ab}-\tau_R)}, & \textrm{if } t_{ab} > \tau_R. \end{cases} \end{equation}$$(6) The expectation of |$t_{ab}$| is given by averaging over the two cases: $$\begin{align} \label{eq:Etab-IM} \mathbb{E}(t_{ab}) &= \int_0^{\tau_R}t P^{\textrm{\ding{175}}}_{AB, BB}(t)dt \cdot \frac{2}{\theta_B} + \left[1 - P^{\textrm{\ding{175}}}_{AB, B}(\tau_R) \right] \left(\tau_R + \frac{\theta_R}{2}\right) \nonumber \\ &= \frac{4M^2\theta_B^2 \left[1 - e_3 (1 - \lambda_3\tau_R)\right] - \theta_A^2 \left[1 - e_2 (1 -\lambda_2\tau_R)\right]}{4M (2M\theta_B - \theta_A)} \nonumber \\ &\quad{} + \left[e_2-\frac{2M\theta_B(e_2 - e_3)}{2M\theta_B - \theta_A}\right] \left(\tau_R + \frac{\theta_R}{2}\right). \end{align}$$(7) We plot |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$| against the migration rate |$M$| in Figure 3A, with other parameters in the model fixed at |$\tau = 0.02$|, |$\theta_A = 0.025, \theta_B = 0.001$|, and |$\theta_R = 0.01$|. In the extreme case of |$M = 0$|, the IM model becomes a model of complete isolation (or the MSC model, Fig. 1A), in which case |$\mathbb{E}(t_{ab})$| = |$\tau_R + \frac{1}{2}\theta_R$| = 0.025 and |$\mathbb{E}(t_{aa}) = \frac{1}{2}\theta_A + P_A \cdot \frac{1}{2}(\theta_R - \theta_A)$| = 0.01099, with |$P_A$| = |$\exp(-\frac{2}{\theta_A}\tau)$| = 0.2019 to be the probability that |$a_1$| and |$a_2$| do not coalesce in species |$A$|. Here, |$\mathbb{E}(t_{aa})$| is given by the approach of “iterated corrections”, since the coalescent process between |$a_1$| and |$a_2$| occurs at different rates (determined by |$\theta_A$| and |$\theta_R$|) before and after |$\tau_R$| (Burgess and Yang, 2008, eq. 7). If |$\theta_A$| and |$\theta_R$| were equal, the mean coalescent time would be |$\frac{\theta_A}{2}$|. Thus applying a correction for different population sizes, which affects a proportion |$P_A$| of the coalescent times, leads to |$\mathbb{E}(t_{aa})$| = |$\frac{1}{2}\theta_A + P_A \cdot \frac{1}{2}(\theta_R - \theta_A)$|. In the other extreme case with |$M \to \infty$|, both |$a_1$| and |$a_2$| will migrate (backwards in time) into |$B$| immediately and then coalesce with |$b$| at random, so that |$\mathbb{E}(t_{aa})$| = |$\mathbb{E}(t_{ab})$| = |$\frac{1}{2}\theta_B + P_B \cdot \frac{1}{2} (\theta_R - \theta_B)$| = 0.0005, where |$P_B$| = |$\exp(-\frac{2}{\theta_B}\tau)$|. When |$M$| is greater than a threshold value, |$M^* = 0.5254101$|, |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Figure 3. Open in new tabDownload slide A) The expected coalescent times, |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$|, as functions of the migration rate |$M$| under the IM model (Fig. 1B). The two curves cross at |$M^* = 0.5254101$|, and when |$M > M^*$|, |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Other parameters are fixed at |$\tau = 0.02$|, |$\theta_A = 0.025, \theta_B = 0.001$|, and |$\theta_R = 0.01$|. B) Partition of the parameter space defined by |$\theta_A, \theta_B$|, and |$M$| under the IM model according to whether |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Inside the outer tent, |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Inside the inner tent |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab}) + 0.001$|. Other parameters are fixed at |$\tau = 0.02$| and |$\theta_R = 0.01$|. Figure 3. Open in new tabDownload slide A) The expected coalescent times, |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$|, as functions of the migration rate |$M$| under the IM model (Fig. 1B). The two curves cross at |$M^* = 0.5254101$|, and when |$M > M^*$|, |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Other parameters are fixed at |$\tau = 0.02$|, |$\theta_A = 0.025, \theta_B = 0.001$|, and |$\theta_R = 0.01$|. B) Partition of the parameter space defined by |$\theta_A, \theta_B$|, and |$M$| under the IM model according to whether |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Inside the outer tent, |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Inside the inner tent |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab}) + 0.001$|. Other parameters are fixed at |$\tau = 0.02$| and |$\theta_R = 0.01$|. We used bpp to simulate |$10^7$| gene trees at the parameter values of Figure 3A to verify the equations. At |$M^* = 0.5254101$|, the estimates of |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$| are 0.0110556 and 0.0110550, in comparison with 0.0110557 and 0.0110557 from equations 5 and 7. At |$M = 0.8$| they are 0.00902285 and 0.00808002, in comparison with 0.00902284 and 0.00808021 from equations 5 and 7. Supplementary Figure S1C,D available on Dryad examines the impact of the divergence time |$\tau_R$| and the ratio of population sizes |$\theta_A/\theta_B$| on the average coalescent times, with other parameters fixed at the values of Figure 3. The anomaly |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$| occurs when |$\tau_R$| is large and when |$\theta_A$| is much greater than |$\theta_B$|. Figure 3B shows a partition of a 3D parameter space. The anomaly |$\mathbb{E}(t_{aa})$| |$>$| |$\mathbb{E}(t_{ab})$| occurs more easily for large |$M$| and when |$\theta_A$| is much greater than |$\theta_B$|. The MSci Model for Two Species with Three Sequences Consider the introgression (MSci) model for two species |$A$| and |$B$|, with |$B \to A$| introgression at time |$\tau_H = \tau_S$| and introgression probability |$\varphi$| (Fig. 1C). Again consider a sample of three sequences, |$a_1$| and |$a_2$| from |$A$| and |$b$| from |$B$|. We derive the probabilities for the three gene trees: |$G_1 = ((a_1, a_2), b)$|, |$G_2 = ((b, a_1), a_2)$|, and |$G_3 = ((b, a_2), a_1)$|, as well as the expected within-species and between-species coalescent times: |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$|. Probabilities of Gene Trees The gene tree topology depends on whether sequences |$a_1$| and |$a_2$| coalesce in species |$A$| (i.e., over the time interval 0–|$\tau_H$|), and, if they do not, on whether they migrate into population |$S$|, and so on (Fig. 1C). Note that in population |$A$|, sequences |$a_1$| and |$a_2$| coalesce according to a Poisson process at the rate |$\frac{2}{\theta_A}$|. Thus, the probability that |$a_1$| and |$a_2$| do not coalesce in |$A$| before reaching time |$\tau_H$| is $$\begin{equation} P_A = \textrm{e}^{-\frac{2}{\theta_A}\tau_H}. \end{equation}$$(8) Similarly, we define $$\begin{equation} P_H = \textrm{e}^{-\frac{2}{\theta_H}(\tau_R-\tau_H)} \textrm{ and } P_S = \textrm{e}^{-\frac{2}{\theta_S}(\tau_R-\tau_H)} \end{equation}$$(9) to be the probabilities that two sequences entering populations |$H$| or |$S$|, respectively, do not coalesce in that population (Fig. 1C). Then the probabilities for the three gene trees are $$\begin{align} \mathbb{P}(G_1) &= (1 - P_A) + P_A {(1 - \varphi)}^2 (1 - P_H) + \frac{1}{3} P_A {(1 - \varphi)}^2 P_H \nonumber \\ &\quad{} + \frac{2}{3} P_A \varphi (1 - \varphi) P_S + \frac{1}{3} P_A \varphi^2,\\ \mathbb{P}(G_2) &= \mathbb{P}(G_3) = \frac{1}{3} P_A {(1 - \varphi)}^2 P_H\nonumber\\ &\quad{} + P_A \varphi (1 - \varphi) \left( 1 - \frac{1}{3}P_S \right) + \frac{1}{3} P_A \varphi^2, \nonumber \end{align}$$(10) with |$\mathbb{P}(G_1) + 2\mathbb{P}(G_2)$| = 1. Here |$\mathbb{P}(G_1)$| is a sum of five terms, corresponding to different scenarios in which the first coalescent event is between |$a_1$| and |$a_2$|. The first term, |$1 - P_A$|, is the probability that |$a_1$| and |$a_2$| coalesce in population |$A$|. The second term, |$P_A {(1 - \varphi)}^2 (1 - P_H)$|, is the probability that |$a_1$| and |$a_2$| do not coalesce in population |$A$|, they both enter population |$H$| (branch |$RH$| in the species tree, Fig. 1C) and coalesce in |$H$|. The third term, |$P_A {(1 - \varphi)}^2 P_H \cdot \frac{1}{3}$|, is the probability that |$a_1$| and |$a_2$| do not coalesce in |$A$|, and they both enter |$H$| and then |$R$|, where the three sequences coalesce in random order. The fourth term, |$P_A \cdot 2 \varphi (1 - \varphi) P_S \cdot \frac{1}{3}$|, is the probability that |$a_1$| and |$a_2$| do not coalesce in |$A$| and one of them enters |$S$| but does not coalesce with |$b$| in |$S$|, so that all three sequences enter |$R$| and coalesce in random order. The fifth term, |$P_A \varphi^2 \cdot \frac{1}{3} $|, is the probability that |$a_1$| and |$a_2$| do not coalesce in |$A$| and they both enter |$S$|, so that all three sequences enter |$S$| and coalesce at random in |$S$| or |$R$|. Similarly |$\mathbb{P}(G_2)$| is a sum of three terms, corresponding to three different scenarios in which sequences |$a_1$| and |$b$| coalesce first. The first term, |$P_A {(1 - \varphi)}^2 P_H \cdot \frac{1}{3}$|, is for |$a_1$| and |$a_2$| not to coalesce in |$A$| but to enter |$H$| and then |$R$|, and then for |$a_1$| and |$b$| to coalesce in |$R$|. The second term, |$P_A \varphi (1 - \varphi)$| |$\cdot \left(P_S \cdot \frac{1}{3} + (1 - P_S) + P_S\cdot \frac{1}{3} \right)$|, is for one of |$a_1$| and |$a_2$| to enter |$H$| and the other to enter |$S$|. If |$a_1$| enters |$H$|, and |$a_2$| enters |$S$| and does not coalesce with |$b$| in |$S$|, then |$a_1$| and |$b$| can coalesce in |$R$|. If |$a_2$| enters |$H$| and |$a_1$| enters |$S$|, then |$a_1$| and |$b$| may coalesce in |$S$| or |$R$|. Lastly the third term, |$P_A \varphi^2 \cdot \frac{1}{3}$|, is for both |$a_1$| and |$a_2$| to enter |$S$| and then for the three sequences to coalesce at random in |$S$| or |$R$|. We have |$\mathbb{P}(G_1) < \mathbb{P}(G_2) = \mathbb{P}(G_3)$| if and only if $$\begin{equation} P_A \varphi (1 - \varphi) (1 - P_S) > 1 - P_A + P_A {(1 - \varphi)}^2 (1 - P_H) \end{equation}$$(11) or $$\begin{equation} P_A(2 - P_H - P_S) \varphi^2 - P_A (3 - 2P_H -P_S) \varphi + 1 - P_A P_H < 0. \end{equation}$$(12) While the MSci model of Figure 1C has seven parameters (we do not count |$\theta_B$| since it is not needed to simulate sequence data of |$a_1, a_2$|, and |$b$|), the gene tree probabilities depend on only four: the introgression probability |$\varphi$| and the three branch lengths in coalescent units for branches |$A, H,$| and |$S$| (Fig. 1C). Note that |$P_A, P_H$|, and |$P_S$| are simply functions of the respective branch lengths (in coalescent units). We plot |$\mathbb{P}(G_1)$| and |$\mathbb{P}(G_2)$| against |$\varphi$| in Figure 4A, with |$P_A = P_H = 0.9$| and |$P_S$| = 0.1 fixed. Note that when |$\varphi$| = 0 or 1, the MSci model reduces to the simple MSC model for two species with changing population sizes but without introgression. At |$\varphi$| = 0, we have |$\mathbb{P}(G_1)$| = |$1 - \frac{2}{3}P_A P_H$| = 0.46 while at |$\varphi$| = 1, |$\mathbb{P}(G_1)$| = |$1 - \frac{2}{3}P_A$| = 0.4. The anomaly |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| occurs in the zone 0.247694 |$< \varphi <$| 0.852306. When |$\varphi$| is close to 1 (or |$>$| 0.852306), |$a_1$| and |$a_2$| either coalesce in |$A$| or both will very likely enter species |$S$| and coalesce with |$b$| at random, so that |$\mathbb{P}(G_1) >$| |$\mathbb{P}(G_2)$|. Note that |$\mathbb{P}(G_1)$| is not a monotonic function of |$\varphi$|: when introgression is either very rare or virtually guaranteed there is an increased chance for |$a_1$| and |$a_2$| to be in the same population and coalesce. Supplementary Figure S2A–C available on Dryad examines the impact of |$\tau_R$|, |$\theta_H/\theta_S$|, and |$\tau_H$| (Fig. 1C) on gene tree probabilities. The anomaly |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| occurs when |$\tau_R$| is in a certain range, when |$\theta_H$| is much greater than |$\theta_S$|, and when |$\tau_H$| is small. Figure 4. Open in new tabDownload slide A) Probabilities of gene trees |$G_1$| and |$G_2$| as functions of the introgression probability |$\varphi$| in the MSci model (Fig. 1C) when |$P_A = P_H = 0.9$| and |$P_S = 0.1$| are fixed. B) Partition of the parameter space according to gene tree probabilities: |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| if and only if the parameter values are inside the tent. The inner and outer tents correspond to |$P_A = 0.90$| and 0.95, respectively. Figure 4. Open in new tabDownload slide A) Probabilities of gene trees |$G_1$| and |$G_2$| as functions of the introgression probability |$\varphi$| in the MSci model (Fig. 1C) when |$P_A = P_H = 0.9$| and |$P_S = 0.1$| are fixed. B) Partition of the parameter space according to gene tree probabilities: |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| if and only if the parameter values are inside the tent. The inner and outer tents correspond to |$P_A = 0.90$| and 0.95, respectively. Figure 4B shows the zone of parameters in which |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| in a 3D space. When |$P_A$| and |$P_H$| are large and |$P_S$| is small (or when |$\theta_A$| and |$\theta_H$| are large and |$\theta_S$| is small), the anomaly |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| may occur even with |$\varphi < 0.5$|. Average Coalescent Times The density of the coalescent time between sequences |$a_1$| and |$a_2$| is $$\begin{equation}\label{eq:ftaa-MSci} f(t_{aa}) = \left\{\!\!\begin{array}{@{}ll} &\frac{2}{\theta_A} \textrm{e}^{-\frac{2}{\theta_A}t_{aa}},\\[6pt] &\qquad{} \textrm{if } 0 < t_{aa} < \tau_H,\\[6pt] & P_A \left[{(1 - \varphi)}^2 \frac{2}{\theta_H} \textrm{e}^{-\frac{2}{\theta_H}(t_{aa}-\tau_H)} + \varphi^2 \frac{2}{\theta_S} \textrm{e}^{-\frac{2}{\theta_S}(t_{aa}-\tau_H)}\right]\!,\\[6pt] &\qquad{} \textrm{if } \tau_H < t_{aa} < \tau_R,\\[6pt] & P_A [{(1 - \varphi)}^2 P_H + \varphi^2 P_S + 2\varphi (1 - \varphi)] \frac{2}{\theta_R} \textrm{e}^{-\frac{2}{\theta_R}(t_{aa}-\tau_R)},\\[6pt] &\qquad{} \textrm{if } t_{aa} > \tau_R. \end{array}\right. \end{equation}$$(13) First, the probability, |$f(t_{aa})\Delta t$|, that sequences |$a_1$| and |$a_2$| coalesce during the time interval |$(t_{aa}, t_{aa} + \Delta t)$|, with |$t_{aa} < \tau_H$|, is given by the probability, |$\textrm{e}^{-\frac{2}{\theta_A}t_{aa}}$|, that they do not coalesce before time |$t_{aa}$|, multiplied by the probability, |$\frac{2}{\theta_A} \Delta t$|, that they coalesce during the time interval (|$t_{aa}$|, |$t_{aa} + \Delta t$|). Second, for |$\tau_H < t_{aa} < \tau_R$|, |$f(t_{aa}) \Delta t$| is the sum of two terms, as the coalescent event can occur in either species |$H$| or |$S$|. The first term, |$P_A {(1 - \varphi)}^2 \frac{2}{\theta_H} \textrm{e}^{-\frac{2}{\theta_H}(t_{aa}-\tau_H)} \Delta t$|, is the probability that |$a_1$| and |$a_2$| do not coalesce in species |$A$|, but both enter species |$H$| and coalesce there. Similarly, the second term, |$P_A \varphi^2 \frac{2}{\theta_S} \textrm{e}^{-\frac{2}{\theta_S}(t_{aa}-\tau_H)} \Delta t$|, is the probability that |$a_1$| and |$a_2$| do not coalesce in species |$A$|, but both enter species |$S$| and coalesce there. Finally, in the case |$t_{aa} > \tau_R$|, both |$a_1$| and |$a_2$| enter |$R$| with probability |$P_A [{(1 - \varphi)}^2 P_H + \varphi^2 P_S + 2\varphi (1 - \varphi)]$|, and coalesce in |$R$| in the time interval (|$t_{aa}$|, |$t_{aa}+\Delta t$|) with probability |$\frac{2}{\theta_R}\textrm{e}^{-\frac{2}{\theta_R}(t_{aa}-\tau_R)} \Delta t$|. The expectation of |$t_{aa}$| is given by averaging over the four cases of equation 13 in which |$a_1$| and |$a_2$| coalesce in |$A$|, |$H$|, |$S$|, and |$R$|. $$\begin{align} \label{eq:Etaa-MSci} \mathbb{E}(t_{aa}) & = \frac{\theta_A}{2} - P_A \left(\tau_H + \frac{\theta_A}{2}\right)\nonumber\\ &\quad{} + P_A {(1-\varphi)}^2 \left[\tau_H + \frac{\theta_H}{2} - P_H (\tau_R + \frac{\theta_H}{2})\right] \nonumber \\ &\quad{} + P_A \varphi^2 \left[\tau_H + \frac{\theta_S}{2} - P_S \left(\tau_R + \frac{\theta_S}{2}\right)\right]\nonumber\\ &\quad{} + P_A [P_H{(1-\varphi)}^2 + P_S \varphi^2 + 2 \varphi (1 - \varphi)] \left(\tau_R + \frac{\theta_R}{2}\right)\!. \end{align}$$(14) Similarly the density of the coalescent time between sequences |$a_1$| and |$b$| is $$\begin{equation} \label{eq:ftab-MSci} f(t_{ab}) = \begin{cases} \varphi \frac{2}{\theta_S} \textrm{e}^{-\frac{2}{\theta_S}(t_{ab}-\tau_H)}, & \textrm{if } \tau_H < t_{ab} < \tau_R, \\ [(1 - \varphi) + P_S \varphi] \frac{2}{\theta_R} \textrm{e}^{-\frac{2}{\theta_R}(t_{ab}-\tau_R)}, & \textrm{if } t_{ab} > \tau_R. \end{cases} \end{equation}$$(15) When |$\tau_H < t_{ab} < \tau_R$|, the coalescent occurs in species |$S$|, and |$f(t_{ab}) \Delta t$| is given by the probability, |$\varphi$|, that sequence |$a_1$| enters species |$S$| times the probability, |$\frac{2}{\theta_S}\textrm{e}^{-\frac{2}{\theta_S}(t_{ab}-\tau_H)} \Delta t$|, that |$a_1$| and |$b$| coalesce in |$S$| in the time interval (|$t_{ab}$|, |$t_{ab}+\Delta t$|). In the case of |$t_{ab} > \tau_R$|, the coalescent occurs in species |$R$|. The probability that both |$a_1$| and |$b$| enter |$R$| is |$(1 - \varphi) + \varphi P_S$|, and the probability that they coalesce in |$R$| in the time interval (|$t_{ab}$|, |$t_{ab}+\Delta t$|) is |$\frac{2}{\theta_R}\textrm{e}^{-\frac{2}{\theta_R}(t_{ab}-\tau_R)} \Delta t$|. The expectation of |$t_{ab}$| is given by $$\begin{equation} \label{eq:Etab-MSci} \mathbb{E}(t_{ab}) = \varphi \left[\tau_H + \frac{\theta_S}{2} + P_S \left(\frac{\theta_R}{2} - \frac{\theta_S}{2} \right)\right] + (1-\varphi)\left( \tau_R + \frac{\theta_R}{2} \right). \end{equation}$$(16) This is a weighted average depending on whether sequence |$a$| enters |$S$| (with probability |$\varphi$|) or |$H$| (with probability |$1-\varphi$|). If |$a$| enters |$S$|, the mean coalescent time is |$\tau_H + \frac{\theta_S}{2} + P_S \left(\frac{\theta_R}{2} - \frac{\theta_S}{2} \right)$| by the argument of iterated corrections. Similarly with probability |$1 - \varphi$| sequence |$a$| enters |$H$| and coalesces with |$b$| in |$R$|, with the mean coalescent time to be |$\tau_R + \frac{\theta_R}{2}$|. Thus |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$| if and only if $$\begin{align} & \frac{\theta_A}{2} - P_A (\tau_H + \frac{\theta_A}{2}) + P_A {(1-\varphi)}^2 \left[\tau_H + \frac{\theta_H}{2} - P_H \left(\tau_R + \frac{\theta_H}{2}\right)\right]\nonumber\\ &\quad{} + (P_A \varphi^2 - \varphi) \left[\!\tau_H + \frac{\theta_S}{2} - P_S \left(\tau_R + \frac{\theta_S}{2}\right)\!\right] + [P_A P_H {(1-\varphi)}^2\nonumber\\ &\quad{} + P_S \varphi (P_A \varphi - 1) + (2 P_A \varphi - 1)(1 - \varphi)] \left(\tau_R + \frac{\theta_R}{2}\right) > 0. \end{align}$$(17) We plot |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$| against |$\varphi$| in Figure 5A, with other parameters in the MSci model fixed: |$\theta_A = \theta_H = 0.05$|, |$\theta_S = \theta_R = 0.001$|, |$\tau_R = 0.01$|, and |$\tau_H =$| 0.0025. Note that the coalescent times depend on all seven parameters of the MSci model except |$\theta_B$| (Fig. 1C). The cases |$\varphi$| = 0 and 1 correspond to MSC (complete isolation) models for two species with changing population sizes. With |$\varphi$| = 0, sequences |$a_1$| and |$a_2$| coalesce at different rates determined by population sizes |$\theta_A$|, |$\theta_H$|, and |$\theta_R$|, so the approach of iterated corrections gives |$\mathbb{E}(t_{aa})$| = |$\frac{\theta_A}{2}$| + |$P_A [ (\frac{\theta_H}{2} - \frac{\theta_A}{2})$| + |$P_H(\frac{\theta_R}{2} - \frac{\theta_H}{2}) ] $| = 0.00857716. Also at |$\varphi$| = 0, sequences |$a_1$| and |$b$| can coalesce in |$R$| only, with |$\mathbb{E}(t_{ab})$| = |$\tau_R + \frac{\theta_R}{2}$| = 0.0105. At |$\varphi = 1$|, sequences |$a_1$| and |$a_2$| can coalesce in |$A$|, |$S$|, or |$R$|, so that |$\mathbb{E}(t_{aa})$| = |$\frac{\theta_A}{2}$| + |$P_A [ (\frac{\theta_S}{2} - \frac{\theta_A}{2})$| + |$P_S(\frac{\theta_R}{2} - \frac{\theta_S}{2}) ] $| = 0.00283148 while sequences |$a_1$| and |$b$| can coalesce in |$S$| or |$R$|, with |$\mathbb{E}(t_{ab})$| = |$\tau_H + \frac{\theta_S}{2}$| + |$P_S (\frac{\theta_R}{2} - \frac{\theta_S}{2})$| = 0.003. When |$\varphi$| is close to 1, either |$a_1$| and |$a_2$| coalesce in |$A$| or they both enter |$S$| and coalesce with |$b$| at random, so that |$\mathbb{E}(t_{aa})$| |$< \mathbb{E}(t_{ab})$|. When 0.252962 |$< \varphi <$| 0.971179, we have the anomaly |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. If species |$A$| has a much larger population size than |$S$|, it may be more likely for sequence |$a_1$| or |$a_2$| to migrate into species |$S$| and coalesce with |$b$| than for |$a_1$| to coalesce with |$a_2$|, causing |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$|. Such anomaly may occur even if |$\varphi$| is much smaller than |$\frac{1}{2}$|. Figure 5. Open in new tabDownload slide A) The expected coalescent times, |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$|, as functions of the introgression probability |$\varphi$| under the MSci model (Fig. 1C). Other parameters are fixed at |$\theta_A = \theta_H = 0.05$|, |$\theta_S = \theta_R = 0.001$|, |$\tau_R = 0.01$|, and |$\tau_H =$| 0.0025. B) Partition of the parameter space defined by |$\theta_A, \theta_S$|, and |$\varphi$| under the MSci model (Fig. 1C): inside the tent, |$\mathbb{E}(t_{aa})$| |$> \mathbb{E}(t_{ab})$| while outside it the opposite is true. Other parameters are fixed at |$\theta_H = 0.05$|, |$\theta_R = 0.001$|, |$\tau_R = 0.01$|, and |$\tau_H = 0.0025$|. Figure 5. Open in new tabDownload slide A) The expected coalescent times, |$\mathbb{E}(t_{aa})$| and |$\mathbb{E}(t_{ab})$|, as functions of the introgression probability |$\varphi$| under the MSci model (Fig. 1C). Other parameters are fixed at |$\theta_A = \theta_H = 0.05$|, |$\theta_S = \theta_R = 0.001$|, |$\tau_R = 0.01$|, and |$\tau_H =$| 0.0025. B) Partition of the parameter space defined by |$\theta_A, \theta_S$|, and |$\varphi$| under the MSci model (Fig. 1C): inside the tent, |$\mathbb{E}(t_{aa})$| |$> \mathbb{E}(t_{ab})$| while outside it the opposite is true. Other parameters are fixed at |$\theta_H = 0.05$|, |$\theta_R = 0.001$|, |$\tau_R = 0.01$|, and |$\tau_H = 0.0025$|. We confirmed our derivations by simulating |$10^7$| gene trees using bpp (Flouri et al. 2018). With |$\varphi = 0.4$| in Figure 5A, the estimates are 0.00815726 for |$\mathbb{E}(t_{aa})$| and 0.007499884 for |$\mathbb{E}(t_{ab})$|, compared with 0.008157341 and 0.0075 from equations 14 and 16. Supplementary Figure S2D–F available on Dryad examines the impact of |$\tau_R$|, |$\theta_H/\theta_S$|, and |$\tau_H$| (Fig. 1C) on the average coalescent times, when other parameters are fixed at the values of Figure 5. |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$| when |$\tau_R$| is in a certain range, when |$\theta_H$| is much greater than |$\theta_S$|, and when |$\tau_H$| is small. Figure 5B shows the anomaly zone with |$\mathbb{E}(t_{aa}) > \mathbb{E}(t_{ab})$| in the 3D space of parameters |$\theta_A$|, |$\theta_S$|, and |$\varphi$|, with other parameters fixed. Discussion The Nature of the Anomaly The species-definition anomaly zone, in which the within-species divergence is greater than the between-species divergence, with divergence measured by either the gene tree probability or the average genetic distance, is very similar to the species-tree anomaly zone (Degnan and Rosenberg 2006). In the species-tree anomaly zone, the use of the most common gene tree topology as the species tree estimate will be statistically inconsistent, although it should be emphasized that the problem disappears if one takes a likelihood approach and uses the likelihood (i.e., the probability of the gene trees) to compare different species trees (Xu and Yang 2016). The models considered in this article (Fig. 1B,C) involve only two species with only one simple species tree: |$(A, B)$|. However, one may consider the gene tree |$G_1$| = |$((a_1, a_2), b)$| to match the species tree, and gene trees |$G_2$| = |$((b, a_1), a_2)$| and |$G_3$| = |$((b, a_2), a_1)$| to be the mismatching trees. Then the anomaly |$\mathbb{P}(G_1) < \mathbb{P}(G_2)$| means that the matching gene tree has a smaller probability than either mismatching gene tree, a situation very similar to the anomaly zone in species tree estimation. Nevertheless, the anomaly zone for species tree estimation is due to polymorphism in ancestral species and the resulting deep coalescence, while the anomaly discussed in this paper is due to cross-species gene flow and different population sizes. In the context of phylogenetic network (i.e., MSci) models, Zhu et al. (2016) defined an anomalous gene tree as one that has a higher probability than any gene tree that matches a displayed species tree—displayed species trees are binary trees that remain when one of the two parental branches at each hybridization node in the species network is removed (Zhu et al. 2016 Zhu and Degnan 2017). All such anomalies share the feature that the most probable gene tree under the data-generating model does not match one’s intuitive expectation. Here, we stress that the “counter-intuitive" results do not imply that genetic sequence data contain misleading information about the history of species divergences. The species-definition anomaly does not occur in the MSC model without gene flow (Fig. 1A). Nor does it occur in simple models of population subdivision in population genetics. For example, under the islands and stepping-stones models, the expected coalescent time between sequences sampled from the same population must be smaller than that between sequences sampled from two different populations (Li 1976 Strobeck 1987 Slatkin 1987 Slatkin 1991). Those models assume symmetry in the population size and migration rate: the different populations are assumed to have the same size and the migration rate is assumed to be the same between any two populations in the islands model or between any two adjacent populations in the stepping-stones model. In the IM and MSci models considered here, cross-species gene flow and large differences in population size are the main causes for the anomaly. We note that the anomaly described in this article can occur in more general settings than we considered. For example, we have assumed unidirectional migration (from |$B$| to |$A$| only) in the IM model, but the same behavior should occur in a more general model of bidirectional migration (Long and Kubatko 2018), as long as there is sufficient asymmetry in the population size and in the migration rate. In our analysis, we have assumed a simple neutral coalescent model (with and without gene flow) and have not considered the effects of natural selection or population structure. Selection may distort the distribution of the gene tree topologies and coalescent times, especially when the population sizes and thus the efficacy of purifying selection differs between species (He et al. 2020). Previously coding loci were found to produce highly consistent species-tree and parameter estimates with the noncoding parts of the genome (Shi and Yang 2018 Thawornwattana et al. 2018 Flouri et al. 2020), suggesting that the effects may be minor if purifying selection operates in similar ways in different species. However, species-specific selection, as expected for gene loci responsible for ecological adaptation of the species (Turner et al. 2005 Pardo-Diaz et al. 2012), will likely have major impacts on the gene tree distribution. Furthermore, our analysis has assumed that each species is a population of panmixia. Population subdivision may lead to an inflated effective population size for the species, and may create a scenario that is similar to the model studied here. Suppose species |$A$| has a wide-ranging geographical distribution with population subdivision, while species |$B$| has a very limited distribution and is close to one of the geographical populations of species |$A$|. Our analysis suggests that such gene flow can easily create a species-definition anomaly zone, with two sequences randomly sampled from species |$A$| to be on average more distantly related than two sequences from the two different species. How Common is the Species-Definition Anomaly? While our theoretical calculations suggest that the species-definition anomaly is possible in large zones of the parameter space, it is not known how often it occurs in nature. This empirical question can be addressed by estimating the relevant parameters (in particular the migration rate |$M$| and the introgression probability |$\varphi$|) under the IM and MSci models using genomic sequence data. Currently, such estimates are rare and mostly based on small data sets, while it may be necessary to use hundreds or thousands of loci to get reliable estimates. Nevertheless, available estimates (e.g., Pinho and Hey 2010, Table S1) suggest that population sizes can differ by orders of magnitude even between closely related species, and migration is often asymmetrical, providing opportunities for the anomaly to occur. Here, we briefly review a few recent studies which generated estimates of migration rates from genomic data, from fruit flies, mosquitoes, butterflies, and gibbons. Several studies have found significant evidence for gene flow from Drosophila simulans to D. melanogaster, at the rate of |$M_{S \to M} =$| 0.02–0.04 migrant individuals per generation, but no migration in the opposite direction (|$M_{M \to S} \approx 0$|) (Wang and Hey 2010 Dalquen et al. 2017, Tables 9 and 10). Population sizes were around |$\theta_S = 0.013$| and |$\theta_M = 0.005$|, with the divergence time |$\tau_{SM} \approx$| 0.012–0.014 (Dalquen et al. 2017, Tables 9 and 10). In the Anopheles gambiae species complex of African mosquitoes, hybridization occurs between several pairs of nonsister species. Gene flow from A. arabiensis to A. gambiae (or A. coluzzii) occurs so frequently for the autosomes that the gene trees reflect the migration history rather than the history of species divergences (Fontaine et al. 2015 Thawornwattana et al. 2018). Estimates from the genomic data are in the order of |$M_{A\to G} \approx 0.2$| migrants per generation while |$M_{G \to A} = 0$| (Thawornwattana et al. 2018, Table S3; Flouri et al. 2020, Table 1), in agreement with crossing experiments, which showed that introgressed alleles from A. arabiensis to A. gambiae persisted over many generations, while it was not possible to maintain an introgression colony in the opposite |$G \to A$| direction (Slotman et al. 2005). Other parameters were around |$\theta_A$| = 0.014, |$\theta_G$| = 0.02–0.03, and |$\tau_{AG}$| = 0.007 (Thawornwattana et al. 2018, Table S3). Heliconius butterflies constitute one of the best studied groups for cross-species hybridization/introgression, involving many sister- and nonsister-species pairs, and involving both recent and ancient gene flow (Bull et al. 2006 Kronforst et al. 2006 Mallet et al. 2007 Salazar et al. 2008 Pardo-Diaz et al. 2012 Martin et al. 2013). A recent study (Van Belleghem et al. 2020) applied coalescent-based simulation to joint site-frequency spectrum data to estimate the migration rates and population sizes between two incipient species: H. erato and H. himera, finding strong evidence for highly asymmetrical introgression, predominantly from H. erato favorinus to H. himera, at the rate of |$M =$| 0.5–0.6 migrants per generation, with |$\tau \approx 0.002$|, |$\theta_E = 0.01$|, and |$\theta_H =0.0008$|. In an analysis of genomic sequences from five species of gibbons (which belong to four different genera), gene flow was inferred between two species of the same genus: Hylobates moloch and H. pileatus, but not between species of different genera. The migration rates were estimated to be |$M_{M \to P} \approx 0.008$| migrants per generation, while |$M_{P \to M} \approx 0$|, with |$\theta_M = 0.0014$|, |$\theta_P = 0.0005$|, and |$\tau= 0.0017$| (Shi and Yang 2018, Fig. 1). The parameter estimates suggest that those species pairs are not in the species-definition anomaly zone as discussed in this article. Nevertheless, they do suggest large differences in population size and in the migration rate in the two directions. They also indicate that the parameter values used in our example calculations (Figs. 2, 3, 4, 5) are representative of real biological systems. We leave it to future genomic analyses to determine how common the anomaly is in the real world. As more and more genomes are sequenced, and as analytical methods are improved to handle large data sets, we see exciting opportunities for using genomic data to infer the evolutionary history of species divergence and gene flow. The Impact of Gene Flow on the Definition and Identification of Species It is noteworthy that the migration rate required for the species-definition anomaly to occur may be much less than one migrant per generation. For a species like the mosquitoes the population size may well be larger than a million, which means that a proportion of migrants less than one in a million is sufficient to change the apparent genetic history of the species. In population genetic models of population subdivision, migration rates of |$M \ll 1$| are low enough so that the populations will be differentiated or isolated (as measured by |$F_{\mathrm{st}}$|) (Wright 1931). However, in the IM model, such low levels of gene flow can have a dramatic impact on the history of the species as represented in gene genealogies or genetic distances. Similarly (Jiao et al., 2020) found that even a small amount of migration per generation can have a huge impact on species tree estimation under the simple MSC model (see also Long and Kubatko 2018). The dramatic impact of gene flow on the genetic history of the species suggests that one has to consider this effect when defining and identifying species. In the species-definition anomaly zone, simple application of DNA barcoding or the gdi will lump genuinely distinct species into the same species. Thus, if those methods suggest one species but there is evidence for asymmetrical gene flow between the populations and drastically different population sizes, the results from those methods should be re-examined for the impact of gene flow. We suggest that estimating and contrasting the long-term migration rate and the short-term hybridization rate as an effective approach to establishing the existence of reproductive barriers and evidence for species status. Note that genomic sequence data may contain rich information concerning evolutionary parameters such as species divergence times, population sizes, and migration rates or introgression probabilities, which may be invaluable for delimiting species boundaries (Fujita et al. 2012 Leaché et al. 2019). The migration rate estimated from genomic data under the IM model reflects the long-term impact of gene flow and genetic drift, as well as natural selection against introgressed alleles (Martin and Jiggins 2017). Genomic sequence data can also be used to identify recent hybridization/admixture events (Anderson and Thompson 2002 Anderson 2008 Veller et al. 2019). A greatly reduced migration rate relative to the hybridization rate (e.g., a migration rate of |$m = 10^{-6}$| per generation relative to a proportion of F|$_1$| hybrids of |$0.1\%$|) may be strong evidence that introgressed alleles are deleterious and removed from the receiving population by natural selection and that reproductive barriers exist between the species. While genomic data may be currently lacking for many species groups, this approach may become feasible in the near future with advancements in genome sequencing technologies and development of reduced-representation data sets (Lemmon et al. 2012 Edwards et al. 2017), as well as advancements of analytical methods that accommodate both coalescent and gene flow (Dalquen et al. 2017 Hey et al. 2018 Wen and Nakhleh 2018 Zhang et al. 2018 Flouri et al. 2020). Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.xwdbrv1b5. Acknowledgments We thank James Mallet and Yuttapong Thawornwattana for many discussions and comments. We are grateful to three anonymous reviewers, Matthew Hahn and Bryan Carstens for many constructive comments. Funding This study has been supported by Biotechnology and Biological Sciences Research Council grant [BB/P006493/1 to Z.Y.] and a BBSRC equipment grant [BB/R01356X/1]. References Andersen L.N. , Mailund T., Hobolth A. 2014 . Efficient computation in the IM model . J. Math. Biol. 68 : 1423 – 1451 . Google Scholar Crossref Search ADS PubMed WorldCat Anderson E. , Thompson, E. 2002 . A model-based method for identifying species hybrids using multilocus genetic data . Genetics 160 : 1217 – 1229 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Anderson E.C. 2008 . Bayesian inference of species hybrids using multilocus dominant genetic markers . Philos. Trans. R. Soc. Lond. B: Biol. Sci. 363 ( 1505 ): 2841 – 2850 . Google Scholar Crossref Search ADS PubMed WorldCat Arnold B.J. , Lahner B., DaCosta J.M., Weisman C.M., Hollister J.D., Salt D.E., Bomblies K., Yant L. 2016 . Borrowed alleles and convergence in serpentine adaptation . Proc. Natl. Acad. Sci. USA 113 ( 29 ): 8320 – 8325 . Google Scholar Crossref Search ADS WorldCat Bull V. , Beltran M., Jiggins C., McMillan, W.O., Bermingham E., Mallet J. 2006 . Polyphyly and gene flow between non-sibling Heliconius species . BMC Biol. 4 : 11 . Google Scholar Crossref Search ADS PubMed WorldCat Burgess R. , Yang Z. 2008 . Estimation of Hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors . Mol. Biol. Evol. 25 : 1979 – 1994 . Google Scholar Crossref Search ADS PubMed WorldCat Chan Y.C. , Roos C., Inoue-Murayama M., Inoue E., Shih C.C., Pei K.J., Vigilant L. 2013 . Inferring the evolutionary histories of divergences in Hylobates and Nomascus gibbons through multilocus sequence data . BMC Evol. Biol. 13 : 82 . Google Scholar Crossref Search ADS PubMed WorldCat Coyne J.A. , Orr H.A. 2004 . Speciation . Sunderland, MA : Sinauer Association . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Dalquen D. , Zhu T., Yang Z. 2017 . Maximum likelihood implementation of an isolation-with-migration model for three species . Syst. Biol. 66 : 379 – 398 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Dasmahapatra K.K. , Elias M., Hill R.I., Hoffman J.I., Mallet J. 2010 . Mitochondrial DNA barcoding detects some species that are real, and some that are not . Mol. Ecol. Resour. 10 ( 2 ): 264 – 273 . Google Scholar Crossref Search ADS PubMed WorldCat De Queiroz K. 2007 . Species concepts and species delimitation . Syst. Biol. 56 : 879 – 886 . Google Scholar Crossref Search ADS PubMed WorldCat Degnan J.H. , Rosenberg N.A. 2006 . Discordance of species trees with their most likely gene trees . PLoS Genet. 2 : e68 . Google Scholar Crossref Search ADS PubMed WorldCat Edwards S. , Cloutier A., Baker A. 2017 . Conserved nonexonic elements: a novel class of marker for phylogenomics . Syst. Biol. 66 ( 6 ): 1028 – 1044 . Google Scholar Crossref Search ADS PubMed WorldCat Ellegren H. , Smeds L., Burri R., Olason P.I., Backstrom N., Kawakami T., Kunstner A., Makinen H., Nadachowska-Brzyska K., Qvarnstrom A., Uebbing S., Wolf J.B.W. 2012 . The genomic landscape of species divergence in Ficedula flycatchers . Nature 491 : 756 – 760 . Google Scholar Crossref Search ADS PubMed WorldCat Faircloth B.C. , McCormack J.E., Crawford N.G., Harvey M.G., Brumfield R.T., Glenn T.C. 2012 . Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales . Syst. Biol. 61 ( 5 ): 717 – 726 . Google Scholar Crossref Search ADS PubMed WorldCat Flouri T. , Jiao X., Rannala B., Yang Z. 2018 . Species tree inference with BPP using genomic sequences and the multispecies coalescent . Mol. Biol. Evol. 35 ( 10 ): 2585 – 2593 . Google Scholar Crossref Search ADS PubMed WorldCat Flouri T. , Jiao X., Rannala B., Yang Z. 2020 . A Bayesian implementation of the multispecies coalescent model with introgression for phylogenomic analysis . Mol. Biol. Evol. 37 ( 4 ): 1211 – 1223 . Google Scholar Crossref Search ADS PubMed WorldCat Fontaine M.C. , Pease J.B., Steele A., Waterhouse R.M., Neafsey D.E., Sharakhov I.V., Jiang X., Hall A.B., Catteruccia F., Kakani E., Mitchell S.N., Wu Y.C., Smith H.A., Love R.R., Lawniczak M.K., Slotman M.A., Emrich S.J., Hahn M.W., Besansky N.J. 2015 . Extensive introgression in a malaria vector species complex revealed by phylogenomics . Science 347 ( 6217 ): 1258524 . Google Scholar Crossref Search ADS PubMed WorldCat Fujita M.K. , Leaché A.D., Burbrink F.T., McGuire J.A., Moritz C. 2012 . Coalescent-based species delimitation in an integrative taxonomy . Trends Ecol. Evol. 27 : 480 – 488 . Google Scholar Crossref Search ADS PubMed WorldCat He C. , Liang D., Zhang P. 2020 . Asymmetric distribution of gene trees can arise under purifying selection if differences in population size exist . Mol. Biol. Evol. 37 ( 3 ): 881 – 892 . Google Scholar Crossref Search ADS PubMed WorldCat Hebert P.D. , Cywinska A., Ball S.L., deWaard J.R. 2003 . Biological identifications through DNA barcodes . Proc. Biol. Sci. 270 : 313 – 321 . Google Scholar Crossref Search ADS PubMed WorldCat Hey J. 2010 . Isolation with migration models for more than two populations . Mol. Biol. Evol. 27 : 905 – 920 . Google Scholar Crossref Search ADS PubMed WorldCat Hey J. , Chung Y., Sethuraman A., Lachance J., Tishkoff S., Sousa V.C., Wang Y. 2018 . Phylogeny estimation by integration over isolation with migration models . Mol. Biol. Evol. 35 ( 11 ): 2805 – 2818 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Hudson R.R. , Turelli M. 2003 . Stochasticity overrules the "three-times rule": genetic drift, genetic draft, and coalescence times for nuclear loci versus mitochondrial DNA . Evolution 57 : 182 – 190 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Jackson N.D. , Carstens B.C., Morales A.E., O’Meara B.C. 2017 . Species delimitation with gene flow . Syst. Biol. 66 ( 5 ): 799 – 812 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Jiao X. , Flouri T., Rannala B., Yang Z. 2020 . The impact of cross-species gene flow on species tree estimation . Syst. Biol. 69:830–847. Google Scholar OpenURL Placeholder Text WorldCat Karin B.R. , Gamble T., Jackman T.R. 2020 . Optimizing phylogenomics with rapidly evolving long exons: comparison with anchored hybrid enrichment and ultraconserved elements . Mol. Biol. Evol. 37 ( 3 ): 904 – 922 . Google Scholar Crossref Search ADS PubMed WorldCat Kronforst M.R. , Young L.G., Blume L.M., Gilbert L.E. 2006 . Multilocus analyses of admixture and introgression among hybridizing Heliconius butterflies . Evolution 60 ( 6 ): 1254 – 1268 . Google Scholar Crossref Search ADS PubMed WorldCat Leaché A.D. , Koo M.S., Spencer C.L., Papenfuss T.J., Fisher R.N., McGuire J.A. 2009 . Quantifying ecological, morphological, and genetic variation to delimit species in the coast horned lizard species complex (Phrynosoma) . Proc. Natl. Acad. Sci. USA 106 : 12418 – 12423 . Google Scholar Crossref Search ADS WorldCat Leaché A.D. , Zhu T., Rannala B., Yang Z. 2019 . The spectre of too many species . Syst. Biol. 68 ( 1 ): 168 – 181 . Google Scholar Crossref Search ADS PubMed WorldCat Lemmon A.R. , Emme S.A., Lemmon E.M. 2012 . Anchored hybrid enrichment for massively high-throughput phylogenomics . Syst. Biol. 61 ( 5 ): 727 – 744 . Google Scholar Crossref Search ADS PubMed WorldCat Li G. , Figueiro H.V., Eizirik E., Murphy W.J. 2019 . Recombination-aware phylogenomics reveals the structured genomic landscape of hybridizing cat species . Mol. Biol. Evol. 36 ( 10 ): 2111 – 2126 . Google Scholar Crossref Search ADS PubMed WorldCat Li W.-H. 1976 . Distribution of nucleotide differences between two randomly chosen cistrons in a subdivided population: the finite island model . Theor. Popul. Biol. 10 : 303 – 308 . Google Scholar Crossref Search ADS PubMed WorldCat Liu S. , Lorenzen E.D., Fumagalli M., Li B., Harris K., Xiong Z., Zhou L., Korneliussen T.S., Somel M., Babbitt C., Wray G., Li J., He W., Wang Z., Fu W., Xiang X., Morgan, C.C., Doherty A., O’Connell, M.J., McInerney J.O., Born E.W., Dalen L., Dietz R., Orlando L., Sonne C., Zhang G., Nielsen R., Willerslev E., Wang J. 2014 . Population genomics reveal recent speciation and rapid evolutionary adaptation in polar bears . Cell 157 : 785 – 794 . Google Scholar Crossref Search ADS PubMed WorldCat Long C. , Kubatko L. 2018 . The effect of gene flow on coalescent-based species-tree inference . Syst. Biol. 67 ( 5 ): 770 – 785 . Google Scholar Crossref Search ADS PubMed WorldCat Mallet J. 2005 . Hybridization as an invasion of the genome . Trends Ecol. Evol. 20 : 229 – 237 . Google Scholar Crossref Search ADS PubMed WorldCat Mallet J. 2008 . Hybridization, ecological races, and the nature of species: empirical evidence for the ease of speciation . Philos. Trans. R. Soc. B: Biol. Sci. 363 : 2971 – 2986 . Google Scholar Crossref Search ADS WorldCat Mallet J. 2013 . Concepts of species. In: Levin S., editor, Encyclopedia of biodiversity , Vol. 6 . Massachusetts: Amsterdam : Academic Press . p. 679 – 691 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Mallet J. , Beltran M., Neukirchen W., Linares M. 2007 . Natural hybridization in heliconiine butterflies: the species boundary as a continuum . BMC Evol. Biol. 7 : 28 . Google Scholar Crossref Search ADS PubMed WorldCat Mao Y. , Economo E.P., Satoh N. 2018 . The roles of introgression and climate change in the rise to dominance of Acropora corals . Curr. Biol. 28 ( 21 ): 3373 – 3382 e5. Google Scholar Crossref Search ADS PubMed WorldCat Martin S.H. , Jiggins C.D. 2017 . Interpreting the genomic landscape of introgression . Curr. Opin. Genet. Dev. 47 : 69 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat Martin S.H. , Dasmahapatra K.K., Nadeau N.J., Salazar C., Walters J.R., Simpson F., Blaxter M., Manica A., Mallet J., Jiggins C.D. 2013 . Genome-wide evidence for speciation with gene flow in Heliconius butterflies . Genome Res. 23 ( 11 ): 1817 – 1828 . Google Scholar Crossref Search ADS PubMed WorldCat Meyer C.P. , Paulay G. 2005 . DNA barcoding: error rates based on comprehensive sampling . PLoS Biol. 3 ( 12 ): e422 . Google Scholar Crossref Search ADS PubMed WorldCat Nielsen R. , Akey J.M., Jakobsson M., Pritchard J.K., Tishkoff S., Willerslev E. 2017 . Tracing the peopling of the world through genomics . Nature 541 : 302 . Google Scholar Crossref Search ADS PubMed WorldCat Notohara M. 1990 . The coalescent and the genealogical process in geographically structured populations . J. Math. Biol. 29 : 59 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat Pardo-Diaz C. , Salazar C., Baxter S.W., Merot C., Figueiredo-Ready W., Joron M., McMillan W.O., Jiggins C.D. 2012 . Adaptive introgression across species boundaries in Heliconius butterflies . PLoS Genet. 8 ( 6 ): e1002752 . Google Scholar Crossref Search ADS PubMed WorldCat Pinho C. , Hey J. 2010 . Divergence with gene flow: models and data . Ann. Rev. Ecol. Evol. Syst. 41 : 215 – 230 . Google Scholar Crossref Search ADS WorldCat Puillandre N. , Lambert A., Brouillet S., Achaz G. 2012 . Automatic barcode gap discovery for primary species delimitation . Mol. Ecol. 21 : 1864 – 1877 . Google Scholar Crossref Search ADS PubMed WorldCat Salazar C. , Jiggins C., Taylor J.E., Kronforst M., Linares M. 2008 . Gene flow and the genealogical history of Heliconius heurippa . BMC Evol. Biol. 8 : 132 . Google Scholar Crossref Search ADS PubMed WorldCat Shi C. , Yang Z. 2018 . Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of gibbons . Mol. Biol. Evol. 35 : 159 – 179 . Google Scholar Crossref Search ADS PubMed WorldCat Slatkin M. 1987 . The average number of sites separating DNA sequences drawn from a subdivided population . Theor. Popul. Biol. 32 : 42 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat Slatkin M. 1991 . Inbreeding coefficients and coalescence times . Genet. Res. 58 : 167 – 175 . Google Scholar Crossref Search ADS PubMed WorldCat Slotman M.A. , della Torre A., Calzetta M., Powell J.R. 2005 . Differential introgression of chromosomal regions between Anopheles gambiae and An. arabiensis . Am. J. Trop. Med. Hyg. 73 ( 2 ): 326 – 335 . Google Scholar Crossref Search ADS PubMed WorldCat Strobeck K. 1987 . Average number of nucleotide differences in a sample from a single subpopulation: a test for population subdivision . Genetics 117 : 149 – 153 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Thawornwattana Y. , Dalquen D., Yang Z. 2018 . Coalescent analysis of phylogenomic data confidently resolves the species relationships in the Anopheles gambiae species complex . Mol. Biol. Evol. 35 ( 10 ): 2512 – 2527 . Google Scholar Crossref Search ADS PubMed WorldCat Tian Y. , Kubatko L.S. 2016 . Distribution of coalescent histories under the coalescent model with gene flow . Mol. Phylogenet. Evol. 105 : 177 – 192 . Google Scholar Crossref Search ADS PubMed WorldCat Turner T.L. , Hahn M.W., Nuzhdin S.V. 2005 . Genomic islands of speciation in Anopheles gambiae . PLoS Biol. 3 ( 9 ): e285 . Google Scholar Crossref Search ADS PubMed WorldCat Van Belleghem S.M. , Cole J.M., Montejo-Kovacevich G., Bacquet C.N., McMillan W.O., Papa R., Counterman B.A. 2020 . Selection and gene flow define polygenic barriers between incipient butterfly species . bioRxiv p. 2020.04.09.034470 . Google Scholar OpenURL Placeholder Text WorldCat Veller C. , Edelman N., Muralidhar P., Nowak M. 2019 . Recombination, variance in genetic relatedness, and selection against introgressed DNA . bioRxiv:846147 . Google Scholar OpenURL Placeholder Text WorldCat Wang Y. , Hey J. 2010 . Estimating divergence parameters with small samples from a large number of loci . Genetics 184 : 363 – 379 . Google Scholar Crossref Search ADS PubMed WorldCat Wen D. , Nakhleh L. 2018 . Coestimating reticulate phylogenies and gene trees from multilocus sequence data . Syst. Biol. 67 ( 3 ): 439 – 457 . Google Scholar Crossref Search ADS PubMed WorldCat Wilkinson-Herbots H.M. 2008 . The distribution of the coalescence time and the number of pairwise nucleotide differences in the "isolation with migration" model . Theor. Popul. Biol. 73 : 277 – 288 . Google Scholar Crossref Search ADS PubMed WorldCat Wright S. 1931 . Evolution in Mendelian populations . Genetics 16 : 97 – 159 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Wu D.-D. , Ding X.-D., Wang S., Wojcik J.M., Zhang Y., Tokarska M., Li Y., Wang M.-S., Faruque O., Nielsen R., Zhang Q., Zhang Y.-P. 2018 . Pervasive introgression facilitated domestication and adaptation in the Bos species complex . Nature Ecol. Evol. 2 ( 7 ): 1139 – 1145 . Google Scholar Crossref Search ADS WorldCat Xu B. , Yang Z. 2016 . Challenges in species tree estimation under the multispecies coalescent model . Genetics 204 : 1353 – 1368 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. , Rannala B. 2017 . Bayesian species identification under the multispecies coalescent provides significant improvements to DNA barcoding analyses . Mol. Ecol. 26 : 3028 – 3036 . Google Scholar Crossref Search ADS PubMed WorldCat Yu Y. , Dong J., Liu K.J., Nakhleh L. 2014 . Maximum likelihood inference of reticulate evolutionary histories . Proc. Natl. Acad. Sci. USA 111 ( 46 ): 16448 – 16453 . Google Scholar Crossref Search ADS WorldCat Zachos F.E. 2016 . Species concepts in biology . New York : Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Zhang C. , Ogilvie H.A., Drummond A.J., Stadler T. 2018 . Bayesian inference of species networks from multilocus sequence data . Mol. Biol. Evol. 35 : 504 – 517 . Google Scholar Crossref Search ADS PubMed WorldCat Zhu J. , Yu Y., Nakhleh L. 2016 . In the light of deep coalescence: revisiting trees within networks . BMC Bioinformatics 17 : 415 . Google Scholar Crossref Search ADS PubMed WorldCat Zhu S. , Degnan J.H. 2017 . Displayed trees do not determine distinguishability under the network multispecies coalescent . Syst. Biol. 66 ( 2 ): 283 – 298 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Zhu T. , Yang Z. 2012 . Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow . Mol. Biol. Evol. 29 : 3131 – 3142 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Out of Sight, Out of Mind: Widespread Nuclear and Plastid-Nuclear Discordance in the Flowering Plant Genus Polemonium (Polemoniaceae) Suggests Widespread Historical Gene Flow Despite Limited Nuclear SignalRose, Jeffrey, P;Toledo, Cassio A, P;Lemmon, Emily, Moriarty;Lemmon, Alan, R;Sytsma, Kenneth, J
doi: 10.1093/sysbio/syaa049pmid: 32617587
Abstract Phylogenomic data from a rapidly increasing number of studies provide new evidence for resolving relationships in recently radiated clades, but they also pose new challenges for inferring evolutionary histories. Most existing methods for reconstructing phylogenetic hypotheses rely solely on algorithms that only consider incomplete lineage sorting (ILS) as a cause of intra- or intergenomic discordance. Here, we utilize a variety of methods, including those to infer phylogenetic networks, to account for both ILS and introgression as a cause for nuclear and cytoplasmic-nuclear discordance using phylogenomic data from the recently radiated flowering plant genus Polemonium (Polemoniaceae), an ecologically diverse genus in Western North America with known and suspected gene flow between species. We find evidence for widespread discordance among nuclear loci that can be explained by both ILS and reticulate evolution in the evolutionary history of Polemonium. Furthermore, the histories of organellar genomes show strong discordance with the inferred species tree from the nuclear genome. Discordance between the nuclear and plastid genome is not completely explained by ILS, and only one case of discordance is explained by detected introgression events. Our results suggest that multiple processes have been involved in the evolutionary history of Polemonium and that the plastid genome does not accurately reflect species relationships. We discuss several potential causes for this cytoplasmic-nuclear discordance, which emerging evidence suggests is more widespread across the Tree of Life than previously thought. [Cyto-nuclear discordance, genomic discordance, phylogenetic networks, plastid capture, Polemoniaceae, Polemonium, reticulations.] The increasing use of genomic-level data sets from transcriptomes (Wickett et al. 2014), sequence capture probes such as ultraconserved elements and anchored hybrid enrichment (AHE) (Faircloth et al. 2012; Lemmon et al. 2012), and single nucleotide polymorphisms (SNPs) (Eaton and Ree 2013) has provided a plethora of data for resolving both deep and shallow scale relationships across all branches of the Tree of Life. However, analyzing these data opens the possibility for widespread conflicting phylogenetic signal among loci due to processes such as gene duplication, incomplete lineage sorting (ILS), and gene flow (both hybridization and introgression), necessitating a consideration of the causes of such discordance in analyses. Emerging evidence suggests that gene flow between lineages is important in the evolutionary history of both plants and animals (Cui et al. 2013; Folk et al. 2017, 2018; McVay et al. 2017a; Vargas et al. 2017; Gernandt et al. 2018; Grummer et al. 2018 ). Discordance among nuclear gene trees may not be the only complicating factor, as there also may be discordance among the nuclear genome and organellar genome(s) (cytoplasmic-nuclear or cyto-nuclear discordance). In plants, this discordance between the nuclear and (usually) the plastid (generally referred to in the literature as chloroplast) genomes has been interpreted as introgressive plastid capture, although uncertainty in phylogenetic inference as an explanation often is not discussed (Smith and Sytsma 1990; Rieseberg and Soltis 1991; Fehrer et al. 2007; Rojas-Andres et al. 2015; Yi et al. 2015). Despite their smaller effective population sizes, organellar genomes are also susceptible to ILS, although this is only beginning to be tested (Folk et al. 2017; Lee-Yaw et al. 2019), and plastid haplotypes that arise in a population via introgression may be maintained in the population as a result of natural selection, genetic drift, or asexual reproduction (Levin 2003; Tsitrone et al. 2003; Bock et al. 2014; Lee-Yaw et al. 2019). Despite the potential importance for gene flow within the evolutionary history of clades, most existing methods for phylogenetic analysis either assume no genomic discordance by concatenating loci or only assume a coalescent model in which all genomic discordance is due to ILS (Heled and Drummond 2010; Liu and Edwards 2010; Chifman and Kubatko 2014; Mirarab and Warnow 2015). However, new computational tools allow the inference of phylogenetic networks that account for both ILS and gene flow (Than et al. 2008; Solís-Lemus and Ané 2016; Kamneva and Rosenberg 2017; Solís-Lemus et al. 2017), although these programs are still currently limited by computational time for even 15 terminals (Solís-Lemus and Ané 2016:Fig. 6; but see Yu and Nakhleh 2015). Here, we evaluate the importance of gene flow in the evolutionary history of the flowering plant genus Polemonium L. (Polemoniaceae), a lineage with documented cases of interspecific hybridization under greenhouse conditions (Ostenfeld 1929; Clausen 1931, 1967) and with historical gene flow hypothesized based on morphological intergradation (Davidson 1950; Anway 1968; Sledge and Anway 1970; Grant 1989). We reconstruct the evolutionary history of Polemonium using both coalescent and phylogenetic network approaches with data gathered from AHE of up to 499 nuclear genes. In addition, we examine the genealogical history of the plastid genome (ptDNA, often referred to in the literature as cpDNA) and nuclear ribosomal DNA (nrDNA) independent of the low or single copy nuclear genes, both obtained from the AHE raw reads. Importantly, this study includes multiple samples for most of the morphological species in Polemonium, an approach undertaken in only a few studies using data of this type (cf. Folk et al. 2017; Morales-Briones et al. 2018; Stubbs et al. 2018). In studies containing multiple individuals per morphological species, these proposed taxa are not always reciprocally monophyletic in both the nuclear and plastid genomes (Folk et al. 2017; Pham et al. 2017; Lee-Yaw et al. 2019). The Genus Polemonium Polemonium is a genus of temperate, largely diploid (Grant 1989), primarily perennial herbs. It is the most widespread genus in Polemoniaceae, mostly occurring in Eurasia and North America, with one species (P. micranthum) disjunct between western North America and southern South America (Davidson 1950; Grant 1959; Johnson and Porter 2017). Infraspecific variability in floral traits and pollination mode has spurred research on trait heritability and natural selection by pollinators in several species (Galen and Kevan 1980; Galen and Newport 1987; Galen et al. 1987, 1991; Kulbaba et al. 2013), necessitating a robust understanding of the evolutionary history of the genus. While Polemonium consistently has been recovered as monophyletic (Johnson et al. 1996, 2008; Prather et al. 2000), the placement of the genus in Polemoniaceae is ambiguous (Steele and Vilgalys 1994; Johnson et al. 1996, 2008; Porter 1996; Porter and Johnson 1998; Prather et al. 2000). The currently accepted hypothesis is a sister relationship of Polemonium to tribe Phlocideae (Johnson et al. 2008), but support for this relationship varies considerably between optimality criteria, with posterior probabilities (PP) from Bayesian inference consistently finding strong support but maximum parsimony finding contrastingly low support. Based on our current understanding of the taxonomy of Polemonium, the genus radiated ca. 7.3–10.8 Ma (Landis et al. 2018; Rose et al. 2018) and consists of at least 27 species and as many as 38, although new species continue to be recognized (Murray and Elven 2011; Irwin et al. 2012; Stubbs and Patterson 2013). Uncertainty in the number of species has largely depended on the circumscription of closely related Eurasian populations, which continue to be poorly understood (Davidson 1950; Hultén 1971; Vassiljev 1974; Feulner et al. 2001). Davidson (1950) was the first to tackle interspecific relationships in Polemonium in a systematic way. Based on his taxonomic conclusions, subsequent workers have focused on four major species complexes: 1) Polemonium caeruleum: erect plants of northern wetlands with paniculate inflorescences, coplanar leaflets, and yellow anthers; 2) P. pulcherrimum: caespitose mid- to high-elevation plants with corymbose inflorescences, coplanar leaflets, and white anthers (Wherry 1967; Grant 1989); 3) P. viscosum: caespitose plants of alpine environments with paniculate to spicate inflorescences, densely glandular-pubescent verticillate leaflets, and white to orange anthers (Grant 1989); and 4) P. foliosissimum: erect plants of midelevation conifer forests with corymbose inflorescences, coplanar leaflets, and yellow anthers (Anway 1968). Although not explicitly stated, both Wherry (1942) and Grant (1989) implied an hypothesis that the P. caeruleum species complex exhibits the most plesiomorphic character states in the genus. In addition, Grant (1989) explicitly hypothesized that alpine species of Polemonium have evolved multiple times from low to midelevation species. Molecular phylogenetic investigations into Polemonium have provided limited but important insights into the evolution and taxonomy of the group. These studies used sequence data (de Geofroy 1998; Timme 2001; Irwin et al. 2012) or amplified fragment length polymorphisms (AFLPs) (Worley et al. 2009). When multiple individuals per species have been sampled, these studies generally support the monophyly of species recognized by Davidson (1950). However, inferring interspecific relationships in Polemonium has been exceedingly difficult with weak to nonexistent support along the backbone of the genus (de Geofroy 1998; Timme 2001; Worley et al. 2009; Irwin et al. 2012). Furthermore, phylogenetic reconstructions based on sequence and AFLP data have been strongly conflicting. Taken as a whole, previous results suggest that a rapid radiation, recent divergence, or widespread gene flow (either present day or historical), separately or in concert, may contribute to a complex evolutionary history of Polemonium. Materials and Methods Data Availability Scripts for nuclear locus assembly and processing are available from the Dryad Digital Repository (https://doi.org/10.5061/dryad.5hqbzkh2r) as well as GitHub (https://github.com/jrosaceae/polemonium_AHE). Raw reads have been deposited in the NCBI Sequence Read Archive (BioProject PRJNA638804). Tree files and pre- and postfiltered alignments for all analyses are available from the Dryad Digital Repository. Taxonomic Sampling Taxonomic sampling was designed so as to both elucidate the phylogenetic placement of Polemonium, as well as examine the evolutionary history of the genus itself. In addition to Polemonium, we sampled one species each from the two tribes possibly sister to the clade: Gilieae (Gilia capitata) and Phlocideae (Leptosiphon montanus), with tribe Loeselieae (Ipomopsis aggregata) treated as the outgroup based on the results of Johnson et al. (2008). Infrageneric sampling in Polemonium was tailored to meet our primary goal of resolving deeper relationships in the genus, although thorough taxonomic sampling was also desirable. We therefore focused our taxonomic sampling on major lineages indicated by previous phylogenetic studies. As previous studies have lacked sufficient sampling of the P. foliosissimum and P. viscosum species complexes, we also targeted these putative clades in which additional phylogenetic diversity may exist. For New World Polemonium, leaf tissue was obtained for all but one North American species and several Mexican species, totaling 45 accessions of Polemonium (Table 1). In many cases, multiple geographically distant individuals were sampled for each morphological species. Genomic DNA was extracted from silica-dried leaf tissue and two herbarium samples using the Qiagen DNeasy Plant Mini Kit (Valencia, CA, USA) following the recommended protocol. Library Preparation and Sequencing We utilized an anchored phylogenomics approach, AHE, in collaboration with the Center for Anchored Phylogenomics at Florida State University (www.anchoredphylogeny.com). This method targets conserved “anchor” regions in the nuclear genome and generates data for |$\sim $|500 loci including exons and introns (Lemmon et al. 2012; Prum et al. 2015). The pipeline employed here uses probes designed from the genomes of 25 broadly sampled angiosperms and targets 499 moderately conserved exons with some intronic regions (Buddenhagen et al. 2016). This pipeline has been used successfully to infer infrageneric relationships across distantly related groups (Cardillo et al. 2017; Léveillé-Bourret et al. 2017; Mitchell et al. 2017; Kriebel et al. 2019). Library preparation was carried out on the extracted genomic DNA. DNA was sonicated to a fragment size of between 200 and 600 bp before indexed library preparation and indexing following a modified protocol of Meyer and Kircher (2010). Library preparation was performed on a Beckman Coulter FXp liquid handling robot. Indexed samples were pooled and enriched using the Angiosperm v.1 enrichment kit (Buddenhagen et al. 2016). Sequencing was performed on one PE150 Illumina HiSeq 2500 lane (|$\sim $|50 Gb total yield) at the Translational Science Laboratory, College of Medicine, Florida State University. Nuclear DNA Assembly and Alignment Paired reads were merged before assembly, following Rokyta et al. (2012). Reads were mapped to the probe regions using Arabidopsis thaliana, Billbergia nutans, and Carex lurida as references and the assembly was extended into the flanks using a quasi-de novo assembly approach described by Hamilton et al. (2016). Consensus sequences were calculated for each assembly cluster with contigs based on fewer than 20 reads removed. The orthology of consensus sequences at each locus was assessed using a pairwise distance matrix among homologs and used to cluster sequences with a neighbor-joining algorithm to assess if gene duplication occurred prior to or following the crown of the clade. To minimize cases of miscalled orthology, we applied a conservative filter by removing clusters containing evidence of gene duplication if they contained <90% of samples (see Hamilton et al. 2016 for details). Sequences in each orthologous cluster were aligned in MAFFT v. 7.023b (Katoh and Standley 2013). To remove poorly aligned regions, raw alignments were trimmed and masked using the following procedure from Prum et al. (2015). Sites with the same character in |$>$|40% of sequences were considered “conserved.” A 20 bp sliding window was then moved across the alignment, and regions with |$<$|14 characters matching the common base at the corresponding conserved site were masked. Sites with |$<$|25 unmasked bases were removed. Finally, the masked alignments were inspected by eye in Geneious v. 10.2.3 (Kearse et al. 2012). Regions considered obviously misaligned or likely paralogous were removed and poorly aligned sections in a given alignment were deleted. Table 1. Voucher and sequence information for taxa of Polemonium and outgroups used in this study . . . Reads . No. of . Mean . Taxon . Voucher . Provenance . (10|$^{6}$|) . loci . coverage . Gilia capitata Sims Rose s.n. (WIS) Cultivated 12.57 448 274 Ipomopsis aggregata (Pursh) V.E. Grant Rose 14-221 (WIS) ID, USA 7.34 359 76 Leptosiphon montanus (Greene) J.M. Porter and L.A. Johnson Rose s.n. (WIS) Cultivated 4.14 350 80 P. acutiflorum Willd. Bennett 14-0056 (WIS) BC, CAN 7.49 402 176 P. boreale Adams Bennett 14-0081 (WIS) YT, CAN 4.73 413 140 P. brandegeei (A. Gray) Greene Rose 14-245 (WIS) CO, USA 7.73 400 170 P. californicum Eastw. Stubbs 18 (SFSU) CA, USA 6.96 411 164 P. californicum JR009 (WIS) WA, USA 2.49 371 89 P. carneum A. Gray Stubbs 10 (SFSU) CA, USA 5.09 367 116 P. carneum Stubbs 8 (SFSU) OR, USA 3.52 327 63 P. chartaceum Mason Stubbs 23 (SFSU) CA, USA 7.93 398 179 P. chartaceum Stubbs 24 (SFSU) CA, USA 6.26 378 115 P. confertum A. Gray Rose 14-240 (WIS) CO, USA 5.76 416 221 P. delicatum Rydb. Rose 14-239 (WIS) CO, USA 6.22 401 159 P. delicatum Rose 12-233b (WIS) UT, USA 7.06 396 142 P. eddyense Stubbs Stubbs 15 (SFSU) CA, USA 8.81 405 159 P. elegans Greene Rose 13-53 (WIS) WA, USA 5.01 398 124 P. elusum J.J. Irwin and R.L. Hartm. Irwin 5148 (RM) ID, USA 6.22 333 78 P. elusum Irwin 5496 (RM) ID, USA 4.89 355 80 P. eximium Greene Stubbs 22 (SFSU) CA, USA 3.93 247 51 P. eximium Stubbs 14 (SFSU) CA, USA 5.84 378 129 P. foliosissimum A. Gray var. alpinum Brand Rose 14-230 (WIS) UT, USA 5.14 395 123 P. foliosissimum var. flavum (Greene) Anway Rose 14-260 (WIS) AZ, USA 7.05 366 120 P. foliosissimum A. Gray var. |$f$|. Rose 14-256 (WIS) NM, USA 6.08 372 145 P. foliosissimum var. molle (Greene) Anway Rose 14-250 (WIS) CO, USA 5.77 378 143 P. foliosissimum var. molle Rose 14-235a (WIS) UT, USA 4.78 333 75 P. foliosissimum var. molle Rose 14-254a (WIS) NM, USA 4.86 369 120 P. foliosissimum var. nov. (aff. P. filicinum Greene) Rose 14-258 (WIS) AZ, USA 4.40 389 97 P. grandiflorum Benth. MEXU630068 Mexico 5.37 210 27 P. micranthum Benth. Stubbs 4 (SFSU) CA, USA 5.33 405 151 Reads No. of Mean Taxon Voucher Provenance (10|$^{6}$|) loci coverage P. nevadense Wherry Rose 14-219 (WIS) NV, USA 4.26 349 78 P. occidentale Greene subsp. |$o$|. Rose 14-226 (WIS) ID, USA 4.00 351 89 |$P$|. occidentale subsp. |$o$|. Stubbs 17 (SFSU) CA, USA 8.00 397 185 |$P$|. occidentale subsp. lacustre Wherry MIN816873 MN, USA 3.22 372 117 P. pauciflorum S. Wats. Fishbein 2438 (ARIZ) AZ, USA 8.18 357 110 P. pectinatum Greene JR95 (WIS) WA, USA 9.10 369 143 P. pulcherrimum subsp. lindleyi (Wherry) V.E. Grant Bennett 14-0039 (WIS) YT, CAN 6.15 388 148 P. pulcherrimum Hook. subsp. |$p$|. DiNicola 2013-53 (WIS) ID, USA 7.97 396 173 P. pulcherrimum subsp. |$p$|. Stubbs 20 (SFSU) CA, USA 4.70 348 80 P. pulcherrimum var. shastense (Eastw.) Stubbs Stubbs 16 (SFSU) CA, USA 8.24 386 149 P. reptans L. var. |$r$|. Rose 13-44 (WIS) OH, USA 7.21 331 93 P. reptans var. villosum E.L. Braun JR11 (WIS) OH, USA 4.76 386 113 P. vanbruntiae Britton JRVB2 (WIS) VT, USA 6.16 372 129 P. vanbruntiae JR002 (WIS) NY, USA 7.41 382 198 P. vanbruntiae JRVB3 (WIS) QB, CAN 5.30 376 111 P. vanbruntiae Rose 13-46 (WIS) WV, USA 4.25 368 116 P. viscosum Nutt. Stubbs 25 (SFSU) NV, USA 7.16 396 147 P. viscosum Rose 14-232 (WIS) UT, USA 7.63 411 205 P. viscosum Rose 12-243 (WIS) NM, USA 5.44 376 104 . . . Reads . No. of . Mean . Taxon . Voucher . Provenance . (10|$^{6}$|) . loci . coverage . Gilia capitata Sims Rose s.n. (WIS) Cultivated 12.57 448 274 Ipomopsis aggregata (Pursh) V.E. Grant Rose 14-221 (WIS) ID, USA 7.34 359 76 Leptosiphon montanus (Greene) J.M. Porter and L.A. Johnson Rose s.n. (WIS) Cultivated 4.14 350 80 P. acutiflorum Willd. Bennett 14-0056 (WIS) BC, CAN 7.49 402 176 P. boreale Adams Bennett 14-0081 (WIS) YT, CAN 4.73 413 140 P. brandegeei (A. Gray) Greene Rose 14-245 (WIS) CO, USA 7.73 400 170 P. californicum Eastw. Stubbs 18 (SFSU) CA, USA 6.96 411 164 P. californicum JR009 (WIS) WA, USA 2.49 371 89 P. carneum A. Gray Stubbs 10 (SFSU) CA, USA 5.09 367 116 P. carneum Stubbs 8 (SFSU) OR, USA 3.52 327 63 P. chartaceum Mason Stubbs 23 (SFSU) CA, USA 7.93 398 179 P. chartaceum Stubbs 24 (SFSU) CA, USA 6.26 378 115 P. confertum A. Gray Rose 14-240 (WIS) CO, USA 5.76 416 221 P. delicatum Rydb. Rose 14-239 (WIS) CO, USA 6.22 401 159 P. delicatum Rose 12-233b (WIS) UT, USA 7.06 396 142 P. eddyense Stubbs Stubbs 15 (SFSU) CA, USA 8.81 405 159 P. elegans Greene Rose 13-53 (WIS) WA, USA 5.01 398 124 P. elusum J.J. Irwin and R.L. Hartm. Irwin 5148 (RM) ID, USA 6.22 333 78 P. elusum Irwin 5496 (RM) ID, USA 4.89 355 80 P. eximium Greene Stubbs 22 (SFSU) CA, USA 3.93 247 51 P. eximium Stubbs 14 (SFSU) CA, USA 5.84 378 129 P. foliosissimum A. Gray var. alpinum Brand Rose 14-230 (WIS) UT, USA 5.14 395 123 P. foliosissimum var. flavum (Greene) Anway Rose 14-260 (WIS) AZ, USA 7.05 366 120 P. foliosissimum A. Gray var. |$f$|. Rose 14-256 (WIS) NM, USA 6.08 372 145 P. foliosissimum var. molle (Greene) Anway Rose 14-250 (WIS) CO, USA 5.77 378 143 P. foliosissimum var. molle Rose 14-235a (WIS) UT, USA 4.78 333 75 P. foliosissimum var. molle Rose 14-254a (WIS) NM, USA 4.86 369 120 P. foliosissimum var. nov. (aff. P. filicinum Greene) Rose 14-258 (WIS) AZ, USA 4.40 389 97 P. grandiflorum Benth. MEXU630068 Mexico 5.37 210 27 P. micranthum Benth. Stubbs 4 (SFSU) CA, USA 5.33 405 151 Reads No. of Mean Taxon Voucher Provenance (10|$^{6}$|) loci coverage P. nevadense Wherry Rose 14-219 (WIS) NV, USA 4.26 349 78 P. occidentale Greene subsp. |$o$|. Rose 14-226 (WIS) ID, USA 4.00 351 89 |$P$|. occidentale subsp. |$o$|. Stubbs 17 (SFSU) CA, USA 8.00 397 185 |$P$|. occidentale subsp. lacustre Wherry MIN816873 MN, USA 3.22 372 117 P. pauciflorum S. Wats. Fishbein 2438 (ARIZ) AZ, USA 8.18 357 110 P. pectinatum Greene JR95 (WIS) WA, USA 9.10 369 143 P. pulcherrimum subsp. lindleyi (Wherry) V.E. Grant Bennett 14-0039 (WIS) YT, CAN 6.15 388 148 P. pulcherrimum Hook. subsp. |$p$|. DiNicola 2013-53 (WIS) ID, USA 7.97 396 173 P. pulcherrimum subsp. |$p$|. Stubbs 20 (SFSU) CA, USA 4.70 348 80 P. pulcherrimum var. shastense (Eastw.) Stubbs Stubbs 16 (SFSU) CA, USA 8.24 386 149 P. reptans L. var. |$r$|. Rose 13-44 (WIS) OH, USA 7.21 331 93 P. reptans var. villosum E.L. Braun JR11 (WIS) OH, USA 4.76 386 113 P. vanbruntiae Britton JRVB2 (WIS) VT, USA 6.16 372 129 P. vanbruntiae JR002 (WIS) NY, USA 7.41 382 198 P. vanbruntiae JRVB3 (WIS) QB, CAN 5.30 376 111 P. vanbruntiae Rose 13-46 (WIS) WV, USA 4.25 368 116 P. viscosum Nutt. Stubbs 25 (SFSU) NV, USA 7.16 396 147 P. viscosum Rose 14-232 (WIS) UT, USA 7.63 411 205 P. viscosum Rose 12-243 (WIS) NM, USA 5.44 376 104 Open in new tab Table 1. Voucher and sequence information for taxa of Polemonium and outgroups used in this study . . . Reads . No. of . Mean . Taxon . Voucher . Provenance . (10|$^{6}$|) . loci . coverage . Gilia capitata Sims Rose s.n. (WIS) Cultivated 12.57 448 274 Ipomopsis aggregata (Pursh) V.E. Grant Rose 14-221 (WIS) ID, USA 7.34 359 76 Leptosiphon montanus (Greene) J.M. Porter and L.A. Johnson Rose s.n. (WIS) Cultivated 4.14 350 80 P. acutiflorum Willd. Bennett 14-0056 (WIS) BC, CAN 7.49 402 176 P. boreale Adams Bennett 14-0081 (WIS) YT, CAN 4.73 413 140 P. brandegeei (A. Gray) Greene Rose 14-245 (WIS) CO, USA 7.73 400 170 P. californicum Eastw. Stubbs 18 (SFSU) CA, USA 6.96 411 164 P. californicum JR009 (WIS) WA, USA 2.49 371 89 P. carneum A. Gray Stubbs 10 (SFSU) CA, USA 5.09 367 116 P. carneum Stubbs 8 (SFSU) OR, USA 3.52 327 63 P. chartaceum Mason Stubbs 23 (SFSU) CA, USA 7.93 398 179 P. chartaceum Stubbs 24 (SFSU) CA, USA 6.26 378 115 P. confertum A. Gray Rose 14-240 (WIS) CO, USA 5.76 416 221 P. delicatum Rydb. Rose 14-239 (WIS) CO, USA 6.22 401 159 P. delicatum Rose 12-233b (WIS) UT, USA 7.06 396 142 P. eddyense Stubbs Stubbs 15 (SFSU) CA, USA 8.81 405 159 P. elegans Greene Rose 13-53 (WIS) WA, USA 5.01 398 124 P. elusum J.J. Irwin and R.L. Hartm. Irwin 5148 (RM) ID, USA 6.22 333 78 P. elusum Irwin 5496 (RM) ID, USA 4.89 355 80 P. eximium Greene Stubbs 22 (SFSU) CA, USA 3.93 247 51 P. eximium Stubbs 14 (SFSU) CA, USA 5.84 378 129 P. foliosissimum A. Gray var. alpinum Brand Rose 14-230 (WIS) UT, USA 5.14 395 123 P. foliosissimum var. flavum (Greene) Anway Rose 14-260 (WIS) AZ, USA 7.05 366 120 P. foliosissimum A. Gray var. |$f$|. Rose 14-256 (WIS) NM, USA 6.08 372 145 P. foliosissimum var. molle (Greene) Anway Rose 14-250 (WIS) CO, USA 5.77 378 143 P. foliosissimum var. molle Rose 14-235a (WIS) UT, USA 4.78 333 75 P. foliosissimum var. molle Rose 14-254a (WIS) NM, USA 4.86 369 120 P. foliosissimum var. nov. (aff. P. filicinum Greene) Rose 14-258 (WIS) AZ, USA 4.40 389 97 P. grandiflorum Benth. MEXU630068 Mexico 5.37 210 27 P. micranthum Benth. Stubbs 4 (SFSU) CA, USA 5.33 405 151 Reads No. of Mean Taxon Voucher Provenance (10|$^{6}$|) loci coverage P. nevadense Wherry Rose 14-219 (WIS) NV, USA 4.26 349 78 P. occidentale Greene subsp. |$o$|. Rose 14-226 (WIS) ID, USA 4.00 351 89 |$P$|. occidentale subsp. |$o$|. Stubbs 17 (SFSU) CA, USA 8.00 397 185 |$P$|. occidentale subsp. lacustre Wherry MIN816873 MN, USA 3.22 372 117 P. pauciflorum S. Wats. Fishbein 2438 (ARIZ) AZ, USA 8.18 357 110 P. pectinatum Greene JR95 (WIS) WA, USA 9.10 369 143 P. pulcherrimum subsp. lindleyi (Wherry) V.E. Grant Bennett 14-0039 (WIS) YT, CAN 6.15 388 148 P. pulcherrimum Hook. subsp. |$p$|. DiNicola 2013-53 (WIS) ID, USA 7.97 396 173 P. pulcherrimum subsp. |$p$|. Stubbs 20 (SFSU) CA, USA 4.70 348 80 P. pulcherrimum var. shastense (Eastw.) Stubbs Stubbs 16 (SFSU) CA, USA 8.24 386 149 P. reptans L. var. |$r$|. Rose 13-44 (WIS) OH, USA 7.21 331 93 P. reptans var. villosum E.L. Braun JR11 (WIS) OH, USA 4.76 386 113 P. vanbruntiae Britton JRVB2 (WIS) VT, USA 6.16 372 129 P. vanbruntiae JR002 (WIS) NY, USA 7.41 382 198 P. vanbruntiae JRVB3 (WIS) QB, CAN 5.30 376 111 P. vanbruntiae Rose 13-46 (WIS) WV, USA 4.25 368 116 P. viscosum Nutt. Stubbs 25 (SFSU) NV, USA 7.16 396 147 P. viscosum Rose 14-232 (WIS) UT, USA 7.63 411 205 P. viscosum Rose 12-243 (WIS) NM, USA 5.44 376 104 . . . Reads . No. of . Mean . Taxon . Voucher . Provenance . (10|$^{6}$|) . loci . coverage . Gilia capitata Sims Rose s.n. (WIS) Cultivated 12.57 448 274 Ipomopsis aggregata (Pursh) V.E. Grant Rose 14-221 (WIS) ID, USA 7.34 359 76 Leptosiphon montanus (Greene) J.M. Porter and L.A. Johnson Rose s.n. (WIS) Cultivated 4.14 350 80 P. acutiflorum Willd. Bennett 14-0056 (WIS) BC, CAN 7.49 402 176 P. boreale Adams Bennett 14-0081 (WIS) YT, CAN 4.73 413 140 P. brandegeei (A. Gray) Greene Rose 14-245 (WIS) CO, USA 7.73 400 170 P. californicum Eastw. Stubbs 18 (SFSU) CA, USA 6.96 411 164 P. californicum JR009 (WIS) WA, USA 2.49 371 89 P. carneum A. Gray Stubbs 10 (SFSU) CA, USA 5.09 367 116 P. carneum Stubbs 8 (SFSU) OR, USA 3.52 327 63 P. chartaceum Mason Stubbs 23 (SFSU) CA, USA 7.93 398 179 P. chartaceum Stubbs 24 (SFSU) CA, USA 6.26 378 115 P. confertum A. Gray Rose 14-240 (WIS) CO, USA 5.76 416 221 P. delicatum Rydb. Rose 14-239 (WIS) CO, USA 6.22 401 159 P. delicatum Rose 12-233b (WIS) UT, USA 7.06 396 142 P. eddyense Stubbs Stubbs 15 (SFSU) CA, USA 8.81 405 159 P. elegans Greene Rose 13-53 (WIS) WA, USA 5.01 398 124 P. elusum J.J. Irwin and R.L. Hartm. Irwin 5148 (RM) ID, USA 6.22 333 78 P. elusum Irwin 5496 (RM) ID, USA 4.89 355 80 P. eximium Greene Stubbs 22 (SFSU) CA, USA 3.93 247 51 P. eximium Stubbs 14 (SFSU) CA, USA 5.84 378 129 P. foliosissimum A. Gray var. alpinum Brand Rose 14-230 (WIS) UT, USA 5.14 395 123 P. foliosissimum var. flavum (Greene) Anway Rose 14-260 (WIS) AZ, USA 7.05 366 120 P. foliosissimum A. Gray var. |$f$|. Rose 14-256 (WIS) NM, USA 6.08 372 145 P. foliosissimum var. molle (Greene) Anway Rose 14-250 (WIS) CO, USA 5.77 378 143 P. foliosissimum var. molle Rose 14-235a (WIS) UT, USA 4.78 333 75 P. foliosissimum var. molle Rose 14-254a (WIS) NM, USA 4.86 369 120 P. foliosissimum var. nov. (aff. P. filicinum Greene) Rose 14-258 (WIS) AZ, USA 4.40 389 97 P. grandiflorum Benth. MEXU630068 Mexico 5.37 210 27 P. micranthum Benth. Stubbs 4 (SFSU) CA, USA 5.33 405 151 Reads No. of Mean Taxon Voucher Provenance (10|$^{6}$|) loci coverage P. nevadense Wherry Rose 14-219 (WIS) NV, USA 4.26 349 78 P. occidentale Greene subsp. |$o$|. Rose 14-226 (WIS) ID, USA 4.00 351 89 |$P$|. occidentale subsp. |$o$|. Stubbs 17 (SFSU) CA, USA 8.00 397 185 |$P$|. occidentale subsp. lacustre Wherry MIN816873 MN, USA 3.22 372 117 P. pauciflorum S. Wats. Fishbein 2438 (ARIZ) AZ, USA 8.18 357 110 P. pectinatum Greene JR95 (WIS) WA, USA 9.10 369 143 P. pulcherrimum subsp. lindleyi (Wherry) V.E. Grant Bennett 14-0039 (WIS) YT, CAN 6.15 388 148 P. pulcherrimum Hook. subsp. |$p$|. DiNicola 2013-53 (WIS) ID, USA 7.97 396 173 P. pulcherrimum subsp. |$p$|. Stubbs 20 (SFSU) CA, USA 4.70 348 80 P. pulcherrimum var. shastense (Eastw.) Stubbs Stubbs 16 (SFSU) CA, USA 8.24 386 149 P. reptans L. var. |$r$|. Rose 13-44 (WIS) OH, USA 7.21 331 93 P. reptans var. villosum E.L. Braun JR11 (WIS) OH, USA 4.76 386 113 P. vanbruntiae Britton JRVB2 (WIS) VT, USA 6.16 372 129 P. vanbruntiae JR002 (WIS) NY, USA 7.41 382 198 P. vanbruntiae JRVB3 (WIS) QB, CAN 5.30 376 111 P. vanbruntiae Rose 13-46 (WIS) WV, USA 4.25 368 116 P. viscosum Nutt. Stubbs 25 (SFSU) NV, USA 7.16 396 147 P. viscosum Rose 14-232 (WIS) UT, USA 7.63 411 205 P. viscosum Rose 12-243 (WIS) NM, USA 5.44 376 104 Open in new tab Plastid and Nuclear Ribosomal DNA Assembly and Alignment To obtain the ptDNA and nrDNA, we used Geneious to map all recovered forward and reverse reads to reference sequences. For the ptDNA, we used the whole plastid genome of Saltugilia latimeri (Polemoniaceae, GenBank accession KT921175 from Landis et al. 2016) as a reference. For the nrDNA, we used a concatenated ETS/18S/ITS/26S reference compiled from published and unpublished Sanger-derived sequences from multiple Polemonium species. Raw reads were trimmed and assembled using iterative refinement of up to five times with the default Geneious mapper and medium sensitivity. As the reference plastid genome is fairly distantly related, we assembled plastomes using a multistep process. First, reads were mapped to the Saltugilia reference. Next, we used the sample with the most reads mapped to it (P. pectinatum) as a second reference to map published and unpublished Sanger-derived reads to improve our reference. We then remapped all reads to this Polemonium reference using the same parameters mentioned above. For the ptDNA, consensus sequences were generated using the strict consensus approach. If coverage for a particular site was |$<$|7, the consensus nucleotide was scored as a gap. Unmapped regions were treated as missing data and reads mapped to multiple positions were excluded from consensus calculations. To account for potential paralogous copies of the nrDNA (Buckler et al. 1997; Poczai and Hyvönen 2010), we applied a more stringent consensus criterion of base calls matching 65% of sequences. Coverage and unmapped regions were treated as above. Sequences were aligned using MAFFT with default parameters. After alignment, ambiguously aligned or called regions were removed by hand. The mitochondrial genome was not extracted as the closest reference genome available was too distantly related to extract a large and phylogenetically informative amount of the genome (Vaccinium macrocarpon, Ericaceae). Checking for Tree-Like Structure We began by analyzing the 45 accessions of Polemonium alone using 142 loci without any missing ingroup data. We first estimated a phylogenetic tree from a concatenated version of this data set using maximum likelihood in RAxML (Stamatakis 2014) under the GTR |$+$| G model of sequence evolution. We conducted 10 separate runs to find the most likely tree as well as 500 rapid bootstrap (BS) replicates. Next, to examine if an assumption of underlying tree-like structure was appropriate for our data set, we employed the Tree Incongruence Checking in R (TICR) pipeline (Stenz et al. 2015) to obtain concordance factors (CFs) for each possible quartet and to use these data to infer an optimal population tree for Polemonium. Briefly, for the TICR pipeline, we ran MrBayes v. 3.2.6 (Ronquist et al. 2012) on all genes using a batch script. The best gene tree for each locus was inferred under the GTR |$+$| I |$+$| G model of sequence evolution using 3 runs of 3 chains each for 6 million generations with sampling every 6000 generations with a chain temperature of 0.4 and a 30% burnin. Standard deviations of split frequencies from MrBayes were usually |$<$|0.010. Following the MrBayes analysis, Bayesian concordance analysis on the posterior sample of gene trees was conducted in BUCKy v. 1.4.4 (Ané et al. 2007; Larget et al. 2010) with 100 000 postburnin generations. This analysis calculates all possible quartets and prunes the MrBayes gene trees to all but the four terminals of interest. Then, BUCKy is run on each pruned gene tree to generate a table of all quartet CFs and their standard errors. The parameter |$\alpha $| in BUCKy is the a priori amount of discordance between gene trees. Although empirical studies have shown that CFs and branch lengths in coalescent units are robust to this prior (Cranston et al. 2009; McVay et al. 2017a; Everson et al. 2016; Crowl et al. 2017; Folk et al. 2017), we used three values of |$\alpha $| (0.5, 1, and 2) to ensure the robustness of our results as in Folk et al. (2017). Next, Quartet MaxCut (Snir and Rao 2012) was used to generate a starting population tree based on these CFs. The TICR test (Stenz et al. 2015) was employed using the R package phylolm (Ho and Ané 2014) to examine if any or all edges in the optimal population tree could be modeled under panmixia using the test.one.species.tree and stepwise.test.tree functions. The stepwise.test.tree function starts from either a fully resolved tree or a fully collapsed tree and uses a stepwise search of edges to determine the best resolution of the tree. We conducted two searches with a maximum number of iterations of 10 000 each with forward and backwards searches at each step and considering 1000 population trees at each step. One search started from panmixia and another started from a fully resolved tree. Coalescent Species Tree Estimation Because methods that explicitly account for reticulate evolution can only realistically work on many fewer terminals than those for which we collected data, we first explored relationships under ILS alone to investigate the phylogenetic position of Polemonium within Polemoniaceae as well as phylogenetic relationships within all of our samples of Polemonium. This data set consisted of all 45 accessions of Polemonium plus three outgroups (Gilia capitata, Ipomopsis aggregata, and Leptosiphon montanus). For this data set, we allowed up to 24 missing sequences per locus (50% missing sequences), which resulted in a data set of 48 terminals and 316 loci (123 loci without missing data for all taxa, 142 loci without missing data for Polemonium). We employed two methods for inferring a coalescent species tree. First, we used SVDquartets (Chifman and Kubatko 2014) as implemented in PAUP* version 4.0a163 (Swofford 2002) on the entire data set, including those with missing data. SVDquartets is primarily designed for use with SNP data and the theoretical background for this assumes that each SNP is unlinked. However, Chifman and Kubatko (2014) demonstrated that the statistical power of this method is not greatly affected by using linked SNPs. We evaluated all possible quartets with ambiguities treated as missing data. Support for the SVDquartets population tree was assessed with 500 nonparametric BS replicates. As an additional method for reconstructing phylogenies under ILS alone, we also used ASTRAL-III (Mirarab et al. 2014; Zhang et al. 2017). Using a batch script, we first conducted maximum likelihood as implemented in RAxML on each aligned locus under the GTR |$+$| G model of sequence evolution with 100 rapid BS replicates. We then ran these gene trees in ASTRAL-III and accounted for uncertainty in the estimated population tree using 100 BS replicates from the RAxML BS trees, as well as calculating ASTRAL quadripartition support (QS): that is, the fraction of quartet trees in the set of gene trees for a particular quadripartition (Sayyari and Mirarab 2016). Lastly, we conducted a polytomy test to see if the null hypothesis that a branch is a polytomy could be rejected using the |$-$|t 10 option in ASTRAL (Sayyari and Mirarab 2018). This test uses the fact that under panmixia, the three topologies of an unrooted quartet of species each should be present in equal frequency and should follow a chi-squared distribution. We set our |$\alpha $| value for this test to a less conservative cutoff of 0.10 based on Figure 7 of Sayyari and Mirarab (2018), which suggests that in phylogenies reconstructed using less than several thousand genes and which contain many short branches (in coalescent units) there may not be enough resolution for this test. Phylogenetic Network Estimation To explore the possibility for gene flow as a cause of discordance in Polemonium, we utilized 17 exemplars of Polemonium, a sampling which represents a good compromise between taxonomic coverage and computational cost. For this data set, we allowed up to six missing sequences per locus (up to 33% missing sequences/locus). Missing sequences were scored as a single “N” for each individual missing the locus. Individuals for this data set were selected based on taxonomic representativeness and having high numbers of loci captured. We first used the TICR pipeline as described above in the subsection “Checking for Tree-Like Structure,” but with the MrBayes parameters changed to 2 million generations with sampling every 1000 generations. Standard deviations of split frequencies from MrBayes were usually less than 0.010 and always less than 0.017. We ran BUCKy on each quartet for 1 million postburnin generations, again testing values of |$\alpha = 0.5$|, 1, and 2, summarizing the results as an optimal population tree in Quartet MaxCut. As for the 45-taxon data set, we used phylolm to test for an unresolved or partially resolved population tree (see “Checking for Tree-Like Structure” section). Following these steps, we analyzed the BUCKy CFs and the Quartet MaxCut population tree as a starting tree using the SNaQ function in the Julia package PhyloNetworks (Solís-Lemus and Ané 2016; Solís-Lemus et al. 2017) to examine the contribution of ILS and reticulation to the phylogenetic history of Polemonium. This package uses maximum pseudolikelihood to fit a network while also accounting for ILS. PhyloNetworks considers quartet topologies only and does not take into account information from branch lengths in individual gene trees. Furthermore, PhyloNetworks assumes a one-level network: a network where each hybrid node only has one lineage transferring genetic material horizontally. We first tested the fit of models allowing from 0 to 5 reticulation events (|$h)$| and compared models using their pseudolikelihood score. The best network model was selected by examining at what value of |$h$| the pseudolikelihood score plateaus, as per the recommendation of Solís-Lemus et al. (2017). For each value of |$h$|, the best network over 20 search replicates was selected. We examined branch support on the best phylogenetic network using the bootsnaq function with 100 runs of 10 replicates each. Cyto-nuclear Discordance NrDNA and ptDNA for the entire 48-taxon set were analyzed with MrBayes under the GTR |$+$| I |$+$| G model of sequence evolution using 4 runs of 3 chains each for 25 million generations with sampling every 100 000 generations and a 30% burnin. To examine if any observed topological discordance between the estimated population tree based on nuclear data and individual genome trees can be explained by ILS alone, we implemented two nested approaches. First, we modified the approach of Olave et al. (2017) that takes a given species tree with branch lengths in 4N|$_{e}$| units and simulates gene trees using the program MS (Hudson 2002). The number of extra lineages between the species tree and gene trees is then calculated using the function deep_coal_count in Phylonet v. 2.4 (Than and Nakhleh 2009). The approach of Olave et al. (2017) generates gene trees under different values of the migration (m) parameter and then compares the likelihood of simulated gene trees under various values of m to actual gene trees to infer gene flow between lineages. However, this method assumes that branch lengths are accurately estimated. We used the best fitting major network topology from our 17-taxon phylogenetic network analysis and the optimal population tree from our 45-taxon analysis to simulate gene trees in MS for both the nuclear and plastid genomes. To obtain correct branch lengths for nuclear gene trees, we multiplied branch lengths in coalescent units by a factor of two to get branch lengths in 4N|$_{e}$| units. To simulate gene trees for the plastid genome, branch lengths in this starting nuclear species tree were divided by a factor of four before simulating these gene trees to account for the faster coalescence times of uniparentally inherited genes. Second, we generated 5000 gene trees simulated under a coalescent process with no gene flow (meaning all discordance is due to ILS) for both the nuclear and organellar 17 and 45 taxon species trees. Extra lineages were counted on the 5000 simulated gene trees and pruned nrDNA and ptDNA trees with edges supported by PP |$<$| 0.95 collapsed. If there was evidence for hybridization in our gene trees, we would expect our observed gene trees to have more extra lineages than the simulated trees. To get a sense of which observed clades may be supported under an ILS-only scenario, we assumed our nrDNA and ptDNA trees were “true” species trees and assessed bipartition support by treating our 5000 simulated trees for each genome as BS replicates. Results Overall Metrics of AHE Nuclear Loci Metrics of number of reads and loci generated from AHE sequencing as well as locus coverage are provided in Table 1. After filtering, our pipeline recovered 360 putatively orthologous regions. After cleaning to remove loci with miscalled orthology, few potentially parsimony informative sites, and/or a large amount of ambiguous base calls, we recovered 316 loci with less than 50% missing sequences, 123 of which are not missing any sequences. These loci vary greatly in taxonomic coverage, length, and number of potentially parsimony informative sites. Locus coverage ranges from 22 to 48 taxa (mean of 44 taxa). Loci range from 254 to 1704 aligned nucleotides in length, with a median length of 593.5 nucleotides. Within each of these loci, the number of potentially parsimony informative sites varies from 3 to 151, with a median value of 33. Evidence for Underlying Tree-Like Structure Inference of the optimal population tree from the set of major quartets in Quartet MaxCut for the 45-terminal data set shows no topological differences between values of |$\alpha $| and only slight differences in branch lengths and CFs. We therefore focus on the results of the analysis in which |$\alpha = 1$| when generating CFs in BUCKy. CFs on the optimal population tree suggest a large amount of gene tree discordance within Polemonium, with CFs ranging from 0.95 to 0.34 with a mean CF of 0.53 (Fig. 1). The TICR test suggests that ILS does not fully explain the observed quartet CFs (X|$^{2} = 2359.6$|, |$P = 0.0$|). Furthermore, panmixia is strongly rejected (X|$^{2} = 33$| 150.13, |$P = 0.0$|). Stepwise addition of edges starting from panmixia or a fully resolved tree converge on the same answer and suggests that all edges should be retained except the edge supporting a closer relationship of P. californicum from California to P. delicatum than to P. californicum from Washington (Fig. 1) (X|$^{2} = 2368.3$|, |$P = 0.0$|). Figure 1. Open in new tabDownload slide Optimal population tree for Polemonium inferred using Quartet MaxCut based on CFs derived from 142 nuclear genes without any missing data. The tree is rooted with P. micranthum based on the results of other analyses. Internal branch lengths are in coalescent units, terminal branches are diagramed for visualization purposes. Numbers above branches are CFs with their range shown in parentheses. Italicized numbers below branches are BS support values from the RAxML analysis on the concatenated matrix. Thin branches are not found in the most likely tree and branches with support marked by a dash (-) are not present in any BS replicate. Tips are colored according major clades discussed in the text. Figure 1. Open in new tabDownload slide Optimal population tree for Polemonium inferred using Quartet MaxCut based on CFs derived from 142 nuclear genes without any missing data. The tree is rooted with P. micranthum based on the results of other analyses. Internal branch lengths are in coalescent units, terminal branches are diagramed for visualization purposes. Numbers above branches are CFs with their range shown in parentheses. Italicized numbers below branches are BS support values from the RAxML analysis on the concatenated matrix. Thin branches are not found in the most likely tree and branches with support marked by a dash (-) are not present in any BS replicate. Tips are colored according major clades discussed in the text. Estimating Bifurcating Coalescent Trees The optimal population tree from our ASTRAL analysis of 48 terminals and all 316 loci suggests an overall well-supported population tree (BS and QS |$>$| 80) (Fig. 2a). The normalized quartet score of this tree is 0.553, suggesting strong discordance between loci. Although there is clearly tree-like structure to the data set, the polytomy test on our data suggests several edges could be collapsed (|$P > 0.10$|) (Supplementary Fig. 1 available on Dryad at https://doi.org/10.5061/dryad.5hqbzkh2r). Our ASTRAL analysis suggests the closest relative of Polemonium is tribe Gilieae rather than Phlocideae (BS |$= 1.0$|, QS |$= 1.0$|). Within Polemonium, |$P$|. micranthum is recovered as sister to the remainder of the genus (BS |$= 1.0$|, QS |$= 1.0$|). We recover two major clades within the remainder of Polemonium. The first clade consists of members of the P. caeruleum (caeruleum clade) and P. pulcherrimum species complexes (pulcherrimum clade) (BS |$= 1.0$|, QS |$= 0.9$|). The second clade is less supported (BS |$= 0.94$|, QS |$= 0.78$|, edge best collapsed into a polytomy). Polemonium elegans is sister to the remainder of this clade (BS |$= 1.0$|, QS |$= 0.98$|), and P. carneum is sister to the remainder of the clade excluding P. elegans (BS |$= 1.0$|, QS |$= 0.85$|, edge best collapsed into a polytomy). Relationships in the remainder of this clade are less clear as backbone relationships receive contrasting BS and QS support. Included in this group are members of the P. foliosissimum (foliosissimum clade) and P. viscosum species complexes. The P. viscosum species complex is recovered as a grade of three clades, the eximium clade (BS |$= 1.0$|, QS |$= 0.92$|, edge best collapsed into a polytomy), an Intermountain clade (BS |$= 1.0$|, QS |$= 0.89$|, edge best collapsed into a polytomy), and a Rocky Mountain clade (BS |$=1.0$|, QS |$= 0.99$|). The Rocky Mountain clade is sister to members of the foliosissimum clade (BS |$= 1.0$|, QS |$= 1.0$|), although the monophyly the foliosissimum clade is poorly supported (BS |$= 1.0$|, QS |$= 0.71$|, edge best collapsed into a polytomy). Most species are recovered as monophyletic if multiple accessions exist, but several cases of paraphyletic/polyphyletic species are recovered including 1) P. delicatum embedded in P. californicum, 2) species polyphyly in the P. eximium (eximium) clade, and 3) two clades of P. viscosum: one in the Rocky Mountain clade sister to P. brandegeei and another in the Intermountain clade sister to P. elusum |$+$| P. nevadense. The topology and support for relationships in the SVDquartets tree are largely consistent with that in the ASTRAL topology along the backbone (Fig. 2b) with only 58% BS support for the eximium clade as sister to the Rocky Mountain, Intermountain, and foliosissimum clades. Other differences in the SVDquartets topology are the placement of P. foliosissimum var. foliosissimum, the placement of P. occidentale subsp. lacustre, and the monophyly of P. californicum P. chartaceum, and P. eximium, although these topological conflicts are poorly supported in one or both analyses. Figure 2. Open in new tabDownload slide Coalescent species tree for Polemonium based on 316 nuclear genes inferred using a) ASTRAL and b) SVDquartets. Tips are colored according major clades discussed in the text. Thick edges in the SVDquartets tree are fully supported; values |$<$| 100% BS support are indicated. Thick edges in the ASTRAL tree have 100% BS support. Numbers above edges show support based on the BS analysis followed by quartet support. Branches without any numbers above them are fully supported by both metrics. Figure 2. Open in new tabDownload slide Coalescent species tree for Polemonium based on 316 nuclear genes inferred using a) ASTRAL and b) SVDquartets. Tips are colored according major clades discussed in the text. Thick edges in the SVDquartets tree are fully supported; values |$<$| 100% BS support are indicated. Thick edges in the ASTRAL tree have 100% BS support. Numbers above edges show support based on the BS analysis followed by quartet support. Branches without any numbers above them are fully supported by both metrics. The analysis of the concatenated matrix of 45-terminals and 142 loci produced a mostly well-supported topology (BS |$>$| 80) largely consistent with the topologies of the optimal population tree of the same data set as well as the analyses on the larger data set of 48 terminals and 316 loci, with the exception of several edges. These edges include the positions of P. carneum and P. elegans, relationships among the Intermountain, eximium, and Rocky Mountain |$+$| foliosissimum clades, and relationships within the caeruleum, eximium, and foliosissimum clades (Figs. 1 and Fig. 2; Supplementary Fig. 2 available on Dryad). Estimating a Phylogenetic Network Filters for taxonomic and genetic completeness in the 360 recovered loci from AHE sequencing resulted in 325 loci analyzed for the 17 target taxa. As with the 45-taxon optimal population tree, CFs mapped on the 17-taxon optimal population tree are relatively low (0.37–0.61, mean |$= 0.45$|), suggesting high discordance between gene trees (Supplementary Fig. 3 available on Dryad). The TICR test suggests that ILS does not fully explain the observed quartet CFs (X|$^{2} = 7.88, P = 0.049$|). Furthermore, panmixia is strongly rejected (X|$^{2} = 551.06, P = 4.09 \times 10^{-119}$|). Stepwise addition of edges suggests that all edges should be retained. A plot of pseudologlikelihood scores (Supplementary Fig. 4 available on Dryad) suggests the best network model is one with a maximum number of four hybridization events allowed (-ploglik |$= 1446.48$|). The best network has three hybridization events (Fig. 3). The parameter gamma (|$\gamma $|) is a metric of the proportion of the genome contributed by one parental population in the reticulation event, with 1 - |$\gamma $| being the proportion of the other parent (stem lineage in a level-1 network). In each of the three reticulation events, large portions of the genome have been exchanged. Detected reticulations are: 1) between the pulcherrimum clade and P. elegans (|$\gamma = 0.47$|), 2) between P. foliosissimum var. nov. and P. pauciflorum (|$\gamma = 0.33$|), and 3) between P. elusum and P. carneum (|$\gamma = 0.26$|). The BS analysis found full support (BS |$=$| 100%) for all edges of the major topology as well as for all hybrid edges. The range of estimated |$\gamma $| values of the major hybrid edge from the BS analysis was 0.45–0.48 for the contribution of the pulcherrimum clade to P. elegans, 0.30–0.44 for the contribution of P. foliosissimum var. nov. to P. pauciflorum, and 0.24–0.27 for the contribution of P. elusum to P. carneum. The major topology of the phylogenetic network contrasts with that recovered from the coalescence-only approaches in two ways. First, P. pectinatum is recovered as sister to the pulcherrimum clade instead of sister to the caeruleum clade. Second, P. pauciflorum is not recovered within the foliosissimum clade. Instead, it is sister to the foliosissimum clade |$+$| the Rocky Mountain clade. Figure 3. Open in new tabDownload slide Best SNaQ phylogenetic network from our analysis of 17 exemplars of Polemonium. Tip color corresponds to clades recovered in the coalescence-only species tree approaches. Arrows indicate inferred direction of gene flow in the three detected reticulation events, with the parameter |$\gamma$| indicating the percentage of the genome involved in the reticulation. All major topology edges and all hybrid edges receive 100% BS support. Figure 3. Open in new tabDownload slide Best SNaQ phylogenetic network from our analysis of 17 exemplars of Polemonium. Tip color corresponds to clades recovered in the coalescence-only species tree approaches. Arrows indicate inferred direction of gene flow in the three detected reticulation events, with the parameter |$\gamma$| indicating the percentage of the genome involved in the reticulation. All major topology edges and all hybrid edges receive 100% BS support. Cyto-nuclear Discordance We were able to recover nearly the entire plastid genome from the raw reads of the AHE sequencing. After removing ambiguously aligned portions, the ptDNA data set of 48 terminals contains 132 389 aligned nucleotides. The maximum clade credibility (MCC) tree has most edges supported by a PP |$<$| 0.95 (Supplementary Fig. 5 available on Dryad). The ptDNA topology differs markedly from those obtained from the nuclear loci in terms of clade composition, clade placement, and species monophyly (Fig. 4). The ptDNA genealogy suggests the eximium clade is sister to the remainder of the genus. Polemonium micranthum is in turn sister to the remainder of the genus. Within this larger clade, morphological species are scattered throughout the tree but do form several smaller clades, including two clades of members of the foliosissimum clade and a close relationship among many (but not all) members of the pulcherrimum clade. The nrDNA data set of 48 terminals contains 6342 aligned nucleotides. A number of sites are clearly polymorphic. We interpret this as a result of sequencing paralogous copies of the tandem repeat, and these are therefore treated as ambiguous characters in our analyses. While several areas of our nrDNA genealogy are strongly supported, many branches in the MCC tree are supported by low PP (Supplementary Fig. 6 available on Dryad). Polemonium micranthum is sister to the rest of Polemonium. Within Polemonium excluding P. micranthum, there are two major subclades. The first subclade contains members of the P. carneum, P. viscosum, and foliosissimum clades, although none of the polytypic clades are recovered as monophyletic. Polemonium eddyense is sister to the Intermountain clade |$+$| P. carneum (PP |$=$| 1.0). Polemonium carneum is recovered as sister to P. elusum |$+$| P. nevadense (PP |$=$| 1.0). The P. eddyense |$+$| Intermountain clade |$+$| P. carneum clade is sister to the foliosissimum clade |$+$| Rocky Mountain clade, although the monophyly of this clade is poorly supported (PP |$=$| 0.47). The Rocky Mountain clade is not monophyletic and clearly nested within the foliosissimum clade. The second major nrDNA clade contains members of the caeruleum, P. elegans, eximium, and pulcherrimum clades. The backbone of this clade is unsupported (PP |$<$| 0.50), but the pulcherrimum clade is polyphyletic, with P. elegans nested within a clade of P. pulcherrimum (PP |$=$|1.0). Though poorly supported, the eximium clade (excluding P. eddyense) is nested within the caeruleum clade (PP |$=$| 0.86). Overall, the well-supported areas of the nrDNA genealogy are more congruent with the species trees based on nuclear DNA than to the ptDNA genealogy (Fig. 5). Figure 4. Open in new tabDownload slide Tanglegram illustrating discordance between the ASTRAL nuclear species tree (left) and the plastome tree (right). Trees have been rotated to maximize tip matching with minimal link overlap. Link color corresponds to major clades discussed in the text and links connect identical tips. Clades in the plastome tree with PP |$<$| 0.95 have been collapsed. Figure 4. Open in new tabDownload slide Tanglegram illustrating discordance between the ASTRAL nuclear species tree (left) and the plastome tree (right). Trees have been rotated to maximize tip matching with minimal link overlap. Link color corresponds to major clades discussed in the text and links connect identical tips. Clades in the plastome tree with PP |$<$| 0.95 have been collapsed. Figure 5. Open in new tabDownload slide Tanglegram illustrating discordance between the ASTRAL nuclear species tree (left) and the nuclear ribosomal DNA gene tree (right). Trees have been rotated to maximize tip matching with minimal link overlap. Link color corresponds to major clades discussed in the text and links connect identical tips. Clades in the nuclear ribosomal DNA tree with PP |$<$| 0.70 have been collapsed. Figure 5. Open in new tabDownload slide Tanglegram illustrating discordance between the ASTRAL nuclear species tree (left) and the nuclear ribosomal DNA gene tree (right). Trees have been rotated to maximize tip matching with minimal link overlap. Link color corresponds to major clades discussed in the text and links connect identical tips. Clades in the nuclear ribosomal DNA tree with PP |$<$| 0.70 have been collapsed. The results of the pipeline of Olave et al. (2017) suggest that the number of extra lineages required to reconcile the random trees generated under the multispecies coalescent with the major topology of our 17-tip phylogenetic network range from 0 to 15 with a median of 6 for nuclear genes, and vary from 0 to 4 with a median of 0 for organellar genomes (Fig. 6a). For our 45-tip optimal population tree, the number of extra lineages in the nuclear genome ranges from 2 to 31 with a median value of 16 and from 0 to 10 for organellar genomes with a median of 3. For observed gene trees, the number of extra lineages required in the 17-tip data set is 16 for the nrDNA gene tree and 27 for the ptDNA gene tree. For the 45-tip data set, these numbers are 24 for the nrDNA gene tree and 80 for the ptDNA gene tree (Fig. 6b). For both data sets, the number of extra lineages for the nrDNA gene tree falls within the upper tail of the distribution expected for the nuclear genome under a strictly coalescent process, whereas the number of extra lineages for the ptDNA genome greatly exceeds the expected distribution (Fig. 6). Assessing support for observed ptDNA bipartitions based on the 5000 organellar trees simulated under the multispecies coalescent suggests the backbone of the ptDNA gene tree does not appear in any simulated gene trees, and that many other clades are only present in a small percentage of gene trees. The most commonly recovered clades in the simulated gene trees are found towards the tips and usually support the monophyly of species (Fig. 7a). By contrast, support for bipartitions of nrDNA clades is much higher overall, with many backbone bipartitions appearing in a significant number (|$>$|33%, within the expected distribution of splits under a polytomy) of simulated coalescent trees (Fig. 7b). Figure 6. Open in new tabDownload slide Histogram of number of extra lineages required to reconcile gene trees simulated under the multispecies coalescent in MS using a) the major topology of our 17-tip phylogenetic network or b) the 45-tip optimal population tree. Extra lineages were counted using deep_coal_count in Phylonet. Light grey bars show the distribution of nuclear genome trees and dark grey bars show the distribution of organellar genome trees. Dashed vertical lines represent the number of extra lineages observed in the nuclear ribosomal (light grey) and plastid (dark grey) gene trees. Figure 6. Open in new tabDownload slide Histogram of number of extra lineages required to reconcile gene trees simulated under the multispecies coalescent in MS using a) the major topology of our 17-tip phylogenetic network or b) the 45-tip optimal population tree. Extra lineages were counted using deep_coal_count in Phylonet. Light grey bars show the distribution of nuclear genome trees and dark grey bars show the distribution of organellar genome trees. Dashed vertical lines represent the number of extra lineages observed in the nuclear ribosomal (light grey) and plastid (dark grey) gene trees. Figure 7. Open in new tabDownload slide Percent support for bipartitions in the maximum clade credibility tree in 5000 simulated multispecies coalescent trees from MS assuming the a) the plastid gene tree or b) nuclear ribosomal DNA gene tree represents the true species tree. Numbers are rounded. If no number is given above the edge, that bipartition does not occur in any simulated coalescent trees. Tips are colored according major clades discussed in the text. Figure 7. Open in new tabDownload slide Percent support for bipartitions in the maximum clade credibility tree in 5000 simulated multispecies coalescent trees from MS assuming the a) the plastid gene tree or b) nuclear ribosomal DNA gene tree represents the true species tree. Numbers are rounded. If no number is given above the edge, that bipartition does not occur in any simulated coalescent trees. Tips are colored according major clades discussed in the text. Discussion A major challenge for inferring phylogenetic networks are the computational limitations associated with searching “network space” on data sets that contain more than a dozen taxa, especially when the number of possible hybridization events is high. By contrast, methods that incorporate only ILS in inferring species trees are popular because they account for some sources of discordance and are fast and tractable with data sets of dozens to hundreds of terminals. However, species trees inferred using these methods are suboptimal models of relationships if gene flow has occurred between lineages, even if the major topology of the phylogenetic network is identical, as branch lengths will be underestimated if all genomic discordance is explained by ILS alone (Leaché et al. 2014). In this study, we document that morphological species of Polemonium are often reciprocally monophyletic based on nuclear data. Despite evidence for widespread gene flow within Polemonium, there is considerable topological congruence among species trees inferred using disparate methods. Furthermore, evidence from the plastid genome suggests that there has been more gene flow than is evident solely in the nuclear genome. Support for Relationships in Polemonium Despite Strong Nuclear Discordance Our results suggest that there is a large amount of discordance between nuclear gene trees as evidenced by low CF |$<$| 0.50 in the optimal population tree, yet the vast majority of these edges are statistically supported by all analyses, strongly suggesting an underlying tree-like structure. One possibility for some of this discordance is error in gene tree estimation. While this may be true to some extent for the ASTRAL-III analysis which relies on maximum likelihood gene trees estimated using RAxML, uncertainty in gene tree estimation is tempered somewhat by using BS replicates. Other analyses should be robust to this error as they either estimate a species tree directly (SVDquartets) or rely on CFs derived from BUCKy, which accounts for gene tree estimation error. However, low but significant CFs are best explained as the result of conflict caused by ILS and reticulation. Widespread ILS is not unexpected in the cases of Polemonium, given the apparent rapid radiation of the genus in the last 10 My (Landis et al. 2018; Rose et al. 2018). Our topologies are consistent with the conclusions of the only previous multilocus phylogenetic study of Polemonium based on a concatenated parsimony analysis of AFLP data (Worley et al. 2009). In cases where Worley et al. (2009) found strong support for relationships, we likewise find strong support. As in Worley et al. (2009), we recover a sister relationship of P. reptans and P. vanbruntiae, as well as two subclades in the pulcherrimum clade: one of P. pulcherrimum s.s. and another of P. boreale, P. californicum, and P. delicatum. The major difference between our estimate and Worley et al. (2009) is that they suggest that P. viscosum is either monophyletic or P. brandegeei is nested within it, instead of polyphyletic as suggested in this study. This discrepancy, however, is based on their geographic sampling of P. viscosum, which was restricted to members of the Rocky Mountain clade. Despite the large amount of nuclear gene tree discordance detected in our study, major topological relationships are largely consistent across species tree approaches. Between our coalescent species tree approaches, topological relationships along the backbone are consistent, although the placement of the eximium clade is poorly supported across analyses, and the optimal population tree suggests the split between the eximium, Rocky Mountain, Intermountain, and foliosissimum clades occurred rapidly. Additionally, as noted above, error in gene tree estimation may explain some of these differences between these coalescent analyses (Fig. 2). The major topological difference among our results is the placement of P. pauciflorum and P. pectinatum. In our optimal population trees and in our coalescent analyses, P. pauciflorum falls sister to southern members of the foliosissimum clade: P. foliosissimum var. flavum and an undescribed variety of P. foliosissimum. In our SNaQ analysis, P. foliosissimum is recovered as monophyletic. In addition, in our concordance and coalescent analyses P. pectinatum falls sister to the remainder of the caeruleum clade whereas in the SNaQ analysis the species is placed sister to the pulcherrimum clade. Our results are also consistent across methods in suggesting that P. viscosum is polyphyletic, and that these lineages have not exchanged genes. Only 19% of the nuclear genome supports a sister relationship of the two clades of P. viscosum. Our finding of polyphyly in P. viscosum might be expected given that the species exhibits a wide range of morphotypes and that species circumscription based on morphology has resulted in various schemes for delimiting species (Brand 1907; Davidson 1950; Grant 1989). At present, it is unclear what (if any) morphological characters distinguish Intermountain P. viscosum from Rocky Mountain P. viscosum, but it appears that these clades represent a case of extreme convergence in alpine habitats. Preliminary study suggests that length and deciduousness of leaves as well as corolla length may be important characters (J. Rose, unpublished data). Nuclear Discordance is Explained by ILS and Gene Flow Our SNaQ results demonstrate that reticulate evolution has played a large role in the evolutionary history of the genus. In fact, the presence of reticulation events in Polemonium is one possible explanation for deeper edges rejected as “true” by the ASTRAL polytomy test. The phylogenetic network analysis suggests at least three exchanges of genetic material have occurred between unrelated lineages, several of which occur along deeper (internal) edges of the phylogeny. Estimates of the amounts of the genome exchanged between lineages during these events are high (26, 33, and 47%), especially when considering animal systems (Solís-Lemus and Ané 2016; Blair et al. 2019; Morando et al. 2020; Pyron et al. 2020) but are not atypical for plant systems (Crowl et al. 2017; Morales-Briones et al. 2018; Roberts and Roalson 2018). Given high CFs for sister relationships not analyzed by SNaQ (Fig. 1), it is likely that the P. carneum–P. elusum reticulation event occurred before the origin of P. carneum (P. carneum CF |$=$| 0.95) and that the P. foliosissimum var. nov.–P. pauciflorum reticulation event also involved P. foliosissimum var. flavum (P. foliosissimum var. nov. |$+$| P. foliosissimum var. flavum CF |$=$| 0.92). One of the most commonly used markers for Sanger-based phylogenetic inference in land plants has been nrITS sequences. Despite lack of variability in nrDNA and weak support for relationships in Polemonium (Irwin et al. 2012; Supplementary Fig. 6 available on Dryad), this nuclear region contains evidence of the history of reticulate evolution in the clade through the strongly supported sister relationship of P. pulcherrimum and P. elegans (PP |$=$| 1.0) as well as the sister relationship of P. carenum and P. elusum (PP |$=$| 1.0). Based on the distribution of extra lineages as well as support for the topology provided by nrDNA in the simulated gene trees, discordance for the nrDNA genealogy can be explained by 1) a lack of phylogenetically informative substitutions for some relationships, 2) ILS, and 3) detected reticulate evolution events. While it is clear that species of Polemonium have exchanged genes, it is unclear which phenotypic traits, if any, were exchanged in these reticulation events. Candidate traits demonstrating such reticulation events include pollen color and leaflet size. Future study into character evolution within Polemonium should explicitly test for this phenomenon in the context of a phylogenetic network (Bastide et al. 2018). Strong Discordance Between Plastid and Nuclear Genomes From our analyses, it is clear that the well-supported tree obtained with the plastid genome of Polemonium is in strong conflict with the species tree of the genus inferred from nuclear data. Based on data from the nuclear genome, when multiple individuals of a given morphological species have been sampled, the species are either monophyletic or paraphyletic. By contrast, most morphological species appear polyphyletic in the ptDNA genealogy. Moreover, this topology is not only inexplicable based on ILS, but tests for ILS as the sole cause of discordance in the ptDNA tree demonstrate that the amount of discordance in the ptDNA tree is on exceeds the amount of discordance expected under a strictly coalescent process (Fig. 6). This is further emphasized by the small number of bipartitions present in the plastid genealogy also present in bipartitions in the 5000 simulated organellar gene trees (Fig. 7a). The finding of strong plastid-nuclear discordance is contrary to what would be expected based on the smaller effective population sizes (and thus shorter coalescence times) expected for organellar genomes (Moore 1995). Of the three reticulation events supported by our nuclear data, only the genetic exchange between P. foliosissimum var. nov. and P. pauciflorum is also reflected in the plastid genome. Plastid discordance not matching hybridization events has also been reported recently in Pinus (Gernandt et al. 2018). Causes of Cyto-nuclear Discordance How do we reconcile these results? First, it is possible that we simply did not sample loci in the nuclear genome that show evidence for additional reticulation events. This may come about from not sampling loci showing reticulation in our 316-gene subsample of the nuclear genome. Our sample of loci may come from one or a few chromosomes or within a specific area of each chromosome and not capture all evidence of horizontal gene flow. However, the 499 target loci are found throughout the Arabidopsis genome (Buddenhagen et al. 2016), so it is unlikely that our 316 genes are strongly biased towards certain chromosomes. Second, our subsample of 17-taxa for inferring a phylogenetic network may not have included most of the ptDNA/nucDNA discordances and we may have not sampled individuals with these reticulation events in their history. This could be addressed by generating a phylogenetic network for all 45 samples of Polemonium if this were computationally feasible. ABBA-BABA statistics are widely used to investigate patterns of gene flow (Patterson et al. 2012; Eaton and Ree 2013; Martin et al. 2014; Pease and Hahn 2015; Blischak et al. 2018) but are limited in that they only treat four or five terminals at a time and cannot detect gene flow along internal edges of the phylogeny in the four-terminal case (Pease and Hahn 2015, p. 654). Third, there may be no sign of these reticulation events in the nuclear genome. In order to detect introgression between different lineages, crossing over between the two parental nuclear genomes in the F|$_{1}$| generation must occur, and extensive backcrossing must not occur. This could be caused by prezygotic or postzygotic barriers preventing crossing between F|$_{1}$| hybrids, or nuclear signal being swamped out by extensive backcrossing with one parent. We propose that this third possibility best explains the seemingly incompatible ptDNA/nucDNA results. That is, the unexplained discordance between genomes reflects a pattern of gene flow followed by immediate backcrossing with one parent, swamping out any signal of gene flow in the nuclear genome. There is experimental evidence for this scenario in Polemonium. While many species of Polemonium lack prezygotic barriers to reproduction (Ostenfeld 1929), interspecific F|$_{1}$| hybrids are often sterile but may be backcrossed with one of the parents to produce a fertile hybrid. Clausen (1967) investigated the chromosomes of these Polemonium hybrids and found a large number of irregularities during meiosis in the anthers. These chromosomal rearrangements may provide a postzygotic barrier to reproduction in F|$_{1}$| hybrids (Rieseberg 2001). In other cases, interspecific hybrids show few chromosomal irregularities (Clausen 1931) and may be able to produce viable seed (Anway 1968). In the case of Polemonium, both chromosomal incompatibilities and/or backcrossing with one parent may have erased a signal of gene flow in the nuclear genome, but this signal has been retained in the plastid genome by chance or by selection on particular interspecific ptDNA haplotypes. Future work should examine whether the mitochondrial genome is concordant with the plastid genome, as would be expected if the cytoplasm is inherited as a single unit. Inferring the mitochondrial genealogy of Polemonium is hampered by having no closely related reference genome and the tendency of plant mitochondrial genomes to rearrange (Palmer and Herbon 1988). However, preliminary results with ca. 30 kb of highly conserved mitochondrial sequences, where strongly supported, suggest a similar topology to the plastid gene tree (J. Rose, unpublished data). Future work also should examine if there is a signal of natural selection in the coding regions of certain ptDNA haplotypes which maintain the persistence of interspecific and intraspecific haplotypes in populations of Polemonium. Phylogenies inferred from the plastid genome may reflect the biogeographic history of Polemonium where relationships are indicators of present or past geographic distance between species rather than phylogenetic distance (Whittemore and Schaal 1991; Petit et al. 1993; Wolf et al. 1997; Rautenberg et al. 2010; Pham et al. 2017; Schuster et al. 2018). A Workflow for Analyzing Introgression in AHE Data Sets This study presents a case study for examining the evolution of recently radiated plant clades using AHE and teasing apart ILS and horizontal gene flow. We recommend a procedure such as the following be implemented, with some caveats relating to the number of tips being analyzed (Supplementary Fig. 7 available on Dryad). Generate gene trees for all loci, paying attention to the informativeness of the loci at your particular taxonomic scale and the potential impact of missing data in the phylogenetic methods used. Loci may be highly conserved (e.g., transcriptomes) or short reads (e.g., restriction site methods) where there may be few or no potentially phylogenetically informative SNPs per locus. If a reference genome is available, consider combining loci into a “superlocus” using a sliding window approach. Generate a preliminary species tree from these loci. Test for evidence for all or some branches in the species tree and if gene tree discordance has a simple evolutionary explanation (ILS alone) or is best explained by ILS and one or more cases of horizontal gene transfer, as suggested by outlier quartets. If ILS alone is a sufficient explanation for the discordance, ILS is low, or there is no conflict among loci, coalescent-only or concatenation approaches are appropriate for analyzing the data set. If not, phylogenetic network approaches are appropriate. If the number of terminals is |$<$|20, analyze all samples using currently available methods for inferring phylogenetic networks, allowing for models with reticulation events varying from 0 to n. If there are |$>$|20 terminals, carefully consider subsetting the data set and avoid if at all possible. If necessary, subsetting should focus on the specific questions being addressed in a given study, informed by the results of (2) and (3) (e.g., major clades recovered in a coalescent only or concatenation approaches, known/suspected hybrids and their parental taxa, or composition of outlier quartets), perhaps with several tiers of subsampling and tests for the robustness of results in particularly complex systems. Consider follow-up analyses to investigate gene flow in outlier quartets, including investigating signal from organellar genome(s) and evidence from pattern-based statistics, if appropriate. Given the rapid progress and interest in the field of phylogenetic network inference, we are confident that soon, new algorithms will drastically decrease the computational cost associated with modeling gene flow with a larger number of tips and reticulation events, allowing for ease of analysis for ever larger data sets. Conclusions We have presented a case study for examining the evolution of recently radiated plant clades using AHE. This has allowed us to robustly tease apart the contribution of ILS and reticulation in the history of Polemonium, as well as explain patterns of discordance between the nuclear genome and additional genomic compartments. Our results suggest that reticulation has been an integral factor in the evolutionary history of angiosperms and that caution should be applied when making claims about the evolutionary history and taxonomic circumscription of recently radiated groups based on models that do not or only account for ILS, and/or overly rely upon drawing conclusions based on the inferred history of organellar genomes. Furthermore, given the contrasting patterns of species or species complex monophyly in the plastid genome compared with species and species complex polyphyly in the nuclear genome revealed by this study, we urge caution against describing new, cryptic species based solely on variation in the plastid genome. Supplementary material Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.5hqbzkh2r. Funding This work was funded in part through a NSF doctoral dissertation improvement grant [DEB-1501867 to K.J.S. and J.P.R], the O.N. and E.K. Allen Fellowship awarded to J.P.R. through the University of Wisconsin-Madison Department of Botany, the University of Wisconsin Botany Department Hofmeister Endowment, a NSF-DOB [DOB-1046355 to K.J.S], a NSF-DEB [DEB-1655606 to K.J.S.], and from small research grants to J.P.R. through the Rocky Mountain Biological Laboratory Idaho Native Plant Society, Nevada Native Plant Society, Native Plant Society of New Mexico, and Utah Native Plant Society. Acknowledgments We would like to thank Rebecca Stubbs for providing tissue collected during her M.S. Thesis, Cecilé Ané and Nisa Karimi for discussion of SNAq, Steve Hunter for discussion of plastomes, associate editors Bryan Carstens and David Tank, Ryan Folk, and four anonymous reviewers for their helpful suggestions which greatly improved the manuscript. References Ané C. , Larget B., Baum D.A., Smith S.D., Rokas A. 2007 . Bayesian estimation of concordance among gene trees . Mol. Biol. Evol. 24 : 412 – 426 . Google Scholar Crossref Search ADS PubMed WorldCat Anway J.C. 1968 . The systematic botany and taxonomy of Polemonium foliosissimum A . Gray (Polemoniaceae). Am. Midl. Nat. 79 : 458 – 475 . Google Scholar Crossref Search ADS WorldCat Bastide P. , Solís-Lemus C., Kriebel R., Sparks K.W., Ané C. 2018 . Phylogenetic comparative methods on phylogenetic networks with reticulations . Syst. Biol. 67 : 800 – 820 . Google Scholar Crossref Search ADS PubMed WorldCat Blair C. , Bryson R.W. Jr., Linkem C.W., Lazcano D., Klicka J., McCormack J.E. 2019 . Cryptic diversity in the Mexican highlands: thousands of UCE loci help illuminate phylogenetic relationships, species limits and divergence times of montane rattlesnakes (Viperidae: Crotalus) . Mol. Ecol. Resour. 19 : 349 – 365 . Google Scholar Crossref Search ADS PubMed WorldCat Blischak P.D. , Chifman J., Wolfe A.D., Kubatko L.S. 2018 . HyDe: a Python package for genome-scale hybridization detection . Syst. Biol. 67 : 821 – 829 . Google Scholar Crossref Search ADS PubMed WorldCat Bock D.G. , Andrew R.L. Rieseberg L.H. 2014 . On the adaptive value of cytoplasmic genomes in plants . Mol. Ecol. 23 : 4899 – 4911 . Google Scholar Crossref Search ADS PubMed WorldCat Brand A. 1907 . Polemoniaceae. In: Engler A., editor. Das Pflanzenreich IV(250) . Leipzig : Engelmann . p. 1 – 203 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Buckler E.S. , Ippolito A., Holtsford T.P. 1997 . The evolution of ribosomal DNA divergent paralogues and phylogenetic implications . Genetics 145 : 821 – 832 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Buddenhagen C. , Lemmon A.R., Lemmon E.M., Bruhl J., Cappa J., Clement W.L., Donoghue M., Edwards E.J., Hipp A.L., Kortyna M., Mitchell N., Moore A., Prychid C.J., Segovia-Salcedo M.C., Simmons M.P., Soltis P.S., Wanke S., Mast A. 2016 . Anchored phylogenomics of angiosperms I: assessing the robustness of phylogenetic estimates . bioRxiv:086298 . doi:10.1101/086298 . Google Scholar OpenURL Placeholder Text WorldCat Crossref Cardillo M. , Weston P.H., Reynolds Z.K.M., Olde P.M., Mast A.R., Lemmon E.M., Lemmon A.R., Bromham L. 2017 . The phylogeny and biogeography of Hakea (Proteaceae) reveals the role of biome shifts in a continental plant radiation . Evolution 71 : 1928 -1943. Google Scholar Crossref Search ADS PubMed WorldCat Chifman J. , Kubatko L. 2014 . Quartet inference from SNP data under the coalescent model . Bioinformatics 30 : 3317 – 3324 . Google Scholar Crossref Search ADS PubMed WorldCat Clausen J. 1931 . Genetic studies in Polemonium . III. Hereditas 15 : 62 – 66 . Google Scholar Crossref Search ADS WorldCat Clausen J. 1967 . Biosystematic consequences of ecotypic and chromosomal differentiation . Taxon 16 : 271 – 279 . Google Scholar Crossref Search ADS WorldCat Cranston K.A. , Hurwitz B., Ware D., Stein L., Wing R.A. 2009 . Species trees from highly incongruent gene trees in rice . Syst. Biol. 58 : 489 – 500 . Google Scholar Crossref Search ADS PubMed WorldCat Crowl A.A. , Myers C., Cellinese N., 2017 . Embracing discordance: Phylogenomic analyses provide evidence for allopolyploidy leading to cryptic diversity in a Mediterranean Campanula (Campanulaceae) clade . Evolution. 71 : 913 – 922 . Google Scholar Crossref Search ADS PubMed WorldCat Cui R. , Schumer M., Kruesi K., Walter R., Andolfatto P., Rosenthal G.G. 2013 . Phylogenomics reveals extensive reticulate evolution in Xiphophorus fishes . Evolution. 67 : 2166 -2179. Google Scholar Crossref Search ADS PubMed WorldCat Davidson J.F. 1950 . The genus Polemonium [Tournefort] L . Univ. Calif. Publ. Bot. 23 : 209 – 282 . Google Scholar OpenURL Placeholder Text WorldCat Eaton D.A. , Ree R.H. 2013 . Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae) . Syst. Biol. 62 : 689 – 706 . Google Scholar Crossref Search ADS PubMed WorldCat Everson K.M. , Soarimalala V., Goodman S.M., Olson L.E. 2016 . Multiple loci and complete taxonomic sampling resolve the phylogeny and biogeographic history of tenrecs (Mammalia: Tenrecidae) and reveal higher speciation rates in Madagascar’s humid forests . Syst. Biol. 65 : 890 – 909 . Google Scholar Crossref Search ADS PubMed WorldCat Faircloth B. , McCormack J.E., Crawford N.G., Harvey M.G., Brumfield R.T., Glenn T.C. 2012 . Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales . Syst. Biol. 61 : 717 – 726 . Google Scholar Crossref Search ADS PubMed WorldCat Fehrer J. , Gemeinholzer B., Chrtek J. Jr., Bräutigam S. 2007 . Incongruent plastid and nuclear DNA phylogenies reveal ancient intergeneric hybridization in Pilosella hawkweeds (Hieracium, Cichorieae, Asteraceae) . Mol. Phylogenet. Evol. 42 : 347 – 361 . Google Scholar Crossref Search ADS PubMed WorldCat Feulner M. , Möseler B.M., Nezadal W. 2001 . Introgression und morphologische variabilität bei der Blauen Himmelsleiter, Polemonium caeruleum L . in Nordbayern, Deutschland. Feddes Repertorium 112 : 231 – 246 . Google Scholar Crossref Search ADS WorldCat Folk R.A. , Mandel J.R., Freudenstein J.V. 2017 . Ancestral gene flow and parallel organellar genome capture result in extreme phylogenomic discord in a lineage of angiosperms . Syst. Biol. 66 : 320 – 337 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Folk R.A. , Soltis P.S., Soltis D.E., Guralnick R. 2018 . New prospects in the detection and comparative analysis of hybridization in the tree of life . Am. J. Bot. 105 : 364 – 375 . Google Scholar Crossref Search ADS PubMed WorldCat Galen C. , Kevan P.G. 1980 . Scent and color, floral polymorphisms and pollination biology in Polemonium viscosum Nutt . Am. Midl. Nat. 104 : 281 – 289 . Google Scholar Crossref Search ADS WorldCat Galen C. , Newport M.E.A. 1987 . Bumble bee behavior and selection on flower size in the sky pilot, Polemonium viscosum . Oecologia 74 : 20 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat Galen C. , Zimmer K.A., Newport M.E. 1987 . Pollination in floral scent morphs of Polemonium viscosum: a mechanism for disruptive selection on flower size . Evolution 41 : 599 – 606 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Galen C. , Shore J.S., Deyoe H. 1991 . Ecotypic divergence in alpine Polemonium viscosum: genetic structure, quantitative variation, and local adaptation . Evolution 45 : 1218 – 1228 . Google Scholar PubMed OpenURL Placeholder Text WorldCat de Geofroy I. 1998 . Phylogeny and biogeography of the high-elevation species of Polemonium (Polemoniaceae) [M.S. Thesis] . San Francisco State University . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Gernandt D.S. , Aguirre Dugua X., Vázquez-Lobo A., Willyard A., Moreno Letelier A., Pérez de la Rosa J.A., Piñero D., Liston A. 2018 . Multi-locus phylogenetics, lineage sorting, and reticulation in Pinus subsection Australes Am . J. Bot 105 : 711 – 725 . Google Scholar Crossref Search ADS WorldCat Grant V. 1959 . Natural history of the phlox family . The Hague : Martinus Nijhoff . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Grant V. 1989 . Taxonomy of the tufted alpine and subalpine polemoniums (Polemoniaceae) . Bot. Gaz. 150 : 158 – 169 . Google Scholar Crossref Search ADS WorldCat Grummer J.A. , Morando M.M., Avila L.J., Sites J.W. Jr., Leaché A.D. 2018 . Phylogenomic evidence for a recent and rapid radiation of lizards in the Patagonian Liolaemus fitzingerii species group . Mol. Phylogenet. Evol. 125 : 243 – 254 . Google Scholar Crossref Search ADS PubMed WorldCat Hamilton C.A. , Lemmon A.R., Lemmon E.M., Bond J.M. 2016 . Expanding anchored hybrid enrichment to resolve both deep and shallow relationships within the spider tree of life . BMC Evol. Biol. 16 : 212 . Google Scholar Crossref Search ADS PubMed WorldCat Heled J. , Drummond A.J. 2010 . Bayesian inference of species trees from multilocus data . Mol. Biol. Evol 27 : 570 – 580 . Google Scholar Crossref Search ADS PubMed WorldCat Ho L.S.T. , Ané C. 2014 . A linear-time algorithm for Gaussian and non-Gaussian trait evolution models . Syst. Biol. 63 : 397 – 408 . Google Scholar Crossref Search ADS PubMed WorldCat Hudson R.R. 2002 . Generating samples under a Wright-Fisher neutral model . Bioinformatics 2 : 337 – 338 . Google Scholar Crossref Search ADS WorldCat Hultén E. 1971 . The circumpolar plants. 2 . Dicotyledons . Stockholm : Almqvist and Wiksell . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Irwin J.J. , Stubbs R., Hartman R.L. 2012 . Polemonium elusum (Polemoniaceae), a new species from east central Idaho, USA . J. Bot. Res. Inst. Texas 6 : 331 – 338 . Google Scholar OpenURL Placeholder Text WorldCat Johnson L.A. , Porter J.M. 2017 . Fates of angiosperm species following long-distance dispersal: Examples from American amphitropical Polemoniaceae . Am. J. Bot. 104 : 1729 – 1744 . Google Scholar Crossref Search ADS PubMed WorldCat Johnson L.A. , Chan L.M., Weese T.L., Busby L.D., McMurry S. 2008 . Nuclear and cpDNA sequences combined provide strong inference of higher phylogenetic relationships in the phlox family (Polemoniaceae) . Mol. Phylogenet. Evol. 48 : 997 – 1012 . Google Scholar Crossref Search ADS PubMed WorldCat Johnson L.A. , Schultz J.L., Soltis D.E., Soltis P.S. 1996 . Monophyly and generic relationships of Polemoniaceae based on matK sequences . Am. J. Bot. 83 : 1207 – 1224 . Google Scholar Crossref Search ADS WorldCat Kamneva O.K. , Rosenberg N.A. 2017 . Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting . Evol. Bioinform. 13 : 1176934317691935 . Google Scholar Crossref Search ADS WorldCat Katoh K. , Standley D.M. 2013 . MAFFT multiple sequence alignment software version 7: improvements in performance and usability . Mol. Biol. Evol. 30 : 772 – 780 . Google Scholar Crossref Search ADS PubMed WorldCat Kearse M. , Moir R., Wilson A., Stones-Havas S., Cheung M., Sturrock S., Buxton S., Cooper A., Markowitz S., Duran C., Thierer T., Ashton B., Meintjes P., Drummond A. 2012 . Geneious basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data . Bioinformatics 28 : 1647 – 1649 . Google Scholar Crossref Search ADS PubMed WorldCat Kriebel R. , Drew B.T., Drummond C.P., González-Gallegos J.G., Celep F., Mahdjoub M.M., Rose J.P., Xiang C.-L., Hu G.-X., Walker J.B., Lemmon E.M., Lemmon A.R., Sytsma K.J. 2019 . Tracking the temporal shifts in area, biomes, and pollinators in the radiation of Salvia (sages, Lamiaceae) across continents: leveraging Anchored Hybrid Enrichment and targeted sequence data . Am. J. Bot. 106 : 573 – 597 . Google Scholar Crossref Search ADS PubMed WorldCat Kulbaba M.W. , Worley A.C. 2013 . Selection on Polemonium brandegeei (Polemoniaceae) flowers under hummingbird pollination: in opposition, parallel, or independent of selection by hawkmoths? Evolution 67 : 2194 – 2206 . Google Scholar Crossref Search ADS PubMed WorldCat Landis J.B. , O’Toole R.D., Ventura K.L., Gitzendanner M.A., Oppenheimer D.G., Soltis, D.E. Soltis P.S. 2016 . The phenotypic and genetic underpinnings of flower size in Polemoniaceae . Front. Plant Sci. 6 : 1144 . Google Scholar Crossref Search ADS PubMed WorldCat Landis J.B. , Bell C.D., Hernandez M., Zenil-Ferguson R., McCarthy E.W., Soltis D.E., Soltis P.S. 2018 . Evolution of floral traits and impact of reproductive mode on diversification in the phlox family (Polemoniaceae) Mol . Phylogenet. Evol. 127 : 878 – 890 . Google Scholar Crossref Search ADS WorldCat Larget B.R. , Kotha S.K., Dewey C.N., Ané C. 2010 . BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis . Bioinformatics 26 : 2910 – 2911 . Google Scholar Crossref Search ADS PubMed WorldCat Leaché A.D. , Harris R.B., Rannala B., Yang Z. 2014 . The influence of gene flow on species tree estimation: a simulation study . Syst. Biol. 63 : 17 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat Lee-Yaw J.A. , Grassa C.J., Joly S., Andrew R.L., Rieseberg L.H. 2019 . An evaluation of alternative explanations for widespread cytonuclear discordance in annual sunflowers (Helianthus) . New Phytol. 221 : 515 – 526 . Google Scholar Crossref Search ADS PubMed WorldCat Lemmon A.R. , Emme S.A., Lemmon E.M. 2012 . Anchored hybrid enrichment for massively high-throughput phylogenomics . Syst. Biol. 61 : 727 – 744 . Google Scholar Crossref Search ADS PubMed WorldCat Léveillé-Bourret E. , Starr J.R., Ford B.A., Lemmon E.M., Lemmon A.R. 2017 . Resolving rapidly radiations within angiosperm families using anchored phylogenomics . Syst. Biol. 67 : 94 – 112 . Google Scholar Crossref Search ADS WorldCat Levin D.A. 2003 . The cytoplasmic factor in plant speciation . Syst. Bot. 28 : 5 – 11 . Google Scholar OpenURL Placeholder Text WorldCat Liu L. , Yu L., Edwards S.V. 2010 . A maximum pseudo-likelihood approach for estimating species trees under the coalescent model . BMC Evol. Biol. 10 : 302 . Google Scholar Crossref Search ADS PubMed WorldCat Martin S.H. , Davey J.W., Jiggins C.D. 2014 . Evaluating the use of ABBA–BABA statistics to locate introgressed loci . Mol. Biol. Evol. 32 : 244 – 257 . Google Scholar Crossref Search ADS PubMed WorldCat Meyer B.S. , Matschiner M., Salzburger W. 2015 . A tribal level phylogeny of Lake Tanganyika cichlid fishes based on a genomic multi-marker approach . Mol. Phylogenet. Evol. 83 : 56 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat McVay J.D. , Hauser D., Hipp A.L., Manos P.S. 2017a . Phylogenomics reveals a complex evolutionary history of lobed-leaf white oaks in Western North America . Genome 60 : 733 – 742 . Google Scholar Crossref Search ADS WorldCat McVay J.D. , Hauser D., Hipp A.L., Manos P.S. 2017b . A genetic legacy of introgression confounds phylogeny and biogeography in oaks . Proc. R. Soc. Lond. 284 : 20170300 . Google Scholar OpenURL Placeholder Text WorldCat Meyer M. , Kircher M. 2010 . Illumina sequencing library preparation for highly multiplexed target capture and sequencing . Cold Spring Harbor Protocols 2010 : pdb .prot5448. Google Scholar Crossref Search ADS WorldCat Mirarab S. , Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S., Warnow T. 2014 . ASTRAL: genome-scale coalescent-based species tree . Bioinformatics 30 : i541 – i548 . Google Scholar Crossref Search ADS PubMed WorldCat Mirarab S. , Warnow T. 2015 . ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes . Bioinformatics 31 : i44 – i52 . Google Scholar Crossref Search ADS PubMed WorldCat Mitchell N. , Lewis P.O., Lemmon E.M., Lemmon A.R., Holsinger K.E. 2017 . Anchored phylogenomics improves the resolution of evolutionary relationships in the rapid radiation of Protea L . (Proteaceae). Am. J. Bot. 104 : 102 – 115 . Google Scholar Crossref Search ADS WorldCat Moore W.S. 1995 . Inferring phylogenies from mtDNA variation: mitochondrial-gene trees versus nuclear-gene trees . Evolution 49 : 718 – 726 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Morales-Briones D.F. , Liston A., Tank D.C. 2018 . Phylogenomic analyses reveal a deep history of hybridization and polyploidy in the Neotropical genus Lachemilla (Rosaceae) . New Phytol. 218 : 1668 – 1684 . Google Scholar Crossref Search ADS PubMed WorldCat Morando M. , Olave M., Avila L.J., Sites J.W. Jr, Leaché A.D., 2020 . Phylogenomic data resolve higher-level relationships within South American Liolaemus lizards . Mol. Phylogenet. Evol. 147 : 106781 . Google Scholar Crossref Search ADS PubMed WorldCat Murray D.F. , Elven R. 2011 . Polemonium villosissimum (Polemoniaceae), an overlooked species in Alaska and Yukon Territory . J. Bot. Res. Inst. Texas. 5 : 19 – 24 . Google Scholar OpenURL Placeholder Text WorldCat Olave M. , Avila L.J., Sites J.W., Morando M. 2017 . Detecting hybridization by likelihood calculation of gene tree extra lineages given explicit models . Methods Ecol. Evol. 9 : 121 – 133 . Google Scholar Crossref Search ADS WorldCat Ostenfeld C.H. 1929 . Genetic studies in Polemonium . Hereditas 12 : 33 – 40 . Google Scholar Crossref Search ADS WorldCat Patterson N. , Moorjani P., Luo Y., Swapan M., Rohland N., Zhan Y., Genschoreck T., Webster T., Reich D. 2012 . Ancient admixture in human history . Genetics 192 : 1065 – 1093 . Google Scholar Crossref Search ADS PubMed WorldCat Palmer J.D. , Herbon L.A. 1988 . Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence . J. Mol. Evol. 28 : 87 – 97 . Google Scholar Crossref Search ADS PubMed WorldCat Pease J.B. , Hahn M.W. 2015 . Detection and polarization of introgression in a five-taxon phylogeny . Syst. Biol. 64 : 651 – 662 . Google Scholar Crossref Search ADS PubMed WorldCat Petit R.J. , Kremer A., Wagner D.B. 1993 . Geographic structure of chloroplast DNA polymorphisms in European oaks . Theor. Appl. Genet. 87 : 122 -128. Google Scholar Crossref Search ADS PubMed WorldCat Pham K.K. , Hipp A.L., Manos P.S., Cronn R.C. 2017 . A time and a place for everything: phylogenetic history and geography as joint predictors of oak plastome phylogeny . Genome 60 : 720 – 732 . Google Scholar Crossref Search ADS PubMed WorldCat Poczai P. , Hyvönen J. 2010 . Nuclear ribosomal spacer regions in plant phylogenetics: problems and prospects . Mol. Biol. Rep. 37 : 1897 – 1912 . Google Scholar Crossref Search ADS PubMed WorldCat Porter J.M. 1996 . Phylogeny of Polemoniaceae based on nuclear ribosomal internal transcribed spacer DNA sequences . Aliso 15 : 57 – 77 . Google Scholar Crossref Search ADS WorldCat Porter J.M. , Johnson L.A. 1998 . Phylogenetic relationships of Polemoniaceae: inferences from mitochondrial nad1 intron sequences . Aliso 17 : 157 – 188 . Google Scholar Crossref Search ADS WorldCat Prather L.A. , Ferguson C.J., Jansen R.K. 2000 . Polemoniaceae phylogeny and classification: Implications of sequence data from the chloroplast gene ndhF . Am. J. Bot. 87 : 1300 – 1308 . Google Scholar Crossref Search ADS PubMed WorldCat Prum R.O. , Berv J.S., Dornburg A., Field D.J., Townsend J.P., Lemmon E.M., Lemmon A.R. 2015 . A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing . Nature 526 : 569 – 573 . Google Scholar Crossref Search ADS PubMed WorldCat Pyron R.A. , O’Connell K.A., Lemmon E.M., Lemmon A.R., Beamer D.A., 2020 . Phylogenomic data reveal reticulation and incongruence among mitochondrial candidate species in Dusky Salamanders (Desmognathus) . Mol. Phylogenet. Evol. 146 : 106751 . Google Scholar Crossref Search ADS PubMed WorldCat Rautenberg A. , Hathaway L., Oxelman B., Prentice H.C. 2010 . Geographic and phylogenetic patterns in Silene section Melandrium (Caryophyllaceae) as inferred from chloroplast and nuclear DNA sequences Mol . Phylogenet. Evol. 57 : 978 – 991 . Google Scholar Crossref Search ADS WorldCat Rieseberg L.H. , Soltis D.E. 1991 . Phylogenetic consequences of cytoplasmic gene flow in plants . Evol. Trends Plants 5 : 65 – 84 . Google Scholar OpenURL Placeholder Text WorldCat Rieseberg L.H. 2001 . Chromosomal rearrangements and speciation . Trends Ecol. Evol. 16 : 351 – 358 . Google Scholar Crossref Search ADS PubMed WorldCat Roberts W.R. , Roalson E.H. 2018 . Phylogenomic analyses reveal extensive gene flow within the magic flowers (Achimenes) . Am. J. Bot. 105 : 726 – 740 . Google Scholar Crossref Search ADS PubMed WorldCat Rokyta D.R. , Lemmon A.R., Margres M., Aronow K. 2012 . The venom-gland transcriptome of the eastern diamondback rattlesnake (Crotalus adamanteus) . BMC Genomics 13 : 312 . Google Scholar Crossref Search ADS PubMed WorldCat Ronquist F. , Teslenko M., Van Der Mark P., Ayres D.L., Darling A., Höhna S., Larget B., Liu L., Suchard M.A., Huelsenbeck J.P. 2012 . MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space . Syst. Biol. 61 : 539 – 542 . Google Scholar Crossref Search ADS PubMed WorldCat Rose J.P. , Kleist T.J., Löfstrand S.D., Drew B.T., Schönenberger J., Sytsma K.J. 2018 . Phylogeny, historical biogeography, and diversification of angiosperm order Ericales suggest ancient Neotropical and East Asian connections . Mol. Phylogenet. Evol. 122 : 59 – 79. Google Scholar Crossref Search ADS PubMed WorldCat Rojas-Andres B.M. , Albach D.C., Martinez-Ortega M.M. 2015 . Exploring the intricate evolutionary history of the diploid-polyploid complex Veronica subsection Pentasepalae (Plantaginaceae) . Bot. J. Linn. Soc. 179 : 670 – 692 . Google Scholar Crossref Search ADS WorldCat Sayyari E. , Mirarab S. 2016 . Fast coalescent-based computation of local branch support from quartet frequencies . Mol. Biol. Evol. 33 : 1654 – 1668 . Google Scholar Crossref Search ADS PubMed WorldCat Sayyari E. , Mirarab S. 2018 . Testing for polytomies in phylogenetic species trees using quartet frequencies . Genes 9 : 132 . Google Scholar Crossref Search ADS WorldCat Schuster T.M. , Setaro S.D., Tibbits J.F.G., Batty E.L., Fowler R.M., McLay T.G.B., Wilcox S., Ades P.K., Bayly M.J. 2018 . Chloroplast variation is incongruent with classification of the Australian bloodwood eucalypts (genus Corymbia, family Myrtaceae) . PLoS One 13 : e0195034 . Google Scholar Crossref Search ADS PubMed WorldCat Sledge J.L. , Anway J.C. 1970 . Hybridization between members of Polemonium delicatum Ryd . and P. foliosissimum A. Gray var. molle (Greene) Anway in southern Utah. Am. Midl. Nat. 84 : 136 – 143 . Google Scholar OpenURL Placeholder Text WorldCat Smith R.L. , Sytsma K.J. 1990 . Evolution of Populus nigra (sect . Aigeiros): introgressive hybridization and the chloroplast contribution of Populus alba (sect. Populus). Am. J. Bot. 77 : 1176 – 1187 . Google Scholar OpenURL Placeholder Text WorldCat Snir S. , Rao S. 2012 . Quartet MaxCut: a fast algorithm for amalgamating quartet trees . Mol. Phylogenet. Evol. 62 : 1 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat Solís-Lemus C. , Ané C. 2016 . Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting . PLoS Genet. 12 : e1005896 . Google Scholar Crossref Search ADS PubMed WorldCat Solís-Lemus C. , Bastide P., Ané C. 2017 . PhyloNetworks: a package for phylogenetic networks . Mol. Biol. Evol. 34 : 3292 – 3298 . Google Scholar Crossref Search ADS PubMed WorldCat Stamatakis A. 2014 . RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies . Bioinformatics 30 : 1312 – 1313 . Google Scholar Crossref Search ADS PubMed WorldCat Steele K.P. , Vilgalys R. 1994 . Phylogenetic analysis of Polemoniaceae using nucleotide sequences of the plastid gene matK . Syst. Bot. 19 : 126 – 142 . Google Scholar Crossref Search ADS WorldCat Stenz N.W. , Larget B., Baum D.A., Ané C. 2015 . Exploring tree-like and non-tree-like patterns using genome sequences: an example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh . Syst. Biol. 64 : 809 – 823 . Google Scholar Crossref Search ADS PubMed WorldCat Stubbs R.L. , Patterson R. 2013 . Revisions in Polemonium (Polemoniaceae): a new species and a new variety from California . Madroño 60 : 243 – 248 . Google Scholar Crossref Search ADS WorldCat Stubbs R.L. , Folk R.A., Xiang C.-L., Soltis D.E., Cellinese N. 2018 . Pseudo-parallel patterns of disjunctions in an arctic-alpine plant lineage . Mol. Phylogenet. Evol. 123 : 88 – 100 . Google Scholar Crossref Search ADS PubMed WorldCat Swofford D.L. 2002 . PAUP*. Phylogenetic analysis using parsimony (*and other methods). version 4.0a . Sunderland, MA : Sinauer Associates . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Than C.V. , Nakhleh L. 2009 . Species tree inference by minimizing deep coalescences . PLoS Comp. Biol. 5 : e1000501 . Google Scholar Crossref Search ADS WorldCat Than C. , Ruths D., Nakhleh L. 2008 . PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships . BMC Bioinformatics 9 : 322 . Google Scholar Crossref Search ADS PubMed WorldCat Timme R.E. 2001 . A molecular phylogeny of the genus Polemonium (Polemoniaceae) [M.S. Thesis] . San Francisco State University . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Tsitrone A. , Kirkpatrick M., Levin D.A. 2003 . A model for chloroplast capture . Evolution 57 : 1776 – 1782 . Google Scholar Crossref Search ADS PubMed WorldCat Vargas O.M. , Ortiz E.M., Simpson B.B. 2017 . Conflicting phylogenomic signals reveal a pattern of reticulate evolution in a recent high-Andean diversification (Asteraceae: Astereae: Diplostephium) . New Phytol. 214 : 1736 – 1750 . Google Scholar Crossref Search ADS PubMed WorldCat Vassiljev V. 1974 . Polemonium L. In: Shishkin B.K., editor. Flora of the U.S.S.R.,Volume XIX Tubiflorae . Jerusalem : Israel Program for Scientific Translations . p. 58 – 69 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Wherry E.T. 1942 . The genus Polemonium in America . Am. Midl. Nat. 27 : 741 – 760 . Google Scholar Crossref Search ADS WorldCat Wherry E.T. 1967 . Our temperate tufted polemoniums . Aliso 6 : 97 – 101 . Google Scholar Crossref Search ADS WorldCat Whittemore A.T. , Schaal B.A. 1991 . Interspecific gene flow in sympatric oaks . Proc. Natl. Acad. Sci. USA 88 : 2540 – 2544 . Google Scholar Crossref Search ADS WorldCat Wickett N.J. , Mirarab S., Nguyen N., Warnow T., Carpenter E., Matasci N., Ayyampalayam S., Barker M.S., Burleigh J.G., Gitzendanner M.A., Ruhfel B.R. 2014 . Phylotranscriptomic analysis of the origin and early diversification of land plants . Proc. Natl. Acad. Sci. USA 111 : E4859 – E4868 . Google Scholar Crossref Search ADS WorldCat Wolf P.G. , Murray R.A., Sipes S.D. 1997 . Species-independent, geographical structuring of chloroplast DNA haplotypes in a montane herb Ipomopsis (Polemoniaceae) . Mol. Ecol. 6 : 283 – 291 . Google Scholar Crossref Search ADS WorldCat Worley A.C. , Ghazvini H., Schemske D.W. 2009 . A phylogeny of the genus Polemonium based on amplified fragment length polymorphism (AFLP) markers . Syst. Bot. 34 : 149 – 161 . Google Scholar Crossref Search ADS WorldCat Yi T.-S. , Jin G.-H., Wen J. 2015 . Chloroplast capture and intra- and inter-continental biogeographic diversification in the Asian-New World disjunct plant genus Osmorhiza (Apiaceae) . Mol. Phylogenet. Evol. 85 : 10 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat Yu Y. , Nakhleh L. 2015 . A maximum pseudo-likelihood approach for phylogenetic networks . BMC Genomics 16 : S10 . Google Scholar Crossref Search ADS PubMed WorldCat Zhang C. , Sayyari E., Mirarab S. 2017 . ASTRAL-III: increased scalability and impacts of contracting low support branches. In: Meidanis J., Nakhleh L., editors. RECOMB International Workshop on Comparative Genomics . Cham, Switzerland : Springer . p. 53 – 75 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Markov-Modulated Continuous-Time Markov Chains to Identify Site- and Branch-Specific Evolutionary Variation in BEASTBaele, Guy; Gill, Mandev S; Bastide, Paul; Lemey, Philippe; Suchard, Marc A
doi: 10.1093/sysbio/syaa037pmid: 32415977
Abstract Markov models of character substitution on phylogenies form the foundation of phylogenetic inference frameworks. Early models made the simplifying assumption that the substitution process is homogeneous over time and across sites in the molecular sequence alignment. While standard practice adopts extensions that accommodate heterogeneity of substitution rates across sites, heterogeneity in the process over time in a site-specific manner remains frequently overlooked. This is problematic, as evolutionary processes that act at the molecular level are highly variable, subjecting different sites to different selective constraints over time, impacting their substitution behavior. We propose incorporating time variability through Markov-modulated models (MMMs), which extend covarion-like models and allow the substitution process (including relative character exchange rates as well as the overall substitution rate) at individual sites to vary across lineages. We implement a general MMM framework in BEAST, a popular Bayesian phylogenetic inference software package, allowing researchers to compose a wide range of MMMs through flexible XML specification. Using examples from bacterial, viral, and plastid genome evolution, we show that MMMs impact phylogenetic tree estimation and can substantially improve model fit compared to standard substitution models. Through simulations, we show that marginal likelihood estimation accurately identifies the generative model and does not systematically prefer the more parameter-rich MMMs. To mitigate the increased computational demands associated with MMMs, our implementation exploits recent developments in BEAGLE, a high-performance computational library for phylogenetic inference. [Bayesian inference; BEAGLE; BEAST; covarion, heterotachy; Markov-modulated models; phylogenetics.] Molecular sequence evolution is typically modeled by Markov models of character substitution acting along the branches of a phylogenetic tree. These models are phenomenological descriptions of the evolution of DNA as a string of a number of discrete character states, with models of nucleotide substitution among four states being the most widely used in statistical phylogenetics. The Markovian property within such a model reflects the common assumption that evolution has no memory. Further, it is standard to assume that the Markov model is time-homogeneous, so that it can be characterized by a generator or instantaneous rate matrix |$\mathbf{Q}_{}$| that remains constant during evolution (Gascuel and Guindon 2007). Early probabilistic phylogenetic reconstruction methods assumed a single substitution model that acted independently across all sites and lineages. The characters at different alignment sites, however, typically evolve under varying structural or functional constraints, inspiring models that accommodate among-site rate variation by scaling up or down the expected number of substitutions at different sites. Sites evolve, nonetheless, in more qualitatively different ways than simply variation in their overall substitution rates (Pagel and Meade 2004). Furthermore, selective pressures vary over time and often defy a priori site partitioning into sets with approximately equal selection across an alignment. Examples of such a complex interplay between sites come from studies on how the 3D structure of proteins evolves over time. These studies show that, although a few essential sites may be invariable over long periods of evolutionary time, most sites do change their functional environment—and as a result, the functional constraints they are subjected to—during evolution (Penny et al. 2001). In order to capture and accurately model these types of evolutionary phenomena, there is need for a class of flexible substitution models that do not require prior knowledge regarding data partitioning. The increase in computational power over the past two decades has enabled fast evaluation of complex models in a feasible amount of time, by focusing on exploiting many-core computing solutions (Suchard and Rambaut 2009). This has paved the way for evaluating high-dimensional substitution models and modeling complex scenarios, such as clade-specific and even branch-specific evolutionary processes. Markov-modulated models (MMMs) constitute a class of mixture models that allow the substitution process to change across each branch and this for each site independently within an alignment (we refer interested readers to Supplementary materials available on Dryad at https://doi.org/10.5061/dryad.230s5h0 for an in-depth introduction). In this article, we introduce a Bayesian inference framework for MMMs, with an implementation in BEAST (Suchard et al. 2018)—a software package for Bayesian evolutionary analysis—that accommodates phylogenetic uncertainty. In doing so, we strive for optimal generality by allowing switching between evolutionary models within the MMM that have different substitution rates, relative character exchange rates and stationary distributions. Methods Markov-Modulated Model Structure Consider an MMM composed of |$K$| evolutionary models (irrespective of those models being nucleotide, amino acid, or codon models). Each evolutionary model is defined by a relative substitution rate multiplier |$\rho_{k}$| and a substitution model characterized by an instantaneous rate matrix |$\mathbf{Q}_{k} = \left\{ Q^{(k)}_{s s} \right\}$|, of dimension |$S \times S$|, and stationary distribution |$\boldsymbol{\Pi}_{k} = \left( \pi_{k 1}, \ldots, \pi_{k S} \right)$|. We also adopt the usual constraint -|$\sum_{s = 1}^{S} Q^{(k)}_{s s} \pi_{k s} = 1$|. The switching process between the |$K$| models is defined by a |$K$|-state continuous-time Markov process with rate matrix $$\begin{equation} \label{eq:switching} \mathbf{\Phi} = \left( \begin{array}{ccccc} -\sum_{k~~/{\hspace{-2.1mm}} =~ 1}\phi_{1k} & \phi_{12} & \cdots & \phi_{1K} \\ \phi_{21} & -\sum_{k~~/{\hspace{-2.1mm}} =~~ 2}\phi_{2k} & & \vdots \\ \vdots & & \ddots & \phi_{K -1,K} \\ \phi_{K 1} & \cdots & \phi_{K, K -1} & -\sum_{k~~/{\hspace{-2.1mm}} =~ K}\phi_{K k} \end{array} \right), \end{equation}$$(1.1) where the element |$\phi_{ij}$| corresponds to the rate of switching from substitution model |$i$| to substitution model |$j$|, and the diagonal elements are fixed such that the rows sum to |$0$|. We denote the stationary distribution of this switching process by |$\boldsymbol{\Psi} = \left(\psi_1, \ldots, \psi_K\right)$|. These model switches follow a homogeneous, stationary—but not necessarily time-reversible—Markovian process. In Equation 1.1, we do not make use of an additional parameter |$\delta$| that expresses the global rate of change between the evolutionary models because this is a deterministic parameter obtained by normalizing the model-switching process (Guindon et al. 2004; Gascuel and Guindon 2007). The MMM is characterized by a |$KS \times KS$| rate matrix |$\mathbf{\Lambda}$| (Fischer and Meier-Hellstern 1993): (1.2) where |$\mathbf{I}_{}$| is an |$S \times S$| identity matrix and |$\otimes$| denotes the Kronecker product. The MMM can therefore be considered a single Markov process with a state space equal to the Cartesian product of the state space of the switching process (between the evolutionary models) and the state space of the evolutionary models, with cardinality |$K S$| and stationary distribution |$\boldsymbol{\Pi}_{\mathbf{\Lambda}} = \left( \psi_{1}\pi_{11}, \ldots, \psi_{1}\pi_{1S}, \ldots, \psi_{K}\pi_{K1}, \ldots, \psi_{K}\pi_{KS} \right)$| (Guindon et al. 2004). As noted by Gascuel and Guindon (2007), the MMM in Equation 2 allows for every compound state |$(k, s)$| to either: 1) stay in model |$k$| and transition to |$(k, s^{\prime})$| with rate defined by |$\rho_{k} \mathbf{Q}_{k}$|, or 2) change evolutionary models and transition to |$(k^{\prime}, s)$| with rate |$\phi_{k k^{\prime}}$|. All rows in |$\mathbf{\Lambda}$| sum to |$0$|, and because |$\boldsymbol{\Psi}\mathbf{\Phi} = 0$| and |$\boldsymbol{\Pi}_{k}\mathbf{Q}_{k} = 0$|, it follows that |$\boldsymbol{\Pi}_{\mathbf{\Lambda}}\mathbf{\Lambda} = 0$|. We refer to Supplementary material available on Dryad for additional information on these MMMs, for example on their identifiability when combining them with among-site rate variation (ASRV; Yang 1994, 1996). Likelihood In this section, we adopt a similar notation to Gascuel and Guindon (2007) to describe the data likelihood under an MMM. Likelihood calculations for MMMs employ a standard pruning approach (Felsenstein 1981), with integration over the compound states (i.e., the evolutionary model and character state) at the internal nodes of the tree, and integration over the unobserved categories at the tips. Let |$\mathbf{Y} = (\mathbf{Y}_1, \ldots, \mathbf{Y}_{L}),$| where |$\mathbf{Y}_{\ell}$| are the extant characters observed at aligned site |$\ell$| for |$\ell = 1, \ldots, L$|, and let |${\cal T}$| denote the phylogenetic tree with its branch lengths. Let |${\cal M(\boldsymbol{\theta}, \boldsymbol{\phi})}$| denote the MMM that models the evolutionary process for all sites, where |$\boldsymbol{\theta} = \{\boldsymbol \theta_1, \ldots, \boldsymbol \theta_{K}\}$| and |$\boldsymbol\theta_k$| represents parameters for the |$k$|th evolutionary model, and |$\boldsymbol{\phi}$| parameters of the switching process. The observed data likelihood is: $$\begin{equation} L(\boldsymbol{\theta}, \boldsymbol{\phi}, {\cal T}, {\cal M} \mid \mathbf{Y}) = \prod_i \left( \sum_{(k, s)} \psi_{k}\pi_s L_i^R((k,s), \boldsymbol{\theta}, \boldsymbol{\phi}, {\cal T}, {\cal M} \mid \mathbf{Y}_i) \right), \end{equation}$$(1.3) where the product is taken over every site |$i$| in the alignment, with each site assumed to evolve independently. The sum over the compound states |$(k, s)$| replaces the sum over the nucleotide characters |$s$| that is performed for standard nucleotide substitution models (Gascuel and Guindon 2007). Here, |$L_i^R((k,s), \boldsymbol{\theta}, \boldsymbol{\phi}, {\cal T}, {\cal M} \mid \mathbf{Y_i})$| is the likelihood of the data at site |$i$| under category |$k$| and given that state |$s$| is observed at site |$i$| of the root node |$R$|. We can generalize this notation as |$L_i^v((k,s), \boldsymbol{\theta}, \boldsymbol{\phi}, {\cal T}, {\cal M} \mid \mathbf{Y_i})$| for node |$v$| to express the partial likelihood of observing the characters at site |$i$| in the extant sequences descending from |$v$|. This notation can be shortened to |$L_i^v(k,s)$| because |$\boldsymbol{\theta}$|, |$\boldsymbol{\phi}$|, |${\cal T}$|, |${\cal M,}$| and |$\mathbf{Y}$| are the same for all sites and nodes. Let |$l$| and |$r$| be the left and right descendants of |$v$| and |$t_{v}$| the length of the branch connecting |$v$| to its parent. Each partial likelihood is then defined as follows (taking into account that the evolutionary categories are unobserved; Gascuel and Guindon 2007): $$\begin{equation}\begin{aligned} L_i^v(k, s) = \left \{ \begin{array}{l} \textrm{1 if } v \textrm{ is a leaf with nucleotide character} s, \\ \textrm{0 if } v \textrm{ is a leaf with nucleotide character}\\ \textrm{different from } s \textrm{ or} \\ \left( \sum_{(k', s')} P_{(k,s)(k',s')}(t_{l})L_i^l(k', s') \right) \left( \sum_{(k', s')}\right.\\ \left.P_{(k,s)(k',s')}(t_{r})L_i^r(k', s') \right) \textrm{ otherwise}. \end{array} \right.\ \end{aligned}\end{equation}$$(1.4) The substitution probabilities |$P_{(k,s)(k',s')}(t)$| are computed using matrix exponentiation of |$\mathbf{\Lambda}$| with computational complexity |${\cal O} \left( K^3 S^3 \right)$| (Pan and Chen 1999), although lower complexity may be achieved depending on the Kronecker structure of |$\boldsymbol{\Lambda}$| (but see the Supplementary material available on Dryad). Computing these probabilities for all |${\cal O} \left( N \right)$| branches in the phylogeny therefore sports a complexity of |${\cal O} \left( N K^3 S^3 \right)$|. Evaluating the |$L$| site likelihoods through the tree-pruning (or peeling) algorithm (Felsenstein 1981) amounts to a complexity of |${\cal O} \left( N L K^2 S^2 \right)$|. Taken together, with a relatively small cost |${\cal O} \left( L \right)$| for taking logarithm of site likelihoods and summing over sites results in a computational complexity of |${\cal O} \left( N K^3 S^3 + N L K^2 S^2 \right)$| for the log-likelihood of the observed data. Implementation We have implemented MMMs and their corresponding likelihood function in BEAST (Suchard et al. 2018), a widely used software package for Bayesian phylogenetic and phylodynamic inference using Markov chain Monte Carlo integration. These models are available for use in BEAST through XML specification, allowing to construct a wide range of different modeling assumptions such as the ones detailed in this article (and the Supplementary material available on Dryad). The use of MMMs substantially increases computation time in likelihood-based inference, and we offload the computationally demanding aspects to powerful multi- and many-core hardware through the BEAGLE library (Ayres et al. 2019). Biological Examples We here consider substitution models that are time-reversible and therefore substitution model |$k$| will have instantaneous rates |$Q^{(k)}_{ij}$| that can be expressed in terms of base frequencies |$\pi_{kj}$| and symmetric rate parameters |$R^{(k)}_{i \leftrightarrow j} = R^{(k)}_{j \leftrightarrow i}$| as follows: $$\begin{equation} Q^{(k)}_{ij} = \pi_{kj} R^{(k)}_{i \leftrightarrow j}. \end{equation}$$(1.5) Thus a substitution model can be specified in terms of its base frequencies and symmetric rate parameters |$\textbf R_k = \{R^{(k)}_{i \leftrightarrow j} | i ~~/{\hspace{-2.1mm}} =~ j, (i,j) \in \mathcal S \} $|. We adopt the following notation: MMM(|$M$|)|$_{ijkl}$|, where |$M$| denotes the type of substitution model and |$i$|, |$j$|, |$k$|, and |$l$| denote the numbers of distinct sets of symmetric rate parameters, sets of base frequencies, the relative rate multipliers, and the structure of |$\mathbf{\Phi}$| as either symmetric/triangular (|$T$|) or asymmetric (|$A$|), respectively. For example, an MMM(HKY)|$_{222T}$| refers to an MMM featuring two different HKY substitution models, each with its own set of symmetric rate parameters and set of base frequencies, two different relative rate multipliers and a symmetric rate switching matrix |$\mathbf{\Phi}$|. An MMM(HKY)|$_{122A}$| refers to an MMM featuring two different HKY substitution models that share the same set of symmetric rate parameters but have different sets of base frequencies, along with two different relative rate multipliers and an asymmetric rate switching matrix |$\mathbf{\Phi}$|. When the relative rate multipliers are all fixed to 1 to superimpose an ASRV model (see Supplementary material available on Dryad), the |$k$| subscript is omitted (e.g., MMM(HKY)|$_{22A}$|). We here consider two empirical data sets that show the importance of employing MMMs to accurately model the substitution process, as supported by Bayesian model selection. In Supplementary material available on Dryad, we analyze two additional empirical data sets—a plant plastid gene and an influenza A virus data set—that provide evidence in favor of MMMs over traditional substitution models but also showcase the wide range of modeling assumptions possible within our MMM formulation. Bacterial 16S Ribosomal RNA Differences in base composition throughout the genome can bias phylogenetic inference when not properly taken into account. Often, the proportion of A+T in a genome differs from that of G+C, and different organisms exhibit different patterns of base composition. At the level of the entire genome, GC content varies greatly within and among major groups of organisms, which can skew phylogenetic reconstruction if not properly unaccounted for (Mooers and Holmes 2000). Two different evolutionary processes have been singled out as possible explanations for varying patterns of base composition: biases in the underlying process of mutation, as similar levels of GC content are often found in regions with different functional constraints, and natural selection, with increased global GC content in bacteria possibly being selected for by UV exposure (Singer and Ames 1970). Environmental variation shaping nucleotide composition may cause unrelated taxa to share similar base composition and therefore be grouped together within a clade. To accurately reconstruct evolutionary histories through phylogenetic inference, these potentially differing base compositions need to be accommodated in an explicit manner by the nucleotide substitution model. To address this, Blanquart and Lartillot (2006) developed a nonstationary and nonhomogeneous model accounting for compositional biases, allowing the composition to change at random points in the tree, with the total number of change points across the tree being inferred from the data. Through a Bayesian analysis of eubacterial 16S rRNA and BAS1 gene yeast data sets, the authors show that in most cases, the stationarity assumption was rejected in favor of their nonstationary model. We evaluate our MMM framework on 16S ribosomal RNA of five bacterial sequences: Deinococcus radiodurans, Thermus thermophilus, Thermotoga maritima, Aquifex pyrophilus, and Bacillus subtilis (GenBank accession numbers: Y11332.1, AJ251939.1, NR_029163.1, M83548.2, and CP009796.1). We use standard nucleotide substitution models as well as MMMs to infer their evolutionary history while fixing the Aquifex pyrophilus sequence as an outgroup. Given that the data contain three thermophilic (high GC content) and two mesophilic (lower GC content) bacteria genera (Mooers and Holmes 2000), we consider only MMM(|$*$|)|$_{22*}$| models and do not further explore higher-dimensional models. The true tree topology of this eubacterial data set is believed to group D. radiodurans and T. thermophilus together to the exclusion of B. subtilis, T. maritima, and A. pyrophilus, given that D. radiodurans and T. thermophilus share the same peptidoglycan and menaquinone type (Murray 1992). However, phylogenetic reconstruction under stationary models has a tendency to erroneously group D. radiodurans and B. subtilis together, because these mesophiles have similar, relatively low GC content. Figure 1 shows the results of the phylogenetic reconstructions, with the HKY and GTR models—both featuring an ASRV model and a relaxed molecular clock with an underlying lognormal distribution—yielding similar (log) marginal likelihoods (we refer to Supplementary material available on Dryad for details on the marginal likelihood estimation procedure). Note that, because we will include an ASRV model in all of these MMMs, we set all |$\rho_{i}$| in Equation 1.2 to 1 to ensure identifiability. Both the HKY and GTR models express strong support in favor of a clustering of D. radiodurans and B. subtilis (see Fig. 1), with the GTR model yielding a small increase in model fit to the data over the HKY model (log BF |$<$| 1). As such, both models yield an incorrect clustering, which appears to be primarily based on both sequences being mesophilic (low GC content), whereas the three other sequences are considered thermophilic (high GC content). While an MMM of the type introduced by Tuffley and Steel (1998) offers no improvement over these models when |$\mathbf{Q}_{1}$| is parameterized as an HKY model (see Supplementary material available on Dryad for the model’s details), a significant improvement in model fit can be obtained when |$\mathbf{Q}_{1}$| is parameterized as a GTR model (log BF = 19). However, any MMM with two sets of base frequencies and with either a single set of symmetric rate parameters (an MMM(|$*$|)|$_{12*}$|) or with two different sets of symmetric rate parameters (an MMM(|$*$|)|$_{22*}$|) offers a further improvement in model fit compared to the standard nucleotide substitution models tested (8 |$<$| log BF |$<$| 45; we refer to Supplementary material available on Dryad for the log marginal likelihood estimates). This can be attributed to the fact that MMMs are able to accommodate differing base compositions throughout the tree topology, and consequently yield an accurate phylogenetic reconstruction of the bacterial relationships, with the D. radiodurans and T. thermophilus clustering together (see Fig. 1) (Embley et al. 1993; Mooers and Holmes 2000). Figure 1 Open in new tabDownload slide a) Maximum clade credibility (MCC) phylogeny relating five bacterial 16S sequences; unlabeled nodes have |$>$|0.9999 posterior probability. Standard nucleotide substitution models that assume among-site rate variation (ASRV) erroneously cluster the two mesophiles together with high posterior probability (0.649 for HKY and 0.863 for GTR in the topology on the left). However, an MMM(HKY)|$_{22A}$| yields the correct clustering of the Deinococcus radiodurans and the Thermus thermophilus sequences with high posterior probability (topology on the right); each branch is annotated with the proportion of sites in each of the continuous-time Markov chain (CTMC) models, based on the maximum a posteriori (MAP) phylogeny. b) Number of CTMC model switches per alignment site based on the most probable hidden state realizations of the MMM on the MAP phylogeny; of the full alignment of 1304 sites, 761 sites are estimated not to switch between CTMC models. c) Mean posterior parameter estimates of the MMM show asymmetric switching between models (with circle sizes proportional to rate switching intensity) with pronounced differences in transition/transversion ratios and base frequencies. The base frequency estimates for the CTMC models within the MMM reflect the presence of mesophilic sequences (low GC content; orange in Fig. 1) and thermophilic sequences (high GC content; blue in Fig. 1) in our data. Despite the fact that only eight branches connect the observed sequences, alignment sites switch up to four times between CTMC models across the phylogeny, indicating evolutionary dynamics that cannot possibly be accommodated using standard nucleotide substitution models. Over 40% of the alignment sites undergo at least one switch between CTMC models in a highly asymmetric manner (see Fig. 1). The two CTMC models are also characterized by pronounced differences in transition/transversion ratios. In conclusion, we show that appropriately modeling compositional heterogeneity for these eubacterial sequences enables inference of the correct phylogeny as well as base frequency compositions that reflect the presence of both mesophilic and thermophilic sequences in the data set. Plant Plastid Genes Figure 2 Open in new tabDownload slide Phylogenetic reconstruction of plant plastid sequences, based on the psaB protein-coding gene; unlabeled nodes have |$>$|0.9999 posterior probability. Left: MCC tree based on a standard GTR+ASRV model. Right: MCC tree based on an MMM(GTR)|$_{33A}$| with ASRV, which is strongly supported over the MCC tree generated under the GTR model (log Bayes factor of 347). While only a single different clustering can be observed within the Angiosperms, many differing clusters that have very high posterior probabilities are generated using the MMM(GTR)|$_{33A}$| outside of the seed plants. We consider nucleotide sequence data from the protein-coding genes of 23 completely sequenced plant plastid genomes, previously analyzed by Ané et al. (2005) to measure the independence of the substitution process between two groups of taxa as a means of detecting covarion evolution. Assuming a fixed underlying reference tree that represents the likely relationships of plant taxa for which complete chloroplast sequences were available at the time, the covarion test of Ané et al. (2005) detected significant covarion evolution (|$P < 0.0005$|) in 14 of 57 genes analyzed across all positions. We here analyze the psaB gene with standard nucleotide substitution models and MMMs and compare the inferred phylogenies and model fit; we refer to Supplementary material available on Dryad for our analysis of the ndhD gene. A comparison of standard nucleotide substitution models reveals that the combination of a GTR model and an ASRV model, along with a relaxed clock assuming an underlying lognormal distribution, yields the highest (log) marginal likelihood for both data sets. We conduct analyses with MMMs that feature an HKY or GTR substitution model with a single set of symmetric rate parameters along with two or three different sets of base frequencies (i.e., MMM(|$*$|)|$_{12*}$| and MMM(|$*$|)|$_{13*}$| models), as well as generalizations of these MMMs that feature as many different sets of rate parameters as sets of base frequencies, and both symmetric and asymmetric |$\mathbf{\Phi}$| (i.e., MMM(|$*$|)|$_{22*}$| and MMM(|$*$|)|$_{33*}$| models). For all of these models, we set all |$\rho_{i}$| in equation 1.2 to 1 to ensure identifiability when using an ASRV model in combination with MMMs. We also analyze the data with a nucleotide covarion model (Tuffley and Steel 1998), which we can easily compose within our MMM framework through XML specification. The psaB data set strongly prefers the covarion-style model over a standard GTR+ASRV substitution model by a log Bayes factor of 208. The MMM(GTR)|$_{12A}$| and MMM(GTR)|$_{22A}$| yield log Bayes factors of 257 and 313, respectively, over the standard GTR+ASRV model. MMM(GTR)|$_{13A}$| and MMM(GTR)|$_{33A}$| parameterizations yield further increases in model fit of 321 and 347, respectively, over the GTR+ASRV model. Because additional categories within the MMM offer diminishing returns in terms of model fit at the expense of additional computation time, we did not explore MMMs with even higher dimensions. Figure 2 shows the maximum clade credibility (MCC) trees obtained under the standard GTR+ASRV model and the MMM(GTR)|$_{33A}$| that generated the highest (log) marginal likelihood. While the clustering within the seed plants is identical under both models, substantial differences in posterior support can be observed for specific clades. In the remaining part of the tree, these models result in completely different clustering patterns with strong support for many clades under the MMM(GTR)|$_{33A}$| model. In Figure 3, we illustrate the complex substitution patterns across all sites on the MAP psaB phylogeny, using the most probable hidden state realizations of the MMM(GTR)|$_{33A}$|. We use a simple counting procedure to quantify the number of differences between the ancestral model states as a means to reconstruct which sites evolve according to which CTMC within the MMM(GTR)|$_{33A}$|, and we observe a relatively small amount of CTMC switching throughout the phylogeny (of note, we observe a 4.5-fold increase in number of sites switching between CTMCs in our analysis of the ndhD gene in Supplementary material available on Dryad). The reconstructed patterns go beyond mere codon position partitioning, as we observe different substitution dynamics per codon position. In particular, the third codon position is the only position that evolves according to a particular CTMC a majority of the time, and it also exhibits the greatest degree of switching between CTMC realizations. We depict the mean posterior instantaneous substitution rates of the various MMM components in Figure 3, showing a clearly asymmetric CTMC switching process and three distinct GTR model realizations within the MMM. This complex interplay of model components is consistent with the strong Bayes factor support of the MMM(GTR)|$_{33A}$| over all other models tested. Figure 3 Open in new tabDownload slide Markov-modulated model behavior on the psaB protein-coding gene phylogeny. a) Amount of time (branch lengths in genetic distance) spent in each CTMC model for each alignment site based on the most probable hidden state realizations of the MMM on the maximum a posteriori phylogeny. b) Summary of the number of sites that evolve according to each CTMC, illustrating complex substitution patterns that go beyond codon position partitioning, as well as 2.8% of sites switching between CTMC realizations. c) Distribution of sites in each codon position across the different CTMC model realizations, showing that first and second codon positions switch far less frequently between CTMC models than the third codon position, in which the substitutions occur according to a clearly predominant CTMC. d) Switching behavior of the MMM between the three CTMC models, with the mean instantaneous substitution rates shown for those models (with circle sizes proportional to rate intensity). Conclusion MMMs can infer substantially different phylogenies compared to standard nucleotide substitution models, and they can be associated with significant increases in model fit. A targeted simulation study that assesses the ability of MMMs to retrieve the generative models of simulated sequence alignments and to quantify their increase in model fit when the MMM was the generative model shows that these large differences are not artifacts of using such high-dimensional models (see Supplementary material available on Dryad). Our simulation study also shows similar differences in model fit compared to the ones obtained in this section for the psaB and ndhD genes, as well as the ability of state-of-the-art Bayesian model selection to select the generative substitution model even when compared with similar model parameterizations. Importantly, when simulating data under a standard GTR model, MMMs exhibit a worse model fit than under the generative GTR model. These analyses of simulated data show that MMMs can easily be used in combination with recent developments in Bayesian model selection (Baele et al. 2016) and provide additional support for our conclusions that these models can yield substantial increases in model fit over standard nucleotide substitution models. We note that each additional CTMC within an MMM (significantly) increases computational demands, and that a search for the optimal MMM may therefore prove time-consuming for complex large data sets. Avoiding direct evaluation of the finite-time transition probabilities through emerging algorithms that instead manipulate the matrix exponential action (Ji et al. 2016) represents a possible work around. In the mean time, to make such computations manageable, BEAST can however exploit the BEAGLE library (Ayres et al. 2019) to offload the large matrix multiplications onto powerful multi-core hardware solutions. In particular, the use of graphics cards for scientific computing yields significant performance gains over standard multi-core processors (see Supplementary material available on Dryad), rendering phylogenetic inference under these MMMs feasible despite their complexity. Finally, it remains important to recognize that phylogenetic substitution models draw inspiration from biology and biochemistry, but do not capture the full complexity of these underlying processes. MMMs offer a substantial increase in model complexity over traditional substitution models but—like most other substitution models—also make simplifying assumptions, for example, regarding site-independent evolution, as there is no mechanism within an MMM in which changes in one site result in concomitant changes in another. Resulting model misspecification (and potential overparameterization) can mislead model-based tree reconstruction methods (Steel 2005). To guard against such situations, a well-developed statistical theory such as Bayesian model testing should be employed to compare models in an objective manner and choose a model that carefully balances the model’s parameterization with the available information in the data. After all, as Steel (2005) sagely states, the aim of model selection is not to find the “true model” but to find a model with sufficient parameters to capture the key features of the data. Additionally, we have made available an online tutorial on how to construct XML files to perform phylogenetic inference using Markov-modulated models in BEAST: http://beast.community/markov_modulated.html. Supplementary Material Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.230s5h0. Acknowledgments We would like to thank the editors, Bryan Carstens and David Bryant, as well as three anonymous reviewers for their constructive comments that helped improve this article. We are grateful to Cécile Ané for kindly providing the plant plastid genome data sets. We gratefully acknowledge support from NVIDIA Corporation with the donation of parallel computing resources used for this research. Funding This work was supported by the Interne Fondsen KU Leuven/Internal Funds KU Leuven under grant agreement C14/18/094, and by the Research Foundation – Flanders [“Fonds voor Wetenschappelijk Onderzoek – Vlaanderen”, G0E1420N to G.B.]. The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 725422-ReservoirDOCS); the Research Foundation – Flanders [“Fonds voor Wetenschappelijk Onderzoek – Vlaanderen,” 12Q5619N and V434319N to P.B.]; the Research Foundation – Flanders [“Fonds voor Wetenschappelijk Onderzoek – Vlaanderen,” G066215N, G0D5117N, and G0B9317N to P.L.]; National Science Foundation [DMS 1264153] and National Institutes of Health [R01 AI107034 and U19 AI135995], in part to M.A.S. The Artic Network receives funding from the Wellcome Trust through project 206298/Z/17/Z. References Ané C. , Burleigh J.G., McMahon M.M., Sanderson. M.J. 2005 . Covarion structure in plastid genome evolution: a new statistical test . Mol. Biol. Evol. 22 : 914 – 924 . Google Scholar Crossref Search ADS PubMed WorldCat Ayres D. L. , Cummings M. P., Baele G., Darling A.E., Lewis P.O., Swofford D.L., Huelsenbeck J.P., Lemey P., Rambaut A., Suchard M.A. 2019 . BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics . Syst. Biol. 68 : 1052 – 1061 . Google Scholar Crossref Search ADS PubMed WorldCat Baele G. , Lemey P., Suchard M.A. 2016 . Genealogical working distributions for Bayesian model testing with phylogenetic uncertainty . Syst. Biol. 65 : 250 – 264 . Google Scholar Crossref Search ADS PubMed WorldCat Blanquart S. , Lartillot N. 2006 . A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution . Mol. Biol. Evol. 23 : 2058 —2071. Google Scholar Crossref Search ADS PubMed WorldCat Embley T.M. , Thomas R.H., Williams R.A.D. 1993 . Reduced thermophilic bias in the 16s rDNA sequence from Thermus ruber provides further support for a relationship between Thermus and Deinococcus . Syst. Appl. Microbial. 16 : 25 – 29 . Google Scholar Crossref Search ADS WorldCat Felsenstein J. 1981 . Evolutionary trees from DNA sequences: a maximum likelihood approach . J. Mol. Evol. 17 : 368 – 376 . Google Scholar Crossref Search ADS PubMed WorldCat Fischer W. , Meier-Hellstern K. 1993 . The Markov-modulated Poisson process (MMPP) cookbook . Perform. Evaluation 18 : 149 – 171 . Google Scholar Crossref Search ADS WorldCat Gascuel O. , Guindon S. 2007 . Modelling the variability of evolutionary processes . Reconstruct. Evol. 2 : 65 – 99 . Google Scholar OpenURL Placeholder Text WorldCat Guindon S. , Rodrigo A.G., Dyer K.A., Huelsenbeck J.P. 2004 . Modeling the site-specific variation of selection patterns along lineages . Proc. Natl. Acad. Sci. USA 101 : 12957 – 12962 . Google Scholar Crossref Search ADS WorldCat Ji X. , Griffing A., Thorne J.L. 2016 . A phylogenetic approach finds abundant interlocus gene conversion in yeast . Mol. Biol. Evol. 33 : 2469 – 2476 . Google Scholar Crossref Search ADS PubMed WorldCat Mooers A.O. , Holmes E.C. 2000 . The evolution of base composition and phylogenetic inference . Trends Ecol. Evol. 15 : 365 – 369 . Google Scholar Crossref Search ADS PubMed WorldCat Murray R.G.E. 1992 . The family Deinococcaceae. In: Balows, A., Trüper, H.G., Dworkin, M., Harder, W., and Schleifer, K.-H., editors. The prokaryotes: a handbook on the biology of bacteria: ecophysiology, isolation, identification, applications, Vol. 4 . New York : Springer . p. 3732 — 3744 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Pagel M. , Meade A. 2004 . A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data . Syst. Biol. 53 : 571 – 581 . Google Scholar Crossref Search ADS PubMed WorldCat Pan V.Y. , Chen Z.Q. 1999 . The complexity of the matrix eigenproblem . Proceedings of the Thirty-first Annual ACM Symposium on Theory of Computing STOC ’99 ACM, New York, NY, USA . p. 507 – 516 . Google Scholar OpenURL Placeholder Text WorldCat Penny D. , McComish B.J., Charleston M.A., Hendy M.D. 2001 . Mathematical elegance with biochemical realism: the covarion model of molecular evolution . J. Mol. Evol. 53 : 711 – 753 . Google Scholar Crossref Search ADS PubMed WorldCat Singer C.E. , Ames B.N. 1970 . Sunlight ultraviolet and bacterial DNA base ratios . Science 170 : 822 – 826 . Google Scholar Crossref Search ADS PubMed WorldCat Steel M. 2005 . Should phylogenetic models be trying to ‘fit an elephant’? Trends Genet. 21 : 307 – 309 . Google Scholar Crossref Search ADS PubMed WorldCat Suchard M.A. , Lemey P., Baele G., Ayres D.L., Drummond A.J., Rambaut A. 2018 . Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 . Virus Evol . 4 : vey016 . Google Scholar Crossref Search ADS PubMed WorldCat Suchard M.A. , Rambaut A. 2009 . Many-core algorithms for statistical phylogenetics . Bioinformatics 25 : 1370 – 1376 . Google Scholar Crossref Search ADS PubMed WorldCat Tuffley C. , Steel M. 1998 . Modeling the covarion hypothesis of nucleotide substitution . Math. Biosci. 147 : 63 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. 1994 . Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods . J. Mol. Evol. 39 : 306 – 314 . Google Scholar Crossref Search ADS PubMed WorldCat Yang Z. 1996 . Among-site rate variation and its impact on phylogenetic analyses . Trends Ecol. Evol. 11 : 367 – 372 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.