Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy

Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy Statistical phylogenetic analyses of genomic data depend on models of nucleotide or amino acid substitution. The ade- quacy of these substitution models can be assessed using a number of test statistics, allowing the model to be rejected when it is found to provide a poor description of the evolutionary process. A potentially valuable use of model-adequacy test statistics is to identify when data sets are likely to produce unreliable phylogenetic estimates, but their differences in performance are rarely explored. We performed a comprehensive simulation study to identify test statistics that are sensitive to some of the most commonly cited sources of phylogenetic estimation error. Our results show that, for many test statistics, traditional thresholds for assessing model adequacy can fail to reject the model when the phylogenetic inferences are inaccurate and imprecise. This is particularly problematic when analysing loci that have few informative sites. We propose new thresholds for assessing substitution model adequacy and demonstrate their effectiveness in analyses of three phylogenomic data sets. These thresholds lead to frequent rejection of the model for loci that yield topological inferences that are imprecise and are likely to be inaccurate. We also propose the use of a summary statistic that provides a practical assessment of overall model adequacy. Our approach offers a promising means of enhancing model choice in genome-scale data sets, potentially leading to improvements in the reliability of phylogenomic inference. Key words: model adequacy, substitution model, maximum likelihood, test statistics, birds, Laurasiatherian mammals, turtles. Introduction et al. 2014) or by filtering the data according to appropriate In recent years, phylogenomic analyses have provided inter- criteria (Liu et al. 2015). Both of these measures can help to esting insights into the evolution of various taxonomic groups enhance the phylogenetic signal in the available data. (e.g., Meredith et al. 2011; Timme et al. 2012; Misofetal. One potential method of improving evolutionary 2014). However, the evolutionary relationships in some Modelling in phylogenetics is to perform an absolute assess- groups of organisms have remained difficult to resolve even ment of the adequacy or plausibility of the substitution model with large amounts of data, and different studies have some- for each locus in the data set (Doyle et al. 2015). This is dif- times yielded conflicting phylogenetic estimates (e.g., Jarvis ferent from the comparisons of relative model fit that are et al. 2014; Prum et al. 2015). The sources of conflict among routinely performed in phylogenetics (Posada and Crandall data sets are likely to include model misspecification, incon- 2001), and involves testing whether the chosen substitution gruence among gene trees, and the impacts of positive selec- model provides an accurate description of the evolutionary tion (Springer and Gatesy 2016; Reddy et al. 2017; Shen et al. process that generated the data (Goldman 1993). Although 2017). Unless data from additional loci can be readily the available substitution models are sometimes found to be obtained, the best means of improving the reliability of the good descriptions of the molecular evolutionary process inferred species tree is to enhance the Modelling of the evo- (Ripplinger and Sullivan 2010), they are unlikely to be ade- lutionary process that produced the data at hand. This can be quate for all of the loci in genome-scale data sets (Goldman done either by improving the methods of inference (e.g., 1993; Doyle et al. 2015; Duchene et al. 2016). Methods of Galtier 2001; Foster 2004; Lartillot et al. 2007; Jayaswal assessing model adequacy could, in principle, identify the The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non- commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com Genome Biol. Evol. 10(6):1375–1388. doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1375 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE portion of the data set for which the model is realistic. This The procedure of comparing the empirical and predictive information couldthenbe usedtoreject a givenmodelor to data requires the choice of a test statistic, a metric that exclude the subset of the data that is not well described by the describes some aspect of the data set or of the inferences available models (Doyle et al. 2015), leading to enhancement drawn from the data set. If empirical data are similar to, or of the phylogenetic signal in the data. produce similar estimates to, the predictive distribution, then Methods of assessing model adequacy by comparing em- the model can be considered adequate. Test statistics that can pirical data to those simulated under the model can test be calculated directly from the data are known as data-based whether some aspects of the model are realistic, but they statistics, whereas those that describe the inferences drawn might not necessarily indicate when estimates of parame- from the data are known as inference-based statistics. Some ters of interest are unreliable. This is because sequence data test statistics require calculations from the data as well as might meet the particular assumption being assessed but inferences from the data, such that they are hybrid test sta- still violate other assumptions that are critical for estimating tistics. Critically, test statistics can differ in their sensitivity to parameters of interest (e.g., Ho and Jermiin 2004; Brown biased inferences. Identifying which test statistics are sensitive 2014; Doyle et al. 2015). Data might also fail to meet a to biased phylogenetic inferences is fundamental before we particular assumption of the model, yet this might not can develop a practical framework of model assessment. have a negative impact on the estimates of parameters of The extent to which empirical data must be similar to the interest (Lemmon and Moriarty 2004; Brown 2014; predictive distribution for inferences to be inaccurate remains Duchene et al. 2017). For example, estimates of tree topol- poorly understood. The traditional approach to determining ogy and branch lengths can be accurate even when there model adequacy is to reject the model if the test statistic for are substantial departures from stationary base composition the original data falls above 95% or 99% of the values from (Duchene et al. 2017). To ensure the usefulness of methods simulated data. But assessment using these thresholds can for model assessment, we must evaluate whether measures lead to the model being rejected even when inferences are of model adequacy can predict the accuracy and precision not necessarily unreliable, or the failure to reject a model that of the inferences made using these models. Similarly, leads to biased inferences (Brown 2014; Duchene et al. 2017). approaches that can summarize tests from multiple meth- In this study, we aim to characterize the performance of a ods, and across hundreds of loci, are needed to make as- wide range of test statistics for model assessment in phyloge- sessment of model adequacy practical in the genomic age. nomic data sets, and explore a method for summarizing their Methods to assess model adequacy are based on com- results. Using simulations under a range of conditions known paring the empirical data with a distribution of data gener- to occur in empirical data, we characterize the ability of nine ated under the candidate model, also known as the test statistics to detect poor performance of the substitution predictive distribution. The data generated under the can- model. We define performance as the accuracy and precision didate model are also known as predictive data and can be of estimates of tree topology and branch lengths, which are generated using the maximum-likelihood estimates of the often the parameters of interest in phylogenomic studies. We model parameters (Goldman 1993). Alternatively, parame- recommend using an efficient maximum-likelihood frame- ter estimates can be taken from samples from the posterior work for model assessment that makes assessment feasible distribution in a Bayesian analysis, such that predictive data for genome-scale data sets, and focusing on the test statistics account for the uncertainty in estimates of model parame- that perform well in our simulation study. Based on our sim- ters (Bollback 2002). In a Bayesian framework, simulations ulation study, we also propose thresholds that are meaningful performed using the model are also known as posterior for identifying misleading phylogenetic inferences. We also predictive simulations. The benefit of accounting for uncer- describe the performance of a test statistic that summarizes tainty in parameter estimates comes at the cost of compu- multiple tests, which can be used to provide an overall assess- tational demand. In this study, we focus on methods to ment of model adequacy. We demonstrate our approach and assess model adequacy for genome-scale data sets, so we explore the relationship between substitution model ade- use a maximum-likelihood method of assessment. Several quacy and performance in phylogenomic data sets from tur- types of models in phylogenetics can be assessed using pre- tles, birds, and Laurasiatherian mammals. dictive simulations, including substitution models (Goldman 1993; Bollback 2002; Foster 2004; Brown 2014), large hi- Materials and Methods erarchical models (Reid et al. 2014; Duchene et al. 2015), Fast Assessment of Model Adequacy models of trait evolution (Rabosky and Glor 2010; Slater and Pennell 2014), and models of diversification rates Framework for Model Assessment through time (Ho ¨ hna et al. 2016). Predictive approaches In a basic assessment of substitution model adequacy, phylo- can also be used for relative model comparison (Lewis genetic analysis of the empirical data is performed using the et al. 2014), providing an alternative to commonly used model that is to be assessed. Then, a large number of data information criteria or Bayes factors. 1376 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE sets of the same size as the empirical data set are generated have previously been proposed in the context of substitution by simulating sequence evolution under the chosen model, model adequacy (table 1). The selected statistics consider sev- using the maximum-likelihood estimates of the model param- eral aspects of the model and potentially provide an overall eters. A chosen test statistic is calculated for each of these examination of adequacy. These test statistics include the X simulated data sets, thereby producing a distribution of values statistic for assessing stationarity of base composition (Foster derived from the model (Goldman 1993; Foster 2004). The 2004); multinomial and d statistics for assessing overall model test statistic from the empirical data can be compared against fit (Goldman 1993; Bollback 2002); biochemical diversity for this predictive distribution. assessing the diversity in base composition across sites An ideal test statistic is able to describe the differences (Lartillot et al. 2007); and consistency index for assessing the between the empirical data and the simulated data, especially consistency of phylogenetic information in the data when with regard to inferences of interest (such as the tree topology compared with the most parsimonious possible scenario and branch lengths). Therefore, selecting appropriate test sta- (Kluge and Farris 1969). We also use three test statistics based tistics is critical to assessment of model adequacy. Here, we on inferences from the data, including mean branch support, describe our framework for fast assessment of model ade- 95% confidence interval in branch support,and sum of quacy using test statistics that can detect instances when phy- branch lengths (Brown 2014). In addition, we use the logenetic inferences are inaccurate or imprecise. We assume Mahalanobis statistic, which has been used previously for that we have an empirical data set comprising a large number assessing phylodynamic models, and provides a summary of of unlinked loci from the taxa of interest. a chosen set of test statistics (O’Hagan 2003; Drummond and In our framework, we obtain maximum-likelihood esti- Suchard 2008). This approach treats groups of test statistics as mates of model parameters and the phylogeny for each a multivariate distribution, to which the empirical data are gene alignment using the software PhyML 3.0 (Guindon compared using the Mahalanobis distance (see table 1). We et al. 2010). Branch support is estimated using a highly have implemented our approach and the test statistics out- efficient, nonparametric measure of branch support with lined here in the software PhyloMAd (github.com/duchene/ behaviour similar to the nonparametric bootstrap (Guindon phylomad). et al. 2010; Anisimova et al. 2011). We take advantage of the speed of this method for assessing the performance of Validation of Test Statistics Using a Simulation Study test statistics that are based on inferences from the data Simulation Scenarios (Brown 2014). For each locus alignment, we generate 100 data sets by To understand the performance of test statistics for assess- simulating sequence evolution using the parameter estimates ing model adequacy, we simulated the evolution of nucle- from the empirical data. These simulated data sets have the otide sequences in a number of scenarios that involve same number of taxa and sites as the empirical data. Since misspecified models. Under these conditions, the accuracy these simulated data are derived from the model, they can be and precision of phylogenetic inference might be adversely considered a null distribution (Goldman 1993). To assess affected. We performed a range of simulations in six sce- model adequacy, we calculate a number of test statistics for narios involving realistic variation across loci (fig. 1). Our the empirical data set and for each simulated data set. It is study does not include other scenarios that have the poten- common to consider the model to be inadequate when the tial to create difficulties for phylogenetic inference, such as test statistic calculated from the empirical data falls outside tree imbalance or extreme model parameter values. Instead, the central 95 or 99 percentile range of test statistics calcu- we focus on scenarios that lead to model misspecification. lated from the simulated data. However, these thresholds do Our simulations were performed on fully symmetric trees not necessarily reflect the points at which inferences become with 32 tips. To derive adverse simulation scenarios, we inaccurate, so in this study we do not use the P-values to test began with a baseline phylogenetic tree with branch model adequacy. Instead, we used a measure of effect size lengths drawn from an exponential distribution with rate based on the number of standard deviations of the predictive of 5, such that the mean distance from the root to each distribution (SDPD) between the mean and the statistic calcu- tip is 1 substitution per site. We simulated the evolution of lated from the original data (Brown 2014; Duchene et al. sequences along these tree to produce data sets with 200, 2017). 1,000, or 5,000 nucleotide sites. Simulations were performed under a model in which the rates of each of the six substitution types were different. This Test Statistics Considered model is the GTR þ C (Tavare 1986; Yang 1993) in scenarios Almost any variable or model parameter that can be esti- where base composition is stationary and sites do not contain mated from the data can be used as a test statistic in the covarion-like patterns of substitution. Simulations were done approach described here. Instead of carrying out an exhaus- using the R package PHANGORN (Schliep 2011), except for tive examination of possible statistics, we consider nine that those with nonstationary base composition, which were done Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1377 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE 1378 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Table 1 Details of Nine Test Statistics Used to Assess the Adequacy of Nucleotide Substitution Models Test Statistic Calculation Model Component Assessed Type of Statistic Reference X Tree- and model-based chi-squared statistic of base frequencies across taxa. It is calcu- Stationarity of base Data-based Foster (2004) lated using matrices of base composition with a row for each taxon and a column for composition each nucleotide. One matrix corresponds to the values expected under the tree and model of base composition, and the other corresponds to the observed base compo- sition. The statistic is calculated using the following formula: X ¼ R½ðobs  exp Þ =exp m m m where at each cell obs is the observed base frequency and exp is the expected frequency under the tree and phylogenetic model. Multinomial (or Product of the unique site frequencies (L ), calculated as: Overall fit Data-based Goldman (1993a), unconstrained) Bollback (2002) likelihood L ¼ ðN =NÞ m ‘ ‘2B where ‘ is a site pattern in the set b, N is the number of occurrences of pattern ‘,and N is the total number of sites in the data. d Likelihood of the data using the unconstrained model (L ), minus the likelihood under Overall fit Data-inference hybrid Goldman (1993a) the evolutionary model Biochemical Calculated as the number of different bases occurring at each site, and the mean value Diversity in base composition Data-based Lartillot et al. diversity taken across the alignment across sites (2007) Consistency Minimum possible number of substitutions in the data divided by the minimum number Consistency of phylogenetic Data-inference hybrid Kluge and Farris index required to describe a given tree using parsimony. It has been considered as a measure information in the data (1969) of homoplasy, and is expected to take a value of 1 in the absence of homoplasy compared with the most parsimonious scenario Branch support Mean of branch-support values across the maximum-likelihood tree. Branch support can Overall fit Inference-based Brown (2014b) be calculated in a variety of ways, including nonparametric bootstrap or approximate likelihood-ratio test (Anisimova and Gascuel 2006). 95% CI in 95% range in branch-support values across the maximum-likelihood tree Overall fit Inference-based Brown (2014b) branch-support statistic Phylogenomic model adequacy test statistics GBE Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1379 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Tree length Sum of the branch lengths in the maximum-likelihood tree Overall fit Inference-based Brown (2014b) Mahalanobis A test of model adequacy can be placed in a multivariate setting that simultaneously Summary assessment from Based on other test Mahalanobis distance considers multiple test statistics by using Mahalanobis distances. The aim of this ap- multiple test statistics; statistics (1936); O’Hagan proach is to estimate a distance between the empirical test statistics and the multi- Overall fit (2003); variate predictive distribution from several test statistics. Individual test statistics are Drummond and first standardized so that they appear on the same scale. Then it is possible to calculate Suchard (2008) the mean (m) and variance covariance matrix (V) of the multivariate distribution: m ¼ Tðrep Þ P p P¼1 hihi V ¼ Tðrep Þ m ^  Tðrep Þ m ^ p p P1 P¼1 where P is the number of predictive simulations, T is the multivariate distribution of test statistics, and rep is each of the predictive data sets. The empirical and replicate data sets each has a distance from the multivariate distribution that can be defined as the following: MxðÞ ¼ðÞ x  m V ðÞ x  m For assessing substitution model adequacy, we used two combinations of test statistics to define the multivariate distribution. One included the other eight test statistics con- sidered in this study, and the other included the four statistics that were the most sensitive to biased phylogenetic inferences. ^ Duchene et al. GBE a. Substitution model parameterization Model-adequate Over-parameterized Under-parameterized Simulated with Analysed with Simulated with Analysed with Simulated with Analysed with GTR+ GTR+ JC GTR+ GTR+ JC AC AC AC AC AC AC GT GT GT GT GT GT b. Composition heterogeneous model One half of taxa with One half of taxa with One eighth of taxa with One eighth of taxa with moderate change strong change moderate change strong change c. Covarion-like rate variation model d. Long terminal branches Moderate Strong ×10 ×25 variation variation e. Covarion-like model with long terminal branches Moderate variation with Strong variation with Moderate variation with Strong variation with terminal branch lengths ×10 terminal branch lengths ×10 terminal branch lengths ×25 terminal branch lengths ×25 f. Number of nucleotides in each locus 500 1000 5000 FIG.1.—The six characteristics that were varied in simulations of sequence evolution to investigate the performance and adequacy of the candidate substitution model (GTRþC): (a) substitution model parameterization; (b) compositional heterogeneity; (c) covarion-like rate variation; (d) terminal branch lengths; (e) covarion-like rate variation and terminal branch lengths; and (f) sequence length for each locus. One hundred replicates were performed under each scenario from (a)to (e), under each of the sequence lengths shown in (f). Colors in (a) indicate different rate parameters, whereas in (b) they indicate the magnitude and proportion of taxa undergoing a change in base composition. Branch thickness corresponds to evolutionary rate in (c), (d), and (e). using the software p4 (Foster 2004). The R matrix used for We first performed phylogenetic and model-adequacy simulation had parameters set to 1.3472, 4.8145, 0.9304, analyses using substitution models that were overparameter- 1.2491, 5.5587, and 1.0000, based on a previous study of ized, underparameterized, and adequate (fig. 1a). This placental mammals (Murphy et al. 2001). We performed an was done by evaluating three scenarios that represented dif- additional set of simulations under the Jukes–Cantor model ferent combinations of simulation model and estimation (Jukes and Cantor 1969). For each simulation scenario, we model: sequences that evolved under the JC model and performed 100 replicates for each locus length. were analysed using the GTR þ C model (overparameterized 1380 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE model); sequences that evolved under the GTR þ C model To increase the sources of error in the data, we introduced and were analysed using the JC model (underparameterized simulation scenarios in which the lengths of the terminal model); and sequences that evolved under the GTR þ C branches were 10 and 25 times those in the original simula- model and were analysed using the GTR þ C model (ade- tions (fig. 1d). These simulations led to data that resemble quate model). those across highly divergent taxa, such as those used to ex- We simulated other processes that can occur in empirical amine the relationships among metazoan taxa (e.g., Pisani data and which can cause poor accuracy and precision in et al. 2015). To select the parameters for these simulations, phylogenetic inference. Heterogeneity in base frequencies we took the original simulations involving a root-to-tip dis- across lineages, which violates the assumption of composi- tance of 1 substitution per site to represent a period of 50 My. tional stationarity, can lead to the spurious joining of taxa with Under this assumption, the simulation scenarios with longer convergent base frequencies (Lockhart et al. 1992; Jermiin terminal branches reflect data sets across timescales of 1.25 et al. 2004). We performed simulations that included two and 2.5 billion years, resembling the scenarios simulated by convergent changes in base composition across the tree, Penny et al. (2001). We also performed simulations on trees with the maximum topological distance from each other with long terminal branches for each of the three covarion (fig. 1b). We varied the proportion of tips that underwent a scenarios (fig. 1e), which also resembles previous investiga- change in base composition between none, one-eighth, and tions of these conditions (Penny et al. 2001). half of the total number of tips. We also varied the magnitude of changes in base composition across two scenarios: 1) ap- Analyses of Simulated Data proximately the maximum difference in base composition ob- served across taxa in an avian phylogenomic data set (Prum For each simulated data set, phylogenetic inference was per- et al. 2015; from {0.25, 0.25, 0.25, 0.25} to {0.05, 0.45, 0.45, formed using a common candidate substitution model 0.05}); and 2) extreme differences in base composition (from (GTR þ C) using the software PhyML (Guindon et al. 2010), {0.25, 0.25, 0.25, 0.25} to {0.01, 0.49, 0.49, 0.01}). The small- and then model adequacy was assessed as described above. est base frequency in the first scenario was set to approxi- We used four metrics to describe the accuracy and precision mately the smallest base frequency observed in avian of our analyses of simulated data. The lengths of branches in phylogenomic data. These parameters allowed data sets to estimated trees were summed, and then the sum was sub- have a high realistic probability of leading to biased phyloge- tracted from the sum of branch lengths of simulated trees. netic inferences. The smallest base frequency in the second This value was then divided by the sum of simulated branch scenario was set to a small but nonzero value of 0.01, in order lengths to describe the inaccuracy in estimates. We made the to maintain the presence of all four bases in the data. same calculation using the stemminess of each tree, which is Simulations of sequence evolution were also done under a the proportion of the inferred tree length represented by in- covarion-like process (Galtier 2001), where substitution rates ternal branches (Fiala and Sokal 1985). Stemminess was used vary across sites and across lineages (Fitch and Markowitz to summarize the bias in inferences across branch lengths 1970; Fitch 1971). This scenario has been found to cause within each tree estimate. As a measure of error in topological poor estimates of phylogenetic tree lengths, but can benefit inference, we calculated the unweighted Robinson–Foulds topological inference by causing the overestimation of inter- distance between the estimated and simulated trees nal branch lengths (Penny et al. 2001; Phillips 2009). For this (Robinson and Foulds 1981; Penny and Hendy 1985). Lastly, reason, it can be considered an unusual form of model mis- we used the support for nodes in the estimated tree as a specification that is likely to be difficult to detect, yet might measure of precision in the inferred topology, calculated using mislead the inferences. Under a covarion-like process (Galtier the Shimodaira–Hasegawa approximate likelihood-ratio test 2001), relative rates across lineages are described by a discre- (aLRT) nonparametric measure of branch support (Guindon tized gamma distribution. The data share the categories of the et al. 2010; Anisimova et al. 2011). distribution, but each site has its own realization of rates For each simulated data set, we assessed model adequacy across lineages (Galtier 2001). In this way, sites might share using each of the nine test statistics considered in our study. In the rate along a given branch, but might not share that rate statistics, sensitivity is generally defined as the ability of a test along other branches. We simulated this covarion-like process to correctly identify instances in which a result is significant. using a gamma distribution with an a parameter of 1. To vary Accordingly, we consider test statistics to be consistently sen- the strength of variation across sites, rates across lineages sitive to particular simulation conditions if the distribution of were drawn from distributions with 1 (no covarion), 5, or values from replicate simulations does not overlap with the 10 categories across sites, allowing for an increasing number mean of the values from the predictive distributions. For test of extreme rates (fig. 1c). The consequence of using different statistics that were sensitive to inferences with poor accuracy numbers of categories is that some sites will have more ex- and precision, we aimed to determine meaningful thresholds treme (low and high) rates, although the rate distribution and for model assessment. This is because existing tests of model mean rate remain unchanged. adequacy are generally conservative, frequently rejecting the Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1381 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE model when inferences from the data are unlikely to be mis- loci that lead to well-supported topologies can have some of leading (e.g., Duchene et al. 2017). the largest numbers of variable sites, and therefore can be the We identified thresholds with a tendency to be lenient, loci that show the greatest differences from predictive data with low risk of rejecting the model when the inferences (i.e., have the largest Mahalanobis distance). are not misleading (low Type I error rate) at the expense of To investigate whether information content was associated sometimes failing to reject the model when inferences are with distance from the model or with tree space, we used misleading (moderate Type II error rate). This approach max- Spearman’s q to test whether the Mahalanobis distance was imizes the usage of phylogenomic data, because data can still correlated with mean node support, tree length, number of contain useful information even when the model is rejected variable sites, and the MDS dimensions. We also used the by a given test statistic. MDS visualization to assess the performance of thresholds The interpretation of test statistics can be sensitive to se- for assessing model adequacy. For each sensitive test statistic, quence length (Duchene et al. 2017), so our thresholds take we investigated the power of our threshold to reject the this into account. We determined the threshold for model model in regions of tree space that are the most likely to assessment for each sensitive test statistic. Thresholds were contain misleading inferences. defined as a fitted function between the sequence lengths used for simulation and the median test statistic under a sim- ulation scenario to which the statistic is most sensitive. Results Accuracy and Precision under Simulation Conditions Analyses of Empirical Data Phylogenetic inferences are accurate and precise when there is a match between the substitution models used for simula- The framework that we have used here for model assessment tion and analysis (fig. 2). When the model used for analysis is is highly computationally efficient, and therefore well suited overparameterized, the accuracy and precision of phyloge- for examining genome-scale data sets. We analysed three netic estimates is similar to those obtained when the data phylogenomic data sets to investigate the impact of assessing are analysed using the correct model (supplementary figs. substitution model adequacy on the inferred tree topology. S1–S3, Supplementary Material online). When the model These data sets comprised 2,363 loci from 63 turtle taxa used for analysis is underparameterized, tree length is consis- (Crawford et al. 2015), 222 loci from 200 bird taxa (Prum tently underestimated (fig. 2a), terminal branches have a dis- et al. 2015), and 96 loci from 15 Laurasiatherian mammal proportionate contribution to the tree length (fig. 2b), and the taxa (Zhou et al. 2012). In order to allow calculation of the topology is estimated with higher error than when the correct multinomial likelihood, sites with gaps or unknown nucleoti- model is used (fig. 2c). Using an underparameterized model des were excluded. For each of the three data sets, we per- also leads to estimates with greater branch support (higher formed maximum-likelihood analyses using a GTR þ C model precision) than when the model is adequate (fig. 2d), in agree- of nucleotide substitution, and assessed model adequacy us- ment with the bias-variance tradeoff found in statistical mod- ing the nine test statistics that we investigated in our simula- els (Burnham and Anderson 2002; Lemmon and Moriarty tion study (table 1). 2004; Wertheim et al. 2010; Liu et al. 2015). We described the relative distances between gene trees by As simulated in this study, compositional heterogeneity approximating these distances in two-dimensional space. This leads to a mild tendency to overestimate tree length, to ter- approach has been shown to provide an accurate description ^ minal branches having a greater contribution to the tree of the differences in phylogenetic signal across loci (Duchene length, and to greater error in the estimate of the tree topol- et al. 2018). This representation of tree space was made using ogy than when the correct model is used (fig. 2). Branch multidimensional scaling (MDS), based on unweighted lengths can be overestimated under conditions of composi- Robinson–Foulds distances (Robinson and Foulds 1981; tional heterogeneity because a change in base composition Penny and Hendy 1985) for describing the distances between can inflate the inferred numbers of the types of substitutions trees in Euclidean space (Hillis et al. 2005; Matsen 2006; involved in that change (Ho and Jermiin 2004; Duch^ ene et al. Ho ¨ hna and Drummond 2012). MDS finds the Euclidean posi- 2017). tions of gene trees that minimize the sum of the distances Analyses of data generated under a covarion-like process between them (Mardia et al. 1979). lead to underestimates of tree length (fig. 2a), but otherwise We used the MDS visualization to explore an association produce estimates with similar accuracy and precision to between tree space and the Mahalanobis distance, mean those in which the correct model was used. The lowest accu- node support, tree length, and number of variable sites. racy and precision in phylogenetic inferences was found in This association might occur, for example, if loci that yield simulation scenarios that involved long terminal branches trees with highly supported nodes and that have high infor- (fig. 2c and d). In these cases, tree length was severely under- mation content lead to congruent estimates of topology and estimated and terminal branches had a small contribution to have high model adequacy (Doyle et al. 2015). Alternatively, 1382 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE (a)(b) (c)(d) FIG.2.—The performance of phylogenetic inference using the GTRþC substitution model in simulations with 5,000 nucleotides under six represen- tative simulation conditions (for results from every simulation scenario, see supplementary figs. S1–S3, Supplementary Material online). Each box represents the results of 100 replicate analyses. Performance is described by (a) the length of the estimated tree minus that of the simulated tree, divided by that of the simulated tree, (b) the difference in stemminess, defined as the proportion of the inferred tree length represented by internal branches, (c) the unweighted Robinson–Foulds topological distance between estimated and simulated trees, and (d) the mean node support in the estimated tree, which is a measure of precision in estimates. tree length (fig. 2a and b). The effects of the covarion process consistency index (fig. 3b). However, the d statistic has negli- combined with long terminal branches have been studied gible sensitivity compared with the other three test statistics, previously, with similar outcomes (Galtier 2001; Penny et al. with SDPD values lower than 0.5 in data sets with 1,000 2001). We also find that inferences from short sequence nucleotides. The same four test statistics, along with X ,are alignments have greater variance in accuracy and precision consistently sensitive to compositional heterogeneity. The X (supplementary figs. S2 and S3, Supplementary Material statistic is overwhelmingly the most sensitive statistic to this online). scenario, dwarfing the signal of the other statistics (fig. 3c). Theseresults suggestthatmostteststatistics are lenientunder conditions of compositional heterogeneity. However, the X Sensitivity of Model-Adequacy Test Statistics statistic can be highly conservative, because some of the infer- None of the test statistics is consistently sensitive to scenarios ences under conditions of compositional heterogeneity have in which the model is adequate (fig. 3a) or to model over- similar accuracy and precision to those obtained when using parameterization (supplementary figs. S4b–S6b, the correct model. Supplementary Material online). This is a desirable property The biochemical diversity and consistency index statistics and is consistent with the results of previous research (Brown are consistently sensitive to the covarion-like process, in par- 2014; Duchene et al. 2017). Four test statistics are consistently ticular when the simulated sequences had a length of 5,000 sensitive to model underparameterization, including the mul- nucleotides (fig. 3d and f; see supplementary figs. S5i and S6i, tinomial likelihood, d statistic, biochemical diversity,and the Supplementary Material online). Only the biochemical Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1383 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Topological distance between Difference in tree length estimated and simulated trees (estimated minus simulated) 0 1020 304050 −0.6 −0.4 −0.2 0.0 0.2 GTR+ GTR+ - Analysed using JC Compositional heterogeneity Covarion-like Long terminal branches Covarion and long terminal branches Mean node support of Difference in stemminess estimated tree (estimated minus simulated) 0.65 0.75 0.85 0.95 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 GTR+ GTR+ - Analysed using JC Compositional heterogeneity Covarion-like Long terminal branches Covarion and long terminal branches ^ Duchene et al. GBE GTR+Γ - Model-adequate GTR+Γ - Analysed using JC (a)(b) (c)( Compositional heterogeneity d) Covarion-like Covarion and long (e)( Long terminal branches f) terminal branches FIG.3.—The sensitivity of nine test statistics for assessing the adequacy of the GTR þC substitution model in simulations with 5,000 nucleotides under six representative simulation conditions (for results from every simulation scenario, see supplementary figs. S4–S6, Supplementary Material online). The Mahalanobis test statistic was calculated to summarize all test statistics (M ), or the four sensitive test statistics (M ). Each box represents the results of 100 1 2 replicate analyses. diversity statistic is consistently sensitive to long terminal consistency index. The result of focusing on these statistics branches, particularly in simulation scenarios involving can be observed when comparing the summary sequences with lengths of 5,000 nucleotides (fig. 3e). In Mahalanobis distance including all eight of the other test sta- data sets with shorter sequences (200 and 1,000 nucleotides), tisticsexaminedhere (M ) with that including only the four two test statistics are consistently sensitive to long terminal most sensitive statistics (M ). In most simulation scenarios, M 2 1 branches, but with low sensitivity (SDPD between 1 and 1; and M are two of the most sensitive statistics, and M is 2 2 supplementary figs. S5j–S5o and S6j–S6o, Supplementary always more sensitive than M (with the exception of scenar- Material online). These were the multinomial likelihood and ios in which the correct model is used). This shows that the confidence interval in branch support. Interestingly, we statistics with low sensitivity should be excluded. find that inference-based statistics are generally very lenient. Meanwhile, examining M and the four informative test sta- This is possibly because inferences from predictive data sets tistics can provide a useful general method for examining are similar to those from the original data (but see Brown model performance. 2014). We propose an approach to model assessment that con- New Thresholds for Assessment siders the test statistics that have measurable sensitivity to phylogenetic inferences with low accuracy and precision: Analyses of phylogenomic data sets from turtles, birds, and the X , multinomial likelihood, biochemical diversity,and mammals show the critical importance of using thresholds for 1384 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Number of standard deviations from mean of predictive distribution −5 0 5 10 15 0 500 1000 1500 2000 −2 −1 0 1 2 χ 2 Multinomial likelihood Biochemical diversity Consistency Index Branch support CI in branch support Tree length M1 M2 −10 −5 0 5 10 15 −5 0 5 10 −50 0 50 100 150 200 χ 2 Multinomial likelihood Biochemical diversity Consistency Index Branch support CI in branch support Tree length M1 M2 Phylogenomic model adequacy test statistics GBE Mahalanobis distance Mean branch support Total tree length Number of varaible sites (a)(b)(c)(d) −1e−13 0e+00 1e−13 −1e−13 0e+00 1e−13 −1e−13 0e+00 1e−13 −1e−13 0e+00 1e−13 −50 0 50 100 150 −50 0 50 100 150 −50 0 50 100 150 −50 0 50 100 150 −5 0 5 −5 0 5 −5 0 5 −5 0 5 MDS dimension 1 FIG.4.—Estimated two-dimensional representation of tree-space for samples of loci from turtles, birds, and mammals. Data are colored such that warmer colors indicate higher values of (a) Mahalanobis distance (M ), (b) mean branch support, (c) tree length, and (d) number of variable sites. High values of these variables occur in similar locations of tree-space. In the sequence data from turtles, loci with high values of M are not necessarily the same as those that have high values for the other variables (see supplementary fig. S7, Supplementary Material online). assessment that consider sequence lengths. We find strik- covarion-like, for determining the thresholds of X , multino- ing evidence in these data that highly informative align- mial likelihood, biochemical diversity,and consistency index, ments have poorer perceived model adequacy. Loci with respectively. For deriving the threshold for M ,we used the poor perceived model adequacy (the highest SDPD for M ) median values of simulations of a covarion-like process with yield gene-tree estimates that are similar to those of loci long terminal branches. that yield high branch support and long branches, and In analyses of phylogenomic data from birds and mam- that contain a large number of variable sites (fig. 4). In mals, the new thresholds have an association with esti- simulations and empirical data sets, we find that loci mates of the tree topology, in particular when model with gene-tree estimates that are likely to be inaccurate assessment is made using X and biochemical diversity and imprecise have consistently good perceived model ad- (fig. 5). In these data, the model is rejected for loci in equacy (low M ), low mean branch support, short trees, regions of tree-space with lower branch support and and few variable sites (supplementary figs. S7–S8; shorter trees. These results suggest that assessment of Supplementary Material online). model adequacy might be more meaningful for data sets We propose thresholds of model assessment based on the with longer sequences, since the mean number of sites is median value of test statistics under the scenarios to which greater in the data from birds (741) than in the data from test statistics are sensitive (supplementary fig. S9, mammals (480) and turtles (200). Similarly, these results Supplementary Material online). We use the scenarios of show that X and biochemical diversity statistics might strong compositional heterogeneity, model underparameteri- provide the most informative tests out of those that we zation, covarion-like process with long terminal branches, and have explored. Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1385 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 MDS dimension 2 Mammals Birds Turtles −5 0 5 −50 0 50 −1e−13 0e+00 1e−13 ^ Duchene et al. GBE (a)(b)(c)(d)(e) FIG.5.—Estimated two-dimensional representation of tree-space for loci from turtles, birds, and mammals. Data are colored according to whether each locus passes (black) or fails (red) each of the five tests of model adequacy using our new thresholds for assessment. framework. These sensitive statistics include X , multinomial Discussion likelihood, biochemical diversity,and consistency index.We Assessing model adequacy can provide important insights have also shown the performance of the assessment using alongside commonly used methods of model selection based these four statistics simultaneously in a multivariate setting, on information criteria or Bayes factors. By assessing model using a statistic that we denote M . We focus on these five adequacy, it is possible to consider that even the best-fitting test statistics for deriving meaningful thresholds for assess- model can provide a poor description of the process that ment, which we hope can provide information about generated the data (Gelman et al. 2014). This can be partic- whether the model is likely to yield misleading inferences ularly useful when examining phylogenomic data sets, be- of tree topology or branch lengths. Exploring the power of cause a subset of the data can be selected to maximize the novel test statistics for identifying misleading inferences will reliability of the model and inferences. Here, we compared be an important avenue of research. For example, entropy the sensitivity of methods of assessing model adequacy to a metrics have been implemented for model assessment in a diverse range of scenarios that can lead to biased inferences Bayesian framework (Brown 2014), and could be adapted of tree topology and branch lengths. We show that the var- for use in a maximum-likelihood setting by metrics that ious test statistics for assessment differ in their sensitivity to compare the empirical and predictive data to another null biased inferences. We find that some test statistics can be reference data set (Lewis et al. 2014). Similarly, more exten- highly lenient, whereas others can be conservative. Our results sive simulation frameworks could lead to additional insights also support previous evidence that sequence length has a into the power of model assessment for detecting other critical bearing on the perceived adequacy of a model potential sources of bias, such as tree imbalance or long- (Duchene et al. 2017). This phenomenon occurs because branch attraction. when longer sequences are analysed, the distribution of test Our results emphasize the importance of defining thresh- statistics calculated for predictive data is narrower. As a con- olds based on sequence length. In our simulation study and sequence, long sequences can appear to have a greater dis- our analyses of three phylogenomic data sets, we find that crepancy from predictive data compared with short the loci that yield estimates with high accuracy are usually sequences (Goldman 1993). long, such that they would be rejected using traditional Using these insights, we have proposed thresholds for as- thresholds for assessing model adequacy (supplementary sessment that are specific to the most sensitive test statistics, figs. S1–S6, S8; Supplementary Material online). Critically, implemented using a fast maximum-likelihood optimization 1386 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE the poor perceived model adequacy in analyses of long Supplementary Material sequences might be exacerbated in empirical data, since the Supplementary data are available at Genome Biology and model is a greater simplification of the evolutionary process in Evolution online. these data compared with our simulations. Nonetheless, it seems unreasonable to reject the model for these data, since a simple model can be sufficient for estimating the parame- ters of interest accurately and precisely (Steel 2005). This leads Acknowledgments us to propose thresholds for model assessment that are rela- This work was supported by funding from the Australian tively lenient when applied to data generated by simulation, Research Council to D.A.D. and S.Y.W.H. (grant yet can reject the model in most cases when inferences are DP160104173). S.D. was supported by a McKenzie misleading. In analyses of phylogenomic data sets from birds Fellowship from the University of Melbourne. We acknowl- and mammals, we find that our methods of assessment can edge the University of Sydney for providing high-performance reject a commonly used substitution model for loci that lead computing resources that have contributed to the research to inferences with anomalous tree topologies and low statis- results reported within this paper. tical support. We also find that assessing model adequacy can be difficult for short loci. In the phylogenomic data from bird families, loci Literature Cited that were model-adequate according to our thresholds Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio test for yielded congruent estimates of the tree topology, with strong branches: a fast, accurate, and powerful alternative. Syst Biol. 55(4):539–552. statistical support. The data from birds comprised the longest Anisimova M, Gil M, Dufayard J-F, Dessimoz C, Gascuel O. 2011. Survey of loci of the three phylogenomic data sets that we investigated. branch support methods demonstrates accuracy, power, and robust- These data also produced the most intuitive results, consistent ness of fast likelihood-based approximation schemes. Syst Biol. with a previous study that also identified that model-adequate 60(5):685–699. loci lead to congruent phylogenetic inferences (Doyle et al. Bollback JP. 2002. Bayesian model adequacy and choice in phylogenetics. Mol Biol Evol. 19(7):1171–1180. 2015). Strikingly, when using the phylogenomic data set with Brown JM. 2014. Detection of implausible phylogenetic inferences using the shortest sequences, based on ultraconserved elements posterior predictive assessment of model fit. Syst Biol. 63(3):334–348. from families of turtles, model assessment was less meaning- Burnham KP, Anderson DR. 2002. Model selection and multimodel infer- ful. In these data, model-adequate loci yielded phylogenetic ence: a practical information-theoretic approach. 2nd edn. New York: estimates that were similar to those from loci for which the Springer Crawford NG, et al. 2015. A phylogenomic analysis of turtles. Mol model was rejected. These results are congruent with those of 2 Phylogenet Evol. 83:250–257. a previous study of the X statistic in phylogenomic data from Doyle VP, Young RE, Naylor GJP, Brown JM. 2015. Can we identify genes birds, which showed that only some of the shortest loci were with increased phylogenetic reliability? Syst Biol. 64(5):824–837. deemed model-inadequate and risked causing phylogenetic Drummond AJ, Suchard MA. 2008. Fully Bayesian tests of neutrality using bias due to compositional heterogeneity (Duchene et al. genealogical summary statistics. BMC Genet. 9:68. Duchene DA, et al. 2018. Analysis of phylogenomic tree space resolves 2017). relationships among marsupial Families. Syst Biol. 67:400–412. doi: Tests of model adequacy are frequently reliant on evaluat- 10.1093/sysbio/syx076. ing multiple test statistics and can be too lenient or conserva- ^ ^ Duchene DA, Duchene S,Ho SYW.2017. New statistical criteria detect tive, such that it is difficult to interpret the phylogenetic phylogenetic bias caused by compositional heterogeneity. Mol Biol performance of the model. The thresholds for assessment Evol. 34(6):1529–1534. ^ ^ Duchene DA, Duchene S, Holmes EC, Ho SYW. 2015. Evaluating the provided here should lead to more effective identification of adequacy of molecular clock models using posterior predictive simu- models that are potentially misspecified. Nevertheless, assess- lations. Mol Biol Evol. 32(11):2986–2995. ing model performance remains difficult for data sets com- Duchene S, Di Giallonardo F, Holmes EC. 2016. Substitution model ade- prising short sequences. Development of novel test statistics quacy and assessing the reliability of estimates of virus evolutionary should be accompanied by simulation studies that provide rates and time scales. Mol Biol Evol. 33(1):255–267. Fiala KL, Sokal RR. 1985. Factors determining the accuracy of cladogram intuitive thresholds, based on the impact of model violation estimation: evaluation using computer simulation. Evolution on inferences of interest. Other potential avenues of research 39(3):609–622. include developing more summary metrics and graphics Fitch WM. 1971. Rate of change of concomitantly variable codons. J Mol of model adequacy across the genome; assessing genome- Evol. 1(1):84–96. wide models of inference, such as those developed for Fitch WM, Markowitz E. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation assessing the multispecies coalescent (Reid et al. 2014); or of mutations in evolution. Biochem Genet. 4:579–593. assessment for quantifying sequence information under the Foster PG. 2004. Modeling compositional heterogeneity. Syst Biol. model (e.g., Klopfstein et al. 2017). Together, these advances 53(3):485–495. will improve the reliability of phylogenomic inferences from Galtier N. 2001. Maximum-likelihood phylogenetic analysis under a sequence data. covarion-like model. Mol Biol Evol. 18(5):866–873. Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1387 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE Gelman A, Carlin JB, Stern HS, Rubin DB. 2014. Bayesian data analysis. O’Hagan A. 2003. HSSS model criticism (with discussion). In: Green P, Boca Raton (FL): Taylor & Francis. Hjiort N, Richardson S, editors. Highly structured stochastic systems. Goldman N. 1993. Statistical tests of models of DNA substitution. J Mol Oxford (UK): Oxford University Press. p. 423–453. Evol. 36(2):182–198. Penny D, Hendy MD. 1985. The use of tree comparison metrics. Syst Zool. Guindon S, et al. 2010. New algorithms and methods to estimate 34(1):75–82. maximum-likelihood phylogenies: assessing the performance of Penny D, McComish BJ, Charleston MA, Hendy MD. 2001. Mathematical PhyML 3.0. Syst Biol. 59(3):307–321. elegance with biochemical realism: the covarion model of molecular Hillis DM, Heath TA, St John K. 2005. Analysis and visualization of tree evolution. J Mol Evol. 53(6):711–723. space. Syst Biol. 54(3):471–482. Phillips MJ. 2009. Branch-length estimation bias misleads molecular Ho SYW, Jermiin L. 2004. Tracing the decay of the historical signal in dating for a vertebrate mitochondrial phylogeny. Gene 441(1– biological sequence data. Syst Biol. 53(4):623–637. 2):132–140. Ho ¨ hna S, Drummond AJ. 2012. Guided tree topology proposals for Pisani D, et al. 2015. Genomic data do not support comb jellies as the sister Bayesian phylogenetic inference. Syst Biol. 61(1):1–11. group to all other animals. Proc Natl Acad Sci U S A. Ho ¨ hna S, May MR, Moore BR. 2016. TESS: an R package for efficiently 112(50):15402–15407. simulating phylogenetic trees and performing Bayesian inference of Posada D, Crandall KA. 2001. Selecting the best-fit model of nucleotide lineage diversification rates. Bioinformatics 32(5):789–791. substitution. Syst Biol. [Internet] 50:580–601. Jarvis ED, et al. 2014. Whole-genome analyses resolve early branches Prum RO, et al. 2015. A comprehensive phylogeny of birds (Aves) using in the tree of life of modern birds. Science 346(6215): targeted next-generation DNA sequencing. Nature 1320–1331. 526(7574):569–573. Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. 2014. Mixture Rabosky DL, Glor RE. 2010. Equilibrium speciation dynamics in a model models of nucleotide sequence evolution that account for heteroge- adaptive radiation of island lizards. Proc Natl Acad Sci U S A. neity in the substitution process across sites and across lineages. Syst 107(51):22178–22183. Biol. 63(5):726–742. Reddy S, et al. 2017. Why do phylogenomic data sets yield conflicting Jermiin L, Ho SYW, Ababneh F, Robinson J, Larkum AW. 2004. The biasing trees? Data type influences the avian Tree of Life more than taxon effect of compositional heterogeneity on phylogenetic estimates may sampling. Syst Biol. [Internet]. 66(5):857–879. be underestimated. Syst Biol. 53(4):638–643. Reid NM, et al. 2014. Poor fit to the multispecies coalescent is widely Jukes TH, Cantor CR. 1969. Evolution of protein molecules. In: Munro H, detectable in empirical data. Syst Biol. 63(3):322–333. editor. Mammalian protein metabolism. New York: Academic Press. p. Ripplinger J, Sullivan J. 2010. Assessment of substitution model adequacy 21–132. using frequentist and Bayesian methods. Mol Biol Evol. Klopfstein S, Massingham T, Goldman N. 2017. More on the best evolu- 27(12):2790–2803. tionary rate for phylogenetic analysis. Syst Biol. 66:769–785. doi: Robinson DF, Foulds LR. 1981. Comparison of phylogenetic trees. Math 10.1093/sysbio/syx051. Biosci. 53(1-2):131–147. Kluge AG, Farris JS. 1969. Quantitative phyletics and the evolution of Schliep KP. 2011. PHANGORN: phylogenetic analysis in R. Bioinformatics anurans. Syst Biol. 18(1):1–32. 27(4):592–593. Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch Shen X-X, Hittinger CT, Rokas A, Minh BQ, Braun EL. 2017. Contentious attraction artefacts in the animal phylogeny using a site- relationships in phylogenomic studies can be driven by a handful of heterogeneous model. BMC Evol Biol. 7(suppl 1):S4. genes. Nat Ecol Evol. 1(5):126. Lemmon AR, Moriarty EC. 2004. The importance of proper model as- Slater GJ, Pennell MW. 2014. Robust regression and posterior predictive sumption in Bayesian phylogenetics. Syst Biol. 53(2):265–277. simulation increase power to detect early bursts of trait evolution. Syst Lewis PO, Xie W, Chen M-H, Fan Y, Kuo L. 2014. Posterior predictive Biol. 63(3):293–308. Bayesian phylogenetic model selection. Syst Biol. 63(3):309–321. Springer MS, Gatesy J. 2016. The gene tree delusion. Mol Phylogenet Evol. Liu L, Xi Z, Wu S, Davis CC, Edwards SV. 2015. Estimating phyloge- 94(Pt A):1–33. netic trees from genome-scale data. Ann N Y Acad Sci. 1360: Steel M. 2005. Should phylogenetic models be trying to “fit an elephant”? 36–53. Trends Genet. 21(6):307–309. Lockhart PJ, Howe CJ, Bryant DA, Beanland TJ, Larkum AWD. 1992. Tavar e S. 1986. Some probabilistic and statistical problems in the analysis Substitutional bias confounds inference of cyanelle origins from se- of DNA sequences. Lect Math Life Sci. 17:57–86. quence data. J Mol Evol. 34:153–162. Timme RE, Bachvaroff TR, Delwiche CF. 2012. Broad phylogenomic Mahalanobis PC. 1936. On the generalised distance in statistics. Proc Natl sampling and the sister lineage of land plants. PLoS One Inst Sci India 12:49–55. 7(1):e29696. Mardia KV, Kent JT, Bibby JM. 1979. Multivariate analysis. New York: Wertheim JO, Sanderson MJ, Worobey M, Bjork A. 2010. Relaxed molec- Academic Press. ular clocks, the bias-variance trade-off, and the quality of phylogenetic Matsen FA. 2006. A geometric approach to tree shape statistics. Syst Biol. inference. Syst Biol. 59(1):1–8. 55:652–661. Yang Z. 1993. Maximum-likelihood estimation of phylogeny from DNA Meredith RW, et al. 2011. Impacts of the cretaceous terrestrial revolution sequences when substitution rates differ over sites. Mol Biol Evol. and KPg extinction on mammal diversification. Science 10(6):1396–1401. 334(6055):521–524. Zhou X, et al. 2012. Phylogenomic analysis resolves the interordinal rela- Misof B, et al. 2014. Phylogenomics resolves the timing and pattern of tionships and rapid diversification of the laurasiatherian mammals. Syst insect evolution. Science 346(6210):763–767. Biol. 61(1):150–164. Murphy WJ, et al. 2001. Molecular phylogenetics and the origins of pla- cental mammals. Nature 409(6820):614–618. Associate editor: David Bryant 1388 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology and Evolution Oxford University Press

Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy

Free
14 pages

Loading next page...
 
/lp/ou_press/differences-in-performance-among-test-statistics-for-assessing-CYY6o5fmOy
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
ISSN
1759-6653
eISSN
1759-6653
D.O.I.
10.1093/gbe/evy094
Publisher site
See Article on Publisher Site

Abstract

Statistical phylogenetic analyses of genomic data depend on models of nucleotide or amino acid substitution. The ade- quacy of these substitution models can be assessed using a number of test statistics, allowing the model to be rejected when it is found to provide a poor description of the evolutionary process. A potentially valuable use of model-adequacy test statistics is to identify when data sets are likely to produce unreliable phylogenetic estimates, but their differences in performance are rarely explored. We performed a comprehensive simulation study to identify test statistics that are sensitive to some of the most commonly cited sources of phylogenetic estimation error. Our results show that, for many test statistics, traditional thresholds for assessing model adequacy can fail to reject the model when the phylogenetic inferences are inaccurate and imprecise. This is particularly problematic when analysing loci that have few informative sites. We propose new thresholds for assessing substitution model adequacy and demonstrate their effectiveness in analyses of three phylogenomic data sets. These thresholds lead to frequent rejection of the model for loci that yield topological inferences that are imprecise and are likely to be inaccurate. We also propose the use of a summary statistic that provides a practical assessment of overall model adequacy. Our approach offers a promising means of enhancing model choice in genome-scale data sets, potentially leading to improvements in the reliability of phylogenomic inference. Key words: model adequacy, substitution model, maximum likelihood, test statistics, birds, Laurasiatherian mammals, turtles. Introduction et al. 2014) or by filtering the data according to appropriate In recent years, phylogenomic analyses have provided inter- criteria (Liu et al. 2015). Both of these measures can help to esting insights into the evolution of various taxonomic groups enhance the phylogenetic signal in the available data. (e.g., Meredith et al. 2011; Timme et al. 2012; Misofetal. One potential method of improving evolutionary 2014). However, the evolutionary relationships in some Modelling in phylogenetics is to perform an absolute assess- groups of organisms have remained difficult to resolve even ment of the adequacy or plausibility of the substitution model with large amounts of data, and different studies have some- for each locus in the data set (Doyle et al. 2015). This is dif- times yielded conflicting phylogenetic estimates (e.g., Jarvis ferent from the comparisons of relative model fit that are et al. 2014; Prum et al. 2015). The sources of conflict among routinely performed in phylogenetics (Posada and Crandall data sets are likely to include model misspecification, incon- 2001), and involves testing whether the chosen substitution gruence among gene trees, and the impacts of positive selec- model provides an accurate description of the evolutionary tion (Springer and Gatesy 2016; Reddy et al. 2017; Shen et al. process that generated the data (Goldman 1993). Although 2017). Unless data from additional loci can be readily the available substitution models are sometimes found to be obtained, the best means of improving the reliability of the good descriptions of the molecular evolutionary process inferred species tree is to enhance the Modelling of the evo- (Ripplinger and Sullivan 2010), they are unlikely to be ade- lutionary process that produced the data at hand. This can be quate for all of the loci in genome-scale data sets (Goldman done either by improving the methods of inference (e.g., 1993; Doyle et al. 2015; Duchene et al. 2016). Methods of Galtier 2001; Foster 2004; Lartillot et al. 2007; Jayaswal assessing model adequacy could, in principle, identify the The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non- commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com Genome Biol. Evol. 10(6):1375–1388. doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1375 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE portion of the data set for which the model is realistic. This The procedure of comparing the empirical and predictive information couldthenbe usedtoreject a givenmodelor to data requires the choice of a test statistic, a metric that exclude the subset of the data that is not well described by the describes some aspect of the data set or of the inferences available models (Doyle et al. 2015), leading to enhancement drawn from the data set. If empirical data are similar to, or of the phylogenetic signal in the data. produce similar estimates to, the predictive distribution, then Methods of assessing model adequacy by comparing em- the model can be considered adequate. Test statistics that can pirical data to those simulated under the model can test be calculated directly from the data are known as data-based whether some aspects of the model are realistic, but they statistics, whereas those that describe the inferences drawn might not necessarily indicate when estimates of parame- from the data are known as inference-based statistics. Some ters of interest are unreliable. This is because sequence data test statistics require calculations from the data as well as might meet the particular assumption being assessed but inferences from the data, such that they are hybrid test sta- still violate other assumptions that are critical for estimating tistics. Critically, test statistics can differ in their sensitivity to parameters of interest (e.g., Ho and Jermiin 2004; Brown biased inferences. Identifying which test statistics are sensitive 2014; Doyle et al. 2015). Data might also fail to meet a to biased phylogenetic inferences is fundamental before we particular assumption of the model, yet this might not can develop a practical framework of model assessment. have a negative impact on the estimates of parameters of The extent to which empirical data must be similar to the interest (Lemmon and Moriarty 2004; Brown 2014; predictive distribution for inferences to be inaccurate remains Duchene et al. 2017). For example, estimates of tree topol- poorly understood. The traditional approach to determining ogy and branch lengths can be accurate even when there model adequacy is to reject the model if the test statistic for are substantial departures from stationary base composition the original data falls above 95% or 99% of the values from (Duchene et al. 2017). To ensure the usefulness of methods simulated data. But assessment using these thresholds can for model assessment, we must evaluate whether measures lead to the model being rejected even when inferences are of model adequacy can predict the accuracy and precision not necessarily unreliable, or the failure to reject a model that of the inferences made using these models. Similarly, leads to biased inferences (Brown 2014; Duchene et al. 2017). approaches that can summarize tests from multiple meth- In this study, we aim to characterize the performance of a ods, and across hundreds of loci, are needed to make as- wide range of test statistics for model assessment in phyloge- sessment of model adequacy practical in the genomic age. nomic data sets, and explore a method for summarizing their Methods to assess model adequacy are based on com- results. Using simulations under a range of conditions known paring the empirical data with a distribution of data gener- to occur in empirical data, we characterize the ability of nine ated under the candidate model, also known as the test statistics to detect poor performance of the substitution predictive distribution. The data generated under the can- model. We define performance as the accuracy and precision didate model are also known as predictive data and can be of estimates of tree topology and branch lengths, which are generated using the maximum-likelihood estimates of the often the parameters of interest in phylogenomic studies. We model parameters (Goldman 1993). Alternatively, parame- recommend using an efficient maximum-likelihood frame- ter estimates can be taken from samples from the posterior work for model assessment that makes assessment feasible distribution in a Bayesian analysis, such that predictive data for genome-scale data sets, and focusing on the test statistics account for the uncertainty in estimates of model parame- that perform well in our simulation study. Based on our sim- ters (Bollback 2002). In a Bayesian framework, simulations ulation study, we also propose thresholds that are meaningful performed using the model are also known as posterior for identifying misleading phylogenetic inferences. We also predictive simulations. The benefit of accounting for uncer- describe the performance of a test statistic that summarizes tainty in parameter estimates comes at the cost of compu- multiple tests, which can be used to provide an overall assess- tational demand. In this study, we focus on methods to ment of model adequacy. We demonstrate our approach and assess model adequacy for genome-scale data sets, so we explore the relationship between substitution model ade- use a maximum-likelihood method of assessment. Several quacy and performance in phylogenomic data sets from tur- types of models in phylogenetics can be assessed using pre- tles, birds, and Laurasiatherian mammals. dictive simulations, including substitution models (Goldman 1993; Bollback 2002; Foster 2004; Brown 2014), large hi- Materials and Methods erarchical models (Reid et al. 2014; Duchene et al. 2015), Fast Assessment of Model Adequacy models of trait evolution (Rabosky and Glor 2010; Slater and Pennell 2014), and models of diversification rates Framework for Model Assessment through time (Ho ¨ hna et al. 2016). Predictive approaches In a basic assessment of substitution model adequacy, phylo- can also be used for relative model comparison (Lewis genetic analysis of the empirical data is performed using the et al. 2014), providing an alternative to commonly used model that is to be assessed. Then, a large number of data information criteria or Bayes factors. 1376 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE sets of the same size as the empirical data set are generated have previously been proposed in the context of substitution by simulating sequence evolution under the chosen model, model adequacy (table 1). The selected statistics consider sev- using the maximum-likelihood estimates of the model param- eral aspects of the model and potentially provide an overall eters. A chosen test statistic is calculated for each of these examination of adequacy. These test statistics include the X simulated data sets, thereby producing a distribution of values statistic for assessing stationarity of base composition (Foster derived from the model (Goldman 1993; Foster 2004). The 2004); multinomial and d statistics for assessing overall model test statistic from the empirical data can be compared against fit (Goldman 1993; Bollback 2002); biochemical diversity for this predictive distribution. assessing the diversity in base composition across sites An ideal test statistic is able to describe the differences (Lartillot et al. 2007); and consistency index for assessing the between the empirical data and the simulated data, especially consistency of phylogenetic information in the data when with regard to inferences of interest (such as the tree topology compared with the most parsimonious possible scenario and branch lengths). Therefore, selecting appropriate test sta- (Kluge and Farris 1969). We also use three test statistics based tistics is critical to assessment of model adequacy. Here, we on inferences from the data, including mean branch support, describe our framework for fast assessment of model ade- 95% confidence interval in branch support,and sum of quacy using test statistics that can detect instances when phy- branch lengths (Brown 2014). In addition, we use the logenetic inferences are inaccurate or imprecise. We assume Mahalanobis statistic, which has been used previously for that we have an empirical data set comprising a large number assessing phylodynamic models, and provides a summary of of unlinked loci from the taxa of interest. a chosen set of test statistics (O’Hagan 2003; Drummond and In our framework, we obtain maximum-likelihood esti- Suchard 2008). This approach treats groups of test statistics as mates of model parameters and the phylogeny for each a multivariate distribution, to which the empirical data are gene alignment using the software PhyML 3.0 (Guindon compared using the Mahalanobis distance (see table 1). We et al. 2010). Branch support is estimated using a highly have implemented our approach and the test statistics out- efficient, nonparametric measure of branch support with lined here in the software PhyloMAd (github.com/duchene/ behaviour similar to the nonparametric bootstrap (Guindon phylomad). et al. 2010; Anisimova et al. 2011). We take advantage of the speed of this method for assessing the performance of Validation of Test Statistics Using a Simulation Study test statistics that are based on inferences from the data Simulation Scenarios (Brown 2014). For each locus alignment, we generate 100 data sets by To understand the performance of test statistics for assess- simulating sequence evolution using the parameter estimates ing model adequacy, we simulated the evolution of nucle- from the empirical data. These simulated data sets have the otide sequences in a number of scenarios that involve same number of taxa and sites as the empirical data. Since misspecified models. Under these conditions, the accuracy these simulated data are derived from the model, they can be and precision of phylogenetic inference might be adversely considered a null distribution (Goldman 1993). To assess affected. We performed a range of simulations in six sce- model adequacy, we calculate a number of test statistics for narios involving realistic variation across loci (fig. 1). Our the empirical data set and for each simulated data set. It is study does not include other scenarios that have the poten- common to consider the model to be inadequate when the tial to create difficulties for phylogenetic inference, such as test statistic calculated from the empirical data falls outside tree imbalance or extreme model parameter values. Instead, the central 95 or 99 percentile range of test statistics calcu- we focus on scenarios that lead to model misspecification. lated from the simulated data. However, these thresholds do Our simulations were performed on fully symmetric trees not necessarily reflect the points at which inferences become with 32 tips. To derive adverse simulation scenarios, we inaccurate, so in this study we do not use the P-values to test began with a baseline phylogenetic tree with branch model adequacy. Instead, we used a measure of effect size lengths drawn from an exponential distribution with rate based on the number of standard deviations of the predictive of 5, such that the mean distance from the root to each distribution (SDPD) between the mean and the statistic calcu- tip is 1 substitution per site. We simulated the evolution of lated from the original data (Brown 2014; Duchene et al. sequences along these tree to produce data sets with 200, 2017). 1,000, or 5,000 nucleotide sites. Simulations were performed under a model in which the rates of each of the six substitution types were different. This Test Statistics Considered model is the GTR þ C (Tavare 1986; Yang 1993) in scenarios Almost any variable or model parameter that can be esti- where base composition is stationary and sites do not contain mated from the data can be used as a test statistic in the covarion-like patterns of substitution. Simulations were done approach described here. Instead of carrying out an exhaus- using the R package PHANGORN (Schliep 2011), except for tive examination of possible statistics, we consider nine that those with nonstationary base composition, which were done Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1377 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE 1378 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Table 1 Details of Nine Test Statistics Used to Assess the Adequacy of Nucleotide Substitution Models Test Statistic Calculation Model Component Assessed Type of Statistic Reference X Tree- and model-based chi-squared statistic of base frequencies across taxa. It is calcu- Stationarity of base Data-based Foster (2004) lated using matrices of base composition with a row for each taxon and a column for composition each nucleotide. One matrix corresponds to the values expected under the tree and model of base composition, and the other corresponds to the observed base compo- sition. The statistic is calculated using the following formula: X ¼ R½ðobs  exp Þ =exp m m m where at each cell obs is the observed base frequency and exp is the expected frequency under the tree and phylogenetic model. Multinomial (or Product of the unique site frequencies (L ), calculated as: Overall fit Data-based Goldman (1993a), unconstrained) Bollback (2002) likelihood L ¼ ðN =NÞ m ‘ ‘2B where ‘ is a site pattern in the set b, N is the number of occurrences of pattern ‘,and N is the total number of sites in the data. d Likelihood of the data using the unconstrained model (L ), minus the likelihood under Overall fit Data-inference hybrid Goldman (1993a) the evolutionary model Biochemical Calculated as the number of different bases occurring at each site, and the mean value Diversity in base composition Data-based Lartillot et al. diversity taken across the alignment across sites (2007) Consistency Minimum possible number of substitutions in the data divided by the minimum number Consistency of phylogenetic Data-inference hybrid Kluge and Farris index required to describe a given tree using parsimony. It has been considered as a measure information in the data (1969) of homoplasy, and is expected to take a value of 1 in the absence of homoplasy compared with the most parsimonious scenario Branch support Mean of branch-support values across the maximum-likelihood tree. Branch support can Overall fit Inference-based Brown (2014b) be calculated in a variety of ways, including nonparametric bootstrap or approximate likelihood-ratio test (Anisimova and Gascuel 2006). 95% CI in 95% range in branch-support values across the maximum-likelihood tree Overall fit Inference-based Brown (2014b) branch-support statistic Phylogenomic model adequacy test statistics GBE Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1379 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Tree length Sum of the branch lengths in the maximum-likelihood tree Overall fit Inference-based Brown (2014b) Mahalanobis A test of model adequacy can be placed in a multivariate setting that simultaneously Summary assessment from Based on other test Mahalanobis distance considers multiple test statistics by using Mahalanobis distances. The aim of this ap- multiple test statistics; statistics (1936); O’Hagan proach is to estimate a distance between the empirical test statistics and the multi- Overall fit (2003); variate predictive distribution from several test statistics. Individual test statistics are Drummond and first standardized so that they appear on the same scale. Then it is possible to calculate Suchard (2008) the mean (m) and variance covariance matrix (V) of the multivariate distribution: m ¼ Tðrep Þ P p P¼1 hihi V ¼ Tðrep Þ m ^  Tðrep Þ m ^ p p P1 P¼1 where P is the number of predictive simulations, T is the multivariate distribution of test statistics, and rep is each of the predictive data sets. The empirical and replicate data sets each has a distance from the multivariate distribution that can be defined as the following: MxðÞ ¼ðÞ x  m V ðÞ x  m For assessing substitution model adequacy, we used two combinations of test statistics to define the multivariate distribution. One included the other eight test statistics con- sidered in this study, and the other included the four statistics that were the most sensitive to biased phylogenetic inferences. ^ Duchene et al. GBE a. Substitution model parameterization Model-adequate Over-parameterized Under-parameterized Simulated with Analysed with Simulated with Analysed with Simulated with Analysed with GTR+ GTR+ JC GTR+ GTR+ JC AC AC AC AC AC AC GT GT GT GT GT GT b. Composition heterogeneous model One half of taxa with One half of taxa with One eighth of taxa with One eighth of taxa with moderate change strong change moderate change strong change c. Covarion-like rate variation model d. Long terminal branches Moderate Strong ×10 ×25 variation variation e. Covarion-like model with long terminal branches Moderate variation with Strong variation with Moderate variation with Strong variation with terminal branch lengths ×10 terminal branch lengths ×10 terminal branch lengths ×25 terminal branch lengths ×25 f. Number of nucleotides in each locus 500 1000 5000 FIG.1.—The six characteristics that were varied in simulations of sequence evolution to investigate the performance and adequacy of the candidate substitution model (GTRþC): (a) substitution model parameterization; (b) compositional heterogeneity; (c) covarion-like rate variation; (d) terminal branch lengths; (e) covarion-like rate variation and terminal branch lengths; and (f) sequence length for each locus. One hundred replicates were performed under each scenario from (a)to (e), under each of the sequence lengths shown in (f). Colors in (a) indicate different rate parameters, whereas in (b) they indicate the magnitude and proportion of taxa undergoing a change in base composition. Branch thickness corresponds to evolutionary rate in (c), (d), and (e). using the software p4 (Foster 2004). The R matrix used for We first performed phylogenetic and model-adequacy simulation had parameters set to 1.3472, 4.8145, 0.9304, analyses using substitution models that were overparameter- 1.2491, 5.5587, and 1.0000, based on a previous study of ized, underparameterized, and adequate (fig. 1a). This placental mammals (Murphy et al. 2001). We performed an was done by evaluating three scenarios that represented dif- additional set of simulations under the Jukes–Cantor model ferent combinations of simulation model and estimation (Jukes and Cantor 1969). For each simulation scenario, we model: sequences that evolved under the JC model and performed 100 replicates for each locus length. were analysed using the GTR þ C model (overparameterized 1380 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE model); sequences that evolved under the GTR þ C model To increase the sources of error in the data, we introduced and were analysed using the JC model (underparameterized simulation scenarios in which the lengths of the terminal model); and sequences that evolved under the GTR þ C branches were 10 and 25 times those in the original simula- model and were analysed using the GTR þ C model (ade- tions (fig. 1d). These simulations led to data that resemble quate model). those across highly divergent taxa, such as those used to ex- We simulated other processes that can occur in empirical amine the relationships among metazoan taxa (e.g., Pisani data and which can cause poor accuracy and precision in et al. 2015). To select the parameters for these simulations, phylogenetic inference. Heterogeneity in base frequencies we took the original simulations involving a root-to-tip dis- across lineages, which violates the assumption of composi- tance of 1 substitution per site to represent a period of 50 My. tional stationarity, can lead to the spurious joining of taxa with Under this assumption, the simulation scenarios with longer convergent base frequencies (Lockhart et al. 1992; Jermiin terminal branches reflect data sets across timescales of 1.25 et al. 2004). We performed simulations that included two and 2.5 billion years, resembling the scenarios simulated by convergent changes in base composition across the tree, Penny et al. (2001). We also performed simulations on trees with the maximum topological distance from each other with long terminal branches for each of the three covarion (fig. 1b). We varied the proportion of tips that underwent a scenarios (fig. 1e), which also resembles previous investiga- change in base composition between none, one-eighth, and tions of these conditions (Penny et al. 2001). half of the total number of tips. We also varied the magnitude of changes in base composition across two scenarios: 1) ap- Analyses of Simulated Data proximately the maximum difference in base composition ob- served across taxa in an avian phylogenomic data set (Prum For each simulated data set, phylogenetic inference was per- et al. 2015; from {0.25, 0.25, 0.25, 0.25} to {0.05, 0.45, 0.45, formed using a common candidate substitution model 0.05}); and 2) extreme differences in base composition (from (GTR þ C) using the software PhyML (Guindon et al. 2010), {0.25, 0.25, 0.25, 0.25} to {0.01, 0.49, 0.49, 0.01}). The small- and then model adequacy was assessed as described above. est base frequency in the first scenario was set to approxi- We used four metrics to describe the accuracy and precision mately the smallest base frequency observed in avian of our analyses of simulated data. The lengths of branches in phylogenomic data. These parameters allowed data sets to estimated trees were summed, and then the sum was sub- have a high realistic probability of leading to biased phyloge- tracted from the sum of branch lengths of simulated trees. netic inferences. The smallest base frequency in the second This value was then divided by the sum of simulated branch scenario was set to a small but nonzero value of 0.01, in order lengths to describe the inaccuracy in estimates. We made the to maintain the presence of all four bases in the data. same calculation using the stemminess of each tree, which is Simulations of sequence evolution were also done under a the proportion of the inferred tree length represented by in- covarion-like process (Galtier 2001), where substitution rates ternal branches (Fiala and Sokal 1985). Stemminess was used vary across sites and across lineages (Fitch and Markowitz to summarize the bias in inferences across branch lengths 1970; Fitch 1971). This scenario has been found to cause within each tree estimate. As a measure of error in topological poor estimates of phylogenetic tree lengths, but can benefit inference, we calculated the unweighted Robinson–Foulds topological inference by causing the overestimation of inter- distance between the estimated and simulated trees nal branch lengths (Penny et al. 2001; Phillips 2009). For this (Robinson and Foulds 1981; Penny and Hendy 1985). Lastly, reason, it can be considered an unusual form of model mis- we used the support for nodes in the estimated tree as a specification that is likely to be difficult to detect, yet might measure of precision in the inferred topology, calculated using mislead the inferences. Under a covarion-like process (Galtier the Shimodaira–Hasegawa approximate likelihood-ratio test 2001), relative rates across lineages are described by a discre- (aLRT) nonparametric measure of branch support (Guindon tized gamma distribution. The data share the categories of the et al. 2010; Anisimova et al. 2011). distribution, but each site has its own realization of rates For each simulated data set, we assessed model adequacy across lineages (Galtier 2001). In this way, sites might share using each of the nine test statistics considered in our study. In the rate along a given branch, but might not share that rate statistics, sensitivity is generally defined as the ability of a test along other branches. We simulated this covarion-like process to correctly identify instances in which a result is significant. using a gamma distribution with an a parameter of 1. To vary Accordingly, we consider test statistics to be consistently sen- the strength of variation across sites, rates across lineages sitive to particular simulation conditions if the distribution of were drawn from distributions with 1 (no covarion), 5, or values from replicate simulations does not overlap with the 10 categories across sites, allowing for an increasing number mean of the values from the predictive distributions. For test of extreme rates (fig. 1c). The consequence of using different statistics that were sensitive to inferences with poor accuracy numbers of categories is that some sites will have more ex- and precision, we aimed to determine meaningful thresholds treme (low and high) rates, although the rate distribution and for model assessment. This is because existing tests of model mean rate remain unchanged. adequacy are generally conservative, frequently rejecting the Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1381 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE model when inferences from the data are unlikely to be mis- loci that lead to well-supported topologies can have some of leading (e.g., Duchene et al. 2017). the largest numbers of variable sites, and therefore can be the We identified thresholds with a tendency to be lenient, loci that show the greatest differences from predictive data with low risk of rejecting the model when the inferences (i.e., have the largest Mahalanobis distance). are not misleading (low Type I error rate) at the expense of To investigate whether information content was associated sometimes failing to reject the model when inferences are with distance from the model or with tree space, we used misleading (moderate Type II error rate). This approach max- Spearman’s q to test whether the Mahalanobis distance was imizes the usage of phylogenomic data, because data can still correlated with mean node support, tree length, number of contain useful information even when the model is rejected variable sites, and the MDS dimensions. We also used the by a given test statistic. MDS visualization to assess the performance of thresholds The interpretation of test statistics can be sensitive to se- for assessing model adequacy. For each sensitive test statistic, quence length (Duchene et al. 2017), so our thresholds take we investigated the power of our threshold to reject the this into account. We determined the threshold for model model in regions of tree space that are the most likely to assessment for each sensitive test statistic. Thresholds were contain misleading inferences. defined as a fitted function between the sequence lengths used for simulation and the median test statistic under a sim- ulation scenario to which the statistic is most sensitive. Results Accuracy and Precision under Simulation Conditions Analyses of Empirical Data Phylogenetic inferences are accurate and precise when there is a match between the substitution models used for simula- The framework that we have used here for model assessment tion and analysis (fig. 2). When the model used for analysis is is highly computationally efficient, and therefore well suited overparameterized, the accuracy and precision of phyloge- for examining genome-scale data sets. We analysed three netic estimates is similar to those obtained when the data phylogenomic data sets to investigate the impact of assessing are analysed using the correct model (supplementary figs. substitution model adequacy on the inferred tree topology. S1–S3, Supplementary Material online). When the model These data sets comprised 2,363 loci from 63 turtle taxa used for analysis is underparameterized, tree length is consis- (Crawford et al. 2015), 222 loci from 200 bird taxa (Prum tently underestimated (fig. 2a), terminal branches have a dis- et al. 2015), and 96 loci from 15 Laurasiatherian mammal proportionate contribution to the tree length (fig. 2b), and the taxa (Zhou et al. 2012). In order to allow calculation of the topology is estimated with higher error than when the correct multinomial likelihood, sites with gaps or unknown nucleoti- model is used (fig. 2c). Using an underparameterized model des were excluded. For each of the three data sets, we per- also leads to estimates with greater branch support (higher formed maximum-likelihood analyses using a GTR þ C model precision) than when the model is adequate (fig. 2d), in agree- of nucleotide substitution, and assessed model adequacy us- ment with the bias-variance tradeoff found in statistical mod- ing the nine test statistics that we investigated in our simula- els (Burnham and Anderson 2002; Lemmon and Moriarty tion study (table 1). 2004; Wertheim et al. 2010; Liu et al. 2015). We described the relative distances between gene trees by As simulated in this study, compositional heterogeneity approximating these distances in two-dimensional space. This leads to a mild tendency to overestimate tree length, to ter- approach has been shown to provide an accurate description ^ minal branches having a greater contribution to the tree of the differences in phylogenetic signal across loci (Duchene length, and to greater error in the estimate of the tree topol- et al. 2018). This representation of tree space was made using ogy than when the correct model is used (fig. 2). Branch multidimensional scaling (MDS), based on unweighted lengths can be overestimated under conditions of composi- Robinson–Foulds distances (Robinson and Foulds 1981; tional heterogeneity because a change in base composition Penny and Hendy 1985) for describing the distances between can inflate the inferred numbers of the types of substitutions trees in Euclidean space (Hillis et al. 2005; Matsen 2006; involved in that change (Ho and Jermiin 2004; Duch^ ene et al. Ho ¨ hna and Drummond 2012). MDS finds the Euclidean posi- 2017). tions of gene trees that minimize the sum of the distances Analyses of data generated under a covarion-like process between them (Mardia et al. 1979). lead to underestimates of tree length (fig. 2a), but otherwise We used the MDS visualization to explore an association produce estimates with similar accuracy and precision to between tree space and the Mahalanobis distance, mean those in which the correct model was used. The lowest accu- node support, tree length, and number of variable sites. racy and precision in phylogenetic inferences was found in This association might occur, for example, if loci that yield simulation scenarios that involved long terminal branches trees with highly supported nodes and that have high infor- (fig. 2c and d). In these cases, tree length was severely under- mation content lead to congruent estimates of topology and estimated and terminal branches had a small contribution to have high model adequacy (Doyle et al. 2015). Alternatively, 1382 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE (a)(b) (c)(d) FIG.2.—The performance of phylogenetic inference using the GTRþC substitution model in simulations with 5,000 nucleotides under six represen- tative simulation conditions (for results from every simulation scenario, see supplementary figs. S1–S3, Supplementary Material online). Each box represents the results of 100 replicate analyses. Performance is described by (a) the length of the estimated tree minus that of the simulated tree, divided by that of the simulated tree, (b) the difference in stemminess, defined as the proportion of the inferred tree length represented by internal branches, (c) the unweighted Robinson–Foulds topological distance between estimated and simulated trees, and (d) the mean node support in the estimated tree, which is a measure of precision in estimates. tree length (fig. 2a and b). The effects of the covarion process consistency index (fig. 3b). However, the d statistic has negli- combined with long terminal branches have been studied gible sensitivity compared with the other three test statistics, previously, with similar outcomes (Galtier 2001; Penny et al. with SDPD values lower than 0.5 in data sets with 1,000 2001). We also find that inferences from short sequence nucleotides. The same four test statistics, along with X ,are alignments have greater variance in accuracy and precision consistently sensitive to compositional heterogeneity. The X (supplementary figs. S2 and S3, Supplementary Material statistic is overwhelmingly the most sensitive statistic to this online). scenario, dwarfing the signal of the other statistics (fig. 3c). Theseresults suggestthatmostteststatistics are lenientunder conditions of compositional heterogeneity. However, the X Sensitivity of Model-Adequacy Test Statistics statistic can be highly conservative, because some of the infer- None of the test statistics is consistently sensitive to scenarios ences under conditions of compositional heterogeneity have in which the model is adequate (fig. 3a) or to model over- similar accuracy and precision to those obtained when using parameterization (supplementary figs. S4b–S6b, the correct model. Supplementary Material online). This is a desirable property The biochemical diversity and consistency index statistics and is consistent with the results of previous research (Brown are consistently sensitive to the covarion-like process, in par- 2014; Duchene et al. 2017). Four test statistics are consistently ticular when the simulated sequences had a length of 5,000 sensitive to model underparameterization, including the mul- nucleotides (fig. 3d and f; see supplementary figs. S5i and S6i, tinomial likelihood, d statistic, biochemical diversity,and the Supplementary Material online). Only the biochemical Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1383 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Topological distance between Difference in tree length estimated and simulated trees (estimated minus simulated) 0 1020 304050 −0.6 −0.4 −0.2 0.0 0.2 GTR+ GTR+ - Analysed using JC Compositional heterogeneity Covarion-like Long terminal branches Covarion and long terminal branches Mean node support of Difference in stemminess estimated tree (estimated minus simulated) 0.65 0.75 0.85 0.95 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 GTR+ GTR+ - Analysed using JC Compositional heterogeneity Covarion-like Long terminal branches Covarion and long terminal branches ^ Duchene et al. GBE GTR+Γ - Model-adequate GTR+Γ - Analysed using JC (a)(b) (c)( Compositional heterogeneity d) Covarion-like Covarion and long (e)( Long terminal branches f) terminal branches FIG.3.—The sensitivity of nine test statistics for assessing the adequacy of the GTR þC substitution model in simulations with 5,000 nucleotides under six representative simulation conditions (for results from every simulation scenario, see supplementary figs. S4–S6, Supplementary Material online). The Mahalanobis test statistic was calculated to summarize all test statistics (M ), or the four sensitive test statistics (M ). Each box represents the results of 100 1 2 replicate analyses. diversity statistic is consistently sensitive to long terminal consistency index. The result of focusing on these statistics branches, particularly in simulation scenarios involving can be observed when comparing the summary sequences with lengths of 5,000 nucleotides (fig. 3e). In Mahalanobis distance including all eight of the other test sta- data sets with shorter sequences (200 and 1,000 nucleotides), tisticsexaminedhere (M ) with that including only the four two test statistics are consistently sensitive to long terminal most sensitive statistics (M ). In most simulation scenarios, M 2 1 branches, but with low sensitivity (SDPD between 1 and 1; and M are two of the most sensitive statistics, and M is 2 2 supplementary figs. S5j–S5o and S6j–S6o, Supplementary always more sensitive than M (with the exception of scenar- Material online). These were the multinomial likelihood and ios in which the correct model is used). This shows that the confidence interval in branch support. Interestingly, we statistics with low sensitivity should be excluded. find that inference-based statistics are generally very lenient. Meanwhile, examining M and the four informative test sta- This is possibly because inferences from predictive data sets tistics can provide a useful general method for examining are similar to those from the original data (but see Brown model performance. 2014). We propose an approach to model assessment that con- New Thresholds for Assessment siders the test statistics that have measurable sensitivity to phylogenetic inferences with low accuracy and precision: Analyses of phylogenomic data sets from turtles, birds, and the X , multinomial likelihood, biochemical diversity,and mammals show the critical importance of using thresholds for 1384 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Number of standard deviations from mean of predictive distribution −5 0 5 10 15 0 500 1000 1500 2000 −2 −1 0 1 2 χ 2 Multinomial likelihood Biochemical diversity Consistency Index Branch support CI in branch support Tree length M1 M2 −10 −5 0 5 10 15 −5 0 5 10 −50 0 50 100 150 200 χ 2 Multinomial likelihood Biochemical diversity Consistency Index Branch support CI in branch support Tree length M1 M2 Phylogenomic model adequacy test statistics GBE Mahalanobis distance Mean branch support Total tree length Number of varaible sites (a)(b)(c)(d) −1e−13 0e+00 1e−13 −1e−13 0e+00 1e−13 −1e−13 0e+00 1e−13 −1e−13 0e+00 1e−13 −50 0 50 100 150 −50 0 50 100 150 −50 0 50 100 150 −50 0 50 100 150 −5 0 5 −5 0 5 −5 0 5 −5 0 5 MDS dimension 1 FIG.4.—Estimated two-dimensional representation of tree-space for samples of loci from turtles, birds, and mammals. Data are colored such that warmer colors indicate higher values of (a) Mahalanobis distance (M ), (b) mean branch support, (c) tree length, and (d) number of variable sites. High values of these variables occur in similar locations of tree-space. In the sequence data from turtles, loci with high values of M are not necessarily the same as those that have high values for the other variables (see supplementary fig. S7, Supplementary Material online). assessment that consider sequence lengths. We find strik- covarion-like, for determining the thresholds of X , multino- ing evidence in these data that highly informative align- mial likelihood, biochemical diversity,and consistency index, ments have poorer perceived model adequacy. Loci with respectively. For deriving the threshold for M ,we used the poor perceived model adequacy (the highest SDPD for M ) median values of simulations of a covarion-like process with yield gene-tree estimates that are similar to those of loci long terminal branches. that yield high branch support and long branches, and In analyses of phylogenomic data from birds and mam- that contain a large number of variable sites (fig. 4). In mals, the new thresholds have an association with esti- simulations and empirical data sets, we find that loci mates of the tree topology, in particular when model with gene-tree estimates that are likely to be inaccurate assessment is made using X and biochemical diversity and imprecise have consistently good perceived model ad- (fig. 5). In these data, the model is rejected for loci in equacy (low M ), low mean branch support, short trees, regions of tree-space with lower branch support and and few variable sites (supplementary figs. S7–S8; shorter trees. These results suggest that assessment of Supplementary Material online). model adequacy might be more meaningful for data sets We propose thresholds of model assessment based on the with longer sequences, since the mean number of sites is median value of test statistics under the scenarios to which greater in the data from birds (741) than in the data from test statistics are sensitive (supplementary fig. S9, mammals (480) and turtles (200). Similarly, these results Supplementary Material online). We use the scenarios of show that X and biochemical diversity statistics might strong compositional heterogeneity, model underparameteri- provide the most informative tests out of those that we zation, covarion-like process with long terminal branches, and have explored. Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1385 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 MDS dimension 2 Mammals Birds Turtles −5 0 5 −50 0 50 −1e−13 0e+00 1e−13 ^ Duchene et al. GBE (a)(b)(c)(d)(e) FIG.5.—Estimated two-dimensional representation of tree-space for loci from turtles, birds, and mammals. Data are colored according to whether each locus passes (black) or fails (red) each of the five tests of model adequacy using our new thresholds for assessment. framework. These sensitive statistics include X , multinomial Discussion likelihood, biochemical diversity,and consistency index.We Assessing model adequacy can provide important insights have also shown the performance of the assessment using alongside commonly used methods of model selection based these four statistics simultaneously in a multivariate setting, on information criteria or Bayes factors. By assessing model using a statistic that we denote M . We focus on these five adequacy, it is possible to consider that even the best-fitting test statistics for deriving meaningful thresholds for assess- model can provide a poor description of the process that ment, which we hope can provide information about generated the data (Gelman et al. 2014). This can be partic- whether the model is likely to yield misleading inferences ularly useful when examining phylogenomic data sets, be- of tree topology or branch lengths. Exploring the power of cause a subset of the data can be selected to maximize the novel test statistics for identifying misleading inferences will reliability of the model and inferences. Here, we compared be an important avenue of research. For example, entropy the sensitivity of methods of assessing model adequacy to a metrics have been implemented for model assessment in a diverse range of scenarios that can lead to biased inferences Bayesian framework (Brown 2014), and could be adapted of tree topology and branch lengths. We show that the var- for use in a maximum-likelihood setting by metrics that ious test statistics for assessment differ in their sensitivity to compare the empirical and predictive data to another null biased inferences. We find that some test statistics can be reference data set (Lewis et al. 2014). Similarly, more exten- highly lenient, whereas others can be conservative. Our results sive simulation frameworks could lead to additional insights also support previous evidence that sequence length has a into the power of model assessment for detecting other critical bearing on the perceived adequacy of a model potential sources of bias, such as tree imbalance or long- (Duchene et al. 2017). This phenomenon occurs because branch attraction. when longer sequences are analysed, the distribution of test Our results emphasize the importance of defining thresh- statistics calculated for predictive data is narrower. As a con- olds based on sequence length. In our simulation study and sequence, long sequences can appear to have a greater dis- our analyses of three phylogenomic data sets, we find that crepancy from predictive data compared with short the loci that yield estimates with high accuracy are usually sequences (Goldman 1993). long, such that they would be rejected using traditional Using these insights, we have proposed thresholds for as- thresholds for assessing model adequacy (supplementary sessment that are specific to the most sensitive test statistics, figs. S1–S6, S8; Supplementary Material online). Critically, implemented using a fast maximum-likelihood optimization 1386 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 Phylogenomic model adequacy test statistics GBE the poor perceived model adequacy in analyses of long Supplementary Material sequences might be exacerbated in empirical data, since the Supplementary data are available at Genome Biology and model is a greater simplification of the evolutionary process in Evolution online. these data compared with our simulations. Nonetheless, it seems unreasonable to reject the model for these data, since a simple model can be sufficient for estimating the parame- ters of interest accurately and precisely (Steel 2005). This leads Acknowledgments us to propose thresholds for model assessment that are rela- This work was supported by funding from the Australian tively lenient when applied to data generated by simulation, Research Council to D.A.D. and S.Y.W.H. (grant yet can reject the model in most cases when inferences are DP160104173). S.D. was supported by a McKenzie misleading. In analyses of phylogenomic data sets from birds Fellowship from the University of Melbourne. We acknowl- and mammals, we find that our methods of assessment can edge the University of Sydney for providing high-performance reject a commonly used substitution model for loci that lead computing resources that have contributed to the research to inferences with anomalous tree topologies and low statis- results reported within this paper. tical support. We also find that assessing model adequacy can be difficult for short loci. In the phylogenomic data from bird families, loci Literature Cited that were model-adequate according to our thresholds Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio test for yielded congruent estimates of the tree topology, with strong branches: a fast, accurate, and powerful alternative. Syst Biol. 55(4):539–552. statistical support. The data from birds comprised the longest Anisimova M, Gil M, Dufayard J-F, Dessimoz C, Gascuel O. 2011. Survey of loci of the three phylogenomic data sets that we investigated. branch support methods demonstrates accuracy, power, and robust- These data also produced the most intuitive results, consistent ness of fast likelihood-based approximation schemes. Syst Biol. with a previous study that also identified that model-adequate 60(5):685–699. loci lead to congruent phylogenetic inferences (Doyle et al. Bollback JP. 2002. Bayesian model adequacy and choice in phylogenetics. Mol Biol Evol. 19(7):1171–1180. 2015). Strikingly, when using the phylogenomic data set with Brown JM. 2014. Detection of implausible phylogenetic inferences using the shortest sequences, based on ultraconserved elements posterior predictive assessment of model fit. Syst Biol. 63(3):334–348. from families of turtles, model assessment was less meaning- Burnham KP, Anderson DR. 2002. Model selection and multimodel infer- ful. In these data, model-adequate loci yielded phylogenetic ence: a practical information-theoretic approach. 2nd edn. New York: estimates that were similar to those from loci for which the Springer Crawford NG, et al. 2015. A phylogenomic analysis of turtles. Mol model was rejected. These results are congruent with those of 2 Phylogenet Evol. 83:250–257. a previous study of the X statistic in phylogenomic data from Doyle VP, Young RE, Naylor GJP, Brown JM. 2015. Can we identify genes birds, which showed that only some of the shortest loci were with increased phylogenetic reliability? Syst Biol. 64(5):824–837. deemed model-inadequate and risked causing phylogenetic Drummond AJ, Suchard MA. 2008. Fully Bayesian tests of neutrality using bias due to compositional heterogeneity (Duchene et al. genealogical summary statistics. BMC Genet. 9:68. Duchene DA, et al. 2018. Analysis of phylogenomic tree space resolves 2017). relationships among marsupial Families. Syst Biol. 67:400–412. doi: Tests of model adequacy are frequently reliant on evaluat- 10.1093/sysbio/syx076. ing multiple test statistics and can be too lenient or conserva- ^ ^ Duchene DA, Duchene S,Ho SYW.2017. New statistical criteria detect tive, such that it is difficult to interpret the phylogenetic phylogenetic bias caused by compositional heterogeneity. Mol Biol performance of the model. The thresholds for assessment Evol. 34(6):1529–1534. ^ ^ Duchene DA, Duchene S, Holmes EC, Ho SYW. 2015. Evaluating the provided here should lead to more effective identification of adequacy of molecular clock models using posterior predictive simu- models that are potentially misspecified. Nevertheless, assess- lations. Mol Biol Evol. 32(11):2986–2995. ing model performance remains difficult for data sets com- Duchene S, Di Giallonardo F, Holmes EC. 2016. Substitution model ade- prising short sequences. Development of novel test statistics quacy and assessing the reliability of estimates of virus evolutionary should be accompanied by simulation studies that provide rates and time scales. Mol Biol Evol. 33(1):255–267. Fiala KL, Sokal RR. 1985. Factors determining the accuracy of cladogram intuitive thresholds, based on the impact of model violation estimation: evaluation using computer simulation. Evolution on inferences of interest. Other potential avenues of research 39(3):609–622. include developing more summary metrics and graphics Fitch WM. 1971. Rate of change of concomitantly variable codons. J Mol of model adequacy across the genome; assessing genome- Evol. 1(1):84–96. wide models of inference, such as those developed for Fitch WM, Markowitz E. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation assessing the multispecies coalescent (Reid et al. 2014); or of mutations in evolution. Biochem Genet. 4:579–593. assessment for quantifying sequence information under the Foster PG. 2004. Modeling compositional heterogeneity. Syst Biol. model (e.g., Klopfstein et al. 2017). Together, these advances 53(3):485–495. will improve the reliability of phylogenomic inferences from Galtier N. 2001. Maximum-likelihood phylogenetic analysis under a sequence data. covarion-like model. Mol Biol Evol. 18(5):866–873. Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 1387 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018 ^ Duchene et al. GBE Gelman A, Carlin JB, Stern HS, Rubin DB. 2014. Bayesian data analysis. O’Hagan A. 2003. HSSS model criticism (with discussion). In: Green P, Boca Raton (FL): Taylor & Francis. Hjiort N, Richardson S, editors. Highly structured stochastic systems. Goldman N. 1993. Statistical tests of models of DNA substitution. J Mol Oxford (UK): Oxford University Press. p. 423–453. Evol. 36(2):182–198. Penny D, Hendy MD. 1985. The use of tree comparison metrics. Syst Zool. Guindon S, et al. 2010. New algorithms and methods to estimate 34(1):75–82. maximum-likelihood phylogenies: assessing the performance of Penny D, McComish BJ, Charleston MA, Hendy MD. 2001. Mathematical PhyML 3.0. Syst Biol. 59(3):307–321. elegance with biochemical realism: the covarion model of molecular Hillis DM, Heath TA, St John K. 2005. Analysis and visualization of tree evolution. J Mol Evol. 53(6):711–723. space. Syst Biol. 54(3):471–482. Phillips MJ. 2009. Branch-length estimation bias misleads molecular Ho SYW, Jermiin L. 2004. Tracing the decay of the historical signal in dating for a vertebrate mitochondrial phylogeny. Gene 441(1– biological sequence data. Syst Biol. 53(4):623–637. 2):132–140. Ho ¨ hna S, Drummond AJ. 2012. Guided tree topology proposals for Pisani D, et al. 2015. Genomic data do not support comb jellies as the sister Bayesian phylogenetic inference. Syst Biol. 61(1):1–11. group to all other animals. Proc Natl Acad Sci U S A. Ho ¨ hna S, May MR, Moore BR. 2016. TESS: an R package for efficiently 112(50):15402–15407. simulating phylogenetic trees and performing Bayesian inference of Posada D, Crandall KA. 2001. Selecting the best-fit model of nucleotide lineage diversification rates. Bioinformatics 32(5):789–791. substitution. Syst Biol. [Internet] 50:580–601. Jarvis ED, et al. 2014. Whole-genome analyses resolve early branches Prum RO, et al. 2015. A comprehensive phylogeny of birds (Aves) using in the tree of life of modern birds. Science 346(6215): targeted next-generation DNA sequencing. Nature 1320–1331. 526(7574):569–573. Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. 2014. Mixture Rabosky DL, Glor RE. 2010. Equilibrium speciation dynamics in a model models of nucleotide sequence evolution that account for heteroge- adaptive radiation of island lizards. Proc Natl Acad Sci U S A. neity in the substitution process across sites and across lineages. Syst 107(51):22178–22183. Biol. 63(5):726–742. Reddy S, et al. 2017. Why do phylogenomic data sets yield conflicting Jermiin L, Ho SYW, Ababneh F, Robinson J, Larkum AW. 2004. The biasing trees? Data type influences the avian Tree of Life more than taxon effect of compositional heterogeneity on phylogenetic estimates may sampling. Syst Biol. [Internet]. 66(5):857–879. be underestimated. Syst Biol. 53(4):638–643. Reid NM, et al. 2014. Poor fit to the multispecies coalescent is widely Jukes TH, Cantor CR. 1969. Evolution of protein molecules. In: Munro H, detectable in empirical data. Syst Biol. 63(3):322–333. editor. Mammalian protein metabolism. New York: Academic Press. p. Ripplinger J, Sullivan J. 2010. Assessment of substitution model adequacy 21–132. using frequentist and Bayesian methods. Mol Biol Evol. Klopfstein S, Massingham T, Goldman N. 2017. More on the best evolu- 27(12):2790–2803. tionary rate for phylogenetic analysis. Syst Biol. 66:769–785. doi: Robinson DF, Foulds LR. 1981. Comparison of phylogenetic trees. Math 10.1093/sysbio/syx051. Biosci. 53(1-2):131–147. Kluge AG, Farris JS. 1969. Quantitative phyletics and the evolution of Schliep KP. 2011. PHANGORN: phylogenetic analysis in R. Bioinformatics anurans. Syst Biol. 18(1):1–32. 27(4):592–593. Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch Shen X-X, Hittinger CT, Rokas A, Minh BQ, Braun EL. 2017. Contentious attraction artefacts in the animal phylogeny using a site- relationships in phylogenomic studies can be driven by a handful of heterogeneous model. BMC Evol Biol. 7(suppl 1):S4. genes. Nat Ecol Evol. 1(5):126. Lemmon AR, Moriarty EC. 2004. The importance of proper model as- Slater GJ, Pennell MW. 2014. Robust regression and posterior predictive sumption in Bayesian phylogenetics. Syst Biol. 53(2):265–277. simulation increase power to detect early bursts of trait evolution. Syst Lewis PO, Xie W, Chen M-H, Fan Y, Kuo L. 2014. Posterior predictive Biol. 63(3):293–308. Bayesian phylogenetic model selection. Syst Biol. 63(3):309–321. Springer MS, Gatesy J. 2016. The gene tree delusion. Mol Phylogenet Evol. Liu L, Xi Z, Wu S, Davis CC, Edwards SV. 2015. Estimating phyloge- 94(Pt A):1–33. netic trees from genome-scale data. Ann N Y Acad Sci. 1360: Steel M. 2005. Should phylogenetic models be trying to “fit an elephant”? 36–53. Trends Genet. 21(6):307–309. Lockhart PJ, Howe CJ, Bryant DA, Beanland TJ, Larkum AWD. 1992. Tavar e S. 1986. Some probabilistic and statistical problems in the analysis Substitutional bias confounds inference of cyanelle origins from se- of DNA sequences. Lect Math Life Sci. 17:57–86. quence data. J Mol Evol. 34:153–162. Timme RE, Bachvaroff TR, Delwiche CF. 2012. Broad phylogenomic Mahalanobis PC. 1936. On the generalised distance in statistics. Proc Natl sampling and the sister lineage of land plants. PLoS One Inst Sci India 12:49–55. 7(1):e29696. Mardia KV, Kent JT, Bibby JM. 1979. Multivariate analysis. New York: Wertheim JO, Sanderson MJ, Worobey M, Bjork A. 2010. Relaxed molec- Academic Press. ular clocks, the bias-variance trade-off, and the quality of phylogenetic Matsen FA. 2006. A geometric approach to tree shape statistics. Syst Biol. inference. Syst Biol. 59(1):1–8. 55:652–661. Yang Z. 1993. Maximum-likelihood estimation of phylogeny from DNA Meredith RW, et al. 2011. Impacts of the cretaceous terrestrial revolution sequences when substitution rates differ over sites. Mol Biol Evol. and KPg extinction on mammal diversification. Science 10(6):1396–1401. 334(6055):521–524. Zhou X, et al. 2012. Phylogenomic analysis resolves the interordinal rela- Misof B, et al. 2014. Phylogenomics resolves the timing and pattern of tionships and rapid diversification of the laurasiatherian mammals. Syst insect evolution. Science 346(6210):763–767. Biol. 61(1):150–164. Murphy WJ, et al. 2001. Molecular phylogenetics and the origins of pla- cental mammals. Nature 409(6820):614–618. Associate editor: David Bryant 1388 Genome Biol. Evol. 10(6):1375–1388 doi:10.1093/gbe/evy094 Advance Access publication May 18, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/6/1375/4999383 by Ed 'DeepDyve' Gillespie user on 16 June 2018

Journal

Genome Biology and EvolutionOxford University Press

Published: May 18, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off