Theoretical Foundation of the RelTime Method for Estimating Divergence Times from Variable Evolutionary Rates

Theoretical Foundation of the RelTime Method for Estimating Divergence Times from Variable... Abstract RelTime estimates divergence times by relaxing the assumption of a strict molecular clock in a phylogeny. It shows excellent performance in estimating divergence times for both simulated and empirical molecular sequence data sets in which evolutionary rates varied extensively throughout the tree. RelTime is computationally efficient and scales well with increasing size of data sets. Until now, however, RelTime has not had a formal mathematical foundation. Here, we show that the basis of the RelTime approach is a relative rate framework (RRF) that combines comparisons of evolutionary rates in sister lineages with the principle of minimum rate change between evolutionary lineages and their respective descendants. We present analytical solutions for estimating relative lineage rates and divergence times under RRF. We also discuss the relationship of RRF with other approaches, including the Bayesian framework. We conclude that RelTime will be useful for phylogenies with branch lengths derived not only from molecular data, but also morphological and biochemical traits. divergence times, timetree, evolution, theory Introduction The inference of divergence times is accomplished by assuming a constant rate throughout the tree (a strict molecular clock) or by relaxing the strict molecular clock (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). Bayesian approaches are widely applied to relaxed clock dating analyses. They require specification of a probability distribution of evolutionary rates in the tree (e.g., lognormal distribution), as well as an assumption of the presence or absence of rate autocorrelation among branches. In contrast, the RelTime approach does not require specification of such priors, and produces relative node ages that can then be transformed into absolute dates by using calibration constraints for one or more nodes (Tamura et al. 2012, 2013). RelTime performs well for estimating divergence times in analyses of many large empirical data sets (Mello et al. 2017) and simulated data sets (Tamura et al. 2012; Filipski et al. 2014). RelTime’s computational speed (fig. 1) and accuracy have led to its use for estimating divergence time for large data sets (Tamura et al. 2012; Mahler et al. 2013; Bond et al. 2014; Bonaldo et al. 2016). However, a mathematical foundation for the RelTime method has not yet been provided, which is needed not only to understand basic properties and assumptions of RelTime, but also to reveal its relationship with other molecular dating methods (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). In the following, we present the theoretical foundation of the RelTime method. We also assess the absolute performance of RelTime and compare it with other methods by analyzing data generated using computer simulations in which sequences were evolved according to three different branch rate models: independent (Drummond et al. 2006), autocorrelated (Kishino et al. 2001), and hybrid (Beaulieu et al. 2015). Fig. 1. View largeDownload slide (a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(n2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods. Fig. 1. View largeDownload slide (a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(n2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods. Mathematical Theory Theoretical Analysis for a Phylogeny with Three Taxa and an Outgroup We begin with the simplest case in which the phylogeny contains a clade with three ingroup taxa (subtree at node 5) and one outgroup taxon (fig. 2a). In this tree, branch lengths b1 and b2 represent the amount of evolutionary change that has occurred in lineages emanating from node 4 and leading to taxon 1 and taxon 2, respectively. We assume that taxon 1 and 2 are sampled at the same evolutionary time (t1 = 0 and t2 = 0), which is usually the case in phylogenetic analysis of data sampled from living species. The contemporaneous sampling of data produces sampling times equal to 0, which serve as calibration points (Tamura et al. 2012). By using branch lengths in this phylogeny (b’s), we can estimate the relative evolutionary rates (r’s) for all the lineages as well as relative divergence times (t’s). Here, a lineage refers to a branch and all the taxa (and branches) in the descendant subtree. For example, lineage a in figure 2a contains two taxa (1 and 2) and three branches with lengths b1, b2, and b4. Fig. 2. View largeDownload slide The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are bi’s and lineage rates are ri’s. La = b4 + ½(b1 + b2). (b) The case of four ingroup sequences with an outgroup. Here, La = b5 + ½(b1 + b2). Lb = b6 + ½(b3 + b4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF. Fig. 2. View largeDownload slide The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are bi’s and lineage rates are ri’s. La = b4 + ½(b1 + b2). (b) The case of four ingroup sequences with an outgroup. Here, La = b5 + ½(b1 + b2). Lb = b6 + ½(b3 + b4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF. The following system of equations formalizes the RelTime approach mathematically by linking relative rates for lineages (ri) with branch lengths (bi) in figure 2a. Here,, r1/r2=b1/b2, (1) r3/ra=b3/La, (2) where La is depth of node 5 on lineage a that contains taxon 1 and 2; La = b4 + ½(b1 + b2). We set: ra=½(r1+r2), (3) r0=½(ra+r3). (4) The setting of equalities in equations (3) and (4) leads to preference for the minimum rate change between the lineage originating at node 5 and the descendant lineage originating at node 4. The selection of lineage rate ra will be constrained by rate ratios imposed by equations (1) and (2), which relaxes the strict molecular clock. Because all the lineage rates are relative, one unknown is reduced by setting the group rate at the most recent common ancestor of the ingroup node to be 1, that is, r0=1. (5) We solve for each ri by using equations (1–5) and get: r1=4b1(b1+b2+2b4)/[(b1+b2)(b1+b2+2b3+2b4)], (6) r2=4b2(b1+b2+2b4)/[(b1+b2)(b1+b2+2b3+2b4)], (7) r3=4b3/(b1+b2+2b3+2b4), (8) ra=2(b1+b2+2b4)/(b1+b2+2b3+2b4). (9) The estimates of relative lineage rates and lengths produce an ultrametric tree with relative times for node 4 (t4) and node 5 (t5): t4=(b1+b2)(b1+b2+2b3+2b4)/4(b1+b2+2b4), (10) t5=(b1+b2+2b3+2b4)/4. (11) The above equations (6–11) constitute the relative rate framework (RRF) for phylogenies with three taxa and one outgroup. These equations yield point estimates for lineage rates and node ages. Because branch lengths have variances, the resulting lineage rate estimates have variances. The variance of relative node ages will be a function of branch length variances and the variance contributed by evolutionary rate differences among branches (see Discussion). Theoretical Analysis for Phylogenies Containing Four Taxa and an Outgroup To estimate six evolutionary rates (r1–r4, ra, and rb) using six branch length estimates (b1–b6) for a phylogeny containing four ingroup taxa (fig. 2b) we write a set of seven equations: r1/r2=b1/b2, (12) r3/r4=b3/b4, (13) ra=½(r1+r2), (14) rb=½(r3+r4), (15) r0=½(ra+rb), (16) ra/rb=La/Lb, (17) r0=1. (18) Here, La=b5+½(b1+b2) and Lb=b6+½(b3+b4). Equations (14–16) above lead to a preference for minimum changes in rates between ancestor–descendant lineage pairs. In this system, ri ’s are not required to be equal to one another and the rate assignments are constrained by the rate ratios among lineages (eqs. 12, 13, and 17), which relaxes the strict molecular clock. We solve for each ri from equations (12–18) to estimate relative rates and node ages. r1=4b1(b1+b2+2b5)/[(b1+b2)(b1+b2+b3+b4+2b5+2b6)], (19) r2=4b2(b1+b2+2b5)/[(b1+b2)(b1+b2+b3+b4+2b5+2b6)], (20) r3=4b3(b3+b4+2b6)/[(b3+b4)(b1+b2+b3+b4+2b5+2b6)], (21) r4=4b4(b3+b4+2b6)/[(b3+b4)(b1+b2+b3+b4+2b5+2b6)], (22) ra=2(b1+b2+2b5)/(b1+b2+b3+b4+2b5+2b6), (23) rb=2(b3+b4+2b6)/(b1+b2+b3+b4+2b5+2b6). (24) The estimates of relative node ages t5, t6, t7 for nodes 5, 6, and 7, respectively, are: t5=(b1+b2)(b1+b2+b3+b4+2b5+2b6)/[4(b1+b2+2b5)], (25) t6=(b3+b4)(b1+b2+b3+b4+2b5+2b6)/[4(b3+b4+2b6)], (26) t7=(b1+b2+b3+b4+2b5+2b6)/4. (27) The above equations (19–27) establish RRF of the RelTime approach for a tree containing four taxa and one outgroup. As previously mentioned, point estimates of node ages and lineage rates have variances because branch lengths have variances and because evolutionary rates are not equal among lineages (see Discussion). Relative Rate Framework with Geometric Means In both the original RelTime approach (Tamura et al. 2012) and the mathematical formulations above, we considered an arithmetic mean when averaging branch lengths to minimize evolutionary rate changes. This approach does not assume an equal rate, but is rather a natural way to calculate node depths by averaging branch lengths. We have now developed analytical formulas for an alternative RRF using the geometric mean that balances the rate changes between two descendant lineages. For example, if b1 = 1 and b2 = 4 in figure 2a, then the arithmetic mean will give 2.5. Thus, evolutionary rate r1 is 2.5 times slower and r2 is 1.6 times faster when compared with the average rate. The difference in rate change (between 2.5 and 1.6, in the present case) becomes larger as the difference between b1 and b2 becomes larger when using the arithmetic mean. In contrast, the geometric mean would give 2.0, which results in two times slower rate in b1 and two times faster rate in b2, as compared with the average rate. That is, the difference in rate between the ancestral and descendant lineages is always equal for sister lineages when using the geometric mean, which is not the case if the arithmetic mean is used. Using a geometric means approach, we obtain the following analytical formulas for a phylogeny containing three ingroup taxa (fig. 2a): r1=b1b1b2+b4/b2b3, (28) r2=b2b1b2+b4/b1b3, (29) r3=b3/b1b2+b4, (30) ra=b1b2+b4/b3, (31) t4=b1b2b3/b1b2+b4, (32) t5=b3b1b2+b4. (33) For the 4-taxon case in figure 2b, the equations are as follows: r1=b1b1b2+b5/b2b3b4+b6, (34) r2=b2b1b2+b5/b1b3b4+b6, (35) r3=b3b3b4+b6/b4b1b2+b5, (36) r4=b4b3b4+b6/b3b1b2+b5, (37) ra=b1b2+b5/b3b4+b6, (38) rb=b3b4+b6/b1b2+b5, (39) t5=b1b2b3b4+b6/b1b2+b5, (40) t6=b3b4b1b2+b5/b3b4+b6, (41) t7=b1b2+b5b3b4+b6. (42) Relative Rate Framework for a General Case Next, we consider a general case of a phylogeny with more than four ingroup taxa. In this case, RelTime applies RRF with a bottom-up approach, starting from the tips (external branches) of the phylogeny and moving toward the root. In the first step, we generate local relative lineage rates for the subtrees containing three and four taxa (subtree x and y, respectively) in a phylogeny containing eight taxa and an outgroup (fig. 3). Subtree y contains four taxa, so we apply equations (34–39) to generate relative rates. Equations (28–31) will be used to estimate rates for clade x which contains three ingroup taxa. Fig. 3. View largeDownload slide Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals. Fig. 3. View largeDownload slide Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals. In the next step, we consider the parent clade (z) that has four taxa: composite taxon I consisting of two taxa (1 and 2), taxon II consisting of one taxon (3), composite taxon III consisting of two taxa (4 and 5), and composite taxon IV containing two taxa (6 and 7) (fig. 3b). We estimate branch lengths (bI, bII, bIII, and bIV) by using the geometric means: bI=b1b2+b9, bII=b3, bIII=b4b5+b10, and bIV=b6b7+b11. Then, we use equations (34–39) to compute relative rates for all the lineages using these branch lengths. In bigger phylogenies, this process is carried out for every internal node in a postorder traversal. In the current example, node w is the common ancestor of all the ingroup taxa and is in a 3-taxon configuration (fig. 3c). We now have bV=bIbII+bx and bVI=bIIIbIV+by and apply equations (28–31). At this stage, we have local (relative) lineage rate estimates for nodes in the ingroup tree. Finally, all the rates in the tree are computed by multiplying descendant lineage rates by their respective ancestral lineage rates in preorder traversal to generate final relative rates such that the ingroup root node (e.g., node w in fig. 3c) has an average relative lineage rate equal to 1. Results We evaluated the performance of RRF for correctly estimating lineage rates and divergence times by analyzing data generated using computer simulation in which sequences were evolved according to the autocorrelated rate (AR) model (Kishino et al. 2001), the independent rate (IR) model (Drummond et al. 2006), as well as a model that contains multiple distributions of rates (hybrid rates [HR]). We present results from the analysis of a collection of small data sets (three ingroup sequences; AR and IR models) and two collections of large data sets: one containing 100 ingroup sequences (AR and IR models) and another containing 91-ingroup sequences (HR model). Analysis of Small Data Sets (Phylogeny with Three Ingroup Sequences) We first tested the accuracy of divergence times estimated via RRF by analyzing data sets containing three ingroup sequences. We used RRF with geometric means and compared the modeled (true) values with estimated lineage rates as well as node ages (fig. 4). The lineage rates produced by RRF were similar to the true rates for data sets that were evolved under the AR model (fig. 4a); the relationship showed a linear regression slope of 0.99. RRF estimates for external lineages, were similarly good (blue circles in fig. 4a, r2 = 0.68). The relationship was also strong for internal branches, but with greater dispersion (red circles in fig. 4a, r2 = 0.29). RRF performance for AR data sets was similar to that observed for IR data sets (fig. 4b). Overall, the success in estimating lineage-specific evolutionary rates translates into robust estimates of relative times for AR (fig. 4c) and IR data sets (fig. 4d). Fig. 4. View largeDownload slide Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2a). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, r1, r2, and r3) and red circles represent the internal lineage (ra). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (r1, r2, and r3) and red circles show the internal branch (r4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (r2) are shown for each panel. Fig. 4. View largeDownload slide Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2a). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, r1, r2, and r3) and red circles represent the internal lineage (ra). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (r1, r2, and r3) and red circles show the internal branch (r4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (r2) are shown for each panel. Bayesian methods produce branch-specific rates under a given statistical distribution of rates (e.g., lognormal), which differs from RRF where relative lineage rates are considered. Therefore, we compared the true values of branch rates with the branch rates derived using a Bayesian method. We provided correct priors (based on simulation parameters) and conducted the analyses using MCMCTree software (Yang 2007). Bayesian branch rate estimates showed a more diffuse relationship with true rates in internal branches for AR data sets (fig. 4e;r2 = 0.00) as compared with RRF (fig. 4a;r2 = 0.29). Fundamental reasons underlying these patterns are presented in the Discussion section. These patterns notwithstanding, Bayesian estimates of times for AR data sets showed a slope of 0.99 with true time estimates (fig. 4g), which means that node age estimates are generally robust to difficulties in estimating branch-specific rates. This robustness was also seen for IR data sets, where Bayesian branch rates showed a more diffuse relationship with the true rates (fig. 4f), but estimated times showed a slope close to 1 with a high r2 (0.91). Overall, both Bayesian and RRF approaches showed similar or lower accuracy in estimation of rates and node ages for IR data sets as compared with AR data sets, potentially because rate independence requires the estimation of a greater number of free parameters. Analysis of Large Data Sets (Phylogeny with 100 Ingroup Sequences) We next analyzed data sets consisting of 100 ingroup sequences, which were evolved over a range of empirical rate variation parameters. As observed for data sets containing only three ingroup sequences, RRF lineage rate estimates were highly correlated with the true rates for AR data sets (fig. 5a). Lineage rate correlations were generally lower for IR data sets and these correlations were higher for external (tip) lineages (fig. 5b). Importantly, distributions of the slopes of RelTime node age estimates and true times were centered at close to 1 for both AR and IR data sets (fig. 5c). Fig. 5. View largeDownload slide Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns. Fig. 5. View largeDownload slide Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns. Analysis of Hybrid Rate Data Sets (Phylogeny Containing 91 Ingroup Sequences) We also examined the performance of RRF in an analysis of simulated data from Beaulieu et al. (2015), who simulated two lognormal distributions (hybrid rate model) for an angiosperm phylogeny in which herbaceous clades exhibited higher and more variable evolutionary rates than woody clades (fig. 6a). They reported that one rate model Bayesian methods produced considerably more ancient date estimates for the divergence of herbaceous and woody clades. This overestimation of divergence time became more severe as the difference between the two rates increased (fig. 6b). Application of RRF produced divergence time estimates that were much closer to true times (fig. 6c and d), which shows that RRF can be useful in cases where the rate distribution differs among clades (Smith and Donoghue 2008; Dornburg et al. 2012; Beaulieu et al. 2015) or when clocks are local (Drummond and Suchard 2010; Crisp et al. 2014). Fig. 6. View largeDownload slide (a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses. Fig. 6. View largeDownload slide (a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses. Furthermore, Tamura et al. (2012) found that RelTime produced accurate time estimates in simulations with a very large number of sequences even when one clade possessed accelerated evolutionary rates, where penalized likelihood methods did not perform as well. In general, we expect that the limitations of on rate model Bayesian analyses will be overcome by local clock methods (Drummond and Suchard 2010; Höhna et al. 2016; Lartillot et al. 2016), but the computational time required to analyze even modestly sized data sets via these approaches can be prohibitive. So, the current RRF approach, which does not assume a specific model for rate variation, may be suitable for such data in its current implementation or as a foundation for future methodological refinements. Discussion We have presented a mathematical foundation for the relative rate framework (RRF) underlying the RelTime method. In the following, we compare RRF with other approaches for estimating divergence times and present the motivation behind RRF. Lineage Rates versus Branch Rates In RRF, evolutionary rate heterogeneity in a phylogeny is considered by comparing rates between sister lineages emanating from internal nodes in a phylogeny. A lineage rate is an average of all the lineage rates that belong to that lineage, including all the descendants (e.g., node x in fig. 3). RRF’s estimation of evolutionary lineage rates is fundamentally different from the comparison and modeling of branch rates in other approaches. For example, the penalized likelihood methods consider differences in rates between ancestral and descendant branches (Sanderson 1997, 2003), and Bayesian methods model branch rates to share a probabilistic distribution (e.g., lognormal distribution). The distinction between lineage and branch rates complicates direct comparisons of evolutionary rates produced by using RelTime and Bayesian methods, except for external (tip) branches for which the lineages consist of only one branch. In this case, RelTime and Bayesian estimates of rates show similar trends (fig. 4, blue circles). This trend was also observed in the analysis of 100 sequence data sets, where the correlation of the estimated external branch rates with true branch rates was high for both RelTime (median correlation = 0.76 and 0.69) and Bayesian (median correlation = 0.72 and 0.86) analyses of AR and IR data sets, respectively. For internal branches, computer simulations showed greater similarity between RRF estimates of lineage rates and the true lineage rates (fig. 4a and b) as compared with the similarity observed between Bayesian estimates of branch rates and the true branch rates (fig. 4e and f, respectively). This was also the case in the analysis of 100 sequence data sets: internal branch rates from Bayesian methods were less strongly correlated with the true rates (median correlation = 0.42 and 0.36 for AR and IR data sets, respectively), as compared with those observed for lineage rates from RRF (median correlation = 0.79 and 0.55 for AR and IR data sets, respectively). These trends are due to the fact that the estimate of a branch rate is a function of two time estimates: one for the ancestral node and another for the descendant node. For example, the variance of the rate on branch with length b4 in figure 2a is a function of the variance of two time estimates (t4 and t5), in addition to the variance of b4. In contrast, the variance of a lineage rate (e.g., ra in fig. 2a) is a function of the variance of only one time estimate (t5), in addition to the lineage depth (La), because the other time point is zero in contemporary sequence sampling. Thus, branch rates are estimated with greater variance than lineage rates, which results in lower correlations seen for Bayesian approaches. Underlying Evolutionary Rate Model in RRF RRF exploits the fact that the estimation of the ratio of lineage rates at any node in a phylogeny is independent of that node’s age. For example, the ratio of evolutionary rates at node 4, r1/r2, does not depend on t4 (Fig. 7a). It is clear that the rate of evolution is higher in the lineage leading from node 4 to taxon 2 than to taxon 1 (r2 > r1), because b2 is longer than b1 in figure 7a. In fact, we can estimate r1/r2 (= b1/b2) without knowing anything about the probability distribution of evolutionary rates throughout the tree. Similarly, the other rate ratio in this tree does not depend on knowledge of the distribution of rates among branches or lineages, it is simply [(b1 + b2)/2 + b4]/b3 when using the arithmetic means and (b1b2+b4)/b3when using the geometric means. Fig. 7. View largeDownload slide A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4). Fig. 7. View largeDownload slide A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4). However, the node-by-node specification of relative lineage rates is not sufficient to estimate relative times t4 and t5. For that, we need to know the relationship of subtree rate s4 and branch rate r4, where s4 is the overall evolutionary rate of the subtree originating at node 4 (contains taxon 1 and 2) and r4 is the evolutionary rate on branch b4 (fig. 7a). Without assuming a specific distribution of rates, s4/r4 cannot be determined uniquely and t4 can be at any point between 0 and t5. Figure 7c and d present two extreme possibilities. In one, if the subtree rate (s4) is much higher after the divergence event at node 4 (s4 ≫ r4), then the estimate of t4 will be small and the divergence event recent (fig. 7c). Alternately, if the subtree rate is much slower after the divergence event at node 4 (s4 ≪ r4), then t4 will be much more ancient (fig. 7d). In its mathematical formulation, RRF considers the best estimate of the rate of evolution of an ancestral lineage to be the average of the rate of evolution of its two descendant lineages (e.g., eqs. 3, 14, 15, and 16), while accommodating the relative rates among lineages. In the current example, this would result in the timetree shown in figure 7b. This is the principle of minimum rate change from the ancestor to its immediate descendants, where we do not favor extreme rate assignments, for example, those in figure 7c and d. Probabilities of such extreme rate assignments are also low in commonly used branch rate distributions (e.g., lognormal, normal, and exponential distributions), so Bayesian methods also tend to favor the smallest rate change needed to explain the data. Relationship of RRF with Other Molecular Clock Methods The treatment of evolutionary rates is conceptually different between Bayesian methods and RRF, because RRF does not assume a specific statistical distribution for modeling (lineage) rate variation at the outset and Bayesian methods model branch rates rather than the lineage rates. The application of the principle of minimum rate change in RRF is different from nonparametric and semiparametric approaches based on the idea of Sanderson (1997), because RRF minimizes lineage rate changes rather than the branch rate changes from an ancestor to its immediate descendants. In addition, RRF does not attempt to estimate a universal penalty for the speed of rate change throughout the tree. RRF is distinct from strict clock approaches because strict clock methods only apply rate averaging (e.g., eqs. 3 and 4) at each node in the tree, but RRF imposes an additional constraint that the ratio of sister lineage rates be the ratio of their lineage lengths (e.g., eqs. 1 and 2). These additional constraints relax the strict clock and allow rates to vary throughout the tree. For this reason, RelTime is also different from some molecular clock methods that assume the ratio of ages between two nodes to be proportional to the ratio of their average node-to-tip distances (Purvis 1995; Britton et al. 2002,, 2007). This assumption is tantamount to assuming equality of rates among lineages, and, thus, a strict molecular clock. For this reason, these approaches require pruning of taxa or lineages for which the rate equality does not apply (Takezaki et al. 1995). RRF does not assume equality among any lineage rates at any time, and it does not require the removal of rate heterogeneous lineages. Statistical Distribution of Relative Lineage Rates Relative lineage rates produced by RRF will show extensive correlation, because the evolutionary rate of a lineage is a function of evolutionary rates of all its descendant lineages. This would result in both local and global correlation, which is expected to be present in phylogenies with autocorrelated as well as independent branch rates. As expected, the analysis of 100 sequence data sets showed correlation between ancestral and descendant lineage rates when branch rates were autocorrelated (median correlation = 0.88) or independent (median correlation = 0.77). Therefore, RRF is fundamentally different from the autocorrelated (branch) rate model of Kishino et al. (2001) as well as the independent (branch) rate model of Drummond et al. (2006), as they deal with branches rather than lineages, as defined here. For this reason, the lineage rates produced by RRF are not directly comparable with those observed for branch rates produced by Bayesian methods. Even though RRF does not require a statistical distribution of lineage rates at the outset, the resulting estimates of lineage rates may follow a statistical distribution. We examined this relationship in an analysis of 100 ingroup sequence data sets in which branch rates were simulated with lognormal, exponential, or uniform distributions. When branch rates followed a lognormal distribution, the distribution of true lineage rates was also lognormal, as was the distribution of RRF lineage rate estimates (fig. 8a–d). When the branch rates followed an exponential distribution, the RRF and true lineage rates showed a similar distribution (fig. 8e). In the case of a uniform distribution of branch rates, the lineage rates showed a normal-like distribution and the RRF rate estimates were lognormally distributed (fig. 8f). All of these results suggest that a flexible lognormal distribution will generally fit the distribution of RRF lineage rates. Importantly, time estimates showed a linear relationship with the true times, with slopes close to 1.0 (fig. 8g–i). Fig. 8. View largeDownload slide Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g–i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (r2) are shown. Fig. 8. View largeDownload slide Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g–i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (r2) are shown. Point Estimates and Their Variances RRF yields point estimates for (relative) lineage rates and divergence times based on branch lengths. As is the common practice in classical statistics, estimates of dispersion such as SEs and confidence intervals accompany all estimates. The variance of a lineage rate estimate is a function of branch length variances. It can be obtained analytically by using the equations for lineage rate (e.g., eqs. 28–31 for the case of three taxa with outgroup by the delta method) or simply by using a bootstrap sampling procedure. However, the estimation of confidence intervals around the node ages depends on branch length variances as well as the degree of inequality of evolutionary rates among lineages (Kumar and Hedges 2016). Tamura et al. (2013) proposed a method to estimate confidence intervals within RelTime, which produces rather wide confidence intervals. We are currently investigating an advanced approach to narrow confidence intervals while maintaining appropriate coverage probabilities, but this subject is beyond the scope of the current article. Computational Speed of RelTime RRF scales well with increasing numbers of sequences and is much faster than Bayesian methods for analyzing of molecular sequence data (fig. 1). The fast computational speed is due to the innovation that RRF uses all the data first to map a large alignment onto a phylogeny, and then it uses the resulting branch lengths to generate relative divergence times and evolutionary lineage rates. The computational time taken by RRF is the sum of time taken to generate maximum likelihood estimates of branch lengths for a given sequence alignment and phylogeny and the time taken to estimate rates and dates using RRF. The latter is negligible compared with the former, because of the analytical nature of RRF. In comparison, Bayesian methods are computationally demanding because they require a substantial exploration of likelihood space using prior distributions to generate posterior estimates of rates and divergence times. RelTime for Phylogenies with Branch Lengths The above decomposition has a positive side effect, in addition to making RelTime computationally speedy. RRF may be applied to any phylogeny where branch lengths reflect the amount of change. For example, RRF is directly applicable when branch lengths are estimated by using pairwise evolutionary distances and a least squares approach for a given tree topology (Rzhetsky and Nei 1993). In addition to multiple sequence alignments, such distances can come from unaligned and locally aligned genome or genomic segments (Otu and Sayood 2003; Henz et al. 2005; Auch et al. 2006; Deng et al. 2006; Gao and Qi 2007; Lin et al. 2009; Xu and Hao 2009). In fact, RRF can be applied to any phylogeny where branch lengths are generated using other types of molecular data (e.g., gene expression patterns and breakpoint distances) or nonmolecular data (e.g., morphological and life history traits) (Herniou et al. 2001; Gramm and Niedermeier 2002; King et al. 2016; Cooney et al. 2017). Of course, the accuracy of the relative rate and time inferences made for such data depends directly on the accuracy of the phylogenetic tree and the branch lengths, so the biological interpretations of the results obtained will require utmost care. Usefulness of Relative Node Ages RRF’s accurate estimation of relative node ages without assuming a speciation model or calibration priors can benefit many applications (Tamura et al. 2012). For example, relative node ages from molecular data are directly comparable with those from fossil data. This allows evaluation of biological hypotheses without the circularity created by the use of calibration priors and densities inferred from molecular data (Battistuzzi et al. 2015; Gold et al. 2017). Along these lines, RRF has been used to develop a protocol to identify calibration priors that have the strongest influence on the final time estimates in Bayesian dating (Battistuzzi et al. 2015), because the cross-validation methods are unlikely to be effective (Warnock et al. 2012,, 2015). Inferring Absolute Times from Relative Node Ages By placing calibration constraints on one or more nodes in the tree, we can generate an absolute timetree from the ultrametric tree containing relative node ages. Tamura et al. (2013) presented an algorithmic approach for adjusting relative rates to ensure that the estimated times for calibrated nodes are within the boundaries. This process uses the maximum and minimum boundaries only for calibration, which is preferable when the uncertainty distribution of calibrations is not known precisely. Otherwise, there is high probability of biased time estimation (Hedges and Kumar 2004; Ho and Phillips 2009; Inoue et al. 2010; Heath et al. 2014; Ho and Duchêne 2014; dos Reis et al. 2015). Tamura et al.’s approach worked well in the analysis of large data sets, because Bayesian time estimates reported in multiple large-scale studies are similar to those produced by RRF using ultrametric trees with relative times that were transformed into timetrees using many calibration constraints (Mello et al. 2017). Conclusions We have presented a mathematical foundation for the RelTime method and elucidated its relationship with other relaxed and strict clock methods. We have shown that the relative rate framework (RRF) produces excellent estimates of rates and divergence times for evolutionary lineages. It is, however, important to note that estimates of divergence times in a phylogeny are only biologically meaningful when the reconstruction of evolutionary relationships is robust. Therefore, the best practice is to first obtain a reliable phylogeny and then estimate divergence times. We must also consider the confidence intervals associated with node ages to assess the precision of time estimates prior to making biological inferences. Materials and Methods Computer Simulations and Analysis We simulated 200 multisequence alignments: 50 each for two models of evolutionary rates (independent and autocorrelated among branches) for two model topologies containing three- and four-ingroup taxa (fig. 2a and b, respectively). The node height of the ingroup subtree was set to be 10 time units, while the node heights of all descendent subtrees varied randomly between 0 and 10 time units. For each resulting model timetree, branch rates were sampled from a lognormal distribution, where the mean rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the SD varied from 0.25 to 0.75 for all branches independently. For the autocorrelated rates, the initial rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the autocorrelation parameter was varied from 0.1 to 0.3. This rate sampling resulted in a phylogram with branch lengths that could be used as input for SeqGen (Grassly et al. 1997). We used the Hasegawa–Kinshino–Yano (HKY) model (Hasegawa et al. 1985) with 4 gamma categories and empirically derived GC content and transition/transversion ratio (Rosenberg and Kumar 2003) as simulation parameters. We generated simulated multispecies alignments where each sequence was 3, 000 base pairs long. The results from three-ingroup sequence analysis (fig. 4) were similar to those from the analysis of four-ingroup sequences (not shown). Using the same simulation strategy, we created 35 alignments each under independent and autocorrelated rate scenarios following a master phylogeny of 100 taxa that was sampled from the bony-vertebrate clade in the Timetree of Life (Hedges and Kumar 2009). In the independent rate case, the SD varied from 0.3 to 0.5. In the autocorrelated rate case, the autocorrelation parameter varied from 0.01 to 0.04. All other simulation parameters (GC contents, transition/transversion ratio, and sequence length) were derived from empirical distributions (Rosenberg and Kumar 2003). Using the same 100 ingroup taxon master phylogeny of bony vertebrates, we simulated five additional sequence data sets under (1) an autocorrelated lognormal rate model with low dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.02, respectively; (2) an autocorrelated lognormal rate model with high dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.05, respectively; (3) an independent lognormal rate model with low dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.25, respectively; (4) an independent lognormal rate model with high dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.75, respectively; (5) an independent exponential rate model, where the mean rate was set to be r; and (6) an independent uniform rate model, where the branch-specific rates were sampled from an uniform distribution from 0 to 2r. GC contents, transition/transversion ratio, sequence length, and evolutionary rate r were derived from empirical distributions (Rosenberg and Kumar 2003). MEGA software (Kumar et al. 2012, 2016) was used to obtain maximum likelihood estimates of branch lengths from simulated sequence alignments, where the correct substitution model and tree topology were used. RRF (eqs. 28–42) was applied to the resulting phylogram with branch lengths, and relative lineage rates and times were obtained. One calibration (true age ± 10 Ma) at the crown node of the ingroup was used to convert relative time estimates for comparison with true times (fig. 8). No calibrations were used in other RRF analyses. All Bayesian analyses were conducted using MCMCTree (Yang 2007) using correct priors; two independent runs of five million generations were carried out. Results were checked for convergence using Tracer (Rambaut et al. 2014). ESS values were higher than 200 after removing 10% burn-in samples in each run. MCMCTree analyses used one root calibration (true age ± 0.1 time units). Analysis of Hybrid Rate Models Simulated data sets and BEAST results were provided by Beaulieu et al. (2015), or retrieved from the Dryad Repository. All outgroup and root calibrations are automatically disregarded in RelTime, because the assumption of equal rates of evolution between the ingroup and outgroup sequences is not testable (Kumar et al. 2016). Lognormal distributions with fixed median values of “true ages” were used as calibration densities in the original study (Beaulieu et al. 2015). Because RelTime does not require specific density distributions for calibrations, we used true age ± 5 Ma for all 15 ingroup calibrated nodes in the reanalysis to directly compare RelTime divergence time estimates with those from BEAST. Calibrations employed in RelTime (true age ± 5 Ma) had boundaries similar to 99% probability densities of lognormal distributions originally employed as calibrations. The same alignments, topology, and ingroup calibrations were used in RRF analyses. Estimates of angiosperm age were obtained by summarizing estimates of 100 data sets each in which the herbaceous clades have 3 times (3×) and 6 times (6×) higher rates than those of woody clades. Acknowledgments We are thankful to Heather Rowe for editorial comments and Beatriz Mello for many helpful discussions. We thank Jeremy Beaulieu for providing simulated data and Bayesian results. This research was supported by grants from NIH (R01GM126567-01), National Aeronautics and Space Administration (NASA, NNX16AJ30G), and Tokyo Metropolitan University (DB105). References Auch AF , Henz SR , Holland BR , Göker M. 2006 . Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences . BMC Bioinformatics 7 : 350. Google Scholar CrossRef Search ADS PubMed Battistuzzi FU , Billing-Ross P , Murillo O , Filipski A , Kumar S. 2015 . A protocol for diagnosing the effect of calibration priors on posterior time estimates: a case study for the Cambrian explosion of animal phyla . Mol Biol Evol . 32 7 : 1907 – 1912 . Google Scholar CrossRef Search ADS PubMed Beaulieu JM , O’Meara BC , Crane P , Donoghue MJ. 2015 . Heterogeneous rates of molecular evolution and diversification could explain the Triassic age estimate for angiosperms . Syst Biol . 64 5 : 869 – 878 . Google Scholar CrossRef Search ADS PubMed Bonaldo MC , Ribeiro IP , Lima NS , Dos Santos AAC , Menezes LSR , da Cruz SOD , de Mello IS , Furtado ND , de Moura EE , Damasceno L et al. , . 2016 . Isolation of infective Zika virus from urine and saliva of patients in Brazil . PLoS Negl Trop Dis . 10 6 : e0004816. Google Scholar CrossRef Search ADS PubMed Bond JE , Garrison NL , Hamilton CA , Godwin RL , Hedin M , Agnarsson I. 2014 . Phylogenomics resolves a spider backbone phylogeny and rejects a prevailing paradigm for orb web evolution . Curr Biol . 24 15 : 1765 – 1771 . Google Scholar CrossRef Search ADS PubMed Britton T , Anderson CL , Jacquet D , Lundqvist S , Bremer K. 2007 . Estimating divergence times in large phylogenetic trees . Syst Biol . 56 5 : 741 – 752 . Google Scholar CrossRef Search ADS PubMed Britton T , Oxelman B , Vinnersten A , Bremer K. 2002 . Phylogenetic dating with confidence intervals using mean path lengths . Mol Phylogenet Evol . 24 1 : 58 – 65 . Google Scholar CrossRef Search ADS PubMed Cooney CR , Bright JA , Capp EJR , Chira AM , Hughes EC , Moody CJA , Nouri LO , Varley ZK , Thomas GH. 2017 . Mega-evolutionary dynamics of the adaptive radiation of birds . Nature 542 7641 : 344 – 347 . Google Scholar CrossRef Search ADS PubMed Crisp MD , Hardy NB , Cook LG. 2014 . Clock model makes a large difference to age estimates of long-stemmed clades with no internal calibration: a test using Australian grasstrees . BMC Evol Biol . 14 : 263 – 279 . Google Scholar CrossRef Search ADS PubMed Deng R , Huang M , Wang J , Huang Y , Yang J , Feng J , Wang X. 2006 . PTreeRec: Phylogenetic Tree Reconstruction based on genome BLAST distance . Comput Biol Chem . 30 4 : 300 – 302 . Google Scholar CrossRef Search ADS PubMed Dornburg A , Brandley MC , McGowen MR , Near TJ. 2012 . Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: cetacea) . Mol Biol Evol . 29 2 :721–243. dos Reis M , Donoghue PC , Yang Z. 2016 . Bayesian molecular clock dating of species divergences in the genomics era . Nat Rev Genet . 17 2 : 71 – 80 . Google Scholar CrossRef Search ADS PubMed dos Reis M , Thawornwattana Y , Angelis K , Telford MJ , Donoghue PC , Yang Z. 2015 . Uncertainty in the timing of origin of animals and the limits of precision in molecular timescales . Curr Biol . 25 22 :2939–2912. Drummond AJ , Ho SYW , Phillips MJ , Rambaut A. 2006 . Relaxed phylogenetics and dating with confidence . PLoS Biol . 4 5 : e88 – e99 . Google Scholar CrossRef Search ADS PubMed Drummond AJ , Suchard MA. 2010 . Bayesian random local clocks, or one rate to rule them all . BMC Biol . 8 : 114 – 125 . Google Scholar CrossRef Search ADS PubMed Filipski A , Murillo O , Freydenzon A , Tamura K , Kumar S. 2014 . Prospects for building large timetrees using molecular data with incomplete gene coverage among species . Mol Biol Evol . 31 9 : 2542 – 2550 . Google Scholar CrossRef Search ADS PubMed Gao L , Qi J. 2007 . Whole genome molecular phylogeny of large dsDNA viruses using composition vector method . BMC Evol Biol . 7 : 41. Google Scholar CrossRef Search ADS PubMed Gold DA , Caron A , Fournier GP , Summons RE. 2017 . Paleoproterozoic sterol biosynthesis and the rise of oxygen . Nature 543 7645 : 420 – 423 . Google Scholar CrossRef Search ADS PubMed Gramm J , Niedermeier R. 2002 . Breakpoint medians and breakpoint phylogenies: a fixed-parameter approach . Bioinformatics 18 ( Suppl 2 ): S128 – S139 . Google Scholar CrossRef Search ADS PubMed Grassly NC , Adachi J , Rambaut A. 1997 . Seq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees . Comput Appl Biosci . 13 3 : 235 – 238 . Google Scholar PubMed Hasegawa M , Kishino H , Yano T. 1985 . Dating of the human-ape splitting by a molecular clock of mitochondrial DNA . J Mol Evol . 22 2 : 160 – 174 . Google Scholar CrossRef Search ADS PubMed Heath TA , Huelsenbeck JP , Stadler T. 2014 . The fossilized birth-death process for coherent calibration of divergence-time estimates . Proc Natl Acad Sci U S A . 111 29 : E2957 – E2966 . Google Scholar CrossRef Search ADS PubMed Hedges SB , Kumar S. 2004 . Precision of molecular time estimates . Trends Genet . 20 5 : 242 – 247 . Google Scholar CrossRef Search ADS PubMed Hedges SB , Kumar S. 2009 . The timetree of life . New York : Oxford University Press . Henz SR , Huson DH , Auch AF , Nieselt-Struwe K , Schuster SC. 2005 . Whole-genome prokaryotic phylogeny . Bioinformatics 21 10 : 2329 – 2335 . Google Scholar CrossRef Search ADS PubMed Herniou EA , Luque T , Chen X , Vlak JM , Winstanley D , Cory JS , O’Reilly DR. 2001 . Use of whole genome sequence data to infer baculovirus phylogeny . J Virol . 75 17 : 8117 – 8126 . Google Scholar CrossRef Search ADS PubMed Ho SY , Duchêne S. 2014 . Molecular-clock methods for estimating evolutionary rates and timescales . Mol Ecol . 23 24 : 5947 – 5965 . Google Scholar CrossRef Search ADS PubMed Ho SYW , Phillips MJ. 2009 . Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times . Syst Biol . 58 3 : 367 – 380 . Google Scholar CrossRef Search ADS PubMed Höhna S , Landis MJ , Heath TA , Boussau B , Lartillot N , Moore BR , Huelsenbeck JP , Ronquist F. 2016 . RevBayes: bayesian phylogenetic inference using graphical models and an interactive model-specification language . Syst Biol . 65 : 726 – 736 . Google Scholar CrossRef Search ADS PubMed Inoue J , Donoghue PCJ , Yang Z. 2010 . The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times . Syst Biol . 59 1 : 74 – 89 . Google Scholar CrossRef Search ADS PubMed King B , Qiao T , Lee MSY , Zhu M , Long JA. 2016 . Bayesian morphological clock methods resurrect placoderm monophyly and reveal rapid early evolution in jawed vertebrates . Syst Biol . 66 : 499 – 516 . Kishino H , Thorne JL , Bruno WJ. 2001 . Performance of a divergence time estimation method under a probabilistic model of rate evolution . Mol Biol Evol . 18 3 : 352 – 361 . Google Scholar CrossRef Search ADS PubMed Kumar S , Hedges SB. 2016 . Advances in time estimation methods for molecular data . Mol Biol Evol . 33 4 : 863 – 869 . Google Scholar CrossRef Search ADS PubMed Kumar S , Stecher G , Peterson D , Tamura K. 2012 . MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis . Bioinformatics 28 20 : 2685 – 2686 . Google Scholar CrossRef Search ADS PubMed Kumar S , Stecher G , Tamura K. 2016 . MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets . Mol Biol Evol . 33 7 : 1870 – 1874 . Google Scholar CrossRef Search ADS PubMed Lartillot N , Phillips MJ , Ronquist F. 2016 . A mixed relaxed clock model . Philos Trans R Soc B 371 1699 : 20150132. Google Scholar CrossRef Search ADS Lin GN , Cai Z , Lin G , Chakraborty S , Xu D. 2009 . ComPhy: prokaryotic composite distance phylogenies inferred from whole-genome gene sets . BMC Bioinformatics 10 ( Suppl 1 ): S5. Google Scholar CrossRef Search ADS Mahler DL , Ingram T , Revell LJ , Losos JB. 2013 . Exceptional convergence on the macroevolutionary landscape in island lizard radiations . Science 341 6143 : 292 – 295 . Google Scholar CrossRef Search ADS PubMed Mello B , Tao Q , Tamura K , Kumar S. 2017 . Fast and accurate estimates of divergence times from big data . Mol Biol Evol . 34 1 : 45 – 50 . Google Scholar CrossRef Search ADS PubMed Otu HH , Sayood K. 2003 . A new sequence distance measure for phylogenetic tree construction . Bioinformatics 19 16 : 2122 – 2130 . Google Scholar CrossRef Search ADS PubMed Purvis A. 1995 . A composite estimate of primate phylogeny . Philos Trans R Soc B 348 1326 : 405 – 421 . Google Scholar CrossRef Search ADS Rambaut A , Suchard M A Xie D , Drummond A. 2014 . Tracer v1.6. Institute of Evolutionary Biology, University of Edinburgh, UK. Available from: http://beast.bio.ed.ac.uk/Tracer, last accessed March 29, 2018. Rosenberg MS , Kumar S. 2003 . Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference . Mol Biol Evol . 20 4 : 610 – 621 . Google Scholar CrossRef Search ADS PubMed Rzhetsky A , Nei M. 1993 . Theoretical foundation of the minimum-evolution method of phylogenetic inference . Mol Biol Evol . 10 5 : 1073 – 1095 . Google Scholar PubMed Sanderson MJ. 1997 . A nonparametric approach to estimating divergence times in the absence of rate constancy . Mol Biol Evol . 14 12 : 1218 – 1231 . Google Scholar CrossRef Search ADS Sanderson MJ. 2003 . r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock . Bioinformatics 19 2 : 301 – 302 . Google Scholar CrossRef Search ADS PubMed Smith SA , Donoghue MJ. 2008 . Rates of molecular evolution are linked to life history in flowering plants . Science 322 5898 : 86 – 89 . Google Scholar CrossRef Search ADS PubMed Takezaki N , Rzhetsky A , Nei M. 1995 . Phylogenetic test of the molecular clock and linearized trees . Mol Biol Evol . 12 5 : 823 – 833 . Google Scholar PubMed Tamura K , Battistuzzi FU , Billing-Ross P , Murillo O , Filipski A , Kumar S. 2012 . Estimating divergence times in large molecular phylogenies . Proc Natl Acad Sci U S A . 109 47 : 19333 – 19338 . Google Scholar CrossRef Search ADS PubMed Tamura K , Stecher G , Peterson D , Filipski A , Kumar S. 2013 . MEGA6: molecular evolutionary genetics analysis version 6.0 . Mol Biol Evol . 30 12 : 2725 – 2729 . Google Scholar CrossRef Search ADS PubMed Warnock RC , Parham JF , Joyce WG , Lyson TR , Donoghue PC. 2015 . Calibration uncertainty in molecular dating analyses: there is no substitute for the prior evaluation of time priors . Proc R Soc B 282 1798 : 20141013. Google Scholar CrossRef Search ADS PubMed Warnock RC , Yang Z , Donoghue PC. 2012 . Exploring uncertainty in the calibration of the molecular clock . Biol Lett . 8 1 : 156 – 159 . Google Scholar CrossRef Search ADS PubMed Xu Z , Hao B. 2009 . CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes . Nucleic Acids Res . 37 ( Web Server ): W174 – W178 . Google Scholar CrossRef Search ADS PubMed Yang Z. 2007 . PAML 4: phylogenetic analysis by maximum likelihood . Mol Biol Evol . 24 8 : 1586 – 1591 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Molecular Biology and Evolution Oxford University Press

Theoretical Foundation of the RelTime Method for Estimating Divergence Times from Variable Evolutionary Rates

Loading next page...
 
/lp/ou_press/theoretical-foundation-of-the-reltime-method-for-estimating-divergence-7oXnOYawAd
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
ISSN
0737-4038
eISSN
1537-1719
D.O.I.
10.1093/molbev/msy044
Publisher site
See Article on Publisher Site

Abstract

Abstract RelTime estimates divergence times by relaxing the assumption of a strict molecular clock in a phylogeny. It shows excellent performance in estimating divergence times for both simulated and empirical molecular sequence data sets in which evolutionary rates varied extensively throughout the tree. RelTime is computationally efficient and scales well with increasing size of data sets. Until now, however, RelTime has not had a formal mathematical foundation. Here, we show that the basis of the RelTime approach is a relative rate framework (RRF) that combines comparisons of evolutionary rates in sister lineages with the principle of minimum rate change between evolutionary lineages and their respective descendants. We present analytical solutions for estimating relative lineage rates and divergence times under RRF. We also discuss the relationship of RRF with other approaches, including the Bayesian framework. We conclude that RelTime will be useful for phylogenies with branch lengths derived not only from molecular data, but also morphological and biochemical traits. divergence times, timetree, evolution, theory Introduction The inference of divergence times is accomplished by assuming a constant rate throughout the tree (a strict molecular clock) or by relaxing the strict molecular clock (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). Bayesian approaches are widely applied to relaxed clock dating analyses. They require specification of a probability distribution of evolutionary rates in the tree (e.g., lognormal distribution), as well as an assumption of the presence or absence of rate autocorrelation among branches. In contrast, the RelTime approach does not require specification of such priors, and produces relative node ages that can then be transformed into absolute dates by using calibration constraints for one or more nodes (Tamura et al. 2012, 2013). RelTime performs well for estimating divergence times in analyses of many large empirical data sets (Mello et al. 2017) and simulated data sets (Tamura et al. 2012; Filipski et al. 2014). RelTime’s computational speed (fig. 1) and accuracy have led to its use for estimating divergence time for large data sets (Tamura et al. 2012; Mahler et al. 2013; Bond et al. 2014; Bonaldo et al. 2016). However, a mathematical foundation for the RelTime method has not yet been provided, which is needed not only to understand basic properties and assumptions of RelTime, but also to reveal its relationship with other molecular dating methods (Ho and Duchêne 2014; dos Reis et al. 2016; Kumar and Hedges 2016). In the following, we present the theoretical foundation of the RelTime method. We also assess the absolute performance of RelTime and compare it with other methods by analyzing data generated using computer simulations in which sequences were evolved according to three different branch rate models: independent (Drummond et al. 2006), autocorrelated (Kishino et al. 2001), and hybrid (Beaulieu et al. 2015). Fig. 1. View largeDownload slide (a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(n2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods. Fig. 1. View largeDownload slide (a) Computational time taken by RelTime and MCMCTree (Bayesian method) to estimate divergence times for data sets containing increasing number of sequences (n). The tested sequence alignment consisted of 4,493 sites in which sequence evolution was simulated with extensive rate variation (RR50 data from Tamura et al. 2012). RelTime’s speed advantage increases with data volume by O(n2). (b) Distribution of computation speed ratio of RelTime to MCMCTree for 70 data sets, each containing 100 ingroup sequences that were simulated as described in the Materials and Methods. Mathematical Theory Theoretical Analysis for a Phylogeny with Three Taxa and an Outgroup We begin with the simplest case in which the phylogeny contains a clade with three ingroup taxa (subtree at node 5) and one outgroup taxon (fig. 2a). In this tree, branch lengths b1 and b2 represent the amount of evolutionary change that has occurred in lineages emanating from node 4 and leading to taxon 1 and taxon 2, respectively. We assume that taxon 1 and 2 are sampled at the same evolutionary time (t1 = 0 and t2 = 0), which is usually the case in phylogenetic analysis of data sampled from living species. The contemporaneous sampling of data produces sampling times equal to 0, which serve as calibration points (Tamura et al. 2012). By using branch lengths in this phylogeny (b’s), we can estimate the relative evolutionary rates (r’s) for all the lineages as well as relative divergence times (t’s). Here, a lineage refers to a branch and all the taxa (and branches) in the descendant subtree. For example, lineage a in figure 2a contains two taxa (1 and 2) and three branches with lengths b1, b2, and b4. Fig. 2. View largeDownload slide The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are bi’s and lineage rates are ri’s. La = b4 + ½(b1 + b2). (b) The case of four ingroup sequences with an outgroup. Here, La = b5 + ½(b1 + b2). Lb = b6 + ½(b3 + b4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF. Fig. 2. View largeDownload slide The relative rate framework for RelTime method. (a) A tree containing three ingroup sequences with an outgroup. Branch lengths are bi’s and lineage rates are ri’s. La = b4 + ½(b1 + b2). (b) The case of four ingroup sequences with an outgroup. Here, La = b5 + ½(b1 + b2). Lb = b6 + ½(b3 + b4), when using the arithmetic mean. See figure 3 and its legend for a simple procedure outlining the calculation of relative rates using RRF. The following system of equations formalizes the RelTime approach mathematically by linking relative rates for lineages (ri) with branch lengths (bi) in figure 2a. Here,, r1/r2=b1/b2, (1) r3/ra=b3/La, (2) where La is depth of node 5 on lineage a that contains taxon 1 and 2; La = b4 + ½(b1 + b2). We set: ra=½(r1+r2), (3) r0=½(ra+r3). (4) The setting of equalities in equations (3) and (4) leads to preference for the minimum rate change between the lineage originating at node 5 and the descendant lineage originating at node 4. The selection of lineage rate ra will be constrained by rate ratios imposed by equations (1) and (2), which relaxes the strict molecular clock. Because all the lineage rates are relative, one unknown is reduced by setting the group rate at the most recent common ancestor of the ingroup node to be 1, that is, r0=1. (5) We solve for each ri by using equations (1–5) and get: r1=4b1(b1+b2+2b4)/[(b1+b2)(b1+b2+2b3+2b4)], (6) r2=4b2(b1+b2+2b4)/[(b1+b2)(b1+b2+2b3+2b4)], (7) r3=4b3/(b1+b2+2b3+2b4), (8) ra=2(b1+b2+2b4)/(b1+b2+2b3+2b4). (9) The estimates of relative lineage rates and lengths produce an ultrametric tree with relative times for node 4 (t4) and node 5 (t5): t4=(b1+b2)(b1+b2+2b3+2b4)/4(b1+b2+2b4), (10) t5=(b1+b2+2b3+2b4)/4. (11) The above equations (6–11) constitute the relative rate framework (RRF) for phylogenies with three taxa and one outgroup. These equations yield point estimates for lineage rates and node ages. Because branch lengths have variances, the resulting lineage rate estimates have variances. The variance of relative node ages will be a function of branch length variances and the variance contributed by evolutionary rate differences among branches (see Discussion). Theoretical Analysis for Phylogenies Containing Four Taxa and an Outgroup To estimate six evolutionary rates (r1–r4, ra, and rb) using six branch length estimates (b1–b6) for a phylogeny containing four ingroup taxa (fig. 2b) we write a set of seven equations: r1/r2=b1/b2, (12) r3/r4=b3/b4, (13) ra=½(r1+r2), (14) rb=½(r3+r4), (15) r0=½(ra+rb), (16) ra/rb=La/Lb, (17) r0=1. (18) Here, La=b5+½(b1+b2) and Lb=b6+½(b3+b4). Equations (14–16) above lead to a preference for minimum changes in rates between ancestor–descendant lineage pairs. In this system, ri ’s are not required to be equal to one another and the rate assignments are constrained by the rate ratios among lineages (eqs. 12, 13, and 17), which relaxes the strict molecular clock. We solve for each ri from equations (12–18) to estimate relative rates and node ages. r1=4b1(b1+b2+2b5)/[(b1+b2)(b1+b2+b3+b4+2b5+2b6)], (19) r2=4b2(b1+b2+2b5)/[(b1+b2)(b1+b2+b3+b4+2b5+2b6)], (20) r3=4b3(b3+b4+2b6)/[(b3+b4)(b1+b2+b3+b4+2b5+2b6)], (21) r4=4b4(b3+b4+2b6)/[(b3+b4)(b1+b2+b3+b4+2b5+2b6)], (22) ra=2(b1+b2+2b5)/(b1+b2+b3+b4+2b5+2b6), (23) rb=2(b3+b4+2b6)/(b1+b2+b3+b4+2b5+2b6). (24) The estimates of relative node ages t5, t6, t7 for nodes 5, 6, and 7, respectively, are: t5=(b1+b2)(b1+b2+b3+b4+2b5+2b6)/[4(b1+b2+2b5)], (25) t6=(b3+b4)(b1+b2+b3+b4+2b5+2b6)/[4(b3+b4+2b6)], (26) t7=(b1+b2+b3+b4+2b5+2b6)/4. (27) The above equations (19–27) establish RRF of the RelTime approach for a tree containing four taxa and one outgroup. As previously mentioned, point estimates of node ages and lineage rates have variances because branch lengths have variances and because evolutionary rates are not equal among lineages (see Discussion). Relative Rate Framework with Geometric Means In both the original RelTime approach (Tamura et al. 2012) and the mathematical formulations above, we considered an arithmetic mean when averaging branch lengths to minimize evolutionary rate changes. This approach does not assume an equal rate, but is rather a natural way to calculate node depths by averaging branch lengths. We have now developed analytical formulas for an alternative RRF using the geometric mean that balances the rate changes between two descendant lineages. For example, if b1 = 1 and b2 = 4 in figure 2a, then the arithmetic mean will give 2.5. Thus, evolutionary rate r1 is 2.5 times slower and r2 is 1.6 times faster when compared with the average rate. The difference in rate change (between 2.5 and 1.6, in the present case) becomes larger as the difference between b1 and b2 becomes larger when using the arithmetic mean. In contrast, the geometric mean would give 2.0, which results in two times slower rate in b1 and two times faster rate in b2, as compared with the average rate. That is, the difference in rate between the ancestral and descendant lineages is always equal for sister lineages when using the geometric mean, which is not the case if the arithmetic mean is used. Using a geometric means approach, we obtain the following analytical formulas for a phylogeny containing three ingroup taxa (fig. 2a): r1=b1b1b2+b4/b2b3, (28) r2=b2b1b2+b4/b1b3, (29) r3=b3/b1b2+b4, (30) ra=b1b2+b4/b3, (31) t4=b1b2b3/b1b2+b4, (32) t5=b3b1b2+b4. (33) For the 4-taxon case in figure 2b, the equations are as follows: r1=b1b1b2+b5/b2b3b4+b6, (34) r2=b2b1b2+b5/b1b3b4+b6, (35) r3=b3b3b4+b6/b4b1b2+b5, (36) r4=b4b3b4+b6/b3b1b2+b5, (37) ra=b1b2+b5/b3b4+b6, (38) rb=b3b4+b6/b1b2+b5, (39) t5=b1b2b3b4+b6/b1b2+b5, (40) t6=b3b4b1b2+b5/b3b4+b6, (41) t7=b1b2+b5b3b4+b6. (42) Relative Rate Framework for a General Case Next, we consider a general case of a phylogeny with more than four ingroup taxa. In this case, RelTime applies RRF with a bottom-up approach, starting from the tips (external branches) of the phylogeny and moving toward the root. In the first step, we generate local relative lineage rates for the subtrees containing three and four taxa (subtree x and y, respectively) in a phylogeny containing eight taxa and an outgroup (fig. 3). Subtree y contains four taxa, so we apply equations (34–39) to generate relative rates. Equations (28–31) will be used to estimate rates for clade x which contains three ingroup taxa. Fig. 3. View largeDownload slide Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals. Fig. 3. View largeDownload slide Calculating relative lineage rates in a phylogeny. (a) A phylogeny containing eight ingroup sequences (taxa) that consists of three taxa nodes (x and w) and four taxa nodes (y and z). (b) Reduced phylogeny and branch lengths after applying RRF to nodes x and y. (c) Final phylogeny after applying RRF to node z in panel b, which produces node w in a three taxa configuration. After applying RRF to node w, a preorder traversal scales descendant lineage rates by multiplying them by their ancestral lineage rates to generate final relative rates. The ingroup root node (w) has an average relative lineage rate equal to 1. Multisequence taxa are indicated by Roman numerals. In the next step, we consider the parent clade (z) that has four taxa: composite taxon I consisting of two taxa (1 and 2), taxon II consisting of one taxon (3), composite taxon III consisting of two taxa (4 and 5), and composite taxon IV containing two taxa (6 and 7) (fig. 3b). We estimate branch lengths (bI, bII, bIII, and bIV) by using the geometric means: bI=b1b2+b9, bII=b3, bIII=b4b5+b10, and bIV=b6b7+b11. Then, we use equations (34–39) to compute relative rates for all the lineages using these branch lengths. In bigger phylogenies, this process is carried out for every internal node in a postorder traversal. In the current example, node w is the common ancestor of all the ingroup taxa and is in a 3-taxon configuration (fig. 3c). We now have bV=bIbII+bx and bVI=bIIIbIV+by and apply equations (28–31). At this stage, we have local (relative) lineage rate estimates for nodes in the ingroup tree. Finally, all the rates in the tree are computed by multiplying descendant lineage rates by their respective ancestral lineage rates in preorder traversal to generate final relative rates such that the ingroup root node (e.g., node w in fig. 3c) has an average relative lineage rate equal to 1. Results We evaluated the performance of RRF for correctly estimating lineage rates and divergence times by analyzing data generated using computer simulation in which sequences were evolved according to the autocorrelated rate (AR) model (Kishino et al. 2001), the independent rate (IR) model (Drummond et al. 2006), as well as a model that contains multiple distributions of rates (hybrid rates [HR]). We present results from the analysis of a collection of small data sets (three ingroup sequences; AR and IR models) and two collections of large data sets: one containing 100 ingroup sequences (AR and IR models) and another containing 91-ingroup sequences (HR model). Analysis of Small Data Sets (Phylogeny with Three Ingroup Sequences) We first tested the accuracy of divergence times estimated via RRF by analyzing data sets containing three ingroup sequences. We used RRF with geometric means and compared the modeled (true) values with estimated lineage rates as well as node ages (fig. 4). The lineage rates produced by RRF were similar to the true rates for data sets that were evolved under the AR model (fig. 4a); the relationship showed a linear regression slope of 0.99. RRF estimates for external lineages, were similarly good (blue circles in fig. 4a, r2 = 0.68). The relationship was also strong for internal branches, but with greater dispersion (red circles in fig. 4a, r2 = 0.29). RRF performance for AR data sets was similar to that observed for IR data sets (fig. 4b). Overall, the success in estimating lineage-specific evolutionary rates translates into robust estimates of relative times for AR (fig. 4c) and IR data sets (fig. 4d). Fig. 4. View largeDownload slide Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2a). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, r1, r2, and r3) and red circles represent the internal lineage (ra). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (r1, r2, and r3) and red circles show the internal branch (r4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (r2) are shown for each panel. Fig. 4. View largeDownload slide Performance of RRF and MCMCTree Bayesian analyses for three ingroup sequences with an outgroup (topology in fig. 2a). RRF lineage rate estimates are compared with the true lineage rate estimates for sequences that evolved under (a) autocorrelated and (b) independent rate models. Blue circles represent external lineages (single taxon, r1, r2, and r3) and red circles represent the internal lineage (ra). RRF estimates of divergence times for (c) autocorrelated rate and (d) independent rate data sets. Bayesian (MCMCTree) estimates of branch rates are compared with the true branch rates for sequences evolved under (e) autocorrelated and (f) independent rate models. Blue circles indicate external branches (r1, r2, and r3) and red circles show the internal branch (r4). Bayesian estimates of divergence times for (g) autocorrelated rate and (h) independent rate data sets. Each panel contains results from 50 simulated data sets. All rates and divergence time estimates are normalized to allow direct comparison between true and estimated values. Regression slope and correlation coefficient (r2) are shown for each panel. Bayesian methods produce branch-specific rates under a given statistical distribution of rates (e.g., lognormal), which differs from RRF where relative lineage rates are considered. Therefore, we compared the true values of branch rates with the branch rates derived using a Bayesian method. We provided correct priors (based on simulation parameters) and conducted the analyses using MCMCTree software (Yang 2007). Bayesian branch rate estimates showed a more diffuse relationship with true rates in internal branches for AR data sets (fig. 4e;r2 = 0.00) as compared with RRF (fig. 4a;r2 = 0.29). Fundamental reasons underlying these patterns are presented in the Discussion section. These patterns notwithstanding, Bayesian estimates of times for AR data sets showed a slope of 0.99 with true time estimates (fig. 4g), which means that node age estimates are generally robust to difficulties in estimating branch-specific rates. This robustness was also seen for IR data sets, where Bayesian branch rates showed a more diffuse relationship with the true rates (fig. 4f), but estimated times showed a slope close to 1 with a high r2 (0.91). Overall, both Bayesian and RRF approaches showed similar or lower accuracy in estimation of rates and node ages for IR data sets as compared with AR data sets, potentially because rate independence requires the estimation of a greater number of free parameters. Analysis of Large Data Sets (Phylogeny with 100 Ingroup Sequences) We next analyzed data sets consisting of 100 ingroup sequences, which were evolved over a range of empirical rate variation parameters. As observed for data sets containing only three ingroup sequences, RRF lineage rate estimates were highly correlated with the true rates for AR data sets (fig. 5a). Lineage rate correlations were generally lower for IR data sets and these correlations were higher for external (tip) lineages (fig. 5b). Importantly, distributions of the slopes of RelTime node age estimates and true times were centered at close to 1 for both AR and IR data sets (fig. 5c). Fig. 5. View largeDownload slide Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns. Fig. 5. View largeDownload slide Performance of RRF in the analysis of data sets with 100 ingroup sequences and an outgroup. Fraction of data sets for which RRF inferred lineage rates are correlated with true rates at different levels of correlation for data sets simulated with (a) autocorrelated rates and (b) independent rates. Dotted lines represent external branches and solid lines indicate the internal branches. (c) Distribution of the linear regression slopes of RRF estimates and true times for different data sets. The results presented are from analyses of 35 data sets in which sequences evolved with autocorrelated branch rates (blue lines) and another 35 data sets that were evolved with independent branch rates (red lines). In regression analysis, the intercept was set to zero, because the estimated node age is expected to be zero when the true node age is zero. The regression slopes generated with and without this assumption produced similar patterns. Analysis of Hybrid Rate Data Sets (Phylogeny Containing 91 Ingroup Sequences) We also examined the performance of RRF in an analysis of simulated data from Beaulieu et al. (2015), who simulated two lognormal distributions (hybrid rate model) for an angiosperm phylogeny in which herbaceous clades exhibited higher and more variable evolutionary rates than woody clades (fig. 6a). They reported that one rate model Bayesian methods produced considerably more ancient date estimates for the divergence of herbaceous and woody clades. This overestimation of divergence time became more severe as the difference between the two rates increased (fig. 6b). Application of RRF produced divergence time estimates that were much closer to true times (fig. 6c and d), which shows that RRF can be useful in cases where the rate distribution differs among clades (Smith and Donoghue 2008; Dornburg et al. 2012; Beaulieu et al. 2015) or when clocks are local (Drummond and Suchard 2010; Crisp et al. 2014). Fig. 6. View largeDownload slide (a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses. Fig. 6. View largeDownload slide (a) Hybrid distribution of rates for branches leading to woody taxa (brown) and herbaceous taxa (green), with the former evolving three times (3×) slower than the latter. (b) Bayesian estimates reported by Beaulieu et al. (2015) when the rate difference between clades was 3-fold (3×, solid line) and 6-fold (6×, dashed line), with the simulated angiosperm age of 140 Ma shown by a red line. RelTime estimates of angiosperm age for Beaulieu et al. (2015)’s alignments with (c) 3× mean rate difference and (d) 6× mean rate difference. Median and SD for age estimates are shown. Beaulieu et al. (2015) simulated 100 replicates (1,000 bases) under a GTR model for each scenario. Bayesian analyses were conducted using a single uncorrelated lognormal rate prior in Beaulieu et al. (2015). The same alignments, topology, and ingroup calibrations were used in RRF analyses. Furthermore, Tamura et al. (2012) found that RelTime produced accurate time estimates in simulations with a very large number of sequences even when one clade possessed accelerated evolutionary rates, where penalized likelihood methods did not perform as well. In general, we expect that the limitations of on rate model Bayesian analyses will be overcome by local clock methods (Drummond and Suchard 2010; Höhna et al. 2016; Lartillot et al. 2016), but the computational time required to analyze even modestly sized data sets via these approaches can be prohibitive. So, the current RRF approach, which does not assume a specific model for rate variation, may be suitable for such data in its current implementation or as a foundation for future methodological refinements. Discussion We have presented a mathematical foundation for the relative rate framework (RRF) underlying the RelTime method. In the following, we compare RRF with other approaches for estimating divergence times and present the motivation behind RRF. Lineage Rates versus Branch Rates In RRF, evolutionary rate heterogeneity in a phylogeny is considered by comparing rates between sister lineages emanating from internal nodes in a phylogeny. A lineage rate is an average of all the lineage rates that belong to that lineage, including all the descendants (e.g., node x in fig. 3). RRF’s estimation of evolutionary lineage rates is fundamentally different from the comparison and modeling of branch rates in other approaches. For example, the penalized likelihood methods consider differences in rates between ancestral and descendant branches (Sanderson 1997, 2003), and Bayesian methods model branch rates to share a probabilistic distribution (e.g., lognormal distribution). The distinction between lineage and branch rates complicates direct comparisons of evolutionary rates produced by using RelTime and Bayesian methods, except for external (tip) branches for which the lineages consist of only one branch. In this case, RelTime and Bayesian estimates of rates show similar trends (fig. 4, blue circles). This trend was also observed in the analysis of 100 sequence data sets, where the correlation of the estimated external branch rates with true branch rates was high for both RelTime (median correlation = 0.76 and 0.69) and Bayesian (median correlation = 0.72 and 0.86) analyses of AR and IR data sets, respectively. For internal branches, computer simulations showed greater similarity between RRF estimates of lineage rates and the true lineage rates (fig. 4a and b) as compared with the similarity observed between Bayesian estimates of branch rates and the true branch rates (fig. 4e and f, respectively). This was also the case in the analysis of 100 sequence data sets: internal branch rates from Bayesian methods were less strongly correlated with the true rates (median correlation = 0.42 and 0.36 for AR and IR data sets, respectively), as compared with those observed for lineage rates from RRF (median correlation = 0.79 and 0.55 for AR and IR data sets, respectively). These trends are due to the fact that the estimate of a branch rate is a function of two time estimates: one for the ancestral node and another for the descendant node. For example, the variance of the rate on branch with length b4 in figure 2a is a function of the variance of two time estimates (t4 and t5), in addition to the variance of b4. In contrast, the variance of a lineage rate (e.g., ra in fig. 2a) is a function of the variance of only one time estimate (t5), in addition to the lineage depth (La), because the other time point is zero in contemporary sequence sampling. Thus, branch rates are estimated with greater variance than lineage rates, which results in lower correlations seen for Bayesian approaches. Underlying Evolutionary Rate Model in RRF RRF exploits the fact that the estimation of the ratio of lineage rates at any node in a phylogeny is independent of that node’s age. For example, the ratio of evolutionary rates at node 4, r1/r2, does not depend on t4 (Fig. 7a). It is clear that the rate of evolution is higher in the lineage leading from node 4 to taxon 2 than to taxon 1 (r2 > r1), because b2 is longer than b1 in figure 7a. In fact, we can estimate r1/r2 (= b1/b2) without knowing anything about the probability distribution of evolutionary rates throughout the tree. Similarly, the other rate ratio in this tree does not depend on knowledge of the distribution of rates among branches or lineages, it is simply [(b1 + b2)/2 + b4]/b3 when using the arithmetic means and (b1b2+b4)/b3when using the geometric means. Fig. 7. View largeDownload slide A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4). Fig. 7. View largeDownload slide A phylogenetic tree of three taxa (1, 2, and 3). (a) Original phylogenetic tree with the observed branch lengths (b’s), which are necessary to estimate node times (t’s) shown in panel (b). Evolutionary trees where the rate for the subtree containing taxon 1 and 2 (s4) is much (c) faster or (d) slower than that of its ancestral branch (r4). However, the node-by-node specification of relative lineage rates is not sufficient to estimate relative times t4 and t5. For that, we need to know the relationship of subtree rate s4 and branch rate r4, where s4 is the overall evolutionary rate of the subtree originating at node 4 (contains taxon 1 and 2) and r4 is the evolutionary rate on branch b4 (fig. 7a). Without assuming a specific distribution of rates, s4/r4 cannot be determined uniquely and t4 can be at any point between 0 and t5. Figure 7c and d present two extreme possibilities. In one, if the subtree rate (s4) is much higher after the divergence event at node 4 (s4 ≫ r4), then the estimate of t4 will be small and the divergence event recent (fig. 7c). Alternately, if the subtree rate is much slower after the divergence event at node 4 (s4 ≪ r4), then t4 will be much more ancient (fig. 7d). In its mathematical formulation, RRF considers the best estimate of the rate of evolution of an ancestral lineage to be the average of the rate of evolution of its two descendant lineages (e.g., eqs. 3, 14, 15, and 16), while accommodating the relative rates among lineages. In the current example, this would result in the timetree shown in figure 7b. This is the principle of minimum rate change from the ancestor to its immediate descendants, where we do not favor extreme rate assignments, for example, those in figure 7c and d. Probabilities of such extreme rate assignments are also low in commonly used branch rate distributions (e.g., lognormal, normal, and exponential distributions), so Bayesian methods also tend to favor the smallest rate change needed to explain the data. Relationship of RRF with Other Molecular Clock Methods The treatment of evolutionary rates is conceptually different between Bayesian methods and RRF, because RRF does not assume a specific statistical distribution for modeling (lineage) rate variation at the outset and Bayesian methods model branch rates rather than the lineage rates. The application of the principle of minimum rate change in RRF is different from nonparametric and semiparametric approaches based on the idea of Sanderson (1997), because RRF minimizes lineage rate changes rather than the branch rate changes from an ancestor to its immediate descendants. In addition, RRF does not attempt to estimate a universal penalty for the speed of rate change throughout the tree. RRF is distinct from strict clock approaches because strict clock methods only apply rate averaging (e.g., eqs. 3 and 4) at each node in the tree, but RRF imposes an additional constraint that the ratio of sister lineage rates be the ratio of their lineage lengths (e.g., eqs. 1 and 2). These additional constraints relax the strict clock and allow rates to vary throughout the tree. For this reason, RelTime is also different from some molecular clock methods that assume the ratio of ages between two nodes to be proportional to the ratio of their average node-to-tip distances (Purvis 1995; Britton et al. 2002,, 2007). This assumption is tantamount to assuming equality of rates among lineages, and, thus, a strict molecular clock. For this reason, these approaches require pruning of taxa or lineages for which the rate equality does not apply (Takezaki et al. 1995). RRF does not assume equality among any lineage rates at any time, and it does not require the removal of rate heterogeneous lineages. Statistical Distribution of Relative Lineage Rates Relative lineage rates produced by RRF will show extensive correlation, because the evolutionary rate of a lineage is a function of evolutionary rates of all its descendant lineages. This would result in both local and global correlation, which is expected to be present in phylogenies with autocorrelated as well as independent branch rates. As expected, the analysis of 100 sequence data sets showed correlation between ancestral and descendant lineage rates when branch rates were autocorrelated (median correlation = 0.88) or independent (median correlation = 0.77). Therefore, RRF is fundamentally different from the autocorrelated (branch) rate model of Kishino et al. (2001) as well as the independent (branch) rate model of Drummond et al. (2006), as they deal with branches rather than lineages, as defined here. For this reason, the lineage rates produced by RRF are not directly comparable with those observed for branch rates produced by Bayesian methods. Even though RRF does not require a statistical distribution of lineage rates at the outset, the resulting estimates of lineage rates may follow a statistical distribution. We examined this relationship in an analysis of 100 ingroup sequence data sets in which branch rates were simulated with lognormal, exponential, or uniform distributions. When branch rates followed a lognormal distribution, the distribution of true lineage rates was also lognormal, as was the distribution of RRF lineage rate estimates (fig. 8a–d). When the branch rates followed an exponential distribution, the RRF and true lineage rates showed a similar distribution (fig. 8e). In the case of a uniform distribution of branch rates, the lineage rates showed a normal-like distribution and the RRF rate estimates were lognormally distributed (fig. 8f). All of these results suggest that a flexible lognormal distribution will generally fit the distribution of RRF lineage rates. Importantly, time estimates showed a linear relationship with the true times, with slopes close to 1.0 (fig. 8g–i). Fig. 8. View largeDownload slide Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g–i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (r2) are shown. Fig. 8. View largeDownload slide Distributions of true and RRF-derived estimates of lineage rates. Branch rates were simulated under autocorrelated lognormal rate models with (a) low and (b) high dispersions, independent lognormal rate models with (c) low and (d) high dispersions, and independent rate models with (e) exponential and (f) uniform distributions. Green lines represent the fitted curves of the true lineage rate distributions and blue bars show the distributions of RRF rates. Rate unit is substitutions per site per million years. (g–i) Relationships between true times and RRF times in six rate scenarios. Black circles and lines represent the average times of 5 replicates simulated under rate scenarios in (a), (c), and (e), and gray triangles and lines represent the average times of 5 replicates simulated under rate scenarios in (b), (d), and (f). All times are normalized to the sum of ingroup divergence times. Regression slopes and correlation coefficients (r2) are shown. Point Estimates and Their Variances RRF yields point estimates for (relative) lineage rates and divergence times based on branch lengths. As is the common practice in classical statistics, estimates of dispersion such as SEs and confidence intervals accompany all estimates. The variance of a lineage rate estimate is a function of branch length variances. It can be obtained analytically by using the equations for lineage rate (e.g., eqs. 28–31 for the case of three taxa with outgroup by the delta method) or simply by using a bootstrap sampling procedure. However, the estimation of confidence intervals around the node ages depends on branch length variances as well as the degree of inequality of evolutionary rates among lineages (Kumar and Hedges 2016). Tamura et al. (2013) proposed a method to estimate confidence intervals within RelTime, which produces rather wide confidence intervals. We are currently investigating an advanced approach to narrow confidence intervals while maintaining appropriate coverage probabilities, but this subject is beyond the scope of the current article. Computational Speed of RelTime RRF scales well with increasing numbers of sequences and is much faster than Bayesian methods for analyzing of molecular sequence data (fig. 1). The fast computational speed is due to the innovation that RRF uses all the data first to map a large alignment onto a phylogeny, and then it uses the resulting branch lengths to generate relative divergence times and evolutionary lineage rates. The computational time taken by RRF is the sum of time taken to generate maximum likelihood estimates of branch lengths for a given sequence alignment and phylogeny and the time taken to estimate rates and dates using RRF. The latter is negligible compared with the former, because of the analytical nature of RRF. In comparison, Bayesian methods are computationally demanding because they require a substantial exploration of likelihood space using prior distributions to generate posterior estimates of rates and divergence times. RelTime for Phylogenies with Branch Lengths The above decomposition has a positive side effect, in addition to making RelTime computationally speedy. RRF may be applied to any phylogeny where branch lengths reflect the amount of change. For example, RRF is directly applicable when branch lengths are estimated by using pairwise evolutionary distances and a least squares approach for a given tree topology (Rzhetsky and Nei 1993). In addition to multiple sequence alignments, such distances can come from unaligned and locally aligned genome or genomic segments (Otu and Sayood 2003; Henz et al. 2005; Auch et al. 2006; Deng et al. 2006; Gao and Qi 2007; Lin et al. 2009; Xu and Hao 2009). In fact, RRF can be applied to any phylogeny where branch lengths are generated using other types of molecular data (e.g., gene expression patterns and breakpoint distances) or nonmolecular data (e.g., morphological and life history traits) (Herniou et al. 2001; Gramm and Niedermeier 2002; King et al. 2016; Cooney et al. 2017). Of course, the accuracy of the relative rate and time inferences made for such data depends directly on the accuracy of the phylogenetic tree and the branch lengths, so the biological interpretations of the results obtained will require utmost care. Usefulness of Relative Node Ages RRF’s accurate estimation of relative node ages without assuming a speciation model or calibration priors can benefit many applications (Tamura et al. 2012). For example, relative node ages from molecular data are directly comparable with those from fossil data. This allows evaluation of biological hypotheses without the circularity created by the use of calibration priors and densities inferred from molecular data (Battistuzzi et al. 2015; Gold et al. 2017). Along these lines, RRF has been used to develop a protocol to identify calibration priors that have the strongest influence on the final time estimates in Bayesian dating (Battistuzzi et al. 2015), because the cross-validation methods are unlikely to be effective (Warnock et al. 2012,, 2015). Inferring Absolute Times from Relative Node Ages By placing calibration constraints on one or more nodes in the tree, we can generate an absolute timetree from the ultrametric tree containing relative node ages. Tamura et al. (2013) presented an algorithmic approach for adjusting relative rates to ensure that the estimated times for calibrated nodes are within the boundaries. This process uses the maximum and minimum boundaries only for calibration, which is preferable when the uncertainty distribution of calibrations is not known precisely. Otherwise, there is high probability of biased time estimation (Hedges and Kumar 2004; Ho and Phillips 2009; Inoue et al. 2010; Heath et al. 2014; Ho and Duchêne 2014; dos Reis et al. 2015). Tamura et al.’s approach worked well in the analysis of large data sets, because Bayesian time estimates reported in multiple large-scale studies are similar to those produced by RRF using ultrametric trees with relative times that were transformed into timetrees using many calibration constraints (Mello et al. 2017). Conclusions We have presented a mathematical foundation for the RelTime method and elucidated its relationship with other relaxed and strict clock methods. We have shown that the relative rate framework (RRF) produces excellent estimates of rates and divergence times for evolutionary lineages. It is, however, important to note that estimates of divergence times in a phylogeny are only biologically meaningful when the reconstruction of evolutionary relationships is robust. Therefore, the best practice is to first obtain a reliable phylogeny and then estimate divergence times. We must also consider the confidence intervals associated with node ages to assess the precision of time estimates prior to making biological inferences. Materials and Methods Computer Simulations and Analysis We simulated 200 multisequence alignments: 50 each for two models of evolutionary rates (independent and autocorrelated among branches) for two model topologies containing three- and four-ingroup taxa (fig. 2a and b, respectively). The node height of the ingroup subtree was set to be 10 time units, while the node heights of all descendent subtrees varied randomly between 0 and 10 time units. For each resulting model timetree, branch rates were sampled from a lognormal distribution, where the mean rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the SD varied from 0.25 to 0.75 for all branches independently. For the autocorrelated rates, the initial rate was drawn randomly from an empirical distribution (Rosenberg and Kumar 2003) and the autocorrelation parameter was varied from 0.1 to 0.3. This rate sampling resulted in a phylogram with branch lengths that could be used as input for SeqGen (Grassly et al. 1997). We used the Hasegawa–Kinshino–Yano (HKY) model (Hasegawa et al. 1985) with 4 gamma categories and empirically derived GC content and transition/transversion ratio (Rosenberg and Kumar 2003) as simulation parameters. We generated simulated multispecies alignments where each sequence was 3, 000 base pairs long. The results from three-ingroup sequence analysis (fig. 4) were similar to those from the analysis of four-ingroup sequences (not shown). Using the same simulation strategy, we created 35 alignments each under independent and autocorrelated rate scenarios following a master phylogeny of 100 taxa that was sampled from the bony-vertebrate clade in the Timetree of Life (Hedges and Kumar 2009). In the independent rate case, the SD varied from 0.3 to 0.5. In the autocorrelated rate case, the autocorrelation parameter varied from 0.01 to 0.04. All other simulation parameters (GC contents, transition/transversion ratio, and sequence length) were derived from empirical distributions (Rosenberg and Kumar 2003). Using the same 100 ingroup taxon master phylogeny of bony vertebrates, we simulated five additional sequence data sets under (1) an autocorrelated lognormal rate model with low dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.02, respectively; (2) an autocorrelated lognormal rate model with high dispersion, where the initial rate and the autocorrelation parameter were set to be r and 0.05, respectively; (3) an independent lognormal rate model with low dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.25, respectively; (4) an independent lognormal rate model with high dispersion, where the mean rate and the SD (in log scale) were set to be r and 0.75, respectively; (5) an independent exponential rate model, where the mean rate was set to be r; and (6) an independent uniform rate model, where the branch-specific rates were sampled from an uniform distribution from 0 to 2r. GC contents, transition/transversion ratio, sequence length, and evolutionary rate r were derived from empirical distributions (Rosenberg and Kumar 2003). MEGA software (Kumar et al. 2012, 2016) was used to obtain maximum likelihood estimates of branch lengths from simulated sequence alignments, where the correct substitution model and tree topology were used. RRF (eqs. 28–42) was applied to the resulting phylogram with branch lengths, and relative lineage rates and times were obtained. One calibration (true age ± 10 Ma) at the crown node of the ingroup was used to convert relative time estimates for comparison with true times (fig. 8). No calibrations were used in other RRF analyses. All Bayesian analyses were conducted using MCMCTree (Yang 2007) using correct priors; two independent runs of five million generations were carried out. Results were checked for convergence using Tracer (Rambaut et al. 2014). ESS values were higher than 200 after removing 10% burn-in samples in each run. MCMCTree analyses used one root calibration (true age ± 0.1 time units). Analysis of Hybrid Rate Models Simulated data sets and BEAST results were provided by Beaulieu et al. (2015), or retrieved from the Dryad Repository. All outgroup and root calibrations are automatically disregarded in RelTime, because the assumption of equal rates of evolution between the ingroup and outgroup sequences is not testable (Kumar et al. 2016). Lognormal distributions with fixed median values of “true ages” were used as calibration densities in the original study (Beaulieu et al. 2015). Because RelTime does not require specific density distributions for calibrations, we used true age ± 5 Ma for all 15 ingroup calibrated nodes in the reanalysis to directly compare RelTime divergence time estimates with those from BEAST. Calibrations employed in RelTime (true age ± 5 Ma) had boundaries similar to 99% probability densities of lognormal distributions originally employed as calibrations. The same alignments, topology, and ingroup calibrations were used in RRF analyses. Estimates of angiosperm age were obtained by summarizing estimates of 100 data sets each in which the herbaceous clades have 3 times (3×) and 6 times (6×) higher rates than those of woody clades. Acknowledgments We are thankful to Heather Rowe for editorial comments and Beatriz Mello for many helpful discussions. We thank Jeremy Beaulieu for providing simulated data and Bayesian results. This research was supported by grants from NIH (R01GM126567-01), National Aeronautics and Space Administration (NASA, NNX16AJ30G), and Tokyo Metropolitan University (DB105). References Auch AF , Henz SR , Holland BR , Göker M. 2006 . Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences . BMC Bioinformatics 7 : 350. Google Scholar CrossRef Search ADS PubMed Battistuzzi FU , Billing-Ross P , Murillo O , Filipski A , Kumar S. 2015 . A protocol for diagnosing the effect of calibration priors on posterior time estimates: a case study for the Cambrian explosion of animal phyla . Mol Biol Evol . 32 7 : 1907 – 1912 . Google Scholar CrossRef Search ADS PubMed Beaulieu JM , O’Meara BC , Crane P , Donoghue MJ. 2015 . Heterogeneous rates of molecular evolution and diversification could explain the Triassic age estimate for angiosperms . Syst Biol . 64 5 : 869 – 878 . Google Scholar CrossRef Search ADS PubMed Bonaldo MC , Ribeiro IP , Lima NS , Dos Santos AAC , Menezes LSR , da Cruz SOD , de Mello IS , Furtado ND , de Moura EE , Damasceno L et al. , . 2016 . Isolation of infective Zika virus from urine and saliva of patients in Brazil . PLoS Negl Trop Dis . 10 6 : e0004816. Google Scholar CrossRef Search ADS PubMed Bond JE , Garrison NL , Hamilton CA , Godwin RL , Hedin M , Agnarsson I. 2014 . Phylogenomics resolves a spider backbone phylogeny and rejects a prevailing paradigm for orb web evolution . Curr Biol . 24 15 : 1765 – 1771 . Google Scholar CrossRef Search ADS PubMed Britton T , Anderson CL , Jacquet D , Lundqvist S , Bremer K. 2007 . Estimating divergence times in large phylogenetic trees . Syst Biol . 56 5 : 741 – 752 . Google Scholar CrossRef Search ADS PubMed Britton T , Oxelman B , Vinnersten A , Bremer K. 2002 . Phylogenetic dating with confidence intervals using mean path lengths . Mol Phylogenet Evol . 24 1 : 58 – 65 . Google Scholar CrossRef Search ADS PubMed Cooney CR , Bright JA , Capp EJR , Chira AM , Hughes EC , Moody CJA , Nouri LO , Varley ZK , Thomas GH. 2017 . Mega-evolutionary dynamics of the adaptive radiation of birds . Nature 542 7641 : 344 – 347 . Google Scholar CrossRef Search ADS PubMed Crisp MD , Hardy NB , Cook LG. 2014 . Clock model makes a large difference to age estimates of long-stemmed clades with no internal calibration: a test using Australian grasstrees . BMC Evol Biol . 14 : 263 – 279 . Google Scholar CrossRef Search ADS PubMed Deng R , Huang M , Wang J , Huang Y , Yang J , Feng J , Wang X. 2006 . PTreeRec: Phylogenetic Tree Reconstruction based on genome BLAST distance . Comput Biol Chem . 30 4 : 300 – 302 . Google Scholar CrossRef Search ADS PubMed Dornburg A , Brandley MC , McGowen MR , Near TJ. 2012 . Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: cetacea) . Mol Biol Evol . 29 2 :721–243. dos Reis M , Donoghue PC , Yang Z. 2016 . Bayesian molecular clock dating of species divergences in the genomics era . Nat Rev Genet . 17 2 : 71 – 80 . Google Scholar CrossRef Search ADS PubMed dos Reis M , Thawornwattana Y , Angelis K , Telford MJ , Donoghue PC , Yang Z. 2015 . Uncertainty in the timing of origin of animals and the limits of precision in molecular timescales . Curr Biol . 25 22 :2939–2912. Drummond AJ , Ho SYW , Phillips MJ , Rambaut A. 2006 . Relaxed phylogenetics and dating with confidence . PLoS Biol . 4 5 : e88 – e99 . Google Scholar CrossRef Search ADS PubMed Drummond AJ , Suchard MA. 2010 . Bayesian random local clocks, or one rate to rule them all . BMC Biol . 8 : 114 – 125 . Google Scholar CrossRef Search ADS PubMed Filipski A , Murillo O , Freydenzon A , Tamura K , Kumar S. 2014 . Prospects for building large timetrees using molecular data with incomplete gene coverage among species . Mol Biol Evol . 31 9 : 2542 – 2550 . Google Scholar CrossRef Search ADS PubMed Gao L , Qi J. 2007 . Whole genome molecular phylogeny of large dsDNA viruses using composition vector method . BMC Evol Biol . 7 : 41. Google Scholar CrossRef Search ADS PubMed Gold DA , Caron A , Fournier GP , Summons RE. 2017 . Paleoproterozoic sterol biosynthesis and the rise of oxygen . Nature 543 7645 : 420 – 423 . Google Scholar CrossRef Search ADS PubMed Gramm J , Niedermeier R. 2002 . Breakpoint medians and breakpoint phylogenies: a fixed-parameter approach . Bioinformatics 18 ( Suppl 2 ): S128 – S139 . Google Scholar CrossRef Search ADS PubMed Grassly NC , Adachi J , Rambaut A. 1997 . Seq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees . Comput Appl Biosci . 13 3 : 235 – 238 . Google Scholar PubMed Hasegawa M , Kishino H , Yano T. 1985 . Dating of the human-ape splitting by a molecular clock of mitochondrial DNA . J Mol Evol . 22 2 : 160 – 174 . Google Scholar CrossRef Search ADS PubMed Heath TA , Huelsenbeck JP , Stadler T. 2014 . The fossilized birth-death process for coherent calibration of divergence-time estimates . Proc Natl Acad Sci U S A . 111 29 : E2957 – E2966 . Google Scholar CrossRef Search ADS PubMed Hedges SB , Kumar S. 2004 . Precision of molecular time estimates . Trends Genet . 20 5 : 242 – 247 . Google Scholar CrossRef Search ADS PubMed Hedges SB , Kumar S. 2009 . The timetree of life . New York : Oxford University Press . Henz SR , Huson DH , Auch AF , Nieselt-Struwe K , Schuster SC. 2005 . Whole-genome prokaryotic phylogeny . Bioinformatics 21 10 : 2329 – 2335 . Google Scholar CrossRef Search ADS PubMed Herniou EA , Luque T , Chen X , Vlak JM , Winstanley D , Cory JS , O’Reilly DR. 2001 . Use of whole genome sequence data to infer baculovirus phylogeny . J Virol . 75 17 : 8117 – 8126 . Google Scholar CrossRef Search ADS PubMed Ho SY , Duchêne S. 2014 . Molecular-clock methods for estimating evolutionary rates and timescales . Mol Ecol . 23 24 : 5947 – 5965 . Google Scholar CrossRef Search ADS PubMed Ho SYW , Phillips MJ. 2009 . Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times . Syst Biol . 58 3 : 367 – 380 . Google Scholar CrossRef Search ADS PubMed Höhna S , Landis MJ , Heath TA , Boussau B , Lartillot N , Moore BR , Huelsenbeck JP , Ronquist F. 2016 . RevBayes: bayesian phylogenetic inference using graphical models and an interactive model-specification language . Syst Biol . 65 : 726 – 736 . Google Scholar CrossRef Search ADS PubMed Inoue J , Donoghue PCJ , Yang Z. 2010 . The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times . Syst Biol . 59 1 : 74 – 89 . Google Scholar CrossRef Search ADS PubMed King B , Qiao T , Lee MSY , Zhu M , Long JA. 2016 . Bayesian morphological clock methods resurrect placoderm monophyly and reveal rapid early evolution in jawed vertebrates . Syst Biol . 66 : 499 – 516 . Kishino H , Thorne JL , Bruno WJ. 2001 . Performance of a divergence time estimation method under a probabilistic model of rate evolution . Mol Biol Evol . 18 3 : 352 – 361 . Google Scholar CrossRef Search ADS PubMed Kumar S , Hedges SB. 2016 . Advances in time estimation methods for molecular data . Mol Biol Evol . 33 4 : 863 – 869 . Google Scholar CrossRef Search ADS PubMed Kumar S , Stecher G , Peterson D , Tamura K. 2012 . MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis . Bioinformatics 28 20 : 2685 – 2686 . Google Scholar CrossRef Search ADS PubMed Kumar S , Stecher G , Tamura K. 2016 . MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets . Mol Biol Evol . 33 7 : 1870 – 1874 . Google Scholar CrossRef Search ADS PubMed Lartillot N , Phillips MJ , Ronquist F. 2016 . A mixed relaxed clock model . Philos Trans R Soc B 371 1699 : 20150132. Google Scholar CrossRef Search ADS Lin GN , Cai Z , Lin G , Chakraborty S , Xu D. 2009 . ComPhy: prokaryotic composite distance phylogenies inferred from whole-genome gene sets . BMC Bioinformatics 10 ( Suppl 1 ): S5. Google Scholar CrossRef Search ADS Mahler DL , Ingram T , Revell LJ , Losos JB. 2013 . Exceptional convergence on the macroevolutionary landscape in island lizard radiations . Science 341 6143 : 292 – 295 . Google Scholar CrossRef Search ADS PubMed Mello B , Tao Q , Tamura K , Kumar S. 2017 . Fast and accurate estimates of divergence times from big data . Mol Biol Evol . 34 1 : 45 – 50 . Google Scholar CrossRef Search ADS PubMed Otu HH , Sayood K. 2003 . A new sequence distance measure for phylogenetic tree construction . Bioinformatics 19 16 : 2122 – 2130 . Google Scholar CrossRef Search ADS PubMed Purvis A. 1995 . A composite estimate of primate phylogeny . Philos Trans R Soc B 348 1326 : 405 – 421 . Google Scholar CrossRef Search ADS Rambaut A , Suchard M A Xie D , Drummond A. 2014 . Tracer v1.6. Institute of Evolutionary Biology, University of Edinburgh, UK. Available from: http://beast.bio.ed.ac.uk/Tracer, last accessed March 29, 2018. Rosenberg MS , Kumar S. 2003 . Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference . Mol Biol Evol . 20 4 : 610 – 621 . Google Scholar CrossRef Search ADS PubMed Rzhetsky A , Nei M. 1993 . Theoretical foundation of the minimum-evolution method of phylogenetic inference . Mol Biol Evol . 10 5 : 1073 – 1095 . Google Scholar PubMed Sanderson MJ. 1997 . A nonparametric approach to estimating divergence times in the absence of rate constancy . Mol Biol Evol . 14 12 : 1218 – 1231 . Google Scholar CrossRef Search ADS Sanderson MJ. 2003 . r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock . Bioinformatics 19 2 : 301 – 302 . Google Scholar CrossRef Search ADS PubMed Smith SA , Donoghue MJ. 2008 . Rates of molecular evolution are linked to life history in flowering plants . Science 322 5898 : 86 – 89 . Google Scholar CrossRef Search ADS PubMed Takezaki N , Rzhetsky A , Nei M. 1995 . Phylogenetic test of the molecular clock and linearized trees . Mol Biol Evol . 12 5 : 823 – 833 . Google Scholar PubMed Tamura K , Battistuzzi FU , Billing-Ross P , Murillo O , Filipski A , Kumar S. 2012 . Estimating divergence times in large molecular phylogenies . Proc Natl Acad Sci U S A . 109 47 : 19333 – 19338 . Google Scholar CrossRef Search ADS PubMed Tamura K , Stecher G , Peterson D , Filipski A , Kumar S. 2013 . MEGA6: molecular evolutionary genetics analysis version 6.0 . Mol Biol Evol . 30 12 : 2725 – 2729 . Google Scholar CrossRef Search ADS PubMed Warnock RC , Parham JF , Joyce WG , Lyson TR , Donoghue PC. 2015 . Calibration uncertainty in molecular dating analyses: there is no substitute for the prior evaluation of time priors . Proc R Soc B 282 1798 : 20141013. Google Scholar CrossRef Search ADS PubMed Warnock RC , Yang Z , Donoghue PC. 2012 . Exploring uncertainty in the calibration of the molecular clock . Biol Lett . 8 1 : 156 – 159 . Google Scholar CrossRef Search ADS PubMed Xu Z , Hao B. 2009 . CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes . Nucleic Acids Res . 37 ( Web Server ): W174 – W178 . Google Scholar CrossRef Search ADS PubMed Yang Z. 2007 . PAML 4: phylogenetic analysis by maximum likelihood . Mol Biol Evol . 24 8 : 1586 – 1591 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Journal

Molecular Biology and EvolutionOxford University Press

Published: Mar 20, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off