# Information Criteria for Comparing Partition Schemes

Information Criteria for Comparing Partition Schemes Abstract When inferring phylogenies, one important decision is whether and how nucleotide substitution parameters should be shared across different subsets or partitions of the data. One sort of partitioning error occurs when heterogeneous subsets are mistakenly lumped together and treated as if they share parameter values. The opposite kind of error is mistakenly treating homogeneous subsets as if they result from distinct sets of parameters. Lumping and splitting errors are not equally bad. Lumping errors can yield parameter estimates that do not accurately reflect any of the subsets that were combined whereas splitting errors yield estimates that did not benefit from sharing information across partitions. Phylogenetic partitioning decisions are often made by applying information criteria such as the Akaike information criterion (AIC). As with other information criteria, the AIC evaluates a model or partition scheme by combining the maximum log-likelihood value with a penalty that depends on the number of parameters being estimated. For the purpose of selecting an optimal partitioning scheme, we derive an adjustment to the AIC that we refer to as the AIC$$^{(p)}$$ and that is motivated by the idea that splitting errors are less serious than lumping errors. We also introduce a similar adjustment to the Bayesian information criterion (BIC) that we refer to as the BIC$$^{(p)}$$. Via simulation and empirical data analysis, we contrast AIC and BIC behavior to our suggested adjustments. We discuss these results and also emphasize why we expect the probability of lumping errors with the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ to be relatively robust to model parameterization. AIC, BIC, information criteria, model comparison, multilocus analysis, partition scheme comparison, phylogenomics Probabilistic models of DNA and protein sequence change have central roles in phylogenetics and molecular evolution. Many models have been proposed, but it is not always obvious which are most appropriate. With multilocus data, model choice can become especially difficult because the best model for one subset of the data may be suboptimal for another. The challenge of how to organize data into subsets that should be analyzed by shared parameter values has become increasingly central as improvements in DNA sequencing technology have led to bigger data sets. A simple and intuitive approach is to concatenate multilocus sequence data and regard the merged result as a single locus. However, drawbacks of this “concatenation” procedure exist and “separate analysis” has been suggested as an alternative (e.g. Adachi et al. 2000; Cao et al. 2000a,b; Nikaido et al. 2003; Nishihara et al. 2007). In separate analysis (also referred to as partitioned analysis), some model parameters may be shared among loci but other parameters may be distinct to individual loci. The concatenation approach has been criticized for ignoring heterogeneity among loci because different genes may support different tree topologies for reasons ranging from incomplete lineage sorting to horizontal gene transfer to introgression (Leigh et al. 2011; Anderson et al. 2012). With concatenation, locus-specific information is lost. In addition, evolutionary heterogeneity among loci is not limited to topology. Even when multiple loci support the same topology, natural selection, and/or mutation can cause the nucleotide substitution process to vary among loci. For example, when multiple loci are generated by an identical tree topology but according to different branch lengths, ignoring the resulting heterotachy (Lopez et al. 2002) can cause the inferred topology to differ from the one that is shared among the individual loci (e.g. Chang 1996; Kolaczkowski and Thornton 2004; Seo 2008). For diverse reasons, it would be preferable to handle variation among loci when variation exists. Although separate analysis can have advantages, it also has disadvantages. Because evolutionary parameters are separately estimated for each locus, the total number of estimated parameters can be much greater for separate analysis than concatenation. There is a trade-off between model fit and estimation uncertainty (Hastie et al. 2009). As the number of parameters increases, model fit tends to improve but the variances for estimated parameters get bigger. While they did not focus on partitioned analysis, Lemmon and Moriarty (2004) did a careful simulation study regarding underparameterization and overparameterization in phylogenetics. They found that both have negative consequences, but those from underparameterization were more severe. Thus, it is crucial to balance the number of parameters and estimation uncertainty. Another drawback of separate analyses is that they require more computation than concatenation. Therefore, for a given set of multilocus data, careful consideration is needed for whether all loci should be handled separately or whether some should be concatenated. One goal is to determine an optimal “partition scheme,” in which data partitions with similar evolutionary properties are merged and in which partitions with dissimilar properties are not merged. Previous studies (Li et al. 2008; Lanfear et al. 2012, 2014) have noted that partition schemes can be regarded as statistical models and compared via conventional model selection criteria such as the likelihood ratio test (LRT), or the Akaike information criterion (AIC) (Akaike 1974), or the Bayesian information criterion (BIC) (Schwarz 1978). These criteria can then be employed as part of an exhaustive search that examines all possible partition schemes to find the best one. Alternatively, a heuristic search can be used along with the selection criteria. This is often desirable because the number of possible partition schemes grows quickly as the number of loci increases. A Bayesian approach to simultaneously estimate the number of partitions and their model parameters (Wu et al. 2012) is another possible alternative. Two sorts of errors can be made in phylogenetic partitioning. These errors are lumping partitions together when they should be separately treated (i.e. underparameterization) and splitting loci into distinct subsets when they should be grouped together (i.e. overparameterization). Lumping errors can yield bias such as systematic errors in topology estimation (e.g. Chang 1996; Kolaczkowski and Thornton 2004; Seo 2008) whereas splitting errors can cause additional parameter uncertainty. Here, we introduce and justify two model selection criteria that we term the AIC$$^{(p)}$$ and BIC$$^{(p)}$$ because they are modifications designed for partitioning decisions of the widely-used AIC and BIC. The AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ are predicated on the assumption that lumping errors have more serious consequences than splitting errors. Theory We motivate the new information criteria by showing the connection between the AIC$$^{(p)}$$ and a likelihood ratio test of the null hypothesis of partition homogeneity versus the alternative of partition heterogeneity. We then review the connection between the AIC and the Kullback– Leibler deviation. We follow this with an explanation of how the AIC$$^{(p)}$$ relates to the Kullback– Leibler (KL) deviation between the truth and a particular partition scheme of interest when there is homogeneity between partitions. We complete this section by introducing the BIC$$^{(p)}$$. Introduction and the Likelihood Ratio Test as Motivation Suppose that a K-partition scheme is to be compared with the “concatenated” 1-partition scheme. The parameter vector of model $$m_i$$ for partition $$i \in \{ 1,2, \ldots, K \}$$ of the K-partition scheme will be denoted $$\theta_{i}$$. The log-likelihood of the $$i$$th partition with model $$m_i$$ is defined as, \begin{eqnarray*} l^{(i)}(\theta_{i}) := \sum_{j=1}^{n_i} \log f (x_{ij} | \theta_{i} ), \end{eqnarray*} where $$x_{ij}$$ is the $$j$$th aligned sequence column of the $$i$$th partition and $$n_i$$ is the sequence length of the $$i$$th partition. When the entire collection of possible partitions is being considered, we employ the summation sign $$\{\Sigma i := 1 + \cdots + K \}$$ to represent this 1-partition concatenation scheme. The model for the 1-partition scheme is termed $$m_c$$ and its log-likelihood is \begin{eqnarray*} l^{(\Sigma {i})}(\theta_{c}) := l^{(1+2+\cdots+K)}(\theta_{c}) = \sum_{i=1}^K l^{(i)}(\theta_{c}). \end{eqnarray*} We treat and discuss generalizations later, but we begin by assuming that the concatenated model $$m_c$$ and the models $$m_i$$ for each partition $$i$$ share the same parameterization (i.e. topology and substitution model) but may differ in parameter values. For example, all models could share the same topology and all could have the HKY (Hasegawa et al. 1985) parameterization ($$m_i = m_c$$). The maximum likelihood estimators (MLEs) of the $$i$$th partition and of the 1-partition scheme are: \begin{eqnarray*} \widehat{\theta}_{c}^{(i)} &:=& \mathop{\mbox{argmax}}_{\theta_{c}} \left\{ l^{(i)}(\theta_{c}) \right\} \\ \widehat{\theta}_{c}^{(\Sigma {i})} &:=& \mathop{\mbox{argmax}}_{\theta_{c}} \left\{ l^{(\Sigma {i})}(\theta_{c}) \right\}, \end{eqnarray*} where $$\mathop{\mbox{argmax}}_{\theta_\beta} {h(\theta_\beta)}$$ implies “the values of the parameters $$\theta_\beta$$ of model $$m_\beta$$ that maximize $$h(\theta_\beta)$$”. The $$\widehat{\theta}_c^{\,(i)}$$ estimators represent MLEs obtained from the $$i$$th partition with model $$m_c$$. Note that $$\widehat{\theta}_c^{\,(i)}$$ and $$\widehat{\theta}_{c}^{(\Sigma {i})}$$ are expected to be unequal because the former values are obtained from only the $$i$$th partition whereas the latter values are obtained from the entire data set. A limitation of our notation is that it does not reflect the frequently arising situation where some parameters are forced to have values that are shared among all partitions and other parameters are allowed to have values that are unique to each partition. While this limitation could be corrected by expanding the notation, we choose to adopt our more simple notation and instead focus on the cases where either all parameter values are shared among partitions or all have distinct values for each partition. However, as will be emphasized later, the concepts that we discuss and the theory that justifies the AIC$$^{(p)}$$ and BIC$$^{(p)}$$ also apply to the case where some parameters have values that are shared among partitions and others do not. Consider the likelihood ratio test when the null hypothesis is a single partition (i.e. concatenation) and when the more general hypothesis is that each of the $$K$$ partitions shares the parameterization of the 1-partition model but that each has its own parameter values. Because the null hypothesis is a special case that is nested within the more general one, the likelihood ratio test statistic can be approximated as having a $$\chi^2$$ distribution when the null hypothesis is true. Specifically, $$2 \left\{ \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) - l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) \right\} \sim \chi_{(K-1)\rho_c}^2, \label{chisquaredist}$$ (1) where $$\rho_c$$ parameters are estimated according to the null hypothesis and where $$\rho_c$$ parameters are separately estimated for each partition in the K-partition scheme. When the null hypothesis of partition homogeneity is correct, Equation (1) implies $$-2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) \stackrel{E}{=} -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + (K-1)\rho_c \label{LRTrelationship1}$$ (2) because $$(K-1)\rho_c$$ is the expected value of a $$\chi_{(K-1)\rho_c}^2$$ random variable. Here, the “$$\stackrel{E}{=}$$” sign indicates that the expectations are equal. Therefore, \begin{eqnarray} -2l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + 2 \rho_c & \stackrel{E}{=} & -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + (K-1) \rho_c + 2 \rho_c \nonumber\\ \end{eqnarray} (3) \begin{eqnarray} \label{LRTrelationship2} \\ & = & -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + (K+1) \rho_c. \label{LRTrelationship2-b} \end{eqnarray} (4) Equation (2) shows that the expected log-likelihood of a concatenated model can be used to approximate the expected log-likelihood of a K-partition model when partitions are homogeneous (or vice-versa). The $$2 \rho_c$$ term on the left side of Equation (3) happens to be the AIC$$^{(p)}$$ penalty for a concatenated model. It is the same penalty as would be assigned by the AIC. The $$(K+1) \rho_c$$ term on the right side of Equation (4) is the AIC$$^{(p)}$$ penalty for a K-partition model and the $$(K-1) \rho_c$$ term on the right side of Equation (3) reflects the extra cost of the act of partitioning. In contrast, the AIC penalty for a K-partition model would be $$2K \rho_c$$ and this can be much heavier than the AIC$$^{(p)}$$ penalty when $$K$$ is large. As a result, the AIC is more prone than the AIC$$^{(p)}$$ to favoring concatenated models. While the LRT approach to model comparison has some attractive features, one limitation is that the LRT approach assumes that the more restricted “nested” statistical model is true. The AIC simply aims to select the candidate that is closest to the truth and it relies on an approximation that is expected to be “excellent” when the assumed model is “good” (Burnham and Anderson 2002). Likewise, the AIC$$^{(p)}$$ differs from the LRT that inspires Equation (4) because it is not predicated on the substitution model and tree topology being correct (see Part 1 of the Appendix). However, Equation (3) illustrates that the LRT and the AIC$$^{(p)}$$ penalty are closely connected when the substitution model and topology do happen to be correct. Kullback– Leibler Divergence (KLD) and the AIC Suppose that $$n$$ homogeneous samples, ($$x_1, \cdots, x_n$$), are obtained from the true but unknown distribution $$g(x)$$, and we want to fit them with model $$f(x|\theta)$$. The KLD between $$g(x)$$ and $$f(x|\theta)$$ is \begin{eqnarray} KLD[g;f(\cdot|\theta)] &=& \int \log \left( \frac{g(x)}{f(x|\theta)} \right) g(x) dx \nonumber \\ &=& E_g \left[\log g(x) \right] - E_g[\log f(x|\theta)]. \label{KLD} \end{eqnarray} (5) The model whose KLD is minimized is regarded as the best among model candidates. Because $$E_g \left[\log g(x) \right]$$ does not vary among models, Equation (5) shows that the best model must be the one that maximizes $$E_g[\log f(x|\theta)]$$. Akaike (1974) showed that $$E_g[\log f(x|\widehat{\theta})]$$ can be approximated using the MLE $$\widehat{\theta}$$ and an adjustment for the number of parameters that are estimated. Written in our notation, Akaike (1974) showed \begin{eqnarray} KLD[g;f(\cdot|\widehat{\theta})] &=& E_g [\log g(x) ] - E_g[\log f(x|\widehat{\theta})] \label{mKLD} \\ \end{eqnarray} (6) \begin{eqnarray} & \simeq& E_g [\log g(x) ] - \frac{1}{n} \left\{ \sum_{i=1}^{n} \log f(x_i|\widehat{\theta}) - \rho_{\theta} \right\}, \nonumber\\ \label{mKLD2} \end{eqnarray} (7) where $$\rho_{\theta}$$ is the dimension of $$\theta$$ and where the approximation requires $$n$$ to be large. In Equation (7), the bracketed term is the unbiased estimator of $$n E_g[\log f(x|\widehat{\theta})]$$. The $$\rho_{\theta}$$ in this estimator corrects the bias and is necessary because the data set is used to get $$\widehat{\theta}$$ and is then used again to approximate $$E_g [\cdot]$$ (Konishi and Kitagawa 2004). The AIC is defined as \begin{eqnarray} \text{AIC} &:=& -2 \sum_{i=1}^n \log f(x_i | \widehat{\theta}) + 2 \rho_\theta \label{AIC-def-original} \end{eqnarray} (8) and is derived from this unbiased estimator. When there are competing models, the one that maximizes its unbiased estimator of $$E_g[\log f(x|\theta)]$$ (or equivalently that minimizes its AIC) is regarded as the best model. Applying the definition of Equation (8), the AIC scores of concatenated data are written in our notation as \begin{eqnarray} \text{AIC}[m_c, \Sigma i] &:=& -2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + 2 \rho_c \label{AIC-concat} \end{eqnarray} (9) where $$\rho_c$$ is the dimension of $$\theta_c$$ (see also Lanfear et al. 2012, 2014; Li et al. 2008). For phylogenetic applications, the number of free parameters can be separated as \begin{eqnarray} \rho_c = \rho_{\tau} + \rho_{\mu}, \label{rho-separate-1} \end{eqnarray} (10) where $$\rho_{\tau}$$ is the number of tree branches and $$\rho_{\mu}$$ is the number of the remaining parameters (e.g. parameters affecting nucleotide frequencies, transition/transversion ratios, rate heterogeneity among sites, etc). Writing $$\left\{(m_i, i) \right\}$$ to indicate that each partition $$i$$ of the $$K$$ total partitions has its own model $$m_i$$, the AIC scores of partitioned data (Lanfear et al. 2012, 2014) are $$\text{AIC}[\left\{(m_i, i) \right\} ] := -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_{i}^{(i)}) + 2 \rho_{\left\{(m_i, i) \right\}}, \label{AIC-partition-conven}$$ (11) where $$\rho_{\left\{(m_i, i) \right\}}$$ is the number of freely estimated parameters and $$\rho_{\left\{(m_i, i) \right\}} = \sum_{i=1}^K \rho_i$$. In the exhaustive comparison of partition schemes, the scores of Equation (11) are calculated for all candidate schemes and the minimum score determines the best one. Widely-used heuristic algorithms for choosing partition schemes (Li et al. 2008; Lanfear et al. 2012, 2014) involve repeatedly comparing a 2– partition scheme with a 1-partition scheme. Different $$\rho_{\tau}$$ values in Equations (9) and (11) could be needed if the 1-partition and K-partition schemes adopted different tree topologies (e.g. a fully bifurcating topology in one scheme and a “star topology” in the other). When analyzing multilocus data, a model that has corresponding branch lengths of different partitions differ only by a proportionality constraint (Yang 1996; Pupko et al. 2002) is often adopted. We will refer to this as the “linked branch length” (LBL) model (see also Lanfear et al. 2012). With the LBL model, a set of branch lengths that will influence all partitions is estimated as is a proportionality factor for each partition. The sum of proportionality factors is typically restricted to ensure uniqueness of MLEs and the degrees of freedom for the proportionality factors is $$(K-1)$$ when there are $$K$$ partitions. A model that has branch lengths being independently and separately estimated for all partitions can be used as an alternative to a proportional model. This model will be referred to as the “unlinked branch length” (UBL) model (see also Lanfear et al. 2012). With multiple partitions, the LBL model can substantially reduce the number of free parameters when $$\rho_{\tau} \gg 1$$ because the number of free parameters for branch lengths is $$\rho_{\tau} + (K-1)$$ with the LBL model whereas it is $$K\rho_{\tau}$$ for the UBL model. However, the more limited flexibility of the LBL model means it is less satisfactory than the UBL model for some data sets (e.g. Pupko et al. 2002). For the UBL and LBL treatments, \begin{eqnarray} &&\!\!\!\!\text{AIC}[\left\{(m_c, i) \right\} ]\nonumber\\ &&\!\!\!\!:= \left\{ \begin{array}{@{}ll} -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + 2 K \rho_c & \text{(UBL model when $m_i = m_c \;\; \forall i$)},\\[4pt] -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) \\[4pt] \ \ \ \ + 2 \{\rho_{\tau}+(K-1)+K\rho_{\mu}\} & \text{(LBL model when $m_i = m_c \;\; \forall i$).} \\ \end{array}\right. \nonumber\\\label{AIC-partition-conven2} \end{eqnarray} (12) Here, we focus on comparing a K-partition scheme with a 1-partition scheme. The aforementioned heuristic algorithms would have $$K=2$$ but larger values of $$K$$ are also possible. Using AIC, the 1-partition scheme is selected when $$\text{AIC}[m_c, \Sigma i] < \text{AIC}[\left\{(m_c, i) \right\} ]$$ and the K-partition scheme is selected when the inequality is in the other direction. Partition Groups One way to compute $$\text{AIC}[\left\{(m_i, i) \right\} ]$$ begins by organizing partitions into groups where partitions in the same group share topology and substitution model parameterization. Parameter values may vary among partitions within a group but model parameterization would not vary within the group. By grouping partitions in this way, the AIC score for each partition group can first be calculated and then the AIC score for the entire data set can be determined by summing all the AIC group scores. This way of organizing the AIC computations is convenient because the issue of whether or not to concatenate partitions only becomes relevant if partitions have identical model parameterization. As with the AIC, the AIC$$^{(p)}$$ that is introduced here can be computed for the entire data set by summing the AIC$$^{(p)}$$ scores of each individual partition group. Therefore, we focus on how to calculate the AIC$$^{(p)}$$ scores when the decision is either that there is one concatenated partition with model $$m_c$$ or there are $$K>1$$ distinct partitions that each have model $$m_c$$. For the same reasons, we will later focus on how to calculate BIC$$^{(p)}$$ scores when the decision is either that there is one concatenated partition with model $$m_c$$ or there are $$K>1$$ distinct partitions that each have model $$m_c$$. Kullback– Leibler Divergence and the AIC$$^{(p)}$$ Our AIC$$^{(p)}$$ modification of the AIC has penalty terms that are intended to reflect how much improvement in model fit would be expected if partitions are homogeneous but sequence data are analyzed by separately estimating parameters from each partition. These penalties can therefore be interpreted as stemming from the act of partitioning. Our suggestion is that the best candidate partition scheme should be selected after accounting for the act of partitioning. The AIC$$^{(p)}$$ penalty is not as heavy as the AIC penalty of $$2 \sum_{i=1}^K \rho_i$$ in Equation (11). Heavy penalties favor simple models (e.g. concatenation) and can lead to selection of a 1-partition model even when partitions are highly heterogeneous. The idea underlying the AIC$$^{(p)}$$ is to have a model selection criterion that places all candidate selection schemes on an equal footing by using the appropriate bias correction term according to partition homogeneity. After this bias adjustment, we reason that the best partitioning scheme is the one with the optimal score. Consider the situation in which $$n$$ samples are separated into $$K$$ partitions. The size of each partition will be $$n_i$$ ($$\sum_{i=1}^K n_i = n$$) and the MLE of the 1-partition scheme will be denoted $$\widehat{\theta}^{(\Sigma i)}$$. The KLD for the $$i$$th partition is \begin{eqnarray} KLD[g;f(\cdot|\widehat{\theta}^{(i)})] &\simeq& E_g [\log g(x) ]\nonumber\\ &&-\frac{1}{n_i} \left\{ \sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(i)}) - \rho_{\theta} \right\}. \label{mKLD_part} \end{eqnarray} (13) A weighted average of Equation (13) with weights $$w_i = n_i / n$$ yields the KLD when the model of interest has heterogeneous partitions, \begin{eqnarray} \sum_{i=1}^K w_i KLD[g_i;f(\cdot|\widehat{\theta}^{(i)})] &\simeq& \sum_{i=1}^K w_i E_{g_i} [\log g_i(x) ]\nonumber\\ &&- \frac{1}{n} \left\{\!\sum_{i=1}^K\!\sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(i)}){-}K \rho_{\theta}\!\right\}.\nonumber\\ \label{mKLD_part_weighted_average} \end{eqnarray} (14) Minimizing the AIC of the UBL parameterization in Equation (12) can therefore be seen to be equivalent to minimizing $$\sum_{i=1}^K w_i KLD[g_i;f(\cdot|\widehat{\theta}^{(i)})]$$. Now, consider the case where the UBL model is used but the $$K$$ partitions are actually homogeneous. Part 2 of the Appendix outlines one reason why the drawbacks of incorrectly partitioning homogeneous data do not seem especially severe. From Equation (7) and the argument outlined in Part 3 of the Appendix, we can show that the proper bias correction for “partitioning homogeneous data” is $$\frac{K+1}{2} \rho_{\theta}$$: \begin{eqnarray} KLD[g;f(\cdot|\widehat{\theta}^{(\Sigma i)})] & \simeq& E_g [\log g(x)]\nonumber\\ && -\frac{1}{n} \left\{ \sum_{i=1}^K \sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(\Sigma i)}) - \rho_{\theta} \right\} \nonumber\\ & \simeq & E_g [\log g(x)]\nonumber\\ &&-\frac{1}{n} \left\{ \sum_{i=1}^K \sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(i)}) - \frac{K+1}{2} \rho_{\theta} \right\}\nonumber\\ \label{KLD_equal} \end{eqnarray} (15) This means that either a 1-partition scheme or a K-partition scheme can be employed when there is homogeneity among partitions, but the proper bias correction depends on the number of partitions. The bias correction for partitioning homogeneous data can be contrasted to a bias correction when partitions are heterogeneous. Whereas the bias correction for partitioning homogeneous data is $$\frac{K+1}{2} \rho_\theta$$, the more extreme correction for partitioning heterogeneous data is $$K \rho_\theta$$. The bias correction for partitioning homogeneous partitions closely corresponds to the likelihood ratio test (see Equation (3)). Based on this idea and the argument outlined in Part 3 of the Appendix, we can define the AIC$$^{(p)}$$ scores of a K-partition scheme and a concatenated scheme. As above, we assume that each of the $$K$$ partitions has the same model parameterization as the concatenated sequences (i.e. $$m_c = m_i \;\; \forall i$$). With this assumption, we have the AIC$$^{(p)}$$ of the K-partition scheme being $$\text{AIC}^{(p)}[ \left\{ (m_c, i) \right\} ] := \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) \right\} + \delta + 2 \rho_c \label{aicp-part},$$ (16) where $$\delta = \rho_{\left\{ (m_c, i) \right\} } - \rho_c$$ and represents the difference between the total number of parameters in the partition scheme ($$\rho_{\left\{ (m_c, i) \right\} }$$) and the total number in the concatenation scheme ($$\rho_c$$). When an UBL model is adopted, $$\delta = \sum_{i=1}^K \rho_c - \rho_c = (K-1)(\rho_{\tau}+\rho_{\mu})$$. When an LBL model is adopted, $$\delta=\rho_{\tau} + (K-1) + K \rho_{\mu} - (\rho_{\tau}+\rho_{\mu}) = (K-1)(1+ \rho_{\mu})$$. In contrast, the AIC$$^{(p)}$$ for the concatenated model is $$\text{AIC}^{(p)}[m_c, \Sigma i] = \text{AIC}[m_c, \Sigma i] := -2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + 2 \rho_c. \label{aicp-concat}$$ (17) In Equation (16), the summed log-likelihoods represent the model fit of the data and the remaining portion ($$\delta + 2 \rho_c$$) works as a penalty. Larger $$\delta$$ values arise from complex partition schemes and therefore complex schemes are accompanied by higher penalties. The $$\text{AIC}^{(p)}$$ therefore accounts for the trade-off between partition fit and partition complexity. The 1-partition scheme is selected when $$\text{AIC}^{(p)}[m_c, \Sigma i] < \text{AIC}^{(p)}[ \left\{ (m_c, i) \right\} ]$$ and the K-partition scheme is selected when the inequality is in the other direction. With the UBL model, the AIC$$^{(p)}$$ score for the K-partition scheme is \begin{eqnarray} \text{AIC}^{(p)}[\left\{ (m_c, i) \right\} ] = \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) \right\} + (K+1) \rho_c. \label{aicp-sum-mc} \end{eqnarray} (18) We see from Equation (3) that the AIC$$^{(p)}$$ scores of the concatenated model (see Equation (17)) and the K-partition model (see Equation (18)) are expected to be about the same when the partitions are homogeneous. If the $$K$$ partitions are heterogeneous, the AIC$$^{(p)}$$ score of the K-partition model is therefore expected to be lower. For this reason, we suggest selecting the K-partition model when its score is less than that of the concatenated model. We note that the AIC$$^{(p)}$$ score of Equation (16) can be converted to the AIC score by replacing the $$\delta$$ term with the heavier penalty of $$2 \delta$$. For the UBL parameterization, one way to interpret the AIC and AIC$$^{(p)}$$ penalties is that the AIC has a penalty of $$2$$ each time a parameter is estimated whereas the AIC$$^{(p)}$$ has a penalty of $$2$$ the first time a parameter is estimated but a penalty of only $$1$$ for additional partitions from which the parameter is estimated. The fact that the penalty of $$\text{AIC}^{(p)}[\left\{(m_c, i) \right\}]$$ in Equation (16) is lighter than that of $$\text{AIC}[\left\{(m_c, i) \right\} ]$$ in Equation (12) implies that the AIC$$^{(p)}$$ can detect heterogeneity better than the AIC. This is because a moderate improvement of model fit represented by $$l^{(i)}(\widehat{\theta}_c^{\,(i)})$$ can dominate the penalty term of $$\text{AIC}^{(p)}[\left\{(m_c, i) \right\}]$$ more easily than that of $$\text{AIC}[\left\{(m_c, i) \right\} ]$$. New BIC of Partitioned Data: $$\text{BIC}^{(p)}$$ The BIC aims to find the model with the highest logarithm of the marginal likelihood. The BIC penalty term is $$\rho \log (n)$$ where $$\rho$$ is the number of free parameters and $$n$$ is the number of data points. This penalty term is derived from an approximation of the logarithm of the marginal likelihood that is best when $$n$$ is large. A better approximation would yield a penalty term of $$\rho \log (n) - \rho \log (2 \pi)$$ but the $$\rho \log(2 \pi)$$ is omitted from the BIC because the assumption is that $$\log (n)$$ is substantially bigger than $$\log (2 \pi)$$ (but see Draper 1995). For defining the BIC$$^{(p)}$$ information criterion, we choose to avoid the assumption about the relative values of $$\log (n)$$ and $$\log(2 \pi)$$. Using the relationship between maximum log-likelihoods of concatenated and partitioned data (Equation (15)) when partitions are homogeneous, we can approximate the logarithm of the marginal likelihood of concatenated and partitioned data. We again consider the case where each of the $$K$$ partitions in the partition scheme has the same parameterization as the concatenation model (i.e. $$m_i = m_c \;\; \forall i$$). As shown in Part 4 of the Appendix and using $$p_i$$ to represent the proportion of all data that is in partition $$i$$, this enables us to define BIC$$^{(p)}$$ scores of concatenation and partition schemes: \begin{eqnarray} &&\text{BIC}^{(p)}\,[m_c, \Sigma i] := -2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + \rho_c (\log n - \log(2 \pi) )\nonumber\\ \label{BICP_sum} \\ \end{eqnarray} (19) \begin{eqnarray} &&\text{BIC}^{(p)}[\left\{(m_c, i) \right\}]\nonumber\\ &&\quad := \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) +(1-p_i) \rho_{\delta} ( \log n_i - \log(2 \pi) ) \right\} \nonumber \\ &&\qquad + \rho_c (\log n - \log(2 \pi)), \label{BICP_part} \end{eqnarray} (20) where $$\rho_{\delta}$$ represents the number of free parameters that are added per partition for each additional partition after the first. For the UBL model, $$\rho_{\delta}=\rho_c$$. For the LBL model, $$\rho_{\delta}=1+\rho_{\mu}$$. When $$\text{BIC}^{(p)}\,[m_c, \Sigma i] < \text{BIC}^{(p)}\,[\left\{(m_c, i) \right\}]$$, we choose the concatenation scheme. When the inequality is reversed, we choose the $$K$$-partition scheme. We note that the penalty difference between Equations (19) and (20) is smaller than the difference between conventional BIC’s for partitioned and concatenated data. This implies that the BIC$$^{(p)}$$ can detect partition heterogeneity better than the BIC. Based on our experience, the BIC$$^{(p)}$$ terms involving $$\log (2 \pi)$$ reasonably often affect partitioning decisions. If the contributions involving $$\log (2 \pi)$$ are ignored, the penalty differences between Equations (19) and (20) would become greater and thus would make it more difficult to detect heterogeneity among partitions. Results and discussion Simulation Studies We performed simulations to compare information criteria. All simulations employed the topology and branch lengths illustrated in Figure 1 to randomly generate data sets consisting of four partitions. To simulate partition homogeneity, all branches in all partitions had length $$0.1$$. To simulate heterogeneous partitions, all branches in partition $$i$$ ($$i \in \left\{ 1,2,3,4 \right \}$$) had branch length $$0.1 + \Delta_i := 0.1 + 0.02 \times (i-1)$$. The number of simulated sites in partition $$i$$ was set to $$i \times n_1$$. This means that the concatenated (1-partition) scheme had $$n= 10 n_1 = n_1 + 2n_1+ 3n_1 +4n_1$$ sites. The values of $$n_1$$ that were explored are $$100$$, $$200$$, $$500$$, $$1000$$, $$1500$$, and $$2000$$. The Seq-Gen software (Rambaut and Grassly 1997) was employed to randomly generate $$1000$$ data sets via the HKY substitution model (Hasegawa et al. 1985) and a five-category discrete-gamma treatment of rate heterogeneity among sites (Yang 1994b). The frequencies of A, C, G, and T were respectively 0.3, 0.2, 0.3, and 0.2 while the values of parameters for rate heterogeneity among sites ($$\alpha$$) and transition/transversion ratio ($$\kappa$$) were both set to 5.0. Figure 1. View largeDownload slide Tree topology of eight taxa for simulation studies. For the simulation of homogeneous and heterogeneous partition schemes, $$(b, \Delta_i)$$ was set to $$(0.1, 0.0)$$, and $$(0.1, 0.02\times(i-1))$$, respectively. Figure 1. View largeDownload slide Tree topology of eight taxa for simulation studies. For the simulation of homogeneous and heterogeneous partition schemes, $$(b, \Delta_i)$$ was set to $$(0.1, 0.0)$$, and $$(0.1, 0.02\times(i-1))$$, respectively. All simulated data sets were analyzed both with a 4-partition scheme and a concatenated (1– partition) scheme. For each of the two schemes, each simulated data set was analyzed with each of eight nucleotide substitution models and each of these cases was explored both with rate homogeneity among sites and with discrete-gamma rate heterogeneity (five rate categories). The eight substitution models that were used for inference are denoted: JC (Jukes and Cantor 1969), K2 (Kimura 1980), F81 (Felsenstein 1981), F84 (Felsenstein 1989), HKY (Hasegawa et al. 1985), T92 (Tamura 1992), TN93 (Tamura and Nei 1993), and GTR (Yang 1994a). For the analyses of the simulated data, we did not try to find the best substitution model and tree topology. Instead, we compared the 1-partition scheme and 4-partition schemes when both assumed the true tree topology and when both assumed the same substitution model. Also, the UBL parameterization was used for investigating the 4-partition schemes so that branch lengths of each partition were independently estimated. Maximum likelihood inference was conducted via Version 4.8 of the baseml program of the PAML software (Yang 2007). Table 1 shows how often the 4-partition scheme was selected with the AIC and the AIC$$^{(p)}$$ when the truth was that partitions are homogeneous (i.e. $$\Delta_i =0$$). Most of the simulation results for the AIC show a low proportion (<0.1%) of incorrectly selecting the 4-partition scheme. In contrast, the $$\text{AIC}^{(p)}$$ incorrectly selects the 4-partition scheme about 50% of the time. This higher proportion is expected because the $$\text{AIC}^{(p)}$$ criterion is designed with the idea that “splitting” errors (incorrectly separating partitions) are less serious than “lumping” errors (incorrectly concatenating partitions). The observation of an approximately 50% error rate is reasonable in light of the fact that the $$\text{AIC}^{(p)}$$ criterion is designed so that its expected value is the same for a 4-partition and 1-partition scheme when the truth is partition homogeneity. Table 1. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually homogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 Table 1. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually homogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 Table 2 shows simulation results for the AIC and the AIC$$^{(p)}$$ when partitions are heterogeneous. In this situation, the 4-partition scheme should be selected over the 1-partition scheme and failure to do so is a “lumping error.” Both the AIC and AIC$$^{(p)}$$ criteria perform well when partitions consist of relatively large numbers of sites. But, Table 2 reveals a marked contrast when partitions have fewer sites. In the situation of $$n_1 =100$$ sites, the AIC makes a lumping error for nearly all simulated data sets whereas the AIC$$^{(p)}$$ is quite unlikely to make these errors. The results show that the $$\text{AIC}^{(p)}$$ has higher “sensitivity” than the AIC. That is, the $$\text{AIC}^{(p)}$$ detects heterogeneity better than the AIC when partitions are heterogeneous. With regard to making lumping errors, we also note that the $$\text{AIC}^{(p)}$$ appears to be relatively robust to model misspecification in comparison with the AIC. Table 2. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 Table 2. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 When the number of sites per partition is in a biologically plausible range, both the BIC and BIC$$^{(p)}$$ are more prone than the AIC to favor models that are less parameterized. The practical effect of the increased penalties on more parameterized models means that both the BIC and the BIC$$^{(p)}$$ are less likely than the AIC to make splitting errors. The results of Table 1 show that the AIC is very unlikely in our simulation settings to make a splitting error by choosing the 4-partition scheme over the 1-partition scheme when the truth is partition homogeneity. Therefore, it is not surprising that we do not observe splitting errors with the heavier BIC and BIC$$^{(p)}$$ penalties for our simulation conditions. Because splitting errors are not observed with the BIC and the BIC$$^{(p)}$$, we do not include tables of them for these criteria. Table 3 shows the performance of the BIC and $$\text{BIC}^{(p)}$$ when the truth is that the heterogeneous 4-partition scheme should be selected. As expected, the $$\text{BIC}^{(p)}$$ detects partition heterogeneity better than BIC. That is, the $$\text{BIC}^{(p)}$$ makes fewer lumping errors than the BIC. However, from the comparison of Tables 2 and 3, we observe that the $$\text{BIC}^{(p)}$$ is less sensitive than the $$\text{AIC}^{(p)}$$ and even less than the $$\text{AIC}$$. This is because the $$\text{BIC}^{(p)}$$ has heavier penalties than the $$\text{AIC}^{(p)}$$ and the $$\text{AIC}$$. Also, the $$\text{BIC}^{(p)}$$ displays a marked sensitivity to the assumed substitution model. For example, the observed proportion of selecting the 4-partition scheme with the JC model is 0.998 when $$n_1 = 1500$$ and it is only 0.093 with the GTR+G model. In contrast, the $$\text{AIC}^{(p)}$$ appears to be more robust than the $$\text{BIC}^{(p)}$$ to model misspecification. Table 3. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by BIC and by BIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 Table 3. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by BIC and by BIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 The choice of model selection criterion can also affect branch length inferences. Table 4 considers the sums of branch lengths (i.e. the tree lengths) that were inferred from the simulated data with the different information criteria. It shows the specific case where the number of sites $$n_1$$ was 100 and where both the true and assumed substitution models were HKY + G. Table 4 shows that lumping errors can have more serious impacts than splitting errors on branch length estimates. Also, it indicates the potential value of the AIC$$^{(p)}$$ when downstream inferences that rely upon branch length estimates (e.g. divergence time and evolutionary rate estimates) are important (see Case 2 of the following section). Table 4. Tree length inferences from different information criteria when Partition 1 ($$n_1$$) had 100 sites and when the true and assumed substitution model were both HKY+G. If the information criterion selected the 1– partition scheme for a simulated data set, then the inferred tree length for the concatenated model was recorded for all 4 partitions. If the 4– partition scheme was selected, then tree lengths separately inferred for each partition was recorded. Standard deviations of tree length inferences are shown in parentheses below the sample means that were inferred from the 1000 simulated data sets. Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Table 4. Tree length inferences from different information criteria when Partition 1 ($$n_1$$) had 100 sites and when the true and assumed substitution model were both HKY+G. If the information criterion selected the 1– partition scheme for a simulated data set, then the inferred tree length for the concatenated model was recorded for all 4 partitions. If the 4– partition scheme was selected, then tree lengths separately inferred for each partition was recorded. Standard deviations of tree length inferences are shown in parentheses below the sample means that were inferred from the 1000 simulated data sets. Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Empirical Data Analysis We analyzed two data sets to examine how partition scheme selection affects evolutionary inferences. Because some details of our ML estimation and partition selection procedures differ from those of the previous studies, our results have minor differences from the previous ones. However, our emphasis here is not on these differences but is instead on how partition scheme selection is affected by the choice of information criterion and how downstream evolutionary analyses are affected by partition selection. Case 1: Li et al. (2008) analyzed 10 protein-coding genes of 56 ray-finned fish taxa. By separating each protein-coding gene into three codon positions, they started with a possible 30-partition scheme. They then performed hierarchical clustering to generate candidate schemes with 29, 28, $$\ldots$$, 2, 1 partitions. Although the number of possible ways to partition 30 items is the Bell Number (Bell 1934) $$B_{30}$$ and this is far more than 30, we considered only the 30 schemes from hierarchical clustering that Li et al. (2008) reported. For each of these 30 schemes with the AIC and with the BIC, our analyses allowed different substitution models for different partitions by invoking the “models = all” options of PartitionFinder (Lanfear et al. 2012). This differs from the Li et al. (2008) analysis that assumed the GTR+G model for all partitions in all schemes. We also considered the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ for the 30 candidate partitioning schemes. All analyses with these two information criteria assumed the Neighbor-joining tree topology reported by the MEGA software (Tamura et al. 2013) when the 30 possible partitions of the Li et al. (2008) data were concatenated and then analyzed with the TN93 substitution model. Our AIC$$^{(p)}$$ and BIC$$^{(p)}$$ analyses assumed that all partitions in each candidate scheme evolved according to a GTR substitution model with four discrete-gamma categories of rate heterogeneity among sites. Whereas Li et al. (2008) considered only the LBL parameterization, we considered both LBL and UBL parameterizations for the four information criteria with all candidate partitioning schemes. Detailed results from applying the four information criteria to the Li et al. (2008) candidate partitions are given in Tables 1 through 4 of Supplementary Material, Online Appendix available on Dryad at http://dx.doi.org/10.5061/dryad.qq586. Most of the results are not surprising. The AIC$$^{(p)}$$ is more prone to splitting than the AIC. Considering only the UBL parameterization, the AIC$$^{(p)}$$ selects the 30-partition scheme while the AIC chooses the 21-partition scheme. Likewise, the BIC$$^{(p)}$$ is more prone to splitting than the BIC. When considering only the UBL parameterization, the BIC$$^{(p)}$$ selects the 3-partition scheme while the BIC prefers the 2-partition scheme. These UBL results also confirm the expectation that the BIC and the BIC$$^{(p)}$$ prefer fewer partitions than the AIC and the AIC$$^{(p)}$$. This is because both the BIC and the BIC$$^{(p)}$$ penalties involve the amount of data as well as the number of estimated parameters. The number of estimated parameters for the LBL parameterization is substantially less than for the UBL parameterization. This causes the LBL parameterization to tend to favor splitting more than the UBL parameterization. Among the LBL results, the BIC selects the 21-partition scheme whereas the other three information criteria all choose the 30-partition scheme. When both UBL and LBL parameterizations are considered for each of the 30 candidate schemes, the BIC selects the LBL with 21 partitions and 288 free parameters while the BIC$$^{(p)}$$ selects the LBL with 30 partitions and 408 free parameters. For the same set of possible models, the AIC chooses the UBL with 21 partitions and 2486 free parameters while the AIC$$^{(p)}$$ favors the UBL with 30 partitions and 3540 free parameters. All of these model choices are consistent with the AIC$$^{(p)}$$ favoring splitting more than the AIC, the BIC$$^{(p)}$$ favoring splitting more than the BIC, and the BIC and the BIC$$^{(p)}$$ both preferring fewer parameters than the AIC and the AIC$$^{(p)}$$. Because the 30 candidate schemes were generated by hierarchical clustering, all 30 have a nesting/nested relationship for the UBL parameterization. That is, the $$k$$– partition scheme is nested within the $$(k+1)$$-partition scheme ($$1\leq k \leq 29$$). The same applies for the LBL parameterization. These nesting relationships mean that the traditional (asymptotic) LRT can be applied. For both LBL and UBL parameterizations, the 30-partition scheme is significantly better than all others according to the likelihood ratio test. In contrast, the 21-partition UBL scheme is selected by the AIC over the 30-partition UBL scheme and over all other candidate schemes. This emphasizes that the AIC score can produce results that are contradictory to the LRT. The AIC$$^{(p)}$$ selects the 30-partition UBL scheme as the best among all candidates and it selects the 30-partition LBL as being the best LBL candidate. The consistency between the AIC$$^{(p)}$$ and the LRT is not surprising given their close relationship. Ripplinger and Sullivan (2008) noted that model selection may affect phylogeny estimation, especially for regions of an evolutionary tree that have low bootstrap support. Partitioning decisions are a type of model selection and our experience with partition selection coincides with Ripplinger and Sullivan’s observation. For example, we considered the 30-partition UBL scheme that was selected as optimal according to the AIC$$^{(p)}$$ and the 21-partition UBL scheme that was preferred by the AIC. While computational considerations motivated our decision to avoid intensively searching topology space when calculating information criteria scores for the 30 candidate partitioning schemes, we used the RAXML software (Stamatakis 2014) to more carefully search among topologies for both the 21-partition UBL scheme and the 30-partition UBL scheme. While the RAXML trees derived from these two schemes are topologically very similar, Figure 2 shows that the differences tend to occur in regions of the topologies with low bootstrap support. Fig. 2. View largeDownload slide Different tree topologies for different partition schemes. ML tree topologies of 56 ray-finned fish group were reconstructed with the 30-partition (left) and 21-partition (right) schemes. These two partition schemes are the best with the AIC$$^{(p)}$$ and the AIC, respectively. The bootstrap support levels shown near each node are based upon 500 bootstrap replicates. The trees were inferred with the GTR substitution model and four categories of discrete-gamma rate heterogeneity among sites. Fig. 2. View largeDownload slide Different tree topologies for different partition schemes. ML tree topologies of 56 ray-finned fish group were reconstructed with the 30-partition (left) and 21-partition (right) schemes. These two partition schemes are the best with the AIC$$^{(p)}$$ and the AIC, respectively. The bootstrap support levels shown near each node are based upon 500 bootstrap replicates. The trees were inferred with the GTR substitution model and four categories of discrete-gamma rate heterogeneity among sites. Case 2: We analyzed 4 protein-coding mitochondrial genes and seven (six protein-coding and one noncoding) nuclear genes of 49 notothenioid fish group taxa. Colombo et al. (2015) focussed on the (possibly adaptive) radiation of the Antarctic clade, but our focus here is on partition selection. As with Case 1, we used the “models = all” option and the tree topology provided by PartitionFinder for the AIC and the BIC calculations. For the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ calculations, we used the Neighbor-joining tree topology estimated with the TN93 substitution model (Tamura and Nei 1993) and the MEGA software (Tamura et al. 2013). Fixing this topology, we obtained the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ scores for a GTR model with 4 discrete-gamma categories of rate heterogeneity. We first ran PartitionFinder (Lanfear et al. 2012) with the BIC and the UBL parameterization. The partition space was searched with the greedy option. Starting with 31 possible partitions (one for the noncoding gene and 1 per codon position for each of the 10 protein-coding genes), partitions were hierarchically merged until the best partition scheme (a 4-partition scheme) was found according to the BIC. Because the BIC tends to favor concatenation relative to the other information criteria, we decided to focus on the 28 candidate partition schemes that led PartitionFinder from the starting 31-partition scheme to the 4-partition scheme. The number of partitions in these 28 candidate schemes therefore range from 4 to 31. Assuming the UBL parameterization and evaluating these 28 candidate schemes, the 9-, 4-, 20-, and 6-partition schemes are selected as the best according to the AIC, the BIC, the AIC$$^{(p)}$$, and the BIC$$^{(p)}$$ respectively. Assuming the LBL parameterization, the 31-, 8-, 31-, and 18-partition schemes are selected as the best for AIC, BIC, AIC$$^{(p)}$$, and BIC$$^{(p)}$$, respectively. When considering either the UBL or LBL parameterizations, the AIC selects the 933-parameter 9-partition UBL scheme and the AIC$$^{(p)}$$ prefers the 2080-parameter 20-partition UBL scheme while the BIC prefers the 146-parameter 8-partition LBL scheme and the BIC$$^{(p)}$$ prefers the 274-parameter 18-partition LBL scheme. In summary, the analyses again show that BIC$$^{(p)}$$ is more apt to split partitions than the BIC and the AIC$$^{(p)}$$ is more apt to split than the AIC. The results also again show that the BIC and BIC$$^{(p)}$$ prefer models with fewer free parameters than the AIC and AIC$$^{(p)}$$. With this data set, we again performed maximum likelihood topology estimation according to a variety of partition schemes and model parameterizations that were favored according to one of the four information criteria. The results were again consistent with the observation of Ripplinger and Sullivan (2008). While we found minor variations in inferred maximum likelihood topology, the variations were associated with regions of the tree that have low or moderate bootstrap support (data not shown). Using the 20-partition UBL scheme preferred by the AIC$$^{(p)}$$ and also the 9-partition UBL scheme preferred by the AIC, we estimated evolutionary rates and divergence times with the MCMCtree software (Yang, 2007) by using both the sequence data and calibration points of Colombo et al. (2015). We concentrate on the third codon positions of the CO1 and the ND4 genes that are heterogeneous in the 20-partition UBL scheme but that are homogeneous (i.e. in the same partition) in the 9-partition UBL scheme. Figure 3 shows divergence time estimates and the inferred trajectory of evolutionary rates of CO1 and ND4 third positions at nodes along the path connecting the most recent common ancestor of the Antarctic clade to the tip in this clade that represents Chionodraco rastrospinosus. While the divergence time estimates are very similar for the 20-partition and 9-partition schemes, the rate trajectories for the third positions of CO1 and ND4 are quite different. The rate trajectory of the merged data shows a kind of average of CO1 and ND4 and is located between the two trajectories of CO1 and ND4, but it loses information for the evolutionary properties of the individual genes. While this is only one example rather than evidence for a general trend, we expect that the choice of partition scheme is less likely to affect the estimation of tree topology than divergence times and we expect that partition scheme selection is more likely to affect the estimation of evolutionary rates of individual genes than it is to affect divergence times that are assumed to be shared among the genes. Fig. 3. View largeDownload slide Different evolutionary rates for different partition schemes. Evolutionary rate trajectories from the most recent common ancestor of the Antarctic clade to Chionodraco rastrospinosus. Evolutionary rates of 3rd sites of CO1 (circle), ND4 (rectangle) and merged CO1 and ND4 (cross) are plotted on the y-axis with estimated divergence times (millions of years ago) on the x-axis. The third sites of CO1 and ND4 are heterogeneous in the 20-partition scheme that is the best according to the AIC$$^{(p)}$$. However, they are homogeneous in the 9-partition scheme that is the best according to the AIC. Fig. 3. View largeDownload slide Different evolutionary rates for different partition schemes. Evolutionary rate trajectories from the most recent common ancestor of the Antarctic clade to Chionodraco rastrospinosus. Evolutionary rates of 3rd sites of CO1 (circle), ND4 (rectangle) and merged CO1 and ND4 (cross) are plotted on the y-axis with estimated divergence times (millions of years ago) on the x-axis. The third sites of CO1 and ND4 are heterogeneous in the 20-partition scheme that is the best according to the AIC$$^{(p)}$$. However, they are homogeneous in the 9-partition scheme that is the best according to the AIC. Concluding Remarks The asymmetric consequences of splitting and lumping errors are our motivation for suggesting the $$\text{AIC}^{(p)}$$ and the $$\text{BIC}^{(p)}$$ to compare partition schemes. We view the possible bias resulting from lumping errors as more serious than the increased variance generated by splitting errors. Other partitioning techniques to account for the asymmetric consequences are also possible. For example, partitioning guided by Bayesian decision theory (e.g. see Berger 1985) could be attractive, but it would be difficult to convert the qualitative asymmetry of consequences from lumping and splitting errors into a quantitative loss function that would adequately summarize the relative severities of these two types of errors. Of the four information criteria that we consider, the $$\text{AIC}^{(p)}$$ is the least likely to make lumping errors and we conclude that the $$\text{AIC}^{(p)}$$ is usually a better choice than the other information criteria. However, one of the notable features of the $$\text{BIC}^{(p)}$$ is that it is consistent. Consistency in model selection implies that the probability to select the true model approaches 1 as sample size increases (Dziak et al. 2012). Likewise, consistency in partition selection guarantees selection of the true partition scheme when sample size (i.e. the number of sequence sites) is large. When $$n$$ goes to infinity, the $$\text{BIC}^{(p)}$$ penalty divided by $$n$$ goes to zero whereas the penalty itself goes to infinity. This is often the situation when information criteria are consistent with respect to model selection (Bozdogan 1987; Dziak et al. 2012). The $$\text{BIC}$$ is also consistent but, because the $$\text{BIC}^{(p)}$$ is less prone to lumping errors than the BIC, the $$\text{BIC}^{(p)}$$ might be the best alternative among the four information criteria when statistical consistency is especially valued. A conventional approach is to first make a quick and crude approximation of the phylogenetic tree topology and to then search for the optimal partition scheme by fixing the topology at this approximation (Lanfear et al. 2012). After settling upon and fixing the partition scheme, a thorough search of topologies can be carried out. This conventional approach is attractive because a joint search of all combinations of partition scheme and topology can be computationally prohibitive. This conventional approach simplifies computation and seems to us to also be sensible when employing the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$, especially for doing phylogenetics with genome-scale data. If two partitions are heterogeneous according to one combination of tree topology and nucleotide substitution model, they are likely to be heterogeneous according to another combination. To be sure, the choice of a combination could affect the power to detect heterogeneity but this effect is often small. This is because even if the assumed tree topology and substitution model are incorrect for some or all partitions, the resulting bias would also be homogeneous (or heterogeneous) if evolutionary properties are really homogeneous (or heterogeneous) among partitions. However, we have not fully characterized this conventional approach here and doing so might be a good direction for future research. The application of information criteria to partitioning sequence data has mainly received attention with regard to impact on phylogeny inference, but diverse other kinds of evolutionary inferences (e.g. divergence time estimation, examination of how rates of molecular evolution have changed over time, and detection of diversifying positive selection) are also potentially influenced by partitioning. The potentially large effects of partitioning on inferred trajectories of evolutionary rates are illustrated by our findings with CO1 and ND4 third positions from the Colombo et al. (2015) notothenioid data. The ability to detect and quantify shifts of evolutionary rates over time is especially pertinent to the study of adaptation. By studying a phylogeny of diverse terrestrial and marine mammals and then identifying genes having evolutionary rates on the tree that correlate with marine/terrestrial status, Chikina et al. (2016) identified a biologically plausible group of candidate genes that might be associated with adaptation to marine environments. This promising strategy for illuminating the genetic underpinnings of evolutionary adaptation can potentially be applied to diverse other sorts of adaptation, including adaptation to the extreme Antarctic environment that may have been associated with the notothenioid fish radiation studied by Colombo et al. (2015). However, ability to use rate change to identify genes associated with adaptation to extreme environments or other sorts of adaptation will be influenced by partitioning decisions. As genomic data of non-model organisms becomes increasingly available, the ability to characterize trajectories of evolutionary rates should improve and so should the ability to identify interesting changes in evolutionary rates. Success of these studies is likely to be influenced by the availability of sound methods for partitioning. Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.qq586. Funding This work was supported by the Korea Polar Research Institute (PE17090) and by NIH grant GM118508. Acknowledgements We thank Mark Holder, Stephen Smith, Xiang Ji, and two anonymous reviewers. Appendix 1. Basic assumptions Let us define operators $$\nabla$$, $$\nabla \nabla$$, and $$\nabla \nabla^T$$ as follows \begin{eqnarray*} \nabla l^{(i)} (\theta) &:=& \sum_{j=1}^{n_i} \frac{\partial}{\partial \theta} \log f(x_{ij}|\theta) \\ \nabla \nabla l^{(i)} (\theta) &:=& \sum_{j=1}^{n_i} \frac{\partial^2}{\partial \theta \partial \theta^T} \log f(x_{ij}|\theta) \\ \nabla \nabla^T l^{(i)} (\theta) &:=& \sum_{j=1}^{n_i} \frac{\partial}{\partial \theta} \log f(x_{ij}|\theta) \frac{\partial}{\partial \theta^T} \log f(x_{ij}|\theta). \end{eqnarray*} $$\nabla l^{(\Sigma i)} (\theta)$$, $$\nabla \nabla l^{(\Sigma i)} (\theta)$$ and $$\nabla \nabla^T l^{(\Sigma i)} (\theta)$$ are defined in a similar way. The adopted model may not be correct. When the MLE with an incorrect model converges to a certain value that we will denote $$\theta_*$$, its asymptotic distribution is normal under suitable regularity conditions (White 1982). That is, \begin{eqnarray} \left( \widehat{\theta}^{(i)} - \theta_*^{(i)} \right) &\underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\,& N\left( {\bf 0}, \left[\nabla \nabla l^{(i)} (\theta_*^{(i)}) \right]^{-1} \left[\nabla \nabla^T l^{(i)} (\theta_*^{(i)}) \right]\right.\nonumber\\ &&\left. \left[\nabla \nabla l^{(i)} (\theta_*^{(i)}) \right]^{-1} \right)\label{sandwitch1}\\ \end{eqnarray} (A.1) \begin{eqnarray} &\simeq& N\left( {\bf 0}, \left[\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \right]^{-1} \left[\nabla \nabla^T l^{(i)} (\widehat{\theta}^{(i)}) \right]\right.\nonumber\\ &&\left.\left[\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \right]^{-1} \right),\label{sandwitch2} \end{eqnarray} (A.2) and the asymptotic distribution of $$( \widehat{\theta}^{(\Sigma i)} - \theta_*^{(\Sigma i)} )$$ is expressed in a similar way. When the assumed model is correct, $$\nabla \nabla l = - \nabla \nabla^T l$$ in Equation (A.1) and the covariance matrix of Equation (A.1) is reduced to the inverse of $$-\nabla \nabla l$$. In our derivations, we assume that the adopted model may not be correct but it is close to the truth so that $$\nabla \nabla l \simeq - \nabla \nabla^T l$$ and the covariance matrix of Equation (A.1) is approximately the inverse of $$-\nabla \nabla l$$. Also, we assume partition homogeneity when deriving AIC$$^{(p)}$$’s penalty. 2. Partitioning homogeneous sequence data is not harmful when the number of sequence sites per partition is large We define \begin{eqnarray*} H^{(i)} &:=& n_i E_g \left[\nabla \nabla \log f(x_{ij}|\theta_*^{(i)}) \right] \nonumber \\ H^{(\Sigma i)} &=& \sum_{i=1}^K H^{(i)} \\ W_i &:=& \left[ H^{(\Sigma i)} \right]^{-1} H^{(i)}. \end{eqnarray*} If we assume partition homogeneity, then $$\theta_*^{(1)} = \cdots = \theta_*^{(K)} = \theta_*^{(\Sigma i)}$$ and \begin{eqnarray*} H^{(\Sigma i)} &=& n E_g \left[\nabla \nabla \log f(x_{ij}|\theta_*^{(\Sigma i)}) \right] \\ W_i &=& p_i I, \end{eqnarray*} where $$p_i = n_i/n$$ and $$I$$ is identity matrix. We consider the following natural estimators \begin{eqnarray*} \widehat{H}^{(i)} &=& \nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \\ \widehat{H}^{(\Sigma i)} &=& \sum_{i=1}^K \widehat{H}^{(i)} \\ \widehat{W}_i &=& \left[ \widehat{H}^{(\Sigma i)} \right]^{-1} \widehat{H}^{(i)}, \end{eqnarray*} and note that $$\widehat{W}_i$$ is a consistent estimator of $$W_i$$. Now, consider the following first derivative and its asymptotic expansion. where $$h(x) = O_p \left( g(n) \right)$$ implies that $$h(x)/g(n)$$ is bounded in probability (Bishop et al. 2007). This leads to \begin{eqnarray} \widehat{\theta}^{(\Sigma i)} &=& \left[ \sum_{i=1}^{K} \nabla \nabla l^{(i)}(\widehat{\theta}^{(i)}) \right]^{-1} \sum_{i=1}^{K} \left\{ \nabla \nabla l^{(i)}(\widehat{\theta}^{(i)}) \widehat{\theta}^{(i)} + O_p \left(1 \right) \right\} \nonumber \\ &=& \left[ \widehat{H}^{(\Sigma i)} \right]^{-1} \sum_{i=1}^{K} \left\{ \widehat{H}^{(i)} \widehat{\theta}^{(i)} + O_p \left(1 \right) \right\} \nonumber \\ &=& \sum_{i=1}^{K} \left\{ \left[ \widehat{H}^{(\Sigma i)} \right]^{-1} \widehat{H}^{(i)} \widehat{\theta}^{(i)} + O_p \left( \frac{1}{n } \right) \right\} \nonumber \\ &=& \sum_{i=1}^{K} \widehat{W}_i \widehat{\theta}^{(i)} + O_p \left( \frac{1}{n } \right) \nonumber \\ &\simeq& \sum_{i=1}^{K} p_i \widehat{\theta}^{(i)} + O_p \left( \frac{1}{n } \right), \label{two_mle_similar} \end{eqnarray} (A.3) where we use the fact $$\widehat{W}_i \simeq p_i I$$ for large $$n_i$$’s in the approximation of the last line. We denote $$\sum_{i=1}^{K} p_i \widehat{\theta}^{(i)}$$ as $$\widehat{\theta}^{(\Sigma i|*)}$$. Then, Equation (A.3) implies $$\widehat{\theta}^{(\Sigma i)} \approx \widehat{\theta}^{(\Sigma i|*)}$$ for large $$n$$. On the other hand, \begin{eqnarray*} {\bf 0} &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) \\ &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + \nabla \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) (\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)}) + \cdots \\ &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + O_p(n)\cdot O_p\left(\frac{1}{n } \right) + \cdots \\ &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + O_p \left(1 \right). \end{eqnarray*} Therefore, $$\nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) \not\approx \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)})$$ even for large $$n$$. However, \begin{eqnarray*} l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) &=& l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) +(\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)})^T \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) \\ && + \frac{1}{2} (\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)})^T \cdot \nabla \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)})\\ &&\cdot (\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)}) + \cdots \\ &=& l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + O_p\left(\frac{1}{n} \right). \end{eqnarray*} Therefore, $$l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) \approx l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)})$$ for large $$n$$. Now consider the variances of $$\widehat{\theta}^{(\Sigma i)}$$ and $$\widehat{\theta}^{(\Sigma i|*)}$$. From Equation (A.3), \begin{eqnarray*} &&\left\{\widehat{\theta}^{(\Sigma i)}-\theta_*^{(\Sigma i)} \right\} \left\{\widehat{\theta}^{(\Sigma i)}-\theta_*^{(\Sigma i)} \right\}^T\\ &&\quad\!\!\!= \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} + O_p\left(\frac{1}{n } \right)\right\} \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} + O_p\left(\frac{1}{n} \right)\right\}^T \\ &&\quad\!\!\!= \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} \right\} \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} \right\}^T + O_p\left(\frac{1}{n \sqrt{n} }\right) \end{eqnarray*} Taking the expectation of both sides, we get \begin{eqnarray*} Var[\widehat{\theta}^{(\Sigma i)}] = Var [\widehat{\theta}^{(\Sigma i|*)}] + O\left(\frac{1}{n \sqrt{n} }\right), \end{eqnarray*} which implies that the variances of $$\widehat{\theta}^{(\Sigma i)}$$ and $$\widehat{\theta}^{(\Sigma i|*)}$$ are similar to each other when there are large amounts of data. The similarity of these variances suggests that partitioning homogeneous data is not harmful when the number of sequence sites per partition is large. 3. Proof of Equation (15) For the simplicity and convenience of mathematical notation, we omit the ‘$$c$$’ subscript below so that $$\widehat{\theta}^{(\cdot)}$$ and $$\rho$$ respectively represent $$\widehat{\theta}_c^{\,(\cdot)}$$ and $$\rho_c$$. To further simplify Equation (A.2), we define \begin{eqnarray*} U_{(i)} &:=& \left[\nabla \nabla^T l^{(i)} (\widehat{\theta}^{(i)}) \right] \\ V_{(i)} &:=& - \left[\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \right]^{-1} \end{eqnarray*} If we rewrite Equation (A.2), the MLE from the $$i$$th partition asymptotically follows a normal distribution, \begin{eqnarray*} \widehat{\theta}^{(i)} - \theta_*^{(i)} \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, N\left( {\bf 0}, V_{(i)} U_{(i)} V_{(i)} \right). \end{eqnarray*} Now, let us define the vector of all partition’s MLEs as follows, \begin{eqnarray*} \widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*:= ((\widehat{\theta}^{(1)} - \theta_*^{(1)})^T, \cdots, (\widehat{\theta}^{(k)} - \theta_*^{(K)})^T )^T, \end{eqnarray*} where $$\widehat{\boldsymbol{\theta}}$$ and $$\boldsymbol{\theta}_*$$ are $$K\rho \times 1$$ dimensional column vectors. The vector $$\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*$$ asymptotically follows a multivariate normal distribution, \begin{eqnarray} \widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_* \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, N( {\bf 0}, \widehat{{\bf \Sigma}} ), \label{tot-theta-normal-approx} \end{eqnarray} (A.4) where $$\widehat{{\bf \Sigma}}$$ is a diagonal block matrix due to the independence of partitions, \begin{eqnarray*} \widehat{{\bf \Sigma}} = \left( \begin{array}{ccc} V_{(i)} U_{(i)} V_{(i)} & \cdots & {\bf 0}\\ \vdots & \ddots & \vdots \\ {\bf 0}& \cdots &V_{(K)} U_{(K)} V_{(K)} \\ \end{array} \right). \end{eqnarray*} Now, we investigate the relationship between $$l^{(i)}(\widehat{\theta}^{(\Sigma i)})$$ and $$l^{(i)}(\widehat{\theta}^{(i)})$$ for the $$i$$th partition. We define the ($$K\rho\times \rho$$)-dimensional matrix $${\bf p}_i$$ as \begin{eqnarray*} {\bf p}_i := (p_1 I_\rho,\cdots, (p_i-1) I_\rho, \cdots, p_K I_\rho )^T, \end{eqnarray*} where $$I_\rho$$ is the $$\rho$$-dimensional identity matrix. Then, \begin{eqnarray*} {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) &=& \widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)} \\ &\approx& \widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(i)}, \end{eqnarray*} where the approximation is from Equation (A.3). Applying Equation (A.4), we find \begin{eqnarray*} {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, N({\bf 0}, {\bf p}_i^T \widehat{ {\bf \Sigma} } {\bf p}_i ). \end{eqnarray*} The $${\bf p}_i^T \widehat{ {\bf \Sigma} } {\bf p}_i$$ can be obtained with covariance matrices of individual partitions, \begin{eqnarray*} {\bf p}_i^T \widehat{ {\bf \Sigma}} {\bf p}_i &=& p_1^2 V_{(1)} U_{(1)} V_{(1)} + \cdots + (p_i-1)^2 V_{(i)} U_{(i)} V_{(i)}\\ &&+ \cdots + p_K^2 V_{(K)} U_{(K)} V_{(K)} \\ &\approx& p_1^2 [ V_{(i)} U_{(i)} V_{(i)} p_i / p_1 ] + \cdots + (p_i-1)^2 [ V_{(i)} U_{(i)} V_{(i)} ]\\ && + \cdots + p_K^2 [ V_{(i)} U_{(i)} V_{(i)} p_i/p_K] \\ &=& (1-p_i) [ V_{(i)} U_{(i)} V_{(i)}] \\ &\approx& (1-p_i) V_{(i)} = -(1-p_i) [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ]^{-1}, \end{eqnarray*} where the first approximation results from partition homogeneity and the second approximation results from the assumption of $$\nabla \nabla l \simeq - \nabla \nabla^T l$$. The quadratic summation of the elements of $$(\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})$$ follows a $$\chi_\rho^2$$ distribution, \begin{eqnarray} && \left\{ {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) \right\}^T \left\{ {\bf p}_i^T \widehat{ {\bf \Sigma} } {\bf p}_i \right\}^{-1} \left\{ {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) \right\} \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, \chi_{\rho}^2 \nonumber \\ &\Longleftrightarrow&(\widehat{\theta}^{(\Sigma i|*)} {-}\widehat{\theta}^{(i)})^T \cdot \frac{(-1)}{1{-}p_i} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)}) \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, \chi_{\rho}^2.\nonumber\\ &\Longleftrightarrow & (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})^T \cdot [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})\nonumber\\ &&\underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, - (1-p_i)\chi_{\rho}^2. \label{qaud_sum} \end{eqnarray} (A.5) Then, the log-likelihood function at the $$i$$th partition has the following relationship, (A.6) where the last approximation holds in the sense of expectation. Therefore, \begin{eqnarray*} \sum_{i=1}^K l^{(i)}(\widehat{\theta}^{(\Sigma i)}) &\approx& \sum_{i=1}^K \left\{ l^{(i)}(\widehat{\theta}^{(i)}) - \frac{1}{2} (1- p_i) \chi_{\rho}^2 \right\} \\ &\approx& \sum_{i=1}^K l^{(i)}(\widehat{\theta}^{(i)}) - \frac{1}{2} (K-1) \chi_{\rho}^2 \\ &\stackrel{E}{\approx} & \sum_{i=1}^K l^{(i)}(\widehat{\theta}^{(i)}) - \frac{1}{2} (K-1) \rho \end{eqnarray*} which proves Equation (15). While Equation (15) is a direct outcome of the asymptotic behavior of likelihood ratio tests when the null hypothesis is true, we note that Equations (A.5) and (A.6) are the critical steps in this proof and they are valid so long as the adopted model does not severely deviate from the truth. 4. Proof of Equation (20): derivation of BIC$$^{(p)}$$ To begin, we overview an approximation of the posterior probability density to show the origin of $$\lceil - \rho_c \log(2 \pi) \rfloor$$ in Equation (19) (e.g. see Robert 2007). As in Part 3 of the Appendix, we will omit the ‘$$c$$’ subscript below when doing so does not affect clarity. Define \begin{eqnarray*} J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) := - \frac{1}{n} \nabla \nabla l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}). \end{eqnarray*} Then, the probability of data $$\textbf{x}$$ for a given prior $$\pi(\theta)$$ is \begin{eqnarray} p(\textbf{x}) &=& \int \prod_{i=1}^n f(x_i | \theta) \pi(\theta) d\theta\nonumber \\ &=& \int \exp \left\{ l^{(\Sigma i)} (\theta) \right\} \pi(\theta) d\theta \nonumber\\ &=& \int \exp \left\{ l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) - \frac{n}{2} (\theta - \widehat{\theta}^{(\Sigma i)} )^T\right.\nonumber\\ &&\left.\cdot J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \cdot (\theta - \widehat{\theta}^{(\Sigma i)} ) + Re \right\} \nonumber \\ && \times \left\{ \pi(\widehat{\theta}^{(\Sigma i)}) + (\theta - \widehat{\theta}^{(\Sigma i)} ) \nabla \pi(\widehat{\theta}^{(\Sigma i)}) + Re \right\} d\theta \nonumber \\ &\approx& \exp \left\{ l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) \right\} \pi(\widehat{\theta}^{(\Sigma i)}) \times \nonumber \\ && \int \exp \left\{ - \frac{n}{2} (\theta - \widehat{\theta}^{(\Sigma i)} )^T \cdot J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \cdot (\theta - \widehat{\theta}^{(\Sigma i)} ) \right\} d\theta.\nonumber\\ \label{BIC-approx} \end{eqnarray} (A.7) We use the following results, \begin{eqnarray} &&\int \exp \left\{ - \frac{n}{2} (\theta - \widehat{\theta}^{(\Sigma i)} )^T \cdot J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \cdot (\theta - \widehat{\theta}^{(\Sigma i)} ) \right\} d\theta\nonumber \\ &&\quad \approx (2\pi)^{\rho/2} n^{-\rho/2} \left| J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \right|^{-1/2}, \label{bic_factor_integral} \end{eqnarray} (A.8) which can be derived from the probability density function of a multivariate normal distribution. Thus, \begin{eqnarray} \log p(\textbf{x}) &\approx& l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) - \frac{\rho}{2} \log n + \frac{\rho}{2} \log(2\pi) + \log \pi(\widehat{\theta}^{(\Sigma i)})\nonumber\\ &&-\frac{1}{2} \log \left| J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \right| \nonumber \\ &\approx& l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) - \frac{\rho}{2} \log n + \frac{\rho}{2} \log(2\pi). \label{BIC-concat-append} \end{eqnarray} (A.9) Multiplying the right side of Equation (A.9) by $$-2$$, we obtain Equation (19), \begin{eqnarray*} \text{BIC$^{(p)}$}[m_c, \Sigma i] := -2 l^{(\Sigma i)} (\widehat{\theta}_c^{\,(\Sigma i)}) + \rho_c \log n - \rho_c \log(2\pi). \end{eqnarray*} In the conventional definition of BIC, $$\lceil - \rho_c \log(2 \pi) \rfloor$$ is ignored. But, in our study, we take it into consideration for more accurate comparison (see Theory). By using Equations (A.5) and (A.6), we can derive the following relationships. \begin{eqnarray} &&\!\!\!\!\!\exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) \right\}\quad\nonumber\\ &&=\exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) \right\} \cdot 1 \nonumber \\ &&= \exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)}) /(1-p_i) \right\} \cdot \int \pi (\theta) d\theta \nonumber\\ &&= \int \exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) \right\} \pi (\theta) d\theta \nonumber\\ &&\approx \int \exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i|*)})/(1-p_i) \right\} \pi (\theta) d\theta \nonumber\\ &&\approx \int \exp \left\{ l^{(i)}(\widehat{\theta}^{(i)})/(1-p_i) \right. \nonumber\\ &&\quad\left. + \frac{1}{2} (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)} )^T \left[ \nabla \nabla l^{(i)}(\widehat{\theta}^{(i)})/(1-p_i) \right] (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)} ) \right\} \nonumber\\ &&\quad\times \left\{ \pi(\widehat{\theta}^{(i)}) + Re \right\} d\theta \nonumber\\ &&\approx \exp \left\{ l^{(i)}(\widehat{\theta}^{(i)})/(1-p_i) \right\} \cdot \pi(\widehat{\theta}^{(i)}) \nonumber\\ &&\quad\times (2\pi)^{\rho/2} n_i^{-\rho/2} \left| \frac{(-1)}{n_i (1-p_i)} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \right|^{-1/2}, \label{BICP_prop_approx} \end{eqnarray} (A.10) where we used an approximation similar to Equation (A.8), \begin{eqnarray*} &&\int \exp \left\{ \frac{1}{2} (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})^T\right.\nonumber\\ &&\quad \left.\cdot [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)})/(1-p_i) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)}) \right\} d\theta \\ &&= \int \exp \left\{ - \frac{n_i}{2} (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})^T\right.\\ &&\quad \left.\cdot \frac{(-1)}{n_i (1-p_i)} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)}) \right\} d\theta\\ &&\approx (2\pi)^{\rho/2} n_i^{-\rho/2} \left| \frac{(-1)}{n_i (1-p_i)} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \right|^{-1/2}. \end{eqnarray*} Taking the logarithm of both sides of Equation (A.10) followed by ignoring minor terms results in \begin{eqnarray*} l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) &\approx& l^{(i)} (\widehat{\theta}^{(i)})/(1-p_i) - \frac{1}{2} \rho (\log n_i - \log (2\pi) ). \label{BICP_prop_approx2} \end{eqnarray*} Therefore, by recovering model index $$m_c$$, we obtain the following approximation \begin{eqnarray} l^{(i)} (\widehat{\theta}_c^{\,(\Sigma i)}) &\approx& l^{(i)} (\widehat{\theta}_c^{\,(i)}) - \frac{1}{2} \rho_c (1-p_i) (\log n_i -\log (2\pi)).\qquad\quad \label{partition-laplace-trans} \end{eqnarray} (A.11) From the definition of Equation (19) and the approximation of Equation (A.11), we can consider the following approximation and definition, \begin{align} &\text{BIC$^{(p)}$}[m_c, \Sigma i]\nonumber\\ &\quad := -2 l^{(\Sigma i)}_c(\widehat{\theta}^{(\Sigma i)}) + \rho_c (\log n - \log(2 \pi))\nonumber\\ &\quad = \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(\Sigma i)}) \right\} + \rho_c (\log n - \log(2 \pi))\nonumber\\ &\quad\approx \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) +(1{-}p_i) \rho_c (\log n_i {-} \log(2 \pi) ) \right\}\nonumber\\ &\qquad + \rho_c (\log n - \log(2 \pi))\nonumber \\ &\quad=: \text{BIC}^{(p)}[\{(m_c, i)\}], \nonumber \end{align} where ‘$$=:$$’ implies that the right side of the equation is defined as the left side of the equation. References Adachi J., Waddell P.J., Martin W., Hasegawa M. 2000 . Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast. J. Mol. Evol. 50 , 348 – 358 . Google Scholar CrossRef Search ADS PubMed Akaike H. 1974 . A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19 , 716 – 723 . Google Scholar CrossRef Search ADS Anderson C.N., Liu L., Pearl D., Edwards S.V. 2012 . Tangled trees: the challenge of inferring species trees from coalescent and noncoalescent genes. Methods Mol Biol. 856 , 3 – 28 . Google Scholar CrossRef Search ADS PubMed Bell E.T. 1934 . Exponential numbers. Amer. Math. Monthly 41 , 411 – 419 . Google Scholar CrossRef Search ADS Berger J.O. 1985 . Statistical decision theory and Bayesian analysis. New York : Springer-Verlag . Google Scholar CrossRef Search ADS Bishop Y.M., Fienberg S.E., Holland P.W. 2007 . Discrete multivariate analysis. New York : Springer-Verlag . p. 475 – 484 . Bozdogan H. 1987 . Model selection and Akaike’s Information Criterion (AIC): the general theory and its analytical extensions. Psychometrika 52 , 345 – 370 . Google Scholar CrossRef Search ADS Burnham K.P., Anderson D.R. 2002 . Model selection and multimodel inference. 2nd ed . New York : Springer-Verlag . p. 64 – 66 , 284 – 285 . Cao Y., Sorenson M.D., Kumazawa Y., Mindell D.P., Hasegawa M. 2000a . Phylogenetic position of turtles among amniotes: evidence from mitochondrial and nuclear genes. Gene 259 , 139 – 148 . Google Scholar CrossRef Search ADS Cao Y., Fujiwara M., Nikaido N., Okada N., Hasegawa M. 2000b . Interordinal relationships and timescale of eutherian evolution as inferred from mitochondrial genome data. Gene 259 : 149 – 158 . Google Scholar CrossRef Search ADS Chang J.T. 1996 . Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134 : 189 – 215 Google Scholar CrossRef Search ADS PubMed Chikina M., Robinson J.D., Clark N.L. 2016 . Hundreds of genes experienced convergent shifts in selective pressure in marine mammals. Mol. Biol. Evol. 33 (9) : 2182 – 2192 . Google Scholar CrossRef Search ADS PubMed Colombo M., Damerau M., Hanel R., Salzburger W., Matschiner M. 2015 . Diversity and disparity through time in the adaptive radiation of Antarctic notothenioid fishes. J. Evol. Biol. 28 (2) : 376 – 394 . Google Scholar CrossRef Search ADS PubMed Draper D. 1995 . Assessment and propagation of model uncertainty. J. R. Statist. Soc. B 57 , 45 – 97 . Dziak J.J., Coffman D.L., Lanza S.T., Li R. 2012 . Sensitivity and specificity of information criteria. Technical Report Series #12-119. The Pennsylvania State University. State College, PA . Felsenstein J. 1981 . Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17 , 368 – 376 . Google Scholar CrossRef Search ADS PubMed Felsenstein J. 1989 . PHYLIP—phylogeny inference package (version 3.2). Cladistics 5 : 164 – 166 Hasegawa M., Kishino H., Yano T. 1985 . Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22 , 160 – 174 . Google Scholar CrossRef Search ADS PubMed Hastie T., Tibshirani R., Friedman J. 2009 . The elements of statistical learning. Chapter 7 . New York : Springer-Verlag . Jukes T.H., Cantor C.R. 1969 . Evolution of protein molecules. In: Munro H.N., editors. Mammalian protein metabolism. New York : Academic Press , p. 21 – 132 . Google Scholar CrossRef Search ADS Kimura M. 1980 . A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16 , 111 – 120 . Google Scholar CrossRef Search ADS PubMed Kolaczkowski B., Thornton J.W. 2004 . Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431 , 980 – 984 . Google Scholar CrossRef Search ADS PubMed Konishi S., Kitagawa G. 2004 . Information criteria (in Japanese). Tokyo : Asakura Publishing Co ., p. 47 – 48 . Lanfear R., Calcott B., Ho S.Y.W., Guindon S. 2012 . PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol. Biol. Evol. 29 , 1695 – 1701 . Google Scholar CrossRef Search ADS PubMed Lanfear R., Calcott B., Kainer D., Mayer C., Stamatakis A. 2014 . Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol. Biol. 14 , 82 – 95 . Google Scholar CrossRef Search ADS PubMed Leigh J.W., Lapointe F.J., Lopez P., Bapteste E. 2011 . Evaluating phylogenetic congruence in the post-genomic era. Genome Biol Evol. 3 , 571 – 587 . Google Scholar CrossRef Search ADS PubMed Lemmon A.R., Moriarty E.C. 2004 . The importance of proper model assumption in Bayesian phylogenetics. Syst. Biol. 53 : 265 – 277 . Google Scholar CrossRef Search ADS PubMed Li C., Lu G., Orti G. 2008 . Optimal data partitioning and a test for Ray–Finned fishes (Actinopterygii) based on ten nuclear loci. Syst. Biol. 57 , 519 – 539 . Google Scholar CrossRef Search ADS PubMed Lopez P., Casane D., Philippe H. 2002 . Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19 , 1 – 7 . Google Scholar CrossRef Search ADS PubMed Nikaido, M., Cao, Y. Harada M., Okada N., Hasegawa M. 2003 . Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. 28 : 276 – 284 Google Scholar CrossRef Search ADS PubMed Nishihara H., Okada N., Hasegawa M. 2007 . Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol. 8 : R199.1 – R199.10 Google Scholar CrossRef Search ADS Pupko T., Huchon D., Cao Y., Okada N., Hasegawa M. 2002 . Combining multiple data sets in a likelihood analysis: which models are the best? Mol. Biol. Evol. 19 , 2294 – 2307 . Google Scholar CrossRef Search ADS PubMed Rambaut A., Grassly N.C. 1997 . Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13 : 235 – 238 . Google Scholar PubMed Ripplinger J., Sullivan J. 2008 . Does choice in model selection affect maximum likelihood analysis? Syst. Biol. 57 (1) : 76 – 85 . Google Scholar CrossRef Search ADS PubMed Robert C.P. 2007 . The Bayesian choice. 2/e New York : Springer-Verlag . p. 352 . Schwarz G. 1978 . Estimating the dimension of a model. Ann. Stat. 6 , 461 – 464 . Google Scholar CrossRef Search ADS Seo T.-K. 2008 . Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25 , 960 – 971 . Google Scholar CrossRef Search ADS PubMed Stamatakis A. 2014 . RAxML Version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30 (9) : 1312 – 1313 . Google Scholar CrossRef Search ADS PubMed Tamura K. 1992 . Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9 , 678 – 687 . Google Scholar PubMed Tamura K., Nei M. 1993 . Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10 , 512 – 526 . Google Scholar PubMed Tamura K., Stecher G., Peterson D., Filipski A., Kumar S. 2013 . MEGA6: molecular evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 30 (12) : 2725 – 2729 . Google Scholar CrossRef Search ADS Wu C.H., Suchard M.A., Drummond A.J. 2012 Bayesian selection of nucleotide substitution models and their site assignments. Mol. Biol. Evol. 30 (3) : 669 – 699 . Google Scholar PubMed White H., 1982 . Maximum likelihood estimation of misspecified models. Econometrica 50 , 1 – 25 . Google Scholar CrossRef Search ADS Yang Z. 1994a . Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39 , 105 – 111 . Yang Z. 1994b . Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39 , 306 – 314 . Google Scholar CrossRef Search ADS Yang Z. 1996 . Maximum-likelihood models for combined analyses of Multiple sequence data. J. Mol. Evol. 42 , 587 – 596 . Google Scholar CrossRef Search ADS PubMed Yang Z. 2007 . PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24 , 1586 – 1591 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Systematic Biology Oxford University Press

# Information Criteria for Comparing Partition Schemes

, Volume Advance Article (4) – Jan 23, 2018
17 pages

/lp/ou_press/information-criteria-for-comparing-partition-schemes-40UghHZHVw
Publisher
Oxford University Press
ISSN
1063-5157
eISSN
1076-836X
D.O.I.
10.1093/sysbio/syx097
Publisher site
See Article on Publisher Site

### Abstract

Abstract When inferring phylogenies, one important decision is whether and how nucleotide substitution parameters should be shared across different subsets or partitions of the data. One sort of partitioning error occurs when heterogeneous subsets are mistakenly lumped together and treated as if they share parameter values. The opposite kind of error is mistakenly treating homogeneous subsets as if they result from distinct sets of parameters. Lumping and splitting errors are not equally bad. Lumping errors can yield parameter estimates that do not accurately reflect any of the subsets that were combined whereas splitting errors yield estimates that did not benefit from sharing information across partitions. Phylogenetic partitioning decisions are often made by applying information criteria such as the Akaike information criterion (AIC). As with other information criteria, the AIC evaluates a model or partition scheme by combining the maximum log-likelihood value with a penalty that depends on the number of parameters being estimated. For the purpose of selecting an optimal partitioning scheme, we derive an adjustment to the AIC that we refer to as the AIC$$^{(p)}$$ and that is motivated by the idea that splitting errors are less serious than lumping errors. We also introduce a similar adjustment to the Bayesian information criterion (BIC) that we refer to as the BIC$$^{(p)}$$. Via simulation and empirical data analysis, we contrast AIC and BIC behavior to our suggested adjustments. We discuss these results and also emphasize why we expect the probability of lumping errors with the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ to be relatively robust to model parameterization. AIC, BIC, information criteria, model comparison, multilocus analysis, partition scheme comparison, phylogenomics Probabilistic models of DNA and protein sequence change have central roles in phylogenetics and molecular evolution. Many models have been proposed, but it is not always obvious which are most appropriate. With multilocus data, model choice can become especially difficult because the best model for one subset of the data may be suboptimal for another. The challenge of how to organize data into subsets that should be analyzed by shared parameter values has become increasingly central as improvements in DNA sequencing technology have led to bigger data sets. A simple and intuitive approach is to concatenate multilocus sequence data and regard the merged result as a single locus. However, drawbacks of this “concatenation” procedure exist and “separate analysis” has been suggested as an alternative (e.g. Adachi et al. 2000; Cao et al. 2000a,b; Nikaido et al. 2003; Nishihara et al. 2007). In separate analysis (also referred to as partitioned analysis), some model parameters may be shared among loci but other parameters may be distinct to individual loci. The concatenation approach has been criticized for ignoring heterogeneity among loci because different genes may support different tree topologies for reasons ranging from incomplete lineage sorting to horizontal gene transfer to introgression (Leigh et al. 2011; Anderson et al. 2012). With concatenation, locus-specific information is lost. In addition, evolutionary heterogeneity among loci is not limited to topology. Even when multiple loci support the same topology, natural selection, and/or mutation can cause the nucleotide substitution process to vary among loci. For example, when multiple loci are generated by an identical tree topology but according to different branch lengths, ignoring the resulting heterotachy (Lopez et al. 2002) can cause the inferred topology to differ from the one that is shared among the individual loci (e.g. Chang 1996; Kolaczkowski and Thornton 2004; Seo 2008). For diverse reasons, it would be preferable to handle variation among loci when variation exists. Although separate analysis can have advantages, it also has disadvantages. Because evolutionary parameters are separately estimated for each locus, the total number of estimated parameters can be much greater for separate analysis than concatenation. There is a trade-off between model fit and estimation uncertainty (Hastie et al. 2009). As the number of parameters increases, model fit tends to improve but the variances for estimated parameters get bigger. While they did not focus on partitioned analysis, Lemmon and Moriarty (2004) did a careful simulation study regarding underparameterization and overparameterization in phylogenetics. They found that both have negative consequences, but those from underparameterization were more severe. Thus, it is crucial to balance the number of parameters and estimation uncertainty. Another drawback of separate analyses is that they require more computation than concatenation. Therefore, for a given set of multilocus data, careful consideration is needed for whether all loci should be handled separately or whether some should be concatenated. One goal is to determine an optimal “partition scheme,” in which data partitions with similar evolutionary properties are merged and in which partitions with dissimilar properties are not merged. Previous studies (Li et al. 2008; Lanfear et al. 2012, 2014) have noted that partition schemes can be regarded as statistical models and compared via conventional model selection criteria such as the likelihood ratio test (LRT), or the Akaike information criterion (AIC) (Akaike 1974), or the Bayesian information criterion (BIC) (Schwarz 1978). These criteria can then be employed as part of an exhaustive search that examines all possible partition schemes to find the best one. Alternatively, a heuristic search can be used along with the selection criteria. This is often desirable because the number of possible partition schemes grows quickly as the number of loci increases. A Bayesian approach to simultaneously estimate the number of partitions and their model parameters (Wu et al. 2012) is another possible alternative. Two sorts of errors can be made in phylogenetic partitioning. These errors are lumping partitions together when they should be separately treated (i.e. underparameterization) and splitting loci into distinct subsets when they should be grouped together (i.e. overparameterization). Lumping errors can yield bias such as systematic errors in topology estimation (e.g. Chang 1996; Kolaczkowski and Thornton 2004; Seo 2008) whereas splitting errors can cause additional parameter uncertainty. Here, we introduce and justify two model selection criteria that we term the AIC$$^{(p)}$$ and BIC$$^{(p)}$$ because they are modifications designed for partitioning decisions of the widely-used AIC and BIC. The AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ are predicated on the assumption that lumping errors have more serious consequences than splitting errors. Theory We motivate the new information criteria by showing the connection between the AIC$$^{(p)}$$ and a likelihood ratio test of the null hypothesis of partition homogeneity versus the alternative of partition heterogeneity. We then review the connection between the AIC and the Kullback– Leibler deviation. We follow this with an explanation of how the AIC$$^{(p)}$$ relates to the Kullback– Leibler (KL) deviation between the truth and a particular partition scheme of interest when there is homogeneity between partitions. We complete this section by introducing the BIC$$^{(p)}$$. Introduction and the Likelihood Ratio Test as Motivation Suppose that a K-partition scheme is to be compared with the “concatenated” 1-partition scheme. The parameter vector of model $$m_i$$ for partition $$i \in \{ 1,2, \ldots, K \}$$ of the K-partition scheme will be denoted $$\theta_{i}$$. The log-likelihood of the $$i$$th partition with model $$m_i$$ is defined as, \begin{eqnarray*} l^{(i)}(\theta_{i}) := \sum_{j=1}^{n_i} \log f (x_{ij} | \theta_{i} ), \end{eqnarray*} where $$x_{ij}$$ is the $$j$$th aligned sequence column of the $$i$$th partition and $$n_i$$ is the sequence length of the $$i$$th partition. When the entire collection of possible partitions is being considered, we employ the summation sign $$\{\Sigma i := 1 + \cdots + K \}$$ to represent this 1-partition concatenation scheme. The model for the 1-partition scheme is termed $$m_c$$ and its log-likelihood is \begin{eqnarray*} l^{(\Sigma {i})}(\theta_{c}) := l^{(1+2+\cdots+K)}(\theta_{c}) = \sum_{i=1}^K l^{(i)}(\theta_{c}). \end{eqnarray*} We treat and discuss generalizations later, but we begin by assuming that the concatenated model $$m_c$$ and the models $$m_i$$ for each partition $$i$$ share the same parameterization (i.e. topology and substitution model) but may differ in parameter values. For example, all models could share the same topology and all could have the HKY (Hasegawa et al. 1985) parameterization ($$m_i = m_c$$). The maximum likelihood estimators (MLEs) of the $$i$$th partition and of the 1-partition scheme are: \begin{eqnarray*} \widehat{\theta}_{c}^{(i)} &:=& \mathop{\mbox{argmax}}_{\theta_{c}} \left\{ l^{(i)}(\theta_{c}) \right\} \\ \widehat{\theta}_{c}^{(\Sigma {i})} &:=& \mathop{\mbox{argmax}}_{\theta_{c}} \left\{ l^{(\Sigma {i})}(\theta_{c}) \right\}, \end{eqnarray*} where $$\mathop{\mbox{argmax}}_{\theta_\beta} {h(\theta_\beta)}$$ implies “the values of the parameters $$\theta_\beta$$ of model $$m_\beta$$ that maximize $$h(\theta_\beta)$$”. The $$\widehat{\theta}_c^{\,(i)}$$ estimators represent MLEs obtained from the $$i$$th partition with model $$m_c$$. Note that $$\widehat{\theta}_c^{\,(i)}$$ and $$\widehat{\theta}_{c}^{(\Sigma {i})}$$ are expected to be unequal because the former values are obtained from only the $$i$$th partition whereas the latter values are obtained from the entire data set. A limitation of our notation is that it does not reflect the frequently arising situation where some parameters are forced to have values that are shared among all partitions and other parameters are allowed to have values that are unique to each partition. While this limitation could be corrected by expanding the notation, we choose to adopt our more simple notation and instead focus on the cases where either all parameter values are shared among partitions or all have distinct values for each partition. However, as will be emphasized later, the concepts that we discuss and the theory that justifies the AIC$$^{(p)}$$ and BIC$$^{(p)}$$ also apply to the case where some parameters have values that are shared among partitions and others do not. Consider the likelihood ratio test when the null hypothesis is a single partition (i.e. concatenation) and when the more general hypothesis is that each of the $$K$$ partitions shares the parameterization of the 1-partition model but that each has its own parameter values. Because the null hypothesis is a special case that is nested within the more general one, the likelihood ratio test statistic can be approximated as having a $$\chi^2$$ distribution when the null hypothesis is true. Specifically, $$2 \left\{ \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) - l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) \right\} \sim \chi_{(K-1)\rho_c}^2, \label{chisquaredist}$$ (1) where $$\rho_c$$ parameters are estimated according to the null hypothesis and where $$\rho_c$$ parameters are separately estimated for each partition in the K-partition scheme. When the null hypothesis of partition homogeneity is correct, Equation (1) implies $$-2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) \stackrel{E}{=} -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + (K-1)\rho_c \label{LRTrelationship1}$$ (2) because $$(K-1)\rho_c$$ is the expected value of a $$\chi_{(K-1)\rho_c}^2$$ random variable. Here, the “$$\stackrel{E}{=}$$” sign indicates that the expectations are equal. Therefore, \begin{eqnarray} -2l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + 2 \rho_c & \stackrel{E}{=} & -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + (K-1) \rho_c + 2 \rho_c \nonumber\\ \end{eqnarray} (3) \begin{eqnarray} \label{LRTrelationship2} \\ & = & -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + (K+1) \rho_c. \label{LRTrelationship2-b} \end{eqnarray} (4) Equation (2) shows that the expected log-likelihood of a concatenated model can be used to approximate the expected log-likelihood of a K-partition model when partitions are homogeneous (or vice-versa). The $$2 \rho_c$$ term on the left side of Equation (3) happens to be the AIC$$^{(p)}$$ penalty for a concatenated model. It is the same penalty as would be assigned by the AIC. The $$(K+1) \rho_c$$ term on the right side of Equation (4) is the AIC$$^{(p)}$$ penalty for a K-partition model and the $$(K-1) \rho_c$$ term on the right side of Equation (3) reflects the extra cost of the act of partitioning. In contrast, the AIC penalty for a K-partition model would be $$2K \rho_c$$ and this can be much heavier than the AIC$$^{(p)}$$ penalty when $$K$$ is large. As a result, the AIC is more prone than the AIC$$^{(p)}$$ to favoring concatenated models. While the LRT approach to model comparison has some attractive features, one limitation is that the LRT approach assumes that the more restricted “nested” statistical model is true. The AIC simply aims to select the candidate that is closest to the truth and it relies on an approximation that is expected to be “excellent” when the assumed model is “good” (Burnham and Anderson 2002). Likewise, the AIC$$^{(p)}$$ differs from the LRT that inspires Equation (4) because it is not predicated on the substitution model and tree topology being correct (see Part 1 of the Appendix). However, Equation (3) illustrates that the LRT and the AIC$$^{(p)}$$ penalty are closely connected when the substitution model and topology do happen to be correct. Kullback– Leibler Divergence (KLD) and the AIC Suppose that $$n$$ homogeneous samples, ($$x_1, \cdots, x_n$$), are obtained from the true but unknown distribution $$g(x)$$, and we want to fit them with model $$f(x|\theta)$$. The KLD between $$g(x)$$ and $$f(x|\theta)$$ is \begin{eqnarray} KLD[g;f(\cdot|\theta)] &=& \int \log \left( \frac{g(x)}{f(x|\theta)} \right) g(x) dx \nonumber \\ &=& E_g \left[\log g(x) \right] - E_g[\log f(x|\theta)]. \label{KLD} \end{eqnarray} (5) The model whose KLD is minimized is regarded as the best among model candidates. Because $$E_g \left[\log g(x) \right]$$ does not vary among models, Equation (5) shows that the best model must be the one that maximizes $$E_g[\log f(x|\theta)]$$. Akaike (1974) showed that $$E_g[\log f(x|\widehat{\theta})]$$ can be approximated using the MLE $$\widehat{\theta}$$ and an adjustment for the number of parameters that are estimated. Written in our notation, Akaike (1974) showed \begin{eqnarray} KLD[g;f(\cdot|\widehat{\theta})] &=& E_g [\log g(x) ] - E_g[\log f(x|\widehat{\theta})] \label{mKLD} \\ \end{eqnarray} (6) \begin{eqnarray} & \simeq& E_g [\log g(x) ] - \frac{1}{n} \left\{ \sum_{i=1}^{n} \log f(x_i|\widehat{\theta}) - \rho_{\theta} \right\}, \nonumber\\ \label{mKLD2} \end{eqnarray} (7) where $$\rho_{\theta}$$ is the dimension of $$\theta$$ and where the approximation requires $$n$$ to be large. In Equation (7), the bracketed term is the unbiased estimator of $$n E_g[\log f(x|\widehat{\theta})]$$. The $$\rho_{\theta}$$ in this estimator corrects the bias and is necessary because the data set is used to get $$\widehat{\theta}$$ and is then used again to approximate $$E_g [\cdot]$$ (Konishi and Kitagawa 2004). The AIC is defined as \begin{eqnarray} \text{AIC} &:=& -2 \sum_{i=1}^n \log f(x_i | \widehat{\theta}) + 2 \rho_\theta \label{AIC-def-original} \end{eqnarray} (8) and is derived from this unbiased estimator. When there are competing models, the one that maximizes its unbiased estimator of $$E_g[\log f(x|\theta)]$$ (or equivalently that minimizes its AIC) is regarded as the best model. Applying the definition of Equation (8), the AIC scores of concatenated data are written in our notation as \begin{eqnarray} \text{AIC}[m_c, \Sigma i] &:=& -2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + 2 \rho_c \label{AIC-concat} \end{eqnarray} (9) where $$\rho_c$$ is the dimension of $$\theta_c$$ (see also Lanfear et al. 2012, 2014; Li et al. 2008). For phylogenetic applications, the number of free parameters can be separated as \begin{eqnarray} \rho_c = \rho_{\tau} + \rho_{\mu}, \label{rho-separate-1} \end{eqnarray} (10) where $$\rho_{\tau}$$ is the number of tree branches and $$\rho_{\mu}$$ is the number of the remaining parameters (e.g. parameters affecting nucleotide frequencies, transition/transversion ratios, rate heterogeneity among sites, etc). Writing $$\left\{(m_i, i) \right\}$$ to indicate that each partition $$i$$ of the $$K$$ total partitions has its own model $$m_i$$, the AIC scores of partitioned data (Lanfear et al. 2012, 2014) are $$\text{AIC}[\left\{(m_i, i) \right\} ] := -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_{i}^{(i)}) + 2 \rho_{\left\{(m_i, i) \right\}}, \label{AIC-partition-conven}$$ (11) where $$\rho_{\left\{(m_i, i) \right\}}$$ is the number of freely estimated parameters and $$\rho_{\left\{(m_i, i) \right\}} = \sum_{i=1}^K \rho_i$$. In the exhaustive comparison of partition schemes, the scores of Equation (11) are calculated for all candidate schemes and the minimum score determines the best one. Widely-used heuristic algorithms for choosing partition schemes (Li et al. 2008; Lanfear et al. 2012, 2014) involve repeatedly comparing a 2– partition scheme with a 1-partition scheme. Different $$\rho_{\tau}$$ values in Equations (9) and (11) could be needed if the 1-partition and K-partition schemes adopted different tree topologies (e.g. a fully bifurcating topology in one scheme and a “star topology” in the other). When analyzing multilocus data, a model that has corresponding branch lengths of different partitions differ only by a proportionality constraint (Yang 1996; Pupko et al. 2002) is often adopted. We will refer to this as the “linked branch length” (LBL) model (see also Lanfear et al. 2012). With the LBL model, a set of branch lengths that will influence all partitions is estimated as is a proportionality factor for each partition. The sum of proportionality factors is typically restricted to ensure uniqueness of MLEs and the degrees of freedom for the proportionality factors is $$(K-1)$$ when there are $$K$$ partitions. A model that has branch lengths being independently and separately estimated for all partitions can be used as an alternative to a proportional model. This model will be referred to as the “unlinked branch length” (UBL) model (see also Lanfear et al. 2012). With multiple partitions, the LBL model can substantially reduce the number of free parameters when $$\rho_{\tau} \gg 1$$ because the number of free parameters for branch lengths is $$\rho_{\tau} + (K-1)$$ with the LBL model whereas it is $$K\rho_{\tau}$$ for the UBL model. However, the more limited flexibility of the LBL model means it is less satisfactory than the UBL model for some data sets (e.g. Pupko et al. 2002). For the UBL and LBL treatments, \begin{eqnarray} &&\!\!\!\!\text{AIC}[\left\{(m_c, i) \right\} ]\nonumber\\ &&\!\!\!\!:= \left\{ \begin{array}{@{}ll} -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) + 2 K \rho_c & \text{(UBL model when $m_i = m_c \;\; \forall i$)},\\[4pt] -2 \sum_{i=1}^K l^{(i)}(\widehat{\theta}_c^{\,(i)}) \\[4pt] \ \ \ \ + 2 \{\rho_{\tau}+(K-1)+K\rho_{\mu}\} & \text{(LBL model when $m_i = m_c \;\; \forall i$).} \\ \end{array}\right. \nonumber\\\label{AIC-partition-conven2} \end{eqnarray} (12) Here, we focus on comparing a K-partition scheme with a 1-partition scheme. The aforementioned heuristic algorithms would have $$K=2$$ but larger values of $$K$$ are also possible. Using AIC, the 1-partition scheme is selected when $$\text{AIC}[m_c, \Sigma i] < \text{AIC}[\left\{(m_c, i) \right\} ]$$ and the K-partition scheme is selected when the inequality is in the other direction. Partition Groups One way to compute $$\text{AIC}[\left\{(m_i, i) \right\} ]$$ begins by organizing partitions into groups where partitions in the same group share topology and substitution model parameterization. Parameter values may vary among partitions within a group but model parameterization would not vary within the group. By grouping partitions in this way, the AIC score for each partition group can first be calculated and then the AIC score for the entire data set can be determined by summing all the AIC group scores. This way of organizing the AIC computations is convenient because the issue of whether or not to concatenate partitions only becomes relevant if partitions have identical model parameterization. As with the AIC, the AIC$$^{(p)}$$ that is introduced here can be computed for the entire data set by summing the AIC$$^{(p)}$$ scores of each individual partition group. Therefore, we focus on how to calculate the AIC$$^{(p)}$$ scores when the decision is either that there is one concatenated partition with model $$m_c$$ or there are $$K>1$$ distinct partitions that each have model $$m_c$$. For the same reasons, we will later focus on how to calculate BIC$$^{(p)}$$ scores when the decision is either that there is one concatenated partition with model $$m_c$$ or there are $$K>1$$ distinct partitions that each have model $$m_c$$. Kullback– Leibler Divergence and the AIC$$^{(p)}$$ Our AIC$$^{(p)}$$ modification of the AIC has penalty terms that are intended to reflect how much improvement in model fit would be expected if partitions are homogeneous but sequence data are analyzed by separately estimating parameters from each partition. These penalties can therefore be interpreted as stemming from the act of partitioning. Our suggestion is that the best candidate partition scheme should be selected after accounting for the act of partitioning. The AIC$$^{(p)}$$ penalty is not as heavy as the AIC penalty of $$2 \sum_{i=1}^K \rho_i$$ in Equation (11). Heavy penalties favor simple models (e.g. concatenation) and can lead to selection of a 1-partition model even when partitions are highly heterogeneous. The idea underlying the AIC$$^{(p)}$$ is to have a model selection criterion that places all candidate selection schemes on an equal footing by using the appropriate bias correction term according to partition homogeneity. After this bias adjustment, we reason that the best partitioning scheme is the one with the optimal score. Consider the situation in which $$n$$ samples are separated into $$K$$ partitions. The size of each partition will be $$n_i$$ ($$\sum_{i=1}^K n_i = n$$) and the MLE of the 1-partition scheme will be denoted $$\widehat{\theta}^{(\Sigma i)}$$. The KLD for the $$i$$th partition is \begin{eqnarray} KLD[g;f(\cdot|\widehat{\theta}^{(i)})] &\simeq& E_g [\log g(x) ]\nonumber\\ &&-\frac{1}{n_i} \left\{ \sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(i)}) - \rho_{\theta} \right\}. \label{mKLD_part} \end{eqnarray} (13) A weighted average of Equation (13) with weights $$w_i = n_i / n$$ yields the KLD when the model of interest has heterogeneous partitions, \begin{eqnarray} \sum_{i=1}^K w_i KLD[g_i;f(\cdot|\widehat{\theta}^{(i)})] &\simeq& \sum_{i=1}^K w_i E_{g_i} [\log g_i(x) ]\nonumber\\ &&- \frac{1}{n} \left\{\!\sum_{i=1}^K\!\sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(i)}){-}K \rho_{\theta}\!\right\}.\nonumber\\ \label{mKLD_part_weighted_average} \end{eqnarray} (14) Minimizing the AIC of the UBL parameterization in Equation (12) can therefore be seen to be equivalent to minimizing $$\sum_{i=1}^K w_i KLD[g_i;f(\cdot|\widehat{\theta}^{(i)})]$$. Now, consider the case where the UBL model is used but the $$K$$ partitions are actually homogeneous. Part 2 of the Appendix outlines one reason why the drawbacks of incorrectly partitioning homogeneous data do not seem especially severe. From Equation (7) and the argument outlined in Part 3 of the Appendix, we can show that the proper bias correction for “partitioning homogeneous data” is $$\frac{K+1}{2} \rho_{\theta}$$: \begin{eqnarray} KLD[g;f(\cdot|\widehat{\theta}^{(\Sigma i)})] & \simeq& E_g [\log g(x)]\nonumber\\ && -\frac{1}{n} \left\{ \sum_{i=1}^K \sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(\Sigma i)}) - \rho_{\theta} \right\} \nonumber\\ & \simeq & E_g [\log g(x)]\nonumber\\ &&-\frac{1}{n} \left\{ \sum_{i=1}^K \sum_{j=1}^{n_i} \log f(x_{ij}|\widehat{\theta}^{(i)}) - \frac{K+1}{2} \rho_{\theta} \right\}\nonumber\\ \label{KLD_equal} \end{eqnarray} (15) This means that either a 1-partition scheme or a K-partition scheme can be employed when there is homogeneity among partitions, but the proper bias correction depends on the number of partitions. The bias correction for partitioning homogeneous data can be contrasted to a bias correction when partitions are heterogeneous. Whereas the bias correction for partitioning homogeneous data is $$\frac{K+1}{2} \rho_\theta$$, the more extreme correction for partitioning heterogeneous data is $$K \rho_\theta$$. The bias correction for partitioning homogeneous partitions closely corresponds to the likelihood ratio test (see Equation (3)). Based on this idea and the argument outlined in Part 3 of the Appendix, we can define the AIC$$^{(p)}$$ scores of a K-partition scheme and a concatenated scheme. As above, we assume that each of the $$K$$ partitions has the same model parameterization as the concatenated sequences (i.e. $$m_c = m_i \;\; \forall i$$). With this assumption, we have the AIC$$^{(p)}$$ of the K-partition scheme being $$\text{AIC}^{(p)}[ \left\{ (m_c, i) \right\} ] := \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) \right\} + \delta + 2 \rho_c \label{aicp-part},$$ (16) where $$\delta = \rho_{\left\{ (m_c, i) \right\} } - \rho_c$$ and represents the difference between the total number of parameters in the partition scheme ($$\rho_{\left\{ (m_c, i) \right\} }$$) and the total number in the concatenation scheme ($$\rho_c$$). When an UBL model is adopted, $$\delta = \sum_{i=1}^K \rho_c - \rho_c = (K-1)(\rho_{\tau}+\rho_{\mu})$$. When an LBL model is adopted, $$\delta=\rho_{\tau} + (K-1) + K \rho_{\mu} - (\rho_{\tau}+\rho_{\mu}) = (K-1)(1+ \rho_{\mu})$$. In contrast, the AIC$$^{(p)}$$ for the concatenated model is $$\text{AIC}^{(p)}[m_c, \Sigma i] = \text{AIC}[m_c, \Sigma i] := -2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + 2 \rho_c. \label{aicp-concat}$$ (17) In Equation (16), the summed log-likelihoods represent the model fit of the data and the remaining portion ($$\delta + 2 \rho_c$$) works as a penalty. Larger $$\delta$$ values arise from complex partition schemes and therefore complex schemes are accompanied by higher penalties. The $$\text{AIC}^{(p)}$$ therefore accounts for the trade-off between partition fit and partition complexity. The 1-partition scheme is selected when $$\text{AIC}^{(p)}[m_c, \Sigma i] < \text{AIC}^{(p)}[ \left\{ (m_c, i) \right\} ]$$ and the K-partition scheme is selected when the inequality is in the other direction. With the UBL model, the AIC$$^{(p)}$$ score for the K-partition scheme is \begin{eqnarray} \text{AIC}^{(p)}[\left\{ (m_c, i) \right\} ] = \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) \right\} + (K+1) \rho_c. \label{aicp-sum-mc} \end{eqnarray} (18) We see from Equation (3) that the AIC$$^{(p)}$$ scores of the concatenated model (see Equation (17)) and the K-partition model (see Equation (18)) are expected to be about the same when the partitions are homogeneous. If the $$K$$ partitions are heterogeneous, the AIC$$^{(p)}$$ score of the K-partition model is therefore expected to be lower. For this reason, we suggest selecting the K-partition model when its score is less than that of the concatenated model. We note that the AIC$$^{(p)}$$ score of Equation (16) can be converted to the AIC score by replacing the $$\delta$$ term with the heavier penalty of $$2 \delta$$. For the UBL parameterization, one way to interpret the AIC and AIC$$^{(p)}$$ penalties is that the AIC has a penalty of $$2$$ each time a parameter is estimated whereas the AIC$$^{(p)}$$ has a penalty of $$2$$ the first time a parameter is estimated but a penalty of only $$1$$ for additional partitions from which the parameter is estimated. The fact that the penalty of $$\text{AIC}^{(p)}[\left\{(m_c, i) \right\}]$$ in Equation (16) is lighter than that of $$\text{AIC}[\left\{(m_c, i) \right\} ]$$ in Equation (12) implies that the AIC$$^{(p)}$$ can detect heterogeneity better than the AIC. This is because a moderate improvement of model fit represented by $$l^{(i)}(\widehat{\theta}_c^{\,(i)})$$ can dominate the penalty term of $$\text{AIC}^{(p)}[\left\{(m_c, i) \right\}]$$ more easily than that of $$\text{AIC}[\left\{(m_c, i) \right\} ]$$. New BIC of Partitioned Data: $$\text{BIC}^{(p)}$$ The BIC aims to find the model with the highest logarithm of the marginal likelihood. The BIC penalty term is $$\rho \log (n)$$ where $$\rho$$ is the number of free parameters and $$n$$ is the number of data points. This penalty term is derived from an approximation of the logarithm of the marginal likelihood that is best when $$n$$ is large. A better approximation would yield a penalty term of $$\rho \log (n) - \rho \log (2 \pi)$$ but the $$\rho \log(2 \pi)$$ is omitted from the BIC because the assumption is that $$\log (n)$$ is substantially bigger than $$\log (2 \pi)$$ (but see Draper 1995). For defining the BIC$$^{(p)}$$ information criterion, we choose to avoid the assumption about the relative values of $$\log (n)$$ and $$\log(2 \pi)$$. Using the relationship between maximum log-likelihoods of concatenated and partitioned data (Equation (15)) when partitions are homogeneous, we can approximate the logarithm of the marginal likelihood of concatenated and partitioned data. We again consider the case where each of the $$K$$ partitions in the partition scheme has the same parameterization as the concatenation model (i.e. $$m_i = m_c \;\; \forall i$$). As shown in Part 4 of the Appendix and using $$p_i$$ to represent the proportion of all data that is in partition $$i$$, this enables us to define BIC$$^{(p)}$$ scores of concatenation and partition schemes: \begin{eqnarray} &&\text{BIC}^{(p)}\,[m_c, \Sigma i] := -2 l^{(\Sigma i)}(\widehat{\theta}_c^{\,(\Sigma i)}) + \rho_c (\log n - \log(2 \pi) )\nonumber\\ \label{BICP_sum} \\ \end{eqnarray} (19) \begin{eqnarray} &&\text{BIC}^{(p)}[\left\{(m_c, i) \right\}]\nonumber\\ &&\quad := \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) +(1-p_i) \rho_{\delta} ( \log n_i - \log(2 \pi) ) \right\} \nonumber \\ &&\qquad + \rho_c (\log n - \log(2 \pi)), \label{BICP_part} \end{eqnarray} (20) where $$\rho_{\delta}$$ represents the number of free parameters that are added per partition for each additional partition after the first. For the UBL model, $$\rho_{\delta}=\rho_c$$. For the LBL model, $$\rho_{\delta}=1+\rho_{\mu}$$. When $$\text{BIC}^{(p)}\,[m_c, \Sigma i] < \text{BIC}^{(p)}\,[\left\{(m_c, i) \right\}]$$, we choose the concatenation scheme. When the inequality is reversed, we choose the $$K$$-partition scheme. We note that the penalty difference between Equations (19) and (20) is smaller than the difference between conventional BIC’s for partitioned and concatenated data. This implies that the BIC$$^{(p)}$$ can detect partition heterogeneity better than the BIC. Based on our experience, the BIC$$^{(p)}$$ terms involving $$\log (2 \pi)$$ reasonably often affect partitioning decisions. If the contributions involving $$\log (2 \pi)$$ are ignored, the penalty differences between Equations (19) and (20) would become greater and thus would make it more difficult to detect heterogeneity among partitions. Results and discussion Simulation Studies We performed simulations to compare information criteria. All simulations employed the topology and branch lengths illustrated in Figure 1 to randomly generate data sets consisting of four partitions. To simulate partition homogeneity, all branches in all partitions had length $$0.1$$. To simulate heterogeneous partitions, all branches in partition $$i$$ ($$i \in \left\{ 1,2,3,4 \right \}$$) had branch length $$0.1 + \Delta_i := 0.1 + 0.02 \times (i-1)$$. The number of simulated sites in partition $$i$$ was set to $$i \times n_1$$. This means that the concatenated (1-partition) scheme had $$n= 10 n_1 = n_1 + 2n_1+ 3n_1 +4n_1$$ sites. The values of $$n_1$$ that were explored are $$100$$, $$200$$, $$500$$, $$1000$$, $$1500$$, and $$2000$$. The Seq-Gen software (Rambaut and Grassly 1997) was employed to randomly generate $$1000$$ data sets via the HKY substitution model (Hasegawa et al. 1985) and a five-category discrete-gamma treatment of rate heterogeneity among sites (Yang 1994b). The frequencies of A, C, G, and T were respectively 0.3, 0.2, 0.3, and 0.2 while the values of parameters for rate heterogeneity among sites ($$\alpha$$) and transition/transversion ratio ($$\kappa$$) were both set to 5.0. Figure 1. View largeDownload slide Tree topology of eight taxa for simulation studies. For the simulation of homogeneous and heterogeneous partition schemes, $$(b, \Delta_i)$$ was set to $$(0.1, 0.0)$$, and $$(0.1, 0.02\times(i-1))$$, respectively. Figure 1. View largeDownload slide Tree topology of eight taxa for simulation studies. For the simulation of homogeneous and heterogeneous partition schemes, $$(b, \Delta_i)$$ was set to $$(0.1, 0.0)$$, and $$(0.1, 0.02\times(i-1))$$, respectively. All simulated data sets were analyzed both with a 4-partition scheme and a concatenated (1– partition) scheme. For each of the two schemes, each simulated data set was analyzed with each of eight nucleotide substitution models and each of these cases was explored both with rate homogeneity among sites and with discrete-gamma rate heterogeneity (five rate categories). The eight substitution models that were used for inference are denoted: JC (Jukes and Cantor 1969), K2 (Kimura 1980), F81 (Felsenstein 1981), F84 (Felsenstein 1989), HKY (Hasegawa et al. 1985), T92 (Tamura 1992), TN93 (Tamura and Nei 1993), and GTR (Yang 1994a). For the analyses of the simulated data, we did not try to find the best substitution model and tree topology. Instead, we compared the 1-partition scheme and 4-partition schemes when both assumed the true tree topology and when both assumed the same substitution model. Also, the UBL parameterization was used for investigating the 4-partition schemes so that branch lengths of each partition were independently estimated. Maximum likelihood inference was conducted via Version 4.8 of the baseml program of the PAML software (Yang 2007). Table 1 shows how often the 4-partition scheme was selected with the AIC and the AIC$$^{(p)}$$ when the truth was that partitions are homogeneous (i.e. $$\Delta_i =0$$). Most of the simulation results for the AIC show a low proportion (<0.1%) of incorrectly selecting the 4-partition scheme. In contrast, the $$\text{AIC}^{(p)}$$ incorrectly selects the 4-partition scheme about 50% of the time. This higher proportion is expected because the $$\text{AIC}^{(p)}$$ criterion is designed with the idea that “splitting” errors (incorrectly separating partitions) are less serious than “lumping” errors (incorrectly concatenating partitions). The observation of an approximately 50% error rate is reasonable in light of the fact that the $$\text{AIC}^{(p)}$$ criterion is designed so that its expected value is the same for a 4-partition and 1-partition scheme when the truth is partition homogeneity. Table 1. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually homogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 Table 1. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually homogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.002 0.000 0.001 0.001 0.000 0.000 JC+G 0.001 0.000 0.000 0.000 0.000 0.000 K2 0.001 0.000 0.000 0.000 0.000 0.000 K2+G 0.001 0.000 0.000 0.001 0.000 0.000 F81 0.001 0.000 0.001 0.000 0.001 0.001 F81+G 0.000 0.000 0.001 0.000 0.001 0.001 F84 0.000 0.000 0.000 0.000 0.000 0.000 F84+G 0.000 0.000 0.000 0.000 0.000 0.000 HKY 0.000 0.000 0.000 0.000 0.000 0.000 HKY+G 0.000 0.000 0.000 0.000 0.000 0.000 T92 0.000 0.000 0.000 0.000 0.000 0.000 T92+G 0.000 0.000 0.000 0.000 0.000 0.000 TN93 0.000 0.000 0.000 0.000 0.000 0.000 TN93+G 0.000 0.000 0.000 0.000 0.000 0.000 GTR 0.000 0.000 0.000 0.000 0.000 0.000 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 AIC$$^{(p)}$$ JC 0.511 0.499 0.491 0.475 0.494 0.459 JC+G 0.464 0.446 0.444 0.443 0.459 0.432 K2 0.494 0.484 0.471 0.450 0.492 0.445 K2+G 0.505 0.488 0.477 0.447 0.490 0.446 F81 0.562 0.538 0.528 0.531 0.552 0.514 F81+G 0.551 0.532 0.530 0.526 0.561 0.518 F84 0.533 0.485 0.483 0.463 0.515 0.481 F84+G 0.533 0.487 0.491 0.453 0.500 0.478 HKY 0.530 0.488 0.474 0.455 0.522 0.484 HKY+G 0.518 0.502 0.477 0.450 0.512 0.481 T92 0.521 0.490 0.482 0.449 0.497 0.466 T92+G 0.514 0.484 0.490 0.447 0.491 0.458 TN93 0.541 0.497 0.481 0.462 0.522 0.489 TN93+G 0.531 0.490 0.468 0.449 0.504 0.488 GTR 0.560 0.499 0.489 0.488 0.533 0.480 GTR+G 0.540 0.464 0.489 0.470 0.505 0.467 Table 2 shows simulation results for the AIC and the AIC$$^{(p)}$$ when partitions are heterogeneous. In this situation, the 4-partition scheme should be selected over the 1-partition scheme and failure to do so is a “lumping error.” Both the AIC and AIC$$^{(p)}$$ criteria perform well when partitions consist of relatively large numbers of sites. But, Table 2 reveals a marked contrast when partitions have fewer sites. In the situation of $$n_1 =100$$ sites, the AIC makes a lumping error for nearly all simulated data sets whereas the AIC$$^{(p)}$$ is quite unlikely to make these errors. The results show that the $$\text{AIC}^{(p)}$$ has higher “sensitivity” than the AIC. That is, the $$\text{AIC}^{(p)}$$ detects heterogeneity better than the AIC when partitions are heterogeneous. With regard to making lumping errors, we also note that the $$\text{AIC}^{(p)}$$ appears to be relatively robust to model misspecification in comparison with the AIC. Table 2. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 Table 2. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by AIC and by AIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G AIC JC 0.091 0.508 0.999 1.000 1.000 1.000 JC+G 0.049 0.374 0.998 1.000 1.000 1.000 K2 0.089 0.544 0.999 1.000 1.000 1.000 K2+G 0.040 0.332 0.996 1.000 1.000 1.000 F81 0.049 0.327 0.995 1.000 1.000 1.000 F81+G 0.029 0.230 0.990 1.000 1.000 1.000 F84 0.034 0.344 0.996 1.000 1.000 1.000 F84+G 0.016 0.168 0.981 1.000 1.000 1.000 HKY 0.034 0.353 0.996 1.000 1.000 1.000 HKY+G 0.013 0.170 0.986 1.000 1.000 1.000 T92 0.068 0.466 0.999 1.000 1.000 1.000 T92+G 0.031 0.277 0.994 1.000 1.000 1.000 TN93 0.028 0.291 0.995 1.000 1.000 1.000 TN93+G 0.009 0.135 0.977 1.000 1.000 1.000 GTR 0.012 0.163 0.984 1.000 1.000 1.000 GTR+G 0.004 0.074 0.927 1.000 1.000 1.000 AIC$$^{(p)}$$ JC 0.969 1.000 1.000 1.000 1.000 1.000 JC+G 0.951 0.999 1.000 1.000 1.000 1.000 K2 0.970 1.000 1.000 1.000 1.000 1.000 K2+G 0.951 1.000 1.000 1.000 1.000 1.000 F81 0.961 0.998 1.000 1.000 1.000 1.000 F81+G 0.945 0.998 1.000 1.000 1.000 1.000 F84 0.966 1.000 1.000 1.000 1.000 1.000 F84+G 0.944 0.999 1.000 1.000 1.000 1.000 HKY 0.961 1.000 1.000 1.000 1.000 1.000 HKY+G 0.943 0.999 1.000 1.000 1.000 1.000 T92 0.968 1.000 1.000 1.000 1.000 1.000 T92+G 0.945 1.000 1.000 1.000 1.000 1.000 TN93 0.964 1.000 1.000 1.000 1.000 1.000 TN93+G 0.943 1.000 1.000 1.000 1.000 1.000 GTR 0.952 1.000 1.000 1.000 1.000 1.000 GTR+G 0.934 0.999 1.000 1.000 1.000 1.000 When the number of sites per partition is in a biologically plausible range, both the BIC and BIC$$^{(p)}$$ are more prone than the AIC to favor models that are less parameterized. The practical effect of the increased penalties on more parameterized models means that both the BIC and the BIC$$^{(p)}$$ are less likely than the AIC to make splitting errors. The results of Table 1 show that the AIC is very unlikely in our simulation settings to make a splitting error by choosing the 4-partition scheme over the 1-partition scheme when the truth is partition homogeneity. Therefore, it is not surprising that we do not observe splitting errors with the heavier BIC and BIC$$^{(p)}$$ penalties for our simulation conditions. Because splitting errors are not observed with the BIC and the BIC$$^{(p)}$$, we do not include tables of them for these criteria. Table 3 shows the performance of the BIC and $$\text{BIC}^{(p)}$$ when the truth is that the heterogeneous 4-partition scheme should be selected. As expected, the $$\text{BIC}^{(p)}$$ detects partition heterogeneity better than BIC. That is, the $$\text{BIC}^{(p)}$$ makes fewer lumping errors than the BIC. However, from the comparison of Tables 2 and 3, we observe that the $$\text{BIC}^{(p)}$$ is less sensitive than the $$\text{AIC}^{(p)}$$ and even less than the $$\text{AIC}$$. This is because the $$\text{BIC}^{(p)}$$ has heavier penalties than the $$\text{AIC}^{(p)}$$ and the $$\text{AIC}$$. Also, the $$\text{BIC}^{(p)}$$ displays a marked sensitivity to the assumed substitution model. For example, the observed proportion of selecting the 4-partition scheme with the JC model is 0.998 when $$n_1 = 1500$$ and it is only 0.093 with the GTR+G model. In contrast, the $$\text{AIC}^{(p)}$$ appears to be more robust than the $$\text{BIC}^{(p)}$$ to model misspecification. Table 3. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by BIC and by BIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 Table 3. Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by BIC and by BIC$$^{(p)}$$ when partitions were actually heterogeneous $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 $$n_1$$ True model IC Applied Model 100 200 500 1000 1500 2000 HKY+G BIC JC 0.000 0.000 0.000 0.000 0.138 0.898 JC+G 0.000 0.000 0.000 0.000 0.029 0.645 K2 0.000 0.000 0.000 0.000 0.218 0.956 K2+G 0.000 0.000 0.000 0.000 0.005 0.388 F81 0.000 0.000 0.000 0.000 0.001 0.245 F81+G 0.000 0.000 0.000 0.000 0.000 0.036 F84 0.000 0.000 0.000 0.000 0.004 0.422 F84+G 0.000 0.000 0.000 0.000 0.000 0.012 HKY 0.000 0.000 0.000 0.000 0.006 0.451 HKY+G 0.000 0.000 0.000 0.000 0.000 0.017 T92 0.000 0.000 0.000 0.000 0.095 0.830 T92+G 0.000 0.000 0.000 0.000 0.000 0.161 TN93 0.000 0.000 0.000 0.000 0.000 0.225 TN93+G 0.000 0.000 0.000 0.000 0.000 0.006 GTR 0.000 0.000 0.000 0.000 0.000 0.009 GTR+G 0.000 0.000 0.000 0.000 0.000 0.000 BIC$$^{(p)}$$ JC 0.000 0.000 0.003 0.641 0.998 1.000 JC+G 0.000 0.000 0.000 0.402 0.995 1.000 K2 0.000 0.000 0.005 0.740 0.999 1.000 K2+G 0.000 0.000 0.000 0.226 0.971 1.000 F81 0.000 0.000 0.000 0.169 0.940 1.000 F81+G 0.000 0.000 0.000 0.034 0.704 0.992 F84 0.000 0.000 0.000 0.257 0.978 1.000 F84+G 0.000 0.000 0.000 0.019 0.611 0.985 HKY 0.000 0.000 0.000 0.266 0.975 1.000 HKY+G 0.000 0.000 0.000 0.026 0.665 0.988 T92 0.000 0.000 0.000 0.584 0.998 1.000 T92+G 0.000 0.000 0.000 0.132 0.903 0.999 TN93 0.000 0.000 0.000 0.162 0.932 0.999 TN93+G 0.000 0.000 0.000 0.009 0.470 0.979 GTR 0.000 0.000 0.000 0.016 0.590 0.985 GTR+G 0.000 0.000 0.000 0.000 0.093 0.769 The choice of model selection criterion can also affect branch length inferences. Table 4 considers the sums of branch lengths (i.e. the tree lengths) that were inferred from the simulated data with the different information criteria. It shows the specific case where the number of sites $$n_1$$ was 100 and where both the true and assumed substitution models were HKY + G. Table 4 shows that lumping errors can have more serious impacts than splitting errors on branch length estimates. Also, it indicates the potential value of the AIC$$^{(p)}$$ when downstream inferences that rely upon branch length estimates (e.g. divergence time and evolutionary rate estimates) are important (see Case 2 of the following section). Table 4. Tree length inferences from different information criteria when Partition 1 ($$n_1$$) had 100 sites and when the true and assumed substitution model were both HKY+G. If the information criterion selected the 1– partition scheme for a simulated data set, then the inferred tree length for the concatenated model was recorded for all 4 partitions. If the 4– partition scheme was selected, then tree lengths separately inferred for each partition was recorded. Standard deviations of tree length inferences are shown in parentheses below the sample means that were inferred from the 1000 simulated data sets. Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Table 4. Tree length inferences from different information criteria when Partition 1 ($$n_1$$) had 100 sites and when the true and assumed substitution model were both HKY+G. If the information criterion selected the 1– partition scheme for a simulated data set, then the inferred tree length for the concatenated model was recorded for all 4 partitions. If the 4– partition scheme was selected, then tree lengths separately inferred for each partition was recorded. Standard deviations of tree length inferences are shown in parentheses below the sample means that were inferred from the 1000 simulated data sets. Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Partition 1 2 3 4 Partition homogeneity simulations Truth 1.3 1.3 1.3 1.3 AIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) AIC$$^{(p)}$$ 1.33 1.31 1.31 1.31 (0.144) (0.0914) (0.0799) (0.0697) BIC 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) BIC$$^{(p)}$$ 1.30 1.30 1.30 1.30 (0.0494) (0.0494) (0.0494) (0.0494) Partition heterogeneity simulations Truth 1.30 1.56 1.82 2.08 AIC 1.82 1.82 1.83 1.83 (0.101) (0.0834) (0.0737) (0.0888) AIC$$^{(p)}$$ 1.38 1.60 1.85 2.09 (0.208) (0.151) (0.134) (0.151) BIC 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) BIC$$^{(p)}$$ 1.83 1.83 1.83 1.83 (0.0730) (0.0730) (0.0730) (0.0730) Empirical Data Analysis We analyzed two data sets to examine how partition scheme selection affects evolutionary inferences. Because some details of our ML estimation and partition selection procedures differ from those of the previous studies, our results have minor differences from the previous ones. However, our emphasis here is not on these differences but is instead on how partition scheme selection is affected by the choice of information criterion and how downstream evolutionary analyses are affected by partition selection. Case 1: Li et al. (2008) analyzed 10 protein-coding genes of 56 ray-finned fish taxa. By separating each protein-coding gene into three codon positions, they started with a possible 30-partition scheme. They then performed hierarchical clustering to generate candidate schemes with 29, 28, $$\ldots$$, 2, 1 partitions. Although the number of possible ways to partition 30 items is the Bell Number (Bell 1934) $$B_{30}$$ and this is far more than 30, we considered only the 30 schemes from hierarchical clustering that Li et al. (2008) reported. For each of these 30 schemes with the AIC and with the BIC, our analyses allowed different substitution models for different partitions by invoking the “models = all” options of PartitionFinder (Lanfear et al. 2012). This differs from the Li et al. (2008) analysis that assumed the GTR+G model for all partitions in all schemes. We also considered the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ for the 30 candidate partitioning schemes. All analyses with these two information criteria assumed the Neighbor-joining tree topology reported by the MEGA software (Tamura et al. 2013) when the 30 possible partitions of the Li et al. (2008) data were concatenated and then analyzed with the TN93 substitution model. Our AIC$$^{(p)}$$ and BIC$$^{(p)}$$ analyses assumed that all partitions in each candidate scheme evolved according to a GTR substitution model with four discrete-gamma categories of rate heterogeneity among sites. Whereas Li et al. (2008) considered only the LBL parameterization, we considered both LBL and UBL parameterizations for the four information criteria with all candidate partitioning schemes. Detailed results from applying the four information criteria to the Li et al. (2008) candidate partitions are given in Tables 1 through 4 of Supplementary Material, Online Appendix available on Dryad at http://dx.doi.org/10.5061/dryad.qq586. Most of the results are not surprising. The AIC$$^{(p)}$$ is more prone to splitting than the AIC. Considering only the UBL parameterization, the AIC$$^{(p)}$$ selects the 30-partition scheme while the AIC chooses the 21-partition scheme. Likewise, the BIC$$^{(p)}$$ is more prone to splitting than the BIC. When considering only the UBL parameterization, the BIC$$^{(p)}$$ selects the 3-partition scheme while the BIC prefers the 2-partition scheme. These UBL results also confirm the expectation that the BIC and the BIC$$^{(p)}$$ prefer fewer partitions than the AIC and the AIC$$^{(p)}$$. This is because both the BIC and the BIC$$^{(p)}$$ penalties involve the amount of data as well as the number of estimated parameters. The number of estimated parameters for the LBL parameterization is substantially less than for the UBL parameterization. This causes the LBL parameterization to tend to favor splitting more than the UBL parameterization. Among the LBL results, the BIC selects the 21-partition scheme whereas the other three information criteria all choose the 30-partition scheme. When both UBL and LBL parameterizations are considered for each of the 30 candidate schemes, the BIC selects the LBL with 21 partitions and 288 free parameters while the BIC$$^{(p)}$$ selects the LBL with 30 partitions and 408 free parameters. For the same set of possible models, the AIC chooses the UBL with 21 partitions and 2486 free parameters while the AIC$$^{(p)}$$ favors the UBL with 30 partitions and 3540 free parameters. All of these model choices are consistent with the AIC$$^{(p)}$$ favoring splitting more than the AIC, the BIC$$^{(p)}$$ favoring splitting more than the BIC, and the BIC and the BIC$$^{(p)}$$ both preferring fewer parameters than the AIC and the AIC$$^{(p)}$$. Because the 30 candidate schemes were generated by hierarchical clustering, all 30 have a nesting/nested relationship for the UBL parameterization. That is, the $$k$$– partition scheme is nested within the $$(k+1)$$-partition scheme ($$1\leq k \leq 29$$). The same applies for the LBL parameterization. These nesting relationships mean that the traditional (asymptotic) LRT can be applied. For both LBL and UBL parameterizations, the 30-partition scheme is significantly better than all others according to the likelihood ratio test. In contrast, the 21-partition UBL scheme is selected by the AIC over the 30-partition UBL scheme and over all other candidate schemes. This emphasizes that the AIC score can produce results that are contradictory to the LRT. The AIC$$^{(p)}$$ selects the 30-partition UBL scheme as the best among all candidates and it selects the 30-partition LBL as being the best LBL candidate. The consistency between the AIC$$^{(p)}$$ and the LRT is not surprising given their close relationship. Ripplinger and Sullivan (2008) noted that model selection may affect phylogeny estimation, especially for regions of an evolutionary tree that have low bootstrap support. Partitioning decisions are a type of model selection and our experience with partition selection coincides with Ripplinger and Sullivan’s observation. For example, we considered the 30-partition UBL scheme that was selected as optimal according to the AIC$$^{(p)}$$ and the 21-partition UBL scheme that was preferred by the AIC. While computational considerations motivated our decision to avoid intensively searching topology space when calculating information criteria scores for the 30 candidate partitioning schemes, we used the RAXML software (Stamatakis 2014) to more carefully search among topologies for both the 21-partition UBL scheme and the 30-partition UBL scheme. While the RAXML trees derived from these two schemes are topologically very similar, Figure 2 shows that the differences tend to occur in regions of the topologies with low bootstrap support. Fig. 2. View largeDownload slide Different tree topologies for different partition schemes. ML tree topologies of 56 ray-finned fish group were reconstructed with the 30-partition (left) and 21-partition (right) schemes. These two partition schemes are the best with the AIC$$^{(p)}$$ and the AIC, respectively. The bootstrap support levels shown near each node are based upon 500 bootstrap replicates. The trees were inferred with the GTR substitution model and four categories of discrete-gamma rate heterogeneity among sites. Fig. 2. View largeDownload slide Different tree topologies for different partition schemes. ML tree topologies of 56 ray-finned fish group were reconstructed with the 30-partition (left) and 21-partition (right) schemes. These two partition schemes are the best with the AIC$$^{(p)}$$ and the AIC, respectively. The bootstrap support levels shown near each node are based upon 500 bootstrap replicates. The trees were inferred with the GTR substitution model and four categories of discrete-gamma rate heterogeneity among sites. Case 2: We analyzed 4 protein-coding mitochondrial genes and seven (six protein-coding and one noncoding) nuclear genes of 49 notothenioid fish group taxa. Colombo et al. (2015) focussed on the (possibly adaptive) radiation of the Antarctic clade, but our focus here is on partition selection. As with Case 1, we used the “models = all” option and the tree topology provided by PartitionFinder for the AIC and the BIC calculations. For the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ calculations, we used the Neighbor-joining tree topology estimated with the TN93 substitution model (Tamura and Nei 1993) and the MEGA software (Tamura et al. 2013). Fixing this topology, we obtained the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$ scores for a GTR model with 4 discrete-gamma categories of rate heterogeneity. We first ran PartitionFinder (Lanfear et al. 2012) with the BIC and the UBL parameterization. The partition space was searched with the greedy option. Starting with 31 possible partitions (one for the noncoding gene and 1 per codon position for each of the 10 protein-coding genes), partitions were hierarchically merged until the best partition scheme (a 4-partition scheme) was found according to the BIC. Because the BIC tends to favor concatenation relative to the other information criteria, we decided to focus on the 28 candidate partition schemes that led PartitionFinder from the starting 31-partition scheme to the 4-partition scheme. The number of partitions in these 28 candidate schemes therefore range from 4 to 31. Assuming the UBL parameterization and evaluating these 28 candidate schemes, the 9-, 4-, 20-, and 6-partition schemes are selected as the best according to the AIC, the BIC, the AIC$$^{(p)}$$, and the BIC$$^{(p)}$$ respectively. Assuming the LBL parameterization, the 31-, 8-, 31-, and 18-partition schemes are selected as the best for AIC, BIC, AIC$$^{(p)}$$, and BIC$$^{(p)}$$, respectively. When considering either the UBL or LBL parameterizations, the AIC selects the 933-parameter 9-partition UBL scheme and the AIC$$^{(p)}$$ prefers the 2080-parameter 20-partition UBL scheme while the BIC prefers the 146-parameter 8-partition LBL scheme and the BIC$$^{(p)}$$ prefers the 274-parameter 18-partition LBL scheme. In summary, the analyses again show that BIC$$^{(p)}$$ is more apt to split partitions than the BIC and the AIC$$^{(p)}$$ is more apt to split than the AIC. The results also again show that the BIC and BIC$$^{(p)}$$ prefer models with fewer free parameters than the AIC and AIC$$^{(p)}$$. With this data set, we again performed maximum likelihood topology estimation according to a variety of partition schemes and model parameterizations that were favored according to one of the four information criteria. The results were again consistent with the observation of Ripplinger and Sullivan (2008). While we found minor variations in inferred maximum likelihood topology, the variations were associated with regions of the tree that have low or moderate bootstrap support (data not shown). Using the 20-partition UBL scheme preferred by the AIC$$^{(p)}$$ and also the 9-partition UBL scheme preferred by the AIC, we estimated evolutionary rates and divergence times with the MCMCtree software (Yang, 2007) by using both the sequence data and calibration points of Colombo et al. (2015). We concentrate on the third codon positions of the CO1 and the ND4 genes that are heterogeneous in the 20-partition UBL scheme but that are homogeneous (i.e. in the same partition) in the 9-partition UBL scheme. Figure 3 shows divergence time estimates and the inferred trajectory of evolutionary rates of CO1 and ND4 third positions at nodes along the path connecting the most recent common ancestor of the Antarctic clade to the tip in this clade that represents Chionodraco rastrospinosus. While the divergence time estimates are very similar for the 20-partition and 9-partition schemes, the rate trajectories for the third positions of CO1 and ND4 are quite different. The rate trajectory of the merged data shows a kind of average of CO1 and ND4 and is located between the two trajectories of CO1 and ND4, but it loses information for the evolutionary properties of the individual genes. While this is only one example rather than evidence for a general trend, we expect that the choice of partition scheme is less likely to affect the estimation of tree topology than divergence times and we expect that partition scheme selection is more likely to affect the estimation of evolutionary rates of individual genes than it is to affect divergence times that are assumed to be shared among the genes. Fig. 3. View largeDownload slide Different evolutionary rates for different partition schemes. Evolutionary rate trajectories from the most recent common ancestor of the Antarctic clade to Chionodraco rastrospinosus. Evolutionary rates of 3rd sites of CO1 (circle), ND4 (rectangle) and merged CO1 and ND4 (cross) are plotted on the y-axis with estimated divergence times (millions of years ago) on the x-axis. The third sites of CO1 and ND4 are heterogeneous in the 20-partition scheme that is the best according to the AIC$$^{(p)}$$. However, they are homogeneous in the 9-partition scheme that is the best according to the AIC. Fig. 3. View largeDownload slide Different evolutionary rates for different partition schemes. Evolutionary rate trajectories from the most recent common ancestor of the Antarctic clade to Chionodraco rastrospinosus. Evolutionary rates of 3rd sites of CO1 (circle), ND4 (rectangle) and merged CO1 and ND4 (cross) are plotted on the y-axis with estimated divergence times (millions of years ago) on the x-axis. The third sites of CO1 and ND4 are heterogeneous in the 20-partition scheme that is the best according to the AIC$$^{(p)}$$. However, they are homogeneous in the 9-partition scheme that is the best according to the AIC. Concluding Remarks The asymmetric consequences of splitting and lumping errors are our motivation for suggesting the $$\text{AIC}^{(p)}$$ and the $$\text{BIC}^{(p)}$$ to compare partition schemes. We view the possible bias resulting from lumping errors as more serious than the increased variance generated by splitting errors. Other partitioning techniques to account for the asymmetric consequences are also possible. For example, partitioning guided by Bayesian decision theory (e.g. see Berger 1985) could be attractive, but it would be difficult to convert the qualitative asymmetry of consequences from lumping and splitting errors into a quantitative loss function that would adequately summarize the relative severities of these two types of errors. Of the four information criteria that we consider, the $$\text{AIC}^{(p)}$$ is the least likely to make lumping errors and we conclude that the $$\text{AIC}^{(p)}$$ is usually a better choice than the other information criteria. However, one of the notable features of the $$\text{BIC}^{(p)}$$ is that it is consistent. Consistency in model selection implies that the probability to select the true model approaches 1 as sample size increases (Dziak et al. 2012). Likewise, consistency in partition selection guarantees selection of the true partition scheme when sample size (i.e. the number of sequence sites) is large. When $$n$$ goes to infinity, the $$\text{BIC}^{(p)}$$ penalty divided by $$n$$ goes to zero whereas the penalty itself goes to infinity. This is often the situation when information criteria are consistent with respect to model selection (Bozdogan 1987; Dziak et al. 2012). The $$\text{BIC}$$ is also consistent but, because the $$\text{BIC}^{(p)}$$ is less prone to lumping errors than the BIC, the $$\text{BIC}^{(p)}$$ might be the best alternative among the four information criteria when statistical consistency is especially valued. A conventional approach is to first make a quick and crude approximation of the phylogenetic tree topology and to then search for the optimal partition scheme by fixing the topology at this approximation (Lanfear et al. 2012). After settling upon and fixing the partition scheme, a thorough search of topologies can be carried out. This conventional approach is attractive because a joint search of all combinations of partition scheme and topology can be computationally prohibitive. This conventional approach simplifies computation and seems to us to also be sensible when employing the AIC$$^{(p)}$$ and the BIC$$^{(p)}$$, especially for doing phylogenetics with genome-scale data. If two partitions are heterogeneous according to one combination of tree topology and nucleotide substitution model, they are likely to be heterogeneous according to another combination. To be sure, the choice of a combination could affect the power to detect heterogeneity but this effect is often small. This is because even if the assumed tree topology and substitution model are incorrect for some or all partitions, the resulting bias would also be homogeneous (or heterogeneous) if evolutionary properties are really homogeneous (or heterogeneous) among partitions. However, we have not fully characterized this conventional approach here and doing so might be a good direction for future research. The application of information criteria to partitioning sequence data has mainly received attention with regard to impact on phylogeny inference, but diverse other kinds of evolutionary inferences (e.g. divergence time estimation, examination of how rates of molecular evolution have changed over time, and detection of diversifying positive selection) are also potentially influenced by partitioning. The potentially large effects of partitioning on inferred trajectories of evolutionary rates are illustrated by our findings with CO1 and ND4 third positions from the Colombo et al. (2015) notothenioid data. The ability to detect and quantify shifts of evolutionary rates over time is especially pertinent to the study of adaptation. By studying a phylogeny of diverse terrestrial and marine mammals and then identifying genes having evolutionary rates on the tree that correlate with marine/terrestrial status, Chikina et al. (2016) identified a biologically plausible group of candidate genes that might be associated with adaptation to marine environments. This promising strategy for illuminating the genetic underpinnings of evolutionary adaptation can potentially be applied to diverse other sorts of adaptation, including adaptation to the extreme Antarctic environment that may have been associated with the notothenioid fish radiation studied by Colombo et al. (2015). However, ability to use rate change to identify genes associated with adaptation to extreme environments or other sorts of adaptation will be influenced by partitioning decisions. As genomic data of non-model organisms becomes increasingly available, the ability to characterize trajectories of evolutionary rates should improve and so should the ability to identify interesting changes in evolutionary rates. Success of these studies is likely to be influenced by the availability of sound methods for partitioning. Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.qq586. Funding This work was supported by the Korea Polar Research Institute (PE17090) and by NIH grant GM118508. Acknowledgements We thank Mark Holder, Stephen Smith, Xiang Ji, and two anonymous reviewers. Appendix 1. Basic assumptions Let us define operators $$\nabla$$, $$\nabla \nabla$$, and $$\nabla \nabla^T$$ as follows \begin{eqnarray*} \nabla l^{(i)} (\theta) &:=& \sum_{j=1}^{n_i} \frac{\partial}{\partial \theta} \log f(x_{ij}|\theta) \\ \nabla \nabla l^{(i)} (\theta) &:=& \sum_{j=1}^{n_i} \frac{\partial^2}{\partial \theta \partial \theta^T} \log f(x_{ij}|\theta) \\ \nabla \nabla^T l^{(i)} (\theta) &:=& \sum_{j=1}^{n_i} \frac{\partial}{\partial \theta} \log f(x_{ij}|\theta) \frac{\partial}{\partial \theta^T} \log f(x_{ij}|\theta). \end{eqnarray*} $$\nabla l^{(\Sigma i)} (\theta)$$, $$\nabla \nabla l^{(\Sigma i)} (\theta)$$ and $$\nabla \nabla^T l^{(\Sigma i)} (\theta)$$ are defined in a similar way. The adopted model may not be correct. When the MLE with an incorrect model converges to a certain value that we will denote $$\theta_*$$, its asymptotic distribution is normal under suitable regularity conditions (White 1982). That is, \begin{eqnarray} \left( \widehat{\theta}^{(i)} - \theta_*^{(i)} \right) &\underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\,& N\left( {\bf 0}, \left[\nabla \nabla l^{(i)} (\theta_*^{(i)}) \right]^{-1} \left[\nabla \nabla^T l^{(i)} (\theta_*^{(i)}) \right]\right.\nonumber\\ &&\left. \left[\nabla \nabla l^{(i)} (\theta_*^{(i)}) \right]^{-1} \right)\label{sandwitch1}\\ \end{eqnarray} (A.1) \begin{eqnarray} &\simeq& N\left( {\bf 0}, \left[\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \right]^{-1} \left[\nabla \nabla^T l^{(i)} (\widehat{\theta}^{(i)}) \right]\right.\nonumber\\ &&\left.\left[\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \right]^{-1} \right),\label{sandwitch2} \end{eqnarray} (A.2) and the asymptotic distribution of $$( \widehat{\theta}^{(\Sigma i)} - \theta_*^{(\Sigma i)} )$$ is expressed in a similar way. When the assumed model is correct, $$\nabla \nabla l = - \nabla \nabla^T l$$ in Equation (A.1) and the covariance matrix of Equation (A.1) is reduced to the inverse of $$-\nabla \nabla l$$. In our derivations, we assume that the adopted model may not be correct but it is close to the truth so that $$\nabla \nabla l \simeq - \nabla \nabla^T l$$ and the covariance matrix of Equation (A.1) is approximately the inverse of $$-\nabla \nabla l$$. Also, we assume partition homogeneity when deriving AIC$$^{(p)}$$’s penalty. 2. Partitioning homogeneous sequence data is not harmful when the number of sequence sites per partition is large We define \begin{eqnarray*} H^{(i)} &:=& n_i E_g \left[\nabla \nabla \log f(x_{ij}|\theta_*^{(i)}) \right] \nonumber \\ H^{(\Sigma i)} &=& \sum_{i=1}^K H^{(i)} \\ W_i &:=& \left[ H^{(\Sigma i)} \right]^{-1} H^{(i)}. \end{eqnarray*} If we assume partition homogeneity, then $$\theta_*^{(1)} = \cdots = \theta_*^{(K)} = \theta_*^{(\Sigma i)}$$ and \begin{eqnarray*} H^{(\Sigma i)} &=& n E_g \left[\nabla \nabla \log f(x_{ij}|\theta_*^{(\Sigma i)}) \right] \\ W_i &=& p_i I, \end{eqnarray*} where $$p_i = n_i/n$$ and $$I$$ is identity matrix. We consider the following natural estimators \begin{eqnarray*} \widehat{H}^{(i)} &=& \nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \\ \widehat{H}^{(\Sigma i)} &=& \sum_{i=1}^K \widehat{H}^{(i)} \\ \widehat{W}_i &=& \left[ \widehat{H}^{(\Sigma i)} \right]^{-1} \widehat{H}^{(i)}, \end{eqnarray*} and note that $$\widehat{W}_i$$ is a consistent estimator of $$W_i$$. Now, consider the following first derivative and its asymptotic expansion. where $$h(x) = O_p \left( g(n) \right)$$ implies that $$h(x)/g(n)$$ is bounded in probability (Bishop et al. 2007). This leads to \begin{eqnarray} \widehat{\theta}^{(\Sigma i)} &=& \left[ \sum_{i=1}^{K} \nabla \nabla l^{(i)}(\widehat{\theta}^{(i)}) \right]^{-1} \sum_{i=1}^{K} \left\{ \nabla \nabla l^{(i)}(\widehat{\theta}^{(i)}) \widehat{\theta}^{(i)} + O_p \left(1 \right) \right\} \nonumber \\ &=& \left[ \widehat{H}^{(\Sigma i)} \right]^{-1} \sum_{i=1}^{K} \left\{ \widehat{H}^{(i)} \widehat{\theta}^{(i)} + O_p \left(1 \right) \right\} \nonumber \\ &=& \sum_{i=1}^{K} \left\{ \left[ \widehat{H}^{(\Sigma i)} \right]^{-1} \widehat{H}^{(i)} \widehat{\theta}^{(i)} + O_p \left( \frac{1}{n } \right) \right\} \nonumber \\ &=& \sum_{i=1}^{K} \widehat{W}_i \widehat{\theta}^{(i)} + O_p \left( \frac{1}{n } \right) \nonumber \\ &\simeq& \sum_{i=1}^{K} p_i \widehat{\theta}^{(i)} + O_p \left( \frac{1}{n } \right), \label{two_mle_similar} \end{eqnarray} (A.3) where we use the fact $$\widehat{W}_i \simeq p_i I$$ for large $$n_i$$’s in the approximation of the last line. We denote $$\sum_{i=1}^{K} p_i \widehat{\theta}^{(i)}$$ as $$\widehat{\theta}^{(\Sigma i|*)}$$. Then, Equation (A.3) implies $$\widehat{\theta}^{(\Sigma i)} \approx \widehat{\theta}^{(\Sigma i|*)}$$ for large $$n$$. On the other hand, \begin{eqnarray*} {\bf 0} &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) \\ &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + \nabla \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) (\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)}) + \cdots \\ &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + O_p(n)\cdot O_p\left(\frac{1}{n } \right) + \cdots \\ &=& \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + O_p \left(1 \right). \end{eqnarray*} Therefore, $$\nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) \not\approx \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)})$$ even for large $$n$$. However, \begin{eqnarray*} l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) &=& l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) +(\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)})^T \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) \\ && + \frac{1}{2} (\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)})^T \cdot \nabla \nabla l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)})\\ &&\cdot (\widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(\Sigma i|*)}) + \cdots \\ &=& l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)}) + O_p\left(\frac{1}{n} \right). \end{eqnarray*} Therefore, $$l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)}) \approx l^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i|*)})$$ for large $$n$$. Now consider the variances of $$\widehat{\theta}^{(\Sigma i)}$$ and $$\widehat{\theta}^{(\Sigma i|*)}$$. From Equation (A.3), \begin{eqnarray*} &&\left\{\widehat{\theta}^{(\Sigma i)}-\theta_*^{(\Sigma i)} \right\} \left\{\widehat{\theta}^{(\Sigma i)}-\theta_*^{(\Sigma i)} \right\}^T\\ &&\quad\!\!\!= \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} + O_p\left(\frac{1}{n } \right)\right\} \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} + O_p\left(\frac{1}{n} \right)\right\}^T \\ &&\quad\!\!\!= \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} \right\} \left\{\widehat{\theta}^{(\Sigma i|*)} - \theta_*^{(\Sigma i)} \right\}^T + O_p\left(\frac{1}{n \sqrt{n} }\right) \end{eqnarray*} Taking the expectation of both sides, we get \begin{eqnarray*} Var[\widehat{\theta}^{(\Sigma i)}] = Var [\widehat{\theta}^{(\Sigma i|*)}] + O\left(\frac{1}{n \sqrt{n} }\right), \end{eqnarray*} which implies that the variances of $$\widehat{\theta}^{(\Sigma i)}$$ and $$\widehat{\theta}^{(\Sigma i|*)}$$ are similar to each other when there are large amounts of data. The similarity of these variances suggests that partitioning homogeneous data is not harmful when the number of sequence sites per partition is large. 3. Proof of Equation (15) For the simplicity and convenience of mathematical notation, we omit the ‘$$c$$’ subscript below so that $$\widehat{\theta}^{(\cdot)}$$ and $$\rho$$ respectively represent $$\widehat{\theta}_c^{\,(\cdot)}$$ and $$\rho_c$$. To further simplify Equation (A.2), we define \begin{eqnarray*} U_{(i)} &:=& \left[\nabla \nabla^T l^{(i)} (\widehat{\theta}^{(i)}) \right] \\ V_{(i)} &:=& - \left[\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) \right]^{-1} \end{eqnarray*} If we rewrite Equation (A.2), the MLE from the $$i$$th partition asymptotically follows a normal distribution, \begin{eqnarray*} \widehat{\theta}^{(i)} - \theta_*^{(i)} \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, N\left( {\bf 0}, V_{(i)} U_{(i)} V_{(i)} \right). \end{eqnarray*} Now, let us define the vector of all partition’s MLEs as follows, \begin{eqnarray*} \widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*:= ((\widehat{\theta}^{(1)} - \theta_*^{(1)})^T, \cdots, (\widehat{\theta}^{(k)} - \theta_*^{(K)})^T )^T, \end{eqnarray*} where $$\widehat{\boldsymbol{\theta}}$$ and $$\boldsymbol{\theta}_*$$ are $$K\rho \times 1$$ dimensional column vectors. The vector $$\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*$$ asymptotically follows a multivariate normal distribution, \begin{eqnarray} \widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_* \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, N( {\bf 0}, \widehat{{\bf \Sigma}} ), \label{tot-theta-normal-approx} \end{eqnarray} (A.4) where $$\widehat{{\bf \Sigma}}$$ is a diagonal block matrix due to the independence of partitions, \begin{eqnarray*} \widehat{{\bf \Sigma}} = \left( \begin{array}{ccc} V_{(i)} U_{(i)} V_{(i)} & \cdots & {\bf 0}\\ \vdots & \ddots & \vdots \\ {\bf 0}& \cdots &V_{(K)} U_{(K)} V_{(K)} \\ \end{array} \right). \end{eqnarray*} Now, we investigate the relationship between $$l^{(i)}(\widehat{\theta}^{(\Sigma i)})$$ and $$l^{(i)}(\widehat{\theta}^{(i)})$$ for the $$i$$th partition. We define the ($$K\rho\times \rho$$)-dimensional matrix $${\bf p}_i$$ as \begin{eqnarray*} {\bf p}_i := (p_1 I_\rho,\cdots, (p_i-1) I_\rho, \cdots, p_K I_\rho )^T, \end{eqnarray*} where $$I_\rho$$ is the $$\rho$$-dimensional identity matrix. Then, \begin{eqnarray*} {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) &=& \widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)} \\ &\approx& \widehat{\theta}^{(\Sigma i)} - \widehat{\theta}^{(i)}, \end{eqnarray*} where the approximation is from Equation (A.3). Applying Equation (A.4), we find \begin{eqnarray*} {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, N({\bf 0}, {\bf p}_i^T \widehat{ {\bf \Sigma} } {\bf p}_i ). \end{eqnarray*} The $${\bf p}_i^T \widehat{ {\bf \Sigma} } {\bf p}_i$$ can be obtained with covariance matrices of individual partitions, \begin{eqnarray*} {\bf p}_i^T \widehat{ {\bf \Sigma}} {\bf p}_i &=& p_1^2 V_{(1)} U_{(1)} V_{(1)} + \cdots + (p_i-1)^2 V_{(i)} U_{(i)} V_{(i)}\\ &&+ \cdots + p_K^2 V_{(K)} U_{(K)} V_{(K)} \\ &\approx& p_1^2 [ V_{(i)} U_{(i)} V_{(i)} p_i / p_1 ] + \cdots + (p_i-1)^2 [ V_{(i)} U_{(i)} V_{(i)} ]\\ && + \cdots + p_K^2 [ V_{(i)} U_{(i)} V_{(i)} p_i/p_K] \\ &=& (1-p_i) [ V_{(i)} U_{(i)} V_{(i)}] \\ &\approx& (1-p_i) V_{(i)} = -(1-p_i) [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ]^{-1}, \end{eqnarray*} where the first approximation results from partition homogeneity and the second approximation results from the assumption of $$\nabla \nabla l \simeq - \nabla \nabla^T l$$. The quadratic summation of the elements of $$(\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})$$ follows a $$\chi_\rho^2$$ distribution, \begin{eqnarray} && \left\{ {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) \right\}^T \left\{ {\bf p}_i^T \widehat{ {\bf \Sigma} } {\bf p}_i \right\}^{-1} \left\{ {\bf p}_i^T (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}_*) \right\} \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, \chi_{\rho}^2 \nonumber \\ &\Longleftrightarrow&(\widehat{\theta}^{(\Sigma i|*)} {-}\widehat{\theta}^{(i)})^T \cdot \frac{(-1)}{1{-}p_i} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)}) \underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, \chi_{\rho}^2.\nonumber\\ &\Longleftrightarrow & (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})^T \cdot [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})\nonumber\\ &&\underset{\cdot }{\overset{\cdot }{\mathop{\sim }}}\, - (1-p_i)\chi_{\rho}^2. \label{qaud_sum} \end{eqnarray} (A.5) Then, the log-likelihood function at the $$i$$th partition has the following relationship, (A.6) where the last approximation holds in the sense of expectation. Therefore, \begin{eqnarray*} \sum_{i=1}^K l^{(i)}(\widehat{\theta}^{(\Sigma i)}) &\approx& \sum_{i=1}^K \left\{ l^{(i)}(\widehat{\theta}^{(i)}) - \frac{1}{2} (1- p_i) \chi_{\rho}^2 \right\} \\ &\approx& \sum_{i=1}^K l^{(i)}(\widehat{\theta}^{(i)}) - \frac{1}{2} (K-1) \chi_{\rho}^2 \\ &\stackrel{E}{\approx} & \sum_{i=1}^K l^{(i)}(\widehat{\theta}^{(i)}) - \frac{1}{2} (K-1) \rho \end{eqnarray*} which proves Equation (15). While Equation (15) is a direct outcome of the asymptotic behavior of likelihood ratio tests when the null hypothesis is true, we note that Equations (A.5) and (A.6) are the critical steps in this proof and they are valid so long as the adopted model does not severely deviate from the truth. 4. Proof of Equation (20): derivation of BIC$$^{(p)}$$ To begin, we overview an approximation of the posterior probability density to show the origin of $$\lceil - \rho_c \log(2 \pi) \rfloor$$ in Equation (19) (e.g. see Robert 2007). As in Part 3 of the Appendix, we will omit the ‘$$c$$’ subscript below when doing so does not affect clarity. Define \begin{eqnarray*} J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) := - \frac{1}{n} \nabla \nabla l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}). \end{eqnarray*} Then, the probability of data $$\textbf{x}$$ for a given prior $$\pi(\theta)$$ is \begin{eqnarray} p(\textbf{x}) &=& \int \prod_{i=1}^n f(x_i | \theta) \pi(\theta) d\theta\nonumber \\ &=& \int \exp \left\{ l^{(\Sigma i)} (\theta) \right\} \pi(\theta) d\theta \nonumber\\ &=& \int \exp \left\{ l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) - \frac{n}{2} (\theta - \widehat{\theta}^{(\Sigma i)} )^T\right.\nonumber\\ &&\left.\cdot J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \cdot (\theta - \widehat{\theta}^{(\Sigma i)} ) + Re \right\} \nonumber \\ && \times \left\{ \pi(\widehat{\theta}^{(\Sigma i)}) + (\theta - \widehat{\theta}^{(\Sigma i)} ) \nabla \pi(\widehat{\theta}^{(\Sigma i)}) + Re \right\} d\theta \nonumber \\ &\approx& \exp \left\{ l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) \right\} \pi(\widehat{\theta}^{(\Sigma i)}) \times \nonumber \\ && \int \exp \left\{ - \frac{n}{2} (\theta - \widehat{\theta}^{(\Sigma i)} )^T \cdot J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \cdot (\theta - \widehat{\theta}^{(\Sigma i)} ) \right\} d\theta.\nonumber\\ \label{BIC-approx} \end{eqnarray} (A.7) We use the following results, \begin{eqnarray} &&\int \exp \left\{ - \frac{n}{2} (\theta - \widehat{\theta}^{(\Sigma i)} )^T \cdot J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \cdot (\theta - \widehat{\theta}^{(\Sigma i)} ) \right\} d\theta\nonumber \\ &&\quad \approx (2\pi)^{\rho/2} n^{-\rho/2} \left| J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \right|^{-1/2}, \label{bic_factor_integral} \end{eqnarray} (A.8) which can be derived from the probability density function of a multivariate normal distribution. Thus, \begin{eqnarray} \log p(\textbf{x}) &\approx& l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) - \frac{\rho}{2} \log n + \frac{\rho}{2} \log(2\pi) + \log \pi(\widehat{\theta}^{(\Sigma i)})\nonumber\\ &&-\frac{1}{2} \log \left| J^{(\Sigma i)}(\widehat{\theta}^{(\Sigma i)} ) \right| \nonumber \\ &\approx& l^{(\Sigma i)} (\widehat{\theta}^{(\Sigma i)}) - \frac{\rho}{2} \log n + \frac{\rho}{2} \log(2\pi). \label{BIC-concat-append} \end{eqnarray} (A.9) Multiplying the right side of Equation (A.9) by $$-2$$, we obtain Equation (19), \begin{eqnarray*} \text{BIC$^{(p)}$}[m_c, \Sigma i] := -2 l^{(\Sigma i)} (\widehat{\theta}_c^{\,(\Sigma i)}) + \rho_c \log n - \rho_c \log(2\pi). \end{eqnarray*} In the conventional definition of BIC, $$\lceil - \rho_c \log(2 \pi) \rfloor$$ is ignored. But, in our study, we take it into consideration for more accurate comparison (see Theory). By using Equations (A.5) and (A.6), we can derive the following relationships. \begin{eqnarray} &&\!\!\!\!\!\exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) \right\}\quad\nonumber\\ &&=\exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) \right\} \cdot 1 \nonumber \\ &&= \exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)}) /(1-p_i) \right\} \cdot \int \pi (\theta) d\theta \nonumber\\ &&= \int \exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) \right\} \pi (\theta) d\theta \nonumber\\ &&\approx \int \exp \left\{ l^{(i)} (\widehat{\theta}^{(\Sigma i|*)})/(1-p_i) \right\} \pi (\theta) d\theta \nonumber\\ &&\approx \int \exp \left\{ l^{(i)}(\widehat{\theta}^{(i)})/(1-p_i) \right. \nonumber\\ &&\quad\left. + \frac{1}{2} (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)} )^T \left[ \nabla \nabla l^{(i)}(\widehat{\theta}^{(i)})/(1-p_i) \right] (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)} ) \right\} \nonumber\\ &&\quad\times \left\{ \pi(\widehat{\theta}^{(i)}) + Re \right\} d\theta \nonumber\\ &&\approx \exp \left\{ l^{(i)}(\widehat{\theta}^{(i)})/(1-p_i) \right\} \cdot \pi(\widehat{\theta}^{(i)}) \nonumber\\ &&\quad\times (2\pi)^{\rho/2} n_i^{-\rho/2} \left| \frac{(-1)}{n_i (1-p_i)} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \right|^{-1/2}, \label{BICP_prop_approx} \end{eqnarray} (A.10) where we used an approximation similar to Equation (A.8), \begin{eqnarray*} &&\int \exp \left\{ \frac{1}{2} (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})^T\right.\nonumber\\ &&\quad \left.\cdot [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)})/(1-p_i) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)}) \right\} d\theta \\ &&= \int \exp \left\{ - \frac{n_i}{2} (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)})^T\right.\\ &&\quad \left.\cdot \frac{(-1)}{n_i (1-p_i)} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \cdot (\widehat{\theta}^{(\Sigma i|*)} - \widehat{\theta}^{(i)}) \right\} d\theta\\ &&\approx (2\pi)^{\rho/2} n_i^{-\rho/2} \left| \frac{(-1)}{n_i (1-p_i)} [\nabla \nabla l^{(i)} (\widehat{\theta}^{(i)}) ] \right|^{-1/2}. \end{eqnarray*} Taking the logarithm of both sides of Equation (A.10) followed by ignoring minor terms results in \begin{eqnarray*} l^{(i)} (\widehat{\theta}^{(\Sigma i)})/(1-p_i) &\approx& l^{(i)} (\widehat{\theta}^{(i)})/(1-p_i) - \frac{1}{2} \rho (\log n_i - \log (2\pi) ). \label{BICP_prop_approx2} \end{eqnarray*} Therefore, by recovering model index $$m_c$$, we obtain the following approximation \begin{eqnarray} l^{(i)} (\widehat{\theta}_c^{\,(\Sigma i)}) &\approx& l^{(i)} (\widehat{\theta}_c^{\,(i)}) - \frac{1}{2} \rho_c (1-p_i) (\log n_i -\log (2\pi)).\qquad\quad \label{partition-laplace-trans} \end{eqnarray} (A.11) From the definition of Equation (19) and the approximation of Equation (A.11), we can consider the following approximation and definition, \begin{align} &\text{BIC$^{(p)}$}[m_c, \Sigma i]\nonumber\\ &\quad := -2 l^{(\Sigma i)}_c(\widehat{\theta}^{(\Sigma i)}) + \rho_c (\log n - \log(2 \pi))\nonumber\\ &\quad = \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(\Sigma i)}) \right\} + \rho_c (\log n - \log(2 \pi))\nonumber\\ &\quad\approx \sum_{i=1}^K \left\{ -2 l^{(i)}(\widehat{\theta}_c^{\,(i)}) +(1{-}p_i) \rho_c (\log n_i {-} \log(2 \pi) ) \right\}\nonumber\\ &\qquad + \rho_c (\log n - \log(2 \pi))\nonumber \\ &\quad=: \text{BIC}^{(p)}[\{(m_c, i)\}], \nonumber \end{align} where ‘$$=:$$’ implies that the right side of the equation is defined as the left side of the equation. References Adachi J., Waddell P.J., Martin W., Hasegawa M. 2000 . Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast. J. Mol. Evol. 50 , 348 – 358 . Google Scholar CrossRef Search ADS PubMed Akaike H. 1974 . A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19 , 716 – 723 . Google Scholar CrossRef Search ADS Anderson C.N., Liu L., Pearl D., Edwards S.V. 2012 . Tangled trees: the challenge of inferring species trees from coalescent and noncoalescent genes. Methods Mol Biol. 856 , 3 – 28 . Google Scholar CrossRef Search ADS PubMed Bell E.T. 1934 . Exponential numbers. Amer. Math. Monthly 41 , 411 – 419 . Google Scholar CrossRef Search ADS Berger J.O. 1985 . Statistical decision theory and Bayesian analysis. New York : Springer-Verlag . Google Scholar CrossRef Search ADS Bishop Y.M., Fienberg S.E., Holland P.W. 2007 . Discrete multivariate analysis. New York : Springer-Verlag . p. 475 – 484 . Bozdogan H. 1987 . Model selection and Akaike’s Information Criterion (AIC): the general theory and its analytical extensions. Psychometrika 52 , 345 – 370 . Google Scholar CrossRef Search ADS Burnham K.P., Anderson D.R. 2002 . Model selection and multimodel inference. 2nd ed . New York : Springer-Verlag . p. 64 – 66 , 284 – 285 . Cao Y., Sorenson M.D., Kumazawa Y., Mindell D.P., Hasegawa M. 2000a . Phylogenetic position of turtles among amniotes: evidence from mitochondrial and nuclear genes. Gene 259 , 139 – 148 . Google Scholar CrossRef Search ADS Cao Y., Fujiwara M., Nikaido N., Okada N., Hasegawa M. 2000b . Interordinal relationships and timescale of eutherian evolution as inferred from mitochondrial genome data. Gene 259 : 149 – 158 . Google Scholar CrossRef Search ADS Chang J.T. 1996 . Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134 : 189 – 215 Google Scholar CrossRef Search ADS PubMed Chikina M., Robinson J.D., Clark N.L. 2016 . Hundreds of genes experienced convergent shifts in selective pressure in marine mammals. Mol. Biol. Evol. 33 (9) : 2182 – 2192 . Google Scholar CrossRef Search ADS PubMed Colombo M., Damerau M., Hanel R., Salzburger W., Matschiner M. 2015 . Diversity and disparity through time in the adaptive radiation of Antarctic notothenioid fishes. J. Evol. Biol. 28 (2) : 376 – 394 . Google Scholar CrossRef Search ADS PubMed Draper D. 1995 . Assessment and propagation of model uncertainty. J. R. Statist. Soc. B 57 , 45 – 97 . Dziak J.J., Coffman D.L., Lanza S.T., Li R. 2012 . Sensitivity and specificity of information criteria. Technical Report Series #12-119. The Pennsylvania State University. State College, PA . Felsenstein J. 1981 . Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17 , 368 – 376 . Google Scholar CrossRef Search ADS PubMed Felsenstein J. 1989 . PHYLIP—phylogeny inference package (version 3.2). Cladistics 5 : 164 – 166 Hasegawa M., Kishino H., Yano T. 1985 . Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22 , 160 – 174 . Google Scholar CrossRef Search ADS PubMed Hastie T., Tibshirani R., Friedman J. 2009 . The elements of statistical learning. Chapter 7 . New York : Springer-Verlag . Jukes T.H., Cantor C.R. 1969 . Evolution of protein molecules. In: Munro H.N., editors. Mammalian protein metabolism. New York : Academic Press , p. 21 – 132 . Google Scholar CrossRef Search ADS Kimura M. 1980 . A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16 , 111 – 120 . Google Scholar CrossRef Search ADS PubMed Kolaczkowski B., Thornton J.W. 2004 . Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431 , 980 – 984 . Google Scholar CrossRef Search ADS PubMed Konishi S., Kitagawa G. 2004 . Information criteria (in Japanese). Tokyo : Asakura Publishing Co ., p. 47 – 48 . Lanfear R., Calcott B., Ho S.Y.W., Guindon S. 2012 . PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol. Biol. Evol. 29 , 1695 – 1701 . Google Scholar CrossRef Search ADS PubMed Lanfear R., Calcott B., Kainer D., Mayer C., Stamatakis A. 2014 . Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol. Biol. 14 , 82 – 95 . Google Scholar CrossRef Search ADS PubMed Leigh J.W., Lapointe F.J., Lopez P., Bapteste E. 2011 . Evaluating phylogenetic congruence in the post-genomic era. Genome Biol Evol. 3 , 571 – 587 . Google Scholar CrossRef Search ADS PubMed Lemmon A.R., Moriarty E.C. 2004 . The importance of proper model assumption in Bayesian phylogenetics. Syst. Biol. 53 : 265 – 277 . Google Scholar CrossRef Search ADS PubMed Li C., Lu G., Orti G. 2008 . Optimal data partitioning and a test for Ray–Finned fishes (Actinopterygii) based on ten nuclear loci. Syst. Biol. 57 , 519 – 539 . Google Scholar CrossRef Search ADS PubMed Lopez P., Casane D., Philippe H. 2002 . Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19 , 1 – 7 . Google Scholar CrossRef Search ADS PubMed Nikaido, M., Cao, Y. Harada M., Okada N., Hasegawa M. 2003 . Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. 28 : 276 – 284 Google Scholar CrossRef Search ADS PubMed Nishihara H., Okada N., Hasegawa M. 2007 . Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol. 8 : R199.1 – R199.10 Google Scholar CrossRef Search ADS Pupko T., Huchon D., Cao Y., Okada N., Hasegawa M. 2002 . Combining multiple data sets in a likelihood analysis: which models are the best? Mol. Biol. Evol. 19 , 2294 – 2307 . Google Scholar CrossRef Search ADS PubMed Rambaut A., Grassly N.C. 1997 . Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13 : 235 – 238 . Google Scholar PubMed Ripplinger J., Sullivan J. 2008 . Does choice in model selection affect maximum likelihood analysis? Syst. Biol. 57 (1) : 76 – 85 . Google Scholar CrossRef Search ADS PubMed Robert C.P. 2007 . The Bayesian choice. 2/e New York : Springer-Verlag . p. 352 . Schwarz G. 1978 . Estimating the dimension of a model. Ann. Stat. 6 , 461 – 464 . Google Scholar CrossRef Search ADS Seo T.-K. 2008 . Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25 , 960 – 971 . Google Scholar CrossRef Search ADS PubMed Stamatakis A. 2014 . RAxML Version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30 (9) : 1312 – 1313 . Google Scholar CrossRef Search ADS PubMed Tamura K. 1992 . Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9 , 678 – 687 . Google Scholar PubMed Tamura K., Nei M. 1993 . Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10 , 512 – 526 . Google Scholar PubMed Tamura K., Stecher G., Peterson D., Filipski A., Kumar S. 2013 . MEGA6: molecular evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 30 (12) : 2725 – 2729 . Google Scholar CrossRef Search ADS Wu C.H., Suchard M.A., Drummond A.J. 2012 Bayesian selection of nucleotide substitution models and their site assignments. Mol. Biol. Evol. 30 (3) : 669 – 699 . Google Scholar PubMed White H., 1982 . Maximum likelihood estimation of misspecified models. Econometrica 50 , 1 – 25 . Google Scholar CrossRef Search ADS Yang Z. 1994a . Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39 , 105 – 111 . Yang Z. 1994b . Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39 , 306 – 314 . Google Scholar CrossRef Search ADS Yang Z. 1996 . Maximum-likelihood models for combined analyses of Multiple sequence data. J. Mol. Evol. 42 , 587 – 596 . Google Scholar CrossRef Search ADS PubMed Yang Z. 2007 . PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24 , 1586 – 1591 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

### Journal

Systematic BiologyOxford University Press

Published: Jan 23, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations