Abstract To study the evolution of several quantitative traits, the classical phylogenetic comparative framework consists of a multivariate random process running along the branches of a phylogenetic tree. The Ornstein–Uhlenbeck (OU) process is sometimes preferred to the simple Brownian motion (BM) as it models stabilizing selection toward an optimum. The optimum for each trait is likely to be changing over the long periods of time spanned by large modern phylogenies. Our goal is to automatically detect the position of these shifts on a phylogenetic tree, while accounting for correlations between traits, which might exist because of structural or evolutionary constraints. We show that, in the presence of shifts, phylogenetic Principal Component Analysis fails to decorrelate traits efficiently, so that any method aiming at finding shifts needs to deal with correlation simultaneously. We introduce here a simplification of the full multivariate OU model, named scalar OU, which allows for noncausal correlations and is still computationally tractable. We extend the equivalence between the OU and a BM on a rescaled tree to our multivariate framework. We describe an Expectation–Maximization (EM) algorithm that allows for a maximum likelihood estimation of the shift positions, associated with a new model selection criterion, accounting for the identifiability issues for the shift localization on the tree. The method, freely available as an R-package (PhylogeneticEM) is fast, and can deal with missing values. We demonstrate its efficiency and accuracy compared to another state-of-the-art method ($$\ell$$1ou) on a wide range of simulated scenarios and use this new framework to reanalyze recently gathered data sets on New World Monkeys and Anolis lizards. Adaptive evolution, Change-point detection, Model selection, Ornstein–Uhlenbeck, PhylogeneticEM, Phylogeny Motivation A major goal of comparative and evolutionary biology is to decipher the past evolutionary mechanisms that shaped the present day diversity. Taking advantage of the increasing amount of molecular data made available by powerful sequencing techniques, sophisticated mathematical models have made it possible to infer reliable phylogenetic trees for ever growing groups of taxa (see e.g., Meredith et al. 2011; Jetz et al. 2012). Models of phenotypic evolution for such large families need to cope with the heterogeneity of observed traits across the species tree. One source of heterogeneity is the mechanism of “evolution by jumps” as hypothesized by Simpson (1944). It states that there exists an adaptive landscape shaping the evolution of functional traits, and that this landscape might shift, sometimes in a dramatic fashion, in response to environmental changes such as migration, or colonization of a new ecological niche. Such shifts, like the one observed in the brain shape and size of New World Monkeys in association with dietary and locomotion changes (Aristide et al. 2015, 2016), need to be explicitly accounted for in models of phenotypic evolution. To detect such adaptive shifts, we must cope with two constraints: species do not evolve independently (Felsenstein 1985) and adaptive evolution is an intrinsically multivariate phenomenon. The first constraint arises from the shared evolutionary history of species, usually represented as a phylogenetic tree. It means that traits observed on closely related taxa are on average more similar than traits observed on distantly related species. This leads to so called phylogenetic correlation of traits across species. The second constraint results from natural selection or genetic changes acting on many traits at once. Functional traits are indeed often interdependent, either because they are regulated by the same portions of the genetic architecture or because they are functionally constrained (e.g., limb bones lengths in Greater Antillean Anolis lizards, Mahler et al. 2010). We refer to this phenomenon of within-species trait correlation as the between-trait correlation. This work aims to develop a likelihood-based method to detect rapid adaptive events, referred to as shifts, using a time calibrated phylogenetic tree and potentially incomplete observations of a multivariate functional trait at the tips of that tree. The shifts can be used to cluster together species sharing a common adaptive history. State of the Art Phylogenetic comparative methods (PCM) are the de facto tools for studying phenotypic evolution. Most of them can be summarized as stochastic processes on a tree. Specifically, given a rooted phylogeny, the traits evolve according to a stochastic process on each branch of the tree. At each speciation event, one independent copy with the same initial conditions is created for each daughter species. A common stochastic process in this setting is the Brownian motion (BM, Felsenstein 1985). It is well suited to model the random drift of a quantitative, neutral, and polygenic trait (see e.g., Felsenstein 2004, Chapter 24). Unfortunately, the BM process cannot adequately model adaptation to a specific optimum (Hansen and Orzack 2005). The Ornstein–Uhlenbeck (OU) process models attraction to some optimal trait value and is therefore preferred to the BM in the context of adaptive evolution (Hansen 1997; Hansen et al. 2008). We refer interested readers to Butler and King (2004) for an excellent description of the BM and OU models and their use to study adaptive evolution. Note that, as pointed out by Hansen et al. (2008) and Cooper et al. (2016), this model is distinct from the process theoretically derived by Lande (1976) for stabilizing selection toward an optimum on an adaptive landscape at a microevolutionary timescale and is better seen as a heuristic for the macroevolution of the “secondary optima” themselves in a Simpsonian interpretation of evolution (Hansen et al. 2008). Recently, Levy processes have also been used to capture Simpsonian evolution (Landis et al. 2013; Duchen et al. 2017). Extensions to multivariate traits have been proposed for both BM (Felsenstein 1985) and OU processes (Bartoszek et al. 2012). Cybis et al. (2015) considered even more complex models, with a mix of both quantitative and discrete characters modeled with an underlying multivariate BM and a threshold model (Felsenstein 2005, 2012) for drawing discrete characters from the underlying continuous BM. The work on adaptive shifts also enjoyed a growing interest in the last decade. In their seminal work, Butler and King (2004) considered a univariate trait with known shift locations on the tree and estimated shift amplitudes in the trait optimal value using a maximum-likelihood framework. Beaulieu et al. (2012) extended the work by estimating shift amplitudes not only in the optimal value but also in the evolutionary rate. The focus then moved to estimating the number and locations of shifts. Eastman et al. (2011, 2013) detected shifts, respectively, in the evolutionary rate or the trait expectations, for traits evolving as BM, in a Bayesian setting using reversible jump Markov Chain Monte Carlo (rjMCMC). Ingram and Mahler (2013),Uyeda and Harmon (2014), and Bastide et al. (2017) detected shifts in the optimal value of a trait evolving as an OU. Uyeda and Harmon (2014) and Bastide et al. (2017) consider all shift locations for a given number of shifts and use either rjMCMC or penalized likelihood to select the number of shifts. By contrast, Ingram and Mahler (2013) uses a stepwise procedure, based on AIC, to detect shifts sequentially, stopping when adding a shift does not improve the criteria anymore. Extensions from univariate to multivariate shifts are more recent. It should be noted that all methods assume that shifts affect all traits simultaneously. Given known shift locations and a multivariate OU process, Bartoszek et al. (2012) was the first to develop a likelihood-based method (package mvSLOUCH) to estimate both matrices of multivariate evolutionary rates and selection strengths. Clavel et al. (2015) so on followed with mvmorph, a comprehensive package covering a wide range of multivariate processes. Detection of shifts in multivariate traits is more involved and both Ingram and Mahler (2013) and Khabbazian et al. (2016) make the simplifying assumption that all traits are independent, conditional on their shared shifts. Ingram and Mahler (2013) then proceed with the same stepwise procedure as in the univariate case whereas (Bastide et al. 2017) uses a lasso-regression to detect the shifts and a phylogenetic BIC (pBIC) criterion to select the number of shifts. Scope of the Article In this work, we present a new likelihood-based method to detect evolutionary shifts in multivariate OU models. We make the simplifying assumptions that all traits have the same selection strength but, unlike in Khabbazian et al. (2016) and Ingram and Mahler (2013), we allow for between-trait correlation. Our contribution is multifaceted. We show that the scalar assumption that we make (see Model section) and the independence assumption share a similar feature in their structure that make the shift detection problem tractable. Building upon a formal analysis made in the univariate case (Bastide et al. 2017), we show that the problem suffers from identifiability issues as two or more distinct shift configurations may be indistinguishable. We propose a latent variable model combined with an OU to BM reparametrization trick to estimate the unknown number of shifts and their locations. Our method is fast and can handle missing data. It also proved accurate in a large scale simulation study and was able to find back known shift locations in reanalysis of public data sets. Finally, we show that the standard practice of decorrelating traits using phylogenetic principal component analysis (pPCA) before using a method designed for independent traits can be misleading in the presence of shifts. The article is organized as follows: We present the model and inference procedure in Model section, the theoretical bias of pPCA in the presence of shifts in pPCA and Shifts section, the simulation study in Simulations Studies section, the reanalysis of the New World Monkeys and Greater Antillean Anolis lizards data sets in Examples section, and discuss the results and limitations of our method in Discussion section. Model Trait Evolution on a Tree Tree We consider a fixed- and time-calibrated phylogenetic tree linking the present-day species studied. The tree is assumed ultrametric with height $$h$$, but with possible polytomies. We denote by $$n$$ the number of tips and by $$m$$ the number of internal nodes, such that $$N=n+m$$ is the total number of nodes. For a fully bifurcating tree, $$m=n-1$$ and $$N=2n-1$$. Traits We note $${\mathbf{Y}}$$ the matrix of size $$n \times p$$ of measured traits at the tips of the tree. For each tip $$i$$, the row-vector $$\mathbf{Y}^i$$ represents the $$p$$ measured traits at tip $$i$$. Some of the data might be missing, as discussed later (see Statistical Inference section). Brownian motion (BM) The multivariate BM has $$p + p(p+1)/2$$ parameters: $$p$$ for the ancestral mean value vector $$\boldsymbol{\mu}$$, and $$p(p+1)/2$$ for the drift rate (in the genetic sense) matrix $${\mathbf{R}}$$, that is responsible for the between-trait correlations. The variance of a given trait grows linearly in time, and the covariance between two traits $$k$$ and $$l$$ at nodes $$i$$ and $$j$$ is given by $$t_{ij} R_{kl}$$, where $$t_{ij}$$ is the time elapsed between the root and the most recent common ancestor (MRCA) of $$i$$ and $$j$$ (see e.g., Felsenstein 2004, Chapter 24). Like in the univariate case, this results in an ever-increasing variance of the traits and in the absence of any optimal adaptive value. Using the vectorized version of matrix $${\mathbf{Y}}$$ (where $$vec({\mathbf{Y}})$$ is the vector obtained by “stacking” all the columns of $${\mathbf{Y}}$$), we get: $$\mathbb{V}\text{ar}\left[{vec({\mathbf{Y}})}\right] = {\mathbf{R}} \otimes {\mathbf{C}}$$, where $$\otimes$$ is the Kronecker product, and $${\mathbf{C}} = [t_{ij}]_{1\leq i,j \leq n}$$. Ornstein–Uhlenbeck (OU) The Ornstein–Uhlenbeck process has $$p^2$$ extra parameters in the form of a selection strength matrix $${\mathbf{A}}$$. The traits evolve according to the stochastic differential equation $$d\mathbf{X}_t = {\mathbf{A}}(\boldsymbol{\beta} - \mathbf{X}_t)dt + {\mathbf{R}}d{\mathbf{W}}_t$$, where $${\mathbf{W}}_t$$ stands for the standard $$p$$-variate Brownian motion. The first part represents the attraction to a “primary optimum” $$\boldsymbol{\beta}$$, with a dynamic controlled by $${\mathbf{A}}$$. This matrix is not necessarily symmetric in general, but it must have positive eigenvalues for the traits to indeed be attracted to their optima. This assumption also ensures the existence of a stationary distribution of the traits, with (stable) variance $$\boldsymbol{\Gamma}$$ and mean $$\boldsymbol{\beta}$$, which can be interpreted as an adaptive optimum (see Bartoszek et al. 2012; Clavel et al. 2015, for further details and general expression of $$\boldsymbol{\Gamma}$$). Shifts We assume that some environmental changes affected the traits evolution in the past. In the BM model, we take those changes into account by allowing the process to be discontinuous, with shifts occurring in its mean value vector (as e.g., Eastman et al. 2013). This is reasonable if the adaptive response to a change in the environment is fast enough compared to the evolutionary time scale. For the OU, we assume that environmental changes result in a shift in the primary optimum $$\boldsymbol{\beta}$$ (as e.g., Butler and King 2004). The process is hence continuous, and goes to a new optimum, with a dynamic controlled by $${\mathbf{A}}$$. In both cases, we make the standard assumptions that all traits shift at the same time (but with different magnitudes), that each shift occurs at the beginning of its branch, and that all other parameters $$({\mathbf{A}}, {\mathbf{R}})$$ of the process remain unchanged. We further assume that each jump induces a specific optimum, which implies that there is no homoplasy for the optimum, that is, no convergent evolution. Simplifying Assumptions Trait independence assumption The general OU as described above is computationally hard to fit (Clavel et al. 2015), even when the shifts are fixed a priori. For automatic detection to be tractable in practice, several assumptions can be made. The two methods that (to our knowledge) tackle this problem in the multivariate setting assume that all the traits are independent, that is that matrices $${\mathbf{A}}$$ and $${\mathbf{R}}$$ are diagonal (Ingram and Mahler 2013; Khabbazian et al. 2016). This is often justified by assuming that a priori preprocessing with phylogenetic Principal Component Analysis (pPCA, Revell 2009) leads to independent traits. However, pPCA assumes a no-shift BM evolution of the traits, and it can introduce a bias in the downstream analysis conducted on the scores, as shown by Uyeda et al. (2015). The choice of the number of PC axes to keep is also crucial, and can qualitatively change the results obtained, leading to the detection of artificial shifts near the root when not enough PC axes are kept for the analysis, as observed by Khabbazian et al. (2016). Finally, we show theoretically (pPCA and Shifts section) and numerically (Simulations Studies section, last paragraph) that pPCA fails to remove the between-trait correlation in the data in the presence of shifts and may even hamper shift detection accuracy. Scalar OU (scOU) We offer here an alternative to the independence assumption. Computations are greatly simplified when matrices $${\mathbf{A}}$$ and $${\mathbf{R}}$$ commute. This happens when both of these matrices are diagonal for example, or when $${\mathbf{R}}$$ is unconstrained and $${\mathbf{A}}$$ is scalar, that is of the form $${\mathbf{A}} = \alpha {\mathbf{I}}_p$$, where $${\mathbf{I}}_p$$ is the identity matrix. We call a process satisfying the latter assumptions a scalar OU (scOU), as it behaves essentially as a univariate OU. In particular, its stationary variance is simply given by $$\boldsymbol{\Gamma} = {\mathbf{R}} / ({2\alpha})$$ (analogous to the formula $$\gamma^2 = {\sigma^2}/({2\alpha})$$ in the univariate case, see e.g., Hansen, 1997). We define the scOU model as follows: at the root $$\rho$$, the traits are either drawn from the stationary normal distribution with mean $$\boldsymbol{\mu}$$ and variance $$\boldsymbol{\Gamma}$$ ($$\mathbf{X}^{\rho} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Gamma})$$), or fixed and equal to $$\boldsymbol{\mu}$$. The initial optimum vector is $$\boldsymbol{\beta}_0$$ and the conditional distribution of trait $$\mathbf{X}^i$$ at node $$i$$ given trait $$\mathbf{X}^{pa(i)}$$ at its parent node $$pa(i)$$ is \begin{equation} {\left.{\mathbf{X}^i}\mathrel{}|\mathrel{}{\mathbf{X}^{pa(i)}}\right.} \sim \mathcal{N}\left(\!e^{-\alpha \ell_i} \mathbf{X}^{pa(i)} {+} (1{-}e^{-\alpha \ell_i}) \boldsymbol{\beta}_i, \frac1{2\alpha}(1{-}e^{-\alpha \ell_i}) {\mathbf{R}} \!\right)\!, \end{equation} (1) where $$\boldsymbol{\beta}_i = \boldsymbol{\beta}_{pa(i)} + \boldsymbol{\Delta}^{i}$$ is the optimal value of the process on the branch with length $$\ell_i$$ going from $$pa(i)$$ to $$i$$ and $$\boldsymbol{\Delta}$$ is the $$N\times p$$ matrix of shifts on the branches of the tree: for any node $$i$$ and any trait $$l$$, $$\Delta_{il}$$ is $$0$$ if there are no shift on the branch going from $$pa(i)$$ to $$i$$, and the value of the shift on trait $$l$$ otherwise. At the root, we define $$\boldsymbol{\beta}_{\rho} = \boldsymbol{\beta}_0$$ and, for each trait $$l$$: $$\Delta_{\rho l} = e^{-\alpha h} \mu_l + (1 - e^{-\alpha h}) \beta_{0l}$$, where $$h$$ is the age of the root (or tree height). The scOU model can also be expressed under a linear form. Let $${\mathbf{U}}$$ be the $$N\times N$$ matrix where $$U_{ij}$$ is 1 if node $$j$$ is an ancestor of node $$i$$ and 0 otherwise. Let $${\mathbf{T}}$$ be the $$n \times N$$ matrix made of the $$n$$ rows of $${\mathbf{U}}$$ corresponding to tip taxa. For a given $$\alpha$$, we further define the diagonal $$N$$ matrix $${\mathbf{W}}(\alpha)$$ with diagonal term $$W_{ii}(\alpha) = 1 - e^{-\alpha a_{pa(i)}}$$ for any nonroot node $$i$$, where $$a_{pa(i)}$$ is the age of node $$pa(i)$$, and $$W_{\rho\rho}(\alpha) = 1$$ for the root node $$\rho$$. Then the joint distribution of the observed traits $${\mathbf{Y}}$$ is normal \begin{equation} vec({\mathbf{Y}}) \sim \mathcal{N}\left(vec({\mathbf{T}} {\mathbf{W}}(\alpha) {\boldsymbol{\Delta}}), {\mathbf{R}} \otimes {\mathbf{F}}(\alpha) \right)\!, \end{equation} (2) where $${\mathbf{F}}(\alpha)$$ is the symmetric scaled phylogenetic correlation matrix between the $$n$$ tips, with entries $$F_{ij} = \frac1{2\alpha}e^{-\alpha d_{ij}}$$ if the root is drawn from the stationary distribution, and $$F_{ij} = \frac1{2\alpha}e^{-\alpha d_{ij}}(1-e^{-2\alpha t_{ij}})$$ if the root is fixed, where $$d_{ij}$$ is the tree distance between nodes $$i$$ and $$j$$. In the next section, this will allow us to rewrite scOU as a BM on a tree with rescaled branch lengths. This observation is at the core of our statistical inference strategy. The scOU process allows us to handle the within-species correlations that might exist between traits and spares us from doing a preliminary pPCA. This however comes at the cost of assuming that all the traits evolve at the same rate toward their respective optima, with the same selection strength $$\alpha$$. See the Discussion section for further analysis of these assumptions. Identifiability Issues Root state It can be easily checked that the parameters $$\boldsymbol{\mu}$$ and $$\boldsymbol{\beta}_0$$ at the root are not jointly identifiable from observations at the tips of an ultrametric tree, only the combination $$\boldsymbol{\lambda} = e^{-\alpha h} \boldsymbol{\mu} + (1 - e^{-\alpha h}) \boldsymbol{\beta}_0$$ is. See Ho and Ané (2014) for a derivation in the univariate case. Note that $$\boldsymbol{\lambda}$$ corresponds to the first row of the shift matrix $$\boldsymbol{\Delta}$$. As we cannot decide from the data, we assume by default $$\boldsymbol{\beta}_0 = \boldsymbol{\mu} = \boldsymbol{\lambda}$$. Shift position The location of the shifts may not always be uniquely determined, as several sets of locations (and magnitudes) may yield the same joint marginal distribution of the traits at the tips. These identifiability issues have been carefully studied in Bastide et al. (2017) for the univariate case. Because we assume that all traits shift at the same time, the sets of equivalent shift locations are the same in the multivariate case as in the univariate case; only the number of parameters involved is different. So, the problem of counting the total number of parsimonious, nonequivalent shift allocations remains the same, as well as the problem of listing the allocations that are equivalent to a given one. As a consequence, all the combinatorial results and algorithms used in Bastide et al. (2017) are still valid here; only the model selection criterion needs be adapted (see Statistical Inference section). Rescaling of the Tree Equivalency scOU/rBM As recalled above, the inference of OU models raises specific issues, mostly because some maximum likelihood estimates do not have a closed form expression. Many of these issues can be circumvented using the equivalence between the univariate BM and OU models described in Blomberg et al. (2003), Ho and Ané (2013), and Pennell et al. (2015), for ultrametric trees, when $$\alpha$$ is known. Thanks to the scalar assumption, this equivalence extends to the multivariate case. Indeed, the marginal distribution of the traits at the observed tips $${\mathbf{Y}}$$ given in (2) is the same as the one arising from a BM model on a rescaled tree defined by: \begin{align*} \mathbf{X}^{\rho} & \sim \mathcal{N}\left(\boldsymbol{\beta}_0, \ell_{\rho}(\alpha) {\mathbf{R}} \right) \text{ or } \mathbf{X}^{\rho} = \boldsymbol{\beta}_0 \text{ (fixed)} \\ {\left.{\mathbf{X}^i}\mathrel{}|\mathrel{}{\mathbf{X}^{pa(i)}}\right.} & \sim \mathcal{N}\left(\mathbf{X}^{pa(i)} + \boldsymbol{\Delta}^i(\alpha), \ell_i(\alpha) {\mathbf{R}} \right)\!,\\ & \quad \mbox{ for nonroot node } i. \end{align*} where $$\ell_{\rho}(\alpha) = \frac{1}{2\alpha}e^{-2\alpha h}$$, $$\ell_i(\alpha) = \frac{1}{2\alpha}e^{-2\alpha h}\left(e^{2 \alpha t_i} - e^{2 \alpha t_{pa(i)}}\right)$$, and $$\boldsymbol{\Delta}^i(\alpha) = ({\mathbf{W}}(\alpha)\boldsymbol{\Delta})^i = (1 - e^{-\alpha(h-t_{pa(i)})}) \boldsymbol{\Delta}^i$$. Note that, when the root is taken random, everything happens as if we added a fictive branch above the root with length $$\ell_{\rho}(\alpha)$$. The length of this branch increases when $$\alpha$$ goes to zero. We emphasize that only the distribution of the observed traits $${\mathbf{Y}}$$ is preserved and not the distribution of the complete data set $${\mathbf{X}}$$. As a consequence, ancestral traits at internal nodes cannot be directly inferred using this representation. Still, the equivalence recasts inference of $${\mathbf{R}}$$ and $${\mathbf{W}}(\alpha) {\boldsymbol{\Delta}}$$ in the scOU model into inference of the same parameters in a much simpler BM model, albeit on a tree with rescaled branch lengths $$\ell_i(\alpha)$$. Note that the rescaling depends on $$\alpha$$, which needs to be inferred separately. See the discussion (Interpretation Issues section) for further analysis of this rescaling. Statistical Inference Principle As described in details below and in the Supplementary Appendix available on Dryad at http://dx.doi.org/10.5061/dryad.60t0f, the statistical inference relies on two main steps. First, we fix the number $$K$$ of shifts, and find the best solution with exactly $$K$$ shifts. We repeat this for every value of $$K$$ between $$0$$ and some user-defined maximum value $$K_{\max}$$. Second, we use a model selection procedure to compare the solutions found for different $$K$$ values and choose the final $$K$$ and its associated solution. The first step is done through an EM algorithm. It is based upon the observation that shift detection would be easier if we knew all the ancestral values of the traits, that is the values at the internal nodes of the tree. We would then only need to find branches where the trait values differ “substantially” between a child node and its parent node. Unfortunately, and in the absence of fossil data, we do not know these ancestral values. The EM algorithm allows us to virtually go back to this ideal situation. It works by iterating two steps until convergence. During the E step, we assume that we know the parameters of the model of evolution, and we use them to perform ancestral trait reconstruction at the internal nodes of the tree. Then, during the M step, we assume that we know ancestral trait values, and use them to find the parameters of the process, including the shift positions. Upon convergence, the result of the M step is the maximum likelihood solution for a fixed number $$K$$ of shifts. We repeat this first step for every value of $$K$$ from $$0$$ to $$K_{\max}$$. Now consider the true number of shifts, $$K_{\text{true}}$$. Because the best-fitting solution with $$K_{\text{true}}+1$$ shifts has more parameters, it will always have a higher likelihood than the solution with $$K_{\text{true}}$$ shifts. But this increase in likelihood is only due to overfitting and should not be significant. To avoid overfitting and choose an adequate number of shifts, we use a penalized likelihood criterion instead of the raw likelihood. The penalty needs to be chosen carefully, so that the criterion is maximized at $$K_{\text{true}}$$ with high probability. In particular, the penalty should depend on the number of models that the method is allowed to explore when we add a shift. Incomplete data model We now discuss how to infer the set of parameters $${\boldsymbol{\theta}} = (\boldsymbol{\Delta}, {\mathbf{R}})$$. We adopt a maximum likelihood strategy, which consists in maximizing the log-likelihood of the observed tip data $$\log p_{{\boldsymbol{\theta}}} ({\mathbf{Y}})$$ with respect to $${\boldsymbol{\theta}}$$ to get the estimate $$\widehat{{\boldsymbol{\theta}}}$$. The maximum likelihood estimate $$\widehat{{\boldsymbol{\theta}}}$$ is difficult to derive directly as the computation of $$\log p_{{\boldsymbol{\theta}}} ({\mathbf{Y}})$$ requires to integrate over the unobserved values of the traits at the internal nodes. We denote by $${\mathbf{Z}}$$ the unobserved matrix of size $$m \times p$$ of these ancestral traits at internal nodes of the tree: for each internal node $$j$$, $$\mathbf{Z}^j$$ is the row-vector of the $$p$$ ancestral traits at node $$j$$. Following Bastide et al. (2017), we use the EM algorithm (Dempster et al. 1977) that relies on an incomplete data representation of the model and takes advantage of the decomposition of $$\log p_{{\boldsymbol{\theta}}} ({\mathbf{Y}})$$ as $$\mathbb{E}\left[ \log {{p}_{\mathbf{\theta }}}(\mathbf{Y},\mathbf{Z})\,|\,\mathbf{Y} \right]-\mathbb{E}\left[ \log {{p}_{\mathbf{\theta }}}(\mathbf{Z}\,|\,\mathbf{Y})\,|\,\mathbf{Y} \right]$$. EM The M step of the EM algorithm consists in maximizing $$\mathbb{E} \left[\log p_{{\boldsymbol{\theta}}} {\left.{({\mathbf{Y}}, {\mathbf{Z}})}\mathrel{}|\mathrel{}{{\mathbf{Y}}}\right.}\right]$$ with respect to $${\boldsymbol{\theta}}$$. For a given value of $$\alpha$$, thanks to the rescaling described in Model section, the formulas to update $${\boldsymbol{\Delta}}$$ and $${\mathbf{R}}$$ are explicit (see Supplementary Appendix EM Inference available on Dryad). The optimization of $$\alpha$$ is achieved over a grid of values, at each point of which a complete EM algorithm is run. At the M step, we need the mean and variance of the unobserved traits $$\mathbf{Z}^j$$ at each internal node $$j$$ conditional on the observed traits $${\mathbf{Y}}$$ at the tips. The E step is dedicated to the computation of these values, which can be achieved via an upward-downward recursion (Felsenstein 2004). The upward path goes from the leaves to the root, computing the conditional means and variances at each internal node given the values of its offspring in a recursive way. The downward recursion then goes from the root to the leaves, updating the values at each internal node to condition on the full taxon set. Thanks to the joint normality of the tip and internal node data, all update formulas have closed form matrix expressions, even when there are some missing values (see Supplementary Appendix EM Inference available on Dryad). Initialization The EM algorithm is known to be very sensitive to the initialization. Following Bastide et al. (2017), we take advantage of the linear formulation (2) to initialize the shifts position using a lasso penalization (Tibshirani 1996). This initialization method is similar to the procedure used in $$\ell$$1ou (Khabbazian et al. 2016). See Supplementary Appendix EM Inference available on Dryad for more details. Missing data EM was originally designed to handle missing data. As a consequence, the algorithm described above also applies when some traits are unobserved for some taxa. Indeed, the conditional distribution of the missing traits given the observed ones can be derived in the same way as in the E step. However, missing data break down the factorized structure of the data set and some computational tricks are needed to handle the missing data efficiently (see Supplementary Appendix EM Inference available on Dryad). Model selection For each value of the number of shifts $$K$$, the EM algorithm described above provides us with the maximum likelihood estimate $$\widehat{\boldsymbol{\theta}}_K$$. $$K$$ needs to be estimated to complete the inference procedure. We do so using a penalized likelihood approach. The model selection criterion relies on a reformulation of the model in terms of multivariate linear regression, where we remove the phylogenetic correlation, like independent contrasts and PGLS do. We can rewrite (2), for a given $$\alpha$$, as \[ \widetilde{{\mathbf{Y}}} = \widetilde{{\mathbf{T}}} {\boldsymbol{\Delta}} + {\mathbf{E}} \quad \text{where } \widetilde{{\mathbf{Y}}} = {\mathbf{F}}(\alpha)^{-1/2} {\mathbf{Y}}, \quad \widetilde{{\mathbf{T}}} = {\mathbf{F}}(\alpha)^{-1/2} {\mathbf{T}} {\mathbf{W}}(\alpha) , \] where $${\mathbf{E}}$$ is a $$n \times p$$ matrix with independent and identically distributed rows, each row being a (transposed) centered Gaussian vector with variance $${\mathbf{R}}$$. In the univariate case Khabbazian et al. (2016), this representation allowed us to cast the problem in the setting considered by Baraud et al. (2009), and hence to derive a penalty on the log-likelihood, or, equivalently, on the least squares. Taking advantage of the well-known fact that the maximum likelihood estimators of the coefficients are also the least square ones, and do not depend on the variance matrix $${\mathbf{R}}$$ (see, e.g., Mardia et al. 1979, Section 6), we propose to estimate $$K$$ using the penalized least squares: \[ \widehat{K} = \arg\min_K \left( 1 + \frac{{\rm{pen}}(K)}{n-K} \right) \sum_{j=1}^p \|\widetilde{{\mathbf{Y}}}_j - \widehat{\widetilde{{\mathbf{Y}}}}^K_j\|^2 \] where $$\widetilde{{\mathbf{Y}}}_j$$ is the column of $$\widetilde{{\mathbf{Y}}}$$ for the $$j$$-th trait, and $$\widehat{\widetilde{{\mathbf{Y}}}}^K_j$$ the predicted means for trait $$j$$ from the best model with $$K$$ shifts. Using the EM results, this can be written as: \[ \widehat{K} = \arg\min_K \left( 1 + \frac{{\rm{pen}}(K)}{n-K} \right) {\rm{tr}}\left[\widehat{{\mathbf{R}}}(K, \hat{\alpha})\right], \] where $$\widehat{{\mathbf{R}}}(K, \hat{\alpha})$$ is the ML estimate of the variance parameter obtained by the EM for a fixed number $$K$$ of shifts. The penalty is the same as in the univariate case: \[ {\rm{pen}}(K) = A \frac{n-K-1}{n-K-2} {\rm{{\rm{EDkhi}}}}\left[K, n-K-2, (K+1)^2 / |\mathcal{S}_K^{\text{PI}}|\right], \] where EDkhi is the function from Definition 3 from Baraud et al. (2009) and $$|\mathcal{S}_K^{\text{PI}}|$$ is the number of parsimonious identifiable sets of locations for $$K$$ shifts, as defined in Bastide et al. (2017). It hence might depends on the topology of the tree, for a tree with polytomies. For a fully resolved tree, $$|\mathcal{S}_K^{\text{PI}}| = \binom{2n-2-K}{K}$$. $$A$$ is a normalizing constant, that must be greater than $$1$$. In Baraud et al. (2009), the authors showed that it had little influence in the univariate case, and advised for a value around $$A = 1.1$$. We took this value as a default. The criterion is directly inspired from the univariate case and inherits its theoretical properties in the special case $${\mathbf{R}} = \sigma^2 {\mathbf{I}}_p$$. In general however, the criterion should be seen as a heuristic, although with good empirical properties (see Simulations Studies section). Implementation We implemented the method presented above in the PhylogeneticEMR package (R Core Team 2017), available on the Comprehensive R Archive Network (CRAN). A thorough documentation of its functions, along with a brief tutorial, is available from the GitHub repository of the project (pbastide.github.io/PhylogeneticEM). Thanks to a comprehensive suite of unitary tests, that cover approximately 79% of the code (codecov.io/gh/pbastide/PhylogeneticEM), and that are run automa- tically on an independent Ubuntu server using the continuous integration tool Travis CI (travis-ci.org), the package was made as robust as possible. The computationally intensive parts of the analysis, such that the upward-downward algorithm of the M step, have been coded in C++ to improve performance (see Simulations Studies section for a study of the computation times needed to solve problems of typical size). Because the inference on each $$\alpha$$ value on the grid used is independent, they can be easily be done in parallel, and a built in option allows the user to choose the number of cores to be allocated to the computations. pPCA and Shifts Shift detection in multivariate settings is usually done by first decorrelating traits with pPCA before feeding phylogenetic PCs to detection procedures that assume independence between traits. We show hereafter that even in the simple BM setting, phylogenetic PCs may still be correlated in the presence of shifts. The problem is only exacerbated in the OU setting. pPCA is Biased in the Presence of Shifts Assume that the traits evolve as a shifted BM process on the tree, so that $${\rm vec}({\mathbf{Y}}) \sim {\mathcal{N}}({\rm vec}({\mathbf{a}}), {\mathbf{R}} \otimes {\mathbf{C}})$$, with $${\mathbf{a}}$$ being the $$n\times p$$ matrix of trait means at the tips. Decomposing $${\mathbf{R}}$$ as $${\mathbf{R}} = {\mathbf{V}}{\mathbf{D}}^2{\mathbf{V}}^T$$, pPCA relies on the fact that the columns of the matrix $${\mathbf{Y}} {\mathbf{V}}$$ are independent. Therefore, its efficiency relies on an accurate estimation of $${\mathbf{R}}$$. The estimate of $${\mathbf{R}}$$ used in pPCA is $$\hat{{\mathbf{R}}} = ({n-1})^{-1} ({\mathbf{Y}} - {\mathbf{1}}_n\bar{{\mathbf{Y}}}^T)^T {\mathbf{C}}^{-1} ({\mathbf{Y}} - {\mathbf{1}}_n\bar{{\mathbf{Y}}}^T)$$, where $$\bar{{\mathbf{Y}}}^T = ({\mathbf{1}}_n^T {\mathbf{C}}^{-1}{\mathbf{1}}_n)^{-1}{\mathbf{1}}_n^T{\mathbf{C}}^{-1}{\mathbf{Y}}$$, which is known as the estimated phylogenetic mean vector (Revell 2009). Decomposing the estimate of $${\mathbf{R}}$$ as $$\hat{{\mathbf{R}}} = \widehat{{\mathbf{V}}} \widehat{{\mathbf{D}}}^2 \widehat{{\mathbf{V}}}^T$$, pPCA then computes the scores as $${\mathbf{S}} = ({\mathbf{Y}} - {\mathbf{1}}_n\bar{{\mathbf{Y}}}^T)\widehat{{\mathbf{V}}}$$. In the absence of shift, all species have the same mean vector $${\boldsymbol{\mu}}$$ so $${\mathbf{a}} = {\mathbf{1}}_n {\boldsymbol{\mu}}^T$$ and $${\mathbb{E}\left[ {\bar{{\mathbf{Y}}}} \right]} = {\boldsymbol{\mu}}$$. In the presence of shifts, species do not all share the same mean vector so the uniform centering is not valid anymore. As a consequence, the estimate of $${\mathbf{R}}$$ is biased (see Appendix PCA: Mathematical Derivations): \begin{equation} {\mathbb{E}\left[ {\hat{{\mathbf{R}}}} \right]} = {\mathbf{R}} + {\mathbf{B}} \quad \text{where} \quad {\mathbf{B}} = \frac{1}{n-1}{\mathbf{G}}^T{\mathbf{C}}^{-1}{\mathbf{G}}, \; {\mathbf{G}} = {\mathbf{a}} - {\mathbf{1}}_n\bar{{\mathbf{a}}}^T \end{equation} (3) The extra term $${\mathbf{B}}$$ is analogous to the between-group variance in the context of linear discriminant analysis and cancels out in the absence of shifts (note that $${\mathbf{R}}$$ is analogous to the within-group variance, see Mardia et al., 1979). Because $$\hat{{\mathbf{R}}}$$ is biased, the columns of the score matrix $${\mathbf{S}}$$ resulting from pPCA are still correlated. We illustrate this phenomenon below using toy examples. Illustration: A Simple Example To illustrate the impact of shifts on the between-trait decorrelation performed by (p)PCA, we used the simple tree presented in Figure 1a and considered three scenarios. In all scenarios, we simulated two highly correlated traits under a BM starting from $$(0, 0)$$ at the root and with covariance matrix $${\mathbf{R}} = \left( {\matrix{1 & -0.9 \cr -0.9 & 1 \cr }}\right)$$. The tree has two clearly marked clades, designed to highlight the differences between pPCA and PCA. $${\mathbf{R}}$$ is identical in all scenarios; any preprocessing aiming at decorrelating the traits should retrieve the eigenvectors of $${\mathbf{R}}$$ as PCs. In the first scenario, there are no trait shifts on the tree, corresponding to the pPCA assumptions, and pPCA is indeed quite efficient in finding the PCs (see Fig. 1b, left panel). In the second scenario, we added a shift on a long branch. This shift induces a species structure in the trait space that misleads standard PCA. The same structure can however be achieved by a large increment of the BM on that branch and large increments are likely on long branches. pPCA therefore copes with the shift quite well and is able to recover accurate PCs. More quantitatively, the bias induced by the shift on $$\hat{{\mathbf{R}}}$$ is quite small, $${\mathbf{B}} = \left( {\matrix{0.16 & 0.08 \cr 0.08 & 0.04 \cr }}\right)$$, around one tenth of the values of $${\mathbf{R}}$$. In the third scenario, we put a shift on a small branch. The structure induced by the shift “breaks down” the upper clade and is unlikely to arise from the increment of a BM on that branch. It is therefore antagonistic to pPCA and results in a large bias for $$\hat{{\mathbf{R}}}$$: the extra term $${\mathbf{B}}$$ is equal to $$\left( {\matrix{1.58 & 0.79 \cr 0.79 & 0.4 \cr }}\right)$$ and comparable to $${\mathbf{R}}$$. In that scenario, both PCA and pPCA find axes that are far away from the eigenvectors of $${\mathbf{R}}$$ (Fig. 1b, right panel). The first eigenvector of $${\mathbf{R}}$$ captures the evolutionary drift correlation between traits, whereas the PCs of both PCA and pPCA capture a mix of evolutionary drift correlation and correlation resulting from traits shifting along the same edge. Figure 1. View largeDownload slide Bivariate traits simulated as a BM under three scenarios: no shift (left), shift on a long branch (middle), and shift on a short branch (right). Species affected by the shift are in dark red. In the three scenarios, the BM has an ancestral value of $$(0,0)$$, and a rate matrix $${\mathbf{R}}$$ such that $$R_{11} = R_{22} = 1$$ and $$R_{12} = -0.9$$. Top: Phylogenetic tree, shift position, and simulated trait values. Bottom: Scatterplot of species in the trait space and corresponding first eigenvector computed from the true covariance $${\mathbf{R}}$$ (black, solid) or found by pPCA (orange, long dash), and PCA (green, short dash). Figure 1. View largeDownload slide Bivariate traits simulated as a BM under three scenarios: no shift (left), shift on a long branch (middle), and shift on a short branch (right). Species affected by the shift are in dark red. In the three scenarios, the BM has an ancestral value of $$(0,0)$$, and a rate matrix $${\mathbf{R}}$$ such that $$R_{11} = R_{22} = 1$$ and $$R_{12} = -0.9$$. Top: Phylogenetic tree, shift position, and simulated trait values. Bottom: Scatterplot of species in the trait space and corresponding first eigenvector computed from the true covariance $${\mathbf{R}}$$ (black, solid) or found by pPCA (orange, long dash), and PCA (green, short dash). Simulations Studies Experimental Design General setting We studied the performance or our method using a “star-like” experimental design, as opposed to a full-factorial design. We first considered a base scenario, corresponding to a base parameter set, and then varied each parameter in turn to assess its impact as in Khabbazian et al. (2016). The base scenario was chosen to be only moderately difficult, so that our method would find shifts most but not all of the time. For the base scenario, we generated one $$160$$-taxon tree according to a pure birth process, using the R package TreeSim (Stadler 2011), with unit height and birth rate $$\lambda = 0.1$$. We then generated $$4$$ traits on the phylogeny according to the scOU model, with a rather low selection strength $$\alpha_b=1$$ ($$t_{1/2} = 69\%$$ of the tree height), and with a root taken with a stationary variance of $$\gamma^2_b = {\sigma^2_b}/{(2\alpha_b)} = 1$$. Diagonal entries of the rate matrix $${\mathbf{R}}$$ are $$\sigma^2_b$$ and off-diagonal entries were set to $$\sigma^2_b r_d$$ with a base between-trait correlation of $$r_d = 0.4$$ (correlated traits) when testing the effect of shift number and amplitude, or $$r_d=0$$ (independent traits) otherwise. Finally, we added three shifts on this phylogeny, with fixed positions (see Fig. 2). Shift amplitudes were calibrated so that the means at the tips differ by about $$1$$ standard deviation, which constitute a reasonable shift signal (Khabbazian et al. 2016). Each configuration was replicated $$100$$ times. We then used both our PhylogeneticEM and $$\ell$$1ou package (Khabbazian et al. 2016) to study the simulated data. We excluded SURFACE (Ingram and Mahler 2013) from the comparison as it (i) is quite slow, (ii) assumes the same evolutionary model as $$\ell$$1ou and (iii) was found to achieve worse accuracy than $$\ell$$1ou (Khabbazian et al. 2016). We used default setting for both methods. For PhylogeneticEM this implies an inference on an automatically chosen grid with $$10$$$$\alpha$$ values, on a log scale, and a maximum number of shifts of $$\sqrt{n}+5$$ (see Bastide et al. 2017 and Supplementary Appendix EM Inference available on Dryad for a justification of these default parameters). Figure 2. View largeDownload slide Shifts locations and magnitudes used in the base scenario. Mean trait values are identical for the four traits, up to a multiplicative $$\pm 1$$ factor and shown at the tips. The bar plots on the right represent the expected traits values under the base model. Colored bars of different heights correspond to different regimes. Figure 2. View largeDownload slide Shifts locations and magnitudes used in the base scenario. Mean trait values are identical for the four traits, up to a multiplicative $$\pm 1$$ factor and shown at the tips. The bar plots on the right represent the expected traits values under the base model. Colored bars of different heights correspond to different regimes. Number and amplitude of shifts We explored the effect of shifts by varying both their number and amplitude. We considered successively $$0, 3, 7, 11, 15$$ shifts on the topology, with positions and values fixed as in Figure 2. Shifts values were chosen to form well separated tip groups; adjacent (in the tree) group means differ by about $$1$$ standard deviation $$\gamma_b$$. To mimic adaptive events having different consequences on different traits, all shifts on a trait were then randomly multiplied by $$-1$$ or $$+1$$. Finally and to assess the effect of shift amplitude, we rescaled all shifts by a common factor taking values in $$[{0.5},{3}]$$. Low scaling values correspond to smaller, harder to detect, shifts and high values to larger and easier to detect shifts. Selection strength When exploring parameters not related to the shifts, we considered a base number of $$3$$ shifts and a base scaling factor of $$1.25$$, empirically found to correspond to a moderately difficult scenario. We also assumed independent traits with the same variance and selection strength (i.e., scalar $${\mathbf{A}}$$ and $${\mathbf{R}}$$, see model A in Supplementary Appendix Kullback–Leibler Divergences available on Dryad). We first varied $$\alpha$$ from $$1$$ to $$3$$ (i.e., $$t_{1/2}$$ varied between 35% and 23% of the tree height). The variance $$\sigma^2$$ varied with $$\alpha$$ to ensure that the stationary variance $$\gamma^2_b$$ remained fixed at $$\gamma^2_b = 1$$. Model mis-specification The two current frameworks ($$\ell$$1ou and scOU) for multivariate shift detection assume independents traits (diagonal $${\mathbf{A}}$$ and $${\mathbf{R}}$$) or correlated traits with equal selection strengths (scalar $${\mathbf{A}}$$ and arbitrary $${\mathbf{R}}$$). To assess robustness to model mis-specification, we simulated data under four classes of models, referred to as A, B, C, D. Model A is correctly specified for both scOU and $$\ell$$1ou whereas B, C, D correspond respectively to mis-specifications for $$\ell$$1ou, scOU and both. We used the Kullback-Leibler divergence between models A and B (resp. C, D) to choose parameters that attain comparable “levels” of mis-specification (see Supplementary Appendix Kullback-Leibler Divergences available on Dryad for details). Model A assumes scalar $${\mathbf{A}}$$ and $${\mathbf{R}}$$ (independent traits, same selection strength and variance) and meets the assumptions of both scOU and $$\ell$$1ou. Model B assumes scalar $${\mathbf{A}}$$ and arbitrary $${\mathbf{R}}$$ (correlated traits, same selection strength) and corresponds to the scOU model. The level of between-trait correlation is controlled by setting all off-diagonal terms to $$\sigma^2_b r_d$$ in $${\mathbf{R}}$$. Following Khabbazian et al. (2016), $$r_d$$ varies from $$0.2$$ to $$0.8$$, leading to Kullback divergences of up to $$288.36$$ units. Model C assumes diagonal, but not scalar, $${\mathbf{A}}$$, and diagonal $${\mathbf{R}}$$ (independent traits, different selection strengths), which matches the assumptions of $$\ell$$1ou only. We considered $${\mathbf{A}} = \alpha{\rm Diag}(s^{-1.5}, s^{-0.5}, s^{0.5}, s^{1.5})$$ with $$s$$ varying from $$2$$ to $$8$$. We accordingly set $${\mathbf{R}} = 2 \gamma_b^2 {\mathbf{A}}$$ to ensure that all traits have stationary variance $$\gamma_b^2=1$$. This led to Kullback divergences of up to $$286.78$$ units. Model D assumes nondiagonal $${\mathbf{A}}$$ and diagonal $${\mathbf{R}}$$ (uncorrelated drift, but correlated traits selection) and violates both models. Following Khabbazian et al. (2016), all off-diagonal elements of $${\mathbf{A}}$$ were set to $$\alpha_b r_s$$, varying from $$0.2$$ to $$0.8$$. In this case, the stationary variance is not diagonal but has diagonal entries equal to $$\frac{\sigma^2}{2} \frac{1 + (p - 2) r_s}{(1 - r_s)(1 + (p-1)r_s)}$$. We thus rescaled $$\sigma^2$$ appropriately to ensure that each trait has marginal stationary variance $$\gamma_b^2 = 1$$ as previously. This led to Kullback divergences of up to $$112.98$$ units. We expected $$\ell$$1ou to outperform scOU in model C and vice versa in model B. To be fair to both methods, we selected parameter ranges leading to similar Kullback divergences, to achieve similar levels of mis-specifications. However, both deviations produce data sets with groups that are also theoretically easier to discriminate compared to model A (see Fig. 3). Indeed, we can quantify the difficulty of a data set in terms of group separation by the Mahalanobis distance between the observed data and their expected mean, (phylogenetically) estimated in the absence of shifts: Figure 3. View largeDownload slide Impact of between-trait correlation $$r_d$$ (left) and unequal selection strengths $$s$$ (right) on group separation, as defined in Eq. (4). Unequal selection strengths ($$s > 1$$) and between-trait correlations ($$r_d > 0$$) both increase group separation and make it easier to detect shifts. Figure 3. View largeDownload slide Impact of between-trait correlation $$r_d$$ (left) and unequal selection strengths $$s$$ (right) on group separation, as defined in Eq. (4). Unequal selection strengths ($$s > 1$$) and between-trait correlations ($$r_d > 0$$) both increase group separation and make it easier to detect shifts. \begin{equation} D = \left\lVert{{\mathbf{Y}}_{\text{vec}} - ({\mathbf{1}}^T \boldsymbol{\Sigma}_{\text{d}} {\mathbf{1}})^{-1} {\mathbf{1}}^T \boldsymbol{\Sigma}_{\text{d}}{\mathbf{Y}}_{\text{vec}}}\right\rVert_{{\Sigma_{\text{d}}}^{-1}}^{\mathbf{2}} - (np - N_{\textrm{NA}}) \end{equation} (4) where $${\mathbf{Y}}_{\text{vec}}$$ is the vector of observed data at the tips (omitting missing values), $$\boldsymbol{\Sigma}_{\text{d}}$$ is the true variance of $${\mathbf{Y}}_{\text{vec}}$$ and $$N_{\textrm{NA}}$$ is the number of missing values. In the absence of shifts $${\mathbb{E}\left[ {D} \right]}=0$$ and $${\mathbb{E}\left[ {D} \right]}$$ increases when groups are well separated. Number of observations We varied the number of observations by (i) varying the number of taxa and (ii) adding missing values. To change the number of taxa, we generated $$6$$ extra trees with the same parameters as before but with $$32$$ to $$256$$ taxa. The three shifts were fixed as in Figure 4. To test the ability of our method to handle missing data, we removed observations at random in our base scenario, taking care to keep at least one observed trait per species, so as not to change the number of taxa. The fraction of missing data varied from 5% to 50%. Figure 4. View largeDownload slide Shifts locations and magnitudes used for the test trees with, respectively, 32, 64, 96, 128, 192, and 256 taxa. Figure 4. View largeDownload slide Shifts locations and magnitudes used for the test trees with, respectively, 32, 64, 96, 128, 192, and 256 taxa. Results Number and amplitude of shifts We assessed shift detection accuracy with the Adjusted Rand Index (ARI, Hubert and Arabie 1985) between the true clustering of the tips, and the clustering induced by the inferred shifts (Fig. 5, top). Before adjustment, the Rand index is proportional to the number of pairs of species correctly classified in the same group or correctly classified in different groups. The ARI has maximum value of 1 (for a perfectly inferred clustering) and has expected value of 0, conditional on the inferred number and size of clusters. We use this measure rather than the classical precision/sensitivity graphs as only the clustering can be recovered unambiguously (see Identifiability Issues section). Note also that when there is no shift ($$K=0$$), there is only one true cluster, and the ARI is either $$1$$ if no shift is found, or $$0$$ otherwise (see Supplementary Appendix Note on the ARI available on Dryad). Figure 5. View largeDownload slide ARI (top) and number of shifts selected (bottom) for the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey). Each box corresponds to one of the configuration shown in Figure 2, with a scaling factor varying between $$0.5$$ and $$3$$, and a true number of shift between $$0$$ and $$15$$ (solid lines, bottom). For the ARI, the two lines represent the maximum ($$1$$) and expected ($$0$$, for a random solution) ARI values. For $$K$$ select, the lines represent the true number of shifts. Figure 5. View largeDownload slide ARI (top) and number of shifts selected (bottom) for the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey). Each box corresponds to one of the configuration shown in Figure 2, with a scaling factor varying between $$0.5$$ and $$3$$, and a true number of shift between $$0$$ and $$15$$ (solid lines, bottom). For the ARI, the two lines represent the maximum ($$1$$) and expected ($$0$$, for a random solution) ARI values. For $$K$$ select, the lines represent the true number of shifts. Figure 5 (top panel) shows that, unsurprisingly, both methods detect the number and positions of shifts more accurately when the shifts have higher amplitudes. PhylogeneticEM is also consistently better than $$\ell$$1ou when there is a base between-trait correlation (here, $$r_b = 0.4$$, see Experimental Design section), which is expected as the independence assumption of $$\ell$$1ou is then violated. The case $$K=0$$ (no shift) shows that $$\ell$$1ou systematically finds shifts when there are none, leading to an ARI of $$0$$. More generally, $$\ell$$1ou is prone to overestimating the number of shifts, even when they have a high magnitude (Fig. 5, bottom) whereas PhylogeneticEM is more conservative and underestimates the number of shifts when they are difficult to detect. Selection strength and model mis-specifications Our method is relatively robust to model mis-specification (Fig. 6, top). The first panel confirms that, under model A, high values of $$\alpha$$ reduce the stationary variance and lead to higher ARI values and lower RMSEs for continuous parameters (Fig. 6, bottom, leftmost panel). Similarly, scOU (resp. $$\ell$$1ou) achieves high ARI values under well specified models A and B (resp. A and C). The mis-specification of model $$C$$ (different selection strengths) does not affect scOU much: it has higher ARI dispersion than $$\ell$$1ou but their median ARI are comparable. By contrast, $$\ell$$1ou is severely affected by correlated evolution (model C) and higher between-trait correlations lead to significantly lower accuracy, even though group separation is increased (Fig. 3, right). Finally, both methods are negatively affected by correlated selection strengths (Model D), although $$\ell$$1ou seems more robust to this type of mis-specification. Figure 6. View largeDownload slide ARI (top) and root mean squared error (RMSE) of the diagonal values of the estimated stationary variance $$\boldsymbol{\Gamma}$$ (bottom) for the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey). Each panel corresponds to a different type of mis-specification (except Model A) and the parameters $$r_d$$, $$s,$$ and $$r_s$$ control the level of mis-specification, with leftmost values corresponding to no mis-specification. For the ARI, the solid lines represent the maximum (1) and expected (0, for a random solution with the same number and size of clusters) ARI values. Figure 6. View largeDownload slide ARI (top) and root mean squared error (RMSE) of the diagonal values of the estimated stationary variance $$\boldsymbol{\Gamma}$$ (bottom) for the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey). Each panel corresponds to a different type of mis-specification (except Model A) and the parameters $$r_d$$, $$s,$$ and $$r_s$$ control the level of mis-specification, with leftmost values corresponding to no mis-specification. For the ARI, the solid lines represent the maximum (1) and expected (0, for a random solution with the same number and size of clusters) ARI values. Although shift detection is relatively unaffected by model mis-specification, parameter estimation suffers from it (Fig. 6, bottom, center, and right panels). Both $$\ell$$1ou and scOU behave better for model A than for model D and as expected, scOU is not affected by between-trait correlation (model B) whereas $$\ell$$1ou is. Unequal selection strengths (model C) degrades parameter estimation for both PhylogeneticEM and, surprisingly, $$\ell$$1ou, that should in principle remain unaffected. Overall, features of trait evolution not properly accounted for by the inference methods (e.g., correlated selection strengths) are turned into overestimated variances. Note that the quality of the estimation of $$\boldsymbol{\Gamma}$$ depends strongly on the estimation of $$\alpha$$, and could be improved by taking a finer grid for this parameter. Number of observations and computation time For a given number of shifts, shift detection becomes easier as the number of taxa increases (Fig. 7, left). Furthermore, our method is robust against missing data with detection accuracy only slightly decreased when up to 50% of the observations are missing (Fig. 7, right). Finally, our implementation of the EM algorithm, using only two tree traversals (see Supplementary Appendix E step available on Dryad) and coded in C++, is reasonably fast (see Fig. 8). Inference takes roughly $$15$$ min on a single core on the base 160 taxa tree and less than $$45$$ min on the largest simulated trees ($$256$$ taxa). $$\ell$$1ou scales less efficiently: it is faster for very small trees ($$32$$ taxa) but median running times go up to $$20$$ h for the large $$256$$-taxon tree. Those long running times were unexpected and higher than the ones reported in Khabbazian et al. (2016). This discrepancy is partly due to the maximum number of shifts allowed, which strongly impacts the running time of $$\ell$$1ou. Khabbazian et al. (2016) capped it at twice the true number of shifts ($$6$$ shifts in our base scenario), while we used the default setting, which is half the number of tips (i.e., from $$16$$ to $$128$$ shifts). Figure 7. View largeDownload slide ARI of the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey) when the number of taxa (left) or the number of missing values (right) increases. No ARI is available for $$\ell$$1ou when there are missing values as it does not accept them in the version used here, v1.21. Figure 7. View largeDownload slide ARI of the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey) when the number of taxa (left) or the number of missing values (right) increases. No ARI is available for $$\ell$$1ou when there are missing values as it does not accept them in the version used here, v1.21. Figure 8. View largeDownload slide Inference running times (in log-scale) of scOU and $$\ell$$1ou. All tests were run on a high-performance computing facility with CPU speeds ranging from $$2.2$$ to 2.8 GHz. Figure 8. View largeDownload slide Inference running times (in log-scale) of scOU and $$\ell$$1ou. All tests were run on a high-performance computing facility with CPU speeds ranging from $$2.2$$ to 2.8 GHz. Impact of pPCA on shift detection accuracy To illustrate how pPCA can both improve and hamper shift detection, we compared PhylogeneticEM on raw traits to $$\ell 1$$ou on both raw traits and phylogenetic PCs. Figure 9a shows that in our base scenario, with three moderate shifts, pPCA preprocessing slightly decreases performance for low between-trait correlations ($$r_d \leq 0.2$$) but drastically improves them for moderate to high correlations levels ($$r_d \geq 0.6$$). Although preprocessing is neutral at moderate between-trait correlations ($$r_d = 0.4$$) with three “easy” shifts, it becomes harmful and degrades the performance of $$\ell 1$$ou when the number or magnitude of the shifts increases (Fig. 9b). As expected, PhylogeneticEM is unaffected by the pPCA preprocessing, up to numerical issues. Figure 9. View largeDownload slide ARI of the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey), without (solid lines) or with (dotted lines) pPCA preprocessing. a) Between-trait correlation ($$r_d$$) increases from $$0$$ to $$0.8$$. b) Each box corresponds to one of the configuration shown in Figure 2, and shifts are increasingly large with a scaling factor varying between $$0.5$$ and $$3$$. Figure 9. View largeDownload slide ARI of the solutions found by PhylogeneticEM (dark grey) and $$\ell$$1ou (light grey), without (solid lines) or with (dotted lines) pPCA preprocessing. a) Between-trait correlation ($$r_d$$) increases from $$0$$ to $$0.8$$. b) Each box corresponds to one of the configuration shown in Figure 2, and shifts are increasingly large with a scaling factor varying between $$0.5$$ and $$3$$. Examples We used PhylogeneticEM to re-analyse two publicly available data sets. New World Monkeys We first considered the evolution of brain shape in New World Monkeys studied by Aristide et al. (2016). The data set consists of $$49$$ species on a time-calibrated maximum-likelihood tree. The traits under study are the first two principal components (PC1, PC2) resulting from a PCA on $$399$$ landmarks describing brain shape. We ran PhylogeneticEM on a grid of $$30$$ values for the $$\alpha$$ parameter. To make this parameter easily interpretable, we report the phylogenetic half-life$$t_{1/2} = \ln(2)/\alpha$$ (Hansen 1997), expressed in percentage of total tree height. Here, $$t_{1/2}$$ took values between 0.46% and 277.26%. We allowed for a maximum of $$20$$ shifts. The inference took $$17.56$$ min, parallelized on $$5$$ cores. The model selection criterion suggests an optimal value of $$\widehat{K} = 4$$ shifts (Fig. 10, inset graph). The criterion does not show a very sharp minimum, however, and a value of $$\widehat{K} = 5$$ shifts also seems to be a good candidate. In order to compare our results with that presented in Aristide et al. (2016), we present the solution with 5 shifts (see Fig. 10, left). The solution with $$4$$ shifts is very similar, except that the group with Aotus species is absent (slanted cross, in red, see Fig. 10, and SupplementaryFig. 2 in Supplementary Appendix Case Study available on Dryad). Note that, because of this added group, the solution with $$\widehat{K} = 5$$ has $$3$$ equivalent parsimonious allocations of the shifts (see SupplementaryFig. 3 in Appendix Case Study available on Dryad). The groups found by PhylogeneticEM (Fig. 10) are in close agreement with the ecological niches defined in Aristide et al. (2016). There are three main differences. First, there is no jump associated with the Pithecia species (inverted triangle, in green, right tree of Fig. 10) who, although having their own ecological niche, seem to have quite similar brain shapes as closely related species. Second, Callicebus and Aotus are marked as convergent in Aristide et al. (2016) (slanted cross, in red, right tree), but form two distinct groups in our model (slanted cross, in red, and star, in pink, left tree). This is due to our assumption of no homoplasy. Finally, the group with Chiropotes, Ateles, and Cebus species (square, in black) was found as having the “ancestral” trait optimum, while it is marked as “convergent” in Aristide et al. (2016). This is because we did not include any information from the fossil record that the authors considered in their original study, but instead used a parsimonious solution. Note that the coloring displayed in Aristide et al. (2016) is not parsimonious. The two models have the same number of distinct groups. Figure 10. View largeDownload slide Solution given by PhylogeneticEM for $$K=5$$ (left) against groups defined in Aristide et al. (2016, Fig. 3) (right), based on ecological criteria including locomotion (arboreal quadrupedal walk, clamber, and suspensory locomotion or clawed locomotion), diet (leaves, fruits, seeds, or insects) and group size (smaller or larger than $$15$$ individuals). Groups are labelled with symbols. The inset graph shows the model selection criterion. The minimum is for $$K=4$$, but $$K=5$$ is also a good candidate. Figure 10. View largeDownload slide Solution given by PhylogeneticEM for $$K=5$$ (left) against groups defined in Aristide et al. (2016, Fig. 3) (right), based on ecological criteria including locomotion (arboreal quadrupedal walk, clamber, and suspensory locomotion or clawed locomotion), diet (leaves, fruits, seeds, or insects) and group size (smaller or larger than $$15$$ individuals). Groups are labelled with symbols. The inset graph shows the model selection criterion. The minimum is for $$K=4$$, but $$K=5$$ is also a good candidate. The selected $$\alpha$$ value was found to be reasonably high, with $$t_{1/2} = 12.58\%$$. The estimated correlation between the two PCs was $$-0.13$$, confirming that PCA does not result in independent traits. Lizards We then considered the data set from Mahler et al. (2013), which consists in $$100$$ lizard species on a time-calibrated maximum likelihood tree and $$11$$ morphological traits. We chose this example because of the large number of traits and the high correlation between traits, as all traits are highly correlated (0.82 $$< \rho <$$ 0.97) with snout-to-vent length (SVL). To deal with the correlation between traits, Mahler et al. (2010, 2013) first performed a phylogenetic regression of all the traits against SVL, retrieved the residuals and then applied a phylogenetic PCA on SVL and the previous residuals, from which they used the first four components (pPC1 to pPC4) for their shift analysis. We first explored how the number of pPCs used can impact the shift detection. We hence ran PhylogeneticEM$$11$$ times, including $$1$$ to $$11$$ pPCs in the input data set. Each run was done on a grid of $$100$$ values of $$\alpha$$, with $$t_{1/2} = \ln(2)/\alpha \in [0.99, 693.15]$$% of tree height, and allowing for a maximum of $$20$$ shifts. It appears that the result is quite sensitive to the number of pPCs included: the selected number of shifts varies from $$20$$, the maximum allowed, to $$5$$ (Fig. 11). When $$4$$ pPCs were used, as in the original study, the estimated covariance matrix $$\mathbf{R}$$ contains many high correlations, showing that the pPCs are not phylogenetically independent (Fig. 11). Figure 11. View largeDownload slide Lizard data set: selected number of shifts $$\hat{K}$$ given the number of pPCs included in the analysis (top) and estimated correlation matrix between the first four pPCs (bottom). Figure 11. View largeDownload slide Lizard data set: selected number of shifts $$\hat{K}$$ given the number of pPCs included in the analysis (top) and estimated correlation matrix between the first four pPCs (bottom). To avoid the difficult choice of the number of pPCs, we considered the direct analysis of the raw traits without any pre-processing, and found no shift when running PhylogeneticEM. Although the likelihood was found to increase with $$K$$, the model selection criterion profile was found erratic, suggesting numerical instability. A natural suspect for such instability is the extreme correlation between some traits (0.996 for tibia and metatarsal lengths), which results in bad conditioning of several matrices that must be inverted. To circumvent this problem, we used the two pseudo-orthogonalization strategies described above, running PhylogeneticEM on the SVL plus residuals data set, and on the $$11$$ pPCs, with the same parameters as above. Note that all these transformations use a rotation matrix, so that the likelihood and the least squares of the original or of any of the two transformed data sets are the same. Hence, the objective function, as well as the model selection criterion, should remain unchanged. Still, slight differences were found between the maximized likelihood for each pseudo-orthogonalized data sets. For each value of $$K$$, we therefore retained the solution with the highest likelihood. Using the model selection criterion given in Statistical Inference section, we found $$\widehat{K} = 5$$ shifts, which are displayed in Figure 12, along with the ecomorphs as described in Mahler et al. (2013). Figure 12. View largeDownload slide Lizard data set: solution found by PhylogeneticEM. Groups produced by the shifts are colored on the edges of the tree. The species are colored according to ecomorphs defined in Mahler et al. (2013). The traits are the snout-to-vent length (SVL), and the phylogenetic residuals of the regression against SVL of the following traits: femur length, tibia length, metatarsal IV length, toe IV length, humerus length, radius length, finger IV length, lamina number (toe and finger IV), and tail length. The same transformations were used as in Mahler et al. (2010, 2013) Figure 12. View largeDownload slide Lizard data set: solution found by PhylogeneticEM. Groups produced by the shifts are colored on the edges of the tree. The species are colored according to ecomorphs defined in Mahler et al. (2013). The traits are the snout-to-vent length (SVL), and the phylogenetic residuals of the regression against SVL of the following traits: femur length, tibia length, metatarsal IV length, toe IV length, humerus length, radius length, finger IV length, lamina number (toe and finger IV), and tail length. The same transformations were used as in Mahler et al. (2010, 2013) Three of those shifts seem to single out grass-bush Anolis, that appear to have a rather small body size, with longer than expected lower limbs and tail, and shorter upper limbs. The two others might be associated with twig Anolis, that have smaller than expected limbs and tails. Because of our no-homoplasy assumption, one of those shifts encompasses some species living in other ecomorphs (namely, trunk, trunk-crown and unclassified). The shift, designed to be coherent with the phylogeny, is located on the stem lineage of the smallest clade encompassing the bulk of twig lizards. Comments On both examples (p)PCA does not correct a priori for the correlation between the traits in the presence of shifts. In pPCA and Shifts section, we formally proved that it cannot correct for it, actually. As a consequence, any shift detection methods has to account for the correlation between traits. Still, high correlations between traits may raise strong numerical issues, so PCA can be used as a pseudo-orthogonalization of traits, as well as any other linear distance-preserving transformation that would reduce the correlation between them. This does not dispense of considering the correlation between the transformed traits in the model. The other interest of PCA is to reduce the dimension of the data, which may be desirable when dealing with a large number of traits, such as the original data set from Aristide et al. (2016). Since PCA does not correct for the right correlation, we have no clue as to whether or not the dimension reduction performed by PCA is relevant for shift detection, or if it may remove precisely the direction along which the shifts occur. The relevant dimension reduction would consist in approximating the correlation matrix $$\mathbf{R}$$ with a matrix of lower rank $$q < p$$. This can obviously not be done before the shifts are known, which suggests that shift detection and dimension reduction should be performed simultaneously. Methods have been proposed (see e.g., Goolsby 2016; Tolkoff et al. 2017) to deal with very high-dimensional and highly correlated data but they assume that there are no shifts. Until a comprehensive method is developed to deal with both shifts and dimension reduction simultaneously, PCA and other ad hoc dimension reduction methods can still be considered as useful preprocessing steps. They might even be necessary for numerical stability purposes, as in Aristide et al. (2016). However, and as stated previously, the practitioner should keep in mind that in the context of shift detection, this preprocessing step does not lead to independent traits. Therefore, downstream analyses should still account for correlations between the reduced traits. Discussion Many phenotypic traits appear to evolve relatively smoothly over time and across many taxa. However, changes in evolutionary pressures (dispersal to new geographic zones, diet change, etc.) or key innovations (bipedal locomotion) may cause bursts of rapid trait evolution, coined evolutionary jumps by Simpson (1944). Phenotypic traits typically evolve in a coordinated way (Mahler et al. 2013; Aristide et al. 2015) and a multivariate framework is thus best suited to detect evolutionary jumps. We introduced here an EM algorithm embedded in a maximum-likelihood multivariate framework to infer shifts strength, location and number. Importantly, our method uses Gaussian elimination, just like Fitzjohn (2012), to avoid computing inverses of large variance-covariance matrices and can cope with missing data, an especially important problem in the multivariate setting where some traits are bound to be missing for some taxa. We demonstrated the applicability and accuracy of our method on simulated data sets and by identifying jumps for body size evolution in Anolis lizards and brain shapes of New World Monkeys. In both systems, the well-supported jumps occurred on stem lineages of clades that differ in terms of diet, locomotion, group size or foraging strategy (see Aristide et al. 2016 for a detailed discussion) supporting the Simpsonian hypothesis. Interpretation Issues We emphasize that the interpretation of $$\alpha$$ is a matter of discussion. We introduced the scOU in terms of adaptive evolution with a selection strength $$\alpha$$ on the tree. However, the equivalency between OU and BM on a distorted tree suggests that $$\alpha$$ can also be seen as a “phylogenetic signal” parameter, like Pagel’s $$\lambda$$ (Pagel 1999). When $$\alpha$$ is small, $$\ell_i(\alpha) \simeq \ell_i$$ so that branch lengths are unchanged and the phylogenetic variance is preserved. At the other end of the spectrum, when $$\alpha$$ is large, $$\ell_i(\alpha) \simeq 0$$ for inner branches and the rescaled tree behaves almost like a star tree. However and unlike Pagel’s $$\lambda$$, $$\alpha$$ also dictates how shifts in the optima in the original OU ($$\boldsymbol{\Delta}^{OU}$$) are transformed into shifts in the traits values in the rescaled BM ($$\boldsymbol{\Delta}^{BM}(\alpha)$$). For small $$\alpha$$, recall to the optima is weak and shifts on the optima affect the traits values minimally ($$\boldsymbol{\Delta}^{BM}(\alpha) \simeq 0$$). By contrast, for large $$\alpha$$, the recall is strong and shifts on the optima are instantaneously passed on to the traits values ($$\boldsymbol{\Delta}^{BM}(\alpha) \simeq \boldsymbol{\Delta}^{OU}$$). Note however that in both cases, the topology is never lost: a shift, no matter how small its amplitude or how short the branch it occurs on, always affects the same species. Note that if we observed traits values at some ancestral nodes (e.g., from the fossil record), the equivalency between BM and OU would break down: $$\alpha$$ would recover its strict interpretation as selection strength. On nonultrametric trees, our inference strategy does not benefit from the computational trick to speed up the M step. Similarly to the univariate case, we could write a generalized EM algorithm to handle this situation. In Bastide et al. (2017), we used a lasso-based heuristic to raise, if not maximize, the objective function at the M step. It worked quite well, but was much slower. This approach could be extended to the multivariate setting, although with impaired computational burden. Note also that some shift configurations that are not identifiable in the absence of fossil data become distinguishable with the addition of fossil data. This affects our model selection criterion, which relies on the number of distinct identifiable solutions. Computing this number on a nonultrametric tree for an OU remains an open problem, and is probably highly dependent on the topology of the tree. Noncausal Correlations $$\ell$$1ou, SURFACE and PhylogeneticEM make many simplifying assumptions to achieve tractable models. Chief among them is the assumption that $$\mathbf{A}$$ is diagonal. While $$\ell$$1ou and SURFACE both assume independent traits, PhylogeneticEM can handle correlated traits through nondiagonal variance matrix $$\mathbf{R}$$. We warn the reader that correlations encoded by $$\mathbf{R}$$ are not causal and only capture coordinated and nonselective traits evolution: that is when arm length increases, so does leg length. In order to capture evolution of trait $$i$$in response to changes in trait $$j$$ (i.e., when arm length strays away from its optimal value, does leg length move away or toward its own optimum) one should rather look at the value of $$A_{ij}$$, as was recently pointed out (Reitan et al. 2012; Liow et al. 2015; Manceau et al. 2016). Our simplifying assumptions are justified by various considerations: our focus on inference of shifts rather than proper estimation of $$\mathbf{A}$$ and $$\mathbf{R}$$, simulations showing that shift detection is robust to moderate values of off-diagonal terms in $$\mathbf{A}$$, difficulties to simultaneously estimate $$\alpha$$ and shifts even in the univariate case (Butler and King 2004), and computational gain achieved by considering scalar or diagonal $$\mathbf{A}$$. They also suggest that if the focus is on causal correlation in the presence of shifts, a two-step strategy that first detects shifts using a crude but robust model, then includes those shifts in a more complex model, may achieve good performance. The other simplifying assumption we made is that all traits shift at the same time. It makes formal analysis of identifiability issues and selection of the number of shifts similar to the univariate case, previously studied in Bastide et al. (2017). The assumption is likely to be false in practice, however. Asynchronous shifts are an interesting extension of the model. An ambitious framework would be to build from the ground up a model that allows for different shifts on different traits. It would have to deal with the combinatorial complexity induced by asynchronous shifts, and to use a different selection criterion for the number of shifts. A less ambitious but more pragmatic approach would be a postprocessing of the shifts to select, for each shift, the traits that actually jumped. This would require derivation of confidence intervals for the shift values. Finally, and unlike SURFACE and new version v1.40 of $$\ell$$1ou, our model excludes convergent evolution. This limitation is shared with other shift detection methods such as bayou (Uyeda and Harmon 2014) in the univariate case. This exclusion simplifies formal analysis and allows us to borrow from the framework of convex characters on a tree developed in Semple and Steel (2003) but is also likely to be false in practice. A straightforward extension of our method to detect convergence relies again on postprocessing of the shifts: the inferred optimal value of a trait after a shift can be tested to assess whether or not it is different from previously inferred optimal values and warrants a regime of its own. Nature of Jumps We model shifts as instantaneous and immediately following speciation events, like in the punctuated equilibrium theory of Eldredge and Gould (1972). We don’t argue that this is necessary the case. Selection and drift can reasonably be seen as instantaneous over macroevolutionary timescales but by no means over microevolutionary timescales. There is very strong evidence, for example in peppered moths (Cook et al. 2012), that rapid adaptation can happen even in the absence of speciation. However our model does not allow us to distinguish between many small jumps distributed across a branch, one big jump anywhere on that branch and one big jump immediately following speciation, and therefore between punctuated or Simpsonian evolution. Parsimony, Ancestral State Reconstruction and the Fossil Record To cope with identifiability issues, we only considered parsimonious allocations of shifts on the tree. In the absence of fossil taxa in the sample (i.e., when the tree is ultrametric), the data do not allow us to choose between the many nonparsimonious solutions to the problem, as they would all give the exact trait distribution at the tips, but using more parameters (Bastide et al. 2017). However, the actual biological process might very well be nonparsimonious. For instance, in the New World Monkeys example, we found a conflict between the parsimonious solution we recovered and the observed fossil record. The only way to solve this discrepancy would be to develop a method that can deal with nonultrametric trees, allowing for fossil taxa to bear information on the best evolutionary scenario. However, as mentioned earlier, this raises computational and theoretical identifiability questions that are beyond the scope of the present article. Supplementary Material Online-only appendices available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.60t0f. Acknowledgments We are grateful to the INRA MIGALE bioinformatics platform (http://migale.jouy.inra.fr) for providing the computational resources needed for the experiments. We also thank an anonymous reviewers and the associate editor Michael Alfaro for comments that improved the first version of this manuscript. Funding The visit of P.B. to the University of Wisconsin-Madison during the fall of 2015 was funded by a grant from the Franco-American Fulbright Commission. Appendix: PCA: Mathematical Derivations Expectation of the estimated variance–covariance matrix Taking $$\widetilde{\mathbf{C}} = (\mathbf{1}_n^T \mathbf{C}^{-1}\mathbf{1}_n)^{-1}\mathbf{1}_n^T\mathbf{C}^{-1}$$, we have that $$\bar{\mathbf{Y}}^T = \widetilde{\mathbf{C}} \mathbf{Y}$$, and $$\bar{\mathbf{a}}^T = \mathbb{E}\left[ {\bar{\mathbf{Y}}^T} \right] = \widetilde{\mathbf{C}} \mathbf{a}$$. Denote by $$\mathbf{N}_{\mathbf{C}^{-1}}: \mathbb{R}^{n\times p} \to \mathbb{R}^{p^2}$$ the function that to a $$n\times p$$ matrix $$\mathbf{A}$$ associates the $$p\times p$$ matrix $$\mathbf{A}^T \mathbf{C}^{-1}\mathbf{A}$$. We get: \begin{align*} &(n-1) \mathbb{E}\left[ {\hat{\mathbf{R}}} \right]\\ &\quad{} = \mathbb{E}\left[ {\mathbf{N}_{\mathbf{C}^{-1}}\left(\mathbf{Y} - \mathbf{1}_n\bar{\mathbf{Y}}^T\right)} \right]\\ &\quad{} = \mathbb{E}\left[ {\mathbf{N}_{\mathbf{C}^{-1}}\left((\mathbf{Y} - \mathbf{a}) + (\mathbf{a} - \mathbf{1}_n\bar{\mathbf{a}}^T) + (\mathbf{1}_n\bar{\mathbf{a}}^T - \mathbf{1}_n\bar{\mathbf{Y}}^T)\right)} \right]\\ &\quad{}= \mathbb{E}\left[ {\mathbf{N}_{\mathbf{C}^{-1}}\left((\mathbf{I} - \mathbf{1}_n\widetilde{\mathbf{C}})(\mathbf{Y} - \mathbf{a}) + (\mathbf{a} - \mathbf{1}_n\bar{\mathbf{a}}^T)\right)} \right]\\ &\quad{}= \mathbb{E}\left[ {\mathbf{N}_{\mathbf{C}^{-1}}\left((\mathbf{I} - \mathbf{1}_n\widetilde{\mathbf{C}})(\mathbf{Y} - \mathbf{a})\right)} \right] + \mathbf{N}_{\mathbf{C}^{-1}}\left(\mathbf{a} - \mathbf{1}_n\bar{\mathbf{a}}^T\right) \end{align*} where the two double products cancel out, as $$\mathbb{E}\left[ {\mathbf{Y}} \right] = \mathbf{a}$$. But, for any nonsingular symmetric matrix $$\mathbf{H}$$, we have: \begin{align*} &\mathbb{E}\left[ {(\mathbf{Y}- \mathbf{a})^T \mathbf{H}^{-1} (\mathbf{Y}- \mathbf{a})} \right]\\ &\quad{} = \sum_{1\leq i, j \leq n} [\mathbf{H}^{-1}]_{ij} \mathbb{E}\left[ {(\mathbf{Y}^i - \mathbf{a}^i)(\mathbf{Y}^j - \mathbf{a}^j)^T} \right]\\ &\quad{} = \sum_{1\leq i, j \leq n} [\mathbf{H}^{-1}]_{ij} C_{ij}\mathbf{R} = {\rm{tr}}(\mathbf{H}^{-1}\mathbf{C})\mathbf{R} \end{align*} Hence, applying this formula with $$\mathbf{H}^{-1} = (\mathbf{I} - \mathbf{1}_n\widetilde{\mathbf{C}})^T \mathbf{C}^{-1} (\mathbf{I} - \mathbf{1}_n\widetilde{\mathbf{C}}) = \mathbf{C}^{-1} - \mathbf{C}^{-1}\mathbf{1}_n\widetilde{\mathbf{C}}$$, some straightforward matrix algebra manipulations give us: \begin{align*} (n-1) \mathbb{E}\left[ {\hat{\mathbf{R}}} \right] & = (n-1)\mathbf{R} + (\mathbf{a} - \mathbf{1}_n\bar{\mathbf{a}}^T)^T\mathbf{C}^{-1}(\mathbf{a} - \mathbf{1}_n\bar{\mathbf{a}}^T) \end{align*} which is the result stated in the text, with $$\mathbf{G} = \mathbf{a} - \mathbf{1}_n\bar{\mathbf{a}}^T = (\mathbf{I}_n - (\mathbf{1}_n^T \mathbf{C}^{-1}\mathbf{1}_n)^{-1}\mathbf{1}_n\mathbf{1}_n^T\mathbf{C}^{-1})\mathbf{a}$$. References Aristide L., dos Reis S.F., Machado A.C., Lima I., Lopes R.T., Perez S.I. 2016 . Brain shape convergence in the adaptive radiation of New World monkeys. Proc. Natl. Acad. Sci. USA 113 : 2158 – 2163 . Google Scholar CrossRef Search ADS Aristide L, Rosenberger AL, Tejedor MF, Perez SI. 2015 . Modeling lineage and phenotypic diversification in the New World monkey (Platyrrhini, Primates) radiation. Mol. Phylogenet. Evol. 82 : 375 – 385 . Google Scholar CrossRef Search ADS PubMed Baraud Y, Giraud C, Huet S. 2009 . Gaussian model selection with an unknown variance. Ann. Stat. 37 : 630 – 672 . Google Scholar CrossRef Search ADS Bartoszek K, Pienaar J, Mostad P, Andersson S, Hansen TF. 2012 . A phylogenetic comparative method for studying multivariate adaptation. J. Theor. Biol. 314 : 204 – 215 . Google Scholar CrossRef Search ADS PubMed Bastide P, Mariadassou M, Robin S. 2017 . Detection of adaptive shifts on phylogenies by using shifted stochastic processes on a tree. J. R. Stat. Soc. Ser. B 79 : 1067 – 1093 . Google Scholar CrossRef Search ADS Beaulieu JM, Jhwueng DC, Boettiger C, O’Meara BC. 2012 . Modeling stabilizing selection: expanding the Ornstein–Uhlenbeck model of adaptive evolution. Evolution 66 : 2369 – 2383 . Google Scholar CrossRef Search ADS PubMed Blomberg SP, Garland T, Ives AR. 2003 . Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution 57 : 717 – 745 . Google Scholar CrossRef Search ADS PubMed Butler MA, King AA. 2004 . Phylogenetic comparative analysis: a modeling approach for adaptive evolution. Am. Nat . 164 : 683 – 695 . Google Scholar CrossRef Search ADS PubMed Clavel J, Escarguel G, Merceron G. 2015 . mvmorph : an R package for fitting multivariate evolutionary models to morphometric data. Methods Ecol. Evol. 6 : 1311 – 1319 . Google Scholar CrossRef Search ADS Cook LM, Grant BS, Saccheri IJ, Mallet J. 2012 . Selective bird predation on the peppered moth: the last experiment of Michael Majerus. Biol. Lett. 8 : 609 – 612 . Google Scholar CrossRef Search ADS PubMed Cooper N, Thomas GH, Venditti C, Meade A, Freckleton RP. 2016 . A cautionary note on the use of Ornstein–Uhlenbeck models in macroevolutionary studies. Biol. J. Linnean Soc . 118 : 64 – 77 . Google Scholar CrossRef Search ADS Cybis GB, Sinsheimer JS, Bedford T, Mather AE, Lemey P, Suchard MA. 2015 . Assessing phenotypic correlation through the multivariate phylogenetic latent liability model. Ann. Appl. Stat . 9 : 969 – 991 . Google Scholar CrossRef Search ADS PubMed Dempster A, Laird N, Rubin DB. 1977 . Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39 : 1 – 38 . Duchen P, Leuenberger C, Szilágyi SM, Harmon LJ, Eastman JM, Schweizer M, Wegmann D. 2017 . Inference of evolutionary jumps in large phylogenies using Lévy processes. Syst. Biol. 00 : 1 – 14 . Eastman JM, Alfaro ME, Joyce P, Hipp AL, Harmon LJ. 2011 . A Novel comparative method for identifying shifts in the rate of character evolution on trees. Evolution 65 : 3578 – 3589 . Google Scholar CrossRef Search ADS PubMed Eastman JM, Wegmann D, Leuenberger C, Harmon LJ. 2013 . Simpsonian ‘Evolution by Jumps’ in an adaptive radiation of Anolis Lizards . ArXiv id:1305.4216 . Eldredge N, Gould SJ. 1972 . Punctuated equilibria: an alternative to phyletic gradualism . In: Schopf T, editor, Models in paleobiology , San Francisco : Freeman, Cooper & Co , chapter 5 , pp. 82 – 115 . Google Scholar CrossRef Search ADS Felsenstein J. 1985 . Phylogenies and the comparative method. Am. Nat . 125 : 1 – 15 . Google Scholar CrossRef Search ADS Felsenstein J. 2004 . Inferring Phylogenies . Sunderland , Massachusetts . Felsenstein J. 2005 . Using the quantitative genetic threshold model for inferences between and within species. Philos. Trans. R. Soc. B 360 : 1427 – 1434 . Google Scholar CrossRef Search ADS Felsenstein J. 2012 . A comparative method for both discrete and continuous characters using the threshold model. Am. Nat . 179 : 145 – 156 . Google Scholar CrossRef Search ADS PubMed Fitzjohn RG. 2012 . Diversitree: comparative phylogenetic analyses of diversification in R. Methods Ecol. Evol. 3 : 1084 – 1092 . Google Scholar CrossRef Search ADS Goolsby EW. 2016 . Likelihood-based parameter estimation for high-dimensional phylogenetic comparative models: overcoming the limitations of distance-based methods. Syst. Biol. 65 : 852 – 870 . Google Scholar CrossRef Search ADS PubMed Hansen TF. 1997 . Stabilizing selection and the comparative analysis of adaptation. Evolution 51 : 1341 . Google Scholar CrossRef Search ADS PubMed Hansen TF, Orzack SH. 2005 . Assessing current adaptation and phylogenetic inertia as explanations of trait evolution: the need for controlled comparisons. Evolution 59 : 2063 – 2072 . Google Scholar PubMed Hansen TF, Pienaar J, Orzack SH. 2008 . A comparative method for studying adaptation to a randomly evolving environment. Evolution 62 : 1965 – 1977 . Google Scholar PubMed Ho LST, Ané C. 2013 . A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. Syst. Biol . 63 : 397 – 408 . Ho LST, Ané C. 2014 . Intrinsic inference difficulties for trait evolution with Ornstein-Uhlenbeck models. Methods Ecol. Evol. 5 : 1133 – 1146 . Google Scholar CrossRef Search ADS Hubert L, Arabie P. 1985 . Comparing partitions. J. Classif. 2 : 193 – 218 . Google Scholar CrossRef Search ADS Ingram T, Mahler DL. 2013 . SURFACE: Detecting convergent evolution from comparative data by fitting Ornstein–Uhlenbeck models with stepwise Akaike Information Criterion. Methods Ecol. Evol. 4 : 416 – 425 . Google Scholar CrossRef Search ADS Jetz W, Thomas G, Joy J, Hartmann K, Mooers A. 2012 . The global diversity of birds in space and time. Nature 491 : 444 – 448 . Google Scholar CrossRef Search ADS PubMed Khabbazian M, Kriebel R, Rohe K, Ané C. 2016 . Fast and accurate detection of evolutionary shifts in Ornstein–Uhlenbeck models. Methods Ecol. Evol. 7 : 811 – 824 . Google Scholar CrossRef Search ADS Lande R. 1976 . Natural selection and random genetic drift in phenotypic evolution. Evolution 30 : 314 . Google Scholar CrossRef Search ADS PubMed Landis MJ, Schraiber JG, Liang M. 2013 . Phylogenetic analysis using Lévy processes: finding jumps in the evolution of continuous traits. Syst. Biol. 62 : 193 – 204 . Google Scholar CrossRef Search ADS PubMed Liow LH, Reitan T, Harnik PG. 2015 . Ecological interactions on macroevolutionary time scales: clams and brachiopods are more than ships that pass in the night. Ecol. Lett. 18 : 1030 – 1039 . Google Scholar CrossRef Search ADS PubMed Mahler DL, Ingram T, Revell LJ, Losos JB. 2013 . Exceptional convergence on the macroevolutionary landscape in island lizard radiations. Science 341 : 292 – 295 . Google Scholar CrossRef Search ADS PubMed Mahler DL, Revell LJ, Glor RE, Losos JB. 2010 . Ecological opportunity and the rate of morphological evolution in the diversification of greater Antillean anoles. Evolution 64 : 2731 – 2745 . Google Scholar CrossRef Search ADS PubMed Manceau M, Lambert A, Morlon H. 2016 . A unifying comparative phylogenetic framework including traits coevolving across interacting lineages. Syst. Biol. syw115 . Manceau M, Lambert A, Morlon H. 2017 . A unifying comparative phylogenetic framework including traits coevolving across interacting lineages. Systematic Biology . 66 : 551 – 568 . Google Scholar PubMed Mardia KV, Kent JT, Bibby JM. 1979 . Multivariate analysis . Probability and mathematical statistics . New York : Academic Press . Meredith RW, Janečka JE, Gatesy J, et al. (22 co-authors). 2011 . Impacts of the cretaceous terrestrial revolution and kpg extinction on mammal diversification. Science . 334 : 521 – 524 . Google Scholar CrossRef Search ADS PubMed Pagel M. 1999 . Inferring the historical patterns of biological evolution. Nature 401 : 877 – 884 . Google Scholar CrossRef Search ADS PubMed Pennell MW, FitzJohn RG, Cornwell WK, Harmon LJ. 2015 . Model adequacy and the macroevolution of angiosperm functional traits. Am. Nat. 186 : E33 – E50 . Google Scholar CrossRef Search ADS PubMed R Core Team . 2017 . R: a language and environment for statistical computing. Vienna, Austria : R Foundation for Statistical Computing . Reitan T, Schweder T, Henderiks J. 2012 . Phenotypic evolution studied by layered stochastic differential equations. Ann. Appl. Stat. 6 : 1531 – 1551 . Google Scholar CrossRef Search ADS Revell LJ. 2009 . Size-correction and principal components for interspecific comparative studies. Evolution 63 : 3258 – 3268 . Google Scholar CrossRef Search ADS PubMed Semple C, Steel M. 2003 . Phylogenetics. Oxford Lecture Series in Mathematics and its Applications . Oxford : Oxford University Press . Simpson GG. 1944 . Tempo and Mode in Evolution. A Wartime book . New York : Columbia University Press . Stadler T. 2011 . Simulating trees with a fixed number of extant species. Syst. Biol. 60 : 676 – 684 . Google Scholar CrossRef Search ADS PubMed Tibshirani R. 1996 . Regression Selection and Shrinkage via the Lasso. J. R. Stat. Soc. B 58 : 267 – 288 . Tolkoff MR, Alfaro ME, Baele G, Lemey P, Suchard MA. 2017 . Phylogenetic factor analysis. Systematic Biology . p. syx066 . Uyeda JC, Caetano DS, Pennell MW. 2015 . Comparative analysis of principal components can be misleading. Syst. Biol. 64 : 677 – 689 . Google Scholar CrossRef Search ADS PubMed Uyeda JC, Harmon LJ. 2014 . A Novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data. Syst. Biol. 63 : 902 – 918 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Systematic Biology – Oxford University Press
Published: Jan 27, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”
Daniel C.
“Whoa! It’s like Spotify but for academic articles.”
@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”
@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”
@JoseServera
DeepDyve Freelancer | DeepDyve Pro | |
---|---|---|
Price | FREE | $49/month |
Save searches from | ||
Create folders to | ||
Export folders, citations | ||
Read DeepDyve articles | Abstract access only | Unlimited access to over |
20 pages / month | ||
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.
ok to continue