# Divergence Estimation in the Presence of Incomplete Lineage Sorting and Migration

Divergence Estimation in the Presence of Incomplete Lineage Sorting and Migration Abstract This article focuses on the problem of estimating a species tree from multilocus data in the presence of incomplete lineage sorting and migration. I develop a mathematical model similar to IMa2 (Hey 2010) for the relevant evolutionary processes which allows both the population size parameters and the migration rates between pairs of species tree branches to be integrated out. I then describe a BEAST2 package DENIM (Divergence estimation notwithstanding ILS and migration) which is based on this model and which uses an approximation to sample from the posterior. The approximation is based on the assumption that migrations are rare, and it only samples from certain regions of the posterior which seem likely given this assumption. The method breaks down if there is a lot of migration. Using simulations, Leaché et al. (2014) showed that using the standard multispecies coalescent model to infer a species tree can result in poor accuracy if migration is present. I reanalyze this simulated data to explore DENIM’s performance and demonstrate substantial improvements in accuracy over *BEAST. I also reanalyze an empirical data set. Bayesian, incomplete lineage sorting, isolation-with-migration, Markov chain Monte Carlo, multispecies coalescent, phylogenetic analysis, species tree Speciation is probably gradual in most cases (Martin et al. 2013), and gene flow may continue to occur even between good species. It is difficult to infer these phenomena (Nosil 2008), but some large scale studies have demonstrated them in (e.g.,) butterflies (Martin et al. 2013), birds (Rheindt et al. 2014), grasses (Romaschenko et al. 2014), and big cats (Figueiró et al. 2017). Meanwhile, almost all estimates of species trees to date have been conducted under the assumption that no migration occurs after an instantaneous speciation event. Leaché et al. (2014) simulated a large set of data containing migrations and incomplete lineage sorting (ILS) and analyzed it using *BEAST. They showed that migration causes problems for species tree inference using the multispecies coalescent when migration is present but ignored. In general, migration between sister species rarely caused topological errors, although it resulted in biased estimates of node heights and population sizes. Migration between nonsister species resulted in topological errors, even when the migration rates were small. This article describes a BEAST2 package DENIM (Divergence estimation notwithstanding ILS and migration) for inferring a species tree in the presence of ILS and migration. The model in DENIM is an extension of the multispecies coalescent model as used by *BEAST and other programs and has much in common with that model. The addition of migration to the model results in a very hard problem, and no current method deals with it in full generality. For example, the methods of Tian and Kubatko (2016) and Dalquen et al. (2017) are restricted to three species. The method here involves an approximation based on an assumption that the migration rates are small: the results below show that this approximation typically breaks down at rates between one migrant per 100 generations and one migrant per 10 generations. DENIM allows arbitrary numbers of species and allows for different migration rates between each pair of contemporaneous branches. DENIM is available in BEAST2 and can be installed using Beauti. The installation includes a manual and source code. An alternative approach to account for gene flow is to model the situation with a species network instead of a tree. If hybrid speciation occurs, or populations merge, then a network is essential to model the situation properly. These are the sort of situations which the network approach is aimed at, and DENIM would not be appropriate for them. In DENIM, it is assumed that gene flow occurs continuously, and that any gene flow between pairs of branches will eventually become zero, so that there is an underlying species tree. It also is possible to use a species network to model a species tree with gene flow between branches. Instead of a continuous gene flow between two branches, one or more edges can be added to the network, representing gene flow in one or more discrete events. If the gene flow is in both directions, at least two edges would be needed; more might be needed in order to provide a good approximation to gene flow which lasts a long time. These network approaches search for an approximation to reality in a much larger space of topologies than DENIM. There has been a lot of recent interest in species networks. Two recent papers (Wen and Nakhleh 2017; Zhang et al. 2018) use a Bayesian MCMC method and coestimate both the network and the gene trees. These are the most similar to DENIM. There are a number of other network methods which make various approximations in order to deal with larger data sets; generally these rely on estimating the gene trees first instead of coestimating the network and gene trees. For example, Sun et al. (2014) used approximate Bayesian computation, SNaQ (Solís-Lemus and Ané 2016) uses an approximation based on quartets (4-taxon unrooted subnetworks), and FastNet (Hejase et al. 2018) uses a divide-and-conquer approach. PHRAPL (Jackson et al. 2017) simulates gene trees under a set of user-chosen hypotheses in order to choose the best model, but it does not seem feasible to apply it to more than a few species. “Migration” is used to refer to gene flow between species (usually introgression but not restricted to that). I use the term “species” rather than “population” because the method is aimed at situations where gene flow is small. A migration event occurs when an allele comes from a parent from another species. An “embedding” of a gene tree specifies which species tree branch each coalescence belongs to, together with migration events, which specify the times along gene tree branches at which an allele moved between species tree branches, and which species tree branches are involved. I always describe events going back in time from the present, so alleles have parental species to which they “go,” and the “destination” branch in the species tree contains part of a gene tree branch at a more ancient time than the “source” branch. This is because coalescences are easier to model this way and is the same convention as the program IMa2 (Hey and Nielsen 2004, 2007; Hey 2010). There is no upper limit on the number of migration events, and even if this is limited, and the gene tree and species tree are fixed, there can be a huge number of ways in which each gene tree can be embedded into a species tree. It is thus difficult to make inferences if the situation is modeled in full. IMa2 requires that the true population phylogeny (equivalent to species tree here) is known. DENIM uses a model for migration which is similar to that used by Hey (2010) in IMa2. There are two migration rate parameters for each pair of contemporaneous species tree branches. There are $$2(n-1)^2$$ of them for a species tree with $$n$$ tips (Hey 2010). There are three main differences between DENIM and IMa2. The species tree topology is estimated instead of being assumed; the migration rate parameters are integrated out; and an approximation is used to simplify sampling from the posterior. Also, the population size parameters are integrated out in a similar fashion to Jones (2016). The focus is on estimating the species tree despite the presence of small amounts of migration. Since migration rates are integrated out, they cannot be estimated directly, but DENIM does produce Markov chain Monte Carlo (MCMC) samples of species trees and embedded gene trees, so some information about the migrations can be found by postprocessing. Even with migration rate and population size parameters integrated out, there are still an unbounded number of parameters for the gene trees. It appears very difficult to design and implement MCMC operators capable of sampling efficiently from this distribution while estimating the species tree. Instead, DENIM uses an approximation to the posterior by ignoring most of the “unlikely” embeddings. I first describe the underlying evolutionary model for the coalescent and migration processes, and the particular formulation which makes it possible to integrate out the population size parameters and the migration rates. Then, I describe the approximation which makes it possible to sample efficiently from the posterior. The simulated data of Leaché et al. (2014) is reanalyzed to evaluate the proposed method, and the pocket gopher data of Belfiore et al. (2008) provides an empirical demonstration of the method. The Prior Density for a Gene Tree Background Following the introduction of the Kingman coalescent (Kingman 1982), models for coalescence and migration were developed in the 1980s by population geneticists (Hudson et al. 1990). More recent developments include Beerli and Felsenstein (2001), Ewing and Allen (2006), Tian and Kubatko (2016), Dalquen et al. (2017) as well as the aforementioned work of Hey and Nielsen. The methods of Tian and Kubatko (2016) and Dalquen et al. (2017) can estimate the species tree, but are currently restricted to at most three species and three sequences per locus. The underlying evolutionary model used here is the same as that of Hey (2010), except that the species tree $$S$$ is not assumed known but instead follows a birth–death model. When the species tree $$S$$ is estimated, it is important that $$\int \Pr(G|S) dG = 1$$ for any $$S$$, where the integration is over all gene trees $$G$$. I have not found a clear statement to this effect in the literature, so some explanation seems warranted. Between the node heights of the species tree, there is an $$n$$-island model for coalescence and migration (Beerli and Felsenstein 2001), where $$n$$ is the current number of species tree branches. This is a continuous time Markov chain. It could be time-inhomogeneous, to allow for population sizes or migration rates to vary continuously with time, although the application here only uses the time-homogeneous case. In order to define the state space of this Markov chain, we need a few preliminaries. Firstly, each branch in $$G$$ is labeled by the tip labels that descend from the branch. When a coalescence occurs, it should be understood as the merging of two particular labeled gene tree branches. Likewise, when a migration occurs, a particular gene tree branch migrates to a particular species tree branch. Let $$L$$ be the set of tip labels of $$G$$, and let $${\mathcal{P}}(L)$$ be the set of all partitions of $$L$$. At any time, the set of gene tree branches can be regarded as some partition $$P \in {\mathcal{P}}(L)$$, and each branch as a member of $$P$$. The branches of $$S$$ could be labeled in a similar manner to $$G$$, but for convenience, we assume they have been labeled with the numbers $$\{1, \dots, n\}$$ during the period in which there are $$n$$ branches, and that branches $$n$$ and $$n-1$$ merge to form a branch $$n-1$$ when a speciation event is encountered. The state space of the Markov chain during the period with $$n$$ branches consists of all possible assignments of all members of $${\mathcal{P}}_n(L)$$ to the branches of $$S$$. Each state is a pair $$(P,f),$$ where $$P \in {\mathcal{P}}_n(L)$$ and $$f$$ is any map from $$P$$ to $$\{1, \dots, n\}$$, assigning gene tree branches to species tree branches. We use the set theory notation $$X^Y$$ to denote the set of all maps from set $$Y$$ to set $$X$$. So, we can write the state space $${\mathcal{A}}_n$$ as \begin{equation*} {\mathcal{A}}_n = \{ (P,f)\ : \ P \in {\mathcal{P}}(L) \ \wedge \ f \in \{1, \dots, n\}^P \}. \end{equation*} It has size \begin{equation*} |{\mathcal{A}}_n| = \sum_{P \in {\mathcal{P}}(L)} n^{|P|}. \end{equation*} There is an instantaneous rate matrix $$Q$$ of size $$|{\mathcal{A}}_n| \times |{\mathcal{A}}_n|$$. The off-diagonal rows of $$Q$$ are non-negative, the rows of $$Q$$ sum to zero, and the diagonal entries are less than or equal to zero. Note that although $$Q$$ is enormous for large $$|L|$$ and $$n$$, it is sparse, since the number of states which can be reached from a given state by a single migration or coalescence is much smaller than $$|{\mathcal{A}}_n|$$. Basic properties of Markov chains (in particular the fact that rows of $$Q$$ sum to zero) ensure that given a starting distribution over states such that \begin{equation*} \sum_{(P,f) \in {\mathcal{A}}_n} \Pr(P,f) = 1, \end{equation*} this remains true throughout the process, and in particular just before a merging of species tree branches. At such a merge, the partitions $$P$$ are unchanged, but the state space changes. Once we are in the root branch of the species tree, the process reduces to the Kingman coalescent, which is a (normalized) density. Consider the case just above the root, where $$n=2$$. We have \begin{equation*} \sum_{(P,f) \in {\mathcal{A}}_2} \Pr(P,f) = \sum_{P \in {\mathcal{P}}(L)} \ \sum_{f \in \{1,2\}^P} \Pr(P,f). \end{equation*} Each $$P$$ consist of subsets $$L_1, \dots, L_m$$, and as $$f$$ runs over the maps from $$P$$ to $$\{1,2\}$$, it runs over exactly those assignments of these subsets to $$\{1,2\}$$ which result in all of them ending up in the root just after the merge. Thus, \begin{equation*} \sum_{f \in \{1,2\}^P} \Pr(P,f) = \sum_{f \in \{1\}^P} \Pr(P,f), \end{equation*} where the left hand side applies just before the merge and the right hand side applies just after the merge. It follows that \begin{equation*} \sum_{(P,f) \in {\mathcal{A}}_2} \Pr(P,f) = \sum_{(P,f) \in {\mathcal{A}}_1} \Pr(P,f), \end{equation*} where again the left hand side applies just before and the right hand side applies just after the merge. We can apply a similar argument to merges when $$n > 2$$ to establish that $$\int \Pr(G|S) dG = 1$$. This evolutionary model will be referred to as the “tree-island model.” Integrating Out Population and Migration Parameters Suppose the species tree has $$s$$ tips. There are $$2s\,{-}\,1$$ species tree branches, including the root branch. Suppose the migration rate from branch $$b$$ to branch $$d$$ is $$m_{bd}$$. These migration rates follow the same conventions as Hey and Nielsen (backwards from present, scaled by population size). As in Jones (2016), the population size parameter $$\theta_b$$ for branch $$b$$ is equal to $$N_b \mu_b$$, where $$N_b$$ is the effective population size and $$\mu_b$$ is the mutation rate for the branch. For locus $$j$$, the effective number of gene copies is obtained from $$N_b$$ by multiplying by a factor $$p_j$$ (sometimes called the “ploidy”) for gene $$j$$. The time (going back from zero at present) is divided into a number of intervals $$\tau_i$$ ($$i \in {\mathcal{I}}$$) by the times of the events and species tree node heights. The set of species tree branches which exist during the $$i$$th interval is denoted by $${\mathcal{B}}_i$$, and we set $$s_i = |{\mathcal{B}}_i|$$. The number of lineages in gene tree $$j$$ which belong to species tree branch $$b$$ during the $$i$$th interval is $$n_{jbi}$$. The set of intervals which end in a coalescence is $${\mathcal{I}}_{\rm coal}$$, and the set which end in a migration is $${\mathcal{I}}_{\rm mig}$$. See Figure 1, where $$i \in {\mathcal{I}}_{\rm mig}$$, $$i+1$$ is in neither, and $$i+2 \in {\mathcal{I}}_{\rm coal}$$. The rate at which the next event occurs is $$(\kappa_i + \mu_i)$$ where, \begin{equation*}\label{eq:coalrate} \kappa_i = \sum_j \sum_{b \in {\mathcal{B}}_i} \left( \binom{n_{jbi}}{2}p_j^{-1} \theta_b^{-1} \right) \end{equation*} is the total rate for coalescent events and $$\label{eq:migrate} \mu_i = \sum_j \sum_{b \in {\mathcal{B}}_i} \left( n_{jbi} \sum_{d \in {\mathcal{B}}_i\setminus b} m_{bd} \right)$$ (1) is the total rate for migration events. Here, we are summing the nonzero off-diagonal elements of a row of $$Q$$ in order to find $$Q_{x,x} = -(\kappa_i + \mu_i)$$ for the current state $$x$$. We then need $$Q_{x,y}$$ where $$y$$ is the next state. If it is a coalescence, $$Q_{x,y} = p_{j_i}^{-1} \theta_{b_i}^{-1}$$, where $$j_i$$ is the gene tree containing the coalescence, and $$b_i$$ is the species tree branch in which it occurs. If it is a migration, $$Q_{x,y} = m_{b_i d_i}$$, where $$b_i$$ and $$d_i$$ are the source and destination branches. Figure 1. View largeDownload slide Three time intervals. The figure shows a section of a species tree (in pale gray) and some gene tree branches (black lines). The first (uppermost) time interval of length $$\tau_i$$ ends in a migration, the second interval ends with a species tree node, and the third with a coalescence. Figure 1. View largeDownload slide Three time intervals. The figure shows a section of a species tree (in pale gray) and some gene tree branches (black lines). The first (uppermost) time interval of length $$\tau_i$$ ends in a migration, the second interval ends with a species tree node, and the third with a coalescence. Denoting the set of migration rates by $$M$$ and the set of population size parameters by $$\Theta$$, we have the probability density for a gene tree: \begin{align*}\label{eq:coalmigrate} f(G; S, \Theta, M) & = \prod_{i \in {\mathcal{I}}_{\rm coal}} \!\!\! p_{j_i}^{-1} \theta_{b_i}^{-1} \times \\ & \prod_{i \in {\mathcal{I}}_{\rm mig}} \!\!\! m_{b_i d_i} \prod_i \exp (- (\kappa_i + \mu_i) \tau_i). \end{align*} This can be factored into a coalescence part and a migration part. Then, our aim is to rearrange the terms in the coalescence part so that it is a product over species tree branches, and then rearrange the terms in the migration part so that it is a product over ordered pairs of species tree branches. The result will be a product of terms, in which each term contains one population size parameter $$\theta_b$$ or one migration parameter $$m_{bd}$$. As we will see, this enables us to integrate out these parameters if conjugate priors (or mixtures of conjugate priors) are assumed. We put \begin{equation*} f(G; S, \Theta, M) = f_{\rm coal} (G; S, \Theta) \ \ f_{\rm mig} (G; S, M) \end{equation*} where \begin{equation*} f_{\rm coal} (G;S, \Theta) = \prod_{i \in {\mathcal{I}}_{\rm coal}} \!\!\! p_{j_i}^{-1} \theta_{b_i}^{-1} \ \exp \bigg( - \sum_i \tau_i \kappa_i \bigg) \end{equation*} and $$\label{fmig} f_{\rm mig} (G;S, M) = \prod_{i \in {\mathcal{I}}_{\rm mig}} \!\!\! m_{b_i d_i} \ \exp \bigg( - \sum_i \tau_i \mu_i \bigg)$$ (2) First we deal with $$f_{\rm coal}$$. We have \begin{equation*} \prod_{i \in {\mathcal{I}}_{\rm coal}} p_{j_i}^{-1} \theta_{b_i}^{-1} = \prod_j \prod_b (p_j \theta_b)^{-k_{jb}}, \end{equation*} where $$k_{jb}$$ is the number of coalescences in gene tree $$j$$ in branch $$b$$. Next \begin{align*} \sum_i \tau_i \kappa_i & = \sum_i \tau_i \sum_j \sum_{b \in {\mathcal{B}}_i}\binom{n_{jbi}}{2}p_{j_i}^{-1} \theta_b^{-1} \\ & = \sum_{b} \sum_j \sum_{i: b \in {\mathcal{B}}_i} \binom{n_{jbi}}{2}p_{j}^{-1} \tau_i \theta_b^{-1} \end{align*} so \begin{equation*}\label{eq:msc} f_{\rm coal}(G;S, \Theta, M) = \prod_b r_b \theta_b^{-q_b} \exp \big( - \gamma_b \theta_b^{-1} \big) \mathrm{, \ \ where} \end{equation*} \begin{align}\label{eq:qrgamma} q_b &= \sum_j k_{jb} \mathrm{, \ \ \ \ \ \ } r_b = \prod_j p_j^{-k_{jb}} \mathrm{, \ \ \ \ and \ \ } \nonumber \\ \gamma_b &= \sum_j \sum_{i: b \in {\mathcal{B}}_i} \binom{n_{jbi}}{2}p_{j}^{-1} \tau_i. \end{align} (3) As written, there are time intervals in branch $$b$$ for events during which no change occurs in branch $$b$$. For the computation of $$\gamma_b$$, it is only necessary to take into account coalescences within branch $$b$$ and migrations in and out of branch $$b$$, since between these events, $$n_{jbi}$$ is constant. Equation (3) is now of the same form as equation (2) of Jones (2016). The only difference is that $$\gamma_b$$ accounts for migrations in and out of branch $$b$$. This means the population size parameters can be integrated out as in Jones (2016). Now, we turn to the migration part $$f_{\rm mig}$$. Let $${\mathcal{O}}$$ be the set of contemporaneous pairs of branches in $$S$$. We have \begin{align*} \sum_i \tau_i \mu_i &= \sum_i \tau_i \sum_j \sum_{b \in {\mathcal{B}}_i} n_{jbi} \sum_{d \in {\mathcal{B}}_i\setminus b} m_{bd} \\ &= \sum_{(b,d) \in {\mathcal{O}}} \ \sum_{i: b,d \in\ {\mathcal{B}}_i} \!\!\! \tau_i \sum_j n_{jbi} m_{bd}. \end{align*} Thus, \begin{equation*}\label{fmig2} f_{\rm mig} (G;S, \Theta, M) = \prod_{(b,d) \in {\mathcal{O}}} m_{bd}^{n_{bd}} \exp \left( - \zeta_{bd} m_{bd} \right), \end{equation*} where $$n_{bd}$$ is the total number of migrations from $$b$$ to $$d$$ and $$\label{eq:zeta} \zeta_{bd} = \sum_{i: b,d \in\ {\mathcal{B}}_i} \tau_i \sum_j n_{jbi}.$$ (4) The term $$\zeta_{bd}$$ can be interpreted as the total intensity of migrations from $$b$$ to $$d$$ during the time in which both branches $$b$$ and $$d$$ exist. If a conjugate prior is assumed, namely that $$m_{bd} \sim {\mathcal{G}}(\alpha_{bd}, \beta_{bd})$$ for all $$b,d$$ where $${\mathcal{G}}$$ is the gamma distribution, then we get a contribution to the posterior which is $$\label{eq:migIO} \prod_{(b,d) \in {\mathcal{O}}} \int_0^{\infty} \frac{\beta_{bd}^{\alpha_{bd}}}{\Gamma(\alpha_{bd})} m_{bd}^{\alpha_{bd}} \exp(-\beta m_{bd}) \times \\ \ \ m_{bd}^{n_{bd}}\exp \left( - \zeta_{bd} m_{bd} \right) \ {\mathrm{d}} m_{bd} \\ = \prod_{(b,d) \in {\mathcal{O}}} \frac{\Gamma(n_{bd} + \alpha_{bd})}{\Gamma(\alpha_{bd})} \frac{\beta^\alpha_{bd}}{\left(\beta_{bd} + \zeta_{bd} \right)^{n_{bd} + \alpha_{bd}} }.$$ (5) Equations (4) and (5) provide the information needed to implement the calculation for the migration part of the posterior. Each ordered pair of contemporaneous branches $$(b,d)$$ can have a different prior. For example, it is possible to represent the prior expectation that migration rates are lower between more distantly related branches. This model, where migration rates are independent, will be called the “flexible” model. The calculation in equation (4) is slow when the number of tips in the species tree is large. A much simpler model is to assume that $$m_{bd}$$ is the same value $$m$$ for all $$b,d$$. In this case, equation (1) reduces to \begin{equation*}\label{eq:coalmigratesimple} \mu_i = m \sum_j \sum_{b \in {\mathcal{B}}_i} n_{jbi} (s_i-1). \end{equation*} The double sum is equal to the total number of gene tree lineages $$N_i$$ during time interval $$i$$. Then, we have \begin{align*}\label{fmigsimple} f_{\rm mig} (G;S, M) &= m^{N} \exp \bigg( - \sum_i \tau_i \mu_i \bigg) \\ &= m^{N} \exp \bigg( - m \sum_i \tau_i N_i (s_i-1) \bigg), \end{align*} where $$N$$ is the total number of migrations. The parameter $$m$$ can be integrated out. This will be called this model the “simple” model. How the Gene Tree is Embedded This section describes the embedding parameters, and how they are used to embed the gene trees. The embeddings are restricted by ignoring ones which are unlikely when the migration rates are small enough. The hope is that the method will still explore a region of parameter space which includes most of the probability content. Embeddings are restricted by applying the following rules: there is at most one migration in a single gene tree branch at most one of the child branches of a gene tree node contains a migration there are no more migrations than needed (in a sense described below) A pair of child branches of a gene tree node will be called a sister-pair. The embedding parameters $$E_j$$ consist of two values $$\xi_{ji}, \eta_{ji} \in [0,1]$$ for each internal node $$i$$ of the $$j$$th gene tree. See Figure 2. The parameter $$\xi_{ji}$$ determines where along a sister-pair a migration may occur, if a migration is needed in the embedding. Thus it determines which child branch of node $$i$$ is capable of migrating, as well as the time of the migration if there is one. All the nodes in the species tree have their children labeled as “left” and “right,” so that $$[0,1]$$ can be mapped unambiguously onto the sister-pair. The other parameter $$\eta_{ji}$$ specifies which of the node’s child branches to use when choosing a destination species branch for an introgression. If the migration is between sister branches of the species tree, there is only one choice for the destination. It may happen that the sister branch is too ancient, in which case several destination species branches are possible. The possible destination branches are found, and $$\eta_{ji}$$ is used to choose between them by dividing the interval [0,1] equally into the appropriate number of parts. Figure 2. View largeDownload slide This shows $$\xi_{ji}$$ and $$\eta_{ji}$$ for a one gene tree and one node (with subscripts dropped). There is a migration from C into A. The parameter $$\xi$$ determines how far along the sister-pair the migration occurs, and $$\eta$$ determines whether the destination branch is A or B. Figure 2. View largeDownload slide This shows $$\xi_{ji}$$ and $$\eta_{ji}$$ for a one gene tree and one node (with subscripts dropped). There is a migration from C into A. The parameter $$\xi$$ determines how far along the sister-pair the migration occurs, and $$\eta$$ determines whether the destination branch is A or B. The parameters $$\xi_{ji}$$ and $$\eta_{ji}$$ are changed by operators during the MCMC algorithm, regardless of whether or not they are being used to embed a gene tree. This is a simpler alternative to implementing rjMCMC operators which account for changes in dimension. The prior $$\Pr(E_j)$$ for $$E_j$$ is are independent uniform distributions on $$[0,1]$$ for each $$\xi_{ji}$$ and $$\eta_{ji}$$. The first rule above is straightforward. The definition of $$\xi_{ji}$$ enforces the second rule. The third rule is applied recursively from the tips. Suppose $$x$$ is the $$i$$th node of the $$j$$th gene tree, and suppose both child nodes of $$x$$ have been assigned to branches in the species tree. If it is possible to assign $$x$$ to a species tree branch without an migration in either child branch of $$x$$, then this is done. Otherwise $$x$$ is assigned using the species tree branch to which its nonmigrating child has been assigned: it will be the same branch, or an ancestor of that branch, depending on the height of $$x$$. The height of the migration is fixed by $$\xi_{ji}$$. The migrating child branch of $$x$$ starts in the species tree branch that the migrating child has been assigned. It stays in this branch, or an ancestor of it, until the migration height. It will then migrate to the same species tree branch as $$x$$, or a descendant of it. If there is more than one descendant of the species tree branch of $$x$$ at this height, values from $$\eta_{j}$$ are used to choose one. Properties of the Embedding Scheme Different embeddings of the same gene tree in the same species tree are obtained by changing $$\xi_{ji}$$ and $$\eta_{ji}$$ during the MCMC sampling. Figure 3 shows some examples. Case (a) is simple. No migrations are needed to embed the gene tree, so embeddings with one or more migrations are ignored. Case (b) requires one migration, and an embedding with two migrations in the same branch is ignored. Case (c) requires two migrations. The embedding on the left is ignored since it has two sister branches with migrations. The embedding on the right is one of four embeddings that is considered. Figure 3. View largeDownload slide Some examples of embeddings for three gene trees in a, b, c. On the left are embeddings that are ignored. On the right are embeddings that are considered. Figure 3. View largeDownload slide Some examples of embeddings for three gene trees in a, b, c. On the left are embeddings that are ignored. On the right are embeddings that are considered. Proposition. Given any set of particular values for $$\xi$$ and $$\eta$$, and the rules above, any gene tree can be embedded in any species tree. For any $$G_j$$ and $$S$$, the set of embeddings as $$\xi$$ and $$\eta$$ vary include at least one with a minimal number of migrations. Proof: The first claim is straightforward, using recursion starting at the tips, and following the description above (for applying the third rule). For second claim, suppose it is false and consider the set $$M$$ of minimal embeddings (those with a minimal number of migrating branches). Call a node both of whose child branches migrate a “double node.” If there is a member of $$M$$ has no double nodes, this embedding will be chosen for some $$\xi$$ and $$\eta$$, so every member of $$M$$ has at least one double node. Now restrict attention to the subset $$\bar{M}$$ of $$M$$ of embeddings which have as few as possible double nodes. Finally, choose an embedding $$B$$ from $$\bar{M}$$ so that a double node $$x$$ is as near to the root as possible. If $$x$$ is the root, it can be moved into the same branch as one of its children, or an ancestor of that branch, and one migration can be removed, contradicting the definition of $$M$$. If $$x$$ is not the root, it can be again moved into the same branch as one of its children, but now the branch between $$x$$ and its parent may need to become migrating. If the sister branch to $$x$$ is already migrating, we have an embedding with the same number of migrations, but a double node closer to the root than $$x$$, contradicting the definition of $$B$$. If the sister branch to $$x$$ is not migrating, there is an embedding with fewer double nodes than $$B$$, contradicting the definition of $$\bar{M}$$. End of proof. The method does not consider every embedding which has a minimal number of migrations (e.g., Fig. 3c). Some embeddings which are considered are not minimal. For example, consider a species tree (A,B) and a gene tree ((a1,b1),b2) with three tips a1, b1, and b2, where a1 belongs to species A and the others to B. Suppose the species tree has greater height than the gene tree. For some values of $$\xi$$ and $$\eta$$, DENIM will assign the coalescence (a1,b1) to A, which means two migrations are needed to embed the gene tree, although it is possible to embed it with only one migration using other values of $$\xi$$ and $$\eta$$. Implementation Notes For a standard multispecies coalescent analysis, operators which change the species tree and the gene trees in a co-ordinated way are beneficial (Jones 2016; Ogilvie et al. 2017). These operators rely on, and preserve, compatibility betwfeen the species tree and gene trees under the multispecies coalescent. In the presence of migration, any gene tree is compatible with any species tree, and these coordinated operators cannot be used as they are. In the current implementation, DENIM uses the standard tree operators implemented in BEAST2 for the species tree and the gene trees. A couple of simple MCMC operators were implemented for the embedding parameters. As noted above, they are changed by operators regardless of whether or not they are being used to embed a gene tree. In general, applying MCMC operators to unused parameters could be very inefficient, and rjMCMC would be preferable, but here the operators for $$\xi$$ and $$\eta$$ are very fast. DENIM is implemented in the BEAST2 framework (Bouckaert et al. 2014), and so benefits from the flexible site models, substitution models, and others available in BEAST2. An analysis can be set up using the graphical interface Beauti. When using the flexible model, different priors can be used for different pairs of species tree branches. DENIM provides two schemes for specifying these priors. In the first scheme, the prior mean for the migration rate between branches depends on how closely related the species are, with the prior mean decreasing as the number of branches between them increases. In the second, the migration rate decays with the time between the most recent common ancestor of the two branches and the midpoint of the interval during which both branches exist. The details are described in the manual supplied with DENIM. DENIM puts some annotations into the species trees sampled during the MCMC process, which give more detail about the migrations. These can be analyzed using the command line program MigrationAnalyser.jar which can be downloaded from http://indriid.com/software.html. Tests Using Simulated Data The method was tested using the simulated data set from Leaché et al. (2014), and with no data. The same data set was also analyzed using *BEAST as implemented in BEAST2 (and not to be confused with StarBEAST2). In this data set, there are scenarios for 4 species and 10 species. There are 4 sequences per locus for each species, except a single outgroup species, for which there is just one sequence. There are thus 13 sequences per locus in the 4-species scenarios and 37 for the 10-species scenarios. In all cases, there are 10 loci, and the sequences are 1000 bp long. The data set is available in the Supplementary Material available on Dryad at http://dx.doi.org/10.5061/dryad.0jt38. Most of the scenarios consist of a migration pattern specifying which branches are involved, and a migration rate (0.001, 0.01, 0.1, or 1.0, corresponding to one migrant per 1000, 100, 10, or 1 generations). In these cases, migration is in both directions. There are also scenarios which consist of a single allele or a single migrant occurring in one direction at time zero. Figure 4 shows the true species trees and provides a guide to the migration patterns; Leaché et al. (2014) should be consulted for full details. Figure 4. View largeDownload slide The true species trees and guide to the migration patterns. The true species tree topologies for the 4-species scenarios (left) and 10-species scenarios (center, right). Most of the scenarios have migration between one pair of branches and are named accordingly. Thus “B-C” means migration between B and C, as shown on the left, and “AB-C” means migration between AB and C, where AB is the ancestor of A and B. Two migration patterns for the 10-species scenarios are given more convenient names as shown. “DeepAnc” means migration between ABCD and EFGH. “4-island” means migration between all six pairs of branches chosen from E, F, G, and H. Figure 4. View largeDownload slide The true species trees and guide to the migration patterns. The true species tree topologies for the 4-species scenarios (left) and 10-species scenarios (center, right). Most of the scenarios have migration between one pair of branches and are named accordingly. Thus “B-C” means migration between B and C, as shown on the left, and “AB-C” means migration between AB and C, where AB is the ancestor of A and B. Two migration patterns for the 10-species scenarios are given more convenient names as shown. “DeepAnc” means migration between ABCD and EFGH. “4-island” means migration between all six pairs of branches chosen from E, F, G, and H. For all these tests, an exponential prior for the migration rate was used; the mean of this prior was varied in some tests. Other priors could be used, and as long as a prior makes large migration rates very unlikely, the results should be similar. No use was made of the options to use different prior means for different pairs of species tree branches, so within each analysis, the same prior is used for all pairs. The MCcoal program (Yang 2015) was used to generate the sequence files. The MCcoal program was extended slightly to make it produce information about each individual migration in the simulations, instead of the normal summary information. The edited source code is available from http://indriid.com/workingnotes2017.html. The MCcoal control files were the same as those of Leaché et al. (2014), except that only the first 50 replicates were used, and the 10-species scenarios were augmented by adding scenarios for a migration rate of 0.01, making 25 migration patterns in total. In Leaché et al. (2014), only the two highest rates 0.1 and 1.0 were used in the 10-species scenarios, but the results from the 4-species scenarios showed that DENIM generally breaks down between 0.01 and 0.1, so the 0.01 rate is an interesting one for DENIM. Prior Only DENIM was tested by running it with no data. The scenarios used all had six species. The prior on the species tree was a pure birth (Yule) model with growth rate 100. This means the expected height of the species tree is $$0.01(1/2+1/3+1/4+1/5+1/6) = 0.0145$$. The number of individuals $$i$$ was 3 or 9 per species, the number $$g$$ of loci was 3 or 9. The population scaling parameter in DENIM (the value of $$N\mu$$ where $$N$$ is the effective population size and $$\mu$$ is the mutation rate in every branch) was set 0.0005, 0.005, 0.05, producing small, medium, and large amounts of ILS. Both the flexible and simple models were used, resulting in a total of $$2\times 2\times 2\times 3\times 2 = 48$$ scenarios. The analyses were run for 30M ($$i=3,g=3$$), 90M ($$i=3,g=9$$ and $$i=9,g=3$$), and 300M ($$i=9,g=9$$) generations. These long runs (up to around 8 h each) proved necessary to obtain convergence. The Leaché Data Set The simulated data set of Leaché et al. (2014) was used with the simple model. An prior mean of 0.005 was used for the migration rate. For most replicates, the chain length was 20M for the 4-species scenarios and 30M for the 10-species scenarios. When this produced an effective sample size (ESS) for the posterior of less than 200, as estimated by CODA (Plummer et al. 2006), the analysis was extended until an ESS of at least 200 was obtained. States were logged every 5000 generations, and burnin was set to 20%. The settings for site and clock models, for both *BEAST and DENIM, were similar to those used by Leaché et al. (2014). DENIM uses a different population model to *BEAST, so the priors for the population size parameters are somewhat different. Site models were linked. A GTR model of substitution was used, with base frequencies equal. The clock models were strict but unlinked. The first locus had clock rate 1, and the others were estimated. The Yule (pure birth) model was used for the species tree. The priors were set as follows. Substitution rates relative to rateCT: Gamma(0.05,20). Relative clock rates: lognormal(0,1). Growth rate for the species tree: lognormal(5,2). PopPriorScale: lognormal(-5,2). Further experiments were conducted in order to assess the sensitivity of the method to the choice of prior. These just used the 4-species scenarios, to reduce the computational resources required. In a Bayesian approach, if the results are sensitive to the choice of prior, it indicates that there is insufficient information in the data, and that the prior is dominating the posterior. The settings were the same as above, except that the flexible model was used, and the prior means for the migration rate were varied: the set of values (0.00125, 0.005, 0.02, 0.08) were used. The flexible model was used since the experiments with priors suggest that it is likely to be better with high migration rates in the prior. Results on Simulated Data Evaluation Measures Several measures are used for assessing accuracy. One measure is the posterior probability of the true species tree topology. The second is the probability coverage. This is the proportion of replicates where the true species tree topology is in the 95% credible set and was used by Leaché et al. (2014) to evaluate *BEAST. It provides a guide as to how often the method is likely to produce misleading results about topology. Neither of these two measures does not take into account errors in estimated node heights. The third measure is based on the branch score of Kuhner and Felsenstein (1994), adapted for rooted trees. It is defined in the original *BEAST paper (Heled and Drummond 2010, Equation (9)). In this article, the score is not normalized by tree length. It accounts for differences in topology and branch lengths, and the entire posterior is evaluated by finding the mean distance between the MCMC samples of the species tree and the true tree. This provides an overall measure of the method’s accuracy. There are also two measures aimed at evaluating how well DENIM can identify which loci are migrating. Firstly, for each locus, DENIM outputs a statistic which counts the number of migrations in the MCMC sample. From this, the posterior probability of a locus containing a migration can be found, and thresholded at 0.5 to produce a Boolean value which indicates whether the locus contains a migration. Evaluated over the replicates for each scenario, this produces a fraction of false positives, where a true migration is absent but incorrectly inferred; and a fraction of true positives, where a migration is correctly detected. A second measure is the posterior means of the migration counts. This is evaluated for the two cases in which a true migration is either absent or present. Prior Only Figure 5 shows the results from using DENIM with no data. In the upper graph, the values should all be close to zero, corresponding to a ratio (estimated height)/(true height) near 1. Instead, some values are much larger, corresponding to a ratio of 10 or more. The large estimates of species tree height are accompanied by large migration counts. Thus, with a large number of loci and individuals, the method breaks down even with small prior mean for the migration rate. The flexible model behaves better than simple model. This behavior means DENIM must be used with caution when there is little phylogenetic signal in the data. An obvious warning sign is the number of migrations inferred by DENIM is more than a small proportion of the number of coalescences. It is always possible to prevent this by using a prior with a small enough mean, but it may be that the only conclusion that can be drawn is that there is too little signal or too much migration for any sensible estimate of the species tree to be possible using DENIM. Figure 5. View largeDownload slide Results with no data. The y-axis in the top graph is the log-ratio of the estimated species tree height to the true species tree height. i is the number of individuals per species, and g is the number of loci. S, M, and L represent small, medium, and large amounts of ILS. The y-axis in the bottom graph is the proportion of migrations per coalescence (or per sister-pair). Circles represent the simple model, crosses represent the flexible model. Large symbols correspond to a prior mean of 0.005, small ones to a prior mean of 0.001. Figure 5. View largeDownload slide Results with no data. The y-axis in the top graph is the log-ratio of the estimated species tree height to the true species tree height. i is the number of individuals per species, and g is the number of loci. S, M, and L represent small, medium, and large amounts of ILS. The y-axis in the bottom graph is the proportion of migrations per coalescence (or per sister-pair). Circles represent the simple model, crosses represent the flexible model. Large symbols correspond to a prior mean of 0.005, small ones to a prior mean of 0.001. It is not understood why the approximation used by DENIM leads to this behavior, but the following may be an explanation. Consider the case of a species tree $$S$$ and a gene tree $$G$$ with just two tips each. The parameter space can be divided into regions corresponding to 0,1,2,... migrations. If there are none, DENIM will sample correctly from the prior. Given the small prior mean for the migration rate, the regions corresponding to 2 or more migrations are not significant. The main bias arises where there is one migration. For $${\mathrm{height}} (S) > {\mathrm{height}} (G)$$, a migration is needed, and DENIM considers this case. If $${\mathrm{height}} (S) < {\mathrm{height}} (G)$$, DENIM only considers the possibility of no migration. The region of parameter space with $${\mathrm{height}} (S) < {\mathrm{height}} (G)$$ is most affected by the approximation used by DENIM, so that small values of height(S) are discriminated against. The general effect of this sort of bias apparently gets worse as the number of loci and individuals increases. Simulated Data: Effective Sample Sizes Although Leaché et al. (2014) report that chain lengths of 10M were sufficient for *BEAST to achieve ESSs above 200 in the 4-species scenarios, I found that one replicate (replicate 3 for scenarioB-C-1-allele) had only reached an ESS of 30 after 20M generations, and two others were below 100. Leaché et al. (2014) used *BEAST as implemented in BEAST(v1) instead of BEAST2. These should be very similar; the only major difference I am aware of is that in BEAST2, a different initialization for the gene trees and species trees is used. In the 4-species scenarios, of a total of 850 replicates, there were 21 (*BEAST) and 6 (DENIM) with posterior ESSs below 200 after 20M generations. In the 10-species scenarios, of a total of 1250 replicates, there were 167 (*BEAST) and 46 (DENIM) with posterior ESSs below 200 after 30M generations. For both the 4-species and 10-species cases, DENIM achieved about 15% greater ESS values than *BEAST in terms of ESSs per million generations, based on median values over all replicates. In the 4-species scenarios, DENIM was about 30% slower than *BEAST in terms of seconds per generation, whereas in the 10-species scenarios, the times were very similar. The end result in terms of ESSs per hour is that DENIM was about 15% slower in the 4-species scenarios, and 15% faster in the 10-species scenarios. As mentioned earlier, all analyses were continued until an ESS for the posterior of at least 200 was obtained. ESSs for other parameters were sometimes below 200. In particular, with DENIM, the species tree root height often had the worst ESS among all parameters. It is not clear why this should be. The root height was generally estimated fairly accurately, and the operators affecting it appeared to be working satisfactorily. Simulated Data: Posterior Probability of True Topology Figures 6 and 7 show the posterior probability of the true species tree topology for DENIM and *BEAST. For most scenarios, the results are similar, but where there is a small amount of migration between non sister species (B-C_0.001, B-C_0.01, B-C-1-allele, and B-C-1-mig in Fig. 6; and 4-island_0.01, F-G_0.01, H-I_0.01, D-E-1-allele, D-E-1-mig, F-G-1-allele, and F-G-1-mig in Fig. 7) DENIM is better, and *BEAST often fails completely. Figure 6. View largeDownload slide Posterior probability of the true species tree topology for the 4-species scenarios, based on 50 replicates. For each scenario, the results for DENIM are in midgray and below those for *BEAST which are in pale gray. See the text for a description of the scenarios. The simple model was used with prior mean of 0.005. Figure 6. View largeDownload slide Posterior probability of the true species tree topology for the 4-species scenarios, based on 50 replicates. For each scenario, the results for DENIM are in midgray and below those for *BEAST which are in pale gray. See the text for a description of the scenarios. The simple model was used with prior mean of 0.005. Figure 7. View largeDownload slide Posterior probability of the true species tree topology for the 10-species scenarios. Other details as Figure 6. Figure 7. View largeDownload slide Posterior probability of the true species tree topology for the 10-species scenarios. Other details as Figure 6. Simulated Data: Coverage Table 1 shows the coverage probability for all scenarios. This can be compared with Tables 1 and 2 of Leaché et al. (2014). As expected, the two implementations of *BEAST give similar results. As with the previous measure, DENIM performs better than *BEAST in the cases of small amounts of migration between nonsister species, and the migrations at time zero between nonsister species, and is otherwise similar. Table 1. Probability coverage Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Notes: Coverage for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). Where the programs produce results which are substantially different, the better result is in bold. Table 1. Probability coverage Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Notes: Coverage for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). Where the programs produce results which are substantially different, the better result is in bold. Table 2. Migration detection Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Notes: Migration detection for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). The “Absent” column gives the fraction of false positives, and the “Present” column gives the fraction of true positives. These are bolded where DENIM correctly identifies a migration in 75% or more of the replicates. Table 2. Migration detection Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Notes: Migration detection for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). The “Absent” column gives the fraction of false positives, and the “Present” column gives the fraction of true positives. These are bolded where DENIM correctly identifies a migration in 75% or more of the replicates. Supplementary Table S1 available on Dryad shows the coverage probability for the 4-species scenarios as the settings in DENIM are varied. With this data set, there is little difference between the simple and flexible models. As the prior mean is increased, the results for nonsister migration with rates 0.1 and 1.0 improve, while other scenarios become slightly worse. With the smallest prior mean of 0.00125, the result for the scenario with a single migrant between nonsisters becomes worse. Simulated Data: Branch Scores Supplementary Figures S1 and S2 available on Dryad, show the branch scores for the 4 and 10 species scenarios. The general picture is that DENIM produces good results for the two smallest migration rates 0.001 and 0.01, and for the migrations at time zero, but breaks down at higher rates. They also show that in the cases where the migration is between sister species (the A-B, and AB-C scenarios in Supplementary Fig. S1; and the DeepAnc, E-F, and EF-G scenarios in Supplementary Fig. S2) the errors in branch lengths are smaller for DENIM than for *BEAST. Supplementary Figures S3, S4, S5, and S6 show results for the 4-species scenarios with the flexible model, and different prior means 0.00125, 0.005, 0.02, 0.08. The results do not vary greatly with the choice of prior, and follow a similar pattern to the coverage results. All but the high-rate paraphyletic scenarios become worse with a prior mean of 0.08. Simulated Data: Migration Detection Table 2 shows how well DENIM is able to detect migration using the first measure of the measures described in the section “Evaluation measures” for identifying migrating loci. For each scenario, there are 50 replicates and 10 loci, hence 500 cases in which migration may or may not be present. The table shows the number of times that DENIM incorrectly infers migration when it is absent, as a fraction of the number of times it is absent (false positives) and the number of times it correctly infers migration when it is present, as a fraction of the number of times it is present. The general picture is that DENIM only does this well when the migration rate is low and between nonsister species. DENIM usually “explains” migration between sister species by squashing the species tree (like *BEAST). An exception to the general picture is where there is a single migrant or locus introgression at time zero between sister species (B-A-1-allele, B-A-1-mig, F-E-1-allele, F-E-1-mig) where the migration is usually detected. Supplementary Figures S7 and S8 show results for the posterior means of the number of migrations. Results on Empirical Data I reanalyzed the pocket gopher data of Belfiore et al. (2008), as supplied with BEAST. There are seven putative species from the Thomomys genus, and one outgroup Orthogeomys heterodus. Experts are certain this is the true outgroup (Heled and Drummond 2010). There are 26 individuals and 7 noncoding nuclear sequence loci. This data set has been analyzed in various papers, including Heled and Drummond (2010), the paper which introduced *BEAST. This group of species has undergone a rapid radiation, so incomplete lineage sorting is expected to be high. I used the HKY substitution model, linked site models, estimated relative clock rate for all loci except the first, and a strict clock. The results here use the simple model for migration, with an exponential prior with mean 0.001. In the *BEAST analysis, the outgroup species O. heterodus was misplaced (their Fig. 8a), and the authors comment that “The tendency to place the outgroup incorrectly appears to be caused by just one gene,” namely TBO29. The tree from the DENIM analysis is shown in Figure 8. The outgroup is correctly placed, and it is very similar to the *BEAST result with ingroup monophyly enforced (their Fig. 8b). The DENIM tree is somewhat shorter, perhaps due to a different site model or population model. A migration was inferred between O. heterodus and the [(Thomomys bottae, Thomomys townsendi), Thomomys umbrinus] clade, the same clades that *BEAST grouped together. This migration was present in about 95% of the MCMC samples. The other migrations that appear in the posterior samples have much lower posterior probabilities. The next migrations that DENIM analysis suggests (at about 24%) are very recent ones, both ways, of TBO47 between T. bottae and T. umbrinus. This pair is followed (at about 18%) by a very recent one of TBO64 Thomomys talpoides to Thomomys idahoensis (going back in time). Figure 8. View largeDownload slide Gopher tree. Posterior clade probabilities are shown next to branches. The node bars are 95% HPDs for the node heights. The migration of locus TBO64 is also indicated. Figure 8. View largeDownload slide Gopher tree. Posterior clade probabilities are shown next to branches. The node bars are 95% HPDs for the node heights. The migration of locus TBO64 is also indicated. It is interesting that DENIM identifies TBO64, but not TBO29, as a locus with migration. The posterior mean count of migrations for TBO64 was 1.20, for TBO47 it was 0.47, for TBO26 it was 0.11, and the rest, including TBO29, were well under 0.1. In Belfiore et al. (2008), the individual gene trees were estimated separately, and it appears from their Figure 2 that in TBO64, the relative distance between Orthogeomys heterodus and other taxa is considerably smaller than is the case for any other locus. Other settings were also tried. With the simple model, and a prior mean of 0.005, the method broke down, in a way similar to the tests with no data: the species tree height and migration counts were very large. With the simple model, and a prior mean of 0.0002, the result was similar to Figure 8, but the posterior probability for the ingroup decreased to 0.72. With the flexible model, prior means of 0.001, 0.005, and 0.02 produced similar results to Figure 8, but the migration counts increased with the prior mean. Discussion The multispecies coalescent model implies a particular notion of species, as described in Heled and Drummond (2010): “any group of individuals that, after some ‘divergence’ time, have no history of breeding with individuals outside that group.” In the tree-island model used by DENIM, this definition is relaxed to allow some gene flow after the divergence, but it is still supposed that there is a particular time at which two groups of individuals emerge with only a small amount of gene flow between them from that point on. Further research using a variety of approaches will be needed to ascertain how often this model is a good approximation to reality. In the meantime, I would argue that it is preferable to the assumption of no gene flow at all in the “pure” multispecies coalescent model. DENIM should be useful to biologists who would like to estimate a species tree under the assumptions of the multispecies coalescent but are concerned that gene flow may distort the results. Data Requirements The results on simulated data show that DENIM can deal with migration rates up to about 1 in 100 or 1 in 10 migrants per generation. The performance of DENIM in terms of ESSs per hour is similar to that of *BEAST, indicating that DENIM is suitable for up to tens of species and tens of loci. The simulations with 10 species, 37 individuals, and 10 loci took about 80 min to achieve a posterior ESS of $$>$$ 200; similar tests with 40 loci took about 36 h. One should note that more recent implementations of the multispecies coalescent model, with new operators, such as StarBEAST2 (Ogilvie et al. 2017), STACEY (Jones 2016), and BPP (Yang 2015) are likely to be significantly faster, especially when the number of loci is large. The behavior of DENIM with no data is far from ideal, and care must be taken to ensure the signal in the data is sufficient to overcome the potential bias from the prior. In general, the flexible model appears to be preferable in this regard. The embedding scheme means that DENIM can only consider at most one migration per coalescence. If the number of migrations inferred by DENIM is more than a small proportion of the number of coalescences, the approximation is likely to produce bias. The results from the simulations give an indication of the how informative the data needs to be in order to use DENIM successfully. In the simulations, most of the branch lengths had a mean of 0.02, and although the randomness in the scenarios made some branches much shorter, the individual sequences are generally quite informative. Consider a large number of short sequences from a shallow phylogeny. Such data sets are always challenging for methods based on the multispecies coalescent model, since mutational variation produces great uncertainty about the gene trees. However, such data sets are likely to be particularly problematic for DENIM, since the bias in the prior gets worse with larger numbers of empty sequences. It is therefore important that the individual sequences are sufficiently informative. The best solution, if feasible, is longer sequences (for example 10000 bp instead of 1000 bp). This does increase the probability of recombination within a locus, but this seems the lesser evil in the case of shallow phylogenies. Other Guidance Often, the amount of migration will not be known a priori. The posterior predictive approach of Reid et al. (2014) and Gruenstaeudl et al. (2015) may be useful here. If DENIM reports no migration, this provides some assurance that migration (especially between nonsister species) is not interfering with the inference of the topology. It is harder to detect migration between sister species, and you should not expect DENIM to rule out this possibility. If DENIM reports a small amount of migration, the species tree is likely to be more accurate than that reported by software which is based on the multispecies coalescent but ignores gene flow. Alternatively, it may be possible to identify and remove troublesome loci. If DENIM reports a lot of migration, the results from DENIM may be unreliable, and estimating an accurate species tree using any current method may not be feasible, although it is certainly worth considering the species network methods described in the introduction. DENIM identifies loci which are “badly behaved,” rather than those which migrate. That is, it identifies loci with migrations which result in an incompatibility with the species tree. Some migrations do not cause incompatibility, because (going back in time) they do not coalesce with another lineage until the species tree branches have merged; or a lineage may migrate, then migrate back again before coalescing, and so on. Looking Ahead DENIM combines the tree-island model, which is an elegant mathematical model for speciation, coalescence, and migration, with a rather crude approximation for sampling the posterior. The two components are quite independent. The partial sampling of the posterior is a trade-off between accuracy on the one hand and computational effort and simplicity of implementation on the other. An exact sampling from the posterior for large data sets when there is a large amount of migration may remain computationally infeasible for decades. However, there are almost certainly better compromises to be found than the one currently implemented in DENIM. For example, Palczewski and Beerli (2013) provides an approximation for high rates. It is possible to combine the tree-island model with species delimitation and thus coestimate the delimitation and the species tree in the presence of migration. The current implementation of DENIM allows this, using the birth–death-collapse model of Jones et al. (2015), but this possibility has not been explored in any detail. The parameter space become even larger and obtaining useful results in a reasonable amount of time may be very difficult. DENIM uses the usual birth–death model to provide a prior for the species tree, but this only provides a probability density for the reconstructed tree, with all extinct branches removed. In the presence of migration, there may be gene flow from extinct species which could result in unusually deep coalescences and bias the analysis. DENIM could be extended so that the full tree, including the extinct branches is sampled from. There is no upper limit to the number of extinct branches that could exist, so again there are more computational difficulties. In principle, this could allow the detection of some extinct species from genetic data alone. Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.0jt38. Acknowledgments I thank Adam Leaché for supplying the MCcoal control files used to generate the simulated data. I thank Bengt Oxelman and four anonymous reviewers for constructive comments on earlier versions of this paper. References Beerli P. , Felsenstein J. 2001 . Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach . Proc. Natl. Acad. Sci. USA 98 : 4563 – 4568 . Google Scholar CrossRef Search ADS Belfiore N.M. , Liu L. , Moritz C. 2008 . Multilocus phylogenetics of a rapid radiation in the genus Thomomys (Rodentia: Geomyidae) . Syst. Biol. 57 : 294 – 310 . Google Scholar CrossRef Search ADS PubMed Bouckaert R. , Heled J. , Khnert D. , Vaughan T. , Wu C.-H. , Xie D. , Suchard M.A. , Rambaut A. , Drummond A.J. 2014 . BEAST 2: a software platform for Bayesian evolutionary analysis . PLoS Comput. Biol. 10 : e1003537 . Google Scholar CrossRef Search ADS PubMed Dalquen D.A. , Zhu T. , Yang Z. 2017 . Maximum likelihood implementation of an isolation-with-migration model for three species . Syst. Biol. 66 : 379 – 398 . Google Scholar PubMed Ewing G. , Allen R. 2006 . Estimating population parameters using the structured serial coalescent with Bayesian MCMC inference when some demes are hidden . Evol. Bioinform. 2 : 237 – 235 . Google Scholar CrossRef Search ADS Figueiró H.V. , Li G. , Trindade F.J. , Assis J. , Pais F. , Fernandes G. , Santos S.H.D. , Hughes G.M. , Komissarov A. , Antunes A. , Trinca C.S. , Rodrigues M.R. , Linderoth T. , Bi K. , Silveira L. , Azevedo F.C.C. , Kantek D. , Ramalho E. , Brassaloti R.A. , Villela P.M.S. , Nunes A.L.V. , Teixeira R.H.F. , Morato R.G. , Loska D. , Saragüeta P. , Gabaldón T. , Teeling E.C. , O’Brien S.J. , Nielsen R. , Coutinho L.L. , Oliveira G. , Murphy W.J. , Eizirik E. 2017 . Genome-wide signatures of complex introgression and adaptive evolution in the big cats . Sci. Adv. 3 : e1700299 . Google Scholar CrossRef Search ADS PubMed Gruenstaeudl M. , Reid N.M. , Wheeler G.L. , Carstens B.C. 2015 . Posterior predictive checks of coalescent models: P2c2m, an R package . Mol. Ecol. Resour. 16 : 193 – 205 . Google Scholar CrossRef Search ADS PubMed Hejase H. , VandePol N. , Bonito G.M. , Liu K.J. 2018 . Fastnet: fast and accurate inference of phylogenetic networks using large-scale genomic sequence data . bioRxiv . https://doi.org/10.1101/132795. Heled J. , Drummond A. 2010 . Bayesian inference of species trees from multilocus data . Mol. Biol. Evol. 27 : 570 – 580 . Google Scholar CrossRef Search ADS PubMed Hey J. 2010 . Isolation with migration models for more than two populations . Mol. Biol. Evol. 27 : 905 – 920 . Google Scholar CrossRef Search ADS PubMed Hey J. , Nielsen R. 2004 . Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis . Genetics 167 : 747 – 760 . Google Scholar CrossRef Search ADS PubMed Hey J. , Nielsen R. 2007 . Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics . Proc. Natl. Acad. Sci. USA 104 : 2785 – 2790 . Google Scholar CrossRef Search ADS Hudson R.R. , et al. 1990 . Gene genealogies and the coalescent process . Oxf. Surv. Evol. Biol. 7 : 1 – 44 . Jackson N.D. , Morales A.E. , Carstens B.C. , O’Meara B.C. 2017 . Phrapl: phylogeographic inference using approximate likelihoods . Syst. Biol. 66 : 1045 – 1053 . Google Scholar CrossRef Search ADS PubMed Jones G. 2016 . Algorithmic improvements to species delimitation and phylogeny estimation under the multispecies coalescent . J. Math. Biol. 74 : 447 – 467 . Google Scholar CrossRef Search ADS PubMed Jones G. , Aydin Z. , Oxelman B. 2015 . Dissect: an assignment-free Bayesian discovery method for species delimitation under the multispecies coalescent . Bioinformatics 31 : 991 – 998 . Google Scholar CrossRef Search ADS PubMed Kingman J. 1982 . The coalescent . Stoch. Process. Their Appl. 13 : 235 – 248 . Google Scholar CrossRef Search ADS Leaché A.D. , Harris R.B. , Rannala B. , Yang Z. 2014 . The influence of gene flow on species tree estimation: a simulation study . Syst. Biol. 63 : 17 – 30 . Google Scholar CrossRef Search ADS PubMed Martin S.H. , Dasmahapatra K.K. , Nadeau N.J. , Salazar C. , Walters J.R. , Simpson F. , Jiggins C.D. 2013 . Genome-wide evidence for speciation with gene flow in heliconius butterflies . Genome Res. 23 : 1817 – 1828 . Google Scholar CrossRef Search ADS PubMed Nosil P. 2008 . Speciation with gene flow could be common . Mol. Ecol. 17 : 2103 – 2106 . Google Scholar CrossRef Search ADS PubMed Ogilvie H.A. , Bouckaert R.R. , Drummond A.J. 2017 . StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates . Mol. Biol. Evol. 34 : 2101 – 2114 . Google Scholar CrossRef Search ADS PubMed Palczewski M. , Beerli P. 2013 . A continuous method for gene flow . Genetics 194 : 687 – 696 . Google Scholar CrossRef Search ADS PubMed Plummer M. , Best N. , Cowles K. , Vines K. 2006 . CODA: convergence diagnosis and output analysis for MCMC . R News 6 : 7 – 11 . Reid N.M. , Hird S.M. , Brown J.M. , Pelletier T.A. , McVay J.D. , Satler J.D. , Carstens B.C. 2014 . Poor fit to the multispecies coalescent is widely detectable in empirical data . Syst. Biol. 63 : 322 – 333 . Google Scholar CrossRef Search ADS PubMed Rheindt F.E. , Fujita M.K. , Wilton P.R. , Edwards S.V. 2014 . Introgression and phenotypic assimilation in Zimmerius flycatchers (tyrannidae): population genetic and phylogenetic inferences from genome-wide SNPs . Syst. Biol. 63 : 134 – 152 . Google Scholar CrossRef Search ADS PubMed Romaschenko K. , Garcia-Jacas N. , Peterson P.M. , Soreng R.J. , Vilatersana R. , Susanna A. 2014 . Miocenepliocene speciation, introgression, and migration of Patis and Ptilagrostis (Poaceae: Stipeae) . Mol. Phylogenet. Evol. 70 : 244 – 259 . Google Scholar CrossRef Search ADS PubMed Solís-Lemus C. , Ané C. 2016 . Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting . PLOS Genet. 12 : 1 – 21 . Google Scholar CrossRef Search ADS Sun Y. , Abbott R.J. , Li L. , Li L. , Zou J. , Liu J. 2014 . Evolutionary history of purple cone spruce (Picea purpurea) in the Qinghai Tibet Plateau: homoploid hybrid origin and Pleistocene expansion . Mol. Ecol. 23 : 343 – 359 . Google Scholar CrossRef Search ADS PubMed Tian Y. , Kubatko L.S. 2016 . Distribution of coalescent histories under the coalescent model with gene flow . Mol. Phylogenet. Evol. 105 : 177 – 192 . Google Scholar CrossRef Search ADS PubMed Wen D. , Nakhleh L. 2017 . Coestimating reticulate phylogenies and gene trees from multilocus sequence data . Syst. Biol . doi: https://doi.org/10.1093/sysbio/syx085 . Yang Z. 2015 . The BPP program for species tree estimation and species delimitation . Curr. Zool. 61 : 854 – 865 . Google Scholar CrossRef Search ADS Zhang C. , Ogilvie H.A. , Drummond A.J. , Stadler T. 2018 . Bayesian inference of species networks from multilocus sequence data . Mol. Biol. Evol. 35 : 504 – 517 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Systematic Biology Oxford University Press

# Divergence Estimation in the Presence of Incomplete Lineage Sorting and Migration

, Volume Advance Article – May 24, 2018
13 pages

/lp/ou_press/divergence-estimation-in-the-presence-of-incomplete-lineage-sorting-g5ckwJWbo5
Publisher
Oxford University Press
ISSN
1063-5157
eISSN
1076-836X
D.O.I.
10.1093/sysbio/syy041
Publisher site
See Article on Publisher Site

### Abstract

Abstract This article focuses on the problem of estimating a species tree from multilocus data in the presence of incomplete lineage sorting and migration. I develop a mathematical model similar to IMa2 (Hey 2010) for the relevant evolutionary processes which allows both the population size parameters and the migration rates between pairs of species tree branches to be integrated out. I then describe a BEAST2 package DENIM (Divergence estimation notwithstanding ILS and migration) which is based on this model and which uses an approximation to sample from the posterior. The approximation is based on the assumption that migrations are rare, and it only samples from certain regions of the posterior which seem likely given this assumption. The method breaks down if there is a lot of migration. Using simulations, Leaché et al. (2014) showed that using the standard multispecies coalescent model to infer a species tree can result in poor accuracy if migration is present. I reanalyze this simulated data to explore DENIM’s performance and demonstrate substantial improvements in accuracy over *BEAST. I also reanalyze an empirical data set. Bayesian, incomplete lineage sorting, isolation-with-migration, Markov chain Monte Carlo, multispecies coalescent, phylogenetic analysis, species tree Speciation is probably gradual in most cases (Martin et al. 2013), and gene flow may continue to occur even between good species. It is difficult to infer these phenomena (Nosil 2008), but some large scale studies have demonstrated them in (e.g.,) butterflies (Martin et al. 2013), birds (Rheindt et al. 2014), grasses (Romaschenko et al. 2014), and big cats (Figueiró et al. 2017). Meanwhile, almost all estimates of species trees to date have been conducted under the assumption that no migration occurs after an instantaneous speciation event. Leaché et al. (2014) simulated a large set of data containing migrations and incomplete lineage sorting (ILS) and analyzed it using *BEAST. They showed that migration causes problems for species tree inference using the multispecies coalescent when migration is present but ignored. In general, migration between sister species rarely caused topological errors, although it resulted in biased estimates of node heights and population sizes. Migration between nonsister species resulted in topological errors, even when the migration rates were small. This article describes a BEAST2 package DENIM (Divergence estimation notwithstanding ILS and migration) for inferring a species tree in the presence of ILS and migration. The model in DENIM is an extension of the multispecies coalescent model as used by *BEAST and other programs and has much in common with that model. The addition of migration to the model results in a very hard problem, and no current method deals with it in full generality. For example, the methods of Tian and Kubatko (2016) and Dalquen et al. (2017) are restricted to three species. The method here involves an approximation based on an assumption that the migration rates are small: the results below show that this approximation typically breaks down at rates between one migrant per 100 generations and one migrant per 10 generations. DENIM allows arbitrary numbers of species and allows for different migration rates between each pair of contemporaneous branches. DENIM is available in BEAST2 and can be installed using Beauti. The installation includes a manual and source code. An alternative approach to account for gene flow is to model the situation with a species network instead of a tree. If hybrid speciation occurs, or populations merge, then a network is essential to model the situation properly. These are the sort of situations which the network approach is aimed at, and DENIM would not be appropriate for them. In DENIM, it is assumed that gene flow occurs continuously, and that any gene flow between pairs of branches will eventually become zero, so that there is an underlying species tree. It also is possible to use a species network to model a species tree with gene flow between branches. Instead of a continuous gene flow between two branches, one or more edges can be added to the network, representing gene flow in one or more discrete events. If the gene flow is in both directions, at least two edges would be needed; more might be needed in order to provide a good approximation to gene flow which lasts a long time. These network approaches search for an approximation to reality in a much larger space of topologies than DENIM. There has been a lot of recent interest in species networks. Two recent papers (Wen and Nakhleh 2017; Zhang et al. 2018) use a Bayesian MCMC method and coestimate both the network and the gene trees. These are the most similar to DENIM. There are a number of other network methods which make various approximations in order to deal with larger data sets; generally these rely on estimating the gene trees first instead of coestimating the network and gene trees. For example, Sun et al. (2014) used approximate Bayesian computation, SNaQ (Solís-Lemus and Ané 2016) uses an approximation based on quartets (4-taxon unrooted subnetworks), and FastNet (Hejase et al. 2018) uses a divide-and-conquer approach. PHRAPL (Jackson et al. 2017) simulates gene trees under a set of user-chosen hypotheses in order to choose the best model, but it does not seem feasible to apply it to more than a few species. “Migration” is used to refer to gene flow between species (usually introgression but not restricted to that). I use the term “species” rather than “population” because the method is aimed at situations where gene flow is small. A migration event occurs when an allele comes from a parent from another species. An “embedding” of a gene tree specifies which species tree branch each coalescence belongs to, together with migration events, which specify the times along gene tree branches at which an allele moved between species tree branches, and which species tree branches are involved. I always describe events going back in time from the present, so alleles have parental species to which they “go,” and the “destination” branch in the species tree contains part of a gene tree branch at a more ancient time than the “source” branch. This is because coalescences are easier to model this way and is the same convention as the program IMa2 (Hey and Nielsen 2004, 2007; Hey 2010). There is no upper limit on the number of migration events, and even if this is limited, and the gene tree and species tree are fixed, there can be a huge number of ways in which each gene tree can be embedded into a species tree. It is thus difficult to make inferences if the situation is modeled in full. IMa2 requires that the true population phylogeny (equivalent to species tree here) is known. DENIM uses a model for migration which is similar to that used by Hey (2010) in IMa2. There are two migration rate parameters for each pair of contemporaneous species tree branches. There are $$2(n-1)^2$$ of them for a species tree with $$n$$ tips (Hey 2010). There are three main differences between DENIM and IMa2. The species tree topology is estimated instead of being assumed; the migration rate parameters are integrated out; and an approximation is used to simplify sampling from the posterior. Also, the population size parameters are integrated out in a similar fashion to Jones (2016). The focus is on estimating the species tree despite the presence of small amounts of migration. Since migration rates are integrated out, they cannot be estimated directly, but DENIM does produce Markov chain Monte Carlo (MCMC) samples of species trees and embedded gene trees, so some information about the migrations can be found by postprocessing. Even with migration rate and population size parameters integrated out, there are still an unbounded number of parameters for the gene trees. It appears very difficult to design and implement MCMC operators capable of sampling efficiently from this distribution while estimating the species tree. Instead, DENIM uses an approximation to the posterior by ignoring most of the “unlikely” embeddings. I first describe the underlying evolutionary model for the coalescent and migration processes, and the particular formulation which makes it possible to integrate out the population size parameters and the migration rates. Then, I describe the approximation which makes it possible to sample efficiently from the posterior. The simulated data of Leaché et al. (2014) is reanalyzed to evaluate the proposed method, and the pocket gopher data of Belfiore et al. (2008) provides an empirical demonstration of the method. The Prior Density for a Gene Tree Background Following the introduction of the Kingman coalescent (Kingman 1982), models for coalescence and migration were developed in the 1980s by population geneticists (Hudson et al. 1990). More recent developments include Beerli and Felsenstein (2001), Ewing and Allen (2006), Tian and Kubatko (2016), Dalquen et al. (2017) as well as the aforementioned work of Hey and Nielsen. The methods of Tian and Kubatko (2016) and Dalquen et al. (2017) can estimate the species tree, but are currently restricted to at most three species and three sequences per locus. The underlying evolutionary model used here is the same as that of Hey (2010), except that the species tree $$S$$ is not assumed known but instead follows a birth–death model. When the species tree $$S$$ is estimated, it is important that $$\int \Pr(G|S) dG = 1$$ for any $$S$$, where the integration is over all gene trees $$G$$. I have not found a clear statement to this effect in the literature, so some explanation seems warranted. Between the node heights of the species tree, there is an $$n$$-island model for coalescence and migration (Beerli and Felsenstein 2001), where $$n$$ is the current number of species tree branches. This is a continuous time Markov chain. It could be time-inhomogeneous, to allow for population sizes or migration rates to vary continuously with time, although the application here only uses the time-homogeneous case. In order to define the state space of this Markov chain, we need a few preliminaries. Firstly, each branch in $$G$$ is labeled by the tip labels that descend from the branch. When a coalescence occurs, it should be understood as the merging of two particular labeled gene tree branches. Likewise, when a migration occurs, a particular gene tree branch migrates to a particular species tree branch. Let $$L$$ be the set of tip labels of $$G$$, and let $${\mathcal{P}}(L)$$ be the set of all partitions of $$L$$. At any time, the set of gene tree branches can be regarded as some partition $$P \in {\mathcal{P}}(L)$$, and each branch as a member of $$P$$. The branches of $$S$$ could be labeled in a similar manner to $$G$$, but for convenience, we assume they have been labeled with the numbers $$\{1, \dots, n\}$$ during the period in which there are $$n$$ branches, and that branches $$n$$ and $$n-1$$ merge to form a branch $$n-1$$ when a speciation event is encountered. The state space of the Markov chain during the period with $$n$$ branches consists of all possible assignments of all members of $${\mathcal{P}}_n(L)$$ to the branches of $$S$$. Each state is a pair $$(P,f),$$ where $$P \in {\mathcal{P}}_n(L)$$ and $$f$$ is any map from $$P$$ to $$\{1, \dots, n\}$$, assigning gene tree branches to species tree branches. We use the set theory notation $$X^Y$$ to denote the set of all maps from set $$Y$$ to set $$X$$. So, we can write the state space $${\mathcal{A}}_n$$ as \begin{equation*} {\mathcal{A}}_n = \{ (P,f)\ : \ P \in {\mathcal{P}}(L) \ \wedge \ f \in \{1, \dots, n\}^P \}. \end{equation*} It has size \begin{equation*} |{\mathcal{A}}_n| = \sum_{P \in {\mathcal{P}}(L)} n^{|P|}. \end{equation*} There is an instantaneous rate matrix $$Q$$ of size $$|{\mathcal{A}}_n| \times |{\mathcal{A}}_n|$$. The off-diagonal rows of $$Q$$ are non-negative, the rows of $$Q$$ sum to zero, and the diagonal entries are less than or equal to zero. Note that although $$Q$$ is enormous for large $$|L|$$ and $$n$$, it is sparse, since the number of states which can be reached from a given state by a single migration or coalescence is much smaller than $$|{\mathcal{A}}_n|$$. Basic properties of Markov chains (in particular the fact that rows of $$Q$$ sum to zero) ensure that given a starting distribution over states such that \begin{equation*} \sum_{(P,f) \in {\mathcal{A}}_n} \Pr(P,f) = 1, \end{equation*} this remains true throughout the process, and in particular just before a merging of species tree branches. At such a merge, the partitions $$P$$ are unchanged, but the state space changes. Once we are in the root branch of the species tree, the process reduces to the Kingman coalescent, which is a (normalized) density. Consider the case just above the root, where $$n=2$$. We have \begin{equation*} \sum_{(P,f) \in {\mathcal{A}}_2} \Pr(P,f) = \sum_{P \in {\mathcal{P}}(L)} \ \sum_{f \in \{1,2\}^P} \Pr(P,f). \end{equation*} Each $$P$$ consist of subsets $$L_1, \dots, L_m$$, and as $$f$$ runs over the maps from $$P$$ to $$\{1,2\}$$, it runs over exactly those assignments of these subsets to $$\{1,2\}$$ which result in all of them ending up in the root just after the merge. Thus, \begin{equation*} \sum_{f \in \{1,2\}^P} \Pr(P,f) = \sum_{f \in \{1\}^P} \Pr(P,f), \end{equation*} where the left hand side applies just before the merge and the right hand side applies just after the merge. It follows that \begin{equation*} \sum_{(P,f) \in {\mathcal{A}}_2} \Pr(P,f) = \sum_{(P,f) \in {\mathcal{A}}_1} \Pr(P,f), \end{equation*} where again the left hand side applies just before and the right hand side applies just after the merge. We can apply a similar argument to merges when $$n > 2$$ to establish that $$\int \Pr(G|S) dG = 1$$. This evolutionary model will be referred to as the “tree-island model.” Integrating Out Population and Migration Parameters Suppose the species tree has $$s$$ tips. There are $$2s\,{-}\,1$$ species tree branches, including the root branch. Suppose the migration rate from branch $$b$$ to branch $$d$$ is $$m_{bd}$$. These migration rates follow the same conventions as Hey and Nielsen (backwards from present, scaled by population size). As in Jones (2016), the population size parameter $$\theta_b$$ for branch $$b$$ is equal to $$N_b \mu_b$$, where $$N_b$$ is the effective population size and $$\mu_b$$ is the mutation rate for the branch. For locus $$j$$, the effective number of gene copies is obtained from $$N_b$$ by multiplying by a factor $$p_j$$ (sometimes called the “ploidy”) for gene $$j$$. The time (going back from zero at present) is divided into a number of intervals $$\tau_i$$ ($$i \in {\mathcal{I}}$$) by the times of the events and species tree node heights. The set of species tree branches which exist during the $$i$$th interval is denoted by $${\mathcal{B}}_i$$, and we set $$s_i = |{\mathcal{B}}_i|$$. The number of lineages in gene tree $$j$$ which belong to species tree branch $$b$$ during the $$i$$th interval is $$n_{jbi}$$. The set of intervals which end in a coalescence is $${\mathcal{I}}_{\rm coal}$$, and the set which end in a migration is $${\mathcal{I}}_{\rm mig}$$. See Figure 1, where $$i \in {\mathcal{I}}_{\rm mig}$$, $$i+1$$ is in neither, and $$i+2 \in {\mathcal{I}}_{\rm coal}$$. The rate at which the next event occurs is $$(\kappa_i + \mu_i)$$ where, \begin{equation*}\label{eq:coalrate} \kappa_i = \sum_j \sum_{b \in {\mathcal{B}}_i} \left( \binom{n_{jbi}}{2}p_j^{-1} \theta_b^{-1} \right) \end{equation*} is the total rate for coalescent events and $$\label{eq:migrate} \mu_i = \sum_j \sum_{b \in {\mathcal{B}}_i} \left( n_{jbi} \sum_{d \in {\mathcal{B}}_i\setminus b} m_{bd} \right)$$ (1) is the total rate for migration events. Here, we are summing the nonzero off-diagonal elements of a row of $$Q$$ in order to find $$Q_{x,x} = -(\kappa_i + \mu_i)$$ for the current state $$x$$. We then need $$Q_{x,y}$$ where $$y$$ is the next state. If it is a coalescence, $$Q_{x,y} = p_{j_i}^{-1} \theta_{b_i}^{-1}$$, where $$j_i$$ is the gene tree containing the coalescence, and $$b_i$$ is the species tree branch in which it occurs. If it is a migration, $$Q_{x,y} = m_{b_i d_i}$$, where $$b_i$$ and $$d_i$$ are the source and destination branches. Figure 1. View largeDownload slide Three time intervals. The figure shows a section of a species tree (in pale gray) and some gene tree branches (black lines). The first (uppermost) time interval of length $$\tau_i$$ ends in a migration, the second interval ends with a species tree node, and the third with a coalescence. Figure 1. View largeDownload slide Three time intervals. The figure shows a section of a species tree (in pale gray) and some gene tree branches (black lines). The first (uppermost) time interval of length $$\tau_i$$ ends in a migration, the second interval ends with a species tree node, and the third with a coalescence. Denoting the set of migration rates by $$M$$ and the set of population size parameters by $$\Theta$$, we have the probability density for a gene tree: \begin{align*}\label{eq:coalmigrate} f(G; S, \Theta, M) & = \prod_{i \in {\mathcal{I}}_{\rm coal}} \!\!\! p_{j_i}^{-1} \theta_{b_i}^{-1} \times \\ & \prod_{i \in {\mathcal{I}}_{\rm mig}} \!\!\! m_{b_i d_i} \prod_i \exp (- (\kappa_i + \mu_i) \tau_i). \end{align*} This can be factored into a coalescence part and a migration part. Then, our aim is to rearrange the terms in the coalescence part so that it is a product over species tree branches, and then rearrange the terms in the migration part so that it is a product over ordered pairs of species tree branches. The result will be a product of terms, in which each term contains one population size parameter $$\theta_b$$ or one migration parameter $$m_{bd}$$. As we will see, this enables us to integrate out these parameters if conjugate priors (or mixtures of conjugate priors) are assumed. We put \begin{equation*} f(G; S, \Theta, M) = f_{\rm coal} (G; S, \Theta) \ \ f_{\rm mig} (G; S, M) \end{equation*} where \begin{equation*} f_{\rm coal} (G;S, \Theta) = \prod_{i \in {\mathcal{I}}_{\rm coal}} \!\!\! p_{j_i}^{-1} \theta_{b_i}^{-1} \ \exp \bigg( - \sum_i \tau_i \kappa_i \bigg) \end{equation*} and $$\label{fmig} f_{\rm mig} (G;S, M) = \prod_{i \in {\mathcal{I}}_{\rm mig}} \!\!\! m_{b_i d_i} \ \exp \bigg( - \sum_i \tau_i \mu_i \bigg)$$ (2) First we deal with $$f_{\rm coal}$$. We have \begin{equation*} \prod_{i \in {\mathcal{I}}_{\rm coal}} p_{j_i}^{-1} \theta_{b_i}^{-1} = \prod_j \prod_b (p_j \theta_b)^{-k_{jb}}, \end{equation*} where $$k_{jb}$$ is the number of coalescences in gene tree $$j$$ in branch $$b$$. Next \begin{align*} \sum_i \tau_i \kappa_i & = \sum_i \tau_i \sum_j \sum_{b \in {\mathcal{B}}_i}\binom{n_{jbi}}{2}p_{j_i}^{-1} \theta_b^{-1} \\ & = \sum_{b} \sum_j \sum_{i: b \in {\mathcal{B}}_i} \binom{n_{jbi}}{2}p_{j}^{-1} \tau_i \theta_b^{-1} \end{align*} so \begin{equation*}\label{eq:msc} f_{\rm coal}(G;S, \Theta, M) = \prod_b r_b \theta_b^{-q_b} \exp \big( - \gamma_b \theta_b^{-1} \big) \mathrm{, \ \ where} \end{equation*} \begin{align}\label{eq:qrgamma} q_b &= \sum_j k_{jb} \mathrm{, \ \ \ \ \ \ } r_b = \prod_j p_j^{-k_{jb}} \mathrm{, \ \ \ \ and \ \ } \nonumber \\ \gamma_b &= \sum_j \sum_{i: b \in {\mathcal{B}}_i} \binom{n_{jbi}}{2}p_{j}^{-1} \tau_i. \end{align} (3) As written, there are time intervals in branch $$b$$ for events during which no change occurs in branch $$b$$. For the computation of $$\gamma_b$$, it is only necessary to take into account coalescences within branch $$b$$ and migrations in and out of branch $$b$$, since between these events, $$n_{jbi}$$ is constant. Equation (3) is now of the same form as equation (2) of Jones (2016). The only difference is that $$\gamma_b$$ accounts for migrations in and out of branch $$b$$. This means the population size parameters can be integrated out as in Jones (2016). Now, we turn to the migration part $$f_{\rm mig}$$. Let $${\mathcal{O}}$$ be the set of contemporaneous pairs of branches in $$S$$. We have \begin{align*} \sum_i \tau_i \mu_i &= \sum_i \tau_i \sum_j \sum_{b \in {\mathcal{B}}_i} n_{jbi} \sum_{d \in {\mathcal{B}}_i\setminus b} m_{bd} \\ &= \sum_{(b,d) \in {\mathcal{O}}} \ \sum_{i: b,d \in\ {\mathcal{B}}_i} \!\!\! \tau_i \sum_j n_{jbi} m_{bd}. \end{align*} Thus, \begin{equation*}\label{fmig2} f_{\rm mig} (G;S, \Theta, M) = \prod_{(b,d) \in {\mathcal{O}}} m_{bd}^{n_{bd}} \exp \left( - \zeta_{bd} m_{bd} \right), \end{equation*} where $$n_{bd}$$ is the total number of migrations from $$b$$ to $$d$$ and $$\label{eq:zeta} \zeta_{bd} = \sum_{i: b,d \in\ {\mathcal{B}}_i} \tau_i \sum_j n_{jbi}.$$ (4) The term $$\zeta_{bd}$$ can be interpreted as the total intensity of migrations from $$b$$ to $$d$$ during the time in which both branches $$b$$ and $$d$$ exist. If a conjugate prior is assumed, namely that $$m_{bd} \sim {\mathcal{G}}(\alpha_{bd}, \beta_{bd})$$ for all $$b,d$$ where $${\mathcal{G}}$$ is the gamma distribution, then we get a contribution to the posterior which is $$\label{eq:migIO} \prod_{(b,d) \in {\mathcal{O}}} \int_0^{\infty} \frac{\beta_{bd}^{\alpha_{bd}}}{\Gamma(\alpha_{bd})} m_{bd}^{\alpha_{bd}} \exp(-\beta m_{bd}) \times \\ \ \ m_{bd}^{n_{bd}}\exp \left( - \zeta_{bd} m_{bd} \right) \ {\mathrm{d}} m_{bd} \\ = \prod_{(b,d) \in {\mathcal{O}}} \frac{\Gamma(n_{bd} + \alpha_{bd})}{\Gamma(\alpha_{bd})} \frac{\beta^\alpha_{bd}}{\left(\beta_{bd} + \zeta_{bd} \right)^{n_{bd} + \alpha_{bd}} }.$$ (5) Equations (4) and (5) provide the information needed to implement the calculation for the migration part of the posterior. Each ordered pair of contemporaneous branches $$(b,d)$$ can have a different prior. For example, it is possible to represent the prior expectation that migration rates are lower between more distantly related branches. This model, where migration rates are independent, will be called the “flexible” model. The calculation in equation (4) is slow when the number of tips in the species tree is large. A much simpler model is to assume that $$m_{bd}$$ is the same value $$m$$ for all $$b,d$$. In this case, equation (1) reduces to \begin{equation*}\label{eq:coalmigratesimple} \mu_i = m \sum_j \sum_{b \in {\mathcal{B}}_i} n_{jbi} (s_i-1). \end{equation*} The double sum is equal to the total number of gene tree lineages $$N_i$$ during time interval $$i$$. Then, we have \begin{align*}\label{fmigsimple} f_{\rm mig} (G;S, M) &= m^{N} \exp \bigg( - \sum_i \tau_i \mu_i \bigg) \\ &= m^{N} \exp \bigg( - m \sum_i \tau_i N_i (s_i-1) \bigg), \end{align*} where $$N$$ is the total number of migrations. The parameter $$m$$ can be integrated out. This will be called this model the “simple” model. How the Gene Tree is Embedded This section describes the embedding parameters, and how they are used to embed the gene trees. The embeddings are restricted by ignoring ones which are unlikely when the migration rates are small enough. The hope is that the method will still explore a region of parameter space which includes most of the probability content. Embeddings are restricted by applying the following rules: there is at most one migration in a single gene tree branch at most one of the child branches of a gene tree node contains a migration there are no more migrations than needed (in a sense described below) A pair of child branches of a gene tree node will be called a sister-pair. The embedding parameters $$E_j$$ consist of two values $$\xi_{ji}, \eta_{ji} \in [0,1]$$ for each internal node $$i$$ of the $$j$$th gene tree. See Figure 2. The parameter $$\xi_{ji}$$ determines where along a sister-pair a migration may occur, if a migration is needed in the embedding. Thus it determines which child branch of node $$i$$ is capable of migrating, as well as the time of the migration if there is one. All the nodes in the species tree have their children labeled as “left” and “right,” so that $$[0,1]$$ can be mapped unambiguously onto the sister-pair. The other parameter $$\eta_{ji}$$ specifies which of the node’s child branches to use when choosing a destination species branch for an introgression. If the migration is between sister branches of the species tree, there is only one choice for the destination. It may happen that the sister branch is too ancient, in which case several destination species branches are possible. The possible destination branches are found, and $$\eta_{ji}$$ is used to choose between them by dividing the interval [0,1] equally into the appropriate number of parts. Figure 2. View largeDownload slide This shows $$\xi_{ji}$$ and $$\eta_{ji}$$ for a one gene tree and one node (with subscripts dropped). There is a migration from C into A. The parameter $$\xi$$ determines how far along the sister-pair the migration occurs, and $$\eta$$ determines whether the destination branch is A or B. Figure 2. View largeDownload slide This shows $$\xi_{ji}$$ and $$\eta_{ji}$$ for a one gene tree and one node (with subscripts dropped). There is a migration from C into A. The parameter $$\xi$$ determines how far along the sister-pair the migration occurs, and $$\eta$$ determines whether the destination branch is A or B. The parameters $$\xi_{ji}$$ and $$\eta_{ji}$$ are changed by operators during the MCMC algorithm, regardless of whether or not they are being used to embed a gene tree. This is a simpler alternative to implementing rjMCMC operators which account for changes in dimension. The prior $$\Pr(E_j)$$ for $$E_j$$ is are independent uniform distributions on $$[0,1]$$ for each $$\xi_{ji}$$ and $$\eta_{ji}$$. The first rule above is straightforward. The definition of $$\xi_{ji}$$ enforces the second rule. The third rule is applied recursively from the tips. Suppose $$x$$ is the $$i$$th node of the $$j$$th gene tree, and suppose both child nodes of $$x$$ have been assigned to branches in the species tree. If it is possible to assign $$x$$ to a species tree branch without an migration in either child branch of $$x$$, then this is done. Otherwise $$x$$ is assigned using the species tree branch to which its nonmigrating child has been assigned: it will be the same branch, or an ancestor of that branch, depending on the height of $$x$$. The height of the migration is fixed by $$\xi_{ji}$$. The migrating child branch of $$x$$ starts in the species tree branch that the migrating child has been assigned. It stays in this branch, or an ancestor of it, until the migration height. It will then migrate to the same species tree branch as $$x$$, or a descendant of it. If there is more than one descendant of the species tree branch of $$x$$ at this height, values from $$\eta_{j}$$ are used to choose one. Properties of the Embedding Scheme Different embeddings of the same gene tree in the same species tree are obtained by changing $$\xi_{ji}$$ and $$\eta_{ji}$$ during the MCMC sampling. Figure 3 shows some examples. Case (a) is simple. No migrations are needed to embed the gene tree, so embeddings with one or more migrations are ignored. Case (b) requires one migration, and an embedding with two migrations in the same branch is ignored. Case (c) requires two migrations. The embedding on the left is ignored since it has two sister branches with migrations. The embedding on the right is one of four embeddings that is considered. Figure 3. View largeDownload slide Some examples of embeddings for three gene trees in a, b, c. On the left are embeddings that are ignored. On the right are embeddings that are considered. Figure 3. View largeDownload slide Some examples of embeddings for three gene trees in a, b, c. On the left are embeddings that are ignored. On the right are embeddings that are considered. Proposition. Given any set of particular values for $$\xi$$ and $$\eta$$, and the rules above, any gene tree can be embedded in any species tree. For any $$G_j$$ and $$S$$, the set of embeddings as $$\xi$$ and $$\eta$$ vary include at least one with a minimal number of migrations. Proof: The first claim is straightforward, using recursion starting at the tips, and following the description above (for applying the third rule). For second claim, suppose it is false and consider the set $$M$$ of minimal embeddings (those with a minimal number of migrating branches). Call a node both of whose child branches migrate a “double node.” If there is a member of $$M$$ has no double nodes, this embedding will be chosen for some $$\xi$$ and $$\eta$$, so every member of $$M$$ has at least one double node. Now restrict attention to the subset $$\bar{M}$$ of $$M$$ of embeddings which have as few as possible double nodes. Finally, choose an embedding $$B$$ from $$\bar{M}$$ so that a double node $$x$$ is as near to the root as possible. If $$x$$ is the root, it can be moved into the same branch as one of its children, or an ancestor of that branch, and one migration can be removed, contradicting the definition of $$M$$. If $$x$$ is not the root, it can be again moved into the same branch as one of its children, but now the branch between $$x$$ and its parent may need to become migrating. If the sister branch to $$x$$ is already migrating, we have an embedding with the same number of migrations, but a double node closer to the root than $$x$$, contradicting the definition of $$B$$. If the sister branch to $$x$$ is not migrating, there is an embedding with fewer double nodes than $$B$$, contradicting the definition of $$\bar{M}$$. End of proof. The method does not consider every embedding which has a minimal number of migrations (e.g., Fig. 3c). Some embeddings which are considered are not minimal. For example, consider a species tree (A,B) and a gene tree ((a1,b1),b2) with three tips a1, b1, and b2, where a1 belongs to species A and the others to B. Suppose the species tree has greater height than the gene tree. For some values of $$\xi$$ and $$\eta$$, DENIM will assign the coalescence (a1,b1) to A, which means two migrations are needed to embed the gene tree, although it is possible to embed it with only one migration using other values of $$\xi$$ and $$\eta$$. Implementation Notes For a standard multispecies coalescent analysis, operators which change the species tree and the gene trees in a co-ordinated way are beneficial (Jones 2016; Ogilvie et al. 2017). These operators rely on, and preserve, compatibility betwfeen the species tree and gene trees under the multispecies coalescent. In the presence of migration, any gene tree is compatible with any species tree, and these coordinated operators cannot be used as they are. In the current implementation, DENIM uses the standard tree operators implemented in BEAST2 for the species tree and the gene trees. A couple of simple MCMC operators were implemented for the embedding parameters. As noted above, they are changed by operators regardless of whether or not they are being used to embed a gene tree. In general, applying MCMC operators to unused parameters could be very inefficient, and rjMCMC would be preferable, but here the operators for $$\xi$$ and $$\eta$$ are very fast. DENIM is implemented in the BEAST2 framework (Bouckaert et al. 2014), and so benefits from the flexible site models, substitution models, and others available in BEAST2. An analysis can be set up using the graphical interface Beauti. When using the flexible model, different priors can be used for different pairs of species tree branches. DENIM provides two schemes for specifying these priors. In the first scheme, the prior mean for the migration rate between branches depends on how closely related the species are, with the prior mean decreasing as the number of branches between them increases. In the second, the migration rate decays with the time between the most recent common ancestor of the two branches and the midpoint of the interval during which both branches exist. The details are described in the manual supplied with DENIM. DENIM puts some annotations into the species trees sampled during the MCMC process, which give more detail about the migrations. These can be analyzed using the command line program MigrationAnalyser.jar which can be downloaded from http://indriid.com/software.html. Tests Using Simulated Data The method was tested using the simulated data set from Leaché et al. (2014), and with no data. The same data set was also analyzed using *BEAST as implemented in BEAST2 (and not to be confused with StarBEAST2). In this data set, there are scenarios for 4 species and 10 species. There are 4 sequences per locus for each species, except a single outgroup species, for which there is just one sequence. There are thus 13 sequences per locus in the 4-species scenarios and 37 for the 10-species scenarios. In all cases, there are 10 loci, and the sequences are 1000 bp long. The data set is available in the Supplementary Material available on Dryad at http://dx.doi.org/10.5061/dryad.0jt38. Most of the scenarios consist of a migration pattern specifying which branches are involved, and a migration rate (0.001, 0.01, 0.1, or 1.0, corresponding to one migrant per 1000, 100, 10, or 1 generations). In these cases, migration is in both directions. There are also scenarios which consist of a single allele or a single migrant occurring in one direction at time zero. Figure 4 shows the true species trees and provides a guide to the migration patterns; Leaché et al. (2014) should be consulted for full details. Figure 4. View largeDownload slide The true species trees and guide to the migration patterns. The true species tree topologies for the 4-species scenarios (left) and 10-species scenarios (center, right). Most of the scenarios have migration between one pair of branches and are named accordingly. Thus “B-C” means migration between B and C, as shown on the left, and “AB-C” means migration between AB and C, where AB is the ancestor of A and B. Two migration patterns for the 10-species scenarios are given more convenient names as shown. “DeepAnc” means migration between ABCD and EFGH. “4-island” means migration between all six pairs of branches chosen from E, F, G, and H. Figure 4. View largeDownload slide The true species trees and guide to the migration patterns. The true species tree topologies for the 4-species scenarios (left) and 10-species scenarios (center, right). Most of the scenarios have migration between one pair of branches and are named accordingly. Thus “B-C” means migration between B and C, as shown on the left, and “AB-C” means migration between AB and C, where AB is the ancestor of A and B. Two migration patterns for the 10-species scenarios are given more convenient names as shown. “DeepAnc” means migration between ABCD and EFGH. “4-island” means migration between all six pairs of branches chosen from E, F, G, and H. For all these tests, an exponential prior for the migration rate was used; the mean of this prior was varied in some tests. Other priors could be used, and as long as a prior makes large migration rates very unlikely, the results should be similar. No use was made of the options to use different prior means for different pairs of species tree branches, so within each analysis, the same prior is used for all pairs. The MCcoal program (Yang 2015) was used to generate the sequence files. The MCcoal program was extended slightly to make it produce information about each individual migration in the simulations, instead of the normal summary information. The edited source code is available from http://indriid.com/workingnotes2017.html. The MCcoal control files were the same as those of Leaché et al. (2014), except that only the first 50 replicates were used, and the 10-species scenarios were augmented by adding scenarios for a migration rate of 0.01, making 25 migration patterns in total. In Leaché et al. (2014), only the two highest rates 0.1 and 1.0 were used in the 10-species scenarios, but the results from the 4-species scenarios showed that DENIM generally breaks down between 0.01 and 0.1, so the 0.01 rate is an interesting one for DENIM. Prior Only DENIM was tested by running it with no data. The scenarios used all had six species. The prior on the species tree was a pure birth (Yule) model with growth rate 100. This means the expected height of the species tree is $$0.01(1/2+1/3+1/4+1/5+1/6) = 0.0145$$. The number of individuals $$i$$ was 3 or 9 per species, the number $$g$$ of loci was 3 or 9. The population scaling parameter in DENIM (the value of $$N\mu$$ where $$N$$ is the effective population size and $$\mu$$ is the mutation rate in every branch) was set 0.0005, 0.005, 0.05, producing small, medium, and large amounts of ILS. Both the flexible and simple models were used, resulting in a total of $$2\times 2\times 2\times 3\times 2 = 48$$ scenarios. The analyses were run for 30M ($$i=3,g=3$$), 90M ($$i=3,g=9$$ and $$i=9,g=3$$), and 300M ($$i=9,g=9$$) generations. These long runs (up to around 8 h each) proved necessary to obtain convergence. The Leaché Data Set The simulated data set of Leaché et al. (2014) was used with the simple model. An prior mean of 0.005 was used for the migration rate. For most replicates, the chain length was 20M for the 4-species scenarios and 30M for the 10-species scenarios. When this produced an effective sample size (ESS) for the posterior of less than 200, as estimated by CODA (Plummer et al. 2006), the analysis was extended until an ESS of at least 200 was obtained. States were logged every 5000 generations, and burnin was set to 20%. The settings for site and clock models, for both *BEAST and DENIM, were similar to those used by Leaché et al. (2014). DENIM uses a different population model to *BEAST, so the priors for the population size parameters are somewhat different. Site models were linked. A GTR model of substitution was used, with base frequencies equal. The clock models were strict but unlinked. The first locus had clock rate 1, and the others were estimated. The Yule (pure birth) model was used for the species tree. The priors were set as follows. Substitution rates relative to rateCT: Gamma(0.05,20). Relative clock rates: lognormal(0,1). Growth rate for the species tree: lognormal(5,2). PopPriorScale: lognormal(-5,2). Further experiments were conducted in order to assess the sensitivity of the method to the choice of prior. These just used the 4-species scenarios, to reduce the computational resources required. In a Bayesian approach, if the results are sensitive to the choice of prior, it indicates that there is insufficient information in the data, and that the prior is dominating the posterior. The settings were the same as above, except that the flexible model was used, and the prior means for the migration rate were varied: the set of values (0.00125, 0.005, 0.02, 0.08) were used. The flexible model was used since the experiments with priors suggest that it is likely to be better with high migration rates in the prior. Results on Simulated Data Evaluation Measures Several measures are used for assessing accuracy. One measure is the posterior probability of the true species tree topology. The second is the probability coverage. This is the proportion of replicates where the true species tree topology is in the 95% credible set and was used by Leaché et al. (2014) to evaluate *BEAST. It provides a guide as to how often the method is likely to produce misleading results about topology. Neither of these two measures does not take into account errors in estimated node heights. The third measure is based on the branch score of Kuhner and Felsenstein (1994), adapted for rooted trees. It is defined in the original *BEAST paper (Heled and Drummond 2010, Equation (9)). In this article, the score is not normalized by tree length. It accounts for differences in topology and branch lengths, and the entire posterior is evaluated by finding the mean distance between the MCMC samples of the species tree and the true tree. This provides an overall measure of the method’s accuracy. There are also two measures aimed at evaluating how well DENIM can identify which loci are migrating. Firstly, for each locus, DENIM outputs a statistic which counts the number of migrations in the MCMC sample. From this, the posterior probability of a locus containing a migration can be found, and thresholded at 0.5 to produce a Boolean value which indicates whether the locus contains a migration. Evaluated over the replicates for each scenario, this produces a fraction of false positives, where a true migration is absent but incorrectly inferred; and a fraction of true positives, where a migration is correctly detected. A second measure is the posterior means of the migration counts. This is evaluated for the two cases in which a true migration is either absent or present. Prior Only Figure 5 shows the results from using DENIM with no data. In the upper graph, the values should all be close to zero, corresponding to a ratio (estimated height)/(true height) near 1. Instead, some values are much larger, corresponding to a ratio of 10 or more. The large estimates of species tree height are accompanied by large migration counts. Thus, with a large number of loci and individuals, the method breaks down even with small prior mean for the migration rate. The flexible model behaves better than simple model. This behavior means DENIM must be used with caution when there is little phylogenetic signal in the data. An obvious warning sign is the number of migrations inferred by DENIM is more than a small proportion of the number of coalescences. It is always possible to prevent this by using a prior with a small enough mean, but it may be that the only conclusion that can be drawn is that there is too little signal or too much migration for any sensible estimate of the species tree to be possible using DENIM. Figure 5. View largeDownload slide Results with no data. The y-axis in the top graph is the log-ratio of the estimated species tree height to the true species tree height. i is the number of individuals per species, and g is the number of loci. S, M, and L represent small, medium, and large amounts of ILS. The y-axis in the bottom graph is the proportion of migrations per coalescence (or per sister-pair). Circles represent the simple model, crosses represent the flexible model. Large symbols correspond to a prior mean of 0.005, small ones to a prior mean of 0.001. Figure 5. View largeDownload slide Results with no data. The y-axis in the top graph is the log-ratio of the estimated species tree height to the true species tree height. i is the number of individuals per species, and g is the number of loci. S, M, and L represent small, medium, and large amounts of ILS. The y-axis in the bottom graph is the proportion of migrations per coalescence (or per sister-pair). Circles represent the simple model, crosses represent the flexible model. Large symbols correspond to a prior mean of 0.005, small ones to a prior mean of 0.001. It is not understood why the approximation used by DENIM leads to this behavior, but the following may be an explanation. Consider the case of a species tree $$S$$ and a gene tree $$G$$ with just two tips each. The parameter space can be divided into regions corresponding to 0,1,2,... migrations. If there are none, DENIM will sample correctly from the prior. Given the small prior mean for the migration rate, the regions corresponding to 2 or more migrations are not significant. The main bias arises where there is one migration. For $${\mathrm{height}} (S) > {\mathrm{height}} (G)$$, a migration is needed, and DENIM considers this case. If $${\mathrm{height}} (S) < {\mathrm{height}} (G)$$, DENIM only considers the possibility of no migration. The region of parameter space with $${\mathrm{height}} (S) < {\mathrm{height}} (G)$$ is most affected by the approximation used by DENIM, so that small values of height(S) are discriminated against. The general effect of this sort of bias apparently gets worse as the number of loci and individuals increases. Simulated Data: Effective Sample Sizes Although Leaché et al. (2014) report that chain lengths of 10M were sufficient for *BEAST to achieve ESSs above 200 in the 4-species scenarios, I found that one replicate (replicate 3 for scenarioB-C-1-allele) had only reached an ESS of 30 after 20M generations, and two others were below 100. Leaché et al. (2014) used *BEAST as implemented in BEAST(v1) instead of BEAST2. These should be very similar; the only major difference I am aware of is that in BEAST2, a different initialization for the gene trees and species trees is used. In the 4-species scenarios, of a total of 850 replicates, there were 21 (*BEAST) and 6 (DENIM) with posterior ESSs below 200 after 20M generations. In the 10-species scenarios, of a total of 1250 replicates, there were 167 (*BEAST) and 46 (DENIM) with posterior ESSs below 200 after 30M generations. For both the 4-species and 10-species cases, DENIM achieved about 15% greater ESS values than *BEAST in terms of ESSs per million generations, based on median values over all replicates. In the 4-species scenarios, DENIM was about 30% slower than *BEAST in terms of seconds per generation, whereas in the 10-species scenarios, the times were very similar. The end result in terms of ESSs per hour is that DENIM was about 15% slower in the 4-species scenarios, and 15% faster in the 10-species scenarios. As mentioned earlier, all analyses were continued until an ESS for the posterior of at least 200 was obtained. ESSs for other parameters were sometimes below 200. In particular, with DENIM, the species tree root height often had the worst ESS among all parameters. It is not clear why this should be. The root height was generally estimated fairly accurately, and the operators affecting it appeared to be working satisfactorily. Simulated Data: Posterior Probability of True Topology Figures 6 and 7 show the posterior probability of the true species tree topology for DENIM and *BEAST. For most scenarios, the results are similar, but where there is a small amount of migration between non sister species (B-C_0.001, B-C_0.01, B-C-1-allele, and B-C-1-mig in Fig. 6; and 4-island_0.01, F-G_0.01, H-I_0.01, D-E-1-allele, D-E-1-mig, F-G-1-allele, and F-G-1-mig in Fig. 7) DENIM is better, and *BEAST often fails completely. Figure 6. View largeDownload slide Posterior probability of the true species tree topology for the 4-species scenarios, based on 50 replicates. For each scenario, the results for DENIM are in midgray and below those for *BEAST which are in pale gray. See the text for a description of the scenarios. The simple model was used with prior mean of 0.005. Figure 6. View largeDownload slide Posterior probability of the true species tree topology for the 4-species scenarios, based on 50 replicates. For each scenario, the results for DENIM are in midgray and below those for *BEAST which are in pale gray. See the text for a description of the scenarios. The simple model was used with prior mean of 0.005. Figure 7. View largeDownload slide Posterior probability of the true species tree topology for the 10-species scenarios. Other details as Figure 6. Figure 7. View largeDownload slide Posterior probability of the true species tree topology for the 10-species scenarios. Other details as Figure 6. Simulated Data: Coverage Table 1 shows the coverage probability for all scenarios. This can be compared with Tables 1 and 2 of Leaché et al. (2014). As expected, the two implementations of *BEAST give similar results. As with the previous measure, DENIM performs better than *BEAST in the cases of small amounts of migration between nonsister species, and the migrations at time zero between nonsister species, and is otherwise similar. Table 1. Probability coverage Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Notes: Coverage for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). Where the programs produce results which are substantially different, the better result is in bold. Table 1. Probability coverage Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Scenario M Coverage DENIM *BEAST No migration 4 0 1.0 1.0 10 0 0.90 0.94 Isolation–migration (A-B or E-F) 4 0.001 1.0 1.0 4 0.01 1.0 1.0 4 0.1 0.98 1.0 4 1.0 1.0 1.0 10 0.01 0.94 0.94 10 0.1 0.96 0.94 10 1.0 1.0 0.98 4-island 10 0.01 0.90 0.38 10 0.1 0.68 0.30 10 1.0 0.62 0.56 Paraphyly (B-C or F-G) 4 0.001 1.0 0.96 4 0.01 0.98 0.50 4 0.1 0.54 0.28 4 1.0 0.14 0.10 10 0.01 0.92 0.38 10 0.1 0.42 0.16 10 1.0 0.12 0.08 Deep paraphyly (H-I) 10 0.01 0.88 0.08 10 0.1 0.16 0.0 10 1.0 0.02 0.0 Ancestral (AB-C or EF-G) 4 0.001 1.0 0.98 4 0.01 1.0 1.0 4 0.1 0.98 0.98 4 1.0 1.0 0.98 10 0.01 0.98 0.96 10 0.1 0.96 0.96 10 1.0 0.86 0.90 Deep ancestral 10 0.01 0.96 0.94 10 0.1 0.98 0.98 10 1.0 0.90 0.88 Single migrant 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 0.98 0.06 10 Sisters (F-E) 0.90 0.92 10 Nonsisters (F-G) 0.90 0.10 Deep single migrant 10 Nonsisters (D-E) 0.88 0.0 Single locus introgression 4 Sisters (B-A) 1.0 1.0 4 Nonsisters (B-C) 1.0 0.48 10 Sisters (F-E) 0.96 0.92 10 Nonsisters (F-G) 0.92 0.30 Deep single locus introgression 10 Nonsisters (D-E) 0.90 0.02 Notes: Coverage for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). Where the programs produce results which are substantially different, the better result is in bold. Table 2. Migration detection Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Notes: Migration detection for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). The “Absent” column gives the fraction of false positives, and the “Present” column gives the fraction of true positives. These are bolded where DENIM correctly identifies a migration in 75% or more of the replicates. Table 2. Migration detection Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Scenario M Absent Present No migration 4 0 1/500 0/0 10 0 12/500 0/0 Isolation-migration (A-B or E-F) 4 0.001 1/494 3/6 4 0.01 3/403 49/97 4 0.1 2/131 79/369 4 1.0 0/20 2/480 10 0.01 4/396 52/104 10 0.1 2/106 114/394 10 1.0 0/5 5/495 4-island 10 0.01 19/130 355/370 10 0.1 4/4 490/496 10 1.0 0/0 421/500 Paraphyly (B-C or F-G) 4 0.001 1/489 7/11 4 0.01 0/401 81/99 4 0.1 2/127 173/373 4 1.0 2/21 67/479 10 0.01 10/406 80/94 10 0.1 13/114 146/386 10 1.0 5/5 78/495 Deep paraphyly (H-I) 10 0.01 22/309 157/191 10 0.1 10/23 280/477 10 1.0 0/0 133/500 Ancestral (AB-C or EF-G) 4 0.001 0/491 3/9 4 0.01 1/415 20/85 4 0.1 0/161 39/339 4 1.0 0/17 11/483 10 0.01 7/418 17/82 10 0.1 4/173 22/327 10 1.0 0/18 16/482 Deep ancestral 10 0.01 3/422 17/78 10 0.1 2/160 15/340 10 1.0 0/6 23/494 Single migrant 4 Sisters (B-A) 0/0 363/500 4 Nonsisters (B-C) 0/0 472/500 10 Sisters (F-E) 0/0 401/500 10 Nonsisters (F-G) 0/0 488/500 Deep single migrant 10 Nonsisters (D-E) 0/0 500/500 Single locus introgression 4 Sisters (B-A) 1/450 40/50 4 Nonsisters (B-C) 3/450 48/50 10 Sisters (F-E) 10/450 41/50 10 Nonsisters (F-G) 10/450 48/50 Deep single locus introgression 10 Nonsisters (D-E) 11/450 50/50 Notes: Migration detection for scenarios with continuous migration (above the line) and with single-locus introgression and migration of a single individual at time zero (below the line). The “Absent” column gives the fraction of false positives, and the “Present” column gives the fraction of true positives. These are bolded where DENIM correctly identifies a migration in 75% or more of the replicates. Supplementary Table S1 available on Dryad shows the coverage probability for the 4-species scenarios as the settings in DENIM are varied. With this data set, there is little difference between the simple and flexible models. As the prior mean is increased, the results for nonsister migration with rates 0.1 and 1.0 improve, while other scenarios become slightly worse. With the smallest prior mean of 0.00125, the result for the scenario with a single migrant between nonsisters becomes worse. Simulated Data: Branch Scores Supplementary Figures S1 and S2 available on Dryad, show the branch scores for the 4 and 10 species scenarios. The general picture is that DENIM produces good results for the two smallest migration rates 0.001 and 0.01, and for the migrations at time zero, but breaks down at higher rates. They also show that in the cases where the migration is between sister species (the A-B, and AB-C scenarios in Supplementary Fig. S1; and the DeepAnc, E-F, and EF-G scenarios in Supplementary Fig. S2) the errors in branch lengths are smaller for DENIM than for *BEAST. Supplementary Figures S3, S4, S5, and S6 show results for the 4-species scenarios with the flexible model, and different prior means 0.00125, 0.005, 0.02, 0.08. The results do not vary greatly with the choice of prior, and follow a similar pattern to the coverage results. All but the high-rate paraphyletic scenarios become worse with a prior mean of 0.08. Simulated Data: Migration Detection Table 2 shows how well DENIM is able to detect migration using the first measure of the measures described in the section “Evaluation measures” for identifying migrating loci. For each scenario, there are 50 replicates and 10 loci, hence 500 cases in which migration may or may not be present. The table shows the number of times that DENIM incorrectly infers migration when it is absent, as a fraction of the number of times it is absent (false positives) and the number of times it correctly infers migration when it is present, as a fraction of the number of times it is present. The general picture is that DENIM only does this well when the migration rate is low and between nonsister species. DENIM usually “explains” migration between sister species by squashing the species tree (like *BEAST). An exception to the general picture is where there is a single migrant or locus introgression at time zero between sister species (B-A-1-allele, B-A-1-mig, F-E-1-allele, F-E-1-mig) where the migration is usually detected. Supplementary Figures S7 and S8 show results for the posterior means of the number of migrations. Results on Empirical Data I reanalyzed the pocket gopher data of Belfiore et al. (2008), as supplied with BEAST. There are seven putative species from the Thomomys genus, and one outgroup Orthogeomys heterodus. Experts are certain this is the true outgroup (Heled and Drummond 2010). There are 26 individuals and 7 noncoding nuclear sequence loci. This data set has been analyzed in various papers, including Heled and Drummond (2010), the paper which introduced *BEAST. This group of species has undergone a rapid radiation, so incomplete lineage sorting is expected to be high. I used the HKY substitution model, linked site models, estimated relative clock rate for all loci except the first, and a strict clock. The results here use the simple model for migration, with an exponential prior with mean 0.001. In the *BEAST analysis, the outgroup species O. heterodus was misplaced (their Fig. 8a), and the authors comment that “The tendency to place the outgroup incorrectly appears to be caused by just one gene,” namely TBO29. The tree from the DENIM analysis is shown in Figure 8. The outgroup is correctly placed, and it is very similar to the *BEAST result with ingroup monophyly enforced (their Fig. 8b). The DENIM tree is somewhat shorter, perhaps due to a different site model or population model. A migration was inferred between O. heterodus and the [(Thomomys bottae, Thomomys townsendi), Thomomys umbrinus] clade, the same clades that *BEAST grouped together. This migration was present in about 95% of the MCMC samples. The other migrations that appear in the posterior samples have much lower posterior probabilities. The next migrations that DENIM analysis suggests (at about 24%) are very recent ones, both ways, of TBO47 between T. bottae and T. umbrinus. This pair is followed (at about 18%) by a very recent one of TBO64 Thomomys talpoides to Thomomys idahoensis (going back in time). Figure 8. View largeDownload slide Gopher tree. Posterior clade probabilities are shown next to branches. The node bars are 95% HPDs for the node heights. The migration of locus TBO64 is also indicated. Figure 8. View largeDownload slide Gopher tree. Posterior clade probabilities are shown next to branches. The node bars are 95% HPDs for the node heights. The migration of locus TBO64 is also indicated. It is interesting that DENIM identifies TBO64, but not TBO29, as a locus with migration. The posterior mean count of migrations for TBO64 was 1.20, for TBO47 it was 0.47, for TBO26 it was 0.11, and the rest, including TBO29, were well under 0.1. In Belfiore et al. (2008), the individual gene trees were estimated separately, and it appears from their Figure 2 that in TBO64, the relative distance between Orthogeomys heterodus and other taxa is considerably smaller than is the case for any other locus. Other settings were also tried. With the simple model, and a prior mean of 0.005, the method broke down, in a way similar to the tests with no data: the species tree height and migration counts were very large. With the simple model, and a prior mean of 0.0002, the result was similar to Figure 8, but the posterior probability for the ingroup decreased to 0.72. With the flexible model, prior means of 0.001, 0.005, and 0.02 produced similar results to Figure 8, but the migration counts increased with the prior mean. Discussion The multispecies coalescent model implies a particular notion of species, as described in Heled and Drummond (2010): “any group of individuals that, after some ‘divergence’ time, have no history of breeding with individuals outside that group.” In the tree-island model used by DENIM, this definition is relaxed to allow some gene flow after the divergence, but it is still supposed that there is a particular time at which two groups of individuals emerge with only a small amount of gene flow between them from that point on. Further research using a variety of approaches will be needed to ascertain how often this model is a good approximation to reality. In the meantime, I would argue that it is preferable to the assumption of no gene flow at all in the “pure” multispecies coalescent model. DENIM should be useful to biologists who would like to estimate a species tree under the assumptions of the multispecies coalescent but are concerned that gene flow may distort the results. Data Requirements The results on simulated data show that DENIM can deal with migration rates up to about 1 in 100 or 1 in 10 migrants per generation. The performance of DENIM in terms of ESSs per hour is similar to that of *BEAST, indicating that DENIM is suitable for up to tens of species and tens of loci. The simulations with 10 species, 37 individuals, and 10 loci took about 80 min to achieve a posterior ESS of $$>$$ 200; similar tests with 40 loci took about 36 h. One should note that more recent implementations of the multispecies coalescent model, with new operators, such as StarBEAST2 (Ogilvie et al. 2017), STACEY (Jones 2016), and BPP (Yang 2015) are likely to be significantly faster, especially when the number of loci is large. The behavior of DENIM with no data is far from ideal, and care must be taken to ensure the signal in the data is sufficient to overcome the potential bias from the prior. In general, the flexible model appears to be preferable in this regard. The embedding scheme means that DENIM can only consider at most one migration per coalescence. If the number of migrations inferred by DENIM is more than a small proportion of the number of coalescences, the approximation is likely to produce bias. The results from the simulations give an indication of the how informative the data needs to be in order to use DENIM successfully. In the simulations, most of the branch lengths had a mean of 0.02, and although the randomness in the scenarios made some branches much shorter, the individual sequences are generally quite informative. Consider a large number of short sequences from a shallow phylogeny. Such data sets are always challenging for methods based on the multispecies coalescent model, since mutational variation produces great uncertainty about the gene trees. However, such data sets are likely to be particularly problematic for DENIM, since the bias in the prior gets worse with larger numbers of empty sequences. It is therefore important that the individual sequences are sufficiently informative. The best solution, if feasible, is longer sequences (for example 10000 bp instead of 1000 bp). This does increase the probability of recombination within a locus, but this seems the lesser evil in the case of shallow phylogenies. Other Guidance Often, the amount of migration will not be known a priori. The posterior predictive approach of Reid et al. (2014) and Gruenstaeudl et al. (2015) may be useful here. If DENIM reports no migration, this provides some assurance that migration (especially between nonsister species) is not interfering with the inference of the topology. It is harder to detect migration between sister species, and you should not expect DENIM to rule out this possibility. If DENIM reports a small amount of migration, the species tree is likely to be more accurate than that reported by software which is based on the multispecies coalescent but ignores gene flow. Alternatively, it may be possible to identify and remove troublesome loci. If DENIM reports a lot of migration, the results from DENIM may be unreliable, and estimating an accurate species tree using any current method may not be feasible, although it is certainly worth considering the species network methods described in the introduction. DENIM identifies loci which are “badly behaved,” rather than those which migrate. That is, it identifies loci with migrations which result in an incompatibility with the species tree. Some migrations do not cause incompatibility, because (going back in time) they do not coalesce with another lineage until the species tree branches have merged; or a lineage may migrate, then migrate back again before coalescing, and so on. Looking Ahead DENIM combines the tree-island model, which is an elegant mathematical model for speciation, coalescence, and migration, with a rather crude approximation for sampling the posterior. The two components are quite independent. The partial sampling of the posterior is a trade-off between accuracy on the one hand and computational effort and simplicity of implementation on the other. An exact sampling from the posterior for large data sets when there is a large amount of migration may remain computationally infeasible for decades. However, there are almost certainly better compromises to be found than the one currently implemented in DENIM. For example, Palczewski and Beerli (2013) provides an approximation for high rates. It is possible to combine the tree-island model with species delimitation and thus coestimate the delimitation and the species tree in the presence of migration. The current implementation of DENIM allows this, using the birth–death-collapse model of Jones et al. (2015), but this possibility has not been explored in any detail. The parameter space become even larger and obtaining useful results in a reasonable amount of time may be very difficult. DENIM uses the usual birth–death model to provide a prior for the species tree, but this only provides a probability density for the reconstructed tree, with all extinct branches removed. In the presence of migration, there may be gene flow from extinct species which could result in unusually deep coalescences and bias the analysis. DENIM could be extended so that the full tree, including the extinct branches is sampled from. There is no upper limit to the number of extinct branches that could exist, so again there are more computational difficulties. In principle, this could allow the detection of some extinct species from genetic data alone. Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.0jt38. Acknowledgments I thank Adam Leaché for supplying the MCcoal control files used to generate the simulated data. I thank Bengt Oxelman and four anonymous reviewers for constructive comments on earlier versions of this paper. References Beerli P. , Felsenstein J. 2001 . Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach . Proc. Natl. Acad. Sci. USA 98 : 4563 – 4568 . Google Scholar CrossRef Search ADS Belfiore N.M. , Liu L. , Moritz C. 2008 . Multilocus phylogenetics of a rapid radiation in the genus Thomomys (Rodentia: Geomyidae) . Syst. Biol. 57 : 294 – 310 . Google Scholar CrossRef Search ADS PubMed Bouckaert R. , Heled J. , Khnert D. , Vaughan T. , Wu C.-H. , Xie D. , Suchard M.A. , Rambaut A. , Drummond A.J. 2014 . BEAST 2: a software platform for Bayesian evolutionary analysis . PLoS Comput. Biol. 10 : e1003537 . Google Scholar CrossRef Search ADS PubMed Dalquen D.A. , Zhu T. , Yang Z. 2017 . Maximum likelihood implementation of an isolation-with-migration model for three species . Syst. Biol. 66 : 379 – 398 . Google Scholar PubMed Ewing G. , Allen R. 2006 . Estimating population parameters using the structured serial coalescent with Bayesian MCMC inference when some demes are hidden . Evol. Bioinform. 2 : 237 – 235 . Google Scholar CrossRef Search ADS Figueiró H.V. , Li G. , Trindade F.J. , Assis J. , Pais F. , Fernandes G. , Santos S.H.D. , Hughes G.M. , Komissarov A. , Antunes A. , Trinca C.S. , Rodrigues M.R. , Linderoth T. , Bi K. , Silveira L. , Azevedo F.C.C. , Kantek D. , Ramalho E. , Brassaloti R.A. , Villela P.M.S. , Nunes A.L.V. , Teixeira R.H.F. , Morato R.G. , Loska D. , Saragüeta P. , Gabaldón T. , Teeling E.C. , O’Brien S.J. , Nielsen R. , Coutinho L.L. , Oliveira G. , Murphy W.J. , Eizirik E. 2017 . Genome-wide signatures of complex introgression and adaptive evolution in the big cats . Sci. Adv. 3 : e1700299 . Google Scholar CrossRef Search ADS PubMed Gruenstaeudl M. , Reid N.M. , Wheeler G.L. , Carstens B.C. 2015 . Posterior predictive checks of coalescent models: P2c2m, an R package . Mol. Ecol. Resour. 16 : 193 – 205 . Google Scholar CrossRef Search ADS PubMed Hejase H. , VandePol N. , Bonito G.M. , Liu K.J. 2018 . Fastnet: fast and accurate inference of phylogenetic networks using large-scale genomic sequence data . bioRxiv . https://doi.org/10.1101/132795. Heled J. , Drummond A. 2010 . Bayesian inference of species trees from multilocus data . Mol. Biol. Evol. 27 : 570 – 580 . Google Scholar CrossRef Search ADS PubMed Hey J. 2010 . Isolation with migration models for more than two populations . Mol. Biol. Evol. 27 : 905 – 920 . Google Scholar CrossRef Search ADS PubMed Hey J. , Nielsen R. 2004 . Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis . Genetics 167 : 747 – 760 . Google Scholar CrossRef Search ADS PubMed Hey J. , Nielsen R. 2007 . Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics . Proc. Natl. Acad. Sci. USA 104 : 2785 – 2790 . Google Scholar CrossRef Search ADS Hudson R.R. , et al. 1990 . Gene genealogies and the coalescent process . Oxf. Surv. Evol. Biol. 7 : 1 – 44 . Jackson N.D. , Morales A.E. , Carstens B.C. , O’Meara B.C. 2017 . Phrapl: phylogeographic inference using approximate likelihoods . Syst. Biol. 66 : 1045 – 1053 . Google Scholar CrossRef Search ADS PubMed Jones G. 2016 . Algorithmic improvements to species delimitation and phylogeny estimation under the multispecies coalescent . J. Math. Biol. 74 : 447 – 467 . Google Scholar CrossRef Search ADS PubMed Jones G. , Aydin Z. , Oxelman B. 2015 . Dissect: an assignment-free Bayesian discovery method for species delimitation under the multispecies coalescent . Bioinformatics 31 : 991 – 998 . Google Scholar CrossRef Search ADS PubMed Kingman J. 1982 . The coalescent . Stoch. Process. Their Appl. 13 : 235 – 248 . Google Scholar CrossRef Search ADS Leaché A.D. , Harris R.B. , Rannala B. , Yang Z. 2014 . The influence of gene flow on species tree estimation: a simulation study . Syst. Biol. 63 : 17 – 30 . Google Scholar CrossRef Search ADS PubMed Martin S.H. , Dasmahapatra K.K. , Nadeau N.J. , Salazar C. , Walters J.R. , Simpson F. , Jiggins C.D. 2013 . Genome-wide evidence for speciation with gene flow in heliconius butterflies . Genome Res. 23 : 1817 – 1828 . Google Scholar CrossRef Search ADS PubMed Nosil P. 2008 . Speciation with gene flow could be common . Mol. Ecol. 17 : 2103 – 2106 . Google Scholar CrossRef Search ADS PubMed Ogilvie H.A. , Bouckaert R.R. , Drummond A.J. 2017 . StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates . Mol. Biol. Evol. 34 : 2101 – 2114 . Google Scholar CrossRef Search ADS PubMed Palczewski M. , Beerli P. 2013 . A continuous method for gene flow . Genetics 194 : 687 – 696 . Google Scholar CrossRef Search ADS PubMed Plummer M. , Best N. , Cowles K. , Vines K. 2006 . CODA: convergence diagnosis and output analysis for MCMC . R News 6 : 7 – 11 . Reid N.M. , Hird S.M. , Brown J.M. , Pelletier T.A. , McVay J.D. , Satler J.D. , Carstens B.C. 2014 . Poor fit to the multispecies coalescent is widely detectable in empirical data . Syst. Biol. 63 : 322 – 333 . Google Scholar CrossRef Search ADS PubMed Rheindt F.E. , Fujita M.K. , Wilton P.R. , Edwards S.V. 2014 . Introgression and phenotypic assimilation in Zimmerius flycatchers (tyrannidae): population genetic and phylogenetic inferences from genome-wide SNPs . Syst. Biol. 63 : 134 – 152 . Google Scholar CrossRef Search ADS PubMed Romaschenko K. , Garcia-Jacas N. , Peterson P.M. , Soreng R.J. , Vilatersana R. , Susanna A. 2014 . Miocenepliocene speciation, introgression, and migration of Patis and Ptilagrostis (Poaceae: Stipeae) . Mol. Phylogenet. Evol. 70 : 244 – 259 . Google Scholar CrossRef Search ADS PubMed Solís-Lemus C. , Ané C. 2016 . Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting . PLOS Genet. 12 : 1 – 21 . Google Scholar CrossRef Search ADS Sun Y. , Abbott R.J. , Li L. , Li L. , Zou J. , Liu J. 2014 . Evolutionary history of purple cone spruce (Picea purpurea) in the Qinghai Tibet Plateau: homoploid hybrid origin and Pleistocene expansion . Mol. Ecol. 23 : 343 – 359 . Google Scholar CrossRef Search ADS PubMed Tian Y. , Kubatko L.S. 2016 . Distribution of coalescent histories under the coalescent model with gene flow . Mol. Phylogenet. Evol. 105 : 177 – 192 . Google Scholar CrossRef Search ADS PubMed Wen D. , Nakhleh L. 2017 . Coestimating reticulate phylogenies and gene trees from multilocus sequence data . Syst. Biol . doi: https://doi.org/10.1093/sysbio/syx085 . Yang Z. 2015 . The BPP program for species tree estimation and species delimitation . Curr. Zool. 61 : 854 – 865 . Google Scholar CrossRef Search ADS Zhang C. , Ogilvie H.A. , Drummond A.J. , Stadler T. 2018 . Bayesian inference of species networks from multilocus sequence data . Mol. Biol. Evol. 35 : 504 – 517 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

### Journal

Systematic BiologyOxford University Press

Published: May 24, 2018

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month ### Explore the DeepDyve Library ### Search Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly ### Organize Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place. ### Access Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals. ### Your journals are on DeepDyve Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more. All the latest content is available, no embargo periods. DeepDyve ### Freelancer DeepDyve ### Pro Price FREE$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations