Background: Horizontal gene transfer (HGT ), a process of acquisition and fixation of foreign genetic material, is an important biological phenomenon. Several approaches to HGT inference have been proposed. However, most of them either rely on approximate, non-phylogenetic methods or on the tree reconciliation, which is computationally intensive and sensitive to parameter values. Results: We investigate the locus tree inference problem as a possible alternative that combines the advantages of both approaches. We present several algorithms to solve the problem in the parsimony framework. We introduce a novel tree mapping, which allows us to obtain a heuristic solution to the problems of locus tree inference and dupli- cation classification. Conclusions: Our approach allows for faster comparisons of gene and species trees and improves known algorithms for duplication inference in the presence of polytomies in the species trees. We have implemented our algorithms in a software tool available at https ://githu b.com/mciac h/Locus TreeI nfere nce. Keywords: Rank, Taxon, Ranked species tree, Speciation, Gene duplication, Gene loss, Horizontal gene transfer and a species tree. A species tree is a schematic repre- Background sentation of an evolutionary history of a set of species, Horizontal gene transfer (HGT) is the process of acquisi- in which a node corresponds to a speciation event (i.e. tion and fixation of foreign genetic material. It can lead separation of a group of organisms into two distinct spe- to substantial changes in the ecology and evolution of cies). Likewise, a gene tree is a schematic representation recipient organism, sometimes leading to the emergence of the evolutionary history of a set of genes. A node of of new pathogens [1]. HGT is interesting both from bio- a gene tree corresponds to an emergence of a new gene, logical and computational perspective. Several methods whether due to a speciation or an evolutionary event like of detecting horizontally transferred genes have been HGT or duplication. The leafs of the gene tree are usually proposed, which can be roughly divided into two catego- labelled by the leafs of the corresponding species tree. ries [2]. So-called surrogate methods are computationally Consequently, while the leaf labels of the species trees efficient, yet often imprecise. The other group consists of are unique, a single label may occur multiple times in the the phylogenetic methods, most notably the tree reconcili- gene tree. ation [3]. Apart from the aforementioned evolutionary events, HGT and gene duplication are examples of evolu- populational effects are a major source of incongru - tionary events in which an organism gains a new locus ence between gene and species trees. These effects arise (referred later on as locus gain events). A locus is a frag- due to the fact that each gene may undergo a mutation, ment of a chromosome with a specific gene. The locus which creates a new nucleotide sequence referred to as gain events cause an incongruence between a gene tree an allele of the gene. It follows that the set of alleles in a single species can itselt exhibit a complex evolution- *Correspondence: m_ciach@student.uw.edu.pl ary relationship. During a speciation, the alleles present Faculty of Mathematics, Informatics and Mechanics, University in the population are sorted into two sets corresponding of Warsaw, Banacha 2, Warsaw, Poland Full list of author information is available at the end of the article © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/ publi cdoma in/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 2 of 18 to the diverging populations (a process termed allele or latter usually needs to be inferred by comparing the gene lineage sorting). It follows that the evolutionary distance tree to the species tree. between alleles may not correspond to the evolutionary When population effects are negligible due to long distance between their respective species, causing an times between speciations, the evolutionary events are incongruence between the species tree and the gene tree the only source of incongruence between the gene and inferred from the nucleotide sequences of those alleles. the species tree. In this case, the evolutionary relation- Such incongruence is referred to as an incomplete lineage ship between loci can be regarded as identical to the sorting (ILS). It can be shown that as the time between evolutionary relationship between alleles. Consequently, speciations increases, the probability of an ILS decreases the locus tree can be represented as the gene tree with exponentially. Therefore, if all the speciations in the con - nodes labeled either as speciations or locus gains. In this sidered species tree are separated by a sufficiently large approach, a locus gain node corresponds to the time point period of time, the populational effects may be consid - in which a new locus is first observed (note that after a ered negligible. For a more detailed discussion of the ILS, duplication, the choice of the “new” versus the “old” locus the reader is referred to e.g. [4–6]. is often arbitrary). Such labeling allows to decompose the The new locus created by one of the aforementioned gene tree into a forest of trees representing independent evolutionary events evolves more or less independently evolutionary histories of different loci by detaching the of other loci. This observation leads to the concept of locus gain nodes (see e.g. Fig. 1). The concept of gene tree a locus tree [4, 7], which serves as a schematic represena- decomposition has been investigated earlier in the con- tion of the evolutionary relationship between different text of tree comparison [8], but, to our knowledge, not in loci. A node of a locus tree corresponds to an emer- the context of inference of evolutionary events or locus gence of a new locus due to either a speciation or an trees. evolutionary event (note that a mutation of the nucleo- Distinguishing between different locus gain events is tide sequence does not create a new locus). The loci are challenging, as their effects on gene trees are topologi - assumed to evolve within the branches of the species tree cally similar. In reconciliation, weights of events have (or, in case of an HGT, between two branches), while the to be specified; these are, however, rarely known. The alleles are assumed to evolve within the branches of the fact that the results depend strongly on those unknown locus tree. Therefore, a locus tree is an intermediate con - parameters may undermine the credibility of biological cept between the gene and the species tree, as it encom- conclusions. To properly estimate the weights, high-qual- passes the evolutionary events like gene duplication or ity training datasets are needed, in which inferred events HGT, but not the populational effects. Another distinc - are biologically supported. tion between the gene and locus trees is that the former Many cases of HGT were found by manual inspec- can be inferred from the nucleotide sequences, while the tion of incongruences in gene trees [9]. Inferring a locus Fig. 1 An example of locus trees for G = ((a , b ), (b , c )) with two decompositions F and F consistent with S = (a, (b, c), d) . These 1 2 3 4 1 2 decompositions are created by cuts indicated with red color. M : G → S is shown for every set of cuts X (for internal nodes). Here, �(F , S) = �(F , S) = 0 , �(F ) = 2 · GAIN and �(F ) = 3 · GAIN , i.e., F is optimal 1 2 1 2 1 Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 3 of 18 tree facilitates such analyses, as it allows to automati- We call a and b comparable if a b or b a , other wise cally detect the incongruences. This approach has sev - a and b are called incomparable. The parent of a node a eral advantages over reconciliation. It allows restricting is denoted as parent(a) . The subtree of T rooted at v is to only two parameters: the locus gain and the locus denoted by T(v). By L(T) we denote the set of all leaves loss weight. It is also more robust to imprecise data, as in a tree T and we use L(v) instead of L(T(v)). By root(T ) improperly placed branches will only be locally detected we denote the root of tree T. A species tree S is a rooted as new loci, without interfering with the global evolu- directed tree in which nodes are called taxa. A gene tree tionary scenario. This allows to disregard the noise when G is a rooted directed binary tree, such that every leaf of analyzing the tree, and instead focus on several chosen G is labeled by a leaf-taxon from S, i.e., an element of L(S). events. The locus tree inference has been addressed in Note that a gene tree can have multiple copies of taxons populational genetics setting [4]. However, this approach (see Fig. 1). For a node g in G, by tax(g) ⊂ L(S) we denote requires several parameters, like speciation times or the set of all labels of leaves from L(g). population sizes, which are often challenging to obtain. The lowest common ancestor mapping, or lca-map- It has also been addressed in parsimony framework in the ping, between G and S is a function M : V → V such G S model of gene duplication and loss [7]. that M(g) = t if g is a leaf labelled by the leaf-taxon Our contribution In this work, we address the prob- t, or M(g) = lca (M(g ), M(g )) if g has two children S 1 2 lem of locus tree inference when populational effects are g and g . An internal node g in G is a duplication if 1 2 negligible. This allows addressing the locus tree infer - M(g) = M(g ) for any child g of g. Every other node, i.e., i i ence problem in a parsimony framework, and to adapt a a leaf or an internal node satisfying M(g) ≻ M(g ) for more general approach than presented in [7]. We assume every child g of g, is called a speciation [11–13]. that incongruence between gene and species trees can be A node with more than two children is called a poly- caused by locus acquisition events of any kind, including tomy. For a polytomy s in a tree S, let H(s) be the set of all duplications and horizontal gene transfers. We propose possible binary trees whose leaves are the children of s. to solve the locus tree inference problem by decompos- For instance, if s is the polytomy node present in S from ing a binary gene tree into a forest of subtrees that can Fig. 2, then H (s) = {(d, (e, f )), (e, (d, f )), (f , (d, e))} . Let be embedded into a possibly polytomic species tree, in a H (S) be the set of all possible binary trees obtained from way that minimizes the weighted sum of the forest size S by replacing each polytomy s with a tree from H(s). An and the number of loss events. We propose two variants element of H (S) is called a binarization of S. of the problem: the Locus Tree Inference, LTI, in which forest elements are subtrees of the species tree, and the Locus gain problems Conditional Locus Tree Inference, CLTI, where each In this section we introduce the parsimony framework forest element is a subtree of some binarization (full for the (conditional) locus tree inference problem and a refinement) of the species tree. We show a dynamic pro - dynamic programming formula for solving the problem. gramming algorithm that solves LTI in O(|G||S|m) time We say that a gene tree G is embeddable (respectively, and O(|G||S|) space, where m is the maximal degree of conditionaly embeddable) into a species tree S, if each a node from the species tree. To solve CLTI, we pro- node of G is a speciation (respectively, a speciation in pose a new mapping, called the highest separating rank. some binarization of S). For instance, (a, (b, c)) is embed- Based on the mapping, we show an O(d|G|+|S|) time dable into (a, (b, c, e), f), while (a, (a, b)) is not. Since the and O(|G|+|S|) space algorithm, where d is the height of polytomies in S can be resolved independently, we get the S, for inferring required and conditional duplications in following result: gene trees, which improves an O(|G|(d + m) + |S|) time solution from [10]. Finally, we propose an efficient heu - Lemma 1 G is conditionally embeddable into S if and ristic to solve CLTI, and present a comparative study on only if there is a binarization S of S such that G is embed- simulated and empirical data. dable into S . Methods Proof (⇐ ) Obvious. (⇒ ) First, observe that for any ′ ′ Definitions two nodes g , g ∈ G , if M(g) and M(g ) are incompara- Let T =�V , E � be a rooted directed tree. For a, b ∈ V , ble, then binarizing any node under M(g) will not change T T T by lca (a, b) we denote the lowest common ancestor of a the mapping of any node below g . Thus, we can binarize and b in T. We also use the binary order relation a b if disjoint parts of S separately. Now, consider a maximal b is a node on the path between a and the root of T (note node g which maps to a non-binary node in S (i.e. a node ′ ′ that a a ). Two nodes a and b are called siblings, which is g such that for any g ≻ g , M(g ) is binary). Since the par- denoted a = sibling(b) , if they are children of lca (a, b). ent of g is a speciation, the mappings of g and sibling(g) T Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 4 of 18 Fig. 2 An example of a gene tree G and a species tree S. Top: The lca-mapping M. Each internal node of S is decorated with its rank based on the height of the corresponding subtree. Each internal node of G is decorated with the value of mapping P. Note that the parent of taxa d, e and f in S is a polytomy. Bottom: G in which each internal node is decorated with a pair of gene leaves that induces the value of mapping P in Algorithm 1 (line 7). For example, for the left child of the root, say x, f e yields P(x) = R(lca (f , e)) = 2 3 9 S are incomparable in S. Since G is conditionally embed- is called a set of cuts if no two edges in X share their top dable, there exists a binary tree T ∈ H (M(g)) such that nodes. g becomes a speciation after replacing M(g) with T. From Let X be a set of cuts from G. We define G to be the the definition of a speciation node it follows that, after graph obtained from G by removing all cuts from G, such replacement, the children of g map onto disjoint contracting all nodes with one parent and one child, and parts of S, and subsequent polytomies can be resolved then removing all roots having exactly one child. independently to obtain the desired binarization. Lemma 2 For every set of cuts X from G, G is a decom- Every internal node g of G induces a set of loss events position of G. defined as nodes of the species tree strictly between M(g) and M(parent(g)) , plus M(g) if M(g) is a polytomy. The Proof Removing all cuts from G leads to a set of dis- above definition yields a notion of the loss cost, denoted joint and connected subgraphs of G. As no two cuts by �(G, S) , and defined as the total number of loss events share the top node, each subgraph contains at least one required to embed G into S. leaf from G, i.e., there is no subgraph composed entirely A gene tree may not be embeddable into a species tree of G’s internal nodes. On the other hand, each leaf from due to duplications or HGTs. Our goal is to decompose a G is in exactly one subgraph. Internal nodes of G can be gene tree into a set of embeddable subtrees in the most divided into four disjoint classes: (I) nodes disjoint with parsimonious way. We say that a forest F is a decompo- any cut, (II) bottom nodes of a cut incident to exactly one sition of G, if L(T ) = L(G) , and for every T ∈ F , cut, (III) top nodes of a cut incident to exactly one cut, T∈F G| = T , where, for A ⊆ L(G), G| is a tree having A and (IV) nodes incident to exactly two cuts. Contracting L(T ) A as the set of leaves and {lca (a, b)|a, b ∈ A} as the set of nodes with exactly one parent and one child [(the group internal nodes with the ancestor relation inherited from (III)] and removing single-child roots (IV) do not affect G. Decompositions can be equivalently obtained by tree the leaves and the nodes from the group (I). Therefore, edit operations as follows. Given a gene tree G, X ⊆ E after these operations, each subgraph becomes either a G Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 5 of 18 From Lemma 2 it follows that for each set of cuts there binary tree or a leaf. It follows that G is a forest such that is a unique decomposition induced by this set. Con- X L(T ) = L(G) . Since each tree in G is obtained T∈G versely, for every decomposition F of G there exists a set from a connected subgraph of G, it inherits the ancestral of cuts X such that F = G . Inferring such a set from a relation from G. Furthermore, note that a node with one given decomposition is straightforward by a bottom-up parent and one child or a root with one child cannot be traversal of the gene tree. Therefore, we can consider the last common ancestor of any set of leaves. It follows decompositions as equivalent to locus trees. From the that for any T ∈ G we have T = G| . L(T ) computational point of view, it is more natural to seek for optimal decompositions rather than sets of cuts. In the context of X every node of G can be uniquely X X In some applications (see e.g. “Results” section), given a associated with a node from G by a mapping σ that locus tree (G, X) we seek for the ≺-maximal nodes, called maps a node g to the first non-removed node below g, X X source nodes, obtained from G after removing X. The set connected by non-cut edges. Formally, σ (g) = σ (g ) of all source nodes uniquely corresponds to the elements if the edge g, g is a cut from X and g is the sib- 2 1 X X of G . For example, in Fig. 1, the top locus tree has two ling of g , and σ (g) = g , otherwise. By M : G → S sources: the root of G and b , while in the bottom tree, we denote the “locus-aware” lca-mapping, given by X ′ X ′ we have three sources: the root of G, a and the parent M (g) = M (σ (g)) , where M is the lca-mapping X X of b . between T ∈ G and S such that σ (g) ∈ T (see Fig. 1). Given a decomposition F, we define the total loss cost Consider a set of cuts X in G. We say that X detaches as �(F , S) = �(T , S) . We can now define the g ∈ G , or is g-detaching, if σ (g) is the root of some T∈F Locus Tree Inference problem (LTI) in the parsimony tree from G . For example, in Fig. 1, the cuts from framework: the bottom example detach the parent of leaf a (as X X σ (parent(a )) = b is a root in G ), while the cuts from 1 2 Problem 4 (Locus Tree Inference, LTI) Given a gene the top one do not ( σ (parent(a )) = a is below the 1 1 tree G and a species tree S. Find the decomposition F root). of G consistent with S having the minimal weighted cost We say that a decomposition is consistent (respectively, ∗ ∗ �(G, S) = GAIN ·|F |+ LOSS · �(F , S) in the set of all conditionally consistent) with a species tree S if for every decompositions of G consistent with S, where GAIN ≥ 0 T ∈ F , T is (respectively, conditionally) embeddable into and LOSS ≥ 0 are the weights of locus gain and locus S. From the definition of σ we have: loss events, resp. Remark 3 Let G be a decomposition consistent with S. Decompositions which satisfy the above conditions are Then, a set of cuts X detaches g ∈ G if and only if every tree referred to as optimal. Note that, from the mathematical in G is either disjoint with or entirely contained in G(g). point of view, one of the weights in the above cost func- tion could be set to 1, and the other adjusted accordingly. Given a species tree S and a gene tree G we define a In the same way, for conditional consistency, we define locus tree with respect to S as a pair (G, X), where X is a the Conditional Locus Tree Inference problem (CLTI). set of cuts such that the decomposition G is consistent The problems are equivalent if the input species tree is with S. Locus trees which induce the same decomposi- binary. From the algorithmic point of view, LTI is similar tions are considered equivalent. A locus tree represents to the reconciliation with DTL (duplication-transfer-loss) the evolutionary history of a set of genes, while a cut cor- scenarios [14] with no duplications. A transfer event cor- responds to the creation of a new locus by gene dupli- responds to the creation of a tree in a decomposition for- cation or horizontal gene transfer. The decomposition est. Additionally, we do not count loss events at the root induced by a set of cuts represents independent evolu- of a new tree. tionary histories of different loci. Our algorithm for solving LTI, described below, con- The definition of a locus tree presented in this work sists of several functions of g ∈ G , s ∈ S and ι ∈{0, 1} is formal, and does not always agree with the biological which denotes whether a set of cuts detaches g: intuition. In “Relationship between decompositions and locus trees” section we show a set of cuts which cannot D1 δ(g, s,0) is the minimal partial cost contribu- be directly explained by duplications and transfers, and tion of G(g) in the set of all g-detaching sets X requires additional evolutionary events to explain the decomposition (so-called non-admissible events). The such that M (g) = s. topic of non-admissible events together with an algo- D2 δ(g, s,1) as above but for non-g-detaching sets rithm for detecting them is covered in detail in “Relation- of cuts. ship between decompositions and locus trees” section. Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 6 of 18 △ ′ know if there is a polynomial time algorithm for CLTI. D3 δ (g , s, ι) is the minimal value of δ(g , s , ι) for However, when the locus gain weight ( GAIN ) is much s � s. greater than the loss weight ( LOSS ), an efficient heuris - D4 δ (g , s, ι) is the minimal partial cost contribu- tic can be constructed, based on a mapping introduced tion of G(g) in the set of all g-detaching sets X in the next section. of cuts X such that M (g) s . For ι = 1 , the cost additionally includes all losses created by Ranked trees and rank‑based mappings σ (g) and associated with every species node Usually, when comparing trees, mappings based on their ′ X ′ s satisfying M (g) ≺ s � s. topologies are used (e.g., the lca-mapping). However, some biological species trees contain additional useful Below we present the dynamic programming formulas structure: the taxonomic ranks, like species, genus, or (DP Algorithm) for solving LTI. Here, c(v) is the set of family. A species tree with taxonomic ranks assigned is children of v ( ∅ for leaves). By 1 we denote the indicator sometimes called a taxonomy. Several major ranks are function, that is, 1[p] is 1 if p is satisfied and 0 otherwise. 0 if g is a leaf and M(g) = s, min{α, γ } if g is not a leaf, δ(g, s, ι) = +∞ otherwise, ↑ ′ ′ ↑ ′′ ′′ α = 1[c(s) ≥ 3] · LOSS · ι + min δ (g , s ,1) + δ (g , s ,1), ′ ′′ ′ ′′ s ,s ∈c(s) and s �=s △ ′ ′ ↑ ′′ △ ′′ ′′ ↑ ′ γ = GAIN + min(δ (g , M(g ),0) + δ (g , s, ι), δ (g , M(g ),0) + δ (g , s, ι)), δ(g, s, ι) if s is a leaf, δ (g, s, ι) = min{δ(g, s, ι), 1[|c(s)| > 1]· LOSS · ι + min δ (g, x, ι)} otherwise, x∈c(s) △ △ δ (g, s, ι) = min{δ(g, s, ι), min δ (g, x, ι)}. x∈c(s) common to almost all living organisms. In this section, Theorem 5 (Solution to LTI) For every G and S: we propose a mathematical formalization of ranks and �(G, S) = min δ (root(G), s,0)) + GAIN. s∈S two rank-based mappings, which are useful in duplica- tion inference and CLTI. Proof The proof is by induction on the structure of G A ranked species tree is a species tree S in which every and S, where the properties D1–D4 of all δ ’s are proved. node s of S has assigned a small positive integer called We omit technical details. ′ ′ rank, denoted R(s) , such that, for every s and s , if s ≺ s then R(s)< R(s ) . We assume that the rank of the root of Theorem 6 The optimal cost can be computed in is d > 0 and every leaf has rank 1. O(|G||S|m) time and O(|G||S|) space, where m is the max- After assigning integer numbers to the taxonomic ranks, imal degree of a node from S. a taxonomy becomes a ranked species tree. However, the latter concept is broader, as any species tree can be con- Proof Time: We show that all values of δ functions can verted to a ranked one, for example by assigning ranks be computed in O(m) time. This is straightforward for equal to the node depths. The theory and algorithms all values except α , where computing min potentially described in the following sections apply to any ranked tree, requires O(m ) time. This can be done, however, in O(m) regardless of the way in which the ranks were assigned. time, by finding for each node g of G, the two children Let G be a gene tree and S be a ranked species tree. For of s with the minimal and the second minimal value of δ a rank r and a leaf t in S, the unique directed path in S and choosing the minimal pair one among all four vari- consisting of all taxa comparable with t having the rank ants. Space: Obvious. lower than r will be called an (evolutionary) r-lineage of t. Note that every 1-lineage is empty. We say that leaf-taxa CLTI can be solved by an algorithm similar to the t and t are separated by the rank r if for every x from the one presented above. It requires an additional case in δ r-lineage of t and every y from the r-lineage of t , x and y for resolving duplications. To model a proper binariza- are incomparable. Observe that every pair of leaf-taxa is tion of a polytomy in M(g), both children of g have to be separated by the rank of 1. Moreover, if r separates t and mapped into disjoint sets of children of M(g). Such solu- ′ ′ t then every rank lower than r also separates t and t . For tion requires extending all δ ’s by a set of species nodes example, in Fig. 2 leaf-taxa a and c are separated by ranks allowed for the mappings. In consequence, this approach 1, 2 and 3, but not by rank 4. has an exponential time and space complexity. We do not Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 7 of 18 Let g be an internal node in G with children of incomparable nodes. We conclude that the rank r sep- ′ ′ ′ g and g . The highest separating rank mapping arates t and t . Note, that for ranks r > r , b oth r -lineages 1 2 P : V →{1, 2, . . . , d} is defined as contain lca (t, t ) . Therefore, r is the highest rank sepa- G S rating t and t . This completes the proof of (A). (B) fol - P(g) = max r : r separates every pair of leaf-taxa lows from the definition of mapping P and (A). The proof of (C) is similar to (B). Both (D) and (E) follow from (A) from tax(g ) × tax(g ) . 1 2 and (B). The lowest common rank mapping I : V →{1, 2, . . . , d} Taxonomic ranks have been used earlier for HGT is defined as I(g) = R(M(g)) . We now present some fun- detection [15]. In this work, the authors decorated nodes damental properties of both mappings. Simple proofs are of the gene tree with the rank of the lowest taxon shared omitted for brevity. by each descendant leaf, equivalent with the I mapping. A high difference between the rank of a node and the one of Lemma 7 Let ρ(t , t ) be the highest rank that separates its parent was one of the premisses for HGT. To the best leaf-taxa t and t and let g be an internal node of G with of our knowledge, no mapping equivalent with the high- two children g and g . Then, est separating rank has been proposed to date. 1 2 Computing mappings ′ ′ ′ Given a species tree S and a gene tree G, to compute I we (A) For every leaf-taxa t and t , ρ(t, t ) = R(lca (t, t )). can use the classical algorithm for lca-queries, in which, (B) P(g) = min ′ ρ(t, t ). �t,t �∈Q after a linear-time preprocessing, computing lca-queries (C) I(g) = max ′ R(lca (t, t )). �t,t �∈Q S can be completed in constant time [16]. We conclude (D) P(g) = min ′ R(lca (t, t )). �t,t �∈Q S that I can be computed in O(|G|+|S|) time. (E) P(g) = 1 if and only if tax(g ) ∩ tax(g ) �= ∅, 1 2 A naïve algorithm for computing P, based on Lemma 7, requires O(|G||S| ) time. Here, we propose an where Q = {�t, t � ∈ tax(g ) × tax(g )}. 1 2 O(d|G|+|S|) time solution. For two distinct leaves l and l 1 2 of G, we write l < l if l is visited earlier than l in prefix Proof For the proof of (A), let r = R(lca (t, t )) . Then, 1 p 2 1 2 traversal of G. For instance, in Fig. 2 the leaves are linearly by the definition, for every element v of r-lineage for t, ′ ′ ordered starting from the left, i.e., d < b < f < . . .. R(v)< R(lca (t, t )) . Hence, v ≺ lca (t, t ) . And the same 1 p 2 p 3 p S S holds for the r-lineage of t . Thus, both r-lineages consists Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 8 of 18 ′ ′ Lemma 8 For a fixed s ∈ S, the sequence of all assign- was another s , processed before s, with R(s ) = r satisfy- ment evaluations in line 9 such that v.smap = s induces ing the same properties as s) or it will be set to r. This a sequence of values v, denoted by v , v , . . . , v such that: 1 2 completes the proof. (I) the assignment s.lastvisited := v is executed only when rank = s.rank, (II) v < v < ··· < v , and (III) 1 p 2 p p Lemma 10 Algorithm 1 requires O(d|G|+|S|) time and −1 {v , v , . . . , v }= M (L(s)). 1 2 O(|G|+|S|) space. Proof (I) is obvious by the condition in the second loop. Proof Time: Lines 1–4 have O(|G|+|S|) time complex- By the condition in the inner loop, the order of leaves ity, while the body of the inner loop needs O(1) time. induced by a sequence of such assignments follows < . p Space: Algorithm 1 uses only a few node attributes plus For every gene leaf v, v.smap is initially set to the label of the lca-query data structure of the size O(|G|). v, i.e., M(v) (see line 3). Thus, if s is a leaf, i.e., s.rank = 1 , then the assignment in line 9 sets the value of v.lastvisited if and only if the label of v is s. Thus for the leaves, (II) is Classification of gene duplications satisfied. For (III), note that the line 10, ensures that every Several methods for reconciliation with non-binary leaf v is assigned once to s.lastvisited of every node s of gene trees have been proposed [17–22]. However, rec- a species tree that is present on the path starting from onciliation with non-binary species trees is harder to −1 M(v) and terminating in the root. Hence, M (L(s)) ⊆ model. This is because a polytomy may represent a lack {v , v , . . . , v } . The other inclusion follows trivially from 1 2 k of knowledge about the order of speciations, and there- the fact that for a leaf v, v.smap is originally set to M(v) fore some duplication nodes may correspond to biologi- and v cannot be assigned to a node incomparable with cal speciations. This motivates a further classification of M(v). duplication nodes into conditional and required duplica- tions [10]. In our model, we assume that the species tree Lemma 9 For every internal node g, P(g) = g .P. is ranked. This approach can be applied to any species tree by assigning ranks, e.g. based on the node depth. ′ ′′ Proof Let g and g be the left and the right child of When reconciling a gene tree G with every binarization g, respectively. The proof is by induction on the rank of S, if g from G is a duplication in every reconciliation, r = 1, 2, . . . , d , where d is the highest rank in S. Let then g is called a required duplication. Similarly, if g is a r = 1 . Assume that P(g) = 1 , we show that g.P = 1 . duplication in at least one but not all reconciliations, we ′ ′′ Let s ∈ tax(g ) ∩ tax(g ) . Then, by Lemma 8, let be say that g is a conditional duplication. Note that G is con- the sequence {v , v , . . . , v } of all leaves assigned to ditionally embeddable in S if and only if each node in G is 1 2 k s.lastvisited such that M(v ) = s and ordered by < . either a speciation or a conditional duplication. i p Clearly, the list has the leaves from both subtrees of g. In this section, we show how P and I can be used to u Th s there is an index j < k , such that v ∈ L(g ) and solve the problem of gene duplication classification when ′′ v ∈ L(g ) . u Th s lca (v , v ) = g . Now, in line 8, the species tree has possible polytomies. j+1 G j j+1 when v is v then v.smap.lastvisited is v . In such a j+1 j case, either g .P is None and g .P will be set to 1, or g .P is Lemma 11 For an internal node g from a gene tree G, already set. However, it can be only 1. This completes the the following conditions are equivalent: (A1) P(g) = I(g) first part of the proof. , (A2) every subtree rooted below M(g) contains taxa from at most one child of g, i.e., for every s ≺ M(g) Assume that P(g) = r and for every q, such that P(q)< r , , if L(s) ∩ tax(g ) �= ∅ then L(s) ∩ tax(g ) =∅, where 1 2 ′ ′′ we have P(q) = q.P . For every v ∈ L(g ) and w ∈ L(g ) , g and g are the children of g, and (A3) for every 1 2 ′ ′ R(lca (M(v), M(w))) ≥ r . u Th s g .P = None , when Algo- S �t, t �∈ tax(g ) × tax(g ), lca (t, t ) = M(g). 1 2 S rithm 1 starts the main loop with rank = r . From Lemma 7, there is a pair taxa �t , t �∈ gˆ such that s = lca (t, t ) 1 2 S Proof (A1) ⇒ (A2). Assume that s ≺ M(g) and there and R(s) = r . Thus, there are two leaves a and a in G 1 2 are two leaves t and t in L(s) such that t ∈ tax(g ) ′ ′ such that for each i, M(a ) = t and lca(a , a ) = g , i.e. i i 1 2 and t ∈ tax(g ) . Hence, �t, t �∈ tax(g ) × tax(g ) and 2 1 2 ′ ′′ a � g and a � g . Similarly, to the first step, the leaves 1 2 lca (t, t ) � s ≺ M(g) . u Th s , P(g)< I(g) , a contradiction. −1 ′ ′ from M (L(s)) are all visited and set to s.lastvisited (A2) ⇒ (A3). Let �t , t �∈ gˆ . Then, t and t are leaves from according to the order < . The sequence contains ele - p two different subtrees rooted below M(g). Therefore, ments a and a , therefore again there is j separating 1 2 lca (t, t ) = M(g) . (A3) ⇒ (A1). It follows immediately leaves from both subtrees of g. The rest of the proof is from Lemma 7. analogous: in line 8 either g.P is already set to r (if there Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 9 of 18 Note that the above Lemma also holds when (C2,⇐ ). If g is a conditional duplication, then it is a dupli- P(g) = I(g) = 1 , i.e. when an internal node g is mapped cation by definition. (C2, ⇒ ). Assume that g is a duplica- to a leaf of S. In such a case the condition (A2) is satisfied tion, then by condition (A2) from Lemma 11, the children trivially. of M(g) can be clustered into three disjoint sets X, X and ′′ X such that every node from X has no taxa present in ′ ′ Lemma 12 Let g be an internal node of G such that tax(g) , every node of X has taxa from tax(g ) but not from ′′ ′′ P(g) = I(g). Then, g is a speciation if and only if M(g) is tax(g ) and analogously every node of X has taxa from ′′ ′ ′ ′′ an internal node and there are exactly two subtrees rooted tax(g ) but not from tax(g ) , where g and g are the chil- at children of M(g) having nodes from tax(g). dren of g. Also, by Lemma 13 at least one among X and ′′ ′ X , say X , has at least two elements. Consider a binary ′ ′′ tree T in H(M(g)), such that all elements of X and X are Proof (⇒) . We have that g is an internal node. In such a located on the left and the right subtree of T, respectively. case I(g)> 1 and M(g) is an internal node. Then, by (A2) ′ ′′ Then, lca (X ) and lca (X ) are incomparable. Thus, in from Lemma 11, every child of M(g) has taxa present in T T ′ ′′ such a binarization of S, g and g maps below M(g), and g at most one child of g. There are at least two children of is a speciation node. Similarly, it can be shown that there M(g) satisfying this property. If there are more than two, exists a tree in H(M(g)) in which g is a duplication. then one child of g, say g , has taxa from at least two chil- dren of M(g). Hence, M(g ) = M(g) and g is a duplication Based on Algorithm 1, classification theorem leads to node, a contradiction. (⇐) . Similarly, if M(g) is an inter- a natural O(d|G|+|S|) time solution for the inference of nal node, then by (A2), the mappings of the children of required and conditional duplications when reconciling a g are incomparable and located below M(g). Therefore g given binary gene tree with a species tree. This improves cannot be a duplication. the known O(|G|(d + m) + |S|) time algorithm from [10], where m is the maximal degree of a node from S. The We have a symmetric property whose proof is similar improvement is beneficial for highly polytomic species to the previous one. trees. For example, as of 04.28.2017, the genus Aspergil- lus has 1950 children species in the NCBI Taxonomy. Lemma 13 Let g be an internal node of G such that P(g) = I(g). Then, g is a duplication node if and only if Heuristic for CLTI either M(g) is a leaf or M(g) is an internal node and there In this section, we propose the heuristic algorithm for are at least three subtrees rooted at a child of M(g) having CLTI when the locus gain weight is much higher than nodes from tax(g). the loss weight. The algorithm is based on the following lemma, which follows directly from Theorem 14: Finally, we have the main property. Lemma 15 Tree G is conditionally embeddable in S if Theorem 14 (Classification Theorem) Let g be an and only if, for all internal nodes g in G, I(g) = P(g)> 1. internal node of G. Then, (C1) If P(g) = I(g) = 1 orP(g)< I(g) then g is a required duplication. (C2) If Algorithm 2 is a greedy approach that iteratively finds P(g) = I(g)> 1, then g is a duplication if and only if g is a the minimal nodes g such that P(g)< I(g) or I(g) = 1 conditional duplication. and detaches an embeddable subtree below each node. Note the following: Proof (C1) If P(g) = I(g) = 1 , then g is mapped to a leaf. Hence, every leaf below g has the same label. Thus, Remark 16 Let G be conditionally embeddable in in every binarization of S, g is a duplication. Assume that S. Let �(G,S) =|L(M(root(G)))| −|L(G)| . Then, P(g)< I(g) . Then M(g) is an internal node in S, having �(G, S) ≤ �(G, S). at least three taxa in L(M(g)) (otherwise, the two chil- dren of M(g) are leaves and P(g) = I(g) = 2 ). We can Let T \ T denote tree T with detached subtree T . assume that there are three leaves t, t ∈ tax(g ) and 1 2 1 2 ′ ′ ′′ ′ ′′ ′ ′′ ˆ ˆ Then, �(G , S) + �(G(g) \ G , S) is an estimate of the t ∈ tax(g ) such that lca (t , t ) ≺ lca (t, t , t ) . This 2 S S partial loss cost induced by detaching subtree G . The property holds for every binarization T of S, where the detached subtree in Algorithm 2 is chosen to minimize possible polytomy M(g) is resolved. Moreover, in every ′ ′′ this estimate. To limit the complexity of a single step, we T, M(g ) � lca (t, t , t ) , thus M(g ) is comparable with 1 T 1 ′′ consider only subtrees rooted at vertices at a close neigh- M(g ) � t . u Th s , M(g) = max(M(g ),M(g )) and g is a 2 1 2 borhood of g. duplication node. Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 10 of 18 A significant advantage of Algorithm 2 is the space Lemma 17 Algorithm 2 returns a decomposition condi- complexity, which is O(|G|+|S|) . This makes the heu - tionally consistent with S in O(ad|G|+ a|S|) time, where ristic suitable for trees with hundreds or thousands of a is the number of recomputations of I and P mappings. nodes. Note that in Lemma 17, a is pessimistically equal to the height of the gene tree, which makes this algorithm Proof Correctness It follows immediately from the fact asymptotically quadratic. However, in applications, a is that each G(w) detached in the inner loop is conditionally expected to be a small integer. This expectation is valid embeddable. Time Let g be an element of Z. Then, as G is e.g. in cases where evolutionary events occur less fre- binary, there are at most six edges v, w adjacent to a child quently than speciations. Note, however, that it may not of g. (Case I) If v = g and w is the child of g, then both trees be valid for genes which exhibit particularly complex G(w) and G(g) \ G(w) are conditionally embeddable by the evolutionary history, like transposons. assumption that g is the minimal node such that G(g) is not conditionally embeddable. This completes the proof for Relationship between decompositions and locus trees v = g . (Case II) For the remaining case, there at most four An evolutionary scenario is the set of evolutionary events grandchild edges connecting a child v with a grandchild w which have shaped the observed gene tree. Usually, a sce- of g. We can arbitrarily index them by 1,2,3 and 4. Again, nario is represented by assigning duplication labels to G(w) is conditionally embeddable, however, it is unclear some nodes of the gene tree, and transfer labels to some for T = G(g) \ G(w) . By Lemma 15, it is sufficient to check of its edges. whether, for the root t of T, I (t) = P (t)> 1 , where the T T Each evolutionary scenario, understood as a set of mappings are from V . I (t) can be computed in O(1) time T T duplication nodes and transfer edges, induces a decom- by the rank of lca (M(v.sibling), M(w.sibling)) , howe ver , position of the gene tree into a set of trees representing for P : V →{1, 2, . . . , d} we need to use Algorithm 1. T T evolutionary histories of different loci. A set of cuts for To avoid quadratic time, consider the following additional such decomposition consists of all transfer edges and steps after line 4. one edge below each duplication. One could reason- ably expect that a converse relation holds, i.e. that each 1. For each i ∈{1, 2, 3, 4} , create a copy G of G. decomposition induces an evolutionary scenario and 2. For each g in Z, remove the i-th grandchild edge a similar set of cuts. This, however, turns out to not be v, w and G(w) from G . true. The decomposition depicted in Fig. 3 requires add- 3. For every i, let P be the mapping P, denoted P , com- i i ing additional duplication nodes with no cuts below puted by Algorithm 1. them. Thus, one of the trees in this decomposition cor - responds to the evolutionary history of two loci. In this Then, when the ith grandchild edge v, w is processed in section, we give preliminary results on the relationship the inner loop, the value P (t) is P (t) . We conclude that T i between decompositions and locus trees. We begin with each step of the main loop (lines 2–6), requires 4 addi- formalizing the notion of an evolutionary scenario. Next, tional runs of Algorithm 1. Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 11 of 18 Fig. 3 An example of a decomposition of a gene tree G into two trees (a , b ) and (a , b ) for which evolutionary scenarios require one additional 1 1 2 2 duplication located on the internal node of the decomposition forest (here the root of G). The scenario with the minimal number of duplications (here 2) is depicted on the left in the form of the embedding of G into S [12] we formally define the non-admissible events induced by The above conditions denote the speciation, duplica - a decomposition, and show and algorithm of detecting tion and horizontal gene transfers events, respectively. An them. The number of non-admissible events measures example of an evolutionary scenario has been depicted how well a decomposition agrees with biological intui- in Fig. 4. In DTL scenarios, a vertical descent is modeled tion behind evolutionary scenarios. by the condition that the mapping of a child is below or The formal definition of a DTL scenario presented equal to the mapping of its parent. The condition holds below is adopted from [23, 24] with the difference that we for the children of speciation and duplication nodes. A focus more on the event based conditions. destination of a transfer node g is defined by the func - tion ξ . Therefore, we require that both the mapping of g and the mapping of ξ(g) are incomparable. Note that, the Definition 18 (DTL scenario, from [ 25]) A DTL sce- above definition of a DTL scenario does not exclude cyclic nario, or scenario, for a binary tree G, and a tree S and a HGT scenarios, i.e. scenarios with at least two conflict - labelling L : L → L is a tuple M,�,�, �, ξ such that G S ing transfer edges. Conflicting edges occur e.g. when the L : L → L is the leaf labelling function, M : V → V G S G S acceptor of the first transfer is above the donor of the sec - is a mapping that extends L , {,,} is a partition of I ond one, and the acceptor of the second transfer is above into speciation, duplication and transfer nodes, respec- the donor of the first one. Such scenarios are impossible tively, and ξ : � → V determines the termination node to occur in nature, because the acceptors and donors need of a transfer in G, subject the following conditions. For to be present at the same time point. If the optimal cost is any g ∈ I such that c and c are the children of g, let G 1 2 defined as the minimal weighted sum of numbers of HGT s = lca (M(c ),M(c )) . Then, S 1 2 and duplication events, then, for a given gene tree and a species tree, it can be computed in O(|G||S|) time [23, 24], while the problem for acyclic scenarios is NP-hard [24]. • We have g ∈ if and only if the mappings of the children of g are incomparable, and s = M(g). Validation of decomposition • If g ∈ then s M(g). Let F be a decomposition of a gene tree G. Then, by the • If g ∈ then ξ(g) is a child of g, definition of decomposition every node of F is a node of M(sibling(ξ(g))) M(g) , and M(g) and M(ξ(g)) G. Such nodes will be called F-internal. are incomparable. The edge �g , ξ(g)�∈ E is called a Now, we can determine how a given decomposition transfer edge. is evolutionarily congruent by searching for the DTL Fig. 4 An example of a DTL scenario E for a gene tree G and a species tree S. E has two HGTs, one duplication, and three speciation events. The scenario is shown with the mapping M depicted for the internal nodes only. While the gene loss events are not formally modeled here, they are required when embedding a gene tree into a species tree according to a given DTL scenario. Therefore, our visualizations of DTL scenarios are extended by the required loss events (see also the scenario E in Fig. 5) 3 Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 12 of 18 scenarios that preserve the largest number of speciation Theorem 20 Given a gene tree G, a species tree S events in the forest and the minimal number of duplica- and a decomposition F of G consistent with S. The tion and HGT events outside of it. Given a scenario E for minimal number of non-admissible events equals G and S, we say that an event g ∈ I is admissible if min σ (root(G), s)). G s∈S • g ∈ and g is not F-internal. Proof The proof is by induction on the structure of G • g ∈ and �g , ξ(g)� ∈ / E . and S, where the properties V1–V3 of all σ ’s are proved. • g ∈ and g is F-internal. We omit technical details. Given that ||+||+||=|G|− 1 , the decomposition Theorem 21 The number of non-admissible events can congruency can be equivalently expressed in terms of be computed inO(|G||S|m) time and O(|G||S|) space, minimizing the number of non-admissible events. where m is the maximal degree of a node from S. Problem 19 (Validation of decomposition) Given a Proof See the proof of Theorem 6. gene tree G, a species tree S and a decomposition F of G consistent with S. Find a DTL scenario for G and S that minimizes the number of non-admissible events. Results We have run numerical simulations to assess the perfor- Examples of such scenarios are depicted in Fig. 5. The mance of the proposed algorithms. First, we have run the above problem can be solved by a dynamic programming heuristic and dynamic programming algorithms on pairs algorithm in O(|G||S|) time similar to the decomposition of random trees to compare the inferred forest sizes and problem and the reconstruction of DTL scenarios [23, loss costs. Next, we have performed simulations of realis- 25]. Similarly to the decomposition formulas for g ∈ G tic evolutionary scenarios to check the correctness of the and s ∈ S we have the following properties of the σ’s: algorithms’ results. The optimal decomposition is seldom unique. There - fore, when analyzing the dynamic programming algo- V1 σ (g , s) is the minimal number of non-admissible rithm, for each gene tree-species tree pair we picked one events located in G(g) for the scenarios between of the optimal decompositions randomly. The heuristic G(g) and S(s). algorithm always returns a single decomposition. V2 δ(g , s) as above but under assumption that g is mapped to s. → Comparison of algorithms V3 δ (g , s) is the minimal number of non-admissible In the case of binary species trees, conditional embed- events located in G(g) in the set of all scenarios for dability is identical to strict embeddability, and both G(g) and S(x) where x is incomparable with s. locus tree inference algorithms can be compared experimentally. Below we present the dynamic programming formulas For each |L(G)|= 1, . . . , 20 and |L(S)|= 10 we have ′ ′′ for the validation of decompositions. Here g and g are generated 100 pairs of random trees under the Yule- the children of g, α represents the case when g is a specia- Harding model. The leaves of G have been assigned to tion, β —a duplication and γ—an HGT. 0 if g is a leaf and M(g) = s, σ(g, s) = min{α, β, γ } if g is not a leaf, +∞ otherwise, where △ ′ ′ △ ′′ ′′ α = 1[g is not F-internal]+ min σ (g , s ) + σ (g , s ), ′ ′′ ′ ′′ s ,s ∈c(s),s �=s △ ′ ′ ′′ β = 1[g is F-internal]+ min σ (g , s ) + σ(g , s), s ∈c(s)∪{s} ′ → ′ △ ′′ ′′ → ′′ △ ′ γ = min 1[�g , g �∈ E ] + σ (g , s) + σ (g , s), 1[�g , g � ∈ E ]+ σ (g , s) + σ (g , s)}, F F σ (g, s) = min σ(g, x), x�s σ (g, s) = min σ(g, x). x and s are incomparable Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 13 of 18 Fig. 5 An example of the validation of a decomposition with one non-admissible event. Top A species tree S and a gene tree G decomposed into 4 trees (a , b ) (blue), (a , c ) (green), a (red) and a (light blue). Bottom (3 rows) DTL scenarios E -E with embeddings. In scenarios E and E there is 1 1 4 1 2 3 1 3 1 2 one non-admissible HGT terminating in a and c , respectively, while E has one non-admissible duplication at the root 4 1 3 Fig. 6 Comparison of DP and heuristic algorithms for binary species trees in terms of forest size |F| and numbers of losses �(F , S) . The brown line depicts the median cost; the grey ribbon depicts the 90% confidence interval. The weights in the DP algorithm have been set to GAIN = 1000, LOSS = 1 . The plots have been smoothed with cubic splines Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 14 of 18 leaves of S randomly. The numbers of losses for heuristic similar to the inferred number of new loci (i.e. forest size algorithm have been computed using a modification of minus one, accounting for the “base” locus at the root the DP algorithm for LTI. The inferred costs are shown of the gene tree). The forest sizes inferred by the heuris - in Fig. 6. tic algorithm have been depicted in Fig. 7. The number The costs are similar for both algorithms. The for - of locus acquisition events has been inferred properly est size is approximately half the number of leaves in G. in 82.6 and 82.7% of the gene trees by the heuristic and Using linear regression, we have determined that, on dynamic programming algorithm, respectively, and dif- average, the inferred forest size is equal to 0.47|L(G)| for fered by at most one event in 98.3 and 98.2% cases. The DP and 0.53|L(G)| for the heuristic. Large forest sizes in dynamic programming algorithm underestimated the these examples can be explained by the fact that both number of evolutionary events slightly more often than gene and species trees are simulated independently, and the heuristic one. The number of inferred events was most trees in the forests contain only one or two leaves. lower than the actual number in 10.7 and 11.9% of trees The number of losses is slightly smaller for the heuristic for the heuristic and dynamic programming algorithms, algorithm (on average 0.98|L(G)| for DP and 0.90|L(G)| respectively. for the heuristic). We hypothesize that the reason for this To further validate our results, we have checked is that the greedy approach tends to detach more concise whether the source nodes inferred by the decomposition trees. algorithms correspond to simulated evolutionary events. Note that, from the definition, two source nodes cannot Detecting evolutionary events be siblings (see “Locus gain problems” section for the To validate our approach under more realistic conditions, definition of a source node). A properly predicted source we have run simulations of evolutionary scenarios using node is either the end of a transfer edge, or a child of a the GenPhyloData tool [26]. The tool simulates species duplication node. We say that a properly predicted source trees under a stochastic birth-and-death model, in which node is a true positive event. Consequently, an event that each leaf gets duplicated or lost with a constant rate. The failed to be predicted is a false negative; a speciation that total rate of duplication (loss) is equal to the duplication is classified as a source is a false positive; and a properly (loss) rate times the number of leaves at a given time predicted speciation is a true negative. point (see [27] for details). When the loss rate is equal to Note that for a transfer edge, a decomposition algo- zero, this process reduces to the Yule-Harding model. In rithm might classify the sibling of the edge’s end as a the case of the species tree, a branch duplication models source node. In this case, we assume that the algorithm a speciation, and branch loss models an extinction. is wrong twice: first, it incorrectly classifies a node as a In each branch of the species tree, a similar birth- source (a false positive), and second, it fails to detect an and-death process models the gene duplication and loss evolutionary event (a false negative). To account for this events. An additional Poisson process models the occur- assumption, we have adopted the following convention. rence of horizontal gene transfer events. A recipient of A duplication event corresponds to: the transferred gene is picked uniformly from species alive at a given time point. 1. One true positive and one true negative event if one Using GenPhyloData, we have simulated 100 species of its children is a source node, trees under a birth-and-death model with the birth rate 2. One true negative and one false negative if none of its 5 and the death rate 1. Trees have been pruned to remove children are source nodes. extinct lineages. This procedure resulted in a set of binary species trees with approximately 75 leaves on average A transfer event corresponds to: (see Fig. 7). For each species tree, we have simulated 10 gene 1. One true positive and one true negative if one of its trees with duplication rate 0.4, transfer rate 0.1 and loss children is a source node, and this child is the end of rate 1.2. Trees have been pruned to remove lost genes. the transfer edge, We have obtained a set of 1000 binary gene trees with 2. One false positive and one false negative if one of its approximately 43 leaves and 4 duplication or transfer children is a source node, and the end of the transfer events on average (see Fig. 7). The maximum number of edge is the sibling of the source, events in a single gene tree was 27. 3. One true negative and one false negative if none of its The simulated gene trees have been decomposed with children are source nodes. respect to the corresponding species trees using our heu- ristic and dynamic programming algorithms. In both The results for both algorithms have been summarized in cases, the numbers of duplication/transfer events were Table 1. In the case of the heuristic algorithm, 93.2% of Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 15 of 18 Applications of evolutionary history decomposition Table 1 The confusion table of gene tree nodes classification We have compared our approach with a state-of-the-art Event Speciation Sum reconciliation program, Notung 2.9 [28]. We have ana- DP lyzed the evolution of the gene family of an aminotrans- Locus 3584 471 4055 ferase from a fungus Penicillium lilacinoechinulatum Speciation 552 78,591 79,143 [GenBank:ABV48733.1] with a history marked by HGT Sum 4136 79,062 83,198 events [9, 29]. Homologs of the protein sequence have Heuristic been found using BLASTp suite. For analysis, we have Locus 3858 229 4087 chosen 45 closest homologs from 20 fungal species. Speciation 278 78,833 79,111 The sequences have been aligned using MAFFT and Sum 4136 79,062 83,198 trimmed with TrimAL [30, 31]. The phylogenetic tree has been created using PhyML with default parameters and Event/locus: duplication or transfer nodes corresponding to new loci (true positive); event/speciation: duplication or transfer nodes not corresponding aLRT branch support, and rooted by setting Amanita to new loci (false negative); speciation/locus: speciation nodes corresponding muscaria as the outgroup [32]. Nodes of the species tree to new loci (false positive); speciation/speciation: speciation nodes not have been collapsed to represent only the following taxo- corresponding to new loci (true negative) nomic ranks: species, genus, family, order, class, phylum, kingdom. The gene tree has been reconciled with NCBI the duplication/transfer nodes were correctly detected. Taxonomy with loss weight 1, duplication weight 1.5, co- The positive predictive value of the heuristic approach divergence weight 0 and transfer weight varying from 3 to (proportion of inferred locus gain events corresponding 8. The result of decomposition by our heuristic approach to true duplication/transfer events) was equal to 94.4% . has been visualized using the Python ete3 package [33] In the case of the dynamic programming algorithm 86.7% and is depicted in Fig. 8. of duplication/transfer nodes were correctly classified as The protein ABV48733.1 has been chosen for analy- corresponding to a new locus, and the positive predictive sis because it exhibits a particularly complex evolution- value was equal to 88.4%. ary history. In the gene tree consisting of 45 homologs, We have further verified the correctness of our our heuristic has inferred 23 locus acquisition events. approach by investigating the numbers of non-admissible Depending on the transfer weight, Notung 2.9 reported events induced by the decompositions. The results are from 1 to 7 transfers and numerous duplications. The depicted in Fig. 7. Overall, there was no non-admissible inference of evolutionary events is further complicated event in 93.2% of decompositions returned by the heuris- by the fact that the sets of transfer edges for transfer tic algorithm and 90.4% of decompositions returned by weights 3 and 6 are disjoint. However, even though in the dynamic programming one. this case it is difficult to explain the whole evolutionary A likely reason for the worse performance of the scenario by reconciliation, the decomposition can still be dynamic programming algorithm is the random choice of helpful in inferring biologically relevant conclusions. optimal decomposition. The number of non-admissible Consider the light blue subtree on Fig. 8 composed events might in future be used as an additional criterion of several Fusarium species. The protein has been for the optimality of the decomposition. Fig. 7 Results of the simulations of evolutionary scenarios. Left: Distribution of numbers of leaves in simulated gene and species trees. Middle top: Numbers of duplication or transfer events in the simulated evolutionary scenarios. Middle bottom: Numbers of locus acquisition events (i.e., forest size minus one) inferred by the heuristic algorithm. Right: Numbers of non-admissible events induced by decompositions inferred by the dynamic programming and the heuristic algorithm Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 16 of 18 Fig. 8 Gene tree of homologs of protein ABV48733.1 and species tree retrieved from the NCBI Taxonomy database. Numbers above branches show the values of I/P mappings before decomposition, numbers below branches show the aLRT support. The separate histories of different loci have been highlited by different colors extensively duplicated in Fusarium oxysporum. Further- node (labeled as 2) is only 0.65. A closer inspection shows more, the light blue subtree branches with Aspergillus that performing a nearest neighbor interchange opera- and Penicillium species. As the support of the “junction” tion resolves the incongruence. This suggests that this node of both subtrees (labeled as 1 on the locus tree) is locus subtree is an effect of an erroneous tree inference. 1.00, we can assume that this is not due to erroneous Reconciliation explains this event by a horizontal gene gene tree inference. The Fusarium species are not pre- transfer for transfer weights from 3 to 5, and as an ances- sent in any other part of the tree, which is indicative of tral duplication for greater weights. a horizontal gene transfer from the ancestor of Aspergil- Now, consider the light green subtree. This subtree lus and closely related Penicillium species to the ances- contains several species which are present in its sister tor of Fusarium species from the light blue subtree. The subtree (A. lentulus, A. udagawae) and numerous closely horizontal gene transfer hypothesis is further supported related species. This is indicative of an ancestral duplica - by the fact that genus Aspergillus is distantly related to tion. The duplication hypothesis is further confirmed by Fusarium. It is estimated that their ancestors separated the values of mappings I and P. For the junction node 300–500 million years ago [34, 35]. (labeled 3), the mapping P is considerably lower than the The branching of two distant groups of closely related mapping I. This indicates a branching of two large, but species in this case is also visible from the values of map- closely related groups of organisms. Comparison of map- pings I and P, as their values at the junction node 1 are pings I and P at junction nodes and their children can considerably higher than the values at the root of the in future serve as a basis for automated classification of light blue subtree and its sibling node. This transfer is locus gain events as transfers or duplications. consistent with reconciliation results for transfer weights Most other locus gain events are ambiguous, both in greater or equal to 3.7. the case of history decomposition analysis and the tree The emergence of another light blue subtree, consist - reconciliation. ing of Penicillium oxalicum and Penicillium arizonense, could similarly be explained by a duplication or a hori- Conclusions zontal gene transfer from an ancestor of Aspergillus uda- In this work, we have investigated two new problems, gawae. However, Aspergillus and Penicillium species are LTI and CLTI, for locus tree inference in a parsimony fairly closely related, and the support of the junction framework, defined as problems of decomposing a gene tree into trees consistent with a species tree. We have Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 17 of 18 Received: 16 November 2017 Accepted: 11 May 2018 proposed a new mapping, called the highest separating rank, which has been applied to improve the classifica - tion of duplications and to solve CLTI. We have proposed two memory and time efficient solutions to the proposed 2 References problems: an O(n ) dynamic programming algorithm for 1. Keeling PJ, Palmer JD. Horizontal gene transfer in eukaryotic evolution. LTI and a near linear time heuristic for CLTI designed Nat Rev Genet. 2008;9(8):605–18. 2. Ravenhall M, Škunca N, Lassalle F, Dessimoz C. Inferring horizontal gene to solve large instances. Next, to verify the evolutionary transfer. PLOS Comput Biol. 2015;11(5):1–16. consistency of the output our algorithms, we have pro- 3. Doyon J-P, Ranwez V, Daubin V, Berry V. Models, algorithms and programs posed a validation method based on the model of evolu- for phylogeny reconciliation. Brief Bioinform. 2011;12(5):392. 4. Rasmussen MD, Kellis M. Unified modeling of gene duplication, loss, and tionary scenarios with HGTs. Finally, we have performed coalescence using a locus tree. Genome Res. 2012;22(4):755–65. a number of tests on simulated data showing that these 5. Rannala B, Yang Z. Bayes estimation of species divergence times and algorithms detect evolutionary events with high accuracy, ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164(4):1645–56. and performed a proof-of-concept analysis of an empiri- 6. Kingman JFC. The coalescent. Stochastic Process Appl. 1982;13(3):235–48. cal gene tree. Our results suggest that the new mapping, 7. Wu Y-C, Rasmussen MD, Bansal MS, Kellis M. Most parsimonious reconcili- combined with the lca-mapping, can be used to locate ation in the presence of gene duplication, loss, and deep coalescence using labeled coalescent trees. Genome Res. 2014;24(3):475–86. cases of gene duplications and horizontal gene transfers. 8. Marcet-Houben M, Gabaldón T. Treeko: a duplication-aware algorithm for the comparison of phylogenetic trees. Nucleic Acids Res. 2011;39:e66. Future outlooks 9. Richards TA, Leonard G, Soanes DM, Talbot NJ. Gene transfer into the fungi. Fungal Biol Rev. 2011;25(2):98–110. We plan to extend the solutions to LTI and CLTI to non- 10. Vernot B, Stolzer M, Goldman A, Durand D. Reconciliation with non- binary gene trees, as it would allow to collapse nodes binary species trees. J Comput Biol. 2008;15(8):981–1006. with low support and possibly to decrease the forest size. 11. Page RDM. Maps between trees and cladistic analysis of historical asso- ciations among genes, organisms, and areas. Syst Biol. 1994;43(1):58–77. We will further investigate the properties of the highest 12. Górecki P, Tiuryn J. DLS-trees: a model of evolutionary scenarios. Theor separating rank mapping, especially in the context of Comput Sci. 2006;359(1–3):378–99. supertree inference, gene tree rooting and gene tree cor- 13. Bonizzoni P, Della Vedova G, Dondi R. Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci. rection. Finally, we plan to apply our methods to design 2005;347(1–2):36–53. automated tools for HGT inference. They will serve as a 14. Bansal MS, Alm EJ, Kellis M. Efficient algorithms for the reconciliation preprocessing step in obtaining a manually curated data- problem with gene duplication, horizontal transfer and loss. Bioinformat- ics. 2012;28(12):283–91. set of horizontally transferred genes. 15. Naranjo-Ortíz MA, Brock M, Brunke S, Hube B, Marcet-Houben M, Gabaldón T. Widespread inter-and intra-domain horizontal gene transfer Authors’ contributions of d-amino acid metabolism enzymes in eukaryotes. Front Microbiol. MC and PG worked on the algorithmical part of the manuscript. MC per- 2016;7:2001. formed the computational experiments. AM provided cases of horizontally 16. Bender MA, Farach-Colton M. The lCA problem revisited. In: Latin transferred genes. MC and AM performed the analysis of decomposition of American symposium on theoretical informatics. Berlin: Springer; 2000. p. the gene tree in “Applications of evolutionary history decomposition” section. 88–94. All authors read and approved the final manuscript. 17. Zheng Y, Zhang L. Reconciliation with non-binary gene trees revisited. In: International conference on research in computational molecular biol- Author details ogy. Berlin: Springer; 2014. p. 418–32. Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 18. Sanderson MJ, McMahon MM. Inferring angiosperm phylogeny from Banacha 2, Warsaw, Poland. Institute of Biochemistry and Biophysics, Polish EST data with widespread gene duplication. BMC Evol Biol. 2007;7(Suppl Academy of Sciences, Pawińskiego 5A, Warsaw, Poland. 1):S3. 19. Warnow T. Models and algorithms for genome evolution. In: Chauve C, El- Acknowledgements Mabrouk N, Tannier E, editors. Large-scale multiple sequence alignment We thank BM for discussions and thoughtful remarks. The support was and phylogeny estimation. London: Springer; 2013. p. 85–146. provided by National Science Centre Grants #2015/19/B/ST6/00726 and 20. Eulenstein O, Huzurbazar S. Reconciling phylogenetic trees. New York: #2017/25/B/NZ2/01880. Wiley; 2010. p. 185–206. 21. Berglund-Sonnhammer A-C, Steffansson P, Betts MJ, Liberles DA. Optimal Competing interests gene trees from sequences and species trees using a soft interpretation The authors declare that they have no competing interests. of parsimony. J Mol Evol. 2006;63(2):240–50. 22. Lafond M, Swenson KM, El-Mabrouk N. An optimal reconciliation Ethics approval and consent to participate algorithm for gene trees with polytomies. In: International workshop on Not applicable. algorithms in bioinformatics. Berlin: Springer; 2012. p. 106–22. 23. Bansal MS, Alm EJ, Kellis M. Efficient algorithms for the reconciliation Implementation problem with gene duplication, horizontal transfer and loss. Bioinformat- The algorithms for gene tree decomposition have been implemented in a ics. 2012;28(12):283–91. prototype script that is publicly available at https ://githu b.com/mciac h/Locus 24. Tofigh A, Hallett M, Lagergren J. Simultaneous identification of duplica- TreeI nfere nce. tions and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(2):517–35. Publisher’s Note 25. Mykowiecka A, Szczesny P, Gorecki P. Inferring gene-species assignments Springer Nature remains neutral with regard to jurisdictional claims in pub- in the presence of horizontal gene transfer. In: IEEE/ACM Trans Comput lished maps and institutional affiliations. Biol Bioinform. 2017. p. 1–1. Ciach et al. Algorithms Mol Biol (2018) 13:11 Page 18 of 18 26. Sjöstrand J, Arvestad L, Lagergren J, Sennblad B. Genphylodata: realistic 32. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New simulation of gene family evolution. BMC Bioinform. 2013;14(1):209. algorithms and methods to estimate maximum-likelihood phylogenies: 27. Kendall DG. On the generalized birth-and-death process. Ann Math Stat- assessing the performance of phyml 3.0. Syst Biol. 2010;59(3):307–21. ist. 1948;19(1):1–15. 33. Huerta-Cepas J, Serra F, Bork P. Ete 3: reconstruction, analysis, and visuali- 28. Stolzer M, Lai H, Xu M, Sathaye D, Vernot B, Durand D. Inferring duplica- zation of phylogenomic data. Mol Biol Evol. 2016;33(6):1635. tions, losses, transfers and incomplete lineage sorting with nonbinary 34. Lucking R, Huhndorf S, Pfister DH, Plata ER, Lumbsch HT. Fungi evolved species trees. Bioinformatics. 2012;28(18):409–15. right on track. Mycologia. 2009;101(6):810–22. 29. Patron NJ, Waller RF, Cozijnsen AJ, Straney DC, Gardiner DM, Nierman WC, 35. Taylor JW, Berbee ML. Dating divergences in the fungal tree of life: review Howlett BJ. Origin and distribution of epipolythiodioxopiperazine (etp) and new analyses. Mycologia. 2006;98(6):838–49. gene clusters in filamentous ascomycetes. BMC Evol Biol. 2007;7(1):174. 30. Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80. 31. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3. Ready to submit your research ? Choose BMC and benefit from: fast, convenient online submission thorough peer review by experienced researchers in your ﬁeld rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions
Algorithms for Molecular Biology – Springer Journals
Published: Jun 4, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”
Daniel C.
“Whoa! It’s like Spotify but for academic articles.”
@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”
@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”
@JoseServera
DeepDyve Freelancer | DeepDyve Pro | |
---|---|---|
Price | FREE | $49/month |
Save searches from | ||
Create lists to | ||
Export lists, citations | ||
Read DeepDyve articles | Abstract access only | Unlimited access to over |
20 pages / month | ||
PDF Discount | 20% off | |
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.
ok to continue