Access the full text.
Sign up today, get DeepDyve free for 14 days.
B. Mau, M. Newton, B. Larget (1999)
Bayesian Phylogenetic Inference via Markov Chain Monte Carlo MethodsBiometrics, 55
F. Jones, M. Burgoon, S. Hoffman, K. Crossin, B. Cunningham, G. Edelman (1988)
A cDNA clone for cytotactin contains sequences similar to epidermal growth factor-like repeats and segments of fibronectin and fibrinogen.Proceedings of the National Academy of Sciences of the United States of America, 85 7
A. Fornace, David Cummings, C. Comeau, J. Kant, Gerald Crabtree (1984)
Structure of the human gamma-fibrinogen gene. Alternate mRNA splicing near the 3' end of the gene produces gamma A and gamma B forms of gamma-fibrinogen.The Journal of biological chemistry, 259 20
O. Eulenstein, M. Vingron (1998)
On the Equivalence of Two Tree Mapping MeasuresDiscret. Appl. Math., 88
A. Aguinaldo, J. Turbeville, Lawrence Linford, M. Rivera, J. Garey, R. Raff, J. Lake (1997)
Evidence for a clade of nematodes, arthropods and other moulting animalsNature, 387
O. Eulenstein, B. Mirkin, M. Vingron (1998)
Duplication-Based Measures of Difference Between Gene and Species TreesJournal of computational biology : a journal of computational molecular cell biology, 5 1
B. Mirkin, I. Muchnik, Temple Smith (1995)
A Biologically Consistent Model for Comparing Molecular PhylogeniesJournal of computational biology : a journal of computational molecular cell biology, 2 4
N. Saitou, M. Nei (1987)
The neighbor-joining method: a new method for reconstructing phylogenetic trees.Molecular biology and evolution, 4 4
M. Goodman, J. Czelusniak, G. Moore, A. Romero-harrera, G. Matsuda (1979)
Fitting the gene lineage into its species lineage
N. Mulder, R. Apweiler, T. Attwood, A. Bairoch, A. Bateman, David Binns, M. Biswas, Paul Bradley, P. Bork, P. Bucher, R. Copley, E. Courcelle, R. Durbin, L. Falquet, W. Fleischmann, J. Gouzy, S. Griffiths-Jones, D. Haft, H. Hermjakob, N. Hulo, D. Kahn, Alexander Kanapin, Maria Krestyaninova, R. Lopez, Ivica Letunic, S. Orchard, M. Pagni, David Peyruc, C. Ponting, F. Servant, Christian Sigrist (2002)
InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional SitesBriefings in bioinformatics, 3 3
J. Eisen (1998)
Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.Genome research, 8 3
S. Barns, C. Delwiche, J. Palmer, N. Pace (1996)
Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences.Proceedings of the National Academy of Sciences of the United States of America, 93 17
R. Page (1994)
Maps between trees and cladistic analysis of historical associations among genes
J. Felsenstein (1985)
CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAPEvolution, 39
C. Zmasek, S. Eddy (2001)
ATV: display and manipulation of annotated phylogeneticBioinformatics, 17 4
M. Tristem (2000)
Molecular Evolution — A Phylogenetic Approach.Heredity, 84
(2000)
HMMER: profile hidden Markov models for biological sequence analysis
R. Page (1998)
GeneTree: comparing gene and species phylogenies using reconciled treesBioinformatics, 14 9
Oliver Eulenstein (1999)
Vorhersage von Genduplikationen und deren Entwicklung in der Evolution, 20/1998
O. M. (1978)
22 A Model of Evolutionary Change in Proteins
S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman (1990)
Basic local alignment search tool.Journal of molecular biology, 215 3
(1995)
Biochemistry. Freeman
R. Guigó, R. Guigó, I. Muchnik, Temple Smith (1996)
Reconstruction of ancient molecular phylogeny.Molecular phylogenetics and evolution, 6 2
R. Tatusov, D. Natale, I. Garkavtsev, T. Tatusova, U. Shankavaram, B. Rao, B. Kiryutin, Michael Galperin, N. Fedorova, E. Koonin (2001)
The COG database: new developments in phylogenetic classification of proteins from complete genomesNucleic acids research, 29 1
Rebecca Parr, L. Fung, Jeffrey Reneker, N. Myers-Mason, J. Leibowitz, G. Levy (1995)
Association of mouse fibrinogen-like protein with murine hepatitis virus-induced prothrombinase activityJournal of Virology, 69
L. Mueller, F. Ayala (1982)
Estimation and interpretation of genetic distance in empirical studies.Genetical research, 40 2
(1993)
PHYLIP: Phylogeny Inference Package, Version 3
A. Bairoch, R. Apweiler (2000)
The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000Nucleic acids research, 28 1
Nicholas Baker, Marek Mlodzik, Gerald Rubin (1990)
Spacing differentiation in the developing Drosophila eye: a fibrinogen-related lateral inhibitor encoded by scabrous.Science, 250 4986
(1995)
A biologically con827
B. Schieber, U. Vishkin (1988)
On Finding Lowest Common Ancestors: Simplification and ParallelizationSIAM J. Comput., 17
J. JáJá (1992)
An Introduction to Parallel Algorithms
Robert Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne Pollington, O. Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik Sonnhammer, Sean Eddy, Alex Bateman (2007)
The Pfam protein families databaseNucleic Acids Research, 38
R. Page, M. Charleston (1996)
Reconciled trees and incongruent gene and species trees
Kevin Chen, D. Durand, Martín Farach-Colton (2000)
Notung: dating gene duplications using gene family trees
Louxin Zhang (1997)
On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular PhylogeniesJournal of computational biology : a journal of computational molecular cell biology, 4 2
Vol. 17 no. 9 2001 BIOINFORMATICS Pages 821–828 A simple algorithm to infer gene duplication and speciation events on a gene tree Christian M. Zmasek and Sean R. Eddy Howard Hughes Medical Institute, Department of Genetics, Washington University School of Medicine, St Louis, MO 63110, USA Received on January 24, 2001; revised on April 5, 2001; accepted on April 6, 2001 ABSTRACT hits tends to classify novel sequences too aggressively. Motivation: When analyzing protein sequences using se- Without careful human intervention, it is impossible to quence similarity searches, orthologous sequences (that detect when a new sequence is not as similar to known diverged by speciation) are more reliable predictors of a homologs as it should be, and it in fact represents the new protein’s function than paralogous sequences (that di- first member of a novel functional subfamily in a larger verged by gene duplication), because duplication enables superfamily—often an extremely interesting result. functional diversification. The utility of phylogenetic infor- In contrast, analyses using profile search algorithms mation in high-throughput genome annotation (‘phyloge- such as HMMER (Eddy, 2000) and protein family nomics’) is widely recognized, but existing approaches are databases such as Pfam (Bateman et al., 2000) and either manual or indirect (e.g. not based on phylogenetic InterPro (Apweiler et al., 2000), classify sequences too trees). Our goal is to automate phylogenomics using ex- conservatively. They recognize that a new sequence plicit phylogenetic inference. A necessary component is belongs to a certain family, but do not subclassify the an algorithm to infer speciation and duplication events in sequence. a given gene tree. Profile algorithms can be used to align the novel Results: We give an algorithm to infer speciation and sequence to a curated alignment of the known family duplication events on a gene tree by comparison to members. A human annotator can use this multiple a trusted species tree. This algorithm has a worst- alignment as input for a phylogenetic tree analysis, and case running time of O(n ) which is inferior to two from the placement of the new sequence in the tree of previous algorithms that are ∼O(n) for a gene tree of n known sequences can infer a more specific function. This sequences. However, our algorithm is extremely simple, approach was called ‘phylogenomics’ by Eisen (1998). and its asymptotic worst case behavior is only realized This procedure is different from schemes such as the on pathological data sets. We show empirically, using COG database (Tatusov et al., 2001) in that it directly 1750 gene trees constructed from the Pfam protein family uses phylogenetic trees, whereas COG clusters sequences database, that it appears to be a practical (and often based on evolutionary relationships indirectly inferred superior) algorithm for analyzing real gene trees. from sequence similarities. Availability: http://www.genetics.wustl.edu/eddy/forester It is impossible to automate this process fully, because Contact: zmasek@genetics.wustl.edu; it is impossible to precisely define what ‘protein function’ eddy@genetics.wustl.edu means. However, a principle of phylogenomics is that orthologous sequences (that diverged by speciation) are INTRODUCTION more likely to conserve protein function than paralogous sequences (that diverged by gene duplication). Orthology Automated sequence function prediction becomes a neces- and paralogy are precisely defined and can be inferred sity due to the enormous amount of sequence data cur- from gene and species trees. One simple example of a rently produced by the various genome projects. The fact phylogenomics approach that is reasonable and automat- that many proteins belong to large superfamilies that con- able could thus be stated as follows. If a novel sequence sist of subfamilies with different biological functions com- has orthologs, functional annotation can be transferred plicates such efforts. from them (as in best BLAST analysis); if there are Usually, automated sequence function prediction is done no orthologs, the sequence is classified as just a family using methods based on pairwise sequence similarity, such member (as in Pfam/InterPro analysis) and flagged as as BLAST (Altschul et al., 1990). Annotating a new possibly the first representative of a novel subfamily. sequence by transferring annotation from its best BLAST Other, more sophisticated analyses could be devised. 821 C.M.Zmasek and S.R.Eddy Fig. 1. Gene trees and species trees. G and G are gene trees, S is a species tree. Internal tree nodes representing gene duplications are 1 2 labeled as such, other internal nodes represent speciations. The sequence family in tree G is comprised of three functional subfamilies: α, β and γ . The two duplications in G can be inferred directly from the redundancy of species names. G is a tree of the same family as G .In 1 2 1 contrast to G , some sequences are not present in G , either due to gene loss or incomplete sampling. The second duplication in G can only 1 2 2 be inferred by comparing it to the species tree S and recognizing the anomaly of placing the human gene closer to yeast than to nematodes. At the core of such approaches stands therefore the Figure 1). However, due to gene loss or incomplete sam- pling of genes in partially sequenced genomes, not all du- distinction between orthologs and paralogs, and hence the plications are detectable by simple redundancy in a gene ability to discriminate between duplication and speciation tree (tree G in Figure 1). Reliable assignment of nodes events on a gene tree. in the gene tree as either duplication events or speciation Algorithms to distinguish between duplications and events requires comparison to the phylogenetic tree of the speciations have been employed previously in calculating species (tree S in Figure 1). the dissimilarity between gene trees and species trees, First let us define how we recognize that a node in a gene and in inferring parsimonious species trees from gene tree G should be assigned as a duplication, given species trees by minimizing the number of duplications and gene tree S. We use a mapping function M which was first losses that must be invoked to reconcile a given gene introduced by Goodman et al. (1979) and used elsewhere sequence tree with the inferred species tree (Eulenstein (Chen et al., 2000; Eulenstein et al., 1998; Eulenstein and and Vingron, 1995; Goodman et al., 1979; Guigo et al., Vingron, 1995; Guigo et al., 1996; Mirkin et al., 1995; 1996; Mirkin et al., 1995; Page and Charleston, 1997; Page, 1994; Page and Charleston, 1997; Zhang, 1997): Zhang, 1997). Brute force algorithms to solve this problem can have unfavorable O(n ) running times. Two known DEFINITION 1. Let G be the set of nodes in a rooted algorithms solve the problem efficiently with excellent binary gene tree and S the set of nodes in a rooted binary worst-case running times of ∼O(n) for a gene tree of n species tree. For any node g ∈ G , let γ(g) be the set of sequences (Zhang, 1997; Eulenstein, 1998; the Eulenstein species in which occur the extant genes descendant from algorithm is implemented in the program ‘GeneTree’, g. For any node s ∈ S, let σ(s) be the set of species in Page, 1998) but both algorithms are somewhat complex. the external nodes descendant from s. For any g ∈ G , let We describe here a very simple algorithm that appears to M (g) ∈ S be the smallest (lowest) node in S satisfying solve the problem even more efficiently on realistic data γ(g) ⊆ σ( M (g)). That is, M (g) points to the ancestral sets, though it has an asymptotic worst-case running time species in S that (we infer) harbored ancestral gene g. that is less favorable. Duplications are then defined using M (g) in Goodman et al. (1979) and formally in Guigo et al. (1996) and Page ALGORITHM and Charleston (1997) as follows: A gene tree G and the species tree S of the species harbor- DEFINITION 2. Let g and g be the two child nodes 1 2 ing the genes of G do not necessarily have to exhibit the of an internal node g of a rooted binary gene tree G . same topology (Page and Holmes, 1998). Gene duplica- Node g is a duplication if and only if M (g) = M (g ) tion, gene loss, and horizontal genetic transfer are some of or M (g) = M (g ). the forces causing inconsistencies. Gene duplication can be trivially inferred when a species contains two or more An example is shown in Figure 2. This approach makes homologs belonging to the same gene family (tree G in a parsimony assumption. It postulates the minimal number 822 A simple algorithm to infer gene duplication and speciation events on a gene tree number of the external node in S with the matching species name; Recursion Visit each internal node g of G in postorder traversal (from external nodes upwards to root): set a = M (g );[g is child 1 of the current 1 1 node g] set b = M (g );[g is child 2 of the current 2 2 node g] while (a!= b): if a > b: Fig. 2. The mapping function M and the definition of a duplication. M is symbolized by arrows originating at nodes of the gene tree G set a = parent of node a; and pointing to nodes in the species tree S. Letters A to D represent else: species names. As an example, the mapping for g is computed 3 set b = parent of node b; as follows. According to Definition 1, γ(g ) ={ A, C}, hence set M (g) = a; M (g ) = s since the smallest node s ∈ S satisfying γ(g) ⊆ σ(s) 3 2 if ( M (g) == M (g )) or ( M (g) == M (g )): 1 2 is s for which σ(s ) ={ A, B, C}. Each external node of G maps to 2 2 g is a duplication; the external node in S that is associated with the same species name. else: g is a duplication according to Definition 2, since it and its child g is a speciation. g maps to the same node s . 3 2 A sketch of the running time analysis of this algorithm is as follows. Initializing M (g) for the external nodes of of duplications necessary to reconcile the gene tree with G is O(n) if species names are looked up in a hash table the species tree, and it places those duplications as close (Cormen et al., 1990). Initializing the numbering of S is O(n) (again assuming that the number of nodes in S to the external nodes as possible. It minimizes the number scales linearly with the number of nodes in G ; S can be of unobserved genes—due to gene loss or incomplete smaller than G but not larger). Thus initialization is O(n) sampling—that need to be invoked. and will not be the rate determining step. In the recursion, Given the mapping function M (g), using Definition 2 we visit each of the n − 1 internal nodes in G individually, to assign duplications requires only a linear time, O(n) and at each node we find the LCA of M (g ) and M (g ) traversal of a gene tree G for n genes. What about 1 2 simply by brute force, by climbing the tree from both calculating M (g)? To our knowledge, Page was the first points until we meet. The computational cost of finding to implement an algorithm for this problem (Page, 1994), LCAs in this manner depends on the topology of G and S. but the description given is a brute force approach (for In the best case, G has no duplications and the topology each node g in G , visit each node s in S, compile the of G and S are the same; each LCA determination costs sets γ(g) and σ(s), and compare them). This algorithm O(1), no node in S will be reached more than twice has a running time of O(n ), if the number of species in in the whole algorithm, and the overall running time is S is O(n). To speed this up, observe that M (g) cannot be therefore O(n) (Figure 3A). In a pathological bad case, lower than M (g ) or M (g ) in S. Furthermore, observe 1 2 if M (g) for all internal nodes in G pointed to the root that M (g) must in fact be the Last Common Ancestor of the species tree (itself a special case of the unusual (LCA) of M (g ) and M (g ). Therefore if we are careful 1 2 situation in which all parent nodes of all internal nodes to traverse G in the right direction, we can assign M (g) are gene duplication events), and nonetheless no more recursively without ever having to explicitly compile or than one gene in G is found in each species, each LCA compare the lists γ(g) and σ(s), and without having to determination would require climbing the entire height of traverse all of S for each node g. This recursive algorithm tree S, which for a balanced binary tree would be log n, goes as follows: giving an overall running time of O(n log n) (Figure 3B). Input: Rooted binary gene tree G , rooted binary species Finally, in the pathological worst case, not only would tree S of all species in G . each LCA require climbing all of the height of S,but Output: G with ‘duplication’ or ‘speciation’ assigned to S could also be a maximally unbalanced tree (a tree in each of its internal nodes. which each internal node has at least one external child, Initialization also called a ‘pectinate’ tree) with a height of n, giving Number nodes of S in preorder traversal an overall running time of O(n ) (Figure 3C). The space (root = 1, child nodes always larger than complexity of the algorithm is O(n), since only the two parent node); trees and a constant number of auxiliary variables need to For each external node g of G , set M (g) to the be stored. 823 C.M.Zmasek and S.R.Eddy the gene tree, using a data structure similar to a disjoint-set forest (Cormen et al., 1990). Both kinds of algorithm, though asymptotically more ef- ficient than ours, require relatively complex preprocessing. We reasoned that since our algorithm has so few steps, we were likely to have a better constant factor than both. Fur- thermore, our intuition was that the worst case bounds of our algorithm were pathological and would never be re- alized on realistic data sets. Eulenstein comments that his algorithm has a lower constant factor than Zhang’s. We de- cided to implement both our algorithm and Eulenstein’s, and compare their performance on real data. IMPLEMENTATION Both algorithms were implemented in Java. The Java classes are named SDI for ‘Speciation vs Duplication Inference’ and are part of our FORESTER classes for working with phylogenetic trees. FORESTER including SDI is freely available at http://www.genetics.wustl.edu/ eddy/forester/. It should run on every platform with a Java 1.2 JDK. A preprocessing step deletes external nodes in S that have no genes in G , allowing a single trusted species tree to be used for all gene trees. All timings reported are the average of three runs on a single processor 500 MHz Pentium III system running Red Hat Linux 6.0 and Sun Microsystems’ Java 1.2 SDK for Linux. Fig. 3. The number of duplications and the topology of the species tree influence the time complexity of our algorithm. G to G 1 3 RESULTS are gene trees, S and S are species trees. M is symbolized by 1 2 We first timed the two implementations on synthetic data arrows originating at nodes of the gene tree and pointing to nodes sets that would exercise the worst-case behavior of our in the species tree. Letters A to D represent species names. Circled algorithm. We synthesized gene trees with n genes, for nodes are duplications. Arrows inside the species trees symbolize a range of values of n, where M (g) for every internal the movement of variables a and b (see text). node would map to the root of the corresponding species tree with n species (e.g. the situations in Figure 3B Algorithms with more efficient asymptotic bounds on and C). Plots of wall clock time versus n are shown in running time have been published. The closest to ours Figure 4. For a balanced species tree, both algorithms are those of Zhang (1997) and Chen et al. (2000). Both have running times that scale nearly linearly in tree size observe that LCA calculations can be done in O(1) (our O(n log n) is not appreciably different from linear at time, for instance using the LCA algorithms described first glance), and our algorithm exhibits a lower constant by Schieber and Vishkin (1988) or by JaJ ´ a ´ (1991). The than our implementation of the Eulenstein algorithm. trick is that the LCA of any two nodes on a complete For a maximally unbalanced species tree, we confirm binary tree can be calculated by direct arithmetic. The our algorithm worst case O(n ) behavior, but because tree S (which in general is not a complete binary tree) of our lower overhead, SDI is still more efficient for is therefore preprocessed in such as way that the nodes smaller trees. Over about n = 550 genes and species, of S are associated with nodes in a complete binary our implementation of Eulenstein’s algorithm outperforms tree; this preprocessing takes O(n) time. A quite different SDI. If only the actual calculation of M (g) is compared algorithm, developed by Eulenstein (1998), calculates M (excluding all preprocessing and initialization steps), in O(nα(n, n)) time, where α(n, n) is the very slowly Eulenstein’s algorithm outperforms SDI for n larger than growing inverse of Ackermann’s function (Cormen et al., about 200 taxa (data not shown). 1990). This algorithm visits each node of the species tree We then tested both implementations on real data to S and in the process calculates M for each internal node of empirically determine their average-case behavior. We 824 A simple algorithm to infer gene duplication and speciation events on a gene tree Fig. 4. Timing benchmarks on real trees to determine average-case behavior, and synthetic trees that exercise our algorithm’s worst case behavior. For the synthetic trees, every internal node of the gene tree maps to the root of the corresponding species tree and each gene tree has the same size as the corresponding species tree. Each synthetic data point is the average of three measurements. Curves were fitted using GNUPLOT’s nonlinear least squares fitting mechanism (Marquardt–Levenberg algorithm). Real trees are from Pfam alignments and were created as described in the text. In the case of real trees, the species trees usually have fewer taxa than gene trees (each species may contain more than one paralogous gene)—hence the smaller times relative to synthetic data tests. Each Pfam data point is the average of 100 measurements. obtained 2478 multiple sequence alignments from the in Pfam. The topology of this species tree is based on ‘full’ alignments (as opposed to the smaller ‘seed’ align- the taxonomy database at NCBI (http://www.ncbi.nlm.nih. ments) in the protein family database Pfam (release 5.5; gov/Taxonomy/tax.html/), the Tree of Life project (Mad- Bateman et al., 2000). dison and Maddison, at http://phylogeny.arizona.edu/tree/ Gene trees were constructed from these alignments as phylogeny.html), Barns et al. (1996), and Aguinaldo et al. follows. All sequences not originating from the curated (1997). This tree is available at http://www.genetics.wustl. SWISS-PROT database (Bairoch and Apweiler, 2000) and edu/eddy/forester/. not from species in our species tree (see below) were The individual running times of the SDI algorithm and removed from the alignments. Alignments with fewer than of the Eulenstein algorithm for each of these 1750 trees four or more than 1000 sequences were discarded, leaving are shown in Figure 4. These data imply that the average 1750 alignments. Columns containing one or more gap case behavior of our algorithm on real data sets is symbols were removed from the alignment if the resulting approximately O(n), and its worst case behavior is not alignment after this filtering was at least 100 amino acids realized. in length. Pairwise distances were calculated based on As an example of the results from such an analysis, the Dayhoff PAM matrix (Dayhoff et al., 1978) using the and how they might be useful in sequence annotation, the program PROTDIST from Felsenstein’s (1993) PHYLIP gene tree for the fibrinogen beta and gamma chain Pfam package. A neighbor-joining tree (Saitou and Nei, 1987) family (Pfam accession number: PF00147) is presented was constructed using the program NEIGHBOR from the in Figure 5. The fibrinogen sequence family contains PHYLIP package. Roots were placed by the midpoint fibrinogen alpha, beta and gamma chains (sequences rooting method (Swofford et al., 1996). with FIBA, FIBB, FIBG prefixes) which together form A single master species tree was compiled manually, the fibrinogen hexamer (Stryer, 1995). Each chain type containing 200 of the most commonly encountered species appears on the tree as a paralogous subtree. A special case 825 C.M.Zmasek and S.R.Eddy (FIBB). In contrast, a naive best BLAST analysis of the FIBX MOUSE sequence could easily have misannotated it as the mouse fibrinogen beta chain. DISCUSSION In this paper we have presented a simple algorithm to infer gene duplication events on a gene tree by comparing it to a species tree. Computer science textbooks often warn that compari- son of asymptotic worst-case running times may be mis- leading. Our algorithm is O(n ), yet empirically outper- forms at least one more complex algorithm with a supe- rior asymptotic bound close to O(n) (Eulenstein, 1998), at least in our implementation of the two algorithms. Partly this is because our algorithm has very few steps, so it has a low constant. Also, the worst case behavior of our algo- rithm is only realized in a pathological case: a gene tree where M (g) for every internal node points to the root of the species tree, and there are no two genes from the same species (e.g. the number of species in S is O(n)), and the species tree is maximally unbalanced. Figure 4 argues that we do not see such cases in real data. In real data our al- gorithm is nearly linear time. The Zhang (1997) O(n) al- gorithm has not been analyzed in this work, but we expect that there too, the improved asymptotic bound will not be Fig. 5. A gene tree for the fibrinogen beta and gamma chain Pfam family. Circled internal nodes represent gene duplication events worth the cost of the extra complexity nor the extra com- inferred by SDI. The suffix of each SWISS-PROT sequence name putational overhead. We conclude from our results that we indicates the species (BOVIN, Bos taurus; CHICK, Gallus gallus; will use SDI for future work. DROME, Drosophila melanogaster; HUMAN, Homo sapiens; PIG, Our goal is to use SDI as part of a system for automating Sus scrofa;RAT, Rattus norvegicus; XENLA, Xenopus laevis). phylogenomics (Eisen, 1998). SDI gives us a clean, simple Bootstrap values were calculated from 100 replicates and are shown computational engine that can become part of that larger as numbers below the corresponding branch. The tree was rooted goal, but there are additional difficulties that must be faced by the midpoint rooting method. The figure was produced with before we put it to practical use. Most importantly, the our tree display tool ATV (Zmasek and Eddy, 2001, available at algorithm assumes at its peril that the gene tree and species http://www.genetics.wustl.edu/eddy/atv/). tree are both properly rooted and biologically correct. Phylogenetic inference algorithms produce unrooted is FIBH HUMAN (fibrinogen gamma-B chain) which ap- gene trees that will have to be rooted before duplication pears to be the result of alternative splicing of the human inference can be performed. Usually trees are rooted using gamma chain gene (Fornace et al., 1984). In addition, the either a molecular clock assumption or by defining an fibrinogen family also contains various proteins probably outgroup. A molecular clock assumption is generally du- involved in adhesion, which share the fibrinogen-like bious, and will be especially dubious in a sequence family domain with the fibrinogen sequences (e.g. Jones et al., with different paralogous clades with different functions 1988; Baker et al., 1990), such as tenascins (sequences that are under differing selective pressures. Defining an with TENA prefixes). Interestingly—FIBX MOUSE (also outgroup in a complicated family of paralogous sequences named FGL2 MOUSE), a mouse enzyme with prothrom- depends on recognizing the paralogies in the first place, binase activity (conversion of prothrombin into thrombin) so cannot be done independently of duplication inference. is similar to fibrinogen beta and gamma chains (Parr et al., Ironically, one plausible approach to root the gene trees 1995). Thrombin is an enzyme responsible for cleaving might be to minimize the dissimilarity between the gene fibrinogen into monomers which in turn polymerize into tree and a species tree as described in Eulenstein and fibrin (Stryer, 1995). The node connecting FIBX MOUSE Vingron (1995), Goodman et al. (1979), Guigo et al. to the rest of the tree is inferred to be a duplication (1996), Mirkin et al. (1995), and Zhang (1997), using a event, since the placement of FIBX MOUSE contradicts duplication inference algorithm. the species tree and hence FIBX MOUSE is inferred Phylogenetic inference algorithms also rarely produce to be paralogous to the fibrinogen beta chain subfamily completely reliable gene trees. Even a consensus species 826 A simple algorithm to infer gene duplication and speciation events on a gene tree Barns,S.M., Delwiche,C.F., Palmer,J.F. and Pace,N.R. (1996) Per- tree based on all available evidence (from paleontological spectives on archaeal diversity, thermophily and monophyly from to molecular) will always have ambiguities. Errors in environmental rRNA sequences. Proc. Natl Acad. Sci. USA, 93, either tree will produce spurious inferred duplications. 9188–9193. This is obviously problematic if duplications are to Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and be used as indicators of potential functional changes. Sonnhammer,E.L. (2000) The Pfam protein families database. One way to portray uncertainty in phylogenetic trees Nucleic Acids Res., 28, 263–266. is lack of resolution (i.e. multifurcations). However, the Chen,K., Durand,D. and Farach-Colton,M. (2000) Notung: dating current algorithms are limited to completely resolved (i.e. gene duplications using gene family trees. In Proceedings of completely binary) gene and species trees. In addition, the Fourth Annual International Conference on Computational Molecular Biology on RECOMB 2000 , pp. 96–106. the concept of orthology and paralogy is applicable only Cormen,T.H., Leiserson,C.E. and Rivest,R.L. (1990) Introduction to to completely resolved gene trees. Instead, we think we Algorithms. MIT Press, MA. can approach this issue using sampling methods, such as Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) A model of bootstrap (Mueller and Ayala, 1982; Felsenstein, 1985) evolutionary change in proteins. In Atlas of Protein Sequence or Markov chain Monte Carlo (Mau et al., 1996), to and Structure, vol. 5, (Suppl. 3), National Biomedical Research integrate orthology assignments over tree space. This Foundation, Silver Springs, MD, pp. 345–352. would allow us to calculate a probability, or at least a Eddy,S.R. (2000) HMMER: profile hidden Markov models for bootstrap confidence value, for a particular assertion that biological sequence analysis. Washington University School of a known sequence is orthologous to the new sequence Medicine, St Louis, MO (http://hmmer.wustl.edu/). Eisen,J.A. (1998) Phylogenomics: improving functional predictions being analyzed, and to rank the inferred orthologs by for uncharacterized genes by evolutionary analysis. Genome this confidence. Sampling methods can also help us with Res., 8, 163–167. dealing with ambiguities in rooting the trees. Having a fast Eulenstein,O. (1998) Vorhersage von Genduplikationen und deren algorithm for duplication inference ought to help in any Entwicklung in der Evolution. In GMD Research Series, vol. 20, sampling procedure that explores large numbers of tree Sankt Augustin, Germany. topologies. However, we recognize that the rate limiting Eulenstein,O. and Vingron,M. (1995) On the equivalence of two tree step is more likely to be the tree sampling procedure itself, mapping measures. In Arbeitspapiere der GMD, vol. 936, Sankt rather than the duplication inference procedure. Augustin, Germany. Eulenstein,O., Mirkin,B. and Vingron,M. (1998) Duplication-based measures of difference between gene and species trees. J. ACKNOWLEDGEMENTS Comput. Biol., 5, 135–148. This work was supported primarily by a grant from Felsenstein,J. (1985) Confidence limits on phylogenies: an approach Monsanto Company, and also by the Howard Hughes using the bootstrap. Evolution, 39, 783–791. Medical Institute and grant HG01363 from the NIH Felsenstein,J. (1993) PHYLIP: Phylogeny Inference Package, Ver- sion 3.5. University of Washington, Seattle, WA. National Human Genome Research Institute. Fornace,A.J., Cummings,D.E., Comeau,C.M., Kant,J.A. and Crab- tree,G.R. (1984) Structure of the human gamma-fibrinogen gene. REFERENCES Alternate mRNA splicing near the 3’ end of the gene produces Aguinaldo,A.M.A., Turbeville,J.M., Linford,L.S., Rivera,M.C., gamma A and gamma B forms of gamma-fibrinogen. J. Biol. Garey,J.R., Raff,R.A. and Lake,J.A. (1997) Evidence for a clade Chem., 259, 12 826–12 830. of nematodes, arthropods and other moulting animals. Nature, Goodman,M., Czelusniak,J., Moore,G.W., Romero-Herrera,A.E. 387, 489–493. and Matsuda,G. (1979) Fitting the gene lineage into its species Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. lineage, a parsimony strategy illustrated by cladograms con- (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– structed from globin sequences. Syst. Zool., 28, 132–168. Guigo,R., Muchnik,I. and Smith,T.F. (1996) Reconstruction of Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., Birney,E., ancient phylogenies. Mol. Phylogenet. Evol., 6, 189–213. Bucher,P., Codani,J.-J., Corpet,F., Croning,M.D.R., Durbin,R., JaJ ´ a,J. ´ (1991) An Introduction to Parallel Algorithms. Addison- Etzold,T., Fleischmann,W., Gouzy,J., Hermjakob,H., Jonassen,I., Wesley, Reading, MA, pp. 128–136. Kahn,D., Kanapin,A., Schneider,R., Servant,F. and Zdobnov,E. Jones,F.S., Burgoon,M.P., Hoffman,S., Crossin,K.L., Cunning- (2000) InterPro—an integrated documentation resource for pro- ham,B.A. and Edelman,G.M. (1988) A cDNA clone for cyto- tein families, domains and functional sites. CCP11 Newsletter, tactin contains sequences similar to epidermal growth factor-like repeats and segments of fibronectin and fibrinogen. Proc. Natl Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein Acad. Sci. USA, 85, 2186–2190. sequence database and its supplement TrEMBL in 2000. Nucleic Mau,B., Newton,M.A. and Larget,B. (1996) Bayesian phylogenetic Acids Res., 28,45–48. inference via Markov chain Monte Carlo methods. Technical Baker,N.E., Mlodzik,M. and Rubin,G.M. (1990) Spacing differenti- Report 961, Statistics Department, University of Wisconsin- ation in the developing Drosophila eye: a fibrinogen-related lat- Madison. eral inhibitor encoded by scabrous. Science, 250, 1370–1377. Mirkin,B., Muchnik,I. and Smith,T.F. (1995) A biologically con- 827 C.M.Zmasek and S.R.Eddy sistent model for comparing molecular phylogenies. J. Comput. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new Biol., 2, 493–507. method for reconstructing phylogenetic trees. Mol. Biol. Evol., Mueller,L.D. and Ayala,F.J. (1982) Estimation and interpretation of 4, 406–425. genetic distance in empirical studies. Genet. Res., 40, 127–137. Schieber,B. and Vishkin,U. (1988) On finding lowest common Page,R.D.M. (1994) Maps between trees and cladistic analysis of ancestors: simplification and parallelization. SIAM. J. Comput., historical associations among genes, organisms, and areas. Syst. 17, 1253–1262. Biol., 43,58–77. Stryer,L. (1995) Biochemistry. Freeman, New York. Page,R.D.M. (1998) GeneTree: comparing gene and species phylo- Swofford,D.L., Olsen,G.J., Waddell,P.J. and Hillis,D.M. (1996) genies using reconciled trees. Bioinformatics, 14, 819–820. Phylogenetic inference. In Hillis,D.M., Moritz,C. and Page,R.D.M. and Charleston,M.A. (1997) Reconciled trees and in- Mable,B.K. (eds), Molecular Systematics. Sinauer, Sunderland, congruent gene and species trees. In Mirkin,B., McMorris,F.R., MA, pp. 488. Roberts,F.S. and Rzhetsky,A. (eds), Mathematical Hierarchies Tatusov,R.L., Natale,D.A., Garkavtsev,I.V., Tatusova,T.A., in Biology, DIMACS Series in Discrete Mathematics and Theo- Shankavaram,U.T., Rao,B.S., Kiryutin,B., Galperin,M.Y., retical Computer Science, vol. 37, American Mathematical Soci- Fedorova,N.D. and Koonin,E.V. (2001) The COG database: new ety, Providence, RI, pp. 57–70. developments in phylogenetic classification of proteins from Page,R.D.M. and Holmes,E.C. (1998) Molecular Evolution: a complete genomes. Nucleic Acids Res., 29,22–28. Phylogenetic Approach. Blackwell Science , Oxford, pp. 30–31. Zhang,L. (1997) On a Mirkin–Muchnik–Smith conjecture for Parr,R.L., Fung,L., Reneker,J., Myers-Mason,N., Leibowitz,J.L. comparing molecular phylogenies. J. Comput. Biol., 4, 177–187. and Levy,G. (1995) Association of mouse fibrinogen-like protein Zmasek,C.M. and Eddy,S.R. (2001) ATV: display and manipulation with murine hepatitis virus-induced prothrombinase activity. J. of annotated phylogenetic trees. Bioinformatics, 17, 383–384. Virol., 69, 5033–5038.
Bioinformatics – Oxford University Press
Published: Sep 1, 2001
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.