Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

SplitsTree: analyzing and visualizing evolutionary data.

SplitsTree: analyzing and visualizing evolutionary data. !    ""# BIOINFORMATICS %#&' SplitsTree: analyzing and visualizing evolutionary data interpreted as possible evidence for different and conflicting Abstract phylogenies. Further, as split decomposition does not at- Motivation: Real evolutionary data often contain a number tempt to force data onto a tree, it can provide a good indica- of different and sometimes conflicting phylogenetic signals, tion of how tree-like given data are. and thus do not always clearly support a unique tree. To There exist efficient algorithms for performing split de- address this problem, Bandelt and Dress (Adv. Math., 92, composition (Bandelt and Dress, 1992a) and for computing 47-05, 1992) developed the method of split decomposition. splits graphs (Wetzel, 1995; D.H.Huson, in preparation). For ideal data, this method gives rise to a tree, whereas less Dress and Wetzel produced a simple implementation of split ideal data are represented by a tree-like network that may decomposition (Wetzel, 1995) as an investigative tool to help indicate evidence for different and conflicting phylogenies. develop the general theory. Based on their work, a first public Results: SplitsTree is an interactive program, for analyzing version was developed by Wetzel and Huson (SplitsTree ver- and visualizing evolutionary data, that implements this sion 1). The program described in this paper (SplitsTree ver- approach. It also supports a number of distances transform- sion 2) is a completely new implementation. ations, the computation of parsimony splits, spectral analy- In this paper, we first review the concepts of splits, splits sis and bootstrapping. graphs and the method of split decomposition, and then dis- Availability: There are two versions of SplitsTree: an cuss the SplitsTree program in detail. For a number of bio- interactive Macintosh version (shareware) and a command- logical applications of the split decomposition method, see, line Unix version (public domain). Both are available from: for example, Bandelt and Dress (1992b), Dopazo et al. ftp://ftp.uni-bielefeld.de/pub/math/splits/splitstree2. There is (1993), Dress and Wetzel (1993), Lockhart et al. (1995), a WWW version running at: http://www.bibiserv.techfak.uni- Wetzel (1995), Dress et al. (1996), McLenachan et al. (1996) bielefeld.de/splits. or P.J.Lockhart et al. (in preparation). Contact: huson@mathematik.uni-bielefeld.de Introduction Splits and splits graphs Evolutionary relationships between taxa are most often Evolutionary relationships are generally represented by a represented as phylogenetic trees, and many different algo- phylogenetic tree, T, i.e. a tree whose leaves are labeled by rithms for tree construction have been developed (Swofford a set X of taxa and whose remaining vertices are unlabeled et al., 1996). This is, of course, justified by the assumption and of degree at least three. (We only consider unrooted trees that evolution is a branching or tree-like process. However, in this paper.) Any edge e of T defines a split S = {A,A} of a set of real data often contains a number of different and X, i.e. a partition of X into two non-empty sets A and A, sometimes conflicting signals, and thus does not always consisting of all taxa on the one side, or the other, of the clearly support a unique tree. edge e. Such a system Σ of splits is called compatible if, for To address this problem, Bandelt and Dress (1992a) devel- any two splits S = {A ,A } and S = {A ,A } in Σ, one of 1 1 1 2 2 2 oped the method of split decomposition. In contrast to the four intersections methods such as maximum parsimony and maximum likeli- hood that reconstruct phylogenetic trees by optimizing cer- A ∩A , A ∩A , A ∩A , or A ∩A 1 2 1 2 1 2 1 2 tain parameters, split decomposition is a transformation- based approach. Essentially, evolutionary data are trans- is empty. Any phylogenetic tree T gives rise to a compatible formed or, more precisely, ‘canonically decomposed’, into a split system Σ. In 1971, Buneman established that, vice sum of ‘weakly compatible splits’ and then represented by a versa, any compatible split system Σ corresponds to a unique so-called splits graph. For ideal data, this is a tree, whereas phylogenetic tree T. So, tree reconstruction for a given set of less ideal data will give rise to a tree-like network that can be taxa X is equivalent to computing a compatible system of Oxford University Press 68 Analyzing evolutionary data and SplitsTree splits ∑ for X and determining a weight for each split S that corresponds to the length of the associated edge. Hence, to obtain more general graphs, one must consider less restricted systems of splits. Let X be a set of taxa. A system of splits Σ of X is called weakly compatible if, for any three splits S S , S and all A  S (i = 1, 2, 3), at least one 1, 2 3 i i of the four intersections A ∩A ∩A A ∩A ∩A , A ∩A ∩A , or A ∩A ∩A 1 2 3, 1 2 3 1 2 3 1 2 3 is empty (Bandelt and Dress, 1992a). So, in particular, any two splits are permitted to be incompatible. Intermediately, Σ is called circular if there exists an ordering x ,x ,…,x of 1 2 m the taxa such that for every split SΣ there exists AS with A = {x ,x ,…,x } and1≤ p(S)≤ q(S)≤ m. One can p(S) p(S) + 1 q(S) prove that a circular split system is always weakly compat- Fig. 1. The splits graph for the distances listed in Figure 3. Each band ible and a compatible split system is always circular (Bandelt of parallel edges indicates a split. For example, the two bold lines and Dress, 1992a; Wetzel, 1995). represent the split {Euglena, Olithodiscus} versus the other taxa. A splits graph representing a weakly compatible split sys- The distance between any two taxa x and y corresponds to the sum tem Σ is a graph G(Σ) = (V,E) whose vertices vV are la- of weights of all splits that separate x and y, i.e. the sum of edge beled by the set of taxa X and whose edges eE are straight- lengths of any shortest path from x to y. line segments that represent the splits in Σ (see Figure 1). More precisely, each split S = {A,A}Σ is represented by a band of parallel edges of equal length in such a way that deleting all edges in such a band partitions the graph into precisely two components: one containing all vertices la- beled by taxa in A and the other containing all vertices la- ≥ 0 is the weight or isolation index of the split S; and the map beled by taxa in A. The length of the edges representing a d :X*X → R is the so-called split-prime residue and can- given split S indicates its weight or support and is calculated not be decomposed further. A split S with α > 0 is called a as the isolation index of S. For algorithms that compute splits d-split, and the system Σ of all d-splits is weakly compatible graphs, see Wetzel (1995) and D.H.Huson (in preparation). and can be computed efficiently [see Bandelt and Dress Consider a weakly compatible system of splits Σ of a set (1992a) for details]. X of taxa. If Σ is compatible, then G(Σ) is a phylogenetic tree. If there is no split-prime residue, then the distance between What is the situation if Σ is merely circular? Then G(Σ) can any two taxa x and y is precisely equal to the sum of weights be realized as a planar graph (Wetzel, 1995; D.H.Huson, in of all d-splits that ‘separate’ x and y, and thus proportional to preparation). Finally, if Σ is not circular, then in general G(Σ) the sum of all edge lengths along a shortest path from x to y will not be planar. In biological applications, the arising split in the splits graph. However, in general, the split-prime resi- systems are often either circular or mildly non-circular. due will be positive and so the sum of weights will only give an approximation (from below) of the original distances. The fit of the approximation is measured by the sum of all ap- Split decomposition proximated distances divided by the sum of all original dis- tances. In biological applications, the fit is often quite high Split decomposition is a method for obtaining a system of and a small split-prime residue can be considered as ‘noise’. weakly compatible splits with weights from a given set of If we are given a set of aligned sequences, then to apply evolutionary data. So, assume we are given a set of taxa X split decomposition we must first compute a distance matrix and a distance map d:X*X → R on X, i.e. a matrix repre- d using an appropriate distance transformation. Alternative- senting the evolutionary distances between pairs of taxa. ly, one can compute the so-called parsimony splits, or p- Bandelt and Dress (1992a) showed that such a distance map splits, directly from the sequences, as described in Bandelt d has the following canonical decomposition: and Dress (1993). Yet another possibility is to use spectral analysis (Hendy and Penny, 1992; M.D.Hendy and P.J.Wad- d = Σ α δ + d S S 0 dell, in preparation) to assign a weight (the so-called γ-value) Here, we sum over all possible splits S of X; the map to each possible split of X. One can then greedily extract a δ :X*X → R is the split metric on S that equals 1 if x and weakly compatible (or compatible) system of splits, i.e. by y lie on different sides of S, and 0 otherwise; the number α considering all such splits S in decreasing order of weight 69 D.H.Huson and inserting the split S into Σ if it is weakly compatible (or Moreover, there are two items for determining distances compatible) with all splits already in Σ. between groups of taxa, both suggested by Mike Steel: the Fitch Sidow… item computes the distances between given groups using a combination of methods from Fitch (1971) Description of SplitsTree and Sidow et al. (1992), whereas the Covarion… item is SplitsTree is an easy-to-use Macintosh application that takes based on Moulton et al. (1997). as input a file containing sequences, distances, or a system of Finally, SplitsTree checks for given distance data whether splits, and produces as output a weakly compatible system of the triangle inequalities hold. If they do not, then the Force splits and a splits graph representing the given data. It con- Triangle Inequalities item can be used to force them to, i.e. tains a number of transformations to obtain distances from by adding an appropriate offset to all distances. sequences and methods for obtaining compatible or weakly The Method menu is the most important menu, as it deter- compatible split systems from distances or sequences. mines which method is applied to produce a split system from the given data. The first group of items all produce Menus weakly compatible split systems. The choices are: Split De- composition (as described above), Parsimony Splits (Ban- SplitsTree offers the following menus: File, Edit, Layout, delt and Dress, 1993) and Spectral Analysis… (Hendy and Options, Method and Window. The File menu contains the Penny, 1992; M.D.Hendy and P.J.Waddell, in preparation, usual items for opening, closing, saving and printing docu- followed by a greedy selection of a weakly compatible split ments. The Edit menu contains items for copying and past- system). The second group of items all produce compatible ing, etc. split systems: Buneman Tree (Buneman, 1971; Bandelt and The first group of items in the Layout menu can be used to Dress, 1992a), P-Tree (Bandelt and Dress, 1993) and Spec- change the position, orientation and size of the displayed tral Tree (spectral analysis followed by a greedy selection of splits graph. The Cycle item allows the user to specify the a compatible split system). circular order in which the taxa appear around the outside of For larger data sets, methods such as split decomposition the splits graph. This feature can be used to produce the same or computing the ‘Buneman tree’ tend to produce unresolved layout for different splits graphs produced from the same split systems. This is because they involve computing the data set by different methods. The Vertex Labels and Edge minimum of a certain index over all quartets of taxa that are Labels submenus can be used to decide whether the vertices separated by a given split to determine whether that split are to be labeled by the names or numbers of the taxa and should be included in the split system (Bandelt and Dress, whether the edges are to be labeled by weight, number or 1992a). In an attempt to solve this problem, one can replace bootstrap support. The Equal Edges and To Scale items de- the minimum by the average over a given number of quartets termine whether the edges of the displayed splits graph are with smallest indices to obtain a refined system of splits, as drawn all with the same length, or in proportion to the isola- suggested in Moulton et al. (1997). The Refine menu item tion index of the corresponding splits. implements this idea. The Options menu determines how the given data are pre- The Bootstrap item runs bootstrap sampling from given processed. The Taxa item enables the user to exclude certain sequence data (Felsenstein, 1985). This is a way to test the taxa from the analysis. Similarly, the Sites item can be used statistical robustness of the computed splits graph. To be pre- to exclude certain sites and also codon positions. Moreover, cise, the program repeatedly generates new artificial data sets items are available for excluding whole groups of sites: Ex- by randomly choosing k (not necessarily distinct) sites in the clude Gaps, Exclude Missing, Exclude Non Parsimony and original data set. The user is prompted to supply the number Exclude Constant…. In the latter case, one can choose to of times this is done, whereas k usually equals the length of exclude only a proportion of the constant sites, which can be the original sequences. For each such data set, the splits useful in connection, for example, with the LogDet trans- graph is then computed. At the end of this procedure, each formation (see Figure 1), as it provides a way of approximat- split in the original splits graph is labeled by the percentage ing a more continuous distribution for rates across sites (Adachi and Hasegawa, 1995; Waddell, 1996). of computed splits graphs that it occurred in, thus indicating The Options menu also offers a number of distance trans- the statistical robustness of each split. The st_bootstrap block formations such as Hamming distances, Kimura 3ST (Kimu- contains a full listing of all splits that occurred. ra, 1981), Jukes Cantor (Jukes and Cantor, 1969) and LogDet Finally, the Window menu contains a Syntax and Show (Steel, 1994). The Nei Miller item is for computing distances submenu that can be used to obtain a listing of the syntax or for restriction site data (Nei and Miller, 1990), and the PAM current contents of a selected ‘nexus block’. The Get Info 250 item applies to protein data (Dayhoff et al., 1983). A item gives general information on the current document. user-defined weight matrix can be supplied using the User Moreover, the menu contains a list of the currently open win- Matrix item. dows. 70 Analyzing evolutionary data and SplitsTree Windows SplitsTree displays two windows. The SplitsTree Console is used to print messages when reading or computing data. It also accepts typed commands and nexus blocks. Moreover, it is used to present information requested using the menu items described in the preceding paragraph. The second win- dow, called the document window, displays the splits graph computed for the given data set. The bottom of this window contains a line of information on the current data and how they were computed (see Figure 1). The splits graph displayed in the document window can be manipulated using the mouse. Clicking on an edge will high- light that edge and all other edges representing the same split. Then, grabbing and dragging any other part of the graph will rotate the selected edges and thus reshape the graph, without changing any of the edge lengths. Moreover, the vertex labels can also be grabbed and dragged. File format SplitsTree is based on the new nexus format (Maddison et al., Fig. 2. Syntax of the three main input blocks. In this figure, square 1995), which was originally developed for the programs brackets indicate optional items and curly brackets indicate a choice PAUP (Swofford, 1997) and MacClade (Maddison and of items. The syntax follows the standard definition of these blocks Maddison, 1989). Input data are described in the three stan- (Maddison et al., 1995), expect for the two additional commands marked by a (*). The CHARWEIGHTS item is used to enter weights dard block types: taxa, characters and distances. More pre- when specifying RFLP data. The FORCE_METRIC item can be set cisely, an input file will typically consist of a taxa block list- by the program when the triangle inequalities do not hold and an ing the names of the given taxa and either a characters block offset must be added to force them to. containing a set of, for example, DNA, RNA, protein or RFLP sequences, or a distances block containing a distance or dissimilarity matrix. In Figure 2, we describe the syntax or the keyboard and outputs nexus blocks and PostScript. of these blocks and in Figure 3 an example input file is given. The kernel is written in C++ and thus can be compiled on any An output file typically contains a number of additional computer, and executables are available for a number of dif- blocks that are computed by SplitsTree and are specific to the ferent Unix systems. We plan to develop an interactive Win- program. The names of such blocks all have the prefix ‘st_’. dows version in the future. The st_splits, st_graph and st_assumptions blocks contain the split system, the splits graph and the assumptions made, respectively. More precisely, the latter block describes how Example the data were processed, e.g. whether sites were excluded, which distance transformation was applied, and which The splits graph depicted in Figure 1 was obtained by apply- method was used to compute the splits, in other words, which ing the LogDet transformation and split decomposition to all items from the Options and Method menus were in effect. sites in an rDNA data set (indicated in Figure 3). For these data, the splits graph in Figure 1 reveals that a conflicting Additionally, the program will generate a st_spectra block relationship exists between the cyanobacterium Anacystis if spectral analysis was used, a st_bootstrap block if boot- and the chloroplasts of Euglena and Olithodiscus. Previous strapping was applied, or an st_extras block if one of the biological studies suggest that the correct split within this additional computations offered by the program was unresolved part of the splits graph should actually put Eugle- employed. As mentioned above, the program offers an on- na (a chlorophyll a/b-containing plastid) together with the line description of the syntax of all blocks that it understands. other chlorophyll a/b-containing taxa (rice, tobacco, Mar- chantia, Chlamydomonas, Chlorella). That is, Euglena is ex- Implementation pected to split away from the outgroup Anacystis and Olitho- This paper describes the interactive Macintosh version of discus (a chlorophyll a/c-containing plastid). The suggested SplitsTree, which is based on a kernel program that is essen- reason for the conflicting signal is that the rDNA sequences tially a nexus interpreter that reads nexus blocks from a file in Euglena and Olithodiscus have independently and conver- 71 D.H.Huson Fig. 4. The splits graph obtained from the RNA sequences indicated in Figure 3 using the LogDet transformation and split decomposition with 600 constant sites excluded. It contains a split that clearly separates Euglena from Olithodiscus and Anacystis nidulans, as discussed in the Example section. and asymmetrical (e.g. LogDet) correction formulae (Lock- hart et al., 1996). Fig. 3. Example of an input file. Typically, either a characters block or a distances block will be specified, but not both. The first token in a file must be ‘#NEXUS’ and the first block must be the taxa Acknowledgements block. Comments are enclosed in square brackets and all comments SplitsTree was developed within the framework of a joint between the ‘#NEXUS’ and ‘BEGIN taxa’ tokens are passed on to co-operation between researchers at Bielefeld University the output file by SplitsTree. (Germany), Massey University (Palmerston North, New Zealand) and the University of Canterbury (Christchurch, New Zealand) with support from the German Ministry of Science and Technology (BMFT), the New Zealand gently acquired similar base compositions [see discussions Marsden Fund and the University of Canterbury. Thanks to in Lockhart et al. (1994), Delwiche and Palmer (1995) and the following people for their support and co-operation: Van der Peer et al. (1996)]. Hence, in this example, the splits Hans-Jürgen Bandelt, Andreas Dress, Mike Hendy, Pete graph indicates both the suggested true phylogenetic signal Lockhart, Holger Paschke, Dave Penny, Mike Steel, Udo and a spurious one resulting from base composition effects. Tnges and Rainer Wetzel. The Example section of this paper Comparison of Figure 1 with Figure 4 reiterates the point was written with the help of Pete Lockhart, who also sug- made in Lockhart et al. (1994) that the LogDet correction, gested many improvements to the program and this paper. which can overcome some such base composition problems, The WWW version of the program was produced with the will not work when invariable sites are included in sequence help of Holger Paschke. analyses. That is, the expected split is only obtained if one removes the invariable sites from the data, i.e. an appropriate number of constant sites (using the Exclude Constant Sites… References item) before applying the LogDet transformation (Figure 4 Adachi,J. and Hasegawa,M. (1995) Improved dating of the human/ displays the result for 600 constant sites excluded). chimpanzee separation in the mitochondrial DNA tree: heterogene- In practice, a number of techniques can be used to estimate ity among amino acid sites. J. Mol. Evol., 40, 622–628. the proportion of constant sites that should be removed from Bandelt,H.-J. and Dress,A.W.M. (1992a) A canonical decomposition the data when accommodating position rate heterogeneity theory for metrics on a finite set. Adv. Math., 92, 47–05. (e.g. Lockhart et al., 1996). Note that the removal of invari- Bandelt,H.-J. and Dress,A.W.M. (1992b) Split decomposition: a new able positions in sequences can be important before analyses and useful approach to phylogenetic analysis of distance data. Mol. are carried out using both symmetrical (e.g. Jukes Cantor) Phylogenet. Evol., 1, 242–252. 72 Analyzing evolutionary data and SplitsTree Bandelt,H.-J. and Dress,A.W.M. (1993) A relational approach to split Maddison,W.P and Maddison,D.R. (1989) Interactive analysis of decomposition. In Opitz,O., Lausen,B. and Klar,R. (eds), Informa- phylogeny and character evolution using the computer program tion and Classification. Springer, Berlin, pp. 123–131. MacClade. Folia Primatol., 53, 190–202. Buneman,P. (1971) The recovery of trees from measures of dissimilar- Maddison,D.R., Swofford,D.L. and Maddison,W.P. (1995) NEXUS: ity. In Mathematics and the Archeological and Historical Sciences. An extendible file format for systematic information. Syst. Biol., in Edinburgh University Press, pp. 387–395. press. Dayhoff,M.O., Barker,W.C. and Hunt,L.T. (1983) Establishing homo- McLenachan,P.A., Lockhart,P.J., Faber,H.R. and Mansfield,B.C. logies in protein sequences. Methods Enzymol., 91, 524–545. (1996) Evolutionary analysis of the multigene pregnancy specific Delwiche,C.F., Kushel,M. and Palmer,J.D. (1995) Phylogenetic analy- β1-glycoprotein family: separation of historical and non historical sis of tufA sequences indicates a cyanobacterial origin of all plastids. signals. J. Mol. Evol., 42, 273–280. Mol. Phylogenet. Evol., 4, 110–128. Moulton,V., Steel,M.A. and Tuffley,C. (1997) Dissimilarity maps and Dopazo,J., Dress,A.W.M. and von Haeseler,A. (1993) Split decom- substitution models: some new results. Proceedings of the DIMACS position: a new technique to analyse viral evolution. Proc. Natl Workshop on Mathematical Hierarchies and Biology. American Acad. Sci. USA, 90, 10320–10324. Mathematical Society, in press. Dress,A.W.M. and Wetzel,R. (1993) The human organism—a place to Nei,M. and Miller,J.C. (1990) A simple method for estimating average thrive for the immuno-deficiency virus. In Proceedings of IFCS. number of nucleotide substitutions within and between populations Paris. from restriction data. Genetics, 1256, 873–879. Dress,A.W.M., Huson,D.H and Moulton,V. (1996) Analyzing and Sidow,A., Nguyen,T. and Speed,T.P. (1992) Estimating the fraction of visualizing sequence and distance data using splitstree. Discrete invariable codons with a capture-recapture method. J. Mol. Evol., Appl. Math., 71, 95–109. 35, 253–260. Felsenstein,J. (1985) Confidence limits on phylogenies: an approach Swofford, D.L. (1997) PAUP 5.0. Sinaur Associates, Sunderland, MA. using the bootstrap. Evolution, 39, 783–791. Swofford,D.L., Olsen,G.J., Waddell,P.J. and Hillis,D.M (1996) Phy- Fitch,W. (1971) Towards defining the course of evolution: minimum logenetic inference. In Hillis,D.M., Moritz,C. and Mable,B.K. (eds), change for a specific tree topology. Syst. Zool., 20, 406–416. Molecular Systematics, 2nd edn. Sinauer Associates, Sunderland, Hendy,M.D. and Penny,D. (1992) Spectral analysis of phylogenetic MA, pp. 407–514. data. J. Classif., 10, 5–24. Steel,M.A. (1994) Recovering a tree from the leaf colorations it Jukes,T.H. and Cantor,C.R. (1969) Evolution of protein molecules. In generates under a Markov model. Appl. Math. Lett., 7, 19–24. Munro,H.N. (ed.), Mammalian Protein Metabolism. Academic Van de Peer,Y., Rensing,S.A., Maier,U.G and De Wachter,R. (1996) Press, New York, pp. 21–132. Substitution rate calibration of small ribosomal subunit RNA Kimura,M. (1981) Estimation of evolutionary distances between identifies chlorachniophyte endosymbionts as remnants of green homologous nucleotide sequences. Proc. Natl Acad. Sci. USA, 78, algae. Proc. Natl Acad. Sci. USA, 93, 7732–7736. 454–458. Waddell,P.J. (1996) Statistical methods of phylogenetic analysis, Lockhart,P.J., Steel,M.A., Hendy,M.D. and Penny,D.P. (1994) Re- including Hadamard conjugations, LogDet transforms, and maxi- covering an evolutionary tree under a more realistic model of mum likelihood. PhD Thesis, Massey University, New Zealand. sequence evolution. Mol. Biol. Evol., 11, 605–612. Wetzel,R. (1995) Zur Visualisierung abstrakter Ähnlichkeitsbezie- Lockhart,P.J., Penny,D. and Meyer,A. (1995) Testing the phylogeny of hungen. PhD Thesis, University of Bielefeld. swordtail fishes using split decomposition and spectral analysis. Mol. Evol., 41, 666–674. Lockhart,P.J., Larkum,A.W.D., Steel,M.A., Waddell,P.J. and Penny,D. (1996) Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl Acad. Sci. USA, 93, 1930–1934 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

SplitsTree: analyzing and visualizing evolutionary data.

Bioinformatics , Volume 14 (1): 6 – Jan 1, 1998

Loading next page...
 
/lp/oxford-university-press/splitstree-analyzing-and-visualizing-evolutionary-data-Bk97x0J27y

References (27)

Publisher
Oxford University Press
Copyright
© Published by Oxford University Press.
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/14.1.68
Publisher site
See Article on Publisher Site

Abstract

!    ""# BIOINFORMATICS %#&' SplitsTree: analyzing and visualizing evolutionary data interpreted as possible evidence for different and conflicting Abstract phylogenies. Further, as split decomposition does not at- Motivation: Real evolutionary data often contain a number tempt to force data onto a tree, it can provide a good indica- of different and sometimes conflicting phylogenetic signals, tion of how tree-like given data are. and thus do not always clearly support a unique tree. To There exist efficient algorithms for performing split de- address this problem, Bandelt and Dress (Adv. Math., 92, composition (Bandelt and Dress, 1992a) and for computing 47-05, 1992) developed the method of split decomposition. splits graphs (Wetzel, 1995; D.H.Huson, in preparation). For ideal data, this method gives rise to a tree, whereas less Dress and Wetzel produced a simple implementation of split ideal data are represented by a tree-like network that may decomposition (Wetzel, 1995) as an investigative tool to help indicate evidence for different and conflicting phylogenies. develop the general theory. Based on their work, a first public Results: SplitsTree is an interactive program, for analyzing version was developed by Wetzel and Huson (SplitsTree ver- and visualizing evolutionary data, that implements this sion 1). The program described in this paper (SplitsTree ver- approach. It also supports a number of distances transform- sion 2) is a completely new implementation. ations, the computation of parsimony splits, spectral analy- In this paper, we first review the concepts of splits, splits sis and bootstrapping. graphs and the method of split decomposition, and then dis- Availability: There are two versions of SplitsTree: an cuss the SplitsTree program in detail. For a number of bio- interactive Macintosh version (shareware) and a command- logical applications of the split decomposition method, see, line Unix version (public domain). Both are available from: for example, Bandelt and Dress (1992b), Dopazo et al. ftp://ftp.uni-bielefeld.de/pub/math/splits/splitstree2. There is (1993), Dress and Wetzel (1993), Lockhart et al. (1995), a WWW version running at: http://www.bibiserv.techfak.uni- Wetzel (1995), Dress et al. (1996), McLenachan et al. (1996) bielefeld.de/splits. or P.J.Lockhart et al. (in preparation). Contact: huson@mathematik.uni-bielefeld.de Introduction Splits and splits graphs Evolutionary relationships between taxa are most often Evolutionary relationships are generally represented by a represented as phylogenetic trees, and many different algo- phylogenetic tree, T, i.e. a tree whose leaves are labeled by rithms for tree construction have been developed (Swofford a set X of taxa and whose remaining vertices are unlabeled et al., 1996). This is, of course, justified by the assumption and of degree at least three. (We only consider unrooted trees that evolution is a branching or tree-like process. However, in this paper.) Any edge e of T defines a split S = {A,A} of a set of real data often contains a number of different and X, i.e. a partition of X into two non-empty sets A and A, sometimes conflicting signals, and thus does not always consisting of all taxa on the one side, or the other, of the clearly support a unique tree. edge e. Such a system Σ of splits is called compatible if, for To address this problem, Bandelt and Dress (1992a) devel- any two splits S = {A ,A } and S = {A ,A } in Σ, one of 1 1 1 2 2 2 oped the method of split decomposition. In contrast to the four intersections methods such as maximum parsimony and maximum likeli- hood that reconstruct phylogenetic trees by optimizing cer- A ∩A , A ∩A , A ∩A , or A ∩A 1 2 1 2 1 2 1 2 tain parameters, split decomposition is a transformation- based approach. Essentially, evolutionary data are trans- is empty. Any phylogenetic tree T gives rise to a compatible formed or, more precisely, ‘canonically decomposed’, into a split system Σ. In 1971, Buneman established that, vice sum of ‘weakly compatible splits’ and then represented by a versa, any compatible split system Σ corresponds to a unique so-called splits graph. For ideal data, this is a tree, whereas phylogenetic tree T. So, tree reconstruction for a given set of less ideal data will give rise to a tree-like network that can be taxa X is equivalent to computing a compatible system of Oxford University Press 68 Analyzing evolutionary data and SplitsTree splits ∑ for X and determining a weight for each split S that corresponds to the length of the associated edge. Hence, to obtain more general graphs, one must consider less restricted systems of splits. Let X be a set of taxa. A system of splits Σ of X is called weakly compatible if, for any three splits S S , S and all A  S (i = 1, 2, 3), at least one 1, 2 3 i i of the four intersections A ∩A ∩A A ∩A ∩A , A ∩A ∩A , or A ∩A ∩A 1 2 3, 1 2 3 1 2 3 1 2 3 is empty (Bandelt and Dress, 1992a). So, in particular, any two splits are permitted to be incompatible. Intermediately, Σ is called circular if there exists an ordering x ,x ,…,x of 1 2 m the taxa such that for every split SΣ there exists AS with A = {x ,x ,…,x } and1≤ p(S)≤ q(S)≤ m. One can p(S) p(S) + 1 q(S) prove that a circular split system is always weakly compat- Fig. 1. The splits graph for the distances listed in Figure 3. Each band ible and a compatible split system is always circular (Bandelt of parallel edges indicates a split. For example, the two bold lines and Dress, 1992a; Wetzel, 1995). represent the split {Euglena, Olithodiscus} versus the other taxa. A splits graph representing a weakly compatible split sys- The distance between any two taxa x and y corresponds to the sum tem Σ is a graph G(Σ) = (V,E) whose vertices vV are la- of weights of all splits that separate x and y, i.e. the sum of edge beled by the set of taxa X and whose edges eE are straight- lengths of any shortest path from x to y. line segments that represent the splits in Σ (see Figure 1). More precisely, each split S = {A,A}Σ is represented by a band of parallel edges of equal length in such a way that deleting all edges in such a band partitions the graph into precisely two components: one containing all vertices la- beled by taxa in A and the other containing all vertices la- ≥ 0 is the weight or isolation index of the split S; and the map beled by taxa in A. The length of the edges representing a d :X*X → R is the so-called split-prime residue and can- given split S indicates its weight or support and is calculated not be decomposed further. A split S with α > 0 is called a as the isolation index of S. For algorithms that compute splits d-split, and the system Σ of all d-splits is weakly compatible graphs, see Wetzel (1995) and D.H.Huson (in preparation). and can be computed efficiently [see Bandelt and Dress Consider a weakly compatible system of splits Σ of a set (1992a) for details]. X of taxa. If Σ is compatible, then G(Σ) is a phylogenetic tree. If there is no split-prime residue, then the distance between What is the situation if Σ is merely circular? Then G(Σ) can any two taxa x and y is precisely equal to the sum of weights be realized as a planar graph (Wetzel, 1995; D.H.Huson, in of all d-splits that ‘separate’ x and y, and thus proportional to preparation). Finally, if Σ is not circular, then in general G(Σ) the sum of all edge lengths along a shortest path from x to y will not be planar. In biological applications, the arising split in the splits graph. However, in general, the split-prime resi- systems are often either circular or mildly non-circular. due will be positive and so the sum of weights will only give an approximation (from below) of the original distances. The fit of the approximation is measured by the sum of all ap- Split decomposition proximated distances divided by the sum of all original dis- tances. In biological applications, the fit is often quite high Split decomposition is a method for obtaining a system of and a small split-prime residue can be considered as ‘noise’. weakly compatible splits with weights from a given set of If we are given a set of aligned sequences, then to apply evolutionary data. So, assume we are given a set of taxa X split decomposition we must first compute a distance matrix and a distance map d:X*X → R on X, i.e. a matrix repre- d using an appropriate distance transformation. Alternative- senting the evolutionary distances between pairs of taxa. ly, one can compute the so-called parsimony splits, or p- Bandelt and Dress (1992a) showed that such a distance map splits, directly from the sequences, as described in Bandelt d has the following canonical decomposition: and Dress (1993). Yet another possibility is to use spectral analysis (Hendy and Penny, 1992; M.D.Hendy and P.J.Wad- d = Σ α δ + d S S 0 dell, in preparation) to assign a weight (the so-called γ-value) Here, we sum over all possible splits S of X; the map to each possible split of X. One can then greedily extract a δ :X*X → R is the split metric on S that equals 1 if x and weakly compatible (or compatible) system of splits, i.e. by y lie on different sides of S, and 0 otherwise; the number α considering all such splits S in decreasing order of weight 69 D.H.Huson and inserting the split S into Σ if it is weakly compatible (or Moreover, there are two items for determining distances compatible) with all splits already in Σ. between groups of taxa, both suggested by Mike Steel: the Fitch Sidow… item computes the distances between given groups using a combination of methods from Fitch (1971) Description of SplitsTree and Sidow et al. (1992), whereas the Covarion… item is SplitsTree is an easy-to-use Macintosh application that takes based on Moulton et al. (1997). as input a file containing sequences, distances, or a system of Finally, SplitsTree checks for given distance data whether splits, and produces as output a weakly compatible system of the triangle inequalities hold. If they do not, then the Force splits and a splits graph representing the given data. It con- Triangle Inequalities item can be used to force them to, i.e. tains a number of transformations to obtain distances from by adding an appropriate offset to all distances. sequences and methods for obtaining compatible or weakly The Method menu is the most important menu, as it deter- compatible split systems from distances or sequences. mines which method is applied to produce a split system from the given data. The first group of items all produce Menus weakly compatible split systems. The choices are: Split De- composition (as described above), Parsimony Splits (Ban- SplitsTree offers the following menus: File, Edit, Layout, delt and Dress, 1993) and Spectral Analysis… (Hendy and Options, Method and Window. The File menu contains the Penny, 1992; M.D.Hendy and P.J.Waddell, in preparation, usual items for opening, closing, saving and printing docu- followed by a greedy selection of a weakly compatible split ments. The Edit menu contains items for copying and past- system). The second group of items all produce compatible ing, etc. split systems: Buneman Tree (Buneman, 1971; Bandelt and The first group of items in the Layout menu can be used to Dress, 1992a), P-Tree (Bandelt and Dress, 1993) and Spec- change the position, orientation and size of the displayed tral Tree (spectral analysis followed by a greedy selection of splits graph. The Cycle item allows the user to specify the a compatible split system). circular order in which the taxa appear around the outside of For larger data sets, methods such as split decomposition the splits graph. This feature can be used to produce the same or computing the ‘Buneman tree’ tend to produce unresolved layout for different splits graphs produced from the same split systems. This is because they involve computing the data set by different methods. The Vertex Labels and Edge minimum of a certain index over all quartets of taxa that are Labels submenus can be used to decide whether the vertices separated by a given split to determine whether that split are to be labeled by the names or numbers of the taxa and should be included in the split system (Bandelt and Dress, whether the edges are to be labeled by weight, number or 1992a). In an attempt to solve this problem, one can replace bootstrap support. The Equal Edges and To Scale items de- the minimum by the average over a given number of quartets termine whether the edges of the displayed splits graph are with smallest indices to obtain a refined system of splits, as drawn all with the same length, or in proportion to the isola- suggested in Moulton et al. (1997). The Refine menu item tion index of the corresponding splits. implements this idea. The Options menu determines how the given data are pre- The Bootstrap item runs bootstrap sampling from given processed. The Taxa item enables the user to exclude certain sequence data (Felsenstein, 1985). This is a way to test the taxa from the analysis. Similarly, the Sites item can be used statistical robustness of the computed splits graph. To be pre- to exclude certain sites and also codon positions. Moreover, cise, the program repeatedly generates new artificial data sets items are available for excluding whole groups of sites: Ex- by randomly choosing k (not necessarily distinct) sites in the clude Gaps, Exclude Missing, Exclude Non Parsimony and original data set. The user is prompted to supply the number Exclude Constant…. In the latter case, one can choose to of times this is done, whereas k usually equals the length of exclude only a proportion of the constant sites, which can be the original sequences. For each such data set, the splits useful in connection, for example, with the LogDet trans- graph is then computed. At the end of this procedure, each formation (see Figure 1), as it provides a way of approximat- split in the original splits graph is labeled by the percentage ing a more continuous distribution for rates across sites (Adachi and Hasegawa, 1995; Waddell, 1996). of computed splits graphs that it occurred in, thus indicating The Options menu also offers a number of distance trans- the statistical robustness of each split. The st_bootstrap block formations such as Hamming distances, Kimura 3ST (Kimu- contains a full listing of all splits that occurred. ra, 1981), Jukes Cantor (Jukes and Cantor, 1969) and LogDet Finally, the Window menu contains a Syntax and Show (Steel, 1994). The Nei Miller item is for computing distances submenu that can be used to obtain a listing of the syntax or for restriction site data (Nei and Miller, 1990), and the PAM current contents of a selected ‘nexus block’. The Get Info 250 item applies to protein data (Dayhoff et al., 1983). A item gives general information on the current document. user-defined weight matrix can be supplied using the User Moreover, the menu contains a list of the currently open win- Matrix item. dows. 70 Analyzing evolutionary data and SplitsTree Windows SplitsTree displays two windows. The SplitsTree Console is used to print messages when reading or computing data. It also accepts typed commands and nexus blocks. Moreover, it is used to present information requested using the menu items described in the preceding paragraph. The second win- dow, called the document window, displays the splits graph computed for the given data set. The bottom of this window contains a line of information on the current data and how they were computed (see Figure 1). The splits graph displayed in the document window can be manipulated using the mouse. Clicking on an edge will high- light that edge and all other edges representing the same split. Then, grabbing and dragging any other part of the graph will rotate the selected edges and thus reshape the graph, without changing any of the edge lengths. Moreover, the vertex labels can also be grabbed and dragged. File format SplitsTree is based on the new nexus format (Maddison et al., Fig. 2. Syntax of the three main input blocks. In this figure, square 1995), which was originally developed for the programs brackets indicate optional items and curly brackets indicate a choice PAUP (Swofford, 1997) and MacClade (Maddison and of items. The syntax follows the standard definition of these blocks Maddison, 1989). Input data are described in the three stan- (Maddison et al., 1995), expect for the two additional commands marked by a (*). The CHARWEIGHTS item is used to enter weights dard block types: taxa, characters and distances. More pre- when specifying RFLP data. The FORCE_METRIC item can be set cisely, an input file will typically consist of a taxa block list- by the program when the triangle inequalities do not hold and an ing the names of the given taxa and either a characters block offset must be added to force them to. containing a set of, for example, DNA, RNA, protein or RFLP sequences, or a distances block containing a distance or dissimilarity matrix. In Figure 2, we describe the syntax or the keyboard and outputs nexus blocks and PostScript. of these blocks and in Figure 3 an example input file is given. The kernel is written in C++ and thus can be compiled on any An output file typically contains a number of additional computer, and executables are available for a number of dif- blocks that are computed by SplitsTree and are specific to the ferent Unix systems. We plan to develop an interactive Win- program. The names of such blocks all have the prefix ‘st_’. dows version in the future. The st_splits, st_graph and st_assumptions blocks contain the split system, the splits graph and the assumptions made, respectively. More precisely, the latter block describes how Example the data were processed, e.g. whether sites were excluded, which distance transformation was applied, and which The splits graph depicted in Figure 1 was obtained by apply- method was used to compute the splits, in other words, which ing the LogDet transformation and split decomposition to all items from the Options and Method menus were in effect. sites in an rDNA data set (indicated in Figure 3). For these data, the splits graph in Figure 1 reveals that a conflicting Additionally, the program will generate a st_spectra block relationship exists between the cyanobacterium Anacystis if spectral analysis was used, a st_bootstrap block if boot- and the chloroplasts of Euglena and Olithodiscus. Previous strapping was applied, or an st_extras block if one of the biological studies suggest that the correct split within this additional computations offered by the program was unresolved part of the splits graph should actually put Eugle- employed. As mentioned above, the program offers an on- na (a chlorophyll a/b-containing plastid) together with the line description of the syntax of all blocks that it understands. other chlorophyll a/b-containing taxa (rice, tobacco, Mar- chantia, Chlamydomonas, Chlorella). That is, Euglena is ex- Implementation pected to split away from the outgroup Anacystis and Olitho- This paper describes the interactive Macintosh version of discus (a chlorophyll a/c-containing plastid). The suggested SplitsTree, which is based on a kernel program that is essen- reason for the conflicting signal is that the rDNA sequences tially a nexus interpreter that reads nexus blocks from a file in Euglena and Olithodiscus have independently and conver- 71 D.H.Huson Fig. 4. The splits graph obtained from the RNA sequences indicated in Figure 3 using the LogDet transformation and split decomposition with 600 constant sites excluded. It contains a split that clearly separates Euglena from Olithodiscus and Anacystis nidulans, as discussed in the Example section. and asymmetrical (e.g. LogDet) correction formulae (Lock- hart et al., 1996). Fig. 3. Example of an input file. Typically, either a characters block or a distances block will be specified, but not both. The first token in a file must be ‘#NEXUS’ and the first block must be the taxa Acknowledgements block. Comments are enclosed in square brackets and all comments SplitsTree was developed within the framework of a joint between the ‘#NEXUS’ and ‘BEGIN taxa’ tokens are passed on to co-operation between researchers at Bielefeld University the output file by SplitsTree. (Germany), Massey University (Palmerston North, New Zealand) and the University of Canterbury (Christchurch, New Zealand) with support from the German Ministry of Science and Technology (BMFT), the New Zealand gently acquired similar base compositions [see discussions Marsden Fund and the University of Canterbury. Thanks to in Lockhart et al. (1994), Delwiche and Palmer (1995) and the following people for their support and co-operation: Van der Peer et al. (1996)]. Hence, in this example, the splits Hans-Jürgen Bandelt, Andreas Dress, Mike Hendy, Pete graph indicates both the suggested true phylogenetic signal Lockhart, Holger Paschke, Dave Penny, Mike Steel, Udo and a spurious one resulting from base composition effects. Tnges and Rainer Wetzel. The Example section of this paper Comparison of Figure 1 with Figure 4 reiterates the point was written with the help of Pete Lockhart, who also sug- made in Lockhart et al. (1994) that the LogDet correction, gested many improvements to the program and this paper. which can overcome some such base composition problems, The WWW version of the program was produced with the will not work when invariable sites are included in sequence help of Holger Paschke. analyses. That is, the expected split is only obtained if one removes the invariable sites from the data, i.e. an appropriate number of constant sites (using the Exclude Constant Sites… References item) before applying the LogDet transformation (Figure 4 Adachi,J. and Hasegawa,M. (1995) Improved dating of the human/ displays the result for 600 constant sites excluded). chimpanzee separation in the mitochondrial DNA tree: heterogene- In practice, a number of techniques can be used to estimate ity among amino acid sites. J. Mol. Evol., 40, 622–628. the proportion of constant sites that should be removed from Bandelt,H.-J. and Dress,A.W.M. (1992a) A canonical decomposition the data when accommodating position rate heterogeneity theory for metrics on a finite set. Adv. Math., 92, 47–05. (e.g. Lockhart et al., 1996). Note that the removal of invari- Bandelt,H.-J. and Dress,A.W.M. (1992b) Split decomposition: a new able positions in sequences can be important before analyses and useful approach to phylogenetic analysis of distance data. Mol. are carried out using both symmetrical (e.g. Jukes Cantor) Phylogenet. Evol., 1, 242–252. 72 Analyzing evolutionary data and SplitsTree Bandelt,H.-J. and Dress,A.W.M. (1993) A relational approach to split Maddison,W.P and Maddison,D.R. (1989) Interactive analysis of decomposition. In Opitz,O., Lausen,B. and Klar,R. (eds), Informa- phylogeny and character evolution using the computer program tion and Classification. Springer, Berlin, pp. 123–131. MacClade. Folia Primatol., 53, 190–202. Buneman,P. (1971) The recovery of trees from measures of dissimilar- Maddison,D.R., Swofford,D.L. and Maddison,W.P. (1995) NEXUS: ity. In Mathematics and the Archeological and Historical Sciences. An extendible file format for systematic information. Syst. Biol., in Edinburgh University Press, pp. 387–395. press. Dayhoff,M.O., Barker,W.C. and Hunt,L.T. (1983) Establishing homo- McLenachan,P.A., Lockhart,P.J., Faber,H.R. and Mansfield,B.C. logies in protein sequences. Methods Enzymol., 91, 524–545. (1996) Evolutionary analysis of the multigene pregnancy specific Delwiche,C.F., Kushel,M. and Palmer,J.D. (1995) Phylogenetic analy- β1-glycoprotein family: separation of historical and non historical sis of tufA sequences indicates a cyanobacterial origin of all plastids. signals. J. Mol. Evol., 42, 273–280. Mol. Phylogenet. Evol., 4, 110–128. Moulton,V., Steel,M.A. and Tuffley,C. (1997) Dissimilarity maps and Dopazo,J., Dress,A.W.M. and von Haeseler,A. (1993) Split decom- substitution models: some new results. Proceedings of the DIMACS position: a new technique to analyse viral evolution. Proc. Natl Workshop on Mathematical Hierarchies and Biology. American Acad. Sci. USA, 90, 10320–10324. Mathematical Society, in press. Dress,A.W.M. and Wetzel,R. (1993) The human organism—a place to Nei,M. and Miller,J.C. (1990) A simple method for estimating average thrive for the immuno-deficiency virus. In Proceedings of IFCS. number of nucleotide substitutions within and between populations Paris. from restriction data. Genetics, 1256, 873–879. Dress,A.W.M., Huson,D.H and Moulton,V. (1996) Analyzing and Sidow,A., Nguyen,T. and Speed,T.P. (1992) Estimating the fraction of visualizing sequence and distance data using splitstree. Discrete invariable codons with a capture-recapture method. J. Mol. Evol., Appl. Math., 71, 95–109. 35, 253–260. Felsenstein,J. (1985) Confidence limits on phylogenies: an approach Swofford, D.L. (1997) PAUP 5.0. Sinaur Associates, Sunderland, MA. using the bootstrap. Evolution, 39, 783–791. Swofford,D.L., Olsen,G.J., Waddell,P.J. and Hillis,D.M (1996) Phy- Fitch,W. (1971) Towards defining the course of evolution: minimum logenetic inference. In Hillis,D.M., Moritz,C. and Mable,B.K. (eds), change for a specific tree topology. Syst. Zool., 20, 406–416. Molecular Systematics, 2nd edn. Sinauer Associates, Sunderland, Hendy,M.D. and Penny,D. (1992) Spectral analysis of phylogenetic MA, pp. 407–514. data. J. Classif., 10, 5–24. Steel,M.A. (1994) Recovering a tree from the leaf colorations it Jukes,T.H. and Cantor,C.R. (1969) Evolution of protein molecules. In generates under a Markov model. Appl. Math. Lett., 7, 19–24. Munro,H.N. (ed.), Mammalian Protein Metabolism. Academic Van de Peer,Y., Rensing,S.A., Maier,U.G and De Wachter,R. (1996) Press, New York, pp. 21–132. Substitution rate calibration of small ribosomal subunit RNA Kimura,M. (1981) Estimation of evolutionary distances between identifies chlorachniophyte endosymbionts as remnants of green homologous nucleotide sequences. Proc. Natl Acad. Sci. USA, 78, algae. Proc. Natl Acad. Sci. USA, 93, 7732–7736. 454–458. Waddell,P.J. (1996) Statistical methods of phylogenetic analysis, Lockhart,P.J., Steel,M.A., Hendy,M.D. and Penny,D.P. (1994) Re- including Hadamard conjugations, LogDet transforms, and maxi- covering an evolutionary tree under a more realistic model of mum likelihood. PhD Thesis, Massey University, New Zealand. sequence evolution. Mol. Biol. Evol., 11, 605–612. Wetzel,R. (1995) Zur Visualisierung abstrakter Ähnlichkeitsbezie- Lockhart,P.J., Penny,D. and Meyer,A. (1995) Testing the phylogeny of hungen. PhD Thesis, University of Bielefeld. swordtail fishes using split decomposition and spectral analysis. Mol. Evol., 41, 666–674. Lockhart,P.J., Larkum,A.W.D., Steel,M.A., Waddell,P.J. and Penny,D. (1996) Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl Acad. Sci. USA, 93, 1930–1934

Journal

BioinformaticsOxford University Press

Published: Jan 1, 1998

There are no references for this article.