Access the full text.
Sign up today, get DeepDyve free for 14 days.
J. Felsenstein (1978)
Cases in which Parsimony or Compatibility Methods will be Positively MisleadingSystematic Biology, 27
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic acids research, 25 17
A. Edwards (1995)
Assessing molecular phylogeniesScience, 267
D. Hillis, J. Huelsenbeck, D. Swofford (1994)
Hobgoblin of phylogenetics?Nature, 369
W. Miller, P. Waterhouse, W. Gerlach (1988)
Sequence and organization of barley yellow dwarf virus genomic RNA.Nucleic Acids Research, 16
D. Higgins, A. Bleasby, R. Fuchs (1992)
CLUSTAL V: improved software for multiple sequence alignmentComputer applications in the biosciences : CABIOS, 8 2
S. Jeffery (1979)
Evolution of Protein Molecules
Mark Gibbs, Georg Weiller (1999)
Evidence that a plant virus switched hosts to infect a vertebrate and then recombined with a vertebrate-infecting virus.Proceedings of the National Academy of Sciences of the United States of America, 96 14
M. Siddall (1998)
Success of Parsimony in the Four‐Taxon Case: Long‐Branch Repulsion by Likelihood in the Farris ZoneCladistics, 14
D. Hillis, J. Bull (1993)
An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic AnalysisSystematic Biology, 42
G. Weiller (1998)
Phylogenetic profiles: a graphical method for detecting genetic recombinations in homologous sequences.Molecular biology and evolution, 15 3
R. Lartey, T. Voss, U. Melcher (1996)
Tobamovirus evolution: gene overlaps, recombination, and taxonomic implications.Molecular biology and evolution, 13 10
M. Gibbs, J. Cooper (1995)
A recombinational event in the history of luteoviruses probably induced by base-pairing between the genomes of two distinct viruses.Virology, 206 2
A. Rambaut, N. Grassly (1997)
Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic treesComputer applications in the biosciences : CABIOS, 13 3
F. Gao, E. Bailes, D. Robertson, Yalu Chen, C. Rodenburg, S. Michael, L. Cummins, L. Arthur, M. Peeters, G. Shaw, P. Sharp, B. Hahn (1999)
Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytesNature, 397
T. Jukes (1969)
CHAPTER 24 – Evolution of Protein Molecules
J. Hein (1990)
Reconstructing evolution of sequences subject to recombination using parsimony.Mathematical biosciences, 98 2
S. Sawyer (1989)
Statistical tests for detecting gene conversion.Molecular biology and evolution, 6 5
J. Stephens (1985)
Statistical methods of DNA sequence analysis: detection of intragenic recombination or gene conversion.Molecular biology and evolution, 2 6
N. Grassly, E. Holmes (1997)
A likelihood method for the detection of selection and recombination using nucleotide sequences.Molecular biology and evolution, 14 3
M. Salminen, J. Carr, D. Burke, F. McCutchan (1995)
Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning.AIDS research and human retroviruses, 11 11
D. Fitch, M. Goodman (1991)
Phylogenetic scanning: a computer-assisted algorithm for mapping gene conversions and other recombinational eventsComputer applications in the biosciences : CABIOS, 7 2
Vol. 16 no. 7 2000 BIOINFORMATICS Pages 573–582 Sister-Scanning: a Monte Carlo procedure for assessing signals in recombinant sequences Mark J. Gibbs , John S. Armstrong and Adrian J. Gibbs Research School of Biological Sciences, The Australian National University, GPO Box 475, Canberra ACT 2601, Australia Received on January 13, 2000; revised on March 9, 2000; accepted on March 14, 2000 Abstract series of papers describing methods, attest to the difficulty Motivation: To devise a method that, unlike available of answering these questions (Stephens, 1985; Sawyer, methods, directly measures variations in phylogenetic 1989; Hein, 1990; Fitch and Goodman, 1991; Maynard signals in gene sequences that result from recombina- Smith, 1992; Robertson et al., 1995; Salminen et al., tion, tests the significance of the signal variations and 1995; Grassly and Holmes, 1997; Weiller, 1998; Gao et distinguishes misleading signals. al., 1999). Most of these methods attempt to answer the Results: We have developed a method, that we call ‘sister- first and second questions together and the last question is dealt with subsequently by comparing phylogenetic scanning’, for assessing phylogenetic and compositional trees, but it is clear that all three questions are linked. signals in the various patterns of identity that occur be- tween four nucleotide sequences. A Monte Carlo random- Phylogenetic signals and trees must be assessed to test the ization is done for all columns (positions) within a win- evidence of recombination and to identify recombinant dow and Z -scores are obtained for four real sequences or sequences, but regions in the alignment with coherent three real sequences with an outlier that is also random- signals need to be identified first and to do this it is neces- ized. The usefulness of the approach is demonstrated us- sary to test regions for possible signals. This preliminary ing tobamovirus and luteovirus sequences. Contradictory analysis is necessary because if a tree or signal is found phylogenetic signals were distinguished in both datasets, using a sequence that contains regions that have different as were regions of sequence that contained no clear sig- phylogenetic histories, then the tree will probably be nal or potentially misleading signals related to compo- wrong or the signal will be contaminated (e.g. Gibbs and sitional similarities. In the tobamovirus dataset, contra- Weiller, 1999) dictory phylogenetic signals were separated by coding se- When attempting to detect recombination, the problem quences up to a kilobase long that contained no clear sig- is usually simplified by examining small sub-sets from nal. Our re-analysis of this dataset using sister-scanning the alignment of the sequences of interest. Evidence of re- also yielded the first evidence known to us of an inter- combination may be detected by comparing as few as two species recombination site within a viral RNA-dependent or three sequences (Miller et al., 1988; Maynard Smith, RNA polymerase gene together with evidence of an un- 1992), but most methods compare sets of four sequences. usual pattern of conservation in the three codon positions. Alignments of four sequences can contain informative Availability: A program package, SiScan, for use under sites (where two taxa have the same nucleotide and the MS-DOS can be downloaded from http:// life.anu.edu.au/ other two have a different nucleotide), and, by using a with test data and instructions. known outlier as the fourth sequence, the root of a cluster Contact: mgibbs@rsbs.anu.edu.au; of three sequences may be located. Hence, comparisons with a fourth sequence may help identify some misleading johna@rsbs.anu.edu.au; gibbs@rsbs.anu.edu.au signals that arise when sequences are evolving at different rates (Felsenstein, 1978; Siddall, 1998). Robertson et Introduction al. (1995) used a method in which an alignment of four We face three main questions when considering evi- sequences was split between two windows and for each dence of recombination in a set of aligned nucleotide or position of the boundary of the windows, the significance amino-acid sequences: (Q1) Is there clear evidence of re- of the distribution of informative sites between the combination? (Q2) Where were the recombination sites? windows was tested. The method was largely successful, (Q3) Which sequences evolved through recombination? A but, like several other early methods, it oversimplified the problem by assuming that one phylogenetic signal would To whom correspondence should be addressed. be abruptly replaced by another at a point in the alignment c Oxford University Press 2000 573 M.J.Gibbs et al. (i.e. at the recombination site). This is equivalent to assuming that the sequences consist of discrete, adjacent regions each containing a clear phylogenetic signal. In the case of the method used by Robertson et al. (1995), the adjacent windows represented these discrete regions. This assumption is probably invalid for many sequences as they contain regions where there is no clear phylogenetic signal or where there are misleading signals (see below). Furthermore, recombination sites may be overprinted by other mutations, so that the signals change gradually and it may not be possible to pinpoint recombination sites (e.g. Gibbs and Weiller, 1999). The methods of Gao et al. (1999) and Salminen et al. (1995) partly addressed some Fig. 1. A diagram illustrating an example of recombination. of these problems. In the first, evolutionary distances were a. The bars represent three hypothetical sequences including a calculated for pairs of sequences within the window and recombinant, and the similarly shaded regions of the bars represent by sliding the window across the alignment and producing the most closely related regions in the sequences. b. Rooted trees distance values at each position, localized variations in describing the relationships of the hypothetical sequences on either this measure were detected. Regions with no clear signal side of a recombination site. Sequence 4 is an outlier used to root could be detected, but Gao et al. (1999) did not directly the trees. test the significance of the signal variations. In the method of Salminen et al. (1995), bootstrap samples drawn from the sequences in the window were used to infer the support adjacent. Signals are defined as patterns of nucleotide for the three possible resolved trees and these bootstrap identity in the alignment that support the possible sister values were plotted. A weakness of this method is that pairings. The strength of the evidence of recombination bootstrap estimates of support are strongly influenced by is indicated by the significance of the opposing signals, the tree-building method and the substitution model used, and, where possible, recombination sites are located by and even when these parameters are optimized the level of delineating the regions with opposing signals. The effects support may be over- or underestimated (Hillis and Bull, of compositional similarities are tested by replacing one 1993). of the real sequences in the alignment with a randomized Here we present a method for directly measuring sequence with the same composition as one or more of several kinds of signals within a window, together with the real sequences. Including such a randomized outlier a test of the signals that may be used to assess the changes the frequency of some of the patterns of identity, evidence for recombination and to detect misleading and if a signal is diminished when this is done, so that it signals arising from localized compositional similarities. no longer appears to be significant, then it is likely that the To our knowledge, no other methods directly address this signal is due to a compositional similarity. last problem. We have analysed simulated sequences, as well as two sets of viral sequences, to demonstrate the Algorithm power of the method and investigate its properties. As with other methods, we define a region using a window that slides over an alignment of four sequences. An outlier Rationale sequence should be included to identify the sister pairings, and in our method this may be a real sequence or a locally Figure 1 depicts evidence of recombination based on three randomized sequence (see below). sequences. On one side of a recombination site, across region X, sequences 1 and 2 are sister taxa, but on the 1. Count the number of positions within a window that other side of the site, across region Y, sequences 2 and conform to each of the patterns shown in Table 1. 3 are sister taxa. With reference to Figure 1, we rephrase Pass the window over the alignment with a constant Q1 (above) as follows: (Q4) Do two or more regions in step length and make the same calculation at each an alignment contain opposing phylogenetic signals and window position. are the signals significant in those regions regardless of any compositional similarities? We define the opposing 2. For each window, sum the counts of positions with signals as exclusive sister pairings of taxa (sequences), patterns where two sequences are identical, e.g. as shown in Figure 1, and answer the question by testing sequences 1 and 2 are identical in patterns 2, 8, 11 the significance of the signals found in different regions and 12 so the counts of positions with these patterns of the alignment. The tested regions do not have to be are summed (see Table 2, S 4–9); 574 A method for assessing signals in recombinant sequences Table 1. Definitions of the nucleotide identity patterns used to classify the Table 2. Definitions of sums of patterns. Note, positions with pattern 11 are positions in an alignment. If two or more taxa have the same nucleotide at a often excluded from sums 4, 5 and 7 (see text). Total pair-wise identity scores position they are linked by equals signs. If the nucleotide belonging to one equate to sums S4 to S9 taxon differs from those of the other taxa, it is preceded by a tilda sign. For informative sites (see text), the taxa with the same nucleotide are grouped and the two pairs are separated by a tilda sign Sum of Patterns from Table 1 patterns added to the sum Pattern Nucleotide identity S1 2, 7, 8 between sequences (taxa) S2 3, 6, 9 1, 2, 3 and 4 S3 4, 5, 10 S4 2, 8, 11, 12 (1 = 2) P1 1 ∼ 2 ∼ 3 ∼ 4 S5 3, 9, 11, 13 (1 = 3) P2 1 = 2 ∼ 3 ∼ 4 S6 4, 10, 12, 13 (1 = 4) P3 1 = 3 ∼ 2 ∼ 4 S7 5, 10, 11, 14 (2 = 3) P4 1 = 4 ∼ 2 ∼ 3 S8 6, 9, 12, 14 (2 = 4) P5 2 = 3 ∼ 1 ∼ 4 S9 7, 8, 13, 14 (3 = 4) P6 2 = 4 ∼ 1 ∼ 3 P7 3 = 4 ∼ 1 ∼ 2 P8 1 = 2 ∼ 3 = 4 P9 1 = 3 ∼ 2 = 4 for each position of the window by randomizing the P10 1 = 4 ∼ 2 = 3 positions of the nucleotides within the window in just one P11 1 = 2 = 3 ∼ 4 of the real sequences (horizontal randomization). Thus, P12 1 = 2 = 4 ∼ 3 P13 1 = 3 = 4 ∼ 2 the composition of the randomized sequence depends on P14 2 = 3 = 4 ∼ 1 the composition of a segment of real sequence defined by P15 1 = 2 = 3 = 4 the window (local composition). We have also included an option in SiScan version 1.01 that allows the fourth sequence to be generated by horizontally randomizing the positions of nucleotides in two of the real sequences 3. For each window, sum the counts of each kind within the window and then selecting a nucleotide of informative site (where there are two pairs of for each position at random from the two randomized identical nucleotides) and the patterns where two sequences. sequences are identical and which differ from an informative site by only one nucleotide substitution Test datasets (quasi-informative sites), e.g. pattern 8 represents an informative site and patterns 2 and 7 could be 1. Sets of simulated sequences were generated using obtained from pattern 8 by one substitution, so the Seq-Gen 1.1 (Rambaut and Grassly, 1997). Each set counts for patterns 2, 7 and 8 are summed (see consisted of four sequences 1000–5000 nucleotides Table2,S1–3). long that had ‘evolved’ using the Jukes and Cantor (1969) model of substitution to match a tree with 4. For each window, generate four randomized se- branch lengths which were specified as the prob- quences by assigning a new nucleotide at each ability of a substitution per site. Recombinant se- position chosen at random, without replacement, quences were made by splicing simulated sequences from the nucleotides that occur at the position in the in a word processing program. real sequences (Monte Carlo sampling to produce vertical randomization). Count the various patterns 2. An alignment of nucleotide sequences was assem- for the randomized sequences as described in steps bled to match the amino-acid alignment [used by 1 and 2, and calculate the sums of patterns. Repeat Gibbs and Cooper (1995)] of the virion protein and this step 100 times to create a population of scores read-through protein sequences of cucurbit aphid- from randomized sequences. borne yellows polerovirus (CABYV), beet western yellows polerovirus (BWYV) and pea enation mo- 5. For each window, calculate Z -scores for each of the saic enamovirus (PEMV1). patterns and sums of patterns. 3. The complete genomic sequences of 20 to- 6. Plot the Z -scores for each position of the window. bamoviruses were aligned. The sequences were We developed two variants of the method. In the first, separated into five distinct regions: (i) 5 termi- four real sequences are analysed. In the second, three real nal untranslated regions (UTRs); (ii) 3 UTRs; sequences are analysed with a fourth sequence generated (iii) replicase genes, that include the methyltrans- by randomization. The randomized sequence is generated ferase, helicase and polymerase domain-encoding 575 M.J.Gibbs et al. Fig. 2. Changes in Z -score ( y-axis) for selected patterns and sums of patterns for four simulated sequences plotted against position in the alignment (x -axis). The four sequences included a recombinant (taxon 1) and the two parental sequences from which it was derived (taxa 2 and 3). Z -scores for the complete set of patterns and sums of patterns are shown in the inset, as is the tree used when generating the sequences. Branch lengths equate to the probability of a substitution per site. A window 100 nucleotides long was used with a step length of 100 positions. Z -scores are plotted at the mid point of each window. sequences; (iv) movement protein genes and Results and discussion (v) virion protein genes. Untranslated regions were Simulated sequences directly aligned using CLUSTALV. Amino-acid Z -scores were calculated using SiScan for a set of simu- sequences translated from the gene sequences lated sequences that consisted of a recombinant, the two were aligned also using CLUSTALV (Higgins et parental sequences from which it was generated and an al., 1992) and gaps were added to the nucleotide outlier (Figure 2). Z -scores < −3 were obtained for some sequences so that they matched the amino-acid patterns because they occurred less frequently than would alignments using the program ADDGAPS (kindly be expected given the composition of the sequences. Z - supplied by Dr Georg Weiller). Finally, the gapped scores >3 were obtained for patterns that supported either nucleotide sequences of the UTRs and genes were one of the two opposing phylogenies. These significant rejoined to produce the fully aligned genomic se- scores were obtained from more than 90% of the windows quences. The tobamovirus and luteovirus sequence when they were 50 nucleotides long, from more than 50% datasets described above are available with the of the windows when they were 20 nucleotides long and SiScan 1.01 package. from 2.6% of the windows when they were 5 nucleotides long. No Z -scores >3 were obtained for patterns that sup- ported relationships other than those expected on either side of the recombination point, except when windows 5– 10 nucleotides long were used. When such short windows were used, scores supporting the true phylogenies were Implementation found in almost twice as many windows as scores support- SiScan 1.01 operates under MS-DOS. It requires data- ing alternative relationships. The artificial recombination files containing aligned nucleotide sequences in NBRF point was located to within 10 nucleotides, using windows (NBRF/PIR) format and produces tables of raw counts of 10 nucleotides long. patterns, summed counts of patterns, Z -scores for these SiScan results were compared with those obtained counts and sums, compositional data for the sequences by bootstrapping using non-recombinant simulated and counts of excluded positions. These tables are also sequences. Bootstrap values were calculated from max- produced as comma-separated variables in Microsoft imum likelihood trees from 100 bootstrap samples using Excel format. 576 A method for assessing signals in recombinant sequences Z -scores were obtained for each pattern and sum of patterns from these sequences from 100 non-overlapping windows 100 nucleotides long. The average scores for these patterns and sums ranged from 0.13 to −0.18 and their standard deviations ranged from 1.14 to 0.92. Only seven Z -scores > 3 and two < −3 were obtained. Using a Pentium 2, 333 MHz processor, SiScan 1.01 took approximately 6 s to analyse 50 windows when the window length was set at 100 nucleotide positions and a randomized outlier was generated from one of the se- quences. Under the same conditions, it took approximately 8 s to analyse 500 windows when the window length was set at 10 nucleotide positions. Luteovirus sequences Gibbs and Cooper (1995) found evidence of recombi- nation in the history of three luteoviruses by analysing Fig. 3. Z -scores ( y-axis) for pattern 2 plotted against bootstrap the phylogenies of their virion and read-through protein values (x -axis) for a set of four simulated sequences. The tree used amino-acid sequences. They suggested that CABYV when the sequences were generated is shown (inset). A window 200 nucleotides long was used with a step length of 200 positions. evolved from an ancestral recombinant virus that was produced through two recombinational events and that BWYV and PEMV probably belonged to the parental PAUP version 4.0b2 (written by David Swofford) and virus lineages. the Jukes and Cantor model of substitution. A dataset A plot of Z -scores calculated with SiScan (Figure 4A) generated using a tree with terminal branches of 0.5 and confirmed some of the evidence of recombination. The an internal branch of 0.4 was analysed using a window first recombination site found using SiScan, between po- 200 nucleotides long. Significant Z -scores from patterns sitions 550 and 600, was also found by Gibbs and Cooper supporting the true tree were obtained for 15 out of 25 (1995). SiScan plots showed clear support for grouping windows, whereas bootstrap values of >95 supporting the CABYV with BWYV on one side of the site but for group- true tree were found for 17 out of 25 windows. This result ing CABYV with PEMV1 on the other side of this site. was not unexpected, as the bootstrap values were obtained The plot shows that the affinities of the sequences changed from trees inferred using the substitution model that again at about position 925 (Figure 4A) with support for was used to generate the sequences (Hillis et al., 1994; grouping PEMV1 with BWYV on the 3 side of the site. Nei et al., 1995). However, Z -scores >3 were obtained However, the signal that supports this grouping (at po- for some windows that yielded bootstrap values of less sition 1000 in Figure 4A) was found to be sequence in- than 95, suggesting that the randomization test used in dependent. A randomized outlier sequence was generated SiScan is sometimes more sensitive than the optimal from the CABYV sequence for the comparisons plotted in bootstrapping procedure. We confirmed this characteristic Figure 4A. When the randomized outlier was instead gen- using a second simulated dataset (Figure 3). Z -scores erated from the BWYV sequence, no Z -scores >3 were >3 were obtained from patterns that supported the true found for patterns supporting the grouping of BWYV with phylogeny from 8 out of 25 windows, but only 3 out of 25 PEMV1 (Figure 4B). Thus, we concluded that the sig- windows yielded bootstrap values >95. nal shown in Figure 4A was due to a local compositional As expected, pattern 11 (Table 1) predominates in similarity. Signals at the 3 end of the alignment repre- datasets that include three relatively closely related sented by patterns 3 and 12, and sum 2, also appeared sequences and a distant or randomized outlier. Pattern 11 to be due to local compositional similarities. Gibbs and is included in three of the sums of counts (Table 2: S4, S5 Cooper (1995) found that CABYV and BWYV sequences and S7) and so when a distant or randomized outlier is were more closely related on the 3 side of the second pos- used, these sums of counts may be dominated by pattern sible recombination site. The difference between their re- 11 counts. For this reason, we included an option in sults and ours can be explained because we removed sites the program for excluding sites with pattern 11 from an including gaps from the alignment and in doing this, we analysis. To test for other possible biases of the method, deleted the positions across this region that carried the we analysed sets of simulated sequences that contained phylogenetic signal. The location of the first recombina- no phylogenetic signal. These were generated using an tion point in the luteovirus sequences, between positions unresolved tree with four branches each one unit long. 550 and 600 (Figures 4A and B) was not defined more 577 M.J.Gibbs et al. Fig. 4. Plots of Z -scores ( y-axis) for patterns and sums of patterns for a set of luteovirus sequences and randomized outlier sequences. Luteovirus sequences were used in the order: 1, CABYV; 2, BWYV; 3, PEMV. A window 100 nucleotides long was used with a step length of 50 positions for both plots. Positions that included a gap or that conformed to pattern 1 or 15 were excluded from the analysis. The randomized outlier was derived from the CABYV sequence for the calculations plotted in panel A and was derived from the BWYV sequence for the calculations plotted in panel B. Plots for each pattern or sum of patterns is labeled according to Tables 1 and 2. Z -scores for the complete set of patterns and sums, except patterns 1 and 15, are shown in the inset in A. The two alternative trees and the patterns and sums of patterns that support them are also shown. B inset 1 : Z scores for all patterns between positions 300 and 900 in the alignment found using a window of 20 was used with a step length of 20. A region between positions 550 and 610 (see text) is marked with a black triangle. B inset 2: The percentage nucleotide composition of the three luteovirus sequences also measured using SiScan. The strongest central peaks occur at about position 600 and represent the percentage cytosine in each sequence (see text). B inset 3: Z -scores for pattern 12 and the patterns and scores that measure the relative similarity between the PEMV sequence and the other real sequences (lower curves) for positions 1–700 of the alignment, i.e. patterns 3, 5, 9, 10 and 11 and sums 2, 3, 5 and 7. The randomized outlier was derived from the PEMV sequence for the calculations plotted in this inset. accurately using shorter window lengths, but instead this Pattern 12 occurred more frequently than expected region was found to contain no phylogenetic signal (Fig- between positions 200 and 600 in the plots (Figure 4A ure 4B inset 1). The region is cytosine rich in the three and B). This could be interpreted as evidence of a com- sequences (Figure 4B inset 2), and this strongly conserved positional similarity between the CABYV and BWYV cytosine stretch appears to obscure the exact point of re- sequences. However, that was not the case, as we discov- combination. ered when the randomized outlier was generated from the 578 A method for assessing signals in recombinant sequences Fig. 5. Plots of selected Z -scores ( y-axis) for an alignment of the complete genomic sequences of a set of four tobamovirus sequences. The tobamovirus sequences used were those of: 1, ORSV; 2, Chinese rape mosaic tobamovirus (CRMV; subgroup 3); 3, tomato mosaic tobamovirus (ToMV; subgroup 1); 4, sunn-hemp mosaic tobamovirus (an outlier). A window 250 nucleotides long was used with a step length of 50 and positions including gaps were excluded from the analysis. A map of the ORSV genome is shown in which blocks represent genes. The methyltransferase encoding, helicase encoding and polymerase encoding domains of the replicase gene are labeled MT, Hel and Pol. The movement protein and virion protein genes are labeled Move and VP. The two alternative trees and the patterns and sums of patterns that support them are also shown. The lower curve shown in bold is the counts of positions per window in which the ORSV, CRMV and ToMV sequences match, i.e. patterns 11 and 15. The scale for these counts is shown on the right-hand side of the figure. Peaks in the various plots are labeled according to the list of patterns shown in Tables 1 and 2. The two peaks marked with asterisks were not found in plots made using a randomized outlier and hence, were assumed to be due to compositional similarities. PEMV1 sequence. Pattern 12 was also significantly sup- outlier is likely to match the closely similar sequences at ported when the PEMV1-derived outlier was used, even a significant number of sites. though there was no support for grouping the real PEMV1 Tobamovirus sequences sequence with the CABYV and BWYV sequences (Figure 4B inset 3). Thus, the effect was independent of Lartey et al. (1996) found evidence of interspecies recom- sequence and composition. We then examined the actual bination in the history of tobamoviruses by inferring phy- counts and found that, given the frequency of identities logenies from five sets of aligned amino-acid sequences between the real sequences, we would expect on average from the viruses, i.e. those of the methyltransferase, heli- 6.7 positions in 100 to have pattern 12 over the region case and polymerase domains within the replicase protein after alignment with a completely random sequence. If all and those of the movement and coat proteins. The repli- four sequences were random we would expect less than 1 case sequences of odontoglossum ringspot tobamovirus position in 100 to have the pattern. In the real alignment, (ORSV) were grouped with those of the tobamoviruses we found that on average 9.0 positions in 100 had pattern that infect crucifers (subgroup 3), whereas the movement 12 across the region. We concluded that when two of the and coat protein sequences of ORSV were grouped with three real sequences are very similar and these sequences those of the tobamoviruses that infect Solanaceae (sub- are aligned with a randomized outlier, the randomized group 1). Bootstrap values for some of the trees supported 579 M.J.Gibbs et al. Fig. 6. Plots of selected Z -scores ( y-axis) for an alignment of the complete genomic sequences of a set of four tobamovirus sequences. The tobamoviruses sequences used were those of: 1, ORSV; 2, pepper mild mottle tobamovirus (PMMV; subgroup 1); 3, turnip vein-clearing tobamovirus (TVCV; a subgroup 3); 4, sunn-hemp mosaic tobamovirus (an outlier). A window 150 nucleotides long was used with a step length of 50 and positions including gaps were excluded from the analysis. (A) The ORSV genome map is shown, as are the two alternative trees and the patterns and sums of patterns that support them. Peaks in the various plots are labeled according to the list of patterns shown in Tables 1 and 2. The two peaks marked with black triangles are from patterns that support the right-hand tree. The peak marked with an asterisk was not found in plots made using a randomized outlier and hence, it was assumed to be due to some compositional similarity. (B) Plots of Z -scores for sums 4 and 5 for the same alignment but using only the first, second or third codon positions. Peaks that represent signals from codon positions 1 and 3 are labeled. The two peaks where the signal is predominantly in codon position 3 nucleotides are marked with black triangles. 580 A method for assessing signals in recombinant sequences the opposing phylogenies and Lartey et al. (1996) con- or third positions in the sequence to be counted indepen- cluded that ORSV probably had a recombinant ancestor dently so that the signals present in the nucleotides at dif- with a single recombination site close to the 3 end of the ferent codon positions may be assessed. Figure 6b shows replicase gene. that the signal around position 4400 is largely carried by Plots of Z -scores calculated with SiScan concurred with nucleotides in the third codon position, and the same ten- the phylogenetic analysis (Figure 5), with a recombination dency occurs around position 550. We confirmed this re- site located between positions 4375 and 5275, which sult by examining the alignment and by bootstrapping. is close to the terminus of the replicase gene. SiScan One possibility is that the nucleotide sequences at this po- plots showed, however, that the tobamovirus sequences sition have been selected for some property in addition contained a complex pattern of signals. Regions that to that of coding for amino acids. Nucleotide sequences contained significant signals were interspersed with ones that fold into particular structures or that successfully in- that contained no clear signal, and some coding stretches teract with various proteins may have been favoured, and with no clear signal were several hundred nucleotides the signal for this second function may be carried in the long. A plot of the number of positions where the third codon position because of its amino-acid coding re- subgroups 1 and 3 and ORSV sequences were identical dundancy. Thus, these signals may be related to function (Figure 5 lower curve in bold) showed that some of the rather than phylogeny and they may result from conver- low signal regions were relatively strongly conserved (e.g. gence. An analysis of RNA secondary structures may help the region marked ‘c’) whereas others were not (e.g. determine which of the two possible processes, recombi- the region marked ‘v’). Several regions with sequence- nation or convergence, is responsible for the aberrant sig- independent signals (compositional similarities) were also nals, or whether wet bench experiments may be needed. found. The main opposing signals were found regardless In any case, it appears that SiScan may be used to detect a of which combination of sequences from subgroups 1 second class of potentially misleading signals. and 3 was used, but other significant signals were also Broader applications found when some combinations were tested. Perhaps Figure 6b shows that the main signal grouping, the ORSV the most interesting of these signals was found using and TVCV sequences, is largely carried by nucleotides in a subset that included the sequences of pepper mild the first codon position. This may be significant because mottle tobamovirus (PMMV; subgroup 1) and turnip vein- a tree built using the second codon position, which is clearing tobamovirus (TVCV; subgroup 3). A Z score usually the most conserved, rather than the first codon plot (Figure 6a) showed that the ORSV sequence grouped position, might contain more errors or weak branches. with the subgroup 1 virus (PMMV) sequence rather than For similar reasons, it may be important to recognize the subgroup 3 virus (TVCV) sequence at two regions those regions in the tobamovirus sequences that contain in the replicase gene (the peaks at positions 550 and no clear signal or that contain misleading signals related 4400). Matching significant signals were found in plots to compositional similarities. Other sequences, including made using a randomized outlier and a database search those from cellular organisms, must also contain regions made using the program BLASTN (Altschul et al., 1997) that carry such biases or that lack signal. However, the confirmed the affinities detected around position 4400. kind of analysis we have partly automated using SiScan, The signals around positions 550 and 4400 are distinct where the signals in different regions of an alignment are from the main opposing signals suggesting that they assessed and tested independently, is rarely done when were generated by distinct recombinational events and, sequences are used to infer phylogenies. Thus, we propose on either side of these regions, the ORSV sequence is that SiScan may have broader applications. grouped with the TVCV sequence (Figure 6a) suggesting that both aberrant regions were generated by double References cross-over recombinational events. Thus, it appears Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., that our analysis with SiScan provides evidence for as Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI- many as five recombinational events. The fact that the BLAST: a new generation of protein database search programs. additional evidence of recombination was not found with Nucleic Acids Res., 25, 3389–3402. all combinations of subgroups 1 and 3 sequences suggests Fitch,D.H. A. and Goodman,M. (1991) Phylogenetic scanning: a that some of the events occurred after the subgroups computer-assisted algorithm for mapping gene conversions and diversified. other recombinational events. Bioinformatics, 7, 207–215. Further analysis with SiScan suggested, however, that Felsenstein,J. (1978) Cases in which parsimony or compatibility the signals grouping the ORSV and PMMV sequences in methods will be positively misleading. Syst. Zool., 27, 401–410. the replicase gene might be not be due to a common an- Gao,F., Bailes,E., Robertson,D.L., Chen,Y., Rodenburg,C.M., cestry. Our analysis involved the use of an option in SiS- Michael,S.F., Cummins,L.B., Arthur,L.O., Peeters,M., can 1.01 that permits the patterns from all first, second Shaw,G.M., Sharp,P.M. and Hahn,B.H. (1999) Origin of 581 M.J.Gibbs et al. HIV-1 in the chimpanzee Pan troglodytes trogodytes. Nature, Maynard Smith,J.M. (1992) Analysing the mosaic structure of 397, 436–441. genes. J. Mol. Evol., 34, 126–129. Gibbs,M.J. and Cooper,J.I. (1995) A recombinational event in the Miller,W.A., Waterhouse,P.M. and Gerlach,W.L. (1988) Sequence history of luteoviruses probably induced by base-pairing between and organisation of barley yellow dwarf virus genomic RNA. the genomes of two distinct viruses. Virology, 206, 1129–1132. Nucleic Acids Res., 16, 6097–6111. Gibbs,M.J. and Weiller,G.F. (1999) Evidence that a plant virus Nei,M., Takezaki,N. and Sitnikova,T. (1995) Assessing molecular switched hosts to infect a vertebrate and then recombined with a phylogenies. Science, 267, 253–255. vertebrate-infecting virus. Proc. Natl. Acad. Sci. USA, 96, 8022– Rambaut,A. and Grassly,N.C. (1997) Seq-Gen: an application for 8027. the Monte Carlo simulation of DNA sequence evolution along Grassly,N.C. and Holmes,E.C. (1997) A likelihood method for phylogenetic trees. CABIOS, 13, 235–238. the detection of selection and recombination using nucleotide Robertson,D.L., Hahn,B.H. and Sharp,P.M. (1995) Recombination sequences. Mol. Biol. Evol., 14, 239–247. in AIDS viruses. J. Mol. Evol., 40, 249–259. Hein,J. (1990) Reconstructing evolution of sequences subject to re- Salminen,M.O., Carr,J.K., Burke,D.S. and McCutchan,F.E. (1995) combination using parsimony. Math. Biosci., 98, 185–200. Identification of breakpoints in intergenotypic recombinants of Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: HIV type 1 by bootscanning. AIDS Res. and Human Retro- improved software for multiple sequence alignment. CABIOS, 8, viruses, 11, 1423–1425. 189–191. Sawyer,S.A. (1989) Statistical tests for detecting gene conversion. Hillis,D.M. and Bull,J.J. (1993) An empirical test of bootstrapping Mol. Biol. Evol., 6, 526–538. as a method for assessing confidence in phylogenetic analysis. Siddall,M.E. (1998) Success of parsimony in the four taxon Syst. Biol., 42, 182–192. case: long-branch repulsion by likelihood in the Farris Zone. Hillis,D.M., Huelsenbeck,J.P. and Swofford,D.L. (1994) Hobgoblin Cladistics, 14, 209–220. of phylogenetics? Nature, 369, 363–364. Stephens,J.C. (1985) Statistical methods of DNA sequence analysis: Jukes,T.H. and Cantor,C.R. (1969) Evolution of protein molecules. detection of intragenic recombination or gene conversion. Mol. Biol. Evol., 2, 539–556. In Munro,H. (ed.), Mammalian Protein Metabolism Academic Press, pp. 21–132. Weiller,G.F. (1998) Phylogenetic profiles: a graphical method Lartey,R.T., Voss,T.C. and Melcher,U. (1996) Tobamovirus evolu- for detecting recombinations in homologous sequences. Mol. tion: gene overlaps, recombination and taxonomic implications. Biol. Evol., 15, 326–335. Mol. Biol. Evol., 13, 1327–1338.
Bioinformatics – Oxford University Press
Published: Jul 1, 2000
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.