Access the full text.
Sign up today, get DeepDyve free for 14 days.
M. Gribskov, Roland Lothy, D. Eisenberg (1990)
[9] Profile analysisMethods in Enzymology, 183
E. Myers, W. Miller (1988)
Optimal alignments in linear spaceComputer applications in the biosciences : CABIOS, 4 1
R. Wagner, M. Fischer (1974)
The String-to-String Correction ProblemJ. ACM, 21
(1989)
A space-e cient parallel sequence
J. Grice, R. Hughey, D. Speck (1997)
Reduced space sequence alignmentComputer applications in the biosciences : CABIOS, 13 1
Kimmen Sjölander, K. Karplus, Michael Brown, R. Hughey, A. Krogh, I. Mian, D. Haussler (1996)
Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homologyComputer applications in the biosciences : CABIOS, 12 4
S. Needleman, C. Wunsch (1970)
A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of molecular biology, 48 3
J. Hennessy, D. Patterson (1969)
Computer Architecture - A Quantitative Approach, 5th Edition
S. Henikoff, J. Henikoff (2000)
Amino acid substitution matrices.Advances in protein chemistry, 54
P. Bucher, A. Bairoch (1994)
A Generalized Profile Syntax for Biomolecular Sequence Motifs and its Function in Automatic Sequence InterpretationProceedings. International Conference on Intelligent Systems for Molecular Biology, 2
S. Altschul (1991)
Amino acid substitution matrices from an information theoretic perspectiveJournal of Molecular Biology, 219
(1989)
A spacee cient parallel sequencecomparison algorithm for a message - passing multi - processor
L. Rabiner (1989)
A tutorial on hidden Markov models and selected applications in speech recognitionProc. IEEE, 77
R. Hughey, A. Krogh (1996)
Hidden Markov models for sequence analysis: extension and analysis of the basic methodComputer applications in the biosciences : CABIOS, 12 2
P. Sellers (1974)
On the Theory and Computation of Evolutionary DistancesSiam Journal on Applied Mathematics, 26
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.Journal of molecular biology, 147 1
Anders Krogh, Michael Brown, I. Mian, Kimmen Sjölander, David Haussler (1993)
Hidden Markov models in computational biology. Applications to protein modeling.Journal of molecular biology, 235 5
C. Barrett, R. Hughey, K. Karplus (1997)
Scoring hidden Markov modelsComputer applications in the biosciences : CABIOS, 13 2
D. Hirschberg (1975)
A linear space algorithm for computing maximal common subsequencesCommunications of the ACM, 18
W. Irwin, Desmond Neill (1970)
Profile AnalysisBritish Medical Journal, 4
R. Hughey (1995)
Parallel sequence comparison and alignmentProceedings The International Conference on Application Specific Array Processors
S. Needleman, C. Wunsch (1989)
A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins
O. Gotoh (1982)
An improved algorithm for matching biological sequences.Journal of molecular biology, 162 3
S. Eddy, G. Mitchison, R. Durbin (1995)
Maximum Discrimination Hidden Markov Models of Sequence ConsensusJournal of computational biology : a journal of computational molecular cell biology, 2 1
%" $% BIOINFORMATICS ' ()%&' '$( $ ' *, &')#$) % %#&*)' $ $' $ ! (! $ %%" % $ $' $ $ +'( ), % " %'$ $) '*- Abstract Motivation: Complete forward–backward (Baum–Welch) hidden Markov model training cannot take advantage of the linear space, divide-and-conquer sequence alignment algo- rithms because of the examination of all possible paths rather than the single best path. Results: This paper discusses the implementation and performance of checkpoint-based reduced space sequence alignment in the SAM hidden Markov modeling package. Implementation of the checkpoint algorithm reduced mem- ory usage from O(mn) to O(mn) with only a 10% slowdown for small m and n, and vast speed-up for the larger values, such as m = n = 2000, that cause excessive paging on a 96 Mbyte workstation. The results are applicable to other types of dynamic programming. Availability: A World-Wide Web server, as well as informa- tion on obtaining the Sequence Alignment and Modeling (SAM) software suite, can be found at http://www.cse.ucsc. edu/research/compbio/sam.html. Contact: rph@cse.ucsc.edu Fig. 1. Dynamic programming example to find least cost edit of ‘BACKTRACK’ into ‘TRACEBACK’ using Sellers’ evolutionary distance metric (Sellers, 1974). Below the dynamic programming table are two possible alignments and an illustration of the data Introduction dependencies. Dynamic programming forms the core of many sequence c dist (a b ) match analysis methods, including classic methods for sequence– i1,j1 i, j c dist (a ) insert sequence comparison and alignment (Needleman and i1,j i, c min (1) i,j Wunsch, 1970; Smith and Waterman, 1981), as well as more c dist (, b ) delete i,j1 j recent methods such as profiles and linear hidden Markov models (HMMs) (Gribskov et al., 1990; Krogh et al., 1994). The calculation of this recurrence can be arranged in a table The typical process of training an HMM consists of per- of costs for matching each prefix of a to each prefix of b, and forming many complete dynamic programming calculations the selection information from the minimizations can be used on the model and each of the training set sequences. This is to trace back the optimal alignment (Figure 1). repeated many times until the model converges to a statistical The addition of affine gap costs g to the recurrence (Gotoh, representation of the family of sequences. The model may 1982) produces three interleaved recurrences of similar then be used to create a multiple alignment or to find other form. Here, the comparison or alignment can be thought of related sequences. as being in one of three states: in the midst of a sequence of Perhaps the simplest form of the dynamic programming matches, deletions or insertions, where the cost for changing equation is that of edit distance (Wagner and Fischer, 1974; states provides the constant term in the affine gap cost. To Sellers, 1974): emulate the simplest affine gap cost model, many of the g Oxford University Press 401 C.Tarnas and R.Hughey The forward–backward method of HMM training con- siders probabilities rather than scores over all possible states, replacing the minimizations with the addition of probability, and the additions with multiplication of probability (the probabilities are stored as log probabilities to avoid under- flow). Thus, the forward pass becomes: M M MM I IM f (f g f g i,j i1,j1 j i1,j1 j D DM M f g ) P (a , b ) j j i j i1,j1 I M MI I II f (f g f g i,j i1,j j i1,j j D DI I Fig. 2. An example of an HMM with two sequences whose f g ) P (a , ) i1,j j j characters are generated by the HMM, and the corresponding alignment. Positions modeled by the HMM’s match states are D M MD I ID f (f g f g i,j i,j1 j i,j1 j indicated with upper case letters, while those modeled by unaligned D DD D f g ) P (, b ) insertion states are indicated with lower case letters. i,j1 j j j and, considering the various parameters as probabilities, computes values are zero, indicating, for example, that there is no extra cost in matching one pair of characters and then matching the f = P(a … a , b …b |HMM) i, j 1 i 1 j next pair of characters. In the most general forms of sequence where in this simplified version of the HMM notation, the comparison, such as HMMs and generalized profiles ranges on a and b indicate that the first i characters of a were (Bucher and Bairoch, 1994; Krogh et al., 1994), all transition generated by a chain of j states of the HMM ending at state or gap costs (g) between the three states (match, insert or b . A similar recurrence, starting at the end of the sequence delete) and character distance costs are position dependent: and the end of the model, is used to compute the backward M M MM I IM values c min (c g , c g , i,j j j i1,j1 i1,j1 D DM c g ) dist (a , b ) i j b = P(a … a , b … b |HMM) i1,j1 j i , j n i m j indicating that the last n – i + 1 characters of a can be gener- I M MI I II c min (c g , c g , i,j i1,j j i1,j j ated by a chain of m – j + 1 states of the model. The forward D DI c g ) dist (a , ) i1,j j i and the backward values are combined to yield D M MD I ID P(a is generated by state b |HMM) = f · b i j i , j i , j c min (c g , c g , i,j i,j1 j i,j1 j D DD This value, the probability that a certain character was gen- c g ) dist (, b ) i,j1 j erated by a certain state of the HMM, is then used to update The typical linear HMM (Figure 2) is a chain of match the probabilities in the HMM, in association with the values (square), insert (diamond) and delete (circle) nodes, with all for other sequences in the training set and a regularizer or transitions between nodes and all character costs in the insert Dirichlet mixture prior (Sjölander et al., 1996). The notation and match nodes trained to specific probabilities. When per- has been simplified; the reader is referred to the literature for forming dynamic programming between a sequence and an a more detailed treatment (Rabiner, 1989; Krogh et al., 1994) HMM, the HMM will index one dimension, such as the col- and an HMM review (Eddy, 1996). umns, of the matrix. The single best path through an HMM The simplest approach to computing these O(nm) dynamic corresponds to a path from the Start state to the End state in programs is to create a large, n × m table in memory to store which each character of the sequence is related to a success- values. Unfortunately, this table will not fit entirely in a ive match or insertion state along that path (delete states indi- workstation’s memory for large model and sequence lengths. cate that the sequence has no character corresponding to that For example, if a family of long molecules is to be modeled, position in the HMM). If two sequences are aligned to the say 2000 nucleotides and 2000 model nodes, the dynamic model, a multiple alignment between those sequences can be programming matrix will include 12 × 10 entries, one for inferred from their alignments to the model, though it must each state at each index point. To provide sufficient preci- be remembered with HMMs that characters modeled by in- sion, in the Sequence Alignment and Modeling (SAM) soft- sert states are not aligned between sequences. This HMM ware suite each of those entries requires four bytes, so 48 × multiple alignment algorithm requires time proportional to 10 bytes are required just for the dynamic programming the number of sequences or, alternatively, the product of the table. When added to the memory requirements for the rest total number of residues and the model length. of the application, not to mention the operating system and 402 Reduced space hidden Markov model training other users, this will consume most all the memory of a typi- ANSI C and was compiled using the Alpha’s native compiler cal workstation with 64 Mbytes of main memory. If the se- at its highest optimization level. quence or model length is larger, or memory available for the program is lower, virtual memory will be used extensively. Algorithm That is, blocks of memory (or pages—typically 2–8 kiloby- The family of checkpoint algorithms asymptotically pro- tes of data each) will be temporarily stored on disk, with the vides for an arbitrary integer L, a factor of cL slowdown (c computer’s real memory and cache storing only currently ac- < 1) in exchange for reducing memory use from O(mn) to tive pages. In a current machine, a cache may require one O(mn). A single-best-path variant of this family with clock cycle to access, main memory 10–20 clock cycles, and L = log(n) matches the quadratic time and linear space of the disk over one million cycles (Hennessy and Patterson, 1996). divide-and-conquer algorithm. For SAM running on a Thus, the cost of paging into virtual memory is high, and can workstation, we decided that the L = 2 algorithm, which re- make runs of a given size effectively uncomputable. duces space use by the square root of the sequence length, The solution to this is to use a sequence alignment method would be sufficient. We describe this variant in more detail that requires less space. In the case of finding the single best below, and refer the reader to the original work for informa- path, there is an elegant divide-and-conquer algorithm that tion on and analysis of the complete family of algorithms requires only O(n + m) space, where n is the sequence length (Grice et al., 1997). and m is the model length (Hirschberg, 1975). The approach of this algorithm is to find a midpoint of the best path without 2-Level checkpoints saving all O(nm) dynamic programming entries, and then to solve two smaller problems, each of approximate size nm/4 Simply stated, the idea behind 2-level checkpoints is to seg- using the same algorithm. This algorithm is well known in ment the backwards computation. For each of the segments the computational biology community (Myers and Miller, to be computed, a single checkpoint of all values along a row, 1988), and has, for example, been implemented in the column or diagonal is saved during a global forward calcula- HMMer package for sequence alignment to a trained HMM tion. The backward computation now has two parts: recalcu- (Eddy et al., 1995). lating and saving the forward values of that segment using The divide-and-conquer algorithm does not work with for- the checkpoint for initial conditions, and then performing the ward–backward training. The efficiency of the divide-and- backward dynamic programming on that segment. Thus, conquer algorithm is in its partitioning into subproblems so with each (i,j) index point in the dynamic programming ma- that while the entire dynamic programming matrix must be trix, there will be an associated global forward calculation, evaluated once, recursive calls consider smaller and smaller a local forward calculation and a backward calculation. segments of the matrix. With forward–backward training, all While the 2-level checkpoint algorithm performs more paths through the matrix are an important part of training the computation than the standard method (two forward calcula- HMM, and thus this gain cannot be used. tions and one backward calculation for each dynamic pro- One answer is a recently introduced checkpointing algo- gramming cell), it gains greatly in memory performance. In rithm (Grice et al., 1997). In the simplest case, diagonals, the simplest case, it requires space for all the checkpoints and rows or columns of the dynamic programming matrix are for performing the complete forward–backward calculation stored at regular intervals to reduce space use to O(mn) on one segment. The result is that the 2-level method requires while increasing runtime by a small constant factor. In the O(mn) space, where m is the model length and n is the se- experiments discussed below, the new checkpoint-based quence length. For the example above, this will reduce space HMM training requires ~ 10% more time than standard dy- 6 6 use from 48 × 10 bytes to 1 × 10 bytes. If even longer se- namic programming up to the point of virtual memory pag- quences and models are required, an additional level of ing, at which point the checkpoint algorithm is considerably checkpointing can be added to the algorithm to reduce space faster. The reduced space requirements are particularly valu- 3 3 use to O(mn), or 300 × 10 bytes in the example. able in multiple-user environments. Viterbi checkpoints System and methods The Viterbi algorithm, used to find the single best path The work described herein was performed using the Sequence through a dynamic programming matrix (i.e. an alignment of Alignment and Modeling (SAM) HMM software suite a sequence to a model) is, as mentioned, simpler than the (http://www.cse.ucsc.edu/research/compbio/sam.html). Code forward–backward algorithm. Rather than performing costly development and performance analysis were performed on a exponentiation and logarithm extraction operations on log DEC Alpha 255 with a 233 MHz clock, 96 Mbytes of main probabilities (in SAM, these are greatly speeded up using memory and a 1 Mbyte external cache. SAM is written in table look-up), addition and minimization can be used for the 403 C.Tarnas and R.Hughey forward-going part of the Viterbi algorithm, while a simple of the ratio of the probability of the sequence being generated tracing back of the single best path using saved selector bits by the model and by the null model, the log-odds score, indi- from the minimizations is all that is required to establish the cates how much better (or worse) the structured HMM mo- model to sequence correspondence. dels the sequence than the simple unstructured null model When applying checkpointing to the Viterbi algorithm, di- (Altschul, 1991). agonal checkpoints can be used to reduce the amount of re- For SAM, a sequence can make use of fully local align- calculation greatly, making use of the same principle intro- ment as follows. The initial segment of the sequence that duced in a parallelization of the divide-and-conquer algo- does not correspond to any part of the HMM will align to the rithm (Huang, 1989). That is, if the single best path has been initial FIM. The net effect on the log-odds score of this is zero traced back to a point (i, j) on the i + j = k diagonal, and the because the FIM is a copy of the null model. From the FIM next checkpoint is l diagonals away at i + j = k – l, then only node (thought of as the first column of the dynamic program- the triangular region bounded by (i – 1, j), (i, j – l) and (i, j), ming matrix if the model nodes are used to index the col- and containing l /2 dynamic programming cells needs to be umns), we allow a jump directly into the delete state of any recalculated to find the best path between the i + j = k and position within the model. Thus, in the calculation of the c i,j i + j = k – l diagonals. For the Viterbi algorithm with row values, an additional term is added to the minimization: I SD SD checkpoints, a larger, l × m (worst case) strip must be recalcu- c g , where g is the gap cost of skipping over an i,0 j j lated, depending on the model node back to which the path initial segment of the model from the Start node to the jth has currently been traced. For the full forward–backward dy- delete node. namic programming, an entire l × m strip of the matrix is In fully local alignment, the correspondence between the always required. sequence and the model can also end at any point within the For these reasons, we used diagonal checkpoints rather sequence and the model. This corresponds to allowing jumps than row or column checkpoints. To speed code develop- from the delete state of any position in the model to the final ment, we decided to use diagonal checkpoints for both the FIM. Much like the initial FIM, the final FIM will then match Viterbi algorithm and the forward–backward algorithm. any remaining characters of the sequence with a net effect of With hindsight, the added programming complexity and zero on the final log-odds score. Thus, for the final FIM’s runtime overhead of boundary conditions with the diagonal delete state in node n (for simplicity, jumps are only allowed checkpoints (especially for the local and semi-local algo- into and out of delete states), the term rithms discussed next), as well as the difficulty of maintain- min D DE ing the Viterbi and the forward–backward routines in paral- (c g ) 0kn i,k j lel, made this a poor choice. Our forward–backward and Vit- erbi procedures would be significantly simpler had we is combined with the standard minimization for c . For SAM, i,n DE implemented them with row checkpoints. This simplicity, the costs of jumping into the End node, g , and jumping out combined with programmer and compiler optimizations, SD of the Start node, g , are global and default to zero cost, as could reduce or eliminate the theoretical advantage of diag- with standard Smith and Waterman, primarily for compatibil- onal checkpoints in the case of Viterbi training. ity with existing models. In the more general form, these costs would be position dependent (Bucher and Bairoch, 1994), and could also be trained given sufficient data. Local and semi-local checkpointing Analogous changes are made to the forward–backward A second goal of SAM’s inner loop rewrite was to provide calculation. local and semi-local scoring, alignment, and training. This Semi-local dynamic programming, which allows a com- added even more complexity to the decision to use diagonal plete sequence to match a subsection of the model, is simpler rather than row checkpoints. Fully local alignment allows the than fully local alignment. Jumps into the model are only matching of a subregion of the sequence to a subregion of the allowed for the first character of the sequence, while jumps model. Because HMMs are a probabilistic model, this is out of the model are only allowed for the last character of the more complicated than, for example, the zero-thresholding sequence. Thus, the special boundary conditions of fully used in the Smith and Waterman algorithm (Smith and Wat- local alignment only need to be evaluated for two rows of the erman, 1981). dynamic programming matrix rather than all rows of the dy- For fully local alignment, a SAM model is flanked by two namic programming matrix. copies of the null model, in SAM called free-insertion mod- Implementation of local and semi-local jumps in combina- ules, or FIMs (Hughey and Krogh, 1996; Barrett et al., tion with the diagonal checkpoints proved time consuming. 1997). The null model is a simple probabilistic model, such The jumps occur across values in a row of the dynamic pro- as the background distribution of amino acids, of the uni- gramming matrix. Evaluation of the jumps would be simple verse of sequences not modeled by the HMM. The logarithm and regular with row or column checkpoints, but required 404 Reduced space hidden Markov model training more complicated indexing with our chosen diagonal check- points. Results After implementing and optimizing the checkpoint algo- rithm, we performed several experiments on the host workstation in which the dynamic programming calculation was allowed to use a variable amount of memory. Our initial thought was that we would see different levels of efficiency as the problem size grew first beyond the primary cache size and then, in the case of the original space-inefficient code, beyond the main memory size. This turned out not to be the case: the code is computation bound because of the log-prob- ability manipulation. Because we did not see great variation in runtime for per- forming partial checkpointing (e.g. checkpointing only twice to reduce the dynamic programming table’s memory require- ments to mn/2 entries and only having to recalculate half of the forward values), the code now always uses full 2-level checkpointing and O(mn) space. Performance is summarized in two sets of graphs. Figure 3 shows runtime versus sequence length for performing four dynamic programming calculations between a length 500 model and sequences of various lengths. The upper graph shows central processing unit (CPU) time, while the lower graph shows real, or wall clock, time. We took data for forward–backward training using both full 2-level checkpointing (top curve) and no checkpointing (middle curve), as well as checkpointed and uncheckpointed implementations of the Viterbi best path algorithm. The Fig. 3. CPU time and wall clock time for performing four dynamic programming calculations with a 500-node model and several checkpointing forward–backward code is ~ 10% slower than protein sequence lengths using the forward–backward (F–B) and the uncheckpointed version. The Viterbi method is Viterbi algorithms. As can be seen comparing the CPU and wall time approximately seven times faster than forward–backward graphs, above 8000 amino acids, the memory requirements cause the because of its simpler computation. The checkpointed and the 96 Mbyte workstation to page excessively when the standard uncheckpointed Viterbi algorithm perform similarly until algorithm is used. paging begins. This could be due to the Viterbi forward calculation being relatively simple, reducing the penalty of recalculation when compared to the overhead of updating an HMM. Above a sequence length of 7000 residues, the 96 sequences (the folding of which is not modeled by a linear Mbytes of main memory on the test machine are exhausted, HMM), as well as to form a multiple alignment of 60 16S and the wall clock time of the uncheckpointed code increases RNA sequences. Neither of these tasks could be completed to a factor of five at 10 000 residues. with the previous version of SAM. Data for a 2000-node RNA model show similar results (Figure 4), although in this case, because of the larger model, Discussion virtual memory degradation begins around a length 2000 se- quence (4 × 10 dynamic program cells or 48 Mbytes, con- The most interesting result of this work is that the check- sistent with the protein results). We were unable to obtain pointing method, which performs significantly more com- uncheckpointed results beyond sequences of length 2000 be- putation than the simple method, only slightly slowed down cause of the excessive virtual memory use. the program for problems that fit in memory even though the Since including the checkpoint method in SAM, the code vast majority of SAM’s computation time is the dynamic has been used to model 31 sequences of ~ 9500 bases as part programming calculation. This may be due to the greater of an effort to determine how well HMMs can align RNA cache efficiency of the checkpointing algorithm. 405 C.Tarnas and R.Hughey tems, Inc., 1999 Harrison Street, Suite 1100, Oakland, CA 94612, USA. References Altschul,S.F. (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., 219, 555–565. Barrett,C., Hughey,R. and Karplus,K. (1997) Scoring hidden Markov models. Comput. Applic. Biosci., 13, 191–199. Bucher,P. and Bairoch,A. (1994) A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In Altman,R. et al. (eds), Proceedings of the International Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Menlo Park, CA, pp. 53–61. Eddy,S. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, 361–365. Eddy,S., Mitchison,G. and Durbin,R. (1995) Maximum discrimination Fig. 4. Wall clock time for performing four dynamic programming hidden Markov models of sequence consensus. J. Comput. Biol., 2, calculations with a 2000-node model and several RNA sequence 9–23. lengths. Virtual memory problems occur at a similar product of the Gotoh,O. (1982). An improved algorithm for matching biological model length and sequence length, in comparison to Figure 3. sequences. J. Mol. Biol., 162, 705–708. Gribskov,M., Lüthy,R. and Eisenberg,D. (1990) Profile analysis. Computing the dynamic programming matrix along diag- Methods Enzymol., 183, 146–159. onals has several advantages. Such computation removes all Grice,J.A., Hughey,R. and Speck,D. (1997) Reduced space sequence data dependencies from the inner loop of the dynamic pro- alignment. Comput. Applic. Biosci., 13, 45–53. Hennessy,J.L. and Patterson,D.A. (1996) Computer Architecture: A gramming calculation. This frees the compiler to perform Quantitative Approach. Morgan Kaufmann, Los Altos, CA. more extensive code reorganization and optimization within Hirschberg,D.S. (1975) A linear space algorithin for computing the inner loop. Diagonal computation is required when per- maximal common subsequences. Commun. ACM, 18, 341–343. forming dynamic programming on a vector or fine-grain par- Huang,X. (1989) A space-efficient parallel sequence comparison allel processor (Huang, 1989; Hughey, 1996). With hind- algorithm for a message-passing multiprocessor. Int. J. Parallel sight, though, the added complexity of the boundary condi- Program., 18, 223–239. tions when processing diagonals was not worth the potential Hughey,R. (1996) Parallel sequence comparison and alignment. performance gain. To enhance maintainability, we plan to Comput. Applic. Biosci., 12, 473–479. re-reimplement SAM’s inner loop with row checkpoints, Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence learning from the experiences of this work. analysis: Extension and analysis of the basic method. Comput. The 2-level checkpointing algorithm is not as memory effi- Applic. Biosci., 12, 95–107. Krogh,A., Brown,M., Mian,I.S., Sjölander,K. and Haussler,D. (1994) cient as the divide-and-conquer approach, requiring O(mn) Hidden Markov models in computational biology: Applications to space rather than O(n + m) space. It has three advantages that protein modeling. J. Mol. Biol., 235, 1501–1531. make it a strong alternative to the divide-and-conquer ap- Myers,E.W. and Miller,W. (1988) Optimal alignments in linear space. proach. First, it can be used with the forward–backward cal- Comput. Applic. Biosci., 4, 11–17. culation. Second, coding may be somewhat simpler than the Needleman,S.B. and Wunsch,C.D. (1970) A general method appli- divide-and-conquer, especially if row or column checkpoints cable to the search for similarities in the amino acid sequences of two are used. Third, the constant overhead appears to be less than proteins. J. Mol. Biol., 48, 443–453. that of the divide-and-conquer approach, perhaps because it Rabiner,L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257–286. is not a fully recursive algorithm. Sellers,P.H. (1974) On the theory and computation of evolutionary distances. SIAM J. Appl. Math., 26, 787–793. Acknowledgements Sjölander,K., Karplus,K., Brown,M.P., Hughey,R., Krogh,A., Mian,I.S. and Haussler,D. (1996) Dirichlet mixtures: A method for This work was supported in part by National Science improving detection of weak but significant protein sequence Foundation grant DBI-9408579 and its Research Experi- homology. Comput. Applic. Biosci., 12, 327–345. ences for Undergraduates supplement, as well as an equip- Smith,T.F. and Waterman,M.S. (1981) Identification of common ment donation from Digital Equipment Corporation. The molecular subsequences. J. Mol. Biol., 147, 195–197. original algorithm development was supported in part by Wagner,R.A. and Fischer,M.J. (1974) The string-to-string correction NSF grant MIP-9423985. C.T. is currently with Pangea Sys- problem. J. ACM, 21, 168–173.
Bioinformatics – Oxford University Press
Published: Jun 1, 1998
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.