Access the full text.
Sign up today, get DeepDyve free for 14 days.
Chuong Do, Mahathi Mahabhashyam, M. Brudno, S. Batzoglou (2005)
ProbCons: Probabilistic consistency-based multiple sequence alignment.Genome research, 15 2
S. Bernhart, I. Hofacker, S. Will, A. Gruber, P. Stadler (2008)
RNAalifold: improved consensus structure prediction for RNA alignmentsBMC Bioinformatics, 9
S. Needleman, C. Wunsch (1970)
A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of molecular biology, 48 3
Karen Wong, M. Suchard, J. Huelsenbeck (2008)
Alignment Uncertainty and Genomic AnalysisScience, 319
Chuong Do, Samuel Gross, S. Batzoglou (2006)
CONTRAlign: Discriminative Training for Protein Sequence Alignment
H. Kiryu, Taishin Kin, K. Asai (2007)
Robust prediction of consensus secondary structures using averaged base pairing probability matricesBioinformatics, 23 4
S. Lindgreen, P. Gardner, A. Krogh (2007)
MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealingBioinformatics, 23 24
A. Harmanci, Gaurav Sharma, D. Mathews (2008)
PARTS: Probabilistic Alignment for RNA joinT Secondary structure predictionNucleic Acids Research, 36
(2007)
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm591 Structural bioinformatics
I. Holmes (2005)
Accelerated probabilistic inference of RNA structure evolutionBMC Bioinformatics, 6
Ye Ding, C. Chan, C. Lawrence (2005)
RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.RNA, 11 8
S. Moretti, A. Wilm, D. Higgins, I. Xenarios, C. Notredame (2008)
R-Coffee: a web server for accurately aligning noncoding RNA sequencesNucleic Acids Research, 36
S. Bernhart, I. Hofacker, Peter Stadler (2006)
Local RNA base pairing probabilities in large sequencesBioinformatics, 22 5
A. Wilm, D. Higgins, C. Notredame (2008)
R-Coffee: a method for multiple alignment of non-coding RNA.Nucleic acids research, 36 9
J. McCaskill (1990)
The equilibrium partition function and base pair binding probabilities for RNA secondary structureBiopolymers, 29
J. Thompson, F. Plewniak, O. Poch (1999)
A comprehensive comparison of multiple sequence alignment programsNucleic acids research, 27 13
Dūrbin (1998)
Biological Sequence Analysis
Michiaki Hamada, H. Kiryu, Kengo Sato, Toutai Mituyama, K. Asai (2009)
Prediction of RNA secondary structure using generalized centroid estimatorsBioinformatics, 25 4
Chuong Do, Chuan-Sheng Foo, S. Batzoglou (2008)
A max-margin model for efficient simultaneous alignment and folding of RNA sequencesBioinformatics, 24
Ariel Schwartz, E. Myers, L. Pachter (2005)
Alignment Metric AccuracyarXiv: Quantitative Methods
Bruno Contreras-Moreira, Pierre-Alain Branger, Julio Collado-Vides
Tfmodeller: Comparative Modelling of Protein–dna Complexes
Ye Ding, C. Chan, C. Lawrence (2006)
Clustering of RNA secondary structures with application to messenger RNAs.Journal of molecular biology, 359 3
K. Katoh, H. Toh (2008)
Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based frameworkBMC Bioinformatics, 9
BIOINFORMATICS ORIGINAL PAPER Sequence analysis Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%
S. Miyazawa (1995)
A reliable sequence alignment method based on probabilities of residue correspondences.Protein engineering, 8 10
Kengo Sato, Michiaki Hamada, K. Asai, Toutai Mituyama (2009)
CentroidFold: a web server for RNA secondary structure predictionNucleic Acids Research, 37
D. Mathews (2005)
Predicting a set of minimal free energy RNA secondary structures common to two sequencesBioinformatics, 21 10
Markus Bauer, G. Klau, K. Reinert (2007)
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimizationBMC Bioinformatics, 8
B. Webb-Robertson, L. McCue, C. Lawrence (2008)
Measuring Global Credibility with Application to Local Sequence AlignmentPLoS Computational Biology, 4
G. Lunter, A. Rocco, N. Mimouni, A. Heger, Alexandre Caldeira, J. Hein (2008)
Uncertainty in homology inferences: assessing and improving genomic sequence alignment.Genome research, 18 2
D. Sankoff (1985)
Simultaneous Solution of the RNA Folding, Alignment and Protosequence ProblemsSiam Journal on Applied Mathematics, 45
R. Bradley, L. Pachter, I. Holmes (2008)
Specific alignment of structured RNA: stochastic grammars and sequence annealingBioinformatics, 24 23
D. Mathews, D. Turner (2002)
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences.Journal of molecular biology, 317 2
Chuong Do, Daniel Woods, S. Batzoglou (2006)
CONTRAfold: RNA secondary structure prediction without physics-based modelsBioinformatics, 22 14
Mohammad Anwar, Truong Nguyen, M. Turcotte (2006)
Identification of consensus RNA secondary structures using suffix arraysBMC Bioinformatics, 7
Michiaki Hamada, Kengo Sato, H. Kiryu, Toutai Mituyama, Kiyoshi Asai (2009)
Predictions of RNA secondary structure by combining homologous sequence informationBioinformatics, 25
R. Dowell, S. Eddy (2006)
Efficient pairwise RNA structure prediction and alignment using sequence alignment constraintsBMC Bioinformatics, 7
Stefan Washietl, I. Hofacker, P. Stadler (2005)
Fast and reliable prediction of noncoding RNAsProceedings of the National Academy of Sciences of the United States of America, 102 7
P. Gardner, A. Wilm, Stefan Washietl (2005)
A benchmark of multiple sequence alignment programs upon structural RNAsNucleic Acids Research, 33
A. Wilm, Indra Mainz, G. Steger (2006)
An enhanced RNA alignment benchmark for sequence alignment programsAlgorithms for Molecular Biology, 1
I. Holmes, R. Durbin (1998)
Dynamic programming alignment accuracyJournal of computational biology : a journal of computational molecular cell biology, 5 3
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.Journal of molecular biology, 147 1
Usman Roshan, D. Livesay (2006)
Probalign: multiple sequence alignment using partition function posterior probabilitiesBioinformatics, 22 22
R. Dowell, S. Eddy (2004)
Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure predictionBMC Bioinformatics, 5
H. Kiryu, Yasuo Tabei, Taishin Kin, K. Asai (2007)
Murlet: a practical multiple alignment tool for structural RNA sequencesBioinformatics, 23 13
R. Bradley, Adam Roberts, M. Smoot, Sudeep Juvekar, Jaeyoung Do, Colin Dewey, I. Holmes, L. Pachter (2009)
Fast Statistical AlignmentPLoS Computational Biology, 5
Yasuo Tabei, H. Kiryu, Taishin Kin, K. Asai (2008)
A fast structural multiple alignment method for long RNA sequencesBMC Bioinformatics, 9
Tu Phuong, Chuong Do, Robert Edgar, S. Batzoglou (2006)
Multiple alignment of protein sequences with repeats and rearrangementsNucleic Acids Research, 34
S. Seemann, J. Gorodkin, R. Backofen (2008)
Unifying evolutionary and thermodynamic information for RNA folding of multiple alignmentsNucleic Acids Research, 36
Luis Carvalho, C. Lawrence (2008)
Centroid estimation in discrete high-dimensional spaces with applications in biologyProceedings of the National Academy of Sciences, 105
Vol. 25 no. 24 2009, pages 3236–3243 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp580 Sequence analysis CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score 1,2,∗ 2,3 4 2 Michiaki Hamada , Kengo Sato , Hisanori Kiryu , Toutai Mituyama and 2,4 Kiyoshi Asai Mizuho Information & Research Institute, Inc, 2–3 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101–8443, Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2–41–6, Aomi, Koto-ku, Tokyo 135–0064, Japan Biological Informatics Consortium, 2–45 Aomi, Koto-ku, Tokyo 135–8073 and Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan Received on July 26, 2009; revised on September 14, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT also secondary structures are closely related to the functions of most functional RNAs, so we should consider secondary structures Motivation: The importance of accurate and fast predictions of when aligning RNA sequences. In 1985, Sankoff (1985) proposed multiple alignments for RNA sequences has increased due to recent structural alignment in which we must handle alignments of base findings about functional non-coding RNAs. Recent studies suggest pairs in secondary structures. The computational complexities of that maximizing the expected accuracy of predictions will be useful 3n 2n the Sankoff algorithm are O(L ) for time and O(L ) for space for many problems in bioinformatics. where L is the length of RNA sequences and n is the number of Results: We designed a novel estimator for multiple alignments of input sequences, and those are too large for practical applications structured RNAs, based on maximizing the expected accuracy of even when we conduct pairwise alignment. Therefore, a number predictions. First, we define the maximum expected accuracy (MEA) of approximations to the Sankoff algorithm have been proposed estimator for pairwise alignment of RNA sequences. This maximizes (Anwar et al., 2006; Bradley et al., 2008; Dalli et al., 2006; Do et al., the expected sum-of-pairs score (SPS) of a predicted alignment 2008; Dowell and Eddy, 2006; Harmanci et al., 2008; Havgaard under a probability distribution of alignments given by marginalizing et al., 2005; Katoh and Toh, 2008; Kiryu et al., 2007a; Mathews, the Sankoff model. Then, by approximating the MEA estimator, we 3 2 2 2005; Mathews and Turner, 2002; Moretti et al., 2008; Tabei et al., obtain an estimator whose time complexity is O(L +c dL ) where 2008; Wilm et al., 2008). Bauer et al. (2007) proposed an integer L is the length of input sequences and both c and d are constants linear programming (ILP) approach for structural alignments. independent of L. The proposed estimator can handle uncertainty On the other hand, recent studies have indicated that maximizing of secondary structures and alignments that are obstacles in the expected accuracy of predictions is a useful approach to Bioinformatics because it considers all the secondary structures design powerful estimators [maximum expected accuracy (MEA) and all the pairwise alignments as input sequences. Moreover, estimators] for a number of problems appearing in Bioinformatics, we integrate the probabilistic consistency transformation (PCT) on including predictions of secondary structures of RNA (Do et al., alignments into the proposed estimator. Computational experiments 2006a; Hamada et al., 2009a, b), predictions of common secondary using six benchmark datasets indicate that the proposed method structures of RNA sequences (Hamada et al., 2009a; Kiryu et al., achieved a favorable SPS and was the fastest of many state-of-the- 2007b; Seemann et al., 2008) and alignments (Bradley et al., art tools for multiple alignments of structured RNAs. 2008, 2009; Do et al., 2005). Fortunately, MEA estimators have Availability: The software called CentroidAlign, which is an the possibility to address another obstacle in many Bioinformatics implementation of the algorithm in this article, is freely available on problems: the uncertainty of the solutions. For example, there are our website: http://www.ncrna.org/software/centroidalign/. huge numbers of candidates for both secondary structures of an Contact: hamada-michiaki@aist.go.jp RNA sequence (Carvalho and Lawrence, 2008) and for alignments Supplementary information: Supplementary data are available at of biological sequences (Carvalho and Lawrence, 2008; Lunter Bioinformatics online. et al., 2008; Webb-Robertson et al., 2008; Wong et al., 2008), and the probability of the optimal solution is very low (uncertainty). 1 INTRODUCTION Therefore, it is important to design an estimator for multiple alignments of RNA sequences based on maximizing expected The importance of accurate and fast prediction of multiple alignment accuracy. for RNA sequences has increased due to recent findings about In this article, we propose a novel estimator for multiple functional non-coding RNAs. Not only nucleotide sequences but alignment of structured RNA sequences, based on maximizing To whom correspondence should be addressed. the expected sum-of-pairs score (SPS) (Thompson et al., 1999), 3236 © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3236 3236–3243 CentroidAlign which is a widely used accuracy measure for multiple alignments. 2.1.3 Posterior probabilities (1) The base-pairing probability matrix of x, (s,x) (s,x) (s) First, we design the MEA estimator for pairwise alignment of {p } , has entries p = θ p (θ|x). The computational costs i<j ij ij ij θ∈S(x) for computing a base pairing probability matrix using the inside–outside RNA sequences, based on the Sankoff model (Sankoff, 1985). The 3 2 algorithm are O(|x| ) and O(|x| ) for time and space, respectively (e.g. see MEA estimator maximizes the expected SPS under the probability McCaskill, 1990). distribution of alignments given by marginalizing the probability (s,x) (s,x) (2) {q } are called the loop probabilities of x where q =1 − u u distribution of structural alignments in the Sankoff model; the (s,x) (s,x) p − p . i:i<u iu j:u<j uj estimator considers all the suboptimal structural alignments of (a,x,x ) input RNA sequences. Because the MEA estimator entails a huge (3) {p } is called an alignment match probability matrix of x and x uv u,v (a,x,x ) (a) computational cost, we introduce an approximation by factorizing where p = θ p (θ|x,x ). Both time and space complexities uv uv θ∈A(x,x ) a probability space, using an idea similar to that in our previous for computing the matrix using the forward–backward algorithm are study (Hamada et al., 2009b). The resulting estimator considers O(|x||x |) (Durbin et al., 1998). (sa,x,x ) (sa,x,x ) all the suboptimal secondary structures and all the suboptimal (4) {p } and {p } are called the alignment match i<j,k<l uv u,v ijkl (sa,x,x ) pairwise alignments of the input sequences. By using the sparsity probability matrices of a structural alignment of x and x , where p = ijkl of the base paring and alignment match probability matrices, we p (sa) θ p (θ|x,x ) (i.e. the probability that the base pair (x ,x ) i j θ∈SA(x,x ) ijkl 3 2 2 reduced the computational cost of the estimator to O(L +c dL ) (sa,x,x ) l (sa) aligns with the base pair (x ,x )) and p = θ p (θ|x,x ) uv θ∈SA(x,x ) uv k l where L is the length of the RNA sequence, and c and d are (i.e. the probability that x aligns with x in the loop region). The time constants which are independent of L. Moreover, we integrated and space complexities for computing the matrices using a variant of the the probabilistic consistency transformation (PCT) of the alignment 3 3 2 2 inside–outside algorithm are O(|x| |x | ) and O(|x| |x | ), respectively. probability matrix (Do et al., 2005) into the proposed estimator. Finally, the extension to multiple alignment is conducted by a 2.2 Designing estimators for pairwise alignments of progressive alignment algorithm similar to CONTRAlign (Do et al., RNA sequences 2006b). Computational experiments using six datasets indicate that A number of existing successful methods use the following estimators (e.g. the proposed method achieved a favorable SPS and is the fastest of Carvalho and Lawrence, 2008; Ding et al., 2005; Do et al., 2006a; Hamada the state-of-the-art aligners. et al., 2009a, b; Kiryu et al., 2007b): y ˆ =arg maxE [G(θ,y)]=arg max G(θ,y)p(θ|d) (1) θ|d y∈Y y∈Y θ∈ 2 METHODS where Y is a space from which we would like to obtain a prediction, referred to as a predictive space, is the parameter space and is potentially different 2.1 Preliminaries from the predictive space, p(θ|d) is a probability distribution on the parameter First of all, we summarize the notation used in this article. In the following, space given a dataset d and G(θ,y)isa gain function relating and let x and x be RNA sequences. The length of RNA sequence x is denoted Y according to a measure of the accuracy of predictions. We can design by |x| and x for 1 ≤ i ≤|x| indicates the i-th base in x. various estimators by defining the gain function and the parameter space. For example, the γ-centroid estimator (for secondary structure prediction) proposed in Hamada et al. (2009a) is represented by (1) as follows: d = 2.1.1 Binary discrete spaces (1) S(x)(⊂{θ ∈{0,1}|1 ≤ i < j ≤|x|})isthe ij x where x is an RNA sequence; Y = = S(x); G(θ,y) = α TP + α TN − 1 2 space of secondary structures of x, where θ =1 (respectively θ =0) for ij ij α FN − α FN (α > 0, n = 1,2,3,4), where TP (respectively TN, FP, FN) is 3 4 n θ ∈ S(x) means that x and x form (respectively do not form) base pairs. In i j the number of true positive (respectively true negative, false positive, false this study, no pseudo-knotted base pairs are allowed in a secondary structure. negative) base pairs when we consider y as the predicted secondary structure (2) A(x,x )(⊂{θ ∈{0,1}|1 ≤ u ≤|x|,1 ≤ v ≤|x |}) is the space of pairwise uv (s) and θ as a reference secondary structure; and p(θ|d) = p (θ|x). Hamada alignments of x and x , where θ =1 (respectively θ =0) for θ ∈ A(x,x ) uv uv et al. (2009a) proved theoretically that the γ-centroid estimator includes means that x aligns (respectively does not align) with the x . l the centroid estimator (Carvalho and Lawrence, 2008) used in Sfold (Ding (3) SA(x,x )(⊂{(θ ,θ )∈{0,1}×{0,1}|1 ≤ i < j ≤|x|,1 ≤ k < l ≤|x |, ijkl uv et al., 2006) as a special case and is superior to the MEA estimator used in 1 ≤ u ≤|x|,1 ≤ v ≤|x |}) is a space of structural alignments of x and x , where p p CONTRAfold (Do et al., 2006a). Note that Hamada et al. (2009b) recently θ =1 (respectively θ =0) means that the base pair (x ,x )in x aligns i j ijkl ijkl extended this estimator to secondary structure prediction using homologous (respectively does not align) with the base pair (x ,x )in x , and θ =1 uv k l sequences. (respectively θ =0) means that the base x in a loop region of x aligns uv In this study, our strategy for designing an estimator for pairwise alignment (respectively does not align) with the base x in a loop region of x . Note of RNA sequences is to use A(x,x ) for the predictive space Y, that is, a that θ ∈SA(x,x ) satisfies a number of constraints for a consistent structural predicted alignment is not a structural alignment but a sequential alignment. alignment. However, we implicitly consider structural alignments by using structural alignment in the parameter space . In the following sections, we introduce (s) two estimators: the MEA estimator that maximizes the expected SPS, and 2.1.2 Probability distributions (1) p (·|x) is a probability distribution an approximation to the MEA estimator, based on an idea used in Hamada on S(x), which is given by the McCaskill model (McCaskill, 1990), the et al. (2009b). CONTRAfold model (Do et al., 2006a) or the SCFG model (Dowell and Eddy, 2004). (a) 2.2.1 MEA estimator (maximizing expected SPS) In order to consider (2) p (·|x,x ) is a probability distribution on A(x,x ), which is given all the suboptimal structural alignments between x and x , the parameter by the ProbCons model (Do et al., 2005), the Miyazawa model (Miyazawa, (mea) space is defined by =SA(x,x ) and the probability distribution on 1995), the Probalign model (Roshan and Livesay, 2006) or the CONTRAlign (mea) (sa) (sa) the parameter space is defined by p (θ|x,x ) = p (θ|x,x ) where p is model (Do et al., 2006b). (sa) given by the Sankoff model (Sankoff, 1985). (3) p (·|x,x ) is a probability distribution on SA(x,x ), which is given p l For θ = (θ ,θ ) ∈ SA(x,x ) and positions u in x and v in x , the indicator by the pair SCFG model (Holmes, 2005) or the Sankoff model (Sankoff, function R (θ) := θ is equal to 1 only when there exists bases uv 1985). j:u<j,l:v<l ujvl [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3237 3236–3243 M.Hamada et al. and M is an optimal score of the secondary structure for the sub sequence u,v (mea) (a,x,x ) x ...x . If we change p to p in Equation (4), the algorithm is u v uv uv equivalent to γ-centroid alignment (Hamada et al., 2009a); if we change (mea) (a,x,x ) (γ +1)p −1to p in Equation (4), the result is equivalent to the uv uv estimator proposed by Holmes and Durbin (1998). (mea) Given {p } , this DP algorithm requires O(L ) time for calculating uv u,v the optimal alignment, where |x|,|x |≈ L. The estimator (2) is similar to the alignment metric accuracy (AMA) estimator for the structural alignment of RNA sequences (Bradley et al., 2008), which maximizes the expected AMA score (Schwartz et al., 2005) under the probability distribution (3). The relation between the AMA estimator and the estimator in this section is shown in Section A.3 in the Supplementary Material. Although the MEA estimator has the theoretically hopeful properties Fig. 1. Illustration of the approximations of the gain function in the proposed described above, it comes with a huge computational cost of O(L ) for method. (A) x and x are aligned in base pairs and (B) x and x are aligned u u calculating alignments because we must compute the alignment probability v v (sa,x,x ) (sa,x,x ) in loop region. matrices {p } and {p }. This computational cost is too large. So, uv ijkl in the following sections, we obtain our proposed method by approximating x and x where the base pair (x ,x ) aligns with the base pair (x ,x ) (the j u j v the estimator. l l left-side of Fig. 1A), and L (θ) := θ is equal to 1 only when uv i:i<u,k:k<v iukv there exists bases x and x where the base pair (x ,x ) aligns with the base i i u 2.2.2 Proposed estimator Taking into account the space SA(x,x ) and the pair (x ,x ). On the other hand, θ is equal to 1 only when x aligns with x v uv v (sa) probability distribution p on the space entails a huge computational cost, as a part of a loop region (the left-side of Fig. 1B). Then, the gain function as noted in the previous section. Therefore, we factorize the parameter space is defined by (mea) (mea) and the probability distribution p into (mea) (mea) G (θ,y) = G (θ,y) uv 1≤u≤|x| 1≤v≤|x | (prop) = A(x,x ) ×S(x) ×S(x ) (mea) G (θ,y) = γy R (θ) +L (θ) + θ + uv uv uv uv uv (1 −y ) 1 −R (θ) −L (θ) − θ uv uv uv uv and for a prediction y={y }∈ A(x,x ), where γ>0 is a parameter which adjusts uv (prop) (a) (a,x,x ) (s) (s,x) (s) (s,x ) p (θ|x,x ) = p (θ |x,x )p (θ |x)p (θ |x ), the balance of sensitivity (SEN) and positive predictive value (PPV) of a predicted pairwise alignment. Note that R (θ) +L (θ) + θ takes a value uv uv (a,x,x ) (s,x) (s,x ) (prop) (a,x,x ) uv respectively, where θ =(θ ,θ ,θ ) ∈ with θ ∈ in {0,1}, and x and x are aligned with each other in base pairs or a loop u (s,x) (s,x ) A(x,x ), θ ∈ S(x) and θ ∈ S(x ). region if and only if the value is equal to 1. In the following, we use the notation Finally, we obtain an estimator which maximizes the expectation of the (mea) (mea) (s,x) (s,x ) (a,x,x ) gain function G (θ,y) on the probability distribution p (θ|x,x ): R (θ) := θ θ θ uv uj vl jl (mea) (mea) j:u<j,l:v<l y ˆ =arg max G (θ,y)p (θ|x,x ). (2) y∈A(x,x ) (mea) θ∈ and (s,x) (s,x ) (a,x,x ) It is clear that this estimator is equivalent to the γ-centroid estimator L (θ) := θ θ θ . uv iu kv ik (Hamada et al., 2009a) on A(x,x ) when the probability distribution is i:i<u,k:k<v defined by the marginalized distribution Either of these is equal to 1 when x forms a base pair with x , x forms a base u j (sa) p(θ|x,x ) = p (θ |x,x ) (3) pair with x and x aligns with x in θ (the right-side configuration of Fig. 1A). l l (x) (s,x) (s,x) −1 θ ∈ (θ) We also define η := (1 − θ ) (1 − θ ), which is equal to 1 j:u<j j:j<u uj ju (s,x) when x is part of a loop region in θ ∈ S(x) (the right configuration of where is the projection map from a structural alignment θ ∈SA(x,x )to u the sequence x of Fig. 1B). We consider the following approximation to the an alignment θ ∈ A(x,x ) (See Section A.2 in the Supplementary Material). (mea) elements in the gain function G : In other words, p(θ|x,x ) is a probability distribution on A(x,x) obtained by (sa) marginalizing p into the space A(x,x ). Therefore, the MEA estimator (a,x,x ) [1] R (θ) +L (θ)≈ θ +w R (θ) +L (θ) and uv considers all the suboptimal structural alignments of the RNA sequences, uv uv 2 uv uv (a,x,x ) (x) (x ) and has the following useful property. l 1 [2] θ ≈ θ +w η η uv 3 u v uv 2 Property 1. When γ →∞, the estimator (2) maximizes the expectation where w , w and w are positive weights which satisfy w +w +w =1. 1 2 3 1 2 3 of the SPS for the pairwise alignment of x and x under the probability See Figure 1 for an illustration of the approximations. Then, the new gain distribution (3). (prop) function G is given by The predicted alignment of the MEA estimator can be computed by a (prop) (prop) G (θ,y) = G (θ,y) Needleman–Wunsch (Needleman and Wunsch, 1970) or a Holmes (Holmes uv 1≤u≤|x| 1≤v≤|x | (prop) and Durbin, 1998) type dynamic programming (DP) algorithm whose G (θ,y) = γy δ +(1 −y )(1 − δ ) uv uv uv uv uv recursive equation is where (mea) ⎨ M +(γ +1)p −1 u−1,v−1 uv (a,x,x ) (x) (x ) M =max (4) u,v M u−1,v δ = w θ +w R (θ) +L (θ) +w η η . uv 1 uv 2 uv uv 3 u v u,v−1 Finally, we introduce the estimator that maximizes the expectation of where (prop) (prop) G (θ,y) under the probability distribution p (θ|x,x ). By definition, (sa,x,x ) (sa,x,x ) (mea) (sa,x,x ) p = p + p +p the proposed estimator uses all the suboptimal secondary structures of x and uv ujvl iukv uv j:u<j,l:v<l i:i<u,k:k<v x and all the pairwise alignments between x and x . [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3238 3236–3243 CentroidAlign We can obtain the secondary structure of the proposed estimator by In a similar way, we are able to integrate the PCT of the base-paring (mea) probability (Kiryu et al., 2007a) into the proposed estimator. replacing p with uv (s,x) (s,x ) (a,x,x ) (prop) (a,x,x ) p = w p +w p p p + 1 2 uv uv uj vl jl j:u<j,l:v<l 2.4 Extension to multiple alignments (s,x) (s,x ) (a,x,x ) (s,x) (s,x ) Multiple alignment is conducted using the progressive alignment algorithm p p p +w q q iu kv ik u v used in ProbconsRNA (Do et al., 2005) and CONTRAlign (Do et al., 2006b). i:i<u,k:k<v We used the proposed estimator with sufficiently large γ and the PCT for (prop) pairwise alignment of two RNA sequences. Note that integrating PCT does in Equation (4). It is easily seen that p ∈[0,1]. Note that MAFFT (Katoh uv not influence the total computational time achieved using the sparsity of the and Toh, 2008) used a similar scoring scheme for the optimizing function of base-pairing probability and alignment match probability matrices. iterative alignments (Katoh and Toh, 2008). The total computational cost with respect to time for calculating the proposed estimators is O(L ) where |x|,|x |≈ L because it is necessary to (prop) compute the alternative alignment probability matrix {p } . However, uv u,v 3 EXPERIMENTS by using the sparsity of aligned-base probability matrix and base-pairing We conducted all experiments in this section on our Linux cluster probability matrix, we can greatly reduce the computational cost. As machines, each of which has a 2 GHz AMD Opteron(tm) Processor described in Do et al. (2008), we can assume O(c) and O(d) bounds on the number of candidate base pairs and alignment pairs per sequence 246 and 4 GB of memory. position, respectively, if we use a threshold δ for the probability. In other words, there are O(c) [respectively O(d)] base pairs (respectively aligned pairs) whose probability is more than δ per sequence position. Under these 3.1 Datasets assumptions, it is easily seen that the time complexity for calculating the We used the following six datasets for benchmarking. (i) Murlet (prop) 2 2 3 matrix {p } is O(c dL ). Since we need O(L ) time for calculating a uv dataset1 (Kiryu et al., 2007a), which contains 85 multiple alignments base-pairing probability matrix, the total computational cost of our algorithm and reference common secondary structures extracted from the 3 2 2 is O(L +c dL ), whereas the computational cost of the RNA alignment and Rfam database. The number of families is 17 and there are 5 3 2 2 folding (RAF) algorithm is O L +min(c,d)cd L , as shown in Do et al. subalignments of 10 sequences for each family. (ii) Murlet dataset2 (2008). (Kiryu et al., 2007a), which contains 188 multiple alignments and reference common secondary structures of four sequences taken 2.3 Integrating PCT into the proposed estimator from the Hammerhead_3 ribozyme family in the Rfam database. We can easily integrate the PCT (Do et al., 2005) for alignment problems (iii) MXSCARNA_dataset (Tabei et al., 2008), which contains into the proposed estimator when we have a set of sequences which are 1693 multiple alignments and their common secondary structures. homologous to x and x . For a set of (homologous) sequences S and x,x ∈ S, (iv) MASTR_dataset (Lindgreen et al., 2007), which contains five (prop) we redefine the parameter space as families and the total number of alignments is 52. (v) Bralibase2 (prop) = A(x,x ) ×S(x) ×S(x ) (Gardner et al., 2005), which contains 599 multiple alignments. (vi) × A(x,z) ×A(z,x ) Bralibase2.1 (Wilm et al., 2006): the total number of alignments z∈S\{x,x } is 18 990, each of which contains 2, 3, 5, 7, 10 or 15 RNA and the probability distribution on the parameter space as sequences. Note that the Bralibase2.1 dataset does not contain (prop) p (θ|x,x ) = reference common secondary structures. (a) (a,x,x ) (s) (s,x) (s) (s,x ) p (θ |x,x )p (θ |x)p (θ |x ) 3.2 Compared methods and our model (a,x,z) (a,x,z) (a,z,x ) (a,z,x ) × p (θ |x,z)p (θ |z,x ) We compared CentroidAlign with the following eight state-of- z∈S\{x,x } the-art aligners. (i) CONTRAlign v2.01 (Do et al., 2006b), (ii) (a,x,x ) (s,x) (s,x ) (a,x,z) (a,z,x ) (a,x,x ) for θ =(θ ,θ ,θ ,{θ ,θ } ) where θ ∈ z∈S\{x,x } ProbconsRNA (Do et al., 2005) (neither of these aligners consider (s,x) (s,x ) A(x,x ), θ ∈ S(x) and θ ∈ S(x ). Furthermore, we redefine the gain secondary structures), (iii) RAF v1.00 (Do et al., 2008), (iv) (a,x,x ) (prop) function G by replacing θ with the pseudo-count ik MXSCARNA v2.1 (Tabei et al., 2008), (v) Murlet (Kiryu et al., ⎛ ⎞ 2007a), (vi) MAFFT-xinsi 6.626 with SCARNA pair (Katoh and (a,x,x ,S) (a,x,x ) (a,x,z) (a,z,x ) Toh, 2008), (vii) R-coffee v7.81 (Moretti et al., 2008; Wilm et al., ⎝ ⎠ θ = θ + θ θ ik ik iu uk |S|−1 2008), (viii) Stemloc-AMA (Bradley et al., 2008), (ix) STRAL z∈S\{x,x }1≤u≤|z| v0.5.4 (Dalli et al., 2006) and (x) (t)LARA v1.3.2 (Bauer et al., (a,x,x ,S) where |S| indicates the number of sequences in S. Note that θ ∈[0,1]. 2007). Due to limitations of computational cost, Stemloc-AMA was ik Then, we obtain a new estimator that maximizes the expected gain under only applied to the Murlet_dataset2, the MASTR_dataset and the (a,x,x ) the probability distribution. In practice, we only replace p with the Bralibase2 dataset. ik pseudo-probability In the following sections, the proposed method (‘centroid_align’) ⎛ ⎞ means the proposed estimator (Section 2.2.2) with PCT for the (s) (a,x,x ,S) (a,x,x ) (a,x,z) (a,z,x ) ⎝ ⎠ alignment (Section 2.3). We used the McCaskill model (p ) with p = p + p p iu ik ik uk |S|−1 z∈S\{x,x }1≤u≤|z| the same parameter in ViennaRNA-1.6.3 and the CONTRAlign (a) model (p ) with the same parameter in CONTRAlign v2.0.1. (prop) (a,x,x ,S) in computation of p . Note that p is called the PCT of the uv ik Moreover, we set δ =0.01, w =0.45, w =0.5 and w =0.05. Those 1 2 3 (a,x,x ) probability p (Do et al., 2005). parameters were determined by testing on a small dataset. ik [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3239 3236–3243 M.Hamada et al. Table 1. Murlet dataset1 Table 2. Murlet dataset2 Aligner n SPS SCI SEN PPV MCC TIME Aligner n SPS SCI SEN PPV MCC TIME contralign 85 0.77 0.41 0.58 0.77 0.65 149 contralign 188 0.84 0.63 0.76 0.84 0.78 6 probcons 85 0.76 0.38 0.57 0.79 0.64 71 probcons 188 0.81 0.60 0.79 0.84 0.80 6 centroid_align 85 0.78 0.48 0.63 0.75 0.68 196 centroid_align 188 0.88 0.82 0.89 0.89 0.88 11 lara 85 0.75 0.51 0.62 0.73 0.66 12901 lara 188 0.81 0.85 0.89 0.84 0.86 486 mafft-xinsi 85 0.79 0.53 0.67 0.75 0.70 510 mafft-xinsi 188 0.89 0.90 0.95 0.88 0.91 80 mlocarna 85 0.72 0.60 0.66 0.73 0.68 38645 mlocarna 188 0.84 0.94 0.93 0.85 0.89 100 murlet 85 0.77 0.43 0.63 0.76 0.68 59506 murlet 188 0.84 0.76 0.88 0.86 0.86 838 mxscarna 85 0.75 0.44 0.64 0.75 0.67 201 mxscarna 188 0.86 0.83 0.93 0.88 0.90 10 raf 85 0.75 0.46 0.69 0.72 0.70 6167 raf 188 0.86 0.88 0.95 0.89 0.91 68 rcoffee 85 0.76 0.42 0.59 0.75 0.65 487 rcoffee 188 0.84 0.73 0.84 0.87 0.85 328 stral 85 0.67 0.37 0.48 0.72 0.56 145 stemloc-ama 188 0.85 0.75 0.88 0.87 0.87 37557 stral 188 0.77 0.55 0.64 0.71 0.65 13 The aligners above dashed line cannot consider secondary structures when aligning RNA sequences, whereas the ones below dashed line can consider secondary structures. See the footnote in Table 1. n means the number of successfully predicted alignment and TIME means the total computational time in seconds. The bold values indicates the best score in each evaluation measure and the fastest times in aligners above and below the dashed line. We Table 3. MXSCARNA dataset used Centroid(Ali)Fold (Hamada et al., 2009a; Sato et al., 2009) to compute common secondary structure from an alignment for calculating SEN, PPV and MCC. See the Aligner n SPS SCI SEN PPV MCC TIME Supplementary Material for the results using RNAalifold (Bernhart et al., 2008). contralign 1693 0.79 0.58 0.64 0.67 0.63 707 3.3 Evaluation measures probcons 1693 0.78 0.54 0.63 0.66 0.63 487 In order to evaluate a predicted alignment from each aligner, we centroid_align 1693 0.80 0.67 0.70 0.69 0.68 2000 lara 1693 0.77 0.71 0.70 0.68 0.68 61694 used the following evaluation measures used in previous research. mafft-xinsi 1693 0.80 0.71 0.72 0.69 0.69 4316 (i) SPS (Thompson et al., 1999); we used compalignp, which is mlocarna 1693 0.77 0.80 0.75 0.68 0.70 468792 available from the Bralibase2.1 web site (http://www.biophys.uni- murlet 1693 0.79 0.63 0.71 0.69 0.69 500469 duesseldorf.de/bralibase), for computing SPS of each alignment. mxscarna 1693 0.78 0.69 0.73 0.70 0.70 2540 (ii) Stem candidate index (SCI) (Washietl et al., 2005) defined raf 1693 0.79 0.72 0.75 0.70 0.71 41078 by SCI = E /E where E is the consensus minimum free energy A A rcoffee 1693 0.78 0.61 0.67 0.68 0.66 4822 (MFE) computed by RNAalifold and E is the average MFE over stral 1693 0.74 0.57 0.61 0.63 0.60 2021 all RNA sequences in a given alignment; we used scif, which is also available from the Bralibase2.1 database, for calculating SCI. See the footnote in Table 1. (iii) SEN, PPV and Matthew’s correlation coefficient (MCC) for a predicted common secondary structure which are defined as follows: Table 4. MASTR dataset SEN = TP/(TP + FN), PPV = TP/(TP + FP) and TP · TN −FP · FN Aligner n SPS SCI SEN PPV MCC TIME MCC = (TP + FP)(TP + FN)(TN + FP)(TN + FN) contralign 52 0.87 0.72 0.64 0.77 0.69 31 where TP is the number of correctly predicted base pairs (true probcons 52 0.87 0.72 0.64 0.78 0.69 18 positives), TN is the number of base pairs which were correctly centroid_align 52 0.88 0.75 0.65 0.77 0.70 47 predicted as non-matching (true negatives), FN is the number of lara 52 0.86 0.77 0.69 0.80 0.73 3620 base pairs in the correct structure which were not predicted (false mafft-xinsi 52 0.88 0.78 0.68 0.78 0.71 133 negatives) and FP is the number of incorrectly predicted base-pairs mlocarna 52 0.85 0.80 0.68 0.75 0.71 2117 (false positives). The reference secondary structure of each sequence murlet 52 0.87 0.74 0.67 0.78 0.71 4149 mxscarna 52 0.86 0.73 0.66 0.77 0.70 50 in alignments is given by a common secondary structure mapped raf 52 0.86 0.74 0.70 0.77 0.73 272 to the sequence. We used RNAalifold (Bernhart et al., 2008) and rcoffee 52 0.87 0.74 0.66 0.78 0.70 223 Centroid(Ali)Fold (Hamada et al., 2009a; Sato et al., 2009) for stemloc-ama 51 0.86 0.72 0.63 0.77 0.68 353453 common secondary structure prediction from a multiple alignment. stral 52 0.81 0.70 0.61 0.75 0.65 32 3.4 Results of computational experiments See the footnote in Table 1. We show the results of benchmarking on Murlet dataset1, Murlet dataset2, MXSCARNA dataset, MASTR dataset, Bralibase2 dataset results because we designed our estimator in order to maximize and Bralibase2.1 dataset in Tables 1–6, respectively. These results expected SPS score. Moreover, CentroidAlign was one of the fastest show that CentroidAlign achieved a great balance of speed and aligners out of all the aligners that consider secondary structures accuracy compared with existing approaches. More precisely, (i.e. the aligners below dashed line in each table), which confirm (i) CentroidAlign achieved first or second best SPS out of all the effectiveness of our approximations in the MEA estimator. the aligners in all the benchmark datasets. These are desirable However, CentroidAlign sometimes gives worse SCI, SEN, PPV [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3240 3236–3243 CentroidAlign and MCC (i.e. evaluation measures related to common secondary expected accuracy. We obtained the proposed estimator by structures) than some tools such as mafft-xinsi, raf, lara and approximating the MEA estimator that maximizes the expected locarna, which shows CentroidAlign is not the optimal estimator SPS under the marginalized distribution of the Sankoff model. Our for those evaluation measures. (ii) On most evaluation measures, estimator considers all the suboptimal secondary structures and all CentroidAlign is better than CONTRAlign and ProbconsRNA, both the suboptimal pairwise alignments of input sequences. of which cannot consider secondary structures of input sequences Stemloc-AMA (Bradley et al., 2008) also adopts an MEA at all. However, CentroidAlign is 2–5 times slower than those estimator that is different from the one in this article. The differences tools. (iii) CentroidAlign has a better SPS and is much faster than between Stemloc-AMA and CentroidAlign can be summarized Stemloc-AMA (Bradley et al., 2008) (which has several similarities as follows: (i) Stemloc-AMA uses the AMA estimator (Schwartz with CentroidAlign; See Section A in the Supplementary Material). et al., 2005) as the gain function, whereas CentroidAlign uses This is because Stemloc-AMA uses the marginalized probability the different gain function described in Section 2.2.2. The relation distribution given by Sankoff model while we used approximations between the gain functions of Stemloc-AMA and CentroidAlign to the distribution. is described in the Supplementary Materials. (ii) The probability We also tried other reasonable approximations to the MEA distribution in Stemloc-AMA for pairwise alignments of two RNA estimator, as described in Section B in the Supplementary Material, sequences is the marginalized probability distribution obtained by and their performances were consistently worse than those of the the Sankoff model, that is, Equation (3), whereas CentroidAlign estimator proposed in the main text. uses an approximation to the distribution. Therefore, Stemloc- AMA has a huge computational cost, which was demonstrated in our experiments. The idea used in this article will be applied to 4 DISCUSSION AND FUTURE WORK finding approximations of the AMA estimator used in Stemloc- In this article, we have proposed a theoretically sound estimator AMA, although the approximations will be more complicated than for multiple alignment of RNA sequences, based on maximizing the ones in this article. LARA (Bauer et al., 2007) is able to take into account base-paring Table 5. BRAlibase2 dataset probabilities and subsequently uses an approach similar to Probcons and CentroidAlign to compute the multiple alignment. The main Aligner n SPS SCI SEN PPV MCC TIME difference between LARA and CentroidAlign is that LARA uses an ILP approach rather than a DP used in CentroidAlign. STRAL contralign 599 0.83 0.79 0.74 0.74 0.74 147 (Dalli et al., 2006) also conducted sequential alignment whose score probcons 599 0.83 0.77 0.73 0.74 0.73 108 considers the potential secondary structures by using base-paring centroid_align 599 0.84 0.83 0.77 0.75 0.76 416 probabilities. However, the score of STRAL is less elaborate than lara 599 0.85 0.90 0.80 0.76 0.77 13 710 the one of CentroidAlign (e.g. STRAL does not use alignment mafft-xinsi 599 0.84 0.87 0.79 0.75 0.77 870 match probabilities), and our computational experiments indicate mlocarna 599 0.85 0.94 0.82 0.76 0.79 43 697 that STRAL is less accurate than CentroidAlign although it is as murlet 599 0.83 0.82 0.78 0.75 0.76 33 237 mxscarna 599 0.83 0.86 0.80 0.76 0.77 512 fast as CentroidAlign. raf 599 0.84 0.88 0.83 0.76 0.79 2459 Do et al. (2008) proposed the excellent approximations of the rcoffee 599 0.82 0.78 0.73 0.74 0.73 1451 Sankoff algorithm, which were implemented in RAF. However, stemloc-ama 597 0.77 0.72 0.67 0.75 0.69 1 699 068 the predictive space used in RAF remains the space of structural stral 599 0.81 0.80 0.73 0.73 0.73 421 alignments of RNA sequences. So, RAF is theoretically slower than the proposed method—a fact confirmed in our experiments. Stemloc-AMA failed due to the small size of the dataset. See the footnote in Table 1. Table 6. BRAlibase2.1 dataset Aligner k2 k3 k5 k7 k10 k15 SPS SCI SPS SCI SPS SCI SPS SCI SPS SCI SPS SCI n TIME contralign 0.85 0.84 0.86 0.79 0.88 0.78 0.89 0.77 0.90 0.76 0.91 0.75 18990 4110 probcons 0.84 0.83 0.86 0.77 0.88 0.76 0.89 0.75 0.90 0.73 0.91 0.72 18990 2839 centroid_align 0.86 0.89 0.87 0.84 0.89 0.83 0.90 0.81 0.91 0.80 0.92 0.79 18990 8499 lara 0.85 0.95 0.87 0.89 0.90 0.88 0.91 0.87 0.92 0.85 0.93 0.86 18990 377864 mafft-xinsi 0.86 0.91 0.88 0.87 0.90 0.87 0.91 0.86 0.92 0.85 0.93 0.85 18990 20798 mlocarna 0.87 0.96 0.89 0.93 0.90 0.92 0.90 0.90 0.91 0.88 0.92 0.88 18990 1240871 murlet 0.85 0.88 0.87 0.83 0.89 0.81 0.90 0.80 0.91 0.78 0.92 0.76 18980 1371504 mxscarna 0.85 0.91 0.87 0.86 0.88 0.83 0.89 0.81 0.90 0.78 0.91 0.77 18990 9778 raf 0.86 0.95 0.88 0.90 0.89 0.87 0.90 0.83 0.91 0.79 0.91 0.76 18990 67363 rcoffee 0.83 0.82 0.85 0.79 0.88 0.78 0.89 0.77 0.90 0.76 0.91 0.75 18990 38847 stral 0.82 0.84 0.84 0.79 0.85 0.77 0.85 0.75 0.85 0.73 0.85 0.72 18990 7963 This dataset does not contain reference secondary structures, so only SPS and SCI are shown in the table. Murlet failed due to the small size of the dataset. The bold values indicates the best score in each evaluation measure and the fastest times in aligners above and below the dashed line. [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3241 3236–3243 M.Hamada et al. Note that RAF can predict common secondary structures as well as Do,C. et al. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. alignments, whereas CentroidAlign cannot. Do,C. et al. (2006a) CONTRAfold: RNA secondary structure prediction without An advantage of the proposed method is that it can be easily physics-based models. Bioinformatics, 22, e90–e98. extended to local alignment problems. In that case, we should use Do,C.B. et al. (2006b) Contralign: discriminative training for protein sequence (a) probabilities p obtained using local alignment models such as alignment. In Apostolico,A. et al., eds, RECOMB, vol. 3909 of Lecture Notes in Computer Science, Springer, Berlin/Heidelberg, pp. 160–174. ProDA (Phuong et al., 2006) and LAST (http://last.cbrc.jp/), and Do,C. et al. (2008) A max-margin model for efficient simultaneous alignment and conduct Smith–Waterman-style DP (Smith and Waterman, 1981). folding of RNA sequences. Bioinformatics, 24, i68–i76. Instead of Equation (4) Smith–Waterman DP uses Dowell,R. and Eddy,S. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71. Dowell,R. and Eddy,S. (2006) Efficient pairwise RNA structure prediction and (prop) M +(γ +1)p −1 uv alignment using sequence alignment constraints. BMC Bioinformatics, 7,400. u−1,v−1 M =max . u,v M Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University press, u−1,v Cambridge. u,v−1 Gardner,P.P. et al. (2005) A benchmark of multiple sequence alignment programs upon While we used a large γ to maximize the expected SPS in this structural RNAs. Nucleic Acids Res., 33, 2433–2439. Hamada,M. et al. (2009a) Prediction of RNA secondary structure using generalized article, in Smith–Waterman DP the γ parameter will be important for centroid estimators. Bioinformatics, 25, 465–473. adjusting the sensitivity and specificity of aligned bases. We should Hamada,M. et al. (2009b) Predictions of RNA secondary structure by combining employ Rfold (Kiryu et al., 2008) or RNAplfold (Bernhart et al., homologous sequence information. Bioinformatics, 25, i330–i338. 2006) for calculating the banded base-pairing probability matrix. Harmanci,A. et al. (2008) PARTS: probabilistic alignment for RNA joinT secondary (prop) structure prediction. Nucleic Acids Res., 36, 2406–2417. We determined the parameters w , w and w of p in the 1 2 3 uv Havgaard,J. et al. (2005) Pairwise local structural alignment of RNA sequences with proposed estimator by an ad hoc procedure in this studies. Therefore, sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. the values used might not be the optimal ones. However, it is possible Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC to learn these parameters using the max-margin model (Do et al., Bioinformatics, 6, 73. Holmes,I. and Durbin,R. (1998) Dynamic programming alignment accuracy. J. Comput. 2008). Biol., 5, 493–504. Katoh,K. and Toh,H. (2008) Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework. BMC ACKNOWLEDGMENTS Bioinformatics, 9, 212. Kiryu,H. et al. (2007a) Murlet: a practical multiple alignment tool for structural RNA The authors thank L. E. Carvalho and C. E. Lawrence for sequences. Bioinformatics, 23, 1588–1598. valuable comments. The authors are also grateful to our colleagues Kiryu,H. et al. (2007b) Robust prediction of consensus secondary structures using at the Computational Biology Research Center (CBRC) for averaged base pairing probability matrices. Bioinformatics, 23, 434–441. fruitful discussions. Also, the constructive comments of the three Kiryu,H. et al. (2008) Rfold: an exact algorithm for computing local base pairing anonymous reviewers are greatly appreciated. probabilities. Bioinformatics, 24, 367–373. Lindgreen,S. et al. (2007) MASTR: multiple alignment and structure prediction of non- Funding: ‘Functional RNA Project’ funded by the New Energy and coding RNAs using simulated annealing. Bioinformatics, 23, 3304–3311. Industrial Technology Development Organization (NEDO) of Japan; Lunter,G. et al. (2008) Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res., 18, 298–309. Grant-in-Aid for Scientific Research on Priority Areas ‘Comparative Mathews,D. (2005) Predicting a set of minimal free energy RNA secondary structures Genomics’ from the Ministry of Education, Culture, Sports, Science common to two sequences. Bioinformatics, 21, 2246–2253. and Technology of Japan. Mathews,D. and Turner,D. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol., 317, 191–203. Conflict of Interest: none declared. McCaskill,J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119. Miyazawa,S. (1995) A reliable sequence alignment method based on probabilities of REFERENCES residue correspondences. Protein Eng., 8, 999–1009. Moretti,S. et al. (2008) R-Coffee: a web server for accurately aligning noncoding RNA Anwar,M. et al. (2006) Identification of consensus RNA secondary structures using sequences. Nucleic Acids Res., 36, W10–W13. suffix arrays. BMC Bioinformatics, 7, 244. Needleman,S. and Wunsch,C. (1970) A general method applicable to the search Bauer,M. et al. (2007) Accurate multiple sequence-structure alignment of RNA for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, sequences using combinatorial optimization. BMC Bioinformatics, 8, 271. 443–453. Bernhart,S. et al. (2008) RNAalifold: improved consensus structure prediction for RNA Phuong,T. et al. (2006) Multiple alignment of protein sequences with repeats and alignments. BMC Bioinformatics, 9, 474. rearrangements. Nucleic Acids Res., 34, 5932–5942. Bernhart,S.H. et al. (2006) Local RNA base pairing probabilities in large sequences. Roshan,U. and Livesay,D. (2006) Probalign: multiple sequence alignment using Bioinformatics, 22, 614–615. partition function posterior probabilities. Bioinformatics, 22, 2715–2721. Bradley,R.K. et al. (2008) Specific alignment of structured RNA: stochastic grammars Sankoff,D. (1985) Simultaneous solution of the RNA folding alignment and and sequence annealing. Bioinformatics, 24, 2677–2683. protosequence problems. SIAM J. Appl. Math, pp. 810–825. Bradley,R.K. et al. (2009) Fast statistical alignment. PLoS Comput. Biol., 5, Sato,K. et al. (2009) CENTROIDFOLD: a web server for RNA secondary structure e1000392. prediction. Nucleic Acids Res., 37, W277–W280. Carvalho,L. and Lawrence,C. (2008) Centroid estimation in discrete high-dimensional Schwartz,A.S. et al. (2005) Alignment metric accuracy. Available at spaces with applications in biology. Proc. Natl Acad. Sci. USA, 105, http://arxiv.org/abs/q-bio.QM/0510052. 3209–3214. Seemann,S. et al. (2008) Unifying evolutionary and thermodynamic information for Dalli,D. et al. (2006) STRAL: progressive alignment of non-coding RNA using base RNA folding of multiple alignments. Nucleic Acids Res., 36, 6355–6362. pairing probability vectors in quadratic time. Bioinformatics, 22, 1593–1599. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular Ding,Y. et al. (2005) RNA secondary structure prediction by centroids in a Boltzmann subsequences. J. Mol. Biol., 147, 195–197. weighted ensemble. RNA, 11, 1157–1166. Tabei,Y. et al. (2008) A fast structural multiple alignment method for long RNA Ding,Y. et al. (2006) Clustering of RNA secondary structures with application to sequences. BMC Bioinformatics, 9, 33. messenger RNAs. J. Mol. Biol., 359, 554–571. [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3242 3236–3243 CentroidAlign Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence Wilm,A. et al. (2006) An enhanced RNA alignment benchmark for sequence alignment alignment programs. Nucleic Acids Res., 27, 2682–2690. programs. Algorithms Mol. Biol., 1, 19. Washietl,S. et al. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Wilm,A. et al. (2008) R-Coffee: a method for multiple alignment of non-coding RNA. Acad. Sci. USA, 102, 2454–2459. Nucleic Acids Res., 36, e52. Webb-Robertson,B.J. et al. (2008) Measuring global credibility with application to local Wong,K.M. et al. (2008) Alignment uncertainty and genomic analysis. Science, 319, sequence alignment. PLoS Comput. Biol., 4, e1000077. 473–476. [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3243 3236–3243
Bioinformatics – Oxford University Press
Published: Oct 6, 2009
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.