MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

Stinus Lindgreen; Paul P. Gardner; Anders Krogh

doi:10.1093/bioinformatics/btm525

MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

Lindgreen, Stinus; Gardner, Paul P.; Krogh, Anders 2007-11-15 00:00:00 Vol. 23 no. 24 2007, pages 3304–3311 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm525 Sequence analysis MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing 1, 1,2,† 1 Stinus Lindgreen , Paul P. Gardner and Anders Krogh 1 2 Bioinformatics Centre and Molecular Evolution Group, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark Received on August 20, 2007; revised on October 11, 2007; accepted on October 13, 2007 Advance Access publication November 15, 2007 Associate Editor: Dmitrij Frishman involved in gene expression and cell specialization, vault RNAs ABSTRACT seem to be involved in multi-drug resistance important for the Motivation: As more non–coding RNAs are discovered, the treatment of cancer, and small nuclear RNAs (snRNA) are key importance of methods for RNA analysis increases. Since the players in the splicing of pre-mRNA. A large number of other structure of ncRNA is intimately tied to the function of the molecule, ncRNA families have yet to be functionally characterized. programs for RNA structure prediction are necessary tools in this It has also become clear that these non-protein coding genes growing field of research. Furthermore, it is known that RNA vary greatly in size, ranging from microRNAs of 20 nt to more structure is often evolutionarily more conserved than sequence. than 10 000 nt in RNAs involved in eukaryotic gene silencing However, few existing methods are capable of simultaneously (Jossinet et al., 2007), and also that they are transcribed in considering multiple sequence alignment and structure prediction. Result: We present a novel solution to the problem of simultaneous different ways: some reside in introns of protein coding genes, structure prediction and multiple alignment of RNA sequences. while others are large transcripts that include introns and the possibility of alternative splicing although they lack the open Using Markov chain Monte Carlo in a simulated annealing frame- reading frame of a protein coding gene (Meyer, 2007). work, the algorithm MASTR (Multiple Alignment of STructural RNAs) Experimental studies show that a huge fraction of the human iteratively improves both sequence alignment and structure predic- genome is transcribed (Cheng et al., 2005), and computational tion for a set of RNA sequences. This is done by minimizing a studies show evidence that thousands of structurally conserved combined cost function that considers sequence conservation, RNAs can be found in the human genome (Pedersen et al., covariation and basepairing probabilities. The results show that the method is very competitive to similar programs available today, 2006; Washietl et al., 2005). There is therefore little doubt that both in terms of accuracy and computational efficiency. RNAs are biologically very important, and the structural Availability: Source code available from http://mastr.binf.ku.dk/ analysis of RNA sequences is a field of growing interest. Contact: stinus@binf.ku.dk Through evolution, the sequences of related RNAs can diverge although the structure remains conserved. Pure sequence comparison methods therefore fail when applied to ncRNAs that have diverged too much (Gardner et al., 2005). It is ultimately the tertiary structure that determines the 1 INTRODUCTION function of the molecule, and advances are being made in In recent years, the amount of evidence that RNAs play a this field (Das and Baker, 2007; Shapiro et al., 2007). However, much more active role in the cell than previously thought has in the case of RNA it is often sufficient to determine the grown dramatically. The view has now shifted away from the secondary structure. The reason is that the formation of assumption that non-coding RNAs (ncRNA) merely helped in secondary structure is fast, and the basepairing interactions the protein synthesis (e.g. tRNA, rRNA), and today a wide are strong. The secondary structure, therefore, contributes variety of catalytically active RNAs or ribozymes have been the major part of the folding energetics, forming a stable characterized. It has also become clear that ncRNA is a very scaffold for the formation of tertiary interactions (Onoa and diverse group of molecules both in terms of function and Tinoco, 2004). structure. There exist methods to fold a single RNA sequence either RNA molecules have been found to play important by maximizing basepairing interactions (Nussinov et al., 1978), and diverse roles (Athanasius F. Bompfu¨ newerer Consortium or by minimizing the free energy of the structure [mfold (Zuker et al., 2007; Bompfu¨ newerer et al., 2005; Meyer, 2007): and Stiegler, 1981); RNAfold (Hofacker, 2003)]. Another the recently discovered family of microRNAs (miRNA) is approach is to use an existing sequence alignment and predict a consensus structure based on this. In RNAalifold (Hofacker et al., 2002), this has been pursued using a combination of free energy and covariation. In Pfold (Knudsen and *To whom correspondence should be addressed. Present address: Wellcome Trust Sanger Institute, Cambridge, UK. Hein, 1999, 2003), a stochastic context-free grammar (SCFG) 3304 The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org MASTR is used to predict a common structure from a multiple extended to multiple sequences by considering all pairs in a alignment. set of multiple RNA sequences. If pseudoknots are disregarded, an RNA structure can be SimulFold (Meyer and Miklos, 2007) is a fully probabilistic represented as a tree. Since comparison of strings can be model using Bayesian Markov chain Monte Carlo. The extended to trees (Tai, 1979; Zhang and Shasha, 1989), program takes as input a set of unaligned sequences D and alignments could be based on the structures directly. samples both multiple alignment A, secondary structure S and In RNAforester (Ho¨ chsmann et al., 2003), the input is a set a phylogenetic tree T from the joint posterior probability of RNA sequences with (possibly predicted) secondary P(S,A,T|D). This very comprehensive program came out structures, and the problem thus becomes a forest alignment very recently, but although it has some methodology in problem. The program performs either local or global common with MASTR (e.g. sampling based on the likelihood alignment of the structures, and the output is an alignment of the solution) it does so in a very different way, which also and predicted consensus structure based on the structural shows in the computational complexity of the program. similarities. MARNA (Siebert and Backofen, 2005) is another Since the exact solution to the problem is too time and heuristic method, where a multiple structural alignment is memory consuming to be pursued, all the methods above are inferred from all pairwise alignments of secondary structures. simplified in one way or another. Furthermore, it has been Due to the tight relationship between sequence and suggested that the optimal minimum free energy structure is structure, the solution to the sequence alignment problem and not necessarily a good solution to the consensus structure the structure prediction is interdependent. Whether aligning problem (Ding and Lawrence, 2003). We therefore pursue a without considering the structure, or folding without consider- heuristic sampling approach where the structure and ing sequence alignment, information is ignored. Ideally, one sequence alignment can be optimized in parallel. In our should therefore perform the sequence alignment and structure approach a cost function (or energy) is defined as a sum of prediction in parallel. In 1985, Sankoff presented an exact three terms: an alignment term, a structure term and a solution to this NPhard problem (Sankoff, 1985), but the covariance term. This cost function is minimized using 3N exponential running time of O(L ) and memory usage of simulated annealing to obtain the combined alignment and 2N O(L ) makes it intractable even for problems of moderate size. structure with minimum cost—the best solution according to Various implementations of the Sankoff algorithm exist. the cost function. This optimization is carried out by changing Foldalign (Gorodkin et al., 1997; Havgaard et al., 2005) and the structure on the basepair level or by moving gaps around Dynalign (Mathews and Turner, 2002) are both limited to local in the sequence alignment. The change is then judged by pairwise alignment. In PMcomp/PMmulti (Hofacker et al., the change in the cost function and either accepted or rejected. 2004), the optimal alignment of two basepairing probability The procedure is implemented in the program called MASTR matrices is found instead of aligning the RNA sequences per se. (Multiple Alignment of STructural RNAs). A multiple alignment can be built using progressive alignment of the basepairing probability matrices. A similar progressive approach is used in FoldalignM (Torarinsson et al., 2007). 2 METHODS LocARNA (Will et al., 2007) is a local alignment tool similar to 2.1 Defining the cost function but more efficient than PMcomp, and this program can also To find a solution to the problem of simultaneous multiple be used to do progressive multiple alignment and structure alignment and structure prediction, we define a cost function that will prediction. In RNAcast (Reeder and Giegerich, 2005), be minimized in order to search for the optimal solution. A solution the consensus structure problem is dealt with in a different should minimize a combined cost function cost(A, S), which incorpo- way: by using abstract shapes (Giegerich et al., 2004), where the rates both the sequence alignment A and the predicted consensus structures can be regarded without all details but only using structure S. The different parameters used in the program (e.g. scaling the layout of the structure, the search space is reduced. parameters and thresholds) have been set using grid optimization. RNAcast predicts the best common shape for all the sequences A small number of low identity RNA datasets have been used to and, for each sequence, the energetically best structure. optimize the parameters by changing the settings slightly and In RNA Sampler (Xu et al., 2007) stems are the core building reevaluating the results. It should be noted that the datasets used for blocks. For each sequence, a list of all possible stems consisting optimizing the parameters are not the same as in the test, and that the datasets do not cover all the families used in the comparison. of consecutive A–U, G–C and G–U pairs is generated. A pairwise alignment is found by aligning all stems from one sequence with all stems from another, and the loop regions 2.1.1 Calculating alignment cost There exist many ways of determining the cost of a multiple alignment: Sum of Pairs using are aligned using ClustalW (Thompson et al., 1994). Since a substitution matrix and minimization of the entropy of the alignment bulges are not allowed in stems, the alignment process can be are two well–known examples (Durbin et al., 1998), and using a done efficiently by sliding one stem along the other. Such phylogenetic tree to sum the pairwise alignments inferred by the edges a block of aligned stems has a conservation score including has also been pursued (Hein, 1989). both nucleotide alignment probabilities and basepairing prob- During the development of the algorithm, various sequence cost abilities. From the set of blocks, a common structure is found functions were examined. Sum of Pairs and different entropy-based by sampling, and the probability of a block being chosen measures were tested using both single nucleotide and dinucleotide depends on the conservation score. The probability matrices domains. We selected the best performing cost function, which proved are then updated based on the sampled structures, and the to be a log-likelihood cost function inspired by Hidden Markov models process is iterated until convergence. This process has been (HMMs) over single nucleotides. 3305 S.Lindgreen et al. The cost function is fully probabilistic in its treatment of both gaps calculated once for each ungapped sequence s ¼ 1,..., N before the and nucleotides. We assume independence between the sites in the optimization starts, and the results are stored in individual probability alignment. When calculating the cost, we have an alignment of length L matrices M . These matrices do not change throughout the algorithm. consisting of N sequences. Let x denote the jth character in sequence i, For a basepair (i, j) in the alignment, we need to correct the indices to be i i and let Pðx Þ denote the probability of seeing character x at this specific able to find the probability for that particular basepair. Let these j j s s s s s position. Assuming the sites are independent, the probability of the gap-corrected indices be denoted (i , j ), where i ¼ i M and M is the multiple alignment A becomes: number of gaps preceding position i in sequence s, and similarly for index j. The probability for the basepair in sequence s is then found as L N YY s s s s s s PðAÞ¼ Pðx Þ M (i , j ). If either i or j is a gap in sequence s, M (i , j ) ¼ 0. A basepair j¼1 i¼1 (i, j) in the alignment is scored by the mean probability: The individual character probabilities need to be determined and N s s s Pðbp Þ¼ 1=N M ði ; j Þ gaps have to be taken into account. If x is a gap we have two cases: let i; j s¼1 P denote the gap open probability, i.e. the probability of having a GO gap at position j given that position j 1 contained a nucleotide. To transform this into a cost function for the basepairs, the negative Similarly, P is the gap extension probability used when both position GE logarithm of the mean probability is taken and a threshold is j and j 1 contain a gap. Both of these probabilities can be estimated introduced. The threshold reflects the background probability P of null from known structural alignments. In the program, they are set to random basepairs found in the probability matrices: P ¼ 0.5 and P ¼ 0.74. GO GE If x is a nucleotide from the alphabet ¼ {A,C,G,U}, the probability cost ði; jÞ¼ log Pðbp Þ þ logðÞ P ð2Þ P null j 2 i; j 2 Pðx Þ is calculated based on the nucleotides that comprise the column. A background probability of P ¼ 0.25 is used based on parameter null Additionally, the probability is dependent on the preceding character. If i optimization (data not shown). x ¼, we have a gap closing, and the probability is multiplied by j1 i Through evolution, related RNA sequences can mutate which leads (1 P ). Similarly, if x 2 , the probability is multiplied by GE j1 to different sequences of nucleotides while the same core secondary (1 P ). Let c (a) be the number of occurrences of nucleotide a2 GO j structure is retained. When a mutation happens at a position that is at position j in the alignment. The probabilities are given as: involved in a basepair, selection favors mutations at the other position i i P x ¼; x 2 GO > j j1 that maintain the structure and molecular function. This is known as i i P x ¼; x ¼ compensatory mutations. Thus, structure is often more conserved than > GE j j1 sequence, and this signal can be measured by a covariation score. c ðaÞ Pðx Þ¼ j i i ð1 P Þ x ¼ a 2 ; x 2 GO > j j1 > cjðbÞ In Lindgreen et al. (2006), we analyzed various measures of b2 > cjðaÞ i i covariation. We refer to this article for details, but here the chosen : P ð1 P Þ x ¼ a 2 ; x ¼ GE j j1 c ðbÞ b2 cost function will be briefly explained. The RNAalifold measure uses a matrix for each sequence , where ¼ 1 if sequence can A simple pseudo-count function is used where c (a) is incremented by i; i; j form a basepair between position i and j, and ¼ 0 otherwise. 1 for each a2 and 1 j L. An IUPAC ambiguity character is i; j The function ðx x ; x x Þ measures the Hamming distance between exchanged with one of the nucleotides it symbolizes with equal i j i j two aligned pairs at positions i and j in sequences and . The goal is probability. For instance, if an N occurs in a sequence, it is replaced to measure the fraction of consistently aligned pairs. A penalty term, by any one of the 4 nt with 25% chance each. Similarly, an S will be q , measures the fraction of sequences with inconsistent pairs in the exchanged with either a C or a G with a 50% chance each.This exchange i, j alignment. The covariation is then found as: is done once in the beginning of the program. Having these probabilities, P(A) can be calculated. The cost function used is the negative log-likelihood based on the alignment probability: B ¼ ðx x ; x x Þ q i;j i; j i j i j i; j i; j 2 5 QðAÞ¼1=N logðÞ PðAÞ ð1Þ To add stacking information, the two basepairs enclosing the pair in question are also considered, but more weight is put on the actual 2.1.2 Calculating structure cost The cost of the structure is pair: defined as the sum of the cost of the individual basepairs. Let S be the structure consisting of basepairs (i, j): B þ 2 B þ B i1; jþ1 i; j iþ1; j1 X Cðbp Þ¼ i; j costðSÞ ¼ costði; jÞ ði; jÞ2S To turn this into a cost function, the same approach is used as for the partition function above. The covariation score is negated and a There are two ways to score the structure: by the free energy of single threshold value added: sequences and by covariation. In the present work, we use the two measures that proved best at predicting true basepairs in our previous cost ¼Cðbp Þþ ð3Þ Cði;j Þ i; j study (Lindgreen et al., 2006): The McCaskill basepair probabilities (McCaskill, 1990), called P(bp ), and a novel version of the covariation i, j A threshold of ¼ 0.25 is used. Using the two cost functions measure used in RNAalifold (Hofacker et al., 2002) extended to include cost (i, j ) and cost (i, j ) [Equations (2) and (3), respectively], P C stacking of basepairs, called C(bp ). i, j the predicted structure can be evaluated and a move either accepted McCaskill showed how to calculate the partition function over all or rejected based on Equation (4) below. possible secondary structures of an RNA sequence. The basepair probabilities are found using the weighted Boltzman ensemble favoring more stable structures. We use RNAfold, which is part of the Vienna 2.1.3 Combined cost Since we simultaneously optimize both package (Hofacker, 2003), to calculate the probability matrices. Since sequence alignment and structure prediction, the cost function is a gaps are added to the sequences as part of the alignment this has to be combination of three terms: the log-likelihood cost in Equation (1), taken into account when indexing the matrices: the partition function is the basepair probability cost in Equation (2), and the covariation cost 3306 MASTR in Equation (3). The cost of the secondary structure is given as a sum initial alignment is 1.06 L where L is the length of the longest max max over all basepairs in the structure S: sequence. The moves through the solution space can either affect the sequence alignment or the structure. Since it makes little sense to try costðA; SÞ ¼ QðAÞþðÞ cost ði; jÞþ cost ði; jÞ P C and deduce a common structure from randomly aligned sequences, ði; jÞ2S the first iterations are purely sequence moves. As the alignment becomes more stable, we start doing a combination of sequence and The parameters and are parameters used to balance the structure moves. contributions from the different terms in the combined cost. As default, The total number of iterations performed depends both on the length they are set to ¼ 1.5 and ¼ 0.6, which are obtained from an initial of the longest sequence, since that affects the number of structure grid search parameter optimization (data not shown). moves needed, and on the size of the alignment, since that affects the number of sequence moves needed. The alignment size is measured as 2.2 Optimizing the solution the total number of nucleotides in the dataset, N . The dependencies total Simulated annealing (Kirkpatrick et al., 1983) is an optimization are denoted N and L , respectively, and the number of iterations is dep dep technique inspired by the physical process of annealing, which describes found as: the slow cooling of material to form a crystal structure. The idea is that I ¼ N N þ L L dep total dep max the positions of the individual atoms can be described as a probability distribution dependent on the temperature of the system: at high We use N ¼ 1000 and L ¼ 1700 as default. After initially only dep dep temperatures the atoms have a high energy and therefore move around, performing sequence moves, a mixture of alignment and structure but as the temperature is lowered, the system becomes more stable. altering moves are performed. The structure moves are initiated either The goal is to form crystals with few defects, and the most stable after a fixed fraction of the total number of iterations or, as per default, crystal structure is the one with the lowest free energy. If the after N N iterations. The remaining iterations are a mix of dep total temperature of the system is decreased too fast, the crystal structure sequence and structure moves. The ratio between the two is set by a becomes brittle since the system becomes stuck in a local energy parameter R. Per default, R ¼ 0.75 of the last iterations are structure minimum. If the temperature is decreased slowly, the local energy moves. minima can be avoided due to the thermal fluctuations, and the Initially, all moves are accepted (i.e. a temperature of infinity is used) structure becomes more ordered and stable, and the minimum free and the first 0.1% of the iterations are used to determine a good starting energy conformation may be reached. temperature. These results are used to estimate the standard deviation Simulated annealing works in analogy to this. In order to escape of the cost distribution. By deciding on the desired initial probability from local minima of a cost or energy function, steps towards worse of acceptance P the temperature T can be determined: 0 0 states (i.e. higher cost) should be taken often in the beginning (at high T ¼ temperature) and occasionally later at lower temperatures. This is logðÞ P done in a Monte Carlo simulation with an artificial temperature We use P ¼ 0.99 as default. The scaling of the temperature has to make parameter that has absolutely no physical meaning. The probability sure that we end up close to T ¼ 0. An exponential scaling is used: of acceptance depends on the change in cost (huge increases should T ¼ T where 0 < < 1 be accordingly improbable) as well as on the number of iterations i i1 (since the system is closer to the ‘stable’ optimum towards the end). Wanting the final temperature to be T ¼ 10 , this yields: final Given an infinite amount of time, it can be shown that simulated 0 1 annealing will approach the optimal solution to any finite problem log @ A ¼ exp (Ha¨ ggstro¨ m, 2002). Simulated annealing can be used to minimize any cost function, and has for instance been used for multiple alignment (Lukashin et al., 1992). 2.2.1 Sequence moves The moves aimed at changing the sequence Simulated annealing depends on an artificial temperature T that alignment do this by moving gaps in the sequences. They can of course decreases over time. Initially the temperature should be high enough be moved without altering the order of the nucleotides. Three different to give an ‘unstable system’—in this case an alignment prone to types of moves are implemented which in combination ensures that the changes. As more and more changes are sampled, the temperature alignment can be reduced, extended and altered locally. decreases to ‘stabilize’ the system. Normally the temperature decreases exponentially (Lukashin et al., 1992), although there is no theoretical Gap block move: local changes are facilitated through gap blocks. reason for this. If the new cost is lower than the previous, the change is A gap block is a subsequence consisting of 1 or more consecutive always accepted. If the change increases the cost, the chance of gaps in 1 or more aligned sequences. To make this move, a random acceptance P depends on the old cost c , the new (larger) cost c OLD NEW gap in a random sequence is picked. Then the gap block is extended and the temperature T: vertically with probability 0.85 through the other sequences c c NEW OLD containing a gap at that position. Afterwards, the gap block is P ¼ exp ð4Þ extended horizontally to both sides with probability 0.85 if all the chosen sequences contain a gap there. Finally, the gap block This is known as the Metropolis–Hastings algorithm (Hastings, 1970, is moved to a randomly chosen new position in the alignment. Kirkpatrick et al., 1983). Using this, the possible states are sampled The procedure is illustrated in Figure 1 and constitutes 85% of the based on the cost of the current state. Since a state only depends on sequence moves. the previous state, this generates a Markov chain. In combination, MCMC using simulated annealing can be used to sample the solution Gap insertion: insertion of gaps has to insert the same number of space of multiple alignments and RNA structures. Changes can be gaps in all sequences. One could insert the gaps at random made by moving the gaps in the alignment and by adding or removing positions in all sequences, but that would greatly affect the cost of basepairs in the structure, and the move is either rejected or accepted the alignment. Instead, the gaps are inserted in either end based on the change in cost. of the alignment. From these positions the gaps can diffuse into The initial alignment is built by adding gaps at random to all the alignment as needed. These moves constitute 10% of the sequences until they have equal length. By default, the length of the sequence moves. 3307 S.Lindgreen et al. Table 1. Detailed comparison of the predictions by MASTR and RNA Sampler Family ID Seqs Length MASTR RNA Sampler Fig. 1. Illustration of the gap block moves used in the sampling MCC SPS MCC SPS approach. (a) choose a gap at random in a sequence, (b) extend vertically, (c) extend horizontally and (d) move to new position in tRNA 35 11 74 0.72 0.59 0.90 0.68 alignment. 40 11 72 0.89 0.84 0.97 0.79 45 12 71 0.89 0.88 0.94 0.88 50 12 73 0.99 0.97 0.85 0.95 Gap deletion: removing gaps is done by locating gap columns, i.e. 44 12 73 1.00 0.98 0.98 0.91 columns in the alignment containing a gap in all of the sequences. 60 11 73 1.00 0.99 1.00 0.99 Using a well–defined cost function, superfluous gaps are placed in 65 11 73 1.00 0.99 0.86 0.99 the same columns of the alignment. These moves constitute 5% 68 7 73 0.89 1.00 0.99 0.99 of the sequence moves. 75 5 72 1.00 0.98 1.00 0.97 80 6 72 1.00 0.99 1.00 0.99 84 6 69 0.14 0.98 0.31 0.98 2.2.2 Structure moves. Structural moves either add or delete 90 7 72 0.83 1.00 0.98 0.98 basepairs. The structure is forced to contain only non–crossing 95 8 73 0.97 1.00 0.97 1.00 basepairs (i.e. prediction of pseudoknots is not yet supported), and a 97 6 73 0.34 1.00 0.48 1.00 minimum loop length of 3 nt is ensured. Using the three simple moves 5S rRNA 36 5 110 0.45 0.62 0.51 0.60 described below, the structure can be built, extended and reduced. 42 6 111 0.66 0.74 0.62 0.59 45 8 116 0.54 0.76 0.41 0.59 Adding a basepair: a new basepair is added by choosing a 50 9 113 0.59 0.74 0.52 0.73 nucleotide pair (i, j) at random and adding it to the structure if it 55 10 113 0.56 0.91 0.40 0.91 does not violate the constraints. These moves constitute 70% of the 60 11 118 0.89 0.96 0.75 0.90 structure moves. 65 11 117 0.83 0.97 0.81 0.97 Extending a stem: a stem is extended by choosing a basepair (i, j ) 69 10 116 0.52 0.95 0.50 0.92 75 9 115 0.58 0.96 0.53 0.95 already in the structure. The stem that includes basepair (i, j) is then 80 11 118 0.93 1.00 0.85 0.98 extended by adding a new basepair to it, with a 50% chance of 85 15 117 0.11 1.00 0.13 0.99 extending the stem either internally or externally. These moves 89 15 117 0.04 0.97 0.01 0.98 constitute 20% of the structure moves. 95 20 117 0.20 1.00 0.12 1.00 Deleting a basepair: deleting a basepair is done by choosing a pair 97 13 117 0.38 1.00 0.10 0.98 (i, j ) in the structure and removing it. This cannot lead to any new U5 36 5 122 0.40 0.33 0.44 0.38 violations of the constraints. These moves constitute 10% of the 41 6 123 0.48 0.37 0.49 0.42 structure moves. 44 9 120 0.46 0.45 0.64 0.49 51 8 120 0.50 0.51 0.63 0.56 55 10 118 0.67 0.61 0.72 0.64 60 9 115 0.71 0.68 0.82 0.65 2.3 Datasets 65 9 114 0.78 0.76 0.52 0.68 Since consensus structures were not available from BRaliBase II 70 10 115 0.87 0.81 0.93 0.81 (Gardner et al., 2005) at the time, we sampled alignments with 75 8 116 0.90 0.82 0.93 0.84 consensus structures in much the same way as in the BRaliBase 80 8 119 0.50 0.83 0.24 0.76 study. MASTR relies on the partition function to calculate basepair 86 7 116 0.96 0.96 0.96 0.94 probability matrices, so we have chosen to use only short (70–250 nt) 90 9 115 0.95 0.98 0.92 0.95 sequences in the test. There are known problems when using the 95 17 115 1.00 1.00 0.98 0.99 partition function to calculate basepair probability matrices for long 97 7 116 0.96 1.00 0.95 1.00 sequences. Hence, the program will not perform well on long sequences TPP 35 8 122 0.33 0.42 0.45 0.40 until this has been dealt with. 41 9 104 0.37 0.65 0.71 0.51 Datasets were generated from large, trusted seed alignments obtained 45 11 105 0.43 0.75 0.61 0.64 from Rfam (Griffiths-Jones et al., 2005). The 5 families chosen are 50 11 103 0.55 0.72 0.65 0.72 tRNA, 5S ribosomal RNA (5S rRNA), U5 spliceosomal RNA (U5), 55 11 107 0.70 0.86 0.74 0.78 Hepatitis C virus internal ribosome entry site (IRES) and TPP 58 7 102 0.63 0.91 0.69 0.81 riboswitch THI element (TPP). 61 4 101 0.66 0.97 0.59 0.85 From each RNA family, a number of datasets were sampled. Each 70 4 90 0.45 0.86 0.31 0.85 dataset has an average pairwise identity within a specified 10% interval. ires 69 2 249 0.40 0.59 0.46 0.86 These intervals go from 30% to 100% with 5% overlap, i.e. 30–40%, 73 4 250 0.54 0.91 0.45 0.86 35–45%, 40–50%, etc. This sampling procedure gave 14 datasets from tRNA, 5S rRNA and U5, 8 datasets from TPP and 2 datasets from For each dataset, the average pairwise identity (ID) is shown along with the IRES. In total, 52 datasets were sampled and details on average number of sequences (Seqs) and the average sequence length (Length). The results pairwise identity, number of sequences and average sequence lengths for MASTR and RNA Sampler are detailed (MCC: structure quality, SPS: can be seen in Table 1. The datasets can be obtained from http:// alignment quality). For each dataset, the best results are highlighted with bold, mastr.binf.ku.dk/. and identical results with italics. 3308 MASTR (a) 3 RESULTS The program MASTR is implemented in Cþþ and tested against the programs FoldalignM, LocARNA and RNA Sampler. RNAcast is used to produce input to RNAforester, and Clustal alignments were used as input to RNAalifold. All programs were used with their default settings except for RNAforester where the clustering cutoff had to be changed in order to produce complete alignments of all sequences. MASTR FoldalignM does not predict a single consensus structure but LocARNA Clustal + Alifold returns a structure for each sequence in the final alignment. FoldalignM RNAsampler Therefore, we define the consensus structure to be the basepairs RNAforester that are predicted for all sequences. The other programs all 30 40 50 60 70 80 90 predict a single multiple alignment and consensus structure. %ID To evaluate the predicted structures, Matthew’s Correlation Coefficient (MCC) is used: (b) TP TN FP FN MCC ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ Let TP be the number of truly predicted basepairs, FP be the number of predicted basepairs not in the reference structure and FN be the number of basepairs in the reference structure not predicted by the program. TN is defined here as the number of possible basepairing interactions in a sequence that are MASTR not predicted and not in the reference structure, i.e. pairs of LocARNA Clustal + Alifold nucleotide xy that are at least 4 nt apart, and where FoldalignM RNAsampler xy2{AU,UA,CG,GC,UG,GU}. RNAforester To evaluate the quality of the alignment, the Sum of 30 40 50 60 70 80 90 Pairs score (SPS) (Thompson et al., 1999) is used. SPS is a %ID sensitivity–like measure that compares the predicted alignment to a reference. For each pair of aligned sequences, the number (c) of aligned positions that are present in both the prediction and the reference alignment is counted. The total number of correctly aligned positions is then compared to the total number of aligned non–gap pairs present in the reference alignment. This yields a number between 0 and 1 where 1 is perfect correspondence between prediction and reference. In Figure 2, the performance of the programs are compared in terms of structure quality (plot a), alignment quality (plot b) MASTR and running time (plot c) as a function of the average pairwise LocARNA Clustal + Alifold identity of the datasets (%ID). The plots are averages over the FoldalignM RNAsampler different RNA families used for each %ID point. RNAforester The test shows that MASTR can predict consensus struc- 30 40 50 60 70 80 90 tures of a quality comparable to other existing methods. On %ID the lowest identity datasets MASTR is outperformed by RNA Sampler, but after 45% ID the structure predictions of Fig. 2. The performance of the tested programs in terms of structure MASTR are on average better than or comparable to the prediction (a), alignment quality (b) and running time (c) as a function best programs tested. MASTR is consistently better than or of average pairwise identity (%ID). Each plot shows the performance comparable to all other methods regarding alignment quality. as the average over the RNA families used. Note that plot c is on a logarithmic scale. Black: MASTR, Blue: LocARNA, Green: As it can be seen, MASTR is clearly faster than FoldalignM Clustal þ RNAalifold, Red: FoldalignM, Orange: RNA Sampler, by up to an order of magnitude, while LocARNA is even Purple: RNAcast þ RNAforester. faster. Clustal þ RNAalifold is of course by far the fastest method used. RNAforester has a running time comparable to LocARNA but produces consistently worse alignments— be explained by the lack of covariation in these datasets. Most probably due to the fact that all sequences had to be included of the methods rely on some signal from compensatory in the same alignment for comparison. The structure predic- tions are comparable to FoldalignM. mutations, and this signal diminishes as the sequences become The dip in the quality of the structure predictions that too similar. RNA Sampler does not depend on a covariation is visible for all methods in the highest identity range could term that could explain why the dip is less prominent here. log(seconds) SPS MCC 1e-01 1e+00 1e+01 1e+02 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 S.Lindgreen et al. Likewise, FoldalignM shows an almost monotone increase in Thus, new sequences can be aligned to reference alignments in the predictions as a function of identity. a structurally sound manner. Since RNA Sampler seems to be the best of the other methods tested here, a more detailed comparison of MASTR and RNA Sampler can be seen in Table 1. In total, 52 datasets ACKNOWLEDGEMENTS are used. The running time is on the same scale for the two S.L. acknowledges support from the Novo Scholarship programs, although MASTR is in general slightly faster. The Program. The work was supported by the Carlsberg Founda- structure predictions are better than or equal to RNA Sampler tion and the Danish National Research Foundation. in 28 cases (54%), and the alignments are better than or equal The authors thank two anonymous referees for their helpful to RNA Sampler in 43 cases (83%). The differences seem to comments; especially one reviewer for interesting comments depend both on the RNA family and on the level of identity. on covariation. Conflict of Interest: none declared. 4 DISCUSSION We have developed a new algorithm for simultaneous align- REFERENCES ment and structure prediction of multiple non-coding RNA sequences. As shown above, MASTR is highly competitive both Athanasius,F. Bompfu¨ newerer Consortium et al. (2007) RNAs everywhere: genome-wide annotation of structured RNAs. J. Exp. Zoolog. Mol. Dev. in terms of structure prediction quality, sequence alignment and Evol., 308, 1–25. running time. The program can also handle larger datasets Bompfu¨ newerer,A.F. et al. (2005) Evolutionary patterns of non-coding RNAs. than, e.g. RNA Sampler or FoldalignM. Theory Biosci., 123, 301–369. Although we have not used it in the above tests, it is also Cheng,J. et al. (2005) Transcriptional maps of 10 human chromosomes at possible to add structural constraints if some knowledge 5-nucleotide resolution. Science, 308, 1149–1154. Das,R. and Baker,D. (2007) Automated de novo prediction of native-like RNA is available about one of the sequences (e.g. known basepairs, tertiary structures. Proc. Natl Acad. Sci., 104, 14664–14669. knowledge about upstream or downstream interactions, Ding,Y. and Lawrence,C.E. (2003) A statistical sampling algorithm for RNA or knowledge about non-basepaired positions). Additionally, secondary structure prediction. Nucleic Acids Res., 31, 7280–7301. already aligned sequences can be used as input with or without Durbin,R. et al. (1998) Biological Sequence Analysis. Probabilistic Models of a consensus structure. Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. Gardner,P. et al. (2005) A benchmark of multiple sequence alignment programs As the testing of the program showed, MASTR does not upon structural RNAs. Nucleic Acids Res., 33, 2433–2439. have top performance on very dissimilar sequences. In this Giegerich,R. et al. (2004) Abstract shapes of RNA. Nucleic Acids Res., 32, range, one would assume that covariance is important, and it is 4843–4851. therefore interesting that RNA sampler, which does not use Gorodkin,J. et al. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 3724–3732. covariance, is better. One possible explanation for this is Griffiths-Jones,S. et al. (2005) Rfam: annotating non-coding RNAs in complete that covariation in itself is not enough to deduce structure genomes. Nucleic Acids Res., 33, D121–D124. from alignment. Covariation is only an indicator of conserved Ha¨ ggstro¨ m,O. (2002) Finite Markov Chains and Algorithmic Applications. basepairs, but it is not sufficient to predict pairing columns [this Cambridge University Press, Cambridge, UK. corresponds well with our previous study (Lindgreen et al., Hastings,W.K. (1970) Monte carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. 2006)]. MASTR builds up the structure in small steps, which Havgaard,J.H. et al. (2005) Pairwise local structure alignment of RNA sequences might make it vulnerable to erronously high covariation, with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. whereas RNA Sampler makes sure that the alignment is Hein,J. (1989) A new method that simultaneously aligns and reconstructs structurally sound by fixing whole stems. MASTR therefore ancestral sequences for any number of homologous sequences when the phylogeny is given. Mol. Biol. Evol., 6, 649–668. needs to have a relatively stable (and correct) alignment before Ho¨ chsmann,M. et al. (2003) Local similarity of RNA secondary structures. predicting structure. This could explain the relatively poor In Proceedings of the IEEE Computational Systems Bioinformatics Conference performance on low identity datasets, and this should be (CSB 2003). pp. 159–168. explored further. Hofacker,I. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., In future work, we would like to make a local version of the 31, 3429–3431. Hofacker,I. et al. (2004) Alignment of RNA base pairing probability matrices. program. This would be ideal for dealing with long sequences Bioinformatics, 20, 2222–2227. where there are known problems with the standard basepair Hofacker,I.L. et al. (2002) Secondary structure prediction for aligned RNA probability matrices. Since MASTR does not have the same sequences. J. Mol. Biol., 319, 1059–1066. limitations towards crossing basepairs as pure energy-based Jossinet,F. et al. (2007) RNA structure: bioinformatic analysis. Curr. Opin. methods, an extension to include the prediction of pseudoknots Microbiol., 10, 279–285. Kirkpatrick,S. et al. (1983) Optimization by simulated annealing. Science, 220, will also be pursued. 671–680. One of the advantages of MASTR is that the optimization is Knudsen,B. and Hein,J. (1999) RNA secondary structure prediction using decoupled from the cost function, which makes it very easy to stochastic context-free grammars and evolutionary history. Bioinformatics, change the latter. It also makes it reasonably straight forward 15, 446–454. Knudsen,B. and Hein,J. (2003) Pfold: RNA secondary structure prediction using to add phylogenetic prediction to the program, which would stochastic context-free grammars. Nucleic Acids Res., 31, 3423–3428. be similar to the goal of SimulFold (Meyer and Miklos, 2007), Lindgreen,S. et al. (2006) Measuring covariation in RNA alignments: physical but MASTR functions in a very different way. We would also realism improves information measures. Bioinformatics, 22, 2988–2995. like to make it possible to input a set of related and already Lukashin,A.V. et al. (1992) Multiple alignment using simulated annealing: branch aligned sequences together with the set of unaligned sequences. point definition in human mRNA splicing. Nucleic Acids Res., 20, 2511–2516. 3310 MASTR Mathews,D.H. and Turner,D.H. (2002) Dynalign: an algorithm for finding the Siebert,S. and Backofen,R. (2005) MARNA: multiple alignment and consensus secondary structure common to two RNA sequences. J. Mol. Biol., 317, structure prediction of RNAs based on sequence structure comparisons. 191–203. Bioinformatics, 21, 3352–3359. McCaskill,J.S. (1990) The equilibrium partition function and base pair Tai,K. (1979) The tree-to-tree correction problem. J. ACM, 26, 422–433. binding probabilities for RNA secondary structure. Biopolymers, 29, Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive 1105–1119. multiple sequence alignment through sequence weighting, position-specific Meyer,I.M. and Miklos,I. (2007) SimulFold: simultaneously inferring RNA gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. structures including pseudoknots, alignments and trees using a Bayesian Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence MCMC framework. PLoS Comput. Biol., 3, e149. alignment programs. Nucleic Acids Res., 27, 2682–2690. Meyer,I.M. (2007) A practical guide to the art of RNA gene prediction. Brief. Torarinsson,E. et al. (2007) Multiple structural alignment and clustering of RNA Bioinformatics, Epub ahead of print, PMID: 17483123. sequences. Bioinformatics, 23, 926–932. Nussinov,R. et al. (1978) Algorithms for loop matchings. SIAM J. Appl. Math., Washietl,S. et al. (2005) Mapping of conserved RNA secondary structures 35, 68–82. predicts thousands of functional noncoding RNAs in the human genome. Onoa,B. and Tinoco,I. Jr (2004) RNA folding and unfolding. Curr. Opin. Struct. Nat. Biotechnol., 23, 1383–1390. Biol., 14, 374–379. Will,S. et al. (2007) Inferring noncoding RNA families and classes by means of Pedersen,J. et al. (2006) Identification and classification of conserved RNA genome-scale structure-based clustering. PLoS Comput. Biol., 3,e65. secondary structures in the human genome. PLoS Comput. Biol., 2, e33. Xu,X. et al. (2007) RNA Sampler: a new sampling based algorithm for Reeder,J. and Giegerich,R. (2005) Consensus shapes: An alternative to the common RNA secondary structure prediction and structural alignment. Sankoff algorithm for RNA consensus structure prediction. Bioinformatics, Bioinformatics, 23, 1883–1891. 21, 3516–3523. Zhang,K. and Shasha,D. (1989) Simple fast algorithms for the editing distance Sankoff,D. (1985) Simultaneous solution of the RNA folding, alignment and between trees and related problems. SIAM J. Comput., 18, 1245–1262. protosequence problems. SIAM J. Appl. Math., 45, 810–825. Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA Shapiro,B.A. et al. (2007) Bridging the gap in RNA structure prediction. Curr. sequences using thermodynamics and auxiliary information. Nucleic Acids Opin. Struct. Biol., 17, 157–165. Res., 9, 133–148. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/mastr-multiple-alignment-and-structure-prediction-of-non-coding-rnas-aQt0V5KFQp

Loading next page...

References (46)

S. Will, K. Reiche, I. Hofacker, P. Stadler, R. Backofen (2007)
Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering
PLoS Computational Biology, 3
R. Giegerich, B. Voß, Marc Rehmsmeier (2004)
Abstract shapes of RNA.
Nucleic acids research, 32 16
R. Durbin, S. Eddy, A. Krogh, G. Mitchison (1998)
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
J. Hein (1989)
A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given.
Molecular biology and evolution, 6 6
Irmtraud Meyer, I. Miklós (2007)
SimulFold: Simultaneously Inferring RNA Structures Including Pseudoknots, Alignments, and Trees Using a Bayesian MCMC Framework
PLoS Computational Biology, 3
S. Lindgreen, P. Gardner, A. Krogh, Bioinformatics Centre
Bioinformatics Original Paper Measuring Covariation in Rna Alignments: Physical Realism Improves Information Measures
Irmtraud Meyer (2007)
A practical guide to the art of RNA gene prediction
Briefings in bioinformatics, 8 6
B. Shapiro, Yaroslava Yingling, W. Kasprzak, Eckart Bindewald (2007)
Bridging the gap in RNA structure prediction.
Current opinion in structural biology, 17 2
F. Jossinet, Thomas Ludwig, E. Westhof (2007)
RNA structure: bioinformatic analysis.
Current opinion in microbiology, 10 3
S. Kirkpatrick, C. Gelatt, Mario Vecchi (1983)
Optimization by Simulated Annealing
Science, 220
M. Höchsmann, Thomas Töller, R. Giegerich, S. Kurtz (2003)
Local similarity in RNA secondary structures
Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003
D. Sankoff (1985)
Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems
Siam Journal on Applied Mathematics, 45
Sven Siebert, R. Backofen (2005)
MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons
Bioinformatics, 21 16
I. Hofacker (2003)
Vienna RNA secondary structure server
Nucleic acids research, 31 13
D. Mathews, D. Turner (2002)
Dynalign: an algorithm for finding the secondary structure common to two RNA sequences.
Journal of molecular biology, 317 2
I. Hofacker, Martin Fekete, P. Stadler (2002)
Secondary structure prediction for aligned RNA sequences.
Journal of molecular biology, 319 5
M. Zuker, P. Stiegler (1981)
Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information
Nucleic acids research, 9 1
Bjarne Knudsen, J. Hein (2003)
Pfold: RNA secondary structure prediction using stochastic context-free grammars
Nucleic acids research, 31 13
K. Tai (1979)
The Tree-to-Tree Correction Problem
J. ACM, 26
Stefan Washietl, I. Hofacker, Melanie Lukasser, A. Hüttenhofer, P. Stadler (2005)
Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome
Nature Biotechnology, 23
I. Hofacker, S. Bernhart, P. Stadler (2004)
Alignment of RNA base pairing probability matrices
Bioinformatics, 20 14
Jill Cheng, P. Kapranov, J. Drenkow, S. Dike, Shane Brubaker, Sandeep Patel, J. Long, D. Stern, H. Tammana, G. Helt, V. Sementchenko, A. Piccolboni, S. Bekiranov, D. Bailey, M. Ganesh, Srinka Ghosh, I. Bell, D. Gerhard, T. Gingeras (2005)
Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution
Science, 308
Ye Ding, C. Lawrence (2003)
A statistical sampling algorithm for RNA secondary structure prediction.
Nucleic acids research, 31 24
Bibiana Onoa, I. Tinoco (2004)
RNA folding and unfolding.
Current opinion in structural biology, 14 3
J. Thompson, D. Higgins, T. Gibson (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic acids research, 22 22
R. Nussinov, G. Pieczenik, Jerrold Griggs, D. Kleitman (1978)
Algorithms for Loop Matchings
Siam Journal on Applied Mathematics, 35
Alexander Lukashin, J. Engelbrecht, Søren Brunak (1992)
Multiple alignment using simulated annealing: branch point definition in human mRNA splicing.
Nucleic acids research, 20 10
E. Torarinsson, J. Havgaard, Jan Gorodkin (2007)
Multiple structural alignment and clustering of RNA sequences
Bioinformatics, 23 8
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm272 Sequence analysis RNA Sampler: a new sampling based algorithm for common RNA
W. Hastings (1970)
Monte Carlo Sampling Methods Using Markov Chains and Their Applications
Biometrika, 57
Xing Xu, Yongmei Ji, G. Stormo (2007)
RNA Sampler
Bioinformatics, 23
Rhiju Das, D. Baker (2007)
Automated de novo prediction of native-like RNA tertiary structures
Proceedings of the National Academy of Sciences, 104
J. Gorodkin, Laurie Heyer, G. Stormo (1997)
Finding the most significant common sequence and structure motifs in a set of RNA sequences.
Nucleic acids research, 25 18
P. Gardner, A. Wilm, Stefan Washietl (2005)
A benchmark of multiple sequence alignment programs upon structural RNAs
Nucleic Acids Research, 33
Bjarne Knudsen, J. Hein (1999)
RNA secondary structure prediction using stochastic context-free grammars and evolutionary history
Bioinformatics, 15 6
J. McCaskill (1990)
The equilibrium partition function and base pair binding probabilities for RNA secondary structure
Biopolymers, 29
J. Thompson, F. Plewniak, O. Poch (1999)
A comprehensive comparison of multiple sequence alignment programs
Nucleic acids research, 27 13
Dūrbin (1998)
Biological Sequence Analysis
I. Hofacker (2006)
RNAs everywhere: genome-wide annotation of structured RNAs.
Genome informatics. International Conference on Genome Informatics, 17 2
Athanasius Bompfünewerer, C. Flamm, C. Fried, G. Fritzsch, I. Hofacker, Jörg Lehmann, Kristin Missal, A. Mosig, B. Müller, Sonja Prohaska, B. Stadler, P. Stadler, Andrea Tanzer, Stefan Washietl, C. Witwer (2005)
Evolutionary patterns of non-coding RNAs
Theory in Biosciences, 123
Kaizhong Zhang, D. Shasha (1989)
Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems
SIAM J. Comput., 18
J. Pedersen, G. Bejerano, A. Siepel, K. Rosenbloom, K. Lindblad-Toh, E. Lander, Jim Kent, W. Miller, D. Haussler (2006)
Identification and Classification of Conserved RNA Secondary Structures in the Human Genome
PLoS Computational Biology, 2
Olle Häggström (2002)
Finite Markov Chains and Algorithmic Applications: Computer simulation of Markov chains
Jens Reeder, R. Giegerich (2005)
Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction
Bioinformatics, 21 17
S. Griffiths-Jones, Simon Moxon, M. Marshall, Ajay Khanna, S. Eddy, A. Bateman (2004)
Rfam: annotating non-coding RNAs in complete genomes
Nucleic Acids Research, 33
BIOINFORMATICS ORIGINAL PAPER Sequence analysis Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%

Publisher: Oxford University Press
Copyright: © The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/btm525
pmid: 18006551
Publisher site: See Article on Publisher Site

Abstract

Vol. 23 no. 24 2007, pages 3304–3311 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm525 Sequence analysis MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing 1, 1,2,† 1 Stinus Lindgreen , Paul P. Gardner and Anders Krogh 1 2 Bioinformatics Centre and Molecular Evolution Group, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark Received on August 20, 2007; revised on October 11, 2007; accepted on October 13, 2007 Advance Access publication November 15, 2007 Associate Editor: Dmitrij Frishman involved in gene expression and cell specialization, vault RNAs ABSTRACT seem to be involved in multi-drug resistance important for the Motivation: As more non–coding RNAs are discovered, the treatment of cancer, and small nuclear RNAs (snRNA) are key importance of methods for RNA analysis increases. Since the players in the splicing of pre-mRNA. A large number of other structure of ncRNA is intimately tied to the function of the molecule, ncRNA families have yet to be functionally characterized. programs for RNA structure prediction are necessary tools in this It has also become clear that these non-protein coding genes growing field of research. Furthermore, it is known that RNA vary greatly in size, ranging from microRNAs of 20 nt to more structure is often evolutionarily more conserved than sequence. than 10 000 nt in RNAs involved in eukaryotic gene silencing However, few existing methods are capable of simultaneously (Jossinet et al., 2007), and also that they are transcribed in considering multiple sequence alignment and structure prediction. Result: We present a novel solution to the problem of simultaneous different ways: some reside in introns of protein coding genes, structure prediction and multiple alignment of RNA sequences. while others are large transcripts that include introns and the possibility of alternative splicing although they lack the open Using Markov chain Monte Carlo in a simulated annealing frame- reading frame of a protein coding gene (Meyer, 2007). work, the algorithm MASTR (Multiple Alignment of STructural RNAs) Experimental studies show that a huge fraction of the human iteratively improves both sequence alignment and structure predic- genome is transcribed (Cheng et al., 2005), and computational tion for a set of RNA sequences. This is done by minimizing a studies show evidence that thousands of structurally conserved combined cost function that considers sequence conservation, RNAs can be found in the human genome (Pedersen et al., covariation and basepairing probabilities. The results show that the method is very competitive to similar programs available today, 2006; Washietl et al., 2005). There is therefore little doubt that both in terms of accuracy and computational efficiency. RNAs are biologically very important, and the structural Availability: Source code available from http://mastr.binf.ku.dk/ analysis of RNA sequences is a field of growing interest. Contact: stinus@binf.ku.dk Through evolution, the sequences of related RNAs can diverge although the structure remains conserved. Pure sequence comparison methods therefore fail when applied to ncRNAs that have diverged too much (Gardner et al., 2005). It is ultimately the tertiary structure that determines the 1 INTRODUCTION function of the molecule, and advances are being made in In recent years, the amount of evidence that RNAs play a this field (Das and Baker, 2007; Shapiro et al., 2007). However, much more active role in the cell than previously thought has in the case of RNA it is often sufficient to determine the grown dramatically. The view has now shifted away from the secondary structure. The reason is that the formation of assumption that non-coding RNAs (ncRNA) merely helped in secondary structure is fast, and the basepairing interactions the protein synthesis (e.g. tRNA, rRNA), and today a wide are strong. The secondary structure, therefore, contributes variety of catalytically active RNAs or ribozymes have been the major part of the folding energetics, forming a stable characterized. It has also become clear that ncRNA is a very scaffold for the formation of tertiary interactions (Onoa and diverse group of molecules both in terms of function and Tinoco, 2004). structure. There exist methods to fold a single RNA sequence either RNA molecules have been found to play important by maximizing basepairing interactions (Nussinov et al., 1978), and diverse roles (Athanasius F. Bompfu¨ newerer Consortium or by minimizing the free energy of the structure [mfold (Zuker et al., 2007; Bompfu¨ newerer et al., 2005; Meyer, 2007): and Stiegler, 1981); RNAfold (Hofacker, 2003)]. Another the recently discovered family of microRNAs (miRNA) is approach is to use an existing sequence alignment and predict a consensus structure based on this. In RNAalifold (Hofacker et al., 2002), this has been pursued using a combination of free energy and covariation. In Pfold (Knudsen and *To whom correspondence should be addressed. Present address: Wellcome Trust Sanger Institute, Cambridge, UK. Hein, 1999, 2003), a stochastic context-free grammar (SCFG) 3304 The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org MASTR is used to predict a common structure from a multiple extended to multiple sequences by considering all pairs in a alignment. set of multiple RNA sequences. If pseudoknots are disregarded, an RNA structure can be SimulFold (Meyer and Miklos, 2007) is a fully probabilistic represented as a tree. Since comparison of strings can be model using Bayesian Markov chain Monte Carlo. The extended to trees (Tai, 1979; Zhang and Shasha, 1989), program takes as input a set of unaligned sequences D and alignments could be based on the structures directly. samples both multiple alignment A, secondary structure S and In RNAforester (Ho¨ chsmann et al., 2003), the input is a set a phylogenetic tree T from the joint posterior probability of RNA sequences with (possibly predicted) secondary P(S,A,T|D). This very comprehensive program came out structures, and the problem thus becomes a forest alignment very recently, but although it has some methodology in problem. The program performs either local or global common with MASTR (e.g. sampling based on the likelihood alignment of the structures, and the output is an alignment of the solution) it does so in a very different way, which also and predicted consensus structure based on the structural shows in the computational complexity of the program. similarities. MARNA (Siebert and Backofen, 2005) is another Since the exact solution to the problem is too time and heuristic method, where a multiple structural alignment is memory consuming to be pursued, all the methods above are inferred from all pairwise alignments of secondary structures. simplified in one way or another. Furthermore, it has been Due to the tight relationship between sequence and suggested that the optimal minimum free energy structure is structure, the solution to the sequence alignment problem and not necessarily a good solution to the consensus structure the structure prediction is interdependent. Whether aligning problem (Ding and Lawrence, 2003). We therefore pursue a without considering the structure, or folding without consider- heuristic sampling approach where the structure and ing sequence alignment, information is ignored. Ideally, one sequence alignment can be optimized in parallel. In our should therefore perform the sequence alignment and structure approach a cost function (or energy) is defined as a sum of prediction in parallel. In 1985, Sankoff presented an exact three terms: an alignment term, a structure term and a solution to this NPhard problem (Sankoff, 1985), but the covariance term. This cost function is minimized using 3N exponential running time of O(L ) and memory usage of simulated annealing to obtain the combined alignment and 2N O(L ) makes it intractable even for problems of moderate size. structure with minimum cost—the best solution according to Various implementations of the Sankoff algorithm exist. the cost function. This optimization is carried out by changing Foldalign (Gorodkin et al., 1997; Havgaard et al., 2005) and the structure on the basepair level or by moving gaps around Dynalign (Mathews and Turner, 2002) are both limited to local in the sequence alignment. The change is then judged by pairwise alignment. In PMcomp/PMmulti (Hofacker et al., the change in the cost function and either accepted or rejected. 2004), the optimal alignment of two basepairing probability The procedure is implemented in the program called MASTR matrices is found instead of aligning the RNA sequences per se. (Multiple Alignment of STructural RNAs). A multiple alignment can be built using progressive alignment of the basepairing probability matrices. A similar progressive approach is used in FoldalignM (Torarinsson et al., 2007). 2 METHODS LocARNA (Will et al., 2007) is a local alignment tool similar to 2.1 Defining the cost function but more efficient than PMcomp, and this program can also To find a solution to the problem of simultaneous multiple be used to do progressive multiple alignment and structure alignment and structure prediction, we define a cost function that will prediction. In RNAcast (Reeder and Giegerich, 2005), be minimized in order to search for the optimal solution. A solution the consensus structure problem is dealt with in a different should minimize a combined cost function cost(A, S), which incorpo- way: by using abstract shapes (Giegerich et al., 2004), where the rates both the sequence alignment A and the predicted consensus structures can be regarded without all details but only using structure S. The different parameters used in the program (e.g. scaling the layout of the structure, the search space is reduced. parameters and thresholds) have been set using grid optimization. RNAcast predicts the best common shape for all the sequences A small number of low identity RNA datasets have been used to and, for each sequence, the energetically best structure. optimize the parameters by changing the settings slightly and In RNA Sampler (Xu et al., 2007) stems are the core building reevaluating the results. It should be noted that the datasets used for blocks. For each sequence, a list of all possible stems consisting optimizing the parameters are not the same as in the test, and that the datasets do not cover all the families used in the comparison. of consecutive A–U, G–C and G–U pairs is generated. A pairwise alignment is found by aligning all stems from one sequence with all stems from another, and the loop regions 2.1.1 Calculating alignment cost There exist many ways of determining the cost of a multiple alignment: Sum of Pairs using are aligned using ClustalW (Thompson et al., 1994). Since a substitution matrix and minimization of the entropy of the alignment bulges are not allowed in stems, the alignment process can be are two well–known examples (Durbin et al., 1998), and using a done efficiently by sliding one stem along the other. Such phylogenetic tree to sum the pairwise alignments inferred by the edges a block of aligned stems has a conservation score including has also been pursued (Hein, 1989). both nucleotide alignment probabilities and basepairing prob- During the development of the algorithm, various sequence cost abilities. From the set of blocks, a common structure is found functions were examined. Sum of Pairs and different entropy-based by sampling, and the probability of a block being chosen measures were tested using both single nucleotide and dinucleotide depends on the conservation score. The probability matrices domains. We selected the best performing cost function, which proved are then updated based on the sampled structures, and the to be a log-likelihood cost function inspired by Hidden Markov models process is iterated until convergence. This process has been (HMMs) over single nucleotides. 3305 S.Lindgreen et al. The cost function is fully probabilistic in its treatment of both gaps calculated once for each ungapped sequence s ¼ 1,..., N before the and nucleotides. We assume independence between the sites in the optimization starts, and the results are stored in individual probability alignment. When calculating the cost, we have an alignment of length L matrices M . These matrices do not change throughout the algorithm. consisting of N sequences. Let x denote the jth character in sequence i, For a basepair (i, j) in the alignment, we need to correct the indices to be i i and let Pðx Þ denote the probability of seeing character x at this specific able to find the probability for that particular basepair. Let these j j s s s s s position. Assuming the sites are independent, the probability of the gap-corrected indices be denoted (i , j ), where i ¼ i M and M is the multiple alignment A becomes: number of gaps preceding position i in sequence s, and similarly for index j. The probability for the basepair in sequence s is then found as L N YY s s s s s s PðAÞ¼ Pðx Þ M (i , j ). If either i or j is a gap in sequence s, M (i , j ) ¼ 0. A basepair j¼1 i¼1 (i, j) in the alignment is scored by the mean probability: The individual character probabilities need to be determined and N s s s Pðbp Þ¼ 1=N M ði ; j Þ gaps have to be taken into account. If x is a gap we have two cases: let i; j s¼1 P denote the gap open probability, i.e. the probability of having a GO gap at position j given that position j 1 contained a nucleotide. To transform this into a cost function for the basepairs, the negative Similarly, P is the gap extension probability used when both position GE logarithm of the mean probability is taken and a threshold is j and j 1 contain a gap. Both of these probabilities can be estimated introduced. The threshold reflects the background probability P of null from known structural alignments. In the program, they are set to random basepairs found in the probability matrices: P ¼ 0.5 and P ¼ 0.74. GO GE If x is a nucleotide from the alphabet ¼ {A,C,G,U}, the probability cost ði; jÞ¼ log Pðbp Þ þ logðÞ P ð2Þ P null j 2 i; j 2 Pðx Þ is calculated based on the nucleotides that comprise the column. A background probability of P ¼ 0.25 is used based on parameter null Additionally, the probability is dependent on the preceding character. If i optimization (data not shown). x ¼, we have a gap closing, and the probability is multiplied by j1 i Through evolution, related RNA sequences can mutate which leads (1 P ). Similarly, if x 2 , the probability is multiplied by GE j1 to different sequences of nucleotides while the same core secondary (1 P ). Let c (a) be the number of occurrences of nucleotide a2 GO j structure is retained. When a mutation happens at a position that is at position j in the alignment. The probabilities are given as: involved in a basepair, selection favors mutations at the other position i i P x ¼; x 2 GO > j j1 that maintain the structure and molecular function. This is known as i i P x ¼; x ¼ compensatory mutations. Thus, structure is often more conserved than > GE j j1 sequence, and this signal can be measured by a covariation score. c ðaÞ Pðx Þ¼ j i i ð1 P Þ x ¼ a 2 ; x 2 GO > j j1 > cjðbÞ In Lindgreen et al. (2006), we analyzed various measures of b2 > cjðaÞ i i covariation. We refer to this article for details, but here the chosen : P ð1 P Þ x ¼ a 2 ; x ¼ GE j j1 c ðbÞ b2 cost function will be briefly explained. The RNAalifold measure uses a matrix for each sequence , where ¼ 1 if sequence can A simple pseudo-count function is used where c (a) is incremented by i; i; j form a basepair between position i and j, and ¼ 0 otherwise. 1 for each a2 and 1 j L. An IUPAC ambiguity character is i; j The function ðx x ; x x Þ measures the Hamming distance between exchanged with one of the nucleotides it symbolizes with equal i j i j two aligned pairs at positions i and j in sequences and . The goal is probability. For instance, if an N occurs in a sequence, it is replaced to measure the fraction of consistently aligned pairs. A penalty term, by any one of the 4 nt with 25% chance each. Similarly, an S will be q , measures the fraction of sequences with inconsistent pairs in the exchanged with either a C or a G with a 50% chance each.This exchange i, j alignment. The covariation is then found as: is done once in the beginning of the program. Having these probabilities, P(A) can be calculated. The cost function used is the negative log-likelihood based on the alignment probability: B ¼ ðx x ; x x Þ q i;j i; j i j i j i; j i; j 2 5 QðAÞ¼1=N logðÞ PðAÞ ð1Þ To add stacking information, the two basepairs enclosing the pair in question are also considered, but more weight is put on the actual 2.1.2 Calculating structure cost The cost of the structure is pair: defined as the sum of the cost of the individual basepairs. Let S be the structure consisting of basepairs (i, j): B þ 2 B þ B i1; jþ1 i; j iþ1; j1 X Cðbp Þ¼ i; j costðSÞ ¼ costði; jÞ ði; jÞ2S To turn this into a cost function, the same approach is used as for the partition function above. The covariation score is negated and a There are two ways to score the structure: by the free energy of single threshold value added: sequences and by covariation. In the present work, we use the two measures that proved best at predicting true basepairs in our previous cost ¼Cðbp Þþ ð3Þ Cði;j Þ i; j study (Lindgreen et al., 2006): The McCaskill basepair probabilities (McCaskill, 1990), called P(bp ), and a novel version of the covariation i, j A threshold of ¼ 0.25 is used. Using the two cost functions measure used in RNAalifold (Hofacker et al., 2002) extended to include cost (i, j ) and cost (i, j ) [Equations (2) and (3), respectively], P C stacking of basepairs, called C(bp ). i, j the predicted structure can be evaluated and a move either accepted McCaskill showed how to calculate the partition function over all or rejected based on Equation (4) below. possible secondary structures of an RNA sequence. The basepair probabilities are found using the weighted Boltzman ensemble favoring more stable structures. We use RNAfold, which is part of the Vienna 2.1.3 Combined cost Since we simultaneously optimize both package (Hofacker, 2003), to calculate the probability matrices. Since sequence alignment and structure prediction, the cost function is a gaps are added to the sequences as part of the alignment this has to be combination of three terms: the log-likelihood cost in Equation (1), taken into account when indexing the matrices: the partition function is the basepair probability cost in Equation (2), and the covariation cost 3306 MASTR in Equation (3). The cost of the secondary structure is given as a sum initial alignment is 1.06 L where L is the length of the longest max max over all basepairs in the structure S: sequence. The moves through the solution space can either affect the sequence alignment or the structure. Since it makes little sense to try costðA; SÞ ¼ QðAÞþðÞ cost ði; jÞþ cost ði; jÞ P C and deduce a common structure from randomly aligned sequences, ði; jÞ2S the first iterations are purely sequence moves. As the alignment becomes more stable, we start doing a combination of sequence and The parameters and are parameters used to balance the structure moves. contributions from the different terms in the combined cost. As default, The total number of iterations performed depends both on the length they are set to ¼ 1.5 and ¼ 0.6, which are obtained from an initial of the longest sequence, since that affects the number of structure grid search parameter optimization (data not shown). moves needed, and on the size of the alignment, since that affects the number of sequence moves needed. The alignment size is measured as 2.2 Optimizing the solution the total number of nucleotides in the dataset, N . The dependencies total Simulated annealing (Kirkpatrick et al., 1983) is an optimization are denoted N and L , respectively, and the number of iterations is dep dep technique inspired by the physical process of annealing, which describes found as: the slow cooling of material to form a crystal structure. The idea is that I ¼ N N þ L L dep total dep max the positions of the individual atoms can be described as a probability distribution dependent on the temperature of the system: at high We use N ¼ 1000 and L ¼ 1700 as default. After initially only dep dep temperatures the atoms have a high energy and therefore move around, performing sequence moves, a mixture of alignment and structure but as the temperature is lowered, the system becomes more stable. altering moves are performed. The structure moves are initiated either The goal is to form crystals with few defects, and the most stable after a fixed fraction of the total number of iterations or, as per default, crystal structure is the one with the lowest free energy. If the after N N iterations. The remaining iterations are a mix of dep total temperature of the system is decreased too fast, the crystal structure sequence and structure moves. The ratio between the two is set by a becomes brittle since the system becomes stuck in a local energy parameter R. Per default, R ¼ 0.75 of the last iterations are structure minimum. If the temperature is decreased slowly, the local energy moves. minima can be avoided due to the thermal fluctuations, and the Initially, all moves are accepted (i.e. a temperature of infinity is used) structure becomes more ordered and stable, and the minimum free and the first 0.1% of the iterations are used to determine a good starting energy conformation may be reached. temperature. These results are used to estimate the standard deviation Simulated annealing works in analogy to this. In order to escape of the cost distribution. By deciding on the desired initial probability from local minima of a cost or energy function, steps towards worse of acceptance P the temperature T can be determined: 0 0 states (i.e. higher cost) should be taken often in the beginning (at high T ¼ temperature) and occasionally later at lower temperatures. This is logðÞ P done in a Monte Carlo simulation with an artificial temperature We use P ¼ 0.99 as default. The scaling of the temperature has to make parameter that has absolutely no physical meaning. The probability sure that we end up close to T ¼ 0. An exponential scaling is used: of acceptance depends on the change in cost (huge increases should T ¼ T where 0 < < 1 be accordingly improbable) as well as on the number of iterations i i1 (since the system is closer to the ‘stable’ optimum towards the end). Wanting the final temperature to be T ¼ 10 , this yields: final Given an infinite amount of time, it can be shown that simulated 0 1 annealing will approach the optimal solution to any finite problem log @ A ¼ exp (Ha¨ ggstro¨ m, 2002). Simulated annealing can be used to minimize any cost function, and has for instance been used for multiple alignment (Lukashin et al., 1992). 2.2.1 Sequence moves The moves aimed at changing the sequence Simulated annealing depends on an artificial temperature T that alignment do this by moving gaps in the sequences. They can of course decreases over time. Initially the temperature should be high enough be moved without altering the order of the nucleotides. Three different to give an ‘unstable system’—in this case an alignment prone to types of moves are implemented which in combination ensures that the changes. As more and more changes are sampled, the temperature alignment can be reduced, extended and altered locally. decreases to ‘stabilize’ the system. Normally the temperature decreases exponentially (Lukashin et al., 1992), although there is no theoretical Gap block move: local changes are facilitated through gap blocks. reason for this. If the new cost is lower than the previous, the change is A gap block is a subsequence consisting of 1 or more consecutive always accepted. If the change increases the cost, the chance of gaps in 1 or more aligned sequences. To make this move, a random acceptance P depends on the old cost c , the new (larger) cost c OLD NEW gap in a random sequence is picked. Then the gap block is extended and the temperature T: vertically with probability 0.85 through the other sequences c c NEW OLD containing a gap at that position. Afterwards, the gap block is P ¼ exp ð4Þ extended horizontally to both sides with probability 0.85 if all the chosen sequences contain a gap there. Finally, the gap block This is known as the Metropolis–Hastings algorithm (Hastings, 1970, is moved to a randomly chosen new position in the alignment. Kirkpatrick et al., 1983). Using this, the possible states are sampled The procedure is illustrated in Figure 1 and constitutes 85% of the based on the cost of the current state. Since a state only depends on sequence moves. the previous state, this generates a Markov chain. In combination, MCMC using simulated annealing can be used to sample the solution Gap insertion: insertion of gaps has to insert the same number of space of multiple alignments and RNA structures. Changes can be gaps in all sequences. One could insert the gaps at random made by moving the gaps in the alignment and by adding or removing positions in all sequences, but that would greatly affect the cost of basepairs in the structure, and the move is either rejected or accepted the alignment. Instead, the gaps are inserted in either end based on the change in cost. of the alignment. From these positions the gaps can diffuse into The initial alignment is built by adding gaps at random to all the alignment as needed. These moves constitute 10% of the sequences until they have equal length. By default, the length of the sequence moves. 3307 S.Lindgreen et al. Table 1. Detailed comparison of the predictions by MASTR and RNA Sampler Family ID Seqs Length MASTR RNA Sampler Fig. 1. Illustration of the gap block moves used in the sampling MCC SPS MCC SPS approach. (a) choose a gap at random in a sequence, (b) extend vertically, (c) extend horizontally and (d) move to new position in tRNA 35 11 74 0.72 0.59 0.90 0.68 alignment. 40 11 72 0.89 0.84 0.97 0.79 45 12 71 0.89 0.88 0.94 0.88 50 12 73 0.99 0.97 0.85 0.95 Gap deletion: removing gaps is done by locating gap columns, i.e. 44 12 73 1.00 0.98 0.98 0.91 columns in the alignment containing a gap in all of the sequences. 60 11 73 1.00 0.99 1.00 0.99 Using a well–defined cost function, superfluous gaps are placed in 65 11 73 1.00 0.99 0.86 0.99 the same columns of the alignment. These moves constitute 5% 68 7 73 0.89 1.00 0.99 0.99 of the sequence moves. 75 5 72 1.00 0.98 1.00 0.97 80 6 72 1.00 0.99 1.00 0.99 84 6 69 0.14 0.98 0.31 0.98 2.2.2 Structure moves. Structural moves either add or delete 90 7 72 0.83 1.00 0.98 0.98 basepairs. The structure is forced to contain only non–crossing 95 8 73 0.97 1.00 0.97 1.00 basepairs (i.e. prediction of pseudoknots is not yet supported), and a 97 6 73 0.34 1.00 0.48 1.00 minimum loop length of 3 nt is ensured. Using the three simple moves 5S rRNA 36 5 110 0.45 0.62 0.51 0.60 described below, the structure can be built, extended and reduced. 42 6 111 0.66 0.74 0.62 0.59 45 8 116 0.54 0.76 0.41 0.59 Adding a basepair: a new basepair is added by choosing a 50 9 113 0.59 0.74 0.52 0.73 nucleotide pair (i, j) at random and adding it to the structure if it 55 10 113 0.56 0.91 0.40 0.91 does not violate the constraints. These moves constitute 70% of the 60 11 118 0.89 0.96 0.75 0.90 structure moves. 65 11 117 0.83 0.97 0.81 0.97 Extending a stem: a stem is extended by choosing a basepair (i, j ) 69 10 116 0.52 0.95 0.50 0.92 75 9 115 0.58 0.96 0.53 0.95 already in the structure. The stem that includes basepair (i, j) is then 80 11 118 0.93 1.00 0.85 0.98 extended by adding a new basepair to it, with a 50% chance of 85 15 117 0.11 1.00 0.13 0.99 extending the stem either internally or externally. These moves 89 15 117 0.04 0.97 0.01 0.98 constitute 20% of the structure moves. 95 20 117 0.20 1.00 0.12 1.00 Deleting a basepair: deleting a basepair is done by choosing a pair 97 13 117 0.38 1.00 0.10 0.98 (i, j ) in the structure and removing it. This cannot lead to any new U5 36 5 122 0.40 0.33 0.44 0.38 violations of the constraints. These moves constitute 10% of the 41 6 123 0.48 0.37 0.49 0.42 structure moves. 44 9 120 0.46 0.45 0.64 0.49 51 8 120 0.50 0.51 0.63 0.56 55 10 118 0.67 0.61 0.72 0.64 60 9 115 0.71 0.68 0.82 0.65 2.3 Datasets 65 9 114 0.78 0.76 0.52 0.68 Since consensus structures were not available from BRaliBase II 70 10 115 0.87 0.81 0.93 0.81 (Gardner et al., 2005) at the time, we sampled alignments with 75 8 116 0.90 0.82 0.93 0.84 consensus structures in much the same way as in the BRaliBase 80 8 119 0.50 0.83 0.24 0.76 study. MASTR relies on the partition function to calculate basepair 86 7 116 0.96 0.96 0.96 0.94 probability matrices, so we have chosen to use only short (70–250 nt) 90 9 115 0.95 0.98 0.92 0.95 sequences in the test. There are known problems when using the 95 17 115 1.00 1.00 0.98 0.99 partition function to calculate basepair probability matrices for long 97 7 116 0.96 1.00 0.95 1.00 sequences. Hence, the program will not perform well on long sequences TPP 35 8 122 0.33 0.42 0.45 0.40 until this has been dealt with. 41 9 104 0.37 0.65 0.71 0.51 Datasets were generated from large, trusted seed alignments obtained 45 11 105 0.43 0.75 0.61 0.64 from Rfam (Griffiths-Jones et al., 2005). The 5 families chosen are 50 11 103 0.55 0.72 0.65 0.72 tRNA, 5S ribosomal RNA (5S rRNA), U5 spliceosomal RNA (U5), 55 11 107 0.70 0.86 0.74 0.78 Hepatitis C virus internal ribosome entry site (IRES) and TPP 58 7 102 0.63 0.91 0.69 0.81 riboswitch THI element (TPP). 61 4 101 0.66 0.97 0.59 0.85 From each RNA family, a number of datasets were sampled. Each 70 4 90 0.45 0.86 0.31 0.85 dataset has an average pairwise identity within a specified 10% interval. ires 69 2 249 0.40 0.59 0.46 0.86 These intervals go from 30% to 100% with 5% overlap, i.e. 30–40%, 73 4 250 0.54 0.91 0.45 0.86 35–45%, 40–50%, etc. This sampling procedure gave 14 datasets from tRNA, 5S rRNA and U5, 8 datasets from TPP and 2 datasets from For each dataset, the average pairwise identity (ID) is shown along with the IRES. In total, 52 datasets were sampled and details on average number of sequences (Seqs) and the average sequence length (Length). The results pairwise identity, number of sequences and average sequence lengths for MASTR and RNA Sampler are detailed (MCC: structure quality, SPS: can be seen in Table 1. The datasets can be obtained from http:// alignment quality). For each dataset, the best results are highlighted with bold, mastr.binf.ku.dk/. and identical results with italics. 3308 MASTR (a) 3 RESULTS The program MASTR is implemented in Cþþ and tested against the programs FoldalignM, LocARNA and RNA Sampler. RNAcast is used to produce input to RNAforester, and Clustal alignments were used as input to RNAalifold. All programs were used with their default settings except for RNAforester where the clustering cutoff had to be changed in order to produce complete alignments of all sequences. MASTR FoldalignM does not predict a single consensus structure but LocARNA Clustal + Alifold returns a structure for each sequence in the final alignment. FoldalignM RNAsampler Therefore, we define the consensus structure to be the basepairs RNAforester that are predicted for all sequences. The other programs all 30 40 50 60 70 80 90 predict a single multiple alignment and consensus structure. %ID To evaluate the predicted structures, Matthew’s Correlation Coefficient (MCC) is used: (b) TP TN FP FN MCC ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ Let TP be the number of truly predicted basepairs, FP be the number of predicted basepairs not in the reference structure and FN be the number of basepairs in the reference structure not predicted by the program. TN is defined here as the number of possible basepairing interactions in a sequence that are MASTR not predicted and not in the reference structure, i.e. pairs of LocARNA Clustal + Alifold nucleotide xy that are at least 4 nt apart, and where FoldalignM RNAsampler xy2{AU,UA,CG,GC,UG,GU}. RNAforester To evaluate the quality of the alignment, the Sum of 30 40 50 60 70 80 90 Pairs score (SPS) (Thompson et al., 1999) is used. SPS is a %ID sensitivity–like measure that compares the predicted alignment to a reference. For each pair of aligned sequences, the number (c) of aligned positions that are present in both the prediction and the reference alignment is counted. The total number of correctly aligned positions is then compared to the total number of aligned non–gap pairs present in the reference alignment. This yields a number between 0 and 1 where 1 is perfect correspondence between prediction and reference. In Figure 2, the performance of the programs are compared in terms of structure quality (plot a), alignment quality (plot b) MASTR and running time (plot c) as a function of the average pairwise LocARNA Clustal + Alifold identity of the datasets (%ID). The plots are averages over the FoldalignM RNAsampler different RNA families used for each %ID point. RNAforester The test shows that MASTR can predict consensus struc- 30 40 50 60 70 80 90 tures of a quality comparable to other existing methods. On %ID the lowest identity datasets MASTR is outperformed by RNA Sampler, but after 45% ID the structure predictions of Fig. 2. The performance of the tested programs in terms of structure MASTR are on average better than or comparable to the prediction (a), alignment quality (b) and running time (c) as a function best programs tested. MASTR is consistently better than or of average pairwise identity (%ID). Each plot shows the performance comparable to all other methods regarding alignment quality. as the average over the RNA families used. Note that plot c is on a logarithmic scale. Black: MASTR, Blue: LocARNA, Green: As it can be seen, MASTR is clearly faster than FoldalignM Clustal þ RNAalifold, Red: FoldalignM, Orange: RNA Sampler, by up to an order of magnitude, while LocARNA is even Purple: RNAcast þ RNAforester. faster. Clustal þ RNAalifold is of course by far the fastest method used. RNAforester has a running time comparable to LocARNA but produces consistently worse alignments— be explained by the lack of covariation in these datasets. Most probably due to the fact that all sequences had to be included of the methods rely on some signal from compensatory in the same alignment for comparison. The structure predic- tions are comparable to FoldalignM. mutations, and this signal diminishes as the sequences become The dip in the quality of the structure predictions that too similar. RNA Sampler does not depend on a covariation is visible for all methods in the highest identity range could term that could explain why the dip is less prominent here. log(seconds) SPS MCC 1e-01 1e+00 1e+01 1e+02 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 S.Lindgreen et al. Likewise, FoldalignM shows an almost monotone increase in Thus, new sequences can be aligned to reference alignments in the predictions as a function of identity. a structurally sound manner. Since RNA Sampler seems to be the best of the other methods tested here, a more detailed comparison of MASTR and RNA Sampler can be seen in Table 1. In total, 52 datasets ACKNOWLEDGEMENTS are used. The running time is on the same scale for the two S.L. acknowledges support from the Novo Scholarship programs, although MASTR is in general slightly faster. The Program. The work was supported by the Carlsberg Founda- structure predictions are better than or equal to RNA Sampler tion and the Danish National Research Foundation. in 28 cases (54%), and the alignments are better than or equal The authors thank two anonymous referees for their helpful to RNA Sampler in 43 cases (83%). The differences seem to comments; especially one reviewer for interesting comments depend both on the RNA family and on the level of identity. on covariation. Conflict of Interest: none declared. 4 DISCUSSION We have developed a new algorithm for simultaneous align- REFERENCES ment and structure prediction of multiple non-coding RNA sequences. As shown above, MASTR is highly competitive both Athanasius,F. Bompfu¨ newerer Consortium et al. (2007) RNAs everywhere: genome-wide annotation of structured RNAs. J. Exp. Zoolog. Mol. Dev. in terms of structure prediction quality, sequence alignment and Evol., 308, 1–25. running time. The program can also handle larger datasets Bompfu¨ newerer,A.F. et al. (2005) Evolutionary patterns of non-coding RNAs. than, e.g. RNA Sampler or FoldalignM. Theory Biosci., 123, 301–369. Although we have not used it in the above tests, it is also Cheng,J. et al. (2005) Transcriptional maps of 10 human chromosomes at possible to add structural constraints if some knowledge 5-nucleotide resolution. Science, 308, 1149–1154. Das,R. and Baker,D. (2007) Automated de novo prediction of native-like RNA is available about one of the sequences (e.g. known basepairs, tertiary structures. Proc. Natl Acad. Sci., 104, 14664–14669. knowledge about upstream or downstream interactions, Ding,Y. and Lawrence,C.E. (2003) A statistical sampling algorithm for RNA or knowledge about non-basepaired positions). Additionally, secondary structure prediction. Nucleic Acids Res., 31, 7280–7301. already aligned sequences can be used as input with or without Durbin,R. et al. (1998) Biological Sequence Analysis. Probabilistic Models of a consensus structure. Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. Gardner,P. et al. (2005) A benchmark of multiple sequence alignment programs As the testing of the program showed, MASTR does not upon structural RNAs. Nucleic Acids Res., 33, 2433–2439. have top performance on very dissimilar sequences. In this Giegerich,R. et al. (2004) Abstract shapes of RNA. Nucleic Acids Res., 32, range, one would assume that covariance is important, and it is 4843–4851. therefore interesting that RNA sampler, which does not use Gorodkin,J. et al. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 3724–3732. covariance, is better. One possible explanation for this is Griffiths-Jones,S. et al. (2005) Rfam: annotating non-coding RNAs in complete that covariation in itself is not enough to deduce structure genomes. Nucleic Acids Res., 33, D121–D124. from alignment. Covariation is only an indicator of conserved Ha¨ ggstro¨ m,O. (2002) Finite Markov Chains and Algorithmic Applications. basepairs, but it is not sufficient to predict pairing columns [this Cambridge University Press, Cambridge, UK. corresponds well with our previous study (Lindgreen et al., Hastings,W.K. (1970) Monte carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. 2006)]. MASTR builds up the structure in small steps, which Havgaard,J.H. et al. (2005) Pairwise local structure alignment of RNA sequences might make it vulnerable to erronously high covariation, with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. whereas RNA Sampler makes sure that the alignment is Hein,J. (1989) A new method that simultaneously aligns and reconstructs structurally sound by fixing whole stems. MASTR therefore ancestral sequences for any number of homologous sequences when the phylogeny is given. Mol. Biol. Evol., 6, 649–668. needs to have a relatively stable (and correct) alignment before Ho¨ chsmann,M. et al. (2003) Local similarity of RNA secondary structures. predicting structure. This could explain the relatively poor In Proceedings of the IEEE Computational Systems Bioinformatics Conference performance on low identity datasets, and this should be (CSB 2003). pp. 159–168. explored further. Hofacker,I. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., In future work, we would like to make a local version of the 31, 3429–3431. Hofacker,I. et al. (2004) Alignment of RNA base pairing probability matrices. program. This would be ideal for dealing with long sequences Bioinformatics, 20, 2222–2227. where there are known problems with the standard basepair Hofacker,I.L. et al. (2002) Secondary structure prediction for aligned RNA probability matrices. Since MASTR does not have the same sequences. J. Mol. Biol., 319, 1059–1066. limitations towards crossing basepairs as pure energy-based Jossinet,F. et al. (2007) RNA structure: bioinformatic analysis. Curr. Opin. methods, an extension to include the prediction of pseudoknots Microbiol., 10, 279–285. Kirkpatrick,S. et al. (1983) Optimization by simulated annealing. Science, 220, will also be pursued. 671–680. One of the advantages of MASTR is that the optimization is Knudsen,B. and Hein,J. (1999) RNA secondary structure prediction using decoupled from the cost function, which makes it very easy to stochastic context-free grammars and evolutionary history. Bioinformatics, change the latter. It also makes it reasonably straight forward 15, 446–454. Knudsen,B. and Hein,J. (2003) Pfold: RNA secondary structure prediction using to add phylogenetic prediction to the program, which would stochastic context-free grammars. Nucleic Acids Res., 31, 3423–3428. be similar to the goal of SimulFold (Meyer and Miklos, 2007), Lindgreen,S. et al. (2006) Measuring covariation in RNA alignments: physical but MASTR functions in a very different way. We would also realism improves information measures. Bioinformatics, 22, 2988–2995. like to make it possible to input a set of related and already Lukashin,A.V. et al. (1992) Multiple alignment using simulated annealing: branch aligned sequences together with the set of unaligned sequences. point definition in human mRNA splicing. Nucleic Acids Res., 20, 2511–2516. 3310 MASTR Mathews,D.H. and Turner,D.H. (2002) Dynalign: an algorithm for finding the Siebert,S. and Backofen,R. (2005) MARNA: multiple alignment and consensus secondary structure common to two RNA sequences. J. Mol. Biol., 317, structure prediction of RNAs based on sequence structure comparisons. 191–203. Bioinformatics, 21, 3352–3359. McCaskill,J.S. (1990) The equilibrium partition function and base pair Tai,K. (1979) The tree-to-tree correction problem. J. ACM, 26, 422–433. binding probabilities for RNA secondary structure. Biopolymers, 29, Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive 1105–1119. multiple sequence alignment through sequence weighting, position-specific Meyer,I.M. and Miklos,I. (2007) SimulFold: simultaneously inferring RNA gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. structures including pseudoknots, alignments and trees using a Bayesian Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence MCMC framework. PLoS Comput. Biol., 3, e149. alignment programs. Nucleic Acids Res., 27, 2682–2690. Meyer,I.M. (2007) A practical guide to the art of RNA gene prediction. Brief. Torarinsson,E. et al. (2007) Multiple structural alignment and clustering of RNA Bioinformatics, Epub ahead of print, PMID: 17483123. sequences. Bioinformatics, 23, 926–932. Nussinov,R. et al. (1978) Algorithms for loop matchings. SIAM J. Appl. Math., Washietl,S. et al. (2005) Mapping of conserved RNA secondary structures 35, 68–82. predicts thousands of functional noncoding RNAs in the human genome. Onoa,B. and Tinoco,I. Jr (2004) RNA folding and unfolding. Curr. Opin. Struct. Nat. Biotechnol., 23, 1383–1390. Biol., 14, 374–379. Will,S. et al. (2007) Inferring noncoding RNA families and classes by means of Pedersen,J. et al. (2006) Identification and classification of conserved RNA genome-scale structure-based clustering. PLoS Comput. Biol., 3,e65. secondary structures in the human genome. PLoS Comput. Biol., 2, e33. Xu,X. et al. (2007) RNA Sampler: a new sampling based algorithm for Reeder,J. and Giegerich,R. (2005) Consensus shapes: An alternative to the common RNA secondary structure prediction and structural alignment. Sankoff algorithm for RNA consensus structure prediction. Bioinformatics, Bioinformatics, 23, 1883–1891. 21, 3516–3523. Zhang,K. and Shasha,D. (1989) Simple fast algorithms for the editing distance Sankoff,D. (1985) Simultaneous solution of the RNA folding, alignment and between trees and related problems. SIAM J. Comput., 18, 1245–1262. protosequence problems. SIAM J. Appl. Math., 45, 810–825. Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA Shapiro,B.A. et al. (2007) Bridging the gap in RNA structure prediction. Curr. sequences using thermodynamics and auxiliary information. Nucleic Acids Opin. Struct. Biol., 17, 157–165. Res., 9, 133–148.

Journal

Bioinformatics – Oxford University Press

Published: Nov 15, 2007

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing

References (46)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies