Access the full text.
Sign up today, get DeepDyve free for 14 days.
Xiaolu Huang, Hesham Ali (2006)
High sensitivity RNA pseudoknot predictionNucleic Acids Research, 35
Yizhu Lin, B. Schmidt, M. Bruchez, C. McManus (2018)
Structural analyses of NEAT1 lncRNAs suggest long-range RNA interactions that may contribute to paraspeckle architectureNucleic Acids Research, 46
Ydo Wexler, Chaya Zilberstein, Michal Ziv-Ukelson (2006)
A Study of Accessible Motifs and RNA Folding ComplexityJournal of computational biology : a journal of computational molecular cell biology, 14 6
W. Melchers, J. Hoenderop, H. Slot, C. Pleij, E. Pilipenko, V. Agol, J. Galama (1997)
Kissing of the two predominant hairpin loops in the coxsackie B virus 3' untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesisJournal of Virology, 71
Stanislav Bellaousov, D. Mathews (2010)
ProbKnot: fast prediction of RNA secondary structure including pseudoknots.RNA, 16 10
M. Verheije, R. Olsthoorn, M. Kroese, P. Rottier, J. Meulenberg (2002)
Kissing Interaction between 3′ Noncoding and Coding Sequences Is Essential for Porcine Arterivirus RNA ReplicationJournal of Virology, 76
Elena Rivas, S. Eddy (1998)
A dynamic programming algorithm for RNA structure prediction including pseudoknots.Journal of molecular biology, 285 5
Kengo Sato, Yuki Kato, Michiaki Hamada, T. Akutsu, K. Asai (2011)
IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programmingBioinformatics, 27
R. Nussinov, A. Jacobson (1980)
Fast algorithm for predicting the secondary structure of single-stranded RNA.Proceedings of the National Academy of Sciences of the United States of America, 77 11
I. Novikova, S. Hennelly, C. Tung, K. Sanbonmatsu (2013)
Rise of the RNA machines: exploring the structure of long non-coding RNAs.Journal of molecular biology, 425 19
KungYao Chang, I. Tinoco (1997)
The structure of an RNA "kissing" hairpin complex of the HIV TAR hairpin loop and its complement.Journal of molecular biology, 269 1
Hajiaghayi (2012)
Analysis of energy-based algorithms for RNA secondary structure predictionBMC Bioinformatics, 13
S. Will, H. Jabbari (2016)
Sparse RNA folding revisited: space-efficient minimum free energy structure predictionAlgorithms for Molecular Biology : AMB, 11
Mirela Andronescu, A. Condon, H. Hoos, D. Mathews, Kevin Murphy (2007)
Efficient parameter estimation for RNA secondary structure predictionBioinformatics, 23 13
Mathias Möhl, R. Salari, S. Will, R. Backofen, S. Sahinalp (2010)
Sparsification of RNA structure prediction including pseudoknotsAlgorithms for Molecular Biology : AMB, 5
H. Jabbari, A. Condon (2014)
A fast and robust iterative algorithm for prediction of RNA pseudoknotted secondary structuresBMC Bioinformatics, 15
R. Backofen, Dekel Tsur, Shay Zakov, Michal Ziv-Ukelson (2009)
Sparse RNA folding: Time and space efficient algorithms
(2008)
BMC Bioinformatics BioMed Central Database RNA STRAND: The RNA Secondary Structure and Statistical Analysis Database
T. Akutsu (2000)
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknotsDiscret. Appl. Math., 104
Jens Reeder, R. Giegerich (2004)
Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamicsBMC Bioinformatics, 5
T. Mercer, M. Dinger, J. Mattick (2009)
Long non-coding RNAs: insights into functionsNature Reviews Genetics, 10
Jana Sperschneider, A. Datta, M. Wise (2012)
Predicting pseudoknotted structures across two RNA sequencesBioinformatics, 28
Robert Dirks, N. Pierce (2003)
A partition function algorithm for nucleic acid secondary structure including pseudoknotsJournal of Computational Chemistry, 24
Mirela Andronescu, Cristina Pop, A. Condon (2010)
Improved free energy parameters for RNA pseudoknotted secondary structure prediction.RNA, 16 1
H. Jabbari, A. Condon, Shelly Zhao (2008)
Novel and Efficient RNA Secondary Structure Prediction Using Hierarchical FoldingJournal of computational biology : a journal of computational molecular cell biology, 15 2
Baharak Rastegari, A. Condon (2007)
Parsing Nucleic Acid Pseudoknotted Secondary Structure: Algorithm and ApplicationsJournal of computational biology : a journal of computational molecular cell biology, 14 1
R. Chang, T. Hsu, Yen-Lin Chen, Shu-Fan Liu, Yi-Jer Tsai, Yun-Tong Lin, Yi-Shiuan Chen, Yi-Hsin Fan (2013)
Japanese encephalitis virus non-coding RNA inhibits activation of interferon by blocking nuclear translocation of interferon regulatory factor 3.Veterinary microbiology, 166 1-2
Ho-Lin Chen, A. Condon, H. Jabbari (2009)
An O(n5) Algorithm for MFE Prediction of Kissing Hairpins and 4-Chains in Nucleic AcidsJournal of computational biology : a journal of computational molecular cell biology, 16 6
Andronescu (2008)
RNA STRAND: the RNA secondary structure and statistical analysis databaseBMC Bioinformatics, 9
METHODOLOGY ARTICLE Open Access Analysis of energy-based algorithms for RNA
Mirela Andronescu, Zhiyuan Zhang, A. Condon (2005)
Secondary structure prediction of interacting RNA molecules.Journal of molecular biology, 345 5
Y. Uemura, Aki Hasegawa, Satoshi Kobayashi, T. Yokomori (1999)
Tree Adjoining Grammars for RNA Structure PredictionTheor. Comput. Sci., 210
Motivation: The computational prediction of RNA secondary structure by free energy minimization has become an important tool in RNA research. However in practice, energy minimization is mostly limited to pseudoknot-free structures or rather simple pseudoknots, not covering many biologically important structures such as kissing hairpins. Algorithms capable of predicting sufficiently complex pseudoknots (for sequences of length n) used to have extreme complexities, e.g. Pknots has On time andOn space complexity. The algorithm CCJ dramatically improves the asymptotic run time for predicting complex pseudoknots (handling almost all relevant pseudoknots, while being slightly less general than Pknots), but this came at the cost of large constant factors in space and time, which strongly limited its practical application (200 bases already require 256 GB space). Results: We present a CCJ-type algorithm, Knotty, that handles the same comprehensive pseudo- knot class of structures as CCJ with improved space complexity of H n þ Z —due to the applied technique of sparsification,the number of ‘candidates’, Z, appears to grow significantly slower than n on our benchmark set (which include pseudoknotted RNAs up to 400 nt). In terms of run time over this benchmark, Knotty clearly outperforms Pknots and the original CCJ implementation, CCJ 1.0; Knotty’s space consumption fundamentally improves over CCJ 1.0, being on a par with the space- economic Pknots. By comparing to CCJ 2.0, our unsparsified Knotty variant, we demonstrate the isolated effect of sparsification. Moreover, Knotty employs the state-of-the-art energy model of ‘HotKnots DP09’, which results in superior prediction accuracy over Pknots. Availability and implementation: Our software is available at https://github.com/HosnaJabbari/ Knotty. Contact: jabbari@ualberta.ca or will@tbi.unvie.ac.at Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction (Chang et al., 2013; Lin et al., 2018; Novikova et al., 2013), most Computational RNA secondary structure prediction has become an often pseudoknot-free structure prediction methods are applied in indispensable tool in the research on non-coding RNAs. Besides cod- biological research—severely limiting the practical capabilities to ing for proteins, RNAs perform various essential roles—most prom- correctly predict, recognize and compare pseudoknotted structures. The fundamental cause of this limitation is the computational inently in regulating gene expression—in all kingdoms of life, in many cases mediated by their three-dimensional structures (Mercer complexity of the prediction of general pseudoknots in the et al., 2009). Despite the ubiquity of pseudoknots in these RNAs nearest neighbor energy model, which is NP-hard (Akutsu, 2000; V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 3849 3850 H.Jabbari et al. Lyngso and Pedersen, 2000) and even inapproximable (Sheikh et al., AB C 2012). Thus, the high complexity of pseudoknot prediction can be overcome only by either heuristics without optimality-guarantees or restrictions on the predictable pseudoknot class. In comparison, pre- Fig. 1. Examples of ‘Three-Groups-of-Bands’ (TGB) and CCJ structures. Each dicting minimum free energy (MFE) structures without pseudoknots arc represents a set of consecutive pseudoknotted base pairs (referred to as band), that cross the same band. (A) TGB structure with two left, two right and is comparably simple: a pseudoknot-free secondary structure is ei- four middle bands; (B) TGB with no left band, one middle and two right bands ther closed by a base pair connecting the first and last base in the se- as well as a nested pseudoknotted substructure; (C) CCJ structure composed of quence, or can be partitioned into two independent substructures on the two TGB structures of subfigures A and B (see decomposition of P) a prefix and suffix of the sequence, where energies of substructures add up to the total energy. This simple decomposition scheme allows the algorithm Knotty, we improve the space complexity to predicting the MFE pseudoknot-free secondary structure by dynam- H n þ Z , where Z is the total number of candidates. This com- 3 2 ic programming (DP) in H n time and H n space for standard plexity is the result of replacing all four-dimensional DP matrices by energy models (Nussinov and Jacobson, 1980). (a constant number of) three-dimensional matrix slices and lists of The most general DP algorithm for MFE prediction of pseudo- candidates. This suffices to exactly predict the same MFE structures knotted structures, Pknots (Rivas and Eddy, 1999) has complexity as before (due to inverse triangle inequalities). The number of candi- 6 4 of H n time, and H n space. To reduce this prohibitively high dates is expected to be much smaller than the number of replaced complexity, several computationally less costly algorithms for MFE matrix entries; moreover it cannot be larger. pseudoknot prediction have been proposed; for example, running in 4 The space-efficient retrieval of the optimal structure from this al- H n time and H(n ) space (Dirks and Pierce, 2003; Lyngso and gorithm turned out to be a non-trivial application of the sparsifica- 4 2 Pedersen, 2000; Uemura et al., 1999) or even H n time and H n tion of MFE RNA structure prediction algorithms. We enable this space in PknotsRG (Lyngso and Pedersen, 2000; Reeder and by implementing our recently presented technique of space-efficient Giegerich, 2004). However, all these reductions came at the cost of sparse traceback (Will and Jabbari, 2016); this method requires severely limiting the pseudoknot class over Pknots. Most striking- additional space for trace arrows, but keeps the number of trace ly, none of the less complex algorithms (compared to Pknots) han- arrows low. dles the biologically important class of kissing hairpins with Notably, previous applications of the sparsification technique to arbitrary nested substructures (Chang and Tinoco, 1997; Melchers RNA structure prediction discussed either pseudoknot-free methods et al., 1997; Verheije et al., 2002). (Backofen et al., 2011; Wexler et al., 2007) or used pseudoknotted Recently, we introduced the MFE prediction algorithm CCJ methods with simplified energy models (base pair maximization (Chen et al., 2009). Compared to the more complex algorithm only) (Mo¨hl et al., 2010). Sparsified RNA–RNA interaction predic- Pknots, CCJ handles a strictly less general subclass of pseudoknots, tion (Salari et al., 2010) is the most complex case of structure predic- but it significantly expands the class of structures that can be 5 tion so far that was implemented for a realistic interaction energy handled in On time—e.g. including the biologically relevant model. While space-efficient sparsification is discussed in the paper, kissing hairpins. To achieve CCJ’s novel balance of time and pseudo- the space-efficient variant of the implementation could not recover knot complexity, we defined a new class of structures called Three- the optimal interaction structure by traceback. Groups-of-Band (TGB) structures with at most three groups of bands (see Fig. 1A and B). On this basis, we presented recurrences 1.1 Contributions that handle TGB structures alternatingly decomposing the left, right, We introduce Knotty, providing a unique combination of compar- middle and outer bands. ably low run time at moderate space consumption and high expres- The CCJ algorithm covers H-type pseudoknotted structures, sivity, covering most biologically relevant pseudoknots including the kissing hairpins and chains of four interleaved stems by overlaying important class of kissing hairpins and employing the state-of-the- TGB structures (Fig. 1C); this composition can be applied recursive- art energy model. Aiming at low computational complexity and ly, which defines the class of CCJ structures. Note that this class is high expressivity, we base our work on the original CCJ algorithm. comparably general, e.g. comprising all density-2 structures (Jabbari To fundamentally overcome CCJ’s space overhead, we apply space- et al., 2008) with up to four interleaved stems. For example, due to efficient sparsification to an adapted CCJ algorithm, resulting in the Rastegari and Condon (2007), among 1133 pseudoknotted RNA novel algorithm Knotty. By this, we present—to the best of our structures, only 9 were not density-2 and 6 were too complex for knowledge—the first sparsified prediction of complex pseudoknots Pknots. using a realistic energy model. While CCJ predicts complex pseudoknotted secondary structures 5 4 For studying Knotty’s computational behavior, we benchmark in H n time, its H n space complexity was still limiting in prac- versus the original CCJ and a non-sparse variant of Knotty (which tice; e.g.—for the original CCJ algorithm—even generous memory we implemented in parallel, for the purpose of such comparison). of 256 GB used to limit the length of RNAs to below 200 bases. A slightly deeper look reveals that CCJ’s substantial improvement in We show that sparsification alone substantially reduces the space time complexity over Pknots was achieved at the price of using a requirements, whereas we could achieve further significant improve- large number of four-dimensional DP matrices (about 20 matrices in ments over the original implementation (even in the non-sparse CCJ versus four such matrices in Pknots). Apparently, CCJ’s Knotty variant). speedup is tied to a—in practice prohibitively—large space con- Aiming at high accuracy, our implementation of Knotty sumption. In this work, we overcome the still unresolved dilemma supports the state-of-the-art HotKnots DP09 energy model that expressive pseudoknot prediction used to be limited either by (Andronescu et al., 2010). To analyze Knotty’s biological rele- time (Pknots) or space (CCJ). To improve the extreme space vance, we benchmarked the prediction accuracy (for the total struc- requirements of expressive, but comparably fast prediction, we tures as well as for the actual pseudoknots) in comparison to the apply the idea of sparsification (Backofen et al., 2011; Wexler et al., pseudoknot prediction programs HotKnots V2.0 (Andronescu 2007)to CCJ. Devising sparsified recurrences of CCJ, resulting in et al., 2010), IPknot (Sato et al., 2011) and the most general MFE Knotty: efficient and accurate pseudoknot prediction 3851 pseudoknot prediction method Pknots (Rivas and Eddy, 1999). fragments (two subsequences disconnected by a gap). Consequently, For control, we moreover compare to Simfold (Andronescu et al., defining such fragments requires four sequence positions in total. 2005, 2007), a pseudoknot-free prediction method based on the The MFE of the CCJ structure for subsequence S is calculated i::l same energy model. in W(i, l), which is decomposed as 2 Materials and methods Here, the first case corresponds to a scenario in which the last base, The original CCJ algorithm (Chen et al., 2009) minimizes the free i.e. l is unpaired; the second case is when the structure in region½ i; l energy over all CCJ structures for a given input sequence S by DP. can be decomposed to two different non-overlapping substructures; As stated in the introduction, CCJ structures comprise kissing hair- third case corresponds to the case when the end points pair together pins and chains of four interleaved stems, which can recursively con- to create a loop. The recurrence ViðÞ ; l handles non-pseudoknotted tain CCJ structures as substructures. The optimal CCJ structure is loops closed by i and l; PiðÞ ; l is the MFE of a CCJ pseudoknot for re- then determined by standard traceback through the DP matrices. gion½ i; l . PiðÞ ; l is decomposed by the rule Generally, DP algorithms can be described by presenting the recur- rences that are used to calculate the entries of their DP matrices. In the (2) case of RNA structure prediction, the DP matrices store MFE values for sequence fragments (under specific conditions). The recurrences into two TGB structures (with MFEs in the PK matrix). correspond to decompositions of these fragments such that the matrix Representing TGB structures requires using gapped fragments. entries can be recursively inferred from energies of smaller subpro- As indicated by the three blue boxes (free indices) in Equation blems. For example, assuming an energy value of –1 for each canonic- ðÞ 2 , each entry of P is minimized over three indices. Thus, this step al base pair, i.e. C–G, A–U or G–U, the MFE structure of an RNA already bounds the time and space complexity of the CCJ algorithm 5 4 sequence, S, can be found using matrices W and V . The entries W N N N toOn andOn respectively. ðÞ i; j and V ðÞ i; j , which hold the respective minimum free energies of PKðÞ i; j; k; l is the MFE over all TGB structures of the gapped frag- general and closed structures of subsequence S , are decomposed as i::j ment ½ i; j [½ k; l . Note that such structures have additional restric- tions: the positions i and l must be involved in some base pairs, which (1) are not part of nested substructures; moreover, some base pair of a TGB structure must span the gap. The recurrence for PK uses terms These grammar-rules represent a complete case distinction of P ; P ; P ,and P , which (put informally) handle bands on the left, possible structures. Similar graphical notation was introduced by L M O R middle and right groups of the TGB structure, respectively. Both P Rivas and Eddy (Rivas and Eddy, 1999) and is now commonly used M to present DP algorithms for RNA (e.g. Dirks and Pierce, 2003; and P are needed to handle bands in the middle group. The matrix Mo ¨hl et al., 2010). In our notation, the sequence is represented as a entries PðÞ i; j; k; l ; PðÞ i; j; k; l , PðÞ i; j; k; l ; PðÞ i; j; k; l are decom- L M R O posed as illustrated in Figure 2, in which WP handles nested substruc- line with bases as index positions; solid arcs indicate base pairs; and dashed arcs enclose areas containing unconsumed structure. Red tures in a pseudoloop. (For invalid index combinations, the matrix circles represent fixed end points of a region. Blue squares are free entries are set to þ1, and do not have to be stored.) indices (i.e. bases that are pivotal for boundary determinations). To As seen in Figure 2 each of the matrices P ; P ; P , and P L R M O not clutter the representation, we do not mark end points that can requires a base pair between the two ends at the respective positions be identified from bases directly before or after them. In the example left, right, middle or outer. Each such matrix distinguishes the three of Equation (1), W ðÞ i; j corresponds to the MFE structure of S ; N i::j cases that (i) this base pair closes an interior loop or (ii) a multi-loop this structure can be decomposed according to Equation (1), since ei- or (iii) is the inner border of a band. In the latter case, the respective ther j is unpaired (left case) or j is paired to some inner position (re- terms P ðÞ X 2fL; R; M; Og handle transitions from base pairs fromX cursion to V , in which solid arc represent a base pair); the closed in one group to base pairs in other groups (see Fig. 3). structure corresponding to V ðÞ i; j is reduced to WðÞ i þ 1; j 1 N N In transition to another band via one of the P matrices, we fromX (rightmost case). Moreover, the grammar rules allow—almost mech- allow nested substructures around the band. This is handled by the anically—inferring the recursion equations for base pair free energy first two cases of the respective recurrence P . Moreover, in fromX minimization (up to specific energy contributions). Generally, our P we recurse to matrices PðÞ i; j; k; l only if the requirements of fromX recursions minimize over the recursion cases and the free indices in these matrices are met. their respective range limited by the fixed indices. Thus, we translate the rules of Equation (1) to W ðÞ i; j ¼ minfWðÞ i; j 1 ; minWðÞ i; k 1 þ V ðÞ k; jg N N N N i k < j V ðÞ i; j ¼ WðÞ i þ 1; j 11if S S is canonical; N N i j V ðÞ i; j ¼1 otherwise: In our discussion of CCJ and its sparsification, the graphical presentation allows us to focus on the sparsification and avoid dis- tracting details like the exact added energy contributions in each sin- gle step, which is not necessary for understanding the algorithmic ideas. Formal recurrences are provided in the Supplementary Material; further details are found in (Jabbari, 2015). Fig. 2. Decompositions for PKði; j; k; lÞ; P ði; j; k; lÞ, P ði; j; k; lÞ; P ði; j; k; lÞ; L M R While pseudoknot-free recursions generally use only fragments P ði; j; k; lÞ in grammar-rule like graphical notation (Color version of this connected at the sequence level, the CCJ algorithm requires ‘gapped’ figure is available at Bioinformatics online.) 3852 H.Jabbari et al. Fig. 5. Decompositions of WBði; lÞ and WB ði; lÞ (Color version of this figure is available at Bioinformatics online.) 2.1 The Knotty algorithm Knotty improves space complexity over the CCJ algorithm utilizing Fig. 3. Decompositions of P ði; j; k; lÞ for X 2fL; R; M; Og, which handle fromX pseudoloops closed by an X band as well as the transition to some P (Color the following idea: due to the technique of sparsification, one can version of this figure is available at Bioinformatics online.) minimize the free energy, while keeping a fraction of the whole DP matrix cells, namely only the entries referred to as candidates. By storing a few candidates (as explained below) we avoid storing all four-dimensional matrices. (i) In recurrences in which the left- most index, i, does not change, we store the value of such recurrences in a constant number of three-dimensional matrix slices; we call the collection of these matrix slices i-slices. In many of these recursion Fig. 4. Decomposition of multi-loops that span a band in the left band (Color cases, we recurse to matrix entries of the same i-slice (e.g. when infer- version of this figure is available at Bioinformatics online.) ring PK from P or, slightly more interestingly, P from P ). In L R fromR other cases, we recurse to theðÞ i þ 1 -slice (e.g. P from P )or L fromL Note that in P , it is not possible to transition to P This is fromL L. ðÞ i þ c -slice, where c is constantly bounded. The latter occurs in the because the recurrences are designed so that bands are handled in handling of interior loops (P and analogously P ), where c L;iloop O;iloop rounds. Within a round, bands in the left are handled first, if any, does not exceed the maximum interior loop size, MLS. (ii) This leaves then those on the right, if any, and then those on the middle, with us with the recursion cases that recurse to some d-slice, where d – i is bands handled by P (if any) handled before those handled by P . M O not constantly bounded. Instead of storing all slices, we store only cer- A middle band must be handled in each round; otherwise, for ex- tain candidate entries in such slices. These matrix entries (called candi- ample, two ‘bands’ on the left group, added in different rounds, dates) are stored in candidate lists for specific recursion cases together would collapse into one, causing the recurrences to incorrectly add with their corresponding second, third, and fourth matrix indices, j, k, penalty terms for band ‘borders’ that are not actually borders. For and l to keep track of band borders. In some cases, candidate lists can this reason, no row in P has a P term, and so a band on the left fromL L be shared between recursion cases. We presented more details on can- group cannot be handled directly after a band on the right group. didate list requirements in Mo ¨hl et al. (Mo ¨hl et al., 2010). Also, P does not have a row with a P term, to ensure that P fromO M M cannot be used twice on the same round. 2.1.1 Space representation of the four-dimensional matrices by Interior loops that span a band in the left band are decomposed by Knotty the rule ; the remaining cases (right, middle, Only matrices corresponding to P and P occur in interior loops L O that span a band, and require to recurse to a different i-slice; for outer) are analogous. Note that, while the decomposition of P has L;iloop them, we store slices i::i þ MLS. For matrices corresponding to two free indices, these indices are constraint by setting a constant max- P ; P ; P , P ; P and P , we only fromL fromO L;mloop0 L;mloop1 O;mloop0 O;mloop1 imum size of interior loops (here 30 bases) as it is common practice. For need to store slices i and i þ 1. Matrices corresponding to recur- handling interior loops that span a band, the original CCJ algorithm rences of types P and P ðÞ X 2fL; R; M; Og do not need X;iloop X;mloop introduced the five-ary function P andappliedaclever scheme of L;iloop5 to be stored and can be computed when needed. For the remaining keeping track of borders by transitioning to different recurrence cases to matrices, we store only the current i-slice. Note that space is still use only H(n ) space. However, due to its non-standard form, this always reused in the next iteration; in case of ranges of slices, the recurrence cannot be sparsified directly. Thus, in Knotty,wemakethe memory access is ‘rotated’ without copying in memory. So in total, pragmatic choice to simplify the recurrence, which works since even the Knotty maintains candidate lists for nine matrices:PK; P , fromL detailed HotKnots DP09 energy parameter set does not require the full P ; P ; P ; P , P ; P and P . fromM fromR fromO L;mloop0 M;mloop0 R;mloop0 O;mloop0 generality supported by the original recursion in CCJ. Note that this To finalize sparsification, all recursion cases that recurse to modification alone, i.e. without sparsification, does not change the com- these matrices need to be modified. This affects all recursion cases, plexity of the algorithm, even if it reduces the overhead. which insert any nested substructure to the left of the gapped Moreover, we modify the original handling of multi-loop cases region. This occurs exactly in the decompositions of PK; P , fromL to enable their sparsification. Figure 4 shows our multi-loop han- P ; P ; P ; P and P . Moreover, this fromO L;mloop1 L;mloop0 O;mloop0 O;mloop1 dling for the left band. We handle multi-loops by passing through affects the decomposition of P into two PK-fragments, where the states P , and P , which keep track of how many add- L;mloop1 L;mloop0 latter is taken from the respective candidate list. We discuss the sin- itional inner base pairs have to be enforced (1 or 0). Finally, gle modifications on the three examples of PK; P , and P—the L;mloop1 Figure 5 shows the decompositions of WBðÞ i; l and WBðÞ i; l , which remaining cases are sufficiently similar to be sparsified analogously. represent multi-loop fragments around a band; in contrast to WB ðÞ i; l ; WBðÞ i; l requires a base pair in its range½ i; l . The other ‘W’- 1. Consider the case of the PK recurrence: 0 0 matrices (WM and WM for ordinary; WP and WP for pseudoknot min WPðÞ i; d 1þ PKðÞ d; j; k; l : multi-loops) are decomposed analogously. i < d j Knotty: efficient and accurate pseudoknot prediction 3853 It suffices to minimize only over certain candidates PKðÞ d; j; k; l that are not optimally decomposable in the following sense: @e > d : PKðÞ d; j; k; l¼ WPðÞ d; e 1þ PKðÞ e; j; k; l : (3) It can be shown easily that whenever PKðÞ i; j; k; l is optimally Fig. 6. Decomposition of multi-loops in the left band by the original CCJ algorithm decomposable, there is a candidate (i.e. a smaller, not optimally (Color version of this figure is available at Bioinformatics online.) decomposable fragment) which yields the same minimum value. Remarkably, the candidate criterion can be efficiently checked both start at i þ 1; similarly, there is a shift to j – 1 in the recurrences by the DP algorithm. This is more directly seen from the equiva- of WB, and WB of P ðÞ i; j; k; l and P ðÞ i; j; k; l (see L;mloop0 L;mloop1 lent candidate criterion Supplementary Material for further details). PKðÞ d; j; k; l < min WPðÞ d; e 1þ PKðÞ e; j; k; l : d < e j 2.2 Space complexity analysis Also this minimum can be computed by running over candidates In the previous sections, we sparsified all four-dimensional matrices of only; moreover it must be calculated by the DP algorithm for the Knotty algorithm with the goal of reducing its space complexity. computing PKðÞ d; j; k; l , such that the check is performed with- As explained before, our sparsification allows replacing all four- out additional overhead. This idea holds analogously for the dimensional matrices by three-dimensional matrix slices. In seven re- other candidate checks. cursion cases, we had to rewrite the minimizations, such that they compute equivalent results by recursing only to candidates and entries 2. Similarly, the minimization of the same i-slice. In the multi-loop cases, candidate lists are shared. Even if only a small fraction of the respective four-dimensional min WBðÞ i; d 1þ P ðÞ d; j; k; l L;mloop0 i < d j fragments are optimally decomposable (i.e. are not candidates), these changes will save space over the non-sparsified version. in the recurrence of P is restricted to candidates that L;mloop0 However, experience from previous sparsification (and our results) satisfy shows that a large number of fragments is optimally decomposable, @e > d : P ðÞ d; j; k; l¼ WBðÞ d; e 1þ P ðÞ e; j; k; l : L;mloop0 L;mloop0 such that number of candidates are small. (4) We define Z as the total number of candidates. For traceback, we additionally store trace arrows to optimal, non-candidate inner We could even strengthen the criterion, such that candidates base pairs of interior loops; denote their number by T. Then, the must furthermore satisfy total space complexity of Knotty isOn þ Z þ T . @e d : P ðÞ d; j; k; l¼ P ðÞ d; e; k; lþ WBðÞ e þ 1; j : L;mloop0 L;mloop0 (5) 3 Results 3. In the minimization calculating P(i, l), i.e. In this section we provide implementation and dataset details, and min PKðÞ i; j 1; d þ 1; k 1þ PKðÞ j; d; k; l ; show a comparison of Knotty in terms of time and memory usage i < j < d < k < l to its predecessors, CCJ 1.0 and CCJ 2.0. In addition, we compare Knotty’s prediction accuracy to some of the best existing methods it suffices to consider only candidates PKðÞ j; d; k; l . Entries PK for RNA pseudoknotted secondary structure prediction. ðÞ j; d; k; l are candidates if and only if they are not optimally decom- posable in the following sense: 3.1 Implementation @ejðÞ e < d : PKðÞ j; d; k; l¼ PKðÞ j; e; k; lþ WPðÞ e þ 1; d : (6) Knotty, CCJ 2.0, as well as the original CCJ algorithm were implemented in Cþþ. Knotty and CCJ 2.0 implement the pre- Notably, the candidate lists for the P and P recur- X;mloop0 X;mloop1 sented recurrences (respectively, with and without sparsification) rences (X 2fL; M; R; Og) can be shared, since candidates for utilizing the DP09 energy model of HotKnots V2.0 (Andronescu decomposing P must be candidates for P due to WB X;mloop0 X;mloop1 et al., 2010), while CCJ 1.0 strictly realizes the original CCJ recur- ðÞ i; j WBðÞ i; j for all i, j;cf. Figure 5 as well as Equations (4) and (5). rences and uses a slightly different energy model. Notably, all three Finally, the Knotty recurrences can be computed based on implementations are reported here for the first time. the (constantly bounded) matrix slices and the candidates In Knotty, we implement the sparsified recurrences and store alone. Their correctness is a consequence of inverse triangle only (a constant number of) three-dimensional matrix slices instead inequalities; for example in case of W we have 8x < y z : WxðÞ ; z WxðÞ ; y 1þWyðÞ ; z , which follows from the defin- of four-dimensional matrices. For the (given the sparse information, ition of W. Analogous inequalities hold for WP; WB, and WB : non-trivial) traceback, we implement a careful adaption of the ap- proach of Will and Jabbari. (Will and Jabbari, 2016) to compute the Theorem 2.1 (Correctness of the Knotty sparsification). The traceback by recomputation with the help of (as few as possible) Knotty recurrences are equivalent to the original CCJ recurrences. trace arrows. It demonstrates the generality of Will and Jabbari’s ap- proach (cf. Supplementary Material). Proof of Theorem 2.1 is provided in the Supplementary Material. Note that the presented sparsification would not have 3.2 Datasets been possible for the multi-loop handling of the original CCJ We analyzed performance of our algorithm based on a large dataset algorithm (Fig. 6). In the original decomposition of P , the un- L;mloop bound access to d-slices cannot be easily replaced by candidates— of over 600 RNA sequences of up to 400 bases length. This dataset note the index offsets in the graphical notation, indicating that was compiled from four non-overlapping datasets with various in the recurrence of P ðÞ i; j; k; l , the WB and WB fragments pseudoknotted and pseudoknot-free structures. HK-PK and HK-PK- L;mloop 3854 H.Jabbari et al. Table 1. Summary of our benchmark datasets. Reference struc- tures of the pk-type datasets are pseudoknotted, while they are pseudoknot-free in HK-PK-free Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) free datasets were compiled from RNA STRAND database (Andronescu et al., 2008) and were used in evaluating HotKnots V2.0 (Andronescu et al., 2010); IP-pk168 contains 16 categories of pseudoknotted structures with at most 85% sequence similarities (Huang and Ali, 2007) and was used in evaluating IPknot (Sato et al., 2011); and DK-pk16 contains 16 pseudoknotted structures with strong experimental support (Sperschneider et al., 2012). Table 1 summarizes these datasets. 3.3 Accuracy To evaluate accuracy of predicted structures, we use the harmonic mean of sensitivity and positive predictive value (PPV), as commonly referred to as F-measure (Jabbari and Condon, 2014). Its value ranges between 0 and 1, with 0 indicating no common base pair with the reference structure and 1 indicating a perfect prediction (details in Supplementary Material). 3.3.1 Permutation test We use a two-sided permutation test to assess statistical significance of the observed performance differences between two methods. We follow the procedure explained in (Jabbari and Condon, 2014; Fig. 7. (A) Memory usage (maximum resident set size in MB) versus length Hajiaghayi et al., 2012). We report a significance in prediction ac- (log–log plot) over all benchmark instances. The solid line shows a best curacy if the difference in P-values is <0.05. ‘asymptotic’ fit of the form c þ c n for sequence length n, constants c , c 1 2 1 2 and exponent x; for the fit, we ignore all values n < 50. (B) Run-time (s) versus 3.4 Benchmark results length (log–log plot) over all benchmark instances. For each tool in both plots, we report (in parenthesis) the exponent x that we estimated from the bench- We benchmarked Knotty, CCJ 1.0, CCJ 2.0 and Pknots on mark results; it describes the observed complexity as Hðn Þ (Supplementary R R IntelV XeonV CPU E5-4657L v2 2.40 GHz; for running jobs in par- Section S5) (Color version of this figure is available at Bioinformatics online.) allel, the compute server was equipped with a total of 1.5 TB avail- able space. Note that all programs run single-threaded. Based on time savings are striking, in particular for larger instances (e.g. 19 instances of our datasets, we compared the run time and memory days compared to about 12 h for the 400 bases long benchmark in- requirements of the programs. Moreover, we verified that CCJ 2.0 stance). Note that the same instance needs unreasonable amount of and Knotty, indeed produced equivalent results (For most of our memory for CCJ 1.0 and still extreme space for CCJ 2.0. For the benchmark instances, the programs produced exactly the same former, we estimate 10.2 TB by extrapolating from our regression results. For only five instances, the predicted structures differed by a (Fig. 7A, Supplementary Table S7); for the latter we measured few shifted base pairs of a single stem in a pseudoloop (while the pro- 136.4 GB; Knotty requires only 13.60 GB. grams still reported identical minimum energies); in these cases, we con- We further investigated whether a specific class of structures (i.e. firmed that the structural differences are indeed energetically neutral). pseudoknotted versus pseudoknot-free) would benefit stronger than Note that this equivalence must hold, since the latter sparsifies the for- the other from sparsification, and found out that both classes benefit mer (Theorem 2.1). Figure 7A shows the memory consumption, our equally from sparsification (see Fig. 8); in general longer sequences main objective in this work, as a log-log plot versus sequence length; benefit more (as seen in Fig. 7A). A closer look at Figure 7A shows similarly the run times are reported in Figure 7B. We observe significant that, while Knotty has variation in memory usage within the same improvements in space from CCJ 1.0 over CCJ 2.0 to Knotty. length, these variations are minimal. We looked at numbers of can- At the same time, one observes a comparably small run time penalty in Knotty (which at this time does not make full use of didates and trace arrows in Knotty, which together explain the space requirements (see Supplementary Material). sparsification to optimize time). Nevertheless, we emphasize the im- To assess the prediction accuracy of Knotty on our dataset, portance of moderate space requirements, which is particularly rele- vant given today’s heavily parallel computation platforms with we compared it with two of the best existing methods, namely comparably costly main memory. In practice, Knotty’s small mem- HotKnots V2.0, a heuristic method with recently tuned energy ory footprint allows running many more instances in parallel com- parameters (Andronescu et al., 2010), and IPknot, a method based pared to CCJ 1.0 and even CCJ 2.0. Compared to Pknots, the on maximum expected accuracy (Sato et al., 2011). Moreover, we Knotty: efficient and accurate pseudoknot prediction 3855 Fig. 8. Numbers of candidates—as stored by Knotty—for the benchmark instances (in dependency of sequence length). We highlight instances of the four benchmark sets, which does not suggest any obvious bias (Color version of this figure is available at Bioinformatics online.) compared performance of Knotty with Pknots, the most general method for MFE prediction of pseudoknotted structures, as well as Simfold (Andronescu et al., 2005, 2007), a pseudoknot-free MFE prediction tool that employs HotKnots V2.0 energy model. Fig. 9. Accuracy of prediction across benchmark instances. (A) Distribution of Figures 9A and 10 summarize average accuracy, as well as average F-measures as box plots (B) Mean F-measure of pseudoknotted base pairs sensitivity and PPV of each method. In addition, we also compared [tool colors as in Subfigure (A)]. While the (total) F-measures of (A) assess the performance of Knotty to PknotsRG (Reeder and Giegerich, prediction accuracy of all base pairs, the mean ‘pseudoknotted’ F-measures 2004) (data in Supplementary Material). Since performance of of (B) assess the accuracy of predicting the pseudoknot (Color version of this figure is available at Bioinformatics online.) PknotsRG and Pknots was similar, we only included comparison results with the more general method, Pknots, in the main text. Figure 9A visualizes the prediction accuracies (F-measures) of the compared methods. Additionally, we analyzed the significance of the observed differences by permutation tests (Section 3.3.1; detailed results in Supplementary Material). For pseudoknot-free structures (HK-PK-free), only Pknots has significantly lower average F-measure compared to the other of methods. For dataset HK-PK, the differences between Knotty, Pknots and HotKnots are not significant, but these tools perform significantly better than IPknot and Simfold. On the IP-pk168 dataset, Knotty significantly outperforms HotKnots, IPknot and Simfold. Finally, the differences on the small DK-pk16 dataset are not significant. Summarizing, Knotty performs highly pseudoknotted structure prediction—on-par or super- ior to its competitors—and shows the most consistent performance. Figure 10 represents average sensitivity (on X axis) and PPV (on Y axis) of each method on our four datasets. In this figure each method has a different symbol and each dataset was represented in a Fig. 10. Average sensitivity and PPV of all methods on all datasets (Color ver- different color. Data points below diagonal line represent higher sion of this figure is available at Bioinformatics online.) sensitivity compared to PPV for a specific method, while data points above the diagonal line represent higher PPV compared to sensitivity of the specific method. As seen in Figure 10, Knotty achieves compares average F-measure of Knotty to that of other pseudoknot the highest sensitivity on all our pseudoknotted structures (blue, yel- predictors. In all benchmark sets, Knotty outperforms Pknots and low and red circles); while IPknot’s sensitivity is the highest on IPknot, while HotKnots V2.0 achieves similar accuracy for the pseudoknot-free structures (green symbols). This suggests that sets DK-pk16 and HK-PK. Partly, this shows the superiority of the IPknot favors pseudoknot-free structures, while Knotty has a energy model shared by Knotty and HotKnots V2.0. More inter- higher tendency of predicting pseudoknotted structures. estingly, the significant edge over HotKnots V2.0 on set IP-pk168 demonstrates the benefits of true (i.e. global) energy minimization 3.5 Accuracy of pseudoknot prediction over heuristic pseudoknot prediction. We assessed the accuracy of pseudoknot prediction of each method We further evaluated each tool’s ability to correctly identify pseudoknots. We say a pseudoknot is correctly identified by a by average F-measure of pseudoknotted base pairs predicted by each method (following Bellaousov and Mathews, 2010). Figure 9B method if the method correctly predicts at least one pseudoknotted 3856 H.Jabbari et al. Bellaousov,S. and Mathews,D.H. (2010) ProbKnot: fast prediction of RNA base pair. In this comparison, Knotty outperforms the other meth- secondary structure including pseudoknots. RNA, 16, 1870–1880. ods even more clearly, as it correctly identifies 55% of pseudoknots Chang,K.-Y. and Tinoco,I. (1997) The structure of an RNA ‘kissing’ hairpin in our dataset versus HotKnots 39%, IPknot 25% and Pknots complex of the HIV TAR hairpin loop and its complement. J. Mol. Biol., 43% (refer to the Supplementary Material). 269, 52–66. Chang,R.-Y.Y. et al. (2013) Japanese encephalitis virus non-coding RNA inhibits activation of interferon by blocking nuclear translocation of inter- 4 Conclusion feron regulatory factor 3. Vet. Microbiol., 166, 11–21. The novel algorithm Knotty provides comparably fast, highly ac- Chen,H.-L. et al. (2009) An O(n(5)) algorithm for MFE prediction of kissing curate structure prediction for a comprehensive class of biologically hairpins and 4-chains in nucleic acids. JCB, 16, 803–815. Dirks,R.M. and Pierce,N.A. (2003) A partition function algorithm for nucleic acid relevant pseudoknots, while posing only moderate space demands. secondary structure including pseudoknots. J. Comput. Chem., 24, 1664–1677. This unique feature set enables novel applications for computational Hajiaghayi,M. et al. (2012) Analysis of energy-based algorithms for RNA sec- pseudoknot structure prediction by overcoming crucial limitations ondary structure prediction. BMC Bioinformatics, 13, 22. of previous tools. Compared to the classic pseudoknot prediction al- Huang,X. and Ali,H. (2007) High sensitivity RNA pseudoknot prediction. gorithm Pknots, it is asymptotically faster by a linear factor, which Nucleic Acids Res., 35, 656–663. translates to decisive speed-ups already for moderately sized RNAs. Jabbari,H. (2015) Algorithms for prediction of RNA pseudoknotted secondary This gain is possible due to predicting only CCJ structures—in prac- structures. Ph.D. Thesis, University of British Columbia, Vancouver, Canada. tice rarely limiting over Pknots. Jabbari,H. and Condon,A. (2014) A fast and robust iterative algorithm for prediction of RNA pseudoknotted secondary structures. BMC While the original CCJ implementation additionally suffered from Bioinformatics , 15, 147. significant space overhead, the CCJ recurrences are intrinsically Jabbari,H. et al. (2008) Novel and efficient RNA secondary structure predic- space-demanding (introducing numerous auxiliary four-dimensional tion using hierarchical folding. J. Comput. Biol., 15, 139–163. DP matrices). Therefore, overcoming these limitations is the major Lin,Y. et al. (2018) Structural analyses of NEAT1 lncRNAs suggest technical and practical breakthrough of this contribution. For this long-range RNA interactions that may contribute to paraspeckle architec- purpose, we applied the technique of sparsification to CCJ-type recur- ture. Nucleic Acids Res., 46, 3742–3752. rences. To the best of our knowledge, this is the most complex appli- Lyngso,R.B. and Pedersen,C.N.S. (2000) Pseudoknots in RNA secondary cation of this technique so far (surpassing the sparsification of RNA– structures. BRICS Report Series RS-00-1. RECOMB00, ACM Press. RNA interaction prediction). In this respect, Knotty serves as a valu- Melchers,W. et al. (1997) Kissing of the two predominant hairpin loops in the able case study of complex space-efficient sparsification. Notably, coxsackie B virus 3’ untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis. while previous applications of the sparsification technique mainly J. Virol., 71, 686–696. focused on speed improvements, we first of all aimed at reducing the Mercer,T.R. et al. (2009) Long non-coding RNAs: insights into functions. space requirements, which—after presenting CCJ—was the remaining Nat. Rev. Genet., 10, 155–159. limiting factor for the practical applicability of complex RNA pseudo- Mo ¨ hl,M. et al. (2010) Sparsification of RNA structure prediction including knotted secondary structure prediction. pseudoknots. Algorithms Mol. Biol., 5, 39. Our comparison to CCJ’s original implementation and Knotty’s Novikova,I.V. et al. (2013) Rise of the RNA machines: exploring the structure unsparsified variant provides detailed insights into general potentials of long non-coding RNAs. J. Mol. Biol., 425, 3731–3746. for space improvements of complex RNA-related algorithms. Nussinov,R. and Jacobson,A.B. (1980) Fast algorithm for predicting the sec- Moreover, benchmarking against state-of-the-art pseudoknot predic- ondary structure of single-stranded RNA. Proceed. Natl. Acad. Sci. USA, 77, 6309–6313. tion methods (Pknots, HotKnots V2.0, PknotsRG and IPknot), Rastegari,B. and Condon,A. (2007) Parsing nucleic acid pseudoknotted sec- we show superior capabilities for pseudoknot prediction and identifica- ondary structure: algorithm and applications. J. Comput. Biol., 14, 16–32. tion. Being fast, space-efficient, expressive and accurate—Knotty Reeder,J. and Giegerich,R. (2004) Design, implementation and evaluation of a opens the door to large scale biologically-relevant applications of pseu- practical pseudoknot folding algorithm based on thermodynamics. BMC doknot structure prediction covering all important pseudoknot classes. Bioinformatics, 5, 104. Rivas,E. and Eddy,S.R. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. JMB, 285, 2053–2068. Acknowledgements Salari,R. et al. (2010) Time and space efficient RNA-RNA interaction prediction We thank Anne Condon for discussing details of the CCJ algorithm and first via sparse folding. In: Berger,B. (ed.) Proceedings of RECOMB 2010, Volume ideas on space savings. 6044 of Lecture Notes in Computer Science. Springer, Berlin, pp. 473–490. Sato,K. et al. (2011) IPknot: fast and accurate prediction of RNA secondary struc- Conflict of Interest: none declared. tures with pseudoknots using integer programming. Bioinformatics, 27, i85–i93. Sheikh,S. et al. (2012) Impact of the energy model on the complexity of RNA folding with pseudoknots. In: Ka ¨ rkka ¨ inen,J. and Stoye,J. (eds), References Combinatorial Pattern Matching, Volume 7354 of Lecture Notes in Akutsu,T. (2000) Dynamic programming algorithms for RNA secondary Computer Science. Springer, Berlin, pp. 321–333. structure prediction with pseudoknots. Discrete Appl. Math., 104, 45–62. Sperschneider,J. et al. (2012) Predicting pseudoknotted structures across two Andronescu,M. et al. (2005) Secondary structure prediction of interacting RNA sequences. Bioinformatics, 28, 3058–3065. RNA molecules. JMB, 345, 987–1001. Uemura,Y. et al. (1999) Tree adjoining grammars for RNA structure predic- Andronescu,M. et al. (2007) Efficient parameter estimation for RNA second- tion. Theor. Comput. Sci., 210, 277–303. ary structure prediction. Bioinformatics, 23, i19–i28. Verheije,M.H. et al. (2002) Kissing interaction between 3? Noncoding and Andronescu,M. et al. (2008) RNA STRAND: the RNA secondary structure coding sequences is essential for porcine arterivirus RNA replication. and statistical analysis database. BMC Bioinformatics, 9, 340. J. Virol., 76, 1521–1526. Andronescu,M.S. et al. (2010) Improved free energy parameters for RNA Wexler,Y. et al. (2007) A study of accessible motifs and RNA folding com- pseudoknotted secondary structure prediction. RNA, 16, 26–42. plexity. JCB, 14, 856–872. Backofen,R. et al. (2011) Sparse RNA folding: time and space efficient algo- Will,S. and Jabbari,H. (2016) Sparse RNA folding revisited: space-efficient rithms. J. Discrete Algorithms, 9, 12–31. minimum free energy structure prediction. Algorithms Mol. Biol., 11, 7–13.
Bioinformatics – Oxford University Press
Published: Jun 1, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.