Knotty: efficient and accurate prediction of complex RNA pseudoknot structures

Knotty: efficient and accurate prediction of complex RNA pseudoknot structures Abstract Motivation The computational prediction of RNA secondary structure by free energy minimization has become an important tool in RNA research. However in practice, energy minimization is mostly limited to pseudoknot-free structures or rather simple pseudoknots, not covering many biologically important structures such as kissing hairpins. Algorithms capable of predicting sufficiently complex pseudoknots (for sequences of length n) used to have extreme complexities, e.g. Pknots has O(n6) time and O(n4) space complexity. The algorithm CCJ dramatically improves the asymptotic run time for predicting complex pseudoknots (handling almost all relevant pseudoknots, while being slightly less general than Pknots), but this came at the cost of large constant factors in space and time, which strongly limited its practical application (∼200 bases already require 256 GB space). Results We present a CCJ-type algorithm, Knotty, that handles the same comprehensive pseudoknot class of structures as CCJ with improved space complexity of Θ(n3+Z)—due to the applied technique of sparsification, the number of ‘candidates’, Z, appears to grow significantly slower than n4 on our benchmark set (which include pseudoknotted RNAs up to 400 nt). In terms of run time over this benchmark, Knotty clearly outperforms Pknots and the original CCJ implementation, CCJ 1.0; Knotty’s space consumption fundamentally improves over CCJ 1.0, being on a par with the space-economic Pknots. By comparing to CCJ 2.0, our unsparsified Knotty variant, we demonstrate the isolated effect of sparsification. Moreover, Knotty employs the state-of-the-art energy model of ‘HotKnots DP09’, which results in superior prediction accuracy over Pknots. Availability and implementation Our software is available at https://github.com/HosnaJabbari/Knotty. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Computational RNA secondary structure prediction has become an indispensable tool in the research on non-coding RNAs. Besides coding for proteins, RNAs perform various essential roles—most prominently in regulating gene expression—in all kingdoms of life, in many cases mediated by their three-dimensional structures (Mercer et al., 2009). Despite the ubiquity of pseudoknots in these RNAs (Chang et al., 2013; Lin et al., 2018; Novikova et al., 2013), most often pseudoknot-free structure prediction methods are applied in biological research—severely limiting the practical capabilities to correctly predict, recognize and compare pseudoknotted structures. The fundamental cause of this limitation is the computational complexity of the prediction of general pseudoknots in the nearest neighbor energy model, which is NP-hard (Akutsu, 2000; Lyngso and Pedersen, 2000) and even inapproximable (Sheikh et al., 2012). Thus, the high complexity of pseudoknot prediction can be overcome only by either heuristics without optimality-guarantees or restrictions on the predictable pseudoknot class. In comparison, predicting minimum free energy (MFE) structures without pseudoknots is comparably simple: a pseudoknot-free secondary structure is either closed by a base pair connecting the first and last base in the sequence, or can be partitioned into two independent substructures on a prefix and suffix of the sequence, where energies of substructures add up to the total energy. This simple decomposition scheme allows predicting the MFE pseudoknot-free secondary structure by dynamic programming (DP) in Θ(n3) time and Θ(n2) space for standard energy models (Nussinov and Jacobson, 1980). The most general DP algorithm for MFE prediction of pseudoknotted structures, Pknots (Rivas and Eddy, 1999) has complexity of Θ(n6) time, and Θ(n4) space. To reduce this prohibitively high complexity, several computationally less costly algorithms for MFE pseudoknot prediction have been proposed; for example, running in Θ(n5) time and Θ(n4) space (Dirks and Pierce, 2003; Lyngso and Pedersen, 2000; Uemura et al., 1999) or even Θ(n4) time and Θ(n2) space in PknotsRG (Lyngso and Pedersen, 2000; Reeder and Giegerich, 2004). However, all these reductions came at the cost of severely limiting the pseudoknot class over Pknots. Most strikingly, none of the less complex algorithms (compared to Pknots) handles the biologically important class of kissing hairpins with arbitrary nested substructures (Chang and Tinoco, 1997; Melchers et al., 1997; Verheije et al., 2002). Recently, we introduced the MFE prediction algorithm CCJ (Chen et al., 2009). Compared to the more complex algorithm Pknots, CCJ handles a strictly less general subclass of pseudoknots, but it significantly expands the class of structures that can be handled in O(n5) time—e.g. including the biologically relevant kissing hairpins. To achieve CCJ’s novel balance of time and pseudoknot complexity, we defined a new class of structures called Three-Groups-of-Band (TGB) structures with at most three groups of bands (see Fig. 1A and B). On this basis, we presented recurrences that handle TGB structures alternatingly decomposing the left, right, middle and outer bands. Fig. 1. View largeDownload slide Examples of ‘Three-Groups-of-Bands’ (TGB) and CCJ structures. Each arc represents a set of consecutive pseudoknotted base pairs (referred to as band), that cross the same band. (A) TGB structure with two left, two right and four middle bands; (B) TGB with no left band, one middle and two right bands as well as a nested pseudoknotted substructure; (C) CCJ structure composed of the two TGB structures of subfigures A and B (see decomposition of P) Fig. 1. View largeDownload slide Examples of ‘Three-Groups-of-Bands’ (TGB) and CCJ structures. Each arc represents a set of consecutive pseudoknotted base pairs (referred to as band), that cross the same band. (A) TGB structure with two left, two right and four middle bands; (B) TGB with no left band, one middle and two right bands as well as a nested pseudoknotted substructure; (C) CCJ structure composed of the two TGB structures of subfigures A and B (see decomposition of P) The CCJ algorithm covers H-type pseudoknotted structures, kissing hairpins and chains of four interleaved stems by overlaying TGB structures (Fig. 1C); this composition can be applied recursively, which defines the class of CCJ structures. Note that this class is comparably general, e.g. comprising all density-2 structures (Jabbari et al., 2008) with up to four interleaved stems. For example, due to Rastegari and Condon (2007), among 1133 pseudoknotted RNA structures, only 9 were not density-2 and 6 were too complex for Pknots. While CCJ predicts complex pseudoknotted secondary structures in Θ(n5) time, its Θ(n4) space complexity was still limiting in practice; e.g.—for the original CCJ algorithm—even generous memory of 256 GB used to limit the length of RNAs to below 200 bases. A slightly deeper look reveals that CCJ’s substantial improvement in time complexity over Pknots was achieved at the price of using a large number of four-dimensional DP matrices (about 20 matrices in CCJ versus four such matrices in Pknots). Apparently, CCJ’s speedup is tied to a—in practice prohibitively—large space consumption. In this work, we overcome the still unresolved dilemma that expressive pseudoknot prediction used to be limited either by time (Pknots) or space (CCJ). To improve the extreme space requirements of expressive, but comparably fast prediction, we apply the idea of sparsification (Backofen et al., 2011; Wexler et al., 2007) to CCJ. Devising sparsified recurrences of CCJ, resulting in the algorithm Knotty, we improve the space complexity to Θ(n3+Z) ⁠, where Z is the total number of candidates. This complexity is the result of replacing all four-dimensional DP matrices by (a constant number of) three-dimensional matrix slices and lists of candidates. This suffices to exactly predict the same MFE structures as before (due to inverse triangle inequalities). The number of candidates is expected to be much smaller than the number of replaced matrix entries; moreover it cannot be larger. The space-efficient retrieval of the optimal structure from this algorithm turned out to be a non-trivial application of the sparsification of MFE RNA structure prediction algorithms. We enable this by implementing our recently presented technique of space-efficient sparse traceback (Will and Jabbari, 2016); this method requires additional space for trace arrows, but keeps the number of trace arrows low. Notably, previous applications of the sparsification technique to RNA structure prediction discussed either pseudoknot-free methods (Backofen et al., 2011; Wexler et al., 2007) or used pseudoknotted methods with simplified energy models (base pair maximization only) (Möhl et al., 2010). Sparsified RNA–RNA interaction prediction (Salari et al., 2010) is the most complex case of structure prediction so far that was implemented for a realistic interaction energy model. While space-efficient sparsification is discussed in the paper, the space-efficient variant of the implementation could not recover the optimal interaction structure by traceback. 1.1 Contributions We introduce Knotty, providing a unique combination of comparably low run time at moderate space consumption and high expressivity, covering most biologically relevant pseudoknots including the important class of kissing hairpins and employing the state-of-the-art energy model. Aiming at low computational complexity and high expressivity, we base our work on the original CCJ algorithm. To fundamentally overcome CCJ’s space overhead, we apply space-efficient sparsification to an adapted CCJ algorithm, resulting in the novel algorithm Knotty. By this, we present—to the best of our knowledge—the first sparsified prediction of complex pseudoknots using a realistic energy model. For studying Knotty’s computational behavior, we benchmark versus the original CCJ and a non-sparse variant of Knotty (which we implemented in parallel, for the purpose of such comparison). We show that sparsification alone substantially reduces the space requirements, whereas we could achieve further significant improvements over the original implementation (even in the non-sparse Knotty variant). Aiming at high accuracy, our implementation of Knotty supports the state-of-the-art HotKnots DP09 energy model (Andronescu et al., 2010). To analyze Knotty’s biological relevance, we benchmarked the prediction accuracy (for the total structures as well as for the actual pseudoknots) in comparison to the pseudoknot prediction programs HotKnots V2.0 (Andronescu et al., 2010), IPknot (Sato et al., 2011) and the most general MFE pseudoknot prediction method Pknots (Rivas and Eddy, 1999). For control, we moreover compare to Simfold (Andronescu et al., 2005, 2007), a pseudoknot-free prediction method based on the same energy model. 2 Materials and methods The original CCJ algorithm (Chen et al., 2009) minimizes the free energy over all CCJ structures for a given input sequence S by DP. As stated in the introduction, CCJ structures comprise kissing hairpins and chains of four interleaved stems, which can recursively contain CCJ structures as substructures. The optimal CCJ structure is then determined by standard traceback through the DP matrices. Generally, DP algorithms can be described by presenting the recurrences that are used to calculate the entries of their DP matrices. In the case of RNA structure prediction, the DP matrices store MFE values for sequence fragments (under specific conditions). The recurrences correspond to decompositions of these fragments such that the matrix entries can be recursively inferred from energies of smaller subproblems. For example, assuming an energy value of –1 for each canonical base pair, i.e. C–G, A–U or G–U, the MFE structure of an RNA sequence, S, can be found using matrices WN and VN ⁠. The entries WN(i,j) and VN(i,j) ⁠, which hold the respective minimum free energies of general and closed structures of subsequence Si..j ⁠, are decomposed as (1) These grammar-rules represent a complete case distinction of possible structures. Similar graphical notation was introduced by Rivas and Eddy (Rivas and Eddy, 1999) and is now commonly used to present DP algorithms for RNA (e.g. Dirks and Pierce, 2003; Möhl et al., 2010). In our notation, the sequence is represented as a line with bases as index positions; solid arcs indicate base pairs; and dashed arcs enclose areas containing unconsumed structure. Red circles represent fixed end points of a region. Blue squares are free indices (i.e. bases that are pivotal for boundary determinations). To not clutter the representation, we do not mark end points that can be identified from bases directly before or after them. In the example of Equation (1), WN(i,j) corresponds to the MFE structure of Si..j ⁠; this structure can be decomposed according to Equation (1), since either j is unpaired (left case) or j is paired to some inner position (recursion to VN ⁠, in which solid arc represent a base pair); the closed structure corresponding to VN(i,j) is reduced to WN(i+1,j−1) (rightmost case). Moreover, the grammar rules allow—almost mechanically—inferring the recursion equations for base pair free energy minimization (up to specific energy contributions). Generally, our recursions minimize over the recursion cases and the free indices in their respective range limited by the fixed indices. Thus, we translate the rules of Equation (1) to WN(i,j)=min⁡{WN(i,j−1),min⁡i≤k<jWN(i,k−1)+VN(k,j)}VN(i,j)=WN(i+1,j−1)−1  if Si−Sj is canonical,VN(i,j)=∞  otherwise. In our discussion of CCJ and its sparsification, the graphical presentation allows us to focus on the sparsification and avoid distracting details like the exact added energy contributions in each single step, which is not necessary for understanding the algorithmic ideas. Formal recurrences are provided in the Supplementary Material; further details are found in (Jabbari, 2015). While pseudoknot-free recursions generally use only fragments connected at the sequence level, the CCJ algorithm requires ‘gapped’ fragments (two subsequences disconnected by a gap). Consequently, defining such fragments requires four sequence positions in total. The MFE of the CCJ structure for subsequence Si..l is calculated in W(i, l), which is decomposed as Here, the first case corresponds to a scenario in which the last base, i.e. l is unpaired; the second case is when the structure in region [i,l] can be decomposed to two different non-overlapping substructures; third case corresponds to the case when the end points pair together to create a loop. The recurrence V(i,l) handles non-pseudoknotted loops closed by i and l; P(i,l) is the MFE of a CCJ pseudoknot for region [i,l] ⁠. P(i,l) is decomposed by the rule (2) into two TGB structures (with MFEs in the PK matrix). Representing TGB structures requires using gapped fragments. As indicated by the three blue boxes (free indices) in Equation (2) ⁠, each entry of P is minimized over three indices. Thus, this step already bounds the time and space complexity of the CCJ algorithm to O(n5) and O(n4), respectively. PK(i,j,k,l) is the MFE over all TGB structures of the gapped fragment [i,j]∪[k,l] ⁠. Note that such structures have additional restrictions: the positions i and l must be involved in some base pairs, which are not part of nested substructures; moreover, some base pair of a TGB structure must span the gap. The recurrence for PK uses terms PL, PM, PO ⁠, and PR ⁠, which (put informally) handle bands on the left, middle and right groups of the TGB structure, respectively. Both PM and PO are needed to handle bands in the middle group. The matrix entries PL(i,j,k,l), PM(i,j,k,l) ⁠, PR(i,j,k,l), PO(i,j,k,l) are decomposed as illustrated in Figure 2, in which WP handles nested substructures in a pseudoloop. (For invalid index combinations, the matrix entries are set to +∞ ⁠, and do not have to be stored.) Fig. 2. View largeDownload slide Decompositions for PK(i,j,k,l), PL(i,j,k,l) ⁠, PM(i,j,k,l), PR(i,j,k,l), PO(i,j,k,l) in grammar-rule like graphical notation (Color version of this figure is available at Bioinformatics online.) Fig. 2. View largeDownload slide Decompositions for PK(i,j,k,l), PL(i,j,k,l) ⁠, PM(i,j,k,l), PR(i,j,k,l), PO(i,j,k,l) in grammar-rule like graphical notation (Color version of this figure is available at Bioinformatics online.) As seen in Figure 2 each of the matrices PL, PR, PM ⁠, and PO requires a base pair between the two ends at the respective positions left, right, middle or outer. Each such matrix distinguishes the three cases that (i) this base pair closes an interior loop or (ii) a multi-loop or (iii) is the inner border of a band. In the latter case, the respective terms PfromX (X∈{L,R,M,O}) handle transitions from base pairs in one group to base pairs in other groups (see Fig. 3). Fig. 3. View largeDownload slide Decompositions of PfromX(i,j,k,l) for X∈{L,R,M,O} ⁠, which handle pseudoloops closed by an X band as well as the transition to some PY (Color version of this figure is available at Bioinformatics online.) Fig. 3. View largeDownload slide Decompositions of PfromX(i,j,k,l) for X∈{L,R,M,O} ⁠, which handle pseudoloops closed by an X band as well as the transition to some PY (Color version of this figure is available at Bioinformatics online.) In transition to another band via one of the PfromX matrices, we allow nested substructures around the band. This is handled by the first two cases of the respective recurrence PfromX ⁠. Moreover, in PfromX we recurse to matrices PX(i,j,k,l) only if the requirements of these matrices are met. Note that in PfromL ⁠, it is not possible to transition to PL. This is because the recurrences are designed so that bands are handled in rounds. Within a round, bands in the left are handled first, if any, then those on the right, if any, and then those on the middle, with bands handled by PM (if any) handled before those handled by PO ⁠. A middle band must be handled in each round; otherwise, for example, two ‘bands’ on the left group, added in different rounds, would collapse into one, causing the recurrences to incorrectly add penalty terms for band ‘borders’ that are not actually borders. For this reason, no row in PfromL has a PL term, and so a band on the left group cannot be handled directly after a band on the right group. Also, PfromO does not have a row with a PM term, to ensure that PM cannot be used twice on the same round. Interior loops that span a band in the left band are decomposed by the rule ; the remaining cases (right, middle, outer) are analogous. Note that, while the decomposition of PL,iloop has two free indices, these indices are constraint by setting a constant maximum size of interior loops (here 30 bases) as it is common practice. For handling interior loops that span a band, the original CCJ algorithm introduced the five-ary function PL,iloop5 and applied a clever scheme of keeping track of borders by transitioning to different recurrence cases to still use only Θ(n4) space. However, due to its non-standard form, this recurrence cannot be sparsified directly. Thus, in Knotty, we make the pragmatic choice to simplify the recurrence, which works since even the detailed HotKnots DP09 energy parameter set does not require the full generality supported by the original recursion in CCJ. Note that this modification alone, i.e. without sparsification, does not change the complexity of the algorithm, even if it reduces the overhead. Moreover, we modify the original handling of multi-loop cases to enable their sparsification. Figure 4 shows our multi-loop handling for the left band. We handle multi-loops by passing through states PL,mloop1 ⁠, and PL,mloop0 ⁠, which keep track of how many additional inner base pairs have to be enforced (1 or 0). Finally, Figure 5 shows the decompositions of WB(i,l) and WB′(i,l) ⁠, which represent multi-loop fragments around a band; in contrast to WB(i,l), WB′(i,l)requires a base pair in its range [i,l] ⁠. The other ‘W’-matrices (⁠ WM and WM′ for ordinary; WP and WP′ for pseudoknot multi-loops) are decomposed analogously. Fig. 4. View largeDownload slide Decomposition of multi-loops that span a band in the left band (Color version of this figure is available at Bioinformatics online.) Fig. 4. View largeDownload slide Decomposition of multi-loops that span a band in the left band (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Decompositions of WB(i,l) and WB′(i,l) (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Decompositions of WB(i,l) and WB′(i,l) (Color version of this figure is available at Bioinformatics online.) 2.1 The Knotty algorithm Knotty improves space complexity over the CCJ algorithm utilizing the following idea: due to the technique of sparsification, one can minimize the free energy, while keeping a fraction of the whole DP matrix cells, namely only the entries referred to as candidates. By storing a few candidates (as explained below) we avoid storing all four-dimensional matrices. (i) In recurrences in which the left-most index, i, does not change, we store the value of such recurrences in a constant number of three-dimensional matrix slices; we call the collection of these matrix slices i-slices. In many of these recursion cases, we recurse to matrix entries of the same i-slice (e.g. when inferring PK from PL or, slightly more interestingly, PR from PfromR ⁠). In other cases, we recurse to the (i+1)-slice (e.g. PL from PfromL ⁠) or (i+c)-slice, where c is constantly bounded. The latter occurs in the handling of interior loops (⁠ PL,iloop and analogously PO,iloop ⁠), where c does not exceed the maximum interior loop size, MLS. (ii) This leaves us with the recursion cases that recurse to some d-slice, where d – i is not constantly bounded. Instead of storing all slices, we store only certain candidate entries in such slices. These matrix entries (called candidates) are stored in candidate lists for specific recursion cases together with their corresponding second, third, and fourth matrix indices, j, k, and l to keep track of band borders. In some cases, candidate lists can be shared between recursion cases. We presented more details on candidate list requirements in Möhl et al. (Möhl et al., 2010). 2.1.1 Space representation of the four-dimensional matrices by Knotty Only matrices corresponding to PL and PO occur in interior loops that span a band, and require to recurse to a different i-slice; for them, we store slices i..i+MLS ⁠. For matrices corresponding to PfromL, PfromO, PL,mloop0 ⁠, PL,mloop1, PO,mloop0 and PO,mloop1 ⁠, we only need to store slices i and i + 1. Matrices corresponding to recurrences of types PX,iloop and PX,mloop (X∈{L,R,M,O}) do not need to be stored and can be computed when needed. For the remaining matrices, we store only the current i-slice. Note that space is always reused in the next iteration; in case of ranges of slices, the memory access is ‘rotated’ without copying in memory. So in total, Knotty maintains candidate lists for nine matrices: PK, PfromL ⁠, PfromM, PfromR, PfromO, PL,mloop0 ⁠, PM,mloop0, PR,mloop0and PO,mloop0 ⁠. To finalize sparsification, all recursion cases that recurse to these matrices need to be modified. This affects all recursion cases, which insert any nested substructure to the left of the gapped region. This occurs exactly in the decompositions of PK, PfromL ⁠, PfromO, PL,mloop1, PL,mloop0, PO,mloop0 and PO,mloop1 ⁠. Moreover, this affects the decomposition of P into two PK-fragments, where the latter is taken from the respective candidate list. We discuss the single modifications on the three examples of PK, PL,mloop1 ⁠, and P—the remaining cases are sufficiently similar to be sparsified analogously. 1. Consider the case of the PK recurrence: min⁡i<d≤jWP(i,d−1)+ PK(d,j,k,l). It suffices to minimize only over certain candidates PK(d,j,k,l) that are not optimally decomposable in the following sense: ∄e>d:PK(d,j,k,l)=WP(d,e−1)+PK(e,j,k,l). (3) It can be shown easily that whenever PK(i,j,k,l) is optimally decomposable, there is a candidate (i.e. a smaller, not optimally decomposable fragment) which yields the same minimum value. Remarkably, the candidate criterion can be efficiently checked by the DP algorithm. This is more directly seen from the equivalent candidate criterion PK(d,j,k,l)<min⁡d<e≤jWP(d,e−1)+ PK(e,j,k,l). Also this minimum can be computed by running over candidates only; moreover it must be calculated by the DP algorithm for computing PK(d,j,k,l) ⁠, such that the check is performed without additional overhead. This idea holds analogously for the other candidate checks. 2. Similarly, the minimization min⁡i<d≤jWB(i,d−1)+PL,mloop0(d,j,k,l) in the recurrence of PL,mloop0 is restricted to candidates that satisfy ∄e>d:PL,mloop0(d,j,k,l)=WB(d,e−1)+PL,mloop0(e,j,k,l). (4)  We could even strengthen the criterion, such that candidates must furthermore satisfy ∄e≥d:PL,mloop0(d,j,k,l)=PL,mloop0(d,e,k,l)+WB(e+1,j). (5) 3. In the minimization calculating P(i, l), i.e. min⁡i<j<d<k<lPK(i,j−1,d+1,k−1)+PK(j,d,k,l), it suffices to consider only candidates PK(j,d,k,l) ⁠. Entries PK(j,d,k,l) are candidates if and only if they are not optimally decomposable in the following sense: ∄e(j≤e<d):PK(j,d,k,l)=PK(j,e,k,l)+WP(e+1,d). (6) Notably, the candidate lists for the PX,mloop0 and PX,mloop1 recurrences (⁠ X∈{L,M,R,O} ⁠) can be shared, since candidates for decomposing PX,mloop0 must be candidates for PX,mloop1 due to WB(i,j)≤WB′(i,j) for all i, j; cf. Figure 5 as well as Equations (4) and (5). Finally, the Knotty recurrences can be computed based on the (constantly bounded) matrix slices and the candidates alone. Their correctness is a consequence of inverse triangle inequalities; for example in case of W we have ∀x<y≤z:W(x,z)≤W(x,y−1)+W(y,z) ⁠, which follows from the definition of W ⁠. Analogous inequalities hold for WP, WB ⁠, and WB′. Theorem 2.1 (Correctness of the Knotty sparsification). The Knotty recurrences are equivalent to the original CCJ recurrences. Proof of Theorem 2.1 is provided in the Supplementary Material. Note that the presented sparsification would not have been possible for the multi-loop handling of the original CCJ algorithm (Fig. 6). In the original decomposition of PL,mloop ⁠, the unbound access to d-slices cannot be easily replaced by candidates—note the index offsets in the graphical notation, indicating that in the recurrence of PL,mloop(i,j,k,l) ⁠, the WB′ and WB fragments both start at i + 1; similarly, there is a shift to j – 1 in the recurrences of WB ⁠, and WB′ of PL,mloop0(i,j,k,l) and PL,mloop1(i,j,k,l) (see Supplementary Material for further details). Fig. 6. View largeDownload slide Decomposition of multi-loops in the left band by the original CCJ algorithm (Color version of this figure is available at Bioinformatics online.) Fig. 6. View largeDownload slide Decomposition of multi-loops in the left band by the original CCJ algorithm (Color version of this figure is available at Bioinformatics online.) 2.2 Space complexity analysis In the previous sections, we sparsified all four-dimensional matrices of the Knotty algorithm with the goal of reducing its space complexity. As explained before, our sparsification allows replacing all four-dimensional matrices by three-dimensional matrix slices. In seven recursion cases, we had to rewrite the minimizations, such that they compute equivalent results by recursing only to candidates and entries of the same i-slice. In the multi-loop cases, candidate lists are shared. Even if only a small fraction of the respective four-dimensional fragments are optimally decomposable (i.e. are not candidates), these changes will save space over the non-sparsified version. However, experience from previous sparsification (and our results) shows that a large number of fragments is optimally decomposable, such that number of candidates are small. We define Z as the total number of candidates. For traceback, we additionally store trace arrows to optimal, non-candidate inner base pairs of interior loops; denote their number by T. Then, the total space complexity of Knotty is O(n3+Z+T) ⁠. 3 Results In this section we provide implementation and dataset details, and show a comparison of Knotty in terms of time and memory usage to its predecessors, CCJ 1.0 and CCJ 2.0. In addition, we compare Knotty’s prediction accuracy to some of the best existing methods for RNA pseudoknotted secondary structure prediction. 3.1 Implementation Knotty, CCJ 2.0, as well as the original CCJ algorithm were implemented in C++. Knotty and CCJ 2.0 implement the presented recurrences (respectively, with and without sparsification) utilizing the DP09 energy model of HotKnots V2.0 (Andronescu et al., 2010), while CCJ 1.0 strictly realizes the original CCJ recurrences and uses a slightly different energy model. Notably, all three implementations are reported here for the first time. In Knotty, we implement the sparsified recurrences and store only (a constant number of) three-dimensional matrix slices instead of four-dimensional matrices. For the (given the sparse information, non-trivial) traceback, we implement a careful adaption of the approach of Will and Jabbari. (Will and Jabbari, 2016) to compute the traceback by recomputation with the help of (as few as possible) trace arrows. It demonstrates the generality of Will and Jabbari’s approach (cf. Supplementary Material). 3.2 Datasets We analyzed performance of our algorithm based on a large dataset of over 600 RNA sequences of up to 400 bases length. This dataset was compiled from four non-overlapping datasets with various pseudoknotted and pseudoknot-free structures. HK-PK and HK-PK-free datasets were compiled from RNA STRAND database (Andronescu et al., 2008) and were used in evaluating HotKnots V2.0 (Andronescu et al., 2010); IP-pk168 contains 16 categories of pseudoknotted structures with at most 85% sequence similarities (Huang and Ali, 2007) and was used in evaluating IPknot (Sato et al., 2011); and DK-pk16 contains 16 pseudoknotted structures with strong experimental support (Sperschneider et al., 2012). Table 1 summarizes these datasets. Table 1. Summary of our benchmark datasets. Reference structures of the pk-type datasets are pseudoknotted, while they are pseudoknot-free in HK-PK-free Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) Table 1. Summary of our benchmark datasets. Reference structures of the pk-type datasets are pseudoknotted, while they are pseudoknot-free in HK-PK-free Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) 3.3 Accuracy To evaluate accuracy of predicted structures, we use the harmonic mean of sensitivity and positive predictive value (PPV), as commonly referred to as F-measure (Jabbari and Condon, 2014). Its value ranges between 0 and 1, with 0 indicating no common base pair with the reference structure and 1 indicating a perfect prediction (details in Supplementary Material). 3.3.1 Permutation test We use a two-sided permutation test to assess statistical significance of the observed performance differences between two methods. We follow the procedure explained in (Jabbari and Condon, 2014; Hajiaghayi et al., 2012). We report a significance in prediction accuracy if the difference in P-values is <0.05. 3.4 Benchmark results We benchmarked Knotty, CCJ 1.0, CCJ 2.0 and Pknots on Intel® Xeon® CPU E5-4657L v2 2.40 GHz; for running jobs in parallel, the compute server was equipped with a total of 1.5 TB available space. Note that all programs run single-threaded. Based on instances of our datasets, we compared the run time and memory requirements of the programs. Moreover, we verified that CCJ 2.0 and Knotty, indeed produced equivalent results (For most of our benchmark instances, the programs produced exactly the same results. For only five instances, the predicted structures differed by a few shifted base pairs of a single stem in a pseudoloop (while the programs still reported identical minimum energies); in these cases, we confirmed that the structural differences are indeed energetically neutral). Note that this equivalence must hold, since the latter sparsifies the former (Theorem 2.1). Figure 7A shows the memory consumption, our main objective in this work, as a log-log plot versus sequence length; similarly the run times are reported in Figure 7B. We observe significant improvements in space from CCJ 1.0 over CCJ 2.0 to Knotty. Fig. 7. View largeDownload slide (A) Memory usage (maximum resident set size in MB) versus length (log–log plot) over all benchmark instances. The solid line shows a best ‘asymptotic’ fit of the form c1+c2nx for sequence length n, constants c1, c2 and exponent x; for the fit, we ignore all values n < 50. (B) Run-time (s) versus length (log–log plot) over all benchmark instances. For each tool in both plots, we report (in parenthesis) the exponent x that we estimated from the benchmark results; it describes the observed complexity as Θ(nx) (Supplementary Section S5) (Color version of this figure is available at Bioinformatics online.) Fig. 7. View largeDownload slide (A) Memory usage (maximum resident set size in MB) versus length (log–log plot) over all benchmark instances. The solid line shows a best ‘asymptotic’ fit of the form c1+c2nx for sequence length n, constants c1, c2 and exponent x; for the fit, we ignore all values n < 50. (B) Run-time (s) versus length (log–log plot) over all benchmark instances. For each tool in both plots, we report (in parenthesis) the exponent x that we estimated from the benchmark results; it describes the observed complexity as Θ(nx) (Supplementary Section S5) (Color version of this figure is available at Bioinformatics online.) At the same time, one observes a comparably small run time penalty in Knotty (which at this time does not make full use of sparsification to optimize time). Nevertheless, we emphasize the importance of moderate space requirements, which is particularly relevant given today’s heavily parallel computation platforms with comparably costly main memory. In practice, Knotty’s small memory footprint allows running many more instances in parallel compared to CCJ 1.0 and even CCJ 2.0. Compared to Pknots, the time savings are striking, in particular for larger instances (e.g. ∼19 days compared to about ∼12 h for the 400 bases long benchmark instance). Note that the same instance needs unreasonable amount of memory for CCJ 1.0 and still extreme space for CCJ 2.0. For the former, we estimate 10.2 TB by extrapolating from our regression (Fig. 7A, Supplementary Table S7); for the latter we measured 136.4 GB; Knotty requires only 13.60 GB. We further investigated whether a specific class of structures (i.e. pseudoknotted versus pseudoknot-free) would benefit stronger than the other from sparsification, and found out that both classes benefit equally from sparsification (see Fig. 8); in general longer sequences benefit more (as seen in Fig. 7A). A closer look at Figure 7A shows that, while Knotty has variation in memory usage within the same length, these variations are minimal. We looked at numbers of candidates and trace arrows in Knotty, which together explain the space requirements (see Supplementary Material). Fig. 8. View largeDownload slide Numbers of candidates—as stored by Knotty—for the benchmark instances (in dependency of sequence length). We highlight instances of the four benchmark sets, which does not suggest any obvious bias (Color version of this figure is available at Bioinformatics online.) Fig. 8. View largeDownload slide Numbers of candidates—as stored by Knotty—for the benchmark instances (in dependency of sequence length). We highlight instances of the four benchmark sets, which does not suggest any obvious bias (Color version of this figure is available at Bioinformatics online.) To assess the prediction accuracy of Knotty on our dataset, we compared it with two of the best existing methods, namely HotKnots V2.0, a heuristic method with recently tuned energy parameters (Andronescu et al., 2010), and IPknot, a method based on maximum expected accuracy (Sato et al., 2011). Moreover, we compared performance of Knotty with Pknots, the most general method for MFE prediction of pseudoknotted structures, as well as Simfold (Andronescu et al., 2005, 2007), a pseudoknot-free MFE prediction tool that employs HotKnots V2.0 energy model. Figures 9A and 10 summarize average accuracy, as well as average sensitivity and PPV of each method. In addition, we also compared performance of Knotty to PknotsRG (Reeder and Giegerich, 2004) (data in Supplementary Material). Since performance of PknotsRG and Pknots was similar, we only included comparison results with the more general method, Pknots, in the main text. Fig. 9. View largeDownload slide Accuracy of prediction across benchmark instances. (A) Distribution of F-measures as box plots (B) Mean F-measure of pseudoknotted base pairs [tool colors as in Subfigure (A)]. While the (total) F-measures of (A) assess the prediction accuracy of all base pairs, the mean ‘pseudoknotted’ F-measures of (B) assess the accuracy of predicting the pseudoknot (Color version of this figure is available at Bioinformatics online.) Fig. 9. View largeDownload slide Accuracy of prediction across benchmark instances. (A) Distribution of F-measures as box plots (B) Mean F-measure of pseudoknotted base pairs [tool colors as in Subfigure (A)]. While the (total) F-measures of (A) assess the prediction accuracy of all base pairs, the mean ‘pseudoknotted’ F-measures of (B) assess the accuracy of predicting the pseudoknot (Color version of this figure is available at Bioinformatics online.) Fig. 10. View largeDownload slide Average sensitivity and PPV of all methods on all datasets (Color version of this figure is available at Bioinformatics online.) Fig. 10. View largeDownload slide Average sensitivity and PPV of all methods on all datasets (Color version of this figure is available at Bioinformatics online.) Figure 9A visualizes the prediction accuracies (F-measures) of the compared methods. Additionally, we analyzed the significance of the observed differences by permutation tests (Section 3.3.1; detailed results in Supplementary Material). For pseudoknot-free structures (HK-PK-free), only Pknots has significantly lower average F-measure compared to the other of methods. For dataset HK-PK, the differences between Knotty, Pknots and HotKnots are not significant, but these tools perform significantly better than IPknot and Simfold. On the IP-pk168 dataset, Knotty significantly outperforms HotKnots, IPknot and Simfold. Finally, the differences on the small DK-pk16 dataset are not significant. Summarizing, Knotty performs highly pseudoknotted structure prediction—on-par or superior to its competitors—and shows the most consistent performance. Figure 10 represents average sensitivity (on X axis) and PPV (on Y axis) of each method on our four datasets. In this figure each method has a different symbol and each dataset was represented in a different color. Data points below diagonal line represent higher sensitivity compared to PPV for a specific method, while data points above the diagonal line represent higher PPV compared to sensitivity of the specific method. As seen in Figure 10, Knotty achieves the highest sensitivity on all our pseudoknotted structures (blue, yellow and red circles); while IPknot’s sensitivity is the highest on pseudoknot-free structures (green symbols). This suggests that IPknot favors pseudoknot-free structures, while Knotty has a higher tendency of predicting pseudoknotted structures. 3.5 Accuracy of pseudoknot prediction We assessed the accuracy of pseudoknot prediction of each method by average F-measure of pseudoknotted base pairs predicted by each method (following Bellaousov and Mathews, 2010). Figure 9B compares average F-measure of Knotty to that of other pseudoknot predictors. In all benchmark sets, Knotty outperforms Pknots and IPknot, while HotKnots V2.0 achieves similar accuracy for the sets DK-pk16 and HK-PK. Partly, this shows the superiority of the energy model shared by Knotty and HotKnots V2.0. More interestingly, the significant edge over HotKnots V2.0 on set IP-pk168 demonstrates the benefits of true (i.e. global) energy minimization over heuristic pseudoknot prediction. We further evaluated each tool’s ability to correctly identify pseudoknots. We say a pseudoknot is correctly identified by a method if the method correctly predicts at least one pseudoknotted base pair. In this comparison, Knotty outperforms the other methods even more clearly, as it correctly identifies 55% of pseudoknots in our dataset versus HotKnots 39%, IPknot 25% and Pknots 43% (refer to the Supplementary Material). 4 Conclusion The novel algorithm Knotty provides comparably fast, highly accurate structure prediction for a comprehensive class of biologically relevant pseudoknots, while posing only moderate space demands. This unique feature set enables novel applications for computational pseudoknot structure prediction by overcoming crucial limitations of previous tools. Compared to the classic pseudoknot prediction algorithm Pknots, it is asymptotically faster by a linear factor, which translates to decisive speed-ups already for moderately sized RNAs. This gain is possible due to predicting only CCJ structures—in practice rarely limiting over Pknots. While the original CCJ implementation additionally suffered from significant space overhead, the CCJ recurrences are intrinsically space-demanding (introducing numerous auxiliary four-dimensional DP matrices). Therefore, overcoming these limitations is the major technical and practical breakthrough of this contribution. For this purpose, we applied the technique of sparsification to CCJ-type recurrences. To the best of our knowledge, this is the most complex application of this technique so far (surpassing the sparsification of RNA–RNA interaction prediction). In this respect, Knotty serves as a valuable case study of complex space-efficient sparsification. Notably, while previous applications of the sparsification technique mainly focused on speed improvements, we first of all aimed at reducing the space requirements, which—after presenting CCJ—was the remaining limiting factor for the practical applicability of complex RNA pseudoknotted secondary structure prediction. Our comparison to CCJ’s original implementation and Knotty’s unsparsified variant provides detailed insights into general potentials for space improvements of complex RNA-related algorithms. Moreover, benchmarking against state-of-the-art pseudoknot prediction methods (Pknots, HotKnots V2.0, PknotsRG and IPknot), we show superior capabilities for pseudoknot prediction and identification. Being fast, space-efficient, expressive and accurate—Knotty opens the door to large scale biologically-relevant applications of pseudoknot structure prediction covering all important pseudoknot classes. Acknowledgements We thank Anne Condon for discussing details of the CCJ algorithm and first ideas on space savings. Conflict of Interest: none declared. References Akutsu T. ( 2000 ) Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots . Discrete Appl. Math ., 104 , 45 – 62 . Google Scholar Crossref Search ADS Andronescu M. et al. ( 2005 ) Secondary structure prediction of interacting RNA molecules . JMB , 345 , 987 – 1001 . Google Scholar Crossref Search ADS Andronescu M. et al. ( 2007 ) Efficient parameter estimation for RNA secondary structure prediction . Bioinformatics , 23 , i19 – i28 . Google Scholar Crossref Search ADS PubMed Andronescu M. et al. ( 2008 ) RNA STRAND: the RNA secondary structure and statistical analysis database . BMC Bioinformatics , 9 , 340. Google Scholar Crossref Search ADS PubMed Andronescu M.S. et al. ( 2010 ) Improved free energy parameters for RNA pseudoknotted secondary structure prediction . RNA , 16 , 26 – 42 . Google Scholar Crossref Search ADS PubMed Backofen R. et al. ( 2011 ) Sparse RNA folding: time and space efficient algorithms . J. Discrete Algorithms , 9 , 12 – 31 . Google Scholar Crossref Search ADS Bellaousov S. , Mathews D.H. ( 2010 ) ProbKnot: fast prediction of RNA secondary structure including pseudoknots . RNA , 16 , 1870 – 1880 . Google Scholar Crossref Search ADS PubMed Chang K.-Y. , Tinoco I. ( 1997 ) The structure of an RNA ‘kissing’ hairpin complex of the HIV TAR hairpin loop and its complement . J. Mol. Biol ., 269 , 52 – 66 . Google Scholar Crossref Search ADS PubMed Chang R.-Y.Y. et al. ( 2013 ) Japanese encephalitis virus non-coding RNA inhibits activation of interferon by blocking nuclear translocation of interferon regulatory factor 3 . Vet. Microbiol ., 166 , 11 – 21 . Google Scholar Crossref Search ADS PubMed Chen H.-L. et al. ( 2009 ) An O(n(5)) algorithm for MFE prediction of kissing hairpins and 4-chains in nucleic acids . JCB , 16 , 803 – 815 . Dirks R.M. , Pierce N.A. ( 2003 ) A partition function algorithm for nucleic acid secondary structure including pseudoknots . J. Comput. Chem ., 24 , 1664 – 1677 . Google Scholar Crossref Search ADS PubMed Hajiaghayi M. et al. ( 2012 ) Analysis of energy-based algorithms for RNA secondary structure prediction . BMC Bioinformatics , 13 , 22. Google Scholar Crossref Search ADS PubMed Huang X. , Ali H. ( 2007 ) High sensitivity RNA pseudoknot prediction . Nucleic Acids Res ., 35 , 656 – 663 . Google Scholar Crossref Search ADS PubMed Jabbari H. ( 2015 ) Algorithms for prediction of RNA pseudoknotted secondary structures. Ph.D. Thesis, University of British Columbia, Vancouver, Canada. Jabbari H. , Condon A. ( 2014 ) A fast and robust iterative algorithm for prediction of RNA pseudoknotted secondary structures . BMC Bioinformatics , 15 , 147. Google Scholar Crossref Search ADS PubMed Jabbari H. et al. ( 2008 ) Novel and efficient RNA secondary structure prediction using hierarchical folding . J. Comput. Biol ., 15 , 139 – 163 . Google Scholar Crossref Search ADS PubMed Lin Y. et al. ( 2018 ) Structural analyses of NEAT1 lncRNAs suggest long-range RNA interactions that may contribute to paraspeckle architecture . Nucleic Acids Res ., 46 , 3742 – 3752 . Google Scholar Crossref Search ADS PubMed Lyngso R.B. , Pedersen C.N.S. ( 2000 ) Pseudoknots in RNA secondary structures. BRICS Report Series RS-00-1. RECOMB00, ACM Press. Melchers W. et al. ( 1997 ) Kissing of the two predominant hairpin loops in the coxsackie B virus 3’ untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis . J. Virol ., 71 , 686 – 696 . Google Scholar PubMed Mercer T.R. et al. ( 2009 ) Long non-coding RNAs: insights into functions . Nat. Rev. Genet ., 10 , 155 – 159 . Google Scholar Crossref Search ADS PubMed Möhl M. et al. ( 2010 ) Sparsification of RNA structure prediction including pseudoknots . Algorithms Mol. Biol ., 5 , 39. Google Scholar Crossref Search ADS PubMed Novikova I.V. et al. ( 2013 ) Rise of the RNA machines: exploring the structure of long non-coding RNAs . J. Mol. Biol ., 425 , 3731 – 3746 . Google Scholar Crossref Search ADS PubMed Nussinov R. , Jacobson A.B. ( 1980 ) Fast algorithm for predicting the secondary structure of single-stranded RNA . Proceed. Natl. Acad. Sci. USA , 77 , 6309 – 6313 . Google Scholar Crossref Search ADS Rastegari B. , Condon A. ( 2007 ) Parsing nucleic acid pseudoknotted secondary structure: algorithm and applications . J. Comput. Biol ., 14 , 16 – 32 . Google Scholar Crossref Search ADS PubMed Reeder J. , Giegerich R. ( 2004 ) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics . BMC Bioinformatics , 5 , 104. Google Scholar Crossref Search ADS PubMed Rivas E. , Eddy S.R. ( 1999 ) A dynamic programming algorithm for RNA structure prediction including pseudoknots . JMB , 285 , 2053 – 2068 . Google Scholar Crossref Search ADS Salari R. et al. ( 2010 ) Time and space efficient RNA-RNA interaction prediction via sparse folding. In: Berger B. (ed.) Proceedings of RECOMB 2010, Volume 6044 of Lecture Notes in Computer Science . Springer , Berlin , pp. 473 – 490 . Sato K. et al. ( 2011 ) IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming . Bioinformatics , 27 , i85 – i93 . Google Scholar Crossref Search ADS PubMed Sheikh S. et al. ( 2012 ) Impact of the energy model on the complexity of RNA folding with pseudoknots. In: Kärkkäinen J. , Stoye J. (eds), Combinatorial Pattern Matching, Volume 7354 of Lecture Notes in Computer Science . Springer , Berlin , pp. 321 – 333 . Sperschneider J. et al. ( 2012 ) Predicting pseudoknotted structures across two RNA sequences . Bioinformatics , 28 , 3058 – 3065 . Google Scholar Crossref Search ADS PubMed Uemura Y. et al. ( 1999 ) Tree adjoining grammars for RNA structure prediction . Theor. Comput. Sci ., 210 , 277 – 303 . Google Scholar Crossref Search ADS Verheije M.H. et al. ( 2002 ) Kissing interaction between 3? Noncoding and coding sequences is essential for porcine arterivirus RNA replication . J. Virol ., 76 , 1521 – 1526 . Google Scholar Crossref Search ADS PubMed Wexler Y. et al. ( 2007 ) A study of accessible motifs and RNA folding complexity . JCB , 14 , 856 – 872 . Will S. , Jabbari H. ( 2016 ) Sparse RNA folding revisited: space-efficient minimum free energy structure prediction . Algorithms Mol. Biol ., 11 , 7 – 13 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Knotty: efficient and accurate prediction of complex RNA pseudoknot structures

Loading next page...
1
 
/lp/ou_press/knotty-efficient-and-accurate-prediction-of-complex-rna-pseudoknot-AZqOpU3XaL
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty420
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation The computational prediction of RNA secondary structure by free energy minimization has become an important tool in RNA research. However in practice, energy minimization is mostly limited to pseudoknot-free structures or rather simple pseudoknots, not covering many biologically important structures such as kissing hairpins. Algorithms capable of predicting sufficiently complex pseudoknots (for sequences of length n) used to have extreme complexities, e.g. Pknots has O(n6) time and O(n4) space complexity. The algorithm CCJ dramatically improves the asymptotic run time for predicting complex pseudoknots (handling almost all relevant pseudoknots, while being slightly less general than Pknots), but this came at the cost of large constant factors in space and time, which strongly limited its practical application (∼200 bases already require 256 GB space). Results We present a CCJ-type algorithm, Knotty, that handles the same comprehensive pseudoknot class of structures as CCJ with improved space complexity of Θ(n3+Z)—due to the applied technique of sparsification, the number of ‘candidates’, Z, appears to grow significantly slower than n4 on our benchmark set (which include pseudoknotted RNAs up to 400 nt). In terms of run time over this benchmark, Knotty clearly outperforms Pknots and the original CCJ implementation, CCJ 1.0; Knotty’s space consumption fundamentally improves over CCJ 1.0, being on a par with the space-economic Pknots. By comparing to CCJ 2.0, our unsparsified Knotty variant, we demonstrate the isolated effect of sparsification. Moreover, Knotty employs the state-of-the-art energy model of ‘HotKnots DP09’, which results in superior prediction accuracy over Pknots. Availability and implementation Our software is available at https://github.com/HosnaJabbari/Knotty. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Computational RNA secondary structure prediction has become an indispensable tool in the research on non-coding RNAs. Besides coding for proteins, RNAs perform various essential roles—most prominently in regulating gene expression—in all kingdoms of life, in many cases mediated by their three-dimensional structures (Mercer et al., 2009). Despite the ubiquity of pseudoknots in these RNAs (Chang et al., 2013; Lin et al., 2018; Novikova et al., 2013), most often pseudoknot-free structure prediction methods are applied in biological research—severely limiting the practical capabilities to correctly predict, recognize and compare pseudoknotted structures. The fundamental cause of this limitation is the computational complexity of the prediction of general pseudoknots in the nearest neighbor energy model, which is NP-hard (Akutsu, 2000; Lyngso and Pedersen, 2000) and even inapproximable (Sheikh et al., 2012). Thus, the high complexity of pseudoknot prediction can be overcome only by either heuristics without optimality-guarantees or restrictions on the predictable pseudoknot class. In comparison, predicting minimum free energy (MFE) structures without pseudoknots is comparably simple: a pseudoknot-free secondary structure is either closed by a base pair connecting the first and last base in the sequence, or can be partitioned into two independent substructures on a prefix and suffix of the sequence, where energies of substructures add up to the total energy. This simple decomposition scheme allows predicting the MFE pseudoknot-free secondary structure by dynamic programming (DP) in Θ(n3) time and Θ(n2) space for standard energy models (Nussinov and Jacobson, 1980). The most general DP algorithm for MFE prediction of pseudoknotted structures, Pknots (Rivas and Eddy, 1999) has complexity of Θ(n6) time, and Θ(n4) space. To reduce this prohibitively high complexity, several computationally less costly algorithms for MFE pseudoknot prediction have been proposed; for example, running in Θ(n5) time and Θ(n4) space (Dirks and Pierce, 2003; Lyngso and Pedersen, 2000; Uemura et al., 1999) or even Θ(n4) time and Θ(n2) space in PknotsRG (Lyngso and Pedersen, 2000; Reeder and Giegerich, 2004). However, all these reductions came at the cost of severely limiting the pseudoknot class over Pknots. Most strikingly, none of the less complex algorithms (compared to Pknots) handles the biologically important class of kissing hairpins with arbitrary nested substructures (Chang and Tinoco, 1997; Melchers et al., 1997; Verheije et al., 2002). Recently, we introduced the MFE prediction algorithm CCJ (Chen et al., 2009). Compared to the more complex algorithm Pknots, CCJ handles a strictly less general subclass of pseudoknots, but it significantly expands the class of structures that can be handled in O(n5) time—e.g. including the biologically relevant kissing hairpins. To achieve CCJ’s novel balance of time and pseudoknot complexity, we defined a new class of structures called Three-Groups-of-Band (TGB) structures with at most three groups of bands (see Fig. 1A and B). On this basis, we presented recurrences that handle TGB structures alternatingly decomposing the left, right, middle and outer bands. Fig. 1. View largeDownload slide Examples of ‘Three-Groups-of-Bands’ (TGB) and CCJ structures. Each arc represents a set of consecutive pseudoknotted base pairs (referred to as band), that cross the same band. (A) TGB structure with two left, two right and four middle bands; (B) TGB with no left band, one middle and two right bands as well as a nested pseudoknotted substructure; (C) CCJ structure composed of the two TGB structures of subfigures A and B (see decomposition of P) Fig. 1. View largeDownload slide Examples of ‘Three-Groups-of-Bands’ (TGB) and CCJ structures. Each arc represents a set of consecutive pseudoknotted base pairs (referred to as band), that cross the same band. (A) TGB structure with two left, two right and four middle bands; (B) TGB with no left band, one middle and two right bands as well as a nested pseudoknotted substructure; (C) CCJ structure composed of the two TGB structures of subfigures A and B (see decomposition of P) The CCJ algorithm covers H-type pseudoknotted structures, kissing hairpins and chains of four interleaved stems by overlaying TGB structures (Fig. 1C); this composition can be applied recursively, which defines the class of CCJ structures. Note that this class is comparably general, e.g. comprising all density-2 structures (Jabbari et al., 2008) with up to four interleaved stems. For example, due to Rastegari and Condon (2007), among 1133 pseudoknotted RNA structures, only 9 were not density-2 and 6 were too complex for Pknots. While CCJ predicts complex pseudoknotted secondary structures in Θ(n5) time, its Θ(n4) space complexity was still limiting in practice; e.g.—for the original CCJ algorithm—even generous memory of 256 GB used to limit the length of RNAs to below 200 bases. A slightly deeper look reveals that CCJ’s substantial improvement in time complexity over Pknots was achieved at the price of using a large number of four-dimensional DP matrices (about 20 matrices in CCJ versus four such matrices in Pknots). Apparently, CCJ’s speedup is tied to a—in practice prohibitively—large space consumption. In this work, we overcome the still unresolved dilemma that expressive pseudoknot prediction used to be limited either by time (Pknots) or space (CCJ). To improve the extreme space requirements of expressive, but comparably fast prediction, we apply the idea of sparsification (Backofen et al., 2011; Wexler et al., 2007) to CCJ. Devising sparsified recurrences of CCJ, resulting in the algorithm Knotty, we improve the space complexity to Θ(n3+Z) ⁠, where Z is the total number of candidates. This complexity is the result of replacing all four-dimensional DP matrices by (a constant number of) three-dimensional matrix slices and lists of candidates. This suffices to exactly predict the same MFE structures as before (due to inverse triangle inequalities). The number of candidates is expected to be much smaller than the number of replaced matrix entries; moreover it cannot be larger. The space-efficient retrieval of the optimal structure from this algorithm turned out to be a non-trivial application of the sparsification of MFE RNA structure prediction algorithms. We enable this by implementing our recently presented technique of space-efficient sparse traceback (Will and Jabbari, 2016); this method requires additional space for trace arrows, but keeps the number of trace arrows low. Notably, previous applications of the sparsification technique to RNA structure prediction discussed either pseudoknot-free methods (Backofen et al., 2011; Wexler et al., 2007) or used pseudoknotted methods with simplified energy models (base pair maximization only) (Möhl et al., 2010). Sparsified RNA–RNA interaction prediction (Salari et al., 2010) is the most complex case of structure prediction so far that was implemented for a realistic interaction energy model. While space-efficient sparsification is discussed in the paper, the space-efficient variant of the implementation could not recover the optimal interaction structure by traceback. 1.1 Contributions We introduce Knotty, providing a unique combination of comparably low run time at moderate space consumption and high expressivity, covering most biologically relevant pseudoknots including the important class of kissing hairpins and employing the state-of-the-art energy model. Aiming at low computational complexity and high expressivity, we base our work on the original CCJ algorithm. To fundamentally overcome CCJ’s space overhead, we apply space-efficient sparsification to an adapted CCJ algorithm, resulting in the novel algorithm Knotty. By this, we present—to the best of our knowledge—the first sparsified prediction of complex pseudoknots using a realistic energy model. For studying Knotty’s computational behavior, we benchmark versus the original CCJ and a non-sparse variant of Knotty (which we implemented in parallel, for the purpose of such comparison). We show that sparsification alone substantially reduces the space requirements, whereas we could achieve further significant improvements over the original implementation (even in the non-sparse Knotty variant). Aiming at high accuracy, our implementation of Knotty supports the state-of-the-art HotKnots DP09 energy model (Andronescu et al., 2010). To analyze Knotty’s biological relevance, we benchmarked the prediction accuracy (for the total structures as well as for the actual pseudoknots) in comparison to the pseudoknot prediction programs HotKnots V2.0 (Andronescu et al., 2010), IPknot (Sato et al., 2011) and the most general MFE pseudoknot prediction method Pknots (Rivas and Eddy, 1999). For control, we moreover compare to Simfold (Andronescu et al., 2005, 2007), a pseudoknot-free prediction method based on the same energy model. 2 Materials and methods The original CCJ algorithm (Chen et al., 2009) minimizes the free energy over all CCJ structures for a given input sequence S by DP. As stated in the introduction, CCJ structures comprise kissing hairpins and chains of four interleaved stems, which can recursively contain CCJ structures as substructures. The optimal CCJ structure is then determined by standard traceback through the DP matrices. Generally, DP algorithms can be described by presenting the recurrences that are used to calculate the entries of their DP matrices. In the case of RNA structure prediction, the DP matrices store MFE values for sequence fragments (under specific conditions). The recurrences correspond to decompositions of these fragments such that the matrix entries can be recursively inferred from energies of smaller subproblems. For example, assuming an energy value of –1 for each canonical base pair, i.e. C–G, A–U or G–U, the MFE structure of an RNA sequence, S, can be found using matrices WN and VN ⁠. The entries WN(i,j) and VN(i,j) ⁠, which hold the respective minimum free energies of general and closed structures of subsequence Si..j ⁠, are decomposed as (1) These grammar-rules represent a complete case distinction of possible structures. Similar graphical notation was introduced by Rivas and Eddy (Rivas and Eddy, 1999) and is now commonly used to present DP algorithms for RNA (e.g. Dirks and Pierce, 2003; Möhl et al., 2010). In our notation, the sequence is represented as a line with bases as index positions; solid arcs indicate base pairs; and dashed arcs enclose areas containing unconsumed structure. Red circles represent fixed end points of a region. Blue squares are free indices (i.e. bases that are pivotal for boundary determinations). To not clutter the representation, we do not mark end points that can be identified from bases directly before or after them. In the example of Equation (1), WN(i,j) corresponds to the MFE structure of Si..j ⁠; this structure can be decomposed according to Equation (1), since either j is unpaired (left case) or j is paired to some inner position (recursion to VN ⁠, in which solid arc represent a base pair); the closed structure corresponding to VN(i,j) is reduced to WN(i+1,j−1) (rightmost case). Moreover, the grammar rules allow—almost mechanically—inferring the recursion equations for base pair free energy minimization (up to specific energy contributions). Generally, our recursions minimize over the recursion cases and the free indices in their respective range limited by the fixed indices. Thus, we translate the rules of Equation (1) to WN(i,j)=min⁡{WN(i,j−1),min⁡i≤k<jWN(i,k−1)+VN(k,j)}VN(i,j)=WN(i+1,j−1)−1  if Si−Sj is canonical,VN(i,j)=∞  otherwise. In our discussion of CCJ and its sparsification, the graphical presentation allows us to focus on the sparsification and avoid distracting details like the exact added energy contributions in each single step, which is not necessary for understanding the algorithmic ideas. Formal recurrences are provided in the Supplementary Material; further details are found in (Jabbari, 2015). While pseudoknot-free recursions generally use only fragments connected at the sequence level, the CCJ algorithm requires ‘gapped’ fragments (two subsequences disconnected by a gap). Consequently, defining such fragments requires four sequence positions in total. The MFE of the CCJ structure for subsequence Si..l is calculated in W(i, l), which is decomposed as Here, the first case corresponds to a scenario in which the last base, i.e. l is unpaired; the second case is when the structure in region [i,l] can be decomposed to two different non-overlapping substructures; third case corresponds to the case when the end points pair together to create a loop. The recurrence V(i,l) handles non-pseudoknotted loops closed by i and l; P(i,l) is the MFE of a CCJ pseudoknot for region [i,l] ⁠. P(i,l) is decomposed by the rule (2) into two TGB structures (with MFEs in the PK matrix). Representing TGB structures requires using gapped fragments. As indicated by the three blue boxes (free indices) in Equation (2) ⁠, each entry of P is minimized over three indices. Thus, this step already bounds the time and space complexity of the CCJ algorithm to O(n5) and O(n4), respectively. PK(i,j,k,l) is the MFE over all TGB structures of the gapped fragment [i,j]∪[k,l] ⁠. Note that such structures have additional restrictions: the positions i and l must be involved in some base pairs, which are not part of nested substructures; moreover, some base pair of a TGB structure must span the gap. The recurrence for PK uses terms PL, PM, PO ⁠, and PR ⁠, which (put informally) handle bands on the left, middle and right groups of the TGB structure, respectively. Both PM and PO are needed to handle bands in the middle group. The matrix entries PL(i,j,k,l), PM(i,j,k,l) ⁠, PR(i,j,k,l), PO(i,j,k,l) are decomposed as illustrated in Figure 2, in which WP handles nested substructures in a pseudoloop. (For invalid index combinations, the matrix entries are set to +∞ ⁠, and do not have to be stored.) Fig. 2. View largeDownload slide Decompositions for PK(i,j,k,l), PL(i,j,k,l) ⁠, PM(i,j,k,l), PR(i,j,k,l), PO(i,j,k,l) in grammar-rule like graphical notation (Color version of this figure is available at Bioinformatics online.) Fig. 2. View largeDownload slide Decompositions for PK(i,j,k,l), PL(i,j,k,l) ⁠, PM(i,j,k,l), PR(i,j,k,l), PO(i,j,k,l) in grammar-rule like graphical notation (Color version of this figure is available at Bioinformatics online.) As seen in Figure 2 each of the matrices PL, PR, PM ⁠, and PO requires a base pair between the two ends at the respective positions left, right, middle or outer. Each such matrix distinguishes the three cases that (i) this base pair closes an interior loop or (ii) a multi-loop or (iii) is the inner border of a band. In the latter case, the respective terms PfromX (X∈{L,R,M,O}) handle transitions from base pairs in one group to base pairs in other groups (see Fig. 3). Fig. 3. View largeDownload slide Decompositions of PfromX(i,j,k,l) for X∈{L,R,M,O} ⁠, which handle pseudoloops closed by an X band as well as the transition to some PY (Color version of this figure is available at Bioinformatics online.) Fig. 3. View largeDownload slide Decompositions of PfromX(i,j,k,l) for X∈{L,R,M,O} ⁠, which handle pseudoloops closed by an X band as well as the transition to some PY (Color version of this figure is available at Bioinformatics online.) In transition to another band via one of the PfromX matrices, we allow nested substructures around the band. This is handled by the first two cases of the respective recurrence PfromX ⁠. Moreover, in PfromX we recurse to matrices PX(i,j,k,l) only if the requirements of these matrices are met. Note that in PfromL ⁠, it is not possible to transition to PL. This is because the recurrences are designed so that bands are handled in rounds. Within a round, bands in the left are handled first, if any, then those on the right, if any, and then those on the middle, with bands handled by PM (if any) handled before those handled by PO ⁠. A middle band must be handled in each round; otherwise, for example, two ‘bands’ on the left group, added in different rounds, would collapse into one, causing the recurrences to incorrectly add penalty terms for band ‘borders’ that are not actually borders. For this reason, no row in PfromL has a PL term, and so a band on the left group cannot be handled directly after a band on the right group. Also, PfromO does not have a row with a PM term, to ensure that PM cannot be used twice on the same round. Interior loops that span a band in the left band are decomposed by the rule ; the remaining cases (right, middle, outer) are analogous. Note that, while the decomposition of PL,iloop has two free indices, these indices are constraint by setting a constant maximum size of interior loops (here 30 bases) as it is common practice. For handling interior loops that span a band, the original CCJ algorithm introduced the five-ary function PL,iloop5 and applied a clever scheme of keeping track of borders by transitioning to different recurrence cases to still use only Θ(n4) space. However, due to its non-standard form, this recurrence cannot be sparsified directly. Thus, in Knotty, we make the pragmatic choice to simplify the recurrence, which works since even the detailed HotKnots DP09 energy parameter set does not require the full generality supported by the original recursion in CCJ. Note that this modification alone, i.e. without sparsification, does not change the complexity of the algorithm, even if it reduces the overhead. Moreover, we modify the original handling of multi-loop cases to enable their sparsification. Figure 4 shows our multi-loop handling for the left band. We handle multi-loops by passing through states PL,mloop1 ⁠, and PL,mloop0 ⁠, which keep track of how many additional inner base pairs have to be enforced (1 or 0). Finally, Figure 5 shows the decompositions of WB(i,l) and WB′(i,l) ⁠, which represent multi-loop fragments around a band; in contrast to WB(i,l), WB′(i,l)requires a base pair in its range [i,l] ⁠. The other ‘W’-matrices (⁠ WM and WM′ for ordinary; WP and WP′ for pseudoknot multi-loops) are decomposed analogously. Fig. 4. View largeDownload slide Decomposition of multi-loops that span a band in the left band (Color version of this figure is available at Bioinformatics online.) Fig. 4. View largeDownload slide Decomposition of multi-loops that span a band in the left band (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Decompositions of WB(i,l) and WB′(i,l) (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Decompositions of WB(i,l) and WB′(i,l) (Color version of this figure is available at Bioinformatics online.) 2.1 The Knotty algorithm Knotty improves space complexity over the CCJ algorithm utilizing the following idea: due to the technique of sparsification, one can minimize the free energy, while keeping a fraction of the whole DP matrix cells, namely only the entries referred to as candidates. By storing a few candidates (as explained below) we avoid storing all four-dimensional matrices. (i) In recurrences in which the left-most index, i, does not change, we store the value of such recurrences in a constant number of three-dimensional matrix slices; we call the collection of these matrix slices i-slices. In many of these recursion cases, we recurse to matrix entries of the same i-slice (e.g. when inferring PK from PL or, slightly more interestingly, PR from PfromR ⁠). In other cases, we recurse to the (i+1)-slice (e.g. PL from PfromL ⁠) or (i+c)-slice, where c is constantly bounded. The latter occurs in the handling of interior loops (⁠ PL,iloop and analogously PO,iloop ⁠), where c does not exceed the maximum interior loop size, MLS. (ii) This leaves us with the recursion cases that recurse to some d-slice, where d – i is not constantly bounded. Instead of storing all slices, we store only certain candidate entries in such slices. These matrix entries (called candidates) are stored in candidate lists for specific recursion cases together with their corresponding second, third, and fourth matrix indices, j, k, and l to keep track of band borders. In some cases, candidate lists can be shared between recursion cases. We presented more details on candidate list requirements in Möhl et al. (Möhl et al., 2010). 2.1.1 Space representation of the four-dimensional matrices by Knotty Only matrices corresponding to PL and PO occur in interior loops that span a band, and require to recurse to a different i-slice; for them, we store slices i..i+MLS ⁠. For matrices corresponding to PfromL, PfromO, PL,mloop0 ⁠, PL,mloop1, PO,mloop0 and PO,mloop1 ⁠, we only need to store slices i and i + 1. Matrices corresponding to recurrences of types PX,iloop and PX,mloop (X∈{L,R,M,O}) do not need to be stored and can be computed when needed. For the remaining matrices, we store only the current i-slice. Note that space is always reused in the next iteration; in case of ranges of slices, the memory access is ‘rotated’ without copying in memory. So in total, Knotty maintains candidate lists for nine matrices: PK, PfromL ⁠, PfromM, PfromR, PfromO, PL,mloop0 ⁠, PM,mloop0, PR,mloop0and PO,mloop0 ⁠. To finalize sparsification, all recursion cases that recurse to these matrices need to be modified. This affects all recursion cases, which insert any nested substructure to the left of the gapped region. This occurs exactly in the decompositions of PK, PfromL ⁠, PfromO, PL,mloop1, PL,mloop0, PO,mloop0 and PO,mloop1 ⁠. Moreover, this affects the decomposition of P into two PK-fragments, where the latter is taken from the respective candidate list. We discuss the single modifications on the three examples of PK, PL,mloop1 ⁠, and P—the remaining cases are sufficiently similar to be sparsified analogously. 1. Consider the case of the PK recurrence: min⁡i<d≤jWP(i,d−1)+ PK(d,j,k,l). It suffices to minimize only over certain candidates PK(d,j,k,l) that are not optimally decomposable in the following sense: ∄e>d:PK(d,j,k,l)=WP(d,e−1)+PK(e,j,k,l). (3) It can be shown easily that whenever PK(i,j,k,l) is optimally decomposable, there is a candidate (i.e. a smaller, not optimally decomposable fragment) which yields the same minimum value. Remarkably, the candidate criterion can be efficiently checked by the DP algorithm. This is more directly seen from the equivalent candidate criterion PK(d,j,k,l)<min⁡d<e≤jWP(d,e−1)+ PK(e,j,k,l). Also this minimum can be computed by running over candidates only; moreover it must be calculated by the DP algorithm for computing PK(d,j,k,l) ⁠, such that the check is performed without additional overhead. This idea holds analogously for the other candidate checks. 2. Similarly, the minimization min⁡i<d≤jWB(i,d−1)+PL,mloop0(d,j,k,l) in the recurrence of PL,mloop0 is restricted to candidates that satisfy ∄e>d:PL,mloop0(d,j,k,l)=WB(d,e−1)+PL,mloop0(e,j,k,l). (4)  We could even strengthen the criterion, such that candidates must furthermore satisfy ∄e≥d:PL,mloop0(d,j,k,l)=PL,mloop0(d,e,k,l)+WB(e+1,j). (5) 3. In the minimization calculating P(i, l), i.e. min⁡i<j<d<k<lPK(i,j−1,d+1,k−1)+PK(j,d,k,l), it suffices to consider only candidates PK(j,d,k,l) ⁠. Entries PK(j,d,k,l) are candidates if and only if they are not optimally decomposable in the following sense: ∄e(j≤e<d):PK(j,d,k,l)=PK(j,e,k,l)+WP(e+1,d). (6) Notably, the candidate lists for the PX,mloop0 and PX,mloop1 recurrences (⁠ X∈{L,M,R,O} ⁠) can be shared, since candidates for decomposing PX,mloop0 must be candidates for PX,mloop1 due to WB(i,j)≤WB′(i,j) for all i, j; cf. Figure 5 as well as Equations (4) and (5). Finally, the Knotty recurrences can be computed based on the (constantly bounded) matrix slices and the candidates alone. Their correctness is a consequence of inverse triangle inequalities; for example in case of W we have ∀x<y≤z:W(x,z)≤W(x,y−1)+W(y,z) ⁠, which follows from the definition of W ⁠. Analogous inequalities hold for WP, WB ⁠, and WB′. Theorem 2.1 (Correctness of the Knotty sparsification). The Knotty recurrences are equivalent to the original CCJ recurrences. Proof of Theorem 2.1 is provided in the Supplementary Material. Note that the presented sparsification would not have been possible for the multi-loop handling of the original CCJ algorithm (Fig. 6). In the original decomposition of PL,mloop ⁠, the unbound access to d-slices cannot be easily replaced by candidates—note the index offsets in the graphical notation, indicating that in the recurrence of PL,mloop(i,j,k,l) ⁠, the WB′ and WB fragments both start at i + 1; similarly, there is a shift to j – 1 in the recurrences of WB ⁠, and WB′ of PL,mloop0(i,j,k,l) and PL,mloop1(i,j,k,l) (see Supplementary Material for further details). Fig. 6. View largeDownload slide Decomposition of multi-loops in the left band by the original CCJ algorithm (Color version of this figure is available at Bioinformatics online.) Fig. 6. View largeDownload slide Decomposition of multi-loops in the left band by the original CCJ algorithm (Color version of this figure is available at Bioinformatics online.) 2.2 Space complexity analysis In the previous sections, we sparsified all four-dimensional matrices of the Knotty algorithm with the goal of reducing its space complexity. As explained before, our sparsification allows replacing all four-dimensional matrices by three-dimensional matrix slices. In seven recursion cases, we had to rewrite the minimizations, such that they compute equivalent results by recursing only to candidates and entries of the same i-slice. In the multi-loop cases, candidate lists are shared. Even if only a small fraction of the respective four-dimensional fragments are optimally decomposable (i.e. are not candidates), these changes will save space over the non-sparsified version. However, experience from previous sparsification (and our results) shows that a large number of fragments is optimally decomposable, such that number of candidates are small. We define Z as the total number of candidates. For traceback, we additionally store trace arrows to optimal, non-candidate inner base pairs of interior loops; denote their number by T. Then, the total space complexity of Knotty is O(n3+Z+T) ⁠. 3 Results In this section we provide implementation and dataset details, and show a comparison of Knotty in terms of time and memory usage to its predecessors, CCJ 1.0 and CCJ 2.0. In addition, we compare Knotty’s prediction accuracy to some of the best existing methods for RNA pseudoknotted secondary structure prediction. 3.1 Implementation Knotty, CCJ 2.0, as well as the original CCJ algorithm were implemented in C++. Knotty and CCJ 2.0 implement the presented recurrences (respectively, with and without sparsification) utilizing the DP09 energy model of HotKnots V2.0 (Andronescu et al., 2010), while CCJ 1.0 strictly realizes the original CCJ recurrences and uses a slightly different energy model. Notably, all three implementations are reported here for the first time. In Knotty, we implement the sparsified recurrences and store only (a constant number of) three-dimensional matrix slices instead of four-dimensional matrices. For the (given the sparse information, non-trivial) traceback, we implement a careful adaption of the approach of Will and Jabbari. (Will and Jabbari, 2016) to compute the traceback by recomputation with the help of (as few as possible) trace arrows. It demonstrates the generality of Will and Jabbari’s approach (cf. Supplementary Material). 3.2 Datasets We analyzed performance of our algorithm based on a large dataset of over 600 RNA sequences of up to 400 bases length. This dataset was compiled from four non-overlapping datasets with various pseudoknotted and pseudoknot-free structures. HK-PK and HK-PK-free datasets were compiled from RNA STRAND database (Andronescu et al., 2008) and were used in evaluating HotKnots V2.0 (Andronescu et al., 2010); IP-pk168 contains 16 categories of pseudoknotted structures with at most 85% sequence similarities (Huang and Ali, 2007) and was used in evaluating IPknot (Sato et al., 2011); and DK-pk16 contains 16 pseudoknotted structures with strong experimental support (Sperschneider et al., 2012). Table 1 summarizes these datasets. Table 1. Summary of our benchmark datasets. Reference structures of the pk-type datasets are pseudoknotted, while they are pseudoknot-free in HK-PK-free Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) Table 1. Summary of our benchmark datasets. Reference structures of the pk-type datasets are pseudoknotted, while they are pseudoknot-free in HK-PK-free Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) Name #seqs Type seq. lengths Ref. HK-PK 88 pk 26–400 (Andronescu et al., 2010) HK-PK-free 337 pk-free 10–194 (Andronescu et al., 2010) IP-pk168 168 pk 21–137 (Huang and Ali, 2007) DK-pk16 16 pk 34–377 (Sperschneider et al., 2012) 3.3 Accuracy To evaluate accuracy of predicted structures, we use the harmonic mean of sensitivity and positive predictive value (PPV), as commonly referred to as F-measure (Jabbari and Condon, 2014). Its value ranges between 0 and 1, with 0 indicating no common base pair with the reference structure and 1 indicating a perfect prediction (details in Supplementary Material). 3.3.1 Permutation test We use a two-sided permutation test to assess statistical significance of the observed performance differences between two methods. We follow the procedure explained in (Jabbari and Condon, 2014; Hajiaghayi et al., 2012). We report a significance in prediction accuracy if the difference in P-values is <0.05. 3.4 Benchmark results We benchmarked Knotty, CCJ 1.0, CCJ 2.0 and Pknots on Intel® Xeon® CPU E5-4657L v2 2.40 GHz; for running jobs in parallel, the compute server was equipped with a total of 1.5 TB available space. Note that all programs run single-threaded. Based on instances of our datasets, we compared the run time and memory requirements of the programs. Moreover, we verified that CCJ 2.0 and Knotty, indeed produced equivalent results (For most of our benchmark instances, the programs produced exactly the same results. For only five instances, the predicted structures differed by a few shifted base pairs of a single stem in a pseudoloop (while the programs still reported identical minimum energies); in these cases, we confirmed that the structural differences are indeed energetically neutral). Note that this equivalence must hold, since the latter sparsifies the former (Theorem 2.1). Figure 7A shows the memory consumption, our main objective in this work, as a log-log plot versus sequence length; similarly the run times are reported in Figure 7B. We observe significant improvements in space from CCJ 1.0 over CCJ 2.0 to Knotty. Fig. 7. View largeDownload slide (A) Memory usage (maximum resident set size in MB) versus length (log–log plot) over all benchmark instances. The solid line shows a best ‘asymptotic’ fit of the form c1+c2nx for sequence length n, constants c1, c2 and exponent x; for the fit, we ignore all values n < 50. (B) Run-time (s) versus length (log–log plot) over all benchmark instances. For each tool in both plots, we report (in parenthesis) the exponent x that we estimated from the benchmark results; it describes the observed complexity as Θ(nx) (Supplementary Section S5) (Color version of this figure is available at Bioinformatics online.) Fig. 7. View largeDownload slide (A) Memory usage (maximum resident set size in MB) versus length (log–log plot) over all benchmark instances. The solid line shows a best ‘asymptotic’ fit of the form c1+c2nx for sequence length n, constants c1, c2 and exponent x; for the fit, we ignore all values n < 50. (B) Run-time (s) versus length (log–log plot) over all benchmark instances. For each tool in both plots, we report (in parenthesis) the exponent x that we estimated from the benchmark results; it describes the observed complexity as Θ(nx) (Supplementary Section S5) (Color version of this figure is available at Bioinformatics online.) At the same time, one observes a comparably small run time penalty in Knotty (which at this time does not make full use of sparsification to optimize time). Nevertheless, we emphasize the importance of moderate space requirements, which is particularly relevant given today’s heavily parallel computation platforms with comparably costly main memory. In practice, Knotty’s small memory footprint allows running many more instances in parallel compared to CCJ 1.0 and even CCJ 2.0. Compared to Pknots, the time savings are striking, in particular for larger instances (e.g. ∼19 days compared to about ∼12 h for the 400 bases long benchmark instance). Note that the same instance needs unreasonable amount of memory for CCJ 1.0 and still extreme space for CCJ 2.0. For the former, we estimate 10.2 TB by extrapolating from our regression (Fig. 7A, Supplementary Table S7); for the latter we measured 136.4 GB; Knotty requires only 13.60 GB. We further investigated whether a specific class of structures (i.e. pseudoknotted versus pseudoknot-free) would benefit stronger than the other from sparsification, and found out that both classes benefit equally from sparsification (see Fig. 8); in general longer sequences benefit more (as seen in Fig. 7A). A closer look at Figure 7A shows that, while Knotty has variation in memory usage within the same length, these variations are minimal. We looked at numbers of candidates and trace arrows in Knotty, which together explain the space requirements (see Supplementary Material). Fig. 8. View largeDownload slide Numbers of candidates—as stored by Knotty—for the benchmark instances (in dependency of sequence length). We highlight instances of the four benchmark sets, which does not suggest any obvious bias (Color version of this figure is available at Bioinformatics online.) Fig. 8. View largeDownload slide Numbers of candidates—as stored by Knotty—for the benchmark instances (in dependency of sequence length). We highlight instances of the four benchmark sets, which does not suggest any obvious bias (Color version of this figure is available at Bioinformatics online.) To assess the prediction accuracy of Knotty on our dataset, we compared it with two of the best existing methods, namely HotKnots V2.0, a heuristic method with recently tuned energy parameters (Andronescu et al., 2010), and IPknot, a method based on maximum expected accuracy (Sato et al., 2011). Moreover, we compared performance of Knotty with Pknots, the most general method for MFE prediction of pseudoknotted structures, as well as Simfold (Andronescu et al., 2005, 2007), a pseudoknot-free MFE prediction tool that employs HotKnots V2.0 energy model. Figures 9A and 10 summarize average accuracy, as well as average sensitivity and PPV of each method. In addition, we also compared performance of Knotty to PknotsRG (Reeder and Giegerich, 2004) (data in Supplementary Material). Since performance of PknotsRG and Pknots was similar, we only included comparison results with the more general method, Pknots, in the main text. Fig. 9. View largeDownload slide Accuracy of prediction across benchmark instances. (A) Distribution of F-measures as box plots (B) Mean F-measure of pseudoknotted base pairs [tool colors as in Subfigure (A)]. While the (total) F-measures of (A) assess the prediction accuracy of all base pairs, the mean ‘pseudoknotted’ F-measures of (B) assess the accuracy of predicting the pseudoknot (Color version of this figure is available at Bioinformatics online.) Fig. 9. View largeDownload slide Accuracy of prediction across benchmark instances. (A) Distribution of F-measures as box plots (B) Mean F-measure of pseudoknotted base pairs [tool colors as in Subfigure (A)]. While the (total) F-measures of (A) assess the prediction accuracy of all base pairs, the mean ‘pseudoknotted’ F-measures of (B) assess the accuracy of predicting the pseudoknot (Color version of this figure is available at Bioinformatics online.) Fig. 10. View largeDownload slide Average sensitivity and PPV of all methods on all datasets (Color version of this figure is available at Bioinformatics online.) Fig. 10. View largeDownload slide Average sensitivity and PPV of all methods on all datasets (Color version of this figure is available at Bioinformatics online.) Figure 9A visualizes the prediction accuracies (F-measures) of the compared methods. Additionally, we analyzed the significance of the observed differences by permutation tests (Section 3.3.1; detailed results in Supplementary Material). For pseudoknot-free structures (HK-PK-free), only Pknots has significantly lower average F-measure compared to the other of methods. For dataset HK-PK, the differences between Knotty, Pknots and HotKnots are not significant, but these tools perform significantly better than IPknot and Simfold. On the IP-pk168 dataset, Knotty significantly outperforms HotKnots, IPknot and Simfold. Finally, the differences on the small DK-pk16 dataset are not significant. Summarizing, Knotty performs highly pseudoknotted structure prediction—on-par or superior to its competitors—and shows the most consistent performance. Figure 10 represents average sensitivity (on X axis) and PPV (on Y axis) of each method on our four datasets. In this figure each method has a different symbol and each dataset was represented in a different color. Data points below diagonal line represent higher sensitivity compared to PPV for a specific method, while data points above the diagonal line represent higher PPV compared to sensitivity of the specific method. As seen in Figure 10, Knotty achieves the highest sensitivity on all our pseudoknotted structures (blue, yellow and red circles); while IPknot’s sensitivity is the highest on pseudoknot-free structures (green symbols). This suggests that IPknot favors pseudoknot-free structures, while Knotty has a higher tendency of predicting pseudoknotted structures. 3.5 Accuracy of pseudoknot prediction We assessed the accuracy of pseudoknot prediction of each method by average F-measure of pseudoknotted base pairs predicted by each method (following Bellaousov and Mathews, 2010). Figure 9B compares average F-measure of Knotty to that of other pseudoknot predictors. In all benchmark sets, Knotty outperforms Pknots and IPknot, while HotKnots V2.0 achieves similar accuracy for the sets DK-pk16 and HK-PK. Partly, this shows the superiority of the energy model shared by Knotty and HotKnots V2.0. More interestingly, the significant edge over HotKnots V2.0 on set IP-pk168 demonstrates the benefits of true (i.e. global) energy minimization over heuristic pseudoknot prediction. We further evaluated each tool’s ability to correctly identify pseudoknots. We say a pseudoknot is correctly identified by a method if the method correctly predicts at least one pseudoknotted base pair. In this comparison, Knotty outperforms the other methods even more clearly, as it correctly identifies 55% of pseudoknots in our dataset versus HotKnots 39%, IPknot 25% and Pknots 43% (refer to the Supplementary Material). 4 Conclusion The novel algorithm Knotty provides comparably fast, highly accurate structure prediction for a comprehensive class of biologically relevant pseudoknots, while posing only moderate space demands. This unique feature set enables novel applications for computational pseudoknot structure prediction by overcoming crucial limitations of previous tools. Compared to the classic pseudoknot prediction algorithm Pknots, it is asymptotically faster by a linear factor, which translates to decisive speed-ups already for moderately sized RNAs. This gain is possible due to predicting only CCJ structures—in practice rarely limiting over Pknots. While the original CCJ implementation additionally suffered from significant space overhead, the CCJ recurrences are intrinsically space-demanding (introducing numerous auxiliary four-dimensional DP matrices). Therefore, overcoming these limitations is the major technical and practical breakthrough of this contribution. For this purpose, we applied the technique of sparsification to CCJ-type recurrences. To the best of our knowledge, this is the most complex application of this technique so far (surpassing the sparsification of RNA–RNA interaction prediction). In this respect, Knotty serves as a valuable case study of complex space-efficient sparsification. Notably, while previous applications of the sparsification technique mainly focused on speed improvements, we first of all aimed at reducing the space requirements, which—after presenting CCJ—was the remaining limiting factor for the practical applicability of complex RNA pseudoknotted secondary structure prediction. Our comparison to CCJ’s original implementation and Knotty’s unsparsified variant provides detailed insights into general potentials for space improvements of complex RNA-related algorithms. Moreover, benchmarking against state-of-the-art pseudoknot prediction methods (Pknots, HotKnots V2.0, PknotsRG and IPknot), we show superior capabilities for pseudoknot prediction and identification. Being fast, space-efficient, expressive and accurate—Knotty opens the door to large scale biologically-relevant applications of pseudoknot structure prediction covering all important pseudoknot classes. Acknowledgements We thank Anne Condon for discussing details of the CCJ algorithm and first ideas on space savings. Conflict of Interest: none declared. References Akutsu T. ( 2000 ) Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots . Discrete Appl. Math ., 104 , 45 – 62 . Google Scholar Crossref Search ADS Andronescu M. et al. ( 2005 ) Secondary structure prediction of interacting RNA molecules . JMB , 345 , 987 – 1001 . Google Scholar Crossref Search ADS Andronescu M. et al. ( 2007 ) Efficient parameter estimation for RNA secondary structure prediction . Bioinformatics , 23 , i19 – i28 . Google Scholar Crossref Search ADS PubMed Andronescu M. et al. ( 2008 ) RNA STRAND: the RNA secondary structure and statistical analysis database . BMC Bioinformatics , 9 , 340. Google Scholar Crossref Search ADS PubMed Andronescu M.S. et al. ( 2010 ) Improved free energy parameters for RNA pseudoknotted secondary structure prediction . RNA , 16 , 26 – 42 . Google Scholar Crossref Search ADS PubMed Backofen R. et al. ( 2011 ) Sparse RNA folding: time and space efficient algorithms . J. Discrete Algorithms , 9 , 12 – 31 . Google Scholar Crossref Search ADS Bellaousov S. , Mathews D.H. ( 2010 ) ProbKnot: fast prediction of RNA secondary structure including pseudoknots . RNA , 16 , 1870 – 1880 . Google Scholar Crossref Search ADS PubMed Chang K.-Y. , Tinoco I. ( 1997 ) The structure of an RNA ‘kissing’ hairpin complex of the HIV TAR hairpin loop and its complement . J. Mol. Biol ., 269 , 52 – 66 . Google Scholar Crossref Search ADS PubMed Chang R.-Y.Y. et al. ( 2013 ) Japanese encephalitis virus non-coding RNA inhibits activation of interferon by blocking nuclear translocation of interferon regulatory factor 3 . Vet. Microbiol ., 166 , 11 – 21 . Google Scholar Crossref Search ADS PubMed Chen H.-L. et al. ( 2009 ) An O(n(5)) algorithm for MFE prediction of kissing hairpins and 4-chains in nucleic acids . JCB , 16 , 803 – 815 . Dirks R.M. , Pierce N.A. ( 2003 ) A partition function algorithm for nucleic acid secondary structure including pseudoknots . J. Comput. Chem ., 24 , 1664 – 1677 . Google Scholar Crossref Search ADS PubMed Hajiaghayi M. et al. ( 2012 ) Analysis of energy-based algorithms for RNA secondary structure prediction . BMC Bioinformatics , 13 , 22. Google Scholar Crossref Search ADS PubMed Huang X. , Ali H. ( 2007 ) High sensitivity RNA pseudoknot prediction . Nucleic Acids Res ., 35 , 656 – 663 . Google Scholar Crossref Search ADS PubMed Jabbari H. ( 2015 ) Algorithms for prediction of RNA pseudoknotted secondary structures. Ph.D. Thesis, University of British Columbia, Vancouver, Canada. Jabbari H. , Condon A. ( 2014 ) A fast and robust iterative algorithm for prediction of RNA pseudoknotted secondary structures . BMC Bioinformatics , 15 , 147. Google Scholar Crossref Search ADS PubMed Jabbari H. et al. ( 2008 ) Novel and efficient RNA secondary structure prediction using hierarchical folding . J. Comput. Biol ., 15 , 139 – 163 . Google Scholar Crossref Search ADS PubMed Lin Y. et al. ( 2018 ) Structural analyses of NEAT1 lncRNAs suggest long-range RNA interactions that may contribute to paraspeckle architecture . Nucleic Acids Res ., 46 , 3742 – 3752 . Google Scholar Crossref Search ADS PubMed Lyngso R.B. , Pedersen C.N.S. ( 2000 ) Pseudoknots in RNA secondary structures. BRICS Report Series RS-00-1. RECOMB00, ACM Press. Melchers W. et al. ( 1997 ) Kissing of the two predominant hairpin loops in the coxsackie B virus 3’ untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis . J. Virol ., 71 , 686 – 696 . Google Scholar PubMed Mercer T.R. et al. ( 2009 ) Long non-coding RNAs: insights into functions . Nat. Rev. Genet ., 10 , 155 – 159 . Google Scholar Crossref Search ADS PubMed Möhl M. et al. ( 2010 ) Sparsification of RNA structure prediction including pseudoknots . Algorithms Mol. Biol ., 5 , 39. Google Scholar Crossref Search ADS PubMed Novikova I.V. et al. ( 2013 ) Rise of the RNA machines: exploring the structure of long non-coding RNAs . J. Mol. Biol ., 425 , 3731 – 3746 . Google Scholar Crossref Search ADS PubMed Nussinov R. , Jacobson A.B. ( 1980 ) Fast algorithm for predicting the secondary structure of single-stranded RNA . Proceed. Natl. Acad. Sci. USA , 77 , 6309 – 6313 . Google Scholar Crossref Search ADS Rastegari B. , Condon A. ( 2007 ) Parsing nucleic acid pseudoknotted secondary structure: algorithm and applications . J. Comput. Biol ., 14 , 16 – 32 . Google Scholar Crossref Search ADS PubMed Reeder J. , Giegerich R. ( 2004 ) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics . BMC Bioinformatics , 5 , 104. Google Scholar Crossref Search ADS PubMed Rivas E. , Eddy S.R. ( 1999 ) A dynamic programming algorithm for RNA structure prediction including pseudoknots . JMB , 285 , 2053 – 2068 . Google Scholar Crossref Search ADS Salari R. et al. ( 2010 ) Time and space efficient RNA-RNA interaction prediction via sparse folding. In: Berger B. (ed.) Proceedings of RECOMB 2010, Volume 6044 of Lecture Notes in Computer Science . Springer , Berlin , pp. 473 – 490 . Sato K. et al. ( 2011 ) IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming . Bioinformatics , 27 , i85 – i93 . Google Scholar Crossref Search ADS PubMed Sheikh S. et al. ( 2012 ) Impact of the energy model on the complexity of RNA folding with pseudoknots. In: Kärkkäinen J. , Stoye J. (eds), Combinatorial Pattern Matching, Volume 7354 of Lecture Notes in Computer Science . Springer , Berlin , pp. 321 – 333 . Sperschneider J. et al. ( 2012 ) Predicting pseudoknotted structures across two RNA sequences . Bioinformatics , 28 , 3058 – 3065 . Google Scholar Crossref Search ADS PubMed Uemura Y. et al. ( 1999 ) Tree adjoining grammars for RNA structure prediction . Theor. Comput. Sci ., 210 , 277 – 303 . Google Scholar Crossref Search ADS Verheije M.H. et al. ( 2002 ) Kissing interaction between 3? Noncoding and coding sequences is essential for porcine arterivirus RNA replication . J. Virol ., 76 , 1521 – 1526 . Google Scholar Crossref Search ADS PubMed Wexler Y. et al. ( 2007 ) A study of accessible motifs and RNA folding complexity . JCB , 14 , 856 – 872 . Will S. , Jabbari H. ( 2016 ) Sparse RNA folding revisited: space-efficient minimum free energy structure prediction . Algorithms Mol. Biol ., 11 , 7 – 13 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Nov 15, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off