Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data Vol. 28 no. 24 2012, pages 3211–3217 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts611 Sequence analysis Advance Access publication October 15, 2012 SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data 1,2, 1,2 1,2 * ´ ´ ` Evguenia Kopylova , Laurent Noe and Helene Touzet 1 2 LIFL (UMR CNRS 8022 Universite Lille 1) and Inria Lille Nord-Europe, 59655 Villeneuve d’Ascq, France Associate Editor: Ivo Hofacker first set of programs—Meta-RNA 3 (Huang et al., 2009), SSU- ABSTRACT ALIGN (Nawrocki et al., 2009) and rRNASelector (Lee et al., Motivation: The application of next-generation sequencing (NGS) 2011)—shares a common algorithmic approach to represent an technologies to RNAs directly extracted from a community of organ- rRNA family database using a probabilistic model. Both isms yields a mixture of fragments characterizing both coding and Meta-RNA and rRNASelector use prebuilt Hidden Markov non-coding types of RNAs. The task to distinguish among these and Models (HMM) and consequently sort reads against the data- to further categorize the families of messenger RNAs and ribosomal base with the HMMER3 package (Eddy, 1998), whereas RNAs (rRNAs) is an important step for examining gene expression SSU-ALIGN uses covariance models to support secondary patterns of an interactive environment and the phylogenetic classifi- structure information. An alternative algorithm outside the cation of the constituting species. domain of probabilistic models is riboPicker (Schmieder et al., Results: We present SortMeRNA, a new software designed to rapidly 2012), which uses a modified version of the Burrows-Wheeler filter rRNA fragments from metatranscriptomic data. It is capable of Aligner (Li and Durbin, 2009). Lastly, BLASTN (Altschul handling large sets of reads and sorting out all fragments matching to et al., 1990) is used in numerous home-made workflows for the rRNA database with high sensitivity and low running time. this problem. With BLASTN, however, reads should be com- Availability: http://bioinfo.lifl.fr/RNA/sortmerna pared with all sequences of an rRNA database to achieve a good Contact: [email protected] sensitivity level. In all cases, computational time is still an issue to Supplementary information: Supplementary data are available at handle large collections of reads. Bioinformatics online. In this article, we describe SortMeRNA, an efficient filter Received on May 16, 2012; revised on September 17, 2012; accepted requiring only a representative set for an rRNA database and on October 9, 2012 rapidly sorting through millions of reads. The underlying algo- rithm is analogous to the seeding strategy, focusing on finding many short regions of similarity between an rRNA database and 1 INTRODUCTION a read. SortMeRNA also takes advantage of redundancy be- The application of next-generation sequencing (NGS) technolo- tween homolog sequences, as HMMs do, and builds a com- gies for metatranscriptomic profiling has been a successful ven- pressed model of all rRNA sequences. The generated results ture in practice. Scientists may now gain access to the full set of adhere to the accuracy of the HMM-based programs and are coding and non-coding RNA in a community of organisms, computed in a fraction of the time. which becomes particularly important for samples that cannot be cultivated outside their native environment (Bomar et al., 2011; Shi et al., 2009; Stewart et al., 2011). The initial challenge 2 SYSTEM AND METHODS of metatranscriptomic sequenced data analysis is to sort apart 2.1 Algorithm overview the RNA fragments based on their biological significance. Messenger RNAs (mRNAs) cast a universal glimpse on the We assume having a collection of unassembled reads and a data- gene expression patterns between interactive species. Likewise, base of rRNA sequences, and we want to sort out reads that the ribosomal RNAs (rRNAs) disclose information on the com- match to the database. The general principle behind our algo- munity’s structure, evolution and biodiversity, and prevail in rithm is to search for many short similarity regions between each classification and phylogenetic analyses. The rRNA can com- read and the rRNA database. We scan each read with a sliding prise up to 90% of total RNA. Various prior-to-sequencing pro- window, and the accepted reads are those that have more than a cedures, such as mRNA amplification kits, can help to enrich the threshold number of windows present in the database. For a yield of mRNA (Gilbert and Hughes, 2011). However, these kits given read and a given window on the read, we authorize one are not fully satisfactory, as secondary steps may be required to error (substitution, insertion or deletion) between the window verify if the resulting material is an accurate representative of the and the rRNA database. initial samples (Nygaard et al., 2005). New software has been To achieve this task in an efficient manner, the rRNA data- recently developed to address this issue; this software can identify base is stored in a Burst trie coupled with a lookup table that and isolate rRNA fragments from a set of sequenced reads. The speeds up the access to the Burst trie and takes advantage of conserved regions in the rRNA sequences. For a given read and *To whom correspondence should be addressed. a given window on the read, we find the set of windows present The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 3211 E.Kopylova et al. 2.2 The Burst trie to store an rRNA database in Burst trie using the universal Levenshtein automaton. This comparison is done by performing a parallel traversal between The Burst trie (Heinz et al., 2002) is a fast and versatile data the Levenshtein automaton and the Burst trie. structure that effectively stores a large number of strings such as Figure 1 globally illustrates this framework. The length s of the an rRNA database. Unlike the standard trie, the binary search sliding window is a parameter of the algorithm, further discussed tree or other variants, which often adopt an equal rate of in Section 2.5. The acceptance of a read depends also on the ratio memory access among the cache or main memory, the Burst of matched windows. Let r be this parameter. This choice will trie can exploit the modern cache architecture by addressing also be discussed in Section 2.5. memory closest to the CPU. It is capable of reducing the Fig. 1. SortMeRNA algorithm overview. The set of representative rRNA sequences is preprocessed in the following way: (1a) The lookup table stores all of the s 2mers and their number of occurrences that exist in the rRNA database. (1b) The Burst trie is a data structure that stores the rRNA database. (2a) The algorithm takes as input a collection of reads provided by the user, and for each read, a sliding window w of even length s 2½14, 20 moves s s s across the read. (2b) For each window w, the prefix w½1::  and suffix w½s:: þ 1 are translated into a decimal value between 0 and 2  1. (3) If the value 2 2 exists in the lookup table with a high frequency (see Section 1.1 of the Supplementary File), the remaining part of the window is searched in the Burst trie. This is done with a cyclic traversal between the universal Levenshtein automaton and the Burst trie, which determines whether the subpattern is present in the rRNA database with at most one error. For every letter traversed in the Burst trie, a bitvector is passed to the universal Levenshtein automaton to verify whether the number of encountered errors remains 1. (4) After all windows have been traversed, if the number of accepted windows exceeds a certain threshold (see in Section 2.5), then the read is accepted and classified as rRNA 3212 SortMeRNA number of trie nodes by 80% while maintaining performance scattered ones leading to a false alignment (see Section 1.1 of the similar to a hash table (Askitis and Zobel, 2010). Given a se- Supplementary File). quence vz, the Burst trie can store the prefix v as a link of trie nodes and the suffix z as an array of characters appended to the 2.3 The universal Levenshtein automaton last trie node. Normally, subtrees become more sparse in the The classical non-deterministic Levenshtein automaton for a pat- depth of a trie, and representing them as reduced ‘buckets’ of tern p and a number of errors k recognizes the set of strings that contiguous memory preserves space and boosts cache efficiency. are at most edit distance k to p (Fig. 3). This automaton is not When the number of sequences sharing a common prefix v suitable for computation because of the presence of multiple reaches a fixed threshold, the appended bucket of suffixes active states and epsilon transitions. This may be overcome by bursts to form a new trie node and smaller sub-buckets. To op- transforming the automaton into an equivalent deterministic timize memory access during subtree traversal, the threshold size form. However, the resulting automaton may be exponential in of a bucket should be less than the lower-level cache. A system- the length of p and likewise dependent on it. In studies by Schulz atic use of this trie can be observed in the fastest sorting algo- and Mihov (2002) and Mihov and Schulz (2004), a universal rithm for large sets of strings, the Burstsort (Sinha and Zobel, Levenshtein automaton was characterized based on insightful 2004). observations of the classical one. The term universal conveys its Following a similar method of an array-structured trie as one-time construction and independency of p. The intuition described in Sinha et al. (2006), our Burst trie is assembled arises from the symmetry of the non-deterministic automaton, exactly on the nucleotide alphabet {a, c, g, u}. As illustrated in which applies the same set of transition rules to every new input Figure 2, the trie stores every unique (s þ 1)mer substring in an character, and each new set of active states is a subset of a known rRNA database, as we look at windows of length s with at most bounded superset. A set of bitvectors symbolizing the homology one error between any two words. The information on whether of p and a candidate string serve as input to the automaton. In the (s þ 1)mer belongs to a forward strand, the reverse comple- full generality, the size of the automaton is exponential in a func- ment or both (strand), and its origin (hashid) follows each entry tion of k (Mitankin, 2005). In our case, as k ¼ 1, it remains suf- in a bucket. When the exact location of the (s þ 1)mer needs to be ficiently small. The set of bitvectors representing the similarity of found in an rRNA database, the hashid value serves as an index two strings is precomputed using the following definition. in a complementary table storing this information. Nearly Definition 2.1. (Mihov and Schulz, 2004) The characteristic P P one-quarter of the 16S rRNA positions are 99–100% conserved vector  ~ðw, VÞ of a symbol w 2 in a word V ¼ v .. . v 2 1 n (Cannone et al., 2002; Mears et al., 2002), and this moderates the is the bitvector of length n where the ith bit is set to 1 iff w ¼ v . size of the trie, as many identical or closely similar substrings are shared between sequences. The technical details of n  2k þ 2 and the prefix of k symbols We use an additional optimization to improve access into the of ‘$’ appended to the pattern p can be found in the article by Burst trie. Because we consider at most one error between the Mihov and Schulz (2004). window and the database, we have this simple property: for every Example 2.1. Let k ¼ 1, the input word W ¼ acaga and the pat- two words such that the edit distance between them is bounded tern p ¼ $acuaga,then  ða, $acuÞ¼ 0100,  ðc, acuaÞ¼ 0100, 1 2 by 1, there exists a common substring of length s 2, which is ða, cuagÞ¼ 0010,  ðg, uagaÞ¼ 0010,  ða, agaÞ¼ 101 are 3 4 5 either a prefix or a suffix of the two words. We apply this the computed characteristic bitvectors. It follows that property to construct a lookup table storing all s 2mers existing f , ... ,  g is the characteristic bitvector array carrying the simi- 1 5 in the rRNA database. Note that for s in [14, 20], transposing larity information of x and p. the nucleotide alphabet onto a binary equivalent, such that {a, c, Beginning from  to  , the bitvectors are sequentially g, u} ¼ {00, 01, 10, 11}, we can represent each s 2mer in s bits, 1 jsj passed into the universal Levenshtein automaton. Each bitvector which maps to a unique integer value. On completion of the leads to a transition between states (in constant time) corres- forward and reverse Burst tries, a scan of each trie is performed ponding to the number of errors encountered thus far. If some to record the existence of all s 2mers and, if present, associated reaches a failure state, greater than k errors exist between s pointers to the trie node representing the immediate letter fol- and p, and the strings are rejected. The automaton only recog- lowing the prefix. The precomputed lookup table quickly deter- nizes two strings if the input of the last bitvector  leads to a jsj mines whether an exact match of the prefix or suffix exists in the final state. Burst tries, and furthermore it provides us with direct access to the remaining part of the word in the Burst trie. 2.4 Match of a read with the dynamic bitvector table The lookup table also allows us to take into account distribu- tion of s 2mers in the rRNA database. A multiple sequence At this point, matching a window w of length s on the read alignment of an rRNA database can clearly define areas of against the rRNA database amounts to first checking whether high nucleotide conservation and emphasize the evolutionary the prefix or the suffix of length s 2of w is present in the lookup origins shared between organisms. In a similar manner, the table, then determining whether the universal Levenshtein au- lookup table defines highly conserved areas by keeping only fre- tomaton for w recognizes some word in the Burst trie. For the quent s 2mer occurrences in the rRNA database. Before a second step, we have to implement a rapid traversal of the window is traversed in the Burst trie, its prefix or suffix must Levenshtein automaton, which relies on the precomputation of exist in the lookup table. This notion enforces that a read bitvectors for w. At every depth of the Burst trie, we assume that matches closely to one region in a database rather than multiple the symbol q in  ðq, VÞ appears as one of {a, c, g, u}with equal 3213 E.Kopylova et al. Fig. 2. Let s ¼ 16, the Burst trie below is constructed on the first six 17mers of an rRNA sequence. The ‘char flag’ describes whether a pointer is set to a trie node ‘1’, a bucket ‘2’ or neither ‘0’. Additional information on the origin of the 17mer directly follows each element, as shown in the dashed bucket Fig. 3. The non-deterministic Levenshtein automaton for p ¼ acgu and #e Fig. 4. The precomputed bitvector table for pattern p ¼ $acuaga covering k ¼ 1. The s notation for each state corresponds to s number of char- all possibilities of q for k ¼ 1. The first bit in each entry of column i ¼ 0 acters read in the pattern p and e number of errors recorded. The initial #0 #0 #0 #1 represents the $ symbol and is always set to ‘0’ state is 0 , and the three final states are 3 ,4 and 4 .Eachnon-final state has three outgoing arcs, one for each type of edit operation the edit distance between the pattern and a traversed branch probability. Ultimately during traversal, the bitvector of the exceeds k. To further speed up Burst trie traversal for every actual residing nucleotide is chosen. Figure 4 shows the precom- window, a ‘backwards dictionary’ approach as described in putation of bitvectors for p ¼ $acuaga in Example 2.1. If the Mihov and Schulz (2004) was implemented. The original algo- string x ¼ acaga existed in the trie, then the highlighted set of rithm builds two dictionaries, one for the forward strings and the bitvectors {0100, 0100, 0010, 0010, 101} would form the bitvector second for their reverse equivalents. In this manner, the same array (see Section 1.2 of the Supplementary File for a graphic window can be traversed quickly from both ends. example). When the window is shifted by one position, the subsequent pattern p changes simply by the removal of the first character in 2.5 Parameter setting the prefix and the addition of a new character in the suffix. The algorithm depends on two parameters: the length s of the Hence, rather than recomputing the bitvector table for each sliding window, and the minimal proportion r of accepted win- new window, a series of bitwise operations is taken to modify dows in a read. To find a robust choice for s and r, we ran the it, as demonstrated in Figure 5. algorithm for several values of s and r on several rRNA data- Following a preorder path, the traversal of the Burst trie bases and for several sets of reads. begins at the root node. Through knowledge of the nucleotide We purposely designed four databases with distinctive fea- letter and the depth of the node being visited, the coinciding tures: small 16S and large 23S subunit, varying identity percent- bitvector is accessed in the precomputed bitvector table, indiffer- age and from distinct phylogeny tree subparts, ent to whether the node is a trie node or a character in the bucket. Subsequently, the bitvector is passed to the universal Set 1: 16S, 80% identity (2262 rRNA) Levenshtein automaton, which decides whether to continue tra- Set 2: 16S, 80% identity, truncated phylogeny tree (2187 versal of the current subtree or backtrack to the first branching rRNA) point with a non-failure Levenshtein state and recommence tra- versal of a new substree. In this manner, a complete traversal of Set 3: 23S, 95% identity (1969 rRNA)Set 4: 23S, 95% identity, the Burst trie remains unlikely, as backtracking occurs each time truncated phylogeny tree (1906 rRNA). 3214 SortMeRNA OpenMP functions to parallelize filtering of the reads. The input criteria are a fasta/fastq file of letter space reads produced by Roche 454 or Illumina technologies, and a fasta file of rRNA sequences. There are eight rRNA databases included in the soft- ware package covering the small (16S/18S), large (23S/28S) and 5/5.8S ribosomal subunit rRNAs, which were all derived from the SILVA and RFAM databases. Additionally, the user can work with his or her own RNA databases. 4 EXPERIMENTAL EVALUATION The performance of SortMeRNA was measured in terms of sen- sitivity, selectivity and real-data analysis compared with the soft- ware SSU-ALIGN (Nawrocki et al., 2009), Meta-RNA (Huang et al., 2009), rRNASelector (Lee et al., 2011), riboPicker (Schmieder et al., 2012) and BLASTN (Altschul et al., 1990). Fig. 5. The modification of the bitvector table from pattern p ¼ $acuaga to p ¼ $cuagaa for k ¼ 1. Columns 0–2 of p are equal to columns 1–3 of All tests were performed on an Intel(R) Xeon(R) CPU W3520 2 2 p , except for column 0, where the most significant bit (MSB) of every 1 2.67 GHz machine with 8 GB of RAM, L1 cache size of 32 KB, bitvector represents the symbol $ and is set to ‘0’. Column 3 of p equals L2 cache size of 256 KB and L3 cache size of 8192 KB. Because to column 4 of p with an additional bit appended. The appended bit is riboPicker and SSU-ALIGN do not provide a direct option for set to ‘1’ in the bitvector corresponding to the newly appended character; multithreading, all tests were carried out using one thread. otherwise, it is set to 0. Column 4 of p is equal to column 3 of p , 2 2 although the MSB is not considered. The same rule applies to column 4.1 rRNA databases 5of p , where the two MSBs of the column 3 bitvectors are not considered We created two new representative databases: 16S rRNA with 85% identity (7659 sequences) and 23S rRNA with 98% identity (2811 sequences) (see Section 3.2 of the Supplementary File). The These databases were constructed by applying the ARB package 16S rRNA database was used by SortMeRNA, riboPicker, BLASTN and SSU-ALIGN, and the 23S rRNA database was (Ludwig et al., 2004) and UCLUST (Edgar, 2010) to sequences used by SortMeRNA, riboPicker and BLASTN. SSU-ALIGN from SILVA (Pruesse et al., 2007) (see Section 2.1 in the was written for aligning small ribosomal subunits and does not Supplementary File). Next, we constructed datasets of simulated provide models for 23S rRNA. riboPicker was also tested with a rRNA and non-rRNA reads using the software MetaSim more comprehensive database made available from their web (Richter et al., 2008). We used two sequencing error models, site: all 16S and 23S rRNA sequences taken from SILVA, Roche 454 and Illumina, because the errors for Roche 454 RDP-II, Greengenes, NCBI archaeal and bacterial genomes mainly originate as indels and for Illumina as substitutions. and HMP (3 232 371 16S and 1 960 223S unique sequences). The length of the reads differs as well: 200 nt for Roche 454 The results for this larger database are indicated by and 100 nt for Illumina technology. To test the sensitivity on Sets riboPicker* in all subsequent tables. For Meta-RNA and 1 and 3, we constructed 300 000 Roche 454 reads and 1 million rRNASelector, we used the HMMs provided with the software. Illumina reads on the entire SILVA database minus the se- quences used for the representative rRNA database. To measure the sensitivity for discovering new species with Sets 2 and 4, the 4.2 Simulated reads same number of reads was simulated only on the truncated sec- 4.2.1 Sensitivity for 16S rRNA In all, 300 000 Roche 454 tions of the bacteria phylogeny tree. To test the selectivity, the and 1 million Illumina 16S rRNA reads were simulated in the non-rRNA reads were simulated using the NCBI bacterial gen- same manner as described in Section 2.5. The performance results omes library with rRNAs masked (see Section 2.2 in the canbe viewedinTable 1. Allsoftware programsexcept Supplementary File). riboPicker and SSU-ALIGN have a sensitivity level497%, and The parameter values were varied as: s 2½14, 20 and r 2ð0, 1Þ. 499% for BLASTN and SortMeRNA. The sensitivity for The main conclusion is that s ¼ 18, r ¼ 0.15 for Roche 454 reads riboPicker is low (56%) because BWA-SW works well with and s ¼ 18, r ¼ 0.25 for Illumina reads give best sensitivity/select- error rates 2–3% for 100–200 nt reads, and loses sensitivity for ivity balance for all rRNA databases. Moreover, varying r within new species. As expected, the sensitivity increases with a larger short ranges does not significantly affect the results (see Section database (indicated riboPicker*). Considering the computation 2.3 of the Supplementary File). We use these values as default time, SortMeRNA runs in52min,or72 faster than the next settings in all subsequent analyses. fastest tool with proportionate sensitivity (Meta-RNA). Note also that BLASTN executes at a slow speed (several hours) because reads should be compared with all sequences in the representative 3 IMPLEMENTATION database. SortMeRNA is implemented in Cþþ and freely distributed under the GNU general public license. It can be downloaded 4.2.2 Selectivity for 16S rRNA One million Roche 454 and from http://bioinfo.lifl.fr/RNA/sortmerna. The software uses 1 million Illumina non-16S rRNA reads were simulated in the 3215 E.Kopylova et al. same manner as described in Section 2.5. The performance re- in Figures 6 and 7. The results obtained with SortMeRNA are sults can be viewed in Table 2. All programs have a selectivity close to the ones obtained with HMM-based methods. riboPicker level 499.98%. The number of false positives for the finds only a fraction of all potential rRNAs, which confirms its HMM-based programs remains comparable with SortMeRNA low sensitivity for small databases. The majority of 16S reads for both Illumina and Roche 454 reads. The difference in the found only by riboPicker* (1298) map to mRNA. For 23S ana- simulated data results between Meta-RNA and rRNASelector lysis in Table 4 and Figure 7, 99% of the excess reads of can be attributed to the number of bacteria versus archaea Meta-RNA (12 112) and rRNASelector likewise map to 28S, rRNA sequences used in the construction of the HMMs, as along with 83% of the (624) reads found only by BLASTN and well as additional parameter settings in rRNASelector. Meta-RNA. The (537) reads found only by BLASTN map to riboPicker* and BLASTN show the lowest selectivity. mRNA, 16S rRNA and other non-coding RNA. Concerning the running time, the order of the fastest programs is rRNASelector, Meta-RNA and SortMeRNA. Both 5 DISCUSSION rRNASelector and Meta-RNA use the HMMER3 package, which applies a prefilter to quickly reject sequences that would SortMeRNA has shown to be a rapid and efficient filter that can score very low in the HMM. This acceleration heuristic gives sort a large set of metatranscriptomic reads with high accuracy these programs a competitive advantage on the artificial dataset comparable with the HMM-based programs. SortMeRNA im- for selectivity where all sequences are negative. plements seeds with errors (substitution and indel), and this im- Results for 23S rRNAs are analogous in terms of accuracy and portant characteristic renders the algorithm robust to errors of running time. They can be found in Table A and Table B under different types of sequencers while providing the ability to dis- Section 3.3 of the Supplementary File. cover new rRNA sequences from unknown species. The method used by the algorithm is universal and flexible. 4.3 Real data The database can be constructed on any family of sequences The metatranscriptomic datasets SRR106861 of a photosynthetic provided by the user. Moreover, the algorithm does not require microbial community and SRR013513 of a tidal salt marsh creek a multiple sequence alignment file to build the database, as from 454 sequencing were downloaded from the NCBI Sequence HMM-based programs do, and this is an advantage when se- Read Archive. The results for 16S and 23S can be viewed in Tables quences are hard to align or only partial sequences are available. 3 and 4, respectively, and the overlap of the results between tools Another advantage of SortMeRNA is the small number of Table 1. Sensitivity Software Illumina Roche 454 rRNA Run time Latency Memory (%) Sensitivity (%) rRNA Run time Latency Memory (%) Sensitivity (%) SortMeRNA 998615 1 min 39 s 1 8.5 99.861 299979 1 min 43 s 1 6.3 99.993 riboPicker 558607 18 min 45 s 11 6.8 55.860 123024 18 min 36 s 11 5.6 41.008 riboPicker* 999941 6 h 33 min 238 35.3 99.994 299999 9 h 314 34 99.999 BLASTN 995322 23 h 52 min 868 3.0 99.532 299978 18 h 35 min 649 1.4 99.992 Meta-RNA 983332 2 h 72 33.3 98.333 299980 1 h 57 min 68 12.9 99.993 rRNASelector 974118 1 h 47 min 64 17.4 97.411 299976 2 h 70 7 99.992 SSU-ALIGN 971221 6 h 49 min 248 0.1 97.122 299902 5 h 50 min 204 0.1 99.967 One million of MetaSim-simulated Illumina (100 nt) and 300 000 Roche 454 (200 nt) rRNA reads against a representative 16 S rRNA database of 7659 sequences. Table 2. Selectivity Software Illumina Roche 454 rRNA Run time Latency Memory (%) Selectivity (%) rRNA Run time Latency Memory (%) Selectivity (%) SortMeRNA 17 2 min 9 s 2 7.6 99.9983 13 3 min 42 s 1 10.2 99.9987 riboPicker 7 10 min 22 s 8 6.7 99.9993 3 29 min 45 s 9 16.8 99.9997 riboPicker* 158 56 min 37 s 42 35.1 99.9842 53 2 h 43 min 49 45.2 99.9947 BLASTN 33 14 min 22 s 11 0.3 99.9967 33 16 min 12 s 5 0.3 99.9967 Meta-RNA 11 1 min 33 s 1 0.1 99.9989 11 3 min 41 s 1 0.2 99.9989 rRNASelector 10 1 min 20 s 1 0.1 99.9990 11 3 min 21 s 1 0.2 99.9989 SSU-ALIGN 8 3 h 51 min 173 0.1 99.9992 11 10 h 30 min 188 0.1 99.9989 One million of MetaSim-simulated Illumina (100 nt) and 1 million Roche 454 (200 nt) non-rRNA reads against a representative 16S rRNA database of 7659 sequences. 3216 SortMeRNA Table 3. Runtime for the SRR106861 metatranscriptome of 105 873 parameter settings required by the program (see Section 2 of the reads against a 16S rRNA database of 7659 sequences Supplementary Data). Software rRNA Run time Latency Memory (%) ACKNOWLEDGEMENTS Project MAPPI is associated with the Tara Oceans expedition SortMeRNA 27046 34 s 1 4.8 (oceans.taraexpeditions.org), where the principal tasks involve riboPicker 11389 4 min 10 s 7 2.3 the development of new software for mapping and assembling riboPicker* 27195 39 min 3 s 69 30.8 metagenomic and metatranscriptomic data. BLASTN 27061 1 h 29 min 157 0.6 Meta-RNA 27111 10 min 33 s 18 1.8 Funding: This research was supported by the French National rRNASelector 27085 10 min 40 s 18 0.8 Agency for Research (grant ANR-2010-COSI-004). Conflict of Interest: none declared. Table 4. Runtime for the SRR013513 metatranscriptome of 207 368 reads against a 23S rRNA database of 2811 sequences REFERENCES Altschul,S. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, Software rRNA Run time Latency Memory (%) 403–410. Askitis,N. and Zobel,J. (2010) Redesigning the string hash table, burst trie, and bst SortMeRNA 94395 51 s 1 3.8 to exploit cache. ACM JEA, 15,7–8. Bomar,L. et al. (2011) Directed culturing of microorganisms using metatranscrip- riboPicker 71937 10 min 2 s 12 3.9 tomics. MBio, 2, e00012–11. riboPicker* 84152 36 min 27 s 43 5.5 Cannone,J. et al. (2002) The comparative RNA web (CRW) site: an online database BLASTN 94439 3 h 42 min 261 0.9 of comparative sequence and structure information for ribosomal, intron, and Meta-RNA 106698 1 h 33 min 109 4.8 other RNAs. BMC Bioinformatics, 3,15. rRNASelector 107900 1 h 36 min 113 3 Eddy,S. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. Edgar,R. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461. Gilbert,J. and Hughes,M. (2011) Gene expression profiling: metatranscriptomics. Methods Mol. Biol., 733, 195–205. Heinz,S. et al. (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst., 20, 192–223. Huang,Y. et al. (2009) Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics, 25, 1338–1340. Lee,J. et al. (2011) rRNASelector: a computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries. J. Microbiol., 49, 689–691. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25, 1754–1760. Ludwig,W. et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Res., 32, 1363–1371. Mears,J. et al. (2002) Modeling a minimal ribosome based on comparative sequence analysis. J. Mol. Biol., 321, 215–234. Mihov,S. and Schulz,K. (2004) Fast approximate search in large dictionaries. J. Comput. Ling., 30, 451–477. Fig. 6. Venn diagram for reads classified as 16S rRNA by BLASTN, Mitankin,P. (2005) Universal Levenshtein Automata. Building and Properties. Meta-RNA, SortMeRNA and riboPicker* in the SRR106861 Master’s Thesis. Bulgaria, Sofia University. Nawrocki,E. et al. (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics, metatranscriptome 25, 1335–1337. Nygaard,V. et al. (2005) Limitations of mRNA amplification from small-size cell samples. BMC Genomics, 6, 147. Pruesse,E. et al. (2007) Silva: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res., 35, 7188–7196. Richter,D. et al. (2008) A sequencing simulator for genomics and metagenomics. PLoS One, 3,e3373. Schmieder,R. et al. (2012) Identification and removal of ribosomal RNA sequences from metatranscriptomes. Bioinformatics, 28, 433–435. Schulz,K. and Mihov,S. (2002) Fast string correction with Levenshtein automata. IJDAR, 5, 67–85. Shi,Y. et al. (2009) Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column. Nature, 459, 266–269. Sinha,R. and Zobel,J. (2004) Cache-conscious sorting of large sets of strings with dynamic tries. ACM JEA, 9, Article No. 1.5, doi: 10.1145/1005813.1041517. Sinha,R. et al. (2006) Cache-efficient string sorting using copying. ACM JEA, 11, Fig. 7. Venn diagram for reads classified as 23S rRNA by BLASTN, Article No. 1.2, doi: 10.1145/1187436.1187439. Meta-RNA, SortMeRNA and riboPicker* in the SRR013513 Stewart,F. et al. (2011) Metatranscriptomics analysis of sulfur oxidation genes in the metatranscriptome endosymbiont of solemnya velum. Front. Microbiol., 2,134. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data

Bioinformatics , Volume 28 (24): 7 – Oct 15, 2012

Loading next page...
 
/lp/oxford-university-press/sortmerna-fast-and-accurate-filtering-of-ribosomal-rnas-in-ZBWeLWyOVv

References (45)

Publisher
Oxford University Press
Copyright
© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bts611
pmid
23071270
Publisher site
See Article on Publisher Site

Abstract

Vol. 28 no. 24 2012, pages 3211–3217 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts611 Sequence analysis Advance Access publication October 15, 2012 SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data 1,2, 1,2 1,2 * ´ ´ ` Evguenia Kopylova , Laurent Noe and Helene Touzet 1 2 LIFL (UMR CNRS 8022 Universite Lille 1) and Inria Lille Nord-Europe, 59655 Villeneuve d’Ascq, France Associate Editor: Ivo Hofacker first set of programs—Meta-RNA 3 (Huang et al., 2009), SSU- ABSTRACT ALIGN (Nawrocki et al., 2009) and rRNASelector (Lee et al., Motivation: The application of next-generation sequencing (NGS) 2011)—shares a common algorithmic approach to represent an technologies to RNAs directly extracted from a community of organ- rRNA family database using a probabilistic model. Both isms yields a mixture of fragments characterizing both coding and Meta-RNA and rRNASelector use prebuilt Hidden Markov non-coding types of RNAs. The task to distinguish among these and Models (HMM) and consequently sort reads against the data- to further categorize the families of messenger RNAs and ribosomal base with the HMMER3 package (Eddy, 1998), whereas RNAs (rRNAs) is an important step for examining gene expression SSU-ALIGN uses covariance models to support secondary patterns of an interactive environment and the phylogenetic classifi- structure information. An alternative algorithm outside the cation of the constituting species. domain of probabilistic models is riboPicker (Schmieder et al., Results: We present SortMeRNA, a new software designed to rapidly 2012), which uses a modified version of the Burrows-Wheeler filter rRNA fragments from metatranscriptomic data. It is capable of Aligner (Li and Durbin, 2009). Lastly, BLASTN (Altschul handling large sets of reads and sorting out all fragments matching to et al., 1990) is used in numerous home-made workflows for the rRNA database with high sensitivity and low running time. this problem. With BLASTN, however, reads should be com- Availability: http://bioinfo.lifl.fr/RNA/sortmerna pared with all sequences of an rRNA database to achieve a good Contact: [email protected] sensitivity level. In all cases, computational time is still an issue to Supplementary information: Supplementary data are available at handle large collections of reads. Bioinformatics online. In this article, we describe SortMeRNA, an efficient filter Received on May 16, 2012; revised on September 17, 2012; accepted requiring only a representative set for an rRNA database and on October 9, 2012 rapidly sorting through millions of reads. The underlying algo- rithm is analogous to the seeding strategy, focusing on finding many short regions of similarity between an rRNA database and 1 INTRODUCTION a read. SortMeRNA also takes advantage of redundancy be- The application of next-generation sequencing (NGS) technolo- tween homolog sequences, as HMMs do, and builds a com- gies for metatranscriptomic profiling has been a successful ven- pressed model of all rRNA sequences. The generated results ture in practice. Scientists may now gain access to the full set of adhere to the accuracy of the HMM-based programs and are coding and non-coding RNA in a community of organisms, computed in a fraction of the time. which becomes particularly important for samples that cannot be cultivated outside their native environment (Bomar et al., 2011; Shi et al., 2009; Stewart et al., 2011). The initial challenge 2 SYSTEM AND METHODS of metatranscriptomic sequenced data analysis is to sort apart 2.1 Algorithm overview the RNA fragments based on their biological significance. Messenger RNAs (mRNAs) cast a universal glimpse on the We assume having a collection of unassembled reads and a data- gene expression patterns between interactive species. Likewise, base of rRNA sequences, and we want to sort out reads that the ribosomal RNAs (rRNAs) disclose information on the com- match to the database. The general principle behind our algo- munity’s structure, evolution and biodiversity, and prevail in rithm is to search for many short similarity regions between each classification and phylogenetic analyses. The rRNA can com- read and the rRNA database. We scan each read with a sliding prise up to 90% of total RNA. Various prior-to-sequencing pro- window, and the accepted reads are those that have more than a cedures, such as mRNA amplification kits, can help to enrich the threshold number of windows present in the database. For a yield of mRNA (Gilbert and Hughes, 2011). However, these kits given read and a given window on the read, we authorize one are not fully satisfactory, as secondary steps may be required to error (substitution, insertion or deletion) between the window verify if the resulting material is an accurate representative of the and the rRNA database. initial samples (Nygaard et al., 2005). New software has been To achieve this task in an efficient manner, the rRNA data- recently developed to address this issue; this software can identify base is stored in a Burst trie coupled with a lookup table that and isolate rRNA fragments from a set of sequenced reads. The speeds up the access to the Burst trie and takes advantage of conserved regions in the rRNA sequences. For a given read and *To whom correspondence should be addressed. a given window on the read, we find the set of windows present The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 3211 E.Kopylova et al. 2.2 The Burst trie to store an rRNA database in Burst trie using the universal Levenshtein automaton. This comparison is done by performing a parallel traversal between The Burst trie (Heinz et al., 2002) is a fast and versatile data the Levenshtein automaton and the Burst trie. structure that effectively stores a large number of strings such as Figure 1 globally illustrates this framework. The length s of the an rRNA database. Unlike the standard trie, the binary search sliding window is a parameter of the algorithm, further discussed tree or other variants, which often adopt an equal rate of in Section 2.5. The acceptance of a read depends also on the ratio memory access among the cache or main memory, the Burst of matched windows. Let r be this parameter. This choice will trie can exploit the modern cache architecture by addressing also be discussed in Section 2.5. memory closest to the CPU. It is capable of reducing the Fig. 1. SortMeRNA algorithm overview. The set of representative rRNA sequences is preprocessed in the following way: (1a) The lookup table stores all of the s 2mers and their number of occurrences that exist in the rRNA database. (1b) The Burst trie is a data structure that stores the rRNA database. (2a) The algorithm takes as input a collection of reads provided by the user, and for each read, a sliding window w of even length s 2½14, 20 moves s s s across the read. (2b) For each window w, the prefix w½1::  and suffix w½s:: þ 1 are translated into a decimal value between 0 and 2  1. (3) If the value 2 2 exists in the lookup table with a high frequency (see Section 1.1 of the Supplementary File), the remaining part of the window is searched in the Burst trie. This is done with a cyclic traversal between the universal Levenshtein automaton and the Burst trie, which determines whether the subpattern is present in the rRNA database with at most one error. For every letter traversed in the Burst trie, a bitvector is passed to the universal Levenshtein automaton to verify whether the number of encountered errors remains 1. (4) After all windows have been traversed, if the number of accepted windows exceeds a certain threshold (see in Section 2.5), then the read is accepted and classified as rRNA 3212 SortMeRNA number of trie nodes by 80% while maintaining performance scattered ones leading to a false alignment (see Section 1.1 of the similar to a hash table (Askitis and Zobel, 2010). Given a se- Supplementary File). quence vz, the Burst trie can store the prefix v as a link of trie nodes and the suffix z as an array of characters appended to the 2.3 The universal Levenshtein automaton last trie node. Normally, subtrees become more sparse in the The classical non-deterministic Levenshtein automaton for a pat- depth of a trie, and representing them as reduced ‘buckets’ of tern p and a number of errors k recognizes the set of strings that contiguous memory preserves space and boosts cache efficiency. are at most edit distance k to p (Fig. 3). This automaton is not When the number of sequences sharing a common prefix v suitable for computation because of the presence of multiple reaches a fixed threshold, the appended bucket of suffixes active states and epsilon transitions. This may be overcome by bursts to form a new trie node and smaller sub-buckets. To op- transforming the automaton into an equivalent deterministic timize memory access during subtree traversal, the threshold size form. However, the resulting automaton may be exponential in of a bucket should be less than the lower-level cache. A system- the length of p and likewise dependent on it. In studies by Schulz atic use of this trie can be observed in the fastest sorting algo- and Mihov (2002) and Mihov and Schulz (2004), a universal rithm for large sets of strings, the Burstsort (Sinha and Zobel, Levenshtein automaton was characterized based on insightful 2004). observations of the classical one. The term universal conveys its Following a similar method of an array-structured trie as one-time construction and independency of p. The intuition described in Sinha et al. (2006), our Burst trie is assembled arises from the symmetry of the non-deterministic automaton, exactly on the nucleotide alphabet {a, c, g, u}. As illustrated in which applies the same set of transition rules to every new input Figure 2, the trie stores every unique (s þ 1)mer substring in an character, and each new set of active states is a subset of a known rRNA database, as we look at windows of length s with at most bounded superset. A set of bitvectors symbolizing the homology one error between any two words. The information on whether of p and a candidate string serve as input to the automaton. In the (s þ 1)mer belongs to a forward strand, the reverse comple- full generality, the size of the automaton is exponential in a func- ment or both (strand), and its origin (hashid) follows each entry tion of k (Mitankin, 2005). In our case, as k ¼ 1, it remains suf- in a bucket. When the exact location of the (s þ 1)mer needs to be ficiently small. The set of bitvectors representing the similarity of found in an rRNA database, the hashid value serves as an index two strings is precomputed using the following definition. in a complementary table storing this information. Nearly Definition 2.1. (Mihov and Schulz, 2004) The characteristic P P one-quarter of the 16S rRNA positions are 99–100% conserved vector  ~ðw, VÞ of a symbol w 2 in a word V ¼ v .. . v 2 1 n (Cannone et al., 2002; Mears et al., 2002), and this moderates the is the bitvector of length n where the ith bit is set to 1 iff w ¼ v . size of the trie, as many identical or closely similar substrings are shared between sequences. The technical details of n  2k þ 2 and the prefix of k symbols We use an additional optimization to improve access into the of ‘$’ appended to the pattern p can be found in the article by Burst trie. Because we consider at most one error between the Mihov and Schulz (2004). window and the database, we have this simple property: for every Example 2.1. Let k ¼ 1, the input word W ¼ acaga and the pat- two words such that the edit distance between them is bounded tern p ¼ $acuaga,then  ða, $acuÞ¼ 0100,  ðc, acuaÞ¼ 0100, 1 2 by 1, there exists a common substring of length s 2, which is ða, cuagÞ¼ 0010,  ðg, uagaÞ¼ 0010,  ða, agaÞ¼ 101 are 3 4 5 either a prefix or a suffix of the two words. We apply this the computed characteristic bitvectors. It follows that property to construct a lookup table storing all s 2mers existing f , ... ,  g is the characteristic bitvector array carrying the simi- 1 5 in the rRNA database. Note that for s in [14, 20], transposing larity information of x and p. the nucleotide alphabet onto a binary equivalent, such that {a, c, Beginning from  to  , the bitvectors are sequentially g, u} ¼ {00, 01, 10, 11}, we can represent each s 2mer in s bits, 1 jsj passed into the universal Levenshtein automaton. Each bitvector which maps to a unique integer value. On completion of the leads to a transition between states (in constant time) corres- forward and reverse Burst tries, a scan of each trie is performed ponding to the number of errors encountered thus far. If some to record the existence of all s 2mers and, if present, associated reaches a failure state, greater than k errors exist between s pointers to the trie node representing the immediate letter fol- and p, and the strings are rejected. The automaton only recog- lowing the prefix. The precomputed lookup table quickly deter- nizes two strings if the input of the last bitvector  leads to a jsj mines whether an exact match of the prefix or suffix exists in the final state. Burst tries, and furthermore it provides us with direct access to the remaining part of the word in the Burst trie. 2.4 Match of a read with the dynamic bitvector table The lookup table also allows us to take into account distribu- tion of s 2mers in the rRNA database. A multiple sequence At this point, matching a window w of length s on the read alignment of an rRNA database can clearly define areas of against the rRNA database amounts to first checking whether high nucleotide conservation and emphasize the evolutionary the prefix or the suffix of length s 2of w is present in the lookup origins shared between organisms. In a similar manner, the table, then determining whether the universal Levenshtein au- lookup table defines highly conserved areas by keeping only fre- tomaton for w recognizes some word in the Burst trie. For the quent s 2mer occurrences in the rRNA database. Before a second step, we have to implement a rapid traversal of the window is traversed in the Burst trie, its prefix or suffix must Levenshtein automaton, which relies on the precomputation of exist in the lookup table. This notion enforces that a read bitvectors for w. At every depth of the Burst trie, we assume that matches closely to one region in a database rather than multiple the symbol q in  ðq, VÞ appears as one of {a, c, g, u}with equal 3213 E.Kopylova et al. Fig. 2. Let s ¼ 16, the Burst trie below is constructed on the first six 17mers of an rRNA sequence. The ‘char flag’ describes whether a pointer is set to a trie node ‘1’, a bucket ‘2’ or neither ‘0’. Additional information on the origin of the 17mer directly follows each element, as shown in the dashed bucket Fig. 3. The non-deterministic Levenshtein automaton for p ¼ acgu and #e Fig. 4. The precomputed bitvector table for pattern p ¼ $acuaga covering k ¼ 1. The s notation for each state corresponds to s number of char- all possibilities of q for k ¼ 1. The first bit in each entry of column i ¼ 0 acters read in the pattern p and e number of errors recorded. The initial #0 #0 #0 #1 represents the $ symbol and is always set to ‘0’ state is 0 , and the three final states are 3 ,4 and 4 .Eachnon-final state has three outgoing arcs, one for each type of edit operation the edit distance between the pattern and a traversed branch probability. Ultimately during traversal, the bitvector of the exceeds k. To further speed up Burst trie traversal for every actual residing nucleotide is chosen. Figure 4 shows the precom- window, a ‘backwards dictionary’ approach as described in putation of bitvectors for p ¼ $acuaga in Example 2.1. If the Mihov and Schulz (2004) was implemented. The original algo- string x ¼ acaga existed in the trie, then the highlighted set of rithm builds two dictionaries, one for the forward strings and the bitvectors {0100, 0100, 0010, 0010, 101} would form the bitvector second for their reverse equivalents. In this manner, the same array (see Section 1.2 of the Supplementary File for a graphic window can be traversed quickly from both ends. example). When the window is shifted by one position, the subsequent pattern p changes simply by the removal of the first character in 2.5 Parameter setting the prefix and the addition of a new character in the suffix. The algorithm depends on two parameters: the length s of the Hence, rather than recomputing the bitvector table for each sliding window, and the minimal proportion r of accepted win- new window, a series of bitwise operations is taken to modify dows in a read. To find a robust choice for s and r, we ran the it, as demonstrated in Figure 5. algorithm for several values of s and r on several rRNA data- Following a preorder path, the traversal of the Burst trie bases and for several sets of reads. begins at the root node. Through knowledge of the nucleotide We purposely designed four databases with distinctive fea- letter and the depth of the node being visited, the coinciding tures: small 16S and large 23S subunit, varying identity percent- bitvector is accessed in the precomputed bitvector table, indiffer- age and from distinct phylogeny tree subparts, ent to whether the node is a trie node or a character in the bucket. Subsequently, the bitvector is passed to the universal Set 1: 16S, 80% identity (2262 rRNA) Levenshtein automaton, which decides whether to continue tra- Set 2: 16S, 80% identity, truncated phylogeny tree (2187 versal of the current subtree or backtrack to the first branching rRNA) point with a non-failure Levenshtein state and recommence tra- versal of a new substree. In this manner, a complete traversal of Set 3: 23S, 95% identity (1969 rRNA)Set 4: 23S, 95% identity, the Burst trie remains unlikely, as backtracking occurs each time truncated phylogeny tree (1906 rRNA). 3214 SortMeRNA OpenMP functions to parallelize filtering of the reads. The input criteria are a fasta/fastq file of letter space reads produced by Roche 454 or Illumina technologies, and a fasta file of rRNA sequences. There are eight rRNA databases included in the soft- ware package covering the small (16S/18S), large (23S/28S) and 5/5.8S ribosomal subunit rRNAs, which were all derived from the SILVA and RFAM databases. Additionally, the user can work with his or her own RNA databases. 4 EXPERIMENTAL EVALUATION The performance of SortMeRNA was measured in terms of sen- sitivity, selectivity and real-data analysis compared with the soft- ware SSU-ALIGN (Nawrocki et al., 2009), Meta-RNA (Huang et al., 2009), rRNASelector (Lee et al., 2011), riboPicker (Schmieder et al., 2012) and BLASTN (Altschul et al., 1990). Fig. 5. The modification of the bitvector table from pattern p ¼ $acuaga to p ¼ $cuagaa for k ¼ 1. Columns 0–2 of p are equal to columns 1–3 of All tests were performed on an Intel(R) Xeon(R) CPU W3520 2 2 p , except for column 0, where the most significant bit (MSB) of every 1 2.67 GHz machine with 8 GB of RAM, L1 cache size of 32 KB, bitvector represents the symbol $ and is set to ‘0’. Column 3 of p equals L2 cache size of 256 KB and L3 cache size of 8192 KB. Because to column 4 of p with an additional bit appended. The appended bit is riboPicker and SSU-ALIGN do not provide a direct option for set to ‘1’ in the bitvector corresponding to the newly appended character; multithreading, all tests were carried out using one thread. otherwise, it is set to 0. Column 4 of p is equal to column 3 of p , 2 2 although the MSB is not considered. The same rule applies to column 4.1 rRNA databases 5of p , where the two MSBs of the column 3 bitvectors are not considered We created two new representative databases: 16S rRNA with 85% identity (7659 sequences) and 23S rRNA with 98% identity (2811 sequences) (see Section 3.2 of the Supplementary File). The These databases were constructed by applying the ARB package 16S rRNA database was used by SortMeRNA, riboPicker, BLASTN and SSU-ALIGN, and the 23S rRNA database was (Ludwig et al., 2004) and UCLUST (Edgar, 2010) to sequences used by SortMeRNA, riboPicker and BLASTN. SSU-ALIGN from SILVA (Pruesse et al., 2007) (see Section 2.1 in the was written for aligning small ribosomal subunits and does not Supplementary File). Next, we constructed datasets of simulated provide models for 23S rRNA. riboPicker was also tested with a rRNA and non-rRNA reads using the software MetaSim more comprehensive database made available from their web (Richter et al., 2008). We used two sequencing error models, site: all 16S and 23S rRNA sequences taken from SILVA, Roche 454 and Illumina, because the errors for Roche 454 RDP-II, Greengenes, NCBI archaeal and bacterial genomes mainly originate as indels and for Illumina as substitutions. and HMP (3 232 371 16S and 1 960 223S unique sequences). The length of the reads differs as well: 200 nt for Roche 454 The results for this larger database are indicated by and 100 nt for Illumina technology. To test the sensitivity on Sets riboPicker* in all subsequent tables. For Meta-RNA and 1 and 3, we constructed 300 000 Roche 454 reads and 1 million rRNASelector, we used the HMMs provided with the software. Illumina reads on the entire SILVA database minus the se- quences used for the representative rRNA database. To measure the sensitivity for discovering new species with Sets 2 and 4, the 4.2 Simulated reads same number of reads was simulated only on the truncated sec- 4.2.1 Sensitivity for 16S rRNA In all, 300 000 Roche 454 tions of the bacteria phylogeny tree. To test the selectivity, the and 1 million Illumina 16S rRNA reads were simulated in the non-rRNA reads were simulated using the NCBI bacterial gen- same manner as described in Section 2.5. The performance results omes library with rRNAs masked (see Section 2.2 in the canbe viewedinTable 1. Allsoftware programsexcept Supplementary File). riboPicker and SSU-ALIGN have a sensitivity level497%, and The parameter values were varied as: s 2½14, 20 and r 2ð0, 1Þ. 499% for BLASTN and SortMeRNA. The sensitivity for The main conclusion is that s ¼ 18, r ¼ 0.15 for Roche 454 reads riboPicker is low (56%) because BWA-SW works well with and s ¼ 18, r ¼ 0.25 for Illumina reads give best sensitivity/select- error rates 2–3% for 100–200 nt reads, and loses sensitivity for ivity balance for all rRNA databases. Moreover, varying r within new species. As expected, the sensitivity increases with a larger short ranges does not significantly affect the results (see Section database (indicated riboPicker*). Considering the computation 2.3 of the Supplementary File). We use these values as default time, SortMeRNA runs in52min,or72 faster than the next settings in all subsequent analyses. fastest tool with proportionate sensitivity (Meta-RNA). Note also that BLASTN executes at a slow speed (several hours) because reads should be compared with all sequences in the representative 3 IMPLEMENTATION database. SortMeRNA is implemented in Cþþ and freely distributed under the GNU general public license. It can be downloaded 4.2.2 Selectivity for 16S rRNA One million Roche 454 and from http://bioinfo.lifl.fr/RNA/sortmerna. The software uses 1 million Illumina non-16S rRNA reads were simulated in the 3215 E.Kopylova et al. same manner as described in Section 2.5. The performance re- in Figures 6 and 7. The results obtained with SortMeRNA are sults can be viewed in Table 2. All programs have a selectivity close to the ones obtained with HMM-based methods. riboPicker level 499.98%. The number of false positives for the finds only a fraction of all potential rRNAs, which confirms its HMM-based programs remains comparable with SortMeRNA low sensitivity for small databases. The majority of 16S reads for both Illumina and Roche 454 reads. The difference in the found only by riboPicker* (1298) map to mRNA. For 23S ana- simulated data results between Meta-RNA and rRNASelector lysis in Table 4 and Figure 7, 99% of the excess reads of can be attributed to the number of bacteria versus archaea Meta-RNA (12 112) and rRNASelector likewise map to 28S, rRNA sequences used in the construction of the HMMs, as along with 83% of the (624) reads found only by BLASTN and well as additional parameter settings in rRNASelector. Meta-RNA. The (537) reads found only by BLASTN map to riboPicker* and BLASTN show the lowest selectivity. mRNA, 16S rRNA and other non-coding RNA. Concerning the running time, the order of the fastest programs is rRNASelector, Meta-RNA and SortMeRNA. Both 5 DISCUSSION rRNASelector and Meta-RNA use the HMMER3 package, which applies a prefilter to quickly reject sequences that would SortMeRNA has shown to be a rapid and efficient filter that can score very low in the HMM. This acceleration heuristic gives sort a large set of metatranscriptomic reads with high accuracy these programs a competitive advantage on the artificial dataset comparable with the HMM-based programs. SortMeRNA im- for selectivity where all sequences are negative. plements seeds with errors (substitution and indel), and this im- Results for 23S rRNAs are analogous in terms of accuracy and portant characteristic renders the algorithm robust to errors of running time. They can be found in Table A and Table B under different types of sequencers while providing the ability to dis- Section 3.3 of the Supplementary File. cover new rRNA sequences from unknown species. The method used by the algorithm is universal and flexible. 4.3 Real data The database can be constructed on any family of sequences The metatranscriptomic datasets SRR106861 of a photosynthetic provided by the user. Moreover, the algorithm does not require microbial community and SRR013513 of a tidal salt marsh creek a multiple sequence alignment file to build the database, as from 454 sequencing were downloaded from the NCBI Sequence HMM-based programs do, and this is an advantage when se- Read Archive. The results for 16S and 23S can be viewed in Tables quences are hard to align or only partial sequences are available. 3 and 4, respectively, and the overlap of the results between tools Another advantage of SortMeRNA is the small number of Table 1. Sensitivity Software Illumina Roche 454 rRNA Run time Latency Memory (%) Sensitivity (%) rRNA Run time Latency Memory (%) Sensitivity (%) SortMeRNA 998615 1 min 39 s 1 8.5 99.861 299979 1 min 43 s 1 6.3 99.993 riboPicker 558607 18 min 45 s 11 6.8 55.860 123024 18 min 36 s 11 5.6 41.008 riboPicker* 999941 6 h 33 min 238 35.3 99.994 299999 9 h 314 34 99.999 BLASTN 995322 23 h 52 min 868 3.0 99.532 299978 18 h 35 min 649 1.4 99.992 Meta-RNA 983332 2 h 72 33.3 98.333 299980 1 h 57 min 68 12.9 99.993 rRNASelector 974118 1 h 47 min 64 17.4 97.411 299976 2 h 70 7 99.992 SSU-ALIGN 971221 6 h 49 min 248 0.1 97.122 299902 5 h 50 min 204 0.1 99.967 One million of MetaSim-simulated Illumina (100 nt) and 300 000 Roche 454 (200 nt) rRNA reads against a representative 16 S rRNA database of 7659 sequences. Table 2. Selectivity Software Illumina Roche 454 rRNA Run time Latency Memory (%) Selectivity (%) rRNA Run time Latency Memory (%) Selectivity (%) SortMeRNA 17 2 min 9 s 2 7.6 99.9983 13 3 min 42 s 1 10.2 99.9987 riboPicker 7 10 min 22 s 8 6.7 99.9993 3 29 min 45 s 9 16.8 99.9997 riboPicker* 158 56 min 37 s 42 35.1 99.9842 53 2 h 43 min 49 45.2 99.9947 BLASTN 33 14 min 22 s 11 0.3 99.9967 33 16 min 12 s 5 0.3 99.9967 Meta-RNA 11 1 min 33 s 1 0.1 99.9989 11 3 min 41 s 1 0.2 99.9989 rRNASelector 10 1 min 20 s 1 0.1 99.9990 11 3 min 21 s 1 0.2 99.9989 SSU-ALIGN 8 3 h 51 min 173 0.1 99.9992 11 10 h 30 min 188 0.1 99.9989 One million of MetaSim-simulated Illumina (100 nt) and 1 million Roche 454 (200 nt) non-rRNA reads against a representative 16S rRNA database of 7659 sequences. 3216 SortMeRNA Table 3. Runtime for the SRR106861 metatranscriptome of 105 873 parameter settings required by the program (see Section 2 of the reads against a 16S rRNA database of 7659 sequences Supplementary Data). Software rRNA Run time Latency Memory (%) ACKNOWLEDGEMENTS Project MAPPI is associated with the Tara Oceans expedition SortMeRNA 27046 34 s 1 4.8 (oceans.taraexpeditions.org), where the principal tasks involve riboPicker 11389 4 min 10 s 7 2.3 the development of new software for mapping and assembling riboPicker* 27195 39 min 3 s 69 30.8 metagenomic and metatranscriptomic data. BLASTN 27061 1 h 29 min 157 0.6 Meta-RNA 27111 10 min 33 s 18 1.8 Funding: This research was supported by the French National rRNASelector 27085 10 min 40 s 18 0.8 Agency for Research (grant ANR-2010-COSI-004). Conflict of Interest: none declared. Table 4. Runtime for the SRR013513 metatranscriptome of 207 368 reads against a 23S rRNA database of 2811 sequences REFERENCES Altschul,S. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, Software rRNA Run time Latency Memory (%) 403–410. Askitis,N. and Zobel,J. (2010) Redesigning the string hash table, burst trie, and bst SortMeRNA 94395 51 s 1 3.8 to exploit cache. ACM JEA, 15,7–8. Bomar,L. et al. (2011) Directed culturing of microorganisms using metatranscrip- riboPicker 71937 10 min 2 s 12 3.9 tomics. MBio, 2, e00012–11. riboPicker* 84152 36 min 27 s 43 5.5 Cannone,J. et al. (2002) The comparative RNA web (CRW) site: an online database BLASTN 94439 3 h 42 min 261 0.9 of comparative sequence and structure information for ribosomal, intron, and Meta-RNA 106698 1 h 33 min 109 4.8 other RNAs. BMC Bioinformatics, 3,15. rRNASelector 107900 1 h 36 min 113 3 Eddy,S. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. Edgar,R. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461. Gilbert,J. and Hughes,M. (2011) Gene expression profiling: metatranscriptomics. Methods Mol. Biol., 733, 195–205. Heinz,S. et al. (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst., 20, 192–223. Huang,Y. et al. (2009) Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics, 25, 1338–1340. Lee,J. et al. (2011) rRNASelector: a computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries. J. Microbiol., 49, 689–691. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25, 1754–1760. Ludwig,W. et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Res., 32, 1363–1371. Mears,J. et al. (2002) Modeling a minimal ribosome based on comparative sequence analysis. J. Mol. Biol., 321, 215–234. Mihov,S. and Schulz,K. (2004) Fast approximate search in large dictionaries. J. Comput. Ling., 30, 451–477. Fig. 6. Venn diagram for reads classified as 16S rRNA by BLASTN, Mitankin,P. (2005) Universal Levenshtein Automata. Building and Properties. Meta-RNA, SortMeRNA and riboPicker* in the SRR106861 Master’s Thesis. Bulgaria, Sofia University. Nawrocki,E. et al. (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics, metatranscriptome 25, 1335–1337. Nygaard,V. et al. (2005) Limitations of mRNA amplification from small-size cell samples. BMC Genomics, 6, 147. Pruesse,E. et al. (2007) Silva: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res., 35, 7188–7196. Richter,D. et al. (2008) A sequencing simulator for genomics and metagenomics. PLoS One, 3,e3373. Schmieder,R. et al. (2012) Identification and removal of ribosomal RNA sequences from metatranscriptomes. Bioinformatics, 28, 433–435. Schulz,K. and Mihov,S. (2002) Fast string correction with Levenshtein automata. IJDAR, 5, 67–85. Shi,Y. et al. (2009) Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column. Nature, 459, 266–269. Sinha,R. and Zobel,J. (2004) Cache-conscious sorting of large sets of strings with dynamic tries. ACM JEA, 9, Article No. 1.5, doi: 10.1145/1005813.1041517. Sinha,R. et al. (2006) Cache-efficient string sorting using copying. ACM JEA, 11, Fig. 7. Venn diagram for reads classified as 23S rRNA by BLASTN, Article No. 1.2, doi: 10.1145/1187436.1187439. Meta-RNA, SortMeRNA and riboPicker* in the SRR013513 Stewart,F. et al. (2011) Metatranscriptomics analysis of sulfur oxidation genes in the metatranscriptome endosymbiont of solemnya velum. Front. Microbiol., 2,134.

Journal

BioinformaticsOxford University Press

Published: Oct 15, 2012

There are no references for this article.