Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming

Jochen Blom; Tobias Jakobi; Daniel Doppmeier; Sebastian Jaenicke; Jörn Kalinowski; Jens Stoye; Alexander Goesmann

doi:10.1093/bioinformatics/btr151

Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming

Blom, Jochen; Jakobi, Tobias; Doppmeier, Daniel; Jaenicke, Sebastian; Kalinowski, Jörn; Stoye, Jens; Goesmann, Alexander 2011-03-30 00:00:00 Vol. 27 no. 10 2011, pages 1351–1358 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr151 Sequence analysis Advance Access publication March 30, 2011 Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming 1,†,∗ 1,† 2 1 Jochen Blom , Tobias Jakobi , Daniel Doppmeier , Sebastian Jaenicke , 2 3,4 1,4,5 Jörn Kalinowski , Jens Stoye and Alexander Goesmann 1 2 3 Computational Genomics, Institute for Genome Research and Systems Biology, CeBiTec, Genome Informatics, 4 5 Faculty of Technology, Institute for Bioinformatics and Bioinformatics Resource Facility, CeBiTec, Bielefeld University, Bielefeld, Germany Associate Editor: Alfonso Valencia ABSTRACT increased. These techniques allow for cost-effective sequencing of complete libraries of different bacterial strains that may provide Motivation: The introduction of next-generation sequencing new insights e.g. into microevolution, but experimental data need techniques and especially the high-throughput systems Solexa to be processed before any conclusion can be drawn. Given an (Illumina Inc.) and SOLiD (ABI) made the mapping of short Illumina Solexa setup with a 36-bp read length, even one lane reads to reference sequences a standard application in modern on the ﬂow cell will sufﬁce for deep coverage sequencing of bioinformatics. Short-read alignment is needed for reference based several bacterial genomes. Typically, the generated reads are mapped re-sequencing of complete genomes as well as for gene expression on a closely related reference genome to perform a targeted analysis based on transcriptome sequencing. Several approaches re-sequencing, an in-depth SNP analysis, or to gain knowledge were developed during the last years allowing for a fast alignment about gene expression when performing transcriptomic experiments of short sequences to a given template. Methods available to date using cDNA sequencing. Different tools for mapping reads against use heuristic techniques to gain a speedup of the alignments, thereby reference genomes are available at this time including MAQ missing possible alignment positions. Furthermore, most approaches (Li et al., 2008), BWA (Li and Durbin, 2009), Bowtie (Langmead return only one best hit for every query sequence, thus losing the et al., 2009), PASS (Campagna et al., 2009), SHRiMP (Rumble potentially valuable information of alternative alignment positions et al., 2009) and SOAP2 (Li et al., 2009). MAQ was one of the with identical scores. ﬁrst mapping tools available and uses a hash data structure to keep Results: We developed SARUMAN (Semiglobal Alignment of short reads in memory while traversing the reference genome, resulting Reads Using CUDA and NeedleMAN-Wunsch), a mapping approach in a comparably low memory footprint. But it lacks support for the that returns all possible alignment positions of a read in a reference output of more than one mapping position per read and is not able to sequence under a given error threshold, together with one optimal compete with more recent approaches in terms of speed (Langmead alignment for each of these positions. Alignments are computed in et al., 2009). BWA, Bowtie and SOAP2 are indexing the complete parallel on graphics hardware, facilitating an considerable speedup reference genome using a Burrows-Wheeler-Transformation (BWT) of this normally time-consuming step. Combining our ﬁlter algorithm (Burrows and Wheeler, 1994) and process all the reads sequentially, with CUDA-accelerated alignments, we were able to align reads resulting in a considerable speedup of the mapping process. PASS to microbial genomes in time comparable or even faster than all and SHRiMP also perform an indexing of the reference sequence, published approaches, while still providing an exact, complete and but use a spaced seed approach (Califano and Rigoutsos, 2002) optimal result. At the same time, SARUMAN runs on every standard instead of BWT. All mentioned approaches except SHRiMP are Linux PC with a CUDA-compatible graphics accelerator. heuristic and do not guarantee the mapping of all possible reads. Availability: http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html. Especially reads that show insertions or deletions (indels) compared Contact: jblom@cebitec.uni-bielefeld.de to the reference sequence are often missed. Bowtie is unable to Supplementary information: Supplementary data are available at identify such alignments by design, while SOAP2 only supports Bioinformatics online. gapped alignments in paired-end mode. As high accuracy and Received on October 27, 2010; revised on March 15, 2011; accepted conﬁdence of short-read alignment is highly desirable, especially on March 22, 2011 in the analysis of single nucleotide polymorphisms (SNPs) or small-scale structural variations, we decided to design an exact short-read alignment approach that identiﬁes all match positions 1 INTRODUCTION for each read under a given error tolerance, and generates one 1.1 Challenges of next-generation sequencing optimal alignment for each of these positions without any heuristic. To obtain an adequate runtime even for large-scale, whole-genome Together with the advent of new high-throughput sequencing applications, we employ the massively parallel compute power of technologies, the amount of generated biological data steadily modern graphics adapters [Graphic Processing Unit (GPUs)]. A qgram-based (Jokinen and Ukkonen, 1991) ﬁlter algorithm was To whom correspondence should be addressed. designed to ﬁnd auspicious alignment positions of short reads The authors wish it to be known that, in their opinion, the ﬁrst two authors should be regarded as joint First Authors. within the reference sequence. After the localization of possible © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1351 [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1351 1351–1358 J.Blom et al. mapping positions, we use a Needleman–Wunsch (Needleman and more general Application Programming Interface (API). Thus, the Wunsch, 1970) algorithm to compute optimal alignments of the focus of development shifts from ﬁtting a given algorithm to candidate sequences to the reference sequence. The alignments are OpenGL operations to the development of the best implementation. processed in parallel on a NVIDIA graphics card using the NVIDIA The CUDA platform developed by the NVIDIA cooperation CUDA (Compute Uniﬁed Device Architecture) framework. This (http://www.nvidia.com/object/cuda_home.html) and ATI’s Stream combination of a ﬁlter step with the parallel computation of optimal framework (http://www.amd.com/stream) are novel approaches to alignments on graphics hardware allows us to compute an exact and use the huge computational power of modern graphics cards not only complete mapping of all reads to the reference in a time comparable for games but also for scientiﬁc applications. Contemporary graphics to existing heuristic approaches. processing units are built as massively parallel computational devices, optimized for ﬂoating point operations. Compared to universal central processing units (CPUs) used in every computer, 1.2 Parallelization in biosequence analysis GPUs are specialized for parallel execution of many small tasks In computational biology, huge amounts of data need to be processed while CPUs are designed to execute fewer large tasks sequentially. and analyzed, therefore suggesting the exploration and evaluation As such, GPUs are also well suited for the highly parallel of new computational technologies like parallel programming. One computation of small-scale alignments. basic idea behind parallel programming is to divide a problem into smaller, easier to solve subproblems, a technique known 2 METHODS as ‘divide and conquer’. Another approach is to solve a huge number of small independent tasks by parallelization such that 2.1 Problem each problem can be solved on a different processing unit, either The short-read matching problem can be deﬁned as follows: we have given a cores of one computer or a compute cluster. Once all calculations short sequencing read f of length |f |= m, a (in most cases genomic) reference are ﬁnished, they are combined into a solution of the original sequence g of length |g|= n, and an error threshold e ≥ 0 deﬁned by the user. problem. While server systems with 32 and more CPU cores are Then we want to calculate all starting positions i in g, such that there exists available today and can be used to efﬁciently speedup multithreaded an alignment of f and a preﬁx of g[i...] with at most e errors (mismatches software, they still pose a signiﬁcant capital investment. Therefore, and/or indels). The algorithm shall be capable to export an optimal alignment specialized hardware for parallel processing has been employed, for every such match position. This problem has been studied widely in the past, and most solutions are such as Celeras GeneMatch™ASIC (application-speciﬁc integrated based on one of the two following principles, or combinations thereof: the circuit) approach which uses specialized processors to accelerate qgram lemma states that two strings P and S with an edit distance of e share several bioinformatics algorithms. A few years later, TimeLogic at least t qgrams, that is substrings of length q, where t = max(|P|,|S|) −q + developed Field Programmable Gate Arrays (FPGAs) to run adapted 1 −q ·e. versions of the Smith–Waterman, BLAST and HMMer software That means that every error may destroy up to q ·e overlapping qgrams. with signiﬁcant performance gains. Unfortunately, the prices for For non-overlapping qgrams, one error can destroy only the qgram in which such optimized special purpose hardware together with appropriate it is located, which results in the applicability of the pigeonhole principle. licenses lie in the same expensive investment range as large servers. The pigeonhole principle states that, if n objects (errors) are to be allocated A more cost-efﬁcient possibility is the use of existing hardware to m containers (segments), then at least one container must hold no fewer which can be employed for scientiﬁc computing through different than objects. Similarly, at least one container must hold no more than n n objects. If n < m, it follows that = 0, which means that at least one frameworks. The MMX technology introduced by Intel in 1997 m m container (segment) has to be empty (free of errors). Moreover, if n < m this is a SIMD (single input multiple data) approach which allows for holds for at least m −n segments. In our algorithm, presented in the following parallel processing of large amounts of data with specialized CPU section, we ﬁrst use a 2-fold application of the pigeonhole principle to ﬁnd instructions. The MMX instruction set as well as its successor regions of interest in the genome, before we apply parameters derived from SSE (Streaming SIMD Extensions) have been used to accelerate the qgram lemma to make our ﬁnal selection of positions to pass the ﬁlter. implementations (Farrar, 2007; Rognes and Seeberg, 2000) of the Smith–Waterman algorithm (Smith and Waterman, 1981). In 2.2 Solution 2008, also a ﬁrst attempt on non-PC hardware has been published. Assumptions and deﬁnitions: as a basic assumption, we require a minimal SWPS3 (Szalkowski et al., 2008) employs the Playstation 3’s cell length of reads in relation to the given error threshold: m > e + 1. Given this processor to speedup an adapted version of the Smith–Waterman assumption, we calculate the length of the qgrams for our ﬁlter algorithm as algorithm. Another trend in the recent past is the use of modern follows. We deﬁne the length q of the qgrams as the largest value below e+1 graphics adapters in scientiﬁc computation due to their immense such that processing power which still increases at a much faster rate than the processing power of general purpose processors. While there m q if (e + 1)q < m, q := , q := are several implementations available (Liu et al., 2009; Manavski e + 1 q − 1 otherwise. and Valle, 2008) that focus on the alignment of one query sequence This guarantees that a read f can be split into e + 1 intervals with an to a database of reference sequences, an algorithmic adaptation of additional non-empty remainder of length R := m − (e + 1)q. massively parallel pairwise alignments, as it could be used for short- Given the calculated qgram length, we create an index I of the starting read alignments, is still missing. First approaches using graphics positions of qgrams in sequence g, such that for each possible qgram x, I (x) cards as hardware accelerators for bioinformatics algorithms (Liu contains the starting positions of x in g. et al., 2006) relied on OpenGL, resulting in a difﬁcult and In the following step of the ﬁlter algorithm, every read is segmented into limited implementation. Today, frameworks simplify software pieces of length q (Fig. 1). We choose a set S of c = segments S = s ...s 1 c development by hiding the layer of 3D programming behind a of length q from f , such that for i = 1,...,c : s = f [(i − 1) ∗q + 1...i ∗q].As in [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1352 1351–1358 GPU short read alignment Fig. 2. A matching segment s ∈ S where no other segment of S has a match in correct distance. Segments of set K are checked in this case. If none of the Fig. 1. The two sets of segments S and K . The two sets are shifted by the segments k ...k matches, the read cannot match at the respective genome i c distance of R. position. most cases q = , it can easily be shown that c converges against e + 1 e+1 be correct, and so must be the ﬁrst q −R positions of k . The remaining R for increasing read length m. We additionally choose a second set K of c positions of k must also be correct because all errors are within segments segments K = k ...k of length q, such that for i = 1,...,c : k = f [(i − 1) ∗q + 1 c i c s ...s . The last R positions of k are not part of one of these segments, so 1 +R...i ∗q +R], a set shifted by the remainder R. 1 c c any error within them would be the e + 1st one. If k does not match, it can The simple version of our algorithm allows for mismatches only, not be excluded that read f can be aligned to the reference sequence g with a for insertion or deletions. In this case, it can be shown by the following maximum of e errors at the actual position. See Figure 2 as illustration of observations that a matching read f must have at least two matching segments this matching process. from the sets S and/or K . There are two cases, in the ﬁrst case two segments Insertions and Deletions: it is possible to allow indels in the matching of set S will match, in the second case one segment from S and one segment process by checking every segment not only on one position, but also on from K will match. The term ‘matching segment’ in the following indicates several positions by shifting the segment to the left or right by up to t that a starting position for the respective segment can be found in the genome positions, where t = e − (i − 1) for an initial matching segment s as at least via exact occurrence within index I . The ﬁrst segment found in the index is i i − 1 errors have already occurred in the previous i − 1 segments. Hence, used as seed, the second ‘matching’ segment has to be listed in the index in we conclude that if there are only e errors in f , it is always possible to appropriate distance to this seed. The algorithm is based on the assumption match (i) two segments of S or (ii) one segment of S and one segment that S and K comprise c = e + 1 segments. In cases where c > e + 1, we know of K . Algorithm 1 identiﬁes reads as candidates for a Needleman–Wunsch by the pigeonhole principle that at least two segments of S must match, which alignment by checking for this condition. fulﬁlls our ﬁlter criterion directly. The segment set K is not needed in such Alignment phase: as soon as a second qgram hit has been found for a cases, and all algorithmic steps after the ﬁrst can be skipped. given start index B, the alignment of f to g[B −e,B +m +e] is enqueued Filter algorithm: by the pigeonhole principle, we know that one segment for alignment and the rest of the ﬁlter phase for this value of B can be of set S ‘must’ match, as we allow e errors and have e + 1 segments. We can identify all match positions of segments s ∈ S for the given read in the skipped. To speed up this alignment procedure in practice, the veriﬁcation by qgram index I and use them as starting points for the ﬁlter algorithm. alignment is computed on graphics hardware. Due to the parallel computation of alignments on graphics hardware, a huge number of possible alignment (1) For every segment s matching at a position b in the genome g positions is collected before being submitted to the graphics card. The actual we check in I if there is another segment s ∈ S,j > i, that starts at number depends on the amount of memory available on the graphics card the expected position b + (j −i) ∗q. If we ﬁnd such a segment, we (VRAM). The alignment of the read sequence f to the possibly matching identiﬁed the two matching segments we expect. part g[B −e,B +m +e] of the reference sequence identiﬁed in the ﬁlter step is computed by a Needleman–Wunsch algorithm. On both sides of this template In the case that only one segment s ∈ S matches, there has to be exactly one sequence, e, additional bases are extracted from the reference sequence to error in every remaining segment s ...s ,s ...s . Otherwise, a second 1 i−1 i+1 c allow for indels in an end-gap free alignment. We use unit edit costs for the segment of set S would match. We can infer that not more than c −i errors are alignment matrix. As soon as the error threshold is exceeded in a complete remaining in the segments to the right of read s , but due to the overlapping column of the distance matrix, the calculation is stopped, the read is discarded construction of our segment sets we have one segment more of K to the and backtracing is omitted. If the complete distance matrix is computed, an right of read s to check. Hence, if read f is a possible hit, one of these optimal alignment path is calculated in the backtracing step and the read is c − (i − 1) segments k ...k has to match, and we can start the checks for set i c reported as hit together with its optimal alignment. SARUMAN does not use K at position k . banded alignments, yet, this is planned for future releases. (2) Check if segment k overlapping s matches. If it does, we have i i successfully matched 2 qgrams and passed the ﬁlter. 3 IMPLEMENTATION If segment k does not match, we know that k overlaps s on q −R positions. i i i These positions are free from errors (as s matched without errors). Thus, the The feasibility of sequence alignments using GPU hardware was error causing k not to match must have been on the last R positions of k i i demonstrated by different tools like SW-Cuda (Manavski and which are the ﬁrst R positions of s . As there is exactly one error in every i+1 Valle, 2008) or CUDASW++ (Liu et al., 2009). But, compared segment of S, we can conclude that the last q −R positions of s are free of i+1 to our solution, existing implementations focused on the search errors, which are the ﬁrst q −R positions of k .Soif k does not match, i+1 i+1 for similar sequences in a huge set of other sequences, which the next error is in the ﬁrst R positions of s and so on. i+2 corresponds to a BLAST-like use. In contrast, SARUMAN searches (3) Iteratively check all remaining segments k ...k until one segment i+1 c for local alignments of millions of short sequences against one long matches and the read f passes the ﬁlter or until k is reached. reference sequence. Employing the ﬁltering algorithm described in the previous section, all possible alignment positions in the reference If we reach k , that means that k did not match, we know that the c c−1 genome can be identiﬁed, and thereby the problem is reduced to error of s was in the ﬁrst R positions. So the last q −R positions of s must c c [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1353 1351–1358 J.Blom et al. Algorithm 1 (Mapper) on demand. The complete workﬂow of SARUMAN is illustrated in Require: A read f of length m Figure 3. Require: A reference sequence g of length n 1. ## Set of candidate positions ## 3.1 Mapping phase 2. C ←∅ The memory usage of the qgram index depends on the qgram length 3. ## q calculated as described in Section 2.2 ## and the number of replicates per qgram number, but is mainly 4. c ← dependent on the size of the reference genome. To process large 5. R ← m −c ·q reference genomes with limited resources, the reference sequence is 6. for i ← 1,...,c do divided into several chunks that are processed iteratively. The size 7. u ← (i − 1) ·q + 1 of a sequence that can be processed in one iteration is restricted 8. ## initial matching segments ## by the available memory. Using this technique, it is possible for 9. for each b in I (s ) do SARUMAN to run on computers with small amounts of RAM by 10. B ← b −u dividing the qgram index into chunks perfectly ﬁtting into available 11. ## check remaining segments of S ## memory. Our tests show that a standard computer with 4 GB of 12. for j ← i + 1,...,c do RAM and a recent dual core CPU is able to read and process even 13. v ← (j − 1) ·q + 1 large bacterial genomes like Sorangium cellulosum (13 033 779 bp) 14. ## shift by up to e − (i − 1) positions ## without any difﬁculties. However, this approach is not feasible 15. for t ←−(e − (i − 1)),...,+(e − (i − 1)) do for large eukaryotic genomes as the number of needed iterations 16. ## if distance ﬁts, add to alignment queue ## would be too high. Supported read input formats are FASTA as well 17. if B +v +t ∈ I (s ) then as FASTQ. While the reads are stored in memory, all sequences 18. add B to C containing more than e ambiguous bases (represented as N) are 19. end if ﬁltered out, due to the fact that an N would nevertheless be treated 20. end for as a mismatch by SARUMAN. Subsequently, perfect matches are 21. end for determined by exact text matching and exported as hits. Preceding 22. if not B ∈ C then the actual ﬁltering step, all reads are preprocessed. During this 23. ## check segments of K ## phase, all located start positions of the 2 ∗c qgrams of a read in 24. for j ← i,...,c do the reference genome are extracted from the qgram hash index. 25. v ← (j − 1) ·q + 1 +R This list of positions is stored in an auxiliary data structure in order 26. for t ←−e − (j − 1),...,e − (j − 1) do to minimize access to the hash index in later stages and therefore 27. if B +v +t ∈ I (k ) then speedup the following ﬁlter step. Reads with perfect hits on the 28. add B to C reference genome are still further processed as there may exist 29. end if imperfect hits elsewhere on the reference, but for such reads the 30. end for starting positions of qgrams representing the perfect hit are removed 31. end for from the auxiliary data structure. After the ﬁrst steps, each read is 32. end if mapped onto the reference genome using the algorithm described 33. end for in Section 2, whereas a read is only mapped once to a given start 34. end for position to avoid redundancy. While a combination of ﬁrst and last 35. ## Send candidates to alignment ## qgram (e.g. s and k ) exactly determines start and end position of 36. for each B ∈ C do the read, two ‘inner’ qgrams (e.g. s and k ) would result in a 2 c−1 37. align f to g[B −e,B +m +e] and report hit if successful longer alignment in order to ﬁnd the correct start and stop position. 38. end for Start positions for possible mappings are stored and transferred to the CUDA module in a later stage. In order to not only exploit the parallel architecture of graphics cards but also the availability global alignments of a read sequence with a short substring of a of multicore CPUs, the matching phase uses two threads. The ﬁrst reference genome. Thus, compared to SW-Cuda, SARUMAN does thread handles mapping on the sense strand, whereas the second not align sequences against a database of templates, but is designed thread processes mapping on the antisense strand. In contrast to as an alignment application to perform thousands of short pairwise many other approaches, which employ a 2-bit encoding for the four global sequence alignments in parallel. DNA letters, SARUMAN is able to handle Ns in reads as well as in The CUDA API is an extension to the C programming language. the reference genome. Since many genomes contain a small number Thus, the mapping part of SARUMAN was implemented in the of bases with unknown identity, it is of advantage to treat these C programming language to simplify the integration of CUDA bases as N. Replacing these positions with random or ﬁxed bases code. Our software is divided into two consecutive phases, namely to maintain the 2-bit encoding may lead to wrong and in case of mapping and aligning. Phase one, the creation of the qgram index random replacements even to irreproducible results. together with the following mapping of reads through qgrams, is completely processed on the host computer. During phase two, 3.2 Alignment phase CUDA is used to compute the edit distance for candidate hits on the graphics card using a modiﬁed Needleman–Wunsch algorithm. If the To efﬁciently use CUDA, it is of great advantage to understand the computed edit distance of read and genome lies below a given error underlying hardware of CUDA capable graphics adapters. A GPU threshold, the optimal alignment is computed in the backtracing step consists of a variable number of multiprocessors reaching from 1 in [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1354 1351–1358 GPU short read alignment Fig. 3. Program ﬂow within SARUMAN. The three different application phases have been highlighted in different shades of gray. After reading the reference genome, a qgram index is created, followed by the preﬁltering of perfect hits. Consecutively, all reads are processed by the ﬁltering algorithm which reports all candidate hits and copies all necessary data to the GPU. Edit distances for candidates are computed and alignments for promising hits calculated, which in the last step are postprocessed and printed out. threadblocks. Each threadblock is a collection of a given number of threads. A threadblock must be executable in any order and therefore must not have any dependencies on other blocks of the same grid. Once a sufﬁcient number of candidate hits has been found by the ﬁlter algorithm, all necessary data for performing the alignments is collected. Read and genome sequences are stored together with auxiliary data structures. Afterwards, all required data for the alignment phase is copied to the GPU as one large unit in order to minimize I/O overhead. The maximal number of alignments ﬁtting in the GPU memory heavily depends on read length. SARUMAN automatically calculates a save value for the number of parallel alignments. As the memory of the graphics Fig. 4. CUDA hardware layout and memory organization. Each adapter is also a limiting factor, SARUMAN does not use the quality multiprocessor consists of eight processors, each with an own set of information of reads as this would double up the memory usage. registers. Shared, constant and texture memory can be used by each of the eight processors, while device memory is globally accessible between For datasets with sufﬁcient coverage and quality, we recommend multiprocessors. to simply remove all reads with low-quality bases. For 36 bp reads, a value of 200 000 alignments (100 000 for each mapping thread entry level graphics adapters up to 60 multiprocessors in high end and direction) can be achieved on a standard GPU with 1 GB video cards. Each of these multiprocessors has access to registers of of graphics memory [Video RAM (VRAM)]. Once all data of 8 or 16 kb size and is divided into 8 small processors. The available the candidate hits has been copied to the GPU for each pair of registers are divided and equally assigned to processors. This small genome and read sequence, the edit distance is computed using the amount of writable memory should be used for data processed in the desired values for match and mismatch positions. By comparing the currently active thread while texture memory and constant memory distance with the supplied maximal error rate, all candidates with are read only and can be used to prefetch data from the much slower values above this threshold are discarded. Complete alignments are device memory. An overview of the CUDA hardware and memory only computed in a second backtracing step for candidates with model is given in Figure 4. Implementing code for the execution on a tolerable edit distance. Typically, the alignment phase for one a GPU is very similar to standard application development. Some batch takes only a few seconds to complete, including the whole additional keywords extending the C programming language are process of copying raw data to and processed alignments from the used to deﬁne on which device a function should be executed. A GPU. Before any output is written, alignments are postprocessed by function on the GPU is mapped to one thread on one processor clipping gaps at the start and end. For each possible start position located on the graphics card, whereas one function can and should of each read, the optimal alignment is reported, in contrast to other be executed on many different datasets in parallel. This execution available tools which only deliver a ﬁxed number of positions or do scheme is called SIMT (single input multiple threads) due to its not even guarantee to report any optimal alignment. SARUMAN relation to the similar SIMD scheme used in classical parallel produces tab separated output by default which includes start programming. Parallel programming with CUDA is transparent and and stop position of the mapping together with the edit distance and the complete alignment. The package includes an easy to (within NVIDIA products) device independent for developers and use conversion tool to generate the widely used SAM alignment users. Launch of a CUDA application is controlled using only a format. The SARUMAN software is available for download at few parameters deﬁning the total number of threads. Those threads are organized hierarchically into grids, which themselves consist of http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html. [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1355 1351–1358 J.Blom et al. 4 RESULTS 100 bp reads are provided in the Supplementary Material. The goal was to map all artiﬁcial reads on the genome at the exact position 4.1 Evaluation methods without missing any mappings. As expected, using SARUMAN we The evaluation was accomplished on a standard desktop PC with an were able to map all artiﬁcial reads to the genome. Furthermore, Intel Core2Duo E8400 3 GHz dual core processor with 8 GB DDR2 nearly all reads were mapped to the correct position in the reference RAM and a GeForce GTX280 graphics card with 1 GB of VRAM. genome. In some rare cases (ca. 0.45%), SARUMAN returned All programs were run multithreaded on both processor cores, the optimal alignments that are shifted from the reads original position operating system was Ubuntu Linux. We used four datasets for the by up to e bases (Fig. 5). Such cases cannot be resolved, this performance evaluation, three synthetic read sets of roughly 18 mio. is a general problem of the edit cost function and not a ﬂaw of reads generated from the Escherichia coli K12 MG1655 genome SARUMAN. Additionally, SARUMAN reported a large number (GenBank accession NC_000913), and a real dataset with data from of other matches on different sites of the genome. Among them one lane of a re-sequencing run of Corynebacterium glutamicum were six matches that placed a read to an alternative position in ATCC 13032 (GenBank Accession BX927147) using the Illumina the reference genome with a better score than the alignment to the Solexa GAII sequencer, comprising 18 161 299 reads of 35 bp original position. In this case, the incorporation of errors led to a read length. The settings for all programs used in the comparisons were that just by chance ﬁts better to a wrong genomic position. While adjusted to make the run parameters as comparable as possible. alternative hits can be identiﬁed as misplaced in synthetic data, this To achieve this, we allowed two mismatches/indels for each read behavior is preferable for real data as one can not determine the and set all programs to support multithreading. Furthermore, we correct mapping position among several equally good locations. allowed gapped alignment of reads where possible and adjusted the The evaluation shows a clear separation of the programs into two alignment scoring matrix to simple unit costs. Detailed settings for classes, depending on their ability to handle gaps. Neither SOAP2 all tool runs are given in the Supplementary Material. nor Bowtie are able to perform gapped alignments, both tools were 4.2 Performance evaluation In order to prove the exactness and completeness of the presented approach and to measure the discrepancy between exact and heuristic implementations, we used the synthetic read sets described in Section 4.1. Synthetic reads were generated with 36, 75 and 100 bp length in both directions with up to two errors of different types, i.e. mismatches, insertions, deletions and combinations thereof. About three million of the artiﬁcial reads contained indels. These reads were mapped to the original source genome Escherichia coli K12 Fig. 5. Example for an ambiguous mapping. The read has a deletion in a MG1655. The dataset is available on the project homepage. Table 1 poly-T region compared to the reference. This results in a shift of the mapping compares the mapping ratios of the different tools together with their position by one base as it is not possible to determine in which position the respective running time for 75 bp reads. Data for 36 bp reads and deletion event happened. Table 1. Sensitivity evaluation with an artiﬁcial dataset of 17.980.142 reads (75 bp) generated from Escherichia coli K12 MG1655 with up to two errors SARUMAN SOAP2 Bowtie BWA SHRiMP PASS Reference Mapped 17 980 142 15 142 908 15 123 838 17 746 484 17 980 142 16 873 044 17 980 142 Not mapped 0 2 837 234 2 856 304 233 658 0 1 107 098 0 Perfect 4 999 944 4 999 942 4 999 944 4 999 944 4 999 944 4 999 944 4 999 944 With errors 12 980 198 10 142 966 10 123 894 12 746 540 12 980 198 11 873 100 12 980 198 1 mismatch 4 999 908 4 999 906 4 999 908 4 999 908 4 999 908 4 999 908 4 999 908 2 mismatches 4 999 936 4 999 936 4 999 936 4 999 936 4 999 936 4 999 936 4 999 936 1 Insertion 499 992 34 648 29 900 489 788 499 992 478 754 499 992 2 Insertions 499 998 1243 496 405 563 499 998 496 499 998 1 Deletion 493 560 46 740 46 740 491 454 493 560 475 916 493 560 2 Deletion 493 354 4828 4828 433 605 493 354 452 714 493 354 1Ins,&1Mism, 500 000 21 374 12 308 458 957 500 000 18 334 500 000 1Del,&1Mism, 493 450 34 291 29 778 467 329 493 450 447 042 493 450 MCP 17 899 092 15 070 286 15 061 178 17 432 842 17 785 567 16 577 180 – BMCP 17 899 086 15 070 286 15 061 178 17 430 632 17 784 713 16 577 173 – Total alignments 19 971 674 16 425 886 16 542 168 18 022 317 19 423 932 18 447 418 – Runtime (min) 12:03 06:40 18:56 15:09 95:06 27:42 – RAM usage (kb) 3 375 236 702 964 14 420 117 172 1 313 944 399 278 – MCP (Mapped to Correct Position) denotes the number of mapped reads that had a match to their original position. BMCP (Best Match at Correct Position) denotes the number of reads where the best match was located at the correct position. [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1356 1351–1358 GPU short read alignment Table 2. In-depth analysis of reported mappings of 18 161 299 C.glutamicum re-sequencing reads with an error threshold of two SARUMAN SOAP2 Bowtie BWA SHRiMP Pass Mapped 18 025 584 18 001 767 17 999 961 18 023 109 16 558 248 17 859 475 Not mapped 135 715 159 532 161 338 138 190 1 603 051 301 824 1 alignment 17 406 040 17 467 485 17 388 188 17 732 158 16 057 777 17 284 434 2 alignments 184 224 153 627 181 476 171 385 144 175 167801 3 alignments 72 679 44 237 73 534 58 298 42 699 57630 ≥ 4 alignments 362 641 336 418 356 763 61 268 313 597 349610 Alignments total 20 006 760 19 755 348 19 936 446 18 494 894 17 991 074 19 713 095 Runtime (min) 6:08 9:29 19:40 8:30 32:03 23:59 Matches reported by SOAP2 include reads containing Ns, which are treated as mismatch by other programs. Table 3. Sensitivity of the ﬁlter algorithm and performance gain of the not able to map more than a small portion of sequences generated GPU implementation, tested on E.coli artiﬁcial data with indels to their correct position by using mismatches instead of gaps. The second group of tools consists of BWA, PASS, SHRiMP Read length 36 bp 75 bp 100 bp and SARUMAN, which are all capable of aligning reads containing gaps, although PASS shows a poor mapping ratio for reads with Candidates 23 0405 03 15 405 284 14 991 106 more than one indel. BWA shows a very good performance, but Aligned 14 918 066 14 553 824 14 451 680 still misses more than 200 000 reads. SHRIMP shows a complete Filter step (min) 5:26 4:53 4:49 mapping of all reads, but places less reads than SARUMAN to Alignment on GPU (s) 42 430 1046 the correct position. SARUMAN shows also the best performance Alignment on CPU (s) 1085 3282 5814 on the real C.glutamicum dataset presented in Table 2. As this GPU:CPU 1:25.83 1:7.63 1:5.55 dataset originates from a re-sequencing of the identical strain, only a very low number of errors is expected. Therefore, the differences The upper part shows the runtime of the ﬁlter step, the alignment candidates passing between the tools are quite small. Nevertheless, SARUMAN shows the ﬁlter and the number of reads that where successfully aligned. The lower part the highest mapping ratio of all tools, and furthermore produces the compares the runtime of the alignment step on a GPU versus the runtime on a CPU. highest number of valid alignments. The dataset was evaluated with an error threshold of two: data for one and three allowed errors are candidates resulting from the ﬁlter algorithm with the number of given in the Supplementary Material. alignments successfully veriﬁed by the alignment step. The results Besides sensitivity, another major requirement of short-read are shown in Table 3. The efﬁciency of the ﬁlter algorithm is slightly alignment approaches is the performance in terms of running time. increasing with longer read lengths due to the bigger qgram size. The The runtime evaluation of the artiﬁcial reads shows that SOAP performance of the alignment step is decreasing as the alignments is nearly double as fast as SARUMAN, but maps signiﬁcantly of the longer reads are more memory consuming on the graphics less reads. All other tools are comparably fast (BWA, Bowtie) adapter. But even for 100 bp reads, the GPU implementation shows or considerably slower (SHRIMP, PASS) than our approach. For a more than 5-fold speedup compared to the CPU implementation the C.glutamicum re-sequencing dataset, SARUMAN performs of the same algorithm. best with a running time of 06:08 min, while SOAP and BWA take slightly longer with 9:29 and 8:30 min, respectively. Bowtie, 5 DISCUSSION PASS and SHRiMP are three to ﬁve times slower. Considering all calculated datasets, SOAP2 shows the best performance of all For short-read alignments against microbial genomes, SARUMAN compared approaches, being the only algorithm that is faster than has proven to be as fast or even faster than other available approaches SARUMAN in most cases. BWA shows a runtime comparable to like SOAP2, Bowtie, BWA, SHRiMP and Pass (Tables 1 and 2). SARUMAN, while Bowtie is highly dependent on the used error In contrast to heuristic approaches, SARUMAN guarantees to ﬁnd ratio. For low error rates, it can compete with other approaches, optimal alignments for all possible alignment positions. Compared for higher error rates the runtime rises signiﬁcantly. SHRiMP and to SHRiMP, another exact aproach, SARUMAN shows a better PASS are the slowest of the compared approaches for all evaluated runtime performance. Thus, it is the ﬁrst non-heuristic approach datasets. In summary, SARUMAN is the only approach that provides providing full indel support in competitive computing time. The exact and complete results, while still being nearly as fast or even completeness of the SARUMAN mapping was demonstrated using faster than all compared approaches. Thereby, it shows the best a constructed dataset, where all reads could be mapped correctly overall performance for short-read alignment against prokaryotic (Table 1). SARUMAN is designed as a short-read alignment genomes. approach; nonetheless, it works properly and with reasonable running times for reads up to lengths of 125 bp. The qgram index used in the ﬁlter step has a memory usage that in worst case increases 4.3 Performance of ﬁlter and alignment components linearly with the size of the reference genome. Furthermore, as The performance of the ﬁlter algorithm mainly depends on the SARUMAN provides a proper handling of Ns in the reference read length and the qgram length. We compared the number of sequence as well as in the reads, it cannot use two bit encoding [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1357 1351–1358 J.Blom et al. of nucleotides, which further increases the memory footprint. On REFERENCES a standard desktop PC with 4 GB RAM, it can map reads to all Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data compression bacterial genomes in a single iteration, with 8 GB RAM fungal algorithm. Technical Report 124, Digital Systems Research Center, Palo Alto, California. genomes of up to 50 Mb in size were processed in a single run. Califano,A. and Rigoutsos,I. (2002) FLASH: A fast look-up algorithm for string If a reference genome is too large for the available memory, it is homology. In Computer Vision and Pattern Recognition, 1993. Proceedings automatically split into two or more parts as described in Section 3. CVPR’93., 1993 IEEE Computer Society Conference on. IEEE, New York, NY, Of course the running time of the algorithm has to be multiplied USA, pp. 353–359. by the number of iterations that are needed. Due to this limitation, Campagna,D. et al. (2009) PASS: a program to align short sequences. Bioinformatics, 25, 967. we propose SARUMAN as dedicated solution for read mapping Farrar,M. (2007) Striped Smith-Waterman speeds database searches six times over other on microbial reference genomes or other reference sequences of SIMD implementations. Bioinformatics, 23, 156–161. comparable size. SARUMAN does not natively support paired Jokinen,P. and Ukkonen,E. (1991) Two algorithms for approximate string matching in end sequencing data, but as all possible alignments are returned static texts. Lecture Notes in Computer Science, 520/1991, 240–248. Langmead,B. et al. (2009) Ultrafast and memory-efﬁcient alignment of short DNA this can be handled by post-processing the results to ﬂag read sequences to the human genome. Genome Biol., 10, R25. pairs as matching either in compatible distance or not. A post- Liu,W. et al. (2006) Bio-sequence database scanning on a GPU. In Parallel and processing tool is under development. Several further improvements Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, IEEE, to SARUMAN are planned for the future. One idea is to combine Rhodes Island, Greece, p. 8. the CUDA alignment module with other ﬁlter algorithms. These Liu,Y. et al. (2009) CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes, 2, 73. may be heuristic solutions to establish an even faster short-read Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows- alignment, or algorithms based on compression techniques like the Wheeler transform. Bioinformatics, 25, 1754–1760. BWT to reduce the memory usage and make the approach more Li,H. et al. (2008) Mapping short DNA sequencing reads and calling variants using applicable to large reference genomes. The highest potential for a mapping quality scores. Genome Res., 18, 1851–1858. Li,R. et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. further speedup has the processing of the ﬁltering algorithm on the Bioinformatics, 25, 1966–1967. graphics adapter. Unfortunately, this is not yet foreseeable as it is Manavski,S.A. and Valle,G. (2008) CUDA compatible GPU cards as efﬁcient hardware quite complicated to handle the reference sequence data with the accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics, 9 limited amount of memory available on the graphics cards, but it (Suppl. 2), S10. may become feasible with future generations of graphics hardware. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Another planned development is a native support for color space Rognes,T. and Seeberg,E. (2000) Six-fold speed-up of Smith-Waterman sequence data as generated by the SOLiD sequencing system. database searches using parallel processing on common microprocessors. Bioinformatics, 16, 699–706. Rumble,S. et al. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS ACKNOWLEDGEMENTS Comput. Biol., 5, e1000386. Smith,T.F. and Waterman,M.S. (1981) Identiﬁcation of common molecular The authors wish to thank the Bioinformatics Resource Facility subsequences. J. Mol. Biol., 147, 195–197. system administrators for expert technical support. Szalkowski,A. et al. (2008) SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2. BMC Res. Notes, 1, 107. Funding: German Federal Ministry of Education and Research (grant 0315599B ‘GenoMik-Transfer’) to J.B. and S.J. Conﬂict of Interest: none declared. [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1358 1351–1358 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/exact-and-complete-short-read-alignment-to-microbial-genomes-using-4j9qR6VD0w

Loading next page...

References (37)

Farrar (2007)
Striped Smith-Waterman speeds database searches six times over other SIMD implementations
Bioinformatics, 23
Langmead (2009)
R25
Genome Biol., 10
Li (2008)
1851
Genome Res., 18
S. Needleman, C. Wunsch (1970)
A general method applicable to the search for similarities in the amino acid sequence of two proteins.
Journal of molecular biology, 48 3
B. Langmead, C. Trapnell, Mihai Pop, S. Salzberg (2009)
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biology, 10
Torbjørn Rognes, E. Seeberg (2000)
Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors
Bioinformatics, 16 8
A. Califano, I. Rigoutsos (1993)
FLASH: a fast look-up algorithm for string homology
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Magdalena Jonikas, R. Radmer, Alain Laederach (2009)
Knowledge-based Potentials and Structural Filters Coarse-grained Modeling of Large Rna Molecules with Material Supplemental Open Access Coarse-grained Modeling of Large Rna Molecules with Knowledge-based Potentials and Structural Filters
Califano (2002)
353
Petteri Jokinen, E. Ukkonen (1991)
Two Algorithms for Approximate String Matching in Static Texts
Liu (2006)
8
Manavski (2008)
S10
BMC Bioinformatics, 9
Natthakan Boongoen, Tossapon Garrett (2004)
BIOINFORMATICS ORIGINAL PAPER
Li (2009)
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics, 25
Campagna (2009)
967
Bioinformatics, 25
Needleman (1970)
443
J. Mol. Biol., 48
Li (2009)
1754
Bioinformatics, 25
D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, Svetlin Manavski, N. Vitulo, G. Valle (2009)
PASS: a program to align short sequences
Bioinformatics, 25 7
Heng Li, R. Durbin (2009)
Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform
Rognes (2000)
699
Bioinformatics, 16
Rumble (2009)
e1000386
PLoS Comput. Biol., 5
Jokinen (1991)
240
Lecture Notes in Computer Science, 520/1991
Stephen Rumble, P. Lacroute, Adrian Dalca, M. Fiume, A. Sidow, M. Brudno (2009)
SHRiMP: Accurate Mapping of Short Color-space Reads
PLoS Computational Biology, 5
Li (2008)
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Res., 18
A. Szalkowski, C. Ledergerber, Philipp Krähenbühl, C. Dessimoz (2008)
SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2
BMC Research Notes, 1
Yongchao Liu, D. Maskell, B. Schmidt (2009)
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units
BMC Research Notes, 2
Michael Farrar
Bioinformatics Original Paper Striped Smith–waterman Speeds Database Searches Six times over Other Simd Implementations
Liu (2009)
73
BMC Res. Notes, 2
Weiguo Liu, B. Schmidt, G. Voss, Adrian Schröder, W. Müller-Wittig (2006)
Bio-sequence database scanning on a GPU
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.
Journal of molecular biology, 147 1
Svetlin Manavski, G. Valle (2008)
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
BMC Bioinformatics, 9
Szalkowski (2008)
107
BMC Res. Notes, 1
Smith (1981)
195
J. Mol. Biol., 147
Ruiqiang Li, Chang Yu, Yingrui Li, T. Lam, S. Yiu, K. Kristiansen, Jun Wang (2009)
SOAP2: an improved ultrafast tool for short read alignment
Bioinformatics, 25 15
Farrar (2007)
156
Bioinformatics, 23
Li (2009)
1966
Bioinformatics, 25
M. Burrows, D. L, R. Taylor, D. Wheeler, D. Wheeler (1994)
A Block-sorting Lossless Data Compression Algorithm

Publisher: Oxford University Press
Copyright: © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btr151
pmid: 21450712
Publisher site: See Article on Publisher Site

Abstract

Vol. 27 no. 10 2011, pages 1351–1358 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr151 Sequence analysis Advance Access publication March 30, 2011 Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming 1,†,∗ 1,† 2 1 Jochen Blom , Tobias Jakobi , Daniel Doppmeier , Sebastian Jaenicke , 2 3,4 1,4,5 Jörn Kalinowski , Jens Stoye and Alexander Goesmann 1 2 3 Computational Genomics, Institute for Genome Research and Systems Biology, CeBiTec, Genome Informatics, 4 5 Faculty of Technology, Institute for Bioinformatics and Bioinformatics Resource Facility, CeBiTec, Bielefeld University, Bielefeld, Germany Associate Editor: Alfonso Valencia ABSTRACT increased. These techniques allow for cost-effective sequencing of complete libraries of different bacterial strains that may provide Motivation: The introduction of next-generation sequencing new insights e.g. into microevolution, but experimental data need techniques and especially the high-throughput systems Solexa to be processed before any conclusion can be drawn. Given an (Illumina Inc.) and SOLiD (ABI) made the mapping of short Illumina Solexa setup with a 36-bp read length, even one lane reads to reference sequences a standard application in modern on the ﬂow cell will sufﬁce for deep coverage sequencing of bioinformatics. Short-read alignment is needed for reference based several bacterial genomes. Typically, the generated reads are mapped re-sequencing of complete genomes as well as for gene expression on a closely related reference genome to perform a targeted analysis based on transcriptome sequencing. Several approaches re-sequencing, an in-depth SNP analysis, or to gain knowledge were developed during the last years allowing for a fast alignment about gene expression when performing transcriptomic experiments of short sequences to a given template. Methods available to date using cDNA sequencing. Different tools for mapping reads against use heuristic techniques to gain a speedup of the alignments, thereby reference genomes are available at this time including MAQ missing possible alignment positions. Furthermore, most approaches (Li et al., 2008), BWA (Li and Durbin, 2009), Bowtie (Langmead return only one best hit for every query sequence, thus losing the et al., 2009), PASS (Campagna et al., 2009), SHRiMP (Rumble potentially valuable information of alternative alignment positions et al., 2009) and SOAP2 (Li et al., 2009). MAQ was one of the with identical scores. ﬁrst mapping tools available and uses a hash data structure to keep Results: We developed SARUMAN (Semiglobal Alignment of short reads in memory while traversing the reference genome, resulting Reads Using CUDA and NeedleMAN-Wunsch), a mapping approach in a comparably low memory footprint. But it lacks support for the that returns all possible alignment positions of a read in a reference output of more than one mapping position per read and is not able to sequence under a given error threshold, together with one optimal compete with more recent approaches in terms of speed (Langmead alignment for each of these positions. Alignments are computed in et al., 2009). BWA, Bowtie and SOAP2 are indexing the complete parallel on graphics hardware, facilitating an considerable speedup reference genome using a Burrows-Wheeler-Transformation (BWT) of this normally time-consuming step. Combining our ﬁlter algorithm (Burrows and Wheeler, 1994) and process all the reads sequentially, with CUDA-accelerated alignments, we were able to align reads resulting in a considerable speedup of the mapping process. PASS to microbial genomes in time comparable or even faster than all and SHRiMP also perform an indexing of the reference sequence, published approaches, while still providing an exact, complete and but use a spaced seed approach (Califano and Rigoutsos, 2002) optimal result. At the same time, SARUMAN runs on every standard instead of BWT. All mentioned approaches except SHRiMP are Linux PC with a CUDA-compatible graphics accelerator. heuristic and do not guarantee the mapping of all possible reads. Availability: http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html. Especially reads that show insertions or deletions (indels) compared Contact: jblom@cebitec.uni-bielefeld.de to the reference sequence are often missed. Bowtie is unable to Supplementary information: Supplementary data are available at identify such alignments by design, while SOAP2 only supports Bioinformatics online. gapped alignments in paired-end mode. As high accuracy and Received on October 27, 2010; revised on March 15, 2011; accepted conﬁdence of short-read alignment is highly desirable, especially on March 22, 2011 in the analysis of single nucleotide polymorphisms (SNPs) or small-scale structural variations, we decided to design an exact short-read alignment approach that identiﬁes all match positions 1 INTRODUCTION for each read under a given error tolerance, and generates one 1.1 Challenges of next-generation sequencing optimal alignment for each of these positions without any heuristic. To obtain an adequate runtime even for large-scale, whole-genome Together with the advent of new high-throughput sequencing applications, we employ the massively parallel compute power of technologies, the amount of generated biological data steadily modern graphics adapters [Graphic Processing Unit (GPUs)]. A qgram-based (Jokinen and Ukkonen, 1991) ﬁlter algorithm was To whom correspondence should be addressed. designed to ﬁnd auspicious alignment positions of short reads The authors wish it to be known that, in their opinion, the ﬁrst two authors should be regarded as joint First Authors. within the reference sequence. After the localization of possible © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1351 [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1351 1351–1358 J.Blom et al. mapping positions, we use a Needleman–Wunsch (Needleman and more general Application Programming Interface (API). Thus, the Wunsch, 1970) algorithm to compute optimal alignments of the focus of development shifts from ﬁtting a given algorithm to candidate sequences to the reference sequence. The alignments are OpenGL operations to the development of the best implementation. processed in parallel on a NVIDIA graphics card using the NVIDIA The CUDA platform developed by the NVIDIA cooperation CUDA (Compute Uniﬁed Device Architecture) framework. This (http://www.nvidia.com/object/cuda_home.html) and ATI’s Stream combination of a ﬁlter step with the parallel computation of optimal framework (http://www.amd.com/stream) are novel approaches to alignments on graphics hardware allows us to compute an exact and use the huge computational power of modern graphics cards not only complete mapping of all reads to the reference in a time comparable for games but also for scientiﬁc applications. Contemporary graphics to existing heuristic approaches. processing units are built as massively parallel computational devices, optimized for ﬂoating point operations. Compared to universal central processing units (CPUs) used in every computer, 1.2 Parallelization in biosequence analysis GPUs are specialized for parallel execution of many small tasks In computational biology, huge amounts of data need to be processed while CPUs are designed to execute fewer large tasks sequentially. and analyzed, therefore suggesting the exploration and evaluation As such, GPUs are also well suited for the highly parallel of new computational technologies like parallel programming. One computation of small-scale alignments. basic idea behind parallel programming is to divide a problem into smaller, easier to solve subproblems, a technique known 2 METHODS as ‘divide and conquer’. Another approach is to solve a huge number of small independent tasks by parallelization such that 2.1 Problem each problem can be solved on a different processing unit, either The short-read matching problem can be deﬁned as follows: we have given a cores of one computer or a compute cluster. Once all calculations short sequencing read f of length |f |= m, a (in most cases genomic) reference are ﬁnished, they are combined into a solution of the original sequence g of length |g|= n, and an error threshold e ≥ 0 deﬁned by the user. problem. While server systems with 32 and more CPU cores are Then we want to calculate all starting positions i in g, such that there exists available today and can be used to efﬁciently speedup multithreaded an alignment of f and a preﬁx of g[i...] with at most e errors (mismatches software, they still pose a signiﬁcant capital investment. Therefore, and/or indels). The algorithm shall be capable to export an optimal alignment specialized hardware for parallel processing has been employed, for every such match position. This problem has been studied widely in the past, and most solutions are such as Celeras GeneMatch™ASIC (application-speciﬁc integrated based on one of the two following principles, or combinations thereof: the circuit) approach which uses specialized processors to accelerate qgram lemma states that two strings P and S with an edit distance of e share several bioinformatics algorithms. A few years later, TimeLogic at least t qgrams, that is substrings of length q, where t = max(|P|,|S|) −q + developed Field Programmable Gate Arrays (FPGAs) to run adapted 1 −q ·e. versions of the Smith–Waterman, BLAST and HMMer software That means that every error may destroy up to q ·e overlapping qgrams. with signiﬁcant performance gains. Unfortunately, the prices for For non-overlapping qgrams, one error can destroy only the qgram in which such optimized special purpose hardware together with appropriate it is located, which results in the applicability of the pigeonhole principle. licenses lie in the same expensive investment range as large servers. The pigeonhole principle states that, if n objects (errors) are to be allocated A more cost-efﬁcient possibility is the use of existing hardware to m containers (segments), then at least one container must hold no fewer which can be employed for scientiﬁc computing through different than objects. Similarly, at least one container must hold no more than n n objects. If n < m, it follows that = 0, which means that at least one frameworks. The MMX technology introduced by Intel in 1997 m m container (segment) has to be empty (free of errors). Moreover, if n < m this is a SIMD (single input multiple data) approach which allows for holds for at least m −n segments. In our algorithm, presented in the following parallel processing of large amounts of data with specialized CPU section, we ﬁrst use a 2-fold application of the pigeonhole principle to ﬁnd instructions. The MMX instruction set as well as its successor regions of interest in the genome, before we apply parameters derived from SSE (Streaming SIMD Extensions) have been used to accelerate the qgram lemma to make our ﬁnal selection of positions to pass the ﬁlter. implementations (Farrar, 2007; Rognes and Seeberg, 2000) of the Smith–Waterman algorithm (Smith and Waterman, 1981). In 2.2 Solution 2008, also a ﬁrst attempt on non-PC hardware has been published. Assumptions and deﬁnitions: as a basic assumption, we require a minimal SWPS3 (Szalkowski et al., 2008) employs the Playstation 3’s cell length of reads in relation to the given error threshold: m > e + 1. Given this processor to speedup an adapted version of the Smith–Waterman assumption, we calculate the length of the qgrams for our ﬁlter algorithm as algorithm. Another trend in the recent past is the use of modern follows. We deﬁne the length q of the qgrams as the largest value below e+1 graphics adapters in scientiﬁc computation due to their immense such that processing power which still increases at a much faster rate than the processing power of general purpose processors. While there m q if (e + 1)q < m, q := , q := are several implementations available (Liu et al., 2009; Manavski e + 1 q − 1 otherwise. and Valle, 2008) that focus on the alignment of one query sequence This guarantees that a read f can be split into e + 1 intervals with an to a database of reference sequences, an algorithmic adaptation of additional non-empty remainder of length R := m − (e + 1)q. massively parallel pairwise alignments, as it could be used for short- Given the calculated qgram length, we create an index I of the starting read alignments, is still missing. First approaches using graphics positions of qgrams in sequence g, such that for each possible qgram x, I (x) cards as hardware accelerators for bioinformatics algorithms (Liu contains the starting positions of x in g. et al., 2006) relied on OpenGL, resulting in a difﬁcult and In the following step of the ﬁlter algorithm, every read is segmented into limited implementation. Today, frameworks simplify software pieces of length q (Fig. 1). We choose a set S of c = segments S = s ...s 1 c development by hiding the layer of 3D programming behind a of length q from f , such that for i = 1,...,c : s = f [(i − 1) ∗q + 1...i ∗q].As in [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1352 1351–1358 GPU short read alignment Fig. 2. A matching segment s ∈ S where no other segment of S has a match in correct distance. Segments of set K are checked in this case. If none of the Fig. 1. The two sets of segments S and K . The two sets are shifted by the segments k ...k matches, the read cannot match at the respective genome i c distance of R. position. most cases q = , it can easily be shown that c converges against e + 1 e+1 be correct, and so must be the ﬁrst q −R positions of k . The remaining R for increasing read length m. We additionally choose a second set K of c positions of k must also be correct because all errors are within segments segments K = k ...k of length q, such that for i = 1,...,c : k = f [(i − 1) ∗q + 1 c i c s ...s . The last R positions of k are not part of one of these segments, so 1 +R...i ∗q +R], a set shifted by the remainder R. 1 c c any error within them would be the e + 1st one. If k does not match, it can The simple version of our algorithm allows for mismatches only, not be excluded that read f can be aligned to the reference sequence g with a for insertion or deletions. In this case, it can be shown by the following maximum of e errors at the actual position. See Figure 2 as illustration of observations that a matching read f must have at least two matching segments this matching process. from the sets S and/or K . There are two cases, in the ﬁrst case two segments Insertions and Deletions: it is possible to allow indels in the matching of set S will match, in the second case one segment from S and one segment process by checking every segment not only on one position, but also on from K will match. The term ‘matching segment’ in the following indicates several positions by shifting the segment to the left or right by up to t that a starting position for the respective segment can be found in the genome positions, where t = e − (i − 1) for an initial matching segment s as at least via exact occurrence within index I . The ﬁrst segment found in the index is i i − 1 errors have already occurred in the previous i − 1 segments. Hence, used as seed, the second ‘matching’ segment has to be listed in the index in we conclude that if there are only e errors in f , it is always possible to appropriate distance to this seed. The algorithm is based on the assumption match (i) two segments of S or (ii) one segment of S and one segment that S and K comprise c = e + 1 segments. In cases where c > e + 1, we know of K . Algorithm 1 identiﬁes reads as candidates for a Needleman–Wunsch by the pigeonhole principle that at least two segments of S must match, which alignment by checking for this condition. fulﬁlls our ﬁlter criterion directly. The segment set K is not needed in such Alignment phase: as soon as a second qgram hit has been found for a cases, and all algorithmic steps after the ﬁrst can be skipped. given start index B, the alignment of f to g[B −e,B +m +e] is enqueued Filter algorithm: by the pigeonhole principle, we know that one segment for alignment and the rest of the ﬁlter phase for this value of B can be of set S ‘must’ match, as we allow e errors and have e + 1 segments. We can identify all match positions of segments s ∈ S for the given read in the skipped. To speed up this alignment procedure in practice, the veriﬁcation by qgram index I and use them as starting points for the ﬁlter algorithm. alignment is computed on graphics hardware. Due to the parallel computation of alignments on graphics hardware, a huge number of possible alignment (1) For every segment s matching at a position b in the genome g positions is collected before being submitted to the graphics card. The actual we check in I if there is another segment s ∈ S,j > i, that starts at number depends on the amount of memory available on the graphics card the expected position b + (j −i) ∗q. If we ﬁnd such a segment, we (VRAM). The alignment of the read sequence f to the possibly matching identiﬁed the two matching segments we expect. part g[B −e,B +m +e] of the reference sequence identiﬁed in the ﬁlter step is computed by a Needleman–Wunsch algorithm. On both sides of this template In the case that only one segment s ∈ S matches, there has to be exactly one sequence, e, additional bases are extracted from the reference sequence to error in every remaining segment s ...s ,s ...s . Otherwise, a second 1 i−1 i+1 c allow for indels in an end-gap free alignment. We use unit edit costs for the segment of set S would match. We can infer that not more than c −i errors are alignment matrix. As soon as the error threshold is exceeded in a complete remaining in the segments to the right of read s , but due to the overlapping column of the distance matrix, the calculation is stopped, the read is discarded construction of our segment sets we have one segment more of K to the and backtracing is omitted. If the complete distance matrix is computed, an right of read s to check. Hence, if read f is a possible hit, one of these optimal alignment path is calculated in the backtracing step and the read is c − (i − 1) segments k ...k has to match, and we can start the checks for set i c reported as hit together with its optimal alignment. SARUMAN does not use K at position k . banded alignments, yet, this is planned for future releases. (2) Check if segment k overlapping s matches. If it does, we have i i successfully matched 2 qgrams and passed the ﬁlter. 3 IMPLEMENTATION If segment k does not match, we know that k overlaps s on q −R positions. i i i These positions are free from errors (as s matched without errors). Thus, the The feasibility of sequence alignments using GPU hardware was error causing k not to match must have been on the last R positions of k i i demonstrated by different tools like SW-Cuda (Manavski and which are the ﬁrst R positions of s . As there is exactly one error in every i+1 Valle, 2008) or CUDASW++ (Liu et al., 2009). But, compared segment of S, we can conclude that the last q −R positions of s are free of i+1 to our solution, existing implementations focused on the search errors, which are the ﬁrst q −R positions of k .Soif k does not match, i+1 i+1 for similar sequences in a huge set of other sequences, which the next error is in the ﬁrst R positions of s and so on. i+2 corresponds to a BLAST-like use. In contrast, SARUMAN searches (3) Iteratively check all remaining segments k ...k until one segment i+1 c for local alignments of millions of short sequences against one long matches and the read f passes the ﬁlter or until k is reached. reference sequence. Employing the ﬁltering algorithm described in the previous section, all possible alignment positions in the reference If we reach k , that means that k did not match, we know that the c c−1 genome can be identiﬁed, and thereby the problem is reduced to error of s was in the ﬁrst R positions. So the last q −R positions of s must c c [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1353 1351–1358 J.Blom et al. Algorithm 1 (Mapper) on demand. The complete workﬂow of SARUMAN is illustrated in Require: A read f of length m Figure 3. Require: A reference sequence g of length n 1. ## Set of candidate positions ## 3.1 Mapping phase 2. C ←∅ The memory usage of the qgram index depends on the qgram length 3. ## q calculated as described in Section 2.2 ## and the number of replicates per qgram number, but is mainly 4. c ← dependent on the size of the reference genome. To process large 5. R ← m −c ·q reference genomes with limited resources, the reference sequence is 6. for i ← 1,...,c do divided into several chunks that are processed iteratively. The size 7. u ← (i − 1) ·q + 1 of a sequence that can be processed in one iteration is restricted 8. ## initial matching segments ## by the available memory. Using this technique, it is possible for 9. for each b in I (s ) do SARUMAN to run on computers with small amounts of RAM by 10. B ← b −u dividing the qgram index into chunks perfectly ﬁtting into available 11. ## check remaining segments of S ## memory. Our tests show that a standard computer with 4 GB of 12. for j ← i + 1,...,c do RAM and a recent dual core CPU is able to read and process even 13. v ← (j − 1) ·q + 1 large bacterial genomes like Sorangium cellulosum (13 033 779 bp) 14. ## shift by up to e − (i − 1) positions ## without any difﬁculties. However, this approach is not feasible 15. for t ←−(e − (i − 1)),...,+(e − (i − 1)) do for large eukaryotic genomes as the number of needed iterations 16. ## if distance ﬁts, add to alignment queue ## would be too high. Supported read input formats are FASTA as well 17. if B +v +t ∈ I (s ) then as FASTQ. While the reads are stored in memory, all sequences 18. add B to C containing more than e ambiguous bases (represented as N) are 19. end if ﬁltered out, due to the fact that an N would nevertheless be treated 20. end for as a mismatch by SARUMAN. Subsequently, perfect matches are 21. end for determined by exact text matching and exported as hits. Preceding 22. if not B ∈ C then the actual ﬁltering step, all reads are preprocessed. During this 23. ## check segments of K ## phase, all located start positions of the 2 ∗c qgrams of a read in 24. for j ← i,...,c do the reference genome are extracted from the qgram hash index. 25. v ← (j − 1) ·q + 1 +R This list of positions is stored in an auxiliary data structure in order 26. for t ←−e − (j − 1),...,e − (j − 1) do to minimize access to the hash index in later stages and therefore 27. if B +v +t ∈ I (k ) then speedup the following ﬁlter step. Reads with perfect hits on the 28. add B to C reference genome are still further processed as there may exist 29. end if imperfect hits elsewhere on the reference, but for such reads the 30. end for starting positions of qgrams representing the perfect hit are removed 31. end for from the auxiliary data structure. After the ﬁrst steps, each read is 32. end if mapped onto the reference genome using the algorithm described 33. end for in Section 2, whereas a read is only mapped once to a given start 34. end for position to avoid redundancy. While a combination of ﬁrst and last 35. ## Send candidates to alignment ## qgram (e.g. s and k ) exactly determines start and end position of 36. for each B ∈ C do the read, two ‘inner’ qgrams (e.g. s and k ) would result in a 2 c−1 37. align f to g[B −e,B +m +e] and report hit if successful longer alignment in order to ﬁnd the correct start and stop position. 38. end for Start positions for possible mappings are stored and transferred to the CUDA module in a later stage. In order to not only exploit the parallel architecture of graphics cards but also the availability global alignments of a read sequence with a short substring of a of multicore CPUs, the matching phase uses two threads. The ﬁrst reference genome. Thus, compared to SW-Cuda, SARUMAN does thread handles mapping on the sense strand, whereas the second not align sequences against a database of templates, but is designed thread processes mapping on the antisense strand. In contrast to as an alignment application to perform thousands of short pairwise many other approaches, which employ a 2-bit encoding for the four global sequence alignments in parallel. DNA letters, SARUMAN is able to handle Ns in reads as well as in The CUDA API is an extension to the C programming language. the reference genome. Since many genomes contain a small number Thus, the mapping part of SARUMAN was implemented in the of bases with unknown identity, it is of advantage to treat these C programming language to simplify the integration of CUDA bases as N. Replacing these positions with random or ﬁxed bases code. Our software is divided into two consecutive phases, namely to maintain the 2-bit encoding may lead to wrong and in case of mapping and aligning. Phase one, the creation of the qgram index random replacements even to irreproducible results. together with the following mapping of reads through qgrams, is completely processed on the host computer. During phase two, 3.2 Alignment phase CUDA is used to compute the edit distance for candidate hits on the graphics card using a modiﬁed Needleman–Wunsch algorithm. If the To efﬁciently use CUDA, it is of great advantage to understand the computed edit distance of read and genome lies below a given error underlying hardware of CUDA capable graphics adapters. A GPU threshold, the optimal alignment is computed in the backtracing step consists of a variable number of multiprocessors reaching from 1 in [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1354 1351–1358 GPU short read alignment Fig. 3. Program ﬂow within SARUMAN. The three different application phases have been highlighted in different shades of gray. After reading the reference genome, a qgram index is created, followed by the preﬁltering of perfect hits. Consecutively, all reads are processed by the ﬁltering algorithm which reports all candidate hits and copies all necessary data to the GPU. Edit distances for candidates are computed and alignments for promising hits calculated, which in the last step are postprocessed and printed out. threadblocks. Each threadblock is a collection of a given number of threads. A threadblock must be executable in any order and therefore must not have any dependencies on other blocks of the same grid. Once a sufﬁcient number of candidate hits has been found by the ﬁlter algorithm, all necessary data for performing the alignments is collected. Read and genome sequences are stored together with auxiliary data structures. Afterwards, all required data for the alignment phase is copied to the GPU as one large unit in order to minimize I/O overhead. The maximal number of alignments ﬁtting in the GPU memory heavily depends on read length. SARUMAN automatically calculates a save value for the number of parallel alignments. As the memory of the graphics Fig. 4. CUDA hardware layout and memory organization. Each adapter is also a limiting factor, SARUMAN does not use the quality multiprocessor consists of eight processors, each with an own set of information of reads as this would double up the memory usage. registers. Shared, constant and texture memory can be used by each of the eight processors, while device memory is globally accessible between For datasets with sufﬁcient coverage and quality, we recommend multiprocessors. to simply remove all reads with low-quality bases. For 36 bp reads, a value of 200 000 alignments (100 000 for each mapping thread entry level graphics adapters up to 60 multiprocessors in high end and direction) can be achieved on a standard GPU with 1 GB video cards. Each of these multiprocessors has access to registers of of graphics memory [Video RAM (VRAM)]. Once all data of 8 or 16 kb size and is divided into 8 small processors. The available the candidate hits has been copied to the GPU for each pair of registers are divided and equally assigned to processors. This small genome and read sequence, the edit distance is computed using the amount of writable memory should be used for data processed in the desired values for match and mismatch positions. By comparing the currently active thread while texture memory and constant memory distance with the supplied maximal error rate, all candidates with are read only and can be used to prefetch data from the much slower values above this threshold are discarded. Complete alignments are device memory. An overview of the CUDA hardware and memory only computed in a second backtracing step for candidates with model is given in Figure 4. Implementing code for the execution on a tolerable edit distance. Typically, the alignment phase for one a GPU is very similar to standard application development. Some batch takes only a few seconds to complete, including the whole additional keywords extending the C programming language are process of copying raw data to and processed alignments from the used to deﬁne on which device a function should be executed. A GPU. Before any output is written, alignments are postprocessed by function on the GPU is mapped to one thread on one processor clipping gaps at the start and end. For each possible start position located on the graphics card, whereas one function can and should of each read, the optimal alignment is reported, in contrast to other be executed on many different datasets in parallel. This execution available tools which only deliver a ﬁxed number of positions or do scheme is called SIMT (single input multiple threads) due to its not even guarantee to report any optimal alignment. SARUMAN relation to the similar SIMD scheme used in classical parallel produces tab separated output by default which includes start programming. Parallel programming with CUDA is transparent and and stop position of the mapping together with the edit distance and the complete alignment. The package includes an easy to (within NVIDIA products) device independent for developers and use conversion tool to generate the widely used SAM alignment users. Launch of a CUDA application is controlled using only a format. The SARUMAN software is available for download at few parameters deﬁning the total number of threads. Those threads are organized hierarchically into grids, which themselves consist of http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html. [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1355 1351–1358 J.Blom et al. 4 RESULTS 100 bp reads are provided in the Supplementary Material. The goal was to map all artiﬁcial reads on the genome at the exact position 4.1 Evaluation methods without missing any mappings. As expected, using SARUMAN we The evaluation was accomplished on a standard desktop PC with an were able to map all artiﬁcial reads to the genome. Furthermore, Intel Core2Duo E8400 3 GHz dual core processor with 8 GB DDR2 nearly all reads were mapped to the correct position in the reference RAM and a GeForce GTX280 graphics card with 1 GB of VRAM. genome. In some rare cases (ca. 0.45%), SARUMAN returned All programs were run multithreaded on both processor cores, the optimal alignments that are shifted from the reads original position operating system was Ubuntu Linux. We used four datasets for the by up to e bases (Fig. 5). Such cases cannot be resolved, this performance evaluation, three synthetic read sets of roughly 18 mio. is a general problem of the edit cost function and not a ﬂaw of reads generated from the Escherichia coli K12 MG1655 genome SARUMAN. Additionally, SARUMAN reported a large number (GenBank accession NC_000913), and a real dataset with data from of other matches on different sites of the genome. Among them one lane of a re-sequencing run of Corynebacterium glutamicum were six matches that placed a read to an alternative position in ATCC 13032 (GenBank Accession BX927147) using the Illumina the reference genome with a better score than the alignment to the Solexa GAII sequencer, comprising 18 161 299 reads of 35 bp original position. In this case, the incorporation of errors led to a read length. The settings for all programs used in the comparisons were that just by chance ﬁts better to a wrong genomic position. While adjusted to make the run parameters as comparable as possible. alternative hits can be identiﬁed as misplaced in synthetic data, this To achieve this, we allowed two mismatches/indels for each read behavior is preferable for real data as one can not determine the and set all programs to support multithreading. Furthermore, we correct mapping position among several equally good locations. allowed gapped alignment of reads where possible and adjusted the The evaluation shows a clear separation of the programs into two alignment scoring matrix to simple unit costs. Detailed settings for classes, depending on their ability to handle gaps. Neither SOAP2 all tool runs are given in the Supplementary Material. nor Bowtie are able to perform gapped alignments, both tools were 4.2 Performance evaluation In order to prove the exactness and completeness of the presented approach and to measure the discrepancy between exact and heuristic implementations, we used the synthetic read sets described in Section 4.1. Synthetic reads were generated with 36, 75 and 100 bp length in both directions with up to two errors of different types, i.e. mismatches, insertions, deletions and combinations thereof. About three million of the artiﬁcial reads contained indels. These reads were mapped to the original source genome Escherichia coli K12 Fig. 5. Example for an ambiguous mapping. The read has a deletion in a MG1655. The dataset is available on the project homepage. Table 1 poly-T region compared to the reference. This results in a shift of the mapping compares the mapping ratios of the different tools together with their position by one base as it is not possible to determine in which position the respective running time for 75 bp reads. Data for 36 bp reads and deletion event happened. Table 1. Sensitivity evaluation with an artiﬁcial dataset of 17.980.142 reads (75 bp) generated from Escherichia coli K12 MG1655 with up to two errors SARUMAN SOAP2 Bowtie BWA SHRiMP PASS Reference Mapped 17 980 142 15 142 908 15 123 838 17 746 484 17 980 142 16 873 044 17 980 142 Not mapped 0 2 837 234 2 856 304 233 658 0 1 107 098 0 Perfect 4 999 944 4 999 942 4 999 944 4 999 944 4 999 944 4 999 944 4 999 944 With errors 12 980 198 10 142 966 10 123 894 12 746 540 12 980 198 11 873 100 12 980 198 1 mismatch 4 999 908 4 999 906 4 999 908 4 999 908 4 999 908 4 999 908 4 999 908 2 mismatches 4 999 936 4 999 936 4 999 936 4 999 936 4 999 936 4 999 936 4 999 936 1 Insertion 499 992 34 648 29 900 489 788 499 992 478 754 499 992 2 Insertions 499 998 1243 496 405 563 499 998 496 499 998 1 Deletion 493 560 46 740 46 740 491 454 493 560 475 916 493 560 2 Deletion 493 354 4828 4828 433 605 493 354 452 714 493 354 1Ins,&1Mism, 500 000 21 374 12 308 458 957 500 000 18 334 500 000 1Del,&1Mism, 493 450 34 291 29 778 467 329 493 450 447 042 493 450 MCP 17 899 092 15 070 286 15 061 178 17 432 842 17 785 567 16 577 180 – BMCP 17 899 086 15 070 286 15 061 178 17 430 632 17 784 713 16 577 173 – Total alignments 19 971 674 16 425 886 16 542 168 18 022 317 19 423 932 18 447 418 – Runtime (min) 12:03 06:40 18:56 15:09 95:06 27:42 – RAM usage (kb) 3 375 236 702 964 14 420 117 172 1 313 944 399 278 – MCP (Mapped to Correct Position) denotes the number of mapped reads that had a match to their original position. BMCP (Best Match at Correct Position) denotes the number of reads where the best match was located at the correct position. [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1356 1351–1358 GPU short read alignment Table 2. In-depth analysis of reported mappings of 18 161 299 C.glutamicum re-sequencing reads with an error threshold of two SARUMAN SOAP2 Bowtie BWA SHRiMP Pass Mapped 18 025 584 18 001 767 17 999 961 18 023 109 16 558 248 17 859 475 Not mapped 135 715 159 532 161 338 138 190 1 603 051 301 824 1 alignment 17 406 040 17 467 485 17 388 188 17 732 158 16 057 777 17 284 434 2 alignments 184 224 153 627 181 476 171 385 144 175 167801 3 alignments 72 679 44 237 73 534 58 298 42 699 57630 ≥ 4 alignments 362 641 336 418 356 763 61 268 313 597 349610 Alignments total 20 006 760 19 755 348 19 936 446 18 494 894 17 991 074 19 713 095 Runtime (min) 6:08 9:29 19:40 8:30 32:03 23:59 Matches reported by SOAP2 include reads containing Ns, which are treated as mismatch by other programs. Table 3. Sensitivity of the ﬁlter algorithm and performance gain of the not able to map more than a small portion of sequences generated GPU implementation, tested on E.coli artiﬁcial data with indels to their correct position by using mismatches instead of gaps. The second group of tools consists of BWA, PASS, SHRiMP Read length 36 bp 75 bp 100 bp and SARUMAN, which are all capable of aligning reads containing gaps, although PASS shows a poor mapping ratio for reads with Candidates 23 0405 03 15 405 284 14 991 106 more than one indel. BWA shows a very good performance, but Aligned 14 918 066 14 553 824 14 451 680 still misses more than 200 000 reads. SHRIMP shows a complete Filter step (min) 5:26 4:53 4:49 mapping of all reads, but places less reads than SARUMAN to Alignment on GPU (s) 42 430 1046 the correct position. SARUMAN shows also the best performance Alignment on CPU (s) 1085 3282 5814 on the real C.glutamicum dataset presented in Table 2. As this GPU:CPU 1:25.83 1:7.63 1:5.55 dataset originates from a re-sequencing of the identical strain, only a very low number of errors is expected. Therefore, the differences The upper part shows the runtime of the ﬁlter step, the alignment candidates passing between the tools are quite small. Nevertheless, SARUMAN shows the ﬁlter and the number of reads that where successfully aligned. The lower part the highest mapping ratio of all tools, and furthermore produces the compares the runtime of the alignment step on a GPU versus the runtime on a CPU. highest number of valid alignments. The dataset was evaluated with an error threshold of two: data for one and three allowed errors are candidates resulting from the ﬁlter algorithm with the number of given in the Supplementary Material. alignments successfully veriﬁed by the alignment step. The results Besides sensitivity, another major requirement of short-read are shown in Table 3. The efﬁciency of the ﬁlter algorithm is slightly alignment approaches is the performance in terms of running time. increasing with longer read lengths due to the bigger qgram size. The The runtime evaluation of the artiﬁcial reads shows that SOAP performance of the alignment step is decreasing as the alignments is nearly double as fast as SARUMAN, but maps signiﬁcantly of the longer reads are more memory consuming on the graphics less reads. All other tools are comparably fast (BWA, Bowtie) adapter. But even for 100 bp reads, the GPU implementation shows or considerably slower (SHRIMP, PASS) than our approach. For a more than 5-fold speedup compared to the CPU implementation the C.glutamicum re-sequencing dataset, SARUMAN performs of the same algorithm. best with a running time of 06:08 min, while SOAP and BWA take slightly longer with 9:29 and 8:30 min, respectively. Bowtie, 5 DISCUSSION PASS and SHRiMP are three to ﬁve times slower. Considering all calculated datasets, SOAP2 shows the best performance of all For short-read alignments against microbial genomes, SARUMAN compared approaches, being the only algorithm that is faster than has proven to be as fast or even faster than other available approaches SARUMAN in most cases. BWA shows a runtime comparable to like SOAP2, Bowtie, BWA, SHRiMP and Pass (Tables 1 and 2). SARUMAN, while Bowtie is highly dependent on the used error In contrast to heuristic approaches, SARUMAN guarantees to ﬁnd ratio. For low error rates, it can compete with other approaches, optimal alignments for all possible alignment positions. Compared for higher error rates the runtime rises signiﬁcantly. SHRiMP and to SHRiMP, another exact aproach, SARUMAN shows a better PASS are the slowest of the compared approaches for all evaluated runtime performance. Thus, it is the ﬁrst non-heuristic approach datasets. In summary, SARUMAN is the only approach that provides providing full indel support in competitive computing time. The exact and complete results, while still being nearly as fast or even completeness of the SARUMAN mapping was demonstrated using faster than all compared approaches. Thereby, it shows the best a constructed dataset, where all reads could be mapped correctly overall performance for short-read alignment against prokaryotic (Table 1). SARUMAN is designed as a short-read alignment genomes. approach; nonetheless, it works properly and with reasonable running times for reads up to lengths of 125 bp. The qgram index used in the ﬁlter step has a memory usage that in worst case increases 4.3 Performance of ﬁlter and alignment components linearly with the size of the reference genome. Furthermore, as The performance of the ﬁlter algorithm mainly depends on the SARUMAN provides a proper handling of Ns in the reference read length and the qgram length. We compared the number of sequence as well as in the reads, it cannot use two bit encoding [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1357 1351–1358 J.Blom et al. of nucleotides, which further increases the memory footprint. On REFERENCES a standard desktop PC with 4 GB RAM, it can map reads to all Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data compression bacterial genomes in a single iteration, with 8 GB RAM fungal algorithm. Technical Report 124, Digital Systems Research Center, Palo Alto, California. genomes of up to 50 Mb in size were processed in a single run. Califano,A. and Rigoutsos,I. (2002) FLASH: A fast look-up algorithm for string If a reference genome is too large for the available memory, it is homology. In Computer Vision and Pattern Recognition, 1993. Proceedings automatically split into two or more parts as described in Section 3. CVPR’93., 1993 IEEE Computer Society Conference on. IEEE, New York, NY, Of course the running time of the algorithm has to be multiplied USA, pp. 353–359. by the number of iterations that are needed. Due to this limitation, Campagna,D. et al. (2009) PASS: a program to align short sequences. Bioinformatics, 25, 967. we propose SARUMAN as dedicated solution for read mapping Farrar,M. (2007) Striped Smith-Waterman speeds database searches six times over other on microbial reference genomes or other reference sequences of SIMD implementations. Bioinformatics, 23, 156–161. comparable size. SARUMAN does not natively support paired Jokinen,P. and Ukkonen,E. (1991) Two algorithms for approximate string matching in end sequencing data, but as all possible alignments are returned static texts. Lecture Notes in Computer Science, 520/1991, 240–248. Langmead,B. et al. (2009) Ultrafast and memory-efﬁcient alignment of short DNA this can be handled by post-processing the results to ﬂag read sequences to the human genome. Genome Biol., 10, R25. pairs as matching either in compatible distance or not. A post- Liu,W. et al. (2006) Bio-sequence database scanning on a GPU. In Parallel and processing tool is under development. Several further improvements Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, IEEE, to SARUMAN are planned for the future. One idea is to combine Rhodes Island, Greece, p. 8. the CUDA alignment module with other ﬁlter algorithms. These Liu,Y. et al. (2009) CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes, 2, 73. may be heuristic solutions to establish an even faster short-read Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows- alignment, or algorithms based on compression techniques like the Wheeler transform. Bioinformatics, 25, 1754–1760. BWT to reduce the memory usage and make the approach more Li,H. et al. (2008) Mapping short DNA sequencing reads and calling variants using applicable to large reference genomes. The highest potential for a mapping quality scores. Genome Res., 18, 1851–1858. Li,R. et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. further speedup has the processing of the ﬁltering algorithm on the Bioinformatics, 25, 1966–1967. graphics adapter. Unfortunately, this is not yet foreseeable as it is Manavski,S.A. and Valle,G. (2008) CUDA compatible GPU cards as efﬁcient hardware quite complicated to handle the reference sequence data with the accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics, 9 limited amount of memory available on the graphics cards, but it (Suppl. 2), S10. may become feasible with future generations of graphics hardware. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Another planned development is a native support for color space Rognes,T. and Seeberg,E. (2000) Six-fold speed-up of Smith-Waterman sequence data as generated by the SOLiD sequencing system. database searches using parallel processing on common microprocessors. Bioinformatics, 16, 699–706. Rumble,S. et al. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS ACKNOWLEDGEMENTS Comput. Biol., 5, e1000386. Smith,T.F. and Waterman,M.S. (1981) Identiﬁcation of common molecular The authors wish to thank the Bioinformatics Resource Facility subsequences. J. Mol. Biol., 147, 195–197. system administrators for expert technical support. Szalkowski,A. et al. (2008) SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2. BMC Res. Notes, 1, 107. Funding: German Federal Ministry of Education and Research (grant 0315599B ‘GenoMik-Transfer’) to J.B. and S.J. Conﬂict of Interest: none declared. [17:34 18/4/2011 Bioinformatics-btr151.tex] Page: 1358 1351–1358

Journal

Bioinformatics – Oxford University Press

Published: Mar 30, 2011

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming

Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming

Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming

References (37)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies