Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters

Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters Background: Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. Results: This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Conclusions: Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github. com/turbo0628/LSDBS-mpi. Keywords: Smith-Waterman, Dynamic programming, Pairwise sequence alignment, Multiple sequence alignment, Xeon Phi clusters Background approach to reduce associated runtimes is the implemen- Calculating similarity scores between a given query pro- tation of basic alignment algorithms on parallel computer tein sequence and all sequences of a database and comput- architectures [1–3]. More recently, the usage of mod- ing multiple sequence alignments are two common tasks ern massively parallel accelerator architectures such as in bioinformatics. Both tasks include iterative calculations CUDA-enabled GPUs has gained momentum [4]. In this of pairwise local alignments as a basic building block. paper we are investigating how a Xeon Phi-based compute This can lead to high runtimes for large-scale input data cluster can be used as a computational platform to acceler- sets. Since biological sequence databases are continuously ate alignment algorithms based on dynamic programming growing, finding fast solutions is of high importance. An for two applications: (i) databases scanning of protein sequence databases *Correspondence: [email protected] Equal contributors with the Smith-Waterman algorithm, and School of Computer Science and Technology, Shandong University, Shunhua Road 1500, Jinan, Shandong, China Full list of author information is available at the end of the article © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 12 of 66 (ii) distance matrix computation for multiple sequence (iii) symmetric model. In this paper, we have chosen the alignment (i.e. the first stage of the popular ClustalW offload model. In this model, code sections and data can heuristic). be offloaded from the host CPU to the Xeon Phi. Using OpenMP pragmas, offload regions can be specified. When Three levels of parallelization are required in order to encountering such a region during program execution, the exploit the compute power available in a cluster of Xeon necessary data transfers between host and Xeon Phi are Phis. Parallelization within a Xeon Phi is usually based on performed and the code inside the (parallelized) region is the “scale-and-vectorize” approach: scaling across the up executed on the Xeon Phi. to 61 cores requires the usage of several hundred threads while exploiting the 512-bit wide vector units requires Pairwise sequence alignment and database search SIMD vectorization within each core. Recent examples The database search application considered in this paper of efficient parallelization on Xeon Phis include scientific scans a protein sequence database using a single pro- computing [5], bioinformatics [6–10], and database oper- tein sequence as a query (similar to BLASTP). Differ- ations [11]. Furthermore, parallelization between Xeon ent to the BLASTP heuristic, we calculate the score of Phis adds another level of message passing based paral- an optimal local alignment between the query and each lelism. This level needs to consider data partitioning, load subject sequence using the Smith-Waterman algorithm balancing, and task scheduling. The accelerator-based with affine gap penalties (instead of a seed-and-extend approach is motivated by the fact that the performance of approach). The subject sequences are ranked in terms many-core architectures is growing. For example, the 2nd of this score. Actual alignments are only computed for generation Xeon Phi processor named “Knight’s Landing” the top-ranked database sequences which only takes a has already been announced. negligible amount of time in comparison to the score- The rest of this paper is organized as follows. only search procedure. Note that the score-only Smith- The “Related work” Section provides important back- Waterman computation can be performed in linear space ground information about the Xeon Phi program- and quadratic time with respect to the length of the ming model, pairwise and multiple sequence alignment, alignment targets. and hardware accelerated alignment algorithms. Our Consider two protein sequences Q and S and length q single-node parallel algorithms are presented in the and s, respectively. We want to compute the score of an “Algorithms on a single node” Section. The “Cluster level optimal local alignment of Q and S with respect to a given data parallelization” Section describes our cluster-level scoring scheme consisting of a gap opening penalty α,a parallelization. Section “Results and discussion” evalu- gap extension penalty β and an amino acid substitution ates performance. Some conclusions are drawn in Section matrix sbt(). The well-known Smith-Waterman algorithm “Conclusion”. solves this problem by computing a dynamic program- ming matrix iteratively based on the following recurrence Related work relations: Programming models on Xeon Phi coprocessor Xeon Phi is a coprocessor connected via the PCI express H (i, j) = max{0, E(i, j), F(i, j), H (i − 1, j − 1) A A (PCIe) bus to a host CPU. From a hardware perspective, +sbt(Q[ i], S[ j] )} (1) it contains up to 61 86 compatible cores. Each core fea- E(i, j) = max{H (i, j − 1) − α, E(i, j − 1) − β} tures a 512-bit vector processing unit (VPU) based on a F(i, j) = max{H (i − 1, j) − α, F(i − 1, j) − β} new instruction set. The cache hierarchy contains a L1 A data cache of size 32 KB and a 512 KB per core L2 cache. The cores are connected via a bidirectional ring bus which The iterative computation of theses matrices is started enables L2 cache coherence based on a directory based with the initial values: H (i,0) = H (0, j) = E(i,0) = A A protocol. Each core can execute up to four threads at the F(0, j) = 0 for all 0 <= i <= q,0 <= j <= s. same time. Assuming a Xeon Phi with 61 usable cores running at Progressive multiple sequence alignment 1.238 GHz, we can determine the peak performance for The time complexity of computing an optimal multiple 32-bit integer (integer arithmetic is commonly used for alignment of more than two sequences grows exponen- sequence alignment calculations) operations as follows: 16 tially in terms of the number of input sequences. Thus, (#SIMD lanes) × 1 integer operation × 1.238 GHz × 61 heuristic approaches with polynomial complexities must (#cores) = 1.208 Tera integer operations per second. be used in practice for large inputs to approximate the From a software perspective, three programming mod- (generally unknown) optimal multiple alignment. els can be used in order to harness the compute power The multiple (protein) sequence alignment application of the Xeon Phi: (i) native model, (ii) offload model, and considered in this paper is the first stage of the popular Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 13 of 66 ClustalW heuristic [12]. ClustalW is based on the classi- Different from CUDA-enabled GPUs, a Xeon Phi provides cal progressive alignment approach [13] featuring a 3-step x86 compatibility, which often simplifies the implemen- pipeline (see Fig. 1): tation process. Nevertheless, achieving near-optimal per- formance is still a challenge which needs to be addressed (a) Distance matrix: For each input sequence pair, a by parallel algorithm design and efficient implementa- distance values is computed based on the tion. In this paper we demonstrate how this can be done Smith-Waterman algorithm for protein sequence database search and distance matrix (b) Guide tree: Using the distance matrix computed in computation for multiple sequence alignment. the previous step is taken as an input to compute an Compared to our previously presented LSBDS [9], we evolutionary tree using the neighbor-joining method introduce the following new contributions in this paper: [14]. (c) Progressive alignment: Following the branching We have designed new algorithms which can handle order of the tree a multiple sequence alignment is searching tasks for large-scale protein databases on build progressively. Xeon Phi clusters. We have designed new algorithms for calculating Hardware accelerated alignment algorithms large-scale multiple sequence alignments on Xeon We briefly review some previous work on accelerat- Phi clusters. ing pairwise alignment (based on Smith Waterman) We have implemented our multiple sequence and progressive multiple sequence alignment (based on alignment algorithm using the offload model to make ClustalW) on a number of parallel computer architec- full use of the compute power of both the multi-core tures. A number of SIMD implementations have been CPUs and the many-core Xeon Phi hardware. designed in order to harness the vector units of com- mon multi-core CPUs (e.g. [15–21]) or the the Cell/BE Methods (e.g. [22, 23]). Recent years has seen increased interests Algorithms on a single node in acceleration of sequence alignment on massively par- Protein sequence database search allel GPUs. Initially, programming these graphics chips We have observed two facts: (1) protein sequence for bioinformatics application still required programming database search has inherent data parallelism; (2) each with shaders using languages such as OpenGL [24]. The VPUonXeonPhi canexecute multiple integeroperations release of CUDA in 2007 made the usage GPUs for gen- in an SIMD parallel way efficiently. Based on these two eral purpose computing more accessible and subsequently facts, we have partitioned the database search process on a number of CUDA-enabled Smith-Waterman implemen- a single node into two data parallel parts: device level and tation have been presented in recent years [4, 25–33]. thread level. The device level data parallel part is encoded A number of MPI-based solutions for progressive mul- on the host CPU. It splits the subject database into multi- tiple sequence alignments are targeted towards PC clus- ple batches that can be distributed to CPU and Xeon Phi ters [34–37]. Another attractive architecture for sequence devices. The thread level data parallel part is used to pro- analysis are FPGAs [38–41] which are based on recon- cess data batches locally. In order to support search tasks figurable hardware. However, in comparison to the other for large-scale databases, we have designed a dynamic data mentioned architectures, FPGAs are often less accessible distribution framework to distribute these batches to both and generally more difficult to program. the host CPU device and the Xeon Phi devices. In order The solution in this paper is based on a cluster of Xeon to solve the performance loss problem for searching long Phis. Compared to common CPUs, a Xeon Phi contains query sequences, we have also proposed a multi-pass algo- significantly more cores and often a wider vector unit. rithm where long query sequences are partitioned into ab c Fig. 1 Illustration of the three stages of progressive multiple alignment (see text for details) Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 14 of 66 0, if H (i, j) = 0 multiple short subsequences for consecutive searching ⎪ A passes. We have presented more implementation details N (i − 1, j − 1) + m(i, j), of our algorithm in [9]. A if H (i, j) = H (i − 1, j − 1) A A MSA N (i, j) = + sbt(S [ i], S [ j] ) 1 2 The distance matrix computation stage of ClustalW is typically a major runtime bottleneck. Thus, in our work ⎪ N (i, j),if H (i, j) = E(i, j) E A we have only concentrated on designing a parallel algo- rithm for this stage. ClustalW bases the distance compu- N (i, j);if H (i, j) = F(i, j) F A tation between two protein sequences on the following where concept [24]: 1, if S [ i] = S [ j] 1 2 m(i, j) = Definition 1. Consider two sequences S , S ∈ S = 0; otherwise i j {S , ... S }. The following equation defines their distance 1 n 0, if j = 1 d(S , S ): i j N (i, j) = N (i, j − 1),if E(i, j) = H (i, j − 1) − α E A A N (i, j − 1);if E(i, j) = E(i, j − 1) − β nid(S , S ) i j d(S , S ) = 1 − i j 0, if i = 1 min{l , l } ⎨ i j N (i, j) = N (i − 1, j),if F(i, j) = H (i − 1, j) − α F A A whereby nid(S , S ) is defined as the number of exact i j N (i − 1, j);if F(i, j) = F(i − 1, j) − β matches in an optimal local alignment between S and S . i j It can be shown that l (l )isthe length of S (S ). i j i j nid(S , S ) = N (i , j ) 1 2 A max max The value nid(S , S ) can be calculated in the Smith- i j Waterman traceback procedure by counting the num- where (i , j ) denote the coordinates of the maximum max max ber of exact character matches. Figure 2 illustrates this value in the corresponding pairwise local alignment DP method. However, this direct method does not work well matrix H . for long sequences and large-scale datasets because it needs to store the whole DP matrix. In order to solve Input data set sizes for MSA are typically smaller than this problem, we have adapted the method presented in for database search (protein sequence databases typically [24] to do the nid-value computation on the Xeon Phi contain may millions of sequences while large-scale MSAs architecture. That is we have used the following definition are computed for a few thousand protein sequences) and theorem to calculate the nid-value without doing the making the subject sequence set for distance matrix actual traceback. computation comparatively small. In order to design an efficient parallel distance matrix computation algorithm Definition 2. Consider two protein sequences S and on Xeon Phi, we have used the task partitioning method S , affine gap penalties α, β, and substitution matrix sbt. showninFig.3.Inour method,the sequences aresorted by their lengths and then partitioned into smaller sized The matrix N (i, j)(1 ≤ i ≤ l ,1 ≤ j ≤ l ) is defined in A 1 2 batches. In an alignment task, a query sequence will be terms of the following recurrence relations: aligned to the corresponding sequence batch. This pro- cedure will continue until all task batches are calculated. We have implemented the whole process into two par- \A \\ A TT C C TT C C G G TT A A TT G G AA TT \\ 00 00 00 00 00 00 00 00 00 00 00 00 00 allel parts: the thread level and the VPU level. On the G G 00 00 00 00 00 00 33 22 11 00 33 22 11 thread level, the process aligning S to S ={S , ... , S } i (i+1) n T 0 0 3 2 3 2 2 6 5 4 3 2 5 is grouped to task ,and each task is processedbya C C 0026565554324 0 0 2 6 5 6 5 5 5 4 3 2 4 thread. On the VPU level, multi-pairwise comparisons T 0 0 3 5 9 8 7 8 7 8 7 6 5 are performed in parallel on VPUs. In our method, S = A A 03248877 0 3 2 4 8 8 7 7 1111 1100 99 1100 99 T 0 2 3 3 7 7 7 10 10 14 13 12 13 {S , ... , S } is packed into a 2D buffer which has 16 (i+1) n C C 00 11 22 66 66 1100 99 99 99 13 13 13 13 12 12 12 12 channels, meaning that sequence S can be aligned to 16 A 0 3 2 5 5 9 9 8 12 12 12 16 15 different sequence in the 16-channel buffer in parallel. C 0 2 2 5 4 8 8 8 11 11 11 15 15 We have used Knights Corner instructions to implement Fig. 2 An example of how to compute the nid-value in the traceback this part. Figure 4 shows the pseudo-code of our algo- procedure. The matrix H (i, j) is shown for a linear gap penalty α = 1, rithm framework. In order to take advantage of both CPUs and a substitution score +3 for the exact match and −1 otherwise. and Xeon Phis in a node to process MSA for large-scale The nid-value here is five datasets, we have implemented our algorithm framework Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 15 of 66 Fig. 3 Illustration of our task partitioning scheme using the offload model. We have implemented the arith- substitution values and m(i, j) values quickly. The shuf- metic operations specified by the equations in Definition fling procedure in Fig. 6 is used to help VPUs fetch 2 using a number of Knights Corner instructions (see corresponding values from the substitution matrix in Fig. 5) for Xeon Phis. These instructions are executed on parallel [7]. VPUs to calculate the sixteen residue vectors of align- In our implementation, the size of these two temporary ment matrices according to Definition 2. For CPUs, VPUs vectors for Xeon Phi and CPU is 16 and 8 separately. fetch 8 residues each time. The core instructions used on We have designed and implemented a device level CPUs are identical with Xeon Phis, whereas they have dynamic task distribution framework to distribute tasks been implemented using different 256-bit AVX intrinsic to both the CPU device and the Xeon Phi device. Figure 7 instructions. shows our framework. In this framework, the task distrib- Before performing the alignment process, two tempo- utor is implemented as a critical section to prevent the rary score vectors (the sprofile and the mprofile in Fig. 4) concurrent access to shared tasks. It is also used to per- are created to help improve the IO efficiency for loading form the dynamic distribution of tasks to CPUs and Xeon the substitution matrix values and the m(i, j) (see Defini- Phis. In Fig. 7, both CPUs and Xeon Phis fetch and pro- tion 2) values in parallel. Figure 6 shows an example of cess multiple tasks in parallel. After the allocated tasks how to create these two temporary vectors. From Fig. 6 are processed, both devices will send requirements to the we can see that the substitution score matrix, the cur- data distributor to request for new tasks. All new task rent database sequence vector, and the query sequence requirements will first be identified and queued by the will be used to create the sprofile and the mprofile. data distributor. It then distributes tasks to the queue in VPUs will make useofthese twoscore vectorstoload order. Fig. 4 The pseudo-code of our MSA algorithm framework on a single computing node Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 16 of 66 Fig. 5 Xeon Phi vectorized implementation of pairwise alignment according to Definition 2 by dynamic programming using 25 core instructions. The variables in these instructions can be divided into two classes. One class includes vH , vE, vF,and vS which are used in the Smith-Waterman algorithm. Another class contains vN , vN , vN and vN which are defined in Definition 2. Here vN is the target vector and vN is the value nid(S , S ) i j A E F S A S AACCDDEEFFGGHHII JJKKLLMMNNOOPPQQRRSSTTUUVVW W W WYY Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 17 of 66 Score Matrix A C D E F G H I J K L M N O P Q R S T U V WY A C D E F G H I J K L M N O P Q R S T U V WY Database Sequences T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C Shuffling procedure T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A A sprofile mprofile Fig. 6 An example of how to create the sprofile and the mprofile for two sequence vectors to match the ‘A’ residue Cluster level data parallelization Figure 8 illustrates our method. In Fig. 8, the static dis- Our approach is based on the fact that both subject patcher in the preprocess stage first divides the database database batches (for database searching) and MSA tasks into several chunks with respect to the total number of can be scanned in parallel. Thus we have implemented the nodes. The database chunks are then sent to the cor- cluster level data parallel algorithm for these two align- responding node for local searching. Since the compute ment applications. The cluster level data parallel algo- power of all compute nodes may vary, the size of each rithm is encoded on the master node. The master-node database subset can also vary. In order to achieve load bal- partitions the subject database or the MSA tasks into a ancing among all nodes, we have implemented a sample number of chunks that will be sent to different compute test method. In our method, at the preprocess stage (see nodes. Our approach is implemented using the following Fig. 8), firstly a sample test is performed to explore the modules: compute power of all compute nodes. Performance fac- tors of different nodes are then automatically generated. Dispatcher (Master): Partitions subject database or In our work, we name this factor the compute power P for MSA tasks into a number of chunks in a node i. With the performance factor P , we can then calcu- preprocessing steps and sends them to compute late the appropriate size of the database subset allocated nodes. to node i. Algorithms on a Single Node (Worker): Receives sequence chunks from master and performs the MSA corresponding DP calculations. We have designed and implemented a cluster level Result Collector (Master): Performs additional dynamic dispatcher to distribute tasks to compute nodes. operations required to further process the returned Figure 9 illustrates our method. In this method, the results. dynamic dispatcher first divides the dataset into a set of tasks which are organized as a task pool. Then, multi- Protein sequence database search ple tasks are sent to each node for local distance matrix In our work, we have implemented a static dispatcher computation. After the allocated tasks are processed, each for our cluster level parallel database searching algorithm. node will send requirements to the dispatcher to ask for Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 18 of 66 Fig. 7 Our device level dynamic task distribution framework. The black dots denote tasks new tasks to process. This procedure will continue until Protein sequence database search all tasks are processed. A performance measure commonly used in computa- tional biology to evaluate Smith-Waterman implemen- Results and discussion tations is cell updates per second (CUPS). A CUPS Test platforms represents the time for a complete computation of one We have implemented the proposed methods using C++ entry of the DP matrix, including all comparisons, addi- and evaluated them on compute nodes with the following tions and maxima operations. Xeon Phi cards (with ECC enabled) installed: We have scanned three protein sequence databases: (i) the 7.5 GB UniProtKB/Reviewed and Annotated - Intel Xeon Phi 7110P: 61 hardware cores, 1.1 GHz (5,943,361,275 residues in 16,110,751 sequences), (ii) the processor clock speed, 8 GB GDDR5 device memory. 18 GB UniProtKB/TrEMBL (13,630,914,768 residues in - Intel Xeon Phi 31S1P: 57 hardware cores, 1.1 GHz 42,821,879 sequences), and (iii) the 37 GB merged Non- processor clock speed, 8 GB GDDR5 device memory. Redundant plus UniProtKB/TrEMBL (24,323,686,690 residues in 73,401,766 sequences) for query sequences Tests have been conducted on a Xeon Phi cluster with with varying lengths. Query sequences used in our tests three compute nodes that are connected by an Ethernet have the accession numbers P01008, P42357, P56418, switch. There are two Xeon E5 CPUs and 16GB RAM on P07756, P19096, P0C6B8, P08519, and Q9UKN1. each compute node. The cluster runs Centos 6.5 with the Linux kernel 2.6.32-431.17.1.el6.x86_64. The CPU con- figuration on each node varies, as is listed in Table 1. Performance on a single node We also have SSD hard disks installed on each compute We have firstly compared the single-node performance of node. our methods to SWAPHI [8] and CUDASW++ 3.1 [26]. Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 19 of 66 Fig. 8 Illustration of our method to dispatch database subsets to all nodes. The node who has more computing power will be dispatched more sequences, which will finally balance the workload at runtime SWAPHI is another parallel Smith-Waterman algorithm database size limitation for SWAPHI is less than the avail- on Xeon Phi-based neo-heterogeneous architectures. It able RAM size; i.e. 16 GB. CUDASW++ 3.1 is currently is also implemented using the offload model. However, the fastest available Smith-Waterman implementation for SWAPHI canonlyrun search tasksonXeonPhi; i.e. database searching. It makes use of the compute power it does not exploit the computing power of multi-core of both the CPU and GPU. At the CPU side, CUD- CPUs. SWAPHI cannot handle search tasks for large- ASW++ 3.1 carries out parallel database searching by scale biological databases. In our tests, we find that the invoking the SWIPE [18] program. It employs CUDA Fig. 9 Illustration of our method to dispatch tasks dynamically to all nodes. The task partition method is illustrated in Fig. 3 Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 20 of 66 Table 1 Test cluster configurations and CUDASW++ 3.1 for searching the 7.5 GB UniPro- Node CPU Coprocessor tKB/Reviewed and Annotated protein database using dif- ferent query sequences. From Fig. 10a we can see that N Xeon E5-2620 (6 cores) * 2 Xeon Phi 7110p * 1 the computing GCUPS of our multi-pass method is com- N Xeon E5-2620v2 (6 cores) * 2 Xeon Phi 7110p * 2 parable to CUDASW++ 3.1. Both of them achieve better N Xeon E5-2650v2 (8 cores) * 2 Xeon Phi 31s1p * 4 performance than SWAPHI. SWAPHI and CUDASW++ 3.1 cannot support search tasks for the 18 GB and 37 GB databases. Thus, we only use our methods to search them. Figure 10a also reports PTX SIMD video instructions to gain the data parallelism the performance of our methods for searching these two at the GPU side. The database size supported by CUD- databases. The results show that our methods can handle ASW++ 3.1 is less than the memory size available on the large-scale database search tasks efficiently. GPU. Neither SWAPHI nor CUDASW++ 3.1 supports Performance on a cluster clusters. Figure 10b shows the performance of our methods using For single-node tests, we have used the N node (see all three cluster nodes. The result indicates that our meth- Table 1) as test platform. In our experiments, we run our ods exhibit good scalability in terms of sequence length methods with 24 threads on two Intel E5-2620 v2 six-core and size, and number of compute nodes. Our method 2.0 GHz CPUs and 240 threads on each Intel Xeon Phi achieves a peak overall performance of 730 GCUPS on the 7110P respectively. We execute SWAPHI with 240 threads Xeon Phi-based cluster. on each Xeon Phi 7110P. We have executed CUDASW++ 3.1 on another server with the same two Intel E5-2620 v2 six-core 2.0 GHz CPUs plus two Nvidia Tesla Kepler K40 MSA GPUs with ECC enabled. 24 CPU threads are also used A set of performance tests have been conducted using for CUDASW++ 3.1. If not specified, default parameters different protein sequence datasets to evaluate the pro- are used for both SWAPHI and CUDASW++ 3.1. Fur- cessing time for the distance matrix computation step of thermore, all available compiler optimizations have been our implementation in comparison to MSA-CUDA [32]. enabled. The parameters α = 10, and β = 2havebeen The datasets are extracted from the UniProtKB/Reviewed used in our experiments. The substitution matrix used is database, whose details are listed in Table 2. We have BLOSUM62. used two groups of datasets in our tests. Datasets S to We have measured the time to compute the simi- S are used to compare the performance of our method larity matrices to calculate the computing CUPS values and MSA-CUDA, where the sequence numbers are small since MSA-CUDA can not handle datasets with large in our experiments. Figure 10a shows the correspond- sequence number. Datasets L to L areusedtoevaluate ing computing GCUPS values of our methods, SWAPHI 1 6 a b Fig. 10 a performance comparison on a single node (N ) between our method, CUDASW++v3.1 and SWAPHI. b performance results of our method using all three compute nodes Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 21 of 66 Table 2 Test datasets for MSA where L denotes the length of the ith sequence in the Dataset Avg. Length #Sequences Workload (GCells) dataset. Thus, the workload W is actually the total num- ber of matrix cells to be calculated. As our method utilizes S 465 200 4.35 the constant 25 instructions for calculating each cell (as is S 472 400 17.84 listed in Fig. 5), the execution time grows linearly with W. S 474 600 40.52 Table 2 also lists the workload needed for processing each S 476 800 72.56 dataset. S 476 1000 113.54 Performance for processing medium-scale datasets S 480 1200 164.13 For the medium-scale datasets S to S ,MSA-CUDAis 1 6 L 150 30000 10891 benchmarked on a Tesla K40 GPU with default options L 382 16000 18692 and all available compiler optimizations enabled. Our L 935 10000 39148 implementation runs on an Intel Xeon Phi 7110P with 240 threads. Figure 11 shows the performance comparison L 274 40000 60246 between our method and MSA-CUDA. From Fig. 11 we L 1350 10000 88013 can find our implementation achieves significantly better L 700 24000 133112 performance compared to MSA-CUDA. Performance for processing large-scale datasets the performance of our method for handling large-scale For the large-scale datasets L to L , MSA-CUDA cannot 1 6 datasets. These datasets consist at least 10,000 sequences. work normally. We have run our methods on a single Intel The workload for computing a distant matrix grows Xeon Phi 7110P, the N node and the cluster respectively. quadratically with respect to the number of input The performance results are shown in Fig. 12. Figure 12 sequences. The average sequence length of the dataset indicates that our methods exhibit very good scalabil- also has a great impact on the computing workload. We ity in terms of workload and number of compute nodes. have used the following equation to measure the workload Although the nodes in our cluster have different com- needed to process a dataset. pute power, our dynamic task dispatching scheme still works efficiently. Moreover, our method on the cluster is ⎛ ⎞ n n able to process large-scale datasets that are rarely seen in ⎝ ⎠ W = L ∗ L i j other MSA implementations, whereas the runtime is still i=1 j=i+1 acceptable. Fig. 11 Runtime (in seconds) for processing datasets S to S . Our method runs on a Xeon Phi 7110P. MSA-CUDA runs on a Tesla K40 GPU 1 6 Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 22 of 66 Fig. 12 Runtime (in seconds) for processing datasets L to L . We have run our method on an Intel Xeon Phi 7110P, the N node and the cluster, 1 6 2 respectively Conclusion Authors’ contributions HL, BS, and WL designed the study, wrote and revised the manuscript. HL, YC, We have presented two parallel algorithms for protein and KX implemented the algorithm, performed the tests, analysed the results. sequence alignment based on the dynamic programming BS, SP, and WL contributed the idea of using Knights Corner instructions and concept which can be efficiently mapped onto Xeon Phi Xeon Phi clusters, participated in the algorithm optimization, analysed the results. All authors read and approved the final manuscript. clusters. Our methods exhibit good performance on a sin- gle compute node as well as good scalability in terms of Competing interests sequence length and size, and number of compute nodes The authors declare that they have no competing interests. for both protein sequence database search and distance Consent for publication matrix computation employed in multiple sequence align- Not applicable. ment. Furthermore, the achieved performance is highly competitive in comparison to other optimized Xeon Phi Ethics approval and consent to participate Not applicable. and GPU implementations. Biological sequence databases are continuously growing establishing the need for even Author details faster parallel solutions in the future. Hence, our results School of Computer Science and Technology, Shandong University, Shunhua Road 1500, Jinan, Shandong, China. Johannes Gutenberg University, Mainz, are especially encouraging since performance of many- Germany. School of Computer Science, National University of Defense core architectures grows much faster than Moore’s law Technology, Changsha, Hunan, China. as it applies to CPUs. For instance, the performance Published: 19 July 2016 improvement with at least a factor of 3 can be expected on the already announced next-generation Xeon Phi product. References 1. Schmidt B, Schröder H, Schimmler M. Massively parallel solutions for Declarations molecular sequence analysis. International Parallel and Distributed Publication of this article was funded by the PPP project from CSC and DAAD, Processing Symposium parallel solutions for molecular sequence analysis. Taishan Scholar, and NSFC Grants 61272056 and U1435222. IEEE; 2002. p. 0186. This article has been published as part of BMC Bioinformatics Vol17Suppl 9 2. Bader DA. Computational biology and high-performance computing. 2016: Selected articles from the IEEE International Conference on Commun ACM. 2004;47(11):34–41. Bioinformatics and Biomedicine 2015: genomics. The full contents of the 3. Rajko S, Aluru S. Space and time optimal parallel sequence alignments. supplement are available online at http://bmcbioinformatics.biomedcentral. IEEE Trans Parallel Distrib Syst. 2004;15(11):1070–81. com/articles/supplements/volume-17-supplement-9. 4. Liu Y, Schmidt B. SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors. Application-specific Systems, Architectures and Availability of data and materials Processors (ASAP), 2014 IEEE 25th International Conference on. IEEE; 2014. Project name: LSDBS-mpi p. 184–5. Project homepage: https://github.com/turbo0628/LSDBS-mpi 5. Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and Operating System: Linux implementation of the linpack benchmark for single and multi-node Programming Language: C++ systems based on intel xeon phi coprocessor. Parallel & Distributed Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 23 of 66 Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE; 29. Khajeh-Saeed A, Poole S, PJ B. Acceleration of the Smith-Waterman 2013. p. 126–37. algorithm using single and multiple graphics processors. J Comput Phys. 6. Pennycook SJ, Hughes CJ, Smelyanskiy M, et al. Exploring SIMD for 2010;229(11):4247–58. Molecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi 30. Blazewicz J, Frohmberg W, Kierzynka M, Pesch E, Wojciechowski P. Coprocessors. Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th Protein alignment algorithms with an efficient backtracking routine on International Symposium on. IEEE; 2013. p. 1085–97. multiple gpus. BMC Bioinforma. 2011;12:181. 7. Wang L, Chan Y, Duan X, et al. XSW: Accelerating biological database 31. Hains D, Cashero Z, Ottenberg M, et al. Improving CUDASW++, a search on xeon phi. Parallel & Distributed Processing Symposium parallelization of Smith-Waterman for CUDA enabled devices. Parallel and Workshops (IPDPSW), 2014 IEEE International. IEEE; 2014. p. 950–7. Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE 8. Liu Y, Maskell DL, Schmidt B. CUDASW++: optimizing Smith-Waterman International Symposium on. IEEE; 2011. p. 490–501. sequence database searches for CUDA-enabled graphics processing 32. Liu Y, Schmidt B, Maskell DL. MSA-CUDA: multiple sequence alignment units. BMC Res Notes. 2009;2(1):73. on graphics processing units with CUDA. Application-specific Systems, 9. Lan H, Liu W, Schmidt B, et al. Accelerating large-scale biological Architectures and Processors, 2009. ASAP 2009. 20th IEEE International database search on Xeon Phi-based neo-heterogeneous architectures. Conference on. IEEE; 2009. p. 121–8. Bioinformatics and Biomedicine (BIBM), 2015 IEEE International 33. Hung CL, Lin YS, Lin CY, Chung YC, Chung YF. CUDA ClustalW: An Conference on. IEEE; 2015. p. 503–10. efficient parallel algorithm for progressive multiple sequence alignment 10. Rucci E, García C, Botella G, Degiusti A, Naiouf M, Prieto-Matías M. An on multi-gpus. Comput Biol Chem. 2015;58:62–8. energy-aware performance analysis of swimm: Smith—waterman 34. Li K. ClustalW analysis using parallel and distributed computing. implementation on i ntel’s m ulticore and m anycore architectures. Bioinformatics. 2003;19:1585–6. Concurr Comput Pract Experience. 2015;22(6):865–72. 35. Ebedes J, Datta A. Multiple sequence alignment in parallel on a 11. Lu M, Zhang L, Huynh HP, et al. Optimizing the mapreduce framework workstation cluster. Bioinformatics. 2004;20:1193–5. on intel xeon phi coprocessor. Big Data, 2013 IEEE International 36. Cheetham J, Dehne F, Pitre S, et al. Parallel clustal w for pc clusters[M]. Conference on. IEEE; 2013. p. 125–30. Computational Science and Its Applications—ICCSA 2003. Berlin 12. Thompson J, Higgins D, Gibson T. ClustalW: improving the sensitivity of Heidelberg: Springer; 2003, pp. 300–9. progressive multiple sequence alignment through sequence weighting 37. Tan J, Feng S, Sun N. Parallel multiple sequences alignment in SMP position specific gap penalties and weight matrix choice. Nucleic Acids cluster. Int Conf High Perform Comput Asia Reg. 2005;20:425–31. Res. 1994;22:4673–680. 38. Oliver T, Schmidt B, Maskell D. Hyper customized processors for 13. Feng D, Doolittle R. Progressive sequence alignment as a prerequisite to bio-sequence database scanning on FPGAs. Proceedings of the 2005 a correct phylogenetic trees. J Mol Evol. 1987;25:351–60. ACM/SIGDA 13th international symposium on Field-programmable gate 14. Saitou N, Nei M. The neighbor-joining method: a new method for arrays. ACM; 2005. p. 229–37. reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–25. 39. Li ITS, Shum W, Truong K. 160-fold acceleration of the Smith-Waterman 15. Wozniak A. Using video-oriented instructions to speed up sequence algorithm using a field programmable gate array (FPGA). BMC Bioinforma. comparison. Comput Appl Biosci. 1997;13(2):145–50. 2007;8(1):1. 16. Rognes T, Seeberg E. Six-fold speed-up of Smith-Waterman sequence 40. Oliver T, Schmidt B, Nathan D, Clemens R, Maskell D. Using database searches using parallel processing on common reconfigurable hardware to accelerate multiple sequence alignment with microprocessors. Bioinformatics. 2000;16(8):699–706. ClustalW. Bioinformatics. 2005;21:3431–432. 17. Alpern B, Carter L, Su Gatlin K. Microparallelism and high-performance 41. Boukerche A, Correa JM, de Melo ACMA, et al. An FPGA-based protein matching. Proceedings of the 1995 ACM/IEEE conference on accelerator for multiple biological sequence alignment with DIALIGN[M]. Supercomputing. ACM; 1995. p. 24. High Performance Computing-HiPC 2007. Berlin Heidelberg: Springer; 18. Rognes T. Faster Smith-Waterman database searches with inter-sequence 2007. p. 71–82. SIMD parallelisation. BMC Bioinforma. 2011;12:. 19. Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. 20. Notredame C, Higgins D, Heringa J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–17. 21. Chaichoompu K, Kittitornkun S, Tongsima S. MT-ClustalW: multithreading multiple sequence alignment. Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. IEEE; 2006. p. 8. 22. Wirawan A, Kwoh CK, Hieu NT, et al. CBESW: sequence alignment on the playstation 3. BMC Bioinforma. 2008;9(1):377. 23. Szalkowski A, Ledergerber C, Krähenbühl P, et al. SWPS3–fast multi-threaded vectorized Smith-Waterman for IBM Cell/BE and x86/SSE2. BMC Res Notes. 2008;1(1):107. 24. Liu W, Schmidt B, Voss G, Mueller-Wittig W. Streaming algorithms for biological sequence alignment on gpus. IEEE Trans Parallel Distrib Syst. 2007;18(9):1270–81. Submit your next manuscript to BioMed Central 25. Liu Y, Schmidt B, Maskell DL. CUDASW++ 2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based and we will help you at every step: on SIMT and virtualized SIMD abstractions. BMC Res Notes. 2010;3(1):93. • We accept pre-submission inquiries 26. Liu Y, Wirawan A, Schmidt B. CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU � Our selector tool helps you to find the most relevant journal SIMD instructions. BMC Bioinforma. 2013;14(1):117. � We provide round the clock customer support 27. Manavski S, Valle G. CUDA compatible GPU cards as efficient hardware � Convenient online submission accelerators for Smith-Waterman sequence alignment. BMC Bioinforma. 2008;9(2):1. � Thorough peer review 28. Ligowski L, Rudnicki W. An efficient implementation of Smith-Waterman � Inclusion in PubMed and all major indexing services algorithm on GPU using CUDA, for massively parallel scanning of � Maximum visibility for your research sequence databases. 2009 International Parallel and Distributed Processing Symposium. IEEE; 2009. p. 1–8. Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters

Loading next page...
 
/lp/springer-journals/parallel-algorithms-for-large-scale-biological-sequence-alignment-on-nVxPMP1rWA

References (26)

Publisher
Springer Journals
Copyright
Copyright © 2016 by The Author(s)
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
DOI
10.1186/s12859-016-1128-0
pmid
27455061
Publisher site
See Article on Publisher Site

Abstract

Background: Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. Results: This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Conclusions: Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github. com/turbo0628/LSDBS-mpi. Keywords: Smith-Waterman, Dynamic programming, Pairwise sequence alignment, Multiple sequence alignment, Xeon Phi clusters Background approach to reduce associated runtimes is the implemen- Calculating similarity scores between a given query pro- tation of basic alignment algorithms on parallel computer tein sequence and all sequences of a database and comput- architectures [1–3]. More recently, the usage of mod- ing multiple sequence alignments are two common tasks ern massively parallel accelerator architectures such as in bioinformatics. Both tasks include iterative calculations CUDA-enabled GPUs has gained momentum [4]. In this of pairwise local alignments as a basic building block. paper we are investigating how a Xeon Phi-based compute This can lead to high runtimes for large-scale input data cluster can be used as a computational platform to acceler- sets. Since biological sequence databases are continuously ate alignment algorithms based on dynamic programming growing, finding fast solutions is of high importance. An for two applications: (i) databases scanning of protein sequence databases *Correspondence: [email protected] Equal contributors with the Smith-Waterman algorithm, and School of Computer Science and Technology, Shandong University, Shunhua Road 1500, Jinan, Shandong, China Full list of author information is available at the end of the article © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 12 of 66 (ii) distance matrix computation for multiple sequence (iii) symmetric model. In this paper, we have chosen the alignment (i.e. the first stage of the popular ClustalW offload model. In this model, code sections and data can heuristic). be offloaded from the host CPU to the Xeon Phi. Using OpenMP pragmas, offload regions can be specified. When Three levels of parallelization are required in order to encountering such a region during program execution, the exploit the compute power available in a cluster of Xeon necessary data transfers between host and Xeon Phi are Phis. Parallelization within a Xeon Phi is usually based on performed and the code inside the (parallelized) region is the “scale-and-vectorize” approach: scaling across the up executed on the Xeon Phi. to 61 cores requires the usage of several hundred threads while exploiting the 512-bit wide vector units requires Pairwise sequence alignment and database search SIMD vectorization within each core. Recent examples The database search application considered in this paper of efficient parallelization on Xeon Phis include scientific scans a protein sequence database using a single pro- computing [5], bioinformatics [6–10], and database oper- tein sequence as a query (similar to BLASTP). Differ- ations [11]. Furthermore, parallelization between Xeon ent to the BLASTP heuristic, we calculate the score of Phis adds another level of message passing based paral- an optimal local alignment between the query and each lelism. This level needs to consider data partitioning, load subject sequence using the Smith-Waterman algorithm balancing, and task scheduling. The accelerator-based with affine gap penalties (instead of a seed-and-extend approach is motivated by the fact that the performance of approach). The subject sequences are ranked in terms many-core architectures is growing. For example, the 2nd of this score. Actual alignments are only computed for generation Xeon Phi processor named “Knight’s Landing” the top-ranked database sequences which only takes a has already been announced. negligible amount of time in comparison to the score- The rest of this paper is organized as follows. only search procedure. Note that the score-only Smith- The “Related work” Section provides important back- Waterman computation can be performed in linear space ground information about the Xeon Phi program- and quadratic time with respect to the length of the ming model, pairwise and multiple sequence alignment, alignment targets. and hardware accelerated alignment algorithms. Our Consider two protein sequences Q and S and length q single-node parallel algorithms are presented in the and s, respectively. We want to compute the score of an “Algorithms on a single node” Section. The “Cluster level optimal local alignment of Q and S with respect to a given data parallelization” Section describes our cluster-level scoring scheme consisting of a gap opening penalty α,a parallelization. Section “Results and discussion” evalu- gap extension penalty β and an amino acid substitution ates performance. Some conclusions are drawn in Section matrix sbt(). The well-known Smith-Waterman algorithm “Conclusion”. solves this problem by computing a dynamic program- ming matrix iteratively based on the following recurrence Related work relations: Programming models on Xeon Phi coprocessor Xeon Phi is a coprocessor connected via the PCI express H (i, j) = max{0, E(i, j), F(i, j), H (i − 1, j − 1) A A (PCIe) bus to a host CPU. From a hardware perspective, +sbt(Q[ i], S[ j] )} (1) it contains up to 61 86 compatible cores. Each core fea- E(i, j) = max{H (i, j − 1) − α, E(i, j − 1) − β} tures a 512-bit vector processing unit (VPU) based on a F(i, j) = max{H (i − 1, j) − α, F(i − 1, j) − β} new instruction set. The cache hierarchy contains a L1 A data cache of size 32 KB and a 512 KB per core L2 cache. The cores are connected via a bidirectional ring bus which The iterative computation of theses matrices is started enables L2 cache coherence based on a directory based with the initial values: H (i,0) = H (0, j) = E(i,0) = A A protocol. Each core can execute up to four threads at the F(0, j) = 0 for all 0 <= i <= q,0 <= j <= s. same time. Assuming a Xeon Phi with 61 usable cores running at Progressive multiple sequence alignment 1.238 GHz, we can determine the peak performance for The time complexity of computing an optimal multiple 32-bit integer (integer arithmetic is commonly used for alignment of more than two sequences grows exponen- sequence alignment calculations) operations as follows: 16 tially in terms of the number of input sequences. Thus, (#SIMD lanes) × 1 integer operation × 1.238 GHz × 61 heuristic approaches with polynomial complexities must (#cores) = 1.208 Tera integer operations per second. be used in practice for large inputs to approximate the From a software perspective, three programming mod- (generally unknown) optimal multiple alignment. els can be used in order to harness the compute power The multiple (protein) sequence alignment application of the Xeon Phi: (i) native model, (ii) offload model, and considered in this paper is the first stage of the popular Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 13 of 66 ClustalW heuristic [12]. ClustalW is based on the classi- Different from CUDA-enabled GPUs, a Xeon Phi provides cal progressive alignment approach [13] featuring a 3-step x86 compatibility, which often simplifies the implemen- pipeline (see Fig. 1): tation process. Nevertheless, achieving near-optimal per- formance is still a challenge which needs to be addressed (a) Distance matrix: For each input sequence pair, a by parallel algorithm design and efficient implementa- distance values is computed based on the tion. In this paper we demonstrate how this can be done Smith-Waterman algorithm for protein sequence database search and distance matrix (b) Guide tree: Using the distance matrix computed in computation for multiple sequence alignment. the previous step is taken as an input to compute an Compared to our previously presented LSBDS [9], we evolutionary tree using the neighbor-joining method introduce the following new contributions in this paper: [14]. (c) Progressive alignment: Following the branching We have designed new algorithms which can handle order of the tree a multiple sequence alignment is searching tasks for large-scale protein databases on build progressively. Xeon Phi clusters. We have designed new algorithms for calculating Hardware accelerated alignment algorithms large-scale multiple sequence alignments on Xeon We briefly review some previous work on accelerat- Phi clusters. ing pairwise alignment (based on Smith Waterman) We have implemented our multiple sequence and progressive multiple sequence alignment (based on alignment algorithm using the offload model to make ClustalW) on a number of parallel computer architec- full use of the compute power of both the multi-core tures. A number of SIMD implementations have been CPUs and the many-core Xeon Phi hardware. designed in order to harness the vector units of com- mon multi-core CPUs (e.g. [15–21]) or the the Cell/BE Methods (e.g. [22, 23]). Recent years has seen increased interests Algorithms on a single node in acceleration of sequence alignment on massively par- Protein sequence database search allel GPUs. Initially, programming these graphics chips We have observed two facts: (1) protein sequence for bioinformatics application still required programming database search has inherent data parallelism; (2) each with shaders using languages such as OpenGL [24]. The VPUonXeonPhi canexecute multiple integeroperations release of CUDA in 2007 made the usage GPUs for gen- in an SIMD parallel way efficiently. Based on these two eral purpose computing more accessible and subsequently facts, we have partitioned the database search process on a number of CUDA-enabled Smith-Waterman implemen- a single node into two data parallel parts: device level and tation have been presented in recent years [4, 25–33]. thread level. The device level data parallel part is encoded A number of MPI-based solutions for progressive mul- on the host CPU. It splits the subject database into multi- tiple sequence alignments are targeted towards PC clus- ple batches that can be distributed to CPU and Xeon Phi ters [34–37]. Another attractive architecture for sequence devices. The thread level data parallel part is used to pro- analysis are FPGAs [38–41] which are based on recon- cess data batches locally. In order to support search tasks figurable hardware. However, in comparison to the other for large-scale databases, we have designed a dynamic data mentioned architectures, FPGAs are often less accessible distribution framework to distribute these batches to both and generally more difficult to program. the host CPU device and the Xeon Phi devices. In order The solution in this paper is based on a cluster of Xeon to solve the performance loss problem for searching long Phis. Compared to common CPUs, a Xeon Phi contains query sequences, we have also proposed a multi-pass algo- significantly more cores and often a wider vector unit. rithm where long query sequences are partitioned into ab c Fig. 1 Illustration of the three stages of progressive multiple alignment (see text for details) Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 14 of 66 0, if H (i, j) = 0 multiple short subsequences for consecutive searching ⎪ A passes. We have presented more implementation details N (i − 1, j − 1) + m(i, j), of our algorithm in [9]. A if H (i, j) = H (i − 1, j − 1) A A MSA N (i, j) = + sbt(S [ i], S [ j] ) 1 2 The distance matrix computation stage of ClustalW is typically a major runtime bottleneck. Thus, in our work ⎪ N (i, j),if H (i, j) = E(i, j) E A we have only concentrated on designing a parallel algo- rithm for this stage. ClustalW bases the distance compu- N (i, j);if H (i, j) = F(i, j) F A tation between two protein sequences on the following where concept [24]: 1, if S [ i] = S [ j] 1 2 m(i, j) = Definition 1. Consider two sequences S , S ∈ S = 0; otherwise i j {S , ... S }. The following equation defines their distance 1 n 0, if j = 1 d(S , S ): i j N (i, j) = N (i, j − 1),if E(i, j) = H (i, j − 1) − α E A A N (i, j − 1);if E(i, j) = E(i, j − 1) − β nid(S , S ) i j d(S , S ) = 1 − i j 0, if i = 1 min{l , l } ⎨ i j N (i, j) = N (i − 1, j),if F(i, j) = H (i − 1, j) − α F A A whereby nid(S , S ) is defined as the number of exact i j N (i − 1, j);if F(i, j) = F(i − 1, j) − β matches in an optimal local alignment between S and S . i j It can be shown that l (l )isthe length of S (S ). i j i j nid(S , S ) = N (i , j ) 1 2 A max max The value nid(S , S ) can be calculated in the Smith- i j Waterman traceback procedure by counting the num- where (i , j ) denote the coordinates of the maximum max max ber of exact character matches. Figure 2 illustrates this value in the corresponding pairwise local alignment DP method. However, this direct method does not work well matrix H . for long sequences and large-scale datasets because it needs to store the whole DP matrix. In order to solve Input data set sizes for MSA are typically smaller than this problem, we have adapted the method presented in for database search (protein sequence databases typically [24] to do the nid-value computation on the Xeon Phi contain may millions of sequences while large-scale MSAs architecture. That is we have used the following definition are computed for a few thousand protein sequences) and theorem to calculate the nid-value without doing the making the subject sequence set for distance matrix actual traceback. computation comparatively small. In order to design an efficient parallel distance matrix computation algorithm Definition 2. Consider two protein sequences S and on Xeon Phi, we have used the task partitioning method S , affine gap penalties α, β, and substitution matrix sbt. showninFig.3.Inour method,the sequences aresorted by their lengths and then partitioned into smaller sized The matrix N (i, j)(1 ≤ i ≤ l ,1 ≤ j ≤ l ) is defined in A 1 2 batches. In an alignment task, a query sequence will be terms of the following recurrence relations: aligned to the corresponding sequence batch. This pro- cedure will continue until all task batches are calculated. We have implemented the whole process into two par- \A \\ A TT C C TT C C G G TT A A TT G G AA TT \\ 00 00 00 00 00 00 00 00 00 00 00 00 00 allel parts: the thread level and the VPU level. On the G G 00 00 00 00 00 00 33 22 11 00 33 22 11 thread level, the process aligning S to S ={S , ... , S } i (i+1) n T 0 0 3 2 3 2 2 6 5 4 3 2 5 is grouped to task ,and each task is processedbya C C 0026565554324 0 0 2 6 5 6 5 5 5 4 3 2 4 thread. On the VPU level, multi-pairwise comparisons T 0 0 3 5 9 8 7 8 7 8 7 6 5 are performed in parallel on VPUs. In our method, S = A A 03248877 0 3 2 4 8 8 7 7 1111 1100 99 1100 99 T 0 2 3 3 7 7 7 10 10 14 13 12 13 {S , ... , S } is packed into a 2D buffer which has 16 (i+1) n C C 00 11 22 66 66 1100 99 99 99 13 13 13 13 12 12 12 12 channels, meaning that sequence S can be aligned to 16 A 0 3 2 5 5 9 9 8 12 12 12 16 15 different sequence in the 16-channel buffer in parallel. C 0 2 2 5 4 8 8 8 11 11 11 15 15 We have used Knights Corner instructions to implement Fig. 2 An example of how to compute the nid-value in the traceback this part. Figure 4 shows the pseudo-code of our algo- procedure. The matrix H (i, j) is shown for a linear gap penalty α = 1, rithm framework. In order to take advantage of both CPUs and a substitution score +3 for the exact match and −1 otherwise. and Xeon Phis in a node to process MSA for large-scale The nid-value here is five datasets, we have implemented our algorithm framework Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 15 of 66 Fig. 3 Illustration of our task partitioning scheme using the offload model. We have implemented the arith- substitution values and m(i, j) values quickly. The shuf- metic operations specified by the equations in Definition fling procedure in Fig. 6 is used to help VPUs fetch 2 using a number of Knights Corner instructions (see corresponding values from the substitution matrix in Fig. 5) for Xeon Phis. These instructions are executed on parallel [7]. VPUs to calculate the sixteen residue vectors of align- In our implementation, the size of these two temporary ment matrices according to Definition 2. For CPUs, VPUs vectors for Xeon Phi and CPU is 16 and 8 separately. fetch 8 residues each time. The core instructions used on We have designed and implemented a device level CPUs are identical with Xeon Phis, whereas they have dynamic task distribution framework to distribute tasks been implemented using different 256-bit AVX intrinsic to both the CPU device and the Xeon Phi device. Figure 7 instructions. shows our framework. In this framework, the task distrib- Before performing the alignment process, two tempo- utor is implemented as a critical section to prevent the rary score vectors (the sprofile and the mprofile in Fig. 4) concurrent access to shared tasks. It is also used to per- are created to help improve the IO efficiency for loading form the dynamic distribution of tasks to CPUs and Xeon the substitution matrix values and the m(i, j) (see Defini- Phis. In Fig. 7, both CPUs and Xeon Phis fetch and pro- tion 2) values in parallel. Figure 6 shows an example of cess multiple tasks in parallel. After the allocated tasks how to create these two temporary vectors. From Fig. 6 are processed, both devices will send requirements to the we can see that the substitution score matrix, the cur- data distributor to request for new tasks. All new task rent database sequence vector, and the query sequence requirements will first be identified and queued by the will be used to create the sprofile and the mprofile. data distributor. It then distributes tasks to the queue in VPUs will make useofthese twoscore vectorstoload order. Fig. 4 The pseudo-code of our MSA algorithm framework on a single computing node Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 16 of 66 Fig. 5 Xeon Phi vectorized implementation of pairwise alignment according to Definition 2 by dynamic programming using 25 core instructions. The variables in these instructions can be divided into two classes. One class includes vH , vE, vF,and vS which are used in the Smith-Waterman algorithm. Another class contains vN , vN , vN and vN which are defined in Definition 2. Here vN is the target vector and vN is the value nid(S , S ) i j A E F S A S AACCDDEEFFGGHHII JJKKLLMMNNOOPPQQRRSSTTUUVVW W W WYY Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 17 of 66 Score Matrix A C D E F G H I J K L M N O P Q R S T U V WY A C D E F G H I J K L M N O P Q R S T U V WY Database Sequences T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C Shuffling procedure T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C T R S V V R S R S T R V T T R R I C E I I L E E M I M L E I C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A A sprofile mprofile Fig. 6 An example of how to create the sprofile and the mprofile for two sequence vectors to match the ‘A’ residue Cluster level data parallelization Figure 8 illustrates our method. In Fig. 8, the static dis- Our approach is based on the fact that both subject patcher in the preprocess stage first divides the database database batches (for database searching) and MSA tasks into several chunks with respect to the total number of can be scanned in parallel. Thus we have implemented the nodes. The database chunks are then sent to the cor- cluster level data parallel algorithm for these two align- responding node for local searching. Since the compute ment applications. The cluster level data parallel algo- power of all compute nodes may vary, the size of each rithm is encoded on the master node. The master-node database subset can also vary. In order to achieve load bal- partitions the subject database or the MSA tasks into a ancing among all nodes, we have implemented a sample number of chunks that will be sent to different compute test method. In our method, at the preprocess stage (see nodes. Our approach is implemented using the following Fig. 8), firstly a sample test is performed to explore the modules: compute power of all compute nodes. Performance fac- tors of different nodes are then automatically generated. Dispatcher (Master): Partitions subject database or In our work, we name this factor the compute power P for MSA tasks into a number of chunks in a node i. With the performance factor P , we can then calcu- preprocessing steps and sends them to compute late the appropriate size of the database subset allocated nodes. to node i. Algorithms on a Single Node (Worker): Receives sequence chunks from master and performs the MSA corresponding DP calculations. We have designed and implemented a cluster level Result Collector (Master): Performs additional dynamic dispatcher to distribute tasks to compute nodes. operations required to further process the returned Figure 9 illustrates our method. In this method, the results. dynamic dispatcher first divides the dataset into a set of tasks which are organized as a task pool. Then, multi- Protein sequence database search ple tasks are sent to each node for local distance matrix In our work, we have implemented a static dispatcher computation. After the allocated tasks are processed, each for our cluster level parallel database searching algorithm. node will send requirements to the dispatcher to ask for Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 18 of 66 Fig. 7 Our device level dynamic task distribution framework. The black dots denote tasks new tasks to process. This procedure will continue until Protein sequence database search all tasks are processed. A performance measure commonly used in computa- tional biology to evaluate Smith-Waterman implemen- Results and discussion tations is cell updates per second (CUPS). A CUPS Test platforms represents the time for a complete computation of one We have implemented the proposed methods using C++ entry of the DP matrix, including all comparisons, addi- and evaluated them on compute nodes with the following tions and maxima operations. Xeon Phi cards (with ECC enabled) installed: We have scanned three protein sequence databases: (i) the 7.5 GB UniProtKB/Reviewed and Annotated - Intel Xeon Phi 7110P: 61 hardware cores, 1.1 GHz (5,943,361,275 residues in 16,110,751 sequences), (ii) the processor clock speed, 8 GB GDDR5 device memory. 18 GB UniProtKB/TrEMBL (13,630,914,768 residues in - Intel Xeon Phi 31S1P: 57 hardware cores, 1.1 GHz 42,821,879 sequences), and (iii) the 37 GB merged Non- processor clock speed, 8 GB GDDR5 device memory. Redundant plus UniProtKB/TrEMBL (24,323,686,690 residues in 73,401,766 sequences) for query sequences Tests have been conducted on a Xeon Phi cluster with with varying lengths. Query sequences used in our tests three compute nodes that are connected by an Ethernet have the accession numbers P01008, P42357, P56418, switch. There are two Xeon E5 CPUs and 16GB RAM on P07756, P19096, P0C6B8, P08519, and Q9UKN1. each compute node. The cluster runs Centos 6.5 with the Linux kernel 2.6.32-431.17.1.el6.x86_64. The CPU con- figuration on each node varies, as is listed in Table 1. Performance on a single node We also have SSD hard disks installed on each compute We have firstly compared the single-node performance of node. our methods to SWAPHI [8] and CUDASW++ 3.1 [26]. Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 19 of 66 Fig. 8 Illustration of our method to dispatch database subsets to all nodes. The node who has more computing power will be dispatched more sequences, which will finally balance the workload at runtime SWAPHI is another parallel Smith-Waterman algorithm database size limitation for SWAPHI is less than the avail- on Xeon Phi-based neo-heterogeneous architectures. It able RAM size; i.e. 16 GB. CUDASW++ 3.1 is currently is also implemented using the offload model. However, the fastest available Smith-Waterman implementation for SWAPHI canonlyrun search tasksonXeonPhi; i.e. database searching. It makes use of the compute power it does not exploit the computing power of multi-core of both the CPU and GPU. At the CPU side, CUD- CPUs. SWAPHI cannot handle search tasks for large- ASW++ 3.1 carries out parallel database searching by scale biological databases. In our tests, we find that the invoking the SWIPE [18] program. It employs CUDA Fig. 9 Illustration of our method to dispatch tasks dynamically to all nodes. The task partition method is illustrated in Fig. 3 Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 20 of 66 Table 1 Test cluster configurations and CUDASW++ 3.1 for searching the 7.5 GB UniPro- Node CPU Coprocessor tKB/Reviewed and Annotated protein database using dif- ferent query sequences. From Fig. 10a we can see that N Xeon E5-2620 (6 cores) * 2 Xeon Phi 7110p * 1 the computing GCUPS of our multi-pass method is com- N Xeon E5-2620v2 (6 cores) * 2 Xeon Phi 7110p * 2 parable to CUDASW++ 3.1. Both of them achieve better N Xeon E5-2650v2 (8 cores) * 2 Xeon Phi 31s1p * 4 performance than SWAPHI. SWAPHI and CUDASW++ 3.1 cannot support search tasks for the 18 GB and 37 GB databases. Thus, we only use our methods to search them. Figure 10a also reports PTX SIMD video instructions to gain the data parallelism the performance of our methods for searching these two at the GPU side. The database size supported by CUD- databases. The results show that our methods can handle ASW++ 3.1 is less than the memory size available on the large-scale database search tasks efficiently. GPU. Neither SWAPHI nor CUDASW++ 3.1 supports Performance on a cluster clusters. Figure 10b shows the performance of our methods using For single-node tests, we have used the N node (see all three cluster nodes. The result indicates that our meth- Table 1) as test platform. In our experiments, we run our ods exhibit good scalability in terms of sequence length methods with 24 threads on two Intel E5-2620 v2 six-core and size, and number of compute nodes. Our method 2.0 GHz CPUs and 240 threads on each Intel Xeon Phi achieves a peak overall performance of 730 GCUPS on the 7110P respectively. We execute SWAPHI with 240 threads Xeon Phi-based cluster. on each Xeon Phi 7110P. We have executed CUDASW++ 3.1 on another server with the same two Intel E5-2620 v2 six-core 2.0 GHz CPUs plus two Nvidia Tesla Kepler K40 MSA GPUs with ECC enabled. 24 CPU threads are also used A set of performance tests have been conducted using for CUDASW++ 3.1. If not specified, default parameters different protein sequence datasets to evaluate the pro- are used for both SWAPHI and CUDASW++ 3.1. Fur- cessing time for the distance matrix computation step of thermore, all available compiler optimizations have been our implementation in comparison to MSA-CUDA [32]. enabled. The parameters α = 10, and β = 2havebeen The datasets are extracted from the UniProtKB/Reviewed used in our experiments. The substitution matrix used is database, whose details are listed in Table 2. We have BLOSUM62. used two groups of datasets in our tests. Datasets S to We have measured the time to compute the simi- S are used to compare the performance of our method larity matrices to calculate the computing CUPS values and MSA-CUDA, where the sequence numbers are small since MSA-CUDA can not handle datasets with large in our experiments. Figure 10a shows the correspond- sequence number. Datasets L to L areusedtoevaluate ing computing GCUPS values of our methods, SWAPHI 1 6 a b Fig. 10 a performance comparison on a single node (N ) between our method, CUDASW++v3.1 and SWAPHI. b performance results of our method using all three compute nodes Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 21 of 66 Table 2 Test datasets for MSA where L denotes the length of the ith sequence in the Dataset Avg. Length #Sequences Workload (GCells) dataset. Thus, the workload W is actually the total num- ber of matrix cells to be calculated. As our method utilizes S 465 200 4.35 the constant 25 instructions for calculating each cell (as is S 472 400 17.84 listed in Fig. 5), the execution time grows linearly with W. S 474 600 40.52 Table 2 also lists the workload needed for processing each S 476 800 72.56 dataset. S 476 1000 113.54 Performance for processing medium-scale datasets S 480 1200 164.13 For the medium-scale datasets S to S ,MSA-CUDAis 1 6 L 150 30000 10891 benchmarked on a Tesla K40 GPU with default options L 382 16000 18692 and all available compiler optimizations enabled. Our L 935 10000 39148 implementation runs on an Intel Xeon Phi 7110P with 240 threads. Figure 11 shows the performance comparison L 274 40000 60246 between our method and MSA-CUDA. From Fig. 11 we L 1350 10000 88013 can find our implementation achieves significantly better L 700 24000 133112 performance compared to MSA-CUDA. Performance for processing large-scale datasets the performance of our method for handling large-scale For the large-scale datasets L to L , MSA-CUDA cannot 1 6 datasets. These datasets consist at least 10,000 sequences. work normally. We have run our methods on a single Intel The workload for computing a distant matrix grows Xeon Phi 7110P, the N node and the cluster respectively. quadratically with respect to the number of input The performance results are shown in Fig. 12. Figure 12 sequences. The average sequence length of the dataset indicates that our methods exhibit very good scalabil- also has a great impact on the computing workload. We ity in terms of workload and number of compute nodes. have used the following equation to measure the workload Although the nodes in our cluster have different com- needed to process a dataset. pute power, our dynamic task dispatching scheme still works efficiently. Moreover, our method on the cluster is ⎛ ⎞ n n able to process large-scale datasets that are rarely seen in ⎝ ⎠ W = L ∗ L i j other MSA implementations, whereas the runtime is still i=1 j=i+1 acceptable. Fig. 11 Runtime (in seconds) for processing datasets S to S . Our method runs on a Xeon Phi 7110P. MSA-CUDA runs on a Tesla K40 GPU 1 6 Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 22 of 66 Fig. 12 Runtime (in seconds) for processing datasets L to L . We have run our method on an Intel Xeon Phi 7110P, the N node and the cluster, 1 6 2 respectively Conclusion Authors’ contributions HL, BS, and WL designed the study, wrote and revised the manuscript. HL, YC, We have presented two parallel algorithms for protein and KX implemented the algorithm, performed the tests, analysed the results. sequence alignment based on the dynamic programming BS, SP, and WL contributed the idea of using Knights Corner instructions and concept which can be efficiently mapped onto Xeon Phi Xeon Phi clusters, participated in the algorithm optimization, analysed the results. All authors read and approved the final manuscript. clusters. Our methods exhibit good performance on a sin- gle compute node as well as good scalability in terms of Competing interests sequence length and size, and number of compute nodes The authors declare that they have no competing interests. for both protein sequence database search and distance Consent for publication matrix computation employed in multiple sequence align- Not applicable. ment. Furthermore, the achieved performance is highly competitive in comparison to other optimized Xeon Phi Ethics approval and consent to participate Not applicable. and GPU implementations. Biological sequence databases are continuously growing establishing the need for even Author details faster parallel solutions in the future. Hence, our results School of Computer Science and Technology, Shandong University, Shunhua Road 1500, Jinan, Shandong, China. Johannes Gutenberg University, Mainz, are especially encouraging since performance of many- Germany. School of Computer Science, National University of Defense core architectures grows much faster than Moore’s law Technology, Changsha, Hunan, China. as it applies to CPUs. For instance, the performance Published: 19 July 2016 improvement with at least a factor of 3 can be expected on the already announced next-generation Xeon Phi product. References 1. Schmidt B, Schröder H, Schimmler M. Massively parallel solutions for Declarations molecular sequence analysis. International Parallel and Distributed Publication of this article was funded by the PPP project from CSC and DAAD, Processing Symposium parallel solutions for molecular sequence analysis. Taishan Scholar, and NSFC Grants 61272056 and U1435222. IEEE; 2002. p. 0186. This article has been published as part of BMC Bioinformatics Vol17Suppl 9 2. Bader DA. Computational biology and high-performance computing. 2016: Selected articles from the IEEE International Conference on Commun ACM. 2004;47(11):34–41. Bioinformatics and Biomedicine 2015: genomics. The full contents of the 3. Rajko S, Aluru S. Space and time optimal parallel sequence alignments. supplement are available online at http://bmcbioinformatics.biomedcentral. IEEE Trans Parallel Distrib Syst. 2004;15(11):1070–81. com/articles/supplements/volume-17-supplement-9. 4. Liu Y, Schmidt B. SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors. Application-specific Systems, Architectures and Availability of data and materials Processors (ASAP), 2014 IEEE 25th International Conference on. IEEE; 2014. Project name: LSDBS-mpi p. 184–5. Project homepage: https://github.com/turbo0628/LSDBS-mpi 5. Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and Operating System: Linux implementation of the linpack benchmark for single and multi-node Programming Language: C++ systems based on intel xeon phi coprocessor. Parallel & Distributed Lan et al. BMC Bioinformatics 2016, 17(Suppl 9):267 Page 23 of 66 Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE; 29. Khajeh-Saeed A, Poole S, PJ B. Acceleration of the Smith-Waterman 2013. p. 126–37. algorithm using single and multiple graphics processors. J Comput Phys. 6. Pennycook SJ, Hughes CJ, Smelyanskiy M, et al. Exploring SIMD for 2010;229(11):4247–58. Molecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi 30. Blazewicz J, Frohmberg W, Kierzynka M, Pesch E, Wojciechowski P. Coprocessors. Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th Protein alignment algorithms with an efficient backtracking routine on International Symposium on. IEEE; 2013. p. 1085–97. multiple gpus. BMC Bioinforma. 2011;12:181. 7. Wang L, Chan Y, Duan X, et al. XSW: Accelerating biological database 31. Hains D, Cashero Z, Ottenberg M, et al. Improving CUDASW++, a search on xeon phi. Parallel & Distributed Processing Symposium parallelization of Smith-Waterman for CUDA enabled devices. Parallel and Workshops (IPDPSW), 2014 IEEE International. IEEE; 2014. p. 950–7. Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE 8. Liu Y, Maskell DL, Schmidt B. CUDASW++: optimizing Smith-Waterman International Symposium on. IEEE; 2011. p. 490–501. sequence database searches for CUDA-enabled graphics processing 32. Liu Y, Schmidt B, Maskell DL. MSA-CUDA: multiple sequence alignment units. BMC Res Notes. 2009;2(1):73. on graphics processing units with CUDA. Application-specific Systems, 9. Lan H, Liu W, Schmidt B, et al. Accelerating large-scale biological Architectures and Processors, 2009. ASAP 2009. 20th IEEE International database search on Xeon Phi-based neo-heterogeneous architectures. Conference on. IEEE; 2009. p. 121–8. Bioinformatics and Biomedicine (BIBM), 2015 IEEE International 33. Hung CL, Lin YS, Lin CY, Chung YC, Chung YF. CUDA ClustalW: An Conference on. IEEE; 2015. p. 503–10. efficient parallel algorithm for progressive multiple sequence alignment 10. Rucci E, García C, Botella G, Degiusti A, Naiouf M, Prieto-Matías M. An on multi-gpus. Comput Biol Chem. 2015;58:62–8. energy-aware performance analysis of swimm: Smith—waterman 34. Li K. ClustalW analysis using parallel and distributed computing. implementation on i ntel’s m ulticore and m anycore architectures. Bioinformatics. 2003;19:1585–6. Concurr Comput Pract Experience. 2015;22(6):865–72. 35. Ebedes J, Datta A. Multiple sequence alignment in parallel on a 11. Lu M, Zhang L, Huynh HP, et al. Optimizing the mapreduce framework workstation cluster. Bioinformatics. 2004;20:1193–5. on intel xeon phi coprocessor. Big Data, 2013 IEEE International 36. Cheetham J, Dehne F, Pitre S, et al. Parallel clustal w for pc clusters[M]. Conference on. IEEE; 2013. p. 125–30. Computational Science and Its Applications—ICCSA 2003. Berlin 12. Thompson J, Higgins D, Gibson T. ClustalW: improving the sensitivity of Heidelberg: Springer; 2003, pp. 300–9. progressive multiple sequence alignment through sequence weighting 37. Tan J, Feng S, Sun N. Parallel multiple sequences alignment in SMP position specific gap penalties and weight matrix choice. Nucleic Acids cluster. Int Conf High Perform Comput Asia Reg. 2005;20:425–31. Res. 1994;22:4673–680. 38. Oliver T, Schmidt B, Maskell D. Hyper customized processors for 13. Feng D, Doolittle R. Progressive sequence alignment as a prerequisite to bio-sequence database scanning on FPGAs. Proceedings of the 2005 a correct phylogenetic trees. J Mol Evol. 1987;25:351–60. ACM/SIGDA 13th international symposium on Field-programmable gate 14. Saitou N, Nei M. The neighbor-joining method: a new method for arrays. ACM; 2005. p. 229–37. reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–25. 39. Li ITS, Shum W, Truong K. 160-fold acceleration of the Smith-Waterman 15. Wozniak A. Using video-oriented instructions to speed up sequence algorithm using a field programmable gate array (FPGA). BMC Bioinforma. comparison. Comput Appl Biosci. 1997;13(2):145–50. 2007;8(1):1. 16. Rognes T, Seeberg E. Six-fold speed-up of Smith-Waterman sequence 40. Oliver T, Schmidt B, Nathan D, Clemens R, Maskell D. Using database searches using parallel processing on common reconfigurable hardware to accelerate multiple sequence alignment with microprocessors. Bioinformatics. 2000;16(8):699–706. ClustalW. Bioinformatics. 2005;21:3431–432. 17. Alpern B, Carter L, Su Gatlin K. Microparallelism and high-performance 41. Boukerche A, Correa JM, de Melo ACMA, et al. An FPGA-based protein matching. Proceedings of the 1995 ACM/IEEE conference on accelerator for multiple biological sequence alignment with DIALIGN[M]. Supercomputing. ACM; 1995. p. 24. High Performance Computing-HiPC 2007. Berlin Heidelberg: Springer; 18. Rognes T. Faster Smith-Waterman database searches with inter-sequence 2007. p. 71–82. SIMD parallelisation. BMC Bioinforma. 2011;12:. 19. Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. 20. Notredame C, Higgins D, Heringa J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–17. 21. Chaichoompu K, Kittitornkun S, Tongsima S. MT-ClustalW: multithreading multiple sequence alignment. Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. IEEE; 2006. p. 8. 22. Wirawan A, Kwoh CK, Hieu NT, et al. CBESW: sequence alignment on the playstation 3. BMC Bioinforma. 2008;9(1):377. 23. Szalkowski A, Ledergerber C, Krähenbühl P, et al. SWPS3–fast multi-threaded vectorized Smith-Waterman for IBM Cell/BE and x86/SSE2. BMC Res Notes. 2008;1(1):107. 24. Liu W, Schmidt B, Voss G, Mueller-Wittig W. Streaming algorithms for biological sequence alignment on gpus. IEEE Trans Parallel Distrib Syst. 2007;18(9):1270–81. Submit your next manuscript to BioMed Central 25. Liu Y, Schmidt B, Maskell DL. CUDASW++ 2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based and we will help you at every step: on SIMT and virtualized SIMD abstractions. BMC Res Notes. 2010;3(1):93. • We accept pre-submission inquiries 26. Liu Y, Wirawan A, Schmidt B. CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU � Our selector tool helps you to find the most relevant journal SIMD instructions. BMC Bioinforma. 2013;14(1):117. � We provide round the clock customer support 27. Manavski S, Valle G. CUDA compatible GPU cards as efficient hardware � Convenient online submission accelerators for Smith-Waterman sequence alignment. BMC Bioinforma. 2008;9(2):1. � Thorough peer review 28. Ligowski L, Rudnicki W. An efficient implementation of Smith-Waterman � Inclusion in PubMed and all major indexing services algorithm on GPU using CUDA, for massively parallel scanning of � Maximum visibility for your research sequence databases. 2009 International Parallel and Distributed Processing Symposium. IEEE; 2009. p. 1–8. Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC BioinformaticsSpringer Journals

Published: Jul 19, 2016

There are no references for this article.