AC-DIAMOND v1: accelerating large-scale DNA–protein alignment

AC-DIAMOND v1: accelerating large-scale DNA–protein alignment Abstract Summary AC-DIAMOND (v1) is a DNA–protein alignment tool designed to tackle the efficiency challenge of aligning large amount of reads or contigs to protein databases. When compared with the previously most efficient method DIAMOND, AC-DIAMOND gains a 6- to 7-fold speed-up, while retaining a similar degree of sensitivity. The improvement is rooted at two aspects: first, using a compressed index of seeds with adaptive-length to speed-up the matching between query and reference sequences; second, adopting a compact form of dynamic programing to fully utilize the parallelism of the SIMD capability. Availability and implementation Software source codes and binaries available at https://github.com/Maihj/AC-DIAMOND/ Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction The rapid advancement of sequencing technologies has made it feasible to produce massive sequencing data for microbes, and the analysis of such data often involves aligning contigs or reads to protein databases on a large scale (e.g. complex metagenomics data like soil contain up to hundreds of Gbps, and protein database like NCBI-nr over 20G amino acids). BLASTX (Altschul et al., 1990), in view of its superior sensitivity, has been regarded as the gold standard for DNA–protein alignment, but it is too slow for large-scale alignment. For example, consider aligning 10 Gbp of soil contigs to NCBI-nr using a server with 12 cores; BLASTX would take years. The last decade has witnessed a number of tools to speed-up BLASTX. An early such tool is RAPSearch2 (Zhao et al., 2012), which takes only weeks for the above example. Other tools like PAUDA (Huson and Xie, 2014) and GHOSTZ (Suzuki et al., 2015), by trading sensitivity for speed, can further decrease the alignment time by a few folds. DIAMOND (Buchfink et al., 2015) is a recent tool which drastically improves the speed without sacrificing the sensitivity; for the above example, it takes 6+ days. Note that 6+ days are longer than the time for a sequencer to generate the data. This article presents a more efficient DNA–protein alignment tool AC-DIAMOND (version v1), which has a 6- to 7-fold speed-up over DIAMOND, while retaining same sensitivity as DIAMOND. For the example of aligning 10 Gbp of soil contigs, AC-DIAMOND takes 20+ h. 2 Materials and methods AC-DIAMOND attempts to speed-up DIAMOND by engineering complex data structures and algorithms to locate the alignment seeds and to parallelize the dynamic programming (DP) for computing the alignment from the seeds. DIAMOND locates all spaced seeds occurring in both reference and query sequences by double indexing the sequences; both indexes are computed on the fly as computing from scratch is faster than loading from hard disks. For large-scale DNA–protein alignment, the indexes cannot fit into the main memory, and DIAMOND works on the query/reference sequences chunk by chunk, and the same chunk of reference sequences is re-loaded and re-indexed multiple times for different chunks of query sequences and different spaced seed patterns, costing a lot of time. Our first improvement to the seeding process was to compress the representation of the double indexing so as to process more sequences each round and reduce reference re-loading (this resulted in a preliminary version of AC-DIAMOND (i.e. version v0), which was first presented in IWBBIO 2016 and achieved a 4-fold speed-up). To achieve better speed-up, we totally avoid query indexing and re-design the reference index by a two-level suffix-array-like data structure, which allows us to take advantage of the fact that doing a binary search in cache memory is extremely efficient. We also devise a clever way to compress the data structure so that it is small enough for efficient loading (i.e. no re-computation). More interestingly, our new indexing scheme provides better resolution of seeds: it starts with seeds of a fixed length (say, 10) and, for those cases where the seeds have too many hits; it increases the seed length adaptively to locate more relevant seeds. This improves both efficiency and accuracy. Once the seeds are located, DP is used to compute the alignments. It is natural to consider using the SIMD capability of each CPU core to work on multiple DP in parallel. However, this is difficult to implement because the size of the DP table varies among different pairs of queries and references. We engineered a scheduler inside AC-DIAMOND to dynamically pack DP tables of similar size, which can parallelize the DP for up to 90% of the seeds. To gain further parallelism, AC-DIAMOND reduces the number of bits to represent an alignment score from 16 to 8; note that 8 bits are not sufficient to represent a score, but is good enough to represent the differences with its neighbors (which are relatively small according to scoring matrix like BLOSUM62). In short, AC-DIAMOND drastically improves the speed by processing up to 16 DB tables in parallel using one CPU core with 128-bit SIMD capability. 3 Results 3.1 On the speed of DIAMOND and AC-DIAMOND We benchmarked the time required by AC-DIAMOND (v1) and DIAMOND (version 0.7.9) to align three kinds of data to the protein database NCBI-nr (release 82; 23.8G amino acids) on a server with 12 CPU cores: (i) Contigs assembled from the Iowa Native Prairie soil data (Howe et al., 2014); average length 727 bp; (ii) contigs assembled from the human stool data (https://www/hmpdacc.org/HMASM/#data); average length 1000 bp; (iii) MiSeq reads of a bacteria (Chromohalobacter salexigens DSM-3043, NCBI SRA #SRP057274); average length 300 bp. For each kind of data, we extracted five datasets of different sizes, which, after translated into protein sequences, contained 2G, 4G, 6G, 8G, and 10G amino acids, and were each aligned using AC-DIAMOND and DIAMOND with the fast mode. Table 1 shows the running time for the smallest and biggest datasets (see Supplementary Tables S1–S3 for other datasets); AC-DIAMOND achieved 6- to 7-fold speed-up consistently. The peak memory usage of AC-DIAMOND is similar to DIAMOND in each case (specifically, both using 30+ GB). We also benchmarked the software with bigger block size and using 40+ or 50+ GB of main memory; both tools are found to be faster (specifically, DIAMOND is about 5% faster, and AC-DIAMOND is over 20% faster; see Supplementary Tables S5 and S6). Table 1. Comparison of time for alignment to NCBI-nr Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Table 1. Comparison of time for alignment to NCBI-nr Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Notice that BLASTX and other reasonably sensitive tools are much slower. For the smallest soil dataset (2G amino acids), BLASTX would require 13 000 h (estimated), and RAPSearch2 (fast) was measured to complete in 75 h. MMSeqs2 (Steinegger and Söding, 2017), a tool also supports iterative sequence profile search, was found to take 57 h to align the 2G dataset. See Supplementary Table S4 for a comparison of the speed of RAPSearch2, MMSeqs2, DIAMOND and AC-DIAMOND, in different fast and sensitive modes. 3.2 On the sensitivity against BLASTX BLASTX is commonly regarded as the gold standard regarding alignment sensitivity. Using BLASTX (2.2.29+) as the control, we evaluated the sensitivity of RAPSearch2 (2.24), DIAMOND (0.7.9) and AC-DIAMOND (v1) when aligning contigs/reads to NCBI-nr. Since BLASTX is too slow, we limit the test to 10 000 queries in each case. Specifically, for each of the above three types of data (i.e. soil, human stool and bacteria), we used a dataset with 10 000 contigs or reads that were aligned by BLASTX successfully to NCBI-nr to benchmark RAPSearch2, MMSeqs2, DIAMOND and AC-DIAMOND, in different fast and sensitive modes. For comparison purpose, we recorded for each query the best 25 alignments with e-value no bigger than 0.001 reported by BLASTX and each tool. For each tool, we measured (i) the percentage of queries that were aligned by the tool, and (ii) the percentage of BLAST’s alignments that overlapped with the alignments reported by the tool. The results, as given in Table 2, demonstrate the relative sensitivity of the different tools over the three different datasets (namely, soil contigs, human tool contigs and bacteria reads). Table 2. Comparison of alignment sensitivity against BLASTX (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 We measured the % of contigs/reads aligned by BLASTX, as well as the % of alignments reported by BLASTX. Note that AC-DIAMOND has two sensitive modes, namely Sensitive-1 and Sensitive-2. Table 2. Comparison of alignment sensitivity against BLASTX (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 We measured the % of contigs/reads aligned by BLASTX, as well as the % of alignments reported by BLASTX. Note that AC-DIAMOND has two sensitive modes, namely Sensitive-1 and Sensitive-2. 4 Conclusion AC-DIAMOND (v1) achieves a speed-up of six to seven times over DIAMOND, while retaining a high sensitivity. Funding This work was partially supported by Hong Kong Innovation and Technology Fund ITS/155/15FP. Hong Kong ITF [ITS/155/15FP]. Conflict of Interest: none declared. References Altschul S.F. et al. ( 1990 ) Basic local alignment search tool . J. Mol. Biol ., 215 , 403. Google Scholar Crossref Search ADS PubMed Buchfink B. et al. ( 2015 ) Fast and sensitive protein alignment using DIAMOND . Nat. Methods , 12 , 59. Google Scholar Crossref Search ADS PubMed Howe A.C. et al. ( 2014 ) Tackling soil diversity with the assembly of large, complex metagenomes . In: Proceedings of the National Academy of Science , pp. 111 . Huson D.H. , Xie C. ( 2014 ) A poor man’s BLASTX high throughput metagenomic protein database search using PAUDA . Bioinformatics , 30 , 38. Google Scholar Crossref Search ADS PubMed Suzuki S. et al. ( 2015 ) Faster sequence homology searches by clustering subsequences . Bioinformatics , 31 , 1183. Google Scholar Crossref Search ADS PubMed Steinegger M. , Söding J. ( 2017 ) MMseqs enables sensitive protein sequence searching for the analysis of massive data sets . Nat. Biotechnol ., 35 , 1206 – 1208 . Zhao Y. et al. ( 2012 ) RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data . Bioinformatics , 28 , 125 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

AC-DIAMOND v1: accelerating large-scale DNA–protein alignment

Loading next page...
 
/lp/ou_press/ac-diamond-v1-accelerating-large-scale-dna-protein-alignment-bVKT0b0AWf
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty391
Publisher site
See Article on Publisher Site

Abstract

Abstract Summary AC-DIAMOND (v1) is a DNA–protein alignment tool designed to tackle the efficiency challenge of aligning large amount of reads or contigs to protein databases. When compared with the previously most efficient method DIAMOND, AC-DIAMOND gains a 6- to 7-fold speed-up, while retaining a similar degree of sensitivity. The improvement is rooted at two aspects: first, using a compressed index of seeds with adaptive-length to speed-up the matching between query and reference sequences; second, adopting a compact form of dynamic programing to fully utilize the parallelism of the SIMD capability. Availability and implementation Software source codes and binaries available at https://github.com/Maihj/AC-DIAMOND/ Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction The rapid advancement of sequencing technologies has made it feasible to produce massive sequencing data for microbes, and the analysis of such data often involves aligning contigs or reads to protein databases on a large scale (e.g. complex metagenomics data like soil contain up to hundreds of Gbps, and protein database like NCBI-nr over 20G amino acids). BLASTX (Altschul et al., 1990), in view of its superior sensitivity, has been regarded as the gold standard for DNA–protein alignment, but it is too slow for large-scale alignment. For example, consider aligning 10 Gbp of soil contigs to NCBI-nr using a server with 12 cores; BLASTX would take years. The last decade has witnessed a number of tools to speed-up BLASTX. An early such tool is RAPSearch2 (Zhao et al., 2012), which takes only weeks for the above example. Other tools like PAUDA (Huson and Xie, 2014) and GHOSTZ (Suzuki et al., 2015), by trading sensitivity for speed, can further decrease the alignment time by a few folds. DIAMOND (Buchfink et al., 2015) is a recent tool which drastically improves the speed without sacrificing the sensitivity; for the above example, it takes 6+ days. Note that 6+ days are longer than the time for a sequencer to generate the data. This article presents a more efficient DNA–protein alignment tool AC-DIAMOND (version v1), which has a 6- to 7-fold speed-up over DIAMOND, while retaining same sensitivity as DIAMOND. For the example of aligning 10 Gbp of soil contigs, AC-DIAMOND takes 20+ h. 2 Materials and methods AC-DIAMOND attempts to speed-up DIAMOND by engineering complex data structures and algorithms to locate the alignment seeds and to parallelize the dynamic programming (DP) for computing the alignment from the seeds. DIAMOND locates all spaced seeds occurring in both reference and query sequences by double indexing the sequences; both indexes are computed on the fly as computing from scratch is faster than loading from hard disks. For large-scale DNA–protein alignment, the indexes cannot fit into the main memory, and DIAMOND works on the query/reference sequences chunk by chunk, and the same chunk of reference sequences is re-loaded and re-indexed multiple times for different chunks of query sequences and different spaced seed patterns, costing a lot of time. Our first improvement to the seeding process was to compress the representation of the double indexing so as to process more sequences each round and reduce reference re-loading (this resulted in a preliminary version of AC-DIAMOND (i.e. version v0), which was first presented in IWBBIO 2016 and achieved a 4-fold speed-up). To achieve better speed-up, we totally avoid query indexing and re-design the reference index by a two-level suffix-array-like data structure, which allows us to take advantage of the fact that doing a binary search in cache memory is extremely efficient. We also devise a clever way to compress the data structure so that it is small enough for efficient loading (i.e. no re-computation). More interestingly, our new indexing scheme provides better resolution of seeds: it starts with seeds of a fixed length (say, 10) and, for those cases where the seeds have too many hits; it increases the seed length adaptively to locate more relevant seeds. This improves both efficiency and accuracy. Once the seeds are located, DP is used to compute the alignments. It is natural to consider using the SIMD capability of each CPU core to work on multiple DP in parallel. However, this is difficult to implement because the size of the DP table varies among different pairs of queries and references. We engineered a scheduler inside AC-DIAMOND to dynamically pack DP tables of similar size, which can parallelize the DP for up to 90% of the seeds. To gain further parallelism, AC-DIAMOND reduces the number of bits to represent an alignment score from 16 to 8; note that 8 bits are not sufficient to represent a score, but is good enough to represent the differences with its neighbors (which are relatively small according to scoring matrix like BLOSUM62). In short, AC-DIAMOND drastically improves the speed by processing up to 16 DB tables in parallel using one CPU core with 128-bit SIMD capability. 3 Results 3.1 On the speed of DIAMOND and AC-DIAMOND We benchmarked the time required by AC-DIAMOND (v1) and DIAMOND (version 0.7.9) to align three kinds of data to the protein database NCBI-nr (release 82; 23.8G amino acids) on a server with 12 CPU cores: (i) Contigs assembled from the Iowa Native Prairie soil data (Howe et al., 2014); average length 727 bp; (ii) contigs assembled from the human stool data (https://www/hmpdacc.org/HMASM/#data); average length 1000 bp; (iii) MiSeq reads of a bacteria (Chromohalobacter salexigens DSM-3043, NCBI SRA #SRP057274); average length 300 bp. For each kind of data, we extracted five datasets of different sizes, which, after translated into protein sequences, contained 2G, 4G, 6G, 8G, and 10G amino acids, and were each aligned using AC-DIAMOND and DIAMOND with the fast mode. Table 1 shows the running time for the smallest and biggest datasets (see Supplementary Tables S1–S3 for other datasets); AC-DIAMOND achieved 6- to 7-fold speed-up consistently. The peak memory usage of AC-DIAMOND is similar to DIAMOND in each case (specifically, both using 30+ GB). We also benchmarked the software with bigger block size and using 40+ or 50+ GB of main memory; both tools are found to be faster (specifically, DIAMOND is about 5% faster, and AC-DIAMOND is over 20% faster; see Supplementary Tables S5 and S6). Table 1. Comparison of time for alignment to NCBI-nr Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Table 1. Comparison of time for alignment to NCBI-nr Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Data volume (amino acids) Running time (hours) AC-DIAMOND DIAMOND Speed-up Soil 2G 2.6 17.4 6.7 contigs 10G 12.0 83.4 7.0 Human stool 2G 2.4 14.4 6.0 contigs 10G 12.5 78.7 6.3 Bacteria 2G 2.3 16.4 7.1 reads 10G 10.6 75.3 7.1 Notice that BLASTX and other reasonably sensitive tools are much slower. For the smallest soil dataset (2G amino acids), BLASTX would require 13 000 h (estimated), and RAPSearch2 (fast) was measured to complete in 75 h. MMSeqs2 (Steinegger and Söding, 2017), a tool also supports iterative sequence profile search, was found to take 57 h to align the 2G dataset. See Supplementary Table S4 for a comparison of the speed of RAPSearch2, MMSeqs2, DIAMOND and AC-DIAMOND, in different fast and sensitive modes. 3.2 On the sensitivity against BLASTX BLASTX is commonly regarded as the gold standard regarding alignment sensitivity. Using BLASTX (2.2.29+) as the control, we evaluated the sensitivity of RAPSearch2 (2.24), DIAMOND (0.7.9) and AC-DIAMOND (v1) when aligning contigs/reads to NCBI-nr. Since BLASTX is too slow, we limit the test to 10 000 queries in each case. Specifically, for each of the above three types of data (i.e. soil, human stool and bacteria), we used a dataset with 10 000 contigs or reads that were aligned by BLASTX successfully to NCBI-nr to benchmark RAPSearch2, MMSeqs2, DIAMOND and AC-DIAMOND, in different fast and sensitive modes. For comparison purpose, we recorded for each query the best 25 alignments with e-value no bigger than 0.001 reported by BLASTX and each tool. For each tool, we measured (i) the percentage of queries that were aligned by the tool, and (ii) the percentage of BLAST’s alignments that overlapped with the alignments reported by the tool. The results, as given in Table 2, demonstrate the relative sensitivity of the different tools over the three different datasets (namely, soil contigs, human tool contigs and bacteria reads). Table 2. Comparison of alignment sensitivity against BLASTX (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 We measured the % of contigs/reads aligned by BLASTX, as well as the % of alignments reported by BLASTX. Note that AC-DIAMOND has two sensitive modes, namely Sensitive-1 and Sensitive-2. Table 2. Comparison of alignment sensitivity against BLASTX (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 (1) Soil contigs (2) Stool contigs (3) Bacteria reads % of contigs % of alignments % of contigs % of alignments % of contigs % of alignments DIAMOND-fast 90.5 78.0 97.5 89.0 99.0 91.5 DIAMOND-sensitive 94.2 89.4 98.1 92.1 99.2 94.5 AC-DIAM.-fast 91.0 78.4 97.6 89.3 99.1 91.8 AC-DIAM.-Sensitive-1 94.5 78.8 98.3 89.4 99.3 91.8 AC-DIAM.-Sensitive-2 94.5 89.2 98.3 93.0 99.3 95.6 RAPSearch2-fast 87.4 54.1 97.3 76.2 99.2 82.2 RAPSearch2-sensitive 94.0 73.6 98.4 86.6 99.4 90.5 MMSeqs2 95.7 58.4 97.6 63.9 96.6 71.3 We measured the % of contigs/reads aligned by BLASTX, as well as the % of alignments reported by BLASTX. Note that AC-DIAMOND has two sensitive modes, namely Sensitive-1 and Sensitive-2. 4 Conclusion AC-DIAMOND (v1) achieves a speed-up of six to seven times over DIAMOND, while retaining a high sensitivity. Funding This work was partially supported by Hong Kong Innovation and Technology Fund ITS/155/15FP. Hong Kong ITF [ITS/155/15FP]. Conflict of Interest: none declared. References Altschul S.F. et al. ( 1990 ) Basic local alignment search tool . J. Mol. Biol ., 215 , 403. Google Scholar Crossref Search ADS PubMed Buchfink B. et al. ( 2015 ) Fast and sensitive protein alignment using DIAMOND . Nat. Methods , 12 , 59. Google Scholar Crossref Search ADS PubMed Howe A.C. et al. ( 2014 ) Tackling soil diversity with the assembly of large, complex metagenomes . In: Proceedings of the National Academy of Science , pp. 111 . Huson D.H. , Xie C. ( 2014 ) A poor man’s BLASTX high throughput metagenomic protein database search using PAUDA . Bioinformatics , 30 , 38. Google Scholar Crossref Search ADS PubMed Suzuki S. et al. ( 2015 ) Faster sequence homology searches by clustering subsequences . Bioinformatics , 31 , 1183. Google Scholar Crossref Search ADS PubMed Steinegger M. , Söding J. ( 2017 ) MMseqs enables sensitive protein sequence searching for the analysis of massive data sets . Nat. Biotechnol ., 35 , 1206 – 1208 . Zhao Y. et al. ( 2012 ) RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data . Bioinformatics , 28 , 125 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Nov 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off