SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and... Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster. We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a ∼30× NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved ∼5.7 times the fastest speed of other tools. Keywords: high-throughput sequencing; quality control; preprocessing; MapReduce Background negative impact on downstream analyses. Thus, QC and prepro- cessing of raw data serve as the critical steps to initiate analysis High-throughput sequencing (HTS) instruments have enabled pipelines [4, 5]. QC investigates several statistics of datasets to many large-scale studies and generated enormous amounts of ensure data quality, and preprocessing trims off undesirable ter- data [1–3]. However, the presence of low-quality bases, sequence minal fragments and filters out substandard reads [ 6]. We have artifacts, and sequence contamination can introduce serious Received: 17 July 2017; Revised: 18 October 2017; Accepted: 22 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 provided the original work is properly cited. by Ed 'DeepDyve' Gillespie user on 16 March 2018 1 2 Chen et al. conducted a survey on 31 existing tools, and widely shared func- statistics are comprised of the number of sequences and bases, tions are listed in Supplementary Material 1. base composition, Q20 and Q30, and filtering information. Com- Existing tools for QC and preprocessing can be divided into plex statistics include the distribution of quality score and base 2 categories according to their structures: toolkit and work- composition distribution for each position. For the quality score flow. Toolkit-like software provides multiple executables such distribution, Q20 and Q30 for each position are plotted in a line as statistics computer, clipper, and filtrator [ 7–15]. In practice, chart, and the quantiles of the quality are represented in a box- raw data are processed by a few individual executables in se- plot. And for the base composition distribution, an overlapping quence. Comparatively, workflow-like software offers an inte- histogram is used to display base composition distribution for gral workflow where functions are performed in predefined each position. These calculations are conducted by C++,and the order [6, 16–37]. plots are generated by R 3.3.2 [38]. An example of the 2 plots is However, both categories have their own demerits. When us- showninFig. 1. A comprehensive list of statistics available in ing toolkit-like software, it is complex and error-prone to write SOAPnuke is included in Additional file 2. Statistics of prepro- additional scripts to wrap executables. Moreover, it consumes cessed data are compared with some preset thresholds. A warn- much time to generate and read intermediate files, which is hard ing message will be issued if the median score of any position in for acceleration. Besides, the same variables could possibly be per-base quality distribution is lower than 25, and a failure will computed repetitively. For instance, the average quality score of be issued if it is lower than 20. For per-base base composition, each read is necessary for counting quality score distribution by a warning will be raised if the difference between A and T, or G reads and filtering reads based on average quality scores. It has and C, in any position is greater than 10%, or a failure will be to be counted twice if these 2 functions are implemented by dif- issued if it is greater than 20%. ferent toolkits. In the step of preprocessing, those undesirable terminal frag- For workflow-like tools, an optimal architecture is required ments are trimmed off, substandard reads are filtered out, and because the orders of functions are fixed. Most of the existing certain transform operations are applied. On both ends of reads, tools successively perform QC and preprocessing without com- bases of assigned number or of quality lower than the threshold plete statistics of preprocessed datasets. If the preprocessing op- will be trimmed off. Sequencing adapters can be aligned, where eration is not suitable for a given dataset, the problem can only mismatch is supported while no INDEL is tolerated, and cut to be revealed by downstream analyses. the 3’ end. Filtering can be performed on reads with adapter, Datasets sequenced from various samples may require dif- short length, too many ambiguous bases, low–average quality, ferent processing functions or parameters. Existing workflow- or too many low-quality bases. The sequencing batches, such like tools mostly support genomics data processing; only a as tile of Illumina sequencer [39] and fov (field of view) of BGI few of them are developed for other types of studies, such as sequencer [40], with unfavorable sequencing quality can be as- RNA-seq and metagenomics data. For example, RObiNA [22] signed so that the corresponding sequences will be discarded. In provides 4 preprocessing modules to combined for different addition, reads with identical nucleotides can be deduplicated RNA-Seq Data. PrinSeq [6] offers a QC stat, dinucleotide odds to keep only 1 copy. Transformation comprises quality system ratios, to show how the dataset might be related to other conversion, interconversion between DNA and RNA, and com- viral/microbial metagenomes. However, there is still no single pression of output with gzip, etc. Additional file 3 lists the above tool supporting multiple data types. preprocessing functions and their parameters. Several tools have made certain progress in overcoming the limitations mentioned above. Galaxy [37] is a web-based plat- Module design form incorporating various existing toolkit-like softwares. Users can conveniently concatenate tools into a pipeline on the web To improve processing performance of different types of data, interface. NGS QC toolkit [16] offers a workflow with QC on both 4 modules are specialized in SOAPnuke, including the General, raw and preprocessed datasets, though there are few prepro- DGE, sRNA, and Meta modules. (1) The General module can han- cessing functions. dle most of the DNA re-sequencing datasets, as described in the In terms of software acceleration, only multithreading is section of QC & PROCESSING. adopted by existing tools [14–16, 24–28]. This approach only (2) DGE profiling generates a single-end read that has a works for standalone operation and is limited by the maximum “CATG” segment neighboring the targeted sequences of 17 base number of processors in 1 computer server. It may be incompe- pairs [41]. By default, the DGE module will find the targeted seg- tent when dealing with the huge present and potential volume ment and trim off other parts. Moreover, reads with ambiguous of sequencing datasets. bases will be filtered. (3) The sRNA module incorpates filtering To solve these problems, we have developed a workflow-like of poly-A tags as polyadenylation is a feature of mRNA data and tool, SOAPnuke, for integrated QC and preprocessing of large sRNA sequences can be contaminated by mRNA during sample HTS datasets. Similar to NGS QC toolkit, SOAPnuke performs preparation [42]. (4) The Metagenomics preprocessing module 2-step QC. Trimming, filtering, and other frequently used func- customizes a few functions from the General module for trim- tions are integrated in our program. Four modules are designed ming adapters and low-quality bases on both ends, dropping to handle genomic, metagenomic, DGE, and sRNA datasets, re- reads with too-short length or too many ambiguous bases. De- spectively. In addition, SOAPnuke is extended to multiple work- tailed parameter settings can be accessed in Additional file 3. ing nodes for parallel computing using Hadoop MapReduce framework. Software features Methods SOAPnuke is written by C++ for good scalability and perfor- mance, and it can be run on both Linux and Windows platforms. QC and preprocessing Two paralleled strategies are implemented for acceleration. SOAPnuke (SOAPnuke, RRID:SCR 015025) was developed to sum- Multithreading is developed for standalone operation. Data are marize statistics of both raw and preprocessed data. Basic cut into blocks of fixed size, and each block is processed by 1 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 SOAPnuke: integreated QC & preprocessing 3 Figure 1: An example of QC complex statistics. (A) Per-base quality distribution of raw paired-end reads. (B) Per-base Q20 and Q30 of raw and preprocessed paired-end reads. (C) Per-base base composition distribution of raw paired-end reads. thread. This design utilizes multiple cores in a working node. In which are Trimmomatic (Trimmomatic, RRID:SCR 011848)[27], SOAPnuke, the creation and allocation of threads are managed AfterQC [30], BBDuk [31], and AlignTrimmer [36]. The parameter by a threadpool library, which decreases the overhead of creating setting is also available in Additional file 4. and destroying threads. More importantly, Hadoop MapReduce is applied to achieve rapid processing in multinode clusters for Results ultra-large-scale data. In the mapping phase, each read is kept as a key-value pair, where key is the readID and value denotes In the performance test, we chose 3 indexes for evaluation: elapsed time, CPU usage, and maximum RAM usage. As shown the sequence and quality scores. In shuffle phase, the key-value pairs are sorted, and each pair of paired-end reads is gathered. in Table 1, AfterQC is the tool occupying the fewest resources. However, its processing time is too long for practical usage, es- During the reducing phase, blocks of fixed size are processed by various threads of multiple nodes, and each block generates an pecially considering that we ran the program with pypy, which individual result. After that, it is optional to merge the results is announced to be 3 times as fast as standard Python. Among the remaining tools, SOAPnuke struck an appropriate balance into integrated fastq files. To prove the effectiveness of the acceleration design, we have between resource occupancy and performance. Furthermore, users can choose to run SOAPnuke on multiple nodes with conducted a performance test on SOAPnuke and other alterna- tive tools. A ∼30× human genome dataset published by GIAB MapReduce framework if high-throughput performance is de- manded. In our testing, 16 nodes can achieve ∼32 times acceler- [43] was extracted as testing data (see Additional file 4). In terms of the computing environment, up to 16 nodes were used, each ation compared with standalone operation, which is 5.37 times faster than the highest speed of 4 tested tools. of which has 24 cores of Intel(R) Xeon(R) CPU E5–2620 v4 @ 2.10 GHz and RAM of 128 G. SOAPnuke operations for testing After the preprocessing, downstream analyses were per- formed with the GATK (GATK, RRID:SCR 001876)bestpractice were set as described in published manuscripts (see the refer- ence list in Additional file 5). Trimming adapters and filtering on pipeline (see the description of GATK best practices) [44]. Data were processed by the alignment, rmDup, baseRecal, bamSort, length and quality were selected for their universality. We chose other workflow-like tools capable of performing these functions, and haplotypeCaller modules in order. For the haplotypeCaller, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Chen et al. Table 1: Evaluation of the data processing performance across SOAPnuke and 4 other tools Index\ tools Time, min Throughput, reads/s CPU, % Max RAM, GB SOAPnuke (1 node, 1 thread) 302.7 33 947.8 250 0.62 SOAPnuke (16 nodes) 9.4 1 093 191.1 640 50.10 Trimmomatic (1 thread) 84.7 121 380.1 75 2.98 Trimmomatic (24 threads) 50.5 203 582.1 239 10.28 BBDuk 57.2 162 230.2 259 11.40 AlienTrimmer 530.2 19 076.1 99 0.54 AfterQC (pypy) 2482.7 4319.1 99 0.21 Time, throughput, CPU, and maximum memory occupation are presented. For CPU usage, 100% means full load of a single CPU core. Maximum RAM usage means the highest occupancy of RAM during the whole processing. GIAB high-confidence small variant and reference calls v3.3.2 However, we have found 2 problems worth exploring regard- [45] were used as gold standard. Details of this testing are ing QC and preprocessing. First, in terms of preprocessing, it available in Additional file 4. is difficult to choose optimal parameters for a specific dataset. As seen in Table 2, AfterQC achieves the best variant calling Datasets from the same experiments and sequencers tend to result. The F-measures of SOAPnuke and Trimmomatic are the share features, so users always select the same parameters for same, which are slightly lower than those of AfterQC. AlienTrim- those similar data. The parameters are initially defined based mer performs slightly worse, and BBDuk has the worst result, on experiments on a specific dataset or just experience, which whose INDEL calling result differs greatly from that of other may already introduce some error and bias. Moreover, even if tools. In summary, though the variant calling result of AfterQC is the parameters are optimal for the tested dataset, they are pos- optimal, it is not worth considering for its long processing time. sibly inappropriate for other data because of random factors. Among the remaining tools, SOAPnuke and Trimmomatic tie for Thus, the current method is a compromise. However, it might first place. be a considerable solution that preprocessing settings are auto- matically adjusted during the processing. Second, some of the QC statistics are of limited help to judge the availability of data. Discussion and Conclusion For example, as the threshold of filtering out low-quality reads is increased from 0 to 40, the mean quality of all reads or each posi- Data quality is critical to downstream analysis, which makes it tion will rise accordingly, and the result of variant calling will be important to use reliable tools for preprocessing. To omit unnec- improved at the very beginning but then gets worse. This is be- essary input/output and computation, workflow-like structure cause preprocessing is a procedure required to strike a balance is adopted in SOAPnuke, where QC and preprocessing functions between removing noise and keeping useful information, while are integrated within an executable program. Compared with single QC statistics cannot reflect the global balance. A com- most of workflow-like tools, such as PrinSeq [ 6]and RObiNA [26], prehensive list of QC statistics in SOAPnuke can help solve the SOAPnuke adds statistics of preprocessed data for better under- problem as raising the threshold of mean quality after the bal- standing of data. To cope with datasets generated from different ance alone might make other irrelevant statistics worse. Thus, experiments, 4 modules are predefined with tailored functions it is worthwhile to explore ways to comprehensively analyze all and parameters. In terms of acceleration approach, multithread- statistics to evaluate the effect of preprocessing. Currently, this ing is the sole method adopted by existing tools [14–16, 24–28], procedure is performed empirically by users. In our future work, but it is only applicable to single-node operations. SOAPnuke uti- these 2 problems will be considered for the development of up- lizes MapReduce to realize concurrent execution on multinode dated versions. operations, where CPU cores of multiple nodes can be involved in a single task. It improves the scalability of parallel execution and the applicability to mass data. SOAPnuke also includes mul- Availability and requirements tithreading for standalone computing. Our test results indicate that SOAPnuke can achieve a speed ∼5.37 times faster than the Project name: SOAPnuke maximum speed of other tools with multithreading. It is worth Project home page: https://github.com/BGI-flexlab/SOAPnuke mentioning that processing speed is not directly proportional RRID:SCR 015025 to the number of working nodes, because some procedures like Operating system(s): Linux, Windows initialization of MapReduce cannot be accelerated as nodes in- Programming language: C++ crease, and the burden of communication between nodes aggra- Requirements: libraries: boost, zlib, log4cplus, and openssl; R vates as well. License: GPL For the future works, we will continue adding functions to feature modules. For example, in the preprocessing of DGE Availability of supporting data datasets, filtering out singleton reads is frequently included [46–48]. For the sRNA module, screening out reads based on Snapshots of the code and test data are also stored in the Giga- alignment with noncoding RNA databases (such as tRNA, rRNA, Science repository, GigaDB [51]. and snoRNA) [49, 50] is under development. Adding statistics such as per-read quality distribution and length distribution is also worth consideration. To users without a computing cluster, Abbreviations SOAPnuke might not be an optimal tool in terms of overall per- DGE: digital gene expression; HTS: high-throughput sequencing; formance. Thus, we are performing refactoring to increase the QC: quality control; sRNA: small RNA. standalone processing speed. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 SOAPnuke: integreated QC & preprocessing 5 Table 2: Variant calling result of SOAPnuke and other 4 tools Indexes Tools SNPs precision SNPs sensitivity SNPs F-measure INDELs precision INDELs sensitivity INDELs F-measure SOAPnuke 0.9967 0.9811 0.9888 0.9806 0.9575 0.9689 Trimmomatic 0.9966 0.9811 0.9888 0.9806 0.9575 0.9689 BBDuk 0.9966 0.9797 0.9881 0.9698 0.9184 0.9434 AlienTrimmer 0.9954 0.9810 0.9882 0.9792 0.9540 0.9665 AfterQC 0.9968 0.9811 0.9889 0.9811 0.9586 0.9697 F-measure is a measure considering both the precision and recall of the variant calling result. SNP and INDEL are 2 main categories of variants. Author contributions References L.F. and Q.C. conceived the project. Yuxin C. and C.S. conducted 1. Fox S, Filichkin S, Mockler TC. Applications of ultra-high- the survey on existing tools for QC and preprocessing. Yuxin throughput sequencing. Methods Mol Biol 2009;553:79–108. C., Yongsheng C., C.S., Z.H., Y.Z., S.L., J.Y., Z.L., X.Z., J.W., H.Y., 2. Soon WW, Hariharan M, Snyder MP. High-throughput L.F., and Q.C., provided feedback on features and functionality. sequencing for biology and medicine. Mol Syst Biol YongSheng C., Z.H., and S.L. wrote the standalone version of 2014;9(1):640-. SOAPnuke. Yuxin C. wrote the MapReduce version of SOAPnuke. 3. Stephens ZD, Lee SY, Faghri F et al. Big data: astronomical or Yuxin C. and Z.H. performed the above-mentioned test. Yuxin genomical? PLoS Biol 2015;13(7):e1002195. C., Y.L., C.Y., and L.F. wrote the manuscript. All authors read and 4. Guo Y, Ye F, Sheng Q et al. Three-stage quality control approved the final manuscript. strategies for DNA re-sequencing data. Brief Bioinformatics 2014;15(6):879–89. 5. Zhou X, Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol Ecol Additional files 2014;23(7):1679–700. Supplementary Material 1: Comparison of features and func- 6. Schmieder R, Edwards R. Quality control and preprocessing tions of various tools for QC and preprocessing (XLSX 41 kb). of metagenomic datasets. Bioinformatics 2011;27(6):863–4. Supplementary Material 2: Details of QC in SOAPnuke (PDF 304 7. Moxon S, Schwach F, Dalmay T et al. A toolkit kb). for analysing large-scale plant small RNA datasets. Supplementary Material 3: Details of preprocessing in SOAPnuke Bioinformatics 2008;24(19):2252–3. (PDF 1.6 mb). 8. Gordon A, Hannon GJ. Fastx-toolkit. FASTQ/A short-reads Supplementary Material 4: Details of preprocessing perfor- preprocessing tools. http://hannonlab.cshl.edu/fastx toolkit. mance test and downstream analyses (DOCX 38 kb). Accessed 1 November 2017. Supplementary Material 5: Details of research involving SOAP- 9. Cox MP, Peterson DA, Biggs PJ. SolexaQA: At-a-glance quality nuke (XLSX 12 kb). assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010;11(1):485. 10. Zhang T, Luo Y, Liu K et al. BIGpre: a quality assessment package for next-generation sequencing data. Genomics Competing interests Proteomics Bioinformatics 2011;9(6):238–44. The authors declare that they have no competing interests. 11. Aronesty E. ea-utils: Command-Line Tools for Processing Biological Sequencing Data. Durham, NC: Expression Anal- ysis; 2011. Open access 12. Yang X, Liu D, Liu F et al. HTQC: a fast quality control toolkit for Illumina sequencing data. BMC Bioinformatics This article is distributed under the terms of the Creative 2013;14(1):33. Commons Attribution 4.0 International License (http:// 13. Li H. seqtk: toolkit for processing sequences in FASTA/Q for- creativecommons.org/licenses/by/4.0/), which permits un- mats. https://github.com/lh3/seqtk. Accessed 1 March 2017. restricted use, distribution, and reproduction in any medium, 14. Zhou Q, Su X, Wang A et al. QC-Chain: fast and holistic provided you give appropriate credit to the original au- quality control method for next-generation sequencing data. thor(s) and the source, provide a link to the Creative PLoS One 2013;8(4):e60234. Commons license, and indicate if changes were made. 15. Zhou Q, Su X, Jing G et al. Meta-QC-Chain: comprehen- The Creative Commons Public Domain Dedication waiver sive and fast quality control method for metagenomic data. (http://creativecommons.org/publicdomain/zero/1.0/) applies to Genomics Proteomics Bioinformatics 2014;12(1):52–56. the data made available in this article, unless otherwise stated. 16. Patel RK, Jain M. NGS QC Toolkit: a toolkit for qual- ity control of next generation sequencing data. PLoS One 2012;7(2):e30619. Acknowledgements 17. Simon A. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics. This research was supported by Collaborative Innovation babraham.ac.uk/projects/fastqc/ Accessed 1 November Center of High Performance Computing, the Critical Patented Project of the Science and Technology Bureau of Fujian Province, 18. Schmieder R, Lim YW, Rohwer F et al. TagCleaner: identi- China (Grant No. 2013YZ0002–2), and the Joint Project of the Nat- fication and removal of tag sequences from genomic and ural Science and Health Foundation of Fujian Province, China metagenomic datasets. BMC Bioinformatics 2010;11(1):341. (Grant No. 2015J01397). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Chen et al. 19. Falgueras J, Lara AJ, Fernandez-Pozo N et al. SeqTrim: a high- sequences from high-throughput sequencing reads. Ge- throughput pipeline for preprocessing any type of sequence nomics 2013;102(5–6):500–6. reads. BMC Bioinformatics 2010;11(1):38. 37. Goecks J, Nekrutenko A, Taylor J et al. Galaxy: a compre- 20. St John J. SeqPrep: tool for stripping adaptors and/or hensive approach for supporting accessible, reproducible, merging paired reads with overlap into single reads. and transparent computational research in the life sciences. https://github.com/jstjohn/SeqPrep Accessed 1 November Genome Biol 2010;11(8):R86. 2017. 38. Team RC. R: A Language and Environment for Statistical 21. Kong Y. Btrim: a fast, lightweight adapter and quality trim- Computing. Vienna, Austria: R Foundation for Statistical ming program for next-generation sequencing technologies. Computing; 2013. Genomics 2011;98(2):152–3. 39. Illumina. NextSeq 500 system overview. https://support. 22. Lohse M, Bolger AM, Nagel A et al. RobiNA: a user-friendly, illumina.com/content/dam/illumina-support/courses/ integrated software solution for RNA-seq-based transcrip- nextseq-system-overview/story content/external files/ tomics. Nucleic Acids Res 2012;40(W1):W622–7. NextSeq500 System Overview narration.pdf Accessed 1 23. Martin M. Cutadapt removes adapter sequences from November 2017. high-throughput sequencing reads. EMBnet J 2011;17(1): 40. Huang J, Liang X, Xuan Y et al. A reference human pp–10. genome dataset of the BGISEQ-500 sequencer. Gigascience 24. Schubert M, Lindgreen S, Orlando L. AdapterRemoval v2: 2017;6(5):1–9. rapid adapter trimming, identification, and read merging. 41. Zhang X, Hao L, Meng L et al. Digital gene expression tag BMC Res Notes 2016;9(1):88. profiling analysis of the gene expression patterns regulat- 25. Dodt M, Roehr JT, Ahmed R et al. FLEXBAR-flexible barcode ing the early stage of mouse spermatogenesis. PLoS One and adapter processing for next-generation sequencing plat- 2013;8(3):e58680. forms. Biology (Basel) 2012;1(3):895–905. 42. Tam S, Tsao MS, McPherson JD. Optimization of miRNA-seq 26. Li YL, Weng JC, Hsiao CC et al. PEAT: an intelligent and ef- data preprocessing. Brief Bioinformatics 2015;16(6):950–63. ficient paired-end sequencing adapter trimming algorithm. 43. Zook JM, Catoe D, McDaniel J et al. Extensive sequencing of BMC Bioinformatics 2015;16(Suppl 1):S2. seven human genomes to characterize benchmark reference 27. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexi- materials. Sci Data 2016;3:160025. ble trimmer for Illumina sequence data. Bioinformatics 44. GATK best practices. http://www.broadinstitute.org/gatk/ 2014;30(15):2114–20. guide/best-practices. Access 1 November 2017. 28. Sturm M, Schroeder C, Bauer P. SeqPurge: highly-sensitive 45. NISTv3.3.2, NA12878 high-confidence variant calls as a gold adapter trimming for paired-end NGS data. BMC Bioinfor- standard. GIAB. 2017. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ matics 2016;17(1):208. ftp/release/NA12878 HG001/NISTv3.3.2/. Access 1 November 29. Jiang H, Lei R, Ding SW et al. Skewer: a fast and accurate 2017. adapter trimmer for next-generation sequencing paired-end 46. Zhang X, Hao L, Meng L et al. Digital gene expression tag reads. BMC Bioinformatics 2014;15(1):182. profiling analysis of the gene expression patterns regulat- 30. Chen S, Huang T, Zhou Y et al. AfterQC: automatic filtering, ing the early stage of mouse spermatogenesis. PLoS One trimming, error removing and quality control for fastq data. 2013;8(3):e58680. BMC Bioinformatics 2017;18(S3):80. 47. Zhou L, Chen J, Li Z et al. Integrated profiling of microRNAs 31. BUSHNELL Brian. BBMap: A Fast, Accurate, Splice-Aware and mRNAs: microRNAs located on Xq27.3 associate with Aligner. Berkeley, CA: Ernest Orlando Lawrence Berkeley clear cell renal cell carcinoma. PLoS One 2010;5(12):e15224. National Laboratory; 2014. 48. Han Y, Zhang X, Wang W et al. The suppression of WRKY44 32. Joshi NA, Fass JN. Sickle: A sliding-window, adaptive, by GIGANTEA-miR172 pathway is involved in drought re- quality-based trimming tool for FastQ files. https:// sponse of Arabidopsis thaliana. PLoS One 2013;8(11):e73541. github.com/najoshi/sickle. Accessed 1 November 2017. 49. Hall AE, Lu WT, Godfrey JD et al. The cytoskeleton adaptor 33. Pertea G. fqtrim: trimming&filtering of next-gen reads. protein ankyrin-1 is upregulated by p53 following DNA dam- https://ccb.jhu.edu/software/fqtrim/. Access 1 November age and alters cell migration. Cell Death Dis 2016;7(4):e2184. 2017. 50. Surbanovski N, Brilli M, Moser M et al. A highly specific 34. Vince B. Scythe: a Bayesian adapter trimmer. https:// microRNA-mediated mechanism silences LTR retrotrans- github.com/vsbuffalo/scythe Access 1 March 2017. posons of strawberry. Plant J 2016;85(1):70–82. 35. Leggett RM, Clavijo BJ, Clissold L et al. NextClip: an anal- 51. Chen Y, Chen Y, Shi C et al. Supporting data for “SOAP- ysis and read preparation tool for Nextera long mate pair nuke: a MapReduce acceleration-supported software for libraries. Bioinformatics 2014;30(4):566–8. integrated quality control and preprocessing of high- 36. Criscuolo A, Brisse S. AlienTrimmer: a tool to quickly throughput sequencing data.” GigaScience Database 2017. and accurately trim off multiple short contaminant http://dx.doi.org/10.5524/100373. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

Free
6 pages

Loading next page...
 
/lp/ou_press/soapnuke-a-mapreduce-acceleration-supported-software-for-integrated-pzxj06t330
Publisher
BGI
Copyright
© The Author(s) 2017. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix120
Publisher site
See Article on Publisher Site

Abstract

Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster. We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a ∼30× NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved ∼5.7 times the fastest speed of other tools. Keywords: high-throughput sequencing; quality control; preprocessing; MapReduce Background negative impact on downstream analyses. Thus, QC and prepro- cessing of raw data serve as the critical steps to initiate analysis High-throughput sequencing (HTS) instruments have enabled pipelines [4, 5]. QC investigates several statistics of datasets to many large-scale studies and generated enormous amounts of ensure data quality, and preprocessing trims off undesirable ter- data [1–3]. However, the presence of low-quality bases, sequence minal fragments and filters out substandard reads [ 6]. We have artifacts, and sequence contamination can introduce serious Received: 17 July 2017; Revised: 18 October 2017; Accepted: 22 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 provided the original work is properly cited. by Ed 'DeepDyve' Gillespie user on 16 March 2018 1 2 Chen et al. conducted a survey on 31 existing tools, and widely shared func- statistics are comprised of the number of sequences and bases, tions are listed in Supplementary Material 1. base composition, Q20 and Q30, and filtering information. Com- Existing tools for QC and preprocessing can be divided into plex statistics include the distribution of quality score and base 2 categories according to their structures: toolkit and work- composition distribution for each position. For the quality score flow. Toolkit-like software provides multiple executables such distribution, Q20 and Q30 for each position are plotted in a line as statistics computer, clipper, and filtrator [ 7–15]. In practice, chart, and the quantiles of the quality are represented in a box- raw data are processed by a few individual executables in se- plot. And for the base composition distribution, an overlapping quence. Comparatively, workflow-like software offers an inte- histogram is used to display base composition distribution for gral workflow where functions are performed in predefined each position. These calculations are conducted by C++,and the order [6, 16–37]. plots are generated by R 3.3.2 [38]. An example of the 2 plots is However, both categories have their own demerits. When us- showninFig. 1. A comprehensive list of statistics available in ing toolkit-like software, it is complex and error-prone to write SOAPnuke is included in Additional file 2. Statistics of prepro- additional scripts to wrap executables. Moreover, it consumes cessed data are compared with some preset thresholds. A warn- much time to generate and read intermediate files, which is hard ing message will be issued if the median score of any position in for acceleration. Besides, the same variables could possibly be per-base quality distribution is lower than 25, and a failure will computed repetitively. For instance, the average quality score of be issued if it is lower than 20. For per-base base composition, each read is necessary for counting quality score distribution by a warning will be raised if the difference between A and T, or G reads and filtering reads based on average quality scores. It has and C, in any position is greater than 10%, or a failure will be to be counted twice if these 2 functions are implemented by dif- issued if it is greater than 20%. ferent toolkits. In the step of preprocessing, those undesirable terminal frag- For workflow-like tools, an optimal architecture is required ments are trimmed off, substandard reads are filtered out, and because the orders of functions are fixed. Most of the existing certain transform operations are applied. On both ends of reads, tools successively perform QC and preprocessing without com- bases of assigned number or of quality lower than the threshold plete statistics of preprocessed datasets. If the preprocessing op- will be trimmed off. Sequencing adapters can be aligned, where eration is not suitable for a given dataset, the problem can only mismatch is supported while no INDEL is tolerated, and cut to be revealed by downstream analyses. the 3’ end. Filtering can be performed on reads with adapter, Datasets sequenced from various samples may require dif- short length, too many ambiguous bases, low–average quality, ferent processing functions or parameters. Existing workflow- or too many low-quality bases. The sequencing batches, such like tools mostly support genomics data processing; only a as tile of Illumina sequencer [39] and fov (field of view) of BGI few of them are developed for other types of studies, such as sequencer [40], with unfavorable sequencing quality can be as- RNA-seq and metagenomics data. For example, RObiNA [22] signed so that the corresponding sequences will be discarded. In provides 4 preprocessing modules to combined for different addition, reads with identical nucleotides can be deduplicated RNA-Seq Data. PrinSeq [6] offers a QC stat, dinucleotide odds to keep only 1 copy. Transformation comprises quality system ratios, to show how the dataset might be related to other conversion, interconversion between DNA and RNA, and com- viral/microbial metagenomes. However, there is still no single pression of output with gzip, etc. Additional file 3 lists the above tool supporting multiple data types. preprocessing functions and their parameters. Several tools have made certain progress in overcoming the limitations mentioned above. Galaxy [37] is a web-based plat- Module design form incorporating various existing toolkit-like softwares. Users can conveniently concatenate tools into a pipeline on the web To improve processing performance of different types of data, interface. NGS QC toolkit [16] offers a workflow with QC on both 4 modules are specialized in SOAPnuke, including the General, raw and preprocessed datasets, though there are few prepro- DGE, sRNA, and Meta modules. (1) The General module can han- cessing functions. dle most of the DNA re-sequencing datasets, as described in the In terms of software acceleration, only multithreading is section of QC & PROCESSING. adopted by existing tools [14–16, 24–28]. This approach only (2) DGE profiling generates a single-end read that has a works for standalone operation and is limited by the maximum “CATG” segment neighboring the targeted sequences of 17 base number of processors in 1 computer server. It may be incompe- pairs [41]. By default, the DGE module will find the targeted seg- tent when dealing with the huge present and potential volume ment and trim off other parts. Moreover, reads with ambiguous of sequencing datasets. bases will be filtered. (3) The sRNA module incorpates filtering To solve these problems, we have developed a workflow-like of poly-A tags as polyadenylation is a feature of mRNA data and tool, SOAPnuke, for integrated QC and preprocessing of large sRNA sequences can be contaminated by mRNA during sample HTS datasets. Similar to NGS QC toolkit, SOAPnuke performs preparation [42]. (4) The Metagenomics preprocessing module 2-step QC. Trimming, filtering, and other frequently used func- customizes a few functions from the General module for trim- tions are integrated in our program. Four modules are designed ming adapters and low-quality bases on both ends, dropping to handle genomic, metagenomic, DGE, and sRNA datasets, re- reads with too-short length or too many ambiguous bases. De- spectively. In addition, SOAPnuke is extended to multiple work- tailed parameter settings can be accessed in Additional file 3. ing nodes for parallel computing using Hadoop MapReduce framework. Software features Methods SOAPnuke is written by C++ for good scalability and perfor- mance, and it can be run on both Linux and Windows platforms. QC and preprocessing Two paralleled strategies are implemented for acceleration. SOAPnuke (SOAPnuke, RRID:SCR 015025) was developed to sum- Multithreading is developed for standalone operation. Data are marize statistics of both raw and preprocessed data. Basic cut into blocks of fixed size, and each block is processed by 1 Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 SOAPnuke: integreated QC & preprocessing 3 Figure 1: An example of QC complex statistics. (A) Per-base quality distribution of raw paired-end reads. (B) Per-base Q20 and Q30 of raw and preprocessed paired-end reads. (C) Per-base base composition distribution of raw paired-end reads. thread. This design utilizes multiple cores in a working node. In which are Trimmomatic (Trimmomatic, RRID:SCR 011848)[27], SOAPnuke, the creation and allocation of threads are managed AfterQC [30], BBDuk [31], and AlignTrimmer [36]. The parameter by a threadpool library, which decreases the overhead of creating setting is also available in Additional file 4. and destroying threads. More importantly, Hadoop MapReduce is applied to achieve rapid processing in multinode clusters for Results ultra-large-scale data. In the mapping phase, each read is kept as a key-value pair, where key is the readID and value denotes In the performance test, we chose 3 indexes for evaluation: elapsed time, CPU usage, and maximum RAM usage. As shown the sequence and quality scores. In shuffle phase, the key-value pairs are sorted, and each pair of paired-end reads is gathered. in Table 1, AfterQC is the tool occupying the fewest resources. However, its processing time is too long for practical usage, es- During the reducing phase, blocks of fixed size are processed by various threads of multiple nodes, and each block generates an pecially considering that we ran the program with pypy, which individual result. After that, it is optional to merge the results is announced to be 3 times as fast as standard Python. Among the remaining tools, SOAPnuke struck an appropriate balance into integrated fastq files. To prove the effectiveness of the acceleration design, we have between resource occupancy and performance. Furthermore, users can choose to run SOAPnuke on multiple nodes with conducted a performance test on SOAPnuke and other alterna- tive tools. A ∼30× human genome dataset published by GIAB MapReduce framework if high-throughput performance is de- manded. In our testing, 16 nodes can achieve ∼32 times acceler- [43] was extracted as testing data (see Additional file 4). In terms of the computing environment, up to 16 nodes were used, each ation compared with standalone operation, which is 5.37 times faster than the highest speed of 4 tested tools. of which has 24 cores of Intel(R) Xeon(R) CPU E5–2620 v4 @ 2.10 GHz and RAM of 128 G. SOAPnuke operations for testing After the preprocessing, downstream analyses were per- formed with the GATK (GATK, RRID:SCR 001876)bestpractice were set as described in published manuscripts (see the refer- ence list in Additional file 5). Trimming adapters and filtering on pipeline (see the description of GATK best practices) [44]. Data were processed by the alignment, rmDup, baseRecal, bamSort, length and quality were selected for their universality. We chose other workflow-like tools capable of performing these functions, and haplotypeCaller modules in order. For the haplotypeCaller, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Chen et al. Table 1: Evaluation of the data processing performance across SOAPnuke and 4 other tools Index\ tools Time, min Throughput, reads/s CPU, % Max RAM, GB SOAPnuke (1 node, 1 thread) 302.7 33 947.8 250 0.62 SOAPnuke (16 nodes) 9.4 1 093 191.1 640 50.10 Trimmomatic (1 thread) 84.7 121 380.1 75 2.98 Trimmomatic (24 threads) 50.5 203 582.1 239 10.28 BBDuk 57.2 162 230.2 259 11.40 AlienTrimmer 530.2 19 076.1 99 0.54 AfterQC (pypy) 2482.7 4319.1 99 0.21 Time, throughput, CPU, and maximum memory occupation are presented. For CPU usage, 100% means full load of a single CPU core. Maximum RAM usage means the highest occupancy of RAM during the whole processing. GIAB high-confidence small variant and reference calls v3.3.2 However, we have found 2 problems worth exploring regard- [45] were used as gold standard. Details of this testing are ing QC and preprocessing. First, in terms of preprocessing, it available in Additional file 4. is difficult to choose optimal parameters for a specific dataset. As seen in Table 2, AfterQC achieves the best variant calling Datasets from the same experiments and sequencers tend to result. The F-measures of SOAPnuke and Trimmomatic are the share features, so users always select the same parameters for same, which are slightly lower than those of AfterQC. AlienTrim- those similar data. The parameters are initially defined based mer performs slightly worse, and BBDuk has the worst result, on experiments on a specific dataset or just experience, which whose INDEL calling result differs greatly from that of other may already introduce some error and bias. Moreover, even if tools. In summary, though the variant calling result of AfterQC is the parameters are optimal for the tested dataset, they are pos- optimal, it is not worth considering for its long processing time. sibly inappropriate for other data because of random factors. Among the remaining tools, SOAPnuke and Trimmomatic tie for Thus, the current method is a compromise. However, it might first place. be a considerable solution that preprocessing settings are auto- matically adjusted during the processing. Second, some of the QC statistics are of limited help to judge the availability of data. Discussion and Conclusion For example, as the threshold of filtering out low-quality reads is increased from 0 to 40, the mean quality of all reads or each posi- Data quality is critical to downstream analysis, which makes it tion will rise accordingly, and the result of variant calling will be important to use reliable tools for preprocessing. To omit unnec- improved at the very beginning but then gets worse. This is be- essary input/output and computation, workflow-like structure cause preprocessing is a procedure required to strike a balance is adopted in SOAPnuke, where QC and preprocessing functions between removing noise and keeping useful information, while are integrated within an executable program. Compared with single QC statistics cannot reflect the global balance. A com- most of workflow-like tools, such as PrinSeq [ 6]and RObiNA [26], prehensive list of QC statistics in SOAPnuke can help solve the SOAPnuke adds statistics of preprocessed data for better under- problem as raising the threshold of mean quality after the bal- standing of data. To cope with datasets generated from different ance alone might make other irrelevant statistics worse. Thus, experiments, 4 modules are predefined with tailored functions it is worthwhile to explore ways to comprehensively analyze all and parameters. In terms of acceleration approach, multithread- statistics to evaluate the effect of preprocessing. Currently, this ing is the sole method adopted by existing tools [14–16, 24–28], procedure is performed empirically by users. In our future work, but it is only applicable to single-node operations. SOAPnuke uti- these 2 problems will be considered for the development of up- lizes MapReduce to realize concurrent execution on multinode dated versions. operations, where CPU cores of multiple nodes can be involved in a single task. It improves the scalability of parallel execution and the applicability to mass data. SOAPnuke also includes mul- Availability and requirements tithreading for standalone computing. Our test results indicate that SOAPnuke can achieve a speed ∼5.37 times faster than the Project name: SOAPnuke maximum speed of other tools with multithreading. It is worth Project home page: https://github.com/BGI-flexlab/SOAPnuke mentioning that processing speed is not directly proportional RRID:SCR 015025 to the number of working nodes, because some procedures like Operating system(s): Linux, Windows initialization of MapReduce cannot be accelerated as nodes in- Programming language: C++ crease, and the burden of communication between nodes aggra- Requirements: libraries: boost, zlib, log4cplus, and openssl; R vates as well. License: GPL For the future works, we will continue adding functions to feature modules. For example, in the preprocessing of DGE Availability of supporting data datasets, filtering out singleton reads is frequently included [46–48]. For the sRNA module, screening out reads based on Snapshots of the code and test data are also stored in the Giga- alignment with noncoding RNA databases (such as tRNA, rRNA, Science repository, GigaDB [51]. and snoRNA) [49, 50] is under development. Adding statistics such as per-read quality distribution and length distribution is also worth consideration. To users without a computing cluster, Abbreviations SOAPnuke might not be an optimal tool in terms of overall per- DGE: digital gene expression; HTS: high-throughput sequencing; formance. Thus, we are performing refactoring to increase the QC: quality control; sRNA: small RNA. standalone processing speed. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 SOAPnuke: integreated QC & preprocessing 5 Table 2: Variant calling result of SOAPnuke and other 4 tools Indexes Tools SNPs precision SNPs sensitivity SNPs F-measure INDELs precision INDELs sensitivity INDELs F-measure SOAPnuke 0.9967 0.9811 0.9888 0.9806 0.9575 0.9689 Trimmomatic 0.9966 0.9811 0.9888 0.9806 0.9575 0.9689 BBDuk 0.9966 0.9797 0.9881 0.9698 0.9184 0.9434 AlienTrimmer 0.9954 0.9810 0.9882 0.9792 0.9540 0.9665 AfterQC 0.9968 0.9811 0.9889 0.9811 0.9586 0.9697 F-measure is a measure considering both the precision and recall of the variant calling result. SNP and INDEL are 2 main categories of variants. Author contributions References L.F. and Q.C. conceived the project. Yuxin C. and C.S. conducted 1. Fox S, Filichkin S, Mockler TC. Applications of ultra-high- the survey on existing tools for QC and preprocessing. Yuxin throughput sequencing. Methods Mol Biol 2009;553:79–108. C., Yongsheng C., C.S., Z.H., Y.Z., S.L., J.Y., Z.L., X.Z., J.W., H.Y., 2. Soon WW, Hariharan M, Snyder MP. High-throughput L.F., and Q.C., provided feedback on features and functionality. sequencing for biology and medicine. Mol Syst Biol YongSheng C., Z.H., and S.L. wrote the standalone version of 2014;9(1):640-. SOAPnuke. Yuxin C. wrote the MapReduce version of SOAPnuke. 3. Stephens ZD, Lee SY, Faghri F et al. Big data: astronomical or Yuxin C. and Z.H. performed the above-mentioned test. Yuxin genomical? PLoS Biol 2015;13(7):e1002195. C., Y.L., C.Y., and L.F. wrote the manuscript. All authors read and 4. Guo Y, Ye F, Sheng Q et al. Three-stage quality control approved the final manuscript. strategies for DNA re-sequencing data. Brief Bioinformatics 2014;15(6):879–89. 5. Zhou X, Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol Ecol Additional files 2014;23(7):1679–700. Supplementary Material 1: Comparison of features and func- 6. Schmieder R, Edwards R. Quality control and preprocessing tions of various tools for QC and preprocessing (XLSX 41 kb). of metagenomic datasets. Bioinformatics 2011;27(6):863–4. Supplementary Material 2: Details of QC in SOAPnuke (PDF 304 7. Moxon S, Schwach F, Dalmay T et al. A toolkit kb). for analysing large-scale plant small RNA datasets. Supplementary Material 3: Details of preprocessing in SOAPnuke Bioinformatics 2008;24(19):2252–3. (PDF 1.6 mb). 8. Gordon A, Hannon GJ. Fastx-toolkit. FASTQ/A short-reads Supplementary Material 4: Details of preprocessing perfor- preprocessing tools. http://hannonlab.cshl.edu/fastx toolkit. mance test and downstream analyses (DOCX 38 kb). Accessed 1 November 2017. Supplementary Material 5: Details of research involving SOAP- 9. Cox MP, Peterson DA, Biggs PJ. SolexaQA: At-a-glance quality nuke (XLSX 12 kb). assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010;11(1):485. 10. Zhang T, Luo Y, Liu K et al. BIGpre: a quality assessment package for next-generation sequencing data. Genomics Competing interests Proteomics Bioinformatics 2011;9(6):238–44. The authors declare that they have no competing interests. 11. Aronesty E. ea-utils: Command-Line Tools for Processing Biological Sequencing Data. Durham, NC: Expression Anal- ysis; 2011. Open access 12. Yang X, Liu D, Liu F et al. HTQC: a fast quality control toolkit for Illumina sequencing data. BMC Bioinformatics This article is distributed under the terms of the Creative 2013;14(1):33. Commons Attribution 4.0 International License (http:// 13. Li H. seqtk: toolkit for processing sequences in FASTA/Q for- creativecommons.org/licenses/by/4.0/), which permits un- mats. https://github.com/lh3/seqtk. Accessed 1 March 2017. restricted use, distribution, and reproduction in any medium, 14. Zhou Q, Su X, Wang A et al. QC-Chain: fast and holistic provided you give appropriate credit to the original au- quality control method for next-generation sequencing data. thor(s) and the source, provide a link to the Creative PLoS One 2013;8(4):e60234. Commons license, and indicate if changes were made. 15. Zhou Q, Su X, Jing G et al. Meta-QC-Chain: comprehen- The Creative Commons Public Domain Dedication waiver sive and fast quality control method for metagenomic data. (http://creativecommons.org/publicdomain/zero/1.0/) applies to Genomics Proteomics Bioinformatics 2014;12(1):52–56. the data made available in this article, unless otherwise stated. 16. Patel RK, Jain M. NGS QC Toolkit: a toolkit for qual- ity control of next generation sequencing data. PLoS One 2012;7(2):e30619. Acknowledgements 17. Simon A. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics. This research was supported by Collaborative Innovation babraham.ac.uk/projects/fastqc/ Accessed 1 November Center of High Performance Computing, the Critical Patented Project of the Science and Technology Bureau of Fujian Province, 18. Schmieder R, Lim YW, Rohwer F et al. TagCleaner: identi- China (Grant No. 2013YZ0002–2), and the Joint Project of the Nat- fication and removal of tag sequences from genomic and ural Science and Health Foundation of Fujian Province, China metagenomic datasets. BMC Bioinformatics 2010;11(1):341. (Grant No. 2015J01397). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Chen et al. 19. Falgueras J, Lara AJ, Fernandez-Pozo N et al. SeqTrim: a high- sequences from high-throughput sequencing reads. Ge- throughput pipeline for preprocessing any type of sequence nomics 2013;102(5–6):500–6. reads. BMC Bioinformatics 2010;11(1):38. 37. Goecks J, Nekrutenko A, Taylor J et al. Galaxy: a compre- 20. St John J. SeqPrep: tool for stripping adaptors and/or hensive approach for supporting accessible, reproducible, merging paired reads with overlap into single reads. and transparent computational research in the life sciences. https://github.com/jstjohn/SeqPrep Accessed 1 November Genome Biol 2010;11(8):R86. 2017. 38. Team RC. R: A Language and Environment for Statistical 21. Kong Y. Btrim: a fast, lightweight adapter and quality trim- Computing. Vienna, Austria: R Foundation for Statistical ming program for next-generation sequencing technologies. Computing; 2013. Genomics 2011;98(2):152–3. 39. Illumina. NextSeq 500 system overview. https://support. 22. Lohse M, Bolger AM, Nagel A et al. RobiNA: a user-friendly, illumina.com/content/dam/illumina-support/courses/ integrated software solution for RNA-seq-based transcrip- nextseq-system-overview/story content/external files/ tomics. Nucleic Acids Res 2012;40(W1):W622–7. NextSeq500 System Overview narration.pdf Accessed 1 23. Martin M. Cutadapt removes adapter sequences from November 2017. high-throughput sequencing reads. EMBnet J 2011;17(1): 40. Huang J, Liang X, Xuan Y et al. A reference human pp–10. genome dataset of the BGISEQ-500 sequencer. Gigascience 24. Schubert M, Lindgreen S, Orlando L. AdapterRemoval v2: 2017;6(5):1–9. rapid adapter trimming, identification, and read merging. 41. Zhang X, Hao L, Meng L et al. Digital gene expression tag BMC Res Notes 2016;9(1):88. profiling analysis of the gene expression patterns regulat- 25. Dodt M, Roehr JT, Ahmed R et al. FLEXBAR-flexible barcode ing the early stage of mouse spermatogenesis. PLoS One and adapter processing for next-generation sequencing plat- 2013;8(3):e58680. forms. Biology (Basel) 2012;1(3):895–905. 42. Tam S, Tsao MS, McPherson JD. Optimization of miRNA-seq 26. Li YL, Weng JC, Hsiao CC et al. PEAT: an intelligent and ef- data preprocessing. Brief Bioinformatics 2015;16(6):950–63. ficient paired-end sequencing adapter trimming algorithm. 43. Zook JM, Catoe D, McDaniel J et al. Extensive sequencing of BMC Bioinformatics 2015;16(Suppl 1):S2. seven human genomes to characterize benchmark reference 27. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexi- materials. Sci Data 2016;3:160025. ble trimmer for Illumina sequence data. Bioinformatics 44. GATK best practices. http://www.broadinstitute.org/gatk/ 2014;30(15):2114–20. guide/best-practices. Access 1 November 2017. 28. Sturm M, Schroeder C, Bauer P. SeqPurge: highly-sensitive 45. NISTv3.3.2, NA12878 high-confidence variant calls as a gold adapter trimming for paired-end NGS data. BMC Bioinfor- standard. GIAB. 2017. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ matics 2016;17(1):208. ftp/release/NA12878 HG001/NISTv3.3.2/. Access 1 November 29. Jiang H, Lei R, Ding SW et al. Skewer: a fast and accurate 2017. adapter trimmer for next-generation sequencing paired-end 46. Zhang X, Hao L, Meng L et al. Digital gene expression tag reads. BMC Bioinformatics 2014;15(1):182. profiling analysis of the gene expression patterns regulat- 30. Chen S, Huang T, Zhou Y et al. AfterQC: automatic filtering, ing the early stage of mouse spermatogenesis. PLoS One trimming, error removing and quality control for fastq data. 2013;8(3):e58680. BMC Bioinformatics 2017;18(S3):80. 47. Zhou L, Chen J, Li Z et al. Integrated profiling of microRNAs 31. BUSHNELL Brian. BBMap: A Fast, Accurate, Splice-Aware and mRNAs: microRNAs located on Xq27.3 associate with Aligner. Berkeley, CA: Ernest Orlando Lawrence Berkeley clear cell renal cell carcinoma. PLoS One 2010;5(12):e15224. National Laboratory; 2014. 48. Han Y, Zhang X, Wang W et al. The suppression of WRKY44 32. Joshi NA, Fass JN. Sickle: A sliding-window, adaptive, by GIGANTEA-miR172 pathway is involved in drought re- quality-based trimming tool for FastQ files. https:// sponse of Arabidopsis thaliana. PLoS One 2013;8(11):e73541. github.com/najoshi/sickle. Accessed 1 November 2017. 49. Hall AE, Lu WT, Godfrey JD et al. The cytoskeleton adaptor 33. Pertea G. fqtrim: trimming&filtering of next-gen reads. protein ankyrin-1 is upregulated by p53 following DNA dam- https://ccb.jhu.edu/software/fqtrim/. Access 1 November age and alters cell migration. Cell Death Dis 2016;7(4):e2184. 2017. 50. Surbanovski N, Brilli M, Moser M et al. A highly specific 34. Vince B. Scythe: a Bayesian adapter trimmer. https:// microRNA-mediated mechanism silences LTR retrotrans- github.com/vsbuffalo/scythe Access 1 March 2017. posons of strawberry. Plant J 2016;85(1):70–82. 35. Leggett RM, Clavijo BJ, Clissold L et al. NextClip: an anal- 51. Chen Y, Chen Y, Shi C et al. Supporting data for “SOAP- ysis and read preparation tool for Nextera long mate pair nuke: a MapReduce acceleration-supported software for libraries. Bioinformatics 2014;30(4):566–8. integrated quality control and preprocessing of high- 36. Criscuolo A, Brisse S. AlienTrimmer: a tool to quickly throughput sequencing data.” GigaScience Database 2017. and accurately trim off multiple short contaminant http://dx.doi.org/10.5524/100373. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4689118 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off