Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags

Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence... Background: Illumina paired-end sequencing has been increasingly popular for 16S rRNA gene-based microbiota profiling. It provides higher phylogenetic resolution than single-end reads due to a longer read length. However, the reverse read (R2) often has significant low base quality, and a large proportion of R2s will be discarded after quality control, resulting in a mixture of paired-end and single-end reads. A typical 16S analysis pipeline usually processes either paired-end or single-end reads but not a mixture. Thus, the quantification accuracy and statistical power will be reduced due to the loss of a large amount of reads. As a result, rare taxa may not be detectable with the paired-end approach, or low taxonomic resolution will result in a single-end approach. Results: To have both the higher phylogenetic resolution provided by paired-end reads and the higher sequence coverage by single-end reads, we propose a novel OTU-picking pipeline, hybrid-denovo, that can process a hybrid of single-end and paired-end reads. Using high-quality paired-end reads as a gold standard, we show that hybrid-denovo achieved the highest correlation with the gold standard and performed better than the approaches based on paired-end or single-end reads in terms of quantifying the microbial diversity and taxonomic abundances. By applying our method to a rheumatoid arthritis (RA) data set, we demonstrated that hybrid-denovo captured more microbial diversity and identified more RA-associated taxa than a paired-end or single-end approach. Conclusions: Hybrid-denovo utilizes both paired-end and single-end 16S sequencing reads and is recommended for 16S rRNA gene targeted paired-end sequencing data. Keywords: microbiome; OTU picking; 16S rRNA Introduction used to profile microbiota. Identifying related groups of organ- isms known as operational taxonomic units (OTUs) remains The microbiome plays an important role in global ecology, a central part of the analysis of microbiome data. Both de nutrient cycling, and disease [1]. Targeted sequencing of the novo and reference-based approaches have been proposed for hypervariable region of the 16S rRNA gene is now routinely Received: 23 October 2017; Revised: 28 November 2017; Accepted: 7 December 2017 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Chen et al. processing 16S rDNA reads—each with complementary strengths and weaknesses. De novo OTU-picking naively clus- ters reads based on sequence similarity. It has the advantages of not requiring any prior knowledge or reference about the target molecule, and produces OTU groupings that are more naturally aligned to the data. However, de novo approaches require com- parison of the same gene region. Reference-based approaches can get around this limitation, but rely on a preexisting set of OTU representatives that may or may not be appropriate for a particular dataset [2]. To perform a de novo approach, one of the challenges pre- sented by Illumina paired-end reads is that the reverse read (R2) often has a much lower base quality than the forward read (R1). For the 16S datasets generated at the Mayo Clinic Core Facility, only 24% of R2s passed quality control (QC) between 2013 and 2015, as opposed to 83% for R1s (Supplementary Fig. 1). We are then left with a smaller set of high-fidelity paired-end reads (R1- R2) and a deeper set of single-end reads (R1). Thus, we would have to choose between the more accurate taxonomic identifi- cation using R1-R2 or improved detection of rare taxa using R1 [3]. To integrate information from both paired-end and single- end reads, we propose hybrid-denovo, a pipeline that combines paired-end and single-end reads in order to retain the advan- tages of de novo OTU-picking while maximizing the ability to de- tect rare taxa. Methods Hybrid-denovo first constructs an OTU backbone using only paired-end reads. The remaining single-end reads (R1) are mapped to the OTU backbone, creating new OTUs if unmapped (Fig. 1 A). The same quality control and OTU-picking process as implemented in IM-TORNADO is used to build the OTU back- bone [3]. Specifically, quality filtering was performed using Trim- momatic [4] with a hard cutoff of PHRED score Q3 for 5 and 3 ends of the reads, trimming the 3 end with a moving aver- age score of Q15, with a window size of 4 bases, and remov- ing any remaining reads shorter than 75% of the original read length. Reads with any ambiguous base calls were discarded. Surviving read pairs were further trimmed down to specified cutoffs to uniform the length of both reads, then concatenated and sorted by cluster size. Afterwards, a de novo OTU-picking was conducted via UPARSE algorithm [5,6]. Though the UPARSE algo- rithm has performed de novo chimera removal, we additionally used UCHIME [7] to perform a reference-based chimera removal against the GOLD database [8] resulting in a set of high-quality OTU representatives. We then mapped the single-end R1s to the R1 end of the OTU representatives using USEARCH (if there are multiple hits with the same score, the most abundant one will be chosen by default). The remaining unmapped R1s were clus- tered into new OTUs via the UPARSE algorithm and added to the list of OTUs generated by the paired-end reads. Thus, the OTU representatives consist of a mixture of single-end and paired- end reads. We then aligned all the OTU representatives using the structure alignment algorithm Infernal trained on the Ri- bosomal Database Project’s (RDP’s) database [9,10]. OTU repre- sentatives that were not aligned were removed as they hypo- thetically represented nonbacteria. A phylogenetic tree was built from the aligned OTU representatives using FastTree [11]. Fast- Tree has little penalty on end-gaps, which is favorable when pro- cessing a mixture of single-end and paired-end reads. Finally, R1 and R2 reads were stitched together with ambiguous nucleotides (a string of Ns) in between and then assigned a taxonomy by Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BC WUniFrac UniFrac Paired-end reads A B C 1.00 1.0 R1 R2 FastTree OTU-picking R1 R2 0.75 0.8 R1 R2 ...... OTU1 OTU2 OTU3 OTU4 R1 R1 R2 R2 R1 R2 R1 0.50 Paired 0.6 Hybrid Single-end reads R1 R1 0.25 R1 0.4 R1 0.00 ...... 25% 50% 75% 25% 50% 75% 25% 50% 75% 25% 50% 75% Percent R2 remaining Percent R2 remaining Figure 1: Overview and evaluation of the hybrid-denovo approach. A, hybrid-denovo illustration. B, Mantel correlation of β-diversity distance matrices (unweighted UniFrac, weighted UniFrac, and Bray-Curtis distance) with the gold standard for the 3 approaches at different percentages of good-quality R2 reads. Error bars represent standard errors of the estimate based on 100 bootstrap samples. C, Boxplot of correlations of the relative abundances of 56 prevalent genera with the gold standard. Mapping Mapping Mapping New OTU Correlation with gold standard Correlation with gold standard Hybrid-denovo 3 mothur QIIME Hybrid (100%) Hybrid (75%) Hybrid (50%) Hybrid (25%) 1.0 QIIME 0 1 0.8 30 3 Hybrid (50%) mothur 0.6 Other Genus (<0.5%) Unclassified_Genus Firmicutes;Ruminococcus 0.4 Firmicutes;Roseburia Firmicutes;Oscillospira Firmicutes;Megamonas Firmicutes;Faecalibacterium Firmicutes;Dorea Firmicutes;Coprococcus Firmicutes;Clostridium 0.2 Firmicutes;Christensenella Firmicutes;Blautia Bacteroidetes;Prevotella Bacteroidetes;Paraprevotella Bacteroidetes;Parabacteroides Bacteroidetes;Odoribacter Bacteroidetes;Bacteroides 0.0 Figure 2: Comparison of mothur, QIIME, and hybrid-denovo on genus-level profiles. Hybrid-denovo is run on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%). Each column represents the microbiota profile of an individual averaged over all replicates. The overlaps of detected gener a between the 3 pipelines are shown in the Venn diagram. the RDP classifier [ 12] trained on the Greengenes database [13] comparable to that of hybrid-denovo. As we created good-quality OTUs not classified as bacteria and singleton OTUs were re- reads by using Trimmomatic, we reduced potential variation in moved as they were presumed contaminants. Note that this step performance between pipelines by not applying additional read may have lost diversity that is not represented in the database QC filters. An RDP classifier trained on Greengenes v13.5 was and is a tradeoff between accuracy and completeness. The com- used to classify reads for all pipelines. Singletons and nonbacte- plete workflow of our pipeline is given in Supplementary Fig. 2. ria OTUs (based on taxonomy) were filtered out. The major dif- To validate our approach, we created a gold standard data ferences between the 3 pipelines in addition to the commands set with high-quality paired-end reads based on the 837 high- used to reproduce the results are documented in Supplemen- coverage human fecal samples sequenced at the Mayo Core Fa- tary Note 1. We assessed performance by investigating (1) the cility (V3-V5 16S amplicon, 694 nt, 357F/926R primers) [14]. These number of detected genera and percentage of unclassified reads fecal samples were collected from 20 subjects using 6 different at the genus level, (2) Mantel correlation using Bray-Curtis (BC) methods (no additive, RNAlater, 70% ethanol, EDTA, dry swab, matrices, and (3) the intraclass correlation coefficient (ICC) for and fecal occult blood test [FOBT]). The samples were immedi- these core OTUs and genera observed in more than 90% of the ately frozen or stored at room temperature for 4 days to study samples. ICC is a measure of the correlation between the tech- the stability of the microbiota. Each condition had 2–3 tech- nical replicates. A high value indicates less measurement error. nical replicates to assess the reproducibility. We ran Trimmo- ICC was calculated using the R ICC package [17]. matic [4] for quality control and trimmed R1s down to 250 bp Finally, we demonstrated the performance of the proposed and R2s down to 200 bp to ensure high base quality, resulting method on a data set from the study of the stool microbiome in nonoverlapping paired-end reads. For each sample, we re- of rheumatoid arthritis (RA) patients, which consists of 40 RA trieved 8000 high-quality paired-end reads. We then performed patients and 49 controls (V3-V5 16S amplicon, 694 nt) [18]. We OTU-picking and taxonomy assignment based on these paired- applied DESeq2 to the taxa count data for differential abundance end reads using IM-TORNADO. These resulting OTUs and their analysis [19] and compared the RA-associated OTUs and genera associated taxonomy constitute the gold standard data set. We recovered by different approaches. then subset the gold standard with 25%, 50%, and 75% of the R2 reads remaining. The 3 sub–data sets represented differ- ent levels of R2 quality encountered in practice. We compared Results hybrid-denovo with de novo approaches based on single-end R1s or paired-end reads using the sub–data sets. Performance was The correlation of microbial β-diversity with the gold stan- evaluated by calculating the Spearman’s correlation with the dard was generally high for all the 3 approaches (Fig. 1B). How- gold standard in terms of microbial β-diversity (unweighted and ever, the approach based on single-end R1 tended to have a weighted UniFrac, and Bray-Curtis distance) and genus-level rel- lower correlation when BC distance was used (the single-end R1 ative abundances. approach was invariant to the number of R2s). The paired-end We also compared our pipeline with QIIME and mothur (ver- approach, on the other hand, had a much lower correlation sions 1.8.0 and 1.39.3, respectively) [15,16] on the gold standard for unweighted UniFrac when only 25% of R2s remained. This data set. As QIIME and mothur currently do not support de novo was due to the fact that unweighted UniFrac captures com- OTU-picking based on nonoverlapping reads, we ran QIIME and munity membership, which is contributed mainly by rare taxa, mothur on the R1 reads. Parameter settings were chosen to be and many rare taxa are no longer detectable by the paired-end Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Proportion 4 Chen et al. AB Figure 3: Comparison of mothur, QIIME, and hybrid-denovo on intra-class correlation coefficients (ICCs) of the core genera (A) and OTUs (B). ICCs are calculated based on the technical replicates for 6 different fecal collection methods. Hybrid-denovo is run on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%). approach due to loss of reads. In contrast, Hybrid-denovo was very ple, application of hybrid-denovo to the data set with 50% good- robust and had the best or close to the best correlation with the quality R2 reads yielded a total of 110 genera, compared with 70 gold standard in both diversity measures. For weighted UniFrac and 84 for QIIME and mothur, respectively (Fig. 2, upper right, distance, the correlation was similarly high for all the 3 meth- Venn diagram). Using BLAST on the paired-end counterparts of ods as the weighted UniFrac is most influenced by dominant the QIIME and mothur-specific genera (classified based on R1 taxa and all the methods quantify these dominant taxa very well reads) against the Greengenes database re-assigns many of the (Fig. 1B). reads to other genera. This indicates that those genera were We next studied the performance of taxonomic profiling of probably misclassified due to shorter reads. Though the genus- the proposed approach. Based on the 56 genera with a preva- level microbiota profiles for the 20 subjects were similar for all lence greater than 10%, hybrid-denovo had a much higher corre- the pipelines (Fig. 2), hybrid-denovo had a much lower proportion lation with the gold standard across all scenarios considered, of reads with unknown genus identity (5%) than mothur and QI- and its performance was not very sensitive to the percentage IME (14% and 18%, respectively). Taken together, these observa- of R2s remaining (Fig. 1C). In contrast, the performance of the tions demonstrated that hybrid-denovo had increased taxonomic paired-end approach depends strongly on the R2 quality and had resolution due to the use of longer reads. Interestingly, all the a much lower correlation when R2 quality was low. The single- pipelines could yield similar intersample relationships as mea- end R1 approach was invariant to the number of R2s expected sured by Mantel correlation coefficients based on Bray-Curtis and performed better than the paired-end approach only when distance matrices (Table 1). The availability of technical repli- R2 quality was low. Supplementary Fig. 3 showed the individ- cates of the data set allows us to compare different pipelines us- ual genus correlations. For the single-end approach, 2 genera ing intraclass correlation coefficients. A high ICC indicates less showed 0 correlation with the gold standard because all of their variability introduced by the bioinformatics pipeline. We calcu- R1 reads were reclassified at the family level due to their short lated the ICCs for different fecal collection methods for the core length (Lachnobacterium mapped to Ruminococcaceae and Erwinia OTUs and genera, which occurred in more than 90% of the sam- mapped to Enterobacteriaceae), indicating the increased phylo- ples. Our pipeline generally had higher ICCs (less variation be- genetic resolution using paired-end reads. For the paired-end tween technical replicates) than mothur and QIIME (Fig. 3). In approach, genera with low abundance exhibited a lower corre- contrast, mothur and QIIME did not perform as well on the core lation, indicating the decreased quantification accuracy due to OTUs and genera, respectively. loss of paired-end reads. We also applied our method to a data set from an RA study We also compared hybrid-denovo with mothur and QIIME, the [18], where about 40% of R2s were discarded after quality control 2 predominant pipelines for 16S data, based on the gold standard (Supplementary Table 1). Hybrid-denovo resulted in the largest data set. Mothur and QIIME took around 24 and 6 hours, respec- number of OTUs and genera, as expected (Fig. 4A), and cov- tively, to complete the analysis of the gold standard dataset (n = ered all genera from the paired-end approach and the major- 837), compared with around 1 hour for our pipeline. Mothur and ity genera from the single-end R1 approach (Fig. 4C). Among QIIME produced a total of 4599 and 2898 nonsingleton OTUs, re- the 5 R1-specific genera, Bacteria Firmicutes Clostridia Clostridi- spectively, while hybrid-denovo produced 1094, 1086, 1079, and ales Clostridiaceae 02d0 and Bacteria Firmicutes Clostridia Clostridi- 1049 nonsingleton OTUs on data sets with different percent- ales Clostridiaceae Sarcina were reclassified to Bacteria Firmicutes ages of good-quality R2 reads (100%, 75%, 50%, and 25%). Though Clostridia Clostridiales Clostridiaceae Clostridium when their paired- our pipeline resulted in a smaller number of OTUs, we detected end counterparts were used, indicating that the R1-specific gen- a larger number of genera than mothur and QIIME. For exam- era were misclassified due to their short read length. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Hybrid-denovo 5 Table 1: Mantel correlations of intersample distances between QIIME, mothur and hybrid-denovo. Mothur QIIME Hybrid (100%) Hybrid (75%) Hybrid (50%) Hybrid (25%) Mothur - <0.001 <0.001 <0.001 <0.001 <0.001 QIIME 0.884 - <0.001 <0.001 <0.001 <0.001 Hybrid (100%) 0.986 0.879 - <0.001 <0.001 <0.001 Hybrid (75%) 0.973 0.909 0.985 - <0.001 <0.001 Hybrid (50%) 0.973 0.928 0.982 0.984 - <0.001 Hybrid (25%) 0.955 0.949 0.960 0.980 0.985 - Bray-Curtis distance matrices on the OTU data are used. Hybrid-denovo is run on data sets with different percentages of good quality R2 reads (100%, 75%, 50% and 25%). Top right: Mantel correlation P-value based on 1000 permutation; bottom left: Mantel correlation coefficients. A B #Genus #OTU #Genus #OTU CD C D Figure 4: Comparison of the R1, Paired, and Hybrid approaches on the RA data set. A, Number of detected OTUs (red) and genera (blue). B, Number of significant OTUs (red) and genera (blue) from differential abundance analysis (FDR ≤ 0.01). C, Venn diagram of the genera detected. D, Venn diagram of significant genera from differential abundance analysis. Apart from the comparison of the detected genera, we also 32 and 34 for the paired-end and single-end R1 approaches demonstrated the advantage of hybrid-denovo in the context of (Fig. 4B). There were 20 significant genera shared by all 3 meth- differential abundance analysis using DESeq2 [19]. We excluded ods (Fig. 4D), many of which were reported by previous studies OTUs that occurred in less than 10% samples from testing. A to- [18,20,21]. For example, Bacteroides is enriched in control sam- tal of 758, 578, and 393 OTUs were tested using hybrid-denovo, ples, while Collinsella, Eggerthella, Prevotella,and Clostridium are paired, and R1 approaches, respectively. Due to higher read enriched in RA samples. Even though the total number of dif- counts and increased phylogenetic resolution, hybrid-denovo re- ferential genera were similar for all the methods, hybrid-denovo covered more differential OTUs (Fig. 4B). We identified a to- identified the most genera (n = 11) that were shared by either 1 of tal of 126 significant OTUs at an FDR-adjusted P value of 0.01 the other 2 approaches, compared with 6 and 9 for the paired- compared with 93 and 80 OTUs for paired-end and single-end end and single-end R1 approaches, indicating that the hybrid- R1 approaches, respectively. Since different methods had their denovo approach was able to identify differential genera that own definition of OTUs and direct comparison of the differ- were otherwise missed by either the paired-end or single-end R1 ential OTUs is difficult, we instead compared the genus iden- approach. Furthermore, hybrid-denovo had the lowest number of tity of the identified OTUs. The differential OTUs identified by method-specific genera (n = 2) in contract to paired-end (n = 6) hybrid-denovo were classified into 33 genera, in comparison with and R1 single-end (n = 5). The method-specific genera might be Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Chen et al. less reliable due to lack of support from other methods. For ex- to unknown sequencing errors. Improving the diversity estimate ample, the R1 approach found Veillonella to be enriched in control from hybrid-denovo will be the focus of our future work. samples, which conflicts with a previous study [ 20]. Interest- ingly, among the 2 hybrid-denovo-specific genera, Klebsiella,which Competing interests was enriched in healthy people, was reported by Zhang et al. [21]. The authors declare that they have no competing interests. Discussion Abbreviations We proposed hybrid-denovo for de novo OTU-picking based on OTU: operational taxonomic unit; QC: quality control; RDP: ribo- paired-end 16S sequence tags. Through simulations and real somal database project; BC: Bray-Curtis; ICC: intra-class correla- data examples, we showed that our approach had better per- tion coefficient; RA: rheumatoid arthritis. formance than the single-end or paired-end approach in quan- tifying the microbial diversity and taxonomic abundance, due to the full use of the information in the paired-end reads. Acknowledgements Based on the size of 16S amplicons and the length of the This study was supported by the Mayo Clinic Center for Individ- paired-end reads, we could have overlapping or nonoverlap- ualized Medicine. ping paired-end reads. For example, sequencing of the V4 region (252 nt, 515F/806R primers) produces overlapping paired-end reads, while sequencing of the V3-V5 region (694 nt, F357/R926 Additional Files primers) results in nonoverlapping paired-end reads using Illu- Additional file 1: Supplementary Figure 1: Percentage of reads re- mina MiSeq (250 bp × 2). As QIIME and mothur currently do not maining after QC 2013-2015 in the Mayo Clinic Sequencing Core support de novo OTU-picking based on nonoverlapping paired- Facility. end reads, the main advantage of our pipeline lies in the abil- Additional file 2: Supplementary Figure 2: Hybrid-denovo wor- ity to process nonoverlapping paired-end reads. However, our fkow. pipeline could also be applied to overlapping paired-end reads Additional file 3: Supplementary Figure 3: Correlations of 54 by using PANDAseq [22] to stitch the paired-end reads together. prevalent genera (>10%) to the gold standard. It is noted that some existing pipelines could also process a mix- Additional file 4: Supplementary Figure 4: Comparison of ture of paired-end and single-end reads with different capac- DADA2, LotuS, and hybrid-denovo on genus-level profiles. All ities. For example, the recently proposed LotuS pipeline uses pipelines are run on data sets with 100% good-quality R2 reads good-quality R1 reads to build OTUs, followed by a postcluster- (gold standard). Each column represents the microbiota profile ing merging of R1 and R2 to increase the accuracy of the taxon- of an individual averaged over all replicates. omy [23]. However, the OTU-level resolution is still determined Additional file 5: Supplementary Figure 5: Comparison of by R1 reads. DADA2, LotuS, and hybrid-denovo on intraclass correlation coef- There are new pipelines that have been developed for 16S ficients of the core genera (A) and OTUs (B). ICCs are calculated data. It is interesting to benchmark hybrid-denovo against these based on the technical replicates for 6 different fecal collection state-of-the-art pipelines. We selected DADA2 and LotuS [23,24] methods. All pipelines are run on data sets with 100% good- for comparison as they have been demonstrated to have an over- quality R2 reads (gold standard). all better performance than QIIME and mothur and have been in- Additional file 6: Supplementary Table 1:Numberofreadsfor creasingly used by the community. We repeated the same anal- the RA data set after quality control. ysis on the gold standard data set with complete read pairs. Additional file 7: Supplementary Note 1: Details of the steps The specific command lines used for DADA2 and LotuS are doc- and parameter settings used for comparing hybrid-denovo, QI- umented in Supplementary Note 1. DADA2 produced 18 389 IME, and mothur. Command lines to run the pipelines including sequence variants (SVs), while LotuS produced 472 OTUs. The DADA2 and LotuS are supplied for transparency. Mantel correlation on the OTU/SV-level Bray-Curtis distance is high between hybrid-denovo and LotuS (ρ = 0.93) but moderate between hybrid-denovo and DADA2 (ρ = 0.71). Interestingly, the Availability and requirements Mantel correlation on the genus-level Bray-Curtis distance is Project name: Hybrid-denovo (https://scicrunch.org/SCR 015866) high between all methods (ρ> 0.97), indicating that all meth- Project home page: http://bioinformaticstools.mayo.edu/ ods could produce similar genus-level profiles (Supplementary research/hybrid-denovo/ Fig. 4). Similar ICC analysis demonstrated that all the methods Operating system(s): Linux (centOS 6 is prefered) had relatively high ICCs, but hybrid-denovo had the overall the Programming language: Python 2.7, Java, and shell script best performance (Supplementary Fig. 5). Other requirements: QIIME and python libraries: biom- One problem for de novo OTU-picking is the potential in- format (v. 1.3.1), bitarray (v. 0.8.1), pyqi (v. 0.2.0), numpy (v. 1.8.1), flated OTU number, which could be due to sources such as se- and biopython (v. 1.66) quencing errors, chimera, and environmental contaminants [6]. License: Modified BSD In hybrid-denovo, we used various quality filtering criteria to re- Any restrictions to use by nonacademics: none. duce the number of spurious OTUs. For example, we applied Trimmomatic [4] to trim and remove reads with low base quality, removed reads with any ambiguous bases, removed singleton Availability of supporting data OTUs, used the Infernal package [9] to remove non-structurally aligned OTUs, and used reference-based UCHIME as an addi- The example files and additional data sets supporting the results tional chimera removal process [6]. However, even these filters of this article are available in the GigaScience Database [25], as might fall short of reducing the inflated diversity estimate due well as from the project home page. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Hybrid-denovo 7 REFERENCES high-throughput community sequencing data. Nat Methods 2010;7:335–6. 1. Cho I, Blaser MJ. The human microbiome: at the interface of 16. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hol- health and disease. Nat Rev Genet 2012;13(4):260–70. lister EB et al. Introducing mothur: open-source, platform- 2. McDonald D, Birmingham A, Knight R. Context and the hu- independent, community-supported software for describing man microbiome. Microbiome 2015;3:52. and comparing microbial communities. Appl Environ Micro- 3. Jeraldo P, Kalari K, Chen X, Bhavsar J, Mangalam A, White B biol 2009;75(23):7537–41. et al. IM-TORNADO: a tool for comparison of 16s reads from 17. Wolak ME, Fairbairn DJ, Paulsen YR. Guidelines for estimat- paired-end libraries. PLoS One 2014;9(12):e114804. ing repeatability. Methods Ecol Evol 2012;3(1):129–37. 4. Bolger MA, Lohse M, Usadel B. Trimmomatic: a flexi- 18. Chen J, Wright K, Davis JM, Jeraldo P, Marietta EV, Mur- ble trimmer for illumina sequence data. Bioinformatics ray J et al. An expansion of rare lineage intestinal microbes 2014;30(15):2114–20. characterizes rheumatoid arthritis. Genome Med 2016;8(1): 5. Edgar RC. Search and clustering orders of magnitude faster than blast. Bioinformatics 2010;26(19):2460–1. 19. Love MI, Huber W, Anders S. Moderated estimation of 6. Edgar RC. Uparse: highly accurate OTU sequences from mi- fold change and dispersion for RNA-seq data with DESeq2. crobial amplicon reads. Nat Methods 2013;10(10):996–8. Genome Biol 2014;15(12):550. 7. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME 20. Scher JU, Sczesnak A, Longman RS, Segata N, Ubeda C, Biel- improves sensitivity and speed of chimera detection. Bioin- ski C et al. Expansion of intestinal Prevotella copri corre- formatics 2011;10(10): 27(16):2194–200. lates with enhanced susceptibility to arthritis. Elife 2013;5(2): 8. https://drive5.com/uchime/gold.fa. e01202. 9. Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of 21. Zhang X, Zhang D, Jia H, Feng Q, Wang D, Liang D et al. RNA alignments. Bioinformatics 2009;25(10):1335–7. The oral and gut microbiomes are perturbed in rheuma- 10. Cole JR,WangQ,CardenasE,Fish J, ChaiB,FarrisRJetal. toid arthritis and partly normalized after treatment. Nat Med The ribosomal database project: improved alignments and 2015;21(8):895–905. new tools for rRNA analysis. Nucleic Acids Res 2009;37:D141– 22. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld D145. JD. Pandaseq: paired-end assembler for illumina sequences. 11. Price MN, Dehal PS, Arkin AP. Fasttree 2–approximately BMC Bioinformatics 2012;13:31. maximum-likelihood trees for large alignments. PLoS One 23. Hildebrand F, Tadeo R, Voigt AY, Bork P, Raes J. LotuS: an effi- 2010;5(3):e9490. cient and user-friendly OTU processing pipeline. Microbiome 12. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian 2014;2:30. classifier for rapid assignment of rRNA sequences into 24. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, John- the new bacterial taxonomy. Appl Environ Microbiol son AJ, Holmes SP. DADA2: high-resolution sample infer- 2007;73(16):5261–7. ence from Illumina amplicon data. Nat Methods 2016;13: 13. http://greengenes.secondgenome.com/. 581–3. 14. Sinha R, Chen J, Amir A, Vogtmann E, Shi J, Inman KS 25. Chen X, Johnson S, Jeraldo P, Wang J, Chia N, Kocher et al. Collecting fecal samples for microbiome analyses in JA, Chen J. Supporting data for “Hybrid-denovo: A de epidemiology studies. Cancer Epidemiol Biomarkers Prev novo OTU-picking pipeline integrating single-end and 2016;25(2):407–16. paired-end 16S sequence tags.” GigaScience Database 2017. 15. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, http://dx.doi.org/10.5524/100388. Bushman FD, Costello EK et al. QIIME allows analysis of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags

Free
7 pages

Loading next page...
 
/lp/ou_press/hybrid-denovo-a-de-novo-otu-picking-pipeline-integrating-single-end-f74HZzQoH2
Publisher
BGI
Copyright
© The Author(s) 2018. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix129
Publisher site
See Article on Publisher Site

Abstract

Background: Illumina paired-end sequencing has been increasingly popular for 16S rRNA gene-based microbiota profiling. It provides higher phylogenetic resolution than single-end reads due to a longer read length. However, the reverse read (R2) often has significant low base quality, and a large proportion of R2s will be discarded after quality control, resulting in a mixture of paired-end and single-end reads. A typical 16S analysis pipeline usually processes either paired-end or single-end reads but not a mixture. Thus, the quantification accuracy and statistical power will be reduced due to the loss of a large amount of reads. As a result, rare taxa may not be detectable with the paired-end approach, or low taxonomic resolution will result in a single-end approach. Results: To have both the higher phylogenetic resolution provided by paired-end reads and the higher sequence coverage by single-end reads, we propose a novel OTU-picking pipeline, hybrid-denovo, that can process a hybrid of single-end and paired-end reads. Using high-quality paired-end reads as a gold standard, we show that hybrid-denovo achieved the highest correlation with the gold standard and performed better than the approaches based on paired-end or single-end reads in terms of quantifying the microbial diversity and taxonomic abundances. By applying our method to a rheumatoid arthritis (RA) data set, we demonstrated that hybrid-denovo captured more microbial diversity and identified more RA-associated taxa than a paired-end or single-end approach. Conclusions: Hybrid-denovo utilizes both paired-end and single-end 16S sequencing reads and is recommended for 16S rRNA gene targeted paired-end sequencing data. Keywords: microbiome; OTU picking; 16S rRNA Introduction used to profile microbiota. Identifying related groups of organ- isms known as operational taxonomic units (OTUs) remains The microbiome plays an important role in global ecology, a central part of the analysis of microbiome data. Both de nutrient cycling, and disease [1]. Targeted sequencing of the novo and reference-based approaches have been proposed for hypervariable region of the 16S rRNA gene is now routinely Received: 23 October 2017; Revised: 28 November 2017; Accepted: 7 December 2017 The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Chen et al. processing 16S rDNA reads—each with complementary strengths and weaknesses. De novo OTU-picking naively clus- ters reads based on sequence similarity. It has the advantages of not requiring any prior knowledge or reference about the target molecule, and produces OTU groupings that are more naturally aligned to the data. However, de novo approaches require com- parison of the same gene region. Reference-based approaches can get around this limitation, but rely on a preexisting set of OTU representatives that may or may not be appropriate for a particular dataset [2]. To perform a de novo approach, one of the challenges pre- sented by Illumina paired-end reads is that the reverse read (R2) often has a much lower base quality than the forward read (R1). For the 16S datasets generated at the Mayo Clinic Core Facility, only 24% of R2s passed quality control (QC) between 2013 and 2015, as opposed to 83% for R1s (Supplementary Fig. 1). We are then left with a smaller set of high-fidelity paired-end reads (R1- R2) and a deeper set of single-end reads (R1). Thus, we would have to choose between the more accurate taxonomic identifi- cation using R1-R2 or improved detection of rare taxa using R1 [3]. To integrate information from both paired-end and single- end reads, we propose hybrid-denovo, a pipeline that combines paired-end and single-end reads in order to retain the advan- tages of de novo OTU-picking while maximizing the ability to de- tect rare taxa. Methods Hybrid-denovo first constructs an OTU backbone using only paired-end reads. The remaining single-end reads (R1) are mapped to the OTU backbone, creating new OTUs if unmapped (Fig. 1 A). The same quality control and OTU-picking process as implemented in IM-TORNADO is used to build the OTU back- bone [3]. Specifically, quality filtering was performed using Trim- momatic [4] with a hard cutoff of PHRED score Q3 for 5 and 3 ends of the reads, trimming the 3 end with a moving aver- age score of Q15, with a window size of 4 bases, and remov- ing any remaining reads shorter than 75% of the original read length. Reads with any ambiguous base calls were discarded. Surviving read pairs were further trimmed down to specified cutoffs to uniform the length of both reads, then concatenated and sorted by cluster size. Afterwards, a de novo OTU-picking was conducted via UPARSE algorithm [5,6]. Though the UPARSE algo- rithm has performed de novo chimera removal, we additionally used UCHIME [7] to perform a reference-based chimera removal against the GOLD database [8] resulting in a set of high-quality OTU representatives. We then mapped the single-end R1s to the R1 end of the OTU representatives using USEARCH (if there are multiple hits with the same score, the most abundant one will be chosen by default). The remaining unmapped R1s were clus- tered into new OTUs via the UPARSE algorithm and added to the list of OTUs generated by the paired-end reads. Thus, the OTU representatives consist of a mixture of single-end and paired- end reads. We then aligned all the OTU representatives using the structure alignment algorithm Infernal trained on the Ri- bosomal Database Project’s (RDP’s) database [9,10]. OTU repre- sentatives that were not aligned were removed as they hypo- thetically represented nonbacteria. A phylogenetic tree was built from the aligned OTU representatives using FastTree [11]. Fast- Tree has little penalty on end-gaps, which is favorable when pro- cessing a mixture of single-end and paired-end reads. Finally, R1 and R2 reads were stitched together with ambiguous nucleotides (a string of Ns) in between and then assigned a taxonomy by Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BC WUniFrac UniFrac Paired-end reads A B C 1.00 1.0 R1 R2 FastTree OTU-picking R1 R2 0.75 0.8 R1 R2 ...... OTU1 OTU2 OTU3 OTU4 R1 R1 R2 R2 R1 R2 R1 0.50 Paired 0.6 Hybrid Single-end reads R1 R1 0.25 R1 0.4 R1 0.00 ...... 25% 50% 75% 25% 50% 75% 25% 50% 75% 25% 50% 75% Percent R2 remaining Percent R2 remaining Figure 1: Overview and evaluation of the hybrid-denovo approach. A, hybrid-denovo illustration. B, Mantel correlation of β-diversity distance matrices (unweighted UniFrac, weighted UniFrac, and Bray-Curtis distance) with the gold standard for the 3 approaches at different percentages of good-quality R2 reads. Error bars represent standard errors of the estimate based on 100 bootstrap samples. C, Boxplot of correlations of the relative abundances of 56 prevalent genera with the gold standard. Mapping Mapping Mapping New OTU Correlation with gold standard Correlation with gold standard Hybrid-denovo 3 mothur QIIME Hybrid (100%) Hybrid (75%) Hybrid (50%) Hybrid (25%) 1.0 QIIME 0 1 0.8 30 3 Hybrid (50%) mothur 0.6 Other Genus (<0.5%) Unclassified_Genus Firmicutes;Ruminococcus 0.4 Firmicutes;Roseburia Firmicutes;Oscillospira Firmicutes;Megamonas Firmicutes;Faecalibacterium Firmicutes;Dorea Firmicutes;Coprococcus Firmicutes;Clostridium 0.2 Firmicutes;Christensenella Firmicutes;Blautia Bacteroidetes;Prevotella Bacteroidetes;Paraprevotella Bacteroidetes;Parabacteroides Bacteroidetes;Odoribacter Bacteroidetes;Bacteroides 0.0 Figure 2: Comparison of mothur, QIIME, and hybrid-denovo on genus-level profiles. Hybrid-denovo is run on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%). Each column represents the microbiota profile of an individual averaged over all replicates. The overlaps of detected gener a between the 3 pipelines are shown in the Venn diagram. the RDP classifier [ 12] trained on the Greengenes database [13] comparable to that of hybrid-denovo. As we created good-quality OTUs not classified as bacteria and singleton OTUs were re- reads by using Trimmomatic, we reduced potential variation in moved as they were presumed contaminants. Note that this step performance between pipelines by not applying additional read may have lost diversity that is not represented in the database QC filters. An RDP classifier trained on Greengenes v13.5 was and is a tradeoff between accuracy and completeness. The com- used to classify reads for all pipelines. Singletons and nonbacte- plete workflow of our pipeline is given in Supplementary Fig. 2. ria OTUs (based on taxonomy) were filtered out. The major dif- To validate our approach, we created a gold standard data ferences between the 3 pipelines in addition to the commands set with high-quality paired-end reads based on the 837 high- used to reproduce the results are documented in Supplemen- coverage human fecal samples sequenced at the Mayo Core Fa- tary Note 1. We assessed performance by investigating (1) the cility (V3-V5 16S amplicon, 694 nt, 357F/926R primers) [14]. These number of detected genera and percentage of unclassified reads fecal samples were collected from 20 subjects using 6 different at the genus level, (2) Mantel correlation using Bray-Curtis (BC) methods (no additive, RNAlater, 70% ethanol, EDTA, dry swab, matrices, and (3) the intraclass correlation coefficient (ICC) for and fecal occult blood test [FOBT]). The samples were immedi- these core OTUs and genera observed in more than 90% of the ately frozen or stored at room temperature for 4 days to study samples. ICC is a measure of the correlation between the tech- the stability of the microbiota. Each condition had 2–3 tech- nical replicates. A high value indicates less measurement error. nical replicates to assess the reproducibility. We ran Trimmo- ICC was calculated using the R ICC package [17]. matic [4] for quality control and trimmed R1s down to 250 bp Finally, we demonstrated the performance of the proposed and R2s down to 200 bp to ensure high base quality, resulting method on a data set from the study of the stool microbiome in nonoverlapping paired-end reads. For each sample, we re- of rheumatoid arthritis (RA) patients, which consists of 40 RA trieved 8000 high-quality paired-end reads. We then performed patients and 49 controls (V3-V5 16S amplicon, 694 nt) [18]. We OTU-picking and taxonomy assignment based on these paired- applied DESeq2 to the taxa count data for differential abundance end reads using IM-TORNADO. These resulting OTUs and their analysis [19] and compared the RA-associated OTUs and genera associated taxonomy constitute the gold standard data set. We recovered by different approaches. then subset the gold standard with 25%, 50%, and 75% of the R2 reads remaining. The 3 sub–data sets represented differ- ent levels of R2 quality encountered in practice. We compared Results hybrid-denovo with de novo approaches based on single-end R1s or paired-end reads using the sub–data sets. Performance was The correlation of microbial β-diversity with the gold stan- evaluated by calculating the Spearman’s correlation with the dard was generally high for all the 3 approaches (Fig. 1B). How- gold standard in terms of microbial β-diversity (unweighted and ever, the approach based on single-end R1 tended to have a weighted UniFrac, and Bray-Curtis distance) and genus-level rel- lower correlation when BC distance was used (the single-end R1 ative abundances. approach was invariant to the number of R2s). The paired-end We also compared our pipeline with QIIME and mothur (ver- approach, on the other hand, had a much lower correlation sions 1.8.0 and 1.39.3, respectively) [15,16] on the gold standard for unweighted UniFrac when only 25% of R2s remained. This data set. As QIIME and mothur currently do not support de novo was due to the fact that unweighted UniFrac captures com- OTU-picking based on nonoverlapping reads, we ran QIIME and munity membership, which is contributed mainly by rare taxa, mothur on the R1 reads. Parameter settings were chosen to be and many rare taxa are no longer detectable by the paired-end Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Proportion 4 Chen et al. AB Figure 3: Comparison of mothur, QIIME, and hybrid-denovo on intra-class correlation coefficients (ICCs) of the core genera (A) and OTUs (B). ICCs are calculated based on the technical replicates for 6 different fecal collection methods. Hybrid-denovo is run on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%). approach due to loss of reads. In contrast, Hybrid-denovo was very ple, application of hybrid-denovo to the data set with 50% good- robust and had the best or close to the best correlation with the quality R2 reads yielded a total of 110 genera, compared with 70 gold standard in both diversity measures. For weighted UniFrac and 84 for QIIME and mothur, respectively (Fig. 2, upper right, distance, the correlation was similarly high for all the 3 meth- Venn diagram). Using BLAST on the paired-end counterparts of ods as the weighted UniFrac is most influenced by dominant the QIIME and mothur-specific genera (classified based on R1 taxa and all the methods quantify these dominant taxa very well reads) against the Greengenes database re-assigns many of the (Fig. 1B). reads to other genera. This indicates that those genera were We next studied the performance of taxonomic profiling of probably misclassified due to shorter reads. Though the genus- the proposed approach. Based on the 56 genera with a preva- level microbiota profiles for the 20 subjects were similar for all lence greater than 10%, hybrid-denovo had a much higher corre- the pipelines (Fig. 2), hybrid-denovo had a much lower proportion lation with the gold standard across all scenarios considered, of reads with unknown genus identity (5%) than mothur and QI- and its performance was not very sensitive to the percentage IME (14% and 18%, respectively). Taken together, these observa- of R2s remaining (Fig. 1C). In contrast, the performance of the tions demonstrated that hybrid-denovo had increased taxonomic paired-end approach depends strongly on the R2 quality and had resolution due to the use of longer reads. Interestingly, all the a much lower correlation when R2 quality was low. The single- pipelines could yield similar intersample relationships as mea- end R1 approach was invariant to the number of R2s expected sured by Mantel correlation coefficients based on Bray-Curtis and performed better than the paired-end approach only when distance matrices (Table 1). The availability of technical repli- R2 quality was low. Supplementary Fig. 3 showed the individ- cates of the data set allows us to compare different pipelines us- ual genus correlations. For the single-end approach, 2 genera ing intraclass correlation coefficients. A high ICC indicates less showed 0 correlation with the gold standard because all of their variability introduced by the bioinformatics pipeline. We calcu- R1 reads were reclassified at the family level due to their short lated the ICCs for different fecal collection methods for the core length (Lachnobacterium mapped to Ruminococcaceae and Erwinia OTUs and genera, which occurred in more than 90% of the sam- mapped to Enterobacteriaceae), indicating the increased phylo- ples. Our pipeline generally had higher ICCs (less variation be- genetic resolution using paired-end reads. For the paired-end tween technical replicates) than mothur and QIIME (Fig. 3). In approach, genera with low abundance exhibited a lower corre- contrast, mothur and QIIME did not perform as well on the core lation, indicating the decreased quantification accuracy due to OTUs and genera, respectively. loss of paired-end reads. We also applied our method to a data set from an RA study We also compared hybrid-denovo with mothur and QIIME, the [18], where about 40% of R2s were discarded after quality control 2 predominant pipelines for 16S data, based on the gold standard (Supplementary Table 1). Hybrid-denovo resulted in the largest data set. Mothur and QIIME took around 24 and 6 hours, respec- number of OTUs and genera, as expected (Fig. 4A), and cov- tively, to complete the analysis of the gold standard dataset (n = ered all genera from the paired-end approach and the major- 837), compared with around 1 hour for our pipeline. Mothur and ity genera from the single-end R1 approach (Fig. 4C). Among QIIME produced a total of 4599 and 2898 nonsingleton OTUs, re- the 5 R1-specific genera, Bacteria Firmicutes Clostridia Clostridi- spectively, while hybrid-denovo produced 1094, 1086, 1079, and ales Clostridiaceae 02d0 and Bacteria Firmicutes Clostridia Clostridi- 1049 nonsingleton OTUs on data sets with different percent- ales Clostridiaceae Sarcina were reclassified to Bacteria Firmicutes ages of good-quality R2 reads (100%, 75%, 50%, and 25%). Though Clostridia Clostridiales Clostridiaceae Clostridium when their paired- our pipeline resulted in a smaller number of OTUs, we detected end counterparts were used, indicating that the R1-specific gen- a larger number of genera than mothur and QIIME. For exam- era were misclassified due to their short read length. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Hybrid-denovo 5 Table 1: Mantel correlations of intersample distances between QIIME, mothur and hybrid-denovo. Mothur QIIME Hybrid (100%) Hybrid (75%) Hybrid (50%) Hybrid (25%) Mothur - <0.001 <0.001 <0.001 <0.001 <0.001 QIIME 0.884 - <0.001 <0.001 <0.001 <0.001 Hybrid (100%) 0.986 0.879 - <0.001 <0.001 <0.001 Hybrid (75%) 0.973 0.909 0.985 - <0.001 <0.001 Hybrid (50%) 0.973 0.928 0.982 0.984 - <0.001 Hybrid (25%) 0.955 0.949 0.960 0.980 0.985 - Bray-Curtis distance matrices on the OTU data are used. Hybrid-denovo is run on data sets with different percentages of good quality R2 reads (100%, 75%, 50% and 25%). Top right: Mantel correlation P-value based on 1000 permutation; bottom left: Mantel correlation coefficients. A B #Genus #OTU #Genus #OTU CD C D Figure 4: Comparison of the R1, Paired, and Hybrid approaches on the RA data set. A, Number of detected OTUs (red) and genera (blue). B, Number of significant OTUs (red) and genera (blue) from differential abundance analysis (FDR ≤ 0.01). C, Venn diagram of the genera detected. D, Venn diagram of significant genera from differential abundance analysis. Apart from the comparison of the detected genera, we also 32 and 34 for the paired-end and single-end R1 approaches demonstrated the advantage of hybrid-denovo in the context of (Fig. 4B). There were 20 significant genera shared by all 3 meth- differential abundance analysis using DESeq2 [19]. We excluded ods (Fig. 4D), many of which were reported by previous studies OTUs that occurred in less than 10% samples from testing. A to- [18,20,21]. For example, Bacteroides is enriched in control sam- tal of 758, 578, and 393 OTUs were tested using hybrid-denovo, ples, while Collinsella, Eggerthella, Prevotella,and Clostridium are paired, and R1 approaches, respectively. Due to higher read enriched in RA samples. Even though the total number of dif- counts and increased phylogenetic resolution, hybrid-denovo re- ferential genera were similar for all the methods, hybrid-denovo covered more differential OTUs (Fig. 4B). We identified a to- identified the most genera (n = 11) that were shared by either 1 of tal of 126 significant OTUs at an FDR-adjusted P value of 0.01 the other 2 approaches, compared with 6 and 9 for the paired- compared with 93 and 80 OTUs for paired-end and single-end end and single-end R1 approaches, indicating that the hybrid- R1 approaches, respectively. Since different methods had their denovo approach was able to identify differential genera that own definition of OTUs and direct comparison of the differ- were otherwise missed by either the paired-end or single-end R1 ential OTUs is difficult, we instead compared the genus iden- approach. Furthermore, hybrid-denovo had the lowest number of tity of the identified OTUs. The differential OTUs identified by method-specific genera (n = 2) in contract to paired-end (n = 6) hybrid-denovo were classified into 33 genera, in comparison with and R1 single-end (n = 5). The method-specific genera might be Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Chen et al. less reliable due to lack of support from other methods. For ex- to unknown sequencing errors. Improving the diversity estimate ample, the R1 approach found Veillonella to be enriched in control from hybrid-denovo will be the focus of our future work. samples, which conflicts with a previous study [ 20]. Interest- ingly, among the 2 hybrid-denovo-specific genera, Klebsiella,which Competing interests was enriched in healthy people, was reported by Zhang et al. [21]. The authors declare that they have no competing interests. Discussion Abbreviations We proposed hybrid-denovo for de novo OTU-picking based on OTU: operational taxonomic unit; QC: quality control; RDP: ribo- paired-end 16S sequence tags. Through simulations and real somal database project; BC: Bray-Curtis; ICC: intra-class correla- data examples, we showed that our approach had better per- tion coefficient; RA: rheumatoid arthritis. formance than the single-end or paired-end approach in quan- tifying the microbial diversity and taxonomic abundance, due to the full use of the information in the paired-end reads. Acknowledgements Based on the size of 16S amplicons and the length of the This study was supported by the Mayo Clinic Center for Individ- paired-end reads, we could have overlapping or nonoverlap- ualized Medicine. ping paired-end reads. For example, sequencing of the V4 region (252 nt, 515F/806R primers) produces overlapping paired-end reads, while sequencing of the V3-V5 region (694 nt, F357/R926 Additional Files primers) results in nonoverlapping paired-end reads using Illu- Additional file 1: Supplementary Figure 1: Percentage of reads re- mina MiSeq (250 bp × 2). As QIIME and mothur currently do not maining after QC 2013-2015 in the Mayo Clinic Sequencing Core support de novo OTU-picking based on nonoverlapping paired- Facility. end reads, the main advantage of our pipeline lies in the abil- Additional file 2: Supplementary Figure 2: Hybrid-denovo wor- ity to process nonoverlapping paired-end reads. However, our fkow. pipeline could also be applied to overlapping paired-end reads Additional file 3: Supplementary Figure 3: Correlations of 54 by using PANDAseq [22] to stitch the paired-end reads together. prevalent genera (>10%) to the gold standard. It is noted that some existing pipelines could also process a mix- Additional file 4: Supplementary Figure 4: Comparison of ture of paired-end and single-end reads with different capac- DADA2, LotuS, and hybrid-denovo on genus-level profiles. All ities. For example, the recently proposed LotuS pipeline uses pipelines are run on data sets with 100% good-quality R2 reads good-quality R1 reads to build OTUs, followed by a postcluster- (gold standard). Each column represents the microbiota profile ing merging of R1 and R2 to increase the accuracy of the taxon- of an individual averaged over all replicates. omy [23]. However, the OTU-level resolution is still determined Additional file 5: Supplementary Figure 5: Comparison of by R1 reads. DADA2, LotuS, and hybrid-denovo on intraclass correlation coef- There are new pipelines that have been developed for 16S ficients of the core genera (A) and OTUs (B). ICCs are calculated data. It is interesting to benchmark hybrid-denovo against these based on the technical replicates for 6 different fecal collection state-of-the-art pipelines. We selected DADA2 and LotuS [23,24] methods. All pipelines are run on data sets with 100% good- for comparison as they have been demonstrated to have an over- quality R2 reads (gold standard). all better performance than QIIME and mothur and have been in- Additional file 6: Supplementary Table 1:Numberofreadsfor creasingly used by the community. We repeated the same anal- the RA data set after quality control. ysis on the gold standard data set with complete read pairs. Additional file 7: Supplementary Note 1: Details of the steps The specific command lines used for DADA2 and LotuS are doc- and parameter settings used for comparing hybrid-denovo, QI- umented in Supplementary Note 1. DADA2 produced 18 389 IME, and mothur. Command lines to run the pipelines including sequence variants (SVs), while LotuS produced 472 OTUs. The DADA2 and LotuS are supplied for transparency. Mantel correlation on the OTU/SV-level Bray-Curtis distance is high between hybrid-denovo and LotuS (ρ = 0.93) but moderate between hybrid-denovo and DADA2 (ρ = 0.71). Interestingly, the Availability and requirements Mantel correlation on the genus-level Bray-Curtis distance is Project name: Hybrid-denovo (https://scicrunch.org/SCR 015866) high between all methods (ρ> 0.97), indicating that all meth- Project home page: http://bioinformaticstools.mayo.edu/ ods could produce similar genus-level profiles (Supplementary research/hybrid-denovo/ Fig. 4). Similar ICC analysis demonstrated that all the methods Operating system(s): Linux (centOS 6 is prefered) had relatively high ICCs, but hybrid-denovo had the overall the Programming language: Python 2.7, Java, and shell script best performance (Supplementary Fig. 5). Other requirements: QIIME and python libraries: biom- One problem for de novo OTU-picking is the potential in- format (v. 1.3.1), bitarray (v. 0.8.1), pyqi (v. 0.2.0), numpy (v. 1.8.1), flated OTU number, which could be due to sources such as se- and biopython (v. 1.66) quencing errors, chimera, and environmental contaminants [6]. License: Modified BSD In hybrid-denovo, we used various quality filtering criteria to re- Any restrictions to use by nonacademics: none. duce the number of spurious OTUs. For example, we applied Trimmomatic [4] to trim and remove reads with low base quality, removed reads with any ambiguous bases, removed singleton Availability of supporting data OTUs, used the Infernal package [9] to remove non-structurally aligned OTUs, and used reference-based UCHIME as an addi- The example files and additional data sets supporting the results tional chimera removal process [6]. However, even these filters of this article are available in the GigaScience Database [25], as might fall short of reducing the inflated diversity estimate due well as from the project home page. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Hybrid-denovo 7 REFERENCES high-throughput community sequencing data. Nat Methods 2010;7:335–6. 1. Cho I, Blaser MJ. The human microbiome: at the interface of 16. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hol- health and disease. Nat Rev Genet 2012;13(4):260–70. lister EB et al. Introducing mothur: open-source, platform- 2. McDonald D, Birmingham A, Knight R. Context and the hu- independent, community-supported software for describing man microbiome. Microbiome 2015;3:52. and comparing microbial communities. Appl Environ Micro- 3. Jeraldo P, Kalari K, Chen X, Bhavsar J, Mangalam A, White B biol 2009;75(23):7537–41. et al. IM-TORNADO: a tool for comparison of 16s reads from 17. Wolak ME, Fairbairn DJ, Paulsen YR. Guidelines for estimat- paired-end libraries. PLoS One 2014;9(12):e114804. ing repeatability. Methods Ecol Evol 2012;3(1):129–37. 4. Bolger MA, Lohse M, Usadel B. Trimmomatic: a flexi- 18. Chen J, Wright K, Davis JM, Jeraldo P, Marietta EV, Mur- ble trimmer for illumina sequence data. Bioinformatics ray J et al. An expansion of rare lineage intestinal microbes 2014;30(15):2114–20. characterizes rheumatoid arthritis. Genome Med 2016;8(1): 5. Edgar RC. Search and clustering orders of magnitude faster than blast. Bioinformatics 2010;26(19):2460–1. 19. Love MI, Huber W, Anders S. Moderated estimation of 6. Edgar RC. Uparse: highly accurate OTU sequences from mi- fold change and dispersion for RNA-seq data with DESeq2. crobial amplicon reads. Nat Methods 2013;10(10):996–8. Genome Biol 2014;15(12):550. 7. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME 20. Scher JU, Sczesnak A, Longman RS, Segata N, Ubeda C, Biel- improves sensitivity and speed of chimera detection. Bioin- ski C et al. Expansion of intestinal Prevotella copri corre- formatics 2011;10(10): 27(16):2194–200. lates with enhanced susceptibility to arthritis. Elife 2013;5(2): 8. https://drive5.com/uchime/gold.fa. e01202. 9. Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of 21. Zhang X, Zhang D, Jia H, Feng Q, Wang D, Liang D et al. RNA alignments. Bioinformatics 2009;25(10):1335–7. The oral and gut microbiomes are perturbed in rheuma- 10. Cole JR,WangQ,CardenasE,Fish J, ChaiB,FarrisRJetal. toid arthritis and partly normalized after treatment. Nat Med The ribosomal database project: improved alignments and 2015;21(8):895–905. new tools for rRNA analysis. Nucleic Acids Res 2009;37:D141– 22. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld D145. JD. Pandaseq: paired-end assembler for illumina sequences. 11. Price MN, Dehal PS, Arkin AP. Fasttree 2–approximately BMC Bioinformatics 2012;13:31. maximum-likelihood trees for large alignments. PLoS One 23. Hildebrand F, Tadeo R, Voigt AY, Bork P, Raes J. LotuS: an effi- 2010;5(3):e9490. cient and user-friendly OTU processing pipeline. Microbiome 12. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian 2014;2:30. classifier for rapid assignment of rRNA sequences into 24. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, John- the new bacterial taxonomy. Appl Environ Microbiol son AJ, Holmes SP. DADA2: high-resolution sample infer- 2007;73(16):5261–7. ence from Illumina amplicon data. Nat Methods 2016;13: 13. http://greengenes.secondgenome.com/. 581–3. 14. Sinha R, Chen J, Amir A, Vogtmann E, Shi J, Inman KS 25. Chen X, Johnson S, Jeraldo P, Wang J, Chia N, Kocher et al. Collecting fecal samples for microbiome analyses in JA, Chen J. Supporting data for “Hybrid-denovo: A de epidemiology studies. Cancer Epidemiol Biomarkers Prev novo OTU-picking pipeline integrating single-end and 2016;25(2):407–16. paired-end 16S sequence tags.” GigaScience Database 2017. 15. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, http://dx.doi.org/10.5524/100388. Bushman FD, Costello EK et al. QIIME allows analysis of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4750782 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Mar 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off