LiveKraken––real-time metagenomic classification of illumina data

LiveKraken––real-time metagenomic classification of illumina data Abstract Motivation In metagenomics, Kraken is one of the most widely used tools due to its robustness and speed. Yet, the overall turnaround time of metagenomic analysis is hampered by the sequential paradigm of wet and dry lab. In urgent experiments, it can be crucial to gain a timely insight into a dataset. Results Here, we present LiveKraken, a real-time read classification tool based on the core algorithm of Kraken. LiveKraken uses streams of raw data from Illumina sequencers to classify reads taxonomically. This way, we are able to produce results identical to those of Kraken the moment the sequencer finishes. We are furthermore able to provide comparable results in early stages of a sequencing run, allowing saving up to a week of sequencing time on an Illumina HiSeq in High Throughput Mode. While the number of classified reads grows over time, false classifications appear in negligible numbers and proportions of identified taxa are only affected to a minor extent. Availability and implementation LiveKraken is available at https://gitlab.com/rki_bioinformatics/LiveKraken. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Real-time analyses of genome sequencing data have been gaining particular attention over the last years, as they enable to analyze data while the sequencer is still running. Yet, the possibilities of live analysis approaches based on MinION sequencers are still limited due to low throughput rates and sequence qualities of these devices. With HiLive (Lindner et al., 2017) we proposed the first method for real-time analyses of high-throughput sequencing data from Illumina machines, enabling a new field of applications. For metagenomic studies, classification tools such as Kraken (Wood and Salzberg, 2014) have also been used in time-relevant applications. These are, however, affected by the sequential paradigm of wet and dry lab, setting the lower limit of the overall duration of an experiment to the runtime of the sequencing machine. To tackle these limitations, we present LiveKraken, a real-time taxonomic classification tool based on the core algorithm of Kraken. We show that it yields results comparable to those of established tools long before the sequencer has even finished and that it guarantees results identical to those of Kraken as soon as a sequencing run has ended. LiveKraken has been tested on HiSeq and MiSeq systems and is as robust and easy to use as Kraken. The field of applications may range from controlling sample composition, contamination identification, or outbreak detection in real time. 2 Materials and methods Originally, Kraken has a linear workflow (Wood and Salzberg, 2014). Sequencing reads are read from FASTA or FASTQ files and subsequently classified using a pre-computed database. Since the reads are independent of each other, they can be processed in parallel. The lowest common ancestor (LCA) classification results found for each read are written to Kraken’s tabular report file. To make this workflow fit for the purpose of live taxonomic classification, similar to the approach taken in HiLive (Lindner et al., 2017), a new sequence reader module was implemented which allows reading sequencing data from Illumina’s binary basecall (BCL) format. LiveKraken can be used to analyze continuously and refine the metagenomic sample composition, using the same database structure as the original Kraken. Illumina sequencers process all reads in parallel in so called cycles, appending one base to all reads per cycle. For each cycle, BCL files are produced in Illumina’s BaseCalls directory, which is declared as input for LiveKraken instead of FASTA or FASTQ files. New data is collected by the BCL sequencing reader module in user-specified intervals of j sequencing cycles, starting with the first k-mer of size k. The collected data is sent to the classifier which refines the stored partial classification with the new sequence information. Temporary data structures of Kraken are stored for each read, such as the LCA list, a list of ambiguous nucleotides and the number of k-mer occurrences in the database. This leads to an overall increase of memory consumption proportional to the number of LCAs found for each read sequence. Additionally, and crucial for the iterative refinement, a variable is stored that is holding the position up to which each read was classified. After each refinement step, output in the same tabular format as known from Kraken is produced. This enables early classification while also ensuring that the classification output after reading the data from the last sequencing cycle is exactly the same that Kraken would produce (c.f. Fig. 1a). Fig. 1. View largeDownload slide Timeline of LiveKraken: Upper part (a) showing the method, lower part (b) an exemplary result. (a) Method: Raw parts of sequenced reads are streamed directly from the sequencer into Kraken’s classification algorithm. K-mers are taxonomically classified using Kraken’s pre-computed map of each k-mer to the lowest common ancestor of all genomes containing the k-mer, as color coded in the taxonomy tree. The highest scoring path from the pruned sub-tree of the taxonomic tree is selected as classification of each read (Wood and Salzberg 2014). (b) Results: In this example, 2 358 788 unmasked reads from SRR062462 were transferred back into raw MiSeq data format. Results are reported after 40, 80, 120, 160 and 200 sequencing cycles or approximately 12, 9, 6, 3 and 0 h on an Illumina MiSeq before the sequencer finishes and data can be prepared for other tools to start. The results are visualized in a Sankey diagram of read classifications on species level after all cycles are reported. The top five groups with the most hits are shown, while groups with fewer hits are conflated as ‘other’. Reads which cannot be assigned on species level are denoted as unclassified. The unclassified nodes are optically narrowed by approximately 1 500 000 reads each for better recognition of relevant groups. Thickness of the flows encodes the number of reads going from one node to another, where blue flows represent unchanged or new classifications and red ones show changed classifications. While the number of unclassified reads decreases, the overall proportions of taxa stay the same. Misclassifications occur in negligible magnitude. The visualization of results as an interactive sankey-plot is part of LiveKraken Fig. 1. View largeDownload slide Timeline of LiveKraken: Upper part (a) showing the method, lower part (b) an exemplary result. (a) Method: Raw parts of sequenced reads are streamed directly from the sequencer into Kraken’s classification algorithm. K-mers are taxonomically classified using Kraken’s pre-computed map of each k-mer to the lowest common ancestor of all genomes containing the k-mer, as color coded in the taxonomy tree. The highest scoring path from the pruned sub-tree of the taxonomic tree is selected as classification of each read (Wood and Salzberg 2014). (b) Results: In this example, 2 358 788 unmasked reads from SRR062462 were transferred back into raw MiSeq data format. Results are reported after 40, 80, 120, 160 and 200 sequencing cycles or approximately 12, 9, 6, 3 and 0 h on an Illumina MiSeq before the sequencer finishes and data can be prepared for other tools to start. The results are visualized in a Sankey diagram of read classifications on species level after all cycles are reported. The top five groups with the most hits are shown, while groups with fewer hits are conflated as ‘other’. Reads which cannot be assigned on species level are denoted as unclassified. The unclassified nodes are optically narrowed by approximately 1 500 000 reads each for better recognition of relevant groups. Thickness of the flows encodes the number of reads going from one node to another, where blue flows represent unchanged or new classifications and red ones show changed classifications. While the number of unclassified reads decreases, the overall proportions of taxa stay the same. Misclassifications occur in negligible magnitude. The visualization of results as an interactive sankey-plot is part of LiveKraken LiveKraken can be installed via the included script install_kraken.sh analogous to Kraken with an additional dependency to the boost library. It has been tested with gcc v. 4.9.2 and v. 7.2.0 and boost v. 1.5.8. Furthermore, a Conda package is available (Grüning et al., 2017). LiveKraken uses the same command line interface as Kraken. 3 Results LiveKraken builds on the well-known tool Kraken. Hence, we show its results in comparison to the classic Kraken approach. While we guarantee identical results as Kraken with the end of a sequencing run, we also show that preliminary classifications allow a reliable estimate of the sample composition long before the sequencer has finished. We ran LiveKraken on three datasets from the NIH Human Microbiome Project (NIH HMP Working Group et al., 2009; c.f. Table 1), returning results after every 40th sequencing cycle or approximately 12, 9, 6, 3 and 0 h before the sequencer finished, respectively. As reference database we used all bacteria and archea sequences from RefSeq (O’Leary et al., 2016) downloaded on June 2nd 2015. We compared the results to the output of Kraken on the full datasets (Table 1). An example is visualized in Figure 1b, showing that the number of unclassified reads decreases over time, but only a minor number of reads is misclassified in earlier stages. While the peak memory requirements of LiveKraken increase by <1% compared to Kraken in our experiments, speed decreases by 15% (c.f. Supplementary Fig. S1). It is still orders of magnitude faster than the sequencer and therefore not the runtime bottleneck. Our results confirm the hypothesis that a classification is already possible long before classical metagenomic tools can even be started. Table 1. Recall (tpr) and precision (ppv) of LiveKraken at different time points, based on read classification on species level at each cycle compared to Kraken classification after 200 cycles as ground truth Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Table 1. Recall (tpr) and precision (ppv) of LiveKraken at different time points, based on read classification on species level at each cycle compared to Kraken classification after 200 cycles as ground truth Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Acknowledgements The authors thank Jakob M. Schulze, Kristina Kirsten, Piotr W. Dabrowski (all Robert Koch Institute) and Florian Breitwieser (Johns Hopkins University) for valuable discussions and input and Ursula Erikli for copy-editing. Funding The authors gratefully acknowledge financial support from the German Federal Ministry of Health [2515NIK043]. Conflict of Interest: none declared. References Grüning B. et al. ( 2017 ) Bioconda: a sustainable and comprehensive software distribution for the life sciences . bioRxiv , http://dx.doi.org/10.1101/207092. Lindner M.S. et al. ( 2017 ) HiLive: real-time mapping of illumina reads while sequencing . Bioinformatics , 33 , 917 – 919 . Google Scholar PubMed NIH HMP Working Group , et al. ( 2009 ) The NIH human microbiome project . Genome Res ., 19 , 2317 – 2323 . Crossref Search ADS PubMed O’Leary N.A. et al. ( 2016 ) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation . Nucleic Acids Res ., 44 , D733 – D745 . Google Scholar Crossref Search ADS PubMed Wood D.E. , Salzberg S.L. ( 2014 ) Kraken: ultrafast metagenomic sequence classification using exact alignments . Genome Biol ., 15 , R46. Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

LiveKraken––real-time metagenomic classification of illumina data

Loading next page...
 
/lp/ou_press/livekraken-real-time-metagenomic-classification-of-illumina-data-e6FuDuohuo
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty433
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation In metagenomics, Kraken is one of the most widely used tools due to its robustness and speed. Yet, the overall turnaround time of metagenomic analysis is hampered by the sequential paradigm of wet and dry lab. In urgent experiments, it can be crucial to gain a timely insight into a dataset. Results Here, we present LiveKraken, a real-time read classification tool based on the core algorithm of Kraken. LiveKraken uses streams of raw data from Illumina sequencers to classify reads taxonomically. This way, we are able to produce results identical to those of Kraken the moment the sequencer finishes. We are furthermore able to provide comparable results in early stages of a sequencing run, allowing saving up to a week of sequencing time on an Illumina HiSeq in High Throughput Mode. While the number of classified reads grows over time, false classifications appear in negligible numbers and proportions of identified taxa are only affected to a minor extent. Availability and implementation LiveKraken is available at https://gitlab.com/rki_bioinformatics/LiveKraken. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Real-time analyses of genome sequencing data have been gaining particular attention over the last years, as they enable to analyze data while the sequencer is still running. Yet, the possibilities of live analysis approaches based on MinION sequencers are still limited due to low throughput rates and sequence qualities of these devices. With HiLive (Lindner et al., 2017) we proposed the first method for real-time analyses of high-throughput sequencing data from Illumina machines, enabling a new field of applications. For metagenomic studies, classification tools such as Kraken (Wood and Salzberg, 2014) have also been used in time-relevant applications. These are, however, affected by the sequential paradigm of wet and dry lab, setting the lower limit of the overall duration of an experiment to the runtime of the sequencing machine. To tackle these limitations, we present LiveKraken, a real-time taxonomic classification tool based on the core algorithm of Kraken. We show that it yields results comparable to those of established tools long before the sequencer has even finished and that it guarantees results identical to those of Kraken as soon as a sequencing run has ended. LiveKraken has been tested on HiSeq and MiSeq systems and is as robust and easy to use as Kraken. The field of applications may range from controlling sample composition, contamination identification, or outbreak detection in real time. 2 Materials and methods Originally, Kraken has a linear workflow (Wood and Salzberg, 2014). Sequencing reads are read from FASTA or FASTQ files and subsequently classified using a pre-computed database. Since the reads are independent of each other, they can be processed in parallel. The lowest common ancestor (LCA) classification results found for each read are written to Kraken’s tabular report file. To make this workflow fit for the purpose of live taxonomic classification, similar to the approach taken in HiLive (Lindner et al., 2017), a new sequence reader module was implemented which allows reading sequencing data from Illumina’s binary basecall (BCL) format. LiveKraken can be used to analyze continuously and refine the metagenomic sample composition, using the same database structure as the original Kraken. Illumina sequencers process all reads in parallel in so called cycles, appending one base to all reads per cycle. For each cycle, BCL files are produced in Illumina’s BaseCalls directory, which is declared as input for LiveKraken instead of FASTA or FASTQ files. New data is collected by the BCL sequencing reader module in user-specified intervals of j sequencing cycles, starting with the first k-mer of size k. The collected data is sent to the classifier which refines the stored partial classification with the new sequence information. Temporary data structures of Kraken are stored for each read, such as the LCA list, a list of ambiguous nucleotides and the number of k-mer occurrences in the database. This leads to an overall increase of memory consumption proportional to the number of LCAs found for each read sequence. Additionally, and crucial for the iterative refinement, a variable is stored that is holding the position up to which each read was classified. After each refinement step, output in the same tabular format as known from Kraken is produced. This enables early classification while also ensuring that the classification output after reading the data from the last sequencing cycle is exactly the same that Kraken would produce (c.f. Fig. 1a). Fig. 1. View largeDownload slide Timeline of LiveKraken: Upper part (a) showing the method, lower part (b) an exemplary result. (a) Method: Raw parts of sequenced reads are streamed directly from the sequencer into Kraken’s classification algorithm. K-mers are taxonomically classified using Kraken’s pre-computed map of each k-mer to the lowest common ancestor of all genomes containing the k-mer, as color coded in the taxonomy tree. The highest scoring path from the pruned sub-tree of the taxonomic tree is selected as classification of each read (Wood and Salzberg 2014). (b) Results: In this example, 2 358 788 unmasked reads from SRR062462 were transferred back into raw MiSeq data format. Results are reported after 40, 80, 120, 160 and 200 sequencing cycles or approximately 12, 9, 6, 3 and 0 h on an Illumina MiSeq before the sequencer finishes and data can be prepared for other tools to start. The results are visualized in a Sankey diagram of read classifications on species level after all cycles are reported. The top five groups with the most hits are shown, while groups with fewer hits are conflated as ‘other’. Reads which cannot be assigned on species level are denoted as unclassified. The unclassified nodes are optically narrowed by approximately 1 500 000 reads each for better recognition of relevant groups. Thickness of the flows encodes the number of reads going from one node to another, where blue flows represent unchanged or new classifications and red ones show changed classifications. While the number of unclassified reads decreases, the overall proportions of taxa stay the same. Misclassifications occur in negligible magnitude. The visualization of results as an interactive sankey-plot is part of LiveKraken Fig. 1. View largeDownload slide Timeline of LiveKraken: Upper part (a) showing the method, lower part (b) an exemplary result. (a) Method: Raw parts of sequenced reads are streamed directly from the sequencer into Kraken’s classification algorithm. K-mers are taxonomically classified using Kraken’s pre-computed map of each k-mer to the lowest common ancestor of all genomes containing the k-mer, as color coded in the taxonomy tree. The highest scoring path from the pruned sub-tree of the taxonomic tree is selected as classification of each read (Wood and Salzberg 2014). (b) Results: In this example, 2 358 788 unmasked reads from SRR062462 were transferred back into raw MiSeq data format. Results are reported after 40, 80, 120, 160 and 200 sequencing cycles or approximately 12, 9, 6, 3 and 0 h on an Illumina MiSeq before the sequencer finishes and data can be prepared for other tools to start. The results are visualized in a Sankey diagram of read classifications on species level after all cycles are reported. The top five groups with the most hits are shown, while groups with fewer hits are conflated as ‘other’. Reads which cannot be assigned on species level are denoted as unclassified. The unclassified nodes are optically narrowed by approximately 1 500 000 reads each for better recognition of relevant groups. Thickness of the flows encodes the number of reads going from one node to another, where blue flows represent unchanged or new classifications and red ones show changed classifications. While the number of unclassified reads decreases, the overall proportions of taxa stay the same. Misclassifications occur in negligible magnitude. The visualization of results as an interactive sankey-plot is part of LiveKraken LiveKraken can be installed via the included script install_kraken.sh analogous to Kraken with an additional dependency to the boost library. It has been tested with gcc v. 4.9.2 and v. 7.2.0 and boost v. 1.5.8. Furthermore, a Conda package is available (Grüning et al., 2017). LiveKraken uses the same command line interface as Kraken. 3 Results LiveKraken builds on the well-known tool Kraken. Hence, we show its results in comparison to the classic Kraken approach. While we guarantee identical results as Kraken with the end of a sequencing run, we also show that preliminary classifications allow a reliable estimate of the sample composition long before the sequencer has finished. We ran LiveKraken on three datasets from the NIH Human Microbiome Project (NIH HMP Working Group et al., 2009; c.f. Table 1), returning results after every 40th sequencing cycle or approximately 12, 9, 6, 3 and 0 h before the sequencer finished, respectively. As reference database we used all bacteria and archea sequences from RefSeq (O’Leary et al., 2016) downloaded on June 2nd 2015. We compared the results to the output of Kraken on the full datasets (Table 1). An example is visualized in Figure 1b, showing that the number of unclassified reads decreases over time, but only a minor number of reads is misclassified in earlier stages. While the peak memory requirements of LiveKraken increase by <1% compared to Kraken in our experiments, speed decreases by 15% (c.f. Supplementary Fig. S1). It is still orders of magnitude faster than the sequencer and therefore not the runtime bottleneck. Our results confirm the hypothesis that a classification is already possible long before classical metagenomic tools can even be started. Table 1. Recall (tpr) and precision (ppv) of LiveKraken at different time points, based on read classification on species level at each cycle compared to Kraken classification after 200 cycles as ground truth Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Table 1. Recall (tpr) and precision (ppv) of LiveKraken at different time points, based on read classification on species level at each cycle compared to Kraken classification after 200 cycles as ground truth Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Cycle 40 80 120 160 Dataset tpr ppv tpr ppv tpr ppv tpr ppv 062371 0.85 0.99 0.94 0.99 0.96 0.99 0.99 1 062462 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 062415 0.80 0.98 0.92 0.98 0.95 0.98 0.99 0.99 Acknowledgements The authors thank Jakob M. Schulze, Kristina Kirsten, Piotr W. Dabrowski (all Robert Koch Institute) and Florian Breitwieser (Johns Hopkins University) for valuable discussions and input and Ursula Erikli for copy-editing. Funding The authors gratefully acknowledge financial support from the German Federal Ministry of Health [2515NIK043]. Conflict of Interest: none declared. References Grüning B. et al. ( 2017 ) Bioconda: a sustainable and comprehensive software distribution for the life sciences . bioRxiv , http://dx.doi.org/10.1101/207092. Lindner M.S. et al. ( 2017 ) HiLive: real-time mapping of illumina reads while sequencing . Bioinformatics , 33 , 917 – 919 . Google Scholar PubMed NIH HMP Working Group , et al. ( 2009 ) The NIH human microbiome project . Genome Res ., 19 , 2317 – 2323 . Crossref Search ADS PubMed O’Leary N.A. et al. ( 2016 ) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation . Nucleic Acids Res ., 44 , D733 – D745 . Google Scholar Crossref Search ADS PubMed Wood D.E. , Salzberg S.L. ( 2014 ) Kraken: ultrafast metagenomic sequence classification using exact alignments . Genome Biol ., 15 , R46. Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Nov 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off