bcSeq: an R package for fast sequence mapping in high-throughput shRNA and CRISPR screens

bcSeq: an R package for fast sequence mapping in high-throughput shRNA and CRISPR screens Abstract Summary CRISPR-Cas9 and shRNA high-throughput sequencing screens have abundant applications for basic and translational research. Methods and tools for the analysis of these screens must properly account for sequencing error, resolve ambiguous mappings among similar sequences in the barcode library in a statistically principled manner, and be computationally efficient. Herein we present bcSeq, an open source R package that implements a fast and parallelized algorithm for mapping high-throughput sequencing reads to a barcode library while tolerating sequencing error. The algorithm uses a Trie data structure for speed and resolves ambiguous mappings by using a statistical sequencing error model based on Phred scores for each read. Availability and implementation The package source code and an accompanying tutorial are available at http://bioconductor.org/packages/bcSeq/. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction High-throughput sequencing of screening assays using short hairpin RNA interference (shRNA) libraries or single-guide RNA CRISPR/Cas9 knockout (sgRNA) libraries is a powerful method for forward genetic screens. These screens have been used to study gene function in multiple diseases, including many cancer types. In the study of cancer, such screening approaches are very useful to study clonal dynamics of therapy resistance and sensitivity for targeted therapies (Doudna and Charpentier, 2014). A key step in the analysis of the data from shRNA and CRISPR screen assays is to map each observed read in a sequenced library back to its originating barcode in the reference library. Aside from computational considerations, one has to account for sequencing error in a statistically principled manner. One approach is to repurpose tools for mapping short reads to large genomes (e.g. bowtieLangmead et al., 2009). This approach induces large overhead due to the library indexing and determining starting position for the alignment of each read. Other tools designed for mapping short reads to short barcodes are available (e.g. Dai et al., 2014; Li et al., 2014; Mun et al., 2016; Sims et al., 2011). However, these methods have two main limitations: 1. do not explicitly model the reads error distributions; and 2. are unable to properly handle ambiguous mapping (reads mapped to multiple barcodes). In this paper, we present bcSeq an open-source R (R Core Team, 2017) extension package for fast mapping of reads from high-throughput shRNA and sgRNA sequencing assays with multi-thread support. The package resolves ambiguous mappings on the basis of a statistical model that can be customized by the user. 2 Implementation For each read, bcSeq models the distribution of the originating barcode given the observed sequence using the corresponding Phred scores and maps the read to a reference barcode using a Bayes’ classifier. bcSeq uses a C++ implemented Trie (Bieganski et al., 1994) data structure and the pthreads model for implementing multi-threading. It provides the option for using either Hamming or edit distance. The package accepts data in FASTA and FASTQ formats for reference sequences, FASTQ for read sequences, and also interfaces with functions and classes from core Bioconductor (Huber et al., 2015) packages. The user can opt to output a tabulation of the number of reads mapped to each barcode or to output a mapping probability matrix among all the reads and barcodes in sparse format. Technical details are provided in SI. 3 Results We conduct a simulation study to compare the performance of bcSeq to that of a perfect match approach (see Fig. 1). We consider a reference library consisting of 500 barcodes each of length 18 and a minimum pairwise edit distance of 2. We simulate 1000 sequencing libraries each consisting of 106 reads and sample the Phred scores from a real sequenced library. We evaluate the performance across all simulated reads as well as within regions of similar average Phred scores. Performance is quantified by empirical ratio of reads mapped to the library barcodes and by empirical specificity (excluding unmapped reads). We observe that compared to the perfect match approach, bcSeq consistently maps more reads with similar specificity. Furthermore, we observe that the relative gain in reads mapped for our method improves as the quality of the reads decreases. In SI, we provide additional details for the simulation study along with a performance comparison to three published methods on the basis of an L1 norm excluding unmapped reads (see Supplementary Table S2). bcSeq meets or exceeds the performance of each method considered. Fig. 1. View largeDownload slide bcSeq maps more reads to barcodes for realistic data than perfect match without introducing significant mapping error. The upper left panel shows that bcSeq has a higher ratio of reads mapped to library barcodes (number of mapped reads/number of total reads) than perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The upper right panel elucidates the role of sequencing error in bcSeq’s performance. Each read in the real sequencing data was binned based its average Phred score, and bcSeq and perfect match were run on reads simulated using Phred scores sampled from each of these bins. Perfect match’s mapped reads ratio degrades much more rapidly than bcSeq’s as average Phred score decreases. The lower panels compare the specificity (excluding unmapped reads) of bcSeq and perfect match. The lower left panel shows that bcSeq has a similar specificity to perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The lower right panel elucidates the role of sequencing error in bcSeq’s specificity. bcSeq has a similar specificity to perfect match as the average Phred score decreases Fig. 1. View largeDownload slide bcSeq maps more reads to barcodes for realistic data than perfect match without introducing significant mapping error. The upper left panel shows that bcSeq has a higher ratio of reads mapped to library barcodes (number of mapped reads/number of total reads) than perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The upper right panel elucidates the role of sequencing error in bcSeq’s performance. Each read in the real sequencing data was binned based its average Phred score, and bcSeq and perfect match were run on reads simulated using Phred scores sampled from each of these bins. Perfect match’s mapped reads ratio degrades much more rapidly than bcSeq’s as average Phred score decreases. The lower panels compare the specificity (excluding unmapped reads) of bcSeq and perfect match. The lower left panel shows that bcSeq has a similar specificity to perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The lower right panel elucidates the role of sequencing error in bcSeq’s specificity. bcSeq has a similar specificity to perfect match as the average Phred score decreases The completion times for mapping a sequenced library consisting of 107 reads to a reference library consisting of 87 897 barcodes are 60, 31, 16, 12 and 9 min based on 1, 2, 4, 6 and 8 cores respectively on an AMD FX-8350 desktop CPU. We note these times include quality evaluation and exhaustive mapping within the mismatch threshold. 4 Discussion and conclusion For mapping applications, as illustrated by the simulation results presented in this paper and SI, bcSeq has better performance compared to a number of other methods. The scope of its application goes beyond merely mapping barcodes, enabling rigorous statistical inference. In this context, a typical objective is to model the relationship between the relative prevalence of each barcode and a given phenotype or experimental condition. Existing methods for these analyses (e.g. MAGeCK) implicitly assume that each read is correctly mapped to its originating barcode. However, the goal of inference is to model the relationship between a phenotype and the originating, and not the mapped barcode. As discussed in SI, our approach enables rigorous inference using a measurement error model. bcSeq is an open source R package for fast and accurate mapping of reads from high-throughput shRNA and sgRNA sequencing assays based on a default or a user-defined Phred score based probability model. The package vignette provides a worked example for customization of this model. bcSeq accommodates sequences of varying lengths and provides the option of using Hamming or edit distance, and can tolerate mismatches, internal insertions and deletions. It uses a fast Trie data structure coded in C++ and supports multi-threading. bcSeq is available through Bioconductor. The user can interface with the package using standard sequencing data formats or by using Bioconductor data structures. The code and data to reproduce the simulation and to replicate the benchmark results are described in SI. Acknowledgements The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors thank two reviewers for comments and suggestions which substantially improved the paper and software package. Funding This research was supported in part by awards P01CA142538 (AA, KO and JL) and 5P50-CA190991-04 (KO, SF) from the National Cancer Institute. Conflict of Interest: none declared. References Bieganski P. et al. ( 1994 ). Generalized suffix trees for biological sequence data: applications and implementation. In: 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, vol. 5, pp. 35 . Dai Z. et al. ( 2014 ) edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens . F1000Research , 3 , 95 . Google Scholar PubMed Doudna J.A. , Charpentier E. ( 2014 ) The new frontier of genome gngineering with CRISPR-Cas9 . Science , 346 , 1258096 . Google Scholar Crossref Search ADS PubMed Huber W. et al. ( 2015 ) Orchestrating high-throughput genomic analysis with Bioconductor . Nat. Methods , 12 , 115 . Google Scholar Crossref Search ADS PubMed Langmead B. et al. ( 2009 ) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome . Genome Biol ., 10 , R25 . Google Scholar Crossref Search ADS PubMed Li W. et al. ( 2014 ) MAGeCK enables robust identification of essential genes from genome-scale CRISPR-Cas9 knockout screens . Genome Biol ., 15 , 554 . Google Scholar Crossref Search ADS PubMed Mun J. et al. ( 2016 ) Genome-wide functional analysis using the barcode sequence alignment and statistical analysis (Barcas) tool . BMC Bioinformatics , 17 , 475 . Google Scholar Crossref Search ADS PubMed R Core Team . ( 2017 ) R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing , Vienna, Austria . Sims D. et al. ( 2011 ) High-throughput RNA interference screening using pooled shRNA libraries and next generation sequencing . Genome Biol ., 12 , R104 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

bcSeq: an R package for fast sequence mapping in high-throughput shRNA and CRISPR screens

Loading next page...
 
/lp/ou_press/bcseq-an-r-package-for-fast-sequence-mapping-in-high-throughput-shrna-H2NhQCHtWk
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty402
Publisher site
See Article on Publisher Site

Abstract

Abstract Summary CRISPR-Cas9 and shRNA high-throughput sequencing screens have abundant applications for basic and translational research. Methods and tools for the analysis of these screens must properly account for sequencing error, resolve ambiguous mappings among similar sequences in the barcode library in a statistically principled manner, and be computationally efficient. Herein we present bcSeq, an open source R package that implements a fast and parallelized algorithm for mapping high-throughput sequencing reads to a barcode library while tolerating sequencing error. The algorithm uses a Trie data structure for speed and resolves ambiguous mappings by using a statistical sequencing error model based on Phred scores for each read. Availability and implementation The package source code and an accompanying tutorial are available at http://bioconductor.org/packages/bcSeq/. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction High-throughput sequencing of screening assays using short hairpin RNA interference (shRNA) libraries or single-guide RNA CRISPR/Cas9 knockout (sgRNA) libraries is a powerful method for forward genetic screens. These screens have been used to study gene function in multiple diseases, including many cancer types. In the study of cancer, such screening approaches are very useful to study clonal dynamics of therapy resistance and sensitivity for targeted therapies (Doudna and Charpentier, 2014). A key step in the analysis of the data from shRNA and CRISPR screen assays is to map each observed read in a sequenced library back to its originating barcode in the reference library. Aside from computational considerations, one has to account for sequencing error in a statistically principled manner. One approach is to repurpose tools for mapping short reads to large genomes (e.g. bowtieLangmead et al., 2009). This approach induces large overhead due to the library indexing and determining starting position for the alignment of each read. Other tools designed for mapping short reads to short barcodes are available (e.g. Dai et al., 2014; Li et al., 2014; Mun et al., 2016; Sims et al., 2011). However, these methods have two main limitations: 1. do not explicitly model the reads error distributions; and 2. are unable to properly handle ambiguous mapping (reads mapped to multiple barcodes). In this paper, we present bcSeq an open-source R (R Core Team, 2017) extension package for fast mapping of reads from high-throughput shRNA and sgRNA sequencing assays with multi-thread support. The package resolves ambiguous mappings on the basis of a statistical model that can be customized by the user. 2 Implementation For each read, bcSeq models the distribution of the originating barcode given the observed sequence using the corresponding Phred scores and maps the read to a reference barcode using a Bayes’ classifier. bcSeq uses a C++ implemented Trie (Bieganski et al., 1994) data structure and the pthreads model for implementing multi-threading. It provides the option for using either Hamming or edit distance. The package accepts data in FASTA and FASTQ formats for reference sequences, FASTQ for read sequences, and also interfaces with functions and classes from core Bioconductor (Huber et al., 2015) packages. The user can opt to output a tabulation of the number of reads mapped to each barcode or to output a mapping probability matrix among all the reads and barcodes in sparse format. Technical details are provided in SI. 3 Results We conduct a simulation study to compare the performance of bcSeq to that of a perfect match approach (see Fig. 1). We consider a reference library consisting of 500 barcodes each of length 18 and a minimum pairwise edit distance of 2. We simulate 1000 sequencing libraries each consisting of 106 reads and sample the Phred scores from a real sequenced library. We evaluate the performance across all simulated reads as well as within regions of similar average Phred scores. Performance is quantified by empirical ratio of reads mapped to the library barcodes and by empirical specificity (excluding unmapped reads). We observe that compared to the perfect match approach, bcSeq consistently maps more reads with similar specificity. Furthermore, we observe that the relative gain in reads mapped for our method improves as the quality of the reads decreases. In SI, we provide additional details for the simulation study along with a performance comparison to three published methods on the basis of an L1 norm excluding unmapped reads (see Supplementary Table S2). bcSeq meets or exceeds the performance of each method considered. Fig. 1. View largeDownload slide bcSeq maps more reads to barcodes for realistic data than perfect match without introducing significant mapping error. The upper left panel shows that bcSeq has a higher ratio of reads mapped to library barcodes (number of mapped reads/number of total reads) than perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The upper right panel elucidates the role of sequencing error in bcSeq’s performance. Each read in the real sequencing data was binned based its average Phred score, and bcSeq and perfect match were run on reads simulated using Phred scores sampled from each of these bins. Perfect match’s mapped reads ratio degrades much more rapidly than bcSeq’s as average Phred score decreases. The lower panels compare the specificity (excluding unmapped reads) of bcSeq and perfect match. The lower left panel shows that bcSeq has a similar specificity to perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The lower right panel elucidates the role of sequencing error in bcSeq’s specificity. bcSeq has a similar specificity to perfect match as the average Phred score decreases Fig. 1. View largeDownload slide bcSeq maps more reads to barcodes for realistic data than perfect match without introducing significant mapping error. The upper left panel shows that bcSeq has a higher ratio of reads mapped to library barcodes (number of mapped reads/number of total reads) than perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The upper right panel elucidates the role of sequencing error in bcSeq’s performance. Each read in the real sequencing data was binned based its average Phred score, and bcSeq and perfect match were run on reads simulated using Phred scores sampled from each of these bins. Perfect match’s mapped reads ratio degrades much more rapidly than bcSeq’s as average Phred score decreases. The lower panels compare the specificity (excluding unmapped reads) of bcSeq and perfect match. The lower left panel shows that bcSeq has a similar specificity to perfect match when run on reads simulated using Phred scores sampled from real sequencing data. The lower right panel elucidates the role of sequencing error in bcSeq’s specificity. bcSeq has a similar specificity to perfect match as the average Phred score decreases The completion times for mapping a sequenced library consisting of 107 reads to a reference library consisting of 87 897 barcodes are 60, 31, 16, 12 and 9 min based on 1, 2, 4, 6 and 8 cores respectively on an AMD FX-8350 desktop CPU. We note these times include quality evaluation and exhaustive mapping within the mismatch threshold. 4 Discussion and conclusion For mapping applications, as illustrated by the simulation results presented in this paper and SI, bcSeq has better performance compared to a number of other methods. The scope of its application goes beyond merely mapping barcodes, enabling rigorous statistical inference. In this context, a typical objective is to model the relationship between the relative prevalence of each barcode and a given phenotype or experimental condition. Existing methods for these analyses (e.g. MAGeCK) implicitly assume that each read is correctly mapped to its originating barcode. However, the goal of inference is to model the relationship between a phenotype and the originating, and not the mapped barcode. As discussed in SI, our approach enables rigorous inference using a measurement error model. bcSeq is an open source R package for fast and accurate mapping of reads from high-throughput shRNA and sgRNA sequencing assays based on a default or a user-defined Phred score based probability model. The package vignette provides a worked example for customization of this model. bcSeq accommodates sequences of varying lengths and provides the option of using Hamming or edit distance, and can tolerate mismatches, internal insertions and deletions. It uses a fast Trie data structure coded in C++ and supports multi-threading. bcSeq is available through Bioconductor. The user can interface with the package using standard sequencing data formats or by using Bioconductor data structures. The code and data to reproduce the simulation and to replicate the benchmark results are described in SI. Acknowledgements The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors thank two reviewers for comments and suggestions which substantially improved the paper and software package. Funding This research was supported in part by awards P01CA142538 (AA, KO and JL) and 5P50-CA190991-04 (KO, SF) from the National Cancer Institute. Conflict of Interest: none declared. References Bieganski P. et al. ( 1994 ). Generalized suffix trees for biological sequence data: applications and implementation. In: 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, vol. 5, pp. 35 . Dai Z. et al. ( 2014 ) edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens . F1000Research , 3 , 95 . Google Scholar PubMed Doudna J.A. , Charpentier E. ( 2014 ) The new frontier of genome gngineering with CRISPR-Cas9 . Science , 346 , 1258096 . Google Scholar Crossref Search ADS PubMed Huber W. et al. ( 2015 ) Orchestrating high-throughput genomic analysis with Bioconductor . Nat. Methods , 12 , 115 . Google Scholar Crossref Search ADS PubMed Langmead B. et al. ( 2009 ) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome . Genome Biol ., 10 , R25 . Google Scholar Crossref Search ADS PubMed Li W. et al. ( 2014 ) MAGeCK enables robust identification of essential genes from genome-scale CRISPR-Cas9 knockout screens . Genome Biol ., 15 , 554 . Google Scholar Crossref Search ADS PubMed Mun J. et al. ( 2016 ) Genome-wide functional analysis using the barcode sequence alignment and statistical analysis (Barcas) tool . BMC Bioinformatics , 17 , 475 . Google Scholar Crossref Search ADS PubMed R Core Team . ( 2017 ) R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing , Vienna, Austria . Sims D. et al. ( 2011 ) High-throughput RNA interference screening using pooled shRNA libraries and next generation sequencing . Genome Biol ., 12 , R104 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Oct 15, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off