BS-virus-finder: virus integration calling using bisulfite sequencing data

BS-virus-finder: virus integration calling using bisulfite sequencing data Background: DNA methylation plays a key role in the regulation of gene expression and carcinogenesis. Bisulfite sequencing studies mainly focus on calling single nucleotide polymorphism, different methylation region, and find allele-specific DNA methylation. Until now, only a few software tools have focused on virus integration using bisulfite sequencing data. Findings: We have developed a new and easy-to-use software tool, named BS-virus-finder (BSVF, Received: 10 February 2017; Revised: 6 September 2017; Accepted: 30 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Gao et al. RRID:SCR 015727), to detect viral integration breakpoints in whole human genomes. The tool is hosted at https://github.com/BGI-SZ/BSVF. Conclusions: BS-virus-finder demonstrates high sensitivity and specificity. It is useful in epigenetic studies and to reveal the relationship between viral integration and DNA methylation. BS-virus-finder is the first software tool to detect virus integration loci by using bisulfite sequencing data. Keywords: bisulfite sequencing; carcinogenesis; virus integration Watson strand, methylated C remains C, and unmethylated C Introduction changes to T. On the Crick strand, the reverse complement hap- DNA methylation plays a crucial role in many areas including pens; i.e., methylated C remains C, but in sequenced reads it is development [1, 2] and X chromosome inactivation [3]byreg- reverse-complemented to G, and unmethylated C changes to T, ulating genetic imprinting and epigenetic modification without leading to the reverse-complement base A in sequenced reads. altering DNA sequences. Previous studies have shown a strong As base C can either be methylated or unmethylated, we can use association of DNA methylation with cancer. The methylation International Union of Pure and Applied Chemistry (IUPAC) nu- status altering related to carcinogenesis [4], cancer recurrence cleotide codes “Y” and “R” to represent C/T and G/A, respectively. [5], and metastasis [6] has already been revealed by emerg- So, after bisulfite treatment, base C changes to Y on the Watson ing bisulfite sequencing (BS) technology. BS technology can in- strand, and base G changes to R on the Crick strand. vestigate DNA methylation changes with single-base accuracy. Whole-genome-based bisulfite sequencing (WGBS) has been Treatment of DNA with bisulfite converts unmethylated cytosine developed to detect DNA methylation. Recent clinical studies residues to uracil, but leaves 5-methylcytosine residues unmod- showed that DNA methylation is associated with viral integra- ified [ 7]. Thus, bisulfite treatment introduces specific changes tion [8, 9]. Whole-genome BS data can be analyzed to investigate in the DNA sequence that depend on the methylation status of the sequence mapping and alignment via BSMAP [10], Bismark individual cytosine residues, yielding single-nucleotide resolu- [11], and bwa-meth [12], to detect different methylation regions tion information about the methylation status of a segment of (DMRs) via the software QDMR [13], DMAP [14], and SMAP [15], DNA (Fig. 1). Various analyses can be performed on the altered to identify single nucleotide polymorphisms (SNPs) via BS-SNPer sequences to retrieve this information. BS technology can re- [16] and Bis-SNP [17], and to find allele-specific DNA methylation veal differences between cytosines and thymidine and sequence viaSMAP[15] and Methy-Pipe [18]. However, none of them can change resulting from bisulfite conversion. For the bases with- be used for virus integration loci calling, and no software tool is out methylation, all Cs will change to Ts on both strands. After currently available to detect virus integration loci by analyzing directional library preparation, we have 2 different conversions: BS data. Therefore, we have developed a software tool to detect the Watson and the Crick strands, as shown in Fig. 1.Onthe the virus integration loci by genome-wide BS analysis. Figure 1: The illustration of bisulfite-altered sequence to the original. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BS-virus-finder 3 Figure 2: Principal types of mapping reads around the viral integration site. Description of in silico and real data assas, VA, USA) were cultured as previously described [19]. The cell line was validated by STR makers (Fig. S4). We performed Different types of paired-end (PE) reads (50 base pairs [bp], 90 bp, whole-genome sequencing (WGS) and WGBS sequencing of this 150 bp) that include 700 breakpoints in chromosome 1 (chr 1) of cell line (the results are shown in Table S4). Table 1 shows the GRCh38 were simulated in our study. Input fragments of 50 to analysis result for WGS data, which were compared with the 400 bp were randomly selected from chr 1 in the GRCh38 assem- output results analyzed by Vy-per [20], virus-clip [21], and Virus bly of the human genome. The hepatitis B virus (HBV) genome Finder2 [22]. (GenBank: X04615.1) was used in our simulation. Its integration length was between 45 bp and 180 bp. We cut HBV-containing segments with given PE insert size at all possible positions on Methods every integration event. After alignment, mapping accuracy of Sample preparation each of the 17 different types of read mappings was calculated (Fig. 2). Mapping accuracy varied among the 17 types of read PLC/PRF/5 hepatocellular carcinoma cell line was obtained from mappings in our simulation (Figs S1, S2, S3). In summary, the ac- ATCC (Manassas, VA, USA) and was cultured as previously de- curacies of several kinds of the read mappings were low (Tables scribed [19] and validated by STR makers (Fig. S4). In total, 15 S1, S2, S3), which may raise the false-negative rate. Generally, μg of DNA was extracted to perform WGS and WGBS sequenc- however, bwa-meth [12] performed very well. ing. Sample concentration was detected by fluorometer (Qubit- Bisulfite sequencing is a sophisticated technique to study Fluorometer, Invitrogen). Sample integrity and purification was DNA cytosine methylation. Bisulfite treatment followed by poly- determined by Agarose Gel Electrophoresis. merase chain reaction (PCR) amplification specifically converts unmethylated cytosine to thymine. By cooperating with next- Whole-genome sequencing generation sequencing technology, it is able to detect the methy- lation status of every cytosine in the whole genome. Moreover, About 1.5 μg of gDNA was sonicated to 100–300-bp fragment longer reads make it possible to achieve higher accuracy. Be- genome DNA by Sonication (Covaris) and purified with QIAquick sides simulated data, the PLC/PRF/5 hepatocellular carcinoma PCR Purification Kit (Qiagen). Adapter ligation and target insert cell lines (from American Type Culture Collection [ATCC], Man- size fragment recovering and quantifying library by real-time Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Gao et al. Table 1: The comparison of BS-virus-finder with other software using real data BSVF Vy-per Virus-clip Virus Finder2 Chr HB VB VE HB VB VE HB VB VE HB VB VE chr1 143 272 758 2945 3102 chr2 - - 52 018 758 207 281 chr3 131 451 702 1212 1322 - 131 451 701 1282 1403 131 451 701 1405 chr3 131 453 124 1416 1515 - 131 453 353 1416 1538 chr4 180 586 417 136 378 180 586 416 59 chr4 180 587 608 394 594 180 586 607 167 231 180 587 608 500 632 180 587 607 634 chr5 1 297 478 1174 1315 - 1 297 478 1241 1385 1 297 477 1388 chr7 110 894 616 2739 2748 chr8 35 446 380 2389 2459 35 446 214 2402 2455 35 446 601 2390 2519 35 446 392 2396 2608 chr8 - 106 944 290 698 1077 chr11 65 040 943 2631 2767 - - 65 040 964 2532 chr12 109 573 899 721 815 109 573 677 668 734 109 573 899 705 815 chr13 33 088 123 1521 1603 -- chr13 33 088 561 1917 2066 - 33 088 561 1995 2133 33 088 560 2133 chr16 69 947 046 2055 2826 chr16 70 169 959 2055 2735 70 169 971 2064 2240 chr16 74 425 602 2062 2665 chr17 82 105 786 407 489 82 105 984 368 435 82 105 783 347 489 chr17 82 107 626 2177 2321 - 82 107 710 2048 2159 82 107 625 2045 chr19 41 783 064 687 804 - 41 782 971 761 905 chr20 20 473 566 2415 2565 BSVF used WGBS data, and other software used WGS data. Supported by previous fluorescence in situ hybridization (FISH) experiments [ 8]. Abbreviations: HB: host breakpoint; VB: virus begin is the revealed leftmost position on virus; VE: virus end is the rightmost position on virus. quantitative PCR (QPCR; TaqMan Probe) were then performed. 2. Clustering: After alignment, the result was filtered. We se- The qualified library was sequenced on an Illumina Hiseq X Ten lect read pairs with 1 read match by the following criterion: platform, and 150 bp of PE reads were obtained. In total, around The Phred-scaled mapping quality is larger than 30 (≥30), and 90 G of clean data were generated. at least 1 soft clipping is longer than 5 bp (≥5). The mapped parts of reads, which is marked as “M” by its CIGAR string, cover the human reference genome. For paired reads, we also Whole-genome bisulfite sequencing add the gap between 2 mapped reads to their covered re- About 3 μg of gDNA were sonicated to 100–300 bp by Sonication gion, making read 1 and read 2 continuously covered on the (Covaris) and purified with MiniElute PCR Purification Kit (QIA- human reference. Each continuous region with at least 1 bp GEN). A single “A” nucleotide was added to the 3 ends of the of overlapisdefinedasacluster. Allreadsinvolvedare se- blunt fragments. Methylated adapters were then purified and lected to form the cluster. The remaining soft clippings are addedtothe 5 and 3 ends of each strand in the genomic frag- viral junction candidates. Read pairs with 1 read mapped on ment. Sizes 300–400 bp were selected. DNA was then purified the virus also indicate a potential virus junction between the with QIAquick Gel Extraction Kit (QIAGEN) and bisulfite treated read pairs. with Methylation-Gold Kit (ZYMO). Finally, PCR was conducted 3. Assembling: Within 1 cluster, all soft clipping start sites are and sizes 350–400 bp were selected and purified with QIAquick collected. The position with the most abundance of start Gel Extraction kit (QIAGEN). Qualified library was amplified on sites is identified as the most likely candidate breakpoint. All cBot to generate the cluster on the flowcells (TruSeq PE Clus- clipping sequences in the cluster are extracted and aligned ter Kit V3–cBot–HS, Illumina). The flowcells were sequenced for together. A restore algorithm was used to calculate the most 150 bp of PE reads on the HiSeq X Ten platform, and more than possible base in each position based on the aligned bases 90G of clean data were generated. and their sequencing quality. The algorithm is based on a Bayesian model, where we compute the posteriori probability estimation for A, C, G, T as: Data analysis The read coverage situation for 1 integration is shown in Fig. 3. P (T )P (D|T ) P (T )P (D|T ) Four steps were implemented to detect virus integration: Wi Wi Ci Ci P (T |D) = × s s P (T )P (D|T ) P (T )P (D|T ) Wx Wx Cx Cx 1. Alignment: We use bwa-meth [12] to align bisulfite-treated x=1 x=1 sequencing reads to a hybrid reference that contains both = C × P (D|T ) × P (D|T ) 0 Wi Ci (1) human genome and virus sequences. For chimeric reads from the junction parts, BWA-MEM [23] will align it to 1 or- P (T ) P (T ) Wi Ci C = × ganism and mark the unmapped part as soft clipping, which S S P (T )P (D|T ) P (T )P (D|T ) Wx Wx Cx Cx is in fact from the other organism. This enables us to find x=1 x=1 breakpoints directly from the alignment. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BS-virus-finder 5 Figure 3: A demo plot of 1 viral integration cluster in its pre-insertion form. Here, D is the observation of the next-generation sequencing 1 − 10 (T ∈{C, G, T }) (NGS) reads on given position. P(Ti|D) is the likelihood compo- P (d |T) = . (4) Ck ⎪ 1 − 10 nent, which can be interpreted as the probability of observing (T ∈{A}) D when the true genotype is T .D is a realization (or observa- i W tion) of the NGS reads in the Watson strand. D is a realization (or observation) of the NGS reads in the Crick strand. P(T |D) is We used “Y” and “R” to represent C/T and G/A, respectively (IU- Wi PAC nucleotide code). If a region is covered by both the Watson the likelihood component, which can be interpreted as the prob- ability of observing D when the true genotype is T .P(T |D) is strand and the Crick strand, we were able to deduce the original Wi Ci the likelihood component, which can be interpreted as the prob- base from Y or R by calculation. ability of observing D when the true genotype is T . At each Ci 4. Detection of viral integrations: The assembled clipping re- virus location, prior probability P(Ti) of each genotype Ti was set gions above were mapped to the given virus reference se- according to Table S5. The likelihood P(D|Ti) for the assumed quence with a Smith-Waterman local alignment tool from genotype Ti was calculated from the observed allele types in the the EMBOSS package [24], which supports IUPAC DNA codes sequencing reads in formula 2. Thus, on the Watson strand, it Y and R. Virus fragment location is extracted from the align- is P(D |T ), and on the Crick strand it is P(D |T ). We defined the W C i i ment results. likelihood of observing allele d in a read for a possible haploid genotype TasP(d |T), on the Watson strand it is P(d |T), and on k Wk the Crick strand it is P(d |T). So, for a set of n observed alleles at Ck Discussion a locus, D = {d ,d , ..., d } on each strand, these probabilities 1 2 n In summary, we have implemented the first software tool to de- are computed as shown by formulas (3) and (4), where Q stands tect virus integration using BS data. Our software is based on for the base quality from the fastaq file. bwa-meth, and by assembling and aligning soft-clip regions, it can find the virus breakpoints. However, accuracy of reads sur- m n rounding the breakpoints needs to be further improved. A virus P (D |T ) = P (d |T), P (D |T ) = P (d |T). (2) W i Wk C i Ck usually integrates into regions that are homologous to both hu- k=1 k=1 man and virus (micro-homologous) [25]. Therefore, we consider breakpoints predicted by our software tool correctly identified if they are within 10 bp of a real breakpoint (Fig. S2). With this definition, the accuracy of our predicted breakpoints can reach over 70%. Our results will be useful for analyzing BS data and re- ⎪ 1 − 10 (T ∈{A, C, G}) lated applications. Some of the results come with only a location P (d |T) = , (3) on human genome, and the virus location missing. This may be Wk − ⎪ 1 − 10 (T ∈{T }) due to the shortage of virus fragments. We simulated 3 kinds of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Gao et al. reads, PE50, 90, and 150 with various lengths, and further simu- Additional Figure S3: The performance of BS-virus-finder in lated virus-inserted fragment with different lengths as well (Ta- various lengths of virus integration using PE150. ble S6); thus all cases described in Fig. 2 are mimicked here. All Additional Figure S4: The diagram of STR for the PLC/PRF/5 simulations sampled all possible reads, base by base with fixed cell line. insert sizes. As the result in Table S6 showed, the longer the reads, the more accurate a prediction can be achieved. In partic- Abbreviations ular, for read lengths around 50 bp, BS-virus-finder is capable of bp: base pair; BS: bisulfite sequencing; DMR: different methyla- finding the virus integration with an accuracy of more than 70%; tion region; HBV: hepatitis B virus; IUPAC: International Union of for the read lengths between 90 bp and 150 bp, BS-virus-finder Pure and Applied Chemistry; NGS: next-generation sequencing; is capable of finding the virus integration with an accuracy of PCR: polymerase chain reaction; PE: paired-end; SNP: single nu- more than 90%. Apart from simulated data, we have performed cleotide polymorphism; WGBS: Whole-genome-based bisulfite WGS and WGBS sequencing of the PLC/PRF/5 hepatocellular car- sequencing. cinoma cell line (Table S4). As the results show, when the length of input is larger than 150 bp, the analysis result of WGBS is simi- lar to the one of WGS. Additionally, BS-virus-finder is able to find Competing interests breakpoints in 8 out of 9 regions identified by FISH [ 8]. Based The authors declare that they have no competing interests. on these experimental results, we believe that BS-virus-finder is a powerful software tool to analyze virus integration using BS data. Funding This work was funded by the National Natural Science Founda- Availability and requirements tion of China (81602477) and Shenzhen Municipal Government of China (ZDSYS201507301424148). Project Name: BS-virus-finder: virus integration calling using bisulfite-sequencing data Project home page: https://github.com/BGI-SZ/BSVF [26] Authors contributions Operating system: Linux C.P., L.B., and H.Y. conceptualized the project. S.G., X.H., S.L., and Programming language: Perl, Python, C J.W. designed BSVF and developed its accompanying utilities. License: LGPL v3 S.G., X.H., C.G., X.Z., M.W., and S.Z. developed the protocol. F.X., Research Resource Identifier: BSVF, RRID:SCR 015727 D.F., H.C., and J.B. conducted experiments. S.G., X.H., B.L., and S.W. undertook the analysis. K.X., L.M., S.G., X.H., L.B., and C.P. wrote and approved the final version of the manuscript. All au- Availability of supporting data thors read and approved the final manuscript. Data used in this paper are simulated based on random insertion of the HBV sequence into the human chromosome 1 sequence. A Acknowledgements Perl script named “simVirusInserts.pl” is included, and our sim- ulation schema is coded within. We have run the simulation sev- We appreciate the support of Xiaolin Liang and Hengtong Li in eral times, and the result shows no significant difference. The the College of Mathematics and Statistics, Changsha University PLC/PRF/5 hepatocellular carcinoma cell lines were from Amer- of Science and Technology, for contributing advice to our re- ican Type Culture Collection (ATCC, Manassas, VA, USA) and se- search. quenced by HiSeq X Ten System from Novogene company. WGS and WGBA data have been submitted to NCBI SRA project PR- References JNA400455. Supporting data, an archival copy of the code, and the Perl script “simVirusInserts.pl” are also available via the Gi- 1. Wang Y, Shang Y. Epigenetic control of epithelial-to- gaScience repository, GigaDB [27]. mesenchymal transition and cancer metastasis. Experimen- tal Cell Res 2013;319(2):160–9. 2. O’Doherty AM, Magee DA, O’Shea LC et al. DNA methy- Additional files lation dynamics at imprinted genes during bovine pre- implantation embryo development. BMC Dev Biol 2015;15 Additional Table S1: Alignment accuracy rate around the break- 1:13. point region using PE50 data. 3. Cotton AM, Price EM, Jones MJ et al. Landscape of DNA Additional Table S2: Alignment accuracy rate around the methylation on the X chromosome reflects CpG density, breakpoint region using PE90 data. functional chromatin state and X-chromosome inactivation. Additional Table S3: Alignment accuracy rate around the Hum Mol Genet 2015;24(6):1528–39. breakpoint region using PE150 data. 4. Kamdar SN, Ho LT, Kron KJ et al. Dynamic interplay be- Additional Table S4: Mapping statistics of cell line sequencing tween locus-specific DNA methylation and hydroxymethy- data. lation regulates distinct biological pathways in prostate car- Additional Table S5: The prior probability of the Bayesian cinogenesis. Clin Epigenet 2016;8(1):32. model used in the restoring process for bisulfite sequencing of 5. Haldrup C, Mundbjerg K, Vestergaard EM et al. DNA methyla- integrated virus. tion signatures for prediction of biochemical recurrence af- Additional Table S6: The performance of BS-virus-finder in ter radical prostatectomy of clinically localized prostate can- silico with different read lengths and insert sizes. cer. J Clin Oncol 2013;31(26):3250–8. Additional Figure S1: The performance of BS-virus-finder in 6. Kim JH, Dhanasekaran SM, Prensner JR et al. Deep sequenc- various lengths of virus integration using PE50. ing reveals distinct patterns of DNA methylation in prostate Additional Figure S2: The performance of BS-virus-finder in cancer. Genome Res 2011;21(7):1028–41. various lengths of virus integration using PE90. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BS-virus-finder 7 7. Darst RP, Pardo CE, Ai L et al. Bisulfite sequencing of DNA. 18. Jiang P, Sun K, Lun FM et al. Methy-Pipe: an integrated bioin- Curr Protoc Mol Biol 2010;Chapter 7:Unit 7(9):1–17. formatics pipeline for whole genome bisulfite sequencing 8. Watanabe Y, Yamamoto H, Oikawa R et al. DNA methyla- data analysis. PLoS One 2014;9(6):e100360. tion at hepatitis B viral integrants is associated with methy- 19. Carr BI, Cavallini A, Lippolis C et al. Fluoro-sorafenib (Rego- lation at flanking human genomic sequences. Genome Res rafenib) effects on hepatoma cells: growth inhibition, quies- 2015;25(3):328–37. cence, and recovery. J Cell Physiol 2013;228(2):292–7. 9. Lillsunde Larsson G, Helenius G, Sorbe B et al. Viral load, inte- 20. Forster M, Szymczak S, Ellinghaus D et al. Vy-PER: eliminat- gration and methylation of E2BS3 and 4 in human papilloma ing false positive detection of virus integration events in next virus (HPV) 16-positive vaginal and vulvar carcinomas. PLoS generation sequencing data. Sci Rep 2015;5(1):11534. One 2014;9(11):e112839. 21. Ho DW, Sze KM, Ng IO. Virus-Clip: a fast and memory- 10. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAP- efficient viral integration site detection tool at single- ping program. BMC Bioinformatics 2009;10(1):232. base resolution with annotation capability. Oncotarget 11. Krueger F, Andrews SR. Bismark: a flexible aligner and 2015;6(25):20959–63. methylation caller for Bisulfite-Seq applications. Bioinfor- 22. Wang Q, Jia P, Zhao Z. VERSE: a novel approach to detect virus matics 2011;27(11):1571–2. integration in host genomes through reference genome cus- 12. Pedersen BS, Eyring K, De S et al. Fast and accurate alignment tomization. Genome Med 2015;7(1):2. of long bisulfite-seq reads. 2014, arXiv:1401.1129v2. 23. Li H. Aligning sequence reads, clone sequences and assem- 13. Zhang Y, Liu H, Lv J et al. QDMR: a quantitative method bly contigs with BWA-MEM. 2013, arXiv:1303.3997v2. for identification of differentially methylated regions by en- 24. Rice P, Longden I, Bleasby A. EMBOSS: the European tropy. Nucleic Acids Res 2011, 39(9):e58-. Molecular Biology Open Software Suite. Trends Genet 14. Stockwell PA, Chatterjee A, Rodger EJ et al. DMAP: differen- 2000;16(6):276–7. tial methylation analysis package for RRBS and WGBS data. 25. Hu Z, Zhu D, Wang W et al. Genome-wide profiling of HPV in- Bioinformatics 2014;30(13):1814–22. tegration in cervical cancer identifies clustered genomic hot 15. Gao S, Zou D, Mao L et al. SMAP: a streamlined methyla- spots and a potential microhomology-mediated integration tion analysis pipeline for bisulfite sequencing. Gigascience mechanism. Nat Genet 2015;47(2):158–63. 2015;4(1):1814–22. 26. Bisulfite Sequencing Virus Integration Finder. 16. Gao S, Zou D, Mao L et al. BS-SNPer: SNP calling in bisulfite- https://github.com/BGI-SZ/BSVF. Accessed 16 October seq data. Bioinformatics 2015;31(24):4006–8. 2017. 17. Liu Y, Siegmund KD, Laird PW et al. Bis-SNP: combined DNA 27. Gao S, Hu X, Xu F et al. Supporting data for “BS-virus-finder: methylation and SNP calling for Bisulfite-seq data. Genome virus integration calling using bisulfite-sequencing data.” Gi- Biol 2012;13(7):R61. gaScience Database 2017. http://dx.doi.org/10.5524/100377. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Loading next page...
 
/lp/ou_press/bs-virus-finder-virus-integration-calling-using-bisulfite-sequencing-0g9ieJKimx
Publisher
Oxford University Press
Copyright
© The Author(s) 2017. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix123
Publisher site
See Article on Publisher Site

Abstract

Background: DNA methylation plays a key role in the regulation of gene expression and carcinogenesis. Bisulfite sequencing studies mainly focus on calling single nucleotide polymorphism, different methylation region, and find allele-specific DNA methylation. Until now, only a few software tools have focused on virus integration using bisulfite sequencing data. Findings: We have developed a new and easy-to-use software tool, named BS-virus-finder (BSVF, Received: 10 February 2017; Revised: 6 September 2017; Accepted: 30 November 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Gao et al. RRID:SCR 015727), to detect viral integration breakpoints in whole human genomes. The tool is hosted at https://github.com/BGI-SZ/BSVF. Conclusions: BS-virus-finder demonstrates high sensitivity and specificity. It is useful in epigenetic studies and to reveal the relationship between viral integration and DNA methylation. BS-virus-finder is the first software tool to detect virus integration loci by using bisulfite sequencing data. Keywords: bisulfite sequencing; carcinogenesis; virus integration Watson strand, methylated C remains C, and unmethylated C Introduction changes to T. On the Crick strand, the reverse complement hap- DNA methylation plays a crucial role in many areas including pens; i.e., methylated C remains C, but in sequenced reads it is development [1, 2] and X chromosome inactivation [3]byreg- reverse-complemented to G, and unmethylated C changes to T, ulating genetic imprinting and epigenetic modification without leading to the reverse-complement base A in sequenced reads. altering DNA sequences. Previous studies have shown a strong As base C can either be methylated or unmethylated, we can use association of DNA methylation with cancer. The methylation International Union of Pure and Applied Chemistry (IUPAC) nu- status altering related to carcinogenesis [4], cancer recurrence cleotide codes “Y” and “R” to represent C/T and G/A, respectively. [5], and metastasis [6] has already been revealed by emerg- So, after bisulfite treatment, base C changes to Y on the Watson ing bisulfite sequencing (BS) technology. BS technology can in- strand, and base G changes to R on the Crick strand. vestigate DNA methylation changes with single-base accuracy. Whole-genome-based bisulfite sequencing (WGBS) has been Treatment of DNA with bisulfite converts unmethylated cytosine developed to detect DNA methylation. Recent clinical studies residues to uracil, but leaves 5-methylcytosine residues unmod- showed that DNA methylation is associated with viral integra- ified [ 7]. Thus, bisulfite treatment introduces specific changes tion [8, 9]. Whole-genome BS data can be analyzed to investigate in the DNA sequence that depend on the methylation status of the sequence mapping and alignment via BSMAP [10], Bismark individual cytosine residues, yielding single-nucleotide resolu- [11], and bwa-meth [12], to detect different methylation regions tion information about the methylation status of a segment of (DMRs) via the software QDMR [13], DMAP [14], and SMAP [15], DNA (Fig. 1). Various analyses can be performed on the altered to identify single nucleotide polymorphisms (SNPs) via BS-SNPer sequences to retrieve this information. BS technology can re- [16] and Bis-SNP [17], and to find allele-specific DNA methylation veal differences between cytosines and thymidine and sequence viaSMAP[15] and Methy-Pipe [18]. However, none of them can change resulting from bisulfite conversion. For the bases with- be used for virus integration loci calling, and no software tool is out methylation, all Cs will change to Ts on both strands. After currently available to detect virus integration loci by analyzing directional library preparation, we have 2 different conversions: BS data. Therefore, we have developed a software tool to detect the Watson and the Crick strands, as shown in Fig. 1.Onthe the virus integration loci by genome-wide BS analysis. Figure 1: The illustration of bisulfite-altered sequence to the original. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BS-virus-finder 3 Figure 2: Principal types of mapping reads around the viral integration site. Description of in silico and real data assas, VA, USA) were cultured as previously described [19]. The cell line was validated by STR makers (Fig. S4). We performed Different types of paired-end (PE) reads (50 base pairs [bp], 90 bp, whole-genome sequencing (WGS) and WGBS sequencing of this 150 bp) that include 700 breakpoints in chromosome 1 (chr 1) of cell line (the results are shown in Table S4). Table 1 shows the GRCh38 were simulated in our study. Input fragments of 50 to analysis result for WGS data, which were compared with the 400 bp were randomly selected from chr 1 in the GRCh38 assem- output results analyzed by Vy-per [20], virus-clip [21], and Virus bly of the human genome. The hepatitis B virus (HBV) genome Finder2 [22]. (GenBank: X04615.1) was used in our simulation. Its integration length was between 45 bp and 180 bp. We cut HBV-containing segments with given PE insert size at all possible positions on Methods every integration event. After alignment, mapping accuracy of Sample preparation each of the 17 different types of read mappings was calculated (Fig. 2). Mapping accuracy varied among the 17 types of read PLC/PRF/5 hepatocellular carcinoma cell line was obtained from mappings in our simulation (Figs S1, S2, S3). In summary, the ac- ATCC (Manassas, VA, USA) and was cultured as previously de- curacies of several kinds of the read mappings were low (Tables scribed [19] and validated by STR makers (Fig. S4). In total, 15 S1, S2, S3), which may raise the false-negative rate. Generally, μg of DNA was extracted to perform WGS and WGBS sequenc- however, bwa-meth [12] performed very well. ing. Sample concentration was detected by fluorometer (Qubit- Bisulfite sequencing is a sophisticated technique to study Fluorometer, Invitrogen). Sample integrity and purification was DNA cytosine methylation. Bisulfite treatment followed by poly- determined by Agarose Gel Electrophoresis. merase chain reaction (PCR) amplification specifically converts unmethylated cytosine to thymine. By cooperating with next- Whole-genome sequencing generation sequencing technology, it is able to detect the methy- lation status of every cytosine in the whole genome. Moreover, About 1.5 μg of gDNA was sonicated to 100–300-bp fragment longer reads make it possible to achieve higher accuracy. Be- genome DNA by Sonication (Covaris) and purified with QIAquick sides simulated data, the PLC/PRF/5 hepatocellular carcinoma PCR Purification Kit (Qiagen). Adapter ligation and target insert cell lines (from American Type Culture Collection [ATCC], Man- size fragment recovering and quantifying library by real-time Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Gao et al. Table 1: The comparison of BS-virus-finder with other software using real data BSVF Vy-per Virus-clip Virus Finder2 Chr HB VB VE HB VB VE HB VB VE HB VB VE chr1 143 272 758 2945 3102 chr2 - - 52 018 758 207 281 chr3 131 451 702 1212 1322 - 131 451 701 1282 1403 131 451 701 1405 chr3 131 453 124 1416 1515 - 131 453 353 1416 1538 chr4 180 586 417 136 378 180 586 416 59 chr4 180 587 608 394 594 180 586 607 167 231 180 587 608 500 632 180 587 607 634 chr5 1 297 478 1174 1315 - 1 297 478 1241 1385 1 297 477 1388 chr7 110 894 616 2739 2748 chr8 35 446 380 2389 2459 35 446 214 2402 2455 35 446 601 2390 2519 35 446 392 2396 2608 chr8 - 106 944 290 698 1077 chr11 65 040 943 2631 2767 - - 65 040 964 2532 chr12 109 573 899 721 815 109 573 677 668 734 109 573 899 705 815 chr13 33 088 123 1521 1603 -- chr13 33 088 561 1917 2066 - 33 088 561 1995 2133 33 088 560 2133 chr16 69 947 046 2055 2826 chr16 70 169 959 2055 2735 70 169 971 2064 2240 chr16 74 425 602 2062 2665 chr17 82 105 786 407 489 82 105 984 368 435 82 105 783 347 489 chr17 82 107 626 2177 2321 - 82 107 710 2048 2159 82 107 625 2045 chr19 41 783 064 687 804 - 41 782 971 761 905 chr20 20 473 566 2415 2565 BSVF used WGBS data, and other software used WGS data. Supported by previous fluorescence in situ hybridization (FISH) experiments [ 8]. Abbreviations: HB: host breakpoint; VB: virus begin is the revealed leftmost position on virus; VE: virus end is the rightmost position on virus. quantitative PCR (QPCR; TaqMan Probe) were then performed. 2. Clustering: After alignment, the result was filtered. We se- The qualified library was sequenced on an Illumina Hiseq X Ten lect read pairs with 1 read match by the following criterion: platform, and 150 bp of PE reads were obtained. In total, around The Phred-scaled mapping quality is larger than 30 (≥30), and 90 G of clean data were generated. at least 1 soft clipping is longer than 5 bp (≥5). The mapped parts of reads, which is marked as “M” by its CIGAR string, cover the human reference genome. For paired reads, we also Whole-genome bisulfite sequencing add the gap between 2 mapped reads to their covered re- About 3 μg of gDNA were sonicated to 100–300 bp by Sonication gion, making read 1 and read 2 continuously covered on the (Covaris) and purified with MiniElute PCR Purification Kit (QIA- human reference. Each continuous region with at least 1 bp GEN). A single “A” nucleotide was added to the 3 ends of the of overlapisdefinedasacluster. Allreadsinvolvedare se- blunt fragments. Methylated adapters were then purified and lected to form the cluster. The remaining soft clippings are addedtothe 5 and 3 ends of each strand in the genomic frag- viral junction candidates. Read pairs with 1 read mapped on ment. Sizes 300–400 bp were selected. DNA was then purified the virus also indicate a potential virus junction between the with QIAquick Gel Extraction Kit (QIAGEN) and bisulfite treated read pairs. with Methylation-Gold Kit (ZYMO). Finally, PCR was conducted 3. Assembling: Within 1 cluster, all soft clipping start sites are and sizes 350–400 bp were selected and purified with QIAquick collected. The position with the most abundance of start Gel Extraction kit (QIAGEN). Qualified library was amplified on sites is identified as the most likely candidate breakpoint. All cBot to generate the cluster on the flowcells (TruSeq PE Clus- clipping sequences in the cluster are extracted and aligned ter Kit V3–cBot–HS, Illumina). The flowcells were sequenced for together. A restore algorithm was used to calculate the most 150 bp of PE reads on the HiSeq X Ten platform, and more than possible base in each position based on the aligned bases 90G of clean data were generated. and their sequencing quality. The algorithm is based on a Bayesian model, where we compute the posteriori probability estimation for A, C, G, T as: Data analysis The read coverage situation for 1 integration is shown in Fig. 3. P (T )P (D|T ) P (T )P (D|T ) Four steps were implemented to detect virus integration: Wi Wi Ci Ci P (T |D) = × s s P (T )P (D|T ) P (T )P (D|T ) Wx Wx Cx Cx 1. Alignment: We use bwa-meth [12] to align bisulfite-treated x=1 x=1 sequencing reads to a hybrid reference that contains both = C × P (D|T ) × P (D|T ) 0 Wi Ci (1) human genome and virus sequences. For chimeric reads from the junction parts, BWA-MEM [23] will align it to 1 or- P (T ) P (T ) Wi Ci C = × ganism and mark the unmapped part as soft clipping, which S S P (T )P (D|T ) P (T )P (D|T ) Wx Wx Cx Cx is in fact from the other organism. This enables us to find x=1 x=1 breakpoints directly from the alignment. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BS-virus-finder 5 Figure 3: A demo plot of 1 viral integration cluster in its pre-insertion form. Here, D is the observation of the next-generation sequencing 1 − 10 (T ∈{C, G, T }) (NGS) reads on given position. P(Ti|D) is the likelihood compo- P (d |T) = . (4) Ck ⎪ 1 − 10 nent, which can be interpreted as the probability of observing (T ∈{A}) D when the true genotype is T .D is a realization (or observa- i W tion) of the NGS reads in the Watson strand. D is a realization (or observation) of the NGS reads in the Crick strand. P(T |D) is We used “Y” and “R” to represent C/T and G/A, respectively (IU- Wi PAC nucleotide code). If a region is covered by both the Watson the likelihood component, which can be interpreted as the prob- ability of observing D when the true genotype is T .P(T |D) is strand and the Crick strand, we were able to deduce the original Wi Ci the likelihood component, which can be interpreted as the prob- base from Y or R by calculation. ability of observing D when the true genotype is T . At each Ci 4. Detection of viral integrations: The assembled clipping re- virus location, prior probability P(Ti) of each genotype Ti was set gions above were mapped to the given virus reference se- according to Table S5. The likelihood P(D|Ti) for the assumed quence with a Smith-Waterman local alignment tool from genotype Ti was calculated from the observed allele types in the the EMBOSS package [24], which supports IUPAC DNA codes sequencing reads in formula 2. Thus, on the Watson strand, it Y and R. Virus fragment location is extracted from the align- is P(D |T ), and on the Crick strand it is P(D |T ). We defined the W C i i ment results. likelihood of observing allele d in a read for a possible haploid genotype TasP(d |T), on the Watson strand it is P(d |T), and on k Wk the Crick strand it is P(d |T). So, for a set of n observed alleles at Ck Discussion a locus, D = {d ,d , ..., d } on each strand, these probabilities 1 2 n In summary, we have implemented the first software tool to de- are computed as shown by formulas (3) and (4), where Q stands tect virus integration using BS data. Our software is based on for the base quality from the fastaq file. bwa-meth, and by assembling and aligning soft-clip regions, it can find the virus breakpoints. However, accuracy of reads sur- m n rounding the breakpoints needs to be further improved. A virus P (D |T ) = P (d |T), P (D |T ) = P (d |T). (2) W i Wk C i Ck usually integrates into regions that are homologous to both hu- k=1 k=1 man and virus (micro-homologous) [25]. Therefore, we consider breakpoints predicted by our software tool correctly identified if they are within 10 bp of a real breakpoint (Fig. S2). With this definition, the accuracy of our predicted breakpoints can reach over 70%. Our results will be useful for analyzing BS data and re- ⎪ 1 − 10 (T ∈{A, C, G}) lated applications. Some of the results come with only a location P (d |T) = , (3) on human genome, and the virus location missing. This may be Wk − ⎪ 1 − 10 (T ∈{T }) due to the shortage of virus fragments. We simulated 3 kinds of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Gao et al. reads, PE50, 90, and 150 with various lengths, and further simu- Additional Figure S3: The performance of BS-virus-finder in lated virus-inserted fragment with different lengths as well (Ta- various lengths of virus integration using PE150. ble S6); thus all cases described in Fig. 2 are mimicked here. All Additional Figure S4: The diagram of STR for the PLC/PRF/5 simulations sampled all possible reads, base by base with fixed cell line. insert sizes. As the result in Table S6 showed, the longer the reads, the more accurate a prediction can be achieved. In partic- Abbreviations ular, for read lengths around 50 bp, BS-virus-finder is capable of bp: base pair; BS: bisulfite sequencing; DMR: different methyla- finding the virus integration with an accuracy of more than 70%; tion region; HBV: hepatitis B virus; IUPAC: International Union of for the read lengths between 90 bp and 150 bp, BS-virus-finder Pure and Applied Chemistry; NGS: next-generation sequencing; is capable of finding the virus integration with an accuracy of PCR: polymerase chain reaction; PE: paired-end; SNP: single nu- more than 90%. Apart from simulated data, we have performed cleotide polymorphism; WGBS: Whole-genome-based bisulfite WGS and WGBS sequencing of the PLC/PRF/5 hepatocellular car- sequencing. cinoma cell line (Table S4). As the results show, when the length of input is larger than 150 bp, the analysis result of WGBS is simi- lar to the one of WGS. Additionally, BS-virus-finder is able to find Competing interests breakpoints in 8 out of 9 regions identified by FISH [ 8]. Based The authors declare that they have no competing interests. on these experimental results, we believe that BS-virus-finder is a powerful software tool to analyze virus integration using BS data. Funding This work was funded by the National Natural Science Founda- Availability and requirements tion of China (81602477) and Shenzhen Municipal Government of China (ZDSYS201507301424148). Project Name: BS-virus-finder: virus integration calling using bisulfite-sequencing data Project home page: https://github.com/BGI-SZ/BSVF [26] Authors contributions Operating system: Linux C.P., L.B., and H.Y. conceptualized the project. S.G., X.H., S.L., and Programming language: Perl, Python, C J.W. designed BSVF and developed its accompanying utilities. License: LGPL v3 S.G., X.H., C.G., X.Z., M.W., and S.Z. developed the protocol. F.X., Research Resource Identifier: BSVF, RRID:SCR 015727 D.F., H.C., and J.B. conducted experiments. S.G., X.H., B.L., and S.W. undertook the analysis. K.X., L.M., S.G., X.H., L.B., and C.P. wrote and approved the final version of the manuscript. All au- Availability of supporting data thors read and approved the final manuscript. Data used in this paper are simulated based on random insertion of the HBV sequence into the human chromosome 1 sequence. A Acknowledgements Perl script named “simVirusInserts.pl” is included, and our sim- ulation schema is coded within. We have run the simulation sev- We appreciate the support of Xiaolin Liang and Hengtong Li in eral times, and the result shows no significant difference. The the College of Mathematics and Statistics, Changsha University PLC/PRF/5 hepatocellular carcinoma cell lines were from Amer- of Science and Technology, for contributing advice to our re- ican Type Culture Collection (ATCC, Manassas, VA, USA) and se- search. quenced by HiSeq X Ten System from Novogene company. WGS and WGBA data have been submitted to NCBI SRA project PR- References JNA400455. Supporting data, an archival copy of the code, and the Perl script “simVirusInserts.pl” are also available via the Gi- 1. Wang Y, Shang Y. Epigenetic control of epithelial-to- gaScience repository, GigaDB [27]. mesenchymal transition and cancer metastasis. Experimen- tal Cell Res 2013;319(2):160–9. 2. O’Doherty AM, Magee DA, O’Shea LC et al. DNA methy- Additional files lation dynamics at imprinted genes during bovine pre- implantation embryo development. BMC Dev Biol 2015;15 Additional Table S1: Alignment accuracy rate around the break- 1:13. point region using PE50 data. 3. Cotton AM, Price EM, Jones MJ et al. Landscape of DNA Additional Table S2: Alignment accuracy rate around the methylation on the X chromosome reflects CpG density, breakpoint region using PE90 data. functional chromatin state and X-chromosome inactivation. Additional Table S3: Alignment accuracy rate around the Hum Mol Genet 2015;24(6):1528–39. breakpoint region using PE150 data. 4. Kamdar SN, Ho LT, Kron KJ et al. Dynamic interplay be- Additional Table S4: Mapping statistics of cell line sequencing tween locus-specific DNA methylation and hydroxymethy- data. lation regulates distinct biological pathways in prostate car- Additional Table S5: The prior probability of the Bayesian cinogenesis. Clin Epigenet 2016;8(1):32. model used in the restoring process for bisulfite sequencing of 5. Haldrup C, Mundbjerg K, Vestergaard EM et al. DNA methyla- integrated virus. tion signatures for prediction of biochemical recurrence af- Additional Table S6: The performance of BS-virus-finder in ter radical prostatectomy of clinically localized prostate can- silico with different read lengths and insert sizes. cer. J Clin Oncol 2013;31(26):3250–8. Additional Figure S1: The performance of BS-virus-finder in 6. Kim JH, Dhanasekaran SM, Prensner JR et al. Deep sequenc- various lengths of virus integration using PE50. ing reveals distinct patterns of DNA methylation in prostate Additional Figure S2: The performance of BS-virus-finder in cancer. Genome Res 2011;21(7):1028–41. various lengths of virus integration using PE90. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018 BS-virus-finder 7 7. Darst RP, Pardo CE, Ai L et al. Bisulfite sequencing of DNA. 18. Jiang P, Sun K, Lun FM et al. Methy-Pipe: an integrated bioin- Curr Protoc Mol Biol 2010;Chapter 7:Unit 7(9):1–17. formatics pipeline for whole genome bisulfite sequencing 8. Watanabe Y, Yamamoto H, Oikawa R et al. DNA methyla- data analysis. PLoS One 2014;9(6):e100360. tion at hepatitis B viral integrants is associated with methy- 19. Carr BI, Cavallini A, Lippolis C et al. Fluoro-sorafenib (Rego- lation at flanking human genomic sequences. Genome Res rafenib) effects on hepatoma cells: growth inhibition, quies- 2015;25(3):328–37. cence, and recovery. J Cell Physiol 2013;228(2):292–7. 9. Lillsunde Larsson G, Helenius G, Sorbe B et al. Viral load, inte- 20. Forster M, Szymczak S, Ellinghaus D et al. Vy-PER: eliminat- gration and methylation of E2BS3 and 4 in human papilloma ing false positive detection of virus integration events in next virus (HPV) 16-positive vaginal and vulvar carcinomas. PLoS generation sequencing data. Sci Rep 2015;5(1):11534. One 2014;9(11):e112839. 21. Ho DW, Sze KM, Ng IO. Virus-Clip: a fast and memory- 10. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAP- efficient viral integration site detection tool at single- ping program. BMC Bioinformatics 2009;10(1):232. base resolution with annotation capability. Oncotarget 11. Krueger F, Andrews SR. Bismark: a flexible aligner and 2015;6(25):20959–63. methylation caller for Bisulfite-Seq applications. Bioinfor- 22. Wang Q, Jia P, Zhao Z. VERSE: a novel approach to detect virus matics 2011;27(11):1571–2. integration in host genomes through reference genome cus- 12. Pedersen BS, Eyring K, De S et al. Fast and accurate alignment tomization. Genome Med 2015;7(1):2. of long bisulfite-seq reads. 2014, arXiv:1401.1129v2. 23. Li H. Aligning sequence reads, clone sequences and assem- 13. Zhang Y, Liu H, Lv J et al. QDMR: a quantitative method bly contigs with BWA-MEM. 2013, arXiv:1303.3997v2. for identification of differentially methylated regions by en- 24. Rice P, Longden I, Bleasby A. EMBOSS: the European tropy. Nucleic Acids Res 2011, 39(9):e58-. Molecular Biology Open Software Suite. Trends Genet 14. Stockwell PA, Chatterjee A, Rodger EJ et al. DMAP: differen- 2000;16(6):276–7. tial methylation analysis package for RRBS and WGBS data. 25. Hu Z, Zhu D, Wang W et al. Genome-wide profiling of HPV in- Bioinformatics 2014;30(13):1814–22. tegration in cervical cancer identifies clustered genomic hot 15. Gao S, Zou D, Mao L et al. SMAP: a streamlined methyla- spots and a potential microhomology-mediated integration tion analysis pipeline for bisulfite sequencing. Gigascience mechanism. Nat Genet 2015;47(2):158–63. 2015;4(1):1814–22. 26. Bisulfite Sequencing Virus Integration Finder. 16. Gao S, Zou D, Mao L et al. BS-SNPer: SNP calling in bisulfite- https://github.com/BGI-SZ/BSVF. Accessed 16 October seq data. Bioinformatics 2015;31(24):4006–8. 2017. 17. Liu Y, Siegmund KD, Laird PW et al. Bis-SNP: combined DNA 27. Gao S, Hu X, Xu F et al. Supporting data for “BS-virus-finder: methylation and SNP calling for Bisulfite-seq data. Genome virus integration calling using bisulfite-sequencing data.” Gi- Biol 2012;13(7):R61. gaScience Database 2017. http://dx.doi.org/10.5524/100377. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/1/1/4757065 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off