Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA

Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA Standard Illumina libraries are biased toward sequences of intermediate GC-content. This results in an underrepresentation of GC- rich regions in sequencing projects of genomes with heterogeneous base composition, such as mammals and birds. We developed a simple, cost-effective protocol to enrich sheared genomic DNA in its GC-rich fraction by subtracting AT-rich DNA. This was achieved by heating DNA up to 90 C before applying Illumina library preparation. We tested the new approach on chicken DNA and found that heated DNA increased average coverage in the GC-richest chromosomes by a factor up to six. Using a Taq polymerase supposedly appropriate for PCR amplification of GC-rich sequences had a much weaker effect. Our protocol should greatly facilitate sequencing and resequencing of the GC-richest regions of heterogeneous genomes, in combination with standard short-read and long-read technologies. Key words: GC content, GC enrichment, high-throughput sequencing, bird. Introduction including the human genome, have a local GC-content that High-throughput sequencing technologies have decreased the varies from 30% to >55% at the kilo-base scale (Landeretal. cost of sequencing by several orders of magnitude over the last 2001; Cohen et al. 2005; Duret et al. 2006), and a similar pat- few decades (Reuter et al. 2015). Short-read technologies have tern has been reported in honey bee (Apis mellifera)and several increased the depth of coverage to values typically >60 for species of ants (The Honeybee Genome Sequencing whole-genome sequencing and 15 for resequencing data Consortium et al. 2006; Smith et al. 2011). (Sims et al. 2014). Unfortunately, depth of coverage is often The genomes of birds are arguably among the most hetero- far from evenly distributed across the sequenced genome. geneous with respect to GC-content, both within and among Biases in PCR amplification create uneven genomic represen- chromosomes. Birds show a particularly striking negative corre- tation in classical Illumina libraries (Dohm et al. 2008; Kozarewa lation between GC-content and chromosome size (Hillieretal. et al. 2009; Aird et al. 2011), PCR being sensitive to extreme 2004): the bird karyotype includes a number of very small-sized GC-content variation (Baskaran et al. 1996; Benita et al. 2003; chromosomes that are particularly GC-rich, underrepresented Oyola et al. 2012). In consequence, the GC-rich regions of in short-read sequence data, and difficult to assemble. The orig- large, heterogeneous genomes are typically undercovered, inal draft chicken genome assembly, for instance, only included therefore inefficiently assembled, when libraries are prepared 29 out of the 38 autosomes with the smallest chromosomes following standard protocols (Hillier et al. 2004). A marked being missing (Hillier et al. 2004). Importantly, gene density is heterogeneity in GC-content has been identified in various strongly correlated with GC-content in birds (fig. 1). The unas- genomes of relatively large size. In angiosperms, monocots sembled GC-rich regions actually contain a substantial and especially grasses (Poaceae) show a bimodal distribution portion—probably 15%—of the bird gene complement, of GC-content in protein-coding genes, with a class of very GC- which is currently missing from genome annotation databases, rich genes (Yu et al. 2002; Serres-Giardi et al. 2012; Clement as we recently demonstrated from transcriptome analyses et al. 2014; Glemin et al. 2014). Most mammalian genomes, (Botero-Castro et al. 2017, see also Hron et al. 2015). The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 616 Genome Biol. Evol. 10(2):616–622. doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA GBE 12 ● aiming at enriching genomic DNA in its GC-rich fraction prior to library preparation. We show that a simple heat- denaturation and sizing of fragmented DNA before the blunt-end repair step results in a substantially increase in av- ● erage GC-content of sequence reads. Applying this protocol ● ● ● to chicken DNA, we achieved a considerable increase in cov- ● ● erage depth of the GC-richest regions of the genome. The ● new approach is cheap, does not require high quantity or ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● quality of DNA, and is complementary to the shotgun, mate ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● pair and/or SMRT approaches. ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ● ●● Materials and Methods ●●●●● ● ● ● ●● ● ●● ● ● ●●●● ●●●● ●●●●● ● ● ●●● ● ● ● ● ●●●● ●●● ●●● ● ● ●● ● ● ●●●●●● ●●●●●●●●● ●● ● ● ●● ●● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ● ●● ●● ●● ● ●●●●●● ● ● ● ●● ●●●●●● ●●●●●●●●●●●●●●● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●● ●●●●●● ● ● ● ●●● ●● ● ● ● ●●●● ● ●●●●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●● ●●●● ●●●●● ●●●●● ● ● ● ●● ●●●●●●●●●●●● ●●●●● ●●●●● ●● ● DNA Extraction and Treatment Post-Illumina Library ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●● ●● ●● ● ●● ●●●●● ●● ●●●●● ●●●● ● ●● ●●● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● Preparation ● ●●●● ●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ● ● ● ● ● ●●●●●●● ●●●●● ● ●●● ●●● ● ● ●●●●●●●●●● ●● ● ●●●● ●●●●●●●●● ● ● ● ● ● ●● ● ● Total genomic DNA was extracted from chicken tissue using DNAeasy Blood and Tissue kit (QIAGEN) following the manu- 0.4 0.5 facturer instructions. About 3 mg of total genomic DNA were GC sheared for 20 min using an ultrasonic cleaning unit (Elmasonic FIG.1.—Gene content and GC-content computed in 100-kb non- One). Sheared DNA was separated in six tubes of 50 ml contain- overlapping windows across the chicken genomes (Gallus_gallus-5.0). ing 500 ng of DNA each. We applied different temperatures to Linear and quadratic regression lines are shown in blue and red, the sheared DNA in order to denature it. Two samples (CHK2- respectively. 75 and CHK2-85) were heated 5 min to 75 C and 85 C, respectively. Three samples (CHK3-75, CHK3-85, and CHK- 90) were heated to 75 C, 85 C, and 90 C, respectively, There is, therefore, a clear need for DNA sequencing and submitted to a second step of shearing in an ultrasonic methods alleviating the GC bias. Single-molecule real- cleaning unit (Elmasonic One) during 5 min. One control sam- time (SMRT) sequencing technologies that do not rely ple (CHK1) was not heated. All samples were sized using on PCR have recently contributed to significantly improve AMPure (Agencourt) immediately after treatments (see genome assembly in large genomes (Davey et al. 2016; table 1). Gordon et al. 2016; Bickhart et al. 2017; Korlach et al. 2017; Warren et al. 2017; Weissensteiner et al. 2017). In birds, the chicken, zebra finch (Taeniopygia guttata), Library Preparation and Sequencing Anna’s hummingbird (Calypte anna), and hooded crow (Corvus cornix) assemblies have been improved using Illumina library preparation followed the classical protocol in- PacBio technologies with a coverage from 50 to 96 volving blunt-end repair, adapter ligation, and adapter fill-in (Korlach et al. 2017; Warren et al. 2017; Weissensteiner steps as developed by Meyer and Kircher (Meyer and Kircher et al. 2017). SMRT sequencing, however, remains rela- 2010) with slight modifications as explained by Tilak et al. tively costly and error prone, and requires high quantity (2015). The full protocol has been deposited in protocols.io and quality of DNA, so that in many projects sequencing dx.doi.org/10.17504/protocols.io.jxicpke. Libraries were depth is mainly contributed by PCR-dependent technolo- quantified using a Nanodrop ND-8000 spectrophotometer gies. Several attempts have been made to optimize PCR (Nanodrop technologies). About 5 ng of each library (except conditions, such as temperature ramp rate, denaturation CHK-90) were PCR indexed using Taq Phusion (Phusion High- time, chemical additives, and DNA polymerase, in order to Fidelity DNA Polymerase Thermo Scientific) and KAPA HiFi reduce the GC bias during library preparation (Aird et al. (2 KAPA HiFi HotStart ReadyMix KAPABIOSYSTEMS) poly- 2011; Oyola et al. 2012). Aird et al. (2011),for instance,im- merases because these amplification enzymes could have dif- proved the homogeneity of coverage depth when applying ferent GC biases (Quail et al. 2011). CHK-90 was only optimized protocols to a mixture of bacterial DNA from three amplified with KAPA HIFI and 3% DMSO, so that 11 index distinct species but they concluded that not a single protocol libraries were generated—one for CHK-90 and two for each is appropriate in every situation. GC-rich and GC-poor DNA of the other five conditions. Indexed libraries were purified have distinct optimal PCR conditions, so that amplifying het- using AMPure (Agencourt) ratio 1.6, quantified with erogeneous DNA is intrinsically a difficult problem. Nanodrop ND-800, and pooled in equimolar ratio. The pool Elaborating on this idea, we here suggest to isolate GC-rich of indexed libraries was single-read sequenced on one lane of DNA before sequencing it. We investigate a simple method Illumina HiSeq 2500 at GATC-Biotech (Konstanz, Germany). Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 617 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Gene density (gene per 100kb) Tilak et al. GBE Table 1 Pretreatments, Melting Temperature (Tm) and GC Content (FastQC) for Each Library Library Heating ( C) Shearing Polymerase DMSO (%) Tm ( C) GC Content (%) CHK1 No No Phusion 0 86 41 CHK1 No No Kapa 0 86 41 CHK2-75 75 No Phusion 0 86 41 CHK2-75 75 No Kapa 0 86 41 CHK3-75 75 5 min Phusion 0 86 41 CHK3-75 75 5 min Kapa 0 86 41 CHK2-85 85 No Phusion 0 88 51 CHK2-85 85 No Kapa 0 89.5 52 CHK3-85 85 5 min Phusion 0 88 51 CHK3-85 85 5 min Kapa 0 89.5 52 CHK-90 90 5 min Kapa 3 86, 91, 94 59 Fusion Curves Results We generated fusion curves in order to check the effect of We first analyzed fusion curves in order to estimate the melt- pretreatments on the GC-content of the constructed libraries. ing temperature (Tm), which, is known to be positively corre- About 5 ng of each indexed PCR was mixed with ResoLight latedtoGC-content (Marmur and Doty 1962). Tm was not ROCHE 20 (fluorescent molecule) for a final volume of 10ml. notably different between CHK1, CHK2-75, and CHK3-75 The libraries were heated from 65 Cto 98 C with increasing regardless of the enzyme used for amplification. These results ramp to 0.02 C per second and 25 acquisitions per degree using suggest that GC-content was nearly the same for these librar- the High Resolution Melting program of ROCHE Light Cycler ies. In contrast, the libraries constructed from DNA heated to 480. The melting curves were obtained for all libraries and their 85 C and 90 C had a significantly increased Tm, compared negative first-derivative (100 dF/dT) were calculated to esti- with CHK1, suggesting a GC enrichment (fig. 2). There was mate the corresponding melting temperatures (Tm). no conspicuous difference in Tm between CHK2-85 and CHK3-85, suggesting that an additional 5-min DNA shearing after heating has no strong effect on GC-content. Sequence Analyses GC-content was estimated for each library using FastQC (table 1). In agreement with the analysis of melting curves, For a fair comparison between libraries, we generated 11 data GC-content was significantly increased when DNA was sets of exactly eight millions of 101-bp reads each. This was heated to a temperature of 85 C orhigher(table 1): the achieved by randomly subsampling in fastq files prior to any average GC-content of reads was increased from 41% quality control or filtering step (see command line in supple- (unheated) to 52% (85 C) andupto59% (90 C). In con- mentary material online). The quality and GC-content of the trast, GC-content was similar between CHK1, CHK2-75, and data obtained in this study were assessed using FastQC 0.11.4 CHK3-75. The choice of DNA polymerase (Taq Phusion or (Andrews 2010. Available at: https://www.bioinformatics. Kapa Hifi) only had a weak effect on GC-content in treat- babraham.ac.uk/projects/fastqc/). Reads were cleaned with ments CHK2-85 and CHK3-85. Trimmomatic (Bolger et al. 2014) using parameters: Eight million reads from each of the 11 libraries were “LEADING: 3 TRAILING: 3 SLIDINGWINDOW: 4: 15 mapped to the chicken genome Gallus_gallus-5.0. Average MINLEN: 50.” Cleaned reads were mapped onto the refer- expected genome coverage is 0.67 per library. In agreement ence genome Gallus_gallus-5.0 using Bowtie2 with default with the Tm and FastQC results, the number of reads that parameters (Langmead and Salzberg 2012). The number of mapped onto reference genome was similar between libraries readsmappedtoeachchromosome and scaffold wascom- CHK1, CHK2-75, and CHK3-75, on one hand, and between puted using SAMtools. We also computed the number of CHK2-85 and CHK3-85, on the other hand. The results for reads mapped to small contigs that are not associated to libraries CHK1, CHK2-85, and CHK-90 are shown in table 2. any chromosome or linkage group (LG) in the Gallus_gallus- The average GC-content of mapped reads was also consider- 5.0 assembly. The size of these contigs varied from 200 to ably higher in the CHK2-85 and, particularly, CHK-90 libraries 209,746 bp, with an average of 8,964 bp. These contigs rep- when compared with that of CHK1 and this was true of all the resent the badly assembled regions of the chicken genome. groups of chromosomes. This indicates that heating libraries To analyze the relationship between depth of coverage and has not only improved depth of coverage in small, GC-rich GC-content, we sorted the contigs according to GC-content chromosomes but also for the GC-richest regions of large, and divided them in 29 bins of 623 contigs. Contigs with the GC-heterogeneous chromosomes. In addition, note that the 5% highest coverage were excluded from the analysis. 618 Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA GBE FIG.2.—Melting curves of a standard (CHK1, blue) and three heated (CHK2-85, CHK3-85, CHK-90) Illumina libraries. Table 2 Mapping of Reads from Standard (CHK1) and Heated (CHK2-85, CHK-90) Libraries to the Reference Chicken Genome CHK1 CHK2-85 CHK-90 Chromosomes (%) Mapped (%) GC Mapped (%) Mapped (%) GC Mapped Coverage (%) Mapped (%) GC Mapped Coverage a a (% GC) Reads Reads Reads Reads Increase Reads Reads Increase 1–5 (40.3%) 58.5 39 43.4 51 <1 41.1 60 <1 6–10 (42.3%) 13.4 41 14 52 1 13.1 66 1 11–15 (42.8%) 7.6 43 10.4 53 1.4 10.9 67 1.4 16–20 (47.7%) 3.4 46 6.9 54 27.3 68 2.1 21–25 (50%) 2.2 48 5.2 55 2.4 5.2 68 2.4 26–31 (53%) 1.4 51 3.9 56 2.8 5.8 69 4.1 32–33 (54.9%) 0.06 53 0.4 58 6.6 0.4 70 6.6 W-Z-LGE64 (41.3%) 4.6 40 4.7 52 14.27 60 <1 Coverage increase was calculated by dividing the percentage of mapped reads of CHK2-85 (respectively, CHK-90) by that of CHK1. percentage of mapped reads was higher in heated than in small chicken contigs. These contigs represent the badly as- unheatedtreatmentsfor chromosomeshavinganaverage sembled regions of the chicken genome that are not assigned GC-content>42% (table 2). The proportion of reads mapped to any specific chromosome; some of them have a very high onto the different chromosomes clearly reflects the increased GC-content. Reads from CHK1 yielded a negative correlation average GC-content, and more homogeneous coverage, of between contig coverage and GC-content: depth of coverage heated libraries (fig. 3). This result indicates that heating dropped by a factor of 2.5 as GC increased from 33% to 65% sheared DNA before library preparation makes it possible to (fig. 4). In contrast, with CHK2-85 contigs coverage increased sequence GC-rich genomic DNA fragments that are otherwise with GC-content and reached a plateau 55% of GC for essentially out of reach when using the standard protocols. library CHK2-85 (fig. 4). Calculating the average depth of coverage per group of chromosomes, we found that heated libraries yielded a higher Discussion coverage than unheated one for chromosomes with average Illumina library construction protocols are generally recog- GC-content >43%, with up to a 6-fold increase in the GC- nizedtobe biased towardfragments of intermediate richest ones (table 2). Finally, we analyzed the coverage of Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 619 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Tilak et al. GBE A) 0.20 0.15 Libraries CHK1 0.10 CHK2−85 CHK90 0.05 0.00 Chromosomes B) ●●● ●●● Libraries ●●● ●●● ● CHK1 ●●● ●●● ●●● ●●● ●●● ●●● ● CHK2−85 50 ●●● ●●● ●●● ●●● ●●● ●●● ●●● ● CHK90 ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● Chromosomes FIG.3.—(A) Proportion of reads mapped to the various chromosomes of the chicken genome. Colors represent the different libraries. Chromosomes are sorted according to average GC-content. (B) Average GC-content of chromosomes. GC-content, the GC-richest fraction of the target DNA being high-coverage, standard Illumina libraries, high-coverage, underrepresented (van Dijk et al. 2014). Here, we introduce a GC-enriched Illumina libraries, and medium-coverage SMRT simple, cheap protocol that leads to a substantial decrease of reads. Illumina reads would here be used to correct for se- this bias. Heating DNA to temperatures>85 C prior to library quencing errors in SMRT reads (Salmela and Rivals 2014), preparation increased coverage in the GC-richest fraction of and the GC-enriched library would ensure accurate correction the chicken genome by a factor of up to 6. We speculate that across all regions of the genome. We expect this approach to this happens because 1) AT-rich regions are underrepresented substantially improve the efficiency of de novo genome se- as double-stranded DNA in heated solutions due to their quencing in birds, but also in mammals, nonavian reptiles, lower melting temperature, and 2) adapter ligation and fur- hymenopterans, monocots, and presumably a number of ad- ther steps of library construction specifically target double- ditional taxa with GC-heterogeneous genomes. Our approach stranded DNA. should also facilitate the optimization of PCR conditions Our GC-enrichment protocol will complement existing (Baskaran et al. 1996; Aird et al. 2011; Oyola et al. 2012)by approaches for optimal sequencing of GC-heterogeneous decreasing the heterogeneity of matrix GC-content. genomes. We suggest that a promising strategy for, for Gene density is positively correlated to GC-content in birds example, bird genome sequencing would involve combining (Hillier et al. 2004; Axelsson et al. 2005). The unassembled/ 620 Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Chrom_2 Chrom_2 Chrom_3 Chrom_3 Chrom_4 Chrom_4 Chrom_1 Chrom_1 Chrom_Z Chrom_Z Chrom_5 Chrom_5 Chrom_7 Chrom_7 Chrom_6 Chrom_6 Chrom_8 Chrom_8 Chrom_11 Chrom_11 Chrom_9 Chrom_9 Chrom_10 Chrom_10 Chrom_12 Chrom_12 Chrom_13 Chrom_13 Chrom_15 Chrom_15 Chrom_14 Chrom_14 Chrom_W Chrom_W Chrom_20 Chrom_20 LGE64 LGE64 Chrom_18 Chrom_18 Chrom_19 Chrom_19 Chrom_22 Chrom_22 Chrom_21 Chrom_21 Chrom_17 Chrom_17 Chrom_24 Chrom_24 Chrom_23 Chrom_23 Chrom_27 Chrom_27 Chrom_26 Chrom_26 Chrom_28 Chrom_28 Chrom_31 Chrom_31 Chrom_16 Chrom_16 Chrom_25 Chrom_25 Chrom_33 Chrom_33 Chrom_32 Chrom_32 Chrom_30 Chrom_30 GC content Proportion of reads (# reads mapped / sum of reads) Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA GBE Acknowledgments The authors thank Philippe Clair for helpful discussion and the Montpellier GenomiX qPCR core facility of University of Montpellier, France. The analyses benefited from the Montpellier Bioinformatics Biodiversity platform services. This work was supported by Agence Nationale de la Recherche grant ANR-14-CE02-0002-01 “BirdIslandGenomic” to B.N. Literature Cited Aird D, et al. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12(2):R18. Axelsson E, Webster MT, Smith NG, Burt DW, Ellegren H. 2005. Comparison of the chicken and turkey genomes reveals a higher rate of nucleotide divergence on microchromosomes than macrochro- mosomes. Genome Res. 15(1):120–125. Baskaran N, et al. 1996. Uniform amplification of a mixture of deoxyribo- FIG.4.—Relationship between GC-content and coverage recorded on nucleic acids with varying GC content. Genome Res. 6(7):633–638. Benita Y, Oosting RS, Lok MC, Wise MJ, Humphery-Smith I. 2003. the small contigs of the chicken genome. Contigs are divided into 29 Regionalized GC content of template DNA as a predictor of PCR suc- groups (represented by 29 dots with an equal number of contigs) accord- cess. Nucleic Acids Res. 31(16):e99. ing to their GC-content. Bickhart DM, et al. 2017. Single-molecule sequencing and chromatin con- formation capture enable de novo reference assembly of the domestic goat genome. Nat Genet. 49:643. unannotated GC-rich regions, even if they represent a modest Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for fraction of the genome, contain many genes of interest that Illumina sequence data. Bioinformatics 30(15):2114–2120. so far have been absent from functional and comparative Botero-Castro F, Figuet E, Tilak M-K, Nabholz B, Galtier N. 2017. Avian genomic analyses in birds (Botero-Castro et al. 2017)and genomes revisited: hidden genes uncovered and the rates versus traits paradox in birds. Mol Biol Evol. 34:3123–3131. potentially in other taxa of similarly heterogeneous base com- Choudhari S, Grigoriev A. 2017. Phylogenetic heatmaps highlight compo- position. Accessing this information requires to increase the sition biases in sequenced reads. Microorganisms 5:4. coverage in GC-rich regions, which with standard protocols Clement Y, Fustier M-A, Nabholz B, Glemin S. 2014. The bimodal distri- would imply a proportional increment of total sequencing bution of genic GC content is ancestral to monocot species. Genome cost. Our approach provides a simple way to alleviate this Biol Evol. 7(1):336–348. Cohen N, Dagan T, Stone L, Graur D. 2005. GC composition of the human problem at low cost. genome: in search of isochores. Mol Biol Evol. 22(5):1260–1272. Besides de novo sequencing, our protocol should also be Davey JW, et al. 2016. Major improvements to the Heliconius melpomene quite helpful in resequencing projects. SNP and, particularly, genome assembly used to confirm 10 chromosome fusion events in 6 SNV detection in birds is currently limited by the low depth of million years of butterfly evolution. G3 (Bethesda) 6:695–708. coverage typically achieved in GC-rich regions (International Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Chicken Polymorphism Map Consortium 2004; Rubin et al. Nucleic Acids Res. 36(16):e105. 2010; Ellegren et al. 2012; Poelstra et al. 2014). Duret L, Eyre-Walker A, Galtier N. 2006. A new perspective on isochore Metagenomics is another potential field of application of this evolution. Gene 385:71–74. approach. Microbes, particularly bacteria, are characterized by a Ellegren H, et al. 2012. The genomic landscape of species divergence in wide distribution of genome GC-content across species—some Ficedula flycatchers. Nature 491(7426):756–760. Galtier N, Lobry JR. 1997. Relationships between genomic Gþ C content, species reach a genome average >75% GC (Galtier and Lobry RNA secondary structures, and optimal growth temperature in prokar- 1997; Lassalle et al. 2015). Environmental samples, which con- yotes. J Mol Evol. 44(6):632–636. tain a mixture of numerous bacterial species, are therefore typ- Glemin S, Clement Y, David J, Ressayre A. 2014. GC content evolution in ically heterogeneous with respect to GC-content, so that coding regions of angiosperm genomes: a unifying hypothesis. Trends libraries prepared with standard protocols provide a biased sam- Genet. 30(7):263–270. Gordon D, et al. 2016. Long-read sequence assembly of the gorilla ge- ple of the existing microbial communities (Choudhari and nome. Science 352(6281):aae0344. Grigoriev 2017). Correcting for this bias implies developing spe- Hillier LW, et al. 2004. Sequence and comparative analysis of the chicken cific enrichment protocols targeting both the GC-rich, as in this genome provide unique perspectives on vertebrate evolution. Nature study, andthe AT-richfractionof the sampledDNA. 432:695–716. Hron T, Pajer P, Paces J, Bartun ˚ ek P, Elleder D. 2015. Hidden genes in birds. Genome Biol. 16:164. Supplementary Material International Chicken Polymorphism Map Consortium. A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Supplementary data areavailableat Genome Biology and Nature 432:717. Evolution online. Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 621 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Tilak et al. GBE Korlach J, et al. 2017. De novo PacBio long-read and phased avian ge- Salmela L, Rivals E. 2014. LoRDEC: accurate and efficient long read error nome assemblies correct and add to reference genes generated with correction. Bioinformatics 30:3506–3514. intermediate and short reads. GigaScience 6(10):1–16. Serres-Giardi L, Belkhir K, David J, Glemin S. 2012. Patterns and Kozarewa I, et al. 2009. Amplification-free Illumina sequencing-library evolution of nucleotide landscapes in seed plants. Plant Cell preparation facilitates improved mapping and assembly of (G þ C) - 24(4):1379–1397. biased genomes. Nat. Methods. 6:291–295. Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. 2014. Sequencing depth Lander ES, et al. 2001. Initial sequencing and analysis of the human ge- and coverage: key considerations in genomic analyses. Nat Rev Genet. nome. Nature 409:860–921. 15(2):121–132. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie Smith CD, et al. 2011. Draft genome of the globally widespread and 2. Nat Methods 9(4):357–359. invasive Argentine ant (Linepithema humile). Proc Natl Acad Sci U S Lassalle F, Perian S, Bataillon T, Nesme X. 2015. GC-content evolution in A. 108(14):5667–5678. bacterial genomes: the biased gene conversion hypothesis expands. The Honeybee Genome Sequencing Consortium, others. 2006. Insights PLoS Genet. 11:e1004941. into social insects from the genome of the honeybee Apis mellifera. Marmur J, Doty P. 1962. Determination of the base composition of de- Nature 443:931. oxyribonucleic acid from its thermal denaturation temperature. J Mol Tilak M-K, et al. 2015. A cost-effective straightforward protocol for shot- Biol. 5:109–118. gun Illumina libraries designed to assemble complete mitogenomes Meyer M, Kircher M. 2010. Illumina sequencing library preparation for from non-model species. Conserv Genet Resour. 7(1):37–40. highly multiplexed target capture and sequencing. Cold Spring Harb van Dijk EL, Jaszczyszyn Y, Thermes C. 2014. Library preparation methods Protoc. 2010(6):pdb.prot5448. for next-generation sequencing: tone down the bias. Exp Cell Res. Oyola SO, et al. 2012. Optimizing Illumina next-generation sequencing 322(1):12–20. library preparation for extremely AT-biased genomes. BMC Warren WC, et al. 2017. A new chicken genome assembly provides insight Genomics 13(1):1. into avian genome structure. G3 (Bethesda) 7:109–117. Poelstra JW, et al. 2014. The genomic landscape underlying phenotypic in- Weissensteiner MH, et al. 2017. Combination of short-read, long-read, tegrity in the face of gene flow in crows. Science 344(6190):1410–1414. and optical mapping assemblies reveals large-scale tandem repeat Quail M. a, et al. 2011. Optimal enzymes for amplifying sequencing librar- arrays with population genetic implications. Genome Res. ies. Nat Methods 9(1):10–11. 27(5):697–708. Reuter JA, Spacek DV, Snyder MP. 2015. High-throughput sequencing Yu J, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. technologies. Mol Cell 58(4):586–597. indica). Science 296:79–92. Rubin C-J, et al. 2010. Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 464(7288):587–591. Associate editor: Judith Mank 622 Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology and Evolution Oxford University Press

Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA

Free
7 pages

Loading next page...
 
/lp/ou_press/illumina-library-preparation-for-sequencing-the-gc-rich-fraction-of-jKgs0y155v
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
ISSN
1759-6653
eISSN
1759-6653
D.O.I.
10.1093/gbe/evy022
Publisher site
See Article on Publisher Site

Abstract

Standard Illumina libraries are biased toward sequences of intermediate GC-content. This results in an underrepresentation of GC- rich regions in sequencing projects of genomes with heterogeneous base composition, such as mammals and birds. We developed a simple, cost-effective protocol to enrich sheared genomic DNA in its GC-rich fraction by subtracting AT-rich DNA. This was achieved by heating DNA up to 90 C before applying Illumina library preparation. We tested the new approach on chicken DNA and found that heated DNA increased average coverage in the GC-richest chromosomes by a factor up to six. Using a Taq polymerase supposedly appropriate for PCR amplification of GC-rich sequences had a much weaker effect. Our protocol should greatly facilitate sequencing and resequencing of the GC-richest regions of heterogeneous genomes, in combination with standard short-read and long-read technologies. Key words: GC content, GC enrichment, high-throughput sequencing, bird. Introduction including the human genome, have a local GC-content that High-throughput sequencing technologies have decreased the varies from 30% to >55% at the kilo-base scale (Landeretal. cost of sequencing by several orders of magnitude over the last 2001; Cohen et al. 2005; Duret et al. 2006), and a similar pat- few decades (Reuter et al. 2015). Short-read technologies have tern has been reported in honey bee (Apis mellifera)and several increased the depth of coverage to values typically >60 for species of ants (The Honeybee Genome Sequencing whole-genome sequencing and 15 for resequencing data Consortium et al. 2006; Smith et al. 2011). (Sims et al. 2014). Unfortunately, depth of coverage is often The genomes of birds are arguably among the most hetero- far from evenly distributed across the sequenced genome. geneous with respect to GC-content, both within and among Biases in PCR amplification create uneven genomic represen- chromosomes. Birds show a particularly striking negative corre- tation in classical Illumina libraries (Dohm et al. 2008; Kozarewa lation between GC-content and chromosome size (Hillieretal. et al. 2009; Aird et al. 2011), PCR being sensitive to extreme 2004): the bird karyotype includes a number of very small-sized GC-content variation (Baskaran et al. 1996; Benita et al. 2003; chromosomes that are particularly GC-rich, underrepresented Oyola et al. 2012). In consequence, the GC-rich regions of in short-read sequence data, and difficult to assemble. The orig- large, heterogeneous genomes are typically undercovered, inal draft chicken genome assembly, for instance, only included therefore inefficiently assembled, when libraries are prepared 29 out of the 38 autosomes with the smallest chromosomes following standard protocols (Hillier et al. 2004). A marked being missing (Hillier et al. 2004). Importantly, gene density is heterogeneity in GC-content has been identified in various strongly correlated with GC-content in birds (fig. 1). The unas- genomes of relatively large size. In angiosperms, monocots sembled GC-rich regions actually contain a substantial and especially grasses (Poaceae) show a bimodal distribution portion—probably 15%—of the bird gene complement, of GC-content in protein-coding genes, with a class of very GC- which is currently missing from genome annotation databases, rich genes (Yu et al. 2002; Serres-Giardi et al. 2012; Clement as we recently demonstrated from transcriptome analyses et al. 2014; Glemin et al. 2014). Most mammalian genomes, (Botero-Castro et al. 2017, see also Hron et al. 2015). The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 616 Genome Biol. Evol. 10(2):616–622. doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA GBE 12 ● aiming at enriching genomic DNA in its GC-rich fraction prior to library preparation. We show that a simple heat- denaturation and sizing of fragmented DNA before the blunt-end repair step results in a substantially increase in av- ● erage GC-content of sequence reads. Applying this protocol ● ● ● to chicken DNA, we achieved a considerable increase in cov- ● ● erage depth of the GC-richest regions of the genome. The ● new approach is cheap, does not require high quantity or ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● quality of DNA, and is complementary to the shotgun, mate ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● pair and/or SMRT approaches. ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ●●● ● ●● ● ●● ● ●● ● ●●● ● ● ●● Materials and Methods ●●●●● ● ● ● ●● ● ●● ● ● ●●●● ●●●● ●●●●● ● ● ●●● ● ● ● ● ●●●● ●●● ●●● ● ● ●● ● ● ●●●●●● ●●●●●●●●● ●● ● ● ●● ●● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ● ●● ●● ●● ● ●●●●●● ● ● ● ●● ●●●●●● ●●●●●●●●●●●●●●● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●● ●●●●●● ● ● ● ●●● ●● ● ● ● ●●●● ● ●●●●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●● ●●●● ●●●●● ●●●●● ● ● ● ●● ●●●●●●●●●●●● ●●●●● ●●●●● ●● ● DNA Extraction and Treatment Post-Illumina Library ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●● ●● ●● ● ●● ●●●●● ●● ●●●●● ●●●● ● ●● ●●● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ● Preparation ● ●●●● ●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●● ● ● ● ● ● ●●●●●●● ●●●●● ● ●●● ●●● ● ● ●●●●●●●●●● ●● ● ●●●● ●●●●●●●●● ● ● ● ● ● ●● ● ● Total genomic DNA was extracted from chicken tissue using DNAeasy Blood and Tissue kit (QIAGEN) following the manu- 0.4 0.5 facturer instructions. About 3 mg of total genomic DNA were GC sheared for 20 min using an ultrasonic cleaning unit (Elmasonic FIG.1.—Gene content and GC-content computed in 100-kb non- One). Sheared DNA was separated in six tubes of 50 ml contain- overlapping windows across the chicken genomes (Gallus_gallus-5.0). ing 500 ng of DNA each. We applied different temperatures to Linear and quadratic regression lines are shown in blue and red, the sheared DNA in order to denature it. Two samples (CHK2- respectively. 75 and CHK2-85) were heated 5 min to 75 C and 85 C, respectively. Three samples (CHK3-75, CHK3-85, and CHK- 90) were heated to 75 C, 85 C, and 90 C, respectively, There is, therefore, a clear need for DNA sequencing and submitted to a second step of shearing in an ultrasonic methods alleviating the GC bias. Single-molecule real- cleaning unit (Elmasonic One) during 5 min. One control sam- time (SMRT) sequencing technologies that do not rely ple (CHK1) was not heated. All samples were sized using on PCR have recently contributed to significantly improve AMPure (Agencourt) immediately after treatments (see genome assembly in large genomes (Davey et al. 2016; table 1). Gordon et al. 2016; Bickhart et al. 2017; Korlach et al. 2017; Warren et al. 2017; Weissensteiner et al. 2017). In birds, the chicken, zebra finch (Taeniopygia guttata), Library Preparation and Sequencing Anna’s hummingbird (Calypte anna), and hooded crow (Corvus cornix) assemblies have been improved using Illumina library preparation followed the classical protocol in- PacBio technologies with a coverage from 50 to 96 volving blunt-end repair, adapter ligation, and adapter fill-in (Korlach et al. 2017; Warren et al. 2017; Weissensteiner steps as developed by Meyer and Kircher (Meyer and Kircher et al. 2017). SMRT sequencing, however, remains rela- 2010) with slight modifications as explained by Tilak et al. tively costly and error prone, and requires high quantity (2015). The full protocol has been deposited in protocols.io and quality of DNA, so that in many projects sequencing dx.doi.org/10.17504/protocols.io.jxicpke. Libraries were depth is mainly contributed by PCR-dependent technolo- quantified using a Nanodrop ND-8000 spectrophotometer gies. Several attempts have been made to optimize PCR (Nanodrop technologies). About 5 ng of each library (except conditions, such as temperature ramp rate, denaturation CHK-90) were PCR indexed using Taq Phusion (Phusion High- time, chemical additives, and DNA polymerase, in order to Fidelity DNA Polymerase Thermo Scientific) and KAPA HiFi reduce the GC bias during library preparation (Aird et al. (2 KAPA HiFi HotStart ReadyMix KAPABIOSYSTEMS) poly- 2011; Oyola et al. 2012). Aird et al. (2011),for instance,im- merases because these amplification enzymes could have dif- proved the homogeneity of coverage depth when applying ferent GC biases (Quail et al. 2011). CHK-90 was only optimized protocols to a mixture of bacterial DNA from three amplified with KAPA HIFI and 3% DMSO, so that 11 index distinct species but they concluded that not a single protocol libraries were generated—one for CHK-90 and two for each is appropriate in every situation. GC-rich and GC-poor DNA of the other five conditions. Indexed libraries were purified have distinct optimal PCR conditions, so that amplifying het- using AMPure (Agencourt) ratio 1.6, quantified with erogeneous DNA is intrinsically a difficult problem. Nanodrop ND-800, and pooled in equimolar ratio. The pool Elaborating on this idea, we here suggest to isolate GC-rich of indexed libraries was single-read sequenced on one lane of DNA before sequencing it. We investigate a simple method Illumina HiSeq 2500 at GATC-Biotech (Konstanz, Germany). Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 617 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Gene density (gene per 100kb) Tilak et al. GBE Table 1 Pretreatments, Melting Temperature (Tm) and GC Content (FastQC) for Each Library Library Heating ( C) Shearing Polymerase DMSO (%) Tm ( C) GC Content (%) CHK1 No No Phusion 0 86 41 CHK1 No No Kapa 0 86 41 CHK2-75 75 No Phusion 0 86 41 CHK2-75 75 No Kapa 0 86 41 CHK3-75 75 5 min Phusion 0 86 41 CHK3-75 75 5 min Kapa 0 86 41 CHK2-85 85 No Phusion 0 88 51 CHK2-85 85 No Kapa 0 89.5 52 CHK3-85 85 5 min Phusion 0 88 51 CHK3-85 85 5 min Kapa 0 89.5 52 CHK-90 90 5 min Kapa 3 86, 91, 94 59 Fusion Curves Results We generated fusion curves in order to check the effect of We first analyzed fusion curves in order to estimate the melt- pretreatments on the GC-content of the constructed libraries. ing temperature (Tm), which, is known to be positively corre- About 5 ng of each indexed PCR was mixed with ResoLight latedtoGC-content (Marmur and Doty 1962). Tm was not ROCHE 20 (fluorescent molecule) for a final volume of 10ml. notably different between CHK1, CHK2-75, and CHK3-75 The libraries were heated from 65 Cto 98 C with increasing regardless of the enzyme used for amplification. These results ramp to 0.02 C per second and 25 acquisitions per degree using suggest that GC-content was nearly the same for these librar- the High Resolution Melting program of ROCHE Light Cycler ies. In contrast, the libraries constructed from DNA heated to 480. The melting curves were obtained for all libraries and their 85 C and 90 C had a significantly increased Tm, compared negative first-derivative (100 dF/dT) were calculated to esti- with CHK1, suggesting a GC enrichment (fig. 2). There was mate the corresponding melting temperatures (Tm). no conspicuous difference in Tm between CHK2-85 and CHK3-85, suggesting that an additional 5-min DNA shearing after heating has no strong effect on GC-content. Sequence Analyses GC-content was estimated for each library using FastQC (table 1). In agreement with the analysis of melting curves, For a fair comparison between libraries, we generated 11 data GC-content was significantly increased when DNA was sets of exactly eight millions of 101-bp reads each. This was heated to a temperature of 85 C orhigher(table 1): the achieved by randomly subsampling in fastq files prior to any average GC-content of reads was increased from 41% quality control or filtering step (see command line in supple- (unheated) to 52% (85 C) andupto59% (90 C). In con- mentary material online). The quality and GC-content of the trast, GC-content was similar between CHK1, CHK2-75, and data obtained in this study were assessed using FastQC 0.11.4 CHK3-75. The choice of DNA polymerase (Taq Phusion or (Andrews 2010. Available at: https://www.bioinformatics. Kapa Hifi) only had a weak effect on GC-content in treat- babraham.ac.uk/projects/fastqc/). Reads were cleaned with ments CHK2-85 and CHK3-85. Trimmomatic (Bolger et al. 2014) using parameters: Eight million reads from each of the 11 libraries were “LEADING: 3 TRAILING: 3 SLIDINGWINDOW: 4: 15 mapped to the chicken genome Gallus_gallus-5.0. Average MINLEN: 50.” Cleaned reads were mapped onto the refer- expected genome coverage is 0.67 per library. In agreement ence genome Gallus_gallus-5.0 using Bowtie2 with default with the Tm and FastQC results, the number of reads that parameters (Langmead and Salzberg 2012). The number of mapped onto reference genome was similar between libraries readsmappedtoeachchromosome and scaffold wascom- CHK1, CHK2-75, and CHK3-75, on one hand, and between puted using SAMtools. We also computed the number of CHK2-85 and CHK3-85, on the other hand. The results for reads mapped to small contigs that are not associated to libraries CHK1, CHK2-85, and CHK-90 are shown in table 2. any chromosome or linkage group (LG) in the Gallus_gallus- The average GC-content of mapped reads was also consider- 5.0 assembly. The size of these contigs varied from 200 to ably higher in the CHK2-85 and, particularly, CHK-90 libraries 209,746 bp, with an average of 8,964 bp. These contigs rep- when compared with that of CHK1 and this was true of all the resent the badly assembled regions of the chicken genome. groups of chromosomes. This indicates that heating libraries To analyze the relationship between depth of coverage and has not only improved depth of coverage in small, GC-rich GC-content, we sorted the contigs according to GC-content chromosomes but also for the GC-richest regions of large, and divided them in 29 bins of 623 contigs. Contigs with the GC-heterogeneous chromosomes. In addition, note that the 5% highest coverage were excluded from the analysis. 618 Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA GBE FIG.2.—Melting curves of a standard (CHK1, blue) and three heated (CHK2-85, CHK3-85, CHK-90) Illumina libraries. Table 2 Mapping of Reads from Standard (CHK1) and Heated (CHK2-85, CHK-90) Libraries to the Reference Chicken Genome CHK1 CHK2-85 CHK-90 Chromosomes (%) Mapped (%) GC Mapped (%) Mapped (%) GC Mapped Coverage (%) Mapped (%) GC Mapped Coverage a a (% GC) Reads Reads Reads Reads Increase Reads Reads Increase 1–5 (40.3%) 58.5 39 43.4 51 <1 41.1 60 <1 6–10 (42.3%) 13.4 41 14 52 1 13.1 66 1 11–15 (42.8%) 7.6 43 10.4 53 1.4 10.9 67 1.4 16–20 (47.7%) 3.4 46 6.9 54 27.3 68 2.1 21–25 (50%) 2.2 48 5.2 55 2.4 5.2 68 2.4 26–31 (53%) 1.4 51 3.9 56 2.8 5.8 69 4.1 32–33 (54.9%) 0.06 53 0.4 58 6.6 0.4 70 6.6 W-Z-LGE64 (41.3%) 4.6 40 4.7 52 14.27 60 <1 Coverage increase was calculated by dividing the percentage of mapped reads of CHK2-85 (respectively, CHK-90) by that of CHK1. percentage of mapped reads was higher in heated than in small chicken contigs. These contigs represent the badly as- unheatedtreatmentsfor chromosomeshavinganaverage sembled regions of the chicken genome that are not assigned GC-content>42% (table 2). The proportion of reads mapped to any specific chromosome; some of them have a very high onto the different chromosomes clearly reflects the increased GC-content. Reads from CHK1 yielded a negative correlation average GC-content, and more homogeneous coverage, of between contig coverage and GC-content: depth of coverage heated libraries (fig. 3). This result indicates that heating dropped by a factor of 2.5 as GC increased from 33% to 65% sheared DNA before library preparation makes it possible to (fig. 4). In contrast, with CHK2-85 contigs coverage increased sequence GC-rich genomic DNA fragments that are otherwise with GC-content and reached a plateau 55% of GC for essentially out of reach when using the standard protocols. library CHK2-85 (fig. 4). Calculating the average depth of coverage per group of chromosomes, we found that heated libraries yielded a higher Discussion coverage than unheated one for chromosomes with average Illumina library construction protocols are generally recog- GC-content >43%, with up to a 6-fold increase in the GC- nizedtobe biased towardfragments of intermediate richest ones (table 2). Finally, we analyzed the coverage of Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 619 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Tilak et al. GBE A) 0.20 0.15 Libraries CHK1 0.10 CHK2−85 CHK90 0.05 0.00 Chromosomes B) ●●● ●●● Libraries ●●● ●●● ● CHK1 ●●● ●●● ●●● ●●● ●●● ●●● ● CHK2−85 50 ●●● ●●● ●●● ●●● ●●● ●●● ●●● ● CHK90 ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● Chromosomes FIG.3.—(A) Proportion of reads mapped to the various chromosomes of the chicken genome. Colors represent the different libraries. Chromosomes are sorted according to average GC-content. (B) Average GC-content of chromosomes. GC-content, the GC-richest fraction of the target DNA being high-coverage, standard Illumina libraries, high-coverage, underrepresented (van Dijk et al. 2014). Here, we introduce a GC-enriched Illumina libraries, and medium-coverage SMRT simple, cheap protocol that leads to a substantial decrease of reads. Illumina reads would here be used to correct for se- this bias. Heating DNA to temperatures>85 C prior to library quencing errors in SMRT reads (Salmela and Rivals 2014), preparation increased coverage in the GC-richest fraction of and the GC-enriched library would ensure accurate correction the chicken genome by a factor of up to 6. We speculate that across all regions of the genome. We expect this approach to this happens because 1) AT-rich regions are underrepresented substantially improve the efficiency of de novo genome se- as double-stranded DNA in heated solutions due to their quencing in birds, but also in mammals, nonavian reptiles, lower melting temperature, and 2) adapter ligation and fur- hymenopterans, monocots, and presumably a number of ad- ther steps of library construction specifically target double- ditional taxa with GC-heterogeneous genomes. Our approach stranded DNA. should also facilitate the optimization of PCR conditions Our GC-enrichment protocol will complement existing (Baskaran et al. 1996; Aird et al. 2011; Oyola et al. 2012)by approaches for optimal sequencing of GC-heterogeneous decreasing the heterogeneity of matrix GC-content. genomes. We suggest that a promising strategy for, for Gene density is positively correlated to GC-content in birds example, bird genome sequencing would involve combining (Hillier et al. 2004; Axelsson et al. 2005). The unassembled/ 620 Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Chrom_2 Chrom_2 Chrom_3 Chrom_3 Chrom_4 Chrom_4 Chrom_1 Chrom_1 Chrom_Z Chrom_Z Chrom_5 Chrom_5 Chrom_7 Chrom_7 Chrom_6 Chrom_6 Chrom_8 Chrom_8 Chrom_11 Chrom_11 Chrom_9 Chrom_9 Chrom_10 Chrom_10 Chrom_12 Chrom_12 Chrom_13 Chrom_13 Chrom_15 Chrom_15 Chrom_14 Chrom_14 Chrom_W Chrom_W Chrom_20 Chrom_20 LGE64 LGE64 Chrom_18 Chrom_18 Chrom_19 Chrom_19 Chrom_22 Chrom_22 Chrom_21 Chrom_21 Chrom_17 Chrom_17 Chrom_24 Chrom_24 Chrom_23 Chrom_23 Chrom_27 Chrom_27 Chrom_26 Chrom_26 Chrom_28 Chrom_28 Chrom_31 Chrom_31 Chrom_16 Chrom_16 Chrom_25 Chrom_25 Chrom_33 Chrom_33 Chrom_32 Chrom_32 Chrom_30 Chrom_30 GC content Proportion of reads (# reads mapped / sum of reads) Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA GBE Acknowledgments The authors thank Philippe Clair for helpful discussion and the Montpellier GenomiX qPCR core facility of University of Montpellier, France. The analyses benefited from the Montpellier Bioinformatics Biodiversity platform services. This work was supported by Agence Nationale de la Recherche grant ANR-14-CE02-0002-01 “BirdIslandGenomic” to B.N. Literature Cited Aird D, et al. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12(2):R18. Axelsson E, Webster MT, Smith NG, Burt DW, Ellegren H. 2005. Comparison of the chicken and turkey genomes reveals a higher rate of nucleotide divergence on microchromosomes than macrochro- mosomes. Genome Res. 15(1):120–125. Baskaran N, et al. 1996. Uniform amplification of a mixture of deoxyribo- FIG.4.—Relationship between GC-content and coverage recorded on nucleic acids with varying GC content. Genome Res. 6(7):633–638. Benita Y, Oosting RS, Lok MC, Wise MJ, Humphery-Smith I. 2003. the small contigs of the chicken genome. Contigs are divided into 29 Regionalized GC content of template DNA as a predictor of PCR suc- groups (represented by 29 dots with an equal number of contigs) accord- cess. Nucleic Acids Res. 31(16):e99. ing to their GC-content. Bickhart DM, et al. 2017. Single-molecule sequencing and chromatin con- formation capture enable de novo reference assembly of the domestic goat genome. Nat Genet. 49:643. unannotated GC-rich regions, even if they represent a modest Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for fraction of the genome, contain many genes of interest that Illumina sequence data. Bioinformatics 30(15):2114–2120. so far have been absent from functional and comparative Botero-Castro F, Figuet E, Tilak M-K, Nabholz B, Galtier N. 2017. Avian genomic analyses in birds (Botero-Castro et al. 2017)and genomes revisited: hidden genes uncovered and the rates versus traits paradox in birds. Mol Biol Evol. 34:3123–3131. potentially in other taxa of similarly heterogeneous base com- Choudhari S, Grigoriev A. 2017. Phylogenetic heatmaps highlight compo- position. Accessing this information requires to increase the sition biases in sequenced reads. Microorganisms 5:4. coverage in GC-rich regions, which with standard protocols Clement Y, Fustier M-A, Nabholz B, Glemin S. 2014. The bimodal distri- would imply a proportional increment of total sequencing bution of genic GC content is ancestral to monocot species. Genome cost. Our approach provides a simple way to alleviate this Biol Evol. 7(1):336–348. Cohen N, Dagan T, Stone L, Graur D. 2005. GC composition of the human problem at low cost. genome: in search of isochores. Mol Biol Evol. 22(5):1260–1272. Besides de novo sequencing, our protocol should also be Davey JW, et al. 2016. Major improvements to the Heliconius melpomene quite helpful in resequencing projects. SNP and, particularly, genome assembly used to confirm 10 chromosome fusion events in 6 SNV detection in birds is currently limited by the low depth of million years of butterfly evolution. G3 (Bethesda) 6:695–708. coverage typically achieved in GC-rich regions (International Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Chicken Polymorphism Map Consortium 2004; Rubin et al. Nucleic Acids Res. 36(16):e105. 2010; Ellegren et al. 2012; Poelstra et al. 2014). Duret L, Eyre-Walker A, Galtier N. 2006. A new perspective on isochore Metagenomics is another potential field of application of this evolution. Gene 385:71–74. approach. Microbes, particularly bacteria, are characterized by a Ellegren H, et al. 2012. The genomic landscape of species divergence in wide distribution of genome GC-content across species—some Ficedula flycatchers. Nature 491(7426):756–760. Galtier N, Lobry JR. 1997. Relationships between genomic Gþ C content, species reach a genome average >75% GC (Galtier and Lobry RNA secondary structures, and optimal growth temperature in prokar- 1997; Lassalle et al. 2015). Environmental samples, which con- yotes. J Mol Evol. 44(6):632–636. tain a mixture of numerous bacterial species, are therefore typ- Glemin S, Clement Y, David J, Ressayre A. 2014. GC content evolution in ically heterogeneous with respect to GC-content, so that coding regions of angiosperm genomes: a unifying hypothesis. Trends libraries prepared with standard protocols provide a biased sam- Genet. 30(7):263–270. Gordon D, et al. 2016. Long-read sequence assembly of the gorilla ge- ple of the existing microbial communities (Choudhari and nome. Science 352(6281):aae0344. Grigoriev 2017). Correcting for this bias implies developing spe- Hillier LW, et al. 2004. Sequence and comparative analysis of the chicken cific enrichment protocols targeting both the GC-rich, as in this genome provide unique perspectives on vertebrate evolution. Nature study, andthe AT-richfractionof the sampledDNA. 432:695–716. Hron T, Pajer P, Paces J, Bartun ˚ ek P, Elleder D. 2015. Hidden genes in birds. Genome Biol. 16:164. Supplementary Material International Chicken Polymorphism Map Consortium. A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Supplementary data areavailableat Genome Biology and Nature 432:717. Evolution online. Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 621 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Tilak et al. GBE Korlach J, et al. 2017. De novo PacBio long-read and phased avian ge- Salmela L, Rivals E. 2014. LoRDEC: accurate and efficient long read error nome assemblies correct and add to reference genes generated with correction. Bioinformatics 30:3506–3514. intermediate and short reads. GigaScience 6(10):1–16. Serres-Giardi L, Belkhir K, David J, Glemin S. 2012. Patterns and Kozarewa I, et al. 2009. Amplification-free Illumina sequencing-library evolution of nucleotide landscapes in seed plants. Plant Cell preparation facilitates improved mapping and assembly of (G þ C) - 24(4):1379–1397. biased genomes. Nat. Methods. 6:291–295. Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. 2014. Sequencing depth Lander ES, et al. 2001. Initial sequencing and analysis of the human ge- and coverage: key considerations in genomic analyses. Nat Rev Genet. nome. Nature 409:860–921. 15(2):121–132. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie Smith CD, et al. 2011. Draft genome of the globally widespread and 2. Nat Methods 9(4):357–359. invasive Argentine ant (Linepithema humile). Proc Natl Acad Sci U S Lassalle F, Perian S, Bataillon T, Nesme X. 2015. GC-content evolution in A. 108(14):5667–5678. bacterial genomes: the biased gene conversion hypothesis expands. The Honeybee Genome Sequencing Consortium, others. 2006. Insights PLoS Genet. 11:e1004941. into social insects from the genome of the honeybee Apis mellifera. Marmur J, Doty P. 1962. Determination of the base composition of de- Nature 443:931. oxyribonucleic acid from its thermal denaturation temperature. J Mol Tilak M-K, et al. 2015. A cost-effective straightforward protocol for shot- Biol. 5:109–118. gun Illumina libraries designed to assemble complete mitogenomes Meyer M, Kircher M. 2010. Illumina sequencing library preparation for from non-model species. Conserv Genet Resour. 7(1):37–40. highly multiplexed target capture and sequencing. Cold Spring Harb van Dijk EL, Jaszczyszyn Y, Thermes C. 2014. Library preparation methods Protoc. 2010(6):pdb.prot5448. for next-generation sequencing: tone down the bias. Exp Cell Res. Oyola SO, et al. 2012. Optimizing Illumina next-generation sequencing 322(1):12–20. library preparation for extremely AT-biased genomes. BMC Warren WC, et al. 2017. A new chicken genome assembly provides insight Genomics 13(1):1. into avian genome structure. G3 (Bethesda) 7:109–117. Poelstra JW, et al. 2014. The genomic landscape underlying phenotypic in- Weissensteiner MH, et al. 2017. Combination of short-read, long-read, tegrity in the face of gene flow in crows. Science 344(6190):1410–1414. and optical mapping assemblies reveals large-scale tandem repeat Quail M. a, et al. 2011. Optimal enzymes for amplifying sequencing librar- arrays with population genetic implications. Genome Res. ies. Nat Methods 9(1):10–11. 27(5):697–708. Reuter JA, Spacek DV, Snyder MP. 2015. High-throughput sequencing Yu J, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. technologies. Mol Cell 58(4):586–597. indica). Science 296:79–92. Rubin C-J, et al. 2010. Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 464(7288):587–591. Associate editor: Judith Mank 622 Genome Biol. Evol. 10(2):616–622 doi:10.1093/gbe/evy022 Advance Access publication January 27, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/616/4827694 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

Genome Biology and EvolutionOxford University Press

Published: Feb 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Unlimited reading

Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere.

Stay up to date

Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates.

Organize your research

It’s easy to organize your research with our built-in tools.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

Monthly Plan

  • Read unlimited articles
  • Personalized recommendations
  • No expiration
  • Print 20 pages per month
  • 20% off on PDF purchases
  • Organize your research
  • Get updates on your journals and topic searches

$49/month

Start Free Trial

14-day Free Trial

Best Deal — 39% off

Annual Plan

  • All the features of the Professional Plan, but for 39% off!
  • Billed annually
  • No expiration
  • For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

$588

$360/year

billed annually
Start Free Trial

14-day Free Trial