Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates

Jörn Lewin; Armin O. Schmitt; Péter Adorján; Thomas Hildmann; Christian Piepenbrock

doi:10.1093/bioinformatics/bth346

Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates

Lewin, Jörn; Schmitt, Armin O.; Adorján, Péter; Hildmann, Thomas; Piepenbrock, Christian 2004-07-09 00:00:00 Vol. 20 no. 17 2004, pages 3005–3012 BIOINFORMATICS doi:10.1093/bioinformatics/bth346 Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR ampliﬁcates 1,∗ 2 1 Jörn Lewin , Armin O. Schmitt , Péter Adorján , 1 1 Thomas Hildmann and Christian Piepenbrock 1 2 Epigenomics AG, Kleine Präsidentenstrasse 1, 10178 Berlin, Germany and Institut für Nutztierwissenschaften, Humboldt-Universität zu Berlin, Invalidenstrasse 42, 10115 Berlin, Germany Received on March 24, 2004; revised on May 6, 2004; accepted on May 11, 2004 Advance Access publication July 9, 2004 ABSTRACT applicability is proven by identifying CpGs that are differentially Motivation: Methylation of cytosines in DNA plays an import- methylated in real tissue samples. ant role in the regulation of gene expression, and the analysis Contact: [email protected] of methylation patterns is fundamental for the understanding of cell differentiation, aging processes, diseases and cancer INTRODUCTION development. Such analysis has been limited, because tech- DNA methylation is a chemical modiﬁcation of the base nologies for detailed and efﬁcient high-throughput studies have cytosine to 5 -methyl cytosine. In DNA of vertebrates it occurs not been available. We have developed a novel quantitative in the context of cytosines followed by guanine, so-called methylation analysis algorithm and workﬂow based on direct CpGs. CpG methylation in human DNA is a tissue speciﬁc DNA sequencing of PCR products from bisulﬁte-treated DNA layer of information that is involved in the regulation of gene with high-throughput sequencing machines. This technology is expression, genomic imprinting (Reik et al., 2001) and cell a prerequisite for success of the Human Epigenome Project, differentiation (Ehrlich, 2003). Methylation proﬁles undergo the ﬁrst large genome-wide sequencing study for DNA methyl- changes in tumorgenesis (Jones, 2002) and allow differenti- ation in many different tissues. Methylation in tissue samples ation of DNA from healthy versus malignant tissue samples which are compositions of different cells is a quantitative (Adorjan et al., 2002). Differential methylation patterns are information represented by cytosine/thymine proportions after likely to have a large relevance for understanding disease and bisulﬁte conversion of unmethylated cytosines to uracil and diagnostic applications. PCR. Calculation of quantitative methylation information from The main goal of the Human Epigenome Project (Human base proportions represented by different dye signals in four- Epigenome Consortium et al., 2003, http://www.epigenome. dye sequencing trace ﬁles needs a speciﬁc algorithm handling org/) is to characterize the methylation signatures of different imbalanced and overscaled signals, incomplete conversion, tissue types genome wide. This approach requires methods quality problems and basecaller artifacts. that easily detect methylation patterns by automated high- Results: The algorithm we developed has several key prop- throughput technologies. erties: it analyzes trace ﬁles from PCR products of bisulﬁte- Different methods for methylation measurement are treated DNA sequenced directly on ABI machines; it yields described in Dahl and Guldberg (2003), Siegmund and Laird quantitative methylation measurements for individual cytosine (2002). One major group of technologies is based on methyl- positions after alignment with genomic reference sequences, ation sensitive enzymatic restriction of the DNA. The other signal normalization and estimation of effectiveness of bisul- group of technologies is based on bisulﬁte conversion of ﬁte treatment; it works in a fully automated pipeline including unmethylated cytosines (Olek et al., 1996). The Bisulﬁte treat- data quality monitoring; it is efﬁcient and avoids the usual cost ment of DNA leads to a chemical conversion of unmethylated of multiple sequencing runs on subclones to estimate DNA cytosine to uracil. Methylation of cytosines blocks this reac- methylation. The power of our new algorithm is demonstrated tion. In most cases, the PCR is used to amplify regions of with data from two test systems based on mixtures with known interest within the bisulﬁte converted DNA template whereby base compositions and deﬁned methylation. In addition, the positions converted into uracil appear as thymine in the product. Typically, a tissue sample contains a mixture of dif- To whom correspondence should be addressed. ferent cells; therefore, a proper description of methylation Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved. 3005 J.Lewin et al. at a certain CpG requires quantiﬁcation of the proportion of the methylated templates at the investigated CpG. This pro- portion is referred to as the methylation rate of the CpG. After the bisulﬁte conversion and the PCR, the methylation rate at a CpG can be determined by assessing the propor- tion of remaining cytosine relative to the thymine. This can be done, e.g. by hybridization to oligomer probes on DNA chips (Adorjan et al., 2002) or by DNA sequencing (Frommer et al., 1992). Commonly used sequencing methods include the sequencing of a representative number of subclones of the PCR product or direct PCR sequencing by running inde- pendent sequencing reactions for cytosine and thymine using the same dye in different lanes of a sequencing gel (Paul and Clark, 1996). These sequencing methods are expensive and labor intensive. In the Human Epigenome Project, direct PCR sequencing on standard sequencing machines is used to achieve the required throughput in a cost effective way. This technology produces four-dye electropherogram data. The possibility to use such data for quantitative analyses of base compositions within pooled DNA was recently demonstrated for one single nucleotide polymorphism (SNP) (Qiu et al., 2003). The same principle is used here for the measurement of methylation in bisulﬁte-treated DNA product. Quantitative analysis by direct sequencing of the PCR products from bisulﬁte-treated DNA implicates several novel challenges: poor signal quality compared to genomic sequen- Fig. 1. Flow chart of all data processing steps of the methylation cing, overscaled cytosine signals and basecaller artifacts. In estimation algorithm. Detailed description of the single steps is given combination with the overscaled signals incomplete bisulﬁte in the text. Between all data processing steps quality control (QC) conversion, which is a general problem of all bisulﬁte-based is performed. The analysis of a single trace ﬁle is aborted if the ﬁle methylation detection methods, inﬂuences signal proportions itself is corrupted or if the genomic reference sequence is missing or in the trace signiﬁcantly. It was therefore necessary to develop if the length of good quality sequence, as determined by the clipping a speciﬁc algorithm that allows the use of four-dye sequencing procedure, is below a certain threshold (default is 50 bases) or if the trace ﬁles to gain quantitative methylation information. This bisulﬁte conversion rates are below a minimum threshold (default is 65%). newly developed data analysis method allows the use of estab- lished high-throughput sequencing technology for methyl- ation studies. In this paper, we ﬁrst present the algorithms (viii) methylation estimation (Fig. 1). A scheme of the data used for methylation rate estimations based on trace ﬁle data and the inﬂuence of the algorithmic steps (ii), (iii), (iv) and originating from direct sequencing of the PCR products from (vi) is given in Figure 2. Here, we present the algorithms for bisulﬁte-treated DNA. We then assess the two main steps of forward sequencing that aims at the estimation of the propor- our algorithm with real data from two experiments and show tion of cytosine to thymine at the positions of interest. Traces that they improve the accuracy of the methylation estima- that originate from reverse sequencing and show guanine and tion. Finally, we provide a single example based on data from adenine signals at corresponding positions can be analyzed the Human Epigenome Project pilot study to demonstrate the by the same algorithm building the reverse complement of scientiﬁc use of the algorithms with real tissue samples. the trace ﬁles. (i) Entropy-based clipping: We observed that base callers ALGORITHM often generate reads that contain long stretches of called bases The algorithm we present uses four-dye electropherogram with up-scaled background signals after the end of an amp- data preprocessed by the base caller of the sequencing liﬁcate. These artifacts are detected by using the normalized machine manufacturer, e.g. Applied Biosystems ‘.abi’ ﬁles Shannon entropy or the well-described ‘.scf’ ﬁles (Dear and Staden, 1992). The data processing includes the following steps: (i) entropy- S S b b based clipping, (ii) signal detection, (iii) alignment, (iv) trace H =− log S S B B B∈{A,C,G,T } B∈{A,C,G,T } correction, (v) alignment-based clipping, (vi) signal normal- b∈{A,C,G,T } ization, (vii) compensation of incomplete conversion and (1) 3006 DNA methylation quantiﬁcation based on trace data a) trace data b) normalized trace data Waterman, 1981; Barton, 1993) for optimal local alignments A signal norm norm norm C C C allowing for gaps to align the called sequence of the trace ﬁle C signal t C norm norm norm norm G signal T T T T t C T with the a priori known reference sequence. Alignment of t T signal and C in the reference with C or T in the trace are treated as matches. The bisulﬁte-treated DNA contains long stretches of T sig- AA CT G C T C G A AA Y GYT Y G A A t - A GT t C GA AA t G t T C GA nal. In some cases, this is misinterpreted by base callers by inserting too many T s into the called sequence. Accounting for this special situation, we have introduced an additional Fig. 2. Schematic representation of a trace ﬁle electropherogram type of gap cost to guarantee proper mapping of CpGs. Assign- obtained by bisulﬁte PCR sequencing (a) before and (b) after sig- ing costs for gaps between C and G in the reference sequence nal normalization. The upper sequences below the trace curves in (a) represent the sequence called by the standard basecaller and in forces the alignment of CpGs as one functional block to avoid (b) the peak mixture represented using IUPAC code (Y denotes C their mismapping. An example of this is given below: general and/or T ). The sequences at the bottom show the aligned reference costs for all gaps (g) are −19 and higher than costs for mis- sequence whereby t are genomic cytosine positions that are not in matches (−9) (Barton, 1993). For gaps inserted between C and CpG context, and expected to be unmethylated and therefore com- G in the reference sequence special additional gap costs(sg) pletely convertible. Trace curves are shown for all the four bases. For of −20 raise the total costs to −39, a punishment outnumbered every base position in the reference sequence four base intensities only by two gaps (−40) which in most cases leads to CpGs int B ; B ∈{A, C, G, T } are calculated as the area under the trace curve treated as one unit that cannot be split, just misaligned. int int segment that belongs to the base position (only C and T shown in norm a)). Normalized base intensities for cytosine (C ; b ∈{t , T , C}) b trace ATTTTTTTGA ATTTTTTTGA norm and thymine (T ) seen in (b) are used to estimate the bisulﬁte reference ATTTTTC-GA ATTTTT-CGA conversion rate (base intensities at t positions) and the methylation cost(g+sg)=-39 cost(g)=-19 rate at each CpG (base intensities at C positions). (iv) Trace correction: Standard base callers expect one 0 ≤ H ≤ 1 of the four trace curves S , b ∈{A, C, G, T } b homogeneous DNA population to be sequenced, there- in a sliding window of 200 data points. Flanking sequence fore some of them occasionally interpret mixed C and T stretches with an entropy larger than 0.8 are removed. base intensities at a single position of the reference sequence (ii) Signal detection: For each base position in the as two adjacent bases, mostly if there is a small offset of one or int trace ﬁle, we compute corresponding intensities B ; B ∈ two data points between C and T signals. In contrast to stand- {A, C, G, T } that estimate the base proportions in the molecu- ard sequencing, in our experiments we expect signal mixtures lar mixture. As an appropriate measure we have chosen the from different DNA populations. It follows that the separa- areas under the trace corresponding to the respective base for tion of overlaying intensities belonging to one position into each position in the sequence. By default, the trace segment two bases by the base caller has to be corrected. We identify between neighboring local minima is used for the signal area the separated base intensities by searching adjacent T and C estimation. If no local minima are present, then the boundar- positions in the called sequence from which one is aligned ies of the trace segment are estimated as the midpoint between with t or C and the other is introducing a gap into the ref- two neighboring inﬂection points. erence sequence. These base pairs in the called sequence are (iii) Alignment: The base intensities estimated in the then fused into a single base. previous step are then mapped to an underlying refer- (v) Alignment-based clipping: The quality of trace ﬁles ence sequence, available as genomic sequence from database from PCR product sequencing, especially of ampliﬁcates from sources and bisulﬁte converted in silico. The a priori availab- bisulﬁte-treated template containing different molecule pop- ility of the genomic sequence is a prerequisite for our applic- ulations, is lower than sequences from a homogeneous clone ation. To describe an expected bisulﬁte converted reference template. Alignment quality as a natural measure to assess sequence, the commonly used genomic alphabet (A,C,G,T )is sequencing quality is used to identify areas of poor qual- extended by one letter, the lower case t , to distinguish a thym- ity. Flanking regions of the sequence are clipped such that ine derived from uracil by bisulﬁte conversion from a thymine the remaining inner part has <10% alignment error to the that was present already in the genomic sequence. Cytosines reference sequence. in a CpG context in the reference sequence correspond to pos- (vi) Signal normalization: We found that cytosine trace itions where we want to quantify unknown methylation, and curves often are overscaled in direct bisulﬁte sequencing are therefore still denoted by C. For the sake of clarity in traces . Base proportion calculation based on trace curves the notation, these positions should be distinguishable from with different baseline intensities would lead to misleading t , where the sequenced DNA is never methylated and there- fore, expected to have a complete conversion by the bisulﬁte We speculate that this overscaling is a result of the standard basecaller treatment. We use the Smith–Waterman algorithm (Smith and software compensating for the low frequency of C signals. 3007 J.Lewin et al. results. Therefore, we normalize the trace curves prior to sample DNA. It follows that the methylation rate then can calculating the proportions of base intensities to determine be estimated by incorporating a correction for the incomplete bisulphite conversion and methylation rate. The normalized bisulﬁte conversion norm base intensities are denoted by B ; B ∈{A, C, G, T }; norm b ∈{C, t , T } that fulﬁll constraints (2) and (3) based on M = 1 − . (9) norm norm (C + T )R glob C C average base intensities. Signal variance, artifacts or errors in the normalization might norm norm norm T ≡ T + C .(2) T C C lead to negative methylation estimation which is set to 0. norm norm norm T ≡ T + C . (3) t t EXPERIMENTAL SETUP int We performed three series of experiments to assess the ana- Normalization of C is performed by multiplication of a lytical performance of our algorithm. We have investigated global factor F . (i) the estimation of cytosine/thymine signal proportions, (ii) norm int the estimation of methylation rates and (iii) the detection of C = F C , b ∈{C, t , A, G, T } (4) b b differential methylation using real tissue samples. Based on the data we use different strategies for normalization. Test system with known cytosine/thymine int int If there are at least three C positions with C >T normal- C C proportions ization is based on data from these positions [Equation (5) To test how accurate we can measure base proportions in four following from Equation (2)]. Otherwise normalization is dye trace data and if our normalization algorithm improves based on all t positions [Equation (6) following from Equa- measurements, we created an artiﬁcial test system with known tion (3)]. In rare cases when all cytosines were unmethylated int cytosine/thymine proportions. A 669 bp long fragment in the and converted completely (C = 0) normalization of the cytosine trace curve is impossible and unnecessary. promoter region of the gene G6e was ampliﬁed by the PCR after bisulﬁte treatment of the template DNA. The bisulﬁte int int T − T reaction was setup such that the conversion of cytosines were T C F = . (5) not perfect. The PCR product was subcloned into pCR2.1- int Topo vector (invitogen). The 96 clones were sequenced. Out of the 96 clones, 3 showing differences at the most positions int int T − T of genomic cytosine were chosen. The plasmid concentra- F = . (6) int tions of the three stocks were adjusted to the same level. To gain different cytosine/thymine base compositions volumes (vii) Compensation of incomplete conversion. were mixed in all six permutations of the proportions 1:2:4. (viii) Methylation estimation: Cytosine base intensity at These mixtures contain molecules with cytosine and thym- CpG positions can arise from two sources: from a popula- ine at the original genomic cytosine positions with expected tion of methylated cytosines in the sample DNA and from an cytosine/(cytosine + thymine) ratios from 0 to 1 in 1/7 steps. incomplete conversion reaction. It follows that the bisulﬁte Sense strands of the clone mixtures were sequenced ﬁve times conversion rate has to be ﬁrst estimated to obtain a correct using the kit 1.1 on the ABI PRISM 310 (Fig. 3a). Trace ﬁles estimation of the methylation rate in the sample DNA. For an were analyzed by using the ABI basecaller software 310POP4. individual t the conversion rate R is estimated by Our algorithm was then used to estimate base compositions at each original genomic cytosine position. Estimated values norm R = . (7) were binned by their expected cytosine/(cytosine + thymine) norm norm T + C t t ratios to assess their distributions and the mean absolute errors. Local R and global conversion rates R can be determined loc glob Test system with known methylation rates by averaging over R of individual bases within deﬁned ranges. To test our algorithm on data from DNA with deﬁned methyl- Then the methylation rate M,0 ≤ M ≤ 1, at a certain CpG can ation status, unmethylated human genomic DNA (Molecular be estimated by using the following simple linear relationship Staging) was divided into two equal volumes. The DNA norm norm norm in one of the volumes was enzymatically methylated with T = R (1 − M)(T + C ). (8) glob C C C methylase SssI (NEB) following the manufacturer’s protocol. The equation describes the fact that T base intensity at a C Volumes of methylated and unmethylated DNA were mixed norm position T is expected to arise from the unmethylated in 20% steps from 0 to 100%. The PCR for 60 ampliﬁc- portion of the sample DNA that is bisulﬁte converted by rate ates was performed on a Tetrat MJ-research PTC-225. For norm norm R. Furthermore, the sum of the base intensities T +C cycle sequencing, the forward PCR primer was used with C C is assumed to be proportional to the total of cytosines in the ABI kit 1.1 and run on the ABI 3730 DNA analyzer (Fig 3b). 3008 DNA methylation quantiﬁcation based on trace data a) DNA mixtures b) DNA mixtures normalized 0/7 2/7 4/7 6/7 0/7 2/7 4/7 6/7 expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected c) 10 subclones d) 20 subclones Fig. 3. Experimental setup of (a) a test system with known cytosine/thymine proportions (b) a test system with known methyl- ation rates. Steps that are potential sources for variances or biases 0/7 2/7 4/7 6/7 0/7 2/7 4/7 6/7 in the test systems like mixing steps, incomplete enzymatic methyl- expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected ation, PCR bias and variance, incomplete bisulﬁte conversion and variance in the sequencing procedure are typeset in italics. Fig. 4. (a) and (b) Quantitative measurements of C signal propor- tions in data from single sequencing runs of six clone mixtures with Trace ﬁles were called with ABI’s basecaller 3730POP7. Our expected C/(C+T) ratios from 0 to 1 in 1/7 steps. The boxplots show algorithm was then used to estimate the methylation rates at the distribution of the estimated values obtained by our algorithm each CpG position. Methylation rates were binned together without normalization and with normalization, respectively. The by their expected methylation rates and variances and mean estimates are plotted against the expected ratios (1039 data points absolute errors were assessed. total which means a measurement success rate of 89% given 6 mix- tures, 5 repetitions and 39 positions). Dashed graphs show the means Methylation in tissue samples of absolute errors. (c), (d) Simulated data for representations of Methylation was estimated using trace ﬁles from direct PCR mixed DNA in 0 to 1 in 1/7 steps by 10 and 20 subclones based of bisulﬁte-treated DNA from healthy tissue samples. The data on a binomial distribution. are a subset of trace ﬁles from the Human Epigenome Project direct sequencing method with the subcloning method. We pilot study by (Human Epigenome Consortium et al., 2003). calculated the smallest theoretical measurement error inher- ent to the subcloning method by simulating the subsampling RESULTS AND DISCUSSION of 10 and 20 subclones based on binomial distributions with Test system with known cytosine/thymine a certain C:T ratio. Figure 4c and d and Table 1 show that proportions errors in our estimates are comparable with those that could be To assess the effect of our signal normalization step, we obtained by sequencing 20 subclones of a PCR product. From used our algorithms on data from the test system with known this we can conclude that direct sequencing of the bisulﬁte- cytosine/(cytosine + thymine) ratios. Figure 4a and b show the treated DNA is a viable alternative to the subclone sequencing distribution of the estimated ratios against the expected ratios if only the mixture rates are the subject of interest. in the test system without and with normalization, respect- Test system with known methylation rates ively. The results demonstrate that the normalization step decreases the mean absolute error (represented by the dashed Second, we have evaluated the performance of our algorithm line on the ﬁgures) approximately to the half. Sequencing sev- by using the test system with known methylation rates. eral subclones from a PCR product is an alternative method to Figure 5d shows the distribution of the estimated methyl- measure the cytosine:thymine ratios in bisulﬁte-treated DNA. ation rates against the expected methylation rates in the test The measurement error of this method depends mainly on the system using the complete algorithm. For the estimation number of subclones that is sequenced. We benchmarked our of methylation rates, normalization and the correction for estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 J.Lewin et al. Table 1. Comparison of mean SD and absolute errors of C/(C+T) signal Table 2. Test system with known methylation rates: mean absolute errors proportions as estimated in our test system with known cytosine/thymine of methylation estimations in 60 ampliﬁcates with and without signal proportions and simulated representation by subclones normalization and conversion rate correction Mean SD Mean absolute error Method Mean absolute error Signal proportions 0.110 0.130 Raw data 0.27 Normalized signal proportions 0.077 0.055 Normalized 0.17 10 subclones 0.100 0.083 Corrected 0.19 20 subclones 0.072 0.058 Normalized and corrected 0.14 Table 3. Test system with known methylation rates: accuracy of sorting a) nor normalized b) only normalized n’either corrected paired methylation estimates at identical CpGs in 60 ampliﬁcates after normalization and conversion rate correction Expected rate 0.2 0.4 0.6 0.8 1 0 9198979999 0.2 90 98 99 99 0.4 96 97 98 0.6 79 88 0.8 89 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 expected expected expected expected expected expected expected expected expected expected expected expected expected expected d) normalized c) only corrected and corrected from the expected methylation rate. Systematic biases in the real values of all 60 covered regions can arise from incom- plete enzymatic methylation of the DNA or from ampliﬁcate speciﬁc biases in the PCR itself. Systematic biases in the test system would lead to devi- ations from expected values and to a higher variance in the complete data but still allow to detect relative differences in the methylation rates at individual CpG positions. To evaluate the capability of our method to detect differential methyla- 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 tion, we paired data from templates with different methylation expected expected expected expected expected expected expected expected expected expected expected expected expected expected values for each CpG. Table 3 lists the accuracy of classi- ﬁcation of higher versus lower methylated CpGs in the test Fig. 5. Estimation of methylation in the test system with known system. The accuracies for detecting differential methylation methylation rates. The boxplots show the distribution of the estimated in neighboring methylation rates with 20% steps are com- methylation rates as a function of the expected methylation and the pared with those that were obtained without normalization mean absolute error (dashed line). Each box includes data from CpGs and correction for incomplete bisulﬁte conversion (Table 4). of all 60 ampliﬁcates measured at the expected methylation rate. The performance clearly improves by using the normalization bisulﬁte conversion rate play an important role. This is demon- and the conversion rate correction steps. strated on Figure 5a–c and Table 2. If any of these steps Despite the overlap of the distributions of the estimated is omitted from the data analysis, then the mean absolute methylation values (cf. Fig. 5d), we can conclude that the error (dashed line) increases signiﬁcantly. The normaliza- detection of differential methylation is highly accurate. This tion has a major impact mainly on low methylation rates, is in accordance with our hypothesis of having ampliﬁcate where the absence of C signals leads to an overscaling of speciﬁc systematic biases in our reference test system. the C trace. We have evaluated threshold parameters for quality con- The methylation rates estimated in this experiment do not trol. More stringent parameters do not improve the results show as accurate correlation with the expected rates as was signiﬁcantly but lead to lower measurement success rates. For obtained in the previous C:T proportion experiments, where example, raising the threshold for bisulﬁte conversion from a mixture of subclones was used as a test system. One pos- 65 to 80% reduces the mean absolute error by 0.2% and raises sible explanation for this is that the real methylation rate in the accuracy by 1.3% but reduces the number of accessible the mixtures of methylated and unmethylated DNA deviates positions by 16%. estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 DNA methylation quantiﬁcation based on trace data Table 4. Test system with known methylation rates: accuracy of sorting samples by bisulﬁte sequencing the intragenic region of the paired methylation estimates at identical CpGs in 60 ampliﬁcates with 20% gene DAXX. The plot shows clear differential methylation in difference with and without using the normalization and correction for blocks of co-methylated CpGs that distinguishes breast from incomplete bisulﬁte conversion muscle and both from all other tissues. Correct sorting (%) Comparison Raw Norm/corr. CONCLUSION We have presented an algorithm to estimate methylation from 0/0.2 84 91 trace ﬁles generated by direct PCR sequencing of bisulﬁte- 0.2/0.4 71 90 treated DNA. The Results obtained by reference test systems 0.4/0.6 86 96 0.6/0.8 77 79 show that direct PCR sequencing is a viable alternative to 0.8/1 89 89 estimating methylation rates by sequencing subclones from the PCR product. Furthermore, we have demonstrated that our method can detect differences in methylation rates of 20% highly accurately. Applying our algorithms to bisulﬁte (a) sequencing data of DNA obtained from healthy tissue samples CpG pos. Brain Breast Liver Lung Muscle illustrated that by the aid of the method, CpGs with differen- 90 tial methylation rates between different tissue types can be identiﬁed. The algorithm allows to run big DNA methylation studies 60 151 like the Human Epigenome Project based on direct sequencing 50 172 in high-throughput facilities. It will help to gain informa- 40 202 tion about differential methylation in many tissue types and increase our understanding of the epigenetic layer in the complex system of gene expression, cell differentiation and tumorgenesis. (b) ACKNOWLEDGEMENTS Muscle We acknowledge Matthias Schuster, Christoph König and Lung Bülent Genç for experimental planning and Erik Leu for his Liver great laboratory work that provided most data used in this paper. We thank Kurt Berlin, Stephan Beck and Karen Novik Breast for the establishment of the Human Epigenome Project. The ﬁrst author also wants to thank Jörn Walter for his kind support Brain methylation [%] 0 100 200 300 of his PhD thesis. Some data shown in this paper was provided CpG position [bases] by the Human Epigenome Project pilot study supported by the EU (QLRT1999-30417). Fig. 6. (a) Methylation proﬁles of the intragenic region of gene DAXX obtained from DNA samples extracted from brain, breast, liver, lung and muscle tissue samples. The gray shading repres- REFERENCES ents the different methylation rates as indicated on the scale bar. Samples from different individuals are arranged in columns, while Adorjan,P., Distler,J., Lipscher,E., Model,F., Muller,J., Pelet,C., each row represents one CpG within the ampliﬁcate. (b) Physical dis- Braun,A., Florl,A.R., Gutig,D., Grabs,G. et al. (2002) Tumour tances and average methylation of CpGs within the tissues. Blocks class prediction and discovery by microarray-based DNA methyl- of differentially co-methylated CpGs are highlighted. ation analysis.Nucleic Acids Res., 30, e21. Barton,G.J. (1993) An efﬁcient algorithm to locate all locally optimal alignments between two sequences allowing for gaps. Comput. Appl. Biosci., 9, 729–734. Methylation in tissue samples Dahl,C. and Guldberg,P. (2003) DNA methylation analysis tech- Trace ﬁles produced in the The Human Epigenome Project by niques. Biogerontology, 4, 233–250. Human Epigenome Consortium et al. (2003) are processed Dear,S. and Staden,R. (1992) A standard ﬁle format for data from by the algorithm presented here. We present a dataset to DNA sequencing instruments. DNA Seq., 3, 107–110. demonstrate the capability of our method to detect differential Ehrlich,M. (2003) Expression of various genes is controlled by DNA methylation in real tissue samples. Figure 6 shows the methyl- methylation during mammalian development. J. Cell. Biochem., ation proﬁles obtained in brain, breast, liver, lung and muscle 88, 899–910. Methylation [%] 100 80 60 40 20 0 DAXX DAXX DAXX DAXX DAXX DAXX DAXX DAXX DAXX 1 J.Lewin et al. Frommer,M., McDonald,L.E., Millar,D.S., Collis,C.M., Watt,F., Paul,C.L. and Clark,S.J. (1996) Cytosine methylation: quantitation Grigg,G.W., Molloy,P.L. and Paul,C.L. (1992) A genomic sequen- by automated genomic sequencing and GENESCAN analysis. cing protocol that yields a positive display of 5-methylcytosine BioTechniques, 21, 126–133. residues in individual DNA strands.Proc. Natl Acad. Sci., USA, Qiu,P., Soder,G.J., Sanﬁorenzo,V.J., Wang,L., Greene,J.R., 89, 1827–1831. Fritz,M.A. and Cai,X.Y. (2003) Quantiﬁcation of single nucle- Human Epigenome Consortium, Epigenomics AG, The Welcome otide polymorphisms by automated DNA sequencing. Biochem. Trust Sanger Institute and Centre National de Genotypage (2003) Biophys. Res. Commun., 309, 331–338. Human Epigenome Project. Reik,W., Dean,W. and Walter,J. (2001) Epigenetic reprogramming Jones,P.A. (2002) DNA methylation and cancer. Oncogene, 21, in mammalian development. Science, 293, 1089–1093. 5358–5360. Siegmund,K.D. and Laird,P.W. (2002) Analysis of complex methyl- Olek,A., Oswald,J. and Walter,J. (1996) A modiﬁed and improved ation data.Methods, 27, 170–178. method for bisulphite based cytosine methylation analysis. Smith,T.F. and Waterman,M.S. (1981) Identiﬁcation of common Nucleic Acids Res., 24, 5064–5066. molecular subsequences. J. Mol. Biol., 147, 195–197. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/quantitative-dna-methylation-analysis-based-on-four-dye-trace-data-nykrpsTInP

Loading next page...

References (14)

Li Yu (2005)
[DNA methylation and cancer].
Zhonghua nei ke za zhi, 44 6
(2003)
The Welcome Trust Sanger Institute and Centre National de Genotypage
P. Qiu, George Soder, Vincent Sanfiorenzo, Luquan Wang, J. Greene, M. Fritz, Xiao-Yan Cai (2003)
Quantification of single nucleotide polymorphisms by automated DNA sequencing.
Biochemical and biophysical research communications, 309 2
C. Dahl, P. Guldberg (2004)
DNA methylation analysis techniques
Biogerontology, 4
G. Barton (1993)
An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gaps
Computer applications in the biosciences : CABIOS, 9 6
S. Dear, Rodger Staden (1992)
A standard file format for data from DNA sequencing instruments.
DNA sequence : the journal of DNA sequencing and mapping, 3 2
M. Ehrlich (2003)
Expression of various genes is controlled by DNA methylation during mammalian development
Journal of Cellular Biochemistry, 88
K. Siegmund, P. Laird (2002)
Analysis of complex methylation data.
Methods, 27 2
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.
Journal of molecular biology, 147 1
A. Olek, J. Oswald, Jörn Walter (1996)
A modified and improved method for bisulphite based cytosine methylation analysis.
Nucleic acids research, 24 24
W. Reik, W. Dean, J. Walter (2001)
Epigenetic Reprogramming in Mammalian Development
Science, 293
P. Adorján, J. Distler, Evelyne Lipscher, F. Model, Jurgen Müller, C. Pelet, A. Braun, A. Florl, David Gütig, G. Grabs, A. Howe, M. Kursar, R. Lesche, Erik Leu, A. Lewin, S. Maier, Volker Müller, T. Otto, C. Scholz, W. Schulz, H. Seifert, I. Schwope, H. Ziebarth, K. Berlin, C. Piepenbrock, A. Olek (2001)
Tumour class prediction and discovery by microarray-based DNA methylation analysis.
Nucleic acids research, 30 5
M. Frommer, L. McDonald, D. Millar, C. Collis, F. Watt, G. Grigg, P. Molloy, C. Paul (1992)
A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands.
Proceedings of the National Academy of Sciences of the United States of America, 89
C. Paul, S. Clark (1996)
Cytosine methylation: quantitation by automated genomic sequencing and GENESCAN analysis.
BioTechniques, 21 1

Publisher: Oxford University Press
Copyright: Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/bth346
pmid: 15247106
Publisher site: See Article on Publisher Site

Abstract

Vol. 20 no. 17 2004, pages 3005–3012 BIOINFORMATICS doi:10.1093/bioinformatics/bth346 Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR ampliﬁcates 1,∗ 2 1 Jörn Lewin , Armin O. Schmitt , Péter Adorján , 1 1 Thomas Hildmann and Christian Piepenbrock 1 2 Epigenomics AG, Kleine Präsidentenstrasse 1, 10178 Berlin, Germany and Institut für Nutztierwissenschaften, Humboldt-Universität zu Berlin, Invalidenstrasse 42, 10115 Berlin, Germany Received on March 24, 2004; revised on May 6, 2004; accepted on May 11, 2004 Advance Access publication July 9, 2004 ABSTRACT applicability is proven by identifying CpGs that are differentially Motivation: Methylation of cytosines in DNA plays an import- methylated in real tissue samples. ant role in the regulation of gene expression, and the analysis Contact: [email protected] of methylation patterns is fundamental for the understanding of cell differentiation, aging processes, diseases and cancer INTRODUCTION development. Such analysis has been limited, because tech- DNA methylation is a chemical modiﬁcation of the base nologies for detailed and efﬁcient high-throughput studies have cytosine to 5 -methyl cytosine. In DNA of vertebrates it occurs not been available. We have developed a novel quantitative in the context of cytosines followed by guanine, so-called methylation analysis algorithm and workﬂow based on direct CpGs. CpG methylation in human DNA is a tissue speciﬁc DNA sequencing of PCR products from bisulﬁte-treated DNA layer of information that is involved in the regulation of gene with high-throughput sequencing machines. This technology is expression, genomic imprinting (Reik et al., 2001) and cell a prerequisite for success of the Human Epigenome Project, differentiation (Ehrlich, 2003). Methylation proﬁles undergo the ﬁrst large genome-wide sequencing study for DNA methyl- changes in tumorgenesis (Jones, 2002) and allow differenti- ation in many different tissues. Methylation in tissue samples ation of DNA from healthy versus malignant tissue samples which are compositions of different cells is a quantitative (Adorjan et al., 2002). Differential methylation patterns are information represented by cytosine/thymine proportions after likely to have a large relevance for understanding disease and bisulﬁte conversion of unmethylated cytosines to uracil and diagnostic applications. PCR. Calculation of quantitative methylation information from The main goal of the Human Epigenome Project (Human base proportions represented by different dye signals in four- Epigenome Consortium et al., 2003, http://www.epigenome. dye sequencing trace ﬁles needs a speciﬁc algorithm handling org/) is to characterize the methylation signatures of different imbalanced and overscaled signals, incomplete conversion, tissue types genome wide. This approach requires methods quality problems and basecaller artifacts. that easily detect methylation patterns by automated high- Results: The algorithm we developed has several key prop- throughput technologies. erties: it analyzes trace ﬁles from PCR products of bisulﬁte- Different methods for methylation measurement are treated DNA sequenced directly on ABI machines; it yields described in Dahl and Guldberg (2003), Siegmund and Laird quantitative methylation measurements for individual cytosine (2002). One major group of technologies is based on methyl- positions after alignment with genomic reference sequences, ation sensitive enzymatic restriction of the DNA. The other signal normalization and estimation of effectiveness of bisul- group of technologies is based on bisulﬁte conversion of ﬁte treatment; it works in a fully automated pipeline including unmethylated cytosines (Olek et al., 1996). The Bisulﬁte treat- data quality monitoring; it is efﬁcient and avoids the usual cost ment of DNA leads to a chemical conversion of unmethylated of multiple sequencing runs on subclones to estimate DNA cytosine to uracil. Methylation of cytosines blocks this reac- methylation. The power of our new algorithm is demonstrated tion. In most cases, the PCR is used to amplify regions of with data from two test systems based on mixtures with known interest within the bisulﬁte converted DNA template whereby base compositions and deﬁned methylation. In addition, the positions converted into uracil appear as thymine in the product. Typically, a tissue sample contains a mixture of dif- To whom correspondence should be addressed. ferent cells; therefore, a proper description of methylation Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved. 3005 J.Lewin et al. at a certain CpG requires quantiﬁcation of the proportion of the methylated templates at the investigated CpG. This pro- portion is referred to as the methylation rate of the CpG. After the bisulﬁte conversion and the PCR, the methylation rate at a CpG can be determined by assessing the propor- tion of remaining cytosine relative to the thymine. This can be done, e.g. by hybridization to oligomer probes on DNA chips (Adorjan et al., 2002) or by DNA sequencing (Frommer et al., 1992). Commonly used sequencing methods include the sequencing of a representative number of subclones of the PCR product or direct PCR sequencing by running inde- pendent sequencing reactions for cytosine and thymine using the same dye in different lanes of a sequencing gel (Paul and Clark, 1996). These sequencing methods are expensive and labor intensive. In the Human Epigenome Project, direct PCR sequencing on standard sequencing machines is used to achieve the required throughput in a cost effective way. This technology produces four-dye electropherogram data. The possibility to use such data for quantitative analyses of base compositions within pooled DNA was recently demonstrated for one single nucleotide polymorphism (SNP) (Qiu et al., 2003). The same principle is used here for the measurement of methylation in bisulﬁte-treated DNA product. Quantitative analysis by direct sequencing of the PCR products from bisulﬁte-treated DNA implicates several novel challenges: poor signal quality compared to genomic sequen- Fig. 1. Flow chart of all data processing steps of the methylation cing, overscaled cytosine signals and basecaller artifacts. In estimation algorithm. Detailed description of the single steps is given combination with the overscaled signals incomplete bisulﬁte in the text. Between all data processing steps quality control (QC) conversion, which is a general problem of all bisulﬁte-based is performed. The analysis of a single trace ﬁle is aborted if the ﬁle methylation detection methods, inﬂuences signal proportions itself is corrupted or if the genomic reference sequence is missing or in the trace signiﬁcantly. It was therefore necessary to develop if the length of good quality sequence, as determined by the clipping a speciﬁc algorithm that allows the use of four-dye sequencing procedure, is below a certain threshold (default is 50 bases) or if the trace ﬁles to gain quantitative methylation information. This bisulﬁte conversion rates are below a minimum threshold (default is 65%). newly developed data analysis method allows the use of estab- lished high-throughput sequencing technology for methyl- ation studies. In this paper, we ﬁrst present the algorithms (viii) methylation estimation (Fig. 1). A scheme of the data used for methylation rate estimations based on trace ﬁle data and the inﬂuence of the algorithmic steps (ii), (iii), (iv) and originating from direct sequencing of the PCR products from (vi) is given in Figure 2. Here, we present the algorithms for bisulﬁte-treated DNA. We then assess the two main steps of forward sequencing that aims at the estimation of the propor- our algorithm with real data from two experiments and show tion of cytosine to thymine at the positions of interest. Traces that they improve the accuracy of the methylation estima- that originate from reverse sequencing and show guanine and tion. Finally, we provide a single example based on data from adenine signals at corresponding positions can be analyzed the Human Epigenome Project pilot study to demonstrate the by the same algorithm building the reverse complement of scientiﬁc use of the algorithms with real tissue samples. the trace ﬁles. (i) Entropy-based clipping: We observed that base callers ALGORITHM often generate reads that contain long stretches of called bases The algorithm we present uses four-dye electropherogram with up-scaled background signals after the end of an amp- data preprocessed by the base caller of the sequencing liﬁcate. These artifacts are detected by using the normalized machine manufacturer, e.g. Applied Biosystems ‘.abi’ ﬁles Shannon entropy or the well-described ‘.scf’ ﬁles (Dear and Staden, 1992). The data processing includes the following steps: (i) entropy- S S b b based clipping, (ii) signal detection, (iii) alignment, (iv) trace H =− log S S B B B∈{A,C,G,T } B∈{A,C,G,T } correction, (v) alignment-based clipping, (vi) signal normal- b∈{A,C,G,T } ization, (vii) compensation of incomplete conversion and (1) 3006 DNA methylation quantiﬁcation based on trace data a) trace data b) normalized trace data Waterman, 1981; Barton, 1993) for optimal local alignments A signal norm norm norm C C C allowing for gaps to align the called sequence of the trace ﬁle C signal t C norm norm norm norm G signal T T T T t C T with the a priori known reference sequence. Alignment of t T signal and C in the reference with C or T in the trace are treated as matches. The bisulﬁte-treated DNA contains long stretches of T sig- AA CT G C T C G A AA Y GYT Y G A A t - A GT t C GA AA t G t T C GA nal. In some cases, this is misinterpreted by base callers by inserting too many T s into the called sequence. Accounting for this special situation, we have introduced an additional Fig. 2. Schematic representation of a trace ﬁle electropherogram type of gap cost to guarantee proper mapping of CpGs. Assign- obtained by bisulﬁte PCR sequencing (a) before and (b) after sig- ing costs for gaps between C and G in the reference sequence nal normalization. The upper sequences below the trace curves in (a) represent the sequence called by the standard basecaller and in forces the alignment of CpGs as one functional block to avoid (b) the peak mixture represented using IUPAC code (Y denotes C their mismapping. An example of this is given below: general and/or T ). The sequences at the bottom show the aligned reference costs for all gaps (g) are −19 and higher than costs for mis- sequence whereby t are genomic cytosine positions that are not in matches (−9) (Barton, 1993). For gaps inserted between C and CpG context, and expected to be unmethylated and therefore com- G in the reference sequence special additional gap costs(sg) pletely convertible. Trace curves are shown for all the four bases. For of −20 raise the total costs to −39, a punishment outnumbered every base position in the reference sequence four base intensities only by two gaps (−40) which in most cases leads to CpGs int B ; B ∈{A, C, G, T } are calculated as the area under the trace curve treated as one unit that cannot be split, just misaligned. int int segment that belongs to the base position (only C and T shown in norm a)). Normalized base intensities for cytosine (C ; b ∈{t , T , C}) b trace ATTTTTTTGA ATTTTTTTGA norm and thymine (T ) seen in (b) are used to estimate the bisulﬁte reference ATTTTTC-GA ATTTTT-CGA conversion rate (base intensities at t positions) and the methylation cost(g+sg)=-39 cost(g)=-19 rate at each CpG (base intensities at C positions). (iv) Trace correction: Standard base callers expect one 0 ≤ H ≤ 1 of the four trace curves S , b ∈{A, C, G, T } b homogeneous DNA population to be sequenced, there- in a sliding window of 200 data points. Flanking sequence fore some of them occasionally interpret mixed C and T stretches with an entropy larger than 0.8 are removed. base intensities at a single position of the reference sequence (ii) Signal detection: For each base position in the as two adjacent bases, mostly if there is a small offset of one or int trace ﬁle, we compute corresponding intensities B ; B ∈ two data points between C and T signals. In contrast to stand- {A, C, G, T } that estimate the base proportions in the molecu- ard sequencing, in our experiments we expect signal mixtures lar mixture. As an appropriate measure we have chosen the from different DNA populations. It follows that the separa- areas under the trace corresponding to the respective base for tion of overlaying intensities belonging to one position into each position in the sequence. By default, the trace segment two bases by the base caller has to be corrected. We identify between neighboring local minima is used for the signal area the separated base intensities by searching adjacent T and C estimation. If no local minima are present, then the boundar- positions in the called sequence from which one is aligned ies of the trace segment are estimated as the midpoint between with t or C and the other is introducing a gap into the ref- two neighboring inﬂection points. erence sequence. These base pairs in the called sequence are (iii) Alignment: The base intensities estimated in the then fused into a single base. previous step are then mapped to an underlying refer- (v) Alignment-based clipping: The quality of trace ﬁles ence sequence, available as genomic sequence from database from PCR product sequencing, especially of ampliﬁcates from sources and bisulﬁte converted in silico. The a priori availab- bisulﬁte-treated template containing different molecule pop- ility of the genomic sequence is a prerequisite for our applic- ulations, is lower than sequences from a homogeneous clone ation. To describe an expected bisulﬁte converted reference template. Alignment quality as a natural measure to assess sequence, the commonly used genomic alphabet (A,C,G,T )is sequencing quality is used to identify areas of poor qual- extended by one letter, the lower case t , to distinguish a thym- ity. Flanking regions of the sequence are clipped such that ine derived from uracil by bisulﬁte conversion from a thymine the remaining inner part has <10% alignment error to the that was present already in the genomic sequence. Cytosines reference sequence. in a CpG context in the reference sequence correspond to pos- (vi) Signal normalization: We found that cytosine trace itions where we want to quantify unknown methylation, and curves often are overscaled in direct bisulﬁte sequencing are therefore still denoted by C. For the sake of clarity in traces . Base proportion calculation based on trace curves the notation, these positions should be distinguishable from with different baseline intensities would lead to misleading t , where the sequenced DNA is never methylated and there- fore, expected to have a complete conversion by the bisulﬁte We speculate that this overscaling is a result of the standard basecaller treatment. We use the Smith–Waterman algorithm (Smith and software compensating for the low frequency of C signals. 3007 J.Lewin et al. results. Therefore, we normalize the trace curves prior to sample DNA. It follows that the methylation rate then can calculating the proportions of base intensities to determine be estimated by incorporating a correction for the incomplete bisulphite conversion and methylation rate. The normalized bisulﬁte conversion norm base intensities are denoted by B ; B ∈{A, C, G, T }; norm b ∈{C, t , T } that fulﬁll constraints (2) and (3) based on M = 1 − . (9) norm norm (C + T )R glob C C average base intensities. Signal variance, artifacts or errors in the normalization might norm norm norm T ≡ T + C .(2) T C C lead to negative methylation estimation which is set to 0. norm norm norm T ≡ T + C . (3) t t EXPERIMENTAL SETUP int We performed three series of experiments to assess the ana- Normalization of C is performed by multiplication of a lytical performance of our algorithm. We have investigated global factor F . (i) the estimation of cytosine/thymine signal proportions, (ii) norm int the estimation of methylation rates and (iii) the detection of C = F C , b ∈{C, t , A, G, T } (4) b b differential methylation using real tissue samples. Based on the data we use different strategies for normalization. Test system with known cytosine/thymine int int If there are at least three C positions with C >T normal- C C proportions ization is based on data from these positions [Equation (5) To test how accurate we can measure base proportions in four following from Equation (2)]. Otherwise normalization is dye trace data and if our normalization algorithm improves based on all t positions [Equation (6) following from Equa- measurements, we created an artiﬁcial test system with known tion (3)]. In rare cases when all cytosines were unmethylated int cytosine/thymine proportions. A 669 bp long fragment in the and converted completely (C = 0) normalization of the cytosine trace curve is impossible and unnecessary. promoter region of the gene G6e was ampliﬁed by the PCR after bisulﬁte treatment of the template DNA. The bisulﬁte int int T − T reaction was setup such that the conversion of cytosines were T C F = . (5) not perfect. The PCR product was subcloned into pCR2.1- int Topo vector (invitogen). The 96 clones were sequenced. Out of the 96 clones, 3 showing differences at the most positions int int T − T of genomic cytosine were chosen. The plasmid concentra- F = . (6) int tions of the three stocks were adjusted to the same level. To gain different cytosine/thymine base compositions volumes (vii) Compensation of incomplete conversion. were mixed in all six permutations of the proportions 1:2:4. (viii) Methylation estimation: Cytosine base intensity at These mixtures contain molecules with cytosine and thym- CpG positions can arise from two sources: from a popula- ine at the original genomic cytosine positions with expected tion of methylated cytosines in the sample DNA and from an cytosine/(cytosine + thymine) ratios from 0 to 1 in 1/7 steps. incomplete conversion reaction. It follows that the bisulﬁte Sense strands of the clone mixtures were sequenced ﬁve times conversion rate has to be ﬁrst estimated to obtain a correct using the kit 1.1 on the ABI PRISM 310 (Fig. 3a). Trace ﬁles estimation of the methylation rate in the sample DNA. For an were analyzed by using the ABI basecaller software 310POP4. individual t the conversion rate R is estimated by Our algorithm was then used to estimate base compositions at each original genomic cytosine position. Estimated values norm R = . (7) were binned by their expected cytosine/(cytosine + thymine) norm norm T + C t t ratios to assess their distributions and the mean absolute errors. Local R and global conversion rates R can be determined loc glob Test system with known methylation rates by averaging over R of individual bases within deﬁned ranges. To test our algorithm on data from DNA with deﬁned methyl- Then the methylation rate M,0 ≤ M ≤ 1, at a certain CpG can ation status, unmethylated human genomic DNA (Molecular be estimated by using the following simple linear relationship Staging) was divided into two equal volumes. The DNA norm norm norm in one of the volumes was enzymatically methylated with T = R (1 − M)(T + C ). (8) glob C C C methylase SssI (NEB) following the manufacturer’s protocol. The equation describes the fact that T base intensity at a C Volumes of methylated and unmethylated DNA were mixed norm position T is expected to arise from the unmethylated in 20% steps from 0 to 100%. The PCR for 60 ampliﬁc- portion of the sample DNA that is bisulﬁte converted by rate ates was performed on a Tetrat MJ-research PTC-225. For norm norm R. Furthermore, the sum of the base intensities T +C cycle sequencing, the forward PCR primer was used with C C is assumed to be proportional to the total of cytosines in the ABI kit 1.1 and run on the ABI 3730 DNA analyzer (Fig 3b). 3008 DNA methylation quantiﬁcation based on trace data a) DNA mixtures b) DNA mixtures normalized 0/7 2/7 4/7 6/7 0/7 2/7 4/7 6/7 expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected c) 10 subclones d) 20 subclones Fig. 3. Experimental setup of (a) a test system with known cytosine/thymine proportions (b) a test system with known methyl- ation rates. Steps that are potential sources for variances or biases 0/7 2/7 4/7 6/7 0/7 2/7 4/7 6/7 in the test systems like mixing steps, incomplete enzymatic methyl- expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected ation, PCR bias and variance, incomplete bisulﬁte conversion and variance in the sequencing procedure are typeset in italics. Fig. 4. (a) and (b) Quantitative measurements of C signal propor- tions in data from single sequencing runs of six clone mixtures with Trace ﬁles were called with ABI’s basecaller 3730POP7. Our expected C/(C+T) ratios from 0 to 1 in 1/7 steps. The boxplots show algorithm was then used to estimate the methylation rates at the distribution of the estimated values obtained by our algorithm each CpG position. Methylation rates were binned together without normalization and with normalization, respectively. The by their expected methylation rates and variances and mean estimates are plotted against the expected ratios (1039 data points absolute errors were assessed. total which means a measurement success rate of 89% given 6 mix- tures, 5 repetitions and 39 positions). Dashed graphs show the means Methylation in tissue samples of absolute errors. (c), (d) Simulated data for representations of Methylation was estimated using trace ﬁles from direct PCR mixed DNA in 0 to 1 in 1/7 steps by 10 and 20 subclones based of bisulﬁte-treated DNA from healthy tissue samples. The data on a binomial distribution. are a subset of trace ﬁles from the Human Epigenome Project direct sequencing method with the subcloning method. We pilot study by (Human Epigenome Consortium et al., 2003). calculated the smallest theoretical measurement error inher- ent to the subcloning method by simulating the subsampling RESULTS AND DISCUSSION of 10 and 20 subclones based on binomial distributions with Test system with known cytosine/thymine a certain C:T ratio. Figure 4c and d and Table 1 show that proportions errors in our estimates are comparable with those that could be To assess the effect of our signal normalization step, we obtained by sequencing 20 subclones of a PCR product. From used our algorithms on data from the test system with known this we can conclude that direct sequencing of the bisulﬁte- cytosine/(cytosine + thymine) ratios. Figure 4a and b show the treated DNA is a viable alternative to the subclone sequencing distribution of the estimated ratios against the expected ratios if only the mixture rates are the subject of interest. in the test system without and with normalization, respect- Test system with known methylation rates ively. The results demonstrate that the normalization step decreases the mean absolute error (represented by the dashed Second, we have evaluated the performance of our algorithm line on the ﬁgures) approximately to the half. Sequencing sev- by using the test system with known methylation rates. eral subclones from a PCR product is an alternative method to Figure 5d shows the distribution of the estimated methyl- measure the cytosine:thymine ratios in bisulﬁte-treated DNA. ation rates against the expected methylation rates in the test The measurement error of this method depends mainly on the system using the complete algorithm. For the estimation number of subclones that is sequenced. We benchmarked our of methylation rates, normalization and the correction for estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 J.Lewin et al. Table 1. Comparison of mean SD and absolute errors of C/(C+T) signal Table 2. Test system with known methylation rates: mean absolute errors proportions as estimated in our test system with known cytosine/thymine of methylation estimations in 60 ampliﬁcates with and without signal proportions and simulated representation by subclones normalization and conversion rate correction Mean SD Mean absolute error Method Mean absolute error Signal proportions 0.110 0.130 Raw data 0.27 Normalized signal proportions 0.077 0.055 Normalized 0.17 10 subclones 0.100 0.083 Corrected 0.19 20 subclones 0.072 0.058 Normalized and corrected 0.14 Table 3. Test system with known methylation rates: accuracy of sorting a) nor normalized b) only normalized n’either corrected paired methylation estimates at identical CpGs in 60 ampliﬁcates after normalization and conversion rate correction Expected rate 0.2 0.4 0.6 0.8 1 0 9198979999 0.2 90 98 99 99 0.4 96 97 98 0.6 79 88 0.8 89 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 expected expected expected expected expected expected expected expected expected expected expected expected expected expected d) normalized c) only corrected and corrected from the expected methylation rate. Systematic biases in the real values of all 60 covered regions can arise from incom- plete enzymatic methylation of the DNA or from ampliﬁcate speciﬁc biases in the PCR itself. Systematic biases in the test system would lead to devi- ations from expected values and to a higher variance in the complete data but still allow to detect relative differences in the methylation rates at individual CpG positions. To evaluate the capability of our method to detect differential methyla- 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 tion, we paired data from templates with different methylation expected expected expected expected expected expected expected expected expected expected expected expected expected expected values for each CpG. Table 3 lists the accuracy of classi- ﬁcation of higher versus lower methylated CpGs in the test Fig. 5. Estimation of methylation in the test system with known system. The accuracies for detecting differential methylation methylation rates. The boxplots show the distribution of the estimated in neighboring methylation rates with 20% steps are com- methylation rates as a function of the expected methylation and the pared with those that were obtained without normalization mean absolute error (dashed line). Each box includes data from CpGs and correction for incomplete bisulﬁte conversion (Table 4). of all 60 ampliﬁcates measured at the expected methylation rate. The performance clearly improves by using the normalization bisulﬁte conversion rate play an important role. This is demon- and the conversion rate correction steps. strated on Figure 5a–c and Table 2. If any of these steps Despite the overlap of the distributions of the estimated is omitted from the data analysis, then the mean absolute methylation values (cf. Fig. 5d), we can conclude that the error (dashed line) increases signiﬁcantly. The normaliza- detection of differential methylation is highly accurate. This tion has a major impact mainly on low methylation rates, is in accordance with our hypothesis of having ampliﬁcate where the absence of C signals leads to an overscaling of speciﬁc systematic biases in our reference test system. the C trace. We have evaluated threshold parameters for quality con- The methylation rates estimated in this experiment do not trol. More stringent parameters do not improve the results show as accurate correlation with the expected rates as was signiﬁcantly but lead to lower measurement success rates. For obtained in the previous C:T proportion experiments, where example, raising the threshold for bisulﬁte conversion from a mixture of subclones was used as a test system. One pos- 65 to 80% reduces the mean absolute error by 0.2% and raises sible explanation for this is that the real methylation rate in the accuracy by 1.3% but reduces the number of accessible the mixtures of methylated and unmethylated DNA deviates positions by 16%. estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 DNA methylation quantiﬁcation based on trace data Table 4. Test system with known methylation rates: accuracy of sorting samples by bisulﬁte sequencing the intragenic region of the paired methylation estimates at identical CpGs in 60 ampliﬁcates with 20% gene DAXX. The plot shows clear differential methylation in difference with and without using the normalization and correction for blocks of co-methylated CpGs that distinguishes breast from incomplete bisulﬁte conversion muscle and both from all other tissues. Correct sorting (%) Comparison Raw Norm/corr. CONCLUSION We have presented an algorithm to estimate methylation from 0/0.2 84 91 trace ﬁles generated by direct PCR sequencing of bisulﬁte- 0.2/0.4 71 90 treated DNA. The Results obtained by reference test systems 0.4/0.6 86 96 0.6/0.8 77 79 show that direct PCR sequencing is a viable alternative to 0.8/1 89 89 estimating methylation rates by sequencing subclones from the PCR product. Furthermore, we have demonstrated that our method can detect differences in methylation rates of 20% highly accurately. Applying our algorithms to bisulﬁte (a) sequencing data of DNA obtained from healthy tissue samples CpG pos. Brain Breast Liver Lung Muscle illustrated that by the aid of the method, CpGs with differen- 90 tial methylation rates between different tissue types can be identiﬁed. The algorithm allows to run big DNA methylation studies 60 151 like the Human Epigenome Project based on direct sequencing 50 172 in high-throughput facilities. It will help to gain informa- 40 202 tion about differential methylation in many tissue types and increase our understanding of the epigenetic layer in the complex system of gene expression, cell differentiation and tumorgenesis. (b) ACKNOWLEDGEMENTS Muscle We acknowledge Matthias Schuster, Christoph König and Lung Bülent Genç for experimental planning and Erik Leu for his Liver great laboratory work that provided most data used in this paper. We thank Kurt Berlin, Stephan Beck and Karen Novik Breast for the establishment of the Human Epigenome Project. The ﬁrst author also wants to thank Jörn Walter for his kind support Brain methylation [%] 0 100 200 300 of his PhD thesis. Some data shown in this paper was provided CpG position [bases] by the Human Epigenome Project pilot study supported by the EU (QLRT1999-30417). Fig. 6. (a) Methylation proﬁles of the intragenic region of gene DAXX obtained from DNA samples extracted from brain, breast, liver, lung and muscle tissue samples. The gray shading repres- REFERENCES ents the different methylation rates as indicated on the scale bar. Samples from different individuals are arranged in columns, while Adorjan,P., Distler,J., Lipscher,E., Model,F., Muller,J., Pelet,C., each row represents one CpG within the ampliﬁcate. (b) Physical dis- Braun,A., Florl,A.R., Gutig,D., Grabs,G. et al. (2002) Tumour tances and average methylation of CpGs within the tissues. Blocks class prediction and discovery by microarray-based DNA methyl- of differentially co-methylated CpGs are highlighted. ation analysis.Nucleic Acids Res., 30, e21. Barton,G.J. (1993) An efﬁcient algorithm to locate all locally optimal alignments between two sequences allowing for gaps. Comput. Appl. Biosci., 9, 729–734. Methylation in tissue samples Dahl,C. and Guldberg,P. (2003) DNA methylation analysis tech- Trace ﬁles produced in the The Human Epigenome Project by niques. Biogerontology, 4, 233–250. Human Epigenome Consortium et al. (2003) are processed Dear,S. and Staden,R. (1992) A standard ﬁle format for data from by the algorithm presented here. We present a dataset to DNA sequencing instruments. DNA Seq., 3, 107–110. demonstrate the capability of our method to detect differential Ehrlich,M. (2003) Expression of various genes is controlled by DNA methylation in real tissue samples. Figure 6 shows the methyl- methylation during mammalian development. J. Cell. Biochem., ation proﬁles obtained in brain, breast, liver, lung and muscle 88, 899–910. Methylation [%] 100 80 60 40 20 0 DAXX DAXX DAXX DAXX DAXX DAXX DAXX DAXX DAXX 1 J.Lewin et al. Frommer,M., McDonald,L.E., Millar,D.S., Collis,C.M., Watt,F., Paul,C.L. and Clark,S.J. (1996) Cytosine methylation: quantitation Grigg,G.W., Molloy,P.L. and Paul,C.L. (1992) A genomic sequen- by automated genomic sequencing and GENESCAN analysis. cing protocol that yields a positive display of 5-methylcytosine BioTechniques, 21, 126–133. residues in individual DNA strands.Proc. Natl Acad. Sci., USA, Qiu,P., Soder,G.J., Sanﬁorenzo,V.J., Wang,L., Greene,J.R., 89, 1827–1831. Fritz,M.A. and Cai,X.Y. (2003) Quantiﬁcation of single nucle- Human Epigenome Consortium, Epigenomics AG, The Welcome otide polymorphisms by automated DNA sequencing. Biochem. Trust Sanger Institute and Centre National de Genotypage (2003) Biophys. Res. Commun., 309, 331–338. Human Epigenome Project. Reik,W., Dean,W. and Walter,J. (2001) Epigenetic reprogramming Jones,P.A. (2002) DNA methylation and cancer. Oncogene, 21, in mammalian development. Science, 293, 1089–1093. 5358–5360. Siegmund,K.D. and Laird,P.W. (2002) Analysis of complex methyl- Olek,A., Oswald,J. and Walter,J. (1996) A modiﬁed and improved ation data.Methods, 27, 170–178. method for bisulphite based cytosine methylation analysis. Smith,T.F. and Waterman,M.S. (1981) Identiﬁcation of common Nucleic Acids Res., 24, 5064–5066. molecular subsequences. J. Mol. Biol., 147, 195–197.

Journal

Bioinformatics – Oxford University Press

Published: Jul 9, 2004

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates

Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates

Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates

References (14)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies