Access the full text.
Sign up today, get DeepDyve free for 14 days.
Li Yu (2005)
[DNA methylation and cancer].Zhonghua nei ke za zhi, 44 6
(2003)
The Welcome Trust Sanger Institute and Centre National de Genotypage
P. Qiu, George Soder, Vincent Sanfiorenzo, Luquan Wang, J. Greene, M. Fritz, Xiao-Yan Cai (2003)
Quantification of single nucleotide polymorphisms by automated DNA sequencing.Biochemical and biophysical research communications, 309 2
C. Dahl, P. Guldberg (2004)
DNA methylation analysis techniquesBiogerontology, 4
G. Barton (1993)
An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gapsComputer applications in the biosciences : CABIOS, 9 6
S. Dear, Rodger Staden (1992)
A standard file format for data from DNA sequencing instruments.DNA sequence : the journal of DNA sequencing and mapping, 3 2
M. Ehrlich (2003)
Expression of various genes is controlled by DNA methylation during mammalian developmentJournal of Cellular Biochemistry, 88
K. Siegmund, P. Laird (2002)
Analysis of complex methylation data.Methods, 27 2
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.Journal of molecular biology, 147 1
A. Olek, J. Oswald, Jörn Walter (1996)
A modified and improved method for bisulphite based cytosine methylation analysis.Nucleic acids research, 24 24
W. Reik, W. Dean, J. Walter (2001)
Epigenetic Reprogramming in Mammalian DevelopmentScience, 293
P. Adorján, J. Distler, Evelyne Lipscher, F. Model, Jurgen Müller, C. Pelet, A. Braun, A. Florl, David Gütig, G. Grabs, A. Howe, M. Kursar, R. Lesche, Erik Leu, A. Lewin, S. Maier, Volker Müller, T. Otto, C. Scholz, W. Schulz, H. Seifert, I. Schwope, H. Ziebarth, K. Berlin, C. Piepenbrock, A. Olek (2001)
Tumour class prediction and discovery by microarray-based DNA methylation analysis.Nucleic acids research, 30 5
M. Frommer, L. McDonald, D. Millar, C. Collis, F. Watt, G. Grigg, P. Molloy, C. Paul (1992)
A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands.Proceedings of the National Academy of Sciences of the United States of America, 89
C. Paul, S. Clark (1996)
Cytosine methylation: quantitation by automated genomic sequencing and GENESCAN analysis.BioTechniques, 21 1
Vol. 20 no. 17 2004, pages 3005–3012 BIOINFORMATICS doi:10.1093/bioinformatics/bth346 Quantitative DNA methylation analysis based on four-dye trace data from direct sequencing of PCR amplificates 1,∗ 2 1 Jörn Lewin , Armin O. Schmitt , Péter Adorján , 1 1 Thomas Hildmann and Christian Piepenbrock 1 2 Epigenomics AG, Kleine Präsidentenstrasse 1, 10178 Berlin, Germany and Institut für Nutztierwissenschaften, Humboldt-Universität zu Berlin, Invalidenstrasse 42, 10115 Berlin, Germany Received on March 24, 2004; revised on May 6, 2004; accepted on May 11, 2004 Advance Access publication July 9, 2004 ABSTRACT applicability is proven by identifying CpGs that are differentially Motivation: Methylation of cytosines in DNA plays an import- methylated in real tissue samples. ant role in the regulation of gene expression, and the analysis Contact: [email protected] of methylation patterns is fundamental for the understanding of cell differentiation, aging processes, diseases and cancer INTRODUCTION development. Such analysis has been limited, because tech- DNA methylation is a chemical modification of the base nologies for detailed and efficient high-throughput studies have cytosine to 5 -methyl cytosine. In DNA of vertebrates it occurs not been available. We have developed a novel quantitative in the context of cytosines followed by guanine, so-called methylation analysis algorithm and workflow based on direct CpGs. CpG methylation in human DNA is a tissue specific DNA sequencing of PCR products from bisulfite-treated DNA layer of information that is involved in the regulation of gene with high-throughput sequencing machines. This technology is expression, genomic imprinting (Reik et al., 2001) and cell a prerequisite for success of the Human Epigenome Project, differentiation (Ehrlich, 2003). Methylation profiles undergo the first large genome-wide sequencing study for DNA methyl- changes in tumorgenesis (Jones, 2002) and allow differenti- ation in many different tissues. Methylation in tissue samples ation of DNA from healthy versus malignant tissue samples which are compositions of different cells is a quantitative (Adorjan et al., 2002). Differential methylation patterns are information represented by cytosine/thymine proportions after likely to have a large relevance for understanding disease and bisulfite conversion of unmethylated cytosines to uracil and diagnostic applications. PCR. Calculation of quantitative methylation information from The main goal of the Human Epigenome Project (Human base proportions represented by different dye signals in four- Epigenome Consortium et al., 2003, http://www.epigenome. dye sequencing trace files needs a specific algorithm handling org/) is to characterize the methylation signatures of different imbalanced and overscaled signals, incomplete conversion, tissue types genome wide. This approach requires methods quality problems and basecaller artifacts. that easily detect methylation patterns by automated high- Results: The algorithm we developed has several key prop- throughput technologies. erties: it analyzes trace files from PCR products of bisulfite- Different methods for methylation measurement are treated DNA sequenced directly on ABI machines; it yields described in Dahl and Guldberg (2003), Siegmund and Laird quantitative methylation measurements for individual cytosine (2002). One major group of technologies is based on methyl- positions after alignment with genomic reference sequences, ation sensitive enzymatic restriction of the DNA. The other signal normalization and estimation of effectiveness of bisul- group of technologies is based on bisulfite conversion of fite treatment; it works in a fully automated pipeline including unmethylated cytosines (Olek et al., 1996). The Bisulfite treat- data quality monitoring; it is efficient and avoids the usual cost ment of DNA leads to a chemical conversion of unmethylated of multiple sequencing runs on subclones to estimate DNA cytosine to uracil. Methylation of cytosines blocks this reac- methylation. The power of our new algorithm is demonstrated tion. In most cases, the PCR is used to amplify regions of with data from two test systems based on mixtures with known interest within the bisulfite converted DNA template whereby base compositions and defined methylation. In addition, the positions converted into uracil appear as thymine in the product. Typically, a tissue sample contains a mixture of dif- To whom correspondence should be addressed. ferent cells; therefore, a proper description of methylation Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved. 3005 J.Lewin et al. at a certain CpG requires quantification of the proportion of the methylated templates at the investigated CpG. This pro- portion is referred to as the methylation rate of the CpG. After the bisulfite conversion and the PCR, the methylation rate at a CpG can be determined by assessing the propor- tion of remaining cytosine relative to the thymine. This can be done, e.g. by hybridization to oligomer probes on DNA chips (Adorjan et al., 2002) or by DNA sequencing (Frommer et al., 1992). Commonly used sequencing methods include the sequencing of a representative number of subclones of the PCR product or direct PCR sequencing by running inde- pendent sequencing reactions for cytosine and thymine using the same dye in different lanes of a sequencing gel (Paul and Clark, 1996). These sequencing methods are expensive and labor intensive. In the Human Epigenome Project, direct PCR sequencing on standard sequencing machines is used to achieve the required throughput in a cost effective way. This technology produces four-dye electropherogram data. The possibility to use such data for quantitative analyses of base compositions within pooled DNA was recently demonstrated for one single nucleotide polymorphism (SNP) (Qiu et al., 2003). The same principle is used here for the measurement of methylation in bisulfite-treated DNA product. Quantitative analysis by direct sequencing of the PCR products from bisulfite-treated DNA implicates several novel challenges: poor signal quality compared to genomic sequen- Fig. 1. Flow chart of all data processing steps of the methylation cing, overscaled cytosine signals and basecaller artifacts. In estimation algorithm. Detailed description of the single steps is given combination with the overscaled signals incomplete bisulfite in the text. Between all data processing steps quality control (QC) conversion, which is a general problem of all bisulfite-based is performed. The analysis of a single trace file is aborted if the file methylation detection methods, influences signal proportions itself is corrupted or if the genomic reference sequence is missing or in the trace significantly. It was therefore necessary to develop if the length of good quality sequence, as determined by the clipping a specific algorithm that allows the use of four-dye sequencing procedure, is below a certain threshold (default is 50 bases) or if the trace files to gain quantitative methylation information. This bisulfite conversion rates are below a minimum threshold (default is 65%). newly developed data analysis method allows the use of estab- lished high-throughput sequencing technology for methyl- ation studies. In this paper, we first present the algorithms (viii) methylation estimation (Fig. 1). A scheme of the data used for methylation rate estimations based on trace file data and the influence of the algorithmic steps (ii), (iii), (iv) and originating from direct sequencing of the PCR products from (vi) is given in Figure 2. Here, we present the algorithms for bisulfite-treated DNA. We then assess the two main steps of forward sequencing that aims at the estimation of the propor- our algorithm with real data from two experiments and show tion of cytosine to thymine at the positions of interest. Traces that they improve the accuracy of the methylation estima- that originate from reverse sequencing and show guanine and tion. Finally, we provide a single example based on data from adenine signals at corresponding positions can be analyzed the Human Epigenome Project pilot study to demonstrate the by the same algorithm building the reverse complement of scientific use of the algorithms with real tissue samples. the trace files. (i) Entropy-based clipping: We observed that base callers ALGORITHM often generate reads that contain long stretches of called bases The algorithm we present uses four-dye electropherogram with up-scaled background signals after the end of an amp- data preprocessed by the base caller of the sequencing lificate. These artifacts are detected by using the normalized machine manufacturer, e.g. Applied Biosystems ‘.abi’ files Shannon entropy or the well-described ‘.scf’ files (Dear and Staden, 1992). The data processing includes the following steps: (i) entropy- S S b b based clipping, (ii) signal detection, (iii) alignment, (iv) trace H =− log S S B B B∈{A,C,G,T } B∈{A,C,G,T } correction, (v) alignment-based clipping, (vi) signal normal- b∈{A,C,G,T } ization, (vii) compensation of incomplete conversion and (1) 3006 DNA methylation quantification based on trace data a) trace data b) normalized trace data Waterman, 1981; Barton, 1993) for optimal local alignments A signal norm norm norm C C C allowing for gaps to align the called sequence of the trace file C signal t C norm norm norm norm G signal T T T T t C T with the a priori known reference sequence. Alignment of t T signal and C in the reference with C or T in the trace are treated as matches. The bisulfite-treated DNA contains long stretches of T sig- AA CT G C T C G A AA Y GYT Y G A A t - A GT t C GA AA t G t T C GA nal. In some cases, this is misinterpreted by base callers by inserting too many T s into the called sequence. Accounting for this special situation, we have introduced an additional Fig. 2. Schematic representation of a trace file electropherogram type of gap cost to guarantee proper mapping of CpGs. Assign- obtained by bisulfite PCR sequencing (a) before and (b) after sig- ing costs for gaps between C and G in the reference sequence nal normalization. The upper sequences below the trace curves in (a) represent the sequence called by the standard basecaller and in forces the alignment of CpGs as one functional block to avoid (b) the peak mixture represented using IUPAC code (Y denotes C their mismapping. An example of this is given below: general and/or T ). The sequences at the bottom show the aligned reference costs for all gaps (g) are −19 and higher than costs for mis- sequence whereby t are genomic cytosine positions that are not in matches (−9) (Barton, 1993). For gaps inserted between C and CpG context, and expected to be unmethylated and therefore com- G in the reference sequence special additional gap costs(sg) pletely convertible. Trace curves are shown for all the four bases. For of −20 raise the total costs to −39, a punishment outnumbered every base position in the reference sequence four base intensities only by two gaps (−40) which in most cases leads to CpGs int B ; B ∈{A, C, G, T } are calculated as the area under the trace curve treated as one unit that cannot be split, just misaligned. int int segment that belongs to the base position (only C and T shown in norm a)). Normalized base intensities for cytosine (C ; b ∈{t , T , C}) b trace ATTTTTTTGA ATTTTTTTGA norm and thymine (T ) seen in (b) are used to estimate the bisulfite reference ATTTTTC-GA ATTTTT-CGA conversion rate (base intensities at t positions) and the methylation cost(g+sg)=-39 cost(g)=-19 rate at each CpG (base intensities at C positions). (iv) Trace correction: Standard base callers expect one 0 ≤ H ≤ 1 of the four trace curves S , b ∈{A, C, G, T } b homogeneous DNA population to be sequenced, there- in a sliding window of 200 data points. Flanking sequence fore some of them occasionally interpret mixed C and T stretches with an entropy larger than 0.8 are removed. base intensities at a single position of the reference sequence (ii) Signal detection: For each base position in the as two adjacent bases, mostly if there is a small offset of one or int trace file, we compute corresponding intensities B ; B ∈ two data points between C and T signals. In contrast to stand- {A, C, G, T } that estimate the base proportions in the molecu- ard sequencing, in our experiments we expect signal mixtures lar mixture. As an appropriate measure we have chosen the from different DNA populations. It follows that the separa- areas under the trace corresponding to the respective base for tion of overlaying intensities belonging to one position into each position in the sequence. By default, the trace segment two bases by the base caller has to be corrected. We identify between neighboring local minima is used for the signal area the separated base intensities by searching adjacent T and C estimation. If no local minima are present, then the boundar- positions in the called sequence from which one is aligned ies of the trace segment are estimated as the midpoint between with t or C and the other is introducing a gap into the ref- two neighboring inflection points. erence sequence. These base pairs in the called sequence are (iii) Alignment: The base intensities estimated in the then fused into a single base. previous step are then mapped to an underlying refer- (v) Alignment-based clipping: The quality of trace files ence sequence, available as genomic sequence from database from PCR product sequencing, especially of amplificates from sources and bisulfite converted in silico. The a priori availab- bisulfite-treated template containing different molecule pop- ility of the genomic sequence is a prerequisite for our applic- ulations, is lower than sequences from a homogeneous clone ation. To describe an expected bisulfite converted reference template. Alignment quality as a natural measure to assess sequence, the commonly used genomic alphabet (A,C,G,T )is sequencing quality is used to identify areas of poor qual- extended by one letter, the lower case t , to distinguish a thym- ity. Flanking regions of the sequence are clipped such that ine derived from uracil by bisulfite conversion from a thymine the remaining inner part has <10% alignment error to the that was present already in the genomic sequence. Cytosines reference sequence. in a CpG context in the reference sequence correspond to pos- (vi) Signal normalization: We found that cytosine trace itions where we want to quantify unknown methylation, and curves often are overscaled in direct bisulfite sequencing are therefore still denoted by C. For the sake of clarity in traces . Base proportion calculation based on trace curves the notation, these positions should be distinguishable from with different baseline intensities would lead to misleading t , where the sequenced DNA is never methylated and there- fore, expected to have a complete conversion by the bisulfite We speculate that this overscaling is a result of the standard basecaller treatment. We use the Smith–Waterman algorithm (Smith and software compensating for the low frequency of C signals. 3007 J.Lewin et al. results. Therefore, we normalize the trace curves prior to sample DNA. It follows that the methylation rate then can calculating the proportions of base intensities to determine be estimated by incorporating a correction for the incomplete bisulphite conversion and methylation rate. The normalized bisulfite conversion norm base intensities are denoted by B ; B ∈{A, C, G, T }; norm b ∈{C, t , T } that fulfill constraints (2) and (3) based on M = 1 − . (9) norm norm (C + T )R glob C C average base intensities. Signal variance, artifacts or errors in the normalization might norm norm norm T ≡ T + C .(2) T C C lead to negative methylation estimation which is set to 0. norm norm norm T ≡ T + C . (3) t t EXPERIMENTAL SETUP int We performed three series of experiments to assess the ana- Normalization of C is performed by multiplication of a lytical performance of our algorithm. We have investigated global factor F . (i) the estimation of cytosine/thymine signal proportions, (ii) norm int the estimation of methylation rates and (iii) the detection of C = F C , b ∈{C, t , A, G, T } (4) b b differential methylation using real tissue samples. Based on the data we use different strategies for normalization. Test system with known cytosine/thymine int int If there are at least three C positions with C >T normal- C C proportions ization is based on data from these positions [Equation (5) To test how accurate we can measure base proportions in four following from Equation (2)]. Otherwise normalization is dye trace data and if our normalization algorithm improves based on all t positions [Equation (6) following from Equa- measurements, we created an artificial test system with known tion (3)]. In rare cases when all cytosines were unmethylated int cytosine/thymine proportions. A 669 bp long fragment in the and converted completely (C = 0) normalization of the cytosine trace curve is impossible and unnecessary. promoter region of the gene G6e was amplified by the PCR after bisulfite treatment of the template DNA. The bisulfite int int T − T reaction was setup such that the conversion of cytosines were T C F = . (5) not perfect. The PCR product was subcloned into pCR2.1- int Topo vector (invitogen). The 96 clones were sequenced. Out of the 96 clones, 3 showing differences at the most positions int int T − T of genomic cytosine were chosen. The plasmid concentra- F = . (6) int tions of the three stocks were adjusted to the same level. To gain different cytosine/thymine base compositions volumes (vii) Compensation of incomplete conversion. were mixed in all six permutations of the proportions 1:2:4. (viii) Methylation estimation: Cytosine base intensity at These mixtures contain molecules with cytosine and thym- CpG positions can arise from two sources: from a popula- ine at the original genomic cytosine positions with expected tion of methylated cytosines in the sample DNA and from an cytosine/(cytosine + thymine) ratios from 0 to 1 in 1/7 steps. incomplete conversion reaction. It follows that the bisulfite Sense strands of the clone mixtures were sequenced five times conversion rate has to be first estimated to obtain a correct using the kit 1.1 on the ABI PRISM 310 (Fig. 3a). Trace files estimation of the methylation rate in the sample DNA. For an were analyzed by using the ABI basecaller software 310POP4. individual t the conversion rate R is estimated by Our algorithm was then used to estimate base compositions at each original genomic cytosine position. Estimated values norm R = . (7) were binned by their expected cytosine/(cytosine + thymine) norm norm T + C t t ratios to assess their distributions and the mean absolute errors. Local R and global conversion rates R can be determined loc glob Test system with known methylation rates by averaging over R of individual bases within defined ranges. To test our algorithm on data from DNA with defined methyl- Then the methylation rate M,0 ≤ M ≤ 1, at a certain CpG can ation status, unmethylated human genomic DNA (Molecular be estimated by using the following simple linear relationship Staging) was divided into two equal volumes. The DNA norm norm norm in one of the volumes was enzymatically methylated with T = R (1 − M)(T + C ). (8) glob C C C methylase SssI (NEB) following the manufacturer’s protocol. The equation describes the fact that T base intensity at a C Volumes of methylated and unmethylated DNA were mixed norm position T is expected to arise from the unmethylated in 20% steps from 0 to 100%. The PCR for 60 amplific- portion of the sample DNA that is bisulfite converted by rate ates was performed on a Tetrat MJ-research PTC-225. For norm norm R. Furthermore, the sum of the base intensities T +C cycle sequencing, the forward PCR primer was used with C C is assumed to be proportional to the total of cytosines in the ABI kit 1.1 and run on the ABI 3730 DNA analyzer (Fig 3b). 3008 DNA methylation quantification based on trace data a) DNA mixtures b) DNA mixtures normalized 0/7 2/7 4/7 6/7 0/7 2/7 4/7 6/7 expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected c) 10 subclones d) 20 subclones Fig. 3. Experimental setup of (a) a test system with known cytosine/thymine proportions (b) a test system with known methyl- ation rates. Steps that are potential sources for variances or biases 0/7 2/7 4/7 6/7 0/7 2/7 4/7 6/7 in the test systems like mixing steps, incomplete enzymatic methyl- expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected expected ation, PCR bias and variance, incomplete bisulfite conversion and variance in the sequencing procedure are typeset in italics. Fig. 4. (a) and (b) Quantitative measurements of C signal propor- tions in data from single sequencing runs of six clone mixtures with Trace files were called with ABI’s basecaller 3730POP7. Our expected C/(C+T) ratios from 0 to 1 in 1/7 steps. The boxplots show algorithm was then used to estimate the methylation rates at the distribution of the estimated values obtained by our algorithm each CpG position. Methylation rates were binned together without normalization and with normalization, respectively. The by their expected methylation rates and variances and mean estimates are plotted against the expected ratios (1039 data points absolute errors were assessed. total which means a measurement success rate of 89% given 6 mix- tures, 5 repetitions and 39 positions). Dashed graphs show the means Methylation in tissue samples of absolute errors. (c), (d) Simulated data for representations of Methylation was estimated using trace files from direct PCR mixed DNA in 0 to 1 in 1/7 steps by 10 and 20 subclones based of bisulfite-treated DNA from healthy tissue samples. The data on a binomial distribution. are a subset of trace files from the Human Epigenome Project direct sequencing method with the subcloning method. We pilot study by (Human Epigenome Consortium et al., 2003). calculated the smallest theoretical measurement error inher- ent to the subcloning method by simulating the subsampling RESULTS AND DISCUSSION of 10 and 20 subclones based on binomial distributions with Test system with known cytosine/thymine a certain C:T ratio. Figure 4c and d and Table 1 show that proportions errors in our estimates are comparable with those that could be To assess the effect of our signal normalization step, we obtained by sequencing 20 subclones of a PCR product. From used our algorithms on data from the test system with known this we can conclude that direct sequencing of the bisulfite- cytosine/(cytosine + thymine) ratios. Figure 4a and b show the treated DNA is a viable alternative to the subclone sequencing distribution of the estimated ratios against the expected ratios if only the mixture rates are the subject of interest. in the test system without and with normalization, respect- Test system with known methylation rates ively. The results demonstrate that the normalization step decreases the mean absolute error (represented by the dashed Second, we have evaluated the performance of our algorithm line on the figures) approximately to the half. Sequencing sev- by using the test system with known methylation rates. eral subclones from a PCR product is an alternative method to Figure 5d shows the distribution of the estimated methyl- measure the cytosine:thymine ratios in bisulfite-treated DNA. ation rates against the expected methylation rates in the test The measurement error of this method depends mainly on the system using the complete algorithm. For the estimation number of subclones that is sequenced. We benchmarked our of methylation rates, normalization and the correction for estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 J.Lewin et al. Table 1. Comparison of mean SD and absolute errors of C/(C+T) signal Table 2. Test system with known methylation rates: mean absolute errors proportions as estimated in our test system with known cytosine/thymine of methylation estimations in 60 amplificates with and without signal proportions and simulated representation by subclones normalization and conversion rate correction Mean SD Mean absolute error Method Mean absolute error Signal proportions 0.110 0.130 Raw data 0.27 Normalized signal proportions 0.077 0.055 Normalized 0.17 10 subclones 0.100 0.083 Corrected 0.19 20 subclones 0.072 0.058 Normalized and corrected 0.14 Table 3. Test system with known methylation rates: accuracy of sorting a) nor normalized b) only normalized n’either corrected paired methylation estimates at identical CpGs in 60 amplificates after normalization and conversion rate correction Expected rate 0.2 0.4 0.6 0.8 1 0 9198979999 0.2 90 98 99 99 0.4 96 97 98 0.6 79 88 0.8 89 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 expected expected expected expected expected expected expected expected expected expected expected expected expected expected d) normalized c) only corrected and corrected from the expected methylation rate. Systematic biases in the real values of all 60 covered regions can arise from incom- plete enzymatic methylation of the DNA or from amplificate specific biases in the PCR itself. Systematic biases in the test system would lead to devi- ations from expected values and to a higher variance in the complete data but still allow to detect relative differences in the methylation rates at individual CpG positions. To evaluate the capability of our method to detect differential methyla- 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 tion, we paired data from templates with different methylation expected expected expected expected expected expected expected expected expected expected expected expected expected expected values for each CpG. Table 3 lists the accuracy of classi- fication of higher versus lower methylated CpGs in the test Fig. 5. Estimation of methylation in the test system with known system. The accuracies for detecting differential methylation methylation rates. The boxplots show the distribution of the estimated in neighboring methylation rates with 20% steps are com- methylation rates as a function of the expected methylation and the pared with those that were obtained without normalization mean absolute error (dashed line). Each box includes data from CpGs and correction for incomplete bisulfite conversion (Table 4). of all 60 amplificates measured at the expected methylation rate. The performance clearly improves by using the normalization bisulfite conversion rate play an important role. This is demon- and the conversion rate correction steps. strated on Figure 5a–c and Table 2. If any of these steps Despite the overlap of the distributions of the estimated is omitted from the data analysis, then the mean absolute methylation values (cf. Fig. 5d), we can conclude that the error (dashed line) increases significantly. The normaliza- detection of differential methylation is highly accurate. This tion has a major impact mainly on low methylation rates, is in accordance with our hypothesis of having amplificate where the absence of C signals leads to an overscaling of specific systematic biases in our reference test system. the C trace. We have evaluated threshold parameters for quality con- The methylation rates estimated in this experiment do not trol. More stringent parameters do not improve the results show as accurate correlation with the expected rates as was significantly but lead to lower measurement success rates. For obtained in the previous C:T proportion experiments, where example, raising the threshold for bisulfite conversion from a mixture of subclones was used as a test system. One pos- 65 to 80% reduces the mean absolute error by 0.2% and raises sible explanation for this is that the real methylation rate in the accuracy by 1.3% but reduces the number of accessible the mixtures of methylated and unmethylated DNA deviates positions by 16%. estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated estimated 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 DNA methylation quantification based on trace data Table 4. Test system with known methylation rates: accuracy of sorting samples by bisulfite sequencing the intragenic region of the paired methylation estimates at identical CpGs in 60 amplificates with 20% gene DAXX. The plot shows clear differential methylation in difference with and without using the normalization and correction for blocks of co-methylated CpGs that distinguishes breast from incomplete bisulfite conversion muscle and both from all other tissues. Correct sorting (%) Comparison Raw Norm/corr. CONCLUSION We have presented an algorithm to estimate methylation from 0/0.2 84 91 trace files generated by direct PCR sequencing of bisulfite- 0.2/0.4 71 90 treated DNA. The Results obtained by reference test systems 0.4/0.6 86 96 0.6/0.8 77 79 show that direct PCR sequencing is a viable alternative to 0.8/1 89 89 estimating methylation rates by sequencing subclones from the PCR product. Furthermore, we have demonstrated that our method can detect differences in methylation rates of 20% highly accurately. Applying our algorithms to bisulfite (a) sequencing data of DNA obtained from healthy tissue samples CpG pos. Brain Breast Liver Lung Muscle illustrated that by the aid of the method, CpGs with differen- 90 tial methylation rates between different tissue types can be identified. The algorithm allows to run big DNA methylation studies 60 151 like the Human Epigenome Project based on direct sequencing 50 172 in high-throughput facilities. It will help to gain informa- 40 202 tion about differential methylation in many tissue types and increase our understanding of the epigenetic layer in the complex system of gene expression, cell differentiation and tumorgenesis. (b) ACKNOWLEDGEMENTS Muscle We acknowledge Matthias Schuster, Christoph König and Lung Bülent Genç for experimental planning and Erik Leu for his Liver great laboratory work that provided most data used in this paper. We thank Kurt Berlin, Stephan Beck and Karen Novik Breast for the establishment of the Human Epigenome Project. The first author also wants to thank Jörn Walter for his kind support Brain methylation [%] 0 100 200 300 of his PhD thesis. Some data shown in this paper was provided CpG position [bases] by the Human Epigenome Project pilot study supported by the EU (QLRT1999-30417). Fig. 6. (a) Methylation profiles of the intragenic region of gene DAXX obtained from DNA samples extracted from brain, breast, liver, lung and muscle tissue samples. The gray shading repres- REFERENCES ents the different methylation rates as indicated on the scale bar. Samples from different individuals are arranged in columns, while Adorjan,P., Distler,J., Lipscher,E., Model,F., Muller,J., Pelet,C., each row represents one CpG within the amplificate. (b) Physical dis- Braun,A., Florl,A.R., Gutig,D., Grabs,G. et al. (2002) Tumour tances and average methylation of CpGs within the tissues. Blocks class prediction and discovery by microarray-based DNA methyl- of differentially co-methylated CpGs are highlighted. ation analysis.Nucleic Acids Res., 30, e21. Barton,G.J. (1993) An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gaps. Comput. Appl. Biosci., 9, 729–734. Methylation in tissue samples Dahl,C. and Guldberg,P. (2003) DNA methylation analysis tech- Trace files produced in the The Human Epigenome Project by niques. Biogerontology, 4, 233–250. Human Epigenome Consortium et al. (2003) are processed Dear,S. and Staden,R. (1992) A standard file format for data from by the algorithm presented here. We present a dataset to DNA sequencing instruments. DNA Seq., 3, 107–110. demonstrate the capability of our method to detect differential Ehrlich,M. (2003) Expression of various genes is controlled by DNA methylation in real tissue samples. Figure 6 shows the methyl- methylation during mammalian development. J. Cell. Biochem., ation profiles obtained in brain, breast, liver, lung and muscle 88, 899–910. Methylation [%] 100 80 60 40 20 0 DAXX DAXX DAXX DAXX DAXX DAXX DAXX DAXX DAXX 1 J.Lewin et al. Frommer,M., McDonald,L.E., Millar,D.S., Collis,C.M., Watt,F., Paul,C.L. and Clark,S.J. (1996) Cytosine methylation: quantitation Grigg,G.W., Molloy,P.L. and Paul,C.L. (1992) A genomic sequen- by automated genomic sequencing and GENESCAN analysis. cing protocol that yields a positive display of 5-methylcytosine BioTechniques, 21, 126–133. residues in individual DNA strands.Proc. Natl Acad. Sci., USA, Qiu,P., Soder,G.J., Sanfiorenzo,V.J., Wang,L., Greene,J.R., 89, 1827–1831. Fritz,M.A. and Cai,X.Y. (2003) Quantification of single nucle- Human Epigenome Consortium, Epigenomics AG, The Welcome otide polymorphisms by automated DNA sequencing. Biochem. Trust Sanger Institute and Centre National de Genotypage (2003) Biophys. Res. Commun., 309, 331–338. Human Epigenome Project. Reik,W., Dean,W. and Walter,J. (2001) Epigenetic reprogramming Jones,P.A. (2002) DNA methylation and cancer. Oncogene, 21, in mammalian development. Science, 293, 1089–1093. 5358–5360. Siegmund,K.D. and Laird,P.W. (2002) Analysis of complex methyl- Olek,A., Oswald,J. and Walter,J. (1996) A modified and improved ation data.Methods, 27, 170–178. method for bisulphite based cytosine methylation analysis. Smith,T.F. and Waterman,M.S. (1981) Identification of common Nucleic Acids Res., 24, 5064–5066. molecular subsequences. J. Mol. Biol., 147, 195–197.
Bioinformatics – Oxford University Press
Published: Jul 9, 2004
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.