Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles

methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles DNA methylation is a chemical modification of cytosine bases that is pivotal for gene regulation, cellular specification and cancer development. Here, we describe an R package, methylKit, that rapidly analyzes genome- wide cytosine epigenetic profiles from high-throughput methylation and hydroxymethylation sequencing experiments. methylKit includes functions for clustering, sample quality visualization, differential methylation analysis and annotation features, thus automating and simplifying many of the steps for discerning statistically significant bases or regions of DNA methylation. Finally, we demonstrate methylKit on breast cancer data, in which we find statistically significant regions of differential methylation and stratify tumor subtypes. methylKit is available at http://code.google.com/p/methylkit. Rationale methylation state of a gene is inherited from the parents, DNA methylation is a critical epigenetic modification that but de novo methylation also can occur in the early stages guides development, cellular differentiation and the mani- of development [8,9]. festation of some cancers [1,2]. Specifically, cytosine A common technique for measuring DNA methylation methylation is a widespread modification in the genome, is bisulfite sequencing, which has the advantage of pro- and it most often occurs in CpG dinucleotides, although viding single-base, quantitative cytosine methylation non-CpG cytosines are also methylated in certain tissues levels. In this technique, DNA is treated with sodium such as embryonic stem cells [3]. DNA methylation is one bisulfite, which deaminates cytosine residues to uracil, of the many epigenetic control mechanisms associated but leaves 5-methylcytosine residues unaffected. Single- with gene regulation. Specifically, cytosine methylation can base resolution, %methylation levels are then calculated directly hinder binding of transcription factors and methy- by counting the ratio of C/(C+T) at each base. There are lated bases can also be bound by methyl-binding-domain multiple techniques that leverage high-throughput bisul- proteins that recruit chromatin-remodeling factors [4,5]. fite sequencing such as: reduced representation bisulfite In addition, aberrant DNA methylation patterns have been sequencing (RRBS)[10] and its variants [11], whole- observed in many human malignancies and can also be genome shotgun bisulfite sequencing (BS-seq) [12], used to define the severity of leukemia subtypes [6]. In methylC-Seq [13], and target capture bisulfite sequencing malignant tissues, DNA is either hypo-methylated or [14]. In addition, 5-hydroxymethylcytosine (5hmC) levels hyper-methylated compared to the normal tissue. The can be measured through a modification of bisulfite sequencing techniques [15]. location of hyper- and hypo-methylated sites gives distinct signatures within many diseases [7]. Often, hypomethyla- Yet, as bisulfite sequencing techniques have expanded, tion is associated with gene activation and hypermethyla- there are few computational tools available to analyze the tion is associated with gene repression, although there data. Moreover, there is a need for an end-to-end analysis are many exceptions to this trend [7]. DNA methylation package with comprehensive features and ease of use. To is also involved in genomic imprinting, where the address this, we have created methylKit, a multi-threaded R package that can rapidly analyze and characterize data * Correspondence: [email protected]; [email protected] from many methylation experiments at once. methylKit Department of Physiology and Biophysics, 1305 York Ave., Weill Cornell can read DNA methylation information from a text file Medical College, New York, NY 10065, USA and also from alignment files (for example, SAM files) Full list of author information is available at the end of the article © 2012 Akalin et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Akalin et al. Genome Biology 2012, 13:R87 Page 2 of 9 http://genomebiology.com/2012/13/10/R87 and carry out operations such as differential methylation all analyses, it is a modular package that can be applied analysis, sample clustering and annotation, and visualiza- independent of any aligner. Currently, there are two tion of DNA methylation events (See Figure 1 for a dia- ways that information can be supplied to methylKit:: 1) gram of possible operations). methylKit has open-source methylKit can read per base methylation scores from a code and is available at [16] and as Additional file 1 (see text file (see Table 1 for an example of such a file); and, also Additional file 2 for the user guide and Additional 2) methylKit can read SAM format [21] alignments files file 3 for the package documentation ). Our data frame- obtained from Bismark aligner [22]. If a SAM file is sup- work is also extensible to emerging methods in quantiza- plied, methylkit first processes the alignment file to get tion of other base modifications, such as 5hmC [14], or %methylation scores and then reads that information sites discovered through single molecule sequencing into memory. [17,18]. For clarity, we describe only examples with DNA Most bisulfite experiments have a set of test and control methylation data. samples or samples across multiple conditions, and methylKit can read and store (in memory) methylation Flexible data integration and regional analysis data simultaneously for N-experiments, limited only by High-throughput bisulfite sequencing experiments typi- memory of the node or computer. The default setting of cally yield millions of reads with reduced complexity the processing algorithm requires that there be least 10 due to cytosine conversion, and there are several differ- reads covering a base and each of the bases covering the ent aligners suited for mapping these reads to the gen- genomicbaseposition haveatleast 20 PHREDquality ome(seeFrith et al.[19]andKrueger et al.[20]for score. Also, since DNA methylation can occur in CpG, a review and comparison between aligners). Since CHGand CHHcontexts(H=A,T, orC) [3],usersof methylKit only requires a methylation score per base for methylKit have the option to provide methylation Figure 1 Flowchart of possible operations by methylKit. A summary of the most important methylKit features is shown in a flow chart. It depicts the main features of methylKit and the sequential relationship between them. The functions that could be used for those features are also printed in the boxes. Akalin et al. Genome Biology 2012, 13:R87 Page 3 of 9 http://genomebiology.com/2012/13/10/R87 Table 1 Sample text file that can be read by methylKit. [24,25]. Methylation profiles of these cell lines were mea- sured using reduced RRBS [10]. The R objects contained chrBase chr base strand coverage freqC freqT the methylation information for breast cancer cell lines chr21.9764539 chr21 9764539 R 12 25 75 and functions that produce plots and other results that chr21.9764513 chr21 9764513 R 12 0 100 are shown in the remainder of this manuscript are in chr21.9820622 chr21 9820622 F 13 0 100 Additional file 4. chr21.9837545 chr21 9837545 F 11 0 100 chr21.9849022 chr21 9849022 F 124 72.58 27.42 Whole methylome characterization: descriptive chr21.9853326 chr21 9853326 F 17 70.59 29.41 statistics, sample correlation and clustering methylKit can read tab-delimited text files with the following format: the text Descriptive statistics on DNA methylation profiles file should include a unique.id, chromosome name, base position, strand, read coverage, % of C bases and % of T bases on that location. Read coverage per base and % methylation per base are the basic information contained in the methylKit data structures. methylKit has functions for easy visualization information for all these contexts: CpG, CHG and CHH of such information (Figure 2a and 2b for % methylation from SAM files. and read coverage distributions, respectively - for code see Additional file 4). In normal cells, % methylation will have Summarizing DNA methylation information over a bimodal distribution, which denotes that the majority of pre-defined regions or tiling windows bases have either high or low methylation. The read cover- Although base-pair resolution DNA methylation informa- age distribution is also an important metric that will help tion is obtained through most bisulfite sequencing experi- reveal if experiments suffer from PCR duplication bias ments, it might be desirable to summarize methylation information over tiling windows or over a set of prede- (clonal reads). If such bias occurs, some reads will be fined regions (promoters, CpG islands, introns, and so on). asymmetrically amplified and this will impair accurate For example, Smith et al. [9] investigated methylation pro- determination of % methylation scores for those regions. If files with RRBS experiments on gametes and zygote and there is a high degree of PCR duplication bias, read cover- summarized methylation information on 100bp tiles age distribution will have a secondary peak on the right across the genome. Their analysis revealed a unique set of side. To correct for this issue, methylKit has the option to differentially methylated regions maintained in early filter bases with very high read coverage. embryo. Using tiling windows or predefined regions, such as promoters or CpG islands, is desirable when there is Measuring and visualizing similarity between samples not enough coverage, when bases in close proximity will We have also included methods to assess sample similar- have similar methylation profiles, or where methylation ity. Users can calculate pairwise correlation coefficients (Pearson, Kendall or Spearman) between the %methylation properties of a region as a whole determines its function. profiles across all samples. However, to ensure comparable In accordance with these potential analytic foci, methylKit statistics, a new data structure is formed before these cal- provides functionality to do either analysis on tiling culations, wherein only cytosines covered in all samples windows across the genome or predefined regions of the are stored. Subsequently, pairwise correlations are calcu- genome. After reading the base pair methylation informa- lated, to produce a correlation matrix. This matrix allows tion, users can summarize the methylation information on the user to easily compare correlation coefficients between pre-defined regions they select or on tiling windows cover- pairs of samples and can also be used to perform hierarch- ing the genome (parameter for tiles are user provided). ical clustering using 1- correlation distance. methylKit can Then, subsequent analyses, such as clustering or differen- also further visualize similarities between all pairs of sam- tial methylation analysis, can be carried out with the same ples by creating scatterplots of the %methylation scores functions that are used for base pair resolution analysis. (Figure 3). These functions are essential for detecting sam- Example methylation data set: breast cancer cell lines ple outliers or for functional clustering of samples based We demonstrated the capabilities of methylKit using an on their molecular signatures. example data set from seven breast cancer cell lines from Sun et al.[23].Four ofthe cell linesexpressestrogen Hierarchical clustering of samples receptor-alpha (MCF7, T47D, BT474, ZR75-1), and from methylKit can also be used to cluster samples hierarchi- here on are referred to as ER+. The other three cell lines cally in a variety of ways. The user can specify the (BT20, MDA-MB-231, MDA-MB-468) do not express distance metric between samples (‘1 - correlation’‘Eucli- estrogen receptor-alpha, and from here on are referred to dean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’ or ‘min- as ER-. It has been previously shown that ER+ and ER- kowski’) as well as the agglomeration method to be used tumor samples have divergent gene expression profiles in the hierarchical clustering algorithm (for example, and that those profiles are associated with disease outcome ‘Ward’s method’,or ‘single/complete linkage’, and so on). Akalin et al. Genome Biology 2012, 13:R87 Page 4 of 9 http://genomebiology.com/2012/13/10/R87 Figure 2 Descriptive statistics per sample. (a) Histogram of %methylation per cytosine for ER+ T47D sample. Most of the bases have either high or low methylation. (b) Histogram of read coverage per cytosine for ER+ T47D sample. ER+, estrogen receptor-alpha expressing. available for clustering. An example of such a procedure Results can either be returned as a dendrogram object or a plot. Dendrogram plots will be color coded based on user (k-means clustering) is shown in Additional file 4. defined groupings of samples. For example, we found that most ER+ and ER- samples clustered together except Principal component analysis of samples MDMB231 (Figure 4a). Moreover, the user may be inter- methylKit can be used to perform Principal Component ested in employing other more model-intensive clustering Analysis (PCA) on the samples’ %-methylation profiles algorithms to their data. Users can easily obtain the % (see for example [26]). PCA can reduce the high dimen- methylation data from methylKit object and perform their sionality of a data set by transforming the large number own analysis with the multitude of R-packages already of regions to a few principal components. The principal components are ordered so that the first few retain most of the variation present in the original data and are often used to emphasize grouping structure in the data. For example, a plot of the first two or three princi- pal components could potentially reveal a biologically meaningful clustering of the samples. Before the PCA is performed, a new data matrix is formed, containing the samples and only those cytosines that are covered in all samples. After PCA, methylKit then returns to the user a ‘prcomp’ object, which can be used to extract and plot the principal components. We found that in the breast cancer data set, PCA reveals a similar clustering to the hierarchical clustering where MDMB231 is an outlier. Differential methylation calculation Parallelized methods for detecting significant methylation changes Differential methylation patterns have been previously described in malignancies [27-29] and can be used to dif- ferentiate cancer and normal cells [30]. In addition, nor- mal human tissues harbor unique DNA methylation profiles [7]. Differential DNA methylation is usually calcu- Figure 3 Scatter plots for sample pairs.Scatter plots of% methylation values for each pair in seven breast cancer cell lines. lated by comparing methylation levels between multiple Numbers on upper right corner denote pair-wise Pearson’s conditions, which can reveal important locations of diver- correlation scores. The histograms on the diagonal are % gent changes between a test and a control set. We have methylation histograms similar to Figure 2a for each sample. designed methylKit to implement two main methods for Akalin et al. Genome Biology 2012, 13:R87 Page 5 of 9 http://genomebiology.com/2012/13/10/R87 Figure 4 Sample clustering.(a) Hierarchical clustering of seven breast cancer methylation profiles using 1-Pearson’s correlation distance. (b) Principal Component Analysis (PCA) of seven breast cancer methylation profiles, plot shows principal component 1 and principal component 2 for each sample. Samples closer to each other in principal component space are similar in their methylation profiles. determining differential methylation across all regions: cytosine (DMC) or region (DMR). However, if the null logistic regression and Fisher’s exact test. However, the hypothesis is not rejected it implies no statistically signifi- data frames in methylKit can easily be used with other sta- cant difference in methylation between the two groups. tistical tests and an example is shown in Additional file 4 One important consideration in logistic regression is the (using a moderated t-test, although we maintain that most sample size and in many biological experiments the num- natural tests for this kind of data are Fisher’sexact and ber of biological samples in each group can be quite small. logistic regression based tests). For our example data set However, it is important to keep in mind that the relevant we compared ER+ to ER- samples, with our ‘control sample sizes in logistic regression are not merely the num- group’ being the ER- set. ber of biological samples but rather the total read cov- erages summed over all samples in each group separately. Method #1: logistic regression For our example dataset, we used bases with at least 10 In logistic regression, information from each sample is reads coverage for each biological sample and we advise specified (the number of methylated Cs and number of (at least) the same for other users to improve power to unmethylated Cs at a given region), and a logistic detect DMCs/DMRs. regression test will be applied to compare fraction of In addition, we have designed methylKit such that the logistic regression framework can be generalized to han- methylated Cs across the test and the control groups. More specifically, at a given base/region we model the dle more than two experimental groups or data types. methylation proportion P , for sample i= 1,...,n (where n In such a case, the inclusion of additional treatment is the number of biological samples) through the logistic indicators is analogous to multiple regression when regression model: there are categorical variables with multiple groups. Additional covariates can be incorporated into model log(P /(1 - P )) = β + β ∗ T (1) i i 0 1 i (1) by adding to the right side of the model: where T denotes the treatment indicator for sample i, T i i α ∗ Covariate + ... + α ∗ Covariate 1 1,i K K,i = 1 if sample i is in the treatment group and T = 0 if sam- ple i is in control group. The parameter b denotes the log where Covariate , ..., Covariate denote K measured 0 1,i K,i odds of the control group and b the log oddsratio covariates (continuous or categorical) for sample between the treatment and control group. Therefore, inde- i = 1,...,n and a ,..., a denote the corresponding 1 k pendent tests for all the bases/regions of interest are parameters. against the null hypothesis H : b = 0. If the null hypothesis 0 1 is rejected it implies that the logodds (and hence the Method #2: Fisher’s exact test methylation proportions) are different between the treat- The Fisher’s exact test compares the fraction of methy- ment and the control group and the base/region would lated Cs in test and control samples in the absence of subsequently be classified as a differentially methylated replicates. The main advantage of logistic regression Akalin et al. Genome Biology 2012, 13:R87 Page 6 of 9 http://genomebiology.com/2012/13/10/R87 over Fisher’s exact test is that it allows for the inclusion treatment vector, and hyper-/hypomethylation definitions of sample specific covariates (continuous or categorical) are based on that control group. and the ability to adjust for confounding variables. In Furthermore, DMCs/DMRs can be visualized as hori- practice, the number of samples per group will deter- zontal barplots showing percentage of hyper- and hypo- mine which of the two methods will be used (logistic methylated bases/regions out of covered cytosines over regression or Fisher’s exact test). If there are multiple all chromosomes (Figure 5a). We observed higher levels samples per group, methylKit will employ the logistic of hypomethylation than hypermethylation in the breast regression test. Otherwise, when there is one sample per cancer cell lines, which indicates that ER+ cells have lower levels of methylation. Since another common way group, Fisher’s exact test will be used. Following the differential methylation test and calcula- to visualize differential methylation events is with a gen- tion of P-values, methylKit will use the sliding linear ome browser, methylKit can output bedgraph tracks model (SLIM) method to correct P-values to q-values [31], (Figure 5b) for use with the UCSC Genome Browser or which corrects for the problem of multiple hypothesis test- Integrated Genome Viewer. ing [32,33]. However, we also implemented the standard false discovery rate (FDR)-based method (Benjamini- Annotating differential methylation events Hochberg) as an option for P-value correction, which is Annotation with gene models and CpG islands faster but more conservative. Finally, methylKit can use To discern the biological impact of differential methyla- multi-threading so that differential methylation calcula- tion events, each event must be put into its genomic tions can be parallelized over multiple cores and be com- context for subsequent analysis. Indeed, Hansen et al. pleted faster. [34] showed that most variable regions in terms of methylation in the human genome are CpG island Extraction and visualization of differential methylation shores, rather than CpG islands themselves. Thus, it is events interesting to know the location of differential methyla- We have designed methylKit to allow a user to specify the tion events with regard to CpG islands, their shores, and parameters that define the DMCs/DMRs based on: q- also the proximity to the nearest transcription start site value, %methylation difference, and type of differential (TSS) and gene components. Accordingly, methylKit can methylation (hypo-/hyper-). By default, it will extract annotate differential methylation events with regard to bases/regions with a q-value <0.01 and %methylation dif- the nearest TSS (Figure 6a) and it also can annotate ference >25%. These defaults can easily be changed when regions based on their overlap with CpG islands/shores calling get.methylDiff() function. In addition, users can spe- and regions within genes (Figures 6b and 6c are output cify if they want hyper-methylated bases/regions (bases/ from methylKit). regions with higher methylation compared to control sam- ples) or hypo-methylated bases/regions (bases/regions Annotation with custom regions with lower methylation compared to control samples). In As with most genome-wide assays, the regions of interest the literature, hyper- or hypo-methylated DMCs/DMRs for DNA methylation analysis may be quite numerous. are usually defined relative to a control group. In our For example, several reports show that Alu elements are examples, and in methylKit in general, a control group is aberrantly methylated in cancers [35,36] and enhancers defined when creating the objects through supplied are also differentially methylated [37,38]. Since users may Figure 5 Visualizing differential methylation events.(a) Horizontal bar plots show the number of hyper- and hypomethylation events per chromosome, as a percent of the sites with the minimum coverage and differential. By default this is a 25% change in methylation and all samples with 10X coverage. (b) Example of bedgraph file uploaded to UCSC browser. The bedraph file is for differentially methylated CpGs with at least a 25% difference and q-value <0.01. Hyper- and hypo-methylated bases are color coded. The bar heights correspond to % methylation difference between ER+ and ER- sets. ER+, estrogen receptor-alpha expressing; ER-, estrogen receptor-alpha non-expressing. UCSC, University of California Santa Cruz. Akalin et al. Genome Biology 2012, 13:R87 Page 7 of 9 http://genomebiology.com/2012/13/10/R87 Figure 6 Annotation of differentially methylated CpGs.(a) Distance to TSS for differentially methylated CpGs are plotted from ER+ versus ER- analysis. (b) Pie chart showing percentages of differentially methylated CpGs on promoters, exons, introns and intergenic regions. (c) Pie chart showing percentages of differentially methylated CpGs on CpG islands, CpG island shores (defined as 2kb flanks of CpG islands) and other regions outside of shores and CpG islands. (d) Pie chart showing percentages of differentially methylated CpGs on enhancers and other regions. ER+, estrogen receptor-alpha expressing; ER-, estrogen receptor-alpha non-expressing, TSS, transcription start site. need to focus on specific genomic regions and require Analyzing 5-hydroxymethylcytosine data with customized annotation for capturing differential DNA methylKit methylation events, methylKit can annotate differential 5-Hydroxymethylcytosine is a base modification asso- methylation events using user-supplied regions. As an ciated with pluropotency, hematopoiesis and certain example, we identified differentially methylated bases of brain tissues (reviewed in [40]). It is possible to measure ER+ and ER- cells that overlap with ENCODE enhancer base-pair resolution 5hmC levels using variations of tra- regions [39], and we found a large proportion of differen- ditional bisulfite sequencing. Recently, Yu et al. [41] and tially methylated CpGs overlapping with the enhancer Booth et al. [15] published similar methods for detecting marks, and then plotted them with methylKit (Figure 6d). 5hmC levels in base-pair resolution. Both methods Akalin et al. Genome Biology 2012, 13:R87 Page 8 of 9 http://genomebiology.com/2012/13/10/R87 Abbreviations require measuring 5hmC and 5mC levels simultaneously 5hmC: 5-hydroxymethylcytosine; 5mC: 5-methylcytosine; bp: base pair; BS- and use 5hmC levels as a substrate to deduce real 5mC seq,:bisulfite sequencing; DMC: differentially methylated cytosine; DMR: levels, since traditional bisulfite sequencing cannot dis- differentially methylated region; ER: estrogen receptor alpha; FDR: false discovery rate; PCA: principal component analysis; PCR: polymerase chain tinguish between the two [42]. However, both the 5hmC reaction; RRBS: reduced representation bisulfite sequencing; SLIM: sliding and 5mC data generated by these protocols are bisulfite linear model; TSS: transcription start site. sequencing based, and the alignments and text files of Acknowledgements 5hmC levels can be used directly in methylKit. Further- We wish to acknowledge the invaluable contribution of the WCMC more, methylKit has an adjust.methylC() function to Epigenomics Core Facility. MEF is supported by the Leukemia & Lymphoma adjust 5mC levels based on 5hmC levels as described in Society Special Fellow Award and a Doris Duke Clinical Scientist Development Award. FGB is supported by a Sass Foundation Judah Folkman Booth et al. [15]. Fellowship. AM is supported by an LLS SCOR grant (7132-08) and a Burroughs Wellcome Clinical Translational Scientist Award. AM and CEM are Customizing analysis with convenience functions supported by a Starr Cancer Consortium grant (I4-A442). CEM is supported by the National Institutes of Health (I4-A411, I4-A442, and 1R01NS076465-01). methylKit is dependent on Bioconductor [43] packages such as GenomicRanges and its objects are coercible to Author details GenomicRanges objects and regular R data structures Department of Physiology and Biophysics, 1305 York Ave., Weill Cornell Medical College, New York, NY 10065, USA. The HRH Prince Alwaleed Bin such as data frames via provided convenience functions. Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, 1305 That means users can integrate methylKit objects to York Ave., Weill Cornell Medical College, New York, NY 10065, USA. other Bioconductor and R packages and customize the Department of Public Health, Weill Cornell Medical College, 1300 York Ave., New York, NY 10065, USA. Department of Medicine, Division of analysis according to their needs or extend the analysis Hematology/Oncology, 1300 York Ave., Weill Cornell Medical College, New further by using other packages available in R. York, NY 10065, USA. Department of Pathology, University of Michigan, 109 Zina Pitcher Place, Ann Arbor, MI 48109, USA. Department of Pharmacology, 1300 York Ave., Weill Cornell Medical College, New York, NY 10065, USA. Conclusions Methods for detecting methylation across the genome Authors’ contributions are widely used in research laboratories, and they are AA designed methylKit, developed the first codebase, and added most features. MK designed the logistic regression based statistical test for methylKit also a substantial component of the National Institutes and worked on statistical modeling and initial clustering features. SL wrote of Health’s(NIH’s) EpiGenome roadmap and upcoming some of the features in methylKit and prepared plots for the manuscript. projects such as BLUEPRINT [44]. Thus, tools and tech- MEF, FGB and AM tested the code and provided initial data for development of methylKit. CEM supervised the work, tested code, and coordinated test data niques that enable researchers to process and utilize for validation. All authors have read and approved the manuscript for genome-wide methylation data in an easy and fast man- publication. ner will be of critical utility. Competing interests Here, we show a large set of tools and cross-sample The authors declare that they have no competing interests. analysis algorithms built into methylKit, our open-source, multi-threaded R package that can be used for any base- Received: 30 April 2012 Revised: 12 June 2012 Accepted: 3 October 2012 Published: 3 October 2012 level dataset of DNA methylation or base modifications, including 5hmC. We demonstrate its utility with breast References cancer RRBS samples, provide test data sets, and also 1. Deaton AM, Bird A: CpG islands and the regulation of transcription. Genes provide extensive documentation with the release. Dev 2011, 25:1010-2210. 2. Suzuki MM, Bird A: DNA methylation landscapes: provocative insights from epigenomics. Nat Rev Genet 2008, 9:465-476. Additional material 3. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo Q-M, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes Additional file 1: methylKit v0.5.3. This version of methylKit is included at base resolution show widespread epigenomic differences. Nature for archival purposes only. Please download the most recent version 2009, 462:315-322. from [16]. 4. Bird AP, Wolffe AP: Methylation-induced repression–belts, braces, and Additional file 2: methylKit User Guide. A vignette file to accompany chromatin. Cell 1999, 99:451-454. the methylKit software package; the most recent software and vignette 5. Hendrich B, Bird A: Identification and characterization of a family of can be downloaded at [16]. mammalian methyl-CpG binding proteins. Mol Cell Biol 1998, 18:6538-6547. Additional file 3: methylKit documentation. Documentation for 6. Figueroa ME, Abdel-Wahab O, Lu C, Ward PS, Patel J, Shih A, Li Y, functions and classes in the methylKit software package; the most recent Bhagwat N, Vasanthakumar A, Fernandez HF, Tallman MS, Sun Z, Wolniak K, software and documentation can be downloaded at [16]. Peeters JK, Liu W, Choe SE, Fantin VR, Paietta E, Löwenberg B, Licht JD, Additional file 4: R script for example analysis. The file contains R Godley LA, Delwel R, Valk PJM, Thompson CB, Levine RL, Melnick A: commands that are needed to do analysis and to produce graphs used Leukemic IDH1 and IDH2 mutations result in a hypermethylation in this manuscript. The file contains both the commands and detailed phenotype, disrupt TET2 function, and impair hematopoietic comments on how those commands can be used. An up to date version differentiation. Cancer Cell 2010, 18:553-567. of this script will be consistently maintained at [16]. Akalin et al. Genome Biology 2012, 13:R87 Page 9 of 9 http://genomebiology.com/2012/13/10/R87 7. Fernandez AF, Assenov Y, Martin-Subero JI, Balint B, Siebert R, Taniguchi H, 28. Baylin SB, Herman JG: DNA hypermethylation in tumorigenesis: Yamamoto H, Hidalgo M, Tan A-C, Galm O, Ferrer I, Sanchez-Cespedes M, epigenetics joins genetics. Trends Genet 2000, 16:168-174. Villanueva A, Carmona J, Sanchez-Mut JV, Berdasco M, Moreno V, Capella G, 29. Costello JF, Frühwald MC, Smiraglia DJ, Rush LJ, Robertson GP, Gao X, Monk D, Ballestar E, Ropero S, Martinez R, Sanchez-Carbayo M, Prosper F, Wright FA, Feramisco JD, Peltomäki P, Lang JC, Schuller DE, Yu L, Agirre X, Fraga MF, Graña O, Perez-Jurado L, Mora J, Puig S, et al: A DNA Bloomfield CD, Caligiuri MA, Yates A, Nishikawa R, Su Huang H, Petrelli NJ, methylation fingerprint of 1628 human samples. Genome Res 2012, Zhang X, O’Dorisio MS, Held WA, Cavenee WK, Plass C: Aberrant CpG- 22:407-419. island methylation has non-random and tumour-type-specific patterns. 8. Li E, Beard C: Role for DNA methylation in genomic imprinting. Nature Nat Genet 2000, 24:132-138. 1993, 366:362-365. 30. Doi A, Park I-H, Wen B, Murakami P, Aryee MJ, Irizarry R, Herb B, Ladd- 9. Smith ZD, Chan MM, Mikkelsen TS, Gu H, Gnirke A, Regev A, Meissner A: A Acosta C, Rho J, Loewer S, Miller J, Schlaeger T, Daley GQ, Feinberg AP: unique regulatory phase of DNA methylation in the early mammalian Differential methylation of tissue- and cancer-specific CpG island shores embryo. Nature 2012, 484:339-344. distinguishes human induced pluripotent stem cells, embryonic stem 10. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, cells and fibroblasts. Nat Genet 2009, 41:1350-1353. Bernstein BE, Nusbaum C, Jaffe DB, Gnirke A, Jaenisch R, Lander ES: 31. Wang H-Q, Tuominen LK, Tsai C-J: SLIM: a sliding linear model for Genome-scale DNA methylation maps of pluripotent and differentiated estimating the proportion of true null hypotheses in datasets with cells. Nature 2008, 454:766-770. dependence structures. Bioinformatics 2011, 27:225-231. 11. Akalin A, Garrett-Bakelman FE, Kormaksson M, Busuttil J, Zhang L, 32. Storey J: A direct approach to false discovery rates. J R Stat Soc Series B Khrebtukova I, Milne TA, Huang Y, Biswas D, Hess JL, Allis D, Roeder RG, Stat Methodol 2002, 64:479-498. Valk PJM, Lo B, Paietta E, Tallman MS, Schroth GP, Mason CE, Melnick A, 33. Storey JD, Tibshirani R: Statistical significance for genomewide studies. Figueroa ME: Base-pair resolution DNA methylation sequencing reveals Proc Natl Acad Sci USA 2003, 100:9440-9445. profoundly divergent epigenetic landscapes in acute myeloid leukemia. 34. Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, McDonald OG, PLoS Genet 2012, 8:e1002781. Wen B, Wu H, Liu Y, Diep D, Briem E, Zhang K, Irizarry R a, Feinberg AP: 12. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Increased methylation variation in epigenetic domains across cancer Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE: Shotgun bisulphite types. Nat Genet 2011, 43:768-775. sequencing of the Arabidopsis genome reveals DNA methylation 35. Ehrlich M: DNA hypomethylation in cancer cells. Epigenomics 2009, patterning. Nature 2008, 452:215-219. 1:239-259. 13. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, 36. Rodriguez J, Vives L, Jordà M, Morales C, Muñoz M, Vendrell E, Peinado MA: Ecker JR: Highly integrated single-base resolution maps of the Genome-wide tracking of unmethylated DNA Alu repeats in normal and epigenome in Arabidopsis. Cell 2008, 133:523-536. cancer cells. Nucleic Acids Res 2008, 36:770-784. 14. Ball MP, Li JB, Gao Y, Lee J-H, LeProust EM, Park I-H, Xie B, Daley GQ, 37. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Schöler A, Wirbelauer C, Church GM: Targeted and genome-scale strategies reveal gene-body Oakeley EJ, Gaidatzis D, Tiwari VK, Schübeler D: DNA-binding factors shape methylation signatures in human cells. Nat Biotechnol 2009, 27:361-368. the mouse methylome at distal regulatory regions. Nature 2011, 15. Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, 480:490-495. Balasubramanian S: Quantitative sequencing of 5-methylcytosine and 38. Wiench M, John S, Baek S, Johnson TA, Sung M-H, Escobar T, Simmons CA, 5-hydroxymethylcytosine at single-base resolution. Science 2012, Pearce KH, Biddie SC, Sabo PJ, Thurman RE, Stamatoyannopoulos JA, 336:934-937. Hager GL: DNA methylation status predicts cell type-specific enhancer 16. methylKit. [http://code.google.com/p/methylkit]. activity. EMBO J 2011, 30:3028-3039. 17. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, 39. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Turner SW: Direct detection of DNA methylation during single-molecule, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, real-time sequencing. Nat Methods 2010, 7:461-465. Bernstein BE: Mapping and analysis of chromatin state dynamics in nine 18. Cherf GM, Lieberman KR, Rashid H, Lam CE, Karplus K, Akeson M: human cell types. Nature 2011, 473:43-49. Automated forward and reverse ratcheting of DNA in a nanopore at 5-Å 40. Branco MR, Ficz G, Reik W: Uncovering the role of 5- precision. Nat Biotechnol 2012, 30:344-348. hydroxymethylcytosine in the epigenome. Nat Rev Genet 2011, 13:7-13. 19. Frith MC, Mori R, Asai K: A mostly traditional approach improves 41. Yu M, Hon GC, Szulwach KE, Song C-X, Zhang L, Kim A, Li X, Dai Q, Shen Y, alignment of bisulfite-converted DNA. Nucleic Acids Res 2012, 40:e100. Park B, Min J-H, Jin P, Ren B, He C: Base-resolution analysis of 5- 20. Krueger F, Kreck B, Franke A, Andrews SR: DNA methylome analysis using hydroxymethylcytosine in the mammalian genome. Cell 2012, short bisulfite sequencing data. Nat Methods 2012, 9:145-151. 149:1368-1380. 21. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, 42. Huang Y, Pastor WA, Shen Y, Tahiliani M, Liu DR, Rao A: The behaviour of Abecasis G, Durbin R: The Sequence Alignment/Map format and 5-hydroxymethylcytosine in bisulfite sequencing. PloS One 2010, 5:e8888. SAMtools. Bioinformatics 2009, 25:2078-2079. 43. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, 22. Krueger F, Andrews SR: Bismark: a flexible aligner and methylation caller Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, for Bisulfite-Seq applications. Bioinformatics 2011, 27:1571-1572. Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, 23. Sun Z, Asmann YW, Kalari KR, Bot B, Eckel-Passow JE, Baker TR, Carr JM, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development Khrebtukova I, Luo S, Zhang L, Schroth GP, Perez EA, Thompson EA: for computational biology and bioinformatics. Genome Biol 2004, 5:R80. Integrated analysis of gene expression, CpG island methylation, and 44. Adams D, Altucci L, Antonarakis SE, Ballesteros J, Beck S, Bird A, Bock C, gene copy number in breast cancer cells by deep sequencing. PloS One Boehm B, Campo E, Caricasole A, Dahl F, Dermitzakis ET, Enver T, Esteller M, 2011, 6:e17490. Estivill X, Ferguson-Smith A, Fitzgibbon J, Flicek P, Giehl C, Graf T, 24. van ‘t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Grosveld F, Guigo R, Gut I, Helin K, Jarvius J, Küppers R, Lehrach H, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Lengauer T, Lernmark Å, Leslie D, et al: BLUEPRINT to decode the Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene epigenetic signature written in blood. Nat Biotechnol 2012, 30:224-226. expression profiling predicts clinical outcome of breast cancer. Nature doi:10.1186/gb-2012-13-10-R87 2002, 415:530-536. Cite this article as: Akalin et al.: methylKit: a comprehensive R package 25. Sotiriou C, Neo S-Y, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, for the analysis of genome-wide DNA methylation profiles. Genome Biology Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based 2012 13:R87. on gene expression profiles from a population-based study. Proc Natl Acad Sci USA 2003, 100:10393-10398. 26. Joliffe I: Principal Component Analysis. 2 edition. New York, USA, Springer; 27. Esteller M, Corn PG, Baylin SB, Herman JG: A gene hypermethylation profile of human cancer. Cancer Res 2001, 61:3225-3229. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals

methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles

Loading next page...
 
/lp/springer-journals/methylkit-a-comprehensive-r-package-for-the-analysis-of-genome-wide-PY5bS8zIdd

References (88)

Publisher
Springer Journals
Copyright
2012 Akalin et al.; licensee BioMed Central Ltd.
eISSN
1474-760X
DOI
10.1186/gb-2012-13-10-r87
Publisher site
See Article on Publisher Site

Abstract

DNA methylation is a chemical modification of cytosine bases that is pivotal for gene regulation, cellular specification and cancer development. Here, we describe an R package, methylKit, that rapidly analyzes genome- wide cytosine epigenetic profiles from high-throughput methylation and hydroxymethylation sequencing experiments. methylKit includes functions for clustering, sample quality visualization, differential methylation analysis and annotation features, thus automating and simplifying many of the steps for discerning statistically significant bases or regions of DNA methylation. Finally, we demonstrate methylKit on breast cancer data, in which we find statistically significant regions of differential methylation and stratify tumor subtypes. methylKit is available at http://code.google.com/p/methylkit. Rationale methylation state of a gene is inherited from the parents, DNA methylation is a critical epigenetic modification that but de novo methylation also can occur in the early stages guides development, cellular differentiation and the mani- of development [8,9]. festation of some cancers [1,2]. Specifically, cytosine A common technique for measuring DNA methylation methylation is a widespread modification in the genome, is bisulfite sequencing, which has the advantage of pro- and it most often occurs in CpG dinucleotides, although viding single-base, quantitative cytosine methylation non-CpG cytosines are also methylated in certain tissues levels. In this technique, DNA is treated with sodium such as embryonic stem cells [3]. DNA methylation is one bisulfite, which deaminates cytosine residues to uracil, of the many epigenetic control mechanisms associated but leaves 5-methylcytosine residues unaffected. Single- with gene regulation. Specifically, cytosine methylation can base resolution, %methylation levels are then calculated directly hinder binding of transcription factors and methy- by counting the ratio of C/(C+T) at each base. There are lated bases can also be bound by methyl-binding-domain multiple techniques that leverage high-throughput bisul- proteins that recruit chromatin-remodeling factors [4,5]. fite sequencing such as: reduced representation bisulfite In addition, aberrant DNA methylation patterns have been sequencing (RRBS)[10] and its variants [11], whole- observed in many human malignancies and can also be genome shotgun bisulfite sequencing (BS-seq) [12], used to define the severity of leukemia subtypes [6]. In methylC-Seq [13], and target capture bisulfite sequencing malignant tissues, DNA is either hypo-methylated or [14]. In addition, 5-hydroxymethylcytosine (5hmC) levels hyper-methylated compared to the normal tissue. The can be measured through a modification of bisulfite sequencing techniques [15]. location of hyper- and hypo-methylated sites gives distinct signatures within many diseases [7]. Often, hypomethyla- Yet, as bisulfite sequencing techniques have expanded, tion is associated with gene activation and hypermethyla- there are few computational tools available to analyze the tion is associated with gene repression, although there data. Moreover, there is a need for an end-to-end analysis are many exceptions to this trend [7]. DNA methylation package with comprehensive features and ease of use. To is also involved in genomic imprinting, where the address this, we have created methylKit, a multi-threaded R package that can rapidly analyze and characterize data * Correspondence: [email protected]; [email protected] from many methylation experiments at once. methylKit Department of Physiology and Biophysics, 1305 York Ave., Weill Cornell can read DNA methylation information from a text file Medical College, New York, NY 10065, USA and also from alignment files (for example, SAM files) Full list of author information is available at the end of the article © 2012 Akalin et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Akalin et al. Genome Biology 2012, 13:R87 Page 2 of 9 http://genomebiology.com/2012/13/10/R87 and carry out operations such as differential methylation all analyses, it is a modular package that can be applied analysis, sample clustering and annotation, and visualiza- independent of any aligner. Currently, there are two tion of DNA methylation events (See Figure 1 for a dia- ways that information can be supplied to methylKit:: 1) gram of possible operations). methylKit has open-source methylKit can read per base methylation scores from a code and is available at [16] and as Additional file 1 (see text file (see Table 1 for an example of such a file); and, also Additional file 2 for the user guide and Additional 2) methylKit can read SAM format [21] alignments files file 3 for the package documentation ). Our data frame- obtained from Bismark aligner [22]. If a SAM file is sup- work is also extensible to emerging methods in quantiza- plied, methylkit first processes the alignment file to get tion of other base modifications, such as 5hmC [14], or %methylation scores and then reads that information sites discovered through single molecule sequencing into memory. [17,18]. For clarity, we describe only examples with DNA Most bisulfite experiments have a set of test and control methylation data. samples or samples across multiple conditions, and methylKit can read and store (in memory) methylation Flexible data integration and regional analysis data simultaneously for N-experiments, limited only by High-throughput bisulfite sequencing experiments typi- memory of the node or computer. The default setting of cally yield millions of reads with reduced complexity the processing algorithm requires that there be least 10 due to cytosine conversion, and there are several differ- reads covering a base and each of the bases covering the ent aligners suited for mapping these reads to the gen- genomicbaseposition haveatleast 20 PHREDquality ome(seeFrith et al.[19]andKrueger et al.[20]for score. Also, since DNA methylation can occur in CpG, a review and comparison between aligners). Since CHGand CHHcontexts(H=A,T, orC) [3],usersof methylKit only requires a methylation score per base for methylKit have the option to provide methylation Figure 1 Flowchart of possible operations by methylKit. A summary of the most important methylKit features is shown in a flow chart. It depicts the main features of methylKit and the sequential relationship between them. The functions that could be used for those features are also printed in the boxes. Akalin et al. Genome Biology 2012, 13:R87 Page 3 of 9 http://genomebiology.com/2012/13/10/R87 Table 1 Sample text file that can be read by methylKit. [24,25]. Methylation profiles of these cell lines were mea- sured using reduced RRBS [10]. The R objects contained chrBase chr base strand coverage freqC freqT the methylation information for breast cancer cell lines chr21.9764539 chr21 9764539 R 12 25 75 and functions that produce plots and other results that chr21.9764513 chr21 9764513 R 12 0 100 are shown in the remainder of this manuscript are in chr21.9820622 chr21 9820622 F 13 0 100 Additional file 4. chr21.9837545 chr21 9837545 F 11 0 100 chr21.9849022 chr21 9849022 F 124 72.58 27.42 Whole methylome characterization: descriptive chr21.9853326 chr21 9853326 F 17 70.59 29.41 statistics, sample correlation and clustering methylKit can read tab-delimited text files with the following format: the text Descriptive statistics on DNA methylation profiles file should include a unique.id, chromosome name, base position, strand, read coverage, % of C bases and % of T bases on that location. Read coverage per base and % methylation per base are the basic information contained in the methylKit data structures. methylKit has functions for easy visualization information for all these contexts: CpG, CHG and CHH of such information (Figure 2a and 2b for % methylation from SAM files. and read coverage distributions, respectively - for code see Additional file 4). In normal cells, % methylation will have Summarizing DNA methylation information over a bimodal distribution, which denotes that the majority of pre-defined regions or tiling windows bases have either high or low methylation. The read cover- Although base-pair resolution DNA methylation informa- age distribution is also an important metric that will help tion is obtained through most bisulfite sequencing experi- reveal if experiments suffer from PCR duplication bias ments, it might be desirable to summarize methylation information over tiling windows or over a set of prede- (clonal reads). If such bias occurs, some reads will be fined regions (promoters, CpG islands, introns, and so on). asymmetrically amplified and this will impair accurate For example, Smith et al. [9] investigated methylation pro- determination of % methylation scores for those regions. If files with RRBS experiments on gametes and zygote and there is a high degree of PCR duplication bias, read cover- summarized methylation information on 100bp tiles age distribution will have a secondary peak on the right across the genome. Their analysis revealed a unique set of side. To correct for this issue, methylKit has the option to differentially methylated regions maintained in early filter bases with very high read coverage. embryo. Using tiling windows or predefined regions, such as promoters or CpG islands, is desirable when there is Measuring and visualizing similarity between samples not enough coverage, when bases in close proximity will We have also included methods to assess sample similar- have similar methylation profiles, or where methylation ity. Users can calculate pairwise correlation coefficients (Pearson, Kendall or Spearman) between the %methylation properties of a region as a whole determines its function. profiles across all samples. However, to ensure comparable In accordance with these potential analytic foci, methylKit statistics, a new data structure is formed before these cal- provides functionality to do either analysis on tiling culations, wherein only cytosines covered in all samples windows across the genome or predefined regions of the are stored. Subsequently, pairwise correlations are calcu- genome. After reading the base pair methylation informa- lated, to produce a correlation matrix. This matrix allows tion, users can summarize the methylation information on the user to easily compare correlation coefficients between pre-defined regions they select or on tiling windows cover- pairs of samples and can also be used to perform hierarch- ing the genome (parameter for tiles are user provided). ical clustering using 1- correlation distance. methylKit can Then, subsequent analyses, such as clustering or differen- also further visualize similarities between all pairs of sam- tial methylation analysis, can be carried out with the same ples by creating scatterplots of the %methylation scores functions that are used for base pair resolution analysis. (Figure 3). These functions are essential for detecting sam- Example methylation data set: breast cancer cell lines ple outliers or for functional clustering of samples based We demonstrated the capabilities of methylKit using an on their molecular signatures. example data set from seven breast cancer cell lines from Sun et al.[23].Four ofthe cell linesexpressestrogen Hierarchical clustering of samples receptor-alpha (MCF7, T47D, BT474, ZR75-1), and from methylKit can also be used to cluster samples hierarchi- here on are referred to as ER+. The other three cell lines cally in a variety of ways. The user can specify the (BT20, MDA-MB-231, MDA-MB-468) do not express distance metric between samples (‘1 - correlation’‘Eucli- estrogen receptor-alpha, and from here on are referred to dean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’ or ‘min- as ER-. It has been previously shown that ER+ and ER- kowski’) as well as the agglomeration method to be used tumor samples have divergent gene expression profiles in the hierarchical clustering algorithm (for example, and that those profiles are associated with disease outcome ‘Ward’s method’,or ‘single/complete linkage’, and so on). Akalin et al. Genome Biology 2012, 13:R87 Page 4 of 9 http://genomebiology.com/2012/13/10/R87 Figure 2 Descriptive statistics per sample. (a) Histogram of %methylation per cytosine for ER+ T47D sample. Most of the bases have either high or low methylation. (b) Histogram of read coverage per cytosine for ER+ T47D sample. ER+, estrogen receptor-alpha expressing. available for clustering. An example of such a procedure Results can either be returned as a dendrogram object or a plot. Dendrogram plots will be color coded based on user (k-means clustering) is shown in Additional file 4. defined groupings of samples. For example, we found that most ER+ and ER- samples clustered together except Principal component analysis of samples MDMB231 (Figure 4a). Moreover, the user may be inter- methylKit can be used to perform Principal Component ested in employing other more model-intensive clustering Analysis (PCA) on the samples’ %-methylation profiles algorithms to their data. Users can easily obtain the % (see for example [26]). PCA can reduce the high dimen- methylation data from methylKit object and perform their sionality of a data set by transforming the large number own analysis with the multitude of R-packages already of regions to a few principal components. The principal components are ordered so that the first few retain most of the variation present in the original data and are often used to emphasize grouping structure in the data. For example, a plot of the first two or three princi- pal components could potentially reveal a biologically meaningful clustering of the samples. Before the PCA is performed, a new data matrix is formed, containing the samples and only those cytosines that are covered in all samples. After PCA, methylKit then returns to the user a ‘prcomp’ object, which can be used to extract and plot the principal components. We found that in the breast cancer data set, PCA reveals a similar clustering to the hierarchical clustering where MDMB231 is an outlier. Differential methylation calculation Parallelized methods for detecting significant methylation changes Differential methylation patterns have been previously described in malignancies [27-29] and can be used to dif- ferentiate cancer and normal cells [30]. In addition, nor- mal human tissues harbor unique DNA methylation profiles [7]. Differential DNA methylation is usually calcu- Figure 3 Scatter plots for sample pairs.Scatter plots of% methylation values for each pair in seven breast cancer cell lines. lated by comparing methylation levels between multiple Numbers on upper right corner denote pair-wise Pearson’s conditions, which can reveal important locations of diver- correlation scores. The histograms on the diagonal are % gent changes between a test and a control set. We have methylation histograms similar to Figure 2a for each sample. designed methylKit to implement two main methods for Akalin et al. Genome Biology 2012, 13:R87 Page 5 of 9 http://genomebiology.com/2012/13/10/R87 Figure 4 Sample clustering.(a) Hierarchical clustering of seven breast cancer methylation profiles using 1-Pearson’s correlation distance. (b) Principal Component Analysis (PCA) of seven breast cancer methylation profiles, plot shows principal component 1 and principal component 2 for each sample. Samples closer to each other in principal component space are similar in their methylation profiles. determining differential methylation across all regions: cytosine (DMC) or region (DMR). However, if the null logistic regression and Fisher’s exact test. However, the hypothesis is not rejected it implies no statistically signifi- data frames in methylKit can easily be used with other sta- cant difference in methylation between the two groups. tistical tests and an example is shown in Additional file 4 One important consideration in logistic regression is the (using a moderated t-test, although we maintain that most sample size and in many biological experiments the num- natural tests for this kind of data are Fisher’sexact and ber of biological samples in each group can be quite small. logistic regression based tests). For our example data set However, it is important to keep in mind that the relevant we compared ER+ to ER- samples, with our ‘control sample sizes in logistic regression are not merely the num- group’ being the ER- set. ber of biological samples but rather the total read cov- erages summed over all samples in each group separately. Method #1: logistic regression For our example dataset, we used bases with at least 10 In logistic regression, information from each sample is reads coverage for each biological sample and we advise specified (the number of methylated Cs and number of (at least) the same for other users to improve power to unmethylated Cs at a given region), and a logistic detect DMCs/DMRs. regression test will be applied to compare fraction of In addition, we have designed methylKit such that the logistic regression framework can be generalized to han- methylated Cs across the test and the control groups. More specifically, at a given base/region we model the dle more than two experimental groups or data types. methylation proportion P , for sample i= 1,...,n (where n In such a case, the inclusion of additional treatment is the number of biological samples) through the logistic indicators is analogous to multiple regression when regression model: there are categorical variables with multiple groups. Additional covariates can be incorporated into model log(P /(1 - P )) = β + β ∗ T (1) i i 0 1 i (1) by adding to the right side of the model: where T denotes the treatment indicator for sample i, T i i α ∗ Covariate + ... + α ∗ Covariate 1 1,i K K,i = 1 if sample i is in the treatment group and T = 0 if sam- ple i is in control group. The parameter b denotes the log where Covariate , ..., Covariate denote K measured 0 1,i K,i odds of the control group and b the log oddsratio covariates (continuous or categorical) for sample between the treatment and control group. Therefore, inde- i = 1,...,n and a ,..., a denote the corresponding 1 k pendent tests for all the bases/regions of interest are parameters. against the null hypothesis H : b = 0. If the null hypothesis 0 1 is rejected it implies that the logodds (and hence the Method #2: Fisher’s exact test methylation proportions) are different between the treat- The Fisher’s exact test compares the fraction of methy- ment and the control group and the base/region would lated Cs in test and control samples in the absence of subsequently be classified as a differentially methylated replicates. The main advantage of logistic regression Akalin et al. Genome Biology 2012, 13:R87 Page 6 of 9 http://genomebiology.com/2012/13/10/R87 over Fisher’s exact test is that it allows for the inclusion treatment vector, and hyper-/hypomethylation definitions of sample specific covariates (continuous or categorical) are based on that control group. and the ability to adjust for confounding variables. In Furthermore, DMCs/DMRs can be visualized as hori- practice, the number of samples per group will deter- zontal barplots showing percentage of hyper- and hypo- mine which of the two methods will be used (logistic methylated bases/regions out of covered cytosines over regression or Fisher’s exact test). If there are multiple all chromosomes (Figure 5a). We observed higher levels samples per group, methylKit will employ the logistic of hypomethylation than hypermethylation in the breast regression test. Otherwise, when there is one sample per cancer cell lines, which indicates that ER+ cells have lower levels of methylation. Since another common way group, Fisher’s exact test will be used. Following the differential methylation test and calcula- to visualize differential methylation events is with a gen- tion of P-values, methylKit will use the sliding linear ome browser, methylKit can output bedgraph tracks model (SLIM) method to correct P-values to q-values [31], (Figure 5b) for use with the UCSC Genome Browser or which corrects for the problem of multiple hypothesis test- Integrated Genome Viewer. ing [32,33]. However, we also implemented the standard false discovery rate (FDR)-based method (Benjamini- Annotating differential methylation events Hochberg) as an option for P-value correction, which is Annotation with gene models and CpG islands faster but more conservative. Finally, methylKit can use To discern the biological impact of differential methyla- multi-threading so that differential methylation calcula- tion events, each event must be put into its genomic tions can be parallelized over multiple cores and be com- context for subsequent analysis. Indeed, Hansen et al. pleted faster. [34] showed that most variable regions in terms of methylation in the human genome are CpG island Extraction and visualization of differential methylation shores, rather than CpG islands themselves. Thus, it is events interesting to know the location of differential methyla- We have designed methylKit to allow a user to specify the tion events with regard to CpG islands, their shores, and parameters that define the DMCs/DMRs based on: q- also the proximity to the nearest transcription start site value, %methylation difference, and type of differential (TSS) and gene components. Accordingly, methylKit can methylation (hypo-/hyper-). By default, it will extract annotate differential methylation events with regard to bases/regions with a q-value <0.01 and %methylation dif- the nearest TSS (Figure 6a) and it also can annotate ference >25%. These defaults can easily be changed when regions based on their overlap with CpG islands/shores calling get.methylDiff() function. In addition, users can spe- and regions within genes (Figures 6b and 6c are output cify if they want hyper-methylated bases/regions (bases/ from methylKit). regions with higher methylation compared to control sam- ples) or hypo-methylated bases/regions (bases/regions Annotation with custom regions with lower methylation compared to control samples). In As with most genome-wide assays, the regions of interest the literature, hyper- or hypo-methylated DMCs/DMRs for DNA methylation analysis may be quite numerous. are usually defined relative to a control group. In our For example, several reports show that Alu elements are examples, and in methylKit in general, a control group is aberrantly methylated in cancers [35,36] and enhancers defined when creating the objects through supplied are also differentially methylated [37,38]. Since users may Figure 5 Visualizing differential methylation events.(a) Horizontal bar plots show the number of hyper- and hypomethylation events per chromosome, as a percent of the sites with the minimum coverage and differential. By default this is a 25% change in methylation and all samples with 10X coverage. (b) Example of bedgraph file uploaded to UCSC browser. The bedraph file is for differentially methylated CpGs with at least a 25% difference and q-value <0.01. Hyper- and hypo-methylated bases are color coded. The bar heights correspond to % methylation difference between ER+ and ER- sets. ER+, estrogen receptor-alpha expressing; ER-, estrogen receptor-alpha non-expressing. UCSC, University of California Santa Cruz. Akalin et al. Genome Biology 2012, 13:R87 Page 7 of 9 http://genomebiology.com/2012/13/10/R87 Figure 6 Annotation of differentially methylated CpGs.(a) Distance to TSS for differentially methylated CpGs are plotted from ER+ versus ER- analysis. (b) Pie chart showing percentages of differentially methylated CpGs on promoters, exons, introns and intergenic regions. (c) Pie chart showing percentages of differentially methylated CpGs on CpG islands, CpG island shores (defined as 2kb flanks of CpG islands) and other regions outside of shores and CpG islands. (d) Pie chart showing percentages of differentially methylated CpGs on enhancers and other regions. ER+, estrogen receptor-alpha expressing; ER-, estrogen receptor-alpha non-expressing, TSS, transcription start site. need to focus on specific genomic regions and require Analyzing 5-hydroxymethylcytosine data with customized annotation for capturing differential DNA methylKit methylation events, methylKit can annotate differential 5-Hydroxymethylcytosine is a base modification asso- methylation events using user-supplied regions. As an ciated with pluropotency, hematopoiesis and certain example, we identified differentially methylated bases of brain tissues (reviewed in [40]). It is possible to measure ER+ and ER- cells that overlap with ENCODE enhancer base-pair resolution 5hmC levels using variations of tra- regions [39], and we found a large proportion of differen- ditional bisulfite sequencing. Recently, Yu et al. [41] and tially methylated CpGs overlapping with the enhancer Booth et al. [15] published similar methods for detecting marks, and then plotted them with methylKit (Figure 6d). 5hmC levels in base-pair resolution. Both methods Akalin et al. Genome Biology 2012, 13:R87 Page 8 of 9 http://genomebiology.com/2012/13/10/R87 Abbreviations require measuring 5hmC and 5mC levels simultaneously 5hmC: 5-hydroxymethylcytosine; 5mC: 5-methylcytosine; bp: base pair; BS- and use 5hmC levels as a substrate to deduce real 5mC seq,:bisulfite sequencing; DMC: differentially methylated cytosine; DMR: levels, since traditional bisulfite sequencing cannot dis- differentially methylated region; ER: estrogen receptor alpha; FDR: false discovery rate; PCA: principal component analysis; PCR: polymerase chain tinguish between the two [42]. However, both the 5hmC reaction; RRBS: reduced representation bisulfite sequencing; SLIM: sliding and 5mC data generated by these protocols are bisulfite linear model; TSS: transcription start site. sequencing based, and the alignments and text files of Acknowledgements 5hmC levels can be used directly in methylKit. Further- We wish to acknowledge the invaluable contribution of the WCMC more, methylKit has an adjust.methylC() function to Epigenomics Core Facility. MEF is supported by the Leukemia & Lymphoma adjust 5mC levels based on 5hmC levels as described in Society Special Fellow Award and a Doris Duke Clinical Scientist Development Award. FGB is supported by a Sass Foundation Judah Folkman Booth et al. [15]. Fellowship. AM is supported by an LLS SCOR grant (7132-08) and a Burroughs Wellcome Clinical Translational Scientist Award. AM and CEM are Customizing analysis with convenience functions supported by a Starr Cancer Consortium grant (I4-A442). CEM is supported by the National Institutes of Health (I4-A411, I4-A442, and 1R01NS076465-01). methylKit is dependent on Bioconductor [43] packages such as GenomicRanges and its objects are coercible to Author details GenomicRanges objects and regular R data structures Department of Physiology and Biophysics, 1305 York Ave., Weill Cornell Medical College, New York, NY 10065, USA. The HRH Prince Alwaleed Bin such as data frames via provided convenience functions. Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, 1305 That means users can integrate methylKit objects to York Ave., Weill Cornell Medical College, New York, NY 10065, USA. other Bioconductor and R packages and customize the Department of Public Health, Weill Cornell Medical College, 1300 York Ave., New York, NY 10065, USA. Department of Medicine, Division of analysis according to their needs or extend the analysis Hematology/Oncology, 1300 York Ave., Weill Cornell Medical College, New further by using other packages available in R. York, NY 10065, USA. Department of Pathology, University of Michigan, 109 Zina Pitcher Place, Ann Arbor, MI 48109, USA. Department of Pharmacology, 1300 York Ave., Weill Cornell Medical College, New York, NY 10065, USA. Conclusions Methods for detecting methylation across the genome Authors’ contributions are widely used in research laboratories, and they are AA designed methylKit, developed the first codebase, and added most features. MK designed the logistic regression based statistical test for methylKit also a substantial component of the National Institutes and worked on statistical modeling and initial clustering features. SL wrote of Health’s(NIH’s) EpiGenome roadmap and upcoming some of the features in methylKit and prepared plots for the manuscript. projects such as BLUEPRINT [44]. Thus, tools and tech- MEF, FGB and AM tested the code and provided initial data for development of methylKit. CEM supervised the work, tested code, and coordinated test data niques that enable researchers to process and utilize for validation. All authors have read and approved the manuscript for genome-wide methylation data in an easy and fast man- publication. ner will be of critical utility. Competing interests Here, we show a large set of tools and cross-sample The authors declare that they have no competing interests. analysis algorithms built into methylKit, our open-source, multi-threaded R package that can be used for any base- Received: 30 April 2012 Revised: 12 June 2012 Accepted: 3 October 2012 Published: 3 October 2012 level dataset of DNA methylation or base modifications, including 5hmC. We demonstrate its utility with breast References cancer RRBS samples, provide test data sets, and also 1. Deaton AM, Bird A: CpG islands and the regulation of transcription. Genes provide extensive documentation with the release. Dev 2011, 25:1010-2210. 2. Suzuki MM, Bird A: DNA methylation landscapes: provocative insights from epigenomics. Nat Rev Genet 2008, 9:465-476. Additional material 3. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo Q-M, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes Additional file 1: methylKit v0.5.3. This version of methylKit is included at base resolution show widespread epigenomic differences. Nature for archival purposes only. Please download the most recent version 2009, 462:315-322. from [16]. 4. Bird AP, Wolffe AP: Methylation-induced repression–belts, braces, and Additional file 2: methylKit User Guide. A vignette file to accompany chromatin. Cell 1999, 99:451-454. the methylKit software package; the most recent software and vignette 5. Hendrich B, Bird A: Identification and characterization of a family of can be downloaded at [16]. mammalian methyl-CpG binding proteins. Mol Cell Biol 1998, 18:6538-6547. Additional file 3: methylKit documentation. Documentation for 6. Figueroa ME, Abdel-Wahab O, Lu C, Ward PS, Patel J, Shih A, Li Y, functions and classes in the methylKit software package; the most recent Bhagwat N, Vasanthakumar A, Fernandez HF, Tallman MS, Sun Z, Wolniak K, software and documentation can be downloaded at [16]. Peeters JK, Liu W, Choe SE, Fantin VR, Paietta E, Löwenberg B, Licht JD, Additional file 4: R script for example analysis. The file contains R Godley LA, Delwel R, Valk PJM, Thompson CB, Levine RL, Melnick A: commands that are needed to do analysis and to produce graphs used Leukemic IDH1 and IDH2 mutations result in a hypermethylation in this manuscript. The file contains both the commands and detailed phenotype, disrupt TET2 function, and impair hematopoietic comments on how those commands can be used. An up to date version differentiation. Cancer Cell 2010, 18:553-567. of this script will be consistently maintained at [16]. Akalin et al. Genome Biology 2012, 13:R87 Page 9 of 9 http://genomebiology.com/2012/13/10/R87 7. Fernandez AF, Assenov Y, Martin-Subero JI, Balint B, Siebert R, Taniguchi H, 28. Baylin SB, Herman JG: DNA hypermethylation in tumorigenesis: Yamamoto H, Hidalgo M, Tan A-C, Galm O, Ferrer I, Sanchez-Cespedes M, epigenetics joins genetics. Trends Genet 2000, 16:168-174. Villanueva A, Carmona J, Sanchez-Mut JV, Berdasco M, Moreno V, Capella G, 29. Costello JF, Frühwald MC, Smiraglia DJ, Rush LJ, Robertson GP, Gao X, Monk D, Ballestar E, Ropero S, Martinez R, Sanchez-Carbayo M, Prosper F, Wright FA, Feramisco JD, Peltomäki P, Lang JC, Schuller DE, Yu L, Agirre X, Fraga MF, Graña O, Perez-Jurado L, Mora J, Puig S, et al: A DNA Bloomfield CD, Caligiuri MA, Yates A, Nishikawa R, Su Huang H, Petrelli NJ, methylation fingerprint of 1628 human samples. Genome Res 2012, Zhang X, O’Dorisio MS, Held WA, Cavenee WK, Plass C: Aberrant CpG- 22:407-419. island methylation has non-random and tumour-type-specific patterns. 8. Li E, Beard C: Role for DNA methylation in genomic imprinting. Nature Nat Genet 2000, 24:132-138. 1993, 366:362-365. 30. Doi A, Park I-H, Wen B, Murakami P, Aryee MJ, Irizarry R, Herb B, Ladd- 9. Smith ZD, Chan MM, Mikkelsen TS, Gu H, Gnirke A, Regev A, Meissner A: A Acosta C, Rho J, Loewer S, Miller J, Schlaeger T, Daley GQ, Feinberg AP: unique regulatory phase of DNA methylation in the early mammalian Differential methylation of tissue- and cancer-specific CpG island shores embryo. Nature 2012, 484:339-344. distinguishes human induced pluripotent stem cells, embryonic stem 10. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, cells and fibroblasts. Nat Genet 2009, 41:1350-1353. Bernstein BE, Nusbaum C, Jaffe DB, Gnirke A, Jaenisch R, Lander ES: 31. Wang H-Q, Tuominen LK, Tsai C-J: SLIM: a sliding linear model for Genome-scale DNA methylation maps of pluripotent and differentiated estimating the proportion of true null hypotheses in datasets with cells. Nature 2008, 454:766-770. dependence structures. Bioinformatics 2011, 27:225-231. 11. Akalin A, Garrett-Bakelman FE, Kormaksson M, Busuttil J, Zhang L, 32. Storey J: A direct approach to false discovery rates. J R Stat Soc Series B Khrebtukova I, Milne TA, Huang Y, Biswas D, Hess JL, Allis D, Roeder RG, Stat Methodol 2002, 64:479-498. Valk PJM, Lo B, Paietta E, Tallman MS, Schroth GP, Mason CE, Melnick A, 33. Storey JD, Tibshirani R: Statistical significance for genomewide studies. Figueroa ME: Base-pair resolution DNA methylation sequencing reveals Proc Natl Acad Sci USA 2003, 100:9440-9445. profoundly divergent epigenetic landscapes in acute myeloid leukemia. 34. Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, McDonald OG, PLoS Genet 2012, 8:e1002781. Wen B, Wu H, Liu Y, Diep D, Briem E, Zhang K, Irizarry R a, Feinberg AP: 12. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Increased methylation variation in epigenetic domains across cancer Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE: Shotgun bisulphite types. Nat Genet 2011, 43:768-775. sequencing of the Arabidopsis genome reveals DNA methylation 35. Ehrlich M: DNA hypomethylation in cancer cells. Epigenomics 2009, patterning. Nature 2008, 452:215-219. 1:239-259. 13. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, 36. Rodriguez J, Vives L, Jordà M, Morales C, Muñoz M, Vendrell E, Peinado MA: Ecker JR: Highly integrated single-base resolution maps of the Genome-wide tracking of unmethylated DNA Alu repeats in normal and epigenome in Arabidopsis. Cell 2008, 133:523-536. cancer cells. Nucleic Acids Res 2008, 36:770-784. 14. Ball MP, Li JB, Gao Y, Lee J-H, LeProust EM, Park I-H, Xie B, Daley GQ, 37. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Schöler A, Wirbelauer C, Church GM: Targeted and genome-scale strategies reveal gene-body Oakeley EJ, Gaidatzis D, Tiwari VK, Schübeler D: DNA-binding factors shape methylation signatures in human cells. Nat Biotechnol 2009, 27:361-368. the mouse methylome at distal regulatory regions. Nature 2011, 15. Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, 480:490-495. Balasubramanian S: Quantitative sequencing of 5-methylcytosine and 38. Wiench M, John S, Baek S, Johnson TA, Sung M-H, Escobar T, Simmons CA, 5-hydroxymethylcytosine at single-base resolution. Science 2012, Pearce KH, Biddie SC, Sabo PJ, Thurman RE, Stamatoyannopoulos JA, 336:934-937. Hager GL: DNA methylation status predicts cell type-specific enhancer 16. methylKit. [http://code.google.com/p/methylkit]. activity. EMBO J 2011, 30:3028-3039. 17. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, 39. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Turner SW: Direct detection of DNA methylation during single-molecule, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, real-time sequencing. Nat Methods 2010, 7:461-465. Bernstein BE: Mapping and analysis of chromatin state dynamics in nine 18. Cherf GM, Lieberman KR, Rashid H, Lam CE, Karplus K, Akeson M: human cell types. Nature 2011, 473:43-49. Automated forward and reverse ratcheting of DNA in a nanopore at 5-Å 40. Branco MR, Ficz G, Reik W: Uncovering the role of 5- precision. Nat Biotechnol 2012, 30:344-348. hydroxymethylcytosine in the epigenome. Nat Rev Genet 2011, 13:7-13. 19. Frith MC, Mori R, Asai K: A mostly traditional approach improves 41. Yu M, Hon GC, Szulwach KE, Song C-X, Zhang L, Kim A, Li X, Dai Q, Shen Y, alignment of bisulfite-converted DNA. Nucleic Acids Res 2012, 40:e100. Park B, Min J-H, Jin P, Ren B, He C: Base-resolution analysis of 5- 20. Krueger F, Kreck B, Franke A, Andrews SR: DNA methylome analysis using hydroxymethylcytosine in the mammalian genome. Cell 2012, short bisulfite sequencing data. Nat Methods 2012, 9:145-151. 149:1368-1380. 21. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, 42. Huang Y, Pastor WA, Shen Y, Tahiliani M, Liu DR, Rao A: The behaviour of Abecasis G, Durbin R: The Sequence Alignment/Map format and 5-hydroxymethylcytosine in bisulfite sequencing. PloS One 2010, 5:e8888. SAMtools. Bioinformatics 2009, 25:2078-2079. 43. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, 22. Krueger F, Andrews SR: Bismark: a flexible aligner and methylation caller Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, for Bisulfite-Seq applications. Bioinformatics 2011, 27:1571-1572. Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, 23. Sun Z, Asmann YW, Kalari KR, Bot B, Eckel-Passow JE, Baker TR, Carr JM, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development Khrebtukova I, Luo S, Zhang L, Schroth GP, Perez EA, Thompson EA: for computational biology and bioinformatics. Genome Biol 2004, 5:R80. Integrated analysis of gene expression, CpG island methylation, and 44. Adams D, Altucci L, Antonarakis SE, Ballesteros J, Beck S, Bird A, Bock C, gene copy number in breast cancer cells by deep sequencing. PloS One Boehm B, Campo E, Caricasole A, Dahl F, Dermitzakis ET, Enver T, Esteller M, 2011, 6:e17490. Estivill X, Ferguson-Smith A, Fitzgibbon J, Flicek P, Giehl C, Graf T, 24. van ‘t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Grosveld F, Guigo R, Gut I, Helin K, Jarvius J, Küppers R, Lehrach H, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Lengauer T, Lernmark Å, Leslie D, et al: BLUEPRINT to decode the Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene epigenetic signature written in blood. Nat Biotechnol 2012, 30:224-226. expression profiling predicts clinical outcome of breast cancer. Nature doi:10.1186/gb-2012-13-10-R87 2002, 415:530-536. Cite this article as: Akalin et al.: methylKit: a comprehensive R package 25. Sotiriou C, Neo S-Y, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, for the analysis of genome-wide DNA methylation profiles. Genome Biology Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based 2012 13:R87. on gene expression profiles from a population-based study. Proc Natl Acad Sci USA 2003, 100:10393-10398. 26. Joliffe I: Principal Component Analysis. 2 edition. New York, USA, Springer; 27. Esteller M, Corn PG, Baylin SB, Herman JG: A gene hypermethylation profile of human cancer. Cancer Res 2001, 61:3225-3229.

Journal

Genome BiologySpringer Journals

Published: Oct 1, 2012

Keywords: Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology

There are no references for this article.