Array probe density and pathobiological relevant CpG calling bias in human disease and physiological DNA methylation profiling

Array probe density and pathobiological relevant CpG calling bias in human disease and... Abstract The HumanMethylation450 BeadChip array (450K; Infinium) is a widely used tool in epigenomics. A recognized concern in the 450K platform is the potential effect of the number of probes/gene (PG) on ranking differentially methylated (DM) CpGs (DM-CpGs) before testing for enrichment of gene ontology categories. We previously showed in a fatty acid (FA)-induced DNA methylation profiling study that when DM-CpGs are ranked by the number of called DM-CpGs-to-PG ratio, the 150 top-ranking gene list is enriched in pathways that overlap with the corresponding Affymetrix array-based expression data. In this study, a comparative analysis of thirteen 450K-based studies representing FA-stimulated cellular models, aging, diseased and normal tissues, revealed that the 150 top-ranking DM-CpGs are in high PG genes. This points to a significant false-negative rate in the low PG gene set when delta-beta-based ranking is performed. We show that PG is not related to the density of methylation-prone sites, as it does not follow gene length or GC content. Conversely, ranking genes by the number of DM-CpGs-to-PG ratio and analysing the 150 top-ranking entries yields significantly enriched gene disease- or tissue-specific function categories that are increased both in number and in the degree of overlap with expression data compared with delta-beta-only ranking or to the previously published gometh-based pipeline. The 15 top-ranking loci list is also significantly enriched in non-coding RNAs, a greatly underrepresented transcript type in 450K. In summary, the proposed simple normalization method yields pathobiologically relevant DM-CpGs. This method is relevant for the newly developed MethylationEPIC (Infinium) microarray. DNA methylation, epigenomics, microarray, 450K Illumina DNA methylation arrays, 850K (EPIC) Illumina DNA methylation arrays Introduction The HumanMethylation450 BeadChip array (450K; Infinium) quantitatively profiles the methylation of ∼485 000 cytosines. Although specifically designed for CpGs that map to genes implicated in cancer, the 450K has become a popular platform for high-throughput profiling of human DNA methylation in virtually all areas of basic and disease epigenetics, where it has uncovered both potential therapeutic targets and disease markers [1, 2]. This is because of relatively low cost, high reproducibility, the requirement for small amounts of bisulphite-converted genomic DNA and high coverage, i.e. ∼99% of the RefSeq genes and ∼96% of the CpG islands (CGIs) with additional coverage of adjacent sequences. The 450k array has now been replaced with the MethylationEPIC 850K microarray that has maintained >90% of the original 450k CpG sites in addition to 33 325 sites located in enhancer regions [3]. Following normalization, each CpG is assigned a beta value between 0 and 1 (unmethylated and completely methylated, respectively). DNA methylation mosaicism or allele specificity within the sample determines the beta value for a given locus. Although beta gives an intuitive methylation value, a logit transformation of beta (M value) is usually preferred for statistical analysis [4]. Approximately 76% of the cytosines represented in the 450k array lie within proximal and distal promoters, 5ʹ untranslated region (UTR), first exon, gene-body and 3ʹ UTR, representing 20 621 genes. On average, each gene in the array is represented by 18 CpGs; however, while 96.2% (19 837/20 621) of genes have <50 probes/gene (PG), only ∼4% (784 of 20 621) have PG > 50 PG (Supplementary Table S1). In particular, the genes PTPRN2, MAD1L1, PRDM16, TNXB, RPTOR and HDAC4 are each represented by >200 probes in 450K. The mainstream analysis method for 450K data seeks differentially methylated (DM) CpGs (DM-CpGs) defined as showing large, i.e. above a predetermined threshold, and statistically significant differences in methylation among experimental groups. Typically, CpGs are ranked by decreasing differences in beta between groups (delta-beta), where large differences are assumed to be biologically relevant. Subsequently, DM-CpG-harbouring genes are tested for significant enrichment in functional gene categories. However, our recent study pointed to possible pitfalls in this approach. We found that fatty acids (FAs) induce small dose-dependent changes in DNA methylation across several genes, which are particularly obvious in the gene body of genes with PG > 100, suggesting that FA-induced DNA methylation profiles are widespread and coordinated [5]. In turn, the observed sizeable delta-beta variability of FA-induced profiles implies that the probability of detecting a change in DNA methylation in any given gene by mere delta-beta ranking will be affected by the number of probes present for that gene. For instance, the probability of detecting a methylation change in the PTPRN2 gene, represented by 1290 probes on the array, would be 2.65×10−3 (1290 of 485 577). Conversely, for PCDHGB1, which is represented by seven probes, the probability would more than two orders of magnitude lower (i.e. 1.44×10−5). These observations point to a possible array design-driven bias in DM-CpG calling that may results in a significant false-negative frequency because of probe under-representation in pathobiologically relevant genes. Indeed, in the FA-induced DNA methylation profiling study, we found that when DM-CpGs are ranked by the number of called DM-CpGs-to-PG ratio—rather than by delta-beta only—the top-ranking gene list is enriched in pathways that overlap with the corresponding Affymetrix array-based expression data [5]. In the present study, we compared delta-beta-based and PG-based gene ranking methods across a large and biologically diverse set of published 450K-based studies, including normal and pathological human samples. As previously observed, we find that ranking by PG is more effective in identifying significant gene function enrichment relative to delta-beta alone. Notably, despite being under-represent on the methylation array, this method results in non-random enrichment of non-coding RNAs. Finally, we show that delta-beta-only ranking also results in diseased sample DM-CpGs under-representation. Results We reasoned that if any bias in favour of probe-dense genes exists in 450K, genes represented by a high number of CpG probes should be more frequently called as DM loci, in comparison with less probe-dense genes. We addressed that issue by probing 22 published 450K data sets (corresponding to 13 studies), which represent a wide range of disease and normal samples (Supplementary Table S2). Eleven array data sets represent aging, disease, i.e. metabolic (type 2 diabetes, obesity and atherosclerosis), psychiatric (autism and schizophrenia) and cancer (colorectal cancer, hepatocellular carcinoma, leukaemia and breast cancer) diseases; three represent FA-stimulated cellular models, i.e. arachidonic or oleic acid (AA and OA, respectively)-stimulated human THP-1 cells [5] and palmitic acid (PA)-stimulated cultured pancreatic cells (see Supplementary Table S2 for individual study references). In addition, eight array data sets represent normal tissues [6]. In all cases, we used normalized, statistically significant differential methylation data generated by the authors before publication. We did not process the respective raw data to homogenize the normalization method and statistical modelling, reasoning that the detection of any bias in heterogeneous data sets would robustly point to an intrinsic feature of 450K. The data sets will be collectively referred to as 450K-based data. First, we analysed the distribution of DM genes, i.e. genes containing at least one DM-CpG, in each array across four different groups based on PG: PG ≤ 10; 10 < PG ≤ 50; 50 < PG ≤ 100; and PG > 100. We found that the frequency of DM genes—calculated as the percentage of all genes within the corresponding PG group—directly followed probe density in a nearly perfect linear fashion in the vast majority of data sets (Table 1, Supplementary Figure S1). We also analysed a randomly selected 15 gene data set for each PG class (listed in Table 2) and assessed their call frequency as DM loci across the 450K-based data sets. Consistent with the above data, genes with PG > 100 were called as DM loci more frequently than the corresponding set of genes with PG ≤ 10—on average in 9 of 13 and only in 3 of 13 450K-based data sets, respectively (Figure 1). Table 1. Distribution of DM genes classified by PG, in the probed 450K-based studies   PG ≤ 10  10 < PG ≤ 50  50 < PG ≤ 100  PG > 100  FA-stimulated cellular models  Arachidonic acid  73  94  100  100  Oleic acid  67  92  99  100  Palmitic acid  45  81  99  100    Metabolic  Obesity  3  9  37  62  Type 2 diabetes  1  4  19  24  Atherosclerosis  2  5  20  31  Disease  Psychiatric  Autism BA10  6  15  47  71  Autism BA24  15  28  77  96  Schizophrenia  4  16  42  67    Cancer  Breast cancer  5  15  47  74  Colorectal carcinoma  9  17  44  52  Hepatocellular carcinoma  3  8  19  27  Leukaemia  2  12  24  25  Aging    5  13  34  49  Normal tissues  Blood  32  57  93  99  Breast  8  22  63  79  Colon  15  36  81  92  Kidney  12  28  76  90  Liver  18  36  77  93  Lungs  2  8  33  54  Prostate  18  39  84  92  Thyroid  18  40  83  94    PG ≤ 10  10 < PG ≤ 50  50 < PG ≤ 100  PG > 100  FA-stimulated cellular models  Arachidonic acid  73  94  100  100  Oleic acid  67  92  99  100  Palmitic acid  45  81  99  100    Metabolic  Obesity  3  9  37  62  Type 2 diabetes  1  4  19  24  Atherosclerosis  2  5  20  31  Disease  Psychiatric  Autism BA10  6  15  47  71  Autism BA24  15  28  77  96  Schizophrenia  4  16  42  67    Cancer  Breast cancer  5  15  47  74  Colorectal carcinoma  9  17  44  52  Hepatocellular carcinoma  3  8  19  27  Leukaemia  2  12  24  25  Aging    5  13  34  49  Normal tissues  Blood  32  57  93  99  Breast  8  22  63  79  Colon  15  36  81  92  Kidney  12  28  76  90  Liver  18  36  77  93  Lungs  2  8  33  54  Prostate  18  39  84  92  Thyroid  18  40  83  94  Note: All values are percentages. Table 2. List of randomly chosen 15 genes in each of the four PG classes PG ≤ 10   10<PG≤50   50<PG≤100   PG > 100   Gene  PG  Gene  PG  Gene  PG  Gene  PG  CCL25  10  PDGFD  39  ADCY9  84  CASZ1  176  IL22RA2  8  PITPNC1  47  CHST8  63  GALNT9  263  C10orf68  5  FBLN1  14  GLI3  75  INPP5A  395  NAPSB  8  GRAMD1B  32  PLCH2  65  SDK1  305  NRL  9  ZNF518B  22  C3orf21  86  AFAP1  103  TNP1  8  DDX51  19  CARS2  66  BAHCC1  113  MUC13  7  TAS1R1  22  ELMO1  57  CUX1  222  SFRS12IP1  9  ZNF491  12  PPP2R2B  57  DENND3  112  SLC22A24  4  ATPBD4  15  TAF4  51  OSBPL5  103  C20orf165  6  ZDHHC7  38  MAPK8IP3  78  TCERG1L  106  DPRXP4  1  C14orf101  13  MEIS1  58  JARID2  107  MIR614  3  CREBL2  15  ZBTB12  76  CSMD1  109  MIRLET7G  1  EGLN2  18  BAT1  81  ZC3H3  108  SIGMAR1  7  ZNF691  17  CACNA1  57  LPCAT1  116  DMRTC1B  2  ABHD11  15  FCGBP  52  TRIM27  144  PG ≤ 10   10<PG≤50   50<PG≤100   PG > 100   Gene  PG  Gene  PG  Gene  PG  Gene  PG  CCL25  10  PDGFD  39  ADCY9  84  CASZ1  176  IL22RA2  8  PITPNC1  47  CHST8  63  GALNT9  263  C10orf68  5  FBLN1  14  GLI3  75  INPP5A  395  NAPSB  8  GRAMD1B  32  PLCH2  65  SDK1  305  NRL  9  ZNF518B  22  C3orf21  86  AFAP1  103  TNP1  8  DDX51  19  CARS2  66  BAHCC1  113  MUC13  7  TAS1R1  22  ELMO1  57  CUX1  222  SFRS12IP1  9  ZNF491  12  PPP2R2B  57  DENND3  112  SLC22A24  4  ATPBD4  15  TAF4  51  OSBPL5  103  C20orf165  6  ZDHHC7  38  MAPK8IP3  78  TCERG1L  106  DPRXP4  1  C14orf101  13  MEIS1  58  JARID2  107  MIR614  3  CREBL2  15  ZBTB12  76  CSMD1  109  MIRLET7G  1  EGLN2  18  BAT1  81  ZC3H3  108  SIGMAR1  7  ZNF691  17  CACNA1  57  LPCAT1  116  DMRTC1B  2  ABHD11  15  FCGBP  52  TRIM27  144  Figure 1. View largeDownload slide Calling of 15 randomly selected genes as DM loci for each PG class, in the analysed 450K-based studies. (A–D) PG ≤ 10, 10 < PG ≤ 50, 50 < PG ≤ 100, PG > 100, respectively. Genes are arranged by decreasing frequency of DM call. PG, number of 450K PG. Figure 1. View largeDownload slide Calling of 15 randomly selected genes as DM loci for each PG class, in the analysed 450K-based studies. (A–D) PG ≤ 10, 10 < PG ≤ 50, 50 < PG ≤ 100, PG > 100, respectively. Genes are arranged by decreasing frequency of DM call. PG, number of 450K PG. Once established that the frequency of called DM loci reflects the PG, we asked whether the latter is correlated with the genic CpG dinucleotide count. A tight correlation would indicate that DM-CpG calling is more likely in genes harbouring a high number of methylation-prone sites and thus is driven by sequence features, rather than being based on array design criteria. As proxies for CpG count, we used gene length and GC content. The correlations with the PG were not significant in either case (Figure 2A and B), thus indicating that the array design criteria shape the DM-CpG distribution. A potential caveat in our analysis is that the GC content may not accurately reflect the lower than expected CpG frequency outside CGIs [7, 8]. Yet, the GC content correlates with the CpG/GpC ratio, a measure of CpG abundance, in a near-linear fashion [9]. Furthermore, within CGIs, no correlation exists between the number of probes mapping within CGIs and the average number of CpGs within a gene’s CGIs (Figure 2C). We expected a negative correlation, as it is difficult to design specific probes within CpG-dense CGIs. Incidentally, the number of CGI probes is significantly although mildly correlated with the number of CGIs within a gene (Figure 2D), indicating that the 450K design is a balanced representation of genic CGIs. Figure 2. View largeDownload slide Correlations between 450K probe density and genic features. (A and B) Correlations between the number of 450K PG and gene length or genic GC content, respectively. The log scale is used in (A). (C and D) Correlations between the number of probes mapping within a CGI and the average number of CpG/CGI or the number of CGIs. Each data point corresponds to a gene. The respective linear regression R2 is shown. Unless the P-value is indicated, the correlation is not significant. Figure 2. View largeDownload slide Correlations between 450K probe density and genic features. (A and B) Correlations between the number of 450K PG and gene length or genic GC content, respectively. The log scale is used in (A). (C and D) Correlations between the number of probes mapping within a CGI and the average number of CpG/CGI or the number of CGIs. Each data point corresponds to a gene. The respective linear regression R2 is shown. Unless the P-value is indicated, the correlation is not significant. Next, we ranked the 450K-based data in each set in decreasing order by the number of called DM-CpGs per gene and probed for overlap across the data sets. The top 15 genes in this list are protein-coding genes, and the majority of them are represented by >100 probes per gene on 450K (Supplementary Table S3). Also, a relatively large overlap was observed between the 450K-based studies. Examples include PRDM16, PTRN2, TNXB and HDAC4 that were present in >10 different data sets (Figure 3A). Figure 3. View largeDownload slide Distribution of the top 15 genes ranked by called DM-CpGs/gene, in 450K-based studies. (A) Top genes ranked by DM-CpG delta-beta. (B) Top genes ranked by number of called DM-CpGs normalized by the number of 450K probes/gene. The presence of a filled square indicates DM calling. BA, Brodmann area; CRC, colorectal carcinoma; HCC, hepatocellular carcinoma. Figure 3. View largeDownload slide Distribution of the top 15 genes ranked by called DM-CpGs/gene, in 450K-based studies. (A) Top genes ranked by DM-CpG delta-beta. (B) Top genes ranked by number of called DM-CpGs normalized by the number of 450K probes/gene. The presence of a filled square indicates DM calling. BA, Brodmann area; CRC, colorectal carcinoma; HCC, hepatocellular carcinoma. In addition, we noticed that the top delta-beta ranking genes showed on average both a lower number of DM-CpGs and smaller delta-beta values in disease samples compared with normal tissue counterparts (Figure 4). In this analysis, we averaged the 8 normal or 10 of the 11 diseased tissues. The schizophrenia data set was available only as delta-M values and was therefore excluded from the analysis. The number of DM-CpGs/gene was significantly higher in normal relative to diseased tissues in both the hypomethylated and hypermethylated set (17.5 ± 8.9 versus 3.2 ± 3.0, P = 2.0×10−4; 8.8 ± 5.9 versus 3.0 ± 1.6, P = 0.005, respectively; t-test). The extent of differential methylation showed the same trend (P < 1.9×10−5 in all cases). This trend was also observed in comparison between the three normal/diseased tissue pairs available—colon, blood and liver against colorectal carcinoma, hepatocellular carcinoma and breast cancer patient’s peripheral blood, respectively, i.e. normal tissues showed a significant increase in number of hypomethylated and hypermethylated DM-CpGs (P = 0.036 and P = 0.006, respectively; Supplementary Figure S2). These observations imply that the mainstream DNA methylome analysis is particularly prone to a significant false-negative rate in the case of disease sample. Figure 4: View largeDownload slide Tissue and disease-related changes in DNA methylation of selected genes. Gene-specific differences in DNA methylation (number of DM-CpGs and delta-beta values) between normal and diseased tissues. The values shown are averages of the respective 450K-based studies. Figure 4: View largeDownload slide Tissue and disease-related changes in DNA methylation of selected genes. Gene-specific differences in DNA methylation (number of DM-CpGs and delta-beta values) between normal and diseased tissues. The values shown are averages of the respective 450K-based studies. To correct for the bias in favour of high PG genes, we filtered the list of DM-CpGs identified in the 450K-based data sets, by calculating the ratio between the number of called DM-CpGs within a given gene and the corresponding PG, and subsequently ranked the data by decreasing values of that ratio. From this ranked DM-CpG set, we extracted the CpGs corresponding to the top 15 or 150 genes and ranked again by the absolute delta-beta value. The rationale for choosing either one of those two number of genes was the suitability for illustrative analyses or for broader analyses, but in principle, it should be adapted to specific needs. For example, 150 annotated genes are likely to give representative results in functional category enrichment analysis tools such as DAVID. This normalization method is supported by the fact that it identified a significant, convergent enrichment for G protein-coupled receptor (GPCR; GO:0007186) signalling genes both in AA-, OA- and PA-induced DNA methylation profiles [false discovery rate (FDR) < 10−3 in all cases] and in the expression profiles yielded by an Affymetrix-array-based study of AA- or OA-stimulated cells in our previous study [5]. Such a convergence is methodologically and biologically likely; yet, a weak correlation between DNA methylation and expression data is almost invariably observed (for example [10]). In addition, the PG-based method was not biased against low delta-beta CpGs, a feature of disease methylomes compared with normal tissue counterparts [5]. To further test within a larger sample whether the proposed PG-based ranking method improves the identification of pathobiologically relevant CpG methylation profiles, we assessed the number of significant gene function categories yielded in comparison with the delta-beta-based ranking across the 450K-based data sets. Furthermore, we determined the degree of overlap between 450K-based and expression array-based gene function categories for each of the two ranking methods. The PG-based ranking method yielded 34 categories in 10 studies, compared with only nine in six studies for the delta-beta ranking (Supplementary Table S4). Notably, the PG-based ranking identified a number of biologically relevant tissue-specific categories in normal samples—blood, liver and prostate (for the significance of drug metabolism genes in the normal prostate physiology, see: www.proteinatlas.org/humanproteome/prostate). As for concordance with transcription profiles, 11 categories in 10 studies overlapped with expression data after PG-based ranking, compared with only two in six studies in the case of delta-beta ranking. In addition, the majority of the top 15 genes according to this normalization method (Supplementary Table S5) showed an increased specificity for defined disease/experimental data sets (Figure 3B). Strikingly, ∼37% (122 of 330) of the top 15 DM targets were non-coding RNA loci, despite the fact that this transcript class accounts for only the 2.1% (10 064 of 485 764) of all 450K probes. This is a significant enrichment (P < 0.03 in all cases, hypergeometric test). Yet, it is possible that these observations merely reflect the fact that non-coding loci have low PG. Indeed, the majority (1038 of 1309 or 79.3%) of non-coding loci are present in the PG < 10 class in the 450K (Supplementary Figure S3) and have a lower PG than coding genes (four and six, respectively; P < 10−4, t-test). However, PG alone does not explain the abundance of non-coding loci. First, within the PG ≤ 10 class, non-coding loci represent only ∼one-sixth of coding genes (1038 versus 5895; Supplementary Figure S3). Secondly, within the top 15 ranked loci of the 19 analysed data sets from normal (n = 8) and diseased/aging (n = 11) tissues, non-coding loci are significantly less abundant in disease/aging than normal samples (average 4.0 and 5.6, respectively; P = 0.024, t-test), but the number of coding loci is similar between the two sets (average 9.4 and 8.4; P = 0.39). Thirdly, non-coding loci identified by our method are enriched in all data sets within the PG ≤ 10 class (P < 0.05, hypergeometric test) compared with coding genes, with the exception of obesity, breast cancer blood and aging, within the top 15 ranking loci (Supplementary Figure S4). Notably, these latter three data sets are tightly clustered in the independent methylome comparative analysis that we previously performed [5]. Taken together, these results strongly suggest that instructive, pathobiologically relevant mechanisms, rather than bias in favour of low PG loci, underlie the observed non-coding loci enrichment. Finally, we compared our normalization method that corrects for high PG bias before gene enrichment analysis with gometh software, that is also designed to reduce the inherent bias of high PG in downstream enrichment studies [11]. The gometh function takes into account a gene’s probability of being selected based on PG, followed by Wallenius’ non-central hypergeometric test for each gene ontology category. To this end, we analysed the same DM-CpGs from each of the 450k arrays previously described (Supplementary Table S2). As expected, the number of pathways was increased using gometh software relative to our ranked top 150 gene list. We restricted our analysis to the first 10 most significantly enriched terms, which is ∼3-fold higher than the average number of terms per data set yielded by our method. Our normalization method yielded a higher percentage of overlap between methylation and expression data compared with gometh (11 pathways in 10 studies versus 7 pathways in 22 studies, respectively; Supplementary Table S4). Discussion We provide evidence that ranking 450K-generated data for delta-beta only ranks DM-CpGs in favour of high probe density genes. This pattern was consistently observed in a range of 450K-based studies across different laboratories, array batches and normalization methods. The 450K probe density distribution does not reflect genomic features such as the gene length nor the GC content. The main implication of the results is that a significant false-negative rate cannot be ruled out when ranking DM-CpGs solely by delta-beta amplitude. The small number of genes with PG > 50 further underlines the potential importance of the probe density-driven false-negative rate. This bias is expected to be particularly significant in diseases or treatments associated with small and genome-widespread DNA methylation profiles. Indeed, we show that the differential DNA methylation between diseased and normal counterpart tissues is smaller both in the number of loci and magnitude, compared with the corresponding differences between normal tissues, thus making disease samples particularly prone to a significant false-negative rate. Notably, even 450K-based epigenome-wide association studies (EWAS) may be prone to favour high probe density genes, despite the large sample size and strict statistical criteria. For example, recent metabolic disease EWAS identified CpGs that map to genes with higher than average probe density, i.e. CPT1A, ABCG1 and HIF3A with 61, 32 and 25 PG, respectively [12, 13]. The proposed ranking method yields a called DM-CpG distribution that reflects the gene distribution across the probe density classes, not the probe density itself, and offers several advantages compared with the mainstream analysis. First, analysis of the top 150 ranked genes yielded an enrichment in the GPCR pathway that was specific for selected data sets, despite their different biological and laboratory origin, array batch and statistical criteria [5]. Importantly, the proposed ranking method identified a pathway enrichment that overlaps between the 450K-based and expression array-based data, suggesting that it may contribute to the often frustrating efforts to understand the interplay between DNA methylation and expression [5, 10]. Secondly, it could distinguish disease-specific from tissue-specific profiles. At the same time, it identifies a set of transcripts including abundant non-coding RNAs, an under-represented transcript type in 450K. This reflected an instructive mechanism, rather than a bias in favour of low PG-genes, as we showed that within the low PG (PG < 10) class, non-coding loci were enriched in most data sets compared with coding genes. The exceptions were obesity, breast cancer blood and aging. The fact that these data sets were tightly clustered in our previous comparative analysis of DNA methylomes [5] suggests a lesser contribution of non-coding gene differential methylation in these epigenetically related diseases. Also, we document that the number of non-coding/coding loci ratio is significantly lower in the analysed disease and aging tissues compared with normal counterparts. Based on the recognized importance of non-coding RNAs in gene expression regulation, this observation suggests a limitation of the 450K design. The newly developed MethylationEPIC BeadChip (Infinium) array [3], an enhancer element-enriched version of the 450K covering ∼850 000 CpGs, may at least partly correct that limitation in combination with the application of the PG-based ranking method, given that often enhancers are non-coding RNA transcription sites [14]. On the other hand, the 850K array does not show any substantial difference in PG distribution compared with the 450K version; therefore, the bias addressed here applies to the newer platform (Supplementary Figure S5). Finally, the top-ranking DM-CpGs following the PG-based method have a comparatively low delta-beta, a feature of disease methylome in comparison with normal counterparts, thus leading to a better representation of disease-specific profiles. High PG bias has previously been documented using methylation microarray data from lung cancer and ulcerative colitis as examples [15], and the gometh function dedicated to correcting that bias has only recently been published [11]. When compared with the latter, our proposed method yielded a broader overlap with expression data sets. One main difference between our approach and gometh is that we apply the gene ontology analysis on a set of tissue- and disease-specific DM genes, rather than the whole DM gene list. From a practical viewpoint, the proposed normalization method does not require any specialized bioinformatics knowledge. It uses normalized intensity (beta or M values) and annotated data (gene identity; genomic features) rather than being an automatic raw data-to-enriched pathway identification pipeline, thus allowing a full biological knowledge of the data at every step of the analysis. Our approach is only one of several that could be tested in further studies. For example, a gene-by-gene, multi-probe paired comparison between samples should effectively lower false-negative rates. In summary, the proposed PG-based ranking method corrects biases inherent in currently accepted delta-beta-based methods. The latter are effective in identifying large changes in CpG methylation with robust pathobiological significance, but clearly introduce a potentially significant false-negative rate that masks biologically relevant low PG loci. Our PG-based method yields gene sets with significantly enriched gene function categories that are increased in number and in the degree of overlap with expression data, and show increased disease or tissue specificity, in comparison with the delta-beta-only ranking. Furthermore, it is not biased against low delta-beta loci, which are a distinct feature of disease methylomes. Methods This briefing presents a conceptual analysis of a method for 450K data ranking and illustrates its effectiveness by applying it to previously described data sets [5]. Therefore, methods, data set description, accession numbers for publicly available data and statistical tests are as reported in that reference. Key Points The HumanMethylation450 BeadChip array (450K; Infinium) is a widely used epigenomics tool. We show in several 450K-based studies that ranking is biased in favour of genes represented by a high number of probes, if the extent of differential methylation (delta-beta) is the sole ranking criterion. We propose that introducing the number of called DM-CpGs-to-probe density ratio as a simple additional normalizing criterion yields unbiased data. The effectiveness of the method is illustrated in a large set of publicly available 450K-based data. Supplementary Data Supplementary data are available online at https://academic.oup.com/bfg. Funding This work was supported by the Council for Science and Technology of the Guanajuato state, Mexico (CONCyTEG) [grant no. 08-03-K662-020-A01 to G.L.]; and the Mexican Council for Science and Technology (CONACyT) [Basic Science (Ciencia Básica) grant no. 83401 to G.L., Ph.D. Fellowship no. 334370 to G.A.S.-M. Guillermo A. Silva-Martínez is a PhD student co-directed by the two co-authors at the Irapuato Unit of the Centre for Advanced Research (CINVESTAV), a Mexican governmental institution (www.ira.cinvestav.mx). Silvio Zaina is a Professor (level: ‘Titular A’) at the Department of Medical Sciences, University of Guanajuato (www.ugto.mx), Mexico. Personal website: silviozaina.com. Gertrud Lund is a Researcher at CINVESTAV Irapuato Unit (www.ira.cinvestav.mx), Mexico. G.L. and S.Z. have conducted collaborative work in the epigenetics field for >10 years. References 1 Sandoval J, Heyn H, Moran S, et al.   Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics  2011; 6: 692– 702. Google Scholar CrossRef Search ADS PubMed  2 Bibikova M, Barnes B, Tsan C, et al.   High density DNA methylation array with single CpG site resolution. Genomics  2011; 98: 288– 95. Google Scholar CrossRef Search ADS PubMed  3 Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics  2016; 8: 389– 99. Google Scholar CrossRef Search ADS PubMed  4 Du P, Zhang X, Huang CC, et al.   Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics  2010; 11: 587. Google Scholar CrossRef Search ADS PubMed  5 Silva-Martínez GA, Rodríguez-Ríos D, Alvarado-Caudillo Y, et al.   Arachidonic and oleic acid exert distinct effects on the DNA methylome. Epigenetics  2016; 11: 321– 34. Google Scholar CrossRef Search ADS PubMed  6 Lowe R, Slodkowicz G, Goldman N, et al.   The human blood DNA methylome displays a highly distinctive profile compared with other somatic tissues. Epigenetics  2015; 10: 274– 81. Google Scholar CrossRef Search ADS PubMed  7 Jabbari K, Bernardi G. CpG doublets, CpG islands and Alu repeats in long human DNA sequences from different isochore families. Gene  1998; 224: 123– 7. Google Scholar CrossRef Search ADS PubMed  8 Bird A, Taggart M, Frommer M, et al.   A fraction of the mouse genome that is derived from islands of nonmethylated, CpG-rich DNA. Cell  1985; 40: 91– 9. Google Scholar CrossRef Search ADS PubMed  9 Fryxell KJ, Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol  2000; 17: 1371– 83. Google Scholar CrossRef Search ADS PubMed  10 Xie L, Weichel B, Ohm JE, et al.   An integrative analysis of DNA methylation and RNA-Seq data for human heart, kidney and liver. BMC Syst Biol  2011; 5: S4. Google Scholar CrossRef Search ADS PubMed  11 Phipson B, Maksimovic J, Oshlack A. missMethyl: an R package for analysing methylation data from Illumina’ s HumanMethylation450 platform. Bioinformatics  2016; 32: 286– 8. Google Scholar PubMed  12 Irvin MR, Zhi D, Joehanes R, et al.   Epigenome-wide association study of fasting blood lipids in the genetics of lipid-lowering drugs and diet network study. Circulation  2014; 130: 565– 72. Google Scholar CrossRef Search ADS PubMed  13 Demerath EW, Guan W, Grove ML, et al.   Epigenome-Wide Association Study (EWAS) of BMI, BMI change, and waist circumference in African American adults identifies multiple replicated loci. Hum Mol Genet  2015; 24: 4464– 79. Google Scholar CrossRef Search ADS PubMed  14 Kim TK, Hemberg M, Gray JM, et al.   Widespread transcription at neuronal activity-regulated enhancers. Nature  2010; 465: 182– 7. Google Scholar CrossRef Search ADS PubMed  15 Geeleher P, Hartnett L, Egan LJ, et al.   Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics  2013; 29: 1851– 7. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Functional Genomics Oxford University Press

Array probe density and pathobiological relevant CpG calling bias in human disease and physiological DNA methylation profiling

Loading next page...
 
/lp/ou_press/array-probe-density-and-pathobiological-relevant-cpg-calling-bias-in-IB7LPtJWyM
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
2041-2649
eISSN
2041-2647
D.O.I.
10.1093/bfgp/elx017
Publisher site
See Article on Publisher Site

Abstract

Abstract The HumanMethylation450 BeadChip array (450K; Infinium) is a widely used tool in epigenomics. A recognized concern in the 450K platform is the potential effect of the number of probes/gene (PG) on ranking differentially methylated (DM) CpGs (DM-CpGs) before testing for enrichment of gene ontology categories. We previously showed in a fatty acid (FA)-induced DNA methylation profiling study that when DM-CpGs are ranked by the number of called DM-CpGs-to-PG ratio, the 150 top-ranking gene list is enriched in pathways that overlap with the corresponding Affymetrix array-based expression data. In this study, a comparative analysis of thirteen 450K-based studies representing FA-stimulated cellular models, aging, diseased and normal tissues, revealed that the 150 top-ranking DM-CpGs are in high PG genes. This points to a significant false-negative rate in the low PG gene set when delta-beta-based ranking is performed. We show that PG is not related to the density of methylation-prone sites, as it does not follow gene length or GC content. Conversely, ranking genes by the number of DM-CpGs-to-PG ratio and analysing the 150 top-ranking entries yields significantly enriched gene disease- or tissue-specific function categories that are increased both in number and in the degree of overlap with expression data compared with delta-beta-only ranking or to the previously published gometh-based pipeline. The 15 top-ranking loci list is also significantly enriched in non-coding RNAs, a greatly underrepresented transcript type in 450K. In summary, the proposed simple normalization method yields pathobiologically relevant DM-CpGs. This method is relevant for the newly developed MethylationEPIC (Infinium) microarray. DNA methylation, epigenomics, microarray, 450K Illumina DNA methylation arrays, 850K (EPIC) Illumina DNA methylation arrays Introduction The HumanMethylation450 BeadChip array (450K; Infinium) quantitatively profiles the methylation of ∼485 000 cytosines. Although specifically designed for CpGs that map to genes implicated in cancer, the 450K has become a popular platform for high-throughput profiling of human DNA methylation in virtually all areas of basic and disease epigenetics, where it has uncovered both potential therapeutic targets and disease markers [1, 2]. This is because of relatively low cost, high reproducibility, the requirement for small amounts of bisulphite-converted genomic DNA and high coverage, i.e. ∼99% of the RefSeq genes and ∼96% of the CpG islands (CGIs) with additional coverage of adjacent sequences. The 450k array has now been replaced with the MethylationEPIC 850K microarray that has maintained >90% of the original 450k CpG sites in addition to 33 325 sites located in enhancer regions [3]. Following normalization, each CpG is assigned a beta value between 0 and 1 (unmethylated and completely methylated, respectively). DNA methylation mosaicism or allele specificity within the sample determines the beta value for a given locus. Although beta gives an intuitive methylation value, a logit transformation of beta (M value) is usually preferred for statistical analysis [4]. Approximately 76% of the cytosines represented in the 450k array lie within proximal and distal promoters, 5ʹ untranslated region (UTR), first exon, gene-body and 3ʹ UTR, representing 20 621 genes. On average, each gene in the array is represented by 18 CpGs; however, while 96.2% (19 837/20 621) of genes have <50 probes/gene (PG), only ∼4% (784 of 20 621) have PG > 50 PG (Supplementary Table S1). In particular, the genes PTPRN2, MAD1L1, PRDM16, TNXB, RPTOR and HDAC4 are each represented by >200 probes in 450K. The mainstream analysis method for 450K data seeks differentially methylated (DM) CpGs (DM-CpGs) defined as showing large, i.e. above a predetermined threshold, and statistically significant differences in methylation among experimental groups. Typically, CpGs are ranked by decreasing differences in beta between groups (delta-beta), where large differences are assumed to be biologically relevant. Subsequently, DM-CpG-harbouring genes are tested for significant enrichment in functional gene categories. However, our recent study pointed to possible pitfalls in this approach. We found that fatty acids (FAs) induce small dose-dependent changes in DNA methylation across several genes, which are particularly obvious in the gene body of genes with PG > 100, suggesting that FA-induced DNA methylation profiles are widespread and coordinated [5]. In turn, the observed sizeable delta-beta variability of FA-induced profiles implies that the probability of detecting a change in DNA methylation in any given gene by mere delta-beta ranking will be affected by the number of probes present for that gene. For instance, the probability of detecting a methylation change in the PTPRN2 gene, represented by 1290 probes on the array, would be 2.65×10−3 (1290 of 485 577). Conversely, for PCDHGB1, which is represented by seven probes, the probability would more than two orders of magnitude lower (i.e. 1.44×10−5). These observations point to a possible array design-driven bias in DM-CpG calling that may results in a significant false-negative frequency because of probe under-representation in pathobiologically relevant genes. Indeed, in the FA-induced DNA methylation profiling study, we found that when DM-CpGs are ranked by the number of called DM-CpGs-to-PG ratio—rather than by delta-beta only—the top-ranking gene list is enriched in pathways that overlap with the corresponding Affymetrix array-based expression data [5]. In the present study, we compared delta-beta-based and PG-based gene ranking methods across a large and biologically diverse set of published 450K-based studies, including normal and pathological human samples. As previously observed, we find that ranking by PG is more effective in identifying significant gene function enrichment relative to delta-beta alone. Notably, despite being under-represent on the methylation array, this method results in non-random enrichment of non-coding RNAs. Finally, we show that delta-beta-only ranking also results in diseased sample DM-CpGs under-representation. Results We reasoned that if any bias in favour of probe-dense genes exists in 450K, genes represented by a high number of CpG probes should be more frequently called as DM loci, in comparison with less probe-dense genes. We addressed that issue by probing 22 published 450K data sets (corresponding to 13 studies), which represent a wide range of disease and normal samples (Supplementary Table S2). Eleven array data sets represent aging, disease, i.e. metabolic (type 2 diabetes, obesity and atherosclerosis), psychiatric (autism and schizophrenia) and cancer (colorectal cancer, hepatocellular carcinoma, leukaemia and breast cancer) diseases; three represent FA-stimulated cellular models, i.e. arachidonic or oleic acid (AA and OA, respectively)-stimulated human THP-1 cells [5] and palmitic acid (PA)-stimulated cultured pancreatic cells (see Supplementary Table S2 for individual study references). In addition, eight array data sets represent normal tissues [6]. In all cases, we used normalized, statistically significant differential methylation data generated by the authors before publication. We did not process the respective raw data to homogenize the normalization method and statistical modelling, reasoning that the detection of any bias in heterogeneous data sets would robustly point to an intrinsic feature of 450K. The data sets will be collectively referred to as 450K-based data. First, we analysed the distribution of DM genes, i.e. genes containing at least one DM-CpG, in each array across four different groups based on PG: PG ≤ 10; 10 < PG ≤ 50; 50 < PG ≤ 100; and PG > 100. We found that the frequency of DM genes—calculated as the percentage of all genes within the corresponding PG group—directly followed probe density in a nearly perfect linear fashion in the vast majority of data sets (Table 1, Supplementary Figure S1). We also analysed a randomly selected 15 gene data set for each PG class (listed in Table 2) and assessed their call frequency as DM loci across the 450K-based data sets. Consistent with the above data, genes with PG > 100 were called as DM loci more frequently than the corresponding set of genes with PG ≤ 10—on average in 9 of 13 and only in 3 of 13 450K-based data sets, respectively (Figure 1). Table 1. Distribution of DM genes classified by PG, in the probed 450K-based studies   PG ≤ 10  10 < PG ≤ 50  50 < PG ≤ 100  PG > 100  FA-stimulated cellular models  Arachidonic acid  73  94  100  100  Oleic acid  67  92  99  100  Palmitic acid  45  81  99  100    Metabolic  Obesity  3  9  37  62  Type 2 diabetes  1  4  19  24  Atherosclerosis  2  5  20  31  Disease  Psychiatric  Autism BA10  6  15  47  71  Autism BA24  15  28  77  96  Schizophrenia  4  16  42  67    Cancer  Breast cancer  5  15  47  74  Colorectal carcinoma  9  17  44  52  Hepatocellular carcinoma  3  8  19  27  Leukaemia  2  12  24  25  Aging    5  13  34  49  Normal tissues  Blood  32  57  93  99  Breast  8  22  63  79  Colon  15  36  81  92  Kidney  12  28  76  90  Liver  18  36  77  93  Lungs  2  8  33  54  Prostate  18  39  84  92  Thyroid  18  40  83  94    PG ≤ 10  10 < PG ≤ 50  50 < PG ≤ 100  PG > 100  FA-stimulated cellular models  Arachidonic acid  73  94  100  100  Oleic acid  67  92  99  100  Palmitic acid  45  81  99  100    Metabolic  Obesity  3  9  37  62  Type 2 diabetes  1  4  19  24  Atherosclerosis  2  5  20  31  Disease  Psychiatric  Autism BA10  6  15  47  71  Autism BA24  15  28  77  96  Schizophrenia  4  16  42  67    Cancer  Breast cancer  5  15  47  74  Colorectal carcinoma  9  17  44  52  Hepatocellular carcinoma  3  8  19  27  Leukaemia  2  12  24  25  Aging    5  13  34  49  Normal tissues  Blood  32  57  93  99  Breast  8  22  63  79  Colon  15  36  81  92  Kidney  12  28  76  90  Liver  18  36  77  93  Lungs  2  8  33  54  Prostate  18  39  84  92  Thyroid  18  40  83  94  Note: All values are percentages. Table 2. List of randomly chosen 15 genes in each of the four PG classes PG ≤ 10   10<PG≤50   50<PG≤100   PG > 100   Gene  PG  Gene  PG  Gene  PG  Gene  PG  CCL25  10  PDGFD  39  ADCY9  84  CASZ1  176  IL22RA2  8  PITPNC1  47  CHST8  63  GALNT9  263  C10orf68  5  FBLN1  14  GLI3  75  INPP5A  395  NAPSB  8  GRAMD1B  32  PLCH2  65  SDK1  305  NRL  9  ZNF518B  22  C3orf21  86  AFAP1  103  TNP1  8  DDX51  19  CARS2  66  BAHCC1  113  MUC13  7  TAS1R1  22  ELMO1  57  CUX1  222  SFRS12IP1  9  ZNF491  12  PPP2R2B  57  DENND3  112  SLC22A24  4  ATPBD4  15  TAF4  51  OSBPL5  103  C20orf165  6  ZDHHC7  38  MAPK8IP3  78  TCERG1L  106  DPRXP4  1  C14orf101  13  MEIS1  58  JARID2  107  MIR614  3  CREBL2  15  ZBTB12  76  CSMD1  109  MIRLET7G  1  EGLN2  18  BAT1  81  ZC3H3  108  SIGMAR1  7  ZNF691  17  CACNA1  57  LPCAT1  116  DMRTC1B  2  ABHD11  15  FCGBP  52  TRIM27  144  PG ≤ 10   10<PG≤50   50<PG≤100   PG > 100   Gene  PG  Gene  PG  Gene  PG  Gene  PG  CCL25  10  PDGFD  39  ADCY9  84  CASZ1  176  IL22RA2  8  PITPNC1  47  CHST8  63  GALNT9  263  C10orf68  5  FBLN1  14  GLI3  75  INPP5A  395  NAPSB  8  GRAMD1B  32  PLCH2  65  SDK1  305  NRL  9  ZNF518B  22  C3orf21  86  AFAP1  103  TNP1  8  DDX51  19  CARS2  66  BAHCC1  113  MUC13  7  TAS1R1  22  ELMO1  57  CUX1  222  SFRS12IP1  9  ZNF491  12  PPP2R2B  57  DENND3  112  SLC22A24  4  ATPBD4  15  TAF4  51  OSBPL5  103  C20orf165  6  ZDHHC7  38  MAPK8IP3  78  TCERG1L  106  DPRXP4  1  C14orf101  13  MEIS1  58  JARID2  107  MIR614  3  CREBL2  15  ZBTB12  76  CSMD1  109  MIRLET7G  1  EGLN2  18  BAT1  81  ZC3H3  108  SIGMAR1  7  ZNF691  17  CACNA1  57  LPCAT1  116  DMRTC1B  2  ABHD11  15  FCGBP  52  TRIM27  144  Figure 1. View largeDownload slide Calling of 15 randomly selected genes as DM loci for each PG class, in the analysed 450K-based studies. (A–D) PG ≤ 10, 10 < PG ≤ 50, 50 < PG ≤ 100, PG > 100, respectively. Genes are arranged by decreasing frequency of DM call. PG, number of 450K PG. Figure 1. View largeDownload slide Calling of 15 randomly selected genes as DM loci for each PG class, in the analysed 450K-based studies. (A–D) PG ≤ 10, 10 < PG ≤ 50, 50 < PG ≤ 100, PG > 100, respectively. Genes are arranged by decreasing frequency of DM call. PG, number of 450K PG. Once established that the frequency of called DM loci reflects the PG, we asked whether the latter is correlated with the genic CpG dinucleotide count. A tight correlation would indicate that DM-CpG calling is more likely in genes harbouring a high number of methylation-prone sites and thus is driven by sequence features, rather than being based on array design criteria. As proxies for CpG count, we used gene length and GC content. The correlations with the PG were not significant in either case (Figure 2A and B), thus indicating that the array design criteria shape the DM-CpG distribution. A potential caveat in our analysis is that the GC content may not accurately reflect the lower than expected CpG frequency outside CGIs [7, 8]. Yet, the GC content correlates with the CpG/GpC ratio, a measure of CpG abundance, in a near-linear fashion [9]. Furthermore, within CGIs, no correlation exists between the number of probes mapping within CGIs and the average number of CpGs within a gene’s CGIs (Figure 2C). We expected a negative correlation, as it is difficult to design specific probes within CpG-dense CGIs. Incidentally, the number of CGI probes is significantly although mildly correlated with the number of CGIs within a gene (Figure 2D), indicating that the 450K design is a balanced representation of genic CGIs. Figure 2. View largeDownload slide Correlations between 450K probe density and genic features. (A and B) Correlations between the number of 450K PG and gene length or genic GC content, respectively. The log scale is used in (A). (C and D) Correlations between the number of probes mapping within a CGI and the average number of CpG/CGI or the number of CGIs. Each data point corresponds to a gene. The respective linear regression R2 is shown. Unless the P-value is indicated, the correlation is not significant. Figure 2. View largeDownload slide Correlations between 450K probe density and genic features. (A and B) Correlations between the number of 450K PG and gene length or genic GC content, respectively. The log scale is used in (A). (C and D) Correlations between the number of probes mapping within a CGI and the average number of CpG/CGI or the number of CGIs. Each data point corresponds to a gene. The respective linear regression R2 is shown. Unless the P-value is indicated, the correlation is not significant. Next, we ranked the 450K-based data in each set in decreasing order by the number of called DM-CpGs per gene and probed for overlap across the data sets. The top 15 genes in this list are protein-coding genes, and the majority of them are represented by >100 probes per gene on 450K (Supplementary Table S3). Also, a relatively large overlap was observed between the 450K-based studies. Examples include PRDM16, PTRN2, TNXB and HDAC4 that were present in >10 different data sets (Figure 3A). Figure 3. View largeDownload slide Distribution of the top 15 genes ranked by called DM-CpGs/gene, in 450K-based studies. (A) Top genes ranked by DM-CpG delta-beta. (B) Top genes ranked by number of called DM-CpGs normalized by the number of 450K probes/gene. The presence of a filled square indicates DM calling. BA, Brodmann area; CRC, colorectal carcinoma; HCC, hepatocellular carcinoma. Figure 3. View largeDownload slide Distribution of the top 15 genes ranked by called DM-CpGs/gene, in 450K-based studies. (A) Top genes ranked by DM-CpG delta-beta. (B) Top genes ranked by number of called DM-CpGs normalized by the number of 450K probes/gene. The presence of a filled square indicates DM calling. BA, Brodmann area; CRC, colorectal carcinoma; HCC, hepatocellular carcinoma. In addition, we noticed that the top delta-beta ranking genes showed on average both a lower number of DM-CpGs and smaller delta-beta values in disease samples compared with normal tissue counterparts (Figure 4). In this analysis, we averaged the 8 normal or 10 of the 11 diseased tissues. The schizophrenia data set was available only as delta-M values and was therefore excluded from the analysis. The number of DM-CpGs/gene was significantly higher in normal relative to diseased tissues in both the hypomethylated and hypermethylated set (17.5 ± 8.9 versus 3.2 ± 3.0, P = 2.0×10−4; 8.8 ± 5.9 versus 3.0 ± 1.6, P = 0.005, respectively; t-test). The extent of differential methylation showed the same trend (P < 1.9×10−5 in all cases). This trend was also observed in comparison between the three normal/diseased tissue pairs available—colon, blood and liver against colorectal carcinoma, hepatocellular carcinoma and breast cancer patient’s peripheral blood, respectively, i.e. normal tissues showed a significant increase in number of hypomethylated and hypermethylated DM-CpGs (P = 0.036 and P = 0.006, respectively; Supplementary Figure S2). These observations imply that the mainstream DNA methylome analysis is particularly prone to a significant false-negative rate in the case of disease sample. Figure 4: View largeDownload slide Tissue and disease-related changes in DNA methylation of selected genes. Gene-specific differences in DNA methylation (number of DM-CpGs and delta-beta values) between normal and diseased tissues. The values shown are averages of the respective 450K-based studies. Figure 4: View largeDownload slide Tissue and disease-related changes in DNA methylation of selected genes. Gene-specific differences in DNA methylation (number of DM-CpGs and delta-beta values) between normal and diseased tissues. The values shown are averages of the respective 450K-based studies. To correct for the bias in favour of high PG genes, we filtered the list of DM-CpGs identified in the 450K-based data sets, by calculating the ratio between the number of called DM-CpGs within a given gene and the corresponding PG, and subsequently ranked the data by decreasing values of that ratio. From this ranked DM-CpG set, we extracted the CpGs corresponding to the top 15 or 150 genes and ranked again by the absolute delta-beta value. The rationale for choosing either one of those two number of genes was the suitability for illustrative analyses or for broader analyses, but in principle, it should be adapted to specific needs. For example, 150 annotated genes are likely to give representative results in functional category enrichment analysis tools such as DAVID. This normalization method is supported by the fact that it identified a significant, convergent enrichment for G protein-coupled receptor (GPCR; GO:0007186) signalling genes both in AA-, OA- and PA-induced DNA methylation profiles [false discovery rate (FDR) < 10−3 in all cases] and in the expression profiles yielded by an Affymetrix-array-based study of AA- or OA-stimulated cells in our previous study [5]. Such a convergence is methodologically and biologically likely; yet, a weak correlation between DNA methylation and expression data is almost invariably observed (for example [10]). In addition, the PG-based method was not biased against low delta-beta CpGs, a feature of disease methylomes compared with normal tissue counterparts [5]. To further test within a larger sample whether the proposed PG-based ranking method improves the identification of pathobiologically relevant CpG methylation profiles, we assessed the number of significant gene function categories yielded in comparison with the delta-beta-based ranking across the 450K-based data sets. Furthermore, we determined the degree of overlap between 450K-based and expression array-based gene function categories for each of the two ranking methods. The PG-based ranking method yielded 34 categories in 10 studies, compared with only nine in six studies for the delta-beta ranking (Supplementary Table S4). Notably, the PG-based ranking identified a number of biologically relevant tissue-specific categories in normal samples—blood, liver and prostate (for the significance of drug metabolism genes in the normal prostate physiology, see: www.proteinatlas.org/humanproteome/prostate). As for concordance with transcription profiles, 11 categories in 10 studies overlapped with expression data after PG-based ranking, compared with only two in six studies in the case of delta-beta ranking. In addition, the majority of the top 15 genes according to this normalization method (Supplementary Table S5) showed an increased specificity for defined disease/experimental data sets (Figure 3B). Strikingly, ∼37% (122 of 330) of the top 15 DM targets were non-coding RNA loci, despite the fact that this transcript class accounts for only the 2.1% (10 064 of 485 764) of all 450K probes. This is a significant enrichment (P < 0.03 in all cases, hypergeometric test). Yet, it is possible that these observations merely reflect the fact that non-coding loci have low PG. Indeed, the majority (1038 of 1309 or 79.3%) of non-coding loci are present in the PG < 10 class in the 450K (Supplementary Figure S3) and have a lower PG than coding genes (four and six, respectively; P < 10−4, t-test). However, PG alone does not explain the abundance of non-coding loci. First, within the PG ≤ 10 class, non-coding loci represent only ∼one-sixth of coding genes (1038 versus 5895; Supplementary Figure S3). Secondly, within the top 15 ranked loci of the 19 analysed data sets from normal (n = 8) and diseased/aging (n = 11) tissues, non-coding loci are significantly less abundant in disease/aging than normal samples (average 4.0 and 5.6, respectively; P = 0.024, t-test), but the number of coding loci is similar between the two sets (average 9.4 and 8.4; P = 0.39). Thirdly, non-coding loci identified by our method are enriched in all data sets within the PG ≤ 10 class (P < 0.05, hypergeometric test) compared with coding genes, with the exception of obesity, breast cancer blood and aging, within the top 15 ranking loci (Supplementary Figure S4). Notably, these latter three data sets are tightly clustered in the independent methylome comparative analysis that we previously performed [5]. Taken together, these results strongly suggest that instructive, pathobiologically relevant mechanisms, rather than bias in favour of low PG loci, underlie the observed non-coding loci enrichment. Finally, we compared our normalization method that corrects for high PG bias before gene enrichment analysis with gometh software, that is also designed to reduce the inherent bias of high PG in downstream enrichment studies [11]. The gometh function takes into account a gene’s probability of being selected based on PG, followed by Wallenius’ non-central hypergeometric test for each gene ontology category. To this end, we analysed the same DM-CpGs from each of the 450k arrays previously described (Supplementary Table S2). As expected, the number of pathways was increased using gometh software relative to our ranked top 150 gene list. We restricted our analysis to the first 10 most significantly enriched terms, which is ∼3-fold higher than the average number of terms per data set yielded by our method. Our normalization method yielded a higher percentage of overlap between methylation and expression data compared with gometh (11 pathways in 10 studies versus 7 pathways in 22 studies, respectively; Supplementary Table S4). Discussion We provide evidence that ranking 450K-generated data for delta-beta only ranks DM-CpGs in favour of high probe density genes. This pattern was consistently observed in a range of 450K-based studies across different laboratories, array batches and normalization methods. The 450K probe density distribution does not reflect genomic features such as the gene length nor the GC content. The main implication of the results is that a significant false-negative rate cannot be ruled out when ranking DM-CpGs solely by delta-beta amplitude. The small number of genes with PG > 50 further underlines the potential importance of the probe density-driven false-negative rate. This bias is expected to be particularly significant in diseases or treatments associated with small and genome-widespread DNA methylation profiles. Indeed, we show that the differential DNA methylation between diseased and normal counterpart tissues is smaller both in the number of loci and magnitude, compared with the corresponding differences between normal tissues, thus making disease samples particularly prone to a significant false-negative rate. Notably, even 450K-based epigenome-wide association studies (EWAS) may be prone to favour high probe density genes, despite the large sample size and strict statistical criteria. For example, recent metabolic disease EWAS identified CpGs that map to genes with higher than average probe density, i.e. CPT1A, ABCG1 and HIF3A with 61, 32 and 25 PG, respectively [12, 13]. The proposed ranking method yields a called DM-CpG distribution that reflects the gene distribution across the probe density classes, not the probe density itself, and offers several advantages compared with the mainstream analysis. First, analysis of the top 150 ranked genes yielded an enrichment in the GPCR pathway that was specific for selected data sets, despite their different biological and laboratory origin, array batch and statistical criteria [5]. Importantly, the proposed ranking method identified a pathway enrichment that overlaps between the 450K-based and expression array-based data, suggesting that it may contribute to the often frustrating efforts to understand the interplay between DNA methylation and expression [5, 10]. Secondly, it could distinguish disease-specific from tissue-specific profiles. At the same time, it identifies a set of transcripts including abundant non-coding RNAs, an under-represented transcript type in 450K. This reflected an instructive mechanism, rather than a bias in favour of low PG-genes, as we showed that within the low PG (PG < 10) class, non-coding loci were enriched in most data sets compared with coding genes. The exceptions were obesity, breast cancer blood and aging. The fact that these data sets were tightly clustered in our previous comparative analysis of DNA methylomes [5] suggests a lesser contribution of non-coding gene differential methylation in these epigenetically related diseases. Also, we document that the number of non-coding/coding loci ratio is significantly lower in the analysed disease and aging tissues compared with normal counterparts. Based on the recognized importance of non-coding RNAs in gene expression regulation, this observation suggests a limitation of the 450K design. The newly developed MethylationEPIC BeadChip (Infinium) array [3], an enhancer element-enriched version of the 450K covering ∼850 000 CpGs, may at least partly correct that limitation in combination with the application of the PG-based ranking method, given that often enhancers are non-coding RNA transcription sites [14]. On the other hand, the 850K array does not show any substantial difference in PG distribution compared with the 450K version; therefore, the bias addressed here applies to the newer platform (Supplementary Figure S5). Finally, the top-ranking DM-CpGs following the PG-based method have a comparatively low delta-beta, a feature of disease methylome in comparison with normal counterparts, thus leading to a better representation of disease-specific profiles. High PG bias has previously been documented using methylation microarray data from lung cancer and ulcerative colitis as examples [15], and the gometh function dedicated to correcting that bias has only recently been published [11]. When compared with the latter, our proposed method yielded a broader overlap with expression data sets. One main difference between our approach and gometh is that we apply the gene ontology analysis on a set of tissue- and disease-specific DM genes, rather than the whole DM gene list. From a practical viewpoint, the proposed normalization method does not require any specialized bioinformatics knowledge. It uses normalized intensity (beta or M values) and annotated data (gene identity; genomic features) rather than being an automatic raw data-to-enriched pathway identification pipeline, thus allowing a full biological knowledge of the data at every step of the analysis. Our approach is only one of several that could be tested in further studies. For example, a gene-by-gene, multi-probe paired comparison between samples should effectively lower false-negative rates. In summary, the proposed PG-based ranking method corrects biases inherent in currently accepted delta-beta-based methods. The latter are effective in identifying large changes in CpG methylation with robust pathobiological significance, but clearly introduce a potentially significant false-negative rate that masks biologically relevant low PG loci. Our PG-based method yields gene sets with significantly enriched gene function categories that are increased in number and in the degree of overlap with expression data, and show increased disease or tissue specificity, in comparison with the delta-beta-only ranking. Furthermore, it is not biased against low delta-beta loci, which are a distinct feature of disease methylomes. Methods This briefing presents a conceptual analysis of a method for 450K data ranking and illustrates its effectiveness by applying it to previously described data sets [5]. Therefore, methods, data set description, accession numbers for publicly available data and statistical tests are as reported in that reference. Key Points The HumanMethylation450 BeadChip array (450K; Infinium) is a widely used epigenomics tool. We show in several 450K-based studies that ranking is biased in favour of genes represented by a high number of probes, if the extent of differential methylation (delta-beta) is the sole ranking criterion. We propose that introducing the number of called DM-CpGs-to-probe density ratio as a simple additional normalizing criterion yields unbiased data. The effectiveness of the method is illustrated in a large set of publicly available 450K-based data. Supplementary Data Supplementary data are available online at https://academic.oup.com/bfg. Funding This work was supported by the Council for Science and Technology of the Guanajuato state, Mexico (CONCyTEG) [grant no. 08-03-K662-020-A01 to G.L.]; and the Mexican Council for Science and Technology (CONACyT) [Basic Science (Ciencia Básica) grant no. 83401 to G.L., Ph.D. Fellowship no. 334370 to G.A.S.-M. Guillermo A. Silva-Martínez is a PhD student co-directed by the two co-authors at the Irapuato Unit of the Centre for Advanced Research (CINVESTAV), a Mexican governmental institution (www.ira.cinvestav.mx). Silvio Zaina is a Professor (level: ‘Titular A’) at the Department of Medical Sciences, University of Guanajuato (www.ugto.mx), Mexico. Personal website: silviozaina.com. Gertrud Lund is a Researcher at CINVESTAV Irapuato Unit (www.ira.cinvestav.mx), Mexico. G.L. and S.Z. have conducted collaborative work in the epigenetics field for >10 years. References 1 Sandoval J, Heyn H, Moran S, et al.   Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics  2011; 6: 692– 702. Google Scholar CrossRef Search ADS PubMed  2 Bibikova M, Barnes B, Tsan C, et al.   High density DNA methylation array with single CpG site resolution. Genomics  2011; 98: 288– 95. Google Scholar CrossRef Search ADS PubMed  3 Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics  2016; 8: 389– 99. Google Scholar CrossRef Search ADS PubMed  4 Du P, Zhang X, Huang CC, et al.   Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics  2010; 11: 587. Google Scholar CrossRef Search ADS PubMed  5 Silva-Martínez GA, Rodríguez-Ríos D, Alvarado-Caudillo Y, et al.   Arachidonic and oleic acid exert distinct effects on the DNA methylome. Epigenetics  2016; 11: 321– 34. Google Scholar CrossRef Search ADS PubMed  6 Lowe R, Slodkowicz G, Goldman N, et al.   The human blood DNA methylome displays a highly distinctive profile compared with other somatic tissues. Epigenetics  2015; 10: 274– 81. Google Scholar CrossRef Search ADS PubMed  7 Jabbari K, Bernardi G. CpG doublets, CpG islands and Alu repeats in long human DNA sequences from different isochore families. Gene  1998; 224: 123– 7. Google Scholar CrossRef Search ADS PubMed  8 Bird A, Taggart M, Frommer M, et al.   A fraction of the mouse genome that is derived from islands of nonmethylated, CpG-rich DNA. Cell  1985; 40: 91– 9. Google Scholar CrossRef Search ADS PubMed  9 Fryxell KJ, Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol  2000; 17: 1371– 83. Google Scholar CrossRef Search ADS PubMed  10 Xie L, Weichel B, Ohm JE, et al.   An integrative analysis of DNA methylation and RNA-Seq data for human heart, kidney and liver. BMC Syst Biol  2011; 5: S4. Google Scholar CrossRef Search ADS PubMed  11 Phipson B, Maksimovic J, Oshlack A. missMethyl: an R package for analysing methylation data from Illumina’ s HumanMethylation450 platform. Bioinformatics  2016; 32: 286– 8. Google Scholar PubMed  12 Irvin MR, Zhi D, Joehanes R, et al.   Epigenome-wide association study of fasting blood lipids in the genetics of lipid-lowering drugs and diet network study. Circulation  2014; 130: 565– 72. Google Scholar CrossRef Search ADS PubMed  13 Demerath EW, Guan W, Grove ML, et al.   Epigenome-Wide Association Study (EWAS) of BMI, BMI change, and waist circumference in African American adults identifies multiple replicated loci. Hum Mol Genet  2015; 24: 4464– 79. Google Scholar CrossRef Search ADS PubMed  14 Kim TK, Hemberg M, Gray JM, et al.   Widespread transcription at neuronal activity-regulated enhancers. Nature  2010; 465: 182– 7. Google Scholar CrossRef Search ADS PubMed  15 Geeleher P, Hartnett L, Egan LJ, et al.   Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics  2013; 29: 1851– 7. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com

Journal

Briefings in Functional GenomicsOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off