Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality... Background: Mislabeled, contaminated or poorly performing samples can threaten power in methylation microarray analyses or even result in spurious associations. We describe a set of quality checks for the popular Illumina 450K and EPIC microarrays to identify problematic samples and demonstrate their application in publicly available datasets. Methods: Quality checks implemented here include 17 control metrics defined by the manufacturer, a sex check to detect mislabeled sex-discordant samples, and both an identity check for fingerprinting sample donors and a measure of sample contamination based on probes querying high-frequency SNPs. These checks were tested on 80 datasets comprising 8327 samples run on the 450K microarray from the GEO repository. Results: Nine hundred forty samples were flagged by at least one control metric and 133 samples from 20 datasets were assigned the wrong sex. In a dataset in which a subset of samples appear contaminated with a single source of DNA, we demonstrate that our measure based on outliers among SNP probes was strongly correlated (> 0.95) with another independent measure of contamination. Conclusions: A more complete examination of samples that may be mislabeled, contaminated, or have poor performance due to technical problems will improve downstream analyses and replication of findings. We demonstrate that quality control problems are prevalent in a public repository of DNA methylation data. We advocate for a more thorough quality control workflow in epigenome-wide association studies and provide a software package to perform the checks described in this work. Reproducible code and supplementary material are available at https:// doi.org/10.5281/zenodo.1172730. Keywords: DNA methylation, Epigenomics, Infinium, 450K, EPIC, Quality control, Contamination, Mislabeling, Data cleaning Background control (QC) to ensure robust and reproducible results The number of epigenome-wide association studies in has received much less attention. Epigenome-wide asso- epigenetic epidemiology is growing rapidly, facilitated by ciation studies usually start with a quality control step, popular microarray platforms like the Infinium 450K and often involving discarding individual probes or entire EPIC chips, which offer broad coverage and precise quan- samples with too many high detection p values (resulting tification of DNA methylation. Whereas the literature from a low signal-to-noise ratio of fluorescence intensi- about preprocessing and statistical analysis of microar- ties), probes with too few beads, or discarding subsets of raydataisextensive [1, 2], the need for upstream quality probes that are considered unreliable based on design fea- tures (e.g., those being cross-reactive or close to SNPs). *Correspondence: jonathan.heiss@mssm.edu However, there is a lot of heterogeneity in which checks Department of Environmental Medicine and Public Health, Icahn School of are undertaken and how criteria are applied, as seen Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1057, New York, NY 10029, USA in the methods reported across a large consortium of © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Heiss and Just Clinical Epigenetics (2018) 10:73 Page 2 of 9 birth cohort studies [3]. Furthermore, quality checks that the lack of a standard format makes their analysis diffi- go beyond to catch other types of problematic samples cult to automate. While our ewastools package offers are not yet standard procedure. There are many reasons the same functionality for the EPIC chip, our selection microarray experiments might fail: starting with low- was limited to the predecessor 450K chip as the two plat- quality DNA input, an incomplete bisulfite conversion, or forms differ in the number of SNP probes which are a failure of the other experimental steps in the Infinium essential for the checks described below. Datasets meeting assay. These issues can result in distorted methylation pat- these criteria were identified using the Entrez Program- terns, which may be sometimes more or less apparent. ming Utilities (http://eutils.ncbi.nlm.nih.gov/). From this Another common source of errors is sample mislabeling, list we manually excluded datasets based on the descrip- especially in large multi-center studies where the chain tion and metadata provided in GEO: we dropped datasets of custody of the sample collection process is long and involving samples from tumor tissue or cultivated cell mistakes canhappenatevery step.Mislabeling creates lines because tumor cells can show extensive epigenetic mismatches between epi- and phenotype, thereby obfus- mutations related to defects in the DNA methylation cating genuine associations or even producing spurious maintenance apparatus [6], whereas for the latter it is ones. The prevalence of mislabeling was recently demon- unclear to what degree methylation profiles reflect the strated in 70 publicly available gene expression datasets natural/in vivo state of the cell types rather than an hosted in the Gene Expression Omnibus (GEO) reposi- epigenome of manipulation [7]; we dropped placenta sam- tory: Toker et al. used genes known to be differentially ples because they could either be of maternal or fetal expressed between sexes to infer the sex of the sample origin, as well as sperm samples as the cells are haploid; donors and to compare it to the recorded metadata. They we dropped FFPE (formalin-fixed, paraffin-embedded) found that 32 of the datasets contained discrepancies in samples as their preparation follows a different proce- a subset of the samples [4]. Finally, sample contamination dure than for fresh tissue samples and the DNA is often with foreign DNA may arise accidentally in the laboratory of lower quality; lastly, we excluded datasets measuring or from complex sampling procedures (e.g., contamina- 5-hydroxymethyl-cytosine. Eventually, a total of 80 tion of cord blood or placental tissue with maternal blood datasets comprising 8327 samples remained, representing [5]) and analytic methods are needed to identify and a broad range of tissues and cell lines (peripheral blood, quantify contamination. cord blood, saliva, liver, muscle, cartilage, etc.). Having found mislabeled and contaminated samples in our own DNA methylation datasets, we developed a soft- Quantification of methylation levels ware package for the R programming language named Treating DNA with sodium bisulfite converts cytosine to ewastools aiming to facilitate quality control and sta- thymine except for 5-methylcytosine, which is protected tistical analysis of datasets generated from the Illumina by the added methyl-group. In combination with sub- Infinium BeadChip platforms (both 450K and the newer sequent whole-genome amplification, the proportion of EPIC). In order to test the package functionality, we unmethylated to methylated DNA strands in the DNA decided to apply it to a range of datasets from the public input is translated into differences in abundance of dis- Gene Expression Omnibus (GEO) repository. The results tinct PCR (polymerase chain reaction) products, which, show that low-quality samples, mislabeling, and contami- for the sake of simplicity, are here still referred to as nation of samples are widespread issues. It is our hope that (un)methylated. The 450K chip employs 50 base pairs the ewastools package will help researchers to extend long probes complimentary to the targeted loci. Unmethy- and improve the quality control workflow in epigenome- lated and methylated strands are targeted by separate wide association studies. probes or color channels. Their abundance is quanti- fied by hybridization with the corresponding probes, a Methods subsequent single-base extension step with dye-linked Datasets nucleotides and measurement of the resulting fluores- The search for DNA methylation datasets was limited cence intensity. Thus, two data points are available for to the popular Gene Expression Omnibus repository. We each CpG site i, the intensity for unmethylated and methy- selected datasets meeting the following criteria: they had lated strands, U and M , respectively. Here, intensities i i to be submitted before January 2018; samples had to were corrected for dye bias using RELIC [8], but not nor- be run on the Illumina Infinium HumanMethylation450 malized. The proportion of methylated strands was then BeadChip (GEO Accession GPL13534); the sex of the estimated as , commonly referred to as the β-value. M +U i i sample donors had to be provided in the metadata; and raw data had to be provided in the form of .idat files. Quality checks Datasets containing only preprocessed data were not Four kinds of quality checks are implemented in the included as they may no longer contain QC probes and ewastools package: an evaluation of control metrics Heiss and Just Clinical Epigenetics (2018) 10:73 Page 3 of 9 monitoring the various experimental steps such as bisul- of 150 pairs of monozygotic twins (in total 300 samples fite conversion or staining; a sex check comparing the and 10 technical replicates resulting in 47,895 pairwise actual sex of the sample donors to the records; an identity comparisons). check for fingerprinting sample donors; and detection of The total intensity T = U + M has been shown to i i i contaminated samples using outliers among the 65 probes be sensitive to copy number aberrations [12]. By exploit- querying high-frequency SNPs. ing the natural difference in allosomal copy number, this The first check, implemented by the function can be used to detect sex-mismatches. There are 11,232 control_metrics, evaluates 17 control metrics calcu- and 413 probes on the 450K chip targeting the X and lated from dedicated control probes placed on each assay. Y chromosome, respectively. The function check_sex A description of these metrics, together with default computes for each sample n the average total intensi- thresholds to flag problematic samples, is provided in the ties of all probes targeting either chromosome, T and BeadArray Controls Reporter Software Guide available T , respectively. In order to account for differences in X Y ¯ ¯ from the Illumina support website. Similar functions post-amplification DNA concentration, T and T are n n to evaluate the control probes visually are provided in normalized by the average total intensity across all auto- the minfi [9], RnBeads [10]and shinyMethyl [11] somal probes which leads to more compact clusters in R packages. Not all flagged samples necessarily failed nor visualizations. Thresholds to discriminate between both do these metrics indicate potential upstream issues, e.g. sexes are determined by the Hodges-Lehmann estima- whether the DNA quality was low to begin with. All 8327 tor (median of all pairwise male/female averages) for X Y ¯ ¯ samples were screened in this first check. T and T separately.Thisrobustapproach wascho- n n There are 65 probes placed on the 450K chip query- sen over other sex determination functions provided ing high-frequency SNPs (with 59 of these on the EPIC in the minfi [9], RnBeads [10]and shinyMethyl chip; their probe identifiers start with “rs”). Just as for [11] packages exactly because of the potential of sex- CpG sites, a β-value is calculated for each SNP locus, mismatches and allosomal outliers being present in the based on fluorescence intensities from two probes tar- dataset. All 8327 samples were screened in this third geting either the wild type or the common mutant check. variant. These β-values usually fall into one of three dis- When exploring the data, one dataset in particular, here junct clusters, corresponding to the heterozygous and referred to as dataset E, caught our attention. Plotting X Y ¯ ¯ the two homozygous genotypes (AB, AA, or BB). The T against T , most samples—aside from a few misla- n n specific combination of SNPs across these 65 probes beledones—clusteredasexpected, butasubset of samples serves as a genetic fingerprint: fingerprints of samples from female donors were protruding from the cluster cen- from the same donor match but differ between individu- ter in the direction of the male cluster center (Fig. 3b): als – with the exception of monozygotic twins – thereby such a pattern is indicative of sample contamination enabling one to check for discrepancies with the meta- and more specifically, it is compatible with the hypoth- data. Genotype calling is handled by call_genotypes. esis of the foreign DNA coming from a male source. This function pools β-values of all 65 SNP probes across This does not imply that only female samples were samples to train a mixture model with four compo- affected: in the case of a male/male DNA mix, both nents: three Beta distributions, each representing one allosomes would still show a methylation profile typi- genotype, and one uniform distribution representing out- cal for males, only in the case of a female/male mix liers. Subsequently, posterior probabilities are computed would both allosomes show methylation profiles atypi- and forwarded to check_snp_agreement.Using the cal for either males or females. With increasing degree posterior probabilities as soft classification, pairwise of contamination, such samples would be further away agreement of fingerprints is assessed by counting the from the female cluster center and closer to the male number of SNP probes for which two samples possess cluster center. the same genotype, divided by the total number of SNP In order to confirm this hypothesis, we turned again probes after those classified as outliers were excluded to the 65 SNP probes. Normally, their β-values fall, (consistent with the soft genotype calling, a SNP might according to the underlying genotype, into one of three be partially classified as outlier and therefore be only disjunct clusters. In dataset E however, many β-values partially excluded). Mislabeling constitutes either as unex- fell in-between these three clusters (plotting their his- pected disagreement between samples that are supposed togram would show three peaks no longer completely to come from the same individual, or unexpected agree- disjunct), a pattern one would expect to see when ment between samples supposed to come from different mixing two genotypes (the same way β-values of het- individuals. A list of such instances, here termed con- erozygous AB individuals scatter around 0.5, as they flicts, is returned by check_snp_agreement.Thissec- possess a 50:50 mixture of A and B alleles). We trans- ond quality check was applied to a dataset comprised lated our hypothesis in a generative statistical model Heiss and Just Clinical Epigenetics (2018) 10:73 Page 4 of 9 N 65 with the following likelihood function β ∼ bisulfite conversion, across the 80 datasets. Three hun- nj n=1 j=1 dred seventy-seven samples from 17 datasets fell below N (1 − γ) · μ + γ · c , (1 − γ ) · σ and parameters k n j n k nj nj the default threshold specified by Illumina. Overall, 940 n ∈ [1, N] (sample index), j ∈ [1, 65] (SNP probe index), samples (11%) from 41 datasets (51%) were flagged by at k ∈{AA, AB, BB} (the three possible genotypes), k nj least one of the 17 metrics in control_metrics,leav- (genotype of sample n for probe j), β (methylation level nj ing 7387 samples that passed this first check. A summary of sample n for SNP probe j), σ (standard deviation for is provided in Table 1. Many of these problematic sam- β-values from genotype k), c (methylation level of SNP ples would be overlooked when filtering out only samples probe j in foreign DNA), and γ (degree of contamina- with too many undetected probes or low overall fluo- tion of sample n). The parameters were estimated using rescence intensity, two popular criteria. Out of the 940 the Metropolis algorithm implemented in the Julia pro- flagged samples, 432 had >1% of probes with a detection gramming language (code is provided the supplement). p value above 0.01 and 217 were flagged by the getQC Comparing γ (degree of contamination) to T for female function from the minfi package, and 541 samples when samples allowed us to test if both agreed on ranking sam- considering both criteria. ples from least to most contaminated. This check was Figure 2 shows the result of the identity check in applied only to dataset E. the dataset comprising 150 monozygotic twins. Pairwise Our precise quantification of sample contamination in agreement of SNP fingerprints between samples ranged dataset E required the assumption that the foreign DNA from 0.96 to 1.00 for the 187 twin pairs (number is came from a single source, which may be an unusual sce- larger than 150 because of the technical replicates) and nario. A simpler and more general test of sample contam- from 0.22 to 0.66 for the 47,708 non-twin pairs, thereby ination that can be applied to any dataset is implemented perfectly segregating both groups. For comparison, geno- in the function snp_outliers. This function, again type calling by k-means clustering, as implemented in the using the output of call_genotypes,computes O ,the wateRmelon package [13], produced non-identical fin- average log odds from the 65 posterior probabilities from gerprints in 11 out of the 187 twin pairs, with one twin the outlier component from the mixture model described pair showing as many as 9 discordant SNPs. above. Thus, O captures how irregular the SNP β-values The function check_sex estimated that 133 of the of sample n are, i.e., how much they deviate from the ideal 8327 samples had been assigned the wrong sex yield- trimodal distribution. O was compared to γ in dataset E n n ing an error rate of 1.6%, with 13 cases being unclear, in order to confirm that it did indeed capture sample con- as the inferred sex was discordant between T and tamination. Subsequently, all 8327 samples were screened T (Fig. 3a). Of the 80 datasets, 20 (25%) contained in this fourth check. these sex-mismatched samples. These were unevenly dis- tributed: the three highest error rates per study were 55% (#mistakes 38/#total sample size 69), 45% (5/11), Results and 34% (15/44). Excluding all samples that failed the Figure 1 shows the distribution of the “Bisulfite Con- control_metrics check and might have technical version II” control metric, used to monitor successful 20.0 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + +++ + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + +++ + +++++ + +++ +++ + + + + + +++ + + + + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 10.0 + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + +++ + +++ + +++ + + + + + + ++++ + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + +++ + + + + + +++ ++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + +++ + + + + + + +++ + + + + + +++++ + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++ + +++ + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + +++ + + + + + + + + + + + + + + + + +++ + + + + +++ ++ +++ + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++++ + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + +++ + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + +++ + +++ + +++ + + + + + + + + + + +++ + + + + + + + + + + + + +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++++ ++ + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + +++ + +++++ + + + +++ + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + +++ ++ + + + + + + + + + + + + + + +++ + + + + + + + +++ + + + + + + + + +++ + + + +++ + + + +++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + +++ + + +++ + + + +++ + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + +++ + +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ +++ + + +++ + + + + +++ + + + + + + + + + + + + +++ + + +++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + +++ + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + +++ + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++ + + +++ + + + + + + + +++ + + + + + + + + +++ + + + +++ + + +++++ + + + + + + + + + + + +++ + + + +++ + + + + +++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + ++ + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + ++++ ++ + + + + + +++ + + + +++ + +++ + + + +++ + + + + + + + + + + ++ +++ + + + + + + +++ +++ + + +++++ ++ + + +++ ++ ++ + + + + + + + +++ + + +++ +++ +++ ++ ++ + +++ + + + + + + + + + + +++ +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + +++ + + +++ + + + + +++ + + + +++ + + + +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + +++ + + + + + + + + + + + + +++ + + + +++ + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ +++ +++ + + + + + + + + +++ + + + + + +++ +++ + + + + + + + + + + + +++ ++ ++ ++ ++ + +++ + + + +++ + + + + +++ ++ + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++ + + + + + + + + + +++ +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + +++ + + + + + +++ + + +++ + + + ++ ++ + + + +++ + + + +++ + + + + + + + + +++ + + + + + + + +++ + + +++ ++ + +++ ++ ++++ +++ + +++ + +++ +++ ++ + + + +++ + +++ +++++ + + + + +++ + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + ++ ++ + +++ + +++ + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++ +++ + + +++ + + + + + + + + + + + + + + + + +++ + + + + + + + + +++ + + + + +++++ + + +++ + + + + + + + +++ + + + + + +++ ++ +++ + + + +++++ +++ + + + ++ +++ + +++ + + + + + + +++ +++ + + + + + + + + + +++ + + + + + + + +++ + +++ +++ + + + +++ ++ + +++ + +++ + + +++ +++++++ + + + + ++ + + + + + + + + + ++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ + + +++ + + + + + + + + + + + + + + + + +++ + ++++ + + +++ + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + ++ +++ + + + +++ + + +++ + +++ + + + + + + +++ + +++ +++ + + + +++ + + + + ++++ +++++ ++++ ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + +++ + + + + + + + + + + +++ + + + + + ++++ ++ + + + + + + +++ + + + + + + + + + + + + + +++ + +++ + + + + + + +++ + + + + + + + +++ + + + + +++ + + + +++ + + +++ + + + +++ + +++ + + + + +++ +++ ++ + + + + + +++ + + + + ++ ++ +++ +++ + + + + + + + +++++ + + +++ ++ +++ ++ +++ +++++ + + +++ ++ ++ ++ +++ ++ + +++ + +++ + + + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + +++ + + +++ + + + + + + + + + + +++ + + + + +++++ + + + + + +++ + + + + + + + + + + + + + + +++ + + + + +++ + + + + + + + + +++ ++ ++ + + + + + + + + + + + + + +++++ + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + + +++ +++ + + + + + + + + +++ ++ ++ ++ +++ + + +++ + + +++ + +++ ++ + +++ + + +++ ++ +++ + + +++ + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + +++++ + + + + + + + + + +++ + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + +++ ++ ++ + + +++ + ++++++ + + + + + + + + + + + + + + + + + + + + +++ + + ++++ ++ +++ + + + + + + + + + +++ ++ ++++++ + +++ ++ + + + ++ + + + + +++ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + + + + + + + + ++++ + +++ + +++ +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + + + +++ + + + +++++ +++ + + + + + ++ ++ + + + +++ +++ +++ + +++ + + +++ ++ + + +++ +++ +++ + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + +++ + ++ + + + + + + + + + + + + +++ + +++ +++ +++ + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + +++ + +++ + + + +++ +++ + +++ ++ ++++ ++ ++ + +++ + +++ ++ ++++ ++ + +++ + + +++ + + + + + + + + +++ + + + + + + ++ + + + + + +++ + ++ + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 5.0 + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++++ + + + +++ + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ ++ +++ + + +++ ++ +++ ++ ++ + + + + +++ +++ + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ + + + +++ + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++++ + + +++ + + +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ +++ +++ +++ +++ +++ ++ + + +++ + + + + + + + + + + + + + + + + + + + +++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + +++ ++ + + + +++ + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++++ + + ++ +++ + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + +++ + + + + + + + + + + + + +++ + + + + + + + + + + +++ + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + ++ ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + ++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++++ + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + +++ + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++ + + + + + + + + + + + + + + + + + + + +++ + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1.0 + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + +++ + + + ++ + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.5 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.1 + #17 #80 GEO datasets sorted by minimum bisulfite conversion ratio Fig. 1 Distribution of a control metric monitoring bisulfite conversion. Samples below the manufacturer’s suggested threshold of 1 might be incompletely converted, leading to inaccurate estimates of methylation levels Bisulfite Conversion II (unitless ratio) Heiss and Just Clinical Epigenetics (2018) 10:73 Page 5 of 9 Table 1 Summary of the number of samples flagged by each of from least to most contaminated. Correlation was weaker the 17 control metrics outlined in the BeadArray controls reporter ¯ for T (− 0.828, 95% CI − 0.898,− 0.748), suggesting that software guide. n.a.—not available ¯ T was the more sensitive metric of contamination in Metric Passed Flagged n.a. this situation. Males were not included in the calcula- tion of these correlation coefficients because contamina- Restoration 8326 0 1 Y X ¯ ¯ tion would not affect T nor T if our hypothesis of n n Staining green 7885 10 432 contamination coming from a single male source were Staining red 7764 14 549 correct. Even though γ was non-zero for most sam- Extension green 8324 3 0 ples, we assume that only a subset of the samples were Extension red 8326 1 0 indeed contaminated, as those with the largest γ tended Hybridization high/medium 8326 1 0 to be allocated next to each other on the 450K chips (not shown). γ and O for the same set of samples were Hybridization medium/low 8327 0 0 n n strongly correlated as well (Pearson’s correlation coeffi- Target removal 1 8327 0 0 cient 0.958, 95% bootstrapped CI 0.948,0.966), even when Target removal 2 8327 0 0 including males (0.954, 95% CI 0.944,0.961; Fig. 4b), sug- Bisulfite conversion I green 8317 10 0 gesting that in this dataset O , the average log odds of Bisulfite conversion I red 8208 119 0 SNP probes being outliers, was a proxy for sample con- Bisulfite conversion II 7950 377 0 tamination not contingent on the sex of the sample donor or source of contamination (with the exception of con- Specificity I green 8323 4 0 taminating DNA coming from another tissue of the same Specificity I red 8279 48 0 donor). Specificity II 8326 1 0 While the importance of a metric monitoring critical Non-polymorphic green 7677 558 92 laboratory steps such as bisulfite conversion is obvious, Non-polymorphic red 7917 318 92 the relevance of Illumina provided default thresholds for other metrics is less clear. Figure 5 shows the dis- Any of the above 7387 940 – tribution of O among all samples that were flagged by the control_metrics checks versus the remain- errors in measurement did not change sex-mismatched der that passed. We use O here as a measure of poor error rates substantially: out of the 7387 samples, 122 technical performance, rather than a measure of sam- (1.7%) had been assigned the wrong sex with 10 unclear ple contamination, as the former would also contribute cases. Comparing with the predictions provided from to a deviation of the SNP probes from the ideal tri- minfi, there were, apart from the 13 unclear cases men- modal distribution. Flagged samples had in general higher tioned above, 66 female (according to metadata) sam- values of O , indicating that such samples are indeed ples from six datasets that were correctly classified by of concern. ewastools but misclassified by minfi. Estimating the degree of contamination γ for each sam- Discussion ple in dataset E,wefound that γ and T were strongly The topic of 450K data quality has been addressed before. correlated among females (Pearson’s correlation coeffi- Among the issues being discussed are cross-reactive cient 0.953, 95% bootstrapped CI 0.934,0.965; Fig. 4a), probes and probes possibly affected by nearby SNPs confirming that both metrics agreed on ranking samples [14, 15], high detection p values and batch effects [16]. While these publications focus on probes with low fluo- rescence intensities or that are in general unreliable, or issues that affect ensembles of samples, our work is in contrast mainly concerned with the identification of indi- Unrelated vidual problematic samples resulting from failed exper- iments, mislabeling or contamination. Due to the often small effect sizes, epigenome-wide association studies are Twins sensitive to such samples as they often present as out- liers. Robust regression methods mitigate the impact of 0.00 0.25 0.50 0.75 1.00 spurious outliers but are computationally intensive due Agreement score to the high dimension of the data while least squares or Fig. 2 Pairwise agreement scores of genetic fingerprints in a dataset maximum likelihood estimation remain popular choices. of monozygotic twins. There is a perfect segregation between twin Finding and removing problematic samples during qual- and non-twin pairs ity control is therefore an important first step of every Heiss and Just Clinical Epigenetics (2018) 10:73 Page 6 of 9 a b 1.2 1.2 Male Female Correct Mislabeled 1.0 1.0 Unclear 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.6 0.8 1.0 1.2 0.6 0.8 1.0 1.2 Normalized X chromosome intensities Normalized X chromosome intensities Fig. 3 Average fluorescence intensities of probes targeting the X chromosome (x-axis) or targeting the Y chromosome (y-axis), each normalized versus the average fluorescence of autosomal probe intensities per sample. Dotted lines represent the Hodges-Lehmann estimators separating the male and female cluster centers. Samples that are discordant for sex relative to their metadata annotation are considered mislabeled and shown in red. Samples falling in the lower left or upper right quadrant are considered “unclear”. a All 8327 samples from all 80 datasets. b A single dataset (dataset E) with a spread in the female cluster indicating varying degrees of contamination with male DNA epigenome-wide association study. We conducted a sur- As demonstrated on the example of monozygotic twins, vey of publicly available DNA methylation datasets to see the check_snp_agreement function does—at least in whether they suffer from the same quality issues that have the absence of other issues—perfectly predict whether been reported for gene expression datasets. Assuming two samples come from the same person or not. The that all samples included in datasets uploaded to the GEO function is robust against SNP outliers, due to the soft repository were used in associated analyses, our results classification scheme, whereas hard classification of geno- indicate that the current practice of quality control fails to types as produced by k-means clustering or the use of detect many problematic samples that have the potential fixed cutpoints in the β-value distribution can result to severely bias findings. in unexpected genotype mismatches. It is worth point- Eleven percent of samples were flagged by at least one ing out that the genetic fingerprint is the only way to of the 17 control metrics defined by the microarray man- detect mislabeling if a sample swap did not result in ufacturer. This does not mean that every flagged sample any apparent epitype/phenotype mismatch (e.g., two sam- features inaccurate methylation levels and it is unclear ples from the same sex). Mislabeling results in conflicts how Illumina’s default thresholds were set and whether (unexpected disagreement between samples that are sup- the resulting dichotomization is appropriate for flagging posed to come from the same individual or unexpected samples in all conditions. However, samples flagged by agreement between samples that are supposed to come one or more criteria in control_metrics had substan- from different individuals), but it might be necessary to tially more outliers among the normally well-behaved SNP build upon further evidence in order to resolve conflicts probes, an indication of low data quality, and therefore and reassign the correct identities or even to narrow down these samples require closer attention. In the specific case which of the samples in conflict are the mislabeled ones. of monitoring bisulfite conversion, Zhou et al. suggested Furthermore, the utility of check_snp_agreement is a more robust alternative to using the dedicated control limited for datasets that feature only a single sample probes [17]. per person. Normalized Y chromosome intensities Heiss and Just Clinical Epigenetics (2018) 10:73 Page 7 of 9 Non−polymorphic Red n = 301 Bisulfite Conversion I Red Non−polymorphic Green 0.75 Bisulfite Conversion I Green Staining Green 8 Bisulfite Conversion II 377 0.50 Specificity I Green 4 Specificity I Red 48 Staining Red 14 Specificity II 1 0.25 Passed 6619 −4 −2 0 2 010 20 30 40 Average log odds b Fig. 5 Boxplots of O (average log odds of being an outlier across the 65 SNP probes) as a measure of low technical performance for samples flagged by any of the 17 Illumina control metrics and −1 samples passing all of them. Metrics are ordered by median O with the number of samples in each non-exclusive category indicated on the right. Flagged samples have in general higher values of O than samples that passed all checks, indicating that flagged samples are −2 more likely to have poor performance characteristics −3 features should be conducted. Regarding the few unclear cases in which the sex of the sample donors could not −4 Male conclusively be inferred, we suspect that most of these Female samples suffer from other, possibly upstream issues and Correct Mislabeled should be excluded. A few, however, might represent chro- −5 mosomal disorders, e.g. Klinefelter syndrome, which has an reported incidence around 1 per 576 [18]. Because of their XXY genotype, individuals with Klinefelter syn- 0 10203040 drome, who are phenotypically male, would in Fig. 3 be Estimated contamination in % located in the center of the upper right quadrant (for T Fig. 4 Evidence of contamination in dataset E. a γ , the estimated ¯ on par with females, for T on par with males). degree of contamination, and T are strongly correlated among females with a Pearson’s correlation coefficient r = 0.953. b γ and Some samples showed evidence of contamination, espe- O , the average log odds of SNP probes being outliers, are strongly cially those belonging to dataset E:weconstructed two correlated as well (r = 0.954), including both males and females measures of sample contamination based on two sub- sets of probes, using either the average total intensi- ties of probes targeting the Y chromosome T or the In contrast, check_sex can be applied regardless of β-valuesofthe 65 SNPprobes(γ ). The fact that both the number of samples available for each person. In our measures exhibited very high agreement, even though survey 1.6% of samples coming from 25% of the datasets they were based on independent evidence and completely were assigned the wrong sex. The actual rate of mis- different principles, lends credence to our hypothesis of labeled samples is likely higher because sample swaps a single contamination source. In contrast, a strong cor- between donors of the same sex would have gone unde- relation between γ and O (the average log odds of n n tected. Assuming a balanced sex ratio and random mis- being an outlier across all 65 SNP probes) was to be labeling, only half of the potential mistakes would be expected, as both are derived from the same data. O captured by this test alone. If applicable, more compre- is not a perfect proxy of contamination, and deviations hensive checks testing for correct tissue types and other from the trimodal distribution of β-values might also be Normalized Y chromosome intensities Average log odds Heiss and Just Clinical Epigenetics (2018) 10:73 Page 8 of 9 caused by other issues, as evident when comparing sam- Conclusion ples passing and failing the control_metrics check. Beyond the obvious measurement of methylation, a This does however not diminish the value of O as an multitude of information can be inferred from high- overall sample quality indicator. This metric is easy to dimensional DNA methylation data, information that compute using our mixture modeling approach and can can be checked for agreement with recorded covari- be applied regardless of the sex of the sample donors. ates. This includes the use of principal component Unlike a recently proposed metric of contamination of analysis to create lower-dimensional representations for cord blood with maternal blood based on methylation discriminating between tissue types; comparing chrono- probes [5], our measure snp_outliers is not depen- logical and epigenetic age [19]; the estimation of cell dent on the tissue type of the sample and contaminat- proportions for blood samples, which can be as pre- ing DNA. Because the proportion of the 65 SNP probes cise as actual blood cell counts [20]; or any other expected to differ in a contaminated sample is a func- checks that might apply to the specific dataset at hand. tion of both the relative proportion of contamination We demonstrated in this work the high prevalence and the number of SNPs for which the contaminating of failed and mislabeled samples in DNA methylation sample differs in genotype, it was not possible to deter- datasets and recommend that epigenome-wide associa- mine a single cutoff for O above which samples should tion studies should start with comprehensive quality con- be classified as contaminated and excluded from fur- trol. With ewastools, we provide a software package ther analysis. Judging by Fig. 4b however, we inter- for the popular R programming language to conduct the pret that removing samples with an average log odds quality checks described here. Existing R scripts do not of − 4 represents a reasonable choice. Other authors have to be changed in order to incorporate these tests into have suggested to use the SNP probes as quality indi- existing analytic workflows. In addition, we recommend cator before: Pidsley et al. proposed a metric quanti- that researchers seeking to make their DNA methylation fying the standard deviation of SNP probes in order data available should also upload raw data in the form to benchmark normalization methods for 450K datasets of .idat files. Taken together these steps will improve the [13], but did not evaluate the quality of individual reproducibility of epigenome-wide association studies. samples. There are some limitations of our work. We did Abbreviations not check the associated publications for whether the 450K: Illumina Infinium HumanMethylation450 BeadChip; CI: Confidence interval; EPIC: Illumina Infinium MethylationEPIC BeadChip; GEO: Gene authors of the original studies mentioned flagging or expression omnibus; QC: Quality control; SNP: Single nucleotide polymorphism exclusion from downstream analyses of any samples from their datasets uploaded to the GEO repository. Acknowledgements We advocate the inclusion of a simple tag in the meta- We would like to thank Elena Colicino and Stefanie Busgang for their feedback on the manuscript and Brook Frye for code review. data on GEO to indicate samples excluded in qual- ity control steps, although no such indication was Funding seen in the metadata of the studies we reviewed. Fur- This work was supported by NIH grants R00ES023450 and P30ES023515. thermore, our analysis is demonstrated exclusively on Availability of data and materials the Illumina Infinium HumanMethylation450 BeadChip, R and Julia code to reproduce all analyses and results presented here, including although the functions in our ewastools package work retrieval of public methylation datasets, as well as other supplementary equally well on the newer Illumina Infinium Methyla- material, are available at https://doi.org/10.5281/zenodo.1172730.The ewastools R package can be found at https://github.com/hhhh5/ewastools. tionEPIC BeadChip (commonly called 850K chip) for which far fewer datasets are currently available on Authors’ contributions GEO. While we excluded certain types of tissues in JAH performed analyses and wrote the manuscript. ACJ conceived and order to get a handle on the heterogeneity of datasets, supervised the project and contributed to the manuscript. Both authors read and approved the final manuscript. this does not mean that these QC checks cannot be applied to those tissues, such as checking maternal con- Ethics approval and consent to participate tamination of fetal placenta. The selection of datasets Not applicable. was also restricted to those for which raw data were Competing interests available, as this was needed to automate our QC test- The authors declare that they have no competing interests. ing, and this subset may not be representative of the Publisher’s Note entirety of > 1000 Illumina 450K methylation datasets Springer Nature remains neutral with regard to jurisdictional claims in on GEO. Nonetheless, we feel that our results indi- published maps and institutional affiliations. cate the need for additional QC checks as data quality Received: 16 March 2018 Accepted: 16 May 2018 issues appear prevalent in publicly available methylation datasets. Heiss and Just Clinical Epigenetics (2018) 10:73 Page 9 of 9 References 1. Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of Infinium HumanMethylation450 data processing. Brief Bioinform. 2014;15(6):929–41. 2. Morris TJ, Beck S. Analysis pipelines and packages for Infinium HumanMethylation450 BeadChip (450k) data. Methods. 2015;72:3–8. 3. Felix JF, Joubert BR, Baccarelli AA, Sharp GC, Almqvist C, Annesi-Maesano I, et al. Cohort profile: pregnancy and childhood epigenetics (PACE) Consortium. Int J Epidemiol. 2017;47(1):22–23u. 4. Toker L, Feng M, Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Res. 2016;5:2103. 5. Morin AM, Gatev E, McEwen LM, MacIsaac JL, Lin DTS, Koen N, et al. Maternal blood contamination of collected cord blood can be identified using DNA methylation at three CpGs. Clin Epigenet. 2017;9:75. 6. Dawson MA. The cancer epigenome: Concepts, challenges, and therapeutic opportunities. Science. 2017;355(6330):1147–52. 7. Nestor CE, Ottaviano R, Reinhardt D, Cruickshanks HA, Mjoseng HK, McPherson RC, et al. Rapid reprogramming of epigenetic and transcriptional profiles in mammalian culture systems. Genome Biol. 2015;16:11. 8. Xu Z, Langie SA, De Boever P, Taylor JA, Niu L. RELIC: a novel dye-bias correction method for Illumina Methylation BeadChip. BMC Genomics. 2017;18(1):4. 9. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363–9. 10. Assenov Y, Muller F, Lutsik P, Walter J, Lengauer T, Bock C. Comprehensive analysis of DNA methylation data with RnBeads [Journal Article]. Nat Methods. 2014;11(11):1138–40. Available from: https://www. ncbi.nlm.nih.gov/pubmed/25262207. 11. Fortin JP, Fertig E, Hansen K. shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R. F1000Res. 2014;3:175. 12. Feber A, Guilhamon P, Lechner M, Fenton T, Wilson GA, Thirlwell C, et al. Using high-density DNA methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30. 13. Pidsley R, CC YW, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14:293. 14. Zhang X, Mu W, Zhang W. On the analysis of the Illumina 450k array data: probes ambiguously mapped to the human genome. Front Genet. 2012;3:73. 15. Chen YA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, et al. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203–9. 16. Lehne B, Drong AW, Loh M, Zhang W, Scott WR, Tan ST, et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 2015;16:37. 17. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 2017;45(4):e22. Available from: https://www.ncbi.nlm. nih.gov/pubmed/27924034. 18. Nielsen J, Wohlert M. Chromosome abnormalities found among 34,910 newborn children: results from a 13-year incidence study in Arhus, Denmark. Hum Genet. 1991;87(1):81–3. 19. Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013;14(10):R115. 20. Heiss JA, Breitling LP, Lehne B, Kooner JS, Chambers JC, Brenner H. Training a model for estimating leukocyte composition using whole-blood DNA methylation and cell counts as reference. Epigenomics. 2017;9(1):13–20. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Clinical Epigenetics Springer Journals

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO

Free
9 pages

Loading next page...
 
/lp/springer_journal/identifying-mislabeled-and-contaminated-dna-methylation-microarray-a9WxDERSjj
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Biomedicine; Human Genetics; Gene Function
ISSN
1868-7075
eISSN
1868-7083
D.O.I.
10.1186/s13148-018-0504-1
Publisher site
See Article on Publisher Site

Abstract

Background: Mislabeled, contaminated or poorly performing samples can threaten power in methylation microarray analyses or even result in spurious associations. We describe a set of quality checks for the popular Illumina 450K and EPIC microarrays to identify problematic samples and demonstrate their application in publicly available datasets. Methods: Quality checks implemented here include 17 control metrics defined by the manufacturer, a sex check to detect mislabeled sex-discordant samples, and both an identity check for fingerprinting sample donors and a measure of sample contamination based on probes querying high-frequency SNPs. These checks were tested on 80 datasets comprising 8327 samples run on the 450K microarray from the GEO repository. Results: Nine hundred forty samples were flagged by at least one control metric and 133 samples from 20 datasets were assigned the wrong sex. In a dataset in which a subset of samples appear contaminated with a single source of DNA, we demonstrate that our measure based on outliers among SNP probes was strongly correlated (> 0.95) with another independent measure of contamination. Conclusions: A more complete examination of samples that may be mislabeled, contaminated, or have poor performance due to technical problems will improve downstream analyses and replication of findings. We demonstrate that quality control problems are prevalent in a public repository of DNA methylation data. We advocate for a more thorough quality control workflow in epigenome-wide association studies and provide a software package to perform the checks described in this work. Reproducible code and supplementary material are available at https:// doi.org/10.5281/zenodo.1172730. Keywords: DNA methylation, Epigenomics, Infinium, 450K, EPIC, Quality control, Contamination, Mislabeling, Data cleaning Background control (QC) to ensure robust and reproducible results The number of epigenome-wide association studies in has received much less attention. Epigenome-wide asso- epigenetic epidemiology is growing rapidly, facilitated by ciation studies usually start with a quality control step, popular microarray platforms like the Infinium 450K and often involving discarding individual probes or entire EPIC chips, which offer broad coverage and precise quan- samples with too many high detection p values (resulting tification of DNA methylation. Whereas the literature from a low signal-to-noise ratio of fluorescence intensi- about preprocessing and statistical analysis of microar- ties), probes with too few beads, or discarding subsets of raydataisextensive [1, 2], the need for upstream quality probes that are considered unreliable based on design fea- tures (e.g., those being cross-reactive or close to SNPs). *Correspondence: jonathan.heiss@mssm.edu However, there is a lot of heterogeneity in which checks Department of Environmental Medicine and Public Health, Icahn School of are undertaken and how criteria are applied, as seen Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1057, New York, NY 10029, USA in the methods reported across a large consortium of © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Heiss and Just Clinical Epigenetics (2018) 10:73 Page 2 of 9 birth cohort studies [3]. Furthermore, quality checks that the lack of a standard format makes their analysis diffi- go beyond to catch other types of problematic samples cult to automate. While our ewastools package offers are not yet standard procedure. There are many reasons the same functionality for the EPIC chip, our selection microarray experiments might fail: starting with low- was limited to the predecessor 450K chip as the two plat- quality DNA input, an incomplete bisulfite conversion, or forms differ in the number of SNP probes which are a failure of the other experimental steps in the Infinium essential for the checks described below. Datasets meeting assay. These issues can result in distorted methylation pat- these criteria were identified using the Entrez Program- terns, which may be sometimes more or less apparent. ming Utilities (http://eutils.ncbi.nlm.nih.gov/). From this Another common source of errors is sample mislabeling, list we manually excluded datasets based on the descrip- especially in large multi-center studies where the chain tion and metadata provided in GEO: we dropped datasets of custody of the sample collection process is long and involving samples from tumor tissue or cultivated cell mistakes canhappenatevery step.Mislabeling creates lines because tumor cells can show extensive epigenetic mismatches between epi- and phenotype, thereby obfus- mutations related to defects in the DNA methylation cating genuine associations or even producing spurious maintenance apparatus [6], whereas for the latter it is ones. The prevalence of mislabeling was recently demon- unclear to what degree methylation profiles reflect the strated in 70 publicly available gene expression datasets natural/in vivo state of the cell types rather than an hosted in the Gene Expression Omnibus (GEO) reposi- epigenome of manipulation [7]; we dropped placenta sam- tory: Toker et al. used genes known to be differentially ples because they could either be of maternal or fetal expressed between sexes to infer the sex of the sample origin, as well as sperm samples as the cells are haploid; donors and to compare it to the recorded metadata. They we dropped FFPE (formalin-fixed, paraffin-embedded) found that 32 of the datasets contained discrepancies in samples as their preparation follows a different proce- a subset of the samples [4]. Finally, sample contamination dure than for fresh tissue samples and the DNA is often with foreign DNA may arise accidentally in the laboratory of lower quality; lastly, we excluded datasets measuring or from complex sampling procedures (e.g., contamina- 5-hydroxymethyl-cytosine. Eventually, a total of 80 tion of cord blood or placental tissue with maternal blood datasets comprising 8327 samples remained, representing [5]) and analytic methods are needed to identify and a broad range of tissues and cell lines (peripheral blood, quantify contamination. cord blood, saliva, liver, muscle, cartilage, etc.). Having found mislabeled and contaminated samples in our own DNA methylation datasets, we developed a soft- Quantification of methylation levels ware package for the R programming language named Treating DNA with sodium bisulfite converts cytosine to ewastools aiming to facilitate quality control and sta- thymine except for 5-methylcytosine, which is protected tistical analysis of datasets generated from the Illumina by the added methyl-group. In combination with sub- Infinium BeadChip platforms (both 450K and the newer sequent whole-genome amplification, the proportion of EPIC). In order to test the package functionality, we unmethylated to methylated DNA strands in the DNA decided to apply it to a range of datasets from the public input is translated into differences in abundance of dis- Gene Expression Omnibus (GEO) repository. The results tinct PCR (polymerase chain reaction) products, which, show that low-quality samples, mislabeling, and contami- for the sake of simplicity, are here still referred to as nation of samples are widespread issues. It is our hope that (un)methylated. The 450K chip employs 50 base pairs the ewastools package will help researchers to extend long probes complimentary to the targeted loci. Unmethy- and improve the quality control workflow in epigenome- lated and methylated strands are targeted by separate wide association studies. probes or color channels. Their abundance is quanti- fied by hybridization with the corresponding probes, a Methods subsequent single-base extension step with dye-linked Datasets nucleotides and measurement of the resulting fluores- The search for DNA methylation datasets was limited cence intensity. Thus, two data points are available for to the popular Gene Expression Omnibus repository. We each CpG site i, the intensity for unmethylated and methy- selected datasets meeting the following criteria: they had lated strands, U and M , respectively. Here, intensities i i to be submitted before January 2018; samples had to were corrected for dye bias using RELIC [8], but not nor- be run on the Illumina Infinium HumanMethylation450 malized. The proportion of methylated strands was then BeadChip (GEO Accession GPL13534); the sex of the estimated as , commonly referred to as the β-value. M +U i i sample donors had to be provided in the metadata; and raw data had to be provided in the form of .idat files. Quality checks Datasets containing only preprocessed data were not Four kinds of quality checks are implemented in the included as they may no longer contain QC probes and ewastools package: an evaluation of control metrics Heiss and Just Clinical Epigenetics (2018) 10:73 Page 3 of 9 monitoring the various experimental steps such as bisul- of 150 pairs of monozygotic twins (in total 300 samples fite conversion or staining; a sex check comparing the and 10 technical replicates resulting in 47,895 pairwise actual sex of the sample donors to the records; an identity comparisons). check for fingerprinting sample donors; and detection of The total intensity T = U + M has been shown to i i i contaminated samples using outliers among the 65 probes be sensitive to copy number aberrations [12]. By exploit- querying high-frequency SNPs. ing the natural difference in allosomal copy number, this The first check, implemented by the function can be used to detect sex-mismatches. There are 11,232 control_metrics, evaluates 17 control metrics calcu- and 413 probes on the 450K chip targeting the X and lated from dedicated control probes placed on each assay. Y chromosome, respectively. The function check_sex A description of these metrics, together with default computes for each sample n the average total intensi- thresholds to flag problematic samples, is provided in the ties of all probes targeting either chromosome, T and BeadArray Controls Reporter Software Guide available T , respectively. In order to account for differences in X Y ¯ ¯ from the Illumina support website. Similar functions post-amplification DNA concentration, T and T are n n to evaluate the control probes visually are provided in normalized by the average total intensity across all auto- the minfi [9], RnBeads [10]and shinyMethyl [11] somal probes which leads to more compact clusters in R packages. Not all flagged samples necessarily failed nor visualizations. Thresholds to discriminate between both do these metrics indicate potential upstream issues, e.g. sexes are determined by the Hodges-Lehmann estima- whether the DNA quality was low to begin with. All 8327 tor (median of all pairwise male/female averages) for X Y ¯ ¯ samples were screened in this first check. T and T separately.Thisrobustapproach wascho- n n There are 65 probes placed on the 450K chip query- sen over other sex determination functions provided ing high-frequency SNPs (with 59 of these on the EPIC in the minfi [9], RnBeads [10]and shinyMethyl chip; their probe identifiers start with “rs”). Just as for [11] packages exactly because of the potential of sex- CpG sites, a β-value is calculated for each SNP locus, mismatches and allosomal outliers being present in the based on fluorescence intensities from two probes tar- dataset. All 8327 samples were screened in this third geting either the wild type or the common mutant check. variant. These β-values usually fall into one of three dis- When exploring the data, one dataset in particular, here junct clusters, corresponding to the heterozygous and referred to as dataset E, caught our attention. Plotting X Y ¯ ¯ the two homozygous genotypes (AB, AA, or BB). The T against T , most samples—aside from a few misla- n n specific combination of SNPs across these 65 probes beledones—clusteredasexpected, butasubset of samples serves as a genetic fingerprint: fingerprints of samples from female donors were protruding from the cluster cen- from the same donor match but differ between individu- ter in the direction of the male cluster center (Fig. 3b): als – with the exception of monozygotic twins – thereby such a pattern is indicative of sample contamination enabling one to check for discrepancies with the meta- and more specifically, it is compatible with the hypoth- data. Genotype calling is handled by call_genotypes. esis of the foreign DNA coming from a male source. This function pools β-values of all 65 SNP probes across This does not imply that only female samples were samples to train a mixture model with four compo- affected: in the case of a male/male DNA mix, both nents: three Beta distributions, each representing one allosomes would still show a methylation profile typi- genotype, and one uniform distribution representing out- cal for males, only in the case of a female/male mix liers. Subsequently, posterior probabilities are computed would both allosomes show methylation profiles atypi- and forwarded to check_snp_agreement.Using the cal for either males or females. With increasing degree posterior probabilities as soft classification, pairwise of contamination, such samples would be further away agreement of fingerprints is assessed by counting the from the female cluster center and closer to the male number of SNP probes for which two samples possess cluster center. the same genotype, divided by the total number of SNP In order to confirm this hypothesis, we turned again probes after those classified as outliers were excluded to the 65 SNP probes. Normally, their β-values fall, (consistent with the soft genotype calling, a SNP might according to the underlying genotype, into one of three be partially classified as outlier and therefore be only disjunct clusters. In dataset E however, many β-values partially excluded). Mislabeling constitutes either as unex- fell in-between these three clusters (plotting their his- pected disagreement between samples that are supposed togram would show three peaks no longer completely to come from the same individual, or unexpected agree- disjunct), a pattern one would expect to see when ment between samples supposed to come from different mixing two genotypes (the same way β-values of het- individuals. A list of such instances, here termed con- erozygous AB individuals scatter around 0.5, as they flicts, is returned by check_snp_agreement.Thissec- possess a 50:50 mixture of A and B alleles). We trans- ond quality check was applied to a dataset comprised lated our hypothesis in a generative statistical model Heiss and Just Clinical Epigenetics (2018) 10:73 Page 4 of 9 N 65 with the following likelihood function β ∼ bisulfite conversion, across the 80 datasets. Three hun- nj n=1 j=1 dred seventy-seven samples from 17 datasets fell below N (1 − γ) · μ + γ · c , (1 − γ ) · σ and parameters k n j n k nj nj the default threshold specified by Illumina. Overall, 940 n ∈ [1, N] (sample index), j ∈ [1, 65] (SNP probe index), samples (11%) from 41 datasets (51%) were flagged by at k ∈{AA, AB, BB} (the three possible genotypes), k nj least one of the 17 metrics in control_metrics,leav- (genotype of sample n for probe j), β (methylation level nj ing 7387 samples that passed this first check. A summary of sample n for SNP probe j), σ (standard deviation for is provided in Table 1. Many of these problematic sam- β-values from genotype k), c (methylation level of SNP ples would be overlooked when filtering out only samples probe j in foreign DNA), and γ (degree of contamina- with too many undetected probes or low overall fluo- tion of sample n). The parameters were estimated using rescence intensity, two popular criteria. Out of the 940 the Metropolis algorithm implemented in the Julia pro- flagged samples, 432 had >1% of probes with a detection gramming language (code is provided the supplement). p value above 0.01 and 217 were flagged by the getQC Comparing γ (degree of contamination) to T for female function from the minfi package, and 541 samples when samples allowed us to test if both agreed on ranking sam- considering both criteria. ples from least to most contaminated. This check was Figure 2 shows the result of the identity check in applied only to dataset E. the dataset comprising 150 monozygotic twins. Pairwise Our precise quantification of sample contamination in agreement of SNP fingerprints between samples ranged dataset E required the assumption that the foreign DNA from 0.96 to 1.00 for the 187 twin pairs (number is came from a single source, which may be an unusual sce- larger than 150 because of the technical replicates) and nario. A simpler and more general test of sample contam- from 0.22 to 0.66 for the 47,708 non-twin pairs, thereby ination that can be applied to any dataset is implemented perfectly segregating both groups. For comparison, geno- in the function snp_outliers. This function, again type calling by k-means clustering, as implemented in the using the output of call_genotypes,computes O ,the wateRmelon package [13], produced non-identical fin- average log odds from the 65 posterior probabilities from gerprints in 11 out of the 187 twin pairs, with one twin the outlier component from the mixture model described pair showing as many as 9 discordant SNPs. above. Thus, O captures how irregular the SNP β-values The function check_sex estimated that 133 of the of sample n are, i.e., how much they deviate from the ideal 8327 samples had been assigned the wrong sex yield- trimodal distribution. O was compared to γ in dataset E n n ing an error rate of 1.6%, with 13 cases being unclear, in order to confirm that it did indeed capture sample con- as the inferred sex was discordant between T and tamination. Subsequently, all 8327 samples were screened T (Fig. 3a). Of the 80 datasets, 20 (25%) contained in this fourth check. these sex-mismatched samples. These were unevenly dis- tributed: the three highest error rates per study were 55% (#mistakes 38/#total sample size 69), 45% (5/11), Results and 34% (15/44). Excluding all samples that failed the Figure 1 shows the distribution of the “Bisulfite Con- control_metrics check and might have technical version II” control metric, used to monitor successful 20.0 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + +++ + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + +++ + +++++ + +++ +++ + + + + + +++ + + + + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 10.0 + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + +++ + +++ + +++ + + + + + + ++++ + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + +++ + + + + + +++ ++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + +++ + + + + + + +++ + + + + + +++++ + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++ + +++ + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + +++ + + + + + + + + + + + + + + + + +++ + + + + +++ ++ +++ + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++++ + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + +++ + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + +++ + +++ + +++ + + + + + + + + + + +++ + + + + + + + + + + + + +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++++ ++ + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + +++ + +++++ + + + +++ + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + +++ ++ + + + + + + + + + + + + + + +++ + + + + + + + +++ + + + + + + + + +++ + + + +++ + + + +++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + +++ + + +++ + + + +++ + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + +++ + +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ +++ + + +++ + + + + +++ + + + + + + + + + + + + +++ + + +++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + +++ + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + +++ + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++ + + +++ + + + + + + + +++ + + + + + + + + +++ + + + +++ + + +++++ + + + + + + + + + + + +++ + + + +++ + + + + +++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + ++ + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + ++++ ++ + + + + + +++ + + + +++ + +++ + + + +++ + + + + + + + + + + ++ +++ + + + + + + +++ +++ + + +++++ ++ + + +++ ++ ++ + + + + + + + +++ + + +++ +++ +++ ++ ++ + +++ + + + + + + + + + + +++ +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + +++ + + +++ + + + + +++ + + + +++ + + + +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + +++ + + + + + + + + + + + + +++ + + + +++ + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ +++ +++ + + + + + + + + +++ + + + + + +++ +++ + + + + + + + + + + + +++ ++ ++ ++ ++ + +++ + + + +++ + + + + +++ ++ + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++ + + + + + + + + + +++ +++ + + + +++ + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + +++ + + + + + +++ + + +++ + + + ++ ++ + + + +++ + + + +++ + + + + + + + + +++ + + + + + + + +++ + + +++ ++ + +++ ++ ++++ +++ + +++ + +++ +++ ++ + + + +++ + +++ +++++ + + + + +++ + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + ++ ++ + +++ + +++ + + + + + + + + + + + + + + + +++ + + + + + + + + + + + +++ +++ + + +++ + + + + + + + + + + + + + + + + +++ + + + + + + + + +++ + + + + +++++ + + +++ + + + + + + + +++ + + + + + +++ ++ +++ + + + +++++ +++ + + + ++ +++ + +++ + + + + + + +++ +++ + + + + + + + + + +++ + + + + + + + +++ + +++ +++ + + + +++ ++ + +++ + +++ + + +++ +++++++ + + + + ++ + + + + + + + + + ++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ + + +++ + + + + + + + + + + + + + + + + +++ + ++++ + + +++ + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + ++ +++ + + + +++ + + +++ + +++ + + + + + + +++ + +++ +++ + + + +++ + + + + ++++ +++++ ++++ ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + +++ + + + + + + + + + + +++ + + + + + ++++ ++ + + + + + + +++ + + + + + + + + + + + + + +++ + +++ + + + + + + +++ + + + + + + + +++ + + + + +++ + + + +++ + + +++ + + + +++ + +++ + + + + +++ +++ ++ + + + + + +++ + + + + ++ ++ +++ +++ + + + + + + + +++++ + + +++ ++ +++ ++ +++ +++++ + + +++ ++ ++ ++ +++ ++ + +++ + +++ + + + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + +++ + + +++ + + + + + + + + + + +++ + + + + +++++ + + + + + +++ + + + + + + + + + + + + + + +++ + + + + +++ + + + + + + + + +++ ++ ++ + + + + + + + + + + + + + +++++ + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + + +++ +++ + + + + + + + + +++ ++ ++ ++ +++ + + +++ + + +++ + +++ ++ + +++ + + +++ ++ +++ + + +++ + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + +++++ + + + + + + + + + +++ + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + +++ ++ ++ + + +++ + ++++++ + + + + + + + + + + + + + + + + + + + + +++ + + ++++ ++ +++ + + + + + + + + + +++ ++ ++++++ + +++ ++ + + + ++ + + + + +++ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + + + + + + + + ++++ + +++ + +++ +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + + + +++ + + + +++++ +++ + + + + + ++ ++ + + + +++ +++ +++ + +++ + + +++ ++ + + +++ +++ +++ + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + +++ + ++ + + + + + + + + + + + + +++ + +++ +++ +++ + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + +++ + +++ + + + +++ +++ + +++ ++ ++++ ++ ++ + +++ + +++ ++ ++++ ++ + +++ + + +++ + + + + + + + + +++ + + + + + + ++ + + + + + +++ + ++ + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 5.0 + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++++ + + + +++ + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ ++ +++ + + +++ ++ +++ ++ ++ + + + + +++ +++ + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ + + + +++ + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++++ + + +++ + + +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ +++ +++ +++ +++ +++ ++ + + +++ + + + + + + + + + + + + + + + + + + + +++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + +++ + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + +++ ++ + + + +++ + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++++ + + ++ +++ + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + +++ + + + + + + + + + + + + +++ + + + + + + + + + + +++ + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + ++ ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + ++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ +++ + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++++ + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + +++ + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + +++ + + + + + + + + + + + + + + + + + + + +++ + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1.0 + + + + + +++ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + +++ + + + ++ + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.5 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.1 + #17 #80 GEO datasets sorted by minimum bisulfite conversion ratio Fig. 1 Distribution of a control metric monitoring bisulfite conversion. Samples below the manufacturer’s suggested threshold of 1 might be incompletely converted, leading to inaccurate estimates of methylation levels Bisulfite Conversion II (unitless ratio) Heiss and Just Clinical Epigenetics (2018) 10:73 Page 5 of 9 Table 1 Summary of the number of samples flagged by each of from least to most contaminated. Correlation was weaker the 17 control metrics outlined in the BeadArray controls reporter ¯ for T (− 0.828, 95% CI − 0.898,− 0.748), suggesting that software guide. n.a.—not available ¯ T was the more sensitive metric of contamination in Metric Passed Flagged n.a. this situation. Males were not included in the calcula- tion of these correlation coefficients because contamina- Restoration 8326 0 1 Y X ¯ ¯ tion would not affect T nor T if our hypothesis of n n Staining green 7885 10 432 contamination coming from a single male source were Staining red 7764 14 549 correct. Even though γ was non-zero for most sam- Extension green 8324 3 0 ples, we assume that only a subset of the samples were Extension red 8326 1 0 indeed contaminated, as those with the largest γ tended Hybridization high/medium 8326 1 0 to be allocated next to each other on the 450K chips (not shown). γ and O for the same set of samples were Hybridization medium/low 8327 0 0 n n strongly correlated as well (Pearson’s correlation coeffi- Target removal 1 8327 0 0 cient 0.958, 95% bootstrapped CI 0.948,0.966), even when Target removal 2 8327 0 0 including males (0.954, 95% CI 0.944,0.961; Fig. 4b), sug- Bisulfite conversion I green 8317 10 0 gesting that in this dataset O , the average log odds of Bisulfite conversion I red 8208 119 0 SNP probes being outliers, was a proxy for sample con- Bisulfite conversion II 7950 377 0 tamination not contingent on the sex of the sample donor or source of contamination (with the exception of con- Specificity I green 8323 4 0 taminating DNA coming from another tissue of the same Specificity I red 8279 48 0 donor). Specificity II 8326 1 0 While the importance of a metric monitoring critical Non-polymorphic green 7677 558 92 laboratory steps such as bisulfite conversion is obvious, Non-polymorphic red 7917 318 92 the relevance of Illumina provided default thresholds for other metrics is less clear. Figure 5 shows the dis- Any of the above 7387 940 – tribution of O among all samples that were flagged by the control_metrics checks versus the remain- errors in measurement did not change sex-mismatched der that passed. We use O here as a measure of poor error rates substantially: out of the 7387 samples, 122 technical performance, rather than a measure of sam- (1.7%) had been assigned the wrong sex with 10 unclear ple contamination, as the former would also contribute cases. Comparing with the predictions provided from to a deviation of the SNP probes from the ideal tri- minfi, there were, apart from the 13 unclear cases men- modal distribution. Flagged samples had in general higher tioned above, 66 female (according to metadata) sam- values of O , indicating that such samples are indeed ples from six datasets that were correctly classified by of concern. ewastools but misclassified by minfi. Estimating the degree of contamination γ for each sam- Discussion ple in dataset E,wefound that γ and T were strongly The topic of 450K data quality has been addressed before. correlated among females (Pearson’s correlation coeffi- Among the issues being discussed are cross-reactive cient 0.953, 95% bootstrapped CI 0.934,0.965; Fig. 4a), probes and probes possibly affected by nearby SNPs confirming that both metrics agreed on ranking samples [14, 15], high detection p values and batch effects [16]. While these publications focus on probes with low fluo- rescence intensities or that are in general unreliable, or issues that affect ensembles of samples, our work is in contrast mainly concerned with the identification of indi- Unrelated vidual problematic samples resulting from failed exper- iments, mislabeling or contamination. Due to the often small effect sizes, epigenome-wide association studies are Twins sensitive to such samples as they often present as out- liers. Robust regression methods mitigate the impact of 0.00 0.25 0.50 0.75 1.00 spurious outliers but are computationally intensive due Agreement score to the high dimension of the data while least squares or Fig. 2 Pairwise agreement scores of genetic fingerprints in a dataset maximum likelihood estimation remain popular choices. of monozygotic twins. There is a perfect segregation between twin Finding and removing problematic samples during qual- and non-twin pairs ity control is therefore an important first step of every Heiss and Just Clinical Epigenetics (2018) 10:73 Page 6 of 9 a b 1.2 1.2 Male Female Correct Mislabeled 1.0 1.0 Unclear 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.6 0.8 1.0 1.2 0.6 0.8 1.0 1.2 Normalized X chromosome intensities Normalized X chromosome intensities Fig. 3 Average fluorescence intensities of probes targeting the X chromosome (x-axis) or targeting the Y chromosome (y-axis), each normalized versus the average fluorescence of autosomal probe intensities per sample. Dotted lines represent the Hodges-Lehmann estimators separating the male and female cluster centers. Samples that are discordant for sex relative to their metadata annotation are considered mislabeled and shown in red. Samples falling in the lower left or upper right quadrant are considered “unclear”. a All 8327 samples from all 80 datasets. b A single dataset (dataset E) with a spread in the female cluster indicating varying degrees of contamination with male DNA epigenome-wide association study. We conducted a sur- As demonstrated on the example of monozygotic twins, vey of publicly available DNA methylation datasets to see the check_snp_agreement function does—at least in whether they suffer from the same quality issues that have the absence of other issues—perfectly predict whether been reported for gene expression datasets. Assuming two samples come from the same person or not. The that all samples included in datasets uploaded to the GEO function is robust against SNP outliers, due to the soft repository were used in associated analyses, our results classification scheme, whereas hard classification of geno- indicate that the current practice of quality control fails to types as produced by k-means clustering or the use of detect many problematic samples that have the potential fixed cutpoints in the β-value distribution can result to severely bias findings. in unexpected genotype mismatches. It is worth point- Eleven percent of samples were flagged by at least one ing out that the genetic fingerprint is the only way to of the 17 control metrics defined by the microarray man- detect mislabeling if a sample swap did not result in ufacturer. This does not mean that every flagged sample any apparent epitype/phenotype mismatch (e.g., two sam- features inaccurate methylation levels and it is unclear ples from the same sex). Mislabeling results in conflicts how Illumina’s default thresholds were set and whether (unexpected disagreement between samples that are sup- the resulting dichotomization is appropriate for flagging posed to come from the same individual or unexpected samples in all conditions. However, samples flagged by agreement between samples that are supposed to come one or more criteria in control_metrics had substan- from different individuals), but it might be necessary to tially more outliers among the normally well-behaved SNP build upon further evidence in order to resolve conflicts probes, an indication of low data quality, and therefore and reassign the correct identities or even to narrow down these samples require closer attention. In the specific case which of the samples in conflict are the mislabeled ones. of monitoring bisulfite conversion, Zhou et al. suggested Furthermore, the utility of check_snp_agreement is a more robust alternative to using the dedicated control limited for datasets that feature only a single sample probes [17]. per person. Normalized Y chromosome intensities Heiss and Just Clinical Epigenetics (2018) 10:73 Page 7 of 9 Non−polymorphic Red n = 301 Bisulfite Conversion I Red Non−polymorphic Green 0.75 Bisulfite Conversion I Green Staining Green 8 Bisulfite Conversion II 377 0.50 Specificity I Green 4 Specificity I Red 48 Staining Red 14 Specificity II 1 0.25 Passed 6619 −4 −2 0 2 010 20 30 40 Average log odds b Fig. 5 Boxplots of O (average log odds of being an outlier across the 65 SNP probes) as a measure of low technical performance for samples flagged by any of the 17 Illumina control metrics and −1 samples passing all of them. Metrics are ordered by median O with the number of samples in each non-exclusive category indicated on the right. Flagged samples have in general higher values of O than samples that passed all checks, indicating that flagged samples are −2 more likely to have poor performance characteristics −3 features should be conducted. Regarding the few unclear cases in which the sex of the sample donors could not −4 Male conclusively be inferred, we suspect that most of these Female samples suffer from other, possibly upstream issues and Correct Mislabeled should be excluded. A few, however, might represent chro- −5 mosomal disorders, e.g. Klinefelter syndrome, which has an reported incidence around 1 per 576 [18]. Because of their XXY genotype, individuals with Klinefelter syn- 0 10203040 drome, who are phenotypically male, would in Fig. 3 be Estimated contamination in % located in the center of the upper right quadrant (for T Fig. 4 Evidence of contamination in dataset E. a γ , the estimated ¯ on par with females, for T on par with males). degree of contamination, and T are strongly correlated among females with a Pearson’s correlation coefficient r = 0.953. b γ and Some samples showed evidence of contamination, espe- O , the average log odds of SNP probes being outliers, are strongly cially those belonging to dataset E:weconstructed two correlated as well (r = 0.954), including both males and females measures of sample contamination based on two sub- sets of probes, using either the average total intensi- ties of probes targeting the Y chromosome T or the In contrast, check_sex can be applied regardless of β-valuesofthe 65 SNPprobes(γ ). The fact that both the number of samples available for each person. In our measures exhibited very high agreement, even though survey 1.6% of samples coming from 25% of the datasets they were based on independent evidence and completely were assigned the wrong sex. The actual rate of mis- different principles, lends credence to our hypothesis of labeled samples is likely higher because sample swaps a single contamination source. In contrast, a strong cor- between donors of the same sex would have gone unde- relation between γ and O (the average log odds of n n tected. Assuming a balanced sex ratio and random mis- being an outlier across all 65 SNP probes) was to be labeling, only half of the potential mistakes would be expected, as both are derived from the same data. O captured by this test alone. If applicable, more compre- is not a perfect proxy of contamination, and deviations hensive checks testing for correct tissue types and other from the trimodal distribution of β-values might also be Normalized Y chromosome intensities Average log odds Heiss and Just Clinical Epigenetics (2018) 10:73 Page 8 of 9 caused by other issues, as evident when comparing sam- Conclusion ples passing and failing the control_metrics check. Beyond the obvious measurement of methylation, a This does however not diminish the value of O as an multitude of information can be inferred from high- overall sample quality indicator. This metric is easy to dimensional DNA methylation data, information that compute using our mixture modeling approach and can can be checked for agreement with recorded covari- be applied regardless of the sex of the sample donors. ates. This includes the use of principal component Unlike a recently proposed metric of contamination of analysis to create lower-dimensional representations for cord blood with maternal blood based on methylation discriminating between tissue types; comparing chrono- probes [5], our measure snp_outliers is not depen- logical and epigenetic age [19]; the estimation of cell dent on the tissue type of the sample and contaminat- proportions for blood samples, which can be as pre- ing DNA. Because the proportion of the 65 SNP probes cise as actual blood cell counts [20]; or any other expected to differ in a contaminated sample is a func- checks that might apply to the specific dataset at hand. tion of both the relative proportion of contamination We demonstrated in this work the high prevalence and the number of SNPs for which the contaminating of failed and mislabeled samples in DNA methylation sample differs in genotype, it was not possible to deter- datasets and recommend that epigenome-wide associa- mine a single cutoff for O above which samples should tion studies should start with comprehensive quality con- be classified as contaminated and excluded from fur- trol. With ewastools, we provide a software package ther analysis. Judging by Fig. 4b however, we inter- for the popular R programming language to conduct the pret that removing samples with an average log odds quality checks described here. Existing R scripts do not of − 4 represents a reasonable choice. Other authors have to be changed in order to incorporate these tests into have suggested to use the SNP probes as quality indi- existing analytic workflows. In addition, we recommend cator before: Pidsley et al. proposed a metric quanti- that researchers seeking to make their DNA methylation fying the standard deviation of SNP probes in order data available should also upload raw data in the form to benchmark normalization methods for 450K datasets of .idat files. Taken together these steps will improve the [13], but did not evaluate the quality of individual reproducibility of epigenome-wide association studies. samples. There are some limitations of our work. We did Abbreviations not check the associated publications for whether the 450K: Illumina Infinium HumanMethylation450 BeadChip; CI: Confidence interval; EPIC: Illumina Infinium MethylationEPIC BeadChip; GEO: Gene authors of the original studies mentioned flagging or expression omnibus; QC: Quality control; SNP: Single nucleotide polymorphism exclusion from downstream analyses of any samples from their datasets uploaded to the GEO repository. Acknowledgements We advocate the inclusion of a simple tag in the meta- We would like to thank Elena Colicino and Stefanie Busgang for their feedback on the manuscript and Brook Frye for code review. data on GEO to indicate samples excluded in qual- ity control steps, although no such indication was Funding seen in the metadata of the studies we reviewed. Fur- This work was supported by NIH grants R00ES023450 and P30ES023515. thermore, our analysis is demonstrated exclusively on Availability of data and materials the Illumina Infinium HumanMethylation450 BeadChip, R and Julia code to reproduce all analyses and results presented here, including although the functions in our ewastools package work retrieval of public methylation datasets, as well as other supplementary equally well on the newer Illumina Infinium Methyla- material, are available at https://doi.org/10.5281/zenodo.1172730.The ewastools R package can be found at https://github.com/hhhh5/ewastools. tionEPIC BeadChip (commonly called 850K chip) for which far fewer datasets are currently available on Authors’ contributions GEO. While we excluded certain types of tissues in JAH performed analyses and wrote the manuscript. ACJ conceived and order to get a handle on the heterogeneity of datasets, supervised the project and contributed to the manuscript. Both authors read and approved the final manuscript. this does not mean that these QC checks cannot be applied to those tissues, such as checking maternal con- Ethics approval and consent to participate tamination of fetal placenta. The selection of datasets Not applicable. was also restricted to those for which raw data were Competing interests available, as this was needed to automate our QC test- The authors declare that they have no competing interests. ing, and this subset may not be representative of the Publisher’s Note entirety of > 1000 Illumina 450K methylation datasets Springer Nature remains neutral with regard to jurisdictional claims in on GEO. Nonetheless, we feel that our results indi- published maps and institutional affiliations. cate the need for additional QC checks as data quality Received: 16 March 2018 Accepted: 16 May 2018 issues appear prevalent in publicly available methylation datasets. Heiss and Just Clinical Epigenetics (2018) 10:73 Page 9 of 9 References 1. Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of Infinium HumanMethylation450 data processing. Brief Bioinform. 2014;15(6):929–41. 2. Morris TJ, Beck S. Analysis pipelines and packages for Infinium HumanMethylation450 BeadChip (450k) data. Methods. 2015;72:3–8. 3. Felix JF, Joubert BR, Baccarelli AA, Sharp GC, Almqvist C, Annesi-Maesano I, et al. Cohort profile: pregnancy and childhood epigenetics (PACE) Consortium. Int J Epidemiol. 2017;47(1):22–23u. 4. Toker L, Feng M, Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Res. 2016;5:2103. 5. Morin AM, Gatev E, McEwen LM, MacIsaac JL, Lin DTS, Koen N, et al. Maternal blood contamination of collected cord blood can be identified using DNA methylation at three CpGs. Clin Epigenet. 2017;9:75. 6. Dawson MA. The cancer epigenome: Concepts, challenges, and therapeutic opportunities. Science. 2017;355(6330):1147–52. 7. Nestor CE, Ottaviano R, Reinhardt D, Cruickshanks HA, Mjoseng HK, McPherson RC, et al. Rapid reprogramming of epigenetic and transcriptional profiles in mammalian culture systems. Genome Biol. 2015;16:11. 8. Xu Z, Langie SA, De Boever P, Taylor JA, Niu L. RELIC: a novel dye-bias correction method for Illumina Methylation BeadChip. BMC Genomics. 2017;18(1):4. 9. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363–9. 10. Assenov Y, Muller F, Lutsik P, Walter J, Lengauer T, Bock C. Comprehensive analysis of DNA methylation data with RnBeads [Journal Article]. Nat Methods. 2014;11(11):1138–40. Available from: https://www. ncbi.nlm.nih.gov/pubmed/25262207. 11. Fortin JP, Fertig E, Hansen K. shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R. F1000Res. 2014;3:175. 12. Feber A, Guilhamon P, Lechner M, Fenton T, Wilson GA, Thirlwell C, et al. Using high-density DNA methylation arrays to profile copy number alterations. Genome Biol. 2014;15(2):R30. 13. Pidsley R, CC YW, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14:293. 14. Zhang X, Mu W, Zhang W. On the analysis of the Illumina 450k array data: probes ambiguously mapped to the human genome. Front Genet. 2012;3:73. 15. Chen YA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, et al. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203–9. 16. Lehne B, Drong AW, Loh M, Zhang W, Scott WR, Tan ST, et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 2015;16:37. 17. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 2017;45(4):e22. Available from: https://www.ncbi.nlm. nih.gov/pubmed/27924034. 18. Nielsen J, Wohlert M. Chromosome abnormalities found among 34,910 newborn children: results from a 13-year incidence study in Arhus, Denmark. Hum Genet. 1991;87(1):81–3. 19. Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013;14(10):R115. 20. Heiss JA, Breitling LP, Lehne B, Kooner JS, Chambers JC, Brenner H. Training a model for estimating leukocyte composition using whole-blood DNA methylation and cell counts as reference. Epigenomics. 2017;9(1):13–20.

Journal

Clinical EpigeneticsSpringer Journals

Published: Jun 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off