An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray

An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using... Genome-wide methylation arrays are powerful tools for assessing cell composition of complex mixtures. We compare three approaches to select reference libraries for deconvoluting neutrophil, monocyte, B-lymphocyte, natural killer, and CD4+ and CD8+ T-cell fractions based on blood-derived DNA methylation signatures assayed using the Illumina HumanMethylationEPIC array. The IDOL algorithm identifies a library of 450 CpGs, resulting in an average R =99.2 across cell types when applied to EPIC methylation data collected on artificial mixtures constructed from the above cell types. Of the 450 CpGs, 69% are unique to EPIC. This library has the potential to reduce unintended technical differences across array platforms. Keywords: DNA methylation, Epigenetics, Neutrophils, Monocytes, Natural killer cells, B-cells, Helper T-cells, Cytotoxic T-lymphocytes, Adults, Leukocytes Background assessed [4]. While some of the observed changes in DNA DNA methylation microarrays have become a widely methylation reported in EWAS reflect induced epigenetic utilized tool for epigenome-wide association studies alterations within the constituent cells, others may reflect (EWAS), including expanded use in studies investigating coordinately induced changes in the proportions of the association between DNA methylation with environ- leukocyte subtypes in circulation that underlie or contrib- mental exposures, and in the setting of case-control and ute to the pathophysiologic process. Both reference-based longitudinal studies [1, 2]. Peripheral blood is the most and non-reference-based techniques have been used to commonly used biospecimen for these analyses primarily control the effect of cell heterogeneity, and thus possible because it is easily accessible through a minimally invasive confounding, in different studies, and their specific appli- procedure, although some emerging evidence suggests cations have been detailed elsewhere [5, 6]. Deconvolution that some specific DNA methylation changes in blood techniques, such as constrained projection/quadratic pro- may reflect pathological states in target organs not easily gramming (CP/QP) [7], provide a framework for estimat- or safely accessible by biopsy [3]. Finally, DNA methyla- ing the relative proportions of blood cell types using tion profiles in blood may, in some instances, summarize blood-derived signatures of DNA methylation. So-called information of systemic exposures or diseases where cells “deconvolution estimates” can then be used in down- from a single organ or tissue cannot be specifically stream statistical models to adjust for the potential con- founding effects of cell composition [4, 7–9], or examined * Correspondence: Brock.C.Christensen@dartmouth.edu independently to determine their association with risk or Lucas A. Salas and Devin C. Koestler contributed equally to this work. 1 exposures [10–12]. Indeed, as in other general clinical ap- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA plications, the ratio of myeloid to lymphoid lineages (neu- Departments of Molecular and Systems Biology, and Community and Family trophil to lymphocyte ratio (NLR)) can be reconstructed Medicine, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA in archival whole blood DNA samples measured with Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Salas et al. Genome Biology (2018) 19:64 Page 2 of 14 methylation arrays using different deconvolution approaches purity across the six cell subtypes (Additional file 1:Figure [11, 12]. Furthermore, while complete blood cell counts S1). Individual sample cell purity is reported in the pheno- (CBC) are sometimes concurrently collected, deconvolution type table in the FlowSorted.Blood.EPIC file. We explored algorithms allow for estimation of lymphoid-specific sub- potential genetic clustering using the ethnicity and the 59 populations (e.g., B cells, CD4(+) T, CD8(+) T and natural control SNP probes included in the array. A striking killer (NK) lymphocytes), approximating the results of cell clustering was observed when grouping for larger ethnic flow sorting without the need of additional sample input. groups (Additional file 1: Figure S2). Using a principal This strategy helps to correct for inter-individual variation, component regression analysis, the first 20 principal lineage relationships (changes in the myeloid versus lymph- components were tested against potential confounders oid ratio mean), and potential effects of prominent methyla- (Additional file 1: Figure S3). Each of the first five princi- tion differences in general cell subpopulations. pal components were significantly (P < 0.01) associated Current analysis pipelines for estimating the propor- with cell type composition. Other potential confounders, tions of leukocyte subtypes in adult blood are based on including body mass index (BMI; P < 0.01), subject weight samples from six adult males that were purified with (P < 0.01), age (in years; P < 0.01) or sex (P <1 E-04) were flow cytometry and profiled for DNA methylation using only accounted for in the sixth to eleventh principal com- the Illumina HumanMethylation450K platform (450 K ponents, and smoking was significant only with the 12th array) [13]. As the 450 K array is now the predecessor to component (P < 0.05). Cell purity was associated with the the recently released Illumina HumanMethylationEPIC ninth principal component (P < 0.05). In contrast to what array (EPIC array), an outstanding question relates to was observed using the genetic information, ethnicity was accuracy of cell deconvolution of peripheral blood not significantly associated with any of the top 20 principal measured on the EPIC array using existing 450 K refer- components. ence methylation signatures. The EPIC array interrogates We used three deconvolution methods for comparison > 860,000 CpG sites, roughly twice the number of CpGs (see “Methods” for details): 1) the current commonly used contained on the 450 K array, with additional genomic 450 K reference (Reinius et al.) [13] using the automatic content in enhancer regions and DNase hypersensitive selection to identify the L-DMR library in the minfi Bio- sites (DHS), which are important in hematopoietic de- conductor package [19]; 2) our new EPIC reference using velopment and differentiation [14, 15]. In this work, we the automatic selection to identify the L-DMR library in extend the available reference library for deconvolution minfi; 3) our new EPIC reference using IDOL selection to of blood cell proportions using the EPIC array with the identify the L-DMR library [18]. The automatic selection goal of both improving the accuracy of cell composition picks the top 50 hyper- and hypomethylated probes for estimates and overcoming potential technical differences each cell type (600 probes total), while the IDOL method in platforms [16, 17]. Using antibody bead sorted neu- identified 450 probes as the optimal number of probes for trophils, B cells, monocytes, NK cells, CD4+ T cells, and deconvolution (Additional file 1: Figure S4). The probes CD8+ T cells, we measured DNA methylation with the selected by the various methods are compared in Fig. 1. 850 K EPIC DNA methylation array and applied an Only 26 probes overlapped across the different methods. iterative algorithm for Identifying Optimal Libraries A comparison of the selected probes, including the pro- (IDOL) from leukocyte differentially methylated regions portion per genomic context and probes in Phantom5 en- (L-DMR) that improves the accuracy and efficiency of hancers and DHS, is provided in Table 1. The majority of cell composition estimates obtained by cell mixture the probes selected for deconvolution using the new EPIC deconvolution [18]. We then compared the performance array were not present on the previous 450 K array; 66% of cell estimates obtained using the EPIC platform and of the probes selected using the automatic selection optimized library to the now unavailable 450 K array in method and 69% of the probes selected with IDOL were artificial blood mixtures with predefined cell proportions. unique to the EPIC array. As expected, more probes in the open sea were selected using the new reference library Results (80 and 76% using the automatic and IDOL methods, The FlowSorted.Blood.EPIC dataset contains information respectively, compared to 57% using the 450 K reference). from neutrophils (Neu, n = 6), monocytes (Mono, n =6), B In addition, approximately twice as many Phantom5 lymphocytes (Bcells, n = 6), CD4+ T cells (CD4T, n = 7, six enhancer sites were selected using the new EPIC platform samples and one technical replicate), CD8+ T cells (CD8T, with the automatic (18%) and EPIC methods (16%) com- n = 6), natural killer cells (NK, n = 6), and 12 DNA artifi- pared with the Reinius [13] reference probe set from the cial mixtures (labeled as MIX in the dataset). Across sam- 450 K platform (9%). ples, the average purity reported in the control flow Once we determined the probes for cell type estima- sorting (after antibody-linked magnetic bead sorting) was tion, we used the minfi modified Houseman constrained 95% (range 88 to 99%), with Mono having the lowest projection approach [7] to estimate the cell composition Salas et al. Genome Biology (2018) 19:64 Page 3 of 14 Fig. 1 Comparison of L-DMR libraries among automatic selection in minfi and the IDOL algorithm for optimization. a Reinius reference dataset [13] probes from the 450 K array (n = 600 CpGs). b Probes selected from the new reference samples measured with the EPIC array (n = 600 CpGs). c L-DMR library derived from IDOL using the EPIC array (n =450 CpGs). d Overlapping of the probes of the three methods. DHS DNase hypersensitive sites of 12 samples, spread across two sets of artificially re- methylation data, and the variance of our estimates was constructed mixtures. As the specific amount of DNA consistently lower (Fig. 2b,Additionalfile 1: Figure S5). per cell type in each mixture was known, we compared For all the cell types, except CD4T, the R was over 99.7%. our estimate of cell proportions to the amount of DNA The lowest R estimate from applying the IDOL method represented by that cell type in each of the artificial to the EPIC platform data was for CD4T (R =95.5%). mixtures (Fig. 2a, Additional file 1: Table S1). The R The observed versus expected estimate for CD4T was (coefficient of determination) values were > 86% across all slightly better when using the 450 K L-DMR library (R = cell types and across the three tested methods (Additional 98.1%), and performance was worse using automatic selec- file 1: Figure S5). However, we consistently obtained better tion with data from the EPIC platform (R =86.0%). cell type proportion estimates (higher R and lower RMSE Although the results are highly correlated to the actual (root mean square error)) when using the L-DMR library proportion of DNA in the artificial mixtures when using generated with the IDOL method from EPIC platform the Reinius [13] 450 K reference L-DMR library, the Salas et al. Genome Biology (2018) 19:64 Page 4 of 14 Table 1 Genomic context of CpG sites selected for each L-DMR library approach Automatic selection 450 K Automatic selection EPIC IDOL EPIC P Probes (n = 600) (n = 600) (n = 450) N(%) N(%) N(%) CpG present on 450 K array 600 (100) 203 (34) 140 (31) 7.50E-29 Enhancers (Phantom5) 53 (9) 108 (18) 70 (16) 3.13E-69 DNase hypersensitive sites 452 (75) 467 (78) 328 (73) 6.66E-147 Genomic context CpG island 47 (8) 24 (4) 10 (2) 4.35E-62 Shores 116 (19) 55 (9) 63 (14) Shelves 95 (16) 41 (7) 35 (8) Open sea 342 (57) 480 (80) 342 (76) Functional context TSS1500 73 (12) 44 (7) 38 (8) 2.10E-102 TSS200 46 (8) 23 (4) 11 (2) 5′ UTR 76 (13) 69 (12) 46 (10) First exon 20 (3) 16 (3) 7 (2) Body 241 (40) 283 (47) 210 (47) 3′ UTR 29 (5) 11 (2) 9 (2) Intergenic 115 (19) 152 (25) 128 (28) a 2 P is calculated from the χ test comparing the proportions between the three L-DMR selection methods estimates showed increased variability compared to esti- overestimated this cell type (mean differences = 0.6 mates obtained using the EPIC reference L-DMR library and 0.4%, respectively). NK cells were slightly under- (Fig. 3). Importantly, the magnitude of the variance was estimated using CBS (mean difference = − 0.3). All strongly significantly lower using IDOL compared to com- three methods overestimated monocytes by a small panion automatic methods (Bartlett test P =7.60 E-26). percentage that was statistically significant (P < 0.05). As a sensitivity analysis, we compared the results of the The lowest mean difference in monocytes was ob- IDOL library using CP/QP (minfi method) versus two served with CP/QP (0.5%), followed by RPC (0.7%) additional deconvolution methods: 1) CBS-CIBERSORT, a and CBS (1.6%). CBS outperformed RPC and CP/QP support vector machine non-constrained projection; and for neutrophil estimation (P >0.1), and both CP/QP 2) and Robust Partial Correlation (RPC) a linear and RPC underestimated neutrophils (mean differ- non-constrained projection using the methods described ences = − 1.27 and − 1.66%, respectively). by Teschendorff et al. and available in the R package EpiDISH [14]. Using a paired t-test, we compared the true values (fraction of cell DNA in the artificial mixture) Pathways in the selected EPIC IDOL library versus the estimates obtained by the three methods; the The probes present in the new EPIC IDOL L-DMR library global mean differences for all the cells were not statisti- were tested for enrichment using missMethyl and the cally significant (P > 0.05 for the three tests). The paired Gene Ontology (GO) and the Gene Set Enrichment Ana- mean differences were analyzed using Bland-Altman plots lysis (GSEA) set 7 (immune related) v.6.1 pathways. In (Additional file 1: Figure S6). Subtle, less than 2%, mean total, 375 GO pathways (299 biological processes, 31 cell differences were observed in the cell by cell estimate com- components, and 45 molecular features) and 181 GSEA parisons across each of the deconvolution methods. set 7 pathways were statistically significant (false discovery Across the three deconvolution methods, there were no rate < 0.05) after array bias correction (Additional files 2 statistically significant differences between the true values and 3). Among others, several GO pathways were tracked for CD8T cells and their estimated fraction (P >0.05). RPC to the parent GO terms response to wounding (e.g., in- mean estimates were closer to the true values for CD4T, flammatory response, defense response), T-cell activation though CBS underestimated CD4T (mean difference = (e.g., T-cell activation) and leukocyte proliferation. The − 0.8%), and CP/QP overestimated CD4T (mean dif- GSEA pathways included pathways related to the different ference = 1.0%). RPC estimates were also closer to the six cell types and other cells derived from the six cells true Bcell values, whereas both CBS and CP/QP included in the database. Salas et al. Genome Biology (2018) 19:64 Page 5 of 14 Fig. 2 Comparison of estimate cell proportions using constrained projection/quadratic programming (CP/QP) versus the reconstructed (true) DNA fraction in the artificial DNA mixtures using the EPIC IDOL method. a Cell-specific DNA proportions per sample included in the two mixture reconstruction methods (methods A and B). b R and RMSE using the EPIC IDOL method and the two reconstruction methods Potential applications 350 days of observation. Although the measurements Although any EWAS using the IlluminaHumanMethyla- were from a healthy male adult volunteer, we observed tionEPIC array could benefit from estimates derived an important range of variability in the cell subpopula- from the new reference library, one specific setting in tions across the different time points (Fig. 4). Specific- which the use of more precise cell estimates is particu- ally, we observed a potential underestimation of CD8T larly beneficial includes longitudinal and/or repeated (− 5.5% in the 450 K estimates, P = 2.86E-05) and Bcell measurement studies. Using a dataset containing periph- (− 3.84% in the 450 K estimates, P = 1.10E-04). Further, eral blood DNA methylation information from one vol- when examining the cell ratios (lineage relationships) unteer, 11 repeated measurements were obtained across those ratios containing these cell subpopulations, and in Salas et al. Genome Biology (2018) 19:64 Page 6 of 14 Fig. 3 Observed estimates of absolute error by deconvolution method per cell type (top panel) and global per method (bottom panel) particular those including CD8T cells alone, were L-DMR library. The performance of the 450 K-restricted dramatically affected (CD4T/CD8T increased 5.82 350 CpG L-DMR library in 12 artificial reconstructed points, P = 5.76E-05; CD8T + Mono/NK were reduced mixtures (GSE77797) measured with the 450 K platform 4.78 points, P = 1.28E-04). In the CD8T/Bcell ratio, al- (Additional file 1: Figure S7) was consistent with the though the mean change was non-statistically significant, performance of the 450 CpG L-DMR library in the EPIC the global variance of the ratio was affected (Bartlett test samples (Fig. 2). In particular, slightly higher RMSEs and P = 4.64E-05). Importantly, the neutrophil estimates slightly lower R values were observed (Additional file 1: were preserved using any of the reference datasets, Figure S7); however, the RMSEs were lower and the R though the neutrophil to lymphocyte ratio (NLR) esti- values were higher, or similar, to those reported previously mates were slightly higher using the 450 K compared to by Koestler et al. [18] using the Reinius reference [13]. the EPIC IDOL L-DMR library (1.97 points, P = 0.06). Finally, to further validate our estimation using actual Subtle changes are better captured using the new samples, we estimated the cell composition of whole L-DMR library. In addition, the discordance in the blood samples collected from six additional healthy CD8T and Bcell estimates using the 450 K L-DMR li- donors, whose DNA was run on the EPIC platform, and brary compared with the EPIC array library (underesti- compared our estimates to FACS measured cell propor- mation in the reconstructed samples comparisons) was tions collected on the same samples (GSE112618; consistent with the direction and magnitude in the error Fig. 5a). We also deconvoluted six publicly available observed when we compared the performance with the samples with available FACS information arrayed using reconstructed samples (Fig. 3). the 450 K platform (GSE77797; Fig. 5b), and in five of To interrogate deconvolution using datasets from the the 11 samples in our longitudinal dataset with FACS previous Illumina HumanMethylation450K platform, we information at the time of the sample collection (years optimized a second set IDOL L-DMR library from our 2011 and 2012), we compared our estimates of cell com- data, measured on the EPIC platform, but including only position in archival DNA samples arrayed using the probes also present on the 450 K array. Restricting to EPIC platform to the corresponding FACS measured CpGs also on the 450 K array, a set of 350 probes was proportions (GSE110530; Fig. 5c). In all three datasets, identified as the optimal IDOL L-DMR library (similar RMSEs were less than 2.0 for all cell types (Fig. 5). The R and lower RMSE than libraries containing less or least accurate estimations were observed in the granulo- more probes), 60 of which are part of our IDOL EPIC cyte/neutrophil fractions, where increased accuracy was Salas et al. Genome Biology (2018) 19:64 Page 7 of 14 Fig. 4 Comparison of the longitudinal assessment of cell type proportions and cell ratio changes using DNA methylation data and two different reference L-DMR libraries (EPIC IDOL and 450 K) observed when comparing the total granulocytes vs neu- more extended automatic selection approach, the global trophil estimates compared to the total neutrophils vs results are similar, albeit less precise. Furthermore, neutrophil estimates. In particular, in Fig. 5a, the RMSE although the results were statistically similar using the for neutrophils was 1.91, whereas when we instead used previous 450 K reference library, our results suggest that total granulocytes (the sum of neutrophils, eosinophils, this L-DMR library was less precise in their estimations. and basophils), the RMSE was reduced to 0.92. The increased coverage of the new EPIC array may also include additional important genomic areas for Discussion hematopoiesis and immune cell development. Although Here, we offer a new DNA methylation reference library IDOL relies on an algorithmic approach for the selection from the Illumina EPIC array for six adult blood cell of probes, the output L-DMR library contains probes subtypes. Using artificial mixtures with fixed proportions associated with critical biological pathways in immune of purified cell DNA, this library offers more precise re- development and differentiation. Through its iterative sults in terms of the cell estimations obtained through selection of CpG loci that optimize prediction perform- constrained projection compared to those using the pre- ance across the six leukocyte subtypes, the IDOL vious 450 K reference library. Although the statistical algorithm identifies critically sensitive and specific differ- differences are subtle numerically, the global increased entially methylated sites that populate the final deconvo- precision may help to control the confounding of cell lution library. Included among the cell-specific output subpopulations in studies using adult peripheral blood. are loci that figure prominently in established leukocyte In our approach, we suggest that optimized probe selec- biology, as well as others that are less well described. tion using IDOL may also help to increase precision and Examples of the former are loci that reside in genes pre- reduce noise compared to larger probe lists based solely sented in Fig. 6. BLK (B lymphoid tyrosine kinase) is well on t-statistic ranking of cell-specific hyper- and hypo- established in B-cell antigen receptor signaling and methylated CpGs. Nevertheless, even when using the B-cell development [20]. CD8A (CD8 alpha subunit) is a Salas et al. Genome Biology (2018) 19:64 Page 8 of 14 ab Fig. 5 Comparison of the estimated cell proportions using constrained projection/quadratic programming (CP/QP) versus the FACS measured fraction in EPIC and 450 K platforms. a Whole blood cell samples arrayed using the EPIC platform with known (FACS) fractions for the six main cell subtypes. Cell estimates were obtained using the EPIC IDOL method. b Whole blood cell samples arrayed using the Illumina 450 K platform with known (FACS) fractions for the six main cell subtypes. Cell estimates were obtained using the EPIC IDOL 450 K legacy method. c Five out of 11 observations on the longitudinal dataset run with EPIC had FACS information cell defining co-receptor for cytotoxic T-cell receptor– gene (cytoplasmic linker associated protein 1). CLASP MHC–antigen complex response [21]. Although this proteins are important in microtubule organization and molecule is also expressed in approximately 40% of NK vesical transport [28] but a role for this gene in NK biol- cells [22], the use of this marker in conjunction with ogy has not been described. Further experimentation is other probes help to differentiate this specific cell type. required to elucidate the mechanistic connections be- NFIA (nuclear factor 1 A transcription factor) in concert tween specific DNA methylation events and the functional with miR-223 is a crucial player in the molecular characteristics of diverse immune cells. The present re- circuitry controlling human granulopoiesis [23, 24]. sults demonstrate notable improvements in understanding IDOL also selected loci in the RPTOR gene (regulatory the contributions of immune cell compartments to the associated protein of MTOR) to discriminate CD4T substantial variation in DNA methylation observed in per- cells. Metabolic reprogramming mediated by RPTOR is ipheral human blood. emerging as essential in T helper cell differentiation In non-pathological conditions, less common cell sub- [25]. In previous studies, a different set of CpGs related populations will probably be estimated as part of the clos- to RPTOR have been associated with inflammatory est cell in the cell development hierarchy (as observed markers in CD4T [26]. Interestingly, SLFN5 (Schlafen with the estimates of the neutrophils approximating the factor 5 protein) demethylation strongly delineates total granulocytes in our FACS comparisons). However, a monocytes from neutrophils and other cell types. Schla- limitation of the current approach is its potential vulner- fen family proteins are alpha interferon-inducible growth ability in pathologic conditions wherein other cell types or and cell cycle regulatory proteins but have no known cell transition states may appear in the peripheral blood. function in monocyte biology [27]. Finally, a notable The accuracy of reference-based cell deconvolution ap- NK-specific locus was uncovered within the CLASP1 proaches can potentially be affected by the presence of cell Salas et al. Genome Biology (2018) 19:64 Page 9 of 14 Fig. 6 Examples of critical CpGs for cell deconvolution selected by IDOL populations that are unaccounted for in cell reference libraries and optimization procedures would need to be libraries. One example is the presence of nucleated red applied to identify and minimize estimation bias arising blood cells in cord blood samples; this heterogeneous from an additional cell population. These limitations are group of erythroid cells shows a characteristic unmethy- not unique to epigenetic approaches as conventional FACS lated pattern, previous to enucleation [29], that could itself requires prior knowledge of cell characteristics for affect the estimates in that specific age group where they accurate cell profiles. are relatively abundant (about 5% of the nucleated cells) This new reference library has the potential to be but disappear in the first 72 h after birth [30–32]. How- widely used in the newest adult peripheral blood EWAS. ever, in normal conditions, independent of the age of the The use of an EPIC-specific reference library will elimin- subject, it is expected that cell types or states without a ate unintended technical differences arising from apply- direct reference methylome will be accounted for as part ing a reference library from a previous generation array of the closest cell subtype with a reference. In fact, in our which may result in residual confounding or critical data we observed how the FACS information confirmed technical defects which go beyond the cell heterogeneity that most of the variability in a longitudinal assessment of a problem when analyzing blood samples. The use of reli- healthy subject was attributable to changes in the cell popu- able cell reference panels is particularly important for lation proportions. The approach described here can ac- longitudinal assessment of cohort datasets. Using infor- commodate additional normal or pathologically related cell mation from the 450 K datasets, it had been shown that types, but as with all deconvolution methods additional cell temporal trends in longitudinal studies were mainly Salas et al. Genome Biology (2018) 19:64 Page 10 of 14 driven by changes in cell composition across time [33– cells were selected using immunomagnetic labeling through 36]. The expected increase in precision using the new two different protocols: 1) for Neu, leukocytes were sepa- reference would be especially important in this context ratedusing HetaSep followedbydensity gradient separation to control for aging-related effects changing specific sub- and neutrophil negative selection; 2) Mono, Bcells, NK, populations of T lymphocytes in which a higher variability CD4T, and CD8T were negatively isolated from untouched is expected when using the previous library. Indeed, this peripheral mononuclear cells using indirect immunomag- library may also find particular utility when additional netic cell labeling systems (CD14, CD19, CD56, CD4T, and subtypes of leukocytes are added to the current library, as CD8T, respectively). is evidenced by the plethora of EPIC-array-specific Twelve artificial mixtures were reconstructed using L-DMR loci discovered in the new analysis. DNA from the specific cell samples. Two different sets of reconstruction mixtures, each with n = 6, were deter- Conclusions mined by randomly generating proportions from a This new EPIC-specific reference library will reduce re- six-component Dirichlet distribution. The first set of re- sidual confounding arising from the use of a reference li- constructed samples (method A samples) used mixtures brary from a previous array generation when analyzing of purified leukocyte subtype DNA in relatively equiva- adult blood samples. The increased precision of using this lent proportions across the six leukocyte subtypes. For new L-DMR library will help in applications where subtle the second set of six samples, the proportions of DNA changes in specific cell subpopulations may lead to higher for each leukocyte subtype were selected to resemble than expected variability, such as longitudinal studies. their relative fractions in the peripheral blood of normal human adult subjects (method B samples). A mixture Methods containing 1.2 μg of total DNA was estimated using the In this work, we extend the available reference library proportions included in Additional file 1:Table S1.The for deconvolution of blood cell proportions using the DNA from the cell sorted samples and those of the artificial EPIC array with the goal of both improving the accuracy mixtures were randomized in the slide slots of the micro- of cell composition estimates and overcoming potential array. Sample DNA (1 μg) was bisulfite converted and proc- technical differences in platforms. Using magnetic sorted essed according to the Illumina protocols at the Vincent J. neutrophils, B cells, monocytes, NK cells, CD4+ T cells, Coates Genomics Sequencing Laboratory at UC Berkeley. and CD8+ T cells, we measured DNA methylation with The raw idat files from the EPIC methylation array the 850 K EPIC DNA methylation array and applied were pre-processed using minfi [19]and EnMIX[39] IDOL to identify optimal L-DMR libraries. We then for quality control using R v.3.4.3 [40]. To assess data compared the performance of cell estimates obtained quality, we used a detection P value of 1E-06, three using the EPIC platform and optimized the L-DMR standard deviations of the mean bisulfite conversion IDOL library to the now unavailable 450 K array in arti- control probe fluorescence signal intensity, and a mini- ficial blood mixtures with predefined cell proportions. mum of three beads per probe. Only 1897 CpGs had a Six MACS-isolated and FACS-verified purity cell sub- detection P > 1E-06 in 5% or more of the samples; how- types (neutrophils (Neu), monocytes (Mono), B lympho- ever, they were not masked in the raw dataset. No sam- cytes (Bcells), T helper lymphocytes (CD4T), T cytotoxic ples were excluded because of low quality. The IDOL lymphocytes (CD8T), and natural killer lymphocytes (NK)) EPIC L-DMR library is freely available in Bioconductor were purchased from AllCells® corporation (Alameda, CA, as the package FlowSorted.Blood.EPIC [37]tobe USA) and STEMCELL technologies (Vancouver, BC, adopted in downstream analyses in current analyses Canada). Cells were isolated from 31 males and 6 females, pipelines. The package contains a RGChannelSet R ob- all anonymous healthy donors. The donors had a mean age ject generated through minfi containing 49 samples and of 32.6 years (range 19–59 years) and an average weight of information on 1,051,815 probes corresponding to 86 kg (range 65–118 Kg) and were negative for HIV, HBV, 866,091 CpGs using the latest annotation release by and HBC. Women were not pregnant at the time of sample Illumina (MethylationEPIC_v-1-0_B4). It is important collection, and samples were collected from donors with no for the reader to note that the cells were purified using history of heart, lung, or kidney disease, asthma, blood an immunomagnetic procedure; the name “FlowSorted” disorders, autoimmune disorders, cancer, or diabetes. All was kept for easy adoption and integration with previ- donors provided written informed consent before dona- ous minfi pipelines. tion. The full phenotype information is available in the FlowSorted.Blood.EPIC package [37] and in the Gene Ex- IDOL algorithm pression Omnibus (GEO; GSE110554) [38]. For a complete description of the IDOL algorithm please Isolation protocols are available through the commercial refer to Koestler et al. [18]. In brief, the IDOL algorithm websites of AllCells and STEMCELL technologies. In brief, utilizes a training dataset consisting of both blood-derived Salas et al. Genome Biology (2018) 19:64 Page 11 of 14 DNA methylation data and measurements of the fraction the estimateCellCounts function contained in the minfi of each of the underlying cell types (e.g., FACs, artificial Bioconductor package [19, 41]. estimateCellCounts is an mixtures of DNA from purified cell types of pre-specified, adaptation of the Houseman et al. CP/QP method [7], in known proportions, etc.) as a means to identify optimal which a raw reference library is combined and normalized reference libraries for cell mixture deconvolution. A series with a target dataset, followed by cell deconvolution. By of t-tests comparing the mean CpG-specific methylation default, this method uses the FlowSorted.Blood.450 K li- between each leukocyte cell type compared to the mean brary derived from the Reinius dataset [13]as the reference methylation across all the other cell types was conducted dataset. Both the reference and target datasets are normal- to identify discriminating CpGs (e.g., L-DMRs) for each ized together using independent type I and type II probe specific cell type. Based on this analysis, CpGs were then quantile normalization [42]. First, the default library used rank-ordered on the basis of their t-statistics and the L/2 for cell mixture deconvolution consists of 600 CpGs, repre- CpGs with the largest and smallest t-statistic for each K senting the top 50 hyper- and hypomethylated CpGs, cell type were identified and pooled. L is a tuning param- rank-ordered based on the t-statistic obtained in compari- eter representing the number of cell-specific L-DMRs and sons of CpG-specific methylation between each cell type was set to L = 150 in our application, consistent with (i.e., CD4T, CD8T, NK, Bcell, Mono, and Neu) and all other Koestler et al. [18]. A candidate L-DMR library containing cell types. We hereafter refer to this approach as automatic the total L*K unique L-DMRs for each cell type forms the selection 450 K. Second, we used the same estimateCell- search space for the IDOL algorithm, from which L-DMR Counts defaults but substituted FlowSorted.Blood.450 K subsets of size < L*K are sequentially and probabilistically with FlowSorted.Blood.EPIC as the underlying reference selected and examined for their prediction accuracy in dataset. Similar to the previous approach, the top 50 hyper- deconvoluting the samples in the training dataset. The and hypomethylated CpGs were identified for each cell type user needs to preselect the library size in order to balance and used to assemble the library for deconvolution consist- accuracy and precision of cell composition estimates. For ing of 600 total CpGs. We hereafter refer to this approach the application of IDOL presented here, we considered li- as automatic selection EPIC. Finally, we used IDOL for braries ranging from 50 to 800 CpGs by increments of 50, probe selection [18]. This approach dynamically scans a as our previous work has shown that libraries ranging candidate set of cell-specific methylation markers to find li- from 300 to 600 CpGs generally yield accurate and reliable braries that optimize the accuracy of cell fraction estimates deconvolution estimates. In the first iteration of the IDOL obtained from cell mixture deconvolution. Library sizes algorithm, all L*K CpGs constituting the candidate library ranging from 50 to 800 CpGs, in increments of 50, were have an equal probability of being selected to be included considered (see IDOL algorithm above for details). The se- in the DMR subset library. Using the randomly assembled lected probes (n = 450, IDOL optimized L-DMR library), DMR library, the constrained projection/quadratic pro- plus the genomic context information, are supplied as gramming approach [7] is applied to obtain cell compos- Additional file 4. Per each cell type the following number ition estimates for each sample in the training dataset. of probes were selected: Bcell 71, CD4T 70, CD8T 82, Using these predictions, the R and RMSE (root mean Mono 72, Neu 73, and NK 82. As both the reference and square error) were calculated for each of the cell types the target were EPIC datasets, we changed the default (Additional file 1: Figure S4), comparing the cell estimates normalization and only used the methylumi-noob back- to their known proportion in each sample. One-by-one ground correction [41] before the cell projection. This last CpGs are removed from the randomly selected DMR li- method is referred to as IDOL selection EPIC. As this brary, followed by computation of R and RMSE based on method is not included within the estimateCellCounts cell composition estimates obtained using a library con- function, we offer a modified function in our package sisting of only the remaining CpGs. This procedure allows named estimateCellCounts2 which allows all the options assessment of the contribution of each CpG in the library already included in the original function plus the use of in terms of its impact on the accuracy of cell composition IDOL-customized probe selection. The estimates of the estimates and, in doing so, provides a basis for modifying three methods were compared against the proportion of the probability of each CpG being selected in subsequent cell DNA included in the mixture (true value); we report IDOL iterations. This process is repeated at each iteration, the R and the RMSE (residual mean standard error) for with the algorithm eventually converging on an “optimal” the three methods. The absolute mean error was calcu- library for deconvolution. Per Koestler et al. [18], we used lated subtracting the estimated proportions from the 500 iterations in our implementation of IDOL. reconstructed (true) fraction spiked in the sample. As a measure of the global variance variability we used a Deconvolution methods Bartlett test for homogeneity of the variances to com- We used three different deconvolution methods to assess pare the three methods. As a sensitivity analysis we the performance of the new reference library. First, we used compared the results of the EPIC IDOL library using Salas et al. Genome Biology (2018) 19:64 Page 12 of 14 CP/QP (minfi method) versus two additional deconvo- were carried out on whole blood provided by six healthy lution methods: 1) CBS-CIBERSORT, a support vector blood donors using established methods as described in machine non-constrained projection; and 2) RPC (ro- Accomando et al. [46]. In brief, the gating strategy bust partial correlation), a linear non-constrained pro- included counting total leukocytes using CD45(+), granu- jection using the methods described by Teschendorff et locytes based on CD16 and CD15, monocytes based on al. and available in the R package EpiDISH [14]. We CD14, total T lymphocytes marked as CD3(+), CD4T used a paired t-test to compare the true values (fraction were CD3(+) CD4(+), CD8T were CD3(+) CD8(+), B lym- of cell DNA in the artificial mixture) versus the esti- phocytes marked as CD19(+), and NK as CD56(+). We mates obtained by the three deconvolution methods. also compared the performance of our estimations against six additional samples with FACS information available in Longitudinal dataset GEO which were arrayed using the Illumina Human- A repeated measurement dataset (GSE110530) [43]was Methylation450k array (GSE77797) [44]. Finally, for five of from a male adult volunteer who provided 12 samples of the 11 samples of our longitudinal dataset analyzed with blood distributed over a period of 350 days from March the EPIC platform, we had partial FACS information. In 2011 to February 2012. The DNA was extracted from this latter dataset we show the CD3(−) lymphocyte frac- whole blood within 24 h of sampling and archived at tion as the sum of Bcell and NK. DNA isolated from − 80 °C. Total input DNA of 0.75 μg, as measured by donor blood was stored at − 80 °C for approximately Quant-iT Picogreen dsDNA Assay (Invitrogen, Carls- 6 years before being assayed on the EPIC array. Data are bad, CA), was prepared for each time point. Samples available in the GEO (GSE110530) [43]. were randomized across the slots of the microarray. Bisulfite conversion and processing were performed Test for enrichment according to Illumina protocols using the IlluminaHu- The IDOL L-DMR library was tested for enrichment manMethylationEPIC array at the Vincent J. Coates Gen- using the GO database version 3.5.0 with date 11/08/ omics Sequencing Laboratory at UC Berkeley. During 2017, and the immune curated GSEA (set 7) version 6.1 quality control, one of the samples (time point 11) showed using missMethyl to correct for array bias [47]. Only a different SNP content pointing to a potential sample those pathways containing more than ten probes of the mix-up and was excluded from this analysis. We estimated L-DMR library and pathways with less than 2000 genes the cell composition using the 450 K L-DMR and the were selected for this analysis. Pathways with a false EPIC IDOL L-DMR methods and compared the mean dif- discovery rate < 0.05 were considered statistically significant. ference and homogeneity of the variance of the estimates between both methods using t-test and the Bartlett test. Additional files Extension to whole blood samples and application for legacy Additional file 1: Figure S1. Estimated cell purity by flow cytometry per cell type. Figure S2. Heatmap based on a hierarchical cluster of 450 K datasets purified cell types and cell mixtures based on the array SNPs. Figure S3. As a potential extension and validation of our algorithm, Association between the top 20 principal components and potential we used a public dataset (GSE77797) [44]containing12 confounders for DNA methylation. Figure S4. Iterative testing of different L-DMR library sizes using the IDOL optimization algorithm. samples of artificial mixtures and six whole blood samples Table S1. Cell composition percentages for the artificial reconstruction with known flow-sorted fractions for the main six cell samples. Figure S5. Comparison of several probe selection methods and fractions arrayed using the Illumina HumanMethyla- estimated cell proportions using constrained projection/quadratic programming (CP/QP) versus the reconstructed (true) DNA fraction in tion450k platform. We optimized an EPIC IDOL 450 K the artificial DNA mixtures. Figure S6. Bland-Altman plots comparing legacy L-DMR library using the same procedure described the mean differences between the estimated cell fraction using three for the EPIC IDOL L-DMR above. The resulting library deconvolution methods and the true fraction in the artificial mixture per cell type. Figure S7. Comparison of the estimated cell proportions contained 350 probes present on the previous 450 K Illu- using CP/QP using an IDOL-optimized library restricted to the Illumina mina DNA methylation array generation (Additional file 5). HumanMethylation450K k array versus the reconstructed (true) DNA fraction The estimated cell composition using this EPIC IDOL in the artificial DNA mixtures arrayed in the 450 k platform. (PDF 618 kb) 450 K legacy L-DMR library was compared against the Additional file 2: Gene Ontology enrichment of the probes contained in the L-DMR IDOL library. (CSV 28 kb) reconstructed fraction or the FACS measured fraction. Additional file 3: GSEA enrichment using the curated set 7 (immune We report the R and the RMSE for the artificial mixtures profiles) of the probes contained in the L-DMR IDOL library. (CSV 13 kb) and the RMSE for the FACS measured samples. Additional file 4: L-DMR IDOL library. (CSV 113 kb) Additional file 5: L-DMR IDOL 450 K legacy library. (CSV 88 kb) Validation using samples with FACS information Three independent datasets were used for validation. We Abbreviations ran six samples of healthy donors with FACS information Bcells: B lymphocytes; CD4T: CD4+ T cells; CD8T: CD8+ T cells; CP/QP: Constrained using the EPIC platform (GSE112618) [45]. FACS analyses projection/ quadratic programming; EWAS: Epigenome-wide association studies; Salas et al. Genome Biology (2018) 19:64 Page 13 of 14 FACS: Fluorescence activated cell sorting; IDOL: Identifying Optimal Received: 23 February 2018 Accepted: 8 May 2018 Libraries; L-DMR: Leukocyte differentially methylated regions; Mono: Monocytes; Neu: Neutrophils; NK: Natural killer cells; NLR: Neutrophil to lymphocyte ratio; R : Coefficient of determination; RMSE: Root mean square error References 1. Breton CV, Marsit CJ, Faustman E, Nadeau K, Goodrich JM, Dolinoy DC, et al. Acknowledgements Small-magnitude effect sizes in epigenetic end points are important in This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC children’s environmental health studies: the Children’s Environmental Health Berkeley. and Disease Prevention Research Center’s Epigenetics Working Group. Environ Health Perspect. 2017;125:511–26. 2. Christensen BC, Houseman EA, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, et al. Funding Aging and environmental exposures alter tissue-specific DNA methylation This work was supported by NIH grants (R01CA52689 and P50CA097257 to dependent upon CpG island context. PLoS Genet. 2009;5:e1000602. J.K. Wiencke; R01CA207110 to K.T. Kelsey; R01DE022772, P20GM104416–8189, 3. Levenson VV. DNA methylation as a universal biomarker. Expert Rev Mol and R01CA216265 to B.C. Christensen). The Robert Magnin Newman Diagn. 2010;10:481–8. endowment for Neuro-oncology (JKW). This work was also supported by the 4. Houseman EA, Kim S, Kelsey KT, Wiencke JK. DNA methylation in whole Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, blood: uses and challenges. Curr Environ Heal Rep. 2015;2:145–54. supported in part by the National Institute of General Medical Science 5. Titus AJ, Gallimore RM, Salas LA, Christensen BC. Cell-type deconvolution (NIGMS) Award P20GM103418. from DNA methylation: a review of recent applications. Hum Mol Genet. 2017;26:R216–24. 6. Teschendorff AE, Zheng SC. Cell-type deconvolution in epigenome-wide Availability of data and materials association studies: a review and recommendations. Epigenomics. 2017;9:757–68. The datasets generated and/or analyzed during the current study are available 7. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, in the superSeries GSE110555 in the GEO (https://www.ncbi.nlm.nih.gov/geo/ Nelson HH, et al. DNA methylation arrays as surrogate measures of cell query/acc.cgi?acc=GSE110555)[48]. The specific accession codes are GSE110554 mixture distribution. BMC Bioinformatics. 2012;13:86. (FlowSorted.Blood.EPIC) [38], GSE110530 (longitudinal dataset) [43], and 8. Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in GSE112618 (validation FACS whole blood samples) [45]. The additional the analysis of DNA methylation array data: a mathematical perspective. validation set including artificial mixtures and FACS whole blood cell BMC Bioinformatics. 2015;16:95. fractions using Illumina HumanMethylation450k is available under the 9. Zheng SC, Beck S, Jaffe AE, Koestler DC, Hansen KD, Houseman AE, et al. accession number GSE77797 [44]. The R package FlowSorted.Blood.EPIC Correcting for cell-type heterogeneity in epigenome-wide association is available in Bioconductor (https://bioconductor.org/packages/ studies: revisiting previous analyses. Nat Methods. 2017;14:216–7. FlowSorted.Blood.EPIC) and the original source code is available 10. Guo S, Diep D, Plongthongkum N, Fung H-L, Zhang K, Zhang K. through https://github.com/immunomethylomics/FlowSorted.Blood.EPIC Identification of methylation haplotype blocks aids in deconvolution of (under licenseGPL-3.0). Forreproducibility thesourcecodehas also heterogeneous tissue samples and tumor tissue-of-origin mapping from been deposited on Zenodo (doi: https://doi.org/10.5281/zenodo.1241199 plasma DNA. Nat Genet. 2017;49:635–42. for the package and doi: https://doi.org/10.5281/zenodo.1243840 for the 11. Koestler DC, Usset J, Christensen BC, Marsit CJ, Karagas MR, Kelsey KT, et al. scripts for the figures and tables) [37, 49–51]. DNA methylation-derived neutrophil-to-lymphocyte ratio: an epigenetic tool to explore cancer inflammation and outcomes. Cancer Epidemiol Biomark Authors’ contributions Prev. 2017;26:328–38. The original idea was proposed by LAS, DCK, JKW, KTK, and BCC. RAB and 12. Wiencke JK, Koestler DC, Salas LA, Wiemels JL, Roy RP, Hansen HM, et al. HMH performed the DNA extractions and DNA reconstruction experiments. Immunomethylomic approach to explore the blood neutrophil lymphocyte LAS and DCK contributed to the processing and bioinformatic analyses of ratio (NLR) in glioma survival. Clin Epigenetics. 2017;9:10. the paper. All authors participated in the interpretation of data for the work. 13. Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlén SE, Greco D, et al. LAS, DCK, and BCC were responsible for the initial draft of the work. All Differential DNA methylation in purified human blood cells: Implications for authors participated in final drafting and critical revision for important cell lineage and studies on disease susceptibility. PLoS One. 2012;7:e41361. intellectual content. All authors read and approved the final manuscript. 14. Teschendorff AE, Breeze CE, Zheng SC, Beck S. A comparison of reference- based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinformatics. 2017;18:105. Ethics approval and consent to participate 15. Goode DK, Obier N, Vijayabaskar MS, Lie-A-Ling M, Lilly AJ, Hannah R, et al. Cells used in these experiments were obtained commercially. All donors are Dynamic gene regulatory networks drive hematopoietic specification and anonymous. All the subjects provided written informed consent before differentiation. Dev Cell. 2016;36:572–87. donation to the commercial houses which provided the commercial cells. 16. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 2017;45:e22. Competing interests 17. Logue MW, Smith AK, Wolf EJ, Maniates H, Stone A, Schichman SA, et al. The authors declare that they have no competing interests. The correlation of methylation levels measured using Illumina 450K and EPIC BeadChips in blood samples. Epigenomics. 2017;9:1363–71. 18. Koestler DC, Jones MJ, Usset J, Christensen BC, Butler RA, Kobor MS, et al. Improving cell mixture deconvolution by identifying optimal DNA Publisher’sNote methylation libraries (IDOL). BMC Bioinformatics. 2016;17:120. Springer Nature remains neutral with regard to jurisdictional claims in 19. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, published maps and institutional affiliations. et al. Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9. Author details Department of Epidemiology, Geisel School of Medicine, Dartmouth 20. Saijo K, Schmedt C, Su I-H, Karasuyama H, Lowell CA, Reth M, et al. Essential College, Lebanon, NH, USA. Department of Biostatistics, University of Kansas role of Src-family protein tyrosine kinases in NF-kappaB activation during B Medical Center, Kansas City, KS, USA. Departments of Epidemiology and cell development. Nat Immunol. 2003;4:274–9. Pathology and Laboratory Medicine, Brown University, Providence, RI, USA. 21. Miceli MC, Parnes JR. Role of CD4 and CD8 in T cell activation and Department of Neurological Surgery, Institute for Human Genetics, differentiation. Adv Immunol. 1993;53:59–122. University of California San Francisco, San Francisco, CA, USA. Departments 22. Addison EG, North J, Bakhsh I, Marden C, Haq S, Al-Sarraj S, et al. Ligation of of Molecular and Systems Biology, and Community and Family Medicine, CD8alpha on human natural killer cells prevents activation-induced Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA. apoptosis and enhances cytolytic activity. Immunology. 2005;116:354–61. Salas et al. Genome Biology (2018) 19:64 Page 14 of 14 23. Fazi F, Rosa A, Fatica A, Gelmetti V, De Marchis ML, Nervi C, et al. A 44. Koestler DC, Christensen BC, Wiencke JK, Kelsey KT. GSE77797: DNA minicircuitry comprised of microRNA-223 and transcription factors NFI-A methylation profiling of whole blood and reconstructed mixtures of purified and C/EBPalpha regulates human granulopoiesis. Cell. 2005;123:819–31. leukocytes isolated from human adult blood. Gene Expression Omnibus. 24. Vian L, Di Carlo M, Pelosi E, Fazi F, Santoro S, Cerio AM, et al. Transcriptional 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77797. fine-tuning of microRNA-223 levels directs lineage choice of human Accessed 4 May 2018. hematopoietic progenitors. Cell Death Differ. 2014;21:290–301. 45. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. GSE112618: FACS validation dataset: An optimized library for reference- 25. Yang K, Shrestha S, Zeng H, Karmaus PWF, Neale G, Vogel P, et al. T based deconvolution of whole-blood biospecimens assayed using the cell exit from quiescence and differentiation into Th2 cells depend on Illumina HumanMethylationEPIC BeadArray (III). Gene Expression Omnibus. Raptor-mTORC1-mediated metabolic reprogramming. Immunity. 2013; 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112618. 39:1043–56. Accessed 4 May 2018. 26. Yusuf N, Hidalgo B, Irvin MR, Sha J, Zhi D, Tiwari HK, et al. An epigenome- 46. Accomando WP, Wiencke JK, Houseman EA, Nelson HH, Kelsey KT. wide association study of inflammatory response to fenofibrate in the Quantitative reconstruction of leukocyte subsets using DNA methylation. Genetics of Lipid Lowering Drugs and Diet Network. Pharmacogenomics. Genome Biol. 2014;15:R50. 2017;18:1333–41. 47. Phipson B, Maksimovic J, Oshlack A. missMethyl: an R package for 27. Puck A, Aigner R, Modak M, Cejka P, Blaas D, Stöckl J. Expression and analyzing data from Illumina’s HumanMethylation450 platform. regulation of Schlafen (SLFN) family members in primary human monocytes, Bioinformatics. 2016;32:286–8. monocyte-derived dendritic cells and T cells. Results Immunol. 2015;5:23–32. 48. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. 28. Stehbens SJ, Paszek M, Pemble H, Ettinger A, Gierke S, Wittmann T. CLASPs GSE110555: SuperSeries: an optimized library for reference-based deconvolution link focal-adhesion-associated microtubule capture to localized exocytosis of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC and adhesion site turnover. Nat Cell Biol. 2014;16:561–73. BeadArray. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/ 29. de Goede OM, Lavoie PM, Robinson WP. Characterizing the query/acc.cgi?acc=GSE110555. Accessed 4 May 2018. hypomethylated DNA methylation profile of nucleated red blood cells from 49. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. cord blood. Epigenomics. 2016;8:1481–94. FlowSorted.Blood.EPIC. GitHub. 2018. https://github.com/ 30. de Goede OM, Razzaghian HR, Price EM, Jones MJ, Kobor MS, Robinson WP, immunomethylomics/FlowSorted.Blood.EPIC. Accessed 4 May 2018. et al. Nucleated red blood cells impact DNA methylation and expression 50. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. analyses of cord blood hematopoietic cells. Clin Epigenetics. 2015;7:95. Immunomethylomics/FlowSorted.Blood.EPIC: FlowSorted.Blood.EPIC v.0.99.36. 31. Bakulski KM, Feinberg JI, Andrews SV, Yang J, Brown S, L McKenney S, et al. Zenodo. 2018. https://doi.org/10.5281/ZENODO.1241200. Accessed 4 May 2018. DNA methylation of cord blood cell types: Applications for mixed cell birth 51. Salas LA. v.1.0 immunomethylomics/Analysis_FlowSorted.Blood.EPIC: analysis studies. Epigenetics. 2016;11:354–62. scripts. 2018. https://doi.org/10.5281/zenodo.1243840. Accessed 4 May 2018. 32. Gervin K, Page CM, Aass HCD, Jansen MA, Fjeldstad HE, Andreassen BK, et al. Cell type specific DNA methylation in cord blood: a 450K-reference dataset and cell count-based validation of estimated cell type composition. Epigenetics. 2016;2294:00. 33. Shvetsov YB, Song M-A, Cai Q, Tiirikainen M, Xiang Y-B, Shu X-O, et al. Intraindividual variation and short-term temporal trend in DNA methylation of human blood. Cancer Epidemiol Biomark Prev. 2015;24:490–7. 34. Urdinguio RG, Torró MI, Bayón GF, Álvarez-Pitti J, Fernández AF, Redon P, et al. Longitudinal study of DNA methylation during the first 5 years of life. J Transl Med. 2016;14:160. 35. Tan Q, Heijmans BT, Hjelmborg JVB, Soerensen M, Christensen K, Christiansen L. Epigenetic drift in the aging genome: a ten-year follow-up in an elderly twin cohort. Int J Epidemiol. 2016;45:1146–58. 36. Kananen L, Marttila S, Nevalainen T, Kummola L, Junttila I, Mononen N, et al. The trajectory of the blood DNA methylome ageing rate is largely set before adulthood: evidence from two longitudinal studies. Age (Dordr). Age. 2016;38:65. 37. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. FlowSorted.Blood.EPIC. Bioconductor. 2018. https://bioconductor.org/ packages/FlowSorted.Blood.EPIC, https://doi.org/10.18129/B9.bioc. FlowSorted.Blood.EPIC. Accessed 4 May 2018. 38. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. GSE110554: FlowSorted.Blood.EPIC: An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray (II). Gene Expression Omnibus. 2018. https:// www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110554. [cited 2018 May 4]. 39. Xu Z, Niu L, Li L, Taylor JA. ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res. 2016;44:e20. 40. R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2017. 41. Fortin J-P, Triche TJ, Hansen KD. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics. 2017;33:558–60. 42. Touleimat N, Tost J. Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics. 2012;4:325–41. 43. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. GSE110530: Longitudinal dataset: An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray (I). Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE110530. Accessed 4 May 2018. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals

An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray

Free
14 pages

Loading next page...
 
/lp/springer_journal/an-optimized-library-for-reference-based-deconvolution-of-whole-blood-ZsYf40vBnP
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology
eISSN
1474-760X
D.O.I.
10.1186/s13059-018-1448-7
Publisher site
See Article on Publisher Site

Abstract

Genome-wide methylation arrays are powerful tools for assessing cell composition of complex mixtures. We compare three approaches to select reference libraries for deconvoluting neutrophil, monocyte, B-lymphocyte, natural killer, and CD4+ and CD8+ T-cell fractions based on blood-derived DNA methylation signatures assayed using the Illumina HumanMethylationEPIC array. The IDOL algorithm identifies a library of 450 CpGs, resulting in an average R =99.2 across cell types when applied to EPIC methylation data collected on artificial mixtures constructed from the above cell types. Of the 450 CpGs, 69% are unique to EPIC. This library has the potential to reduce unintended technical differences across array platforms. Keywords: DNA methylation, Epigenetics, Neutrophils, Monocytes, Natural killer cells, B-cells, Helper T-cells, Cytotoxic T-lymphocytes, Adults, Leukocytes Background assessed [4]. While some of the observed changes in DNA DNA methylation microarrays have become a widely methylation reported in EWAS reflect induced epigenetic utilized tool for epigenome-wide association studies alterations within the constituent cells, others may reflect (EWAS), including expanded use in studies investigating coordinately induced changes in the proportions of the association between DNA methylation with environ- leukocyte subtypes in circulation that underlie or contrib- mental exposures, and in the setting of case-control and ute to the pathophysiologic process. Both reference-based longitudinal studies [1, 2]. Peripheral blood is the most and non-reference-based techniques have been used to commonly used biospecimen for these analyses primarily control the effect of cell heterogeneity, and thus possible because it is easily accessible through a minimally invasive confounding, in different studies, and their specific appli- procedure, although some emerging evidence suggests cations have been detailed elsewhere [5, 6]. Deconvolution that some specific DNA methylation changes in blood techniques, such as constrained projection/quadratic pro- may reflect pathological states in target organs not easily gramming (CP/QP) [7], provide a framework for estimat- or safely accessible by biopsy [3]. Finally, DNA methyla- ing the relative proportions of blood cell types using tion profiles in blood may, in some instances, summarize blood-derived signatures of DNA methylation. So-called information of systemic exposures or diseases where cells “deconvolution estimates” can then be used in down- from a single organ or tissue cannot be specifically stream statistical models to adjust for the potential con- founding effects of cell composition [4, 7–9], or examined * Correspondence: Brock.C.Christensen@dartmouth.edu independently to determine their association with risk or Lucas A. Salas and Devin C. Koestler contributed equally to this work. 1 exposures [10–12]. Indeed, as in other general clinical ap- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA plications, the ratio of myeloid to lymphoid lineages (neu- Departments of Molecular and Systems Biology, and Community and Family trophil to lymphocyte ratio (NLR)) can be reconstructed Medicine, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA in archival whole blood DNA samples measured with Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Salas et al. Genome Biology (2018) 19:64 Page 2 of 14 methylation arrays using different deconvolution approaches purity across the six cell subtypes (Additional file 1:Figure [11, 12]. Furthermore, while complete blood cell counts S1). Individual sample cell purity is reported in the pheno- (CBC) are sometimes concurrently collected, deconvolution type table in the FlowSorted.Blood.EPIC file. We explored algorithms allow for estimation of lymphoid-specific sub- potential genetic clustering using the ethnicity and the 59 populations (e.g., B cells, CD4(+) T, CD8(+) T and natural control SNP probes included in the array. A striking killer (NK) lymphocytes), approximating the results of cell clustering was observed when grouping for larger ethnic flow sorting without the need of additional sample input. groups (Additional file 1: Figure S2). Using a principal This strategy helps to correct for inter-individual variation, component regression analysis, the first 20 principal lineage relationships (changes in the myeloid versus lymph- components were tested against potential confounders oid ratio mean), and potential effects of prominent methyla- (Additional file 1: Figure S3). Each of the first five princi- tion differences in general cell subpopulations. pal components were significantly (P < 0.01) associated Current analysis pipelines for estimating the propor- with cell type composition. Other potential confounders, tions of leukocyte subtypes in adult blood are based on including body mass index (BMI; P < 0.01), subject weight samples from six adult males that were purified with (P < 0.01), age (in years; P < 0.01) or sex (P <1 E-04) were flow cytometry and profiled for DNA methylation using only accounted for in the sixth to eleventh principal com- the Illumina HumanMethylation450K platform (450 K ponents, and smoking was significant only with the 12th array) [13]. As the 450 K array is now the predecessor to component (P < 0.05). Cell purity was associated with the the recently released Illumina HumanMethylationEPIC ninth principal component (P < 0.05). In contrast to what array (EPIC array), an outstanding question relates to was observed using the genetic information, ethnicity was accuracy of cell deconvolution of peripheral blood not significantly associated with any of the top 20 principal measured on the EPIC array using existing 450 K refer- components. ence methylation signatures. The EPIC array interrogates We used three deconvolution methods for comparison > 860,000 CpG sites, roughly twice the number of CpGs (see “Methods” for details): 1) the current commonly used contained on the 450 K array, with additional genomic 450 K reference (Reinius et al.) [13] using the automatic content in enhancer regions and DNase hypersensitive selection to identify the L-DMR library in the minfi Bio- sites (DHS), which are important in hematopoietic de- conductor package [19]; 2) our new EPIC reference using velopment and differentiation [14, 15]. In this work, we the automatic selection to identify the L-DMR library in extend the available reference library for deconvolution minfi; 3) our new EPIC reference using IDOL selection to of blood cell proportions using the EPIC array with the identify the L-DMR library [18]. The automatic selection goal of both improving the accuracy of cell composition picks the top 50 hyper- and hypomethylated probes for estimates and overcoming potential technical differences each cell type (600 probes total), while the IDOL method in platforms [16, 17]. Using antibody bead sorted neu- identified 450 probes as the optimal number of probes for trophils, B cells, monocytes, NK cells, CD4+ T cells, and deconvolution (Additional file 1: Figure S4). The probes CD8+ T cells, we measured DNA methylation with the selected by the various methods are compared in Fig. 1. 850 K EPIC DNA methylation array and applied an Only 26 probes overlapped across the different methods. iterative algorithm for Identifying Optimal Libraries A comparison of the selected probes, including the pro- (IDOL) from leukocyte differentially methylated regions portion per genomic context and probes in Phantom5 en- (L-DMR) that improves the accuracy and efficiency of hancers and DHS, is provided in Table 1. The majority of cell composition estimates obtained by cell mixture the probes selected for deconvolution using the new EPIC deconvolution [18]. We then compared the performance array were not present on the previous 450 K array; 66% of cell estimates obtained using the EPIC platform and of the probes selected using the automatic selection optimized library to the now unavailable 450 K array in method and 69% of the probes selected with IDOL were artificial blood mixtures with predefined cell proportions. unique to the EPIC array. As expected, more probes in the open sea were selected using the new reference library Results (80 and 76% using the automatic and IDOL methods, The FlowSorted.Blood.EPIC dataset contains information respectively, compared to 57% using the 450 K reference). from neutrophils (Neu, n = 6), monocytes (Mono, n =6), B In addition, approximately twice as many Phantom5 lymphocytes (Bcells, n = 6), CD4+ T cells (CD4T, n = 7, six enhancer sites were selected using the new EPIC platform samples and one technical replicate), CD8+ T cells (CD8T, with the automatic (18%) and EPIC methods (16%) com- n = 6), natural killer cells (NK, n = 6), and 12 DNA artifi- pared with the Reinius [13] reference probe set from the cial mixtures (labeled as MIX in the dataset). Across sam- 450 K platform (9%). ples, the average purity reported in the control flow Once we determined the probes for cell type estima- sorting (after antibody-linked magnetic bead sorting) was tion, we used the minfi modified Houseman constrained 95% (range 88 to 99%), with Mono having the lowest projection approach [7] to estimate the cell composition Salas et al. Genome Biology (2018) 19:64 Page 3 of 14 Fig. 1 Comparison of L-DMR libraries among automatic selection in minfi and the IDOL algorithm for optimization. a Reinius reference dataset [13] probes from the 450 K array (n = 600 CpGs). b Probes selected from the new reference samples measured with the EPIC array (n = 600 CpGs). c L-DMR library derived from IDOL using the EPIC array (n =450 CpGs). d Overlapping of the probes of the three methods. DHS DNase hypersensitive sites of 12 samples, spread across two sets of artificially re- methylation data, and the variance of our estimates was constructed mixtures. As the specific amount of DNA consistently lower (Fig. 2b,Additionalfile 1: Figure S5). per cell type in each mixture was known, we compared For all the cell types, except CD4T, the R was over 99.7%. our estimate of cell proportions to the amount of DNA The lowest R estimate from applying the IDOL method represented by that cell type in each of the artificial to the EPIC platform data was for CD4T (R =95.5%). mixtures (Fig. 2a, Additional file 1: Table S1). The R The observed versus expected estimate for CD4T was (coefficient of determination) values were > 86% across all slightly better when using the 450 K L-DMR library (R = cell types and across the three tested methods (Additional 98.1%), and performance was worse using automatic selec- file 1: Figure S5). However, we consistently obtained better tion with data from the EPIC platform (R =86.0%). cell type proportion estimates (higher R and lower RMSE Although the results are highly correlated to the actual (root mean square error)) when using the L-DMR library proportion of DNA in the artificial mixtures when using generated with the IDOL method from EPIC platform the Reinius [13] 450 K reference L-DMR library, the Salas et al. Genome Biology (2018) 19:64 Page 4 of 14 Table 1 Genomic context of CpG sites selected for each L-DMR library approach Automatic selection 450 K Automatic selection EPIC IDOL EPIC P Probes (n = 600) (n = 600) (n = 450) N(%) N(%) N(%) CpG present on 450 K array 600 (100) 203 (34) 140 (31) 7.50E-29 Enhancers (Phantom5) 53 (9) 108 (18) 70 (16) 3.13E-69 DNase hypersensitive sites 452 (75) 467 (78) 328 (73) 6.66E-147 Genomic context CpG island 47 (8) 24 (4) 10 (2) 4.35E-62 Shores 116 (19) 55 (9) 63 (14) Shelves 95 (16) 41 (7) 35 (8) Open sea 342 (57) 480 (80) 342 (76) Functional context TSS1500 73 (12) 44 (7) 38 (8) 2.10E-102 TSS200 46 (8) 23 (4) 11 (2) 5′ UTR 76 (13) 69 (12) 46 (10) First exon 20 (3) 16 (3) 7 (2) Body 241 (40) 283 (47) 210 (47) 3′ UTR 29 (5) 11 (2) 9 (2) Intergenic 115 (19) 152 (25) 128 (28) a 2 P is calculated from the χ test comparing the proportions between the three L-DMR selection methods estimates showed increased variability compared to esti- overestimated this cell type (mean differences = 0.6 mates obtained using the EPIC reference L-DMR library and 0.4%, respectively). NK cells were slightly under- (Fig. 3). Importantly, the magnitude of the variance was estimated using CBS (mean difference = − 0.3). All strongly significantly lower using IDOL compared to com- three methods overestimated monocytes by a small panion automatic methods (Bartlett test P =7.60 E-26). percentage that was statistically significant (P < 0.05). As a sensitivity analysis, we compared the results of the The lowest mean difference in monocytes was ob- IDOL library using CP/QP (minfi method) versus two served with CP/QP (0.5%), followed by RPC (0.7%) additional deconvolution methods: 1) CBS-CIBERSORT, a and CBS (1.6%). CBS outperformed RPC and CP/QP support vector machine non-constrained projection; and for neutrophil estimation (P >0.1), and both CP/QP 2) and Robust Partial Correlation (RPC) a linear and RPC underestimated neutrophils (mean differ- non-constrained projection using the methods described ences = − 1.27 and − 1.66%, respectively). by Teschendorff et al. and available in the R package EpiDISH [14]. Using a paired t-test, we compared the true values (fraction of cell DNA in the artificial mixture) Pathways in the selected EPIC IDOL library versus the estimates obtained by the three methods; the The probes present in the new EPIC IDOL L-DMR library global mean differences for all the cells were not statisti- were tested for enrichment using missMethyl and the cally significant (P > 0.05 for the three tests). The paired Gene Ontology (GO) and the Gene Set Enrichment Ana- mean differences were analyzed using Bland-Altman plots lysis (GSEA) set 7 (immune related) v.6.1 pathways. In (Additional file 1: Figure S6). Subtle, less than 2%, mean total, 375 GO pathways (299 biological processes, 31 cell differences were observed in the cell by cell estimate com- components, and 45 molecular features) and 181 GSEA parisons across each of the deconvolution methods. set 7 pathways were statistically significant (false discovery Across the three deconvolution methods, there were no rate < 0.05) after array bias correction (Additional files 2 statistically significant differences between the true values and 3). Among others, several GO pathways were tracked for CD8T cells and their estimated fraction (P >0.05). RPC to the parent GO terms response to wounding (e.g., in- mean estimates were closer to the true values for CD4T, flammatory response, defense response), T-cell activation though CBS underestimated CD4T (mean difference = (e.g., T-cell activation) and leukocyte proliferation. The − 0.8%), and CP/QP overestimated CD4T (mean dif- GSEA pathways included pathways related to the different ference = 1.0%). RPC estimates were also closer to the six cell types and other cells derived from the six cells true Bcell values, whereas both CBS and CP/QP included in the database. Salas et al. Genome Biology (2018) 19:64 Page 5 of 14 Fig. 2 Comparison of estimate cell proportions using constrained projection/quadratic programming (CP/QP) versus the reconstructed (true) DNA fraction in the artificial DNA mixtures using the EPIC IDOL method. a Cell-specific DNA proportions per sample included in the two mixture reconstruction methods (methods A and B). b R and RMSE using the EPIC IDOL method and the two reconstruction methods Potential applications 350 days of observation. Although the measurements Although any EWAS using the IlluminaHumanMethyla- were from a healthy male adult volunteer, we observed tionEPIC array could benefit from estimates derived an important range of variability in the cell subpopula- from the new reference library, one specific setting in tions across the different time points (Fig. 4). Specific- which the use of more precise cell estimates is particu- ally, we observed a potential underestimation of CD8T larly beneficial includes longitudinal and/or repeated (− 5.5% in the 450 K estimates, P = 2.86E-05) and Bcell measurement studies. Using a dataset containing periph- (− 3.84% in the 450 K estimates, P = 1.10E-04). Further, eral blood DNA methylation information from one vol- when examining the cell ratios (lineage relationships) unteer, 11 repeated measurements were obtained across those ratios containing these cell subpopulations, and in Salas et al. Genome Biology (2018) 19:64 Page 6 of 14 Fig. 3 Observed estimates of absolute error by deconvolution method per cell type (top panel) and global per method (bottom panel) particular those including CD8T cells alone, were L-DMR library. The performance of the 450 K-restricted dramatically affected (CD4T/CD8T increased 5.82 350 CpG L-DMR library in 12 artificial reconstructed points, P = 5.76E-05; CD8T + Mono/NK were reduced mixtures (GSE77797) measured with the 450 K platform 4.78 points, P = 1.28E-04). In the CD8T/Bcell ratio, al- (Additional file 1: Figure S7) was consistent with the though the mean change was non-statistically significant, performance of the 450 CpG L-DMR library in the EPIC the global variance of the ratio was affected (Bartlett test samples (Fig. 2). In particular, slightly higher RMSEs and P = 4.64E-05). Importantly, the neutrophil estimates slightly lower R values were observed (Additional file 1: were preserved using any of the reference datasets, Figure S7); however, the RMSEs were lower and the R though the neutrophil to lymphocyte ratio (NLR) esti- values were higher, or similar, to those reported previously mates were slightly higher using the 450 K compared to by Koestler et al. [18] using the Reinius reference [13]. the EPIC IDOL L-DMR library (1.97 points, P = 0.06). Finally, to further validate our estimation using actual Subtle changes are better captured using the new samples, we estimated the cell composition of whole L-DMR library. In addition, the discordance in the blood samples collected from six additional healthy CD8T and Bcell estimates using the 450 K L-DMR li- donors, whose DNA was run on the EPIC platform, and brary compared with the EPIC array library (underesti- compared our estimates to FACS measured cell propor- mation in the reconstructed samples comparisons) was tions collected on the same samples (GSE112618; consistent with the direction and magnitude in the error Fig. 5a). We also deconvoluted six publicly available observed when we compared the performance with the samples with available FACS information arrayed using reconstructed samples (Fig. 3). the 450 K platform (GSE77797; Fig. 5b), and in five of To interrogate deconvolution using datasets from the the 11 samples in our longitudinal dataset with FACS previous Illumina HumanMethylation450K platform, we information at the time of the sample collection (years optimized a second set IDOL L-DMR library from our 2011 and 2012), we compared our estimates of cell com- data, measured on the EPIC platform, but including only position in archival DNA samples arrayed using the probes also present on the 450 K array. Restricting to EPIC platform to the corresponding FACS measured CpGs also on the 450 K array, a set of 350 probes was proportions (GSE110530; Fig. 5c). In all three datasets, identified as the optimal IDOL L-DMR library (similar RMSEs were less than 2.0 for all cell types (Fig. 5). The R and lower RMSE than libraries containing less or least accurate estimations were observed in the granulo- more probes), 60 of which are part of our IDOL EPIC cyte/neutrophil fractions, where increased accuracy was Salas et al. Genome Biology (2018) 19:64 Page 7 of 14 Fig. 4 Comparison of the longitudinal assessment of cell type proportions and cell ratio changes using DNA methylation data and two different reference L-DMR libraries (EPIC IDOL and 450 K) observed when comparing the total granulocytes vs neu- more extended automatic selection approach, the global trophil estimates compared to the total neutrophils vs results are similar, albeit less precise. Furthermore, neutrophil estimates. In particular, in Fig. 5a, the RMSE although the results were statistically similar using the for neutrophils was 1.91, whereas when we instead used previous 450 K reference library, our results suggest that total granulocytes (the sum of neutrophils, eosinophils, this L-DMR library was less precise in their estimations. and basophils), the RMSE was reduced to 0.92. The increased coverage of the new EPIC array may also include additional important genomic areas for Discussion hematopoiesis and immune cell development. Although Here, we offer a new DNA methylation reference library IDOL relies on an algorithmic approach for the selection from the Illumina EPIC array for six adult blood cell of probes, the output L-DMR library contains probes subtypes. Using artificial mixtures with fixed proportions associated with critical biological pathways in immune of purified cell DNA, this library offers more precise re- development and differentiation. Through its iterative sults in terms of the cell estimations obtained through selection of CpG loci that optimize prediction perform- constrained projection compared to those using the pre- ance across the six leukocyte subtypes, the IDOL vious 450 K reference library. Although the statistical algorithm identifies critically sensitive and specific differ- differences are subtle numerically, the global increased entially methylated sites that populate the final deconvo- precision may help to control the confounding of cell lution library. Included among the cell-specific output subpopulations in studies using adult peripheral blood. are loci that figure prominently in established leukocyte In our approach, we suggest that optimized probe selec- biology, as well as others that are less well described. tion using IDOL may also help to increase precision and Examples of the former are loci that reside in genes pre- reduce noise compared to larger probe lists based solely sented in Fig. 6. BLK (B lymphoid tyrosine kinase) is well on t-statistic ranking of cell-specific hyper- and hypo- established in B-cell antigen receptor signaling and methylated CpGs. Nevertheless, even when using the B-cell development [20]. CD8A (CD8 alpha subunit) is a Salas et al. Genome Biology (2018) 19:64 Page 8 of 14 ab Fig. 5 Comparison of the estimated cell proportions using constrained projection/quadratic programming (CP/QP) versus the FACS measured fraction in EPIC and 450 K platforms. a Whole blood cell samples arrayed using the EPIC platform with known (FACS) fractions for the six main cell subtypes. Cell estimates were obtained using the EPIC IDOL method. b Whole blood cell samples arrayed using the Illumina 450 K platform with known (FACS) fractions for the six main cell subtypes. Cell estimates were obtained using the EPIC IDOL 450 K legacy method. c Five out of 11 observations on the longitudinal dataset run with EPIC had FACS information cell defining co-receptor for cytotoxic T-cell receptor– gene (cytoplasmic linker associated protein 1). CLASP MHC–antigen complex response [21]. Although this proteins are important in microtubule organization and molecule is also expressed in approximately 40% of NK vesical transport [28] but a role for this gene in NK biol- cells [22], the use of this marker in conjunction with ogy has not been described. Further experimentation is other probes help to differentiate this specific cell type. required to elucidate the mechanistic connections be- NFIA (nuclear factor 1 A transcription factor) in concert tween specific DNA methylation events and the functional with miR-223 is a crucial player in the molecular characteristics of diverse immune cells. The present re- circuitry controlling human granulopoiesis [23, 24]. sults demonstrate notable improvements in understanding IDOL also selected loci in the RPTOR gene (regulatory the contributions of immune cell compartments to the associated protein of MTOR) to discriminate CD4T substantial variation in DNA methylation observed in per- cells. Metabolic reprogramming mediated by RPTOR is ipheral human blood. emerging as essential in T helper cell differentiation In non-pathological conditions, less common cell sub- [25]. In previous studies, a different set of CpGs related populations will probably be estimated as part of the clos- to RPTOR have been associated with inflammatory est cell in the cell development hierarchy (as observed markers in CD4T [26]. Interestingly, SLFN5 (Schlafen with the estimates of the neutrophils approximating the factor 5 protein) demethylation strongly delineates total granulocytes in our FACS comparisons). However, a monocytes from neutrophils and other cell types. Schla- limitation of the current approach is its potential vulner- fen family proteins are alpha interferon-inducible growth ability in pathologic conditions wherein other cell types or and cell cycle regulatory proteins but have no known cell transition states may appear in the peripheral blood. function in monocyte biology [27]. Finally, a notable The accuracy of reference-based cell deconvolution ap- NK-specific locus was uncovered within the CLASP1 proaches can potentially be affected by the presence of cell Salas et al. Genome Biology (2018) 19:64 Page 9 of 14 Fig. 6 Examples of critical CpGs for cell deconvolution selected by IDOL populations that are unaccounted for in cell reference libraries and optimization procedures would need to be libraries. One example is the presence of nucleated red applied to identify and minimize estimation bias arising blood cells in cord blood samples; this heterogeneous from an additional cell population. These limitations are group of erythroid cells shows a characteristic unmethy- not unique to epigenetic approaches as conventional FACS lated pattern, previous to enucleation [29], that could itself requires prior knowledge of cell characteristics for affect the estimates in that specific age group where they accurate cell profiles. are relatively abundant (about 5% of the nucleated cells) This new reference library has the potential to be but disappear in the first 72 h after birth [30–32]. How- widely used in the newest adult peripheral blood EWAS. ever, in normal conditions, independent of the age of the The use of an EPIC-specific reference library will elimin- subject, it is expected that cell types or states without a ate unintended technical differences arising from apply- direct reference methylome will be accounted for as part ing a reference library from a previous generation array of the closest cell subtype with a reference. In fact, in our which may result in residual confounding or critical data we observed how the FACS information confirmed technical defects which go beyond the cell heterogeneity that most of the variability in a longitudinal assessment of a problem when analyzing blood samples. The use of reli- healthy subject was attributable to changes in the cell popu- able cell reference panels is particularly important for lation proportions. The approach described here can ac- longitudinal assessment of cohort datasets. Using infor- commodate additional normal or pathologically related cell mation from the 450 K datasets, it had been shown that types, but as with all deconvolution methods additional cell temporal trends in longitudinal studies were mainly Salas et al. Genome Biology (2018) 19:64 Page 10 of 14 driven by changes in cell composition across time [33– cells were selected using immunomagnetic labeling through 36]. The expected increase in precision using the new two different protocols: 1) for Neu, leukocytes were sepa- reference would be especially important in this context ratedusing HetaSep followedbydensity gradient separation to control for aging-related effects changing specific sub- and neutrophil negative selection; 2) Mono, Bcells, NK, populations of T lymphocytes in which a higher variability CD4T, and CD8T were negatively isolated from untouched is expected when using the previous library. Indeed, this peripheral mononuclear cells using indirect immunomag- library may also find particular utility when additional netic cell labeling systems (CD14, CD19, CD56, CD4T, and subtypes of leukocytes are added to the current library, as CD8T, respectively). is evidenced by the plethora of EPIC-array-specific Twelve artificial mixtures were reconstructed using L-DMR loci discovered in the new analysis. DNA from the specific cell samples. Two different sets of reconstruction mixtures, each with n = 6, were deter- Conclusions mined by randomly generating proportions from a This new EPIC-specific reference library will reduce re- six-component Dirichlet distribution. The first set of re- sidual confounding arising from the use of a reference li- constructed samples (method A samples) used mixtures brary from a previous array generation when analyzing of purified leukocyte subtype DNA in relatively equiva- adult blood samples. The increased precision of using this lent proportions across the six leukocyte subtypes. For new L-DMR library will help in applications where subtle the second set of six samples, the proportions of DNA changes in specific cell subpopulations may lead to higher for each leukocyte subtype were selected to resemble than expected variability, such as longitudinal studies. their relative fractions in the peripheral blood of normal human adult subjects (method B samples). A mixture Methods containing 1.2 μg of total DNA was estimated using the In this work, we extend the available reference library proportions included in Additional file 1:Table S1.The for deconvolution of blood cell proportions using the DNA from the cell sorted samples and those of the artificial EPIC array with the goal of both improving the accuracy mixtures were randomized in the slide slots of the micro- of cell composition estimates and overcoming potential array. Sample DNA (1 μg) was bisulfite converted and proc- technical differences in platforms. Using magnetic sorted essed according to the Illumina protocols at the Vincent J. neutrophils, B cells, monocytes, NK cells, CD4+ T cells, Coates Genomics Sequencing Laboratory at UC Berkeley. and CD8+ T cells, we measured DNA methylation with The raw idat files from the EPIC methylation array the 850 K EPIC DNA methylation array and applied were pre-processed using minfi [19]and EnMIX[39] IDOL to identify optimal L-DMR libraries. We then for quality control using R v.3.4.3 [40]. To assess data compared the performance of cell estimates obtained quality, we used a detection P value of 1E-06, three using the EPIC platform and optimized the L-DMR standard deviations of the mean bisulfite conversion IDOL library to the now unavailable 450 K array in arti- control probe fluorescence signal intensity, and a mini- ficial blood mixtures with predefined cell proportions. mum of three beads per probe. Only 1897 CpGs had a Six MACS-isolated and FACS-verified purity cell sub- detection P > 1E-06 in 5% or more of the samples; how- types (neutrophils (Neu), monocytes (Mono), B lympho- ever, they were not masked in the raw dataset. No sam- cytes (Bcells), T helper lymphocytes (CD4T), T cytotoxic ples were excluded because of low quality. The IDOL lymphocytes (CD8T), and natural killer lymphocytes (NK)) EPIC L-DMR library is freely available in Bioconductor were purchased from AllCells® corporation (Alameda, CA, as the package FlowSorted.Blood.EPIC [37]tobe USA) and STEMCELL technologies (Vancouver, BC, adopted in downstream analyses in current analyses Canada). Cells were isolated from 31 males and 6 females, pipelines. The package contains a RGChannelSet R ob- all anonymous healthy donors. The donors had a mean age ject generated through minfi containing 49 samples and of 32.6 years (range 19–59 years) and an average weight of information on 1,051,815 probes corresponding to 86 kg (range 65–118 Kg) and were negative for HIV, HBV, 866,091 CpGs using the latest annotation release by and HBC. Women were not pregnant at the time of sample Illumina (MethylationEPIC_v-1-0_B4). It is important collection, and samples were collected from donors with no for the reader to note that the cells were purified using history of heart, lung, or kidney disease, asthma, blood an immunomagnetic procedure; the name “FlowSorted” disorders, autoimmune disorders, cancer, or diabetes. All was kept for easy adoption and integration with previ- donors provided written informed consent before dona- ous minfi pipelines. tion. The full phenotype information is available in the FlowSorted.Blood.EPIC package [37] and in the Gene Ex- IDOL algorithm pression Omnibus (GEO; GSE110554) [38]. For a complete description of the IDOL algorithm please Isolation protocols are available through the commercial refer to Koestler et al. [18]. In brief, the IDOL algorithm websites of AllCells and STEMCELL technologies. In brief, utilizes a training dataset consisting of both blood-derived Salas et al. Genome Biology (2018) 19:64 Page 11 of 14 DNA methylation data and measurements of the fraction the estimateCellCounts function contained in the minfi of each of the underlying cell types (e.g., FACs, artificial Bioconductor package [19, 41]. estimateCellCounts is an mixtures of DNA from purified cell types of pre-specified, adaptation of the Houseman et al. CP/QP method [7], in known proportions, etc.) as a means to identify optimal which a raw reference library is combined and normalized reference libraries for cell mixture deconvolution. A series with a target dataset, followed by cell deconvolution. By of t-tests comparing the mean CpG-specific methylation default, this method uses the FlowSorted.Blood.450 K li- between each leukocyte cell type compared to the mean brary derived from the Reinius dataset [13]as the reference methylation across all the other cell types was conducted dataset. Both the reference and target datasets are normal- to identify discriminating CpGs (e.g., L-DMRs) for each ized together using independent type I and type II probe specific cell type. Based on this analysis, CpGs were then quantile normalization [42]. First, the default library used rank-ordered on the basis of their t-statistics and the L/2 for cell mixture deconvolution consists of 600 CpGs, repre- CpGs with the largest and smallest t-statistic for each K senting the top 50 hyper- and hypomethylated CpGs, cell type were identified and pooled. L is a tuning param- rank-ordered based on the t-statistic obtained in compari- eter representing the number of cell-specific L-DMRs and sons of CpG-specific methylation between each cell type was set to L = 150 in our application, consistent with (i.e., CD4T, CD8T, NK, Bcell, Mono, and Neu) and all other Koestler et al. [18]. A candidate L-DMR library containing cell types. We hereafter refer to this approach as automatic the total L*K unique L-DMRs for each cell type forms the selection 450 K. Second, we used the same estimateCell- search space for the IDOL algorithm, from which L-DMR Counts defaults but substituted FlowSorted.Blood.450 K subsets of size < L*K are sequentially and probabilistically with FlowSorted.Blood.EPIC as the underlying reference selected and examined for their prediction accuracy in dataset. Similar to the previous approach, the top 50 hyper- deconvoluting the samples in the training dataset. The and hypomethylated CpGs were identified for each cell type user needs to preselect the library size in order to balance and used to assemble the library for deconvolution consist- accuracy and precision of cell composition estimates. For ing of 600 total CpGs. We hereafter refer to this approach the application of IDOL presented here, we considered li- as automatic selection EPIC. Finally, we used IDOL for braries ranging from 50 to 800 CpGs by increments of 50, probe selection [18]. This approach dynamically scans a as our previous work has shown that libraries ranging candidate set of cell-specific methylation markers to find li- from 300 to 600 CpGs generally yield accurate and reliable braries that optimize the accuracy of cell fraction estimates deconvolution estimates. In the first iteration of the IDOL obtained from cell mixture deconvolution. Library sizes algorithm, all L*K CpGs constituting the candidate library ranging from 50 to 800 CpGs, in increments of 50, were have an equal probability of being selected to be included considered (see IDOL algorithm above for details). The se- in the DMR subset library. Using the randomly assembled lected probes (n = 450, IDOL optimized L-DMR library), DMR library, the constrained projection/quadratic pro- plus the genomic context information, are supplied as gramming approach [7] is applied to obtain cell compos- Additional file 4. Per each cell type the following number ition estimates for each sample in the training dataset. of probes were selected: Bcell 71, CD4T 70, CD8T 82, Using these predictions, the R and RMSE (root mean Mono 72, Neu 73, and NK 82. As both the reference and square error) were calculated for each of the cell types the target were EPIC datasets, we changed the default (Additional file 1: Figure S4), comparing the cell estimates normalization and only used the methylumi-noob back- to their known proportion in each sample. One-by-one ground correction [41] before the cell projection. This last CpGs are removed from the randomly selected DMR li- method is referred to as IDOL selection EPIC. As this brary, followed by computation of R and RMSE based on method is not included within the estimateCellCounts cell composition estimates obtained using a library con- function, we offer a modified function in our package sisting of only the remaining CpGs. This procedure allows named estimateCellCounts2 which allows all the options assessment of the contribution of each CpG in the library already included in the original function plus the use of in terms of its impact on the accuracy of cell composition IDOL-customized probe selection. The estimates of the estimates and, in doing so, provides a basis for modifying three methods were compared against the proportion of the probability of each CpG being selected in subsequent cell DNA included in the mixture (true value); we report IDOL iterations. This process is repeated at each iteration, the R and the RMSE (residual mean standard error) for with the algorithm eventually converging on an “optimal” the three methods. The absolute mean error was calcu- library for deconvolution. Per Koestler et al. [18], we used lated subtracting the estimated proportions from the 500 iterations in our implementation of IDOL. reconstructed (true) fraction spiked in the sample. As a measure of the global variance variability we used a Deconvolution methods Bartlett test for homogeneity of the variances to com- We used three different deconvolution methods to assess pare the three methods. As a sensitivity analysis we the performance of the new reference library. First, we used compared the results of the EPIC IDOL library using Salas et al. Genome Biology (2018) 19:64 Page 12 of 14 CP/QP (minfi method) versus two additional deconvo- were carried out on whole blood provided by six healthy lution methods: 1) CBS-CIBERSORT, a support vector blood donors using established methods as described in machine non-constrained projection; and 2) RPC (ro- Accomando et al. [46]. In brief, the gating strategy bust partial correlation), a linear non-constrained pro- included counting total leukocytes using CD45(+), granu- jection using the methods described by Teschendorff et locytes based on CD16 and CD15, monocytes based on al. and available in the R package EpiDISH [14]. We CD14, total T lymphocytes marked as CD3(+), CD4T used a paired t-test to compare the true values (fraction were CD3(+) CD4(+), CD8T were CD3(+) CD8(+), B lym- of cell DNA in the artificial mixture) versus the esti- phocytes marked as CD19(+), and NK as CD56(+). We mates obtained by the three deconvolution methods. also compared the performance of our estimations against six additional samples with FACS information available in Longitudinal dataset GEO which were arrayed using the Illumina Human- A repeated measurement dataset (GSE110530) [43]was Methylation450k array (GSE77797) [44]. Finally, for five of from a male adult volunteer who provided 12 samples of the 11 samples of our longitudinal dataset analyzed with blood distributed over a period of 350 days from March the EPIC platform, we had partial FACS information. In 2011 to February 2012. The DNA was extracted from this latter dataset we show the CD3(−) lymphocyte frac- whole blood within 24 h of sampling and archived at tion as the sum of Bcell and NK. DNA isolated from − 80 °C. Total input DNA of 0.75 μg, as measured by donor blood was stored at − 80 °C for approximately Quant-iT Picogreen dsDNA Assay (Invitrogen, Carls- 6 years before being assayed on the EPIC array. Data are bad, CA), was prepared for each time point. Samples available in the GEO (GSE110530) [43]. were randomized across the slots of the microarray. Bisulfite conversion and processing were performed Test for enrichment according to Illumina protocols using the IlluminaHu- The IDOL L-DMR library was tested for enrichment manMethylationEPIC array at the Vincent J. Coates Gen- using the GO database version 3.5.0 with date 11/08/ omics Sequencing Laboratory at UC Berkeley. During 2017, and the immune curated GSEA (set 7) version 6.1 quality control, one of the samples (time point 11) showed using missMethyl to correct for array bias [47]. Only a different SNP content pointing to a potential sample those pathways containing more than ten probes of the mix-up and was excluded from this analysis. We estimated L-DMR library and pathways with less than 2000 genes the cell composition using the 450 K L-DMR and the were selected for this analysis. Pathways with a false EPIC IDOL L-DMR methods and compared the mean dif- discovery rate < 0.05 were considered statistically significant. ference and homogeneity of the variance of the estimates between both methods using t-test and the Bartlett test. Additional files Extension to whole blood samples and application for legacy Additional file 1: Figure S1. Estimated cell purity by flow cytometry per cell type. Figure S2. Heatmap based on a hierarchical cluster of 450 K datasets purified cell types and cell mixtures based on the array SNPs. Figure S3. As a potential extension and validation of our algorithm, Association between the top 20 principal components and potential we used a public dataset (GSE77797) [44]containing12 confounders for DNA methylation. Figure S4. Iterative testing of different L-DMR library sizes using the IDOL optimization algorithm. samples of artificial mixtures and six whole blood samples Table S1. Cell composition percentages for the artificial reconstruction with known flow-sorted fractions for the main six cell samples. Figure S5. Comparison of several probe selection methods and fractions arrayed using the Illumina HumanMethyla- estimated cell proportions using constrained projection/quadratic programming (CP/QP) versus the reconstructed (true) DNA fraction in tion450k platform. We optimized an EPIC IDOL 450 K the artificial DNA mixtures. Figure S6. Bland-Altman plots comparing legacy L-DMR library using the same procedure described the mean differences between the estimated cell fraction using three for the EPIC IDOL L-DMR above. The resulting library deconvolution methods and the true fraction in the artificial mixture per cell type. Figure S7. Comparison of the estimated cell proportions contained 350 probes present on the previous 450 K Illu- using CP/QP using an IDOL-optimized library restricted to the Illumina mina DNA methylation array generation (Additional file 5). HumanMethylation450K k array versus the reconstructed (true) DNA fraction The estimated cell composition using this EPIC IDOL in the artificial DNA mixtures arrayed in the 450 k platform. (PDF 618 kb) 450 K legacy L-DMR library was compared against the Additional file 2: Gene Ontology enrichment of the probes contained in the L-DMR IDOL library. (CSV 28 kb) reconstructed fraction or the FACS measured fraction. Additional file 3: GSEA enrichment using the curated set 7 (immune We report the R and the RMSE for the artificial mixtures profiles) of the probes contained in the L-DMR IDOL library. (CSV 13 kb) and the RMSE for the FACS measured samples. Additional file 4: L-DMR IDOL library. (CSV 113 kb) Additional file 5: L-DMR IDOL 450 K legacy library. (CSV 88 kb) Validation using samples with FACS information Three independent datasets were used for validation. We Abbreviations ran six samples of healthy donors with FACS information Bcells: B lymphocytes; CD4T: CD4+ T cells; CD8T: CD8+ T cells; CP/QP: Constrained using the EPIC platform (GSE112618) [45]. FACS analyses projection/ quadratic programming; EWAS: Epigenome-wide association studies; Salas et al. Genome Biology (2018) 19:64 Page 13 of 14 FACS: Fluorescence activated cell sorting; IDOL: Identifying Optimal Received: 23 February 2018 Accepted: 8 May 2018 Libraries; L-DMR: Leukocyte differentially methylated regions; Mono: Monocytes; Neu: Neutrophils; NK: Natural killer cells; NLR: Neutrophil to lymphocyte ratio; R : Coefficient of determination; RMSE: Root mean square error References 1. Breton CV, Marsit CJ, Faustman E, Nadeau K, Goodrich JM, Dolinoy DC, et al. Acknowledgements Small-magnitude effect sizes in epigenetic end points are important in This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC children’s environmental health studies: the Children’s Environmental Health Berkeley. and Disease Prevention Research Center’s Epigenetics Working Group. Environ Health Perspect. 2017;125:511–26. 2. Christensen BC, Houseman EA, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, et al. Funding Aging and environmental exposures alter tissue-specific DNA methylation This work was supported by NIH grants (R01CA52689 and P50CA097257 to dependent upon CpG island context. PLoS Genet. 2009;5:e1000602. J.K. Wiencke; R01CA207110 to K.T. Kelsey; R01DE022772, P20GM104416–8189, 3. Levenson VV. DNA methylation as a universal biomarker. Expert Rev Mol and R01CA216265 to B.C. Christensen). The Robert Magnin Newman Diagn. 2010;10:481–8. endowment for Neuro-oncology (JKW). This work was also supported by the 4. Houseman EA, Kim S, Kelsey KT, Wiencke JK. DNA methylation in whole Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, blood: uses and challenges. Curr Environ Heal Rep. 2015;2:145–54. supported in part by the National Institute of General Medical Science 5. Titus AJ, Gallimore RM, Salas LA, Christensen BC. Cell-type deconvolution (NIGMS) Award P20GM103418. from DNA methylation: a review of recent applications. Hum Mol Genet. 2017;26:R216–24. 6. Teschendorff AE, Zheng SC. Cell-type deconvolution in epigenome-wide Availability of data and materials association studies: a review and recommendations. Epigenomics. 2017;9:757–68. The datasets generated and/or analyzed during the current study are available 7. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, in the superSeries GSE110555 in the GEO (https://www.ncbi.nlm.nih.gov/geo/ Nelson HH, et al. DNA methylation arrays as surrogate measures of cell query/acc.cgi?acc=GSE110555)[48]. The specific accession codes are GSE110554 mixture distribution. BMC Bioinformatics. 2012;13:86. (FlowSorted.Blood.EPIC) [38], GSE110530 (longitudinal dataset) [43], and 8. Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in GSE112618 (validation FACS whole blood samples) [45]. The additional the analysis of DNA methylation array data: a mathematical perspective. validation set including artificial mixtures and FACS whole blood cell BMC Bioinformatics. 2015;16:95. fractions using Illumina HumanMethylation450k is available under the 9. Zheng SC, Beck S, Jaffe AE, Koestler DC, Hansen KD, Houseman AE, et al. accession number GSE77797 [44]. The R package FlowSorted.Blood.EPIC Correcting for cell-type heterogeneity in epigenome-wide association is available in Bioconductor (https://bioconductor.org/packages/ studies: revisiting previous analyses. Nat Methods. 2017;14:216–7. FlowSorted.Blood.EPIC) and the original source code is available 10. Guo S, Diep D, Plongthongkum N, Fung H-L, Zhang K, Zhang K. through https://github.com/immunomethylomics/FlowSorted.Blood.EPIC Identification of methylation haplotype blocks aids in deconvolution of (under licenseGPL-3.0). Forreproducibility thesourcecodehas also heterogeneous tissue samples and tumor tissue-of-origin mapping from been deposited on Zenodo (doi: https://doi.org/10.5281/zenodo.1241199 plasma DNA. Nat Genet. 2017;49:635–42. for the package and doi: https://doi.org/10.5281/zenodo.1243840 for the 11. Koestler DC, Usset J, Christensen BC, Marsit CJ, Karagas MR, Kelsey KT, et al. scripts for the figures and tables) [37, 49–51]. DNA methylation-derived neutrophil-to-lymphocyte ratio: an epigenetic tool to explore cancer inflammation and outcomes. Cancer Epidemiol Biomark Authors’ contributions Prev. 2017;26:328–38. The original idea was proposed by LAS, DCK, JKW, KTK, and BCC. RAB and 12. Wiencke JK, Koestler DC, Salas LA, Wiemels JL, Roy RP, Hansen HM, et al. HMH performed the DNA extractions and DNA reconstruction experiments. Immunomethylomic approach to explore the blood neutrophil lymphocyte LAS and DCK contributed to the processing and bioinformatic analyses of ratio (NLR) in glioma survival. Clin Epigenetics. 2017;9:10. the paper. All authors participated in the interpretation of data for the work. 13. Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlén SE, Greco D, et al. LAS, DCK, and BCC were responsible for the initial draft of the work. All Differential DNA methylation in purified human blood cells: Implications for authors participated in final drafting and critical revision for important cell lineage and studies on disease susceptibility. PLoS One. 2012;7:e41361. intellectual content. All authors read and approved the final manuscript. 14. Teschendorff AE, Breeze CE, Zheng SC, Beck S. A comparison of reference- based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinformatics. 2017;18:105. Ethics approval and consent to participate 15. Goode DK, Obier N, Vijayabaskar MS, Lie-A-Ling M, Lilly AJ, Hannah R, et al. Cells used in these experiments were obtained commercially. All donors are Dynamic gene regulatory networks drive hematopoietic specification and anonymous. All the subjects provided written informed consent before differentiation. Dev Cell. 2016;36:572–87. donation to the commercial houses which provided the commercial cells. 16. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 2017;45:e22. Competing interests 17. Logue MW, Smith AK, Wolf EJ, Maniates H, Stone A, Schichman SA, et al. The authors declare that they have no competing interests. The correlation of methylation levels measured using Illumina 450K and EPIC BeadChips in blood samples. Epigenomics. 2017;9:1363–71. 18. Koestler DC, Jones MJ, Usset J, Christensen BC, Butler RA, Kobor MS, et al. Improving cell mixture deconvolution by identifying optimal DNA Publisher’sNote methylation libraries (IDOL). BMC Bioinformatics. 2016;17:120. Springer Nature remains neutral with regard to jurisdictional claims in 19. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, published maps and institutional affiliations. et al. Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9. Author details Department of Epidemiology, Geisel School of Medicine, Dartmouth 20. Saijo K, Schmedt C, Su I-H, Karasuyama H, Lowell CA, Reth M, et al. Essential College, Lebanon, NH, USA. Department of Biostatistics, University of Kansas role of Src-family protein tyrosine kinases in NF-kappaB activation during B Medical Center, Kansas City, KS, USA. Departments of Epidemiology and cell development. Nat Immunol. 2003;4:274–9. Pathology and Laboratory Medicine, Brown University, Providence, RI, USA. 21. Miceli MC, Parnes JR. Role of CD4 and CD8 in T cell activation and Department of Neurological Surgery, Institute for Human Genetics, differentiation. Adv Immunol. 1993;53:59–122. University of California San Francisco, San Francisco, CA, USA. Departments 22. Addison EG, North J, Bakhsh I, Marden C, Haq S, Al-Sarraj S, et al. Ligation of of Molecular and Systems Biology, and Community and Family Medicine, CD8alpha on human natural killer cells prevents activation-induced Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA. apoptosis and enhances cytolytic activity. Immunology. 2005;116:354–61. Salas et al. Genome Biology (2018) 19:64 Page 14 of 14 23. Fazi F, Rosa A, Fatica A, Gelmetti V, De Marchis ML, Nervi C, et al. A 44. Koestler DC, Christensen BC, Wiencke JK, Kelsey KT. GSE77797: DNA minicircuitry comprised of microRNA-223 and transcription factors NFI-A methylation profiling of whole blood and reconstructed mixtures of purified and C/EBPalpha regulates human granulopoiesis. Cell. 2005;123:819–31. leukocytes isolated from human adult blood. Gene Expression Omnibus. 24. Vian L, Di Carlo M, Pelosi E, Fazi F, Santoro S, Cerio AM, et al. Transcriptional 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77797. fine-tuning of microRNA-223 levels directs lineage choice of human Accessed 4 May 2018. hematopoietic progenitors. Cell Death Differ. 2014;21:290–301. 45. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. GSE112618: FACS validation dataset: An optimized library for reference- 25. Yang K, Shrestha S, Zeng H, Karmaus PWF, Neale G, Vogel P, et al. T based deconvolution of whole-blood biospecimens assayed using the cell exit from quiescence and differentiation into Th2 cells depend on Illumina HumanMethylationEPIC BeadArray (III). Gene Expression Omnibus. Raptor-mTORC1-mediated metabolic reprogramming. Immunity. 2013; 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112618. 39:1043–56. Accessed 4 May 2018. 26. Yusuf N, Hidalgo B, Irvin MR, Sha J, Zhi D, Tiwari HK, et al. An epigenome- 46. Accomando WP, Wiencke JK, Houseman EA, Nelson HH, Kelsey KT. wide association study of inflammatory response to fenofibrate in the Quantitative reconstruction of leukocyte subsets using DNA methylation. Genetics of Lipid Lowering Drugs and Diet Network. Pharmacogenomics. Genome Biol. 2014;15:R50. 2017;18:1333–41. 47. Phipson B, Maksimovic J, Oshlack A. missMethyl: an R package for 27. Puck A, Aigner R, Modak M, Cejka P, Blaas D, Stöckl J. Expression and analyzing data from Illumina’s HumanMethylation450 platform. regulation of Schlafen (SLFN) family members in primary human monocytes, Bioinformatics. 2016;32:286–8. monocyte-derived dendritic cells and T cells. Results Immunol. 2015;5:23–32. 48. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. 28. Stehbens SJ, Paszek M, Pemble H, Ettinger A, Gierke S, Wittmann T. CLASPs GSE110555: SuperSeries: an optimized library for reference-based deconvolution link focal-adhesion-associated microtubule capture to localized exocytosis of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC and adhesion site turnover. Nat Cell Biol. 2014;16:561–73. BeadArray. Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/ 29. de Goede OM, Lavoie PM, Robinson WP. Characterizing the query/acc.cgi?acc=GSE110555. Accessed 4 May 2018. hypomethylated DNA methylation profile of nucleated red blood cells from 49. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. cord blood. Epigenomics. 2016;8:1481–94. FlowSorted.Blood.EPIC. GitHub. 2018. https://github.com/ 30. de Goede OM, Razzaghian HR, Price EM, Jones MJ, Kobor MS, Robinson WP, immunomethylomics/FlowSorted.Blood.EPIC. Accessed 4 May 2018. et al. Nucleated red blood cells impact DNA methylation and expression 50. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. analyses of cord blood hematopoietic cells. Clin Epigenetics. 2015;7:95. Immunomethylomics/FlowSorted.Blood.EPIC: FlowSorted.Blood.EPIC v.0.99.36. 31. Bakulski KM, Feinberg JI, Andrews SV, Yang J, Brown S, L McKenney S, et al. Zenodo. 2018. https://doi.org/10.5281/ZENODO.1241200. Accessed 4 May 2018. DNA methylation of cord blood cell types: Applications for mixed cell birth 51. Salas LA. v.1.0 immunomethylomics/Analysis_FlowSorted.Blood.EPIC: analysis studies. Epigenetics. 2016;11:354–62. scripts. 2018. https://doi.org/10.5281/zenodo.1243840. Accessed 4 May 2018. 32. Gervin K, Page CM, Aass HCD, Jansen MA, Fjeldstad HE, Andreassen BK, et al. Cell type specific DNA methylation in cord blood: a 450K-reference dataset and cell count-based validation of estimated cell type composition. Epigenetics. 2016;2294:00. 33. Shvetsov YB, Song M-A, Cai Q, Tiirikainen M, Xiang Y-B, Shu X-O, et al. Intraindividual variation and short-term temporal trend in DNA methylation of human blood. Cancer Epidemiol Biomark Prev. 2015;24:490–7. 34. Urdinguio RG, Torró MI, Bayón GF, Álvarez-Pitti J, Fernández AF, Redon P, et al. Longitudinal study of DNA methylation during the first 5 years of life. J Transl Med. 2016;14:160. 35. Tan Q, Heijmans BT, Hjelmborg JVB, Soerensen M, Christensen K, Christiansen L. Epigenetic drift in the aging genome: a ten-year follow-up in an elderly twin cohort. Int J Epidemiol. 2016;45:1146–58. 36. Kananen L, Marttila S, Nevalainen T, Kummola L, Junttila I, Mononen N, et al. The trajectory of the blood DNA methylome ageing rate is largely set before adulthood: evidence from two longitudinal studies. Age (Dordr). Age. 2016;38:65. 37. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. FlowSorted.Blood.EPIC. Bioconductor. 2018. https://bioconductor.org/ packages/FlowSorted.Blood.EPIC, https://doi.org/10.18129/B9.bioc. FlowSorted.Blood.EPIC. Accessed 4 May 2018. 38. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. GSE110554: FlowSorted.Blood.EPIC: An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray (II). Gene Expression Omnibus. 2018. https:// www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110554. [cited 2018 May 4]. 39. Xu Z, Niu L, Li L, Taylor JA. ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res. 2016;44:e20. 40. R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2017. 41. Fortin J-P, Triche TJ, Hansen KD. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics. 2017;33:558–60. 42. Touleimat N, Tost J. Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics. 2012;4:325–41. 43. Salas LA, Koestler DC, Butler RA, Hansen HM, Wiencke JK, Kelsey KT, et al. GSE110530: Longitudinal dataset: An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray (I). Gene Expression Omnibus. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE110530. Accessed 4 May 2018.

Journal

Genome BiologySpringer Journals

Published: May 29, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off