Motif signatures in stretch enhancers are enriched for disease-associated genetic variants

Motif signatures in stretch enhancers are enriched for disease-associated genetic variants Background: Stretch enhancers (SEs) are large chromatin-defined regulatory elements that are at least 3,000 base pairs (bps) long, in contrast to the median enhancer length of 800 bps. SEs tend to be cell-type specific, regulate cell- type specific gene expression, and are enriched in disease-associated genetic variants in disease-relevant cell types. Transcription factors ( TFs) can bind to enhancers to modulate enhancer activity, and their sequence specificity can be represented by motifs. We hypothesize motifs can provide a biological context for how genetic variants contribute to disease. Results: We integrated chromatin state, gene expression, and chromatin accessibility [measured as DNase I Hyper- sensitive Sites (DHSs)] maps across nine different cell types. Motif enrichment analyses of chromatin-defined enhancer sequences identify several known cell-type specific “master” factors. Furthermore, de novo motif discovery not only recovers many of these motifs, but also identifies novel non-canonical motifs, providing additional insight into TF binding preferences. Across the length of SEs, motifs are most enriched in DHSs, though relative enrichment is also observed outside of DHSs. Interestingly, we show that single nucleotide polymorphisms associated with diseases or quantitative traits significantly overlap motif occurrences located in SEs, but outside of DHSs. Conclusions: These results reinforce the role of SEs in influencing risk for diseases and suggest an expanded regu- latory functional role for motifs that occur outside highly accessible chromatin. Furthermore, the motif signatures generated here expand our understanding of the binding preference of well-characterized TFs. Background (referred to as SEs in this paper) [3]. By our definition, Chromatin immunoprecipitation combined with high- SEs have a length of at least 3,000 DNA base pairs (bps) throughput sequencing (ChIP-seq) can identify the and are much larger than typical enhancers (TEs), which genome-wide locations of target proteins, including tran- we defined to be any enhancer less than or equal to the scription factors (TFs), RNA Polymerase II, and cova- median enhancer length of 800 bps. SEs are generally cell lently modified histones [ 1]. ChIP-seq datasets for several type specific, associated with increased cell-specific gene chromatin marks and the sequence-specific factor CTCF expression, and tend to harbor disease-relevant genetic can be computationally integrated to discover combina- variants derived from genome-wide association stud- torial and spatial patterns that produce a consistent anno- ies (GWASs). In our previous paper [3], we showed that tation of promoter, enhancer, insulator, transcribed, and enrichment for GWAS variants increases with the length repressed chromatin states. In a recent study, we profiled of enhancers, but we did not try to define the precise the chromatin states of several cell lines using Chrom- relationship of GWAS variants to motifs located within HMM [2], which revealed the presence of large gene the enhancers—that is the goal of this paper. SEs share control elements that we designated “stretch enhancers” several traits with another recently defined class of large enhancers designated super-enhancers by Young and colleagues [4, 5]. Like SEs, super-enhancers drive cell- *Correspondence: scjp@umich.edu type-specific gene expression; however, super-enhancers Stephen CJ Parker and Francis S Collins contributed equally Departments of Computational Medicine and Bioinformatics have been defined by the disproportionate abundance of and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA Mediator or histone 3 lysine 27 acetylation (H3K27ac) Full list of author information is available at the end of the article © 2015 Quang et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 2 of 14 signal [6], whereas SEs are defined by patterns of histone Of the ten cell types profiled, nine also have DNase-seq modifications, or chromatin states. Enhancers can func - data available. DNase-seq is a method used to identify tion independently of their endogenous spatial contexts, the genome-wide locations of DHSs, which are regions which is a property exploited in luciferase assays to meas- of the genome that are highly sensitive to cleavage by ure enhancer activity by placing enhancers upstream DNase I and mark regulatory elements such as enhancers of reporter genes. This suggests that the information and promoters [8]. DHSs generally mark regions that are required for the enhancer activity is encoded in the more accessible for TF binding and are enriched for TF underlying DNA sequences. We, therefore, hypothesize binding motifs. that the sequence content of the enhancers can provide Figure  1a displays the accessible chromatin, chromatin additional insight into the relationship between enhancer state, and gene expression profiles across the nine cell function and enhancer length, which we address by stud- types around the ABCC8 locus. Gene expression RNA- ying how SEs differ from TEs. seq [9] tracks clearly show that the ABCC8 transcript is Enhancer sequences are known to be enriched for exclusively expressed in the islet sample, and the chroma- transcription factor binding sites, which contribute to tin state tracks show this gene body is covered by several enhancer activity. Upon binding to their recognition islet-specific SEs. This integrative approach can identify motifs, some TF proteins can form complexes with other cell-specific chromatin and expression profiles to provide proteins, which can alter the 3D conformation of chro- a basis for further understanding the functional effects of matin and recruit RNA Polymerase II to promote the SNPs in common, complex diseases. In the ABCC8 exam- transcription of target genes located in cis, sometimes ple, lead type 2 diabetes (T2D) GWAS SNPs (red arrow at considerable distances. The motif, or sequence bind - heads) and several linked SNPs (r   ≥  0.8) (green bars) ing specificity, for a TF can be represented as a posi - overlap islet-specific chromatin states. Enhancers have tion weight matrix (PWM) that specifies the nucleotide been shown to overlap multiple linked SNPs more often frequency at each position along the binding sequence. than expected at random [3, 10], suggesting that multiple Recently, less complex nucleotide patterns—like dinu- enhancer variants work together in concert to alter gene cleotide repeats—were shown to contribute to enhancer expression and contribute to disease susceptibility. activity [7]. From the ABCC8 example, we can infer some properties In this study, we analyze the motif signatures of SEs about the interplay between chromatin states and DNase and how they differ from those of TEs. We scan enhancer I hypersensitivity in SEs. First, we notice that the SEs sequences using known motifs from databases to iden- encompass several DHSs, but the entire SE does not dis- tify TFs that are characteristic of each cell type studied, play DNase hypersensitivity. Unsurprisingly due to their which we are also able to recover using de novo motif length, SEs contain proportionately more DHSs than TEs discovery. We investigate how GWAS single nucleotide do (Additional file  2: Figure S1). Second, strong enhancer polymorphisms (SNPs) can affect TF motif signatures states in islets in ABCC8, for example (Figure  1a), tend in SEs, which can provide important clues about how to overlap spikes in the islet DNase-seq signal. By aggre- genetic variations contribute to disease risk. gating the DNase-seq tag density relative to DNase-seq peak summits in either SEs or TEs, we find there is rela - Results and discussion tively little difference in DNase I hypersensitivity between Systematic chromatin state, DNase hypersensitivity, the classes of enhancers (Figure  1b). However, there is a and gene expression profiling across nine diverse cell large difference in the H3K27ac ChIP-seq signal (Fig - types ure  1c). Notably, there is a dip in the H3K27ac signal in In our previous study, we used the ChromHMM algo- the center of the DNase peak summit, which is a reflec - rithm to systematically integrate ChIP-seq histone tion of bound TFs that can displace histones. This TF- modification and CTCF datasets and uniformly profile induced configuration of chromatin architecture is much chromatin states across ten diverse cell types. These more pronounced in SEs than it is in TEs. Of relevance, ChromHMM segmentations are used to profile SEs, this H3K27ac dip feature recently provided the central which are defined as regions of at least 3,000 bps con - motivation for a novel computational algorithm to detect taining contiguous segments marked as enhancer states. factor binding sites [11]. Given the pronounced dip in SEs TEs are defined similarly, but are less than or equal to versus TEs, this recent computational algorithm likely 800 bps in length. Although SEs only constitute the top picked up mostly on SE-mediated factor binding sites. 10% of enhancers in terms of length, they represent a In generating the signal histogram plots, we accounted disproportionately large percentage of the total number for the minor difference in mappability between SEs of nucleotides among all enhancers (Additional file  1: and TEs. Generally, SEs are slightly more mappable than Table S1). TEs (Additional file  2: Figure S2), which suggests that SE Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 3 of 14 Figure 1 Systematic profiling of DNase I hypersensitivity, chromatin states, and gene expression across nine cell types. a Chromatin states and DNase I hypersensitivity density tracks in and around the ABCC8 locus. Human pancreatic islet chromatin states are similar to some of the other ENCODE cell types at the commonly expressed flanking gene NCR3LG1 and unique at the islet-specific expressed gene ABCC8. (Upper) DNase I Hypersensitivity Density Signal from ENCODE/Duke in dense format for each of nine human cell types (islets; GM12878, lymphoblastoid cells; H1 ES, embryonic stem cells; HepG2, hepatocellular carcinoma; HMEC, mammary epithelial cells; HSMM, smooth muscle myoblasts; HUVEC, umbilical vein endothelial cells, K562, erythroleukemia cells, NHEK, keratinocytes). Density graphs (wiggles) of signal enrichment calculated using F-Seq are displayed in grayscale. Scale is from 0 to 0.1. (Middle) ChromHMM-defined chromatin states. Chromatin state assignments are indicated in the top- leftmost key. (Lower) RNA-seq-based expression for each cell type is measured in reads per million mapped reads (RPM) per base pair. Scale is from 0 to 2 for each cell type. Index and tightly linked (r ≥ 0.8) SNPs associated with T2D are indicated in green in the T2D GWAS SNPs track and primarily reside in islet-specific SEs. Index SNPs rs5215 and rs5219 are marked with red arrows. The black box highlights a portion of the SE previously shown to direct tissue-specific expression patterns in a spatial and temporal manner in vivo [3]. All processed results are browsable and downloadable at http://research.nhgri.nih.gov/manuscripts/Collins/isletchromatin/. b–d Aggregate DNase-seq tag density (b), H3K27ac ChIP-seq tag density, (c), and (d) CG dinucleotide frequency profiles of 3kbp sequences centered on DNase-seq peak summits—the location of the highest DNase-seq signal— −100 located within SEs or TEs. e DHSs within SEs are much closer together than they are within TEs (two-tailed Wilcoxon rank sum test, p < 10 for all cell types). Boxplots show the distance, for each SE or TE DNase-seq summit, to the nearest SE or TE DNase-seq summit, respectively, for all cell types. −5 f DHSs within SEs are moderately longer than DHSs within TEs (two-tailed Wilcoxon rank sum test, p < 10 for all cell types). Boxplots show the length of each DHS whose summit overlaps a SE or TE. Boxplot whiskers extend to 1.5× the interquartile range and outliers are shown as block dots, but the y-axis is truncated so that the boxplots can remain in view. Enhancer classes for b–f are indicated in the top-rightmost key. sequences are less repetitive and more complex than TEs. after accounting for differences in mappability, there is Such sequence composition differences could be charac - still a large statistical difference in H3K27ac signal across teristic of the two enhancer classes. Nevertheless, even the length of the enhancers (Additional file 2 : Figure S3). Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 4 of 14 To examine functional and sequence composition dif- enrichment of its binding motif in SE DHS sequences ferences between the two enhancer classes more compre- across the nine cell types. For instance, HNF1A, an acti- hensively, we plotted the average Genomic Evolutionary vator highly expressed in the liver, shows a very strong Rate Profiling (GERP) score [12] (Additional file  2: Figure positive relationship between its expression across the S4), a metric that estimates position-specific evolution - nine cell types and the enrichment of its binding motif ary constraint across a multi-species sequence alignment, in SE DHS sequences (Figure  2a). In contrast, repres- and dinucleotide frequencies (Figure 1d, Additional file  2: sors such as GFI1 tend to have a negative correlation Figure S5) at each position relative to DNase-seq peak (Figure  2b). These observations reflect the general con - summits in either SEs or TEs. Generally, SEs and TEs cept of TF binding cooperativity on the DNA scaffold show similar patterns of evolutionary constraint around whereby increases in the number of binding sites results DNase-seq peak summits. GERP scores are highest at the in increased enhancer activity [14, 15]. We hypothesize summit and decrease as one moves away from the sum- that enhancer sequences are organized in a way such mit. At the summits, SEs have a slightly higher average that an increase in expression of the relevant activator is GERP score than TEs. SE and TE sequences are par- accommodated by an increase in available binding sites, ticularly different in terms of their CG dinucleotide fre - while reducing potential binding sites for any present quencies: both SEs and TEs display a large spike in CG repressors. To explore this idea more comprehensively, frequency in the DHS summits, but TE sequences are we studied the distribution of motif enrichment versus much more CG-rich and overlap CpG islands more often gene expression correlations across motif-TF pairs for SE (Additional file 2: Figure S6). Notably, TFs tend to bind to DHS sequences, as well as SE non-DHS, TE DHS, and TE CG-rich DNase I accessible regions, but the CG richness non-DHS sequences (Figure  2c). All four observed dis- of some motifs do not account for this spike in CG fre- tributions (red) display a significant positive correlation quency [13]. The variation in dinucleotide frequencies is bias relative to the null expectation (blue), indicating that largely a function of where these enhancers are located in SEs and TEs have well-organized motif architecture both the genome relative to gene models (Additional file  2: Fig- inside and outside of DHSs. To determine whether the ure S7). When restricted to transcription start site (TSS) relative motif enrichments are the same across different distal regions, the dinucleotide differences are mitigated, enhancer regions, we examine the correlation between but the difference in H3K27ac ChIP-seq signal persists a motif ’s enrichment in one genomic region against its (Additional file  2: Figure S8). In some of these histogram enrichment in a different genomic region (e.g. CTCF plots, we note that the differences can persist several motif enrichment in SE DHS sequences across cell types hundred base pairs away from the center. One possible versus its enrichment in TE DHS sequences across cell reason for this phenomenon is the proximity between types). We then measure the distribution of correlations DHSs in SEs. Although SEs are much longer than TEs, between enrichments for all motifs in different pairs of DHSs in SEs are still spaced comparatively close together regions (Figure  2d). Despite the difference in absolute (Figure 1e). DHSs in SEs are also significantly longer than motif enrichments in different enhancer regions, we DHSs in TEs (Figure  1f ), which suggests that individual find that the relative motif enrichments are strongly pre - DHSs in SEs can accommodate more TF binding sites served. Together, these findings indicate that enhancer than DHSs in TEs can. motif architecture is linked to TF gene expression and preserved within and outside DHSs in SEs and TEs. Enhancer sequences are enriched for known motifs Next, we investigate how motif enrichments vary Different cell types are regulated by sets of TFs that are across different motifs, cell types, and enhancer regions. important for establishing and maintaining cell identity. We perform agglomerative clustering on the log fold Enhancers are expected to be enriched for motifs that enrichment in SE DHS sequences of activator motifs that serve as putative TF binding sites. We hypothesize that are significantly and differentially enriched in enhancer DHSs in enhancers should be especially enriched for sequences, in order to group motifs in an unsupervised motif sites, because these regions are more accessible fashion (“Methods”). This analysis results in clusters of for protein–DNA interactions. In particular, enhancers motifs that are grouped corresponding to TF families would be expected to be bound by a specific class of TFs known to play important roles in the cell types consid- called activators, which increase gene transcription. ered, which we visualize with heatmaps of motif enrich- To identify activator motifs, we searched for a rela- ment across four different enhancer categories (Figure  3, tionship between motif enrichment and gene expres- Additional file  2: Figure S9). The heatmaps highlight sion, similar in nature to a previous approach [2], and the relative motif enrichments in enhancers between demonstrated that the gene expression of activators (by and within the cell types, capturing known biologically mRNA quantification) correlates positively with the relevant cell-TF associations. For example, the GATA Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 5 of 14 ab cd Figure 2 Motif enrichments in enhancers are correlated with the expression of the TFs that bind to these motifs. a, b TFs’ expressions are correlated against the enrichment of their respective binding motifs in SE DHS sequences across the nine cell types. The master activator HNF1A (a) has a posi- tive relationship between gene expression and motif enrichment, while the repressor GFI1 (b) has a negative relationship. The name of the motif, Spearman’s correlation (ρ), and sequence logo of the motif are displayed in the corner of the plots. c Boxplots of Spearman’s correlations of enrich- ments in four different regions for all database motifs against gene expression in the nine cell types. d Boxplots of Spearman’s correlations of motif enrichments between two different regions (listed in the strip titles at the top of the facets). For each motif and each pair of regions, we computed the enrichments of the motif in both sets of sequences and then computed the Spearman’s correlation between the two sets of enrichments. Within each facet, the boxplot of Spearman’s correlations (red, left) is also displayed alongside a boxplot of a null distribution (blue, right) generated by recalculating the Spearman’s correlations after shuffling cell assignments for one of the variables. P-values below the boxplots represent the significance of the distribution compared to the respective null distributions and are calculated with the Wilcoxon rank sum test. Gene expression is measured in reads per kilobase per million (RPKM). Motif sites were identified with FIMO, a tool for searching occurrences of known motifs in biological sequences [34]. Motif enrichment is calculated as the ratio of FIMO hits in the positive sequence set to FIMO hits in dinucleotide shuffled negative control (“Methods”). cluster is K562-specific, which is appropriate because enrichment clusterings in the other enhancer regions are K562 is an erythroleukemia cell line, so we expect the preserved (Figure  3, black boxes), further supporting the motif of the erythroid fate determining TF GATA-1 [16] concept of common motif architecture across the differ - to be over-represented in K562 enhancers. As expected, ent enhancer regions. motifs are most enriched in DHS sequences. Non-DHS We next investigate how motifs are spatially distributed regions in enhancers are also enriched for some motifs, in SEs and TEs. For each cell type, we generated a set of but to a much lesser extent than their DHS counterparts. activator motifs that are highly enriched in the respec- Despite how clustering was performed on enrichment tive SE DHS sequences and computed the density histo- values focused on the SE DHS sequences, many of the grams of motif occurrences in SEs or TEs relative to the Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 6 of 14 Figure 3 Heatmaps of enrichment of the binding motifs of activator TFs across nine cell types and four regions. Shading indicates log2 enrich- ment of motifs in sequences of the specified cell and region. (Clustering was done using SE DHS motifs). To improve visualization of the heatmaps, the original set of motifs is pruned through a strategy that includes removing any motifs for which the corresponding TF fails to be significantly expressed in any of the nine cell types. This pruning strategy reduces the motif set to primarily include motifs of activator and master factors. The remaining motifs are clustered and ordered by exemplar-based agglomerative clustering on the log2 enrichment values across the nine cell types (“Methods”). Groups or families of motifs are manually labeled on the left side. Black boxes highlight cell-specific motif enrichment clusters. summits of DNase-seq peaks. Motif density is propor- enhancers by a proportional increase in putative binding tional to DNase I hypersensitivity, being greatest at the sites, but not by differences in motif density or enrich - summits of DNase I hypersensitivity peaks and decreas- ment within or close by to individual DHSs. ing away from the summits (Figure  4a, Additional file  2: To examine how motifs are distributed across indi- Figure S10). Based on these aggregate plots and heat- vidual enhancers instead of aggregating across groups of maps, SEs and TEs display similar motif enrichment and enhancers, we generated tracks that display motif densi- density patterns in their DHSs. However, SEs represent a ties in enhancers (Figure 4b, Additional file  2: Figure S11) disproportionately large fraction of enhancers by nucleo- for motifs that are significantly and differentially enriched tide count (Additional file  1: Table S1). Furthermore, SEs in enhancer sequences of a given cell type, which we refer contain more DHSs per enhancer (Additional file  2: Fig- to as cell identity motifs. Again, we observe that cell iden- ure S1). Therefore, SEs are further set apart from shorter tity motifs cluster in DHSs, but we also find that these Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 7 of 14 Figure 4 Motifs are most enriched around the summits of DNase-seq peaks. a Aggregate motif density plots of 3kbp sequences centered on DNase-seq peak summits within SEs or TEs. Line plots are colored as in the enhancer class key in Figure 1. Cell-specific motif sets are generated by selecting motifs that are significantly enriched in SE DHS sequences and correspond to TFs that are significantly expressed. Individual motif density plots are generated by the Homer annotate Peaks tool [39] with motifs called at a threshold of 40% of the maximum log likelihood score. These individual motif density plots are summed together position-wise to generate the aggregate motif density plots (“Methods”). Motif densities are generally similar between the SE and TE sequences. b UCSC Genome browser view of the ZBED3-AS1 locus. Motif density tracks (Upper) identify portions of the genome assigned an enhancer state that are relatively rich in motifs. These tracks measure the number of non-overlapping motif sites in 150 bp windows at 10 bp steps and are auto-scaled for each cell type. Potentially functional SNPs that overlap SEs may not overlap any DHS, but may overlap portions of the SEs that are dense in motifs, such as the rightmost T2D tightly linked SNP. Chromatin state colors and scales for the other tracks are assigned as in Figure 1a. The black box highlights a strong overlap of cell-specific accessible chromatin and enhancer chromatin states that is also dense in motifs. These tracks are downloadable at http://fusion.nhgri.nih.gov/files/se-motifs/motifsBedgraphs.tar . Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 8 of 14 motif clusters often appear outside of DHSs. In this par- discovery to DNase I “footprints” within all enhancers. ticular example, none of the linked T2D SNPs overlapped Footprints are local dips in the DNase I cleavage signal, any DHSs, but they did overlap portions of the islet-spe- which are predicted to demarcate TF occupancy because cific SE that is relatively dense in islet cell identity acti - TFs protect accessible chromatin from DNase I cleavage vator motifs. Interestingly, the RNA-seq tracks show a [23–25]. Therefore, footprint sequences should be much lack of expression of the surrounding ZBED3-AS1 gene, more enriched for motifs than the larger DHS sequences implying that the SE is likely acting on a distant gene. and can significantly improve the quality of de novo motif The lack of DNase hypersensitivity does not necessarily discovery. Due to the large number of nucleotides in the exclude the possibility that TFs are binding to these parts sequences we scan, we use EXTREME [19], a fast motif of SEs, since the specific chromatin marks extending discovery algorithm designed for large sequence datasets. across an SE suggest that the entire region is consider- As a demonstration of the effectiveness of this approach, ably more open than the average genomic segment. One EXTREME is able to recover motifs that significantly possibility is that DNase-seq may not be able to identify match known signatures (Figure  5, Additional file  2: Fig- chromatin accessibility in these particular locations. For ure S12). These motifs often correspond to TFs known example, in a comparative study with FAIRE-seq, another to be associated with cellular differentiation and repro - assay for mapping open chromatin sites, some open chro- gramming (Additional file 3). matin sites are unique to either DNase-seq or FAIRE-seq. The de novo motif discovery analysis does not yield Multiple lines of evidence suggest that these DNase- any prominent examples of novel motif families, possi- only and FAIRE-only sites correspond to real chromatin bly because the systems we consider have already been features [17]. It is also of note that some TFs are able to studied extensively. However, we do find that de novo localize partially or predominantly within inaccessible motif discovery can provide novel insight into the spa- chromatin [18]. tial arrangement of motif combinations at nucleotide resolution. Some of the motifs discovered in footprint De novo motif discovery in enhancers sequences appear as combinations of two known motifs Computing enrichment of known motifs in enhancer sequences can provide insight into the motif landscape of these sequences, but it does not necessarily identify the actual set of TFs that bind to the enhancers. This is because different TFs can bind to very similar motifs, leading to ambiguity. Moreover, motif scanning analy- sis is limited to motifs that are available in databases, which are incomplete. Even for the TFs that are repre- sented, databases often neglect infrequent non-canonical motifs. For example, one study showed that the protein Neuron Restrictive Silencer Factor can have, in addition to its prominent canonical motif, nine infrequent non- canonical motifs that are not present in any motif data- base [19]. In fact, motifs in databases may not accurately represent the in  vivo binding context of TFs. For exam- ple, many of the database motifs we consider in this study were generated by high-throughput SELEX on single TFs [20], an in  vitro assay that can miss important in  vivo binding contexts. Additionally, DNA shape-based read- out by TFs is not well captured by traditional PWMs [21] and can, therefore, result in missed target sites. Finally, Figure 5 De novo motif discovery accurately recovers known motifs. results from a recent study show that disease-associated Examples of sequence motifs that are enriched in Stretch Enhancer non-coding SNPs are not well captured by PWMs and (SE) DNase Hypersensitive Site (DHS) sequences or within DNase footprints (FP) across all ChromHMM concatenated enhancers (CEs). may instead be better explained by non-conical sequence Sequences of motifs derived from EXTREME (bottom) are aligned to determinants [22]. known motifs from two databases [20, 37] (top). Motif similarity was To further explore the motif landscape of enhancers, measured with TOMTOM, a tool that reports the significance of a we apply de novo motif discovery on DHS sequences in match between a query motif and a database motif [43]. Below each SEs, which we showed are highly enriched for cell spe- sequence logo, the cell type, sequence set, motif name, and signifi- cance of the TOMTOM match (E ) are displayed. cific regulatory motifs. We also apply de novo motif T Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 9 of 14 in close spatial proximity (Figure  6). In HUVEC enhanc- uncover some of the non-canonical sequence determi- ers, for example, we find a significant number of activa - nants that underly disease-associated SNPs. Similarly, tor protein 1 (AP-1) and ERG motif matches. AP-1 is a another previous study showed the sequence-specific heterodimeric protein that recognizes and binds to the constraints of some TFs can decrease as a function of enhancer heptamer motif 5′-TGA[CG]TCA-3′. ERG the number of co-occupying factors [27]. Although these is a subfamily of the ETS family of TFs, which have a motifs contain binding preferences of well-character- strong 5′-GGAA-3′ core binding sequence within their ized TFs, most of them are novel, lacking any database binding motifs. The ERG subfamily includes TFs such matches. As a resource to the community, we provide all as ERG and FLI1, which are known to be functionally de novo discovered motifs in MEME Minimal Motif For- active in HUVEC. Through our de novo motif analysis, mat (Additional file 4). we find these two classes of motifs are significantly co- enriched, but the frequency of the combination depends GWAS SNPs significantly overlap SEs and alter motifs on the relative orientation of these two motifs. Further- outside of DHSs more, sequence-specific constraints for the ERG binding Motivated by the co-occurrence of T2D GWAS SNP loci motif are relaxed when an AP-1 motif is nearby. These and islet SEs, we test whether GWAS SNPs associated results suggest a motif regulatory “grammar” governed with several diseases and traits are generally enriched by physical constraints that dictate the in  vivo spatial in SEs. A SNP locus consists of a lead SNP and all arrangements and frequencies of combinations of motifs, SNPs in strong LD with that lead SNP (r   ≥  0.8), care- which is consistent with a previous report [26], and may fully accounting for the possibility that the lead SNP is not causative but is instead in LD with the true, causa- tive SNP. Indeed, SNP loci are enriched both inside and outside of DHSs in SEs, replicating previous disease- and trait-associated SNP enrichment in cell-specific enhancer states [2, 28], including rheumatoid arthritis in GM12878 (Figure 7, Additional file 2: Figure S13). While we know that SEs are enriched for SNPs associated with complex diseases, the mechanism by which these SNPs contribute to disease risk is not clear. One reasonable mechanism is through altering TF binding sites in SEs. To assess this idea, we also test for GWAS SNP enrich- ment within cell identity motif sites in SEs. Interestingly, GWAS SNP loci are much more enriched and abundant in SE non-DHS cell identity motif sites than they are in SE DHS cell identity motif sites. We posit that our earlier proposition that SNPs are disrupting putative TF binding sites in less accessible chromatin portions of SEs may in fact be a prevalent mechanism for driving common dis- eases. A previous study suggested that most non-coding GWAS SNP loci intersect DHS regions [29], but did not consider motifs in SEs that reside outside DHSs, as we Figure 6 De novo motif discovery in enhancer footprint sequences do here. Our findings are consistent with another study reveals novel binding patterns of well-characterized TFs. Motifs of known activators in the HUVEC, K562, HSMM, and HepG2 cell lines that found enhancer-associated chromatin marks can be can co-occur together. For example, in the HUVEC enhancer footprint more informative for tissue-specific disease SNP enrich - sequences, the ERG motif, a member of the ETS family that is char- ments than DHSs can [30]. acterized by a “GGAA” binding paper, often co-occurs with the AP-1 Although our analysis provides a possible mecha- motif. In the presence of the AP-1 motif, the degree of resemblance nism for many disease SNPs, several SNPs are left unac- of a predicted site to the ERG motif is weaker. SPI1, another member of the ETS family, shares a similar relationship with the GATA1 motif in counted for. Notably, our motif sets are limited, including K562. In other examples, activator TFs appear to often homodimerize only motifs from two databases, which are far from and form palindromic motifs. Sequence logos of examples of de novo complete and are missing motifs that were discovered motifs in the cell types are displayed alongside, if available, CentriMo through our de novo motif discovery analysis, for exam- E-values and number of matches in SE DHS sequences (“Meth- ple. Our GWAS analysis also focuses on motifs of activa- ods”). For two of the HepG2 examples, the motifs are so infrequent that CentriMo failed to find a significant number of matches in SE tors, which excludes TFs like CTCF whose binding motif sequences. is slightly enriched in enhancers. We also do not consider Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 10 of 14 Figure 7 SEs show significant enrichment of GWAS SNPs associated with diseases or quantitative in a cell-specific manner. Positions of index and tightly linked (r ≥ 0.8) SNPs for different diseases or traits (y-axis) are overlapped with those of DHSs within SEs, non-DHSs with SEs, and motif sites in either DHSs or non-DHSs within SEs for each cell type (x-axis). Only motifs from the cell-specific motif sets used to generate the motif density −4 tracks in Figure 4 are considered for each cell type. Motif sites were identified with FIMO at the default threshold of p < 10 . Text in the boxes indi- cates the number of overlapping SNP loci in each cell type. Shading indicates the significance of SNP locus enrichment relative to a null distribution −8 (“Methods”). Only SNPs that meet the genome-wide significance threshold (p < 5 × 10 ) are considered. variants that generate new motif sites. A recent study, for the punctate DHSs, SEs contain dense clusters of motifs example, showed that a small insertion introducing bind- outside of their DHSs. Notably, the SE motif architecture ing motifs can lead to the formation of an aberrant onco- within and outside DHSs is significantly correlated, sug - genic super-enhancer [31]. Including these other pieces gesting an orchestrated mechanism of regulation across of information may account for the remaining SNP loci. the entire length of the element. Through de novo motif discovery analysis in the motif- Conclusions rich DNase I footprint sequences, we identified non- In this study, we analyzed the motif architecture of SEs, canonical binding sites and predicted which pairs of TFs which we hypothesized can distinguish SEs from TEs. bind adjacently more than expected by chance. Many of In general, enhancers are highly enriched for TF binding these composite motifs are not present in any motif data- sites, especially those corresponding to activators. SEs are base, having only been identified in this study, revealing characterized by multiple motif-rich DHSs in close spa- large gaps in current databases. tial proximity, unlike shorter TEs which typically have at Disease-associated SNPs identified through GWAS are most one DHS. SEs also display much higher H3K27ac known to co-localize with SEs in a cell-specific manner, ChIP-seq signal, a marker for active enhancers. This result but the exact mechanism by which these SNPs perturb complements a recent study which found that enhancer genome function is not well characterized. One pos- reporter activity from sequences in H3K27ac peaks sibility is that these SNPs are disrupting TF binding in within super enhancers is considerably stronger than regions of open chromatin. Indeed, we find that DHSs enhancer reporter activity from sequences in H3K27ac within SEs are modestly enriched for GWAS SNPs. Inter- peaks outside of super enhancers [32]. We conjecture that estingly, however, our most notable finding is that GWAS the exceptional length of SEs is in part due to the spatial SNPs more often co-localize outside of the SE DHSs, and coordination of accessible chromatin where clusters of motif sites that are within SEs but outside of DHSs cap- activators can bind. Expression of tissue-specific TFs, par - ture many of these SNPs. This finding may explain why ticularly activators, also correlates with presence of their a recent large study reported better GWAS SNP asso- binding sites in enhancers that are active in that same tis- ciations with chromatin-marked enhancers versus DHSs sue. Although these TF binding sites are most enriched in [30]. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 11 of 14 The functional role of these motif sites in less accessible sequences were generated using the dinucleotide shuf- chromatin is not well-understood. They could represent fling script in the MEME Suite [35]. This measure of actual binding sites, despite lacking the extreme chro- enrichment allows direct comparison between sequence matin openness marked by DHSs. From an evolutionary sets of varying number of nucleotides. perspective, it is possible that common disease SNPs may We also performed PWM scanning using CentriMo be preferentially found in such sites because disruption [36], which computes the central enrichment of motifs. of TF binding in the most highly accessible chromatin For each motif, CentriMo finds an optimal score above would not be normally tolerated in the population due the minimum threshold 5 bits at which to call motifs. to selective pressure. Another possibility is based on the 1  kbp sequences centered on DNase-seq peak summits idea that SEs are lineage-specifying regulatory hubs that in SEs or TEs were extracted from the hg19 reference orchestrate a chromatin environment that is permissive genome as input sequences to CentriMo. CentriMo out- to rapid cellular response. The GWAS SNPs enriched in puts log adjusted p-values measuring the significance of motifs in SEs but outside DHSs could represent response motif enrichment in the center of sequences. elements that become active (and potentially hypersen- When scanning sequences with known motifs, we take sitive to DNase I) in different developmental stages or human or mouse motifs from the JASPAR 2014 [37] or under different physiologic conditions. Such a scenario the high-throughput SELEX [20] databases. We selected would be an efficient solution to tune dynamic changes these two databases because of the quality of their bind- rapidly and precisely in lineage-specific gene expres - ing models and their coverage of TFs. These two data - sion, consistent with the rheostat model of SE function bases contain 943 motifs altogether. that was recently proposed [33]. Testing these hypotheses will require additional TF profiling experiments, such as Motif enrichment heatmaps ChIP-seq and DNase-seq, across diverse environmental, Heatmaps plotting the enrichment of motifs in enhancer developmental, and genetic backgrounds. sequences were generated on a subset of the 943 motifs to focus on motifs that likely play an important role in Methods enhancers and improve the visualization of the heatmaps. DNase I hypersensitivity, chromatin state and gene From the original 943 database motifs, we selected motifs expression profiling that are highly (log fold enrichment >1.5 in at least one Chromatin states and gene expression were integrated cell type) and differentially (log fold enrichment range as previously described [3]. Sources of sequencing reads >0.75) enriched in SE DHS sequences. We also select from ChIP-seq and RNA-seq experiments used in the motifs that correspond to TFs that are expressed in any integrative analysis are found in the supplementary mate- one of the nine cell types (>2 RPKM). Finally, we further rials of [3]. Single-hit DNase-seq data from the ENCODE condense the motifs by selecting motifs whose enrich- Duke University group were used for the genome ment in SE DHS sequences correlates positively (ρ  >  0) browser shots and calling DHSs. DHSs were based on with the expression of the TF it corresponds to (Fig- narrow peak calls of single-hit DNase-seq data from the ure 2a), which we found to be an indication of activators. ENCODE Duke University group. Narrow peak files also The remaining motifs are ordered using agglomerative contain the genome coordinates of the summits. Wiggle clustering on the log fold enrichment values in SE DHSs. signal tracks for these single-hit data were also used for Agglomerative clustering was implemented using the the UCSC genome browser shots. aggExCluster method in the apcluster R package with the mutual pairwise similarities of data vectors as negative Motif scanning and calculating motif enrichment distances [38]. Visualization and clustering were based We performed position weight matrix (PWM) motif on log fold enrichment values instead of directly on fold scanning of FASTA sequence sets using FIMO [34]. Motif enrichment values due to the relatively large range of fold occurrences were called at the default p value thresh- enrichment values across all motifs (typically between −4 old of 10 . Based on empirical results, we found the 1 and 8). Using log values makes direct comparisons default threshold to be a good compromise between a between motifs more manageable across this range. stringent threshold that called very few motif sites and a relaxed threshold that called too many motif sites, as Assigning motifs to cells well as computationally efficient. Motif enrichment in a For each cell type, we generated a list of activator motifs set of sequences is calculated as the ratio of the number that are important in enhancers. These are the same of occurrences of the motif in the set of sequences rela- motifs that we used for our motif density tracks (Fig- tive to the number of motif occurrences in dinucleotide ure  4b, Additional file 2: Figure S11) and GWAS enrich - shuffled versions of the sequences. Dinucleotide shuffled ment analysis (Figure 7, Additional file 2: Figure S13). The Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 12 of 14 −5 selection procedure is similar to the filtering steps for at a threshold of p  <  10 , except for the GM12878 cell −10 the motif enrichment heatmaps. A motif is assigned to a line for which we used a threshold of p < 10 due to the cell type if its respective TF is expressed in that cell type deeper sequencing depth available. For the remaining cell (>2 RPKM), it is highly enriched in the central region line, NHEK, the sequencing depth of the available data- of the cell’s SE DHS sequences (CentriMo log adjusted set was too shallow to reliably call footprints with Wel- p-value < −50), and its enrichments in SE DHS sequences lington, so we extracted only high confidence footprints correlate positively with gene expression across the nine previously called by a Hidden Markov Model (HMM) cell types (ρ  >  0). If a motif has multiple versions in the algorithm [23], which were originally based in the hg18 databases, such as the DNA binding domain and full genome but later lifted over to the hg19 genome. transcript versions in the high-throughput SELEX data- Footprints are further processed prior to downstream base [20] we only select the motif with the lowest Cent- analysis by extending footprint genome coordinates riMo log adjusted p-value. The number of motifs in each equally on both ends by 5  bps (20  bps for HMM-based set ranges from as few as 25 for H1 ES to as many as 56 footprints) and then merging overlapping coordinates for HMEC and NHEK. using the BEDTools [42] mergeBed tool with parameter d = 10 (d = 40 for HMM-based footprints). Aggregate histogram plots Of the nine cell types considered in this study, only We used the Homer annotatePeaks tool [39] to generate the K562 cell line had enough data to apply all four foot- aggregate histogram plots documenting the dinucleotide printing methods on. Hence, we applied all four foot- frequency, motif density, and DNase-seq and H3K27ac printing methods on the K562 for comparison in the ChIP-seq tag density ±1,500  bps with 10  bp bins rela- downstream de novo motif discovery (see next subsec- tive to DNase-seq narrow peak summits overlapping tion). The thresholds and parameters were selected to SEs or TEs. ChIP tags were centered based on their esti- yield sequence datasets of 5–10 million bps, which we mated ChIP-fragment length. DNase tags were kept in found to be optimal for motif discovery. their original positions by fixing the estimated DNase- fragment length to 1. Mappability (36 bp) and GERP [12] De novo motif discovery aggregate histogram plots were similarly generated with De novo motif discovery was applied to SE DHS the bwtool aggregate command [40] instead. sequences and enhancer footprint sequences in all enhancers using the EXTREME algorithm [19]. FASTA De novo footprinting sequence sets were generated by extracting hg19 masked De novo footprinting requires deep sequenced DNase- genome sequences from BED file coordinates using seq data. Unfortunately, the data available are highly het- the BEDTools fastafrombed command [42]. SE DHS erogeneous. Not only are the deep sequence DNase-seq FASTA sequences were further preprocessed by replac- datasets available at varying sequence depth, they are ing instances of AAAAAAAA, ACACACAC, and their also generated by two different experimental protocols: reverse complements with capital N’s. Such repetitive a single-hit version [8] and a double-hit version [25] of subsequences are ubiquitous throughout the genome DNase-seq. Therefore, we adopted four different meth - and are often missed in the genome masking process. ods to call footprints de novo. Masked sequences were inputted to the EXTREME pipe- Four of the nine cell types considered in this study line with the parameters l  =  5 and q  =  0.02 (all other have deep sequenced double-hit DNase-seq data (K562, parameters were set to the default or recommended val- HSMM, HepG2, HUVEC). Of these four, three were pre- ues). We selected these parameters empirically based on viously processed by a de novo footprinting algorithm the quality of motifs generated. Although footprints were that searches DHSs for locations with high footprint generated from a variety of algorithms and experimen- occupancy scores, and genome coordinates of the foot- tal protocols, EXTREME can still find a consistent set prints were made openly available [24]. For the remaining of high-quality enhancer-associated motifs (Additional cell type, HUVEC, footprints were called on the double- file 2: Figure S12). Discovered motifs were compared to hit data with Wellington [41] at a stringent threshold of known motifs from the JASPAR 2014 [37] or the high- −20 p < 10 . throughput SELEX [20] databases using TOMTOM [43] For the remaining five cell types, footprints were called keeping only significant matches (E < 0.1). on deep sequenced single-hit DNase-seq data, but these datasets were not sequenced as deeply as the double-hit GWAS variant enrichment for the other four cell types. We called footprints in four We calculated GWAS variant enrichment exactly as we of the single-hit DNase-seq data with the “1D” version of did in our previous study of SEs [3] on an updated set Wellington on pooled replicates. Footprints were called of SNPs from the NHGRI GWAS catalog (http://www. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 13 of 14 and R00DK099240-02 (to SCJP), the American Diabetes Association Pathway genome.gov/gwastudies/; downloaded on October 9, to Stop Diabetes Grant 1-14-INI-07 (to SCJP), and the National Institute of 2014). Briefly, we calculated enrichment by performing Biomedical Imaging and Bioengineering, National Research Service Award a permutation test that measures SNP loci and enhancer (EB009418) from the University of California, Irvine, Center for Complex Biological Systems (NIH R01HG006870, to DXQ). This material is based upon overlaps as previously described [2]. A SNP locus con- work supported by the National Science Foundation Graduate Research sists of a lead SNP and all SNPs in strong LD with that Fellowship under Grant No. (DGE-1321846, to DXQ). Any opinion, findings, lead SNP. SNPs in LD with the lead SNP were defined as and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science those with r  ≥ 0.8. We ran 10,000 iterations of the per- Foundation. This research was supported (in part) by the Intramural Research mutation test and estimated the maximal P-value as the Program of the National Human Genome Research Institute, National Insti- number of permutations equal to or greater than the tutes of Health. observed overlap value plus one divided by the number Compliance with ethical guidelines of iterations plus one (10,001). Our enrichment analysis was performed on both the entire set of NHGRI GWAS Competing interests The authors declare that they have no competing interests. catalog SNPs per trait and a filtered subset of genome- −8 wide significant SNPs (p < 5 × 10 ) per trait. Received: 25 June 2015 Accepted: 1 July 2015 Additional files Additional file 1: Table S1. Staitistics of enhancers. Additional file 2: Supplementary Figures S1 to S13. References 1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of Additional file 3: CentriMo scans of DHS sequences with database in vivo protein-DNA interactions. Science. 2007;316(5830):1497–502. motifs. Zip archive of text outputs from CentriMo scans on DHS sequences 2. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. within SEs or TEs using database motifs. File names specify the cell type Mapping and analysis of chromatin state dynamics in nine human cell and enhancer class. Can be downloaded at http://fusion.nhgri.nih.gov/ types. Nature. 2011;473(7345):43–9. files/se-motifs/AdditionalFile3.zip . 3. Parker SCJ, Stitzel ML, Taylor DL, Orozco JM, Erdos MR, Akiyama JA, et al. Additional file 4: De novo motif disocvery results. Zip archive of de novo Chromatin stretch enhancer states drive cell-specific gene regula- motif discovery results. Discovered motifs are condensed in MEME Mini- tion and harbor human disease risk variants. Proc Natl Acad Sci USA. mal Motif Format. TOMTOM output HTML files documenting the compari- 2013;110(44):17921–6. son of the discovered motifs to known motifs and CentriMo output text 4. Loven J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, et al. Selective files detailing the enrichment of the discovered motifs in SE or TE DHS inhibition of tumor oncogenes by disruption of super-enhancers. Cell. sequences are also included. File names indicate the type of sequences 2013;153(2):320–34. used for the motif discovery, including the method of footprinting. Can be 5. Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, et al. downloaded at http://fusion.nhgri.nih.gov/files/se-motifs/AdditionalFile4. Master transcription factors and mediator establish super-enhancers at zip. key cell identity genes. Cell. 2013;153(2):307–19. 6. Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-Andre V, Sigova AA, et al. Super-enhancers in the control of cell identity and disease. Cell. 2013;155(4):934–47. Abbreviations 7. Yanez-Cuna JO, Arnold CD, Stampfel G, Boryn LM, Gerlach D, Rath M, SE: stretch enhancer; TE: typical enhancer; CE: concatenated enhancer; DHS: et al. Dissection of thousands of cell type-specific enhancers identities DNase I hypersensitive site; FP: footprint; TF: transcription factor; TSS: transcrip- dinucleotide repeat motifs as general enhancer features. Genome Res. tion start site; bp: base pair; T2D: type 2 diabetes; H3K27ac: histone 3 lysine 27 2014;24(7):1147–56. acetylation; GWAS: genome-wide association study; SNP: single nucleotide 8. Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. High- polymorphism; LD: linkage disequilibrium; PWM: position weight matrix; resolution mapping and characterization of open chromatin across the HMM: Hidden Markov Model; GERP: Genomic Evolutionary Rate Profiling. genome. Cell. 2008;132(2):311–22. 9. Mortazavi A, Williams BA, McCue K, Schaeer L, Wold B. Mapping and Authors’ contributions quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. DXQ participated in study design, performed computational data analysis, and 2008;5(7):621–8. drafted the manuscript. MRE participated in study design and coordination. 10. Corradin O, Saiakhova A, Akhtar-Zaidi B, Myero L, Willis J, Cowper-Sallari SCJP helped conceive of the study, participated in study design and coor- R, et al. Combinatorial effects of multiple enhancer variants in linkage dination, helped perform computational data analysis, and helped draft the disequilibrium dictate levels of gene expression to confer susceptibility manuscript. FSC helped conceive the study, participated in study design and to common traits. Genome Res. 2014;24(1):1–13. coordination, and helped draft the manuscript. All authors read and approved 11. Ziller MJ, Edri R, Yae Y, Donaghey J, Pop R, Mallard W, et al. Dissecting neu- the final manuscript. ral differentiation regulatory networks through epigenetic footprinting. Nature. 2015;518(7539):355–9. Author details 12. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Pro- Department of Computer Science, University of California, Irvine, Irvine, CA gram, Green ED, Batzoglou S, et al. Distribution and intensity of constraint 92697, USA. Center for Complex Biological Systems, University of California, in mammalian genomic sequence. Genome Res. 2005;15(7):901–13. Irvine, Irvine, CA 92697, USA. National Human Genome Research Institute, 13. Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, et al. Sequence National Institutes of Health, Bethesda, MD 20892, USA. Departments features and chromatin structure around the genomic regions bound by of Computational Medicine and Bioinformatics and Human Genetics, Univer- 119 human transcription factors. Genome Res. 2012;22(9):1798–812. sity of Michigan, Ann Arbor, MI 48109, USA. 14. Lelli KM, Slattery M, Mann RS. Disentangling the many layers of eukaryotic transcriptional regulation. Annu Rev Genet. 2012;46:43–68. Acknowledgements 15. Carey M, Lin YS, Green MR, Ptashne M. A mechanism for synergis- We thank Peter S. Chines, Narisu Narisu, and Brooke N. Wolford for helpful tic activation of a mammalian gene by GAL4 derivatives. Nature. advice. This work was supported by NIH grants 1-ZIA-HG000024 (to FSC) 1990;345(6273):361–4. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 14 of 14 16. Rhodes J, Hagen A, Hsu K, Deng M, Liu TX, Look AT, et al. Interplay of PU.1 30. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, and GATA1 determines myelo-erythroid progenitor cell fate in zebrafish. Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epig- Dev Cell. 2005;8(1):97–108. enomes. Nature. 2015;518(7539):317–30. 17. Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee BK, et al. Open 31. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A, Durbin chromatin defined by DNaseI and FAIRE identifies regulatory elements AD, et al. Oncogene regulation. An oncogenic super-enhancer formed that shape cell-type identity. Genome Res. 2011;21(10):1757–67. through somatic mutation of a noncoding intergenic element. Science. 18. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. 2014;346(6215):1373–7. The accessible chromatin landscape of the human genome. Nature. 32. Vanhille L, Griffon A, Maqbool MA, Zacarias-Cabeza J, Dao LTM, Fernandez 2012;489(7414):75–82. N, et al. High-throughput and quantitative assessment of enhancer activ- 19. Quang D, Xie X. EXTREME: an online EM algorithm for motif discovery. ity in mammals by CapStarr-seq. Nat Commun. 2015;6:6905. Bioinformatics. 2014;30(12):1667–73. 33. Hah N, Benner C, Chong LW, Yu RT, Downes M, Evans RM. Inflammation- 20. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. sensitive super enhancers form domains of coordinately regulated DNA-binding specificities of human transcription factors. Cell. enhancer RNAs. Proc Natl Acad Sci USA. 2015;112(3):E297–302. 2013;152(1–2):327–39. 34. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given 21. Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, et al. Quantitative motif. Bioinformatics. 2011;27(7):1017–8. modeling of transcription factor binding specificities using DNA shape. 35. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L et al. MEME Proc Natl Acad Sci USA. 2015;112(15):4654–9. SUITE: tools for motif discovery and searching. Nucleic Acids Res. 22. Farh KKH, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, et al. 2009;37( Web Server issue):W202-8. Genetic and epigenetic fine mapping of causal autoimmune disease 36. Bailey TL, Machanick P. Inferring direct DNA binding from ChIP-seq. variants. Nature. 2015;518(7539):337–43. Nucleic Acids Res. 2012;40(17):e128. 23. Boyle AP, Song L, Lee BK, London D, Keefe D, Birney E, et al. High-resolu- 37. Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, tion genome-wide in vivo footprinting of diverse transcription factors in et al. JASPAR 2014: an extensively expanded and updated open-access human cells. Genome Res. 2011;21(3):456–64. database of transcription factor binding profiles. Nucleic Acids Res. 24. Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, et al. 2014;42(Database issue):D142–7. An expansive human regulatory lexicon encoded in transcription factor 38. Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: an R package for footprints. Nature. 2012;489(7414):83–90. affinity propagation clustering. Bioinformatics. 2011;27(17):2463–4. 25. Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, et al. 39. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple Global mapping of protein-DNA interactions in vivo by digital genomic combinations of lineage-determining transcription factors prime cis- footprinting. Nat Methods. 2009;6(4):283–9. regulatory elements required for macrophage and B cell identities. Mol 26. Guo Y, Mahony S, Gifford DK. High resolution genome wide binding Cell. 2010;38(4):576–89. event finding and motif discovery reveals transcription factor spatial 40. Pohl A, Beato M. bwtool: a tool for bigWiggles. Bioinformatics. binding constraints. PLoS Comput Biol. 2012;8(8):e1002638. 2014;30(11):1618–9. 27. Siersbaek R, Baek S, Rabiee A, Nielsen R, Traynor S, Clark N, et al. Molecular 41. Piper J, Elze MC, Cauchy P, Cockerill PN, Bonifer C, Ott S. Wellington: a architecture of transcription factor hotspots in early adipogenesis. Cell novel method for the accurate identification of digital genomic foot - Rep. 2014;7(5):1434–42. prints from DNase-seq data. Nucleic Acids Res. 2013;41(21):e201. 28. ENCODE Project Consortium. An integrated encyclopedia of DNA ele- 42. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing ments in the human genome. Nature. 2012;489(7414):57–74. genomic features. Bioinformatics. 2010;26(6):841–2. 29. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. 43. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying Systematic localization of common disease-associated variation in regu- similarity between motifs. Genome Biol. 2007;8(2):R24. latory DNA. Science. 2012;337(6099):1190–5. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Epigenetics & Chromatin Springer Journals

Motif signatures in stretch enhancers are enriched for disease-associated genetic variants

Free
14 pages
Loading next page...
 
/lp/springer_journal/motif-signatures-in-stretch-enhancers-are-enriched-for-disease-M9Y1hw9CjA
Publisher
BioMed Central
Copyright
Copyright © 2015 by Quang et al.
Subject
Life Sciences; Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Cell Biology; Gene Expression; Gene Function
eISSN
1756-8935
D.O.I.
10.1186/s13072-015-0015-7
Publisher site
See Article on Publisher Site

Abstract

Background: Stretch enhancers (SEs) are large chromatin-defined regulatory elements that are at least 3,000 base pairs (bps) long, in contrast to the median enhancer length of 800 bps. SEs tend to be cell-type specific, regulate cell- type specific gene expression, and are enriched in disease-associated genetic variants in disease-relevant cell types. Transcription factors ( TFs) can bind to enhancers to modulate enhancer activity, and their sequence specificity can be represented by motifs. We hypothesize motifs can provide a biological context for how genetic variants contribute to disease. Results: We integrated chromatin state, gene expression, and chromatin accessibility [measured as DNase I Hyper- sensitive Sites (DHSs)] maps across nine different cell types. Motif enrichment analyses of chromatin-defined enhancer sequences identify several known cell-type specific “master” factors. Furthermore, de novo motif discovery not only recovers many of these motifs, but also identifies novel non-canonical motifs, providing additional insight into TF binding preferences. Across the length of SEs, motifs are most enriched in DHSs, though relative enrichment is also observed outside of DHSs. Interestingly, we show that single nucleotide polymorphisms associated with diseases or quantitative traits significantly overlap motif occurrences located in SEs, but outside of DHSs. Conclusions: These results reinforce the role of SEs in influencing risk for diseases and suggest an expanded regu- latory functional role for motifs that occur outside highly accessible chromatin. Furthermore, the motif signatures generated here expand our understanding of the binding preference of well-characterized TFs. Background (referred to as SEs in this paper) [3]. By our definition, Chromatin immunoprecipitation combined with high- SEs have a length of at least 3,000 DNA base pairs (bps) throughput sequencing (ChIP-seq) can identify the and are much larger than typical enhancers (TEs), which genome-wide locations of target proteins, including tran- we defined to be any enhancer less than or equal to the scription factors (TFs), RNA Polymerase II, and cova- median enhancer length of 800 bps. SEs are generally cell lently modified histones [ 1]. ChIP-seq datasets for several type specific, associated with increased cell-specific gene chromatin marks and the sequence-specific factor CTCF expression, and tend to harbor disease-relevant genetic can be computationally integrated to discover combina- variants derived from genome-wide association stud- torial and spatial patterns that produce a consistent anno- ies (GWASs). In our previous paper [3], we showed that tation of promoter, enhancer, insulator, transcribed, and enrichment for GWAS variants increases with the length repressed chromatin states. In a recent study, we profiled of enhancers, but we did not try to define the precise the chromatin states of several cell lines using Chrom- relationship of GWAS variants to motifs located within HMM [2], which revealed the presence of large gene the enhancers—that is the goal of this paper. SEs share control elements that we designated “stretch enhancers” several traits with another recently defined class of large enhancers designated super-enhancers by Young and colleagues [4, 5]. Like SEs, super-enhancers drive cell- *Correspondence: scjp@umich.edu type-specific gene expression; however, super-enhancers Stephen CJ Parker and Francis S Collins contributed equally Departments of Computational Medicine and Bioinformatics have been defined by the disproportionate abundance of and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA Mediator or histone 3 lysine 27 acetylation (H3K27ac) Full list of author information is available at the end of the article © 2015 Quang et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 2 of 14 signal [6], whereas SEs are defined by patterns of histone Of the ten cell types profiled, nine also have DNase-seq modifications, or chromatin states. Enhancers can func - data available. DNase-seq is a method used to identify tion independently of their endogenous spatial contexts, the genome-wide locations of DHSs, which are regions which is a property exploited in luciferase assays to meas- of the genome that are highly sensitive to cleavage by ure enhancer activity by placing enhancers upstream DNase I and mark regulatory elements such as enhancers of reporter genes. This suggests that the information and promoters [8]. DHSs generally mark regions that are required for the enhancer activity is encoded in the more accessible for TF binding and are enriched for TF underlying DNA sequences. We, therefore, hypothesize binding motifs. that the sequence content of the enhancers can provide Figure  1a displays the accessible chromatin, chromatin additional insight into the relationship between enhancer state, and gene expression profiles across the nine cell function and enhancer length, which we address by stud- types around the ABCC8 locus. Gene expression RNA- ying how SEs differ from TEs. seq [9] tracks clearly show that the ABCC8 transcript is Enhancer sequences are known to be enriched for exclusively expressed in the islet sample, and the chroma- transcription factor binding sites, which contribute to tin state tracks show this gene body is covered by several enhancer activity. Upon binding to their recognition islet-specific SEs. This integrative approach can identify motifs, some TF proteins can form complexes with other cell-specific chromatin and expression profiles to provide proteins, which can alter the 3D conformation of chro- a basis for further understanding the functional effects of matin and recruit RNA Polymerase II to promote the SNPs in common, complex diseases. In the ABCC8 exam- transcription of target genes located in cis, sometimes ple, lead type 2 diabetes (T2D) GWAS SNPs (red arrow at considerable distances. The motif, or sequence bind - heads) and several linked SNPs (r   ≥  0.8) (green bars) ing specificity, for a TF can be represented as a posi - overlap islet-specific chromatin states. Enhancers have tion weight matrix (PWM) that specifies the nucleotide been shown to overlap multiple linked SNPs more often frequency at each position along the binding sequence. than expected at random [3, 10], suggesting that multiple Recently, less complex nucleotide patterns—like dinu- enhancer variants work together in concert to alter gene cleotide repeats—were shown to contribute to enhancer expression and contribute to disease susceptibility. activity [7]. From the ABCC8 example, we can infer some properties In this study, we analyze the motif signatures of SEs about the interplay between chromatin states and DNase and how they differ from those of TEs. We scan enhancer I hypersensitivity in SEs. First, we notice that the SEs sequences using known motifs from databases to iden- encompass several DHSs, but the entire SE does not dis- tify TFs that are characteristic of each cell type studied, play DNase hypersensitivity. Unsurprisingly due to their which we are also able to recover using de novo motif length, SEs contain proportionately more DHSs than TEs discovery. We investigate how GWAS single nucleotide do (Additional file  2: Figure S1). Second, strong enhancer polymorphisms (SNPs) can affect TF motif signatures states in islets in ABCC8, for example (Figure  1a), tend in SEs, which can provide important clues about how to overlap spikes in the islet DNase-seq signal. By aggre- genetic variations contribute to disease risk. gating the DNase-seq tag density relative to DNase-seq peak summits in either SEs or TEs, we find there is rela - Results and discussion tively little difference in DNase I hypersensitivity between Systematic chromatin state, DNase hypersensitivity, the classes of enhancers (Figure  1b). However, there is a and gene expression profiling across nine diverse cell large difference in the H3K27ac ChIP-seq signal (Fig - types ure  1c). Notably, there is a dip in the H3K27ac signal in In our previous study, we used the ChromHMM algo- the center of the DNase peak summit, which is a reflec - rithm to systematically integrate ChIP-seq histone tion of bound TFs that can displace histones. This TF- modification and CTCF datasets and uniformly profile induced configuration of chromatin architecture is much chromatin states across ten diverse cell types. These more pronounced in SEs than it is in TEs. Of relevance, ChromHMM segmentations are used to profile SEs, this H3K27ac dip feature recently provided the central which are defined as regions of at least 3,000 bps con - motivation for a novel computational algorithm to detect taining contiguous segments marked as enhancer states. factor binding sites [11]. Given the pronounced dip in SEs TEs are defined similarly, but are less than or equal to versus TEs, this recent computational algorithm likely 800 bps in length. Although SEs only constitute the top picked up mostly on SE-mediated factor binding sites. 10% of enhancers in terms of length, they represent a In generating the signal histogram plots, we accounted disproportionately large percentage of the total number for the minor difference in mappability between SEs of nucleotides among all enhancers (Additional file  1: and TEs. Generally, SEs are slightly more mappable than Table S1). TEs (Additional file  2: Figure S2), which suggests that SE Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 3 of 14 Figure 1 Systematic profiling of DNase I hypersensitivity, chromatin states, and gene expression across nine cell types. a Chromatin states and DNase I hypersensitivity density tracks in and around the ABCC8 locus. Human pancreatic islet chromatin states are similar to some of the other ENCODE cell types at the commonly expressed flanking gene NCR3LG1 and unique at the islet-specific expressed gene ABCC8. (Upper) DNase I Hypersensitivity Density Signal from ENCODE/Duke in dense format for each of nine human cell types (islets; GM12878, lymphoblastoid cells; H1 ES, embryonic stem cells; HepG2, hepatocellular carcinoma; HMEC, mammary epithelial cells; HSMM, smooth muscle myoblasts; HUVEC, umbilical vein endothelial cells, K562, erythroleukemia cells, NHEK, keratinocytes). Density graphs (wiggles) of signal enrichment calculated using F-Seq are displayed in grayscale. Scale is from 0 to 0.1. (Middle) ChromHMM-defined chromatin states. Chromatin state assignments are indicated in the top- leftmost key. (Lower) RNA-seq-based expression for each cell type is measured in reads per million mapped reads (RPM) per base pair. Scale is from 0 to 2 for each cell type. Index and tightly linked (r ≥ 0.8) SNPs associated with T2D are indicated in green in the T2D GWAS SNPs track and primarily reside in islet-specific SEs. Index SNPs rs5215 and rs5219 are marked with red arrows. The black box highlights a portion of the SE previously shown to direct tissue-specific expression patterns in a spatial and temporal manner in vivo [3]. All processed results are browsable and downloadable at http://research.nhgri.nih.gov/manuscripts/Collins/isletchromatin/. b–d Aggregate DNase-seq tag density (b), H3K27ac ChIP-seq tag density, (c), and (d) CG dinucleotide frequency profiles of 3kbp sequences centered on DNase-seq peak summits—the location of the highest DNase-seq signal— −100 located within SEs or TEs. e DHSs within SEs are much closer together than they are within TEs (two-tailed Wilcoxon rank sum test, p < 10 for all cell types). Boxplots show the distance, for each SE or TE DNase-seq summit, to the nearest SE or TE DNase-seq summit, respectively, for all cell types. −5 f DHSs within SEs are moderately longer than DHSs within TEs (two-tailed Wilcoxon rank sum test, p < 10 for all cell types). Boxplots show the length of each DHS whose summit overlaps a SE or TE. Boxplot whiskers extend to 1.5× the interquartile range and outliers are shown as block dots, but the y-axis is truncated so that the boxplots can remain in view. Enhancer classes for b–f are indicated in the top-rightmost key. sequences are less repetitive and more complex than TEs. after accounting for differences in mappability, there is Such sequence composition differences could be charac - still a large statistical difference in H3K27ac signal across teristic of the two enhancer classes. Nevertheless, even the length of the enhancers (Additional file 2 : Figure S3). Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 4 of 14 To examine functional and sequence composition dif- enrichment of its binding motif in SE DHS sequences ferences between the two enhancer classes more compre- across the nine cell types. For instance, HNF1A, an acti- hensively, we plotted the average Genomic Evolutionary vator highly expressed in the liver, shows a very strong Rate Profiling (GERP) score [12] (Additional file  2: Figure positive relationship between its expression across the S4), a metric that estimates position-specific evolution - nine cell types and the enrichment of its binding motif ary constraint across a multi-species sequence alignment, in SE DHS sequences (Figure  2a). In contrast, repres- and dinucleotide frequencies (Figure 1d, Additional file  2: sors such as GFI1 tend to have a negative correlation Figure S5) at each position relative to DNase-seq peak (Figure  2b). These observations reflect the general con - summits in either SEs or TEs. Generally, SEs and TEs cept of TF binding cooperativity on the DNA scaffold show similar patterns of evolutionary constraint around whereby increases in the number of binding sites results DNase-seq peak summits. GERP scores are highest at the in increased enhancer activity [14, 15]. We hypothesize summit and decrease as one moves away from the sum- that enhancer sequences are organized in a way such mit. At the summits, SEs have a slightly higher average that an increase in expression of the relevant activator is GERP score than TEs. SE and TE sequences are par- accommodated by an increase in available binding sites, ticularly different in terms of their CG dinucleotide fre - while reducing potential binding sites for any present quencies: both SEs and TEs display a large spike in CG repressors. To explore this idea more comprehensively, frequency in the DHS summits, but TE sequences are we studied the distribution of motif enrichment versus much more CG-rich and overlap CpG islands more often gene expression correlations across motif-TF pairs for SE (Additional file 2: Figure S6). Notably, TFs tend to bind to DHS sequences, as well as SE non-DHS, TE DHS, and TE CG-rich DNase I accessible regions, but the CG richness non-DHS sequences (Figure  2c). All four observed dis- of some motifs do not account for this spike in CG fre- tributions (red) display a significant positive correlation quency [13]. The variation in dinucleotide frequencies is bias relative to the null expectation (blue), indicating that largely a function of where these enhancers are located in SEs and TEs have well-organized motif architecture both the genome relative to gene models (Additional file  2: Fig- inside and outside of DHSs. To determine whether the ure S7). When restricted to transcription start site (TSS) relative motif enrichments are the same across different distal regions, the dinucleotide differences are mitigated, enhancer regions, we examine the correlation between but the difference in H3K27ac ChIP-seq signal persists a motif ’s enrichment in one genomic region against its (Additional file  2: Figure S8). In some of these histogram enrichment in a different genomic region (e.g. CTCF plots, we note that the differences can persist several motif enrichment in SE DHS sequences across cell types hundred base pairs away from the center. One possible versus its enrichment in TE DHS sequences across cell reason for this phenomenon is the proximity between types). We then measure the distribution of correlations DHSs in SEs. Although SEs are much longer than TEs, between enrichments for all motifs in different pairs of DHSs in SEs are still spaced comparatively close together regions (Figure  2d). Despite the difference in absolute (Figure 1e). DHSs in SEs are also significantly longer than motif enrichments in different enhancer regions, we DHSs in TEs (Figure  1f ), which suggests that individual find that the relative motif enrichments are strongly pre - DHSs in SEs can accommodate more TF binding sites served. Together, these findings indicate that enhancer than DHSs in TEs can. motif architecture is linked to TF gene expression and preserved within and outside DHSs in SEs and TEs. Enhancer sequences are enriched for known motifs Next, we investigate how motif enrichments vary Different cell types are regulated by sets of TFs that are across different motifs, cell types, and enhancer regions. important for establishing and maintaining cell identity. We perform agglomerative clustering on the log fold Enhancers are expected to be enriched for motifs that enrichment in SE DHS sequences of activator motifs that serve as putative TF binding sites. We hypothesize that are significantly and differentially enriched in enhancer DHSs in enhancers should be especially enriched for sequences, in order to group motifs in an unsupervised motif sites, because these regions are more accessible fashion (“Methods”). This analysis results in clusters of for protein–DNA interactions. In particular, enhancers motifs that are grouped corresponding to TF families would be expected to be bound by a specific class of TFs known to play important roles in the cell types consid- called activators, which increase gene transcription. ered, which we visualize with heatmaps of motif enrich- To identify activator motifs, we searched for a rela- ment across four different enhancer categories (Figure  3, tionship between motif enrichment and gene expres- Additional file  2: Figure S9). The heatmaps highlight sion, similar in nature to a previous approach [2], and the relative motif enrichments in enhancers between demonstrated that the gene expression of activators (by and within the cell types, capturing known biologically mRNA quantification) correlates positively with the relevant cell-TF associations. For example, the GATA Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 5 of 14 ab cd Figure 2 Motif enrichments in enhancers are correlated with the expression of the TFs that bind to these motifs. a, b TFs’ expressions are correlated against the enrichment of their respective binding motifs in SE DHS sequences across the nine cell types. The master activator HNF1A (a) has a posi- tive relationship between gene expression and motif enrichment, while the repressor GFI1 (b) has a negative relationship. The name of the motif, Spearman’s correlation (ρ), and sequence logo of the motif are displayed in the corner of the plots. c Boxplots of Spearman’s correlations of enrich- ments in four different regions for all database motifs against gene expression in the nine cell types. d Boxplots of Spearman’s correlations of motif enrichments between two different regions (listed in the strip titles at the top of the facets). For each motif and each pair of regions, we computed the enrichments of the motif in both sets of sequences and then computed the Spearman’s correlation between the two sets of enrichments. Within each facet, the boxplot of Spearman’s correlations (red, left) is also displayed alongside a boxplot of a null distribution (blue, right) generated by recalculating the Spearman’s correlations after shuffling cell assignments for one of the variables. P-values below the boxplots represent the significance of the distribution compared to the respective null distributions and are calculated with the Wilcoxon rank sum test. Gene expression is measured in reads per kilobase per million (RPKM). Motif sites were identified with FIMO, a tool for searching occurrences of known motifs in biological sequences [34]. Motif enrichment is calculated as the ratio of FIMO hits in the positive sequence set to FIMO hits in dinucleotide shuffled negative control (“Methods”). cluster is K562-specific, which is appropriate because enrichment clusterings in the other enhancer regions are K562 is an erythroleukemia cell line, so we expect the preserved (Figure  3, black boxes), further supporting the motif of the erythroid fate determining TF GATA-1 [16] concept of common motif architecture across the differ - to be over-represented in K562 enhancers. As expected, ent enhancer regions. motifs are most enriched in DHS sequences. Non-DHS We next investigate how motifs are spatially distributed regions in enhancers are also enriched for some motifs, in SEs and TEs. For each cell type, we generated a set of but to a much lesser extent than their DHS counterparts. activator motifs that are highly enriched in the respec- Despite how clustering was performed on enrichment tive SE DHS sequences and computed the density histo- values focused on the SE DHS sequences, many of the grams of motif occurrences in SEs or TEs relative to the Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 6 of 14 Figure 3 Heatmaps of enrichment of the binding motifs of activator TFs across nine cell types and four regions. Shading indicates log2 enrich- ment of motifs in sequences of the specified cell and region. (Clustering was done using SE DHS motifs). To improve visualization of the heatmaps, the original set of motifs is pruned through a strategy that includes removing any motifs for which the corresponding TF fails to be significantly expressed in any of the nine cell types. This pruning strategy reduces the motif set to primarily include motifs of activator and master factors. The remaining motifs are clustered and ordered by exemplar-based agglomerative clustering on the log2 enrichment values across the nine cell types (“Methods”). Groups or families of motifs are manually labeled on the left side. Black boxes highlight cell-specific motif enrichment clusters. summits of DNase-seq peaks. Motif density is propor- enhancers by a proportional increase in putative binding tional to DNase I hypersensitivity, being greatest at the sites, but not by differences in motif density or enrich - summits of DNase I hypersensitivity peaks and decreas- ment within or close by to individual DHSs. ing away from the summits (Figure  4a, Additional file  2: To examine how motifs are distributed across indi- Figure S10). Based on these aggregate plots and heat- vidual enhancers instead of aggregating across groups of maps, SEs and TEs display similar motif enrichment and enhancers, we generated tracks that display motif densi- density patterns in their DHSs. However, SEs represent a ties in enhancers (Figure 4b, Additional file  2: Figure S11) disproportionately large fraction of enhancers by nucleo- for motifs that are significantly and differentially enriched tide count (Additional file  1: Table S1). Furthermore, SEs in enhancer sequences of a given cell type, which we refer contain more DHSs per enhancer (Additional file  2: Fig- to as cell identity motifs. Again, we observe that cell iden- ure S1). Therefore, SEs are further set apart from shorter tity motifs cluster in DHSs, but we also find that these Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 7 of 14 Figure 4 Motifs are most enriched around the summits of DNase-seq peaks. a Aggregate motif density plots of 3kbp sequences centered on DNase-seq peak summits within SEs or TEs. Line plots are colored as in the enhancer class key in Figure 1. Cell-specific motif sets are generated by selecting motifs that are significantly enriched in SE DHS sequences and correspond to TFs that are significantly expressed. Individual motif density plots are generated by the Homer annotate Peaks tool [39] with motifs called at a threshold of 40% of the maximum log likelihood score. These individual motif density plots are summed together position-wise to generate the aggregate motif density plots (“Methods”). Motif densities are generally similar between the SE and TE sequences. b UCSC Genome browser view of the ZBED3-AS1 locus. Motif density tracks (Upper) identify portions of the genome assigned an enhancer state that are relatively rich in motifs. These tracks measure the number of non-overlapping motif sites in 150 bp windows at 10 bp steps and are auto-scaled for each cell type. Potentially functional SNPs that overlap SEs may not overlap any DHS, but may overlap portions of the SEs that are dense in motifs, such as the rightmost T2D tightly linked SNP. Chromatin state colors and scales for the other tracks are assigned as in Figure 1a. The black box highlights a strong overlap of cell-specific accessible chromatin and enhancer chromatin states that is also dense in motifs. These tracks are downloadable at http://fusion.nhgri.nih.gov/files/se-motifs/motifsBedgraphs.tar . Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 8 of 14 motif clusters often appear outside of DHSs. In this par- discovery to DNase I “footprints” within all enhancers. ticular example, none of the linked T2D SNPs overlapped Footprints are local dips in the DNase I cleavage signal, any DHSs, but they did overlap portions of the islet-spe- which are predicted to demarcate TF occupancy because cific SE that is relatively dense in islet cell identity acti - TFs protect accessible chromatin from DNase I cleavage vator motifs. Interestingly, the RNA-seq tracks show a [23–25]. Therefore, footprint sequences should be much lack of expression of the surrounding ZBED3-AS1 gene, more enriched for motifs than the larger DHS sequences implying that the SE is likely acting on a distant gene. and can significantly improve the quality of de novo motif The lack of DNase hypersensitivity does not necessarily discovery. Due to the large number of nucleotides in the exclude the possibility that TFs are binding to these parts sequences we scan, we use EXTREME [19], a fast motif of SEs, since the specific chromatin marks extending discovery algorithm designed for large sequence datasets. across an SE suggest that the entire region is consider- As a demonstration of the effectiveness of this approach, ably more open than the average genomic segment. One EXTREME is able to recover motifs that significantly possibility is that DNase-seq may not be able to identify match known signatures (Figure  5, Additional file  2: Fig- chromatin accessibility in these particular locations. For ure S12). These motifs often correspond to TFs known example, in a comparative study with FAIRE-seq, another to be associated with cellular differentiation and repro - assay for mapping open chromatin sites, some open chro- gramming (Additional file 3). matin sites are unique to either DNase-seq or FAIRE-seq. The de novo motif discovery analysis does not yield Multiple lines of evidence suggest that these DNase- any prominent examples of novel motif families, possi- only and FAIRE-only sites correspond to real chromatin bly because the systems we consider have already been features [17]. It is also of note that some TFs are able to studied extensively. However, we do find that de novo localize partially or predominantly within inaccessible motif discovery can provide novel insight into the spa- chromatin [18]. tial arrangement of motif combinations at nucleotide resolution. Some of the motifs discovered in footprint De novo motif discovery in enhancers sequences appear as combinations of two known motifs Computing enrichment of known motifs in enhancer sequences can provide insight into the motif landscape of these sequences, but it does not necessarily identify the actual set of TFs that bind to the enhancers. This is because different TFs can bind to very similar motifs, leading to ambiguity. Moreover, motif scanning analy- sis is limited to motifs that are available in databases, which are incomplete. Even for the TFs that are repre- sented, databases often neglect infrequent non-canonical motifs. For example, one study showed that the protein Neuron Restrictive Silencer Factor can have, in addition to its prominent canonical motif, nine infrequent non- canonical motifs that are not present in any motif data- base [19]. In fact, motifs in databases may not accurately represent the in  vivo binding context of TFs. For exam- ple, many of the database motifs we consider in this study were generated by high-throughput SELEX on single TFs [20], an in  vitro assay that can miss important in  vivo binding contexts. Additionally, DNA shape-based read- out by TFs is not well captured by traditional PWMs [21] and can, therefore, result in missed target sites. Finally, Figure 5 De novo motif discovery accurately recovers known motifs. results from a recent study show that disease-associated Examples of sequence motifs that are enriched in Stretch Enhancer non-coding SNPs are not well captured by PWMs and (SE) DNase Hypersensitive Site (DHS) sequences or within DNase footprints (FP) across all ChromHMM concatenated enhancers (CEs). may instead be better explained by non-conical sequence Sequences of motifs derived from EXTREME (bottom) are aligned to determinants [22]. known motifs from two databases [20, 37] (top). Motif similarity was To further explore the motif landscape of enhancers, measured with TOMTOM, a tool that reports the significance of a we apply de novo motif discovery on DHS sequences in match between a query motif and a database motif [43]. Below each SEs, which we showed are highly enriched for cell spe- sequence logo, the cell type, sequence set, motif name, and signifi- cance of the TOMTOM match (E ) are displayed. cific regulatory motifs. We also apply de novo motif T Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 9 of 14 in close spatial proximity (Figure  6). In HUVEC enhanc- uncover some of the non-canonical sequence determi- ers, for example, we find a significant number of activa - nants that underly disease-associated SNPs. Similarly, tor protein 1 (AP-1) and ERG motif matches. AP-1 is a another previous study showed the sequence-specific heterodimeric protein that recognizes and binds to the constraints of some TFs can decrease as a function of enhancer heptamer motif 5′-TGA[CG]TCA-3′. ERG the number of co-occupying factors [27]. Although these is a subfamily of the ETS family of TFs, which have a motifs contain binding preferences of well-character- strong 5′-GGAA-3′ core binding sequence within their ized TFs, most of them are novel, lacking any database binding motifs. The ERG subfamily includes TFs such matches. As a resource to the community, we provide all as ERG and FLI1, which are known to be functionally de novo discovered motifs in MEME Minimal Motif For- active in HUVEC. Through our de novo motif analysis, mat (Additional file 4). we find these two classes of motifs are significantly co- enriched, but the frequency of the combination depends GWAS SNPs significantly overlap SEs and alter motifs on the relative orientation of these two motifs. Further- outside of DHSs more, sequence-specific constraints for the ERG binding Motivated by the co-occurrence of T2D GWAS SNP loci motif are relaxed when an AP-1 motif is nearby. These and islet SEs, we test whether GWAS SNPs associated results suggest a motif regulatory “grammar” governed with several diseases and traits are generally enriched by physical constraints that dictate the in  vivo spatial in SEs. A SNP locus consists of a lead SNP and all arrangements and frequencies of combinations of motifs, SNPs in strong LD with that lead SNP (r   ≥  0.8), care- which is consistent with a previous report [26], and may fully accounting for the possibility that the lead SNP is not causative but is instead in LD with the true, causa- tive SNP. Indeed, SNP loci are enriched both inside and outside of DHSs in SEs, replicating previous disease- and trait-associated SNP enrichment in cell-specific enhancer states [2, 28], including rheumatoid arthritis in GM12878 (Figure 7, Additional file 2: Figure S13). While we know that SEs are enriched for SNPs associated with complex diseases, the mechanism by which these SNPs contribute to disease risk is not clear. One reasonable mechanism is through altering TF binding sites in SEs. To assess this idea, we also test for GWAS SNP enrich- ment within cell identity motif sites in SEs. Interestingly, GWAS SNP loci are much more enriched and abundant in SE non-DHS cell identity motif sites than they are in SE DHS cell identity motif sites. We posit that our earlier proposition that SNPs are disrupting putative TF binding sites in less accessible chromatin portions of SEs may in fact be a prevalent mechanism for driving common dis- eases. A previous study suggested that most non-coding GWAS SNP loci intersect DHS regions [29], but did not consider motifs in SEs that reside outside DHSs, as we Figure 6 De novo motif discovery in enhancer footprint sequences do here. Our findings are consistent with another study reveals novel binding patterns of well-characterized TFs. Motifs of known activators in the HUVEC, K562, HSMM, and HepG2 cell lines that found enhancer-associated chromatin marks can be can co-occur together. For example, in the HUVEC enhancer footprint more informative for tissue-specific disease SNP enrich - sequences, the ERG motif, a member of the ETS family that is char- ments than DHSs can [30]. acterized by a “GGAA” binding paper, often co-occurs with the AP-1 Although our analysis provides a possible mecha- motif. In the presence of the AP-1 motif, the degree of resemblance nism for many disease SNPs, several SNPs are left unac- of a predicted site to the ERG motif is weaker. SPI1, another member of the ETS family, shares a similar relationship with the GATA1 motif in counted for. Notably, our motif sets are limited, including K562. In other examples, activator TFs appear to often homodimerize only motifs from two databases, which are far from and form palindromic motifs. Sequence logos of examples of de novo complete and are missing motifs that were discovered motifs in the cell types are displayed alongside, if available, CentriMo through our de novo motif discovery analysis, for exam- E-values and number of matches in SE DHS sequences (“Meth- ple. Our GWAS analysis also focuses on motifs of activa- ods”). For two of the HepG2 examples, the motifs are so infrequent that CentriMo failed to find a significant number of matches in SE tors, which excludes TFs like CTCF whose binding motif sequences. is slightly enriched in enhancers. We also do not consider Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 10 of 14 Figure 7 SEs show significant enrichment of GWAS SNPs associated with diseases or quantitative in a cell-specific manner. Positions of index and tightly linked (r ≥ 0.8) SNPs for different diseases or traits (y-axis) are overlapped with those of DHSs within SEs, non-DHSs with SEs, and motif sites in either DHSs or non-DHSs within SEs for each cell type (x-axis). Only motifs from the cell-specific motif sets used to generate the motif density −4 tracks in Figure 4 are considered for each cell type. Motif sites were identified with FIMO at the default threshold of p < 10 . Text in the boxes indi- cates the number of overlapping SNP loci in each cell type. Shading indicates the significance of SNP locus enrichment relative to a null distribution −8 (“Methods”). Only SNPs that meet the genome-wide significance threshold (p < 5 × 10 ) are considered. variants that generate new motif sites. A recent study, for the punctate DHSs, SEs contain dense clusters of motifs example, showed that a small insertion introducing bind- outside of their DHSs. Notably, the SE motif architecture ing motifs can lead to the formation of an aberrant onco- within and outside DHSs is significantly correlated, sug - genic super-enhancer [31]. Including these other pieces gesting an orchestrated mechanism of regulation across of information may account for the remaining SNP loci. the entire length of the element. Through de novo motif discovery analysis in the motif- Conclusions rich DNase I footprint sequences, we identified non- In this study, we analyzed the motif architecture of SEs, canonical binding sites and predicted which pairs of TFs which we hypothesized can distinguish SEs from TEs. bind adjacently more than expected by chance. Many of In general, enhancers are highly enriched for TF binding these composite motifs are not present in any motif data- sites, especially those corresponding to activators. SEs are base, having only been identified in this study, revealing characterized by multiple motif-rich DHSs in close spa- large gaps in current databases. tial proximity, unlike shorter TEs which typically have at Disease-associated SNPs identified through GWAS are most one DHS. SEs also display much higher H3K27ac known to co-localize with SEs in a cell-specific manner, ChIP-seq signal, a marker for active enhancers. This result but the exact mechanism by which these SNPs perturb complements a recent study which found that enhancer genome function is not well characterized. One pos- reporter activity from sequences in H3K27ac peaks sibility is that these SNPs are disrupting TF binding in within super enhancers is considerably stronger than regions of open chromatin. Indeed, we find that DHSs enhancer reporter activity from sequences in H3K27ac within SEs are modestly enriched for GWAS SNPs. Inter- peaks outside of super enhancers [32]. We conjecture that estingly, however, our most notable finding is that GWAS the exceptional length of SEs is in part due to the spatial SNPs more often co-localize outside of the SE DHSs, and coordination of accessible chromatin where clusters of motif sites that are within SEs but outside of DHSs cap- activators can bind. Expression of tissue-specific TFs, par - ture many of these SNPs. This finding may explain why ticularly activators, also correlates with presence of their a recent large study reported better GWAS SNP asso- binding sites in enhancers that are active in that same tis- ciations with chromatin-marked enhancers versus DHSs sue. Although these TF binding sites are most enriched in [30]. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 11 of 14 The functional role of these motif sites in less accessible sequences were generated using the dinucleotide shuf- chromatin is not well-understood. They could represent fling script in the MEME Suite [35]. This measure of actual binding sites, despite lacking the extreme chro- enrichment allows direct comparison between sequence matin openness marked by DHSs. From an evolutionary sets of varying number of nucleotides. perspective, it is possible that common disease SNPs may We also performed PWM scanning using CentriMo be preferentially found in such sites because disruption [36], which computes the central enrichment of motifs. of TF binding in the most highly accessible chromatin For each motif, CentriMo finds an optimal score above would not be normally tolerated in the population due the minimum threshold 5 bits at which to call motifs. to selective pressure. Another possibility is based on the 1  kbp sequences centered on DNase-seq peak summits idea that SEs are lineage-specifying regulatory hubs that in SEs or TEs were extracted from the hg19 reference orchestrate a chromatin environment that is permissive genome as input sequences to CentriMo. CentriMo out- to rapid cellular response. The GWAS SNPs enriched in puts log adjusted p-values measuring the significance of motifs in SEs but outside DHSs could represent response motif enrichment in the center of sequences. elements that become active (and potentially hypersen- When scanning sequences with known motifs, we take sitive to DNase I) in different developmental stages or human or mouse motifs from the JASPAR 2014 [37] or under different physiologic conditions. Such a scenario the high-throughput SELEX [20] databases. We selected would be an efficient solution to tune dynamic changes these two databases because of the quality of their bind- rapidly and precisely in lineage-specific gene expres - ing models and their coverage of TFs. These two data - sion, consistent with the rheostat model of SE function bases contain 943 motifs altogether. that was recently proposed [33]. Testing these hypotheses will require additional TF profiling experiments, such as Motif enrichment heatmaps ChIP-seq and DNase-seq, across diverse environmental, Heatmaps plotting the enrichment of motifs in enhancer developmental, and genetic backgrounds. sequences were generated on a subset of the 943 motifs to focus on motifs that likely play an important role in Methods enhancers and improve the visualization of the heatmaps. DNase I hypersensitivity, chromatin state and gene From the original 943 database motifs, we selected motifs expression profiling that are highly (log fold enrichment >1.5 in at least one Chromatin states and gene expression were integrated cell type) and differentially (log fold enrichment range as previously described [3]. Sources of sequencing reads >0.75) enriched in SE DHS sequences. We also select from ChIP-seq and RNA-seq experiments used in the motifs that correspond to TFs that are expressed in any integrative analysis are found in the supplementary mate- one of the nine cell types (>2 RPKM). Finally, we further rials of [3]. Single-hit DNase-seq data from the ENCODE condense the motifs by selecting motifs whose enrich- Duke University group were used for the genome ment in SE DHS sequences correlates positively (ρ  >  0) browser shots and calling DHSs. DHSs were based on with the expression of the TF it corresponds to (Fig- narrow peak calls of single-hit DNase-seq data from the ure 2a), which we found to be an indication of activators. ENCODE Duke University group. Narrow peak files also The remaining motifs are ordered using agglomerative contain the genome coordinates of the summits. Wiggle clustering on the log fold enrichment values in SE DHSs. signal tracks for these single-hit data were also used for Agglomerative clustering was implemented using the the UCSC genome browser shots. aggExCluster method in the apcluster R package with the mutual pairwise similarities of data vectors as negative Motif scanning and calculating motif enrichment distances [38]. Visualization and clustering were based We performed position weight matrix (PWM) motif on log fold enrichment values instead of directly on fold scanning of FASTA sequence sets using FIMO [34]. Motif enrichment values due to the relatively large range of fold occurrences were called at the default p value thresh- enrichment values across all motifs (typically between −4 old of 10 . Based on empirical results, we found the 1 and 8). Using log values makes direct comparisons default threshold to be a good compromise between a between motifs more manageable across this range. stringent threshold that called very few motif sites and a relaxed threshold that called too many motif sites, as Assigning motifs to cells well as computationally efficient. Motif enrichment in a For each cell type, we generated a list of activator motifs set of sequences is calculated as the ratio of the number that are important in enhancers. These are the same of occurrences of the motif in the set of sequences rela- motifs that we used for our motif density tracks (Fig- tive to the number of motif occurrences in dinucleotide ure  4b, Additional file 2: Figure S11) and GWAS enrich - shuffled versions of the sequences. Dinucleotide shuffled ment analysis (Figure 7, Additional file 2: Figure S13). The Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 12 of 14 −5 selection procedure is similar to the filtering steps for at a threshold of p  <  10 , except for the GM12878 cell −10 the motif enrichment heatmaps. A motif is assigned to a line for which we used a threshold of p < 10 due to the cell type if its respective TF is expressed in that cell type deeper sequencing depth available. For the remaining cell (>2 RPKM), it is highly enriched in the central region line, NHEK, the sequencing depth of the available data- of the cell’s SE DHS sequences (CentriMo log adjusted set was too shallow to reliably call footprints with Wel- p-value < −50), and its enrichments in SE DHS sequences lington, so we extracted only high confidence footprints correlate positively with gene expression across the nine previously called by a Hidden Markov Model (HMM) cell types (ρ  >  0). If a motif has multiple versions in the algorithm [23], which were originally based in the hg18 databases, such as the DNA binding domain and full genome but later lifted over to the hg19 genome. transcript versions in the high-throughput SELEX data- Footprints are further processed prior to downstream base [20] we only select the motif with the lowest Cent- analysis by extending footprint genome coordinates riMo log adjusted p-value. The number of motifs in each equally on both ends by 5  bps (20  bps for HMM-based set ranges from as few as 25 for H1 ES to as many as 56 footprints) and then merging overlapping coordinates for HMEC and NHEK. using the BEDTools [42] mergeBed tool with parameter d = 10 (d = 40 for HMM-based footprints). Aggregate histogram plots Of the nine cell types considered in this study, only We used the Homer annotatePeaks tool [39] to generate the K562 cell line had enough data to apply all four foot- aggregate histogram plots documenting the dinucleotide printing methods on. Hence, we applied all four foot- frequency, motif density, and DNase-seq and H3K27ac printing methods on the K562 for comparison in the ChIP-seq tag density ±1,500  bps with 10  bp bins rela- downstream de novo motif discovery (see next subsec- tive to DNase-seq narrow peak summits overlapping tion). The thresholds and parameters were selected to SEs or TEs. ChIP tags were centered based on their esti- yield sequence datasets of 5–10 million bps, which we mated ChIP-fragment length. DNase tags were kept in found to be optimal for motif discovery. their original positions by fixing the estimated DNase- fragment length to 1. Mappability (36 bp) and GERP [12] De novo motif discovery aggregate histogram plots were similarly generated with De novo motif discovery was applied to SE DHS the bwtool aggregate command [40] instead. sequences and enhancer footprint sequences in all enhancers using the EXTREME algorithm [19]. FASTA De novo footprinting sequence sets were generated by extracting hg19 masked De novo footprinting requires deep sequenced DNase- genome sequences from BED file coordinates using seq data. Unfortunately, the data available are highly het- the BEDTools fastafrombed command [42]. SE DHS erogeneous. Not only are the deep sequence DNase-seq FASTA sequences were further preprocessed by replac- datasets available at varying sequence depth, they are ing instances of AAAAAAAA, ACACACAC, and their also generated by two different experimental protocols: reverse complements with capital N’s. Such repetitive a single-hit version [8] and a double-hit version [25] of subsequences are ubiquitous throughout the genome DNase-seq. Therefore, we adopted four different meth - and are often missed in the genome masking process. ods to call footprints de novo. Masked sequences were inputted to the EXTREME pipe- Four of the nine cell types considered in this study line with the parameters l  =  5 and q  =  0.02 (all other have deep sequenced double-hit DNase-seq data (K562, parameters were set to the default or recommended val- HSMM, HepG2, HUVEC). Of these four, three were pre- ues). We selected these parameters empirically based on viously processed by a de novo footprinting algorithm the quality of motifs generated. Although footprints were that searches DHSs for locations with high footprint generated from a variety of algorithms and experimen- occupancy scores, and genome coordinates of the foot- tal protocols, EXTREME can still find a consistent set prints were made openly available [24]. For the remaining of high-quality enhancer-associated motifs (Additional cell type, HUVEC, footprints were called on the double- file 2: Figure S12). Discovered motifs were compared to hit data with Wellington [41] at a stringent threshold of known motifs from the JASPAR 2014 [37] or the high- −20 p < 10 . throughput SELEX [20] databases using TOMTOM [43] For the remaining five cell types, footprints were called keeping only significant matches (E < 0.1). on deep sequenced single-hit DNase-seq data, but these datasets were not sequenced as deeply as the double-hit GWAS variant enrichment for the other four cell types. We called footprints in four We calculated GWAS variant enrichment exactly as we of the single-hit DNase-seq data with the “1D” version of did in our previous study of SEs [3] on an updated set Wellington on pooled replicates. Footprints were called of SNPs from the NHGRI GWAS catalog (http://www. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 13 of 14 and R00DK099240-02 (to SCJP), the American Diabetes Association Pathway genome.gov/gwastudies/; downloaded on October 9, to Stop Diabetes Grant 1-14-INI-07 (to SCJP), and the National Institute of 2014). Briefly, we calculated enrichment by performing Biomedical Imaging and Bioengineering, National Research Service Award a permutation test that measures SNP loci and enhancer (EB009418) from the University of California, Irvine, Center for Complex Biological Systems (NIH R01HG006870, to DXQ). This material is based upon overlaps as previously described [2]. A SNP locus con- work supported by the National Science Foundation Graduate Research sists of a lead SNP and all SNPs in strong LD with that Fellowship under Grant No. (DGE-1321846, to DXQ). Any opinion, findings, lead SNP. SNPs in LD with the lead SNP were defined as and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science those with r  ≥ 0.8. We ran 10,000 iterations of the per- Foundation. This research was supported (in part) by the Intramural Research mutation test and estimated the maximal P-value as the Program of the National Human Genome Research Institute, National Insti- number of permutations equal to or greater than the tutes of Health. observed overlap value plus one divided by the number Compliance with ethical guidelines of iterations plus one (10,001). Our enrichment analysis was performed on both the entire set of NHGRI GWAS Competing interests The authors declare that they have no competing interests. catalog SNPs per trait and a filtered subset of genome- −8 wide significant SNPs (p < 5 × 10 ) per trait. Received: 25 June 2015 Accepted: 1 July 2015 Additional files Additional file 1: Table S1. Staitistics of enhancers. Additional file 2: Supplementary Figures S1 to S13. References 1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of Additional file 3: CentriMo scans of DHS sequences with database in vivo protein-DNA interactions. Science. 2007;316(5830):1497–502. motifs. Zip archive of text outputs from CentriMo scans on DHS sequences 2. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. within SEs or TEs using database motifs. File names specify the cell type Mapping and analysis of chromatin state dynamics in nine human cell and enhancer class. Can be downloaded at http://fusion.nhgri.nih.gov/ types. Nature. 2011;473(7345):43–9. files/se-motifs/AdditionalFile3.zip . 3. Parker SCJ, Stitzel ML, Taylor DL, Orozco JM, Erdos MR, Akiyama JA, et al. Additional file 4: De novo motif disocvery results. Zip archive of de novo Chromatin stretch enhancer states drive cell-specific gene regula- motif discovery results. Discovered motifs are condensed in MEME Mini- tion and harbor human disease risk variants. Proc Natl Acad Sci USA. mal Motif Format. TOMTOM output HTML files documenting the compari- 2013;110(44):17921–6. son of the discovered motifs to known motifs and CentriMo output text 4. Loven J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, et al. Selective files detailing the enrichment of the discovered motifs in SE or TE DHS inhibition of tumor oncogenes by disruption of super-enhancers. Cell. sequences are also included. File names indicate the type of sequences 2013;153(2):320–34. used for the motif discovery, including the method of footprinting. Can be 5. Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, et al. downloaded at http://fusion.nhgri.nih.gov/files/se-motifs/AdditionalFile4. Master transcription factors and mediator establish super-enhancers at zip. key cell identity genes. Cell. 2013;153(2):307–19. 6. Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-Andre V, Sigova AA, et al. Super-enhancers in the control of cell identity and disease. Cell. 2013;155(4):934–47. Abbreviations 7. Yanez-Cuna JO, Arnold CD, Stampfel G, Boryn LM, Gerlach D, Rath M, SE: stretch enhancer; TE: typical enhancer; CE: concatenated enhancer; DHS: et al. Dissection of thousands of cell type-specific enhancers identities DNase I hypersensitive site; FP: footprint; TF: transcription factor; TSS: transcrip- dinucleotide repeat motifs as general enhancer features. Genome Res. tion start site; bp: base pair; T2D: type 2 diabetes; H3K27ac: histone 3 lysine 27 2014;24(7):1147–56. acetylation; GWAS: genome-wide association study; SNP: single nucleotide 8. Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al. High- polymorphism; LD: linkage disequilibrium; PWM: position weight matrix; resolution mapping and characterization of open chromatin across the HMM: Hidden Markov Model; GERP: Genomic Evolutionary Rate Profiling. genome. Cell. 2008;132(2):311–22. 9. Mortazavi A, Williams BA, McCue K, Schaeer L, Wold B. Mapping and Authors’ contributions quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. DXQ participated in study design, performed computational data analysis, and 2008;5(7):621–8. drafted the manuscript. MRE participated in study design and coordination. 10. Corradin O, Saiakhova A, Akhtar-Zaidi B, Myero L, Willis J, Cowper-Sallari SCJP helped conceive of the study, participated in study design and coor- R, et al. Combinatorial effects of multiple enhancer variants in linkage dination, helped perform computational data analysis, and helped draft the disequilibrium dictate levels of gene expression to confer susceptibility manuscript. FSC helped conceive the study, participated in study design and to common traits. Genome Res. 2014;24(1):1–13. coordination, and helped draft the manuscript. All authors read and approved 11. Ziller MJ, Edri R, Yae Y, Donaghey J, Pop R, Mallard W, et al. Dissecting neu- the final manuscript. ral differentiation regulatory networks through epigenetic footprinting. Nature. 2015;518(7539):355–9. Author details 12. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Pro- Department of Computer Science, University of California, Irvine, Irvine, CA gram, Green ED, Batzoglou S, et al. Distribution and intensity of constraint 92697, USA. Center for Complex Biological Systems, University of California, in mammalian genomic sequence. Genome Res. 2005;15(7):901–13. Irvine, Irvine, CA 92697, USA. National Human Genome Research Institute, 13. Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, et al. Sequence National Institutes of Health, Bethesda, MD 20892, USA. Departments features and chromatin structure around the genomic regions bound by of Computational Medicine and Bioinformatics and Human Genetics, Univer- 119 human transcription factors. Genome Res. 2012;22(9):1798–812. sity of Michigan, Ann Arbor, MI 48109, USA. 14. Lelli KM, Slattery M, Mann RS. Disentangling the many layers of eukaryotic transcriptional regulation. Annu Rev Genet. 2012;46:43–68. Acknowledgements 15. Carey M, Lin YS, Green MR, Ptashne M. A mechanism for synergis- We thank Peter S. Chines, Narisu Narisu, and Brooke N. Wolford for helpful tic activation of a mammalian gene by GAL4 derivatives. Nature. advice. This work was supported by NIH grants 1-ZIA-HG000024 (to FSC) 1990;345(6273):361–4. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 14 of 14 16. Rhodes J, Hagen A, Hsu K, Deng M, Liu TX, Look AT, et al. Interplay of PU.1 30. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, and GATA1 determines myelo-erythroid progenitor cell fate in zebrafish. Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epig- Dev Cell. 2005;8(1):97–108. enomes. Nature. 2015;518(7539):317–30. 17. Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee BK, et al. Open 31. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A, Durbin chromatin defined by DNaseI and FAIRE identifies regulatory elements AD, et al. Oncogene regulation. An oncogenic super-enhancer formed that shape cell-type identity. Genome Res. 2011;21(10):1757–67. through somatic mutation of a noncoding intergenic element. Science. 18. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. 2014;346(6215):1373–7. The accessible chromatin landscape of the human genome. Nature. 32. Vanhille L, Griffon A, Maqbool MA, Zacarias-Cabeza J, Dao LTM, Fernandez 2012;489(7414):75–82. N, et al. High-throughput and quantitative assessment of enhancer activ- 19. Quang D, Xie X. EXTREME: an online EM algorithm for motif discovery. ity in mammals by CapStarr-seq. Nat Commun. 2015;6:6905. Bioinformatics. 2014;30(12):1667–73. 33. Hah N, Benner C, Chong LW, Yu RT, Downes M, Evans RM. Inflammation- 20. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. sensitive super enhancers form domains of coordinately regulated DNA-binding specificities of human transcription factors. Cell. enhancer RNAs. Proc Natl Acad Sci USA. 2015;112(3):E297–302. 2013;152(1–2):327–39. 34. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given 21. Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, et al. Quantitative motif. Bioinformatics. 2011;27(7):1017–8. modeling of transcription factor binding specificities using DNA shape. 35. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L et al. MEME Proc Natl Acad Sci USA. 2015;112(15):4654–9. SUITE: tools for motif discovery and searching. Nucleic Acids Res. 22. Farh KKH, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, et al. 2009;37( Web Server issue):W202-8. Genetic and epigenetic fine mapping of causal autoimmune disease 36. Bailey TL, Machanick P. Inferring direct DNA binding from ChIP-seq. variants. Nature. 2015;518(7539):337–43. Nucleic Acids Res. 2012;40(17):e128. 23. Boyle AP, Song L, Lee BK, London D, Keefe D, Birney E, et al. High-resolu- 37. Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, tion genome-wide in vivo footprinting of diverse transcription factors in et al. JASPAR 2014: an extensively expanded and updated open-access human cells. Genome Res. 2011;21(3):456–64. database of transcription factor binding profiles. Nucleic Acids Res. 24. Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, et al. 2014;42(Database issue):D142–7. An expansive human regulatory lexicon encoded in transcription factor 38. Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: an R package for footprints. Nature. 2012;489(7414):83–90. affinity propagation clustering. Bioinformatics. 2011;27(17):2463–4. 25. Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, et al. 39. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple Global mapping of protein-DNA interactions in vivo by digital genomic combinations of lineage-determining transcription factors prime cis- footprinting. Nat Methods. 2009;6(4):283–9. regulatory elements required for macrophage and B cell identities. Mol 26. Guo Y, Mahony S, Gifford DK. High resolution genome wide binding Cell. 2010;38(4):576–89. event finding and motif discovery reveals transcription factor spatial 40. Pohl A, Beato M. bwtool: a tool for bigWiggles. Bioinformatics. binding constraints. PLoS Comput Biol. 2012;8(8):e1002638. 2014;30(11):1618–9. 27. Siersbaek R, Baek S, Rabiee A, Nielsen R, Traynor S, Clark N, et al. Molecular 41. Piper J, Elze MC, Cauchy P, Cockerill PN, Bonifer C, Ott S. Wellington: a architecture of transcription factor hotspots in early adipogenesis. Cell novel method for the accurate identification of digital genomic foot - Rep. 2014;7(5):1434–42. prints from DNase-seq data. Nucleic Acids Res. 2013;41(21):e201. 28. ENCODE Project Consortium. An integrated encyclopedia of DNA ele- 42. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing ments in the human genome. Nature. 2012;489(7414):57–74. genomic features. Bioinformatics. 2010;26(6):841–2. 29. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. 43. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying Systematic localization of common disease-associated variation in regu- similarity between motifs. Genome Biol. 2007;8(2):R24. latory DNA. Science. 2012;337(6099):1190–5. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Journal

Epigenetics & ChromatinSpringer Journals

Published: Jul 16, 2015

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off