Haystack: systematic analysis of the variation of epigenetic states and cell-type specific regulatory elements

Luca Pinello; Rick Farouni; Guo-Cheng Yuan

doi:10.1093/bioinformatics/bty031

Haystack: systematic analysis of the variation of epigenetic states and cell-type specific regulatory elements

Pinello, Luca; Farouni, Rick; Yuan, Guo-Cheng 2018-01-17 00:00:00 Motivation: With the increasing amount of genomic and epigenomic data in the public domain, a pressing challenge is to integrate these data to investigate the role of epigenetic mechanisms in regu- lating gene expression and maintenance of cell-identity. To this end, we have implemented a computa- tional pipeline to systematically study epigenetic variability and uncover regulatory DNA sequences. Results: Haystack is a bioinformatics pipeline to identify hotspots of epigenetic variability across dif- ferent cell-types, cell-type speciﬁc cis-regulatory elements, and associated transcription factors. Haystack is generally applicable to any epigenetic mark and provides an important tool to investigate the mechanisms underlying epigenetic switches during development. This software is accompanied by a set of precomputed tracks, which may be used as a valuable resource for functional annotation of the human genome. Availability and implementation: The Haystack pipeline is implemented as an open-source, multiplat- form, Python package called haystack_bio freely available at https://github.com/pinellolab/haystack_bio. Contact: lpinello@mgh.harvard.edu or gcyuan@jimmy.harvard.edu Supplementary information: Supplementary data are available at Bioinformatics online. types in an easy-to-use command line software. Our goal is to facili- 1 Introduction tate biologists’ efforts at analyzing epigenetic data without the bur- Epigenetic patterns are highly cell-type specific, and influence gene den of coding, and to enable researchers to integrate their own expression programs (Jenuwein and Allis, 2001). Recently, a large sequencing data with information from the public domain. amount of epigenomic data across many cell types has been gener- ated and deposited in the public domain, in part thanks to large con- sortia such as Roadmap Epigenomics Project (Bernstein et al., 2 Description 2010), and ENCODE (Dunham et al., 2012). These data sources offer unprecedented opportunities for systematic integration and Haystack takes as input the genome-wide distributions of an epigen- comparison. In an earlier work (Pinello et al., 2014), we developed etic mark across multiple cell types or subjects—measured by and validated a computational strategy to systematically evaluate ChIP-seq, DNase-seq, ATAC-seq or similar assays—as well as gene cross-cell-type epigenetic variability and to identify the underlying expression profiles quantified by microarray or RNA-seq. Users can regulatory factors of such variability. Here we provide an implemen- start with publicly available preprocessed data or integrate their tation of this strategy that automatically integrates multiple data own data in the pipeline by providing BAM or bigWig files that can V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1930 Haystack 1931 be generated by existing tools such as the ENCODE Uniform input. Alternatively, the input may be a generic set of genomics re- Processing Pipelines. Haystack’s entire computational pipeline gions; e.g. promoters for a set of genes of interest or cell-type specific can be executed with a single command (i.e. haystack_pipeline). enhancers. A motif database can also be specified (JASPAR The pipeline is composed of three modules: haystack_hotspots, [Mathelier et al., 2016] by default) to look for motif enrichment (the haystack_motifs, and haystack_tf_activity_plane. Each module is basic counting of each motif is based on the FIMO software; Grant designed to carry out a distinct but related task (Fig. 1A), as et al., 2011), with use of random or C þ G content matched gen- described below. omic sequences as background. We find that the latter option is more appropriate for histone modifications. The output of this mod- ule consists of an HTML page (Supplementary Fig. S2) that reports 2.1 Module 1. Discovery of hotspots and cell-type each enriched motif, a series of informative parameters including the specific regions target/background ratio, the P-value (calculated with the Fisher’s haystack_hotspots identifies the hotspots of epigenetic variability, exact test) and q-value, the motif logo, the central enrichment score, i.e. those regions that are highly variable for a given epigenetic mark the average profile in the target regions containing the motif, and among different cell types. The algorithm for identifying the hot- the closest genes for each region (Fig. 1C, Supplementary Fig. S2). spots was described previously in Pinello et al. (2014). Briefly, the input for the pipeline is a set of genome-aligned sequencing tracks for a given epigenetic mark in different cell types, in BAM or bigWig 2.3 Module 3. Integration of gene expression data format. The haystack_hotspots module first quantifies the sequence Because different TFs may share similar sequence binding patterns, reads to non-overlapping bins of predetermined size (500 bp by de- the exact regulator cannot be determined by motif enrichment ana- fault), and normalizes data using a variance stabilization method lysis alone. Spurious association may also occur due to, e.g. over- followed by quantile normalization. It then quantifies the variability abundance of motif sequences. haystack_tf_activity_plane provides of the processed data signal in each bin using the variance-to-mean an additional filter to select for the most relevant TFs by further ratio. The most variable regions, accordingly to this measure, are se- integrating gene expression data; it is based on the assumption lected as hotspots (originally termed as Highly Plastic Regions in that the expression level of a functional TF is correlated with the ex- Pinello et al. [2014]). The subsets of hotspot regions that have spe- pression level of the target genes of hotspot regions. Such a relation- cific activity in a particular cell type are next identified, based on a ship is visualized with the use of an activity plane representation z-score metric. Finally, an IGV (http://www.broadinstitute.org/igv/) (Fig. 1D). A detailed description of the tf activity plane plot and XML session file is created to enable easy visualization of the results how it is generated is provided in Supplementary Material Section 3. (Fig. 1B, Supplementary Fig. S1). Briefly, for each cell type, an activity plane plot (Supplementary Fig. S3) is generated for each enriched motif identified in that cell type 2.2 Module 2. Analysis of transcription factor motif by the haystack_motifs. In this representation, the cell-type of inter- haystack_motifs identifies transcription factors (TFs) whose binding est is marked with a red star. The furthest is the star from the origin sequence motifs are enriched in a cell-type specific subset of hot- the more cell-type specific is either the expression of the TF (x-axis) spots. This module takes the output of haystack_hotspots as its or its effect on nearby genes (y-axis). This allows us to capture how AC BD Fig. 1. (A) Haystack overview: modules and corresponding functions. (B) Hotspot analysis on H3k27ac: signal tracks, variability track and hotspots of variability are computed from the ChIP-seq aligned data; the regions speciﬁc for a given cell type are also extracted. (C) Motif analysis on the regions speciﬁc for the H1hesc cell line: Pou5f1:: Sox2 is signiﬁcant; P- and q-value, motif logo and average proﬁle are calculated. (D) TF activity for Sox2 in H1esc (star) compared to the other cell types (circles), x-axis speciﬁcity of Sox2 expression (z-score), y-axis effect (z-score) on the gene nearby the regions containing the Sox2 motif 1932 L.Pinello et al. informative (as measured by the gene expression level of the TF) a stem cell line (H1hesc), we found that the Pou5f1:: Sox2 composed particular TF is for a given cell type compared with other cell types. motif was highly enriched, and the expression of Sox2—a fundamen- However, not all possible plots are generated by default; Only those tal TF for embryonic stem cell identity—was highly specific and posi- passing the following filters are reported: (i) the activity of the TF re- tively correlated with activity of the target genes. capitulates changes in gene expression such that the value of the cor- relation of the TF with nearby genes exceeds a given threshold 4.2 Analysis of roadmap epigenomics project (default rho ¼ 0.3) and (ii) the average gene expression is greater in We applied the Haystack pipeline to data from the Roadmap the considered cell type such that the standardized gene expression Epigenomics Project using the maximal number of non-redundant values are positive (i.e. default z-score > 0). Earlier we showed that cell-types for which gene expression and epigenetic data was avail- these filters are important for identifying factors that truly play a able (Supplementary Material Section 4). We provide precomputed key role in mediating poised enhancer activities (Pinello et al., analysis for H3k27ac (41 cell types), H3K27me3 (41 cell types), 2014). H3K4me3 (41 cell types) and DNase I hypersensitivity (25 cell types). These precomputed tracks provide a valuable resource for re- searchers interested in identifying functional elements in the human 3 Related methods genome, exploring how epigenetic variability is controlled in differ- ent cell types, and uncovering regulatory sequences. Several epigenomics software packages already exist that share Haystack’s goals of identifying functional regulatory sequences or regulators involved in gene regulation. The main contribution of 4.3 Reproducible results through Cloud and Haystack can be summarized by the following three general aspects Docker support of the pipeline: (i) Haystack takes as input not just epigenomic data, To facilitate the use of Haystack without the need to access an inten- but also genomic and transcriptomics data. The majority of avail- sive computational facility, we provide detailed instructions in the able epigenomic tools are designed to work with one or two types of Supplementary Material on how to deploy and test Haystack on the data. DeepChrome instead (Singh et al., 2016) is an example of an Amazon Web Services cloud or similar services. We also provide a integrative deep learning method that takes in histone modification Docker image to make our tool more user-friendly and reproducible signal as input and gene expression as output to be predicted. (see Supplementary Material Section 5). However, DNA sequence is not incorporated, and the histone signal is constrained to a small window around the transcription start site. 5 Usage (ii) Haystack takes as input epigenomic data for a single epigenetic mark across multiple cell types and generates cell-type specific hot- The entire pipeline can be executed simply by running a single com- spot annotation tracks. In contrast, chromatin state annotation mand. By default, the users need only to create a single description methods such as ChromHMM (Ernst and Kellis, 2012), Segway file that contains information about the data file paths (e.g. sam- (Hoffman et al., 2012), diHMM (Marco et al., 2017) and Spectacle ples_names.txt) and the reference genome used in the analysis (e.g. (Song and Chen, 2015) take as input epigenomic data for a single hg19): cell type across multiple epigenetic marks and annotate genomic re- haystack_pipeline samples_names.txt hg19 gions into discrete chromatin states (e.g. enhancers, promoters) based on the patterns of marks in a single cell type. These generated If the specified genome information and annotations are not available annotated regions are not necessarily variable across cell types. locally, they will be automatically downloaded from the internet. The (iii) By computing cell-type specific enriched motifs using a central haystack_pipeline command is equivalent to running haystack_hot- enrichment filter and incorporating gene expression data, Haystack spots followed by haystack_motifs and haystack_tf_activity_plane. generates a list of cell-type specific TFs. In contrast, Homer (Heinz A detailed description of the settings is provided in Supplementary et al., 2010) can find enriched or de novo motifs from a set of se- Material Section 9. To illustrate the Haystack workflow, we also pro- quences but cannot perform central enrichment filtering and vide a walk-through example (see Supplementary Material Section 3) DREME (Bailey, 2011) can be used only for de novo motif discovery that reproduces the results described in Section 4. but cannot calculate enrichment of known motifs. Neither method incorporates gene expression data. A detailed comparison of related methods is presented in Supplementary Table S1. Acknowledgements We would like to thank Dr Stuart Orkin, Dr Jian Xu, Dr Nadin Rohland, Dr Kimberly Glass, Dr Eugenio Marco Rubio, Dr Jialiang Huang, Dr Assieh 4 Results Saadatpour, Dr Jennifer Wu and Dr Kendell Clement for their helpful discus- sions and/or for testing the software. 4.1 Analysis of H3K27ac data To demonstrate Haystack’s utility, we analyzed 6 ChIP-seq datasets from the ENCODE project (Dunham et al., 2012)for the histone Funding modification H3K27ac (Fig. 1B). H3K27ac often marks active enhan- This work was supported by National Institutes of Health award cers that promote the expression of nearby genes. We also integrated [R00HG008399 to L.P. and R01HG009663 to G.-C.Y.]. six RNA-seq assays, to quantify gene expression for the same cell types. Figure 1 shows the output of the pipeline: Haystack not only re- Conﬂict of Interest: none declared. covers regions that are highly dynamic (variability and hotspots tracks in Fig. 1), but also regions that are specifically active in each cell type. References Additionally, Haystack detects several TFs that are likely to play an important regulatory role in those regions (Supplementary Fig. S3). Bailey,T.L. (2011) DREME: motif discovery in transcription factor ChIP-seq For example, for regions that are specifically active in the embryonic data. Bioinformatics, 27, 1653–1659. Haystack 1933 Bernstein,B.E. et al. (2010) The NIH roadmap epigenomics mapping consor- Jenuwein,T. and Allis,C.D. (2001) Translating the histone code. Science, 293, tium. Nat. Biotechnol., 28, 1045–1048. 1074–1080. Dunham,I. et al. (2012) An integrated encyclopedia of DNA elements in the Mathelier,A. et al. (2016) JASPAR 2016: a major expansion and update of the human genome. Nature, 489, 57–74. open-access database of transcription factor binding proﬁles. Nucleic Acids Ernst,J. and Kellis,M. (2012) ChromHMM: automating chromatin-state dis- Res., 44, D110–D115. covery and characterization. Nat. Methods, 9, 215–216. Marco,E. et al. (2017) Multi-scale chromatin state annotation using a hier- Grant,C.E. et al. (2011) FIMO: scanning for occurrences of a given motif. archical hidden Markov model. Nat. Commun., 8, 15011. Bioinformatics, 27, 1017–1018. Pinello,L. et al. (2014) Analysis of chromatin-state plasticity identiﬁes Heinz,S. et al. (2010) Simple combinations of lineage-determining transcrip- cell-type-speciﬁc regulators of H3K27me3 patterns. Proc. Natl. Acad. Sci. tion factors prime cis-regulatory elements required for macrophage and B USA, 111, E344–E353. cell identities. Mol. Cell, 38, 576–589. Singh,R. et al. (2016) DeepChrome: deep-learning for predicting gene expres- Hoffman,M.M. et al. (2012) Unsupervised pattern discovery in human sion from histone modiﬁcations. Bioinformatics, 32, i639–i648. chromatin structure through genomic segmentation. Nat. Methods, 9, Song,J. and Chen,K.C. (2015) Spectacle: fast chromatin state annotation using 473–476. spectral learning. Genome Biol., 16, 33. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/haystack-systematic-analysis-of-the-variation-of-epigenetic-states-and-Z4QtPVxJ0D

Loading next page...

References (17)

Ritambhara Singh, Jack Lanchantin, G. Robins, Yanjun Qi
Advance Access Publication Date: Day Month Year Manuscript Category Deepchrome: Deep-learning for Predicting Gene Expression from Histone Modifications
T. Jenuwein, C. Allis (2001)
Translating the Histone Code
Science, 293
Luca Pinello, Jian Xu, S. Orkin, Guocheng Yuan (2014)
Analysis of chromatin-state plasticity identifies cell-type–specific regulators of H3K27me3 patterns
Proceedings of the National Academy of Sciences, 111
ENCODEConsortium, Martin Min (2012)
An Integrated Encyclopedia of DNA Elements in the Human Genome
Nature, 489
B. Bernstein, J. Stamatoyannopoulos, J. Costello, B. Ren, A. Milosavljevic, A. Meissner, Manolis Kellis, M. Marra, A. Beaudet, J. Ecker, P. Farnham, M. Hirst, E. Lander, T. Mikkelsen, J. Thomson (2010)
The NIH Roadmap Epigenomics Mapping Consortium
Nature Biotechnology, 28
Bailey (2011)
1653
Bioinformatics, 27
T.L. Bailey (2011)
DREME: motif discovery in transcription factor ChIP-seq data
Bioinformatics, 27
R. Singh (2016)
DeepChrome: deep-learning for predicting gene expression from histone modifications
Bioinformatics, 32
M. Hoffman, Orion Buske, Jie Wang, Z. Weng, J. Bilmes, William Noble (2012)
Unsupervised pattern discovery in human chromatin structure through genomic segmentation
Nature Methods, 9
S. Heinz, C. Benner, N. Spann, E. Bertolino, Yin Lin, P. Laslo, Jason Cheng, C. Murre, Harinder Singh, C. Glass (2010)
Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.
Molecular cell, 38 4
Charles Grant, Timothy Bailey, William Noble (2011)
FIMO: scanning for occurrences of a given motif
Bioinformatics, 27
(2011)
Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data
Jimin Song, Kevin Chen (2015)
Spectacle: fast chromatin state annotation using spectral learning
Genome Biology, 16
J. Ernst, Manolis Kellis (2012)
ChromHMM: automating chromatin-state discovery and characterization
Nature Methods, 9
Eugenio Marco, W. Meuleman, Jialiang Huang, K. Glass, Luca Pinello, Jianrong Wang, Manolis Kellis, Guocheng Yuan (2017)
Multi-scale chromatin state annotation using a hierarchical hidden Markov model
Nature Communications, 8
Singh (2016)
i639
Bioinformatics, 32
Anthony Mathelier, O. Fornes, David Arenillas, Chih-Yu Chen, Grégoire Denay, Jessica Lee, Wenqiang Shi, C. Shyr, Ge Tan, R. Worsley-Hunt, Allen Zhang, F. Parcy, B. Lenhard, A. Sandelin, W. Wasserman (2015)
JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles
Nucleic Acids Research, 44

Publisher: Oxford University Press
Copyright: © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/bty031
Publisher site: See Article on Publisher Site

Abstract

Motivation: With the increasing amount of genomic and epigenomic data in the public domain, a pressing challenge is to integrate these data to investigate the role of epigenetic mechanisms in regu- lating gene expression and maintenance of cell-identity. To this end, we have implemented a computa- tional pipeline to systematically study epigenetic variability and uncover regulatory DNA sequences. Results: Haystack is a bioinformatics pipeline to identify hotspots of epigenetic variability across dif- ferent cell-types, cell-type speciﬁc cis-regulatory elements, and associated transcription factors. Haystack is generally applicable to any epigenetic mark and provides an important tool to investigate the mechanisms underlying epigenetic switches during development. This software is accompanied by a set of precomputed tracks, which may be used as a valuable resource for functional annotation of the human genome. Availability and implementation: The Haystack pipeline is implemented as an open-source, multiplat- form, Python package called haystack_bio freely available at https://github.com/pinellolab/haystack_bio. Contact: lpinello@mgh.harvard.edu or gcyuan@jimmy.harvard.edu Supplementary information: Supplementary data are available at Bioinformatics online. types in an easy-to-use command line software. Our goal is to facili- 1 Introduction tate biologists’ efforts at analyzing epigenetic data without the bur- Epigenetic patterns are highly cell-type specific, and influence gene den of coding, and to enable researchers to integrate their own expression programs (Jenuwein and Allis, 2001). Recently, a large sequencing data with information from the public domain. amount of epigenomic data across many cell types has been gener- ated and deposited in the public domain, in part thanks to large con- sortia such as Roadmap Epigenomics Project (Bernstein et al., 2 Description 2010), and ENCODE (Dunham et al., 2012). These data sources offer unprecedented opportunities for systematic integration and Haystack takes as input the genome-wide distributions of an epigen- comparison. In an earlier work (Pinello et al., 2014), we developed etic mark across multiple cell types or subjects—measured by and validated a computational strategy to systematically evaluate ChIP-seq, DNase-seq, ATAC-seq or similar assays—as well as gene cross-cell-type epigenetic variability and to identify the underlying expression profiles quantified by microarray or RNA-seq. Users can regulatory factors of such variability. Here we provide an implemen- start with publicly available preprocessed data or integrate their tation of this strategy that automatically integrates multiple data own data in the pipeline by providing BAM or bigWig files that can V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1930 Haystack 1931 be generated by existing tools such as the ENCODE Uniform input. Alternatively, the input may be a generic set of genomics re- Processing Pipelines. Haystack’s entire computational pipeline gions; e.g. promoters for a set of genes of interest or cell-type specific can be executed with a single command (i.e. haystack_pipeline). enhancers. A motif database can also be specified (JASPAR The pipeline is composed of three modules: haystack_hotspots, [Mathelier et al., 2016] by default) to look for motif enrichment (the haystack_motifs, and haystack_tf_activity_plane. Each module is basic counting of each motif is based on the FIMO software; Grant designed to carry out a distinct but related task (Fig. 1A), as et al., 2011), with use of random or C þ G content matched gen- described below. omic sequences as background. We find that the latter option is more appropriate for histone modifications. The output of this mod- ule consists of an HTML page (Supplementary Fig. S2) that reports 2.1 Module 1. Discovery of hotspots and cell-type each enriched motif, a series of informative parameters including the specific regions target/background ratio, the P-value (calculated with the Fisher’s haystack_hotspots identifies the hotspots of epigenetic variability, exact test) and q-value, the motif logo, the central enrichment score, i.e. those regions that are highly variable for a given epigenetic mark the average profile in the target regions containing the motif, and among different cell types. The algorithm for identifying the hot- the closest genes for each region (Fig. 1C, Supplementary Fig. S2). spots was described previously in Pinello et al. (2014). Briefly, the input for the pipeline is a set of genome-aligned sequencing tracks for a given epigenetic mark in different cell types, in BAM or bigWig 2.3 Module 3. Integration of gene expression data format. The haystack_hotspots module first quantifies the sequence Because different TFs may share similar sequence binding patterns, reads to non-overlapping bins of predetermined size (500 bp by de- the exact regulator cannot be determined by motif enrichment ana- fault), and normalizes data using a variance stabilization method lysis alone. Spurious association may also occur due to, e.g. over- followed by quantile normalization. It then quantifies the variability abundance of motif sequences. haystack_tf_activity_plane provides of the processed data signal in each bin using the variance-to-mean an additional filter to select for the most relevant TFs by further ratio. The most variable regions, accordingly to this measure, are se- integrating gene expression data; it is based on the assumption lected as hotspots (originally termed as Highly Plastic Regions in that the expression level of a functional TF is correlated with the ex- Pinello et al. [2014]). The subsets of hotspot regions that have spe- pression level of the target genes of hotspot regions. Such a relation- cific activity in a particular cell type are next identified, based on a ship is visualized with the use of an activity plane representation z-score metric. Finally, an IGV (http://www.broadinstitute.org/igv/) (Fig. 1D). A detailed description of the tf activity plane plot and XML session file is created to enable easy visualization of the results how it is generated is provided in Supplementary Material Section 3. (Fig. 1B, Supplementary Fig. S1). Briefly, for each cell type, an activity plane plot (Supplementary Fig. S3) is generated for each enriched motif identified in that cell type 2.2 Module 2. Analysis of transcription factor motif by the haystack_motifs. In this representation, the cell-type of inter- haystack_motifs identifies transcription factors (TFs) whose binding est is marked with a red star. The furthest is the star from the origin sequence motifs are enriched in a cell-type specific subset of hot- the more cell-type specific is either the expression of the TF (x-axis) spots. This module takes the output of haystack_hotspots as its or its effect on nearby genes (y-axis). This allows us to capture how AC BD Fig. 1. (A) Haystack overview: modules and corresponding functions. (B) Hotspot analysis on H3k27ac: signal tracks, variability track and hotspots of variability are computed from the ChIP-seq aligned data; the regions speciﬁc for a given cell type are also extracted. (C) Motif analysis on the regions speciﬁc for the H1hesc cell line: Pou5f1:: Sox2 is signiﬁcant; P- and q-value, motif logo and average proﬁle are calculated. (D) TF activity for Sox2 in H1esc (star) compared to the other cell types (circles), x-axis speciﬁcity of Sox2 expression (z-score), y-axis effect (z-score) on the gene nearby the regions containing the Sox2 motif 1932 L.Pinello et al. informative (as measured by the gene expression level of the TF) a stem cell line (H1hesc), we found that the Pou5f1:: Sox2 composed particular TF is for a given cell type compared with other cell types. motif was highly enriched, and the expression of Sox2—a fundamen- However, not all possible plots are generated by default; Only those tal TF for embryonic stem cell identity—was highly specific and posi- passing the following filters are reported: (i) the activity of the TF re- tively correlated with activity of the target genes. capitulates changes in gene expression such that the value of the cor- relation of the TF with nearby genes exceeds a given threshold 4.2 Analysis of roadmap epigenomics project (default rho ¼ 0.3) and (ii) the average gene expression is greater in We applied the Haystack pipeline to data from the Roadmap the considered cell type such that the standardized gene expression Epigenomics Project using the maximal number of non-redundant values are positive (i.e. default z-score > 0). Earlier we showed that cell-types for which gene expression and epigenetic data was avail- these filters are important for identifying factors that truly play a able (Supplementary Material Section 4). We provide precomputed key role in mediating poised enhancer activities (Pinello et al., analysis for H3k27ac (41 cell types), H3K27me3 (41 cell types), 2014). H3K4me3 (41 cell types) and DNase I hypersensitivity (25 cell types). These precomputed tracks provide a valuable resource for re- searchers interested in identifying functional elements in the human 3 Related methods genome, exploring how epigenetic variability is controlled in differ- ent cell types, and uncovering regulatory sequences. Several epigenomics software packages already exist that share Haystack’s goals of identifying functional regulatory sequences or regulators involved in gene regulation. The main contribution of 4.3 Reproducible results through Cloud and Haystack can be summarized by the following three general aspects Docker support of the pipeline: (i) Haystack takes as input not just epigenomic data, To facilitate the use of Haystack without the need to access an inten- but also genomic and transcriptomics data. The majority of avail- sive computational facility, we provide detailed instructions in the able epigenomic tools are designed to work with one or two types of Supplementary Material on how to deploy and test Haystack on the data. DeepChrome instead (Singh et al., 2016) is an example of an Amazon Web Services cloud or similar services. We also provide a integrative deep learning method that takes in histone modification Docker image to make our tool more user-friendly and reproducible signal as input and gene expression as output to be predicted. (see Supplementary Material Section 5). However, DNA sequence is not incorporated, and the histone signal is constrained to a small window around the transcription start site. 5 Usage (ii) Haystack takes as input epigenomic data for a single epigenetic mark across multiple cell types and generates cell-type specific hot- The entire pipeline can be executed simply by running a single com- spot annotation tracks. In contrast, chromatin state annotation mand. By default, the users need only to create a single description methods such as ChromHMM (Ernst and Kellis, 2012), Segway file that contains information about the data file paths (e.g. sam- (Hoffman et al., 2012), diHMM (Marco et al., 2017) and Spectacle ples_names.txt) and the reference genome used in the analysis (e.g. (Song and Chen, 2015) take as input epigenomic data for a single hg19): cell type across multiple epigenetic marks and annotate genomic re- haystack_pipeline samples_names.txt hg19 gions into discrete chromatin states (e.g. enhancers, promoters) based on the patterns of marks in a single cell type. These generated If the specified genome information and annotations are not available annotated regions are not necessarily variable across cell types. locally, they will be automatically downloaded from the internet. The (iii) By computing cell-type specific enriched motifs using a central haystack_pipeline command is equivalent to running haystack_hot- enrichment filter and incorporating gene expression data, Haystack spots followed by haystack_motifs and haystack_tf_activity_plane. generates a list of cell-type specific TFs. In contrast, Homer (Heinz A detailed description of the settings is provided in Supplementary et al., 2010) can find enriched or de novo motifs from a set of se- Material Section 9. To illustrate the Haystack workflow, we also pro- quences but cannot perform central enrichment filtering and vide a walk-through example (see Supplementary Material Section 3) DREME (Bailey, 2011) can be used only for de novo motif discovery that reproduces the results described in Section 4. but cannot calculate enrichment of known motifs. Neither method incorporates gene expression data. A detailed comparison of related methods is presented in Supplementary Table S1. Acknowledgements We would like to thank Dr Stuart Orkin, Dr Jian Xu, Dr Nadin Rohland, Dr Kimberly Glass, Dr Eugenio Marco Rubio, Dr Jialiang Huang, Dr Assieh 4 Results Saadatpour, Dr Jennifer Wu and Dr Kendell Clement for their helpful discus- sions and/or for testing the software. 4.1 Analysis of H3K27ac data To demonstrate Haystack’s utility, we analyzed 6 ChIP-seq datasets from the ENCODE project (Dunham et al., 2012)for the histone Funding modification H3K27ac (Fig. 1B). H3K27ac often marks active enhan- This work was supported by National Institutes of Health award cers that promote the expression of nearby genes. We also integrated [R00HG008399 to L.P. and R01HG009663 to G.-C.Y.]. six RNA-seq assays, to quantify gene expression for the same cell types. Figure 1 shows the output of the pipeline: Haystack not only re- Conﬂict of Interest: none declared. covers regions that are highly dynamic (variability and hotspots tracks in Fig. 1), but also regions that are specifically active in each cell type. References Additionally, Haystack detects several TFs that are likely to play an important regulatory role in those regions (Supplementary Fig. S3). Bailey,T.L. (2011) DREME: motif discovery in transcription factor ChIP-seq For example, for regions that are specifically active in the embryonic data. Bioinformatics, 27, 1653–1659. Haystack 1933 Bernstein,B.E. et al. (2010) The NIH roadmap epigenomics mapping consor- Jenuwein,T. and Allis,C.D. (2001) Translating the histone code. Science, 293, tium. Nat. Biotechnol., 28, 1045–1048. 1074–1080. Dunham,I. et al. (2012) An integrated encyclopedia of DNA elements in the Mathelier,A. et al. (2016) JASPAR 2016: a major expansion and update of the human genome. Nature, 489, 57–74. open-access database of transcription factor binding proﬁles. Nucleic Acids Ernst,J. and Kellis,M. (2012) ChromHMM: automating chromatin-state dis- Res., 44, D110–D115. covery and characterization. Nat. Methods, 9, 215–216. Marco,E. et al. (2017) Multi-scale chromatin state annotation using a hier- Grant,C.E. et al. (2011) FIMO: scanning for occurrences of a given motif. archical hidden Markov model. Nat. Commun., 8, 15011. Bioinformatics, 27, 1017–1018. Pinello,L. et al. (2014) Analysis of chromatin-state plasticity identiﬁes Heinz,S. et al. (2010) Simple combinations of lineage-determining transcrip- cell-type-speciﬁc regulators of H3K27me3 patterns. Proc. Natl. Acad. Sci. tion factors prime cis-regulatory elements required for macrophage and B USA, 111, E344–E353. cell identities. Mol. Cell, 38, 576–589. Singh,R. et al. (2016) DeepChrome: deep-learning for predicting gene expres- Hoffman,M.M. et al. (2012) Unsupervised pattern discovery in human sion from histone modiﬁcations. Bioinformatics, 32, i639–i648. chromatin structure through genomic segmentation. Nat. Methods, 9, Song,J. and Chen,K.C. (2015) Spectacle: fast chromatin state annotation using 473–476. spectral learning. Genome Biol., 16, 33.

Journal

Bioinformatics – Oxford University Press

Published: Jan 17, 2018

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Haystack: systematic analysis of the variation of epigenetic states and cell-type specific regulatory elements

Haystack: systematic analysis of the variation of epigenetic states and cell-type specific regulatory elements

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Haystack: systematic analysis of the variation of epigenetic states and cell-type specific regulatory elements

Haystack: systematic analysis of the variation of epigenetic states and cell-type specific regulatory elements

References (17)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies