Genome-wide prediction of cis-regulatory regions using supervised deep learning methods

Genome-wide prediction of cis-regulatory regions using supervised deep learning methods Background: In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Results: Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). Conclusion: The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations. Keywords: cis-regulatory region, Enhancer, Promoter, Deep learning Background transcription factors (TFs), help specify the formation of In this article, we apply deep supervised analysis methods diverse cell types and respond to changing physiological to identify the positions of active cis-regulatory regions conditions. While gene expression is ultimately a reflec- (CRRs), including both enhancers and promoters, across tion of regulation across multiple processes, the key role the human genome. CRRs play a crucial role in precise of promoters and enhancers has been a central focus of control of gene expression. Promoters and enhancers act genome annotation for the past decade. The investment via complex interactions across time and space in the in generating informative data for the detection of these nucleus to control when, where and at what magnitude regions has been immense, in part motivated by the antici- genes are active. CRRs, through interactions with proteins pation that advanced computational approaches would be such as histones and sequence-specific DNA-binding able to transform the data into a reliable annotation of the genome. Promoters and enhancers were early discoveries during *Correspondence: wyeth@cmmt.ubc.ca Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital the molecular characterization of genes. While promot- Research Institute, Department of Medical Genetics, University of British ers specify and enable the positioning of RNA polymerase Columbia, Rm 3109, 950 West 28th Avenue, V5Z 4H4 Vancouver, Canada machinery at transcription initiation sites, enhancers Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Li et al. BMC Bioinformatics (2018) 19:202 Page 2 of 14 modulate the activity of promoters from linearly distal prediction of enhancers that are defined by p300 bind- locations away from transcript initiation sites [1, 2]. The ing sites overlapping with DNase-I hypersensitive sites delineation between the classes has become increasingly and distal to annotated TSS. Chen et al. applied multi- challenging, with some literature suggesting the two cate- nomial logistic regression with LASSO regularization gories are the edges of a continuous spectrum of CRRs [3]. to find key features for the classification of stem cell- specific functional enhancer regions [18]. Using STARR- Indeed, it has long been observed that sequences flanking transcription initiation regions can function as enhancers seq data, a new experimental approach for screening (promoter-proximal regions), and in recent years, it has candidate enhancer sequences [19], dinucleotide repeat been observed that there are transcripts initiated at the motifs (DRMs) were found to be enriched in broadly edgesofactiveenhancers [4, 5]. For the purpose of this active enhancers, leading to a proposition that a small set report, we address the two as distinct classes, but discuss of TF binding site motifs and DRMs might be sufficient the relationship between our findings and the continuous for enhancer prediction [20]. class model. New laboratory methods are emerging, providing a The use of computational methods to detect the loca- refined resolution of CRR locations. The majority of tions of promoters and enhancers has been a key focus of human DNA is transcribed, producing diverse types of bioinformatics for twenty years (see reviews [6, 7]). With RNA. In particular, transcripts generated at the edges the advances of experimental procedures for profiling the of enhancers, enhancer RNAs (eRNAs), allow for the properties of chromatin and RNA transcripts, a new wave experimental readout of active regulatory regions. Global of methods has arrived. Given the small set of reliable run-on and sequencing (GRO-seq) protocols [21]mea- enhancer annotations, it was appropriate that the first sure the 5’-end of nascent RNAs revealing the divergent among these methods used unsupervised learning. For transcriptional signature of both transcriptionally active instance, both ChromHMM [8]andSegway [9]segment promoters and enhancers [5]. Using GRO-seq signals, a the genome into sequence classes based on ENCODE support vector regression model (dReg) was developed project data [10], such as histone modification ChIP-seq to predict active transcriptional regulatory elements [22]. (chromatin immunoprecipitation followed by sequencing The cap analysis of gene expression (CAGE) technique [11]) signals. Such unsupervised methods infer hidden [23] captures the 5’-end of RNA transcripts, enabling a states based on observed signals, and then associate an precise determination of transcript initiation sites. Using element to each hidden state. The states are subsequently CAGE, the FANTOM5 Consortium has identified an atlas labelled with biological functions based on enrichment of transcriptionally active promoters [24] and a permissive for known examples. A test of predicted Enhancers for set of 43,011 transcriptionally active enhancers character- the K562 leukemia cell line by the Combined method ized by bidirectional eRNAs [4] across hundreds of human (unifying ChromHMM and Segway annotations) [12] cell types and tissues. These enhancers were validated using a high-throughput reporter gene assay [13]revealed with high success rates ranging from 67.4 to 73.9% [4]. that only 26% of predicted enhancers have regulatory Compared to protein-coding RNAs, eRNAs are believed activity [14]. The assessment showed that the predicted to degenerate quickly, and only a small number of tis- Weak Enhancers, a class associated with lower H3K27ac sues have been explored with sufficient depth to reveal and H3K36me3 signals, unexpectedly drove higher gene eRNAs. While the FANTOM enhancer set is therefore expression than the predicted Enhancers. It is evident that incomplete, it provides a uniquely large inventory of high- improvements are needed, potentially involving the use of quality enhancers to use for the training of machine learn- additional experimental features and alternative machine ing approaches. An ensemble support vector machine learning approaches. method suggested the potential to distinguish enhancers Despite the limited set of precisely annotated active based on such data [25]. enhancers, supervised machine learning models have We have previously proposed and herein present the been attempted to predict enhancer regions. In each case, use of a deep feature selection (DFS) model for the a distinct definition of a suitable positive training set of supervised prediction of CRRs [26]. Deep learning is a enhancers was taken. A random-forest method was used dramatic advance in the frontier of artificial intelligence in [15] to classify TF bound regions with a focus on [27–29]. Unlike widely used linear models, deep learning observed binding patterns, generating sets of two-class approaches model complex systems and capture high- classifiers to distinguish regions based on binding activity level knowledge from data. Driven by big and rich data, and position relative to promoter regions. A random- deep learning has been successfully applied in various forest based enhancer classification method was devised areas such as automatic image annotation and speech in [16] with histone modification ChIP-seq data as fea- language processing [30]. Bioinformaticians have started tures, using p300 bound regions as the basis for training. using this powerful tool for next-generation sequencing An AdaBoost-based model was proposed in [17]for the data mining, such as predicting the impact of variations Li et al. BMC Bioinformatics (2018) 19:202 Page 3 of 14 on exon splicing [31] and the effects of noncoding variants transcript initiation events are observed in the tissue of on chromatin [32], detecting TF binding patterns [33], and focus, while inactive refers to regions detected in other tis- predicting protein secondary structures [34]. sues, but not in the focus tissue. We recorded the mean Our study stands on three important legs. First, the class-wise rate (i.e. averaged sensitivities of all classes), precisely annotated FANTOM promoters and enhancers, area under the receiver operating characteristic curve which provide the largest experimentally defined collec- (auROC), and the area under the precision-recall curve tion of CRRs. Second, the ENCODE project genome-wide (auPRC) in Fig. 1 and Additional file 1:FigureS1. feature data, such as histone modifications, TF binding, There are four aspects of the results that we highlight, RNA transcripts, chromatin accessibility, and chromatin which affirm the capacity of our supervised deep learn- interactions. Third, deep learning methods to distinguish ing approach to distinguish between classes of CRRs and CRRs based on the available data. We unite the three background. First, we are able to distinguish between components to create the DECRES (DEep learning for active enhancers and promoters (A-E versus A-P) (Fig. 1a). identifying Cis-Regulatory ElementS and other applica- We used A-E and A-P as positive and negative train- tions) model, with which we identify the most compre- ing classes, respectively. Overall, we found that A-E and hensive collection of CRRs across the human genome yet A-P are highly separable. Second, we can distinguish compiled. active and inactive CRRs (either enhancers or promot- ers). From Fig. 1b and Additional file 1:FigureS1A,it Results can be observed that mean auPRCs on GM12878, HelaS3, Deep learning accurately distinguishes active enhancers HepG2, and K562, which have the largest training sets, are and promoters from background above 0.95 with small variances for both enhancers and We investigated the capacity of deep learning models promoters. In the rest of this paper, we exclude A549 and to separate enhancers and promoters, and to distinguish MCF7 cell lines in most analyses due to limited data avail- them from other regions and between activity states. We ability. Third, not unexpectedly, it is difficult to distinguish trained a deep feedforward neural network over our bal- between inactive enhancers and promoters (Additional anced labelled training sets to predict our (unbalanced) file 1: Figure S1B). Seven of the mean class-wise rates for test sets from each well-characterized cell type, repeating the eight cell types were lower than 0.80. While there are the procedure 100 times. The deep model takes experi- some indications that a portion of inactive promoters have some machinery present, it was our expectation that such mentally derived features over genomic regions as inputs regions will largely not exhibit strong transcription fac- and outputs class labels of these regions with probabili- tor binding or appropriate epigenetic signatures to inform ties (see Additional file 1: Table S1 for the total number of a model. Fourth, we tested the applicability of predicting samples of each class and Additional file 1:Table S2 forthe number of available features; see Methods). For narrative A-E and A-P from the super background (BG) class merg- convenience, hereafter we refer to active enhancer, active ingI-E,I-P,A-X,I-X,and UK(Fig. 1c). The results on promoter, active exon, inactive enhancer, inactive pro- six cell types were promising, all exceeded 0.80 auPRC. moter, inactive exon, and unknown (or uncharacterized) If A-E and A-P are merged further to form a super class region as A-E, A-P, A-X, I-E, I-P, I-X, and UK, respectively. (A-E+A-P), higher performance is achieved (Additional Under the assumption that active CRRs are undergoing file 1: Figure S1C). All auPRCs on these six cell types transcription, active applies to regions in which CAGE went beyond 0.89 auPRC. Furthermore, we also tested a ab c Fig. 1 Mean performance and standard deviation of 100 runs using the MLP model on our respectively sampled train-test partitions of eight cell types. a Classification performances of A-E versus A-P. b Classification performances of A-E versus I-E. c Classification performances of A-E versus A-P versus BG. MLP: Multilayer Perception, RF: Random Forest, A-E: Active Enhancer, A-P: Active Promoter, A-X: Active Exon, I-E: Inactive Enhancer, I-P: Inactive Promoter, I-X: Inactive Exon, UK: Unknown or Uncharacterized, BG: I-E+I-P+A-X+I-X+UK Li et al. BMC Bioinformatics (2018) 19:202 Page 4 of 14 random forest method, another state-of-the-art classifier, unsupervised ChromHMM and ChromHMM-Segway on our labelled data. Similar performance was obtained on Combined methods [8, 12] using FANTOM annotations all six experimental settings. The random forest method on five available cell types as reference. They were com- exhibited slightly better performance for A549 and MCF7 pared on unbalanced sets reflecting the true genomic datasets, which both have low numbers of enhancers. In background. The results are compared in Fig. 2a which expectation that more annotated enhancers are becoming displays radar charts where the larger and more con- available, we will continue using MLP and exploring other vex the area is, the better the performance. It is intuitive deep learning approaches such as convolutional neural that supervised approaches are preferred when labelled networks and recurrent neural networks. training data is sufficient. Furthermore, both unsuper- vised methods were developed prior to public release DECRES gives higher sensitivity and precision on FANTOM of the FANTOM5 data and are therefore at a disad- annotated regions vantage. However, these annotations are widely used by To assess the relative utility of our supervised deep the community and hence the relative performance of method for CRR prediction, we compared it with the DECRES to the standard is of interest. Overall, we observe Fig. 2 Comparison of the supervised method (DECRES) and unsupervised methods (ChromHMM and Combined) on five FANTOM annotated test sets in radar charts (a) and significance tests (b). The ENCODE segmentations were downloaded from [66]. We relabelled the annotations of ChromHMM and Combined. For ChromHMM segmentations, the Tss, TssF, and PromF classes were merged to A-P; the Enh, EnhF, EnhW, EnhWF classes were merged to A-E; and the rest were denoted by BG. When processing the Combined annotations, TSS and PF were relabelled to A-P; E and WE were relabelled to A-E; and the rest to BG. The p-values in (b) were obtained from two-tailed Student’s t-test on all cell types. The signs of statistic values are indicated in brackets Li et al. BMC Bioinformatics (2018) 19:202 Page 5 of 14 that DECRES outperforms ChromHMM and Combined DECRES predicted all to be negative (including the 16 methods which in turn deliver similar performance. that were active in the CRE-seq experiment). Importantly, These unsupervised methods consistently have lower as the DECRES scores rise, the quality of the predictions sensitivities for active enhancer detection (p = 5.57E-5 increase. We drew the histogram of DECRES membership and 9.90E-5 for DECRES versus ChromHMM and Com- scores of 254 and 433 experimentally positive and neg- ative Combined enhancers that were predicted as A-Es bined respectively, two tailed Student’s t-test; see Fig. 2b) and lower precision for active promoter detection (p = by DECRES (Additional file 1: Figure S2). The distri- 7.36E-5 and 2.33E-4 for DECRES versus ChromHMM butions are significantly different (p = 0.014, two-sided and Combined respectively, two tailed Student’s t-test; Mann-Whitney rank test). see Fig. 2b). Using ChromHMM, the active enhancer The second independent collection, in which K562 sensitivity ranges from 16.5% to 48.4% (numbers are con- and HepG2-specific “strong enhancer” (as predicted by sistent with the test on ENCODE predicted enhancers ChromHMM) containing predicted TF binding sites for reported in [14]), while our deep model ranges from cell-selective TFs were tested using a massively parallel 69% (K562) to 88.8% (GM12878). Moreover, ChromHMM reporter assay (MPRA) [35]. Only 41% of the enhancers achieves a maximum precision of 49.8% for active pro- were detected to be significantly expressed (p = 0.05, moter prediction, while the maximum for DECRES two-sided Mann-Whitney rank test). We used DECRES to is of 84.3%. predict the classes of the MPRA positive and MPRA neg- ative enhancers. Our result in Additional file 1:Table S3 Evaluation of DECRES performance with independent shows that 98.4% (120/122) and 97.8% (182/186) of the experimental data MPRA positive enhancers were respectively predicted to As the initial evaluation focused on FANTOM eRNA- be A-Es by DECRES for K562 and HepG2 cells, while based annotation of CRRs, the type of data used to train 92.3% (179/194) and 81.3% (217/267) of the MPRA nega- our supervised model, we sought to assess performance tive enhancers were still predicted as A-Es for K562 and on data generated by alternative methods. We identi- HepG2, respectively, but with different distributions of fied two independent collections of laboratory validated DECRES scores (p = 4.8E-6 and p = 2.3E-6 for K562 and enhancers to further assess the performance of DECRES: HepG2 respectively, two-sided Mann-Whitney rank test) a CRE-seq collection of regions tested in K562 cells [14] (Additional file 1: Figure S2). Consistent with the other independent data, the higher the DECRES scores the more and MPRA (massively parallel reporter assay) collections likelythey aretobepositive. tested in K562 and HepG2 cells [35]. In both instances, the set of regions that fail to direct expression may be falsely predicted by the assessed methods, but may also reflect Assessing the utility of DNA sequence properties on the the facts that the experimental procedures only include performance of DECRES a small segment of regulatory DNA and that plasmid- Recent studies confirmed that DNA sequence proper- based assays do not recapitulate chromatin properties. ties can be useful for the recognition of promoters and Given the nature of the data, we anticipate a portion of the enhancers [3, 5, 25], and the discrimination between experimental negatives to be bona fide regulatory regions. active and inactive regulatory sequences [36, 37]using In the first independent set, subsets of predicted K562 string sequence kernels. This builds on the long- enhancers and negative regions (as predicted by the Com- recognized capacity for the inclusion of CpG islands as bined ChromHMM and Segway method) were assessed features to improve promoter prediction [38]. We sought in the laboratory using CRE-seq [14]. In that study, only to determine if DNA sequence features can be informa- 33% of the “Combined” predicted regulatory regions were tive to distinguish between promoters and enhancers, found to be positive in the experiment, compared to 7% and between active and inactive classes. We trained the for the negative set. Using DECRES trained on all available model with 351 sequence features (originally used in [25]) active regulatory regions of K562 cells, we therefore vali- in multiple scenarios. Results are displayed in Fig. 3 and dated our method on 386 regions showing active enhancer Additional file 1: Figure S3. First, a deep method restricted activity in K562 as validated by CRE-seq compared to the to sequence features for discriminating A-E and A-P 298 control regions (Additional file 1: Table S3). Highly (Fig. 3a) delivered auPRCs from 0.8567 to 0.9370, con- consistent with the results above, a sensitivity of 65.5% firming that sequence attributes are indeed informative. (254/386) for the experimentally validated regions were Second, sequence features have a limited utility for distin- successfully predicted as A-E; the remaining 132 regions guishing between active and inactive states of enhancers were predicted as background (none were classified as and promoters, which is logical; while the experi- promoters). For the 812 tested predictions that were inac- mentally derived features could highly separate them tive in the CRE-seq experiment, DECRES classified 53.3% (p = 1.90E-08 and 5.06E-08 for enhancers and promoters (433/812) as positive. For the 298 negative control regions, respectively, two-tailed Student’s t-test; see Fig. 3b and Li et al. BMC Bioinformatics (2018) 19:202 Page 6 of 14 ab c Fig. 3 Comparing the mean auPRCs over 100 resampling and retraining on our labelled regions using different feature sets. “Experimental” means our experimentally derived next generation sequencing feature set. “Sequence” means the set of 351 sequence properties used in [25]. “Experimental+Sequence” means the combination of these two sets. a. Comparison of the three feature sets in A-E versus A-P. b.Comparisonofthe three feature sets in A-E versus I-E. c. Comparison of the three feature sets in A-E versus A-P versus BG. The p-values in each legend were obtained using two-tailed Student’s t-test to compare “Experimental”-based results with “Experimental+Sequence”-based and “Sequence”-based results, respectively Additional file 1: Figure S3A). Using sequence features and promoters from the background in GM12878 cell in the absence of experimental features has a lower line. In some cases, the different measures complement performance in classifying A-E, A-P and BG across all each other. For instance, H3K4me2 and H4K20me1 are eight cell types (p = 1.86E-09, two-tailed Student’s t-test; marked as key features by the randomized DFS, which see Fig. 3c). Finally, better results were not achieved is convincing as indicated by the box plots in Additional by combining experimental and sequence features file 1: Figure S4B and Figure S6-S13, but are overlooked (p = 2.79E-01, 6.56E-01 and 1.17E-01 in Fig. 3, two-tailed by random forest. Tbp was highlighted by random forest Student’s t-test). in GM12878 and HelaS3 cells, but was not picked up by randomized DFS. Examining the box plots of this feature Key features for DECRES performance in Additional file 1: Figures S6 and S7 reveals that this As experimental data can be time consuming and expen- feature is discriminative to distinguish active enhancers sive to produce, we sought to determine the minimal set and promoters from background, but there is not a dra- of features most informative for CRR prediction from matic difference between active enhancers and promoters. a computational perspective. We used randomized deep Important features incorporated into a random forest model may not be incorporated until a latter stage of the feature selection (randomized DFS or RDFS) and ran- DFS process. For instance, in K562 cell line, C-Myc was dom forest (RF) models (see Methods) for two-class emphasized by random forest, which is indeed reason- [A-E+A-P (or CRR) versus BG] and three-class (A-E ver- sus A-P versus BG) classifications on four cell types able as shown in Additional file 1: Figure S12 and was not (GM12878, HelaS3, HepG2, and K562) which have 72-135 selected as an initial feature in the DFS process. features available. For the development of machine learning methods in Figure 4a and Additional file 1: Figure S4A display the genome annotation, minimizing the number of features feature importance scores discovered by randomized DFS required decreases cost and increases the capacity for and random forest for the three-class classification. The biological interpretation. Figure 4b and Additional file 1: feature importance scores produced by these methods Figure S5B show the changes of test auPRCs as the num- should be interpreted differently. Similar to a forward bers of selected features increase for the three-class and selection, the feature importance scores from randomized two-class classifications, respectively. In both cases, test DFS reflect which features are preferred in the early stage auPRCs increase dramatically for the initial features, then of the sparse model, while the importance score of a fea- performance plateaus. Comparing the randomized DFS ture by random forest indicates the role of this feature in curves with the random forest curves, we can see that the context of its use with all other features. Thus, using thereisnosingleoptimal curve. Afewkeyfeaturesare both methods in this study enables us to gain different sufficient for a good prediction performance. To define insights into the data. In our experiments, both meth- an optimal number of features needed, we fit the curves ods can capture the most important features as indicated in Fig. 4b and Additional file 1: Figure S5B and selected by importance scores across all four cell lines. For exam- the intersection point for a line with slope of 0.5 on ple, both methods agree that Pol2, H3K4me1, Taf1, and the randomized DFS curves (see Methods). Fewer fea- H3K27ac are useful for distinguishing active enhancers tures are needed for two-class CRR prediction (6 features) Li et al. BMC Bioinformatics (2018) 19:202 Page 7 of 14 Fig. 4 Feature importance and classification performance in the 3-class (A-E versus A-P versus BG) scenario. a Feature importance discovered by randomized DFS (RDFS) and random forest (RF) on GM12878. The random forest’s feature importance scores were normalized to [0,1] for better comparison with randomized DFS. b auPRC versus the number of features incorporated into the RDFS and RF. The annotated points indicate where a line with slope 0.5 intersects a fitted curve compared to three-class models intended to distinguish three-class models, in agreement with existing knowledge between A-E, A-P and background (10 features). [2, 3, 39, 40]. Among transcription factors (including co- The distributions of the top ten features for three-class factors), Taf1 and p300, as well as RNA polymerase II predictions (A-E, A-P, and BG) are given in Additional (Pol2), are frequently selected, which is also consistent file 1: Figure S4B. Using the top ten features for each with existing knowledge [41, 42]. cell, auPRCs of 0.9022, 0.9156, 0.8651, and 0.8565 were Additional file 1: Figure S5C shows box plots of the top achieved on GM12878, HelaS3, HepG2, and K562, respec- six selected features by randomized DFS for two-class pre- tively. Half of these top features are histone modifica- dictions. Using these features, auPRCs of 0.9561, 0.9627, tions, of which H3K4me1, H3K4me2, H3K4me3, and 0.926, and 0.9555 were obtained on the four cell types, H3K27me3 were commonly selected features for the respectively. For most features, the ranges of values are Li et al. BMC Bioinformatics (2018) 19:202 Page 8 of 14 elevated in A-E and A-P relative to the background cat- We investigated how many among our genome-wide egories. Half of the selected features are DNase-seq and predictions are supported by the VISTA enhancer set [43]. histone modification ChIP-seq data including H3K4me2, Despite the fact that the majority of the VISTA enhancers H3K27ac, and H3K27me3. The box plots of these fea- are extremely conserved across development, we still find tures indicate that they distinguish A-E and A-P from that 37.1% (850/2,293) of experimentally confirmed and unconfirmed VISTA enhancers overlap with the predicted background [2, 39, 40]. A-Es, while merely 4.8% (110/2,293) of these VISTA The majority of DECRES’s genome-wide predictions are enhancers overlap with the predicted A-Ps. Results for supported by other methods experimentally confirmed VISTA enhancers are similar We trained 2- and 3-class multilayer perceptron (MLP) (482/1,196 = 40.30% and 60/1,196 = 5.02% overlap A-Es models (see Methods) using all reference (labelled) data and A-Ps, respectively), which suggests that our predicted for training, in order to predict CRRs across the entire active enhancers have real enhancer functions. A propor- genome for six cell types (A549 and MCF7 were excluded). tion of the VISTA enhancers not overlapping our predic- The 2-class model identified 227,332 CRRs (adjacent tions could be active specifically during development or in regions were merged), which occupy 4.8% of the genome other cell types than our focus cell lines. (Additional file 1: Table S4). A total of 9153 CRRs were ubiquitously predicted across all six cell types. For the DECRES extends the FANTOM enhancer atlas 3-class prediction, we obtained 301,650 A-E regions (6.8% Due to the limited depth of CAGE signals for eRNAs, a of the genome) and 26,555 A-P regions (0.6% of the portion of active (or transcribed) enhancers will not have genome) together with 11,886 ubiquitous A-Es and 3678 been detected in the original compilation of the enhancer ubiquitous A-Ps. The genome-wide predictions for all six atlas. Hence, we sought to identify additional partially cell types are available in Additional file 2. supported enhancers for which eRNA signals were below Next, we examined the overlap of our predicted CRRs the original atlas threshold settings [4]. In the previous with the Combined [12]anddReg[22] predictions on work, a total of 200,171 bidirectionally transcribed (BDT) GM12878, HelaS3, and K562. The majority of CRRs loci were detected across the human genome, using CAGE predicted by DECRES overlap with the results from tags of 808 cell types and tissues. After excluding BDT loci either Combined or dReg, specifically 86.13%, 76.13%, within exons, a partially supported set of 102,021 BDT and 83.63% for GM12878, HelaS3, and K562, respec- regions remained, of which 43,011 balanced loci (simi- tively (Fig. 5). A subset (13.87% on GM12878, 23.87% lareRNAlevelsonbothsides) constitute theFANTOM enhancer atlas [4]. In order to investigate whether more on HelaS3, and 16.37% on K562) of DECRES predic- tions do not overlap with predictions from the other active enhancer candidates can be detected for each of the two tools. Notably, a large portion of the Combined pre- six cell types, we trained a MLP on its active atlas regions, dictions (56.78% on HelaS3, 55.99% on GM12878, and and predicted classes for all 102,021 BDT sites. Among 36.36% on K562) do not overlap with those from the the 102,021 BDT loci, most were classified as negative supervised methods, which is consistent with its low regions in a given cell (Additional file 1: Table S5), while observed validation rate [14]. Furthermore, DECRES pre- on average 13,316 were predicted as A-Es and only 834 dictions tend to have a finer resolution for both A-P were predicted as A-Ps per cell type. A substantial num- and A-E regions (see Additional file 1:FigureS14 foran ber (6535 on average) of inactive enhancers in the original example). enhancer atlas were predicted as active by our model ab c Fig. 5 Agreements of the DECRES CRRs with the Combined and dReg CRRs on three cell types (a: GM12878, b:HelaS3, c: K562), respectively. The TSS, PF, E, and WE segmentations from Combined were relabelled to CRRs. The active transcriptional regulatory elements (TREs) predicted by dReg were renamed to CRRs Li et al. BMC Bioinformatics (2018) 19:202 Page 9 of 14 (Additional file 1: Table S6), consistent with the assump- file 1: Figures S17 and S18) [50]. HNF1A, HNF1B, FOXA1, tion that BDT data is incomplete for any given sample. On FOXA2, HNF4A, and HNF4G factors regulate liver- average 5514 BDT loci excluded by the original atlas, were specific genes (Additional file 1: Figures S19 and S20) predicted as A-Es per cell type. Over the six analyzed cell [51, 52]. NFY factors cooperate with GATA1 to mediate types, a total of 38,601 BDT loci were predicted as A-Es erythroid-specific transcription in K562 (Additional file 1: Figures S25 and S26) [53]. (Additional file 3), of which 16,988 represent an expan- sion of the original FANTOM enhancer atlas. Note that We performed functional and enrichment analysis on 21,398 out of 43,011 enhancers from the original FAN- the A-E and A-P predictions from the Combined method TOM enhancer atlas are not predicted as active in the six [12], and report the results in Additional file 1:Figures cells analyzed here, but these regions may be active in the S27-S30. Most of the predicted promoters by the Com- other 802 cells for which there are inadequate features to bined method are distal to known gene TSSs, which is analyze. similar to enhancers. For instance on cell line GM12878, only 22% of the Combined promoters are located less than Computational validation of DECRES’s prediction using 5 kbp to the annotated gene TSSs, compared to 47% of the functional and motif enrichment analysis DECRES promoters. Moreover, functional analysis on the We performed functional enrichment analysis on the CRRs predicted by the Combined method returned much genome-wide predicted A-Es and A-Ps using GREAT [44]. less or zero significant terms for GO biological process, For GM12878 cells, 79% of predicted enhancer regions are MSigDB pathway, and disease ontology than the DECRES more than 5 kilobase pairs (kbps) away from gene TSSs predictions. The motif analysis results of both methods (Additional file 1: Figure S15A), while 47% of predicted are consistent. promoters are less than 5 kbps to the annotated gene TSSs (Additional file 1: Figure S15B). Similar statistics Discussion were obtained for the remaining five cell types. Annota- Our study brings together a large collection of high- tion analyses of the GM12878-specific CRRs show that throughput data from global projects to allow for super- proximal genes are associated to: immune response from vised annotation. One key challenge in such analysis is the gene ontology (GO) annotations (Additional file 1:Figure depth of validation. In this report, validation is assessed S15C); B cell signalling pathways from MSigDB Path- using existing collections of reliable enhancers, including way annotations (Additional file 1: Figure S15D); and CAGE [4], and laboratory validated sets from CRE-seq leukemia from disease ontology annotations (Additional [14], and, on a small-scale, transgenic mouse assays [43]), showing that the supervised approach nears 89% sensi- file 1: Figure S15E). Results are consistent with the lym- phoblastoid lineage of the cells. Next, we performed tivity. While we compare to multiple laboratory validated functional enrichment analysis on the BDT-supported sets retrospectively, a prospective assessment would have predicted enhancers not previously reported in the broad value. In light of the recent advances in both big FANTOM enhancer atlas (“not in atlas”). Results are data analysis methods and genome-scale data generation, fully consistent with the above analysis (Additional file 1: we believe it is opportune to launch a global prospective Figure S16). assessment, such as enabled within the DREAM Chal- We further carried out motif enrichment analysis lenges program [54]. Such a test for annotation of cis- on the predicted cell-specific CRRs and not-in-atlas regulatory regions in the human genome would inspire enhancers using HOMER [45]. The predicted regions are the machine learning community to push the perfor- enriched for motifs similar to JASPAR binding profiles mance limit of supervised CRR-prediction methods, and [46] (Additional file 1: Figure S15F and Figures S16-S26) would encourage laboratory biologists to accelerate cell both associated to TFs maintaining general cell processes type-specific data generation. and TFs with selective roles in cell-related functions. For Enhancers and promoters have both common and dis- instance, motifs for Jun-, Fos-, and Ets-related factors were tinct characteristics. In our cross-validations, we show enriched in regions from all six cell types. These TFs that A-E and A-P are highly separable (Fig. 1a), while regulate general cellular progresses such as differentia- better performance can be obtained if A-E and A-P are tion, proliferation, or apoptosis [47, 48]. Cell-appropriate treated as a single class (Additional file 1:FigureS1C). TF enrichments were observed for each cell (summa- Both continuous (merging enhancers and promoters rized in Additional file 1: Table S7). For example, RUNX1 together) and distinct models (treating enhancers and and other Runt-related factors, which play crucial roles promoters separately) have limitations. While a continu- in haematopoiesis, are observed in GM12878 (Additional ous model may overlook functional differences, a distinct model may overemphasize such differences. A potentially file 1: Figure S15F and Figure S16) [49]. C/EBP-related fac- tors that regulate genes involved in immune and inflam- better prediction model might require two hierarchical matory responses are expressed in cervix (Additional steps. It could first distinguish CRRs from the background Li et al. BMC Bioinformatics (2018) 19:202 Page 10 of 14 genome, then assign a continuous score to each candidate for the GTEx project [62]. The current predictions using region indicating the likelihood of being an enhancer. Fur- MLPs and future annotations using CNNs and RNNs can ther clustering and subtyping may be necessary. It is worth integrate sequence variations (captured in alignment of mentioning that the CAGE-defined enhancers used in this short sequence reads of ChIP-seq and other sequencing study may introduce some bias towards capturing a spe- techniques) and RNA-seq gene expression data of a cell type of interest, so that the impact of genetic variations in cific class of enhancers which exhibit reasonably strong and detectable transcription. To further investigate the non-coding regions can be prioritized. characteristics of enhancers and improve genome-wide prediction, enhancers detected by other techniques, such Conclusions as GRO-seq, will need to be considered in the future. Using FANTOM data for training, we show that super- Our predicted CRRs take a substantial but small por- vised deep learning methods are able to accurately pre- tion of the non-coding regions, previously known as “junk dict active enhancers and promoters across the human DNA”. It may be because only six cell types are consid- genome. Models incorporating cell-specific data outper- ered in this work. Nevertheless, we have already seen that form models restricted to universal data (e.g. sequence), the non-coding regions exhibit regulatory functionality. and highlight key experimental features that tend to be It would be interesting in the next phase to collect data incorporated into predictive models when available. We from a large number of cell types and examine the cover- explore the relative performance of 2- and 3-class mod- age, which will unveil whether regulatory regions have an els that either group or separate enhancers and pro- oasis pattern. It may also imply that certain fragments of moters. Finally, we deliver a comprehensive collection of the non-coding regions play other partially known (such annotations, that label 6.8% of the genome as enhancers as suppression, domain boundary, and development) and and 0.6% as promoters in one or more of six well- unknown roles. characterized cells. As already advocated in our review [7], two other deep Accurate annotation of regulatory regions across the learning models might be well suited to improve annota- human genome is essential for genome interpretation. tions of non-coding regions. One method is convolutional With genome sequencing transitioning to a standard clini- neural networks (CNNs), which can take into account cal test, the ability to move beyond the analysis of protein- the topological properties of features. The other is bidi- coding alterations has the potential to expand clinical rectional recurrent neural networks (RNNs), which can diagnostic capacity to explain observed genetic disorders. consider the information from adjacent regions (i.e. the By demonstrating the suitability of supervised deep learn- context). Such an approach can be potentially applied to ing methods to label regulatory regions, we now enter into annotate regulatory domains or complexes where exons, a new stage of genome annotation. In the next few years, introns, promoters, enhancers, silencers, and insulators we anticipate that characterization of regulatory prop- form cohorts for specific functionalities. Bidirectional erties in specific cell populations will accelerate, using RNNs have a smoothing effect, making the predictions both chromatin-based and sequencing-based methods. context-dependent. Development of CNN- and RNN- As demonstrated in this report, deep learning methods based models for prediction of enhancers using sequence are well suited for the challenge of using the expanded data information has just emerged [55]. We foresee more for reliable annotation of the genome. sophisticateddeeplearningmodelsinthe nearfuture We anticipate that the collection of regulatory region for comprehensive genome annotations. To prevent pre- annotations provided in this study will have broad util- dictions from jumping between states, smoothing has ity for genome interpretation, and that the demonstra- been taken into consideration in a deep neural network tion of the sufficiency of training data and the utility of combined with hidden Markov model [56, 57]. Com- deep learning supervised methods for CRR prediction will bined with MLPs, CNNs, or RNNs, other newly published move the discussion to a highly applied period of high- deep feature selection techniques, such as layer-wise rele- quality annotation. Understanding how CRRs interact and vance propagation [58] and class saliency extraction [59], how they link to their target genes is the key to decipher might be useful to identify informative signal peaks for the cis-regulatory mechanism. We expect that further cis-regulatory elements of focus. Furthermore, transfer development of integrative machine learning methods learning [60] and multi-task learning [37] techniques [63, 64] is crucial to reconstruct such a gene regulatory might be useful in the design of deep predictive mod- system. els, particularly when the number of learning examples of one cell type is limited or a region allows several anno- Methods tations. Assessing the impact of sequence variations in Data non-coding regions on gene expression and phenotypes is For the purpose of supervised analysis, we collected fea- of high clinical interest [32, 61], which was one motivation ture data from ENCODE [10] along with the trans- Li et al. BMC Bioinformatics (2018) 19:202 Page 11 of 14 criptionally active enhancers and promoters from eight learning rate could reduce gradually as the number of iter- matched cell types catalogued by the FANTOM effort ations increases. Using batch-size 100, stochastic gradient [4, 24]. These cell types include A549, GM12878, HelaS3, descent was employed to optimize the model param- HepG2, HMEC, HUVEC, K562, and MCF7. For each cell eters. Rectified linear unit (ReLU) activation function type, we defined seven classes of labelled regions, includ- [72] was used. When evaluating the classification perfor- ingA-E,I-E,A-P,I-P,A-X,I-X,and UK.The librariesof mance of various experiments, we randomly sampled a enhancers and promoters were downloaded from [65]. A- balanced training set with maximally 70% of examples Es and I-Es were defined as FANTOM enhancers with and maximally 3000 examples in each class, the remaining TPM>0 (tags per million) and TPM=0, respectively. A- data were further randomly sampled to generate a corre- Ps and I-Ps were randomly selected FANTOM promoters sponding test set with proportion of A-E:A-P:A-X:I-E:I- with TPM>5andTPM=0, respectively. A-X and I-X were P:I-X:UK=1:1:1:2:2:1:10 to mimic the true genome-wide defined based on exons’ transcription levels measured by background among the classes (Same ratio was used in RNA-seq [66]. An exon with peak-max greater than 400 the comparison with ChromHMM and Combined meth- (equal to 0) was defined as A-X (I-X). The UK regions were ods). A training set was further partitioned into a training sampled from the genome regions excluding all FANTOM subset for model learning and a validation subset for CAGE tags, all exons, and DNaseI open regions. The num- early termination. We scaled each feature in the training bers of labelled regions used in this study are listed in subset to [0,1], and applied the same estimated scaling Additional file 1: Table S1. For each cell type, we built factors to the validation subset and test set. Class-wise a comprehensive feature set, integrating histone modifi- rate (CWR, i.e. averaged sensitivity and specificity for cation and TF binding ChIP-seq, DNase-seq, FAIRE-seq, two classes, and averaged sensitivities for more than two and ChIA-PET data from the ENCODE project [66, 67]. classes), area under the receiver operating characteris- These features characterize the activities of enhancers tic curve (auROC), and area under the precision-recall and promoters in cell-specific aspects. Additionally, CpG curve (auPRC) were calculated to measure the classifica- islands and phastCons evolutionary conservation scores tion performance. The above procedure was repeated 100 were included, because it is well recognized that some times to determine the means and variations of CWRs, regulatory regions are highly GC-rich and extremely con- auROCs, and auPRCs. Random forest [73]was compared served. For each labelled region, the mean value of feature on the same training-test splits in our experiments. Before signals fall within a bin centered at the region was taken predicting regulatory regions in the whole genome, all as the feature value using bwtool [68]. We tried different labelled A-E samples and 3000 samples in each of the bin sizes including 200, 500, 1,000, 2,000, and 4,000 bps. otherclasses of acelltypewereusedtotrainthenetwork. Since 200 bps worked the best in cross-validation tests, we used it throughout our analyses. The numbers of fea- Feature selection tures used for each cell type and the numbers of common Based on our newly devised deep feature selection (DFS) features between any pair of them are given in Additional model [26], we designed the randomized DFS (RDFS), file 1: Table S2. A combined list of features is provided which is a deep extension of randomized LASSO [74], for in Additional file 4. Our labelled data are downloadable stably selecting subsets of discriminative features. from [69]. Addressing the limitations of sparse linear models for feature selection, DFS is able to model the non-linearity Deep learning for classification ofthefeaturesandselect asinglesubsetoffeaturesfor Based on the Deep Learning Tutorials [70]and multi-class data. The main idea of DFS is to add a one- Theano [71], we implemented a deep learning pack- to-one linear layer (named feature-selection layer) to the age named DECRES (DEep learning for identifying Cis- above described feedforward neural network. For the i-th Regulatory ElementS and other applications) which is input feature x , the output of the feature-selection layer available at [69]. We applied a supervised deep model – becomes w x . Thus, the parameter of this layer is a vector i i feedforward neural network (also known as multilayer w. By shrinking w,someofitselements turn to zeros,such perceptrons or MLP) for the detection of regulatory that the corresponding features do not contribute to the regions. For each experiment, we conducted a model classification at all. The upper hidden layers of the model search from no hidden layers upto three hidden lay- have the capability of modelling non-linear interactions ers with maximally 256, 128, and 64 units in the first, in the data. The feature selection layer allows to select a second, and third hidden layers. Initial learning rate, single subset of features for multi-class problem. l -regularization amount (to control model complex- The randomized DFS procedure is similar to ran- ity), and momentum (to stabilize the optimization) were dom forest, both implementing the philosophy of the searched across ranges of values. The maximum num- wisdom of crowds. In randomized DFS, DFS with per- ber of allowed iterations was set to 1000. The initial turbed feature-wise sparsity control parameters runs on Li et al. BMC Bioinformatics (2018) 19:202 Page 12 of 14 a training subset which is a randomly sampled portion Acknowledgements We thank our colleagues in the Wasserman Laboratory for discussions and (usually ) from the entire training set, and selects the top feedback improving the quality of this paper. We thank Dr. Anshul Kundaje K features based on feature weights. This random proce- (Stanford) for making the processed ChIP-seq fold-change data and script dure is repeated, say 100, times to generate the empirical available. probability of each feature being selected. The empiri- Funding cal probabilities are used as feature importance. In our This research was launched by supports from the Genome Canada/Genome BC: 174DE (ABC4DE project), Canadian Institutes of Health Research (CIHR): experiment, we used ReLUs as activation functions and MOP-82875, the Natural Sciences and Engineering Research Council of set K =10 in this procedure. We scaled each feature to Canada (NSERC): RGPIN355532-10, and the National Institutes of Health (USA) [0,1] in the training set, and applied the estimated scal- grant: 1R01GM084875. WWW was supported by scholar awards from CIHR and the Michael Smith Foundation for Health Research (MSFHR) for directing this ing factors from the training set to the test set. From research. YL was supported by NSERC Postdoctoral Fellowship: PDF-471767- the most important to the least important features, we 2015 in the second stage of this project for in-depth analysis and writing of kept adding features to a MLP to evaluate how many this article. The computer systems of the Gene Regulation Bioinformatics Laboratory were funded by the Canada Foundation for Innovation and the BC features are sufficient for the prediction. Randomized Knowledge Development Fund, and were used for collecting and processing DFS was compared to random forest on the same parti- data in this project. WestGrid clusters of Compute Canada were used for tions of training and test sets in this study. The DFS and conducting computational experiments and analysis of this project. randomized DFS models were included in the DECRES Availability of data and materials package. The labelled datasets generated and analysed during the current study and Aiming at trading off the size of feature subset and clas- the DECRES package are available in https://github.com/yifeng-li/DECRES. sification performance (auPRC), we designed a method Authors’ contributions based on curving fitting that was applied in Fig. 4. YL developed the machine learning methods, compiled the data sets, performed computational analysis, and drafted this manuscript. WS Denoting a size of feature subset and corresponding participated in the evaluation of the findings. WWW initialized and supervised test auPRC by x and y respectively, we first fit function the project, and finished the final manuscript. All authors read and approved 2s y = arctan(kx) where k and s are scale parameters. the final manuscript. Once done, a point can be chosen on the curve given a Ethics approval and consent to participate 1 2ks−tπ Not applicable. proper tangent value (say t)using x = and k tπ 2s y = arctan(kx ). Since the values come from different t t Competing interests The authors declare that they have no competing interests. scales on x and y axes, we qualitatively used t = 0.5. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Additional files Author details Additional file 1: This file contains supplemental Tables S1-S7,and Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital supplemental Figures S1-S30. (PDF 27600 kb) Research Institute, Department of Medical Genetics, University of British Columbia, Rm 3109, 950 West 28th Avenue, V5Z 4H4 Vancouver, Canada. Additional file 2: Genome-wide predictions of cis-regulatory regions for Digital Technologies Research Centre, National Research Council Canada, all six cell types. (ZIP 20400 kb) Building M-50, 1200 Montreal Road, K1A 0R6 Ottawa, Canada. Additional file 3: Predictions on CAGE supported bidirectional loci. AiA: Active in the FANTOM Enhancer Atlas; IiA: Inactive in the FANTOM Received: 13 April 2017 Accepted: 4 May 2018 Enhancer Atlas; NiA: Not included in the FANTOM Enhancer Atlas; Specific: Predicted cell-specific A-Es. (ZIP 4820 kb) Additional file 4: List of features used for each cell type. (ZIP 1.36 kb) References 1. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. Enhancers: Five essential questions. Nat Rev Genet. 2013;14(2):288–95. Abbreviations 2. Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: From A-E: Active enhancer; A-P: Active promoter; A-X: Active exon; properties to genome-wide predictions. Nat Rev Genet. 2014;15:272–86. BDT: Bidirectionally transcribed; BG: Class merging I-E, I-P, A-X, I-X, and UK; 3. Andersson R, Sandelin A, Danko CG. A unified architecture of auPRC: Area under the precision-recall curve; auROC: Area under the receiver transcriptional regulatory elements. Trends Genet. 2015;31(8):426–33. operating characteristic curve; CAGE: Cap analysis of gene expression; 4. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, ChIP-seq: Chromatin immunoprecipitation followed by sequencing; Chen Y, Zhao C, Schmidl C, Suzuki T, Ntini E, Arner E, Valen E, Li K, CNN: Convolutional neural network; CRR: cis-regulatory region; Schwarzfischer L, Glatz D, Raithel J, Lilje B, Rapin N, Bagger FO, CWR: Class-wise rate, i.e. averaged sensitivity and specificity for two classes, Jørgensen M, Andersen PR, Bertin N, Rackham O, Burroughs AM, and averaged sensitivities for more than two classes; DECRES: Deep learning Baillie JK, Ishizu Y, Shimizu Y, Furuhata E, Maeda S, Negishi Y, Mungall CJ, for identifying cis-regulatory elements and other applications; DFS: Deep Meehan TF, Lassmann T, Itoh M, Kawaji H, Kondo N, Kawai J, feature selection; DRM: Dinucleotide repeat motif; eRNA: Enhancer RNA; Lennartsson A, Daub CO, Heutink P, Hume DA, Jensen TH, Suzuki H, ENCODE: Encyclopedia of DNA Elements; FANTOM: Functional Annotation of Hayashizaki Y, Müller F, FANTOM Consortium, Forrest AR, Carninci P, the Mammalian Genome; GRO-seq: Global run-on and sequencing; Rehli M, Sandelin A. An atlas of active enhancers across human cell types I-E: Inactive enhancer; I-P: Inactive promoter; I-X: Inactive exon; MLP: Multilayer and tissues. Nature. 2014;507:455–61. perceptron; MPRA: Massively parallel reporter assay; Pol2: RNA polymerase II; 5. Core LJ, Martins AL, Danko CG, Waters CT, Siepel A, Lis JT. Analysis of RDFS: Randomized deep feature selection; RF: Random forest; RNN: Recurrent nascent RNA identifies a unified architecture of initiation regions at neural network; ReLU: Rectified linear unit; TF: Transcription factor; TPM: tags mammalian promoters and enhancers. Nat Genet. 2014;46(12):1311–20. per million; UK: Unknown/uncharacterized Li et al. BMC Bioinformatics (2018) 19:202 Page 13 of 14 6. Wasserman WW, Sandelin A. Applied bioinformatics for the identification Jojic N, Scherer S, Blencowe B, Frey B. The human splicing code reveals of regulatory elements. Nat Rev Genet. 2004;5(4):276–87. new insights into the genetic determinants of disease. Science. 7. Li Y, Chen C, Kaye AM, Wasserman WW. The identification of 2015;347(6218):1254806. cis-regulatory elements: A review from a machine learning perspective. 32. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with BioSystems. 2015;138:6–17. deep learning-based sequence model. Nat Methods. 2015;12(10):931–4. 8. Ernst J, Kellis M. ChromHMM: Automating chromatin-state discovery and 33. Alipanhi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence characterization. Nat Methods. 2012;9(3):215–6. specificities of DNA- and RNA-binding proteins by deep learning. 9. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Nobel WS. Nat Biotechnol. 2015;33(8):831–8. Unsupervised pattern discovery in human chromatin structure through 34. Spencer M, Eickholt J, Cheng J. A deep learning network approach to ab genomic segmentation. Nat Methods. 2012;9(5):473–6. initio protein secondary structure prediction. IEEE/ACM Trans Comput 10. The ENCODE Project Consortium. An integrated encyclopedia of DNA Biol Bioinforma. 2015;12(1):103–12. elements in the human genome. Nature. 2012;489:57–74. 35. Kheradpour P, Ernst J, Mlenikov A, Rogov P, Wang L, Zhang X, Alston J, 11. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of Mikkelsen TS, Kellis M. Systematic dissection of regulatory motifs in 2000 in vivo protein-DNA interactions. Science. 2007;316(5830):447–55. predicted human enhancers using a massively parallel reporter assay. 12. Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, Genome Res. 2013;23(5):800–11. Giardine B, Ellenbogen PM, Bilmes JA, Birney E, Hardison RC, Dunham I, 36. Fletez-Brant C, Lee D, McCallion AS, Beer MA. kmer-SVM: A web server Kellis M, Noble WS. Integrative annotation of chromatin elements from for identifying predictive regulatory sequence features in genomic data ENCODE data. Nucleic Acids Res. 2013;41(2):827–41. sets. Nucleic Acids Res. 2013;41:544–6. 13. Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA. Complex 37. Setty M, Leslie CS. SeqGL identifies context-dependent binding signals in effects of nucleotide variants in a mammalian cis-regulatory element. genome-wide regulatory element maps. PLoS Comput Biol. 2015;11(5): Proc Natl Acad Sci. 2012;109(27):19498–503. 1004271. 14. Kwasnieski JC, Fiore C, Chaudhari HG, Cohen BA. High-throughput 38. Deaton AM, Bird A. CpG islands and the regulation of transcription. functional testing of ENCODE segmentation predictions. Genome Res. Genes Dev. 2011;25(10):1010–22. 2014;24:1595–602. 39. Wang Y, Li X, Hua H. H3K4me2 reliably defines transcription factor 15. Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, binding regions in different cells. Genomics. 2014;103(2-3):222–8. Birney E, Bickel P, Snyder M, Gerstein M. Classification of human 40. Zhou VW, Goren A, Bernstein BE. Charting histone modifications and the genomic regions based on experimentally determined binding sites of functional organization of mammalian genomes. Nat Rev Genet. more than 100 transcription-related factors. Genome Biol. 2012;13:48. 2011;12:7–18. 16. Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, 41. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. GeneCards: Integrating Ernst J, Kellis M, Ren B. RFECS: A random-forest based algorithm for information about genes, proteins and diseases. Trends Genet. enhancer identification from chromatin state. PLoS Comput Biol. 1997;13(4):163. 2013;9(3):1002968. 42. Witte S, Bradley A, Enright AJ, Muljo SA. High-density P300 enhancers 17. Lu Y, Qu W, Shan G, Zhang C. DELTA: A distal enhancer locating tool control cell state transitions. BMC Genomics. 2015;16:903. based on AdaBoost algorithm and shape features of chromatin 43. Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer modifications. PLoS ONE. 2015;10(6):0130622. Browser – a database of tissue-specific human enhancers. Nucleic Acids 18. Chen C, Morris Q, Mitchell JA. Enhancer identification in mouse Res. 2007;35:88–92. embryonic stem cell using integrative modeling of chromatin and 44. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, genomic features. BMC Genomics. 2012;13:152. Wenger AM, Bejerano G. GREAT improves functional interpretation of 19. Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A. cis-regulatory regions. Nat Biotechnol. 2010;28(5):495–501. Genome-wise quantitative enhancer activity maps identified by 45. Heinz S, Benner C, Spann N, Bertolino E, et al. Simple combinations of STARR-seq. Science. 2013;339:1074–7. lineage-determining transcription factors prime cis-regulatory elements 20. Yanez-Cuna JO, Arnold CD, Stampfel G, Boryn LM, Gerlach D, Rath M, required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89. Stark A. Dissection of thousands of cell type-specific enhancers identifies 46. Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, dinucleotide repeat motifs as general enhancer features. Genome Res. Shyr C, Tan G, Worsley-Hunt R, Zhang AW, Parcy F, Lenhard B, 2014;24:1147–56. Sandelin A, Wasserman WW. JASPAR 2016: A major expansion and 21. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread update of the open-access database of transcription factor binding pausing and divergent initiation at human promoters. Science. 2008;322: profiles. Nucleic Acids Res. 2016;44(D1):110–5. 1845–8. 47. Ameyar M, Wisniewska M, Weitzman JB. A role for AP-1 in apoptosis: The 22. Danko CG, Hyland SL, Core LJ, Martins AL, Waters CT, Lee HW, case for and against. Biochimie. 2003;85(8):747–52. Cheung VG, Kraus WL, Lis JT, Siepel A. Identification of active 48. Sharrocks AD. The ETS-domain transcription factor family. Nat Rev Mol transcriptional regulatory elements from GRO-seq data. Nat Methods. Cell Biol. 2001;2(11):827–37. 2015;12:433–8. 49. Okuda T, Nishimura M, Nakao M, Fujita Y. RUNX1/AML1: A central player 23. Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, in hematopoiesis. Int J Hematol. 2001;74(3):252–7. Sasaki D, Imamura K, Kai C, Harbers M, Hayashizaki Y, Carninci P. CAGE: 50. Arnett B, Soisson P, Ducatman BS, Zhang P. Expression of CAAT Cap analysis of gene expression. Nat Methods. 2006;3:211–22. enhancer binding protein beta (C/EBP beta) in cervix and endometrium. 24. The FANTOM Consortium, The RIKEN PMI, CLST (DGT). A promoter-level Mol Cancer. 2003;2:21. mammalian expression atlas. Nature. 2014;507:462–70. 51. Costa RH, Kalinichenko VV, Holterman AX, Wang X. Transcription factors 25. Kleftogiannis D, Kalnis P, Bajic VB. DEEP: A general compuational in liver development, differentiation, and regeneration. Hepatology. framework for predicting enhancers. Nucleic Acids Res. 2015;43(1):6. 2003;38(6):1331–47. 26. Li Y, Chen C, Wasserman WW. Deep feature selection: Theory and 52. Wang Z, Bishop EP, Burke PA. Expression profile analysis of the application to identify enhancers and promoters. J Comput Biol. inflammatory response regulated by hepatocyte nuclear factor 4α. 2016;23(5):322–36. BMC Genomics. 2011;12:128. 27. Hinton GE, Osindero S, Teh Y. A fast learning algorithm for deep belief 53. Fleming JD, Pavesi G, Benatti P, Imbriano C, Mantovani R, Struhl K. NF-Y nets. Neural Comput. 2006;18:1527–54. coassociates with FOS at promoters, enhancers, repetitive elements, and 28. Hinton G, Salakhutdinov R. Reducing the dimensionality of data with inactive chromatin regions, and is stereo-positioned with neural networks. Science. 2006;313:504–7. growth-controlling transcription factors. Genome Res. 2013;23(8): 29. Bengio Y, Courville A, Vincent P. Representation learning: A review and 1195–209. new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8): 54. DREAM Challenges. http://dreamchallenges.org. 1798–828. 30. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. 55. Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, Shu W. BiRen: Predicting 31. Xiong HY, Alipanahi B, Lee L, Bretschneider H, Merico D, Yuen R, Hua Y, enhancers with a deep-learning-based model using the DNA sequence Gueroussov S, Najafabadi H, Hughes T, Morris Q, Barash Y, Krainer A, alone. Bioinformatics. 2017;33(13):1930–6. Li et al. BMC Bioinformatics (2018) 19:202 Page 14 of 14 56. Liu F, Ren C, Li H, Zhou P, Bo X, Shu W. De novo identification of replication-timing domains in the human genome by deep learning. Bioinformatics. 2016;32(5):641–9. 57. Liu F, Ren C, Bo X, Shu W. PEDLA predicting enhancers with a deep learning-based algorithmic framework. Sci Rep. 2016;6:28517. 58. Bach S, Binder A, Montavon G, Klauschen F, Muller K-R, Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE. 2015;10(7):0130140. 59. Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. In: International Conference on Learning Representations Workshop. 2014. https://iclr.cc/archive/2014/workshop-proceedings. 60. Pan S. J, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010;22(10):1345–59. 61. Kelley DR, Snoek J, Rinn JL. Basset: Learning the regulatory code of the accessible genome wide deep convolutional neural networks. Genome Res. 2016;26:990–9. 62. GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat Genet. 2013;45(6):580–5. 63. Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinforma. 2018;19(2):325–40. 64. Eser U, Churchman L. S. FIDDLE: An integrative deep learning framework for functional genomic data inference. bioRxiv. https://doi.org/10.1101/ 65. FANTOM5 Data. http://fantom.gsc.riken.jp/5/data. 66. ENCODE Data. ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/ encodeDCC. 67. ENCODE Fold-Change Data. https://sites.google.com/site/anshulkundaje. 68. Pohl A, Beato M. bwtool: A tool for bigWig files. Bioinformatics. 2014;30(11):1618–9. 69. DECRES: Deep Learning Methods for Identifying Cis-Regulatory Elements and Other Applications. https://github.com/yifeng-li/DECRES. 70. Deep Learning Tutorials. http://deeplearning.net/tutorial. 71. Theano. http://deeplearning.net/software/theano. 72. Nair V, Hinton G. Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML). 2010. p. 807–14. 73. Breiman L. Random forests. Mach Learn. 2001;45:5–32. 74. Meinshausen U, Buhlmann P. Stability selection. J R Stat Soc Ser B Stat Methodol. 2010;72(4):417–73. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Genome-wide prediction of cis-regulatory regions using supervised deep learning methods

Free
14 pages

Loading next page...
 
/lp/springer_journal/genome-wide-prediction-of-cis-regulatory-regions-using-supervised-deep-7habyVofi9
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
D.O.I.
10.1186/s12859-018-2187-1
Publisher site
See Article on Publisher Site

Abstract

Background: In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Results: Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). Conclusion: The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations. Keywords: cis-regulatory region, Enhancer, Promoter, Deep learning Background transcription factors (TFs), help specify the formation of In this article, we apply deep supervised analysis methods diverse cell types and respond to changing physiological to identify the positions of active cis-regulatory regions conditions. While gene expression is ultimately a reflec- (CRRs), including both enhancers and promoters, across tion of regulation across multiple processes, the key role the human genome. CRRs play a crucial role in precise of promoters and enhancers has been a central focus of control of gene expression. Promoters and enhancers act genome annotation for the past decade. The investment via complex interactions across time and space in the in generating informative data for the detection of these nucleus to control when, where and at what magnitude regions has been immense, in part motivated by the antici- genes are active. CRRs, through interactions with proteins pation that advanced computational approaches would be such as histones and sequence-specific DNA-binding able to transform the data into a reliable annotation of the genome. Promoters and enhancers were early discoveries during *Correspondence: wyeth@cmmt.ubc.ca Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital the molecular characterization of genes. While promot- Research Institute, Department of Medical Genetics, University of British ers specify and enable the positioning of RNA polymerase Columbia, Rm 3109, 950 West 28th Avenue, V5Z 4H4 Vancouver, Canada machinery at transcription initiation sites, enhancers Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Li et al. BMC Bioinformatics (2018) 19:202 Page 2 of 14 modulate the activity of promoters from linearly distal prediction of enhancers that are defined by p300 bind- locations away from transcript initiation sites [1, 2]. The ing sites overlapping with DNase-I hypersensitive sites delineation between the classes has become increasingly and distal to annotated TSS. Chen et al. applied multi- challenging, with some literature suggesting the two cate- nomial logistic regression with LASSO regularization gories are the edges of a continuous spectrum of CRRs [3]. to find key features for the classification of stem cell- specific functional enhancer regions [18]. Using STARR- Indeed, it has long been observed that sequences flanking transcription initiation regions can function as enhancers seq data, a new experimental approach for screening (promoter-proximal regions), and in recent years, it has candidate enhancer sequences [19], dinucleotide repeat been observed that there are transcripts initiated at the motifs (DRMs) were found to be enriched in broadly edgesofactiveenhancers [4, 5]. For the purpose of this active enhancers, leading to a proposition that a small set report, we address the two as distinct classes, but discuss of TF binding site motifs and DRMs might be sufficient the relationship between our findings and the continuous for enhancer prediction [20]. class model. New laboratory methods are emerging, providing a The use of computational methods to detect the loca- refined resolution of CRR locations. The majority of tions of promoters and enhancers has been a key focus of human DNA is transcribed, producing diverse types of bioinformatics for twenty years (see reviews [6, 7]). With RNA. In particular, transcripts generated at the edges the advances of experimental procedures for profiling the of enhancers, enhancer RNAs (eRNAs), allow for the properties of chromatin and RNA transcripts, a new wave experimental readout of active regulatory regions. Global of methods has arrived. Given the small set of reliable run-on and sequencing (GRO-seq) protocols [21]mea- enhancer annotations, it was appropriate that the first sure the 5’-end of nascent RNAs revealing the divergent among these methods used unsupervised learning. For transcriptional signature of both transcriptionally active instance, both ChromHMM [8]andSegway [9]segment promoters and enhancers [5]. Using GRO-seq signals, a the genome into sequence classes based on ENCODE support vector regression model (dReg) was developed project data [10], such as histone modification ChIP-seq to predict active transcriptional regulatory elements [22]. (chromatin immunoprecipitation followed by sequencing The cap analysis of gene expression (CAGE) technique [11]) signals. Such unsupervised methods infer hidden [23] captures the 5’-end of RNA transcripts, enabling a states based on observed signals, and then associate an precise determination of transcript initiation sites. Using element to each hidden state. The states are subsequently CAGE, the FANTOM5 Consortium has identified an atlas labelled with biological functions based on enrichment of transcriptionally active promoters [24] and a permissive for known examples. A test of predicted Enhancers for set of 43,011 transcriptionally active enhancers character- the K562 leukemia cell line by the Combined method ized by bidirectional eRNAs [4] across hundreds of human (unifying ChromHMM and Segway annotations) [12] cell types and tissues. These enhancers were validated using a high-throughput reporter gene assay [13]revealed with high success rates ranging from 67.4 to 73.9% [4]. that only 26% of predicted enhancers have regulatory Compared to protein-coding RNAs, eRNAs are believed activity [14]. The assessment showed that the predicted to degenerate quickly, and only a small number of tis- Weak Enhancers, a class associated with lower H3K27ac sues have been explored with sufficient depth to reveal and H3K36me3 signals, unexpectedly drove higher gene eRNAs. While the FANTOM enhancer set is therefore expression than the predicted Enhancers. It is evident that incomplete, it provides a uniquely large inventory of high- improvements are needed, potentially involving the use of quality enhancers to use for the training of machine learn- additional experimental features and alternative machine ing approaches. An ensemble support vector machine learning approaches. method suggested the potential to distinguish enhancers Despite the limited set of precisely annotated active based on such data [25]. enhancers, supervised machine learning models have We have previously proposed and herein present the been attempted to predict enhancer regions. In each case, use of a deep feature selection (DFS) model for the a distinct definition of a suitable positive training set of supervised prediction of CRRs [26]. Deep learning is a enhancers was taken. A random-forest method was used dramatic advance in the frontier of artificial intelligence in [15] to classify TF bound regions with a focus on [27–29]. Unlike widely used linear models, deep learning observed binding patterns, generating sets of two-class approaches model complex systems and capture high- classifiers to distinguish regions based on binding activity level knowledge from data. Driven by big and rich data, and position relative to promoter regions. A random- deep learning has been successfully applied in various forest based enhancer classification method was devised areas such as automatic image annotation and speech in [16] with histone modification ChIP-seq data as fea- language processing [30]. Bioinformaticians have started tures, using p300 bound regions as the basis for training. using this powerful tool for next-generation sequencing An AdaBoost-based model was proposed in [17]for the data mining, such as predicting the impact of variations Li et al. BMC Bioinformatics (2018) 19:202 Page 3 of 14 on exon splicing [31] and the effects of noncoding variants transcript initiation events are observed in the tissue of on chromatin [32], detecting TF binding patterns [33], and focus, while inactive refers to regions detected in other tis- predicting protein secondary structures [34]. sues, but not in the focus tissue. We recorded the mean Our study stands on three important legs. First, the class-wise rate (i.e. averaged sensitivities of all classes), precisely annotated FANTOM promoters and enhancers, area under the receiver operating characteristic curve which provide the largest experimentally defined collec- (auROC), and the area under the precision-recall curve tion of CRRs. Second, the ENCODE project genome-wide (auPRC) in Fig. 1 and Additional file 1:FigureS1. feature data, such as histone modifications, TF binding, There are four aspects of the results that we highlight, RNA transcripts, chromatin accessibility, and chromatin which affirm the capacity of our supervised deep learn- interactions. Third, deep learning methods to distinguish ing approach to distinguish between classes of CRRs and CRRs based on the available data. We unite the three background. First, we are able to distinguish between components to create the DECRES (DEep learning for active enhancers and promoters (A-E versus A-P) (Fig. 1a). identifying Cis-Regulatory ElementS and other applica- We used A-E and A-P as positive and negative train- tions) model, with which we identify the most compre- ing classes, respectively. Overall, we found that A-E and hensive collection of CRRs across the human genome yet A-P are highly separable. Second, we can distinguish compiled. active and inactive CRRs (either enhancers or promot- ers). From Fig. 1b and Additional file 1:FigureS1A,it Results can be observed that mean auPRCs on GM12878, HelaS3, Deep learning accurately distinguishes active enhancers HepG2, and K562, which have the largest training sets, are and promoters from background above 0.95 with small variances for both enhancers and We investigated the capacity of deep learning models promoters. In the rest of this paper, we exclude A549 and to separate enhancers and promoters, and to distinguish MCF7 cell lines in most analyses due to limited data avail- them from other regions and between activity states. We ability. Third, not unexpectedly, it is difficult to distinguish trained a deep feedforward neural network over our bal- between inactive enhancers and promoters (Additional anced labelled training sets to predict our (unbalanced) file 1: Figure S1B). Seven of the mean class-wise rates for test sets from each well-characterized cell type, repeating the eight cell types were lower than 0.80. While there are the procedure 100 times. The deep model takes experi- some indications that a portion of inactive promoters have some machinery present, it was our expectation that such mentally derived features over genomic regions as inputs regions will largely not exhibit strong transcription fac- and outputs class labels of these regions with probabili- tor binding or appropriate epigenetic signatures to inform ties (see Additional file 1: Table S1 for the total number of a model. Fourth, we tested the applicability of predicting samples of each class and Additional file 1:Table S2 forthe number of available features; see Methods). For narrative A-E and A-P from the super background (BG) class merg- convenience, hereafter we refer to active enhancer, active ingI-E,I-P,A-X,I-X,and UK(Fig. 1c). The results on promoter, active exon, inactive enhancer, inactive pro- six cell types were promising, all exceeded 0.80 auPRC. moter, inactive exon, and unknown (or uncharacterized) If A-E and A-P are merged further to form a super class region as A-E, A-P, A-X, I-E, I-P, I-X, and UK, respectively. (A-E+A-P), higher performance is achieved (Additional Under the assumption that active CRRs are undergoing file 1: Figure S1C). All auPRCs on these six cell types transcription, active applies to regions in which CAGE went beyond 0.89 auPRC. Furthermore, we also tested a ab c Fig. 1 Mean performance and standard deviation of 100 runs using the MLP model on our respectively sampled train-test partitions of eight cell types. a Classification performances of A-E versus A-P. b Classification performances of A-E versus I-E. c Classification performances of A-E versus A-P versus BG. MLP: Multilayer Perception, RF: Random Forest, A-E: Active Enhancer, A-P: Active Promoter, A-X: Active Exon, I-E: Inactive Enhancer, I-P: Inactive Promoter, I-X: Inactive Exon, UK: Unknown or Uncharacterized, BG: I-E+I-P+A-X+I-X+UK Li et al. BMC Bioinformatics (2018) 19:202 Page 4 of 14 random forest method, another state-of-the-art classifier, unsupervised ChromHMM and ChromHMM-Segway on our labelled data. Similar performance was obtained on Combined methods [8, 12] using FANTOM annotations all six experimental settings. The random forest method on five available cell types as reference. They were com- exhibited slightly better performance for A549 and MCF7 pared on unbalanced sets reflecting the true genomic datasets, which both have low numbers of enhancers. In background. The results are compared in Fig. 2a which expectation that more annotated enhancers are becoming displays radar charts where the larger and more con- available, we will continue using MLP and exploring other vex the area is, the better the performance. It is intuitive deep learning approaches such as convolutional neural that supervised approaches are preferred when labelled networks and recurrent neural networks. training data is sufficient. Furthermore, both unsuper- vised methods were developed prior to public release DECRES gives higher sensitivity and precision on FANTOM of the FANTOM5 data and are therefore at a disad- annotated regions vantage. However, these annotations are widely used by To assess the relative utility of our supervised deep the community and hence the relative performance of method for CRR prediction, we compared it with the DECRES to the standard is of interest. Overall, we observe Fig. 2 Comparison of the supervised method (DECRES) and unsupervised methods (ChromHMM and Combined) on five FANTOM annotated test sets in radar charts (a) and significance tests (b). The ENCODE segmentations were downloaded from [66]. We relabelled the annotations of ChromHMM and Combined. For ChromHMM segmentations, the Tss, TssF, and PromF classes were merged to A-P; the Enh, EnhF, EnhW, EnhWF classes were merged to A-E; and the rest were denoted by BG. When processing the Combined annotations, TSS and PF were relabelled to A-P; E and WE were relabelled to A-E; and the rest to BG. The p-values in (b) were obtained from two-tailed Student’s t-test on all cell types. The signs of statistic values are indicated in brackets Li et al. BMC Bioinformatics (2018) 19:202 Page 5 of 14 that DECRES outperforms ChromHMM and Combined DECRES predicted all to be negative (including the 16 methods which in turn deliver similar performance. that were active in the CRE-seq experiment). Importantly, These unsupervised methods consistently have lower as the DECRES scores rise, the quality of the predictions sensitivities for active enhancer detection (p = 5.57E-5 increase. We drew the histogram of DECRES membership and 9.90E-5 for DECRES versus ChromHMM and Com- scores of 254 and 433 experimentally positive and neg- ative Combined enhancers that were predicted as A-Es bined respectively, two tailed Student’s t-test; see Fig. 2b) and lower precision for active promoter detection (p = by DECRES (Additional file 1: Figure S2). The distri- 7.36E-5 and 2.33E-4 for DECRES versus ChromHMM butions are significantly different (p = 0.014, two-sided and Combined respectively, two tailed Student’s t-test; Mann-Whitney rank test). see Fig. 2b). Using ChromHMM, the active enhancer The second independent collection, in which K562 sensitivity ranges from 16.5% to 48.4% (numbers are con- and HepG2-specific “strong enhancer” (as predicted by sistent with the test on ENCODE predicted enhancers ChromHMM) containing predicted TF binding sites for reported in [14]), while our deep model ranges from cell-selective TFs were tested using a massively parallel 69% (K562) to 88.8% (GM12878). Moreover, ChromHMM reporter assay (MPRA) [35]. Only 41% of the enhancers achieves a maximum precision of 49.8% for active pro- were detected to be significantly expressed (p = 0.05, moter prediction, while the maximum for DECRES two-sided Mann-Whitney rank test). We used DECRES to is of 84.3%. predict the classes of the MPRA positive and MPRA neg- ative enhancers. Our result in Additional file 1:Table S3 Evaluation of DECRES performance with independent shows that 98.4% (120/122) and 97.8% (182/186) of the experimental data MPRA positive enhancers were respectively predicted to As the initial evaluation focused on FANTOM eRNA- be A-Es by DECRES for K562 and HepG2 cells, while based annotation of CRRs, the type of data used to train 92.3% (179/194) and 81.3% (217/267) of the MPRA nega- our supervised model, we sought to assess performance tive enhancers were still predicted as A-Es for K562 and on data generated by alternative methods. We identi- HepG2, respectively, but with different distributions of fied two independent collections of laboratory validated DECRES scores (p = 4.8E-6 and p = 2.3E-6 for K562 and enhancers to further assess the performance of DECRES: HepG2 respectively, two-sided Mann-Whitney rank test) a CRE-seq collection of regions tested in K562 cells [14] (Additional file 1: Figure S2). Consistent with the other independent data, the higher the DECRES scores the more and MPRA (massively parallel reporter assay) collections likelythey aretobepositive. tested in K562 and HepG2 cells [35]. In both instances, the set of regions that fail to direct expression may be falsely predicted by the assessed methods, but may also reflect Assessing the utility of DNA sequence properties on the the facts that the experimental procedures only include performance of DECRES a small segment of regulatory DNA and that plasmid- Recent studies confirmed that DNA sequence proper- based assays do not recapitulate chromatin properties. ties can be useful for the recognition of promoters and Given the nature of the data, we anticipate a portion of the enhancers [3, 5, 25], and the discrimination between experimental negatives to be bona fide regulatory regions. active and inactive regulatory sequences [36, 37]using In the first independent set, subsets of predicted K562 string sequence kernels. This builds on the long- enhancers and negative regions (as predicted by the Com- recognized capacity for the inclusion of CpG islands as bined ChromHMM and Segway method) were assessed features to improve promoter prediction [38]. We sought in the laboratory using CRE-seq [14]. In that study, only to determine if DNA sequence features can be informa- 33% of the “Combined” predicted regulatory regions were tive to distinguish between promoters and enhancers, found to be positive in the experiment, compared to 7% and between active and inactive classes. We trained the for the negative set. Using DECRES trained on all available model with 351 sequence features (originally used in [25]) active regulatory regions of K562 cells, we therefore vali- in multiple scenarios. Results are displayed in Fig. 3 and dated our method on 386 regions showing active enhancer Additional file 1: Figure S3. First, a deep method restricted activity in K562 as validated by CRE-seq compared to the to sequence features for discriminating A-E and A-P 298 control regions (Additional file 1: Table S3). Highly (Fig. 3a) delivered auPRCs from 0.8567 to 0.9370, con- consistent with the results above, a sensitivity of 65.5% firming that sequence attributes are indeed informative. (254/386) for the experimentally validated regions were Second, sequence features have a limited utility for distin- successfully predicted as A-E; the remaining 132 regions guishing between active and inactive states of enhancers were predicted as background (none were classified as and promoters, which is logical; while the experi- promoters). For the 812 tested predictions that were inac- mentally derived features could highly separate them tive in the CRE-seq experiment, DECRES classified 53.3% (p = 1.90E-08 and 5.06E-08 for enhancers and promoters (433/812) as positive. For the 298 negative control regions, respectively, two-tailed Student’s t-test; see Fig. 3b and Li et al. BMC Bioinformatics (2018) 19:202 Page 6 of 14 ab c Fig. 3 Comparing the mean auPRCs over 100 resampling and retraining on our labelled regions using different feature sets. “Experimental” means our experimentally derived next generation sequencing feature set. “Sequence” means the set of 351 sequence properties used in [25]. “Experimental+Sequence” means the combination of these two sets. a. Comparison of the three feature sets in A-E versus A-P. b.Comparisonofthe three feature sets in A-E versus I-E. c. Comparison of the three feature sets in A-E versus A-P versus BG. The p-values in each legend were obtained using two-tailed Student’s t-test to compare “Experimental”-based results with “Experimental+Sequence”-based and “Sequence”-based results, respectively Additional file 1: Figure S3A). Using sequence features and promoters from the background in GM12878 cell in the absence of experimental features has a lower line. In some cases, the different measures complement performance in classifying A-E, A-P and BG across all each other. For instance, H3K4me2 and H4K20me1 are eight cell types (p = 1.86E-09, two-tailed Student’s t-test; marked as key features by the randomized DFS, which see Fig. 3c). Finally, better results were not achieved is convincing as indicated by the box plots in Additional by combining experimental and sequence features file 1: Figure S4B and Figure S6-S13, but are overlooked (p = 2.79E-01, 6.56E-01 and 1.17E-01 in Fig. 3, two-tailed by random forest. Tbp was highlighted by random forest Student’s t-test). in GM12878 and HelaS3 cells, but was not picked up by randomized DFS. Examining the box plots of this feature Key features for DECRES performance in Additional file 1: Figures S6 and S7 reveals that this As experimental data can be time consuming and expen- feature is discriminative to distinguish active enhancers sive to produce, we sought to determine the minimal set and promoters from background, but there is not a dra- of features most informative for CRR prediction from matic difference between active enhancers and promoters. a computational perspective. We used randomized deep Important features incorporated into a random forest model may not be incorporated until a latter stage of the feature selection (randomized DFS or RDFS) and ran- DFS process. For instance, in K562 cell line, C-Myc was dom forest (RF) models (see Methods) for two-class emphasized by random forest, which is indeed reason- [A-E+A-P (or CRR) versus BG] and three-class (A-E ver- sus A-P versus BG) classifications on four cell types able as shown in Additional file 1: Figure S12 and was not (GM12878, HelaS3, HepG2, and K562) which have 72-135 selected as an initial feature in the DFS process. features available. For the development of machine learning methods in Figure 4a and Additional file 1: Figure S4A display the genome annotation, minimizing the number of features feature importance scores discovered by randomized DFS required decreases cost and increases the capacity for and random forest for the three-class classification. The biological interpretation. Figure 4b and Additional file 1: feature importance scores produced by these methods Figure S5B show the changes of test auPRCs as the num- should be interpreted differently. Similar to a forward bers of selected features increase for the three-class and selection, the feature importance scores from randomized two-class classifications, respectively. In both cases, test DFS reflect which features are preferred in the early stage auPRCs increase dramatically for the initial features, then of the sparse model, while the importance score of a fea- performance plateaus. Comparing the randomized DFS ture by random forest indicates the role of this feature in curves with the random forest curves, we can see that the context of its use with all other features. Thus, using thereisnosingleoptimal curve. Afewkeyfeaturesare both methods in this study enables us to gain different sufficient for a good prediction performance. To define insights into the data. In our experiments, both meth- an optimal number of features needed, we fit the curves ods can capture the most important features as indicated in Fig. 4b and Additional file 1: Figure S5B and selected by importance scores across all four cell lines. For exam- the intersection point for a line with slope of 0.5 on ple, both methods agree that Pol2, H3K4me1, Taf1, and the randomized DFS curves (see Methods). Fewer fea- H3K27ac are useful for distinguishing active enhancers tures are needed for two-class CRR prediction (6 features) Li et al. BMC Bioinformatics (2018) 19:202 Page 7 of 14 Fig. 4 Feature importance and classification performance in the 3-class (A-E versus A-P versus BG) scenario. a Feature importance discovered by randomized DFS (RDFS) and random forest (RF) on GM12878. The random forest’s feature importance scores were normalized to [0,1] for better comparison with randomized DFS. b auPRC versus the number of features incorporated into the RDFS and RF. The annotated points indicate where a line with slope 0.5 intersects a fitted curve compared to three-class models intended to distinguish three-class models, in agreement with existing knowledge between A-E, A-P and background (10 features). [2, 3, 39, 40]. Among transcription factors (including co- The distributions of the top ten features for three-class factors), Taf1 and p300, as well as RNA polymerase II predictions (A-E, A-P, and BG) are given in Additional (Pol2), are frequently selected, which is also consistent file 1: Figure S4B. Using the top ten features for each with existing knowledge [41, 42]. cell, auPRCs of 0.9022, 0.9156, 0.8651, and 0.8565 were Additional file 1: Figure S5C shows box plots of the top achieved on GM12878, HelaS3, HepG2, and K562, respec- six selected features by randomized DFS for two-class pre- tively. Half of these top features are histone modifica- dictions. Using these features, auPRCs of 0.9561, 0.9627, tions, of which H3K4me1, H3K4me2, H3K4me3, and 0.926, and 0.9555 were obtained on the four cell types, H3K27me3 were commonly selected features for the respectively. For most features, the ranges of values are Li et al. BMC Bioinformatics (2018) 19:202 Page 8 of 14 elevated in A-E and A-P relative to the background cat- We investigated how many among our genome-wide egories. Half of the selected features are DNase-seq and predictions are supported by the VISTA enhancer set [43]. histone modification ChIP-seq data including H3K4me2, Despite the fact that the majority of the VISTA enhancers H3K27ac, and H3K27me3. The box plots of these fea- are extremely conserved across development, we still find tures indicate that they distinguish A-E and A-P from that 37.1% (850/2,293) of experimentally confirmed and unconfirmed VISTA enhancers overlap with the predicted background [2, 39, 40]. A-Es, while merely 4.8% (110/2,293) of these VISTA The majority of DECRES’s genome-wide predictions are enhancers overlap with the predicted A-Ps. Results for supported by other methods experimentally confirmed VISTA enhancers are similar We trained 2- and 3-class multilayer perceptron (MLP) (482/1,196 = 40.30% and 60/1,196 = 5.02% overlap A-Es models (see Methods) using all reference (labelled) data and A-Ps, respectively), which suggests that our predicted for training, in order to predict CRRs across the entire active enhancers have real enhancer functions. A propor- genome for six cell types (A549 and MCF7 were excluded). tion of the VISTA enhancers not overlapping our predic- The 2-class model identified 227,332 CRRs (adjacent tions could be active specifically during development or in regions were merged), which occupy 4.8% of the genome other cell types than our focus cell lines. (Additional file 1: Table S4). A total of 9153 CRRs were ubiquitously predicted across all six cell types. For the DECRES extends the FANTOM enhancer atlas 3-class prediction, we obtained 301,650 A-E regions (6.8% Due to the limited depth of CAGE signals for eRNAs, a of the genome) and 26,555 A-P regions (0.6% of the portion of active (or transcribed) enhancers will not have genome) together with 11,886 ubiquitous A-Es and 3678 been detected in the original compilation of the enhancer ubiquitous A-Ps. The genome-wide predictions for all six atlas. Hence, we sought to identify additional partially cell types are available in Additional file 2. supported enhancers for which eRNA signals were below Next, we examined the overlap of our predicted CRRs the original atlas threshold settings [4]. In the previous with the Combined [12]anddReg[22] predictions on work, a total of 200,171 bidirectionally transcribed (BDT) GM12878, HelaS3, and K562. The majority of CRRs loci were detected across the human genome, using CAGE predicted by DECRES overlap with the results from tags of 808 cell types and tissues. After excluding BDT loci either Combined or dReg, specifically 86.13%, 76.13%, within exons, a partially supported set of 102,021 BDT and 83.63% for GM12878, HelaS3, and K562, respec- regions remained, of which 43,011 balanced loci (simi- tively (Fig. 5). A subset (13.87% on GM12878, 23.87% lareRNAlevelsonbothsides) constitute theFANTOM enhancer atlas [4]. In order to investigate whether more on HelaS3, and 16.37% on K562) of DECRES predic- tions do not overlap with predictions from the other active enhancer candidates can be detected for each of the two tools. Notably, a large portion of the Combined pre- six cell types, we trained a MLP on its active atlas regions, dictions (56.78% on HelaS3, 55.99% on GM12878, and and predicted classes for all 102,021 BDT sites. Among 36.36% on K562) do not overlap with those from the the 102,021 BDT loci, most were classified as negative supervised methods, which is consistent with its low regions in a given cell (Additional file 1: Table S5), while observed validation rate [14]. Furthermore, DECRES pre- on average 13,316 were predicted as A-Es and only 834 dictions tend to have a finer resolution for both A-P were predicted as A-Ps per cell type. A substantial num- and A-E regions (see Additional file 1:FigureS14 foran ber (6535 on average) of inactive enhancers in the original example). enhancer atlas were predicted as active by our model ab c Fig. 5 Agreements of the DECRES CRRs with the Combined and dReg CRRs on three cell types (a: GM12878, b:HelaS3, c: K562), respectively. The TSS, PF, E, and WE segmentations from Combined were relabelled to CRRs. The active transcriptional regulatory elements (TREs) predicted by dReg were renamed to CRRs Li et al. BMC Bioinformatics (2018) 19:202 Page 9 of 14 (Additional file 1: Table S6), consistent with the assump- file 1: Figures S17 and S18) [50]. HNF1A, HNF1B, FOXA1, tion that BDT data is incomplete for any given sample. On FOXA2, HNF4A, and HNF4G factors regulate liver- average 5514 BDT loci excluded by the original atlas, were specific genes (Additional file 1: Figures S19 and S20) predicted as A-Es per cell type. Over the six analyzed cell [51, 52]. NFY factors cooperate with GATA1 to mediate types, a total of 38,601 BDT loci were predicted as A-Es erythroid-specific transcription in K562 (Additional file 1: Figures S25 and S26) [53]. (Additional file 3), of which 16,988 represent an expan- sion of the original FANTOM enhancer atlas. Note that We performed functional and enrichment analysis on 21,398 out of 43,011 enhancers from the original FAN- the A-E and A-P predictions from the Combined method TOM enhancer atlas are not predicted as active in the six [12], and report the results in Additional file 1:Figures cells analyzed here, but these regions may be active in the S27-S30. Most of the predicted promoters by the Com- other 802 cells for which there are inadequate features to bined method are distal to known gene TSSs, which is analyze. similar to enhancers. For instance on cell line GM12878, only 22% of the Combined promoters are located less than Computational validation of DECRES’s prediction using 5 kbp to the annotated gene TSSs, compared to 47% of the functional and motif enrichment analysis DECRES promoters. Moreover, functional analysis on the We performed functional enrichment analysis on the CRRs predicted by the Combined method returned much genome-wide predicted A-Es and A-Ps using GREAT [44]. less or zero significant terms for GO biological process, For GM12878 cells, 79% of predicted enhancer regions are MSigDB pathway, and disease ontology than the DECRES more than 5 kilobase pairs (kbps) away from gene TSSs predictions. The motif analysis results of both methods (Additional file 1: Figure S15A), while 47% of predicted are consistent. promoters are less than 5 kbps to the annotated gene TSSs (Additional file 1: Figure S15B). Similar statistics Discussion were obtained for the remaining five cell types. Annota- Our study brings together a large collection of high- tion analyses of the GM12878-specific CRRs show that throughput data from global projects to allow for super- proximal genes are associated to: immune response from vised annotation. One key challenge in such analysis is the gene ontology (GO) annotations (Additional file 1:Figure depth of validation. In this report, validation is assessed S15C); B cell signalling pathways from MSigDB Path- using existing collections of reliable enhancers, including way annotations (Additional file 1: Figure S15D); and CAGE [4], and laboratory validated sets from CRE-seq leukemia from disease ontology annotations (Additional [14], and, on a small-scale, transgenic mouse assays [43]), showing that the supervised approach nears 89% sensi- file 1: Figure S15E). Results are consistent with the lym- phoblastoid lineage of the cells. Next, we performed tivity. While we compare to multiple laboratory validated functional enrichment analysis on the BDT-supported sets retrospectively, a prospective assessment would have predicted enhancers not previously reported in the broad value. In light of the recent advances in both big FANTOM enhancer atlas (“not in atlas”). Results are data analysis methods and genome-scale data generation, fully consistent with the above analysis (Additional file 1: we believe it is opportune to launch a global prospective Figure S16). assessment, such as enabled within the DREAM Chal- We further carried out motif enrichment analysis lenges program [54]. Such a test for annotation of cis- on the predicted cell-specific CRRs and not-in-atlas regulatory regions in the human genome would inspire enhancers using HOMER [45]. The predicted regions are the machine learning community to push the perfor- enriched for motifs similar to JASPAR binding profiles mance limit of supervised CRR-prediction methods, and [46] (Additional file 1: Figure S15F and Figures S16-S26) would encourage laboratory biologists to accelerate cell both associated to TFs maintaining general cell processes type-specific data generation. and TFs with selective roles in cell-related functions. For Enhancers and promoters have both common and dis- instance, motifs for Jun-, Fos-, and Ets-related factors were tinct characteristics. In our cross-validations, we show enriched in regions from all six cell types. These TFs that A-E and A-P are highly separable (Fig. 1a), while regulate general cellular progresses such as differentia- better performance can be obtained if A-E and A-P are tion, proliferation, or apoptosis [47, 48]. Cell-appropriate treated as a single class (Additional file 1:FigureS1C). TF enrichments were observed for each cell (summa- Both continuous (merging enhancers and promoters rized in Additional file 1: Table S7). For example, RUNX1 together) and distinct models (treating enhancers and and other Runt-related factors, which play crucial roles promoters separately) have limitations. While a continu- in haematopoiesis, are observed in GM12878 (Additional ous model may overlook functional differences, a distinct model may overemphasize such differences. A potentially file 1: Figure S15F and Figure S16) [49]. C/EBP-related fac- tors that regulate genes involved in immune and inflam- better prediction model might require two hierarchical matory responses are expressed in cervix (Additional steps. It could first distinguish CRRs from the background Li et al. BMC Bioinformatics (2018) 19:202 Page 10 of 14 genome, then assign a continuous score to each candidate for the GTEx project [62]. The current predictions using region indicating the likelihood of being an enhancer. Fur- MLPs and future annotations using CNNs and RNNs can ther clustering and subtyping may be necessary. It is worth integrate sequence variations (captured in alignment of mentioning that the CAGE-defined enhancers used in this short sequence reads of ChIP-seq and other sequencing study may introduce some bias towards capturing a spe- techniques) and RNA-seq gene expression data of a cell type of interest, so that the impact of genetic variations in cific class of enhancers which exhibit reasonably strong and detectable transcription. To further investigate the non-coding regions can be prioritized. characteristics of enhancers and improve genome-wide prediction, enhancers detected by other techniques, such Conclusions as GRO-seq, will need to be considered in the future. Using FANTOM data for training, we show that super- Our predicted CRRs take a substantial but small por- vised deep learning methods are able to accurately pre- tion of the non-coding regions, previously known as “junk dict active enhancers and promoters across the human DNA”. It may be because only six cell types are consid- genome. Models incorporating cell-specific data outper- ered in this work. Nevertheless, we have already seen that form models restricted to universal data (e.g. sequence), the non-coding regions exhibit regulatory functionality. and highlight key experimental features that tend to be It would be interesting in the next phase to collect data incorporated into predictive models when available. We from a large number of cell types and examine the cover- explore the relative performance of 2- and 3-class mod- age, which will unveil whether regulatory regions have an els that either group or separate enhancers and pro- oasis pattern. It may also imply that certain fragments of moters. Finally, we deliver a comprehensive collection of the non-coding regions play other partially known (such annotations, that label 6.8% of the genome as enhancers as suppression, domain boundary, and development) and and 0.6% as promoters in one or more of six well- unknown roles. characterized cells. As already advocated in our review [7], two other deep Accurate annotation of regulatory regions across the learning models might be well suited to improve annota- human genome is essential for genome interpretation. tions of non-coding regions. One method is convolutional With genome sequencing transitioning to a standard clini- neural networks (CNNs), which can take into account cal test, the ability to move beyond the analysis of protein- the topological properties of features. The other is bidi- coding alterations has the potential to expand clinical rectional recurrent neural networks (RNNs), which can diagnostic capacity to explain observed genetic disorders. consider the information from adjacent regions (i.e. the By demonstrating the suitability of supervised deep learn- context). Such an approach can be potentially applied to ing methods to label regulatory regions, we now enter into annotate regulatory domains or complexes where exons, a new stage of genome annotation. In the next few years, introns, promoters, enhancers, silencers, and insulators we anticipate that characterization of regulatory prop- form cohorts for specific functionalities. Bidirectional erties in specific cell populations will accelerate, using RNNs have a smoothing effect, making the predictions both chromatin-based and sequencing-based methods. context-dependent. Development of CNN- and RNN- As demonstrated in this report, deep learning methods based models for prediction of enhancers using sequence are well suited for the challenge of using the expanded data information has just emerged [55]. We foresee more for reliable annotation of the genome. sophisticateddeeplearningmodelsinthe nearfuture We anticipate that the collection of regulatory region for comprehensive genome annotations. To prevent pre- annotations provided in this study will have broad util- dictions from jumping between states, smoothing has ity for genome interpretation, and that the demonstra- been taken into consideration in a deep neural network tion of the sufficiency of training data and the utility of combined with hidden Markov model [56, 57]. Com- deep learning supervised methods for CRR prediction will bined with MLPs, CNNs, or RNNs, other newly published move the discussion to a highly applied period of high- deep feature selection techniques, such as layer-wise rele- quality annotation. Understanding how CRRs interact and vance propagation [58] and class saliency extraction [59], how they link to their target genes is the key to decipher might be useful to identify informative signal peaks for the cis-regulatory mechanism. We expect that further cis-regulatory elements of focus. Furthermore, transfer development of integrative machine learning methods learning [60] and multi-task learning [37] techniques [63, 64] is crucial to reconstruct such a gene regulatory might be useful in the design of deep predictive mod- system. els, particularly when the number of learning examples of one cell type is limited or a region allows several anno- Methods tations. Assessing the impact of sequence variations in Data non-coding regions on gene expression and phenotypes is For the purpose of supervised analysis, we collected fea- of high clinical interest [32, 61], which was one motivation ture data from ENCODE [10] along with the trans- Li et al. BMC Bioinformatics (2018) 19:202 Page 11 of 14 criptionally active enhancers and promoters from eight learning rate could reduce gradually as the number of iter- matched cell types catalogued by the FANTOM effort ations increases. Using batch-size 100, stochastic gradient [4, 24]. These cell types include A549, GM12878, HelaS3, descent was employed to optimize the model param- HepG2, HMEC, HUVEC, K562, and MCF7. For each cell eters. Rectified linear unit (ReLU) activation function type, we defined seven classes of labelled regions, includ- [72] was used. When evaluating the classification perfor- ingA-E,I-E,A-P,I-P,A-X,I-X,and UK.The librariesof mance of various experiments, we randomly sampled a enhancers and promoters were downloaded from [65]. A- balanced training set with maximally 70% of examples Es and I-Es were defined as FANTOM enhancers with and maximally 3000 examples in each class, the remaining TPM>0 (tags per million) and TPM=0, respectively. A- data were further randomly sampled to generate a corre- Ps and I-Ps were randomly selected FANTOM promoters sponding test set with proportion of A-E:A-P:A-X:I-E:I- with TPM>5andTPM=0, respectively. A-X and I-X were P:I-X:UK=1:1:1:2:2:1:10 to mimic the true genome-wide defined based on exons’ transcription levels measured by background among the classes (Same ratio was used in RNA-seq [66]. An exon with peak-max greater than 400 the comparison with ChromHMM and Combined meth- (equal to 0) was defined as A-X (I-X). The UK regions were ods). A training set was further partitioned into a training sampled from the genome regions excluding all FANTOM subset for model learning and a validation subset for CAGE tags, all exons, and DNaseI open regions. The num- early termination. We scaled each feature in the training bers of labelled regions used in this study are listed in subset to [0,1], and applied the same estimated scaling Additional file 1: Table S1. For each cell type, we built factors to the validation subset and test set. Class-wise a comprehensive feature set, integrating histone modifi- rate (CWR, i.e. averaged sensitivity and specificity for cation and TF binding ChIP-seq, DNase-seq, FAIRE-seq, two classes, and averaged sensitivities for more than two and ChIA-PET data from the ENCODE project [66, 67]. classes), area under the receiver operating characteris- These features characterize the activities of enhancers tic curve (auROC), and area under the precision-recall and promoters in cell-specific aspects. Additionally, CpG curve (auPRC) were calculated to measure the classifica- islands and phastCons evolutionary conservation scores tion performance. The above procedure was repeated 100 were included, because it is well recognized that some times to determine the means and variations of CWRs, regulatory regions are highly GC-rich and extremely con- auROCs, and auPRCs. Random forest [73]was compared served. For each labelled region, the mean value of feature on the same training-test splits in our experiments. Before signals fall within a bin centered at the region was taken predicting regulatory regions in the whole genome, all as the feature value using bwtool [68]. We tried different labelled A-E samples and 3000 samples in each of the bin sizes including 200, 500, 1,000, 2,000, and 4,000 bps. otherclasses of acelltypewereusedtotrainthenetwork. Since 200 bps worked the best in cross-validation tests, we used it throughout our analyses. The numbers of fea- Feature selection tures used for each cell type and the numbers of common Based on our newly devised deep feature selection (DFS) features between any pair of them are given in Additional model [26], we designed the randomized DFS (RDFS), file 1: Table S2. A combined list of features is provided which is a deep extension of randomized LASSO [74], for in Additional file 4. Our labelled data are downloadable stably selecting subsets of discriminative features. from [69]. Addressing the limitations of sparse linear models for feature selection, DFS is able to model the non-linearity Deep learning for classification ofthefeaturesandselect asinglesubsetoffeaturesfor Based on the Deep Learning Tutorials [70]and multi-class data. The main idea of DFS is to add a one- Theano [71], we implemented a deep learning pack- to-one linear layer (named feature-selection layer) to the age named DECRES (DEep learning for identifying Cis- above described feedforward neural network. For the i-th Regulatory ElementS and other applications) which is input feature x , the output of the feature-selection layer available at [69]. We applied a supervised deep model – becomes w x . Thus, the parameter of this layer is a vector i i feedforward neural network (also known as multilayer w. By shrinking w,someofitselements turn to zeros,such perceptrons or MLP) for the detection of regulatory that the corresponding features do not contribute to the regions. For each experiment, we conducted a model classification at all. The upper hidden layers of the model search from no hidden layers upto three hidden lay- have the capability of modelling non-linear interactions ers with maximally 256, 128, and 64 units in the first, in the data. The feature selection layer allows to select a second, and third hidden layers. Initial learning rate, single subset of features for multi-class problem. l -regularization amount (to control model complex- The randomized DFS procedure is similar to ran- ity), and momentum (to stabilize the optimization) were dom forest, both implementing the philosophy of the searched across ranges of values. The maximum num- wisdom of crowds. In randomized DFS, DFS with per- ber of allowed iterations was set to 1000. The initial turbed feature-wise sparsity control parameters runs on Li et al. BMC Bioinformatics (2018) 19:202 Page 12 of 14 a training subset which is a randomly sampled portion Acknowledgements We thank our colleagues in the Wasserman Laboratory for discussions and (usually ) from the entire training set, and selects the top feedback improving the quality of this paper. We thank Dr. Anshul Kundaje K features based on feature weights. This random proce- (Stanford) for making the processed ChIP-seq fold-change data and script dure is repeated, say 100, times to generate the empirical available. probability of each feature being selected. The empiri- Funding cal probabilities are used as feature importance. In our This research was launched by supports from the Genome Canada/Genome BC: 174DE (ABC4DE project), Canadian Institutes of Health Research (CIHR): experiment, we used ReLUs as activation functions and MOP-82875, the Natural Sciences and Engineering Research Council of set K =10 in this procedure. We scaled each feature to Canada (NSERC): RGPIN355532-10, and the National Institutes of Health (USA) [0,1] in the training set, and applied the estimated scal- grant: 1R01GM084875. WWW was supported by scholar awards from CIHR and the Michael Smith Foundation for Health Research (MSFHR) for directing this ing factors from the training set to the test set. From research. YL was supported by NSERC Postdoctoral Fellowship: PDF-471767- the most important to the least important features, we 2015 in the second stage of this project for in-depth analysis and writing of kept adding features to a MLP to evaluate how many this article. The computer systems of the Gene Regulation Bioinformatics Laboratory were funded by the Canada Foundation for Innovation and the BC features are sufficient for the prediction. Randomized Knowledge Development Fund, and were used for collecting and processing DFS was compared to random forest on the same parti- data in this project. WestGrid clusters of Compute Canada were used for tions of training and test sets in this study. The DFS and conducting computational experiments and analysis of this project. randomized DFS models were included in the DECRES Availability of data and materials package. The labelled datasets generated and analysed during the current study and Aiming at trading off the size of feature subset and clas- the DECRES package are available in https://github.com/yifeng-li/DECRES. sification performance (auPRC), we designed a method Authors’ contributions based on curving fitting that was applied in Fig. 4. YL developed the machine learning methods, compiled the data sets, performed computational analysis, and drafted this manuscript. WS Denoting a size of feature subset and corresponding participated in the evaluation of the findings. WWW initialized and supervised test auPRC by x and y respectively, we first fit function the project, and finished the final manuscript. All authors read and approved 2s y = arctan(kx) where k and s are scale parameters. the final manuscript. Once done, a point can be chosen on the curve given a Ethics approval and consent to participate 1 2ks−tπ Not applicable. proper tangent value (say t)using x = and k tπ 2s y = arctan(kx ). Since the values come from different t t Competing interests The authors declare that they have no competing interests. scales on x and y axes, we qualitatively used t = 0.5. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Additional files Author details Additional file 1: This file contains supplemental Tables S1-S7,and Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital supplemental Figures S1-S30. (PDF 27600 kb) Research Institute, Department of Medical Genetics, University of British Columbia, Rm 3109, 950 West 28th Avenue, V5Z 4H4 Vancouver, Canada. Additional file 2: Genome-wide predictions of cis-regulatory regions for Digital Technologies Research Centre, National Research Council Canada, all six cell types. (ZIP 20400 kb) Building M-50, 1200 Montreal Road, K1A 0R6 Ottawa, Canada. Additional file 3: Predictions on CAGE supported bidirectional loci. AiA: Active in the FANTOM Enhancer Atlas; IiA: Inactive in the FANTOM Received: 13 April 2017 Accepted: 4 May 2018 Enhancer Atlas; NiA: Not included in the FANTOM Enhancer Atlas; Specific: Predicted cell-specific A-Es. (ZIP 4820 kb) Additional file 4: List of features used for each cell type. (ZIP 1.36 kb) References 1. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. Enhancers: Five essential questions. Nat Rev Genet. 2013;14(2):288–95. Abbreviations 2. Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: From A-E: Active enhancer; A-P: Active promoter; A-X: Active exon; properties to genome-wide predictions. Nat Rev Genet. 2014;15:272–86. BDT: Bidirectionally transcribed; BG: Class merging I-E, I-P, A-X, I-X, and UK; 3. Andersson R, Sandelin A, Danko CG. A unified architecture of auPRC: Area under the precision-recall curve; auROC: Area under the receiver transcriptional regulatory elements. Trends Genet. 2015;31(8):426–33. operating characteristic curve; CAGE: Cap analysis of gene expression; 4. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, ChIP-seq: Chromatin immunoprecipitation followed by sequencing; Chen Y, Zhao C, Schmidl C, Suzuki T, Ntini E, Arner E, Valen E, Li K, CNN: Convolutional neural network; CRR: cis-regulatory region; Schwarzfischer L, Glatz D, Raithel J, Lilje B, Rapin N, Bagger FO, CWR: Class-wise rate, i.e. averaged sensitivity and specificity for two classes, Jørgensen M, Andersen PR, Bertin N, Rackham O, Burroughs AM, and averaged sensitivities for more than two classes; DECRES: Deep learning Baillie JK, Ishizu Y, Shimizu Y, Furuhata E, Maeda S, Negishi Y, Mungall CJ, for identifying cis-regulatory elements and other applications; DFS: Deep Meehan TF, Lassmann T, Itoh M, Kawaji H, Kondo N, Kawai J, feature selection; DRM: Dinucleotide repeat motif; eRNA: Enhancer RNA; Lennartsson A, Daub CO, Heutink P, Hume DA, Jensen TH, Suzuki H, ENCODE: Encyclopedia of DNA Elements; FANTOM: Functional Annotation of Hayashizaki Y, Müller F, FANTOM Consortium, Forrest AR, Carninci P, the Mammalian Genome; GRO-seq: Global run-on and sequencing; Rehli M, Sandelin A. An atlas of active enhancers across human cell types I-E: Inactive enhancer; I-P: Inactive promoter; I-X: Inactive exon; MLP: Multilayer and tissues. Nature. 2014;507:455–61. perceptron; MPRA: Massively parallel reporter assay; Pol2: RNA polymerase II; 5. Core LJ, Martins AL, Danko CG, Waters CT, Siepel A, Lis JT. Analysis of RDFS: Randomized deep feature selection; RF: Random forest; RNN: Recurrent nascent RNA identifies a unified architecture of initiation regions at neural network; ReLU: Rectified linear unit; TF: Transcription factor; TPM: tags mammalian promoters and enhancers. Nat Genet. 2014;46(12):1311–20. per million; UK: Unknown/uncharacterized Li et al. BMC Bioinformatics (2018) 19:202 Page 13 of 14 6. Wasserman WW, Sandelin A. Applied bioinformatics for the identification Jojic N, Scherer S, Blencowe B, Frey B. The human splicing code reveals of regulatory elements. Nat Rev Genet. 2004;5(4):276–87. new insights into the genetic determinants of disease. Science. 7. Li Y, Chen C, Kaye AM, Wasserman WW. The identification of 2015;347(6218):1254806. cis-regulatory elements: A review from a machine learning perspective. 32. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with BioSystems. 2015;138:6–17. deep learning-based sequence model. Nat Methods. 2015;12(10):931–4. 8. Ernst J, Kellis M. ChromHMM: Automating chromatin-state discovery and 33. Alipanhi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence characterization. Nat Methods. 2012;9(3):215–6. specificities of DNA- and RNA-binding proteins by deep learning. 9. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Nobel WS. Nat Biotechnol. 2015;33(8):831–8. Unsupervised pattern discovery in human chromatin structure through 34. Spencer M, Eickholt J, Cheng J. A deep learning network approach to ab genomic segmentation. Nat Methods. 2012;9(5):473–6. initio protein secondary structure prediction. IEEE/ACM Trans Comput 10. The ENCODE Project Consortium. An integrated encyclopedia of DNA Biol Bioinforma. 2015;12(1):103–12. elements in the human genome. Nature. 2012;489:57–74. 35. Kheradpour P, Ernst J, Mlenikov A, Rogov P, Wang L, Zhang X, Alston J, 11. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of Mikkelsen TS, Kellis M. Systematic dissection of regulatory motifs in 2000 in vivo protein-DNA interactions. Science. 2007;316(5830):447–55. predicted human enhancers using a massively parallel reporter assay. 12. Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, Genome Res. 2013;23(5):800–11. Giardine B, Ellenbogen PM, Bilmes JA, Birney E, Hardison RC, Dunham I, 36. Fletez-Brant C, Lee D, McCallion AS, Beer MA. kmer-SVM: A web server Kellis M, Noble WS. Integrative annotation of chromatin elements from for identifying predictive regulatory sequence features in genomic data ENCODE data. Nucleic Acids Res. 2013;41(2):827–41. sets. Nucleic Acids Res. 2013;41:544–6. 13. Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA. Complex 37. Setty M, Leslie CS. SeqGL identifies context-dependent binding signals in effects of nucleotide variants in a mammalian cis-regulatory element. genome-wide regulatory element maps. PLoS Comput Biol. 2015;11(5): Proc Natl Acad Sci. 2012;109(27):19498–503. 1004271. 14. Kwasnieski JC, Fiore C, Chaudhari HG, Cohen BA. High-throughput 38. Deaton AM, Bird A. CpG islands and the regulation of transcription. functional testing of ENCODE segmentation predictions. Genome Res. Genes Dev. 2011;25(10):1010–22. 2014;24:1595–602. 39. Wang Y, Li X, Hua H. H3K4me2 reliably defines transcription factor 15. Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, binding regions in different cells. Genomics. 2014;103(2-3):222–8. Birney E, Bickel P, Snyder M, Gerstein M. Classification of human 40. Zhou VW, Goren A, Bernstein BE. Charting histone modifications and the genomic regions based on experimentally determined binding sites of functional organization of mammalian genomes. Nat Rev Genet. more than 100 transcription-related factors. Genome Biol. 2012;13:48. 2011;12:7–18. 16. Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, 41. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. GeneCards: Integrating Ernst J, Kellis M, Ren B. RFECS: A random-forest based algorithm for information about genes, proteins and diseases. Trends Genet. enhancer identification from chromatin state. PLoS Comput Biol. 1997;13(4):163. 2013;9(3):1002968. 42. Witte S, Bradley A, Enright AJ, Muljo SA. High-density P300 enhancers 17. Lu Y, Qu W, Shan G, Zhang C. DELTA: A distal enhancer locating tool control cell state transitions. BMC Genomics. 2015;16:903. based on AdaBoost algorithm and shape features of chromatin 43. Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer modifications. PLoS ONE. 2015;10(6):0130622. Browser – a database of tissue-specific human enhancers. Nucleic Acids 18. Chen C, Morris Q, Mitchell JA. Enhancer identification in mouse Res. 2007;35:88–92. embryonic stem cell using integrative modeling of chromatin and 44. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, genomic features. BMC Genomics. 2012;13:152. Wenger AM, Bejerano G. GREAT improves functional interpretation of 19. Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A. cis-regulatory regions. Nat Biotechnol. 2010;28(5):495–501. Genome-wise quantitative enhancer activity maps identified by 45. Heinz S, Benner C, Spann N, Bertolino E, et al. Simple combinations of STARR-seq. Science. 2013;339:1074–7. lineage-determining transcription factors prime cis-regulatory elements 20. Yanez-Cuna JO, Arnold CD, Stampfel G, Boryn LM, Gerlach D, Rath M, required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89. Stark A. Dissection of thousands of cell type-specific enhancers identifies 46. Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, dinucleotide repeat motifs as general enhancer features. Genome Res. Shyr C, Tan G, Worsley-Hunt R, Zhang AW, Parcy F, Lenhard B, 2014;24:1147–56. Sandelin A, Wasserman WW. JASPAR 2016: A major expansion and 21. Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread update of the open-access database of transcription factor binding pausing and divergent initiation at human promoters. Science. 2008;322: profiles. Nucleic Acids Res. 2016;44(D1):110–5. 1845–8. 47. Ameyar M, Wisniewska M, Weitzman JB. A role for AP-1 in apoptosis: The 22. Danko CG, Hyland SL, Core LJ, Martins AL, Waters CT, Lee HW, case for and against. Biochimie. 2003;85(8):747–52. Cheung VG, Kraus WL, Lis JT, Siepel A. Identification of active 48. Sharrocks AD. The ETS-domain transcription factor family. Nat Rev Mol transcriptional regulatory elements from GRO-seq data. Nat Methods. Cell Biol. 2001;2(11):827–37. 2015;12:433–8. 49. Okuda T, Nishimura M, Nakao M, Fujita Y. RUNX1/AML1: A central player 23. Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, in hematopoiesis. Int J Hematol. 2001;74(3):252–7. Sasaki D, Imamura K, Kai C, Harbers M, Hayashizaki Y, Carninci P. CAGE: 50. Arnett B, Soisson P, Ducatman BS, Zhang P. Expression of CAAT Cap analysis of gene expression. Nat Methods. 2006;3:211–22. enhancer binding protein beta (C/EBP beta) in cervix and endometrium. 24. The FANTOM Consortium, The RIKEN PMI, CLST (DGT). A promoter-level Mol Cancer. 2003;2:21. mammalian expression atlas. Nature. 2014;507:462–70. 51. Costa RH, Kalinichenko VV, Holterman AX, Wang X. Transcription factors 25. Kleftogiannis D, Kalnis P, Bajic VB. DEEP: A general compuational in liver development, differentiation, and regeneration. Hepatology. framework for predicting enhancers. Nucleic Acids Res. 2015;43(1):6. 2003;38(6):1331–47. 26. Li Y, Chen C, Wasserman WW. Deep feature selection: Theory and 52. Wang Z, Bishop EP, Burke PA. Expression profile analysis of the application to identify enhancers and promoters. J Comput Biol. inflammatory response regulated by hepatocyte nuclear factor 4α. 2016;23(5):322–36. BMC Genomics. 2011;12:128. 27. Hinton GE, Osindero S, Teh Y. A fast learning algorithm for deep belief 53. Fleming JD, Pavesi G, Benatti P, Imbriano C, Mantovani R, Struhl K. NF-Y nets. Neural Comput. 2006;18:1527–54. coassociates with FOS at promoters, enhancers, repetitive elements, and 28. Hinton G, Salakhutdinov R. Reducing the dimensionality of data with inactive chromatin regions, and is stereo-positioned with neural networks. Science. 2006;313:504–7. growth-controlling transcription factors. Genome Res. 2013;23(8): 29. Bengio Y, Courville A, Vincent P. Representation learning: A review and 1195–209. new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8): 54. DREAM Challenges. http://dreamchallenges.org. 1798–828. 30. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. 55. Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, Shu W. BiRen: Predicting 31. Xiong HY, Alipanahi B, Lee L, Bretschneider H, Merico D, Yuen R, Hua Y, enhancers with a deep-learning-based model using the DNA sequence Gueroussov S, Najafabadi H, Hughes T, Morris Q, Barash Y, Krainer A, alone. Bioinformatics. 2017;33(13):1930–6. Li et al. BMC Bioinformatics (2018) 19:202 Page 14 of 14 56. Liu F, Ren C, Li H, Zhou P, Bo X, Shu W. De novo identification of replication-timing domains in the human genome by deep learning. Bioinformatics. 2016;32(5):641–9. 57. Liu F, Ren C, Bo X, Shu W. PEDLA predicting enhancers with a deep learning-based algorithmic framework. Sci Rep. 2016;6:28517. 58. Bach S, Binder A, Montavon G, Klauschen F, Muller K-R, Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE. 2015;10(7):0130140. 59. Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. In: International Conference on Learning Representations Workshop. 2014. https://iclr.cc/archive/2014/workshop-proceedings. 60. Pan S. J, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010;22(10):1345–59. 61. Kelley DR, Snoek J, Rinn JL. Basset: Learning the regulatory code of the accessible genome wide deep convolutional neural networks. Genome Res. 2016;26:990–9. 62. GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat Genet. 2013;45(6):580–5. 63. Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinforma. 2018;19(2):325–40. 64. Eser U, Churchman L. S. FIDDLE: An integrative deep learning framework for functional genomic data inference. bioRxiv. https://doi.org/10.1101/ 65. FANTOM5 Data. http://fantom.gsc.riken.jp/5/data. 66. ENCODE Data. ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/ encodeDCC. 67. ENCODE Fold-Change Data. https://sites.google.com/site/anshulkundaje. 68. Pohl A, Beato M. bwtool: A tool for bigWig files. Bioinformatics. 2014;30(11):1618–9. 69. DECRES: Deep Learning Methods for Identifying Cis-Regulatory Elements and Other Applications. https://github.com/yifeng-li/DECRES. 70. Deep Learning Tutorials. http://deeplearning.net/tutorial. 71. Theano. http://deeplearning.net/software/theano. 72. Nair V, Hinton G. Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML). 2010. p. 807–14. 73. Breiman L. Random forests. Mach Learn. 2001;45:5–32. 74. Meinshausen U, Buhlmann P. Stability selection. J R Stat Soc Ser B Stat Methodol. 2010;72(4):417–73.

Journal

BMC BioinformaticsSpringer Journals

Published: May 31, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off