Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data

Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and... Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets. Keywords: regulatory genomics; mammalian genome; cattle; pigs; enhancers; promoters; transcription factors; SNP; PLAG1, Poll Received: 8 June 2017; Revised: 7 November 2017; Accepted: 22 December 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Nguyen et al. identify a part of the regulatory repertory in the targeted species. Background We recognize that species-specific regulatory elements may be Predicting functional features of the genome beyond protein- underrepresented in this process. However, we note that the coding regions has been the primary focus of the post-genome fundamental biology of, e.g., that encompassing developmental sequencing era [1, 2]. More than 90% of common genetic vari- programs, response to stimuli, reproduction, energy homeosta- ants associated with phenotypic variation of complex traits are sis, and many other systems shows considerable conservation located in intergenic and intronic regions that regulate gene ex- of components and processes across species [17, 19, 22]. pression but do not change protein structure [3–5]. Moreover, In the current research, we developed the Human Projec- SNPs associated with diseases such as autoimmune diseases, tion of Regulatory Regions (HPRS) method to utilize results from multiple sclerosis, Crohn’s disease, rheumatoid arthritis, and thousands of biochemical assays in human samples to compu- type 1 diabetes are strikingly enriched in promoters and en- tationally predict equivalent information in other mammalian hancers [4, 6, 7]. Annotation of functional regions of the genome species. The method exploits the conservation of regulatory ele- that harbour SNPs identified by genome-wide association stud- ments at the DNA sequence and genome organizational levels to ies (GWAS) to be significantly associated with variation in phe- map these elements to other mammalian species. It then uses notype will contribute to the identification of functional SNPs species-specific data to filter these mapped sequences, which and causative mutations, thereby suggesting genetic targets and are enriched for regulatory sequence features, to predict a set markers for numerous applications in human health care and of high-confidence regulatory regions. We selected cattle as the agricultural livestock production [8–11]. target species to build the HPRS pipeline and then used the pig However, in mammalian species other than the human and as a test species to validate the pipeline. The 2 species are im- mouse, there are few data available at the genome level for dis- portant agricultural ruminant and nonruminant species, respec- covery of regulatory elements. The recently established Func- tively, with genomes sequenced but with little information avail- tional Annotation of ANimal Genomes (FAANG) consortium has able about genomic regulatory regions [12]. We also applied the begun to address this deficiency in a coordinated fashion [ 12, method to the genomes of 8 additional mammals. We demon- 13]. It is expected that core assays identifying regulatory ele- strated that the predicted regulatory dataset produced by the ments for key tissues in a number of production animals will be HPRS pipeline is useful for selecting more likely functional SNPs produced by the FAANG consortium and collaborators. However, before (e.g., for SNP chip design) and after (e.g., for prioritiz- the information generated in the foreseeable future for livestock ing significant SNPs) GWAS analysis, genomic prediction mod- is likely to remain far less comprehensive for coverage of tis- els, and the understanding of biological mechanisms underlying sues, sampling conditions, and breadth of annotation of regu- noncoding genomic variant effects to potentially identify regu- latory elements compared with the human and mouse. The de- latory targets for genome editing. ficiency in the genome-wide prediction of regulatory elements is far greater for most other mammalian species. We have de- veloped a computational method to utilize thousands of human Results and Discussion regulatory datasets to predict regulatory elements in important A pipeline for the projection of human genomic mammalian species. features to other mammals Transcriptional regulatory DNA elements (RDEs) are defined as genomic regions that are binding sites for 1, or usually a The 4 key elements of the HPRS pipeline (Fig. 1) include (1) se- combination of, transcription factors (TFs) and transcriptional lection of suitable regulatory datatypes (biochemical assays) and coregulators [14–16]. Across distant species from C. elegans to tissues in humans; (2) mapping the selected features to the tar- D. melanogaster to humans, the architecture of gene regulatory get species by utilizing conservation of genome organization networks, organization of chromatin topological domains, chro- and sequence identity to maximize coverage without compro- matin context at enhancer and promoter regions, and nucle- mising specificity; (3) first round filtering of the mapped regions osome positioning are remarkably conserved [17, 18]. For ex- to retain high-confidence mapped features, which had strict 1- ample, the majority of co-associations of transcription factors to-1 forward and reciprocal mapping and where human fea- (i.e., combinations of different transcription factors binding to tures have multiple mappings to the target genome, keeping the same genomic region) at proximal transcription start site only those with high sequence identity; and (4) second round regions in humans remain proximal in the worm (80%) and filtering by applying a pipeline to utilize available (often limited fly (100%). Large-scale comparisons between the human and in scale and coverage) species-specific data to prioritize regions mouse (M. musculus) in the Encyclopedia of DNA Elements (EN- likely to be functional in the target species. CODE) project found a high level of conservation of binding mo- tifs and activities, including TF binding to different chromatin states (r = 0.9), proportion of enhancers in TF binding regions Optimizing parameters for mapping sequence features (r = 0.7), DNA methylation preferences within TF-occupied re- across genomes gions (hypomethylated regions in both species), and TFs sharing To identify regions that were likely to be orthologous between a conserved primary binding motif sequence (∼94% of studied TFs) [19]. The human Encyclopedia of DNA Elements (ENCODE), genomes, we deployed the liftOver tool [23]and theprecom- puted alignment files available from UCSC to map regulatory Functional Annotation of the Mammalian genome (FANTOM), Roadmap Epigenomics Mapping Consortium (ROADMAP), and regions in the human genome to the cattle genome based on sequence similarity and genome location. First, we optimized related projects have generated large volumes of data relevant to the identification of promoters, enhancers, and other RDEs [ 6, the minMatch mapping threshold of the liftOver tool, which is the minimum proportion of bases to the total length of a region 20, 21]. However, these data have not been utilized for predict- ing regulatory genomic regions in other mammalian species—a mappable to contiguous aligned segments in the target genome. strategy that can produce more comprehensive predictions than The minMatch parameter was thoroughly tested with a range alternative options using a small set of experimental assays to from high stringency 0.95 down to 0.1 (Fig. 2). The minMatch Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 3 Figure 1: A streamlined workflow for the prediction of regulatory regions. Four key steps include mapping human regulatory regions to a target genome (creating a universal dataset), filtering the mapped regions by 7 epigenomic, transcriptomic, and genomic criteria to keep only regions with potential regulat ory functions, validating the predicted regions by comparing with the known reference dataset, and translating the findings to potential applications in genomic te chnology. parameter values were assessed using 7 diverse datasets (Fig. mapped SSS were within 25 bp of the boundaries of the original 2,Table 1). SSS in the human genome. The percentage of regions mappable to the target genome In all 5 enhancer datasets tested as shown in Fig. 2A, the ra- was compared with the total number of elements in the hu- tio of mapped regions increased steadily when the minMatch man regulatory databases (Fig. 2A). For cattle, mappable regions parameter was reduced from 0.95 to 0.55, with a much slower were defined as: (1) a small sequence segment (SSS) that can be increase when the minMatch was reduced from 0.55 to 0.10 mapped from the human to the bovine genome; (2) the resulting (Fig. 2A). The accuracy of the sequence projection was assessed SSS can be mapped back (reciprocally mapped) from the bovine as the percentage of mapped regions that overlapped with a to the human genome; and (3) the boundaries of the reciprocally feature present in a reference cattle liver enhancer dataset, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Nguyen et al. Figure 2: Optimization of mapping parameters using 7 input databases. The input databases included 5 human enhancer databases (ENSEMBL, ENCODE, ROADMAP liver tissue, Vista, and FANTOM enhancers), 1 human promoter database (FANTOM promoters), and 1 annotated human exon database (UCSC hg19) [6, 18, 21, 56, 57]. The numbers of regions from each dataset used to optimize parameters are shown in Table S4. We used the UCSC pair-wise whole-genome alignment chain file s between the human genome (hg19) and the bovine genome (UMD3.1) and performed mapping from the human genome to the bovine genome (minMatch 0.1 to 0.95 as shown in the x-axis) and then reciprocal mapping from the bovine genome back to the human genome [52, 58–60]. A, Recovered rate, defined as the percentage of the number of mapped regions with exact reciprocal mapping to the total number of original regions in humans. B, Confirmation rate, defined as the percen tage of reference regions covered by predicted regions to the total number in reference regions (Villar reference enhancers, Villar reference promoters, and cattle GENCODE genes V19). C, Specificity, defined as the percentage of matched reference (true positive for the reference dataset) compared with the total number of p redicted regions. Table 1: Summary information for the optimized set of human regulatory datasets used for HPRS mapping Total No. of Mean Dataset Tissues/cell lines regions Region types length, bp ENCODE Distal TFs [19, 26] 0/91 1 122 364 Binding sites 151.2 for 163 TFs ENCODE Proximal TFs [19, 26] 0/91 384 343 Binding sites 151.4 for 163 TFs ROADMAP [21] 24 primary cells (e.g., blood cells, immune cells, and breast 9 102 278 Enhancers 970.8 myoepithelial cells), 14 primary culture (e.g., skin, muscle satellite, neurosphere, bone marrow), and 50 primary tissues (e.g., thymus, spleen, lung, fetal stomach) FANTOM Enhancers [6, 25] 135/673 43 011 Enhancers 289 FANTOM Promoters [20, 25] 152/823 201 802 Promoters 21.5 Information on data types and models is described in Table S3. SeeTable S5 for sample source details. identified experimentally by histone 3 lysine 27 acetylation mappable regions was obtained when decreasing the minMatch (H3K27Ac—a marker for active enhancers) and histone 3 lysine from 0.3 to 0.2. Therefore, the parameter testing indicated that 4 trimethylation (H3K4me3—a marker for active promoters near the optimal minMatch threshold was 0.2. transcription start sites) assays (hereafter referred to as the Vil- We also developed the method to detect regions possi- lar reference datasets) (Fig. 2B) [22]. The coverage of the rele- bly from gene duplication events (Supplementary Methods). To vant reference datasets (Villar reference promoters, Villar ref- identify regions possibly resulting from duplication events (Fig. erence enhancers, and UCSC exons) also increased when the S1A), the HPRS mapping pipeline pooled unmapped regions in minMatch was reduced for some, but not all databases (Fig. 2B). the human datasets (with minMatch = 0.2) and mapped re- Importantly, the reduction in mapping threshold did not lead gions with no exact reciprocal matches for a second round map- to a loss of specificity, which is defined as the percentage of ping with different parameters (allowing multiple mappings and predicted enhancers that matched Villar reference enhancers keeping only results with similarity higher than 80%) to rescue (true positive for the reference dataset) compared with the to- regions with multiple map targets. tal number of enhancers predicted using the particular input dataset (Fig. 2C). The combined results shown in Fig. 2Aand Optimized use of human regulatory datasets Fig. 2B suggest that reducing minMatch to lower than 0.55 still increases (at a slower rate) the number of mapped regions (for Regulatory regions can be active or quiescent, depending on the the ROADMAP, ENSEMBL, FANTOM, and ENCODE datasets) (Fig. cell type and the biological state, and therefore prediction us- 2B) and increases the chance of detecting more reference en- ing a single tissue/cell line, or a single assay type, is unlikely hancers (for the ROADMAP, ENSEMBL, and ENCODE datasets (Fig. to produce a high coverage of all possible regulatory sequences 2A). No significant difference was observed when lowering the of a species [24]. Therefore, we investigated the effect of us- minMatch from 0.2 to 0.1, but a slight gain in the percentage of ing different databases on the predictive capacity of HPRS. First, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 5 Figure 3: Effects of combining databases. A, Application of the HPRS mapping method to the genomes of 10 mammalian species. The input for the HPRS mapping was 1 of the 42 ROADMAP human datasets (38 adult tissues and 4 cell lines as shown in the x-axis) [22] to 10 mammalian species. The Villar reference enhancer datasets determined by H3K27Ac and H3K4me3 assays for liver tissue in each species were used to estimate the coverage of experimental enhancers by the predicted dataset (shown in the y-axis as species-specific enhancer dataset). For each species, the coverage was the percentage of the Villar reference enhancer datase t that overlapped with the HRPS prefiltered enhancers. B, The combination of all 40 tissues in each species was used. C and D, The optimal combination of 5 databases for enh ancers and promoters, respectively. The reference datasets include ROADMAP enhancers (42 tissues), ENCODE distal TFs, ENCODE proximal TFs, FANTOM enhancers, and FANTOM promoters. The numbers shown in the intersections are the number of common regulatory regions between the HPRS mapped regions and the Villar reference datasets. we compared the mapping coverage of enhancers from 42 hu- features that define different types of regulatory regions, e.g., man ROADMAP datasets [21] with the reference liver enhancer those that are specific to promoters or enhancers. In general, datasets, which were experimentally identified (by H3K27Ac as- species with closer evolutionary distance to humans had more say for liver tissues) for 10 mammalian species reported in Vil- HPRS-predicted enhancers matching the relevant Villar refer- lar et al. (Fig. 3Aand B, Tables 1 and 5)[22]. Fig. 3A shows the ence liver datasets (Fig. 3A). For each tissue, the relative map- percentage of Villar reference enhancers (e.g., enhancers de- ping rates were similar between species. Between different tis- tected in liver tissues in cats) that overlap with HPRS-predicted sues across the 42 ROADMAP datasets, thymus enhancers had regions by mapping each of the original 42 human tissues to the lowest mapping rate and liver enhancers the highest map- the targeted species (e.g., to the cat genome). Fig. 3B shows the ping rate in most species (Fig. 3B). Notably, the tissue specificity percentage overlapping with the results from using the com- effect, exemplified by the higher mapping rate for ROADMAP bined 42 tissues. Comparing results from each tissue, or from liver datasets to the relevant species Villar reference datasets combined tissues in each species, enabled assessment of vari- than for other ROADMAP tissues (Fig. 3B), was reduced substan- ation due to evolutionary distance or tissue specificity. Second, tially if the 2 primates that were more evolutionarily related we evaluated the predictions from human to bovine based on to humans (macaque and marmoset) were removed from the different datatypes, including promoter databases (FANTOM), comparison. enhancer databases (FANTOM and ROADMAP), and transcrip- As the coverage of the reference cattle liver enhancer dataset tion factor binding site databases (ENCODE proximal and distal was not significantly higher with human liver enhancers than TFs) (Fig. 3C and D). Each of the datatypes has unique sequence with enhancers from many of the other human ROADMAP Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Nguyen et al. Table 2: Summary of promoter predictions No. within Fold Total regions in cattle Overlap with Villar Fold enrichment of 200 bp of TSS enrichment a c Dataset (MB, %) dataset (%) Villar dataset (%) of TSS Total No. of CAGE regions 154 377 (3.68, 0.138) 11 606 (84.1) 609 13 676 (51.0) 370 Filtered set CAGE regions 145 912 (3.46, 0.129) 11 203 (81.2) 629 13 011 (48.7) 377 Total all regulatory regions 542 756 (937.39, 35.11) 13 329 (96.6) 3 20 759 (77.6) 2 (Universal Dataset) Filtered regulatory regions 245 384 (356.1, 13.33) 13 104 (95.0) 7 17 715 (66.2) 5 (Filtered Dataset) Villar reference promoters 13 796 (32.90, 1.23) 13 796 (100) NA 10 212 (38.2) 31 ROADMAP promoters 81 892 (135.6, 5.08) 12 677 (91.9) 18 14 388 (53.8) 11 The total number of regions with liftOver at minMatch 0.2 and exact reciprocal matches, combined with regions that had multiple matches (no 1-to-1 relationship) but had high conservation (80% identity). The original human regions are from the FANTOM promoter dataset. The percentage was calculated for total genome size. Villar promoter dataset for cattle [22]. Promoter count within 200 bp of the Ensembl annotated UMD3.1 TSS Ensembl build 85 (total 26 740). tissue enhancer datasets [21], we asked whether combining tis- inference results in a small number of promoters. Approxi- sues would increase coverage. By combining the predictions mately 26 740 cattle genes (coding, lncRNAs, miRNAs, etc.) in from the 42 ROADMAP datasets, 2- to 4-fold higher coverage than the reference dataset used (Ensembl Build 85) have annotated from 1 tissue alone (at least 60% total coverage) could be ob- TSS. This dataset is far from comprehensive because of the ex- tained across a variety of species, with coverage being lowest for pected underrepresentation of noncoding genes and of alter- the rat and highest for macaque (Fig. 3A and B). Furthermore, native promoters (APs). The 1 gene-one promoter and 1 gene- we found that separate databases constructed using different one protein concepts are no longer appropriate to describe the models and biochemical assays were complementary, and com- diverse transcriptome [28]. APs are common and are function- bining them significantly increased coverage compared with a ally important. A number of APs were found associated with single database alone (Fig. 3C and D). For example, prediction us- complex traits [29]. While 51% of the Ensembl cattle TSS are ing the ENCODE distal TF dataset and the ROADMAP enhancer covered by mapped human CAGE transcription initiation peaks dataset covered the highest number of Villar cattle reference (3.7 Mb), only 38.4% are covered by the experimentally defined enhancers, while prediction using the FANTOM promoter and promoters (32.9 Mb) in Villar et al. [22], suggesting that HPRS the ENCODE proximal TFBS databases covered more Villar cat- predictions based on human CAGE data could enrich promoter tle reference promoters, and each dataset could add a num- coverage in the cow by more than 12 times compared with ber of unique regulatory regions not found in other datasets the standard promoter assay (H3K4me3 ChIP-Seq) (Table 2). Ac- (Fig. 3C and D). The combination of 88 ROADMAP datasets [21], tive TSS regions from 88 human tissues in the ROADMAP were the FANTOM enhancer and promoter datasets [25], and the EN- mapped to 81 892 putative promoters in cattle [21], with a total CODE distal and proximal TF datasets [26] generated a maxi- length of 135.6 Mb. Noticeably, the average number of Ensembl mum enhancer coverage of 95% (for macaque) and promoter reference TSS that overlapped every 1 Mb of predicted promoters coverage of 98% (for marmoset). Therefore, we selected an op- based on the ROADMAP database was 37-fold lower than those timal combination of human input databases for the HPRS based on the CAGE database (Table 2). pipeline on the basis that they represent promoters, enhancers, HPRS using the CAGE dataset can predict many TSS at single- and TFBS from a large combination of human tissues and pri- nucleotide resolution and can accurately predict transcriptional mary cells and were generated by different methods (Table 1). orientation. TSS are presented in the Ensembl database as single nucleotide genomic positions. HPRS-predicted promoters based on CAGE had exact overlap to the 7191 Ensembl TSS for cat- Predicting promoters tle. While promoter prediction by using histone marks (such as One of the most comprehensive human promoter datasets is those used by ROADMAP) cannot directly define transcriptional the FANTOM5 promoter atlas generated experimentally by CAGE orientation, this information, predicted by HPRS using human data from almost 1000 tissues and cell lines [20]. CAGE is a sen- CAGE data, is highly accurate [20]. Consistently, we found that of sitive methodology for the detection of transcription start sites 13 676 genes that have TSS within 500 bp of mapped CAGE peaks, (TSS) and hence defines core promoter regions where there is 96.9% (13 257) of genes had the same transcriptional orientation binding of the transcriptional machinery [27]. Promoters gener- in the Ensembl annotation and predicted by human CAGE data. ally have a high concentration of TFBS, typically within 300 bp We therefore assigned promoter orientation using the predic- upstream and 100 bp downstream of the TSS [20]. Promoter se- tions from the CAGE dataset. quences are more evolutionarily conserved than enhancer se- quences, and therefore a larger proportion can be mapped from Mapping enhancer datasets human to other mammal genomes [22]. Of 201 802 CAGE transcription initiation peaks in the FAN- Prediction of enhancers is likely to be more challenging than TOM5 human promoter atlas [20], 154 377 (76.5% of the to- predicting promoters because (1) enhancers are less conserved tal) were mappable to the bovine genome (Table 2). The HPRS in DNA sequence; (2) enhancer locations evolve faster [19, 22]; using CAGE predicted new TSS not present within the exist- and (3) enhancer effects are usually independent of the distance, ing bovine genome annotation (Ensembl Build 85). Although orientation, and relative location (upstream or downstream) of a promoter dataset for cattle can be inferred by defining up- gene targets [14]. To predict a broad set of sequences in a species stream sequences of genes with annotated TSS, this indirect that are active in 1 or more tissues or conditions, we expanded Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 7 Table 3: Summary of mapped and filtered regulatory sequences No. of mapped regions (%) Genome coverage, Mb (%) Datasets Human Cow Pig Human Cow Pig Total genome size, Mb NA NA NA 3137.2 (100) 2670.4 (100) 2808.5 (100) b b b ROADMAP enhancers, % mapped 9 102 278 (100) 5 917 129 (65) 5 620 417 (62) 8836.6 (NA) 6142.4 (NA) 5809.5 (NA) to target species ROADMAP enhancers (overlapping 494 583 (100) 371 295 (75) 361 682 (73) 1123.2 (35.8) 885.6 (33.2) 826.2 (29.4) regions were merged) FANTOM CAGE enhancers 43 011 (100) 34 303 (80) 27 558 (64) 12.4 (0.40) 12.2 (4.6) 9.6 (0.34) ENCODE distal TFs 1 122 364 (100) 749 572 (67) 716 515 (64) 169.7 (5.4) 132.0 (4.9) 124.4 (4.4) FANTOM CAGE promoter peaks 201 802 (100) 154 377 (76) 153 893 (76) 4.3 (0.14) 3.7 (0.14) 3.7 (0.13) ENCODE proximal TFs 384 343 (100) 298 554 (78) 279 774 (73) 58.2 (1.9) 48.0 (1.8) 48.9 (1.7) Merged ROADMAP, ENCODE, and 760 702 542 756 (86.1 519 913 (89.2 1165.7 (37.2) 919.5 (34.4) 857.8 (30.5) d d FANTOM datasets (Universal and 96.6) and 97.1) Dataset) Filtered Dataset NA 245 384 (73.5 151 523 (69.8 NA 356.1 (13.3) 311.5 (11.1) d d and 95.0) and 95.6) NA, not applicable. The percentage was not calculated for these 3 values because overlapping regions are present in the different enhancer datasets. The total size is bigger than the genome size because overlapping regions are included. Total size of nonoverlapping regions in the Universal Dataset (before filtering); the percent overlapping Villar reference enhancers (the former) a nd promoters (the later) in the targeted species. Percent overlap Villar reference liver enhancers and promoters in the filtered datasets. the human enhancer datasets to include 88 tissues, primary hancers and promoters (Table 1)[30]. The use of these TFBS cell lines, and primary cell cultures generated by the ROADMAP datasets not only supported predictions from using the en- project (Table 1)[21]; all human active enhancers defined by hancer and promoter datasets, but more importantly, added CAGE data from hundreds of tissues and cell lines in the FAN- other regulatory categories into the combined prediction of reg- TOM project [6]; and all the Villar experimentally defined refer- ulatory regions. For example, the binding targets of the CCCTC- ence cattle liver enhancers (Table 1)[22]. Cumulatively, the HPRS binding factor (CTCF) are likely insulator regions [31], while en- pipeline mapped more than 9.1 million human enhancer se- hancer of zeste homolog 2 (EZH2) binding sites may mark poly- quences to more than 5.9 million regions in the bovine genome, comb repressor complex 2 (PRC2) regions [32]. These ENCODE which were then merged into 542 756 nonoverlapping regions TFBS were identified as binding regions of TFs to nucleosome- (Table 3). The merged dataset (Universal Dataset) covered 86% free regions (∼151 bp per region), which are more biologi- (excluding merged regions resulting from the original Villar ref- cally relevant than de novo scanning of genome sequence for erence enhancers) of the Villar enhancer reference dataset (Ta- TFBS based on short position weight matrices (PWMs; typically ble 3). The term “Universal” reflects the initial pooling of all rel- 6–12 bp) because the later method only uses DNA sequence evant human regulatory datatypes and datasets to form a large and does not take into account the biological chromatin con- collection of genomic regions to be mapped to the target species. text, which is essential for transcription factor binding [33, 34]. Regulatory sequences are often active in certain conditions, and In total, from the ENCODE TFBS dataset [26, 34], 298 554 proxi- remain inactive in most other cases. Therefore, pooling active mal TFBS (total 47.97 Mb) and 749 572 distal TFBS (total 132.04 regulatory regions from a large number of datasets can likely Mb) were projected by HPRS onto the bovine genome. We also cover most active and inactive regulatory sequences, thus en- show that the HPRS prediction using ENCODE transcription fac- abling the prediction of a Universal Dataset. tor datasets was supported by 2 other independent prediction The HPRS mapping of the enhancer datasets predicted a large approaches (Supplementary Methods). set of homologous regions that are potentially regulatory regions in cattle (the Universal Dataset). We noted that the alignability of The filtering pipeline for a high-confidence regulatory DNA sequence does not automatically imply functionality [22], region dataset and therefore we applied a filtering pipeline to incorporate other types of cattle-specific data to prioritize regions more likely to The predictions produced by HPRS were optimized so that they have transcription regulation functions. The filtering pipeline occupied a relatively small part of the whole genome, but can used a combination of sequence features and epigenetics marks universally predict regulatory regions in different cell types and to enrich for likely regulatory enhancers and promoters, as dis- tissues. Applying HPRS for selected datasets (Fig. 3,Table 1), cussed in the filtering section. we first produced a preliminary Universal Dataset and then refined it to generate a Filtered Dataset (Table 3). To remove redundancies, overlapping mapped ROADMAP enhancers (ini- Mapping transcription factor binding site datasets tially mapped separately for each of the 88 ROADMAP datasets) To include potential regulatory regions beyond typical promoter were merged (Table 3). Similarly, all mapped regions for pro- and enhancer classifications, we performed HPRS mapping of moters, merged enhancers, and TFBS with overlapping coordi- human experimentally defined ENCODE TFBS (ENCODE anno- nates were merged into larger regions to form the final Universal tation version 2) to the bovine genome. The ENCODE TFBS Dataset (UD), containing 542 756 nonoverlapping regions. These database contains binding sites for 163 key TFs, some of which regions covered 937.4 Mb (35.1%) of the bovine genome. The high represent additional types of regulatory regions other than en- coverage (35.1%) of the UD was due to the large collection of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Nguyen et al. Figure 4. Enrichment of the enhancers and promoters by the filters in the HPRS filtering process. A, A pipeline to filter predicted regulatory regions from the Univ ersal Dataset with 542 756 regions, covering 937.4 Mb of the genome (35.1%). The initial number of experimentally defined Villar reference datasets include d 31 971 enhancers (E) and 12 257 promoters (P). The number of reference E and P, total number of predicted regulatory regions, and total length (in Mb) for all promoters and enhancers passing each filtering layer are shown. The Ratio (total enhancers overlapping Villar reference enhancers/total length) and Ratio (total promoters overlapping Villar E P reference promoters/total length) were used as criteria to assess enrichment for each filter. The 7 filters are described in Table 4 and in the Supplementary Methods. B, Two bar graphs showing enrichment results (using the same starting set) of using each of the 7 filtering steps in comparison with the baseline (whole g enome) as shown in the dashed lines, and the Universal Dataset (mapped regions, not filtered). The y-axis shows the average number of reference promoters or en hancers in every 1 MB of the genome. The density of regulatory regions predicted is an indicator of the prediction coverage and accuracy. The higher values indicate more experimentally validated enhancers and promoters are enriched after filtering, suggestive of a more efficient filter. Each filter was tested independe ntly, using the same Universal Dataset as the input, to compare the enrichment levels that resulted from each of the 7 filters. human datasets used as inputs for mapping to bovine (37.2% of predict a small set of potential transcription regulatory genomic the human genome) so that the UD covered almost all possible regions in the bovine genome (Fig. 4,Table 4). promoters, enhancers, and TFBS (Table 3). Importantly, the HPRS The filtering pipeline reduced the UD to the much smaller pipeline improves the specificity of the UD by applying a filter- Filtered Dataset (FD; the same as filtered UD), which covered ing step, which incorporates the power of cattle-specific data to a smaller part of the whole genome, but which still predicted Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 9 fi Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Table 4: Filters with species-specific data for selecting regulatory regions (refer to the Supplementary Materials and Methods) Length, Mb Ratio of enhancers, count/Mb No. of ratio promoters, count/Mb Filter Filtering parameters Cattle Pig Mouse Cattle Pig Mouse Cattle Pig Mouse Whole Genome baseline, enhancers/Mb 2670.4 2808.5 2730.8 31 971 (12.0) 23 804 (8.5) 18 396 (6.7) 13 796 (5.2) 11 114 (4.0) 15 164 (5.6) genome Universal Universal baseline, enhancers/Mb 937.4 882.4 699.7 31 971 (34.1) 23 804 (27.0) 18 396 (26.3) 13 796 (14.7) 11 114 (12.6) 15 164 (21.7) Dataset CAGE CAGE ≥ 2orCAGE = 1and RNAseq > 201.9 194.7 147.6 9628 (47.7) 6679 (34.3) 3935 (26.7) 10 152 (50.3) 9476 (48.7) 10 951 (74.2) mean(VillarRef) CAGE ≥ 1 250.7 248 189.8 11 318 (45.2) 8214 (33.0) 5199 (27.0) 12 103 (48.3) 9936 (39.9) 12 722 (67.0) H3K27Ac Log2(H3K27Ac ≥ median(log2(VillarRef)) 89.6 103.8 47.0 16 124 (180.0) 11 985 (115.4) 6736 (143.3) 3927 (43.8) 9324 (89.8) 10 570 (225.0) Log2(H3K27Ac) ≥ mean(log2(VillarRef)) 91.0 102.0 50.4 16 366 (179.8) 11 670 (114.4) 7255 (143.9) 3966 (43.6) 9305 (91.2) 10 707 (212.4) RNAseq Log2(RNAseq) ≥ 3rd 156.1 85.3 105.8 6999 (44.8) 3162 (37.1) 2726 (25.8) 5473 (35.1) 6412 (75.2) 4375 (41.3) quartile(log2(VillarRef)) Log2(RNAseq) ≥ median(log2(VillarRef)) 278.1 184.0 193.1 12 147 (43.7) 6748 (36.6) 4626 (24.0) 8442 (30.4) 8709 (47.3) 6346 (32.9) Log2(RNAseq) ≥ mean(log2(VillarRef)) 319.4 197.7 218.7 13 746 (43.0) 7268 (36.8) 5175 (23.7) 9249 (29.0) 8874 (44.9) 6839 (31.3) gkm-SVM Length < 3000 & SVM ≥ 85.9 4.7 77.2 3603 (41.9) 261 (55.6) 3525 (45.7) 9208 (107.1) 359 (76.4) 8570 (110.0) median(VillarRef) Length < 3000 & SVM ≥ mean(VillarRef) 87.4 3.7 74.5 3645 (41.7) 200 (53.9) 3446 (46.3) 9230 (105.6) 287 (77.3) 8555 (114.9) Length < 5000 & SVM ≥ mean(VillarRef) 133.4 49.3 110.8 5673 (42.5) 1858 (37.6) 4285 (38.7) 9766 (73.2) 1333 (27.0) 9100 (82.1) Annotation AnnCount ≥ 3rd quartile (VillarRef) 72.8 41.4 362.4 3308 (45.5) 893 (21.6) 9183 (25.3) 9618 (132.2) 7489 (181.0) 11 317 (31.2) count AnnCount ≥ median (VillarRef) 109.1 21.2 614.3 5173 (47.4) 887 (21.5) 13 366 (21.7) 10 599 (97.2) 7486 (181.7) 13 835 (22.5) AnnCount ≥ mean (VillarRef) 273.2 239.6 318.2 12 433 (45.5) 7792 (32.5) 8321 (26.2) 12 391 (45.4) 10 039 (41.9) 10 633 (33.4) PhastCons PhastCons ≥ 95th percentile (VillarRef) 28.0 26.7 31.1 939 (33.5) 722 (27.1) 851 (27.3) 1504 (53.7) 1165 (43.7) 2177 (69.9) PhastCons ≥ median (VillarRef) 383.6 351.3 354.9 13 068 (34.1) 9425 (26.8) 8319 (24.4) 9929 (25.9) 7746 (22.1) 11 016 (31.0) PhastCons ≥ mean (VillarRef) 247.4 227.0 248.0 8415 (34.0) 6098 (26.8) 6039 (24.3) 8000 (32.3) 6249 (27.5) 9456 (38.1) TFBS count TFBScount ≥ median (VillarRef) 16.3 12.9 213.3 4 (0.2) 2 (0.2) 6151 (28.8) 6700 (411.3) 3253 (252.0) 9739 (45.7) TFBScount ≥ mean (VillarRef) 379.9 882.4 295.9 12 933 (34.0) 23 804 (27.0) 7995 (27.0) 9956 (26.2) 10 788 (12.2) 10 729 (36.3) Regions with at least 2 CAGE peaks or with 1 CAGE peak, and the number of mapped reads from RNAseq data is higher than the mean mapped reads of the reference Villar enhancer dataset. The results are compared with results from applying another filtering criterion, which is CAGE peak above 1. The comparisons are based on total length of all selected regions, and the averag e count of the ltered regions normalized by length (regulatory regions per 1-Mb genome length). Regions with more of the H3K27Ac mapped reads than the median or mean mapped reads to reference enhancers in the Villar dataset (log2 scale). Regions with more of the RNAseq mapped reads than the 3rd quartile, median, or mean mapped reads to reference enhancers in the Villar dataset (log2 scale). Regions with length shorter than 3000 bp (or 5000 bp) and gkmSVM scores ≥ median or mean scores for the Villar enhancer dataset. Regions with more annotation terms (diversity of regulatory features) mapped to the regions, compared with those mapped to reference regions in the Villar enhancer dataset. Regions with higher conservation PhastCons scores compared with the 95th percentile, median, or mean scores to enhancers in the Villar dataset. Regions with more transcription factor binding sites than those in the Villar enhancer dataset. 10 Nguyen et al. Table 5: HPRS predicted regulatory datasets for 10 species most active enhancers and promoters (Table 4,Fig. 4). De- tailed discussion on rationale for selecting each filter is in the Total Enhancer Promoter Supplementary Materials and Methods. Briefly, the pipeline uti- No. of length, coverage, coverage, lized both biological data in the target species (86 RNA-Seq a a Species regions Mb % % datasets representing 79 cattle tissues [35], cattle H3K27Ac sig- nal [22], and DNA sequence conservation scores) and computa- Unfiltered datasets tionally estimated criteria (gapped k-mers support vector ma- Cattle (bTau6) 545 748 919.5 86.1 96.6 chine (gkm-SVM) scores, number of overlapping annotations, Pig (susScr3) 519 913 882.4 89.2 97.1 and number of CB-predicted TFBS) (Fig. 4A). Marmoset (CalJac3) 642 144 1106.4 93.1 98.4 Before filtering, the Universal Dataset had approximately Rhesus Macaque 693 312 1158.2 94.5 97.6 2.84 times higher Ratio (number of Villar reference enhancers (RheMac3) Dog (CanFam3) 570 317 877.5 89.4 97.6 by predicted regions divided by the length in Mb of predicted Cat (FelCat5) 570 282 903.9 90.8 97.1 regions) and 2.82 times higher Ratio (similar to Ratio , but for P E Guinea pig (CavPor3) 523 273 761.6 81.1 92.7 promoters) than the total genome baseline, and each filtering Rabbit (OryCun2) 531 109 819.4 86.8 96.8 step in the pipeline increased Ratio and Ratio compared with E P Mouse (Mm10) 478 974 699.7 79.6 93.2 the baseline (Fig. 4B, Table 4). At the end of the pipeline, a set of Rat (Rn5) 453 017 620.5 75.3 89.5 high-confidence regulatory regions (the FD), containing 245 384 Filtered datasets sequences (with a total length of 356.1 Mb, equivalent to 13.3% Cattle (bTau6) 245 358 356.1 73.5 95.0 of the whole genome) was obtained. The filtering reduced the Pig (susScr3) 151 523 311.5 69.8 95.6 number of regions by 2.2 times and the genome coverage by 2.6 Mouse (mm10) 281 071 308.4 68.9 91.4 times (Table 3,Fig. 4A), while still including most of the cattle liver reference enhancers and promoters (73.5% and 95.0%, re- The datasets were generated for each species using the same human data spectively) (Table 4,Fig. 3A). Importantly, the filtered dataset had sources, including: 88 ROADMAP tissues/primary cell lines, FANTOM promoters a 5.5 and 7.1 times higher Ratio and Ratio , respectively, than E P and enhancers, and ENCODE proximal and distal TFs (Table S2) and combined the genome baseline (Fig. 4). The size and coverage of the bovine with the Villar reference enhancer promoter dataset. The prediction results for each species are available as part of Additional file 2. genome (356.1 Mb, 13.3%) by HPRS predicted regulatory regions Coverage of the relevant Villar reference datasets [22]. were comparable to the published Fig. 1 for the mouse, which Reference genomes are from UCSC [61]. is 12.6% of the mouse genome, as predicted by ENCODE DNAse The relevant Villar reference species enhancer datasets were added prior to fil- I accessibility data and transcription factor ChIP-Seq (using an- tering. tibodies for 37 TFs on 33 tissues/cell lines) and histone modifi- cation ChIP-Seq data [2]. Similarly, applying the HPRS pipeline to the mouse genome, without using mouse-specific datasets trons), more than 92% are within intronic regions [3, 5]. To test from ENCODE or other sources (except for the reference Vil- the enrichment of potential causal SNPs within predicted regu- lar dataset), predicted potential regulatory regions that occupy latory regions in cattle, we explored the overlap between SNPs 11.3% of the whole mouse genome. in regulatory regions and pleiotropic SNPs, which are SNPs sig- nificantly associated with multiple traits. The pleiotropic SNPs were identified by an independent GWAS study for 32 cattle feed Validating and extending the HPRS pipeline in 9 other intake, growth, body composition, and reproduction traits [37]. mammalian species The GWAS used 10 191 beef cattle, with data (including imputed The performance of the HPRS pipeline was evaluated using the data) for 729 068 SNPs (Fig. 5). We observed a substantial fold en- porcine (pig) genome (susScr3) [36]. HPRS had been developed richment (∼2–4 times) of SNPs with –log (P-value) from3to20in based on the bovine genome, and the pig was then selected as a the Filtered Dataset compared with all other sets of commonly species for step-by-step comparison throughout the pipeline be- classifying SNPs in different genomic regions, including the set cause of the availability of experimentally defined porcine pro- of SNPs 5 kb upstream of protein coding genes. We also observed moter and enhancer reference datasets [22] and because the higher counts (for 6 out of 10 traits) of associated SNPs within pig is an evolutionarily divergent nonruminant production an- regulatory regions in a study on 10 climatic adaptation traits in imal. We obtained similar results in the pig compared with 2112 Brahman beef cattle (Fig. S1) [38]. Similarly, we found en- cattle on numbers of putative regulatory regions, percentage to richment of regulatory SNPs in a study of 5 major production total genome length, and coverage of the reference datasets (Ta- and functional traits in 17 925 Holstein and Jersey dairy cattle (P bles 3 and 4). Importantly, we extended the application of the < 0.05 for 3 out of 5 traits) (Table S1) [39]. These observations are HPRS mapping data from the human to 8 additional mammalian consistent with the pipeline identifying regulatory SNPs from species, which had reference promoter and enhancer datasets millions of SNPs in the genome and suggest that the predicted from the Villar et al. study. We generated HPRS mapped unfil- regulatory database is useful for prioritizing SNPs likely to be tered UDs and observed consistently high coverage of the refer- contributing to phenotypic variation of complex traits. ence enhancer and promoter datasets, and the coverages were comparable between all 10 mammalian species (Table 5). Thus, The regulatory region datasets can be used to guide the pipeline appears to have general utility, not just for livestock identification of potential causative SNPs and their species, but also for mammals in general. gene targets As examples of the application of our resources to identify likely SNPs in regulatory regions are enriched for significant causative mutations from a large list of significantly associ- GWAS SNPs ated SNPs, we applied the HPRS approach to analyse 2 well- More than 90% of significant GWAS SNPs lie outside gene-coding studied genetic variants in cattle that were known to contribute regions, and for those within the gene body (from the start to to phenotypic variation, but their mechanisms of action were the termination site of the complete transcript, including in- not known because they were located within noncoding regions. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 11 The bovine Pleomorphic adenoma gene 1 (PLAG1) locus has been identified in the control of stature (weight and height) by several independent GWAS studies in cattle [40, 41]. The study by Karim et al. [40] fine-mapped 14 SNPs associated with stature. The 14 SNPs are in the vicinity of PLAG1 and the Coiled- coil-helix-coiled-coil-helix domain containing 7 (CHCHD7)gene, which are 540 bp apart (Fig. 6A). The 14 candidate SNPs are showninFig. 6A with coordinate locations relative to HPRS- predicted regulatory regions. The HPRS database suggests a strategy for further filtering these fine-mapped SNPs in 2 ways, first to prioritize gene targets and second to prioritize SNPs. The design of the validation experiment by Karim et al. [40]did not separate the 2 SNPs (rs209821678 and rs210030313) in the pro- moter region because both the long and short fragments used for activity assays in the study contained both SNPs. The HPRS prediction separates the 2 SNPs into 2 core CAGE peaks (Fig. 6B). The 2 peaks suggest 2 potentially separate binding sites of the transcriptional machinery. HPRS resolves the shared 540-bp pro- moter region into separate core promoter regions and suggests a new validation design in which 3 short, directional fragments fo- Figure 5: Enrichment of significant pleiotropic SNPs in regulatory genomic re- cusing more specifically on core CAGE regions (2 near PLAG1 and gions. Count of significant pleiotropic GWAS SNPs ( P-values are from multi-trait 1 near CHCHD7) can be used for functional assays of SNP geno- meta-analysis chi-square test statistic for 32 traits) [37]inaset of ∼729 100 SNPs type. Measuring promoter activity of these 3 constructs by using genotyped using the Illumina HD Bovine SNP chip or imputed from genotyping the similar promoter luciferase assay and transcription factor data from smaller Illumina Bovine SNP chips. Legend labels, from top to bottom: binding assay employed by Karim et al. [40] may confirm which “AllHDchip”: 43 130 SNPs randomly selected (from all 692 529 SNPs on the HD chip, excluding those from chromosome X); “100kbUpstream”: 43 130 SNPs ran- of the 2 SNPs is causative and which gene is affected. domly selected (from 325 227 SNPs within 100 kb of upstream regions of coding Furthermore, by applying a scoring model for regulatory vari- genes); “5kbUpstream”: all 30 384 SNPs within the 5-kb upstream regions of cod- ants, we generated deltaSVM score for each of 97 million known ing genes (results scaled to 43k SNPs); “Genes”: 43 130 SNPs randomly selected bovine SNPs (Supplementary Materials and Methods). The SNP (from 240 160 SNPs in coding genes); “Exons”: all 10 003 SNPs in exons of cod- rs209821678 had a deltaSVM score of –5.99. The score was be- ing genes (results scaled to 43k SNPs); “HPRS regions”: 43 130 SNPs in regulatory yond the 95th percentile range of SVM scores for 97 million regions. SNPs, suggesting that it may play an important regulatory role. Figure 6: Application of the regulatory database to prioritize significant bovine SNPs identified by GWAS studies for functional validation . Overview of 13 significant SNPs fine-mapped by Karim et al. [ 40] is shown in the left panel. Among those SNPs, only 3 overlap regulatory regions and promoter regions in the predicted database. The right panel is a detailed view of the 2 SNPs validated as causative in Karim et al. [40]. Both SNPs are within promoter regions of the PLAG1 gene, but not the CHCHD7 gene. The regulatory (enhancers, promoters, and transcription factor binding sites) and promoter (only promoters) tracks display HPRS-predicted regions. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 12 Nguyen et al. Figure 7: A potential model for effect of the Celtic mutation. Using human Hi-C (chromosome conformation capture) data and scanning of transcription factor binding sites, we generated a hypothesis to predict cattle regulatory targets for polled mutation (HiC target, HiC anchor, and HAND1 tracks). The regulatory (enhancers, promoters, and transcription factor binding sites) tracks display the HPRS-predicted region. Two common mutations on chromosome 1 in cattle have been associated with polled cattle. One is a 202-bp indel (“Celtic mutation”). The other is an 80-kb duplication ∼300 kb away. Purple arrows on the top link the Hi-C anchor to multiple targets mapped from the human genome to the cattle genome. Bottom of figure is a map with locations of the regulatory regions and shows potential effects of two HAND1 sites on the expression of the lincRNA2. Notably, the rs209821678 deletion of the (CCG)x11 to (CCG)x9 is a 202-bp indel, where the duplication of a 212-bp region trinucleotide repeats lies in a predicted G-quadruplex and may (chr1: 1 705 834–1 706 045) replaces the 10 bp (chr1: 1 706 051– cause changes in its structure, an event that could alter tran- 1 706 060) (Fig. 7)[46–48]. The mechanism for the Celtic mu- scriptional activity [42]. In contrast, the SNPs rs210030313 and tation is unknown, although it may affect the expression of rs109815800 did not have significant deltaSVM scores (0.51 and OLIGO1, OLIGO2, CH1H21orf62, and 2 long noncoding RNAs (lin- 3.2, respectively). cRNA1 and lincRNA2)[46, 47]. We found that the whole 10-base We then asked if the regions containing the SNPs interact deletion, but not the upstream 212-base duplication, is within an with additional genes distant from the PLAG1 locus. We applied HPRS-predicted enhancer sequence (chr1: 1 706 046–1 706 182, HPRS for mapping interactions defined by chromatin conforma- UMD3.1). A detailed transcription factor binding motif analysis tion capture data (5C and Hi-C in the ENCODE human datasets) of the polled mutation site suggests that a binding site for the to predict distal targets of the promoter regions in the PLAG1 Heart and neural crest derivatives expressed 1 (TF HAND1)is locus [43, 44], and we found that rs209821678 and rs210030313 lost due to the 10-bp deletion in animals containing the Celtic are within the anchor A 447 043 (chr14: 25 044 319–25 054 287, mutation (Fig. 7C). The neural crest cells give rise to the cran- UMD3.1), with a predicted target region (chr14: 25 478 861–25 iofacial cartilage and bone [49], suggesting that the loss of the 497 096) near the Inositol monophosphatase domain contain- HAND1 putative binding site is a plausible explanation for the ing1(IMPAD1). Variants within IMPAD1 have been implicated altered craniofacial development in polled animals. Addition- in short stature and chondrodysplasia (Table S2). Interestingly, ally, using information from Hi-C in the human genome [44], we the leading SNP identified in an analysis of pleiotropic genes found the mutation is within a mapped interaction target of the affecting carcase traits in Nellore cattle, rs136543212 at chr14: regions Hi-C A 264 635 (chr1: 1 706 078–1 714 122, UMD3.1) and 25 502 915, is slightly closer to IMPAD1 [45]. The rs109815800 A 264 636 (chr1:1 698 252–1 706 077, UMD3.1) and interacts with SNP, on the other hand, does not lie in any mapped Hi-C region. genes hundreds of Kb away (Fig. 7, bottom panel; Table S2). Al- Together, the HPRS-predicted results strongly suggest that the though the above hypothesis requires experimental validation, rs209821678 variant is the causative SNP among the 14 candi- it shows that applying the HPRS approach could lead to a biolog- dates fine-mapped by Karim et al. [ 40]. ical hypothesis for the underlying effects of causative mutations Another example of applying the HPRS databases for anal- within noncoding regions. ysis of noncoding mutations is for the case of the “Celtic Therefore, from the 2 examples described above (and from mutation,” which causes the polled phenotype. The mutation the Callipyge example described in the Supplementary Data), Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 13 we found that the HPRS regulatory database can be used to pri- from the FAANG consortium, when they become available, with oritize SNPs and genetic variants that were identified by GWAS complementary data from human research. The immediate studies and to draw hypotheses about biological mechanisms of application of the regulatory database is to complement the cur- a causative SNP. rent species-specific GWAS analysis by (1) discovery of poten- tial regulatory mechanisms of SNPs lying outside gene coding regions, (2) prioritizing SNPs that are statistically significant at Limitations of the methods a genome-wide level but located within regulatory regions, (3) The main aim of the HPRS pipeline is to predict as many prioritizing SNPs that are at low allele frequency but have po- regulatory regions and as accurately as possible, so that the tential for large effects, and (4) suggesting possible causative dataset could be applied for functional SNP analysis in the target SNPs as targets for precise genome editing or selective breeding species. However, given the uncertain nature of promoter and practices. enhancer identification, the rate of false positives and negatives by HPRS is difficult to determine. In our analysis, all of the ref- Methods erence cattle liver enhancers were included in the initial unfil- tered datasets, although ∼25% were lost during the filtering pro- The complete HPRS pipeline is divided into 3 modules: mapping, cess. Similarly, 96% of reference cattle liver dataset promoters filtering, and SNP analysis. The whole pipeline and documenta- were covered by the unfiltered dataset, with less than 3% lost tion are available from the CSIRO BitBucket [50]. in the filtering process. A limitation of the HPRS filtering pro- cess is the requirement to use a species-specific dataset. Nev- HPRS mapping pipeline ertheless, compared with the large number of datasets and bio- chemical assay types that are required to create comprehensive We developed a mapping strategy based on 4 elements: (1) se- coverage of regulatory regions, the number of species-specific lecting a suitable combination of human databases as HPRS datasets needed for HPRS is small. In this paper, for each of inputs; (2) finding an optimal sequence identity threshold in the 3 species (mouse, cattle, and pig), we used data from only the target genome; (3) finding options to remove less confi- 3 biological replicates of the H3K27Ac assay, which was gener- dent mapped results; and (4) adding multiple mapped regions ated within a scope of 1 project, as reported in Villar et al. [22], that meet a high sequence similarity threshold. Depending on to successfully filter the Universal Dataset. In addition, the ap- the species, targeted tissues, or regulatory categories of in- proach cannot predict promoters and enhancers that are unique terest, users can select suitable human databases using the to the species, e.g., promoters and enhancers that are present following suggested criteria: types of regulatory regions (pro- in the cow but not present in humans. These unique promot- moters, enhancers, and TFBS), biochemical assays, computa- ers/enhancers are likely to be a small proportion of the total tional models for combining data, and data sources (tissues, cell promoter/enhancer set. Indeed, the lineage-specific promoters lines, traits). Second, by applying the UCSC liftOver tool [23], and enhancers across 20 mammalian species were around 5% regions that were aligned at the genome scale (by LastZ pair- of the total promoters and enhancers [22]. Of note, relevant hu- wise genome alignment [52]) were fine-mapped to identify tar- man input datasets can be integrated depending on the aim of get regions with proportion of sequence identity to the original an analysis. For example, if the focus is to study milk produc- regions (minMatch) higher than a selected cutoff. We recom- tion, the HPRS pipeline can be applied for more relevant tissues, mend an optimal minMatch of 0.20 and not allowing multiple such as the mammary gland. Future cattle-specific datasets can mapping for this step. Users can vary input parameters (min- be incorporated into the HPRS pipeline to address the tissue and MatchMain and minMatchMulti) in the HPRS mapping script species specificity issues. (Main Mapping Pipeline.py) to optimize the minMatch suitable In contrast to the HPRS pipeline prediction of regulatory re- to specific datasets that may have different features such as gions, the prediction of causative genetic variation within regu- sequence length and conservation. Third, mapped regions re- latory regions is much more challenging. The current approach sulting from using a low minMatch cutoff (0.20) were filtered relies on the enrichment of sequence motifs within regulatory to retain only regions with exact reciprocal mapping back to regions relative to nonregulatory regions. At least some of the the human genome, with the condition that both the left and motifs are TFBS, but there are likely to be other types of motifs, right borders of the reciprocally mapped regions were within such as G-quadraplexes, present in regulatory regions. While 25-bp windows of the original regions. Fourth, to accommo- the predicted datasets can be useful for generating relevant hy- date regions possibly resulting from duplication events, the potheses, the identification of causal variants still requires con- HPRS mapping pipeline added a step to remap regions that are siderable future refinement and validation. unmapped or are not reciprocally mapped by allowing mul- tiple mapped results to be included while setting a high se- quence similarity threshold (specified by the minMatchMulti pa- Conclusions rameter ≥0.80). Fig. S1A shows some of the expected mapping We have developed the HPRS pipeline using a large collection of scenarios. existing human genomics data and a limited number of cattle- In addition to the customized minMatchMain and min- specific datasets to predict a database of cattle regulatory re- MatchMulti parameter inputs, the Main Mapping Pipeline.py gions that covers a large number of active promoters, enhancers, script also takes user-specified chain files for target species, and TFBS. The database generated here is not a final product be- which can be any of the mammalian species with chain files cause HPRS is capable of readily integrating new cattle-specific available from the UCSC databases or generated in-house. datasets into its mapping and filtering pipeline to expand, re- The HPRS mapping pipeline enables fast mapping of as many fine, and validate the databases. Moreover, the HPRS pipeline databases as necessary. The script PostHPRSMapping Merge can be applied to data of other mammalian species and by DifferentDatabaseTypes.py (available in the CSIRO BitBucket scientists without computer programming skills. We anticipate [50]) can be used to combine resulting datasets into 1 that the pipeline will be used to integrate large-scale datasets dataset containing nonoverlapping regions. For example, we Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 14 Nguyen et al. merged enhancer databases from 88 ROADMAP tissues/primary Data availability cell lines and 5 additional promoter, enhancer, and TFBS We have made all HPRS Python and R scripts publically avail- databases. The script also collapses names of overlapping able with usage instruction from the CSIRO BitBucket [50]. These regions into a comma-separated field that can be used to codes can be used to perform all steps from mapping to filtering count the total number of annotations for each merged and scoring regulatory SNPs. Supporting data are also available region. via the GigaScience database, GigaDB [51]. All human databases used for prediction are publically avail- HPRS filtering pipeline able (Table S5). Results of predicted regulatory regions, includ- ing the Universal Datasets and the Filtered Datasets for cattle Detailed description of the 7 filters is presented in the Supple- and pig, are available in the Supplementary Data for this arti- mentary Materials and Methods section. Briefly, the HPRS fil- cle. For cattle, we provide deltaSVM scores for ∼97 million SNPs, tering pipeline was written in R and contains 7 filtering steps which can be used as 1 of the parameters for assessing po- (Fig. 4,Table 4). The input file is a merged metadata file, in tential SNP effects. Additionally, we share predicted Universal which each region was calculated for the number of CAGE Datasets (not yet filtered) for 10 other mammalian species in a peaks mapped, the RNA-Seq signal from 86 cattle RNA-Seq format compatible for uploading to the UCSC genome browser datasets, the Villar H3K27Ac signal, the SVM enhancer scores (Table 5; Fig. S7). These 10 additional datasets can be useful for (enhancer activity predicted by a machine learning classifica- exploring potential regulatory effects from noncoding genomic tion method, gkmSVM) [53], the number of overlapping anno- regions. tations, the conservation score based on the UCSC 100-way vertebrate alignment [54], and the number of TFBS based on Cluster-Buster scanning [55]. The main filtering pipeline was HPRS Filtering pipeline.Rmd. We tested a range of parameters Additional files and recommend using the parameters set in the script. In addition, prior to running this main script, users can choose Additional file 1: Figure S1: Assessing mapping results for dif- to optimize parameters suitable to specific datasets using ferent human tissues and cell line datasets onto other species. the script HPRS Filtering optimize FilterOrder.Rmd, which cal- The counts and percentages are for mapped regions before the culates Ratio and Ratio (average number of enhancers and pro- HPRS filtering pipeline. A, Possible cross-species mapping re- P E moters per Mb of the total length of all predicted enhancers and sults, with 3 scenarios: (i) reciprocal mapping with a low iden- promoters) for each filter and for a range of filter parameters tity threshold (minMatch = 0.20) but that requires exact back so that the optimal parameters are used in the main filtering mapping to the human genome; (ii) multiple mapping allowing pipeline. The filtering pipeline was written in a such a way that multiple targets, but requiring a stringent minMatch = 0.80; and it is simple to add or remove filter layers depending on availabil- (iii) features that do not fall into the 2 categories above, such as ity of species-specific data. species-specific enhancers, are not included in the prediction. B, Coverage of the cattle Villar enhancer reference dataset by predicted and random feature datasets. Features were mapped Methods to apply HPRS dataset for regulatory SNP from 42 human enhancer datasets or 42 random datasets (equal analysis number of regions and region length distribution) to the bovine The HPRS dataset can be applied for the selection of top can- genome and then compared for percent overlap with the cat- didate SNPs in regulatory regions, which are present in exist- tle Villar reference liver enhancer dataset. For all 42 datasets, the regulatory datasets produced 5–10 times higher coverage of ing genotyping SNP chips. The selected SNPs form a small set of SNPs that are more likely to be causal or associated to phe- the reference bovine liver enhancers than the random datasets. C, Coverage of predictions by 12 common ENCODE human cell notypes. Using these SNPs for GWAS analysis may reduce noise compared with using a large number of SNPs that are noncausal lines (x-axis). The numbers of recovered regions using minMatch = 0.2 and minMatch = 0.95 are shown. D, Percentage of addi- but in high linkage disequilibrium to causal SNPs. The top can- didate SNPs can be selected by the identification of SNPs be- tional features that had multiple mapped targets but met the high sequence similarity threshold for reciprocal mapping from longing or not belonging to the following categories: the Uni- versal Dataset, the Filtered Dataset, the TFBS of the predicted the bovine genome to the human genome (minMatch = 0.80). Additional file 2: Figure S2: Enrichment of GWAS-associated regulatory regions, and regulatory regions active in tissues re- lated to the trait of interest. In addition, deltaSVM scores can SNPs in HPRS filtered regulatory regions. Data are from a GWAS dataset for 2112 cattle measured for 10 different climatic change be used as 1 of the indicators for potential SNP effects, as dis- cussed in the Supplementary Methods section. Alternatively, the adaptation–related phenotypes [6]. A, Number of GWAS signifi- dataset can be used for post-GWAS analysis, in which signifi- cant SNPs, with –log(P-values) ≥7inany of the10separatephe- -7 notypes. GWAS P-values for each trait ≤10 are considered sig- cant SNPs in noncoding regions that are identified from GWAS can be assessed for potential effect on gene regulatory activ- nificant with multiple test correction (with Bonferroni-corrected P ≤0.05, and the number of SNP is ∼500 000 SNPs). SNPs were se- ity. We have discussed examples of applications for the cases of pleiotropic SNPs, climatic adaptation–associated SNPs, and lected for each of the 10 phenotypes separately and were pooled into 1 set, and density plots of SNP counts and corresponding – associated SNPs’ milk-production traits (Fig. S1, Table S1), and of post-GWAS analysis for the stature phenotype and Callipyge log(P-values) for each phenotype are shown. The x-axis shows – log(P-values) values from 0 to 20, and the y-axis shows density of phenotype (Fig. 6;TablesS2and S3). We developed an implementation pipeline of the gkm-SVM the SNP counts according to the –logP distribution. B, Significant SNPS were selected based on combined criteria: –log(P-values) model to estimate SNP effects on enhancer activities in cattle by adapting the model to the case where very limited species- >2 and abs(effect size) greater than or equal to the third quartile effect size value for each of the 10 phenotypes. The x-axis shows specific ChIP-Seq data are available for model training (Supple- mentary Materials and Methods). the name IDs of the 10 phenotypes. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 15 Additional file 3: Figure S3: Promoter prediction. A, We se- in cattle. Single nucleotide resolution scores within the ALDOB lected a random large region of the chromosome to evaluate enhancer are shown. Negative scores indicate loss of function promoter prediction. We observed consistent overlapping of pre- (or TF binding), while positive scores indicate increases in ac- dicted promoters with known transcription start sites (TSS). The tivities. Computational predictions of transcription factor bind- higher and denser numbers of predicted promoters compared ing sites (by FIMO [29] and JASPAR position weight matrices) are with annotated TSS suggest that the HPRS prediction poten- shown in the lower panels. Transcription factor IDs and SNP IDs tially led to the identification of unannotated promoters, in- are shown next to the predicted regions. The ALDOB enhancer cluding alternative promoters within annotated transcripts and was mapped from humans to cattle. Vertical dashed lines show promoters of unannotated transcripts such as those for long the locations of the deltaSVM peaks, where SNPs most likely re- noncoding RNAs. B, HPRS promoters also predict bidirectional duce the enhancer activity, compared with the locations of pre- promoters with high accuracy (- for antisense, + for sense). C, dicted TFBS. The deltaSVM score prediction was consistent with HPRS-predicted alternative promoters are supported by cattle luciferase activity measurement (in humans) and prediction of expression sequencing tag (EST) data. The predicted promoters TFBS (in humans and cattle). overlap the start sites of EST transcripts within the full-length Additional file 7: Figure S7: An example of a simple view of ZC3H14 gene. the datasets generated for 10 mammalian species. The exam- Additional file 4: Figure S4: Enrichment of TFBS within en- ple is from the dog (canFam3) genome. Predicted regulatory re- hancers and promoters. The promoters and enhancers were gions are shown in blue, with annotations (enhancer, promoter, mapped from the human FANTOM enhancer [1]and thehu- and transcription factor IDs) marked on the left. For regions with man FANTOM promoter databases onto the bovine genome [2]. multiple annotations, users can display the annotations by se- Mapped regions were compared with the Villar bovine enhancer lecting the region on the browser. The example shows the ENPP1 and promoter reference datasets for liver tissues [7]. Three cat- gene. egories of overlapping to the reference datasets were compared Additional file 8: Table S1: Enrichment of GWAS SNPs in com- (x-axis): (i) mapped regions overlapping the Villar reference mon phenotypes measured for dairy cattle. dataset (LOinLiver); (ii) mapped regions not in the Villar dataset Additional file 9: Table S2: Hi-C targets and gkm-SVM predict (LOnotinLiver); and (iii) regions in reference datasets not covered causative SNPs and gene targets. by mapped regions (LiverNotinLO). The TFBS were derived from Additional file 10: Table S3: Predicted Hi-C interaction regions whole–bovine genome scanning using the Cluster Buster pro- at the putative Callipyge locus (chr21:67339968:67340027). gram [13] and 3 major transcription factor position weight ma- Additional file 11: Table S4: Regulatory datasets used for op- trix databases (TRANSFAC, JASPAR, and ENCODE) [21–23]. timizing mapping parameters. Additional file 5: Figure S5: Tissue specificity of predicted reg- Additional file 12: Table S5: Data sources for publically avail- ulatory regions. A, Counts of HPRS-predicted enhancers (using able human datasets used as inputs for the HPRA pipeline in this 88 ROADMAP human enhancer datasets) that overlap with the paper. Villar cattle reference enhancers. B, We then defined a tissue- specific enhancer dataset by identifying HPRS regions that over- lap with Villar reference enhancers for cattle and are unique to Abbreviations each of the 88 tissues. The datasets that yielded the highest over- bp: base pair; CAGE: Cap Analysis of Gene Expression; CDS: cod- lap are those from the liver cell line (liver hepatocellular cells - ing sequence; ENCODE: Encyclopedia of DNA Elements; FAN- HepG2) and the human liver tissue. C, We mapped 101 RNA se- TOM: Functional Annotation of the Mammalian genome; FD: quencing datasets, collected from more than 79 tissues (Table Filtered Dataset; Gb: Giga base; GWAS: Genome-Wide Associa- S5), to the predicted regulatory regions. The mapped RNA signal tion Studies; HPRS: Human Projection of Regulatory Regions; kb: was used to compare the similarity between different tissues. kilo base; Mb: Mega base pair; RDEs: Regulatory DNA Elements; Strong enrichment of brain, muscle, and liver tissues was ob- ROADMAP: Roadmap Epigenomics Mapping Consortium; SNP: served. Single Nucleotide Polymorphism; SRA: Sequence Read Archive; Additional file 6: Figure S6: Large-scale gapped k-mer sup- SVM: Support Vector Machine; TE: Transposable Elements; TFs: port vector machine (LS-gkm-SVM) scores for enhancers and Transcription Factors; UCSC: University of California Santa Cruz; deltaSVM scores for SNPs. A, The LS-gkm-SVM model was used ENSEMBL: Ensemnbl database; TSS: Transcription Start Sites; to calculate the gkm-SVM scores for all enhancers in the Vil- TFBS: Transcription Binding Sites; PWM: Position Weight Matrix; lar dataset. Red, enhancers scored on “enhancers versus back- UD: Universal Dataset. ground matrix”; green, random regions (selected by shuffling through the genomes to sample genomic regions of the same length as the Villar reference bovine enhancers) scored on “en- Competing interests hancers versus background matrix”; blue, enhancers scored on a “background versus background” matrix. The positive The authors declare that they have no competing interests. background was selected from the Villar reference enhancer dataset, as described in the Supplementary Materials and Meth- ods section. Training datasets using human (HHb) and cat- Funding tle (BBb) and liftOver enhancer regions from human to cattle Q.N. and M.N.S. were supported by CSIRO OCE Postdoctoral Fel- (LOBHb) yielded consistent and comparable results, which pre- lowships. dicted higher scores for enhancer regions (BBb Enh, HHb Enh, LOBHb Enh) than the predictions for promoter (pmtr Enh) and random regions (BBb Neg, HHb Neg, LOBHb Neg, and Pmtr neg). Author contributions B, deltaSVM for scoring SNP effects on enhancer activity. The LS- gkm-SVM model was used to score every possible SNP across the B.D., R.T., J.K., B.B., and Q.N. conceived the project. Q.N. and enhancer of the aldolase B fructose bisphosphate (ALDOB)gene B.D. designed the HPRS algorithm. Q.N. wrote the pipeline. Q.N., Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 16 Nguyen et al. M.N.S., L.P.N., A.R., and B.H. contributed to the analysis. Q.N. and 15. Lelli KM, Slattery M, Mann RS. Disentangling the many lay- B.D. wrote the manuscript, with input from all other co-authors. ers of eukaryotic transcriptional regulation. Annu Rev Genet 2012;46(1):43–68. 16. Spitz F, Furlong EEM. Transcription factors: from en- Acknowledgements hancer binding to developmental control. Nat Rev Genet 2012;13(9):613–26. We thank Dr Tony Vuocolo (CSIRO) for insightful discussions, Dr 17. Boyle AP, Araya CL, Brdlik C et al. Comparative analysis of Li Congjun (ARS, USDA) for sharing the update of a published regulatory information and circuits across distant species. ChIP-Seq dataset, and Dr Derek Bickhart (ARS, USDA) for provid- Nature 2014;512(7515):453–6. ing us with the updated TFBS dataset. We thank the generators 18. Ho JWK, Jung YL, Liu T et al. Comparative analysis of meta- of the human genomics, transcriptomics, and epigenomics re- zoan chromatin organization. Nature 2014;512(7515):449–52. sources, which are highly valuable to the broader research com- 19. Cheng Y, Ma Z, Kim BH et al. Principles of regulatory in- munity. formation conservation between mouse and human. Nature 2014;515(7527):371–5. 20. The Fantom Consortium, Riken PMI, and CLST, Forrest AR, References Kawaji H et al. A promoter-level mammalian expression at- 1. The ENCODE Project Consortium. An integrated encyclo- las. Nature 2014;507(7493):462–70. pedia of DNA elements in the human genome. Nature 21. Roadmap Epigenomics Consortium, Meuleman W, Ernst 2012;489(7414):57–74. J et al. Integrative analysis of 111 reference human 2. Yue F, Cheng Y, Breschi A et al. A comparative ency- epigenomes. Nature 2015;518(7539):317–30. clopedia of DNA elements in the mouse genome. Nature 22. Villar D, Berthelot C, Aldridge S et al. Enhancer evolution 2014;515(7527):355–64. across 20 mammalian species. Cell 2015;160(3):554–66. 3. Ward LD, Kellis M. HaploReg v4: systematic mining of puta- 23. Kent WJ, Sugnet CW, Furey TS et al. The Human Genome tive causal variants, cell types, regulators and target genes Browser at UCSC. Genome Res 2002;12(6):996–1006. for human complex traits and disease. Nucleic Acids Res 24. Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges 2016;44(D1):D877–81. in bioinformatics approaches for enhancer identification. 4. Farh KK-H, Marson A, Zhu J et al. Genetic and epigenetic Brief Bioinformatics 2016;17(6):967–79. fine mapping of causal autoimmune disease variants. Na- 25. Lizio M, Harshbarger J, Shimoji H et al. Gateways to the FAN- ture 2015;518(7539):337–43. TOM5 promoter level mammalian expression atlas. Genome 5. Li MJ, Liu Z, Wang P et al. GWASdb v2: an update database Biol 2016;16(22). doi: 10.1186/s13059-014-0560-6. for human genetic variants identified by genome-wide asso- 26. Sloan CA, Chan ET, Davidson JM et al. ENCODE data ciation studies. Nucleic Acids Res 2016;44(D1):D869–76. at the ENCODE portal. Nucleic Acids Res 2016;44(D1): 6. Andersson R, Gebhard C, Miguel-Escalada I et al. An atlas of D726–32. active enhancers across human cell types and tissues. Na- 27. Takahashi H, Lassmann T, Murata M et al. 5’ end–centered ture 2014;507(7493):455–61. expression profiling using cap-analysis gene expression 7. Corradin O, Cohen AJ, Luppino JM et al. Modeling disease risk and next-generation sequencing. Nat Protoc 2012;7(3): through analysis of physical interactions between genetic 542–61. variants within chromatin regulatory circuitry. Nat Genet 28. Strausberg RL, Levy S. Promoting transcriptome diversity. 2016;48(11):1313–20. Genome Res 2007;17(7):965–8. 8. MacLeod IM, Bowman PJ, Vander Jagt CJ et al. Exploiting bi- 29. Carninci P, Sandelin A, Lenhard B et al. Genome-wide analy- ological priors and sequence variants enhances QTL discov- sis of mammalian promoter architecture and evolution. Nat ery and genomic prediction of complex traits. BMC Genomics Genet 2006;38(6):626–35. 2017;17(144). doi: 10.1186/s12864-016-2443-6. 30. Gerstein MB, Kundaje A, Hariharan M et al. Architecture of 9. Fragomeni BO, Lourenco DAL, Masuda Y et al. Incorpora- the human regulatory network derived from ENCODE data. tion of causative quantitative trait nucleotides in single-step Nature 2012;489(7414):91–100. GBLUP. Genet Sel Evol 2017;49(134):463–71. 31. Hnisz D, Day DS, Young RA. Insulated neighborhoods: struc- 10. Wang M, Hancock TP, MacLeod IM et al. Putative enhancer tural and functional units of mammalian gene control. Cell sites in the bovine genome are enriched with variants 167(5):1188–200. affecting complex traits. Genet Sel Evol 2017;49(56). doi: 32. Moritz LE, Trievel RC. Structure, mechanism, and regula- 10.1186/s12711-017-0331-4. tion of polycomb repressive complex 2. J Biol Chem 2017; 11. Fang L, Sahana G, Ma P et al. Use of biological priors en- doi:10.1074/jbc.R117.800367. hances understanding of genetic architecture and genomic 33. Narlikar L, Ovcharenko I. Identifying regulatory elements prediction of complex traits within and between dairy cattle in eukaryotic genomes. Brief Funct Genomics Proteomics breeds. BMC Genomics 2017;18(1):604. 2009;8(4):215–30. 12. Andersson L, Archibald AL, Bottema CD et al. Coordinated 34. Wang J, Zhuang J, Iyer S et al. Sequence features and chro- international action to accelerate genome-to-phenome with matin structure around the genomic regions bound by 119 FAANG, the Functional Annotation of Animal Genomes human transcription factors. Genome Res 2012;22(9):1798– project. Genome Biol 2015;16(1):57. 13. Tuggle CK, Giuffra E, White SN et al. GO-FAANG meeting: 35. Elsik CG, Unni DR, Diesh CM et al. Bovine Genome Database: a gathering on functional annotation of animal genomes. new tools for gleaning function from the Bos taurus genome. Anim Genet 2016;47(5):528–33. Nucleic Acids Res 2016;44(D1):D834–9. 14. Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: 36. Groenen MAM, Archibald AL, Uenishi H et al. Analyses of pig from properties to genome-wide predictions. Nat Rev Genet genomes provide insight into porcine demography and evo- 2014;15(4):272–86. lution. Nature 2012;491(7424):393–8. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 17 37. Bolormaa S, Pryce JE, Reverter A et al. A multi-trait, 50. Nguyen et al.. The Commonwealth Scientific and In- meta-analysis for detecting pleiotropic polymorphisms for dustrial Research Organisation (CSIRO). HPRS pipeline stature, fatness and reproduction in beef cattle. PLoS Genet CSIRO BitBucket. 2018. https://bitbucket.csiro.au/ 2014;10(3):e1004198. users/ngu121/repos/hprs/browse (10 February; 2018, date 38. Porto-Neto LR, Reverter A, Prayaga KC et al. The genetic ar- last accessed). chitecture of climatic adaptation of tropical cattle. PLoS One 51. Nguyen QH, Tellam RL, Naval-Sanchez M. et al. Support- 2014;9(11):e113284. ing data for “Mammalian genomic regulatory regions pre- 39. Raven L-A, Cocks BG, Hayes BJ. Multibreed genome wide as- dicted by utilizing human genomics, transcriptomics and sociation can improve precision of mapping causative vari- epigenetics data.” GigaScience Database 2017. http://dx.doi. ants underlying milk production in dairy cattle. BMC Ge- org/10.5524/100390. nomics 2014;15(1):62–62. 52. Harris RS. Improved pairwise alignment of genomic DNA. 40. Karim L, Takeda H, Lin L et al. Variants modulating PhD Thesis. College of Engineering, Pennsylvania, USA: the expression of a chromosome domain encompass- Pennsylvania State University; 2007. ing PLAG1 influence bovine stature. Nat Genet 2011; 43(5): 53. Lee D, Gorkin DU, Baker M et al. A method to predict the im- 405–13. pact of regulatory variants from DNA sequence. Nat Genet 41. Takasuga A. PLAG1 and NCAPG-LCORL in livestock. Anim Sci 2015;47(8):955–61. J 2016;87(2):159–67. 54. Siepel A, Bejerano G, Pedersen JS et al. Evolutionarily con- 42. Hansel-Hertsch R, Beraldi D, Lensing SV et al. G-quadruplex served elements in vertebrate, insect, worm, and yeast structures mark human regulatory chromatin. Nat Genet genomes. Genome Res 2005;15(8):1034–50. 2016;48(10):1267–72. 55. Frith MC, Li MC, Weng Z. Cluster-Buster: finding dense 43. de Wit E, de Laat W. A decade of 3C technologies: insights clusters of motifs in DNA sequences. Nucleic Acids Res into nuclear organization. Genes Dev 2012;26(1):11–24. 2003;31(13):3666–8. 44. Jin F, Li Y, Dixon JR et al. A high-resolution map of the three- 56. Zerbino D, Wilder SP, Johnson N et al. The Ensembl Regula- dimensional chromatin interactome in human cells. Nature tory Build. Genome Biol 2015;16(1):56. 2013;503(7475):290–4. 57. Visel A, Minovitsky S, Dubchak I et al. VISTA Enhancer 45. Pereira AGA, Utsunomiya YT, Milanesi M et al. Pleiotropic Browser—a database of tissue-specific human enhancers. genes affecting carcass traits in Bos indicus (Nellore) cattle Nucleic Acids Res 2007;35(Database):D88–92. are modulators of growth. PLoS One 2016;11(7):e0158165. 58. Schwartz S, Kent WJ, Smit A et al. Human-mouse alignments 46. Allais-Bonnet A, Grohs C, Medugorac I et al. Novel insights with BLASTZ. Genome Res 2003;13(1):103–7. into the bovine polled phenotype and horn ontogenesis in 59. Hinrichs AS, Karolchik D, Baertsch R et al. The UCSC bovidae. PLoS One 2013;8(5):e63512. Genome Browser Database: update 2006. Nucleic Acids Res 47. Wiedemar N, Tetens J, Jagannathan V et al. Independent 2006;34(90001):D590–8. polled mutations leading to complex gene expression differ- 60. Kent WJ, Baertsch R, Hinrichs A et al. Evolution’s caul- ences in cattle. PLoS One 2014;9(3):e93435. dron: duplication, deletion, and rearrangement in the 48. Carlson DF, Lancto CA, Zang B et al. Production of hornless mouse and human genomes. Proc Natl Acad Sci U S A dairy cattle from genome-edited cell lines. Nat Biotechnol 2003;100(20):11484–9. 2016;34(5):479–81. 61. Hindrichs et al. UCSC Table Browser. Santa Cruz: Uni- 49. Santagati F, Rijli FM. Cranial neural crest and the building of versity of California 2006. http://hgdownload.cse.ucsc.edu/ the vertebrate head. Nat Rev Neurosci 2003;4(10):806–18. downloads.html (25 August 2016, date last accessed). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png GigaScience Oxford University Press

Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data

Free
17 pages

Loading next page...
 
/lp/ou_press/mammalian-genomic-regulatory-regions-predicted-by-utilizing-human-shK0xvYOSw
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press.
eISSN
2047-217X
D.O.I.
10.1093/gigascience/gix136
Publisher site
See Article on Publisher Site

Abstract

Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets. Keywords: regulatory genomics; mammalian genome; cattle; pigs; enhancers; promoters; transcription factors; SNP; PLAG1, Poll Received: 8 June 2017; Revised: 7 November 2017; Accepted: 22 December 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Nguyen et al. identify a part of the regulatory repertory in the targeted species. Background We recognize that species-specific regulatory elements may be Predicting functional features of the genome beyond protein- underrepresented in this process. However, we note that the coding regions has been the primary focus of the post-genome fundamental biology of, e.g., that encompassing developmental sequencing era [1, 2]. More than 90% of common genetic vari- programs, response to stimuli, reproduction, energy homeosta- ants associated with phenotypic variation of complex traits are sis, and many other systems shows considerable conservation located in intergenic and intronic regions that regulate gene ex- of components and processes across species [17, 19, 22]. pression but do not change protein structure [3–5]. Moreover, In the current research, we developed the Human Projec- SNPs associated with diseases such as autoimmune diseases, tion of Regulatory Regions (HPRS) method to utilize results from multiple sclerosis, Crohn’s disease, rheumatoid arthritis, and thousands of biochemical assays in human samples to compu- type 1 diabetes are strikingly enriched in promoters and en- tationally predict equivalent information in other mammalian hancers [4, 6, 7]. Annotation of functional regions of the genome species. The method exploits the conservation of regulatory ele- that harbour SNPs identified by genome-wide association stud- ments at the DNA sequence and genome organizational levels to ies (GWAS) to be significantly associated with variation in phe- map these elements to other mammalian species. It then uses notype will contribute to the identification of functional SNPs species-specific data to filter these mapped sequences, which and causative mutations, thereby suggesting genetic targets and are enriched for regulatory sequence features, to predict a set markers for numerous applications in human health care and of high-confidence regulatory regions. We selected cattle as the agricultural livestock production [8–11]. target species to build the HPRS pipeline and then used the pig However, in mammalian species other than the human and as a test species to validate the pipeline. The 2 species are im- mouse, there are few data available at the genome level for dis- portant agricultural ruminant and nonruminant species, respec- covery of regulatory elements. The recently established Func- tively, with genomes sequenced but with little information avail- tional Annotation of ANimal Genomes (FAANG) consortium has able about genomic regulatory regions [12]. We also applied the begun to address this deficiency in a coordinated fashion [ 12, method to the genomes of 8 additional mammals. We demon- 13]. It is expected that core assays identifying regulatory ele- strated that the predicted regulatory dataset produced by the ments for key tissues in a number of production animals will be HPRS pipeline is useful for selecting more likely functional SNPs produced by the FAANG consortium and collaborators. However, before (e.g., for SNP chip design) and after (e.g., for prioritiz- the information generated in the foreseeable future for livestock ing significant SNPs) GWAS analysis, genomic prediction mod- is likely to remain far less comprehensive for coverage of tis- els, and the understanding of biological mechanisms underlying sues, sampling conditions, and breadth of annotation of regu- noncoding genomic variant effects to potentially identify regu- latory elements compared with the human and mouse. The de- latory targets for genome editing. ficiency in the genome-wide prediction of regulatory elements is far greater for most other mammalian species. We have de- veloped a computational method to utilize thousands of human Results and Discussion regulatory datasets to predict regulatory elements in important A pipeline for the projection of human genomic mammalian species. features to other mammals Transcriptional regulatory DNA elements (RDEs) are defined as genomic regions that are binding sites for 1, or usually a The 4 key elements of the HPRS pipeline (Fig. 1) include (1) se- combination of, transcription factors (TFs) and transcriptional lection of suitable regulatory datatypes (biochemical assays) and coregulators [14–16]. Across distant species from C. elegans to tissues in humans; (2) mapping the selected features to the tar- D. melanogaster to humans, the architecture of gene regulatory get species by utilizing conservation of genome organization networks, organization of chromatin topological domains, chro- and sequence identity to maximize coverage without compro- matin context at enhancer and promoter regions, and nucle- mising specificity; (3) first round filtering of the mapped regions osome positioning are remarkably conserved [17, 18]. For ex- to retain high-confidence mapped features, which had strict 1- ample, the majority of co-associations of transcription factors to-1 forward and reciprocal mapping and where human fea- (i.e., combinations of different transcription factors binding to tures have multiple mappings to the target genome, keeping the same genomic region) at proximal transcription start site only those with high sequence identity; and (4) second round regions in humans remain proximal in the worm (80%) and filtering by applying a pipeline to utilize available (often limited fly (100%). Large-scale comparisons between the human and in scale and coverage) species-specific data to prioritize regions mouse (M. musculus) in the Encyclopedia of DNA Elements (EN- likely to be functional in the target species. CODE) project found a high level of conservation of binding mo- tifs and activities, including TF binding to different chromatin states (r = 0.9), proportion of enhancers in TF binding regions Optimizing parameters for mapping sequence features (r = 0.7), DNA methylation preferences within TF-occupied re- across genomes gions (hypomethylated regions in both species), and TFs sharing To identify regions that were likely to be orthologous between a conserved primary binding motif sequence (∼94% of studied TFs) [19]. The human Encyclopedia of DNA Elements (ENCODE), genomes, we deployed the liftOver tool [23]and theprecom- puted alignment files available from UCSC to map regulatory Functional Annotation of the Mammalian genome (FANTOM), Roadmap Epigenomics Mapping Consortium (ROADMAP), and regions in the human genome to the cattle genome based on sequence similarity and genome location. First, we optimized related projects have generated large volumes of data relevant to the identification of promoters, enhancers, and other RDEs [ 6, the minMatch mapping threshold of the liftOver tool, which is the minimum proportion of bases to the total length of a region 20, 21]. However, these data have not been utilized for predict- ing regulatory genomic regions in other mammalian species—a mappable to contiguous aligned segments in the target genome. strategy that can produce more comprehensive predictions than The minMatch parameter was thoroughly tested with a range alternative options using a small set of experimental assays to from high stringency 0.95 down to 0.1 (Fig. 2). The minMatch Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 3 Figure 1: A streamlined workflow for the prediction of regulatory regions. Four key steps include mapping human regulatory regions to a target genome (creating a universal dataset), filtering the mapped regions by 7 epigenomic, transcriptomic, and genomic criteria to keep only regions with potential regulat ory functions, validating the predicted regions by comparing with the known reference dataset, and translating the findings to potential applications in genomic te chnology. parameter values were assessed using 7 diverse datasets (Fig. mapped SSS were within 25 bp of the boundaries of the original 2,Table 1). SSS in the human genome. The percentage of regions mappable to the target genome In all 5 enhancer datasets tested as shown in Fig. 2A, the ra- was compared with the total number of elements in the hu- tio of mapped regions increased steadily when the minMatch man regulatory databases (Fig. 2A). For cattle, mappable regions parameter was reduced from 0.95 to 0.55, with a much slower were defined as: (1) a small sequence segment (SSS) that can be increase when the minMatch was reduced from 0.55 to 0.10 mapped from the human to the bovine genome; (2) the resulting (Fig. 2A). The accuracy of the sequence projection was assessed SSS can be mapped back (reciprocally mapped) from the bovine as the percentage of mapped regions that overlapped with a to the human genome; and (3) the boundaries of the reciprocally feature present in a reference cattle liver enhancer dataset, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Nguyen et al. Figure 2: Optimization of mapping parameters using 7 input databases. The input databases included 5 human enhancer databases (ENSEMBL, ENCODE, ROADMAP liver tissue, Vista, and FANTOM enhancers), 1 human promoter database (FANTOM promoters), and 1 annotated human exon database (UCSC hg19) [6, 18, 21, 56, 57]. The numbers of regions from each dataset used to optimize parameters are shown in Table S4. We used the UCSC pair-wise whole-genome alignment chain file s between the human genome (hg19) and the bovine genome (UMD3.1) and performed mapping from the human genome to the bovine genome (minMatch 0.1 to 0.95 as shown in the x-axis) and then reciprocal mapping from the bovine genome back to the human genome [52, 58–60]. A, Recovered rate, defined as the percentage of the number of mapped regions with exact reciprocal mapping to the total number of original regions in humans. B, Confirmation rate, defined as the percen tage of reference regions covered by predicted regions to the total number in reference regions (Villar reference enhancers, Villar reference promoters, and cattle GENCODE genes V19). C, Specificity, defined as the percentage of matched reference (true positive for the reference dataset) compared with the total number of p redicted regions. Table 1: Summary information for the optimized set of human regulatory datasets used for HPRS mapping Total No. of Mean Dataset Tissues/cell lines regions Region types length, bp ENCODE Distal TFs [19, 26] 0/91 1 122 364 Binding sites 151.2 for 163 TFs ENCODE Proximal TFs [19, 26] 0/91 384 343 Binding sites 151.4 for 163 TFs ROADMAP [21] 24 primary cells (e.g., blood cells, immune cells, and breast 9 102 278 Enhancers 970.8 myoepithelial cells), 14 primary culture (e.g., skin, muscle satellite, neurosphere, bone marrow), and 50 primary tissues (e.g., thymus, spleen, lung, fetal stomach) FANTOM Enhancers [6, 25] 135/673 43 011 Enhancers 289 FANTOM Promoters [20, 25] 152/823 201 802 Promoters 21.5 Information on data types and models is described in Table S3. SeeTable S5 for sample source details. identified experimentally by histone 3 lysine 27 acetylation mappable regions was obtained when decreasing the minMatch (H3K27Ac—a marker for active enhancers) and histone 3 lysine from 0.3 to 0.2. Therefore, the parameter testing indicated that 4 trimethylation (H3K4me3—a marker for active promoters near the optimal minMatch threshold was 0.2. transcription start sites) assays (hereafter referred to as the Vil- We also developed the method to detect regions possi- lar reference datasets) (Fig. 2B) [22]. The coverage of the rele- bly from gene duplication events (Supplementary Methods). To vant reference datasets (Villar reference promoters, Villar ref- identify regions possibly resulting from duplication events (Fig. erence enhancers, and UCSC exons) also increased when the S1A), the HPRS mapping pipeline pooled unmapped regions in minMatch was reduced for some, but not all databases (Fig. 2B). the human datasets (with minMatch = 0.2) and mapped re- Importantly, the reduction in mapping threshold did not lead gions with no exact reciprocal matches for a second round map- to a loss of specificity, which is defined as the percentage of ping with different parameters (allowing multiple mappings and predicted enhancers that matched Villar reference enhancers keeping only results with similarity higher than 80%) to rescue (true positive for the reference dataset) compared with the to- regions with multiple map targets. tal number of enhancers predicted using the particular input dataset (Fig. 2C). The combined results shown in Fig. 2Aand Optimized use of human regulatory datasets Fig. 2B suggest that reducing minMatch to lower than 0.55 still increases (at a slower rate) the number of mapped regions (for Regulatory regions can be active or quiescent, depending on the the ROADMAP, ENSEMBL, FANTOM, and ENCODE datasets) (Fig. cell type and the biological state, and therefore prediction us- 2B) and increases the chance of detecting more reference en- ing a single tissue/cell line, or a single assay type, is unlikely hancers (for the ROADMAP, ENSEMBL, and ENCODE datasets (Fig. to produce a high coverage of all possible regulatory sequences 2A). No significant difference was observed when lowering the of a species [24]. Therefore, we investigated the effect of us- minMatch from 0.2 to 0.1, but a slight gain in the percentage of ing different databases on the predictive capacity of HPRS. First, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 5 Figure 3: Effects of combining databases. A, Application of the HPRS mapping method to the genomes of 10 mammalian species. The input for the HPRS mapping was 1 of the 42 ROADMAP human datasets (38 adult tissues and 4 cell lines as shown in the x-axis) [22] to 10 mammalian species. The Villar reference enhancer datasets determined by H3K27Ac and H3K4me3 assays for liver tissue in each species were used to estimate the coverage of experimental enhancers by the predicted dataset (shown in the y-axis as species-specific enhancer dataset). For each species, the coverage was the percentage of the Villar reference enhancer datase t that overlapped with the HRPS prefiltered enhancers. B, The combination of all 40 tissues in each species was used. C and D, The optimal combination of 5 databases for enh ancers and promoters, respectively. The reference datasets include ROADMAP enhancers (42 tissues), ENCODE distal TFs, ENCODE proximal TFs, FANTOM enhancers, and FANTOM promoters. The numbers shown in the intersections are the number of common regulatory regions between the HPRS mapped regions and the Villar reference datasets. we compared the mapping coverage of enhancers from 42 hu- features that define different types of regulatory regions, e.g., man ROADMAP datasets [21] with the reference liver enhancer those that are specific to promoters or enhancers. In general, datasets, which were experimentally identified (by H3K27Ac as- species with closer evolutionary distance to humans had more say for liver tissues) for 10 mammalian species reported in Vil- HPRS-predicted enhancers matching the relevant Villar refer- lar et al. (Fig. 3Aand B, Tables 1 and 5)[22]. Fig. 3A shows the ence liver datasets (Fig. 3A). For each tissue, the relative map- percentage of Villar reference enhancers (e.g., enhancers de- ping rates were similar between species. Between different tis- tected in liver tissues in cats) that overlap with HPRS-predicted sues across the 42 ROADMAP datasets, thymus enhancers had regions by mapping each of the original 42 human tissues to the lowest mapping rate and liver enhancers the highest map- the targeted species (e.g., to the cat genome). Fig. 3B shows the ping rate in most species (Fig. 3B). Notably, the tissue specificity percentage overlapping with the results from using the com- effect, exemplified by the higher mapping rate for ROADMAP bined 42 tissues. Comparing results from each tissue, or from liver datasets to the relevant species Villar reference datasets combined tissues in each species, enabled assessment of vari- than for other ROADMAP tissues (Fig. 3B), was reduced substan- ation due to evolutionary distance or tissue specificity. Second, tially if the 2 primates that were more evolutionarily related we evaluated the predictions from human to bovine based on to humans (macaque and marmoset) were removed from the different datatypes, including promoter databases (FANTOM), comparison. enhancer databases (FANTOM and ROADMAP), and transcrip- As the coverage of the reference cattle liver enhancer dataset tion factor binding site databases (ENCODE proximal and distal was not significantly higher with human liver enhancers than TFs) (Fig. 3C and D). Each of the datatypes has unique sequence with enhancers from many of the other human ROADMAP Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Nguyen et al. Table 2: Summary of promoter predictions No. within Fold Total regions in cattle Overlap with Villar Fold enrichment of 200 bp of TSS enrichment a c Dataset (MB, %) dataset (%) Villar dataset (%) of TSS Total No. of CAGE regions 154 377 (3.68, 0.138) 11 606 (84.1) 609 13 676 (51.0) 370 Filtered set CAGE regions 145 912 (3.46, 0.129) 11 203 (81.2) 629 13 011 (48.7) 377 Total all regulatory regions 542 756 (937.39, 35.11) 13 329 (96.6) 3 20 759 (77.6) 2 (Universal Dataset) Filtered regulatory regions 245 384 (356.1, 13.33) 13 104 (95.0) 7 17 715 (66.2) 5 (Filtered Dataset) Villar reference promoters 13 796 (32.90, 1.23) 13 796 (100) NA 10 212 (38.2) 31 ROADMAP promoters 81 892 (135.6, 5.08) 12 677 (91.9) 18 14 388 (53.8) 11 The total number of regions with liftOver at minMatch 0.2 and exact reciprocal matches, combined with regions that had multiple matches (no 1-to-1 relationship) but had high conservation (80% identity). The original human regions are from the FANTOM promoter dataset. The percentage was calculated for total genome size. Villar promoter dataset for cattle [22]. Promoter count within 200 bp of the Ensembl annotated UMD3.1 TSS Ensembl build 85 (total 26 740). tissue enhancer datasets [21], we asked whether combining tis- inference results in a small number of promoters. Approxi- sues would increase coverage. By combining the predictions mately 26 740 cattle genes (coding, lncRNAs, miRNAs, etc.) in from the 42 ROADMAP datasets, 2- to 4-fold higher coverage than the reference dataset used (Ensembl Build 85) have annotated from 1 tissue alone (at least 60% total coverage) could be ob- TSS. This dataset is far from comprehensive because of the ex- tained across a variety of species, with coverage being lowest for pected underrepresentation of noncoding genes and of alter- the rat and highest for macaque (Fig. 3A and B). Furthermore, native promoters (APs). The 1 gene-one promoter and 1 gene- we found that separate databases constructed using different one protein concepts are no longer appropriate to describe the models and biochemical assays were complementary, and com- diverse transcriptome [28]. APs are common and are function- bining them significantly increased coverage compared with a ally important. A number of APs were found associated with single database alone (Fig. 3C and D). For example, prediction us- complex traits [29]. While 51% of the Ensembl cattle TSS are ing the ENCODE distal TF dataset and the ROADMAP enhancer covered by mapped human CAGE transcription initiation peaks dataset covered the highest number of Villar cattle reference (3.7 Mb), only 38.4% are covered by the experimentally defined enhancers, while prediction using the FANTOM promoter and promoters (32.9 Mb) in Villar et al. [22], suggesting that HPRS the ENCODE proximal TFBS databases covered more Villar cat- predictions based on human CAGE data could enrich promoter tle reference promoters, and each dataset could add a num- coverage in the cow by more than 12 times compared with ber of unique regulatory regions not found in other datasets the standard promoter assay (H3K4me3 ChIP-Seq) (Table 2). Ac- (Fig. 3C and D). The combination of 88 ROADMAP datasets [21], tive TSS regions from 88 human tissues in the ROADMAP were the FANTOM enhancer and promoter datasets [25], and the EN- mapped to 81 892 putative promoters in cattle [21], with a total CODE distal and proximal TF datasets [26] generated a maxi- length of 135.6 Mb. Noticeably, the average number of Ensembl mum enhancer coverage of 95% (for macaque) and promoter reference TSS that overlapped every 1 Mb of predicted promoters coverage of 98% (for marmoset). Therefore, we selected an op- based on the ROADMAP database was 37-fold lower than those timal combination of human input databases for the HPRS based on the CAGE database (Table 2). pipeline on the basis that they represent promoters, enhancers, HPRS using the CAGE dataset can predict many TSS at single- and TFBS from a large combination of human tissues and pri- nucleotide resolution and can accurately predict transcriptional mary cells and were generated by different methods (Table 1). orientation. TSS are presented in the Ensembl database as single nucleotide genomic positions. HPRS-predicted promoters based on CAGE had exact overlap to the 7191 Ensembl TSS for cat- Predicting promoters tle. While promoter prediction by using histone marks (such as One of the most comprehensive human promoter datasets is those used by ROADMAP) cannot directly define transcriptional the FANTOM5 promoter atlas generated experimentally by CAGE orientation, this information, predicted by HPRS using human data from almost 1000 tissues and cell lines [20]. CAGE is a sen- CAGE data, is highly accurate [20]. Consistently, we found that of sitive methodology for the detection of transcription start sites 13 676 genes that have TSS within 500 bp of mapped CAGE peaks, (TSS) and hence defines core promoter regions where there is 96.9% (13 257) of genes had the same transcriptional orientation binding of the transcriptional machinery [27]. Promoters gener- in the Ensembl annotation and predicted by human CAGE data. ally have a high concentration of TFBS, typically within 300 bp We therefore assigned promoter orientation using the predic- upstream and 100 bp downstream of the TSS [20]. Promoter se- tions from the CAGE dataset. quences are more evolutionarily conserved than enhancer se- quences, and therefore a larger proportion can be mapped from Mapping enhancer datasets human to other mammal genomes [22]. Of 201 802 CAGE transcription initiation peaks in the FAN- Prediction of enhancers is likely to be more challenging than TOM5 human promoter atlas [20], 154 377 (76.5% of the to- predicting promoters because (1) enhancers are less conserved tal) were mappable to the bovine genome (Table 2). The HPRS in DNA sequence; (2) enhancer locations evolve faster [19, 22]; using CAGE predicted new TSS not present within the exist- and (3) enhancer effects are usually independent of the distance, ing bovine genome annotation (Ensembl Build 85). Although orientation, and relative location (upstream or downstream) of a promoter dataset for cattle can be inferred by defining up- gene targets [14]. To predict a broad set of sequences in a species stream sequences of genes with annotated TSS, this indirect that are active in 1 or more tissues or conditions, we expanded Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 7 Table 3: Summary of mapped and filtered regulatory sequences No. of mapped regions (%) Genome coverage, Mb (%) Datasets Human Cow Pig Human Cow Pig Total genome size, Mb NA NA NA 3137.2 (100) 2670.4 (100) 2808.5 (100) b b b ROADMAP enhancers, % mapped 9 102 278 (100) 5 917 129 (65) 5 620 417 (62) 8836.6 (NA) 6142.4 (NA) 5809.5 (NA) to target species ROADMAP enhancers (overlapping 494 583 (100) 371 295 (75) 361 682 (73) 1123.2 (35.8) 885.6 (33.2) 826.2 (29.4) regions were merged) FANTOM CAGE enhancers 43 011 (100) 34 303 (80) 27 558 (64) 12.4 (0.40) 12.2 (4.6) 9.6 (0.34) ENCODE distal TFs 1 122 364 (100) 749 572 (67) 716 515 (64) 169.7 (5.4) 132.0 (4.9) 124.4 (4.4) FANTOM CAGE promoter peaks 201 802 (100) 154 377 (76) 153 893 (76) 4.3 (0.14) 3.7 (0.14) 3.7 (0.13) ENCODE proximal TFs 384 343 (100) 298 554 (78) 279 774 (73) 58.2 (1.9) 48.0 (1.8) 48.9 (1.7) Merged ROADMAP, ENCODE, and 760 702 542 756 (86.1 519 913 (89.2 1165.7 (37.2) 919.5 (34.4) 857.8 (30.5) d d FANTOM datasets (Universal and 96.6) and 97.1) Dataset) Filtered Dataset NA 245 384 (73.5 151 523 (69.8 NA 356.1 (13.3) 311.5 (11.1) d d and 95.0) and 95.6) NA, not applicable. The percentage was not calculated for these 3 values because overlapping regions are present in the different enhancer datasets. The total size is bigger than the genome size because overlapping regions are included. Total size of nonoverlapping regions in the Universal Dataset (before filtering); the percent overlapping Villar reference enhancers (the former) a nd promoters (the later) in the targeted species. Percent overlap Villar reference liver enhancers and promoters in the filtered datasets. the human enhancer datasets to include 88 tissues, primary hancers and promoters (Table 1)[30]. The use of these TFBS cell lines, and primary cell cultures generated by the ROADMAP datasets not only supported predictions from using the en- project (Table 1)[21]; all human active enhancers defined by hancer and promoter datasets, but more importantly, added CAGE data from hundreds of tissues and cell lines in the FAN- other regulatory categories into the combined prediction of reg- TOM project [6]; and all the Villar experimentally defined refer- ulatory regions. For example, the binding targets of the CCCTC- ence cattle liver enhancers (Table 1)[22]. Cumulatively, the HPRS binding factor (CTCF) are likely insulator regions [31], while en- pipeline mapped more than 9.1 million human enhancer se- hancer of zeste homolog 2 (EZH2) binding sites may mark poly- quences to more than 5.9 million regions in the bovine genome, comb repressor complex 2 (PRC2) regions [32]. These ENCODE which were then merged into 542 756 nonoverlapping regions TFBS were identified as binding regions of TFs to nucleosome- (Table 3). The merged dataset (Universal Dataset) covered 86% free regions (∼151 bp per region), which are more biologi- (excluding merged regions resulting from the original Villar ref- cally relevant than de novo scanning of genome sequence for erence enhancers) of the Villar enhancer reference dataset (Ta- TFBS based on short position weight matrices (PWMs; typically ble 3). The term “Universal” reflects the initial pooling of all rel- 6–12 bp) because the later method only uses DNA sequence evant human regulatory datatypes and datasets to form a large and does not take into account the biological chromatin con- collection of genomic regions to be mapped to the target species. text, which is essential for transcription factor binding [33, 34]. Regulatory sequences are often active in certain conditions, and In total, from the ENCODE TFBS dataset [26, 34], 298 554 proxi- remain inactive in most other cases. Therefore, pooling active mal TFBS (total 47.97 Mb) and 749 572 distal TFBS (total 132.04 regulatory regions from a large number of datasets can likely Mb) were projected by HPRS onto the bovine genome. We also cover most active and inactive regulatory sequences, thus en- show that the HPRS prediction using ENCODE transcription fac- abling the prediction of a Universal Dataset. tor datasets was supported by 2 other independent prediction The HPRS mapping of the enhancer datasets predicted a large approaches (Supplementary Methods). set of homologous regions that are potentially regulatory regions in cattle (the Universal Dataset). We noted that the alignability of The filtering pipeline for a high-confidence regulatory DNA sequence does not automatically imply functionality [22], region dataset and therefore we applied a filtering pipeline to incorporate other types of cattle-specific data to prioritize regions more likely to The predictions produced by HPRS were optimized so that they have transcription regulation functions. The filtering pipeline occupied a relatively small part of the whole genome, but can used a combination of sequence features and epigenetics marks universally predict regulatory regions in different cell types and to enrich for likely regulatory enhancers and promoters, as dis- tissues. Applying HPRS for selected datasets (Fig. 3,Table 1), cussed in the filtering section. we first produced a preliminary Universal Dataset and then refined it to generate a Filtered Dataset (Table 3). To remove redundancies, overlapping mapped ROADMAP enhancers (ini- Mapping transcription factor binding site datasets tially mapped separately for each of the 88 ROADMAP datasets) To include potential regulatory regions beyond typical promoter were merged (Table 3). Similarly, all mapped regions for pro- and enhancer classifications, we performed HPRS mapping of moters, merged enhancers, and TFBS with overlapping coordi- human experimentally defined ENCODE TFBS (ENCODE anno- nates were merged into larger regions to form the final Universal tation version 2) to the bovine genome. The ENCODE TFBS Dataset (UD), containing 542 756 nonoverlapping regions. These database contains binding sites for 163 key TFs, some of which regions covered 937.4 Mb (35.1%) of the bovine genome. The high represent additional types of regulatory regions other than en- coverage (35.1%) of the UD was due to the large collection of Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Nguyen et al. Figure 4. Enrichment of the enhancers and promoters by the filters in the HPRS filtering process. A, A pipeline to filter predicted regulatory regions from the Univ ersal Dataset with 542 756 regions, covering 937.4 Mb of the genome (35.1%). The initial number of experimentally defined Villar reference datasets include d 31 971 enhancers (E) and 12 257 promoters (P). The number of reference E and P, total number of predicted regulatory regions, and total length (in Mb) for all promoters and enhancers passing each filtering layer are shown. The Ratio (total enhancers overlapping Villar reference enhancers/total length) and Ratio (total promoters overlapping Villar E P reference promoters/total length) were used as criteria to assess enrichment for each filter. The 7 filters are described in Table 4 and in the Supplementary Methods. B, Two bar graphs showing enrichment results (using the same starting set) of using each of the 7 filtering steps in comparison with the baseline (whole g enome) as shown in the dashed lines, and the Universal Dataset (mapped regions, not filtered). The y-axis shows the average number of reference promoters or en hancers in every 1 MB of the genome. The density of regulatory regions predicted is an indicator of the prediction coverage and accuracy. The higher values indicate more experimentally validated enhancers and promoters are enriched after filtering, suggestive of a more efficient filter. Each filter was tested independe ntly, using the same Universal Dataset as the input, to compare the enrichment levels that resulted from each of the 7 filters. human datasets used as inputs for mapping to bovine (37.2% of predict a small set of potential transcription regulatory genomic the human genome) so that the UD covered almost all possible regions in the bovine genome (Fig. 4,Table 4). promoters, enhancers, and TFBS (Table 3). Importantly, the HPRS The filtering pipeline reduced the UD to the much smaller pipeline improves the specificity of the UD by applying a filter- Filtered Dataset (FD; the same as filtered UD), which covered ing step, which incorporates the power of cattle-specific data to a smaller part of the whole genome, but which still predicted Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 9 fi Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Table 4: Filters with species-specific data for selecting regulatory regions (refer to the Supplementary Materials and Methods) Length, Mb Ratio of enhancers, count/Mb No. of ratio promoters, count/Mb Filter Filtering parameters Cattle Pig Mouse Cattle Pig Mouse Cattle Pig Mouse Whole Genome baseline, enhancers/Mb 2670.4 2808.5 2730.8 31 971 (12.0) 23 804 (8.5) 18 396 (6.7) 13 796 (5.2) 11 114 (4.0) 15 164 (5.6) genome Universal Universal baseline, enhancers/Mb 937.4 882.4 699.7 31 971 (34.1) 23 804 (27.0) 18 396 (26.3) 13 796 (14.7) 11 114 (12.6) 15 164 (21.7) Dataset CAGE CAGE ≥ 2orCAGE = 1and RNAseq > 201.9 194.7 147.6 9628 (47.7) 6679 (34.3) 3935 (26.7) 10 152 (50.3) 9476 (48.7) 10 951 (74.2) mean(VillarRef) CAGE ≥ 1 250.7 248 189.8 11 318 (45.2) 8214 (33.0) 5199 (27.0) 12 103 (48.3) 9936 (39.9) 12 722 (67.0) H3K27Ac Log2(H3K27Ac ≥ median(log2(VillarRef)) 89.6 103.8 47.0 16 124 (180.0) 11 985 (115.4) 6736 (143.3) 3927 (43.8) 9324 (89.8) 10 570 (225.0) Log2(H3K27Ac) ≥ mean(log2(VillarRef)) 91.0 102.0 50.4 16 366 (179.8) 11 670 (114.4) 7255 (143.9) 3966 (43.6) 9305 (91.2) 10 707 (212.4) RNAseq Log2(RNAseq) ≥ 3rd 156.1 85.3 105.8 6999 (44.8) 3162 (37.1) 2726 (25.8) 5473 (35.1) 6412 (75.2) 4375 (41.3) quartile(log2(VillarRef)) Log2(RNAseq) ≥ median(log2(VillarRef)) 278.1 184.0 193.1 12 147 (43.7) 6748 (36.6) 4626 (24.0) 8442 (30.4) 8709 (47.3) 6346 (32.9) Log2(RNAseq) ≥ mean(log2(VillarRef)) 319.4 197.7 218.7 13 746 (43.0) 7268 (36.8) 5175 (23.7) 9249 (29.0) 8874 (44.9) 6839 (31.3) gkm-SVM Length < 3000 & SVM ≥ 85.9 4.7 77.2 3603 (41.9) 261 (55.6) 3525 (45.7) 9208 (107.1) 359 (76.4) 8570 (110.0) median(VillarRef) Length < 3000 & SVM ≥ mean(VillarRef) 87.4 3.7 74.5 3645 (41.7) 200 (53.9) 3446 (46.3) 9230 (105.6) 287 (77.3) 8555 (114.9) Length < 5000 & SVM ≥ mean(VillarRef) 133.4 49.3 110.8 5673 (42.5) 1858 (37.6) 4285 (38.7) 9766 (73.2) 1333 (27.0) 9100 (82.1) Annotation AnnCount ≥ 3rd quartile (VillarRef) 72.8 41.4 362.4 3308 (45.5) 893 (21.6) 9183 (25.3) 9618 (132.2) 7489 (181.0) 11 317 (31.2) count AnnCount ≥ median (VillarRef) 109.1 21.2 614.3 5173 (47.4) 887 (21.5) 13 366 (21.7) 10 599 (97.2) 7486 (181.7) 13 835 (22.5) AnnCount ≥ mean (VillarRef) 273.2 239.6 318.2 12 433 (45.5) 7792 (32.5) 8321 (26.2) 12 391 (45.4) 10 039 (41.9) 10 633 (33.4) PhastCons PhastCons ≥ 95th percentile (VillarRef) 28.0 26.7 31.1 939 (33.5) 722 (27.1) 851 (27.3) 1504 (53.7) 1165 (43.7) 2177 (69.9) PhastCons ≥ median (VillarRef) 383.6 351.3 354.9 13 068 (34.1) 9425 (26.8) 8319 (24.4) 9929 (25.9) 7746 (22.1) 11 016 (31.0) PhastCons ≥ mean (VillarRef) 247.4 227.0 248.0 8415 (34.0) 6098 (26.8) 6039 (24.3) 8000 (32.3) 6249 (27.5) 9456 (38.1) TFBS count TFBScount ≥ median (VillarRef) 16.3 12.9 213.3 4 (0.2) 2 (0.2) 6151 (28.8) 6700 (411.3) 3253 (252.0) 9739 (45.7) TFBScount ≥ mean (VillarRef) 379.9 882.4 295.9 12 933 (34.0) 23 804 (27.0) 7995 (27.0) 9956 (26.2) 10 788 (12.2) 10 729 (36.3) Regions with at least 2 CAGE peaks or with 1 CAGE peak, and the number of mapped reads from RNAseq data is higher than the mean mapped reads of the reference Villar enhancer dataset. The results are compared with results from applying another filtering criterion, which is CAGE peak above 1. The comparisons are based on total length of all selected regions, and the averag e count of the ltered regions normalized by length (regulatory regions per 1-Mb genome length). Regions with more of the H3K27Ac mapped reads than the median or mean mapped reads to reference enhancers in the Villar dataset (log2 scale). Regions with more of the RNAseq mapped reads than the 3rd quartile, median, or mean mapped reads to reference enhancers in the Villar dataset (log2 scale). Regions with length shorter than 3000 bp (or 5000 bp) and gkmSVM scores ≥ median or mean scores for the Villar enhancer dataset. Regions with more annotation terms (diversity of regulatory features) mapped to the regions, compared with those mapped to reference regions in the Villar enhancer dataset. Regions with higher conservation PhastCons scores compared with the 95th percentile, median, or mean scores to enhancers in the Villar dataset. Regions with more transcription factor binding sites than those in the Villar enhancer dataset. 10 Nguyen et al. Table 5: HPRS predicted regulatory datasets for 10 species most active enhancers and promoters (Table 4,Fig. 4). De- tailed discussion on rationale for selecting each filter is in the Total Enhancer Promoter Supplementary Materials and Methods. Briefly, the pipeline uti- No. of length, coverage, coverage, lized both biological data in the target species (86 RNA-Seq a a Species regions Mb % % datasets representing 79 cattle tissues [35], cattle H3K27Ac sig- nal [22], and DNA sequence conservation scores) and computa- Unfiltered datasets tionally estimated criteria (gapped k-mers support vector ma- Cattle (bTau6) 545 748 919.5 86.1 96.6 chine (gkm-SVM) scores, number of overlapping annotations, Pig (susScr3) 519 913 882.4 89.2 97.1 and number of CB-predicted TFBS) (Fig. 4A). Marmoset (CalJac3) 642 144 1106.4 93.1 98.4 Before filtering, the Universal Dataset had approximately Rhesus Macaque 693 312 1158.2 94.5 97.6 2.84 times higher Ratio (number of Villar reference enhancers (RheMac3) Dog (CanFam3) 570 317 877.5 89.4 97.6 by predicted regions divided by the length in Mb of predicted Cat (FelCat5) 570 282 903.9 90.8 97.1 regions) and 2.82 times higher Ratio (similar to Ratio , but for P E Guinea pig (CavPor3) 523 273 761.6 81.1 92.7 promoters) than the total genome baseline, and each filtering Rabbit (OryCun2) 531 109 819.4 86.8 96.8 step in the pipeline increased Ratio and Ratio compared with E P Mouse (Mm10) 478 974 699.7 79.6 93.2 the baseline (Fig. 4B, Table 4). At the end of the pipeline, a set of Rat (Rn5) 453 017 620.5 75.3 89.5 high-confidence regulatory regions (the FD), containing 245 384 Filtered datasets sequences (with a total length of 356.1 Mb, equivalent to 13.3% Cattle (bTau6) 245 358 356.1 73.5 95.0 of the whole genome) was obtained. The filtering reduced the Pig (susScr3) 151 523 311.5 69.8 95.6 number of regions by 2.2 times and the genome coverage by 2.6 Mouse (mm10) 281 071 308.4 68.9 91.4 times (Table 3,Fig. 4A), while still including most of the cattle liver reference enhancers and promoters (73.5% and 95.0%, re- The datasets were generated for each species using the same human data spectively) (Table 4,Fig. 3A). Importantly, the filtered dataset had sources, including: 88 ROADMAP tissues/primary cell lines, FANTOM promoters a 5.5 and 7.1 times higher Ratio and Ratio , respectively, than E P and enhancers, and ENCODE proximal and distal TFs (Table S2) and combined the genome baseline (Fig. 4). The size and coverage of the bovine with the Villar reference enhancer promoter dataset. The prediction results for each species are available as part of Additional file 2. genome (356.1 Mb, 13.3%) by HPRS predicted regulatory regions Coverage of the relevant Villar reference datasets [22]. were comparable to the published Fig. 1 for the mouse, which Reference genomes are from UCSC [61]. is 12.6% of the mouse genome, as predicted by ENCODE DNAse The relevant Villar reference species enhancer datasets were added prior to fil- I accessibility data and transcription factor ChIP-Seq (using an- tering. tibodies for 37 TFs on 33 tissues/cell lines) and histone modifi- cation ChIP-Seq data [2]. Similarly, applying the HPRS pipeline to the mouse genome, without using mouse-specific datasets trons), more than 92% are within intronic regions [3, 5]. To test from ENCODE or other sources (except for the reference Vil- the enrichment of potential causal SNPs within predicted regu- lar dataset), predicted potential regulatory regions that occupy latory regions in cattle, we explored the overlap between SNPs 11.3% of the whole mouse genome. in regulatory regions and pleiotropic SNPs, which are SNPs sig- nificantly associated with multiple traits. The pleiotropic SNPs were identified by an independent GWAS study for 32 cattle feed Validating and extending the HPRS pipeline in 9 other intake, growth, body composition, and reproduction traits [37]. mammalian species The GWAS used 10 191 beef cattle, with data (including imputed The performance of the HPRS pipeline was evaluated using the data) for 729 068 SNPs (Fig. 5). We observed a substantial fold en- porcine (pig) genome (susScr3) [36]. HPRS had been developed richment (∼2–4 times) of SNPs with –log (P-value) from3to20in based on the bovine genome, and the pig was then selected as a the Filtered Dataset compared with all other sets of commonly species for step-by-step comparison throughout the pipeline be- classifying SNPs in different genomic regions, including the set cause of the availability of experimentally defined porcine pro- of SNPs 5 kb upstream of protein coding genes. We also observed moter and enhancer reference datasets [22] and because the higher counts (for 6 out of 10 traits) of associated SNPs within pig is an evolutionarily divergent nonruminant production an- regulatory regions in a study on 10 climatic adaptation traits in imal. We obtained similar results in the pig compared with 2112 Brahman beef cattle (Fig. S1) [38]. Similarly, we found en- cattle on numbers of putative regulatory regions, percentage to richment of regulatory SNPs in a study of 5 major production total genome length, and coverage of the reference datasets (Ta- and functional traits in 17 925 Holstein and Jersey dairy cattle (P bles 3 and 4). Importantly, we extended the application of the < 0.05 for 3 out of 5 traits) (Table S1) [39]. These observations are HPRS mapping data from the human to 8 additional mammalian consistent with the pipeline identifying regulatory SNPs from species, which had reference promoter and enhancer datasets millions of SNPs in the genome and suggest that the predicted from the Villar et al. study. We generated HPRS mapped unfil- regulatory database is useful for prioritizing SNPs likely to be tered UDs and observed consistently high coverage of the refer- contributing to phenotypic variation of complex traits. ence enhancer and promoter datasets, and the coverages were comparable between all 10 mammalian species (Table 5). Thus, The regulatory region datasets can be used to guide the pipeline appears to have general utility, not just for livestock identification of potential causative SNPs and their species, but also for mammals in general. gene targets As examples of the application of our resources to identify likely SNPs in regulatory regions are enriched for significant causative mutations from a large list of significantly associ- GWAS SNPs ated SNPs, we applied the HPRS approach to analyse 2 well- More than 90% of significant GWAS SNPs lie outside gene-coding studied genetic variants in cattle that were known to contribute regions, and for those within the gene body (from the start to to phenotypic variation, but their mechanisms of action were the termination site of the complete transcript, including in- not known because they were located within noncoding regions. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 11 The bovine Pleomorphic adenoma gene 1 (PLAG1) locus has been identified in the control of stature (weight and height) by several independent GWAS studies in cattle [40, 41]. The study by Karim et al. [40] fine-mapped 14 SNPs associated with stature. The 14 SNPs are in the vicinity of PLAG1 and the Coiled- coil-helix-coiled-coil-helix domain containing 7 (CHCHD7)gene, which are 540 bp apart (Fig. 6A). The 14 candidate SNPs are showninFig. 6A with coordinate locations relative to HPRS- predicted regulatory regions. The HPRS database suggests a strategy for further filtering these fine-mapped SNPs in 2 ways, first to prioritize gene targets and second to prioritize SNPs. The design of the validation experiment by Karim et al. [40]did not separate the 2 SNPs (rs209821678 and rs210030313) in the pro- moter region because both the long and short fragments used for activity assays in the study contained both SNPs. The HPRS prediction separates the 2 SNPs into 2 core CAGE peaks (Fig. 6B). The 2 peaks suggest 2 potentially separate binding sites of the transcriptional machinery. HPRS resolves the shared 540-bp pro- moter region into separate core promoter regions and suggests a new validation design in which 3 short, directional fragments fo- Figure 5: Enrichment of significant pleiotropic SNPs in regulatory genomic re- cusing more specifically on core CAGE regions (2 near PLAG1 and gions. Count of significant pleiotropic GWAS SNPs ( P-values are from multi-trait 1 near CHCHD7) can be used for functional assays of SNP geno- meta-analysis chi-square test statistic for 32 traits) [37]inaset of ∼729 100 SNPs type. Measuring promoter activity of these 3 constructs by using genotyped using the Illumina HD Bovine SNP chip or imputed from genotyping the similar promoter luciferase assay and transcription factor data from smaller Illumina Bovine SNP chips. Legend labels, from top to bottom: binding assay employed by Karim et al. [40] may confirm which “AllHDchip”: 43 130 SNPs randomly selected (from all 692 529 SNPs on the HD chip, excluding those from chromosome X); “100kbUpstream”: 43 130 SNPs ran- of the 2 SNPs is causative and which gene is affected. domly selected (from 325 227 SNPs within 100 kb of upstream regions of coding Furthermore, by applying a scoring model for regulatory vari- genes); “5kbUpstream”: all 30 384 SNPs within the 5-kb upstream regions of cod- ants, we generated deltaSVM score for each of 97 million known ing genes (results scaled to 43k SNPs); “Genes”: 43 130 SNPs randomly selected bovine SNPs (Supplementary Materials and Methods). The SNP (from 240 160 SNPs in coding genes); “Exons”: all 10 003 SNPs in exons of cod- rs209821678 had a deltaSVM score of –5.99. The score was be- ing genes (results scaled to 43k SNPs); “HPRS regions”: 43 130 SNPs in regulatory yond the 95th percentile range of SVM scores for 97 million regions. SNPs, suggesting that it may play an important regulatory role. Figure 6: Application of the regulatory database to prioritize significant bovine SNPs identified by GWAS studies for functional validation . Overview of 13 significant SNPs fine-mapped by Karim et al. [ 40] is shown in the left panel. Among those SNPs, only 3 overlap regulatory regions and promoter regions in the predicted database. The right panel is a detailed view of the 2 SNPs validated as causative in Karim et al. [40]. Both SNPs are within promoter regions of the PLAG1 gene, but not the CHCHD7 gene. The regulatory (enhancers, promoters, and transcription factor binding sites) and promoter (only promoters) tracks display HPRS-predicted regions. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 12 Nguyen et al. Figure 7: A potential model for effect of the Celtic mutation. Using human Hi-C (chromosome conformation capture) data and scanning of transcription factor binding sites, we generated a hypothesis to predict cattle regulatory targets for polled mutation (HiC target, HiC anchor, and HAND1 tracks). The regulatory (enhancers, promoters, and transcription factor binding sites) tracks display the HPRS-predicted region. Two common mutations on chromosome 1 in cattle have been associated with polled cattle. One is a 202-bp indel (“Celtic mutation”). The other is an 80-kb duplication ∼300 kb away. Purple arrows on the top link the Hi-C anchor to multiple targets mapped from the human genome to the cattle genome. Bottom of figure is a map with locations of the regulatory regions and shows potential effects of two HAND1 sites on the expression of the lincRNA2. Notably, the rs209821678 deletion of the (CCG)x11 to (CCG)x9 is a 202-bp indel, where the duplication of a 212-bp region trinucleotide repeats lies in a predicted G-quadruplex and may (chr1: 1 705 834–1 706 045) replaces the 10 bp (chr1: 1 706 051– cause changes in its structure, an event that could alter tran- 1 706 060) (Fig. 7)[46–48]. The mechanism for the Celtic mu- scriptional activity [42]. In contrast, the SNPs rs210030313 and tation is unknown, although it may affect the expression of rs109815800 did not have significant deltaSVM scores (0.51 and OLIGO1, OLIGO2, CH1H21orf62, and 2 long noncoding RNAs (lin- 3.2, respectively). cRNA1 and lincRNA2)[46, 47]. We found that the whole 10-base We then asked if the regions containing the SNPs interact deletion, but not the upstream 212-base duplication, is within an with additional genes distant from the PLAG1 locus. We applied HPRS-predicted enhancer sequence (chr1: 1 706 046–1 706 182, HPRS for mapping interactions defined by chromatin conforma- UMD3.1). A detailed transcription factor binding motif analysis tion capture data (5C and Hi-C in the ENCODE human datasets) of the polled mutation site suggests that a binding site for the to predict distal targets of the promoter regions in the PLAG1 Heart and neural crest derivatives expressed 1 (TF HAND1)is locus [43, 44], and we found that rs209821678 and rs210030313 lost due to the 10-bp deletion in animals containing the Celtic are within the anchor A 447 043 (chr14: 25 044 319–25 054 287, mutation (Fig. 7C). The neural crest cells give rise to the cran- UMD3.1), with a predicted target region (chr14: 25 478 861–25 iofacial cartilage and bone [49], suggesting that the loss of the 497 096) near the Inositol monophosphatase domain contain- HAND1 putative binding site is a plausible explanation for the ing1(IMPAD1). Variants within IMPAD1 have been implicated altered craniofacial development in polled animals. Addition- in short stature and chondrodysplasia (Table S2). Interestingly, ally, using information from Hi-C in the human genome [44], we the leading SNP identified in an analysis of pleiotropic genes found the mutation is within a mapped interaction target of the affecting carcase traits in Nellore cattle, rs136543212 at chr14: regions Hi-C A 264 635 (chr1: 1 706 078–1 714 122, UMD3.1) and 25 502 915, is slightly closer to IMPAD1 [45]. The rs109815800 A 264 636 (chr1:1 698 252–1 706 077, UMD3.1) and interacts with SNP, on the other hand, does not lie in any mapped Hi-C region. genes hundreds of Kb away (Fig. 7, bottom panel; Table S2). Al- Together, the HPRS-predicted results strongly suggest that the though the above hypothesis requires experimental validation, rs209821678 variant is the causative SNP among the 14 candi- it shows that applying the HPRS approach could lead to a biolog- dates fine-mapped by Karim et al. [ 40]. ical hypothesis for the underlying effects of causative mutations Another example of applying the HPRS databases for anal- within noncoding regions. ysis of noncoding mutations is for the case of the “Celtic Therefore, from the 2 examples described above (and from mutation,” which causes the polled phenotype. The mutation the Callipyge example described in the Supplementary Data), Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 13 we found that the HPRS regulatory database can be used to pri- from the FAANG consortium, when they become available, with oritize SNPs and genetic variants that were identified by GWAS complementary data from human research. The immediate studies and to draw hypotheses about biological mechanisms of application of the regulatory database is to complement the cur- a causative SNP. rent species-specific GWAS analysis by (1) discovery of poten- tial regulatory mechanisms of SNPs lying outside gene coding regions, (2) prioritizing SNPs that are statistically significant at Limitations of the methods a genome-wide level but located within regulatory regions, (3) The main aim of the HPRS pipeline is to predict as many prioritizing SNPs that are at low allele frequency but have po- regulatory regions and as accurately as possible, so that the tential for large effects, and (4) suggesting possible causative dataset could be applied for functional SNP analysis in the target SNPs as targets for precise genome editing or selective breeding species. However, given the uncertain nature of promoter and practices. enhancer identification, the rate of false positives and negatives by HPRS is difficult to determine. In our analysis, all of the ref- Methods erence cattle liver enhancers were included in the initial unfil- tered datasets, although ∼25% were lost during the filtering pro- The complete HPRS pipeline is divided into 3 modules: mapping, cess. Similarly, 96% of reference cattle liver dataset promoters filtering, and SNP analysis. The whole pipeline and documenta- were covered by the unfiltered dataset, with less than 3% lost tion are available from the CSIRO BitBucket [50]. in the filtering process. A limitation of the HPRS filtering pro- cess is the requirement to use a species-specific dataset. Nev- HPRS mapping pipeline ertheless, compared with the large number of datasets and bio- chemical assay types that are required to create comprehensive We developed a mapping strategy based on 4 elements: (1) se- coverage of regulatory regions, the number of species-specific lecting a suitable combination of human databases as HPRS datasets needed for HPRS is small. In this paper, for each of inputs; (2) finding an optimal sequence identity threshold in the 3 species (mouse, cattle, and pig), we used data from only the target genome; (3) finding options to remove less confi- 3 biological replicates of the H3K27Ac assay, which was gener- dent mapped results; and (4) adding multiple mapped regions ated within a scope of 1 project, as reported in Villar et al. [22], that meet a high sequence similarity threshold. Depending on to successfully filter the Universal Dataset. In addition, the ap- the species, targeted tissues, or regulatory categories of in- proach cannot predict promoters and enhancers that are unique terest, users can select suitable human databases using the to the species, e.g., promoters and enhancers that are present following suggested criteria: types of regulatory regions (pro- in the cow but not present in humans. These unique promot- moters, enhancers, and TFBS), biochemical assays, computa- ers/enhancers are likely to be a small proportion of the total tional models for combining data, and data sources (tissues, cell promoter/enhancer set. Indeed, the lineage-specific promoters lines, traits). Second, by applying the UCSC liftOver tool [23], and enhancers across 20 mammalian species were around 5% regions that were aligned at the genome scale (by LastZ pair- of the total promoters and enhancers [22]. Of note, relevant hu- wise genome alignment [52]) were fine-mapped to identify tar- man input datasets can be integrated depending on the aim of get regions with proportion of sequence identity to the original an analysis. For example, if the focus is to study milk produc- regions (minMatch) higher than a selected cutoff. We recom- tion, the HPRS pipeline can be applied for more relevant tissues, mend an optimal minMatch of 0.20 and not allowing multiple such as the mammary gland. Future cattle-specific datasets can mapping for this step. Users can vary input parameters (min- be incorporated into the HPRS pipeline to address the tissue and MatchMain and minMatchMulti) in the HPRS mapping script species specificity issues. (Main Mapping Pipeline.py) to optimize the minMatch suitable In contrast to the HPRS pipeline prediction of regulatory re- to specific datasets that may have different features such as gions, the prediction of causative genetic variation within regu- sequence length and conservation. Third, mapped regions re- latory regions is much more challenging. The current approach sulting from using a low minMatch cutoff (0.20) were filtered relies on the enrichment of sequence motifs within regulatory to retain only regions with exact reciprocal mapping back to regions relative to nonregulatory regions. At least some of the the human genome, with the condition that both the left and motifs are TFBS, but there are likely to be other types of motifs, right borders of the reciprocally mapped regions were within such as G-quadraplexes, present in regulatory regions. While 25-bp windows of the original regions. Fourth, to accommo- the predicted datasets can be useful for generating relevant hy- date regions possibly resulting from duplication events, the potheses, the identification of causal variants still requires con- HPRS mapping pipeline added a step to remap regions that are siderable future refinement and validation. unmapped or are not reciprocally mapped by allowing mul- tiple mapped results to be included while setting a high se- quence similarity threshold (specified by the minMatchMulti pa- Conclusions rameter ≥0.80). Fig. S1A shows some of the expected mapping We have developed the HPRS pipeline using a large collection of scenarios. existing human genomics data and a limited number of cattle- In addition to the customized minMatchMain and min- specific datasets to predict a database of cattle regulatory re- MatchMulti parameter inputs, the Main Mapping Pipeline.py gions that covers a large number of active promoters, enhancers, script also takes user-specified chain files for target species, and TFBS. The database generated here is not a final product be- which can be any of the mammalian species with chain files cause HPRS is capable of readily integrating new cattle-specific available from the UCSC databases or generated in-house. datasets into its mapping and filtering pipeline to expand, re- The HPRS mapping pipeline enables fast mapping of as many fine, and validate the databases. Moreover, the HPRS pipeline databases as necessary. The script PostHPRSMapping Merge can be applied to data of other mammalian species and by DifferentDatabaseTypes.py (available in the CSIRO BitBucket scientists without computer programming skills. We anticipate [50]) can be used to combine resulting datasets into 1 that the pipeline will be used to integrate large-scale datasets dataset containing nonoverlapping regions. For example, we Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 14 Nguyen et al. merged enhancer databases from 88 ROADMAP tissues/primary Data availability cell lines and 5 additional promoter, enhancer, and TFBS We have made all HPRS Python and R scripts publically avail- databases. The script also collapses names of overlapping able with usage instruction from the CSIRO BitBucket [50]. These regions into a comma-separated field that can be used to codes can be used to perform all steps from mapping to filtering count the total number of annotations for each merged and scoring regulatory SNPs. Supporting data are also available region. via the GigaScience database, GigaDB [51]. All human databases used for prediction are publically avail- HPRS filtering pipeline able (Table S5). Results of predicted regulatory regions, includ- ing the Universal Datasets and the Filtered Datasets for cattle Detailed description of the 7 filters is presented in the Supple- and pig, are available in the Supplementary Data for this arti- mentary Materials and Methods section. Briefly, the HPRS fil- cle. For cattle, we provide deltaSVM scores for ∼97 million SNPs, tering pipeline was written in R and contains 7 filtering steps which can be used as 1 of the parameters for assessing po- (Fig. 4,Table 4). The input file is a merged metadata file, in tential SNP effects. Additionally, we share predicted Universal which each region was calculated for the number of CAGE Datasets (not yet filtered) for 10 other mammalian species in a peaks mapped, the RNA-Seq signal from 86 cattle RNA-Seq format compatible for uploading to the UCSC genome browser datasets, the Villar H3K27Ac signal, the SVM enhancer scores (Table 5; Fig. S7). These 10 additional datasets can be useful for (enhancer activity predicted by a machine learning classifica- exploring potential regulatory effects from noncoding genomic tion method, gkmSVM) [53], the number of overlapping anno- regions. tations, the conservation score based on the UCSC 100-way vertebrate alignment [54], and the number of TFBS based on Cluster-Buster scanning [55]. The main filtering pipeline was HPRS Filtering pipeline.Rmd. We tested a range of parameters Additional files and recommend using the parameters set in the script. In addition, prior to running this main script, users can choose Additional file 1: Figure S1: Assessing mapping results for dif- to optimize parameters suitable to specific datasets using ferent human tissues and cell line datasets onto other species. the script HPRS Filtering optimize FilterOrder.Rmd, which cal- The counts and percentages are for mapped regions before the culates Ratio and Ratio (average number of enhancers and pro- HPRS filtering pipeline. A, Possible cross-species mapping re- P E moters per Mb of the total length of all predicted enhancers and sults, with 3 scenarios: (i) reciprocal mapping with a low iden- promoters) for each filter and for a range of filter parameters tity threshold (minMatch = 0.20) but that requires exact back so that the optimal parameters are used in the main filtering mapping to the human genome; (ii) multiple mapping allowing pipeline. The filtering pipeline was written in a such a way that multiple targets, but requiring a stringent minMatch = 0.80; and it is simple to add or remove filter layers depending on availabil- (iii) features that do not fall into the 2 categories above, such as ity of species-specific data. species-specific enhancers, are not included in the prediction. B, Coverage of the cattle Villar enhancer reference dataset by predicted and random feature datasets. Features were mapped Methods to apply HPRS dataset for regulatory SNP from 42 human enhancer datasets or 42 random datasets (equal analysis number of regions and region length distribution) to the bovine The HPRS dataset can be applied for the selection of top can- genome and then compared for percent overlap with the cat- didate SNPs in regulatory regions, which are present in exist- tle Villar reference liver enhancer dataset. For all 42 datasets, the regulatory datasets produced 5–10 times higher coverage of ing genotyping SNP chips. The selected SNPs form a small set of SNPs that are more likely to be causal or associated to phe- the reference bovine liver enhancers than the random datasets. C, Coverage of predictions by 12 common ENCODE human cell notypes. Using these SNPs for GWAS analysis may reduce noise compared with using a large number of SNPs that are noncausal lines (x-axis). The numbers of recovered regions using minMatch = 0.2 and minMatch = 0.95 are shown. D, Percentage of addi- but in high linkage disequilibrium to causal SNPs. The top can- didate SNPs can be selected by the identification of SNPs be- tional features that had multiple mapped targets but met the high sequence similarity threshold for reciprocal mapping from longing or not belonging to the following categories: the Uni- versal Dataset, the Filtered Dataset, the TFBS of the predicted the bovine genome to the human genome (minMatch = 0.80). Additional file 2: Figure S2: Enrichment of GWAS-associated regulatory regions, and regulatory regions active in tissues re- lated to the trait of interest. In addition, deltaSVM scores can SNPs in HPRS filtered regulatory regions. Data are from a GWAS dataset for 2112 cattle measured for 10 different climatic change be used as 1 of the indicators for potential SNP effects, as dis- cussed in the Supplementary Methods section. Alternatively, the adaptation–related phenotypes [6]. A, Number of GWAS signifi- dataset can be used for post-GWAS analysis, in which signifi- cant SNPs, with –log(P-values) ≥7inany of the10separatephe- -7 notypes. GWAS P-values for each trait ≤10 are considered sig- cant SNPs in noncoding regions that are identified from GWAS can be assessed for potential effect on gene regulatory activ- nificant with multiple test correction (with Bonferroni-corrected P ≤0.05, and the number of SNP is ∼500 000 SNPs). SNPs were se- ity. We have discussed examples of applications for the cases of pleiotropic SNPs, climatic adaptation–associated SNPs, and lected for each of the 10 phenotypes separately and were pooled into 1 set, and density plots of SNP counts and corresponding – associated SNPs’ milk-production traits (Fig. S1, Table S1), and of post-GWAS analysis for the stature phenotype and Callipyge log(P-values) for each phenotype are shown. The x-axis shows – log(P-values) values from 0 to 20, and the y-axis shows density of phenotype (Fig. 6;TablesS2and S3). We developed an implementation pipeline of the gkm-SVM the SNP counts according to the –logP distribution. B, Significant SNPS were selected based on combined criteria: –log(P-values) model to estimate SNP effects on enhancer activities in cattle by adapting the model to the case where very limited species- >2 and abs(effect size) greater than or equal to the third quartile effect size value for each of the 10 phenotypes. The x-axis shows specific ChIP-Seq data are available for model training (Supple- mentary Materials and Methods). the name IDs of the 10 phenotypes. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 15 Additional file 3: Figure S3: Promoter prediction. A, We se- in cattle. Single nucleotide resolution scores within the ALDOB lected a random large region of the chromosome to evaluate enhancer are shown. Negative scores indicate loss of function promoter prediction. We observed consistent overlapping of pre- (or TF binding), while positive scores indicate increases in ac- dicted promoters with known transcription start sites (TSS). The tivities. Computational predictions of transcription factor bind- higher and denser numbers of predicted promoters compared ing sites (by FIMO [29] and JASPAR position weight matrices) are with annotated TSS suggest that the HPRS prediction poten- shown in the lower panels. Transcription factor IDs and SNP IDs tially led to the identification of unannotated promoters, in- are shown next to the predicted regions. The ALDOB enhancer cluding alternative promoters within annotated transcripts and was mapped from humans to cattle. Vertical dashed lines show promoters of unannotated transcripts such as those for long the locations of the deltaSVM peaks, where SNPs most likely re- noncoding RNAs. B, HPRS promoters also predict bidirectional duce the enhancer activity, compared with the locations of pre- promoters with high accuracy (- for antisense, + for sense). C, dicted TFBS. The deltaSVM score prediction was consistent with HPRS-predicted alternative promoters are supported by cattle luciferase activity measurement (in humans) and prediction of expression sequencing tag (EST) data. The predicted promoters TFBS (in humans and cattle). overlap the start sites of EST transcripts within the full-length Additional file 7: Figure S7: An example of a simple view of ZC3H14 gene. the datasets generated for 10 mammalian species. The exam- Additional file 4: Figure S4: Enrichment of TFBS within en- ple is from the dog (canFam3) genome. Predicted regulatory re- hancers and promoters. The promoters and enhancers were gions are shown in blue, with annotations (enhancer, promoter, mapped from the human FANTOM enhancer [1]and thehu- and transcription factor IDs) marked on the left. For regions with man FANTOM promoter databases onto the bovine genome [2]. multiple annotations, users can display the annotations by se- Mapped regions were compared with the Villar bovine enhancer lecting the region on the browser. The example shows the ENPP1 and promoter reference datasets for liver tissues [7]. Three cat- gene. egories of overlapping to the reference datasets were compared Additional file 8: Table S1: Enrichment of GWAS SNPs in com- (x-axis): (i) mapped regions overlapping the Villar reference mon phenotypes measured for dairy cattle. dataset (LOinLiver); (ii) mapped regions not in the Villar dataset Additional file 9: Table S2: Hi-C targets and gkm-SVM predict (LOnotinLiver); and (iii) regions in reference datasets not covered causative SNPs and gene targets. by mapped regions (LiverNotinLO). The TFBS were derived from Additional file 10: Table S3: Predicted Hi-C interaction regions whole–bovine genome scanning using the Cluster Buster pro- at the putative Callipyge locus (chr21:67339968:67340027). gram [13] and 3 major transcription factor position weight ma- Additional file 11: Table S4: Regulatory datasets used for op- trix databases (TRANSFAC, JASPAR, and ENCODE) [21–23]. timizing mapping parameters. Additional file 5: Figure S5: Tissue specificity of predicted reg- Additional file 12: Table S5: Data sources for publically avail- ulatory regions. A, Counts of HPRS-predicted enhancers (using able human datasets used as inputs for the HPRA pipeline in this 88 ROADMAP human enhancer datasets) that overlap with the paper. Villar cattle reference enhancers. B, We then defined a tissue- specific enhancer dataset by identifying HPRS regions that over- lap with Villar reference enhancers for cattle and are unique to Abbreviations each of the 88 tissues. The datasets that yielded the highest over- bp: base pair; CAGE: Cap Analysis of Gene Expression; CDS: cod- lap are those from the liver cell line (liver hepatocellular cells - ing sequence; ENCODE: Encyclopedia of DNA Elements; FAN- HepG2) and the human liver tissue. C, We mapped 101 RNA se- TOM: Functional Annotation of the Mammalian genome; FD: quencing datasets, collected from more than 79 tissues (Table Filtered Dataset; Gb: Giga base; GWAS: Genome-Wide Associa- S5), to the predicted regulatory regions. The mapped RNA signal tion Studies; HPRS: Human Projection of Regulatory Regions; kb: was used to compare the similarity between different tissues. kilo base; Mb: Mega base pair; RDEs: Regulatory DNA Elements; Strong enrichment of brain, muscle, and liver tissues was ob- ROADMAP: Roadmap Epigenomics Mapping Consortium; SNP: served. Single Nucleotide Polymorphism; SRA: Sequence Read Archive; Additional file 6: Figure S6: Large-scale gapped k-mer sup- SVM: Support Vector Machine; TE: Transposable Elements; TFs: port vector machine (LS-gkm-SVM) scores for enhancers and Transcription Factors; UCSC: University of California Santa Cruz; deltaSVM scores for SNPs. A, The LS-gkm-SVM model was used ENSEMBL: Ensemnbl database; TSS: Transcription Start Sites; to calculate the gkm-SVM scores for all enhancers in the Vil- TFBS: Transcription Binding Sites; PWM: Position Weight Matrix; lar dataset. Red, enhancers scored on “enhancers versus back- UD: Universal Dataset. ground matrix”; green, random regions (selected by shuffling through the genomes to sample genomic regions of the same length as the Villar reference bovine enhancers) scored on “en- Competing interests hancers versus background matrix”; blue, enhancers scored on a “background versus background” matrix. The positive The authors declare that they have no competing interests. background was selected from the Villar reference enhancer dataset, as described in the Supplementary Materials and Meth- ods section. Training datasets using human (HHb) and cat- Funding tle (BBb) and liftOver enhancer regions from human to cattle Q.N. and M.N.S. were supported by CSIRO OCE Postdoctoral Fel- (LOBHb) yielded consistent and comparable results, which pre- lowships. dicted higher scores for enhancer regions (BBb Enh, HHb Enh, LOBHb Enh) than the predictions for promoter (pmtr Enh) and random regions (BBb Neg, HHb Neg, LOBHb Neg, and Pmtr neg). Author contributions B, deltaSVM for scoring SNP effects on enhancer activity. The LS- gkm-SVM model was used to score every possible SNP across the B.D., R.T., J.K., B.B., and Q.N. conceived the project. Q.N. and enhancer of the aldolase B fructose bisphosphate (ALDOB)gene B.D. designed the HPRS algorithm. Q.N. wrote the pipeline. Q.N., Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 16 Nguyen et al. M.N.S., L.P.N., A.R., and B.H. contributed to the analysis. Q.N. and 15. Lelli KM, Slattery M, Mann RS. Disentangling the many lay- B.D. wrote the manuscript, with input from all other co-authors. ers of eukaryotic transcriptional regulation. Annu Rev Genet 2012;46(1):43–68. 16. Spitz F, Furlong EEM. Transcription factors: from en- Acknowledgements hancer binding to developmental control. Nat Rev Genet 2012;13(9):613–26. We thank Dr Tony Vuocolo (CSIRO) for insightful discussions, Dr 17. Boyle AP, Araya CL, Brdlik C et al. Comparative analysis of Li Congjun (ARS, USDA) for sharing the update of a published regulatory information and circuits across distant species. ChIP-Seq dataset, and Dr Derek Bickhart (ARS, USDA) for provid- Nature 2014;512(7515):453–6. ing us with the updated TFBS dataset. We thank the generators 18. Ho JWK, Jung YL, Liu T et al. Comparative analysis of meta- of the human genomics, transcriptomics, and epigenomics re- zoan chromatin organization. Nature 2014;512(7515):449–52. sources, which are highly valuable to the broader research com- 19. Cheng Y, Ma Z, Kim BH et al. Principles of regulatory in- munity. formation conservation between mouse and human. Nature 2014;515(7527):371–5. 20. The Fantom Consortium, Riken PMI, and CLST, Forrest AR, References Kawaji H et al. A promoter-level mammalian expression at- 1. The ENCODE Project Consortium. An integrated encyclo- las. Nature 2014;507(7493):462–70. pedia of DNA elements in the human genome. Nature 21. Roadmap Epigenomics Consortium, Meuleman W, Ernst 2012;489(7414):57–74. J et al. Integrative analysis of 111 reference human 2. Yue F, Cheng Y, Breschi A et al. A comparative ency- epigenomes. Nature 2015;518(7539):317–30. clopedia of DNA elements in the mouse genome. Nature 22. Villar D, Berthelot C, Aldridge S et al. Enhancer evolution 2014;515(7527):355–64. across 20 mammalian species. Cell 2015;160(3):554–66. 3. Ward LD, Kellis M. HaploReg v4: systematic mining of puta- 23. Kent WJ, Sugnet CW, Furey TS et al. The Human Genome tive causal variants, cell types, regulators and target genes Browser at UCSC. Genome Res 2002;12(6):996–1006. for human complex traits and disease. Nucleic Acids Res 24. Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges 2016;44(D1):D877–81. in bioinformatics approaches for enhancer identification. 4. Farh KK-H, Marson A, Zhu J et al. Genetic and epigenetic Brief Bioinformatics 2016;17(6):967–79. fine mapping of causal autoimmune disease variants. Na- 25. Lizio M, Harshbarger J, Shimoji H et al. Gateways to the FAN- ture 2015;518(7539):337–43. TOM5 promoter level mammalian expression atlas. Genome 5. Li MJ, Liu Z, Wang P et al. GWASdb v2: an update database Biol 2016;16(22). doi: 10.1186/s13059-014-0560-6. for human genetic variants identified by genome-wide asso- 26. Sloan CA, Chan ET, Davidson JM et al. ENCODE data ciation studies. Nucleic Acids Res 2016;44(D1):D869–76. at the ENCODE portal. Nucleic Acids Res 2016;44(D1): 6. Andersson R, Gebhard C, Miguel-Escalada I et al. An atlas of D726–32. active enhancers across human cell types and tissues. Na- 27. Takahashi H, Lassmann T, Murata M et al. 5’ end–centered ture 2014;507(7493):455–61. expression profiling using cap-analysis gene expression 7. Corradin O, Cohen AJ, Luppino JM et al. Modeling disease risk and next-generation sequencing. Nat Protoc 2012;7(3): through analysis of physical interactions between genetic 542–61. variants within chromatin regulatory circuitry. Nat Genet 28. Strausberg RL, Levy S. Promoting transcriptome diversity. 2016;48(11):1313–20. Genome Res 2007;17(7):965–8. 8. MacLeod IM, Bowman PJ, Vander Jagt CJ et al. Exploiting bi- 29. Carninci P, Sandelin A, Lenhard B et al. Genome-wide analy- ological priors and sequence variants enhances QTL discov- sis of mammalian promoter architecture and evolution. Nat ery and genomic prediction of complex traits. BMC Genomics Genet 2006;38(6):626–35. 2017;17(144). doi: 10.1186/s12864-016-2443-6. 30. Gerstein MB, Kundaje A, Hariharan M et al. Architecture of 9. Fragomeni BO, Lourenco DAL, Masuda Y et al. Incorpora- the human regulatory network derived from ENCODE data. tion of causative quantitative trait nucleotides in single-step Nature 2012;489(7414):91–100. GBLUP. Genet Sel Evol 2017;49(134):463–71. 31. Hnisz D, Day DS, Young RA. Insulated neighborhoods: struc- 10. Wang M, Hancock TP, MacLeod IM et al. Putative enhancer tural and functional units of mammalian gene control. Cell sites in the bovine genome are enriched with variants 167(5):1188–200. affecting complex traits. Genet Sel Evol 2017;49(56). doi: 32. Moritz LE, Trievel RC. Structure, mechanism, and regula- 10.1186/s12711-017-0331-4. tion of polycomb repressive complex 2. J Biol Chem 2017; 11. Fang L, Sahana G, Ma P et al. Use of biological priors en- doi:10.1074/jbc.R117.800367. hances understanding of genetic architecture and genomic 33. Narlikar L, Ovcharenko I. Identifying regulatory elements prediction of complex traits within and between dairy cattle in eukaryotic genomes. Brief Funct Genomics Proteomics breeds. BMC Genomics 2017;18(1):604. 2009;8(4):215–30. 12. Andersson L, Archibald AL, Bottema CD et al. Coordinated 34. Wang J, Zhuang J, Iyer S et al. Sequence features and chro- international action to accelerate genome-to-phenome with matin structure around the genomic regions bound by 119 FAANG, the Functional Annotation of Animal Genomes human transcription factors. Genome Res 2012;22(9):1798– project. Genome Biol 2015;16(1):57. 13. Tuggle CK, Giuffra E, White SN et al. GO-FAANG meeting: 35. Elsik CG, Unni DR, Diesh CM et al. Bovine Genome Database: a gathering on functional annotation of animal genomes. new tools for gleaning function from the Bos taurus genome. Anim Genet 2016;47(5):528–33. Nucleic Acids Res 2016;44(D1):D834–9. 14. Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: 36. Groenen MAM, Archibald AL, Uenishi H et al. Analyses of pig from properties to genome-wide predictions. Nat Rev Genet genomes provide insight into porcine demography and evo- 2014;15(4):272–86. lution. Nature 2012;491(7424):393–8. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Predicting mammalian genomic regulatory regions 17 37. Bolormaa S, Pryce JE, Reverter A et al. A multi-trait, 50. Nguyen et al.. The Commonwealth Scientific and In- meta-analysis for detecting pleiotropic polymorphisms for dustrial Research Organisation (CSIRO). HPRS pipeline stature, fatness and reproduction in beef cattle. PLoS Genet CSIRO BitBucket. 2018. https://bitbucket.csiro.au/ 2014;10(3):e1004198. users/ngu121/repos/hprs/browse (10 February; 2018, date 38. Porto-Neto LR, Reverter A, Prayaga KC et al. The genetic ar- last accessed). chitecture of climatic adaptation of tropical cattle. PLoS One 51. Nguyen QH, Tellam RL, Naval-Sanchez M. et al. Support- 2014;9(11):e113284. ing data for “Mammalian genomic regulatory regions pre- 39. Raven L-A, Cocks BG, Hayes BJ. Multibreed genome wide as- dicted by utilizing human genomics, transcriptomics and sociation can improve precision of mapping causative vari- epigenetics data.” GigaScience Database 2017. http://dx.doi. ants underlying milk production in dairy cattle. BMC Ge- org/10.5524/100390. nomics 2014;15(1):62–62. 52. Harris RS. Improved pairwise alignment of genomic DNA. 40. Karim L, Takeda H, Lin L et al. Variants modulating PhD Thesis. College of Engineering, Pennsylvania, USA: the expression of a chromosome domain encompass- Pennsylvania State University; 2007. ing PLAG1 influence bovine stature. Nat Genet 2011; 43(5): 53. Lee D, Gorkin DU, Baker M et al. A method to predict the im- 405–13. pact of regulatory variants from DNA sequence. Nat Genet 41. Takasuga A. PLAG1 and NCAPG-LCORL in livestock. Anim Sci 2015;47(8):955–61. J 2016;87(2):159–67. 54. Siepel A, Bejerano G, Pedersen JS et al. Evolutionarily con- 42. Hansel-Hertsch R, Beraldi D, Lensing SV et al. G-quadruplex served elements in vertebrate, insect, worm, and yeast structures mark human regulatory chromatin. Nat Genet genomes. Genome Res 2005;15(8):1034–50. 2016;48(10):1267–72. 55. Frith MC, Li MC, Weng Z. Cluster-Buster: finding dense 43. de Wit E, de Laat W. A decade of 3C technologies: insights clusters of motifs in DNA sequences. Nucleic Acids Res into nuclear organization. Genes Dev 2012;26(1):11–24. 2003;31(13):3666–8. 44. Jin F, Li Y, Dixon JR et al. A high-resolution map of the three- 56. Zerbino D, Wilder SP, Johnson N et al. The Ensembl Regula- dimensional chromatin interactome in human cells. Nature tory Build. Genome Biol 2015;16(1):56. 2013;503(7475):290–4. 57. Visel A, Minovitsky S, Dubchak I et al. VISTA Enhancer 45. Pereira AGA, Utsunomiya YT, Milanesi M et al. Pleiotropic Browser—a database of tissue-specific human enhancers. genes affecting carcass traits in Bos indicus (Nellore) cattle Nucleic Acids Res 2007;35(Database):D88–92. are modulators of growth. PLoS One 2016;11(7):e0158165. 58. Schwartz S, Kent WJ, Smit A et al. Human-mouse alignments 46. Allais-Bonnet A, Grohs C, Medugorac I et al. Novel insights with BLASTZ. Genome Res 2003;13(1):103–7. into the bovine polled phenotype and horn ontogenesis in 59. Hinrichs AS, Karolchik D, Baertsch R et al. The UCSC bovidae. PLoS One 2013;8(5):e63512. Genome Browser Database: update 2006. Nucleic Acids Res 47. Wiedemar N, Tetens J, Jagannathan V et al. Independent 2006;34(90001):D590–8. polled mutations leading to complex gene expression differ- 60. Kent WJ, Baertsch R, Hinrichs A et al. Evolution’s caul- ences in cattle. PLoS One 2014;9(3):e93435. dron: duplication, deletion, and rearrangement in the 48. Carlson DF, Lancto CA, Zang B et al. Production of hornless mouse and human genomes. Proc Natl Acad Sci U S A dairy cattle from genome-edited cell lines. Nat Biotechnol 2003;100(20):11484–9. 2016;34(5):479–81. 61. Hindrichs et al. UCSC Table Browser. Santa Cruz: Uni- 49. Santagati F, Rijli FM. Cranial neural crest and the building of versity of California 2006. http://hgdownload.cse.ucsc.edu/ the vertebrate head. Nat Rev Neurosci 2003;4(10):806–18. downloads.html (25 August 2016, date last accessed). Downloaded from https://academic.oup.com/gigascience/article-abstract/7/3/1/4867972 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

GigaScienceOxford University Press

Published: Mar 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off