Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Using hexamers to predict cis-regulatory motifs in Drosophila

Using hexamers to predict cis-regulatory motifs in Drosophila Background: Cis-regulatory modules (CRMs) are short stretches of DNA that help regulate gene expression in higher eukaryotes. They have been found up to 1 megabase away from the genes they regulate and can be located upstream, downstream, and even within their target genes. Due to the difficulty of finding CRMs using biological and computational techniques, even well-studied regulatory systems may contain CRMs that have not yet been discovered. Results: We present a simple, efficient method (HexDiff) based only on hexamer frequencies of known CRMs and non-CRM sequence to predict novel CRMs in regulatory systems. On a data set of 16 gap and pair-rule genes containing 52 known CRMs, predictions made by HexDiff had a higher correlation with the known CRMs than several existing CRM prediction algorithms: Ahab, Cluster Buster, MSCAN, MCAST, and LWF. After combining the results of the different algorithms, 10 putative CRMs were identified and are strong candidates for future study. The hexamers used by HexDiff to distinguish between CRMs and non-CRM sequence were also analyzed and were shown to be enriched in regulatory elements. Conclusion: HexDiff provides an efficient and effective means for finding new CRMs based on known CRMs, rather than known binding sites. Background of CRMs – promoters and enhancers. Promoters are The development of eukaryotic organisms is tightly regu- located immediately upstream of a gene's transcriptional lated by a variety of mechanisms. The initial step of regu- start site and often contain a variety of sequence signals lation is carried out by transcription factors interacting such as the TATA box, CCAAT box, and different TFBS. with cis-regulatory sequences, also known as transcription These characteristics have been used in approaches for factor binding sites (TFBS). In eukaryotes, multiple TFBS finding promoters [2]. In contrast, enhancers do not share are often clustered together into cis-regulatory modules these signals and operate in a manner that is relatively (CRMs). The TFBS can be thought of as inputs into an independent of orientation or distance from their target information processing element, with the output being gene [3]. In fact, one enhancer, Dct, has been found the level of expression of the gene controlled by the CRM almost a megabase away from Sox9, the gene it regulates [1]. [4]. Because of the lack of common signals and because the search for enhancers cannot be limited to the few hun- One of the major challenges for understanding eukaryotic dred base pairs upstream of the transcriptional start site, gene regulation is finding CRMs. There are two main types finding enhancers is a more difficult problem. Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 1: Key aspects of HexDiff and other algorithms. The table shows the knowledge used and the parameters required by the different algorithms. Algorithm Knowledge Used Parameters HexDiff CRM Locations Number of hexamers in H Window size Window score threshold Ahab PWMs Window size Free energy cutoff Order of background model Cluster Buster PWMs Motif score threshold Gap parameter Cluster score threshold Residue abundance range MSCAN PWMs Motif score threshold Window size Minimum hits Maximum hits MCAST PWMs Motif score threshold Maximum allowed distance between adjacent hits Pseudocount weight LWF CRM Locations String length Number of mismatches Detection window size Maximum number of channels Channels equalized Profile cutoff Peak width cutoff Smoothing window Methods for predicting CRMs can be classified by the type Methods based on locations of known CRMs have been of information they use to make the predictions – known rarer. Methods of this type tend to look for statistical prop- binding sites of regulatory proteins, homologous erties of DNA sequence that distinguish CRMs from non- sequences, or known CRMs. Binding sites for the first type regulatory DNA. One group has developed a statistical test of method are generally modeled using position weight called the "fluffy-tail test" that looks for differences in matrices (PWMs) or consensus sequences. These models nucleotide composition, particularly in lists of words of are used to search for statistically significant clusters of various lengths [28]. From the related field of promoter predicted TFBS. Examples of methods based on binding prediction comes PromFind, an algorithm for finding pro- sites of multiple transcription factor proteins include one moters that uses hexamer frequencies of known promot- developed for human skeletal muscle [5], a logistic regres- ers to search for DNA with similar frequencies [29]. sion analysis model for liver-specific transcription factors Because PromFind was developed for promoters, the [6], CIS-ANALYST [7], MCAST [8], Ahab [9], Stubb [10], author could assume that every sequence being tested Cluster Buster [11], MSCAN [12], and EMCMODULE contained one promoter, and that the strand containing [13]. Methods based on binding sites of single transcrip- the promoter was known – assumptions that are not true tion factors have also been developed – SCORE [14], Fly for enhancers. Another recent algorithm developed to pre- Enhancer [15], and a method of searching for homotypic dict CRMs is based on the exhaustive analysis of local clusters [16]. word frequencies (LWF) [30]. Unlike PromFind where the algorithm is based solely on hexamer words, the LWF Methods based on homologous sequences assume that algorithm considers the pattern of word frequencies in a areas of the DNA involved in regulating gene transcription sliding window. are under selective pressure and are therefore more likely to be conserved than non-functional DNA [17]. These While the LWF algorithm was shown to perform well at methods can be categorized by whether they search for the task of predicting CRMs [30], it is difficult to put a bio- conserved DNA by aligning homologous regions from logical meaning to the results since the algorithm depends multiple species [18-20], homologous regions from two on word frequencies, not on the words themselves. In species [21-24], or homologous regions from related contrast, the PromFind algorithm generates lists of hex- genes in a single species (also referred to as co-regulated amers that are important for distinguishing promoters genes) [25-27]. from non-promoter sequences. In the paper describing Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 2: Correlation between predicted and known CRMs. The performance of six different algorithms on a common data set is compared in this table. For each sequence, the Matthews correlation coefficient is calculated by checking whether each position is a TP, TN, FP, or FN and using the equation listed in the Methods section. The sum of the correlation coefficients gives a cumulative score for each algorithm on this data set. Gene CRMs HexDiff Ahab Cluster Buster MSCAN MCAST LWF btd 1 0.70 0.57 0.19 0.01 0.07 0.10 ems 3 0.00 0.00 -0.03 0.12 -0.01 -0.01 eve 6 0.55 0.63 0.65 0.50 0.41 0.06 fkh 1 -0.03 -0.02 -0.02 -0.04 -0.02 -0.01 ftz 5 0.40 0.28 0.28 0.07 0.16 0.08 gt 1 0.27 0.42 0.33 0.35 0.15 0.03 h 5 0.71 0.63 0.53 0.30 0.37 0.08 hb 2 0.35 0.63 0.39 0.34 0.24 0.04 hkb 1 0.51 0.00 -0.02 -0.02 -0.08 0.09 kni 3 0.55 0.55 0.39 0.37 0.23 -0.05 kr 3 0.43 0.00 0.77 0.20 0.11 -0.03 oc 2 0.70 -0.02 0.00 0.11 0.02 0.07 prd 7 0.01 -0.07 0.16 0.07 -0.04 0.05 run 6 0.27 0.16 0.08 0.08 0.02 0.07 slp1 3 -0.07 0.15 -0.04 0.00 0.07 0.01 tll 3 0.35 0.56 0.58 0.19 0.12 -0.04 Total 52 5.71 4.48 4.24 2.64 1.81 0.52 PromFind, the author analyzes the hexamers used by the the known CRMs were provided as FASTA-formatted DNA algorithm for their CpG dinucleotide content and their sequences, their positions relative to the extracted similarity to various known promoter signals such as the sequences were confirmed using BLAST [see Additional Sp1, TATA-box, and CCAAT-box motifs [29]. This type of file 2]. analysis cannot be performed on the local word frequency distributions of the LWF algorithm, due to the way it was The HexDiff algorithm designed. The HexDiff algorithm is designed to discriminate between CRMs and non-CRM sequence by using hexamer We developed the HexDiff algorithm to solve the same frequencies. For evaluation, we use the leave-one-out problem as the LWF algorithm – predicting the location of cross-validation (LOOCV) methodology. In this case, 15 CRMs – while being as biologically meaningful as the of the 16 sequences in the data set are used as a training th PromFind algorithm. The performance of HexDiff was set and the 16 sequence is used as the test set. The process compared to the LWF algorithm and several other CRM is repeated 16 times; leaving out one sequence each time. prediction programs using a common data set. The hex- During training, hexamers that appear more frequently in amers used by HexDiff were then examined to see if CRMs are selected. In order to predict CRMs in the test set known TFBS were recovered. sequence, a window is slid across the sequence and the set of selected hexamers (H ) is used to calculate a score that is used to predict whether each position is either CRM or Results Data set non-CRM sequence. The early development of the Drosophila embryo is well- studied, both biologically and computationally. Data sets Performance comparison detailing the locations of known CRMs have been pub- The HexDiff algorithm was compared to five other CRM lished by several groups [7,9,30,31], but we chose the one prediction algorithms: Ahab, Cluster Buster, MSCAN, compiled by Schroeder et al. as it clearly defined the regu- MCAST, and LWF (see Table 1). The results for Ahab were latory networks involved. In their analysis, Schroeder et al. obtained from the supplementary files of Schroeder et al., examined 29 genes with gap and pair-rule patterns. How- while Cluster Buster, MSCAN, MCAST, and LWF are all ever, only 16 of those genes were associated with known available as downloadable software or public web servers. CRMs [31]. Therefore, we focused our study on those 16 An important criterion for the inclusion of an algorithm genes, which contained a total of 52 CRMs. Sequences for was that it accepted user-defined sequences and PWMs as the genes were obtained from the 4.1 Release of the Dro- input. Ahab, Cluster Buster, MSCAN, and MCAST are sophila genome and included the 20 kb upstream as well algorithms based on binding sites of known transcription as downstream of each gene [see Additional file 1]. Since factors and were given the same set of nine PWMs for the Page 3 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 3: Sensitivities and positive predictive values (PPVs) of HexDiff and other algorithms. A known CRM was considered recovered if a predicted CRM overlapped it by at least 50 bp. The PPVs in this table are italicized because they are estimates of the true PPVs. Without complete knowledge of all CRMs that are present in the 16 sequences, it is possible that some of the predicted CRMs that are labeled as false positives are actually true positives. Algorithm CRMs Recovered Num CRMs Sensitivity TP/(TP + FN) True Positives CRMs Predicted PPV TP/(TP + FP) HexDiff 36 52 69.23% 35 104 33.65% Ahab 23 52 44.23% 20 35 57.14% Cluster Buster 31 52 59.62% 23 88 26.14% MSCAN 34 52 65.38% 42 226 18.58% MCAST 43 52 82.69% 53 499 10.62% LWF 27 52 51.92% 48 433 11.09% transcription factors as described in Schroeder et al.: the narrow down the candidates, the 68 potential CRMs were maternal factors Bicoid (Bcd), Hunchback (Hb), Caudal compared to predictions made by Ahab, Cluster Buster, (Cad), the Torso-response element (TorRE), and Stat92E MSCAN, MCAST, and LWF. 37 of the 68 matched predic- (D-Stat), and the gap factors Kruppel (Kr), Knirps (Kni), tions made by at least one other method, while 18 Giant (Gt), and Tailless (Tll). LWF was given the same matched predictions made by at least two other methods, positive and negative training sets as HexDiff. The default 10 matched predictions made by at least three other meth- parameters were used for Cluster Buster, MSCAN, MCAST, ods, and 6 matched predictions made by at least four and LWF. The complete list of predictions can be found in other methods. the Additional materials section [see Additional file 3]. While our analysis was focused on the gap and pair-rule As shown in Table 2, each algorithm was assigned a cumu- regulatory networks, it's possible that some of the pre- lative score, calculated by summing the Matthews correla- dicted CRMs are actually known CRMs from other regula- tion coefficients for each of the sequences in the data set. tory networks involved in the early development of The predictions made by HexDiff have the highest corre- Drosophila. Therefore, the list of 18 predicted CRMs was lation with the known CRMs (5.71), followed by Ahab compared to a more comprehensive compilation of 124 (4.48) and Cluster Buster (4.24). An interesting result is CRMs [32]. 8 of the 18 predicted CRMs matched CRMs that although four of the six algorithms used the same from the compilation, leaving 10 that do not correspond PWMs, their performance varied widely – from 1.81 to to any known CRMs. Given the specificities of the individ- 4.48. ual methods and the breadth of approaches, it seems very likely that some of the 18 predictions listed in Table 4 While the Matthews correlation coefficient was the pri- constitute novel CRMs. mary performance measure for the six algorithms, a closer look at two other measures offers more information about Meaning of differential hexamers the characteristics of the individual algorithms. Table 3 One advantage of the HexDiff algorithm is that the set of hexamers (H ) used to distinguish between CRMs and shows sensitivities and positive predictive values (PPVs) for each of the algorithms. A known CRM was considered non-CRM sequence can be analyzed for further insights recovered if the overlap between it and a predicted CRM into the mechanisms of gene regulation. Since the H hex- exceeded 50 bp. amers were selected based on their overrepresentation in CRMs relative to non-CRM sequence, it would be reason- One caveat about the PPVs in Table 3 is that they are not able to expect that some of the hexamers would be similar true PPVs, but estimates of the true PPVs. This is due to the to gap and pair-rule regulatory sites. Therefore, the top 80 fact that the 16 sequences in the data set may contain H hexamers and their reverse complements were com- more CRMs than the 52 that have been characterized so pared to the list of binding sites used to build the 9 PWMs far, which would mean that some of the predicted CRMs from Schroeder et al [see Additional file 4]. In the 16 that are labeled as false positives could actually be true rounds of cross-validation, an average of 59.6 hexamers positives. was found within the binding sites. Predicted CRMs Simulations were carried out to estimate the likelihood of Of the 104 predictions made by HexDiff, 36 overlapped this result. 100,000 random sets of 80 hexamers and their known CRMs by at least 50 bp, leaving 68 potential reverse complements were compared to the binding sites. CRMs. Some of these may have been false positives, so to The probability of 59 or more hexamers being found Page 4 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 4: Potential novel CRMs predicted by HexDiff and other algorithms. All of the predicted CRMs listed in this table were predicted by HexDiff and at least two other algorithms. The column labeled "Gene" lists the gene involved in the early development of Drosophila that is closest to the predicted CRM. The columns labeled 1–5 are the different algorithms whose predictions matched the CRMs predicted by HexDiff: 1 – Ahab, 2 – Cluster Buster, 3 – MSCAN, 4 – MCAST, and 5 – LWF. The predicted CRMs were also compared to a compilation of 124 CRMs [32] – matching CRMs are listed in the last column. Gene Arm Begin End Length 1 2 3 4 5 Matched btd X 9534921 9535192 271 * * eve 2R 5492385 5493575 1190 * * eve_late2_mel fkh 3R 24421705 24422385 680 * * ftz 3R 2683060 2683406 346 * * gt X 2268347 2270179 1832 * * * gt X 2290228 2290685 457 * * * * * gt_23-bcd_mel hb 3R 4503375 4503962 587 * * * hb 3R 4519805 4520172 367 * * kni 3L 20628230 20628504 274 * * * * kni_+1_mel prd 2L 12080435 12082316 1881 * * * prd_bcd_mel prd 2L 12089627 12089847 220 * * prd_1_mel run X 20488169 20488643 474 * * * * run X 20524260 20524722 462 * * * * slp1 2L 3811050 3812092 1042 * * slp1 2L 3822581 3823049 468 * * slp1 2L 3824891 3825039 148 * * * * * slp_A-bcd_mel slp1 2L 3833433 3834671 1238 * * * * slp2_-3_mel tll 3R 26680559 26683175 2616 * * * tll_bcd_mel within the binding sites was 0.019, indicating that the algorithm to five other algorithms: Ahab, Cluster Buster, hexamers selected by HexDiff were enriched in binding MSCAN, MCAST, and LWF. Hexdiff's predictions corre- sites of known regulatory proteins. lated best with biological knowledge, and its sensitivity and specificity were comparable to the other algorithms. A similar study was performed using regulatory sites This result was encouraging considering that HexDiff is a obtained from TRANSFAC [33], but the results were not type of machine learning algorithm, which tend to do bet- statistically significant. TRANSFAC contains binding ter in problems where the items being classified are sepa- regions obtained through a variety of biological methods. rated into two roughly equal groups. In this case, the For instance, R02491 contains the following binding CRMs made up just 9.36% of the total data set, with non- region obtained using DNase I footprinting: CRM sequence making up the other 90.64%. Even with a GACTTTATTGCAGCATCTTGAACAATCGTCGCAGTTT- data set where the negative data outweighed the positive GGTAACAC. On average, TRANSFAC binding sites are data by a factor of 9, HexDiff still performed well. much longer than the binding sites used by Schroeder et al .and are therefore probably much less specific. While the While the HexDiff algorithm has a fairly high specificity in TRANSFAC binding sites do contain regulatory DNA, it isolation, it would still be prudent to compare its results seems that the signal is washed out by extraneous to the predictions made by other algorithms, considering sequence. the time and effort required to biologically confirm a computational prediction. The 18 CRMs listed in Table 4 Discussion were predicted by at least three algorithms and did not The HexDiff algorithm is designed to solve the difficult match any of the 52 CRMs provided by Schroeder et al. A task of distinguishing CRMs from non-CRM sequence. It comparison with a more comprehensive list of 124 CRMs is first trained on sequences containing known CRMs by revealed that 8 of the 18 predicted CRMs corresponded to selecting hexamers that discriminate between the two cat- known CRMs from other regulatory networks involved in egories. These hexamers are then applied to novel the early development of Drosophila. The 10 remaining sequences to search for predicted CRMs. The requirements predicted CRMs are strong candidates for future study. of the training process mean that HexDiff works best in a well-defined regulatory system where some CRMs are A further attempt to understand the meaning of the H already known. hexamers was made by comparing them to the known binding sites used to generate the 9 PWMs provided by Using a data set of 16 sequences containing 52 CRMs Schroeder et al. Simulations showed that the number of obtained from Schroeder et al., we compared the HexDiff H hexamers found within the binding sites was signifi- Page 5 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 cantly more than would be expected of a randomly f (h) and negative f (h) training sets and used to calculate p n selected set of hexamers of the same size. While this result a ratio, R(h). is not unexpected, it is a confirmation that we retrieved the binding sites of relevant regulatory proteins using only the locations of known CRMs. fh () Rh () = fh () While recovering the known binding sites is important, it is important to note that they accounted for less than half The next step is to select hexamers that discriminate of the hexamers in the H sets. When 80 hexamers were between CRMs and non-CRM sequence. The hexamers selected for H , their reverse complements would increase with the highest values for R(h) are then placed into H . the total number of hexamers to 160, barring palin- As a result of this selection process, H contains hexamers dromes. And yet, the average number of hexamers that that are much more common in CRMs than in non-CRM were found within the known binding sites was 59.6, leav- sequence. ing another 100 hexamers whose identities are unknown. This result suggests that there are novel sequence features Once the hexamers for H are chosen, the final step is to besides the known binding sites that are important for use them to classify each position in an unknown distinguishing between CRMs and non-CRM sequence. sequence as either CRM or non-CRM sequence. A window is slid across the unknown sequence 1 bp at a time. At Conclusion each position i, a score S is calculated for the window by One of the major questions in studying eukaryotic gene increasing the score by the product of R(h ) and the regulation is how regulatory proteins with relatively number of times that hexamer appeared n(h ) for each degenerate binding sequences can precisely regulate many hexamer from H : genes. The discovery of cis-regulatory modules, short stretches of DNA that contain multiple binding sites for multiple proteins, has provided at least a partial explana- Sn = [(h )R(h )] tion for the regulatory specificity observed in eukaryotes, ∑ id d hH ∈ and motivated a search for ways to predict CRMs compu- dd tationally. Any positions where the window scores exceed a thresh- old score are labeled as potential CRMs. As the shortest We have developed a simple and effective algorithm for known CRM in the data set was 66 bp, we filtered out any predicting CRMs. In our study of the gap and pair-rule genes in Drosophila melanogaster, the results of the HexDiff predicted CRMs shorter than 50 bp. The entire process takes less than 10 seconds per gene on an Athlon 64 algorithm correlated best with biological knowledge, and 3200+. the sensitivity and specificity of the algorithm were com- parable to other algorithms. Our predictions were com- Evaluation of predictions pared to those made by other methods and resulted in a Because there are only 16 sequences in the data set, list of 10 putative CRMs with strong computational sup- LOOCV was used to assess the performance of the HexDiff port. Analysis of the H hexamers revealed that not only were we rediscovering the known binding sites, but also algorithm. In LOOCV, the first of the 16 sequences was discovering new signals that distinguished between CRMs withheld from the data set and used as the test set. The HexDiff algorithm was trained on the 15 remaining and non-CRM sequence. sequences and predictions were made on the test set sequence. Those predictions were then compared to the Methods Differential hexamer frequency algorithm locations of the known CRMs in the test set sequence. The same process was repeated with the second sequence, etc., The HexDiff algorithm is based on the idea of distinguish- until each of the 16 sequences had been used as a test set. ing between two types of DNA sequence: CRMs, and non- CRM sequence. In order to accomplish this task, a model is built using sequences where the CRMs are known – the The accuracy of the predictions for each sequence was measured using the Matthews correlation coefficient, training set. The training set is split into positive and neg- which has been used extensively to evaluate the perform- ative training sets by consolidating all of the known CRMs ance of various prediction algorithms [34]. It combines into a positive training set, and the remainder of the train- ing set into a negative training set. On average, the ~1 Mb both sensitivity and specificity into one measure and relies on four values that satisfy TP + TN + FP + FN = N training set is split into a ~50 kb positive training set and (length of the sequence): TP (the number of base pairs a ~950 kb negative training set. The frequency of each hex- where known CRMs overlap with predicted CRMs), TN amer h on both strands is then calculated for the positive Page 6 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 (the number of base pairs that are not in known CRMs or Choices in algorithm design predicted CRMs), FP (the number of base pairs where pre- Two important choices were made during the design of dicted CRMs did not overlap known CRMs), and FN (the the HexDiff algorithm: the model used to represent the number of base pairs where known CRMs did not overlap training sets, and the type of negative training set used. In predicted CRMs). The Matthews correlation coefficient is order to choose the correct model, a balance had to be calculated as follows: struck between the expressivity of the model and the amount of training data. Since the amount of positive training data was fixed at 52 known CRMs, we tried pen- tamers, hexamers, and heptamers with 0, 1, or 2 mis- TP** TN − FP FN CC = matches (data not shown). Shorter n-mers would have (TP++ FP)() TP FN() TN+ FP(TN+ FN) resulted in a model that was insufficiently expressive to capture the difference between CRMs and non-CRM The Matthews correlation coefficient ranges from -1 to +1, sequences, while longer n-mers would have resulted in a like the better-known Pearson correlation coefficient. A model whose parameters would have had high variance. value of 0 signifies that the prediction is equivalent to a Allowing mismatches would have consolidated n-mers completely random prediction, while +1 signifies a perfect into groups and therefore would have had the effect of prediction. simplifying the model. In the end, hexamers with no mis- matches turned out to have the best performance. Choosing parameters The HexDiff algorithm was designed so that the number For the LWF algorithm, Nazina and Papatsenko tried three of parameters needed to be set by the user was minimized. different negative training sets drawn from the Drosophila This reduced the complexity of the model and helped to genome: random samples from the whole genome, ran- avoid overfitting. Overfitting is a problem often faced by dom samples of coding sequence, and random samples of machine learning algorithms, where a model that is too non-coding sequence. They found that samples from the complex will not only learn the signal in the training set whole genome and samples from non-coding sequence but will also fit the noise, reducing the algorithm's per- resulted in better agreement between CRM predictions formance on the test set [35]. The three parameters from the LWF algorithm and the biologically determined HexDiff does require are: the number of hexamers in H , locations. Since we were predicting CRMs in a well- the size of the window that is slid across the sequence of defined set of 16 genes, we used a local negative training interest, and the threshold score that determines whether set – the portions of the 16 genes that were not labeled as each position is predicted as a CRM or not. known CRMs. We found this approach to give higher accuracy than the negative training sets used by Nazina These values were chosen during LOOCV using an empir- and Papatsenko (data not shown). ical method. During each round of cross-validation, the 16 sequences were split into a 15-sequence training set Availability and requirements and a single-sequence test set. The HexDiff algorithm was Project name: HexDiffProject home page: http:// trained on the training set and its performance evaluated www.ics.uci.edu/~bobc/hexdiff.html on the training set, using various combinations of the three parameters. The combination that gave the best per- Operating system: Platform independent formance on the training set was then used to predict CRMs in the test set. Programming language: Perl In general, we observed that the HexDiff algorithm was Other requirements: Perl 5.6 or higher relatively insensitive to the precise values for the parame- ters. We ran the HexDiff algorithms with 6 values for the License: GNU GPL number of hexamers (30, 40, 50, 60, 70, 80) and 11 val- ues for the size of the sliding window (1000, 1100, 1200, Any restrictions to use by non-academics: licence needed 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000). For each pair of parameters, the threshold score that gave the Authors' contributions best performance on the training set was selected. For BC carried out the computational work in this study and these 66 analyses, the average cumulative Matthews corre- drafted the manuscript. DK participated in the design of lation coefficient for the test set sequences was 5.37 with the study and helped revise the manuscript. Both authors a standard deviation of 0.51. The average number of mod- read and approved the final manuscript. ules recovered was 33.8 with a standard deviation of 3.29. Page 7 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Additional material References 1. Davidson EH, McClay DR, Hood L: Regulatory gene networks and the properties of the developmental process. Proc Natl Acad Sci U S A 2003, 100(4):1475-1480. Additional File 1 2. Qiu P: Recent advances in computational promoter analysis Sequences comprising the data set. FASTA-formatted sequences that in understanding the transcriptional regulatory network. Bio- include the 20 kb upstream and downstream of the 16 gap and pair-rule chem Biophys Res Commun 2003, 309(3):495-501. genes that were studied. Sequences were extracted from Release 4.1 of the 3. Laimins LA, Gruss P, Pozzatti R, Khoury G: Characterization of enhancer elements in the long terminal repeat of Moloney Drosophila genome using FlyBase's "GBrowse" Genome Browser. murine sarcoma virus. J Virol 1984, 49(1):183-189. Click here for file 4. Qin Y, Kong LK, Poirier C, Truong C, Overbeek PA, Bishop CE: [http://www.biomedcentral.com/content/supplementary/1471- Long-range activation of Sox9 in Odd Sex (Ods) mice. Hum 2105-6-262-S1.TXT] Mol Genet 2004, 13(12):1213-1218. 5. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Additional File 2 Biol 1998, 278(1):167-181. Coordinates of known CRMs. Coordinates of known CRMs, relative to 6. Krivan W, Wasserman WW: A predictive model for regulatory the sequences in Additional file 2. sequences directing liver-specific transcription. Genome Res Click here for file 2001, 11(9):1559-1566. [http://www.biomedcentral.com/content/supplementary/1471- 7. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding 2105-6-262-S2.TXT] site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Additional File 3 Sci U S A 2002, 99(2):757-762. 8. Bailey TL, Noble WS: Searching for statistically significant reg- Coordinates of CRMs predicted by the different algorithms. Coordi- ulatory modules. Bioinformatics 2003, 19 Suppl 2:II16-II25. nates of predicted CRMs for each of the algorithms compared in this study, 9. Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational relative to the sequences in Additional file 2. detection of genomic cis-regulatory modules applied to body Click here for file patterning in the early Drosophila embryo. BMC Bioinformatics [http://www.biomedcentral.com/content/supplementary/1471- 2002, 3(1):30. 2105-6-262-S3.TXT] 10. Sinha S, van Nimwegen E, Siggia ED: A probabilistic method to detect regulatory modules. Bioinformatics 2003, 19 Suppl 1:i292-301. Additional File 4 11. Frith MC, Li MC, Weng Z: Cluster-Buster: Finding dense clus- Hd hexamers. Top 80 H hexamers calculated by the HexDiff algorithm ters of motifs in DNA sequences. Nucleic Acids Res 2003, for each round of cross-validation. 31(13):3666-3668. 12. Johansson O, Alkema W, Wasserman WW, Lagergren J: Identifica- Click here for file tion of functional clusters of transcription factor binding [http://www.biomedcentral.com/content/supplementary/1471- motifs in genome sequences: the MSCAN algorithm. Bioinfor- 2105-6-262-S4.TXT] matics 2003, 19 Suppl 1:i169-76. 13. Gupta M, Liu JS: De novo cis-regulatory module elicitation for Additional File 5 eukaryotic genomes. Proc Natl Acad Sci U S A 2005, 102(20):7079-7084. ROC curves for the HexDiff algorithm. The three curves in this plot were 14. Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational made by taking the combination of parameters that gave the best perform- approach to the identification of cis-regulatory modules and ance on the training sets (number of nmers: 80, window size: 1700 bp, target genes in whole-genome sequence data. Site clustering threshold: 170) and holding two parameters constant while varying the over random expectation. Proc Natl Acad Sci U S A 2002, third. 99(15):9888-9893. 15. Markstein M, Markstein P, Markstein V, Levine MS: Genome-wide Click here for file analysis of clustered Dorsal binding sites identifies putative [http://www.biomedcentral.com/content/supplementary/1471- target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 2105-6-262-S5.pdf] 2002, 99(2):763-768. 16. Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila. Genome Res 2003, Additional File 6 13(4):579-588. Sensitivities and specificities for HexDiff and the other algorithms. 17. Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT: Sensitivities and specificities for HexDiff and the other algorithms were Embryonic epsilon and gamma globin genes of a prosimian calculated by checking whether each position was a TP, FP, TN, or FN primate (Galago crassicaudatus). Nucleotide and amino acid and using the appropriate formulas. sequences, developmental regulation and phylogenetic foot- prints. J Mol Biol 1988, 203(2):439-455. Click here for file 18. Blanchette M, Tompa M: Discovery of regulatory elements by a [http://www.biomedcentral.com/content/supplementary/1471- computational method for phylogenetic footprinting. 2105-6-262-S6.pdf] Genome Res 2002, 12(5):739-748. 19. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2003, 299(5611):1391-1394. Acknowledgements 20. Sumiyama K, Kim CB, Ruddle FH: An efficient cis-element discov- ery method using multiple sequence comparisons based on This investigation was supported by National Institutes of Health, National evolutionary relationships. Genomics 2001, 71(2):260-262. Research Service Award 5 T15 LM00744, from the National Library of 21. Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly Medicine. acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D.pseudoobscura. Bioinformatics 2004, 20(16):2738-2750. Page 8 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 22. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasser- man WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003, 2(2):13. 23. Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional tran- scription factor binding sites. Genome Res 2002, 12(5):832-839. 24. Yuh CH, Brown CT, Livi CB, Rowen L, Clarke PJ, Davidson EH: Patchy interspecific sequence similarities efficiently identify positive cis-regulatory elements in the sea urchin. Dev Biol 2002, 246(1):148-161. 25. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15(7-8):563-577. 26. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cere- visiae. J Mol Biol 2000, 296(5):1205-1214. 27. van Helden J, Andre B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computa- tional analysis of oligonucleotide frequencies. J Mol Biol 1998, 281(5):827-842. 28. Abnizova I, te Boekhorst R, Walter K, Gilks WR: Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC Bioinformatics 2005, 6(1):109. 29. Hutchinson GB: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Com- put Appl Biosci 1996, 12(5):391-398. 30. Nazina AG, Papatsenko DA: Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 2003, 4(1):65. 31. Schroeder MD, Pearce M, Fak J, Fan H, Unnerstall U, Emberly E, Rajewsky N, Siggia ED, Gaul U: Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol 2004, 2(9):E271. 32. [http://webdisk.berkeley.edu/~dap5/] 33. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, Pruss M, Schacherer F, Thiele S, Urbach S: The TRANSFAC system on gene expression regu- lation. Nucleic Acids Res 2001, 29(1):281-283. 34. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16(5):412-424. 35. Mitchell TM: Machine Learning. In McGraw-Hill series in computer science New York , McGraw-Hill; 1997:xvii, 414 p.. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 9 of 9 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Using hexamers to predict cis-regulatory motifs in Drosophila

BMC Bioinformatics , Volume 6 (1) – Oct 27, 2005

Loading next page...
 
/lp/springer-journals/using-hexamers-to-predict-cis-regulatory-motifs-in-drosophila-B9ZMY1Vl4j

References (39)

Publisher
Springer Journals
Copyright
Copyright © 2005 by Chan and Kibler; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-6-262
pmid
16253142
Publisher site
See Article on Publisher Site

Abstract

Background: Cis-regulatory modules (CRMs) are short stretches of DNA that help regulate gene expression in higher eukaryotes. They have been found up to 1 megabase away from the genes they regulate and can be located upstream, downstream, and even within their target genes. Due to the difficulty of finding CRMs using biological and computational techniques, even well-studied regulatory systems may contain CRMs that have not yet been discovered. Results: We present a simple, efficient method (HexDiff) based only on hexamer frequencies of known CRMs and non-CRM sequence to predict novel CRMs in regulatory systems. On a data set of 16 gap and pair-rule genes containing 52 known CRMs, predictions made by HexDiff had a higher correlation with the known CRMs than several existing CRM prediction algorithms: Ahab, Cluster Buster, MSCAN, MCAST, and LWF. After combining the results of the different algorithms, 10 putative CRMs were identified and are strong candidates for future study. The hexamers used by HexDiff to distinguish between CRMs and non-CRM sequence were also analyzed and were shown to be enriched in regulatory elements. Conclusion: HexDiff provides an efficient and effective means for finding new CRMs based on known CRMs, rather than known binding sites. Background of CRMs – promoters and enhancers. Promoters are The development of eukaryotic organisms is tightly regu- located immediately upstream of a gene's transcriptional lated by a variety of mechanisms. The initial step of regu- start site and often contain a variety of sequence signals lation is carried out by transcription factors interacting such as the TATA box, CCAAT box, and different TFBS. with cis-regulatory sequences, also known as transcription These characteristics have been used in approaches for factor binding sites (TFBS). In eukaryotes, multiple TFBS finding promoters [2]. In contrast, enhancers do not share are often clustered together into cis-regulatory modules these signals and operate in a manner that is relatively (CRMs). The TFBS can be thought of as inputs into an independent of orientation or distance from their target information processing element, with the output being gene [3]. In fact, one enhancer, Dct, has been found the level of expression of the gene controlled by the CRM almost a megabase away from Sox9, the gene it regulates [1]. [4]. Because of the lack of common signals and because the search for enhancers cannot be limited to the few hun- One of the major challenges for understanding eukaryotic dred base pairs upstream of the transcriptional start site, gene regulation is finding CRMs. There are two main types finding enhancers is a more difficult problem. Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 1: Key aspects of HexDiff and other algorithms. The table shows the knowledge used and the parameters required by the different algorithms. Algorithm Knowledge Used Parameters HexDiff CRM Locations Number of hexamers in H Window size Window score threshold Ahab PWMs Window size Free energy cutoff Order of background model Cluster Buster PWMs Motif score threshold Gap parameter Cluster score threshold Residue abundance range MSCAN PWMs Motif score threshold Window size Minimum hits Maximum hits MCAST PWMs Motif score threshold Maximum allowed distance between adjacent hits Pseudocount weight LWF CRM Locations String length Number of mismatches Detection window size Maximum number of channels Channels equalized Profile cutoff Peak width cutoff Smoothing window Methods for predicting CRMs can be classified by the type Methods based on locations of known CRMs have been of information they use to make the predictions – known rarer. Methods of this type tend to look for statistical prop- binding sites of regulatory proteins, homologous erties of DNA sequence that distinguish CRMs from non- sequences, or known CRMs. Binding sites for the first type regulatory DNA. One group has developed a statistical test of method are generally modeled using position weight called the "fluffy-tail test" that looks for differences in matrices (PWMs) or consensus sequences. These models nucleotide composition, particularly in lists of words of are used to search for statistically significant clusters of various lengths [28]. From the related field of promoter predicted TFBS. Examples of methods based on binding prediction comes PromFind, an algorithm for finding pro- sites of multiple transcription factor proteins include one moters that uses hexamer frequencies of known promot- developed for human skeletal muscle [5], a logistic regres- ers to search for DNA with similar frequencies [29]. sion analysis model for liver-specific transcription factors Because PromFind was developed for promoters, the [6], CIS-ANALYST [7], MCAST [8], Ahab [9], Stubb [10], author could assume that every sequence being tested Cluster Buster [11], MSCAN [12], and EMCMODULE contained one promoter, and that the strand containing [13]. Methods based on binding sites of single transcrip- the promoter was known – assumptions that are not true tion factors have also been developed – SCORE [14], Fly for enhancers. Another recent algorithm developed to pre- Enhancer [15], and a method of searching for homotypic dict CRMs is based on the exhaustive analysis of local clusters [16]. word frequencies (LWF) [30]. Unlike PromFind where the algorithm is based solely on hexamer words, the LWF Methods based on homologous sequences assume that algorithm considers the pattern of word frequencies in a areas of the DNA involved in regulating gene transcription sliding window. are under selective pressure and are therefore more likely to be conserved than non-functional DNA [17]. These While the LWF algorithm was shown to perform well at methods can be categorized by whether they search for the task of predicting CRMs [30], it is difficult to put a bio- conserved DNA by aligning homologous regions from logical meaning to the results since the algorithm depends multiple species [18-20], homologous regions from two on word frequencies, not on the words themselves. In species [21-24], or homologous regions from related contrast, the PromFind algorithm generates lists of hex- genes in a single species (also referred to as co-regulated amers that are important for distinguishing promoters genes) [25-27]. from non-promoter sequences. In the paper describing Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 2: Correlation between predicted and known CRMs. The performance of six different algorithms on a common data set is compared in this table. For each sequence, the Matthews correlation coefficient is calculated by checking whether each position is a TP, TN, FP, or FN and using the equation listed in the Methods section. The sum of the correlation coefficients gives a cumulative score for each algorithm on this data set. Gene CRMs HexDiff Ahab Cluster Buster MSCAN MCAST LWF btd 1 0.70 0.57 0.19 0.01 0.07 0.10 ems 3 0.00 0.00 -0.03 0.12 -0.01 -0.01 eve 6 0.55 0.63 0.65 0.50 0.41 0.06 fkh 1 -0.03 -0.02 -0.02 -0.04 -0.02 -0.01 ftz 5 0.40 0.28 0.28 0.07 0.16 0.08 gt 1 0.27 0.42 0.33 0.35 0.15 0.03 h 5 0.71 0.63 0.53 0.30 0.37 0.08 hb 2 0.35 0.63 0.39 0.34 0.24 0.04 hkb 1 0.51 0.00 -0.02 -0.02 -0.08 0.09 kni 3 0.55 0.55 0.39 0.37 0.23 -0.05 kr 3 0.43 0.00 0.77 0.20 0.11 -0.03 oc 2 0.70 -0.02 0.00 0.11 0.02 0.07 prd 7 0.01 -0.07 0.16 0.07 -0.04 0.05 run 6 0.27 0.16 0.08 0.08 0.02 0.07 slp1 3 -0.07 0.15 -0.04 0.00 0.07 0.01 tll 3 0.35 0.56 0.58 0.19 0.12 -0.04 Total 52 5.71 4.48 4.24 2.64 1.81 0.52 PromFind, the author analyzes the hexamers used by the the known CRMs were provided as FASTA-formatted DNA algorithm for their CpG dinucleotide content and their sequences, their positions relative to the extracted similarity to various known promoter signals such as the sequences were confirmed using BLAST [see Additional Sp1, TATA-box, and CCAAT-box motifs [29]. This type of file 2]. analysis cannot be performed on the local word frequency distributions of the LWF algorithm, due to the way it was The HexDiff algorithm designed. The HexDiff algorithm is designed to discriminate between CRMs and non-CRM sequence by using hexamer We developed the HexDiff algorithm to solve the same frequencies. For evaluation, we use the leave-one-out problem as the LWF algorithm – predicting the location of cross-validation (LOOCV) methodology. In this case, 15 CRMs – while being as biologically meaningful as the of the 16 sequences in the data set are used as a training th PromFind algorithm. The performance of HexDiff was set and the 16 sequence is used as the test set. The process compared to the LWF algorithm and several other CRM is repeated 16 times; leaving out one sequence each time. prediction programs using a common data set. The hex- During training, hexamers that appear more frequently in amers used by HexDiff were then examined to see if CRMs are selected. In order to predict CRMs in the test set known TFBS were recovered. sequence, a window is slid across the sequence and the set of selected hexamers (H ) is used to calculate a score that is used to predict whether each position is either CRM or Results Data set non-CRM sequence. The early development of the Drosophila embryo is well- studied, both biologically and computationally. Data sets Performance comparison detailing the locations of known CRMs have been pub- The HexDiff algorithm was compared to five other CRM lished by several groups [7,9,30,31], but we chose the one prediction algorithms: Ahab, Cluster Buster, MSCAN, compiled by Schroeder et al. as it clearly defined the regu- MCAST, and LWF (see Table 1). The results for Ahab were latory networks involved. In their analysis, Schroeder et al. obtained from the supplementary files of Schroeder et al., examined 29 genes with gap and pair-rule patterns. How- while Cluster Buster, MSCAN, MCAST, and LWF are all ever, only 16 of those genes were associated with known available as downloadable software or public web servers. CRMs [31]. Therefore, we focused our study on those 16 An important criterion for the inclusion of an algorithm genes, which contained a total of 52 CRMs. Sequences for was that it accepted user-defined sequences and PWMs as the genes were obtained from the 4.1 Release of the Dro- input. Ahab, Cluster Buster, MSCAN, and MCAST are sophila genome and included the 20 kb upstream as well algorithms based on binding sites of known transcription as downstream of each gene [see Additional file 1]. Since factors and were given the same set of nine PWMs for the Page 3 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 3: Sensitivities and positive predictive values (PPVs) of HexDiff and other algorithms. A known CRM was considered recovered if a predicted CRM overlapped it by at least 50 bp. The PPVs in this table are italicized because they are estimates of the true PPVs. Without complete knowledge of all CRMs that are present in the 16 sequences, it is possible that some of the predicted CRMs that are labeled as false positives are actually true positives. Algorithm CRMs Recovered Num CRMs Sensitivity TP/(TP + FN) True Positives CRMs Predicted PPV TP/(TP + FP) HexDiff 36 52 69.23% 35 104 33.65% Ahab 23 52 44.23% 20 35 57.14% Cluster Buster 31 52 59.62% 23 88 26.14% MSCAN 34 52 65.38% 42 226 18.58% MCAST 43 52 82.69% 53 499 10.62% LWF 27 52 51.92% 48 433 11.09% transcription factors as described in Schroeder et al.: the narrow down the candidates, the 68 potential CRMs were maternal factors Bicoid (Bcd), Hunchback (Hb), Caudal compared to predictions made by Ahab, Cluster Buster, (Cad), the Torso-response element (TorRE), and Stat92E MSCAN, MCAST, and LWF. 37 of the 68 matched predic- (D-Stat), and the gap factors Kruppel (Kr), Knirps (Kni), tions made by at least one other method, while 18 Giant (Gt), and Tailless (Tll). LWF was given the same matched predictions made by at least two other methods, positive and negative training sets as HexDiff. The default 10 matched predictions made by at least three other meth- parameters were used for Cluster Buster, MSCAN, MCAST, ods, and 6 matched predictions made by at least four and LWF. The complete list of predictions can be found in other methods. the Additional materials section [see Additional file 3]. While our analysis was focused on the gap and pair-rule As shown in Table 2, each algorithm was assigned a cumu- regulatory networks, it's possible that some of the pre- lative score, calculated by summing the Matthews correla- dicted CRMs are actually known CRMs from other regula- tion coefficients for each of the sequences in the data set. tory networks involved in the early development of The predictions made by HexDiff have the highest corre- Drosophila. Therefore, the list of 18 predicted CRMs was lation with the known CRMs (5.71), followed by Ahab compared to a more comprehensive compilation of 124 (4.48) and Cluster Buster (4.24). An interesting result is CRMs [32]. 8 of the 18 predicted CRMs matched CRMs that although four of the six algorithms used the same from the compilation, leaving 10 that do not correspond PWMs, their performance varied widely – from 1.81 to to any known CRMs. Given the specificities of the individ- 4.48. ual methods and the breadth of approaches, it seems very likely that some of the 18 predictions listed in Table 4 While the Matthews correlation coefficient was the pri- constitute novel CRMs. mary performance measure for the six algorithms, a closer look at two other measures offers more information about Meaning of differential hexamers the characteristics of the individual algorithms. Table 3 One advantage of the HexDiff algorithm is that the set of hexamers (H ) used to distinguish between CRMs and shows sensitivities and positive predictive values (PPVs) for each of the algorithms. A known CRM was considered non-CRM sequence can be analyzed for further insights recovered if the overlap between it and a predicted CRM into the mechanisms of gene regulation. Since the H hex- exceeded 50 bp. amers were selected based on their overrepresentation in CRMs relative to non-CRM sequence, it would be reason- One caveat about the PPVs in Table 3 is that they are not able to expect that some of the hexamers would be similar true PPVs, but estimates of the true PPVs. This is due to the to gap and pair-rule regulatory sites. Therefore, the top 80 fact that the 16 sequences in the data set may contain H hexamers and their reverse complements were com- more CRMs than the 52 that have been characterized so pared to the list of binding sites used to build the 9 PWMs far, which would mean that some of the predicted CRMs from Schroeder et al [see Additional file 4]. In the 16 that are labeled as false positives could actually be true rounds of cross-validation, an average of 59.6 hexamers positives. was found within the binding sites. Predicted CRMs Simulations were carried out to estimate the likelihood of Of the 104 predictions made by HexDiff, 36 overlapped this result. 100,000 random sets of 80 hexamers and their known CRMs by at least 50 bp, leaving 68 potential reverse complements were compared to the binding sites. CRMs. Some of these may have been false positives, so to The probability of 59 or more hexamers being found Page 4 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Table 4: Potential novel CRMs predicted by HexDiff and other algorithms. All of the predicted CRMs listed in this table were predicted by HexDiff and at least two other algorithms. The column labeled "Gene" lists the gene involved in the early development of Drosophila that is closest to the predicted CRM. The columns labeled 1–5 are the different algorithms whose predictions matched the CRMs predicted by HexDiff: 1 – Ahab, 2 – Cluster Buster, 3 – MSCAN, 4 – MCAST, and 5 – LWF. The predicted CRMs were also compared to a compilation of 124 CRMs [32] – matching CRMs are listed in the last column. Gene Arm Begin End Length 1 2 3 4 5 Matched btd X 9534921 9535192 271 * * eve 2R 5492385 5493575 1190 * * eve_late2_mel fkh 3R 24421705 24422385 680 * * ftz 3R 2683060 2683406 346 * * gt X 2268347 2270179 1832 * * * gt X 2290228 2290685 457 * * * * * gt_23-bcd_mel hb 3R 4503375 4503962 587 * * * hb 3R 4519805 4520172 367 * * kni 3L 20628230 20628504 274 * * * * kni_+1_mel prd 2L 12080435 12082316 1881 * * * prd_bcd_mel prd 2L 12089627 12089847 220 * * prd_1_mel run X 20488169 20488643 474 * * * * run X 20524260 20524722 462 * * * * slp1 2L 3811050 3812092 1042 * * slp1 2L 3822581 3823049 468 * * slp1 2L 3824891 3825039 148 * * * * * slp_A-bcd_mel slp1 2L 3833433 3834671 1238 * * * * slp2_-3_mel tll 3R 26680559 26683175 2616 * * * tll_bcd_mel within the binding sites was 0.019, indicating that the algorithm to five other algorithms: Ahab, Cluster Buster, hexamers selected by HexDiff were enriched in binding MSCAN, MCAST, and LWF. Hexdiff's predictions corre- sites of known regulatory proteins. lated best with biological knowledge, and its sensitivity and specificity were comparable to the other algorithms. A similar study was performed using regulatory sites This result was encouraging considering that HexDiff is a obtained from TRANSFAC [33], but the results were not type of machine learning algorithm, which tend to do bet- statistically significant. TRANSFAC contains binding ter in problems where the items being classified are sepa- regions obtained through a variety of biological methods. rated into two roughly equal groups. In this case, the For instance, R02491 contains the following binding CRMs made up just 9.36% of the total data set, with non- region obtained using DNase I footprinting: CRM sequence making up the other 90.64%. Even with a GACTTTATTGCAGCATCTTGAACAATCGTCGCAGTTT- data set where the negative data outweighed the positive GGTAACAC. On average, TRANSFAC binding sites are data by a factor of 9, HexDiff still performed well. much longer than the binding sites used by Schroeder et al .and are therefore probably much less specific. While the While the HexDiff algorithm has a fairly high specificity in TRANSFAC binding sites do contain regulatory DNA, it isolation, it would still be prudent to compare its results seems that the signal is washed out by extraneous to the predictions made by other algorithms, considering sequence. the time and effort required to biologically confirm a computational prediction. The 18 CRMs listed in Table 4 Discussion were predicted by at least three algorithms and did not The HexDiff algorithm is designed to solve the difficult match any of the 52 CRMs provided by Schroeder et al. A task of distinguishing CRMs from non-CRM sequence. It comparison with a more comprehensive list of 124 CRMs is first trained on sequences containing known CRMs by revealed that 8 of the 18 predicted CRMs corresponded to selecting hexamers that discriminate between the two cat- known CRMs from other regulatory networks involved in egories. These hexamers are then applied to novel the early development of Drosophila. The 10 remaining sequences to search for predicted CRMs. The requirements predicted CRMs are strong candidates for future study. of the training process mean that HexDiff works best in a well-defined regulatory system where some CRMs are A further attempt to understand the meaning of the H already known. hexamers was made by comparing them to the known binding sites used to generate the 9 PWMs provided by Using a data set of 16 sequences containing 52 CRMs Schroeder et al. Simulations showed that the number of obtained from Schroeder et al., we compared the HexDiff H hexamers found within the binding sites was signifi- Page 5 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 cantly more than would be expected of a randomly f (h) and negative f (h) training sets and used to calculate p n selected set of hexamers of the same size. While this result a ratio, R(h). is not unexpected, it is a confirmation that we retrieved the binding sites of relevant regulatory proteins using only the locations of known CRMs. fh () Rh () = fh () While recovering the known binding sites is important, it is important to note that they accounted for less than half The next step is to select hexamers that discriminate of the hexamers in the H sets. When 80 hexamers were between CRMs and non-CRM sequence. The hexamers selected for H , their reverse complements would increase with the highest values for R(h) are then placed into H . the total number of hexamers to 160, barring palin- As a result of this selection process, H contains hexamers dromes. And yet, the average number of hexamers that that are much more common in CRMs than in non-CRM were found within the known binding sites was 59.6, leav- sequence. ing another 100 hexamers whose identities are unknown. This result suggests that there are novel sequence features Once the hexamers for H are chosen, the final step is to besides the known binding sites that are important for use them to classify each position in an unknown distinguishing between CRMs and non-CRM sequence. sequence as either CRM or non-CRM sequence. A window is slid across the unknown sequence 1 bp at a time. At Conclusion each position i, a score S is calculated for the window by One of the major questions in studying eukaryotic gene increasing the score by the product of R(h ) and the regulation is how regulatory proteins with relatively number of times that hexamer appeared n(h ) for each degenerate binding sequences can precisely regulate many hexamer from H : genes. The discovery of cis-regulatory modules, short stretches of DNA that contain multiple binding sites for multiple proteins, has provided at least a partial explana- Sn = [(h )R(h )] tion for the regulatory specificity observed in eukaryotes, ∑ id d hH ∈ and motivated a search for ways to predict CRMs compu- dd tationally. Any positions where the window scores exceed a thresh- old score are labeled as potential CRMs. As the shortest We have developed a simple and effective algorithm for known CRM in the data set was 66 bp, we filtered out any predicting CRMs. In our study of the gap and pair-rule genes in Drosophila melanogaster, the results of the HexDiff predicted CRMs shorter than 50 bp. The entire process takes less than 10 seconds per gene on an Athlon 64 algorithm correlated best with biological knowledge, and 3200+. the sensitivity and specificity of the algorithm were com- parable to other algorithms. Our predictions were com- Evaluation of predictions pared to those made by other methods and resulted in a Because there are only 16 sequences in the data set, list of 10 putative CRMs with strong computational sup- LOOCV was used to assess the performance of the HexDiff port. Analysis of the H hexamers revealed that not only were we rediscovering the known binding sites, but also algorithm. In LOOCV, the first of the 16 sequences was discovering new signals that distinguished between CRMs withheld from the data set and used as the test set. The HexDiff algorithm was trained on the 15 remaining and non-CRM sequence. sequences and predictions were made on the test set sequence. Those predictions were then compared to the Methods Differential hexamer frequency algorithm locations of the known CRMs in the test set sequence. The same process was repeated with the second sequence, etc., The HexDiff algorithm is based on the idea of distinguish- until each of the 16 sequences had been used as a test set. ing between two types of DNA sequence: CRMs, and non- CRM sequence. In order to accomplish this task, a model is built using sequences where the CRMs are known – the The accuracy of the predictions for each sequence was measured using the Matthews correlation coefficient, training set. The training set is split into positive and neg- which has been used extensively to evaluate the perform- ative training sets by consolidating all of the known CRMs ance of various prediction algorithms [34]. It combines into a positive training set, and the remainder of the train- ing set into a negative training set. On average, the ~1 Mb both sensitivity and specificity into one measure and relies on four values that satisfy TP + TN + FP + FN = N training set is split into a ~50 kb positive training set and (length of the sequence): TP (the number of base pairs a ~950 kb negative training set. The frequency of each hex- where known CRMs overlap with predicted CRMs), TN amer h on both strands is then calculated for the positive Page 6 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 (the number of base pairs that are not in known CRMs or Choices in algorithm design predicted CRMs), FP (the number of base pairs where pre- Two important choices were made during the design of dicted CRMs did not overlap known CRMs), and FN (the the HexDiff algorithm: the model used to represent the number of base pairs where known CRMs did not overlap training sets, and the type of negative training set used. In predicted CRMs). The Matthews correlation coefficient is order to choose the correct model, a balance had to be calculated as follows: struck between the expressivity of the model and the amount of training data. Since the amount of positive training data was fixed at 52 known CRMs, we tried pen- tamers, hexamers, and heptamers with 0, 1, or 2 mis- TP** TN − FP FN CC = matches (data not shown). Shorter n-mers would have (TP++ FP)() TP FN() TN+ FP(TN+ FN) resulted in a model that was insufficiently expressive to capture the difference between CRMs and non-CRM The Matthews correlation coefficient ranges from -1 to +1, sequences, while longer n-mers would have resulted in a like the better-known Pearson correlation coefficient. A model whose parameters would have had high variance. value of 0 signifies that the prediction is equivalent to a Allowing mismatches would have consolidated n-mers completely random prediction, while +1 signifies a perfect into groups and therefore would have had the effect of prediction. simplifying the model. In the end, hexamers with no mis- matches turned out to have the best performance. Choosing parameters The HexDiff algorithm was designed so that the number For the LWF algorithm, Nazina and Papatsenko tried three of parameters needed to be set by the user was minimized. different negative training sets drawn from the Drosophila This reduced the complexity of the model and helped to genome: random samples from the whole genome, ran- avoid overfitting. Overfitting is a problem often faced by dom samples of coding sequence, and random samples of machine learning algorithms, where a model that is too non-coding sequence. They found that samples from the complex will not only learn the signal in the training set whole genome and samples from non-coding sequence but will also fit the noise, reducing the algorithm's per- resulted in better agreement between CRM predictions formance on the test set [35]. The three parameters from the LWF algorithm and the biologically determined HexDiff does require are: the number of hexamers in H , locations. Since we were predicting CRMs in a well- the size of the window that is slid across the sequence of defined set of 16 genes, we used a local negative training interest, and the threshold score that determines whether set – the portions of the 16 genes that were not labeled as each position is predicted as a CRM or not. known CRMs. We found this approach to give higher accuracy than the negative training sets used by Nazina These values were chosen during LOOCV using an empir- and Papatsenko (data not shown). ical method. During each round of cross-validation, the 16 sequences were split into a 15-sequence training set Availability and requirements and a single-sequence test set. The HexDiff algorithm was Project name: HexDiffProject home page: http:// trained on the training set and its performance evaluated www.ics.uci.edu/~bobc/hexdiff.html on the training set, using various combinations of the three parameters. The combination that gave the best per- Operating system: Platform independent formance on the training set was then used to predict CRMs in the test set. Programming language: Perl In general, we observed that the HexDiff algorithm was Other requirements: Perl 5.6 or higher relatively insensitive to the precise values for the parame- ters. We ran the HexDiff algorithms with 6 values for the License: GNU GPL number of hexamers (30, 40, 50, 60, 70, 80) and 11 val- ues for the size of the sliding window (1000, 1100, 1200, Any restrictions to use by non-academics: licence needed 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000). For each pair of parameters, the threshold score that gave the Authors' contributions best performance on the training set was selected. For BC carried out the computational work in this study and these 66 analyses, the average cumulative Matthews corre- drafted the manuscript. DK participated in the design of lation coefficient for the test set sequences was 5.37 with the study and helped revise the manuscript. Both authors a standard deviation of 0.51. The average number of mod- read and approved the final manuscript. ules recovered was 33.8 with a standard deviation of 3.29. Page 7 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 Additional material References 1. Davidson EH, McClay DR, Hood L: Regulatory gene networks and the properties of the developmental process. Proc Natl Acad Sci U S A 2003, 100(4):1475-1480. Additional File 1 2. Qiu P: Recent advances in computational promoter analysis Sequences comprising the data set. FASTA-formatted sequences that in understanding the transcriptional regulatory network. Bio- include the 20 kb upstream and downstream of the 16 gap and pair-rule chem Biophys Res Commun 2003, 309(3):495-501. genes that were studied. Sequences were extracted from Release 4.1 of the 3. Laimins LA, Gruss P, Pozzatti R, Khoury G: Characterization of enhancer elements in the long terminal repeat of Moloney Drosophila genome using FlyBase's "GBrowse" Genome Browser. murine sarcoma virus. J Virol 1984, 49(1):183-189. Click here for file 4. Qin Y, Kong LK, Poirier C, Truong C, Overbeek PA, Bishop CE: [http://www.biomedcentral.com/content/supplementary/1471- Long-range activation of Sox9 in Odd Sex (Ods) mice. Hum 2105-6-262-S1.TXT] Mol Genet 2004, 13(12):1213-1218. 5. Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Additional File 2 Biol 1998, 278(1):167-181. Coordinates of known CRMs. Coordinates of known CRMs, relative to 6. Krivan W, Wasserman WW: A predictive model for regulatory the sequences in Additional file 2. sequences directing liver-specific transcription. Genome Res Click here for file 2001, 11(9):1559-1566. [http://www.biomedcentral.com/content/supplementary/1471- 7. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding 2105-6-262-S2.TXT] site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Additional File 3 Sci U S A 2002, 99(2):757-762. 8. Bailey TL, Noble WS: Searching for statistically significant reg- Coordinates of CRMs predicted by the different algorithms. Coordi- ulatory modules. Bioinformatics 2003, 19 Suppl 2:II16-II25. nates of predicted CRMs for each of the algorithms compared in this study, 9. Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational relative to the sequences in Additional file 2. detection of genomic cis-regulatory modules applied to body Click here for file patterning in the early Drosophila embryo. BMC Bioinformatics [http://www.biomedcentral.com/content/supplementary/1471- 2002, 3(1):30. 2105-6-262-S3.TXT] 10. Sinha S, van Nimwegen E, Siggia ED: A probabilistic method to detect regulatory modules. Bioinformatics 2003, 19 Suppl 1:i292-301. Additional File 4 11. Frith MC, Li MC, Weng Z: Cluster-Buster: Finding dense clus- Hd hexamers. Top 80 H hexamers calculated by the HexDiff algorithm ters of motifs in DNA sequences. Nucleic Acids Res 2003, for each round of cross-validation. 31(13):3666-3668. 12. Johansson O, Alkema W, Wasserman WW, Lagergren J: Identifica- Click here for file tion of functional clusters of transcription factor binding [http://www.biomedcentral.com/content/supplementary/1471- motifs in genome sequences: the MSCAN algorithm. Bioinfor- 2105-6-262-S4.TXT] matics 2003, 19 Suppl 1:i169-76. 13. Gupta M, Liu JS: De novo cis-regulatory module elicitation for Additional File 5 eukaryotic genomes. Proc Natl Acad Sci U S A 2005, 102(20):7079-7084. ROC curves for the HexDiff algorithm. The three curves in this plot were 14. Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational made by taking the combination of parameters that gave the best perform- approach to the identification of cis-regulatory modules and ance on the training sets (number of nmers: 80, window size: 1700 bp, target genes in whole-genome sequence data. Site clustering threshold: 170) and holding two parameters constant while varying the over random expectation. Proc Natl Acad Sci U S A 2002, third. 99(15):9888-9893. 15. Markstein M, Markstein P, Markstein V, Levine MS: Genome-wide Click here for file analysis of clustered Dorsal binding sites identifies putative [http://www.biomedcentral.com/content/supplementary/1471- target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 2105-6-262-S5.pdf] 2002, 99(2):763-768. 16. Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila. Genome Res 2003, Additional File 6 13(4):579-588. Sensitivities and specificities for HexDiff and the other algorithms. 17. Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT: Sensitivities and specificities for HexDiff and the other algorithms were Embryonic epsilon and gamma globin genes of a prosimian calculated by checking whether each position was a TP, FP, TN, or FN primate (Galago crassicaudatus). Nucleotide and amino acid and using the appropriate formulas. sequences, developmental regulation and phylogenetic foot- prints. J Mol Biol 1988, 203(2):439-455. Click here for file 18. Blanchette M, Tompa M: Discovery of regulatory elements by a [http://www.biomedcentral.com/content/supplementary/1471- computational method for phylogenetic footprinting. 2105-6-262-S6.pdf] Genome Res 2002, 12(5):739-748. 19. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2003, 299(5611):1391-1394. Acknowledgements 20. Sumiyama K, Kim CB, Ruddle FH: An efficient cis-element discov- ery method using multiple sequence comparisons based on This investigation was supported by National Institutes of Health, National evolutionary relationships. Genomics 2001, 71(2):260-262. Research Service Award 5 T15 LM00744, from the National Library of 21. Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly Medicine. acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D.pseudoobscura. Bioinformatics 2004, 20(16):2738-2750. Page 8 of 9 (page number not for citation purposes) BMC Bioinformatics 2005, 6:262 http://www.biomedcentral.com/1471-2105/6/262 22. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasser- man WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol 2003, 2(2):13. 23. Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional tran- scription factor binding sites. Genome Res 2002, 12(5):832-839. 24. Yuh CH, Brown CT, Livi CB, Rowen L, Clarke PJ, Davidson EH: Patchy interspecific sequence similarities efficiently identify positive cis-regulatory elements in the sea urchin. Dev Biol 2002, 246(1):148-161. 25. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15(7-8):563-577. 26. Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cere- visiae. J Mol Biol 2000, 296(5):1205-1214. 27. van Helden J, Andre B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computa- tional analysis of oligonucleotide frequencies. J Mol Biol 1998, 281(5):827-842. 28. Abnizova I, te Boekhorst R, Walter K, Gilks WR: Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC Bioinformatics 2005, 6(1):109. 29. Hutchinson GB: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Com- put Appl Biosci 1996, 12(5):391-398. 30. Nazina AG, Papatsenko DA: Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 2003, 4(1):65. 31. Schroeder MD, Pearce M, Fak J, Fan H, Unnerstall U, Emberly E, Rajewsky N, Siggia ED, Gaul U: Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol 2004, 2(9):E271. 32. [http://webdisk.berkeley.edu/~dap5/] 33. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, Pruss M, Schacherer F, Thiele S, Urbach S: The TRANSFAC system on gene expression regu- lation. Nucleic Acids Res 2001, 29(1):281-283. 34. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16(5):412-424. 35. Mitchell TM: Machine Learning. In McGraw-Hill series in computer science New York , McGraw-Hill; 1997:xvii, 414 p.. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 9 of 9 (page number not for citation purposes)

Journal

BMC BioinformaticsSpringer Journals

Published: Oct 27, 2005

There are no references for this article.