Access the full text.
Sign up today, get DeepDyve free for 14 days.
Carola Kanz, P. Aldebert, N. Althorpe, W. Baker, A. Baldwin, Kirsty Bates, Paul Browne, Alexandra Broek, Matias Castro, G. Cochrane, Karyn Duggan, R. Eberhardt, Nadeem Faruque, John Gamble, Federico Diez, Nicola Harte, T. Kulikova, Quan Lin, Vincent Lombard, R. Lopez, Renato Mancuso, Michelle McHale, Francesco Nardone, Ville Silventoinen, S. Sobhany, P. Stoehr, M. Tuli, Katerina Tzouvara, Robert Vaughan, Dan Wu, Weimin Zhu, R. Apweiler (2004)
The EMBL Nucleotide Sequence DatabaseNucleic Acids Research, 33
S. Aerts, P. Loo, G. Thijs, H. Mayer, R. Martin, Y. Moreau, B. Moor (2005)
TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysisNucleic Acids Research, 33
E. Gößling, O. Kel-Margoulis, A. Kel, E. Wingender (2003)
MATCHTM - a tool for searching transcription factor binding sites in DNA sequences. Application for the analysis of human chromosomes
T. Werner (1999)
Models for prediction and recognition of eukaryotic promotersMammalian Genome, 10
E. Xing, Wei Wu, Michael Jordan, R. Karp (2004)
Logos: a Modular Bayesian Model for de Novo Motif DetectionJournal of bioinformatics and computational biology, 2 1
BY Chan, D Kibler (2005)
Using hexamers to predict cis-regulatory motifs in DrosophilaBMC Bioinformatics, 6
W. Wasserman, J. Fickett (1998)
Identification of regulatory regions which confer muscle-specific gene expression.Journal of molecular biology, 278 1
G. Sandve, F. Drabløs (2006)
A survey of motif discovery methods in an integrated frameworkBiology Direct, 1
S. Sze, M. Gelfand, P. Pevzner (2001)
Finding Weak Motifs in DNA SequencesPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
T. Bailey, C. Elkan (1994)
Fitting a Mixture Model By Expectation Maximization To Discover Motifs In BiopolymerProceedings. International Conference on Intelligent Systems for Molecular Biology, 2
A. Sandelin, W. Alkema, P. Engström, W. Wasserman, B. Lenhard (2004)
JASPAR: an open-access database for eukaryotic transcription factor binding profilesNucleic acids research, 32 Database issue
A. Kel, E. Gößling, I. Reuter, E. Cheremushkin, O. Kel-Margoulis, E. Wingender (2003)
MATCH: A tool for searching transcription factor binding sites in DNA sequences.Nucleic acids research, 31 13
GK Sandve, F Drabløs (2005)
In: Proceedings of the 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems
G. Wray, Matthew Hahn, E. Abouheif, J. Balhoff, M. Pizer, M. Rockman, L. Romano (2003)
The evolution of transcriptional regulation in eukaryotes.Molecular biology and evolution, 20 9
William Krivan, W. Wasserman (2001)
A predictive model for regulatory sequences directing liver-specific transcription.Genome research, 11 9
S. Kiełbasa, D. Gonze, H. Herzel (2005)
Measuring similarities between transcription factor binding sitesBMC Bioinformatics, 6
Jianjun Hu, Bin Li, D. Kihara (2005)
Limitations and potentials of current motif discovery algorithmsNucleic Acids Research, 33
A. Kel, T. Konovalova, T. Waleev, E. Cheremushkin, O. Kel-Margoulis, E. Wingender (2006)
Composite Module Analyst: a fitness-based tool for identification of transcription factor binding site combinationsBioinformatics, 22 10
Liver model , supplementary material
Catalogue of Regulatory Elements
P. Perco, A. Kainz, G. Mayer, A. Lukas, R. Oberbauer, B. Mayer (2005)
Detection of coregulation in differential gene expression profiles.Bio Systems, 82 3
M. Burset, R. Guigó (1996)
Evaluation of gene structure prediction programs.Genomics, 34 3
G. Sandve, F. Drabløs (2005)
Generalized Composite Motif Discovery
S. Sinha, E. Nimwegen, E. Siggia (2003)
A probabilistic method to detect regulatory modulesBioinformatics, 19 Suppl 1
A. Brazma, J. Vilo, E. Ukkonen (1997)
Finding transcription factor binding site combinations in the yeast genome
M. Tompa, Nan Li, T. Bailey, G. Church, B. Moor, E. Eskin, A. Favorov, M. Frith, Yutao Fu, W. Kent, V. Makeev, A. Mironov, William Noble, G. Pavesi, G. Pesole, M. Régnier, Nicolas Simonis, S. Sinha, G. Thijs, J. Helden, Mathias Vandenbogaert, Z. Weng, C. Workman, Chun Ye, Zhou Zhu (2005)
Assessing computational tools for the discovery of transcription factor binding sitesNature Biotechnology, 23
T. Schneider, G. Stormo, L. Gold, A. Ehrenfeucht (1986)
Information content of binding sites on nucleotide sequences.Journal of molecular biology, 188 3
B. Lenhard, W. Wasserman (2002)
TFBS: Computational framework for transcription factor binding site analysisBioinformatics, 18 8
S. Aerts, P. Loo, G. Thijs, Y. Moreau, B. Moor (2003)
Computational detection of cis-regulatory modulesBioinformatics, 19 Suppl 2
(2005)
BMC Bioinformatics Methodology article Using hexamers to predict cis-regulatory motifs in Drosophila
G. Stormo (2000)
DNA binding sites: representation and discoveryBioinformatics, 16 1
TL Bailey, C Elkan (1994)
In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology
Debraj Thakurta, G. Stormo (2001)
Identifying target sites for cooperatively binding factorsBioinformatics, 17 7
SH Sze, MS Gelfand, PA Pevzner (2002)
In: Proceedings of the Pacific Symposium on Biocomputing
Qing Zhou, W. Wong (2004)
CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling.Proceedings of the National Academy of Sciences of the United States of America, 101 33
T Waleev, D Shtokalo, T Konovalova, N Voss, E Cheremushkin, P Stegmaier, O Kel-Margoulis, E Wingender, A Kel (2006)
Composite Module Analyst: identification of transcription factor binding site combinations using genetic algorithmNucleic Acids Res, 34
Ö. Johansson, W. Alkema, W. Wasserman, J. Lagergren (2003)
Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithmBioinformatics, 19 Suppl 1
T. Bailey, William Noble (2003)
Searching for statistically significant regulatory modulesBioinformatics, 19 Suppl 2
R. Sharan, I. Ovcharenko, A. Ben-Hur, R. Karp (2003)
CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segmentsBioinformatics, 19 Suppl 1
Laurent Duret, Philipp Bucher (1997)
Searching for regulatory elements in human noncoding sequences.Current opinion in structural biology, 7 3
M. Frith, U. Hansen, Z. Weng (2001)
Detection of cis -element clusters in higher eukaryotic DNABioinformatics, 17 10
V. Matys, O. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. Kel, E. Wingender (2005)
TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotesNucleic Acids Research, 34
M. Frith, Michael Li, Z. Weng (2003)
Cluster-Buster: finding dense clusters of motifs in DNA sequencesNucleic acids research, 31 13
K. Frech, J. Danescu-Mayer, T. Werner (1997)
A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter.Journal of molecular biology, 270 5
Background: Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery – discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis- regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. Results: We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Conclusion: Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery. Page 1 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 tify novel module instances. Searching for new matches to Background A key step in the process of gene regulation is the binding a previously defined model might be considered a special of transcription factors to specific cis-regulatory regions of case of module discovery and is often referred to as mod- the genome, usually located in the proximal promoter ule scanning. Programs that specialize in searching for upstream of target genes or in distal enhancer regions modules this way without inferring the models them- [1,2]. Each transcription factor recognizes and binds to a selves include ModuleInspector [9] and ModuleScanner more or less distinct nucleotide pattern – a motif – thereby [10]. The general problem of module discovery, however, regulating the expression of the nearby gene. Determining usually involves inferring both a model representation of the location and specificity of each transcription factor the modules and to find their locations in the sequences. binding site in the genome is thus an important prerequi- site for reconstructing the gene regulatory network of an Most module discovery methods require users to supply a organism. set of candidate single motif models in the form of IUPAC consensus strings or position weight matrices (PWM) Since establishing these binding sites experimentally is a [11]. These are used to discover putative transcription fac- rather laborious process, much effort has been made to tor binding sites in the sequences, and the programs then develop methods that can automatically discover such search for significant combinations of such binding sites binding sites and motifs directly from genomic sequence to report as modules. data. More than a hundred methods have already been proposed [3], and new methods are published nearly What constitutes a significant combination varies every month. There is a large diversity in the algorithms between methods. MSCAN [12], for instance, searches for and models used, and the field has not yet reached agree- regions within sequences that have unusually high densi- ment on the optimal approach. Most methods search for ties of binding sites, more so than would be expected from short, statistically overrepresented patterns in a set of chance alone. The types of the binding motifs are irrele- sequences believed to be enriched in binding sites for par- vant, however, and each potential module instance is ana- ticular transcription factors, such as promoter sequences lyzed independently from the rest. Other tools, like from coregulated genes or orthologous genes in distantly ModuleSearcher [10], Composite Module Analyst [13] related species. and CREME [14], search for specific combinations of motifs that co-occur multiple times in regulatory regions In higher organism, however, transcription factors seldom of related genes. function in isolation, but act in concert with nearby bound factors in a combinatorial manner to induce spe- With an increasing number of programs available, both cific regulatory behaviours. A set of binding motifs associ- for single and composite motif discovery, there is a grow- ated with a cooperating set of transcription factors is ing need among end users for reliable and unbiased infor- called a composite motif or cis-regulatory module. In recent mation regarding the comparative merits of different years, the field of computational motif discovery has approaches. A few independent investigations have been therefore shifted from the detection of single motifs undertaken to assess the performance of selected single towards the discovery of entire regulatory modules. motif discovery methods, for instance by Sze et al. [15] and Hu et al. [16]. The most comprehensive benchmark The diversity of approaches to module discovery is even study to date was carried out by Tompa et al. and included greater than for single motif discovery, and methods vary thirteen of the most popular single motif discovery meth- widely in what they expect as input and what they provide ods [17]. The authors of this study also provided a web as output. For instance, methods like Co-Bind [4], service to enable new methods to be assessed and com- LOGOS [5] and CisModule [6] expect only a set of coreg- pared to the original methods using the same datasets. ulated or orthologous promoter sequences as input and are able to infer both the location and the structure of However, in spite of the increased interest in regulatory modules with few prior assumptions regarding their modules, we are not aware of any similar independent nature. These programs infer an internal model that benchmarking efforts that have been undertaken with includes a representation of each individual transcription respect to composite motif discovery. factor binding motif as well as constraints on the distances between them. On the other hand, programs such as LRA Results [7] and Hexdiff [8] demand as input a collection of We have developed a framework for assessing and com- already known module sites to serve as training data. The paring the performance of methods for the discovery of known positive sites are used along with negative composite motifs. Sequence sets containing real, experi- sequence examples to build a model representation which mentally verified modules are made available for down- can then be compared to new sequences in order to iden- load through our web service, and users can test programs Page 2 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 of their own choice on these datasets and submit the aspects of modules separately, our framework can equally results back to the web service to get the predictions eval- well be used with programs that predict only one or the uated. Results are presented both as tabulated values and other. in graphical format, and performances of different meth- ods can be compared. Since most module discovery tools To measure prediction accuracy of methods with respect require users to input candidate motifs, each sequence to module location, we have used the nucleotide-level corre- dataset is supplemented by a set of PWMs capable of lation coefficient (nCC). This statistic has been widely used detecting the binding sites involved in the modules. To before, among others, for coding region identification and test how programs respond to varying levels of noise in gene structure prediction [21]. It was also adopted by the PWM sets, we created extended PWM sets for one of Tompa et al. to evaluate binding site predictions in their our datasets where the genuine matrices were mixed with single motif discovery benchmark study. The value of nCC various decoy matrices. lies in the range -1 to +1. A score of +1 indicates that a pre- diction is coincident with the correct answer; whereas a Scoring predictions score of -1 means that the prediction is exactly the inverse We adopted a simple and general definition of a module: of the correct answer. Random predictions will generally a module is a cis-regulatory element consisting of a collec- result in nCC-values close to zero. tion of single binding sites for transcription factors. A module is thus characterized by only two aspects in our TP⋅− TN FN⋅FP nCC = framework: its location in a sequence and its composition, () TP++ FN() TN FP(TP+ FP)(TN+ FN) that is, the set of transcription factor binding motifs involved. A module's location is further defined as the Here, TP is the number of nucleotides in a sequence that smallest contiguous sequence segment encompassing all are correctly predicted by a program as belonging to a the single binding sites in the module, including also the module, while TN is the number of nucleotides correctly intervening bases. For our purpose, the composition of a identified as background. FN is the number of true mod- module is represented by a set of PWM identifiers. Differ- ule nucleotides incorrectly classified as background, and ent modules that share the same composition are said to FP is the number of background nucleotides incorrectly belong to the same module class. Module class definitions classified as belonging to a module. may also be limited by structural constraints. These are rules governing, among others, the strand bias, order and A similar statistic, the motif-level correlation coefficient distances between the transcription factor binding sites of (mCC), was used to evaluate prediction accuracy with modules of the same class. Since it requires a substantial respect to module composition. The definition of mCC effort to determine these constraints experimentally, this follows that of nCC, except that instead of counting the kind of information is available for a very limited number number of nucleotides, we count the number of single of classes. Few methods also report such module con- motifs (or PWMs) correctly or incorrectly classified as straints explicitly. Consequently, we have chosen not to being part of a module or not. Hence, for mCC, TP is the consider this aspect of modules further in our framework, number of PWMs correctly identified as constituents of at least for the time being. the module, while FP is the number of PWMs incorrectly predicted as being part of a module. Note that the correla- Module discovery programs are requested to predict both tion statistics, as defined here, are only applicable when the location of modules and to identify the motifs both the datasets and the predictions made by a program involved by naming the proper PWMs. However, not all contain a combination of module and non-module programs are able to perform both these tasks. The instances, if not, the divisor will be zero and the value of MCAST program [18], for instance, only reports the loca- the statistic will be undefined. Consequently, the mCC- tion of predicted modules, even though it uses a set of score is only informative when the set of PWMs supplied PWMs to detect single binding sites internally. On the to a module discovery program contains false positives, other hand, programs that discover single motifs de novo i.e. additional matrices besides those that are actually without relying on pre-constructed matrices have, of involved in the modules. Final scores for each dataset are course, no way of correctly naming the motifs involved. obtained by summing up TP, FP, TN and FN over all Methods like that of Perco et al. [19] and GCMD [20] sequences before calculating the correlation scores. If no identify modules by looking for groups of PWMs whose module predictions are made on a set of sequences, the binding sites consistently appear together in multiple resulting scores for nCC and mCC are assigned a value of sequences, but disregard any further information about zero rather than being left undefined. In addition to CC the precise position of these sites. Hence, such programs scores, several other statistics mentioned in [17] such as only report the composition of modules but not their sensitivity, specificity, positive predictive value, performance location. By assessing the location and composition Page 3 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 coefficient (phi-score) and average site performance are cal- TRANSFAC sets and the customized PWM sets independ- culated for both nucleotide- and motif-level. ently. The motivation for using two different PWM sets was to test the stability of methods and examine how the Datasets specific representations used for single motifs might influ- We compiled three datasets from sequences containing ence the ability of methods to find the correct modules. experimentally verified regulatory modules. The first and the last two datasets have different characteristics and The two last datasets were based on combinations of TFBS were chosen to complement each other to test methods found in the regulatory regions of genes specifically under different conditions. expressed in liver [23] and muscle [7] cells. The modules here are usually larger compared to the TRANSCompel Our main dataset was based on annotated composite modules, containing up to nine binding sites for four dif- motifs from the TRANSCompel database [22]. The mod- ferent motifs in the liver regulatory regions and up to eight ules selected for this dataset are small, each consisting of sites for five motifs in the muscle regions. PWMs for these exactly two single binding sites for different transcription motifs were taken from the respective publications. The factors (TFs), but we specifically chose modules that had composition of the modules in these two datasets is vari- multiple similar instances in several sequences. Sequences able; modules can contain multiple binding sites for the containing modules from the same class were grouped same motifs and not all motifs are present in every mod- together producing ten sequence sets named after their ule. constituent single motifs as shown in Table 1. Each of the sequences in a set contained at least one copy of the mod- While most programs require candidate PWMs to be ule with the same two motifs, but the order, orientation entered, this can pose a problem for users who might not and distance between the TFBS could vary between always know in advance the kind of modules that should sequences. Separate PWM collections, with matrices for be present in a sequence or which transcription factors the two single motifs involved, were constructed for each that might bind. It could be the case, for instance, that a of the sequence sets. All in all there were eleven distinct researcher has only a set of promoters from a coregulated single TF binding motifs in our full TRANSCompel data- set of genes and is interested in identifying the hitherto set, and PWMs representing these motifs were collected unknown module that controls the common expression from the companion TRANSFAC database [22]. Since of these genes. A popular strategy then is to employ an TRANSFAC often contains several different PWMs for excessive set of PWMs which, hopefully, also includes the each motif, we grouped all the matrices corresponding to appropriate matrices. An extreme, but not unlikely, sce- a particular motif into an equivalence set, essentially treat- nario would be to use all the matrices available from a ing these PWMs as if they were one and the same with published compilation like TRANSFAC (774 matrices in respect to prediction and scoring. In addition to the release 9.4) or Jaspar [24] (123 core matrices). Although TRANSFAC matrix sets, we also constructed eleven custom this approach will inevitably lead to lots of false positive matrices that were specifically tailored to the particular PWM matches that might thwart the module discovery motifs and binding sites present in the sequences (see process, good module discovery tools should nonetheless Methods). Assessment of module discovery programs on be able to report the true module instances without simul- the TRANSCompel dataset was conducted using both the taneously predicting too many spurious occurrences. Table 1: Datasets Sequence set Sequences Modules Total size (bp) Module size, min-max (avg) AP1-Ets 16 17 14860 14 – 99 (27) AP1-NFAT 8 11 6893 14 – 19 (16) AP1-NFκB 7 8 6532 18 – 135 (53) CEBP-NFκB 8 8 7308 44 – 118 (84) Ebox-Ets 4 6 3489 16 – 50 (25) Ets-AML 5 5 4053 13 – 30 (19) IRF-NFκB 6 6 5344 23 – 71 (43) NFκB-HMGIY 6 7 5393 10 – 32 (13) PU1-IRF 5 5 4530 12 – 14 (13) Sp1-Ets 7 8 5787 16 – 117 (37) Liver 12 14 11943 26 – 176 (112) Muscle 24 24 20427 14 – 294 (120) A brief overview of the ten TRANSCompel sequence sets and the liver and muscle datasets used in the assessment. Further information can be found in Additional File 1. Page 4 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 To simulate these conditions and test methods' response We generally relied on default parameter settings for all to noisy PWM sets, each PWM set under the TRANSCom- programs. However, since choosing the proper parameter pel dataset was issued in multiple versions with progres- values can sometimes prove crucial for a method's per- sively more decoy matrices added to the set of true formance, we decided to provide the programs with a few annotated motifs. Decoy matrices were randomly sam- general clues where applicable; specifically, that the size of pled from the complete TRANSFAC compilation after modules should not exceed 200 bp (300 bp in the muscle removing the matrices corresponding to the true motifs dataset) and that the modules should consist of exactly for a sequence set. Decoy sets are available at 50%, 75%, two single binding sites for different TFs in the TRANS- 90%, 95% and 99% levels, where the percentage number Compel dataset but possibly up to ten binding sites for relates the amount of decoy matrices in the set. Thus, a four and five different TFs on the liver and muscle sets custom PWM set at the 90% level includes 2 genuine respectively. Furthermore, binding sites could potentially matrices and 18 decoy matrices. The number of decoy overlap and the composition of the modules in liver and matrices in the TRANSFAC PWM sets varies with each muscle sets should be allowed to vary between sequences. module class but is always higher than for the custom sets at the same percentage level. Information on the exact Figures 1a and 1b show the resulting nucleotide-level cor- number of PWMs in each set is available in Additional File relation scores on each sequence set in the TRANSCompel 1. The 99% sets include as decoys all of the matrices from dataset when methods were supplied with TRANSFAC TRANSFAC which do not correspond to the correct matrices and custom matrices respectively. The scores vary motifs. They are called "99%" for consistency, although widely between individual sequence sets but are generally the actual percentage of decoys ranges between 95% and fairly well correlated between methods, so that most 99% depending on the module class. To avert artefacts methods tend to get high (or low) scores on the same sets. stemming from possibly biased selections of decoys, all The notable exception is CisModule which performs decoy sets (except at the 99% level) consist of ten inde- poorly on all sequence sets. The correlation suggests that pendently sampled decoy collections, and the final corre- some sequence sets are inherently more easy (or difficult) lation statistics for a decoy level are calculated by to tackle than others. Scores for CEBP-NFκB and IRF- averaging prediction scores made from using each collec- NFκB are the highest overall. The reasons why these sets tion in turn. This also means that variation due to any sto- are generally easy to predict might be that their modules chastic nature of algorithms will be averaged over ten are quite long and the matrices representing the single independent runs. binding motifs have high information content (see Table 3 and Additional File 1). Conversely, the short size of the Benchmark of module discovery methods modules and the low information content of PWMs for Using our assessment framework, we benchmarked eight AP1-NFAT would make this a hard sequence set. We also published methods for module discovery: CisModule [6], calculated combined scores for the whole TRANSCompel Cister [25], Cluster-Buster [26], Composite Module Analyst dataset which are shown in the inset legends of Figure 1 (CMA) [13], MCAST [18], ModuleSearcher [10], MSCAN and graphically in Figure 2. These combined scores were [12] and Stubb [27]. See Table 2 for brief descriptions of obtained by summing up TP, TN, FP, FN over all sequence each of these methods. CisModule, CMA and Module- sets when calculating the score measures. The highest Searcher process all the sequences in a dataset simultane- combined nCC scores achieved were 0.388 with the ously and look for instances of similar modules across TRANSFAC matrices (MSCAN) and 0.38 with custom multiple sequences. The other methods examine the matrices (MCAST). The average performances across all sequences individually, although Stubb considers multi- methods were also about the same with the two PWM ple instances of similar modules within the same sets. Some methods performed quite differently depend- sequence. Except for MCAST, which does not report mod- ing on the PWMs, however. For instance, MCAST scored ule composition, all the programs report both the loca- much better using custom matrices than with TRANSFAC tion and composition of modules. CisModule, however, matrices, while MSCAN and Cluster-Buster did a better predicts modules de novo without relying on supplied with job with TRANSFAC. The rank order of methods is PWM sets and so does not name the single motifs thus somewhat altered between the two cases. Still, some involved the way we require. Hence, motif-level scores tendencies remain: CMA, Cluster-Buster, MCAST, Mod- were not calculated for MCAST and CisModule. Cluster- uleSearcher and MSCAN occupy the top five positions in Buster and MCAST report the full module segments, while both cases, followed by Cister and Stubb and then finally the rest of the methods list the positions of the PWM hits CisModule which consistently scored lowest. in the modules. In these cases we extracted the start posi- tion of the first reported binding site and the end position Figure 3 shows the results of mixing the PWM sets with an of the last binding site and used these as the boundaries equal proportion of decoy matrices. The addition of decoy of a module prediction. PWMs leads to a drop in score values for almost all meth- Page 5 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Table 2: Description of module discovery tools CisModule CisModule models the structure of sequences with a two-level hierarchical mixture-model and uses a Bayesian approach with Gibbs sampling to simultaneously infer the modules, TFBSs and PWMs based on their joint posterior distribution, which is the probability of a model given the input sequence set. At the first level, sequences are viewed as a mixture of module instances and background. At the second level, modules are modelled as a mixture of motifs and inter-module background. Parameters of the model include the widths and representations (PWMs) of single motifs and parameters related to distances between modules and between TFBS within modules. From a random initialization, CisModule iteratively cycles through steps of parameter update and module-motif detection. New parameter values are sampled from their conditional posterior distributions based on the currently predicted modules and motifs, and new predictions of modules and TFBSs are then sampled based on these updated parameter values. Positions in the sequences where the marginal posterior probability of being sampled within modules was greater than 0.5 were output as module predictions. Cister Given a set of PWMs and parameters specifying the expected number of motifs in modules, the expected distances between motifs in modules and the expected distance between modules, Cister builds a Hidden Markov Model (HMM) with three basic states: motif, intra-module background and inter-module background. Transition probabilities between these states follow geometric distributions according to the expected values input by the user. In the motif state, one of the PWMs is chosen uniformly at random and used to decide the probabilities of outputting nucleotides. Background-state emission probabilities are estimated from a sliding window centered on the current base in the query sequence. From this HMM, the posterior probability that each base in the input sequence was generated from a module state as opposed to the inter-module state can be calculated. Predicted modules are defined to occur at local maxima of this posterior probability curve where the value is at least 0.5 and no larger value is observed within 1200 bp. Cluster-Buster Cluster-Buster is developed by the same group that made Cister and is designed to search for clusters of pre- specified motifs in nucleotide sequences. Like Cister, Cluster-Buster constructs a HMM-model based on the user-supplied PWMs, an expected distance between motifs in clusters and background distributions estimated from the input sequence over sliding windows. Log likelihood ratios are used to determine whether a sequence is more likely to be generated by a "cluster-model" or a "background-model". Cluster-Buster uses a linear time heuristic to rapidly estimate log likelihood ratios for all subsequences of the input sequence and outputs those subsequences with ratios above a specified threshold that do not overlap with other higher scoring subsequences. Composite Module Analyst (CMA) The promoter model in CMA is expressed as a Boolean combination of one or more composite modules (CM), each of which consist of a set of single, independent motifs as well as pairs of motifs that must obey certain constraints on distance and orientation. Given a candidate promoter model, the method searches for potential matches to the CMs in the sequences, and a final promoter score is calculated after the presence or absence of each CM is established. CMA employs a Genetic Algorithm to search for the promoter model which best discriminates between a set of positive (co-regulated) and a set of negative sequences. The fitness function is based on a linear combination of several properties of the distribution of the promoter scores and of the individual CM scores in the two sequence sets. MCAST MCAST builds a HMM-model consisting of an intra-module state, an inter-module state and motif-states based on the supplied PWMs. The score for a motif-state is called a p-score and is the negative logarithm of the p- value of a log-odds score based on the probability of a segment in the target sequence being generated either by the PWM or a fixed, user-specified zero-order Markov background model. MCAST forbids transitions into motif-states that result in p-scores lower than some chosen threshold. Some state transitions are associated with certain costs. For instance, entering the inter-module state from a motif-state incurs a large one-time penalty while cycling through the intra-module state incurs smaller penalties for each nucleotide emitted. The Viterbi algorithm is used to find the highest scoring path through the HMM with respect to the input sequence, classifying each position in the sequence as either belonging to a module or to the background. Potential module segments are scored according to the number of motifs in the module and the p-scores of these motifs and are penalized by the number of intra-module background bases. Finally, modules are ranked according to the estimated E-values of these scores. ModuleSearcher Given a list of PWM hits with match scores for putative TFBSs in a sequence set, ModuleSearcher finds the module model (set of k PWMs) that best fits the sequences. The score of a module model is calculated as the sum of scores over all sequences, and the score function for a single sequence is based on the best scoring set of TFBSs in the sequence that corresponds to the PWMs in the module model. To be considered a valid TFBS set the binding sites must all lie within a short window, and the user can choose to ignore TFBS sets with overlapping binding sites or penalize sets that lack sites for some PWMs. An A*-algorithm (or alternatively a Genetic Algorithm) is employed to search the space of possible subsets of k motifs from the full PWM library in order to find the highest scoring module model. MSCAN MSCAN discovers modules by evaluating the combined statistical significance of sets of potential non- overlapping TF binding sites in a sliding window along the input sequence. PWMs are compared against each position within the window to obtain match scores, and p-values are calculated as the probability of obtaining similar or higher scores at a specific position in a random sequence with nucleotide distribution similar to the distribution in the window. MSCAN proceeds by calculating significance scores for all combinations of up to k binding sites in the window and then selects the optimal combination (the one with the lowest score). A prediction is output if a final p-value computed from this score is less than some user-specified threshold. Page 6 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Table 2: Description of module discovery tools (Continued) Stubb The HMM used by Stubb consists of motif states based on supplied PWMs and a single background state based on a kth-order Markov model with probability distribution estimated from a sliding window. The scoring function is the log likelihood ratio that the sequence within a limited window was more likely generated by the full model than with a HMM consisting of only the background state. Unlike the other HMM methods presented here, the transition probabilities between states in Stubb are not based on user-input expectancies, but are estimated from the sequence using the Baum-Welch algorithm. This procedure finds the set of transition probabilities that maximizes the scoring function. If Stubb finds that some motifs are highly correlated with respect to order, it can make use of correlated transition probabilities. This means that the probability of entering a specific motif state will dependent on which previous motif was output. Stubb can also utilize phylogenetic comparisons between sequences from multiple species to highlight potentially regulatory modules. The table contains short descriptions of the eight methods included in the assessment. All methods except for CisModule rely on supplied PWMs and consider matches on both strands, usually with equal probability (however, Stubb can estimate strand biases for all PWMs in a preprocessing step). Not all methods are able to consider overlapping single binding sites, which do occur in a few modules. ods. The drop is greater for the TRANSFAC PWMs, pre- CEBP-NFκB explains the higher drop seen in nCC for this sumably because these sets contain more genuine set when decoys were added. CMA and ModuleSearcher matrices and therefore also more decoys. Contrary to are by far the best methods at predicting the correct com- expectation, some methods actually score slightly better position of modules with both TRANSFAC and custom on certain sequence sets when decoys are in use. Examples PWMs as input, although CMA does perform notably are Cister on Ets-AML and Stubb on Ebox-Ets with custom poor on two specific sequence sets. The mCC score for the matrices. One explanation for this could be that these third best method, Cluster-Buster, is less than half of that methods make use of decoy motifs that just happen to of ModuleSearcher. have a high degree of overlap with genuine modules. To examine whether the modules are predicted with the cor- Figures 5 and 6 show score tendencies as increasingly rect motifs or not, we can look at the corresponding motif- more decoys are added to the PWM sets. The nucleotide- level correlation scores as shown in Figure 4. The generally level performances of CMA and ModuleSearcher are only high mCC scores obtained for IRF-NFκB support the slightly affected by the larger amounts of decoys, whereas notion that this is an easy sequence set, while the diffi- the scores for the other methods steadily decline. At the culty for most methods in selecting the correct motifs for motif-level we clearly see a division into two groups with Cluster−Buster: 0.35 Cluster−Buster: 0.282 Cister: 0.218 Cister: 0.231 MSCAN: 0.388 MSCAN: 0.288 a) b) ModuleSearcher: 0.347 ModuleSearcher: 0.326 MCAST: 0.274 MCAST: 0.38 Stubb: 0.212 Stubb: 0.166 CMA: 0.324 CMA: 0.337 CisModule: −0.001 CisModule: −0.001 Nucleotide-level correlation scor Figure 1 es on the TRANSCompel dataset Nucleotide-level correlation scores on the TRANSCompel dataset. The graphs show nCC scores for each of the ten sequence sets in the TRANSCompel dataset when methods are supplied with TRANSFAC PWM sets (a) and custom matrices (b). Page 7 of 16 (page number not for citation purposes) nCC 0 0.2 0.4 0.6 0.8 1.0 AP1−Ets AP1−NFAT AP1−NFkB CEBP−NFkB Ebox−Ets Ets−AML IRF−NFkB NFkB−HMGIY PU1−IRF Sp1−Ets nCC 0 0.2 0.4 0.6 0.8 1.0 AP1−Ets AP1−NFAT AP1−NFkB CEBP−NFkB Ebox−Ets Ets−AML IRF−NFkB NFkB−HMGIY PU1−IRF Sp1−Ets BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Table 3: Correlations between dataset properties and nCC scores TRANSFAC PWMs Custom PWMs Average nCC Highest nCC Average nCC Highest nCC Number of sequences -0.23 -0.16 -0.23 -0.05 Length of shortest sequence 0.30 0.18 0.30 0.13 Average sequence length 0.40 0.33 0.42 0.43 Total sequence set length -0.19 -0.12 -0.18 -0.02 Number of module instances -0.38 -0.32 -0.40 -0.19 Size of smallest module 0.61 0.69 0.67 0.73 Size of largest module 0.26 0.34 0.19 0.35 Average module size 0.60 0.68 0.59 0.70 Module size standard deviation 0.23 0.29 0.13 0.29 IC-content (lowest) 0.46 0.45 0.73 0.47 IC-content (total) 0.75 0.73 0.78 0.54 Module/background-ratio 0.53 0.61 0.51 0.63 We conducted a simple correlation analysis to examine which properties of the TRANSCompel sequence sets and PWMs correlated best with the highest and average nCC scores obtained by the methods on these sets. "IC-content (lowest)" is the information content (IC) of the PWM with the lowest IC of the two involved in each sequence set. The information content of a PWM is inversely related to the amount of variability in the binding patterns from which the PWM is constructed [38]. PWMs with higher information content are more specific and match only sites with a high degree of similarity to the consensus motif. "IC-content (total)" is the sum of IC-contents for the two motifs (for TRANSFAC PWMs we used the PWM with the highest IC in each equivalence set to represent the motif). The three highest values are highlighted in each column. The properties that seem to correlate best with methods' performances are the minimum and average size of modules (in basepairs) and the total IC- content, which would imply that module discovery is harder for datasets containing short and degenerate modules. CMA and ModuleSearcher performing significantly better liver- and five muscle-PWMs respectively, and no decoy than the rest. Additional graphs detailing the effects of matrices were used. Since the modules in these datasets do added noise with respect to each individual sequence set not necessarily include binding sites for all of these motifs and the variations due to different decoy selections can be however, we could calculate motif-level scores by treating found at our web site. the PWMs for the missing motifs as false instances. All methods, except CisModule, did a better job on locating Results for the liver and muscle datasets are shown in Fig- the modules in the liver dataset than in the TRANSCom- ures 7 and 8. For these datasets we supplied only four pel dataset. Cluster-Buster scored highest, but Stubb Cluster−Buster Cluster−Buster Cister Cister a) b MSCAN MSCAN MCAST MCAST Stubb Stubb ModuleSearcher ModuleSearcher CisModule CisModule CMA CMA CC Sn SP PPV PC ASP CC Sn SP PPV PC ASP Combined performance scores on Figure 2 the full TRANSCompel dataset Combined performance scores on the full TRANSCompel dataset. Combined nucleotide-level scores obtained for different performance measures when using TRANSFAC PWM sets (a) and custom matrices (b). Page 8 of 16 (page number not for citation purposes) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Cluster−Buster: 0.267 Cluster−Buster: 0.265 Cister: 0.179 Cister: 0.206 a) MSCAN: 0.241 b MSCAN: 0.237 ModuleSearcher: 0.34 ModuleSearcher: 0.318 MCAST: 0.154 MCAST: 0.303 Stubb: 0.127 Stubb: 0.168 CMA: 0.314 CMA: 0.328 Nucleotide-level correlation scores Figure 3 with 50% noise in the PWM sets Nucleotide-level correlation scores with 50% noise in the PWM sets. The graphs show nCC scores when using TRANSFAC PWM sets (a) and custom matrices (b) with an equal proportion of decoy matrices added. Each value represents the average score over ten runs with different decoy selections. showed the largest improvement in nCC score. The motif- in the case of CMA and underprediction for MSCAN. level scores, on the other hand, were not very high, which Results on the muscle dataset display the same main ten- can most likely be attributed to overprediction of motifs dencies as the other two datasets, but for the first time, Cluster−Buster: 0.347 Cluster−Buster: 0.332 Cister: 0.27 Cister: 0.182 a) MSCAN: 0.258 b MSCAN: 0.254 ModuleSearcher: 0.717 ModuleSearcher: 0.688 Stubb: 0.199 Stubb: 0.209 CMA: 0.771 CMA: 0.792 Motif-level correlation sc Figure 4 ores with 50% noise in the PWM sets Motif-level correlation scores with 50% noise in the PWM sets. The graphs show mCC scores when using TRANSFAC PWM sets (a) and custom matrices (b) with an equal proportion of decoy matrices added. Each value represents the average score over ten runs with different decoy selections. Page 9 of 16 (page number not for citation purposes) mCC nCC 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 AP1−Ets AP1−Ets AP1−NFAT AP1−NFAT AP1−NFkB AP1−NFkB CEBP−NFkB CEBP−NFkB Ebox−Ets Ebox−Ets Ets−AML Ets−AML IRF−NFkB IRF−NFkB NFkB−HMGIY NFkB−HMGIY PU1−IRF PU1−IRF Sp1−Ets Sp1−Ets mCC nCC 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 AP1−Ets AP1−Ets AP1−NFAT AP1−NFAT AP1−NFkB AP1−NFkB CEBP−NFkB CEBP−NFkB Ebox−Ets Ebox−Ets Ets−AML Ets−AML IRF−NFkB IRF−NFkB NFkB−HMGIY NFkB−HMGIY PU1−IRF PU1−IRF Sp1−Ets Sp1−Ets BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Cluster−Buster: avg = 0.185 Cluster−Buster: avg = 0.166 Cister: avg = 0.105 Cister: avg = 0.119 a) MSCAN: avg = 0.169 b) MSCAN: avg = 0.121 MCAST: avg = 0.081 MCAST: avg = 0.189 Stubb: avg = 0.08 Stubb: avg = 0.118 ModuleSearcher: avg = 0.308 ModuleSearcher: avg = 0.296 CMA: avg = 0.28 CMA: avg = 0.293 Nucleotide-level correlation sc Figure 5 ores at different noise levels Nucleotide-level correlation scores at different noise levels. Plot of nCC scores at increasing noise levels when meth- ods are supplied with TRANSFAC PWM sets (a) and custom matrices (b). Scores shown are averages over all sequence sets and decoy selections at each noise level. MCAST was unable to function properly with very large PWM sets and was therefore assigned a score of zero at the 99% level. CisModule obtains an nCC score above zero and actually lishing the methodological frontier with respect to bypasses one the other methods. bioinformatics techniques. In this study we wanted to explore benchmarking in the context of module discovery and to investigate related design issues such as dataset Discussion Objective benchmarking efforts are important for provid- construction and performance evaluation. ing unbiased reviews of published methods and for estab- Cluster−Buster: avg = 0.147 Cluster−Buster: avg = 0.175 Cister: avg = 0.124 Cister: avg = 0.1 a) MSCAN: avg = 0.121 b) MSCAN: avg = 0.097 Stubb: avg = 0.1 Stubb: avg = 0.171 ModuleSearcher: avg = 0.613 ModuleSearcher: avg = 0.634 CMA: avg = 0.638 CMA: avg = 0.613 Motif-level correlation sc Figure 6 ores at different noise levels Motif-level correlation scores at different noise levels. Plot of mCC scores at increasing noise levels when methods are supplied with TRANSFAC PWM sets (a) and custom matrices (b). Scores shown are averages over all sequence sets and decoy selections at each noise level. Page 10 of 16 (page number not for citation purposes) mCC nCC −0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 50% 0% 50% 75% 75% 90% 90% 95% 95% 99% 99% mCC nCC −0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 50% 0% 50% 75% 75% 90% 90% 95% 95% 99% 99% BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Cluster−Buster Cluster−Buster Cister Cister a) b MSCAN MSCAN MCAST Stubb Stubb ModuleSearcher ModuleSearcher CMA CisModule CMA CC Sn SP PPV PC ASP CC Sn SP PPV PC ASP Perfo Figure 7 rmances on the liver dataset Performances on the liver dataset. Scores obtained on the liver dataset for different performance measures at nucleotide- level (a) and motif-level (b). Benchmarking of tools for composite motif discovery is that programs might ask for, and not all module discovery harder than benchmarking of single motif discovery tools, tools can be fairly assessed with our system. since the former methods are more diverse with respect to input requirements and the type of predictions they make. To construct the benchmark datasets we relied on real We have aimed at creating a simple and general frame- genomic sequences containing experimentally verified work that can be used with a wide range of methods. Nev- modules, rather than creating synthetic datasets with fab- ertheless, we do not provide every kind of information ricated and planted modules. The motivation for only Cluster−Buster Cluster−Buster Cister Cister a) b) MSCAN MSCAN MCAST Stubb Stubb ModuleSearcher ModuleSearcher CMA CisModule CMA CC Sn SP PPV PC ASP CC Sn SP PPV PC ASP Perfo Figure 8 rmances on the muscle dataset Performances on the muscle dataset. Scores obtained on the muscle dataset for different performance measures at nucleotide-level (a) and motif-level (b). Page 11 of 16 (page number not for citation purposes) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 using real data was to avoid introducing artificial bias module discovery programs can be unduly penalized for related to the composition and constraints of modules. selecting an incorrect PWM at the motif level, even though Good benchmark datasets should be diverse enough to the predicted PWM is very similar to the target. We have discriminate the behaviour of different methods, when tried to remedy this situation by grouping together PWMs possible, and provide them with a wide range of realistic that correspond to the same TFs and consider these as the challenges. For module discovery these challenges could same motif with respect to scoring. However, there might include discovering modules with few or many single still be other matrices in the decoy sets that can match motifs, tightly clustered or widely spaced motifs and mod- with the annotated binding sites. ules with highly conserved or degenerate binding sites. Ideally, benchmark datasets should also be novel to the Since we are using real genomic sequences, some of the methods tested. Currently the amount of experimental predicted modules that we label as false positives can in data available is too limited to achieve all of these goals. fact represent unannotated true positives, and so the The particular dataset we have constructed based on actual performance of methods might very well be better TRANSCompel data is novel in terms of performance test- than indicated, especially at high noise levels. ing. The modules in TRANSCompel are short, however, and to include larger modules we were forced to rely on a It should be noted that while the annotated length of a TF few well-known datasets from liver and muscle regulatory motif may vary from binding site to binding site, the regions that have been used extensively in the past for test- length of a standard PWM is fixed, and PWMs do not ing and possibly for designing and developing module always match the locations of their corresponding bind- discovery methods. Some methods might therefore be ing sites precisely. Perfect nCC scores can therefore be dif- intrinsically biased to perform well on these sets. It is con- ficult or even impossible to obtain. The nCC score also spicuous, for instance, that CisModule – which was tested drops fast if a method predicts a larger module region with muscle data in its original publication – scored com- than what is annotated, even though the target module is parably well to the other methods on our muscle set, yet correctly covered by the predicted region. This can close to zero on both the TRANSCompel and liver data- severely penalize methods that tend to predict large mod- sets. ule regions, especially on the TRANSCompel dataset where most modules are rather short. We chose the correlation coefficient as our main statistic for evaluating and comparing module discovery methods Some programs can utilize additional information to because it captures aspects of two of the most commonly strengthen confidence in predictions and improve their used performance measures – sensitivity and specificity – performance. For instance, Stubb is a sensitive method into a single score value. However, since different statistics and the predictions it makes usually include the correct often favour different methods, it is prudent to consider modules, especially when using large PWM sets; yet, its several measures to get a better comprehension of each CC-scores are generally low because it simultaneously pre- method's qualities. The sensitivity measure (Sn), for dicts a lot of false positives. Stubb can employ a phyloge- instance, tells us to what extent a method's predictions netic footprinting [29] strategy to filter out many of these include the true module instances. At the nucleotide level, false predictions, but it requires that orthologous MCAST seems overall to be the most sensitive method sequences from related species are supplied along with the among those tested here, while CMA shows high sensitiv- regular sequences. However, in order to make the tests as ity on the TRANSCompel dataset. Yet, to achieve these comparable as possible, we have not made such addi- high sensitivity scores the methods at the same time make tional information available to the programs in our a lot of false positive predictions, as can be seen from the benchmark test, unless the type of information can be lower positive predictive values (PPV). MSCAN and Module- expected to be readily obtained for any dataset. Searcher, on the other hand, generally have the highest nucleotide-level PPV scores, which tells us that the posi- Caution should thus always be taken when interpreting tive predictions made by these two programs are more score values, since the reported scores might not accu- trustworthy than predictions made by the other programs. rately reflect the optimal capabilities of the methods. Also, we have run the programs using mostly their default PWMs from the TRANSFAC database were used to repre- parameter settings. We are fully aware that adjusting the sent both the true motifs and the decoys for the TRANS- parameters can greatly affect the performance of a pro- Compel dataset. A potential problem when using gram, however, selecting the most appropriate parameter TRANSFAC is that many of the matrices are quite similar values be can be tricky and running methods with default to each other [28]. This is partly due to some TFs being settings is probably closer to typical usage. represented by several PWMs, but also because different TFs might bind to similar-looking motifs. As a result, Page 12 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 It is inherently difficult to conduct an assessment that is Conclusion fair to all methods. Even the most minute design choice While improvements can still be made to our systems, we can influence the outcome if it unintentionally favours have taken a first step towards developing a comprehen- some methods over others. For instance, limiting the size sive testing workbench for composite motif discovery of input sequences will be beneficial for most module dis- tools. The assessment system is based on two established covery tools since it improves the signal-to-noise ratio. On datasets for module discovery plus a novel dataset we con- the other hand, using too short sequences can disadvan- structed from TRANSCompel module annotations. The tage methods that require substantial amounts of data in performance of methods on our novel set is comparable order to derive elaborate background models. The best to the previous two, demonstrating its utility as a bench- solution, then, is to try to balance the scales by subjecting mark set. Together these datasets challenge methods to methods to several different situations with datasets discover modules with different characteristics and vary- exhibiting a range of characteristics. This will make it ing levels of difficulty. harder still to declare a winner, since it will inevitably lead to even greater variation in the results. Then again, the Not surprisingly, trying to discover composite motifs de purpose of benchmarks tests need not be to identify a sin- novo proves to be much more challenging than relying on gle program that can be recommended for all needs, but PWMs as an aid to detect potential single binding sites. rather to determine how different methods behave under With large and noisy PWM sets, however, it becomes cru- different conditions, thus enabling us to select the most cial to consider multiple instances of conserved motif appropriate tool to use in specific situations. combinations in order to identify true modules. In gen- eral, our study shows that there are still advances to be The results from our assessment of eight published mod- made in computational module discovery. ule discovery tools show that the top scoring method does vary a lot between datasets. On the TRANSCompel data- Methods set, for instance, all methods save Stubb and CisModule TRANSCompel dataset score better than the others on at least one sequence set. Our main dataset was based on modules annotated in the But there is also a tendency for some methods to perform TRANSCompel database [22], which is one of very few consistently better or worse across several datasets. Cis- databases that contain entries for composite elements Module performed poorly on most sequence sets, Cister whose combinatorial binding effects have been verified and Stubb usually scored somewhere in the middle, while through biological experiments. It comes in both a profes- CMA, ModuleSearcher, MSCAN and Cluster-Buster were sional licensed version and a smaller public version. Our often found among the top scoring methods on each set. dataset was selected from TRANSCompel Professional CMA and ModuleSearcher were clearly best at identifying version 9.4 which contains 421 annotated module sites the correct motif types involved in the modules, and they from 152 different module classes. The largest modules were also the only methods capable of coping with large registered in TRANSCompel are triplets (34 entries) with and noisy PWM sets. The other PWM-reliant methods the remaining being pairs of binding sites (387 entries). appear to be more suited for detecting modules with some To ensure a minimum of support for each module class, prior expected composition than for discovering com- we considered only classes that had at least five annotated pletely new and uncharacterized modules. module sites. Unfortunately, this requirement excluded all triplets and left us with only pairs. After further discard- There was some variation when using custom PWMs as ing a few modules that were too weak to be detected with opposed to TRANSFAC PWM sets. The average perform- stringent PWM-thresholds, we ended up with ten ance over all methods on the whole TRANSCompel data- sequence sets encompassing 81 module binding sites in set was about the same in both cases, but there were a lot 63 different sequences. The longest module spanned 135 of local differences between sequence sets. The most bp with the average being 33 bp. The binding sequences extreme example can be seen on the Ebox-Ets sequence set of modules are specified in TRANSCompel by using where MSCAN scores highest of all with TRANSFAC uppercase letters to indicate bases of the constituent single matrices, yet completely fails to find any true modules motifs and lowercase letters for the intra-module back- with custom matrices. The average deviation in scores ground. We used the supplied references to the EMBL when using either PWM set was about 0.11 and the effect database [30] to obtain additional sequence bases flank- could go both ways. MCAST was the only method which ing these module sites but set an upper limit of 1000 bp almost consistently scored better with one set, namely on the length of the sequences used. Most of the custom matrices. sequences were from human or mouse but also some other mammalian and a few viral sequences were included. Each sequence set was constructed around mod- ules of one particular class made up of two single motifs Page 13 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 from the following set of eleven: AML, AP-1, C/EBP, E- could be detected with an 85% relative cut-off threshold. box, Ets, IRF, HMGIY, NF-AT, NF-κB, Sp1 and PU.1. The When we lowered the cut-off to 75%, all sites could be sequence sets contained between 4 and 16 sequences and detected. Single binding sites were considered to be the sequences themselves ranged in length from 294 to detected if a predicted match to the corresponding PWM 1000 bp (average 884 bp). All sequences contained at overlapped with the annotated binding site. For the least one module instance, but sometimes up to three, of TRANSFAC matrices, we regarded it as sufficient if any one the designated class. Some sequences also included anno- of the matrices in the equivalence set made a prediction tated modules of other classes. This will usually not be a that overlapped with the annotated site. problem at low noise-levels, because the other modules will only interfere if the set of PWMs supplied to a pro- Liver and muscle datasets gram contains decoy matrices corresponding to the motifs The liver dataset was based on a set of regulatory regions involved in these modules. As the noise-level approaches used as a positive training set to develop a model of liver 99%, however, this will inevitably happen because the specific regulation in the paper by Krivan and Wasserman PWM sets then include the complete TRANSFAC collec- [23]. Sequence data as well as PWM models of four TFs tion. Since we use real genomic data, there is also always implied in liver specific regulation (C/EBP, HNF-1, HNF- a possibility that additional unknown modules are 3 and HNF-4) was downloaded from their supplementary present in the sequences. Even so, for a particular web site [33]. After inspection of the sequence annota- sequence set, only module sites corresponding to the des- tions, we discarded from further consideration those reg- ignated class of that set were considered true positives. ulatory regions that only contained a single TFBS and also smaller annotated regions that were completely over- Although the TRANSCompel database itself does not pro- lapped by larger regions. Furthermore, we ignored a small vide matrix representations for the motifs involved in set of TFs that only had one binding site each in the whole modules, its companion database TRANSFAC does [22]. dataset. This left us with regulatory regions consisting of Unfortunately, there is not a one-to-one correspondence two or more binding sites for the four TFs previously men- between transcription factors and matrices in TRANSFAC, tioned. The start position of the first TFBS and the end and a single factor (or family of factors that recognize the position of the last TFBS in each region were used as mod- same motif) can be represented by several different ule boundaries, and the modules thus obtained varied in PWMs. Instead of selecting just one canonical PWM to use length from 26 to 176 bp with and average of 112 bp. for each motif, we collected all matrices related to a spe- Long sequences were cropped to a maximum of 1000 bp. cific motif and treated the whole set as an equivalence The resulting dataset after curation consisted of 14 mod- class. Thus, a motif can be represented by either one of the ules in 12 sequences with 51 binding sites for 4 different PWMs in the corresponding set, and predicted binding TFs. Eight of the sequences were human, two were from sites in the sequences are considered to be instances of the rat and the last two from mouse and chicken. same motif even if the binding sites are predicted by dif- ferent PWMs from the equivalence set. For the muscle dataset we selected a subset of the regula- tory regions from the paper by Wasserman and Fickett [7] As an alternative to these TRANSFAC sets, we also con- obtained from their web site [34]. Five motifs (Mef-2, Myf, structed custom PWMs for the eleven motifs involved in Sp1, SRF and Tef) were reported as important in muscle our module classes. For each motif we extracted the corre- regulation, and PWMs for these motifs were downloaded sponding annotated binding sites plus four flanking bases from the same site. We chose regions that had at least two on each side from our sequences and used MEME [31] to annotated binding sites for motifs in this set and used the align them and infer a PWM model for the motif. Con- first and last binding site in the regions to delimit the structing matrices from the same binding sites they will modules. Since most of the sequences at the website were later on be used to detect introduces a circularity which excerpts and rather short, we tried to extend them where will probably make these sites easier to find than if the possible by obtaining the original sequences from EMBL, PWMs had been constructed from independent though limiting the sequences to a maximum of 1000 bp sequences. This was intentional, however. Since the pur- as usual. The final muscle dataset used contained 24 pose of our study was to assess the methods' abilities to sequences with one module in each and a total of 84 TFBS find significant combinations of binding sites rather than for 5 motifs. The smallest module spanned 14 bp and the individual sites, we wanted the individual sites to be easily longest 294 bp (average 120 bp). 10 sequences were from detectable. To verify that the annotated single binding the mouse genome, 6 from human, 5 from rat, 2 from sites in the TRANSCompel dataset were indeed detectable chicken and 1 from cow. by our matrices, we used an algorithm from the "TFBS" package [32] to match the PWMs against the sequences. Further statistics on the datasets and PWMs used are sum- Of the 81 single binding sites in the dataset, all but ten marized in Table 1 and Additional File 1. Page 14 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 Running the programs was the project supervisor. All authors helped revise and Most of the methods tested could be run directly from the approved the final manuscript. input sequences and a set of PWMs. Both CMA and Mod- uleSearcher, however, rely on separate programs to match Additional material the PWMs against the sequences in a preprocessing step. For ModuleSearcher we used the program MotifScanner Additional File 1 since both of these methods are part of the Toucan tools Dataset statistics. This supplementary table includes information about suite for regulatory sequence analysis [35]. MotifScanner the datasets and modules therein, the matrices used to represent the true was run with a third order background model based on motifs and the number of matrices in the PWM sets at various noise levels vertebrate promoter sequences, which was also available on the TRANSCompel dataset. Click here for file with Toucan. CMA comes bundled with Match [36] for [http://www.biomedcentral.com/content/supplementary/1471- PWM scanning. Match utilizes two different threshold val- 2105-9-123-S1.xls] ues which should be individually fitted for each specific PWM. Preconstructed cut-off profiles for TRANSFAC matrices are available for different conditions, for instance to minimize either the false positive or false negative dis- Acknowledgements covery rate or to minimize the sum of these two rates. As Kjetil Klepper, Jostein Johansen and Finn Drabløs were all supported by suggested in the CMA publication, we used cut-off profiles The National Programme for Research in Functional Genomics in Norway designed to minimize the false negative discovery rate. (FUGE) in The Research Council of Norway. Finn Drabløs was additionally Similar cut-off profiles for the liver, muscle and custom supported by The Svanhild and Arne Must Fund for Medical Research. matrices were estimated according to the procedure Osman Abul has been fully supported by an ERCIM fellowship. described for Match [36]. For each PWM we generated References 50000 random oligos by sampling from the PWM distri- 1. Werner T: Models for prediction and recognition of eukaryo- bution. The PWM was then scored against these oligos tic promoters. Mammalian Genome 1999, 10(2):168-175. with Match, and a cut-off threshold was chosen so that 2. Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The Evolution of Transcriptional Regulation in 90% of the oligos obtained a match score above this Eukaryotes. Mol Biol Evol 2003, 20(9):1377-1419. threshold. Since CMA is based on a discriminative model, 3. Sandve GK, Drabløs F: A survey of motif discovery methods in it also requires a set of negative sequences along with the an integrated framework. Biol Direct 2006, 1:11. 4. GuhaThakurta D, Stormo GD: Identifying target sites for coop- positive dataset. As negative data we selected 1000 bp pro- eratively binding factors. Bioinformatics 2001, 17(7):608-621. moter segments from 50 random housekeeping genes that 5. Xing EP, Wu W, Jordan MI, Karp RM: Logos: a modular bayesian were part of the default negative gene set included with model for de novo motif detection. J Bioinform Comput Biol 2004, 2(1):127-154. the method's web service [37]. 6. Zhou Q, Wong WH: CisModule: de novo discovery of cis-regu- latory modules by hierarchical mixture modeling. Proc Natl Acad Sci USA 2004, 101(33):12114-12119. Availability and requirements 7. Wasserman WW, Fickett JW: Identification of regulatory The web service for assessing composite motif discovery regions which confer muscle-specific gene expression. J Mol tools, as well as all the results from our benchmark test, is Biol 1998, 278(1):167-181. 8. Chan BY, Kibler D: Using hexamers to predict cis-regulatory available at http://tare.medisin.ntnu.no/composite. motifs in Drosophila. BMC Bioinformatics 2005, 6:262. 9. Frech K, Danescu-Mayer J, Werner T: A novel method to develop highly specific models for regulatory units detects a new LTR Abbreviations in GenBank which contains a functional promoter. J Mol Biol ASP, average site performance (defined as (Sn + PPV)/2); 1997, 270(5):674-687. bp, base pair; FN, false negative; FP, false positive; HMM, 10. Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B: Computational detection of cis-regulatory modules. Bioinformatics 2003, hidden Markov model; mCC, motif-level correlation coef- 19(suppl 2):ii5-14. ficient; nCC, nucleotide-level correlation coefficient; PC, 11. Stormo GD: DNA binding sites: representation and discovery. performance coefficient (defined as TP/(TP + FN + FP)); Bioinformatics 2000, 16(1):16-23. 12. Johansson , Alkema WBL, Wasserman WW, Lagergren J: Identifica- PPV, positive predictive value (defined as TP/(TP + FP)); tion of functional clusters of transcription factor binding PWM, position weight matrix; Sn, sensitivity (defined as motifs in genome sequences: the MSCAN algorithm. Bioinfor- matics 2003, 19(Suppl. 1):i169-i176. TP/(TP + FN)); Sp, specificity. (defined as TN/(TN + FP)); 13. Kel AE, Konovalova T, Waleev T, Cheremushkin E, Kel-Margoulis TF, transcription factor; TFBS, transcription factor binding OV, Wingender E: Composite Module Analyst: a fitness-based site; TN, true negative; TP, true positive. tool for identification of transcription factor binding site combinations. Bioinformatics 2006, 22(10):1190-1197. 14. Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a frame- Authors' contributions work for identifying cis-regulatory modules in human-mouse GKS and OA conceived of the study. All authors partici- conserved segments. Bioinformatics 2003, 19(suppl 1):i283-291. 15. Sze SH, Gelfand MS, Pevzner PA: Finding weak motifs in DNA pated in the design of the study. KK and GKS assembled sequences. In: Proceedings of the Pacific Symposium on Biocomputing the datasets. JJ implemented the web service and ran all 2002:235-246 [http://helix-web.stanford.edu/psb02/sze.pdf]. Lihue, Hawaii the tests together with KK. KK drafted the manuscript. FD Page 15 of 16 (page number not for citation purposes) BMC Bioinformatics 2008, 9:123 http://www.biomedcentral.com/1471-2105/9/123 16. Hu J, Li B, Kihara D: Limitations and potentials of current motif 38. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information discovery algorithms. Nucl Acids Res 2005, 33(15):4899-4913. content of binding sites on nucleotide sequences. J Mol Biol 17. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov 1986, 188(3):415-431. AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23(1):137-144. 18. Bailey TL, Noble WS: Searching for statistically significant reg- ulatory modules. Bioinformatics 2003, 19(Suppl. 2):ii16-ii25. 19. Perco P, Kainz A, Mayer G, Lukas A, Oberbauer R, Mayer B: Detec- tion of coregulation in differential gene expression profiles. Biosystems 2005, 82(3):235-247. 20. Sandve GK, Drabløs F: Generalized composite motif discovery. In In: Proceedings of the 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems Melbourne, Australia; 2005:763-769. 21. Burset M, Guigó R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353-367. 22. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC® and its module TRANSCompel®: transcriptional gene regu- lation in eukaryotes. Nucl Acids Res 2006, 34:D108-D110. 23. Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res 2001, 11(9):1559-1566. 24. Sandelin A, Alkema WBL, Engström P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcrip- tion factor binding profiles. Nucl Acids Res 2004, 32:D91-D94. 25. Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 2001, 17(10):878-889. 26. Frith MC, Li MC, Weng Z: Cluster-buster: finding dense clusters of motifs in DNA sequences. Nucl Acids Res 2003, 31(13):3666-3668. 27. Sinha S, van Nimwegen E, Siggia ED: A probabilistic method to detect regulatory modules. Bioinformatics 2003, 19(Suppl.1):i292-i301. 28. Kielbasa SM, Gonze D, Herzel H: Measuring similarities between transcription factor binding sites. BMC Bioinformatics 2005, 6:237. 29. Duret L, Bucher P: Searching for regulatory elements in human noncoding sequences. Curr Opin Struct Biol 1997, 7(3):399-406. 30. Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Cochrane G, Duggan K, Eber- hardt R, Faruque N, Gamble J, Diez FG, Harte N, Kulikova T, Lin Q, Lombard V, Lopez R, Mancuso R, McHale M, Nardone F, Silventoinen V, Sobhany S, Stoehr P, Tuli MA, Tzouvara K, Vaughan R, Wu D, Zhu W, Apweiler R: The EMBL nucleotide sequence database. Nucl Acids Res 2005, 33:D29-D33. 31. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceed- ings of the Second International Conference on Intelligent Systems for Molecular Biology 1994:28-36 [http://research1t.imb.uq.edu.au/bailey/ t.bailey/papers/ismb94.pdf]. Stanford, California 32. Lenhard B, Wasserman WW: TFBS: computational framework for transcription factor binding site analysis. Bioinformatics 2002, 18(8):1135-1136. 33. Krivan W, Wasserman WW: Liver model, supplementary mate- rial. [http://www.cisreg.ca/tjkwon]. Publish with Bio Med Central and every 34. Wasserman WW, Fickett JW: Catalogue of Regulatory Ele- scientist can read your work free of charge ments. [http://www.cbil.upenn.edu/MTIR/HomePage.html]. 35. Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De "BioMed Central will be the most significant development for Moor B: TOUCAN 2: the all-inclusive open source workbench disseminating the results of biomedical researc h in our lifetime." for regulatory sequence analysis. Nucl Acids Res 2005, Sir Paul Nurse, Cancer Research UK 33:W393-W396. 36. Kel AE, Gößling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Your research papers will be: Wingender E: MATCH: a tool for searching transcription fac- available free of charge to the entire biomedical community tor binding sites in DNA sequences. Nucl Acids Res 2003, 31(13):3576-3579. peer reviewed and published immediately upon acceptance 37. Waleev T, Shtokalo D, Konovalova T, Voss N, Cheremushkin E, Steg- cited in PubMed and archived on PubMed Central maier P, Kel-Margoulis O, Wingender E, Kel A: Composite Module Analyst: identification of transcription factor binding site yours — you keep the copyright combinations using genetic algorithm. Nucleic Acids Res 2006, BioMedcentral 34(Suppl 2):W541-545. Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 16 of 16 (page number not for citation purposes)
BMC Bioinformatics – Springer Journals
Published: Feb 26, 2008
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.