Experimental data of a single promoter can be used for in silico
detection of genes with related regulation in the absence of
Institute of Experimental Genetics, GSF-National Research Center for Environment and Health, Ingolstaedter Landstrasse 1,
D-85764 Neuherberg, Germany
Genomatix Software GmbH, Karlstrasse 55, D-80333 Muenchen, Germany
Received: 18 April 2000 / Accepted: 17 August 2000
Abstract. Gene expression is presently a major focus in genome
analysis, and the experimental data on regulatory mechanisms and
functional transcription factor binding sites are steadily growing.
However, the annotation of transcriptional regulation of sequences
cannot keep pace with the exponential growth of sequence data-
bases. Employing detailed experimental data of a single promoter
or enhancer to predict genes with similar regulation would provide
a powerful method to link the literature about transcriptional regu-
lation and sequence databases. To this end, we used information on
individual functional transcription factor binding sites to compose
in silico promoter and enhancer models of muscle-specific genes
and to analyze the rodents section of EMBL with these models.
Exhaustive evaluation of all hits revealed every second to third
match to be a muscle-associated gene. Moreover, functionally re-
lated regulatory regions were detected by our model-based ap-
proach even in the absence of sequence similarity. We believe that
this new approach is a substanial extension to database analysis by
BLAST or FASTA, which are restricted to sequence similarity.
Temporal and spatial gene expression is regulated by transcrip-
tional control and mediated by a complex cis-regulatory system.
Transcription factors activate or repress gene expression by bind-
ing to their respective binding sites (cis elements) on regulatory
sequences (e.g., promoter and enhancer sequences). The response
to specific environmental or developmental signals is mediated by
distinct combinatorial interactions of transcription factors binding
to their corresponding cis elements (Grayson et al. 1995; Smart et
al. 1996; Puente et al. 1996). These cis elements are organized on
the DNA sequence as regulatory modules (Firulli and Olson 1997,
review about muscle-specific gene expression; Arnone and Dav-
idson 1997; Yuh et al. 1998; Wasserman and Fickett 1998; Werner
1999, review about modular organization of promoters).
Existing bioinformatics tools for the analysis of regulatory re-
gions using transcription factor binding sites either are based on
statistical evaluation of the occurrence of such sites (Prestridge
1995; Zhang 1998; Wasserman and Fickett 1998) or incorporate
the modular organization of regulatory regions into promoter mod-
els generated by an in silico approach (Frech et al. 1997, 1998;
Werner 1999). The most basic form of regulatory modules are
composite elements consisting of pairs of functional transcription
factor binding sites, which act synergistically to activate or repress
promoter activity (Kel et al. 1995; Lavorgna et al. 1998; Klingen-
hoff et al. 1999; Kel et al. 1999). They were successfully used for
database searches that were independent of direct sequence simi-
larity (Klingenhoff et al. 1999).
In general, the information about the functional organization of
a regulatory sequence of a gene is fragmentary, i.e., not all of the
binding sites are known or experimentally verified. Can this frag-
mentary information be utilized to search whole databases for
genes that are similarly regulated? We used experimental data
about individual transcription factor binding sites of muscle-
specific genes to generate organizational models of the respective
Despite the absence of significant nucleotide sequence simi-
larity, we demonstrate with several examples that fragmentary
information of functional regulatory regions can indeed be utilized
to detect similarly regulated genes in whole databases.
Materials and methods
Generation of models.
Transcription factor (TF) binding sites that have
been experimentally verified to be functional (e.g., by site-directed muta-
genesis, deletion analysis, DNase I footprinting, cotransfection/
overexpression of transcription factor, etc.) were collected from the litera-
ture and used to build promoter models. Such information from one or
more sequences (e.g., promoter sequences of orthologous genes) was used
to generate models.
The models were developed by using the program FastM professional
(Genomatix Software GmbH, Munich, Germany; Klingenhoff et al. 1999).
Models are generated based on user-supplied data about transcription fac-
tor binding sites, their strand orientation, their order, and their distance.
The transcription factor binding sites can be either selected from the Mat-
Inspector library (Quandt et al. 1995) or provided by the user as IUPAC
strings. For the models we used matrix and core similarities, which scores
were 0.02 lower than the actual scores of the TF sites in the training
sequence. If the model was derived from more than one sequence, the
matrix and core similarities were lowered according to the TF site, with the
lowest scores in one of the training sequences. In case IUPAC strings were
used, we did not allow mismatches. In general, we allowed a flexibility of
±7 bp for the spacing of the TF sites. If more than one training sequence
was used, the spacing was adjusted according to the different distances of
the TF sites in the sequences. Each model has an individual threshold,
which determines its specificity. Model thresholds were set so that more
than 50% of the individual elements of a model must be found in the
Sequences that were used to generate organizational models: (1)
AF042092, skeletal muscle type 1 sodium channel (SKM1, rat); (2)
AF051909, acetylcholine receptor alpha (AChR, chicken); (3) D17553,
caldesmon (chicken); (4) L21905, troponin I slow/skeletal (human); (5)
M33834, SERCA2 cardiac/slow-twitch (sarcoplasmic reticulum Ca
ATPase, rabbit); (6) M36684, AChR delta (mouse); (7) M57409, alpha
actin, vascular (mouse), D00618 alpha actin vascular (human). M13756,
alpha actin, vascular (chicken); (8) J04971, troponin C slow/cardiac
(mouse), M37984, troponin C slow/cardiac (human); (9) U49920, troponin
I slow/skeletal (rat), M12132, troponin I fast/skeletal (quail); (10) X73887,
nicotinic AChR beta (rat).
Correspondence to: T. Werner; E-mail: Werner@gsf.de
Mammalian Genome 12, 67–72 (2001).
© Springer-Verlag New York Inc. 2001