Computational analysis of composite regulatory elements
Ping Qiu, Wei Ding, Ying Jiang, Jonathan R. Greene, Luquan Wang
Bioinformatics Group and Human Genomic Research Department at Schering-Plough Research Institute, 2015 Galloping Hill Road,
Kenilworth, New Jersey 07033, USA
Received: 5 November 2001 / Accepted: 30 January 2002
Abstract. Combinatorial regulation is a powerful mechanism for
generating specificity in gene expression, and it is thought to play
a pivotal role in the formation of the complex gene regulatory
networks found in higher eukaryotes. The term “Composite Ele-
ment” (CE) refers to a minimal functional unit where protein–
DNA and protein–protein interactions contribute to a highly spe-
cific pattern of gene transcriptional regulation. Identification of
composite elements will help to better understand gene regulation
networks. Experimentally identified CEs are limited in number,
and the currently available CE database COMPEL is based on such
published information. Here, based on the statistical analysis of
over-represented adjacent transcription factor binding sites, we
describe a computational method to predict composite regulatory
elements in genomic sequences. The algorithm proved to be effi-
cient for extracting composite elements that had been experimen-
tally confirmed and documented in the COMPEL database. Fur-
thermore, putative new composite elements are predicted based on
this method, and we have been able to confirm some of our pre-
dictions which are not included in the COMPEL database by
searching published information.
Eukaryotic gene regulation involves the assembly of an initiation
complex at the core promoter region and regulatory complexes at
promoter-enhancer regions. The promoter region is usually located
just proximal to or overlapping the transcription initiation site and
consists of several sequence elements with which transcription
factors (TFs) interact in a sequence-specific manner. When re-
cruited, these TFs serve as molecular switches, which turn the
transcription of the gene on or off. The combinations of the TF-
binding elements in promoters vary depending on the gene, which
provides the molecular basis of temporal and spatial gene expres-
sion (Mitchell and Tjian 1989; Novina and Roy 1996).
In the last few years, more and more evidence suggests that the
complex differential expression of genes in higher organisms is
achieved through combinatorial regulation of transcription by a
specific combination of transcription factors binding to their target
sites in the regulatory regions of these genes. Just a few tissue-
specific transcription factors with distinct tissue distributions have
the potential to act in different combinations to direct many dif-
ferent patterns of gene expression (Chen 1999; Wolberger 1998).
One of the best-studied such examples is that of composite NFAT/
AP-1 sites, in which it was demonstrated that these two factors
bind cooperatively to activate cytokine gene expression (Jain et al.
1993; Rao 1994; Rao et al. 1997; Northrop et al. 1993; Crabtree
1999; Lee et al 1995; Cockerill et al. 1993, 1995). For genome-
wide analysis, microarray data have been used to uncover novel
combinatorial functional motif in the promoters of Saccharomyces
cerevisiae (Pilpel et al. 2001).
Composite Elements (CEs) were first introduced by Diamond
et al. (1990) when they studied the interaction between a gluco-
corticoid receptor binding site and its adjacent AP-1 site in mouse
proliferin promoter. The CE model was defined further by Kel-
Margoulis et al. (2000) as pairs of closely situated binding sites,
corresponding transcription factors, protein–protein interaction be-
tween them, and expression patterns provided by this combinato-
rial regulation. There are two main types of CEs: synergistic and
antagonistic. In synergistic CEs, simultaneous interactions of two
factors with closely situated target sites result in a high level of
transcriptional activation. In an antagonistic CE, two factors inter-
fere with each other, in some cases resulting in mutually exclusive
binding. There are other examples where factors can bind to DNA
simultaneously, but binding of a repressing factor may mask an
activation domain of an activator (Wingender et al. 1997). Com-
putational analysis and prediction of regulatory elements (Scherf et
al. 2000; Werner 1999; Frech et al. 1997, 1998; Fickett and Hatzi-
georgiou 1997) as well as CEs have been an active research area.
Most studies in this direction focused on either target gene iden-
tification (Wagner 1999) or on a particular transcription factor
(Kel et al. 1999). A recent study utilized a Gibbs sampling strategy
to model the cooperativity between two transcription factors and
defined position weight matrices for the binding sites (Gu-
haThakurta and Stormo 2001).
Even with the completed working draft of the human genome
sequence, functions of more than half of the human genes are still
unknown. It would be beneficial to be able to identify the regula-
tory regions that confer temporal and spatial expression patterns
for the uncharacterized genes. Additionally, it would be advanta-
geous to identify regulatory regions within genes of known expres-
sion pattern without performing the costly and time-consuming labo-
ratory studies now required. To achieve these goals, the wealth of
case studies performed over the past years will have to be col-
lected. One such ongoing effort is the COMPEL database. Kel-
Margoulis et al. developed the COMPEL database (http://compel.
bionet.nsc.ru/compel/search.html), in which they have collected
published information on composite regulatory elements (Kel et al.
1995, Kel-Margoulis et al. 2000; Wingender et al. 1997). Yet, until
now the entries in COMPEL 3.0 are still very limited (178 entries).
In this study, we describe a novel computational approach to
detect possible composite elements in genomic sequence. The
method is based on the detection of over-represented adjacent
transcription binding sites. Such over-represented composite bind-
ing sites are very unlikely to occur by chance alone, as opposed to
individual sites, which are often abundant in promoter regions as
well as in other regions of the genome.
Materials and methods
Resources for databases and computer programs.
120 was downloaded from ftp://ncbi.nlm.nih.gov. TRANSFAC (Wingen-
der et al. 1996, 2001) and Matinspector (Quandt et al. 1995) were licensed
from Biobase. TRANSFAC is a database on transcription factors, their
Correspondence to: P. Qiu; E-mail: email@example.com
Mammalian Genome 13, 327–332 (2002).
© Springer-Verlag New York Inc. 2002
Incorporating Mouse Genome