Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid- and abiotic stress-responsive genes in Arabidopsis thaliana

Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid-... Vol. 21 no. 14 2005, pages 3074–3081 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti490 Genome analysis Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid- and abiotic stress-responsive genes in Arabidopsis thaliana 1,2,∗ 1 3 3 Weixiong Zhang , Jianhua Ruan , Tuan-hua David Ho , Youngsook You , 1 3 Taotao Yu and Ralph S. Quatrano 1 2 3 Department of Computer Science and Engineering, Department of Genetics and Department of Biology, Washington University in Saint Louis, Saint Louis, MO 63130, USA Received on January 20, 2005; revised on May 5, 2005; accepted on May 6, 2005 Advance Access publication May 12, 2005 ABSTRACT We call the genes of particular interest target genes and the problem Motivation: A fundamental problem of computational genomics is of identifying target genes targeted gene finding. identifying the genes that respond to certain endogenous cues and One possible approach to targeted gene finding is to use the environmental stimuli. This problem can be referred to as targeted knowledge of experimentally verified target genes in closely related gene finding. Since gene regulation is mainly determined by the species and utilize gene conservation across different species to binding of transcription factors and cis-regulatory DNA sequences, identify putative target genes [see Zhang (2002) for a review]. By most existing gene annotation methods, which exploit the conser- focusing on the conservation of open reading frames (ORFs) of vation of open reading frames, are not effective in finding target genes, this strategy has been used to annotate many genomes (The genes. Arabidopsis Genome Initiative, 2000; Lander et al., 2001; Yu et al., Results: A viable approach to targeted gene finding is to exploit the 2002). Despite its great success, this method is able to discover only cis-regulatory elements that are known to be responsible for the tran- a small number of genes of particular functions, partly because the scription of target genes. Given such cis-elements, putative target number of experimentally determined genes is limited. As a result, genes whose promoters contain the elements can be identified. As a large portion of the predicted genes of many species does not have a case study, we apply the above approach to predict the genes any functional annotation at all. For example, about half of the genes in model plant Arabidopsis thaliana which are inducible by a phyto- of plant Arabidopsis thaliana currently do not have any definitive hormone, abscisic acid (ABA), and abiotic stress, such as drought, functional annotation. cold and salinity. We first construct and analyze two ABA specific cis- Furthermore, an ORF-centric gene finding method may not be elements, ABA-responsive element (ABRE) and its coupling element effective in discovering target genes that express under specific con- (CE), in A.thaliana, based on their conservation in rice and other cereal ditions. Although gene functions may be indicative of a gene’s plants. We then use the ABRE–CE module to identify putative ABA- responsiveness to certain stimuli, there is no direct correlation responsive genes in A.thaliana. Based on RT–PCR verification and the between gene function and gene expression. Gene expression is con- results from literature, this method has an accuracy rate of 67.5% for trolled mainly at the transcription level, where the binding between the top 40 predictions. The cis-element based targeted gene finding transcription factors (TFs) and cis-regulatory DNA sequences (or approach is expected to be widely applicable since a large number of cis-elements) in the upstream regions of genes plays an important cis-elements in many species are available. role (Brivanlou and Darnell, 2002). In other words, a gene’s respons- Contact: zhang@cse.wustl.edu iveness to certain conditions is ‘hard-wired’ by their cis-elements. Supplementary information: Supplementary data for this paper are Therefore, if some cis-elements are known to be directly involved available at Bioinformatics online. in gene transcription regulation in responding to specific stimuli, we should be able to use the cis-elements to identify the genes of interest. When combined with experimental verification, this con- 1 INTRODUCTION stitutes an effective approach to genome-wide targeted gene finding 1.1 Targeted gene finding and function annotation. This approach is supported by the fact that a large number of TFs and their binding cis-elements have been iden- It is fundamentally important, yet difficult, to identify the genes that tified over the years. For example, we know most of the TFs and respond to certain endogenous cues and/or environmental stimuli. their corresponding cis-binding elements in the yeast Saccharomy- For example, it is of great importance to find genes in plants that ces cerevisiae (Harbison et al., 2004). Moreover, TRANSFAC, a are responsive to abiotic stress to enhance the genomic makeup of database of experimentally verified and computationally predicted plants to combat harsh stress, such as drought and low temperature. cis-elements in many species has been established and has been widely used for many years (Matys et al., 2003). There are also data- To whom correspondence should be addressed. bases of plant-specific cis-elements, including PLACE (Higo et al., 3074 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Cis-regulatory element based targeted gene finding 1999) and PlantCARE (Lescot et al., 2002). Furthermore, there is Prod. No. M404). About 20 seedlings were transferred to each MS medium plate with or without 100 µM ABA. After 24 h RNA was extracted from abundant information on cis-elements in literature that has not been the two groups using TRIzolR reagent (Invitrogen, Cat. No. 15596-026) and achieved as yet. further purified by using RNA clean-up columns from RNeasyR Plant Mini In this paper, we investigate the targeted gene finding approach kit (Qiagen, Cat. No. 74904). The total RNA was then treated with DNase I and demonstrate its validity and efficacy by identifying the genes (Invitrogen, Cat. No. 18068-015). inducible by abscisic acid (ABA) and abiotic stress in A.thaliana. RT–PCR analysis was done as follows. First-strand cDNA was synthes- ized from 1.5 µg of total RNA using ThermoScriptTM RNase H-reverse 1.2 ABA in plants and abiotic stress transcriptase (Invitrogen, Cat. No. 12236-014) with Oligo(dT) primer 12−18 ABA is an important phytohormone that is prevalent in a plant’s following the manufacturer’s recommendation. Amplification of the cDNA developmental stages. It plays many key roles in the synthesis of seed was optimized using 0.5–2 µl of the cDNA in a total of 25 µl reaction volume ◦ ◦ ◦ storage, the promotion of seed desiccation tolerance and dormancy, and carried out at 94 C for 2 min, 30 cycles of 94 C for 1 min, 60 C for 1 min ◦ ◦ and 72 C for 1 min, and then 72 C for 5 min. Expression analysis of each and the inhibition of the phase transitions from embryonic to germin- gene was confirmed in at least three independent RT-reactions using forward ative growth and from vegetative to reproductive growth (Finkelstein and reverse primers, which are listed in Supplementary Table 1. et al., 2002). Many genes have been identified as ABA-responsive; these include the genes encoding seed storage proteins, late embryo- genesis abundant (LEA) proteins, and various other proteins and 2.2 Genomic sequences protein families. Examples of these genes are EM in wheat (Marcotte Predicting promoters is at least as difficult as predicting genes. The key is et al., 1989), EM and RAB16 in rice (Hattori et al., 1995; Mundy the identification of transcription start sites (TSSs). To predict TSSs, we et al., 1990), HVA1, HVA22 and dehydrins in barley (Shen and Ho, combined an A.thaliana cDNA database and a software, TSSP (SoftBerry, 1997; Shen et al., 1996; Xu et al., 1996), and EM and RD29 in http://www.softberry.com). As of January, 2004, there were 26 213 predicted A.thaliana genes (excluding pseudogenes and RNA genes) in GenBank, A.thaliana (Carles et al., 2002; Narusaka et al., 2003). among which 12 359 (47%) had cDNA sequences (with annotated 5 -UTR Most ABA-responsive genes have two eminent characteristics. >50 bp). For each gene, we retrieved a segment from the gene’s start codon First, they contain conserved ABA-responsive elements (ABREs) in to 1500 bases upstream or its farthest 5 annotated gene boundary. We then their promoters (Hattori et al., 1995, 2002; Marcotte et al., 1989; applied the TSSP software to each upstream sequence to identify TSS. When Mundy et al., 1990; Shen and Ho, 1997; Shen et al., 1996; Xu multiple TSSs were predicted on a gene, the one closest to the ORF was et al., 1996). ABREs are the binding sites of TFs, such as EmBP-1 chosen. The TSSs that were >50 bases downstream of the cDNA start point (Guiltinan et al., 1990), TAF-1 (Oeda et al., 1991) and ABFs (Choi were considered false positive. Overall, TSSs of 18 343 (70%) genes were et al., 2000). Second, the ABREs need to be accompanied by some predicted and used in further analysis and prediction, whereas 3133 (12%) coupling elements (CEs) in order to be functional (Shen and Ho, genes with cDNA and 4737 (18%) genes without cDNA did not have TSSs 1997; Shen et al., 1996; Xu et al., 1996). The sequence specificity of predicted. Given a TSS, we retrieved the sequence from 350 bp upstream to 50 bp downstream of the TSS as the proximal promoter. The intron and exon CEs may be lower than that of ABREs. Moreover, other functional regions were retrieved from The Arabidopsis Information Resource (TAIR) elements, such as ABREs themselves and dehydration-responsive at http://www.arabidopsis.org element (DRE), can also function as CEs (Guiltinan et al., 1990; Hobo et al., 1999; Narusaka et al., 2003). Importantly, ABA mediates many aspects of physiological 2.3 Microarray gene expression data responses to environmental stress, such as drought, cold and salin- Data on gene transcription profiling were from Hoth et al. (2002), Kreps et al. ity. Many experiments have shown that abiotic stress also activate (2002) and Seki et al. (2002a,b). We took the datasets in Seki et al. (2002a,b) the processes underlying ABA (Finkelstein et al., 2002; Shinozaki for motif analysis. Using the promoter prediction method mentioned above, and Yamaguchi-Shinozaki, 2000; Zhu, 2002). Specifically, a large we obtained 366 promoters of unique genes upregulated after ABA-treatment and/or under abiotic stress (drought, cold and salinity) for motif analysis. number of genes that respond to abiotic stress are also inducible directly by ABA treatment (Seki et al., 2002a,b), providing direct evidences that ABA must be involved in the processes responding to 2.4 Scoring these environmental stress. In order to search for instances of a known degenerate motif in promoters, In brief, we have gained a substantial amount of knowledge of we scored each short sequences of length w in a given sequence based the critical roles that ABA plays in plant development and stress on the motif and a background model. The degenerate motif W of length response. It is clear that ABA is an essential element in gene w is represented by a PWM  = (q ), where q is the probability W i,b i,b transcription regulation in responding to abiotic stress. of finding base b at position i in the motif. The background model B In this paper, we study the cis-element based targeted gene find- is a m-th order Markov model, which can be estimated from background ing method in the context of transcription regulation under ABA and sequences. The probability that a wmer starting from the j -th position of a sequence is generated from the background model was calculated as abiotic stress in plants. We are particularly interested in the effect- P(j |B ) = P(b |b ··· b ), where b is the j -th base m j +l−1 j +l−2 j +l−m−1 j iveness of this method in finding, on a genomic scale, the genes that l=1 of the sequence. The conditional probability P(b |b ··· b ) j +l−1 j +l−2 j +l−m−1 are inducible by ABA and abiotic stress in A.thaliana. is the frequency of observing base b following a particular mmer j +l−1 b ··· b in background sequences. With a 0-th Markov model, j +l−2 j +l−m−1 2 MATERIALS AND METHODS the conditional probability is reduced to P(b ), which is the frequency j +l−1 of base b in the background sequences. The probability that a wmer j +l−1 2.1 Plant material, RNA preparation and RT–PCR w was generated from the motif model  is P(j | ) = q , W W l,b l=1 j +l−1 For plant material, A.thaliana ecotype Columbia seeds were grown at 24 C where q is the probability of seeing base b at the l-th position of l,b j +l−1 j +l−1 for 10 days with 16/8 h light/dark period on Murashige and Skoog modi- the motif. In our final model, we used a 0-th model since it has the highest fied basal medium with Gamborg vitamins (PhytoTechnology Laboratories, accuracy for m-th models with m ranging from 0 to 5. 3075 W.Zhang et al. Based on these two probabilities, a log-ratio score A was assigned unique to cold, drought and salinity, respectively. Nevertheless, their j , ,B W m to each position j in the sequence, which is computed as similarities to the ACGT-containing motif from ABA genes were still significant, i.e. >0.98. These ACGT-containing motifs are shown in P(j | ) A = ln . j , ,B W m online Supplementry Figure 2. P(j |B ) This specificity analysis indicated that ABREs are also determ- To score a sequence S by a motif module consisting of two motifs  and inant cis-elements for stress-related transcription regulations. This (e.g. ABRE and CE motifs), we considered all their possible positions implies that the genes responsive to ABA and abiotic stress are i and j within S, as long as they did not overlap and were within a certain distance d . The combined score between positions i and j was computed as difficult to separate from one another merely based on these cis- A = max (A + A ). The highest combined score S, , ,B i,j i, ,B j , ,B elements. It is possible that other cis-elements cooperate with ABREs m m m M N M N among all positions was assigned to the sequence. to differentiate various stress regulations, a topic beyond the scope of this paper. 3 RESULTS 3.2 CEs In our research, we began with an analysis of the cis-elements, i.e. Cis-elements are generally degenerate. Even though the ACGT- ABREs and CEs, of the target genes in A.thaliana. We then predicted core in ABREs is well conserved, the flanking sequences candidate genes, and verified them using RT–PCR experiments and beyond the ACGT-core vary. For example, a rice ABRE is published microarray profiling results. CGTACGTGTC (Hobo et al., 1999), whereas an ABRE for maize 3.1 ABREs is GACGTG (Busk et al., 1997) and an ABRE for A.thaliana is To reiterate, abiotic stress, such as drought, cold and salinity, can CCACGTGG (Zhu, 2002). trigger ABA. Many genes can be induced by ABA and/or one of When compared with ABREs, CEs are less conserved (Busk and these stress conditions. It is also known that ABA-responsive genes Pages, 1997; Hobo et al., 1999; Shen and Ho, 1997). The EM gene in rice, maize, barley and other cereals typically have ABREs and of rice has a CE (CE3) of GACGCGTGTC (Hobo et al., 1999); CEs as their determinant cis-elements. An ABRE for A.thaliana has maize RAB28 has CE3 of ACGCGCCTCCTC (Busk et al., 1997), also been identified (Zhu, 2002). and barley HVA1 has CE3 of ACGCGTGTCCTC and HVA22 has Our analysis showed that the ABREs are also significant cis- CE1 of TGCCACCGG (Shen and Ho, 1997). These CEs are more elements for genes responsive to abiotic stress in A.thaliana. In our diverged than ABREs, whereas the CE3s have a relatively conserved analysis, we used the expression profiling results published in Hoth CGCG core. To our knowledge, no experiment has been done to et al. (2002), Kreps et al. (2002) and Seki et al. (2002a,b). We first characterize the CEs of A.thaliana. considered the promoters of the genes that are upregulated separ- To obtain an accurate, as well as degenerate, pair of ABRE ately under ABA, cold, drought and salinity. MEME (Bailey and and CE for predicting ABA-inducible and stress-inducible genes Elkan, 1994), which is one of the best motif finding algorithms in A.thaliana, we combined computational methods with the know- (http://meme.sdsc.edu/meme/website/intro.html), was used to find ledge of known motifs from other plants to obtain the following three statistically significant degenerate motifs from these four sets of types of degenerate ABRE and CE motifs. The first type is based on promoters. They contain many common motifs. Specifically, ACGT- the experimentally identified ABRE and CE motifs. We construc- containing motifs, termed as G-box, are over-represented in all ted PWMs (Stormo, 2000) for ABREs based on motifs from rice these sets of promoters. They are consistently ranked among [OSEM (Hattori et al., 2002), RAB16A/B/D (Mundy et al., 1990; the top motifs identified: they are the first for ABA-inducible Ono et al., 1996)], maize [RAB28 (Busk and Pages, 1997)], barley genes, second for drought-responsive genes and third for both [HVA1 (Shen and Ho, 1997)], and A.thaliana [RD29A (Zhu, 2002)]. cold- and salinity-inducible genes. The ACGT-containing motifs The resulting ABRE is MGTACGTGKC. To obtain CEs, we con- from the three stress-induced genes are very similar to the ACGT- sidered the ones from monocots (as no CE from A.thaliana is known) containing motif in ABA-induced genes. They were directly com- and resulted in CEs of GMCGCGTGKC. The logos for these ABRE pared by a computer program CompareACE (downloadable at and CE are shown in Figure 1a. Since we used information from http://atlas.med.harvard.edu/download/). The similarity score used monocots, we refer to this type of motifs as Monocots-based motifs by CompareACE is the Pearson correlation coefficient between for convenience. Note that the number of experimentally verified the nucleotide base frequencies of the alignment of two motifs. genes is small; hence the accuracy of the motifs may be low. The scores vary between −1 and 1, where 1 means a perfect match. The second type of degenerate ABRE and CE motif comes from a The three ABRE motifs from stress genes have similarity scores refinement to the first type through an iterative procedure that com- >0.99 as compared with those of the ABRE motif from ABA genes. bined the experimentally verified motifs, i.e. the first-type motifs The four ABRE motifs are shown in Supplementary Figure 1. Sim- and the results from expression profiling. We applied a motif scan ilar results were obtained using AlignACE motif finding algorithm program, which we developed (see Section 2), to the 366 upregu- (Hughes et al., 2000) (data not shown). lated genes that were identified by gene expression profiling in Seki There are significant overlaps among the genes induced by ABA et al. (2002a,b). We explicitly took the first-type motifs as seeds to and one of the three stress conditions. We further analyzed the cis- scan the promoters. Matched sequences were ranked by their scores elements in the set of genes exclusively induced by one of these stress, (Section 2); the top 15 matched motifs were chosen to construct resulting in three sets of genes not overlapping with ABA-inducible new PWMs. The new PWMs must be similar but not identical to genes. The analysis of the motifs from these three sets of genes led the monocots-based PWMs. This step was repeated until the PWMs to similar conclusions, except that the ACGT-containing motifs were did not change or their specificity decreased. The refined motifs are now slightly degenerate and ranked slightly lower than in the previous SRTACGTGTC for ABRE and GACRCGTGKC for CE, respect- analysis. They ranked as the third, second and third for the genes ively, whose sequence logos are shown in Figure 1b. Compared with 3076 Cis-regulatory element based targeted gene finding (a) 2 2 Arabidopsis–specific A G A G GT G C G C G C AA A A A AGCG A A CCGAAAACTG CCC C CTG G CGCC A G G A TTTTTTGTCT TTGA TTCT 0 0 MEME–derived (b) 2 2 Monocot–based 1 1 0 50 100 150 200 250 A number of top hits CGT G G C ACC T TA C T C GAA G GC A G CG TTGA T A ATGG TCAT 0 0 Fig. 2. Prediction accuracies of three ABRE–CE modules in Figure 1. The x-axis is the number of motif hits, ordered by their scores matching to the (c) modules; the y-axis is the percentage of scored motifs that are expressed in 2 2 Hoth et al. (2002), Kreps et al. (2002) and Seki et al. (2002a,b). 1 1 3.3 Motif modules as target gene indicators C A T The three types of motifs discussed above were constructed dif- GC C C G C T CG C C G A T G C GA A T TTG C G T T G G AC ATGA GTGGT CCAATCGGT GAATC ferently and have different degeneracies. One problem with the 0 0 monocot-based motifs is that they may not be necessarily good indicators of ABA-inducible and stress-inducible genes in dicot A.thaliana. One problem with MEME-inferred motifs is that the Fig. 1. ABRE–CE modules, where ABREs are to the left and CEs to the ABREs and CEs were identified individually rather than as a right. (a) The monocot-based module was constructed using known ones module. from monocots; (b) the Arabidopsis-specific module was a refinement to the The Arabidopsis-specific motif module seems to be a good indic- monocot-based module by a repeated search on A.thaliana promoters; (c) the ator for ABA-responsive and stress-responsive genes. To quantify MEME-derived module was inferred by MEME motif algorithm. its prediction power, we compared it with the other two modules, measured by the number of predicted genes that were also iden- tified by microarray experiments (Hoth et al., 2002; Kreps et al., 2002; Seki et al., 2002a,b). The results are shown in Figure 2. The the monocot-based motifs (Fig. 1a), there are substantial changes in monocot-based module has the worst prediction accuracy, and the the flanking sequences beyond the ACGT-core. Besides, A.thaliana Arabidopsis-specific module is superior to the MEME-derived mod- seems to have a less conserved GCGT-core in its CE component, the ule, except for the first eight predictions. This result partially supports first G in particular can be A about 30% of the time. As a result, the the interactive approach we used to obtain refined ABRE and CE refined ABREs and CEs are similar to each other and become almost motifs. palindromic. We refer to these motifs as Arabidopsis-specific motifs. In the rest of this paper, we use the Arabidopsis-specific motif A third type of ABRE and CE motifs was computationally inferred, module (shown in Figure 1b) for our analysis and prediction. as a reference, from the promoters of the 366 stress-responsive 3.4 ABRE and DRE as coupling elements genes from microarray experiments under ABA treatment and abi- otic stress (Seki et al., 2002a,b). Two motifs, the second and seventh, A close examination of the CE in the Arabidopsis-specific module from the top 10 motifs produced by MEME, appear to be meaning- shows that it is very often a G-box (ACGT-core) or has a GCGT-core. ful and the rest seem to be repeats. The motifs for ABREs and CEs The GCGT-core has a strong conservation in the CEs for many mono- are YKMCACGTGKC and MCGCGTCRNYYWCK, respectively, cots, as shown in Figure 1a. It is also known that ABREs can act as whose sequence logos are shown in Figure 1c. These two motifs are CEs, and so can dehydration responsive elements (DREs) (Guiltinan significantly different from the previous two types. For ABREs, the et al., 1990; Hobo et al., 1999; Narusaka et al., 2003). prefix before the ACGT-core is less conserved. The base immedi- We examined the utilities of ABREs, DREs and GCGT-containing ately before the ACGT-core is a relatively conserved C, whereas it motifs as CEs in ABRE–CE module. The results in Figure 3 show is T or A in the previous two types in Figure 1a and b. For CEs, the that the prediction accuracies using DREs and GCGT-containing suffix sequence of the GCGT element does not at all match those in motifs as CEs are significantly lower than those with ACGT-core Figure 1a and b. as CE. The figure indicates that unlike most monocots, which very bits bits bits bits bits bits percent of verified hits W.Zhang et al. exon intron random promoter stress_down stress_up all promoter ABRE–CE (ACGT) ABRE–CE ABRE–CE (GCGT) ABRE–DRE 0 50 100 150 200 250 12 14 16 18 20 22 24 26 number of top hits score threshold Fig. 3. Prediction accuracies using ABREs (ACGT-core), GCGT-containing Fig. 4. Location specificities of ABRE–CE module in second exon regions, motifs and DREs as CEs. ABRE–CE is Arabidopsis-specific module. ABRE– intron regions, random promoters, promoters of downregulated (stress_down) CE(ACGT), ABRE–CE(GCGT) and ABRE–DRE use ABREs, GCGT- and upregulated (stress_up) genes identified in Hoth et al. (2002), Kreps et al. containing motifs and DREs as CEs, respectively. The x-axis and y-axis (2002) and Seki et al. (2002a,b), and all promoters. are the same as those of Figure 2. ABA and/or abiotic-stress responsive, giving a prediction accuracy often have GCGT-core in their CEs, A.thaliana tends to have ABREs of 67.5%. as CEs. The results also show that DREs are less effective as CEs Three genes (At3g33131, At5g52290 and At1g65200) from the set than ABREs. One reason may be that the DREs have only six bases of 27 genes were not detected by RT–PCR. Based on the GenBank (RCCGAC), which may give rise to a relatively large number of false annotation as of May 2004, two of these three genes (At3g33131 positive matches. and At5g52290) were annotated as hypothetical proteins. In addi- tion, seven genes (At1g79040, At1g77450, At1g28530, At1g52230, 3.5 Where ABRE–CE module locate At2g18700, At4g21270 and At4g21280) did not show significant Before applying the Arabidopsis-specific ABRE–CE module, we expression changes in the RT–PCR experiment. It is possible that need to assess if this module is indeed unique to ABA-inducible these genes express in different developmental stages or tissues other and stress-inducible genes. For this purpose, we analyzed the distri- than the conditions of the RT–PCR experiments. Indeed, one of butions of high-quality matches of the module in various regions these genes (At1g77450) was verified as upregulated in Hoth et al. of A.thaliana genome. We considered the promoters, the intron (2002). One of the differences between the experimental conditions regions and the second exon regions of all genes, and the pro- used in Hoth et al. (2002) and our RT–PCR is the age of the seed- moters of the upregulated and downregulated genes under ABA lings [4-week-old seedlings in Hoth et al. (2002)] versus 10-day-old and stress conditions (Hoth et al., 2002; Kreps et al., 2002; Seki seedlings in our experiments). Combining these observations, the et al., 2002a,b). We also included randomly constructed sequences 67.5% prediction accuracy is, apparently, a lower bound on the using the frequencies of nucleotides in all promoters, to evaluate prediction accuracy. the possibility that the module appears by chance. The results in Although we used the degenerate ABRE–CE module, which Figure 4 show that the ABRE–CE module occurs more often in the allows an ACGT component in the CE motif, only one gene among promoters of the target genes than in the coding and other non-coding the top 40 predictions actually has the GCGT-core in its CE. In other regions. words, most of the top candidates match to ABRE–ABRE module well, in agreement with our analysis of CEs in Section 3.4. 3.6 Prediction and experimental verification Using the Arabidopsis-specific ABRE–CE module, we detected 3.7 Where ABREs and CEs locate a large number of putative ABA-responsive and stress-responsive genes. We closely examined the highest scored 40 predictions, It is worthwhile to know the locations of ABRE and CE within pro- listed in Table 1. We tested 27 genes using RT–PCR on 10-day- moters. We examined two position statistics. The first is the gap old seedlings (Section 2). Among these 27 genes, 17 (63.0%) between the ABRE and CE in a module. Figure 5a shows the results are verified as upregulated, 3 (11.1%) have no transcripts detec- in the promoters of all genes and ABA-inducible and stress-inducible ted and 7 (25.9%) have no significant expression change. Some genes identified in (Hoth et al., 2002; Kreps et al., 2002; Seki et al., of the RT–PCR results are shown in Supplementary Figure 3. In 2002a,b) whose promoters contain these ABRE–CE module. The combination with the results from previously published microar- gaps are typically <150 bases in these genes, although a few gaps ray results (Hoth et al., 2002; Kreps et al., 2002; Seki et al., beyond 150 bases exist. The most possible gap between these two 2002a,b), we found that 27 of the top 40 genes were confirmed as elements are ∼40–50 bases. percent of verified hits number of hits / kb Cis-regulatory element based targeted gene finding Table 1. Top 40 predicted ABA-inducible and stress-inducible genes in A.thaliana and their experimental verification Assession no. ABRE CE Strand Function annotation in GenBank Verification At5g07920 cctacgtggc ggcacgtggc + Diacylglycerol kinase (ATDGK1) 4 At1g79040 cctacgtggc gccacgtgtc + Photosystem II polypeptide-related −1 At4g37220 cctacgtggc gccacgtgtc + Cold acclimation protein homolog 1 At3g33131 cgaacgtgtc gacgcgtggc + Hypothetical protein −2 At5g24155 cacacgtggc gccacgtggc − Squalene monooxygenase 4 At2g44660 gctacgtggc gacacgtggc + ALG6, ALG8 glycosyltransferase family 4 At4g12680 tgtacgtggc gacacgtggc − Expressed protein 4 At5g66580 agaacgtggc gccacgtggc − Expressed protein 1 At5g50360 cgcacgtggc gccacgtctc + Expressed protein 4 At5g51210 cgtacgtgtc gacacgtgac + Glycine-rich protein oleosin 4 At1g77450 cgaacgtgtc gccacgtgtc + No apical meristem (NAM) protein family 1,-1 At1g17120 cgaacgtggc gtcacgtggc + Amino acid permease family protein At5g52290 cgtacgtgtc gagacgtggc − Hypothetical protein −2 At5g52300 cgtacgtgtc gagacgtggc + Desiccation-responsive protein 29B (RD29B) 2, 3, 4 At5g58650 catacgtggc gacacgtgtc + Expressed protein At2g38820 ggtacgtgtc ggcacgtgtc − Expressed protein 2, 4 At1g28530 tgcacgtgtc gccacgtggc − Expressed protein −1 At1g28540 tgcacgtgtc gccacgtggc + Expressed protein At1g54130 gccacgtggc gacacgtgtc − RSH3 (RelA/SpoT homolog) 4 At1g32550 tctacgtggc gacacgtggc − Ferredoxin family protein 4 At1g32560 tctacgtggc gacacgtggc + LEA group 1 protein 3 At1g52220 tccacgtggc gccacgtggc − Expressed protein 1 At1g52230 tccacgtggc gccacgtggc + Photosystem I subunit VI precursor −1 At4g21270 accacgtgtc gccacgtggc + Kinesin-like protein A (katA) −1 At4g21280 accacgtgtc gccacgtggc − Oxygen-evolving enhancer protein 3 (PSBQ) −1 At3g03680 ccgacgtggc gtcacgtggc + C2 domain-containing protein 4 At2g22240 ccaacgtgtc gccacgtgtc + Myo-inositol 1-phosphate synthase-related 2 At3g19590 cccacgtgtc gccacgtgac − Mitotic checkpoint protein 4 At1g58520 acaacgtggc gacacgtggc + Early-responsive to dehydration (ERD4) 4 At3g18290 agcacgtggc ggcacgtgac − Zinc finger protein-related 1 At1g65200 cgtacgtgac gtcacgtggc + Ubiquitin carboxyl-terminal hydrolase-related −2 At2g18700 gaaacgtggc gccacgtggc − Glycosyltransferase family 20 −1 At2g36270 cacacgtgtc gacacgtgtc + ABA insensitive 5 (ABI5) 4 At1g02660 ctgacgtggc gccacgtgtc + Lipase (class 3) family 1 At1g02670 ctgacgtggc gccacgtgtc − DNA repair protein, putative 4 At1g74450 agcacgtgga gccacgtggc − Expressed protein 4 At5g05220 caaacgtgtc gacacgtggc + Expressed protein 3 At3g62260 gccacgtgtc gacacgtgtc + Protein phosphatase 2C (PP2C) At5g65890 tccacgtgtc gccacgtggc − ACT domain-containing protein (ACR1) At5g62490 aacacgtgtc gccacgtggc − ABA-responsive protein (HVA22b) 3, 4 1: Hoth et al. microarray (Hoth et al., 2002); 2: Kreps et al. microarray (Kreps et al., 2002); 3: Seki et al. microarray (Seki et al., 2002a,b); 4: Upregulated, verified by RT–PCR; −1: no ABA response detected by RT–PCR; −2: no transcripts detected by RT–PCR. The second statistic is the start position of an ABRE–CE module 150 putative genes predicted in our study. This was done using from the transcription start site (TSS) of a gene. The results on all the MIPS functional category classification from http://mips.gsf.de/ genes and ABA-inducible and stress-inducible genes are shown in projects/plants Figure 5b. ABRE–CE modules are usually within 200 bases from Among the 1825 stress-inducible genes from the microarray exper- TSSs; the majority of them are <120 bases from TSSs. Note that a iments, 1530 (83.8%) can be assigned to at least one functional few ABRE–CE modules start within 5 -UTRs. category. Among our top 150 predicted genes, 126 (84.0%) have a functional category. Moreover, these two sets of genes have 3.8 Functional categories similar distributions across a wide range of functional categor- ies, as depicted in Supplementary Figures 4 and 5. Except the The function of ABA-responsive genes are diverse, reflected by the unclassified proteins, the three largest categories are transcription, large number of functional categories these genes may be involved metabalism and binding proteins. This result suggests that there in. We carried out a functional analysis on two sets of genes, the may be a lot of gene regulation activities after ABA treatment and ABA-inducible and stress-inducible genes reported in Hoth et al. stress. (2002), Kreps et al. (2002) and Seki et al. (2002a,b), and the top 3079 W.Zhang et al. (a) 60 huge amount of previous experimental efforts devoted to identifying all genes ABA-responsive and stress-responsive genes and elucidating their stress–inducible genes regulatory mechanisms. The idea of using cis-elements to identify genes responsive to cer- tain stimuli is intuitive, and was pursued in at least two previous studies. The first study was reported in Markstein et al. (2002), on a genome-wide analysis of the binding sites for Dorsal, one of the best- characterized sequence-specific TFs in Drosophila. It was known that many Dorsal targeted genes contain a cluster of multiple Dorsal binding sites in a small vicinity of their promoter regions. Using the 20 known Dorsal binding motifs, fifteen promoters that contains clusters of Dorsal binding motifs were identified from Drosophila genome. Among the fifteen genes, three are known Dorsal target genes. Using in situ localization assays, two other genes were shown to be upreg- ulated in the presumptive mesoderm of early embryos, leading to a total prediction accuracy of ∼34% (5 positive of 15 putative ones). 0 50 100 150 200 250 300 350 The second study was on interneurons called AIY in Caenorhabditis gap length elegans (Wenick and Hobert, 2004). Using the newly sequenced Caenorhabditis briggsae genome, another nematode diverged from (b) 35 C.elegans ∼70–100 million years ago. Wenick and Hobert found all genes stress inducible genes eight genes in AIY in C.elegans that are also conserved in C.briggsae. Using a standard promoter–dissection approach, they discovered cis- elements that are necessary and sufficient for AIY transcriptome. Moreover, they carried out a genome-wide screening using the dis- covered cis-elements to predict genes in C.elegans that may express in AIY. They experimentally tested 15 of the top 26 predictions and confirmed 14 of them expressed in AIY, giving a prediction accuracy of 14/15 or 94%. Overall, the verified AIY hit rate was 41 of 57 or 72%. As a comparison, we achieved a comparable prediction accuracy of 67.5% for the top 40 predictions in our study. Our approach in this paper and the approach taken in Markstein et al. (2002) and Wenick and Hobert (2004) are similar and comple- ment one another. These studies used similar genome-wide search strategies to predict genes that may have certain expression pro- 350 300 250 200 150 100 50 0 50 files. They did not make strong assumptions about where within binding site start position promoters the motif matches should be. However, these approaches differ in the way that cis-elements were derived. Markstein seemed Fig. 5. Position statistics of Arabidopsis-specific ABRE–CE modules. to use exact known Dorsal binding motifs in the analysis. Wen- (a) Distribution of the gaps between ABREs and CEs; (b) distribution of ick et al. relied on the conservation of genes in C.elegans and the start positions of ABRE–CE modules relative to TSSs. C.briggsae, to infer cis-elements that are characteristic to the tar- get genes. We obtained cis-elements based on previous experimental analyses. 4 DISCUSSION In summary, based on the previous studies and the results in We advocated and investigated a cis-element based method for find- this paper, it is evident that the cis-element based targeted gene ing genes that are responsive to certain conditions, which can be finding approach is effective and general; it has a high prediction referred to as targeted gene finding. This method is orthogonal and accuracy and is applicable to different organisms and different type complementary to conventional approaches to gene finding and func- of genes. With more information of TFs and their DNA binding tion annotation that are based on the conservation of ORFs. The information becoming freely available, including those in TRANS- cis-element based targeted gene finding method explicitly utilizes FAC database, we expect this cost-effective and accurate approach the information of transcription regulations. By exploiting exper- to be widely applied to various targeted gene finding problems in the imentally verified cis-elements in a genome-wide screening, it future. naturally combines the fidelity of gene functions elucidated in exper- imental analyses with computational efficiency of a genome-scale ACKNOWLEDGEMENTS search. In this study, we focused on ABA-responsive and abiotic stress- This research was supported in part by NSF grant EIA-0113618 and responsive genes in A.thaliana and their cis-elements, i.e. ABREs a grant from Monsanto Corporation to W.Z. and in part by a grant and CEs. By employing the experimentally identified cis-elements, from Monsanto Corporation to R.S.Q. We thank the other members we are able to leverage genome-wide targeted gene finding with a of W.Z. and R.S.Q.’s groups for the helpful discussions. number of hits number of hits Cis-regulatory element based targeted gene finding REFERENCES Markstein,M. et al. (2002) Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA, Bailey,T. and Elkan,C. (1994) Fitting a mixture model by expectation maximization 22, 763–768. to discover motifs in biopolymers. In Proceedings of the 2nd ISMB Conference, Matys,V. et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Palo Alto, CA, pp. 28–36. Nucleic Acids Res., 31, 374–378. Brivanlou,A. and Darnell,J. (2002) Signal transduction and the control of gene Mundy,J. et al. (1990) Nuclear proteins bind conserved elements in the abscisic acid- expression. Science, 295, 813–818. responsive promoter of a rice rab gene. Proc. Natl Acad. Sci. USA, 87, 1406–1410. Busk,P. and Pages,M. (1997) Protein binding to the abscisic acid-responsive element is Narusaka,Y. et al. (2003) Interaction between two cis-acting elements, ABRE and DRE, independent of viviparous1 in vivo. Plant Cell, 9, 2261–2270. in ABA-dependent expression of Arabidopsis rd29A gene in response to dehydration Busk,P. et al. (1997) Regulatory elements in vivo in the promoter of the abscisic acid and high-salinity stresses. Plant J., 34, 137–148. responsive gene rab17 from maize. Plant J., 11, 1285–1295. Oeda,K. et al. (1991) A tobacco bZip transcription activator (TAF-1) binds to a G-box- Carles,C. et al. (2002) Regulation of Arabidopsis thaliana Em genes: role of ABI5. like motif conserved in plant genes. EMBO J., 10, 1793–1802. Plant J., 30, 373–383. Ono,A. et al. (1996) The rab16b promoter of rice contains two distinct abscisic acid- Choi,H.I. et al. (2000) ABFs, a family of ABA-responsive element binding factors. responsive elements. Plant Physiol., 112, 483–491. J. Biol. Chem., 275, 1723–1730. Seki,M. et al. (2002a) Monitoring the expression pattern of around 7000 Arabidopsis Finkelstein,R. et al. (2002) Abscisic acid signaling in seeds and seedlings. Plant Cell, genes under ABA treatments using a full-length cDNA microarray. Funct. Integr. (suppl.), S15–S45. Genomics, 2, 282–291. Guiltinan,M. et al. (1990) A plant leucine zipper protein that recognizes an abscisic Seki,M. et al. (2002b) Monitoring the expression profiles of 7000 Arabidopsis genes acid response elements. Science, 250, 267–271. under drought, cold and high-salinity stresses using a full-length cDNA microarray. Harbison,C. et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Plant J., 31, 279–292. Nature, 431, 99–104. Shen,Q. and Ho,T.H. (1997) Promoter switches specific for abscisic acid (ABA)-induced Hattori,T. et al. (1995) Regulation of the Osem gene by abscisic acid and the transcrip- gene expression in cereals. Physiol. Plantarum, 101, 653–664. tional activator VP1: analysis of cis-acting promoter elements required for regulation Shen,Q. et al. (1996) Modular nature of abscisic acid (ABA) response complexes: com- by abscisic acid and VP1. Plant J., 7, 913–925. posite promoter units that are necessary and sufficient for ABA induction of gene Hattori,T. et al. (2002) Experimentally determined sequence requirement of ACGT- expression in barley. Plant Cell, 8, 1107–1119. containing abscisic acid response element. Plant Cell Physiol., 43, 136–140. Shinozaki,K. and Yamaguchi-Shinozaki,K. (2000) Molecular responses to dehydra- Higo,K. et al. (1999) Plant cis-acting regulatory DNA elements (PLACE) database. tion and low temperature: differences and cross-talk between two stress signaling Nucleic Acids Res., 27, 297–300. pathways. Curr. Opin. Plant Biol., 3, 217–223. Hobo,T. et al. (1999) ACGT-containing abscisic acid response element (ABRE) and Stormo,G. (2000) DNA binding sites: representation and discovery. Bioinformatics, coupling element 3 (CE3) are functionally equivalent. Plant J., 19, 679–689. 16, 16–23. Hoth,S. et al. (2002) Genome-wide gene expression profiling in Arabidopsis thaliana The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the reveals new targets of abscisic acid and largely impaired gene regulation in the abi1-1 flowering plant Arabidopsis thaliana. Nature, 408, 796–815. mutant. J. Cell Sci., 115, 4891–4900. Wenick,A. and Hobert,O. (2004) Genomic cis-regulatory architecture and trans-acting Hughes,J. et al. (2000) Computational identification of cis-regulatory elements associ- regulators of a single interneuron-specific gene battery in C.elegans. Dev. Cell, 6, ated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. 757–770. Biol., 296, 1205–1214. Xu,D. et al. (1996) Expression of a late embryogenesis abundant protein gene, HVA1, Kreps,J. et al. (2002) Transcriptome changes for Arabidopsis in response to salt, osmotic, from barley confers tolerance to water deficit and salt stress in transgeneic rice. and cold stress. Plant Physiol., 130, 2129–2141. Plant Physiol., 110, 249–257. Lander,E. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. Yu,J. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Lescot,M. et al. (2002) PlantCARE, a database of plant cis-acting regulatory elements Science, 296, 79–91. and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res., Zhang,M. (2002) Computational prediction of eukaryotic protein-coding genes. Nat. 30, 325–327. Rev. Genet., 3, 698–709. Marcotte,W. et al. (1989) Abscisic acid-responsive sequence from the Em gene of wheat. Zhu,J.K. (2002) Salt and drought stress signal transduction in plants. Annu. Rev. Plant Plant Cell, 1, 969–976. Biol., 53, 247–273. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid- and abiotic stress-responsive genes in Arabidopsis thaliana

Bioinformatics , Volume 21 (14): 8 – May 12, 2005

Loading next page...
 
/lp/oxford-university-press/cis-regulatory-element-based-targeted-gene-finding-genome-wide-xSSZDn3qLN

References (38)

Publisher
Oxford University Press
Copyright
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bti490
pmid
15890746
Publisher site
See Article on Publisher Site

Abstract

Vol. 21 no. 14 2005, pages 3074–3081 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti490 Genome analysis Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid- and abiotic stress-responsive genes in Arabidopsis thaliana 1,2,∗ 1 3 3 Weixiong Zhang , Jianhua Ruan , Tuan-hua David Ho , Youngsook You , 1 3 Taotao Yu and Ralph S. Quatrano 1 2 3 Department of Computer Science and Engineering, Department of Genetics and Department of Biology, Washington University in Saint Louis, Saint Louis, MO 63130, USA Received on January 20, 2005; revised on May 5, 2005; accepted on May 6, 2005 Advance Access publication May 12, 2005 ABSTRACT We call the genes of particular interest target genes and the problem Motivation: A fundamental problem of computational genomics is of identifying target genes targeted gene finding. identifying the genes that respond to certain endogenous cues and One possible approach to targeted gene finding is to use the environmental stimuli. This problem can be referred to as targeted knowledge of experimentally verified target genes in closely related gene finding. Since gene regulation is mainly determined by the species and utilize gene conservation across different species to binding of transcription factors and cis-regulatory DNA sequences, identify putative target genes [see Zhang (2002) for a review]. By most existing gene annotation methods, which exploit the conser- focusing on the conservation of open reading frames (ORFs) of vation of open reading frames, are not effective in finding target genes, this strategy has been used to annotate many genomes (The genes. Arabidopsis Genome Initiative, 2000; Lander et al., 2001; Yu et al., Results: A viable approach to targeted gene finding is to exploit the 2002). Despite its great success, this method is able to discover only cis-regulatory elements that are known to be responsible for the tran- a small number of genes of particular functions, partly because the scription of target genes. Given such cis-elements, putative target number of experimentally determined genes is limited. As a result, genes whose promoters contain the elements can be identified. As a large portion of the predicted genes of many species does not have a case study, we apply the above approach to predict the genes any functional annotation at all. For example, about half of the genes in model plant Arabidopsis thaliana which are inducible by a phyto- of plant Arabidopsis thaliana currently do not have any definitive hormone, abscisic acid (ABA), and abiotic stress, such as drought, functional annotation. cold and salinity. We first construct and analyze two ABA specific cis- Furthermore, an ORF-centric gene finding method may not be elements, ABA-responsive element (ABRE) and its coupling element effective in discovering target genes that express under specific con- (CE), in A.thaliana, based on their conservation in rice and other cereal ditions. Although gene functions may be indicative of a gene’s plants. We then use the ABRE–CE module to identify putative ABA- responsiveness to certain stimuli, there is no direct correlation responsive genes in A.thaliana. Based on RT–PCR verification and the between gene function and gene expression. Gene expression is con- results from literature, this method has an accuracy rate of 67.5% for trolled mainly at the transcription level, where the binding between the top 40 predictions. The cis-element based targeted gene finding transcription factors (TFs) and cis-regulatory DNA sequences (or approach is expected to be widely applicable since a large number of cis-elements) in the upstream regions of genes plays an important cis-elements in many species are available. role (Brivanlou and Darnell, 2002). In other words, a gene’s respons- Contact: zhang@cse.wustl.edu iveness to certain conditions is ‘hard-wired’ by their cis-elements. Supplementary information: Supplementary data for this paper are Therefore, if some cis-elements are known to be directly involved available at Bioinformatics online. in gene transcription regulation in responding to specific stimuli, we should be able to use the cis-elements to identify the genes of interest. When combined with experimental verification, this con- 1 INTRODUCTION stitutes an effective approach to genome-wide targeted gene finding 1.1 Targeted gene finding and function annotation. This approach is supported by the fact that a large number of TFs and their binding cis-elements have been iden- It is fundamentally important, yet difficult, to identify the genes that tified over the years. For example, we know most of the TFs and respond to certain endogenous cues and/or environmental stimuli. their corresponding cis-binding elements in the yeast Saccharomy- For example, it is of great importance to find genes in plants that ces cerevisiae (Harbison et al., 2004). Moreover, TRANSFAC, a are responsive to abiotic stress to enhance the genomic makeup of database of experimentally verified and computationally predicted plants to combat harsh stress, such as drought and low temperature. cis-elements in many species has been established and has been widely used for many years (Matys et al., 2003). There are also data- To whom correspondence should be addressed. bases of plant-specific cis-elements, including PLACE (Higo et al., 3074 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Cis-regulatory element based targeted gene finding 1999) and PlantCARE (Lescot et al., 2002). Furthermore, there is Prod. No. M404). About 20 seedlings were transferred to each MS medium plate with or without 100 µM ABA. After 24 h RNA was extracted from abundant information on cis-elements in literature that has not been the two groups using TRIzolR reagent (Invitrogen, Cat. No. 15596-026) and achieved as yet. further purified by using RNA clean-up columns from RNeasyR Plant Mini In this paper, we investigate the targeted gene finding approach kit (Qiagen, Cat. No. 74904). The total RNA was then treated with DNase I and demonstrate its validity and efficacy by identifying the genes (Invitrogen, Cat. No. 18068-015). inducible by abscisic acid (ABA) and abiotic stress in A.thaliana. RT–PCR analysis was done as follows. First-strand cDNA was synthes- ized from 1.5 µg of total RNA using ThermoScriptTM RNase H-reverse 1.2 ABA in plants and abiotic stress transcriptase (Invitrogen, Cat. No. 12236-014) with Oligo(dT) primer 12−18 ABA is an important phytohormone that is prevalent in a plant’s following the manufacturer’s recommendation. Amplification of the cDNA developmental stages. It plays many key roles in the synthesis of seed was optimized using 0.5–2 µl of the cDNA in a total of 25 µl reaction volume ◦ ◦ ◦ storage, the promotion of seed desiccation tolerance and dormancy, and carried out at 94 C for 2 min, 30 cycles of 94 C for 1 min, 60 C for 1 min ◦ ◦ and 72 C for 1 min, and then 72 C for 5 min. Expression analysis of each and the inhibition of the phase transitions from embryonic to germin- gene was confirmed in at least three independent RT-reactions using forward ative growth and from vegetative to reproductive growth (Finkelstein and reverse primers, which are listed in Supplementary Table 1. et al., 2002). Many genes have been identified as ABA-responsive; these include the genes encoding seed storage proteins, late embryo- genesis abundant (LEA) proteins, and various other proteins and 2.2 Genomic sequences protein families. Examples of these genes are EM in wheat (Marcotte Predicting promoters is at least as difficult as predicting genes. The key is et al., 1989), EM and RAB16 in rice (Hattori et al., 1995; Mundy the identification of transcription start sites (TSSs). To predict TSSs, we et al., 1990), HVA1, HVA22 and dehydrins in barley (Shen and Ho, combined an A.thaliana cDNA database and a software, TSSP (SoftBerry, 1997; Shen et al., 1996; Xu et al., 1996), and EM and RD29 in http://www.softberry.com). As of January, 2004, there were 26 213 predicted A.thaliana genes (excluding pseudogenes and RNA genes) in GenBank, A.thaliana (Carles et al., 2002; Narusaka et al., 2003). among which 12 359 (47%) had cDNA sequences (with annotated 5 -UTR Most ABA-responsive genes have two eminent characteristics. >50 bp). For each gene, we retrieved a segment from the gene’s start codon First, they contain conserved ABA-responsive elements (ABREs) in to 1500 bases upstream or its farthest 5 annotated gene boundary. We then their promoters (Hattori et al., 1995, 2002; Marcotte et al., 1989; applied the TSSP software to each upstream sequence to identify TSS. When Mundy et al., 1990; Shen and Ho, 1997; Shen et al., 1996; Xu multiple TSSs were predicted on a gene, the one closest to the ORF was et al., 1996). ABREs are the binding sites of TFs, such as EmBP-1 chosen. The TSSs that were >50 bases downstream of the cDNA start point (Guiltinan et al., 1990), TAF-1 (Oeda et al., 1991) and ABFs (Choi were considered false positive. Overall, TSSs of 18 343 (70%) genes were et al., 2000). Second, the ABREs need to be accompanied by some predicted and used in further analysis and prediction, whereas 3133 (12%) coupling elements (CEs) in order to be functional (Shen and Ho, genes with cDNA and 4737 (18%) genes without cDNA did not have TSSs 1997; Shen et al., 1996; Xu et al., 1996). The sequence specificity of predicted. Given a TSS, we retrieved the sequence from 350 bp upstream to 50 bp downstream of the TSS as the proximal promoter. The intron and exon CEs may be lower than that of ABREs. Moreover, other functional regions were retrieved from The Arabidopsis Information Resource (TAIR) elements, such as ABREs themselves and dehydration-responsive at http://www.arabidopsis.org element (DRE), can also function as CEs (Guiltinan et al., 1990; Hobo et al., 1999; Narusaka et al., 2003). Importantly, ABA mediates many aspects of physiological 2.3 Microarray gene expression data responses to environmental stress, such as drought, cold and salin- Data on gene transcription profiling were from Hoth et al. (2002), Kreps et al. ity. Many experiments have shown that abiotic stress also activate (2002) and Seki et al. (2002a,b). We took the datasets in Seki et al. (2002a,b) the processes underlying ABA (Finkelstein et al., 2002; Shinozaki for motif analysis. Using the promoter prediction method mentioned above, and Yamaguchi-Shinozaki, 2000; Zhu, 2002). Specifically, a large we obtained 366 promoters of unique genes upregulated after ABA-treatment and/or under abiotic stress (drought, cold and salinity) for motif analysis. number of genes that respond to abiotic stress are also inducible directly by ABA treatment (Seki et al., 2002a,b), providing direct evidences that ABA must be involved in the processes responding to 2.4 Scoring these environmental stress. In order to search for instances of a known degenerate motif in promoters, In brief, we have gained a substantial amount of knowledge of we scored each short sequences of length w in a given sequence based the critical roles that ABA plays in plant development and stress on the motif and a background model. The degenerate motif W of length response. It is clear that ABA is an essential element in gene w is represented by a PWM  = (q ), where q is the probability W i,b i,b transcription regulation in responding to abiotic stress. of finding base b at position i in the motif. The background model B In this paper, we study the cis-element based targeted gene find- is a m-th order Markov model, which can be estimated from background ing method in the context of transcription regulation under ABA and sequences. The probability that a wmer starting from the j -th position of a sequence is generated from the background model was calculated as abiotic stress in plants. We are particularly interested in the effect- P(j |B ) = P(b |b ··· b ), where b is the j -th base m j +l−1 j +l−2 j +l−m−1 j iveness of this method in finding, on a genomic scale, the genes that l=1 of the sequence. The conditional probability P(b |b ··· b ) j +l−1 j +l−2 j +l−m−1 are inducible by ABA and abiotic stress in A.thaliana. is the frequency of observing base b following a particular mmer j +l−1 b ··· b in background sequences. With a 0-th Markov model, j +l−2 j +l−m−1 2 MATERIALS AND METHODS the conditional probability is reduced to P(b ), which is the frequency j +l−1 of base b in the background sequences. The probability that a wmer j +l−1 2.1 Plant material, RNA preparation and RT–PCR w was generated from the motif model  is P(j | ) = q , W W l,b l=1 j +l−1 For plant material, A.thaliana ecotype Columbia seeds were grown at 24 C where q is the probability of seeing base b at the l-th position of l,b j +l−1 j +l−1 for 10 days with 16/8 h light/dark period on Murashige and Skoog modi- the motif. In our final model, we used a 0-th model since it has the highest fied basal medium with Gamborg vitamins (PhytoTechnology Laboratories, accuracy for m-th models with m ranging from 0 to 5. 3075 W.Zhang et al. Based on these two probabilities, a log-ratio score A was assigned unique to cold, drought and salinity, respectively. Nevertheless, their j , ,B W m to each position j in the sequence, which is computed as similarities to the ACGT-containing motif from ABA genes were still significant, i.e. >0.98. These ACGT-containing motifs are shown in P(j | ) A = ln . j , ,B W m online Supplementry Figure 2. P(j |B ) This specificity analysis indicated that ABREs are also determ- To score a sequence S by a motif module consisting of two motifs  and inant cis-elements for stress-related transcription regulations. This (e.g. ABRE and CE motifs), we considered all their possible positions implies that the genes responsive to ABA and abiotic stress are i and j within S, as long as they did not overlap and were within a certain distance d . The combined score between positions i and j was computed as difficult to separate from one another merely based on these cis- A = max (A + A ). The highest combined score S, , ,B i,j i, ,B j , ,B elements. It is possible that other cis-elements cooperate with ABREs m m m M N M N among all positions was assigned to the sequence. to differentiate various stress regulations, a topic beyond the scope of this paper. 3 RESULTS 3.2 CEs In our research, we began with an analysis of the cis-elements, i.e. Cis-elements are generally degenerate. Even though the ACGT- ABREs and CEs, of the target genes in A.thaliana. We then predicted core in ABREs is well conserved, the flanking sequences candidate genes, and verified them using RT–PCR experiments and beyond the ACGT-core vary. For example, a rice ABRE is published microarray profiling results. CGTACGTGTC (Hobo et al., 1999), whereas an ABRE for maize 3.1 ABREs is GACGTG (Busk et al., 1997) and an ABRE for A.thaliana is To reiterate, abiotic stress, such as drought, cold and salinity, can CCACGTGG (Zhu, 2002). trigger ABA. Many genes can be induced by ABA and/or one of When compared with ABREs, CEs are less conserved (Busk and these stress conditions. It is also known that ABA-responsive genes Pages, 1997; Hobo et al., 1999; Shen and Ho, 1997). The EM gene in rice, maize, barley and other cereals typically have ABREs and of rice has a CE (CE3) of GACGCGTGTC (Hobo et al., 1999); CEs as their determinant cis-elements. An ABRE for A.thaliana has maize RAB28 has CE3 of ACGCGCCTCCTC (Busk et al., 1997), also been identified (Zhu, 2002). and barley HVA1 has CE3 of ACGCGTGTCCTC and HVA22 has Our analysis showed that the ABREs are also significant cis- CE1 of TGCCACCGG (Shen and Ho, 1997). These CEs are more elements for genes responsive to abiotic stress in A.thaliana. In our diverged than ABREs, whereas the CE3s have a relatively conserved analysis, we used the expression profiling results published in Hoth CGCG core. To our knowledge, no experiment has been done to et al. (2002), Kreps et al. (2002) and Seki et al. (2002a,b). We first characterize the CEs of A.thaliana. considered the promoters of the genes that are upregulated separ- To obtain an accurate, as well as degenerate, pair of ABRE ately under ABA, cold, drought and salinity. MEME (Bailey and and CE for predicting ABA-inducible and stress-inducible genes Elkan, 1994), which is one of the best motif finding algorithms in A.thaliana, we combined computational methods with the know- (http://meme.sdsc.edu/meme/website/intro.html), was used to find ledge of known motifs from other plants to obtain the following three statistically significant degenerate motifs from these four sets of types of degenerate ABRE and CE motifs. The first type is based on promoters. They contain many common motifs. Specifically, ACGT- the experimentally identified ABRE and CE motifs. We construc- containing motifs, termed as G-box, are over-represented in all ted PWMs (Stormo, 2000) for ABREs based on motifs from rice these sets of promoters. They are consistently ranked among [OSEM (Hattori et al., 2002), RAB16A/B/D (Mundy et al., 1990; the top motifs identified: they are the first for ABA-inducible Ono et al., 1996)], maize [RAB28 (Busk and Pages, 1997)], barley genes, second for drought-responsive genes and third for both [HVA1 (Shen and Ho, 1997)], and A.thaliana [RD29A (Zhu, 2002)]. cold- and salinity-inducible genes. The ACGT-containing motifs The resulting ABRE is MGTACGTGKC. To obtain CEs, we con- from the three stress-induced genes are very similar to the ACGT- sidered the ones from monocots (as no CE from A.thaliana is known) containing motif in ABA-induced genes. They were directly com- and resulted in CEs of GMCGCGTGKC. The logos for these ABRE pared by a computer program CompareACE (downloadable at and CE are shown in Figure 1a. Since we used information from http://atlas.med.harvard.edu/download/). The similarity score used monocots, we refer to this type of motifs as Monocots-based motifs by CompareACE is the Pearson correlation coefficient between for convenience. Note that the number of experimentally verified the nucleotide base frequencies of the alignment of two motifs. genes is small; hence the accuracy of the motifs may be low. The scores vary between −1 and 1, where 1 means a perfect match. The second type of degenerate ABRE and CE motif comes from a The three ABRE motifs from stress genes have similarity scores refinement to the first type through an iterative procedure that com- >0.99 as compared with those of the ABRE motif from ABA genes. bined the experimentally verified motifs, i.e. the first-type motifs The four ABRE motifs are shown in Supplementary Figure 1. Sim- and the results from expression profiling. We applied a motif scan ilar results were obtained using AlignACE motif finding algorithm program, which we developed (see Section 2), to the 366 upregu- (Hughes et al., 2000) (data not shown). lated genes that were identified by gene expression profiling in Seki There are significant overlaps among the genes induced by ABA et al. (2002a,b). We explicitly took the first-type motifs as seeds to and one of the three stress conditions. We further analyzed the cis- scan the promoters. Matched sequences were ranked by their scores elements in the set of genes exclusively induced by one of these stress, (Section 2); the top 15 matched motifs were chosen to construct resulting in three sets of genes not overlapping with ABA-inducible new PWMs. The new PWMs must be similar but not identical to genes. The analysis of the motifs from these three sets of genes led the monocots-based PWMs. This step was repeated until the PWMs to similar conclusions, except that the ACGT-containing motifs were did not change or their specificity decreased. The refined motifs are now slightly degenerate and ranked slightly lower than in the previous SRTACGTGTC for ABRE and GACRCGTGKC for CE, respect- analysis. They ranked as the third, second and third for the genes ively, whose sequence logos are shown in Figure 1b. Compared with 3076 Cis-regulatory element based targeted gene finding (a) 2 2 Arabidopsis–specific A G A G GT G C G C G C AA A A A AGCG A A CCGAAAACTG CCC C CTG G CGCC A G G A TTTTTTGTCT TTGA TTCT 0 0 MEME–derived (b) 2 2 Monocot–based 1 1 0 50 100 150 200 250 A number of top hits CGT G G C ACC T TA C T C GAA G GC A G CG TTGA T A ATGG TCAT 0 0 Fig. 2. Prediction accuracies of three ABRE–CE modules in Figure 1. The x-axis is the number of motif hits, ordered by their scores matching to the (c) modules; the y-axis is the percentage of scored motifs that are expressed in 2 2 Hoth et al. (2002), Kreps et al. (2002) and Seki et al. (2002a,b). 1 1 3.3 Motif modules as target gene indicators C A T The three types of motifs discussed above were constructed dif- GC C C G C T CG C C G A T G C GA A T TTG C G T T G G AC ATGA GTGGT CCAATCGGT GAATC ferently and have different degeneracies. One problem with the 0 0 monocot-based motifs is that they may not be necessarily good indicators of ABA-inducible and stress-inducible genes in dicot A.thaliana. One problem with MEME-inferred motifs is that the Fig. 1. ABRE–CE modules, where ABREs are to the left and CEs to the ABREs and CEs were identified individually rather than as a right. (a) The monocot-based module was constructed using known ones module. from monocots; (b) the Arabidopsis-specific module was a refinement to the The Arabidopsis-specific motif module seems to be a good indic- monocot-based module by a repeated search on A.thaliana promoters; (c) the ator for ABA-responsive and stress-responsive genes. To quantify MEME-derived module was inferred by MEME motif algorithm. its prediction power, we compared it with the other two modules, measured by the number of predicted genes that were also iden- tified by microarray experiments (Hoth et al., 2002; Kreps et al., 2002; Seki et al., 2002a,b). The results are shown in Figure 2. The the monocot-based motifs (Fig. 1a), there are substantial changes in monocot-based module has the worst prediction accuracy, and the the flanking sequences beyond the ACGT-core. Besides, A.thaliana Arabidopsis-specific module is superior to the MEME-derived mod- seems to have a less conserved GCGT-core in its CE component, the ule, except for the first eight predictions. This result partially supports first G in particular can be A about 30% of the time. As a result, the the interactive approach we used to obtain refined ABRE and CE refined ABREs and CEs are similar to each other and become almost motifs. palindromic. We refer to these motifs as Arabidopsis-specific motifs. In the rest of this paper, we use the Arabidopsis-specific motif A third type of ABRE and CE motifs was computationally inferred, module (shown in Figure 1b) for our analysis and prediction. as a reference, from the promoters of the 366 stress-responsive 3.4 ABRE and DRE as coupling elements genes from microarray experiments under ABA treatment and abi- otic stress (Seki et al., 2002a,b). Two motifs, the second and seventh, A close examination of the CE in the Arabidopsis-specific module from the top 10 motifs produced by MEME, appear to be meaning- shows that it is very often a G-box (ACGT-core) or has a GCGT-core. ful and the rest seem to be repeats. The motifs for ABREs and CEs The GCGT-core has a strong conservation in the CEs for many mono- are YKMCACGTGKC and MCGCGTCRNYYWCK, respectively, cots, as shown in Figure 1a. It is also known that ABREs can act as whose sequence logos are shown in Figure 1c. These two motifs are CEs, and so can dehydration responsive elements (DREs) (Guiltinan significantly different from the previous two types. For ABREs, the et al., 1990; Hobo et al., 1999; Narusaka et al., 2003). prefix before the ACGT-core is less conserved. The base immedi- We examined the utilities of ABREs, DREs and GCGT-containing ately before the ACGT-core is a relatively conserved C, whereas it motifs as CEs in ABRE–CE module. The results in Figure 3 show is T or A in the previous two types in Figure 1a and b. For CEs, the that the prediction accuracies using DREs and GCGT-containing suffix sequence of the GCGT element does not at all match those in motifs as CEs are significantly lower than those with ACGT-core Figure 1a and b. as CE. The figure indicates that unlike most monocots, which very bits bits bits bits bits bits percent of verified hits W.Zhang et al. exon intron random promoter stress_down stress_up all promoter ABRE–CE (ACGT) ABRE–CE ABRE–CE (GCGT) ABRE–DRE 0 50 100 150 200 250 12 14 16 18 20 22 24 26 number of top hits score threshold Fig. 3. Prediction accuracies using ABREs (ACGT-core), GCGT-containing Fig. 4. Location specificities of ABRE–CE module in second exon regions, motifs and DREs as CEs. ABRE–CE is Arabidopsis-specific module. ABRE– intron regions, random promoters, promoters of downregulated (stress_down) CE(ACGT), ABRE–CE(GCGT) and ABRE–DRE use ABREs, GCGT- and upregulated (stress_up) genes identified in Hoth et al. (2002), Kreps et al. containing motifs and DREs as CEs, respectively. The x-axis and y-axis (2002) and Seki et al. (2002a,b), and all promoters. are the same as those of Figure 2. ABA and/or abiotic-stress responsive, giving a prediction accuracy often have GCGT-core in their CEs, A.thaliana tends to have ABREs of 67.5%. as CEs. The results also show that DREs are less effective as CEs Three genes (At3g33131, At5g52290 and At1g65200) from the set than ABREs. One reason may be that the DREs have only six bases of 27 genes were not detected by RT–PCR. Based on the GenBank (RCCGAC), which may give rise to a relatively large number of false annotation as of May 2004, two of these three genes (At3g33131 positive matches. and At5g52290) were annotated as hypothetical proteins. In addi- tion, seven genes (At1g79040, At1g77450, At1g28530, At1g52230, 3.5 Where ABRE–CE module locate At2g18700, At4g21270 and At4g21280) did not show significant Before applying the Arabidopsis-specific ABRE–CE module, we expression changes in the RT–PCR experiment. It is possible that need to assess if this module is indeed unique to ABA-inducible these genes express in different developmental stages or tissues other and stress-inducible genes. For this purpose, we analyzed the distri- than the conditions of the RT–PCR experiments. Indeed, one of butions of high-quality matches of the module in various regions these genes (At1g77450) was verified as upregulated in Hoth et al. of A.thaliana genome. We considered the promoters, the intron (2002). One of the differences between the experimental conditions regions and the second exon regions of all genes, and the pro- used in Hoth et al. (2002) and our RT–PCR is the age of the seed- moters of the upregulated and downregulated genes under ABA lings [4-week-old seedlings in Hoth et al. (2002)] versus 10-day-old and stress conditions (Hoth et al., 2002; Kreps et al., 2002; Seki seedlings in our experiments). Combining these observations, the et al., 2002a,b). We also included randomly constructed sequences 67.5% prediction accuracy is, apparently, a lower bound on the using the frequencies of nucleotides in all promoters, to evaluate prediction accuracy. the possibility that the module appears by chance. The results in Although we used the degenerate ABRE–CE module, which Figure 4 show that the ABRE–CE module occurs more often in the allows an ACGT component in the CE motif, only one gene among promoters of the target genes than in the coding and other non-coding the top 40 predictions actually has the GCGT-core in its CE. In other regions. words, most of the top candidates match to ABRE–ABRE module well, in agreement with our analysis of CEs in Section 3.4. 3.6 Prediction and experimental verification Using the Arabidopsis-specific ABRE–CE module, we detected 3.7 Where ABREs and CEs locate a large number of putative ABA-responsive and stress-responsive genes. We closely examined the highest scored 40 predictions, It is worthwhile to know the locations of ABRE and CE within pro- listed in Table 1. We tested 27 genes using RT–PCR on 10-day- moters. We examined two position statistics. The first is the gap old seedlings (Section 2). Among these 27 genes, 17 (63.0%) between the ABRE and CE in a module. Figure 5a shows the results are verified as upregulated, 3 (11.1%) have no transcripts detec- in the promoters of all genes and ABA-inducible and stress-inducible ted and 7 (25.9%) have no significant expression change. Some genes identified in (Hoth et al., 2002; Kreps et al., 2002; Seki et al., of the RT–PCR results are shown in Supplementary Figure 3. In 2002a,b) whose promoters contain these ABRE–CE module. The combination with the results from previously published microar- gaps are typically <150 bases in these genes, although a few gaps ray results (Hoth et al., 2002; Kreps et al., 2002; Seki et al., beyond 150 bases exist. The most possible gap between these two 2002a,b), we found that 27 of the top 40 genes were confirmed as elements are ∼40–50 bases. percent of verified hits number of hits / kb Cis-regulatory element based targeted gene finding Table 1. Top 40 predicted ABA-inducible and stress-inducible genes in A.thaliana and their experimental verification Assession no. ABRE CE Strand Function annotation in GenBank Verification At5g07920 cctacgtggc ggcacgtggc + Diacylglycerol kinase (ATDGK1) 4 At1g79040 cctacgtggc gccacgtgtc + Photosystem II polypeptide-related −1 At4g37220 cctacgtggc gccacgtgtc + Cold acclimation protein homolog 1 At3g33131 cgaacgtgtc gacgcgtggc + Hypothetical protein −2 At5g24155 cacacgtggc gccacgtggc − Squalene monooxygenase 4 At2g44660 gctacgtggc gacacgtggc + ALG6, ALG8 glycosyltransferase family 4 At4g12680 tgtacgtggc gacacgtggc − Expressed protein 4 At5g66580 agaacgtggc gccacgtggc − Expressed protein 1 At5g50360 cgcacgtggc gccacgtctc + Expressed protein 4 At5g51210 cgtacgtgtc gacacgtgac + Glycine-rich protein oleosin 4 At1g77450 cgaacgtgtc gccacgtgtc + No apical meristem (NAM) protein family 1,-1 At1g17120 cgaacgtggc gtcacgtggc + Amino acid permease family protein At5g52290 cgtacgtgtc gagacgtggc − Hypothetical protein −2 At5g52300 cgtacgtgtc gagacgtggc + Desiccation-responsive protein 29B (RD29B) 2, 3, 4 At5g58650 catacgtggc gacacgtgtc + Expressed protein At2g38820 ggtacgtgtc ggcacgtgtc − Expressed protein 2, 4 At1g28530 tgcacgtgtc gccacgtggc − Expressed protein −1 At1g28540 tgcacgtgtc gccacgtggc + Expressed protein At1g54130 gccacgtggc gacacgtgtc − RSH3 (RelA/SpoT homolog) 4 At1g32550 tctacgtggc gacacgtggc − Ferredoxin family protein 4 At1g32560 tctacgtggc gacacgtggc + LEA group 1 protein 3 At1g52220 tccacgtggc gccacgtggc − Expressed protein 1 At1g52230 tccacgtggc gccacgtggc + Photosystem I subunit VI precursor −1 At4g21270 accacgtgtc gccacgtggc + Kinesin-like protein A (katA) −1 At4g21280 accacgtgtc gccacgtggc − Oxygen-evolving enhancer protein 3 (PSBQ) −1 At3g03680 ccgacgtggc gtcacgtggc + C2 domain-containing protein 4 At2g22240 ccaacgtgtc gccacgtgtc + Myo-inositol 1-phosphate synthase-related 2 At3g19590 cccacgtgtc gccacgtgac − Mitotic checkpoint protein 4 At1g58520 acaacgtggc gacacgtggc + Early-responsive to dehydration (ERD4) 4 At3g18290 agcacgtggc ggcacgtgac − Zinc finger protein-related 1 At1g65200 cgtacgtgac gtcacgtggc + Ubiquitin carboxyl-terminal hydrolase-related −2 At2g18700 gaaacgtggc gccacgtggc − Glycosyltransferase family 20 −1 At2g36270 cacacgtgtc gacacgtgtc + ABA insensitive 5 (ABI5) 4 At1g02660 ctgacgtggc gccacgtgtc + Lipase (class 3) family 1 At1g02670 ctgacgtggc gccacgtgtc − DNA repair protein, putative 4 At1g74450 agcacgtgga gccacgtggc − Expressed protein 4 At5g05220 caaacgtgtc gacacgtggc + Expressed protein 3 At3g62260 gccacgtgtc gacacgtgtc + Protein phosphatase 2C (PP2C) At5g65890 tccacgtgtc gccacgtggc − ACT domain-containing protein (ACR1) At5g62490 aacacgtgtc gccacgtggc − ABA-responsive protein (HVA22b) 3, 4 1: Hoth et al. microarray (Hoth et al., 2002); 2: Kreps et al. microarray (Kreps et al., 2002); 3: Seki et al. microarray (Seki et al., 2002a,b); 4: Upregulated, verified by RT–PCR; −1: no ABA response detected by RT–PCR; −2: no transcripts detected by RT–PCR. The second statistic is the start position of an ABRE–CE module 150 putative genes predicted in our study. This was done using from the transcription start site (TSS) of a gene. The results on all the MIPS functional category classification from http://mips.gsf.de/ genes and ABA-inducible and stress-inducible genes are shown in projects/plants Figure 5b. ABRE–CE modules are usually within 200 bases from Among the 1825 stress-inducible genes from the microarray exper- TSSs; the majority of them are <120 bases from TSSs. Note that a iments, 1530 (83.8%) can be assigned to at least one functional few ABRE–CE modules start within 5 -UTRs. category. Among our top 150 predicted genes, 126 (84.0%) have a functional category. Moreover, these two sets of genes have 3.8 Functional categories similar distributions across a wide range of functional categor- ies, as depicted in Supplementary Figures 4 and 5. Except the The function of ABA-responsive genes are diverse, reflected by the unclassified proteins, the three largest categories are transcription, large number of functional categories these genes may be involved metabalism and binding proteins. This result suggests that there in. We carried out a functional analysis on two sets of genes, the may be a lot of gene regulation activities after ABA treatment and ABA-inducible and stress-inducible genes reported in Hoth et al. stress. (2002), Kreps et al. (2002) and Seki et al. (2002a,b), and the top 3079 W.Zhang et al. (a) 60 huge amount of previous experimental efforts devoted to identifying all genes ABA-responsive and stress-responsive genes and elucidating their stress–inducible genes regulatory mechanisms. The idea of using cis-elements to identify genes responsive to cer- tain stimuli is intuitive, and was pursued in at least two previous studies. The first study was reported in Markstein et al. (2002), on a genome-wide analysis of the binding sites for Dorsal, one of the best- characterized sequence-specific TFs in Drosophila. It was known that many Dorsal targeted genes contain a cluster of multiple Dorsal binding sites in a small vicinity of their promoter regions. Using the 20 known Dorsal binding motifs, fifteen promoters that contains clusters of Dorsal binding motifs were identified from Drosophila genome. Among the fifteen genes, three are known Dorsal target genes. Using in situ localization assays, two other genes were shown to be upreg- ulated in the presumptive mesoderm of early embryos, leading to a total prediction accuracy of ∼34% (5 positive of 15 putative ones). 0 50 100 150 200 250 300 350 The second study was on interneurons called AIY in Caenorhabditis gap length elegans (Wenick and Hobert, 2004). Using the newly sequenced Caenorhabditis briggsae genome, another nematode diverged from (b) 35 C.elegans ∼70–100 million years ago. Wenick and Hobert found all genes stress inducible genes eight genes in AIY in C.elegans that are also conserved in C.briggsae. Using a standard promoter–dissection approach, they discovered cis- elements that are necessary and sufficient for AIY transcriptome. Moreover, they carried out a genome-wide screening using the dis- covered cis-elements to predict genes in C.elegans that may express in AIY. They experimentally tested 15 of the top 26 predictions and confirmed 14 of them expressed in AIY, giving a prediction accuracy of 14/15 or 94%. Overall, the verified AIY hit rate was 41 of 57 or 72%. As a comparison, we achieved a comparable prediction accuracy of 67.5% for the top 40 predictions in our study. Our approach in this paper and the approach taken in Markstein et al. (2002) and Wenick and Hobert (2004) are similar and comple- ment one another. These studies used similar genome-wide search strategies to predict genes that may have certain expression pro- 350 300 250 200 150 100 50 0 50 files. They did not make strong assumptions about where within binding site start position promoters the motif matches should be. However, these approaches differ in the way that cis-elements were derived. Markstein seemed Fig. 5. Position statistics of Arabidopsis-specific ABRE–CE modules. to use exact known Dorsal binding motifs in the analysis. Wen- (a) Distribution of the gaps between ABREs and CEs; (b) distribution of ick et al. relied on the conservation of genes in C.elegans and the start positions of ABRE–CE modules relative to TSSs. C.briggsae, to infer cis-elements that are characteristic to the tar- get genes. We obtained cis-elements based on previous experimental analyses. 4 DISCUSSION In summary, based on the previous studies and the results in We advocated and investigated a cis-element based method for find- this paper, it is evident that the cis-element based targeted gene ing genes that are responsive to certain conditions, which can be finding approach is effective and general; it has a high prediction referred to as targeted gene finding. This method is orthogonal and accuracy and is applicable to different organisms and different type complementary to conventional approaches to gene finding and func- of genes. With more information of TFs and their DNA binding tion annotation that are based on the conservation of ORFs. The information becoming freely available, including those in TRANS- cis-element based targeted gene finding method explicitly utilizes FAC database, we expect this cost-effective and accurate approach the information of transcription regulations. By exploiting exper- to be widely applied to various targeted gene finding problems in the imentally verified cis-elements in a genome-wide screening, it future. naturally combines the fidelity of gene functions elucidated in exper- imental analyses with computational efficiency of a genome-scale ACKNOWLEDGEMENTS search. In this study, we focused on ABA-responsive and abiotic stress- This research was supported in part by NSF grant EIA-0113618 and responsive genes in A.thaliana and their cis-elements, i.e. ABREs a grant from Monsanto Corporation to W.Z. and in part by a grant and CEs. By employing the experimentally identified cis-elements, from Monsanto Corporation to R.S.Q. We thank the other members we are able to leverage genome-wide targeted gene finding with a of W.Z. and R.S.Q.’s groups for the helpful discussions. number of hits number of hits Cis-regulatory element based targeted gene finding REFERENCES Markstein,M. et al. (2002) Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA, Bailey,T. and Elkan,C. (1994) Fitting a mixture model by expectation maximization 22, 763–768. to discover motifs in biopolymers. In Proceedings of the 2nd ISMB Conference, Matys,V. et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Palo Alto, CA, pp. 28–36. Nucleic Acids Res., 31, 374–378. Brivanlou,A. and Darnell,J. (2002) Signal transduction and the control of gene Mundy,J. et al. (1990) Nuclear proteins bind conserved elements in the abscisic acid- expression. Science, 295, 813–818. responsive promoter of a rice rab gene. Proc. Natl Acad. Sci. USA, 87, 1406–1410. Busk,P. and Pages,M. (1997) Protein binding to the abscisic acid-responsive element is Narusaka,Y. et al. (2003) Interaction between two cis-acting elements, ABRE and DRE, independent of viviparous1 in vivo. Plant Cell, 9, 2261–2270. in ABA-dependent expression of Arabidopsis rd29A gene in response to dehydration Busk,P. et al. (1997) Regulatory elements in vivo in the promoter of the abscisic acid and high-salinity stresses. Plant J., 34, 137–148. responsive gene rab17 from maize. Plant J., 11, 1285–1295. Oeda,K. et al. (1991) A tobacco bZip transcription activator (TAF-1) binds to a G-box- Carles,C. et al. (2002) Regulation of Arabidopsis thaliana Em genes: role of ABI5. like motif conserved in plant genes. EMBO J., 10, 1793–1802. Plant J., 30, 373–383. Ono,A. et al. (1996) The rab16b promoter of rice contains two distinct abscisic acid- Choi,H.I. et al. (2000) ABFs, a family of ABA-responsive element binding factors. responsive elements. Plant Physiol., 112, 483–491. J. Biol. Chem., 275, 1723–1730. Seki,M. et al. (2002a) Monitoring the expression pattern of around 7000 Arabidopsis Finkelstein,R. et al. (2002) Abscisic acid signaling in seeds and seedlings. Plant Cell, genes under ABA treatments using a full-length cDNA microarray. Funct. Integr. (suppl.), S15–S45. Genomics, 2, 282–291. Guiltinan,M. et al. (1990) A plant leucine zipper protein that recognizes an abscisic Seki,M. et al. (2002b) Monitoring the expression profiles of 7000 Arabidopsis genes acid response elements. Science, 250, 267–271. under drought, cold and high-salinity stresses using a full-length cDNA microarray. Harbison,C. et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Plant J., 31, 279–292. Nature, 431, 99–104. Shen,Q. and Ho,T.H. (1997) Promoter switches specific for abscisic acid (ABA)-induced Hattori,T. et al. (1995) Regulation of the Osem gene by abscisic acid and the transcrip- gene expression in cereals. Physiol. Plantarum, 101, 653–664. tional activator VP1: analysis of cis-acting promoter elements required for regulation Shen,Q. et al. (1996) Modular nature of abscisic acid (ABA) response complexes: com- by abscisic acid and VP1. Plant J., 7, 913–925. posite promoter units that are necessary and sufficient for ABA induction of gene Hattori,T. et al. (2002) Experimentally determined sequence requirement of ACGT- expression in barley. Plant Cell, 8, 1107–1119. containing abscisic acid response element. Plant Cell Physiol., 43, 136–140. Shinozaki,K. and Yamaguchi-Shinozaki,K. (2000) Molecular responses to dehydra- Higo,K. et al. (1999) Plant cis-acting regulatory DNA elements (PLACE) database. tion and low temperature: differences and cross-talk between two stress signaling Nucleic Acids Res., 27, 297–300. pathways. Curr. Opin. Plant Biol., 3, 217–223. Hobo,T. et al. (1999) ACGT-containing abscisic acid response element (ABRE) and Stormo,G. (2000) DNA binding sites: representation and discovery. Bioinformatics, coupling element 3 (CE3) are functionally equivalent. Plant J., 19, 679–689. 16, 16–23. Hoth,S. et al. (2002) Genome-wide gene expression profiling in Arabidopsis thaliana The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the reveals new targets of abscisic acid and largely impaired gene regulation in the abi1-1 flowering plant Arabidopsis thaliana. Nature, 408, 796–815. mutant. J. Cell Sci., 115, 4891–4900. Wenick,A. and Hobert,O. (2004) Genomic cis-regulatory architecture and trans-acting Hughes,J. et al. (2000) Computational identification of cis-regulatory elements associ- regulators of a single interneuron-specific gene battery in C.elegans. Dev. Cell, 6, ated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. 757–770. Biol., 296, 1205–1214. Xu,D. et al. (1996) Expression of a late embryogenesis abundant protein gene, HVA1, Kreps,J. et al. (2002) Transcriptome changes for Arabidopsis in response to salt, osmotic, from barley confers tolerance to water deficit and salt stress in transgeneic rice. and cold stress. Plant Physiol., 130, 2129–2141. Plant Physiol., 110, 249–257. Lander,E. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. Yu,J. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Lescot,M. et al. (2002) PlantCARE, a database of plant cis-acting regulatory elements Science, 296, 79–91. and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res., Zhang,M. (2002) Computational prediction of eukaryotic protein-coding genes. Nat. 30, 325–327. Rev. Genet., 3, 698–709. Marcotte,W. et al. (1989) Abscisic acid-responsive sequence from the Em gene of wheat. Zhu,J.K. (2002) Salt and drought stress signal transduction in plants. Annu. Rev. Plant Plant Cell, 1, 969–976. Biol., 53, 247–273.

Journal

BioinformaticsOxford University Press

Published: May 12, 2005

There are no references for this article.