Genotyping and annotation of Affymetrix SNP arraysLamy,, Philippe;Andersen, Claus, L.;Wikman, Friedrik, P.;Wiuf,, Carsten
doi: 10.1093/nar/gkl475pmid: 16899450
ABSTRACT In this paper we develop a new method for genotyping Affymetrix single nucleotide polymorphism (SNP) array. The method is based on (i) using multiple arrays at the same time to determine the genotypes and (ii) a model that relates intensities of individual SNPs to each other. The latter point allows us to annotate SNPs that have poor performance, either because of poor experimental conditions or because for one of the alleles the probes do not behave in a dose–response manner. Generally, our method agrees well with a method developed by Affymetrix. When both methods make a call they agree in 99.25% (using standard settings) of the cases, using a sample of 113 Affymetrix 10k SNP arrays. In the majority of cases where the two methods disagree, our method makes a genotype call, whereas the method by Affymetrix makes a no call, i.e. the genotype of the SNP is not determined. By visualization it is indicated that our method is likely to be correct in majority of these cases. In addition, we demonstrate that our method produces more SNPs that are in concordance with Hardy–Weinberg equilibrium than the method by Affymetrix. Finally, we have validated our method on HapMap data and shown that the performance of our method is comparable to other methods. INTRODUCTION To date, whole-genome scans of polymorphic genetic markers [e.g. single nucleotide polymorphisms (SNPs)] are routinely performed with high-throughput technologies such as Affymetrix SNP array technology. Genome scans provide comprehensive information about the genetic background of individuals and have been used among other things to (i) study linkage disequilibrium in human populations and populations of other species, (ii) perform association mapping and linkage studies of common complex diseases, and (iii) conduct analysis of the genetic content in tumor cells, where the assumption of diploidy, as found in normal cells often is violated. Affymetrix SNP arrays have become popular and are widely used. Originally, an array with 1500 SNPs was released, later the 10k SNP array followed and quite recently arrays with up to 500k SNPs have been made available. The array technique is based on genomic hybridization to synthetic high-density oligonucleotide microarrays [see (1) and references therein]. Each of the two alleles of an SNP is represented by 10 or 14 oligonucleotides (together called a probe set) and hybridization (probe) intensities are measured for all probes in a probe set. Affymetrix has developed a software (GDAS) for genotyping SNPs based on the intensities and, subsequently, the derived genotypes can be used for further analysis of the data [such as (i–iii)]. GDAS genotypes SNPs arraywise, one SNP at a time. For the larger arrays Affymetrix has developed a new dynamic model-based algorithm (DM) that is also based on arraywise genotyping (2). Here we present an alternative algorithm (PBG— pool-based genotyping) for genotyping Affymetrix SNP arrays. If allele intensities (probe intensities combined into one value for each allele) (Materials and Methods) are plotted for a typical SNP, three distinct clusters are generated that correspond well to the three possible genotypes (Figure 1). Naturally, this suggests that the genotype of a SNP could be derived from the distribution of allele intensities by choosing the genotype of the cloud that statistically (in some sense) is closest to the observed allele intensities. PBG builds on this observation. Figure 1 Open in new tabDownload slide Genotype clusters. Plotted is the allele intensities for a typical SNP. Blue triangles denote the heterozygous genotype, red crosses and green circles the two homozygous genotypes. In this case PBG and GDAS agree. Figure 1 Open in new tabDownload slide Genotype clusters. Plotted is the allele intensities for a typical SNP. Blue triangles denote the heterozygous genotype, red crosses and green circles the two homozygous genotypes. In this case PBG and GDAS agree. In addition, we base PBG on a model that allows identification and annotation of SNPs that either are difficult to genotype correctly for experimental reasons or have probes that are not suited for copy number analysis. Identification of chromosomal regions with abnormal copy numbers (i.e. deviations from two copies) is an important undertaking in cancer research (3). Potentially, these regions harbor oncogenes or other genomic elements that are involved in the progression of tumors. Exclusion of SNPs that are not suited for copy number analysis is thus likely to increase the power to infer copy numbers correctly. We return to these issues in Results and Discussion. At the time of writing this paper we became aware of another method (RLMM) that takes an approach similar to genotyping as we do, although it is not based on a model that relates the intensities of different SNPs to each other (4). Also this approach will be taken up in Results and Discussion. MATERIALS AND METHODS 10k Early Access Array For this study we used 113 samples collected at Aarhus University Hospital, Skejby. The GeneChip® Mapping 10k Early Access Array was applied to all 113 samples. The Single Primer Assay Protocol (labeling, hybridization, washing, staining and scanning) was performed according to the manufacturer's instructions (Affymetrix, Santa Clara, CA, USA) (1). Of the 113 samples, 32 are from unrelated Danish individuals, 40 from Cuban unrelated individuals and 41 from Cuban families. The Early Access Array has 10 126 SNPs. Of these 9600 (9430 autosomal and 170 X chromosomal) mapped to on a unique position in the genome [using the April 2003 genome assembly (hg15), http://www.genome.ucsc.edu]. The remaining SNPs (526) were excluded from further analysis. Genotypes were derived using (unnormalized) probe set intensities with Genechip DNA Analysis Software (GDAS) from Affymetrix. Subsequently, the probe set intensities were normalized using the dChipSNP software (6) and the allele intensities as defined in equation (1) were calculated. HapMap data To evaluate PBG on an externally validated dataset, we used a dataset where both HapMap calls and Affymetrix calls are available. We downloaded HapMap genotype data from 30 CEPH trios (90 samples in total) from http://www.hapmap.org/downloads/index.html.en and Affymetrix genotype data from the same samples from http://www.affymetrix.com/support/technical/sample_data/hapmap_trio_data.affx. There are 15 589 SNPs for which both calls exist in the 90 samples. All SNPs are on Affymetrix Xba array and genotyped with DM. Notation and definitions Let α denote an arbitrary allele, α = A or B, and let |$\overline{\alpha }$| denote the complementary allele of α, i.e. if α = A then |$\overline{\alpha }=$| B, and if α = B then |$\overline{\alpha }=$| A. Further, let γ denote an arbitrary genotype, γ = AA, AB, BB, AY or BY. Genotypes AY and BY denote male genotypes for X chromosome SNPs. Also let αα denote the homozygote genotype for the α allele. The probe intensities are combined into two values by taking the logarithm of the average over all probes for the α allele, α = A or B, i.e. \begin{equation} {I}_{ij}\left(\alpha \right)=\hbox{ log }\left(\frac{1}{p}{\displaystyle \sum _{k=1}^{p}}{\hbox{ PM }}_{ijk}\left(\alpha \right)\right), \end{equation}1 where PMijk(α) is the intensity of the k-th probe of allele α for SNP j in array i. Here k runs over k = 1, … , p, where p = 10 or 14, i = 1, … , 113, j = 1, … , 9600 and PM is short for perfect match (1). We do not use the mismatch probes in this approach. We use dChipSNP normalized probe intensities because they appear to have better statistical properties than the unnormalized probe intensities [cf. (5,6)]. The values Iij(A) and Iij(B) are referred to as allele intensities, or the A- and B-intensity, respectively. Further, for SNP j, let Mj(α∣γ) be the empirical average of α-intensities for samples with genotype γ. For example, for a SNP on the X chromosome, Mj(A∣AY) is the average of the A-intensity in male samples with genotype AY and for an autosomal SNP, Mj(B∣AA) is the average of the B-intensity in samples with genotype AA. Allelic cross hybridization The genotypes derived by GDAS were used for this part of the study. To investigate cross hybridization we focused on SNPs on the X chromosome (170 SNPs) and compared the allele intensities found in the male samples to those found in the female samples. For each SNP on the X chromosome, the samples were divided into groups according to genotype (excluding no calls, i.e. SNPs where a genotype call has not been assigned). Only groups comprising at least five samples were included in the analysis. The average of the α-intensity was calculated in each group and a straight line was fitted to the points (Mj(α∣αY), Mj(α∣AB)), α = A, B, j = 1, … , 170 (Notation and Definition). Note that each SNP contributes at most two points. One for the A allele (α = A), if there are more than five male samples with genotype AY and five female samples with genotype AB. Similarly, one for the B allele (α = B), if there are more than five male samples with genotype BY and five female samples with genotype AB. The model We assume the following model for the 9430 autosomal SNPs. The theoretical expectation of the intensity Iij (α∣γ) is denoted μj(α∣γ), j = 1, … , 9430, and the following relations are assumed: \begin{equation} {\mu }_{j}(\alpha \mid \alpha \alpha )={c}_{1}+{c}_{2}{\mu }_{j}(\alpha \mid \hbox{ AB }), \end{equation}2 \begin{equation} {\mu }_{j}(\alpha \mid \overline{\alpha }\overline{\alpha })={d}_{1}+{d}_{2}{\mu }_{j}(\alpha \mid \hbox{ AB })+{d}_{3}{\mu }_{j}{(\alpha \mid \hbox{ AB })}^{2}, \end{equation}3 where c1, c2, d1, d2 and d3 are unknown parameters. The model is motivated by initial plotting of the empirical means Mj (α∣γ) (Figure 2). Including a quadratic term in Equation 2 does not improve the model; the term is statistically indifferent from zero (p ≈ 0). The model postulates that the α-intensity of the heterozygous genotype is related to the α-intensity of the homozygous genotypes and further, that this relation is ‘global’ in the sense that the parameters c1, c2, d1, d2 and d3 are not SNP specific, but the same for all SNPs. Figure 2 Open in new tabDownload slide Mean intensities. Plotted is the mean allele intensities for female samples against the mean intensities for male samples. Only cases for those at least five samples have a given genotype are included. Intensity of the x-axis: Mj(A|AY) and Mj(B|BY). Intensity on the y-axis depends on the color. Red circles, Mj(A|AA) and Mj(B|BB); blue triangles, Mj(A|AB) and Mj(B|AB); Green crosses, Mj(B|AA) and Mj(A|BB). Figure 2 Open in new tabDownload slide Mean intensities. Plotted is the mean allele intensities for female samples against the mean intensities for male samples. Only cases for those at least five samples have a given genotype are included. Intensity of the x-axis: Mj(A|AY) and Mj(B|BY). Intensity on the y-axis depends on the color. Red circles, Mj(A|AA) and Mj(B|BB); blue triangles, Mj(A|AB) and Mj(B|AB); Green crosses, Mj(B|AA) and Mj(A|BB). Note that the labeling of alleles by A and B are arbitrary such that the set of A alleles is not expected to behave differently or have different chemical properties than the set of B alleles. Similarly, for the probe sets. The unknown parameters in Equations 2 and 3 should therefore not depend on α. This observation also has the consequence that the covariance matrix of (Iij(A), Iij(B)) for samples with genotype αα does not depend on α. We further assume that the covariance matrices are independent of the means, i.e. are constant—by testing this does not appear to be strictly true; in particular, the variance of |${I}_{ij}\left(\overline{\alpha }\right)$| for samples with genotype αα depends somewhat on |${\mu }_{j}(\overline{\alpha }\mid \alpha \alpha )$| (data not shown). This provides an additional five parameters. The covariance matrix \begin{equation} {\Sigma }_{\hbox{ hom }}=\left(\begin{array}{cc}{\sigma }_{\hbox{ hom }}^{2}\qquad {\tau }_{\hbox{ hom }}\\ {\tau }_{\hbox{ hom }} {\overline{\sigma }}_{\mathrm{hom}}^{\hbox{ 2 }}\end{array}\right) \end{equation}4 for samples with homozygous genotypes, where |${\sigma }_{\hbox{ hom }}^{2}$| is the variance of the α-intensity and |${\overline{\sigma }}_{\mathrm{hom}}^{\hbox{ 2 }}$| the variance of the |$\overline{\alpha }$|-intensity, and the covariance matrix \begin{equation} {\Sigma }_{\hbox{ het }}=\left(\begin{array}{cc}{\sigma }_{\hbox{ het }}^{2}\qquad {\tau }_{\hbox{ het }}\\ {\tau }_{\hbox{ het }}\qquad {\sigma }_{\hbox{ het }}^{2}\end{array}\right) \end{equation}5 for samples with the heterozygous genotype. In total there are 10 + number of SNPs = 9440 parameters in the model – five regression parameters, five covariance parameters and 9430 mean value parameters. Parameter fitting An iterative procedure is used to estimate the model parameters. Assume that an assignment of genotypes to SNPs is given (Genotyping). For each SNP the samples were divided into three groups according to genotype (excluding no calls). The empirical mean Mj(α∣γ) was calculated for each α and genotype with at least five sample points, all other cases were excluded. In the following, superscript k indicates that the estimates are the estimates after the k-th, k ≥ 0, iteration. Initialization. Define |${\widehat{\mu }}_{j}^{0}(\alpha \mid \gamma )={M}_{j}(\alpha \mid \gamma )$| and let |${\widehat{\Sigma }}_{\mathrm{hom}}^{\hbox{ 0 }}$| and |${\widehat{\Sigma }}_{\mathrm{het}}^{\hbox{ 0 }}$| be the empirical covariances with |${\mu }_{j}(\alpha \mid \gamma )$| replaced by |${\widehat{\mu }}_{j}^{0}(\alpha \mid \gamma )$|. Update 1. Use linear regression to fit a straight line to the points |$\left({\widehat{\mu }}_{j}^{k}\right(\alpha \mid \hbox{ AB }),{\widehat{\mu }}_{j}^{k}(\alpha \mid \alpha \alpha \left)\right)$|, α = A, B, j = 1, … , 9430, resulting in two fitted parameters |${\widehat{c}}_{1}^{k}$| and |${\widehat{c}}_{2}^{k}$|. Note that each SNP contributes at most two points. Update 2. Similarly, fit a second-order polynomial to the points |$\left({\widehat{\mu }}_{j}^{k}\right(\alpha \mid \hbox{ AB }),{\widehat{\mu }}_{j}^{k}(\alpha \mid \overline{\alpha }\overline{\alpha }\left)\right)$|, α = A, B, j = 1, … , 9430, to obtain three estimated parameters |${\widehat{d}}_{1}^{k}$|, |${\widehat{d}}_{2}^{k}$| and |${\widehat{d}}_{3}^{k}$|. Again each SNP contributes at most two points. Update 3. Use weighted least square to re-estimate the parameters |${\mu }_{j}(\alpha \mid \hbox{ AB })$| and |${\mu }_{j}(\alpha \mid \alpha \alpha )$| assuming the relationship (2) and weights |${\widehat{\Sigma }}_{\hbox{ hom }}^{k}$| and |${\widehat{\Sigma }}_{\hbox{ het }}^{k}$|. (Only the relevant entries in the covariance matrices are used.) The re-estimated parameters are denoted as |${\widehat{\mu }}_{j}^{k+1}(\alpha \mid \hbox{ AB })$| and |${\widehat{\mu }}_{j}^{k+1}(\alpha \mid \alpha \alpha )$|. Update 4. Use least square to re-estimate the parameter |${\mu }_{j}(\alpha \mid \overline{\alpha }\overline{\alpha })$| assuming the relationship (3). The re-estimated parameter is denoted |${\widehat{\mu }}_{j}^{k+1}(\alpha \mid \overline{\alpha }\overline{\alpha })$|. Update 5. Re-estimate the covariance matrices with μj(α∣γ) replaced by |${\widehat{\mu }}_{j}^{k+1}(\alpha \mid \gamma )$|. Only intensities for which μj(α∣γ) is estimated are included. Iteration step. Repeat Updates 1–5 a number of times. Here for three times. Updates 3 and 4 are two separate steps rather than just one step. If all parameters are estimated in one step using least square, |${I}_{j}(\alpha \mid \overline{\alpha }\overline{\alpha })$| tends to dominate the least square equation with the consequence that the estimated values become less accurate (data not shown). Genotyping The procedure for genotyping is iterative starting with an initial clustering for each SNP of the points (Iij(A), Iij(B)) for all samples into at most four clusters corresponding to the genotypes AA, AB, BB and NC (no call—both GDAS and our method genotype an SNP as NC if the confidence in all proper genotypes are low). For this study GDAS genotyping was used as the initial clustering; |${\gamma }_{ij}^{0}$| denotes the GDAS genotype of array i, SNP j. The procedure is continued until no more (few) changes in genotypes occur. For the k-th (k > 0) iteration the following is performed. Parameter estimation. Estimate the parameters μj(α∣γ) and the two covariance matrices (as described in Parameter Fitting) using all observations for which the estimated genotype |${\gamma }_{ij}^{k-1}$| has confidence higher than |$C > 0$| in iteration step k − 1. If k = 1 all proper GDAS genotypes (only excluding no calls) are used. Calculation of confidence. Denote the estimated densities for genotype γ by |${f}_{k}(x,y\mid \gamma )$| in iteration k. Calculate the weight for genotype γ using the following equation: \begin{equation} {W}_{k}(x,y\mid \gamma )=\frac{{f}_{k}(x,y\mid \gamma )}{{f}_{k}(x,y\mid \hbox{ AA })+{f}_{k}(x,y\mid \hbox{ AB })+{f}_{k}(x,y\mid \hbox{ BB })+\epsilon }, \end{equation} where (x,y) = (Iij(A), Iij(B)) and ε > 0 is a constant. If none of the genotypes provides good support for (x,y), i.e. if f(x,y∣γ) ≪ ε, then all genotypes get low confidence. Here ε = 10−10 was used. Genotyping. If maxγW(x,y∣γ) > C, then genotype |${\gamma }_{ij}^{\left(k\right)}={\hbox{ arg\; max }}_{\gamma }W(x,y\mid \gamma )$| is assigned to the SNP, otherwise NC is assigned. Iteration step. Repeat the three steps a number of times. Here for six times. SNP performance measures Affymetrix provides a list of SNPs that were excluded/replaced in the commercial version of the 10k SNP array. It comprises 998 SNPs out of the 9430 autosomal SNPs that are used in this study. The reasons for excluding the SNPs include low call rate, low confidence, poor reproducability and visual criteria. The SNPs in the list were compared to the SNPs found by the SNP performance measures described below. The measures are designed to identify SNPs that are difficult to genotype correctly or to flag SNPs or alleles, where the probes of one or both alleles do not show a dose–response behavior, as postulated by the model. The flagged SNPs and alleles might not be suitable for copy number analysis. Hardy–weinberg equilibrium For all SNPs it was tested whether the genotype assignments complied with Hardy–Weinberg equilibrium. To avoid issues of imbreeding and admixture, two groups of arrays were defined. (i) A group of unrelated Danish individuals (32 arrays) and (ii) a group of unrelated Cuban individuals (40 arrays). For each SNP (j), the total number of genotypes (nj; not including no calls) and the numbers of A (aj) and B (bj) alleles were calculated. SNPs where all arrays in a group are homozygous for the same allele are excluded (they trivially comply with Hardy–Weinberg equilibrium). Subsequently, a permutation test was conducted where the aj and bj alleles randomly were re-distributed among the nj individuals. It was counted how often a value higher than the observed value of the chi-squared statistic was obtained in the permuted samples. A total of 1000 permutations were performed. Distance measure A weighted Euclidian distance was calculated between the observed means |${M}_{j}(\alpha \mid \gamma )$| and the estimated means |${\widehat{\mu }}_{j}(\alpha \mid \gamma )$|. This was performed for the A- and the B-intensities alone and jointly for both. A probe set that does not perform according to expectations is likely to have a higher distance value than a probe set that does perform according to expectations. Thus, a large distance indicates that the observed intensities do not fit well to the model. For the α probe, the following distance was calculated: \begin{equation} \frac{1}{{\widehat{\mu }}_{j}{(\alpha \mid \hbox{ AB })}^{2}}{\displaystyle \sum _{\gamma }}D{(\alpha \mid \gamma )}^{2}\widehat{\sigma }{(\alpha \mid \gamma )}^{-2}, \end{equation} where |$D(\alpha \mid \gamma )={M}_{j}(\alpha \mid \gamma )-{\widehat{\mu }}_{j}(\alpha \mid \gamma )$|, if |${\widehat{\mu }}_{j}(\alpha \mid \gamma )$| lies between Mj(A∣γ) and Mj(B∣γ); and otherwise D(α∣γ) = 0. Generally, SNPs can be genotyped correctly if |${\widehat{\mu }}_{j}(\alpha \mid \gamma )$| lies outside the interval spanned by |${M}_{j}(\hbox{ A }\mid \gamma )$| and |${M}_{j}(\hbox{ B }\mid \gamma )$|. Here |$\widehat{\sigma }(\alpha \mid \gamma )$| is the average of the squared residuals |${M}_{j}(\alpha \mid \gamma )-{\widehat{\mu }}_{j}(\alpha \mid \gamma )$|. The factor |${\widehat{\mu }}_{j}(\alpha \mid \hbox{ AB })$| is motivated by the model; it roughly scales intensities for different SNPs to the same range. Thus, distances become comparable between SNPs. Alleles were flagged if the distance was >0.15. SNPs were flagged if the A and B distances both were >0.15. RESULTS Allelic cross hybridization The genotypes derived by GDAS were used for this part of the study. Figure 2 shows the relationship between the mean of the allele intensities for the male and the female samples for all 170 SNPs on the X chromosome. To investigate cross hybridization we focused on SNPs on the X chromosome (170 SNPs) and compared the allele intensities found in the male samples to those found in the female samples. Specifically, we compared the α-intensity for samples with genotype AB (females) to the α-intensity for samples with genotype αY (males). The curve of |${M}_{j}(\alpha \mid \hbox{ AB })-{M}_{j}(\alpha \mid \alpha \hbox{ Y })$| plotted against |${M}_{j}(\alpha \mid \hbox{ AB })$| is not statistically different from the constant line with intercept 0 (P = 0.71). For definition of Mj(α∣γ) see Notation and definitions. It is concluded that the presence of the B allele for genotype AB (in females) does not affect hybridization of the A allele and vice versa. Also, the curve |${M}_{j}(\alpha \mid \overline{\alpha }\overline{\alpha })-$||${M}_{j}(\alpha \mid \overline{\alpha }\hbox{ Y })$| plotted against |${M}_{j}(\alpha \mid \overline{\alpha }\overline{\alpha })$| is not statistically different from the constant line with intercept 0 (P = 0.32). In consequence, hybridization of the A allele is not affected by the copy number of the B allele (1 or 2) and vice versa. Thus, it appears that the effect of allelic cross hybridization generally is insignificant. This observation does not have immediate consequences for our genotyping method but will have consequences for copy number analysis. It will be taken up in Discussion. Genotyping Initially, we selected all arrays for which the arraywise residuals after iteration 1 of parameter fitting (Parameter fitting) were <0.27 (Supplementary Figure S1). A total of 10 arrays were excluded in this way, leaving 103 arrays. The 10 arrays were used for testing. We performed the genotyping method on the 103 arrays as described in Parameter fitting and Genotyping. Subsequently, the 10 arrays were genotyped with the parameters estimated from the 103 arrays, and thus served as an independent test of the method. In majority of cases PBG agrees with GDAS and only one round of iteration is necessary for PBG to stabilize. In some cases PBG improves with the number of iterations because the starting point (GDAS genotypes) is inaccurate and/or the variation in the data requires extra iterations before stabilization occurs. Figure 3 shows examples of SNPs genotyped with GDAS and PBG after five rounds of iteration and confidence level C = 0.90. This level of confidence gives fewer no calls than GDAS (PBG 1%, GDAS 6.6%) (Table 1) but we have found from studying plots of the estimated genotypes that this level of confidence appears reasonable. With confidence C = 0.998 the same number of NCs are made (∼6.5%) but the two methods only agree on ≈26% (=1.3/5.0) of these. Figure 3 Open in new tabDownload slide Genotyping examples—PBG versus GDAS. Shown are eight SNPs genotyped with GDAS and PBG, respectively. Red crosses: Homozygous for AA; blue triangles, heterozygous; green circles, homozygous for BB; yellow rhombuses, no call. SNPs 1–3 show cases where PBG outperforms GDAS; SNP 4 shows an example where the A allele does not reflect the number of A alleles in the genotype, but still GDAS and PBG genotype correctly; SNPs 5 and 8 show cases where none of the probes apparently functions correctly; and SNPs 6–7 show cases where GDAS outperforms PBG. Figure 3 Open in new tabDownload slide Genotyping examples—PBG versus GDAS. Shown are eight SNPs genotyped with GDAS and PBG, respectively. Red crosses: Homozygous for AA; blue triangles, heterozygous; green circles, homozygous for BB; yellow rhombuses, no call. SNPs 1–3 show cases where PBG outperforms GDAS; SNP 4 shows an example where the A allele does not reflect the number of A alleles in the genotype, but still GDAS and PBG genotype correctly; SNPs 5 and 8 show cases where none of the probes apparently functions correctly; and SNPs 6–7 show cases where GDAS outperforms PBG. Table 1. PBG versus GDAS Conf. PBG GDAS 103 Arrays 10 Arrays 113 Arrays Call NC Call NC Call NC 0.9 Call 93.1 6.0 88.6 8.8 92.7 6.3 NC 0.6 0.3 1.7 0.9 0.7 0.3 0.95 Call 92.8 5.9 87.9 8.5 92.4 6.2 NC 0.8 0.4 2.4 1.3 1.0 0.5 0.99 Call 91.6 5.6 85.6 7.6 91.1 5.7 NC 2.1 0.8 4.6 2.2 2.3 0.9 0.998 Call 89.0 5.0 82.5 6.5 88.4 5.1 NC 4.7 1.3 7.8 3.3 5.0 1.5 Conf. PBG GDAS 103 Arrays 10 Arrays 113 Arrays Call NC Call NC Call NC 0.9 Call 93.1 6.0 88.6 8.8 92.7 6.3 NC 0.6 0.3 1.7 0.9 0.7 0.3 0.95 Call 92.8 5.9 87.9 8.5 92.4 6.2 NC 0.8 0.4 2.4 1.3 1.0 0.5 0.99 Call 91.6 5.6 85.6 7.6 91.1 5.7 NC 2.1 0.8 4.6 2.2 2.3 0.9 0.998 Call 89.0 5.0 82.5 6.5 88.4 5.1 NC 4.7 1.3 7.8 3.3 5.0 1.5 Shown is how often PBG and GDAS make a genotype call and an NC using different confidence levels for PBG. Standard settings were applied for GDAS. The additional 10 arrays were genotyped using the parameters fitted when genotyping the 103 arrays. In total 970 466 SNPs in 103 samples were available for genotyping, and 94 220 in the remaining 10 arrays. Open in new tab Table 1. PBG versus GDAS Conf. PBG GDAS 103 Arrays 10 Arrays 113 Arrays Call NC Call NC Call NC 0.9 Call 93.1 6.0 88.6 8.8 92.7 6.3 NC 0.6 0.3 1.7 0.9 0.7 0.3 0.95 Call 92.8 5.9 87.9 8.5 92.4 6.2 NC 0.8 0.4 2.4 1.3 1.0 0.5 0.99 Call 91.6 5.6 85.6 7.6 91.1 5.7 NC 2.1 0.8 4.6 2.2 2.3 0.9 0.998 Call 89.0 5.0 82.5 6.5 88.4 5.1 NC 4.7 1.3 7.8 3.3 5.0 1.5 Conf. PBG GDAS 103 Arrays 10 Arrays 113 Arrays Call NC Call NC Call NC 0.9 Call 93.1 6.0 88.6 8.8 92.7 6.3 NC 0.6 0.3 1.7 0.9 0.7 0.3 0.95 Call 92.8 5.9 87.9 8.5 92.4 6.2 NC 0.8 0.4 2.4 1.3 1.0 0.5 0.99 Call 91.6 5.6 85.6 7.6 91.1 5.7 NC 2.1 0.8 4.6 2.2 2.3 0.9 0.998 Call 89.0 5.0 82.5 6.5 88.4 5.1 NC 4.7 1.3 7.8 3.3 5.0 1.5 Shown is how often PBG and GDAS make a genotype call and an NC using different confidence levels for PBG. Standard settings were applied for GDAS. The additional 10 arrays were genotyped using the parameters fitted when genotyping the 103 arrays. In total 970 466 SNPs in 103 samples were available for genotyping, and 94 220 in the remaining 10 arrays. Open in new tab Table 2 shows how often the two methods agree on genotype when a call has been made for different levels of confidence. A mere 392 SNPs (not listed in the table) in 113 arrays (total of 1 064 686 SNPs) were homozygous AA (BB) with PBG and BB (AA) with GDAS when using confidence 0.90. This number was reduced to 263 SNPs when using confidence 0.998. Table 2. Percentage agreement (%-agrm) between PBG and GDAS Conf. 103 Arrays 10 Arrays 113 Arrays %-agrm NC %-agrm NC %-agrm NC 0.9 99.25 0.8 98.98 2.6 99.23 1.0 0.95 99.30 1.2 99.09 3.6 99.28 1.4 0.99 99.41 2.8 99.31 6.8 99.40 3.2 0.998 99.52 6.1 99.46 11.0 99.51 6.5 Conf. 103 Arrays 10 Arrays 113 Arrays %-agrm NC %-agrm NC %-agrm NC 0.9 99.25 0.8 98.98 2.6 99.23 1.0 0.95 99.30 1.2 99.09 3.6 99.28 1.4 0.99 99.41 2.8 99.31 6.8 99.40 3.2 0.998 99.52 6.1 99.46 11.0 99.51 6.5 Shown is how often the two methods agree when both methods make a call, and the percentage of no calls obtained with PBG. Open in new tab Table 2. Percentage agreement (%-agrm) between PBG and GDAS Conf. 103 Arrays 10 Arrays 113 Arrays %-agrm NC %-agrm NC %-agrm NC 0.9 99.25 0.8 98.98 2.6 99.23 1.0 0.95 99.30 1.2 99.09 3.6 99.28 1.4 0.99 99.41 2.8 99.31 6.8 99.40 3.2 0.998 99.52 6.1 99.46 11.0 99.51 6.5 Conf. 103 Arrays 10 Arrays 113 Arrays %-agrm NC %-agrm NC %-agrm NC 0.9 99.25 0.8 98.98 2.6 99.23 1.0 0.95 99.30 1.2 99.09 3.6 99.28 1.4 0.99 99.41 2.8 99.31 6.8 99.40 3.2 0.998 99.52 6.1 99.46 11.0 99.51 6.5 Shown is how often the two methods agree when both methods make a call, and the percentage of no calls obtained with PBG. Open in new tab The examples in Figure 3 illustrate some of the differences between PBG and GDAS. In many cases where PBG outperforms GDAS, GDAS does not provide a call to one cluster of points or to a group of points located within a cluster. This is not just a question of the level of confidence. Even with a higher level of confidence, e.g. C = 0.95, PBG is able to genotype these SNPs (data not shown). There are also few cases where GDAS outperforms PBG—one case is also shown in Figure 3. In other cases, there is one cluster and PBG might also fail here. The presence of just one cluster is either (i) because one allele has low population frequency and by chance is not represented in the sample, or (ii) because of poor experimental conditions that make it statistically impossible to distinguish the genotypes from one another. In these cases PBG often provides an overrepresentation of heterozygous genotypes. For 31 SNPs, 60% or more (excluding NC) are heterozygous. In a panmixing population 50% is the maximum theoretical heterozygosity level, and an overrepresentation of heterozygous genotypes is thus in conflict with expectations and results in strong violation of Hardy–Weinberg equilibrium. In contrast, for the 31 SNPs, GDAS shows a scatter of different genotypes—whether GDAS genotyping is correct in these cases requires experimental verification. For 440 SNPs we found 4 or more differences between PBG and GDAS, only counting cases where both methods make a proper call (i.e. excluding case where one or both methods assign NC). The group of 440 SNPs is referred to as Group B, the remaining SNPs as Group A. It is difficult to get exact numbers for when one method performs better on a particular SNP than the other, because we do not know the true genotypes. A manual expection indicates that for ∼150 SNPs PBG does a better job than GDAS, and for ∼40 SNPs the opposite is true. These all appear in Group B and ∼90% of the cases, where we believe GDAS is superior, are spotted by the perfomance distance measure introduced in the following section. Despite PBG is based on a model which assumes that the intensity raises with the number of alleles present, PBG successfully genotypes SNPs where one probe does not perform according to the model. To illustrate this point we genotyped the 170 SNPs on chromosome X using only female samples. Subsequently, the male samples were genotyped (see Table 3). Table 3. Genotyping of SNPs on chromosome X Conf. Female Male All %-agrm NC %-agrm NC %-agrm NC 0.9 99.39 0.70 98.32 1.80 98.92 1.20 0.95 99.41 1.00 98.54 2.51 99.02 1.68 0.99 99.59 2.41 99.07 3.82 99.36 3.05 0.998 99.65 4.53 99.25 4.84 99.47 4.67 Conf. Female Male All %-agrm NC %-agrm NC %-agrm NC 0.9 99.39 0.70 98.32 1.80 98.92 1.20 0.95 99.41 1.00 98.54 2.51 99.02 1.68 0.99 99.59 2.41 99.07 3.82 99.36 3.05 0.998 99.65 4.53 99.25 4.84 99.47 4.67 Out of 113 samples, 62 are females and 51 males. For comparison, GDAS produces 5.15% NC in females, 7.29% in males and 6.11% in total. Open in new tab Table 3. Genotyping of SNPs on chromosome X Conf. Female Male All %-agrm NC %-agrm NC %-agrm NC 0.9 99.39 0.70 98.32 1.80 98.92 1.20 0.95 99.41 1.00 98.54 2.51 99.02 1.68 0.99 99.59 2.41 99.07 3.82 99.36 3.05 0.998 99.65 4.53 99.25 4.84 99.47 4.67 Conf. Female Male All %-agrm NC %-agrm NC %-agrm NC 0.9 99.39 0.70 98.32 1.80 98.92 1.20 0.95 99.41 1.00 98.54 2.51 99.02 1.68 0.99 99.59 2.41 99.07 3.82 99.36 3.05 0.998 99.65 4.53 99.25 4.84 99.47 4.67 Out of 113 samples, 62 are females and 51 males. For comparison, GDAS produces 5.15% NC in females, 7.29% in males and 6.11% in total. Open in new tab SNP performance measures Individual SNPs that are not in Hardy–Weinberg equilibrium after genotyping are likely to be falsely genotyped. Thus, failure to pass a test for Hardy–Weinberg equilibrium is an indication of poor SNP quality. If Hardy–Weinberg equilibrium generally is not fulfilled it indicates poor performance of the method. We performed a permutation test for Hardy–Weinberg equilibrium for all SNPs in two populations: the samples of unrelated Danish individuals (32 arrays) and the sample of unrelated Cuban individuals (40 arrays). Table 4 summarizes the findings. SNPs were excluded if all samples were homozygous for the same genotype or were no calls. These SNPs trivially comply with Hardy–Weinberg equilibrium. In general, PBG provides more genotypes that are in concordance with Hardy–Weinberg equilibrium and with statistical expectations. Table 4. Test for Hardy–Weinberg equilibrium Group A Group B Mean Var Mean Var PBG 0.50 0.081 0.37 0.096 GDAS 0.48 0.084 0.34 0.094 Group A Group B Mean Var Mean Var PBG 0.50 0.081 0.37 0.096 GDAS 0.48 0.084 0.34 0.094 Group A is defined as SNPs where PBG and GDAS disagree on the genotype (excluding NC) in <4 cases, and Group B (440 SNPs) is defined as those where there are ≥4 disagreements. The mean and variance are expected to follow a uniform distribution, i.e. the mean should be 0.50 and the variance 0.083. Both methods have problems with Group B. The GDAS mean 0.48 is significantly different from 0.50 (P < 0.001). Open in new tab Table 4. Test for Hardy–Weinberg equilibrium Group A Group B Mean Var Mean Var PBG 0.50 0.081 0.37 0.096 GDAS 0.48 0.084 0.34 0.094 Group A Group B Mean Var Mean Var PBG 0.50 0.081 0.37 0.096 GDAS 0.48 0.084 0.34 0.094 Group A is defined as SNPs where PBG and GDAS disagree on the genotype (excluding NC) in <4 cases, and Group B (440 SNPs) is defined as those where there are ≥4 disagreements. The mean and variance are expected to follow a uniform distribution, i.e. the mean should be 0.50 and the variance 0.083. Both methods have problems with Group B. The GDAS mean 0.48 is significantly different from 0.50 (P < 0.001). Open in new tab An overview of the results of the performance measures are collected in Table 5. Lists of SNPs being selected by the performance measures are provided in Supplementary Table S1. Generally, we find an overrepresentation of SNPs in the list of rejected SNPs of the Affymetrix compared to the list of non-rejected SNPs. The distance measure was calculated for each allele intensity and jointly for both, as described in Materials and Methods. Plots of all SNPs with a distance >0.15 is shown in Supplementary Figure S2. The distance measure is adequate to identify SNPs that are difficult to genotype, i.e. SNPs with experimentally poor performance, or where the probes for one or both alleles are not reacting in a dose–reponse manner, i.e. probes that are not suitable for copy number analysis. Even if a combination of measures are used, the frequency of flagged SNPs in the list of rejected SNPs of the Affymetrix does not exceed 32% (out of 998). To achieve this, SNPs are flagged if the number of NC exceeds 5% (NC 5% in Table 5) or the P-value for the test for Hardy–Weinberg equilibrium is <1% (HW 1% in Table 5). Oppositely, the frequency of rejected SNPs out of all flagged SNPs does not exceed 36% (combining distance with NC 5% and HW 0.1%). A table is provided in Supplementary Table S2. Table 5. Comparison of different performance measures Distance NC HW One Both 5% 10% 0.1% 1% Group B Total 485 34 408 125 57 384 440 Rejected 158 14 185 70 24 107 191 Distance NC HW One Both 5% 10% 0.1% 1% Group B Total 485 34 408 125 57 384 440 Rejected 158 14 185 70 24 107 191 The list of rejected SNPs comprises 998 SNPs. NC 10% (5%) is the group of SNPs with a no call rate of at least 10% (5%). If an SNP obtains a P-value <1% (0.1%) in the test for Hardy–Weinberg equilibrium in either of the two populations it counts in HW 1% (0.1%). Generally, we find an overrepresentation of SNPs in the list of rejected SNPs compared with the list of non-rejected SNPs. Open in new tab Table 5. Comparison of different performance measures Distance NC HW One Both 5% 10% 0.1% 1% Group B Total 485 34 408 125 57 384 440 Rejected 158 14 185 70 24 107 191 Distance NC HW One Both 5% 10% 0.1% 1% Group B Total 485 34 408 125 57 384 440 Rejected 158 14 185 70 24 107 191 The list of rejected SNPs comprises 998 SNPs. NC 10% (5%) is the group of SNPs with a no call rate of at least 10% (5%). If an SNP obtains a P-value <1% (0.1%) in the test for Hardy–Weinberg equilibrium in either of the two populations it counts in HW 1% (0.1%). Generally, we find an overrepresentation of SNPs in the list of rejected SNPs compared with the list of non-rejected SNPs. Open in new tab Comparison with other methods on HapMap data In the previous sections we have evaluated PBG on a large dataset and compared PBG to GDAS. To evaluate PBG on an externally validated dataset, we followed the procedure in (4) closely. This procedure also allows us to compare PBG with DM (2) and RLMM (4). We selected 15 589 SNPs from Affymetrix Xba array where both HapMap and DM calls were available (Materials and Methods). We ran PBG and DM on this dataset. Unfortunately, we could not get RLMM to run on our computers and we therefore used results from (4) to compare with RLMM. These results are based on 15 910 SNPs selected in the same way as our dataset. The discrepancy between the sizes of the two datasets is unknown to us. For each SNP (in both datasets) calls are made for 90 individuals. Table 6 summarizes the results. Table 6. Comparison with other methods on HapMap data No. of SNPs PBG DM No.of SNPS RLMM 15 589 99.50% 99.60% 15 910 ? 14 509 99.57% 99.65% 11 446 99.86% No. of SNPs PBG DM No.of SNPS RLMM 15 589 99.50% 99.60% 15 910 ? 14 509 99.57% 99.65% 11 446 99.86% Shown is the percentage agreement with HapMap calls for different methods. PBG and DM are run on the same data set, RLMM on a different, although similar, dataset (see text). The SNPs in the second row form a subset of the SNPs in the first row. SNPs are excluded if they fulfill the criteria that there is at most one member in two genotype groups (based on HapMap calls). Results are not available for RLMM on the full dataset. Open in new tab Table 6. Comparison with other methods on HapMap data No. of SNPs PBG DM No.of SNPS RLMM 15 589 99.50% 99.60% 15 910 ? 14 509 99.57% 99.65% 11 446 99.86% No. of SNPs PBG DM No.of SNPS RLMM 15 589 99.50% 99.60% 15 910 ? 14 509 99.57% 99.65% 11 446 99.86% Shown is the percentage agreement with HapMap calls for different methods. PBG and DM are run on the same data set, RLMM on a different, although similar, dataset (see text). The SNPs in the second row form a subset of the SNPs in the first row. SNPs are excluded if they fulfill the criteria that there is at most one member in two genotype groups (based on HapMap calls). Results are not available for RLMM on the full dataset. Open in new tab Interestingly, PBG genotypes are identical to HapMap genotypes in all 90 samples for 81.4% of the SNPs in the full dataset and for 87.7% of the 1080 SNPs excluded by the criteria used in (4) (see also Table 6). The criteria excludes SNPs if there is at most one member in two genotype groups (based on HapMap calls). This shows one strength of PBG, because it is able to genotype accurately, even though some genotypes are sparsely represented in the data. It is not shown in (4) how RLMM performs on the excluded set of SNPs. It is also worth pointing out that RLMM used HapMap calls to train the algorithm, whereas PBG used Affymetrix calls (inferred by DM). Affymetrix calls are always available for Affymetrix arrays, whereas HapMap calls naturally are not. Thus, training with HapMap calls is not generally possible. Generally, this might lead to lower performance of RLMM than reported in Table 6, because HapMap calls are believed to be highly accurate. DISCUSSION We have developed a new method for genotyping Affymetrix SNP arrays and compared the performance of our method (PBG) to that of Affymetrix (GDAS). PBG is based on analyzing multiple arrays at the same time, in contrast to GDAS that analyses SNPs arraywise, one SNP at a time. Generally, the two methods agree, but PGB appears to be able to genotype correctly with a lower no call rate and also appears to produce more genotypes than GDAS that comply with Hardy–Weinberg equilibrium. In addition, PBG is based on a model that relates allele intensities from different SNPs to each other. We use this relationship to annotate SNPs and alleles. The plots provided in Supplementary Figure S2 show that we are able to annotate poor performing SNPs and alleles. We also compared PBG to two other recently published methods, DM and RLMM. Overall the methods seems to have similar performances; some of the differences are explained below. Our method is based on dChipSNP normalized probe intensities. One array is selected as reference array and all other arrays are normalized relatively to the reference array. This has the advantage that new arrays (a test set) can be genotyped using fitted parameters obtained from a training set. If the test set is normalized relatively to the reference array of the training set the fitted parameters of the training set can be used to genotype the test set. Particularly, this should be useful when genotyping only few arrays, provided the fitted parameters of the test set and the reference array is publically available. We showed that this approach is feasible by analyzing an additional 10 arrays that was not used for fitting (Table 1). Our model has 10+ number of SNPs = 9440 parameters. For some SNPs only one or two genotypes are observed. In these cases, we use the model to estimate the mean intensity of the non-observed genotypes. In contrast, the model RLMM proposed in (4) has 15 × number of SNPs = 139 450 parameters (if applied to the 10k array), because their model does not assume a relationship between parameters for different SNPs. If a genotype is not observed or sparsely represented, the parameters for that genotype are predicted using estimated parameters from other SNPs. Note that it is not known how RLMM performs on SNPs where only one genotype is present (or some genotypes are sparsely represented). In (4) results are not shown for these SNPs, even though they comprise ∼28.7% of the SNPs in their dataset. Naturally, the structure of the data can be modeled more accurately with a large dimensional parameter than a small dimensional parameter (in the sense that 9440 is small compared to 139 450). PBG is thus likely to fail in genotyping some SNPs that might be correctly genotyped by RLMM. However, since these SNPs do not fit the model, PBG will flag them as ‘poor’ and they can be excluded from the analysis. Flagging or annotation of ‘poor’ performing SNPs offers a two-sided advantage. First of all, SNPs that perform ‘poor’ because of experimental reasons can be excluded from the analysis. Second, SNPs can be ‘poor’ performing, as illustrated in Figure 3 and Supplementary Figure S1, because for one or both alleles the probes do not behave in a dose-response manner and should therefore be excluded. These SNPs might still be genotyped correctly, but are not suitable for copy number analysis. Several research groups have demonstrated that a typical SNP shows a linear relationship between the log-copy number and the log-intensity and used the intensity levels in diploid samples to infer copy numbers in abnormal samples, e.g. in tumor samples for instance see (3,6,8,9). This relationship is documented both with the data normalization procedure introduced by Affymetrix and with dChipSNP's procedure, which is used in PBG. Analysis of SNP arrays often requires correction for multiple testing. To avoid too many false positives the significance level of a single test should be chosen low. Excluding SNPs that are poorly performing because of experimental reasons should reduce the number of false positives and thus increase the power. It appears that GDAS genotypes tumor samples reliably at the cost of an increased no call rate are compared to normal samples. Our initial investigations show that PBG seems to make more errors while genotyping tumor samples (data not shown). This is expected because we explicitly apply a model which assumes that two copies of the DNA are present for each SNP, whereas a copy number of two is often found violated in tumor samples. Whether, the method in (4) can genotype samples with abnormal DNA content correctly is presently unknown. In (3,6,8,9), genotyping and copy number analysis are separate issues; i.e. if genotypes are used in a copy number analysis the genotypes are obtained before the copy number analysis is conducted. It would be natural to combine the two into a single analysis. We showed in Allele Cross Hybridization that the level of the A-intensity is not affected by the copy number of the B allele, and vice versa. This leads us to speculate that cross hybridization can be ignored generally in the sense that the level of the A-intensity is only affected by the copy number of the A allele, not the copy number of the B allele. Assuming a linear relationship between log-copy number and log-intensity, the intensity levels for higher allele copy numbers could be extrapolated from the observations made in this paper. A version of PBG implemented in Perl is available from the authors upon request. ACKNOWLEDGEMENTS P.L. and C.W. is supported by the Danish Cancer Society and by the Fraenkel Foundation, Denmark. C.L.A. is supported by the Danish Research Councils and the John and Birte Meyer Foundation. Ole Mors, Center for Basic Psychiatric Research, is thanked for providing 81 of the SNP arrays. Funding to pay the open Access publication charges for this article was provided by the Fraenkel Foundation, Denmark. Conflict of interest statement. None declared. REFERENCES 1. Kennedy G.C. , Matsuzaki H., Dong S., Liu W.M., Huang J., Liu G., Su X., Cao M., Chen W., Zhang J., et al. 2003 Large-scale genotyping of complex DNA Nat. Biotechnol . 21 1233 – 1237 Google Scholar Crossref Search ADS PubMed WorldCat 2. Di X. , Matsuzaki H., Webster T.A., Hubbell E., Liu G., Dong S., Bartell D., Huang J., Chiles R., Yang G., et al. 2005 Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays Bioinformatics 21 1958 – 1963 Google Scholar Crossref Search ADS PubMed WorldCat 3. Bignell G.R. , Huang J., Greshock J., Watt S., Butler A., West S., Grigorova M., Jones K.W., Wei W., Stratton M.R., et al. 2004 High-resolution analysis of DNA copy number using oligonucleotide microarrays Genome Res . 14 287 – 295 Google Scholar Crossref Search ADS PubMed WorldCat 4. Rabbee N. and Speed T.P. 2006 A genotype calling algorithm for affymetrix SNP arrays Bioinformatics 22 7 – 12 Google Scholar Crossref Search ADS PubMed WorldCat 5. Li C. , Tseng G.C., Wong W.H. 2003 Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis In Speed T. (Ed.). Statistical Analysis of Gene Expression Microarray Data NY Chapman & Hall pp. 1 – 34 6. Zhao X. , Li C., Paez J.G., Chin K., Jänne P.A., Chen T.-H., Girard L., Minna J., Christiani D., Leo C., et al. 2004 An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorpism arrays Cancer Res . 64 3060 – 3071 Google Scholar Crossref Search ADS PubMed WorldCat 7. Irizarry R.A. , Hobbs B., Collin F., Beazer-Barclay Y.D., Antonellis K.J., Scherf U., Speed T.P. 2003 Exploration, normalization, and summaries of high density oligonucleotide array probe level data Biostatistics 4 249 – 264 Google Scholar Crossref Search ADS PubMed WorldCat 8. Huang J. , Wei W., Zhang J., Liu G., Bignell G.R., Stratton M.R., Futreal P.A., Wooster R., Jones K.W., Shapero M.H. 2004 Whole genome DNA copy number changes identified by high density oligonucleotide arrays Hum. Genomics 4 287 – 299 Google Scholar Crossref Search ADS WorldCat 9. Nannya Y. , Sanada M., Nakazaki K., Hosoya N., Wang L., Hangaishi A., Kurokawa M., Chiba S., Bailey D.K., Kennedy G.C., et al. 2005 A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays Cancer Res . 65 6071 – 6079 Google Scholar Crossref Search ADS PubMed WorldCat 10. Ming L. , Wei L.-J., Sellars L.R., Lieberfarb M., Wong W.H., Li C. 2004 dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data Bioinformatics 20 1233 – 1240 Google Scholar Crossref Search ADS PubMed WorldCat © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Information-theoretic identification of predictive SNPs and supervised visualization of genome-wide association studiesBhasi,, Kavitha;Zhang,, Li;Brazeau,, Daniel;Zhang,, Aidong;Ramanathan,, Murali
doi: 10.1093/nar/gkl520pmid: 16899448
ABSTRACT The size, dimensionality and the limited range of the data values makes visualization of single nucleotide polymorphism (SNP) datasets challenging. The purpose of this study is to evaluate the usefulness of 3D VizStruct, a novel multi-dimensional data visualization technique for SNP datasets capable of identifying informative SNPs in genome-wide association studies. VizStruct is an interactive visualization technique that reduces multi-dimensional data to three dimensions using a combination of the discrete Fourier transform and the Kullback–Leibler divergence. The performance of 3D VizStruct was challenged with several diverse, biologically relevant published datasets including the human lipoprotein lipase (LPL) gene locus, the human Y-chromosome in several populations and a multi-locus genotype dataset of coral samples from four populations. In every case, the SNPs and or polymorphic markers identified by the 3D VizStruct mapping were predictive of the underlying biology. INTRODUCTION Technologies capable of simultaneously genotyping thousands of single nucleotide polymorphisms (SNPs) are now widely employed in basic biomedical research for investigating the genetic basis of complex diseases, cancer risk and drug response (1–4). Presently the public SNP database (dbSNP) contains 27 million entries (Build 125, available September 2005), 10 million of which have been identified as unique to the database (‘rs’ SNPs). Approximately 3 million contain genotype information and >500 000 entries also have frequency data. Many techniques have been developed to explore these multivariate datasets but one of the key obstacles of exploring genome-wide SNP data is the high dimensionality both in terms of the number of genes involved and the number of polymorphisms within each gene. Additional challenges include the massive size of the datasets (typically data on >10 000–500 000 SNPs can be obtained from a single sample) and the limited range of the data values (the data are typically sequences of ordinal numbers and number values taken by each SNP are very limited: each SNP is typically called as heterozygous or one of two homozygous states). Data analysis is further complicated by the presence of correlated markers delimiting haplotypes. Visualization algorithms can provide effective tools to summarize and interpret datasets, describe the contents and expose features in genome-wide SNP datasets. Although genotyping technologies have advanced considerably and a variety of sequence analysis and alignment algorithms and tools have been developed, analytical visualization of SNP datasets, the primary focus of this research, has not been extensively investigated in the context of SNP data analysis. Fast, efficient, effective and easy-to-use analytical visualization tools are essential for identifying and interpreting patterns in large SNP datasets in order to generate hypotheses and direct subsequent research. METHODOLOGY AND RESULTS The VizStruct mapping At the core of VizStruct is a radial projection that maps the n-dimensional vectors into 2D points while retaining correlation similarity in the original input space (5,6). If the vector x[n] = (x[0], x[1], … , x[n − 1]) represents a data item in n-dimensional space, Rn, its mapping to a point F1(x[n]) in the complex plane C is given by the following equation: \begin{equation} {F}_{1}\left(\mathbf{\hbox{ x }}\right[n\left]\right)={\displaystyle \sum _{j=0}^{n-1}}x\left[j\right]{\hbox{ e }}^{-2\pi ij/n}. \end{equation}1 The real and imaginary components of F1(x[n]) are used for creating the 2D mapping. In Equation 1|$i=\sqrt{-1}$| and the complex exponential has the effect of dividing the circle of display into equally spaced sectors. The equation shown represents a substantive reformulation of the usual radial visualization mapping and the use of the complex number notation has significant advantages: it allows easier derivations of the theoretical underpinnings and an intuitive geometric interpretation of the mapping (7–9). The mapping F1(x[n]) is equivalent to the first harmonic of the discrete Fourier transform (DFT). The relationship between the DFT and the radial visualization mapping, which was first identified by our group (7–9), allows the computationally efficient fast Fourier transform algorithm [complexity of O(n logn), where n is the number of dimensions] to be used. It allows a wide range of enhancements, including higher harmonic analysis, that have been described previously (7–9). For 3D analysis (3D VizStruct), we included the Kullback–Leibler divergence (KLD) as the third dimension or z-coordinate; the complex number corresponding to the first Fourier harmonic is used for the x- and y-axes. The KLD between two probability mass functions p(x) and q(x) is denoted by D(p‖q) and is also known as the relative entropy. The KLD is defined by (10): \begin{equation} \hbox{ KLD }={\displaystyle \sum _{x\in X}}p\left(x\right)log\left(\frac{p\left(x\right)}{q\left(x\right)}\right). \end{equation}2 The base of the logarithm was taken to be 2. The KLD is a measure of the distance between two distributions or equivalently, it is the inefficiency of assuming that the distribution is q when the true distribution is p. The KLD always takes non-negative values, KLD ≥ 0, and is zero only if p = q (11). As the first step, a contingency table containing the frequencies of the SNP (or polymorphic locus) genotypes in each class was obtained. The frequencies in each cell of the contingency table were normalized using the sample size in the table and these normalized frequencies comprised the probability distribution p. The reference probability distribution, q, for each cell was computed as the product of the corresponding row and column sums of the normalized frequencies table; this is equivalent to using the assumption of independence. The performance of the 3D VizStruct method was measured by calculating the percentage of samples that were misclassified. Coding of SNP datasets An ordinal scale was used to code the SNP genotype sequences: the numbers 1, 2 and 3 were used for genotypes that were homozygous in the major allele, heterozygous or homozygous in the minor allele, respectively. A systematic, sequential approach was used for missing data. Individuals in whom >75% of the SNP genotypes were missing were excluded. SNP locations comprise entirely of a combination of missing data and a single genotype were excluded from the analysis because of the absence of information. In computation of the Fourier dimensions of 3D VizStruct, the remaining missing data points were replaced by the sample mean for that SNP location. The coding and computations were conducted in Microsoft Excel (Microsoft, Bellevue, WA). The 2D and 3D plots were obtained with Kaleidagraph (Synergy Software, Malvern, PA) and MATLAB (The MathWorks, Natick, MA), respectively. Evaluation of the VizStruct approach Analysis of the human lipoprotein lipase genotypes The human lipoprotein lipase (LPL) gene is involved in lipid metabolism and has been characterized in detail for its associations with cardiovascular disease. The LPL dataset from http://droog.gs.washington.edu/mdecode/data/lpl/lpl.prettybase.txt.wo_n wherein LPL was genotyped at 88 polymorphic sites in 48 individuals (12,13) was analyzed to assess the suitability of the 3D VizStruct approach for supervised visualization of densely characterized candidate gene. The dataset contains genotypes of 24 Americans of African ancestry from Jackson, Mississippi (JMS), who participated in the Family Blood Pressure Program, a hypertension study, and 24 Americans of European ancestry from Rochester, Minnesota (RMN), who participated in the Rochester Family Heart Study. The haplotype phase is available in this dataset, but we intentionally coded each SNP location as being either homozygous in the major allele, heterozygous or homozygous in the minor allele for visualization because haplotype phase information is generally not available in the majority of experimental situations. Figure 1A shows the VizStruct mapping of the SNPs from LPL dataset; each point in Figure 1A corresponds to a single SNP. The SNPs with the highest values of KLD were identified and their ability to individually classify the JMS and the RMN samples was investigated. The results for the three SNPs with highest KLD values are summarized in Figure 1B–D. Each of the SNPs shown was strongly associated with and informative of the JMS versus RMN class distinction: e.g. at SNP-24 (Figure 1B), 22 of 24 JMS subjects were homozygous for the major allele and all the RMN subjects had the minor allele (i.e. they were heterozygous or homozygous for the minor allele); the percent error was 4.2%. At SNP-40 (Figure 1C), all 19 JMS subjects (genotypes for 5 subjects were missing at this locus) were homozygous for the major allele and 23 of 24 RMN subjects had the minor allele; only 1 RMN subject had the major allele and could be considered as ‘misclassified’ (1 of 43 or 2.3% error) by the KLD approach. At SNP-79 (Figure 1C), all 24 RMN subjects were homozygous for the major allele and 21 of 24 JMS subjects had the minor allele; 3 JMS subjects had the major allele and could be considered as ‘misclassifications’ (3 of 48 or 6.3% error). Figure 1 Open in new tabDownload slide (A) (Upper left panel) shows the 3D VizStruct mapping of the LPL genotypes. The x- and y-axes are the real and imaginary components of the first harmonic of the DFT and the z-axis is the KLD; each point corresponds to a SNP and the SNPs with the highest KLD values are highlighted with the open triangles. (B–D) show the distribution of the genotypes for three SNPs with the highest values of the KLD in the African-American patients from Jackson, MS (closed circles) and Caucasian-American patients from Rochester, MN (open circles). The x-axis in (B–D) is the sample number and the y-axis are the genotypes with the homozygous genotypes coded as 1 and 3 for the major and minor allele, respectively, and the heterozygous genotype is coded 2. Figure 1 Open in new tabDownload slide (A) (Upper left panel) shows the 3D VizStruct mapping of the LPL genotypes. The x- and y-axes are the real and imaginary components of the first harmonic of the DFT and the z-axis is the KLD; each point corresponds to a SNP and the SNPs with the highest KLD values are highlighted with the open triangles. (B–D) show the distribution of the genotypes for three SNPs with the highest values of the KLD in the African-American patients from Jackson, MS (closed circles) and Caucasian-American patients from Rochester, MN (open circles). The x-axis in (B–D) is the sample number and the y-axis are the genotypes with the homozygous genotypes coded as 1 and 3 for the major and minor allele, respectively, and the heterozygous genotype is coded 2. These results demonstrate that the supervised 3D VizStruct approach is effective for identifying informative SNPs from datasets of densely genotyped candidate genes obtained in 2-class study designs. The Y-chromosome dataset Polymorphisms of the Y-chromosome are of practical interest in forensic identification, paternity testing and in the study of human migration since the chromosome is present only in males and transmitted from father to son (14–16). Figure 2A shows the VizStruct mapping of the SNPs from Y-chromosome dataset from the Perlegen Sciences database (http://genome.perlegen.com/browser/download.html) wherein the SNPs in the Y-chromosome was genotyped at 334 polymorphic sites in 33 males differing in race: 11 African-Americans, 13 European-Americans and 9 Han Chinese (17). Figure 2B–D summarizes the descriptive capabilities of the three SNPs with the highest KLD values. The alleles for SNP7271524 shown in Figure 2A differentiate between the Han Chinese group versus the European-American and African-American groups. For SNP15795329 (Figure 2B), all the African-American subjects have the major alleles whereas all the Han Chinese subjects have the minor allele; the European-Americans have both alleles. The pattern for SNP1733733 is also clearly distinctive: all the European-American and Han Chinese subjects have the major allele whereas the majority of African-Americans have the minor allele. This provides a demonstration of the scalability of supervised 3D VizStruct capabilities to a chromosome-wide SNP dataset containing more than two classes. Figure 2 Open in new tabDownload slide (A) (Upper left panel) shows the 3D VizStruct mapping of the Y-chromosome SNPs. The x- and y-axes are the real and imaginary components of the first harmonic of the DFT and the z-axis is the KLD; each point corresponds to a SNP and the SNPs with the highest KLD values are highlighted with the open triangles. (B–D) show the distribution of the genotypes for three SNPs with the highest values of the KLD in the African-American (closed circles) and Caucasian-American (open circles) and Han Chinese (open triangles) subjects. The x-axis in (B–D) is the sample number and the y-axes are the genotypes with the homozygous genotypes coded as 1 and 2 for the major and minor allele, respectively. Figure 2 Open in new tabDownload slide (A) (Upper left panel) shows the 3D VizStruct mapping of the Y-chromosome SNPs. The x- and y-axes are the real and imaginary components of the first harmonic of the DFT and the z-axis is the KLD; each point corresponds to a SNP and the SNPs with the highest KLD values are highlighted with the open triangles. (B–D) show the distribution of the genotypes for three SNPs with the highest values of the KLD in the African-American (closed circles) and Caucasian-American (open circles) and Han Chinese (open triangles) subjects. The x-axis in (B–D) is the sample number and the y-axes are the genotypes with the homozygous genotypes coded as 1 and 2 for the major and minor allele, respectively. Analysis of the coral dataset In the next analysis, we analyzed a dataset obtained by genotyping individual corals from four coral reefs populations (18). These authors used the amplification fragment length polymorphism (AFLP) assay, a multi-locus technique employed for obtaining a genetic fingerprint of organisms with limited available sequence information (19,20). In the AFLP method, a restriction enzyme digest of genomic DNA is annealed with oligonucleotides primers containing short flanking sequences in addition to the adaptor sequences of the restriction enzyme used. The flanking sequences ensure selective PCR amplification of only those restriction fragments that contain the reverse complement of the flanking sequence present. After PCR, the amplified fragments (typically, 50–100 in number) are separated according to size by denaturing gel electrophoresis (19,20). It should be noted that the AFLP method is susceptible to confounding by size homoplasy because the presence of AFLP bands of the same size in different samples is not sufficient to always assure high sequence similarity (21). The names and locations of the reefs are summarized in Figure 3A: DNA samples from individual coral specimens from geographical locations in the Bahamas (23°28′N, 75°42′W), the Crocker and Conch reefs (two sites separated by 12 km at 24°55′N, 80°31′W near the Key Largo, FL, area) and the Flower Gardens Banks (27°55′N, 93°36′W, 110 km south-southeast of Galveston, TX in the Gulf of Mexico) were analyzed using two separate sets of primers. The number of samples, n, from the Bahamas, Flower Gardens Banks, Crocker and Conch reefs were 22, 28, 17 and 14, respectively. Coral larvae can derive from either local adult populations or immigrate from distant locations. A total of 11 samples from coral larvae (referred to as recruits) from the Flower Gardens Banks reef were also analyzed. The object of the study was to determine the likely source from which the recruits migrated; the authors used discriminant analysis to assign all but one of the recruits to the Flower Gardens banks. The data were nominal variables indicating the presence/absence of PCR products of given lengths generated from two different sets of primers. There were 45 polymorphic markers used in this study. For this dataset, the dimensions were ordered so that the mean across all the samples approximated a cosine-like function; this was achieved by sorting the results for one primer in increasing order and the other primer in decreasing order. Figure 3 Open in new tabDownload slide (A) (Upper panel) is a map of the southeastern United States (made using M. Weinelt, Online Map Creation: http://www.aquarius.geomar.de/omc/make_map.html) showing the locations of the Bahamas (BAH), Crocker and Conch (CC) and Flower Gardens Banks (FGB) coral reefs from which the samples were derived. The grid on the map indicates latitude north and longitude west. (B) Shows the 3D VizStruct mapping of the genotyping results from AFLP analysis of the coral samples. The x- and y-axes are the real and imaginary components of the first harmonic of the DFT and the z-axis is the KLD; each point corresponds to a marker and the markers with the highest KLD values are highlighted with the open triangles. (C–E) Show the distribution of the genotypes for three amplification fragments with the highest values of the KLD in the samples from the Bahamas (open circles, n = 22), the Flower Garden Banks (filled circles, n = 28), Crocker (open triangles, n = 17), Conch (filled triangles, n = 14) and the recruits from the Flower Garden Banks (open diamonds, n = 11). The x-axis is the sample number and the y-axis are the genotypes; the genotype was coded as 1 if the fragment was absent and 2 if the fragment was present. Figure 3 Open in new tabDownload slide (A) (Upper panel) is a map of the southeastern United States (made using M. Weinelt, Online Map Creation: http://www.aquarius.geomar.de/omc/make_map.html) showing the locations of the Bahamas (BAH), Crocker and Conch (CC) and Flower Gardens Banks (FGB) coral reefs from which the samples were derived. The grid on the map indicates latitude north and longitude west. (B) Shows the 3D VizStruct mapping of the genotyping results from AFLP analysis of the coral samples. The x- and y-axes are the real and imaginary components of the first harmonic of the DFT and the z-axis is the KLD; each point corresponds to a marker and the markers with the highest KLD values are highlighted with the open triangles. (C–E) Show the distribution of the genotypes for three amplification fragments with the highest values of the KLD in the samples from the Bahamas (open circles, n = 22), the Flower Garden Banks (filled circles, n = 28), Crocker (open triangles, n = 17), Conch (filled triangles, n = 14) and the recruits from the Flower Garden Banks (open diamonds, n = 11). The x-axis is the sample number and the y-axis are the genotypes; the genotype was coded as 1 if the fragment was absent and 2 if the fragment was present. In the VizStruct results shown in Figure 3B, each point corresponds to a single marker. The three markers with the highest values of KLD were examined in detail (Figure 3C–E). Figure 3C demonstrates that the all but one of the samples from the Bahamas (21 of 22) are negative for Marker-8; in contrast, 26 of 28 samples from the Flower Gardens Banks reef, 13 of 17 samples from Crocker reef, 11 of 14 samples from Conch reef and all 11 samples from the recruits were positive for this marker. Figure 3D shows the results for Marker-19, which is associated with a class distinction different from that of Marker-8: the Marker-19 is present in all the Flower Gardens Banks reef and recruits samples, but it is absent in the majority of samples from the Bahamas (absent in 17 of 22 samples), Crocker (absent in 9 of 17 samples) and Conch (absent in 6 of 11 samples) reefs. Figure 3E shows the results for Marker-35, which is associated with yet another class distinction: it is absent in all the Crocker reef and 9 of 11 Conch reef samples; however, Marker-35 is present in the majority of Bahamas (present in 15 of 22 samples), Flower Gardens Banks (present in 22 of 28 samples) and recruits samples (present in 10 of 11 samples). These representative results demonstrate that the KLD is capable of identifying informative genetic polymorphisms when multiple classes are present. The 3D VizStruct analysis also indicates that the recruits are most similar to the samples from the Flower Gardens Banks, which is consistent with the findings of Brazeau et al. (18). Discriminant analysis, which was used by Brazeau et al. (18), is the conventional ‘gold standard’ methodology for identifying predictive markers from multi-dimensional datasets. We therefore challenged 3D VizStruct by comparing the three markers with the highest KLD in the 3D VizStruct visualization to the markers present with highest weights in the discriminant function: there was an exact concordance between the top three markers identified by both methods. These results extend the useful range of 3D VizStruct visualization capabilities to include multi-class data generated by multi-locus genotyping techniques that are used for poorly characterized genomes. Kullback–Leibler divergence and linkage disequilibrium The relationship between the KLD and linkage disequilibrium (LD) was analyzed to provide further justification for the use of the KLD in VizStruct for SNP visualization. A variety of normalized metrics (e.g. R2 and Lewontin's D′), all of which are based on linkage disequilibrium, D, are widely used in genetic mapping. We therefore first investigated the relationship between the KLD and LD. The starting point for the definition of the measures of linkage disequilibrium is the standard 2 × 2 haplotype frequency table shown on in Table 1. Consider two loci, each of which has two alleles. Let A and a denote the major and minor alleles at the first locus and B and b denote the major and minor alleles at the second locus. The proportions of the A, a, B and b alleles are denoted by pA, pa, pB and pb, respectively. Similarly, denote the proportion of the AB, Ab, aB and ab haplotypes by pAB, pAb, paB, pab, respectively. Linkage disequilbrium D is defined by the following equivalent equations (22): \begin{equation} D={p}_{AB}{p}_{ab}-{p}_{Ab}{p}_{aB}={p}_{AB}-{p}_{A}{p}_{B}={p}_{ab}-{p}_{a}{p}_{b}=-{p}_{Ab}+{p}_{A}{p}_{b}=-{p}_{aB}+{p}_{a}{p}_{B}. \end{equation}3 Because the reference distribution q in the definition of KLD (Equation 2) is based on the assumption of independence, the KLD of the haplotype frequency table is given by the following equation: \begin{equation} \hbox{ KLD }={p}_{AB}log\frac{{p}_{AB}}{{p}_{A}{p}_{B}}+{p}_{Ab}log\frac{{p}_{Ab}}{{p}_{A}{p}_{b}}+{p}_{aB}log\frac{{p}_{aB}}{{p}_{a}{p}_{B}}+{p}_{ab}log\frac{{p}_{ab}}{{p}_{a}{p}_{b}}. \end{equation}4 Using the Equation 3, Equation 4 can be re-written in terms of the linkage disequilibrium and the allele frequency terms alone: \begin{equation} \begin{array}{c}\hbox{ KLD }=(D+{p}_{A}{p}_{B})log\frac{(D+{p}_{A}{p}_{B})}{{p}_{A}{p}_{B}}+({p}_{A}{p}_{b}-D)log\frac{({p}_{A}{p}_{b}-{D}_{})}{{p}_{A}{p}_{b}}\\ \begin{array}{cc}\qquad \end{array}+({p}_{a}{p}_{B}-D)log\frac{({p}_{a}{p}_{B}-D)}{{p}_{a}{p}_{B}}+(D+{p}_{a}{p}_{b})log\frac{(D+{p}_{a}{p}_{b})}{{p}_{a}{p}_{b}}.\end{array} \end{equation}5 Table 1. Haplotype table B b Sum A pAB pAb pA A paB pab pa Sum pB pb 1 B b Sum A pAB pAb pA A paB pab pa Sum pB pb 1 Open in new tab Table 1. Haplotype table B b Sum A pAB pAb pA A paB pab pa Sum pB pb 1 B b Sum A pAB pAb pA A paB pab pa Sum pB pb 1 Open in new tab Equation 5 formally defines the relationship between linkage disequilibrium and KLD: the KLD depends only on the linkage disequilibrium and on the allele frequencies. Figure 4 shows the results from numerical experiments designed to investigate the dependence of KLD on allele frequency. The experiment employed 87 simulated datasets wherein the allele frequencies at one locus (locus B) were kept constant at (pB = 0.9, pb = 0.1) whereas the allele frequency at the other locus was varied from (pA = 0.99, pa = 0.01) to (pA = 0.6, pa = 0.4). The numerical values of the KLD were calculated for various values of linkage disequilibrium (D). Figure 4 summarizes the relationship between the linkage disequilibrium and KLD with allele frequency as a parameter on linear (Figure 4A) and logarithmic axes (Figure 4B); the curves in Figure 4A appear to ‘end’ because there are maximum limits to D for a given set of allele frequencies. Figure 4A and B demonstrates that there is direct relationship between the KLD and D despite their disparate underlying formulations—the KLD is based on an information-theoretical framework, whereas D is the determinant of the haplotype probability table. However, over the range examined, for a given allele frequency at one locus, the dependence between KLD and D is approximated by a power-law relationship of the form |$\hbox{ KLD }\propto {D}^{n}$| (Figure 4B); the exponent of the power-law relationship varied between 1.78 for (pA = 0.99, pa = 0.01) to 2.00 for (pA = 0.6, pa = 0.4). Figure 4 Open in new tabDownload slide Relationship between the KLD versus the linkage disequilibrium D for a range of allele frequencies. The allele frequencies at one locus were kept constant at 0.9 for the major allele (A) and 0.1 for the minor allele (B). The major allele frequencies at the other locus were varied as indicated and were 0.99 (filled circles), 0.95 (open circles), 0.9 (filled triangles) or 0.6 (open triangles). The solid lines are a power-law fit to the results. Figure 1A uses linear axes and Figure 1B shows the same data on logarithmic axes. Figure 4 Open in new tabDownload slide Relationship between the KLD versus the linkage disequilibrium D for a range of allele frequencies. The allele frequencies at one locus were kept constant at 0.9 for the major allele (A) and 0.1 for the minor allele (B). The major allele frequencies at the other locus were varied as indicated and were 0.99 (filled circles), 0.95 (open circles), 0.9 (filled triangles) or 0.6 (open triangles). The solid lines are a power-law fit to the results. Figure 1A uses linear axes and Figure 1B shows the same data on logarithmic axes. DISCUSSION The objective of this report was to evaluate 3D VizStruct, a multi-dimensional visualization approach that combines radial visualization with the KLD for visualizing SNP data and for identifying predictive SNPs from large association studies. We analyzed several datasets to demonstrate the usefulness of the 3D VizStruct approach for mining SNP data obtained from studies of large datasets including a densely genotyped candidate gene, LPL, the Y-chromosome and samples of reef coral individuals obtained from the wild. We highlighted the effectiveness of identifying predictive SNPs from 3D VizStruct visualization in two-class and multi-class study designs. Our results demonstrate the 3D VizStruct approach with the KLD in particular is effective for the supervised detection of informative SNPs. The KLD is valuable for identifying class distinctions because it is order-insensitive; however, the availability of the complex Fourier harmonic dimensions for visualization is also important because it spreads the large number of SNPs efficiently in the visualization field, which allows different categories of class distinctions of comparable quality to clearly emerge in multi-class datasets. The KLD is also easy to interpret by users because it represents a ‘distance’ between two distributions, i.e. two SNP distributions that are similar are placed near each other and those dissimilar are placed distantly from each other. In this context, we are currently conducting theoretical studies to examine the relationships between our supervised KLD formulation and the commonly used measures of linkage disequilibrium, e.g. the correlation coefficient Δ, Lewontin D′ (23,24), Yule's Q (25), Kaplan and Weir' proportional difference d (26), the population attributable risk δ (27), because linkage disequilibrium measures remain the fundamental approach used by population geneticists and genetic epidemiologists to identify disease loci (22) and are frequently used for fine mapping by molecular geneticists; identifying these analytical relationships will facilitate acceptance of the 3D VizStruct visualization method. A common feature of linkage disequilibrium measures is that they assess the difference between the observed and expected frequencies of haplotypes between a ‘disease’ locus and a marker locus of interest (22). Our working hypothesis is that differences between two loci in the KLD dimension are directly related to measures of linkage disequilibrium between the loci. Additional relationships will be systematically derived in the further research but preliminary analyses are promising; e.g. the KLD of two loci is zero in the case the two loci are completely independent of the class distinction and larger for larger values of linkage disequilibrium. One important and useful difference is that the 3D VizStruct approach does not require computation of pairwise linkage disequilibrium values across loci: each SNP vector is individually projected on the visualization field and this reduces the computation needed. An additional advantage with the Fourier and KLD components of 3D VizStruct is that both are generalizable to multiple SNPs and across a wide range of data types. For example, the same approaches could potentially be extended to visualize the associations between SNPs and quantitative traits (e.g. gene expression, clinical and laboratory parameters). The KLD approach can also be used to identify statistically predictive SNPs because the log-likelihood ratio is equal to the product of the KLD and sample size. For a 2 × 2 haplotype table with a given sample size, the KLD is distributed according to the χ2 statistic with one degree of freedom. With these simple relationships and the desired level of significance at hand, predictive genes can be identified easily along the KLD axis. The 3D VizStruct approach represents a computationally efficient means to visually examine large complex datasets form diverse areas of research. With the ever-expanding numbers of massive datasets there is a real need for visualization tools that can quickly encapsulate the information allowing researchers to more easily interpret data, identify and summarize the most important features in genome-wide SNP datasets. ACKNOWLEDGEMENTS This work was supported in part by grants from the Kapoor Foundation, National Science Foundation (Research Grant 0234895) and the National Institutes of Health (P20-GM 067650). Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health. Conflict of interest statement. None declared. REFERENCES 1. Mir K.U. and Southern E.M. 2000 Sequence variation in genes and genomic DNA: methods for large-scale analysis Annu. Rev. Genomics Hum. Genet . 1 329 – 360 Google Scholar Crossref Search ADS PubMed WorldCat 2. Erichsen H.C. and Chanock S.J. 2004 SNPs in cancer research and treatment Br. J. Cancer . 90 747 – 751 Google Scholar Crossref Search ADS PubMed WorldCat 3. Suh Y. and Vijg J. 2005 SNP discovery in associating genetic variation with human disease phenotypes Mutat. Res . 573 41 – 53 Google Scholar Crossref Search ADS PubMed WorldCat 4. Xu H. , Gregory S.G., Hauser E.R., Stenger J.E., Pericak-Vance M.A., Vance J.M., Zuchner S., Hauser M.A. 2005 SNPselector: a web tool for selecting SNPs for genetic association studies Bioinformatics 21 4181 – 4186 Google Scholar Crossref Search ADS PubMed WorldCat 5. Bhadra D. and Garg A. An Interactive Visual Framework for Detecting Clusters of a Multidimensional Dataset . 2001 Technical Report 2001–03, State University of New York , Buffalo , 6. Hoffman P.E. , Grinstein G.G., Marx K. 1997 DNA visual and analytic data mining Proceedings of the IEEE Visualization '97 , Phoenix, AZ pp. 437 – 441 7. Zhang L. , Zhang A., Ramanathan M. 2002 Visualized classification of multiple sample types Proceedings of the 2nd Workshop on Data Mining in Bioinformatics (BIOKDD 2002), The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Edmonton , Alberta, Canada 8. Zhang L. , Zhang A., Ramanathan M. 2003 Enhanced visualization of time series through higher Fourier harmonics Proceedings of the Third ACM SIGKDD Workshop on data mining in bioinformatics (BIOKDD03) pp. 49 – 56 9. Zhang L. , Zhang A., Ramanathan M. 2004 VizStruct: exploratory visualization for gene expression profiling Bioinformatics 20 85 – 92 Google Scholar Crossref Search ADS PubMed WorldCat 10. Cover T.M. and Thomas J.A. Elements of Information Theory . 1991 Wiley , NY 11. Haykin S. Neural Networks: A Comprehensive Foundation 1999 NY College Publishing Co 12. Nickerson D.A. , Taylor S.L., Weiss K.M., Clark A.G., Hutchinson R.G., Stengard J., Salomaa V., Vartiainen E., Boerwinkle E., Sing C.F. 1998 DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene Nature Genet . 19 233 – 240 Google Scholar Crossref Search ADS PubMed WorldCat 13. Clark A.G. , Weiss K.M., Nickerson D.A., Taylor S.L., Buchanan A., Stengard J., Salomaa V., Vartiainen E., Perola M., Boerwinkle E., et al. 1998 Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase Am. J. Hum. Genet . 63 595 – 612 Google Scholar Crossref Search ADS PubMed WorldCat 14. Butler J.M. Forensic DNA Typing: Biology, Technology and Genetics of STR Markers 2005 2nd edn New York, NY Elsevier Press 15. Hammer M.F. , Karafet T.M., Redd A.J., Jarjanazi H., Santachiara-Benerecetti S., Soodyall H., Zegura S.L. 2001 Hierarchical patterns of global human Y-chromosome diversity Mol. Biol. Evol . 18 1189 – 1203 Google Scholar Crossref Search ADS PubMed WorldCat 16. Rosser Z.H. , Zerjal T., Hurles M.E., Adojaan M., Alavantic D., Amorim A., Amos W., Armenteros M., Arroyo E., Barbujani G., et al. 2000 Y-chromosomal diversity in Europe is clinal and influenced primarily by geography, rather than by language Am. J. Hum. Genet . 67 1526 – 1543 Google Scholar Crossref Search ADS PubMed WorldCat 17. Hinds D.A. , Stuve L.L., Nilsen G.B., Halperin E., Eskin E., Ballinger D.G., Frazer K.A., Cox D.R. 2005 Whole-genome patterns of common DNA variation in three human populations Science 307 1072 – 1079 Google Scholar Crossref Search ADS PubMed WorldCat 18. Brazeau D.A. , Sammarco P.W., Gleason D.F. 2005 A multi-locus genetic assignment technique to assess sources of Agaricia agariciteslarvae on coral reefs Marine Biol . 147 1141 – 1148 Google Scholar Crossref Search ADS WorldCat 19. Vos P. 1998 AFLP fingerprinting of Arabidopsis Methods Mol. Biol . 82 147 – 155 Google Scholar PubMed OpenURL Placeholder Text WorldCat 20. Vos P. , Hogers R., Bleeker M., Reijans M., van de Lee T., Hornes M., Frijters A., Pot J., Peleman J., Kuiper M., et al. 1995 AFLP: a new technique for DNA fingerprinting Nucleic Acids Res . 23 4407 – 4414 Google Scholar Crossref Search ADS PubMed WorldCat 21. Peakall R. , Gilmore S., Keys W., Morgante M., Rafalski A. 1998 Cross-species amplification of soybean (Glycine max) simple sequence repeats (SSRs) within the genus and other legume genera: implications for the transferability of SSRs in plants Mol. Biol. Evol . 15 1275 – 1287 Google Scholar Crossref Search ADS PubMed WorldCat 22. Devlin B. and Risch N. 1995 A comparison of linkage disequilibrium measures for fine-scale mapping Genomics 29 311 – 322 Google Scholar Crossref Search ADS PubMed WorldCat 23. Lewontin R.C. 1988 On measures of gametic disequilibrium Genetics 120 849 – 852 Google Scholar PubMed OpenURL Placeholder Text WorldCat 24. Lewontin R.C. and Matsuo Y. 1963 Interaction of genotypes determining viability in Drosophila busckii Proc. Natl Acad. Sci. USA 49 270 – 278 Google Scholar Crossref Search ADS WorldCat 25. Yule G.U. 1900 On the association of attributes in statistics Philos. Trans. R Soc. Lond . 194 257 – 319 Google Scholar Crossref Search ADS WorldCat 26. Kaplan N. and Weir B.S. 1992 Expected behavior of conditional linkage disequilibrium Am. J. Hum. Genet . 51 333 – 343 Google Scholar PubMed OpenURL Placeholder Text WorldCat 27. Levin M.L. and Bertell R. 1978 RE: ‘simple estimation of population attributable risk from case-control studies’ Am. J. Epidemiol . 108 78 – 79 Google Scholar PubMed OpenURL Placeholder Text WorldCat © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Protein–protein interaction analysis by C-terminally specific fluorescence labeling and fluorescence cross-correlation spectroscopyOyama,, Rieko;Takashima,, Hideaki;Yonezawa,, Masato;Doi,, Nobuhide;Miyamoto-Sato,, Etsuko;Kinjo,, Masataka;Yanagawa,, Hiroshi
doi: 10.1093/nar/gkl477pmid: 16914444
ABSTRACT Here, we describe novel puromycin derivatives conjugated with iminobiotin and a fluorescent dye that can be linked covalently to the C-terminus of full-length proteins during cell-free translation. The iminobiotin-labeled proteins can be highly purified by affinity purification with streptavidin beads. We confirmed that the purified fluorescence-labeled proteins are useful for quantitative protein–protein interaction analysis based on fluorescence cross-correlation spectroscopy (FCCS). The apparent dissociation constants of model protein pairs such as proto-oncogenes c-Fos/c-Jun and archetypes of the family of Ca2+-modulated calmodulin/related binding proteins were in accordance with the reported values. Further, detailed analysis of the interactions of the components of polycomb group complex, Bmi1, M33, Ring1A and RYBP, was successfully conducted by means of interaction assay for all combinatorial pairs. The results indicate that FCCS analysis with puromycin-based labeling and purification of proteins is effective and convenient for in vitro protein–protein interaction assay, and the method should contribute to a better understanding of protein functions by using the resource of available nucleotide sequences. INTRODUCTION An understanding of the rate and specificity of assembly of biomolecular complexes is essential for a full appreciation of the mechanisms of biological events. Further, currently available information on genome sequences of various organisms can be exploited as a resource for characterizing novel functions of proteins or hypothetical proteins. For this purpose, a high-throughput method is required for functional protein analysis. Fluorescence correlation spectroscopy (FCS) and fluorescence cross-correlation spectroscopy (FCCS) have recently been applied to such important biological problems (1–11). FCS allows monitoring of the individual movements of fluorescence-labeled molecules through a very tiny area (1,2). The time-dependent fluorescence autocorrelation function allows us to analyze the relative proportions of species involved in the diffusion. Changes of the proportions can be used to calculate the binding kinetics (3,4,8). FCCS utilizes separate channels to detect two distinct fluorophores, as well as the cross-correlated signals, in real time (5). With FCCS, bound molecules can be detected even if the differences of diffusion are not great. So far, FCCS has been applied to the studies of DNA hybridization (5), PCR (9), enzymatic cleavage of a DNA substrate by EcoRI endonuclease (6,10) and protein–DNA interactions (11). Fluorescence labeling of proteins is a key step for the FCS and FCCS analysis of protein interactions. So far, chemical modifications (12,13) and recombinant fusion tagging with fluorescent proteins (14–17) have been used for fluorescence labeling of proteins. These methods are often useful, but the modifications of internal amino acid residues or the addition of relatively large fluorescent proteins may affect the functions of labeled proteins. As an alternative approach, we have previously developed a puromycin-based method for fluorescence labeling of proteins (18,19). By using this method, various fluorophores can be incorporated into full-length proteins in the presence of a low concentration of fluorophore-conjugated puromycin in a cell-free translation system (11). Small fluorescent probes are expected to be less likely to interfere with the structure or biological function of proteins and cell-free protein synthesis is suitable for a high-throughput format owing to its simplicity. We have previously reported the FCCS analysis of protein–DNA interactions between RhG (rhodamine green)-labeled proteins and Cy5-labeled DNA (11). Although high-throughput analysis of protein–protein interactions in solution using FCCS is of great interest, detection of cross-correlations between differently labeled proteins has been difficult, because the labeling efficiency of our method ranges from only 10 to 30% (11), and the remaining unlabeled proteins in solution inhibit the formation of the protein–protein complex carrying both RhG and Cy5. In this study, we have improved the purification process of fluorescence-labeled proteins by using novel iminobiotin-conjugated fluorescent puromycin derivatives to aid the removal of unlabeled proteins, thereby making protein–protein interaction assay using FCCS practically feasible. We used three model systems, proto-oncogenes c-Fos and c-Jun, archetypes of the family of Ca2+-modulated calmodulin (CaM) and CaM-related binding proteins, and the polycomb group (PcG) complex proteins to confirm the usefulness of our method. MATERIALS AND METHODS Synthesis of fluorescent puromycin derivatives NHS-iminobiotin trifluoroacetamide was purchased from Pierce. Iminobiotin-T(Cy5)-dC-puromycin (Figure 1A) and iminobiotin-T(RhG)-dC-puromycin (data not shown) were synthesized and purified as described previously (11), with some modifications (see Supplementary Data). The structural identity of the synthesized fluorescent puromycin analogs was confirmed by MALDI-TOF mass spectrometry (Voyager; Perceptive Biosystems). Figure 1 Open in new tabDownload slide Materials for fluorescence labeling. (A) The structure of a fluorescent puromycin derivative. A fluorophore (Cy5 or RhG) and iminobiotin were chemically conjugated to puromycin through a linker. (B) DNA construction for fluorescence labeling of proteins. Template DNA consists of SP6 promoter, Omega sequence and an open reading frame (ORF) with a T7·tag at the N-terminus and a polyhistidine tag at the C-terminus, followed by a XhoI restriction enzyme site. Figure 1 Open in new tabDownload slide Materials for fluorescence labeling. (A) The structure of a fluorescent puromycin derivative. A fluorophore (Cy5 or RhG) and iminobiotin were chemically conjugated to puromycin through a linker. (B) DNA construction for fluorescence labeling of proteins. Template DNA consists of SP6 promoter, Omega sequence and an open reading frame (ORF) with a T7·tag at the N-terminus and a polyhistidine tag at the C-terminus, followed by a XhoI restriction enzyme site. Preparation of templates for translation In a template DNA, two tags were added to the 5′- and 3′-termini of the open reading frame (11,19) (Figure 1B) by PCR and the fragment was subcloned into a pCR2.1Topo vector (Invitrogen). The DNA template was amplified from the clone by PCR and cleaved with XhoI. The purified DNA was transcribed in an SP6 large-scale RNA production system (Promega). Fluorescence labeling and purification Fluorescence labeling was carried out using the wheat germ extract translation system ‘Proteios’ (Toyobo, Japan) as described in the manufacturer's protocol, except that a fluorophore-conjugated puromycin was added. The translation was terminated by the addition of RNase A (1 μg/0.3 ml; Ambion). The purification of fluorescently labeled proteins was performed at 4°C. The mixture was dialyzed against nickel binding buffer (50 mM phosphate, 150 mM NaCl, 1 mM DTT and 0.05% NP-40, pH 8.0), followed by centrifugation at 16 000 g for 20 min. The supernatant was mixed with 10 μl of Ni-NTA agarose (20) (SuperFlow; Qiagen) for 1 h. The supernatant was removed, and the beads were washed three times, with agitation, in nickel binding buffer (1.0 ml) containing 2.5 mM imidazole and 300 mM NaCl. Proteins were eluted with 50 μl of buffer containing 0.5 M imidazole, pH 8.0. The fraction was mixed with 9 vol of 50 mM phosphate buffer (pH 8.0) containing 300 mM NaCl, 5 mM DTT and 0.05% NP-40, then 10 μl of streptavidin–Sepharose (21) (Amersham Pharmacia) was added and the mixture was rocked for 1 h. The beads were washed with the buffer three times. Protein was eluted with 50 μl of buffer (240 mM Tris–HCl, 150 mM NaCl, 0.1 M biotin, 5 mM DTT and 0.1% NP-40, pH 8.0). The protein fraction was mixed with 10 mM DTT and kept at 4°C before use. Immunodetection and fluorescence determination The proteins were detected by enhanced immunoblotting (22) with mouse anti-T7·tag antibody (Novagen) and horseradish peroxidase (HRP)-labeled goat anti-mouse IgG (Transduction Laboratories). The blot was determined semiquantitatively with the T7·tag positive control recombinant protein (Novagen), an ECL detection kit (Amersham Pharmacia) and a CCD camera (ChemiDoc; Bio-Rad). Proteins separated by SDS–PAGE were stained with SyproOrange (Molecular Probes) and detected using a fluorescence image analyzer (excitation at 488 nm and emission at 515–545 nm, Molecular Imager FX; Bio-Rad). The fluorescence yield was spectrophotometrically determined using the fluorescence image analyzer (Cy5 was detected with excitation at 635 nm and emission at 670–720 nm; RhG with excitation at 488 nm and emission at 515–545 nm) and a standard dye with molecular extinction coefficients of ε505 (RhG) = 68 000 cm−1 M−1 (measured at pH 8.0) and ε647 (Cy5) = 250 000 cm−1 M−1. FCCS measurement FCCS measurement was performed on a ConfoCor2 system (Carl Zeiss) as described previously (11). The two pinholes and the cross-correlated volume element were adjusted by measurement (5). All solutions were prepared in water (fluorescence analysis grade; Dojindo, Japan) and filtered through an Ultrafree-MC filter unit (Millipore). Fluorescently labeled proteins were dialyzed against 50 mM phosphate, 150 mM NaCl, 0.1% NP-40 and 1 mM DTT, pH 7.4. After centrifugation at 16 000 g for 20 min, differently labeled proteins were mixed in a Lab-Tek 8-well chamber (Nalge Nunc) and kept for 10 min. Interaction of c-Fos with c-Jun was also analyzed in the presence of DNA, poly(dI–dC)·poly(dI–dC) (2 μg/ml; Amersham Pharmacia) and the AP-1 synthetic oligonucleotides of 30 bp (Dateconcept, Sapporo, Japan) (23). CaM interactions were analyzed in the presence of 0.5 mM CaCl2. Two autocorrelation curves and the cross-correlation curve of FCCS data were analyzed by using fitting algorithm described below in the software package for ConfoCor2 (Carl Zeiss). Theory and data calibration The theoretical background of FCCS analysis has been des-cribed by Eigen et al. and Rigler et al. (5,6,9). The fluorescence autocorrelation function and the cross-correlation function were acquired from an online system-controlling computer software package. The normalized cross-correlation function G(τ) is given by \begin{equation} {G}_{\hbox{ gr }}\left(\tau \right)=1+\frac{\langle \delta {I}_{\hbox{ g }}\left(t\right)\cdot \delta {I}_{\hbox{ r }}\left(t+\tau \right)\rangle }{\langle {I}_{\hbox{ g }}\rangle \cdot \langle {I}_{\hbox{ r }}\rangle }, \end{equation}1 where the indices refer to one or two measured fluorescence signals, Ig and/or Ir. In the case of one fluorescent species, Equation 1 (r = g) defines normalized autocorrelation function in a single detection channel. ω1 is the radius and ω2 is half of the long axis of the confocal volume element. The structural parameter S is the ratio of ω2/ω1. Two-component model of the autocorrelation function for translational diffusion in a 3D Gaussian volume element is described as follows: \begin{equation} \begin{array}{l}G\left(\tau \right)=1+\frac{1}{N}\cdot [\left(1-Y\right)\cdot {\left(1+\frac{\tau }{{\tau }_{{D}_{1}}}\right)}^{-1}\cdot {\left(1+\frac{\tau }{{S}^{2}{\tau }_{{D}_{1}}}\right)}^{-1/2}\\ +Y\cdot {\left(1+\frac{\tau }{{\tau }_{{D}_{2}}}\right)}^{-1}\cdot {\left(1+\frac{\tau }{{S}^{2}{\tau }_{{D}_{2}}}\right)}^{-1/2}],\end{array} \end{equation}2 where |${\tau }_{{D}_{1}}$| and |${\tau }_{{D}_{2}}$| are the diffusion times of the faster component and slower component in the assay. Y represents the fraction of fluorescent protein with the diffusion time |${\tau }_{{D}_{2}}$| in the total number of fluorescent particles N. The values of ω1,i (i = g or r) were determined from the diffusion times of rhodamine 6G (Sigma Aldrich; diffusion coefficient D = 2.8 × 10−10 m2 s−1) and Cy5 (mono-reactive dye, Amersham Pharmacia; D = 3.16 × 10−10 m2 s−1). \begin{equation} {\omega }_{1,i}=\sqrt{4D\cdot {\tau }_{{D}_{i}}} \end{equation}3 The volume elements V are calculated according to \begin{equation} {V}_{i}={\pi }^{3/2}\cdot {\omega }_{1,i}^{2}\cdot {\omega }_{2,i} \end{equation}4 \begin{equation} {V}_{\hbox{ gr }}={\left(\frac{\pi }{2}\right)}^{3/2}\left({\omega }_{1,\hbox{ g }}^{2}+{\omega }_{1,\hbox{ r }}^{2}\right){\left({\omega }_{2,\hbox{ g }}^{2}+{\omega }_{2,\hbox{ r }}^{2}\right)}^{1/2} \end{equation}5 The measured total number of autocorrelated particles NAC,i and complex cross-correlated particles Ncc is given by \begin{equation} N=\frac{1}{{G}_{i}\left(0\right)-1} \end{equation}6 where in the case of Ncc, Gi(0) indicates Ggr(0). The red emission excited by the green laser Qg (cross-talk fraction) was calculated from the mean count rates of the red channel when excited by both lasers (Cgr) and only the red laser (Cr), using a modification of the method in the application manual of ConfoCor2. \begin{equation} {Q}_{\hbox{ g }}=\frac{{C}_{\hbox{ gr }}-{C}_{\hbox{ r }}}{{C}_{\hbox{ gr }}} \end{equation}7 Calculated free molecules Ni and calculated complex molecules Ngr are as follows: \begin{equation} {N}_{\hbox{ AC },\hbox{ g }}={N}_{\hbox{ g }}+{N}_{\hbox{ gr }} \end{equation}8 \begin{equation} {N}_{\hbox{ AC },\hbox{ r }}={N}_{\hbox{ r }}+{Q}_{\hbox{ g }}\cdot {N}_{\hbox{ g }}+\left(1+{Q}_{\hbox{ g }}\right)\cdot {N}_{\hbox{ gr }} \end{equation}9 \begin{equation} {N}_{\hbox{ gr }}=\frac{{N}_{\hbox{ AC },\hbox{ g }}\cdot \left({N}_{\hbox{ AC },\hbox{ r }}+{Q}_{\hbox{ g }}\cdot {N}_{\hbox{ AC },\hbox{ g }}\right)}{{N}_{\hbox{ cc }}-{Q}_{\hbox{ g }}\cdot {N}_{\hbox{ AC },\hbox{ g }}}. \end{equation}10 The concentrations of each fluorescent protein were calculated with the use of A (Avogadro's number) as follows: \begin{equation} {c}_{i}=\frac{{N}_{i}\cdot {Y}_{i}}{{V}_{i}\cdot A}. \end{equation}11 The dissociation constants (Kds) are given by \begin{equation} {K}_{\hbox{ d }}=\frac{\left({c}_{\hbox{ r }}-{c}_{\hbox{ gr }}\right)\cdot \left({c}_{\hbox{ g }}-{c}_{\hbox{ gr }}\right)}{{c}_{\hbox{ gr }}}. \end{equation}12 RESULTS Tandem affinity purification of fluorescently labeled proteins c-Fos(118–211) and c-Jun(216–318) were translated in the presence of iminobiotin–fluorophore-conjugated puromycin, whose structure is presented in Figure 1A. The optimal concentrations of the puromycin derivatives were 12.5 μM as RhG and 25 μM as Cy5, respectively (data not shown). The reaction mixtures were purified by two steps of affinity purification with nickel-chelate beads (Figure 2A) and streptavidin-conjugated beads (Figure 2B). Excess unincorporated dyes and lower molecular weight proteins were not retained on nickel-chelate beads (Figure 2A, lane 2). The fractions eluted with 0.5 M imidazole (Figure 2A, lane 3) were further purified using streptavidin beads. The flow-through fraction contained <5% of the total fluorescence intensity, but ∼30% of the immunoblotting signal (Figure 2B, lane 2). The biotin-eluted fraction (Figure 2B, lane 3) showed a weaker signal than that of the applied fraction (Figure 2B, lane 1) in immunodetection. These results indicate that unlabeled proteins were successfully removed by the second step of affinity purification. The purified Cy5–protein fraction was identical to one band detected in SDS–PAGE by protein staining with SyproOrange (Figure 2C, lane 2). Similarly, CaM, CaM-binding proteins and PcG proteins were labeled and highly purified (data not shown). Figure 2 Open in new tabDownload slide Purification of fluorescence-labeled proteins. Subscript Cy5 or RhG indicates a fluorophore linked to puromycin derivative. Proteins were separated on 15–25% continuous gradient SDS–PAGE and detected using a fluorescence imager (Cy5 or RhG) or αT7·tag antibody. (A) Affinity purification with nickel-chelate resin. Lane 1, in vitro translation products; lane 2, flow-through fractions; and lane 3, eluates with 0.5 M imidazole. (B) Affinity purification with streptavidin resin. Lane 1, nickel-chelate affinity-purified fractions; lane 2, flow-through; and lane 3, eluates with 0.1 M biotin. (C) Protein staining with SyproOrange. Lane 1, in vitro translation products; and lane 2, purified fractions. Figure 2 Open in new tabDownload slide Purification of fluorescence-labeled proteins. Subscript Cy5 or RhG indicates a fluorophore linked to puromycin derivative. Proteins were separated on 15–25% continuous gradient SDS–PAGE and detected using a fluorescence imager (Cy5 or RhG) or αT7·tag antibody. (A) Affinity purification with nickel-chelate resin. Lane 1, in vitro translation products; lane 2, flow-through fractions; and lane 3, eluates with 0.5 M imidazole. (B) Affinity purification with streptavidin resin. Lane 1, nickel-chelate affinity-purified fractions; lane 2, flow-through; and lane 3, eluates with 0.1 M biotin. (C) Protein staining with SyproOrange. Lane 1, in vitro translation products; and lane 2, purified fractions. Conditions of FCCS analysis with ConfoCor2 The pinhole diameters were adjusted to 70 μm in the green channel and 48 μm in the red channel to provide a sufficient observation volume for our system. The overlap of the excitation volumes between red and green laser lines was achieved by exciting the Cy5 dye with both wavelengths (5). The autocorrelation curve of the red channel with only the 633 nm laser (red line in Supplementary Figure 1) was coincident with the curve of the channel with only the 488 nm laser (blue line in Supplementary Figure 1). The particle numbers detected in the red channel and in the green channel were 24.1 and 23.7, respectively. The structural parameter S was calculated as 5 for the green channel and as 7 for the red channel. The diffusion time of ∼10 nM rhodamine 6G in the green channel was 30 μs and that of ∼10 nM Cy5 dye in the red channel was 44 μs in the laser power range used. The effective volume elements were Vg = ∼0.19 fl in the green channel, Vr = ∼0.41 fl in the red channel and Vgr = ∼0.26 fl in the cross-correlated channel (see Equations 4 and 5 in Materials and Methods for definitions). The differences between these detection volumes were accounted for in the data analysis according to Equation 11. The calculated diffusion coefficients of iminobiotin-T(RhG)-dC-puromycin and iminobiotin-T(Cy5)-dC-puromycin were 2.1 and 2.3 × 10−10 m2 s−1, respectively, using FCS analysis. The cross-talk from red to green was zero whereas the cross-talk from green to red was ∼10%: this was accounted for in the calculation of complex concentration according to Equation 10. FCCS analysis of c-Fos and c-Jun The fractions of fluorescently labeled proteins to fluorescent particles were c-FosCy5 71%, c-JunCy5 68%, c-FosRhG 69% and c-JunRhG 67% when the functions were fitted to two-component models with diffusion times corresponding to those of the fluorescent derivatives using FCS analysis (Figure 3). Diffusion coefficients of c-FosRhG, c-FosCy5, c-JunRhG and c-JunCy5 were calculated to be 7.6–8.1 × 10−11 m2 s−1. As shown in Supplementary Figure 2, the diffusion coefficients of fluorescent-puromycin-labeled proteins were consistent with the predictions of the Stokes–Einstein theory (24). Concentrations of fluorescently labeled proteins were calculated from the autocorrelation functions in the FCCS analysis (Figure 3A–C, upper panels). The apparent Kd values calculated with the equilibrium data are summarized in Table 1. The translational diffusion time of c-Fos homodimer was determined after the addition of AP-1 DNA by using the cross-correlation function (Figure 3A, lower panel). In the presence of the AP-1 DNA sequence, the Kd of the heterodimer was ∼80-fold lower than that of c-Fos homodimer (Table 1). In the case of the heterodimer and c-Jun homodimer, the Kd decreased to ∼70% after the addition of the AP-1 sequence. The cross-correlation function of c-FosRhG/c-JunCy5 gave the diffusion coefficient of the cross-correlated complex as 7.0 × 10−11 m2 s−1 in the absence of the AP-1 sequence and 4.4 × 10−11 m2 s−1 in its presence (Figure 3C, lower panel). The Kd of c-FosCy5/c-JunRhG was determined to be 7 × 10−8 M in the absence of the AP-1 sequence and 5 × 10−8 M in its presence (data not shown). Figure 3 Open in new tabDownload slide FCCS analysis of AP-1-binding proteins. The autocorrelation function (upper panels) and cross-correlation function (lower panels) of c-FosCy5 and c-FosRhG (A), c-JunCy5 and c-JunRhG (B), and c-JunCy5 and c-FosRhG (C). In the autocorrelation plot, red and green lines represent Cy5 and RhG. The dashed curves (blue in cross-correlation function) represent data obtained after the addition of 50 nM AP-1 oligonucleotides. Figure 3 Open in new tabDownload slide FCCS analysis of AP-1-binding proteins. The autocorrelation function (upper panels) and cross-correlation function (lower panels) of c-FosCy5 and c-FosRhG (A), c-JunCy5 and c-JunRhG (B), and c-JunCy5 and c-FosRhG (C). In the autocorrelation plot, red and green lines represent Cy5 and RhG. The dashed curves (blue in cross-correlation function) represent data obtained after the addition of 50 nM AP-1 oligonucleotides. FCCS analysis of CaM and CaM-binding proteins The diffusion coefficients of CaM(1–149)RhG, calcineurin A(328–521)Cy5, Rab3A(1–219)Cy5 and caldesmon(302–564)Cy5 were calculated to be 7.6, 7.3, 6.8 and 7.0 × 10−11 m2 s−1, respectively. Variations of the cross-correlation function between CaM and CaM-binding proteins were observed in the presence of Ca2+ (see Figure 4A–C, solid curves). The diffusion times of the cross-correlated functions were determined, except for that of Rab3A, by using analyzing software. The amplitudes of the cross-correlation functions were reduced by the addition of EGTA (Figure 4B and C, dashed blue curves), indicating the involvement of Ca2+-mediated interactions. The calculated Kds after the addition of EGTA indicated a non-specific-binding interaction or the background of the detection procedure. The significant Kd values were determined to be 2–5 × 10−7 M in the assay (Table 1). Figure 4 Open in new tabDownload slide Cross-correlation function between CaMRhG and CaM-binding proteins: Rab3ACy5 (A), caldesmonCy5 (B) and calcineurin AαCy5 (C). The dashed blue curves represent data obtained after the addition of 5 mM EGTA. Figure 4 Open in new tabDownload slide Cross-correlation function between CaMRhG and CaM-binding proteins: Rab3ACy5 (A), caldesmonCy5 (B) and calcineurin AαCy5 (C). The dashed blue curves represent data obtained after the addition of 5 mM EGTA. Table 1. Concentrations of fluorescent proteins and apparent Kd values determined using FCCS in this study Cy5-labeled protein (nM) RhG-labeled protein (nM) Addition Kd (nM) Fos (18.4) Fos (18.0) — ND Fos (19.2) Fos (17.9) AP-1 DNA 3720 Jun (8.1) Jun (15.2) — 270 Jun (7.8) Jun (14.6) AP-1 DNA 190 Jun (4.8) Fos (12.3) — 69 Jun (4.5) Fos (11.5) AP-1 DNA 45 Rab3A (4.8) CaM (8.1) — ND Rab3A (4.5) CaM (7.8) EGTA ND Caldesmon (6.1) CaM (8.3) — 500 Caldesmon (5.7) CaM (7.5) EGTA 2300 Calcineurin (11.2) CaM (7.8) — 160 Calcineurin (11.5) CaM (7.9) EGTA 2200 M33 (5.6) Bmi1 (7.8) — 92 M33 (4.6) RYBP (5.3) — 70 M33 (5.2) Ring1A (7.7) — 51 Ring1A (4.6) RYBP (16.3) — 74 Bmi1 (4.4) RYBP (14.7) — 2300 Bmi1 (4.1) Ring1A (5.5) — 2000 Cy5-labeled protein (nM) RhG-labeled protein (nM) Addition Kd (nM) Fos (18.4) Fos (18.0) — ND Fos (19.2) Fos (17.9) AP-1 DNA 3720 Jun (8.1) Jun (15.2) — 270 Jun (7.8) Jun (14.6) AP-1 DNA 190 Jun (4.8) Fos (12.3) — 69 Jun (4.5) Fos (11.5) AP-1 DNA 45 Rab3A (4.8) CaM (8.1) — ND Rab3A (4.5) CaM (7.8) EGTA ND Caldesmon (6.1) CaM (8.3) — 500 Caldesmon (5.7) CaM (7.5) EGTA 2300 Calcineurin (11.2) CaM (7.8) — 160 Calcineurin (11.5) CaM (7.9) EGTA 2200 M33 (5.6) Bmi1 (7.8) — 92 M33 (4.6) RYBP (5.3) — 70 M33 (5.2) Ring1A (7.7) — 51 Ring1A (4.6) RYBP (16.3) — 74 Bmi1 (4.4) RYBP (14.7) — 2300 Bmi1 (4.1) Ring1A (5.5) — 2000 ND; not determined. Open in new tab Table 1. Concentrations of fluorescent proteins and apparent Kd values determined using FCCS in this study Cy5-labeled protein (nM) RhG-labeled protein (nM) Addition Kd (nM) Fos (18.4) Fos (18.0) — ND Fos (19.2) Fos (17.9) AP-1 DNA 3720 Jun (8.1) Jun (15.2) — 270 Jun (7.8) Jun (14.6) AP-1 DNA 190 Jun (4.8) Fos (12.3) — 69 Jun (4.5) Fos (11.5) AP-1 DNA 45 Rab3A (4.8) CaM (8.1) — ND Rab3A (4.5) CaM (7.8) EGTA ND Caldesmon (6.1) CaM (8.3) — 500 Caldesmon (5.7) CaM (7.5) EGTA 2300 Calcineurin (11.2) CaM (7.8) — 160 Calcineurin (11.5) CaM (7.9) EGTA 2200 M33 (5.6) Bmi1 (7.8) — 92 M33 (4.6) RYBP (5.3) — 70 M33 (5.2) Ring1A (7.7) — 51 Ring1A (4.6) RYBP (16.3) — 74 Bmi1 (4.4) RYBP (14.7) — 2300 Bmi1 (4.1) Ring1A (5.5) — 2000 Cy5-labeled protein (nM) RhG-labeled protein (nM) Addition Kd (nM) Fos (18.4) Fos (18.0) — ND Fos (19.2) Fos (17.9) AP-1 DNA 3720 Jun (8.1) Jun (15.2) — 270 Jun (7.8) Jun (14.6) AP-1 DNA 190 Jun (4.8) Fos (12.3) — 69 Jun (4.5) Fos (11.5) AP-1 DNA 45 Rab3A (4.8) CaM (8.1) — ND Rab3A (4.5) CaM (7.8) EGTA ND Caldesmon (6.1) CaM (8.3) — 500 Caldesmon (5.7) CaM (7.5) EGTA 2300 Calcineurin (11.2) CaM (7.8) — 160 Calcineurin (11.5) CaM (7.9) EGTA 2200 M33 (5.6) Bmi1 (7.8) — 92 M33 (4.6) RYBP (5.3) — 70 M33 (5.2) Ring1A (7.7) — 51 Ring1A (4.6) RYBP (16.3) — 74 Bmi1 (4.4) RYBP (14.7) — 2300 Bmi1 (4.1) Ring1A (5.5) — 2000 ND; not determined. Open in new tab FCCS analysis of PcG complex proteins The diffusion coefficients of fluorescently labeled proteins M33(1–519)Cy5, Bmi1(1–326)Cy5, Bmi1(1–326)RhG, Ring1A(201–377)Cy5, Ring1A(201–377)RhG, RYBP(92–228)Cy5 and RYBP(92–228)RhG were 4.1, 6.1, 6.6, 7.2, 7.4, 7.3 and 7.5 × 10−11 m2 s−1, respectively. Variations of cross-correlation function were observed for Bmi1RhG/M33Cy5, M33Cy5/Ring1ARhG, M33Cy5/RYBPRhG and RYBPRhG/Ring1ACy5 (solid curves shown in Figure 5A–D and Table 1). The significant interactions are shown schematically in Figure 6. M33 appeared to mediate the association. To confirm the role of M33, we examined the association with the mediator using FCCS. Interestingly, the amplitude of the cross-correlation function of Bmi1Cy5/Ring1ARhG was increased by the addition of non-labeled M33 (dashed blue curves shown in Figure 5F). The diffusion coefficient of the cross-correlated complex was 2.7 × 10−11 m2 s−1, corresponding to ∼120 kDa. The molecular brightness (C/M) was not altered by the addition of a non-labeled protein (19.5–20.0 kHz in the red channel and 12.7–11.3 kHz in the green channel). In contrast, the effect of the addition of M33 on the interactions of Ring1A/RYBP and Bmi1/RYBP was not significant (dashed blue curves in Figure 5D and E). Figure 5 Open in new tabDownload slide Cross-correlation function of M33Cy5 and Bmi1RhG (A), M33Cy5 and Ring1ARhG (B), M33Cy5 and RYBPRhG (C), Ring1ACy5 and RYBPRhG (D), Bmi1Cy5 and RYBPRhG (E), and Bmi1RhG and Ring1ACy5 (F). Dashed blue curves represent data obtained after the addition of non-labeled M33 (2 nM). Figure 5 Open in new tabDownload slide Cross-correlation function of M33Cy5 and Bmi1RhG (A), M33Cy5 and Ring1ARhG (B), M33Cy5 and RYBPRhG (C), Ring1ACy5 and RYBPRhG (D), Bmi1Cy5 and RYBPRhG (E), and Bmi1RhG and Ring1ACy5 (F). Dashed blue curves represent data obtained after the addition of non-labeled M33 (2 nM). Figure 6 Open in new tabDownload slide Schematic diagram of association of polycomb gene complex proteins. Arrows indicate interactions between the proteins as judged from the apparent Kd values in this study. Gray areas indicate triplet interaction detected using FCCS. Figure 6 Open in new tabDownload slide Schematic diagram of association of polycomb gene complex proteins. Arrows indicate interactions between the proteins as judged from the apparent Kd values in this study. Gray areas indicate triplet interaction detected using FCCS. DISCUSSION Purification of fluorescently labeled proteins by using a secondary affinity tag, iminobiotin, introduced on to fluorescent puromycin as described here, improved the sensitivity for FCCS analysis of interactions between two distinct fluorescence-labeled proteins. Indeed, the c-JunRhG/c-JunCy5 interactions both with and without non-labeled AP-1 oligonucleotide could be detected in this study, whereas the interaction among c-JunRhG/Cy5-labeled AP-1/non-labeled Jun was not detected in the previous study (11). The apparent Kd of c-Fos/c-Jun/AP-1 found in this study was in good agreement with reported values (25,26). The Kd of c-Jun homodimer and AP-1 sequence also coincided with the value of 140 nM determined previously (25). Further, the Kd of CaM and caldesmon was in agreement with the reported value of 550 nM (27). The apparent Kd was independent of the concentrations of fluorescence-labeled proteins (data not shown). These results indicate that FCCS analysis with puromycin-based labeling of proteins is effective and convenient for protein–protein interaction assay, and that the puromycin derivatives and affinity tags did not interfere substantially with the protein interactions. It should be noted that the Kd values obtained from FCCS are minimum estimates because small amounts of unlabeled proteins may remain. The interaction of c-Fos homodimer and CaM/Rab3A mediated by Ca2+ could not be identified in this study. The Kd of c-Fos homodimer and AP-1 sequence was previously reported to be ∼6 μM (28). The Kd of CaM/Rab3A was also reported to be 20−50 μM (29,30). The interaction of c-Fos homodimer (and c-Jun homodimer) in this study might include interactions between single-colored proteins, but the molecular brightness was not greater than that of other probed proteins (data not shown). Such weak interactions might be detected if the concentrations of fluorescently labeled proteins were increased. A surface plasmon resonance (SPR) biosensor allows real-time analysis of specific interactions on a solid phase, whereas FCS and FCCS detect interactions in solution. Schubert et al. (31) compared the entropic contribution to the free energy between SPR and FCS and concluded that the reaction entropy determined from an SPR experiment was lower than that from an FCS experiment. Indeed, the Kd between CaM and calcineurin was determined as 1.7 × 10−8 M by means of an SPR biosensor (32), and this is 10 times lower than our value using FCCS. Similarly, interaction assay of c-Fos/c-Jun heterodimer immobilized on a polystyrene tray gave a Kd of 1 nM (33), whereas our FCCS analysis gave 70 nM. Although the immobilizing method may be advantageous for the detection of protein interactions with low affinity, we believe that Kd values in living cells are likely to be more similar to those determined using FCCS in solution than to those determined on a solid phase. The PcG proteins form multimeric complexes that bind to specific genomic sites of polycomb repressive elements (34). We applied FCCS to analyze in detail the individual associations of some PcG proteins by interaction assay of the pairs under homogeneous conditions. As shown in Table 1, significant interactions were found among M33/Bmi1, M33/Ring1A, M33/RYBP and Ring1A/RYBP, respectively, as previously confirmed by the yeast two-hybrid method and protein pulldown assay (35,36). It appears that M33 is a mediator in the association of these proteins (Figure 6), but only the association of Bmi1/M33/Ring1A was confirmed (Figure 5F). The association of Bmi1/M33/Ring1A was also supported by applying a three-component model to fit the autocorrelation function of Bmi1 after the addition of non-labeled M33 (data not shown). Bmi1, M33 and Ring1A are components of a stable core PcG repressive complex, according to a biochemical study (37). Interestingly, our results suggest that RYBP may interact with M33 or Ring1A in the free form without the formation of a core complex. This is consistent with the idea that RYBP plays a role in recruiting PcG components (38). The FCCS analysis of the components of PcG complex proteins presented here should be a good model for detailed analysis of other protein complexes. For example, use of puromycin-based fluorescently labeled proteins would allow FCCS analysis, as well as FCS analysis, of the dynamics of complex formation of retinoblastoma tumor suppressor complex (39). The range of detectable interactions should be improved by using FCCS. The tandem affinity purification method using a polyhistidine tag and an iminobiotin tag was further applied to over 30 proteins and all but three were sufficiently purified for FCCS analysis. We also observed the interactions between IgG and its binding domain ZZ region (40), and between Smac (second mitochondria-derived activator of caspase or DIABLO) and XIAP (X-linked inhibitor of apoptosis protein, data not shown) (41). Combinations of two affinity tags are expected to help high-throughput purification of the fluorescently labeled proteins, because nickel-chelate beads and streptavidin beads for high-throughput robotic systems are already available from several vendors. Thus, the method presented in this paper should be applicable to a large-scale analysis of protein–protein interactions and should also contribute to the elucidation of protein functions in the post-genomic era. ACKNOWLEDGEMENTS We thank Megumi Nakamura for the preparation of puromycin derivatives and Yuko Oishi for the preparation of CaM-binding protein plasmid DNAs. This work was supported by Special Coordination Funds of the Science and Technology Agency (Ministry of Education, Culture, Sports, Science and Technology) of the Japanese Government. Funding to pay the Open Access publication charges for this article was provided by Keio University. Conflict of interest statement. None declared. REFERENCES 1. Magde D. , Elson E.L., Webb W.W. 1974 Fluorescence correlation spectroscopy. II. An experimental realization Biopolymers 13 29 – 61 Google Scholar Crossref Search ADS PubMed WorldCat 2. Eigen M. and Rigler R. 1994 Sorting single molecules: application to diagnostics and evolutionary biotechnology Proc. Natl Acad. Sci. USA 91 5740 – 5747 Google Scholar Crossref Search ADS WorldCat 3. Pack C.G. , Nishimura G., Tamura M., Aoki K., Taguchi H., Yoshida M., Kinjo M. 1999 Analysis of interaction between chaperonin GroEL and its substrate using fluorescence correlation spectroscopy Cytometry 36 247 – 253 Google Scholar Crossref Search ADS PubMed WorldCat 4. Wolcke J. , Reimann M., Klumpp M., Gohler T., Kim E., Deppert W. 2003 Analysis of p53 ‘latency’ and ‘activation’ by fluorescence correlation spectroscopy. Evidence for different modes of high affinity DNA binding J. Biol. Chem . 278 32587 – 32595 Google Scholar Crossref Search ADS PubMed WorldCat 5. Schwille P. , Meyer-Almes F.J., Rigler R. 1997 Dual-color fluorescence cross-correlation spectroscopy for multicomponent diffusional analysis in solution Biophys. J . 72 1878 – 1886 Google Scholar Crossref Search ADS PubMed WorldCat 6. Kettling U. , Koltermann A., Schwille P., Eigen M. 1998 Real-time enzyme kinetics monitored by dual-color fluorescence cross-correlation spectroscopy Proc. Natl Acad. Sci. USA 95 1416 – 1420 Google Scholar Crossref Search ADS WorldCat 7. Koltermann A. , Kettling U., Bieschke J., Winkler T., Eigen M. 1998 Rapid assay processing by integration of dual-color fluorescence cross-correlation spectroscopy: high throughput screening for enzyme activity Proc. Natl Acad. Sci. USA 95 1421 – 1426 Google Scholar Crossref Search ADS WorldCat 8. Kinjo M. , Nishimura G., Koyama T., Mets, Ü, Rigler R. 1998 Single-molecule analysis of restriction DNA fragments using fluorescence correlation spectroscopy Anal. Biochem . 260 166 – 172 Google Scholar Crossref Search ADS PubMed WorldCat 9. Rigler R. , Foldes-Papp Z., Meyer-Almes F.J., Sammet C., Volcker M., Schnetz A. 1998 Fluorescence cross-correlation: a new concept for polymerase chain reaction J. Biotechnol . 63 97 – 109 Google Scholar Crossref Search ADS PubMed WorldCat 10. Winkler T. , Kettling U., Koltermann A., Eigen M. 1999 Confocal fluorescence coincidence analysis: an approach to ultra high-throughput screening Proc. Natl Acad. Sci. USA 96 1375 – 1378 Google Scholar Crossref Search ADS WorldCat 11. Doi N. , Takashima H., Kinjo M., Sakata K., Kawahashi Y., Oishi Y., Oyama R., Miyamoto-Sato E., Sawasaki T., Endo Y., et al. 2002 Novel fluorescence labeling and high-throughput assay technologies for in vitro analysis of protein interactions Genome Res . 12 487 – 492 Google Scholar Crossref Search ADS PubMed WorldCat 12. Patel L.R. , Curran T., Kerppola T.K. 1994 Energy transfer analysis of Fos-Jun dimerization and DNA binding Proc. Natl Acad. Sci. USA 91 7360 – 7364 Google Scholar Crossref Search ADS WorldCat 13. Diebold R.J. , Rajaram N., Leonard D.A., Kerppola T.K. 1998 Molecular basis of cooperative DNA bending and oriented heterodimer binding in the NFAT1–Fos–Jun–ARRE2 complex Proc. Natl Acad. Sci. USA 95 7915 – 7920 Google Scholar Crossref Search ADS WorldCat 14. Haupts U. , Maiti S., Schwille P., Webb W.W. 1998 Dynamics of fluorescence fluctuations in green fluorescent protein observed by fluorescence correlation spectroscopy Proc. Natl Acad. Sci. USA 95 13573 – 13578 Google Scholar Crossref Search ADS WorldCat 15. Kohl T. , Heinze K.G., Kuhlemann R., Koltermann A., Schwille P. 2002 A protease assay for two-photon crosscorrelation and FRET analysis based solely on fluorescent proteins Proc. Natl Acad. Sci. USA 99 12161 – 12166 Google Scholar Crossref Search ADS WorldCat 16. Kim S.A. , Heinze K.G., Waxham M.N., Schwille P. 2004 Intracellular calmodulin availability accessed with two-photon cross-correlation Proc. Natl Acad. Sci. USA 101 105 – 110 Google Scholar Crossref Search ADS WorldCat 17. Kogure T. , Karasawa S., Araki T., Saito K., Kinjo M., Miyawaki A. 2006 A fluorescent variant of a protein from the stony coral Montipora facilitates dual-color single-laser fluorescence cross-correlation spectroscopy Nat. Biotechnol . 24 577 – 581 Google Scholar Crossref Search ADS PubMed WorldCat 18. Nemoto N. , Miyamoto-Sato E., Yanagawa H. 1999 Fluorescence labeling of the C-terminus of proteins with a puromycin analogue in cell-free translation systems FEBS Lett . 462 43 – 46 Google Scholar Crossref Search ADS PubMed WorldCat 19. Miyamoto-Sato E. , Takashima H., Fuse S., Sue K., Ishizaka M., Tateyama S., Horisawa K., Sawasaki T., Endo Y., Yanagawa H. 2003 Highly stable and efficient mRNA templates for mRNA–protein fusions and C-terminally labeled proteins Nucleic Acids Res . 31 e78 Google Scholar Crossref Search ADS PubMed WorldCat 20. Hochuli E. , Dobeli H., Schacher A. 1987 New metal chelate adsorbent selective for proteins and peptides containing neighbouring histidine residues J. Chromatogr . 411 177 – 184 Google Scholar Crossref Search ADS PubMed WorldCat 21. Hofmann K. , Wood S.W., Brinton C.C., Montibeller J.A., Finn F.M. 1980 Iminobiotin affinity columns and their application to retrieval of streptavidin Proc. Natl Acad. Sci. USA 77 4666 – 4668 Google Scholar Crossref Search ADS WorldCat 22. Oyama R. , Yamamoto H., Titani K. 2000 Glutamine synthetase, hemoglobin alpha-chain, and macrophage migration inhibitory factor binding to amyloid beta-protein: their identification in rat brain by a novel affinity chromatography and in Alzheimer's disease brain by immunoprecipitation Biochim. Biophys. Acta 1479 91 – 102 Google Scholar Crossref Search ADS PubMed WorldCat 23. Yen J. , Wisdom R.M., Tratner I., Verma I.M. 1991 An alternative spliced form of FosB is a negative regulator of transcriptional activation and transformation by Fos proteins Proc. Natl Acad. Sci. USA 88 5077 – 5081 Google Scholar Crossref Search ADS WorldCat 24. Krouglova T. , Vercammen J., Engelborghs Y. 2004 Correct diffusion coefficients of proteins in fluorescence correlation spectroscopy. Application to tubulin oligomers induced by Mg2+ and Paclitaxel Biophys. J . 87 2635 – 2646 Google Scholar Crossref Search ADS PubMed WorldCat 25. John M. , Leppik R., Busch S.J., Granger-Schnarr M., Schnarr M. 1996 DNA binding of Jun and Fos bZip domains: homodimers and heterodimers induce a DNA conformational change in solution Nucleic Acids Res . 24 4487 – 4494 Google Scholar Crossref Search ADS PubMed WorldCat 26. Kwon H. , Park S., Lee S., Lee D.K., Yang C.H. 2001 Determination of binding constant of transcription factor AP-1 and DNA. Application of inhibitors Eur. J. Biochem . 268 565 – 572 Google Scholar Crossref Search ADS PubMed WorldCat 27. Shirinsky V.P. , Bushueva T.L., Frolova S.I. 1988 Caldesmon–calmodulin interaction. Study by the method of protein intrinsic tryptophan fluorescence Biochem. J . 255 203 – 208 Google Scholar PubMed OpenURL Placeholder Text WorldCat 28. O'Shea E.K. , Rutkowski R., Stafford W.F. III, Kim P.S. 1989 Preferential heterodimer formation by isolated leucine zippers from fos and jun Science 245 646 – 648 Google Scholar Crossref Search ADS PubMed WorldCat 29. Park J.B. , Farnsworth C.C., Glomset J.A. 1997 Ca2+/calmodulin causes Rab3A to dissociate from synaptic membranes J. Biol. Chem . 272 20857 – 20865 Google Scholar Crossref Search ADS PubMed WorldCat 30. Coppola T. , Perret-Menoud V., Luthi S., Farnsworth C.C., Glomset J.A., Regazzi R. 1999 Disruption of Rab3–calmodulin interaction, but not other effector interactions, prevents Rab3 inhibition of exocytosis EMBO J . 18 5885 – 5891 Google Scholar Crossref Search ADS PubMed WorldCat 31. Schubert F. , Zettl H., Hafner W., Krauss G., Krausch G. 2003 Comparative thermodynamic analysis of DNA–protein interactions using surface plasmon resonance and fluorescence correlation spectroscopy Biochemistry 42 10288 – 10294 Google Scholar Crossref Search ADS PubMed WorldCat 32. Takano E. , Hatanaka M., Maki M. 1994 Real-time-analysis of the calcium-dependent interaction between calmodulin and a synthetic oligopeptide of calcineurin by a surface plasmon resonance biosensor FEBS Lett . 352 247 – 250 Google Scholar Crossref Search ADS PubMed WorldCat 33. Heuer K.H. , Mackay J.P., Podzebenko P., Bains N.P., Weiss A.S., King G.F., Easterbrook-Smith S.B. 1996 Development of a sensitive peptide-based immunoassay: application to detection of the Jun and Fos oncoproteins Biochemistry 35 9069 – 9075 Google Scholar Crossref Search ADS PubMed WorldCat 34. Levine S.S. , King I.F., Kingston R.E. 2004 Division of labor in polycomb group repression Trends Biochem. Sci . 29 478 – 485 Google Scholar Crossref Search ADS PubMed WorldCat 35. Hashimoto N. , Brock H.W., Nomura M., Kyba M., Hodgson J., Fujita Y., Takihara Y., Shimada K., Higashinakagawa T. 1998 RAE28, BMI1, and M33 are members of heterogeneous multimeric mammalian polycomb group complexes Biochem. Biophys. Res. Commun . 245 356 – 365 Google Scholar Crossref Search ADS PubMed WorldCat 36. Alkema M.J. , Bronk M., Verhoeven E., Otte A., van't Veer L.J., Berns A., van Lohuizen M. 1997 Identification of Bmi1-interacting proteins as constituents of a multimeric mammalian polycomb complex Genes Dev . 11 226 – 240 Google Scholar Crossref Search ADS PubMed WorldCat 37. Levine S.S. , Weiss A., Erdjument-Bromage H., Shao Z., Tempst P., Kingston R.E. 2002 The core of the polycomb repressive complex is compositionally and functionally conserved in flies and humans Mol. Cell. Biol . 22 6070 – 6078 Google Scholar Crossref Search ADS PubMed WorldCat 38. García E. , Marcos-Gutiérrez C., del Mar Lorente M., Moreno J.C., Vidal M. 1999 RYBP, a new repressor protein that interacts with components of the mammalian polycomb complex, and with the transcription factor YY1 EMBO J . 18 3404 – 3418 Google Scholar Crossref Search ADS PubMed WorldCat 39. Angus S.P. , Solomon D.A., Kuschel L., Hennigan R.F., Knudsen E.S. 2003 Retinoblastoma tumor suppressor: analyses of dynamic behavior in living cells reveal multiple modes of regulation Mol. Cell. Biol . 23 8172 – 8188 Google Scholar Crossref Search ADS PubMed WorldCat 40. Nilsson B. , Moks T., Jansson B., Abrahmsen L., Elmblad A., Holmgren E., Henrichson C., Jones T.A., Uhlen M. 1987 A synthetic IgG-binding domain based on staphylococcal protein A Protein Eng . 1 107 – 113 Google Scholar Crossref Search ADS PubMed WorldCat 41. Du C. , Fang M., Li Y., Li L., Wang X. 2000 Smac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibition Cell 102 33 – 42 Google Scholar Crossref Search ADS PubMed WorldCat Author notes " Present addresses: Rieko Oyama, RIKEN, Genome Science Laboratory, 2-1 Hirosawa, Wako 351-0198, Japan " Masato Yonezawa, Research Institute of Molecular Pathology (IMP), Vienna, Austria © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
NAR launches new Advance Access publication modelSaxby,, Claire
doi: 10.1093/nar/gkl523pmid: N/A
With issue 14, Nucleic Acids Research launches a new rapid ‘online publication ahead of print’ model which we refer to as ‘Advance Access’. Once papers have been copyedited, typeset and corrected, they will be published online on the journal's Advance Access page, prior to pagination and appearance in an issue of the journal. This will enable us to offer authors rapid publication online three weeks after acceptance just as we do at present, with papers then appearing in an issue of the journal shortly afterwards. However, Advance Access will also bring greater flexibility for authors and readers than the previous continuous publication model. For example, it will enable us to publish individual papers for special issues, including the Web Server and Database issues, online as soon as they are ready, whereas at present these papers cannot be made available online (and therefore cannot be read or cited) until the entire issue of which they are part is ready for publication. Based on feedback from users, we will also provide readers with the option of viewing Advance Access articles either by date published (most recent first) or by subject category. In contrast to the continuous publication model, the launch of Advance Access means that, for the short time before their publication in an issue, papers will be citable by a Digital Object Identifier (DOI) rather than page numbers. Across the publishing community, an automatically generated DOI is already attached to each article once it is accepted for publication, providing both a unique means of identification and a persistent link to its location on the internet. Indeed DOIs already appear on every version of an NAR article, including the final versions in print and online, and reprints. Importantly an article's DOI remains the same even if different versions of recognisably the same article appear successively. The DOI will always point to the latest version of the article; previous versions will be available via the latest version, along with information about how the versions differ. The online version of NAR is highly used and is increasingly considered the definitive version of the journal. Currently on average the journal web site attracts over 460,000 full-text downloads per month. In an online world we foresee traditional print page numbers becoming increasingly less relevant, while other means of identifying and citing articles, such as the DOI, become the norm. Of course, while we continue to publish a print version of the journal, we will also paginate papers when compiling them for an issue. At this time, a significant number of libraries still subscribe to the print version of NAR and while there remains a desire within the community for a printed version of the journal to be available, we want to continue to provide this option. To clarify, papers published in NAR Advance Access will be citable using the DOI, e.g. Jovanovic,M. and Dynan,W.S. (2006). Terminal DNA structure and ATP influence binding parameters of the DNA-dependent protein kinase at an early step prior to DNA synapsis. Nucleic Acids Res., doi:10.1093/nar/gkj504. The same paper when published in an issue can be cited as follows: Jovanovic,M. and Dynan,W.S. (2006). Terminal DNA structure and ATP influence binding parameters of the DNA-dependent protein kinase at an early step prior to DNA synapsis. Nucleic Acids Res., 34, 1112–1120. The adoption of Advance Access is the latest development made by NAR with the aim of better serving our authors and readers. We are proud of NAR's history of experimenting with new models and functionality. For example, in 1999 we launched an innovative online-only Methods section which today has high impact and is popular with authors. More recently, experimentation with the Database and Web Server issues in 2004 led us to adopt a full open access model in 2005. Many of our readers and authors are aware that feedback from the journal's community has been as vital as our experimentation in informing how NAR moves forward. We encourage all of you to contact us, with comments on any aspect of the journal, including the new Advance Access model, our open access initiative, or with suggestions for online functionality that you would find useful. As you and your colleagues have done in the past, tell us how you use the journal, and how it could better serve your needs, and we will take these into account as we plan NAR's path for the future. © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Induction of single chain tetracycline repressor requires the binding of two inducersKamionka,, Annette;Majewski,, Marius;Roth,, Karin;Bertram,, Ralph;Kraft,, Christine;Hillen,, Wolfgang
doi: 10.1093/nar/gkl316pmid: 16899452
ABSTRACT In this article we report the in vivo and in vitro characterization of single chain tetracycline repressor (scTetR) variants in Escherichia coli. ScTetR is genetically and proteolytically stable and exhibits the same regulatory properties as dimeric TetR in E.coli. Urea-dependent denaturation of scTetR is independent of the protein concentration and follows the two-state model with a monophasic transition. Contrary to dimeric TetR, scTetR allows the construction of scTetR mutants, in which one subunit contains a defective inducer binding site while the other is functional. We have used this approach to establish that scTetR needs occupation of both inducer binding sites for in vivo and in vitro induction. Single mutations causing loss of induction in dimeric TetR lead to non-inducible scTetR when inserted into one half-side. The construction of scTetR H64K S135L S138I (scTetRi2) in which one half-side is specific for 4-dedimethylamino-anhydrotetracycline (4-ddma-atc) and the other for tetracycline (tc) leads to a protein which is only inducible by the mixture of tc and 4-ddma-atc. Fluorescence titration of scTetRi2 with both inducers revealed distinct occupancy with each of these inducers yielding roughly a 1:1 stoichiometry of each inducer per scTetRi2. The properties of this gain of function mutant clearly demonstrate that scTetR requires the binding of two inducers for induction of transcription. INTRODUCTION Gene regulation in bacteria occurs mostly at the level of transcription mediated by DNA binding proteins like activators or repressors (1). Most of them are homodimers, as was demonstrated for members of the LacI/GalR, TetR/CamR, FadR, MerR or LysR families (2–4). They exert their regulatory function by binding effectors upon which a conformational change is triggered which allosterically affects DNA binding. The bacterial tetracycline (tc) responsive repressor, TetR, exhibits high affinity (5.6 ± 2 × 109 M−1) for the tet operator (tetO) and is very sensitive to induction by tc and its analogues (5–10). These features combined with the capacity of tc to penetrate cell membranes by passive diffusion have led to its wide use as a tool for gene regulation in prokaryotes and eukaryotes (11–14). Crystal structures of TetR revealed that two identical subunits each consisting of 10 α-helices form the active homodimer. Two DNA reading heads are connected to a core domain in which dimerization takes place and which harbours two inducer binding pockets. The [tc–Mg]+ complex binds into each pocket and triggers a movement of helix 6 which is transferred via helices 4 and 1 to the DNA binding heads (9,15) moving them apart by ∼3 Å. As a consequence, the affinity to the tetO is lowered by four orders of magnitude (16) and TetR leaves the DNA. When bound to TetR, each [tc–Mg]+ complex is contacted by 13 residues from 1 monomer and 4 from the other. Since one inducer contacts residues from both subunits it is conceivable that the allosteric change may be triggered by binding of a single tc molecule. There is only very little information available regarding the correlation between the number of occupied inducer binding sites necessary for induction in regulatory proteins. This has been discussed for MerR, a member of the MerR/SoxR family (17) involved in the regulation of bacterial mercury resistance. Preliminary data would be in agreement with induction by binding of one Hg+ to the MerR dimer (18,19). The multi-drug resistance repressor QacR is, like TetR, a member of the TetR/CamR family of regulators (20). In vitro binding studies revealed a 1:2 stoichiometry of inducer per QacR dimer, and it has been suggested that this is sufficient for induction in vivo (21). Addressing this question in vivo is generally hampered by the exchange of subunits in a dimer as has been clearly demonstrated for TetR (7). Thus, a strain containing a wild type and an induction deficient allele of tetR would contain a mixture of all three possible dimers. We describe here single chain TetR (scTetR) variants, in which the two subunits of TetR are linked by a 25 amino acids (SG4)5 sequence and cannot form heterodimers (22), and their use to construct and analyze TetR variants with mutations in just one half-side in Escherichia coli. We establish clearly, that TetR needs to bind two inducers to accomplish induction. MATERIALS AND METHODS General materials and methods Anhydrotetracycline (atc) was purchased from Acros (Geel, Belgium). All other chemicals were from Merck (Darmstadt, Germany), Roth (Karlsruhe, Germany) or Sigma (Munich, Germany) at the highest purity available. Enzymes for DNA restriction and modification were from New England Biolabs (Frankfurt/Main, Germany) or Roche (Mannheim, Germany). Isolation and manipulation of DNA as well as strain transformation was performed as described previously (23). All plasmids constructed were sequenced between the restriction sites employed for cloning. Oligonucleotides for PCR and sequencing were obtained from MWG Biotech (Ebersberg, Germany) unless stated otherwise. Sequencing was carried out according to the protocol provided by Applied Biosystems for cycle sequencing and analyzed with an ABI PRISM 310 Genetic Analyzer (Applied Biosystems, Weiterstadt, Germany). For β-galactosidase (β-gal) assays, E.coli WH207λtet50 bearing a Tn10 tetA–lacZ trancriptional fusion was used. Cells were grown in Luria–Bertani (LB) medium supplemented with the required antibiotics to an OD600 of 0.4. β-gal activity was determined as published previously (24). Culture and growth conditions E.coli was generally grown in LB medium at 37°C. Antibiotics for selection were added to the following final concentrations: Ampicillin (Ap) 100 mg/l, Kanamycin (Kan) 30 mg/l. When necessary, the cultures were adjusted to final concentrations of atc, tc or 4-dedimethylamino-anhydrotetracycline (4-ddma-atc) as indicated. All atc and 4-ddma-atc solutions were protected from light. Tc and atc were dissolved in 70% ethanol, 4-ddma-atc was solved in dimethyl sulfoxide. Design and construction of genes and plasmids Components for construction of scTetR The following DNA fragments were already available: wt-tetR(BD) and a (SG4)5 encoding linker sequence together with the first 50 codons of the synthetic tetR(B) sequence (termed tetR(sB) originationg from sctetR(BsB) (22). TetR(BD) is a chimera consisting of the first 50 codons of tetR(B) fused to the last 158 codons of tetR(D). We designed the DNA sequence of the synthetic class (D) portion de novo by in silico back-translation of the primary protein sequence online at http://www.entelechon.com/eng/backtranslation.html. Settings were attuned to human codon usage and putative eukaryotic splicing-sites were eliminated. For use in E.coli rare arginine codons were silently exchanged (as found at www.kazusa.or.jp/codon). Care was taken that the length of identical DNA stretches to tetR(BD) did not exceed 20 bp to minimize recombination events (25). The final sequence of sctetR(BDsBsD) was deposited in the NCBI GenBank (accession no. DQ 392985). A total of 20 oligonucleotides, 39–48 bp in length (MWG Biotech, Ebersberg, Germany), entirely covering both strands of the desired ‘tetR(D) gene fragment (471 bp in size), with an overlap of 12 bp to complementary oligos were used. Assembling was done in a PCR-like method, as described previously (26). A DNA-sequence alignment of the gene portions ‘tetR(D)’ and ‘tetR(sD)’ displayed an identity of ∼76%. The assembled product was amplified with primers syn(D)c_fw1 (5′-ctagcagtcgagatactggcccggcaccatgactacagt-3′) and syn(D)c_rev10 (5′-catggcctcacacaatctgaaggagggctgtaagctgga-3′) and temporarily subcloned into pCR™2.1-Topo™ (Invitrogen, Karlsruhe, Germany) as recommended by the manufacturer. Construction of scTetR(BDsBsD) Construction of sctetR(BD) was performed in pWH1926 (like pWH1925 (6), however with tetR(BD) running in opposite direction). The 3′ end of tetR(BD) was modified by silently inserting a ClaI restriction site upstream of the stop-codon (PCR primers were sc0_fw and sc0_rev, restriction was done with MluI and NcoI). After amplification and modification by PCR using primers sc1_fw (5′-gtgaaagtatcgattcaggaggcggtgg-3′) and sc1_rev (5′-gtatgatccatggccagcatctcgattgctagcgcatcgag-3′), the (SG4)5 linker sequence together with the first 50 codons of the synthetic tetR(sB) sequence was fused to the 3′ end of tetR(BD) via ClaI and NcoI, simultaneously inserting a new NheI site directly in front of NcoI. The tetR(sD) portion of the pCRO™2.1-TopoO™ derivative described above was finally inserted as an NheI/NcoI fragment into the likewise cut pWH1926 derivative, resulting in a full-length sctetR(BDsBsD) gene, which gave rise to pWH1926(BDsBsD). The gene was entirely sequenced on both directions. Finally sctetR(BDsBsD) was cloned into a pWH1925 derivative to enable exact comparison with other variants, most of which are encoded on this vector. All tetR and sctetR variants were cloned and analyzed as pWH1925 derivatives. For overexpression of wild-type scTetR, the synthetic part was set in front of the gene. For the sake of convenience all tetR(BD) genes and proteins are termed tetR or TetR, while sctetR(BDsBsD) and the corresponding protein are denoted sctetR or scTetR, respectively. Lysozyme tratment of cells and western blot analysis Cells from a 20 ml culture were harvested at an OD600 of 0.4 by centrifugation at 8000 r.p.m. at 4°C and the pellet was resuspended in 500 μl buffer containing 10 mM Tris–HCl (pH 8.0), 200 mM NaCl, 0.4 mM EDTA and 20 g/l lysozyme. The lysate was incubated at ambient temperatures for 60 min followed by centrifugation at 13 000 r.p.m., 4°C for 60 min. The protein concentration of the soluble fraction was determined using the Biorad Assay (Bio-Rad Laboratories, Munich, Germany). TetR was detected by SDS–PAGE of 35 μg E.coli cell extract on a 10% polyacrylamide gel. Proteins were transferred by electroblotting onto a PhotoGeneO™ nylon membrane (GibcoBRL, Karlsruhe, Germany). Further steps were performed as described previously (27). Overexpression and purification of proteins For overexpression we used the plasmid pWH610 and the strain RB791. Proteins were purified as described previously (28) except scTetRi2, which was overproduced at 22°C. After harvesting, cell pellets were stored on ice instead of freezing. The temperature of all subsequent steps was kept below 8°C. Electrophoretic mobility shift assays and fluorescence measurements For electrophoretic mobility shift assays (EMSA), the complementary synthetic 40 bases tetO containing oligonucleotides designated tetO1 and tetO2 were hybridized. Equimolar amounts of each oligonucleotide were mixed in water, heated at 96°C for 5 min and allowed to cool to room temperature within 2 h. A double stranded oligonucleotide of the same size without palindromic sequence was used as a control. TetR proteins were added at indicated amounts. All samples were incubated in a complex buffer containing 0.02 M Tris–HCl (pH 8.0) and 5 mM MgCl2. Atc and 4-ddma-atc were added to final concentrations of 0.1 mM to the sample. After incubation for 10 min at ambient temperatures, the DNA was electrophoresed on an 8% polyacrylamide gel at 50 V in TBM buffer (0.09 M Tris, 0.09 M boric acid and 5 mM MgCl2), and the DNA was detected by ethidium bromide staining. Fluorescence titrations were performed as described previously (16) with excitation/emission wavelengths of 370/515 nm to observe tc- and 420/540 nm to observe 4-ddma-atc-binding. Denaturation of scTetR All measurements were performed in F-buffer [100 mM Tris–HCl (pH 7.5), 100 mM NaCl, 5 mM MgCl2, 1 mM EDTA and 1 mM dithiothreitol]. Equilibrium denaturation was carried out by incubation of protein samples overnight at given urea concentrations. Protein samples which had been incubated for 3 h at 20°C in F-buffer containing 8 M urea, were renaturated by dilution them 200-fold in F-buffer without urea. All measurements were performed at 25°C. Fluorescence intensities were measured at an excitation wavelength of 280 and 295 nm, respectively, emission intensity was recorded at 324 nm and corrected by the substraction of the buffer fluorescence. Thermodynamic calculations For scTetR, a two-state model of denaturation was applied in which a native monomeric protein N exists in equilibrium with a denatured monomer, U: \begin{equation} N\equiv U. \end{equation}1 The equilibrium constant Ku for every measured point within the linear transition was calculated according to two step denaturation of a monomeric protein, in which fu is the fraction of unfolded protein. \begin{equation} {K}_{\hbox{ u }}=\frac{\left[U\right]}{\left[F\right]}=\frac{{f}_{\hbox{ u }}}{(1-{f}_{\hbox{ u }})}. \end{equation}2 The linear extrapolation method (33) was used to derive protein stability at zero urea concentration. \begin{equation} \Delta {G}_{\hbox{ u }}^{^\circ}=\Delta {G}_{\hbox{ u }}^{^\circ}{(\hbox{ H }}_{2}\hbox{ O })-m[\hbox{ urea }]=-RT\hbox{ ln }({K}_{\hbox{ u }}), \end{equation}3 where |$\Delta {G}_{\hbox{ u }}^{^\circ}$| stands for Gibbs free energy of unfolding at a 1 M concentration of all reactants, |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (H2O) is the extrapolated Gibbs free energy in buffer without urea and m represents the slope of the straight line in the plot of ΔGu versus denaturant concentration. At the midpoint of the unfolding transition at urea1/2, ΔGo = 0 and Ku = 1. To determine urea1/2, we rearranged Equation 3 to Equation 4. \begin{equation} \left[{\hbox{ urea }}_{1/2}\right]=\frac{\Delta {G}_{\hbox{ u }}^{^\circ}{(\hbox{ H }}_{\hbox{ 2 }}\hbox{ O })}{m} \end{equation}4 RESULTS Construction and in vivo stability of scTetR The scTetR variant was constructed as described for the mammalian applications (22) except that a sequence variant with optimal regulatory properties in bacteria was used (see Figure 1). Since the in vivo properties of scTetR with mutations in one inducer binding pocket may be masked by small amounts of dimeric TetR which could result from recombination of the repeat tetR sequence or from proteolysis within the peptide linker, it was important to exclude such events. To minimize recombination, we constructed a synthetic tetR allele (see Materials and Methods for details) so that the DNA sequence identity is only ∼76%. Identical sequences longer than 20 bp were avoided to further minimize recombination. The genetic stability of the resulting sctetR in E.coli DH5α harbouring pWH1925sctetR was determined by growing at 37°C over night on LB/Ap plates. Colonies were streaked out for three successive times and then used to inoculate a liquid culture. Plasmids isolated from these cultures gave restriction patterns indicating the presence of only the full-length sctetR which was also confirmed by sequencing (data not shown). Figure 1 Open in new tabDownload slide Schematic presentation of dimeric and scTetR mutants. The protein arrangements possible with different genetic situations are shown. (A) The TetR dimers arising from two different tetR alleles expressed in the same cell. The filled star indicates a mutation. The three possible dimers are two homodimers and one heterodimer. (B) The scTetR gene (sctetR) in which one half-side carries the mutation (filled star). The first tetR sequence (white arrow) contains the desired mutation (filled star), followed by the linker encoding sequence designated (SG4)5 and the second synthetic tetR sequence (grey arrow). The resulting monomeric protein contains the mutation (filled star) in one half-side. Figure 1 Open in new tabDownload slide Schematic presentation of dimeric and scTetR mutants. The protein arrangements possible with different genetic situations are shown. (A) The TetR dimers arising from two different tetR alleles expressed in the same cell. The filled star indicates a mutation. The three possible dimers are two homodimers and one heterodimer. (B) The scTetR gene (sctetR) in which one half-side carries the mutation (filled star). The first tetR sequence (white arrow) contains the desired mutation (filled star), followed by the linker encoding sequence designated (SG4)5 and the second synthetic tetR sequence (grey arrow). The resulting monomeric protein contains the mutation (filled star) in one half-side. The proteolytic stability of scTetR was examined by western blots. As presented in Figure 2, a single band of TetR at ∼23 kDa is observed in extracts from cells containing tetR after sonification or lysozyme treatment (lanes 2 and 5). The scTetR band appears at ∼46 kDa in the extract from the respective strain containing sctetR (lanes 3 and 6). A smear of additional bands indicating fragmentation is visible below the scTetR signal when cells are broken by sonification (lane 3), while these are absent when cells are treated with lysozyme (lane 6). Taken together, these results indicate that scTetR exhibits sufficient genetic and proteolytic stability for functional analysis in vivo. Figure 2 Open in new tabDownload slide Western blots of TetR and scTetR. The influence of different cell disruption methods on the integrity of TetR proteins is shown. Soluble proteins from crude cell extracts (35 μg) were loaded in lanes 1–6. Proteins from E.coli WH207λtet50 cells carrying pWH1925ΔtetR were loaded in lanes 1 and 4; proteins from cells carrying pWH1925 (tetR) were loaded in lanes 2 and 5, and cells carrying pWH1925sc (sctetR) were loaded in lanes 3 and 6. Lane 7 contains 60 ng of purified TetR, and lane 8 contains 20 ng of purified scTetR. The lysis methods employed are indicated below the respective lanes. Figure 2 Open in new tabDownload slide Western blots of TetR and scTetR. The influence of different cell disruption methods on the integrity of TetR proteins is shown. Soluble proteins from crude cell extracts (35 μg) were loaded in lanes 1–6. Proteins from E.coli WH207λtet50 cells carrying pWH1925ΔtetR were loaded in lanes 1 and 4; proteins from cells carrying pWH1925 (tetR) were loaded in lanes 2 and 5, and cells carrying pWH1925sc (sctetR) were loaded in lanes 3 and 6. Lane 7 contains 60 ng of purified TetR, and lane 8 contains 20 ng of purified scTetR. The lysis methods employed are indicated below the respective lanes. ScTetR with a single inducer binding site is induction deficient We constructed three scTetR variants, each with the single mutation N82A or T103A or E147A in only one half-side. The residues N82, T103 and E147 in TetR are essential for tc and atc binding, and each mutation to A leads to induction deficiency of the respective TetR mutant (29,30). The encoded scTetR proteins will thus contain one deficient and one functional inducer binding pocket (see Figure 1). The inducibility of these half-side scTetR mutants was determined by β-gal measurements with and without tc or atc (Figure 3A). β-gal expression in the absence of tetR defines the 100% level (data not shown). TetR and all single chain variants repress lacZ in the absence of inducer to <1% (data not shown). Tc leads to 45 and 22%, while atc leads to 92 and 78% induction of TetR and scTetR, respectively. The three TetR variants carrying mutations in both subunits (black bars) are only weakly inducible with the more efficient inducer atc, resulting in β-gal activities between 9 and 4%, while the corresponding single chain half-side mutants are inducible to ∼12%. All three TetR and scTetR half-side mutants are not inducible with tc (values <1% β-gal activity; not shown). Thus, the inactivation of one inducer binding site in scTetR reduces induction to about the same level as the elimination of both inducer binding sites. Figure 3 Open in new tabDownload slide Induction efficiencies of sctetR with mutations in one inducer binding pockets. (A) β−gal activities of E.coli WH207λtet50 transformed with plasmids expressing tetR (black columns) or sctetR (white columns) with mutations causing induction deficiency in tetR (designated N82A, T103A or E147A) in one half-side are shown. The β−gal activity in the absence of tetR was ∼8000 Miller Units and was set to 100% (data not shown). β−gal activities in the absence of inducer were <1% for all variants (data not shown). β−gal expression in the presence of inducers is shown for 0.4 μM tc and 0.4 μM atc as indicated in the figure. (B) EMSA with purified TetR and scTetR N82A carrying a mutation in one inducer binding pocket are shown in the insert. The compounds and the amounts present in the respective reaction mixtures are listed in the table above the lanes. The positions of bands corresponding to free (f) and complexed (c) DNA are indicated on the right side. Figure 3 Open in new tabDownload slide Induction efficiencies of sctetR with mutations in one inducer binding pockets. (A) β−gal activities of E.coli WH207λtet50 transformed with plasmids expressing tetR (black columns) or sctetR (white columns) with mutations causing induction deficiency in tetR (designated N82A, T103A or E147A) in one half-side are shown. The β−gal activity in the absence of tetR was ∼8000 Miller Units and was set to 100% (data not shown). β−gal activities in the absence of inducer were <1% for all variants (data not shown). β−gal expression in the presence of inducers is shown for 0.4 μM tc and 0.4 μM atc as indicated in the figure. (B) EMSA with purified TetR and scTetR N82A carrying a mutation in one inducer binding pocket are shown in the insert. The compounds and the amounts present in the respective reaction mixtures are listed in the table above the lanes. The positions of bands corresponding to free (f) and complexed (c) DNA are indicated on the right side. To analyse TetR- and scTetR-tetO interactions in vitro, we purified scTetR N82A and performed EMSA (Figure 3B). This mutant binds tetO similar to dimeric TetR (compare lanes 1–3 with lanes 4–6) in the absence of effector, where a 5-fold molar excess of protein is needed for complete binding of tetO. ScTetR N82A is not released from tetO upon addition of atc, since no band of free tetO DNA is visible in lane 7, while dimeric wild-type TetR is efficiently removed from tetO by atc, and the free operator DNA appears in lane 8. Neither dimer nor scTetR bind a control fragment which contains no tetO sequence (data not shown). We conclude that the in vitro results are consistent with the non-inducible phenotype of scTetR with mutations in one inducer binding pocket. ScTetR variants with different half-side inducer specificities In order to investigate whether the lack of inducibility can be restored, we constructed a scTetR variant with different inducer specificities in each half-side. The TetRi2 mutant H64K S135L S138I is specifically induced by 4-ddma-atc and not by tc or atc (31). We introduced these mutations into one half-side of scTetR and expected the resulting variant to harbour two distinct inducer specificities combined in one protein called scTetRi2. The regulatory properties of this mutant were determined by β-gal measurements with tc, atc, 4-ddma-atc and combinations of them (Figure 4A). β-gal expression in the absence of tetR defines the 100% level (data not shown). ScTetR is not inducible with 4-ddma-atc, while β-gal activities in the presence of tc and atc are ∼35 and 80%, respectively, regardless of the presence of 4-ddma-atc. ScTetRi2 is not inducible with tc, atc or 4-ddma-atc alone. Induction of 40% β-gal expression is observed in the presence of tc and 4-ddma-atc, and induction of 55% in the presence of atc and 4-ddma-atc. Although the β-gal activities are not as high as those of scTetR in the presence of atc, scTetRi2 is apparently induced in the presence of both effectors. This is in agreement with in vitro results obtained by EMSA (Figure 4B). ScTetRi2 binds to tetO in the absence of inducer (lanes 1–3), whereas the control DNA is not bound (lane 4). The bulk of tetO remains bound in the presence of atc or 4-ddma-atc, as indicated by the strong complex band, and only a small amount of free DNA is visible in the presence of atc (lane 5). No band representing the complexed DNA is visible in the presence of atc and 4-ddma-atc, and the band corresponding to free tetO is more intense (lane 7). The free operator DNA appears in lane 0. Bands migrating above the complex appear only in lanes loaded with samples containing inducer and scTetR. Their fluorescence exhibits a different colour than the DNA stained with ethidium bromide and they are visible in control lanes where only scTetR and inducer without DNA have been loaded (data not shown). We conclude that they originate from the scTetR-inducer complex and visualize the well-known tetracycline fluorescence which is enhanced in complex with TetR. These results clearly indicate that two occupied binding pockets are required for induction. Figure 4 Open in new tabDownload slide Induction efficiencies of scTetR harbouring different inducer specificities. (A) β−gal activities of E.coli WH207λtet50 transformed with plasmids expressing sctetR or sctetRi2 (H64K, S135L, S138I in one half-side) are depicted. The β−gal activity in the absence of tetR is ∼8000 Miller Units and was set to 100% (not shown). Measurements were performed in the absence and presence of 0.4 μM of each inducer as indicated in the figure. (B) EMSA were performed with purified scTetRi2 and the compounds and amounts indicated in the table above the lanes. ‘Random’ refers to a non-tetO containing DNA. The positions of bands corresponding to free (f) and complexed (c) DNA are indicated on the right side. Figure 4 Open in new tabDownload slide Induction efficiencies of scTetR harbouring different inducer specificities. (A) β−gal activities of E.coli WH207λtet50 transformed with plasmids expressing sctetR or sctetRi2 (H64K, S135L, S138I in one half-side) are depicted. The β−gal activity in the absence of tetR is ∼8000 Miller Units and was set to 100% (not shown). Measurements were performed in the absence and presence of 0.4 μM of each inducer as indicated in the figure. (B) EMSA were performed with purified scTetRi2 and the compounds and amounts indicated in the table above the lanes. ‘Random’ refers to a non-tetO containing DNA. The positions of bands corresponding to free (f) and complexed (c) DNA are indicated on the right side. Binding pockets with different specificities are bound only by their cognate inducers Distinct occupancy of the effector-specific binding pockets with the respective inducers was assayed by fluorescence titration of purified scTetRi2 first with tc and subsequently with 4-ddma-atc in the same cuvette (Figure 5A and B). As shown in Figure 5A, tc binding to TetR can clearly be detected based upon the excitation/emission wavelengths specific for tc. The subsequent addition of 4-ddma-atc leads to only a small further change of fluorescence intensity. An increase in fluorescence cannot be detected upon addition of tc when observed at excitation/emission wavelengths specific for 4-ddma-atc (Figure 5B), although tc binds to the first wild-type binding pocket (cf. Figure 5A). Thus, the increase in fluorescence during the following addition of 4-ddma-atc must result from binding of 4-ddma-atc in its respective binding pocket. We analyzed a concentration of 2 nmol of protein harbouring 4 nmol of binding pockets. As shown in Figure 5, the equivalence point occurs at 1.3 nmol of tc and 1.8 nmol of 4-ddma-atc. Thus, roughly half of the available inducer binding sites are occupied by each of the two effectors. Figure 5 Open in new tabDownload slide Titration of scTetRi2 with tc and 4-ddma-atc. (A) The fluorescence intensity of 2 nmol of purified scTetRi2 titrated with up to 8.5 nmol of tc (corresponding to a concentration of 8.5 μM) followed by titration with up to 8.5 nmol of 4-ddma-atc (corresponding to a concentration of 8.5 μM) is shown. Since fluorescence was excited at 370 nm and emission was measured at 515 nm only the binding of tc to the protein is detected. (B) The same experiment except that the excitation and emission wavelengths were 420 and 540 nm, repectively, where only 4-ddma-atc binding is observed. The equivalence points are indicated by the dotted lines. Figure 5 Open in new tabDownload slide Titration of scTetRi2 with tc and 4-ddma-atc. (A) The fluorescence intensity of 2 nmol of purified scTetRi2 titrated with up to 8.5 nmol of tc (corresponding to a concentration of 8.5 μM) followed by titration with up to 8.5 nmol of 4-ddma-atc (corresponding to a concentration of 8.5 μM) is shown. Since fluorescence was excited at 370 nm and emission was measured at 515 nm only the binding of tc to the protein is detected. (B) The same experiment except that the excitation and emission wavelengths were 420 and 540 nm, repectively, where only 4-ddma-atc binding is observed. The equivalence points are indicated by the dotted lines. ScTetR stability A bimolecular single transition denaturation reaction has been shown for dimeric TetR (32,33), which was the basis for the two-state model of unfolding. We investigated urea-dependent denaturation of scTetR. Conformational changes during denaturation were monitored by the fluorescence of W43 and W75, located in the helix–turn–helix motif and in the core domain, respectively. The fluorescence quantum yield decreases in dependence of the urea concentration, and the emission maximum is shifted from 342 to 363 nm upon denaturation. Fluorescence emission was determined at 324 nm where the highest change in fluorescence occurs (Figure 6A). The urea-dependent unfolding of scTetR shows a sigmoidal, monophasic decrease in fluorescence which is independent of the scTetR concentration (Figure 6B). Thus, there is no indication for any stable intermediate, indicating that scTetR also denatures in an all or none process. Figure 6 Open in new tabDownload slide Urea-dependent denaturation of scTetR. (A) Fluorescence spectra of native (solid line), denatured (long-dashed line) and renatured (dashed–dotted line) scTetR are shown. The difference spectrum between native and denatured forms is shown by the dotted line. (B) The figure shows the change of fluorescence in dependence of the urea concentration. The fluorescence in the absence of urea was set to 100% of folded protein. The denaturation curves were determined at different concentrations of scTetR (filled triangle, 5 μM; circle, 1 μM; open triangle, 0.4 μM, scTetR). Figure 6 Open in new tabDownload slide Urea-dependent denaturation of scTetR. (A) Fluorescence spectra of native (solid line), denatured (long-dashed line) and renatured (dashed–dotted line) scTetR are shown. The difference spectrum between native and denatured forms is shown by the dotted line. (B) The figure shows the change of fluorescence in dependence of the urea concentration. The fluorescence in the absence of urea was set to 100% of folded protein. The denaturation curves were determined at different concentrations of scTetR (filled triangle, 5 μM; circle, 1 μM; open triangle, 0.4 μM, scTetR). The midpoint of transition is not dependent on the protein concentrations between 0.4 and 5 μM of scTetR (Table 1), and the determination of urea1/2 (Equation 4) concentrations yields values of 4.7, 4.7 and 4.8 M urea for 0.4, 1 and 5 μM scTetR, respectively (Table 1). Since the transition is not dependent on protein concentration, Gibbs free energy was calculated employing Equation 2. The |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (H2O) values determined by Equation 3 for each scTetR concentration are identical within the standard deviations and range from 26 to 27 kJ/mol (Table 1). Table 1. Thermodynamic stability of scTetR and TetR(D) in urea-dependent denaturation determined by fluorescencea 0.4 μM 1 μM 5 μM |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) scTetR 26 ± 1.1 4.7 27 ± 3.8 4.7 26 ± 0.8 4.8 TetR(D)b 60 ± 3 3.8 n.d. n.d. 61 4.2 0.4 μM 1 μM 5 μM |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) scTetR 26 ± 1.1 4.7 27 ± 3.8 4.7 26 ± 0.8 4.8 TetR(D)b 60 ± 3 3.8 n.d. n.d. 61 4.2 n.d., not determined. a Unfolding was followed by the change of the fluorescence signal at 330 nm using protein concentrations of 0.1, 1 (excitation at 280 nm) and 5 μM (excitation at 295 nm) in 1 cm cells. b Values taken from Schubert et al.(33). Open in new tab Table 1. Thermodynamic stability of scTetR and TetR(D) in urea-dependent denaturation determined by fluorescencea 0.4 μM 1 μM 5 μM |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) scTetR 26 ± 1.1 4.7 27 ± 3.8 4.7 26 ± 0.8 4.8 TetR(D)b 60 ± 3 3.8 n.d. n.d. 61 4.2 0.4 μM 1 μM 5 μM |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) |$\Delta {G}_{\hbox{ u }}^{^\circ}$| (kJ/mol) Urea1/2 (M) scTetR 26 ± 1.1 4.7 27 ± 3.8 4.7 26 ± 0.8 4.8 TetR(D)b 60 ± 3 3.8 n.d. n.d. 61 4.2 n.d., not determined. a Unfolding was followed by the change of the fluorescence signal at 330 nm using protein concentrations of 0.1, 1 (excitation at 280 nm) and 5 μM (excitation at 295 nm) in 1 cm cells. b Values taken from Schubert et al.(33). Open in new tab DISCUSSION We have constructed a bacterial scTetR and characterized the modified protein in vivo and in vitro. Since TetR mutations always occur in each subunit of the native dimer, we have not been able up to now to determine whether induction requires occupation of one or both inducer binding pockets. We have overcome this limitation by ‘monomerizing’ TetR to scTetR (Figure 1). The sctetR gene is genetically as stable and the encoded protein is as functional in E.coli as dimeric TetR. Apparently, the human codon usage implemented into the synthetic part of sctetR did not notably reduce the translational efficiency in E.coli. It is surprising that this is possible with the (SG4)5 linker without any loss of activity. In fact, we consistently observed a slightly increased repression exerted by scTetR compared with the dimer (data not shown), and induction is slightly impaired, which is probably due to an increased amount of intracellular scTetR compared with dimeric TetR (Figure 2). Another reason could be the de facto duplication of the tetR gene dosis in sctetR. Although the linker length was chosen to be long and flexible enough to allow assembly of both domains properly (22), it could influence the entrance of inducer into the binding pocket. However, increased regulatory effects have also been observed when similar sctetR constructs were used in eukaryotes (22). Other dimeric repressors, like lambda Cro, P22 Arc or the N-terminal domains of the bacteriophage 434 repressor cI have been successfully modified such that they are also expressed as a single chain protein to investigate their stability and DNA binding (34–36). However, the allosterical conformational change occurring upon effector binding has to our knowledge not yet been analyzed in a transcriptional regulator. To clarify if one or two occupied binding pockets are necessary for induction, we introduced three different mutations conferring a non-inducible phenotype owing to lack of inducer binding into one half-side of scTetR. The respective dimeric TetR mutants exhibit a massive drop of inducer affinity (29,30). We assume that they exert a similar influence on the respective scTetR half-side containing the alteration. Indeed, half-side induction deficient mutants are efficient non-inducible repressors as demonstrated in vivo and in vitro (Figure 3). While the tetO binding clearly underscores that the half-side mutant proteins are functional, the lack of induction is nevertheless a negative result, and could thus be attributed to local folding problems around the inducer binding pocket. To obtain a gain of function mutation we have examined the inducibility of scTetR with two different effector specificities harbored in both half-sides. The TetRi2 protein shows specificity for 4-ddma-atc (5) in one half-side and for tc or atc in the other. As a result, scTetRi2 is only inducible by mixtures of tc/4-ddma-atc or atc/4-ddma-atc. This clearly points out that (i) both binding pockets need to be occupied with suitable inducers to lead to the allosteric conformational change necessary for induction and (ii) both half-sides are folded properly and are active. EMSA and tc- and 4-ddma-atc-dependent titrations verified the distinct occupancy of the two binding pockets by the suitable inducer as well as a 1:2 stoichiometry for each of them. Furthermore, the titrations demonstrate that the respective specificities of the binding pockets, although each binding pocket is formed by residues of both monomers, are totally independent on its counterpart's sequence and specificity (Figure 5). Despite the participation of residues of both monomers in binding of one inducer, it is not likely in the light of the crystal structure (9) that the second monomer undergoes major conformational changes when the [tc–Mg]+ complex binds to the pocket mainly formed by the other monomer. Hence, a suitable movement of both binding heads requires the allosteric process to occur in both half-sides. It is shown here that this requires occupation of both binding pockets. It is not clear whether induction by different effectors in both half-sides differs from that of wild-type TetR. As shown for another homodimeric repressor, LacI (37), an asymmetric mechanism of induction could be possible. The TetR family of bacterial regulators is mainly defined by sequence similarities in the DNA reading heads (20). Hence it is conceivable that members of this diverse family may follow different induction modes. This seems to be the case for TetR requiring two inducers bound per dimer and QacR, for which one inducer bound per dimer seems to be sufficient for induction. One possibility would be that TetR contacts tetO more strongly than QacR binds its respective cognate DNA. In addition, there is a very remarkable difference in DNA binding between the two repressors, as QacR uses a tetramer to bind its operator (21). Hence, there might also be a very different allosteric process underlying induction. Given the lack of sequence similarity within the inducer binding regions of the TetR family proteins, there might be a number of more different induction pathways present in this family. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. ACKNOWLEDGEMENTS The authors thank Christian Berens, Oliver Scholz and Martin Köstner for helpful discussions and Irina Pimenta for titration experiments. This work was supported by the Deutsche Forschungsgemeinschaft through SFB 473 and the Fonds der Chemischen Industrie. Funding to pay the Open Access publication charges for this article was provided by SFB 473. Conflict of interest statement. None declared. REFERENCES 1. Browning D.F. and Busby S.J. 2004 The regulation of bacterial transcription initiation Nature Rev. Microbiol . 2 57 – 65 Google Scholar Crossref Search ADS WorldCat 2. Huffman J.L. and Brennan R.G. 2002 Prokaryotic transcription regulators: more than just the helix-turn-helix motif Curr. Opin. Struct. Biol . 12 98 – 106 Google Scholar Crossref Search ADS PubMed WorldCat 3. Zaim J. and Kierzek A.M. 2003 The structure of full-length LysR-type transcriptional regulators. Modeling of the full-length OxyR transcription factor dimer Nucleic Acids Res . 31 1444 – 1454 Google Scholar Crossref Search ADS PubMed WorldCat 4. Busby S. and Ebright R.H. 1999 Transcription activation by catabolite activator protein (CAP) J Mol. Biol . 293 199 – 213 Google Scholar Crossref Search ADS PubMed WorldCat 5. Henssler E.M. , Bertram R., Wisshak S., Hillen W. 2005 Tet repressor mutants with altered effector binding and allostery FEBS J . 272 4487 – 4496 Google Scholar Crossref Search ADS PubMed WorldCat 6. Scholz O. , Henssler E.M., Bail J., Schubert P., Bogdanska-Urbaniak J., Sopp S., Reich M., Wisshak S., Köstner M., Bertram R., et al. 2004 Activity reversal of Tet repressor caused by single amino acid exchanges Mol. Microbiol . 53 777 – 789 Google Scholar Crossref Search ADS PubMed WorldCat 7. Schnappinger D. , Schubert P., Pfleiderer K., Hillen W. 1998 Determinants of protein–protein recognition by four helix bundles: changing the dimerization specificity of Tet repressor EMBO J . 17 535 – 543 Google Scholar Crossref Search ADS PubMed WorldCat 8. Schubert P. , Pfleiderer K., Hillen W. 2004 Tet repressor residues indirectly recognizing anhydrotetracycline Eur. J Biochem . 271 2144 – 2152 Google Scholar Crossref Search ADS PubMed WorldCat 9. Orth P. , Schnappinger D., Hillen W., Saenger W., Hinrichs W. 2000 Structural basis of gene regulation by the tetracycline inducible Tet repressor-operator system [see comments] Nature Struct. Biol . 7 215 – 219 Google Scholar Crossref Search ADS WorldCat 10. Kedracka-Krok S. , Gorecki A., Bonarek P., Wasylewski Z. 2005 Kinetic and thermodynamic studies of tet repressor-tetracycline interaction Biochemistry 44 1037 – 1046 Google Scholar Crossref Search ADS PubMed WorldCat 11. Ehrt S. , Guo X.V., Hickey C.M., Ryou M., Monteleone M., Riley L.W., Schnappinger D. 2005 Controlling gene expression in mycobacteria with anhydrotetracycline and Tet repressor Nucleic Acids Res . 33 e21 Google Scholar Crossref Search ADS PubMed WorldCat 12. Gossen M. and Bujard H. 2001 Tetracyclines in the control of gene expression in eukaryotes In Nelson M., Hillen W., Greenwald R.A. (Eds.). Tetracyclines in Biology, Chemistry and Medicine Switzerland Birkhäuser Verlag pp. 139 – 158 13. Weinmann P. , Gossen M., Hillen W., Bujard H., Gatz C. 1994 A chimeric transactivator allows tetracycline-responsive gene expression in whole plants Plant J . 5 559 – 569 Google Scholar Crossref Search ADS PubMed WorldCat 14. Berens C. and Hillen W. 2003 Gene regulation by tetracyclines. Constraints of resistance regulation in bacteria shape TetR for application in eukaryotes Eur. J Biochem . 270 3109 – 3121 Google Scholar Crossref Search ADS PubMed WorldCat 15. Kisker C. , Hinrichs W., Tovar K., Hillen W., Saenger W. 1995 The complex formed between Tet repressor and tetracycline-Mg2+ reveals mechanism of antibiotic resistance J Mol. Biol . 247 260 – 280 Google Scholar Crossref Search ADS PubMed WorldCat 16. Kamionka A. , Bogdanska-Urbaniak J., Scholz O., Hillen W. 2004 Two mutations in the tetracycline repressor change the inducer anhydrotetracycline to a corepressor Nucleic Acids Res . 32 842 – 847 Google Scholar Crossref Search ADS PubMed WorldCat 17. Brown N.L. , Stoyanov J.V., Kidd S.P., Hobman J.L. 2003 The MerR family of transcriptional regulators FEMS Microbiol. Rev . 27 145 – 163 Google Scholar Crossref Search ADS PubMed WorldCat 18. Dieckmann G.R. , McRorie D.K., Lear J.D., Sharp K.A., DeGrado W.F., Pecoraro V.L. 1998 The role of protonation and metal chelation preferences in defining the properties of mercury-binding coiled coils J. Mol. Biol . 280 897 – 912 Google Scholar Crossref Search ADS PubMed WorldCat 19. Shewchuk L.M. , Verdine G.L., Nash H., Walsh C.T. 1989 Mutagenesis of the cysteines in the metalloregulatory protein MerR indicates that a metal-bridged dimer activates transcription Biochemistry 28 6140 – 6145 Google Scholar Crossref Search ADS PubMed WorldCat 20. Ramos J.L. , Martinez-Bueno M., Molina-Henares A.J., Teran W., Watanabe K., Zhang X., Gallegos M.T., Brennan R., Tobes R. 2005 The TetR family of transcriptional repressors Microbiol. Mol. Biol. Rev . 69 326 – 356 Google Scholar Crossref Search ADS PubMed WorldCat 21. Schumacher M.A. , Miller M.C., Grkovic S., Brown M.H., Skurray R.A., Brennan R.G. 2001 Structural mechanisms of QacR induction and multidrug recognition Science 294 2158 – 2163 Google Scholar Crossref Search ADS PubMed WorldCat 22. Krueger C. , Berens C., Schmidt A., Schnappinger D., Hillen W. 2003 Single-chain Tet transregulators Nucleic Acids Res . 31 3050 – 3056 Google Scholar Crossref Search ADS PubMed WorldCat 23. Sambrook J. Molecular Cloning: A Laboratory Manual 2001 Cold Spring Harbor, NY Cold Spring Harbor Laboratory Press 24. Miller J.H. Experiments in Molecular Genetics 1972 Cold Spring Harbor, NY Cold Spring Harbor Laboratory Press 25. Watt V.M. , Ingles C.J., Urdea M.S., Rutter W.J. 1985 Homology requirements for recombination in Escherichia coli Proc. Natl Acad. Sci. USA 82 4768 – 4772 Google Scholar Crossref Search ADS WorldCat 26. Heinz C. , Karosi S., Niederweis M. 2003 High-level expression of the mycobacterial porin MspA in Escherichia coli and purification of the recombinant protein J Chromatogr. B. Analyt. Technol. Biomed. Life Sci . 790 337 – 348 Google Scholar Crossref Search ADS PubMed WorldCat 27. Kamionka A. , Bertram R., Hillen W. 2005 Tetracycline-dependent conditional gene knockout in Bacillus subtilis Appl. Environ. Microbiol . 71 728 – 733 Google Scholar Crossref Search ADS PubMed WorldCat 28. Ettner N. , Müller G., Berens C., Backes H., Schnappinger D., Schreppel T., Pfleiderer K., Hillen W. 1996 Fast large-scale purification of tetracycline repressor variants from overproducing Escherichia coli strains J. Chromatogr. A . 742 95 – 105 Google Scholar Crossref Search ADS PubMed WorldCat 29. Müller G. , Hecht B., Helbl V., Hinrichs W., Saenger W., Hillen W. 1995 Characterization of non-inducible Tet repressor mutants suggests conformational changes necessary for induction Nature Struct. Biol . 2 693 – 703 Google Scholar Crossref Search ADS WorldCat 30. Scholz O. , Schubert P., Kintrup M., Hillen W. 2000 Tet repressor induction without Mg2+ Biochemistry 39 10914 – 10920 Google Scholar Crossref Search ADS PubMed WorldCat 31. Henssler E.M. , Scholz O., Lochner S., Gmeiner P., Hillen W. 2004 Structure-based design of Tet repressor to optimize a new inducer specificity Biochemistry 43 9512 – 9518 Google Scholar Crossref Search ADS PubMed WorldCat 32. Backes H. , Berens C., Helbl V., Walter S., Schmid F.X., Hillen W. 1997 Combinations of the alpha-helix–turn–alpha-helix motif of TetR with respective residues from LacI or 434Cro: DNA recognition, inducer binding, and urea-dependent denaturation Biochemistry 36 5311 – 5322 Google Scholar Crossref Search ADS PubMed WorldCat 33. Schubert P. , Schnappinger D., Pfleiderer K., Hillen W. 2001 Identification of a stability determinant on the edge of the tet repressor four-helix bundle dimerization motif Biochemistry 40 3257 – 3263 Google Scholar Crossref Search ADS PubMed WorldCat 34. Simoncsits A. , Chen J., Percipalle P., Wang S., Toro I., Pongor S. 1997 Single-chain repressors containing engineered DNA-binding domains of the phage 434 repressor recognize symmetric or asymmetric DNA operators J. Mol. Biol . 267 118 – 131 Google Scholar Crossref Search ADS PubMed WorldCat 35. Robinson C.R. and Sauer R.T. 1998 Optimizing the stability of single-chain proteins by linker length and composition mutagenesis Proc. Natl Acad. Sci. USA 95 5929 – 5934 Google Scholar Crossref Search ADS WorldCat 36. Jana R. , Hazbun T.R., Fields J.D., Mossing M.C. 1998 Single-chain lambda Cro repressors confirm high intrinsic dimer-DNA affinity Biochemistry 37 6446 – 6455 Google Scholar Crossref Search ADS PubMed WorldCat 37. Flynn T.C. , Swint-Kruse L., Kong Y., Booth C., Matthews K.S., Ma J. 2003 Allosteric transition pathways in the lactose repressor protein core domains: asymmetric motions in a homodimer Protein Sci . 12 2523 – 2541 Google Scholar Crossref Search ADS PubMed WorldCat © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Identification of small non-coding RNAs from mitochondria and chloroplastsLung,, Birgit;Zemann,, Anja;Madej, Monika, J.;Schuelke,, Markus;Techritz,, Sandra;Ruf,, Stephanie;Bock,, Ralph;Hüttenhofer,, Alexander
doi: 10.1093/nar/gkl448pmid: 16899451
ABSTRACT Small non-protein-coding RNAs (ncRNAs) have been identified in a wide spectrum of organisms ranging from bacteria to humans. In eukarya, systematic searches for ncRNAs have so far been restricted to the nuclear or cytosolic compartments of cells. Whether or not small stable non-coding RNA species also exist in cell organelles, in addition to tRNAs or ribosomal RNAs, is unknown. We have thus generated cDNA libraries from size-selected mammalian mitochondrial RNA and plant chloroplast RNA and searched for small ncRNA species in these two types of DNA-containing cell organelles. In total, we have identified 18 novel candidates for organellar ncRNAs in these two cellular compartments and confirmed expression of six of them by northern blot analysis or RNase A protection assays. Most candidate ncRNA genes map to intergenic regions of the organellar genomes. As found previously in bacteria, the presumptive ancestors of present-day chloroplasts and mitochondria, we also observed examples of antisense ncRNAs that potentially could target organelle-encoded mRNAs. The structural features of the identified ncRNAs as well as their possible cellular functions are discussed. The absence from our libraries of abundant small RNA species that are not encoded by the organellar genomes suggests that the import of RNAs into cell organelles is of very limited significance or does not occur at all. INTRODUCTION Cells from all organisms contain two different kinds of RNAs: mRNAs, which are translated into proteins and non-protein-coding RNAs (ncRNAs). The latter ones function at the level of the RNA and are not translated into proteins (1–6). Reported sizes of ncRNAs range from very large as, for example, the ∼17 kb human Xist RNA (7), to extremely small, as the 21–23 nt microRNAs (8,9). However, the sizes of the vast majority of ncRNAs known to date lie between 20 and 500 nt, well below the size of the majority of mRNAs (2). NcRNAs fulfill vital and important functions in many cellular processes, namely in transcription, translation, splicing, DNA replication or RNA processing (3). The evolutionary origin of eukaryotic cells is marked by the endosymbiontic uptake of two eubacteria (an α-proteobacterium and a cyanobacterium) and their gradual conversion into the DNA-containing cell organelles of present-day eukaryotes, mitochondria and chloroplasts (10–12). Genetically, the evolutionary optimization of endosymbiosis was accompanied by the loss of dispensable and redundant genetic information, and the large-scale translocation of information from the genome of the endosymbiont to that of the host cell (13,14). Consequently, contemporary organellar genomes are greatly reduced and, as compared to the nuclear genome, contain relatively little information. The chloroplast genome of higher plants is a circular double-stranded molecule of 120–160 kb which harbors ∼120 genes (15,16). Most of these chloroplast-encoded genes belong to either of two major gene classes: photosynthesis-related genes and genetic system genes (e.g. genes for rRNAs, tRNAs, ribosomal proteins, subunits of an RNA polymerase) (17). Plant mitochondrial genomes can be significantly larger in size (180–2400 kb), but typically contain only half as many genes as chloroplast genomes (18,19). In contrast, animal mitochondria harbor much smaller, but highly compact and gene-dense genomes (20,21). For example, the human mitochondrial genome contains 37 genes in a 16.5 kb circular genome. Of these mitochondrial genes 13 specify protein products (subunits of respiratory chain complexes), the remaining 24 encode RNA products (a complete set of ribosomal and transfer RNAs). An important feature of all organellar genomes is their remarkably high ploidy level, i.e. copy number per cell (22,23). A single tobacco leaf cell, for example, may contain as many as 100 chloroplasts, each harboring ∼100 identical copies of the plastid genome resulting in an extraordinarily high ploidy degree of up to 10 000 plastid genomes per cell. Clearly, the limited coding capacity of organellar genomes is far insufficient to provide all of the many components required to support its own gene expression system. The organelles are therefore highly dependent on the products of nuclear genes that are synthesized on cytoplasmic ribosomes and post-translationally imported into the organelle (24). Chloroplasts, for example, import >90% of their proteins from the cytosol. Consequently, the temporal and spatial expression of organellar and nuclear genes must be regulated in a highly coordinated fashion. Eubacterial genomes encode a large number of ncRNAs and it seems conceivable that ncRNAs also could exist and possibly play regulatory roles in cell organelles. Indeed, the identification of a first ncRNA candidate in tobacco chloroplasts [sprA; (25)] points to the presence of ncRNAs in present-day cell organelles. In this study, we sought to systematically identify ncRNA species in chloroplasts and mitochondria. Since the identification of ncRNAs solely by biocomputational approaches is severely hampered by their lack ofan open reading frame, we employed an experimental approach designated as ‘experimental RNomics’ which is based on the generation of specialized cDNA libraries encoding potential novel ncRNA species (26,27). We thus generated cDNA libraries from fractions of small RNAs isolated from organellar RNA which was extracted from purified chloroplasts (model organism: tobacco, Nicotiana tabacum) and mitochondria (model organism: mouse, Mus musculus). In this analysis of the small RNA component of the transcriptome of chloroplasts and mitochondria, we identified a number of novel ncRNA candidates which might be involved in the regulation of organellar gene expression and thus possibly exert similar functions as ncRNAs in the bacterial ancestors of present-day organelles. MATERIALS AND METHODS Generation of a cDNA library from tobacco chloroplasts and mouse mitochondria For the construction of a chloroplast library, chloroplasts were isolated from young leaves of N.tabacum plants grown in the light and plants grown in dark by using a Percoll gradient-based method (28). Mitochondria were isolated from mouse liver and kidney and purified by a Percoll gradient as described previously (29). Total RNA was extracted from cell organelles by the TRIzol method (Gibco-BRL) or by directly extracting the chloroplast pellet with phenol/chloroform (1:1). Subsequently, 100 µg of total RNA was size-fractionated by denaturing 8% PAGE (7 M urea, 1× TBE buffer). RNAs in the size range between ∼20 and 500 nt were excised from the gel, passively eluted and ethanol-precipitated. For the chloroplast library, RNAs were ligated to 5′- and 3′-oligonucleotide linkers by T4 RNA ligase, as described previously (26). For the mitochondrial library, RNAs were poly(C)-tailed employing poly(A) polymerase (Invitrogen). C-tailed RNAs were ligated to a 5′- oligonucleotide linker as described previously (27). RNAs from both libraries were subsequently converted into cDNAs by RT–PCR as described previously, employing complementary primers to 5′- and 3′-linkers or the poly(C) tail (27), followed by cloning into pGEM-T or pGEM-T-Easy vector (Promega). Sequence analysis of cDNA libaries from chloroplasts and mitochondria cDNA clones were sequenced using the M13 reverse primer and the BigDye terminator cycle sequencing reaction kit (PE Applied Biosystems). Sequencing reactions were run on an ABI Prism 3100 (Perkin Elmer) capillary sequencer. Subsequently, sequences were analyzed with the LASERGENE sequence analysis program package (DNASTAR, Madison, USA). cDNA sequences were compared with one another using the Lasergene Seqman II program package to identify identical sequences (DNASTAR). Following a BLASTN search against the GenBank database (NCBI), all RNA sequences, which were not annotated in the database were treated as potential candidates for novel ncRNAs. Northern blot analysis Total organellar RNA (5–10 µg) was denatured for 1 min at 95°C, separated on a 8% denaturing polyacrylamide gel (7 M urea, 1× TBE buffer) and transferred onto a nylon membrane (Quiabrane Nylon Plus, Quiagen) using the Bio-Rad semi-dry blotting apparatus (Trans-blot SD; Bio-Rad). After immobilizing of RNAs using the STRATAGENE UV crosslinker, we pre-hybridized the nylon membrane for 1 h in 1 M sodium phosphate buffer (pH 6.2) with 7% SDS. Oligonucleotides from 20 to 26 nt in size, complementary to potentially novel RNA species were end-labeled with [γ-32P]ATP and T4 polynucleotide kinase. Hybridization was carried out at 58°C in 1 M sodium phosphate buffer (pH 6.2), 7% SDS for 12 h. Blots were washed twice at room temperature in 2× SSC buffer (20 mM sodium phosphate, pH 7.4; 0.3 M NaCl; 2 mM EDTA), 0.1% SDS for 15 min and subsequently at 58°C in 0.1× SSC, 0.5% SDS for 1 min. Membranes were exposed to Kodak MS-1 film from 12 h to 5 days. RESULTS AND DISCUSSION cDNA library containing ncRNAs from chloroplasts Library construction and sequence analysis of cDNA clones Tobacco plants from the species N.tabacum were either grown under normal light conditions (16 h light, 8 h dark) or kept under constant darkness for 3 days prior to isolation of chloroplasts. As chloroplast gene expression is strongly regulated by light (30,31), we decided to comparatively analyze small RNA species in light-grown versus dark-grown plants. From purified chloroplasts, cDNA libraries were generated encoding RNAs sized from 20 to 500 nt (Materials and Methods). Subsequently, a total of 5500 cDNA sequences were analyzed by first grouping identical cDNA clones, followed by bioinformatical analysis on their location on the chloroplast genome (Materials and Methods). About 90% of the sequenced clones represented known ncRNAs from the chloroplast genome (Figure 1A). We thereby identified the full set of chloroplast-encoded tRNAs and small rRNAs (data not shown), consistent with our screen being saturating with respect to known ncRNAs. A minor fraction (0.4%) was derived from coding regions of chloroplast mRNAs, presumably representing mRNA degradation products. This low amount of putative mRNA degradation intermediates in the library indicates a high degree of intactness of the isolated organellar RNA population. About 2.7% of sequences represented novel potential candidates for chloroplast ncRNAs by two criteria. They were located mainly in intergenic regions and had not been assigned to a known chloroplast gene (Figure 1A; note that from novel ncRNAs, Ntc-12 ncRNA is part of a previously assigned chloroplast gene, i.e. 16S rRNA, and thus is annotated within the number of cDNA clones from this gene). Finally, 4% of the sequenced cDNA clones were encoded by the nuclear genome (Figure 1A). Among the nuclear genome-derived RNAs, 11 novel, previously not annotated candidates for small nucleolar RNAs (snoRNAs) were identified based on sequence and structural motifs. In addition, the nuclear-encoded 5.8S rRNA and some tRNAs were found (see below). Figure 1 Open in new tabDownload slide Sequence analysis and genomic location of ncRNA candidates from the chloroplast library of N.tabacum. (A) Sequence analysis of 5500 cDNA clones from the chloroplast library. cDNA clones representing different RNA species or categories are shown as percent of total clones. The red sector denotes candidates for novel ncRNAs in chloroplasts; note that Ntc-12 ncRNA is part of a previously assigned chloroplast gene, i.e. 16S rRNA, and thus is annotated within the number of cDNA clones from this gene. (B) Location of ncRNA candidates on the chloroplast genome, not drawn to scale. Novel candidates for ncRNA genes are indicated by green arrows, genes flanking the ncRNA candidate are indicated by black arrows. The distance of the ncRNA to 5′ and 3′ ends of flanking genes (in nt) is shown below. For ncRNAs mapping in sense or antisense orientation to introns/UTRs, the distance to the neighboring exon is shown (in nt); for ncRNAs that overlap with the open reading frame of a gene, the number of overlapping nucleotides is indicated by a negative number. Figure 1 Open in new tabDownload slide Sequence analysis and genomic location of ncRNA candidates from the chloroplast library of N.tabacum. (A) Sequence analysis of 5500 cDNA clones from the chloroplast library. cDNA clones representing different RNA species or categories are shown as percent of total clones. The red sector denotes candidates for novel ncRNAs in chloroplasts; note that Ntc-12 ncRNA is part of a previously assigned chloroplast gene, i.e. 16S rRNA, and thus is annotated within the number of cDNA clones from this gene. (B) Location of ncRNA candidates on the chloroplast genome, not drawn to scale. Novel candidates for ncRNA genes are indicated by green arrows, genes flanking the ncRNA candidate are indicated by black arrows. The distance of the ncRNA to 5′ and 3′ ends of flanking genes (in nt) is shown below. For ncRNAs mapping in sense or antisense orientation to introns/UTRs, the distance to the neighboring exon is shown (in nt); for ncRNAs that overlap with the open reading frame of a gene, the number of overlapping nucleotides is indicated by a negative number. Novel candidate ncRNAs encoded by the chloroplast genome Ntc-12 RNA The most abundant cDNA clone from our chloroplast library (found in ∼900 cDNA clones) was designated as Ntc-12 (N.tabacumchloroplast-12). It is encoded by the chloroplast genome and is derived from the chloroplast 16S ribosomal RNA (Table 1 and Figure 1B). The cDNA represents a 53 nt long portion of domain III of the 16S rRNA which folds into a stable hairpin structure (Supplementary Figure S1). Although fragments of ribosomal RNA regularly appear as contaminants in cDNA libraries encoding ncRNAs, the presence of such a highly abundant specific rRNA fragment has not been observed previously. The fact that all 900 clones are derived from one particular region of the 16S rRNA sequence and, moreover, exhibit defined 5′ and 3′ ends makes it highly unlikely that this small RNA species reflects contamination with non-specific rRNA degradation products. Consistent with this interpretation, we detect a specific signal of 53 nt in size in northern blot analyses of total RNA from light and dark grown plants (Figure 2) indicating that significant amounts of Ntc-12 accumulate in a light-independent manner. It was interesting to compare light inducibility of Ntc-12 with that of the 16S rRNA from which Ntc-12 is likely to originate by processing. Previous work has shown that a dark–light shift results in only a moderate increase in transcription from the 16S rRNA promoter [of about 30% (32)]. In contrast, no increase in the level of Ntc-12 ncRNA can be observed in northern blot experiments under light conditions (Figure 2). Figure 2 Open in new tabDownload slide Northern blot analysis of selected ncRNAs from the chloroplast library. Clone names for each ncRNA are indicated above each lane, sizes of RNAs, as estimated by comparison with an internal RNA marker, are indicated by arrows on the right. RNAs isolated from light or dark grown plants (Materials and Methods) are indicated below by L or D, respectively. Figure 2 Open in new tabDownload slide Northern blot analysis of selected ncRNAs from the chloroplast library. Clone names for each ncRNA are indicated above each lane, sizes of RNAs, as estimated by comparison with an internal RNA marker, are indicated by arrows on the right. RNAs isolated from light or dark grown plants (Materials and Methods) are indicated below by L or D, respectively. Table 1. Candidates for novel ncRNAs from a N.tabacum chloroplast cDNA library derived from RNAs sized 10–500 nt Name Nr. Sequence cDNA (nt) N. blot (nt) Remarks Ntc-12 890 CCGGGACAAAGGGTCGCGATCCC GCGAGGGTGAGCTAACCCCAAAA ACCCGTC 53 53 Localized in 16S rDNA gene, 100% conserved in 24 different chloroplast genomes, stable stem–loop structure Ntc-1 36 GGTAGTTCGATCGTGGAATTTC 22 22∗ Localized in intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-2 20 AGTTACTAATTCATGATCTGGC 22 22 Intergenic region, 100% conserved in 960 different chloroplast genomes; homologous sequence found in A.thaliana chloroplast cDNA library Ntc-3 18 TCTGCCCTCCCTCTCTATCTATCC AAGGGATGGAAGGGCAGAGG 44 45 Intergenic region, 100% conserved in 19 different chloroplast genomes, stable stem–loop structure, sncRNA in the same region identified in A.thaliana Ntc-4 18 ACGTCCCCATGTTCCCCCCGTGTG GCGACATGGGGGCGAA 40 55 Intergenic region, 100% conserved in six different chloroplast genomes, stable stem–loop structure Ntc-5 4 AAACTTATTAGATACCAGAGTCA ATGGTATCTAATAAGGTTT 42 40 Antisense to 3′-UTR of atpE gene, stable stem–loop structure, 100% conserved in five chloroplast genomes Ntc-6 2 TGAGAGGCGGTGGTTTACC 19 — Localized in tRNA-Ala intron, 100% conserved in 975 different chloroplast genomes, putative part of a miRNA precursor sequence Ntc-7 1 GCATTCACAAGTTCCGTC 18 — Antisense to intron in gene for ribosomal protein S16 (rps16), 100% conserved in nine different chloroplast genomes Ntc-8 1 AGAAATCAAAGTATTTTGGCCCT CTCTC 28 — 100% conserved in three different Nicotiana chloroplast genomes Ntc-9 1 CAACCAATGACTATTCATGATTC 23 — Intergenic region/promoter region of gene for ribosomal protein S16,100% conserved in 19 different chloroplast genomes Ntc-10 1 AACCGGCCCAAAAGGGAAGTACC TTTCCCTCTGGGGGTAGGA 42 — Intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-11 1 ATCCATTCGAAAGGTTAGA 19 — Intergenic region, 100% conserved in three different chloroplast genomes Name Nr. Sequence cDNA (nt) N. blot (nt) Remarks Ntc-12 890 CCGGGACAAAGGGTCGCGATCCC GCGAGGGTGAGCTAACCCCAAAA ACCCGTC 53 53 Localized in 16S rDNA gene, 100% conserved in 24 different chloroplast genomes, stable stem–loop structure Ntc-1 36 GGTAGTTCGATCGTGGAATTTC 22 22∗ Localized in intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-2 20 AGTTACTAATTCATGATCTGGC 22 22 Intergenic region, 100% conserved in 960 different chloroplast genomes; homologous sequence found in A.thaliana chloroplast cDNA library Ntc-3 18 TCTGCCCTCCCTCTCTATCTATCC AAGGGATGGAAGGGCAGAGG 44 45 Intergenic region, 100% conserved in 19 different chloroplast genomes, stable stem–loop structure, sncRNA in the same region identified in A.thaliana Ntc-4 18 ACGTCCCCATGTTCCCCCCGTGTG GCGACATGGGGGCGAA 40 55 Intergenic region, 100% conserved in six different chloroplast genomes, stable stem–loop structure Ntc-5 4 AAACTTATTAGATACCAGAGTCA ATGGTATCTAATAAGGTTT 42 40 Antisense to 3′-UTR of atpE gene, stable stem–loop structure, 100% conserved in five chloroplast genomes Ntc-6 2 TGAGAGGCGGTGGTTTACC 19 — Localized in tRNA-Ala intron, 100% conserved in 975 different chloroplast genomes, putative part of a miRNA precursor sequence Ntc-7 1 GCATTCACAAGTTCCGTC 18 — Antisense to intron in gene for ribosomal protein S16 (rps16), 100% conserved in nine different chloroplast genomes Ntc-8 1 AGAAATCAAAGTATTTTGGCCCT CTCTC 28 — 100% conserved in three different Nicotiana chloroplast genomes Ntc-9 1 CAACCAATGACTATTCATGATTC 23 — Intergenic region/promoter region of gene for ribosomal protein S16,100% conserved in 19 different chloroplast genomes Ntc-10 1 AACCGGCCCAAAAGGGAAGTACC TTTCCCTCTGGGGGTAGGA 42 — Intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-11 1 ATCCATTCGAAAGGTTAGA 19 — Intergenic region, 100% conserved in three different chloroplast genomes Nr., number of independent cDNA clones identified from each RNA species; Sequence, sequence of cDNA; cDNA (nt), length of cDNA encoding a ncRNA candidate as assessed by sequencing; N. blot (nt), length of RNAs as assessed by northern blot analysis or by an RNase protection assay (indicated by asterisk). Open in new tab Table 1. Candidates for novel ncRNAs from a N.tabacum chloroplast cDNA library derived from RNAs sized 10–500 nt Name Nr. Sequence cDNA (nt) N. blot (nt) Remarks Ntc-12 890 CCGGGACAAAGGGTCGCGATCCC GCGAGGGTGAGCTAACCCCAAAA ACCCGTC 53 53 Localized in 16S rDNA gene, 100% conserved in 24 different chloroplast genomes, stable stem–loop structure Ntc-1 36 GGTAGTTCGATCGTGGAATTTC 22 22∗ Localized in intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-2 20 AGTTACTAATTCATGATCTGGC 22 22 Intergenic region, 100% conserved in 960 different chloroplast genomes; homologous sequence found in A.thaliana chloroplast cDNA library Ntc-3 18 TCTGCCCTCCCTCTCTATCTATCC AAGGGATGGAAGGGCAGAGG 44 45 Intergenic region, 100% conserved in 19 different chloroplast genomes, stable stem–loop structure, sncRNA in the same region identified in A.thaliana Ntc-4 18 ACGTCCCCATGTTCCCCCCGTGTG GCGACATGGGGGCGAA 40 55 Intergenic region, 100% conserved in six different chloroplast genomes, stable stem–loop structure Ntc-5 4 AAACTTATTAGATACCAGAGTCA ATGGTATCTAATAAGGTTT 42 40 Antisense to 3′-UTR of atpE gene, stable stem–loop structure, 100% conserved in five chloroplast genomes Ntc-6 2 TGAGAGGCGGTGGTTTACC 19 — Localized in tRNA-Ala intron, 100% conserved in 975 different chloroplast genomes, putative part of a miRNA precursor sequence Ntc-7 1 GCATTCACAAGTTCCGTC 18 — Antisense to intron in gene for ribosomal protein S16 (rps16), 100% conserved in nine different chloroplast genomes Ntc-8 1 AGAAATCAAAGTATTTTGGCCCT CTCTC 28 — 100% conserved in three different Nicotiana chloroplast genomes Ntc-9 1 CAACCAATGACTATTCATGATTC 23 — Intergenic region/promoter region of gene for ribosomal protein S16,100% conserved in 19 different chloroplast genomes Ntc-10 1 AACCGGCCCAAAAGGGAAGTACC TTTCCCTCTGGGGGTAGGA 42 — Intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-11 1 ATCCATTCGAAAGGTTAGA 19 — Intergenic region, 100% conserved in three different chloroplast genomes Name Nr. Sequence cDNA (nt) N. blot (nt) Remarks Ntc-12 890 CCGGGACAAAGGGTCGCGATCCC GCGAGGGTGAGCTAACCCCAAAA ACCCGTC 53 53 Localized in 16S rDNA gene, 100% conserved in 24 different chloroplast genomes, stable stem–loop structure Ntc-1 36 GGTAGTTCGATCGTGGAATTTC 22 22∗ Localized in intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-2 20 AGTTACTAATTCATGATCTGGC 22 22 Intergenic region, 100% conserved in 960 different chloroplast genomes; homologous sequence found in A.thaliana chloroplast cDNA library Ntc-3 18 TCTGCCCTCCCTCTCTATCTATCC AAGGGATGGAAGGGCAGAGG 44 45 Intergenic region, 100% conserved in 19 different chloroplast genomes, stable stem–loop structure, sncRNA in the same region identified in A.thaliana Ntc-4 18 ACGTCCCCATGTTCCCCCCGTGTG GCGACATGGGGGCGAA 40 55 Intergenic region, 100% conserved in six different chloroplast genomes, stable stem–loop structure Ntc-5 4 AAACTTATTAGATACCAGAGTCA ATGGTATCTAATAAGGTTT 42 40 Antisense to 3′-UTR of atpE gene, stable stem–loop structure, 100% conserved in five chloroplast genomes Ntc-6 2 TGAGAGGCGGTGGTTTACC 19 — Localized in tRNA-Ala intron, 100% conserved in 975 different chloroplast genomes, putative part of a miRNA precursor sequence Ntc-7 1 GCATTCACAAGTTCCGTC 18 — Antisense to intron in gene for ribosomal protein S16 (rps16), 100% conserved in nine different chloroplast genomes Ntc-8 1 AGAAATCAAAGTATTTTGGCCCT CTCTC 28 — 100% conserved in three different Nicotiana chloroplast genomes Ntc-9 1 CAACCAATGACTATTCATGATTC 23 — Intergenic region/promoter region of gene for ribosomal protein S16,100% conserved in 19 different chloroplast genomes Ntc-10 1 AACCGGCCCAAAAGGGAAGTACC TTTCCCTCTGGGGGTAGGA 42 — Intergenic region, 100% conserved in 10 different chloroplast genomes Ntc-11 1 ATCCATTCGAAAGGTTAGA 19 — Intergenic region, 100% conserved in three different chloroplast genomes Nr., number of independent cDNA clones identified from each RNA species; Sequence, sequence of cDNA; cDNA (nt), length of cDNA encoding a ncRNA candidate as assessed by sequencing; N. blot (nt), length of RNAs as assessed by northern blot analysis or by an RNase protection assay (indicated by asterisk). Open in new tab The high abundance of Ntc-12 clones in our library might imply that a significant portion of the 16S ribosomal RNA is fragmented within chloroplasts. However, by a poisoned primer extension analysis (33), we demonstrated that <1% of the ribosomal 16S rRNA in chloroplasts is fragmented and thus lacking the Ntc-12 sequence (data not shown). Thus, the high abundance of Ntc-12 cDNA clones in our library is likely due to the high abundance of rRNA in general and/or to preferential cloning of the Ntc-12 RNA compared with the remaining sequences. We cannot entirely exclude the possibility that Ntc-12 is not derived from the full-length 16S rRNA but, instead, is derived from a shorter RNA produced by internal transcription initiation. However, the rRNA operon is probably the best-characterized transcription unit in chloroplast and has been extensively used to map and dissect promoter elements [for a review see (34)]. No evidence for the presence of an internal promoter has been found, making it less likely that Ntc-12 is produced by internal transcription initiation. Ntc-1 RNA The next most abundant ncRNA candidate identified in our screen (designated as Ntc-1) was present in 36 identical cDNA clones. Ntc-1 is located in an intergenic region of the chloroplast genome, and maps between the psbH (encoding a small subunit of the photosystem II complex) and petB (encoding the cytochrome b subunit of the chloroplast cytochrome b6f complex) genes (Table 1 and Figure 1B). As assessed by its cDNA sequence, the Ntc-1 RNA has a size of 24 nt and thus is in the same size range as miRNAs present in the nucleocytosolic compartment of plant cells (8). Although being the second most abundant novel ncRNA candidate in our library, we were unable to confirm expression of Ntc-1 by northern blot analysis. This might be due to the design of the oligonucleotide probe used for detection of Ntc-1 RNA, an explanation that would be consistent with previous results from analysis of a mouse brain cDNA library (35). Owing to the small size of Ntc-1 (24 nt), it was not possible to design a probe to a different region of the RNA. However, we were able to confirm expression of Ntc-1 RNA by an alternative method, employing an RNase A protection assay and using a radiolabeled RNA antisense probe directed against the Ntc-1 RNA (data not shown). Ntc-2 RNA Ntc-2 RNA is represented by 20 cDNA clones and exhibits a predominant size of 22 nt (Table 1 and Figure 1B). Expression of the RNA can be confirmed by northern blot analysis (Figure 2) and the size of the northern blot signal is in good agreement with the length of the cloned cDNAs. The DNA sequence specifying the Ntc-2 RNA is located 3′ of a ribosomal protein gene (rps7) and overlaps with the initiation codon of the ndhB gene following immediately downstream (Figure 1B). Interestingly, the ndhB gene encoding a subunit of the chloroplast NAD(P)H dehydrogenase complex is not an essential gene in chloroplasts, but seems to play a functional role under certain stress conditions (36–40). A similar sequence to Ntc-2 was detected recently in a cDNA library encoding small RNAs (sized between 20 and 30 nt) from N.tabacum generated from total cellular RNA (41). In this case, the ncRNA was found to be 2 nt longer at the 3′ end. However, among the 20 cDNA clones from our library, only 3 contained this 2 nt extension while the remaining 18 clones were 22 nt in size. Homologs to Ntc-2 can been found in the chloroplast genomes from >1000 different plant species indicating that the Ntc-2 sequence is evolutionarily highly conserved. Since Ntc-2 overlaps with the open reading frame of the ndhB gene, two scenarios can be envisioned, how this RNA is generated: (i) alternative transcription producing distinct transcripts for Ntc-2 and ndhB gene, respectively, or (ii) alternative processing of a primary transcript spanning Ntc-2 and ndhB RNAs. Ntc-3 RNA Next abundant in our library, represented by 18 independently identified cDNA clones, is Ntc-3, a 44 nt long ncRNA candidate mapping to the intergenic region between two ribosomal RNA genes, the 4.5S rRNA and the 5S rRNA (Table 1 and Figure 1B). The ncRNA candidate can be detected as a distinct band in a northern blot analysis pointing to a rather high expression level (Figure 2). A homolog of Ntc-3 RNA has been identified previously in Arabidopsis thaliana, designated as Ath-243. Ntc-3 and Ath-243 share 70% sequence identity and can both be folded into a similar stable stem–loop structure by employing the M-fold program (for secondary structures of selected ncRNAs from chloroplasts see Supplementary Figure S2). Interestingly, the A.thaliana homolog Ath-243, which exhibits a similar size as RNA (as assessed by northern blotting), was shown to be expressed tissue-specifically and detected at significant levels only in roots (42). Ntc-3, however, was identified in a cDNA library produced from leaf chloroplasts suggesting that, at least in tobacco, the occurrence of this small RNA is not strictly confined to roots. Ntc-4 RNA As for Ntc-3, Ntc-4 also maps to an intergenic region of the chloroplast genome, flanked by two known ncRNA genes, 16S rRNA and tRNAIle (Table 1 and Figure 1B). We have isolated 18 independent cDNA clones from our library that contained this RNA, all of them exhibiting a size of 40 nt. As for Ntc-12 and Ntc-3, Ntc-4 folds into a very stable stem–loop structure (Supplementary Figure S2). Expression of Ntc-4 can be verified by northern blot analysis resulting in a signal of the expected size (i.e. at ∼40 nt) as well as two larger bands, which might represent precursors of Ntc-4 RNA (Figure 2). Ntc-5 RNA Ntc-5 is represented by four independent cDNA clones. The sequence is located immediately adjacent to the tRNAMet gene with no spacer region present between the two ncRNAs (Table 1 and Figure 1B). Previously, it has been shown that a novel endonuclease, termed tRNase Z, is able to process the mature 3′ ends of tRNAs (43). 5′-Processing of Ntc-5 as well as 3′-processing of tRNAMet thus might be simultaneously exerted by the same tRNase Z enzyme. In fact, by an in vitro assay, employing recombinant tRNase Z, we could show that Ntc-5 is processed by tRNase Z from a longer precursor RNA including tRNAMet (A. Hüttenhofer and A. Marchfelder, unpublished data). Ntc-5 can fold into a stable stem–loop structure (Supplementary Figure S2) and is transcribed in opposite orientation to the 3′-untranslated region (3′-UTR) of the atpE gene located on the complementary strand. A possible function of Ntc-5 could be the regulation of gene expression of the atpE gene, in analogy to miRNAs, which target 3′-UTRs in eukaryal mRNAs. Thereby, Ntc-5 RNA and the 3′-UTR of the atpE gene form stem–loop structures with complementary loop sequences, prerequisite for the formation of a so-called ‘kissing complex’ (Supplementary Figure S3). This spatial arrangement has been shown previously to be a hallmark of bonafide regulatory sense/antisense interactions in bacteria (44). However, so far no miRNA-like gene regulation mechanism has been described in chloroplasts which would resemble the one observed for cytoplasmatic mRNAs of eukaryotic cells (8). Alternatively, Ntc-5 might function in analogy to an antisense RNA identified previously in bacteria, designated as GadY, which is transcribed in opposite orientation to the 3′-UTR of the GadX gene. It was shown that expression of GadY resulted in an increase in GadX mRNA levels and that this increase is dependent on the complementarity to the 3′-UTR of the GadX gene (45). Thus, Ntc-5 RNA might have a similar, positive effect on aptE expression. Since Ntc-5 is presumably co-transcribed with the tRNAMet gene (see above) it is tempting to speculate, that by that mechanism protein synthesis (by tRNAMet transcription) could be coupled to ATP synthesis (by atpE transcription) in chloroplasts. Alternatively, Ntc-5 RNA might have a destabilizing effect on atpE mRNA, since in bacteria, antisense RNAs to 3′-UTRs of mRNAs have been reported, which trigger mRNA decay by an RNase III-dependent mechanism (46). Based on the structure of identified cleavage sites in chloroplast RNAs, the existence of an RNase III-like enzyme activity in chloroplasts has been suggested (25,47). However, no such enzyme has been unambiguously identified to date. The expression of the Ntc-5 ncRNA could be verified by northern blot analysis and represents the strongest expressed ncRNA candidate from our screen (Figure 2). Ntc-6 RNA Two independent cDNA clones have been identified for the Ntc-6 RNA, which is located in the intron of the tRNAAla gene and exhibits a size of 19 nt. A northern blot signal with a band at the expected size could not be observed. The tRNA intron can be folded in a hairpin precursor structure (somewhat reminiscent of a miRNA precursor, from which the Ntc-6 RNA could be processed; Supplementary Figure S4). Ntc-7, Ntc-8, Ntc-9, Ntc-10 and Ntc-11 RNAs From each of the ncRNAs candidates Ntc-7, Ntc-8, Ntc-9, Ntc-10 and Ntc-11, only a single cDNA clone was isolated. In addition, we could not verify their expression by northern blot analysis, pointing to a very low expression level, as compared with most of the ncRNA candidates discussed above. All of these low-abundant ncRNAs map to intergenic regions of the chloroplast genome (Table 1 and Figure 1B). Except for Ntc-10, they do not fold into extended stem–loop structures. Owing to their limited abundance (inferred from missing northern blot signals and single cDNA clone occurrence), these candidates might less likely represent functional ncRNAs in chloroplasts. However, we cannot rule out the possibility that at least some of them might still be functional despite their low expression level. Nuclear-encoded ncRNAs from the chloroplast cDNA library We also identified cDNA clones from the chloroplast library encoding RNA transcripts from the nuclear genome of N.tabacum. From these, 94 cDNA clones were assigned to various nuclear tRNAs, 38 cDNA clones to nuclear 5.8S rRNA and 25 cDNA clones represented 11 novel candidates for C/D box snoRNAs. (Supplementary Table S1). To address the possibility that these nuclear-encoded ncRNAs are post-transcriptionally imported into the chloroplast compartment, northern blot analyses were conducted to compare signal intensities in purified chloroplast RNA versus total cellular RNA. However, for none of the selected ncRNAs of nuclear origin, including snoRNAs and nuclear-encoded tRNAs, did these experiments provide evidence for import into chloroplasts at significant levels (data not shown). However, we cannot exclude the possibility that these RNAs are imported into chloroplasts at low levels. To date, the import of RNAs into chloroplasts has not been directly demonstrated, although indirect evidence may suggest that at least tRNAs can be imported into chloroplasts (48–50). Nonetheless, in the absence of expression data demonstrating enrichment of nuclear-encoded ncRNAs inside chloroplasts, we tentatively explain the presence of these sequences in our chloroplast library as RNA contaminations, which might stick to the outer membrane of chloroplasts during organelle purification and thus were co-isolated with the endogenous chloroplast RNA. We have considered pre-treatment of isolated chloroplasts with RNases as a possibility to eliminate cytosolic RNAs associated with the outer membrane of the chloroplast. However, nuclease treatment of chloroplasts is known to be problematic and often results in complete degradation of the chloroplast nucleic acids (51). This is consistent with earlier findings that isolated chloroplasts are not impermeable to exogenously added enzymes, including restriction endonucleases (52). cDNA libary containing ncRNAs from mouse mitochondria Library construction and sequence analysis of cDNA clones A mouse cDNA library encoding mitochondrial ncRNAs was generated as described for the chloroplast library (see above Materials and Methods). About 1700 cDNA clones were analyzed from this library. Thereby, the majority of clones were identified as mitochondrial-encoded rRNAs (i.e. 16S or 12S rRNAs) or mitochondrial tRNAs (Figure 3A). As for the chloroplast library, we could identify the full set of known mitochondrial ncRNAs, e.g. mitochondrial tRNAs and ribosomal RNAs. Only 1.4% of the cDNA clones are derived from mitochondrial mRNA fragments which is consistent with an extremely low abundance of mRNA degradation intermediates in the RNA population. About 1% of the total cDNA clones could be assigned as putative novel ncRNA candidates, since they did not contain any sequence or structural motifs which allowed classification as any of the known mitochondrial ncRNAs (Table 2). In addition, 6.6% of cDNA clones were nuclear-encoded RNAs, such as rRNAs, mRNA fragments as well as miRNAs (see below). Figure 3 Open in new tabDownload slide Sequence analysis and genomic location of ncRNA candidates from the mitochondrial library of M.musculus. (A) Sequence analysis of 1700 cDNA clones from the mitochondrial library. cDNA clones representing different RNA species or categories are shown as percent of total clones. The red sector denotes candidates for novel ncRNAs in mitochondria. (B) Location of ncRNA candidates on the mitochondrial genome, not drawn to scale. Respective novel candidates for ncRNA genes are indicated by red arrows, genes flanking the ncRNA candidate are indicated by black arrows. Upper panel: Mitochondrial D-loop region involved in genome replication: Locations of Mt-1, Mt-2, Mt-3 and Mt-4 RNAs, respectively, are indicated by red arrows, the location of MBI-44 (see text) is shown by a purple arrow. Conserved sequence boxes CSB I, II and III, which are characteristic of mitochondrial origins of replication are also shown. L, light-strand transcription initiation site; H1, H2 and H3 represent three heavy-strand transcription initiation sites, respectively. OH indicates the origin of replication of the H-strand. RNase MRP, the arrow points to potential RNase MRP cleavage site involved in RNA primer processing. Lower panel: Location of Mt-5, Mt-6 ncRNAs on the mitochondrial genome. Distance of RNAs to 5′ or 3′ ends of the reading frames of genes ND-4 and ND-6, respectively, (which are located on the opposite strand) is indicated in nt. Figure 3 Open in new tabDownload slide Sequence analysis and genomic location of ncRNA candidates from the mitochondrial library of M.musculus. (A) Sequence analysis of 1700 cDNA clones from the mitochondrial library. cDNA clones representing different RNA species or categories are shown as percent of total clones. The red sector denotes candidates for novel ncRNAs in mitochondria. (B) Location of ncRNA candidates on the mitochondrial genome, not drawn to scale. Respective novel candidates for ncRNA genes are indicated by red arrows, genes flanking the ncRNA candidate are indicated by black arrows. Upper panel: Mitochondrial D-loop region involved in genome replication: Locations of Mt-1, Mt-2, Mt-3 and Mt-4 RNAs, respectively, are indicated by red arrows, the location of MBI-44 (see text) is shown by a purple arrow. Conserved sequence boxes CSB I, II and III, which are characteristic of mitochondrial origins of replication are also shown. L, light-strand transcription initiation site; H1, H2 and H3 represent three heavy-strand transcription initiation sites, respectively. OH indicates the origin of replication of the H-strand. RNase MRP, the arrow points to potential RNase MRP cleavage site involved in RNA primer processing. Lower panel: Location of Mt-5, Mt-6 ncRNAs on the mitochondrial genome. Distance of RNAs to 5′ or 3′ ends of the reading frames of genes ND-4 and ND-6, respectively, (which are located on the opposite strand) is indicated in nt. Table 2. Candidates for novel ncRNAs from a M.musculus mitochondrial cDNA library derived from RNAs sized 10–500 nt Name Nr. Sequence cDNA (nt) Remarks Mt-1 15 GAATTGATCAGGACATAGGGTTTGATAGTTAATATTATATGTCTTTCAAGTTCTTAGTGTTTTTGGGG (A)4–37 68 Maps to D-loop L-strand encoded, Four sequences are partly polyadenylated Mt-2 1 ATAGTTTAATGTACGATATACATAAATGTACTGTTGTACTATGTAAATTTATGTACT 57 Maps to D-loop L-strand encoded Mt-3 1 CACCCCCTCCTCTTAATGCCAAA 23 Maps to D-loop H-strand encoded Mt-4 1 CATTTGGTCTATTAATCTACCATCCTCCGTGAAACCAACAACCCGCCCACCAATG 55 Maps to D-loop H-strand encoded Mt-5 1 TTGGGATTAAGGTTGCTTCAAATAAAATATAAAATATAATTAG 43 Antisense to ND-4 mRNA Mt-6 1 CAACATCGTCAACCTCATATATCAATCAAT 30 Antisense to ND-6 mRNA three mismatches to mitochondrial sequence, two mismatches to a nuclear-encoded pseudogene Name Nr. Sequence cDNA (nt) Remarks Mt-1 15 GAATTGATCAGGACATAGGGTTTGATAGTTAATATTATATGTCTTTCAAGTTCTTAGTGTTTTTGGGG (A)4–37 68 Maps to D-loop L-strand encoded, Four sequences are partly polyadenylated Mt-2 1 ATAGTTTAATGTACGATATACATAAATGTACTGTTGTACTATGTAAATTTATGTACT 57 Maps to D-loop L-strand encoded Mt-3 1 CACCCCCTCCTCTTAATGCCAAA 23 Maps to D-loop H-strand encoded Mt-4 1 CATTTGGTCTATTAATCTACCATCCTCCGTGAAACCAACAACCCGCCCACCAATG 55 Maps to D-loop H-strand encoded Mt-5 1 TTGGGATTAAGGTTGCTTCAAATAAAATATAAAATATAATTAG 43 Antisense to ND-4 mRNA Mt-6 1 CAACATCGTCAACCTCATATATCAATCAAT 30 Antisense to ND-6 mRNA three mismatches to mitochondrial sequence, two mismatches to a nuclear-encoded pseudogene Nr., number of independent cDNA clones identified from each RNA species; Sequence, sequence of cDNA; cDNA (nt), length of cDNA encoding a ncRNA candidate as assessed by sequencing. Open in new tab Table 2. Candidates for novel ncRNAs from a M.musculus mitochondrial cDNA library derived from RNAs sized 10–500 nt Name Nr. Sequence cDNA (nt) Remarks Mt-1 15 GAATTGATCAGGACATAGGGTTTGATAGTTAATATTATATGTCTTTCAAGTTCTTAGTGTTTTTGGGG (A)4–37 68 Maps to D-loop L-strand encoded, Four sequences are partly polyadenylated Mt-2 1 ATAGTTTAATGTACGATATACATAAATGTACTGTTGTACTATGTAAATTTATGTACT 57 Maps to D-loop L-strand encoded Mt-3 1 CACCCCCTCCTCTTAATGCCAAA 23 Maps to D-loop H-strand encoded Mt-4 1 CATTTGGTCTATTAATCTACCATCCTCCGTGAAACCAACAACCCGCCCACCAATG 55 Maps to D-loop H-strand encoded Mt-5 1 TTGGGATTAAGGTTGCTTCAAATAAAATATAAAATATAATTAG 43 Antisense to ND-4 mRNA Mt-6 1 CAACATCGTCAACCTCATATATCAATCAAT 30 Antisense to ND-6 mRNA three mismatches to mitochondrial sequence, two mismatches to a nuclear-encoded pseudogene Name Nr. Sequence cDNA (nt) Remarks Mt-1 15 GAATTGATCAGGACATAGGGTTTGATAGTTAATATTATATGTCTTTCAAGTTCTTAGTGTTTTTGGGG (A)4–37 68 Maps to D-loop L-strand encoded, Four sequences are partly polyadenylated Mt-2 1 ATAGTTTAATGTACGATATACATAAATGTACTGTTGTACTATGTAAATTTATGTACT 57 Maps to D-loop L-strand encoded Mt-3 1 CACCCCCTCCTCTTAATGCCAAA 23 Maps to D-loop H-strand encoded Mt-4 1 CATTTGGTCTATTAATCTACCATCCTCCGTGAAACCAACAACCCGCCCACCAATG 55 Maps to D-loop H-strand encoded Mt-5 1 TTGGGATTAAGGTTGCTTCAAATAAAATATAAAATATAATTAG 43 Antisense to ND-4 mRNA Mt-6 1 CAACATCGTCAACCTCATATATCAATCAAT 30 Antisense to ND-6 mRNA three mismatches to mitochondrial sequence, two mismatches to a nuclear-encoded pseudogene Nr., number of independent cDNA clones identified from each RNA species; Sequence, sequence of cDNA; cDNA (nt), length of cDNA encoding a ncRNA candidate as assessed by sequencing. Open in new tab Novel candidates for ncRNAs encoded by the mitochondrial genome Mt-1, Mt-2, Mt-3 and Mt-4 RNAs We cloned several RNA species derived from the D-loop region of the mitochondrial genome. The mitochondrial D-loop region contains the major transcription initiation sites of the genome as well as the origin of heavy strand (H-strand) DNA replication, (53,54). Synthesis of an RNA primer required for DNA replication of the H-strand and transcription of the entire L-strand polycistronic transcript both originate at the same initiation site on the L-strand (Figure 3B) (55). Before replication, the RNA primer for H-strand replication is thought to undergo processing by cleavage with RNase MRP (Figure 3B). We have identified altogether 15 cDNA clones that encode the ncRNA candidate Mt-1 (Table 2 and Figure 3B). The Mt-1 RNA exhibits an identical 5′ end to the predicted RNA primer for DNA replication. The 3′ end of the Mt-1 RNA is heterogeneous by 4 nt and coincides with the CBS III motif, a conserved sequence motif that, together with CBS I and II, is found in mitochondrial D-loop regions (56). The Mt-1 RNA sequences, however, terminate shortly before the putative cleavage site by RNase MRP within CBS III. Thus it seems feasible that premature transcription termination of the RNA primer for DNA replication, which in turn prevents cleavage by RNase MRP regulates replication of the mitochondrial genome. This would be consistent with the previous hypothesis that regulation of replication of the mitochondrial genome is exerted at the level of RNA priming, based on the observation that the rate of transcription initiation at the D-loop exceeds that of mitochondrial DNA replication (57). Remarkably, 4 out of the 15 Mt-1 sequences identified in our library were polyadenylated with polyadenylation ranging from 4 to 37 adenine residues. It has been demonstrated recently that polyadenylation in animal mitochondria serves a dual role in that it can either produce stable mRNAs or mark transcripts for rapid degradation (58). Thus, whether Mt-1 is stabilized or destabilized by poly(A) tail addition remains to be determined. A second RNA species from the D-loop region is represented by a single cDNA clone, designated as Mt-2 RNA (Table 2 and Figure 3B). The 3′ end of Mt-2 RNA is spaced by 1 nt to the 5′ end of tRNAPro and thus might reflect a processing product of mitochondrial RNase P cleaving the tRNAPro precursor. An interesting aspect of Mt-3, a short, 23 nt long RNA from the mitochondrial D-loop, is that it is encoded by the H-strand and transcribed in antisense orientation to the RNA primer for DNA replication and spans the predicted RNase MRP cleavage site (Table 2 and Figure 3B) (59). Potentially, the Mt-3 RNA could regulate the activity of the catalytic RNA component of RNase MRP by an antisense mechanism and thereby influence the rate of transcription and/or replication of the H-strand. Lastly, the RNA Mt-4 with a size of 55 nt also locates to the D-loop region of the mitochondrial genome and is presumably transcribed from the H3 promotor of the H-strand (Figure 3B). In a cDNA library generated from total cellular RNA of mouse brain, we have identified previously a clone (MBI-44) that maps close to the mitochondrial D-loop and exhibits a size of 97 nt (35). The Mt-1, Mt-2, Mt-3 and Mt-4 RNAs, however, map to a different region than MBI-44, with the latter RNA species being in antisense orientation to tRNAPro (Figure 3B). Mt-5 and Mt-6 RNAs We also have identified two antisense RNA species transcribed in opposite orientation to two different mitochondrial mRNAs encoding NADH dehydrogenase subunits. While Mt-5 RNA is transcribed in antisense orientation to the NADH dehydrogenase subunit 4 gene (ND-4), Mt-6 RNA is transcribed in antisense orientation to the NADH dehydrogenase subunit 6 gene (ND-6; Table 2 and Figure 3B). Mt-5 and Mt-6 cDNA clones display sizes of 43 and 30 nt, respectively. Interestingly, the Mt-6 RNA exhibits three mismatches to the published mitochondrial genome sequence (60), which might be due to polymerization errors of the reverse transcriptase in the process of library construction and/or sequencing errors. Remarkably, the Mt-6 RNA exhibits only two mismatches to a nuclear-encoded pseudogene of ND-6. The Mt-5 and Mt-6 ncRNA species could potentially be involved in regulation of gene expression of the ND-4 and ND-6 genes by an antisense mechanism, as observed previously for numerous bacterial antisense RNAs (44). This would be compatible with the eubacterial origin of mitochondria and the evolutionary conservation of the mechanisms regulating gene expression in both systems. For none of the above ncRNA candidates from mouse mitochondria, we were able to detect a clear and unambiguous northern blot signal, presumably pointing to their low accumulation level. This is consistent with the notion that from all but one ncRNA (Mt-1), only a single cDNA clone has been isolated. The low abundance of ncRNA candidates in mouse mitochondria is not necessarily indicative of their lack of functional significance. It is well established that ncRNAs can be functional even at very low levels, as observed for some miRNAs which also cannot be detected by northern blot analysis (8,9). Nuclear-encoded ncRNAs identified in a cDNA libary from mitochondria As observed for the chloroplast cDNA library, we also found clones derived from the nuclear genome (112 sequenced cDNAs; Figure 3A). Among them, about half of the clones (i.e. 2.5% of all cDNA clones sequenced) represent 5.8S rRNA sequences (with a size of 160 nt) while about another half encodes RNA fragments of the 28S rRNA (varying in size between 25 and 132 nt). The remaining sequences represent clones of 5S rRNA and fragments of the 18S rRNA. In addition, we identified 12 cDNA clones representing nuclear-encoded mRNA fragments as well as four nuclear-encoded miRNAs, namely let7f-1, let-7g, miR-122a,b and miR-101b (8). Only one cDNA clone was identified from each miRNA species. The predominant occurrence of 5.8S rRNA clones in our library, but not of the similar sized and equally abundant 5S rRNA, could hint towards a mitochondrial import of this RNA species, as observed for 5S rRNA in human mitochondria (61). However, in a previous study on ncRNAs from mouse, analyzing total cellular ncRNAs (35), we noted a preferential occurence of 5.8S rRNA cDNA clones compared to 5S rRNA clones in our library (ratio ∼4:1); a similar ratio was observed in the mitochondrial library indicative that the predominant occurrence of 5.8S rRNA might be due to preferential cloning of this RNA species compared to 5S rRNA. Interestingly, we could not detect two nuclear-encoded RNAs, MRP RNA and RNase P, in our library, which have been predicted to be imported. However, the import of these two RNA species remains still to be proven and their presence within mitochondria, especially of MRP RNA, is still a subject of debate. Thus, the lack of clones from these species in our mitochondrial library, while identifying all other mitochondrial-encoded ncRNAs (see above), might shed some new light on this debate. CONCLUSION By an experimental RNomics approach, we have investigated the small transcriptome representing RNAs, sized 20–500 nt, from the two DNA-containing cell organelles, mitochondria (from M.musculus) and chloroplasts (from N.tabacum). Although mouse mitochondria exhibit a small-sized genome of 16.6 kb, chloroplast from N.tabacum exhibit a considerably larger genome size of 156 kb (62,63). For eukaryal nuclear genomes, it has been speculated previously that a considerable portion of the genome (up to 50%) might code for novel regulatory RNA transcripts, amounting to many thousands of regulatory ncRNAs in the nucleus or cytoplasm (3,5,6). In our analysis of the small transcriptome of the two cell organelles we do not find evidence for a similarly large number of small ncRNA candidates within these cellular compartments. It might be argued that a considerable number of ncRNAs might have escaped detection due to the experimental strategy used in our screen; e.g. some RNAs might not be reverse transcribed into cDNAs because of their structure and/or modification and thus could not be identified. Although we cannot completely exclude this possibility, we would like to point out that in both libraries we could detect the full set of all known organellar ncRNAs, including mitochondrial tRNAs that are modified as well as highly structured. Thus, in our RNomics screen we are confident to have detected the majority of small stable RNA transcripts in cell organelles ranging from 20 to 500 nt. In the chloroplast library from the plant N.tabacum, we have identified 12 candidates for ncRNAs. For six of these ncRNAs, we could confirm expression by northern blot analyses while the remaining five appear to be expressed at low levels. Unlike animal mitochondrial genomes, chloroplast genomes possess many promoters of widely different strength and, even for one and the same gene or operon, often multiple transcription initiation sites are found (64–67). Another level of complexity is added by the presence of two different types of RNA polymerases in chloroplasts: a chloroplast-encoded eubacterial-type RNA polymerase depending on sigma factors and a nuclear-encoded bacteriophage-type RNA polymerase (68,69). The presence of a large number of promoters and alternative transcription initiation sites as well as the interplay of the two different RNA polymerase activity result in a highly complex transcript pattern in chloroplasts with different transcripts displaying widely different abundances and stabilities (70–72). Thus, the strong differences observed in the accumulation levels of the identified candidate ncRNAs are unsurprising. The majority of ncRNAs from the chloroplast genome map to intergenic regions (nine ncRNAs), two to intronic sequences and one is transcribed in antisense orientation to the 3′-UTR of a chloroplast gene (atpE). Most tobacco ncRNA candidates exhibit sequence homology to other chloroplast genomes including the genomes from rather distantly related species which may hint to a conserved function. However, at present we cannot exclude the possibility that some of the identified RNA species are by-products of chloroplast RNA processing, which may or may not be functionally relevant. From the mouse mitochondrial library, six ncRNA candidates could be identified. Four of these map to the mitochondrial D-loop region involved in genome replication, while two others are transcribed in antisense orientation to known mitochondrial mRNAs (ND-4 and ND-6, respectively). For none of the candidates, expression could be confirmed by northern blot analysis probably pointing to their low expression levels. Since transcription for both strands of the mitochondrial genome starts from the D-loop region and results in the production of large polycistronic transcripts, the identified ncRNA candidates very likely reflect processing products of these primary polycistronic transcripts. The strong bias of expressed RNA sequences towards the mitochondrial D-loop region may reflect an increased transcriptional activity in this region of the genome. The larger genome size of chloroplasts compared to animal mitochondria may in part account for the larger number of novel ncRNA candidates found in this organelle. Nonetheless, the generally rather low total number of novel ncRNA candidates in cell organelles seems a bit surprising, considering their eubacterial descent. In bacteria, regulatory ncRNAs are widespread. For example, to date, well over 60 regulatory ncRNA have been identified in E.coli (4,73). Bacteria have to react fast to steadily changing environments and this might require numerous ncRNAs as fast genetic switches which do not have to be translated into proteins before exerting their functions. This is demonstrated by independent studies which identified numerous ncRNAs in various bacterial genomes (4,74,75) as well as a cyanobacterial genome (76), which is commonly considered as an ancestor to chloroplasts and their genomes. Owing to the evolutionary distance between bacteria, cyanobacteria on the one side and chloroplasts and mitochondria on the other, no sequence homologs could be identified between those species. The general location of ncRNA genes in cell organelles is thereby similar to those found in bacterial species, which are mainly found to be intergenic with the exception of a considerable number of antisense RNAs transcribed in cis from the opposite strand of protein coding or ncRNA genes (75). Although one could argue that the environment of cell organelles within eukaryotic cells might be more stable and thus the demand for regulatory ncRNAs might be lower—compared to bacteria and cyanobacteria; the main functions of chloroplasts and mitochondria also require very fast adaptation responses: Most chloroplast-encoded genes are involved in photosynthesis, a key bioenergetic pathway that, when not adjusted properly to changing light conditions, can result in massive photooxidative stress and the concomitant release of highly cytotoxic free radicals and reactive oxygen species (77,78). Similar considerations hold true for mitochondria with most mitochondrial gene products functioning in the respiratory electron transport chain. Thus, at least in theory, post-transcriptional regulation via ncRNAs would provide a fast and efficient mechanism for the rapid adjustment of organellar gene expression to changing environmental conditions and/or metabolic demands of the cell. At present, our study provides no experimental evidence for a pathway that would promote import of nuclear-encoded RNAs into cell organelles to a significant extent. Most of the nuclear-derived RNAs found in our cDNA libraries from chloroplast and mitochondria were highly abundant RNAs, such as rRNAs or snoRNAs, and moreover, were not enriched in purified organellar RNAs compared to total cellular RNA preparations. Thus, these RNA species might represent nuclear contaminations, which were co-isolated and cloned during library construction. We cannot exclude, however, that at least some of the identified nuclear-encoded ncRNAs are imported into mitochondria or chloroplasts at low levels. In summary, we present here the first comprehensive analysis of the small RNA component of the transcriptome from mitochondria and chloroplasts. We could identify a total of 18 novel candidates for ncRNAs in these cellular compartments. Functional studies on these organellar ncRNA candidate genes will be required to assign biological functions to the presumptive novel ncRNA species. Unfortunately, the production of mutants or the targeted inactivation of ncRNA genes in chloroplast and mitochondria is far from being trivial. No transgenic technologies suitable to generate gene knockouts are currently available for animal mitochondria making it difficult to directly test the identified ncRNA candidates for their biological functions. In contrast, recent progress with the genetic transformation of chloroplasts has facilitated reverse genetics approaches in chloroplast genomes (79–82). Although the procedures involved in the genetic transformation of higher plant chloroplasts are demanding and time consuming, the targeted knockout of the candidate ncRNA genes identified here certainly represents the most promising approach towards determining ncRNA functions in chloroplasts. ACKNOWLEDGEMENTS We thank Stefanie Seeger, Annett Kaßner (MPI für Molekulare Pflanzenphysiologie) for plant cultivation and help with chloroplast isolations and Daniel Karcher (MPI für Molekulare Pflanzenphysiologe) for the analysis of chloroplast ncRNA precursors. We also thank A. Marchfelder for critically reading the manuscript. This work was supported by an Austrian grant FWF 171370 and a German DFG grant 457-1/2 to A.H., by the Max Planck Society (R.B.) and by a grant from the DFG, SFB 577 TP B4 ‘Genetic variability of mitochondrial disorders’ to M.S. A.Z. was supported by a grant from the Nationales Genomforschungsnetz (NGFN #0313358A). Funding to pay the Open Access publication charges for this article was provided by the Austrian FWF (Fonds zur Föderung der wissenschaftlichen Forschung) grant 171370. Conflict of interest statement. None declared. REFERENCES 1. Eddy S.R. 2001 Non-coding RNA genes and the modern RNA world Nature Rev. Genet . 2 919 – 929 Google Scholar Crossref Search ADS WorldCat 2. Huttenhofer A. , Brosius J., Bachellerie J.P. 2002 RNomics: identification and function of small, non-messenger RNAs Curr. Opin. Chem. Biol . 6 835 – 843 Google Scholar Crossref Search ADS PubMed WorldCat 3. Huttenhofer A. , Schattner P., Polacek N. 2005 Non-coding RNAs: hope or hype? Trends Genet . 21 289 – 297 Google Scholar Crossref Search ADS PubMed WorldCat 4. Kawano M. , Reynolds A.A., Miranda-Rios J., Storz G. 2005 Detection of 5′- and 3′-UTR-derived small RNAs and cis-encoded antisense RNAs in Escherichia coli Nucleic Acids Res . 33 1040 – 1050 Google Scholar Crossref Search ADS PubMed WorldCat 5. Mattick J.S. 2004 RNA regulation: a new genetics? Nature Rev. Genet . 5 316 – 323 Google Scholar Crossref Search ADS WorldCat 6. Mattick J.S. and Makunin I.V. 2005 Small regulatory RNAs in mammals Hum. Mol. Genet . 14 R121 – R132 Google Scholar Crossref Search ADS PubMed WorldCat 7. Plath K. , Mlynarczyk-Evans S., Nusinow D.A., Panning B. 2002 Xist RNA and the mechanism of X chromosome inactivation Annu. Rev. Genet . 36 233 – 278 Google Scholar Crossref Search ADS PubMed WorldCat 8. Bartel D.P. 2004 MicroRNAs: genomics, biogenesis, mechanism, and function Cell 116 281 – 297 Google Scholar Crossref Search ADS PubMed WorldCat 9. Bartel D.P. and Chen C.Z. 2004 Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs Nature Rev. Genet . 5 396 – 400 Google Scholar Crossref Search ADS WorldCat 10. Gray M.W. 1993 Origin and evolution of organelle genomes Curr. Opinion Genet. Dev . 3 884 – 890 Google Scholar Crossref Search ADS WorldCat 11. Gray M.W. , Burger G., Lang B.F. 1999 Mitochondrial evolution Science 283 1476 – 1481 Google Scholar Crossref Search ADS PubMed WorldCat 12. Szathmary E. and Smith J.M. 1995 The major evolutionary transitions Nature 374 227 – 232 Google Scholar Crossref Search ADS PubMed WorldCat 13. Bock R. 2005 Extranuclear inheritance: functional genomics in chloroplasts Prog. Bot . 67 75 – 98 OpenURL Placeholder Text WorldCat 14. Timmis J.N. , Ayliffe M.A., Huang C.Y., Martin W. 2004 Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes Nature Rev. Genet . 5 123 – 135 Google Scholar Crossref Search ADS WorldCat 15. Sugiura M. 1992 The chloroplast genome Plant Mol. Biol . 19 149 – 168 Google Scholar Crossref Search ADS PubMed WorldCat 16. Wakasugi T. , Tsudzuki T., Sugiura M. 2001 The genomics of land plant chloroplasts: gene content and alteration of genomic information by RNA editing Photosynthesis Res . 70 107 – 118 Google Scholar Crossref Search ADS WorldCat 17. Shimada H. and Sugiura M. 1991 Fine structural features of the chloroplast genome: comparison of the sequenced chloroplast genomes Nucleic Acids Res . 19 983 – 995 Google Scholar Crossref Search ADS PubMed WorldCat 18. Brennicke A. , Klein M., Binder S., Knoop V., Grohmann L., Malek O., Marchfelder A., Marienfeld J., Unseld M. 1996 Molecular biology of plant mitochondria Naturwissenschaften 83 339 – 346 Google Scholar Crossref Search ADS WorldCat 19. Oda K. , Yamato K., Ohta E., Nakamura Y., Takemura M., Nozato N., Akashi K., Kanegae T., Ogura Y., Kohchi T., et al. 1992 Gene organization deduced from the complete sequence of liverwort Marchantia polymorpha mitochondrial DNA. A primitive form of plant mitochondrial genome J. Mol. Biol . 223 1 – 7 Google Scholar Crossref Search ADS PubMed WorldCat 20. Anderson S. , Bankier A.T., Barrell B.G., de Bruijn M.H., Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., Sanger F., et al. 1981 Sequence and organization of the human mitochondrial genome Nature 290 457 – 465 Google Scholar Crossref Search ADS PubMed WorldCat 21. Taanman J.-W. 1999 The mitochondrial genome: structure; transcription; translation and replication Biochim. Biophys. Acta 1410 103 – 123 Google Scholar Crossref Search ADS PubMed WorldCat 22. Bendich A.J. 1987 Why do chloroplasts and mitochondria contain so many copies of their genome? Bioessays 6 279 – 282 Google Scholar Crossref Search ADS PubMed WorldCat 23. Lightowlers R.N. , Chinnery P.F., Turnbull D.M., Howell N. 1997 Mammalian mitochondrial genetics: heredity, heteroplasmy and disease Trends Genet . 13 450 – 455 Google Scholar Crossref Search ADS PubMed WorldCat 24. Abdallah F. , Salamini F., Leister D. 2000 A prediction of the size and evolutionary origin of the proteome of chloroplasts of Arabidopsis Trends Plant Sci . 5 141 – 142 Google Scholar Crossref Search ADS PubMed WorldCat 25. Vera A. and Sugiura M. 1994 A novel RNA gene in the tobacco plastid genome: its possible role in the maturation of 16S rRNA EMBO J . 13 2211 – 2217 Google Scholar PubMed OpenURL Placeholder Text WorldCat 26. Huttenhofer A. , Cavaille J., Bachellerie J.P. 2004 Experimental RNomics: a global approach to identifying small nuclear RNAs and their targets in different model organisms Methods Mol. Biol . 265 409 – 428 Google Scholar PubMed OpenURL Placeholder Text WorldCat 27. Huttenhofer A. and Vogel J. 2006 Experimental approaches to identify non-coding RNAs Nucleic Acids Res . 34 635 – 646 Google Scholar Crossref Search ADS PubMed WorldCat 28. Bock R. 1998 Analysis of RNA editing in plastids Methods 15 75 – 83 Google Scholar Crossref Search ADS PubMed WorldCat 29. Xie J. , Techritz S., Haebel S., Horn A., Neitzel H., Klose J., Schuelke M. 2005 A two-dimensional electrophoretic map of human mitochondrial proteins from immortalized lymphoblastoid cell lines: a prerequisite to study mitochondrial disorders in patients Proteomics 5 2981 – 2999 Google Scholar Crossref Search ADS PubMed WorldCat 30. Barkan A. and Goldschmidt-Clermont M. 2000 Participation of nuclear genes in chloroplast gene expression Biochimie 82 559 – 572 Google Scholar Crossref Search ADS PubMed WorldCat 31. Fey V. , Wagner R., Brautigam K., Pfannschmidt T. 2005 Photosynthetic redox control of nuclear gene expression J. Exp. Bot . 56 1491 – 1498 Google Scholar Crossref Search ADS PubMed WorldCat 32. Klein R.R. and Mullet J.E. 1987 Control of gene expression during higher plant chloroplast biogenesis. Protein synthesis and transcript levels of psbA, psaA-psaB, and rbcL in dark-grown and illuminated barley seedlings J. Biol. Chem . 262 4341 – 4348 Google Scholar PubMed OpenURL Placeholder Text WorldCat 33. Sigmund C.D. , Ettayebi M., Borden A., Morgan E.A. 1988 Antibiotic resistance mutations in ribosomal RNA genes of Escherichia coli Methods Enzymol . 164 673 – 690 Google Scholar PubMed OpenURL Placeholder Text WorldCat 34. Lerbs-Mache S. 2000 Regulation of rDNA transcription in plastids of higher plants Biochimie 82 525 – 535 Google Scholar Crossref Search ADS PubMed WorldCat 35. Huttenhofer A. , Kiefmann M., Meier-Ewert S., O'Brien J., Lehrach H., Bachellerie J.P., Brosius J. 2001 RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse EMBO J . 20 2943 – 2953 Google Scholar Crossref Search ADS PubMed WorldCat 36. Endo T. , Shikanai T., Takabayashi A., Asada K., Sato F. 1999 The role of chloroplastic NAD(P)H dehydrogenase in photoprotection FEBS Lett . 457 5 – 8 Google Scholar Crossref Search ADS PubMed WorldCat 37. Horvath E.M. , Peter S.O., Joet T., Rumeau D., Cournac L., Horvath G.V., Kavanagh T.A., Schafer C., Peltier G., Medgyesy P. 2000 Targeted inactivation of the plastid ndhB gene in tobacco results in an enhanced sensitivity of photosynthesis to moderate stomatal closure Plant Physiol . 123 1337 – 1350 Google Scholar Crossref Search ADS PubMed WorldCat 38. Joet T. , Cournac L., Horvath E.M., Medgyesy P., Peltier G. 2001 Increased sensitivity of photosynthesis to antimycin A induced by inactivation of the chloroplast ndhB gene. Evidence for a participation of the NADH-dehydrogenase complex to cyclic electron flow around photosystem I Plant Physiol . 125 1919 – 1929 Google Scholar Crossref Search ADS PubMed WorldCat 39. Li X.G. , Duan W., Meng Q.W., Zou Q., Zhao S.J. 2004 The function of chloroplastic NAD(P)H dehydrogenase in tobacco during chilling stress under low irradiance Plant Cell Physiol . 45 103 – 108 Google Scholar Crossref Search ADS PubMed WorldCat 40. Shikanai T. , Endo T., Hashimoto T., Yamada Y., Asada K., Yokota A. 1998 Directed disruption of the tobacco ndhB gene impairs cyclic electron flow around photosystem I Proc. Natl Acad. Sci. USA 95 9705 – 9709 Google Scholar Crossref Search ADS WorldCat 41. Billoud B. , De Paepe R., Baulcombe D., Boccara M. 2005 Identification of new small non-coding RNAs from tobacco and Arabidopsis Biochimie 87 905 – 910 Google Scholar Crossref Search ADS PubMed WorldCat 42. Marker C. , Zemann A., Terhorst T., Kiefmann M., Kastenmayer J.P., Green P., Bachellerie J.P., Brosius J., Huttenhofer A. 2002 Experimental RNomics: identification of 140 candidates for small non-messenger RNAs in the plant Arabidopsis thaliana Curr. Biol . 12 2002 – 2013 Google Scholar Crossref Search ADS PubMed WorldCat 43. Vogel A. , Schilling O., Spath B., Marchfelder A. 2005 The tRNase Z family of proteins: physiological functions, substrate specificity and structural properties Biol. Chem . 386 1253 – 1264 Google Scholar PubMed OpenURL Placeholder Text WorldCat 44. Wagner E.G. , Altuvia S., Romby P. 2002 Antisense RNAs in bacteria and their genetic elements Adv. Genet . 46 361 – 398 Google Scholar PubMed OpenURL Placeholder Text WorldCat 45. Opdyke J.A. , Kang J.G., Storz G. 2004 GadY, a small-RNA regulator of acid response genes in Escherichia coli J. Bacteriol . 186 6698 – 6705 Google Scholar Crossref Search ADS PubMed WorldCat 46. Krinke L. and Wulff D.L. 1990 RNase III-dependent hydrolysis of lambda cII-O gene mRNA mediated by lambda OOP antisense RNA Genes Dev . 4 2223 – 2233 Google Scholar Crossref Search ADS PubMed WorldCat 47. Strittmatter G. , Gozdzicka-Jozefiak A., Kossel H. 1985 Identification of an rRNA operon promoter from Zea mays chloroplasts which excludes the proximal tRNAValGAC from the primary transcript EMBO J . 4 599 – 604 Google Scholar PubMed OpenURL Placeholder Text WorldCat 48. Wolfe K.H. , Morden C.W., Ems S.C., Palmer J.D. 1992 Rapid evolution of the plastid translational apparatus in a non-photosynthetic plant: loss or accelerated sequence evolution of tRNA and ribosomal protein genes J. Mol. Evol . 35 304 – 317 Google Scholar Crossref Search ADS PubMed WorldCat 49. Bungard R.A. 2004 Photosynthetic evolution in parasitic plants: insight from the chloroplast genome Bioessays 26 235 – 247 Google Scholar Crossref Search ADS PubMed WorldCat 50. Morden C.W. , Wolfe K.H., dePamphilis C.W., Palmer J.D. 1991 Plastid translation and transcription genes in a non-photosynthetic plant: intact, missing and pseudo genes EMBO J . 10 3281 – 3288 Google Scholar PubMed OpenURL Placeholder Text WorldCat 51. Li W. , Ruf S., Bock R. 2006 Constancy of organellar genome copy numbers during leaf development and senescence in higher plants Mol. Genet. Genomics 275 185 – 192 Google Scholar Crossref Search ADS PubMed WorldCat 52. Atchison B.A. , Whitfeld P.R., Bottomley W. 1976 Comparison of chloroplast DNAs by specific fragmentation with EcoRI endonuclease Mol. Gen. Genet . 148 263 – 269 Google Scholar Crossref Search ADS WorldCat 53. Chang D.D. , Hauswirth W.W., Clayton D.A. 1985 Replication priming and transcription initiate from precisely the same site in mouse mitochondrial DNA EMBO J . 4 1559 – 1567 Google Scholar PubMed OpenURL Placeholder Text WorldCat 54. Chang D.D. and Clayton D.A. 1985 Priming of human mitochondrial DNA replication occurs at the light-strand promoter Proc. Natl Acad. Sci. USA 82 351 – 355 Google Scholar Crossref Search ADS WorldCat 55. Montoya J. , Christianson T., Levens D., Rabinowitz M., Attardi G. 1982 Identification of initiation sites for heavy-strand and light-strand transcription in human mitochondrial DNA Proc. Natl Acad. Sci. USA 79 7195 – 7199 Google Scholar Crossref Search ADS WorldCat 56. Ojala D. , Crews S., Montoya J., Gelfand R., Attardi G. 1981 A small polyadenylated RNA (7S RNA), containing a putative ribosome attachment site, maps near the origin of human mitochondrial DNA replication J. Mol. Biol . 150 303 – 314 Google Scholar Crossref Search ADS PubMed WorldCat 57. Clayton D.A. 1983 Replication of animal mitochondrial DNA Cell 28 693 – 705 Google Scholar Crossref Search ADS WorldCat 58. Slomovic S. , Laufer D., Geiger D, Schuster G. 2005 Polyadenylation and degradation of human mitochondrial RNA: the prokaryotic past leaves its mark Mol. Cell. Biol . 25 6427 – 6435 Google Scholar Crossref Search ADS PubMed WorldCat 59. Garesse R. and Vallejo C.G. 2001 Animal mitochondrial biogenesis and function: a regulatory cross-talk between two genomes Gene 263 1 – 16 Google Scholar Crossref Search ADS PubMed WorldCat 60. Bayona-Bafaluy M.P. , Acin-Perez R., Mullikin J.C., Park J.S., Moreno-Loshuertos R., Hu P., Perez-Martos A., Fernandez-Silva P., Bai Y., Enriquez J.A. 2003 Revisiting the mouse mitochondrial DNA Nucleic Acids Res . 31 5349 – 5355 Google Scholar Crossref Search ADS PubMed WorldCat 61. Entelis N.S. , Kolesnikova O.A., Dogan S., Martin R.P., Tarassov I.A. 2001 5S rRNA and tRNA import into human mitochondria. Comparison of in vitro requirements J. Biol. Chem . 276 45642 – 45653 Google Scholar Crossref Search ADS PubMed WorldCat 62. Shinozaki K. , Ohme M., Tanaka M., Wakasugi T., Hayashida N., Matsubayashi T., Zaita N., Chunwongse J., Obokata J., Yamaguchi-Shinozaki K., et al. 1986 The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression EMBO J . 5 2043 – 2049 Google Scholar PubMed OpenURL Placeholder Text WorldCat 63. Wakasugi T. , Sugita M., Tsudzuki T., Sugiura M. 1998 Updated gene map of tobacco chloroplast DNA Plant Mol. Biol. Rep . 16 231 – 241 Google Scholar Crossref Search ADS WorldCat 64. Gruissem W. and Tonkyn J.C. 1993 Control mechanisms of plastid gene expression Crit. Rev. Plant Sci . 12 19 – 55 Google Scholar Crossref Search ADS WorldCat 65. Haley J. and Bogorad L. 1990 Alternative promoters are used for genes within maize chloroplast polycistronic transcription units Plant Cell 2 323 – 333 Google Scholar Crossref Search ADS PubMed WorldCat 66. Igloi G.L. and Kössel H. 1992 The transcriptional apparatus of chloroplasts Crit. Rev. Plant Sci . 10 525 – 558 Google Scholar Crossref Search ADS WorldCat 67. Sugita M. and Sugiura M. 1996 Regulation of gene expression in chloroplasts of higher plants Plant Mol. Biol . 32 315 – 326 Google Scholar Crossref Search ADS PubMed WorldCat 68. Hedtke B. , Borner T., Weihe A. 1997 Mitochondrial and chloroplast phage-type RNA polymerases in Arabidopsis Science 277 809 – 811 Google Scholar Crossref Search ADS PubMed WorldCat 69. Hess W.R. and Borner T. 1999 Organellar RNA polymerases of higher plants Int. Rev. Cytol . 190 1 – 59 Google Scholar PubMed OpenURL Placeholder Text WorldCat 70. Mullet J.E. and Klein R.R. 1987 Transcription and RNA stability are important determinants of higher plant chloroplast RNA levels EMBO J . 6 1571 – 1579 Google Scholar PubMed OpenURL Placeholder Text WorldCat 71. Rapp J.C. , Baumgartner B.J., Mullet J. 1992 Quantitative analysis of transcription and RNA levels of 15 barley chloroplast genes. Transcription rates and mRNA levels vary over 300-fold; predicted mRNA stabilities vary 30-fold J. Biol. Chem . 267 21404 – 21411 Google Scholar PubMed OpenURL Placeholder Text WorldCat 72. Mullet J.E. 1993 Dynamic regulation of chloroplast transcription Plant Physiol . 103 309 – 313 Google Scholar Crossref Search ADS PubMed WorldCat 73. Vogel J. , Argaman L., Wagner E.G., Altuvia S. 2004 The small RNA IstR inhibits synthesis of an SOS-induced toxic peptide Curr. Biol . 14 2271 – 2276 Google Scholar Crossref Search ADS PubMed WorldCat 74. Vogel J. , Bartels V., Tang T.H., Churakov G., Slagter-Jager J.G., Huttenhofer A., Wagner E.G. 2003 RNomics in Escherichia coli detects new sRNA species and indicates parallel transcriptional output in bacteria Nucleic Acids Res . 31 6435 – 6443 Google Scholar Crossref Search ADS PubMed WorldCat 75. Willkomm D.K. , Minnerup J., Huttenhofer A., Hartmann R.K. 2005 Experimental RNomics in Aquifex aeolicus: identification of small non-coding RNAs and the putative 6S RNA homolog Nucleic Acids Res . 33 1949 – 1960 Google Scholar Crossref Search ADS PubMed WorldCat 76. Axmann I.M. , Kensche P., Vogel J., Kohl S., Herzel H., Hess W.R. 2005 Identification of cyanobacterial non-coding RNAs by comparative genome analysis Genome Biol 6 R73 Google Scholar Crossref Search ADS PubMed WorldCat 77. Holt N.E. , Fleming G.R., Niyogi K.K. 2004 Toward an understanding of the mechanism of nonphotochemical quenching in green plants Biochemistry 43 8281 – 8289 Google Scholar Crossref Search ADS PubMed WorldCat 78. Scheibe R. , Backhausen J.E., Emmerlich V., Holtgrefe S. 2005 Strategies to maintain redox homeostasis during photosynthesis under changing conditions J. Exp. Bot . 56 1481 – 1489 Google Scholar Crossref Search ADS PubMed WorldCat 79. Bock R. 2001 Transgenic plastids in basic research and plant biotechnology J. Mol. Biol . 312 425 – 438 Google Scholar Crossref Search ADS PubMed WorldCat 80. Bock R. and Hippler M. 2002 Extranuclear inheritance: Functional genomisc in chloroplasts Prog. Bot . 63 106 – 131 OpenURL Placeholder Text WorldCat 81. Maliga P. 2004 Plastid transformation in higher plants Annu. Rev. Plant Biol . 55 289 – 313 Google Scholar Crossref Search ADS PubMed WorldCat 82. Bock R. 2004 Taming plastids for a green future Trends Biotechnol . 22 311 – 318 Google Scholar Crossref Search ADS PubMed WorldCat © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Two uniquely arranged thyroid hormone response elements in the far upstream 5′ flanking region confer direct thyroid hormone regulation to the murine cholesterol 7α hydroxylase geneShin,, Dong-Ju;Plateroti,, Michelina;Samarut,, Jacques;Osborne, Timothy, F.
doi: 10.1093/nar/gkl506pmid: 16899449
ABSTRACT Cholesterol 7α hydroxlyase (CYP7A1) is a key enzyme in cholesterol catabolism to bile acids and its activity is important for maintaining appropriate cholesterol levels. The murine CYP7A1 gene is highly inducible by thyroid hormone in vivo and there is an inverse relationship between thyroid hormone and serum cholesterol. Eventhough gene expression has been shown to be upregulated, whether the induction was mediated through a direct effect of thyroid hormone on the CYP7A1 promoter has never been established. Using gene targeted mice, we show that either of the two TR isoforms are sufficient to maintain normal hepatic CYP7A1 expression but a loss of both results in a significant decrease in expression. We also identified two new functional thyroid hormone receptor-binding sites in the CYP7A1 5′ flanking sequence located 3 kb upstream from the transcription start site. One site is a DR-0, which is an unusual type of TR response element, and the other consists of only a single recognizable half site that is required for TR/retinoid X receptor (RXR) binding. These two independent TR-binding sites are closely spaced and both are required for full induction of the CYP7A1 promoter by thyroid hormone, although the DR-0 site was more crucial. INTRODUCTION Cholesterol 7α hydroxlyase (CYP7A1) catalyzes the initial and rate-limiting step in the neutral synthetic pathway of bile acids from cholesterol. Because the bile acid synthetic pathway is a major route to remove excess cholesterol from the body, CYP7A1 is considered an important enzyme in cholesterol homeostasis (1). CYP7A1 is exclusively expressed in the liver and the gene is subject to metabolic regulation by oxysterols, bile acids, hormones, nutrients and cytokines. In mice and rats, CYP7A1 is activated by cholesterol excess through by product oxysterols that function as ligand agonists for the liver X receptor (LXR)/retinoid X receptor (RXR) heterodimer which binds to direct repeats of half sites separated by 4 nt (DR-4, LXRE) in the CYP7A1 promoter, increasing expression of CYP7A1 (2–4). Conversely, bile acids inhibit CYP7A1 gene expression through a negative feed back mechanism operating through several molecular pathways. In one pathway, bile acids activate the farnesoid X receptor (FXR), which in turn induces expression of the small heterodimer partner (SHP). SHP binds to and interferes with the activity of the α1-fetoprotein transcription factor (FTF-1, also called LRH-1), leading to an inhibition of CYP7A1 expression (5,6). Hepatocyte nuclear factor 4 (HNF-4) has also been shown to mediate bile acid-induced repression of CYP7A1 (7). In fasted mouse livers and in type I diabetic mice, PPAR-γ-coactivator one alpha (PGC-1α) plays an important role in activating CYP7A1 gene expression (8). Additionally, CYP7A1 expression is regulated by thyroid hormone through a direct effect on gene transcription (9–11). Thyroid hormone mediates its action through the thyroid hormone receptor (TR), a member of the nuclear receptor superfamily of ligand-dependent transcription factors (12,13). There are two major isoforms of TR generated by different genes, TRβ and TRα. Each isoform exhibits a distinct pattern of tissue and developmental expression and there are multiple transcripts from each TR gene. TRβ is the primary isoform in the liver (14). TR binds to specific DNA sequences, TR response elements (TREs), as monomers, homodimers, or with RXR in a heterodimer. Since RXR enhances the binding affinity of TR to TRE, TR/RXR heterodimers have been suggested to be major protein complexes that mediate thyroid hormone responses in vivo (12,15). Through the use of synthetic DNA and in vitro DNA-binding studies, a high affinity consensus element for TR was determined to be the DR-4 motif (16). However, naturally occurring TREs diverge significantly from this consensus and many consist of different orientations and configurations of repeats of the nuclear receptor half site AGGTCA half site. This can vary from a single half site (17) to a DR-0 (18), palindromes (17,19) and multiple separate and variably spaced direct repeats (20,21). The association of hypothyroidism with hypercholesterolemia was first recognized in 1930 (22,23). This thyroid hormone effect is thought to be through direct regulation of target genes of cholesterol metabolism at the transcriptional level (9–11). Cholesterol homeostasis is maintained through cooperative regulation of cholesterol uptake and de novo synthesis together with cholesterol catabolism to bile acids (1). Accordingly, our previous studies have shown that thyroid hormone directly up-regulates expression of sterol regulatory element binding protein-2 (SREBP-2), which in turn increases expression of low density lipoprotein receptor (LDLR), resulting in a decrease in plasma cholesterol levels (24). Cholesterol catabolism is also modulated by thyroid hormone primarily through changes in CYP7A1 mRNA levels. CYP7A1 mRNA is induced rapidly within 1 h of triiodothyronine (T3) treatment in hypophysectomized rats (10,11) and T3 treatment also increases the rate of CYP7A1 gene transcription (9). This rapid induction suggests that the increase in CYP7A1 mRNA may be directly mediated by thyroid hormone at the transcriptional level. In addition, induction of CYP7A1 expression by T3 was blunted in TRβ knockout mice (25) and knock-in mice where a mutant TRβ was inserted that has a defect in ligand binding (26). These studies argue strongly for a direct action of TR on CYP7A1. Indeed, protein–DNA-binding studies have shown direct binding of TR/RXR to the proximal DR-4 of the CYP7A1 promoter which has been characterized as an LXRE (26). However, whether this proximal site or any other TR-binding site(s) in the promoter is responsible for TR responsiveness has not been established. In the current study, we have identified two closely spaced TREs in thefar upstream region of the CYP7A1 promoter that are responsible for the CYP7A1 stimulation by T3. Both these TRE were required for full induction of T3 and unlike some TREs, these CYP7A1 T3 response elements do not respond to LXR signaling. MATERIALS AND METHODS Animal treatments All animals used in this study were acquired and maintained in accordance with the NIH Guide for the Care and Use of Laboratory Animals. For studies at UC Irvine, the protocols were approved by the campus IACUC committee (approval 97-1545). For studies in Lyon, mice were housed, maintained and sacrificed with approval from the animal experimental committee of the Ecole Normale Supérieure de Lyon and in accordance with the ‘Commission de Génie Génétique’ (Agreement number 12837). The 4-week-old B6129 male mice were obtained from Taconic and maintained on a 12 h light/dark cycle with free access to food and water. The mice were allowed to adapt to the new environment for 1 week before experiments. Thyroid hormone deficiency was induced by feeding a low-iodine diet supplemented with 0.15% propylthiouracil (PTU) (Harlan Teklad) for 3 weeks, as described previously (24). For the thyroid hormone-supplemented group, mice were given 1 µg of T3 (ICN) per gram of body weight on the 18th day of the low-iodine diet with PTU by intraperitoneal injections daily for 4 days. The control group was fed ad libitum with normal chow diet. The mice were sacrificed between 8:00 and 10:00 a.m. Livers were removed and frozen in liquid nitrogen and stored at −80°C until RNA was isolated. For the studies in TR knockout mice, 8- to 10-week-old TRα0/0 (27), TRβ−/− (28), TRα0/0/TRβ−/− (28) and the respective wild-type control animals were used in this study. They were maintained on a 12 h day/12 h dark schedule (light on at 7 a.m.) and fed standard mouse chow and water ad libitum. For the experiments, animals were sacrificed at 2 p.m. after 4 h of starving. The liver was quickly removed and frozen in liquid nitrogen and used for RNA extraction. RNA isolation and northern blot analysis Total RNA was isolated from mouse livers using TRIzol (invitrogen). Total RNA (20 µg) from two to four individual animals per feeding condition was subjected to northern analysis with 32P-labeled probes as described previously (24). Expression of ribosomal protein L32 was measured as a control to normalize signals from different lanes. The following cDNA probes were used: CYP7A1 (a gift from G. Gil, Virginia Commonwealth University), a 0.7 kb AccI/EcoRI fragment from the pBSK7a; 5′DI, a 0.8 kb BamHI/EcoRI from pCR2.1-mouse 5′DI, and a 80 base HindIII/EcoRI fragment of rat ribosomal protein L32 cDNA (24). Cell culture and transient transfection assay HepG2 cells were cultured at 37°C and 5% CO2 in minimal essential medium supplemented with 10% fetal bovine serum (FBS) and penicillin/streptomyocin. Transient transfections were performed by the calcium phosphate co-precipitation method, as described previously (24). Briefly, cells were seeded in 6-well dishes at 350 000 cells/well one day before transfection. The next day (16 h later), cells were transfected with 2 µg/well of the indicated CYP7A1 luciferase reporter and 2 µg/well of cytomegalovirus β-galactosidase plasmid as an internal control for transfection efficiency. Expression vectors for CMX-hTR-β (2 µg), CMX-LXRα (0.5 µg) and CMX-hRXRα (0.5 µg) were included as indicated in the figures. Equal amounts of DNA were used for all transfection reactions by adding empty vector DNA. After 4–6 h, cells were treated with 10% glycerol for 2 min, washed three times with phosphate-buffered saline (PBS), and refed with a serum-free medium supplemented with 5 µg/ml insulin, 5 µg/ml transferrin, 5 ng/ml selenite, 5 µl/ml defined lipid mix and 0.1% de-lipidated BSA (Sigma) in the presence or absence of 1 µM 3,3′T3. Cells were incubated for additional 40 h and harvested for luciferase and β-galactosidase assays and normalized expression was determined as described previously (24). Values represent the mean of duplicates ± SD. Each experiment was repeated at least three times. Plasmids The region corresponding to −7454/+59 of the rat CYP7A1 promoter was obtained from pR7α-Cat9 (a gift from G. Gil, Virginia Commonwealth University) by digestion with SalI and XbaI and inserted into the XhoI and NheI sites of pGL3 basic to generate pGL3R7α-7454. pGL3R7α-3640 was constructed by digestion of pGL3R7α-7454 with SacI followed by re-ligation. To construct pGL3R7α-1667, pGL3R7α-3640 was digested with SacI and PstI, and blunted by T4 polymerase, and re-ligated. A series of 5′-deletion mutants of pGL3R7α-3640 was constructed by PCR-based amplification. The following forward primers were used: pGL3R7α-3382, 5′-GTGAACTTTCCTGTATGGGT-3′; pGL3R7α-3132, 5′-TGGTATGCCAGGACTTTGGA-3′; and pGL3R7α-3008, 5′-ACTTCAGTGCCCACCATGCA-3′. The reverse primer for all was 5′-ACAAGTAGACTGCAAGGGGA-3′. The SacI site was added to the forward primer for cloning purpose. Because the PCR fragments include the SmaI site at −2770 bases, they were digested with SacI and SmaI and inserted into the SacI and SmaI sites of pGL3R7α-3640 to generate pGL3R7α-3382, pGL3R7α-3132 and pGL3R7α-3008. pGL3R7α-3640mTRE1, pGL3R7α- 3132mTRE1, pGL3R7α-3132mTRE2 and pGL3R7α-3132mTRE1/2 were constructed by site-directed mutagenesis (QuikChange, Stratagene). The sequences of one strand of the complementary primers are shown below: mTRE1, 5′-CAATAATAACCCTGTCTTTTCAAAGCATCTATCTGTACTGCTGC-3′; and mTRE2, 5′-CTGCTGCAATAGAAACTCCACAGGTCAAAATCACAGCTGTTGTGT-3′. To generate pGL3R7α-3132/342, a fragment from −3132 to −3008 was produced by PCR. The forward primer was the same as the one for pGL3R7α-3132. The reverse primer was 5′-AATCCTGGGGACACTGTGTA and included the XbaI site for cloning purpose. The PCR fragment was digested with SacI and XbaI and inserted into the SacI and NheI sites of pGL3R7α-342. CMX-hTR-α and CMX-hTR-β were from Dr Bruce Blumberg, University of Califormia, Irvine, and CMX-hRXR-α and CMX-LXRα were from Dr Peter Tontonoz (UCLA). Electrophoretic mobility shift assays Human TRα, TRβ and RXRα proteins were synthesized using the TNT-coupled transcription/translation system (Promega). The sequences of one strand of the complementary oligonucleotide probes are as follows: wild-type TRE1, 5′-AACCCTGTCTTTTCAGGGCATCTATCTGTACTGCTGCAATAGAAA-3′; mutant TRE1, 5′-AACCCTGTCTTTTC AAAGCA TCTA TCTGTA CTGCTGCAATAGAAA-3′; wild-type TRE2, 5′-AATAGAAACTCCACAGGTCA GGGTCA CAGCTGTTGTGTTTTACACA-3′; mutant TRE2, AATAGAAACTCCAC AGGTCAAAATCACAGCTGTTGTGTTTTACACA-3′; and DR-4, 5′-CTAGAGCTTCAGGTCACAGGAGGTCAGAGAGCT-3′. The complementary oligonucleotides were annealed, and where indicated, they were labeled with [γ-32P]ATP by T4 polynucleotide kinase. Protein–DNA-binding assays were performed as described previously (24). For competition experiments, unlabeled probes were included in the binding buffer at the concentration indicated in the legend to Figure 3B and C. Chromatin immunoprecipitation assays (ChIP) Rat hepatoma H4IIE cells were grown to 70% confluence in minimum essential medium supplemented with 10% FBS at 37°C and 5% CO2. Nuclear proteins were crosslinked to DNA by adding formaldehyde to the culture medium to a final concentration of 1% for 5 min at room temperature. Crosslinking was stopped by adding glycine to a final concentration of 0.125 M for 10 min. Cells were washed three times and collected in ice-cold PBS supplemented with protease inhibitors. Cell pellets were resuspended in lysis buffer (5 mM HEPES, pH 8.0, 85 mM KCl, 0.5% NP-40 and protease inhibitors) and homogenized with five strokes of a dounce homogenizer with a B pestle to release nuclei. After centrifugation at 313×g for 10 min at 4°C, nuclear pellets were resuspended in nuclear lysis buffer (50 mM Tris–HCl, pH 7.6, 10 mM EDTA, 1% SDS and protease inhibitors) and sonicated for 5 min with 30 s on/off intervals to reduce the size of chromatin to ∼500 bases in length (monitored by gel electrophoresis). Lysates were diluted 1:20 in IP dilution buffer (0.01% SDS, 1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris–HCl, pH 7.6, 167 mM NaCl and protease inhibitors). Crosslinked chromatin samples were precleared by adding salmon sperm DNA–protein G agarose slurry and mouse IgG for 2 h at 4°C with rotation. Supernatants were recovered and incubated with 5 µl of anti-TRβ (Affinity BioReagents, Golden, CO) or mouse IgG overnight at 4°C with rotation. The immune complexes were then mixed with salmon sperm DNA–protein G agarose slurry for 2 h at 4°C with rotation. The beads were collected by microspin column and washed sequentially with the following buffers: buffer B (20 mM Tris–HCl, pH 7.6, 2 mM EDTA, 0.05% SDS, 1%Triton X-100 and 150 mM NaCl), buffer D (20 mM Tris–HCl, pH 7.6, 2 mM EDTA, 0.05% SDS, 1% Triton X-100 and 500 mM NaCl), buffer 3 (10 mM Tris–HCl, pH 7.6, 250 mM LiCl, 1 mM EDTA, 1% NP-40 and 1% deoxycholate) and IP wash buffer (20 mM Tris–HCl, pH 7.6, 150 mM NaCl, 2 mM EDTA and 1%Triton X-100). The immune complexes were eluted with IP elution buffer (40 mM sodium bicarbonate and 1% SDS). After reversing the protein–DNA crosslinks in the chromatin, DNA was recovered by Qiagen PCR purication kit and used as a template for quantitative PCR with an I-cycler (Bio-Rad). Supernatants from immunoprecipitation withIgG were saved and used to prepare control samples representing the unenriched starting material or ‘input’ DNA. A series of dilutions of input DNA with known concentrations (OD260) was used as a template to monitor the efficiency of quantitative PCR and to quantify DNA croslinked to TR protein. The following primers were used for quantitative PCR: rCYP7A1 TREs forward 5′-AAACTCAACTTGGTATGCCAGGAC-3′; rCYP7A1 TREs reverse 5′-TCACACACATGCACACAAGCAC-3′; rCYP7A1 DR-4(LXRE) forward 5′-AGCACATGAGGGACAGACCTTCAG-3′; rCYP7A1 DR-4(LXRE) reverse 5′-TGCACAGGACCATGATCCAATAAC-3′; rYY1 exon 4 forward 5′-GCTGCACAAAGATGTTCAGGGATAA-3′; and rYY1 exon 4 reverse 5′-CTGAAAGGGCTTTTCTCCAGTATG-3′. RESULTS Thyroid hormone induces CYP7A1 mRNA in vivo Expression of the murine CYP7A1 gene fluctuates in response to thyroid hormone status (9–11). This is demonstrated in Figure 1A where CYP7A1 mRNA was barely detectable in livers of thyroid hormone depleted mice but was significantly induced when T3 was added back after the depletion. The 5′ deiodinase gene, a well-known targetof thyroid hormone (29), displayed a similar response to T3. As the mice were treated with supraphysiological doses of T3 to induce hyperthyroid conditions (30) mRNA levels for CYP7A1 and 5′ deiodinase in mice treated with T3 were hyperactivated relative to control mice. To investigate the regulation of CYP7A1 by T3 further we evaluated expression in knockout mice that totally lack each or both of the TR isoforms. When these animals were fed normal chow diets without manipulating the thyroid hormone content, CYP7A1 mRNA was elevated in livers from TRα knockout miceand only when both TR receptors are disrupted was there a notable reduction in basal CYP7A1 mRNA (Figure 1B, P = 0.026). In contrast, expression of the 5′ deiodinase gene was sensitive to the specific loss of the TRβ isoform. These results indicate that either TR isoform can activate CYP7A1 whereas the 5′ deiodinase gene is preferentially activated by TRβ (Figure 1B). Figure 1 Open in new tabDownload slide Thyroid hormone regulation of the CYP7A1 gene. (A) Total RNA was isolated from livers of mice that were fed a normal chow (C), an iodine-deficient diet supplemented with PTU (P) or an iodine-deficient diet supplemented with PTU and injected with T3 (P+T3). Equal amounts (20 µg) of total RNA from individual animals was loaded in separate lanes and analyzed by northern analysis to measure mRNA levels for CYP7A1 and 5′DI. Signals were quantified using Quantity One software from Bio-Rad using densitometric scans from autoradiograms, and the intensities were normalized relative to the control ribosomal protein L32 mRNA for each lane. Results are expressed as a fold change relative to the value from the control chow-fed animals. The mean values obtained from individual measurements from six animals in each group are shown with error bars. (B) Total RNA was isolated from livers of WT, TRα(0/0), TRβ(−/−) and TRα(0/0)/TRβ(−/−) that were fed a normal chow. Figure 1 Open in new tabDownload slide Thyroid hormone regulation of the CYP7A1 gene. (A) Total RNA was isolated from livers of mice that were fed a normal chow (C), an iodine-deficient diet supplemented with PTU (P) or an iodine-deficient diet supplemented with PTU and injected with T3 (P+T3). Equal amounts (20 µg) of total RNA from individual animals was loaded in separate lanes and analyzed by northern analysis to measure mRNA levels for CYP7A1 and 5′DI. Signals were quantified using Quantity One software from Bio-Rad using densitometric scans from autoradiograms, and the intensities were normalized relative to the control ribosomal protein L32 mRNA for each lane. Results are expressed as a fold change relative to the value from the control chow-fed animals. The mean values obtained from individual measurements from six animals in each group are shown with error bars. (B) Total RNA was isolated from livers of WT, TRα(0/0), TRβ(−/−) and TRα(0/0)/TRβ(−/−) that were fed a normal chow. Taken together with previous studies, these results suggest that CYP7A1 is a direct thyroid hormone target gene; however, a T3 response element had not been located previously. Therefore, a fragment of the 5′-flanking region of the rat CYP7A1 promoter from −3640 to +59 (pGL3R7α-3640) was fused to the luciferase reporter gene and tested for T3 responsiveness in co-transfection assays in HepG2 cells. The HepG2 cell-based co-transfection assay has been established to evaluate T3 regulation of other genes (24). In this assay, expression vectors for TRβ and RXRα were co-transfected with pGL3R7α-3640 into HepG2 cells that were subsequently cultured in the presence or absence of T3 and the activity of the reporter gene was measured. As shown in Figure 2A, T3 treatment resulted in a 21-fold increase in the activity of pGL3R7α-3640. In contrast, when T3 was added without the expression constructs or if the TRβ and RXRα expression vectors were added without T3, there was only minimal promoter activity. As a control, the activity of murine leukemia virus luciferase (MLVLuc) reporter, a known T3 responsive promoter (31), was also activated significantly in response to T3 in these studies. These data suggest that the CYP7A1 promoter is T3 responsive and that TRE(s) directly mediating the response are located within the 3640 bases flanking the 5′ end of the transcription start site. Figure 2 Open in new tabDownload slide CYP7A1 promoter is activated by TR in response to T3. (A) HepG2 cells were transfected with the indicated promoter–luciferase fusion construct and where indicated expression vectors were added for TRβ and RXRα. Cells were cultured in serum-free minimal medium and 1 µM T3 was added as indicated and described in Materials and Methods. Results are expressed as corrected luciferase light units divided by the internal control signal for β-galactosidase activity. (B) Similar experiments were performed in HepG2 cells with the indicated promoter–luciferase fusion construct along with expression vectors for TRβ and RXRα. The fold (x) change in the promoter activity by T3 relative to cells transfected with each luciferase reporter alone is shown beside each bar. The data from (A) and (B) represent the mean of duplicates for three individual experiments and include error bars (SEM). RLU, relative light units. Figure 2 Open in new tabDownload slide CYP7A1 promoter is activated by TR in response to T3. (A) HepG2 cells were transfected with the indicated promoter–luciferase fusion construct and where indicated expression vectors were added for TRβ and RXRα. Cells were cultured in serum-free minimal medium and 1 µM T3 was added as indicated and described in Materials and Methods. Results are expressed as corrected luciferase light units divided by the internal control signal for β-galactosidase activity. (B) Similar experiments were performed in HepG2 cells with the indicated promoter–luciferase fusion construct along with expression vectors for TRβ and RXRα. The fold (x) change in the promoter activity by T3 relative to cells transfected with each luciferase reporter alone is shown beside each bar. The data from (A) and (B) represent the mean of duplicates for three individual experiments and include error bars (SEM). RLU, relative light units. To identify the TRE(s), a series of progressive 5′ deletions of the promoter were prepared and tested for T3 responsiveness as above. Truncation from −3640 to −3382, or to −3132 caused no significant change in T3 responsiveness (Figure 2B). However, deletion to −3008 resulted in a marked decrease in T3 activation from 24.8- to 3.9-fold. Further deletion down to −1667 showed a modest drop of T3 activation to 2- to 3-fold of the basal promoter activity observed in pGL3R7α-342. A modest increase in promoter activity by 2- to 3-fold has been shown for the empty pGL3-basic vector in the presence and absence of T3 (24) and was not considered to represent specific T3 response. These results suggest that a T3 responsive region is located at 3000 bp 5′ to the CYP7A1 mRNA start site. Identification of thyroid hormone receptor-binding sites in the CYP7A1 promoter Using synthetic response and binding elements, it was shown that TR preferentially binds to a site containing two direct repeats of hexameric half sites of AGGTCA spaced by 4 nt. However, naturally occurring TR-binding sites are diverse with respect to the sequence and orientation of the halfsites, and the number of nucleotides in the spacer (12,15). When we scanned the −3000 region of the promoter for nuclear receptor half sites, two DR motifs were found: one consisting of a single half site, designated TRE1, and the other consisting of two direct repeats of the half site with no spacer, designated TRE2 (Figure 3A). Importantly, although the sequence of the mouse and rat promoters diverge significantly in this region of the promoter, the TRE1 is moderately conserved and TRE2 is exactly identical. Figure 3 Open in new tabDownload slide Identification of TR/RXR-binding sites in the CYP7A1 promoter. (A) The 5′-flanking sequence alignment of the rat and mouse CYP7A1 promoter is shown. Arrows indicate putative TREs. (B) In the upper panel, the sequence of the wild-type TRE1 is presented as TRE1/WT. The sequence of mutations in TRE1 is indicated with lower case lettering with mTRE1. The full sequences for the oligonucleotide probes are detailed in Materials and Methods. An autoradiogram from a representative gel shift assay is shown in the lower panel. 32P-labeled probes were incubated with in vitro translated TRα, TRβ and RXRα, as indicated. Where indicated, a 100-fold molar excess of the indicated unlabeled probe (Comp.) was included in the binding reactions with the labeled probe. D4 denotes the consensus DR-4. WT and M1 denote the wild-type TRE1 and mutant TRE1, respectively. The arrow denotes the position of specific protein–DNA complexes. (C) Similar experiments were performed for TRE2 as described in (B). Figure 3 Open in new tabDownload slide Identification of TR/RXR-binding sites in the CYP7A1 promoter. (A) The 5′-flanking sequence alignment of the rat and mouse CYP7A1 promoter is shown. Arrows indicate putative TREs. (B) In the upper panel, the sequence of the wild-type TRE1 is presented as TRE1/WT. The sequence of mutations in TRE1 is indicated with lower case lettering with mTRE1. The full sequences for the oligonucleotide probes are detailed in Materials and Methods. An autoradiogram from a representative gel shift assay is shown in the lower panel. 32P-labeled probes were incubated with in vitro translated TRα, TRβ and RXRα, as indicated. Where indicated, a 100-fold molar excess of the indicated unlabeled probe (Comp.) was included in the binding reactions with the labeled probe. D4 denotes the consensus DR-4. WT and M1 denote the wild-type TRE1 and mutant TRE1, respectively. The arrow denotes the position of specific protein–DNA complexes. (C) Similar experiments were performed for TRE2 as described in (B). To evaluate whether these putative TREs directly bind TR, gel shift assays were performed using 32P-labeled oligonucleotides containing either TRE1 or TRE2 with in vitro translated TRα, TRβ and RXRα protein. Heterodimers of both TRα/RXRα or TRβ/RXRα formed complexes with oligonucleotide probes containing either TRE1 and TRE2 (Figure 3B and C, lanes 5 and 6). A 100-fold molar excess of unlabeled wild-type TRE1 or a consensus TRE, DR-4 (32), efficiently competed out TRβ/RXRα for binding to TRE1 (lanes 7 and 8). A 2-base mutation in the 5′ half site of either TRE disrupted binding of TRβ/RXRα with DNA (lanes 13 and 14). Consistently, unlabeled mutant probe (mTRE1 or mTRE2) was not able to compete out TRβ/RXRα for the binding to the respective TRE (lane 9) and similar protein–DNA complexes were observed for the consensus DR-4 site that was used as a positive control. Taken together, these results suggest that TR/RXR binds independently to both TRE1 and TRE2. TRE1 and TRE2 are responsible for the TR response of CYP7A1 To determine whether TRE1 or TRE2 contribute to the T3 stimulation of the CYP7A1 promoter, point mutations identical to the bases mutated in the gel shift assays were introduced into either the TRE1 or TRE2 or both simultaneously in the pGL3R7α-3640 and pGL3R7α-3132 reporter plasmids by site-directed mutagenesis. The mutant constructs were then examined for T3 responsiveness as described in Figure 2. A mutation in TRE1 of pGL3R7α-3640 caused a 61% decrease in T3 response (Figure 4A), compared to its respective wild-type promoter, but it retained a significant amount of the T3 response. Similarly, mutation of TRE1 of pGL3R7α-3132 reduced T3 mediated-promoter activity by 46%, compared to its respective wild-type promoter. When a mutation was introduced into the TRE2, the mutant construct displayed almost no response to T3. The complete loss of T3 responsiveness was also observed for the TRE1/TRE2 double mutant. These data suggest that both of the TREs contribute to the T3 responsiveness, although TRE2 appears to be more critical. Figure 4 Open in new tabDownload slide TRE1 and TRE2 are responsible for the TR response of CYP7A1. (A) Transfection assays with wild-type and the indicated mutant luciferase reporter constructs were performed in cultured HepG2 cells as described in the legend to Figure 3. The data represent the mean of duplicate samples for three individual experiments and include error bars. M1 and M2 denote mutants of TRE1 and TRE2, respectively, where the identical nucleotide substitutions used to disrupt TR binding in Figure 3B and C were introduced into TRE1 or/and TRE2 of the luciferase reporter constructs. (B) The sequence from −3132 to −3008 was fused to the −342 sequence of the truncated CYP7A1 promoter reporter construct and compared to the full-length and −342 truncated promoter construct for T3 responsiveness as described in (A). Figure 4 Open in new tabDownload slide TRE1 and TRE2 are responsible for the TR response of CYP7A1. (A) Transfection assays with wild-type and the indicated mutant luciferase reporter constructs were performed in cultured HepG2 cells as described in the legend to Figure 3. The data represent the mean of duplicate samples for three individual experiments and include error bars. M1 and M2 denote mutants of TRE1 and TRE2, respectively, where the identical nucleotide substitutions used to disrupt TR binding in Figure 3B and C were introduced into TRE1 or/and TRE2 of the luciferase reporter constructs. (B) The sequence from −3132 to −3008 was fused to the −342 sequence of the truncated CYP7A1 promoter reporter construct and compared to the full-length and −342 truncated promoter construct for T3 responsiveness as described in (A). To test the ability of these specific TREs to confer T3 stimulation to an otherwise non-responsive promoter the short region containing −3132 to −3008 that harbors both TREs was ligated to pGL3R7α-342 to generate pGL3R7α-3132/342. As shown in Figure 4B, addition of this small fragment containing both TREs to the −342 promoter resulted in an 18-fold increase in response to T3 which was similar to the full-length −3640 construct. As expected, expression from the simple pGL3R7α-342 construct was insensitive to T3. TRE1/TRE2 do not mediate an LXR response Because TR and LXR both prefer binding to DR-4 elements (33) and because there is a DR-4 in the proximal murine CYP7A1 promoter that is responsive to LXR signaling (3), we evaluated whether these newly identified TREs might also contribute to LXR activation of CYP7A1. pGL3R7α-3132/WT, pGL3R7α-3132/mTRE1,2 or pGL3R7α-3008/WTwere co-transfected with expression vectors for LXRα and RXRα in the presence or absence of GW3695, a synthetic agonist for LXR. The results from these experiments were compared with those for T3 responsiveness of the same constructs. As shown in Figure 5, the potent stimulation by LXR was not affected by mutations in or loss of either TRE1 or TRE2. Figure 5 Open in new tabDownload slide The TRE1/TRE2 do not mediate an LXR response. HepG2 cells were transfected with the indicated promoter–luciferase fusion construct and where indicated expression vectors for TRβ/RXRα or LXRα/RXRα were also included. Cells were cultured in serum-free minimal medium in the presence or absence of 1 µM T3 or 5 µM GW3695. Results are expressed as normalized luciferase light units divided by β-galactosidase activity. The 25-fold activation by T3 for the pGL3R7α-3132/WT construct was set to 100%. The magnitude of the T3 response of pGL3R7α-3132/mTRE1,2 and pGL3R7α-3008/WT is presented relative to that of pGL3R7α-3132/WT. The 30-fold activation of pGL3R7α-3132/WT by GW3695 relative to cells transfected with luciferase reporter alone was set to 100%. The magnitude of the GW3695 response of pGL3R7α-3132/mTRE1,2 and pGL3R7α-3008/WT was presented relative to that of pGL3R7α-3132/WT. The data represent the mean of duplicates for three individual experiments and include error bars. Figure 5 Open in new tabDownload slide The TRE1/TRE2 do not mediate an LXR response. HepG2 cells were transfected with the indicated promoter–luciferase fusion construct and where indicated expression vectors for TRβ/RXRα or LXRα/RXRα were also included. Cells were cultured in serum-free minimal medium in the presence or absence of 1 µM T3 or 5 µM GW3695. Results are expressed as normalized luciferase light units divided by β-galactosidase activity. The 25-fold activation by T3 for the pGL3R7α-3132/WT construct was set to 100%. The magnitude of the T3 response of pGL3R7α-3132/mTRE1,2 and pGL3R7α-3008/WT is presented relative to that of pGL3R7α-3132/WT. The 30-fold activation of pGL3R7α-3132/WT by GW3695 relative to cells transfected with luciferase reporter alone was set to 100%. The magnitude of the GW3695 response of pGL3R7α-3132/mTRE1,2 and pGL3R7α-3008/WT was presented relative to that of pGL3R7α-3132/WT. The data represent the mean of duplicates for three individual experiments and include error bars. Recruitment of TRβ into the TREs in H4IIE cells To directly evaluate if TRβ is capable of binding the TREs of the CYP7A1 promoter on the endogenous CYP7A1 gene in cellular chromatin, a ChIP study was performed in H4IIE cells. These cells have been used as a useful system tostudy regulation of gene expression by thyroid hormone (34) and has been used successfully to demonstrate that TR binds to TR target genes using the ChIP technique. Because TR has been shown to bind in chromatin in the presence or absence of its ligand (35), we made chromatin from H4IIE cells cultured under normal conditions without T3. After crosslinking with formaldehyde, aliquots were immunoprecipitated with anti-TRβ or a control mouse IgG fraction and binding to specific chromatin sites was analyzed by a quantitative PCR protocol. As shown in Figure 6, TRβ was efficiently recruited to the −3000 region of the CYP7A1 promoter (8-fold compared to a non-specific IgG control). Although the proximal DR-4 does not mediate a T3 response, TR was shown to be recruited to this proximal LXR responsive DR-4 in a previous study (26,36). Therefore, we also evaluated TR binding to the proximal promoter as well where a modest 2-fold binding enrichment of TR was observed. As a negative control, TR binding to a non-relevant region of the genome was also evaluated and shown to be negative (the YY1 locus). These data demonstrate that TRβ is specifically recruited to the −3000 TREs of the CYP7A1 promoter in cellular chromatin where the two TREs identified in our study are located. Figure 6 Open in new tabDownload slide Recruitment of TRβ to the CYP7A1 TREs in H4IIE cells. Chromatin was prepared from H4IIE cells grown under normal conditions. The ChIP assays were performed with anti-TRβ as described in Materials and Methods. Immunoprecipitates were analyzed by quantitative PCR using primers that flanked the distal TREs or proximal DR-4 (LXRE) as indicated at the top of the figure. Results are expressed as a fold change in comparing the level of DNA amplification specifically precipitated by the TR-β antibody relative to that precipitated by a normal mouse IgG as control. The recruitment of TRβ to a non-relevant region of the genome in the YY1 locus is shown as an additional negative control. The data represent the mean of triplicates for two individual experiments and include error bars. Figure 6 Open in new tabDownload slide Recruitment of TRβ to the CYP7A1 TREs in H4IIE cells. Chromatin was prepared from H4IIE cells grown under normal conditions. The ChIP assays were performed with anti-TRβ as described in Materials and Methods. Immunoprecipitates were analyzed by quantitative PCR using primers that flanked the distal TREs or proximal DR-4 (LXRE) as indicated at the top of the figure. Results are expressed as a fold change in comparing the level of DNA amplification specifically precipitated by the TR-β antibody relative to that precipitated by a normal mouse IgG as control. The recruitment of TRβ to a non-relevant region of the genome in the YY1 locus is shown as an additional negative control. The data represent the mean of triplicates for two individual experiments and include error bars. DISCUSSION Thyroid hormone influences many aspects of hepatic metabolism, and the rate controlling enzyme of neutral bile acid metabolism, CYP7A1, was shown to be activated at the transcriptional level by thyroid hormone in vivo in previous studies (9–11). However, no functional TRE(s) capable of conferring T3 responsiveness to CYP7A1 had been identified previously. Using transient transfection assays and DNA-binding studies, we have identified two TREs that are responsible for the T3 stimulation of the murine CYP7A1 gene. The two TREs are closely spaced and located in a far upstream region of the promoter at 3 kb upstream relative to the transcription start site. TREs are generally found in the proximal region of promoters, as for the rat growth hormone (17,19), rat malic enzyme (20), myelin basic protein (37) and α-myosin heavy chain genes (38). However, there are some TREs that are located in a far upstream 5′ region. For example, the TREs of the rat Spot 14 and chicken malic enzyme genes are located 2.7 and 3.8 kb upstream of the respective transcription start sites, respectively (39,40). These natural TREs are often clustered and arranged with multiple repeats of the half site, functioning synergistically or cooperatively. Similarly, the two closely spaced TREs are located 3 kb upstream from the CYP7A1 mRNA start site and are both necessary for full responsiveness to T3, although the TRE2 plays a more crucial role in T3 stimulation (Figure 4). We also examined the recruitment of TRβ to the TREs and DR-4 (LXRE) of the endogenous CYP7A1 in H4IIE cells. Our finding of greater recruitment of TRβ to the −3 kb TREs compared to the proximal DR-4 (LXRE) is consistent with the data from transfection assays where the distal TREs mediate a strong T3 response. This would likely occur when the affinity of TR binding simply reflects the functional augmentation of the T3 response. It is also possible that LXR may be more abundant than TR in H4IIE cells such that the DR-4 (LXRE) may be preferentially occupied by LXR/RXR rather than TR/RXR. Regardless, the presence of the proximal DR-4 is not sufficient to provide a significant response to T3 stimulation. We have also shown that either TRα or TRβ appears sufficient for normal expression of CYP7A1 when animals are euthyroid and maintained on a chow diet (Figure 1). However, when both receptors are gone there is a significant decrease in CYP7A1 mRNA. In contrast, expression of the 5′ deiodinase gene specifically requires TRβ under these conditions. Thus, either TR isoform appears capable of sustaining basal CYP7A1 expression. It is important to note that earlier reports showed that TRβ was key to activating CYP7A1 under dietary and pharmacological manipulation that dramatically alter thyroid hormone levels (25,41). There is a significant increase in CYP7A1 mRNA in the TRα knockout mice suggesting that this receptor may function as a negative regulator of CYP7A1 when TRβ is also present. Alternatively, this result may be due to other secondary effects that result from the loss of TRα. Using synthetically designed promoter constructs, a consensus DR-4 element was shown to be a potent TRE (16). However, as mentioned above, TREs found in natural promoters are quite diverse in sequence, and the number and orientation of the individual half sites are variable (17–21). TRE1 contains only a mutation sensitive single TR half site. In vitro protein–DNA-binding assays in the present studies revealed that both of the TREs bind to TR/RXR heterodimers and no significant binding of monomers or homodimers of TR or RXR was detected (Figure 3). An extensive mutagenesis analysis failed to identify a crucial second half site for TRE1 (data not shown). The TRE2 is a DR0 which is similar to the TRE of the PEPCK (18). Although TR can bind to TRE as a monomer, homodimer or heterodimer with RXR (15,42–46), it is likely that the TR/RXR heterodimer binds to the TRE preferentially to mediate T3 response in vivo because both TR and RXR were required for efficient binding and both TR and RXR are required for activation as shown in other studies (18,21). The two TRE's identified here in the rat CYP7A1 are conserved in the mouse promoter eventhough the surrounding sequence is not highly conserved, suggesting there is evolutionary pressure to preserve the T3 response in these two species. In an earlier study, expression of the human CYP7A1 gene was not stimulated by T3 (47). Additionally, the distal TR-binding sites we identified in the murine promoters do not seem to be conserved in the human gene and expression of a human transgene containing a large segment of 5′ flanking DNA in mice was not regulated by T3 (48). Additionally, the proximal DR-4 (LXRE) is not conserved in the human gene (49) and if anything, the human CYP7A1 may in fact be negatively regulated by T3 (50) through a putative negative TRE at −227 to −247 in the promoter. Taken in total, the data suggest there are significant species differences in the effects of thyroid hormone on lipid metabolism that need to be explored further. ACKNOWLEDGEMENTS We thank Dr Roy Weiss (U. Chicago) for his assistance in the early stages of this project. Drs Bruce Blumberg, Gregorio Gil and Peter Tontonoz for plasmid constructs and Timothy Willson (Glaxo SmithKline) for GW3695. We also appreciate the technical assistance of Saro Saroyan at UCI and Nadine Aguilera for animal handling at the Plateau de Biologie Experimentale de la Souris of ENS-Lyon. This work was supported by National Institutes of Health Grant DK71021 (T.F.O.). D.J.S. was supported by a postdoctoral fellowship from the American Heart Association. Funding to pay the Open Access publication charges for this article was provided by NIH DK71021. Conflict of interest statement. None declared. REFERENCES 1. Chiang J.Y. 2002 Bile acid regulation of gene expression: roles of nuclear hormone receptors Endocr. Rev . 23 443 – 463 Google Scholar Crossref Search ADS PubMed WorldCat 2. Peet D.J. , Turley S.D., Ma W., Janowski B.A., Lobaccaro J.M., Hammer R.E., Mangelsdorf D.J. 1998 Cholesterol and bile acid metabolism are impaired in mice lacking the nuclear oxysterol receptor LXR alpha Cell 93 693 – 704 Google Scholar Crossref Search ADS PubMed WorldCat 3. Lehmann J.M. , Kliewer S.A., Moore L.B., Smith-Oliver T.A., Oliver B.B., Su J.L., Sundseth S.S., Winegar D.A., Blanchard D.E., Spencer T.A., et al. 1997 Activation of the nuclear receptor LXR by oxysterols defines a new hormone response pathway J. Biol. Chem . 272 3137 – 3140 Google Scholar Crossref Search ADS PubMed WorldCat 4. Janowski B.A. , Willy P.J., Devi T.R., Falck J.R., Mangelsdorf D.J. 1996 An oxysterol signalling pathway mediated by the nuclear receptor LXR alpha Nature 383 728 – 731 Google Scholar Crossref Search ADS PubMed WorldCat 5. Lu T.T. , Makishima M., Repa J.J., Schoonjans K., Kerr T.A., Auwerx J., Mangelsdorf D.J. 2000 Molecular basis for feedback regulation of bile acid synthesis by nuclear receptors Mol. Cell 6 507 – 515 Google Scholar Crossref Search ADS PubMed WorldCat 6. Goodwin B. , Jones S.A., Price R.R., Watson M.A., McKee D.D., Moore L.B., Galardi C., Wilson J.G., Lewis M.C., Roth M.E., et al. 2000 A regulatory cascade of the nuclear receptors FXR, SHP-1, and LRH-1 represses bile acid biosynthesis Mol. Cell 6 517 – 526 Google Scholar Crossref Search ADS PubMed WorldCat 7. De Fabiani E. , Mitro N., Anzulovich A.C., Pinelli A., Galli G., Crestani M. 2001 The negative effects of bile acids and tumor necrosis factor-alpha on the transcription of cholesterol 7alpha-hydroxylase gene (CYP7A1) converge to hepatic nuclear factor-4: a novel mechanism of feedback regulation of bile acid synthesis mediated by nuclear receptors J. Biol. Chem . 276 30708 – 30716 Google Scholar Crossref Search ADS PubMed WorldCat 8. Shin D.J. , Campos J.A., Gil G., Osborne T.F. 2003 PGC-1alpha activates CYP7A1 and bile acid biosynthesis J. Biol. Chem . 278 50047 – 50052 Google Scholar Crossref Search ADS PubMed WorldCat 9. Ness G.C. and Lopez D. 1995 Transcriptional regulation of rat hepatic low-density lipoprotein receptor and cholesterol 7 alpha hydroxylase by thyroid hormone Arch. Biochem. Biophys . 323 404 – 408 Google Scholar Crossref Search ADS PubMed WorldCat 10. Ness G.C. , Pendelton L.C., Zhao Z. 1994 Thyroid hormone rapidly increases cholesterol 7 alpha-hydroxylase mRNA levels in hypophysectomized rats Biochim. Biophys. Acta 1214 229 – 233 Google Scholar Crossref Search ADS PubMed WorldCat 11. Ness G.C. , Pendleton L.C., Li Y.C., Chiang J.Y. 1990 Effect of thyroid hormone on hepatic cholesterol 7 alpha hydroxylase, LDL receptor, HMG-CoA reductase, farnesyl pyrophosphate synthetase and apolipoprotein A-I mRNA levels in hypophysectomized rats Biochem. Biophys. Res. Commun . 172 1150 – 1156 Google Scholar Crossref Search ADS PubMed WorldCat 12. Yen P.M. 2001 Physiological and molecular basis of thyroid hormone action Physiol. Rev . 81 1097 – 1142 Google Scholar PubMed OpenURL Placeholder Text WorldCat 13. Lazar M.A. 1993 Thyroid hormone receptors: multiple forms, multiple possibilities Endocr. Rev . 14 184 – 193 Google Scholar PubMed OpenURL Placeholder Text WorldCat 14. Hodin R.A. , Lazar M.A., Chin W.W. 1990 Differential and tissue-specific regulation of the multiple rat c-erbA messenger RNA species by thyroid hormone J. Clin. Invest . 85 101 – 105 Google Scholar Crossref Search ADS PubMed WorldCat 15. Lazar M.A. , Berrodin T.J., Harding H.P. 1991 Differential DNA binding by monomeric, homodimeric, and potentially heteromeric forms of the thyroid hormone receptor Mol. Cell. Biol . 11 5005 – 5015 Google Scholar Crossref Search ADS PubMed WorldCat 16. Umesono K. , Murakami K.K., Thompson C.C., Evans R.M. 1991 Direct repeats as selective response elements for the thyroid hormone, retinoic acid, and vitamin D3 receptors Cell 65 1255 – 1266 Google Scholar Crossref Search ADS PubMed WorldCat 17. Brent G.A. , Harney J.W., Chen Y., Warne R.L., Moore D.D., Larsen P.R. 1989 Mutations of the rat growth hormone promoter which increase and decrease response to thyroid hormone define a consensus thyroid hormone response element Mol. Endocrinol . 3 1996 – 2004 Google Scholar Crossref Search ADS PubMed WorldCat 18. Park E.A. , Jerden D.C., Bahouth S.W. 1995 Regulation of phosphoenolpyruvate carboxykinase gene transcription by thyroid hormone involves two distinct binding sites in the promoter Biochem. J . 309 913 – 919 Google Scholar Crossref Search ADS PubMed WorldCat 19. Norman M.F. , Lavin T.N., Baxter J.D., West B.L. 1989 The rat growth hormone gene contains multiple thyroid response elements J. Biol. Chem . 264 12063 – 12073 Google Scholar PubMed OpenURL Placeholder Text WorldCat 20. Desvergne B. , Petty K.J., Nikodem V.M. 1991 Functional characterization and receptor binding studies of the malic enzyme thyroid hormone response element J. Biol. Chem . 266 1008 – 1013 Google Scholar PubMed OpenURL Placeholder Text WorldCat 21. Liu H.C. and Towle H.C. 1994 Functional synergism between multiple thyroid hormone response elements regulates hepatic expression of the rat S14 gene Mol. Endocrinol . 8 1021 – 1037 Google Scholar PubMed OpenURL Placeholder Text WorldCat 22. Mason R.L. , Hunt H.M., Hurxthal L. 1930 Blood cholesterol values in hyperthyroidism and hypothyroidism—their significance N. Engl. J. Med . 203 1273 – 1278 Google Scholar Crossref Search ADS WorldCat 23. Underwood A.H. , Emmett J.C., Ellis D., Flynn S.B., Leeson P.D., Benson G.M., Novelli R., Pearce N.J., Shah V.P. 1986 A thyromimetic that decreases plasma cholesterol levels without increasing cardiac activity Nature 324 425 – 429 Google Scholar Crossref Search ADS PubMed WorldCat 24. Shin D.J. and Osborne T.F. 2003 Thyroid hormone regulation and cholesterol metabolism are connected through sterol regulatory element-binding protein-2 (SREBP-2) J. Biol. Chem . 278 34114 – 34118 Google Scholar Crossref Search ADS PubMed WorldCat 25. Gullberg H. , Rudling M., Forrest D., Angelin B., Vennstrom B. 2000 Thyroid hormone receptor beta-deficient mice show complete loss of the normal cholesterol 7alpha-hydroxylase (CYP7A) response to thyroid hormone but display enhanced resistance to dietary cholesterol Mol. Endocrinol . 14 1739 – 1749 Google Scholar PubMed OpenURL Placeholder Text WorldCat 26. Hashimoto K. , Cohen R.N., Yamada M., Markan K.R., Monden T., Satoh T., Mori M., Wondisford F.E. 2006 Cross-talk between thyroid hormone receptor and liver X receptor regulatory pathways is revealed in a thyroid hormone resistance mouse model J. Biol. Chem . 281 295 – 302 Google Scholar Crossref Search ADS PubMed WorldCat 27. Gauthier K. , Plateroti M., Harvey C.B., Williams G.R., Weiss R.E., Refetoff S., Willott J.F., Sundin V., Roux J.P., Malaval L., et al. 2001 Genetic analysis reveals different functions for the products of the thyroid hormone receptor alpha locus Mol. Cell. Biol . 21 4748 – 4760 Google Scholar Crossref Search ADS PubMed WorldCat 28. Gauthier K. , Chassande O., Plateroti M., Roux J.P., Legrand C., Pain B., Rousset B., Weiss R., Trouillas J., Samarut J. 1999 Different functions for the thyroid hormone receptors TRalpha and TRbeta in the control of thyroid hormone production and post-natal development EMBO J . 18 623 – 631 Google Scholar Crossref Search ADS PubMed WorldCat 29. Berry M.J. , Kates A.L., Larsen P.R. 1990 Thyroid hormone regulates type I deiodinase messenger RNA in rat liver Mol. Endocrinol . 4 743 – 748 Google Scholar Crossref Search ADS PubMed WorldCat 30. Weiss R.E. , Murata Y., Cua K., Hayashi Y., Seo H., Refetoff S. 1998 Thyroid hormone action on liver, heart, and energy expenditure in thyroid hormone receptor beta-deficient mice Endocrinology 139 4945 – 4952 OpenURL Placeholder Text WorldCat 31. Wang H. , Chen J., Hollister K., Sowers L.C., Forman B.M. 1999 Endogenous bile acids are ligands for the nuclear receptor FXR/BAR Mol. Cell 3 543 – 553 Google Scholar Crossref Search ADS PubMed WorldCat 32. Perlmann T. , Rangarajan P.N., Umesono K., Evans R.M. 1993 Determinants for selective RAR and TR recognition of direct repeat HREs Genes. Dev . 7 1411 – 1422 Google Scholar Crossref Search ADS PubMed WorldCat 33. Apfel R. , Benbrook D., Lernhardt E., Ortiz M.A., Salbert G., Pfahl M. 1994 A novel orphan receptor specific for a subset of thyroid hormone-responsive elements and its interaction with the retinoid/thyroid hormone receptor subfamily Mol. Cell. Biol . 14 7025 – 7035 Google Scholar Crossref Search ADS PubMed WorldCat 34. Chang E. and Perlman A.J. 1987 Multiple hormones regulate angiotensinogen messenger ribonucleic acid levels in a rat hepatoma cell line Endocrinology 121 513 – 519 Google Scholar Crossref Search ADS PubMed WorldCat 35. Spindler B.J. , MacLeod K.M., Ring J., Baxter J.D. 1975 Thyroid hormone receptors. Binding characteristics and lack of hormonal dependency for nuclear localization J. Biol. Chem . 250 4113 – 4119 Google Scholar PubMed OpenURL Placeholder Text WorldCat 36. Liu Y. , Xia X., Fondell J.D., Yen P.M. 2006 Thyroid hormone-regulated target genes have distinct patterns of coactivator recruitment and histone acetylation Mol. Endocrinol . 20 483 – 490 Google Scholar Crossref Search ADS PubMed WorldCat 37. Farsetti A. , Mitsuhashi T., Desvergne B., Robbins J., Nikodem V.M. 1991 Molecular basis of thyroid hormone regulation of myelin basic protein gene expression in rodent brain J. Biol. Chem . 266 23226 – 23232 Google Scholar PubMed OpenURL Placeholder Text WorldCat 38. Izumo S. and Mahdavi V. 1988 Thyroid hormone receptor alpha isoforms generated by alternative splicing differentially activate myosin HC gene transcription Nature 334 539 – 542 Google Scholar Crossref Search ADS PubMed WorldCat 39. Hodnett D.W. , Fantozzi D.A., Thurmond D.C., Klautky S.A., MacPhee K.G., Estrem S.T., Xu G., Goodridge A.G. 1996 The chicken malic enzyme gene: structural organization and identification of triiodothyronine response elements in the 5′-flanking DNA Arch. Biochem. Biophys . 334 309 – 324 Google Scholar Crossref Search ADS PubMed WorldCat 40. Zilz N.D. , Murray M.B., Towle H.C. 1990 Identification of multiple thyroid hormone response elements located far upstream from the rat S14 promoter J. Biol. Chem . 265 8136 – 8143 Google Scholar PubMed OpenURL Placeholder Text WorldCat 41. Gullberg H. , Rudling M., Salto C., Forrest D., Angelin B., Vennstrom B. 2002 Requirement for thyroid hormone receptor beta in T3 regulation of cholesterol metabolism in mice Mol. Endocrinol . 16 1767 – 1777 Google Scholar Crossref Search ADS PubMed WorldCat 42. Forman B.M. , Casanova J., Raaka B.M., Ghysdael J., Samuels H.H. 1992 Half-site spacing and orientation determines whether thyroid hormone and retinoic acid receptors and related factors bind to DNA response elements as monomers, homodimers, or heterodimers Mol. Endocrinol . 6 429 – 442 Google Scholar PubMed OpenURL Placeholder Text WorldCat 43. Kliewer S.A. , Umesono K., Mangelsdorf D.J., Evans R.M. 1992 Retinoid X receptor interacts with nuclear receptors in retinoic acid, thyroid hormone and vitamin D3 signalling Nature 355 446 – 449 Google Scholar Crossref Search ADS PubMed WorldCat 44. Leid M. , Kastner P., Lyons R., Nakshatri H., Saunders M., Zacharewski T., Chen J.Y., Staub A., Garnier J.M., Mader S., et al. 1992 Purification, cloning, and RXR identity of the HeLa cell factor with which RAR or TR heterodimerizes to bind target sequences efficiently Cell 68 377 – 395 Google Scholar Crossref Search ADS PubMed WorldCat 45. Wahlstrom G.M. , Sjoberg M., Andersson M., Nordstrom K., Vennstrom B. 1992 Binding characteristics of the thyroid hormone receptor homo- and heterodimers to consensus AGGTCA repeat motifs Mol. Endocrinol . 6 1013 – 1022 Google Scholar PubMed OpenURL Placeholder Text WorldCat 46. Zhang X.K. , Hoffmann B., Tran P.B., Graupner G., Pfahl M. 1992 Retinoid X receptor is an auxiliary protein for thyroid hormone and retinoic acid receptors Nature 355 441 – 446 Google Scholar Crossref Search ADS PubMed WorldCat 47. Wang D.P. , Stroup D., Marrapodi M., Crestani M., Galli G., Chiang J.Y. 1996 Transcriptional regulation of the human cholesterol 7 alpha-hydroxylase gene (CYP7A) in HepG2 cells J. Lipid. Res . 37 1831 – 1841 Google Scholar PubMed OpenURL Placeholder Text WorldCat 48. Drover V.A. and Agellon L.B. 2004 Regulation of the human cholesterol 7alpha-hydroxylase gene (CYP7A1) by thyroid hormone in transgenic mice Endocrinology 145 574 – 581 Google Scholar Crossref Search ADS PubMed WorldCat 49. Chen J. , Cooper A.D., Levy-Wilson B. 1999 Hepatocyte nuclear factor 1 binds to and transactivates the human but not the rat CYP7A1 promoter Biochem. Biophys. Res. Commun . 260 829 – 834 Google Scholar Crossref Search ADS PubMed WorldCat 50. Drover V.A. , Wong N.C., Agellon L.B. 2002 A distinct thyroid hormone response element mediates repression of the human cholesterol 7alpha-hydroxylase (CYP7A1) gene promoter Mol. Endocrinol . 16 14 – 23 Google Scholar PubMed OpenURL Placeholder Text WorldCat © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Identification of the REST regulon reveals extensive transposable element-mediated binding site duplicationJohnson,, Rory;Gamblin, Richard, J.;Ooi,, Lezanne;Bruce, Alexander, W.;Donaldson, Ian, J.;Westhead, David, R.;Wood, Ian, C.;Jackson, Richard, M.;Buckley, Noel, J.
doi: 10.1093/nar/gkl525pmid: 16899447
ABSTRACT The genome-wide mapping of gene-regulatory motifs remains a major goal that will facilitate the modelling of gene-regulatory networks and their evolution. The repressor element 1 is a long, conserved transcription factor-binding site which recruits the transcriptional repressor REST to numerous neuron-specific target genes. REST plays important roles in multiple biological processes and disease states. To map RE1 sites and target genes, we created a position specific scoring matrix representing the RE1 and used it to search the human and mouse genomes. We identified 1301 and 997 RE1s inhuman and mouse genomes, respectively, of which >40% are novel. By employing an ontological analysis we show that REST target genes are significantly enriched in a number of functional classes. Taking the novel REST target gene CACNA1A as an experimental model, we show that it can be regulated by multiple RE1s of different binding affinities, which are only partially conserved between human and mouse. A novel BLAST methodology indicated that many RE1s belong to closely related families. Most of these sequences are associated with transposable elements, leading us to propose that transposon-mediated duplication and insertion of RE1s has led to the acquisition of novel target genes by REST during evolution. INTRODUCTION It is clear that much of what was once termed ‘junk’ DNA represents highly evolved, functional sequence containing amongst other things, numerous transcriptional regulatory motifs. A major challenge of post-genome biology is to identify these motifs and the genes they regulate, and to incorporate these results into accurate models of the transcriptional regulatory networks underlying processes such as development and disease. In addition to helping us understand gene regulation within a species, identification of transcription factor-binding sites (TFBSs) is an important tool in understanding phenotypic differences between species. Although the majority of regulatory sequences are phylogenetically conserved (1), a significant fraction of TFBSs are not conserved among closely related species, indicating that regulatory DNA can experience high rates of evolutionary change (2–6). This finding provides important support for the notion that changes in gene-regulatory networks, brought about by mutations to TFBSs, are an important contributor to phenotypic evolution (7). Little is known about the mechanisms responsible for such TFBS ‘turnover’. Computational and theoretical studies have found that short binding sites can appear rapidly through random DNA mutation (8,9); however, this rate of generation falls exponentially with motif length, suggesting that additional mechanisms may be responsible for the generation of longer binding sites. Therefore, in addition to providing insights into the role of transcription factors in gene regulation, whole-genome maps of TFBSs in multiple species may yield important insight into how gene-regulatory networks evolve. The past decade has seen rapid advances in the complexity of motif-identification techniques: consensus sequences have given way to position-specific scoring matrices (PSSMs) (10), Bayesian-based interdependency models (11) and non-parametric approaches (12) amongst others. Such techniques must confront a serious hurdle to TFBS discovery, namely the short, degenerate nature of most binding motifs. Although this property no doubt confers evolutionary advantages (13), in terms of the high rates of site generation by random sequence mutation, it presents grave challenges to the confident identification of bona fide binding sites against a background of similar, non-functional motifs. A number of approaches have been developed to improve motif identification, most recently and successfully phylogenetic footprinting (14,15) which has been made possible by the sequencing of increasing numbers of genomes. This approach weights putative regulatory DNA sequences by their degree of conservation amongst related species, with the assumption that evolutionary conservation is indicative of functional importance. However, it is likely that those regulatory motifs which are not strongly conserved between species are those that are responsible for their phenotypic differences. Inlight of these considerations, a contemporary project to make a comparative map of TFBSs and the genes they regulate across multiple species might avoid such challenges by selecting a regulatory motif which is long and highly conserved enough to be unambiguously identified using current, non-phylogenetic methodologies. REST (also known as neuron-restrictive silencing factor) is an essential vertebrate transcriptional repressor (16) involved in multiple biological processes and disease states (17–20). REST is the sole transcription factor to bind the highly conserved repressor element 1 (RE1, also known as neuron-restrictive silencing element, NRSE), to which it recruits various histone-modifying and chromatin-remodelling complexes (21–24). At 21 bp long, the RE1 represents an ideal model for bioinformatic regulatory element prediction, and in the past consensus sequence models have successfully predicted novel REST target genes de novo in a number of vertebrate species (25,26), most recently when Bruce et al. (26) published the first exhaustive TFBS search of the newly published human genome. In comparison to well-constructed probabilistic models of TFBSs, consensus sequences suffer from relatively high false negative and false positive rates (27). For example, the RE1 consensus falsely identifies large numbers of human endogenous retrovirus (hERV) Class I sequences as RE1s [(26) and R. Johnson, unpublished data]. Consensus sequences give no indication of a sequence's degree of similarity to the RE1 motif; consequently, the user has no control of the stringency with which searches are carried out, and no indication of a test sequence's likely affinity for REST. For similar reasons, the inability to identify less well-conserved sites rules out more exhaustive searches for weak affinity sites. REST's target genes include many necessary for terminally differentiated neuronal function, such as synapse formation (SYN1) (28), neurotransmitter secretion (SNAP25) (26) and signalling (CHRM4) (29). Calcium signalling is also an essential process in neurons where it mediates transduction of electrical signals into cellular responses (30), and regulation of the voltage-gated calcium channel subunit gene CACNA1H by REST is necessary for normal heart function in mouse (19). Voltage-gated calcium channels are composed of multiple subunits. The α1 subunits, encoded by the CACNA1 family of genes, are responsible for pore formation and this subunit defines the pharmacological properties of the channel (31). The Cav2.1 subunit confers a P/Q type calcium current, initiating rapid synaptic transmission and the secretion of neurotransmitters and neuropeptides. The CACNA1A gene encoding Cav2.1 is highly expressed in the Purkinje cells of the cerebellum (32) and mutations in CACNA1A are responsible for a number of cerebellar disorders including migraine (33), epilepsy (34) and ataxias (35) but little is known about the transcriptional regulation of this gene. The aim of this study was to accurately map the RE1s of human and mouse, as well as their target genes. To this effect, we develop and characterize an RE1 PSSM which is capable of predicting functional RE1s in genomic sequence with high sensitivity and selectivity. Using the RE1s identified in this way, we present a thorough analysis of the genomic distribution of this regulatory motif and its target genes. We also apply the RE1 PSSM to the exhaustive analysis of the REST-regulatory apparatus of a model gene, the novel target CACNA1A. We find that CACNA1A is regulated by a combination of binding sites of various affinities and degrees of phylogenetic conservation. Finally, we identify widespread duplication of functional RE1s, principally located within or beside transposable elements (TEs), which leads us to propose that transposon-mediated duplication has been an important mechanism of evolutionary expansion in the REST regulon. MATERIALS AND METHODS RE1 identification and database construction A PSSM was created by combining 93 experimentally verified REST-binding sequences from the primary literature and personal communications—the ‘positive training set’ (Figure 1A). Using the C program SeqScan (R. J. Gamblin and R. M. Jackson, manuscript in preparation), based on the scoring function of Stormo and Hartzell III (36,37), this PSSM was used to score both the positive training set, and 57 RE1-like sequences known not to bind REST in electrophoretic mobility shift assay (EMSA)—the ‘negative training set’. Query scores fall between 0 (no similarity to PSSM) and 1 (full identity). True and false positive rates at incremental cutoff scores were plotted in a receiver-operator curve (ROC) curve, which indicated that a cutoff of 0.91 produced an optimal sensitivity and selectivity for this set (Figure 1C). Unmasked NCBI builds 35 and 34 of the human and mouse genomes, respectively, were downloaded from Ensembl and searched for RE1s using SeqScan. All 21mers scoring >0.83 were saved to an updated version of the relational database described in Ref. (26) along with Swiss-Prot and TrEMBL annotations of nearest genes (accessible at http://bioinformatics.leeds.ac.uk/RE1db_mkII/). Figure 1 Open in new tabDownload slide (A) The RE1 PSSM. Ninety-three sequences known to bind REST were combined to form a PSSM, here shown below a Weblogo recording the sequence conservation at each nucleotide position of this set (43). The height of each letter is proportional to its information content. (B) The hERV Class I PSSM. Thirty-six hERV Class I sequences containing conserved RE1-like sequences were used to create a PSSM. (C) Measuring the performance of the RE1 PSSM using a ROC curve. A ROC curve was generated by scoring the positive and negative training sets with the RE1 PSSM. The true positive (number of true sites to score above cutoff/total number of true sites) and false positive (number of non-binding sites to score above cutoff/total number of non-binding sites) rates were plotted at incremental cutoff scores (circles). The optimal combination of 88.2% true positive and 17.5% false positive rates were achieved at a cutoff of 0.91 (arrow). A similar analysis was carried out using the RE1 consensus sequence (square). Figure 1 Open in new tabDownload slide (A) The RE1 PSSM. Ninety-three sequences known to bind REST were combined to form a PSSM, here shown below a Weblogo recording the sequence conservation at each nucleotide position of this set (43). The height of each letter is proportional to its information content. (B) The hERV Class I PSSM. Thirty-six hERV Class I sequences containing conserved RE1-like sequences were used to create a PSSM. (C) Measuring the performance of the RE1 PSSM using a ROC curve. A ROC curve was generated by scoring the positive and negative training sets with the RE1 PSSM. The true positive (number of true sites to score above cutoff/total number of true sites) and false positive (number of non-binding sites to score above cutoff/total number of non-binding sites) rates were plotted at incremental cutoff scores (circles). The optimal combination of 88.2% true positive and 17.5% false positive rates were achieved at a cutoff of 0.91 (arrow). A similar analysis was carried out using the RE1 consensus sequence (square). hERV Class I motifs that fell within the RE1 consensus were collected by submitting a representative set of sequences from the consensus RE1 database (http://www.bioinformatics.leeds.ac.uk/group/online/RE1db/re1db_home.htm) to the motif-finding tool MEME (http://bioweb.pasteur.fr/seqanal/motif/meme/). Approximately 40% were found to have highly similar flanking regions, identified as hERV Class 1 elements by RepeatMasker (http://www.repeatmasker.org/). Inspection of their core motif showed that they corresponded to the common non-binding RE1 motif identified in Ref. (26). A PSSM was constructed using the RE1-like motif of 36 of these examples (Figure 1B). By scoring the contents of the RE1 database it was clear that all hERV sequences had hERV PSSM scores >0.9, while genuine RE1s lay below this score. All RE1s identified in the whole-genome PSSM scan were also scored using the hERV PSSM. To assess the number of RE1 motifs expected by chance alone, six control PSSMs were constructed by shuffling the RE1 PSSM, and used to re-search the genome. The randomization process was constrained such that shuffled PSSM motifs had a similar propensity for CG dinucleotides as the RE1, in light of the under-representation of this pair in genomic DNA. Adenoviral Infection Recombinant adenoviruses expressing transgenes for REST DNA-binding domain (Ad DN:REST), or full-length REST (Ad REST), or no transgene (Ad), were amplified in HEK293 cells and purified by centrifugation in a CsCl gradient as described in Ref. (38). Viruses also contained a GFP transgene. Virus was added to cell media such that after 48 h >90% displayed green fluorescence, at which point cells were harvested. Electrophoretic mobility shift assay Nuclear protein extract from rat fibroblast JTC-19 cells was prepared as described previously (39). Protein was mixed with unlabelled competitor 28 bp oligonucleotides and 32P-labelled rat SCN2A2 promoter double-stranded DNA (dsDNA) sequences containing a strong RE1 sequence (40), then run on a 4% non-denaturing polyacrylamide gel. Unlabelled rat CHRM4 RE1 was used as positive control, while a non-binding, RE1-like sequence was used as non-specific dsDNA control. Competitor DNA was added in molar ratios of 100:1, 10:1 and 1:1 to labelled probe. Anti-REST (P18; Santa Cruz) antibodies were used to super-shift REST. A complete list of oligonucleotides used can be found in Supplementary Data. RT–PCR RT–PCR was carried out on RNA harvested from human HeLa cells using the protocol described in Ref. (26). cDNA samples were interrogated by quantitative PCR using the real-time iQ system (Bio-Rad), with primers to the coding regions of CACNA1A and cyclophilin genes. The specificity of PCR was verified by melt curve analysis of products obtained from cDNA, as well as controls in which the reverse transcriptase was omitted. Expression changes were inferred using the ΔΔCt method (41). The sequences of all primer sets used in this study are available in Supplementary Data. Chromatin immunoprecipitation (ChIP) ChIP assays were carried out essentially as described in Ref. (42) on chromatin from HeLa cells. IPs were performed using 10 μg of anti-REST (P18; Santa Cruz) or the same amount of non-specific goat IgG. ChIP DNA was interrogated by real-time quantitative PCR, using primers designed adjacent to RE1 sites. Starting concentrations of ChIP DNA were calculated with reference to a standard curve. Multispecies alignment Whole-genome multispecies alignments were obtained from the UCSC Genome Browser (http://genome.ucsc.edu/). Pairwise alignments were carried out using ClustalW provided by the EBI (www.ebi.ac.uk) at default settings. For each case where aligned sequence could not be found in the other species, the result was verified by BLAST search. RESULTS Identification of RE1s and target genes in human and mouse genomes The RE1 is unusually long (21 bp) and highly conserved compared to a typical TFBS of 5–8 bp. We sought to improve on previous bioinformatic studies of the REST regulon by employing a more sensitive probabilistic PSSM model to identify RE1 sites, and apply it to whole-genome mapping of REST's binding sites. PSSMs are representations of sequence motifs, constructed using a set of known sequences, the ‘training set’, which can be used in conjunction with an appropriate scoring algorithm to identify other similar sequences (10). Query sequences are scored by the product of the similarity of each individual nucleotide to that in the corresponding position of the PSSM. The contribution of each nucleotide to the final score is weighted by its degree of conservation in the training set. For these reasons, PSSMs are both more sensitive and selective at identifying DNA motifs over a similar consensus sequence. In order to construct an RE1 PSSM, we compiled 93 RE1 sequences which have been shown to bind REST either in vitro (by EMSA) or in vivo (by ChIP). These sequences were combined to yield a probability value for each nucleotide at each position in the RE1 motif, resulting in a 21 bp motif with 63% GC content [represented as a Weblogo (43) in Figure 1A]. The composition of this RE1 differs in several important respects from that used in previous consensus sequences, having stronger constraint in positions 1, 3 and 7, as well as including additional conserved positions at the 3′ end (25,26). PSSMs assign all query sequences a score, regardless of whether they are bona fide binding sites or not; therefore it is necessary to empirically determine an optimal cutoff score for each PSSM, to include the maximum number of true binding sites while excluding spurious sequences. Using the program SeqScan (R. J. Gamblin and R. M. Jackson, manuscript in preparation), based on the method of Stormo et al. (36,37), the RE1 PSSM was used to score each sequence of the positive training set. A negative training set, composed of sequences that were shown to be incapable of binding REST in EMSA was scored in a similar way. We created a ROC by counting the numbers of true positives (TP) and false positives (FP) identified in both training sets at incremental cutoff scores; optimal sensitivity and selectivity occurred at a score of 0.91, which we subsequently defined as the RE1 PSSM cutoff score (Figure 1C). Any sequence having a score >0.91 is henceforth defined as an RE1, while sequences scoring below this value will be designated ‘below-cutoff’ RE1. Nevertheless, a population of bona fide RE1s exists in the training set below the cutoff (including functional RE1s in the human NPPA and rat SYN1 genes). The sequences of both training sets were similarly examined for identity with the RE1 consensus sequence, and by plotting the resulting TP/FP values on a ROC curve we confirmed that the RE1 PSSM is both more sensitive and selective at identifying RE1s from this training set (Figure 1C). We next used the RE1 PSSM to identify potential REST-binding sites in genomic sequence of human and mouse. The full, unmasked genome sequence of both species was scanned using the RE1 PSSM with SeqScan. This search identified 1301 RE1s above cutoff in the human genome, including 551 which do not conform to the RE1 consensus and therefore principally constitute novel REST-binding sites (Figure 2A). Overall, we identified 995 REST target genes (defined as the closest Ensembl-annotated gene within 100 kb of an RE1), including 418 which were not identified in previous searches and hence represent novel REST target genes (a full list of novel target genes and disease-associated target genes identified in this study can be found in Supplementary Data). A similar search of the mouse genome identified 1039 RE1s and 822 target genes. The majority of consensus RE1 sequences failed to meet the RE1 PSSM cutoff, including 99.5% of contaminating hERV Class I RE1-like sequences (Figure 1B), which do not bind REST. The results of both searches, including the score, location and sequence of all RE1s and their target genes, are available in a searchable online database RE1db Mk II (http://bioinformatics.leeds.ac.uk/RE1db_mkII). Figure 2 Open in new tabDownload slide (A) Identification of many novel RE1s by RE1 PSSM. The Venn diagram shows the number of RE1s identified in the human genome (Ensembl Build 35) by the RE1 PSSM, using the empirically determined cutoff score of 0.91, as well as those found by the RE1 consensus sequence. Non-functional, RE1-like hERV Class I sequences identified by each technique are also shown. (B) RE1s identified by PSSM in genomic DNA can bind REST. Randomly selected consensus RE1s with PSSM scores below cutoff, and non-consensus RE1s with PSSM scores above cutoff, were tested for the ability to compete REST off a radiolabelled RE1 in EMSA. Unlabelled test sequences were tested at 100:1 excess over probe. Single asterisk indicates the REST-bound probe and double asterisks indicate unbound probe. Controls: 1, no competitor; 2, CHRM4 RE1 site; 3, non-specific oligonucleotide; 4, Anti-REST antibody (P18; Santa Cruz). (C) The RE1 PSSM identifies more sequences in genomic DNA than expected by chance. The number of RE1s in the human genome were plotted at 0.01 score increments (circles). The same sequence was also scanned with six shuffled matrices, for which the mean count in each score bin is also shown (squares). Error bars represent standard deviation, and the PSSM cutoff score is indicated by a dotted line. (D) Excess of RE1s over shuffled sequences in the human genome. The data from (C) were re-plotted to emphasize the excess number of sequences discovered by the RE1 PSSM over shuffled PSSMs. Figure 2 Open in new tabDownload slide (A) Identification of many novel RE1s by RE1 PSSM. The Venn diagram shows the number of RE1s identified in the human genome (Ensembl Build 35) by the RE1 PSSM, using the empirically determined cutoff score of 0.91, as well as those found by the RE1 consensus sequence. Non-functional, RE1-like hERV Class I sequences identified by each technique are also shown. (B) RE1s identified by PSSM in genomic DNA can bind REST. Randomly selected consensus RE1s with PSSM scores below cutoff, and non-consensus RE1s with PSSM scores above cutoff, were tested for the ability to compete REST off a radiolabelled RE1 in EMSA. Unlabelled test sequences were tested at 100:1 excess over probe. Single asterisk indicates the REST-bound probe and double asterisks indicate unbound probe. Controls: 1, no competitor; 2, CHRM4 RE1 site; 3, non-specific oligonucleotide; 4, Anti-REST antibody (P18; Santa Cruz). (C) The RE1 PSSM identifies more sequences in genomic DNA than expected by chance. The number of RE1s in the human genome were plotted at 0.01 score increments (circles). The same sequence was also scanned with six shuffled matrices, for which the mean count in each score bin is also shown (squares). Error bars represent standard deviation, and the PSSM cutoff score is indicated by a dotted line. (D) Excess of RE1s over shuffled sequences in the human genome. The data from (C) were re-plotted to emphasize the excess number of sequences discovered by the RE1 PSSM over shuffled PSSMs. We next confirmed that the RE1 PSSM could identify functional RE1s in genomic sequence, and that such prediction was more effective than the RE1 consensus sequence. From human Chromosome 1 we randomly selected five non-consensus, PSSM-predicted RE1s, and five consensus RE1s with PSSM scores below cutoff, and tested their ability to interact with REST in vitro by EMSA (Figure 2B). Nuclear protein extracts containing REST protein were incubated with a radiolabelled promoter fragment containing the rat SCN2A2 RE1 (40), as well as unlabelled competitor oligonucleotides representing test sequences. We found that no oligonucleotides representing consensus RE1s with below-cutoff PSSM scores were capable of fully competing REST from the radiolabelled probe at a ratio of 100:1, while one competed partially. Conversely, sequences representing PSSM-predicted, non-consensus RE1s completely competed in three cases, with one competing partially. We concluded that the RE1 PSSM is capable of correctly identifying functional REST-binding sites in genomic DNA in the majority of cases, and that it has greater predictive power than previous techniques. In order to identify the maximum number of potential RE1s, the whole-genome RE1 scans were carried out with minimum score cutoffs of 0.88, yielding ∼4000 hits in each species. The number of sequences identified for each PSSM score range in the human genome is shown in Figure 2C. To gauge the background probability of finding sequences with similar conservation and length as the RE1, the RE1 PSSM was repeatedly shuffled and used to scan the same genomic sequence. The shuffled matrices identified on average 58 sequences with scores >0.91 in the human genome, indicating that the 1301 sequences with scores >0.91 RE1s identified by the RE1 PSSM are highly significant. Interestingly, the RE1 PSSM also identifies significantly more sequences than the shuffled PSSMs at the lower score range 0.88–0.91. We re-plotted these data to emphasize the ratio of discovered RE1s to random sequences (Figure 2D); this clearly shows that there is an excess of RE1s over shuffled sequences at all measured score ranges, and that this excess increases steeply with score until 0.93, above which the background count is negligible. We performed similar searches on the genome of Caenorhabditis elegans, which has no known REST homologue: for this genome, the number of sequences identified by the RE1 PSSM and shuffled PSSMs were similar and low (data not shown). Therefore, in addition to a well-defined population of above-cutoff RE1s, the human genome contains an excess of below-cutoff RE1-like sequences, the majority of which might be expected to be unable to recruit REST. REST is known to participate in diverse biological processes, including neuronal development (44,45), axonal pathfinding (46), heart development and function (19) and smooth muscle cell proliferation (42). This is reflected in the functions of REST's known target genes which predominantly encode ion channels, neurotransmitter transporters and receptors, SNARE proteins and transcription factors. We hypothesized that biological processes in which REST plays a role might be overrepresented in the ontology classifications of its target genes. To test this, we compared the prevalence of gene ontology (GO) terms associated with RE1-containing genes to that of all Ensembl genes. This analysis identified a large number of GO terms that were significantly enriched in REST target genes (P < 0.01, Fisher's exact test), of which a selection is shown in Table 1. For most cases, multiple related terms corresponding to the same underlying process were identified; e.g. the terms ‘synaptosome’, ‘synapse’, ‘synaptic vesicle’ and ‘synaptic transmission’ are all enriched in RE1-bearing genes, suggesting that REST regulates synaptic transmission at multiple levels. Similarly, enrichment was found for terms corresponding to various classes of ion channel (potassium, sodium, calcium channels) and neurotransmitter receptor (glutamate, GABA A, GABA B, glycine) involved in cell excitability and neurotransmission. In such cases, a single representative term is shown in Table 1. Overall, ontology terms correspond to most specialized functions of differentiated neurons, such as synaptic transmission, ion conductance and transport, neurotransmitter secretion and reception, axonal guidance, cell structure and cell adhesion. Interestingly, the term ‘sugar metabolism’ is significantly enriched, which may reflect specialized programmes of metabolic gene expression in neurons, as has been observed in previous studies (47). A number of terms including ‘cell differentiation’, ‘central nervous system development’ and ‘neurogenesis’ support REST's documented role in neuronal development (44,45). Target genes were also enriched for multiple terms related to calcium signalling (‘calcium ion binding’, ‘calcium ion transport’, ‘calmodulin binding’, ‘voltage-gated calcium channel complex’, ‘voltage-gated calcium channel activity’), a process in which REST has been implicated previously in relation to regulation of normal heart function through the repression of the α1H voltage-gated calcium channel subunit gene, CACNA1H (19). REST's role in regulating gene expression in the vasculature is reflected in enrichment of the terms ‘regulation of blood vessel size’ and ‘regulation of blood pressure’. Not all ontology terms were enriched in the REST target set; however, a number of GO terms such as ‘protein biosynthesis’ and ‘ribosome’ were underrepresented amongst RE1-containing genes, indicating that enrichment of particular GO terms described above is genuine. Table 1. Enrichment of GO terms in REST target genes GO code GO term Na Obs.b Exp.c Foldd Pe GO:0007268 Synaptic transmission 187 32 5.4 5.9 6.1E−16 GO:0005509 Calcium ion binding 1008 72 29.2 2.5 2.9E−12 GO:0048699 Neurogenesis 275 29 8.0 3.6 2.7E−09 GO:0006811 Ion transport 347 33 10.1 3.3 3.0E−09 GO:0007155 Cell adhesion 455 28 13.2 2.1 1.8E−04 GO:0005529 Sugar binding 194 15 5.6 2.7 5.8E−04 GO:0050880 Regulation of blood vessel size 4 2 0.1 17.2 4.9E−03 GO:0007611 Learning and/or memory 14 3 0.4 7.4 7.0E−03 GO:0005200 Structural constituent of cytoskeleton 101 8 2.9 2.7 9.2E−03 GO code GO term Na Obs.b Exp.c Foldd Pe GO:0007268 Synaptic transmission 187 32 5.4 5.9 6.1E−16 GO:0005509 Calcium ion binding 1008 72 29.2 2.5 2.9E−12 GO:0048699 Neurogenesis 275 29 8.0 3.6 2.7E−09 GO:0006811 Ion transport 347 33 10.1 3.3 3.0E−09 GO:0007155 Cell adhesion 455 28 13.2 2.1 1.8E−04 GO:0005529 Sugar binding 194 15 5.6 2.7 5.8E−04 GO:0050880 Regulation of blood vessel size 4 2 0.1 17.2 4.9E−03 GO:0007611 Learning and/or memory 14 3 0.4 7.4 7.0E−03 GO:0005200 Structural constituent of cytoskeleton 101 8 2.9 2.7 9.2E−03 a Number of Ensembl genes associated with this ontology term. b Number of REST target genes associated with this ontology term. c Expected number of REST target genes. d Fold enrichment of REST target genes over expectation. e Probability, by Fisher's exact test. Open in new tab Table 1. Enrichment of GO terms in REST target genes GO code GO term Na Obs.b Exp.c Foldd Pe GO:0007268 Synaptic transmission 187 32 5.4 5.9 6.1E−16 GO:0005509 Calcium ion binding 1008 72 29.2 2.5 2.9E−12 GO:0048699 Neurogenesis 275 29 8.0 3.6 2.7E−09 GO:0006811 Ion transport 347 33 10.1 3.3 3.0E−09 GO:0007155 Cell adhesion 455 28 13.2 2.1 1.8E−04 GO:0005529 Sugar binding 194 15 5.6 2.7 5.8E−04 GO:0050880 Regulation of blood vessel size 4 2 0.1 17.2 4.9E−03 GO:0007611 Learning and/or memory 14 3 0.4 7.4 7.0E−03 GO:0005200 Structural constituent of cytoskeleton 101 8 2.9 2.7 9.2E−03 GO code GO term Na Obs.b Exp.c Foldd Pe GO:0007268 Synaptic transmission 187 32 5.4 5.9 6.1E−16 GO:0005509 Calcium ion binding 1008 72 29.2 2.5 2.9E−12 GO:0048699 Neurogenesis 275 29 8.0 3.6 2.7E−09 GO:0006811 Ion transport 347 33 10.1 3.3 3.0E−09 GO:0007155 Cell adhesion 455 28 13.2 2.1 1.8E−04 GO:0005529 Sugar binding 194 15 5.6 2.7 5.8E−04 GO:0050880 Regulation of blood vessel size 4 2 0.1 17.2 4.9E−03 GO:0007611 Learning and/or memory 14 3 0.4 7.4 7.0E−03 GO:0005200 Structural constituent of cytoskeleton 101 8 2.9 2.7 9.2E−03 a Number of Ensembl genes associated with this ontology term. b Number of REST target genes associated with this ontology term. c Expected number of REST target genes. d Fold enrichment of REST target genes over expectation. e Probability, by Fisher's exact test. Open in new tab Genic and genomic distributions of RE1s Using the positional information from the RE1 PSSM search, we next measured how RE1s are distributed on a number of scales. Conventional models have transcription factors regulating target genes from proximal upstream promoter regions or distal enhancer modules. Recent in vivo measurements of transcription factor occupancy have demonstrated recruitment to a variety of positions relative to target genes, including introns (48,49). To investigate whether this is the case for REST, we next used the human whole-genome RE1 data to define the most prevalent locations of an RE1 relative to its target gene. We found the greatest number of RE1s are located in the introns of genes (29.4%) (Figure 3A); indeed, this figure is almost certainly an underestimate given the large number of transcripts that remain to be properly annotated, and the uncertainty of the transcription start site (TSS) of many genes. RE1s showed a weak preference for locations upstream over downstream of target genes (29.3 and 17.5%, respectively), while one-fifth of RE1s are located >100 kb from the nearest annotated gene (20.7%), suggesting that either REST can have long-range (>100 kb) interactions with target genes, or that many target genes remain to be annotated. Surprisingly, intronic RE1s are rather uniformly distributed within their target genes, with a small excess situated within 0.1 gene lengths of the TSS (Figure 3B). Together these data suggest that RE1s are only weakly constrained in their position relative to target genes. Figure 3 Open in new tabDownload slide (A) RE1s are most frequently located in the introns of target genes. The fraction of strong RE1s at each position relative to human target genes was calculated. 5′, <100 kb upstream of annotated TSS; 3′, <100 kb downstream of gene end. (B) RE1s have little preference for location within genes. The proportion of internal (i.e. intronic and exonic) RE1s was plotted according to its relative position along the target gene's length (circles), defining the TSS to be 0 and the gene end to be 1. (C) Peaks of RE1 and gene density do not correlate on Chromosome 1. The number of RE1s (circles, left scale), and the number of Ensembl genes (squares, right scale) are plotted for 1 Mb windows along Chromosome 1. The approximate position of the centromere is indicated by a dashed line. (D) RE1s are clustered over distinct distance scales in the human genome. The distance from each RE1 to the next, moving in a positive direction along each chromosome, is plotted as a histogram for human RE1s. As a reference, equivalent data were generated for five sets of equivalent random genomic coordinates. Error bars represent standard deviation. Figure 3 Open in new tabDownload slide (A) RE1s are most frequently located in the introns of target genes. The fraction of strong RE1s at each position relative to human target genes was calculated. 5′, <100 kb upstream of annotated TSS; 3′, <100 kb downstream of gene end. (B) RE1s have little preference for location within genes. The proportion of internal (i.e. intronic and exonic) RE1s was plotted according to its relative position along the target gene's length (circles), defining the TSS to be 0 and the gene end to be 1. (C) Peaks of RE1 and gene density do not correlate on Chromosome 1. The number of RE1s (circles, left scale), and the number of Ensembl genes (squares, right scale) are plotted for 1 Mb windows along Chromosome 1. The approximate position of the centromere is indicated by a dashed line. (D) RE1s are clustered over distinct distance scales in the human genome. The distance from each RE1 to the next, moving in a positive direction along each chromosome, is plotted as a histogram for human RE1s. As a reference, equivalent data were generated for five sets of equivalent random genomic coordinates. Error bars represent standard deviation. Genes have a heterogeneous distribution on chromosomal scales, tending to be highly clustered in G-C rich regions near the ends of chromosomes (50); we next tested whether REST target genes show a similar tendency. By plotting the density of PSSM RE1s and Ensembl genes along each chromosome, we found that while the RE1 density generally mirrors that of annotated genes, there are regions where the RE1 density is markedly lower or higher than corresponding gene density. This phenomenon is well illustrated on Chromosome 1, where one gene-rich region towards the tip of p-arm is highly enriched for RE1s, while another at the centromeric end of the q-arm is markedly under-enriched (Figure 3C). We further sought to quantify the degree of clustering of RE1s by calculating the distance from each RE1 to its nearest neighbour (Figure 3D). Comparison of the RE1 distance distribution to that of randomly generated sets clearly demonstrates that the distribution of RE1s in the human genome is significantly non-random over all distance scales (at most P < 0.001, Student's t-test). In particular, RE1s are clustered on scales of both 10–100 kb and 100–1000 kb; this distance is similar to, or smaller than the length of a typical gene, suggesting that some genes might be regulated by multiple RE1s. Subsequent inspection of the RE1 PSSM data revealed that 101 human genes are closest to, and within 100 kb of two or more RE1s. Nevertheless the majority of RE1s are separated by 1–10 Mb, indicative of typical gene–gene distance or greater. A number of REST target genes have been found to contain pairs of RE1s arranged in tandem: human SNAP25 and L1CAM recruit REST more effectively than single sites in the same cells (26), while KCNN4 contains a strong, well-conserved RE1 together with a second, more weakly conserved site that is capable of interacting with REST in mouse but not human (42). The nearest-neighbour analysis identified 27 examples of pairs of RE1s within 100 bp of each other, compared to none in the random sets. A number of these tandem RE1s were in fact members of groups of up to five motifs. All of these sequences are aligned in a ‘head to tail’ configuration (P = 7 × 10−9, Binomial distribution). Subsequent inspection of the flanking regions of human RE1s revealed an additional 32 RE1-like sequences with scores in the score range 0.88–0.91, compared to zero found by six shuffled matrices, indicating that RE1s tend to colocalize. A list of all human tandem RE1s can be found in Supplementary Data. Dissecting the REST-regulatory sequences of a model gene, CACNA1A We next wished to test the predictive power of the RE1 PSSM by making an in-depth study of the REST-regulatory sequences within a single model gene. We intended for this approach to shed light on the functional significance of below-cutoff RE1s, and by performing the study in both human and mouse we hoped to measure the degree of evolutionary conservation of RE1s. We selected as a model the CACNA1A gene, which was found to contain multiple high-scoring RE1s by the PSSM search in human and mouse. The gene is classified in the RE1-enriched ‘voltage-gated calcium channel complex’ ontology classification, and is a paralogue of the REST target CACNA1H. CACNA1A is a large, multi-exon gene encoding the Cav2.1 neuron-specific voltage-gated calcium channel subunit. Little is known about the regulatory mechanisms governing CACNA1A expression, although the mouse homologue does have two functional Sp1 sites (51), while the human gene contains numerous clusters of potential TFBSs, including Sp1, Pax4 and Myc (R. Johnson, unpublished data). We first confirmed that expression of CACNA1A can be regulated by REST in human cells. HeLa cells are non-neuronal and express relatively high levels of REST [L. Ooi, unpublished data]. Virally mediated overexpression of a dominant-negative REST construct resulted in 75-fold de-repression of CACNA1A mRNA levels in HeLa cells (Figure 4A), indicating that REST is capable of strongly repressing this gene. In addition to the two strong RE1s located in introns 3 and 5 of the gene, the RE1 PSSM identified 13 more RE1-like sequences with scores in the range 0.83–0.91 in this gene (Figure 4B). We tested all potential RE1s in EMSA, and identified four which were capable of interacting with REST (data not shown). We further investigated the relative in vitro binding affinity of these sequences by testing their ability to compete in EMSA at decreasing molar ratios to labelled probe (100:1, 10:1, 1:1) (Figure 4C). Both RE1s with scores above cutoff strongly competed: Site A (PSSM score 0.96) at 1:1, and Site B (0.97) at 10:1. In addition, two other below-cutoff sites displayed detectable affinity for REST: Site C (0.88) competed at 100:1, and Site D (0.83) competed partially at 100:1. These results underlined the accuracy of using a 0.91 cutoff in predicting high-affinity RE1s, while demonstrating that a minority of below-cutoff RE1s can interact with REST, albeit more weakly. Figure 4 Open in new tabDownload slide (A) REST regulates transcription of CACNA1A in HeLa. mRNA from HeLa cells infected with either empty adenovirus (Ad), or adenovirus expressing the REST DNA-binding domain (DN), was harvested and reverse transcribed. These cDNAs were interrogated in real-time quantitative PCR using primers specific to the coding sequence of CACNA1A and the housekeeping gene, cyclophilin. Axis indicates expression level relative to cyclophilin X107. (B) Identification of RE1s in the human CACNA1A gene. The RE1 PSSM was used to scan the human CACNA1A gene, and the scores of putative RE1s were plotted as a function of position. Upper and lower lines represent scores of 0.91 and 0.83, respectively. A cartoon of the gene is shown below, with exons represented by wide lines. (C) Human and mouse CACNA1A orthologues contain multiple functional RE1s that are only partially conserved. Sequences identified by the RE1 PSSM in the human (upper panel) and mouse (lower panel) CACNA1A genes were tested in EMSA competition assay. Only those sequences that displayed detectable affinity for REST are shown. Decreasing molar ratios (100:1, 10:1, 1:1) of unlabelled oligonucleotides were used in EMSA competition assay against a radiolabelled RE1. Single asterisk indicates the REST-bound probe and double asterisks indicate unbound probe. Controls: 1, no competitor; 2, CHRM4 RE1 site; 3, non-specific oligonucleotide; 4, anti-REST antibody (P18; Santa Cruz). The sequence of each RE1 is shown with its RE1 PSSM score in brackets. Boxes denote homologous pairs of RE1s, as determined by a combination of whole-genome alignment from the UCSC Genome Browser and BLAST. Underlined bases indicate those mouse positions which are not conserved in human. (D) REST is recruited to CACNA1A RE1s in vivo. DNA was immunoprecipitated with anti-REST (P18; Santa Cruz) and control non-specific IgG antibodies from wild-type HeLa (WT), as well as cells infected with empty adenovirus (Ad) and adenovirus carrying the full-length REST gene (Ad REST). ChIP DNA was interrogated in quantitative real-time PCR using primers flanking the RE1s of CACNA1A, as well as the coding region of the CHRM4 gene, which is distal to any RE1. Values represent the fold enrichment of anti-REST immunoprecipitate over IgG. Figure 4 Open in new tabDownload slide (A) REST regulates transcription of CACNA1A in HeLa. mRNA from HeLa cells infected with either empty adenovirus (Ad), or adenovirus expressing the REST DNA-binding domain (DN), was harvested and reverse transcribed. These cDNAs were interrogated in real-time quantitative PCR using primers specific to the coding sequence of CACNA1A and the housekeeping gene, cyclophilin. Axis indicates expression level relative to cyclophilin X107. (B) Identification of RE1s in the human CACNA1A gene. The RE1 PSSM was used to scan the human CACNA1A gene, and the scores of putative RE1s were plotted as a function of position. Upper and lower lines represent scores of 0.91 and 0.83, respectively. A cartoon of the gene is shown below, with exons represented by wide lines. (C) Human and mouse CACNA1A orthologues contain multiple functional RE1s that are only partially conserved. Sequences identified by the RE1 PSSM in the human (upper panel) and mouse (lower panel) CACNA1A genes were tested in EMSA competition assay. Only those sequences that displayed detectable affinity for REST are shown. Decreasing molar ratios (100:1, 10:1, 1:1) of unlabelled oligonucleotides were used in EMSA competition assay against a radiolabelled RE1. Single asterisk indicates the REST-bound probe and double asterisks indicate unbound probe. Controls: 1, no competitor; 2, CHRM4 RE1 site; 3, non-specific oligonucleotide; 4, anti-REST antibody (P18; Santa Cruz). The sequence of each RE1 is shown with its RE1 PSSM score in brackets. Boxes denote homologous pairs of RE1s, as determined by a combination of whole-genome alignment from the UCSC Genome Browser and BLAST. Underlined bases indicate those mouse positions which are not conserved in human. (D) REST is recruited to CACNA1A RE1s in vivo. DNA was immunoprecipitated with anti-REST (P18; Santa Cruz) and control non-specific IgG antibodies from wild-type HeLa (WT), as well as cells infected with empty adenovirus (Ad) and adenovirus carrying the full-length REST gene (Ad REST). ChIP DNA was interrogated in quantitative real-time PCR using primers flanking the RE1s of CACNA1A, as well as the coding region of the CHRM4 gene, which is distal to any RE1. Values represent the fold enrichment of anti-REST immunoprecipitate over IgG. We next wished to ask whether the CACNA1A RE1 sites recruit REST in vivo, and whether the degree of that recruitment reflects their in vitro binding affinity. We assayed REST occupancy at the four functional CACNA1A RE1s in HeLa by ChIP assay. In wild-type HeLa both above-cutoff RE1s, Sites A and B, were strongly immunoprecipitated by an anti-REST antibody compared to a non-specific IgG, indicating their occupancy by REST (Figure 4D). Neither Site C nor D were enriched >2-fold. In order to investigate whether these weaker sites can recruit REST at higher cellular concentrations, similar ChIP assays were carried out on HeLa cells overexpressing virally delivered, full-length REST protein. In contrast to wild-type cells, all four CACNA1A RE1s were found to be occupied in cells overexpressing REST. This effect was not observed in cells infected with control adenovirus, nor was overexpressed REST recruited non-specifically to DNA lacking an RE1. Therefore REST is only recruited to the high-affinity RE1s of the CACNA1A gene in HeLa cells under normal conditions, but weaker RE1s retain the ability to specifically recruit REST at elevated concentrations. With the publication of the genomes of multiple species, interest has focussed on phylogenetic comparison as ameans of identifying conserved gene-regulatory elements (52), and for understanding variations in gene expression characteristics between species (53). A number of studies have demonstrated gain and loss (turnover) of TFBSs regulating orthologous genes in closely related species (2–6). Having mapped the regulatory sites through which REST can regulate the CACNA1A gene in human cells, we wished to determine the degree to which these elements are conserved in mouse, both in terms of their DNA sequence and affinity for REST. The RE1 PSSM identified three RE1s in the mouse CACNA1A orthologue. Examination of the predicted RE1s by global alignment in UCSC genome browser (54,55) and by local alignment in ClustalW (http://www.ebi.ac.uk/clustalw/index.html) showed that three mouse sequences are homologous to human Sites A, B and C (designated α, β, γ for mouse) (Figure 4C) (Note: Human Site B cannot be aligned to mouse in UCSC; however, the similarity of both RE1 sequences, as well as ClustalW alignment of their respective flanking regions suggests they are indeed homologous, and will be considered as such). This was not the case for human site D and mouse site δ, which appear to bear no evolutionary relationship to each other. The two homologous pairs of high-affinity sites, A/α and B/β, are highly conserved both in PSSM score and sequence, a fact reflected in their identical capacity to bind REST in EMSA (Figure 4C). On the other hand, although Site C in human has detectable affinity for REST in EMSA, its mouse homologue of similar PSSM score does not. Separate experiments showed that the homologous sequence in dog is similarly incapable of interacting with REST (data not shown). Finally, each species has one RE1 (Site D/Site δ) which has no homologue in the other, and appears to reside in inserted/deleted sequence blocks. The mouse sequence, Site δ, has a high RE1 PSSM score and in vitro affinity for REST. ChIP assays carried out on the mouse neural stem cell line, NS5 (56) found Site δ to be specifically occupied by REST (R. Johnson, unpublished data). Together, these data indicate that conservation of functional REST-binding sites between human and mouse is not total. Non-conserved functional sites come in two distinct types, suggestive of alternative mechanisms of generation: human Site C can be aligned with non-functional mouse sequence, whereas mouse Site δ cannot be aligned with any human sequence and is only shared by one other mammal, rat. The former example is suggestive of RE1 creation or loss through random DNA mutation, while the latter appears to be the result of an insertion or deletion event. These hypotheses are supported by the degree of conservation of these sequences across multiple vertebrate species (Supplementary Data): Sites A/α and B/β have high-conservation scores by PhastCons (52) (notwithstanding alignment issues for Site B discussed above), while Sites C/γ, D and δ do not. Non-conservation of RE1s between human and mouse is not confined to the CACNA1A gene. We investigated the degree of conservation of PSSM-predicted RE1s in other voltage-gated calcium channel genes of the α1, β, γ, α2δ subunit families. We identified a distinct RE1 sequence to that identified by Kuwahara et al. (19) proximal to the CACNA1G gene. Altogether we identified 10 human voltage-gated calcium channel subunit genes associated with 14 strong RE1s. By inspection of the alignment of these RE1s to genomic sequence of mouse, we found that just seven(50%) were well conserved in mouse in terms of RE1 PSSM score, while four had strongly reduced score (mouse sequence below cutoff) and three could not be aligned (Table 2). The three non-conserved human RE1s we tested by ChIP (CACNB2, CACNA1G, CACNG7) recruit REST in vivo, suggesting that these are functional sites (Figure 5D). We infer that non-conservation of RE1 sequence and affinity is a more general feature of REST target genes. Figure 5 Open in new tabDownload slide (A) The human genome contains families of related RE1 sequences. All above-cutoff human RE1s, including 100 bp of flanking sequence, were used in a BLAST search of the remaining RE1s. Sequences identified with similarity to at least one other with P < 10−30, are considered ‘related’. The proportions of RE1s which are unique (One), in pairs of high similarity (Two) or in groups of three or more highly related sequences (More), are shown. (B) Dispersal of RE1s from a single parent site. The largest family of related RE1s contains 28 members, all sharing significant homology with the RE1 hum3 located on Chromosome 1 (boxed). The locations of these sites in the human genome were plotted (red arrowheads) using the Ensembl tool Karyoview. The presumptive parent site, hum3, is boxed. A number of duplicated RE1s cannot be distinguished in this view due to their close proximity. (C) RE1s associated with transposon sequences can bind REST (Table 3). A selection of duplicated RE1 sequences were tested for their ability to interact with REST by EMSA competition assay. ‘Not Rep.’ indicates sequences that have no detectable repetitive characteristics, as judged by the tool RepeatMasker (www.repeatmasker.org). Single asterisk indicates the REST-bound probe and double asterisks indicate unbound probe. Controls: 1, no competitor; 2, CHRM4 RE1 site; 3, non-specific oligonucleotide; 4, anti-REST antibody (P18; Santa Cruz). (D) Transposon-associated and duplicated RE1s recruit REST in vivo (Tables 2 and 3). ChIP assay was carried out on endogenous HeLa cells using an anti-REST antibody (P18; Santa Cruz) and non-specific IgG. Enrichments were measured as in Figure 4D, using primers flanking the indicated RE1s. TEs associated with each RE1 are indicated in brackets. Figure 5 Open in new tabDownload slide (A) The human genome contains families of related RE1 sequences. All above-cutoff human RE1s, including 100 bp of flanking sequence, were used in a BLAST search of the remaining RE1s. Sequences identified with similarity to at least one other with P < 10−30, are considered ‘related’. The proportions of RE1s which are unique (One), in pairs of high similarity (Two) or in groups of three or more highly related sequences (More), are shown. (B) Dispersal of RE1s from a single parent site. The largest family of related RE1s contains 28 members, all sharing significant homology with the RE1 hum3 located on Chromosome 1 (boxed). The locations of these sites in the human genome were plotted (red arrowheads) using the Ensembl tool Karyoview. The presumptive parent site, hum3, is boxed. A number of duplicated RE1s cannot be distinguished in this view due to their close proximity. (C) RE1s associated with transposon sequences can bind REST (Table 3). A selection of duplicated RE1 sequences were tested for their ability to interact with REST by EMSA competition assay. ‘Not Rep.’ indicates sequences that have no detectable repetitive characteristics, as judged by the tool RepeatMasker (www.repeatmasker.org). Single asterisk indicates the REST-bound probe and double asterisks indicate unbound probe. Controls: 1, no competitor; 2, CHRM4 RE1 site; 3, non-specific oligonucleotide; 4, anti-REST antibody (P18; Santa Cruz). (D) Transposon-associated and duplicated RE1s recruit REST in vivo (Tables 2 and 3). ChIP assay was carried out on endogenous HeLa cells using an anti-REST antibody (P18; Santa Cruz) and non-specific IgG. Enrichments were measured as in Figure 4D, using primers flanking the indicated RE1s. TEs associated with each RE1 are indicated in brackets. Table 2. Conservation of RE1s between human and mouse CACNA1 family genes Gene Human Ensembl ID Human RE1 score RE1 ID TE? Mouse Ensembl ID Mouse RE1 score RE1 ID TE? CACNA1A ENSG00000141837 0.96 hum39611 No ENSMUSG00000034656 0.92 mus16973 No 0.97 hum39604 No 0.95 mus16983 No — — — 0.94 mus16981 No CACNA1B ENSG00000148408 0.92 hum23208 No ENSMUSG00000004113 — — — 0.97 hum23209 No <0.83 — — 0.98 hum23210 No 0.96 mus2523 No CACNA1D ENSG00000157388 0.92 hum8374 No ENSMUSG00000015968 <0.83 — — CACNA1H ENSG00000196557 0.97 hum33973 No ENSMUSG00000024112 0.98 mus31150 No CACNA1G ENSG00000006283 0.93 hum37140 LINE2 ENSMUSG00000020866 <0.83 — — CACNA2D2 ENSG00000007402 0.95 hum8255 No ENSMUSG00000010066 0.88 mus19450 No 0.91 hum8267 No 0.93 mus19435 No CACNA2D3 ENSG00000157445 0.97 hum8379 No ENSMUSG00000021991 0.96 mus27091 No CACNB2 ENSG00000165995 0.93 hum23514 Alu ENSMUSG00000057914 — — — CACNG2 ENSG00000166862 0.93 hum43594 No ENSMUSG00000019146 0.94 mus29100 No — — — 0.93 mus29059 No CACNG7 ENSG00000105605 0.93 hum40713 Alu ENSMUSG00000069806 — — — Gene Human Ensembl ID Human RE1 score RE1 ID TE? Mouse Ensembl ID Mouse RE1 score RE1 ID TE? CACNA1A ENSG00000141837 0.96 hum39611 No ENSMUSG00000034656 0.92 mus16973 No 0.97 hum39604 No 0.95 mus16983 No — — — 0.94 mus16981 No CACNA1B ENSG00000148408 0.92 hum23208 No ENSMUSG00000004113 — — — 0.97 hum23209 No <0.83 — — 0.98 hum23210 No 0.96 mus2523 No CACNA1D ENSG00000157388 0.92 hum8374 No ENSMUSG00000015968 <0.83 — — CACNA1H ENSG00000196557 0.97 hum33973 No ENSMUSG00000024112 0.98 mus31150 No CACNA1G ENSG00000006283 0.93 hum37140 LINE2 ENSMUSG00000020866 <0.83 — — CACNA2D2 ENSG00000007402 0.95 hum8255 No ENSMUSG00000010066 0.88 mus19450 No 0.91 hum8267 No 0.93 mus19435 No CACNA2D3 ENSG00000157445 0.97 hum8379 No ENSMUSG00000021991 0.96 mus27091 No CACNB2 ENSG00000165995 0.93 hum23514 Alu ENSMUSG00000057914 — — — CACNG2 ENSG00000166862 0.93 hum43594 No ENSMUSG00000019146 0.94 mus29100 No — — — 0.93 mus29059 No CACNG7 ENSG00000105605 0.93 hum40713 Alu ENSMUSG00000069806 — — — Dash (—) indicates no aligned sequence exists. TE?: The identity of transposable elements overlapping or in the immediate flanking region of RE1s is indicated. Open in new tab Table 2. Conservation of RE1s between human and mouse CACNA1 family genes Gene Human Ensembl ID Human RE1 score RE1 ID TE? Mouse Ensembl ID Mouse RE1 score RE1 ID TE? CACNA1A ENSG00000141837 0.96 hum39611 No ENSMUSG00000034656 0.92 mus16973 No 0.97 hum39604 No 0.95 mus16983 No — — — 0.94 mus16981 No CACNA1B ENSG00000148408 0.92 hum23208 No ENSMUSG00000004113 — — — 0.97 hum23209 No <0.83 — — 0.98 hum23210 No 0.96 mus2523 No CACNA1D ENSG00000157388 0.92 hum8374 No ENSMUSG00000015968 <0.83 — — CACNA1H ENSG00000196557 0.97 hum33973 No ENSMUSG00000024112 0.98 mus31150 No CACNA1G ENSG00000006283 0.93 hum37140 LINE2 ENSMUSG00000020866 <0.83 — — CACNA2D2 ENSG00000007402 0.95 hum8255 No ENSMUSG00000010066 0.88 mus19450 No 0.91 hum8267 No 0.93 mus19435 No CACNA2D3 ENSG00000157445 0.97 hum8379 No ENSMUSG00000021991 0.96 mus27091 No CACNB2 ENSG00000165995 0.93 hum23514 Alu ENSMUSG00000057914 — — — CACNG2 ENSG00000166862 0.93 hum43594 No ENSMUSG00000019146 0.94 mus29100 No — — — 0.93 mus29059 No CACNG7 ENSG00000105605 0.93 hum40713 Alu ENSMUSG00000069806 — — — Gene Human Ensembl ID Human RE1 score RE1 ID TE? Mouse Ensembl ID Mouse RE1 score RE1 ID TE? CACNA1A ENSG00000141837 0.96 hum39611 No ENSMUSG00000034656 0.92 mus16973 No 0.97 hum39604 No 0.95 mus16983 No — — — 0.94 mus16981 No CACNA1B ENSG00000148408 0.92 hum23208 No ENSMUSG00000004113 — — — 0.97 hum23209 No <0.83 — — 0.98 hum23210 No 0.96 mus2523 No CACNA1D ENSG00000157388 0.92 hum8374 No ENSMUSG00000015968 <0.83 — — CACNA1H ENSG00000196557 0.97 hum33973 No ENSMUSG00000024112 0.98 mus31150 No CACNA1G ENSG00000006283 0.93 hum37140 LINE2 ENSMUSG00000020866 <0.83 — — CACNA2D2 ENSG00000007402 0.95 hum8255 No ENSMUSG00000010066 0.88 mus19450 No 0.91 hum8267 No 0.93 mus19435 No CACNA2D3 ENSG00000157445 0.97 hum8379 No ENSMUSG00000021991 0.96 mus27091 No CACNB2 ENSG00000165995 0.93 hum23514 Alu ENSMUSG00000057914 — — — CACNG2 ENSG00000166862 0.93 hum43594 No ENSMUSG00000019146 0.94 mus29100 No — — — 0.93 mus29059 No CACNG7 ENSG00000105605 0.93 hum40713 Alu ENSMUSG00000069806 — — — Dash (—) indicates no aligned sequence exists. TE?: The identity of transposable elements overlapping or in the immediate flanking region of RE1s is indicated. Open in new tab Table 3. Target genes of duplicated RE1s Ensembl ID Description RE1 ID Element EMSA lanea Conserved? No mouse homologue exists ENSG00000196164 Q96HX1_HUMAN hum21648 LINE2 H m ENSG00000101825 Adlican hum44251 Not rep. B m ENSG00000185164 Nodal modulator 2 precursor (pM5 protein 2) hum34541 LINE2 N/A m ENSG00000154608 KIAA0470L protein hum11432 LINE2 E c ENSG00000189281 PREDICTED: hypothetical protein XP_375668 hum11438 LINE2 E c ENSG00000182053 PREDICTED: similar to tripartite motif-containing 51 hum26460 Not rep. N/A m ENSG00000182111 PREDICTED: similar to Zinc finger protein 479 hum17621 LINE2 F c, d ENSG00000197123 Zinc finger protein 679 hum17654 Not rep. N/A m ENSG00000163040 NP_620125.1 hum6034 LINE1 D c Mouse homologue has no RE1 ENSG00000106078 Cordon-bleu homologue hum17527 LINE2 N/A c ENSG00000105198 Galactoside-binding soluble lectin 13 (PP13) hum40246 Not rep. N/A c ENSG00000006659 Placental protein 13-like (CLC2) hum40243 Not rep. N/A c ENSG00000184330 S100 calcium-binding protein A7-like 1 hum2528 Alu C c ENSG00000143556 S100 calcium-binding protein A7 (Psoriasin) hum2515 Alu N/A c ENSG00000196396 Tyrosine-protein phosphatase, non-receptor type 1 (PTP-1B) hum41830 LINE2 N/A c, d Mouse homologue has RE1 ENSG00000152953 STK32B hum10317 Not rep. A c, d ENSG00000007171 NOS2 hum36473 LINE2 I m Ensembl ID Description RE1 ID Element EMSA lanea Conserved? No mouse homologue exists ENSG00000196164 Q96HX1_HUMAN hum21648 LINE2 H m ENSG00000101825 Adlican hum44251 Not rep. B m ENSG00000185164 Nodal modulator 2 precursor (pM5 protein 2) hum34541 LINE2 N/A m ENSG00000154608 KIAA0470L protein hum11432 LINE2 E c ENSG00000189281 PREDICTED: hypothetical protein XP_375668 hum11438 LINE2 E c ENSG00000182053 PREDICTED: similar to tripartite motif-containing 51 hum26460 Not rep. N/A m ENSG00000182111 PREDICTED: similar to Zinc finger protein 479 hum17621 LINE2 F c, d ENSG00000197123 Zinc finger protein 679 hum17654 Not rep. N/A m ENSG00000163040 NP_620125.1 hum6034 LINE1 D c Mouse homologue has no RE1 ENSG00000106078 Cordon-bleu homologue hum17527 LINE2 N/A c ENSG00000105198 Galactoside-binding soluble lectin 13 (PP13) hum40246 Not rep. N/A c ENSG00000006659 Placental protein 13-like (CLC2) hum40243 Not rep. N/A c ENSG00000184330 S100 calcium-binding protein A7-like 1 hum2528 Alu C c ENSG00000143556 S100 calcium-binding protein A7 (Psoriasin) hum2515 Alu N/A c ENSG00000196396 Tyrosine-protein phosphatase, non-receptor type 1 (PTP-1B) hum41830 LINE2 N/A c, d Mouse homologue has RE1 ENSG00000152953 STK32B hum10317 Not rep. A c, d ENSG00000007171 NOS2 hum36473 LINE2 I m Not rep., flanking sequence has no repetitive characteristics; m, sequence can be aligned to multiple species; c, sequence can be aligned to chimp; d, sequence can be aligned to dog. a See Figure 5C. Open in new tab Table 3. Target genes of duplicated RE1s Ensembl ID Description RE1 ID Element EMSA lanea Conserved? No mouse homologue exists ENSG00000196164 Q96HX1_HUMAN hum21648 LINE2 H m ENSG00000101825 Adlican hum44251 Not rep. B m ENSG00000185164 Nodal modulator 2 precursor (pM5 protein 2) hum34541 LINE2 N/A m ENSG00000154608 KIAA0470L protein hum11432 LINE2 E c ENSG00000189281 PREDICTED: hypothetical protein XP_375668 hum11438 LINE2 E c ENSG00000182053 PREDICTED: similar to tripartite motif-containing 51 hum26460 Not rep. N/A m ENSG00000182111 PREDICTED: similar to Zinc finger protein 479 hum17621 LINE2 F c, d ENSG00000197123 Zinc finger protein 679 hum17654 Not rep. N/A m ENSG00000163040 NP_620125.1 hum6034 LINE1 D c Mouse homologue has no RE1 ENSG00000106078 Cordon-bleu homologue hum17527 LINE2 N/A c ENSG00000105198 Galactoside-binding soluble lectin 13 (PP13) hum40246 Not rep. N/A c ENSG00000006659 Placental protein 13-like (CLC2) hum40243 Not rep. N/A c ENSG00000184330 S100 calcium-binding protein A7-like 1 hum2528 Alu C c ENSG00000143556 S100 calcium-binding protein A7 (Psoriasin) hum2515 Alu N/A c ENSG00000196396 Tyrosine-protein phosphatase, non-receptor type 1 (PTP-1B) hum41830 LINE2 N/A c, d Mouse homologue has RE1 ENSG00000152953 STK32B hum10317 Not rep. A c, d ENSG00000007171 NOS2 hum36473 LINE2 I m Ensembl ID Description RE1 ID Element EMSA lanea Conserved? No mouse homologue exists ENSG00000196164 Q96HX1_HUMAN hum21648 LINE2 H m ENSG00000101825 Adlican hum44251 Not rep. B m ENSG00000185164 Nodal modulator 2 precursor (pM5 protein 2) hum34541 LINE2 N/A m ENSG00000154608 KIAA0470L protein hum11432 LINE2 E c ENSG00000189281 PREDICTED: hypothetical protein XP_375668 hum11438 LINE2 E c ENSG00000182053 PREDICTED: similar to tripartite motif-containing 51 hum26460 Not rep. N/A m ENSG00000182111 PREDICTED: similar to Zinc finger protein 479 hum17621 LINE2 F c, d ENSG00000197123 Zinc finger protein 679 hum17654 Not rep. N/A m ENSG00000163040 NP_620125.1 hum6034 LINE1 D c Mouse homologue has no RE1 ENSG00000106078 Cordon-bleu homologue hum17527 LINE2 N/A c ENSG00000105198 Galactoside-binding soluble lectin 13 (PP13) hum40246 Not rep. N/A c ENSG00000006659 Placental protein 13-like (CLC2) hum40243 Not rep. N/A c ENSG00000184330 S100 calcium-binding protein A7-like 1 hum2528 Alu C c ENSG00000143556 S100 calcium-binding protein A7 (Psoriasin) hum2515 Alu N/A c ENSG00000196396 Tyrosine-protein phosphatase, non-receptor type 1 (PTP-1B) hum41830 LINE2 N/A c, d Mouse homologue has RE1 ENSG00000152953 STK32B hum10317 Not rep. A c, d ENSG00000007171 NOS2 hum36473 LINE2 I m Not rep., flanking sequence has no repetitive characteristics; m, sequence can be aligned to multiple species; c, sequence can be aligned to chimp; d, sequence can be aligned to dog. a See Figure 5C. Open in new tab Duplication of RE1s The identification of RE1s which are not conserved between human and mouse, as well as the greater number of RE1s in the genome of the former, suggested to us that novel RE1s had arisen since both species diverged. We hypothesized that the duplication and insertion of sequence blocks containing RE1s might be a mechanism of novel TFBS creation, perhaps leading to the acquisition of novel target genes by REST over time. Pairs of RE1s which had been duplicated recently would be characterized not only by the similarity of their core RE1 sequences, but also by that of their flanking regions. To test whether this was the case, each RE1 identified by the RE1 PSSM in the human genome was sequentially used to search for similar sequences in the RE1 database using the BLAST algorithm. Searches were carried out using the RE1 itself and 100 bp of flanking sequence, with a stringent P-value cutoff of 1 × 10−30 to identify only those sequences with highly significant homology. In this way we found that at least 10% (126/1301) of RE1s in the human genome belong to evolutionarily related groups (Figure 5A). (A full list of the identifiers and locations of duplicated RE1s is available from the authors on request.) A similar effect was observed in mouse (data not shown). This analysis excluded all instances where the RE1 in a duplicated sequence had an RE1 PSSM score below cutoff, as well as the tandem RE1s mentioned earlier. We reasoned that duplication and insertion by TEs might be a potential mechanism of RE1 duplication. We therefore tested the duplicated RE1s for repetitive or transposon characteristics. We submitted the flanking sequences of duplicated RE1s to the online tool RepeatMasker (www.repeatmasker.org), which indicated that the majority of duplicated RE1s are located in TEs of most major classes, including long interspersed repeats (LINEs, principally LINE2s), short interspersed repeats (SINEs, principally Alus) and hERV sequences. In addition, a number of duplicated RE1s are located in sequence with no characteristics of TEs. The largest single family of RE1s, located in the coding region of a LINE2 element, had 28 members with significant similarity to the hum3 RE1 sequence located in the subtelomeric region of the Chromosome 1 p-arm (Figure 6). These sites had an apparently non-random distribution, with 29% (8/28) located within 1 Mb of a telomere, 25% (7/28) within 1 Mb of a centromere and 43% (12/28) located on Chromosome 7 (Figure 5B). To confirm that duplicated RE1s are functional binding sites, we tested a selection of these RE1s’ ability to interact with REST in vitro by EMSA competition assay (Figure 5C). Most of those sequences tested, including those associated with Alu, LINE1 and LINE2 sequences, as well as two pairs residing in non-repetitive DNA, were capable of interacting with REST. The most common LINE2-derived hum3 sequence was amongst this group. In agreement with previous findings (26), neither hERV Class I RE1 showed detectable affinity for REST. In addition to binding REST in vitro, we found that all four of the duplicated RE1s we tested (including two associated with LINE2s and one with an Alu) could be enriched by an anti-REST antibody in ChIP (Figure 5D). We conclude that functional RE1s have been duplicated and inserted at new positions in the human genome by both transposon-dependent and independent processes, and that a high proportion recruit REST in vivo. Figure 6 Open in new tabDownload slide Location of hum3 RE1 within the LINE2 element. The RE1 sequence is boxed. ‘-’ indicates an insertion/deletion, ‘i’ a transition (G↔A, C↔T) and ‘v’ a transversion (all other substitutions). Figure 6 Open in new tabDownload slide Location of hum3 RE1 within the LINE2 element. The RE1 sequence is boxed. ‘-’ indicates an insertion/deletion, ‘i’ a transition (G↔A, C↔T) and ‘v’ a transversion (all other substitutions). We next investigated the target genes of duplicated RE1s; in particular, we used the Ensembl database to check whether such genes had a mouse homologue, and if so, whether the homologue is also a REST target (defined as being the closest gene within 100 kb of an RE1 with PSSM score >0.88). A list of human targets of duplicated RE1s is shown in Table 3. We identified at least six human REST target genes whose mouse orthologue contain no identifiable RE1, as well as seven for which no mouse orthologue has been identified. TEs have gone through bursts of active transposition during distinct periods of evolutionary history: although LINE2 elements were thought to be active ∼200 million years ago and before human–mouse divergence, LINE1 and Alu elements continue to retrotranspose in humans (57). This is reflected in the phylogenetic conservation of human TE-associated RE1s: those associated with Alu and LINE1 elements have no aligned sequences other than in chimp, while a number of ancient LINE2 elements are conserved amongst multiple species (Table 3). Interestingly, multispecies alignment of LINE2 RE1s is only possible in a minority of cases, again suggesting that significant gain and loss of RE1s has taken place since human–mouse divergence. If duplication and insertion of RE1s by TEs has contributed to the current population of human RE1s, one might expect to observe a statistically significant association of the two sequence features. The proximal flanking regions of all above-cutoff human RE1s was searched for TEs using RepeatMasker. The same operation was performed on the control sets of sequences identified by shuffled RE1 matrices using the same cutoff score (Figure 2C). These data, presented in Figure 7, clearly show that the association of RE1s with TEs is non-random. RE1s are under-associated with most classes of TE; in the case of MaLRs and MIRs this effect is statistically significant (P < 0.05, Student's t-test). In contrast, LINE2 are highly significantly associated with RE1s elements (P < 0.001), with approximately one in seven RE1s overlapping or flanking a LINE2. This finding suggests that LINE2 retrotransposition in particular has been an important driver of RE1 generation and insertion in the human lineage. We tested this idea by compiling the putative target genes of the 190 LINE2-associated RE1s in the human genome, and testing their GO classifications for significantly overrepresented terms. We found that this set of genes is significantly enriched for a number of important terms identified for the set of all human REST target genes (Supplementary Data). We consider this to be a strong evidence for the functional importance of gene regulation by LINE2-associated RE1s. Figure 7 Open in new tabDownload slide Association of RE1s with TE classes. The numbers of TEs identified in the proximal flanking region (<100 bp) of all human RE1s were measured using the tool RepeatMasker. As a reference, the sequences identified by the six shuffled RE1 PSSMs (Figure 2C) were analysed in the same way. Statistical significance was calculated using Student's t-test (∗P < 0.05, ∗∗∗P < 0.001). Figure 7 Open in new tabDownload slide Association of RE1s with TE classes. The numbers of TEs identified in the proximal flanking region (<100 bp) of all human RE1s were measured using the tool RepeatMasker. As a reference, the sequences identified by the six shuffled RE1 PSSMs (Figure 2C) were analysed in the same way. Statistical significance was calculated using Student's t-test (∗P < 0.05, ∗∗∗P < 0.001). DISCUSSION The genomic population of RE1s is open-ended We have mapped the REST-regulatory sequences and target genes in the human and mouse genomes with greater accuracy than possible previously. Although the approach we have used is not applicable to most TFBSs, owing to their low information content, nevertheless conclusions from this study on REST will be applicable to transcription factors in general. By searching for RE1s over a wide range of PSSM scores, we showed that instead of a well-defined population of unambiguous, high-affinity binding sites, the human genome may be more accurately considered to have an open-ended continuum of RE1 sites of varying conservation, similarity to the RE1 motif, and affinity. This resonates with the emerging view of dynamically evolving gene-regulatory networks based on a constantly changing set of genomic-binding sites. Therefore statements of absolute numbers of binding sites discovered might better be replaced by ratios of discovered sites to those expected by chance for each PSSM score range. By this regime, there is a 23-fold (1301/57) excess of RE1s above cutoff in the human genome, and a 5-fold (2890/600) excess of RE1s in the below-cutoff score range 0.88–0.91. Furthermore, a distinct change in this ‘RE1 excess’ occurs at scores >0.93, suggesting that 706 sequences in the human genome above this score are highly significant. One might expect that as score constraints are progressively relaxed, the ratio of sequences identified by RE1 and shuffled PSSMs should approach 1; the fact that there is still a 5:1 excess of RE1s between 0.88 and 0.91, strongly suggests that the population of below-cutoff RE1s has genuine biological significance, despite evidence that the majority cannot bind REST. There may be several reasons for this excess. First, the RE1 PSSM no doubt falsely rejected a number of high affinity by scoring them below cutoff; the fact that a PSSM is incapable of perfectly predicting REST-RE1 affinity may indicate that the training set is incomplete or skewed by too many RE1s discovered by consensus search, or more fundamentally, that significant interdependency occurs between nucleotides of the RE1, which cannot be represented by a PSSM. As we have shown, some below-cutoff sites represent low-affinity RE1s which are only capable of recruiting REST at elevated concentrations and/or permissive chromatin states, such as occurs in ischemic neurons (17) or neural stem cells (44,45), respectively. Furthermore, a significant fraction of below-cutoff RE1s represent a sequence motif of hERV Class I elements which we know to be unable to interact with REST. Finally, and perhaps most intriguingly, a number of below-cutoff RE1 sequences might represent evolutionarily turned-over RE1s, i.e. sites which were functional at some point during history, but which are no longer under selection pressure. Over time, random mutation of such RE1s would give rise to RE1-like sequences which are degenerate and non-functional, but nevertheless bear vestigial similarity to the RE1 motif and are detected below the score cutoff by the RE1 PSSM. Future studies of genome-wide phylogenetic conservation of RE1s should identify turned-over RE1s if they exist. The number and state of mutation of such an evolutionary ‘footprint’ of turned-over TFBSs in vertebrate genomes would be an important basis for attempting to estimate the rate of evolution of gene-regulatory DNA. REST regulation of CACNA1A through multiple binding sites The RE1 PSSM was also an effective tool in understanding how REST regulates individual genes. Our thorough analysis of the CACNA1A REST-regulatory DNA showed that the gene contains a number of functional REST-binding sites, with a range of affinities and degrees of evolutionary conservation in mouse. This suggests that, in some cases, regulation of REST target genes may be more complex than previous models suggest. It is conceivable that the number of occupied RE1s in CACNA1A, and hence the degree of transcriptional repression of the gene, can assume a number of well-defined states, determined by the concentration of REST relative to the kd of each RE1. This resonates with the model of Ballas et al. (45) regarding the progressive loss of REST from target genes during neuronal differentiation, where REST recruitment is lost from those genes with weaker RE1s first, as REST levels in the nucleus drop during development. Alternatively, variation in REST protein levels through organs such as the brain might lead to finely patterned levels of CACNA1A transcription (58). In a similar way, the existence of closely spaced tandem RE1 clusters strongly suggests that this arrangement of sites has biological relevance, perhaps through their ability to recruit REST at low concentrations, or by simultaneous recruitment of multiple REST complexes at once to a target gene. Tandem RE1s were originally identified in the SNAP25 gene (26), which could recruit REST at low cellular concentrations such as that found in the U373 astrocytoma cell line. Although the head-to-tail orientation of tandem RE1s seems to be ubiquitous, the distance separating the tandem sites is not, suggesting that REST–REST interactions are not an important element of binding. Rather, we propose that the tandem configuration might instead reflect the process by which the site was generated through a tandem duplication event (59,60). Evolutionary turnover of RE1s through mutation and sequence duplication Comparison of the RE1s of orthologous human and mouse genes of the voltage-gated calcium channel subunit family provided strong evidence of evolutionary turnover in RE1s. We showed that a large proportion of human RE1s from this set are not conserved in aligned genomic sequence of mouse. Such non-conserved RE1s fall into two categories; first, there are sites which have aligned sequence in mouse that cannot bind REST (e.g. human/mouse Site C of CACNA1A). Second, there are RE1s which are aligned to gaps in the other species’ genome sequence (e.g. human Site D of CACNA1A). What mechanisms might be responsible for the observed differences in the RE1s of human and mouse? The generation of novel regulatory sequences by random DNA mutation is thought to be important in instances where the motif is short enough that it occurs at high frequency in a given stretch of DNA (13). However, the exponential relationship between ‘waiting time’ (i.e. the average time it takes for a particular motif to appear through random DNA mutation) for a sequence in a promoter-sized DNA sequence and the motif length (8,9) suggests that other mechanisms might be necessary to explain the generation of motifs as long as the RE1. Nevertheless, random DNA mutation would appear to be responsible for the appearance of Site C in the human orthologue of CACNA1A. We also addressed possible mechanisms of creation of the second type of non-conserved RE1, which appear to be due to insertion. We identified a significant population of duplicated RE1s, principally but not exclusively associated with TEs, which are capable of recruiting REST and have been inserted throughout the human genome. Indeed, a number of such inserted RE1s are to be found proximal to annotated genes. The insertion of functional gene-regulatory motifs by Alu elements has been observed previously (61), and it is likely that insertion of novel gene-regulatory sequences by TEs has been important in the evolution of gene regulation (62). Although we identified RE1s associated with all the major classes of TE, including Alus, LINE1 and LINE2 elements, high-scoring RE1s are more strongly associated with ancient LINE2 elements than expected by chance and occur in its coding sequence. We infer that LINE2-mediated RE1 duplication was an important agent of RE1 creation in the human lineage. In support of this, the genome of the fish Fugu rubripes, from which the human lineage diverged before the historical period of LINE2 activity 100–200 million years ago, contains ∼3-fold fewer consensus RE1s than either human or mouse (554 versus 1892 or 1894, respectively) (26). Such results point to episodic transposon-mediated duplication as an important mechanism by which the REST regulon has acquired new targets over evolutionary history. Through the identification of RE1s in two mammalian species, we have produced evidence for the evolutionary change in the regulon of an essential transcription factor. Since the majority of REST target genes are neuronal-specific, this provides an insight into how genome evolution may have given rise to the differential gene expression regimes observed in human brain compared to closely related species (63,64). Intriguingly, many neuronal genes with rapidly evolving coding sequences have been identified as REST targets by this study (65), while evolutionary changes in brain developmental processes, in which REST plays a central role, have been important drivers of brain evolution (66). Therefore it is conceivable that the mutation and insertion of RE1s we have observed have played important roles in vertebrate brain evolution. ACKNOWLEDGEMENTS The authors would like to thank Dr Gos Micklem (University of Cambridge) for reviewing the manuscript, Dr Michael I. Sadowski (University College, London) for help with the RE1 database, as well as Professor Haig H. Kazazian (University of Pennsylvania), Megan L. Cooper (University of Leeds) and Dr Nikolai D. Belyaev (University of Leeds) for advice and discussions. This work was supported by the Wellcome Trust. R.J. and L.O. are Wellcome Trust PhD students. R.G. is a BBSRC PhD student. Funding to pay the Open Access publication charges for this article was provided by the Wellcome Trust. Conflict of interst statement. None declared. REFERENCES 1. Cooper G.M. , Stone E.A., Asimenos G., NISC Comparative Sequencing Program . Green E.D., Batzoglou S., Sidow A. Distribution and intensity of constraint in mammalian genomic sequence Genome Res . 15 901 – 913 Crossref Search ADS PubMed WorldCat 2. Dermitzakis E.T. and Clark A.G. 2002 Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover Mol. Biol. Evol . 19 1114 – 1121 Google Scholar Crossref Search ADS PubMed WorldCat 3. Dermitzakis E.T. , Bergman C.M., Clark A.G. 2003 Tracing the evolutionary history of Drosophila regulatory regions with models that identify transcription factor binding sites Mol. Biol. Evol . 20 703 – 714 Google Scholar Crossref Search ADS PubMed WorldCat 4. Costas J. , Casares F., Vieira J. 2003 Turnover of binding sites for transcription factors involved in early Drosophila development Gene 310 215 – 220 Google Scholar Crossref Search ADS PubMed WorldCat 5. Smith N.G.C. , Brandstrom M., Ellegren H. 2004 Evidence for turnover of functional noncoding DNA in mammalian genome evolution Genomics 84 806 – 813 Google Scholar Crossref Search ADS PubMed WorldCat 6. Wittkopp P.J. , Haerum B.K., Clark A.G. 2004 Evolutionary changes in cis and trans gene regulation Nature 430 85 – 88 Google Scholar Crossref Search ADS PubMed WorldCat 7. Wray G.A. , Hahn M.W., Abouheif E., Balhoff J.P., Pizer M., Rockman M.V., Romano L.A. 2003 The evolution of transcriptional regulation in eukaryotes Mol. Biol. Evol . 20 1377 – 1419 Google Scholar Crossref Search ADS PubMed WorldCat 8. Berg J. , Willmann S., Lassig M. 2004 Adaptive evolution of transcription factor binding sites BMC Evol. Biol . 4 42 Google Scholar Crossref Search ADS PubMed WorldCat 9. Stone J.R. and Wray G.A. 2001 Rapid evolution of cis-regulatory sequences via local point mutations Mol. Biol. Evol . 18 1764 – 1770 Google Scholar Crossref Search ADS PubMed WorldCat 10. Stormo G.D. 2000 DNA binding sites: representation and discovery Bioinformatics 16 16 – 23 Google Scholar Crossref Search ADS PubMed WorldCat 11. Barash Y. , Elidan G., Friedman N., Kaplan T. 2003 Modeling dependencies in protein–DNA binding sites Proceedings of the Seventh Annual International Conference on Computational Biology , 10–13 April Berlin , Germany ACM press pp. 28 – 37 12. King O.D. and Roth F.P. 2003 A non-parametric model for transcription factor binding sites Nucleic Acids Res . 31 e116 Google Scholar Crossref Search ADS PubMed WorldCat 13. Carroll S.B. , Grenier J.K., Weatherbee S.D. From DNA To Diversity: Molecular Genetics And The Evolution Of Animal Design 2004 2nd edn Blackwell, Oxford 14. Berezikov E. , Guryev V., Plasterk R.H.A., Cuppen E. 2004 CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting Genome Res . 14 170 – 178 Google Scholar Crossref Search ADS PubMed WorldCat 15. Xie X. , Lu J., Kulbokas E.J., Golub T.R., Mootha V., Lindblad-Toh K., Lander E.S., Kellis M. 2005 Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals Nature 434 338 – 345 Google Scholar Crossref Search ADS PubMed WorldCat 16. Chen Z.-F. , Paquette A.J., Anderson D.J. 1998 NRSF/REST is required in vivo for repression of multiple neuronal target genes during embryogenesis Nature Genet . 20 136 – 142 Google Scholar Crossref Search ADS PubMed WorldCat 17. Calderone A. , Jover T., Noh K.-m., Tanaka H., Yokota H., Lin Y., Grooms S.Y., Regis R., Bennett M.V.L., Zukin R.S. 2003 Ischemic insults derepress the gene silencer REST in neurons destined to die J. Neurosci . 23 2112 – 2121 Google Scholar PubMed OpenURL Placeholder Text WorldCat 18. Zuccato C. , Tartari M., Crotti A., Goffredo D., Valenza M., Conti L., Cataudella T., Leavitt B.R., Hayden M.R., Timmusk T., et al. 2003 Huntingtin interacts with REST/NRSF to modulate the transcription of NRSE-controlled neuronal genes Nature Genet . 35 76 – 83 Google Scholar Crossref Search ADS PubMed WorldCat 19. Kuwahara K. , Saito Y., Takano M., Arai Y., Yasuno S., Nakagawa Y., Takahashi N., Adachi Y., Takemura G., Horie M., et al. 2003 NRSF regulates the fetal cardiac gene program and maintains normal cardiac structure and function EMBO J . 22 6310 – 6321 Google Scholar Crossref Search ADS PubMed WorldCat 20. Cheong A.B.A. , Li J., Kumar B., Sukumar P., Munsch C., Buckley N.J., Neylon C.B., Porter K.E., Beech D.J., Wood I.C. 2005 Downregulated REST transcription factor is a switch enabling critical potassium channel expression and cell proliferation Mol. Cell 20 45 – 52 Google Scholar Crossref Search ADS PubMed WorldCat 21. Roopra A. , Sharling L., Wood I.C., Briggs T., Bachfischer U., Paquette A.J., Buckley N.J. 2000 Transcriptional repression by neuron-restrictive silencer factor is mediated via the Sin3–histone deacetylase complex Mol. Cell. Biol . 20 2147 – 2157 Google Scholar Crossref Search ADS PubMed WorldCat 22. Andres M.E. , Burger C., Peral-Rubio M.J., Battaglioli E., Anderson M.E., Grimes J., Dallman J., Ballas N., Mandel G. 1999 CoREST: a functional corepressor required for regulation of neural-specific gene expression Proc. Natl Acad. Sci. USA 96 9873 – 9878 Google Scholar Crossref Search ADS WorldCat 23. Battaglioli E. , Andres M.E., Rose D.W., Chenoweth J.G., Rosenfeld M.G., Anderson M.E., Mandel G. 2002 REST repression of neuronal genes requires components of the hSWI SNF complex J. Biol. Chem . 277 41038 – 41045 Google Scholar Crossref Search ADS PubMed WorldCat 24. Roopra A. , Qazi R., Schoenike B., Daley T.J., Morrison J.F. 2004 Localized domains of G9a-mediated histone methylation are required for silencing of neuronal genes Mol. Cell 14 727 – 738 Google Scholar Crossref Search ADS PubMed WorldCat 25. Schoenherr C.J. , Paquette A.J., Anderson D.J. 1996 Identification of potential target genes for the neuron-restrictive silencer factor PNAS 93 9881 – 9886 Google Scholar Crossref Search ADS PubMed WorldCat 26. Bruce A.W. , Donaldson I.J., Wood I.C., Yerbury S.A., Sadowski M.I., Chapman M., Gottgens B., Buckley N.J. 2004 Genome-wide analysis of repressor element 1 silencing transcription factor/neuron-restrictive silencing factor (REST/NRSF) target genes Proc. Natl Acad. Sci. USA 101 10458 – 10463 Google Scholar Crossref Search ADS WorldCat 27. Osada R. , Zaslavsky E., Singh M. 2004 Comparative analysis of methods for representing and searching for transcription factor binding sites Bioinformatics 20 3516 – 3525 Google Scholar Crossref Search ADS PubMed WorldCat 28. Schoenherr C.J. and Anderson D.J. 1995 The neuron-restrictive silencing factor (NRSF): a coordinate repressor of multiple neuron-specific genes Science 267 1360 – 1363 Google Scholar Crossref Search ADS PubMed WorldCat 29. Wood I.C. , Roopra A., Buckley N.J. 1996 Neural specific expression of the m4 muscarinic acetylcholine receptor gene is mediated by a RE1/NRSE-type silencing element J. Biol. Chem . 271 14221 – 14225 Google Scholar Crossref Search ADS PubMed WorldCat 30. Catterall W.A. , Striessnig J., Snutch T.P., Perez-Reyes E. 2003 International Union of Pharmacology. XL. Compendium of voltage-gated ion channels: calcium channels Pharmacol. Rev . 55 579 – 581 Google Scholar Crossref Search ADS PubMed WorldCat 31. Diriong S. , Lory P., Williams M.E., Ellis S.B., Harpold M.M., Taviaux S. 1995 Chromosomal localization of the human genes for [alpha]1A, [alpha]1B, and [alpha]1E voltage-dependent Ca2+ channel subunits Genomics 30 605 – 609 Google Scholar Crossref Search ADS PubMed WorldCat 32. Ishikawa K. , Fujigasaki H., Saegusa H., Ohwada K., Fujita T., Iwamoto H., Komatsuzaki Y., Toru S., Toriyama H., Watanabe M., et al. 1999 Abundant expression and cytoplasmic aggregations of α1A voltage-dependent calcium channel protein associated with neurodegeneration in spinocerebellar ataxia type 6 Hum. Mol. Genet . 8 1185 – 1193 Google Scholar Crossref Search ADS PubMed WorldCat 33. Terwindt G.M. , Ophoff R.A., van Eijk R., Vergouwe M.N., Haan J., Frants R.R., Sandkuijl L.A., Ferrari M.D. 2001 Involvement of the CACNA1A gene containing region on 19p13 in migraine with and without aura Neurology 56 1028 – 1032 Google Scholar Crossref Search ADS PubMed WorldCat 34. Chioza B. , Wilkie H., Nashef L., Blower J., McCormick D., Sham P., Asherson P., Makoff A.J. 2001 Association between the {α};1a calcium channel gene CACNA1A and idiopathic generalized epilepsy Neurology 56 1245 – 1246 Google Scholar Crossref Search ADS PubMed WorldCat 35. Jodice C. , Mantuano E., Veneziano L., Trettel F., Sabbadini G., Calandriello L., Francia A., Spadaro M., Pierelli F., Salvi F., et al. 1997 Episodic ataxia type 2 (EA2) and spinocerebellar ataxia type 6 (SCA6) due to CAG repeat expansion in the CACNA1A gene on chromosome 19p Hum. Mol. Genet . 6 1973 – 1978 Google Scholar Crossref Search ADS PubMed WorldCat 36. Stormo G.D. and Hartzell G.W. III. 1989 Identifying protein-binding sites from unaligned DNA fragments Proc. Natl Acad. Sci. USA 86 1183 – 1187 Google Scholar Crossref Search ADS WorldCat 37. Hertz G.Z. , Hartzell G.W. III, Stormo G.D. 1990 Identification of consensus patterns in unaligned DNA sequences known to be functionally related Comput. Appl. Biosci . 6 81 – 92 Google Scholar PubMed OpenURL Placeholder Text WorldCat 38. Wood I.C. , Belyaev N.D., Bruce A.W., Jones C., Mistry M., Roopra A., Buckley N.J. 2003 Interaction of the repressor element 1-silencing transcription factor (REST) with target genes J. Mol. Biol . 334 863 – 874 Google Scholar Crossref Search ADS PubMed WorldCat 39. Andrews N.C. and Faller D.V. 1991 A rapid micropreparation technique for extraction of DNA-binding proteins from limiting numbers of mammalian cells Nucleic Acids Res . 19 2499 Google Scholar Crossref Search ADS PubMed WorldCat 40. Kraner S.D. , Chong J.A., Tsay H.-J., Mandel G. 1992 Silencing the Type II sodium channel gene: a model for neural-specific gene regulation Neuron 9 37 – 44 Google Scholar Crossref Search ADS PubMed WorldCat 41. Livak K.J. and Schmittgen T.D. 2001 Analysis of relative gene expression data using real-time quantitative PCR and the 2-ΔΔCT Method Methods 25 402 – 408 Google Scholar Crossref Search ADS PubMed WorldCat 42. Cheong A. , Bingham A.J., Li J., Kumar B., Sukumar P., Munsch C., Buckley N.J., Neylon C.B., Porter K.E., Beech D.J., et al. 2005 Downregulated REST transcription factor is a switch enabling critical potassium channel expression and cell proliferation Mol. Cell 20 45 – 52 Google Scholar Crossref Search ADS PubMed WorldCat 43. Crooks G.E. , Hon G., Chandonia J.-M., Brenner S.E. 2004 WebLogo: a sequence logo generator Genome Res . 14 1188 – 1190 Google Scholar Crossref Search ADS PubMed WorldCat 44. Sun Y.-M. , Greenway D.J., Johnson R., Street M., Belyaev N.D., Deuchars J., Bee T., Wilde S., Buckley N.J. 2005 Distinct Profiles of REST Interactions with Its Target Genes at Different Stages of Neuronal Development Mol. Biol. Cell 16 5630 – 5638 Google Scholar Crossref Search ADS PubMed WorldCat 45. Ballas N. , Grunseich C., Lu D.D., Speh J.C., Mandel G. 2005 REST and its corepressors mediate plasticity of neuronal gene chromatin throughout Neurogenesis Cell 121 645 – 657 Google Scholar Crossref Search ADS PubMed WorldCat 46. Paquette A.J. , Perez S.E., Anderson D.J. 2000 Constitutive expression of the neuron-restrictive silencer factor (NRSF)/REST in differentiating neurons disrupts neuronal gene expression and causes axon pathfinding errors in vivo Proc. Natl Acad. Sci. USA 97 12318 – 12323 Google Scholar Crossref Search ADS WorldCat 47. Sugino K. , Hempel C.M., Miller M.N., Hattox A.M., Shapiro P., Wu C., Huang Z.J., Nelson S.B. 2006 Molecular taxonomy of major neuronal classes in the adult mouse forebrain Nature Neurosci . 9 99 – 107 Google Scholar Crossref Search ADS PubMed WorldCat 48. Zhang X. , Odom D.T., Koo S.-H., Conkright M.D., Canettieri G., Best J., Chen H., Jenner R., Herbolsheimer E., Jacobsen E., et al. 2005 Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues Proc. Natl Acad. Sci. USA 102 4459 – 4464 Google Scholar Crossref Search ADS WorldCat 49. Martone R. , Euskirchen G., Bertone P., Hartman S., Royce T.E., Luscombe N.M., Rinn J.L., Nelson F.K., Miller P., Gerstein M., et al. 2003 Distribution of NF-κB-binding sites across human chromosome 22 Proc. Natl Acad. Sci. USA 100 12247 – 12252 Google Scholar Crossref Search ADS WorldCat 50. Lander E. , Linton M., Birren B., Nusbaum C., Zody M., Baldwin J. 2001 Initial sequencing and analysis of the human genome Nature 409 860 – 921 Google Scholar Crossref Search ADS PubMed WorldCat 51. Takahashi E. , Murata Y., Oki T., Miyamoto N., Mori Y., Takada N., Wanifuchi H., Wanifuchi N., Yagami K., Niidome T. 1999 Isolation and functional characterization of the 5′-upstream region of mouse P/Q-type Ca2+ channel alpha1A subunit gene Biochem. Biophys. Res. Commun . 260 54 – 59 Google Scholar Crossref Search ADS PubMed WorldCat 52. Siepel A. , Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., et al. 2005 Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes Genome Res . 15 1034 – 1050 Google Scholar Crossref Search ADS PubMed WorldCat 53. Rockman M.V. , Hahn M.W., Soranzo N., Zimprich F., Goldstein D.B., Wray G.A. 2005 Ancient and recent positive selection transformed Opioid cis-regulation in humans PLoS Biol . 3 e387 Google Scholar Crossref Search ADS PubMed WorldCat 54. Blanchette M. , Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., et al. 2004 Aligning multiple genomic sequences with the threaded blockset aligner Genome Res . 14 708 – 715 Google Scholar Crossref Search ADS PubMed WorldCat 55. Schwartz S. , Kent W.J., Smit A., Zhang Z., Baertsch R., Hardison R.C., Haussler D., Miller W. 2003 Human–mouse alignments with BLASTZ Genome Res . 13 103 – 107 Google Scholar Crossref Search ADS PubMed WorldCat 56. Conti L. , Pollard S.M., Gorba T., Reitano E., Toselli M., Biella G., Sun Y., Sanzone S., Ying Q.-L., Cattaneo E., et al. 2005 Niche-independent symmetrical self-renewal of a mammalian tissue stem cell PLoS Biol . 3 e283 Google Scholar Crossref Search ADS PubMed WorldCat 57. Kapitonov V.V. , Pavlicek A., Jurka J. Anthology of Human Repetitive DNA 2004 Wiley , NY 58. Palm K. , Belluardo N., Metsis M., Timmusk T.o. 1998 Neuronal expression of zinc finger transcription factor REST/NRSF/XBR gene J. Neurosci . 18 1280 – 1296 Google Scholar PubMed OpenURL Placeholder Text WorldCat 59. Thomas E.E. , Srebro N., Sebat J., Navin N., Healy J., Mishra B., Wigler M. 2004 Distribution of short paired duplications in mammalian genomes Proc. Natl Acad. Sci. USA 101 10349 – 10354 Google Scholar Crossref Search ADS WorldCat 60. Cheng Z. , Ventura M., She X., Khaitovich P., Graves T., Osoegawa K., Church D., DeJong P., Wilson R.K., Paabo S., et al. 2005 A genome-wide comparison of recent chimpanzee and human segmental duplications Nature 437 88 – 93 Google Scholar Crossref Search ADS PubMed WorldCat 61. Shankar R. , Grover D., Brahmachari S., Mukerji M. 2004 Evolution and distribution of RNA polymerase II regulatory sites from RNA polymerase III dependant mobile Alu elements BMC Evol. Biol . 4 37 Google Scholar Crossref Search ADS PubMed WorldCat 62. Hedges D.J. and Batzer M.A. 2005 From the margins of the genome: mobile elements shape primate evolution BioEssays 27 785 – 794 Google Scholar Crossref Search ADS PubMed WorldCat 63. Enard W. , Khaitovich P., Klose J., Zollner S., Heissig F., Giavalisco P., Nieselt-Struwe K., Muchmore E., Varki A., Ravid R., et al. 2002 Intra- and interspecific variation in primate gene expression patterns Science 296 340 – 343 Google Scholar Crossref Search ADS PubMed WorldCat 64. Caceres M. , Lachuer J., Zapala M.A., Redmond J.C., Kudo L., Geschwind D.H., Lockhart D.J., Preuss T.M., Barlow C. 2003 Elevated gene expression levels distinguish human from non-human primate brains Proc. Natl Acad. Sci. USA 100 13030 – 13035 Google Scholar Crossref Search ADS WorldCat 65. Dorus S. , Vallender E.J., Evans P.D., Anderson J.R., Gilbert S.L., Mahowald M., Wyckoff G.J., Malcom C.M., Lahn B.T. 2004 Accelerated evolution of nervous system genes in the origin of Homo sapiens Cell 119 1027 – 1040 Google Scholar Crossref Search ADS PubMed WorldCat 66. Gilbert S.L. , Dobyns W.B., Lahn B.T. 2005 Genetic links between brain development and brain evolution Nature Rev. Genet . 6 581 – 590 Google Scholar Crossref Search ADS WorldCat Author notes " The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors. © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Complexed crystal structure of replication restart primosome protein PriB reveals a novel single-stranded DNA-binding modeHuang,, Cheng-Yang;Hsu,, Che-Hsiung;Sun,, Yuh-Ju;Wu,, Huey-Nan;Hsiao,, Chwan-Deng
doi: 10.1093/nar/gkl536pmid: 16899446
ABSTRACT PriB is a primosomal protein required for replication restart in Escherichia coli. PriB stimulates PriA helicase activity via interaction with single-stranded DNA (ssDNA), but the molecular details of this interaction remain unclear. Here, we report the crystal structure of PriB complexed with a 15 bases oligonucleotide (dT15) at 2.7 Å resolution. PriB shares structural similarity with the E.coli ssDNA-binding protein (EcoSSB). However, the structure of the PriB–dT15 complex reveals that PriB binds ssDNA differently. Results from filter-binding assays show that PriB–ssDNA interaction is salt-sensitive and cooperative. Mutational analysis suggests that the loop L45 plays an important role in ssDNA binding. Based on the crystal structure and biochemical analyses, we propose a cooperative mechanism for the binding of PriB to ssDNA and a model for the assembly of the PriA–PriB–ssDNA complex. This report presents the first structure of a replication restart primosomal protein complexed with DNA, and a novel model that explains the interactions between a dimeric oligonucleotide-binding-fold protein and ssDNA. INTRODUCTION The ability to restart replication after encountering DNA damage is essential for bacterial survival (1,2). The ϕX-type primosome, or ‘replication restart’ primosome (3–5), is a protein–DNA complex that re-activates stalled DNA replication at forks after DNA damage (6). PriB is one of the Escherichia coli primosomal proteins. Together with PriA, PriC, DnaT, DnaB, DnaC and DnaG, PriB is required for the assembly of the ϕX-type primosome (7). Although the sequence of assembly during the ϕX-type primosome formation (PriB is the second to assemble) has been well studied (3,7), the role plays by PriB is poorly understood at the molecular level. PriB can bind both ssDNA and ssRNA (8–10). It also stabilizes the binding of PriA to DNA hairpins and thereby facilitates the association of DnaT with the primosome (7). In addition, a recent study suggests that upon forming the PriA–PriB–ssDNA complex, PriB induces a conformational alteration in PriA resulting in stimulated PriA helicase activity (11). PriB exists as a homodimer (8–10), and each polypeptide has 104 residues. The PriB monomer has an oligonucleotide/oligosaccharide-binding (OB)-fold structure with three flexible β-hairpin loops: L12 (residues 20–24), L23 (residues 37–44) and L45 (residues 81–88). It shares structural similarity with the DNA-binding domain of E.coli ssDNA-binding protein (EcoSSB) (1,2,12). The structural resemblance suggests PriB may bind ssDNA in a manner similar to EcoSSB. However, several lines of evidence indicate that they have different ssDNA-binding modes. First, the amino acid sequences of PriB and EcoSSB share only 11% identity and 27% similarity (Figure 1A). Second, EcoSSB exists as homotetramer (13), while PriB is a homodimer. Third, in vitro assays have shown that EcoSSB inhibits whereas PriB stimulates PriA helicase activity (11). Figure 1 Open in new tabDownload slide Structure of the PriB–dT15 complex. (A) Sequence alignment of PriB and EcoSSB. Identical residues between PriB and EcoSSB are indicated in yellow, and the conserved lysine residues (Lys82 in PriB and Lys87 in EcoSSB) involved in ssDNA binding are indicated in cyan. The secondary structural elements of PriB are shown below the sequences. (B) Periodic interactions between the PriB dimers and dT15 oligonucleotides in the complex crystal. An asymmetric unit contains a PriB dimer and one dT15. The dT15 (magenta trace) is sandwiched by monomer A (green ribbon) and monomer B′ (yellow ribbon) from the symmetrically related dimer. For clarity, the remaining symmetrical molecules are shown in gray. (C) A stereo view of a PriB dimer interacting with two dT15 oligonucleotides. The two oligonucleotides (magenta and cyan stick models) are related by crystallographic 21 symmetry. (D) Structural overlay of PriB dimers in apo (blue) and dT15-bound yellow forms. The two structures are shown as Cα traces, and the L45 loops are labeled. Figure 1 Open in new tabDownload slide Structure of the PriB–dT15 complex. (A) Sequence alignment of PriB and EcoSSB. Identical residues between PriB and EcoSSB are indicated in yellow, and the conserved lysine residues (Lys82 in PriB and Lys87 in EcoSSB) involved in ssDNA binding are indicated in cyan. The secondary structural elements of PriB are shown below the sequences. (B) Periodic interactions between the PriB dimers and dT15 oligonucleotides in the complex crystal. An asymmetric unit contains a PriB dimer and one dT15. The dT15 (magenta trace) is sandwiched by monomer A (green ribbon) and monomer B′ (yellow ribbon) from the symmetrically related dimer. For clarity, the remaining symmetrical molecules are shown in gray. (C) A stereo view of a PriB dimer interacting with two dT15 oligonucleotides. The two oligonucleotides (magenta and cyan stick models) are related by crystallographic 21 symmetry. (D) Structural overlay of PriB dimers in apo (blue) and dT15-bound yellow forms. The two structures are shown as Cα traces, and the L45 loops are labeled. In order to perceive a mechanistic model of ϕX-type primosome assembly, it is important to elucidate the structure of the PriB–ssDNA complex and understand the ssDNA-binding properties of PriB. In this study, we present the crystal structure of PriB complexed with a 15mer oligodeoxythymidylate (dT15) at 2.7 Å resolution. This structural model is compared with that of the EcoSSB–ssDNA complex (13). We also conducted ssDNA-binding assays with wild-type and PriB mutants to investigate the nature of the PriB–ssDNA interaction. MATERIALS AND METHODS Protein expression and purification The encoding region of wild-type and PriB mutants were put on pET-21b expression vectors and expressed with a His6 affinity tag at the C-terminal of the recombinant proteins. Details of the construction and protein purification have been described previously (8). The PriB mutants were generated according to the Stratagene QuickChange mutagenesis protocol (Stratagene, La Jolla, CA) using the pET21b-PriB plasmid as template (8). Based on the secondary structure measurements determined by circular dichroism spectroscopy, the mutated proteins appeared to be correctly folded. These mutants have identical chromatographic behavior as that of the wild-type PriB on a size-exclusion column (data not shown). Therefore, amino acids substituted on these mutants do not affect PriB-dimer formation under the chromatographic conditions we used. Nucleic acids Various lengths of ssDNA oligonucleotides were custom synthesized by MdBio, Inc. (Frederick, MD). The nucleic acid homopolymers were 5′ end labeled with T4 polynucleotide kinase (Promega, Madison, WI) and [γ-32P]ATP (6000 Ci/mmol; PerkinElmer Life Sciences). Filter-binding assay The affinity of PriB to ssDNA was examined by a double-filter-binding assay (14,15). Briefly, ssDNA–PriB complexes were generated by incubating 1 nM of 32P-labeled oligonucleotide with various concentrations of PriB (10−5 to 10−9 M) for 30 min at 25°C in a binding buffer containing 50 mM HEPES, pH 7.0, and 40 μg/ml BSA. The reaction mixture, in a total volume of 50 μl, was filtered though a nitrocellulose membrane overlaid on a Hybond N+ nylon membrane (Amersham Pharmacia Biotech). The membranes have been pre-soaked for 10 min in a washing buffer containing 50 mM HEPES, pH 7.0, and 10 mM NaCl, before being framed into a dot-blotting apparatus. The slots were washed immediately with 100 μl of washing buffer before and after the sample filtering step. The radioactivity on both filters was quantified with a PhosphorImager (Molecular Dynamics), and the fraction of bound ssDNA was estimated. Apparent dissociation constants were determined by plotting the fraction of ssDNA bound at each protein concentration and then fitting the data to the following equation: θ = [P]/([P] + Kd), in which θ is the fraction of ssDNA bound, [P] is the concentration of total protein, and Kd is the apparent dissociation constant. Cooperative binding to ssDNA sites was assessed by plotting the fraction of ssDNA bound over a range of protein concentrations, and the binding data were analyzed by fitting the data to the following equation: log(θ/(1 − θ)) = h log[P] − h log Kd, where h is the Hill coefficient (16). Mobility shift assays with agarose gel electrophoresis The affinity of PriB protein for ϕX ssDNA was examined with a published method used for the analysis of the SSB–ϕX ssDNA complex (17). Briefly, PriB proteins in various concentrations as specified in the figure legends were incubated in 40 mM HEPES, pH 7.0, 80 mM NaCl and 100 nM of circular ϕX ssDNA (Biolab) at 25°C for 30 min. Aliquots (5 μl) were removed from each reaction solution and mixed with 1 μl of loading dye (0.25% bromophenol blue and 40% sucrose). The samples were analyzed by electrophoresis on 0.8, 1, 2 and 3% agarose gels using a Tris–borate–EDTA buffer (45 mM Tris–borate and 1 mM EDTA, pH 8.5). Bands corresponding to unbound ϕX ssDNA and PriB–ϕX ssDNA complexes were visualized by ethidium bromide (0.5 μg/ml) staining. Crystallization and data collection Before crystallization, PriB was concentrated to 6 mg/ml in 20 mM sodium citrate and 50 mM NaCl (pH 5.0), and ssDNA was added to a molar ratio of 1:2.5 (PriB-dimer:ssDNA). The samples were then incubated at 37°C for 30 min. Crystals of PriB–dT15 and PriB–dT30 were grown by the hanging drop vapor diffusion method at 20°C. Both complex crystals grew within 1 week after mixing 1 μl of the protein–ssDNA complex solution with 1 μl of reservoir solution containing 25% (w/v) PEG 3350, 50 mM Bis–Tris, pH 6.5. Both PriB–dT15 and PriB–dT30 crystals grew as clusters of thin plates with dimensions of ∼0.3 mm × 0.1 mm × 0.01 mm. Parafilm oil was used as a cryoprotectant before the crystals were flash frozen. Each dataset was collected on a Rigaku R-AXIS IV++ image-plate detector (Rigaku, MSC) using a synchrotron radiation X-ray source at Beamline 17B2 of the National Synchrotron Radiation Research Center in Taiwan. Data integration and scaling were performed using the HKL package (18). Structure determination and refinement The structure of PriB bound to dT15 was solved by the molecular replacement software AMoRe (19) using DNA-unbound PriB [Protein Data Bank (PDB) accession no. 1V1Q] with its flexible L45 and L12 loops trimmed off. The clearest solution was found at an R-factor of 46% and at a correlation coefficient of 61.3%. Following molecular replacement, model building was performed using the program XtalView (20). The loops were gradually built as the quality of the map improved. After the loops were almost entirely built, electron density corresponding to DNA was observed in both σA-weighted 2Fo–Fc and Fo–Fc maps (21). The DNA structure was built into a 2Fo–Fc electron density map 1 nt a time to avoid preconceived notions of strand topology. Molecular dynamics refinement was performed using the program CNS (22) with a 20–2.7 Å resolution range, and 10% of the data was selected to calculate the Rfree factor to monitor refinement. The B-factors were higher in the ssDNA (57.89 Å2) than in the protein (38.66 Å2). The final structure was refined to an R-factor of 25.0% and an Rfree of 28.4%. The ligand occupancies were estimated from alternating cycles of B-factor and occupancy refinement, which resulted in a value of 0.7. Partial occupancy of ssDNA-binding sites has been observed previously. For example, occupancy of ssDNA ligand in a 2.8 Å crystal structure of EcoSSB is 0.67 (13). The stereochemical quality was checked by a Ramachandran plot generated using the program PROCHECK (23). The statistics for structure refinements of the PriB–dT15 and the PriB–dT30 complexes are listed in Table 1. Table 1. Data collection and refinement statistics Dataset PriB–dT15 PriB–dT30 Data collection Space group P212121 P212121 a (Å) 45.54 45.92 b (Å) 51.15 51.36 c (Å) 99.10 100.46 Resolution (Å) 20–2.7 20–4.5 Rsym (%)a 6.9 (41.8)b 13.3 (65.4) I/σ(I) 18.5 (3.6) 19.2 (4.5) Completeness (%) 99.6 (100.0) 78.5 (66.3) Redundancy 4.9 3.2 Refinement Resolution (Å) 20–2.7 R/Rfree 25.0/28.4 Number of atoms Protein 1763 Nucleic acid 297 Water 119 B-factors Protein 38.66 Nucleic acid 57.89 Water 36.69 Root mean square deviations Bond lengths (Å) 0.012 Bond angles (°) 2.000 Dataset PriB–dT15 PriB–dT30 Data collection Space group P212121 P212121 a (Å) 45.54 45.92 b (Å) 51.15 51.36 c (Å) 99.10 100.46 Resolution (Å) 20–2.7 20–4.5 Rsym (%)a 6.9 (41.8)b 13.3 (65.4) I/σ(I) 18.5 (3.6) 19.2 (4.5) Completeness (%) 99.6 (100.0) 78.5 (66.3) Redundancy 4.9 3.2 Refinement Resolution (Å) 20–2.7 R/Rfree 25.0/28.4 Number of atoms Protein 1763 Nucleic acid 297 Water 119 B-factors Protein 38.66 Nucleic acid 57.89 Water 36.69 Root mean square deviations Bond lengths (Å) 0.012 Bond angles (°) 2.000 a Rmerge(I) = ΣhΣi|Ii − I|ΣhΣII, where I is the mean intensity of the i observations of reflection h. b Numbers in parentheses are for data with a high-resolution cutoff at 2.7 Å. Open in new tab Table 1. Data collection and refinement statistics Dataset PriB–dT15 PriB–dT30 Data collection Space group P212121 P212121 a (Å) 45.54 45.92 b (Å) 51.15 51.36 c (Å) 99.10 100.46 Resolution (Å) 20–2.7 20–4.5 Rsym (%)a 6.9 (41.8)b 13.3 (65.4) I/σ(I) 18.5 (3.6) 19.2 (4.5) Completeness (%) 99.6 (100.0) 78.5 (66.3) Redundancy 4.9 3.2 Refinement Resolution (Å) 20–2.7 R/Rfree 25.0/28.4 Number of atoms Protein 1763 Nucleic acid 297 Water 119 B-factors Protein 38.66 Nucleic acid 57.89 Water 36.69 Root mean square deviations Bond lengths (Å) 0.012 Bond angles (°) 2.000 Dataset PriB–dT15 PriB–dT30 Data collection Space group P212121 P212121 a (Å) 45.54 45.92 b (Å) 51.15 51.36 c (Å) 99.10 100.46 Resolution (Å) 20–2.7 20–4.5 Rsym (%)a 6.9 (41.8)b 13.3 (65.4) I/σ(I) 18.5 (3.6) 19.2 (4.5) Completeness (%) 99.6 (100.0) 78.5 (66.3) Redundancy 4.9 3.2 Refinement Resolution (Å) 20–2.7 R/Rfree 25.0/28.4 Number of atoms Protein 1763 Nucleic acid 297 Water 119 B-factors Protein 38.66 Nucleic acid 57.89 Water 36.69 Root mean square deviations Bond lengths (Å) 0.012 Bond angles (°) 2.000 a Rmerge(I) = ΣhΣi|Ii − I|ΣhΣII, where I is the mean intensity of the i observations of reflection h. b Numbers in parentheses are for data with a high-resolution cutoff at 2.7 Å. Open in new tab Electron microscopy Electron microscopy was used to examine the PriB–ssDNA complexes. Complexes of PriB molecules (46 μl; 600 μg/ml) and intact circular ϕX ssDNA (4 μl; 20 μg/ml) were formed by mixing the solutions. They were then diluted directly into 0.01 M ammonium acetate (pH 7.0) and incubated for 20 min at 25°C. The complexes were adsorbed to a carbon film that had been made hydrophilic by exposure to a high-voltage glow discharge. The adsorbed complexes were exposed to 1% (w/v) aqueous uranyl acetate, dried, and then imaged with a goniometer stage in a Zeiss EM10CA electron microscope. Images on films were scanned with a Nikon LS4500 film scanner. RESULTS Overall structure of the PriB dimer in complex with dT15 To investigate the molecular details of the interaction between PriB and ssDNA, crystals of the PriB–dT15 and the PriB–dT30 complexes were subjected to X-ray diffraction studies. Both crystals belong to space group P212121 with similar cell dimensions (Table 1); the PriB–dT15 and PriB–dT30 complexes diffracted to 2.7 and 4.5 Å resolution, respectively. Owing to the resolution limit and data quality, we focused on the PriB–dT15 complex structure in this study. The majority of the electron density for PriB and dT15 was of good quality, but a discontinuity was observed for T9 to T11 of dT15, suggesting that this region is dynamic. Each asymmetric unit contains one PriB dimer and one dT15 oligonucleotide. Although PriB dimers made few contacts with each other, through their interaction with dT15 they packed as a thread with crystallographic 21 symmetry along the b-axis (Figure 1B). Owing to periodic interactions between PriB dimers and dT15 oligonucleotides every oligonucleotide is sandwiched by monomer A from one dimer and the adjacent monomer B′ from a symmetrically related dimer. Consequently, every PriB dimer contacts two symmetrically related dT15 oligonucleotides (Figure 1C). In the complex, two ssDNA-binding surfaces from two adjacent PriB dimers confine the DNA-binding path, and the bound dT15 adopts an Ω-shaped conformation. PriB–dT15 interactions The occupancy of bound dT15 in the crystal is 0.7. This feature has also been found in the crystal of EcoSSB tetramers complexed with two oligodeoxycytidylates of 35 bases long (dC35) (13). Single-stranded DNA-binding proteins (SSBs) act as sequence-independent ssDNA chaperones. Hence, it has been suggested that SSBs do not limit the conformation of the bound ssDNA to the extent as that observed for other known DNA-binding proteins (24), so that the largely unstructured ssDNA can slide freely through the ssDNA-binding domain of SSB (25). Consequently, this high DNA mobility causes the bound DNAs to be either disordered (24) or have a low occupancy (13). Recently a genomic study indicates that PriB evolved from EcoSSB via gene duplication with subsequent rapid sequence divergence (26). Thus, PriB may have inherited its ssDNA-binding nature from its ancestor, EcoSSB. Although PriB binds ssDNA on the surface of its OB folds as EcoSSB, PriB and EcoSSB are likely distinct in their ssDNA-binding mechanisms because of the difference in the extent of oligomerization and the conformation of the L23 loop. PriB forms a dimer and its L23 loop from each subunit makes close contact with the β-barrel core. EcoSSB, however, forms a tetramer, and its longer L23 loops protrude away from the β-barrel core in the presence or absence of ssDNA (13,27). The extended L23 loops greatly increase the interactions between EcoSSB and ssDNA. A long stretch of ssDNA wraps around the outside of the homotetramer. In contrast, owing to the closed conformation of the L23 loops, PriB has a relatively shallow DNA-binding surface on the two OB folds of the dimer and ssDNA wraps around the L45 loops (Figure 1C). The structure of the PriB dimer in the ssDNA-bound state is mostly similar to that of the apo form without DNA, with significant conformational changes only in the L45 loops (Figure 1D) of the protein. In the apo form (8–10), the L45 loops of the PriB dimer are remarkably flexible. However, both L45 loops are stabilized by interacting strongly with the ssDNA in the PriB–dT15 complex. The L45 loop in PriB is shorter than that in EcoSSB and has different protein–protein interacting abilities. One of the two stabilized L45 loops contacts another L45 loop in a symmetry related dimer (Figure 1B). The contact surface area between PriB dimers is small (∼288 Å2). Apparently the thread-like ultrastructure of the PriB dimer found in the crystal is mainly credited to the association of PriB to ssDNA. In EcoSSB, the L45 loops are probably important for the continuous assembly of homotetramers that wrap the ssDNA. They pair intermolecularly via antiparallel β-sheets and bring the tetramers together in the crystals of both the apo (27) and the ssDNA-complexed forms (13). Since PriB dimers and dT15 oligonucleotides interact periodically in the crystal, for clarity, we will mainly address the interactions among monomer A, dT15, and the adjacent monomer B′ from the symmetry related dimer (Figure 1B). Usually, an OB fold interacts with only a certain span of ssDNA (28). As shown in Figure 2A, a single dT15 oligonucleotide interacts with two OB folds from two symmetry related PriB dimers. The 5′ and 3′ termini of dT15 interact mainly with monomer A, while the central region of dT15 makes many contacts with monomer B′. The dT15 adopt an Ω-shaped conformation to accommodate the two ssDNA-binding surfaces. We postulate that dT15 primarily binds to monomer A, but the central region of dT15 is partially dissociated from monomer A due to competitive binding for monomer B′. This competition between monomer A and B′ yields a variety of conformers that are relatively isoenergetic, resulting in weaker electron density in the central region of the dT15 compares with that at the two ends. Figure 2 Open in new tabDownload slide PriB–ssDNA interactions. (A) Schematic diagram of the protein–ssDNA interactions in the PriB–dT15 complex. The monomer that contains each amino acid is given in parentheses. (B) Stacking interactions between Trp47 and the T3 base of ssDNA. (C) Basic residues from L45 loop interact with T7 and T8 bases. The 2Fo–Fc electron density maps contoured at 0.8 σ covering the T3 base in (B) and T7 and T8 bases in (C). Figure 2 Open in new tabDownload slide PriB–ssDNA interactions. (A) Schematic diagram of the protein–ssDNA interactions in the PriB–dT15 complex. The monomer that contains each amino acid is given in parentheses. (B) Stacking interactions between Trp47 and the T3 base of ssDNA. (C) Basic residues from L45 loop interact with T7 and T8 bases. The 2Fo–Fc electron density maps contoured at 0.8 σ covering the T3 base in (B) and T7 and T8 bases in (C). In the EcoSSB–dC35 complex, aromatic residues (Trp40, Trp54 and Phe60) on the DNA-binding surface make extensive stacking interactions with the ssDNA (13). Trp54 is located at the entrance for the 5′ terminus of the ssDNA, whereas Trp40 and Phe60 function together like a clamp and located at where the 3′ terminus of the ssDNA exits. Accordingly, they define the DNA-binding path and promote wrapping of the ssDNA around the homotetramer. In the PriB–dT15 structure, the L23 loops have a closed conformation and affect the topology of the DNA-binding surface (Figure 1C). Trp47 of monomer A, the functional equivalent of EcoSSB Trp54, interacts with the 3′ terminus of ssDNA (Figure 2B). The clamp-like dyad (Trp40 and Phe60) found in EcoSSB is missing from PriB. Phe77, the functional equivalent of EcoSSB Phe60, is buried inside the protein. Phe42, the functional equivalent of EcoSSB Trp40, does not interact with DNA in the crystal structure. The 5′ terminus of dT15 interacts with Trp47 of monomer B instead. Disregarding the first two bases (T1 and T2), we propose that the interaction of the dT15 5′ terminus with Trp47 of monomer B directs and maximizes the interactions between the ssDNA and the OB fold of the proteins, leading the ssDNA to wrap around the L45 loops of the PriB dimer. The basic residues on the PriB DNA-binding surface, with L45 loop in particular, play a major role in ssDNA interaction (Figure 2). Although lacking aromatic residues like Trp88 in EcoSSB (13), the L45 loop of PriB uses Lys82, Lys84 and Lys89 to make contacts with the ssDNA. These interactions were not observed in the EcoSSB–dC35 complex. By cooperating with Arg13 and Lys18 on the opposite side of molecules A and B′ (Figures 1A and 2), Lys82, Lys84 and Lys89 stabilize nucleotides T5 to T12 by making electrostatic interactions with the sugar-phosphate backbone (Figure 2C). The ssDNA-binding properties of PriB The ssDNA-binding ability of PriB was estimated with filter-binding assay utilizing dA and dT oligonucleotides of various lengths. Since dT homopolymers of 20 bases or longer give high background noises upon binding to nitrocellulose filters, they were excluded from the assays. The titration curves of PriB with dA and dT homopolymers (Supplementary Figure S1) show that the affinity of PriB towards the oligonucleotides increased with length. Binding of PriB to dA5 or dT5 was negligible. Excluding dA5 or dT5, >90% of the homopolymers bound to PriB, and the estimated respective apparent Kd values are presented in Table 2. The binding affinity of PriB for ssDNA increased dramatically within a narrow range of protein concentration, indicating that the formation of PriB–ssDNA complexes is a positive cooperative process (Supplementary Figure S1). The Hill coefficients (h) for PriB–ssDNA binding were determined (Table 2). The h values for dT15, dT20, dA15, dA20 and dA25 are ∼1.5, suggesting cooperative binding of PriB to these homopolymers. Furthermore, a cooperativity transition occurs between dA25 and dA30, where the h values are >2.5. The results indicate a highly cooperative binding of PriB to homopolymers of 30 bases or longer (Table 2). The cooperative binding of PriB to ssDNA has important implications for the nature of the protein–protein interactions within the complex and the position of the ssDNA-binding sites on PriB. Table 2. ssDNA-binding parameters of PriB Apparent Kd (nM) h dT10 740 ± 70 1.2 ± 0.1 dT15 100 ± 20 1.5 ± 0.1 dT20 30 ± 10 1.6 ± 0.1 dA10 1280 ± 100 1.2 ± 0.1 dA15 490 ± 40 1.3 ± 0.1 dA20 290 ± 30 1.6 ± 0.1 dA25 210 ± 20 1.4 ± 0.2 dA30 120 ± 20 2.6 ± 0.2 dA35 110 ± 20 3.0 ± 0.6 dA40 70 ± 20 2.8 ± 0.5 dA45 70 ± 20 2.7 ± 0.4 dA50 40 ± 10 2.6 ± 0.4 dA55 40 ± 10 2.8 ± 0.1 dA60 40 ± 10 2.8 ± 0.2 dA65 40 ± 10 2.9 ± 0.3 Apparent Kd (nM) h dT10 740 ± 70 1.2 ± 0.1 dT15 100 ± 20 1.5 ± 0.1 dT20 30 ± 10 1.6 ± 0.1 dA10 1280 ± 100 1.2 ± 0.1 dA15 490 ± 40 1.3 ± 0.1 dA20 290 ± 30 1.6 ± 0.1 dA25 210 ± 20 1.4 ± 0.2 dA30 120 ± 20 2.6 ± 0.2 dA35 110 ± 20 3.0 ± 0.6 dA40 70 ± 20 2.8 ± 0.5 dA45 70 ± 20 2.7 ± 0.4 dA50 40 ± 10 2.6 ± 0.4 dA55 40 ± 10 2.8 ± 0.1 dA60 40 ± 10 2.8 ± 0.2 dA65 40 ± 10 2.9 ± 0.3 The errors are standard deviations determined using 2–4 independent titration experiments. Open in new tab Table 2. ssDNA-binding parameters of PriB Apparent Kd (nM) h dT10 740 ± 70 1.2 ± 0.1 dT15 100 ± 20 1.5 ± 0.1 dT20 30 ± 10 1.6 ± 0.1 dA10 1280 ± 100 1.2 ± 0.1 dA15 490 ± 40 1.3 ± 0.1 dA20 290 ± 30 1.6 ± 0.1 dA25 210 ± 20 1.4 ± 0.2 dA30 120 ± 20 2.6 ± 0.2 dA35 110 ± 20 3.0 ± 0.6 dA40 70 ± 20 2.8 ± 0.5 dA45 70 ± 20 2.7 ± 0.4 dA50 40 ± 10 2.6 ± 0.4 dA55 40 ± 10 2.8 ± 0.1 dA60 40 ± 10 2.8 ± 0.2 dA65 40 ± 10 2.9 ± 0.3 Apparent Kd (nM) h dT10 740 ± 70 1.2 ± 0.1 dT15 100 ± 20 1.5 ± 0.1 dT20 30 ± 10 1.6 ± 0.1 dA10 1280 ± 100 1.2 ± 0.1 dA15 490 ± 40 1.3 ± 0.1 dA20 290 ± 30 1.6 ± 0.1 dA25 210 ± 20 1.4 ± 0.2 dA30 120 ± 20 2.6 ± 0.2 dA35 110 ± 20 3.0 ± 0.6 dA40 70 ± 20 2.8 ± 0.5 dA45 70 ± 20 2.7 ± 0.4 dA50 40 ± 10 2.6 ± 0.4 dA55 40 ± 10 2.8 ± 0.1 dA60 40 ± 10 2.8 ± 0.2 dA65 40 ± 10 2.9 ± 0.3 The errors are standard deviations determined using 2–4 independent titration experiments. Open in new tab The ssDNA-binding surface of PriB is highly electropositive and interacts directly with both the bases and the phosphate backbone of the ssDNA. To investigate whether these electrostatic interactions play an important role in ssDNA binding, we examined the binding of ssDNA to PriB at varying salt concentrations. The binding affinity of PriB for dT15 or dA30 is salt dependent (Table 3 and Supplementary Figure S2). At 200 mM NaCl, the binding affinities of PriB for dT15 or dA30 are ∼13- and 25-fold lower than that measured in the absence of salt, respectively. Furthermore, <50% of dA30 was bound by PriB, even at micromolar concentrations of PriB (Supplementary Figure S2). These results indicate that PriB binds to ssDNA mainly through electrostatic interactions. Table 3. Salt effect on dA30 or dT15 binding affinities of PriB [NaCl] (mM) dT15 Kd,app (nM) dA30 Kd,app (nM) 0 100 ± 20 120 ± 20 50 170 ± 30 160 ± 30 100 300 ± 30 330 ± 30 150 560 ± 60 560 ± 60 200 1300 ± 200 3000 ± 500 [NaCl] (mM) dT15 Kd,app (nM) dA30 Kd,app (nM) 0 100 ± 20 120 ± 20 50 170 ± 30 160 ± 30 100 300 ± 30 330 ± 30 150 560 ± 60 560 ± 60 200 1300 ± 200 3000 ± 500 The errors are standard deviations determined using 2–4 independent titration experiments. Open in new tab Table 3. Salt effect on dA30 or dT15 binding affinities of PriB [NaCl] (mM) dT15 Kd,app (nM) dA30 Kd,app (nM) 0 100 ± 20 120 ± 20 50 170 ± 30 160 ± 30 100 300 ± 30 330 ± 30 150 560 ± 60 560 ± 60 200 1300 ± 200 3000 ± 500 [NaCl] (mM) dT15 Kd,app (nM) dA30 Kd,app (nM) 0 100 ± 20 120 ± 20 50 170 ± 30 160 ± 30 100 300 ± 30 330 ± 30 150 560 ± 60 560 ± 60 200 1300 ± 200 3000 ± 500 The errors are standard deviations determined using 2–4 independent titration experiments. Open in new tab PriB amino acid residues crucial to ssDNA binding To investigate the contribution of individual amino acid residues to ssDNA binding, alanine substitution and deletion mutants were constructed and analyzed (Supplementary Figure S3). Substitution at Arg13 or Lys18 had a slight effect on ssDNA binding compared to that of wild-type PriB. Interestingly, substitutions of residues in the L45 loop have a greater effect on ssDNA binding. The K82A, K84A and K89A mutants have Kd values that are 4- to 6-fold (∼400–600 nM) higher than that of the wild-type PriB (Table 4). Deletion of K82 or K89 (dK82 and dK89 mutants) also decreases the ssDNA-binding ability. Single mutation of positively charged residues may not sufficiently abolish the ssDNA-binding activity. A triple mutant, K82A/K84A/K89A was generated and the binding ability of this triple mutant to dT15 or dA30 impaired dramatically. The Kd for dT15 and dA30 are 5500 and 7800 nM, respectively, which are 55- to 65-fold higher than that of the wild-type PriB. These data indicate that the highly electropositive region of PriB, especially within the L45 loop (which includes Lys82, Lys84 and Lys89), plays a crucial role in ssDNA binding. Table 4. ssDNA dissociation constants for PriB variants binding to dA30 or dT15 PriB variant dT15 Kd,app (nM) dA30 Kd,app (nM) Wild-type 100 ± 20 120 ± 20 W47A 300 ± 30 310 ± 30 R13A 270 ± 30 200 ± 30 K18A 210 ± 30 210 ± 30 K82A 500 ± 50 450 ± 40 K84A 400 ± 40 420 ± 40 K89A 410 ± 40 400 ± 40 dK82 610 ± 60 550 ± 50 dK89 460 ± 50 520 ± 50 K82A/K84A/K89A 5500 ± 1000 7800 ± 1000 PriB variant dT15 Kd,app (nM) dA30 Kd,app (nM) Wild-type 100 ± 20 120 ± 20 W47A 300 ± 30 310 ± 30 R13A 270 ± 30 200 ± 30 K18A 210 ± 30 210 ± 30 K82A 500 ± 50 450 ± 40 K84A 400 ± 40 420 ± 40 K89A 410 ± 40 400 ± 40 dK82 610 ± 60 550 ± 50 dK89 460 ± 50 520 ± 50 K82A/K84A/K89A 5500 ± 1000 7800 ± 1000 The errors are standard deviations determined using 2–4 independent titration experiments. Open in new tab Table 4. ssDNA dissociation constants for PriB variants binding to dA30 or dT15 PriB variant dT15 Kd,app (nM) dA30 Kd,app (nM) Wild-type 100 ± 20 120 ± 20 W47A 300 ± 30 310 ± 30 R13A 270 ± 30 200 ± 30 K18A 210 ± 30 210 ± 30 K82A 500 ± 50 450 ± 40 K84A 400 ± 40 420 ± 40 K89A 410 ± 40 400 ± 40 dK82 610 ± 60 550 ± 50 dK89 460 ± 50 520 ± 50 K82A/K84A/K89A 5500 ± 1000 7800 ± 1000 PriB variant dT15 Kd,app (nM) dA30 Kd,app (nM) Wild-type 100 ± 20 120 ± 20 W47A 300 ± 30 310 ± 30 R13A 270 ± 30 200 ± 30 K18A 210 ± 30 210 ± 30 K82A 500 ± 50 450 ± 40 K84A 400 ± 40 420 ± 40 K89A 410 ± 40 400 ± 40 dK82 610 ± 60 550 ± 50 dK89 460 ± 50 520 ± 50 K82A/K84A/K89A 5500 ± 1000 7800 ± 1000 The errors are standard deviations determined using 2–4 independent titration experiments. Open in new tab Previous investigations of EcoSSB demonstrated that aromatic stacking plays an important role in ssDNA binding (13,27). Based on our structural data, only Trp47 of PriB is involved in ssDNA binding. To further test the relative contributions of aromatic and basic amino acids to ssDNA binding, Trp47 was point mutated. The ssDNA-binding ability of the W47A mutant is only 2.6-fold less than that of the wild-type PriB. This finding is consistent with the results of a recent report on the binding ability of the W47A mutant (as assayed using fluorescence anisotropy) (9). Hence, the aromatic residue in PriB, Trp47, appears to be involved in but not critical for ssDNA binding. Despite the presence of an EcoSSB-like fold in PriB, PriB appears to bind ssDNA in a different manner from EcoSSB. Binding of PriB protein to circular ϕX ssDNA In the PriB–dT15 complex, the PriB dimers form a long chain along the ssDNA (Figure 3A and C). This arrangement is consistent with the morphology observed in negatively stained electron micrographs of PriB–circular ϕX ssDNA complex (Figure 3B). These structural images thus reveal a novel ssDNA-binding mode that may explain how a dimeric OB-fold protein binds to ssDNA. Figure 3 Open in new tabDownload slide Topology of binding among PriB dimers and ssDNA. (A) Stereo diagram of crystal packing of PriB–dT15 complexes. The 2Fo–Fc electron density map contoured at 0.8 σ shows the bound ssDNA (dT15) in blue. The PriB monomers are shown as green and yellow ribbons. (B) Enlarged electron micrograph of intact ϕX ssDNA covered by PriB dimers. The arrangement of these PriB–ϕX ssDNA complexes presented in the electron micrograph image is similar to the PriB–dT15 complexes shown in the crystal diagram in (C). Figure 3 Open in new tabDownload slide Topology of binding among PriB dimers and ssDNA. (A) Stereo diagram of crystal packing of PriB–dT15 complexes. The 2Fo–Fc electron density map contoured at 0.8 σ shows the bound ssDNA (dT15) in blue. The PriB monomers are shown as green and yellow ribbons. (B) Enlarged electron micrograph of intact ϕX ssDNA covered by PriB dimers. The arrangement of these PriB–ϕX ssDNA complexes presented in the electron micrograph image is similar to the PriB–dT15 complexes shown in the crystal diagram in (C). To examine whether the crystal structure of PriB–dT15 and the results from functional analysis of the protein with synthetic oligonucleotides are relevant to natural events, the binding of PriB protein to circular ϕX ssDNA (5386 nt) was analyzed by agarose gel electrophoresis. By increasing the amount of PriB added, a gradual decrease in the mobility of the ϕX ssDNA (Figure 4, lanes 1–5) can be detected. The observation probably reflects the additional copies of PriB bound on to the ϕX ssDNA and slowed down the mobility of the multicomponent complex. Figure 4 Open in new tabDownload slide Binding of wild-type or PriB mutants to ϕX ssDNA. The reaction solutions contained 100 nM circular ϕX ssDNA and PriB proteins at the indicated protein/nucleotide concentration ratio (P/N ratio). The protein concentrations used were 100 μM (lane 1), 50 μM (lane 2), 10 μM (lane 3), 1 μM (lane 4) and without PriB protein (lane 5). The PriB mutant protein concentrations used were 100 μM (lanes 6–10). The reaction solutions were incubated at 25°C for 30 min and then analyzed by agarose gel electrophoresis mobility shift assays. Bands correspond to unbound ϕX ssDNA and various PriB–ϕX ssDNA complexes were visualized by ethidium bromide staining. The minor band that is visible in the absence of PriB protein (lane 5) is contributed by a small amount of linearized ssDNA in the commercial ϕX ssDNA preparation. Figure 4 Open in new tabDownload slide Binding of wild-type or PriB mutants to ϕX ssDNA. The reaction solutions contained 100 nM circular ϕX ssDNA and PriB proteins at the indicated protein/nucleotide concentration ratio (P/N ratio). The protein concentrations used were 100 μM (lane 1), 50 μM (lane 2), 10 μM (lane 3), 1 μM (lane 4) and without PriB protein (lane 5). The PriB mutant protein concentrations used were 100 μM (lanes 6–10). The reaction solutions were incubated at 25°C for 30 min and then analyzed by agarose gel electrophoresis mobility shift assays. Bands correspond to unbound ϕX ssDNA and various PriB–ϕX ssDNA complexes were visualized by ethidium bromide staining. The minor band that is visible in the absence of PriB protein (lane 5) is contributed by a small amount of linearized ssDNA in the commercial ϕX ssDNA preparation. Residues on PriB that are crucial to dT15 and dA30 binding are also important for ϕX ssDNA binding. R13A, K18A or W47A mutants (Figure 4, lanes 6–8) show a minor effect while the K82A mutant has a greater effect on ϕX ssDNA binding (Figure 4, lane 9). The mobility of the triple mutant (K82A/K84A/K89A) is nearly identical to that of the control, indicating that this protein binds ϕX ssDNA poorly even at 1000 P/N ratios (Figure 4, lane 10). DISCUSSION A novel ssDNA-binding mode is revealed by the structure of the PriB–ssDNA complex Despite the fact that both PriB and EcoSSB have a classical OB-fold ssDNA-binding surface (28,29), they bind DNA differentially and yield structurally distinct DNA-bound complexes. PriB dimers bind dT15 with the highly electrostatic positive L45 loop surface and create a long chain of protein surrounding a DNA strand (Figure 5, left panel). In contrast, a 35 nt homopolymer wraps around the EcoSSB tetramer in the EcoSSB–ssDNA complex (Figure 5, middle panel). Differences in oligomeric structure offer one apparent explanation for this observation. Moreover, the L23 and L45 loops of EcoSSB are longer than those of PriB. The L23 loop of EcoSSB has three aromatic residues (Trp40, Trp54 and Phe60) that are strongly involved in ssDNA binding and guide the ssDNA to wrap around the tetramer through base stacking interactions (13). The L23 loop of PriB also has three aromatic residues, but only Trp47 is involved in ssDNA binding. This interaction involving a single amino acid probably is insufficient to guide the ssDNA to traverse the L23 loop of PriB in a manner similar to that observed for EcoSSB. The much shorter L23 and L45 loops also preclude the possibility that PriB binds ssDNA by wrapping the DNA around a tetrameric arrangement of OB folds in a manner similar to EcoSSB. A large number of non-specific DNA-binding proteins (28,29) have OB fold. Our most important finding is that PriB uses a different ssDNA-binding strategy apart from other SSBs, including human replication protein A (RPA) (Figure 5, right panel) (30). EcoSSB and RPA have conserved aromatic residues in the L45 loop of the OB fold. These residues are replaced with positively charged amino acids in PriB. These differences support our contention that the ssDNA-binding mechanisms of these OB-fold proteins might differ significantly. Indeed, our structural and functional studies reveal clear differences between the ssDNA-binding modes of these highly homologous OB-fold proteins. Figure 5 Open in new tabDownload slide Ribbon diagrams of three OB-fold proteins. (A) Structural comparison of three aligned OB-fold domains: E.coli PriB (PDB code 2CCZ); E.coli SSB (PDB code 1EYG); human RPA70 (PDB code 1JMC). Residues involved in ssDNA binding are labeled. (B) Distinct ssDNA-binding modes of these OB-fold proteins. ssDNA is colored in magenta. Figure 5 Open in new tabDownload slide Ribbon diagrams of three OB-fold proteins. (A) Structural comparison of three aligned OB-fold domains: E.coli PriB (PDB code 2CCZ); E.coli SSB (PDB code 1EYG); human RPA70 (PDB code 1JMC). Residues involved in ssDNA binding are labeled. (B) Distinct ssDNA-binding modes of these OB-fold proteins. ssDNA is colored in magenta. Cooperative binding Many SSB proteins bind to ssDNA with some degree of positive cooperativity. But this type of cooperativity varies considerably among the SSB proteins. For example, EcoSSB binds to long ssDNA in different manners that can be grouped into the (SSB)35 and the (SSB)65-binding modes (31). The (SSB)35-binding mode has ‘unlimited’ cooperative binding while the (SSB)65-binding mode promotes the ‘limited’ type of intertetramer cooperativity (31). In addition, negative cooperativity has also been observed for EcoSSB binding to ssDNA. The third and the fourth subunits of the SSB tetramer have reduced affinity to ssDNA (31). However, negative cooperativity has not been observed for PriB in this study, probably due to the dimeric structure of PriB and an ssDNA-binding mode that differs significantly from that of EcoSSB. PriB binding to ssDNA is a positive cooperative process (Table 2). Cooperativity can result from direct protein–protein interactions between nearest neighbors, such as the LAST motif in the T4 gene 32 protein (32). Cooperativity can also result from protein-induced distortions of adjacent DNA as demonstrated by the Sulfolobus SSB (33). In the case of PriB, the binding activities and cooperativities are increased with longer ssDNA homopolymers (Table 2), and structural data indicate that PriB dimers cooperatively bind to the same ssDNA molecule (Figure 1B). Possibly, a PriB dimer binds and distorts the ssDNA structure. A second PriB then comes in and binds to the ssDNA. The binding of the second PriB dimer is likely the key step in forming the stable complex. Other forms of incompletely occupied PriB–ssDNA complexes may not be stable. Binding of PriB to various ssDNA, partial duplex DNA and forked DNA is inefficient at low protein concentrations except in the presence of PriA helicase (11). Previous data also indicate that PriB can stabilize the binding of PriA to ssDNA (7). These results suggest that cooperation between PriB and PriA helicase may be necessary for PriB and/or PriA to form stable complexes with ssDNA. Interestingly, PriA has a highly electropositive ssDNA-binding region (amino acids 1–198) containing 8 Lys and 14 Arg residues (34). It might serve a role similar to monomer B′ of the PriB dimer in our crystal to stabilize the partially disordered ssDNA. Possibly, PriA and PriB bind to ssDNA cooperatively, thereby decreasing the dissociation rate of PriA from the DNA during helix unwinding. Therefore, we propose that our in vitro assembled PriB–ssDNA complex may mimic the structure of the PriA–ssDNA–PriB complex. However, this speculation needs to be confirmed by further crystallographic study and biochemical experiments. In conclusion, the structure of the PriB–dT15 complex and the binding properties of PriB to oligonucleotides of various lengths provide molecular details on how PriB interacts with ssDNA. Moreover, our discovery that the ssDNA-binding properties of PriB differ from that of EcoSSB suggests a basis for the distinct roles of these proteins in DNA replication. PDB accession code Atomic coordinates and structure factors for PriB–ssDNA complex have been deposited in the PDB under accession code 2ccz. ACKNOWLEDGEMENTS We thank Dr Ming F. Tam for discussion and critical reading of the manuscript. We gratefully acknowledge access to the synchrotron radiation beamline 17B2 at the National Synchrotron Radiation Research Center (NSRRC) in Taiwan. This work was supported by research grants from Academia Sinica and the National Science Council (NSC94-2311-B-001-015 to C.-D.H.), Republic of China. Funding to pay the Open Access publication charges for this article was provided by National Science Council, Taiwan, Republic of China. Conflict of interest statement. None declared. REFERENCES 1. Cox M.M. , Goodman M.F., Kreuzer K.N., Sherratt D.J., Sandler S.J., Marians K.J. 2000 The importance of repairing stalled replication forks Nature 404 37 – 41 Google Scholar Crossref Search ADS PubMed WorldCat 2. McGlynn P. and Lloyd R.G. 2002 Recombinational repair and restart of damaged replication forks Nature Rev. Mol. Cell Biol . 3 859 – 870 Google Scholar Crossref Search ADS WorldCat 3. Marians K.J. 2000 PriA-directed replication fork restart in Escherichia coli Trends Biochem. Sci . 25 185 – 189 Google Scholar Crossref Search ADS PubMed WorldCat 4. Sandler S.J. 2000 Multiple genetic pathways for restarting DNA replication forks in Escherichia coli K-12 Genetics 155 487 – 497 Google Scholar PubMed OpenURL Placeholder Text WorldCat 5. Sandler S.J. and Marians K.J. 2000 Role of PriA in replication fork reactivation in Escherichia coli J. Bacteriol . 182 9 – 13 Google Scholar Crossref Search ADS PubMed WorldCat 6. Heller R.C. and Marians K.J. 2005 The disposition of nascent strands at stalled replication forks dictates the pathway of replisome loading during restart Mol. Cell 17 733 – 743 Google Scholar Crossref Search ADS PubMed WorldCat 7. Liu J. , Nurse P., Marians K.J. 1996 The ordered assembly of the ϕX174-type primosome. III. PriB facilitates complex formation between PriA and DnaT J. Biol. Chem . 271 15656 – 15661 Google Scholar Crossref Search ADS PubMed WorldCat 8. Liu J.H. , Chang T.W., Huang C.Y., Chen S.U., Wu H.N., Chang M.C., Hsiao C.D. 2004 Crystal structure of PriB, a primosomal DNA replication protein of Escherichia coli J. Biol. Chem . 279 50465 – 50471 Google Scholar Crossref Search ADS PubMed WorldCat 9. Lopper M. , Holton J.M., Keck J.L. 2004 Crystal structure of PriB, a component of the Escherichia coli replication restart primosome Structure 12 1967 – 1975 Google Scholar Crossref Search ADS PubMed WorldCat 10. Shioi S. , Ose T., Maenaka K., Shiroishi M., Abe Y., Kohda D., Katayama T., Ueda T. 2005 Crystal structure of a biologically functional form of PriB from Escherichia coli reveals a potential single-stranded DNA-binding site Biochem. Biophys. Res. Commun . 326 766 – 776 Google Scholar Crossref Search ADS PubMed WorldCat 11. Cadman C.J. , Lopper M., Moon P.B., Keck J.L., McGlynn P. 2005 PriB stimulates PriA helicase via an interaction with single-stranded DNA J. Biol. Chem . 280 39693 – 39700 Google Scholar Crossref Search ADS PubMed WorldCat 12. Chase J.W. and Williams K.R. 1986 Single-stranded DNA binding proteins required for DNA replication Annu. Rev. Biochem . 55 103 – 136 Google Scholar Crossref Search ADS PubMed WorldCat 13. Raghunathan S. , Kozlov A.G., Lohman T.M., Waksman G. 2000 Structure of the DNA binding domain of E.coli SSB bound to ssDNA Nature Struct. Biol . 7 648 – 652 Google Scholar Crossref Search ADS WorldCat 14. Wang C.C. , Chang T.C., Lin C.W., Tsui H.L., Chu P.B., Chen B.S., Huang Z.S., Wu H.N. 2003 Nucleic acid binding properties of the nucleic acid chaperone domain of hepatitis delta antigen Nucleic Acids Res . 31 6481 – 6492 Google Scholar Crossref Search ADS PubMed WorldCat 15. Wong I. and Lohman T.M. 1993 A double-filter method for nitrocellulose-filter binding: application to protein–nucleic acid interactions Proc. Natl Acad. Sci. USA 90 5428 – 5432 Google Scholar Crossref Search ADS WorldCat 16. Abbani M. , Iwahara M., Clubb R.T. 2005 The structure of the excisionase (Xis) protein from conjugative transposon Tn916 provides insights into the regulation of heterobivalent tyrosine recombinases J. Mol. Biol . 347 11 – 25 Google Scholar Crossref Search ADS PubMed WorldCat 17. Grove D.E. , Willcox S., Griffith J.D., Bryant F.R. 2005 Differential single-stranded DNA binding properties of the paralogous SsbA and SsbB proteins from Streptococcus pneumoniae J. Biol. Chem . 280 11067 – 11073 Google Scholar Crossref Search ADS PubMed WorldCat 18. Otwinowski Z. and Minor W. 1997 Processing of X-ray diffraction data collected in oscillation mode Methods Enzymol . 276 307 – 326 OpenURL Placeholder Text WorldCat 19. Navaza J. 1994 AMoRe: an automated package for molecular replacement Acta Crystallogr . D50 157 – 163 Google Scholar Crossref Search ADS WorldCat 20. McRee D.E. 1999 XtalView/Xfit—a versatile program for manipulating atomic coordinates and electron density J. Struct. Biol . 125 156 – 165 Google Scholar Crossref Search ADS PubMed WorldCat 21. Read R.J. 1986 Improved Fourier coefficients for maps using phases from partial structures with errors Acta Crystallogr . A42 140 – 149 Google Scholar Crossref Search ADS WorldCat 22. Brünger A.T. , Adams P.D., Clore G.M., DeLano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.S., Kuszewski J., Nilges M., Pannu N.S., et al. 1998 Crystallography and NMR system: a new software suite for macromolecular structure determination Acta Crystallogr . D54 905 – 921 OpenURL Placeholder Text WorldCat 23. Laskowski R.A. , MacArthur M.W., Moss D.S., Thornton J.M. 1993 PROCHECK: a program to check the stereochemical quality of protein structures J. Appl. Crystallogr . 26 283 – 291 Google Scholar Crossref Search ADS WorldCat 24. Shamoo Y. , Friedman A.M., Parsons M.R., Konigsberg W.H., Steitz T.A. 1995 Crystal structure of a replication fork single-stranded DNA binding protein (T4 gp32) complexed to DNA Nature 376 362 – 366 Google Scholar Crossref Search ADS PubMed WorldCat 25. Romer R. , Schomburg U., Krauss G., Maass G. 1984 Escherichia coli single-stranded DNA binding protein is mobile on DNA: 1H NMR study of its interaction with oligo- and polynucleotides Biochemistry 23 6132 – 6137 Google Scholar Crossref Search ADS PubMed WorldCat 26. Ponomarev V.A. , Makarova K.S., Aravind L., Koonin E.V. 2003 Gene duplication with displacement and rearrangement: origin of the bacterial replication protein PriB from the single-stranded DNA-binding protein Ssb J. Mol. Microbiol. Biotechnol . 5 225 – 229 Google Scholar Crossref Search ADS PubMed WorldCat 27. Raghunathan S. , Ricard C.S., Lohman T.M., Waksman G. 1997 Crystal structure of the homo-tetrameric DNA binding domain of Escherichia coli single-stranded DNA-binding protein determined by multiwavelength x-ray diffraction on the selenomethionyl protein at 2.9-Å resolution Proc. Natl Acad. Sci. USA 94 6652 – 6657 Google Scholar Crossref Search ADS WorldCat 28. Theobald D.L. , Mitton-Fry R.M., Wuttke D.S. 2003 Nucleic acid recognition by OB-fold proteins Annu. Rev. Biophys. Biomol. Struct . 32 115 – 133 Google Scholar Crossref Search ADS PubMed WorldCat 29. Murzin A.G. 1993 OB(oligonucleotide/oligosaccharide binding)-fold: common structural and functional solution for non-homologous sequences EMBO J . 12 861 – 867 Google Scholar PubMed OpenURL Placeholder Text WorldCat 30. Bochkarev A. , Pfuetzner R.A., Edwards A.M., Frappier L. 1997 Structure of the single-stranded-DNA-binding domain of replication protein A bound to DNA Nature 385 176 – 181 Google Scholar Crossref Search ADS PubMed WorldCat 31. Lohman T.M. and Ferrari M.E. 1994 Escherichia coli single-stranded DNA-binding protein: multiple DNA-binding modes and cooperativities Annu. Rev. Biochem . 63 527 – 570 Google Scholar Crossref Search ADS PubMed WorldCat 32. Casas-Finet J.R. , Fischer K.R., Karpel R.L. 1992 Structural basis for the nucleic acid binding cooperativity of bacteriophage T4 gene 32 protein: the (Lys/Arg)3(Ser/Thr)2 (LAST) motif Proc. Natl Acad. Sci. USA 89 1050 – 1054 Google Scholar Crossref Search ADS WorldCat 33. Kerr I.D. , Wadsworth R.I., Cubeddu L., Blankenfeldt W., Naismith J.H., White M.F. 2003 Insights into ssDNA recognition by the OB fold from a structural and thermodynamic study of Sulfolobus SSB protein EMBO J . 22 2561 – 2570 Google Scholar Crossref Search ADS PubMed WorldCat 34. Chen H.W. , North S.H., Nakai H. 2004 Properties of the PriA helicase domain and its role in binding PriA to specific DNA structures J. Biol. Chem . 279 38503 – 38512 Google Scholar Crossref Search ADS PubMed WorldCat Author notes " The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors " Protein Data Bank accession no. 2ccz © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Gene function correlates with potential for G4 DNA formation in the human genomeEddy,, Johanna;Maizels,, Nancy
doi: 10.1093/nar/gkl529pmid: 16914419
ABSTRACT G-rich genomic regions can form G4 DNA upon transcription or replication. We have quantified the potential for G4 DNA formation (G4P) of the 16 654 genes in the human RefSeq database, and then correlated gene function with G4P. We have found that very low and very high G4P correlates with specific functional classes of genes. Notably, tumor suppressor genes have very low G4P and proto-oncogenes have very high G4P. G4P of these genes is evenly distributed between exons and introns, and it does not reflect enrichment for CpG islands or local chromosomal environment. These results show that genomic structure undergoes selection based on gene function. Selection based on G4P could promote genomic stability (or instability) of specific classes of genes; or reflect mechanisms for global regulation of gene expression. INTRODUCTION Eukaryotic genomes contain characteristically G-rich regions, including single-copy genes; the rDNA; and repetitive sequences, such as the telomeres and the immunoglobulin heavy chain switch (S) regions of higher vertebrates. G-rich nucleic acids have the potential to form G-quadruplex or ‘G4 DNA’, a structure in which intra- or inter-strand interactions are stabilized by G-quartets, planar arrays of four guanines, paired by Hoogsteen bonding (1,2). G-quartets can stabilize a remarkable diversity of structures, in which the lengths and positions of the G-runs and the ‘loops’ separating them both contribute to overall topology (3,4). In the human genome, the number of distinct sites with potential to form G4 DNA is estimated at more than 300 000, and specific loop sequences are prominent at some of these sites (5,6). Key cellular processes are identified with repetitive G-rich chromosomal regions, where regulated formation of G4 DNA may contribute to biological function. At the G-rich telomere tails, the presence of G4 DNA inhibits extension by telomerase, and proteins that bind specifically to telomeric sequences regulate the formation and resolution of G4 DNA (7–15). The G-rich immunoglobulin switch regions are sites of recombination that is critical to B-cell development and the immune response, and regulated transcription of the switch regions induces the formation of DNA structures targeted by factors essential to class switch recombination (16,17). G-rich regions can also be sites of unprogrammed genomic instability. Many B-cell lymphomas carry a translocation of the MYC proto-oncogene to the immunoglobulin heavy chain switch region (18), and the common translocation breakpoints map to G-rich regions of MYC that form structures similar to those formed by transcribed G-rich switch regions (19,20). Some of the most unstable human minisatellites are G-rich sequences predicted to form G4 DNA (21); and G4 DNA formation in vitro has been directly confirmed for two G-rich VNTRs, D4S43, and the insulin-linked hypervariable repeat (22). Reporter constructs carrying interstitial telomeric repeats display high levels of instability (23), which may be analogous to the instability of G-rich VNTRs. Specialized mechanisms may regulate the expression of G-rich genes at the levels of transcription, RNA processing and translation. Cotranscriptional RNA:DNA hybrid formation occurs readily within G-rich regions (19,24,25). Factors associated with RNA processing pathways, including THO/TREX and ASF/SF2, normally prevent cotranscriptional RNA:DNA hybrid formation, and promote gene expression; and genomic instability ensues in their absence (26,27). Factors involved in translational regulation may target RNA transcripts that contain G-quartets (28,29). Regions with the potential to form G4 DNA have been identified in the promoters of several proto-oncogenes, including c-MYC, VEGF, c-KIT and BCL2 (30–33). This has led to suggestions that formation or resolution of specific quadruplex structures may contribute to the regulation of gene expression, and prompted the design of therapeutics targeted to these structures, but the biological specificity of such compounds is yet to be established rigorously (34–39). Conserved and ubiquitous repair factors recognize G4 DNA, including the human RecQ family helicases BLM and WRN (40,41); the Saccharomyces cerevisiae RecQ family helicase Sgs1 (42); and the mismatch repair factor MutSα, a heterodimer of MSH2/MSH6 (16). RecQ family helicases maintain G-rich regions during replication. Sgs1 is required for nucleolar stability and replication of the G-rich rDNA (43,44); and in the absence of WRN helicase, telomeric sequence is lost due to impaired replication of the G-rich strand (45). The mismatch repair factor, MutSα, may cooperate with BLM helicase to promote the resolution of G4 DNA during replication (46). In immunoglobulin switch recombination, MutSα recognizes G4 DNA formed during transcription of the G-rich switch regions to promote their synapsis and recombination (16). Genomic regions with potential to form G4 DNA have been enumerated (4,5), but they have not been correlated with specific gene functions. The link between potential for G4 DNA formation and genomic instability suggests that the identification of human genes with relatively high or low potential to form G4 DNA might provide insights into the evolution of genomic structure, or identify mechanisms that could account for genomic instability in human malignancies. The possibility that G-richness can contribute to shared regulation suggests that genes with similar or related functions may share features of genomic structure. We therefore set out to determine the prevalence of G-rich sequences capable of forming G4 DNA among human genes, and to determine if particular functional classes of genes might be characterized by the presence or absence of G-rich regions. METHODS G4P Calculator software We developed a software program, ‘G4P Calculator’, which computes G4 DNA potential based on the density of runs of guanines in a sequence. The program evaluates runs of guanines in a sliding window and calculates the percentage of windows searched that meet the specified criteria. The criteria used are as follows: G-run length, ≥3; number of G-runs per window, ≥4; window length, 100 nt; and sliding interval length, 20 nt. The requirement for four or more runs of three or more guanines is based on studies of oligonucleotide folding [reviewed in (2)]. The 100 nt window size facilitates rapid analysis and is easily reduced (or enlarged) for rescan of specific genes of interest. The 20 nt sliding interval length is set such that sequences with the potential to form more than a single G4 DNA structure make correspondingly higher contributions to the total G4P. The last four windows analyzed are processed as windows of progressively smaller size (80, 60, 40 and 20 nt), and this does not affect the results as gene length is much larger than window size. Each DNA strand is evaluated independently. G4P is scored as a percentage, making it independent of sequence length. These criteria are similar to those used by others (4,5), and although not absolute, provide a means to compare G4P between different sequences. The software is written in C# to run on the Microsoft Windows XP operating system. The program and instructions are available on our laboratory website (http://depts.washington.edu/maizels9/). The source code is available upon request. Sequence and GO data Sequence data for the human RefSeq genes (NCBI 35 assembly) and associated GO terms were downloaded from the Ensembl database v.32 using BioMart (47) on 4/9/05. Additional Gene Ontology (GO) data were obtained from the GO website (www.geneontology.org) on 3/5/06. Flanking sequence data were downloaded from Ensembl v.34 on 14/10/05. The cDNA sequences were downloaded from the Ensembl v.37 on 19/3/06. For genes with multiple transcript variants, the first listed variant for each gene was evaluated to assess G4P of the cDNA. The median G4 DNA potential was calculated from 16 654 RefSeq genes. Since each gene may have several GO classifications, a total of 77 968 GO term assignments were sorted above or below the median corresponding to 50.5% or 49.5% of the total. The 4524 RefSeq genes that have no GO classification (27% of the total) are equally distributed below and above the median. Flanking sequence analysis Analysis of ΔG4P included only genes for which both gene and flanking sequence were complete. Excluded from that analysis were 101 genes for which sequence determination was incomplete (more than four unidentified consecutive bases). The excluded genes are identified in Supplementary Table 1; none of the excluded genes was a known proto-oncogene or a tumor suppressor gene. Statistical analysis The Wilcoxon rank sum test was applied by using the statistics program R 2.2.1 (Wilcoxon test parameters: alternative = ‘two-sided’, paired = FALSE). The linear regression, single-factor ANOVA, and standard error analyses were performed using Microsoft Office Excel 2003. Owing to the skewed distribution of G4P, data were subjected to a natural log transformation before linear regression analysis. Genes with G4P = 0 were therefore not included. This resulted in the exclusion of 1152 genes from the correlation between DNA strands (Figure 1B); and of 905 genes from the analysis of G4P versus GC content (Figure 3A). Figure 1 Open in new tabDownload slide Potential for G4 DNA formation of human genes. (A) Distribution of genes across G4 DNA formation potential (G4P). The distribution of 16 654 RefSeq genes is illustrated by vertical bars (gray). Median G4P for the RefSeq genes at 5.0% is indicated by a dotted line. The distribution of the 4396 GO terms assigned to 73% of the RefSeq genes is outlined (black). (B) Positive correlation of G4P of template and nontemplate DNA strands. Linear regression analysis of G4P of the nontemplate (y-axis) and the template (x-axis) strand. Owing to the skewed distribution of G4P, the data were subjected to a natural log transformation before linear regression analysis; therefore, a small number of genes with G4P equal to zero were not included. The slope determined by linear regression analysis (0.83) is represented by the solid line; and slope of unity by the dotted line. Figure 1 Open in new tabDownload slide Potential for G4 DNA formation of human genes. (A) Distribution of genes across G4 DNA formation potential (G4P). The distribution of 16 654 RefSeq genes is illustrated by vertical bars (gray). Median G4P for the RefSeq genes at 5.0% is indicated by a dotted line. The distribution of the 4396 GO terms assigned to 73% of the RefSeq genes is outlined (black). (B) Positive correlation of G4P of template and nontemplate DNA strands. Linear regression analysis of G4P of the nontemplate (y-axis) and the template (x-axis) strand. Owing to the skewed distribution of G4P, the data were subjected to a natural log transformation before linear regression analysis; therefore, a small number of genes with G4P equal to zero were not included. The slope determined by linear regression analysis (0.83) is represented by the solid line; and slope of unity by the dotted line. CpG islands The NewCpGReport software (48) was accessed and run from the website http://csc-fserve.hh.med.ic.ac.uk/emboss/newcpgreport.html. The tumor suppressor and proto-oncogene sequences were processed using the program's default settings: window size = 100; shift increment = 1; minimum length = 200; minimum observed/expected = 0.6; and minimum percent = 50. RESULTS G4 DNA formation potential of human genes To score potential for G4 DNA formation, we developed software that analyzes overlapping windows of sequence, and scores each window that contains four or more runs of three or more guanines as a ‘hit’, then quantifies G4 DNA formation potential (G4P) as the percentage of hits in the total number of windows searched. We used this program, ‘G4P Calculator’, to evaluate the G4P of the entire transcribed sequence (exons and introns) of 16 654 human Reference Sequence (RefSeq) genes. Nontemplate and template strands were analyzed separately, to distinguish contributions of transcription-induced structure formation, which affects only the nontemplate strand; and replication-induced structure formation, which affects both strands. G4P of the nontemplate strand ranged from 0 to 79%, with a median of 5.0% and an average of 9.1% (Figure 1A and Supplementary Table 1). Linear regression analysis (Figure 1B) showed that G4P of the nontemplate and template strands is positively correlated (R2 = 0.73), with a slope of less than unity (0.83). Thus, for most genes, there is a slightly lower potential for the formation of transcription-induced structures than replication-induced structures; although potential for the formation of structures on either strand is closely correlated. Gene function corresponds with G4P The skewed distribution of G4P over the RefSeq genes (Figure 1A) suggests that most genes cannot readily form G4 DNA structures, but that some genes may be highly susceptible. To identify functional classes of genes with high and low potential for G4 DNA formation, we evaluated the distribution of terms defined by the GO Consortium (49) for each RefSeq gene across the spectrum of G4 DNA potentials. In this classification scheme, 27% of genes currently have no GO terms assigned; and others have been assigned multiple GO terms and are represented in several different categories. The distribution of the 4396 GO terms assigned to the human RefSeq genes across G4 DNA potential proved to be nearly identical to the distribution of genes (Figure 1A). Restricting further analysis to the 218 GO terms associated with 50 or more genes, 61 GO terms were identified for which 60% or more of the genes were below or above the RefSeq median G4P (24 low G4P and 37 high G4P); and application of the Wilcoxon rank sum test confirmed that these criteria were robust (Table 1 and Supplementary Table 2). Functions characterized by low G4P include G-protein-coupled receptors, sensory perception (especially olfaction), nucleosome assembly, nucleic acid binding, ubiquitin cycle, cell adhesion and cell division; whereas functions characterized by high G4P include transcription factor activity, development, cell signaling, muscle contraction, growth factors and cytokines. Figure 2 represents, for a subset of the GO terms identified with very low and very high G4P, the median and range of G4P relative to all RefSeq genes and all GO terms. In each case, the difference in distribution relative to the RefSeq genes was highly significant (Figure 2). This establishes a relationship between specific gene functions and potential for G4 DNA formation. Table 1. Gene Ontology (GO) terms with low and high G4P Biological process Molecular function GO ID GO description No. of genes P GO ID GO description No. of genes P Low G4P GO:0007186 G-protein-coupled receptor protein signaling pathway 674 < 2E−16 GO:0004984 Olfactory receptor activity 316 < 2E−16 GO:0007600 Sensory perception 436 < 2E−16 GO:0003676 Nucleic acid binding 615 1E−06 GO:0007608 Perception of smell 244 < 2E−16 GO:0005488 Binding 427 5E−06 GO:0007001 Chromosome organization and biogenesis 99 8E−11 GO:0016874 Ligase activity 163 0.003 GO:0006334 Nucleosome assembly 95 9E−11 GO:0004197 Cysteine-type endopeptidase activity 55 0.006 GO:0006511 Ubiquitin-dependent protein catabolism 96 5E−05 GO:0004842 Ubiquitin–protein ligase activity 327 0.02 GO:0006512 Ubiquitin cycle 210 0.0002 GO:0017111 Nucleoside triphosphatase activity 54 0.03 GO:0007156 Homophilic cell adhesion 87 0.02 GO:0008026 ATP-dependent helicase activity 63 0.03 GO:0051301 Cell division 119 0.03 GO:0006470 Protein amino acid dephosphorylation 115 0.08 High G4P GO:0007275 Development 391 < 2E−16 GO:0003700 Transcription factor activity 752 < 2E−16 GO:0006955 Immune response 272 3E−09 GO:0004295 Trypsin activity 94 2E−07 GO:0007267 Cell–cell signaling 263 2E−06 GO:0004263 Chymotrypsin activity 91 9E−07 GO:0006936 Muscle contraction 63 6E−05 GO:0030528 Transcription regulator activity 74 8E−05 GO:0006817 Phosphate transport 80 7E−05 GO:0030955 Potassium ion binding 100 0.0005 GO:0007010 Cytoskeleton organization and biogenesis 53 0.0002 GO:0008083 Growth factor activity 116 0.003 GO:0009653 Morphogenesis 104 0.002 GO:0005179 Hormone activity 80 0.004 GO:0007517 Muscle development 103 0.002 GO:0008289 Lipid binding 75 0.008 GO:0007218 Neuropeptide signaling pathway 66 0.01 GO:0003774 Motor activity 58 0.02 GO:0006814 Sodium ion transport 85 0.02 GO:0015293 Symporter activity 71 0.02 GO:0006968 Cellular defense response 61 0.02 GO:0005249 Voltage-gated potassium channel activity 65 0.02 GO:0006954 Inflammatory response 163 0.02 GO:0005125 Cytokine activity 79 0.05 GO:0006091 Generation of precursor metabolites and energy 59 0.02 GO:0020037 Heme binding 78 0.06 GO:0001501 Skeletal development 78 0.03 GO:0007169 Transmembrane receptor protein tyrosine kinase signaling pathway 67 0.04 GO:0008544 Epidermis development 62 0.04 GO:0006816 Calcium ion transport 63 0.08 GO:0006869 Lipid transport 53 0.08 GO:0009887 Organogenesis 50 0.09 Biological process Molecular function GO ID GO description No. of genes P GO ID GO description No. of genes P Low G4P GO:0007186 G-protein-coupled receptor protein signaling pathway 674 < 2E−16 GO:0004984 Olfactory receptor activity 316 < 2E−16 GO:0007600 Sensory perception 436 < 2E−16 GO:0003676 Nucleic acid binding 615 1E−06 GO:0007608 Perception of smell 244 < 2E−16 GO:0005488 Binding 427 5E−06 GO:0007001 Chromosome organization and biogenesis 99 8E−11 GO:0016874 Ligase activity 163 0.003 GO:0006334 Nucleosome assembly 95 9E−11 GO:0004197 Cysteine-type endopeptidase activity 55 0.006 GO:0006511 Ubiquitin-dependent protein catabolism 96 5E−05 GO:0004842 Ubiquitin–protein ligase activity 327 0.02 GO:0006512 Ubiquitin cycle 210 0.0002 GO:0017111 Nucleoside triphosphatase activity 54 0.03 GO:0007156 Homophilic cell adhesion 87 0.02 GO:0008026 ATP-dependent helicase activity 63 0.03 GO:0051301 Cell division 119 0.03 GO:0006470 Protein amino acid dephosphorylation 115 0.08 High G4P GO:0007275 Development 391 < 2E−16 GO:0003700 Transcription factor activity 752 < 2E−16 GO:0006955 Immune response 272 3E−09 GO:0004295 Trypsin activity 94 2E−07 GO:0007267 Cell–cell signaling 263 2E−06 GO:0004263 Chymotrypsin activity 91 9E−07 GO:0006936 Muscle contraction 63 6E−05 GO:0030528 Transcription regulator activity 74 8E−05 GO:0006817 Phosphate transport 80 7E−05 GO:0030955 Potassium ion binding 100 0.0005 GO:0007010 Cytoskeleton organization and biogenesis 53 0.0002 GO:0008083 Growth factor activity 116 0.003 GO:0009653 Morphogenesis 104 0.002 GO:0005179 Hormone activity 80 0.004 GO:0007517 Muscle development 103 0.002 GO:0008289 Lipid binding 75 0.008 GO:0007218 Neuropeptide signaling pathway 66 0.01 GO:0003774 Motor activity 58 0.02 GO:0006814 Sodium ion transport 85 0.02 GO:0015293 Symporter activity 71 0.02 GO:0006968 Cellular defense response 61 0.02 GO:0005249 Voltage-gated potassium channel activity 65 0.02 GO:0006954 Inflammatory response 163 0.02 GO:0005125 Cytokine activity 79 0.05 GO:0006091 Generation of precursor metabolites and energy 59 0.02 GO:0020037 Heme binding 78 0.06 GO:0001501 Skeletal development 78 0.03 GO:0007169 Transmembrane receptor protein tyrosine kinase signaling pathway 67 0.04 GO:0008544 Epidermis development 62 0.04 GO:0006816 Calcium ion transport 63 0.08 GO:0006869 Lipid transport 53 0.08 GO:0009887 Organogenesis 50 0.09 GO term ID number, description and number of genes to which this term applies, for each GO term associated with Biological Processes and Molecular Functions, and containing genes with a distribution that is significantly lower or higher in G4P than the RefSeq genes. Terms are sorted by ascending P-value (shown) as calculated by the Wilcoxon rank sum test. Additional data can be found in Supplementary Table 2. Open in new tab Table 1. Gene Ontology (GO) terms with low and high G4P Biological process Molecular function GO ID GO description No. of genes P GO ID GO description No. of genes P Low G4P GO:0007186 G-protein-coupled receptor protein signaling pathway 674 < 2E−16 GO:0004984 Olfactory receptor activity 316 < 2E−16 GO:0007600 Sensory perception 436 < 2E−16 GO:0003676 Nucleic acid binding 615 1E−06 GO:0007608 Perception of smell 244 < 2E−16 GO:0005488 Binding 427 5E−06 GO:0007001 Chromosome organization and biogenesis 99 8E−11 GO:0016874 Ligase activity 163 0.003 GO:0006334 Nucleosome assembly 95 9E−11 GO:0004197 Cysteine-type endopeptidase activity 55 0.006 GO:0006511 Ubiquitin-dependent protein catabolism 96 5E−05 GO:0004842 Ubiquitin–protein ligase activity 327 0.02 GO:0006512 Ubiquitin cycle 210 0.0002 GO:0017111 Nucleoside triphosphatase activity 54 0.03 GO:0007156 Homophilic cell adhesion 87 0.02 GO:0008026 ATP-dependent helicase activity 63 0.03 GO:0051301 Cell division 119 0.03 GO:0006470 Protein amino acid dephosphorylation 115 0.08 High G4P GO:0007275 Development 391 < 2E−16 GO:0003700 Transcription factor activity 752 < 2E−16 GO:0006955 Immune response 272 3E−09 GO:0004295 Trypsin activity 94 2E−07 GO:0007267 Cell–cell signaling 263 2E−06 GO:0004263 Chymotrypsin activity 91 9E−07 GO:0006936 Muscle contraction 63 6E−05 GO:0030528 Transcription regulator activity 74 8E−05 GO:0006817 Phosphate transport 80 7E−05 GO:0030955 Potassium ion binding 100 0.0005 GO:0007010 Cytoskeleton organization and biogenesis 53 0.0002 GO:0008083 Growth factor activity 116 0.003 GO:0009653 Morphogenesis 104 0.002 GO:0005179 Hormone activity 80 0.004 GO:0007517 Muscle development 103 0.002 GO:0008289 Lipid binding 75 0.008 GO:0007218 Neuropeptide signaling pathway 66 0.01 GO:0003774 Motor activity 58 0.02 GO:0006814 Sodium ion transport 85 0.02 GO:0015293 Symporter activity 71 0.02 GO:0006968 Cellular defense response 61 0.02 GO:0005249 Voltage-gated potassium channel activity 65 0.02 GO:0006954 Inflammatory response 163 0.02 GO:0005125 Cytokine activity 79 0.05 GO:0006091 Generation of precursor metabolites and energy 59 0.02 GO:0020037 Heme binding 78 0.06 GO:0001501 Skeletal development 78 0.03 GO:0007169 Transmembrane receptor protein tyrosine kinase signaling pathway 67 0.04 GO:0008544 Epidermis development 62 0.04 GO:0006816 Calcium ion transport 63 0.08 GO:0006869 Lipid transport 53 0.08 GO:0009887 Organogenesis 50 0.09 Biological process Molecular function GO ID GO description No. of genes P GO ID GO description No. of genes P Low G4P GO:0007186 G-protein-coupled receptor protein signaling pathway 674 < 2E−16 GO:0004984 Olfactory receptor activity 316 < 2E−16 GO:0007600 Sensory perception 436 < 2E−16 GO:0003676 Nucleic acid binding 615 1E−06 GO:0007608 Perception of smell 244 < 2E−16 GO:0005488 Binding 427 5E−06 GO:0007001 Chromosome organization and biogenesis 99 8E−11 GO:0016874 Ligase activity 163 0.003 GO:0006334 Nucleosome assembly 95 9E−11 GO:0004197 Cysteine-type endopeptidase activity 55 0.006 GO:0006511 Ubiquitin-dependent protein catabolism 96 5E−05 GO:0004842 Ubiquitin–protein ligase activity 327 0.02 GO:0006512 Ubiquitin cycle 210 0.0002 GO:0017111 Nucleoside triphosphatase activity 54 0.03 GO:0007156 Homophilic cell adhesion 87 0.02 GO:0008026 ATP-dependent helicase activity 63 0.03 GO:0051301 Cell division 119 0.03 GO:0006470 Protein amino acid dephosphorylation 115 0.08 High G4P GO:0007275 Development 391 < 2E−16 GO:0003700 Transcription factor activity 752 < 2E−16 GO:0006955 Immune response 272 3E−09 GO:0004295 Trypsin activity 94 2E−07 GO:0007267 Cell–cell signaling 263 2E−06 GO:0004263 Chymotrypsin activity 91 9E−07 GO:0006936 Muscle contraction 63 6E−05 GO:0030528 Transcription regulator activity 74 8E−05 GO:0006817 Phosphate transport 80 7E−05 GO:0030955 Potassium ion binding 100 0.0005 GO:0007010 Cytoskeleton organization and biogenesis 53 0.0002 GO:0008083 Growth factor activity 116 0.003 GO:0009653 Morphogenesis 104 0.002 GO:0005179 Hormone activity 80 0.004 GO:0007517 Muscle development 103 0.002 GO:0008289 Lipid binding 75 0.008 GO:0007218 Neuropeptide signaling pathway 66 0.01 GO:0003774 Motor activity 58 0.02 GO:0006814 Sodium ion transport 85 0.02 GO:0015293 Symporter activity 71 0.02 GO:0006968 Cellular defense response 61 0.02 GO:0005249 Voltage-gated potassium channel activity 65 0.02 GO:0006954 Inflammatory response 163 0.02 GO:0005125 Cytokine activity 79 0.05 GO:0006091 Generation of precursor metabolites and energy 59 0.02 GO:0020037 Heme binding 78 0.06 GO:0001501 Skeletal development 78 0.03 GO:0007169 Transmembrane receptor protein tyrosine kinase signaling pathway 67 0.04 GO:0008544 Epidermis development 62 0.04 GO:0006816 Calcium ion transport 63 0.08 GO:0006869 Lipid transport 53 0.08 GO:0009887 Organogenesis 50 0.09 GO term ID number, description and number of genes to which this term applies, for each GO term associated with Biological Processes and Molecular Functions, and containing genes with a distribution that is significantly lower or higher in G4P than the RefSeq genes. Terms are sorted by ascending P-value (shown) as calculated by the Wilcoxon rank sum test. Additional data can be found in Supplementary Table 2. Open in new tab Figure 2 Open in new tabDownload slide G4P correlates with gene function. Ranges of G4P for all RefSeq genes (top line) compared with five GO terms overrepresented in low or high G4P. Boxes represent the percentage of genes in each GO category characterized by G4P in the range 0–1.25, 1.25–2.5, 2.5–5.0, 5–10, 10–20% and >20% (colors as indicated). P-values shown on the right represent significance of the difference in distribution between each GO term and the RefSeq genes, as calculated by the Wilcoxon rank sum test. Figure 2 Open in new tabDownload slide G4P correlates with gene function. Ranges of G4P for all RefSeq genes (top line) compared with five GO terms overrepresented in low or high G4P. Boxes represent the percentage of genes in each GO category characterized by G4P in the range 0–1.25, 1.25–2.5, 2.5–5.0, 5–10, 10–20% and >20% (colors as indicated). P-values shown on the right represent significance of the difference in distribution between each GO term and the RefSeq genes, as calculated by the Wilcoxon rank sum test. Tumor suppressor genes are characterized by low G4P and proto-oncogenes by high G4P Some of the gene functions characterized by low and high G4P (Figure 2 and Table 1) are associated with tumor suppressor genes and proto-oncogenes, respectively. This led us to interrogate the distribution of G4P with respect to genes in these two categories. A list of 55 tumor suppressor genes and 95 proto-oncogenes was compiled (Supplementary Table 3), using the Online Mendelian Inheritance in Man (OMIM) database as a primary source and confirming gene classification by search of the published literature. Comparison of G4P for tumor suppressor genes and proto-oncogenes established a clear and highly significant difference in the range of G4P observed (Wilcoxon rank sum test, P = 10−8; Figure 3A), and in the distribution of G4P for genes in these two categories relative to the 16 654 genes in the RefSeq database (Figure 3B). The distribution of tumor suppressor genes was shifted from the RefSeq median of 5.0% towards low G4 DNA potential with a median of 2.4% (Wilcoxon rank sum test, P = 4 × 10−5); and the distribution of proto-oncogenes was shifted towards high G4P with a median of 11.0% (Wilcoxon rank sum test, P = 7 × 10−5). Figure 3 Open in new tabDownload slide Contrasting G4P of tumor suppressor genes and proto-oncogenes. (A) Ranges of G4P for 55 tumor suppressor genes, 95 proto-oncogenes and all 16 654 RefSeq genes. Boxes represent the percentage of genes in each category characterized by G4P in the range 0–1.25, 1.25–2.5, 2.5–5.0, 5–10, 10–20% and >20% (colors as indicated). P-value represents significance of the difference in distribution between the tumor suppressor genes and proto-oncogenes, as calculated by the Wilcoxon rank sum test. (B) Distribution of tumor suppressor genes and proto-oncogenes across G4P. Bars represent the G4P distribution of 55 tumor suppressor genes (blue) and 95 proto-oncogenes (red). The black outline diagrams distribution of all 16 654 RefSeq genes (as in Figure 1A). P-values represent significance of the difference in distribution between each group of genes and the RefSeq genes, as calculated by the Wilcoxon rank sum test. Figure 3 Open in new tabDownload slide Contrasting G4P of tumor suppressor genes and proto-oncogenes. (A) Ranges of G4P for 55 tumor suppressor genes, 95 proto-oncogenes and all 16 654 RefSeq genes. Boxes represent the percentage of genes in each category characterized by G4P in the range 0–1.25, 1.25–2.5, 2.5–5.0, 5–10, 10–20% and >20% (colors as indicated). P-value represents significance of the difference in distribution between the tumor suppressor genes and proto-oncogenes, as calculated by the Wilcoxon rank sum test. (B) Distribution of tumor suppressor genes and proto-oncogenes across G4P. Bars represent the G4P distribution of 55 tumor suppressor genes (blue) and 95 proto-oncogenes (red). The black outline diagrams distribution of all 16 654 RefSeq genes (as in Figure 1A). P-values represent significance of the difference in distribution between each group of genes and the RefSeq genes, as calculated by the Wilcoxon rank sum test. Table 2 shows the top 10 genes in each category, ranked according to G4P, using a high stringency 40 nt search window. Analysis of G4P using a 40 nt rather than a 100 nt search window decreased the numerical value of G4P for each individual gene, as expected (Table 2 and Supplementary Table 3), but did not affect the relative differences in distribution of the potentials, and further supported the significance of the difference between tumor suppressor genes and proto-oncogenes (Wilcoxon rank sum test, P = 2 × 10−7). Table 2 also shows representative GO terms assigned to each gene, many of which correspond to GO terms overrepresented in low or high G4P (Figure 2). Table 2. Tumor suppressor genes with low G4P and proto-oncogenes with high G4P Tumor suppressor genes Proto-oncogenes HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms FBXW7 0.5 0.00 Protein ubiquitination FGF4 46.6 6.8 Cell–cell signaling, growth factor activity MAD2L1 0.8 0.00 Cell cycle, cell division AKT1 45.9 5.3 Anti-apoptosis, signal transduction SMARCA3 0.8 0.00 Ubiquitin–protein ligase activity, DNA binding HRAS 37.6 4.3 Organogenesis, GTPase activity APC 1.1 0.00 Cell adhesion, negative regulation of cell cycle IGF2 31.6 4.1 Development, growth factor activity, hormone activity BLM 1.4 0.00 DNA binding, DNA repair BCL3 25.4 4.1 Transcription, regulation of cell cycle THBS1 1.5 0.00 Cell adhesion, cell motility NOTCH1 39.2 3.9 Transcription factor activity, epidermis development VHL 2.7 0.00 Protein ubiquitination, negative regulation of cell cycle NFKB2 27.6 3.5 Transcription factor activity, signal transduction CDKN2B 6.6 0.00 Negative regulation of cell cycle FURIN 27.8 3.4 Cell–cell signaling MLL3 0.9 0.02 Ubiquitin–protein ligase activity, DNA binding JUNB 25.6 3.4 Transcription factor activity BRCA2 1.5 0.05 Nucleic acid binding, DNA repair, regulation of cell cycle GLI1 20.3 2.8 Development, transcription, signal transduction Tumor suppressor genes Proto-oncogenes HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms FBXW7 0.5 0.00 Protein ubiquitination FGF4 46.6 6.8 Cell–cell signaling, growth factor activity MAD2L1 0.8 0.00 Cell cycle, cell division AKT1 45.9 5.3 Anti-apoptosis, signal transduction SMARCA3 0.8 0.00 Ubiquitin–protein ligase activity, DNA binding HRAS 37.6 4.3 Organogenesis, GTPase activity APC 1.1 0.00 Cell adhesion, negative regulation of cell cycle IGF2 31.6 4.1 Development, growth factor activity, hormone activity BLM 1.4 0.00 DNA binding, DNA repair BCL3 25.4 4.1 Transcription, regulation of cell cycle THBS1 1.5 0.00 Cell adhesion, cell motility NOTCH1 39.2 3.9 Transcription factor activity, epidermis development VHL 2.7 0.00 Protein ubiquitination, negative regulation of cell cycle NFKB2 27.6 3.5 Transcription factor activity, signal transduction CDKN2B 6.6 0.00 Negative regulation of cell cycle FURIN 27.8 3.4 Cell–cell signaling MLL3 0.9 0.02 Ubiquitin–protein ligase activity, DNA binding JUNB 25.6 3.4 Transcription factor activity BRCA2 1.5 0.05 Nucleic acid binding, DNA repair, regulation of cell cycle GLI1 20.3 2.8 Development, transcription, signal transduction The table lists the top 10 genes in each group, sorted by G4P (40 nt search window). The HGNC symbol and values for G4P (both 40 and 100 nt search windows) are shown, along with representative GO terms for each of the genes. The complete list of 55 tumor suppressor genes and 95 proto-oncogenes is available as Supplementary Table 3. Open in new tab Table 2. Tumor suppressor genes with low G4P and proto-oncogenes with high G4P Tumor suppressor genes Proto-oncogenes HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms FBXW7 0.5 0.00 Protein ubiquitination FGF4 46.6 6.8 Cell–cell signaling, growth factor activity MAD2L1 0.8 0.00 Cell cycle, cell division AKT1 45.9 5.3 Anti-apoptosis, signal transduction SMARCA3 0.8 0.00 Ubiquitin–protein ligase activity, DNA binding HRAS 37.6 4.3 Organogenesis, GTPase activity APC 1.1 0.00 Cell adhesion, negative regulation of cell cycle IGF2 31.6 4.1 Development, growth factor activity, hormone activity BLM 1.4 0.00 DNA binding, DNA repair BCL3 25.4 4.1 Transcription, regulation of cell cycle THBS1 1.5 0.00 Cell adhesion, cell motility NOTCH1 39.2 3.9 Transcription factor activity, epidermis development VHL 2.7 0.00 Protein ubiquitination, negative regulation of cell cycle NFKB2 27.6 3.5 Transcription factor activity, signal transduction CDKN2B 6.6 0.00 Negative regulation of cell cycle FURIN 27.8 3.4 Cell–cell signaling MLL3 0.9 0.02 Ubiquitin–protein ligase activity, DNA binding JUNB 25.6 3.4 Transcription factor activity BRCA2 1.5 0.05 Nucleic acid binding, DNA repair, regulation of cell cycle GLI1 20.3 2.8 Development, transcription, signal transduction Tumor suppressor genes Proto-oncogenes HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms HGNC symbol G4P, 100 nt window (%) G4P, 40 nt window (%) Representative GO terms FBXW7 0.5 0.00 Protein ubiquitination FGF4 46.6 6.8 Cell–cell signaling, growth factor activity MAD2L1 0.8 0.00 Cell cycle, cell division AKT1 45.9 5.3 Anti-apoptosis, signal transduction SMARCA3 0.8 0.00 Ubiquitin–protein ligase activity, DNA binding HRAS 37.6 4.3 Organogenesis, GTPase activity APC 1.1 0.00 Cell adhesion, negative regulation of cell cycle IGF2 31.6 4.1 Development, growth factor activity, hormone activity BLM 1.4 0.00 DNA binding, DNA repair BCL3 25.4 4.1 Transcription, regulation of cell cycle THBS1 1.5 0.00 Cell adhesion, cell motility NOTCH1 39.2 3.9 Transcription factor activity, epidermis development VHL 2.7 0.00 Protein ubiquitination, negative regulation of cell cycle NFKB2 27.6 3.5 Transcription factor activity, signal transduction CDKN2B 6.6 0.00 Negative regulation of cell cycle FURIN 27.8 3.4 Cell–cell signaling MLL3 0.9 0.02 Ubiquitin–protein ligase activity, DNA binding JUNB 25.6 3.4 Transcription factor activity BRCA2 1.5 0.05 Nucleic acid binding, DNA repair, regulation of cell cycle GLI1 20.3 2.8 Development, transcription, signal transduction The table lists the top 10 genes in each group, sorted by G4P (40 nt search window). The HGNC symbol and values for G4P (both 40 and 100 nt search windows) are shown, along with representative GO terms for each of the genes. The complete list of 55 tumor suppressor genes and 95 proto-oncogenes is available as Supplementary Table 3. Open in new tab Simulations were carried out to verify the significance of the differences between G4P of tumor suppressor genes and proto-oncogenes. G4P distributions of 95 or 55 genes picked at random were not significantly different from the RefSeq set (P > 0.1), in each of 20 iterations. Furthermore, statistical significance of observed differences was robust to misclassification of up to 10 genes per category, as tested by the addition of 10 randomly selected RefSeq genes to either category (P < 0.002), or by elimination of 10 randomly selected genes from either category (P < 0.0002). These simulations confirmed the significance of the differences between G4P of tumor suppressor genes and proto-oncogenes. Tumor suppressor genes and proto-oncogenes have similar numbers of CpG islands CpG islands are associated with a majority of promoters of human genes (50), and CpG dinucleotides are targets for methylation leading to gene silencing [reviewed in (51)]. G4P is positively correlated with GC-content (Figure 4A), which could in principle reflect a local enrichment of CpG methylation sites. We tested this possibility by analyzing tumor suppressor genes and proto-oncogenes with the EMBOSS program ‘NewCpGReport’ (48), which identifies CpG-rich regions. The number of CpG-rich regions do not differ significantly between these two categories of genes (Wilcoxon rank sum test, P = 0.4; Figure 4B). Thus, the density of potential methylation sites does not distinguish tumor suppressor genes from proto-oncogenes, or account for the differences in G4P we have documented. Figure 4 Open in new tabDownload slide G4P correlates with GC-content but not CpG islands. (A) Correlation of G4P with GC-content. Linear regression analysis of G4P relative to total GC-content (left); the portion of GC-content contributed from G-runs or C-runs (center); the remaining GC-content contributed from Gs and Cs outside of G-runs or C-runs (right). The data were subjected to a natural log transformation before linear regression analysis; therefore, a small number of genes with G4P equal to zero were not included. The slopes determined by linear regression analysis are represented by solid lines. G4P correlates most closely with Gs and Cs within runs (middle). (B) Distribution of tumor suppressor genes and proto-oncogenes relative to number of CpG islands. Closed bars, tumor suppressor genes; open bars, proto-oncogenes. P-value was determined by the Wilcoxon rank sum test comparing tumor suppressor genes to proto-oncogenes, and shows that there is not a significant relationship between gene function and number of CpG islands. Figure 4 Open in new tabDownload slide G4P correlates with GC-content but not CpG islands. (A) Correlation of G4P with GC-content. Linear regression analysis of G4P relative to total GC-content (left); the portion of GC-content contributed from G-runs or C-runs (center); the remaining GC-content contributed from Gs and Cs outside of G-runs or C-runs (right). The data were subjected to a natural log transformation before linear regression analysis; therefore, a small number of genes with G4P equal to zero were not included. The slopes determined by linear regression analysis are represented by solid lines. G4P correlates most closely with Gs and Cs within runs (middle). (B) Distribution of tumor suppressor genes and proto-oncogenes relative to number of CpG islands. Closed bars, tumor suppressor genes; open bars, proto-oncogenes. P-value was determined by the Wilcoxon rank sum test comparing tumor suppressor genes to proto-oncogenes, and shows that there is not a significant relationship between gene function and number of CpG islands. Both exon and intron sequences contribute to G4P In human genes, the ratio of exon length to intron length is typically well below unity (52), so the measurement of G4P for an entire gene will largely reflect the contribution of intronic sequences. To distinguish contributions of exons and introns to G4P, we analyzed the G4P of one representative cDNA for each tumor suppressor gene and proto-oncogene (Figure 5 and Supplementary Table 3). The median G4P of cDNA sequences is 1.9% for tumor suppressor genes, and 7.6% for proto-oncogenes (Figure 5), in each case slightly lower than G4P for the entire gene (2.4 and 11%, respectively; Figure 3B). The difference between G4P for cDNAs in the two functional categories is highly significant (Wilcoxon rank sum test, P = 6 × 10−6). Thus, both exon and intron sequences contribute to the significant differences in G4P characteristic of tumor suppressor and proto-oncogenes. Figure 5 Open in new tabDownload slide Differences in G4P of tumor suppressor and proto-oncogene cDNAs. Distribution of tumor suppressor gene and proto-oncogene cDNA sequences across G4P. Bars represent tumor suppressor genes (closed bars) and proto-oncogenes (open bars). P-value was determined from the Wilcoxon rank sum test comparing tumor suppressor genes to proto-oncogenes. Figure 5 Open in new tabDownload slide Differences in G4P of tumor suppressor and proto-oncogene cDNAs. Distribution of tumor suppressor gene and proto-oncogene cDNA sequences across G4P. Bars represent tumor suppressor genes (closed bars) and proto-oncogenes (open bars). P-value was determined from the Wilcoxon rank sum test comparing tumor suppressor genes to proto-oncogenes. The G4P of genes contrasts with their genomic environment The human genome consists of large segments of fairly homogeneous GC-content, defined as isochores (53). With the availability of the human genome sequence, this definition has been honed further, and 100 kb segments of DNA sequence can be sorted into five isochore families with an average SD of ∼1% GC (54). Since G4P is positively correlated with GC-content (Figure 4A), we asked if G4P for each gene reflects its local genomic environment. To do this, we computed the difference in G4P for each of the RefSeq genes and its flanking sequences, ΔG4P = G4P − G4PFLANK, calculating G4PFLANK as the average G4P for 20 kb upstream and downstream of each gene (Supplementary Table 1 and Figure 6). The average ΔG4P for all RefSeq genes is 1.6%; thus on average, genes have greater G4P than their flanking sequences. Comparison of ΔG4P of the RefSeq genes, the proto-oncogenes and the tumor suppressor genes by a single-factor ANOVA showed that the three groups are distinct (ANOVA, P = 10−5; Figure 6). The average ΔG4P for the set of tumor suppressor genes is −2.2%, much lower than that of the RefSeq genes (ANOVA, P = 5 × 10−5). In contrast, the average ΔG4P for the set of proto-oncogenes is 3.4%, higher than that of the RefSeq genes (ANOVA, P = 0.01), and considerably higher than that of the tumor suppressor genes (ANOVA, P = 5 × 10−7). Thus, on average, tumor suppressor genes have lower G4P than their flanking sequences, and proto-oncogenes have higher G4P than their flanking sequences. Potential for G4 DNA formation therefore correlates with gene function, rather than local genomic environment, for both tumor suppressor genes and proto-oncogenes. Figure 6 Open in new tabDownload slide G4P of genes differs from G4P of genomic environment. Average G4P for genes; 20 kb flanking sequences (G4PFLANK); and ΔG4P, the difference between G4P for each gene and its flank. Gray bars, RefSeq genes; closed bars, tumor suppressor genes; and open bars, proto-oncogenes. Standard errors were determined by ANOVA for each analysis of the three groups of genes. Figure 6 Open in new tabDownload slide G4P of genes differs from G4P of genomic environment. Average G4P for genes; 20 kb flanking sequences (G4PFLANK); and ΔG4P, the difference between G4P for each gene and its flank. Gray bars, RefSeq genes; closed bars, tumor suppressor genes; and open bars, proto-oncogenes. Standard errors were determined by ANOVA for each analysis of the three groups of genes. The prototypical tumor suppressor and proto-oncogenes, TP53 and MYC [reviewed in (55)], illustrate the relationship between G4P of genes and their flanking sequences. For TP53, G4P is 7.6%, slightly higher than the RefSeq median of 5%; and G4PFLANK is 12.1% (10.0% upstream and 14.1% downstream). Thus, ΔG4P of TP53 is −4.4%, low even among tumor suppressor genes. In contrast, for MYC, G4P is 18.6%, well above the RefSeq median; and G4PFLANK is 2.8% (3.3% upstream and 2.2% downstream). Thus, ΔG4P of MYC is 15.9%, considerably above the average RefSeq ΔG4P of 1.6%. DISCUSSION We have investigated the relationship between potential to form G4 DNA and gene function for the 16 654 human RefSeq genes. We find that there is a highly skewed distribution of G4P among human genes, and that there are robust correlations between G4P and gene function. Interrogation of the subset of 218 GO terms assigned to 50 or more genes showed that low G4P corresponds with functions including G-protein-coupled receptors, olfaction, nucleosome assembly, nucleic acid binding, ubiquitin cycle, cell adhesion and cell division; and high G4P with functions including transcription factor activity, development, cell signaling, growth factors and cytokines. These findings motivated interrogation of two contrasting gene categories defined by the OMIM database, tumor suppressor genes and proto-oncogenes, which showed that genes in these categories are distinguished by low and high G4P, respectively (Figure 3). In contrast to the robust relationship between G4P and gene function, G4P did not correspond to any of several well-established parameters used to characterize genomic structure. G4P does correlate with GC-content, but not with the number of CpG islands (Figure 4). Both exons and introns contribute to the difference in G4P between tumor suppressor genes and proto-oncogenes (Figure 5). Furthermore, G4P does not reflect the local genomic environment (Figure 6). In fact, tumor suppressor genes have much lower G4P than would be predicted by their genomic environment as compared to the RefSeq genes, whereas proto-oncogenes have higher G4P than would be predicted. The most straightforward interpretation of these results is that genes with specific functions have undergone selection based on G4P. One source of selective pressure that could contribute to determining G4P is suggested by the association between G-rich regions and genomic instability. Transcription-induced or replication-induced DNA structures can form within regions of high G4P, and if these structures are not faithfully resolved, the result may be genomic instability and impaired gene function. In this view, the low G4P of tumor suppressor genes could reflect evolution that minimized potential instability of genes which function to maintain genomic stability. There is considerable evidence for haploinsufficiency of tumor suppressor genes [reviewed in (56)], and this would contribute to pressure to minimize genomic instability. Conversely, the high G4P that characterizes the proto-oncogenes would be predicted to contribute to their destabilization. Could instability provide a selective advantage? Under some circumstances, it may. Proto-oncogenes are transcribed in rapidly dividing cells and tissues. Transcription-induced structures have considerable potential to contribute to genomic instability (25,26), but they can form only within genes, which represent a relatively small fraction of genomic DNA. The high G4P of the proto-oncogenes would make them targets for transcription-induced destabilization. Proto-oncogenes encode key factors that promote cell proliferation and development, and impaired expression of a proto-oncogene could in turn diminish or prevent cell proliferation, either by decreasing expression of an essential factor, or signaling cell death via apoptosis. Proto-oncogenes may therefore carry out a passive surveillance function, monitoring instability that specifically affected the transcribed fraction of the genome. This surveillance function would necessarily be vested in genes, rather than in the vast landscape of nontranscribed sequences, consistent with the clear differences between G4P of genes and their flanking sequences. Another mechanism that may contribute to selection based on G4P is shared regulation. Sequences within promoter regions of several proto-oncogenes have been shown to form G4 DNA in vitro (29–32), and factors that bind G4 DNA have been implicated in both transcriptional and translational regulation (28,57). However, regulatory factors typically exert their effects within limited genomic regions, so commonality of short cis-regulatory elements is unlikely to provide a complete explanation for a feature of sequence composition that distinguishes both exons and introns, and extends throughout a gene (Figure 5). Similarly, G4P is unlikely to reflect selection for coding capacity, as this sort of selection would affect exons alone. Nonetheless, there does appear to be some selection against regions of high G4P within exons, as in both gene categories, the median G4P of exons was lower than for introns: 1.9% versus 2.4% for tumor suppressor genes; and 7.6% versus 11% for proto-oncogenes (Figures 3B and 5). Thus high G4P may be disfavored in mature RNAs, as has been proposed previously (4); or incompatible with efficient translation or effective coding. Several lines of evidence suggest that GC-content may broadly correlate with gene expression levels (58–60); in particular, GC-richness correlates with open chromatin structure, which may in turn facilitate transcription (61). Proto-oncogenes are rapidly transcribed during early development and in response to cell activation, and the high G4P of the proto-oncogenes might reflect GC-richness that contributes to high transcription levels of genes in this group. The finding that potential for G4 DNA formation correlates robustly with specific gene functions suggests that G4P may be a useful parameter to include in global analyses of gene expression, regulation and interactions. Systems-based analyses of this sort should establish whether regulation could contribute to selection based on G4P. ACKNOWLEDGEMENTS We thank Audrey Qiuyan Fu and Paul Sampson for assistance with statistical analysis; John Newman for independent testing of the G4P Calculator software; and Evan Eichler for comments on the manuscript. This work was supported by NIH R01 GM65988 to N.M. and J.E. was supported by T32 CA009537. Funding to pay the Open Access publication charges for this article was provided by NIH RO1 GM65988. Conflict of interest statement. None declared. REFERENCES 1. Sen D. and Gilbert W. 1988 Formation of parallel four-stranded complexes by guanine-rich motifs in DNA and its implications for meiosis Nature 334 364 – 366 Google Scholar Crossref Search ADS PubMed WorldCat 2. Gellert M. , Lipsett M.N., Davies D.R. 1962 Helix formation by guanylic acid Proc. Natl Acad. Sci. USA 48 2013 – 2018 Google Scholar Crossref Search ADS WorldCat 3. Phan A.T. , Kuryavyi V., Patel D.J. 2006 DNA architecture: from G to Z Curr. Opin. Struct. Biol . 16 288 – 298 Google Scholar Crossref Search ADS PubMed WorldCat 4. Hazel P. , Parkinson G.N., Neidle S. 2006 Predictive modelling of topology and loop variations in dimeric DNA quadruplex structures Nucleic Acids Res . 34 2117 – 2127 Google Scholar Crossref Search ADS PubMed WorldCat 5. Huppert J.L. and Balasubramanian S. 2005 Prevalence of quadruplexes in the human genome Nucleic Acids Res . 33 2908 – 2916 Google Scholar Crossref Search ADS PubMed WorldCat 6. Todd A.K. , Johnston M., Neidle S. 2005 Highly prevalent putative quadruplex sequence motifs in human DNA Nucleic Acids Res . 33 2901 – 2907 Google Scholar Crossref Search ADS PubMed WorldCat 7. Zahler A.M. , Williamson J.R., Cech T.R., Prescott D.M. 1991 Inhibition of telomerase by G-quartet DNA structures Nature 350 718 – 720 Google Scholar Crossref Search ADS PubMed WorldCat 8. Ishikawa F. , Matunis M.J., Dreyfuss G., Cech T.R. 1993 Nuclear proteins that bind the pre-mRNA 3′ splice site sequence r(UUAG/G) and the human telomeric DNA sequence d(TTAGGG)n Mol. Cell. Biol . 13 4301 – 4310 Google Scholar Crossref Search ADS PubMed WorldCat 9. Fletcher T.M. , Sun D., Salazar M., Hurley L.H. 1998 Effect of DNA secondary structure on human telomerase activity Biochemistry 37 5536 – 5541 Google Scholar Crossref Search ADS PubMed WorldCat 10. LaBranche H. , Dupuis S., Ben-David Y., Bani M.-R., Wellinger R.J., Chabot B. 1998 Telomere elongation by hnRNP A1 and a derivative that interacts with telomeric repeats and telomerase Nature Genet . 19 1 – 4 Google Scholar Crossref Search ADS PubMed WorldCat 11. Eversole A. and Maizels N. 2000 In vitro properties of the conserved mammalian protein hnRNP D suggest a role in telomere maintenance Mol. Cell. Biol . 20 5425 – 5432 Google Scholar Crossref Search ADS PubMed WorldCat 12. Enokizono Y. , Konishi Y., Nagata K., Ouhashi K., Uesugi S., Ishikawa F., Katahira M. 2005 Structure of hnRNP D complexed with single-stranded telomere DNA and unfolding of the quadruplex by heterogeneous nuclear ribonucleoprotein D J. Biol. Chem . 280 18862 – 18870 Google Scholar Crossref Search ADS PubMed WorldCat 13. Paeschke K. , Simonsson T., Postberg J., Rhodes D., Lipps H.J. 2005 Telomere end-binding proteins control the formation of G-quadruplex DNA structures in vivo Nature Struct. Mol. Biol . 12 847 – 854 Google Scholar Crossref Search ADS WorldCat 14. Zaug A.J. , Podell E.R., Cech T.R. 2005 Human POT1 disrupts telomeric G-quadruplexes allowing telomerase extension in vitro Proc. Natl Acad. Sci. USA 102 10864 – 10869 Google Scholar Crossref Search ADS WorldCat 15. Zhang Q.S. , Manche L., Xu R.M., Krainer A.R. 2006 hnRNP A1 associates with telomere ends and stimulates telomerase activity RNA 12 1116 – 1128 Google Scholar Crossref Search ADS PubMed WorldCat 16. Maizels N. 2005 Immunoglobulin gene diversification Annu. Rev. Genet . 39 23 – 46 Google Scholar Crossref Search ADS PubMed WorldCat 17. Larson E.D. , Duquette M.L., Cummings W.J., Streiff R.J., Maizels N. 2005 MutSalpha binds to and promotes synapsis of transcriptionally activated immunoglobulin switch regions Curr. Biol . 15 470 – 474 Google Scholar Crossref Search ADS PubMed WorldCat 18. Pasqualucci L. , Neumeister P., Goossens T., Nanjangud G., Chaganti R.S., Kuppers R., Dalla-Favera R. 2001 Hypermutation of multiple proto-oncogenes in B-cell diffuse large-cell lymphomas Nature 412 341 – 346 Google Scholar Crossref Search ADS PubMed WorldCat 19. Duquette M.L. , Handa P., Vincent J.A., Taylor A.F., Maizels N. 2004 Intracellular transcription of G-rich DNAs induces formation of G-loops, novel structures containing G4 DNA Genes Dev . 18 1618 – 1629 Google Scholar Crossref Search ADS PubMed WorldCat 20. Duquette M.L. , Pham P., Goodman M.F., Maizels N. 2005 AID binds to transcription-induced structures in c-MYC that map to regions associated with translocation and hypermutation Oncogene 24 5791 – 5798 Google Scholar Crossref Search ADS PubMed WorldCat 21. Wong Z. , Wilson V., Patel I., Povey S., Jeffreys A.J. 1987 Characterization of a panel of highly variable minisatellites cloned from human DNA Ann. Hum. Genet . 51 269 – 288 Google Scholar Crossref Search ADS PubMed WorldCat 22. Weitzmann M.N. , Woodford K.J., Usdin K. 1997 DNA secondary structures and the evolution of hypervariable tandem arrays J. Biol. Chem . 272 9517 – 9523 Google Scholar Crossref Search ADS PubMed WorldCat 23. Kilburn A.E. , Shea M.J., Sargent R.G., Wilson J.H. 2001 Insertion of a telomere repeat sequence into a mammalian gene causes chromosome instability Mol. Cell. Biol . 21 126 – 135 Google Scholar Crossref Search ADS PubMed WorldCat 24. Reaban M.E. and Griffin J.A. 1990 Induction of RNA-stabilized DNA conformers by transcription of an immunoglobulin switch region Nature 348 342 – 344 Google Scholar Crossref Search ADS PubMed WorldCat 25. Mizuta R. , Iwai K., Shigeno M., Mizuta M., Ushiki T., Kitamura D. 2002 Molecular visualization of immunoglobulin switch region RNA/DNA complex by atomic force microscope J. Biol. Chem . 278 4431 – 4434 Google Scholar Crossref Search ADS PubMed WorldCat 26. Huertas P. and Aguilera A. 2003 Cotranscriptionally formed DNA:RNA hybrids mediate transcription elongation impairment and transcription-associated recombination Mol. Cell 12 711 – 721 Google Scholar Crossref Search ADS PubMed WorldCat 27. Li X. and Manley J.L. 2005 Inactivation of the SR protein splicing factor ASF/SF2 results in genomic instability Cell 122 365 – 378 Google Scholar Crossref Search ADS PubMed WorldCat 28. Darnell J.C. , Fraser C.E., Mostovetsky O., Stefani G., Jones T.A., Eddy S.R., Darnell R.B. 2005 Kissing complex RNAs mediate interaction between the Fragile-X mental retardation protein KH2 domain and brain polyribosomes Genes Dev . 19 903 – 918 Google Scholar Crossref Search ADS PubMed WorldCat 29. Darnell J.C. , Jensen K.B., Jin P., Brown V., Warren S.T., Darnell R.B. 2001 Fragile X mental retardation protein targets G quartet mRNAs important for neuronal function Cell 107 489 – 499 Google Scholar Crossref Search ADS PubMed WorldCat 30. Simonsson T. , Pecinka P., Kubista M. 1998 DNA tetraplex formation in the control region of c-myc Nucleic Acids Res . 26 1167 – 1172 Google Scholar Crossref Search ADS PubMed WorldCat 31. Sun D. , Guo K., Rusche J.J., Hurley L.H. 2005 Facilitation of a structural transition in the polypurine/polypyrimidine tract within the proximal promoter region of the human VEGF gene by the presence of potassium and G-quadruplex-interactive agents Nucleic Acids Res . 33 6070 – 6080 Google Scholar Crossref Search ADS PubMed WorldCat 32. Rankin S. , Reszka A.P., Huppert J., Zloh M., Parkinson G.N., Todd A.K., Ladame S., Balasubramanian S., Neidle S. 2005 Putative DNA quadruplex formation within the human c-kit oncogene J. Am. Chem. Soc . 127 10584 – 10589 Google Scholar Crossref Search ADS PubMed WorldCat 33. Dai J. , Dexheimer T.S., Chen D., Carver M., Ambrus A., Jones R.A., Yang D. 2006 An intramolecular G-quadruplex structure with mixed parallel/antiparallel G-strands formed in the human BCL-2 promoter region in solution J. Am. Chem. Soc . 128 1096 – 1098 Google Scholar Crossref Search ADS PubMed WorldCat 34. Rezler E.M. , Bearss D.J., Hurley L.H. 2003 Telomere inhibition and telomere disruption as processes for drug targeting Annu. Rev. Pharmacol. Toxicol . 43 359 – 379 Google Scholar Crossref Search ADS PubMed WorldCat 35. Gomez D. , Lemarteleur T., Lacroix L., Mailliet P., Mergny J.L., Riou J.F. 2004 Telomerase downregulation induced by the G-quadruplex ligand 12459 in A549 cells is mediated by hTERT RNA alternative splicing Nucleic Acids Res . 32 371 – 379 Google Scholar Crossref Search ADS PubMed WorldCat 36. Burger A.M. , Dai F., Schultes C.M., Reszka A.P., Moore M.J., Double J.A., Neidle S. 2005 The G-quadruplex-interactive molecule BRACO-19 inhibits tumor growth, consistent with telomere targeting and interference with telomerase function Cancer Res . 65 1489 – 1496 Google Scholar Crossref Search ADS PubMed WorldCat 37. Tauchi T. , Shin-Ya K., Sashida G., Sumi M., Okabe S., Ohyashiki J.H., Ohyashiki K. 2006 Telomerase inhibition with a novel G-quadruplex-interactive agent, telomestatin: in vitro and in vivostudies in acute leukemia Oncogene (in press) 38. Siddiqui-Jain A. , Grand C.L., Bearss D.J., Hurley L.H. 2002 Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription Proc. Natl Acad. Sci. USA 99 11593 – 11598 Google Scholar Crossref Search ADS WorldCat 39. Cogoi S. and Xodo L.E. 2006 G-quadruplex formation within the promoter of the KRAS proto-oncogene and its effect on transcription Nucleic Acids Res . 34 2536 – 2549 Google Scholar Crossref Search ADS PubMed WorldCat 40. Huber M.D. , Lee D.C., Maizels N. 2002 G4 DNA unwinding by BLM and Sgs1p: substrate specificity and substrate-specific inhibition Nucleic Acids Res . 30 3954 – 3961 Google Scholar Crossref Search ADS PubMed WorldCat 41. Fry M. and Loeb L.A. 1999 Human werner syndrome DNA helicase unwinds tetrahelical structures of the fragile X syndrome repeat sequence d(CGG)n J. Biol. Chem . 274 12797 – 12802 Google Scholar Crossref Search ADS PubMed WorldCat 42. Sun H. , Bennett R.J., Maizels N. 1999 The S. cerevisiae Sgs1 helicase efficiently unwinds G–G paired DNAs Nucleic Acids Res . 27 1978 – 1984 Google Scholar Crossref Search ADS PubMed WorldCat 43. Sinclair D.A. and Guarente L. 1997 Extrachromosomal rDNA circles—a cause of aging in yeast Cell 91 1033 – 1042 Google Scholar Crossref Search ADS PubMed WorldCat 44. Versini G. , Comet I., Wu M., Hoopes L., Schwob E., Pasero P. 2003 The yeast Sgs1 helicase is differentially required for genomic and ribosomal DNA replication EMBO J . 22 1939 – 1949 Google Scholar Crossref Search ADS PubMed WorldCat 45. Crabbe L. , Verdun R.E., Haggblom C.I., Karlseder J. 2004 Defective telomere lagging strand synthesis in cells lacking WRN helicase activity Science 306 1951 – 1953 Google Scholar Crossref Search ADS PubMed WorldCat 46. Yang Q. , Zhang R., Wang X.W., Linke S.P., Sengupta S., Hickson I.D., Pedrazzi G., Perrera C., Stagljar I., Littman S.J., et al. 2004 The mismatch DNA repair heterodimer, hMSH2/6, regulates BLM helicase Oncogene 23 3749 – 3756 Google Scholar Crossref Search ADS PubMed WorldCat 47. Hubbard T. , Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F., et al. 2005 Ensembl 2005 Nucleic Acids Res . 33 D447 – D453 Google Scholar Crossref Search ADS PubMed WorldCat 48. Rice P. , Longden I., Bleasby A. 2000 EMBOSS: the European Molecular Biology Open Software Suite Trends Genet . 16 276 – 277 Google Scholar Crossref Search ADS PubMed WorldCat 49. Ashburner M. , Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. 2000 Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nature Genet . 25 25 – 29 Google Scholar Crossref Search ADS PubMed WorldCat 50. Carninci P. , Sandelin A., Lenhard B., Katayama S., Shimokawa K., Ponjavic J., Semple C.A., Taylor M.S., Engstrom P.G., Frith M.C., et al. 2006 Genome-wide analysis of mammalian promoter architecture and evolution Nature Genet . 38 626 – 635 Google Scholar Crossref Search ADS PubMed WorldCat 51. Klose R.J. and Bird A.P. 2006 Genomic DNA methylation: the mark and its mediators Trends Biochem. Sci . 31 89 – 97 Google Scholar Crossref Search ADS PubMed WorldCat 52. Lander E.S. , Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., et al. 2001 Initial sequencing and analysis of the human genome Nature 409 860 – 921 Google Scholar Crossref Search ADS PubMed WorldCat 53. Bernardi G. , Olofsson B., Filipski J., Zerial M., Salinas J., Cuny G., Meunier-Rotival M., Rodier F. 1985 The mosaic genome of warm-blooded vertebrates Science 228 953 – 958 Google Scholar Crossref Search ADS PubMed WorldCat 54. Costantini M. , Clay O., Auletta F., Bernardi G. 2006 An isochore map of human chromosomes Genome Res . 16 536 – 541 Google Scholar Crossref Search ADS PubMed WorldCat 55. Hanahan D. and Weinberg R.A. 2000 The hallmarks of cancer Cell 100 57 – 70 Google Scholar Crossref Search ADS PubMed WorldCat 56. Payne S.R. and Kemp C.J. 2005 Tumor suppressor genetics Carcinogenesis 26 2031 – 2045 Google Scholar Crossref Search ADS PubMed WorldCat 57. Lew A. , Rutter W.J., Kennedy G.C. 2000 Unusual DNA structure of the diabetes susceptibility locus IDDM2 and its effect on transcription by the insulin promoter factor Pur-1/MAZ Proc. Natl Acad. Sci. USA 97 12508 – 12512 Google Scholar Crossref Search ADS WorldCat 58. Versteeg R. , van Schaik B.D., van Batenburg M.F., Roos M., Monajemi R., Caron H., Bussemaker H.J., van Kampen A.H. 2003 The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes Genome Res . 13 1998 – 2004 Google Scholar Crossref Search ADS PubMed WorldCat 59. Lercher M.J. , Urrutia A.O., Pavlicek A., Hurst L.D. 2003 A unification of mosaic structures in the human genome Hum. Mol. Genet . 12 2411 – 2415 Google Scholar Crossref Search ADS PubMed WorldCat 60. Semon M. , Mouchiroud D., Duret L. 2005 Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance Hum. Mol. Genet . 14 421 – 427 Google Scholar Crossref Search ADS PubMed WorldCat 61. Gilbert N. , Boyle S., Fiegler H., Woodfine K., Carter N.P., Bickmore W.A. 2004 Chromatin architecture of the human genome: gene-rich domains are enriched in open chromatin fibers Cell 118 555 – 566 Google Scholar Crossref Search ADS PubMed WorldCat © 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.