Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Barrett C. Foat; Alexandre V. Morozov; Harmen J. Bussemaker

doi:10.1093/bioinformatics/btl223

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Foat, Barrett C.; Morozov, Alexandre V.; Bussemaker, Harmen J. 2006-07-15 00:00:00 Vol. 22 no. 14 2006, pages e141–e149 BIOINFORMATICS doi:10.1093/bioinformatics/btl223 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE 1 2 1,3, Barrett C. Foat , Alexandre V. Morozov and Harmen J. Bussemaker 1 2 Department of Biological Sciences, Columbia University, New York, NY 10027, USA, Center for Studies in Physics and Biology, The Rockefeller University, New York, NY 10021, USA and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA sidered to be bound by the TF (for review see Stormo, 2000). These ABSTRACT methods use the information content of nucleotide patterns as a Motivation: Regulation of gene expression by a transcription factor proxy for the free energy contributions of the bases found in the requires physical interaction between the factor and the DNA, which TF binding site (Berg and von Hippel, 1987; Stormo and Fields, can be described by a statistical mechanical model. Based on this 1998). Other computational methods infer physically-based TF model, we developed the MatrixREDUCE algorithm, which uses binding speciﬁcities from measured TF binding afﬁnities for a genome-wide occupancy data for a transcription factor (e.g. ChIP- small set of oligonucleotides (Liu and Clarke, 2002) or from struc- chip) and associated nucleotide sequences to discover the sequence- tural modeling of protein-DNA interaction (Paillard and Lavery, specific binding affinity of the transcription factor. Advantages of our 2004; Endres et al., 2004; Morozov et al., 2005). However, genome- approach are that the information for all probes on the microarray is scale, quantitative measurements of TF occupancies of intergenic efficiently utilized because there is no need to delineate ‘‘bound’’ and regions are now available due to the advent of in vivo chromatin ‘‘unbound’’ sequences, and that, unlike information content-based immunoprecipitation microarrays (Ren et al., 2000; Iyer et al., methods, it does not require a background sequence model. 2001; Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Results: We validated the performance of MatrixREDUCE by inferring Harbison et al., 2004), in vitro protein binding microarrays the sequence-specific binding affinities for several transcription factors (PBM; Mukherjee et al., 2004), and DNA immunoprecipitation in S. cerevisiae and comparing the results with three other independent microarrays (DIP-chip; Liu et al., 2005). Thus, it is no longer sources of transcription factor sequence-specific affinity information: necessary to rely on small data sets, availability of protein-DNA (i) experimental measurement of transcription factor binding affinities structures, or the analogy between information content and statis- for specificoligonucleotides,(ii) reporter geneassays forpromoterswith tical mechanics to infer free energy representations of transcription systematically mutated binding sites, and (iii) relative binding affinities factor binding sites. obtained by modeling transcription factor-DNA interactions based on We have developed a method, implemented as the program co-crystal structures of transcription factors bound to DNA substrates. MatrixREDUCE (Foat et al., 2005), that infers the sequence spe- We show that transcription factor binding affinities inferred by ciﬁcity of a TF directly and accurately from genome-wide TF occu- MatrixREDUCE are in good agreement with all three validating pancy data by ﬁtting a statistical mechanical model for TF-DNA methods. interaction (Figure 1). The sequence speciﬁcity of the TF’s DNA- Availability: MatrixREDUCE source code is freely available for binding domain is modeled using a position-speciﬁc afﬁnity matrix non-commercial use at http://www.bussemakerlab.org/. The software (PSAM), representing the change in the binding afﬁnity (K ) when- runs on Linux, Unix, and Mac OS X. d ever a speciﬁc position within a reference binding sequence is Contact: [email protected] mutated. To validate the physical model of MatrixREDUCE, we discovered the PSAMs for several TFs in S. cerevisiae and com- 1 INTRODUCTION pared the results with three other independent sources of TF sequence-speciﬁc afﬁnity information: (i) experimentally measured The sequence-speciﬁc regulatory activity of a transcription factor K ’s as determined by in vitro methods (Gailus-Durner et al., 1996; (TF) is the result of energetically favorable interactions between the Liu and Clarke, 2002; Pierce et al., 2003), (ii) lacZ reporter assays amino acids exposed in the DNA binding domain and portions of for promoters with systematically mutated binding sites (Gailus- nucleic acid bases exposed in the grooves of the DNA. A compu- Durner et al., 1996; Pierce et al., 2003), and (iii) relative K ’s tational method for discovering the binding speciﬁcity of a TF obtained by using a physical model of protein-DNA interaction cannot provide a quantitative description of TF binding unless it that makes binding afﬁnity predictions starting from a co-crystal considers the physical underpinnings of the TF-DNA interaction. structure of the protein-DNA complex (Morozov et al., 2005). We Most physically motivated computational methods discover over- ﬁnd a surprising level of agreement between MatrixREDUCE- represented patterns in a set of nucleotide sequences that are con- predicted TF binding afﬁnities, experimental measurements, and To whom correspondence should be addressed. structural predictions, suggesting that MatrixREDUCE is a The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] B.C.Foat et al. Fig. 1. The flow of data. A microarray measurement of TF occupancies (ChIP-chip, PBM, DIP-chip, or differential mRNA expression data) and relevant nucleotide sequences for each microarray feature are used as input to MatrixREDUCE. MatrixREDUCE performs a least-squares fit to a statistical-mechanical model of TF-DNA interaction to discover the relative contributions to the free energy of binding for each nucleotide at each position in the generalized TF binding site. These contributions are represented as a position specific affinity matrix (PSAM) containing the relative equilibrium constants of the TF-DNA interaction, with the highest affinity nucleotide at each position scaled to a value of one (DDG ¼ 0). The PSAM can be converted into an affinity logo that graphically represents the DDG’s for each nucleotide at each position relative to the averageDDG at the respective positions. The PSAM can also be used to predict the relative TF occupancy of any nucleotide sequence, allowing the PSAMs inferred by MatrixREDUCE to be compared with experimental measurements of TF binding affinities for particular oligonucleotides. powerful and accurate tool for the elucidation of physically accurate data were b-galactosidase activities for genes containing mutated TF sequence-speciﬁc binding afﬁnities. binding sites of the E. coli su2 amber stop codon suppressor. A similar type of analysis was performed by Liu and Clarke (2002) who ﬁt a physical model for transcription factor binding to elec- 2 RELATED WORK trophoretic mobility shift assay (EMSA) data measuring afﬁnity of In contrast to information theory-based methods of deﬁning the S. cerevisiae Leu3 TF for several oligonucleotides. The physical nucleotide-binding protein speciﬁcities, MatrixREDUCE belongs model behind MatrixREDUCE is the same as that employed by to a small but growing class of methods that infer binding afﬁnities Stormo et al. (1986) and Liu and Clarke (2002). However, our by directly ﬁtting a physical model to experimental data. The ﬁrst ‘‘quantitative data’’ are microarray probe intensities, which mea- such method was introduced by Stormo et al. (1986) who noted, sure TF occupancy over long chromosomal regions with unknown ‘‘When quantitative data are known for many sequences one can binding site locations. Thus, the MatrixREDUCE model integrates the binding signal over the entire length of the sequence. The solve for the matrix elements that give the best ﬁt between the GOMER method of Granek and Clarke (2005) performs a similar sequences and those data.’’ For Stormo et al. (1986) the quantitative e142 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE The occupancy NðU Þ for the entire promoter region U of gene g equals the integration of signal over long regulatory sequences relevant to g g sum of occupancies for each binding site window of length L at each measured microarray intensities. However, GOMER was only position i over the length L of the sequence U : g g used to test hypotheses about the regulatory mechanisms of TFs L L þ1 for which a binding site weight matrix had already been deﬁned by g w L X Y NðU Þ¼ ½PK ðS Þ w ‚ ð9Þ g a ref jU ði þ j 1Þ other methods. Granek and Clarke (2005) did not attempt to ﬁt the g i¼1 j¼1 GOMER model directly to experimental data to infer the binding afﬁnities of TFs. Finally, also of note are the QPMEME algorithm of where U ðiÞ is the base at position i in sequence U . g g Djordjevic et al. (2003) and the work of Djordjevic and Sengupta (2006) which use maximum likelihood procedures to infer PSAMs 3.2 Modeling genome-wide TF occupancy data by ﬁtting physical models to known TF binding sites and SELEX Recent technologies such as ChIP-chip (Ren et al., 2000; Iyer et al., 2001; data, respectively, but rely on the prior delineation of ‘‘bound’’ Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Harbison et al., 2004), sequences. PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., 2005) provide indirect but quantitative information about the TF occupancy of large genomic regions. For each segment of DNA there are two microarray inten- 3 METHODS test test sities. The test intensity I is equal to a background intensity a plus a 3.1 Modeling TF-DNA interaction term that, to ﬁrst approximation, is proportional (g) to the occupancy NðU Þ by the TF, either because the amount of TF bound to the probe contributes We will develop the statistical-mechanical model used by MatrixREDUCE directly to the signal intensity (PBM) or because it determines the proportion starting with a transcription factor P that binds to a DNA sequence S to form at which an immunoprecipitated TF-DNA fragment is present in the sample the TF-DNA complex PS: control (ChIP-chip or DIP-chip). The control intensity I is only the result of on control background signal a . Allowing for experimental noise e , we obtain: P þ S PS ð1Þ test test off gNðU Þþ a g g ¼ þ e bNðU Þþ C þ e ð10Þ g g g control control The afﬁnity of the TF for the sequence can be expressed in terms of its a equilibrium dissociation constant K ðSÞ: Using Equation 9 for the occupancy NðU Þ, we obtain: ½P½S k off DG/RT K ðSÞ¼ ¼ ¼ e ‚ ð2Þ L L þ1 test g w L ½PS k w on X Y ¼ F w þ C þ e ‚ ð11Þ jU ði þ j 1Þ g control g which is directly related to DG, the Gibbs free energy of binding per mole g i¼1 j¼1 (R is the gas constant and T is temperature). The occupancy N(S) of sequence S by transcription factor P can be expressed as the concentration of TF-DNA where complex divided by the total concentration of DNA (bound or unbound): F ¼ b½PK ðS Þ: ð12Þ a ref ½PS ½P NðSÞ¼ ¼ : ð3Þ Note that b,[P], and K ðS Þ cannot be determined separately without a ref ½PSþ½S ½Pþ K ðSÞ additional information such as the real protein concentration or K ðS Þ. a ref For simplicity, we will assume that the TF concentration [P] is much smaller MatrixREDUCE discovers the set of w elements as well as F and C by jb than K ðSÞ. This assumption seems physiologically plausible because in this performing a least squares ﬁt to the measured intensity ratios: regime, the highest afﬁnity binding sites in the genome will be the most responsive to a change in the nuclear concentration of active TF. Thus, the ðC‚F‚fw gÞ ¼ argmin jb C‚ F‚ fw g jb occupancy becomes: ½P NðSÞ ¼½PK ðSÞ‚ ð4Þ L L þ1 a test g w L X X Y K ðSÞ F w C : ð13Þ jU ði þ j 1Þ control g where g g i¼1 j¼1 K ðSÞ K ðSÞ: ð5Þ The 4 · L matrix of K ratios w (3L parameters plus L reference w a jb w w Consider a single point mutation from the original reference sequence S ref nucleotide values) for all nucleotides at all positions in the binding site is to base b at position j resulting in the mutated sequence S . Such a mutation mut referred to as the position speciﬁc afﬁnity matrix (PSAM). Each position j in will give rise to an additive change DDG in the free energy of binding or, the PSAM is rescaled such that the largest w is equal to unity, without loss jb equivalently, a multiplicative change w in K ðS Þ: jb a ref of generality. Differential mRNA expression microarray data, which measures the K ðS Þ a mut DDG/RT w ¼ ¼ e ‚ ð6Þ jb change in mRNA concentrations in cells from two different experimental K ðS Þ a ref conditions, can be used in place of genome-wide TF occupancy data. where This substitution is reasonable since, to ﬁrst approximation, the transcrip- tion rate of genes is proportional to the total TF occupancy along the asso- DDG ¼ DGðS Þ DGðS Þ: ð7Þ ref mut ciated promoter regions. Genome-wide occupancy data is preferable, To be able to generalize the binding of transcription factor P to a sequence however, since it is a more direct measure of TF-DNA interaction and S with more than one point mutation, we assume that the free energy mut since the design of the experiments provides the TF identities for the contributions for each position in the binding site are independent (Benos discovered PSAMs. et al., 2002) and therefore additive. Equivalently, we can multiply the w ’s jb for any nucleotide sequence to obtain the overall K ðS Þ/K ðS Þ ratio. a mut a ref 3.3 MatrixREDUCE implementation and Thus, the occupancy of a particular binding site S of length L with mut w parameters nucleotide sequence S ð1‚2‚.. . ‚L Þ¼ðb ‚b ‚.. . ‚b Þ is: mut w 1 2 Lw w MatrixREDUCE was implemented in Perl and C as outlined above and as NðS Þ¼½PK ðS Þ w : ð8Þ mut a ref jS ðjÞ previously described (Foat et al., 2005) with some modiﬁcations. mut j¼1 Brieﬂy, MatrixREDUCE takes microarray intensities and corresponding e143 B.C.Foat et al. nucleotide sequence data as input. It ﬁrst ﬁnds a gapped dyad motif (e.g. Leu3: standards’’ to assess the validity of our MatrixREDUCE model. The CCG-4nt-CGG), out of all possible dyad motifs of a ﬁxed number of electrophoretic mobility shift assay (EMSA) is able to provide direct nucleotides and a range of gap sizes, whose occurrences best correlate estimates of K ’s for a TF binding to particular oligonucleotides (Fried with the measured intensities for the same sequences. The best dyad motif and Crothers, 1981). The ratio of the EMSA-measured K of a reference is then converted into a seed matrix by ﬁlling in the gap with N’s and oligonucleotide S to the K of one of the other tested oligonucleotides S ref d mut extending out a user deﬁned number of ﬂanking N’s on either side of the provides the same information as the product across the MatrixREDUCE best-scoring dyad. In the 4 · L seed matrix, acceptable nucleotides (all PSAM over the same sequence for the same TF. In the simplifying scenario nucleotides for N’s, a single speciﬁc nucleotide at positions within the top where the length of the oligonucleotides is the same as the length L of the scoring motif) are given K ratios of one and unacceptable nucleotides are PSAM, we have given a very small K ratio w . This seed matrix serves as the starting point a min Lw K ðS Þ d ref for a quasi-Newton numerical minimization of Equation 13 to ﬁnd the optimal ¼ w : ð14Þ jS ðjÞ mut K ðS Þ d mut PSAM. The new version of MatrixREDUCE uses a k-fold cross-validation to j¼1 determine the signiﬁcance of each discovered PSAM. After converging on a While the biological processes involved are considerably more complex, PSAM, the input data is split into k random subsets of array features with lacZ expression data can be employed to the same end. If we assume that associated sequences. The optimal PSAM is then used to seed each of k re-- b-galactosidase activity, concentration of b-galactosidase, the amount of optimizations of the PSAM. A t-value (Pearson correlation) for the goodness mRNA expressed, the speciﬁc recruitment of RNA polymerase to the of ﬁt is calculated for the optimal PSAM of each subset. Finally, the P-value promoter, and the promoter occupancy by the TF are all proportional to corresponding to the average t-value for the k re-optimizations is used to test each other, then relative K ’s are reﬂected in the ratio of b-galactosidase whether the originally optimized PSAM should be kept. This procedure does activities between the assay using the reference binding site and another not test the signiﬁcance of the optimal PSAM itself, but rather it tests whether assay using a different binding site. Thus, we used lacZ reporter expression the data contains widely distributed, explainable variance. Thus, false PSAMs assay data in a similar manner to EMSA-derived K data to conﬁrm the due to a few outliers are prevented. While not relevant to the current study, results of MatrixREDUCE. MatrixREDUCE can iteratively build a linear model of multiple PSAMs that Experimentally determined in vitro binding afﬁnities and lacZ reporter best explain a particular data set (see Foat et al., 2005). expression activity data were gathered from publications. The K data and The parameters for the runs of MatrixREDUCE were as follows: For lacZ expression data for Abf1 are from Gailus-Durner et al. (1996); K data all runs, the length of each of the two dyads of the seed motifs was three, for Leu3 are from Liu and Clarke (2002); and K data and lacZ expression the length of the added ﬂanks on each side of the dyad was three, the minimum data for Ndt80 and Sum1 are from Pierce et al. (2003). gap was zero, the k cross-validations were two, and w was 10 . For all runs min To compare the experimental K measurements with MatrixREDUCE on ChIP-chip and PBM data, the maximum acceptable P-value was 10 and PSAMs, all experimental K and lacZ expression data was ﬁrst converted to the maximum dyad gap was twenty. For all runs on DIP-chip data, the K ratios by normalizing with respect to the value of the highest afﬁnity maximum acceptable P-value was 10 and the maximum dyad gap was oligonucleotide. The K ratios were then log-transformed to obtain the DDG ten. For all runs on differential mRNA expression data, the maximum accept- values. MatrixREDUCE PSAMs for each TF were converted to DDG’s able P-value was 10 and the maximum dyad gap was eleven. relative to the highest afﬁnity oligonucleotide from the respective experiment. The sum of the DDG values was calculated for the best 3.4 Microarray and sequence data PSAM-matching window in each of the experimentally tested sequences. All microarray data was gathered from publication supplements. We chose If a sequence was shorter than the PSAM, the sum was taken over only the speciﬁc TFs to analyze based on the availability of experimental K data or best matching positions within the PSAM. All experimental DDG’s were crystal structure data. PSAMs were inferred by MatrixREDUCE for chro- then compared to the PSAM DDG’s by plotting and by calculating Pearson matin immunoprecipitation microarrays (ChIP-chip) using the microarray correlations. data and microarray feature sequences from Harbison et al. (2004). These BioProspector (Liu et al., 2001) and MDscan (Liu et al., 2002) are popular ChIP-chip experiments were performed under a variety of culture conditions, information theory-based methods for determination of TF binding speci- including rich media (YPD); sulfometuron methyl (SM), an inhibitor of ﬁcities. To compare the quality of the results from these methods with amino acid biosynthesis; and treatment with rapamycin (RAPA). PSAMs MatrixREDUCE results, position-speciﬁc scoring matrices (PSSMs) were were inferred for PBM experiments using the microarray data from Mukher- derived from BioProspector and MDscan outputs by calculating the frequen- jee et al. (2004) and the feature sequence data from Harbison et al. (2004) as cies of each base at each position in the putative binding sites and then the two studies used the same array features. PSAMs were inferred for Leu3 dividing by a background frequency for each respective base. Two different using the DIP-chip microarray data and feature sequences from Liu et al. background frequencies were tested: equal nucleotide probabilities and nuc- (2005). Liu et al. (2005) performed DIP-chip experiments using two different leotide probabilities for intergenic sequences in S. cerevisiae. Once the concentrations of Leu3, 4nM and 40nM, and PSAMs were inferred for each PSSMs had been created, they were tested against experimental EMSA concentration. The PSAM for Ndt80 was inferred from differential mRNA and lacZ data in the same manner as the MatrixREDUCE PSAMs above. expression microarray data measuring the sporulation response in a ndt80 deletion strain versus a wild-type strain (Chu et al., 1998). The sequence data 3.6 Structural modeling for the Ndt80 PSAM inference was the 800 bp upstream of every yeast gene, DNA binding afﬁnities and speciﬁcities of TFs are determined by the forces retrieved from the Saccharomyces Genome Database (Issel-Tarver et al., of electrostatics, solvation, the hydrogen bonding patterns, and shape 2002) and purged of redundant sequences as previously described (Foat complementarity at the binding interface. The magnitude of these contribu- et al., 2005). All microarray intensities were analyzed as the ratio of the tions to the binding free energy can in principle be calculated given a experimental sample intensity to the control sample intensity with the excep- structure of the protein bound to its cognate DNA site. Therefore, it should tion of the ndt80 deletion data, which was analyzed as the log -ratio. All be possible to predict PSAMs starting from the experimentally available microarray data was purged of extreme outliers before being analyzed by structure of the protein-DNA complex (solved by either X-ray diffraction or MatrixREDUCE (Grubbs’ test, P-value ¼ 10 ; Grubbs, 1969). NMR), or, in the absence of the exact structure, from a suitably constructed homology model. Under the assumption that the base pair energies 3.5 Gel shift and lacZ expression data contribute approximately independently to the total binding afﬁnity While prone to their own inaccuracies, experimentally measured in vitro (Benos et al., 2002), all one-point base pair mutations are introduced binding afﬁnities and changes in lacZ expression served as our ‘‘gold into the DNA binding site. Protein-DNA binding energies DG ¼ e144 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE G G G are then evaluated for each mutation. Mutations in number of contacts above which the maximum energy penalty E is protdna prot dna max the reference binding site result in changes of protein-DNA binding energies imposed. f ðN Þ and f ðN Þ are ﬁxed prefactors deﬁned in Morozov max max 1 2 (DDG; Equation 7). A table of DDG values can be used to construct a PSAM et al. (2005). E together with N constitute the free parameters of the max max that is directly comparable with MatrixREDUCE predictions. contact model and are adjusted simultaneously to maximize the fraction of We have previously developed two alternative approaches for predicting correct predictions and minimize the average error over the DDG data set TF binding afﬁnities and speciﬁcities starting from the protein-DNA identical to that used in ﬁtting the all-atom model. The fraction of correct structure (Morozov et al., 2005). In one approach, the ‘‘all atom model’’ predictions is based on a binary function: a prediction is considered to be (which builds on the ROSETTA protein-nucleic acid interaction model of correct if both computational and experimental DDG’s are less than 1.0 kcal/ Havranek et al., 2004), both direct and indirect readout mechanisms con- mol, or greater than 1.0 kcal/mol, or else separated by less than 0.3 kcal/mol. tribute to the recognition of the DNA binding site: The global minimum for the ﬁt is found by exhaustive search; the best ﬁt is DG ¼ DG þ DG . Direct readout is mediated by protein amino obtained with N ¼ 15, E ¼ 3:0 kcal/mol. direct indirect max max acid-DNA base interactions, while indirect readout is encoded in the 3.7 Affinity logos shape of the DNA site imparted by the bound protein, primarily through non-speciﬁc amino acid-DNA phosphate backbone contacts. Direct protein- Information content-based weight matrices are usually displayed as DNA interactions are modeled as a linear combination of the repulsive and sequence logos (Schneider and Stephens, 1990) However, MatrixREDUCE attractive parts of the Lennard-Jones potential, the orientation-dependent weight matrices are discovered without a background sequence model. Thus, hydrogen bonding potential (Kortemme et al., 2003), and the Generalized an appropriate logo should display the actual relative free energies of binding Born electrostatics and solvation model (Onufriev et al., 2004): for each nucleotide at each position rather than information content. There- fore, we created afﬁnity logos, which are constructed as follows: For each DG ¼ w E þ w E direct LJrep LJrep LJattr LJattr ð15Þ position in the PSAM, the average DDG is calculated. Then, the difference þ w E þ w G ‚ hb hb el el between each individual DDG and the average DDG at that position is where each term is a sum over all protein-DNA and protein-protein atomic computed; the absolute value of this difference is the height of the character pairs, and {w} is a set of ﬁtting weights. Indirect readout is modeled using an representing that nucleotide. If the difference is positive (more favorable effective harmonic representation of the DNA conformational energy (Olson than average), the letter is placed above a horizontal black line through the et al., 1998): center of the logo. If the difference is negative (less favorable than average) X X the letter is placed below the black line. Larger letters are stacked on smaller ab ab DG ¼ w E þ w E ‚ ð16Þ indirect dnabp dnabs dnabp dnabs letters moving outward from the black line. The height of the letter can be bp bs interpreted as free energy difference from the average in units of RT. Thus, where the ﬁrst sum is over all base pairs in the DNA site (a, b denote bases in an intuitive high amplitude is given to the nucleotide positions that most a base pair), and the second sum is over all consecutively stacked base pair contribute to the sequence speciﬁcity of the TF. To highlight that the steps (a, b denote base pairs in a base step). Base pairs and base steps are characters representing the high afﬁnity nucleotides are above the black 0 0 counted once in the 5 to 3 direction. The ﬁrst term penalizes deviations line, the characters representing the low afﬁnity nucleotides are made from canonical base pairing, while the second term captures base stacking partially transparent. However, maintaining the representation of the poor energies. The quadratic energy terms are given by: afﬁnity nucleotides below the center line allows the viewer to immediately see which nucleotide substitutions are most unfavorable to binding. 6 6 X X ab ab a b E ¼ f d d ‚ ð17Þ dnabs/dnabp ij i j i¼1 j¼1 3.8 PSAM to PSAM alignments and correlations By inspection of afﬁnity logos, one can make qualitative observations about where the sums run over six geometric degrees of freedom (Twist, Tilt, the similarity between any two PSAMs. However, a quantitative measure of Roll, Shift, Slide and Rise for base pair steps; Opening, Buckle, Propeller, similarity allows for more objective comparisons. Before two PSAMs can be Shear, Stretch and Stagger for base pairs; Lu and Olson, 2003). The DNA compared, they must ﬁrst be aligned. Pearson correlations were calculated potential is a quadratic expansion in d (deviations of the degrees of free- between the DDG values for each nucleotide at each position for every dom from their average values computed using a set of non-homologous possible overlap of the two PSAMs for both the forward and the reverse protein-DNA complexes). The force constants f are evaluated by inverting ij complement alignments. After the best overlap position and strand was the covariance matrix of d obtained with the same protein-DNA dataset: determined from the best correlation P-value, the DDG’s of the two f ¼hd d i. All six weights are simultaneously ﬁt to experimental DDG i j ij PSAMs were recentered relative to a common reference consensus data using a generalized linear model (implemented in the statistical soft- ware package R): ðw ‚w ‚w ‚w ‚w ‚w Þ¼ sequence. Finally, the P-value for the Pearson correlation between the LJrep LJattr hb el dnabp dnabs ð0:00‚0:46‚0:77‚0:27‚ 0:03‚0:03Þ. No conformational ﬂexibility is allowed two optimally aligned and transformed PSAMs was calculated and subjected at the protein-DNA interface. Further details on the ﬁtting procedure and to a Bonferroni correction for the number of alignments that were tested. comprehensive tests of the all-atom free energy function can be found in Morozov et al. (2005). 4 RESULTS In another approach, we developed a ‘‘contact model’’ that exploits the structure of the protein-DNA complex bound to a high afﬁnity reference 4.1 PSAMs inferred by MatrixREDUCE agree well DNA sequence but does not require detailed predictions of protein-DNA with experimental measurements of TF binding interaction energies. In the contact model each mutated base in the PSAM affinity column i incurs equal energy cost relative to the consensus base from the We discovered the position speciﬁc afﬁnity matrices (PSAMs) reference sequence: for the Saccharomyces cerevisiae TFs Rap1, Ndt80, Gcn4, E ½f ðN Þ log ð1 N/N Þ < max max max Leu3, Abf1, and Sum1 by applying MatrixREDUCE to genome- DDG ðNÞ¼ f ðN Þ log ð1 þ 3N/N Þ ðN < N Þ ð18Þ max max max wide TF occupancy data and, in the case of Ndt80, differential E ðN N Þ max max mRNA expression microarray data (Figure 2A). Experimental measurements of relative K ’s (EMSA or lacZ expression) for Here, N is the number of protein amino acid-DNA base atomic contacts d summed over the base pair i (atomic contact is deﬁned by a distance of less speciﬁc oligonucleotides were available for Abf1, Leu3, Ndt80, than 4.5 s; hydrogen atoms are excluded from the counts), and N is the and Sum1. EMSA has long been employed to determine the max e145 B.C.Foat et al. Fig. 2. Comparison of PSAMs—affinity logos and correlations. (A) The PSAMs represented in the columns with blue headers were inferred by Matrix-REDUCE from ChIP-chip, PBM, DIP-chip, or mRNA differential expression microarray data. YPD (rich media), SM (sulfometuron methyl), and RAPA (rapamycin) refer to the environmental conditions to which the test sample was exposed before the ChIP-chip experiment. The DIP-chip experiments were performed with two different concentrations of Leu3, 4nM and 40nM. An ndt80 deletion (ndt80D) versus wild-type mRNA expression experiment (mRNA) was used to obtain the Ndt80 PSAM. The PSAMs represented in the columns green headers were inferred by modeling TF-DNA interactions based on crystal structures of the TFs using two different methods, a contact-only model and an all atom model. (B) All PSAMs for each TF were aligned pairwise and the Pearson correlation between the DDG values of both PSAMs for the best alignment was calculated. The P-value for this correlation is a measure of similarity between the PSAMs. Again, blue labels indicate PSAMs inferred by MatrixREDUCE PSAMs and green labels indicate structurally inferred PSAMs. DNA-binding afﬁnities of TFs in vitro. Likewise, the lacZ reporter experimental method (EMSA or lacZ), and TF, we compared the assay has long been used to measure the difference in activities experimental DDG with the DDG predicted from a PSAM for the of TF binding sites. We claim that a PSAM inferred by same TF (Figure 3). In every case, the experimental DDG values MatrixREDUCE from genome-wide TF occupancy data can be strongly correlated with the PSAM-predicted DDG values, with R ’s used to predict the relative binding afﬁnities of the measured TF ranging from 0.36 to 0.88. Thus, PSAMs inferred by MatrixRE- to any sequence. Therefore, EMSA and lacZ expression data pro- DUCE seem to be good models of the true relative DNA binding vide nearly ideal data sets for validation of the MatrixREDUCE afﬁnities of the corresponding TFs. Unexpectedly, all of the regres- approach. For each combination of experimentally tested sequence, sions of experimental DDG’s on MatrixREDUCE DDG’s have e146 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE Fig. 3. Comparison of experimentally measured DDG’s with MatrixREDUCE PSAM-predicted DDG’s. Experimental measurements of DDG’s were derived from EMSA (A) and lacZ reporter assays (B). The experimental DDG values are plotted along the vertical axes. Predicted DDG’s were calculated from the PSAM for each tested TF for the same oligonucleotide sequences that were measured in each experiment. The MatrixREDUCE-predicted DDG values are plotted along the horizontal axes. In this representation, the higher affinity oligonucleotides have more positive DDG’s. The diagonal dashed line represents experimental DDG equal to MatrixREDUCE DDG. DDG’s are in units of RT, where R is the gas constant and T is the temperature. The R and P-values for the Pearson correlations between the experimental and predicted DDG’s are presented for each PSAM-experimental data pair. slopes less than one (range: 0.31 to 0.94). It seems that MatrixRE- of the binding interface based only on the genomic sequence and DUCE produces a slightly larger range of predicted DDG’s than is genome-wide TF occupancy data. realized in experiments. Nonetheless, the MatrixREDUCE PSAM- Gcn4 is a TF of the bZIP class. It is a homodimer with the basic predicted DDG’s are close to the experimentally inferred DDG’s in region mediating sequence speciﬁc DNA binding and the leucine most cases, especially among the highest afﬁnity sequences. zipper region required for dimerization (O’Shea et al., 1991). For deriving the Gcn4 structural PSAMs, we used a 2.9 s crystal structure of the TF bound to the ATGAGTCAT site (Ellenberger 4.2 PSAMs inferred by MatrixREDUCE agree well et al., 1992). The symmetry of the binding site (two reverse with PSAMs inferred by structural models complement 4 bp half-sites separated by G in the middle) is a reﬂection of the homodimeric binding and is captured well in Both genome-wide TF occupancy data and crystal structures of MatrixREDUCE predictions. While contact model and Matrix- protein-DNA complexes are available for Ndt80, Gcn4, and REDUCE predictions are similar, the all-atom model is less Rap1. Thus, we were able to compare MatrixREDUCE PSAMs successful, probably due to the low resolution of the crystal struc- with those based on ab initio structural models (see Methods; ture, which leads to considerable uncertainty in side chain positions Figure 2A). The structurally inferred PSAMs for Ndt80 were obtained from its co-crystal structure bound to a high afﬁnity with respect to the neighboring DNA bases. GACACAAAA site, solved at 1.4 s resolution (Lamoureux Finally, Rap1 binds DNA as a homodimer in a way that makes its et al., 2002). Figure 2A shows a reasonable agreement between DNA site a tandem repeat. The crystal structure of the Rap1 homod- DDG predictions carried out with MatrixREDUCE and structural imer in complex with a telomeric DNA site has been solved to models. The close correspondence with the contact model, which is 2.25 s resolution (Konig et al., 1996). Comparison of Matrix- a function of the number of protein side chains in contact with DNA REDUCE PSAMs and structural PSAMs reveals good agreement base pairs, is especially remarkable, showing that the Matrix- with the all atom model. The contact model overpredicts binding REDUCE approach is capable of reproducing structural details speciﬁcity at the intermediate positions in the binding site (located e147 B.C.Foat et al. bases provide an estimate of a PSAM in the form of a position- speciﬁc scoring matrix (PSSM). Since we had already compiled the EMSA and lacZ expression data, we had the opportunity to experi- mentally verify the results of these PSSMs. We gathered the BioProspector and MDscan results from the original, published analyses, transformed them into PSSMs, and used them to predict DDG’s for the EMSA and lacZ experimentally tested sequences. We performed this comparison using two different ‘‘background’’ nucleotide frequency models: one using equal nuc- leotide probabilities and one using nucleotide probabilities derived from S. cerevisiae intergenic sequences. The R and P-values for the Fig. 4. Correlations of experimentally measured DDG’s with information correlations between these predicted DDG’s and the experimental theory-predicted DDG’s. Experimental measurements of DDG’s were derived 2 results are displayed in Figure 4. Overall, the quality of the results from EMSA and lacZ reporter assays. The R and P-values for the Pearson from the information theory PSSMs and the MatrixREDUCE correlations between the experimental and predicted DDG’s are presented for PSAMs were similar. However, the results for the PSSMs are dif- each PSSM-experimental data pair. PSSMs were derived and tested using ferent depending on the choice of equal or intergenic nucleotide two different background nucleotide frequencies: equal probabilities and frequencies. While we did not test this scenario, the information intergenic probabilities. theory results would also change depending on the probe intensity threshold chosen to label genes as ‘‘bound.’’ Thus, while Matrix- between tandem repeats), likely because it assigns similar REDUCE performs comparably with existing information theory speciﬁcities to protein-DNA contacts in the loop region and in methods, it conveniently avoids having to choose several ad hoc the DNA binding domains. parameters required by the other methods. 4.3 PSAM to PSAM correlations 5 DISCUSSION Upon visual inspection of Figure 2A, the similarities are immedi- Overall, position speciﬁc afﬁnity matrices (PSAMs) as inferred by ately apparent between afﬁnity logos for the same factor inferred MatrixREDUCE from genome-wide TF occupancy data are good using different experimental and computational methods. However, approximations of the real sequence-speciﬁc DNA binding afﬁni- a quantitative measure of these similarities can be obtained by ties. Discrepancies between the computationally predicted and the aligning the PSAMs (see Methods) and calculating the correlation experimentally inferred binding afﬁnities may be due to either the of their DDG values. The P-value for this correlation between two computational or the experimental methods. EMSA has known PSAMs serves as our similarity metric (Figure 2B). Overall, the problems with ‘‘caging’’ of the TFs by the gel while electrophoresis similarity between the PSAMs from MatrixREDUCE are the most is proceeding (Fried and Crothers, 1981). This could lead to inferred signiﬁcant. There is extreme similarity between the Rap1 PSAMs DDG’s of smaller magnitude. Likewise, lacZ reporter assays are a inferred from ChIP-chip and PBM data. The PSAMs inferred for very indirect way of measuring relative binding afﬁnities as they Gcn4 for three different ChIP-chip conditions are all very similar as require transcription, translation, and b-galactosidase reactions in well. The Leu3 PSAMs inferred from the DIP-chip data are much order to make measurements, and noise could be introduced at each more similar to each other than they are to the Leu3 PSAM inferred step. Structural model predictions are strongly dependent on the from the ChIP-chip data, but they are still both signiﬁcantly similar quality of input structures and are affected by errors in the energy (P-value < 0.01) to the ChIP-chip Leu3 PSAM. The signiﬁcance of function. The current MatrixREDUCE model may also give rise to the correlations between MatrixREDUCE PSAMs and structurally systematic biases. First, it makes the approximation that nucleotides inferred PSAMs is more variable. Both the all atom model PSAM contribute independently to the free energy of TF binding (Benos and the contact model PSAM for Rap1 and Ndt80 have signiﬁcant et al., 2002). Second, it makes the assumption that the concentration similarities with the respective MatrixREDUCE PSAMs (P-value < of TF is much smaller than the K , which may not be correct for 0.01). However for Gcn4, while the contact model PSAM has strong some TFs. Finally, all consecutive positions in the PSAM are cur- similarities with all of the other PSAMs, the all atom PSAM has rently treated as parameters to be estimated, which may lead to insigniﬁcant similarities with all other PSAMs. overﬁtting. We plan to address these issues in a future version of the algorithm. 4.4 How good is the information theory Despite these current limitations, PSAMs discovered using the approximation? current implementation of MatrixREDUCE are good approx- In the original papers describing the ChIP-chip (Harbison et al., imations of the relative nucleotide binding afﬁnities of assayed 2004), PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., TFs. Especially for microarray methods like PBM and DIP-chip, 2005) data, the authors used BioProspector (Liu et al., 2001) or where the objective is to deﬁne nucleotide-binding speciﬁcities, MDscan (Liu et al., 2002) to deﬁne weight matrix representations of MatrixREDUCE may be the most physically accurate method TF binding sites. These two methods use the set of sequences that available to analyze the data. Even for less direct reﬂections of the experimenters label as ‘‘bound’’ to produce a list of potential TF binding afﬁnities like ChIP-chip or differential mRNA expres- binding sites. When interpreted through information theory, the sion data, it will still provide good approximations of the sequence- nucleotide frequencies at each individual position in the binding speciﬁc binding afﬁnities of TFs relevant to those data. Preliminary sites divided by the ‘‘background’’ frequencies for the respective results also suggest that MatrixREDUCE performs well on data e148 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE from higher eukaryotes including D. melanogaster and mammals. Konig,P., Giraldo,R., Chapman,L. and Rhodes,D. (1996) The crystal structure of the DNA-binding domain of yeast RAP1 in complex with telomeric DNA. Cell, 85, Finally, MatrixREDUCE has two key advantages over most other 125–136. computational methods for deﬁning nucleotide binding speciﬁc- Kortemme,T., Morozov,A.V. and Baker,D. (2003) An orientation-dependent hydrogen ities: (i) it uses the information for all probes in genome-wide bonding potential improves prediction of speciﬁcity and structure for proteins and TF occupancy data, and (ii) it does not require a background protein-protein complexes. J. Mol. Biol., 326, 1239–1259. Lamoureux,J.S., Stuart,D., Tsang,R., Wu,C. and Glover,J.N.M. (2002) Structure of sequence model. the sporulation-speciﬁc transcription factor Ndt80 bound to DNA. EMBO J., 21, 5721–5732. ACKNOWLEDGMENTS Lee,T.I., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Gerber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Simon,I., Zeitlinger,J., This work was supported by National Institutes of Health grants Jennings,E.G., Murray,H.L., Gordon,D.B., Ren,B., Wyrick,J.J., Tagne,J.B., HG003008 (to H.J.B) and CA121852. A.V.M. is a fellow of the Volkert,T.L., Fraenkel,E., Gifford,D.K. and Young,R.A. (2002) Transcriptional Leukemia and Lymphoma Society. The authors thank Jason Lieb, regulatory networks in Saccharomyces cerevisiae. Science, 298, Neil Clarke, Martha Bulyk, Ernest Fraenkel, Xiao Liu, and Michael 799–804. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter-speciﬁc binding of Berger for graciously answering questions and supplying additional Rap1 revealed by genome-wide maps of protein-DNA association. Nat. Genet., data from their publications. The authors also thank Gabor Halasz, 28, 327–334. Ronald Tepper, Junbai Wang, Xiang-Jun Lu, and Lucas Ward for Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for ﬁnding protein-dna valuable conversations, and Xiang-Jun Lu for reimplementing the binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839. MatrixREDUCE software for distribution. Liu,X., Brutlag,D.L. and Liu,J.S. (2001) Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. REFERENCES Biocomput. 127–138. Liu,X. and Clarke,N.D. (2002) Rationalization of gene regulation by a eukaryotic Benos,P.V., Bulyk,M.L. and Stormo,G.D. (2002) Additivity in protein-DNA interac- transcription factor: calculation of regulatory region occupancy from predicted tions: how good an approximation is it? Nucleic Acids Res., 30, 4442–4451. binding afﬁnities. J. Mol. Biol., 323, 1–8. Berg,O.G. and von Hippel,P.H. (1987) Selection of DNA binding sites by regulatory Liu,X., Noll,D.M., Lieb,J.D. and Clarke,N.D. (2005) DIP-chip: rapid and accurate proteins. Statistical-mechanical theory and application to operators and promoters. determination of DNA-binding speciﬁcity. Genome Res., 15, 421–427. J. Mol. Biol., 193, 723–750. Lu,X.J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res., (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 31, 5108–5121. 699–705. Morozov,A.V., Havranek,J.J., Baker,D. and Siggia,E.D. (2005) Protein-DNA Djordjevic,M., Sengupta,A.M. and Shraiman,B.I. (2003) A biophysical approach to binding speciﬁcity predictions with structural models. Nucleic Acids Res., 33, transcription factor binding site discovery. Genome Res., 13, 2381–2390. 5781–5798. Djordjevic,M. and Sengupta,A.M. (2006) Quantitative modeling and data analysis of Mukherjee,S., Berger,M.F., Jona,G., Wang,X.S., Muzzey,D., Snyder,M., Young,R.A. SELEX experiments. Phys. Biol., 3, 13–28. and Bulyk,M.L. (2004) Rapid analysis of the DNA-binding speciﬁcities of Ellenberger,T.E., Brandl,C.J., Struhl,K. and Harrison,S.C. (1992) The GCN4 basic transcription factors with DNA microarrays. Nat. Genet., 36, 1331–1339. region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. (1998) DNA sequence- structure of the protein-DNA complex. Cell, 71, 1223–1237. dependent deformability deduced from protein-DNA crystal complexes. Proc. Endres,R.G., Schulthess,T.C. and Wingreen,N.S. (2004) Toward an atomistic model Natl. Acad. Sci. USA, 95, 11163–11168. for predicting transcription-factor binding sites. Proteins, 57, 262–268. Onufriev,A., Bashford,D. and Case,D.A. (2004) Exploring protein native states and Foat,B.C., Houshmandi,S.S., Olivas,W.M. and Bussemaker,H.J. (2005) Proﬁling large-scale conformational changes with a modiﬁed Generalized Born model. condition-speciﬁc, genome-wide regulation of mRNA stability in yeast. Proc. Proteins, 55, 383–394. Natl. Acad. Sci. USA, 102, 17675–17680. O’Shea,E.K., Klemm,J.D., Kim,P.S. and Alber,T. (1991) X-ray structure of the GCN4 Fried,M. and Crothers,D.M. (1981) Equilibria and kinetics of lac repressor-operator leucine zipper, a two-stranded, parallel coiled coil. Science, 254, 539–544. interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res., 9, Paillard,G. and Lavery,R. (2004) Analyzing protein-DNA recognition mechanisms. 6505–6525. Structure, 12, 113– 122. Gailus-Durner,V., Xie,J., Chintamaneni,C. and Vershon,A.K. (1996) Participation of Pierce,M., Benjamin,K.R., Montano,S.P., Georgiadis,M.M., Winter,E. and the yeast activator Abf1 in meiosis-speciﬁc expression of the HOP1 gene. Mol. Vershon,A.K. (2003) Sum1 and Ndt80 proteins compete for binding to middle Cell. Biol., 16, 2777–2786. sporulation element sequences that control meiotic gene expression. Mol. Cell. Granek,J.A. and Clarke,N.D. (2005) Explicit equilibrium modeling of transcription- Biol., 23, 4814–4825. factor binding and gene regulation. Genome Biol., 6, R87. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., Simon,I., Zeitlinger,J., Grubbs,F. (1969) Procedures for detecting outlying observations in samples. Techno- Schreiber,J., Hannett,N., Kanin,E., Volkert,T.L., Wilson,C.J., Bell,S.P. and metrics, 11, 1–21. Young,R.A. (2000) Genome-wide location and function of DNA binding proteins. Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., Danford,T.W., Science, 290, 2306–2309. Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J., Jennings,E.G., Zeitlinger,J., Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display Pokholok,D.K., Kellis,M., Rolfe,P.A., Takusagawa,K.T., Lander,E.S., consensus sequences. Nucleic Acids Res., 18, 6097–6100. Gifford,D.K., Fraenkel,E. and Young,R.A. (2004) Transcriptional regulatory Simon,I., Barnett,J., Hannett,N., Harbison,C.T., Rinaldi,N.J., Volkert,T.L., Wyrick,J.J., code of a eukaryotic genome. Nature, 431, 99–104. Zeitlinger,J., Gifford,D.K., Jaakkola,T.S. and Young,R.A. (2001) Serial regulation Havranek,J.J., Duarte,C.M. and Baker,D. (2004) A simple physical model for the of transcriptional regulators in the yeast cell cycle. Cell, 106, 697–708. prediction and design of protein-DNA interactions. J. Mol. Biol., 344,59–70. Stormo,G.D. and Fields,D.S. (1998) Speciﬁcity, free energy and information content in Issel-Tarver,L., Christie,K.R., Dolinski,K., Andrada,R., Balakrishnan,R., Ball,C.A., protein-DNA interactions. Trends Biochem. Sci., 23, 109–113. Binkley,G., Dong,S., Dwight,S.S., Fisk,D.G., Harris,M., Schroeder,M., Stormo,G.D., Schneider,T.D. and Gold,L. (1986) Quantitative analysis of the Sethuraman,A., Tse,K., Weng,S., Botstein,D. and Cherry,J.M. (2002) Saccha- relationship between nucleotide sequence and functional activity. Nucleic Acids romyces Genome Database. Methods Enzymol., 350, 329–346. Res., 14, 6661–6679. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and Brown,P.O. (2001) Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, Genomic binding sites of the yeast cell-cycle transcription factors SBF and 16, 16–23. MBF. Nature, 409, 533–538. e149 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/statistical-mechanical-modeling-of-genome-wide-transcription-factor-ox4weE5YPQ

Loading next page...

References (40)

M. Djordjevic, Anirvan Sengupta, B. Shraiman (2003)
A biophysical approach to transcription factor binding site discovery.
Genome research, 13 11
V. Iyer, C. Horak, C. Scafe, D. Botstein, M. Snyder, P. Brown (2001)
Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF
Nature, 409
X. Liu, D. Brutlag, J. Liu (2002)
An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments
Nature Biotechnology, 20
Otto Berg (1988)
Selection of DNA binding sites by regulatory proteins: the LexA protein and the arginine repressor use different strategies for functional specificity.
Nucleic acids research, 16 11
P. König, R. Giraldo, L. Chapman, D. Rhodes (1996)
The Crystal Structure of the DNA-Binding Domain of Yeast RAP1 in Complex with Telomeric DNA
Cell, 85
Christopher Harbison, D. Gordon, Tong Lee, Nicola Rinaldi, K. MacIsaac, Timothy Danford, N. Hannett, J. Tagne, David Reynolds, Jane Yoo, E. Jennings, J. Zeitlinger, Dmitry Pokholok, Manolis Kellis, P. Rolfe, K. Takusagawa, E. Lander, D. Gifford, E. Fraenkel, R. Young (2004)
Transcriptional regulatory code of a eukaryotic genome
Nature, 431
Bing Ren, F. Robert, John Wyrick, O. Aparicio, E. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, Elenita Kanin, T. Volkert, Christopher Wilson, S. Bell, R. Young (2000)
Genome-wide location and function of DNA binding proteins.
Science, 290 5500
Xiao Liu, N. Clarke (2002)
Rationalization of gene regulation by a eukaryotic transcription factor: calculation of regulatory region occupancy from predicted binding affinities.
Journal of molecular biology, 323 1
J. Lieb, Xiaole Liu, D. Botstein, P. Brown (2001)
Promoter-specific binding of Rap1 revealed by genome-wide maps of protein–DNA association
Nature Genetics, 28
Xiao Liu, D. Noll, J. Lieb, N. Clarke (2005)
DIP-chip: rapid and accurate determination of DNA-binding specificity.
Genome research, 15 3
V. Gailus-Durner, J. Xie, C. Chintamaneni, A. Vershon (1996)
Participation of the yeast activator Abf1 in meiosis-specific expression of the HOP1 gene
Molecular and Cellular Biology, 16
J. Lamoureux, D. Stuart, R. Tsang, Cynthia Wu, J. Glover (2002)
Structure of the sporulation‐specific transcription factor Ndt80 bound to DNA
The EMBO Journal, 21
T. Schneider, R. Stephens (1990)
Sequence logos: a new way to display consensus sequences.
Nucleic acids research, 18 20
M. Fried, D. Crothers (1981)
Equilibria and kinetics of lac repressor-operator interactions by polyacrylamide gel electrophoresis.
Nucleic acids research, 9 23
P. Benos, M. Bulyk, G. Stormo (2002)
Additivity in protein-DNA interactions: how good an approximation is it?
Nucleic acids research, 30 20
G. Stormo, T. Schneider, L. Gold (1986)
Quantitative analysis of the relationship between nucleotide sequence and functional activity.
Nucleic acids research, 14 16
O. Berg, P. Hippel (1987)
Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters.
Journal of molecular biology, 193 4
R. Endres, T. Schulthess, N. Wingreen (2004)
Toward an atomistic model for predicting transcription‐factor binding sites
Proteins: Structure, 57
Xiang-Jun Lu, W. Olson (2003)
3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures.
Nucleic acids research, 31 17
S. Chu, J. Derisi, M. Eisen, J. Mulholland, D. Botstein, P. Brown, I. Herskowitz (1998)
The transcriptional program of sporulation in budding yeast.
Science, 282 5389
Barrett Foat, S. Houshmandi, W. Olivas, H. Bussemaker (2005)
Profiling condition-specific, genome-wide regulation of mRNA stability in yeast.
Proceedings of the National Academy of Sciences of the United States of America, 102 49
T. Ellenberger, C. Brandl, K. Struhl, S. Harrison (1992)
The GCN4 basic region leucine zipper binds DNA as a dimer of uninterrupted α Helices: Crystal structure of the protein-DNA complex
Cell, 71
A. Morozov, J. Havranek, D. Baker, E. Siggia (2005)
Protein–DNA binding specificity predictions with structural models
Nucleic Acids Research, 33
A. Onufriev, D. Bashford, D. Case (2004)
Exploring protein native states and large‐scale conformational changes with a modified generalized born model
Proteins: Structure, 55
J. Havranek, Carlos Duarte, D. Baker (2004)
A simple physical model for the prediction and design of protein-DNA interactions.
Journal of molecular biology, 344 1
T. Kortemme, A. Morozov, D. Baker (2003)
An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes.
Journal of molecular biology, 326 4
I. Simon, J. Barnett, N. Hannett, Christopher Harbison, Nicola Rinaldi, T. Volkert, John Wyrick, J. Zeitlinger, D. Gifford, T. Jaakkola, R. Young (2001)
Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle
Cell, 106
X. Liu, D. Brutlag, Jun Liu (2000)
BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
G. Stormo (2000)
DNA binding sites: representation and discovery
Bioinformatics, 16 1
G. Paillard, R. Lavery (2004)
Analyzing protein-DNA recognition mechanisms.
Structure, 12 1
M. Pierce, K. Benjamin, Sherwin Montaño, M. Georgiadis, E. Winter, A. Vershon (2003)
Sum1 and Ndt80 Proteins Compete for Binding to Middle Sporulation Element Sequences That Control Meiotic Gene Expression
Molecular and Cellular Biology, 23
J. Granek, N. Clarke (2005)
Explicit equilibrium modeling of transcription-factor binding and gene regulation
Genome Biology, 6
L. Issel-Tarver, K. Christie, K. Dolinski, R. Andrada, R. Balakrishnan, C. Ball, G. Binkley, S. Dong, S. Dwight, D. Fisk, M. Harris, Mark Schroeder, A. Sethuraman, Kane Tse, S. Weng, D. Botstein, J. Cherry (2002)
Saccharomyces Genome Database.
Methods in enzymology, 350
E. O’Shea, J. Klemm, P. Kim, T. Alber (1991)
X-ray structure of the GCN4 leucine zipper, a two-stranded, parallel coiled coil.
Science, 254 5031
M. Djordjevic, Anirvan Sengupta (2005)
Quantitative modeling and data analysis of SELEX experiments
Physical Biology, 3
Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE
G. Stormo, D. Fields (1998)
Specificity, free energy and information content in protein-DNA interactions.
Trends in biochemical sciences, 23 3
F. Grubbs (1969)
Procedures for Detecting Outlying Observations in Samples
Technometrics, 11
Sonali Mukherjee, M. Berger, G. Jona, Xun Wang, D. Muzzey, M. Snyder, R. Young, M. Bulyk (2004)
Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays
Nature Genetics, 36
W. Olson, A. Gorin, Xiang-Jun Lu, L. Hock, V. Zhurkin (1998)
DNA sequence-dependent deformability deduced from protein-DNA crystal complexes.
Proceedings of the National Academy of Sciences of the United States of America, 95 19

Publisher: Oxford University Press
Copyright: © The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btl223
pmid: 16873464
Publisher site: See Article on Publisher Site

Abstract

Vol. 22 no. 14 2006, pages e141–e149 BIOINFORMATICS doi:10.1093/bioinformatics/btl223 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE 1 2 1,3, Barrett C. Foat , Alexandre V. Morozov and Harmen J. Bussemaker 1 2 Department of Biological Sciences, Columbia University, New York, NY 10027, USA, Center for Studies in Physics and Biology, The Rockefeller University, New York, NY 10021, USA and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA sidered to be bound by the TF (for review see Stormo, 2000). These ABSTRACT methods use the information content of nucleotide patterns as a Motivation: Regulation of gene expression by a transcription factor proxy for the free energy contributions of the bases found in the requires physical interaction between the factor and the DNA, which TF binding site (Berg and von Hippel, 1987; Stormo and Fields, can be described by a statistical mechanical model. Based on this 1998). Other computational methods infer physically-based TF model, we developed the MatrixREDUCE algorithm, which uses binding speciﬁcities from measured TF binding afﬁnities for a genome-wide occupancy data for a transcription factor (e.g. ChIP- small set of oligonucleotides (Liu and Clarke, 2002) or from struc- chip) and associated nucleotide sequences to discover the sequence- tural modeling of protein-DNA interaction (Paillard and Lavery, specific binding affinity of the transcription factor. Advantages of our 2004; Endres et al., 2004; Morozov et al., 2005). However, genome- approach are that the information for all probes on the microarray is scale, quantitative measurements of TF occupancies of intergenic efficiently utilized because there is no need to delineate ‘‘bound’’ and regions are now available due to the advent of in vivo chromatin ‘‘unbound’’ sequences, and that, unlike information content-based immunoprecipitation microarrays (Ren et al., 2000; Iyer et al., methods, it does not require a background sequence model. 2001; Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Results: We validated the performance of MatrixREDUCE by inferring Harbison et al., 2004), in vitro protein binding microarrays the sequence-specific binding affinities for several transcription factors (PBM; Mukherjee et al., 2004), and DNA immunoprecipitation in S. cerevisiae and comparing the results with three other independent microarrays (DIP-chip; Liu et al., 2005). Thus, it is no longer sources of transcription factor sequence-specific affinity information: necessary to rely on small data sets, availability of protein-DNA (i) experimental measurement of transcription factor binding affinities structures, or the analogy between information content and statis- for specificoligonucleotides,(ii) reporter geneassays forpromoterswith tical mechanics to infer free energy representations of transcription systematically mutated binding sites, and (iii) relative binding affinities factor binding sites. obtained by modeling transcription factor-DNA interactions based on We have developed a method, implemented as the program co-crystal structures of transcription factors bound to DNA substrates. MatrixREDUCE (Foat et al., 2005), that infers the sequence spe- We show that transcription factor binding affinities inferred by ciﬁcity of a TF directly and accurately from genome-wide TF occu- MatrixREDUCE are in good agreement with all three validating pancy data by ﬁtting a statistical mechanical model for TF-DNA methods. interaction (Figure 1). The sequence speciﬁcity of the TF’s DNA- Availability: MatrixREDUCE source code is freely available for binding domain is modeled using a position-speciﬁc afﬁnity matrix non-commercial use at http://www.bussemakerlab.org/. The software (PSAM), representing the change in the binding afﬁnity (K ) when- runs on Linux, Unix, and Mac OS X. d ever a speciﬁc position within a reference binding sequence is Contact: [email protected] mutated. To validate the physical model of MatrixREDUCE, we discovered the PSAMs for several TFs in S. cerevisiae and com- 1 INTRODUCTION pared the results with three other independent sources of TF sequence-speciﬁc afﬁnity information: (i) experimentally measured The sequence-speciﬁc regulatory activity of a transcription factor K ’s as determined by in vitro methods (Gailus-Durner et al., 1996; (TF) is the result of energetically favorable interactions between the Liu and Clarke, 2002; Pierce et al., 2003), (ii) lacZ reporter assays amino acids exposed in the DNA binding domain and portions of for promoters with systematically mutated binding sites (Gailus- nucleic acid bases exposed in the grooves of the DNA. A compu- Durner et al., 1996; Pierce et al., 2003), and (iii) relative K ’s tational method for discovering the binding speciﬁcity of a TF obtained by using a physical model of protein-DNA interaction cannot provide a quantitative description of TF binding unless it that makes binding afﬁnity predictions starting from a co-crystal considers the physical underpinnings of the TF-DNA interaction. structure of the protein-DNA complex (Morozov et al., 2005). We Most physically motivated computational methods discover over- ﬁnd a surprising level of agreement between MatrixREDUCE- represented patterns in a set of nucleotide sequences that are con- predicted TF binding afﬁnities, experimental measurements, and To whom correspondence should be addressed. structural predictions, suggesting that MatrixREDUCE is a The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] B.C.Foat et al. Fig. 1. The flow of data. A microarray measurement of TF occupancies (ChIP-chip, PBM, DIP-chip, or differential mRNA expression data) and relevant nucleotide sequences for each microarray feature are used as input to MatrixREDUCE. MatrixREDUCE performs a least-squares fit to a statistical-mechanical model of TF-DNA interaction to discover the relative contributions to the free energy of binding for each nucleotide at each position in the generalized TF binding site. These contributions are represented as a position specific affinity matrix (PSAM) containing the relative equilibrium constants of the TF-DNA interaction, with the highest affinity nucleotide at each position scaled to a value of one (DDG ¼ 0). The PSAM can be converted into an affinity logo that graphically represents the DDG’s for each nucleotide at each position relative to the averageDDG at the respective positions. The PSAM can also be used to predict the relative TF occupancy of any nucleotide sequence, allowing the PSAMs inferred by MatrixREDUCE to be compared with experimental measurements of TF binding affinities for particular oligonucleotides. powerful and accurate tool for the elucidation of physically accurate data were b-galactosidase activities for genes containing mutated TF sequence-speciﬁc binding afﬁnities. binding sites of the E. coli su2 amber stop codon suppressor. A similar type of analysis was performed by Liu and Clarke (2002) who ﬁt a physical model for transcription factor binding to elec- 2 RELATED WORK trophoretic mobility shift assay (EMSA) data measuring afﬁnity of In contrast to information theory-based methods of deﬁning the S. cerevisiae Leu3 TF for several oligonucleotides. The physical nucleotide-binding protein speciﬁcities, MatrixREDUCE belongs model behind MatrixREDUCE is the same as that employed by to a small but growing class of methods that infer binding afﬁnities Stormo et al. (1986) and Liu and Clarke (2002). However, our by directly ﬁtting a physical model to experimental data. The ﬁrst ‘‘quantitative data’’ are microarray probe intensities, which mea- such method was introduced by Stormo et al. (1986) who noted, sure TF occupancy over long chromosomal regions with unknown ‘‘When quantitative data are known for many sequences one can binding site locations. Thus, the MatrixREDUCE model integrates the binding signal over the entire length of the sequence. The solve for the matrix elements that give the best ﬁt between the GOMER method of Granek and Clarke (2005) performs a similar sequences and those data.’’ For Stormo et al. (1986) the quantitative e142 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE The occupancy NðU Þ for the entire promoter region U of gene g equals the integration of signal over long regulatory sequences relevant to g g sum of occupancies for each binding site window of length L at each measured microarray intensities. However, GOMER was only position i over the length L of the sequence U : g g used to test hypotheses about the regulatory mechanisms of TFs L L þ1 for which a binding site weight matrix had already been deﬁned by g w L X Y NðU Þ¼ ½PK ðS Þ w ‚ ð9Þ g a ref jU ði þ j 1Þ other methods. Granek and Clarke (2005) did not attempt to ﬁt the g i¼1 j¼1 GOMER model directly to experimental data to infer the binding afﬁnities of TFs. Finally, also of note are the QPMEME algorithm of where U ðiÞ is the base at position i in sequence U . g g Djordjevic et al. (2003) and the work of Djordjevic and Sengupta (2006) which use maximum likelihood procedures to infer PSAMs 3.2 Modeling genome-wide TF occupancy data by ﬁtting physical models to known TF binding sites and SELEX Recent technologies such as ChIP-chip (Ren et al., 2000; Iyer et al., 2001; data, respectively, but rely on the prior delineation of ‘‘bound’’ Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Harbison et al., 2004), sequences. PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., 2005) provide indirect but quantitative information about the TF occupancy of large genomic regions. For each segment of DNA there are two microarray inten- 3 METHODS test test sities. The test intensity I is equal to a background intensity a plus a 3.1 Modeling TF-DNA interaction term that, to ﬁrst approximation, is proportional (g) to the occupancy NðU Þ by the TF, either because the amount of TF bound to the probe contributes We will develop the statistical-mechanical model used by MatrixREDUCE directly to the signal intensity (PBM) or because it determines the proportion starting with a transcription factor P that binds to a DNA sequence S to form at which an immunoprecipitated TF-DNA fragment is present in the sample the TF-DNA complex PS: control (ChIP-chip or DIP-chip). The control intensity I is only the result of on control background signal a . Allowing for experimental noise e , we obtain: P þ S PS ð1Þ test test off gNðU Þþ a g g ¼ þ e bNðU Þþ C þ e ð10Þ g g g control control The afﬁnity of the TF for the sequence can be expressed in terms of its a equilibrium dissociation constant K ðSÞ: Using Equation 9 for the occupancy NðU Þ, we obtain: ½P½S k off DG/RT K ðSÞ¼ ¼ ¼ e ‚ ð2Þ L L þ1 test g w L ½PS k w on X Y ¼ F w þ C þ e ‚ ð11Þ jU ði þ j 1Þ g control g which is directly related to DG, the Gibbs free energy of binding per mole g i¼1 j¼1 (R is the gas constant and T is temperature). The occupancy N(S) of sequence S by transcription factor P can be expressed as the concentration of TF-DNA where complex divided by the total concentration of DNA (bound or unbound): F ¼ b½PK ðS Þ: ð12Þ a ref ½PS ½P NðSÞ¼ ¼ : ð3Þ Note that b,[P], and K ðS Þ cannot be determined separately without a ref ½PSþ½S ½Pþ K ðSÞ additional information such as the real protein concentration or K ðS Þ. a ref For simplicity, we will assume that the TF concentration [P] is much smaller MatrixREDUCE discovers the set of w elements as well as F and C by jb than K ðSÞ. This assumption seems physiologically plausible because in this performing a least squares ﬁt to the measured intensity ratios: regime, the highest afﬁnity binding sites in the genome will be the most responsive to a change in the nuclear concentration of active TF. Thus, the ðC‚F‚fw gÞ ¼ argmin jb C‚ F‚ fw g jb occupancy becomes: ½P NðSÞ ¼½PK ðSÞ‚ ð4Þ L L þ1 a test g w L X X Y K ðSÞ F w C : ð13Þ jU ði þ j 1Þ control g where g g i¼1 j¼1 K ðSÞ K ðSÞ: ð5Þ The 4 · L matrix of K ratios w (3L parameters plus L reference w a jb w w Consider a single point mutation from the original reference sequence S ref nucleotide values) for all nucleotides at all positions in the binding site is to base b at position j resulting in the mutated sequence S . Such a mutation mut referred to as the position speciﬁc afﬁnity matrix (PSAM). Each position j in will give rise to an additive change DDG in the free energy of binding or, the PSAM is rescaled such that the largest w is equal to unity, without loss jb equivalently, a multiplicative change w in K ðS Þ: jb a ref of generality. Differential mRNA expression microarray data, which measures the K ðS Þ a mut DDG/RT w ¼ ¼ e ‚ ð6Þ jb change in mRNA concentrations in cells from two different experimental K ðS Þ a ref conditions, can be used in place of genome-wide TF occupancy data. where This substitution is reasonable since, to ﬁrst approximation, the transcrip- tion rate of genes is proportional to the total TF occupancy along the asso- DDG ¼ DGðS Þ DGðS Þ: ð7Þ ref mut ciated promoter regions. Genome-wide occupancy data is preferable, To be able to generalize the binding of transcription factor P to a sequence however, since it is a more direct measure of TF-DNA interaction and S with more than one point mutation, we assume that the free energy mut since the design of the experiments provides the TF identities for the contributions for each position in the binding site are independent (Benos discovered PSAMs. et al., 2002) and therefore additive. Equivalently, we can multiply the w ’s jb for any nucleotide sequence to obtain the overall K ðS Þ/K ðS Þ ratio. a mut a ref 3.3 MatrixREDUCE implementation and Thus, the occupancy of a particular binding site S of length L with mut w parameters nucleotide sequence S ð1‚2‚.. . ‚L Þ¼ðb ‚b ‚.. . ‚b Þ is: mut w 1 2 Lw w MatrixREDUCE was implemented in Perl and C as outlined above and as NðS Þ¼½PK ðS Þ w : ð8Þ mut a ref jS ðjÞ previously described (Foat et al., 2005) with some modiﬁcations. mut j¼1 Brieﬂy, MatrixREDUCE takes microarray intensities and corresponding e143 B.C.Foat et al. nucleotide sequence data as input. It ﬁrst ﬁnds a gapped dyad motif (e.g. Leu3: standards’’ to assess the validity of our MatrixREDUCE model. The CCG-4nt-CGG), out of all possible dyad motifs of a ﬁxed number of electrophoretic mobility shift assay (EMSA) is able to provide direct nucleotides and a range of gap sizes, whose occurrences best correlate estimates of K ’s for a TF binding to particular oligonucleotides (Fried with the measured intensities for the same sequences. The best dyad motif and Crothers, 1981). The ratio of the EMSA-measured K of a reference is then converted into a seed matrix by ﬁlling in the gap with N’s and oligonucleotide S to the K of one of the other tested oligonucleotides S ref d mut extending out a user deﬁned number of ﬂanking N’s on either side of the provides the same information as the product across the MatrixREDUCE best-scoring dyad. In the 4 · L seed matrix, acceptable nucleotides (all PSAM over the same sequence for the same TF. In the simplifying scenario nucleotides for N’s, a single speciﬁc nucleotide at positions within the top where the length of the oligonucleotides is the same as the length L of the scoring motif) are given K ratios of one and unacceptable nucleotides are PSAM, we have given a very small K ratio w . This seed matrix serves as the starting point a min Lw K ðS Þ d ref for a quasi-Newton numerical minimization of Equation 13 to ﬁnd the optimal ¼ w : ð14Þ jS ðjÞ mut K ðS Þ d mut PSAM. The new version of MatrixREDUCE uses a k-fold cross-validation to j¼1 determine the signiﬁcance of each discovered PSAM. After converging on a While the biological processes involved are considerably more complex, PSAM, the input data is split into k random subsets of array features with lacZ expression data can be employed to the same end. If we assume that associated sequences. The optimal PSAM is then used to seed each of k re-- b-galactosidase activity, concentration of b-galactosidase, the amount of optimizations of the PSAM. A t-value (Pearson correlation) for the goodness mRNA expressed, the speciﬁc recruitment of RNA polymerase to the of ﬁt is calculated for the optimal PSAM of each subset. Finally, the P-value promoter, and the promoter occupancy by the TF are all proportional to corresponding to the average t-value for the k re-optimizations is used to test each other, then relative K ’s are reﬂected in the ratio of b-galactosidase whether the originally optimized PSAM should be kept. This procedure does activities between the assay using the reference binding site and another not test the signiﬁcance of the optimal PSAM itself, but rather it tests whether assay using a different binding site. Thus, we used lacZ reporter expression the data contains widely distributed, explainable variance. Thus, false PSAMs assay data in a similar manner to EMSA-derived K data to conﬁrm the due to a few outliers are prevented. While not relevant to the current study, results of MatrixREDUCE. MatrixREDUCE can iteratively build a linear model of multiple PSAMs that Experimentally determined in vitro binding afﬁnities and lacZ reporter best explain a particular data set (see Foat et al., 2005). expression activity data were gathered from publications. The K data and The parameters for the runs of MatrixREDUCE were as follows: For lacZ expression data for Abf1 are from Gailus-Durner et al. (1996); K data all runs, the length of each of the two dyads of the seed motifs was three, for Leu3 are from Liu and Clarke (2002); and K data and lacZ expression the length of the added ﬂanks on each side of the dyad was three, the minimum data for Ndt80 and Sum1 are from Pierce et al. (2003). gap was zero, the k cross-validations were two, and w was 10 . For all runs min To compare the experimental K measurements with MatrixREDUCE on ChIP-chip and PBM data, the maximum acceptable P-value was 10 and PSAMs, all experimental K and lacZ expression data was ﬁrst converted to the maximum dyad gap was twenty. For all runs on DIP-chip data, the K ratios by normalizing with respect to the value of the highest afﬁnity maximum acceptable P-value was 10 and the maximum dyad gap was oligonucleotide. The K ratios were then log-transformed to obtain the DDG ten. For all runs on differential mRNA expression data, the maximum accept- values. MatrixREDUCE PSAMs for each TF were converted to DDG’s able P-value was 10 and the maximum dyad gap was eleven. relative to the highest afﬁnity oligonucleotide from the respective experiment. The sum of the DDG values was calculated for the best 3.4 Microarray and sequence data PSAM-matching window in each of the experimentally tested sequences. All microarray data was gathered from publication supplements. We chose If a sequence was shorter than the PSAM, the sum was taken over only the speciﬁc TFs to analyze based on the availability of experimental K data or best matching positions within the PSAM. All experimental DDG’s were crystal structure data. PSAMs were inferred by MatrixREDUCE for chro- then compared to the PSAM DDG’s by plotting and by calculating Pearson matin immunoprecipitation microarrays (ChIP-chip) using the microarray correlations. data and microarray feature sequences from Harbison et al. (2004). These BioProspector (Liu et al., 2001) and MDscan (Liu et al., 2002) are popular ChIP-chip experiments were performed under a variety of culture conditions, information theory-based methods for determination of TF binding speci- including rich media (YPD); sulfometuron methyl (SM), an inhibitor of ﬁcities. To compare the quality of the results from these methods with amino acid biosynthesis; and treatment with rapamycin (RAPA). PSAMs MatrixREDUCE results, position-speciﬁc scoring matrices (PSSMs) were were inferred for PBM experiments using the microarray data from Mukher- derived from BioProspector and MDscan outputs by calculating the frequen- jee et al. (2004) and the feature sequence data from Harbison et al. (2004) as cies of each base at each position in the putative binding sites and then the two studies used the same array features. PSAMs were inferred for Leu3 dividing by a background frequency for each respective base. Two different using the DIP-chip microarray data and feature sequences from Liu et al. background frequencies were tested: equal nucleotide probabilities and nuc- (2005). Liu et al. (2005) performed DIP-chip experiments using two different leotide probabilities for intergenic sequences in S. cerevisiae. Once the concentrations of Leu3, 4nM and 40nM, and PSAMs were inferred for each PSSMs had been created, they were tested against experimental EMSA concentration. The PSAM for Ndt80 was inferred from differential mRNA and lacZ data in the same manner as the MatrixREDUCE PSAMs above. expression microarray data measuring the sporulation response in a ndt80 deletion strain versus a wild-type strain (Chu et al., 1998). The sequence data 3.6 Structural modeling for the Ndt80 PSAM inference was the 800 bp upstream of every yeast gene, DNA binding afﬁnities and speciﬁcities of TFs are determined by the forces retrieved from the Saccharomyces Genome Database (Issel-Tarver et al., of electrostatics, solvation, the hydrogen bonding patterns, and shape 2002) and purged of redundant sequences as previously described (Foat complementarity at the binding interface. The magnitude of these contribu- et al., 2005). All microarray intensities were analyzed as the ratio of the tions to the binding free energy can in principle be calculated given a experimental sample intensity to the control sample intensity with the excep- structure of the protein bound to its cognate DNA site. Therefore, it should tion of the ndt80 deletion data, which was analyzed as the log -ratio. All be possible to predict PSAMs starting from the experimentally available microarray data was purged of extreme outliers before being analyzed by structure of the protein-DNA complex (solved by either X-ray diffraction or MatrixREDUCE (Grubbs’ test, P-value ¼ 10 ; Grubbs, 1969). NMR), or, in the absence of the exact structure, from a suitably constructed homology model. Under the assumption that the base pair energies 3.5 Gel shift and lacZ expression data contribute approximately independently to the total binding afﬁnity While prone to their own inaccuracies, experimentally measured in vitro (Benos et al., 2002), all one-point base pair mutations are introduced binding afﬁnities and changes in lacZ expression served as our ‘‘gold into the DNA binding site. Protein-DNA binding energies DG ¼ e144 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE G G G are then evaluated for each mutation. Mutations in number of contacts above which the maximum energy penalty E is protdna prot dna max the reference binding site result in changes of protein-DNA binding energies imposed. f ðN Þ and f ðN Þ are ﬁxed prefactors deﬁned in Morozov max max 1 2 (DDG; Equation 7). A table of DDG values can be used to construct a PSAM et al. (2005). E together with N constitute the free parameters of the max max that is directly comparable with MatrixREDUCE predictions. contact model and are adjusted simultaneously to maximize the fraction of We have previously developed two alternative approaches for predicting correct predictions and minimize the average error over the DDG data set TF binding afﬁnities and speciﬁcities starting from the protein-DNA identical to that used in ﬁtting the all-atom model. The fraction of correct structure (Morozov et al., 2005). In one approach, the ‘‘all atom model’’ predictions is based on a binary function: a prediction is considered to be (which builds on the ROSETTA protein-nucleic acid interaction model of correct if both computational and experimental DDG’s are less than 1.0 kcal/ Havranek et al., 2004), both direct and indirect readout mechanisms con- mol, or greater than 1.0 kcal/mol, or else separated by less than 0.3 kcal/mol. tribute to the recognition of the DNA binding site: The global minimum for the ﬁt is found by exhaustive search; the best ﬁt is DG ¼ DG þ DG . Direct readout is mediated by protein amino obtained with N ¼ 15, E ¼ 3:0 kcal/mol. direct indirect max max acid-DNA base interactions, while indirect readout is encoded in the 3.7 Affinity logos shape of the DNA site imparted by the bound protein, primarily through non-speciﬁc amino acid-DNA phosphate backbone contacts. Direct protein- Information content-based weight matrices are usually displayed as DNA interactions are modeled as a linear combination of the repulsive and sequence logos (Schneider and Stephens, 1990) However, MatrixREDUCE attractive parts of the Lennard-Jones potential, the orientation-dependent weight matrices are discovered without a background sequence model. Thus, hydrogen bonding potential (Kortemme et al., 2003), and the Generalized an appropriate logo should display the actual relative free energies of binding Born electrostatics and solvation model (Onufriev et al., 2004): for each nucleotide at each position rather than information content. There- fore, we created afﬁnity logos, which are constructed as follows: For each DG ¼ w E þ w E direct LJrep LJrep LJattr LJattr ð15Þ position in the PSAM, the average DDG is calculated. Then, the difference þ w E þ w G ‚ hb hb el el between each individual DDG and the average DDG at that position is where each term is a sum over all protein-DNA and protein-protein atomic computed; the absolute value of this difference is the height of the character pairs, and {w} is a set of ﬁtting weights. Indirect readout is modeled using an representing that nucleotide. If the difference is positive (more favorable effective harmonic representation of the DNA conformational energy (Olson than average), the letter is placed above a horizontal black line through the et al., 1998): center of the logo. If the difference is negative (less favorable than average) X X the letter is placed below the black line. Larger letters are stacked on smaller ab ab DG ¼ w E þ w E ‚ ð16Þ indirect dnabp dnabs dnabp dnabs letters moving outward from the black line. The height of the letter can be bp bs interpreted as free energy difference from the average in units of RT. Thus, where the ﬁrst sum is over all base pairs in the DNA site (a, b denote bases in an intuitive high amplitude is given to the nucleotide positions that most a base pair), and the second sum is over all consecutively stacked base pair contribute to the sequence speciﬁcity of the TF. To highlight that the steps (a, b denote base pairs in a base step). Base pairs and base steps are characters representing the high afﬁnity nucleotides are above the black 0 0 counted once in the 5 to 3 direction. The ﬁrst term penalizes deviations line, the characters representing the low afﬁnity nucleotides are made from canonical base pairing, while the second term captures base stacking partially transparent. However, maintaining the representation of the poor energies. The quadratic energy terms are given by: afﬁnity nucleotides below the center line allows the viewer to immediately see which nucleotide substitutions are most unfavorable to binding. 6 6 X X ab ab a b E ¼ f d d ‚ ð17Þ dnabs/dnabp ij i j i¼1 j¼1 3.8 PSAM to PSAM alignments and correlations By inspection of afﬁnity logos, one can make qualitative observations about where the sums run over six geometric degrees of freedom (Twist, Tilt, the similarity between any two PSAMs. However, a quantitative measure of Roll, Shift, Slide and Rise for base pair steps; Opening, Buckle, Propeller, similarity allows for more objective comparisons. Before two PSAMs can be Shear, Stretch and Stagger for base pairs; Lu and Olson, 2003). The DNA compared, they must ﬁrst be aligned. Pearson correlations were calculated potential is a quadratic expansion in d (deviations of the degrees of free- between the DDG values for each nucleotide at each position for every dom from their average values computed using a set of non-homologous possible overlap of the two PSAMs for both the forward and the reverse protein-DNA complexes). The force constants f are evaluated by inverting ij complement alignments. After the best overlap position and strand was the covariance matrix of d obtained with the same protein-DNA dataset: determined from the best correlation P-value, the DDG’s of the two f ¼hd d i. All six weights are simultaneously ﬁt to experimental DDG i j ij PSAMs were recentered relative to a common reference consensus data using a generalized linear model (implemented in the statistical soft- ware package R): ðw ‚w ‚w ‚w ‚w ‚w Þ¼ sequence. Finally, the P-value for the Pearson correlation between the LJrep LJattr hb el dnabp dnabs ð0:00‚0:46‚0:77‚0:27‚ 0:03‚0:03Þ. No conformational ﬂexibility is allowed two optimally aligned and transformed PSAMs was calculated and subjected at the protein-DNA interface. Further details on the ﬁtting procedure and to a Bonferroni correction for the number of alignments that were tested. comprehensive tests of the all-atom free energy function can be found in Morozov et al. (2005). 4 RESULTS In another approach, we developed a ‘‘contact model’’ that exploits the structure of the protein-DNA complex bound to a high afﬁnity reference 4.1 PSAMs inferred by MatrixREDUCE agree well DNA sequence but does not require detailed predictions of protein-DNA with experimental measurements of TF binding interaction energies. In the contact model each mutated base in the PSAM affinity column i incurs equal energy cost relative to the consensus base from the We discovered the position speciﬁc afﬁnity matrices (PSAMs) reference sequence: for the Saccharomyces cerevisiae TFs Rap1, Ndt80, Gcn4, E ½f ðN Þ log ð1 N/N Þ < max max max Leu3, Abf1, and Sum1 by applying MatrixREDUCE to genome- DDG ðNÞ¼ f ðN Þ log ð1 þ 3N/N Þ ðN < N Þ ð18Þ max max max wide TF occupancy data and, in the case of Ndt80, differential E ðN N Þ max max mRNA expression microarray data (Figure 2A). Experimental measurements of relative K ’s (EMSA or lacZ expression) for Here, N is the number of protein amino acid-DNA base atomic contacts d summed over the base pair i (atomic contact is deﬁned by a distance of less speciﬁc oligonucleotides were available for Abf1, Leu3, Ndt80, than 4.5 s; hydrogen atoms are excluded from the counts), and N is the and Sum1. EMSA has long been employed to determine the max e145 B.C.Foat et al. Fig. 2. Comparison of PSAMs—affinity logos and correlations. (A) The PSAMs represented in the columns with blue headers were inferred by Matrix-REDUCE from ChIP-chip, PBM, DIP-chip, or mRNA differential expression microarray data. YPD (rich media), SM (sulfometuron methyl), and RAPA (rapamycin) refer to the environmental conditions to which the test sample was exposed before the ChIP-chip experiment. The DIP-chip experiments were performed with two different concentrations of Leu3, 4nM and 40nM. An ndt80 deletion (ndt80D) versus wild-type mRNA expression experiment (mRNA) was used to obtain the Ndt80 PSAM. The PSAMs represented in the columns green headers were inferred by modeling TF-DNA interactions based on crystal structures of the TFs using two different methods, a contact-only model and an all atom model. (B) All PSAMs for each TF were aligned pairwise and the Pearson correlation between the DDG values of both PSAMs for the best alignment was calculated. The P-value for this correlation is a measure of similarity between the PSAMs. Again, blue labels indicate PSAMs inferred by MatrixREDUCE PSAMs and green labels indicate structurally inferred PSAMs. DNA-binding afﬁnities of TFs in vitro. Likewise, the lacZ reporter experimental method (EMSA or lacZ), and TF, we compared the assay has long been used to measure the difference in activities experimental DDG with the DDG predicted from a PSAM for the of TF binding sites. We claim that a PSAM inferred by same TF (Figure 3). In every case, the experimental DDG values MatrixREDUCE from genome-wide TF occupancy data can be strongly correlated with the PSAM-predicted DDG values, with R ’s used to predict the relative binding afﬁnities of the measured TF ranging from 0.36 to 0.88. Thus, PSAMs inferred by MatrixRE- to any sequence. Therefore, EMSA and lacZ expression data pro- DUCE seem to be good models of the true relative DNA binding vide nearly ideal data sets for validation of the MatrixREDUCE afﬁnities of the corresponding TFs. Unexpectedly, all of the regres- approach. For each combination of experimentally tested sequence, sions of experimental DDG’s on MatrixREDUCE DDG’s have e146 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE Fig. 3. Comparison of experimentally measured DDG’s with MatrixREDUCE PSAM-predicted DDG’s. Experimental measurements of DDG’s were derived from EMSA (A) and lacZ reporter assays (B). The experimental DDG values are plotted along the vertical axes. Predicted DDG’s were calculated from the PSAM for each tested TF for the same oligonucleotide sequences that were measured in each experiment. The MatrixREDUCE-predicted DDG values are plotted along the horizontal axes. In this representation, the higher affinity oligonucleotides have more positive DDG’s. The diagonal dashed line represents experimental DDG equal to MatrixREDUCE DDG. DDG’s are in units of RT, where R is the gas constant and T is the temperature. The R and P-values for the Pearson correlations between the experimental and predicted DDG’s are presented for each PSAM-experimental data pair. slopes less than one (range: 0.31 to 0.94). It seems that MatrixRE- of the binding interface based only on the genomic sequence and DUCE produces a slightly larger range of predicted DDG’s than is genome-wide TF occupancy data. realized in experiments. Nonetheless, the MatrixREDUCE PSAM- Gcn4 is a TF of the bZIP class. It is a homodimer with the basic predicted DDG’s are close to the experimentally inferred DDG’s in region mediating sequence speciﬁc DNA binding and the leucine most cases, especially among the highest afﬁnity sequences. zipper region required for dimerization (O’Shea et al., 1991). For deriving the Gcn4 structural PSAMs, we used a 2.9 s crystal structure of the TF bound to the ATGAGTCAT site (Ellenberger 4.2 PSAMs inferred by MatrixREDUCE agree well et al., 1992). The symmetry of the binding site (two reverse with PSAMs inferred by structural models complement 4 bp half-sites separated by G in the middle) is a reﬂection of the homodimeric binding and is captured well in Both genome-wide TF occupancy data and crystal structures of MatrixREDUCE predictions. While contact model and Matrix- protein-DNA complexes are available for Ndt80, Gcn4, and REDUCE predictions are similar, the all-atom model is less Rap1. Thus, we were able to compare MatrixREDUCE PSAMs successful, probably due to the low resolution of the crystal struc- with those based on ab initio structural models (see Methods; ture, which leads to considerable uncertainty in side chain positions Figure 2A). The structurally inferred PSAMs for Ndt80 were obtained from its co-crystal structure bound to a high afﬁnity with respect to the neighboring DNA bases. GACACAAAA site, solved at 1.4 s resolution (Lamoureux Finally, Rap1 binds DNA as a homodimer in a way that makes its et al., 2002). Figure 2A shows a reasonable agreement between DNA site a tandem repeat. The crystal structure of the Rap1 homod- DDG predictions carried out with MatrixREDUCE and structural imer in complex with a telomeric DNA site has been solved to models. The close correspondence with the contact model, which is 2.25 s resolution (Konig et al., 1996). Comparison of Matrix- a function of the number of protein side chains in contact with DNA REDUCE PSAMs and structural PSAMs reveals good agreement base pairs, is especially remarkable, showing that the Matrix- with the all atom model. The contact model overpredicts binding REDUCE approach is capable of reproducing structural details speciﬁcity at the intermediate positions in the binding site (located e147 B.C.Foat et al. bases provide an estimate of a PSAM in the form of a position- speciﬁc scoring matrix (PSSM). Since we had already compiled the EMSA and lacZ expression data, we had the opportunity to experi- mentally verify the results of these PSSMs. We gathered the BioProspector and MDscan results from the original, published analyses, transformed them into PSSMs, and used them to predict DDG’s for the EMSA and lacZ experimentally tested sequences. We performed this comparison using two different ‘‘background’’ nucleotide frequency models: one using equal nuc- leotide probabilities and one using nucleotide probabilities derived from S. cerevisiae intergenic sequences. The R and P-values for the Fig. 4. Correlations of experimentally measured DDG’s with information correlations between these predicted DDG’s and the experimental theory-predicted DDG’s. Experimental measurements of DDG’s were derived 2 results are displayed in Figure 4. Overall, the quality of the results from EMSA and lacZ reporter assays. The R and P-values for the Pearson from the information theory PSSMs and the MatrixREDUCE correlations between the experimental and predicted DDG’s are presented for PSAMs were similar. However, the results for the PSSMs are dif- each PSSM-experimental data pair. PSSMs were derived and tested using ferent depending on the choice of equal or intergenic nucleotide two different background nucleotide frequencies: equal probabilities and frequencies. While we did not test this scenario, the information intergenic probabilities. theory results would also change depending on the probe intensity threshold chosen to label genes as ‘‘bound.’’ Thus, while Matrix- between tandem repeats), likely because it assigns similar REDUCE performs comparably with existing information theory speciﬁcities to protein-DNA contacts in the loop region and in methods, it conveniently avoids having to choose several ad hoc the DNA binding domains. parameters required by the other methods. 4.3 PSAM to PSAM correlations 5 DISCUSSION Upon visual inspection of Figure 2A, the similarities are immedi- Overall, position speciﬁc afﬁnity matrices (PSAMs) as inferred by ately apparent between afﬁnity logos for the same factor inferred MatrixREDUCE from genome-wide TF occupancy data are good using different experimental and computational methods. However, approximations of the real sequence-speciﬁc DNA binding afﬁni- a quantitative measure of these similarities can be obtained by ties. Discrepancies between the computationally predicted and the aligning the PSAMs (see Methods) and calculating the correlation experimentally inferred binding afﬁnities may be due to either the of their DDG values. The P-value for this correlation between two computational or the experimental methods. EMSA has known PSAMs serves as our similarity metric (Figure 2B). Overall, the problems with ‘‘caging’’ of the TFs by the gel while electrophoresis similarity between the PSAMs from MatrixREDUCE are the most is proceeding (Fried and Crothers, 1981). This could lead to inferred signiﬁcant. There is extreme similarity between the Rap1 PSAMs DDG’s of smaller magnitude. Likewise, lacZ reporter assays are a inferred from ChIP-chip and PBM data. The PSAMs inferred for very indirect way of measuring relative binding afﬁnities as they Gcn4 for three different ChIP-chip conditions are all very similar as require transcription, translation, and b-galactosidase reactions in well. The Leu3 PSAMs inferred from the DIP-chip data are much order to make measurements, and noise could be introduced at each more similar to each other than they are to the Leu3 PSAM inferred step. Structural model predictions are strongly dependent on the from the ChIP-chip data, but they are still both signiﬁcantly similar quality of input structures and are affected by errors in the energy (P-value < 0.01) to the ChIP-chip Leu3 PSAM. The signiﬁcance of function. The current MatrixREDUCE model may also give rise to the correlations between MatrixREDUCE PSAMs and structurally systematic biases. First, it makes the approximation that nucleotides inferred PSAMs is more variable. Both the all atom model PSAM contribute independently to the free energy of TF binding (Benos and the contact model PSAM for Rap1 and Ndt80 have signiﬁcant et al., 2002). Second, it makes the assumption that the concentration similarities with the respective MatrixREDUCE PSAMs (P-value < of TF is much smaller than the K , which may not be correct for 0.01). However for Gcn4, while the contact model PSAM has strong some TFs. Finally, all consecutive positions in the PSAM are cur- similarities with all of the other PSAMs, the all atom PSAM has rently treated as parameters to be estimated, which may lead to insigniﬁcant similarities with all other PSAMs. overﬁtting. We plan to address these issues in a future version of the algorithm. 4.4 How good is the information theory Despite these current limitations, PSAMs discovered using the approximation? current implementation of MatrixREDUCE are good approx- In the original papers describing the ChIP-chip (Harbison et al., imations of the relative nucleotide binding afﬁnities of assayed 2004), PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., TFs. Especially for microarray methods like PBM and DIP-chip, 2005) data, the authors used BioProspector (Liu et al., 2001) or where the objective is to deﬁne nucleotide-binding speciﬁcities, MDscan (Liu et al., 2002) to deﬁne weight matrix representations of MatrixREDUCE may be the most physically accurate method TF binding sites. These two methods use the set of sequences that available to analyze the data. Even for less direct reﬂections of the experimenters label as ‘‘bound’’ to produce a list of potential TF binding afﬁnities like ChIP-chip or differential mRNA expres- binding sites. When interpreted through information theory, the sion data, it will still provide good approximations of the sequence- nucleotide frequencies at each individual position in the binding speciﬁc binding afﬁnities of TFs relevant to those data. Preliminary sites divided by the ‘‘background’’ frequencies for the respective results also suggest that MatrixREDUCE performs well on data e148 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE from higher eukaryotes including D. melanogaster and mammals. Konig,P., Giraldo,R., Chapman,L. and Rhodes,D. (1996) The crystal structure of the DNA-binding domain of yeast RAP1 in complex with telomeric DNA. Cell, 85, Finally, MatrixREDUCE has two key advantages over most other 125–136. computational methods for deﬁning nucleotide binding speciﬁc- Kortemme,T., Morozov,A.V. and Baker,D. (2003) An orientation-dependent hydrogen ities: (i) it uses the information for all probes in genome-wide bonding potential improves prediction of speciﬁcity and structure for proteins and TF occupancy data, and (ii) it does not require a background protein-protein complexes. J. Mol. Biol., 326, 1239–1259. Lamoureux,J.S., Stuart,D., Tsang,R., Wu,C. and Glover,J.N.M. (2002) Structure of sequence model. the sporulation-speciﬁc transcription factor Ndt80 bound to DNA. EMBO J., 21, 5721–5732. ACKNOWLEDGMENTS Lee,T.I., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Gerber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Simon,I., Zeitlinger,J., This work was supported by National Institutes of Health grants Jennings,E.G., Murray,H.L., Gordon,D.B., Ren,B., Wyrick,J.J., Tagne,J.B., HG003008 (to H.J.B) and CA121852. A.V.M. is a fellow of the Volkert,T.L., Fraenkel,E., Gifford,D.K. and Young,R.A. (2002) Transcriptional Leukemia and Lymphoma Society. The authors thank Jason Lieb, regulatory networks in Saccharomyces cerevisiae. Science, 298, Neil Clarke, Martha Bulyk, Ernest Fraenkel, Xiao Liu, and Michael 799–804. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter-speciﬁc binding of Berger for graciously answering questions and supplying additional Rap1 revealed by genome-wide maps of protein-DNA association. Nat. Genet., data from their publications. The authors also thank Gabor Halasz, 28, 327–334. Ronald Tepper, Junbai Wang, Xiang-Jun Lu, and Lucas Ward for Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for ﬁnding protein-dna valuable conversations, and Xiang-Jun Lu for reimplementing the binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839. MatrixREDUCE software for distribution. Liu,X., Brutlag,D.L. and Liu,J.S. (2001) Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. REFERENCES Biocomput. 127–138. Liu,X. and Clarke,N.D. (2002) Rationalization of gene regulation by a eukaryotic Benos,P.V., Bulyk,M.L. and Stormo,G.D. (2002) Additivity in protein-DNA interac- transcription factor: calculation of regulatory region occupancy from predicted tions: how good an approximation is it? Nucleic Acids Res., 30, 4442–4451. binding afﬁnities. J. Mol. Biol., 323, 1–8. Berg,O.G. and von Hippel,P.H. (1987) Selection of DNA binding sites by regulatory Liu,X., Noll,D.M., Lieb,J.D. and Clarke,N.D. (2005) DIP-chip: rapid and accurate proteins. Statistical-mechanical theory and application to operators and promoters. determination of DNA-binding speciﬁcity. Genome Res., 15, 421–427. J. Mol. Biol., 193, 723–750. Lu,X.J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res., (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 31, 5108–5121. 699–705. Morozov,A.V., Havranek,J.J., Baker,D. and Siggia,E.D. (2005) Protein-DNA Djordjevic,M., Sengupta,A.M. and Shraiman,B.I. (2003) A biophysical approach to binding speciﬁcity predictions with structural models. Nucleic Acids Res., 33, transcription factor binding site discovery. Genome Res., 13, 2381–2390. 5781–5798. Djordjevic,M. and Sengupta,A.M. (2006) Quantitative modeling and data analysis of Mukherjee,S., Berger,M.F., Jona,G., Wang,X.S., Muzzey,D., Snyder,M., Young,R.A. SELEX experiments. Phys. Biol., 3, 13–28. and Bulyk,M.L. (2004) Rapid analysis of the DNA-binding speciﬁcities of Ellenberger,T.E., Brandl,C.J., Struhl,K. and Harrison,S.C. (1992) The GCN4 basic transcription factors with DNA microarrays. Nat. Genet., 36, 1331–1339. region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. (1998) DNA sequence- structure of the protein-DNA complex. Cell, 71, 1223–1237. dependent deformability deduced from protein-DNA crystal complexes. Proc. Endres,R.G., Schulthess,T.C. and Wingreen,N.S. (2004) Toward an atomistic model Natl. Acad. Sci. USA, 95, 11163–11168. for predicting transcription-factor binding sites. Proteins, 57, 262–268. Onufriev,A., Bashford,D. and Case,D.A. (2004) Exploring protein native states and Foat,B.C., Houshmandi,S.S., Olivas,W.M. and Bussemaker,H.J. (2005) Proﬁling large-scale conformational changes with a modiﬁed Generalized Born model. condition-speciﬁc, genome-wide regulation of mRNA stability in yeast. Proc. Proteins, 55, 383–394. Natl. Acad. Sci. USA, 102, 17675–17680. O’Shea,E.K., Klemm,J.D., Kim,P.S. and Alber,T. (1991) X-ray structure of the GCN4 Fried,M. and Crothers,D.M. (1981) Equilibria and kinetics of lac repressor-operator leucine zipper, a two-stranded, parallel coiled coil. Science, 254, 539–544. interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res., 9, Paillard,G. and Lavery,R. (2004) Analyzing protein-DNA recognition mechanisms. 6505–6525. Structure, 12, 113– 122. Gailus-Durner,V., Xie,J., Chintamaneni,C. and Vershon,A.K. (1996) Participation of Pierce,M., Benjamin,K.R., Montano,S.P., Georgiadis,M.M., Winter,E. and the yeast activator Abf1 in meiosis-speciﬁc expression of the HOP1 gene. Mol. Vershon,A.K. (2003) Sum1 and Ndt80 proteins compete for binding to middle Cell. Biol., 16, 2777–2786. sporulation element sequences that control meiotic gene expression. Mol. Cell. Granek,J.A. and Clarke,N.D. (2005) Explicit equilibrium modeling of transcription- Biol., 23, 4814–4825. factor binding and gene regulation. Genome Biol., 6, R87. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., Simon,I., Zeitlinger,J., Grubbs,F. (1969) Procedures for detecting outlying observations in samples. Techno- Schreiber,J., Hannett,N., Kanin,E., Volkert,T.L., Wilson,C.J., Bell,S.P. and metrics, 11, 1–21. Young,R.A. (2000) Genome-wide location and function of DNA binding proteins. Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., Danford,T.W., Science, 290, 2306–2309. Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J., Jennings,E.G., Zeitlinger,J., Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display Pokholok,D.K., Kellis,M., Rolfe,P.A., Takusagawa,K.T., Lander,E.S., consensus sequences. Nucleic Acids Res., 18, 6097–6100. Gifford,D.K., Fraenkel,E. and Young,R.A. (2004) Transcriptional regulatory Simon,I., Barnett,J., Hannett,N., Harbison,C.T., Rinaldi,N.J., Volkert,T.L., Wyrick,J.J., code of a eukaryotic genome. Nature, 431, 99–104. Zeitlinger,J., Gifford,D.K., Jaakkola,T.S. and Young,R.A. (2001) Serial regulation Havranek,J.J., Duarte,C.M. and Baker,D. (2004) A simple physical model for the of transcriptional regulators in the yeast cell cycle. Cell, 106, 697–708. prediction and design of protein-DNA interactions. J. Mol. Biol., 344,59–70. Stormo,G.D. and Fields,D.S. (1998) Speciﬁcity, free energy and information content in Issel-Tarver,L., Christie,K.R., Dolinski,K., Andrada,R., Balakrishnan,R., Ball,C.A., protein-DNA interactions. Trends Biochem. Sci., 23, 109–113. Binkley,G., Dong,S., Dwight,S.S., Fisk,D.G., Harris,M., Schroeder,M., Stormo,G.D., Schneider,T.D. and Gold,L. (1986) Quantitative analysis of the Sethuraman,A., Tse,K., Weng,S., Botstein,D. and Cherry,J.M. (2002) Saccha- relationship between nucleotide sequence and functional activity. Nucleic Acids romyces Genome Database. Methods Enzymol., 350, 329–346. Res., 14, 6661–6679. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and Brown,P.O. (2001) Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, Genomic binding sites of the yeast cell-cycle transcription factors SBF and 16, 16–23. MBF. Nature, 409, 533–538. e149

Journal

Bioinformatics – Oxford University Press

Published: Jul 15, 2006

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

References (40)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies