Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE Vol. 22 no. 14 2006, pages e141–e149 BIOINFORMATICS doi:10.1093/bioinformatics/btl223 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE 1 2 1,3, Barrett C. Foat , Alexandre V. Morozov and Harmen J. Bussemaker 1 2 Department of Biological Sciences, Columbia University, New York, NY 10027, USA, Center for Studies in Physics and Biology, The Rockefeller University, New York, NY 10021, USA and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA sidered to be bound by the TF (for review see Stormo, 2000). These ABSTRACT methods use the information content of nucleotide patterns as a Motivation: Regulation of gene expression by a transcription factor proxy for the free energy contributions of the bases found in the requires physical interaction between the factor and the DNA, which TF binding site (Berg and von Hippel, 1987; Stormo and Fields, can be described by a statistical mechanical model. Based on this 1998). Other computational methods infer physically-based TF model, we developed the MatrixREDUCE algorithm, which uses binding specificities from measured TF binding affinities for a genome-wide occupancy data for a transcription factor (e.g. ChIP- small set of oligonucleotides (Liu and Clarke, 2002) or from struc- chip) and associated nucleotide sequences to discover the sequence- tural modeling of protein-DNA interaction (Paillard and Lavery, specific binding affinity of the transcription factor. Advantages of our 2004; Endres et al., 2004; Morozov et al., 2005). However, genome- approach are that the information for all probes on the microarray is scale, quantitative measurements of TF occupancies of intergenic efficiently utilized because there is no need to delineate ‘‘bound’’ and regions are now available due to the advent of in vivo chromatin ‘‘unbound’’ sequences, and that, unlike information content-based immunoprecipitation microarrays (Ren et al., 2000; Iyer et al., methods, it does not require a background sequence model. 2001; Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Results: We validated the performance of MatrixREDUCE by inferring Harbison et al., 2004), in vitro protein binding microarrays the sequence-specific binding affinities for several transcription factors (PBM; Mukherjee et al., 2004), and DNA immunoprecipitation in S. cerevisiae and comparing the results with three other independent microarrays (DIP-chip; Liu et al., 2005). Thus, it is no longer sources of transcription factor sequence-specific affinity information: necessary to rely on small data sets, availability of protein-DNA (i) experimental measurement of transcription factor binding affinities structures, or the analogy between information content and statis- for specificoligonucleotides,(ii) reporter geneassays forpromoterswith tical mechanics to infer free energy representations of transcription systematically mutated binding sites, and (iii) relative binding affinities factor binding sites. obtained by modeling transcription factor-DNA interactions based on We have developed a method, implemented as the program co-crystal structures of transcription factors bound to DNA substrates. MatrixREDUCE (Foat et al., 2005), that infers the sequence spe- We show that transcription factor binding affinities inferred by cificity of a TF directly and accurately from genome-wide TF occu- MatrixREDUCE are in good agreement with all three validating pancy data by fitting a statistical mechanical model for TF-DNA methods. interaction (Figure 1). The sequence specificity of the TF’s DNA- Availability: MatrixREDUCE source code is freely available for binding domain is modeled using a position-specific affinity matrix non-commercial use at http://www.bussemakerlab.org/. The software (PSAM), representing the change in the binding affinity (K ) when- runs on Linux, Unix, and Mac OS X. d ever a specific position within a reference binding sequence is Contact: [email protected] mutated. To validate the physical model of MatrixREDUCE, we discovered the PSAMs for several TFs in S. cerevisiae and com- 1 INTRODUCTION pared the results with three other independent sources of TF sequence-specific affinity information: (i) experimentally measured The sequence-specific regulatory activity of a transcription factor K ’s as determined by in vitro methods (Gailus-Durner et al., 1996; (TF) is the result of energetically favorable interactions between the Liu and Clarke, 2002; Pierce et al., 2003), (ii) lacZ reporter assays amino acids exposed in the DNA binding domain and portions of for promoters with systematically mutated binding sites (Gailus- nucleic acid bases exposed in the grooves of the DNA. A compu- Durner et al., 1996; Pierce et al., 2003), and (iii) relative K ’s tational method for discovering the binding specificity of a TF obtained by using a physical model of protein-DNA interaction cannot provide a quantitative description of TF binding unless it that makes binding affinity predictions starting from a co-crystal considers the physical underpinnings of the TF-DNA interaction. structure of the protein-DNA complex (Morozov et al., 2005). We Most physically motivated computational methods discover over- find a surprising level of agreement between MatrixREDUCE- represented patterns in a set of nucleotide sequences that are con- predicted TF binding affinities, experimental measurements, and To whom correspondence should be addressed. structural predictions, suggesting that MatrixREDUCE is a The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] B.C.Foat et al. Fig. 1. The flow of data. A microarray measurement of TF occupancies (ChIP-chip, PBM, DIP-chip, or differential mRNA expression data) and relevant nucleotide sequences for each microarray feature are used as input to MatrixREDUCE. MatrixREDUCE performs a least-squares fit to a statistical-mechanical model of TF-DNA interaction to discover the relative contributions to the free energy of binding for each nucleotide at each position in the generalized TF binding site. These contributions are represented as a position specific affinity matrix (PSAM) containing the relative equilibrium constants of the TF-DNA interaction, with the highest affinity nucleotide at each position scaled to a value of one (DDG ¼ 0). The PSAM can be converted into an affinity logo that graphically represents the DDG’s for each nucleotide at each position relative to the averageDDG at the respective positions. The PSAM can also be used to predict the relative TF occupancy of any nucleotide sequence, allowing the PSAMs inferred by MatrixREDUCE to be compared with experimental measurements of TF binding affinities for particular oligonucleotides. powerful and accurate tool for the elucidation of physically accurate data were b-galactosidase activities for genes containing mutated TF sequence-specific binding affinities. binding sites of the E. coli su2 amber stop codon suppressor. A similar type of analysis was performed by Liu and Clarke (2002) who fit a physical model for transcription factor binding to elec- 2 RELATED WORK trophoretic mobility shift assay (EMSA) data measuring affinity of In contrast to information theory-based methods of defining the S. cerevisiae Leu3 TF for several oligonucleotides. The physical nucleotide-binding protein specificities, MatrixREDUCE belongs model behind MatrixREDUCE is the same as that employed by to a small but growing class of methods that infer binding affinities Stormo et al. (1986) and Liu and Clarke (2002). However, our by directly fitting a physical model to experimental data. The first ‘‘quantitative data’’ are microarray probe intensities, which mea- such method was introduced by Stormo et al. (1986) who noted, sure TF occupancy over long chromosomal regions with unknown ‘‘When quantitative data are known for many sequences one can binding site locations. Thus, the MatrixREDUCE model integrates the binding signal over the entire length of the sequence. The solve for the matrix elements that give the best fit between the GOMER method of Granek and Clarke (2005) performs a similar sequences and those data.’’ For Stormo et al. (1986) the quantitative e142 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE The occupancy NðU Þ for the entire promoter region U of gene g equals the integration of signal over long regulatory sequences relevant to g g sum of occupancies for each binding site window of length L at each measured microarray intensities. However, GOMER was only position i over the length L of the sequence U : g g used to test hypotheses about the regulatory mechanisms of TFs L L þ1 for which a binding site weight matrix had already been defined by g w L X Y NðU Þ¼ ½PK ðS Þ w ‚ ð9Þ g a ref jU ði þ j  1Þ other methods. Granek and Clarke (2005) did not attempt to fit the g i¼1 j¼1 GOMER model directly to experimental data to infer the binding affinities of TFs. Finally, also of note are the QPMEME algorithm of where U ðiÞ is the base at position i in sequence U . g g Djordjevic et al. (2003) and the work of Djordjevic and Sengupta (2006) which use maximum likelihood procedures to infer PSAMs 3.2 Modeling genome-wide TF occupancy data by fitting physical models to known TF binding sites and SELEX Recent technologies such as ChIP-chip (Ren et al., 2000; Iyer et al., 2001; data, respectively, but rely on the prior delineation of ‘‘bound’’ Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Harbison et al., 2004), sequences. PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., 2005) provide indirect but quantitative information about the TF occupancy of large genomic regions. For each segment of DNA there are two microarray inten- 3 METHODS test test sities. The test intensity I is equal to a background intensity a plus a 3.1 Modeling TF-DNA interaction term that, to first approximation, is proportional (g) to the occupancy NðU Þ by the TF, either because the amount of TF bound to the probe contributes We will develop the statistical-mechanical model used by MatrixREDUCE directly to the signal intensity (PBM) or because it determines the proportion starting with a transcription factor P that binds to a DNA sequence S to form at which an immunoprecipitated TF-DNA fragment is present in the sample the TF-DNA complex PS: control (ChIP-chip or DIP-chip). The control intensity I is only the result of on control background signal a . Allowing for experimental noise e , we obtain: P þ S  PS ð1Þ test test off gNðU Þþ a g g ¼ þ e  bNðU Þþ C þ e ð10Þ g g g control control The affinity of the TF for the sequence can be expressed in terms of its a equilibrium dissociation constant K ðSÞ: Using Equation 9 for the occupancy NðU Þ, we obtain: ½P½S k off DG/RT K ðSÞ¼ ¼ ¼ e ‚ ð2Þ L L þ1 test g w L ½PS k w on X Y ¼ F w þ C þ e ‚ ð11Þ jU ði þ j  1Þ g control g which is directly related to DG, the Gibbs free energy of binding per mole g i¼1 j¼1 (R is the gas constant and T is temperature). The occupancy N(S) of sequence S by transcription factor P can be expressed as the concentration of TF-DNA where complex divided by the total concentration of DNA (bound or unbound): F ¼ b½PK ðS Þ: ð12Þ a ref ½PS ½P NðSÞ¼ ¼ : ð3Þ Note that b,[P], and K ðS Þ cannot be determined separately without a ref ½PSþ½S ½Pþ K ðSÞ additional information such as the real protein concentration or K ðS Þ. a ref For simplicity, we will assume that the TF concentration [P] is much smaller MatrixREDUCE discovers the set of w elements as well as F and C by jb than K ðSÞ. This assumption seems physiologically plausible because in this performing a least squares fit to the measured intensity ratios: regime, the highest affinity binding sites in the genome will be the most responsive to a change in the nuclear concentration of active TF. Thus, the ðC‚F‚fw gÞ ¼ argmin jb C‚ F‚ fw g jb occupancy becomes: ½P NðSÞ ¼½PK ðSÞ‚ ð4Þ L L þ1 a test g w L X X Y K ðSÞ F w C : ð13Þ jU ði þ j  1Þ control g where g g i¼1 j¼1 K ðSÞ K ðSÞ: ð5Þ The 4 · L matrix of K ratios w (3L parameters plus L reference w a jb w w Consider a single point mutation from the original reference sequence S ref nucleotide values) for all nucleotides at all positions in the binding site is to base b at position j resulting in the mutated sequence S . Such a mutation mut referred to as the position specific affinity matrix (PSAM). Each position j in will give rise to an additive change DDG in the free energy of binding or, the PSAM is rescaled such that the largest w is equal to unity, without loss jb equivalently, a multiplicative change w in K ðS Þ: jb a ref of generality. Differential mRNA expression microarray data, which measures the K ðS Þ a mut DDG/RT w ¼ ¼ e ‚ ð6Þ jb change in mRNA concentrations in cells from two different experimental K ðS Þ a ref conditions, can be used in place of genome-wide TF occupancy data. where This substitution is reasonable since, to first approximation, the transcrip- tion rate of genes is proportional to the total TF occupancy along the asso- DDG ¼ DGðS Þ DGðS Þ: ð7Þ ref mut ciated promoter regions. Genome-wide occupancy data is preferable, To be able to generalize the binding of transcription factor P to a sequence however, since it is a more direct measure of TF-DNA interaction and S with more than one point mutation, we assume that the free energy mut since the design of the experiments provides the TF identities for the contributions for each position in the binding site are independent (Benos discovered PSAMs. et al., 2002) and therefore additive. Equivalently, we can multiply the w ’s jb for any nucleotide sequence to obtain the overall K ðS Þ/K ðS Þ ratio. a mut a ref 3.3 MatrixREDUCE implementation and Thus, the occupancy of a particular binding site S of length L with mut w parameters nucleotide sequence S ð1‚2‚.. . ‚L Þ¼ðb ‚b ‚.. . ‚b Þ is: mut w 1 2 Lw w MatrixREDUCE was implemented in Perl and C as outlined above and as NðS Þ¼½PK ðS Þ w : ð8Þ mut a ref jS ðjÞ previously described (Foat et al., 2005) with some modifications. mut j¼1 Briefly, MatrixREDUCE takes microarray intensities and corresponding e143 B.C.Foat et al. nucleotide sequence data as input. It first finds a gapped dyad motif (e.g. Leu3: standards’’ to assess the validity of our MatrixREDUCE model. The CCG-4nt-CGG), out of all possible dyad motifs of a fixed number of electrophoretic mobility shift assay (EMSA) is able to provide direct nucleotides and a range of gap sizes, whose occurrences best correlate estimates of K ’s for a TF binding to particular oligonucleotides (Fried with the measured intensities for the same sequences. The best dyad motif and Crothers, 1981). The ratio of the EMSA-measured K of a reference is then converted into a seed matrix by filling in the gap with N’s and oligonucleotide S to the K of one of the other tested oligonucleotides S ref d mut extending out a user defined number of flanking N’s on either side of the provides the same information as the product across the MatrixREDUCE best-scoring dyad. In the 4 · L seed matrix, acceptable nucleotides (all PSAM over the same sequence for the same TF. In the simplifying scenario nucleotides for N’s, a single specific nucleotide at positions within the top where the length of the oligonucleotides is the same as the length L of the scoring motif) are given K ratios of one and unacceptable nucleotides are PSAM, we have given a very small K ratio w . This seed matrix serves as the starting point a min Lw K ðS Þ d ref for a quasi-Newton numerical minimization of Equation 13 to find the optimal ¼ w : ð14Þ jS ðjÞ mut K ðS Þ d mut PSAM. The new version of MatrixREDUCE uses a k-fold cross-validation to j¼1 determine the significance of each discovered PSAM. After converging on a While the biological processes involved are considerably more complex, PSAM, the input data is split into k random subsets of array features with lacZ expression data can be employed to the same end. If we assume that associated sequences. The optimal PSAM is then used to seed each of k re-- b-galactosidase activity, concentration of b-galactosidase, the amount of optimizations of the PSAM. A t-value (Pearson correlation) for the goodness mRNA expressed, the specific recruitment of RNA polymerase to the of fit is calculated for the optimal PSAM of each subset. Finally, the P-value promoter, and the promoter occupancy by the TF are all proportional to corresponding to the average t-value for the k re-optimizations is used to test each other, then relative K ’s are reflected in the ratio of b-galactosidase whether the originally optimized PSAM should be kept. This procedure does activities between the assay using the reference binding site and another not test the significance of the optimal PSAM itself, but rather it tests whether assay using a different binding site. Thus, we used lacZ reporter expression the data contains widely distributed, explainable variance. Thus, false PSAMs assay data in a similar manner to EMSA-derived K data to confirm the due to a few outliers are prevented. While not relevant to the current study, results of MatrixREDUCE. MatrixREDUCE can iteratively build a linear model of multiple PSAMs that Experimentally determined in vitro binding affinities and lacZ reporter best explain a particular data set (see Foat et al., 2005). expression activity data were gathered from publications. The K data and The parameters for the runs of MatrixREDUCE were as follows: For lacZ expression data for Abf1 are from Gailus-Durner et al. (1996); K data all runs, the length of each of the two dyads of the seed motifs was three, for Leu3 are from Liu and Clarke (2002); and K data and lacZ expression the length of the added flanks on each side of the dyad was three, the minimum data for Ndt80 and Sum1 are from Pierce et al. (2003). gap was zero, the k cross-validations were two, and w was 10 . For all runs min To compare the experimental K measurements with MatrixREDUCE on ChIP-chip and PBM data, the maximum acceptable P-value was 10 and PSAMs, all experimental K and lacZ expression data was first converted to the maximum dyad gap was twenty. For all runs on DIP-chip data, the K ratios by normalizing with respect to the value of the highest affinity maximum acceptable P-value was 10 and the maximum dyad gap was oligonucleotide. The K ratios were then log-transformed to obtain the DDG ten. For all runs on differential mRNA expression data, the maximum accept- values. MatrixREDUCE PSAMs for each TF were converted to DDG’s able P-value was 10 and the maximum dyad gap was eleven. relative to the highest affinity oligonucleotide from the respective experiment. The sum of the DDG values was calculated for the best 3.4 Microarray and sequence data PSAM-matching window in each of the experimentally tested sequences. All microarray data was gathered from publication supplements. We chose If a sequence was shorter than the PSAM, the sum was taken over only the specific TFs to analyze based on the availability of experimental K data or best matching positions within the PSAM. All experimental DDG’s were crystal structure data. PSAMs were inferred by MatrixREDUCE for chro- then compared to the PSAM DDG’s by plotting and by calculating Pearson matin immunoprecipitation microarrays (ChIP-chip) using the microarray correlations. data and microarray feature sequences from Harbison et al. (2004). These BioProspector (Liu et al., 2001) and MDscan (Liu et al., 2002) are popular ChIP-chip experiments were performed under a variety of culture conditions, information theory-based methods for determination of TF binding speci- including rich media (YPD); sulfometuron methyl (SM), an inhibitor of ficities. To compare the quality of the results from these methods with amino acid biosynthesis; and treatment with rapamycin (RAPA). PSAMs MatrixREDUCE results, position-specific scoring matrices (PSSMs) were were inferred for PBM experiments using the microarray data from Mukher- derived from BioProspector and MDscan outputs by calculating the frequen- jee et al. (2004) and the feature sequence data from Harbison et al. (2004) as cies of each base at each position in the putative binding sites and then the two studies used the same array features. PSAMs were inferred for Leu3 dividing by a background frequency for each respective base. Two different using the DIP-chip microarray data and feature sequences from Liu et al. background frequencies were tested: equal nucleotide probabilities and nuc- (2005). Liu et al. (2005) performed DIP-chip experiments using two different leotide probabilities for intergenic sequences in S. cerevisiae. Once the concentrations of Leu3, 4nM and 40nM, and PSAMs were inferred for each PSSMs had been created, they were tested against experimental EMSA concentration. The PSAM for Ndt80 was inferred from differential mRNA and lacZ data in the same manner as the MatrixREDUCE PSAMs above. expression microarray data measuring the sporulation response in a ndt80 deletion strain versus a wild-type strain (Chu et al., 1998). The sequence data 3.6 Structural modeling for the Ndt80 PSAM inference was the 800 bp upstream of every yeast gene, DNA binding affinities and specificities of TFs are determined by the forces retrieved from the Saccharomyces Genome Database (Issel-Tarver et al., of electrostatics, solvation, the hydrogen bonding patterns, and shape 2002) and purged of redundant sequences as previously described (Foat complementarity at the binding interface. The magnitude of these contribu- et al., 2005). All microarray intensities were analyzed as the ratio of the tions to the binding free energy can in principle be calculated given a experimental sample intensity to the control sample intensity with the excep- structure of the protein bound to its cognate DNA site. Therefore, it should tion of the ndt80 deletion data, which was analyzed as the log -ratio. All be possible to predict PSAMs starting from the experimentally available microarray data was purged of extreme outliers before being analyzed by structure of the protein-DNA complex (solved by either X-ray diffraction or MatrixREDUCE (Grubbs’ test, P-value ¼ 10 ; Grubbs, 1969). NMR), or, in the absence of the exact structure, from a suitably constructed homology model. Under the assumption that the base pair energies 3.5 Gel shift and lacZ expression data contribute approximately independently to the total binding affinity While prone to their own inaccuracies, experimentally measured in vitro (Benos et al., 2002), all one-point base pair mutations are introduced binding affinities and changes in lacZ expression served as our ‘‘gold into the DNA binding site. Protein-DNA binding energies DG ¼ e144 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE G  G  G are then evaluated for each mutation. Mutations in number of contacts above which the maximum energy penalty E is protdna prot dna max the reference binding site result in changes of protein-DNA binding energies imposed. f ðN Þ and f ðN Þ are fixed prefactors defined in Morozov max max 1 2 (DDG; Equation 7). A table of DDG values can be used to construct a PSAM et al. (2005). E together with N constitute the free parameters of the max max that is directly comparable with MatrixREDUCE predictions. contact model and are adjusted simultaneously to maximize the fraction of We have previously developed two alternative approaches for predicting correct predictions and minimize the average error over the DDG data set TF binding affinities and specificities starting from the protein-DNA identical to that used in fitting the all-atom model. The fraction of correct structure (Morozov et al., 2005). In one approach, the ‘‘all atom model’’ predictions is based on a binary function: a prediction is considered to be (which builds on the ROSETTA protein-nucleic acid interaction model of correct if both computational and experimental DDG’s are less than 1.0 kcal/ Havranek et al., 2004), both direct and indirect readout mechanisms con- mol, or greater than 1.0 kcal/mol, or else separated by less than 0.3 kcal/mol. tribute to the recognition of the DNA binding site: The global minimum for the fit is found by exhaustive search; the best fit is DG ¼ DG þ DG . Direct readout is mediated by protein amino obtained with N ¼ 15, E ¼ 3:0 kcal/mol. direct indirect max max acid-DNA base interactions, while indirect readout is encoded in the 3.7 Affinity logos shape of the DNA site imparted by the bound protein, primarily through non-specific amino acid-DNA phosphate backbone contacts. Direct protein- Information content-based weight matrices are usually displayed as DNA interactions are modeled as a linear combination of the repulsive and sequence logos (Schneider and Stephens, 1990) However, MatrixREDUCE attractive parts of the Lennard-Jones potential, the orientation-dependent weight matrices are discovered without a background sequence model. Thus, hydrogen bonding potential (Kortemme et al., 2003), and the Generalized an appropriate logo should display the actual relative free energies of binding Born electrostatics and solvation model (Onufriev et al., 2004): for each nucleotide at each position rather than information content. There- fore, we created affinity logos, which are constructed as follows: For each DG ¼ w E þ w E direct LJrep LJrep LJattr LJattr ð15Þ position in the PSAM, the average DDG is calculated. Then, the difference þ w E þ w G ‚ hb hb el el between each individual DDG and the average DDG at that position is where each term is a sum over all protein-DNA and protein-protein atomic computed; the absolute value of this difference is the height of the character pairs, and {w} is a set of fitting weights. Indirect readout is modeled using an representing that nucleotide. If the difference is positive (more favorable effective harmonic representation of the DNA conformational energy (Olson than average), the letter is placed above a horizontal black line through the et al., 1998): center of the logo. If the difference is negative (less favorable than average) X X the letter is placed below the black line. Larger letters are stacked on smaller ab ab DG ¼ w E þ w E ‚ ð16Þ indirect dnabp dnabs dnabp dnabs letters moving outward from the black line. The height of the letter can be bp bs interpreted as free energy difference from the average in units of RT. Thus, where the first sum is over all base pairs in the DNA site (a, b denote bases in an intuitive high amplitude is given to the nucleotide positions that most a base pair), and the second sum is over all consecutively stacked base pair contribute to the sequence specificity of the TF. To highlight that the steps (a, b denote base pairs in a base step). Base pairs and base steps are characters representing the high affinity nucleotides are above the black 0 0 counted once in the 5 to 3 direction. The first term penalizes deviations line, the characters representing the low affinity nucleotides are made from canonical base pairing, while the second term captures base stacking partially transparent. However, maintaining the representation of the poor energies. The quadratic energy terms are given by: affinity nucleotides below the center line allows the viewer to immediately see which nucleotide substitutions are most unfavorable to binding. 6 6 X X ab ab a b E ¼ f d d ‚ ð17Þ dnabs/dnabp ij i j i¼1 j¼1 3.8 PSAM to PSAM alignments and correlations By inspection of affinity logos, one can make qualitative observations about where the sums run over six geometric degrees of freedom  (Twist, Tilt, the similarity between any two PSAMs. However, a quantitative measure of Roll, Shift, Slide and Rise for base pair steps; Opening, Buckle, Propeller, similarity allows for more objective comparisons. Before two PSAMs can be Shear, Stretch and Stagger for base pairs; Lu and Olson, 2003). The DNA compared, they must first be aligned. Pearson correlations were calculated potential is a quadratic expansion in d (deviations of the degrees of free- between the DDG values for each nucleotide at each position for every dom  from their average values computed using a set of non-homologous possible overlap of the two PSAMs for both the forward and the reverse protein-DNA complexes). The force constants f are evaluated by inverting ij complement alignments. After the best overlap position and strand was the covariance matrix of d obtained with the same protein-DNA dataset: determined from the best correlation P-value, the DDG’s of the two f ¼hd d i. All six weights are simultaneously fit to experimental DDG i j ij PSAMs were recentered relative to a common reference consensus data using a generalized linear model (implemented in the statistical soft- ware package R): ðw ‚w ‚w ‚w ‚w ‚w Þ¼ sequence. Finally, the P-value for the Pearson correlation between the LJrep LJattr hb el dnabp dnabs ð0:00‚0:46‚0:77‚0:27‚ 0:03‚0:03Þ. No conformational flexibility is allowed two optimally aligned and transformed PSAMs was calculated and subjected at the protein-DNA interface. Further details on the fitting procedure and to a Bonferroni correction for the number of alignments that were tested. comprehensive tests of the all-atom free energy function can be found in Morozov et al. (2005). 4 RESULTS In another approach, we developed a ‘‘contact model’’ that exploits the structure of the protein-DNA complex bound to a high affinity reference 4.1 PSAMs inferred by MatrixREDUCE agree well DNA sequence but does not require detailed predictions of protein-DNA with experimental measurements of TF binding interaction energies. In the contact model each mutated base in the PSAM affinity column i incurs equal energy cost relative to the consensus base from the We discovered the position specific affinity matrices (PSAMs) reference sequence: for the Saccharomyces cerevisiae TFs Rap1, Ndt80, Gcn4, E ½f ðN Þ log ð1  N/N Þ < max max max Leu3, Abf1, and Sum1 by applying MatrixREDUCE to genome- DDG ðNÞ¼  f ðN Þ log ð1 þ 3N/N Þ ðN < N Þ ð18Þ max max max wide TF occupancy data and, in the case of Ndt80, differential E ðN  N Þ max max mRNA expression microarray data (Figure 2A). Experimental measurements of relative K ’s (EMSA or lacZ expression) for Here, N is the number of protein amino acid-DNA base atomic contacts d summed over the base pair i (atomic contact is defined by a distance of less specific oligonucleotides were available for Abf1, Leu3, Ndt80, than 4.5 s; hydrogen atoms are excluded from the counts), and N is the and Sum1. EMSA has long been employed to determine the max e145 B.C.Foat et al. Fig. 2. Comparison of PSAMs—affinity logos and correlations. (A) The PSAMs represented in the columns with blue headers were inferred by Matrix-REDUCE from ChIP-chip, PBM, DIP-chip, or mRNA differential expression microarray data. YPD (rich media), SM (sulfometuron methyl), and RAPA (rapamycin) refer to the environmental conditions to which the test sample was exposed before the ChIP-chip experiment. The DIP-chip experiments were performed with two different concentrations of Leu3, 4nM and 40nM. An ndt80 deletion (ndt80D) versus wild-type mRNA expression experiment (mRNA) was used to obtain the Ndt80 PSAM. The PSAMs represented in the columns green headers were inferred by modeling TF-DNA interactions based on crystal structures of the TFs using two different methods, a contact-only model and an all atom model. (B) All PSAMs for each TF were aligned pairwise and the Pearson correlation between the DDG values of both PSAMs for the best alignment was calculated. The P-value for this correlation is a measure of similarity between the PSAMs. Again, blue labels indicate PSAMs inferred by MatrixREDUCE PSAMs and green labels indicate structurally inferred PSAMs. DNA-binding affinities of TFs in vitro. Likewise, the lacZ reporter experimental method (EMSA or lacZ), and TF, we compared the assay has long been used to measure the difference in activities experimental DDG with the DDG predicted from a PSAM for the of TF binding sites. We claim that a PSAM inferred by same TF (Figure 3). In every case, the experimental DDG values MatrixREDUCE from genome-wide TF occupancy data can be strongly correlated with the PSAM-predicted DDG values, with R ’s used to predict the relative binding affinities of the measured TF ranging from 0.36 to 0.88. Thus, PSAMs inferred by MatrixRE- to any sequence. Therefore, EMSA and lacZ expression data pro- DUCE seem to be good models of the true relative DNA binding vide nearly ideal data sets for validation of the MatrixREDUCE affinities of the corresponding TFs. Unexpectedly, all of the regres- approach. For each combination of experimentally tested sequence, sions of experimental DDG’s on MatrixREDUCE DDG’s have e146 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE Fig. 3. Comparison of experimentally measured DDG’s with MatrixREDUCE PSAM-predicted DDG’s. Experimental measurements of DDG’s were derived from EMSA (A) and lacZ reporter assays (B). The experimental DDG values are plotted along the vertical axes. Predicted DDG’s were calculated from the PSAM for each tested TF for the same oligonucleotide sequences that were measured in each experiment. The MatrixREDUCE-predicted DDG values are plotted along the horizontal axes. In this representation, the higher affinity oligonucleotides have more positive DDG’s. The diagonal dashed line represents experimental DDG equal to MatrixREDUCE DDG. DDG’s are in units of RT, where R is the gas constant and T is the temperature. The R and P-values for the Pearson correlations between the experimental and predicted DDG’s are presented for each PSAM-experimental data pair. slopes less than one (range: 0.31 to 0.94). It seems that MatrixRE- of the binding interface based only on the genomic sequence and DUCE produces a slightly larger range of predicted DDG’s than is genome-wide TF occupancy data. realized in experiments. Nonetheless, the MatrixREDUCE PSAM- Gcn4 is a TF of the bZIP class. It is a homodimer with the basic predicted DDG’s are close to the experimentally inferred DDG’s in region mediating sequence specific DNA binding and the leucine most cases, especially among the highest affinity sequences. zipper region required for dimerization (O’Shea et al., 1991). For deriving the Gcn4 structural PSAMs, we used a 2.9 s crystal structure of the TF bound to the ATGAGTCAT site (Ellenberger 4.2 PSAMs inferred by MatrixREDUCE agree well et al., 1992). The symmetry of the binding site (two reverse with PSAMs inferred by structural models complement 4 bp half-sites separated by G in the middle) is a reflection of the homodimeric binding and is captured well in Both genome-wide TF occupancy data and crystal structures of MatrixREDUCE predictions. While contact model and Matrix- protein-DNA complexes are available for Ndt80, Gcn4, and REDUCE predictions are similar, the all-atom model is less Rap1. Thus, we were able to compare MatrixREDUCE PSAMs successful, probably due to the low resolution of the crystal struc- with those based on ab initio structural models (see Methods; ture, which leads to considerable uncertainty in side chain positions Figure 2A). The structurally inferred PSAMs for Ndt80 were obtained from its co-crystal structure bound to a high affinity with respect to the neighboring DNA bases. GACACAAAA site, solved at 1.4 s resolution (Lamoureux Finally, Rap1 binds DNA as a homodimer in a way that makes its et al., 2002). Figure 2A shows a reasonable agreement between DNA site a tandem repeat. The crystal structure of the Rap1 homod- DDG predictions carried out with MatrixREDUCE and structural imer in complex with a telomeric DNA site has been solved to models. The close correspondence with the contact model, which is 2.25 s resolution (Konig et al., 1996). Comparison of Matrix- a function of the number of protein side chains in contact with DNA REDUCE PSAMs and structural PSAMs reveals good agreement base pairs, is especially remarkable, showing that the Matrix- with the all atom model. The contact model overpredicts binding REDUCE approach is capable of reproducing structural details specificity at the intermediate positions in the binding site (located e147 B.C.Foat et al. bases provide an estimate of a PSAM in the form of a position- specific scoring matrix (PSSM). Since we had already compiled the EMSA and lacZ expression data, we had the opportunity to experi- mentally verify the results of these PSSMs. We gathered the BioProspector and MDscan results from the original, published analyses, transformed them into PSSMs, and used them to predict DDG’s for the EMSA and lacZ experimentally tested sequences. We performed this comparison using two different ‘‘background’’ nucleotide frequency models: one using equal nuc- leotide probabilities and one using nucleotide probabilities derived from S. cerevisiae intergenic sequences. The R and P-values for the Fig. 4. Correlations of experimentally measured DDG’s with information correlations between these predicted DDG’s and the experimental theory-predicted DDG’s. Experimental measurements of DDG’s were derived 2 results are displayed in Figure 4. Overall, the quality of the results from EMSA and lacZ reporter assays. The R and P-values for the Pearson from the information theory PSSMs and the MatrixREDUCE correlations between the experimental and predicted DDG’s are presented for PSAMs were similar. However, the results for the PSSMs are dif- each PSSM-experimental data pair. PSSMs were derived and tested using ferent depending on the choice of equal or intergenic nucleotide two different background nucleotide frequencies: equal probabilities and frequencies. While we did not test this scenario, the information intergenic probabilities. theory results would also change depending on the probe intensity threshold chosen to label genes as ‘‘bound.’’ Thus, while Matrix- between tandem repeats), likely because it assigns similar REDUCE performs comparably with existing information theory specificities to protein-DNA contacts in the loop region and in methods, it conveniently avoids having to choose several ad hoc the DNA binding domains. parameters required by the other methods. 4.3 PSAM to PSAM correlations 5 DISCUSSION Upon visual inspection of Figure 2A, the similarities are immedi- Overall, position specific affinity matrices (PSAMs) as inferred by ately apparent between affinity logos for the same factor inferred MatrixREDUCE from genome-wide TF occupancy data are good using different experimental and computational methods. However, approximations of the real sequence-specific DNA binding affini- a quantitative measure of these similarities can be obtained by ties. Discrepancies between the computationally predicted and the aligning the PSAMs (see Methods) and calculating the correlation experimentally inferred binding affinities may be due to either the of their DDG values. The P-value for this correlation between two computational or the experimental methods. EMSA has known PSAMs serves as our similarity metric (Figure 2B). Overall, the problems with ‘‘caging’’ of the TFs by the gel while electrophoresis similarity between the PSAMs from MatrixREDUCE are the most is proceeding (Fried and Crothers, 1981). This could lead to inferred significant. There is extreme similarity between the Rap1 PSAMs DDG’s of smaller magnitude. Likewise, lacZ reporter assays are a inferred from ChIP-chip and PBM data. The PSAMs inferred for very indirect way of measuring relative binding affinities as they Gcn4 for three different ChIP-chip conditions are all very similar as require transcription, translation, and b-galactosidase reactions in well. The Leu3 PSAMs inferred from the DIP-chip data are much order to make measurements, and noise could be introduced at each more similar to each other than they are to the Leu3 PSAM inferred step. Structural model predictions are strongly dependent on the from the ChIP-chip data, but they are still both significantly similar quality of input structures and are affected by errors in the energy (P-value < 0.01) to the ChIP-chip Leu3 PSAM. The significance of function. The current MatrixREDUCE model may also give rise to the correlations between MatrixREDUCE PSAMs and structurally systematic biases. First, it makes the approximation that nucleotides inferred PSAMs is more variable. Both the all atom model PSAM contribute independently to the free energy of TF binding (Benos and the contact model PSAM for Rap1 and Ndt80 have significant et al., 2002). Second, it makes the assumption that the concentration similarities with the respective MatrixREDUCE PSAMs (P-value < of TF is much smaller than the K , which may not be correct for 0.01). However for Gcn4, while the contact model PSAM has strong some TFs. Finally, all consecutive positions in the PSAM are cur- similarities with all of the other PSAMs, the all atom PSAM has rently treated as parameters to be estimated, which may lead to insignificant similarities with all other PSAMs. overfitting. We plan to address these issues in a future version of the algorithm. 4.4 How good is the information theory Despite these current limitations, PSAMs discovered using the approximation? current implementation of MatrixREDUCE are good approx- In the original papers describing the ChIP-chip (Harbison et al., imations of the relative nucleotide binding affinities of assayed 2004), PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., TFs. Especially for microarray methods like PBM and DIP-chip, 2005) data, the authors used BioProspector (Liu et al., 2001) or where the objective is to define nucleotide-binding specificities, MDscan (Liu et al., 2002) to define weight matrix representations of MatrixREDUCE may be the most physically accurate method TF binding sites. These two methods use the set of sequences that available to analyze the data. Even for less direct reflections of the experimenters label as ‘‘bound’’ to produce a list of potential TF binding affinities like ChIP-chip or differential mRNA expres- binding sites. When interpreted through information theory, the sion data, it will still provide good approximations of the sequence- nucleotide frequencies at each individual position in the binding specific binding affinities of TFs relevant to those data. Preliminary sites divided by the ‘‘background’’ frequencies for the respective results also suggest that MatrixREDUCE performs well on data e148 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE from higher eukaryotes including D. melanogaster and mammals. Konig,P., Giraldo,R., Chapman,L. and Rhodes,D. (1996) The crystal structure of the DNA-binding domain of yeast RAP1 in complex with telomeric DNA. Cell, 85, Finally, MatrixREDUCE has two key advantages over most other 125–136. computational methods for defining nucleotide binding specific- Kortemme,T., Morozov,A.V. and Baker,D. (2003) An orientation-dependent hydrogen ities: (i) it uses the information for all probes in genome-wide bonding potential improves prediction of specificity and structure for proteins and TF occupancy data, and (ii) it does not require a background protein-protein complexes. J. Mol. Biol., 326, 1239–1259. Lamoureux,J.S., Stuart,D., Tsang,R., Wu,C. and Glover,J.N.M. (2002) Structure of sequence model. the sporulation-specific transcription factor Ndt80 bound to DNA. EMBO J., 21, 5721–5732. ACKNOWLEDGMENTS Lee,T.I., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Gerber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Simon,I., Zeitlinger,J., This work was supported by National Institutes of Health grants Jennings,E.G., Murray,H.L., Gordon,D.B., Ren,B., Wyrick,J.J., Tagne,J.B., HG003008 (to H.J.B) and CA121852. A.V.M. is a fellow of the Volkert,T.L., Fraenkel,E., Gifford,D.K. and Young,R.A. (2002) Transcriptional Leukemia and Lymphoma Society. The authors thank Jason Lieb, regulatory networks in Saccharomyces cerevisiae. Science, 298, Neil Clarke, Martha Bulyk, Ernest Fraenkel, Xiao Liu, and Michael 799–804. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter-specific binding of Berger for graciously answering questions and supplying additional Rap1 revealed by genome-wide maps of protein-DNA association. Nat. Genet., data from their publications. The authors also thank Gabor Halasz, 28, 327–334. Ronald Tepper, Junbai Wang, Xiang-Jun Lu, and Lucas Ward for Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for finding protein-dna valuable conversations, and Xiang-Jun Lu for reimplementing the binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839. MatrixREDUCE software for distribution. Liu,X., Brutlag,D.L. and Liu,J.S. (2001) Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. REFERENCES Biocomput. 127–138. Liu,X. and Clarke,N.D. (2002) Rationalization of gene regulation by a eukaryotic Benos,P.V., Bulyk,M.L. and Stormo,G.D. (2002) Additivity in protein-DNA interac- transcription factor: calculation of regulatory region occupancy from predicted tions: how good an approximation is it? Nucleic Acids Res., 30, 4442–4451. binding affinities. J. Mol. Biol., 323, 1–8. Berg,O.G. and von Hippel,P.H. (1987) Selection of DNA binding sites by regulatory Liu,X., Noll,D.M., Lieb,J.D. and Clarke,N.D. (2005) DIP-chip: rapid and accurate proteins. Statistical-mechanical theory and application to operators and promoters. determination of DNA-binding specificity. Genome Res., 15, 421–427. J. Mol. Biol., 193, 723–750. Lu,X.J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res., (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 31, 5108–5121. 699–705. Morozov,A.V., Havranek,J.J., Baker,D. and Siggia,E.D. (2005) Protein-DNA Djordjevic,M., Sengupta,A.M. and Shraiman,B.I. (2003) A biophysical approach to binding specificity predictions with structural models. Nucleic Acids Res., 33, transcription factor binding site discovery. Genome Res., 13, 2381–2390. 5781–5798. Djordjevic,M. and Sengupta,A.M. (2006) Quantitative modeling and data analysis of Mukherjee,S., Berger,M.F., Jona,G., Wang,X.S., Muzzey,D., Snyder,M., Young,R.A. SELEX experiments. Phys. Biol., 3, 13–28. and Bulyk,M.L. (2004) Rapid analysis of the DNA-binding specificities of Ellenberger,T.E., Brandl,C.J., Struhl,K. and Harrison,S.C. (1992) The GCN4 basic transcription factors with DNA microarrays. Nat. Genet., 36, 1331–1339. region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. (1998) DNA sequence- structure of the protein-DNA complex. Cell, 71, 1223–1237. dependent deformability deduced from protein-DNA crystal complexes. Proc. Endres,R.G., Schulthess,T.C. and Wingreen,N.S. (2004) Toward an atomistic model Natl. Acad. Sci. USA, 95, 11163–11168. for predicting transcription-factor binding sites. Proteins, 57, 262–268. Onufriev,A., Bashford,D. and Case,D.A. (2004) Exploring protein native states and Foat,B.C., Houshmandi,S.S., Olivas,W.M. and Bussemaker,H.J. (2005) Profiling large-scale conformational changes with a modified Generalized Born model. condition-specific, genome-wide regulation of mRNA stability in yeast. Proc. Proteins, 55, 383–394. Natl. Acad. Sci. USA, 102, 17675–17680. O’Shea,E.K., Klemm,J.D., Kim,P.S. and Alber,T. (1991) X-ray structure of the GCN4 Fried,M. and Crothers,D.M. (1981) Equilibria and kinetics of lac repressor-operator leucine zipper, a two-stranded, parallel coiled coil. Science, 254, 539–544. interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res., 9, Paillard,G. and Lavery,R. (2004) Analyzing protein-DNA recognition mechanisms. 6505–6525. Structure, 12, 113– 122. Gailus-Durner,V., Xie,J., Chintamaneni,C. and Vershon,A.K. (1996) Participation of Pierce,M., Benjamin,K.R., Montano,S.P., Georgiadis,M.M., Winter,E. and the yeast activator Abf1 in meiosis-specific expression of the HOP1 gene. Mol. Vershon,A.K. (2003) Sum1 and Ndt80 proteins compete for binding to middle Cell. Biol., 16, 2777–2786. sporulation element sequences that control meiotic gene expression. Mol. Cell. Granek,J.A. and Clarke,N.D. (2005) Explicit equilibrium modeling of transcription- Biol., 23, 4814–4825. factor binding and gene regulation. Genome Biol., 6, R87. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., Simon,I., Zeitlinger,J., Grubbs,F. (1969) Procedures for detecting outlying observations in samples. Techno- Schreiber,J., Hannett,N., Kanin,E., Volkert,T.L., Wilson,C.J., Bell,S.P. and metrics, 11, 1–21. Young,R.A. (2000) Genome-wide location and function of DNA binding proteins. Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., Danford,T.W., Science, 290, 2306–2309. Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J., Jennings,E.G., Zeitlinger,J., Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display Pokholok,D.K., Kellis,M., Rolfe,P.A., Takusagawa,K.T., Lander,E.S., consensus sequences. Nucleic Acids Res., 18, 6097–6100. Gifford,D.K., Fraenkel,E. and Young,R.A. (2004) Transcriptional regulatory Simon,I., Barnett,J., Hannett,N., Harbison,C.T., Rinaldi,N.J., Volkert,T.L., Wyrick,J.J., code of a eukaryotic genome. Nature, 431, 99–104. Zeitlinger,J., Gifford,D.K., Jaakkola,T.S. and Young,R.A. (2001) Serial regulation Havranek,J.J., Duarte,C.M. and Baker,D. (2004) A simple physical model for the of transcriptional regulators in the yeast cell cycle. Cell, 106, 697–708. prediction and design of protein-DNA interactions. J. Mol. Biol., 344,59–70. Stormo,G.D. and Fields,D.S. (1998) Specificity, free energy and information content in Issel-Tarver,L., Christie,K.R., Dolinski,K., Andrada,R., Balakrishnan,R., Ball,C.A., protein-DNA interactions. Trends Biochem. Sci., 23, 109–113. Binkley,G., Dong,S., Dwight,S.S., Fisk,D.G., Harris,M., Schroeder,M., Stormo,G.D., Schneider,T.D. and Gold,L. (1986) Quantitative analysis of the Sethuraman,A., Tse,K., Weng,S., Botstein,D. and Cherry,J.M. (2002) Saccha- relationship between nucleotide sequence and functional activity. Nucleic Acids romyces Genome Database. Methods Enzymol., 350, 329–346. Res., 14, 6661–6679. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and Brown,P.O. (2001) Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, Genomic binding sites of the yeast cell-cycle transcription factors SBF and 16, 16–23. MBF. Nature, 409, 533–538. e149 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

Loading next page...
 
/lp/oxford-university-press/statistical-mechanical-modeling-of-genome-wide-transcription-factor-ox4weE5YPQ

References (40)

Publisher
Oxford University Press
Copyright
© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btl223
pmid
16873464
Publisher site
See Article on Publisher Site

Abstract

Vol. 22 no. 14 2006, pages e141–e149 BIOINFORMATICS doi:10.1093/bioinformatics/btl223 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE 1 2 1,3, Barrett C. Foat , Alexandre V. Morozov and Harmen J. Bussemaker 1 2 Department of Biological Sciences, Columbia University, New York, NY 10027, USA, Center for Studies in Physics and Biology, The Rockefeller University, New York, NY 10021, USA and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032, USA sidered to be bound by the TF (for review see Stormo, 2000). These ABSTRACT methods use the information content of nucleotide patterns as a Motivation: Regulation of gene expression by a transcription factor proxy for the free energy contributions of the bases found in the requires physical interaction between the factor and the DNA, which TF binding site (Berg and von Hippel, 1987; Stormo and Fields, can be described by a statistical mechanical model. Based on this 1998). Other computational methods infer physically-based TF model, we developed the MatrixREDUCE algorithm, which uses binding specificities from measured TF binding affinities for a genome-wide occupancy data for a transcription factor (e.g. ChIP- small set of oligonucleotides (Liu and Clarke, 2002) or from struc- chip) and associated nucleotide sequences to discover the sequence- tural modeling of protein-DNA interaction (Paillard and Lavery, specific binding affinity of the transcription factor. Advantages of our 2004; Endres et al., 2004; Morozov et al., 2005). However, genome- approach are that the information for all probes on the microarray is scale, quantitative measurements of TF occupancies of intergenic efficiently utilized because there is no need to delineate ‘‘bound’’ and regions are now available due to the advent of in vivo chromatin ‘‘unbound’’ sequences, and that, unlike information content-based immunoprecipitation microarrays (Ren et al., 2000; Iyer et al., methods, it does not require a background sequence model. 2001; Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Results: We validated the performance of MatrixREDUCE by inferring Harbison et al., 2004), in vitro protein binding microarrays the sequence-specific binding affinities for several transcription factors (PBM; Mukherjee et al., 2004), and DNA immunoprecipitation in S. cerevisiae and comparing the results with three other independent microarrays (DIP-chip; Liu et al., 2005). Thus, it is no longer sources of transcription factor sequence-specific affinity information: necessary to rely on small data sets, availability of protein-DNA (i) experimental measurement of transcription factor binding affinities structures, or the analogy between information content and statis- for specificoligonucleotides,(ii) reporter geneassays forpromoterswith tical mechanics to infer free energy representations of transcription systematically mutated binding sites, and (iii) relative binding affinities factor binding sites. obtained by modeling transcription factor-DNA interactions based on We have developed a method, implemented as the program co-crystal structures of transcription factors bound to DNA substrates. MatrixREDUCE (Foat et al., 2005), that infers the sequence spe- We show that transcription factor binding affinities inferred by cificity of a TF directly and accurately from genome-wide TF occu- MatrixREDUCE are in good agreement with all three validating pancy data by fitting a statistical mechanical model for TF-DNA methods. interaction (Figure 1). The sequence specificity of the TF’s DNA- Availability: MatrixREDUCE source code is freely available for binding domain is modeled using a position-specific affinity matrix non-commercial use at http://www.bussemakerlab.org/. The software (PSAM), representing the change in the binding affinity (K ) when- runs on Linux, Unix, and Mac OS X. d ever a specific position within a reference binding sequence is Contact: [email protected] mutated. To validate the physical model of MatrixREDUCE, we discovered the PSAMs for several TFs in S. cerevisiae and com- 1 INTRODUCTION pared the results with three other independent sources of TF sequence-specific affinity information: (i) experimentally measured The sequence-specific regulatory activity of a transcription factor K ’s as determined by in vitro methods (Gailus-Durner et al., 1996; (TF) is the result of energetically favorable interactions between the Liu and Clarke, 2002; Pierce et al., 2003), (ii) lacZ reporter assays amino acids exposed in the DNA binding domain and portions of for promoters with systematically mutated binding sites (Gailus- nucleic acid bases exposed in the grooves of the DNA. A compu- Durner et al., 1996; Pierce et al., 2003), and (iii) relative K ’s tational method for discovering the binding specificity of a TF obtained by using a physical model of protein-DNA interaction cannot provide a quantitative description of TF binding unless it that makes binding affinity predictions starting from a co-crystal considers the physical underpinnings of the TF-DNA interaction. structure of the protein-DNA complex (Morozov et al., 2005). We Most physically motivated computational methods discover over- find a surprising level of agreement between MatrixREDUCE- represented patterns in a set of nucleotide sequences that are con- predicted TF binding affinities, experimental measurements, and To whom correspondence should be addressed. structural predictions, suggesting that MatrixREDUCE is a The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] B.C.Foat et al. Fig. 1. The flow of data. A microarray measurement of TF occupancies (ChIP-chip, PBM, DIP-chip, or differential mRNA expression data) and relevant nucleotide sequences for each microarray feature are used as input to MatrixREDUCE. MatrixREDUCE performs a least-squares fit to a statistical-mechanical model of TF-DNA interaction to discover the relative contributions to the free energy of binding for each nucleotide at each position in the generalized TF binding site. These contributions are represented as a position specific affinity matrix (PSAM) containing the relative equilibrium constants of the TF-DNA interaction, with the highest affinity nucleotide at each position scaled to a value of one (DDG ¼ 0). The PSAM can be converted into an affinity logo that graphically represents the DDG’s for each nucleotide at each position relative to the averageDDG at the respective positions. The PSAM can also be used to predict the relative TF occupancy of any nucleotide sequence, allowing the PSAMs inferred by MatrixREDUCE to be compared with experimental measurements of TF binding affinities for particular oligonucleotides. powerful and accurate tool for the elucidation of physically accurate data were b-galactosidase activities for genes containing mutated TF sequence-specific binding affinities. binding sites of the E. coli su2 amber stop codon suppressor. A similar type of analysis was performed by Liu and Clarke (2002) who fit a physical model for transcription factor binding to elec- 2 RELATED WORK trophoretic mobility shift assay (EMSA) data measuring affinity of In contrast to information theory-based methods of defining the S. cerevisiae Leu3 TF for several oligonucleotides. The physical nucleotide-binding protein specificities, MatrixREDUCE belongs model behind MatrixREDUCE is the same as that employed by to a small but growing class of methods that infer binding affinities Stormo et al. (1986) and Liu and Clarke (2002). However, our by directly fitting a physical model to experimental data. The first ‘‘quantitative data’’ are microarray probe intensities, which mea- such method was introduced by Stormo et al. (1986) who noted, sure TF occupancy over long chromosomal regions with unknown ‘‘When quantitative data are known for many sequences one can binding site locations. Thus, the MatrixREDUCE model integrates the binding signal over the entire length of the sequence. The solve for the matrix elements that give the best fit between the GOMER method of Granek and Clarke (2005) performs a similar sequences and those data.’’ For Stormo et al. (1986) the quantitative e142 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE The occupancy NðU Þ for the entire promoter region U of gene g equals the integration of signal over long regulatory sequences relevant to g g sum of occupancies for each binding site window of length L at each measured microarray intensities. However, GOMER was only position i over the length L of the sequence U : g g used to test hypotheses about the regulatory mechanisms of TFs L L þ1 for which a binding site weight matrix had already been defined by g w L X Y NðU Þ¼ ½PK ðS Þ w ‚ ð9Þ g a ref jU ði þ j  1Þ other methods. Granek and Clarke (2005) did not attempt to fit the g i¼1 j¼1 GOMER model directly to experimental data to infer the binding affinities of TFs. Finally, also of note are the QPMEME algorithm of where U ðiÞ is the base at position i in sequence U . g g Djordjevic et al. (2003) and the work of Djordjevic and Sengupta (2006) which use maximum likelihood procedures to infer PSAMs 3.2 Modeling genome-wide TF occupancy data by fitting physical models to known TF binding sites and SELEX Recent technologies such as ChIP-chip (Ren et al., 2000; Iyer et al., 2001; data, respectively, but rely on the prior delineation of ‘‘bound’’ Lieb et al., 2001; Simon et al., 2001; Lee et al., 2002; Harbison et al., 2004), sequences. PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., 2005) provide indirect but quantitative information about the TF occupancy of large genomic regions. For each segment of DNA there are two microarray inten- 3 METHODS test test sities. The test intensity I is equal to a background intensity a plus a 3.1 Modeling TF-DNA interaction term that, to first approximation, is proportional (g) to the occupancy NðU Þ by the TF, either because the amount of TF bound to the probe contributes We will develop the statistical-mechanical model used by MatrixREDUCE directly to the signal intensity (PBM) or because it determines the proportion starting with a transcription factor P that binds to a DNA sequence S to form at which an immunoprecipitated TF-DNA fragment is present in the sample the TF-DNA complex PS: control (ChIP-chip or DIP-chip). The control intensity I is only the result of on control background signal a . Allowing for experimental noise e , we obtain: P þ S  PS ð1Þ test test off gNðU Þþ a g g ¼ þ e  bNðU Þþ C þ e ð10Þ g g g control control The affinity of the TF for the sequence can be expressed in terms of its a equilibrium dissociation constant K ðSÞ: Using Equation 9 for the occupancy NðU Þ, we obtain: ½P½S k off DG/RT K ðSÞ¼ ¼ ¼ e ‚ ð2Þ L L þ1 test g w L ½PS k w on X Y ¼ F w þ C þ e ‚ ð11Þ jU ði þ j  1Þ g control g which is directly related to DG, the Gibbs free energy of binding per mole g i¼1 j¼1 (R is the gas constant and T is temperature). The occupancy N(S) of sequence S by transcription factor P can be expressed as the concentration of TF-DNA where complex divided by the total concentration of DNA (bound or unbound): F ¼ b½PK ðS Þ: ð12Þ a ref ½PS ½P NðSÞ¼ ¼ : ð3Þ Note that b,[P], and K ðS Þ cannot be determined separately without a ref ½PSþ½S ½Pþ K ðSÞ additional information such as the real protein concentration or K ðS Þ. a ref For simplicity, we will assume that the TF concentration [P] is much smaller MatrixREDUCE discovers the set of w elements as well as F and C by jb than K ðSÞ. This assumption seems physiologically plausible because in this performing a least squares fit to the measured intensity ratios: regime, the highest affinity binding sites in the genome will be the most responsive to a change in the nuclear concentration of active TF. Thus, the ðC‚F‚fw gÞ ¼ argmin jb C‚ F‚ fw g jb occupancy becomes: ½P NðSÞ ¼½PK ðSÞ‚ ð4Þ L L þ1 a test g w L X X Y K ðSÞ F w C : ð13Þ jU ði þ j  1Þ control g where g g i¼1 j¼1 K ðSÞ K ðSÞ: ð5Þ The 4 · L matrix of K ratios w (3L parameters plus L reference w a jb w w Consider a single point mutation from the original reference sequence S ref nucleotide values) for all nucleotides at all positions in the binding site is to base b at position j resulting in the mutated sequence S . Such a mutation mut referred to as the position specific affinity matrix (PSAM). Each position j in will give rise to an additive change DDG in the free energy of binding or, the PSAM is rescaled such that the largest w is equal to unity, without loss jb equivalently, a multiplicative change w in K ðS Þ: jb a ref of generality. Differential mRNA expression microarray data, which measures the K ðS Þ a mut DDG/RT w ¼ ¼ e ‚ ð6Þ jb change in mRNA concentrations in cells from two different experimental K ðS Þ a ref conditions, can be used in place of genome-wide TF occupancy data. where This substitution is reasonable since, to first approximation, the transcrip- tion rate of genes is proportional to the total TF occupancy along the asso- DDG ¼ DGðS Þ DGðS Þ: ð7Þ ref mut ciated promoter regions. Genome-wide occupancy data is preferable, To be able to generalize the binding of transcription factor P to a sequence however, since it is a more direct measure of TF-DNA interaction and S with more than one point mutation, we assume that the free energy mut since the design of the experiments provides the TF identities for the contributions for each position in the binding site are independent (Benos discovered PSAMs. et al., 2002) and therefore additive. Equivalently, we can multiply the w ’s jb for any nucleotide sequence to obtain the overall K ðS Þ/K ðS Þ ratio. a mut a ref 3.3 MatrixREDUCE implementation and Thus, the occupancy of a particular binding site S of length L with mut w parameters nucleotide sequence S ð1‚2‚.. . ‚L Þ¼ðb ‚b ‚.. . ‚b Þ is: mut w 1 2 Lw w MatrixREDUCE was implemented in Perl and C as outlined above and as NðS Þ¼½PK ðS Þ w : ð8Þ mut a ref jS ðjÞ previously described (Foat et al., 2005) with some modifications. mut j¼1 Briefly, MatrixREDUCE takes microarray intensities and corresponding e143 B.C.Foat et al. nucleotide sequence data as input. It first finds a gapped dyad motif (e.g. Leu3: standards’’ to assess the validity of our MatrixREDUCE model. The CCG-4nt-CGG), out of all possible dyad motifs of a fixed number of electrophoretic mobility shift assay (EMSA) is able to provide direct nucleotides and a range of gap sizes, whose occurrences best correlate estimates of K ’s for a TF binding to particular oligonucleotides (Fried with the measured intensities for the same sequences. The best dyad motif and Crothers, 1981). The ratio of the EMSA-measured K of a reference is then converted into a seed matrix by filling in the gap with N’s and oligonucleotide S to the K of one of the other tested oligonucleotides S ref d mut extending out a user defined number of flanking N’s on either side of the provides the same information as the product across the MatrixREDUCE best-scoring dyad. In the 4 · L seed matrix, acceptable nucleotides (all PSAM over the same sequence for the same TF. In the simplifying scenario nucleotides for N’s, a single specific nucleotide at positions within the top where the length of the oligonucleotides is the same as the length L of the scoring motif) are given K ratios of one and unacceptable nucleotides are PSAM, we have given a very small K ratio w . This seed matrix serves as the starting point a min Lw K ðS Þ d ref for a quasi-Newton numerical minimization of Equation 13 to find the optimal ¼ w : ð14Þ jS ðjÞ mut K ðS Þ d mut PSAM. The new version of MatrixREDUCE uses a k-fold cross-validation to j¼1 determine the significance of each discovered PSAM. After converging on a While the biological processes involved are considerably more complex, PSAM, the input data is split into k random subsets of array features with lacZ expression data can be employed to the same end. If we assume that associated sequences. The optimal PSAM is then used to seed each of k re-- b-galactosidase activity, concentration of b-galactosidase, the amount of optimizations of the PSAM. A t-value (Pearson correlation) for the goodness mRNA expressed, the specific recruitment of RNA polymerase to the of fit is calculated for the optimal PSAM of each subset. Finally, the P-value promoter, and the promoter occupancy by the TF are all proportional to corresponding to the average t-value for the k re-optimizations is used to test each other, then relative K ’s are reflected in the ratio of b-galactosidase whether the originally optimized PSAM should be kept. This procedure does activities between the assay using the reference binding site and another not test the significance of the optimal PSAM itself, but rather it tests whether assay using a different binding site. Thus, we used lacZ reporter expression the data contains widely distributed, explainable variance. Thus, false PSAMs assay data in a similar manner to EMSA-derived K data to confirm the due to a few outliers are prevented. While not relevant to the current study, results of MatrixREDUCE. MatrixREDUCE can iteratively build a linear model of multiple PSAMs that Experimentally determined in vitro binding affinities and lacZ reporter best explain a particular data set (see Foat et al., 2005). expression activity data were gathered from publications. The K data and The parameters for the runs of MatrixREDUCE were as follows: For lacZ expression data for Abf1 are from Gailus-Durner et al. (1996); K data all runs, the length of each of the two dyads of the seed motifs was three, for Leu3 are from Liu and Clarke (2002); and K data and lacZ expression the length of the added flanks on each side of the dyad was three, the minimum data for Ndt80 and Sum1 are from Pierce et al. (2003). gap was zero, the k cross-validations were two, and w was 10 . For all runs min To compare the experimental K measurements with MatrixREDUCE on ChIP-chip and PBM data, the maximum acceptable P-value was 10 and PSAMs, all experimental K and lacZ expression data was first converted to the maximum dyad gap was twenty. For all runs on DIP-chip data, the K ratios by normalizing with respect to the value of the highest affinity maximum acceptable P-value was 10 and the maximum dyad gap was oligonucleotide. The K ratios were then log-transformed to obtain the DDG ten. For all runs on differential mRNA expression data, the maximum accept- values. MatrixREDUCE PSAMs for each TF were converted to DDG’s able P-value was 10 and the maximum dyad gap was eleven. relative to the highest affinity oligonucleotide from the respective experiment. The sum of the DDG values was calculated for the best 3.4 Microarray and sequence data PSAM-matching window in each of the experimentally tested sequences. All microarray data was gathered from publication supplements. We chose If a sequence was shorter than the PSAM, the sum was taken over only the specific TFs to analyze based on the availability of experimental K data or best matching positions within the PSAM. All experimental DDG’s were crystal structure data. PSAMs were inferred by MatrixREDUCE for chro- then compared to the PSAM DDG’s by plotting and by calculating Pearson matin immunoprecipitation microarrays (ChIP-chip) using the microarray correlations. data and microarray feature sequences from Harbison et al. (2004). These BioProspector (Liu et al., 2001) and MDscan (Liu et al., 2002) are popular ChIP-chip experiments were performed under a variety of culture conditions, information theory-based methods for determination of TF binding speci- including rich media (YPD); sulfometuron methyl (SM), an inhibitor of ficities. To compare the quality of the results from these methods with amino acid biosynthesis; and treatment with rapamycin (RAPA). PSAMs MatrixREDUCE results, position-specific scoring matrices (PSSMs) were were inferred for PBM experiments using the microarray data from Mukher- derived from BioProspector and MDscan outputs by calculating the frequen- jee et al. (2004) and the feature sequence data from Harbison et al. (2004) as cies of each base at each position in the putative binding sites and then the two studies used the same array features. PSAMs were inferred for Leu3 dividing by a background frequency for each respective base. Two different using the DIP-chip microarray data and feature sequences from Liu et al. background frequencies were tested: equal nucleotide probabilities and nuc- (2005). Liu et al. (2005) performed DIP-chip experiments using two different leotide probabilities for intergenic sequences in S. cerevisiae. Once the concentrations of Leu3, 4nM and 40nM, and PSAMs were inferred for each PSSMs had been created, they were tested against experimental EMSA concentration. The PSAM for Ndt80 was inferred from differential mRNA and lacZ data in the same manner as the MatrixREDUCE PSAMs above. expression microarray data measuring the sporulation response in a ndt80 deletion strain versus a wild-type strain (Chu et al., 1998). The sequence data 3.6 Structural modeling for the Ndt80 PSAM inference was the 800 bp upstream of every yeast gene, DNA binding affinities and specificities of TFs are determined by the forces retrieved from the Saccharomyces Genome Database (Issel-Tarver et al., of electrostatics, solvation, the hydrogen bonding patterns, and shape 2002) and purged of redundant sequences as previously described (Foat complementarity at the binding interface. The magnitude of these contribu- et al., 2005). All microarray intensities were analyzed as the ratio of the tions to the binding free energy can in principle be calculated given a experimental sample intensity to the control sample intensity with the excep- structure of the protein bound to its cognate DNA site. Therefore, it should tion of the ndt80 deletion data, which was analyzed as the log -ratio. All be possible to predict PSAMs starting from the experimentally available microarray data was purged of extreme outliers before being analyzed by structure of the protein-DNA complex (solved by either X-ray diffraction or MatrixREDUCE (Grubbs’ test, P-value ¼ 10 ; Grubbs, 1969). NMR), or, in the absence of the exact structure, from a suitably constructed homology model. Under the assumption that the base pair energies 3.5 Gel shift and lacZ expression data contribute approximately independently to the total binding affinity While prone to their own inaccuracies, experimentally measured in vitro (Benos et al., 2002), all one-point base pair mutations are introduced binding affinities and changes in lacZ expression served as our ‘‘gold into the DNA binding site. Protein-DNA binding energies DG ¼ e144 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE G  G  G are then evaluated for each mutation. Mutations in number of contacts above which the maximum energy penalty E is protdna prot dna max the reference binding site result in changes of protein-DNA binding energies imposed. f ðN Þ and f ðN Þ are fixed prefactors defined in Morozov max max 1 2 (DDG; Equation 7). A table of DDG values can be used to construct a PSAM et al. (2005). E together with N constitute the free parameters of the max max that is directly comparable with MatrixREDUCE predictions. contact model and are adjusted simultaneously to maximize the fraction of We have previously developed two alternative approaches for predicting correct predictions and minimize the average error over the DDG data set TF binding affinities and specificities starting from the protein-DNA identical to that used in fitting the all-atom model. The fraction of correct structure (Morozov et al., 2005). In one approach, the ‘‘all atom model’’ predictions is based on a binary function: a prediction is considered to be (which builds on the ROSETTA protein-nucleic acid interaction model of correct if both computational and experimental DDG’s are less than 1.0 kcal/ Havranek et al., 2004), both direct and indirect readout mechanisms con- mol, or greater than 1.0 kcal/mol, or else separated by less than 0.3 kcal/mol. tribute to the recognition of the DNA binding site: The global minimum for the fit is found by exhaustive search; the best fit is DG ¼ DG þ DG . Direct readout is mediated by protein amino obtained with N ¼ 15, E ¼ 3:0 kcal/mol. direct indirect max max acid-DNA base interactions, while indirect readout is encoded in the 3.7 Affinity logos shape of the DNA site imparted by the bound protein, primarily through non-specific amino acid-DNA phosphate backbone contacts. Direct protein- Information content-based weight matrices are usually displayed as DNA interactions are modeled as a linear combination of the repulsive and sequence logos (Schneider and Stephens, 1990) However, MatrixREDUCE attractive parts of the Lennard-Jones potential, the orientation-dependent weight matrices are discovered without a background sequence model. Thus, hydrogen bonding potential (Kortemme et al., 2003), and the Generalized an appropriate logo should display the actual relative free energies of binding Born electrostatics and solvation model (Onufriev et al., 2004): for each nucleotide at each position rather than information content. There- fore, we created affinity logos, which are constructed as follows: For each DG ¼ w E þ w E direct LJrep LJrep LJattr LJattr ð15Þ position in the PSAM, the average DDG is calculated. Then, the difference þ w E þ w G ‚ hb hb el el between each individual DDG and the average DDG at that position is where each term is a sum over all protein-DNA and protein-protein atomic computed; the absolute value of this difference is the height of the character pairs, and {w} is a set of fitting weights. Indirect readout is modeled using an representing that nucleotide. If the difference is positive (more favorable effective harmonic representation of the DNA conformational energy (Olson than average), the letter is placed above a horizontal black line through the et al., 1998): center of the logo. If the difference is negative (less favorable than average) X X the letter is placed below the black line. Larger letters are stacked on smaller ab ab DG ¼ w E þ w E ‚ ð16Þ indirect dnabp dnabs dnabp dnabs letters moving outward from the black line. The height of the letter can be bp bs interpreted as free energy difference from the average in units of RT. Thus, where the first sum is over all base pairs in the DNA site (a, b denote bases in an intuitive high amplitude is given to the nucleotide positions that most a base pair), and the second sum is over all consecutively stacked base pair contribute to the sequence specificity of the TF. To highlight that the steps (a, b denote base pairs in a base step). Base pairs and base steps are characters representing the high affinity nucleotides are above the black 0 0 counted once in the 5 to 3 direction. The first term penalizes deviations line, the characters representing the low affinity nucleotides are made from canonical base pairing, while the second term captures base stacking partially transparent. However, maintaining the representation of the poor energies. The quadratic energy terms are given by: affinity nucleotides below the center line allows the viewer to immediately see which nucleotide substitutions are most unfavorable to binding. 6 6 X X ab ab a b E ¼ f d d ‚ ð17Þ dnabs/dnabp ij i j i¼1 j¼1 3.8 PSAM to PSAM alignments and correlations By inspection of affinity logos, one can make qualitative observations about where the sums run over six geometric degrees of freedom  (Twist, Tilt, the similarity between any two PSAMs. However, a quantitative measure of Roll, Shift, Slide and Rise for base pair steps; Opening, Buckle, Propeller, similarity allows for more objective comparisons. Before two PSAMs can be Shear, Stretch and Stagger for base pairs; Lu and Olson, 2003). The DNA compared, they must first be aligned. Pearson correlations were calculated potential is a quadratic expansion in d (deviations of the degrees of free- between the DDG values for each nucleotide at each position for every dom  from their average values computed using a set of non-homologous possible overlap of the two PSAMs for both the forward and the reverse protein-DNA complexes). The force constants f are evaluated by inverting ij complement alignments. After the best overlap position and strand was the covariance matrix of d obtained with the same protein-DNA dataset: determined from the best correlation P-value, the DDG’s of the two f ¼hd d i. All six weights are simultaneously fit to experimental DDG i j ij PSAMs were recentered relative to a common reference consensus data using a generalized linear model (implemented in the statistical soft- ware package R): ðw ‚w ‚w ‚w ‚w ‚w Þ¼ sequence. Finally, the P-value for the Pearson correlation between the LJrep LJattr hb el dnabp dnabs ð0:00‚0:46‚0:77‚0:27‚ 0:03‚0:03Þ. No conformational flexibility is allowed two optimally aligned and transformed PSAMs was calculated and subjected at the protein-DNA interface. Further details on the fitting procedure and to a Bonferroni correction for the number of alignments that were tested. comprehensive tests of the all-atom free energy function can be found in Morozov et al. (2005). 4 RESULTS In another approach, we developed a ‘‘contact model’’ that exploits the structure of the protein-DNA complex bound to a high affinity reference 4.1 PSAMs inferred by MatrixREDUCE agree well DNA sequence but does not require detailed predictions of protein-DNA with experimental measurements of TF binding interaction energies. In the contact model each mutated base in the PSAM affinity column i incurs equal energy cost relative to the consensus base from the We discovered the position specific affinity matrices (PSAMs) reference sequence: for the Saccharomyces cerevisiae TFs Rap1, Ndt80, Gcn4, E ½f ðN Þ log ð1  N/N Þ < max max max Leu3, Abf1, and Sum1 by applying MatrixREDUCE to genome- DDG ðNÞ¼  f ðN Þ log ð1 þ 3N/N Þ ðN < N Þ ð18Þ max max max wide TF occupancy data and, in the case of Ndt80, differential E ðN  N Þ max max mRNA expression microarray data (Figure 2A). Experimental measurements of relative K ’s (EMSA or lacZ expression) for Here, N is the number of protein amino acid-DNA base atomic contacts d summed over the base pair i (atomic contact is defined by a distance of less specific oligonucleotides were available for Abf1, Leu3, Ndt80, than 4.5 s; hydrogen atoms are excluded from the counts), and N is the and Sum1. EMSA has long been employed to determine the max e145 B.C.Foat et al. Fig. 2. Comparison of PSAMs—affinity logos and correlations. (A) The PSAMs represented in the columns with blue headers were inferred by Matrix-REDUCE from ChIP-chip, PBM, DIP-chip, or mRNA differential expression microarray data. YPD (rich media), SM (sulfometuron methyl), and RAPA (rapamycin) refer to the environmental conditions to which the test sample was exposed before the ChIP-chip experiment. The DIP-chip experiments were performed with two different concentrations of Leu3, 4nM and 40nM. An ndt80 deletion (ndt80D) versus wild-type mRNA expression experiment (mRNA) was used to obtain the Ndt80 PSAM. The PSAMs represented in the columns green headers were inferred by modeling TF-DNA interactions based on crystal structures of the TFs using two different methods, a contact-only model and an all atom model. (B) All PSAMs for each TF were aligned pairwise and the Pearson correlation between the DDG values of both PSAMs for the best alignment was calculated. The P-value for this correlation is a measure of similarity between the PSAMs. Again, blue labels indicate PSAMs inferred by MatrixREDUCE PSAMs and green labels indicate structurally inferred PSAMs. DNA-binding affinities of TFs in vitro. Likewise, the lacZ reporter experimental method (EMSA or lacZ), and TF, we compared the assay has long been used to measure the difference in activities experimental DDG with the DDG predicted from a PSAM for the of TF binding sites. We claim that a PSAM inferred by same TF (Figure 3). In every case, the experimental DDG values MatrixREDUCE from genome-wide TF occupancy data can be strongly correlated with the PSAM-predicted DDG values, with R ’s used to predict the relative binding affinities of the measured TF ranging from 0.36 to 0.88. Thus, PSAMs inferred by MatrixRE- to any sequence. Therefore, EMSA and lacZ expression data pro- DUCE seem to be good models of the true relative DNA binding vide nearly ideal data sets for validation of the MatrixREDUCE affinities of the corresponding TFs. Unexpectedly, all of the regres- approach. For each combination of experimentally tested sequence, sions of experimental DDG’s on MatrixREDUCE DDG’s have e146 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE Fig. 3. Comparison of experimentally measured DDG’s with MatrixREDUCE PSAM-predicted DDG’s. Experimental measurements of DDG’s were derived from EMSA (A) and lacZ reporter assays (B). The experimental DDG values are plotted along the vertical axes. Predicted DDG’s were calculated from the PSAM for each tested TF for the same oligonucleotide sequences that were measured in each experiment. The MatrixREDUCE-predicted DDG values are plotted along the horizontal axes. In this representation, the higher affinity oligonucleotides have more positive DDG’s. The diagonal dashed line represents experimental DDG equal to MatrixREDUCE DDG. DDG’s are in units of RT, where R is the gas constant and T is the temperature. The R and P-values for the Pearson correlations between the experimental and predicted DDG’s are presented for each PSAM-experimental data pair. slopes less than one (range: 0.31 to 0.94). It seems that MatrixRE- of the binding interface based only on the genomic sequence and DUCE produces a slightly larger range of predicted DDG’s than is genome-wide TF occupancy data. realized in experiments. Nonetheless, the MatrixREDUCE PSAM- Gcn4 is a TF of the bZIP class. It is a homodimer with the basic predicted DDG’s are close to the experimentally inferred DDG’s in region mediating sequence specific DNA binding and the leucine most cases, especially among the highest affinity sequences. zipper region required for dimerization (O’Shea et al., 1991). For deriving the Gcn4 structural PSAMs, we used a 2.9 s crystal structure of the TF bound to the ATGAGTCAT site (Ellenberger 4.2 PSAMs inferred by MatrixREDUCE agree well et al., 1992). The symmetry of the binding site (two reverse with PSAMs inferred by structural models complement 4 bp half-sites separated by G in the middle) is a reflection of the homodimeric binding and is captured well in Both genome-wide TF occupancy data and crystal structures of MatrixREDUCE predictions. While contact model and Matrix- protein-DNA complexes are available for Ndt80, Gcn4, and REDUCE predictions are similar, the all-atom model is less Rap1. Thus, we were able to compare MatrixREDUCE PSAMs successful, probably due to the low resolution of the crystal struc- with those based on ab initio structural models (see Methods; ture, which leads to considerable uncertainty in side chain positions Figure 2A). The structurally inferred PSAMs for Ndt80 were obtained from its co-crystal structure bound to a high affinity with respect to the neighboring DNA bases. GACACAAAA site, solved at 1.4 s resolution (Lamoureux Finally, Rap1 binds DNA as a homodimer in a way that makes its et al., 2002). Figure 2A shows a reasonable agreement between DNA site a tandem repeat. The crystal structure of the Rap1 homod- DDG predictions carried out with MatrixREDUCE and structural imer in complex with a telomeric DNA site has been solved to models. The close correspondence with the contact model, which is 2.25 s resolution (Konig et al., 1996). Comparison of Matrix- a function of the number of protein side chains in contact with DNA REDUCE PSAMs and structural PSAMs reveals good agreement base pairs, is especially remarkable, showing that the Matrix- with the all atom model. The contact model overpredicts binding REDUCE approach is capable of reproducing structural details specificity at the intermediate positions in the binding site (located e147 B.C.Foat et al. bases provide an estimate of a PSAM in the form of a position- specific scoring matrix (PSSM). Since we had already compiled the EMSA and lacZ expression data, we had the opportunity to experi- mentally verify the results of these PSSMs. We gathered the BioProspector and MDscan results from the original, published analyses, transformed them into PSSMs, and used them to predict DDG’s for the EMSA and lacZ experimentally tested sequences. We performed this comparison using two different ‘‘background’’ nucleotide frequency models: one using equal nuc- leotide probabilities and one using nucleotide probabilities derived from S. cerevisiae intergenic sequences. The R and P-values for the Fig. 4. Correlations of experimentally measured DDG’s with information correlations between these predicted DDG’s and the experimental theory-predicted DDG’s. Experimental measurements of DDG’s were derived 2 results are displayed in Figure 4. Overall, the quality of the results from EMSA and lacZ reporter assays. The R and P-values for the Pearson from the information theory PSSMs and the MatrixREDUCE correlations between the experimental and predicted DDG’s are presented for PSAMs were similar. However, the results for the PSSMs are dif- each PSSM-experimental data pair. PSSMs were derived and tested using ferent depending on the choice of equal or intergenic nucleotide two different background nucleotide frequencies: equal probabilities and frequencies. While we did not test this scenario, the information intergenic probabilities. theory results would also change depending on the probe intensity threshold chosen to label genes as ‘‘bound.’’ Thus, while Matrix- between tandem repeats), likely because it assigns similar REDUCE performs comparably with existing information theory specificities to protein-DNA contacts in the loop region and in methods, it conveniently avoids having to choose several ad hoc the DNA binding domains. parameters required by the other methods. 4.3 PSAM to PSAM correlations 5 DISCUSSION Upon visual inspection of Figure 2A, the similarities are immedi- Overall, position specific affinity matrices (PSAMs) as inferred by ately apparent between affinity logos for the same factor inferred MatrixREDUCE from genome-wide TF occupancy data are good using different experimental and computational methods. However, approximations of the real sequence-specific DNA binding affini- a quantitative measure of these similarities can be obtained by ties. Discrepancies between the computationally predicted and the aligning the PSAMs (see Methods) and calculating the correlation experimentally inferred binding affinities may be due to either the of their DDG values. The P-value for this correlation between two computational or the experimental methods. EMSA has known PSAMs serves as our similarity metric (Figure 2B). Overall, the problems with ‘‘caging’’ of the TFs by the gel while electrophoresis similarity between the PSAMs from MatrixREDUCE are the most is proceeding (Fried and Crothers, 1981). This could lead to inferred significant. There is extreme similarity between the Rap1 PSAMs DDG’s of smaller magnitude. Likewise, lacZ reporter assays are a inferred from ChIP-chip and PBM data. The PSAMs inferred for very indirect way of measuring relative binding affinities as they Gcn4 for three different ChIP-chip conditions are all very similar as require transcription, translation, and b-galactosidase reactions in well. The Leu3 PSAMs inferred from the DIP-chip data are much order to make measurements, and noise could be introduced at each more similar to each other than they are to the Leu3 PSAM inferred step. Structural model predictions are strongly dependent on the from the ChIP-chip data, but they are still both significantly similar quality of input structures and are affected by errors in the energy (P-value < 0.01) to the ChIP-chip Leu3 PSAM. The significance of function. The current MatrixREDUCE model may also give rise to the correlations between MatrixREDUCE PSAMs and structurally systematic biases. First, it makes the approximation that nucleotides inferred PSAMs is more variable. Both the all atom model PSAM contribute independently to the free energy of TF binding (Benos and the contact model PSAM for Rap1 and Ndt80 have significant et al., 2002). Second, it makes the assumption that the concentration similarities with the respective MatrixREDUCE PSAMs (P-value < of TF is much smaller than the K , which may not be correct for 0.01). However for Gcn4, while the contact model PSAM has strong some TFs. Finally, all consecutive positions in the PSAM are cur- similarities with all of the other PSAMs, the all atom PSAM has rently treated as parameters to be estimated, which may lead to insignificant similarities with all other PSAMs. overfitting. We plan to address these issues in a future version of the algorithm. 4.4 How good is the information theory Despite these current limitations, PSAMs discovered using the approximation? current implementation of MatrixREDUCE are good approx- In the original papers describing the ChIP-chip (Harbison et al., imations of the relative nucleotide binding affinities of assayed 2004), PBM (Mukherjee et al., 2004), and DIP-chip (Liu et al., TFs. Especially for microarray methods like PBM and DIP-chip, 2005) data, the authors used BioProspector (Liu et al., 2001) or where the objective is to define nucleotide-binding specificities, MDscan (Liu et al., 2002) to define weight matrix representations of MatrixREDUCE may be the most physically accurate method TF binding sites. These two methods use the set of sequences that available to analyze the data. Even for less direct reflections of the experimenters label as ‘‘bound’’ to produce a list of potential TF binding affinities like ChIP-chip or differential mRNA expres- binding sites. When interpreted through information theory, the sion data, it will still provide good approximations of the sequence- nucleotide frequencies at each individual position in the binding specific binding affinities of TFs relevant to those data. Preliminary sites divided by the ‘‘background’’ frequencies for the respective results also suggest that MatrixREDUCE performs well on data e148 Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE from higher eukaryotes including D. melanogaster and mammals. Konig,P., Giraldo,R., Chapman,L. and Rhodes,D. (1996) The crystal structure of the DNA-binding domain of yeast RAP1 in complex with telomeric DNA. Cell, 85, Finally, MatrixREDUCE has two key advantages over most other 125–136. computational methods for defining nucleotide binding specific- Kortemme,T., Morozov,A.V. and Baker,D. (2003) An orientation-dependent hydrogen ities: (i) it uses the information for all probes in genome-wide bonding potential improves prediction of specificity and structure for proteins and TF occupancy data, and (ii) it does not require a background protein-protein complexes. J. Mol. Biol., 326, 1239–1259. Lamoureux,J.S., Stuart,D., Tsang,R., Wu,C. and Glover,J.N.M. (2002) Structure of sequence model. the sporulation-specific transcription factor Ndt80 bound to DNA. EMBO J., 21, 5721–5732. ACKNOWLEDGMENTS Lee,T.I., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Gerber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Simon,I., Zeitlinger,J., This work was supported by National Institutes of Health grants Jennings,E.G., Murray,H.L., Gordon,D.B., Ren,B., Wyrick,J.J., Tagne,J.B., HG003008 (to H.J.B) and CA121852. A.V.M. is a fellow of the Volkert,T.L., Fraenkel,E., Gifford,D.K. and Young,R.A. (2002) Transcriptional Leukemia and Lymphoma Society. The authors thank Jason Lieb, regulatory networks in Saccharomyces cerevisiae. Science, 298, Neil Clarke, Martha Bulyk, Ernest Fraenkel, Xiao Liu, and Michael 799–804. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter-specific binding of Berger for graciously answering questions and supplying additional Rap1 revealed by genome-wide maps of protein-DNA association. Nat. Genet., data from their publications. The authors also thank Gabor Halasz, 28, 327–334. Ronald Tepper, Junbai Wang, Xiang-Jun Lu, and Lucas Ward for Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for finding protein-dna valuable conversations, and Xiang-Jun Lu for reimplementing the binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839. MatrixREDUCE software for distribution. Liu,X., Brutlag,D.L. and Liu,J.S. (2001) Bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. REFERENCES Biocomput. 127–138. Liu,X. and Clarke,N.D. (2002) Rationalization of gene regulation by a eukaryotic Benos,P.V., Bulyk,M.L. and Stormo,G.D. (2002) Additivity in protein-DNA interac- transcription factor: calculation of regulatory region occupancy from predicted tions: how good an approximation is it? Nucleic Acids Res., 30, 4442–4451. binding affinities. J. Mol. Biol., 323, 1–8. Berg,O.G. and von Hippel,P.H. (1987) Selection of DNA binding sites by regulatory Liu,X., Noll,D.M., Lieb,J.D. and Clarke,N.D. (2005) DIP-chip: rapid and accurate proteins. Statistical-mechanical theory and application to operators and promoters. determination of DNA-binding specificity. Genome Res., 15, 421–427. J. Mol. Biol., 193, 723–750. Lu,X.J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res., (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 31, 5108–5121. 699–705. Morozov,A.V., Havranek,J.J., Baker,D. and Siggia,E.D. (2005) Protein-DNA Djordjevic,M., Sengupta,A.M. and Shraiman,B.I. (2003) A biophysical approach to binding specificity predictions with structural models. Nucleic Acids Res., 33, transcription factor binding site discovery. Genome Res., 13, 2381–2390. 5781–5798. Djordjevic,M. and Sengupta,A.M. (2006) Quantitative modeling and data analysis of Mukherjee,S., Berger,M.F., Jona,G., Wang,X.S., Muzzey,D., Snyder,M., Young,R.A. SELEX experiments. Phys. Biol., 3, 13–28. and Bulyk,M.L. (2004) Rapid analysis of the DNA-binding specificities of Ellenberger,T.E., Brandl,C.J., Struhl,K. and Harrison,S.C. (1992) The GCN4 basic transcription factors with DNA microarrays. Nat. Genet., 36, 1331–1339. region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. (1998) DNA sequence- structure of the protein-DNA complex. Cell, 71, 1223–1237. dependent deformability deduced from protein-DNA crystal complexes. Proc. Endres,R.G., Schulthess,T.C. and Wingreen,N.S. (2004) Toward an atomistic model Natl. Acad. Sci. USA, 95, 11163–11168. for predicting transcription-factor binding sites. Proteins, 57, 262–268. Onufriev,A., Bashford,D. and Case,D.A. (2004) Exploring protein native states and Foat,B.C., Houshmandi,S.S., Olivas,W.M. and Bussemaker,H.J. (2005) Profiling large-scale conformational changes with a modified Generalized Born model. condition-specific, genome-wide regulation of mRNA stability in yeast. Proc. Proteins, 55, 383–394. Natl. Acad. Sci. USA, 102, 17675–17680. O’Shea,E.K., Klemm,J.D., Kim,P.S. and Alber,T. (1991) X-ray structure of the GCN4 Fried,M. and Crothers,D.M. (1981) Equilibria and kinetics of lac repressor-operator leucine zipper, a two-stranded, parallel coiled coil. Science, 254, 539–544. interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res., 9, Paillard,G. and Lavery,R. (2004) Analyzing protein-DNA recognition mechanisms. 6505–6525. Structure, 12, 113– 122. Gailus-Durner,V., Xie,J., Chintamaneni,C. and Vershon,A.K. (1996) Participation of Pierce,M., Benjamin,K.R., Montano,S.P., Georgiadis,M.M., Winter,E. and the yeast activator Abf1 in meiosis-specific expression of the HOP1 gene. Mol. Vershon,A.K. (2003) Sum1 and Ndt80 proteins compete for binding to middle Cell. Biol., 16, 2777–2786. sporulation element sequences that control meiotic gene expression. Mol. Cell. Granek,J.A. and Clarke,N.D. (2005) Explicit equilibrium modeling of transcription- Biol., 23, 4814–4825. factor binding and gene regulation. Genome Biol., 6, R87. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., Simon,I., Zeitlinger,J., Grubbs,F. (1969) Procedures for detecting outlying observations in samples. Techno- Schreiber,J., Hannett,N., Kanin,E., Volkert,T.L., Wilson,C.J., Bell,S.P. and metrics, 11, 1–21. Young,R.A. (2000) Genome-wide location and function of DNA binding proteins. Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., Danford,T.W., Science, 290, 2306–2309. Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J., Jennings,E.G., Zeitlinger,J., Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display Pokholok,D.K., Kellis,M., Rolfe,P.A., Takusagawa,K.T., Lander,E.S., consensus sequences. Nucleic Acids Res., 18, 6097–6100. Gifford,D.K., Fraenkel,E. and Young,R.A. (2004) Transcriptional regulatory Simon,I., Barnett,J., Hannett,N., Harbison,C.T., Rinaldi,N.J., Volkert,T.L., Wyrick,J.J., code of a eukaryotic genome. Nature, 431, 99–104. Zeitlinger,J., Gifford,D.K., Jaakkola,T.S. and Young,R.A. (2001) Serial regulation Havranek,J.J., Duarte,C.M. and Baker,D. (2004) A simple physical model for the of transcriptional regulators in the yeast cell cycle. Cell, 106, 697–708. prediction and design of protein-DNA interactions. J. Mol. Biol., 344,59–70. Stormo,G.D. and Fields,D.S. (1998) Specificity, free energy and information content in Issel-Tarver,L., Christie,K.R., Dolinski,K., Andrada,R., Balakrishnan,R., Ball,C.A., protein-DNA interactions. Trends Biochem. Sci., 23, 109–113. Binkley,G., Dong,S., Dwight,S.S., Fisk,D.G., Harris,M., Schroeder,M., Stormo,G.D., Schneider,T.D. and Gold,L. (1986) Quantitative analysis of the Sethuraman,A., Tse,K., Weng,S., Botstein,D. and Cherry,J.M. (2002) Saccha- relationship between nucleotide sequence and functional activity. Nucleic Acids romyces Genome Database. Methods Enzymol., 350, 329–346. Res., 14, 6661–6679. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and Brown,P.O. (2001) Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, Genomic binding sites of the yeast cell-cycle transcription factors SBF and 16, 16–23. MBF. Nature, 409, 533–538. e149

Journal

BioinformaticsOxford University Press

Published: Jul 15, 2006

There are no references for this article.