Substitution Rates Predicted by Stability-Constrained Models of Protein Evolution Are Not Consistent with Empirical Data

Substitution Rates Predicted by Stability-Constrained Models of Protein Evolution Are Not... Abstract Protein structures strongly influence molecular evolution. In particular, the evolutionary rate of a protein site depends on the number of its native contacts. Stability-constrained models of protein evolution consider this influence of protein structure on evolution by predicting the effect of mutations on the stability of the native state, but they currently neglect how mutations affect the protein structure. These models predict that buried protein sites with more native contacts are more constrained by natural selection and less variable, as observed. Nevertheless, previous work did not consider the stability against compact misfolded conformations, although it is known that the negative design that destabilizes these misfolded conformations influences protein evolution significantly. Here, we show that stability-constrained models that consider misfolding predict that site-specific sequence entropy and substitution rate peak at amphiphilic sites with an intermediate number of contacts, as these sites are less constrained than exposed sites with few contacts whose hydrophobicity must be limited. This result holds both for a mean-field model with independent sites and for a pairwise model that takes as a reference the wild-type sequence, but it contrasts with the observations that indicate that the entropy and the substitution rate decrease monotonically with the number of contacts. Our work suggests that stability-constrained models overestimate the tolerance of amphiphilic sites against mutations, either because of the limits of the free energy function or, more importantly in our opinion, because they do not consider how mutations perturb the native protein structure. stability-constrained protein evolution, structure-constrained protein evolution, substitution rates, site-specific sequence entropy Introduction One of the main determinants of the variation of the substitution rates across the sites of a protein is the number of native contacts. Buried sites tend to evolve more slowly than exposed sites (Franzosa and Xia 2009). Later studies found that the number of native contacts is a better predictor of the evolutionary rate than the solvent accessibility (Yeh et al. 2014). This tendency was interpreted as an indication that natural selection imposes stronger constraints on sites with many native contacts (Echave et al. 2016). To explain the observed negative correlation between substitution rates and number of contacts, two kinds of models of the fitness effect of a mutation have been adopted (Echave et al. 2015). The first kind of model, which we call here stability-constrained fitness model, computes the fitness from the predicted effect of the mutation on the folding free energy ΔG that determines the stability of the folded native state (see Goldstein 2011; Serohijos and Shakhnovich 2014; Bastolla et al. 2017). Typically, the model adopts a mesoscopic representation of the protein structure such as the contact matrix and it estimates the change of free energy of the native contact matrix due to the simulated mutation, assuming that the native structure does not change. This type of model is sometimes referred in the literature as structurally constrained protein evolution model, but we prefer to call it stability-constrained model because it does not consider that mutations can modify the protein structure. We reserve the name “structurally constrained” to the second type of model of protein evolution, which estimates how mutations affect the structure of the native state and computes the fitness from this predicted structural change. In the original version proposed by Echave (2008), the mutation was modelled as a perturbation applied to the wild-type structure and the mutated structure was estimated through linear response theory adopting the elastic network model (ENM; Tirion 1996). As the ENM does not represent the protein sequence, in this framework it is not possible to compute the change of folding stability, thus the fitness is assumed to depend only on the predicted structural change, not on the stability change. A modified version of this model was described in terms of the stability of the active conformation (Huang et al. 2014), but in fact the change of folding stability ΔΔG cannot be computed in the ENM approximation and the fitness only depends on the structural change, which is the reason why we prefer the name “structure-constrained fitness model.” Of course mutations modify both the stability and the precise structure of the native state, but current models of fitness cannot compute both effects. Both the structure-constrained fitness model and the stability-constrained fitness model qualitatively reproduce the observed correlation between number of contacts and substitution rate (Scherrer et al. 2012; Huang et al. 2014; Echave et al. 2015). Nevertheless, the stability-constrained model that was adopted in Echave et al. (2015) has two important limitations. First of all, it is a neutral model in which the fitness is zero if the stability is below a threshold and one otherwise, which does not allow to simulate the evolutionary effect of small increases or decreases in stability. Second and more important, the model only considers the stability against unfolding (i.e., the difference between the free energy of the native state and the stability of the unfolded state, which is often considered independent of the protein sequence), without considering the effect of the mutation on the free energy of the misfolded ensemble. In contrast, several empirical observations support the view that the stability of the misfolded state is subject to selection through negative design that tends to destabilize misfolded conformations (Berezovsky et al. 2007; Noivirt-Brik et al. 2009; Minning et al. 2013). Here, we revisit the site-dependence of the substitution rates and its relationship with protein stability against both unfolding and misfolding adopting a recently proposed mean-field (MF) model of protein evolution (Arenas et al. 2015). In line with a previous model (Bastolla et al. 2006), the MF model assumes that all sites of the protein evolve independently in a site-specific manner, and it determines their site-specific properties by imposing a global constraint on the thermodynamic stability of the known native state against both unfolding and misfolding. We have shown that the MF model significantly improves the likelihood of inferred evolutionary events with respect to empirical models that do not take into account the structural properties of each site (Arenas et al. 2015). Furthermore, it improves the reconstruction of the stability properties of ancestral sequences (Arenas et al. 2017). Nevertheless, we shall show here that the site-specific entropy and substitution rates predicted through this model are in qualitative disagreement with observed data for sites with an intermediate number of contacts, neither exposed nor buried. In order to investigate how the assumption of site-independence and the misfolding stability influence the substitution rates, we also considered alternative models. In addition to the independent-site MF model in which the effects of a mutation on stability are estimated through the mean-field approximation, we considered a model in which the effect of all point mutations that start from the wild-type sequence is explicitly computed considering all pairwise interactions in the native state and the misfolded ensemble. We call this model the wild-type (WT) model. The WT model is similar to the computation performed by Echave et al. (2015) to model stability effects on fitness, although a different free energy function is used. Second, we considered two different stability-constrained models: 1) The complete stability model that considers the stability of the nonnative state against both unfolding and misfolding, and 2) The native stability model that computes the folding free energy ΔG as the difference between the free energy of the native state and the one of the unfolded state, neglecting the contribution of the misfolded ensemble. In both cases, the stability of the unfolded state is assumed to be independent of the protein sequence, but we do not expect any qualitative change if this assumption is relaxed (see Discussion). Results We measured the site-specific properties predicted by the different versions of the stability-constrained model using a test set of 213 monomeric proteins with structures determined through X-ray crystallography. These proteins have been used in a previous study (Yeh et al. 2014), and for each of them a multiple sequence alignment (MSA) had been obtained, which allowed us to determine the site-specific sequence entropy and the site-specific substitution rate, the latter estimated through the program rate4site (Pupko et al. 2002). Sequence Entropy and Hydrophobicity Predicted from Stability Constraints The site-specific sequence entropy Si=−∑a=120Pi(a) log⁡(Pi(a)) is inversely related to the selective constraints acting on the site. Consistent with a previous model based on similar ideas (Bastolla et al. 2006), the MF model predicts that the site-specific sequence entropy is strongly dependent on the number of native contacts of the site and it has a maximum for amphiphilic sites with intermediate number of contacts and mixed exposed-buried character. Figure 1 shows the sequence entropy versus the number of native contacts for sites in a single protein. Each point represents a site, and a quadratic fit is shown for clarity. Fig. 1. View largeDownload slide Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites (left) and the WT model with pairwise interactions (right). Each point represents one site of the protein with PDB code 2thi. Quadratic fits are shown as a guide to the eye. Fig. 1. View largeDownload slide Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites (left) and the WT model with pairwise interactions (right). Each point represents one site of the protein with PDB code 2thi. Quadratic fits are shown as a guide to the eye. The MF model can easily explain this behavior of the sequence entropy. We can approximate the contact interaction energy adopted in the MF model through the hydrophobic approximation, U(a,b)≈−ϵh(a)h(b), where U(a, b) is the effective interaction energy gained when amino acids a and b form a contact, ϵ>0 is a parameter, and h(a) is the main eigenvector of the matrix U(a, b), which is related with the hydrophobicity of the residue a, and which we shall call hydrophobicity in the following. Thus, we expect that the selection for folding stability mainly affects the average hydrophobicity of each site in such a way that sites with more native contacts tend to have larger hydrophobicity averaged over the mean-field distribution (Porto et al. 2005). In other words, buried sites with many contacts tend to be more hydrophobic, whereas exposed sites with few contacts tend to be more polar, consistent with qualitative ideas on protein folding. Sites whose average hydrophobicity is equal to the average hydrophobicity of an equiprobable amino acid distribution are minimally constrained, and they have maximal entropy. Thus, according to the MF model, selective constraints on folding stability are weakest at amphiphilic sites with intermediate exposure, not at exposed sites. It can be hypothesized that this result stems from the assumption of independent sites used to build the MF model, whereas under a pairwise model the sites that form more interactions are subjected to stronger constraints. To address this important concern, we adopted a simple model in which the fitness effect of all point mutations from the wild-type sequence is computed with the full pairwise stability model. In this model, which we call WT, the frequency of amino acid a at position i is assumed to be proportional to the background distribution multiplied times the exponential of the fitness of the corresponding mutation in which the wild-type amino acid AiWT is substituted by the new amino acid a, PWT,i(a)∝Pmut(a) exp ⁡(Λf(mut(AiWT→a))), where the fitness of a sequence is computed as in equation (7) and the selection parameter Λ is determined in such a way to maximize the likelihood of the wild-type sequence, ∑i log ⁡(PWT,i(AiWT)). We note that this model is only valid one substitution away from the wild-type sequence and its use is questionable for longer substitution processes, but it allows us assessing the effect of pairwise versus mean-field computations of stability effects on fitness. The results obtained with this WT model are shown in figure 2 (left) together with the results of the MF model. As expected, the entropy of sites with many contacts is considerably reduced with respect to the mean-field value, indicating that the pairwise interactions increase the selective constraints for buried sites. Nevertheless, we still observe a maximum in the entropy curve, that is, amphiphilic sites are less constrained than fully exposed sites. Fig. 2. View largeDownload slide Left: Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites and for the WT model that takes into account pairwise interactions between sites, both considering stability against misfolding and nonconsidering it. Each point represents the average over 213 proteins in the data set. Right: Average hydrophobicity as a function of the number of contacts for the MF model with independent sites and the WT model with pairwise interactions, either considering or neglecting stability against misfolding. The thick line represents the average entropy of the residues of the PDB sequence. Fig. 2. View largeDownload slide Left: Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites and for the WT model that takes into account pairwise interactions between sites, both considering stability against misfolding and nonconsidering it. Each point represents the average over 213 proteins in the data set. Right: Average hydrophobicity as a function of the number of contacts for the MF model with independent sites and the WT model with pairwise interactions, either considering or neglecting stability against misfolding. The thick line represents the average entropy of the residues of the PDB sequence. Our WT model is similar to the model adopted by Echave, Jackson, and Wilke (2005, EJW). Nevertheless, the EJW model predicts that the sequence entropy decreases monotonically with the number of contacts. We reason that this discrepancy may originate from the fact that the EJW model does not consider stability against misfolding. In fact, in our models sites with few native contacts are constrained to be more polar than sites drawn from the background distribution. This fact may be attributed to the stability against misfolding. If exposed sites were more hydrophobic than what our model predicts, the stability against unfolding would remain almost the same, but the stability against misfolding would be hindered, and our selection model would limit the hydrophobicity of exposed sites. To test this reasoning, we predict the amino acid distributions evaluating the folding stability without considering the misfolded ensemble (see Materials and Methods). Figure 2 (right) represents the average hydrophobicity, defined as ⟨hi⟩=∑aPi(a)h(a), versus the number of contacts. Both for the MF and for the WT models, the hydrophobicity of exposed sites with few contacts is substantially lower for the model that takes into account misfolding. This is expected according to the understanding that the hydrophobicity of sites with few contacts is constrained by negative design, whereas models that neglect misfolding generate higher hydrophobicity at these sites. The relaxed selective pressure for polarity at exposed sites also induces relaxed selective pressure for hydrophobicity at buried sites, so that in the absence of selection against misfolding the average hydrophobicity is less dependent on the number of contacts and more similar to what it would be under mutation alone. We also show in figure 2 (right) empirical hydrophobicity values obtained from the MSA. At exposed sites, empirical hydrophobicity values are even smaller than those of models that consider misfolding, and they strongly disagree with models that neglect misfolding, supporting the importance of stability against misfolding. From the empirical hydrophobicity curve, we can see that the predictions based on the complete model overestimate the hydrophobicity of sites with 12 or more contacts. This discrepancy affects less than 8% of sites, and it is due to the fact that in real sequences a large number of contacts select large aromatic amino acids whose hydrophobicity is intermediate, whereas in our free energy model the size of the amino acid is not properly considered. A stronger correlation with the empirical hydrophobicity is observed for the principal eigenvector of the contact matrix (Bastolla et al. 2005), which is correlated with the number of contacts but is less dependent on the size of the amino acid. Altogether, empirical data clearly support the importance of selection against misfolding. The behavior of the average hydrophobicity versus the number of contacts directly affects the sequence entropy, as previously shown (Porto et al. 2005). In agreement with the above analysis, we find that the sequence entropy is substantially higher for the models that do not take into account misfolding than for the complete stability models (fig. 2 left), showing that stability against misfolding is an important selective constraint in both the MF and the WT models. The difference between the complete and the native model is most important for sites with minimum and maximum number of contacts. Of particular importance, the native model that does not consider stability against misfolding predicts that the entropy monotonically decreases with the number of contacts, in agreement with the EJW model (see fig. 2 left). Therefore, the existence of a maximum of the entropy in the MF and the WT models is a consequence of considering stability with respect to misfolding. The predictions of the evolutionary model that considers misfolding are robust with respect to introducing the contribution to protein stability of the propensities between amino acids and secondary structure. This contribution is modelled by adding to the free energy function the term ∑iϵSS(si,Ai), where si is the secondary structure, Ai is the amino acid, and ϵSS is a statistical potential for secondary structure whose parameters have been determined in Bastolla et al. (2008). These parameters have the property that they can be combined with contact interactions as they do not count the propensity between secondary structure and amino acids induced by the number of contacts. We found that this addition slightly reduces the sequence entropy, as expected, but it does not modify its overall shape versus the number of contacts (supplementary fig. S1, Supplementary Material online). Site-Specific and Pooled Entropy from Empirical Data We now examine site-specific sequence entropies measured from MSA for proteins with known X-ray structure. For each site, we measure the site-specific amino acid distribution and compute the corresponding entropy. The entropies of sites with the same number of contacts nc are averaged and the average entropy is represented as a function of nc in figure 3 (A). One can see that the MSA entropy is much smaller than the site-specific entropy predicted even by the model displaying the strongest stability constraints, that is, the WT model with selection on misfolding. Importantly, the empirical entropy monotonically decreases with the number of contacts. However, the sequences in the MSA are evolutionarily related and they are not independent of each other. This can lead to underestimate the entropy because there was not sufficient time to explore sequence space. Because of this reason, we do not attach much importance to the absolute value of the site-specific entropy. Nevertheless, the fact that the entropy is a decreasing function of the number of contacts is quite robust. In order to test it, we computed the MSA entropy iteratively removing from each alignment the most divergent sequence whose average identity with the other sequences is lowest. By increasing the number of removed sequences from 0 to 10, the value of the entropy slightly decreases, but the functional dependence on the number of contacts does not change (see supplementary fig. S2, Supplementary Material online). Therefore, the stability-constrained model of protein evolution that takes into account stability against misfolding is not supported by empirical data, neither at the quantitative level (the value of the entropy) nor, more importantly, at the qualitative level (the dependence of the entropy with the number of contacts). This is the main result of this work. Fig. 3. View largeDownload slide A: Site-specific sequence entropy as a function of the number of contacts for the empirical sequence entropy obtained from MSAs. B: Sequence entropy of the pooled amino acid distribution of all PDB residues present at sites with nc contacts as a function of nc. Fig. 3. View largeDownload slide A: Site-specific sequence entropy as a function of the number of contacts for the empirical sequence entropy obtained from MSAs. B: Sequence entropy of the pooled amino acid distribution of all PDB residues present at sites with nc contacts as a function of nc. Next, we pool together all sites with nc contacts, we compute the entropy of the corresponding amino acid distribution, and we plot it as a function of nc. Figure 3 (B) shows that pooled entropies are much higher than site-specific entropies, and they present a maximum versus the number of contacts, as predicted by stability-constrained models. These results confirm those obtained in Porto et al. (2005), who found that the sequence entropy versus the number of contacts has a maximum when sites with the same number of contacts are grouped together, but not when they are examined separately (Ramsey et al. 2011) as in figure 1. As disucssed by Ramsey et al., the fact that the pooled entropy is much larger than the site-specific entropy suggests that the latter responds to stronger structural constraints, which are averaged away when sites are pooled, leaving only weaker and less specific constraints that are captured by the stability-constrained model. Importantly, exposed sites with few contacts have smaller pooled entropy, that is, they are more selective for the amino acid types, than amphiphilic sites with an intermediate number of contacts. This reduced entropy of exposed sites is a further indication that these sites are not likely to contain hydrophobic residues, which supports the model with misfolding. Note in fact that the pooled entropy of sites with zero contacts is 2.5, which agrees better with the entropy predicted by the complete stability models (2.7) than with the entropy predicted by the native stability models (2.9). Consistent with the analysis of the site-specific average hydrophobicity, these data suggest that selection against misfolding is even stronger in real sequences than in sequences generated with stability-constrained models. Substitution Rates The results reported above have direct implications for the substitution rates. In fact, evolutionary arguments suggest that more constrained sites with smaller sequence entropy evolve slower. This behavior is clear from empirical data from MSA, which show a monotonic increase of the substitution rate with sequence entropy, as shown in figure 4 (A). The plot is obtained by computing the sequence entropy and the substitution rate for each individual site and then grouping together sites with similar sequence entropy. As a result of the monotonic relationships between number of contacts and sequence entropy and between sequence entropy and substitution rate, the substitution rate monotonically decreases with the number of contacts as found in several previous studies (Echave et al. 2016) (see fig. 4 B). Fig. 4. View largeDownload slide Substitution rate versus the sequence entropy (A) and versus the number of contacts (B) for empirical data derived from MSAs. Fig. 4. View largeDownload slide Substitution rate versus the sequence entropy (A) and versus the number of contacts (B) for empirical data derived from MSAs. The monotonic increase of the substitution rate with the sequence entropy is consistent with the Halpern–Bruno (HB) model of the substitution process (Halpern and Bruno 1998). This mathematical model is derived from the population genetics formula of the fixation probabilities in the limit of very small mutation rate (Kimura 1962). Briefly, the HB model considers the independent sites approximation, P(A1⋯AL)=∏iPi(Ai) and models the evolutionary process at each site as a Moran’s process. The HB model is based on the observation that the stationary distribution of the Moran’s process at site i is an exponential distribution of the fitness of amino acid a at site i, Pi(a)∝Pmut(a) exp ⁡(Λfi(a)), where Pmut(a) is a background distribution that represents the mutation process and Λ is a selection parameter related with the effective population size. The same relationship between stationary distribution and fitness was later recognized by Sella and Hirsch (2005) and by Mustonen and Lässig (2005), without invoking the independent sites approximation. These authors deeply discussed the analogy between population genetics and statistical physics, where the inverse of population size plays the role of the evolutionary temperature and the fitness plays the role of minus the energy. For a didactic discussion of this subject, see also Nido et al. (2016). The MF and WT models are also based on the same principle, but they explicitly compute the fitness and the site-specific amino acid frequencies from the folding stability model, whereas in the HB approach the site-specific amino acid frequencies are obtained from empirical data. The HB approach allows computing the site-specific substitution processes with rate matrices Qabi that fulfill the following properties: 1) They admit as stationary distributions the Pi(a) and 2) they can be expressed as Qabi=QabmutPfix(fai,fbi), where Qmut is a global mutation process and Pfix is the fixation probability derived from the Moran’s model whose stationary distribution is Pai. Without loss of generality, we can parametrize the rate matrix of the mutation process as Qabmut=EabmutPbmut, where Pamut is the stationary matrix of the mutation process and Eabmut is its exchangeability matrix. To simplify formulas, here we assume detailed balance, that is, we assume that Eabmut is a symmetric matrix. Halpern and Bruno showed that the conditions (1) and (2) above imply that the site-specific substitution processes are given by   Qabi=EabiPbi, (1)  Eabi=Eabmut(ln⁡(Fbsel,i)−ln⁡(Fasel,i)Fbsel,i−Fasel,i) (2)  with  Fasel,i=PaiPamut. (3) As the exchangeability matrix Eabi is symmetric, detailed balance holds and Pai is the stationary distribution. The selective factors Fasel,i quantify the deviation of the site-specific distribution Pai from the background distribution Pamut induced by mutation alone. For substitutions with Fasel,i=Fbsel,i, and in particular for synonymous substitutions a = b, applying l’Hopital’s rule we find Eabi=Eabmut/Fbsel,i and Qabi=Qabmut, that is, the rate of synonymous substitutions, which here are considered selectively neutral, equals the mutation rate, in agreement with Kimura’s theory. If the amino acid b is favored by selection with respect to amino acid a, Fbsel,i>Fasel,i, then the substitution rate is enhanced with respect to the neutral rate, and it is decreased in the opposite case. Because of detailed balance, these effects are exactly compensated by the ratio of the initial frequencies of a and b, and the flux in one direction and the other are equal, Rabi=PaiPbiEabi=Rbai. In the HB model, the flux between two amino acids a and b is given by   Rabi=PaiPbiEabi=(PamutPbmutEabmut)Fasel,iFbsel,iln⁡(Fbsel,i)−ln⁡(Fasel,i)Fbsel,i−Fasel,i. (4) In the above equation, the flux is partitioned into a global component that derives from the mutation process (superscript mut) and a site-specific component that derives from selection (superscript sel) and is function of Fasel,i=Pai/Pamut. The flux is maximal for substitutions ab that have large mutational flux PamutPbmutEabmut and large and almost equal selective factors Fasel,i≈Fbsel,i. The flux decreases when the mutational flux decreases, when the smaller selection factor decreases and when the fitness difference between a and b increases. Based on the latter property, Halpern and Bruno (1998) argued that the substitution rate Ri=∑abPaiEabiPbi is higher at position with higher sequence entropy. This expectation is fulfilled on the average but it is not a rigorous inequality, and in fact we observe that the substitution rate is not a strictly increasing function of sequence entropy (see fig. 5 A). Fig. 5. View largeDownload slide Substitution rate predicted by the four versions of the stability-constrained model versus the sequence entropy (A) and versus the number of contacts (B). Fig. 5. View largeDownload slide Substitution rate predicted by the four versions of the stability-constrained model versus the sequence entropy (A) and versus the number of contacts (B). We determine the global exchangeability matrix Eabmut through the condition that the average flux between any pair of amino acids equals the flux measured in an empirical model of the substitution process, ∑iRabi/L=Rabemp. In this way, the effect of the selection process represented in the MF model disappears from the matrix Eabmut, and this matrix does not contain any free parameter. We note here that in our original mean-field paper (Arenas et al. 2015) we assumed that the exchangeability matrix is the same at all sites, Eabi=Eabmut. This assumption implies the condition (1) above that the distribution Pai is the stationary distribution of the substitution process Qabi=EabmutPbi, but it is inconsistent with the Moran’s fixation probability. Moreover, this assumption would imply that the substitution rate Ri=∑abEabmutPaiPbi is high at position with low sequence entropy, as the highest possible value is attained for a distribution in which the two amino acids a¯ and b¯ with largest exchangeability have half of the weight, Pa¯i=Pb¯i=0.5. We verified numerically that our previous model with global exchangeability matrix predicts an inverse relationship between sequence entropy and substitution rate that is contradicted by empirical data and is inconsistent with the Moran’s model. Therefore, to amend this point, here and in our next works we shall adopt for the MF model the exchangeability matrix Eabi of the HB model (eq. 2). Figure 5 (A) shows the substitution rate as a function of the sequence entropy obtained with the four combinations of models that we considered. The substitution rate shows an almost monotonic increase with the sequence entropy, consistent with empirical data and with the HB arguments. We then show in figure 5 (B) the substitution rate as a function of the number of contacts. In this case, as expected, the results differ qualitatively for two classes of models. Models that neglect stability against misfolding and predict a monotonic decrease of sequence entropy with the number of contacts also predict that the substitution rate decreases with the number of contacts. In contrast, models that consider stability against misfolding and present a maximum of sequence entropy versus the number of contacts also feature the same maximum in the curve of the substitution rate versus the number of contacts. Discussion The main conclusion of this study is that the stability-constrained fitness models that take into account stability against misfolding disagree qualitatively with empirical data obtained from MSA, because they predict that both site-specific sequence entropy and site-specific substitution rates present a maximum as a function of the number of contacts whereas in empirical data both quantities decrease monotonically with the number of contacts (Echave et al. 2016). In contrast, stability-constrained fitness models that do not consider stability against misfolding predict that both the sequence entropy and the substitution rate monotonically decrease with the number of contacts, as observed in empirical data (Echave et al. 2015). Nevertheless, several lines of evidence indicate that models that ignore stability against misfolding are less supported by empirical data. We briefly review this evidence below. First of all, from the physical point of view, contact interactions, which only consider interresidues interactions, do not consider the free energy of solvation, whose contribution would penalize the exposure of hydrophobic residues. Solvation of exposed residues does not contribute to stability against unfolding, as residues that are exposed in the native state are also exposed in the unfolded state, but it contributes to stability against misfolding, as these residues are buried in some misfolded conformations. In contact energy functions, solvation is implicitly considered when alternative conformations contribute to the free energy, and in this case hydrophobic residues are most stabilized when they are buried in the interior of the protein. Thus, contact-based models of stability cannot consider only the native structure. From an evolutionary point of view, the analysis of protein sequences show that the features that reduce the stability of the misfolded ensemble, the so-called negative design (Berezovsky et al. 2007), are significantly stronger in evolved sequences than in their reshuffled randomizations. These features include the destabilization of contacts between residues close in the protein sequence that are frequently formed in misfolded conformations (Noivirt-Brik et al. 2009; Bastolla et al. 2012) and the destabilization of pairs of correlated contacts that are frequently formed together in the misfolded ensemble (Minning et al. 2013). As the stability of the misfolded ensemble is predicted to increase with protein length and hydrophobicity, long and hydrophobic proteins are expected to face more misfolding problems and their sequences are expected to be more strongly selected for negative design. Consistently with this expectation, the more hydrophobic and the longer is the protein sequence, the more its negative design scores diverge from those of reshuffled sequences, indicating that the selective force favoring negative design increases with protein hydrophobicity and protein length (Minning et al. 2013) and supporting the idea that this is a real selective force and not an artefact. Furthermore, the amino acid compositions of evolved protein sequences are not random. If we consider three classes of residues, hydrophobic, polar, and neutral, we observe that sequences with a large fraction of hydrophobic residues have a significantly larger than expected fraction of polar residues (Minning et al. 2013). This property is expected under negative design, as repulsive interactions between polar residues compensate the attraction between hydrophobic residues, and it is stronger for longer proteins. Another evidence that supports negative design is the recent observation that point mutations that increase the surface hydrophobicity of symmetric complexes of Escherichia coli trigger the formation of large supramolecular assemblies, and that in wild-type sequences hot-spots of supramolecular assembly are buffered by hydrophilic residues, suggesting that negative design is preventing the misassembly of these regions (Garcia-Seisdedos et al. 2017). A second line of evidence comes from the study of the MF model of protein evolution with selection on the stability of the native state (Arenas et al. 2015). The variant of the model that neglects stability against misfolding has several unsatisfactory properties: 1) The average hydrophobicity of the sequences that it generates is much larger than the average hydrophobicity of natural sequences, 2) the likelihood of wild-type sequence in the PDB with respect to the model is low, and 3) the sequences generated by the model are on the average unstable against misfolding. All these aspects are significantly improved when stability against misfolding is considered in the model. In particular, the average hydrophobicity of the model becomes equal to that of the wild-type sequence, the likelihood of wild-type sequences becomes much larger, and the average stability against misfolding becomes positive. Globally, these observations strongly support the need to incorporate stability against misfolding in models of protein evolution. In this article, we have shown that the model that considers misfolding and the model that neglects it predict rather different hydrophobicity profiles. If misfolding is not considered, sites with few contacts are predicted to be more hydrophobic than they actually are and at the same time sites with many contacts are predicted to be less hydrophobic (see fig. 2 right). Thus, releasing the selective constraints imposed by negative design results in a site-specific hydrophobicity that is less dependent on the protein structure and is closer to the one induced by the mutation process. Consequently, the predicted sequence entropy is much larger if misfolding is ignored (fig. 2 left), except for the sites with an intermediate number of contacts that are minimally constrained by natural selection in both models. These results hold both for the MF model and for an independent sites model that considers pairwise interactions explicitly (WT model). Recently, Goldstein and Pollock (2017) computed site-specific substitution rates in a model with epistatic interactions (sites are not considered independent). Consistent with our results, they found that sites with many contacts tend to evolve more slowly, but unfortunately they did not present the comparison between exposed and amphiphilic sites. The comparison with evolved MSA (fig. 3 A) shows that stability-constrained models overestimate the site-specific sequence entropy, but the overestimation is much larger for the model that neglects misfolding. However, when MSA sites with the same number of contacts are pooled together to get contact-specific distributions of amino acids that respond to similar stability constraints and average out structural constraints, the resulting sequence entropy is very similar to what is predicted by the model with selection on misfolding, both qualitatively (there is a maximum between six and seven contacts, in agreement with the misfolding model, which predicts a maximum between five and six contacts, whereas the model that neglects misfolding predicts a monotonic trend with maximum at 0 contacts) and quantitatively (the entropy of the model overestimates by approximately 0.2 units the pooled entropy of both exposed sites with nc = 0 and buried sites with nc = 11, above which the statistical error in the MSA becomes very large, whereas the discrepancy is 0.9 and 1.4 with respect to site-specific entropies). Thus, we conclude that our model with selection on misfolding stability captures reasonably well the site-specific selective constraints on amino acid hydrophobicities although other aspects such as amino acid size are less well represented, whereas the model that neglects misfolding presents large qualitative and quantitative discrepancies with observed data. The main result of this work is that stability-constrained models of protein evolution that consider misfolding do not agree with the empirical observation that the sequence entropy and the substitution rate decrease monotonically with the number of native contacts. The previous finding that stability-constrained models predict a monotonic decrease of the sequence entropy and the substitution rate (Echave et al. 2015) was likely a consequence of neglecting misfolding stability, which has the consequence of increasing the hydrophobicity of sites with few contacts with respect to evolved sequences, releasing the selective constraint that these sites cannot be too hydrophobic, and overestimating their sequence entropy. Although the empirical site-specific entropies are likely underestimated, as the sampled sequences are evolutionarily related and it is possible that they did not explore sequence space sufficiently, the fact that the entropy decreases with the number of contacts is quite robust, as we demonstrated eliminating the most distance sequences from each alignment. The qualitative disagreement that we observed between the stability-constrained model and the empirical observations may be attributed to the limitations of the stability prediction. We note that contact energy functions are only appropriate when proteins are in a native-like geometry, as they only consider soft interresidue interactions and neglect all the hard interactions that define the protein geometry, such as steric repulsion, hydrogen bonds, and atomic packing. These interactions are properly satisfied in the native state of wild-type proteins but they cannot be taken for granted in mutants, which can present atomic clashes or cavities that would be not adequately penalized by the contact free energy function. Unfortunately, the strategy to include these hard terms in the free energy function and run explicit simulations is computationally not affordable, which explains why most stability-constrained models of protein evolution adopt contact energy functions. In any case, the energy function that we adopted was tested against 195 experimental measures of ΔΔG of proteins that fold with two-state thermodynamics, mostly involving hydrophobic residues, yielding a squared correlation coefficient between measured and predicted values r2=0.52 (fig. 1 in Bastolla 2014), that is, contact interactions account for half of the experimental effect, which is comparable to other more complex free energy functions. Finally, provided that stability against misfolding is considered, the qualitative predictions of the stability-constrained model are robust if we consider the contribution to protein stability due to the propensity of amino acids for secondary structures. Modelling the influence of the protein sequence on the stability of the unfolded state would result in a mathematical model similar to the model with secondary structure propensity, as the configurational entropy of the side chains can be modelled as ∑iϵentr(Ai), with parameters ϵentr(a) correlated with the propensity of the amino acid a in the loop secondary structure. Thus, the robustness of the stability-constrained model under addition of secondary structure propensities suggests that the qualitative shape of the sequence entropy curve would not change if we consider the influence of the protein sequence on the stability of the unfolded state. These arguments support the view that the limitations of the stability model, that we openly recognize, are not responsible of the qualitative disagreement between the predicted and observed site-specific variability. In our opinion, the responsible of the disagreement is another characteristic of the model that is clearly unrealistic: The assumption that mutations do not modify the protein structure. Indeed, mutations can create native structures that are thermodynamically stable but are different from the wild type. These mutations would be accepted by stability-constrained protein evolution models, but probably they would be discarded by purifying selection. It is common evolutionary wisdom that protein structures are more conserved than protein sequences (Illegard et al. 2009), which means that, for small evolutionary divergence, the fractional change of protein structures is smaller than the corresponding fractional change of protein sequences (Pascual-Garcia et al. 2010). This structural conservation is likely attributable in great part to function conservation, as the pairs of homologous proteins that maintain the same function, as indicated by Gene Ontology terms, evolve up to a limit value of structural divergence, whereas the pairs that change function can diverge much more in structure (Pascual-Garcia et al. 2010). These results, obtained by measuring the structural divergence as the divergence of contact matrices, were recently generalized using the divergence of the TM score (Zhang and Skolnick, 2004). In this recent work (Pascual-Garcia A, Arenas M, and Bastolla M, submitted), we found that the violations of the molecular clock are stronger and more systematic in structure evolution than sequence evolution, in particular when protein function changes. We argued that these violations of the molecular clock are indicative of strong positive selection acting on protein structures, whereas strong purifying selection acts when the protein function is conserved. The idea that protein structures are under strong purifying selection is consistent with the observation that the dynamics of the protein in its native state can be predicted solely from structure through structure-based models such as the ENM (Tirion 1996), and that the predicted collective dynamics reproduces functional conformation changes and allosteric communication. This view of the functional importance of the native structure is also supported by a recent work that found that functional sites induce long-range evolutionary constraints in enzymes (Jack et al. 2016), underlying the evolutionary importance not only of the active site but also of sites that are close to it in the native state. Sites with an intermediate number of contacts are minimally constrained by stability-constrained models. Mutations there have hardly a significant effect on stability, and an even smaller one on fitness. In fact, the fitness function f=1/(1+exp ⁡(ΔG/T)) is a sigmoidal function of stability −ΔG, thus when the wild-type sequence is very stable ( ΔG≪−1) changes of stability ΔΔG have a reduced effect on the fitness and they are effectively neutral (Serohijos and Shakhnovich 2014). According to the stability-constrained models, these amphiphilic sites hardly show any preference for particular amino acids, and their sequence entropy is almost the same as for the background distribution. However, empirical data show that these sites are more conserved than completely exposed sites. A possible explanation is that the mutations at these sites do not modify the stability but they modify the native structure and they may have a deep influence on the protein function. This explanation agrees with the structure-constrained models of protein evolution (Echave 2008), which predict that both the entropy and the substitution rate decrease with the number of contacts (Huang et al. 2014), as mutations at amphiphilic sites perturb the protein structure more than mutations at exposed sites, and consequently they are more often rejected. Thus, empirical data support the structure-constrained models of protein evolution more than the stability-constrained models, at least as far as amphiphilic sites are concerned. Of course, mutations modify both the structure of the native state and its stability ΔG. However, current models of protein evolution cannot predict both changes and they focus on only one aspect, considering one of two extreme types of mutations: Those that change the stability of the native state but have little effect on its structure and those that change the structure of the native state but have a weak effect on its stability. It would be rather useful to integrate both types of models into one that predicts the effect of mutations both on the structure and on the stability of the native state. Despite such a model does not exist yet, in particular due to the fact that the ENM used to predict the structural effect of a mutation does not differentiate between different sequences, this will be a very interesting direction of the future research. Materials and Methods Modelling Stability against Unfolding and Misfolding We estimate the stability of the experimentally known native state of a protein adopting the contact matrix representation of its structure. For each pair of residues at positions i and j along the polypeptidic chain, Cij equals one if the residues are in contact and zero otherwise. We define two residues to be in contact if any pair of their heavy atoms, either in the side chain or in the main chain, is closer than 4.5 Å. As contacts with |i−j|≤2 are formed in almost all structures, they do not contribute to the free energy difference between the native and the misfolded ensemble, and we set Cij = 0 if |i−j|≤2. The free energy of a protein in the mesoscopic structure described by Cij is modelled as a sum of contact interactions, E(C,A)=∑i<jCijU(Ai,Aj), which depends on the type of amino acids in contact Ai and Aj and on 210 contact interaction parameters U(a, b), for which we adopt the parameters determined in Bastolla et al. (2000). For simplicity, we neglect the conformational entropy of the folded native state and estimate its free energy as Gnat(Cnat,A)≈∑i<jCijnatU(Ai,Aj). Regarding the unfolded state, we neglect their contact interactions and estimate its free energy as GU≈−TLSU, where T is the temperature in units in which kB = 1, L is chain length and SU is the conformational entropy per residue of an unfolded chain. We compute the free energy of the misfolded state from the partition function of the contact energy E(C, A) over a set of compact contact matrices C of L residues that are obtained from the PDB. In agreement with previous studies (Garel and Orland 1988; Shakhnovich and Gutin 1989), the resulting free energy is approximately described by the random energy model (REM) (Derrida 1981), but with the addition of the third moment of the contact energy (Minning et al. 2013):   Gmisf≡−T log ⁡(∑Ce−∑i<jCijU(Ai,Aj)/T+S(C))≈⟨E⟩−⟨(E−⟨E⟩)2⟩2T+⟨(E−⟨E⟩)3⟩6T2−LSCT , (5) where LSC is the logarithm of the number of compact contact matrices, ⟨.⟩ represents the average over the set of alternative compact contact matrices of L residues. We assume for simplicity that the conformational entropy, S(Cij), is approximately the same for all compact structures including the native one, and it can be neglected for computing free energy differences. The above estimate only holds above the freezing temperature of the REM (Derrida 1981), whereas the free energy is kept constant below the freezing temperature. Note that, although the free energy is defined in terms of Boltzmann averages, the result of the calculation depends only on the first three moments of the contact energies. These moments are computed from the corresponding moments of the contacts. For instance, the first moment is ⟨E⟩=∑i<j⟨Cij⟩U(Ai,Aj). The moments of the contacts such as ⟨Cij⟩ are computed preliminarly in order to accelerate the computation. This is crucial, as the program must iteratively compute averages over the site-specific distributions and it is very important that the coefficients are precalculated. For this reason, we do not use specific misfolded states. The moments of the contacts are computed over the set of compact conformations of a chain of L residues that are obtained by threading the chain over a nonredundant representation of the PDB. We consider only compact submatrices, discarding those whose number of contacts is smaller than a threshold that depends on L. To reduce the number of parameters that have to be estimated, we perform some approximations (Minning et al. 2013). We assume that ⟨Cij⟩ depends only on the contact range |i−j| and on chain length L, ⟨Cij⟩=a(L)f(|i−j|) and determine the factor a(L) so that the average number of contacts for structures of length L is the same as for native structures. For the contact correlation term, (⟨CijCkl⟩−⟨Cij⟩⟨Ckl⟩), we consider three situations: 1) ij = kl, in which case we adopt a model similar to the above that depends only on L and |i−j|; 2) a pair of residue is equal, say i = l, and we average over j and k both the contacts and the energies in order to obtain a quantity that only depends on i; and 3) all four residues are different, in which case we approximate the contact correlation with the average contact correlation obtained summing over ij and kl, which sums to ⟨Nc2⟩−⟨Nc⟩2, and removing the contribution of groups of three residues. Details on the computations can be found in Arenas et al. (2015). Putting together these free energy estimates, we obtain the free energy difference between the native and the nonnative states as   ΔG(Cnat,A)=Gnat−kT(e−Gmisf/kT+e−GU/kT) , (6) where the free energy of the nonnative state is computed as a Boltzmann average, which is essentially equal to Gmisf when the sequence is hydrophobic ( Gmisf−GU/kT≪−kT) and is essentially equal to GU when the sequence is hydrophilic ( Gmisf−GU/kT≫kT). When we neglect stability against misfolding, we estimate ΔG=Gnat(Cnat,A)+LSU as if all sequences were hydrophilic. Stability-Constrained Fitness Model Stability-constrained models of protein evolution assume that the fitness is proportional to the fraction of protein that is in the native state, which can be computed from the folding free energy as (Goldstein 2011; Serohijos and Shakhnovich 2014)   f=e−ΔG/kT/(1+e−ΔG/kT) . (7) The computation is performed assuming that the native contact matrix Cnat does not change in evolution, whereas the sequence changes one amino acid at a time, producing the free energy ΔGmut=ΔGwt+ΔΔG. In contrast, structure-constrained models of protein evolution adopt the ENM representation of the native state. The fraction of folded protein is assumed to be one by convention irrespective of the protein sequence, but the native structure may change upon mutation (Echave 2008). MF Model of Protein Evolution The MF model assumes that the amino acid distribution is the product of independent distributions at each protein site,   P(A1,⋯AL)=∏i=1LPi(Ai) . (8) The mean-field distribution is determined by minimizing its Kullback–Leibler divergence (distance between distributions) with respect to a global mutational distribution Pmut(a) (where a denotes any of the 20 amino acids), ∑iaPi(a) log ⁡(Pi(a)/Pmut(a)) with the normalization constraints ∑aPi(a)=1 and the constraint on the average fitness, which is transformed into a constraint on the folding free energy ΔG that is imposed through the Lagrange multiplier Λ that represents the strength of selection. As the contact energy parameters U(a,b),T,SU,SC are fixed, the only free parameters of the model are Λ and Pmut(a). These parameters are determined maximizing the log-likelihood of the PDB sequence, ∑i log ⁡(Pi(AiPDB)). The precomputation of the moments of the contacts makes the computation very fast, it runs in a few minutes even for proteins of several hundreds of amino acids. For further computational details see Arenas et al. (2015). WT Model of Protein Evolution In the WT model, the frequency of amino acid a at position i is assumed to be proportional to the background distribution Pmut(a) multiplied by the exponential of the fitness of the corresponding mutation in which the wild-type amino acid in the PDB AiWT is substituted by the new amino acid a:   PWT,i(a)∝Pmut(a) exp ⁡(Λf(mut(AiWT→a))) , (9) where the fitness of a sequence is computed as in equation (7) and the parameter Λ is determined in such a way to maximize the likelihood of the wild-type sequence, ∑i log ⁡(PWT,i(AiWT)). Sequence Entropy The sequence entropy at position i measures the variability of this position as   Si=−∑a=120Pi(a) log ⁡(Pi(a)) , (10) where Pi(a) is obtained either from the evolutionary model (MF or WT) or from an MSA or from pooled amino acids at equivalent structural positions with the same number of contacts. Evolutionary Rates The site-specific substitution rates of the evolutionary models are computed as the weighted average of the substitution rate matrix Qab=EabiPi(b),   Ri=∑a≠bPi(a)Pi(b)Eabi. (11) Eabi is the exchangeability matrix proposed by Halpern and Bruno (1998), equation (2), in which Pi(a) are the site-specific amino acid frequencies of the evolutionary model and Pmut(a) is the background distribution. The global exchangeability matrix Eabmut is computed imposing that the average flux between any pair of amino acids a and b is equal to their flux measured in an empirical substitution model, either the one by Whelan and Goldman (2001) or the one by Jones et al. (1992), that is, we compute Eabmut from the following equation (Arenas et al. 2015):   PaempPbempEabemp=(PamutPbmutEabmut)1L∑iFasel,iFbsel,iln⁡(Fbsel,i)−ln⁡(Fasel,i)Fbsel,i−Fasel,i . (12) Substitution Rates from MSAs The empirical substitution rates of 213 proteins were estimated in Yeh et al. (2014) from the MSA of homologous sequences through the program Rate4Site (Pupko et al. 2002), which builds the phylogenetic tree using a neighbor-joining algorithm (Saitou and Nei 1987) and estimates rates with an empirical Bayesian approach adopting the JTT model of sequence evolution (Jones et al. 1992). Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Acknowledgments We acknowledge interesting discussions with Julián Echave, Nick Goldman, Alberto Pascual-García, and Grzegorz Slodkowicz and thank two anonymous reviewers for their useful comments and suggestions. This work was supported by the Spanish Ministery of Economy through grants BIO2016-79043 and BFU2012-40020. M.A. was supported by the Ramón y Cajal Grant RYC-2015-18241 from the Spanish Government. Research at CBMSO is facilitated by Fundación Ramón Areces. References Arenas M, Sanchez-Cobos A, Bastolla U. 2015. Maximum likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol . 32( 8): 2195– 2207. http://dx.doi.org/10.1093/molbev/msv085 Google Scholar CrossRef Search ADS PubMed  Arenas M, Weber CC, Liberles DA, Bastolla U. 2017. ProtASR: an evolutionary framework for ancestral protein reconstruction with selection on folding stability. Syst Biol . 66( 6): 1054– 1064. Google Scholar PubMed  Bastolla U. 2014. Detecting selection on protein stability through statistical mechanical models of folding and evolution. Biomolecules  4( 1): 291–231. http://dx.doi.org/10.3390/biom4010291 Bastolla U, Bruscolini P, Velasco JL. 2012. Sequence determinants of protein folding rates: positive correlation between contact energy and contact range indicates selection for fast folding. Proteins  80( 9): 2287– 2304. http://dx.doi.org/10.1002/prot.24118 Google Scholar CrossRef Search ADS PubMed  Bastolla U, Dehouck Y, Echave J. 2017. What evolution tells us about protein physics, and protein physics tells us about evolution. Curr Opin Struct Biol . 42: 59– 66. http://dx.doi.org/10.1016/j.sbi.2016.10.020 Google Scholar CrossRef Search ADS PubMed  Bastolla U, Porto M, Ortiz AR. 2008. Local interactions in protein folding determined through an inverse folding model. Proteins  71( 1): 278– 299. http://dx.doi.org/10.1002/prot.21730 Google Scholar CrossRef Search ADS PubMed  Bastolla U, Porto M, Roman HE, Vendruscolo M. 2005. Principal eigenvector of contact matrices and hydrophobicity profiles in proteins. Proteins  58( 1): 22– 30. Google Scholar CrossRef Search ADS PubMed  Bastolla U, Porto M, Roman HE, Vendruscolo M. 2006. A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank. BMC Evol Biol . 6: 43. Google Scholar CrossRef Search ADS PubMed  Bastolla U, Vendruscolo M, Knapp EW. 2000. A statistical mechanical method to optimize energy functions for protein folding. Proc Natl Acad Sci U S A.  97( 8): 3977– 3981. http://dx.doi.org/10.1073/pnas.97.8.3977 Google Scholar CrossRef Search ADS PubMed  Berezovsky IN, Zeldovich KB, Shakhnovich EI. 2007. Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput Biol.  3( 3): e52. Google Scholar CrossRef Search ADS PubMed  Derrida B. 1981. Random Energy Model: an exactly solvable model of disordered systems. Phys Rev B.  24( 5): 2613. http://dx.doi.org/10.1103/PhysRevB.24.2613 Google Scholar CrossRef Search ADS   Echave J. 2008. Evolutionary divergence of protein structure: the linearly forced elastic network model. Chem Phys Lett.  457( 4–6): 413– 416. Google Scholar CrossRef Search ADS   Echave J, Jackson EL, Wilke CO. 2015. Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol . 12( 2): 025002. http://dx.doi.org/10.1088/1478-3975/12/2/025002 Google Scholar CrossRef Search ADS PubMed  Echave J, Spielman SJ, Wilke CO. 2016. Causes of evolutionary rate variation among protein sites. Nat Rev Genet . 17( 2): 109– 121. http://dx.doi.org/10.1038/nrg.2015.18 Google Scholar CrossRef Search ADS PubMed  Franzosa EA, Xia Y. 2009. Structural determinants of protein evolution are context-sensitive at the residue level. Mol Biol Evol . 26( 10): 2387– 2395. http://dx.doi.org/10.1093/molbev/msp146 Google Scholar CrossRef Search ADS PubMed  Garcia-Seisdedos H, Empereur-Mot C, Elad N, Levy ED. 2017. Proteins evolve on the edge of supramolecular self-assembly. Nature  548( 7666): 244– 247. Google Scholar PubMed  Garel T, Orland H. 1988. Mean-field model for protein folding. Europhys Lett . 6( 4): 307– 310. http://dx.doi.org/10.1209/0295-5075/6/4/005 Google Scholar CrossRef Search ADS   Goldstein RA. 2011. The evolution and evolutionary consequences of marginal thermostability in proteins. Proteins  79( 5): 1396– 1407. http://dx.doi.org/10.1002/prot.22964 Google Scholar CrossRef Search ADS PubMed  Goldstein RA, Pollock DD. 2017. Sequence entropy of folding and the absolute rate of amino acid substitutions. Nat Ecol Evol.  1( 12): 1923– 1930. http://dx.doi.org/10.1038/s41559-017-0338-9 Google Scholar CrossRef Search ADS PubMed  Halpern A, Bruno WJ. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol . 15( 7): 910– 917. http://dx.doi.org/10.1093/oxfordjournals.molbev.a025995 Google Scholar CrossRef Search ADS PubMed  Huang TT, del Valle Marcos ML, Hwang JK, Echave J. 2014. A mechanistic stress model of protein evolution accounts for site-specific evolutionary rates and their relationship with packing density and flexibility. BMC Evol Biol . 14: 78. Google Scholar CrossRef Search ADS PubMed  Illergard K, Ardell DH, Elofsson A. 2009. Structure is three to ten times more conserved than sequence–a study of structural response in protein cores. Proteins  77( 3): 499– 508. http://dx.doi.org/10.1002/prot.22458 Google Scholar CrossRef Search ADS PubMed  Jack BR, Meyer AG, Echave J, Wilke CO. 2016. Functional sites induce long-range evolutionary constraints in enzymes. PLoS Biol . 14( 5): e1002452. Google Scholar CrossRef Search ADS PubMed  Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci.  8( 3): 275– 282. Google Scholar PubMed  Kimura M. 1962. On the probability of fixation of mutant genes in a population. Genetics  4: 713– 719. Minning J, Porto M, Bastolla U. 2013. Detecting selection for negative design in proteins through an improved model of the misfolded state. Proteins  81( 7): 1102– 1112. http://dx.doi.org/10.1002/prot.24244 Google Scholar CrossRef Search ADS PubMed  Mustonen V, Lässig M. 2005. Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci U S A.  102( 44): 15936– 15941. Google Scholar CrossRef Search ADS PubMed  Nido GS, Bachschmid-Romano L, Bastolla U, Pascual-García A. 2016. Learning structural bioinformatics and evolution with a snake puzzle. PeerJ Comput Sci.  2: e100. Google Scholar CrossRef Search ADS   Noivirt-Brik O, Horovitz A, Unger R. 2009. Trade-off between positive and negative design of protein stability: from lattice models to real proteins. PLoS Comput Biol.  5( 12): e1000592. Google Scholar CrossRef Search ADS PubMed  Pascual-Garcia A, Abia D, Mendez R, Nido GS, Bastolla U. 2010. Quantifying the evolutionary divergence of protein structures: the role of function change and function conservation. Proteins  78( 1): 181– 196. Google Scholar CrossRef Search ADS PubMed  Porto M, Roman HE, Vendruscolo M, Bastolla U. 2005. Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Mol Biol Evol.  22( 3): 630– 638. http://dx.doi.org/10.1093/molbev/msi048 Google Scholar CrossRef Search ADS PubMed  Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. 2002. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics  18( Suppl 1): S71– S77. Google Scholar CrossRef Search ADS PubMed  Ramsey DC, Scherrer MP, Zhou T, Wilke CO. 2011. The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics  188( 2): 479– 488. http://dx.doi.org/10.1534/genetics.111.128025 Google Scholar CrossRef Search ADS PubMed  Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol.  4( 4): 406– 425. Google Scholar PubMed  Scherrer MP, Meyer AG, Wilke CO. 2012. Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evol Biol . 12: 179. Google Scholar CrossRef Search ADS PubMed  Sella G, Hirsh AE. 2005. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci U S A.  102( 27): 9541– 9546. Google Scholar CrossRef Search ADS PubMed  Serohijos AW, Shakhnovich EI. 2014. Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. Curr Opin Struct Biol.  26: 84– 91. http://dx.doi.org/10.1016/j.sbi.2014.05.005 Google Scholar CrossRef Search ADS PubMed  Shakhnovich EI, Gutin AM. 1989. Formation of unique structure in polypeptide chains. Biophys Chem . 34( 3): 187. http://dx.doi.org/10.1016/0301-4622(89)80058-4 Google Scholar CrossRef Search ADS PubMed  Tirion MM. 1996. Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys Rev Lett . 77( 9): 1905– 1908. http://dx.doi.org/10.1103/PhysRevLett.77.1905 Google Scholar CrossRef Search ADS PubMed  Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol.  18( 5): 691– 699. http://dx.doi.org/10.1093/oxfordjournals.molbev.a003851 Google Scholar CrossRef Search ADS PubMed  Yeh SW, Liu JW, Yu SH, Shih CH, Hwang JK, Echave J. 2014. Site-specific structural constraints on protein sequence evolutionary divergence: local packing density versus solvent exposure. Mol Biol Evol . 31( 1): 135– 139. http://dx.doi.org/10.1093/molbev/mst178 Google Scholar CrossRef Search ADS PubMed  Zhang Y, Skolnick J. 2004. Scoring function for automated assessment of protein structure template quality. Proteins  57( 4): 702– 710. http://dx.doi.org/10.1002/prot.20264 Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Molecular Biology and Evolution Oxford University Press

Substitution Rates Predicted by Stability-Constrained Models of Protein Evolution Are Not Consistent with Empirical Data

Loading next page...
 
/lp/ou_press/substitution-rates-predicted-by-stability-constrained-models-of-TcKvzEbIZk
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
0737-4038
eISSN
1537-1719
D.O.I.
10.1093/molbev/msx327
Publisher site
See Article on Publisher Site

Abstract

Abstract Protein structures strongly influence molecular evolution. In particular, the evolutionary rate of a protein site depends on the number of its native contacts. Stability-constrained models of protein evolution consider this influence of protein structure on evolution by predicting the effect of mutations on the stability of the native state, but they currently neglect how mutations affect the protein structure. These models predict that buried protein sites with more native contacts are more constrained by natural selection and less variable, as observed. Nevertheless, previous work did not consider the stability against compact misfolded conformations, although it is known that the negative design that destabilizes these misfolded conformations influences protein evolution significantly. Here, we show that stability-constrained models that consider misfolding predict that site-specific sequence entropy and substitution rate peak at amphiphilic sites with an intermediate number of contacts, as these sites are less constrained than exposed sites with few contacts whose hydrophobicity must be limited. This result holds both for a mean-field model with independent sites and for a pairwise model that takes as a reference the wild-type sequence, but it contrasts with the observations that indicate that the entropy and the substitution rate decrease monotonically with the number of contacts. Our work suggests that stability-constrained models overestimate the tolerance of amphiphilic sites against mutations, either because of the limits of the free energy function or, more importantly in our opinion, because they do not consider how mutations perturb the native protein structure. stability-constrained protein evolution, structure-constrained protein evolution, substitution rates, site-specific sequence entropy Introduction One of the main determinants of the variation of the substitution rates across the sites of a protein is the number of native contacts. Buried sites tend to evolve more slowly than exposed sites (Franzosa and Xia 2009). Later studies found that the number of native contacts is a better predictor of the evolutionary rate than the solvent accessibility (Yeh et al. 2014). This tendency was interpreted as an indication that natural selection imposes stronger constraints on sites with many native contacts (Echave et al. 2016). To explain the observed negative correlation between substitution rates and number of contacts, two kinds of models of the fitness effect of a mutation have been adopted (Echave et al. 2015). The first kind of model, which we call here stability-constrained fitness model, computes the fitness from the predicted effect of the mutation on the folding free energy ΔG that determines the stability of the folded native state (see Goldstein 2011; Serohijos and Shakhnovich 2014; Bastolla et al. 2017). Typically, the model adopts a mesoscopic representation of the protein structure such as the contact matrix and it estimates the change of free energy of the native contact matrix due to the simulated mutation, assuming that the native structure does not change. This type of model is sometimes referred in the literature as structurally constrained protein evolution model, but we prefer to call it stability-constrained model because it does not consider that mutations can modify the protein structure. We reserve the name “structurally constrained” to the second type of model of protein evolution, which estimates how mutations affect the structure of the native state and computes the fitness from this predicted structural change. In the original version proposed by Echave (2008), the mutation was modelled as a perturbation applied to the wild-type structure and the mutated structure was estimated through linear response theory adopting the elastic network model (ENM; Tirion 1996). As the ENM does not represent the protein sequence, in this framework it is not possible to compute the change of folding stability, thus the fitness is assumed to depend only on the predicted structural change, not on the stability change. A modified version of this model was described in terms of the stability of the active conformation (Huang et al. 2014), but in fact the change of folding stability ΔΔG cannot be computed in the ENM approximation and the fitness only depends on the structural change, which is the reason why we prefer the name “structure-constrained fitness model.” Of course mutations modify both the stability and the precise structure of the native state, but current models of fitness cannot compute both effects. Both the structure-constrained fitness model and the stability-constrained fitness model qualitatively reproduce the observed correlation between number of contacts and substitution rate (Scherrer et al. 2012; Huang et al. 2014; Echave et al. 2015). Nevertheless, the stability-constrained model that was adopted in Echave et al. (2015) has two important limitations. First of all, it is a neutral model in which the fitness is zero if the stability is below a threshold and one otherwise, which does not allow to simulate the evolutionary effect of small increases or decreases in stability. Second and more important, the model only considers the stability against unfolding (i.e., the difference between the free energy of the native state and the stability of the unfolded state, which is often considered independent of the protein sequence), without considering the effect of the mutation on the free energy of the misfolded ensemble. In contrast, several empirical observations support the view that the stability of the misfolded state is subject to selection through negative design that tends to destabilize misfolded conformations (Berezovsky et al. 2007; Noivirt-Brik et al. 2009; Minning et al. 2013). Here, we revisit the site-dependence of the substitution rates and its relationship with protein stability against both unfolding and misfolding adopting a recently proposed mean-field (MF) model of protein evolution (Arenas et al. 2015). In line with a previous model (Bastolla et al. 2006), the MF model assumes that all sites of the protein evolve independently in a site-specific manner, and it determines their site-specific properties by imposing a global constraint on the thermodynamic stability of the known native state against both unfolding and misfolding. We have shown that the MF model significantly improves the likelihood of inferred evolutionary events with respect to empirical models that do not take into account the structural properties of each site (Arenas et al. 2015). Furthermore, it improves the reconstruction of the stability properties of ancestral sequences (Arenas et al. 2017). Nevertheless, we shall show here that the site-specific entropy and substitution rates predicted through this model are in qualitative disagreement with observed data for sites with an intermediate number of contacts, neither exposed nor buried. In order to investigate how the assumption of site-independence and the misfolding stability influence the substitution rates, we also considered alternative models. In addition to the independent-site MF model in which the effects of a mutation on stability are estimated through the mean-field approximation, we considered a model in which the effect of all point mutations that start from the wild-type sequence is explicitly computed considering all pairwise interactions in the native state and the misfolded ensemble. We call this model the wild-type (WT) model. The WT model is similar to the computation performed by Echave et al. (2015) to model stability effects on fitness, although a different free energy function is used. Second, we considered two different stability-constrained models: 1) The complete stability model that considers the stability of the nonnative state against both unfolding and misfolding, and 2) The native stability model that computes the folding free energy ΔG as the difference between the free energy of the native state and the one of the unfolded state, neglecting the contribution of the misfolded ensemble. In both cases, the stability of the unfolded state is assumed to be independent of the protein sequence, but we do not expect any qualitative change if this assumption is relaxed (see Discussion). Results We measured the site-specific properties predicted by the different versions of the stability-constrained model using a test set of 213 monomeric proteins with structures determined through X-ray crystallography. These proteins have been used in a previous study (Yeh et al. 2014), and for each of them a multiple sequence alignment (MSA) had been obtained, which allowed us to determine the site-specific sequence entropy and the site-specific substitution rate, the latter estimated through the program rate4site (Pupko et al. 2002). Sequence Entropy and Hydrophobicity Predicted from Stability Constraints The site-specific sequence entropy Si=−∑a=120Pi(a) log⁡(Pi(a)) is inversely related to the selective constraints acting on the site. Consistent with a previous model based on similar ideas (Bastolla et al. 2006), the MF model predicts that the site-specific sequence entropy is strongly dependent on the number of native contacts of the site and it has a maximum for amphiphilic sites with intermediate number of contacts and mixed exposed-buried character. Figure 1 shows the sequence entropy versus the number of native contacts for sites in a single protein. Each point represents a site, and a quadratic fit is shown for clarity. Fig. 1. View largeDownload slide Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites (left) and the WT model with pairwise interactions (right). Each point represents one site of the protein with PDB code 2thi. Quadratic fits are shown as a guide to the eye. Fig. 1. View largeDownload slide Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites (left) and the WT model with pairwise interactions (right). Each point represents one site of the protein with PDB code 2thi. Quadratic fits are shown as a guide to the eye. The MF model can easily explain this behavior of the sequence entropy. We can approximate the contact interaction energy adopted in the MF model through the hydrophobic approximation, U(a,b)≈−ϵh(a)h(b), where U(a, b) is the effective interaction energy gained when amino acids a and b form a contact, ϵ>0 is a parameter, and h(a) is the main eigenvector of the matrix U(a, b), which is related with the hydrophobicity of the residue a, and which we shall call hydrophobicity in the following. Thus, we expect that the selection for folding stability mainly affects the average hydrophobicity of each site in such a way that sites with more native contacts tend to have larger hydrophobicity averaged over the mean-field distribution (Porto et al. 2005). In other words, buried sites with many contacts tend to be more hydrophobic, whereas exposed sites with few contacts tend to be more polar, consistent with qualitative ideas on protein folding. Sites whose average hydrophobicity is equal to the average hydrophobicity of an equiprobable amino acid distribution are minimally constrained, and they have maximal entropy. Thus, according to the MF model, selective constraints on folding stability are weakest at amphiphilic sites with intermediate exposure, not at exposed sites. It can be hypothesized that this result stems from the assumption of independent sites used to build the MF model, whereas under a pairwise model the sites that form more interactions are subjected to stronger constraints. To address this important concern, we adopted a simple model in which the fitness effect of all point mutations from the wild-type sequence is computed with the full pairwise stability model. In this model, which we call WT, the frequency of amino acid a at position i is assumed to be proportional to the background distribution multiplied times the exponential of the fitness of the corresponding mutation in which the wild-type amino acid AiWT is substituted by the new amino acid a, PWT,i(a)∝Pmut(a) exp ⁡(Λf(mut(AiWT→a))), where the fitness of a sequence is computed as in equation (7) and the selection parameter Λ is determined in such a way to maximize the likelihood of the wild-type sequence, ∑i log ⁡(PWT,i(AiWT)). We note that this model is only valid one substitution away from the wild-type sequence and its use is questionable for longer substitution processes, but it allows us assessing the effect of pairwise versus mean-field computations of stability effects on fitness. The results obtained with this WT model are shown in figure 2 (left) together with the results of the MF model. As expected, the entropy of sites with many contacts is considerably reduced with respect to the mean-field value, indicating that the pairwise interactions increase the selective constraints for buried sites. Nevertheless, we still observe a maximum in the entropy curve, that is, amphiphilic sites are less constrained than fully exposed sites. Fig. 2. View largeDownload slide Left: Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites and for the WT model that takes into account pairwise interactions between sites, both considering stability against misfolding and nonconsidering it. Each point represents the average over 213 proteins in the data set. Right: Average hydrophobicity as a function of the number of contacts for the MF model with independent sites and the WT model with pairwise interactions, either considering or neglecting stability against misfolding. The thick line represents the average entropy of the residues of the PDB sequence. Fig. 2. View largeDownload slide Left: Site-specific sequence entropy as a function of the number of contacts for the MF model with independent sites and for the WT model that takes into account pairwise interactions between sites, both considering stability against misfolding and nonconsidering it. Each point represents the average over 213 proteins in the data set. Right: Average hydrophobicity as a function of the number of contacts for the MF model with independent sites and the WT model with pairwise interactions, either considering or neglecting stability against misfolding. The thick line represents the average entropy of the residues of the PDB sequence. Our WT model is similar to the model adopted by Echave, Jackson, and Wilke (2005, EJW). Nevertheless, the EJW model predicts that the sequence entropy decreases monotonically with the number of contacts. We reason that this discrepancy may originate from the fact that the EJW model does not consider stability against misfolding. In fact, in our models sites with few native contacts are constrained to be more polar than sites drawn from the background distribution. This fact may be attributed to the stability against misfolding. If exposed sites were more hydrophobic than what our model predicts, the stability against unfolding would remain almost the same, but the stability against misfolding would be hindered, and our selection model would limit the hydrophobicity of exposed sites. To test this reasoning, we predict the amino acid distributions evaluating the folding stability without considering the misfolded ensemble (see Materials and Methods). Figure 2 (right) represents the average hydrophobicity, defined as ⟨hi⟩=∑aPi(a)h(a), versus the number of contacts. Both for the MF and for the WT models, the hydrophobicity of exposed sites with few contacts is substantially lower for the model that takes into account misfolding. This is expected according to the understanding that the hydrophobicity of sites with few contacts is constrained by negative design, whereas models that neglect misfolding generate higher hydrophobicity at these sites. The relaxed selective pressure for polarity at exposed sites also induces relaxed selective pressure for hydrophobicity at buried sites, so that in the absence of selection against misfolding the average hydrophobicity is less dependent on the number of contacts and more similar to what it would be under mutation alone. We also show in figure 2 (right) empirical hydrophobicity values obtained from the MSA. At exposed sites, empirical hydrophobicity values are even smaller than those of models that consider misfolding, and they strongly disagree with models that neglect misfolding, supporting the importance of stability against misfolding. From the empirical hydrophobicity curve, we can see that the predictions based on the complete model overestimate the hydrophobicity of sites with 12 or more contacts. This discrepancy affects less than 8% of sites, and it is due to the fact that in real sequences a large number of contacts select large aromatic amino acids whose hydrophobicity is intermediate, whereas in our free energy model the size of the amino acid is not properly considered. A stronger correlation with the empirical hydrophobicity is observed for the principal eigenvector of the contact matrix (Bastolla et al. 2005), which is correlated with the number of contacts but is less dependent on the size of the amino acid. Altogether, empirical data clearly support the importance of selection against misfolding. The behavior of the average hydrophobicity versus the number of contacts directly affects the sequence entropy, as previously shown (Porto et al. 2005). In agreement with the above analysis, we find that the sequence entropy is substantially higher for the models that do not take into account misfolding than for the complete stability models (fig. 2 left), showing that stability against misfolding is an important selective constraint in both the MF and the WT models. The difference between the complete and the native model is most important for sites with minimum and maximum number of contacts. Of particular importance, the native model that does not consider stability against misfolding predicts that the entropy monotonically decreases with the number of contacts, in agreement with the EJW model (see fig. 2 left). Therefore, the existence of a maximum of the entropy in the MF and the WT models is a consequence of considering stability with respect to misfolding. The predictions of the evolutionary model that considers misfolding are robust with respect to introducing the contribution to protein stability of the propensities between amino acids and secondary structure. This contribution is modelled by adding to the free energy function the term ∑iϵSS(si,Ai), where si is the secondary structure, Ai is the amino acid, and ϵSS is a statistical potential for secondary structure whose parameters have been determined in Bastolla et al. (2008). These parameters have the property that they can be combined with contact interactions as they do not count the propensity between secondary structure and amino acids induced by the number of contacts. We found that this addition slightly reduces the sequence entropy, as expected, but it does not modify its overall shape versus the number of contacts (supplementary fig. S1, Supplementary Material online). Site-Specific and Pooled Entropy from Empirical Data We now examine site-specific sequence entropies measured from MSA for proteins with known X-ray structure. For each site, we measure the site-specific amino acid distribution and compute the corresponding entropy. The entropies of sites with the same number of contacts nc are averaged and the average entropy is represented as a function of nc in figure 3 (A). One can see that the MSA entropy is much smaller than the site-specific entropy predicted even by the model displaying the strongest stability constraints, that is, the WT model with selection on misfolding. Importantly, the empirical entropy monotonically decreases with the number of contacts. However, the sequences in the MSA are evolutionarily related and they are not independent of each other. This can lead to underestimate the entropy because there was not sufficient time to explore sequence space. Because of this reason, we do not attach much importance to the absolute value of the site-specific entropy. Nevertheless, the fact that the entropy is a decreasing function of the number of contacts is quite robust. In order to test it, we computed the MSA entropy iteratively removing from each alignment the most divergent sequence whose average identity with the other sequences is lowest. By increasing the number of removed sequences from 0 to 10, the value of the entropy slightly decreases, but the functional dependence on the number of contacts does not change (see supplementary fig. S2, Supplementary Material online). Therefore, the stability-constrained model of protein evolution that takes into account stability against misfolding is not supported by empirical data, neither at the quantitative level (the value of the entropy) nor, more importantly, at the qualitative level (the dependence of the entropy with the number of contacts). This is the main result of this work. Fig. 3. View largeDownload slide A: Site-specific sequence entropy as a function of the number of contacts for the empirical sequence entropy obtained from MSAs. B: Sequence entropy of the pooled amino acid distribution of all PDB residues present at sites with nc contacts as a function of nc. Fig. 3. View largeDownload slide A: Site-specific sequence entropy as a function of the number of contacts for the empirical sequence entropy obtained from MSAs. B: Sequence entropy of the pooled amino acid distribution of all PDB residues present at sites with nc contacts as a function of nc. Next, we pool together all sites with nc contacts, we compute the entropy of the corresponding amino acid distribution, and we plot it as a function of nc. Figure 3 (B) shows that pooled entropies are much higher than site-specific entropies, and they present a maximum versus the number of contacts, as predicted by stability-constrained models. These results confirm those obtained in Porto et al. (2005), who found that the sequence entropy versus the number of contacts has a maximum when sites with the same number of contacts are grouped together, but not when they are examined separately (Ramsey et al. 2011) as in figure 1. As disucssed by Ramsey et al., the fact that the pooled entropy is much larger than the site-specific entropy suggests that the latter responds to stronger structural constraints, which are averaged away when sites are pooled, leaving only weaker and less specific constraints that are captured by the stability-constrained model. Importantly, exposed sites with few contacts have smaller pooled entropy, that is, they are more selective for the amino acid types, than amphiphilic sites with an intermediate number of contacts. This reduced entropy of exposed sites is a further indication that these sites are not likely to contain hydrophobic residues, which supports the model with misfolding. Note in fact that the pooled entropy of sites with zero contacts is 2.5, which agrees better with the entropy predicted by the complete stability models (2.7) than with the entropy predicted by the native stability models (2.9). Consistent with the analysis of the site-specific average hydrophobicity, these data suggest that selection against misfolding is even stronger in real sequences than in sequences generated with stability-constrained models. Substitution Rates The results reported above have direct implications for the substitution rates. In fact, evolutionary arguments suggest that more constrained sites with smaller sequence entropy evolve slower. This behavior is clear from empirical data from MSA, which show a monotonic increase of the substitution rate with sequence entropy, as shown in figure 4 (A). The plot is obtained by computing the sequence entropy and the substitution rate for each individual site and then grouping together sites with similar sequence entropy. As a result of the monotonic relationships between number of contacts and sequence entropy and between sequence entropy and substitution rate, the substitution rate monotonically decreases with the number of contacts as found in several previous studies (Echave et al. 2016) (see fig. 4 B). Fig. 4. View largeDownload slide Substitution rate versus the sequence entropy (A) and versus the number of contacts (B) for empirical data derived from MSAs. Fig. 4. View largeDownload slide Substitution rate versus the sequence entropy (A) and versus the number of contacts (B) for empirical data derived from MSAs. The monotonic increase of the substitution rate with the sequence entropy is consistent with the Halpern–Bruno (HB) model of the substitution process (Halpern and Bruno 1998). This mathematical model is derived from the population genetics formula of the fixation probabilities in the limit of very small mutation rate (Kimura 1962). Briefly, the HB model considers the independent sites approximation, P(A1⋯AL)=∏iPi(Ai) and models the evolutionary process at each site as a Moran’s process. The HB model is based on the observation that the stationary distribution of the Moran’s process at site i is an exponential distribution of the fitness of amino acid a at site i, Pi(a)∝Pmut(a) exp ⁡(Λfi(a)), where Pmut(a) is a background distribution that represents the mutation process and Λ is a selection parameter related with the effective population size. The same relationship between stationary distribution and fitness was later recognized by Sella and Hirsch (2005) and by Mustonen and Lässig (2005), without invoking the independent sites approximation. These authors deeply discussed the analogy between population genetics and statistical physics, where the inverse of population size plays the role of the evolutionary temperature and the fitness plays the role of minus the energy. For a didactic discussion of this subject, see also Nido et al. (2016). The MF and WT models are also based on the same principle, but they explicitly compute the fitness and the site-specific amino acid frequencies from the folding stability model, whereas in the HB approach the site-specific amino acid frequencies are obtained from empirical data. The HB approach allows computing the site-specific substitution processes with rate matrices Qabi that fulfill the following properties: 1) They admit as stationary distributions the Pi(a) and 2) they can be expressed as Qabi=QabmutPfix(fai,fbi), where Qmut is a global mutation process and Pfix is the fixation probability derived from the Moran’s model whose stationary distribution is Pai. Without loss of generality, we can parametrize the rate matrix of the mutation process as Qabmut=EabmutPbmut, where Pamut is the stationary matrix of the mutation process and Eabmut is its exchangeability matrix. To simplify formulas, here we assume detailed balance, that is, we assume that Eabmut is a symmetric matrix. Halpern and Bruno showed that the conditions (1) and (2) above imply that the site-specific substitution processes are given by   Qabi=EabiPbi, (1)  Eabi=Eabmut(ln⁡(Fbsel,i)−ln⁡(Fasel,i)Fbsel,i−Fasel,i) (2)  with  Fasel,i=PaiPamut. (3) As the exchangeability matrix Eabi is symmetric, detailed balance holds and Pai is the stationary distribution. The selective factors Fasel,i quantify the deviation of the site-specific distribution Pai from the background distribution Pamut induced by mutation alone. For substitutions with Fasel,i=Fbsel,i, and in particular for synonymous substitutions a = b, applying l’Hopital’s rule we find Eabi=Eabmut/Fbsel,i and Qabi=Qabmut, that is, the rate of synonymous substitutions, which here are considered selectively neutral, equals the mutation rate, in agreement with Kimura’s theory. If the amino acid b is favored by selection with respect to amino acid a, Fbsel,i>Fasel,i, then the substitution rate is enhanced with respect to the neutral rate, and it is decreased in the opposite case. Because of detailed balance, these effects are exactly compensated by the ratio of the initial frequencies of a and b, and the flux in one direction and the other are equal, Rabi=PaiPbiEabi=Rbai. In the HB model, the flux between two amino acids a and b is given by   Rabi=PaiPbiEabi=(PamutPbmutEabmut)Fasel,iFbsel,iln⁡(Fbsel,i)−ln⁡(Fasel,i)Fbsel,i−Fasel,i. (4) In the above equation, the flux is partitioned into a global component that derives from the mutation process (superscript mut) and a site-specific component that derives from selection (superscript sel) and is function of Fasel,i=Pai/Pamut. The flux is maximal for substitutions ab that have large mutational flux PamutPbmutEabmut and large and almost equal selective factors Fasel,i≈Fbsel,i. The flux decreases when the mutational flux decreases, when the smaller selection factor decreases and when the fitness difference between a and b increases. Based on the latter property, Halpern and Bruno (1998) argued that the substitution rate Ri=∑abPaiEabiPbi is higher at position with higher sequence entropy. This expectation is fulfilled on the average but it is not a rigorous inequality, and in fact we observe that the substitution rate is not a strictly increasing function of sequence entropy (see fig. 5 A). Fig. 5. View largeDownload slide Substitution rate predicted by the four versions of the stability-constrained model versus the sequence entropy (A) and versus the number of contacts (B). Fig. 5. View largeDownload slide Substitution rate predicted by the four versions of the stability-constrained model versus the sequence entropy (A) and versus the number of contacts (B). We determine the global exchangeability matrix Eabmut through the condition that the average flux between any pair of amino acids equals the flux measured in an empirical model of the substitution process, ∑iRabi/L=Rabemp. In this way, the effect of the selection process represented in the MF model disappears from the matrix Eabmut, and this matrix does not contain any free parameter. We note here that in our original mean-field paper (Arenas et al. 2015) we assumed that the exchangeability matrix is the same at all sites, Eabi=Eabmut. This assumption implies the condition (1) above that the distribution Pai is the stationary distribution of the substitution process Qabi=EabmutPbi, but it is inconsistent with the Moran’s fixation probability. Moreover, this assumption would imply that the substitution rate Ri=∑abEabmutPaiPbi is high at position with low sequence entropy, as the highest possible value is attained for a distribution in which the two amino acids a¯ and b¯ with largest exchangeability have half of the weight, Pa¯i=Pb¯i=0.5. We verified numerically that our previous model with global exchangeability matrix predicts an inverse relationship between sequence entropy and substitution rate that is contradicted by empirical data and is inconsistent with the Moran’s model. Therefore, to amend this point, here and in our next works we shall adopt for the MF model the exchangeability matrix Eabi of the HB model (eq. 2). Figure 5 (A) shows the substitution rate as a function of the sequence entropy obtained with the four combinations of models that we considered. The substitution rate shows an almost monotonic increase with the sequence entropy, consistent with empirical data and with the HB arguments. We then show in figure 5 (B) the substitution rate as a function of the number of contacts. In this case, as expected, the results differ qualitatively for two classes of models. Models that neglect stability against misfolding and predict a monotonic decrease of sequence entropy with the number of contacts also predict that the substitution rate decreases with the number of contacts. In contrast, models that consider stability against misfolding and present a maximum of sequence entropy versus the number of contacts also feature the same maximum in the curve of the substitution rate versus the number of contacts. Discussion The main conclusion of this study is that the stability-constrained fitness models that take into account stability against misfolding disagree qualitatively with empirical data obtained from MSA, because they predict that both site-specific sequence entropy and site-specific substitution rates present a maximum as a function of the number of contacts whereas in empirical data both quantities decrease monotonically with the number of contacts (Echave et al. 2016). In contrast, stability-constrained fitness models that do not consider stability against misfolding predict that both the sequence entropy and the substitution rate monotonically decrease with the number of contacts, as observed in empirical data (Echave et al. 2015). Nevertheless, several lines of evidence indicate that models that ignore stability against misfolding are less supported by empirical data. We briefly review this evidence below. First of all, from the physical point of view, contact interactions, which only consider interresidues interactions, do not consider the free energy of solvation, whose contribution would penalize the exposure of hydrophobic residues. Solvation of exposed residues does not contribute to stability against unfolding, as residues that are exposed in the native state are also exposed in the unfolded state, but it contributes to stability against misfolding, as these residues are buried in some misfolded conformations. In contact energy functions, solvation is implicitly considered when alternative conformations contribute to the free energy, and in this case hydrophobic residues are most stabilized when they are buried in the interior of the protein. Thus, contact-based models of stability cannot consider only the native structure. From an evolutionary point of view, the analysis of protein sequences show that the features that reduce the stability of the misfolded ensemble, the so-called negative design (Berezovsky et al. 2007), are significantly stronger in evolved sequences than in their reshuffled randomizations. These features include the destabilization of contacts between residues close in the protein sequence that are frequently formed in misfolded conformations (Noivirt-Brik et al. 2009; Bastolla et al. 2012) and the destabilization of pairs of correlated contacts that are frequently formed together in the misfolded ensemble (Minning et al. 2013). As the stability of the misfolded ensemble is predicted to increase with protein length and hydrophobicity, long and hydrophobic proteins are expected to face more misfolding problems and their sequences are expected to be more strongly selected for negative design. Consistently with this expectation, the more hydrophobic and the longer is the protein sequence, the more its negative design scores diverge from those of reshuffled sequences, indicating that the selective force favoring negative design increases with protein hydrophobicity and protein length (Minning et al. 2013) and supporting the idea that this is a real selective force and not an artefact. Furthermore, the amino acid compositions of evolved protein sequences are not random. If we consider three classes of residues, hydrophobic, polar, and neutral, we observe that sequences with a large fraction of hydrophobic residues have a significantly larger than expected fraction of polar residues (Minning et al. 2013). This property is expected under negative design, as repulsive interactions between polar residues compensate the attraction between hydrophobic residues, and it is stronger for longer proteins. Another evidence that supports negative design is the recent observation that point mutations that increase the surface hydrophobicity of symmetric complexes of Escherichia coli trigger the formation of large supramolecular assemblies, and that in wild-type sequences hot-spots of supramolecular assembly are buffered by hydrophilic residues, suggesting that negative design is preventing the misassembly of these regions (Garcia-Seisdedos et al. 2017). A second line of evidence comes from the study of the MF model of protein evolution with selection on the stability of the native state (Arenas et al. 2015). The variant of the model that neglects stability against misfolding has several unsatisfactory properties: 1) The average hydrophobicity of the sequences that it generates is much larger than the average hydrophobicity of natural sequences, 2) the likelihood of wild-type sequence in the PDB with respect to the model is low, and 3) the sequences generated by the model are on the average unstable against misfolding. All these aspects are significantly improved when stability against misfolding is considered in the model. In particular, the average hydrophobicity of the model becomes equal to that of the wild-type sequence, the likelihood of wild-type sequences becomes much larger, and the average stability against misfolding becomes positive. Globally, these observations strongly support the need to incorporate stability against misfolding in models of protein evolution. In this article, we have shown that the model that considers misfolding and the model that neglects it predict rather different hydrophobicity profiles. If misfolding is not considered, sites with few contacts are predicted to be more hydrophobic than they actually are and at the same time sites with many contacts are predicted to be less hydrophobic (see fig. 2 right). Thus, releasing the selective constraints imposed by negative design results in a site-specific hydrophobicity that is less dependent on the protein structure and is closer to the one induced by the mutation process. Consequently, the predicted sequence entropy is much larger if misfolding is ignored (fig. 2 left), except for the sites with an intermediate number of contacts that are minimally constrained by natural selection in both models. These results hold both for the MF model and for an independent sites model that considers pairwise interactions explicitly (WT model). Recently, Goldstein and Pollock (2017) computed site-specific substitution rates in a model with epistatic interactions (sites are not considered independent). Consistent with our results, they found that sites with many contacts tend to evolve more slowly, but unfortunately they did not present the comparison between exposed and amphiphilic sites. The comparison with evolved MSA (fig. 3 A) shows that stability-constrained models overestimate the site-specific sequence entropy, but the overestimation is much larger for the model that neglects misfolding. However, when MSA sites with the same number of contacts are pooled together to get contact-specific distributions of amino acids that respond to similar stability constraints and average out structural constraints, the resulting sequence entropy is very similar to what is predicted by the model with selection on misfolding, both qualitatively (there is a maximum between six and seven contacts, in agreement with the misfolding model, which predicts a maximum between five and six contacts, whereas the model that neglects misfolding predicts a monotonic trend with maximum at 0 contacts) and quantitatively (the entropy of the model overestimates by approximately 0.2 units the pooled entropy of both exposed sites with nc = 0 and buried sites with nc = 11, above which the statistical error in the MSA becomes very large, whereas the discrepancy is 0.9 and 1.4 with respect to site-specific entropies). Thus, we conclude that our model with selection on misfolding stability captures reasonably well the site-specific selective constraints on amino acid hydrophobicities although other aspects such as amino acid size are less well represented, whereas the model that neglects misfolding presents large qualitative and quantitative discrepancies with observed data. The main result of this work is that stability-constrained models of protein evolution that consider misfolding do not agree with the empirical observation that the sequence entropy and the substitution rate decrease monotonically with the number of native contacts. The previous finding that stability-constrained models predict a monotonic decrease of the sequence entropy and the substitution rate (Echave et al. 2015) was likely a consequence of neglecting misfolding stability, which has the consequence of increasing the hydrophobicity of sites with few contacts with respect to evolved sequences, releasing the selective constraint that these sites cannot be too hydrophobic, and overestimating their sequence entropy. Although the empirical site-specific entropies are likely underestimated, as the sampled sequences are evolutionarily related and it is possible that they did not explore sequence space sufficiently, the fact that the entropy decreases with the number of contacts is quite robust, as we demonstrated eliminating the most distance sequences from each alignment. The qualitative disagreement that we observed between the stability-constrained model and the empirical observations may be attributed to the limitations of the stability prediction. We note that contact energy functions are only appropriate when proteins are in a native-like geometry, as they only consider soft interresidue interactions and neglect all the hard interactions that define the protein geometry, such as steric repulsion, hydrogen bonds, and atomic packing. These interactions are properly satisfied in the native state of wild-type proteins but they cannot be taken for granted in mutants, which can present atomic clashes or cavities that would be not adequately penalized by the contact free energy function. Unfortunately, the strategy to include these hard terms in the free energy function and run explicit simulations is computationally not affordable, which explains why most stability-constrained models of protein evolution adopt contact energy functions. In any case, the energy function that we adopted was tested against 195 experimental measures of ΔΔG of proteins that fold with two-state thermodynamics, mostly involving hydrophobic residues, yielding a squared correlation coefficient between measured and predicted values r2=0.52 (fig. 1 in Bastolla 2014), that is, contact interactions account for half of the experimental effect, which is comparable to other more complex free energy functions. Finally, provided that stability against misfolding is considered, the qualitative predictions of the stability-constrained model are robust if we consider the contribution to protein stability due to the propensity of amino acids for secondary structures. Modelling the influence of the protein sequence on the stability of the unfolded state would result in a mathematical model similar to the model with secondary structure propensity, as the configurational entropy of the side chains can be modelled as ∑iϵentr(Ai), with parameters ϵentr(a) correlated with the propensity of the amino acid a in the loop secondary structure. Thus, the robustness of the stability-constrained model under addition of secondary structure propensities suggests that the qualitative shape of the sequence entropy curve would not change if we consider the influence of the protein sequence on the stability of the unfolded state. These arguments support the view that the limitations of the stability model, that we openly recognize, are not responsible of the qualitative disagreement between the predicted and observed site-specific variability. In our opinion, the responsible of the disagreement is another characteristic of the model that is clearly unrealistic: The assumption that mutations do not modify the protein structure. Indeed, mutations can create native structures that are thermodynamically stable but are different from the wild type. These mutations would be accepted by stability-constrained protein evolution models, but probably they would be discarded by purifying selection. It is common evolutionary wisdom that protein structures are more conserved than protein sequences (Illegard et al. 2009), which means that, for small evolutionary divergence, the fractional change of protein structures is smaller than the corresponding fractional change of protein sequences (Pascual-Garcia et al. 2010). This structural conservation is likely attributable in great part to function conservation, as the pairs of homologous proteins that maintain the same function, as indicated by Gene Ontology terms, evolve up to a limit value of structural divergence, whereas the pairs that change function can diverge much more in structure (Pascual-Garcia et al. 2010). These results, obtained by measuring the structural divergence as the divergence of contact matrices, were recently generalized using the divergence of the TM score (Zhang and Skolnick, 2004). In this recent work (Pascual-Garcia A, Arenas M, and Bastolla M, submitted), we found that the violations of the molecular clock are stronger and more systematic in structure evolution than sequence evolution, in particular when protein function changes. We argued that these violations of the molecular clock are indicative of strong positive selection acting on protein structures, whereas strong purifying selection acts when the protein function is conserved. The idea that protein structures are under strong purifying selection is consistent with the observation that the dynamics of the protein in its native state can be predicted solely from structure through structure-based models such as the ENM (Tirion 1996), and that the predicted collective dynamics reproduces functional conformation changes and allosteric communication. This view of the functional importance of the native structure is also supported by a recent work that found that functional sites induce long-range evolutionary constraints in enzymes (Jack et al. 2016), underlying the evolutionary importance not only of the active site but also of sites that are close to it in the native state. Sites with an intermediate number of contacts are minimally constrained by stability-constrained models. Mutations there have hardly a significant effect on stability, and an even smaller one on fitness. In fact, the fitness function f=1/(1+exp ⁡(ΔG/T)) is a sigmoidal function of stability −ΔG, thus when the wild-type sequence is very stable ( ΔG≪−1) changes of stability ΔΔG have a reduced effect on the fitness and they are effectively neutral (Serohijos and Shakhnovich 2014). According to the stability-constrained models, these amphiphilic sites hardly show any preference for particular amino acids, and their sequence entropy is almost the same as for the background distribution. However, empirical data show that these sites are more conserved than completely exposed sites. A possible explanation is that the mutations at these sites do not modify the stability but they modify the native structure and they may have a deep influence on the protein function. This explanation agrees with the structure-constrained models of protein evolution (Echave 2008), which predict that both the entropy and the substitution rate decrease with the number of contacts (Huang et al. 2014), as mutations at amphiphilic sites perturb the protein structure more than mutations at exposed sites, and consequently they are more often rejected. Thus, empirical data support the structure-constrained models of protein evolution more than the stability-constrained models, at least as far as amphiphilic sites are concerned. Of course, mutations modify both the structure of the native state and its stability ΔG. However, current models of protein evolution cannot predict both changes and they focus on only one aspect, considering one of two extreme types of mutations: Those that change the stability of the native state but have little effect on its structure and those that change the structure of the native state but have a weak effect on its stability. It would be rather useful to integrate both types of models into one that predicts the effect of mutations both on the structure and on the stability of the native state. Despite such a model does not exist yet, in particular due to the fact that the ENM used to predict the structural effect of a mutation does not differentiate between different sequences, this will be a very interesting direction of the future research. Materials and Methods Modelling Stability against Unfolding and Misfolding We estimate the stability of the experimentally known native state of a protein adopting the contact matrix representation of its structure. For each pair of residues at positions i and j along the polypeptidic chain, Cij equals one if the residues are in contact and zero otherwise. We define two residues to be in contact if any pair of their heavy atoms, either in the side chain or in the main chain, is closer than 4.5 Å. As contacts with |i−j|≤2 are formed in almost all structures, they do not contribute to the free energy difference between the native and the misfolded ensemble, and we set Cij = 0 if |i−j|≤2. The free energy of a protein in the mesoscopic structure described by Cij is modelled as a sum of contact interactions, E(C,A)=∑i<jCijU(Ai,Aj), which depends on the type of amino acids in contact Ai and Aj and on 210 contact interaction parameters U(a, b), for which we adopt the parameters determined in Bastolla et al. (2000). For simplicity, we neglect the conformational entropy of the folded native state and estimate its free energy as Gnat(Cnat,A)≈∑i<jCijnatU(Ai,Aj). Regarding the unfolded state, we neglect their contact interactions and estimate its free energy as GU≈−TLSU, where T is the temperature in units in which kB = 1, L is chain length and SU is the conformational entropy per residue of an unfolded chain. We compute the free energy of the misfolded state from the partition function of the contact energy E(C, A) over a set of compact contact matrices C of L residues that are obtained from the PDB. In agreement with previous studies (Garel and Orland 1988; Shakhnovich and Gutin 1989), the resulting free energy is approximately described by the random energy model (REM) (Derrida 1981), but with the addition of the third moment of the contact energy (Minning et al. 2013):   Gmisf≡−T log ⁡(∑Ce−∑i<jCijU(Ai,Aj)/T+S(C))≈⟨E⟩−⟨(E−⟨E⟩)2⟩2T+⟨(E−⟨E⟩)3⟩6T2−LSCT , (5) where LSC is the logarithm of the number of compact contact matrices, ⟨.⟩ represents the average over the set of alternative compact contact matrices of L residues. We assume for simplicity that the conformational entropy, S(Cij), is approximately the same for all compact structures including the native one, and it can be neglected for computing free energy differences. The above estimate only holds above the freezing temperature of the REM (Derrida 1981), whereas the free energy is kept constant below the freezing temperature. Note that, although the free energy is defined in terms of Boltzmann averages, the result of the calculation depends only on the first three moments of the contact energies. These moments are computed from the corresponding moments of the contacts. For instance, the first moment is ⟨E⟩=∑i<j⟨Cij⟩U(Ai,Aj). The moments of the contacts such as ⟨Cij⟩ are computed preliminarly in order to accelerate the computation. This is crucial, as the program must iteratively compute averages over the site-specific distributions and it is very important that the coefficients are precalculated. For this reason, we do not use specific misfolded states. The moments of the contacts are computed over the set of compact conformations of a chain of L residues that are obtained by threading the chain over a nonredundant representation of the PDB. We consider only compact submatrices, discarding those whose number of contacts is smaller than a threshold that depends on L. To reduce the number of parameters that have to be estimated, we perform some approximations (Minning et al. 2013). We assume that ⟨Cij⟩ depends only on the contact range |i−j| and on chain length L, ⟨Cij⟩=a(L)f(|i−j|) and determine the factor a(L) so that the average number of contacts for structures of length L is the same as for native structures. For the contact correlation term, (⟨CijCkl⟩−⟨Cij⟩⟨Ckl⟩), we consider three situations: 1) ij = kl, in which case we adopt a model similar to the above that depends only on L and |i−j|; 2) a pair of residue is equal, say i = l, and we average over j and k both the contacts and the energies in order to obtain a quantity that only depends on i; and 3) all four residues are different, in which case we approximate the contact correlation with the average contact correlation obtained summing over ij and kl, which sums to ⟨Nc2⟩−⟨Nc⟩2, and removing the contribution of groups of three residues. Details on the computations can be found in Arenas et al. (2015). Putting together these free energy estimates, we obtain the free energy difference between the native and the nonnative states as   ΔG(Cnat,A)=Gnat−kT(e−Gmisf/kT+e−GU/kT) , (6) where the free energy of the nonnative state is computed as a Boltzmann average, which is essentially equal to Gmisf when the sequence is hydrophobic ( Gmisf−GU/kT≪−kT) and is essentially equal to GU when the sequence is hydrophilic ( Gmisf−GU/kT≫kT). When we neglect stability against misfolding, we estimate ΔG=Gnat(Cnat,A)+LSU as if all sequences were hydrophilic. Stability-Constrained Fitness Model Stability-constrained models of protein evolution assume that the fitness is proportional to the fraction of protein that is in the native state, which can be computed from the folding free energy as (Goldstein 2011; Serohijos and Shakhnovich 2014)   f=e−ΔG/kT/(1+e−ΔG/kT) . (7) The computation is performed assuming that the native contact matrix Cnat does not change in evolution, whereas the sequence changes one amino acid at a time, producing the free energy ΔGmut=ΔGwt+ΔΔG. In contrast, structure-constrained models of protein evolution adopt the ENM representation of the native state. The fraction of folded protein is assumed to be one by convention irrespective of the protein sequence, but the native structure may change upon mutation (Echave 2008). MF Model of Protein Evolution The MF model assumes that the amino acid distribution is the product of independent distributions at each protein site,   P(A1,⋯AL)=∏i=1LPi(Ai) . (8) The mean-field distribution is determined by minimizing its Kullback–Leibler divergence (distance between distributions) with respect to a global mutational distribution Pmut(a) (where a denotes any of the 20 amino acids), ∑iaPi(a) log ⁡(Pi(a)/Pmut(a)) with the normalization constraints ∑aPi(a)=1 and the constraint on the average fitness, which is transformed into a constraint on the folding free energy ΔG that is imposed through the Lagrange multiplier Λ that represents the strength of selection. As the contact energy parameters U(a,b),T,SU,SC are fixed, the only free parameters of the model are Λ and Pmut(a). These parameters are determined maximizing the log-likelihood of the PDB sequence, ∑i log ⁡(Pi(AiPDB)). The precomputation of the moments of the contacts makes the computation very fast, it runs in a few minutes even for proteins of several hundreds of amino acids. For further computational details see Arenas et al. (2015). WT Model of Protein Evolution In the WT model, the frequency of amino acid a at position i is assumed to be proportional to the background distribution Pmut(a) multiplied by the exponential of the fitness of the corresponding mutation in which the wild-type amino acid in the PDB AiWT is substituted by the new amino acid a:   PWT,i(a)∝Pmut(a) exp ⁡(Λf(mut(AiWT→a))) , (9) where the fitness of a sequence is computed as in equation (7) and the parameter Λ is determined in such a way to maximize the likelihood of the wild-type sequence, ∑i log ⁡(PWT,i(AiWT)). Sequence Entropy The sequence entropy at position i measures the variability of this position as   Si=−∑a=120Pi(a) log ⁡(Pi(a)) , (10) where Pi(a) is obtained either from the evolutionary model (MF or WT) or from an MSA or from pooled amino acids at equivalent structural positions with the same number of contacts. Evolutionary Rates The site-specific substitution rates of the evolutionary models are computed as the weighted average of the substitution rate matrix Qab=EabiPi(b),   Ri=∑a≠bPi(a)Pi(b)Eabi. (11) Eabi is the exchangeability matrix proposed by Halpern and Bruno (1998), equation (2), in which Pi(a) are the site-specific amino acid frequencies of the evolutionary model and Pmut(a) is the background distribution. The global exchangeability matrix Eabmut is computed imposing that the average flux between any pair of amino acids a and b is equal to their flux measured in an empirical substitution model, either the one by Whelan and Goldman (2001) or the one by Jones et al. (1992), that is, we compute Eabmut from the following equation (Arenas et al. 2015):   PaempPbempEabemp=(PamutPbmutEabmut)1L∑iFasel,iFbsel,iln⁡(Fbsel,i)−ln⁡(Fasel,i)Fbsel,i−Fasel,i . (12) Substitution Rates from MSAs The empirical substitution rates of 213 proteins were estimated in Yeh et al. (2014) from the MSA of homologous sequences through the program Rate4Site (Pupko et al. 2002), which builds the phylogenetic tree using a neighbor-joining algorithm (Saitou and Nei 1987) and estimates rates with an empirical Bayesian approach adopting the JTT model of sequence evolution (Jones et al. 1992). Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Acknowledgments We acknowledge interesting discussions with Julián Echave, Nick Goldman, Alberto Pascual-García, and Grzegorz Slodkowicz and thank two anonymous reviewers for their useful comments and suggestions. This work was supported by the Spanish Ministery of Economy through grants BIO2016-79043 and BFU2012-40020. M.A. was supported by the Ramón y Cajal Grant RYC-2015-18241 from the Spanish Government. Research at CBMSO is facilitated by Fundación Ramón Areces. References Arenas M, Sanchez-Cobos A, Bastolla U. 2015. Maximum likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol . 32( 8): 2195– 2207. http://dx.doi.org/10.1093/molbev/msv085 Google Scholar CrossRef Search ADS PubMed  Arenas M, Weber CC, Liberles DA, Bastolla U. 2017. ProtASR: an evolutionary framework for ancestral protein reconstruction with selection on folding stability. Syst Biol . 66( 6): 1054– 1064. Google Scholar PubMed  Bastolla U. 2014. Detecting selection on protein stability through statistical mechanical models of folding and evolution. Biomolecules  4( 1): 291–231. http://dx.doi.org/10.3390/biom4010291 Bastolla U, Bruscolini P, Velasco JL. 2012. Sequence determinants of protein folding rates: positive correlation between contact energy and contact range indicates selection for fast folding. Proteins  80( 9): 2287– 2304. http://dx.doi.org/10.1002/prot.24118 Google Scholar CrossRef Search ADS PubMed  Bastolla U, Dehouck Y, Echave J. 2017. What evolution tells us about protein physics, and protein physics tells us about evolution. Curr Opin Struct Biol . 42: 59– 66. http://dx.doi.org/10.1016/j.sbi.2016.10.020 Google Scholar CrossRef Search ADS PubMed  Bastolla U, Porto M, Ortiz AR. 2008. Local interactions in protein folding determined through an inverse folding model. Proteins  71( 1): 278– 299. http://dx.doi.org/10.1002/prot.21730 Google Scholar CrossRef Search ADS PubMed  Bastolla U, Porto M, Roman HE, Vendruscolo M. 2005. Principal eigenvector of contact matrices and hydrophobicity profiles in proteins. Proteins  58( 1): 22– 30. Google Scholar CrossRef Search ADS PubMed  Bastolla U, Porto M, Roman HE, Vendruscolo M. 2006. A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank. BMC Evol Biol . 6: 43. Google Scholar CrossRef Search ADS PubMed  Bastolla U, Vendruscolo M, Knapp EW. 2000. A statistical mechanical method to optimize energy functions for protein folding. Proc Natl Acad Sci U S A.  97( 8): 3977– 3981. http://dx.doi.org/10.1073/pnas.97.8.3977 Google Scholar CrossRef Search ADS PubMed  Berezovsky IN, Zeldovich KB, Shakhnovich EI. 2007. Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput Biol.  3( 3): e52. Google Scholar CrossRef Search ADS PubMed  Derrida B. 1981. Random Energy Model: an exactly solvable model of disordered systems. Phys Rev B.  24( 5): 2613. http://dx.doi.org/10.1103/PhysRevB.24.2613 Google Scholar CrossRef Search ADS   Echave J. 2008. Evolutionary divergence of protein structure: the linearly forced elastic network model. Chem Phys Lett.  457( 4–6): 413– 416. Google Scholar CrossRef Search ADS   Echave J, Jackson EL, Wilke CO. 2015. Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol . 12( 2): 025002. http://dx.doi.org/10.1088/1478-3975/12/2/025002 Google Scholar CrossRef Search ADS PubMed  Echave J, Spielman SJ, Wilke CO. 2016. Causes of evolutionary rate variation among protein sites. Nat Rev Genet . 17( 2): 109– 121. http://dx.doi.org/10.1038/nrg.2015.18 Google Scholar CrossRef Search ADS PubMed  Franzosa EA, Xia Y. 2009. Structural determinants of protein evolution are context-sensitive at the residue level. Mol Biol Evol . 26( 10): 2387– 2395. http://dx.doi.org/10.1093/molbev/msp146 Google Scholar CrossRef Search ADS PubMed  Garcia-Seisdedos H, Empereur-Mot C, Elad N, Levy ED. 2017. Proteins evolve on the edge of supramolecular self-assembly. Nature  548( 7666): 244– 247. Google Scholar PubMed  Garel T, Orland H. 1988. Mean-field model for protein folding. Europhys Lett . 6( 4): 307– 310. http://dx.doi.org/10.1209/0295-5075/6/4/005 Google Scholar CrossRef Search ADS   Goldstein RA. 2011. The evolution and evolutionary consequences of marginal thermostability in proteins. Proteins  79( 5): 1396– 1407. http://dx.doi.org/10.1002/prot.22964 Google Scholar CrossRef Search ADS PubMed  Goldstein RA, Pollock DD. 2017. Sequence entropy of folding and the absolute rate of amino acid substitutions. Nat Ecol Evol.  1( 12): 1923– 1930. http://dx.doi.org/10.1038/s41559-017-0338-9 Google Scholar CrossRef Search ADS PubMed  Halpern A, Bruno WJ. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol . 15( 7): 910– 917. http://dx.doi.org/10.1093/oxfordjournals.molbev.a025995 Google Scholar CrossRef Search ADS PubMed  Huang TT, del Valle Marcos ML, Hwang JK, Echave J. 2014. A mechanistic stress model of protein evolution accounts for site-specific evolutionary rates and their relationship with packing density and flexibility. BMC Evol Biol . 14: 78. Google Scholar CrossRef Search ADS PubMed  Illergard K, Ardell DH, Elofsson A. 2009. Structure is three to ten times more conserved than sequence–a study of structural response in protein cores. Proteins  77( 3): 499– 508. http://dx.doi.org/10.1002/prot.22458 Google Scholar CrossRef Search ADS PubMed  Jack BR, Meyer AG, Echave J, Wilke CO. 2016. Functional sites induce long-range evolutionary constraints in enzymes. PLoS Biol . 14( 5): e1002452. Google Scholar CrossRef Search ADS PubMed  Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci.  8( 3): 275– 282. Google Scholar PubMed  Kimura M. 1962. On the probability of fixation of mutant genes in a population. Genetics  4: 713– 719. Minning J, Porto M, Bastolla U. 2013. Detecting selection for negative design in proteins through an improved model of the misfolded state. Proteins  81( 7): 1102– 1112. http://dx.doi.org/10.1002/prot.24244 Google Scholar CrossRef Search ADS PubMed  Mustonen V, Lässig M. 2005. Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci U S A.  102( 44): 15936– 15941. Google Scholar CrossRef Search ADS PubMed  Nido GS, Bachschmid-Romano L, Bastolla U, Pascual-García A. 2016. Learning structural bioinformatics and evolution with a snake puzzle. PeerJ Comput Sci.  2: e100. Google Scholar CrossRef Search ADS   Noivirt-Brik O, Horovitz A, Unger R. 2009. Trade-off between positive and negative design of protein stability: from lattice models to real proteins. PLoS Comput Biol.  5( 12): e1000592. Google Scholar CrossRef Search ADS PubMed  Pascual-Garcia A, Abia D, Mendez R, Nido GS, Bastolla U. 2010. Quantifying the evolutionary divergence of protein structures: the role of function change and function conservation. Proteins  78( 1): 181– 196. Google Scholar CrossRef Search ADS PubMed  Porto M, Roman HE, Vendruscolo M, Bastolla U. 2005. Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Mol Biol Evol.  22( 3): 630– 638. http://dx.doi.org/10.1093/molbev/msi048 Google Scholar CrossRef Search ADS PubMed  Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. 2002. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics  18( Suppl 1): S71– S77. Google Scholar CrossRef Search ADS PubMed  Ramsey DC, Scherrer MP, Zhou T, Wilke CO. 2011. The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics  188( 2): 479– 488. http://dx.doi.org/10.1534/genetics.111.128025 Google Scholar CrossRef Search ADS PubMed  Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol.  4( 4): 406– 425. Google Scholar PubMed  Scherrer MP, Meyer AG, Wilke CO. 2012. Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evol Biol . 12: 179. Google Scholar CrossRef Search ADS PubMed  Sella G, Hirsh AE. 2005. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci U S A.  102( 27): 9541– 9546. Google Scholar CrossRef Search ADS PubMed  Serohijos AW, Shakhnovich EI. 2014. Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. Curr Opin Struct Biol.  26: 84– 91. http://dx.doi.org/10.1016/j.sbi.2014.05.005 Google Scholar CrossRef Search ADS PubMed  Shakhnovich EI, Gutin AM. 1989. Formation of unique structure in polypeptide chains. Biophys Chem . 34( 3): 187. http://dx.doi.org/10.1016/0301-4622(89)80058-4 Google Scholar CrossRef Search ADS PubMed  Tirion MM. 1996. Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys Rev Lett . 77( 9): 1905– 1908. http://dx.doi.org/10.1103/PhysRevLett.77.1905 Google Scholar CrossRef Search ADS PubMed  Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol.  18( 5): 691– 699. http://dx.doi.org/10.1093/oxfordjournals.molbev.a003851 Google Scholar CrossRef Search ADS PubMed  Yeh SW, Liu JW, Yu SH, Shih CH, Hwang JK, Echave J. 2014. Site-specific structural constraints on protein sequence evolutionary divergence: local packing density versus solvent exposure. Mol Biol Evol . 31( 1): 135– 139. http://dx.doi.org/10.1093/molbev/mst178 Google Scholar CrossRef Search ADS PubMed  Zhang Y, Skolnick J. 2004. Scoring function for automated assessment of protein structure template quality. Proteins  57( 4): 702– 710. http://dx.doi.org/10.1002/prot.20264 Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Journal

Molecular Biology and EvolutionOxford University Press

Published: Mar 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Unlimited reading

Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere.

Stay up to date

Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates.

Organize your research

It’s easy to organize your research with our built-in tools.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve Freelancer

DeepDyve Pro

Price
FREE
$49/month

$360/year
Save searches from
Google Scholar,
PubMed
Create lists to
organize your research
Export lists, citations
Read DeepDyve articles
Abstract access only
Unlimited access to over
18 million full-text articles
Print
20 pages/month
PDF Discount
20% off