Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Exact methods for lattice protein models

Exact methods for lattice protein models Lattice protein models are well-studied abstractions of globular proteins. By discretizing the structure space and simplifying the energy model over regular proteins, they enable detailed studies of protein structure formation and evolution. However, even in the simplest lattice protein models, the prediction of optimal structures is computationally difficult. Therefore, often, heuristic approaches are applied to find such conformations. Commonly, heuristic methods find only locally optimal solutions. Nevertheless, there exist methods that guarantee to predict globally optimal structures. Currently, only one such exact approach is publicly available, namely the constraint-based protein structure prediction method and variants. Here, we review exact approaches and derived methods. We discuss fundamental concepts like hydrophobic core construction and their use in optimal structure prediction, as well as possible applications like combinations of different energy models. Keywords: CPSP; HP-model; lattice protein; optimal structure; protein structure prediction. DOI 10.1515/bams-2014-0014 Received August 21, 2014; accepted October 17, 2014 Introduction Almost all cellular processes are governed or guided by proteins [1]. Their functionality is always coupled to the formation of a specific three-dimensional structure named *Corresponding authors: Martin Mann, Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany, E-mail: [email protected]; and Rolf Backofen, Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany; Center for Biological Signaling Studies (BIOSS), University of Freiburg, Freiburg, Germany; Center for Biological Systems Analysis (ZBSA), University of Freiburg, Freiburg, Germany; and Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg C, Denmark, E-mail: [email protected] functional, native, or biological fold/conformation. It was shown through refolding experiments that the functional fold of a protein is mainly encoded by its sequence of amino acids [2, 3]. Nevertheless, there is a large number of processes, like crowding effects [4], co-factor binding [5], or chaperons [6], that influence or support the structure formation process within living cells [7]. Still, the folding process is mainly governed by non-covalent intramolecular interactions [8] and misfolding can result in reduced or broken functionality [9]. Several diseases [10­12], and even cancer [13], are caused by misfolded proteins. Two connected but independent problems arise. (i) To study the function of a protein, its native fold has to be identified. This is known as the protein structure prediction (PSP) problem. (ii) To understand the mechanistic details of misfolding, the folding process itself has to be investigated, e.g., using folding simulations. Here, we are focusing on the PSP problem, which resembles the result of the structure formation process. To study the PSP problem, various protein models have been developed. They range from full atom representations in three dimensions used in molecular dynamics simulations [14], through off-lattice bead models [15], and down to very-coarse-grained two-dimensional (2D) lattice models [16]. Three more or less independent layers of abstraction can be found: structure space, sequence space, and energy function. The structure space dictates possible spatial properties of a protein, while sequence space and energy function describe the distinction of different amino acids and their interacting forces, respectively. Assuming the system to be in thermodynamic equilibrium [17], the native fold will be the structure with minimal free energy [2]. Thus, structure prediction can be approached with optimization methods for a given model. Within this work, we will focus on PSP for lattice protein models. We first concentrate on the widely studied hydrophobic-polar model (HP model) by Lau and Dill [16]. Even in this simplistic model, the identification of energyminimal structures is a difficult computational problem [18, 19]. As exhaustive structure enumeration approaches [20] are restricted to very short sequence lengths, usually heuristic methods are applied. They apply various techniques, e.g., simulated annealing [21, 22], quantum 214Mann and Backofen: Exact methods for lattice protein models annealing [23], ant colonization optimization [24], evolutionary algorithms [25], energy landscape paving [26], large neighborhood search [27, 28], constraint programming [29­31], or dedicated search heuristics [32­35]. They enable performance-guaranteed approximations close to the optimum for both backbone-only [36, 37] and sidechain models [38, 39]. For a detailed overview of approximating methods, refer to reviews by Hart and Newman [40] or Istrail and Lam [41]. Here, we review exact methods for the PSP problem for lattice protein models, i.e., approaches that identify energy optimal structures only. Given the large amount of the mentioned recent local search PSP research, it is surprising that currently only two exact methods are known, both targeting the HP model in three dimensions. The first approach was introduced by Yue and Dill [42] and named constrained hydrophobic core construction (CHCC). It is based on the observation that globular proteins show a densely packed hydrophobic core both in reality as well as in the HP model. Thus, energy-optimal structures show an almost maximally dense packing of hydrophobic amino acids. This is used in the CHCC method by considering only (decreasingly) compact hydrophobic packings when searching for optimal structures. While the CHCC approach was the first exact PSP method, it was proven to be incomplete when enumerating all optimal structures [43]. In contrast, the constraint-based protein structure prediction (CPSP) approach [43­45] was shown to be exact and complete when calculating optimal structures. CPSP uses a similar strategy as CHCC and is thus far the only exact and complete method for protein structure prediction in lattice protein models. It is often used as reference when testing local search methods and is discussed in detail within this review. The CPSP approach facilitates methods for more sophisticated energy models. Among them are its extension and application to the HPNX model [46], which also takes polar interactions into account [47]. Furthermore, HP-optimal predictions can be used within a hierarchical prediction approach to search for energy-optimal structures within full 20 0 pairwise potential models [21, 22, ×2 48]. The latter requires the identification of HP-optimal conformations that are structurally diverse. This can be achieved by the introduction of an equivalence relation based on hydrophobic cores and the restricted enumeration of representatives for each equivalence class [49]. In the following, we will introduce the necessary formalisms to discuss lattice protein models and subsequently the CPSP approach. Afterward, we will review the construction of (sub)optimal hydrophobic core packings, a central prerequisite for both the CHCC and CPSP methods. Hydrophobic core information can be used to define equivalence classes of structures within the HP-model, and we will present the enumeration of class representatives based on an extension of the CPSP method. Finally, the extension and application of the CPSP approach to enhanced lattice protein models is illustrated. Lattice protein models The strongest abstraction within a lattice protein model is the allowed structure space, where conformations are discretized on the basis of an underlying regular lattice. Such a lattice L is a set of 3D coordinates that form an additive group for any two points u, v L, i.e., it holds ( u ± v ) L. The neighborhood NLL is the minimal subset of vectors to encode the whole lattice by a linear combination of these vectors using positive integers only, i.e., uL : u = cx x with cx N. xN L (1) Furthermore, we require NL to also contain the reverse vectors, i.e., x N L - x N L , and only vectors of equal lengths, i.e., u, v N L :| u | =| v |. For instance, a 3D cubic lattice is defined by Ncubic={±(1,0,0), ±(0,1,0), ±(0,0,1)}. A wide variety of lattices have been studied in the context of lattice protein models. Common 3D lattices are the cubic (Figure 1) and face-centered-cubic (FCC) lattice (12 neighbors per point, Figure 2), as well as the diamond lattice. For further lattices and details, refer, e.g., to Refs. [50, 51]. The coordinates of a given lattice define possible placements for protein monomers. Within backboneonly models, a protein is represented by its backbone C-atom positions only, while successive C-monomers have to be neighbored in the lattice according to NL. Thus, a backbone-only lattice protein structure P of length n is defined by P P1, ..., Pn)Ln with (P(i+1)­Pi)NL. In addi=( tion, P has to be self-avoiding, i.e., ij: PiPj. Side-chain models extend the amino acid abstraction with a second side-chain monomer, which has to be neighbored to the corresponding C-monomer. For examples, see Figure 1. The modeling quality of real protein conformations by lattice protein structures depends strongly on the underlying lattice used [51­53]. Figure 1 depicts the exponential lattice-dependent growth of the structure space that is detailed in Ref. [50] for different lattices. The sequence space abstraction and energy function are the final determinants for a lattice protein model. Mann and Backofen: Exact methods for lattice protein models215 2 1e+09 2 Number of possible structures 2 1e+06 2 F 2 1e+03 F C C S S F C S F C C S S S 2 F F C S S C C S S S F C S S C C S S S S S 1e+09 Number of possible structures 2 1e+07 2 S F C F C F C F 1e+03 2 C FS C S 10 15 20 C S S S S C F 2 C S S S S Side chain models Lattice - SQR - CUB - FCC - 210 25 S S C S S S S 1e+05 Backbone-only models lattice S C F 2 - SQR - CUB - FCC - 210 25 1e+00 2 C S F 2 F CS CS FCS 2 S C S F C S 1e+01 Protein length n = 5 Protein length n = 5 Figure 1Examples for (left) backbone-only and (right) side-chain lattice protein models in the 3D cubic lattice. Colors encode an HP model: hydrophobic (green), polar (gray), and backbone monomers (pink). (Center) Sequence-length and latticedependent exponential growth of the symmetry-free structure space for different lattices (SQR ­ 2D square, CUB ­ 3D cubic, FCC ­ 3D facecentered cubic, 210 ­ chess knight). Figures are taken from Ref. [50]. Figure 2(Left) Coherence between an optimal structure (top) and the optimally dense packing of H-monomers, i.e., its H-core (bottom) in the 3D FCC HP model. (Right) CPSP workflow. Start and end are marked by open and filled circles, respectively. The number of hydrophobic amino acids within the sequence is denoted by nH, and m is initialized by the maximal number of hydrophobic contacts possible for nH monomers in the underlying lattice. Figures are from Ref. [50]. Various reductions of the amino acid alphabet of proteins are known, which have been combined with corresponding energy functions. Distance-based energy functions either incorporate sequence-based pairwise potentials scaled by the distance [54] or apply special distancedependent potentials [55]. For details, refer to the review by Hart and Newman [40]. In the following, we will discuss contact-based energy models, which can be used to define exact PSP methods. Therein, the energy E of a lattice protein structure P with sequence S is determined by the summation of all pairwise sequence-dependent contact potentials e(Si, Sj). Two non-successive monomers, Pi and Pj, are in contact, if they are neighbored within the lattice. E( S, P ) = 1 if Pi - Pj N L . ( Pi , Pj ) = 0 else. 1i+ 1< jn ( Pi , Pj ) e( Si , S j ), (2) Most abstract is the HP model [16], which models the central impact of hydrophobic forces within structure formation [56]. Here, hydrophobic amino acids are repelled 216Mann and Backofen: Exact methods for lattice protein models from water owing to their non-polar nature, resulting in a crowding of hydrophobic residues within the protein core [57] (see Figure 1). This so-called hydrophobic core is present in almost all globular water-solved protein structures. It is central to the CPSP approach and was also used in recent local search methods [22, 34]. Beneath the suggested energy potentials by Dill and co-workers [58] (e(Si, Sj) 1, if Si and Sj are hydrophobic; 0, otherwise), =­ other potentials have been applied [47, 59]. Banavar and co-workers [60] introduced the THP model to incorporate context-specific hydrophobic contact contributions. Another extension, the HPNX model introduced in Refs. [46, 47] with different potentials, distinguishes four different amino acid groups, namely (H)ydrophobic, (P)ositively charged, (N)egatively charged, and X for all remaining neutral residues. Still, hydrophobic contacts are the strongest potentials. An improved version, the hHPNX model [61], follows the amino acid grouping of the YhHX model suggested by Crippen [3] where alanine and valine are treated as a special group (h). Bornberg-Bauer [47] introduced an integer conversion of the real-valued YhHX model that maintains approximately the same ratios of entries. His potentials were corrected by Hoque and co-workers [61]. Full 20 0 pairwise potentials were pioneered by ×2 Miyazawa and Jernigan [62, 63] and derived from real protein structure information. Simplified potentials have been suggested in Ref. [64]. In first studies, Chan and Dill [65] found contactbased potentials sufficient to enable realistic distributions of secondary structure elements in structure space by the study of small molecules in the 3D cubic lattice. While it is clear to see that very simple lattice models, e.g., in 2D, are difficult to map into the real protein structure space, Vendruscolo and Domany [66] showed general limitations of contact-based potentials to mimic real protein structures. Thus, to enable lattice proteins to predict real native structures, additional constraints are needed. For instance, the prediction and incorporation of secondary structure information yields very promising results [30, 67]. However, even without such extensions, lattice protein models enable studies of general features of protein structure formation and related problems. They have been used to investigate the folding process [68], native structure properties [40], sequence evolution [69], cooperative/competitive folding [70], and co-translational folding [71, 72], to name but a few. In the following, we will first focus on structure prediction of the HP model in 3D lattices. Later on, exact PSP approaches for enhanced energy models will be discussed. Constraint-based protein structure prediction As discussed in the Introduction, the CPSP approach is currently the only exact and complete PSP method for lattice protein models. Its basic version is tailored for 3D backbone-only HP models [43, 45, 73, 74] but was extended to other models. Among them are the extension to the HPNX model [46, 75] or to side-chain structure models [44], both later discussed in detail. In contrast to most tools and approaches in the field, a CPSP implementation is available for local installation [45, 76] and ad hoc web usage [44, 77]. Thus, it has spawned the compilation of extensive benchmark data sets of protein-like sequences with various folding features [78]. Among them are the existence of a unique optimal fold, proven through CPSP, and the accessibility of this structure through unrestricted or co-translational folding using the LatPack package [79]. Within the HP model, only hydrophobic contacts are contributing to the energy. It is therefore sufficient to consider and optimize H-monomer interactions only. The CPSP approach follows, as the CHCC method, the observation that an HP-optimal structure features an (almost) optimal packing of H-monomers, and thus maximizing their contacts (see Figure 2). This spawns the central idea of CPSP: if a structure shows an optimal packing of H-monomers, it has to be an HP-optimal structure. Following this idea, the CPSP screens (sub)optimal packings of H-monomers, the so-called H-cores, for their compatibility with the given sequence. If it is possible to identify a structure that confines the H-monomers of the sequence to the current optimal H-core, an optimal structure was found. This structure threading is done through constraint programming techniques. The overall CPSP workflow is given in Figure 2. In the following, we will first sketch the structure threading step for a given sequence and H-core. In the next section, the computation of H-cores is discussed. A detailed presentation of the whole approach if provided in Ref. [74]. Optimal structure identification The CPSP approach uses constraint-programming techniques to solve the PSP problem. Constraint programming enables the definition of constraint satisfaction problems (CSP) and offers a generic and efficient framework for satisfiability checks [80]. A CSP is defined by a set of variables X, their value domains D, and the set of constraints Mann and Backofen: Exact methods for lattice protein models217 C on X, which define valid variable assignments, i.e., solutions. A sophisticated iterative reasoning and pruning of violating values from the variable domains, combined with tuned search strategies, enable an efficient identification and enumeration of constraint-conform solutions. For further details on general constraint programming techniques, e.g., refer to Ref. [80]. The CSP formulated for the optimal structure identification step in the CPSP approach is based on both the protein's sequence S of length n and the current H-core of interest H in the underlying lattice L. An H-core HL is a set of lattice points with maximally compact positioning (details on their generation are discussed later). Only S-compatible H-cores are considered, i.e., the size of the core equals the number of H-monomers in the sequence. In the following, for simplicity, we will first discuss the CSP for backbone-only models [43, 45, 73, 74]. For each backbone position, a lattice position variable Xi is defined. The domain Di of Xi depends on the corresponding amino acid Si: if hydrophobic (Si ), the variable domain is =H defined by the H-core positions (Di ). Otherwise, the =H monomer has to be placed outside of the core (Di \H). =L As discussed above, this ensures the HP-optimality of any produced structure as long as the H-core shows an optimal packing. To encode valid structures only, additional constraints are enforced: first, self-avoidingness of the chain through global difference of assigned values [81] and, second, the chaining of successive monomers given the neighborhood of the lattice, i.e., x Di : y Di+1 : x - y N L . While formulated here for domains on lattice coordinates, a coordinate integer encoding can be used in order to apply standard finite domain constraint solvers [50, 74, 75]. Provided the H-core shows a maximal number of contacts, any solution satisfying these constraints will represent an optimal structure according to the HP model. If no solution was found, the next H-core with similar compactness is tried. If no optimally packed H-core yields a solution, it is proven that no optimal structure with lower or equal number of hydrophobic contacts exists. Thus, the CPSP approach relates to suboptimal H-cores, which show the highest non-optimal number of hydrophobic contacts and iterates the procedure. Therefore, solutions for such suboptimal H-cores are still globally optimal, as no solution for more compact H-cores was found. Usually no or only a few suboptimality iterations are necessary. Note that the constraint programming framework enables the enumeration of all valid solutions for a CSP. That is, the set of all optimal structures can be enumerated by the CPSP approach when screening all corresponding H-cores. The full workflow is depicted in Figure 2. The complete enumeration of optimal structures has to exclude symmetric conformation. That is, identical structures resulting from rotation or reflection in the lattice have to be avoided. The CPSP approach employs efficient symmetry-breaking techniques within the solution search [82], resulting in a fast and symmetry-free solution enumeration. To this end, for each solution, the CSP is enhanced by lattice-specific constraints that exclude symmetric assignments within the remaining enumeration [82, 83]. The symmetry-free number of optimal structures, known as a sequence's degeneracy, is often taken as a measure of structural stability of the native fold [84]. When extending the approach to side-chain lattice protein models [44, 50], only an adoption of the CSP is needed while the overall workflow is kept. The contact-based energy computation in side-chain models is restricted to side-chain monomers only [50]. In the HP model, only hydrophobic side-chain interactions are contributing to the energy. All other contacts are neutral. Therefore, the CSP is adopted as follows: for each amino acid, two variables are defined, one for the backbone and one for the side-chain monomer position. The backbone monomer is confined to positions outside of the current H-core, while a sequence-specific domain assignment, analogous to the backbone model, is used for the side-chain monomers. In addition to the connectivity constraints applied to the backbone monomers, sidechain-backbone connectivity for each amino acid also has to be ensured. Self-avoidance is enforced for all variables, i.e., all monomer positions. This CPSP modeling extension enables, for the first time, the exact identification of optimal structures in the side-chain HP model in threedimensional lattices [44, 50]. The complete enumeration of optimal side-chain structures is also enabled; however, the enormous structure space growth (see Figure 1) and the high degeneracy of the HP model yield often hundreds to millions optimal structures. This problem is faced when extending the CPSP approach for the enumeration of representatives for H-core equivalence classes, discussed in the "Hydrophobic core equivalence" section. Hydrophobic core construction An H-core can be generated from an optimal conformation by removing all bonds and considering only H-monomers (see Figure 3A). As introduced, the CPSP approach depends on the availability of (sub)optimal H-cores for the lattice of interest. This is based on the relation of the number of contacts in the H-core and the number HH-contacts in a 218Mann and Backofen: Exact methods for lattice protein models Layer 1 n1 = 3 Hs 1 6 7 5 3 4 6 7 5 3 4 2 3 2 surface contacts 2 surface contacts 2 surface contacts 2 surface contacts 5 4 + Interlayer contacts + Layer 2 n2 = 4 Hs 6 7 Figure 3H-core construction. (A) Optimal conformation for the sequence PPHPHHPPPPHHHHPPP in the 3D cubic lattice together with the associated hydrophobic core and its decomposition into layers. (B) Relation between HH-contacts (blue) and surface contacts (red) exemplified for layer 1. conformation featuring this H-core. Formally, the contact number of an H-core H is defined by I ( H ) =|{{ x , y } | x , y H x - y N L } |. (3) Now, let a conformation P for a sequence S have an associated H-core H. Then, the number of HH-contacts E(S, P) relates to I(H) by E ( S , P ) = I ( H ) - number of HH-bonds. (4) Here, HH-bonds denote the bonds between successive H-monomers in the sequence, which are not contributing to the energy. Thus, optimal conformations also have optimal or near-optimal H-cores, where we define an H-core HL to be optimal if its contact number I(H) is maximal for this H-core size |H|. Figure 3A shows the basic idea of using a hydrophobic core construction, as first introduced by Yue and Dill [42]. It also depicts the decomposition of H-cores in order to efficiently determine optimal ones. Here, an H-core is broken into individual layers (right part of Figure 3A), such that we can dissemble a core contact I(H) into layer and interlayer contacts. Thus, the hydrophobic core construction reverses direction (indicated with blue arrows). On the basis of branch-and-bound techniques, (sub) optimal layers are generated, which are then composed into (sub)optimal hydrophobic cores. Finally, conformations are threaded onto the optimal cores as described above. Thus, the success of the CPSP approach relies on good bounds for the layer generation to effectively generate hydrophobic cores. Therein, the basic question is the following: given a specific distribution of H-monomers onto each layer (e.g., in Figure 3A: n1 and n2 ), what is the maximal contact =3 =4 number of an associated H-core? As indicated before, one distinguishes between layer and interlayer contacts for this purpose, which we will illustrate with the example from Figure 3. Therefore, let us concentrate first on the number of layer contacts of the three H-monomers in layer 1 (Figure 3B). This placement produces two HH-contacts. As shown by Yue and Dill [42], the number of layer contacts is related to the surface of the minimal rectangle around all H-monomers. For this, we need to identify the surface contacts of a specific layer, which is the number of unoccupied neighboring positions of H-monomers. Figure 3B highlights the eight surface contacts for layer 1 in red. As every H-monomer has four neighbors in a 3D cubic layer, we know that four times the number of H-monomers must be equal to twice the number of HH-contacts plus the number of surface contacts (4 H+surfaces). Thus, ×n=2×H we get the equation 4 +8. This implies that mini×3=2×2 mizing the surface in a layer is maximizing the number of HH-contacts. Furthermore, the number of surface contacts is exactly the perimeter of the minimal rectangle around all H-monomers (both 8 in Figure 3B). For that reason, a placement of n H-monomers that optimizes the number of HH-contacts is found by using the minimal rectangle (in the following, called frame) with height a = n and n width b = , or vice versa. a This upper bound can be improved in several ways. If one considers four H-monomers to be placed in one layer, then the optimal frame has the dimension 2 . However, ×2 four H-monomers cannot be placed into this frame if the protein's subsequence containing these four H-monomers is HHPHH. The reason is that the enclosed P, which is called P-singlet [42], cannot be placed outside the 2 ×2 frame, which contains all H-monomers, while being in contact to its flanking H-monomers. Thus, the optimal frame is not compatible with the sequence at hand. This can be solved by including these P-singlets in the number of H-monomers to be placed when calculating the optimal surrounding frame. Another improvement was introduced in Ref. [85] by using a special property of the cubic lattice. Mann and Backofen: Exact methods for lattice protein models219 Given a point p = ( x , y , z ) in the cubic lattice, we define a point to be even if x+y+z is even, and odd otherwise. As the neighbors of any point in the cubic lattice differ by one in exactly one coordinate, an even point has only odd neighbors, and vice versa. In a sequence, we also have even and odd monomers, and this parity also translates to the parity of occupied positions. Thus, instead of considering a distribution of the number of H-monomers per layer, one can consider distributions of even and odd H-monomers to layers [85]. If there are two even and two odd monomers, as in the case of Figure 3A for layer 2, then the minimal frame has the dimension 2 , which yields four ×2 HH-contacts. If, instead, three even and one odd (or vice versa) monomers are to be placed, then the optimal frame has dimension 2 with only three HH-contacts [85]. ×3 Concerning the number of interlayer contacts, the problem is relatively easy for the cubic lattice, as each H-monomer in a layer can have at most one HH-contact to exactly one H-monomer in the following layer. Thus, given two successive layer frames, one possible upper bound for the number of interlayer contacts is given by the maximal number of overlap positions between both frames. The dimensions of this overlap are defined by the minimum of the heights and widths of the associated frames (see Figure 4). The problem becomes more complicated for the 3D FCC lattice. The layers, when using a corresponding splitting, are again equivalent to positions in the square lattice. Thus, most of the results for layers described above for the cubic lattice can be transferred to the layers in the FCC. However, the upper bound for the interlayer contacts is much more complex. To the best of our knowledge, only one upper bound for the interlayer contacts in the FCC lattice has been developed thus far [86, 87]. The problem is that an H-monomer can have zero to four contacts to H-monomers in the next layer. For that reason, Backofen [87] developed bounds for the number of positions in the next layer having one to four contacts (called 1-, 2-, 3-, and 4-points), given a frame with additional properties in the current layer. To give an example, it was shown that the number of 3-points for a given placement of H-monomers can be determined by considering the longest intersection of 45° diagonals with unoccupied positions in the frame [87] (see Figure 4B). This implies that the *-point composition of the layers has to be considered for determining a bound on the interlayer contacts. Once upper bounds for layer and interlayer contacts are found for a given H-core size, dynamic programming is used to identify optimal frame sequences for a contact bound [43, 74, 88]. These frame sequences are then instantiated to optimal H-cores through constraint programming. This frame sequence and H-core enumeration and its extension to suboptimal H-core generation are beyond the scope of this review, and we refer to the relevant literature [43, 74, 88, 89]. Hydrophobic core equivalence The hydrophobicity-focusing energy function in HP models results, on average, in a vast number of optimal structures. As polar residues do not contribute to the energy, optimal structures usually show a much higher variation in the placement of polar than hydrophobic residues. While the latter form a compact core, polar residues are placed arbitrarily loose around it. This is even more severe in side-chain models where most of the monomers is unconstrained by the energy function, namely both backbone and polar side-chain monomers. When considering only the relative positioning of a protein's hydrophobic monomers, only few distinct patterns occur, as depicted in Figure 5. These H-core placements enable the definition of an equivalence relation H ~ in order to group optimal structures accordingly. Formally, two backbone-only structures, P P1, ..., Pn) and =( b1=3 Layer 1 Layer 2 Interaction area 3-points in next layer a1=2 2=min(a1,a2) Figure 4(A) Two successive layers in the 3D cubic lattice filled with six H-monomers each. The associated frames have dimensions (a1; b1) and (a2; b2). The maximal number of interlayer contacts is 4 in(a1; a2) in(b1; b2). (B) Placement of H-monomers in a layer of the FCC =m ×m lattice. The longest intersection of 45° diagonals with unoccupied positions are indicated with yellow diagonals. The sum of the associated quantities is three, which is the number of positions in the next FCC layer that have exactly three H-neighbors in the current layer when filled with an H-monomer. These points are called 3-points and are indicated with blue circles, together with their possible contacts. a2=3 2=min(b1,b2) b2=2 220Mann and Backofen: Exact methods for lattice protein models Figure 5Different optimal structures (A, C) for the sequence PHHHPHPPPP in the 2D square lattice that show the same relative H-monomer placement (B), i.e., are within one equivalence class. The structure on the right highlighted in red (D) is also part of this equivalence class when considering reflection. This exemplifies the symmetry problem for equivalence detection. Figures taken from Ref. [50]. ^ ^ ^ P = ( P1 , ..., Pn ), for a sequence S are said to be equivalent if they show identical H-monomer placements according to some symmetrical shift. This is given by H ^ ^ P ~ P rR t L : i|S =H : Pi = PRr + t , i (5) where denotes the set of all symmetry functions (according to rotation and reflection) within the underlying lattice L, while Rr represents the rotation/reflection matrix for a symmetry r (see to Refs. [50, 74] for details). The translation vector t L shifts the symmetric structure into mapping. See Figure 5 for examples. This equivalence definition can be extended to side-chain lattice protein models (see Refs. [49, 50]). This results in corresponding equivalence classes within the structure space for a given sequence. The CPSP approach can be used to enumerate all equivalence classes comprising optimal structures for a given protein [49, 50]. To this end, one representative structure per class is identified. It is based on the observation that only optimal structures confined to the same H-core can be equivalent. Thus, the overall CPSP workflow can be kept, and only the CSP solution identification has to be adopted to avoid the enumeration of equivalent structures. This is achieved by a dedicated variable assignment strategy. Only for hydrophobic monomers, a complete solution enumeration is done excluding symmetries. For each H-monomer assignment, a satisfiability check for the placement of the remaining monomers is done, which results in a single representative optimal structure for the equivalence class, if possible. Note that the symmetry problem is already solved within the CPSP workflow as discussed above. Figure 6 compares the number of optimal structures (a sequence's degeneracy) with the number of equivalence classes for a large set of sequences. It reveals that the number of equivalence classes is several orders of magnitude smaller than a sequence's degeneracy. Furthermore, it reveals the extreme increase of optimal structures in side-chain HP models compared with backbone models due to energetically unconstrained polar and backbone monomers. Here, the difference to equivalence classes Optimal structures Representatives 8000 Number of sequences 800 6000 Number of sequences Optimal structures Representatives 0 1 10 100 1000 10000 1e+05 >10^6 Number of structures 1e+05 >10^6 Number of structures Figure 6Distribution of the number of representatives (green) vs. the overall number of optimal structures (degeneracy, red). They are exemplified for HP models in the 3D cubic lattice for sequence length 27. (Left) Results correspond to backbone-only predictions for 100,000 random sequences; (right) side-chain model predictions for 10,000 random sequences. The distributions are only shown with exact values for degeneracy below 106; all sequences with a higher degeneracy are collected in the rightmost bar in pink representing " 06." Figures taken from Ref. [50]. >1 Mann and Backofen: Exact methods for lattice protein models221 is even more extreme. The distributions of equivalence classes are similar when comparing backbone-only and side-chain models, while backbone-only models show a slightly higher number on average. This results from the increased freedom in hydrophobic side-chain monomer positioning in side-chain models, as only the neighborhood to the backbone monomer has to be ensured. In backbone-only models, each H-monomer has to be neighbored to the preceding and succeeding backbone monomer in the chain. Thus, optimal side-chain model structures show, on average, a more compact H-core compared with optimal backbone-only structures for the same sequence. As the number of H-cores decreases for increasing compactness, fewer H-cores and thus less equivalence classes are found for side-chain structure models. A sequence's degeneracy, which can be computed by the standard CPSP approach, is a measure of structural stability [84]. However, as discussed, it is flawed in the HP model owing to the missing energetic constraints on the non-hydrophobic monomers. Thus, the number of equivalence structures was suggested as a new measure of structural stability in the HP model as an alternative to degeneracy [49, 50]. N-P-contacts have an energy contribution of -1, while P-Pand N-N-contacts are penalized by +1. Contacts to X are neutral (contact potential is 0). Backofen and Will [46, 74, 75, 83] have introduced an exact constraint programming-based PSP approach for the HPNX model. The approach applies a branch-and-bound search on a constraint optimization problem, which encodes feasible lattice protein structures (similar to the CPSP model) in concert with the described HPNX-energy function to be minimized. The performance is based on sophisticated lower-energy bounds for partially defined solution structures, in order to prune energetically unfavorable parts of the search space. This way, the authors yield an efficient and exact PSP method that allows optimal structure prediction and enumeration [46, 74, 75]. Optimized prediction in full potential models For more sophisticated energy models, as, e.g., the 20 0 ×2 pairwise potentials by Miyazawa and Jernigan [62, 63], no efficient and exact PSP approach is known thus far. Here, usually, local search strategies are applied. Local search is a generic optimization scheme that minimizes a given objective function. Given a neighborhood relation within the search space to traverse, local search is able to follow the gradient to identify local or even global minima [90, 91]. Local search approaches work well in practice for more sophisticated energy functions but usually require a large number of steps to converge. Each search usually starts with a random start conformation to enable a good coverage of the structure space. Ullah and coworkers [21] introduced a protein-folding simulation procedure that employs two stages of optimization to find structures of minimal energy. The method mimics the hydrophobic collapse during protein folding, i.e., the hydrophobicity-driven, fast formation of an initial structure and the following folding into the functional fold. Thus, the protein sequence is first collapsed into a compact HP-optimal structure using the CPSP approach. Successively, the CPSP output is given as input to a simulated annealing-based local search procedure that employs the pairwise 20 0 energy poten×2 tials introduced by Berrera and coworkers [64]. Thus, the method combines the efficiency and performance of exact structure prediction in the simpler HP model with established local search schemes in the enhanced energy model using pull moves [92, 93]. When comparing this two-step optimization pipeline with standard simulated annealing procedures based on random start structures, faster convergence to energetically lower structures Enhanced models The HP model implements a strong abstraction of the energetics underlying the protein folding process, as it only models hydrophobic forces. As discussed above, several more fine-grained energy models have been introduced in the literature. While the CPSP is currently the only exact method for protein structure prediction in lattice protein models, it is intrinsically tailored toward the HP model. Nevertheless, it can be extended to HP-related energy models as the HPNX model [46, 74, 75]. Furthermore, it was shown that the HP model can be used within hierarchical approaches to enhance the prediction of optimal structures in more sophisticated energy models providing 20 0-potentials [21, 22]. Both directions are discussed in ×2 the following. Prediction in HP-type models The HPNX model [47] is an extension of the HP model, where polar amino acids are split into (P)ositively and (N)egatively charged amino acids and neutral (X) monomers. Still, it focuses on hydrophobic interactions, such that H-H-contacts have the strongest energy contribution of 4, while H-contacts to all other amino acids are neutral. 222Mann and Backofen: Exact methods for lattice protein models is observed. This results from the energetically lower start conformations, as HP-optimal structures show, on average, about two orders of magnitude lower energies in the enhanced energy model compared with random structures [21, 50]. The impact of such an HP-optimal initialization scheme seems to be strongly connected to the subsequently applied optimization procedure, as Rashid and colleagues [48] found their genetic algorithm to be more efficient with random initializations. in side-chain lattice models. Given the properties of the hydrophobicity-focusing energy function, the number of equivalence classes was suggested as an alternative measure of structural stability besides the commonly used degeneracy. Both measures can only be exactly identified using the CPSP method. The existence of a unique native fold, i.e., lowest energy conformation, is a common criterion to name a model sequence protein-like. Thus far, the CPSP approach is the only method to enable the identification of such sequences. Thus, it was used to generate benchmark sets of protein-like sequences showing different folding properties [78]. Besides the unique native fold, its accessibility through co-translational and unrestricted folding is taken into account. Furthermore, the CPSP approach is often used as a reference to determine the lowest energy accessible for a given sequence to evaluate local search schemes [27, 32]. The use of HP-optimal structure samples was shown to boost protein structure prediction in more sophisticated models [21, 22, 48, 50]. Here, no exact methods are available and thus local search schemes are applied. It was shown that the initialization of local search with HPoptimal structures increases convergence and enhances prediction results. Such a two-step optimization scheme resembles the hydrophobic collapse during globular protein folding. Besides protein structure prediction, the CPSP approach was also used to tackle the inverse folding problem [45, 50], i.e., to find a sequence that has a given structure as its native fold. Here, the CPSP method was needed (i) to confirm that the targeted structure is among the optimal ones and (ii) to check if a sequence's degeneracy is within a targeted upper bound (usually 1). The method can be easily extended to be constrained by equivalence classes if needed. With such a tool at hand, neutral evolution studies are enabled [69, 94, 95]. Therein, the sequence space is screened for mutation-connected subsets that show identical native folds [45, 74]. Summary Lattice protein models enable detailed studies of proteinfolding processes in a discretized, finite but yet exponentially large structure space. The latter renders exhaustive structure enumeration useless for interesting protein lengths. To find native, i.e., energy optimal, structures, mainly local search schemes are applied, which neither ensure to find optimal structure nor enable a thorough investigation of the lowest energy spectrum. Only for hydrophobicity-focusing energy functions, namely the HP and HPNX models, exact methods are known. They are based on the CPSP approach [43­45, 73] applying efficient constraint programming techniques [82, 83]. The CPSP approach uses precomputed sets of maximally compact H-cores to identify optimal structures only. The computation of (sub)optimal H-cores is a difficult computational problem on its own and was solved on the basis of dynamic programming and constraint programming [74, 83, 87, 88]. The sequence independence of H-cores enables a precomputation of an H-core database and its recomputation-free use within the CPSP workflow. This ensures a fast and reliable prediction of HP-optimal structures in 3D lattices. While tailored for the HP model, several extensions of the CPSP approach have produced exact methods for other lattice protein models and applications. Besides the mentioned extension to the HPNX model [45, 46, 74, 75], side-chain models were also targeted [44, 50]. The CPSP approach is the only method enabling a complete and exclusive enumeration of optimal structures. This revealed an extremely high degeneracy, i.e., number of optimal structures, within the HP model. Most of the optimal structures show an equivalent H-monomer placement, while they are only differing in the energetically unconstrained monomers. An extension of the CPSP approach enables the enumeration of corresponding equivalence classes [49, 50], revealing a dramatically lower number of distinct H-monomer positions among optimal structures. This phenomenon is most prominent Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. Mann and Backofen: Exact methods for lattice protein models223 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Exact methods for lattice protein models

Bio-Algorithms and Med-Systems , Volume 10 (4) – Dec 19, 2014

Loading next page...
 
/lp/de-gruyter/exact-methods-for-lattice-protein-models-RJXuS60smQ

References (131)

Publisher
de Gruyter
Copyright
Copyright © 2014 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2014-0014
Publisher site
See Article on Publisher Site

Abstract

Lattice protein models are well-studied abstractions of globular proteins. By discretizing the structure space and simplifying the energy model over regular proteins, they enable detailed studies of protein structure formation and evolution. However, even in the simplest lattice protein models, the prediction of optimal structures is computationally difficult. Therefore, often, heuristic approaches are applied to find such conformations. Commonly, heuristic methods find only locally optimal solutions. Nevertheless, there exist methods that guarantee to predict globally optimal structures. Currently, only one such exact approach is publicly available, namely the constraint-based protein structure prediction method and variants. Here, we review exact approaches and derived methods. We discuss fundamental concepts like hydrophobic core construction and their use in optimal structure prediction, as well as possible applications like combinations of different energy models. Keywords: CPSP; HP-model; lattice protein; optimal structure; protein structure prediction. DOI 10.1515/bams-2014-0014 Received August 21, 2014; accepted October 17, 2014 Introduction Almost all cellular processes are governed or guided by proteins [1]. Their functionality is always coupled to the formation of a specific three-dimensional structure named *Corresponding authors: Martin Mann, Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany, E-mail: [email protected]; and Rolf Backofen, Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany; Center for Biological Signaling Studies (BIOSS), University of Freiburg, Freiburg, Germany; Center for Biological Systems Analysis (ZBSA), University of Freiburg, Freiburg, Germany; and Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg C, Denmark, E-mail: [email protected] functional, native, or biological fold/conformation. It was shown through refolding experiments that the functional fold of a protein is mainly encoded by its sequence of amino acids [2, 3]. Nevertheless, there is a large number of processes, like crowding effects [4], co-factor binding [5], or chaperons [6], that influence or support the structure formation process within living cells [7]. Still, the folding process is mainly governed by non-covalent intramolecular interactions [8] and misfolding can result in reduced or broken functionality [9]. Several diseases [10­12], and even cancer [13], are caused by misfolded proteins. Two connected but independent problems arise. (i) To study the function of a protein, its native fold has to be identified. This is known as the protein structure prediction (PSP) problem. (ii) To understand the mechanistic details of misfolding, the folding process itself has to be investigated, e.g., using folding simulations. Here, we are focusing on the PSP problem, which resembles the result of the structure formation process. To study the PSP problem, various protein models have been developed. They range from full atom representations in three dimensions used in molecular dynamics simulations [14], through off-lattice bead models [15], and down to very-coarse-grained two-dimensional (2D) lattice models [16]. Three more or less independent layers of abstraction can be found: structure space, sequence space, and energy function. The structure space dictates possible spatial properties of a protein, while sequence space and energy function describe the distinction of different amino acids and their interacting forces, respectively. Assuming the system to be in thermodynamic equilibrium [17], the native fold will be the structure with minimal free energy [2]. Thus, structure prediction can be approached with optimization methods for a given model. Within this work, we will focus on PSP for lattice protein models. We first concentrate on the widely studied hydrophobic-polar model (HP model) by Lau and Dill [16]. Even in this simplistic model, the identification of energyminimal structures is a difficult computational problem [18, 19]. As exhaustive structure enumeration approaches [20] are restricted to very short sequence lengths, usually heuristic methods are applied. They apply various techniques, e.g., simulated annealing [21, 22], quantum 214Mann and Backofen: Exact methods for lattice protein models annealing [23], ant colonization optimization [24], evolutionary algorithms [25], energy landscape paving [26], large neighborhood search [27, 28], constraint programming [29­31], or dedicated search heuristics [32­35]. They enable performance-guaranteed approximations close to the optimum for both backbone-only [36, 37] and sidechain models [38, 39]. For a detailed overview of approximating methods, refer to reviews by Hart and Newman [40] or Istrail and Lam [41]. Here, we review exact methods for the PSP problem for lattice protein models, i.e., approaches that identify energy optimal structures only. Given the large amount of the mentioned recent local search PSP research, it is surprising that currently only two exact methods are known, both targeting the HP model in three dimensions. The first approach was introduced by Yue and Dill [42] and named constrained hydrophobic core construction (CHCC). It is based on the observation that globular proteins show a densely packed hydrophobic core both in reality as well as in the HP model. Thus, energy-optimal structures show an almost maximally dense packing of hydrophobic amino acids. This is used in the CHCC method by considering only (decreasingly) compact hydrophobic packings when searching for optimal structures. While the CHCC approach was the first exact PSP method, it was proven to be incomplete when enumerating all optimal structures [43]. In contrast, the constraint-based protein structure prediction (CPSP) approach [43­45] was shown to be exact and complete when calculating optimal structures. CPSP uses a similar strategy as CHCC and is thus far the only exact and complete method for protein structure prediction in lattice protein models. It is often used as reference when testing local search methods and is discussed in detail within this review. The CPSP approach facilitates methods for more sophisticated energy models. Among them are its extension and application to the HPNX model [46], which also takes polar interactions into account [47]. Furthermore, HP-optimal predictions can be used within a hierarchical prediction approach to search for energy-optimal structures within full 20 0 pairwise potential models [21, 22, ×2 48]. The latter requires the identification of HP-optimal conformations that are structurally diverse. This can be achieved by the introduction of an equivalence relation based on hydrophobic cores and the restricted enumeration of representatives for each equivalence class [49]. In the following, we will introduce the necessary formalisms to discuss lattice protein models and subsequently the CPSP approach. Afterward, we will review the construction of (sub)optimal hydrophobic core packings, a central prerequisite for both the CHCC and CPSP methods. Hydrophobic core information can be used to define equivalence classes of structures within the HP-model, and we will present the enumeration of class representatives based on an extension of the CPSP method. Finally, the extension and application of the CPSP approach to enhanced lattice protein models is illustrated. Lattice protein models The strongest abstraction within a lattice protein model is the allowed structure space, where conformations are discretized on the basis of an underlying regular lattice. Such a lattice L is a set of 3D coordinates that form an additive group for any two points u, v L, i.e., it holds ( u ± v ) L. The neighborhood NLL is the minimal subset of vectors to encode the whole lattice by a linear combination of these vectors using positive integers only, i.e., uL : u = cx x with cx N. xN L (1) Furthermore, we require NL to also contain the reverse vectors, i.e., x N L - x N L , and only vectors of equal lengths, i.e., u, v N L :| u | =| v |. For instance, a 3D cubic lattice is defined by Ncubic={±(1,0,0), ±(0,1,0), ±(0,0,1)}. A wide variety of lattices have been studied in the context of lattice protein models. Common 3D lattices are the cubic (Figure 1) and face-centered-cubic (FCC) lattice (12 neighbors per point, Figure 2), as well as the diamond lattice. For further lattices and details, refer, e.g., to Refs. [50, 51]. The coordinates of a given lattice define possible placements for protein monomers. Within backboneonly models, a protein is represented by its backbone C-atom positions only, while successive C-monomers have to be neighbored in the lattice according to NL. Thus, a backbone-only lattice protein structure P of length n is defined by P P1, ..., Pn)Ln with (P(i+1)­Pi)NL. In addi=( tion, P has to be self-avoiding, i.e., ij: PiPj. Side-chain models extend the amino acid abstraction with a second side-chain monomer, which has to be neighbored to the corresponding C-monomer. For examples, see Figure 1. The modeling quality of real protein conformations by lattice protein structures depends strongly on the underlying lattice used [51­53]. Figure 1 depicts the exponential lattice-dependent growth of the structure space that is detailed in Ref. [50] for different lattices. The sequence space abstraction and energy function are the final determinants for a lattice protein model. Mann and Backofen: Exact methods for lattice protein models215 2 1e+09 2 Number of possible structures 2 1e+06 2 F 2 1e+03 F C C S S F C S F C C S S S 2 F F C S S C C S S S F C S S C C S S S S S 1e+09 Number of possible structures 2 1e+07 2 S F C F C F C F 1e+03 2 C FS C S 10 15 20 C S S S S C F 2 C S S S S Side chain models Lattice - SQR - CUB - FCC - 210 25 S S C S S S S 1e+05 Backbone-only models lattice S C F 2 - SQR - CUB - FCC - 210 25 1e+00 2 C S F 2 F CS CS FCS 2 S C S F C S 1e+01 Protein length n = 5 Protein length n = 5 Figure 1Examples for (left) backbone-only and (right) side-chain lattice protein models in the 3D cubic lattice. Colors encode an HP model: hydrophobic (green), polar (gray), and backbone monomers (pink). (Center) Sequence-length and latticedependent exponential growth of the symmetry-free structure space for different lattices (SQR ­ 2D square, CUB ­ 3D cubic, FCC ­ 3D facecentered cubic, 210 ­ chess knight). Figures are taken from Ref. [50]. Figure 2(Left) Coherence between an optimal structure (top) and the optimally dense packing of H-monomers, i.e., its H-core (bottom) in the 3D FCC HP model. (Right) CPSP workflow. Start and end are marked by open and filled circles, respectively. The number of hydrophobic amino acids within the sequence is denoted by nH, and m is initialized by the maximal number of hydrophobic contacts possible for nH monomers in the underlying lattice. Figures are from Ref. [50]. Various reductions of the amino acid alphabet of proteins are known, which have been combined with corresponding energy functions. Distance-based energy functions either incorporate sequence-based pairwise potentials scaled by the distance [54] or apply special distancedependent potentials [55]. For details, refer to the review by Hart and Newman [40]. In the following, we will discuss contact-based energy models, which can be used to define exact PSP methods. Therein, the energy E of a lattice protein structure P with sequence S is determined by the summation of all pairwise sequence-dependent contact potentials e(Si, Sj). Two non-successive monomers, Pi and Pj, are in contact, if they are neighbored within the lattice. E( S, P ) = 1 if Pi - Pj N L . ( Pi , Pj ) = 0 else. 1i+ 1< jn ( Pi , Pj ) e( Si , S j ), (2) Most abstract is the HP model [16], which models the central impact of hydrophobic forces within structure formation [56]. Here, hydrophobic amino acids are repelled 216Mann and Backofen: Exact methods for lattice protein models from water owing to their non-polar nature, resulting in a crowding of hydrophobic residues within the protein core [57] (see Figure 1). This so-called hydrophobic core is present in almost all globular water-solved protein structures. It is central to the CPSP approach and was also used in recent local search methods [22, 34]. Beneath the suggested energy potentials by Dill and co-workers [58] (e(Si, Sj) 1, if Si and Sj are hydrophobic; 0, otherwise), =­ other potentials have been applied [47, 59]. Banavar and co-workers [60] introduced the THP model to incorporate context-specific hydrophobic contact contributions. Another extension, the HPNX model introduced in Refs. [46, 47] with different potentials, distinguishes four different amino acid groups, namely (H)ydrophobic, (P)ositively charged, (N)egatively charged, and X for all remaining neutral residues. Still, hydrophobic contacts are the strongest potentials. An improved version, the hHPNX model [61], follows the amino acid grouping of the YhHX model suggested by Crippen [3] where alanine and valine are treated as a special group (h). Bornberg-Bauer [47] introduced an integer conversion of the real-valued YhHX model that maintains approximately the same ratios of entries. His potentials were corrected by Hoque and co-workers [61]. Full 20 0 pairwise potentials were pioneered by ×2 Miyazawa and Jernigan [62, 63] and derived from real protein structure information. Simplified potentials have been suggested in Ref. [64]. In first studies, Chan and Dill [65] found contactbased potentials sufficient to enable realistic distributions of secondary structure elements in structure space by the study of small molecules in the 3D cubic lattice. While it is clear to see that very simple lattice models, e.g., in 2D, are difficult to map into the real protein structure space, Vendruscolo and Domany [66] showed general limitations of contact-based potentials to mimic real protein structures. Thus, to enable lattice proteins to predict real native structures, additional constraints are needed. For instance, the prediction and incorporation of secondary structure information yields very promising results [30, 67]. However, even without such extensions, lattice protein models enable studies of general features of protein structure formation and related problems. They have been used to investigate the folding process [68], native structure properties [40], sequence evolution [69], cooperative/competitive folding [70], and co-translational folding [71, 72], to name but a few. In the following, we will first focus on structure prediction of the HP model in 3D lattices. Later on, exact PSP approaches for enhanced energy models will be discussed. Constraint-based protein structure prediction As discussed in the Introduction, the CPSP approach is currently the only exact and complete PSP method for lattice protein models. Its basic version is tailored for 3D backbone-only HP models [43, 45, 73, 74] but was extended to other models. Among them are the extension to the HPNX model [46, 75] or to side-chain structure models [44], both later discussed in detail. In contrast to most tools and approaches in the field, a CPSP implementation is available for local installation [45, 76] and ad hoc web usage [44, 77]. Thus, it has spawned the compilation of extensive benchmark data sets of protein-like sequences with various folding features [78]. Among them are the existence of a unique optimal fold, proven through CPSP, and the accessibility of this structure through unrestricted or co-translational folding using the LatPack package [79]. Within the HP model, only hydrophobic contacts are contributing to the energy. It is therefore sufficient to consider and optimize H-monomer interactions only. The CPSP approach follows, as the CHCC method, the observation that an HP-optimal structure features an (almost) optimal packing of H-monomers, and thus maximizing their contacts (see Figure 2). This spawns the central idea of CPSP: if a structure shows an optimal packing of H-monomers, it has to be an HP-optimal structure. Following this idea, the CPSP screens (sub)optimal packings of H-monomers, the so-called H-cores, for their compatibility with the given sequence. If it is possible to identify a structure that confines the H-monomers of the sequence to the current optimal H-core, an optimal structure was found. This structure threading is done through constraint programming techniques. The overall CPSP workflow is given in Figure 2. In the following, we will first sketch the structure threading step for a given sequence and H-core. In the next section, the computation of H-cores is discussed. A detailed presentation of the whole approach if provided in Ref. [74]. Optimal structure identification The CPSP approach uses constraint-programming techniques to solve the PSP problem. Constraint programming enables the definition of constraint satisfaction problems (CSP) and offers a generic and efficient framework for satisfiability checks [80]. A CSP is defined by a set of variables X, their value domains D, and the set of constraints Mann and Backofen: Exact methods for lattice protein models217 C on X, which define valid variable assignments, i.e., solutions. A sophisticated iterative reasoning and pruning of violating values from the variable domains, combined with tuned search strategies, enable an efficient identification and enumeration of constraint-conform solutions. For further details on general constraint programming techniques, e.g., refer to Ref. [80]. The CSP formulated for the optimal structure identification step in the CPSP approach is based on both the protein's sequence S of length n and the current H-core of interest H in the underlying lattice L. An H-core HL is a set of lattice points with maximally compact positioning (details on their generation are discussed later). Only S-compatible H-cores are considered, i.e., the size of the core equals the number of H-monomers in the sequence. In the following, for simplicity, we will first discuss the CSP for backbone-only models [43, 45, 73, 74]. For each backbone position, a lattice position variable Xi is defined. The domain Di of Xi depends on the corresponding amino acid Si: if hydrophobic (Si ), the variable domain is =H defined by the H-core positions (Di ). Otherwise, the =H monomer has to be placed outside of the core (Di \H). =L As discussed above, this ensures the HP-optimality of any produced structure as long as the H-core shows an optimal packing. To encode valid structures only, additional constraints are enforced: first, self-avoidingness of the chain through global difference of assigned values [81] and, second, the chaining of successive monomers given the neighborhood of the lattice, i.e., x Di : y Di+1 : x - y N L . While formulated here for domains on lattice coordinates, a coordinate integer encoding can be used in order to apply standard finite domain constraint solvers [50, 74, 75]. Provided the H-core shows a maximal number of contacts, any solution satisfying these constraints will represent an optimal structure according to the HP model. If no solution was found, the next H-core with similar compactness is tried. If no optimally packed H-core yields a solution, it is proven that no optimal structure with lower or equal number of hydrophobic contacts exists. Thus, the CPSP approach relates to suboptimal H-cores, which show the highest non-optimal number of hydrophobic contacts and iterates the procedure. Therefore, solutions for such suboptimal H-cores are still globally optimal, as no solution for more compact H-cores was found. Usually no or only a few suboptimality iterations are necessary. Note that the constraint programming framework enables the enumeration of all valid solutions for a CSP. That is, the set of all optimal structures can be enumerated by the CPSP approach when screening all corresponding H-cores. The full workflow is depicted in Figure 2. The complete enumeration of optimal structures has to exclude symmetric conformation. That is, identical structures resulting from rotation or reflection in the lattice have to be avoided. The CPSP approach employs efficient symmetry-breaking techniques within the solution search [82], resulting in a fast and symmetry-free solution enumeration. To this end, for each solution, the CSP is enhanced by lattice-specific constraints that exclude symmetric assignments within the remaining enumeration [82, 83]. The symmetry-free number of optimal structures, known as a sequence's degeneracy, is often taken as a measure of structural stability of the native fold [84]. When extending the approach to side-chain lattice protein models [44, 50], only an adoption of the CSP is needed while the overall workflow is kept. The contact-based energy computation in side-chain models is restricted to side-chain monomers only [50]. In the HP model, only hydrophobic side-chain interactions are contributing to the energy. All other contacts are neutral. Therefore, the CSP is adopted as follows: for each amino acid, two variables are defined, one for the backbone and one for the side-chain monomer position. The backbone monomer is confined to positions outside of the current H-core, while a sequence-specific domain assignment, analogous to the backbone model, is used for the side-chain monomers. In addition to the connectivity constraints applied to the backbone monomers, sidechain-backbone connectivity for each amino acid also has to be ensured. Self-avoidance is enforced for all variables, i.e., all monomer positions. This CPSP modeling extension enables, for the first time, the exact identification of optimal structures in the side-chain HP model in threedimensional lattices [44, 50]. The complete enumeration of optimal side-chain structures is also enabled; however, the enormous structure space growth (see Figure 1) and the high degeneracy of the HP model yield often hundreds to millions optimal structures. This problem is faced when extending the CPSP approach for the enumeration of representatives for H-core equivalence classes, discussed in the "Hydrophobic core equivalence" section. Hydrophobic core construction An H-core can be generated from an optimal conformation by removing all bonds and considering only H-monomers (see Figure 3A). As introduced, the CPSP approach depends on the availability of (sub)optimal H-cores for the lattice of interest. This is based on the relation of the number of contacts in the H-core and the number HH-contacts in a 218Mann and Backofen: Exact methods for lattice protein models Layer 1 n1 = 3 Hs 1 6 7 5 3 4 6 7 5 3 4 2 3 2 surface contacts 2 surface contacts 2 surface contacts 2 surface contacts 5 4 + Interlayer contacts + Layer 2 n2 = 4 Hs 6 7 Figure 3H-core construction. (A) Optimal conformation for the sequence PPHPHHPPPPHHHHPPP in the 3D cubic lattice together with the associated hydrophobic core and its decomposition into layers. (B) Relation between HH-contacts (blue) and surface contacts (red) exemplified for layer 1. conformation featuring this H-core. Formally, the contact number of an H-core H is defined by I ( H ) =|{{ x , y } | x , y H x - y N L } |. (3) Now, let a conformation P for a sequence S have an associated H-core H. Then, the number of HH-contacts E(S, P) relates to I(H) by E ( S , P ) = I ( H ) - number of HH-bonds. (4) Here, HH-bonds denote the bonds between successive H-monomers in the sequence, which are not contributing to the energy. Thus, optimal conformations also have optimal or near-optimal H-cores, where we define an H-core HL to be optimal if its contact number I(H) is maximal for this H-core size |H|. Figure 3A shows the basic idea of using a hydrophobic core construction, as first introduced by Yue and Dill [42]. It also depicts the decomposition of H-cores in order to efficiently determine optimal ones. Here, an H-core is broken into individual layers (right part of Figure 3A), such that we can dissemble a core contact I(H) into layer and interlayer contacts. Thus, the hydrophobic core construction reverses direction (indicated with blue arrows). On the basis of branch-and-bound techniques, (sub) optimal layers are generated, which are then composed into (sub)optimal hydrophobic cores. Finally, conformations are threaded onto the optimal cores as described above. Thus, the success of the CPSP approach relies on good bounds for the layer generation to effectively generate hydrophobic cores. Therein, the basic question is the following: given a specific distribution of H-monomers onto each layer (e.g., in Figure 3A: n1 and n2 ), what is the maximal contact =3 =4 number of an associated H-core? As indicated before, one distinguishes between layer and interlayer contacts for this purpose, which we will illustrate with the example from Figure 3. Therefore, let us concentrate first on the number of layer contacts of the three H-monomers in layer 1 (Figure 3B). This placement produces two HH-contacts. As shown by Yue and Dill [42], the number of layer contacts is related to the surface of the minimal rectangle around all H-monomers. For this, we need to identify the surface contacts of a specific layer, which is the number of unoccupied neighboring positions of H-monomers. Figure 3B highlights the eight surface contacts for layer 1 in red. As every H-monomer has four neighbors in a 3D cubic layer, we know that four times the number of H-monomers must be equal to twice the number of HH-contacts plus the number of surface contacts (4 H+surfaces). Thus, ×n=2×H we get the equation 4 +8. This implies that mini×3=2×2 mizing the surface in a layer is maximizing the number of HH-contacts. Furthermore, the number of surface contacts is exactly the perimeter of the minimal rectangle around all H-monomers (both 8 in Figure 3B). For that reason, a placement of n H-monomers that optimizes the number of HH-contacts is found by using the minimal rectangle (in the following, called frame) with height a = n and n width b = , or vice versa. a This upper bound can be improved in several ways. If one considers four H-monomers to be placed in one layer, then the optimal frame has the dimension 2 . However, ×2 four H-monomers cannot be placed into this frame if the protein's subsequence containing these four H-monomers is HHPHH. The reason is that the enclosed P, which is called P-singlet [42], cannot be placed outside the 2 ×2 frame, which contains all H-monomers, while being in contact to its flanking H-monomers. Thus, the optimal frame is not compatible with the sequence at hand. This can be solved by including these P-singlets in the number of H-monomers to be placed when calculating the optimal surrounding frame. Another improvement was introduced in Ref. [85] by using a special property of the cubic lattice. Mann and Backofen: Exact methods for lattice protein models219 Given a point p = ( x , y , z ) in the cubic lattice, we define a point to be even if x+y+z is even, and odd otherwise. As the neighbors of any point in the cubic lattice differ by one in exactly one coordinate, an even point has only odd neighbors, and vice versa. In a sequence, we also have even and odd monomers, and this parity also translates to the parity of occupied positions. Thus, instead of considering a distribution of the number of H-monomers per layer, one can consider distributions of even and odd H-monomers to layers [85]. If there are two even and two odd monomers, as in the case of Figure 3A for layer 2, then the minimal frame has the dimension 2 , which yields four ×2 HH-contacts. If, instead, three even and one odd (or vice versa) monomers are to be placed, then the optimal frame has dimension 2 with only three HH-contacts [85]. ×3 Concerning the number of interlayer contacts, the problem is relatively easy for the cubic lattice, as each H-monomer in a layer can have at most one HH-contact to exactly one H-monomer in the following layer. Thus, given two successive layer frames, one possible upper bound for the number of interlayer contacts is given by the maximal number of overlap positions between both frames. The dimensions of this overlap are defined by the minimum of the heights and widths of the associated frames (see Figure 4). The problem becomes more complicated for the 3D FCC lattice. The layers, when using a corresponding splitting, are again equivalent to positions in the square lattice. Thus, most of the results for layers described above for the cubic lattice can be transferred to the layers in the FCC. However, the upper bound for the interlayer contacts is much more complex. To the best of our knowledge, only one upper bound for the interlayer contacts in the FCC lattice has been developed thus far [86, 87]. The problem is that an H-monomer can have zero to four contacts to H-monomers in the next layer. For that reason, Backofen [87] developed bounds for the number of positions in the next layer having one to four contacts (called 1-, 2-, 3-, and 4-points), given a frame with additional properties in the current layer. To give an example, it was shown that the number of 3-points for a given placement of H-monomers can be determined by considering the longest intersection of 45° diagonals with unoccupied positions in the frame [87] (see Figure 4B). This implies that the *-point composition of the layers has to be considered for determining a bound on the interlayer contacts. Once upper bounds for layer and interlayer contacts are found for a given H-core size, dynamic programming is used to identify optimal frame sequences for a contact bound [43, 74, 88]. These frame sequences are then instantiated to optimal H-cores through constraint programming. This frame sequence and H-core enumeration and its extension to suboptimal H-core generation are beyond the scope of this review, and we refer to the relevant literature [43, 74, 88, 89]. Hydrophobic core equivalence The hydrophobicity-focusing energy function in HP models results, on average, in a vast number of optimal structures. As polar residues do not contribute to the energy, optimal structures usually show a much higher variation in the placement of polar than hydrophobic residues. While the latter form a compact core, polar residues are placed arbitrarily loose around it. This is even more severe in side-chain models where most of the monomers is unconstrained by the energy function, namely both backbone and polar side-chain monomers. When considering only the relative positioning of a protein's hydrophobic monomers, only few distinct patterns occur, as depicted in Figure 5. These H-core placements enable the definition of an equivalence relation H ~ in order to group optimal structures accordingly. Formally, two backbone-only structures, P P1, ..., Pn) and =( b1=3 Layer 1 Layer 2 Interaction area 3-points in next layer a1=2 2=min(a1,a2) Figure 4(A) Two successive layers in the 3D cubic lattice filled with six H-monomers each. The associated frames have dimensions (a1; b1) and (a2; b2). The maximal number of interlayer contacts is 4 in(a1; a2) in(b1; b2). (B) Placement of H-monomers in a layer of the FCC =m ×m lattice. The longest intersection of 45° diagonals with unoccupied positions are indicated with yellow diagonals. The sum of the associated quantities is three, which is the number of positions in the next FCC layer that have exactly three H-neighbors in the current layer when filled with an H-monomer. These points are called 3-points and are indicated with blue circles, together with their possible contacts. a2=3 2=min(b1,b2) b2=2 220Mann and Backofen: Exact methods for lattice protein models Figure 5Different optimal structures (A, C) for the sequence PHHHPHPPPP in the 2D square lattice that show the same relative H-monomer placement (B), i.e., are within one equivalence class. The structure on the right highlighted in red (D) is also part of this equivalence class when considering reflection. This exemplifies the symmetry problem for equivalence detection. Figures taken from Ref. [50]. ^ ^ ^ P = ( P1 , ..., Pn ), for a sequence S are said to be equivalent if they show identical H-monomer placements according to some symmetrical shift. This is given by H ^ ^ P ~ P rR t L : i|S =H : Pi = PRr + t , i (5) where denotes the set of all symmetry functions (according to rotation and reflection) within the underlying lattice L, while Rr represents the rotation/reflection matrix for a symmetry r (see to Refs. [50, 74] for details). The translation vector t L shifts the symmetric structure into mapping. See Figure 5 for examples. This equivalence definition can be extended to side-chain lattice protein models (see Refs. [49, 50]). This results in corresponding equivalence classes within the structure space for a given sequence. The CPSP approach can be used to enumerate all equivalence classes comprising optimal structures for a given protein [49, 50]. To this end, one representative structure per class is identified. It is based on the observation that only optimal structures confined to the same H-core can be equivalent. Thus, the overall CPSP workflow can be kept, and only the CSP solution identification has to be adopted to avoid the enumeration of equivalent structures. This is achieved by a dedicated variable assignment strategy. Only for hydrophobic monomers, a complete solution enumeration is done excluding symmetries. For each H-monomer assignment, a satisfiability check for the placement of the remaining monomers is done, which results in a single representative optimal structure for the equivalence class, if possible. Note that the symmetry problem is already solved within the CPSP workflow as discussed above. Figure 6 compares the number of optimal structures (a sequence's degeneracy) with the number of equivalence classes for a large set of sequences. It reveals that the number of equivalence classes is several orders of magnitude smaller than a sequence's degeneracy. Furthermore, it reveals the extreme increase of optimal structures in side-chain HP models compared with backbone models due to energetically unconstrained polar and backbone monomers. Here, the difference to equivalence classes Optimal structures Representatives 8000 Number of sequences 800 6000 Number of sequences Optimal structures Representatives 0 1 10 100 1000 10000 1e+05 >10^6 Number of structures 1e+05 >10^6 Number of structures Figure 6Distribution of the number of representatives (green) vs. the overall number of optimal structures (degeneracy, red). They are exemplified for HP models in the 3D cubic lattice for sequence length 27. (Left) Results correspond to backbone-only predictions for 100,000 random sequences; (right) side-chain model predictions for 10,000 random sequences. The distributions are only shown with exact values for degeneracy below 106; all sequences with a higher degeneracy are collected in the rightmost bar in pink representing " 06." Figures taken from Ref. [50]. >1 Mann and Backofen: Exact methods for lattice protein models221 is even more extreme. The distributions of equivalence classes are similar when comparing backbone-only and side-chain models, while backbone-only models show a slightly higher number on average. This results from the increased freedom in hydrophobic side-chain monomer positioning in side-chain models, as only the neighborhood to the backbone monomer has to be ensured. In backbone-only models, each H-monomer has to be neighbored to the preceding and succeeding backbone monomer in the chain. Thus, optimal side-chain model structures show, on average, a more compact H-core compared with optimal backbone-only structures for the same sequence. As the number of H-cores decreases for increasing compactness, fewer H-cores and thus less equivalence classes are found for side-chain structure models. A sequence's degeneracy, which can be computed by the standard CPSP approach, is a measure of structural stability [84]. However, as discussed, it is flawed in the HP model owing to the missing energetic constraints on the non-hydrophobic monomers. Thus, the number of equivalence structures was suggested as a new measure of structural stability in the HP model as an alternative to degeneracy [49, 50]. N-P-contacts have an energy contribution of -1, while P-Pand N-N-contacts are penalized by +1. Contacts to X are neutral (contact potential is 0). Backofen and Will [46, 74, 75, 83] have introduced an exact constraint programming-based PSP approach for the HPNX model. The approach applies a branch-and-bound search on a constraint optimization problem, which encodes feasible lattice protein structures (similar to the CPSP model) in concert with the described HPNX-energy function to be minimized. The performance is based on sophisticated lower-energy bounds for partially defined solution structures, in order to prune energetically unfavorable parts of the search space. This way, the authors yield an efficient and exact PSP method that allows optimal structure prediction and enumeration [46, 74, 75]. Optimized prediction in full potential models For more sophisticated energy models, as, e.g., the 20 0 ×2 pairwise potentials by Miyazawa and Jernigan [62, 63], no efficient and exact PSP approach is known thus far. Here, usually, local search strategies are applied. Local search is a generic optimization scheme that minimizes a given objective function. Given a neighborhood relation within the search space to traverse, local search is able to follow the gradient to identify local or even global minima [90, 91]. Local search approaches work well in practice for more sophisticated energy functions but usually require a large number of steps to converge. Each search usually starts with a random start conformation to enable a good coverage of the structure space. Ullah and coworkers [21] introduced a protein-folding simulation procedure that employs two stages of optimization to find structures of minimal energy. The method mimics the hydrophobic collapse during protein folding, i.e., the hydrophobicity-driven, fast formation of an initial structure and the following folding into the functional fold. Thus, the protein sequence is first collapsed into a compact HP-optimal structure using the CPSP approach. Successively, the CPSP output is given as input to a simulated annealing-based local search procedure that employs the pairwise 20 0 energy poten×2 tials introduced by Berrera and coworkers [64]. Thus, the method combines the efficiency and performance of exact structure prediction in the simpler HP model with established local search schemes in the enhanced energy model using pull moves [92, 93]. When comparing this two-step optimization pipeline with standard simulated annealing procedures based on random start structures, faster convergence to energetically lower structures Enhanced models The HP model implements a strong abstraction of the energetics underlying the protein folding process, as it only models hydrophobic forces. As discussed above, several more fine-grained energy models have been introduced in the literature. While the CPSP is currently the only exact method for protein structure prediction in lattice protein models, it is intrinsically tailored toward the HP model. Nevertheless, it can be extended to HP-related energy models as the HPNX model [46, 74, 75]. Furthermore, it was shown that the HP model can be used within hierarchical approaches to enhance the prediction of optimal structures in more sophisticated energy models providing 20 0-potentials [21, 22]. Both directions are discussed in ×2 the following. Prediction in HP-type models The HPNX model [47] is an extension of the HP model, where polar amino acids are split into (P)ositively and (N)egatively charged amino acids and neutral (X) monomers. Still, it focuses on hydrophobic interactions, such that H-H-contacts have the strongest energy contribution of 4, while H-contacts to all other amino acids are neutral. 222Mann and Backofen: Exact methods for lattice protein models is observed. This results from the energetically lower start conformations, as HP-optimal structures show, on average, about two orders of magnitude lower energies in the enhanced energy model compared with random structures [21, 50]. The impact of such an HP-optimal initialization scheme seems to be strongly connected to the subsequently applied optimization procedure, as Rashid and colleagues [48] found their genetic algorithm to be more efficient with random initializations. in side-chain lattice models. Given the properties of the hydrophobicity-focusing energy function, the number of equivalence classes was suggested as an alternative measure of structural stability besides the commonly used degeneracy. Both measures can only be exactly identified using the CPSP method. The existence of a unique native fold, i.e., lowest energy conformation, is a common criterion to name a model sequence protein-like. Thus far, the CPSP approach is the only method to enable the identification of such sequences. Thus, it was used to generate benchmark sets of protein-like sequences showing different folding properties [78]. Besides the unique native fold, its accessibility through co-translational and unrestricted folding is taken into account. Furthermore, the CPSP approach is often used as a reference to determine the lowest energy accessible for a given sequence to evaluate local search schemes [27, 32]. The use of HP-optimal structure samples was shown to boost protein structure prediction in more sophisticated models [21, 22, 48, 50]. Here, no exact methods are available and thus local search schemes are applied. It was shown that the initialization of local search with HPoptimal structures increases convergence and enhances prediction results. Such a two-step optimization scheme resembles the hydrophobic collapse during globular protein folding. Besides protein structure prediction, the CPSP approach was also used to tackle the inverse folding problem [45, 50], i.e., to find a sequence that has a given structure as its native fold. Here, the CPSP method was needed (i) to confirm that the targeted structure is among the optimal ones and (ii) to check if a sequence's degeneracy is within a targeted upper bound (usually 1). The method can be easily extended to be constrained by equivalence classes if needed. With such a tool at hand, neutral evolution studies are enabled [69, 94, 95]. Therein, the sequence space is screened for mutation-connected subsets that show identical native folds [45, 74]. Summary Lattice protein models enable detailed studies of proteinfolding processes in a discretized, finite but yet exponentially large structure space. The latter renders exhaustive structure enumeration useless for interesting protein lengths. To find native, i.e., energy optimal, structures, mainly local search schemes are applied, which neither ensure to find optimal structure nor enable a thorough investigation of the lowest energy spectrum. Only for hydrophobicity-focusing energy functions, namely the HP and HPNX models, exact methods are known. They are based on the CPSP approach [43­45, 73] applying efficient constraint programming techniques [82, 83]. The CPSP approach uses precomputed sets of maximally compact H-cores to identify optimal structures only. The computation of (sub)optimal H-cores is a difficult computational problem on its own and was solved on the basis of dynamic programming and constraint programming [74, 83, 87, 88]. The sequence independence of H-cores enables a precomputation of an H-core database and its recomputation-free use within the CPSP workflow. This ensures a fast and reliable prediction of HP-optimal structures in 3D lattices. While tailored for the HP model, several extensions of the CPSP approach have produced exact methods for other lattice protein models and applications. Besides the mentioned extension to the HPNX model [45, 46, 74, 75], side-chain models were also targeted [44, 50]. The CPSP approach is the only method enabling a complete and exclusive enumeration of optimal structures. This revealed an extremely high degeneracy, i.e., number of optimal structures, within the HP model. Most of the optimal structures show an equivalent H-monomer placement, while they are only differing in the energetically unconstrained monomers. An extension of the CPSP approach enables the enumeration of corresponding equivalence classes [49, 50], revealing a dramatically lower number of distinct H-monomer positions among optimal structures. This phenomenon is most prominent Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. Mann and Backofen: Exact methods for lattice protein models223

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Dec 19, 2014

There are no references for this article.