Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces

Prediction of protein–protein interactions by combining structure and sequence conservation in... Vol. 21 no. 12 2005, pages 2850–2855 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti443 Structural bioinformatics Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces ∗ ∗ A. Selim Aytuna, Attila Gursoy and Ozlem Keskin Koc University, Center for Computational Biology and Bioinformatics, College of Engineering, Rumelifeneri Yolu 34450 Sariyer, Istanbul, Turkey Received on February 2, 2005; revised on April 1, 2005; accepted on April 7, 2005 Advance Access publication April 26, 2005 ABSTRACT characterizing the set of all protein interactions in a cell has rendered Motivation: Elucidation of the full network of protein–protein interac- itself in the development of various experimental and computational tions is crucial for understanding of the principles of biological systems techniques. These attempts shed light on both the global features and and processes. Thus, there is a need for in silico methods for predicting the specifics of the interactions for different types of interactions. interactions. We present a novel algorithm for automated predic- Various experimental methods have been developed to identify tion of protein–protein interactions that employs a unique bottom-up protein–protein interactions in various organisms. These involve approach combining structure and sequence conservation in protein (1) the traditional top-down proteomic approach where the experi- interfaces. ments have been individually designed to identify and validate a small Results: Running the algorithm on a template dataset of 67 interfaces number of specifically targeted interactions or (2) the bottom-up gen- and a sequentially non-redundant dataset of 6170 protein structures, omic approach, the recently developed high-throughput experiments 62 616 potential interactions are predicted. These interactions are designed to probe all the potential interactions within an entire gen- compared with the ones in two publicly available interaction databases ome exhaustively. The latter approach makes use of high throughput (Database of Interacting Proteins and Biomolecular Interaction Net- mass spectrometry (Gavin et al., 2002), the yeast two-hybrid system work Database) and also the Protein Data Bank. A significant number (Ito et al., 2001) and phage display libraries (Ferrer and Harrison, of predictions are verified in these databases. The unverified ones 1999; Wu et al., 1999). These methods have so far yielded a con- may correspond to (1) interactions that are not covered in these data- siderable amount of data on protein–protein associations and their bases but known in literature, (2) unknown interactions that actually relative binding strengths. However, many false positives and false occur in nature and (3) interactions that do not occur naturally but negatives identified in these high-throughput experiments highlight may possibly be realized synthetically in laboratory conditions. Some the need for caution when interpreting their results. Still, binary unverified interactions, supported significantly with studies found in interaction results of these experiments are extremely invaluable to the literature, are discussed. interpret protein–protein interactions and construct protein–protein Availability: http://gordion.hpc.eng.ku.edu.tr/prism networks (Salwinski and Eisenberg, 2003; Lu et al., 2002). Experi- Contact: agursoy@ku.edu.tr; okeskin@ku.edu.tr mentally, verified interactions have been compiled in various large scale protein–protein interaction datasets (Gavin et al., 2002; Ito et al., 2001; Xenarios et al., 2002; Bader et al., 2003). 1 INTRODUCTION Computational methods can address protein–protein interactions Proteins rarely act in isolation; different levels of complexity of bio- at different levels. They may focus on in-depth analysis or carry logical systems arise not only from the number of the proteins (genes) out a broad scale analysis across large datasets. Through genomic of the organism but also from the combinatorial interactions among and protein sequence analysis, they may infer whether proteins do them (Valencia and Pazos, 2002; Ferrer and Harrison, 1999). One interact (Valencia and Pazos, 2002; Marcotte et al., 1999; Salwinski of the primary objectives of the post-genomic era is the elucida- and Eisenberg, 2003; Lu et al., 2002). Or, through structural ana- tion of the interactions in cellular systems. The detailed knowledge lysis of proteins and their complexes, they may provide interaction of the full network of protein–protein interactions, i.e. the distri- details, essential for understanding processes at the microscopic bution and the number of interactions as well as the presence of level (Kortemme and Baker, 2004; Salwinski and Eisenberg, 2003; key nodes in these networks, is expected to provide new insights Chakrabarti and Janin, 2002; LoConte et al., 1999; Jones and into the structures and properties of biological systems. Thus, bioin- Thornton, 1997). Methods using genomic and protein sequence formatics and computational approaches are becoming increasingly data include analysis of presence or absence of genes in related important venues as large amount of data become available. Des- species, conservation of gene neighborhood, gene fusion events, pite the ongoing effort to decipher the complex nature of protein similarity of phylogenetic trees, correlated mutations on protein sur- interactions, they are not still entirely understood (Kortemme and faces and co-occurrence of sequence domains (Valencia and Pazos, Baker, 2004; Chakrabarti and Janin, 2002; LoConte et al., 1999; 2002; Salwinski and Eisenberg, 2003). Methods making use of Jones and Thornton, 1997). The broad recognition of importance of structural data, usually strive to identify functional protein inter- faces and rely on considerations of the solvent accessible surface area buried upon association (Janin, 1997), free energy changes To whom correspondence should be addressed. 2850 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Prediction of protein–protein interactions upon alanine-scanning mutations (Thorn and Bogan, 2001), in silico do non-interacting proteins, due to co-evolution (Auerbach et al., two-hybrid systems (Pazos and Valencia, 2002), scoring functions 2003; Fryxell, 1996). In the second one, evolutionarily convergent based on statistical potentials (Ponsting et al., 2000), physicochem- binding sites were found to correspond to the energetically most ical and geometric properties of the surface, such as electrostatistics, favorable states (Kortemme and Baker, 2004). hydrophobicity, amino acid composition, shape complementarity Through time, differences in paces of evolution result in accumula- and planarity (Jones and Thornton, 1997; Keskin et al., 2004) and tion of similar interfaces across different complexes, accomplishing evolutionary conservation (Pazos et al., 1997; Lichtarge and Sowa, different functions. In a way, evolution has reused ‘good’ favorable 2002; Keskin et al., 2005). interface structural scaffolds and adapted them to different functions Computational and experimental methods concentrate on the (Keskin et al., 2004). protein–protein interaction problem from different aspects. There- In this paper, we present a novel, efficient algorithm to predict fore, no single method can adequately discover the interactome fully. potential protein–protein interactions and complexes. We start with Converging toward an ideal solution will involve unification of dif- a set of structurally known protein interfaces, then seek for pairs ferent methods that take up the problem from different, innovative of proteins that share structure and evolutionarily conserved residue perspectives. This will provide a more complete picture of living (hotspot) similarity to our known interface dataset. A list of poten- cells, leading to a better understanding of biological processes. tially interacting protein pairs is obtained as a final result. Some of these interacting pairs are verified in the Biomolecular Interaction 1.1 Protein interfaces Network Database (BIND) (Dandekar et al., 1998) and Database of Proteins associate through binding sites. These sites are believed to Interacting Proteins (DIP) (Xenarios et al., 2002) and the Protein contribute to the biomolecular recognition and binding of proteins Data Bank (PDB) (Berman et al., 2000) itself. by providing specific chemical and physical properties necessary The approach and the implementation of the algorithm are elab- for these processes. There have been many studies on the protein– orated in Section 2. Discussions on prediction results and some case protein interaction and binding regions. These studies aim to provide studies are presented in Section 3. a deeper insight of the nature and mechanism of protein recognition (Jones and Thornton, 1996). It has been found that binding regions 2 SYSTEMS AND METHODS bury usually surface areas <2000 Å , and these sites usually form a The rationale of our protein–protein prediction algorithm is that, if any two single patch, in contrast to larger multipatch interfaces (Chakrabarti structures contain particular regions on their surfaces that resemble the com- and Janin, 2002). In a recent study, it has been shown that different plementary partners of a known interface, they ‘possibly interact’, through protein folds may combinatorially assemble to yield similar local these regions. In other words, if A is known to interact with B , a shares interface motifs (Keskin et al., 2004). similarity with the binding site of A, b shares similarity with the binding site Alanine scanning mutagenesis is a very powerful method to ana- of B , then we predict that a interacts with b. This resemblance indicates the lyze the contributions of individual amino acids to protein–protein ability of these structures to structurally and evolutionarily complement each binding by systematic replacement of protein interface residues by other along an interface, as chains of any template interface do. Figures 1 alanine and by measuring the drop in the resultant binding free and 2 show the top level pseudocode and schematic outline of our algorithm, energy. These experiments show that each residue at protein–protein respectively. The algorithm requires a ‘template’ dataset, i.e. the representative dataset interfaces does not contribute to the binding free energy equally. of ‘available’ interfaces; and a ‘target’ dataset, to seek every potential bin- Rather, there are only small sets of hotspot residues at interfaces ary interaction between its members. The template dataset handles structure that contribute significantly to binding free energy of the interac- and sequence conservation by combining two previously generated datasets: tion (Clackson and Wells, 1995). Many subsequent studies suggest the structurally non-redundant dataset of protein–protein interfaces extracted that the presence of a few hotspots may be a general characteristic from the PDB and the set of conserved residues on these interfaces (compu- of most protein–protein interfaces (Thorn and Bogan, 2001). These tational hotspots). The target dataset is a sequentially non-redundant set of generally polar residues are found to be highly correlated with the all protein complexes and chains in the PDB. structurally conserved residues through evolution to optimize func- tion, structure and stability of the protein complexes and enhance 2.1 The template interface dataset feasibility of protein–protein associations (Keskin et al., 2005). Keskin et al. (2004) describe a method for finding a structurally and sequen- Many of the residues on interfaces that are critical for binding are tially non-redundant subset of all existing interfaces formed between two likely to be evolutionarily conserved. This is because the pace of protein chains in dimers, trimers or higher complexes of proteins in PDB. evolution at interfaces is slower than the rest of the protein (Fraser They apply this method to get a set of 103 clusters of structurally related et al., 2002). The cause of this slower pace of evolution at interfaces interfaces and their representatives. can be explained the phenomena of co-evolution, in which substitu- In generation of this dataset, first, all existing interfaces formed between two protein chains in dimers, trimers or higher complexes of proteins were tions in one protein result in selection pressure for reciprocal changes extracted from the PDB. Interfaces were defined as the set of residues rep- in interacting partners (Pazos et al., 1997; Fraser et al., 2002). If resenting a region through which two polypeptide chains bind to each other mutations accumulated during the evolution of an interacting part- through non-covalent interactions. This set consisted of contacting residues ner is not compensated by correlated mutations in the other partner, between the chains (interacting residues), and those that are in their vicin- the interface, consequently the interaction, is likely to be disrupted. ity with a certain distance threshold (neighboring residues), representing the The alanine scanning mutagenesis method is actually based on this scaffold of the interface. Two residues from the opposite chains were marked principle. Supportive arguments for co-evolution at protein–protein as interacting, if there was at least a pair of atoms, one from each residue, at interfaces have been documented in two different studies. In the first a distance smaller than the sum of their van der Waals radii plus a threshold one, corresponding phylogenetic trees of interacting proteins were of 0.5 Å. If the C-α of a non-interacting residue lay at a distance of at most argued to display, in certain cases, a greater degree of similarity than 6.0 Å from a C-α of an already assigned interface residue in the same chain, 2851 A.S.Aytuna et al. for all proteins in target dataset do (through computational hotspots) represents a subset of ‘available’ interfaces surface ← extract surface of protein in the PDB. The complete list of these 67 interfaces can be accessed through for all interfaces in template interface do the URL: http://gordion.hpc.eng.ku.edu.tr/prism for all partners in interface do if (size of surface)≥ 0.7 x (size of partner) then 2.2 The target dataset alignments ← align surface with partner best alignment ← sim_score(alignments) The target dataset consists of the list of monomers and complexes that will if similarity score(best alignment) ≥ threshold then be compared with the template dataset for structural and evolutionary sim- similarity list ← flag protein for prediction partner ilarities. Our algorithm predicts interactions by identifying pairs of proteins proceed to verification ← similarity lists that may potentially interact in this dataset. The dataset is generated in two steps. Fig. 1. Top level pseudocode of our protein–protein interaction prediction The first step involves the extraction of a non-homologous set of proteins algorithm. obtained by applying a sequence identity filter of 50% to all existing protein structures in PDB [(online service is available at http://www.pdb.org (Li et al., 2002)]. This preliminary list contains 5427 proteins, as of January 27, 2004. This dataset is then expanded in the second step by splitting multimeric pro- teins into their constituent chains. But to avoid disturbing the non-redundant nature of the dataset, pairwise sequence alignments are carried out before splitting [by invoking FASTA (Pearson and Lipman, 1988)] and identical part- ner chains within the complexes are removed (i.e. homodimers) by grouping chains into sets and choosing a representative for each of them. After these processes, the target dataset becomes a non-homologous subset of all the polypeptide chains and complexes existing in PDB. The polypeptide chains may be in the form of monomers or in the form of isolated constituent chains of multimeric complexes. As of January 27, 2004; the target dataset consists of 6170 structures. Of these structures, 1981 are multimeric and 4189 are monomeric. Of the monomeric structures, 2483 are derived from complexes. 2.3 The algorithm To find every possible binary interaction between pairs of structures in the target dataset, we need a method to measure the similarity between partners of these representative interfaces and surfaces of target proteins. Accordingly, we extract the surfaces of target proteins and perform successive structural alignments between these surfaces and the partner chains of interfaces in template interface dataset, in an all-against-all manner. This enables us to measure the ‘similarity’ of a target structure to a template interface partner. If the surfaces of two target proteins (A and B ) contain regions ‘similar’ to complementary partner chains of a template interface, we say A and B may interact through these ‘similar’ regions. Figures 1 and 2 show the top level Fig. 2. Top level schematic outline protein–protein interaction prediction pseudocode and the schematic flow of our algorithm, respectively. algorithm. The algorithm starts by extracting the surfaces of target structures by invoking NACCESS program (Hubbard and Thornton, 1993). Along with the it was flagged as a neighboring residue. After the interfaces were extracted, atomic accessible surface, NACCESS calculates the relative surface access- they were clustered with respect to their structural similarities. The dataset is ibilities (RSA) of residues. Jones and Thornton (1997) argue that residues, available at http://gordion.hpc.eng.ku.edu.tr/prism whose RSAs (percentage of accessibility compared with the accessibility of Ma et al. (2003) discovered that particular residues are conserved on struc- that residue type X in an extended ALA-X-ALA tripeptide) are >5%, can turally similar interfaces, to an extent that suffices distinguishing between be considered to be on the surface. We adopt the same criterion to qualify binding sites and exposed protein surfaces. Moreover, they found that these surface residues. conserved residues, were highly correlated with polar residue hotspots, The algorithm then checks whether particular regions on the target sur- residues that bear more importance than others in defining affinity and stability faces resemble the complementary partners of representative interfaces in the of an interaction. template dataset. This necessitates a defined way to measure the structural The proceeding work of Keskin et al. (2005), describes a method to find and evolutionary similarities between a target surface and a representative structurally conserved residues on clusters of structurally related interfaces. interface partner. But before the similarities can be measured, the structures They have applied this method on the resulting dataset of Keskin et al. (2004) need to be structurally aligned. First, each representative interface picked and enhanced it with sequence conservation data, which they call computa- from the template dataset is split into its constituent partners. Since the tional hotspots. In their method, they structurally aligned members of a given template dataset comprises only two-chain interfaces, this process always non-redundant interface cluster along their spatially recurring substructural results in two partners per interface. These individual partners are then struc- motifs. Then, they considered the frequencies of identically matched residues turally aligned with the target surface, by invoking MULTIPROT (Shatsky along the multiply aligned substructures. If a residue matched identically on et al., 2004). MULTIPROT detects common geometrical cores between >50% of the multiply aligned structures, it qualified as a hotspot. given protein structures in a sequence-order-independent way. This feature This procedure resulted in 67 interfaces that contained at least one hotspot. makes MULTIPROT a favorable selection for the task, since protein surfaces The final set contained members as diverse as enzymes, antibodies, viral and protein–protein interfaces have sequence discontinuity. MULTIPROT capsids, etc. We import this dataset as our template interface dataset. We returns 10 best substructural matches resulting from every possible alignment. assume that this non-redundant dataset both structurally and evolutionarily Each substructure corresponds to different regions on the surface, bearing 2852 Prediction of protein–protein interactions Table 1. A selected set of verified and unverified predictions Left Right Verified Prediction Template Left function Right function partner partner Dbase score 1cov1 1h8tC 4.192 1cov13 Coxsackievirus coat protein Echovirus 11 coat protein 1dgi 1ncqC 3.867 1cov13 Poliovirus receptor Coat protein Vp3 1lq8{AECG} 1jjo{EF} P 3.453 1as4AB Plasma serine protease inhibitor Neuroserpin 2ae2{AB} 1e7w{AB} D,B,P 2.873 1e92AC Tropinone reductase-II Pteridine reductase 2sicE 1lw6I P 2.749 2sniEI Subtilisin BPN Subtilisin-chymotrypsin inhibitor-2A 1mho 1psb{AB} D,B,P 2.484 1mr8AB S-100 protein S-100 protein, Beta chain 1hj9 1jbl 2.469 1sbwAI Beta-trypsin Cyclic trypsin inhibitor 2tnf{ABC} 1dg6 D,P 2.225 1cdaAB TNF TNF related apoptosis inducing ligand 1fxkC 1jm7B 2.110 1jm7AB Prefoldin Brca1-associated ring domain protein 1 1gk6{AB} 2ebo{ABC} P 2.088 1cosAB Vimentin Ebola virus envelope glycoprotein 1kb9K 1n8v 2.077 1hezCE Light chain (VI) of Fv- fragment Chemosensory protein 1i4k1 1m5q{A..Z12} 2.074 1i4k12 Putative Snrnp Sm-like protein Small nuclear ribonucleoprotein homolog 1l8d{AB} 1c17 2.036 1jgcAC RAD50 Atpase ATP synthase subunit C 1mso{AC} 1mso{BD} P 1.981 6rlxAB Insulin like growth factor A-chain Insulin like growth factor B-chain 1ixm{AB} 1k75{AB} 1.953 1fuuAB Sporulation response regulatory protein l-Histidinol dehydrogenase 1iesB 1ecm{AB} 1.952 1iesAB Ferritin Endo-oxabicyclic transition state analogue 1ju5C 1uff B 1.947 1azeAB Abl Intersectin 2 1osh 1fm6E 1.930 1fm6DE Bile acid receptor Steroid receptor coactivator The characters B, D and P in verified column corresponds to verfication in BIND, DIP and PDB databases. different levels of structural similarity to the interface partner. Among these 2.4 Implementation alignments, the algorithm seeks the most favorable alignment that maxim- Both prediction and verification algorithms were implemented in Python izes our similarity scoring function. The similarity scoring function is defined Language, due to its powerful attributes regarding Bioinformatics related as αf + (1 − α)f , where f and f are evolution- evolution structure evolution structure tasks. Both algorithms take a fairly long time for completion, i.e. on a ary and structural similarity scoring functions, respectively. The coefficient Linux machine with 2.4 GHz Pentium processor and 1GB memory, the pre- α, represents the relative importance of evolutionary similarity to structural diction algorithm needs about a week and the verification algorithm needs similarity. The first function reflects the number of identically matched hot- about a month. This limitation necessitates parallelization for more reason- spots, the second function reflects the size and quality of the alignment along able response times. Parallelized version of the both algorithms have proven the target–template alignment. We assume that hotspots bear greater import- to achieve almost linear speed ups, prediction algorithm was observed to ance in defining an interface than geometrical complementarity. Therefore perform 29.39 times faster at a 32 node Beowulf cluster. we select α as 0.6. The condition prior to alignment restrains that interface partner size be at least 0.7 times the target surface size. (Size of a structure is 3 RESULTS AND DISCUSSION defined as the number of residues it contains.) This condition keeps relatively small interfaces out of computations. Such relatively small interfaces are Prediction results contain various interaction pairs, some of which likely to align perfectly with target surfaces and yield high similarity scores, are verified in DIP and BIND interaction databases as well as causing biased and unselective results. PDB. Starting from 67 template interfaces we found 62 616 pair- After the completion of successive structural alignments, a similarity list wise interactions among the 6170 target proteins. Of these, 31 980 for each interface partner is obtained. If the similarity lists of corresponding interactions are between the monomeric structures, and 25 448 of partners of a template interface contain N and M target structures, respect- them are between a monomeric protein and a complex structure. ively, we obtain N ×M predictions for that interface. A prediction is uniquely The remaining 5188 are between two complex structures. Most of represented by (a, b, c) triplets, where a and b are predicted targets and c is these predictions are heterodimers; only 284 are homodimers (100% the template interface via which the interaction was predicted. The extent of sequence identity between partners). This number contains predic- favorableness of the predicted interaction (prediction score) is quantified by tions with partners having identical sequences, within the same simply the sum of the similarity scores of the target pairs. complex. But we would expect these to be low in number, after These predicted interactions are finally verified for existence in two pub- licly available interaction databases, BIND, DIP and of course, PDB itself. the 50% sequence identity removal phase (Section 2.2). Table 1 Structures in our target dataset are referenced by PDB codes. However, entries displays a selected set of predictions with high scores. The first 4 in the interaction databases have their own referencing nomenclature. There- characters in columns 1, 2 and 5 are PDB representations of proteins fore, there is a need to identify cross references of targets in the respective and the following characters are PDB chain identifiers. In columns interaction databases. This is performed by finding homologous sequences in 1 and 2, multiple chains are enclosed in curly brackets, to indic- the interaction databases using FASTA and alignments yielding expectation ate that the chains are identical and the prediction applies to all of values ≥10 are considered homologous. Notice that this process may res- them. In column 5, these two characters indicate the chains of the ult in more than one homolog per database. Once this ‘translation’ is done, structures between which the template interface exists. Column 3 predicted interactions are checked for existence in the domains of interaction specifies if the interaction is verified in B (BIND), D (DIP) and P databases. In the case of PDB, the prediction is checked for its presence in the entire list of two-chain interfaces existing in the PDB, generated in Keskin (PDB) databases, whereas an empty entry means an unverified inter- et al. (2004). action. Column 4 is the similarity score of the prediction. Columns 6 2853 A.S.Aytuna et al. P D Fig. 3. Left: Surface illustration of the binding site between PTH (cyan) and DBP (purple). Right: Wire (backbone only) illustration of the binding site between PTH (P, orange) and DBP (D, red). The template interface Fig. 4. Left: Surface illustration of the binding site between BRCA1 (cyan) 1cos AC (T and T , yellow) is included to highlight the quality of L R and RAD50 (purple). Right: Wire (backbone only) illustration of the alignments. binding site between BRCA1 (B, orange) and RAD50 (R, red. The tem- plate interface laqd5AC (yellow) is included to highlight the quality of alignments. and 7 are the respective functions of SWISSPROT cross references of target partners, queried via SWISSPROT Sequence Retrieval System (SRS). 395–434 in RAD50 ATPase (PDB reference: A or B chain of 1l8d, SWISSPROT reference: RA50_PYRFU). This prediction had a score 3.1 Biological evidence of some predicted binary of 1.989. The potentially docked structure of the complex is shown protein interactions: case studies in Figure 4. In this section, we discuss two examples in detail. Neither of the BRCA1 protein, as a tumor suppressor, plays an important role cases has been verified in DIP/BIND or in PDB, but the literature in maintaining genomic stability. Through the several functional search strongly suggests that such interactions exist. domains it contains, BRCA1 has the ability to interact with numerous proteins and to form complexes. It has been reported that disruption 3.1.1 Vitamin D binding protein–parathyroid hormone In this of the potential of BRCA1 to form complexes with RAD50 (via case, residues 383–411 in vitamin D binding protein (DBP) (PDB ref- inherited mutations or epigenetic mechanisms in sporadic cancers) erence: D chain of 1kxp, SWISSPROT reference: VTDB_HUMAN) leads to loss of DNA repair ability. This is on account of some were observed to bind to the residues 1–27 of parathyroid hormone proteins among the binding partners being responsible for the recog- (PTH) (PDB reference: A or B chain of 1et1, SWISSPROT refer- nition and repair of DNA, such as the DNA damage repair protein ence: PTHY_HUMAN). The prediction had a score of 2.011. The RAD50. RAD50 repairs DNA double-strand breaks by end join- potentially docked structure of the complex is shown in Figure 3. ing (non-homologous recombination) and meiosis-specific double- PTH regulates calcium and phosphorus levels in blood by inducing strand break formation. It is an essential protein for cell growth and transport of an inactive form of vitamin D (calcidiol) from liver viability (Jhanwar-Uniyal, 2003; Deng and Brodie, 2000). to kidney and its conversion into active form (calcitriol) in prox- imal tubules. Calcitriol, in turn, is transported to small intestine, where it acts to raise the calcium level through increased intest- 3.2 Summary of the verified interactions inal absorption of calcium. Like all forms of vitamin D, calcidiol We predict 62 616 binary interactions starting from 6170 target binds to DBP prior to transportation by blood to the kidney. In proteins. Reasonable amount of these predictions were verified in the kidney, the cellular uptake of DBP–calcidiol complex and PTH interaction databases. Table 2 displays the number of verified inter- are both mediated by an endocytic receptor protein termed mega- actions out of cross referenced interactions for three interaction lin, in proximal tubules. Under the regulation of PTH calcitriol is databases. The results display a good balance of verified and unveri- also synthesized in the proximal tubules (Christensen and Birn, 2001; fied predictions. The higher verification ratio for the PDB database Bikle, 2004, http://www.endotext.org/parathyroid/parathyroid3/ (1094 out of 1497) is because the template interfaces used in predic- parathyroid3.htm). Although an interaction has not been reported tion have been derived from the PDB database. However, not all the in literature, during megalin-mediated uptake, PTH may be interact- cross referenced interactions have been verified, because the structur- ing with the DBP–calcidiol complex through DBP while exerting ally conserved hot spot residues (evolutionary data) have dominant its regulatory action on calcitriol synthesis. We believe that this effect in the evaluation of similarity between interface templates and prediction may provide new insights into vitamin D metabolism PDB interactions. The verified interactions prove the reliability of our studies. algorithm, whereas unverified ones may correspond to unobserved 3.1.2 BRCA1–RAD50 ATPase In this case, residues 2846–2882 interactions that actually occur in nature or may be synthetically real- in BRCA1 (PDB reference: A chain of 1miu, SWISSPROT ref- ized in laboratory conditions. We believe these unverified predictions erence: BRC2_MOUSE) were observed to bind to the residues may have important implications regarding drug design. 2854 Prediction of protein–protein interactions Table 2. Number of verified predictions Deng,C.X. and Brodie,S.G. (2000) Roles of BRCA1 and its interacting proteins. Bioessays, 22, 728–737. Ferrer,M. and Harrison,S.C. (1999) Peptide ligands to human immunodeficiency virus type 1 gp120 identified from phage display libraries. J. Virol., 73, 5795–5802. Interaction database Unique verifications Interation database size Fraser,H.B. et al. (2002) Evolutionary rate in the protein interaction network. Science, 296, 750–752. DIP 597 of 4107 43 892 Fryxell,K.J. (1996) The co-evolution of gene family trees. Trends Genet., 12, 364–369. BIND 431 of 1739 31 243 Gavin,A.C. et al. (2002) Functional organization of the yeast genome by systematic PDB 1094 of 1497 21 686 analysis of protein complexes. Nature, 415, 141–147. Hubbard,S.J. and Thornton,J.M. (1993) NACCESS Computer Program. Department of Biochemistry and Molecular Biology, University College London. Ito,T. et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574. Janin,J. (1997) Specific vs. non-specific contacts in protein crystals Nat. Struct. Biol., 4, 4 CONCLUSION 973–974. As large amount of protein structure data become available, predict- Jhanwar-Uniyal,M. (2003) BRCA1 in cancer, cell cycle and genomic stability. Front ive methods to detect and characterize protein–protein interactions Biosci., 8, 1107–1117. Jones,S. and Thornton,J. (1996) Principles of protein–protein interactions. Proc. Natl are becoming increasingly important in systems biology. Such know- Acad. Sci. USA, 93, 13–20. ledge aid to researchers in identifying nodes in biochemical or Jones,S. and Thornton,J. (1997) Analysis of protein–protein interaction sites using signaling pathways that cause disorders, and designing drugs that surface patches. J. Mol. Biol., 272, 121–132. exert their therapeutic action on these nodes, instead of modulating Keskin,O. et al. (2004) A new, structurally non-redundant, diverse data set of protein– protein interfaces and its implications. Protein. Sci., 13, 1043–1055. the complete set of functions connected with the pathway. An ability Keskin,O. et al. (2005) Hot regions in protein–protein interactions: the organization to predict possible interaction partners of proteins through identific- and contribution of structurally conserved hot spot residues. J. Mol. Biol., 345, ation of their binding sites can provide valuable information on the 1281–1294. interaction networks and pathways. In the light of this trend, we have Kortemme,T. and Baker,D. (2004) Computational design of protein–protein interactions. Curr. Opin. Struct. Biol., 8, 91–97. developed a novel algorithm for automated prediction of protein– Li,W. et al. (2002) Clustering of highly homologous sequences to reduce the size of protein interactions that employs a bottom-up approach combining large protein databases. Bioinformatics, 17, 282–283. structure and sequence conservation in protein interfaces. Starting Lichtarge,O. and Sowa,M.E. (2002) Evolutionary predictions of binding surfaces and from a previously extracted non-redundant dataset that represents interactions. Curr. Opin. Struct. Biol., 12, 21–27. of structurally available interfaces in protein–protein interactions, LoConte,L. et al. (1999) The atomic structure of protein–protein recognition sites. J. Mol. Biol., 285, 2177–2198. we devise a method to measure the similarity between partners of Lu,L. et al. (2002) MULTIPROSPECTOR: an algorithm for the prediction of protein– these representative interfaces and surfaces of target proteins. The protein interactions by multimeric threading. Proteins, 49, 350–364. algorithm resulted in some 60 000 predictions, some of which were Ma,B. et al. (2003) Protein–protein interactions: structurally conserved residues verified in interaction databases and redundant dataset of interface distinguish between binding sites and exposed protein surfaces. Proc. Natl Acad. dataset of Keskin et al. (2004). These verified interactions favor the Sci. USA, 100, 5772–5777. Marcotte,E.M. et al. (1999) Detecting protein function and protein–protein interactions reliability of our approach, whereas unverified ones may point to from genome sequences. Science, 285, 751–753. undiscovered interactions. Pazos,F. and Valencia,A. (2002) In silico two hybrid system for the selection of physically interacting protein pairs. Proteins, 47, 219–227. Pazos,F. et al. (1997) Correlated mutations contain information about protein–protein ACKNOWLEDGEMENT interaction. J. Mol. Biol., 271, 511–523. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence compar- We thank Maxim Shatsky for his assistance on using MULTIPROT. ison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. Ponsting,H. et al. (2000) Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins, 41, 47–57. REFERENCES Salwinski,L. and Eisenberg,D. (2003) Computational methods of analysis of protein– Auerbach,D. et al. (2003) Proteomic approaches for generating comprehensive protein protein interactions. Curr. Opin. Struct. Biol., 13, 377–382. interaction maps. Targets, 2, 85–92. Shatsky,M. et al. (2004) A method for simultaneous alignment of multiple protein Bader,G.D. et al. (2003) BIND: the biomolecular interaction network database. Nucleic structures. Proteins, 56, 143–156. Acids Res., 31, 248–250. Thorn,K.S and Bogan,A.A. (2001) ASEdb: a database of alanine mutations and their Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. effects on the free energy of binding in protein interactions. Bioinformatics, 17, Bikle,D.D. (2004) Vitamin D: production, metabolism and mechanisms of action. 284–285. Chakrabarti,P. and Janin,J. (2002) Dissecting protein–protein recognition sites. Proteins, Valencia,A. and Pazos,F. (2002) Computational methods for the prediction of protein 47, 334–343. interactions. Curr. Opin. Struct. Biol., 12, 368–373. Christensen,E.I. and Birn,H. (2001) Megalin and cubilin: synergistic endocytic receptors Wu,S.J. et al. (1999) Randomization of the receptor alpha chain recruitment epitote in renal proximal tubule. Am. Physiol. Renal. Physiol., 280, F562–F573. reveals a functional interleukin-5 with charge depletion in the CD loop. J. Biol. Clackson,T.J. and Wells,A. (1995) A hot spot of binding energy in a hormone–receptor Chem., 274, 20479–20488. interface. Science, 267, 383–386. Xenarios,I. et al. (2002) DIP, the database of interacting proteins: a research tool for Dandekar,T. et al. (1998) Trends Biochem. Sci., 23, 324–328. studying cellular networks of protein interactions. Nucleic Acids Res., 30, 303–305. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces

Bioinformatics , Volume 21 (12): 6 – Apr 26, 2005

Loading next page...
 
/lp/oxford-university-press/prediction-of-protein-protein-interactions-by-combining-structure-and-hjRw0BEE0F

References (38)

Publisher
Oxford University Press
Copyright
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bti443
pmid
15855251
Publisher site
See Article on Publisher Site

Abstract

Vol. 21 no. 12 2005, pages 2850–2855 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti443 Structural bioinformatics Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces ∗ ∗ A. Selim Aytuna, Attila Gursoy and Ozlem Keskin Koc University, Center for Computational Biology and Bioinformatics, College of Engineering, Rumelifeneri Yolu 34450 Sariyer, Istanbul, Turkey Received on February 2, 2005; revised on April 1, 2005; accepted on April 7, 2005 Advance Access publication April 26, 2005 ABSTRACT characterizing the set of all protein interactions in a cell has rendered Motivation: Elucidation of the full network of protein–protein interac- itself in the development of various experimental and computational tions is crucial for understanding of the principles of biological systems techniques. These attempts shed light on both the global features and and processes. Thus, there is a need for in silico methods for predicting the specifics of the interactions for different types of interactions. interactions. We present a novel algorithm for automated predic- Various experimental methods have been developed to identify tion of protein–protein interactions that employs a unique bottom-up protein–protein interactions in various organisms. These involve approach combining structure and sequence conservation in protein (1) the traditional top-down proteomic approach where the experi- interfaces. ments have been individually designed to identify and validate a small Results: Running the algorithm on a template dataset of 67 interfaces number of specifically targeted interactions or (2) the bottom-up gen- and a sequentially non-redundant dataset of 6170 protein structures, omic approach, the recently developed high-throughput experiments 62 616 potential interactions are predicted. These interactions are designed to probe all the potential interactions within an entire gen- compared with the ones in two publicly available interaction databases ome exhaustively. The latter approach makes use of high throughput (Database of Interacting Proteins and Biomolecular Interaction Net- mass spectrometry (Gavin et al., 2002), the yeast two-hybrid system work Database) and also the Protein Data Bank. A significant number (Ito et al., 2001) and phage display libraries (Ferrer and Harrison, of predictions are verified in these databases. The unverified ones 1999; Wu et al., 1999). These methods have so far yielded a con- may correspond to (1) interactions that are not covered in these data- siderable amount of data on protein–protein associations and their bases but known in literature, (2) unknown interactions that actually relative binding strengths. However, many false positives and false occur in nature and (3) interactions that do not occur naturally but negatives identified in these high-throughput experiments highlight may possibly be realized synthetically in laboratory conditions. Some the need for caution when interpreting their results. Still, binary unverified interactions, supported significantly with studies found in interaction results of these experiments are extremely invaluable to the literature, are discussed. interpret protein–protein interactions and construct protein–protein Availability: http://gordion.hpc.eng.ku.edu.tr/prism networks (Salwinski and Eisenberg, 2003; Lu et al., 2002). Experi- Contact: agursoy@ku.edu.tr; okeskin@ku.edu.tr mentally, verified interactions have been compiled in various large scale protein–protein interaction datasets (Gavin et al., 2002; Ito et al., 2001; Xenarios et al., 2002; Bader et al., 2003). 1 INTRODUCTION Computational methods can address protein–protein interactions Proteins rarely act in isolation; different levels of complexity of bio- at different levels. They may focus on in-depth analysis or carry logical systems arise not only from the number of the proteins (genes) out a broad scale analysis across large datasets. Through genomic of the organism but also from the combinatorial interactions among and protein sequence analysis, they may infer whether proteins do them (Valencia and Pazos, 2002; Ferrer and Harrison, 1999). One interact (Valencia and Pazos, 2002; Marcotte et al., 1999; Salwinski of the primary objectives of the post-genomic era is the elucida- and Eisenberg, 2003; Lu et al., 2002). Or, through structural ana- tion of the interactions in cellular systems. The detailed knowledge lysis of proteins and their complexes, they may provide interaction of the full network of protein–protein interactions, i.e. the distri- details, essential for understanding processes at the microscopic bution and the number of interactions as well as the presence of level (Kortemme and Baker, 2004; Salwinski and Eisenberg, 2003; key nodes in these networks, is expected to provide new insights Chakrabarti and Janin, 2002; LoConte et al., 1999; Jones and into the structures and properties of biological systems. Thus, bioin- Thornton, 1997). Methods using genomic and protein sequence formatics and computational approaches are becoming increasingly data include analysis of presence or absence of genes in related important venues as large amount of data become available. Des- species, conservation of gene neighborhood, gene fusion events, pite the ongoing effort to decipher the complex nature of protein similarity of phylogenetic trees, correlated mutations on protein sur- interactions, they are not still entirely understood (Kortemme and faces and co-occurrence of sequence domains (Valencia and Pazos, Baker, 2004; Chakrabarti and Janin, 2002; LoConte et al., 1999; 2002; Salwinski and Eisenberg, 2003). Methods making use of Jones and Thornton, 1997). The broad recognition of importance of structural data, usually strive to identify functional protein inter- faces and rely on considerations of the solvent accessible surface area buried upon association (Janin, 1997), free energy changes To whom correspondence should be addressed. 2850 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Prediction of protein–protein interactions upon alanine-scanning mutations (Thorn and Bogan, 2001), in silico do non-interacting proteins, due to co-evolution (Auerbach et al., two-hybrid systems (Pazos and Valencia, 2002), scoring functions 2003; Fryxell, 1996). In the second one, evolutionarily convergent based on statistical potentials (Ponsting et al., 2000), physicochem- binding sites were found to correspond to the energetically most ical and geometric properties of the surface, such as electrostatistics, favorable states (Kortemme and Baker, 2004). hydrophobicity, amino acid composition, shape complementarity Through time, differences in paces of evolution result in accumula- and planarity (Jones and Thornton, 1997; Keskin et al., 2004) and tion of similar interfaces across different complexes, accomplishing evolutionary conservation (Pazos et al., 1997; Lichtarge and Sowa, different functions. In a way, evolution has reused ‘good’ favorable 2002; Keskin et al., 2005). interface structural scaffolds and adapted them to different functions Computational and experimental methods concentrate on the (Keskin et al., 2004). protein–protein interaction problem from different aspects. There- In this paper, we present a novel, efficient algorithm to predict fore, no single method can adequately discover the interactome fully. potential protein–protein interactions and complexes. We start with Converging toward an ideal solution will involve unification of dif- a set of structurally known protein interfaces, then seek for pairs ferent methods that take up the problem from different, innovative of proteins that share structure and evolutionarily conserved residue perspectives. This will provide a more complete picture of living (hotspot) similarity to our known interface dataset. A list of poten- cells, leading to a better understanding of biological processes. tially interacting protein pairs is obtained as a final result. Some of these interacting pairs are verified in the Biomolecular Interaction 1.1 Protein interfaces Network Database (BIND) (Dandekar et al., 1998) and Database of Proteins associate through binding sites. These sites are believed to Interacting Proteins (DIP) (Xenarios et al., 2002) and the Protein contribute to the biomolecular recognition and binding of proteins Data Bank (PDB) (Berman et al., 2000) itself. by providing specific chemical and physical properties necessary The approach and the implementation of the algorithm are elab- for these processes. There have been many studies on the protein– orated in Section 2. Discussions on prediction results and some case protein interaction and binding regions. These studies aim to provide studies are presented in Section 3. a deeper insight of the nature and mechanism of protein recognition (Jones and Thornton, 1996). It has been found that binding regions 2 SYSTEMS AND METHODS bury usually surface areas <2000 Å , and these sites usually form a The rationale of our protein–protein prediction algorithm is that, if any two single patch, in contrast to larger multipatch interfaces (Chakrabarti structures contain particular regions on their surfaces that resemble the com- and Janin, 2002). In a recent study, it has been shown that different plementary partners of a known interface, they ‘possibly interact’, through protein folds may combinatorially assemble to yield similar local these regions. In other words, if A is known to interact with B , a shares interface motifs (Keskin et al., 2004). similarity with the binding site of A, b shares similarity with the binding site Alanine scanning mutagenesis is a very powerful method to ana- of B , then we predict that a interacts with b. This resemblance indicates the lyze the contributions of individual amino acids to protein–protein ability of these structures to structurally and evolutionarily complement each binding by systematic replacement of protein interface residues by other along an interface, as chains of any template interface do. Figures 1 alanine and by measuring the drop in the resultant binding free and 2 show the top level pseudocode and schematic outline of our algorithm, energy. These experiments show that each residue at protein–protein respectively. The algorithm requires a ‘template’ dataset, i.e. the representative dataset interfaces does not contribute to the binding free energy equally. of ‘available’ interfaces; and a ‘target’ dataset, to seek every potential bin- Rather, there are only small sets of hotspot residues at interfaces ary interaction between its members. The template dataset handles structure that contribute significantly to binding free energy of the interac- and sequence conservation by combining two previously generated datasets: tion (Clackson and Wells, 1995). Many subsequent studies suggest the structurally non-redundant dataset of protein–protein interfaces extracted that the presence of a few hotspots may be a general characteristic from the PDB and the set of conserved residues on these interfaces (compu- of most protein–protein interfaces (Thorn and Bogan, 2001). These tational hotspots). The target dataset is a sequentially non-redundant set of generally polar residues are found to be highly correlated with the all protein complexes and chains in the PDB. structurally conserved residues through evolution to optimize func- tion, structure and stability of the protein complexes and enhance 2.1 The template interface dataset feasibility of protein–protein associations (Keskin et al., 2005). Keskin et al. (2004) describe a method for finding a structurally and sequen- Many of the residues on interfaces that are critical for binding are tially non-redundant subset of all existing interfaces formed between two likely to be evolutionarily conserved. This is because the pace of protein chains in dimers, trimers or higher complexes of proteins in PDB. evolution at interfaces is slower than the rest of the protein (Fraser They apply this method to get a set of 103 clusters of structurally related et al., 2002). The cause of this slower pace of evolution at interfaces interfaces and their representatives. can be explained the phenomena of co-evolution, in which substitu- In generation of this dataset, first, all existing interfaces formed between two protein chains in dimers, trimers or higher complexes of proteins were tions in one protein result in selection pressure for reciprocal changes extracted from the PDB. Interfaces were defined as the set of residues rep- in interacting partners (Pazos et al., 1997; Fraser et al., 2002). If resenting a region through which two polypeptide chains bind to each other mutations accumulated during the evolution of an interacting part- through non-covalent interactions. This set consisted of contacting residues ner is not compensated by correlated mutations in the other partner, between the chains (interacting residues), and those that are in their vicin- the interface, consequently the interaction, is likely to be disrupted. ity with a certain distance threshold (neighboring residues), representing the The alanine scanning mutagenesis method is actually based on this scaffold of the interface. Two residues from the opposite chains were marked principle. Supportive arguments for co-evolution at protein–protein as interacting, if there was at least a pair of atoms, one from each residue, at interfaces have been documented in two different studies. In the first a distance smaller than the sum of their van der Waals radii plus a threshold one, corresponding phylogenetic trees of interacting proteins were of 0.5 Å. If the C-α of a non-interacting residue lay at a distance of at most argued to display, in certain cases, a greater degree of similarity than 6.0 Å from a C-α of an already assigned interface residue in the same chain, 2851 A.S.Aytuna et al. for all proteins in target dataset do (through computational hotspots) represents a subset of ‘available’ interfaces surface ← extract surface of protein in the PDB. The complete list of these 67 interfaces can be accessed through for all interfaces in template interface do the URL: http://gordion.hpc.eng.ku.edu.tr/prism for all partners in interface do if (size of surface)≥ 0.7 x (size of partner) then 2.2 The target dataset alignments ← align surface with partner best alignment ← sim_score(alignments) The target dataset consists of the list of monomers and complexes that will if similarity score(best alignment) ≥ threshold then be compared with the template dataset for structural and evolutionary sim- similarity list ← flag protein for prediction partner ilarities. Our algorithm predicts interactions by identifying pairs of proteins proceed to verification ← similarity lists that may potentially interact in this dataset. The dataset is generated in two steps. Fig. 1. Top level pseudocode of our protein–protein interaction prediction The first step involves the extraction of a non-homologous set of proteins algorithm. obtained by applying a sequence identity filter of 50% to all existing protein structures in PDB [(online service is available at http://www.pdb.org (Li et al., 2002)]. This preliminary list contains 5427 proteins, as of January 27, 2004. This dataset is then expanded in the second step by splitting multimeric pro- teins into their constituent chains. But to avoid disturbing the non-redundant nature of the dataset, pairwise sequence alignments are carried out before splitting [by invoking FASTA (Pearson and Lipman, 1988)] and identical part- ner chains within the complexes are removed (i.e. homodimers) by grouping chains into sets and choosing a representative for each of them. After these processes, the target dataset becomes a non-homologous subset of all the polypeptide chains and complexes existing in PDB. The polypeptide chains may be in the form of monomers or in the form of isolated constituent chains of multimeric complexes. As of January 27, 2004; the target dataset consists of 6170 structures. Of these structures, 1981 are multimeric and 4189 are monomeric. Of the monomeric structures, 2483 are derived from complexes. 2.3 The algorithm To find every possible binary interaction between pairs of structures in the target dataset, we need a method to measure the similarity between partners of these representative interfaces and surfaces of target proteins. Accordingly, we extract the surfaces of target proteins and perform successive structural alignments between these surfaces and the partner chains of interfaces in template interface dataset, in an all-against-all manner. This enables us to measure the ‘similarity’ of a target structure to a template interface partner. If the surfaces of two target proteins (A and B ) contain regions ‘similar’ to complementary partner chains of a template interface, we say A and B may interact through these ‘similar’ regions. Figures 1 and 2 show the top level Fig. 2. Top level schematic outline protein–protein interaction prediction pseudocode and the schematic flow of our algorithm, respectively. algorithm. The algorithm starts by extracting the surfaces of target structures by invoking NACCESS program (Hubbard and Thornton, 1993). Along with the it was flagged as a neighboring residue. After the interfaces were extracted, atomic accessible surface, NACCESS calculates the relative surface access- they were clustered with respect to their structural similarities. The dataset is ibilities (RSA) of residues. Jones and Thornton (1997) argue that residues, available at http://gordion.hpc.eng.ku.edu.tr/prism whose RSAs (percentage of accessibility compared with the accessibility of Ma et al. (2003) discovered that particular residues are conserved on struc- that residue type X in an extended ALA-X-ALA tripeptide) are >5%, can turally similar interfaces, to an extent that suffices distinguishing between be considered to be on the surface. We adopt the same criterion to qualify binding sites and exposed protein surfaces. Moreover, they found that these surface residues. conserved residues, were highly correlated with polar residue hotspots, The algorithm then checks whether particular regions on the target sur- residues that bear more importance than others in defining affinity and stability faces resemble the complementary partners of representative interfaces in the of an interaction. template dataset. This necessitates a defined way to measure the structural The proceeding work of Keskin et al. (2005), describes a method to find and evolutionary similarities between a target surface and a representative structurally conserved residues on clusters of structurally related interfaces. interface partner. But before the similarities can be measured, the structures They have applied this method on the resulting dataset of Keskin et al. (2004) need to be structurally aligned. First, each representative interface picked and enhanced it with sequence conservation data, which they call computa- from the template dataset is split into its constituent partners. Since the tional hotspots. In their method, they structurally aligned members of a given template dataset comprises only two-chain interfaces, this process always non-redundant interface cluster along their spatially recurring substructural results in two partners per interface. These individual partners are then struc- motifs. Then, they considered the frequencies of identically matched residues turally aligned with the target surface, by invoking MULTIPROT (Shatsky along the multiply aligned substructures. If a residue matched identically on et al., 2004). MULTIPROT detects common geometrical cores between >50% of the multiply aligned structures, it qualified as a hotspot. given protein structures in a sequence-order-independent way. This feature This procedure resulted in 67 interfaces that contained at least one hotspot. makes MULTIPROT a favorable selection for the task, since protein surfaces The final set contained members as diverse as enzymes, antibodies, viral and protein–protein interfaces have sequence discontinuity. MULTIPROT capsids, etc. We import this dataset as our template interface dataset. We returns 10 best substructural matches resulting from every possible alignment. assume that this non-redundant dataset both structurally and evolutionarily Each substructure corresponds to different regions on the surface, bearing 2852 Prediction of protein–protein interactions Table 1. A selected set of verified and unverified predictions Left Right Verified Prediction Template Left function Right function partner partner Dbase score 1cov1 1h8tC 4.192 1cov13 Coxsackievirus coat protein Echovirus 11 coat protein 1dgi 1ncqC 3.867 1cov13 Poliovirus receptor Coat protein Vp3 1lq8{AECG} 1jjo{EF} P 3.453 1as4AB Plasma serine protease inhibitor Neuroserpin 2ae2{AB} 1e7w{AB} D,B,P 2.873 1e92AC Tropinone reductase-II Pteridine reductase 2sicE 1lw6I P 2.749 2sniEI Subtilisin BPN Subtilisin-chymotrypsin inhibitor-2A 1mho 1psb{AB} D,B,P 2.484 1mr8AB S-100 protein S-100 protein, Beta chain 1hj9 1jbl 2.469 1sbwAI Beta-trypsin Cyclic trypsin inhibitor 2tnf{ABC} 1dg6 D,P 2.225 1cdaAB TNF TNF related apoptosis inducing ligand 1fxkC 1jm7B 2.110 1jm7AB Prefoldin Brca1-associated ring domain protein 1 1gk6{AB} 2ebo{ABC} P 2.088 1cosAB Vimentin Ebola virus envelope glycoprotein 1kb9K 1n8v 2.077 1hezCE Light chain (VI) of Fv- fragment Chemosensory protein 1i4k1 1m5q{A..Z12} 2.074 1i4k12 Putative Snrnp Sm-like protein Small nuclear ribonucleoprotein homolog 1l8d{AB} 1c17 2.036 1jgcAC RAD50 Atpase ATP synthase subunit C 1mso{AC} 1mso{BD} P 1.981 6rlxAB Insulin like growth factor A-chain Insulin like growth factor B-chain 1ixm{AB} 1k75{AB} 1.953 1fuuAB Sporulation response regulatory protein l-Histidinol dehydrogenase 1iesB 1ecm{AB} 1.952 1iesAB Ferritin Endo-oxabicyclic transition state analogue 1ju5C 1uff B 1.947 1azeAB Abl Intersectin 2 1osh 1fm6E 1.930 1fm6DE Bile acid receptor Steroid receptor coactivator The characters B, D and P in verified column corresponds to verfication in BIND, DIP and PDB databases. different levels of structural similarity to the interface partner. Among these 2.4 Implementation alignments, the algorithm seeks the most favorable alignment that maxim- Both prediction and verification algorithms were implemented in Python izes our similarity scoring function. The similarity scoring function is defined Language, due to its powerful attributes regarding Bioinformatics related as αf + (1 − α)f , where f and f are evolution- evolution structure evolution structure tasks. Both algorithms take a fairly long time for completion, i.e. on a ary and structural similarity scoring functions, respectively. The coefficient Linux machine with 2.4 GHz Pentium processor and 1GB memory, the pre- α, represents the relative importance of evolutionary similarity to structural diction algorithm needs about a week and the verification algorithm needs similarity. The first function reflects the number of identically matched hot- about a month. This limitation necessitates parallelization for more reason- spots, the second function reflects the size and quality of the alignment along able response times. Parallelized version of the both algorithms have proven the target–template alignment. We assume that hotspots bear greater import- to achieve almost linear speed ups, prediction algorithm was observed to ance in defining an interface than geometrical complementarity. Therefore perform 29.39 times faster at a 32 node Beowulf cluster. we select α as 0.6. The condition prior to alignment restrains that interface partner size be at least 0.7 times the target surface size. (Size of a structure is 3 RESULTS AND DISCUSSION defined as the number of residues it contains.) This condition keeps relatively small interfaces out of computations. Such relatively small interfaces are Prediction results contain various interaction pairs, some of which likely to align perfectly with target surfaces and yield high similarity scores, are verified in DIP and BIND interaction databases as well as causing biased and unselective results. PDB. Starting from 67 template interfaces we found 62 616 pair- After the completion of successive structural alignments, a similarity list wise interactions among the 6170 target proteins. Of these, 31 980 for each interface partner is obtained. If the similarity lists of corresponding interactions are between the monomeric structures, and 25 448 of partners of a template interface contain N and M target structures, respect- them are between a monomeric protein and a complex structure. ively, we obtain N ×M predictions for that interface. A prediction is uniquely The remaining 5188 are between two complex structures. Most of represented by (a, b, c) triplets, where a and b are predicted targets and c is these predictions are heterodimers; only 284 are homodimers (100% the template interface via which the interaction was predicted. The extent of sequence identity between partners). This number contains predic- favorableness of the predicted interaction (prediction score) is quantified by tions with partners having identical sequences, within the same simply the sum of the similarity scores of the target pairs. complex. But we would expect these to be low in number, after These predicted interactions are finally verified for existence in two pub- licly available interaction databases, BIND, DIP and of course, PDB itself. the 50% sequence identity removal phase (Section 2.2). Table 1 Structures in our target dataset are referenced by PDB codes. However, entries displays a selected set of predictions with high scores. The first 4 in the interaction databases have their own referencing nomenclature. There- characters in columns 1, 2 and 5 are PDB representations of proteins fore, there is a need to identify cross references of targets in the respective and the following characters are PDB chain identifiers. In columns interaction databases. This is performed by finding homologous sequences in 1 and 2, multiple chains are enclosed in curly brackets, to indic- the interaction databases using FASTA and alignments yielding expectation ate that the chains are identical and the prediction applies to all of values ≥10 are considered homologous. Notice that this process may res- them. In column 5, these two characters indicate the chains of the ult in more than one homolog per database. Once this ‘translation’ is done, structures between which the template interface exists. Column 3 predicted interactions are checked for existence in the domains of interaction specifies if the interaction is verified in B (BIND), D (DIP) and P databases. In the case of PDB, the prediction is checked for its presence in the entire list of two-chain interfaces existing in the PDB, generated in Keskin (PDB) databases, whereas an empty entry means an unverified inter- et al. (2004). action. Column 4 is the similarity score of the prediction. Columns 6 2853 A.S.Aytuna et al. P D Fig. 3. Left: Surface illustration of the binding site between PTH (cyan) and DBP (purple). Right: Wire (backbone only) illustration of the binding site between PTH (P, orange) and DBP (D, red). The template interface Fig. 4. Left: Surface illustration of the binding site between BRCA1 (cyan) 1cos AC (T and T , yellow) is included to highlight the quality of L R and RAD50 (purple). Right: Wire (backbone only) illustration of the alignments. binding site between BRCA1 (B, orange) and RAD50 (R, red. The tem- plate interface laqd5AC (yellow) is included to highlight the quality of alignments. and 7 are the respective functions of SWISSPROT cross references of target partners, queried via SWISSPROT Sequence Retrieval System (SRS). 395–434 in RAD50 ATPase (PDB reference: A or B chain of 1l8d, SWISSPROT reference: RA50_PYRFU). This prediction had a score 3.1 Biological evidence of some predicted binary of 1.989. The potentially docked structure of the complex is shown protein interactions: case studies in Figure 4. In this section, we discuss two examples in detail. Neither of the BRCA1 protein, as a tumor suppressor, plays an important role cases has been verified in DIP/BIND or in PDB, but the literature in maintaining genomic stability. Through the several functional search strongly suggests that such interactions exist. domains it contains, BRCA1 has the ability to interact with numerous proteins and to form complexes. It has been reported that disruption 3.1.1 Vitamin D binding protein–parathyroid hormone In this of the potential of BRCA1 to form complexes with RAD50 (via case, residues 383–411 in vitamin D binding protein (DBP) (PDB ref- inherited mutations or epigenetic mechanisms in sporadic cancers) erence: D chain of 1kxp, SWISSPROT reference: VTDB_HUMAN) leads to loss of DNA repair ability. This is on account of some were observed to bind to the residues 1–27 of parathyroid hormone proteins among the binding partners being responsible for the recog- (PTH) (PDB reference: A or B chain of 1et1, SWISSPROT refer- nition and repair of DNA, such as the DNA damage repair protein ence: PTHY_HUMAN). The prediction had a score of 2.011. The RAD50. RAD50 repairs DNA double-strand breaks by end join- potentially docked structure of the complex is shown in Figure 3. ing (non-homologous recombination) and meiosis-specific double- PTH regulates calcium and phosphorus levels in blood by inducing strand break formation. It is an essential protein for cell growth and transport of an inactive form of vitamin D (calcidiol) from liver viability (Jhanwar-Uniyal, 2003; Deng and Brodie, 2000). to kidney and its conversion into active form (calcitriol) in prox- imal tubules. Calcitriol, in turn, is transported to small intestine, where it acts to raise the calcium level through increased intest- 3.2 Summary of the verified interactions inal absorption of calcium. Like all forms of vitamin D, calcidiol We predict 62 616 binary interactions starting from 6170 target binds to DBP prior to transportation by blood to the kidney. In proteins. Reasonable amount of these predictions were verified in the kidney, the cellular uptake of DBP–calcidiol complex and PTH interaction databases. Table 2 displays the number of verified inter- are both mediated by an endocytic receptor protein termed mega- actions out of cross referenced interactions for three interaction lin, in proximal tubules. Under the regulation of PTH calcitriol is databases. The results display a good balance of verified and unveri- also synthesized in the proximal tubules (Christensen and Birn, 2001; fied predictions. The higher verification ratio for the PDB database Bikle, 2004, http://www.endotext.org/parathyroid/parathyroid3/ (1094 out of 1497) is because the template interfaces used in predic- parathyroid3.htm). Although an interaction has not been reported tion have been derived from the PDB database. However, not all the in literature, during megalin-mediated uptake, PTH may be interact- cross referenced interactions have been verified, because the structur- ing with the DBP–calcidiol complex through DBP while exerting ally conserved hot spot residues (evolutionary data) have dominant its regulatory action on calcitriol synthesis. We believe that this effect in the evaluation of similarity between interface templates and prediction may provide new insights into vitamin D metabolism PDB interactions. The verified interactions prove the reliability of our studies. algorithm, whereas unverified ones may correspond to unobserved 3.1.2 BRCA1–RAD50 ATPase In this case, residues 2846–2882 interactions that actually occur in nature or may be synthetically real- in BRCA1 (PDB reference: A chain of 1miu, SWISSPROT ref- ized in laboratory conditions. We believe these unverified predictions erence: BRC2_MOUSE) were observed to bind to the residues may have important implications regarding drug design. 2854 Prediction of protein–protein interactions Table 2. Number of verified predictions Deng,C.X. and Brodie,S.G. (2000) Roles of BRCA1 and its interacting proteins. Bioessays, 22, 728–737. Ferrer,M. and Harrison,S.C. (1999) Peptide ligands to human immunodeficiency virus type 1 gp120 identified from phage display libraries. J. Virol., 73, 5795–5802. Interaction database Unique verifications Interation database size Fraser,H.B. et al. (2002) Evolutionary rate in the protein interaction network. Science, 296, 750–752. DIP 597 of 4107 43 892 Fryxell,K.J. (1996) The co-evolution of gene family trees. Trends Genet., 12, 364–369. BIND 431 of 1739 31 243 Gavin,A.C. et al. (2002) Functional organization of the yeast genome by systematic PDB 1094 of 1497 21 686 analysis of protein complexes. Nature, 415, 141–147. Hubbard,S.J. and Thornton,J.M. (1993) NACCESS Computer Program. Department of Biochemistry and Molecular Biology, University College London. Ito,T. et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574. Janin,J. (1997) Specific vs. non-specific contacts in protein crystals Nat. Struct. Biol., 4, 4 CONCLUSION 973–974. As large amount of protein structure data become available, predict- Jhanwar-Uniyal,M. (2003) BRCA1 in cancer, cell cycle and genomic stability. Front ive methods to detect and characterize protein–protein interactions Biosci., 8, 1107–1117. Jones,S. and Thornton,J. (1996) Principles of protein–protein interactions. Proc. Natl are becoming increasingly important in systems biology. Such know- Acad. Sci. USA, 93, 13–20. ledge aid to researchers in identifying nodes in biochemical or Jones,S. and Thornton,J. (1997) Analysis of protein–protein interaction sites using signaling pathways that cause disorders, and designing drugs that surface patches. J. Mol. Biol., 272, 121–132. exert their therapeutic action on these nodes, instead of modulating Keskin,O. et al. (2004) A new, structurally non-redundant, diverse data set of protein– protein interfaces and its implications. Protein. Sci., 13, 1043–1055. the complete set of functions connected with the pathway. An ability Keskin,O. et al. (2005) Hot regions in protein–protein interactions: the organization to predict possible interaction partners of proteins through identific- and contribution of structurally conserved hot spot residues. J. Mol. Biol., 345, ation of their binding sites can provide valuable information on the 1281–1294. interaction networks and pathways. In the light of this trend, we have Kortemme,T. and Baker,D. (2004) Computational design of protein–protein interactions. Curr. Opin. Struct. Biol., 8, 91–97. developed a novel algorithm for automated prediction of protein– Li,W. et al. (2002) Clustering of highly homologous sequences to reduce the size of protein interactions that employs a bottom-up approach combining large protein databases. Bioinformatics, 17, 282–283. structure and sequence conservation in protein interfaces. Starting Lichtarge,O. and Sowa,M.E. (2002) Evolutionary predictions of binding surfaces and from a previously extracted non-redundant dataset that represents interactions. Curr. Opin. Struct. Biol., 12, 21–27. of structurally available interfaces in protein–protein interactions, LoConte,L. et al. (1999) The atomic structure of protein–protein recognition sites. J. Mol. Biol., 285, 2177–2198. we devise a method to measure the similarity between partners of Lu,L. et al. (2002) MULTIPROSPECTOR: an algorithm for the prediction of protein– these representative interfaces and surfaces of target proteins. The protein interactions by multimeric threading. Proteins, 49, 350–364. algorithm resulted in some 60 000 predictions, some of which were Ma,B. et al. (2003) Protein–protein interactions: structurally conserved residues verified in interaction databases and redundant dataset of interface distinguish between binding sites and exposed protein surfaces. Proc. Natl Acad. dataset of Keskin et al. (2004). These verified interactions favor the Sci. USA, 100, 5772–5777. Marcotte,E.M. et al. (1999) Detecting protein function and protein–protein interactions reliability of our approach, whereas unverified ones may point to from genome sequences. Science, 285, 751–753. undiscovered interactions. Pazos,F. and Valencia,A. (2002) In silico two hybrid system for the selection of physically interacting protein pairs. Proteins, 47, 219–227. Pazos,F. et al. (1997) Correlated mutations contain information about protein–protein ACKNOWLEDGEMENT interaction. J. Mol. Biol., 271, 511–523. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence compar- We thank Maxim Shatsky for his assistance on using MULTIPROT. ison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. Ponsting,H. et al. (2000) Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins, 41, 47–57. REFERENCES Salwinski,L. and Eisenberg,D. (2003) Computational methods of analysis of protein– Auerbach,D. et al. (2003) Proteomic approaches for generating comprehensive protein protein interactions. Curr. Opin. Struct. Biol., 13, 377–382. interaction maps. Targets, 2, 85–92. Shatsky,M. et al. (2004) A method for simultaneous alignment of multiple protein Bader,G.D. et al. (2003) BIND: the biomolecular interaction network database. Nucleic structures. Proteins, 56, 143–156. Acids Res., 31, 248–250. Thorn,K.S and Bogan,A.A. (2001) ASEdb: a database of alanine mutations and their Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. effects on the free energy of binding in protein interactions. Bioinformatics, 17, Bikle,D.D. (2004) Vitamin D: production, metabolism and mechanisms of action. 284–285. Chakrabarti,P. and Janin,J. (2002) Dissecting protein–protein recognition sites. Proteins, Valencia,A. and Pazos,F. (2002) Computational methods for the prediction of protein 47, 334–343. interactions. Curr. Opin. Struct. Biol., 12, 368–373. Christensen,E.I. and Birn,H. (2001) Megalin and cubilin: synergistic endocytic receptors Wu,S.J. et al. (1999) Randomization of the receptor alpha chain recruitment epitote in renal proximal tubule. Am. Physiol. Renal. Physiol., 280, F562–F573. reveals a functional interleukin-5 with charge depletion in the CD loop. J. Biol. Clackson,T.J. and Wells,A. (1995) A hot spot of binding energy in a hormone–receptor Chem., 274, 20479–20488. interface. Science, 267, 383–386. Xenarios,I. et al. (2002) DIP, the database of interacting proteins: a research tool for Dandekar,T. et al. (1998) Trends Biochem. Sci., 23, 324–328. studying cellular networks of protein interactions. Nucleic Acids Res., 30, 303–305.

Journal

BioinformaticsOxford University Press

Published: Apr 26, 2005

There are no references for this article.