Sequence-based prediction of protein interaction sites with an integrative method

Xue-wen Chen; Jong Cheol Jeong

doi:10.1093/bioinformatics/btp039

Sequence-based prediction of protein interaction sites with an integrative method

Chen, Xue-wen; Jeong, Jong Cheol 2009-01-19 00:00:00 Vol. 25 no. 5 2009, pages 585–591 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp039 Sequence analysis Sequence-based prediction of protein interaction sites with an integrative method 1,2,∗ 1 Xue-wen Chen and Jong Cheol Jeong Bioinformatics and Computational Life Sciences Laboratory, Information and Telecommunication Technology Center and Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA Received on June 2, 2008; revised on January 14, 2009; accepted on January 15, 2009 Advance Access publication January 19, 2009 Associate Editor: Limsoon Wong ABSTRACT methods, which is of great importance in molecular recognition and are considered as a good starting point to form hypotheses in Motivation: Identiﬁcation of protein interaction sites has signiﬁcant searching for potential pharmacological targets for the design of impact on understanding protein function, elucidating signal drugs (Gallet et al., 2000). transduction networks and drug design studies. With the Roughly speaking, computational methods can be categorized exponentially growing protein sequence data, predictive methods into two groups: molecular docking of two proteins with known using sequence information only for protein interaction site structures and the identiﬁcation of putative interaction sites on an prediction have drawn increasing interest. In this article, we propose isolated protein without knowing the structure of its partner or a predictive model for identifying protein interaction sites. Without complex (Gallet et al., 2000). While a number of computational using any structure data, the proposed method extracts a wide methods for predicting protein interaction sites have been developed range of features from protein sequences. A random forest-based over the years, most of them require known protein structure integrative model is developed to effectively utilize these features information (Aytuna et al., 2005; Bradford and Westhead, 2005; and to deal with the imbalanced data classiﬁcation problem Chen and Zhou, 2005; Chung et al., 2006; Fariselli et al., 2002; commonly encountered in binding site predictions. Gabb et al., 1997; Helmer-Citterich and Tramontano, 1994; Jiang Results: We evaluate the predictive method using 2829 interface and Kim, 1991; Jones and Thornton, 1997a, b; Katchalski-Katzir residues and 24 616 non-interface residues extracted from 99 et al., 1992; Keskin et al., 2005; Kuntz et al., 1982; Norel et al., polypeptide chains in the Protein Data Bank. The experimental 1995; Palma et al., 2000; Salemme, 1976; Shoichet and Kuntz, results show that the proposed method performs signiﬁcantly better 1991; Walls and Sternberg, 1992; Warwicker, 1989; Wodak and than two other sequence-based predictive methods and can reliably Janin, 1978; Zhou and Shan, 2001). Despite much effort in structural predict residues involved in protein interaction sites. Furthermore, genomics, the amount of protein structures, determined by time- we apply the method to predict interaction sites and to construct consuming and expensive experimental technologies, is signiﬁcantly three protein complexes: the DnaK molecular chaperone system, smaller than those of protein sequences produced by large-scale 1YUW and 1DKG, which provide new insight into the sequence– DNA sequencing methods. For example, by July 29, 2008, there function relationship. We show that the predicted interaction sites are 392 667 identiﬁed protein sequences in Uniprot/Swissprot can be valuable as a ﬁrst approach for guiding experimental methods (reviewed, manually annotated) (Uniprot, 2008) and only 47 978 investigating protein–protein interactions and localizing the speciﬁc known protein structures in PDB (Berman et al., 2000). Thus, it is interface residues. now more important than ever to identify protein interaction sites Availability: Datasets and software are available at from amino acid sequences only, without knowing structural data. http://ittc.ku.edu/~xwchen/bindingsite/prediction. There are several studies attempted to address the sequence-based Contact: xwchen@ku.edu interaction site prediction problem. Kini and Evans (1996) observed Supplementary information: Supplementary data are available at that proline is the most common residue in a large number of Bioinformatics online. protein interaction sites. Pazos et al. (1997) used multiple sequence alignment to detect correlated changes to a group of interacting 1 INTRODUCTION protein domains for predicting contacting pairs of residues. Gallet et al. (2000) analyzed hydrophobicity distribution and amino acid Protein–protein interaction plays an essential role in nearly all frequencies in known interaction sites for identifying linear stretches cell functions, such as promoting chemical reactions and acting as of sequences. Most recently, more complicated machine learning antibodies. Consequently, identiﬁcation of protein interaction sites methods are applied to predict interaction sites. Yan et al. (2003) is critical for understanding protein function and for elucidating applied support vector machines (SVMs) to predict interface sites metabolic and signal transduction networks. It could also help in with features extracted from sequence neighbors for each target rational drug design studies (Gallet et al., 2000). A commonly residue. Wang et al. (2006) also employed SVMs as classiﬁers used technique in identifying protein interaction sites is in silico with features extracted from spatial sequence and evolutionary conservation scores based on a phylogenetic tree. To whom correspondence should be addressed. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 585 X.-w.Chen and J.C.Jeong (i) Sequence-based method, increasingly important in protein (2N +1) is the size of the sliding window centered around amino acid i, h is the hydrophobicity of the amino acid (AA) that is nAA’s away from the interaction site prediction, is still in its infancy. Several issues AA i, and δn is the gyration angle between two consecutive residues in the exist that make the prediction from sequences a very difﬁcult sequence. Gallet et al. (2000) found the method to be most successful when task. The two main problems are: (i) the biological properties they used N = 5 and δn = 100˚. The hydrophobicities of each amino acid are that are responsible for protein–protein interactions are not fully taken from the scale developed by Eisenberg et al. (1984). understood, which leads to the difﬁculty of extracting informative We also extract seven other physicochemical properties including features common to all the binding sites; and (ii) the number hydrophilicity, hydrophilic moment, propensity, propensity moment, of interacting sites of a protein is much smaller than that of isoelectric point, isoelectric moment and mass (Jones and Thornton, 1997a; non-interacting sites, which leads to a very challenging problem, Voet and Voet, 2004). The residue interface propensities quantify whether the so-called imbalanced data classiﬁcation problem. Our article an amino acid is possibly exposed to solvent or buried in an interface (Jones addresses these problems by extracting a wide variety of features and Thornton, 1997b). Furthermore, evolutionary conservation score for each residue is also calculated by HSSP (Schneider and Sander, 1996) and used from amino acid sequences and by developing a random forest- as a feature. Thus, for each residue, we can extract nine physicochemical based method to effectively integrate these features and at the same features and one evolutionary conservation score. time deal with imbalanced-data problems. The extracted features Group 2—amino acid distance: the frequency of occurrence of proline can be grouped into three categories: physicochemical properties residues was initially used for analyzing interaction sites by Kini and Evans and evolutionary conservation score, residue-based distance matrix (1996). They examined 1600 protein–protein interaction sequences and and sequence proﬁle. While each group of features may represent found proline residues on at least one side of 88.2% of the binding sites. common characteristics for a certain number of interaction sites, The proline residues generally occurred within four residues of the binding none of the features is the dominant factor that is capable of site and often within two residues. Inspired by this idea, we examine the describing the common effect among all the interface residues. For shortest distance from the current residue to 20 amino acid residues. This example, hydrophobicities may be useful for predicting interaction will create a vector with a size of 20 for each residue. Group 3—PSSM: PSSM is calculated by HSSP using multiple sequence sites in homodimers, they are, however, of moderate power to alignment. The likelihood of 20 amino acid substitutions at a given alignment predict interaction sites in other type of complexes (Jones and position is used as our PSSM features. Thornton, 1996; Lo Conte et al., 1999). Thus, effectively integrating For each target residue we are considering, its features are extracted by a large number of features is critical for a reliably predictive model. using a sliding window with a size of 21 centered on this target residue, To utilize all the extracted features, an integrative random forest i.e. the feature vector for the central residue consists of features extracted framework is developed. Our results evaluated on 99 polypeptide from itself and 10 amino acids on each side of this residue. Consequently, chains show that the proposed method outperforms two other the total numbers of features for each target residue are 210, 420 and 420 sequence-based methods. Furthermore, we apply this method to for groups 1, 2 and 3, respectively. While each group of features may not be identify potential interaction sites for three protein complexes: the the dominant factor to describe the common effect among all the interface DnaK molecular chaperone system, 1YUW and 1DKG, which may residues, collectively, they are capable of effectively characterizing protein interface sites. Since the total number of features is 1050, a carefully designed provide new insight into the sequence–function relationship. classiﬁer is needed to effectively utilize the large size of features. Next, we describe the random forest method. 2 METHODS 2.2 Constructing an integrative random forest model 2.1 Extracting a large number of features To effectively utilize the large number of extracted features and to deal To build a predictor that can distinguish interface residues from non- with the imbalanced data classiﬁcation problems, we herein describe an interface sites, we extract features based on physicochemical property, integrative random forest method for predicting interaction sites. Random evolutionary conservation score, amino acid distances, and position-speciﬁc forest tree has been applied to protein–protein interaction prediction in score matrix (PSSM). Instead of using one long feature vector, we divide our recent work (Chen and Liu, 2005), but not to binding site problems. the features into three groups based on their sources, as features extracted When the input space is extraordinarily large as in our application, random from different sources may have different distribution (e.g. hydrophobicity subspace feature selection introduced by Ho (1998) can improve classiﬁer in the physicochemical feature set versus the shortest distance of amino acid diversity. A random forest consists of an ensemble of decision trees from residues to a target residue in distance features) (see Supplementary Material randomly sampled subspaces of the input features, and ﬁnal classiﬁcation is for detailed discussions). obtained by combining results from the trees via voting (Breiman, 2001). It Group 1—physicochemical features and evolutionary conservation score: is crucial to produce a large number of sufﬁciently different trees when using the ﬁrst two features are hydrophobicity and hydrophobic moments, which the combined power of multiple trees for increase in accuracy. The use of were initially used to distinguish membrane α-helix proteins from soluble randomization in feature selection is a way to explore various possibilities proteins (Eisenberg et al., 1982, 1984) and later to predict protein binding of subspaces. While most classiﬁcation methods suffer from the curse of sites in the apolipoprotein E sequence (De Loof et al., 1986; Gallet et al., dimensionality, the random subspace feature selection method can take 2000). For each amino acid, a sliding window centered at this amino acid is advantage of the high dimensionality. In contrast to the Occam’s Razor, moved along the protein sequence and the mean hydrophobicity and mean the method improves accuracy as it grows in complexity (Ho, 1998). hydrophobic moment are calculated as follows and assigned to the center The random forest can also deal with imbalanced data problems. It amino acid. constructs many decision trees and each is grown from a different subset (i ) of training data. To construct individual decision tree, training samples are < H >= h (1) 2N + 1 randomly selected with replacement from the original training dataset. In n=−N our application, to build each tree, we randomly select the same number   2 2 N N of samples for each class, which converts the imbalanced data problem to (i ) (i )   <µ >= h sin δn + h cos (δn ) (2) multiple balanced data classiﬁcation problems. If the number of positive i n n 2N + 1 n=−N n=−N samples (minority class) in the original training set is N , then N samples are 586 Sequence-based prediction of protein interaction sites randomly drawn with replacement for each class. At each splitting or decision is an imbalance data classiﬁcation problem where the ratio of node, the best splitting feature is chosen from a randomly selected subspace negative to positive samples is about 9:1. of m features where m is much smaller than M total number of features. Each tree in the forest is grown to the largest extent possible without pruning. To 3.2 Evaluation criteria classify a new object, each tree in the forest gives a classiﬁcation which is To measure the performance of each predictor, we use leave-one- interpreted as the tree ‘voting’ for that class. The ﬁnal classiﬁcation of the out cross-validation (LOOCV) and the following criterion functions, object is determined by majority votes among the classes decided by the forest of trees. where true positive (TP) is the number of true interface residues that Furthermore, for each group of features, we generate a random forest are predicted correctly; true negative (TN) is the number of true non- classiﬁer. This is because features from different groups have different interface residues that are predicted correctly; false positive (FP) is distribution. Building a forest classiﬁer for each group of feature can the number of true non-interface residues that are predicted to be effectively integrate all the features for better performance. For each feature interface residues; and false negative (FN) is the number of true group, we generate 100 trees: each tree is built using 100 randomly selected interface residues that are predicted to be non-interface residues. features from each feature group and the same number of positives and TN+TP negatives. The ﬁnal decision is made by majority vote. Overall accuracy: TN+FP+FN+TP TP Sensitivity positive accuracy : FN+TP 3 EXPERIMENTAL RESULTS TN Speciﬁcity negative accuracy : FP+TN 3.1 Data sources Balanced accuracy: Positive Accuracy × Negative Accuracy The proteins used in this article were extracted from a set of 70 TP×TN−FP×FN Correlation coefﬁcient (CC): protein–protein heterocomplexes used in the studies of Chakrabarti TP+FN TP+FP TN+FP TN+FN and Janin (2002) and Yan et al. (2004). Redundant proteins and The overall accuracy is the ratio of the number of correctly molecules with fewer than 10 residues and proteins with sequence predicted residues (both positive and negative) to the total number identity ≥30% were removed. Some proteins which are not available of residues. It measures the overall performance of a classiﬁer. in HSSP and DSSP programs (Kabsch and Sander, 1983) were In our application, since the number of positive samples is much also omitted. Finally, we end up with 54 heterocomplexes for our smaller than that of negative samples, the overall accuracy may not studies. Table 1 lists the 99 polypeptide chains extracted from the be a good measure for evaluating the performance of a predictor. 54 heterocomplexes downloaded in PDB, which can be grouped For imbalanced data classiﬁcation, balanced accuracy and receiver into six categories: antibody–antigen, protease–inhibitor, enzyme operating characteristic (ROC) curves are typically used, where complexes, large protease complexes, G-proteins and miscellaneous. balanced accuracy is related to the product of both positive accuracy Among 27 445 residues in the 99 polypeptide chains, we extract and negative accuracy and ROC curves are generated in terms of 13 774 surface residues based on their relative solvent accessible sensitivity and speciﬁcity. Additionally, CC, ranging from −1to+1, surface areas (RASA) calculated by the DSSP program: a residue is is also a good measure. Its value is -1 for a worst possible predictor, considered as a surface residue if its RASA is >25%. Furthermore, +1 for a best possible predictor and 0 for a random predictor. a surface residue is deﬁned as an interface residue if the difference of accessible surface areas (ASA) between its unbound molecule 3.3 Leave-one-out test results and bounded complex is >1Å . The deﬁnitions for surface residues and interface residues are commonly used in other literatures (Gong To evaluate the performance, we compare the proposed method et al., 2005; Jones and Thornton, 1996; Jones and Thornton, 1997a; to two sequence-based methods. The ﬁrst method, introduced by Nguyen et al., 2006; Rost and Sander, 1994; Wang et al., 2006; Yan Yan et al. (2003) uses PSSM with 11 neighbor residues. The et al., 2003). Among the 13 774 surface residues, 2829 residues are second method, proposed by Wang et al. (2006) uses PSSM deﬁned as interface residues (positive class). Thus, the number of and evolutionary conservation score with 11 neighbor residues. non-interface residues including both non-binding surface residues Both methods use support vector machines (SVMs) for prediction. and non-surface residues (negative class) is 24 616. Apparently, this We implement the same methods and procedures as described in Table 1. Protein categories and polypeptide chains with PDB ID Antibody-antigen 1AO7_A, 1AO7_B, 1AO7_D, 1AO7_E, 1DVF_AB, 1DVF_CD, 1IAI_LH, 1IAI_MI, 1JH1_A, 1KB5_AB 1KB5_LH, 1NCA_LH, 1NCA_N, 1NFD_ABCD, 1NFD_EFGH, 1NMB_LH, 1NMB_N, 1NSN_LH, 1NSN_S 1OSP_LH, 1OSP_O, 1QFU_A , 1QFU_B, 1QFU_H, 1QFU_L, 1YQV_LH, 2JEL_LH, 2JEL_P, 3HFM_LH Protease-inhibitor 1ACB_E, 1ACB_I, 1AVW_A, 1AVW_B, 1CHO_I, 1FLE_E, 1FLE_I, 1HIA_ABXY, 1HIA_IJ 1MCT_A, 1STF_E, 1STF_I, 1TGS_I, 1TGS_Z, 2SIC_I, 2SNI_E, 2SNI_I, 3SGB_E, 4CPA_I Enzyme 1BRS_ABC, 1BRS_DEF, 1DFJ_E, 1DFJ_I, 1DHK_A, 1DHK_B, 1FSS_A 1FSS_B, 1GLA_F, 1GLA_G, 1UDI_E, 1UDI_I, 1YDR_E, 1YDR_I Large-protease 1BTH_PQ, 1DAN_LH, 1DAN_TU, 1TBQ_LHJK, 1TBQ_RS, 1TOC_ABCDEFGH, 1TOC_RSTU, 4HTC_I G-proteins 1AGR_AD, 1AGR_EH, 1GG2_A, 1GG2_B, 1GG2_G, 1GOT_A, 1GOT_B 1GOT_G, 1GUA_A, 1GUA_B, 1TX4_A, 1TX4_B, 2TRC_P Miscellaneous 1AK4_AB, 1ATN_A, 1ATN_D, 1DKG_AB, 1EFN_AC, 1FC2_C, 1FC2_D, 1HWG_A 1HWG_BC, 1IGC_A, 1IGC_LH, 1SEB_ABEF, 1YCS_A, 1YCS_B, 2BTF_A, 2BTF_P 587 X.-w.Chen and J.C.Jeong 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Our-All 0.2 Yan-All Wang-All 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fig. 2. Comparison of prediction performance in terms of the best balanced 1-specificity accuracy for three sequence-based methods. Fig. 1. The ROC curves for three sequence-based predictors. their papers. All the methods are trained and tested on the same datasets. To evaluate the performance, a LOOCV is used: each time, one of the 99 polypeptide chairs (including all the interface and non- interface residues in this polypeptide chain) is used as test data and (a) (b) the remaining 98 chains are used as training data; this process is repeated 99 times and the ﬁnal results are averaged over the test Fig. 3. Predicted results of chains 1IAI_LH and 1IAI_MI in 1IAI using (a) results. our method, and (b) Wang’s method. Figure 1 shows the ROC curves for three sequence-based predictors. An ROC curve is a plot of the sensitivity versus (1 − speciﬁcity) for a binary classiﬁer as its decision boundary is Yan’s and Wang’s methods, respectively (Fig. 2). The results clearly moved. Sensitivity measures the capability of predicting positive demonstrate that the proposed method is capable of predicting samples (interface residues) correctly and speciﬁcity determines protein interaction sites with signiﬁcantly better performance than if any non-interface residues are incorrectly predicted as interface these previous sequence-based methods. residues. The ROC curve of our method is constructed by changing We further examine the predicted results using Jmol (Jmol) and the threshold we place in the majority vote of decision trees. VMD software (Humphrey et al., 1996). For the results presented Typically, majority votes win where the threshold is zero. A threshold in Figures 3 and 4, the model is trained using training data and at ﬁve implies that at least ﬁve more votes of binding sites than decisions are made without changing the threshold (e.g. in our these of non-binding sites are necessary to classify a residue as method, simple majority vote is used). In Figures 3 and 4, each binding site. Otherwise, the residue is predicted as non-binding sphere represents an atom. Green sphere denotes true positives site. Therefore, with different thresholds, our model will produce (true interface residues that are correctly predicted), blue sphere different values of speciﬁcity and sensitivity. The ROC curve for represents false negatives (interface residues that are predicted as Yan’s method is constructed by varying the threshold (bias) for non-interface residues) and red sphere indicates false positives (non- a SVM decision boundary. Wang’s method consists of ﬁve SVM interface residues that are predicted as interface residues). Figure 3 models. Thus, the ROC curve for Wang’s method is constructed by shows the predicted interaction sites for four chains, L, H, M and I varying both the decision boundary of each SVMs with the same in idiotype-anti-idiotype Fab complex (Ban et al., 1994), using our bias and the threshold for the majority vote of SVMs. The proposed method (Fig. 3a) and Wang’s method (Fig. 3b) obtained by leaving method signiﬁcantly outperforms Yan’s and Wang’s methods in out each chain and training the predictive models on the remaining terms of ROC curves: for example, with a speciﬁcity rate of 70%, chains. Result from Yan’s method is similar to Figure 3b. Figure 4 the sensitivities of Yan’s, Wang’s and our methods are 30%, 39% and shows the predicted interaction sites for two chains of 2CIO. Note 73%, respectively. Another function we use is CC, which measures that the structure of 2CIO was not available when we originally how predicted results correlate with actual data. The CC values range trained the model. Thus, it was not included in the group of 99 chains. from negative one (worst possible prediction) to positive one (perfect We use 2CIO as a third, independent data for validating the three prediction). For a random predictor, the CC value is zero. With a models. We also observe that our method identiﬁes signiﬁcantly speciﬁcity rate of 70%, the CC values are 0.00, 0.06 and 0.28 for more interaction residues for some complexes than the other two Yan’s, Wang’s and our methods, respectively. With a sensitivity rate methods (more results can be found in the Supplementary Material: of 70%, the CC values are 0.02, 0.05 and 0.28 for Yan’s, Wang’s and Figs S3 and S4). our methods, respectively. Thus, the proposed method outperforms In conclusion, we showed that the proposed method outperformed other two methods and is signiﬁcantly better than random guessing. two other sequence-based methods. To understand whether the We also compare the best results on balanced accuracies (the square improvement is due to the choice of the predictive model or the use root of product of positive accuracy and negative accuracy), which of new features, we conducted tests using our random forest tree with are commonly used for imbalanced data classiﬁcation problems. Our the same PSSM features as those used in Wang’s method. The areas method improves 23% and 17% in balanced accuracy compared with under ROC curve (AUC) for random forest and Wang’s classiﬁer True positive rate (sensitivity) Sequence-based prediction of protein interaction sites (a) Fig. 5. DnaK molecular chaperone system: (a) DnaJ (PDB ID 1XBL), (b) the structure of DnaK C-terminal and (c) the structure of DnaK N-terminal. Orange structure denotes ATPase domain. (b)(c) Fig. 4. Predicted results of two chains in 2CIO using (a) our method, (b) Wang’s method and (c) Yan’s method. together with PSSM features are 0.75 and 0.58, respectively. The AUC of our method is 0.80. Thus, both the classiﬁer and the new features contribute to the improvement: the AUCs obtained from random forest tree method with PSSM features only and with all the new features increase 0.17 and 0.22, respectively, compared with Wang’s method. As expected, random forest method is capable of dealing with imbalanced data classiﬁcation problems. We also used the student’s t-test to rank the features: among the top 50 features, the best four are PSSM features, the remaining Fig. 6. Predicted results of 1YUW using our method. Black color denotes features are physicochemical properties and the distance proﬁles C-terminal (amino acid 395–554), blue color denotes N-terminal (amino (Supplementary Table S4). acid 1–383) of DnaK, and atoms in predicted binding sites shown as purple spheres. 3.4 Blind test To show the applicability of the proposed method, three blind tests locking in of substrates into the substrate-binding cavity of Hsp70 are conducted. Without knowing the true binding sites, the blind tests and cochaperone DnaJ modulates these ATP hydrolysis and substrate evaluate the capability of predicting interface residues with peptide binding and is associated with conformational changes in DnaK complexes. To construct 3D structures, we used the VMD software (Gassler et al., 1998; Greene et al., 1998; Suh et al., 1998). (Humphrey et al., 1996). Figures 5–7 show the results, where again, DnaJ structures (PDB ID 1XBL) consist of four α-helices and a each sphere represents an atom and purple spheres denote these loop region containing HPD motif (tripeptide of histidine, proline, atoms in the predicted potential interface residues, and the orange and aspartic acid residues) between two α-helices (Fig. 5a). This spheres in Figure 7 denote the atoms also shown in Figure 5. HPD motif is highly conserved and presented in almost all known J First, we test three structural components of the DnaK (eukaryotic domains and this is critical to stimulate Hsp70 ATPase activity and Hsp70) molecular chaperone system, which is used in another mutations on the conserved tripeptide HPD of the J-domain abolish study (Fariselli et al., 2002). A chaperone system aids protein the ability of proteins to function with Hsp70 proteins; therefore, the folding/unfolding. The ﬁrst two components are two DnaK domains: HPD tripeptide could mediate speciﬁc interactions between Hsp40 a C-terminal domain (1DKX, PDB ID) and a N-terminal domain and Hsp70 proteins (Fariselli et al., 2002; Gassler et al., 1998; (1DKG, PDB ID), which are binding and releasing together. The Greene et al., 1998; Hennessy et al., 2000; Suh et al., 1998). Our third component is DnaJ (1XBL, PDB ID). The Hsp70 DnaK method predicted two amino acid residues, 30-MET and 36-ARG proteins are aided by the so-called J-domain cochaperones (Hsp40 which are near the HPD motif (33-HIS, 34-PRO, and 35-ASP). This proteins in eukaryotes, and DnaJ in prokaryotes), which dramatically is similar to the results of Greene et al. (1998) that the potential increase the ATP activity of the Hsp70s. ATP hydrolysis causes binding sites are residues between 1 and 35 and binding sites are 589 X.-w.Chen and J.C.Jeong reported binding regions II, III, IV, V and VI. We also predicted some binding sites in the upper right corner on chain B in GrpE, as a result of the fact that GrpE exists as a dimmer. 4 CONCLUSIONS As genome-sequencing projects provide biologists with ready access to the rapidly increasing pool of protein sequences, there is a growing demand for developing advanced computational methods for predicting potential protein binding sites by using sequence information only. In this article, we demonstrate a predictive (a) (b) system that can reliably identify protein interface sites for protein complexes. The proposed predictive system is based on the analysis of protein sequence information, without knowing protein structures. Fig. 7. Predicted results of three chains in 1DKG using our method. (a) A wide variety of physicochemical properties and sequence proﬁling DnaK N-terminal chain D of 1DKG, orange spheres are these atoms in these properties are effectively integrated using a random forest tree predicted interface residues shown in Figure 5c, and (b) GrpE, chain A and B of 1DKG. All spheres show our predicted atoms: purple spheres denote framework. the atoms in the predicted interface residues and orange spheres denote the The predicted interaction sites can be valuable as a ﬁrst approach atoms also shown in Figure 5. for guiding experimental methods investigating protein–protein interactions and localizing the speciﬁc interface residues. We illustrate the usefulness of the proposed method for predicting concentrated on the outer surface of helix II which is a right-side putative binding sites for the DnaK molecular chaperone system, α-helix where our prediction, 30-MET, is located; therefore, our 1YUW and 1DKG. In our future work, we will evaluate the relative predictions on DnaJ are reasonable. In addition 71-HIS and 74-PHE importance of the features, which should help to understand the are also shown as conserved residues based on the consensus of underlining binding process. amino acid position (Hennessy et al., 2000). For Hsp70 DnaK-Nterminal (PDB ID 1DKG) in Figure 5c, our prediction detected three residues (13-ASN, 116-SER, 174-ALA) on ACKNOWLEDGEMENTS ATPase domain (orange color in Fig. 5c). Other researches (Davis We wish to thank Ozlen Keskin for helping us in modeling 3D et al., 1999; Gassler et al., 1998) showed that most of the mutants, protein structures with VMD software. We also thank the reviewers which affect interaction with C-terminal domain, are located in the for their valuable suggestions. bottom of ATPase domain. Our predictions are spatially very close to those mutants observed in Davis and Gassler’s studies. Funding: National Science Foundation award (IIS-0644366). For Hsp70 DnaK-Cterminal (PDB ID 1DKX) in Figure 5b, our Conﬂicts of Interest: none declared. method predicted 6-THR, 405-GLY, 447-ARG and 483-ILE as interface residues. Mutants observed by Davis and Montgomery are located in the loops on sandwich sub-domain which are close REFERENCES to peptide-binding site; therefore, our predictions are in agreement Aytuna,A.S. et al. (2005) Prediction of protein-protein interactions by combining with their experimental results (Davis et al., 1999; Montgomery structure and sequence conservation in protein interfaces. Bioinformatics, 21, et al., 1999). We believe that the predicted binding will provide new 2850–2855. insights into the interaction. Ban,N. et al. (1994) Crystal structure of an idiotype-anti-idiotype Fab complex. Proc. Figure 6 shows the direct binding between DnaK C-terminal Natl Acad. Sci. USA, 91, 1604–1608. Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. and N-terminal on the protein 1YUW in Bus Taurus (Jiang et al., Bradford,J.R. and Westhead,D.R. (2005) Improved prediction of protein-protein binding 2005). Notice that 1YUW is different from 1DKX in Escherichia sites using a support vector machines approach. Bioinformatics, 21, 1487–1494. coli (Zhu et al., 1996) (also shown in Fig. 5) in terms of both C- Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32. terminal sequences and their binding sites. Our results show that Chakrabarti,P. and Janin,J. (2002) Dissecting protein-protein recognition sites. Proteins, most predicted interface residues are condensed in alpha helix of 47, 334–343. Chen,H. and Zhou,H.X. (2005) Prediction of interface residues in protein-protein C-terminal, which are in agreement with these in Jiang et al. (2005). complexes by a consensus neural network method: test against NMR data. Proteins, We also observe some predicted binding sites in N-terminal, which 61, 21–35. are not in the binding regions between C-terminal and N-terminal. Chen,X.W. and Liu,M. (2005) Prediction of protein-protein interactions using random Some of these may be the interaction sites between N-terminal and decision forest framework. Bioinformatics, 21, 4394–4400. Chung,J.L. et al. (2006) Exploiting sequence and structure homologs to identify protein- other chains (e.g. GrpE). protein binding sites. Proteins, 62, 630–640. Finally, we examined another DnaK effector GrpE by using Davis,J.E. et al. (1999) Intragenic suppressors of Hsp70 mutants: interplay between the 1DKG (Harrison et al., 1997) co-crystallized with 1DKG N-terminal ATPase- and peptide-binding domains. Proc. Natl Acad. Sci. USA, 96, 9269–9276. and GrpE. GrpE is a nucleotide-exchange factor that binds sub- De Loof,H. et al. (1986) Use of hydrophobicity proﬁles to predict receptor binding stoichiometrically to the ATPase unit which is DnaK N-terminal. domains on apolipoprotein E and the low density lipoprotein apolipoprotein B-E receptor. Proc. Natl Acad. Sci. USA, 83, 2295–2299. Figure 7a shows the predicted binding sites in DnaK N-terminal, Eisenberg,D. et al. (1982) The helical hydrophobic moment: a measure of the which are very close to the reported binding residues in the regions amphiphilicity of a helix. Nature, 299, 371–374. III, V and VI (Harrison et al., 1997). In Figure 7b, the predicted Eisenberg,D. et al. (1984) Analysis of membrane and surface protein sequences with interaction residues on chain A of GrpE are also located in the the hydrophobic moment plot. J. Mol. Biol., 179, 125–142. 590 Sequence-based prediction of protein interaction sites Fariselli,P. et al. (2002) Prediction of protein–protein interaction sites in Kuntz,I.D. et al. (1982) A geometric approach to macromolecule-ligand interactions. heterocomplexes with neural networks. Eur. J. Biochem.FEBS, 269, J. Mol. Biol., 161, 269–288. 1356–1361. Lo Conte,L. et al. (1999) The atomic structure of protein-protein recognition sites. Gabb,H.A. et al. (1997) Modelling protein docking using shape complementarity, J. Mol. Biol., 285, 2177–2198. electrostatics and biochemical information. J. Mol. Biol., 272, 106–120. Montgomery,D.L. et al. (1999) Mutations in the substrate binding domain of the Gallet,X. et al. (2000) A fast method to predict protein interaction sites from sequences. Escherichia coli 70 kDa molecular chaperone, DnaK, which alter substrate afﬁnity J. Mol. Biol., 302, 917–926. or interdomain coupling. J. Mol. Biol., 286, 915–932. Gassler,C.S. et al. (1998) Mutations in the DnaK chaperone affecting interaction with Nguyen,N. et al. (2006) Protein-protein interface residue prediction with SVM the DnaJ cochaperone. Proc. Natl Acad. Sci. USA, 95, 15229–15234. using evolutionary proﬁles and accessible surface areas. In Proceedings of IEEE Gong,S. et al. (2005) A protein domain interaction interface database: InterPare. BMC Symposium on Computational Intellegence Bioinformatics Computation Biology. Bioinformatics, 6, 207. pp. 1–5. Greene,M.K. et al. (1998) Role of the J-domain in the cooperation of Hsp40 with Hsp70. Norel,R. et al. (1995) Molecular surface complementarity at protein-protein interfaces: Proc. Natl Acad. Sci. USA, 95, 6108–6113. the critical role played by surface normals at well placed, sparse, points in docking. Harrison,C.J. et al. (1997) Crystal structure of the nucleotide exchange factor GrpE J. Mol. Biol., 252, 263–273. bound to the ATPase domain of the molecular chaperone DnaK. Science, 276, Palma,P.N. et al. (2000) BiGGER: a new (soft) docking algorithm for predicting protein 431–435. interactions. Proteins, 39, 372–384. Helmer-Citterich,M. and Tramontano,A. (1994) PUZZLE: a new method for automated Pazos,F. et al. (1997) Correlated mutations contain information about protein-protein protein docking based on surface shape complementarity. J. Mol. Biol., 235, interaction. J. Mol. Biol., 271, 511–523. 1021–1031. Rost,B. and Sander,C. (1994) Conservation and prediction of solvent accessibility in Hennessy,F. et al. (2000) Analysis of the levels of conservation of the J domain among protein families. Proteins, 20, 216–226. the various types of DnaJ-like proteins. Cell Stress Chaperones, 5, 347–358. Salemme,F.R. (1976) An hypothetical structure for an intermolecular electron transfer Ho,T.K. (1998) The random subspace method for constructing decision forests. IEEE complex of cytochromes c and b5. J. Mol. Biol., 102, 563–568. Trans. Pattern Anal. Mach. Intell., 20, 832–844. Schneider,R. and Sander,C. (1996) The HSSP database of protein structure-sequence Humphrey,W. et al. (1996) VMD: visual molecular dynamics. J. Mol. Graph, 14, 33–38, alignments. Nucleic Acids Res., 24, 201–205. 27–38. Shoichet,B.K. and Kuntz,I.D. (1991) Protein docking and complementarity. J. Mol. Jiang,F. and Kim,S.H. (1991) “Soft docking”: matching of molecular surface cubes. Biol., 221, 327–346. J. Mol. Biol., 219, 79–102. Suh,W.C. et al. (1998) Interaction of the Hsp70 molecular chaperone, DnaK, with its Jiang,J. et al. (2005) Structural basis of interdomain communication in the Hsc70 cochaperone DnaJ. Proc. Natl Acad. Sci. USA, 95, 15223–15228. chaperone. Mol. cell, 20, 513–524. Uniprot (2008) The Universal Protein Resource (UniProt). Nucleic Acids Res., 36, Jmol. Jmol: an open-source Java viewer for chemical structures in 3D. Available at D190–D195. http://www.jmol.org. Voet,D. and Voet,J.G. (2004) Biochemistry. J. Wiley & Sons, Hoboken, NJ. Jones,S. and Thornton,J.M. (1996) Principles of protein-protein interactions. Proc. Natl Walls,P.H. and Sternberg,M.J. (1992) New algorithm to model protein-protein Acad. Sci. USA, 93, 13–20. recognition based on surface complementarity. Applications to antibody-antigen Jones,S. and Thornton,J.M. (1997a) Analysis of protein-protein interaction sites using docking. J. Mol. Biol., 228, 277–297. surface patches. J. Mol. Biol., 272, 121–132. Wang,B. et al. (2006) Predicting protein interaction sites from residue spatial sequence Jones,S. and Thornton,J.M. (1997b) Prediction of protein-protein interaction sites using proﬁle and evolution rate. FEBS Lett., 580, 380–384. patch analysis. J. Mol. Biol., 272, 133–143. Warwicker,J. (1989) Investigating protein-protein interaction surfaces using a reduced Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern stereochemical and electrostatic model. J. Mol. Biol., 206, 381–395. recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, Wodak,S.J. and Janin,J. (1978) Computer analysis of protein-protein interaction. J. Mol. 2577–2637. Biol., 124, 323–342. Katchalski-Katzir,E. et al. (1992) Molecular surface recognition: determination of Yan,C. et al. (2003) Identiﬁcation of surface residues involved in protein-protein geometric ﬁt between proteins and their ligands by correlation techniques. Proc. interaction-a support vector machine approach. In Proceedings of the Conference Natl Acad. Sci. USA, 89, 2195–2199. on Intellegence System Design Application. pp. 53–62. Keskin,O. et al. (2005) Hot regions in protein–protein interactions: the organization Yan,C. et al. (2004) A two-stage classiﬁer for identiﬁcation of protein-protein interface and contribution of structurally conserved hot spot residues. J. Mol. Biol., 345, residues. Bioinformatics, 20(Suppl. 1), i371–i378. 1281–1294. Zhou,H.X. and Shan,Y. (2001) Prediction of protein interaction sites from sequence Kini,R.M. and Evans,H.J. (1996) Prediction of potential protein-protein interaction sites proﬁle and residue neighbor list. Proteins, 44, 336–343. from amino acid sequence. Identiﬁcation of a ﬁbrin polymerization site. FEBS Lett., Zhu,X. et al. (1996) Structural analysis of substrate binding by the molecular chaperone 385, 81–86. DnaK. Science, 272, 1606–1614. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/sequence-based-prediction-of-protein-interaction-sites-with-an-gwjI0mE6W4

Loading next page...

References (59)

P. Fariselli, F. Pazos, A. Valencia, R. Casadio (2002)
Prediction of protein--protein interaction sites in heterocomplexes with neural networks.
European journal of biochemistry, 269 5
W. Humphrey, A. Dalke, K. Schulten (1996)
VMD: visual molecular dynamics.
Journal of molecular graphics, 14 1
W. Suh, W. Burkholder, C. Lu, Xuna Zhao, Max Gottesman, Carol Gross (1998)
Interaction of the Hsp70 molecular chaperone, DnaK, with its cochaperone DnaJ.
Proceedings of the National Academy of Sciences of the United States of America, 95 26
Berman (2000)
The protein data bank
Nucleic Acids Res., 28
Julie Davis, Cindy Voisine, E. Craig (1999)
Intragenic suppressors of Hsp70 mutants: interplay between the ATPase- and peptide-binding domains.
Proceedings of the National Academy of Sciences of the United States of America, 96 16
R. Norel, Shuo-liang Lin, H. Wolfson, R. Nussinov (1995)
Molecular surface complementarity at protein-protein interfaces: the critical role played by surface normals at well placed, sparse, points in docking.
Journal of molecular biology, 252 2
A. Selim, Selim Aytuna, A. Gursoy, Ozlem Aytuna, O. Keskin (2005)
Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces
Bioinformatics, 21 12
P. Palma, L. Krippahl, J. Wampler, J. Moura (2000)
BiGGER: A new (soft) docking algorithm for predicting protein interactions
Proteins: Structure, 39
M. Helmer-Citterich, A. Tramontano (1994)
PUZZLE: a new method for automated protein docking based on surface shape complementarity.
Journal of molecular biology, 235 3
T. Ho (1998)
The Random Subspace Method for Constructing Decision Forests
IEEE Trans. Pattern Anal. Mach. Intell., 20
Reinhard Schneider, A. Daruvar, C. Sander (1993)
The HSSP database of protein structure-sequence alignments
Nucleic acids research, 24 1
B. Shoichet, I. Kuntz (1991)
Protein docking and complementarity.
Journal of molecular biology, 221 1
C. Gässler, A. Buchberger, T. Laufen, M. Mayer, Hartwig Schröder, Alfonso Valencia, B. Bukau (1998)
Mutations in the DnaK chaperone affecting interaction with the DnaJ cochaperone.
Proceedings of the National Academy of Sciences of the United States of America, 95 26
H. Loof, M. Rosseneu, R. Brasseur, J. Ruysschaert (1986)
Use of hydrophobicity profiles to predict receptor binding domains on apolipoprotein E and the low density lipoprotein apolipoprotein B-E receptor.
Proceedings of the National Academy of Sciences of the United States of America, 83 8
B. Wang, Peng Chen, De-shuang Huang, Jing-jing Li, T. Lok, M.R. Lyu (2006)
Predicting protein interaction sites from residue spatial sequence profile and evolution rate
FEBS Letters, 580
D. Eisenberg, Erich Schwarz, M. Komaromy, R. Wall (1984)
Analysis of membrane and surface protein sequences with the hydrophobic moment plot.
Journal of molecular biology, 179 1
Ephraim KATCHALSKI-KATZIRtt, I. Shariv, M. Eisenstein, A. Friesem, C. Aflalo, Ilya VAKSERt (1992)
Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques.
Proceedings of the National Academy of Sciences of the United States of America, 89 6
D. Eisenberg, R. Weiss, T. Terwilliger (1982)
The helical hydrophobic moment: a measure of the amphiphilicity of a helix
Nature, 299
F. Pazos, M. Helmer-Citterich, G. Ausiello, Alfonso Valencia (1997)
Correlated mutations contain information about protein-protein interaction.
Journal of molecular biology, 271 4
A. Bairoch, R. Apweiler, Cathy Wu, W. Barker, B. Boeckmann, Serenella Ferro, E. Gasteiger, Hongzhan Huang, R. Lopez, M. Magrane, M. Martin, D. Natale, C. O’Donovan, Nicole Redaschi, L. Yeh (2004)
The Universal Protein Resource (UniProt)
Nucleic Acids Research, 33
F. Hennessy, M. Cheetham, H. Dirr, G. Blatch (2000)
Analysis of the levels of conservation of the J domain among the various types of DnaJ-like proteins
, 5
Jianwen Jiang, K. Prasad, E. Lafer, R. Sousa (2005)
Structural basis of interdomain communication in the Hsc70 chaperone.
Molecular cell, 20 4
M. Wall, D. Coleman, Ethan Lee, J. Iñiguez-Lluhí, B. Posner, A. Gilman, S. Sprang (1995)
The structure of the G protein heterotrimer Giα1 β 1 γ 2
Cell, 83
W. Kabsch, C. Sander (1983)
Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features
Biopolymers, 22
Huiling Chen, Huan‐Xiang Zhou (2005)
Prediction of interface residues in protein–protein complexes by a consensus neural network method: Test against NMR data
Proteins: Structure, 61
S. Teichmann (2002)
Principles of protein-protein interactions
Bioinformatics, 18 Suppl 2
I. Kuntz, J. Blaney, S. Oatley, R. Langridge, T. Ferrin (1982)
A geometric approach to macromolecule-ligand interactions.
Journal of molecular biology, 161 2
S. Jones, J. Thornton (1997)
Prediction of protein-protein interaction sites using patch analysis.
Journal of molecular biology, 272 1
F. Salemme (1976)
An hypothetical structure for an intermolecular electron transfer complex of cytochromes c and b5.
Journal of molecular biology, 102 3
(2004)
Biochemistry. J. Wiley & Sons
Changhui Yan, D. Dobbs, Vasant Honavar (2003)
Identification of Surface Residues Involved in Protein-Protein Interaction — A Support Vector Machine Approach
B. Rost, C. Sander (1994)
Conservation and prediction of solvent accessibility in protein families
Proteins: Structure, 20
Jo-Lan Chung, Wei Wang, P. Bourne (2005)
Exploiting sequence and structure homologs to identify protein–protein binding sites
Proteins: Structure, 62
X. Gallet, B. Charloteaux, Annick Thomas, R. Brasseur (2000)
A fast method to predict protein interaction sites from sequences.
Journal of molecular biology, 302 4
Q. Zhong, Chi Chen, Shang Li, Yumay Chen, Chuan Wang, Jun Xiao, Phang-lang Chen, Z. Sharp, Wen-Hwa Lee (1999)
Association of BRCA1 with the hRad50-hMre11-p95 complex and the DNA damage response.
Science, 285 5428
M. Greene, K. Maskos, S. Landry (1998)
Role of the J-domain in the cooperation of Hsp40 with Hsp70.
Proceedings of the National Academy of Sciences of the United States of America, 95 11
Fan Jiang, Sung-Hou Kim (1991)
"Soft docking": matching of molecular surface cubes.
Journal of molecular biology, 219 1
(2005)
A protein domain interaction interface database: InterPare
J. Bradford, D. Westhead (2005)
Improved prediction of protein-protein binding sites using a support vector machines approach.
Bioinformatics, 21 8
Bioinformatics Original Paper Prediction of Protein–protein Interactions Using Random Decision Forest Framework
Huan‐Xiang Zhou, Yibing Shan (2001)
Prediction of protein interaction sites from sequence profile and residue neighbor list
Proteins: Structure, 44
D. Montgomery, R. Morimoto, L. Gierasch (1999)
Mutations in the substrate binding domain of the Escherichia coli 70 kDa molecular chaperone, DnaK, which alter substrate affinity or interdomain coupling.
Journal of molecular biology, 286 3
R. Kini, H. Evans (1996)
Prediction of potential protein‐protein interaction sites from amino acid sequence
FEBS Letters, 385
P. Walls, M. Sternberg (1992)
New algorithm to model protein-protein recognition based on surface complementarity. Applications to antibody-antigen docking.
Journal of molecular biology, 228 1
S. Jones, J. Thornton (1997)
Analysis of protein-protein interaction sites using surface patches.
Journal of molecular biology, 272 1
Changhui Yan, D. Dobbs, Vasant Honavar (2004)
A two-stage classifier for identification of protein-protein interface residues
Bioinformatics, 20 Suppl 1
P. Chakrabarti, J. Janin (2002)
Dissecting protein–protein recognition sites
Proteins: Structure, 47
L. Conte, C. Chothia, J. Janin (1999)
The atomic structure of protein-protein recognition sites.
Journal of molecular biology, 285 5
J. Warwicker (1989)
Investigating protein-protein interaction surfaces using a reduced stereochemical and electrostatic model.
Journal of molecular biology, 206 2
L. Breiman (2001)
Random Forests
Machine Learning, 45
M. Nguyen, Jagath Rajapakse (2006)
Protein-Protein Interface Residue Prediction with SVM Using Evolutionary Profiles and Accessible Surface Areas
2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology
H. Gabb, R. Jackson, M. Sternberg (1997)
Modelling protein docking using shape complementarity, electrostatics and biochemical information.
Journal of molecular biology, 272 1
O. Keskin, B. Ma, R. Nussinov (2005)
Hot regions in protein--protein interactions: the organization and contribution of structurally conserved hot spot residues.
Journal of molecular biology, 345 5
C. Harrison, M. Hayer-Hartl, M. Liberto, F. Hartl, J. Kuriyan (1997)
Crystal structure of the nucleotide exchange factor GrpE bound to the ATPase domain of the molecular chaperone DnaK.
Science, 276 5311
Chen (2005)
Prediction of protein-protein interactions using random decision forest framework
Bioinformatics, 21
Xiaotian Zhu, Xuna Zhao, W. Burkholder, A. Gragerov, C. Ogata, M. Gottesman, W. Hendrickson (1996)
Structural Analysis of Substrate Binding by the Molecular Chaperone DnaK
Science, 272
Nenad Ban, Carlos ESCOBARt, Robyn GARCIAt, Karl HASELt, J. Day, A. Greenwood, Alexander McPherson (1994)
Crystal structure of an idiotype-anti-idiotype Fab complex.
Proceedings of the National Academy of Sciences of the United States of America, 91
S. Wodak, Joël Janin (1978)
Computer analysis of protein-protein interaction.
Journal of molecular biology, 124 2
Jmol: an open-source Java viewer for chemical structures in 3D

Publisher: Oxford University Press
Copyright: © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btp039
pmid: 19153136
Publisher site: See Article on Publisher Site

Abstract

Vol. 25 no. 5 2009, pages 585–591 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp039 Sequence analysis Sequence-based prediction of protein interaction sites with an integrative method 1,2,∗ 1 Xue-wen Chen and Jong Cheol Jeong Bioinformatics and Computational Life Sciences Laboratory, Information and Telecommunication Technology Center and Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA Received on June 2, 2008; revised on January 14, 2009; accepted on January 15, 2009 Advance Access publication January 19, 2009 Associate Editor: Limsoon Wong ABSTRACT methods, which is of great importance in molecular recognition and are considered as a good starting point to form hypotheses in Motivation: Identiﬁcation of protein interaction sites has signiﬁcant searching for potential pharmacological targets for the design of impact on understanding protein function, elucidating signal drugs (Gallet et al., 2000). transduction networks and drug design studies. With the Roughly speaking, computational methods can be categorized exponentially growing protein sequence data, predictive methods into two groups: molecular docking of two proteins with known using sequence information only for protein interaction site structures and the identiﬁcation of putative interaction sites on an prediction have drawn increasing interest. In this article, we propose isolated protein without knowing the structure of its partner or a predictive model for identifying protein interaction sites. Without complex (Gallet et al., 2000). While a number of computational using any structure data, the proposed method extracts a wide methods for predicting protein interaction sites have been developed range of features from protein sequences. A random forest-based over the years, most of them require known protein structure integrative model is developed to effectively utilize these features information (Aytuna et al., 2005; Bradford and Westhead, 2005; and to deal with the imbalanced data classiﬁcation problem Chen and Zhou, 2005; Chung et al., 2006; Fariselli et al., 2002; commonly encountered in binding site predictions. Gabb et al., 1997; Helmer-Citterich and Tramontano, 1994; Jiang Results: We evaluate the predictive method using 2829 interface and Kim, 1991; Jones and Thornton, 1997a, b; Katchalski-Katzir residues and 24 616 non-interface residues extracted from 99 et al., 1992; Keskin et al., 2005; Kuntz et al., 1982; Norel et al., polypeptide chains in the Protein Data Bank. The experimental 1995; Palma et al., 2000; Salemme, 1976; Shoichet and Kuntz, results show that the proposed method performs signiﬁcantly better 1991; Walls and Sternberg, 1992; Warwicker, 1989; Wodak and than two other sequence-based predictive methods and can reliably Janin, 1978; Zhou and Shan, 2001). Despite much effort in structural predict residues involved in protein interaction sites. Furthermore, genomics, the amount of protein structures, determined by time- we apply the method to predict interaction sites and to construct consuming and expensive experimental technologies, is signiﬁcantly three protein complexes: the DnaK molecular chaperone system, smaller than those of protein sequences produced by large-scale 1YUW and 1DKG, which provide new insight into the sequence– DNA sequencing methods. For example, by July 29, 2008, there function relationship. We show that the predicted interaction sites are 392 667 identiﬁed protein sequences in Uniprot/Swissprot can be valuable as a ﬁrst approach for guiding experimental methods (reviewed, manually annotated) (Uniprot, 2008) and only 47 978 investigating protein–protein interactions and localizing the speciﬁc known protein structures in PDB (Berman et al., 2000). Thus, it is interface residues. now more important than ever to identify protein interaction sites Availability: Datasets and software are available at from amino acid sequences only, without knowing structural data. http://ittc.ku.edu/~xwchen/bindingsite/prediction. There are several studies attempted to address the sequence-based Contact: xwchen@ku.edu interaction site prediction problem. Kini and Evans (1996) observed Supplementary information: Supplementary data are available at that proline is the most common residue in a large number of Bioinformatics online. protein interaction sites. Pazos et al. (1997) used multiple sequence alignment to detect correlated changes to a group of interacting 1 INTRODUCTION protein domains for predicting contacting pairs of residues. Gallet et al. (2000) analyzed hydrophobicity distribution and amino acid Protein–protein interaction plays an essential role in nearly all frequencies in known interaction sites for identifying linear stretches cell functions, such as promoting chemical reactions and acting as of sequences. Most recently, more complicated machine learning antibodies. Consequently, identiﬁcation of protein interaction sites methods are applied to predict interaction sites. Yan et al. (2003) is critical for understanding protein function and for elucidating applied support vector machines (SVMs) to predict interface sites metabolic and signal transduction networks. It could also help in with features extracted from sequence neighbors for each target rational drug design studies (Gallet et al., 2000). A commonly residue. Wang et al. (2006) also employed SVMs as classiﬁers used technique in identifying protein interaction sites is in silico with features extracted from spatial sequence and evolutionary conservation scores based on a phylogenetic tree. To whom correspondence should be addressed. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 585 X.-w.Chen and J.C.Jeong (i) Sequence-based method, increasingly important in protein (2N +1) is the size of the sliding window centered around amino acid i, h is the hydrophobicity of the amino acid (AA) that is nAA’s away from the interaction site prediction, is still in its infancy. Several issues AA i, and δn is the gyration angle between two consecutive residues in the exist that make the prediction from sequences a very difﬁcult sequence. Gallet et al. (2000) found the method to be most successful when task. The two main problems are: (i) the biological properties they used N = 5 and δn = 100˚. The hydrophobicities of each amino acid are that are responsible for protein–protein interactions are not fully taken from the scale developed by Eisenberg et al. (1984). understood, which leads to the difﬁculty of extracting informative We also extract seven other physicochemical properties including features common to all the binding sites; and (ii) the number hydrophilicity, hydrophilic moment, propensity, propensity moment, of interacting sites of a protein is much smaller than that of isoelectric point, isoelectric moment and mass (Jones and Thornton, 1997a; non-interacting sites, which leads to a very challenging problem, Voet and Voet, 2004). The residue interface propensities quantify whether the so-called imbalanced data classiﬁcation problem. Our article an amino acid is possibly exposed to solvent or buried in an interface (Jones addresses these problems by extracting a wide variety of features and Thornton, 1997b). Furthermore, evolutionary conservation score for each residue is also calculated by HSSP (Schneider and Sander, 1996) and used from amino acid sequences and by developing a random forest- as a feature. Thus, for each residue, we can extract nine physicochemical based method to effectively integrate these features and at the same features and one evolutionary conservation score. time deal with imbalanced-data problems. The extracted features Group 2—amino acid distance: the frequency of occurrence of proline can be grouped into three categories: physicochemical properties residues was initially used for analyzing interaction sites by Kini and Evans and evolutionary conservation score, residue-based distance matrix (1996). They examined 1600 protein–protein interaction sequences and and sequence proﬁle. While each group of features may represent found proline residues on at least one side of 88.2% of the binding sites. common characteristics for a certain number of interaction sites, The proline residues generally occurred within four residues of the binding none of the features is the dominant factor that is capable of site and often within two residues. Inspired by this idea, we examine the describing the common effect among all the interface residues. For shortest distance from the current residue to 20 amino acid residues. This example, hydrophobicities may be useful for predicting interaction will create a vector with a size of 20 for each residue. Group 3—PSSM: PSSM is calculated by HSSP using multiple sequence sites in homodimers, they are, however, of moderate power to alignment. The likelihood of 20 amino acid substitutions at a given alignment predict interaction sites in other type of complexes (Jones and position is used as our PSSM features. Thornton, 1996; Lo Conte et al., 1999). Thus, effectively integrating For each target residue we are considering, its features are extracted by a large number of features is critical for a reliably predictive model. using a sliding window with a size of 21 centered on this target residue, To utilize all the extracted features, an integrative random forest i.e. the feature vector for the central residue consists of features extracted framework is developed. Our results evaluated on 99 polypeptide from itself and 10 amino acids on each side of this residue. Consequently, chains show that the proposed method outperforms two other the total numbers of features for each target residue are 210, 420 and 420 sequence-based methods. Furthermore, we apply this method to for groups 1, 2 and 3, respectively. While each group of features may not be identify potential interaction sites for three protein complexes: the the dominant factor to describe the common effect among all the interface DnaK molecular chaperone system, 1YUW and 1DKG, which may residues, collectively, they are capable of effectively characterizing protein interface sites. Since the total number of features is 1050, a carefully designed provide new insight into the sequence–function relationship. classiﬁer is needed to effectively utilize the large size of features. Next, we describe the random forest method. 2 METHODS 2.2 Constructing an integrative random forest model 2.1 Extracting a large number of features To effectively utilize the large number of extracted features and to deal To build a predictor that can distinguish interface residues from non- with the imbalanced data classiﬁcation problems, we herein describe an interface sites, we extract features based on physicochemical property, integrative random forest method for predicting interaction sites. Random evolutionary conservation score, amino acid distances, and position-speciﬁc forest tree has been applied to protein–protein interaction prediction in score matrix (PSSM). Instead of using one long feature vector, we divide our recent work (Chen and Liu, 2005), but not to binding site problems. the features into three groups based on their sources, as features extracted When the input space is extraordinarily large as in our application, random from different sources may have different distribution (e.g. hydrophobicity subspace feature selection introduced by Ho (1998) can improve classiﬁer in the physicochemical feature set versus the shortest distance of amino acid diversity. A random forest consists of an ensemble of decision trees from residues to a target residue in distance features) (see Supplementary Material randomly sampled subspaces of the input features, and ﬁnal classiﬁcation is for detailed discussions). obtained by combining results from the trees via voting (Breiman, 2001). It Group 1—physicochemical features and evolutionary conservation score: is crucial to produce a large number of sufﬁciently different trees when using the ﬁrst two features are hydrophobicity and hydrophobic moments, which the combined power of multiple trees for increase in accuracy. The use of were initially used to distinguish membrane α-helix proteins from soluble randomization in feature selection is a way to explore various possibilities proteins (Eisenberg et al., 1982, 1984) and later to predict protein binding of subspaces. While most classiﬁcation methods suffer from the curse of sites in the apolipoprotein E sequence (De Loof et al., 1986; Gallet et al., dimensionality, the random subspace feature selection method can take 2000). For each amino acid, a sliding window centered at this amino acid is advantage of the high dimensionality. In contrast to the Occam’s Razor, moved along the protein sequence and the mean hydrophobicity and mean the method improves accuracy as it grows in complexity (Ho, 1998). hydrophobic moment are calculated as follows and assigned to the center The random forest can also deal with imbalanced data problems. It amino acid. constructs many decision trees and each is grown from a different subset (i ) of training data. To construct individual decision tree, training samples are < H >= h (1) 2N + 1 randomly selected with replacement from the original training dataset. In n=−N our application, to build each tree, we randomly select the same number   2 2 N N of samples for each class, which converts the imbalanced data problem to (i ) (i )   <µ >= h sin δn + h cos (δn ) (2) multiple balanced data classiﬁcation problems. If the number of positive i n n 2N + 1 n=−N n=−N samples (minority class) in the original training set is N , then N samples are 586 Sequence-based prediction of protein interaction sites randomly drawn with replacement for each class. At each splitting or decision is an imbalance data classiﬁcation problem where the ratio of node, the best splitting feature is chosen from a randomly selected subspace negative to positive samples is about 9:1. of m features where m is much smaller than M total number of features. Each tree in the forest is grown to the largest extent possible without pruning. To 3.2 Evaluation criteria classify a new object, each tree in the forest gives a classiﬁcation which is To measure the performance of each predictor, we use leave-one- interpreted as the tree ‘voting’ for that class. The ﬁnal classiﬁcation of the out cross-validation (LOOCV) and the following criterion functions, object is determined by majority votes among the classes decided by the forest of trees. where true positive (TP) is the number of true interface residues that Furthermore, for each group of features, we generate a random forest are predicted correctly; true negative (TN) is the number of true non- classiﬁer. This is because features from different groups have different interface residues that are predicted correctly; false positive (FP) is distribution. Building a forest classiﬁer for each group of feature can the number of true non-interface residues that are predicted to be effectively integrate all the features for better performance. For each feature interface residues; and false negative (FN) is the number of true group, we generate 100 trees: each tree is built using 100 randomly selected interface residues that are predicted to be non-interface residues. features from each feature group and the same number of positives and TN+TP negatives. The ﬁnal decision is made by majority vote. Overall accuracy: TN+FP+FN+TP TP Sensitivity positive accuracy : FN+TP 3 EXPERIMENTAL RESULTS TN Speciﬁcity negative accuracy : FP+TN 3.1 Data sources Balanced accuracy: Positive Accuracy × Negative Accuracy The proteins used in this article were extracted from a set of 70 TP×TN−FP×FN Correlation coefﬁcient (CC): protein–protein heterocomplexes used in the studies of Chakrabarti TP+FN TP+FP TN+FP TN+FN and Janin (2002) and Yan et al. (2004). Redundant proteins and The overall accuracy is the ratio of the number of correctly molecules with fewer than 10 residues and proteins with sequence predicted residues (both positive and negative) to the total number identity ≥30% were removed. Some proteins which are not available of residues. It measures the overall performance of a classiﬁer. in HSSP and DSSP programs (Kabsch and Sander, 1983) were In our application, since the number of positive samples is much also omitted. Finally, we end up with 54 heterocomplexes for our smaller than that of negative samples, the overall accuracy may not studies. Table 1 lists the 99 polypeptide chains extracted from the be a good measure for evaluating the performance of a predictor. 54 heterocomplexes downloaded in PDB, which can be grouped For imbalanced data classiﬁcation, balanced accuracy and receiver into six categories: antibody–antigen, protease–inhibitor, enzyme operating characteristic (ROC) curves are typically used, where complexes, large protease complexes, G-proteins and miscellaneous. balanced accuracy is related to the product of both positive accuracy Among 27 445 residues in the 99 polypeptide chains, we extract and negative accuracy and ROC curves are generated in terms of 13 774 surface residues based on their relative solvent accessible sensitivity and speciﬁcity. Additionally, CC, ranging from −1to+1, surface areas (RASA) calculated by the DSSP program: a residue is is also a good measure. Its value is -1 for a worst possible predictor, considered as a surface residue if its RASA is >25%. Furthermore, +1 for a best possible predictor and 0 for a random predictor. a surface residue is deﬁned as an interface residue if the difference of accessible surface areas (ASA) between its unbound molecule 3.3 Leave-one-out test results and bounded complex is >1Å . The deﬁnitions for surface residues and interface residues are commonly used in other literatures (Gong To evaluate the performance, we compare the proposed method et al., 2005; Jones and Thornton, 1996; Jones and Thornton, 1997a; to two sequence-based methods. The ﬁrst method, introduced by Nguyen et al., 2006; Rost and Sander, 1994; Wang et al., 2006; Yan Yan et al. (2003) uses PSSM with 11 neighbor residues. The et al., 2003). Among the 13 774 surface residues, 2829 residues are second method, proposed by Wang et al. (2006) uses PSSM deﬁned as interface residues (positive class). Thus, the number of and evolutionary conservation score with 11 neighbor residues. non-interface residues including both non-binding surface residues Both methods use support vector machines (SVMs) for prediction. and non-surface residues (negative class) is 24 616. Apparently, this We implement the same methods and procedures as described in Table 1. Protein categories and polypeptide chains with PDB ID Antibody-antigen 1AO7_A, 1AO7_B, 1AO7_D, 1AO7_E, 1DVF_AB, 1DVF_CD, 1IAI_LH, 1IAI_MI, 1JH1_A, 1KB5_AB 1KB5_LH, 1NCA_LH, 1NCA_N, 1NFD_ABCD, 1NFD_EFGH, 1NMB_LH, 1NMB_N, 1NSN_LH, 1NSN_S 1OSP_LH, 1OSP_O, 1QFU_A , 1QFU_B, 1QFU_H, 1QFU_L, 1YQV_LH, 2JEL_LH, 2JEL_P, 3HFM_LH Protease-inhibitor 1ACB_E, 1ACB_I, 1AVW_A, 1AVW_B, 1CHO_I, 1FLE_E, 1FLE_I, 1HIA_ABXY, 1HIA_IJ 1MCT_A, 1STF_E, 1STF_I, 1TGS_I, 1TGS_Z, 2SIC_I, 2SNI_E, 2SNI_I, 3SGB_E, 4CPA_I Enzyme 1BRS_ABC, 1BRS_DEF, 1DFJ_E, 1DFJ_I, 1DHK_A, 1DHK_B, 1FSS_A 1FSS_B, 1GLA_F, 1GLA_G, 1UDI_E, 1UDI_I, 1YDR_E, 1YDR_I Large-protease 1BTH_PQ, 1DAN_LH, 1DAN_TU, 1TBQ_LHJK, 1TBQ_RS, 1TOC_ABCDEFGH, 1TOC_RSTU, 4HTC_I G-proteins 1AGR_AD, 1AGR_EH, 1GG2_A, 1GG2_B, 1GG2_G, 1GOT_A, 1GOT_B 1GOT_G, 1GUA_A, 1GUA_B, 1TX4_A, 1TX4_B, 2TRC_P Miscellaneous 1AK4_AB, 1ATN_A, 1ATN_D, 1DKG_AB, 1EFN_AC, 1FC2_C, 1FC2_D, 1HWG_A 1HWG_BC, 1IGC_A, 1IGC_LH, 1SEB_ABEF, 1YCS_A, 1YCS_B, 2BTF_A, 2BTF_P 587 X.-w.Chen and J.C.Jeong 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Our-All 0.2 Yan-All Wang-All 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fig. 2. Comparison of prediction performance in terms of the best balanced 1-specificity accuracy for three sequence-based methods. Fig. 1. The ROC curves for three sequence-based predictors. their papers. All the methods are trained and tested on the same datasets. To evaluate the performance, a LOOCV is used: each time, one of the 99 polypeptide chairs (including all the interface and non- interface residues in this polypeptide chain) is used as test data and (a) (b) the remaining 98 chains are used as training data; this process is repeated 99 times and the ﬁnal results are averaged over the test Fig. 3. Predicted results of chains 1IAI_LH and 1IAI_MI in 1IAI using (a) results. our method, and (b) Wang’s method. Figure 1 shows the ROC curves for three sequence-based predictors. An ROC curve is a plot of the sensitivity versus (1 − speciﬁcity) for a binary classiﬁer as its decision boundary is Yan’s and Wang’s methods, respectively (Fig. 2). The results clearly moved. Sensitivity measures the capability of predicting positive demonstrate that the proposed method is capable of predicting samples (interface residues) correctly and speciﬁcity determines protein interaction sites with signiﬁcantly better performance than if any non-interface residues are incorrectly predicted as interface these previous sequence-based methods. residues. The ROC curve of our method is constructed by changing We further examine the predicted results using Jmol (Jmol) and the threshold we place in the majority vote of decision trees. VMD software (Humphrey et al., 1996). For the results presented Typically, majority votes win where the threshold is zero. A threshold in Figures 3 and 4, the model is trained using training data and at ﬁve implies that at least ﬁve more votes of binding sites than decisions are made without changing the threshold (e.g. in our these of non-binding sites are necessary to classify a residue as method, simple majority vote is used). In Figures 3 and 4, each binding site. Otherwise, the residue is predicted as non-binding sphere represents an atom. Green sphere denotes true positives site. Therefore, with different thresholds, our model will produce (true interface residues that are correctly predicted), blue sphere different values of speciﬁcity and sensitivity. The ROC curve for represents false negatives (interface residues that are predicted as Yan’s method is constructed by varying the threshold (bias) for non-interface residues) and red sphere indicates false positives (non- a SVM decision boundary. Wang’s method consists of ﬁve SVM interface residues that are predicted as interface residues). Figure 3 models. Thus, the ROC curve for Wang’s method is constructed by shows the predicted interaction sites for four chains, L, H, M and I varying both the decision boundary of each SVMs with the same in idiotype-anti-idiotype Fab complex (Ban et al., 1994), using our bias and the threshold for the majority vote of SVMs. The proposed method (Fig. 3a) and Wang’s method (Fig. 3b) obtained by leaving method signiﬁcantly outperforms Yan’s and Wang’s methods in out each chain and training the predictive models on the remaining terms of ROC curves: for example, with a speciﬁcity rate of 70%, chains. Result from Yan’s method is similar to Figure 3b. Figure 4 the sensitivities of Yan’s, Wang’s and our methods are 30%, 39% and shows the predicted interaction sites for two chains of 2CIO. Note 73%, respectively. Another function we use is CC, which measures that the structure of 2CIO was not available when we originally how predicted results correlate with actual data. The CC values range trained the model. Thus, it was not included in the group of 99 chains. from negative one (worst possible prediction) to positive one (perfect We use 2CIO as a third, independent data for validating the three prediction). For a random predictor, the CC value is zero. With a models. We also observe that our method identiﬁes signiﬁcantly speciﬁcity rate of 70%, the CC values are 0.00, 0.06 and 0.28 for more interaction residues for some complexes than the other two Yan’s, Wang’s and our methods, respectively. With a sensitivity rate methods (more results can be found in the Supplementary Material: of 70%, the CC values are 0.02, 0.05 and 0.28 for Yan’s, Wang’s and Figs S3 and S4). our methods, respectively. Thus, the proposed method outperforms In conclusion, we showed that the proposed method outperformed other two methods and is signiﬁcantly better than random guessing. two other sequence-based methods. To understand whether the We also compare the best results on balanced accuracies (the square improvement is due to the choice of the predictive model or the use root of product of positive accuracy and negative accuracy), which of new features, we conducted tests using our random forest tree with are commonly used for imbalanced data classiﬁcation problems. Our the same PSSM features as those used in Wang’s method. The areas method improves 23% and 17% in balanced accuracy compared with under ROC curve (AUC) for random forest and Wang’s classiﬁer True positive rate (sensitivity) Sequence-based prediction of protein interaction sites (a) Fig. 5. DnaK molecular chaperone system: (a) DnaJ (PDB ID 1XBL), (b) the structure of DnaK C-terminal and (c) the structure of DnaK N-terminal. Orange structure denotes ATPase domain. (b)(c) Fig. 4. Predicted results of two chains in 2CIO using (a) our method, (b) Wang’s method and (c) Yan’s method. together with PSSM features are 0.75 and 0.58, respectively. The AUC of our method is 0.80. Thus, both the classiﬁer and the new features contribute to the improvement: the AUCs obtained from random forest tree method with PSSM features only and with all the new features increase 0.17 and 0.22, respectively, compared with Wang’s method. As expected, random forest method is capable of dealing with imbalanced data classiﬁcation problems. We also used the student’s t-test to rank the features: among the top 50 features, the best four are PSSM features, the remaining Fig. 6. Predicted results of 1YUW using our method. Black color denotes features are physicochemical properties and the distance proﬁles C-terminal (amino acid 395–554), blue color denotes N-terminal (amino (Supplementary Table S4). acid 1–383) of DnaK, and atoms in predicted binding sites shown as purple spheres. 3.4 Blind test To show the applicability of the proposed method, three blind tests locking in of substrates into the substrate-binding cavity of Hsp70 are conducted. Without knowing the true binding sites, the blind tests and cochaperone DnaJ modulates these ATP hydrolysis and substrate evaluate the capability of predicting interface residues with peptide binding and is associated with conformational changes in DnaK complexes. To construct 3D structures, we used the VMD software (Gassler et al., 1998; Greene et al., 1998; Suh et al., 1998). (Humphrey et al., 1996). Figures 5–7 show the results, where again, DnaJ structures (PDB ID 1XBL) consist of four α-helices and a each sphere represents an atom and purple spheres denote these loop region containing HPD motif (tripeptide of histidine, proline, atoms in the predicted potential interface residues, and the orange and aspartic acid residues) between two α-helices (Fig. 5a). This spheres in Figure 7 denote the atoms also shown in Figure 5. HPD motif is highly conserved and presented in almost all known J First, we test three structural components of the DnaK (eukaryotic domains and this is critical to stimulate Hsp70 ATPase activity and Hsp70) molecular chaperone system, which is used in another mutations on the conserved tripeptide HPD of the J-domain abolish study (Fariselli et al., 2002). A chaperone system aids protein the ability of proteins to function with Hsp70 proteins; therefore, the folding/unfolding. The ﬁrst two components are two DnaK domains: HPD tripeptide could mediate speciﬁc interactions between Hsp40 a C-terminal domain (1DKX, PDB ID) and a N-terminal domain and Hsp70 proteins (Fariselli et al., 2002; Gassler et al., 1998; (1DKG, PDB ID), which are binding and releasing together. The Greene et al., 1998; Hennessy et al., 2000; Suh et al., 1998). Our third component is DnaJ (1XBL, PDB ID). The Hsp70 DnaK method predicted two amino acid residues, 30-MET and 36-ARG proteins are aided by the so-called J-domain cochaperones (Hsp40 which are near the HPD motif (33-HIS, 34-PRO, and 35-ASP). This proteins in eukaryotes, and DnaJ in prokaryotes), which dramatically is similar to the results of Greene et al. (1998) that the potential increase the ATP activity of the Hsp70s. ATP hydrolysis causes binding sites are residues between 1 and 35 and binding sites are 589 X.-w.Chen and J.C.Jeong reported binding regions II, III, IV, V and VI. We also predicted some binding sites in the upper right corner on chain B in GrpE, as a result of the fact that GrpE exists as a dimmer. 4 CONCLUSIONS As genome-sequencing projects provide biologists with ready access to the rapidly increasing pool of protein sequences, there is a growing demand for developing advanced computational methods for predicting potential protein binding sites by using sequence information only. In this article, we demonstrate a predictive (a) (b) system that can reliably identify protein interface sites for protein complexes. The proposed predictive system is based on the analysis of protein sequence information, without knowing protein structures. Fig. 7. Predicted results of three chains in 1DKG using our method. (a) A wide variety of physicochemical properties and sequence proﬁling DnaK N-terminal chain D of 1DKG, orange spheres are these atoms in these properties are effectively integrated using a random forest tree predicted interface residues shown in Figure 5c, and (b) GrpE, chain A and B of 1DKG. All spheres show our predicted atoms: purple spheres denote framework. the atoms in the predicted interface residues and orange spheres denote the The predicted interaction sites can be valuable as a ﬁrst approach atoms also shown in Figure 5. for guiding experimental methods investigating protein–protein interactions and localizing the speciﬁc interface residues. We illustrate the usefulness of the proposed method for predicting concentrated on the outer surface of helix II which is a right-side putative binding sites for the DnaK molecular chaperone system, α-helix where our prediction, 30-MET, is located; therefore, our 1YUW and 1DKG. In our future work, we will evaluate the relative predictions on DnaJ are reasonable. In addition 71-HIS and 74-PHE importance of the features, which should help to understand the are also shown as conserved residues based on the consensus of underlining binding process. amino acid position (Hennessy et al., 2000). For Hsp70 DnaK-Nterminal (PDB ID 1DKG) in Figure 5c, our prediction detected three residues (13-ASN, 116-SER, 174-ALA) on ACKNOWLEDGEMENTS ATPase domain (orange color in Fig. 5c). Other researches (Davis We wish to thank Ozlen Keskin for helping us in modeling 3D et al., 1999; Gassler et al., 1998) showed that most of the mutants, protein structures with VMD software. We also thank the reviewers which affect interaction with C-terminal domain, are located in the for their valuable suggestions. bottom of ATPase domain. Our predictions are spatially very close to those mutants observed in Davis and Gassler’s studies. Funding: National Science Foundation award (IIS-0644366). For Hsp70 DnaK-Cterminal (PDB ID 1DKX) in Figure 5b, our Conﬂicts of Interest: none declared. method predicted 6-THR, 405-GLY, 447-ARG and 483-ILE as interface residues. Mutants observed by Davis and Montgomery are located in the loops on sandwich sub-domain which are close REFERENCES to peptide-binding site; therefore, our predictions are in agreement Aytuna,A.S. et al. (2005) Prediction of protein-protein interactions by combining with their experimental results (Davis et al., 1999; Montgomery structure and sequence conservation in protein interfaces. Bioinformatics, 21, et al., 1999). We believe that the predicted binding will provide new 2850–2855. insights into the interaction. Ban,N. et al. (1994) Crystal structure of an idiotype-anti-idiotype Fab complex. Proc. Figure 6 shows the direct binding between DnaK C-terminal Natl Acad. Sci. USA, 91, 1604–1608. Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. and N-terminal on the protein 1YUW in Bus Taurus (Jiang et al., Bradford,J.R. and Westhead,D.R. (2005) Improved prediction of protein-protein binding 2005). Notice that 1YUW is different from 1DKX in Escherichia sites using a support vector machines approach. Bioinformatics, 21, 1487–1494. coli (Zhu et al., 1996) (also shown in Fig. 5) in terms of both C- Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32. terminal sequences and their binding sites. Our results show that Chakrabarti,P. and Janin,J. (2002) Dissecting protein-protein recognition sites. Proteins, most predicted interface residues are condensed in alpha helix of 47, 334–343. Chen,H. and Zhou,H.X. (2005) Prediction of interface residues in protein-protein C-terminal, which are in agreement with these in Jiang et al. (2005). complexes by a consensus neural network method: test against NMR data. Proteins, We also observe some predicted binding sites in N-terminal, which 61, 21–35. are not in the binding regions between C-terminal and N-terminal. Chen,X.W. and Liu,M. (2005) Prediction of protein-protein interactions using random Some of these may be the interaction sites between N-terminal and decision forest framework. Bioinformatics, 21, 4394–4400. Chung,J.L. et al. (2006) Exploiting sequence and structure homologs to identify protein- other chains (e.g. GrpE). protein binding sites. Proteins, 62, 630–640. Finally, we examined another DnaK effector GrpE by using Davis,J.E. et al. (1999) Intragenic suppressors of Hsp70 mutants: interplay between the 1DKG (Harrison et al., 1997) co-crystallized with 1DKG N-terminal ATPase- and peptide-binding domains. Proc. Natl Acad. Sci. USA, 96, 9269–9276. and GrpE. GrpE is a nucleotide-exchange factor that binds sub- De Loof,H. et al. (1986) Use of hydrophobicity proﬁles to predict receptor binding stoichiometrically to the ATPase unit which is DnaK N-terminal. domains on apolipoprotein E and the low density lipoprotein apolipoprotein B-E receptor. Proc. Natl Acad. Sci. USA, 83, 2295–2299. Figure 7a shows the predicted binding sites in DnaK N-terminal, Eisenberg,D. et al. (1982) The helical hydrophobic moment: a measure of the which are very close to the reported binding residues in the regions amphiphilicity of a helix. Nature, 299, 371–374. III, V and VI (Harrison et al., 1997). In Figure 7b, the predicted Eisenberg,D. et al. (1984) Analysis of membrane and surface protein sequences with interaction residues on chain A of GrpE are also located in the the hydrophobic moment plot. J. Mol. Biol., 179, 125–142. 590 Sequence-based prediction of protein interaction sites Fariselli,P. et al. (2002) Prediction of protein–protein interaction sites in Kuntz,I.D. et al. (1982) A geometric approach to macromolecule-ligand interactions. heterocomplexes with neural networks. Eur. J. Biochem.FEBS, 269, J. Mol. Biol., 161, 269–288. 1356–1361. Lo Conte,L. et al. (1999) The atomic structure of protein-protein recognition sites. Gabb,H.A. et al. (1997) Modelling protein docking using shape complementarity, J. Mol. Biol., 285, 2177–2198. electrostatics and biochemical information. J. Mol. Biol., 272, 106–120. Montgomery,D.L. et al. (1999) Mutations in the substrate binding domain of the Gallet,X. et al. (2000) A fast method to predict protein interaction sites from sequences. Escherichia coli 70 kDa molecular chaperone, DnaK, which alter substrate afﬁnity J. Mol. Biol., 302, 917–926. or interdomain coupling. J. Mol. Biol., 286, 915–932. Gassler,C.S. et al. (1998) Mutations in the DnaK chaperone affecting interaction with Nguyen,N. et al. (2006) Protein-protein interface residue prediction with SVM the DnaJ cochaperone. Proc. Natl Acad. Sci. USA, 95, 15229–15234. using evolutionary proﬁles and accessible surface areas. In Proceedings of IEEE Gong,S. et al. (2005) A protein domain interaction interface database: InterPare. BMC Symposium on Computational Intellegence Bioinformatics Computation Biology. Bioinformatics, 6, 207. pp. 1–5. Greene,M.K. et al. (1998) Role of the J-domain in the cooperation of Hsp40 with Hsp70. Norel,R. et al. (1995) Molecular surface complementarity at protein-protein interfaces: Proc. Natl Acad. Sci. USA, 95, 6108–6113. the critical role played by surface normals at well placed, sparse, points in docking. Harrison,C.J. et al. (1997) Crystal structure of the nucleotide exchange factor GrpE J. Mol. Biol., 252, 263–273. bound to the ATPase domain of the molecular chaperone DnaK. Science, 276, Palma,P.N. et al. (2000) BiGGER: a new (soft) docking algorithm for predicting protein 431–435. interactions. Proteins, 39, 372–384. Helmer-Citterich,M. and Tramontano,A. (1994) PUZZLE: a new method for automated Pazos,F. et al. (1997) Correlated mutations contain information about protein-protein protein docking based on surface shape complementarity. J. Mol. Biol., 235, interaction. J. Mol. Biol., 271, 511–523. 1021–1031. Rost,B. and Sander,C. (1994) Conservation and prediction of solvent accessibility in Hennessy,F. et al. (2000) Analysis of the levels of conservation of the J domain among protein families. Proteins, 20, 216–226. the various types of DnaJ-like proteins. Cell Stress Chaperones, 5, 347–358. Salemme,F.R. (1976) An hypothetical structure for an intermolecular electron transfer Ho,T.K. (1998) The random subspace method for constructing decision forests. IEEE complex of cytochromes c and b5. J. Mol. Biol., 102, 563–568. Trans. Pattern Anal. Mach. Intell., 20, 832–844. Schneider,R. and Sander,C. (1996) The HSSP database of protein structure-sequence Humphrey,W. et al. (1996) VMD: visual molecular dynamics. J. Mol. Graph, 14, 33–38, alignments. Nucleic Acids Res., 24, 201–205. 27–38. Shoichet,B.K. and Kuntz,I.D. (1991) Protein docking and complementarity. J. Mol. Jiang,F. and Kim,S.H. (1991) “Soft docking”: matching of molecular surface cubes. Biol., 221, 327–346. J. Mol. Biol., 219, 79–102. Suh,W.C. et al. (1998) Interaction of the Hsp70 molecular chaperone, DnaK, with its Jiang,J. et al. (2005) Structural basis of interdomain communication in the Hsc70 cochaperone DnaJ. Proc. Natl Acad. Sci. USA, 95, 15223–15228. chaperone. Mol. cell, 20, 513–524. Uniprot (2008) The Universal Protein Resource (UniProt). Nucleic Acids Res., 36, Jmol. Jmol: an open-source Java viewer for chemical structures in 3D. Available at D190–D195. http://www.jmol.org. Voet,D. and Voet,J.G. (2004) Biochemistry. J. Wiley & Sons, Hoboken, NJ. Jones,S. and Thornton,J.M. (1996) Principles of protein-protein interactions. Proc. Natl Walls,P.H. and Sternberg,M.J. (1992) New algorithm to model protein-protein Acad. Sci. USA, 93, 13–20. recognition based on surface complementarity. Applications to antibody-antigen Jones,S. and Thornton,J.M. (1997a) Analysis of protein-protein interaction sites using docking. J. Mol. Biol., 228, 277–297. surface patches. J. Mol. Biol., 272, 121–132. Wang,B. et al. (2006) Predicting protein interaction sites from residue spatial sequence Jones,S. and Thornton,J.M. (1997b) Prediction of protein-protein interaction sites using proﬁle and evolution rate. FEBS Lett., 580, 380–384. patch analysis. J. Mol. Biol., 272, 133–143. Warwicker,J. (1989) Investigating protein-protein interaction surfaces using a reduced Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern stereochemical and electrostatic model. J. Mol. Biol., 206, 381–395. recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, Wodak,S.J. and Janin,J. (1978) Computer analysis of protein-protein interaction. J. Mol. 2577–2637. Biol., 124, 323–342. Katchalski-Katzir,E. et al. (1992) Molecular surface recognition: determination of Yan,C. et al. (2003) Identiﬁcation of surface residues involved in protein-protein geometric ﬁt between proteins and their ligands by correlation techniques. Proc. interaction-a support vector machine approach. In Proceedings of the Conference Natl Acad. Sci. USA, 89, 2195–2199. on Intellegence System Design Application. pp. 53–62. Keskin,O. et al. (2005) Hot regions in protein–protein interactions: the organization Yan,C. et al. (2004) A two-stage classiﬁer for identiﬁcation of protein-protein interface and contribution of structurally conserved hot spot residues. J. Mol. Biol., 345, residues. Bioinformatics, 20(Suppl. 1), i371–i378. 1281–1294. Zhou,H.X. and Shan,Y. (2001) Prediction of protein interaction sites from sequence Kini,R.M. and Evans,H.J. (1996) Prediction of potential protein-protein interaction sites proﬁle and residue neighbor list. Proteins, 44, 336–343. from amino acid sequence. Identiﬁcation of a ﬁbrin polymerization site. FEBS Lett., Zhu,X. et al. (1996) Structural analysis of substrate binding by the molecular chaperone 385, 81–86. DnaK. Science, 272, 1606–1614.

Journal

Bioinformatics – Oxford University Press

Published: Jan 19, 2009

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Sequence-based prediction of protein interaction sites with an integrative method

Sequence-based prediction of protein interaction sites with an integrative method

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Sequence-based prediction of protein interaction sites with an integrative method

Sequence-based prediction of protein interaction sites with an integrative method

References (59)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies