Kernel methods for predicting protein–protein interactions

Asa Ben-Hur; William Stafford Noble

doi:10.1093/bioinformatics/bti1016

Kernel methods for predicting protein–protein interactions

Ben-Hur, Asa; Noble, William Stafford 2005-06-01 00:00:00 Vol. 21 Suppl. 1 2005, pages i38–i46 BIOINFORMATICS doi:10.1093/bioinformatics/bti1016 Kernel methods for predicting protein–protein interactions 1,∗ 1,2 Asa Ben-Hur and William Stafford Noble 1 2 Department of Genome Sciences and Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA Received on January 15, 2005; accepted on March 27, 2005 ABSTRACT These methods include the yeast two-hybrid screen and meth- Motivation: Despite advances in high-throughput methods for ods based on mass spectrometry (see von Mering et al., 2002 discovering protein–protein interactions, the interaction net- and references therein). The data obtained by these meth- works of even well-studied model organisms are sketchy at ods are partial: each experimental assay can identify only best, highlighting the continued need for computational meth- a subset of the interactions, and it has been estimated that ods to help direct experimentalists in the search for novel for the organism with the most complete interaction network, interactions. namely yeast, only about half of the complete ‘interactome’ Results: We present a kernel method for predicting protein– has been discovered (von Mering et al., 2002). In view of the protein interactions using a combination of data sources, very small overlap between interactions discovered by vari- including protein sequences, Gene Ontology annotations, ous high-throughput studies, some of them using the same local properties of the network, and homologous interactions method, the actual number of interactions is likely to be in other species. Whereas protein kernels proposed in the lit- much higher. Computational methods are therefore required erature provide a similarity between single proteins, prediction for discovering interactions that are not accessible to high- of interactions requires a kernel between pairs of proteins. throughput methods. These computational predictions can We propose a pairwise kernel that converts a kernel between then be veriﬁed by more labor-intensive methods. single proteins into a kernel between pairs of proteins, and A number of methods have been proposed for predict- we illustrate the kernel’s effectiveness in conjunction with a ing protein–protein interactions from sequence. Sprinzak and support vector machine classiﬁer. Furthermore, we obtain Margalit (2001) have noted that many pairs of structural improved performance by combining several sequence-based domains are over-represented in interacting proteins and that kernels based on k-mer frequency, motif and domain content this information can be used to predict interactions. Sev- and by further augmenting the pairwise sequence kernel with eral authors have proposed Bayesian network models that features that are based on other sources of data. use the domain or motif content of a sequence to predict We apply our method to predict physical interactions in yeast interactions (Deng et al., 2002; Gomez et al., 2003; Wang using data from the BIND database. At a false positive rate of et al., 2005). The pairwise sequence kernel was independ- 1% the classiﬁer retrieves close to 80% of a set of trusted ently proposed in a recent paper (Martin et al., 2005) with interactions. We thus demonstrate the ability of our method a sequence representation by 3mers. Other sequence-based to make accurate predictions despite the sizeable fraction of methods use coevolution of interacting proteins by comparing false positives that are known to exist in interaction databases. phylogenetic trees (Ramani and Marcotte, 2003), correlated Availability: The classiﬁcation experiments were performed mutations (Pazos and Valencia, 2002) or gene fusion which using PyML available at http://pyml.sourceforge.net. Data are works at the genome level (Marcotte et al., 1999). An altern- available at: http://noble.gs.washington.edu/proj/sppi ative approach is to combine multiple sources of genomic Contact: [email protected] information—gene expression, Gene Ontology (GO) annota- tions, transcriptional regulation, etc. to predict comembership 1 INTRODUCTION in a complex (Zhang et al., 2004; Lin et al., 2004). One can consider two variants of the interaction prediction Most proteins perform their functions by interacting with other problem: predicting comembership in a complex or predicting proteins.Therefore, information about the network of interac- direct physical interaction. In this work, we focus on the latter tions that occur in a cell can greatly increase our understanding task, and use interactions that are derived from the BIND data- of protein function. Several experimental assays that probe base (Bader et al., 2001), which makes a distinction between interactions in a high-throughput manner are now available. experimental results that yield comembership in a complex To whom correspondence should be addressed. and interactions that are more likely to be direct ones. i38 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] “bti1016” — 2005/6/10 — page 38 — #1 Kernel methods for predicting protein–protein interactions Kernel methods, and in particular support vector machines proteins X and X compared with proteins X and X .We 1 2 1 2 (SVMs) (Schölkopf and Smola, 2002), have proven use- call a kernel that operates on individual genes or proteins a ful in many difﬁcult classiﬁcation problems in bioinform- ‘genomic kernel’, and a kernel that compares pairs of genes or atics (Noble, 2004). The learning task we are addressing proteins a ‘pairwise kernel’. Pairwise kernels can be computed involves a relationship between pairs of protein sequences: either indirectly, by way of an intermediate genomic kernel, whether two pairs of sequences are interacting or not. The or directly using features that characterize pairs of proteins. standard sequence kernels (A kernel is a measure of similarity The most straightforward way to construct a pairwise kernel that satisﬁes the additional condition of being a dot product is to express the similarity between pairs of proteins in terms in some feature space; see Schölkopf and Smola, 2002 for of similarities between individual proteins. In this approach, details.) described in the literature measure similarity between we consider two pairs to be similar to one another when each single proteins. We propose a method for converting a ker- protein of one pair is similar to one protein of the other pair. nel deﬁned on single proteins into a pairwise kernel, and we For example, if protein X is similar to protein X , and X 1 2 describe the feature space produced by that kernel. is similar to X , then we can say that the pairs (X , X ) and 1 2 Our basic method uses motif, domain and kmer composi- (X , X ) are similar. We can translate these intuitions into the 1 2 tion to form a pairwise kernel, and achieves better perform- following pairwise kernel: ance than simple methods based on BLAST or PSI-BLAST. K((X , X ), (X , X )) = K (X , X )K (X , X ) However, because it is difﬁcult to predict interactions from 1 2 1 2 1 2 1 2 sequence alone, we incorporate additional sources of data. + K (X , X )K (X , X ), 1 2 2 1 These include kernels based on similarity of GO annotations, a similarity score to interacting homologs in other species where K (·, ·) is any genomic kernel. This kernel takes into and the mutual-clustering coefﬁcient (Goldberg and Roth, account the fact that X can be similar to either X or X . 1 2 2003) that measures the tendency of neighbors of interact- An alternative to the above approach is to represent a pair ing proteins to interact as well. Adding these additional data of sequences (X , X ) explicitly in terms of the domain or 1 2 sources signiﬁcantly improves our method’s performance rel- motif pairs that appear in it. This representation is motivated ative to a method trained using only the pairwise sequence by the observation that some domains are signiﬁcantly over- kernel. Using kernel methods for combining data from het- represented in interacting proteins (Sprinzak and Margalit, erogeneous sources of data allows us to use high-dimensional 2001). A similar observation holds for sequence motifs as sequence data, whereas other studies on predicting protein– well. Given a pair of sequences X , X represented by vectors 1 2 (1) (2) protein interactions (Zhang et al., 2004; Lin et al., 2004) use x , x , with components x , x we form the vector x with 1 2 12 i i a low dimensional representations which are appropriate for (1) (2) (2) (1) components x x +x x . We can now deﬁne the explicit i j i j any type of classiﬁer. pairwise kernel: K((X , X ), (X , X )) = K (x , x ), (1) 1 2 12 2 KERNELS FOR PROTEIN–PROTEIN 1 2 12 INTERACTIONS where x is the pairwise representation of the pair (X , X ), 12 1 2 SVMs and other kernel methods derive much of their power and K (·, ·) is any kernel that operates on vector data. It is from their ability to incorporate prior knowledge via the straightforward to check that for a linear kernel function, kernel function. Furthermore, the kernel approach offers the the pairwise and explicit pairwise kernels are identical. The ability to easily apply kernels to diverse types of data, includ- explicit representation can be used in order to rank the rel- ing ﬁxed-length vectors (e.g. microarray expression data), evance of motif pairs with respect to the classiﬁcation task. variable-length strings (DNA and protein sequences), graphs This ranking is accomplished, e.g. by sorting the motif pairs and trees. In this work, we employ a diverse collection of according to the magnitude of the corresponding weight vector kernels described in this section. components. 2.1 Pairwise kernels 2.2 Sequence kernels The kernels proposed in the literature for handling genomic We use three sequence kernels in this work: the spectrum information, e.g. sequence kernels such as the motif and Pfam kernel (Leslie et al., 2002), the motif kernel (Ben-hur and kernels presented later in the section, provide a similarity Brutlag, 2003) and the Pfam kernel (Gomez et al., 2003). The between pairs of sequences, or more generally, a similar- feature space of these kernels is a set of sequence models, and ity between a representation of a pair of proteins. Therefore, each component of the feature space representation measures such kernels are not directly applicable to the task of predict- the extent to which a given sequence ﬁts the model. The spec- ing protein–protein interactions, which requires a similarity trum kernel models a sequence in the space of all kmers, and between two pairs of proteins. Thus, we want a function its features count the number of times each kmer appears in K((X , X ), (X , X )) that returns the similarity between the sequence. 1 2 1 2 i39 “bti1016” — 2005/6/10 — page 39 — #2 A.Ben-Hur and W.S.Noble The sequence models for our motif kernel are discrete We consider two ways in which to deﬁne the dot product sequence motifs, providing a count of how many times a dis- in this space. When the non-zero components are set equal crete sequence motif matches a sequence. To compute the to 1, then when each protein has a single annotation, and the motif kernel we used discrete sequence motifs from the eMotif annotatinos are on a tree, the dot product between two pro- database (Nevill-Manning et al., 1997). Yeast ORFs contain teins is the height of the lowest common ancestor of the two occurrences of 17 768 motifs out of a set of 42 718 motifs. nodes. An alternative approach assigns annotation a a score Finally, the Pfam kernel uses a set of hidden Markov mod- of − log p(a), where p(a) is the fraction of proteins that have els (HMMs) to represent the domain structure of a protein, annotation a. We then score the similarity of annotations a, a and is computed by comparing each protein sequence with as max − log p(a ). In a tree topo- a ∈ancestors(a)∩ancestors(a ) every HMM in the Pfam database (Sonnhammer et al., 1997). logy, this score is the similarity between the deepest common Every such protein–HMM comparison yields an E-value ancestor of a and a , because the node frequencies are decreas- statistic. Pfam version 10.0 contains 6190 domain HMMs; ing along a path from the root to any node. The score is a dot therefore, each protein is represented by a vector of 6190 log product with respect to the inﬁnity norm on the annotation E-values. This Pfam kernel has been used previously to pre- vector space. This also holds when the proteins have more dict protein–protein interactions (Gomez et al., 2003), though than one annotation and the similarity between their annota- not in conjunction with the pairwise kernel described above. tions is deﬁned as the maximum similarity between any pair For all three sequence kernels we use a a normalized linear of annotations. When one of the proteins has an unknown GO kernel, K(x, y)/ K(x, x)K(y, y); in the case of the Pfam annotation, the kernel value is set to 0. kernel we ﬁrst performed an initial step of centering the kernel. 2.3.2 Interactions in other species It has been shown that interactions in other species can be used to validate or infer 2.3 Non-sequence kernels interactions (Yu et al., 2004): the existence of interacting An alternative to using the pairwise kernel is the following: homologs of a given pair of proteins implies that the ori- ginal proteins are more likely to interact. We quantify this K((X , X ), (X , X )) = K (X , X )K (X , X ).(2) 1 2 1 2 1 2 1 2 observation with the following homology score for a pair of proteins (X , X ): This kernel is appropriate when similarity within the pair is 1 2 directly related to the likelihood that a pair of proteins inter- h(X , X ) = max I(i, j) 1 2 act. In fact, this is a valid kernel even if K is not a kernel, i∈H(X ),j ∈H(X ) 1 2 because in this formulation K is simply a feature of the pair of × min(l(X , X ), l(X , X )) , 1 i 2 j proteins. Consider GO annotations, for example: a pair of pro- teins is more likely to interact if the two proteins share similar where H(X) is the set of non-yeast proteins that are signiﬁc- annotations. In addition to GO annotation we also consider ant BLAST hits of X, I(i, j) is an indicator variable for the local properties of the interaction network, and homologous interaction between proteins i and j , and l(X , X ) is the neg- k i interactions in other species. We summarize these properties ative of the log E-value provided by BLAST when comparing as a vector of scores s(X , X ), such that the kernel for the 1 2 protein k with protein i in the context of a given sequence data- non-sequence data can be any kernel appropriate for vector base. We used interactions in human, mouse, nematode and data: fruit ﬂy to score the interactions in yeast. 2.3.3 Mutual clustering coefﬁcient Protein–protein inter- K ((X , X )), (X , X )) = K (s(X , X ), s(X , X )) , non-seq 1 2 1 2 1 2 1 2 action networks tend to be ‘cliquish’; i.e. the neighbors of (3) interacting proteins tend to interact. Goldberg and Roth (2003) where we chose to use a Gaussian kernel for K . quantiﬁed this cohesiveness using the mutual clustering coef- 2.3.1 A GO kernel Proteins that are not present in the same ﬁcient (MCC). Given two proteins u, v, their MCC can be cellular component or that participate in different biological quantiﬁed, by the Jaccard coefﬁcient |N(v) ∪ N (u)|/|N(v) ∩ processes are less likely to interact. We represent this prior N (u)|, where N(x) is the set of neighbors of a protein x knowledge using a kernel that measures the similarity of the in an interaction network. In our classiﬁcation experiments GO (Gene Ontology Consortium, 2000) annotations of a pair we performed cross-validation where the MCC in each cross- of proteins, one kernel for each of the three GO hierarchies. validation fold is computed with respect to the interactions The feature space for the GO kernel is a vector space with that occur in the training set of that particular fold. one component for each node in the directed acyclic graph 2.4 Combining kernels in which GO annotations are represented. Let the annotations (nodes in the GO graph) assigned to protein p be denoted by Given a genomic kernel K , we denote by K (K) the pair- A . Note that, in GO, a single protein can be assigned several wise kernel that uses K . When several genomic kernels are annotations. A component of the vector corresponding to node available, the ﬁnal kernel can be deﬁned as K (K ) or p i a is non-zero if a or a parent of a is in A . as K ( K ). Using K ( K ) mixes features between the p p i p i i i i40 “bti1016” — 2005/6/10 — page 40 — #3 Kernel methods for predicting protein–protein interactions individual kernels, while the feature space for K (K ) negative examples is likely to contain very few proteins that p i includes pairs of features that originate from the same gen- interact. omic kernel. In practice, the results from these two different High-throughput protein–protein interaction data contain a approaches were very close, and the mixing approach was large fraction of false positives, estimated to be up to 50% in used because of its lower memory requirement. A Gaussian some experiments (von Mering et al., 2002). Therefore, we or polynomial kernel can be introduced at several stages: prepared a set of BIND interactions that are expected to have instead of the linear genomic kernel as: exp(−γ(K (P , P) − a low rate of false positives. We use these reliable interactions 2K (P , P ) + K (P , P )), where P , P are two pairs of pro- in two ways. We evaluate the performance of our method on p p teins. We have not tried introducing a non-linear kernel at the the reliable interactions because they are more likely to reﬂect level of the genomic kernel; a Gaussian kernel at the level of the true performance of the classiﬁer. We also use reliability to the pairwise kernel performed similar to the ‘linear’ pairwise set the value of the SVM soft-margin parameter as discussed kernel, despite the high dimensionality of the resulting feature in Section 2.5. ‘Gold standard’ interactions can be derived space. The results reported in this paper are computed using from several sources: ‘linear’ pairwise kernels. • Interactions corroborated by interacting yeast paralogs. 2.5 Incorporating interaction reliability in Deane et al. (2002) ﬁnd 2829 interactions from the DIP training database that are supported by their paralogous veriﬁca- tion method (PVM). The estimated false positive rate of Several studies of protein–protein interaction data have noted this method is 1%. that different experimental assays produce varying levels of • Interactions that are supported by interacting homologs in false positives and have proposed methods for ﬁnding which multiple species are likely to be correct (Yu et al., 2004). interactions are likely to be reliable (von Mering et al., 2002; Sprinzak et al., 2003; Deane et al., 2002) (see Section 3.1 for • Interactions that are discovered by different experimental details). We incorporate this knowledge about the reliability of assays were estimated to be correct 95% of the time protein–protein interactions into the training procedure using (Sprinzak et al., 2003). the SVM soft-margin parameter C (Schölkopf and Smola, • Highly reliable methods, e.g. interactions derived from 2002). This parameter puts a penalty on patterns that are mis- crystallized complexes. classiﬁed or are close to the SVM decision boundary. Each training example receives a value of C that depends on its We do not use PVM-validated interactions because they reliability. For a training set with an equal number of positive contain several biases. and negative examples we use two values: C for interac- high • The test set is biased toward interactions that can be easily tions believed to be reliable and for negative examples; C low discovered by sequence similarity. for positive examples that are not known to be reliable. • The list of PVM-validated interactions cannot be used as-is to set the SVM soft-margin parameter in train- 3 METHODS ing because this may incorporate information about 3.1 Interaction data interactions that are in the test set. We focus on the prediction of physical interactions in yeast Also, we do not include interactions validated by interacting and use interaction data from the BIND database (Bader et al., homologs in other species, since that information is included 2001). BIND includes published interaction data from high- in the data as a feature. Therefore, for the purpose of assess- throughput experiments as well as curated entries derived ing performance we use a list of 750 interactions that were from published papers. The advantage of BIND is that validated by high-quality or multiple assays. For setting the it provides an explicit distinction between direct physical SVM soft-margin parameter we augment the 750 interactions interactions and comembership in a complex. with PVM-validated interactions that are computed on the 3.1.1 Positive and negative examples We use physical basis of the training data alone. Training is performed on all interactions from BIND as positive examples, for a dataset interactions so that sensitivity is not sacriﬁced. comprising 10 517 interactions among 4233 yeast proteins 3.2 BLAST/PSI-BLAST based ranking (downloaded July 9, 2004). We eliminated self interactions from the dataset since such interactions do not require a We compare our method with a simple ranking method that pairwise kernel, and the GO and MCC features are not appro- assigns a candidate interaction a score based on its similarity priate in this case. As negative examples we select random, to the interacting pairs in the training set. Speciﬁcally, let non-interacting pairs from the 4233 interacting proteins; the l(X, X ) denote the negative log of the E-value assigned by number of negative examples was taken as equal to the number PSI-BLAST (BLAST) when searching X against X in the of positive examples. In view of the large number of protein context of a large database of sequences, and let I(i, j) be an pairs compared with the number of interactions, such a set of indicator variable for the interaction between proteins i and j . i41 “bti1016” — 2005/6/10 — page 41 — #4 A.Ben-Hur and W.S.Noble Table 1. ROC scores for the various methods computed using 5-fold cross-validation Method Kernel ROC score ROC score BLAST — 0.74 0.18 PSIBLAST — 0.78 0.11 Non-sequence K 0.95 0.37 non-seq Motif K (K ) 0.76 0.17 p motif Pfam K (K ) 0.78 0.20 p Pfam Spectrum (k = 3) K (K ) 0.81 0.05 p spec Motif + Pfam K (K + K ) 0.82 0.22 p motif Pfam Motif + Pfam + spectrum K (K + K + K ) 0.86 0.17 p motif Pfam spec All kernels K + K (K + K + K ) 0.97 0.44 feat p motif Pfam spec All + reliability K + K (K + K + K ) 0.97 0.58 feat p motif Pfam spec Training data include all BIND physical interactions. ROC scores are computed on reliable interactions that do not include PVM-validated interactions. The BLAST and PSIBLAST methods rank interactions according to Equation (4). The ‘kernel’ column of the table shows which kernel was used in conjunction with the SVM classiﬁer. The notation K (K ) p g denotes that the pairwise kernel was derived from a genomic kernel K .The K is a Gaussian kernel over the non-sequence features; in each method it participates in, the width g non-seq of the Gaussian was determined by cross-validation as part of the classiﬁer’s training. The all-reliable method uses information on reliability to set the SVM soft-margin parameter as described in Section 2.5. l(X, X ) is positive for signiﬁcant matches and increases as the whether they participate in any predicted interactions. In this quality of the match increases. The score for a query (X , X ) case, you do not care about the high-conﬁdence interactions 1 2 is deﬁned as: above; instead, you would like to be sure that the complete set of predictions is of high quality. In this case you are interested max I(i, j) min(l(X , X ), l(X , X )),(4) 1 i 2 j in the ROC score of the classiﬁer. i∈P ,j ∈P where P is the set of all proteins in the training set. In these 4 RESULTS experiments, we use PSI-BLAST scores computed in the We report, in this section, the results of experiments in context of the Swiss-Prot database (version 40, containing predicting protein–protein interactions using an SVM clas- 101 602 proteins). siﬁer with various kernels, and compare these with a simple 3.3 Figures of merit method based on BLAST or PSI-BLAST. All the experiments Throughout this paper we evaluate the quality of a predictive were performed using the PyML machine learning framework method using two different metrics. Both metrics—the area available at http://pyml.sourceforge.net. We begin this sec- under the receiver operating characteristic curve (ROC score), tion with results obtained using the various kernels and kernel and the normalized area under that curve up to the ﬁrst 50 false combinations, followed by a discussion of the choice of negat- positives (ROC score)—aim to measure both sensitivity and 50 ive examples, and a section that shows the effects of choosing speciﬁcity by integrating over a curve that plots true positive a non-redundant set of proteins. rate as a function of false positive rate. We include both metrics 4.1 Main results in order to account for two different types of scenarios in which a protein–protein interaction prediction method might We report results that are computed using 5-fold cross- be employed. validation on all BIND physical interactions. The SVM soft- In the ﬁrst scenario, imagine that you have developed a margin parameter was not optimized—we used the default low low-throughput method for detecting whether a given pair of value for this parameter to account for the noise in the data. proteins interacts. Rather than testing your method on ran- The ROC/ROC curve is then computed for those reliable domly selected pairs of proteins, you could use a predictive interactions that were not obtained using the PVM method algorithm to identify likely candidates. In this case, you would as discussed in Section 3.1. The ROC statistics that summar- start from the top of the ranked list of predictions, testing pairs ize these experiments are reported in Table 1 and the selected until you ran out of time or money, or until the success rate ROC curves are shown in Figure 1. of the predictor was too low to be useful. In this scenario, a Our basic method uses a pairwise kernel based on one of sev- predictor that maximizes the quality of the high-conﬁdence eral sequence kernels—the motif, Pfam and spectrum kernels. interactions i.e. that maximizes the ROC score, is going to The performance of the motif and Pfam kernels is comparable, be most useful. with a slight advantage for the Pfam kernel (the ROC scores In the second, more common scenario, you are interested are 0.76 and 0.78 and ROC scores are 0.17 and 0.20). The in a particular biological system. You run the predictive spectrum kernel (using kmers of length 3) achieves a higher algorithm, and you check your favorite set of proteins to see ROC score of 0.81, but its ROC score is signiﬁcantly lower i42 “bti1016” — 2005/6/10 — page 42 — #5 Kernel methods for predicting protein–protein interactions (a) (b) Fig. 1. ROC (a) and ROC (b) curves for several methods. Best performance is obtained using a kernel that combines all the kernels presented in the paper. Additional results are summarized in Table 1, along with a description of the methods. than that of the Pfam and motif kernels. The higher ROC We now explore the effect of adding to the sequence ker- score can be explained by the fact that the motif and the Pfam nels, a kernel based on three types of non-sequence data—GO methods are limited in their sensitivity by the motifs and the annotations, the homology score and the MCC. For the domain models available. However, when such models offer non-sequence features, we ﬁrst standardized the data (sub- a good description of a sequence, their predictions are likely tracted the mean of each feature and divided by the standard to be more accurate, which is reﬂected in the much higher deviation), and used a Gaussian kernel whose width was ROC scores of these methods. Each of the pairwise kernels determined by cross-validation. by itself is not doing much better than BLAST or PSI-BLAST, Combining the non-sequence features with the pairwise but once they are combined, they offer improved performance. sequence kernel yielded better performance than any method We note that using a spectrum kernel with kmers of length 4 by itself in both performance metrics. Furthermore, setting the did not improve the performance of the method. soft-margin parameter of the SVM according to the reliability i43 “bti1016” — 2005/6/10 — page 43 — #6 A.Ben-Hur and W.S.Noble of the interactions provided another signiﬁcant boost to the Signiﬁcant attention has been paid to the problem of select- performance. Its ROC and ROC scores were 0.98 and ing gold standard interacting protein pairs for the purposes 0.58, respectively; at a false positive rate of 1% the classiﬁer of training and validating predictive computational meth- retrieves ∼80% of the trusted interactions. In this experiment ods (Jansen et al., 2003). However, less emphasis has been we did not try to optimize the ratio between the two soft margin placed on the choice of non-interacting protein pairs. In this constants, and used C = 0.01C . study, we selected negatives uniformly at random. We ﬁnd low high The main contribution to the gain in performance comes that this strategy leads to consistent behavior and avoids bias. from the GO-process kernel feature. Its ROC score by itself The possibility for bias due to the method of constructing is 0.68 on all the BIND interactions and 0.95 when limiting negative examples is evidenced by results reported in a related to the reliable positive examples. The difference between the paper (Martin et al., 2005). In this work, the authors report two numbers is probably due to the sizable fraction of false that a pairwise spectrum kernel provides highly accurate pre- interactions in the BIND dataset. In the following subsection dictions of yeast interactions using a dataset studied in Jansen we point out scenarios where the GO data are not useful. The et al. (2003). The positive examples in this dataset satisfy our ROC score for the MCC feature was 0.68 on all BIND inter- criteria of trusted interactions, and one might conclude that actions and 0.53 when computed on the reliable interactions. the use of highly reliable interactions is the reason for the The large difference for the MCC feature is a result of the fact success of the predictive method. However, we found that the that the MCC requires a large number of interactions to be method of choosing negative examples has a strong effect on −10 useful. At a BLAST cutoff of 1e , 329 interactions from the performance: the negative examples from Jansen et al. BIND were supported by interactions from other species, as (2003) were chosen as pairs of proteins that are known to opposed to 49 negative examples. The ROC score for this fea- be localized in different cellular compartments. This makes ture by itself is low since it is sparse, i.e. is informative for a these protein pairs much less likely to interact than randomly small number of interactions. selected pairs, but the selection constraints impose a bias on the resulting distribution that makes the overall learning task easier [note that this is less likely to affect the results of non- 4.2 The role of GO annotations sequence based methods, such as the one used by Jansen et al. In order to understand the difference in the role of the sequence (2003)]. To illustrate this effect, we created datasets with neg- kernels and the non-sequence kernels, we compared the two ative examples taken as pairs whose GO component similarity, kernels on the task of distinguishing between physically inter- as measured by our kernel, is below a given threshold. The acting proteins pairs and those that are members of the same performance of the resulting classiﬁer varied as we varied complex. In this case, the negative examples are chosen as this threshold (Table 2). This constrained selection method protein pairs that are known to belong to the same complex was tested with the spectrum and motif kernels using both the but are not known to physically interact. This set of negative BIND interaction data and a set of trusted interactions similar examples is likely to be more noisy than the non-interacting to the one used by Martin et al. (2005) extracted from DIP set, because complexes that are not accessible by yeast two- and MIPS (Mewes et al., 2000; Xenarios et al., 2002). For hybrid probably contain many physical interactions. But still, the spectrum kernel, the ROC (ROC ) scores varied from the motif-pairwise method achieves an ROC score of 0.78, 0.87 (0.08) to 0.97 (0.46) on the DIP/MIPS data and from very close to the value obtained with non-interacting negative 0.77 (0.04) to 0.95 (0.36) on the BIND data, as the threshold examples. In this task, a classiﬁer based on the non-sequence was lowered from 0.5 to 0.04. Similarly, although slightly kernel fails with an ROC score of 0.5. This is due to the fact that less pronounced, results were obtained for the motif pairwise cocomplexed proteins, such as physically interacting proteins, kernel. tend to have similar GO annotations and network properties, where as the motif and Pfam rely on a signal that is often dir- 4.4 The dependence on interacting paralogs ectly related to the interaction site itself (Wang et al., 2005). The yeast genome contains a large number of duplicated Similar observations can be made for other features used to genes. Since we are using a sequence-based method to pre- predict cocomplexed proteins, such as gene-expression data. dict interactions, we need to determine to what extent the performance depends on the presence of interacting paralogs. 4.3 Choosing negative examples We therefore performed an experiment in which the train- Recall that examples of non-interacting proteins were chosen ing set and test set do not contain proteins whose BLAST as random pairs of interacting proteins. To test the stability of E-values are more signiﬁcant than a given threshold. In this our results with respect to the choice of negative examples, we case we performed 2-fold cross-validation instead of 5-fold ran a set of experiments using 10 different randomly selected cross-validation. For the pairwise motif–Pfam–spectrum ker- sets of non-interacting proteins. Predictions were made using nel the ROC score decreased from 0.86 with no constraint to the motif kernel. The standard deviation of the resulting ROC 0.81 when the training and test sets did not contain proteins scores was 0.003, showing good stability. whose BLAST E-values were better than 0.1. The ROC score i44 “bti1016” — 2005/6/10 — page 44 — #7 Kernel methods for predicting protein–protein interactions Table 2. The dependence of the performance of the spectrum pairwise We also made no attempt to purge from our dataset examples method on the similarity between localization annotations in negative that contain missing data (missing GO annotations). When examples trying to make predictions on unseen data, these data will contain missing data and so, the method is more likely to Dataset Threshold ROC ROC generalize if presented with examples containing missing data during training. BIND 0.50 0.77 0.04 During the time of writing this paper we found that the 0.10 0.89 0.15 pairwise approach was proposed by Martin et al. (2005). They 0.07 0.91 0.21 used only the spectrum kernel, whereas here we considered 0.05 0.92 0.25 several sequence kernels. We found that the spectrum kernel 0.04 0.95 0.36 DIP/MIPS 0.5 0.87 0.08 works better than the motif and Pfam kernels according to the 0.1 0.94 0.22 ROC metric, but the spectrum kernel does not work as well 0.07 0.95 0.32 as the motif and Pfam kernels according to the ROC metric. 0.05 0.96 0.34 Apparently, the signal that the spectrum kernel generates is 0.04 0.97 0.46 not as speciﬁc as that of the other kernels. In addition, we have illustrated that pairwise sequence ker- Enforcing the condition that no two proteins in the set of negative examples have a GO similarity that is less than a given threshold puts a constraint on the distribution of nels can be successfully combined with non-sequence data. negative examples. This constraint makes it easy for the classiﬁer to distinguish between In this work, we have not attempted to learn the weights positive and negative examples, and the effect gets stronger as the threshold becomes of the various kernels as done by Lanckriet et al. (2004). smaller. We performed the experiment on the BIND interaction dataset and on a dataset of reliable interactions derived from DIP and MIPS interactions. This is an avenue for future work, although solving the res- ulting semi-deﬁnite programming problem promises to be computationally expensive, owing to the large training sets for the PSI-BLAST (BLAST) method went down from 0.78 involved. We also plan to consider additional sources of data (0.74) to 0.62 (0.62). This illustrates that the kernel combina- such as gene expression and transcription factor binding data, tion is less dependent on the presence of interacting paralogs which have also been shown to be informative in predicting than BLAST or PSI-BLAST. protein–protein interactions (Zhang et al., 2004). ACKNOWLEDGEMENTS 5 DISCUSSION The authors thank Doug Brutlag, David Baker, Ora Schueler- In this paper we presented several kernels for prediction of Furman and Trisha Davis for the helpful discussions. This protein–protein interactions and used them in combination for work is funded by NCRR NIH award P41 RR11823, by improved performance. The concern regarding the pairwise NHGRI NIH award R33 HG003070, and by NSF award kernel is the high dimensionality of its feature space, which is BDI-0243257. W.S.N. is an Alfred P. Sloan Research Fellow. quadratic in the number of features of the underlying kernel. We considered an alternative kernel which uses summation instead of the multiplication used in the expression for the REFERENCES pairwise kernel, similar to the work of Gomez et al. (2003). Bader,G.D. Donaldson,I., Wolting,C. Ouellette,B.F., Pawson,T. and The performance of the summation kernel is not as good as Hogue,C.W. (2001) BIND—the biomolecular interaction network the corresponding pairwise kernel, showing the advantage of database. Nucleic Acids Res., 29, 242–245. using pairs of features. Ben-hur,A. and Brutlag,D. (2003) Remote homology detec- When training a classiﬁer to predict protein–protein inter- tion: a motif based approach. Bioinformatics, 19 (Suppl 1), actions, there is a balance between placing in the training i26–i33. set only trusted interactions as opposed to trying to maxim- Deane,C. Salwinski,L., Xenarios,I. and Eisenberg,D. (2002) Two methods for assessment of the reliability of high throughput ize the number of positive examples by adding interactions observations. Mol. Cell. Proteomics, 1, 349–356. about which we are less sure. When using a sequence-based Deng,M., Mehta,S., Sun,F. and Chen,T. (2002) Inferring domain– approach, as we have done here, the sensitivity of the method domain interactions from protein–protein interactions. Genome may depend on the richness of the training set. We have shown Res., 12, 1540–1548. in this paper that we are able to use a larger set of noisy data Gene Ontology Consortium (2000) Gene ontology: tool for the while still achieving a good performance. As an alternative uniﬁcation of biology. Natl Genet., 25, 25–29. to training on a dataset that includes false positive interac- Goldberg,D. and Roth,F. (2003) Assessing experimentally derived tions we plan to ﬁrst apply a step of ﬁltering the interaction interactions in a small world. Proc. Natl Acad. Sci. USA, 100, data on the basis of features of trusted interactions, in order to 4372–4376. maximize the number of interactions that can be considered Gomez,S.M., Noble,W.S. and Rzhetsky,A. (2003) Learning to pre- reliable. dict protein–protein interactions. Bioinformatics, 19, 1875–1881. i45 “bti1016” — 2005/6/10 — page 45 — #8 A.Ben-Hur and W.S.Noble Jansen,R., Yu,H., Greenbaum,D., Kluger,Y., Krogan,N.J., Chung,S., Pazos,F. and Valencia,A. (2002) In silico two-hybrid system for Emili,A., Snyder,M., Greenblatt,J.F. and Gerstein,M. (2003) the selection of physically interacting protein pairs. Proteins, 47, A Bayesian networks approach for predicting protein–protein 219–227. interactions from genomic data. Science, 302, 449–453. Ramani,A. and Marcotte,E. (2003) Exploiting the co-evolution of Lanckriet,G.R.G., Deng,M., Cristianini,N., Jordan,M.I. and interacting proteins to discover interaction speciﬁcity. J. Mol. Noble,W.S. (2004) Kernel-based data fusion and its application to Biol., 327, 273–284. protein function prediction in yeast. In Altman,R.B., Dunker,A.K., Schölkopf,B. and Smola,A. (2002) Learning with Kernels. MIT Hunter,L., Jung,T.A. and Klein,T.E. (eds), Proceedings of the Press, Cambridge, MA. Paciﬁc Symposium on Biocomputing, World Scientiﬁc, Singapore, Sonnhammer,E., Eddy,S. and Durbin,R. (1997) Pfam: a comprehens- pp. 300–311. ive database of protein domain families based on seed alignments. Leslie,C., Eskin,E. and Noble,W.S. (2002) The spectrum kernel: Proteins, 28, 405–420. A string kernel for SVM protein classiﬁcation. In Altman,R.B., Sprinzak,E. and Margalit,H. (2001) Correlated sequence-signatures Dunker,A.K., Hunter,L., Lauderdale,K. and Klein,T.E. (eds), Pro- as markers of protein–protein interaction. J. Mol. Biol., 311, ceedings of the Paciﬁc Symposium on Biocomputing, New Jersey. 681–692. World Scientiﬁc, Singapore, pp. 564–575. Sprinzak,E., Sattath,S. and Margalit,H. (2003) How reliable are Lin,N., Wu,B., Jansen,R., Gerstein,M. and Zhao,H. (2004) Inform- experimental protein–protein interaction data? J. Mol. Biol., 327, ation assessment on predicting protein–protein interactions. 919–923. BMC Bioinformatics, 5, 154. von Mering,C., Krause,R., Snel,B., Cornell,M., Olivier,S.G., Marcotte,E.M., Pellegrini,M., Ng,H.-L., Rice,D.W., Yeates,T.O. Fields,S. and Bork,P. (2002) Comparative assessment of large- and Eisenberg,D. (1999) Detecting protein function and protein– scale data sets of protein–protein interactions. Nature, 417, protein interactions from genome sequences. Science, 285, 399–403. Wang,H., Segal,E., Ben-Hur,A., Koller,D. and Brutlag,D.L. (2005) 751–753. Identifying protein–protein interaction sites on a genome-wide Martin,S., Roe,D. and Faulon J.-L. (2005) Predicting protein– scale. In Lawrence K. Saul, Yair Weiss and Léon Bottou (eds), protein interactions using signature products. Bioinformatics, 21, Advances in Neural Information Processing Systems 17. MIT 218–226. Press, Cambridge, MA, pp. 1465–1472. Mewes,H.W., Frishman,D., Gruber,C., Geier,B., Haase,D., Kaps,A., Xenarios,I., Salwinski,L., Duan,X.Q.J., Higney,P., Kim,S.M. and Lemcke,K., Mannhaupt,G., Pfeiffer,F., Schüller,C., Stocker,S. Eisenberg,D. (2002) DIP: the Database of Interacting Proteins: a and Weil,B. (2000) MIPS: a database for genomes and protein research tool for studying cellular networks of protein interactions. sequences. Nucleic Acids Res., 28, 37–40. Nucleic Acids Res., 30, 303–305. Nevill-Manning,C.G., Sethi,K.S., Wu,T.D. and Brutlag,D.L. (1997) Enumerating and ranking discrete motifs. In Proceedings of Yu,H., Luscombe,N., Lu,H., Zhu,X., Xia,Y., Han,J., Bertin,N., the Fifth International Conference on Intelligent Systems for Chung,S., Vidal,M. and Gerstein,M. (2004) Annotation trans- Molecular Biology, pp. 202–209. fer between genomes: protein–protein interlogs and protein–DNA Noble,W.S. (2004) Support vector machine applications in com- regulogs. Genome Res., 14, 1107–1118. putational biology. In Schoelkopf,B., Tsuda,K. and Vert,J.-P. Zhang,L., Wong,S., King,O. and Roth,F. (2004) Predicting co-complexed protein pairs using genomic and proteomic data (eds), Kernel Methods in Computational Biology. MIT Press, integration. BMC Bioinformatics, 5, 38–53. Cambridge, MA, pp. 71–92. i46 “bti1016” — 2005/6/10 — page 46 — #9 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/kernel-methods-for-predicting-protein-protein-interactions-eJ3s2efRtl

Loading next page...

References (28)

E. Sprinzak, Shmuel Sattath, H. Margalit (2003)
How reliable are experimental protein-protein interaction data?
Journal of molecular biology, 327 5
R. Jansen, Haiyuan Yu, D. Greenbaum, Y. Kluger, N. Krogan, Sambath Chung, A. Emili, M. Snyder, J. Greenblatt, M. Gerstein (2003)
A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data
Science, 302
D. Goldberg, F. Roth (2003)
Assessing experimentally derived interactions in a small world
Proceedings of the National Academy of Sciences of the United States of America, 100
Gary Bader, I. Donaldson, Cheryl Wolting, B. Ouellette, T. Pawson, C. Hogue (2001)
BIND: the Biomolecular Interaction Network Database
Nucleic acids research, 31 1
E. Marcotte, M. Pellegrini, H. Ng, Danny Rice, T. Yeates, D. Eisenberg (1999)
Detecting protein function and protein-protein interactions from genome sequences.
Science, 285 5428
H. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. Stocker, D. Frishman (1999)
MIPS: a database for genomes and protein sequences
Nucleic acids research, 28 1
Haidong Wang, E. Segal, A. Ben-Hur, D. Koller, D. Brutlag (2004)
Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biology
Nature Genetics, 25
B. Schölkopf, K. Tsuda, Jean-Philippe Vert (2004)
Support Vector Machine Applications in Computational Biology
Gert Lanckriet, Minghua Deng, N. Cristianini, Michael Jordan, William Noble (2003)
Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
S. Gomez, William Noble, A. Rzhetsky (2003)
Learning to predict protein-protein interactions from protein sequences
Bioinformatics, 19 15
E. Sonnhammer, S. Eddy, R. Durbin (1997)
Pfam: A comprehensive database of protein domain families based on seed alignments
Proteins: Structure, 28
Arun Ramani, E. Marcotte (2003)
Exploiting the co-evolution of interacting proteins to discover interaction specificity.
Journal of molecular biology, 327 1
A. Ben-Hur, D. Brutlag (2003)
Remote homology detection: a motif based approach
Bioinformatics, 19 Suppl 1
E. Sprinzak, H. Margalit (2001)
Correlated sequence-signatures as markers of protein-protein interaction.
Journal of molecular biology, 311 4
Haiyuan Yu, N. Luscombe, Haoxin Lu, Xiaowei Zhu, Yu Xia, J. Han, N. Bertin, Sambath Chung, M. Vidal, M. Gerstein (2004)
Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs.
Genome research, 14 6
C. Leslie, E. Eskin, William Noble (2001)
The Spectrum Kernel: A String Kernel for SVM Protein Classification
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
I. Xenarios, L. Salwínski, X. Duan, Patrick Higney, Sul-Min Kim, D. Eisenberg (2002)
DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions
Nucleic acids research, 30 1
Lan Zhang, Sharyl Wong, O. King, F. Roth (2004)
Predicting co-complexed protein pairs using genomic and proteomic data integration
BMC Bioinformatics, 5
Shawn Martin, D. Roe, J. Faulon (2005)
Predicting protein-protein interactions using signature products
Bioinformatics, 21 2
B. Schölkopf, K. Tsuda, Jean-Philippe Vert (2005)
Kernel Methods in Computational Biology
Nan Lin, Baolin Wu, R. Jansen, M. Gerstein, Hongyu Zhao (2004)
Information assessment on predicting protein-protein interactions
BMC Bioinformatics, 5
F. Pazos, A. Valencia (2002)
In silico two‐hybrid system for the selection of physically interacting protein pairs
Proteins: Structure, 47
L. Hunter, T. Klein (1996)
Proceedings of the Pacific Symposium on Biocomputing '96. Hawaii, USA, 3-6 January 1996.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Minghua Deng, Shipra Mehta, Fengzhu Sun, Ting Chen (2002)
Inferring domain-domain interactions from protein-protein interactions
Genome research, 12 10
C. Nevill-Manning, Komal Sethi, Thomas Wu, D. Brutlag (1997)
Enumerating and Ranking Discrete Motifs
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 5
C. Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, P. Bork (2002)
Comparative assessment of large-scale data sets of protein–protein interactions
Nature, 417
C. Deane, Łukasz Salwiński, I. Xenarios, D. Eisenberg (2002)
Protein Interactions
Molecular & Cellular Proteomics, 1

Publisher: Oxford University Press
Copyright: © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/bti1016
pmid: 15961482
Publisher site: See Article on Publisher Site

Abstract

Vol. 21 Suppl. 1 2005, pages i38–i46 BIOINFORMATICS doi:10.1093/bioinformatics/bti1016 Kernel methods for predicting protein–protein interactions 1,∗ 1,2 Asa Ben-Hur and William Stafford Noble 1 2 Department of Genome Sciences and Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA Received on January 15, 2005; accepted on March 27, 2005 ABSTRACT These methods include the yeast two-hybrid screen and meth- Motivation: Despite advances in high-throughput methods for ods based on mass spectrometry (see von Mering et al., 2002 discovering protein–protein interactions, the interaction net- and references therein). The data obtained by these meth- works of even well-studied model organisms are sketchy at ods are partial: each experimental assay can identify only best, highlighting the continued need for computational meth- a subset of the interactions, and it has been estimated that ods to help direct experimentalists in the search for novel for the organism with the most complete interaction network, interactions. namely yeast, only about half of the complete ‘interactome’ Results: We present a kernel method for predicting protein– has been discovered (von Mering et al., 2002). In view of the protein interactions using a combination of data sources, very small overlap between interactions discovered by vari- including protein sequences, Gene Ontology annotations, ous high-throughput studies, some of them using the same local properties of the network, and homologous interactions method, the actual number of interactions is likely to be in other species. Whereas protein kernels proposed in the lit- much higher. Computational methods are therefore required erature provide a similarity between single proteins, prediction for discovering interactions that are not accessible to high- of interactions requires a kernel between pairs of proteins. throughput methods. These computational predictions can We propose a pairwise kernel that converts a kernel between then be veriﬁed by more labor-intensive methods. single proteins into a kernel between pairs of proteins, and A number of methods have been proposed for predict- we illustrate the kernel’s effectiveness in conjunction with a ing protein–protein interactions from sequence. Sprinzak and support vector machine classiﬁer. Furthermore, we obtain Margalit (2001) have noted that many pairs of structural improved performance by combining several sequence-based domains are over-represented in interacting proteins and that kernels based on k-mer frequency, motif and domain content this information can be used to predict interactions. Sev- and by further augmenting the pairwise sequence kernel with eral authors have proposed Bayesian network models that features that are based on other sources of data. use the domain or motif content of a sequence to predict We apply our method to predict physical interactions in yeast interactions (Deng et al., 2002; Gomez et al., 2003; Wang using data from the BIND database. At a false positive rate of et al., 2005). The pairwise sequence kernel was independ- 1% the classiﬁer retrieves close to 80% of a set of trusted ently proposed in a recent paper (Martin et al., 2005) with interactions. We thus demonstrate the ability of our method a sequence representation by 3mers. Other sequence-based to make accurate predictions despite the sizeable fraction of methods use coevolution of interacting proteins by comparing false positives that are known to exist in interaction databases. phylogenetic trees (Ramani and Marcotte, 2003), correlated Availability: The classiﬁcation experiments were performed mutations (Pazos and Valencia, 2002) or gene fusion which using PyML available at http://pyml.sourceforge.net. Data are works at the genome level (Marcotte et al., 1999). An altern- available at: http://noble.gs.washington.edu/proj/sppi ative approach is to combine multiple sources of genomic Contact: [email protected] information—gene expression, Gene Ontology (GO) annota- tions, transcriptional regulation, etc. to predict comembership 1 INTRODUCTION in a complex (Zhang et al., 2004; Lin et al., 2004). One can consider two variants of the interaction prediction Most proteins perform their functions by interacting with other problem: predicting comembership in a complex or predicting proteins.Therefore, information about the network of interac- direct physical interaction. In this work, we focus on the latter tions that occur in a cell can greatly increase our understanding task, and use interactions that are derived from the BIND data- of protein function. Several experimental assays that probe base (Bader et al., 2001), which makes a distinction between interactions in a high-throughput manner are now available. experimental results that yield comembership in a complex To whom correspondence should be addressed. and interactions that are more likely to be direct ones. i38 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] “bti1016” — 2005/6/10 — page 38 — #1 Kernel methods for predicting protein–protein interactions Kernel methods, and in particular support vector machines proteins X and X compared with proteins X and X .We 1 2 1 2 (SVMs) (Schölkopf and Smola, 2002), have proven use- call a kernel that operates on individual genes or proteins a ful in many difﬁcult classiﬁcation problems in bioinform- ‘genomic kernel’, and a kernel that compares pairs of genes or atics (Noble, 2004). The learning task we are addressing proteins a ‘pairwise kernel’. Pairwise kernels can be computed involves a relationship between pairs of protein sequences: either indirectly, by way of an intermediate genomic kernel, whether two pairs of sequences are interacting or not. The or directly using features that characterize pairs of proteins. standard sequence kernels (A kernel is a measure of similarity The most straightforward way to construct a pairwise kernel that satisﬁes the additional condition of being a dot product is to express the similarity between pairs of proteins in terms in some feature space; see Schölkopf and Smola, 2002 for of similarities between individual proteins. In this approach, details.) described in the literature measure similarity between we consider two pairs to be similar to one another when each single proteins. We propose a method for converting a ker- protein of one pair is similar to one protein of the other pair. nel deﬁned on single proteins into a pairwise kernel, and we For example, if protein X is similar to protein X , and X 1 2 describe the feature space produced by that kernel. is similar to X , then we can say that the pairs (X , X ) and 1 2 Our basic method uses motif, domain and kmer composi- (X , X ) are similar. We can translate these intuitions into the 1 2 tion to form a pairwise kernel, and achieves better perform- following pairwise kernel: ance than simple methods based on BLAST or PSI-BLAST. K((X , X ), (X , X )) = K (X , X )K (X , X ) However, because it is difﬁcult to predict interactions from 1 2 1 2 1 2 1 2 sequence alone, we incorporate additional sources of data. + K (X , X )K (X , X ), 1 2 2 1 These include kernels based on similarity of GO annotations, a similarity score to interacting homologs in other species where K (·, ·) is any genomic kernel. This kernel takes into and the mutual-clustering coefﬁcient (Goldberg and Roth, account the fact that X can be similar to either X or X . 1 2 2003) that measures the tendency of neighbors of interact- An alternative to the above approach is to represent a pair ing proteins to interact as well. Adding these additional data of sequences (X , X ) explicitly in terms of the domain or 1 2 sources signiﬁcantly improves our method’s performance rel- motif pairs that appear in it. This representation is motivated ative to a method trained using only the pairwise sequence by the observation that some domains are signiﬁcantly over- kernel. Using kernel methods for combining data from het- represented in interacting proteins (Sprinzak and Margalit, erogeneous sources of data allows us to use high-dimensional 2001). A similar observation holds for sequence motifs as sequence data, whereas other studies on predicting protein– well. Given a pair of sequences X , X represented by vectors 1 2 (1) (2) protein interactions (Zhang et al., 2004; Lin et al., 2004) use x , x , with components x , x we form the vector x with 1 2 12 i i a low dimensional representations which are appropriate for (1) (2) (2) (1) components x x +x x . We can now deﬁne the explicit i j i j any type of classiﬁer. pairwise kernel: K((X , X ), (X , X )) = K (x , x ), (1) 1 2 12 2 KERNELS FOR PROTEIN–PROTEIN 1 2 12 INTERACTIONS where x is the pairwise representation of the pair (X , X ), 12 1 2 SVMs and other kernel methods derive much of their power and K (·, ·) is any kernel that operates on vector data. It is from their ability to incorporate prior knowledge via the straightforward to check that for a linear kernel function, kernel function. Furthermore, the kernel approach offers the the pairwise and explicit pairwise kernels are identical. The ability to easily apply kernels to diverse types of data, includ- explicit representation can be used in order to rank the rel- ing ﬁxed-length vectors (e.g. microarray expression data), evance of motif pairs with respect to the classiﬁcation task. variable-length strings (DNA and protein sequences), graphs This ranking is accomplished, e.g. by sorting the motif pairs and trees. In this work, we employ a diverse collection of according to the magnitude of the corresponding weight vector kernels described in this section. components. 2.1 Pairwise kernels 2.2 Sequence kernels The kernels proposed in the literature for handling genomic We use three sequence kernels in this work: the spectrum information, e.g. sequence kernels such as the motif and Pfam kernel (Leslie et al., 2002), the motif kernel (Ben-hur and kernels presented later in the section, provide a similarity Brutlag, 2003) and the Pfam kernel (Gomez et al., 2003). The between pairs of sequences, or more generally, a similar- feature space of these kernels is a set of sequence models, and ity between a representation of a pair of proteins. Therefore, each component of the feature space representation measures such kernels are not directly applicable to the task of predict- the extent to which a given sequence ﬁts the model. The spec- ing protein–protein interactions, which requires a similarity trum kernel models a sequence in the space of all kmers, and between two pairs of proteins. Thus, we want a function its features count the number of times each kmer appears in K((X , X ), (X , X )) that returns the similarity between the sequence. 1 2 1 2 i39 “bti1016” — 2005/6/10 — page 39 — #2 A.Ben-Hur and W.S.Noble The sequence models for our motif kernel are discrete We consider two ways in which to deﬁne the dot product sequence motifs, providing a count of how many times a dis- in this space. When the non-zero components are set equal crete sequence motif matches a sequence. To compute the to 1, then when each protein has a single annotation, and the motif kernel we used discrete sequence motifs from the eMotif annotatinos are on a tree, the dot product between two pro- database (Nevill-Manning et al., 1997). Yeast ORFs contain teins is the height of the lowest common ancestor of the two occurrences of 17 768 motifs out of a set of 42 718 motifs. nodes. An alternative approach assigns annotation a a score Finally, the Pfam kernel uses a set of hidden Markov mod- of − log p(a), where p(a) is the fraction of proteins that have els (HMMs) to represent the domain structure of a protein, annotation a. We then score the similarity of annotations a, a and is computed by comparing each protein sequence with as max − log p(a ). In a tree topo- a ∈ancestors(a)∩ancestors(a ) every HMM in the Pfam database (Sonnhammer et al., 1997). logy, this score is the similarity between the deepest common Every such protein–HMM comparison yields an E-value ancestor of a and a , because the node frequencies are decreas- statistic. Pfam version 10.0 contains 6190 domain HMMs; ing along a path from the root to any node. The score is a dot therefore, each protein is represented by a vector of 6190 log product with respect to the inﬁnity norm on the annotation E-values. This Pfam kernel has been used previously to pre- vector space. This also holds when the proteins have more dict protein–protein interactions (Gomez et al., 2003), though than one annotation and the similarity between their annota- not in conjunction with the pairwise kernel described above. tions is deﬁned as the maximum similarity between any pair For all three sequence kernels we use a a normalized linear of annotations. When one of the proteins has an unknown GO kernel, K(x, y)/ K(x, x)K(y, y); in the case of the Pfam annotation, the kernel value is set to 0. kernel we ﬁrst performed an initial step of centering the kernel. 2.3.2 Interactions in other species It has been shown that interactions in other species can be used to validate or infer 2.3 Non-sequence kernels interactions (Yu et al., 2004): the existence of interacting An alternative to using the pairwise kernel is the following: homologs of a given pair of proteins implies that the ori- ginal proteins are more likely to interact. We quantify this K((X , X ), (X , X )) = K (X , X )K (X , X ).(2) 1 2 1 2 1 2 1 2 observation with the following homology score for a pair of proteins (X , X ): This kernel is appropriate when similarity within the pair is 1 2 directly related to the likelihood that a pair of proteins inter- h(X , X ) = max I(i, j) 1 2 act. In fact, this is a valid kernel even if K is not a kernel, i∈H(X ),j ∈H(X ) 1 2 because in this formulation K is simply a feature of the pair of × min(l(X , X ), l(X , X )) , 1 i 2 j proteins. Consider GO annotations, for example: a pair of pro- teins is more likely to interact if the two proteins share similar where H(X) is the set of non-yeast proteins that are signiﬁc- annotations. In addition to GO annotation we also consider ant BLAST hits of X, I(i, j) is an indicator variable for the local properties of the interaction network, and homologous interaction between proteins i and j , and l(X , X ) is the neg- k i interactions in other species. We summarize these properties ative of the log E-value provided by BLAST when comparing as a vector of scores s(X , X ), such that the kernel for the 1 2 protein k with protein i in the context of a given sequence data- non-sequence data can be any kernel appropriate for vector base. We used interactions in human, mouse, nematode and data: fruit ﬂy to score the interactions in yeast. 2.3.3 Mutual clustering coefﬁcient Protein–protein inter- K ((X , X )), (X , X )) = K (s(X , X ), s(X , X )) , non-seq 1 2 1 2 1 2 1 2 action networks tend to be ‘cliquish’; i.e. the neighbors of (3) interacting proteins tend to interact. Goldberg and Roth (2003) where we chose to use a Gaussian kernel for K . quantiﬁed this cohesiveness using the mutual clustering coef- 2.3.1 A GO kernel Proteins that are not present in the same ﬁcient (MCC). Given two proteins u, v, their MCC can be cellular component or that participate in different biological quantiﬁed, by the Jaccard coefﬁcient |N(v) ∪ N (u)|/|N(v) ∩ processes are less likely to interact. We represent this prior N (u)|, where N(x) is the set of neighbors of a protein x knowledge using a kernel that measures the similarity of the in an interaction network. In our classiﬁcation experiments GO (Gene Ontology Consortium, 2000) annotations of a pair we performed cross-validation where the MCC in each cross- of proteins, one kernel for each of the three GO hierarchies. validation fold is computed with respect to the interactions The feature space for the GO kernel is a vector space with that occur in the training set of that particular fold. one component for each node in the directed acyclic graph 2.4 Combining kernels in which GO annotations are represented. Let the annotations (nodes in the GO graph) assigned to protein p be denoted by Given a genomic kernel K , we denote by K (K) the pair- A . Note that, in GO, a single protein can be assigned several wise kernel that uses K . When several genomic kernels are annotations. A component of the vector corresponding to node available, the ﬁnal kernel can be deﬁned as K (K ) or p i a is non-zero if a or a parent of a is in A . as K ( K ). Using K ( K ) mixes features between the p p i p i i i i40 “bti1016” — 2005/6/10 — page 40 — #3 Kernel methods for predicting protein–protein interactions individual kernels, while the feature space for K (K ) negative examples is likely to contain very few proteins that p i includes pairs of features that originate from the same gen- interact. omic kernel. In practice, the results from these two different High-throughput protein–protein interaction data contain a approaches were very close, and the mixing approach was large fraction of false positives, estimated to be up to 50% in used because of its lower memory requirement. A Gaussian some experiments (von Mering et al., 2002). Therefore, we or polynomial kernel can be introduced at several stages: prepared a set of BIND interactions that are expected to have instead of the linear genomic kernel as: exp(−γ(K (P , P) − a low rate of false positives. We use these reliable interactions 2K (P , P ) + K (P , P )), where P , P are two pairs of pro- in two ways. We evaluate the performance of our method on p p teins. We have not tried introducing a non-linear kernel at the the reliable interactions because they are more likely to reﬂect level of the genomic kernel; a Gaussian kernel at the level of the true performance of the classiﬁer. We also use reliability to the pairwise kernel performed similar to the ‘linear’ pairwise set the value of the SVM soft-margin parameter as discussed kernel, despite the high dimensionality of the resulting feature in Section 2.5. ‘Gold standard’ interactions can be derived space. The results reported in this paper are computed using from several sources: ‘linear’ pairwise kernels. • Interactions corroborated by interacting yeast paralogs. 2.5 Incorporating interaction reliability in Deane et al. (2002) ﬁnd 2829 interactions from the DIP training database that are supported by their paralogous veriﬁca- tion method (PVM). The estimated false positive rate of Several studies of protein–protein interaction data have noted this method is 1%. that different experimental assays produce varying levels of • Interactions that are supported by interacting homologs in false positives and have proposed methods for ﬁnding which multiple species are likely to be correct (Yu et al., 2004). interactions are likely to be reliable (von Mering et al., 2002; Sprinzak et al., 2003; Deane et al., 2002) (see Section 3.1 for • Interactions that are discovered by different experimental details). We incorporate this knowledge about the reliability of assays were estimated to be correct 95% of the time protein–protein interactions into the training procedure using (Sprinzak et al., 2003). the SVM soft-margin parameter C (Schölkopf and Smola, • Highly reliable methods, e.g. interactions derived from 2002). This parameter puts a penalty on patterns that are mis- crystallized complexes. classiﬁed or are close to the SVM decision boundary. Each training example receives a value of C that depends on its We do not use PVM-validated interactions because they reliability. For a training set with an equal number of positive contain several biases. and negative examples we use two values: C for interac- high • The test set is biased toward interactions that can be easily tions believed to be reliable and for negative examples; C low discovered by sequence similarity. for positive examples that are not known to be reliable. • The list of PVM-validated interactions cannot be used as-is to set the SVM soft-margin parameter in train- 3 METHODS ing because this may incorporate information about 3.1 Interaction data interactions that are in the test set. We focus on the prediction of physical interactions in yeast Also, we do not include interactions validated by interacting and use interaction data from the BIND database (Bader et al., homologs in other species, since that information is included 2001). BIND includes published interaction data from high- in the data as a feature. Therefore, for the purpose of assess- throughput experiments as well as curated entries derived ing performance we use a list of 750 interactions that were from published papers. The advantage of BIND is that validated by high-quality or multiple assays. For setting the it provides an explicit distinction between direct physical SVM soft-margin parameter we augment the 750 interactions interactions and comembership in a complex. with PVM-validated interactions that are computed on the 3.1.1 Positive and negative examples We use physical basis of the training data alone. Training is performed on all interactions from BIND as positive examples, for a dataset interactions so that sensitivity is not sacriﬁced. comprising 10 517 interactions among 4233 yeast proteins 3.2 BLAST/PSI-BLAST based ranking (downloaded July 9, 2004). We eliminated self interactions from the dataset since such interactions do not require a We compare our method with a simple ranking method that pairwise kernel, and the GO and MCC features are not appro- assigns a candidate interaction a score based on its similarity priate in this case. As negative examples we select random, to the interacting pairs in the training set. Speciﬁcally, let non-interacting pairs from the 4233 interacting proteins; the l(X, X ) denote the negative log of the E-value assigned by number of negative examples was taken as equal to the number PSI-BLAST (BLAST) when searching X against X in the of positive examples. In view of the large number of protein context of a large database of sequences, and let I(i, j) be an pairs compared with the number of interactions, such a set of indicator variable for the interaction between proteins i and j . i41 “bti1016” — 2005/6/10 — page 41 — #4 A.Ben-Hur and W.S.Noble Table 1. ROC scores for the various methods computed using 5-fold cross-validation Method Kernel ROC score ROC score BLAST — 0.74 0.18 PSIBLAST — 0.78 0.11 Non-sequence K 0.95 0.37 non-seq Motif K (K ) 0.76 0.17 p motif Pfam K (K ) 0.78 0.20 p Pfam Spectrum (k = 3) K (K ) 0.81 0.05 p spec Motif + Pfam K (K + K ) 0.82 0.22 p motif Pfam Motif + Pfam + spectrum K (K + K + K ) 0.86 0.17 p motif Pfam spec All kernels K + K (K + K + K ) 0.97 0.44 feat p motif Pfam spec All + reliability K + K (K + K + K ) 0.97 0.58 feat p motif Pfam spec Training data include all BIND physical interactions. ROC scores are computed on reliable interactions that do not include PVM-validated interactions. The BLAST and PSIBLAST methods rank interactions according to Equation (4). The ‘kernel’ column of the table shows which kernel was used in conjunction with the SVM classiﬁer. The notation K (K ) p g denotes that the pairwise kernel was derived from a genomic kernel K .The K is a Gaussian kernel over the non-sequence features; in each method it participates in, the width g non-seq of the Gaussian was determined by cross-validation as part of the classiﬁer’s training. The all-reliable method uses information on reliability to set the SVM soft-margin parameter as described in Section 2.5. l(X, X ) is positive for signiﬁcant matches and increases as the whether they participate in any predicted interactions. In this quality of the match increases. The score for a query (X , X ) case, you do not care about the high-conﬁdence interactions 1 2 is deﬁned as: above; instead, you would like to be sure that the complete set of predictions is of high quality. In this case you are interested max I(i, j) min(l(X , X ), l(X , X )),(4) 1 i 2 j in the ROC score of the classiﬁer. i∈P ,j ∈P where P is the set of all proteins in the training set. In these 4 RESULTS experiments, we use PSI-BLAST scores computed in the We report, in this section, the results of experiments in context of the Swiss-Prot database (version 40, containing predicting protein–protein interactions using an SVM clas- 101 602 proteins). siﬁer with various kernels, and compare these with a simple 3.3 Figures of merit method based on BLAST or PSI-BLAST. All the experiments Throughout this paper we evaluate the quality of a predictive were performed using the PyML machine learning framework method using two different metrics. Both metrics—the area available at http://pyml.sourceforge.net. We begin this sec- under the receiver operating characteristic curve (ROC score), tion with results obtained using the various kernels and kernel and the normalized area under that curve up to the ﬁrst 50 false combinations, followed by a discussion of the choice of negat- positives (ROC score)—aim to measure both sensitivity and 50 ive examples, and a section that shows the effects of choosing speciﬁcity by integrating over a curve that plots true positive a non-redundant set of proteins. rate as a function of false positive rate. We include both metrics 4.1 Main results in order to account for two different types of scenarios in which a protein–protein interaction prediction method might We report results that are computed using 5-fold cross- be employed. validation on all BIND physical interactions. The SVM soft- In the ﬁrst scenario, imagine that you have developed a margin parameter was not optimized—we used the default low low-throughput method for detecting whether a given pair of value for this parameter to account for the noise in the data. proteins interacts. Rather than testing your method on ran- The ROC/ROC curve is then computed for those reliable domly selected pairs of proteins, you could use a predictive interactions that were not obtained using the PVM method algorithm to identify likely candidates. In this case, you would as discussed in Section 3.1. The ROC statistics that summar- start from the top of the ranked list of predictions, testing pairs ize these experiments are reported in Table 1 and the selected until you ran out of time or money, or until the success rate ROC curves are shown in Figure 1. of the predictor was too low to be useful. In this scenario, a Our basic method uses a pairwise kernel based on one of sev- predictor that maximizes the quality of the high-conﬁdence eral sequence kernels—the motif, Pfam and spectrum kernels. interactions i.e. that maximizes the ROC score, is going to The performance of the motif and Pfam kernels is comparable, be most useful. with a slight advantage for the Pfam kernel (the ROC scores In the second, more common scenario, you are interested are 0.76 and 0.78 and ROC scores are 0.17 and 0.20). The in a particular biological system. You run the predictive spectrum kernel (using kmers of length 3) achieves a higher algorithm, and you check your favorite set of proteins to see ROC score of 0.81, but its ROC score is signiﬁcantly lower i42 “bti1016” — 2005/6/10 — page 42 — #5 Kernel methods for predicting protein–protein interactions (a) (b) Fig. 1. ROC (a) and ROC (b) curves for several methods. Best performance is obtained using a kernel that combines all the kernels presented in the paper. Additional results are summarized in Table 1, along with a description of the methods. than that of the Pfam and motif kernels. The higher ROC We now explore the effect of adding to the sequence ker- score can be explained by the fact that the motif and the Pfam nels, a kernel based on three types of non-sequence data—GO methods are limited in their sensitivity by the motifs and the annotations, the homology score and the MCC. For the domain models available. However, when such models offer non-sequence features, we ﬁrst standardized the data (sub- a good description of a sequence, their predictions are likely tracted the mean of each feature and divided by the standard to be more accurate, which is reﬂected in the much higher deviation), and used a Gaussian kernel whose width was ROC scores of these methods. Each of the pairwise kernels determined by cross-validation. by itself is not doing much better than BLAST or PSI-BLAST, Combining the non-sequence features with the pairwise but once they are combined, they offer improved performance. sequence kernel yielded better performance than any method We note that using a spectrum kernel with kmers of length 4 by itself in both performance metrics. Furthermore, setting the did not improve the performance of the method. soft-margin parameter of the SVM according to the reliability i43 “bti1016” — 2005/6/10 — page 43 — #6 A.Ben-Hur and W.S.Noble of the interactions provided another signiﬁcant boost to the Signiﬁcant attention has been paid to the problem of select- performance. Its ROC and ROC scores were 0.98 and ing gold standard interacting protein pairs for the purposes 0.58, respectively; at a false positive rate of 1% the classiﬁer of training and validating predictive computational meth- retrieves ∼80% of the trusted interactions. In this experiment ods (Jansen et al., 2003). However, less emphasis has been we did not try to optimize the ratio between the two soft margin placed on the choice of non-interacting protein pairs. In this constants, and used C = 0.01C . study, we selected negatives uniformly at random. We ﬁnd low high The main contribution to the gain in performance comes that this strategy leads to consistent behavior and avoids bias. from the GO-process kernel feature. Its ROC score by itself The possibility for bias due to the method of constructing is 0.68 on all the BIND interactions and 0.95 when limiting negative examples is evidenced by results reported in a related to the reliable positive examples. The difference between the paper (Martin et al., 2005). In this work, the authors report two numbers is probably due to the sizable fraction of false that a pairwise spectrum kernel provides highly accurate pre- interactions in the BIND dataset. In the following subsection dictions of yeast interactions using a dataset studied in Jansen we point out scenarios where the GO data are not useful. The et al. (2003). The positive examples in this dataset satisfy our ROC score for the MCC feature was 0.68 on all BIND inter- criteria of trusted interactions, and one might conclude that actions and 0.53 when computed on the reliable interactions. the use of highly reliable interactions is the reason for the The large difference for the MCC feature is a result of the fact success of the predictive method. However, we found that the that the MCC requires a large number of interactions to be method of choosing negative examples has a strong effect on −10 useful. At a BLAST cutoff of 1e , 329 interactions from the performance: the negative examples from Jansen et al. BIND were supported by interactions from other species, as (2003) were chosen as pairs of proteins that are known to opposed to 49 negative examples. The ROC score for this fea- be localized in different cellular compartments. This makes ture by itself is low since it is sparse, i.e. is informative for a these protein pairs much less likely to interact than randomly small number of interactions. selected pairs, but the selection constraints impose a bias on the resulting distribution that makes the overall learning task easier [note that this is less likely to affect the results of non- 4.2 The role of GO annotations sequence based methods, such as the one used by Jansen et al. In order to understand the difference in the role of the sequence (2003)]. To illustrate this effect, we created datasets with neg- kernels and the non-sequence kernels, we compared the two ative examples taken as pairs whose GO component similarity, kernels on the task of distinguishing between physically inter- as measured by our kernel, is below a given threshold. The acting proteins pairs and those that are members of the same performance of the resulting classiﬁer varied as we varied complex. In this case, the negative examples are chosen as this threshold (Table 2). This constrained selection method protein pairs that are known to belong to the same complex was tested with the spectrum and motif kernels using both the but are not known to physically interact. This set of negative BIND interaction data and a set of trusted interactions similar examples is likely to be more noisy than the non-interacting to the one used by Martin et al. (2005) extracted from DIP set, because complexes that are not accessible by yeast two- and MIPS (Mewes et al., 2000; Xenarios et al., 2002). For hybrid probably contain many physical interactions. But still, the spectrum kernel, the ROC (ROC ) scores varied from the motif-pairwise method achieves an ROC score of 0.78, 0.87 (0.08) to 0.97 (0.46) on the DIP/MIPS data and from very close to the value obtained with non-interacting negative 0.77 (0.04) to 0.95 (0.36) on the BIND data, as the threshold examples. In this task, a classiﬁer based on the non-sequence was lowered from 0.5 to 0.04. Similarly, although slightly kernel fails with an ROC score of 0.5. This is due to the fact that less pronounced, results were obtained for the motif pairwise cocomplexed proteins, such as physically interacting proteins, kernel. tend to have similar GO annotations and network properties, where as the motif and Pfam rely on a signal that is often dir- 4.4 The dependence on interacting paralogs ectly related to the interaction site itself (Wang et al., 2005). The yeast genome contains a large number of duplicated Similar observations can be made for other features used to genes. Since we are using a sequence-based method to pre- predict cocomplexed proteins, such as gene-expression data. dict interactions, we need to determine to what extent the performance depends on the presence of interacting paralogs. 4.3 Choosing negative examples We therefore performed an experiment in which the train- Recall that examples of non-interacting proteins were chosen ing set and test set do not contain proteins whose BLAST as random pairs of interacting proteins. To test the stability of E-values are more signiﬁcant than a given threshold. In this our results with respect to the choice of negative examples, we case we performed 2-fold cross-validation instead of 5-fold ran a set of experiments using 10 different randomly selected cross-validation. For the pairwise motif–Pfam–spectrum ker- sets of non-interacting proteins. Predictions were made using nel the ROC score decreased from 0.86 with no constraint to the motif kernel. The standard deviation of the resulting ROC 0.81 when the training and test sets did not contain proteins scores was 0.003, showing good stability. whose BLAST E-values were better than 0.1. The ROC score i44 “bti1016” — 2005/6/10 — page 44 — #7 Kernel methods for predicting protein–protein interactions Table 2. The dependence of the performance of the spectrum pairwise We also made no attempt to purge from our dataset examples method on the similarity between localization annotations in negative that contain missing data (missing GO annotations). When examples trying to make predictions on unseen data, these data will contain missing data and so, the method is more likely to Dataset Threshold ROC ROC generalize if presented with examples containing missing data during training. BIND 0.50 0.77 0.04 During the time of writing this paper we found that the 0.10 0.89 0.15 pairwise approach was proposed by Martin et al. (2005). They 0.07 0.91 0.21 used only the spectrum kernel, whereas here we considered 0.05 0.92 0.25 several sequence kernels. We found that the spectrum kernel 0.04 0.95 0.36 DIP/MIPS 0.5 0.87 0.08 works better than the motif and Pfam kernels according to the 0.1 0.94 0.22 ROC metric, but the spectrum kernel does not work as well 0.07 0.95 0.32 as the motif and Pfam kernels according to the ROC metric. 0.05 0.96 0.34 Apparently, the signal that the spectrum kernel generates is 0.04 0.97 0.46 not as speciﬁc as that of the other kernels. In addition, we have illustrated that pairwise sequence ker- Enforcing the condition that no two proteins in the set of negative examples have a GO similarity that is less than a given threshold puts a constraint on the distribution of nels can be successfully combined with non-sequence data. negative examples. This constraint makes it easy for the classiﬁer to distinguish between In this work, we have not attempted to learn the weights positive and negative examples, and the effect gets stronger as the threshold becomes of the various kernels as done by Lanckriet et al. (2004). smaller. We performed the experiment on the BIND interaction dataset and on a dataset of reliable interactions derived from DIP and MIPS interactions. This is an avenue for future work, although solving the res- ulting semi-deﬁnite programming problem promises to be computationally expensive, owing to the large training sets for the PSI-BLAST (BLAST) method went down from 0.78 involved. We also plan to consider additional sources of data (0.74) to 0.62 (0.62). This illustrates that the kernel combina- such as gene expression and transcription factor binding data, tion is less dependent on the presence of interacting paralogs which have also been shown to be informative in predicting than BLAST or PSI-BLAST. protein–protein interactions (Zhang et al., 2004). ACKNOWLEDGEMENTS 5 DISCUSSION The authors thank Doug Brutlag, David Baker, Ora Schueler- In this paper we presented several kernels for prediction of Furman and Trisha Davis for the helpful discussions. This protein–protein interactions and used them in combination for work is funded by NCRR NIH award P41 RR11823, by improved performance. The concern regarding the pairwise NHGRI NIH award R33 HG003070, and by NSF award kernel is the high dimensionality of its feature space, which is BDI-0243257. W.S.N. is an Alfred P. Sloan Research Fellow. quadratic in the number of features of the underlying kernel. We considered an alternative kernel which uses summation instead of the multiplication used in the expression for the REFERENCES pairwise kernel, similar to the work of Gomez et al. (2003). Bader,G.D. Donaldson,I., Wolting,C. Ouellette,B.F., Pawson,T. and The performance of the summation kernel is not as good as Hogue,C.W. (2001) BIND—the biomolecular interaction network the corresponding pairwise kernel, showing the advantage of database. Nucleic Acids Res., 29, 242–245. using pairs of features. Ben-hur,A. and Brutlag,D. (2003) Remote homology detec- When training a classiﬁer to predict protein–protein inter- tion: a motif based approach. Bioinformatics, 19 (Suppl 1), actions, there is a balance between placing in the training i26–i33. set only trusted interactions as opposed to trying to maxim- Deane,C. Salwinski,L., Xenarios,I. and Eisenberg,D. (2002) Two methods for assessment of the reliability of high throughput ize the number of positive examples by adding interactions observations. Mol. Cell. Proteomics, 1, 349–356. about which we are less sure. When using a sequence-based Deng,M., Mehta,S., Sun,F. and Chen,T. (2002) Inferring domain– approach, as we have done here, the sensitivity of the method domain interactions from protein–protein interactions. Genome may depend on the richness of the training set. We have shown Res., 12, 1540–1548. in this paper that we are able to use a larger set of noisy data Gene Ontology Consortium (2000) Gene ontology: tool for the while still achieving a good performance. As an alternative uniﬁcation of biology. Natl Genet., 25, 25–29. to training on a dataset that includes false positive interac- Goldberg,D. and Roth,F. (2003) Assessing experimentally derived tions we plan to ﬁrst apply a step of ﬁltering the interaction interactions in a small world. Proc. Natl Acad. Sci. USA, 100, data on the basis of features of trusted interactions, in order to 4372–4376. maximize the number of interactions that can be considered Gomez,S.M., Noble,W.S. and Rzhetsky,A. (2003) Learning to pre- reliable. dict protein–protein interactions. Bioinformatics, 19, 1875–1881. i45 “bti1016” — 2005/6/10 — page 45 — #8 A.Ben-Hur and W.S.Noble Jansen,R., Yu,H., Greenbaum,D., Kluger,Y., Krogan,N.J., Chung,S., Pazos,F. and Valencia,A. (2002) In silico two-hybrid system for Emili,A., Snyder,M., Greenblatt,J.F. and Gerstein,M. (2003) the selection of physically interacting protein pairs. Proteins, 47, A Bayesian networks approach for predicting protein–protein 219–227. interactions from genomic data. Science, 302, 449–453. Ramani,A. and Marcotte,E. (2003) Exploiting the co-evolution of Lanckriet,G.R.G., Deng,M., Cristianini,N., Jordan,M.I. and interacting proteins to discover interaction speciﬁcity. J. Mol. Noble,W.S. (2004) Kernel-based data fusion and its application to Biol., 327, 273–284. protein function prediction in yeast. In Altman,R.B., Dunker,A.K., Schölkopf,B. and Smola,A. (2002) Learning with Kernels. MIT Hunter,L., Jung,T.A. and Klein,T.E. (eds), Proceedings of the Press, Cambridge, MA. Paciﬁc Symposium on Biocomputing, World Scientiﬁc, Singapore, Sonnhammer,E., Eddy,S. and Durbin,R. (1997) Pfam: a comprehens- pp. 300–311. ive database of protein domain families based on seed alignments. Leslie,C., Eskin,E. and Noble,W.S. (2002) The spectrum kernel: Proteins, 28, 405–420. A string kernel for SVM protein classiﬁcation. In Altman,R.B., Sprinzak,E. and Margalit,H. (2001) Correlated sequence-signatures Dunker,A.K., Hunter,L., Lauderdale,K. and Klein,T.E. (eds), Pro- as markers of protein–protein interaction. J. Mol. Biol., 311, ceedings of the Paciﬁc Symposium on Biocomputing, New Jersey. 681–692. World Scientiﬁc, Singapore, pp. 564–575. Sprinzak,E., Sattath,S. and Margalit,H. (2003) How reliable are Lin,N., Wu,B., Jansen,R., Gerstein,M. and Zhao,H. (2004) Inform- experimental protein–protein interaction data? J. Mol. Biol., 327, ation assessment on predicting protein–protein interactions. 919–923. BMC Bioinformatics, 5, 154. von Mering,C., Krause,R., Snel,B., Cornell,M., Olivier,S.G., Marcotte,E.M., Pellegrini,M., Ng,H.-L., Rice,D.W., Yeates,T.O. Fields,S. and Bork,P. (2002) Comparative assessment of large- and Eisenberg,D. (1999) Detecting protein function and protein– scale data sets of protein–protein interactions. Nature, 417, protein interactions from genome sequences. Science, 285, 399–403. Wang,H., Segal,E., Ben-Hur,A., Koller,D. and Brutlag,D.L. (2005) 751–753. Identifying protein–protein interaction sites on a genome-wide Martin,S., Roe,D. and Faulon J.-L. (2005) Predicting protein– scale. In Lawrence K. Saul, Yair Weiss and Léon Bottou (eds), protein interactions using signature products. Bioinformatics, 21, Advances in Neural Information Processing Systems 17. MIT 218–226. Press, Cambridge, MA, pp. 1465–1472. Mewes,H.W., Frishman,D., Gruber,C., Geier,B., Haase,D., Kaps,A., Xenarios,I., Salwinski,L., Duan,X.Q.J., Higney,P., Kim,S.M. and Lemcke,K., Mannhaupt,G., Pfeiffer,F., Schüller,C., Stocker,S. Eisenberg,D. (2002) DIP: the Database of Interacting Proteins: a and Weil,B. (2000) MIPS: a database for genomes and protein research tool for studying cellular networks of protein interactions. sequences. Nucleic Acids Res., 28, 37–40. Nucleic Acids Res., 30, 303–305. Nevill-Manning,C.G., Sethi,K.S., Wu,T.D. and Brutlag,D.L. (1997) Enumerating and ranking discrete motifs. In Proceedings of Yu,H., Luscombe,N., Lu,H., Zhu,X., Xia,Y., Han,J., Bertin,N., the Fifth International Conference on Intelligent Systems for Chung,S., Vidal,M. and Gerstein,M. (2004) Annotation trans- Molecular Biology, pp. 202–209. fer between genomes: protein–protein interlogs and protein–DNA Noble,W.S. (2004) Support vector machine applications in com- regulogs. Genome Res., 14, 1107–1118. putational biology. In Schoelkopf,B., Tsuda,K. and Vert,J.-P. Zhang,L., Wong,S., King,O. and Roth,F. (2004) Predicting co-complexed protein pairs using genomic and proteomic data (eds), Kernel Methods in Computational Biology. MIT Press, integration. BMC Bioinformatics, 5, 38–53. Cambridge, MA, pp. 71–92. i46 “bti1016” — 2005/6/10 — page 46 — #9

Journal

Bioinformatics – Oxford University Press

Published: Jun 1, 2005

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Kernel methods for predicting protein–protein interactions

Kernel methods for predicting protein–protein interactions

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Kernel methods for predicting protein–protein interactions

Kernel methods for predicting protein–protein interactions

References (28)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies