A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

V.G. Krishnan; D.R. Westhead

doi:10.1093/bioinformatics/btg297

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

Krishnan, V.G.; Westhead, D.R. 2003-11-22 00:00:00 Vol. 19 no. 17 2003, pages 2199–2209 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg297 A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function V.G. Krishnan and D.R. Westhead School of Biochemistry and Molecular Biology, University of Leeds, Leeds LS2 9JT, UK Received on February 26, 2003; revised on May 9, 2003; accepted on May 21, 2003 ABSTRACT function of the encoded protein. There are many examples of Motivation: The large volume of single nucleotide polymor- SNPs in coding regions that have a relationship with disease phism data now available motivates the development of meth- phenotypes (Chakravarti, 2001; Licinio and Wong, 2002). ods for distinguishing neutral changes from those which have Equally, synonymous SNPs, and those outside protein coding real biological effects. Here, two different machine-learning regions can affect gene function through altered regulation, methods, decision trees and support vector machines (SVMs), splicing and levels of protein expression. The volumes of SNP are applied for the ﬁrst time to this problem. In common with data now available pose a key question: can we predict which most other methods, only non-synonymous changes in protein SNPs are likely to be neutral and which are likely to affect coding regions of the genome are considered. gene function? Results: In detailed cross-validation analysis, both learning Several recent studies have considered how deleterious methods are shown to compete well with existing methods, and neutral nsSNPs might be distinguished using sequence and to out-perform them in some key tests. SVMs show bet- and structural aspects of the proteins in which they occur. ter generalization performance, but decision trees have the A study by Wang and Moult (2001) showed that most of advantage of generating interpretable rules with robust estim- the detrimental nsSNPs affect protein function indirectly ates of prediction conﬁdence. It is shown that the inclusion of through effects on protein structural stability, for instance protein structure information produces more accurate meth- by disruption to the protein hydrophobic core, and these ods, in agreement with other recent studies, and the effect of authors provided a set of empirical rules to predict deleterious using predicted rather than actual structure is evaluated. SNPs. Following this, other workers (Chasman and Adams, Availability: Software is available on request from the authors. 2001; Sunyaev et al., 2001; Ramensky et al., 2002; Saunders Contact: [email protected] and Baker, 2002) have asserted the importance of protein structural considerations, and developed prediction methods INTRODUCTION that depend on mapping SNPs to positions in (homologous) three-dimensional (3D) protein structures, as well as using An important aspect of the post-genome biology of model information from multiple sequence alignments. organisms and human is to understand the biological effects In contrast to the above studies, which use mapping to of inherited variations between individuals. For instance, a 3D structures, Ng and Henikoff (2001) have developed the key problem for the pharmaceutical industry is to understand SIFT (Sorting Tolerant from Intolerant) method, based on variations in drug treatment responses among individuals at sequence conservation and scores from position-speciﬁc scor- the molecular level. Among these variations, single nucleotide ing matrices. These authors assert that their method performs polymorphisms (SNPs) have received much attention recently. ‘similarly’ to the structure-based methods. However, fair SNPs are subtle variations, such as insertions, deletions and comparison of these tools is fraught with difﬁculties, and the substitutions observed in the genomic DNA sequences of intimate relationship of 3D structure with protein function individuals of the same species. An enormous volume of SNP and stability would suggest that the use of explicit structural data are available in the public databases (http://snp.cshl.org information alongside sequence conservation will be found and http://www.ncbi.nlm.nih.gov/SNP, Sherry et al., 2001). to improve performance, as supported by the recent study of SNPs in protein coding exons are classiﬁed as synonymous Saunders and Baker (2002). or non-synonymous according to whether or not they alter Of the tools described above, the Chasman and Adams the protein sequence. Non-synonymous SNPs (nsSNPs) can method is the only one that involves automated learning from affect gene function through their effect on the structure and training data. While the other methods depend on rules derived empirically, this method uses a training data set to estimate the To whom correspondence should be addressed. Bioinformatics 19(17) © Oxford University Press 2003; all rights reserved. 2199 V.G.Krishnan and D.R.Westhead probability that a particular nsSNP will affect protein function. It is based on the description of a mutation or SNP in terms of a set of attributes, including sequence conservation and struc- tural features. The probability of an effect is estimated from the proportion of training set mutations with matching attrib- utes, in which an effect on function is known to occur. Effects are predicted if the probability is >0.5. The probability also serves as an estimate of the conﬁdence level of the predic- Fig. 1. A simpliﬁed example decision tree. Acc is the solvent access- tion. Attributes were chosen by a detailed statistical analysis ibility (b = buried, e = exposed, i = intermediate), Mut_res is the of their effect on function, and the method was trained and mutated residue identity (R = Arginine), Cons_value is the conser- cross-validated using the extensive systematic mutation data vation score of the original residue. In this case the ﬁnal decision is sets available for lysozyme (Alber et al., 1987; Rennell et al., binary, Y (effect) or N (no effect). 1991) and the lac repressor (Markiewicz et al., 1994; Suckow et al., 1996). Automated learning from training data is an attractive the application of our method to the SNPs observed to occur alternative to manual tuning of empirical rules. After the set between two strains of the nematode worm Caenorhabditis of descriptive attributes have been deﬁned for each mutation, elegans. automated methods are able to explore much more fully how these attributes can be used to produce a prediction method SYSTEM AND METHODS that is, in some sense, approximately optimal. It is also much easier to perform rigorous cross-validation of such methods. Here, we provide only brief descriptions of the decision tree Here, we report the application of two machine-learning and SVM methods. More detail can be found in the references methods, decision trees, as implemented in C4.5 (Quinlan, cited. 1993) and support vector machines (SVMs) (Cristianini and Decision trees Shawe-Taylor, 2000). The principal difference between these methods lies in the type of classifying function they attempt Decision tree learning (Mitchell, 1997; Witten and Frank, to learn. The decision tree represents the classiﬁer as a tree 2000) is a means for approximating discrete-valued target structure in which each node represents a decision based on functions, in which the learned function is represented by a an attribute value, and it leads to a set of predictive rules that decision tree. Each instance (in this case a mutation or a SNP) can be interpreted easily. On the other hand, the SVM relies is sorted down the tree from the root according to the val- on a mapping of the input attributes to a feature space that can ues of its attributes (e.g. types of residues involved, sequence be of very high dimension, where the classiﬁer takes the form conservation, structural features) until it reaches a classifying of a linear function (hyperplane). These methods have been leaf node (‘effect’ or ‘no effect’) where the prediction is made. found to be effective in many diverse ﬁelds (Mitchell, 1997). This process is illustrated in the (ﬁctitious) example shown in Here, we provide a comparison of the two methods and show Figure 1. that they have a contribution to make in SNP analysis. Here we used the C4.5 decision tree software, which is The attributes used by our methods include both sequence derived from Hunt’s method (Hunt et al., 1966) for construct- and structure-based information, but in contrast to other ing a decision tree. The software can be downloaded freely methods we investigate the possibility of using only struc- (http://www.cse.unsw.edu.au/~quinlan/). First, the decision tural attributes that can be predicted with sufﬁcient accuracy tree was obtained for the training data using the program c4.5 from sequence (secondary structure and solvent accessibility), and rules were generated by the program called c4.5rules, rather than relying on mapping mutations to (homologous) 3D which uses the decision tree constructed by c4.5. Experi- structures. By removing the need for a homologous structure ments were conducted to optimize the input parameters of the the applicability of our method is extended signiﬁcantly. It software (for both prediction accuracy and generalization), but is not clear from the literature which of the currently avail- the default values were found to be approximately optimal in able methods performs the best, but the availability of detailed most cases, and were used throughout this paper. cross-validation data and prediction conﬁdence estimates for The decision tree software gives an estimated accuracy for the Chasman and Adams method (above) is very conveni- each rule, which is derived from the training data. These estim- ent for comparison with our learning methods. Accordingly, ated accuracies were used to assign conﬁdence levels to the we adopt their training sets (unbiased mutation data for lyso- predictions. Rules with estimated accuracies of x% were taken zyme and lac repressor proteins), and replicate and extend to have a conﬁdence level of x/100 (e.g. a rule with estimated their cross-validation techniques. Throughout this paper, the accuracy of 90% was assigned an estimated conﬁdence level Chasman and Adams method is referred to as ‘the probabilistic of 0.9). The conﬁdence level can be viewed as an estimate method’. As an example application of our method, we report of the probability that a prediction from the rule is correct. 2200 Machine learning for SNP predictions In order to facilitate comparison with other published meth- previous studies, which have adopted the same deﬁnition, and ods, we report (see Results) error rates for predictions made because it leads to a similar rate of mutations causing effects from rules with conﬁdence levels above a deﬁned threshold (38–45%) in each data set, suggesting that it deﬁnes a similar (e.g. error rates at the conﬁdence level threshold of 0.6 contain degree of effect in each protein. It is not clear how this deﬁn- predictions from all rules whose conﬁdence level is 0.6). ition relates to observable phenotypic effects on an organism, a point that we will discuss later. SVMs In order to investigate these issues further we used a third, Currently, SVMs (Vapnik, 1998) are gaining great attention in smaller data set (336 mutations) for the HIV protease (Loeb the ﬁeld of bioinformatics. Here, a classic two-class problem et al., 1989). This data set contains at least one mutation is addressed: SNPs have to be divided into two classes, ‘effect’ at every sequence position. In this case the experimentalists or ‘no effect’. Like decision trees, SVMs use an input vector of deﬁne only three degrees of effect, + (no effect), +/− (small attributes for each instance. Using a kernel function the input effect), and − (complete absence). Taking a deﬁnition ana- vectors are mapped to a feature space of high dimension in logous to the one adopted for the other data, where everything which the SVM method constructs a hyperplane that optimally other than + is considered as an effect, resulted in 67% of separates instances from the two classes. Here, we constructed mutations being considered as effects. This percentage is sig- SVMs using mySVM (Vapnik, 1998, http://www-ai.cs.uni- niﬁcantly different to the 38–45% observed in lac repressor dortmund.de/SOFTWARE/MYSVM/). After various trials of and lysozyme data. Therefore, in addition, we investigated different parameters for better performance (data not shown), how an alternative deﬁnition in which both + and +/− we chose the polynomial kernel function of degree d = 2 were treated as neutral would change the performance of the given by learning methods. With this latter deﬁnition 47% of protease ∗ d K(x, y) = (x y + 1) mutations have an effect on function. The SNPs data of C.elegans genome were obtained from The values of the other parameters for the mySVM software −12 St Louis Washington University web site (Wicks et al., 2001), were ε = 1.0 × 10 and C = 1. (http://www.genome.wustl.edu/projects/celegans/index.php? The mySVM software does not provide any estimate of snp=1) and the six chromosome data from Sanger conﬁdence in classiﬁcations, and SVM theory in this area is centre website (http://www.sanger.ac.uk/Projects/C_elegans/ currently not well developed. In contrast to our analysis of WORMBASE/GFF_ﬁles.shtml). The SNPs data are between decision trees, therefore, we do not provide conﬁdence levels CB4856, an isolate from Hawaiian Islands, with reference to for predictions made by SVMs. the completely sequenced N2 strain (from Bristol, UK). The Data sets SNP data are from part (5.4 Mb) of the C.elegans genome. The The systematic unbiased mutagenesis data set of lac repressor SNPs had to be associated to the position in the chromosome (Markiewicz et al., 1994; Suckow et al., 1996) and T4 lyso- data by Fasta (Pearson and Lipman, 1988; Pearson, 1990) zyme (Alber et al., 1987; Rennell et al., 1991) were used to alignment. Using the exon, intron and intergenic information train and validate the prediction methods. Mutations in the provided by Sanger centre web site, the protein coding regions ﬁrst 62 residues of lac repressor were omitted, because they were translated to get the corresponding protein sequences. are missing from protein databank structure of this protein Attributes (Berman et al., 2000). The number of mutations taken for the analysis was 3303 mutations for lac repressor and 1990 The attributes of SNPs used for predictions were chosen from for lysozyme. The experimental results for both the proteins the following set: the residue identities of the original and are given as four-valued expressions of the effect of each mutated residue, the physicochemical classes of these residues mutation on the protein function. In the case of lysozyme, (hydrophobic, polar, charged, glycine), sequence conser- plaque-forming ability was rated as ++ (no effect), + (slight vation score at the mutated position, molecular mass shift effect), +/− (larger effect) and − (complete absence). Taking on mutation, hydrophobicity difference, secondary structure, ++ as a neutral mutation and all the rest as effects gives a data solvent accessibility and buried charge. This set is based on set in which 38% of mutations have an effect on function. In other attribute sets from the literature; changes in these quant- the case of the lac repressor the four values given were + (no ities on mutation are likely to affect protein function (e.g. effect), +− (slight effect), −+ (larger effect) and − (com- by changing a key conserved functional residue) or protein plete absence). Here, ‘+’ was considered as neutral and rest structural stability (e.g. by disruption of the hydrophobic core as effects, resulting in 45% of the mutations having an effect through a residue size change reﬂected in a large molecular on the protein function. These deﬁnitions were adopted by mass shift). Only the latter three attributes require inform- Chasman and Adams. ation from protein structure rather than sequence, but these It is not clear that the method above is the best way to convert have been included because they can be predicted (see below) the four experimental effect classes into a binary classiﬁca- and so our method does not require mapping to a homologous tion. Here, it was employed in order to give comparability to 3D structure. In contrast to other methods, we do not use the 2201 V.G.Krishnan and D.R.Westhead crystallographic B factor, which would limit the applicability results were assessed as median and interquartile range (these of our method to sequences that can be mapped to homolog- were found to be very similar to the alternative mean and SD). ous X-ray structures. In the Results section, we investigate the subset of these attributes that produces optimal cross-validated RESULTS predictions. The results presented in this section concern error rates or The secondary structure and solvent accessibility inform- misclassiﬁcation rates, observed in predictions of functional ation for lysozyme and lac repressor proteins was extrac- effects of mutations (nsSNPs). Predictions are binary valued, ted from homology-derived secondary structure of proteins ‘effect’ or ‘no effect’; indicating whether or not a given SNP (Sander and Schneider, 1991) ﬁles, as 3D structures are is predicted to have a deleterious effect on protein function. available for these proteins. A three-state description of The error rate is the proportion of the total number of predic- solvent accessibility (buried, intermediate and exposed) was tions that were wrong. In all cases, three different error rates used. In order to study the effect of using predicted rather than are reported, the overall error rate, and separate error rates for actual structure, and in cases of proteins of unknown struc- positive (effect) predictions and negative (no effect) predic- ture, PHD (Rost and Sander, 1993, 1994) was used to predict tions. With some methods it is possible to attach an estimated secondary structure and solvent accessibility. The hydro- conﬁdence level to each prediction. This can be viewed as phobicity values were taken from the literature. The sequence an estimate of the probability that the prediction is correct. conservation score was calculated using the ScoreCons pro- If a threshold is set so that only predictions above a certain gram (Valdar and Thornton, 2001a,b) using multiple sequence conﬁdence level are accepted, then the number of predictions alignments of proteins extracted by BLAST (Altschul et al., made usually decreases as this threshold is increased (i.e. if 1990) searches of the SWALL database (Boeckmann et al., higher conﬁdence is required then methods generally make 2003) using an E-value cut-off of 0.01 and aligned with fewer predictions). Therefore, we report number of predic- Clustal W (Thompson et al., 1994). The mutation mass shift tions made as well as error rates: the better of two methods was calculated as the difference between the relative molecu- compared at the same conﬁdence level or error rate is the one lar mass of the mutated residue and the original residue. The able to make the largest number of predictions. wild type residue was deemed to be a buried charge if it was one of K, R, D, E, H and its solvent accessibility was in the Optimization of the set of attributes buried class. Here, optimization means ﬁnding an attribute set that max- Cross-validation methods imizes the total number of predictions while minimizing Machine learning methods are generally evaluated by a stat- the overall error rate. Initially the sequence-based attributes, istical technique called cross-validation. The data are divided including conservation score and the identities of wild type into two sets randomly. The ﬁrst (‘training set’) is used in and mutated residues and their physicochemical classes were training the learning method; the second (‘test set’) is used for chosen. Following this, attributes were added sequentially to subsequent evaluation of the accuracy of the trained method. this basic set to test their effect on the quality of the predictions. This tests the ability of the method to generalize and make The performance of the decision tree method at a conﬁdence predictions on unknown data. level of 0.5 using mixed (lysozyme and lac repressor) cross- We report three types of cross-validation: homogeneous, validation for an expanding attribute set is shown in Figure 2. heterogeneous (after Chasman and Adams) and mixed. For It is clear from Figure 2 that each addition to the attribute set homogeneous cross-validation, each protein data set was prompts a fall in the overall error rate and a slight increase taken separately and cross-validation performed on that set in the number of predictions made. Also, the addition of in isolation. For heterogeneous cross-validation, the data set structural attributes, such as buried charge, solvent accessib- of one protein (e.g. lysozyme) was used as training set and ility and secondary structure information reduces error rates, that of the other protein (e.g. lac repressor) was used as test in agreement with the conclusions drawn in previous study set. For mixed cross-validation, the data from each protein was (Saunders and Baker, 2002). Given these observations, the pooled as a single data set and cross-validation performed on full set of attributes (set 5 in Fig. 2) was used for learning in this pooled set. all the subsequent studies reported here. In case of homogeneous and mixed cross-validations, the The error rates in Figure 2 are all in the range 0.29–0.21. data were randomized and split into 10 equal parts. One part These are signiﬁcantly lower than the best error rates that was used as test set and the remainder as training set. This could be achieved with naïve prediction methods. A naïve procedure was repeated 10 times so that each case or example method predicting either class randomly with equal probabil- (here it is each mutation) was used exactly once for testing. ity would have an error rate of 0.5 on any test set, while one This is called 10-fold cross-validation, and has been shown to with knowledge of the composition of the test set could use give good estimated error rates (Witten and Frank, 2000). In this optimally by predicting the dominant class (‘no effect’ in 10-fold cross-validation, the central tendency and spread of this case) exclusively. In this case, the latter method would 2202 Machine learning for SNP predictions It is interesting to compare observed error rates with the estimated conﬁdence levels of predictions. For instance, in the case of the decision tree method the error rate at conﬁdence threshold 0.9 (Table 1A) is close to the approximate expected value of 0.1 (= 1−0.9). In the case of both proteins, reﬂecting the use of rules with conﬁdence value 0.9 or above. Comparing error rates to conﬁdence levels for thresholds lower than 0.9 in Table 1A is more difﬁcult because data are given cumulatively. For instance, the predictions at the threshold 0.5 include all predictions of conﬁdence 0.5 and above, including many of much higher conﬁdence and the corresponding error rate is therefore signiﬁcantly <0.5. Fig. 2. The effect of including extra attributes on error rates (gray line and square markers) and prediction numbers (black line and dia- Homogeneous cross-validation tests the ability of a method mond markers) in mixed cross-validation using decision tree learning to learn rules applicable to a single protein. Heterogeneous (predictions with a conﬁdence level of 0.5 or greater). Attribute set 1: cross-validation is a much more stringent and realistic test. It the identities and physicochemical classes of wild type and mutated examines the ability of a method to learn rules that general- residues and also the sequence conservation score. Set 2: set 1 plus ize from one protein to another. The results of heterogeneous mass and hydrophobicity differences. Set 3: set 2 plus buried charge. cross-validation are shown in Table 1B. In this case, the overall Set 4: set 3 plus solvent accessibility. Set 5: set 4 plus secondary error rates at all conﬁdence levels are higher than in homo- structure. Error rates are signiﬁcantly lower (95% level) with set 5 geneous cross validation, as expected for a more difﬁcult test. compared to any other set (Wilcoxon rank sum test). However, in contrast with homogeneous cross validation the decision tree error rates are higher than those of the probabil- istic method at all but the highest conﬁdence level thresholds. have an error rate of 0.42 (the proportion of ‘effect’ mutations This is an indication that while the decision tree performs best in the data set). in homogeneous cross validation it is more prone to learning protein-speciﬁc rules that do not generalize well to other pro- Performance of decision tree learning tein examples. The effect seems to be particularly marked for The results for homogeneous cross-validation are given in ‘effect’ predictions, with performance of the methods being Table 1A. Results from the decision tree method are com- more comparable for ‘no effect’ predictions. pared with the published results of the probabilistic method Although extensive in the cases of lysozyme and lac of Chasman and Adams (2001). Note that data in this table repressor, the mutation data for the two proteins are still a very are cumulative: each conﬁdence level threshold includes pre- small sample of naturally occurring proteins, and it is almost dictions made at a conﬁdence level equal to or higher than the certainly unreasonable to expect rules learned from a single threshold. Both methods show the expected increase in both protein to be universally applicable. To form our most accurate the overall number of predictions and observed error rates with prediction method, we therefore used all the data in training. decreasing conﬁdence level threshold. The overall error rates Error rates for such a method can be estimated by mixed cross- of the decision tree method are generally signiﬁcantly lower validation. The results of this for various conﬁdence levels are than those of the probabilistic method for both proteins. The presented in Table 2 (numbers not in parentheses). In this case, exceptions to this are the error rates at high conﬁdence levels the observed error rates at each conﬁdence level are much (0.8 and 0.9) for the lac repressor. Viewing the ‘effect’ predic- more similar to those observed in homogeneous cross valida- tions separately, the error rates follow the same trend as the tion, indicating that when both data sets are used for training overall error rates, with the decision tree performing best at the decision tree method is able to learn rules applicable to lower conﬁdence thresholds. At higher thresholds (e.g. 0.9), both proteins. Mixed cross-validation was not performed by error rates from the probabilistic method are lower, but these Chasman and Adams (2001) so no comparison can be made rates are achieved at the expense of making fewer predictions. with the probabilistic method in this case. For instance in the case of the lac repressor at a conﬁdence It is noticeable in several of the above cases that the cumu- threshold of 0.9, the probabilistic method makes 10 effect lative number of ‘effect’ predictions tends to increase for each predictions with no errors (repeat cross-validations were not successive decrease in the conﬁdence level threshold, while reported for this method), while (in multiple cross-validations) the number of ‘no effect’ predictions often reaches a plateau. the decision tree makes on average 21.5 predictions and 1.5 For instance, in homogeneous cross-validation (Table 1A) the errors. In the case of ‘no effect’ predictions the decision tree number of no effect predictions made with the decision tree performs best on the lysozyme data in terms of both error method does not increase between conﬁdence level thresholds rate and prediction numbers at each conﬁdence level, but of 0.7–0.5 in the case of either protein. This indicates that performance is more comparable on the lac repressor data. the decision tree rules predicting ‘no effect’ are all of a 2203 V.G.Krishnan and D.R.Westhead Table 1. Prediction results and error rates from homogeneous (A) and heterogeneous (B) cross-validation at several conﬁdence level thresholds (A) Homogeneous Prediction Actual Lysozyme: conﬁdence threshold Lac repressor: conﬁdence threshold 0.9 0.8 0.7 0.6 0.5 0.9 0.8 0.7 0.6 0.5 Effect Effect 11 ± 4 [3] 32.5 ± 3.3 [8] 42 ± 2.5 [24] 50.5 ± 3.7 [37] 52.5 ± 2.5 [43] 21.5 ± 10 [10] 52.5 ± 7 [34] 78 ± 6 [52] 95 ± 5.5 [66] 95 ± 2.9 [78] No effect 2 ± 9 [0] 3.5 ± 0.5 [2] 10 ± 2.5 [9] 16.5 ± 2.5 [14] 18.5 ± 2.3 [24] 1.5 ± 1.3 [0] 6.5 ± 1.9 [8] 15 ± 2.2 [17] 23.5 ± 2.7 [24] 24 ± 2.4 [27] No Effect No effect 5.5 ± 3.4 [2] 76 ± 9.2 [15] 86 ± 4 [37] 86 ± 4.4 [46] 86 ± 4.4 [63] 72 ± 35.8 [90] 127 ± 20.6 [122] 146 ± 17.9 [142] 146 ± 18.7 [156] 146 ± 18.6 [169] Effect 0 ± 0.38 [1] 9.5 ± 1.3 [7] 11 ± 1 [10] 11 ± 1 [18] 11 ± 1 [28] 4.5 ± 3 [3] 12 ± 2.2 [5] 17.5 ± 4.9 [22] 17.5 ± 5.3 [32] 17.5 ± 5.3 [43] Overall error rate 0.11 ± 0.02 [0.17] 0.11 ± 0.02 [0.28] 0.15 ± 0.02 [0.24] 0.20 ± 0.03 [0.28] 0.2 ± 0.02 [0.33] 0.08 ± 0.02 [0.03] 0.12 ± 0.03 [0.08] 0.14 ± 0.03 [0.17] 0.16 ± 0.01 [0.20] 0.16 ± 0.01 [0.22] Effect error rate 0.14 [0.00] 0.10 ± 0.04 [0.20] 0.18 ± 0.04 [0.27] 0.25 ± 0.04 [0.27] 0.25 ± 0.04 [0.36] 0.07 ± 0.06 [0.00] 0.12 ± 0.03 [0.19] 0.16 ± 0.02 [0.25] 0.20 ± 0.01 [0.27] 0.19 ± 0.01 [0.26] No effect error rate 0 ± 0.03 [0.33] 0.12 ± 0.01 [0.32] 0.12 ± 0.02 [0.21] 0.12 ± 0.02 [0.28] 0.12 ± 0.02 [0.31] 0.07 ± 0.02 [0.03] 0.10 ± 0.03 [0.04] 0.13 ± 0.03 [0.13] 0.13 ± 0.03 [0.17] 0.13 ± 0.03 [0.20] (B) Heterogeneous Prediction Actual Training: Lysozyme Test: Lac repressor Training: Lac repressor Test: Lysozyme Conﬁdence threshold Conﬁdence threshold 0.9 0.8 0.7 0.6 0.5 0.9 0.8 0.7 0.6 0.5 Effect Effect 74 [68] 459 [227] 531 [358] 792 [483] 851 [551] 104 [30] 220 [70] 291 [156] 295 [271] 299 [341] No effect 6 [10] 301 [33] 405 [101] 632 [233] 708 [345] 23 [8] 81 [13] 172 [49] 174 [108] 176 [166] No Effect No effect 0 [101] 998 [259] 1107 [515] 1107 [666] 1107 [786] 225 [232] 388 [368] 483 [436] 483 [534] 483 [644] Effect 0 [9] 243 [27] 365 [109] 365 [132] 365 [182] 79 [90] 177 [135] 232 [183] 232 [233] 232 [328] Overall error rate 0.07 [0.10] 0.27 [0.11] 0.32 [0.19] 0.34 [0.24] 0.35 [0.28] 0.24 [0.27] 0.30 [0.25] 0.34 [0.28] 0.34 [0.30] 0.34 [0.33] Effect error rate 0.07 [0.13] 0.40 [0.13] 0.43 [0.22] 0.44 [0.33] 0.45 [0.39] 0.18 [0.21] 0.27 [0.16] 0.37 [0.24] 0.37 [0.28] 0.37 [0.33] No effect error rate NA [0.08] 0.2 [0.09] 0.25 [0.17] 0.25 [0.17] 0.25 [0.19] 0.26 [0.28] 0.31 [0.27] 0.32 [0.30] 0.32 [0.30] 0.32 [0.34] Decision tree results (data outside parentheses) are compared with the probabilistic method (data in parentheses). The upper half of the table gives total prediction numbers and the lower half gives error rates. Data for the probabilistic method taken from Chasman and Adams (2001) Tables 4 and 5. Decision tree results given as median± interquartile range (repeat cross validations not available for the probabilistic method, or for heterogeneous cross validation with either method). Machine learning for SNP predictions Table 2. Prediction results and error rates from mixed cross-validation at several conﬁdence level thresholds Prediction Actual Conﬁdence threshold 0.9 0.8 0.7 0.6 0.5 Effect Effect 28.5 ± 12.1 70 ± 7.4 105 ± 3.4 133 ± 4.1 135 ± 4.1 (6.5 ± 2.6)(32 ± 4.4)(80 ± 4.8)(110 ± 7.8)(124 ± 7.9) No effect 2 ± 0.5 9 ± 2.7 26.5 ± 4.7 42.5 ± 2.6 44.5 ± 3.1 (0 ± 3.8)(4 ± 0.9)(16.5 ± 4.8)(31.5 ± 7.8)(42.5 ± 6.8) No Effect No effect 32 ± 13.4 197 ± 37.4 228 ± 41.9 233 ± 37.4 233 ± 37.4 (17.5 ± 2.4)(115 ± 39.4)(149 ± 45.8)(157 ± 43.6)(157 ± 43.6) Effect 1.5 ± 1.4 28 ± 3.8 40.5 ± 8.3 41.5 ± 7.3 41.5 ± 7.3 (1.5 ± 1)(17.5 ± 5.5)(32 ± 4.9)(37.5 ± 4)(37.5 ± 4) Overall error rate 0.08 ± 0.04 0.13 ± 0.02 0.18 ± 0.01 0.2 ± 0.01 0.21 ± 0.01 (0.1 ± 0.04)(0.14 ± 0.02)(0.2 ± 0.03)(0.21 ± 0.03)(0.23 ± 0.02) Effect error rate 0.09 ± 0.02 0.11 ± 0.02 0.20 ± 0.02 0.24 ± 0.01 0.25 ± 0.02 (0 ± 0.03)(0.12 ± 0.00)(0.17 ± 0.04)(0.23 ± 0.03)(0.25 ± 0.03) No effect error rate 0.08 ± 0.05 0.13 ± 0.01 0.16 ± 0.01 0.16 ± 0.02 0.16 ± 0.02 (0.09 ± 0.05)(0.15 ± 0.02)(0.17 ± 0.03)(0.18 ± 0.04)(0.18 ± 0.04) Decision tree results using actual structure (data outside parentheses) are compared with those using predicted structure (data in parentheses). The upper half of the table gives total prediction numbers and the lower half gives error rates. Results given as median ± interquartile range. conﬁdence level 0.7 or above, while there are some rules insufﬁcient data to enable assessment of any trend relating of lower conﬁdence for ‘effect’ predictions. It would seem the performance of our methods to the accuracy of these to be easier to ﬁnd high conﬁdence rules for predicting ‘no predictions. effect’. Rules derived from decision trees Effect of predicted data An advantage of the decision tree method is that it produces Here we compare decision tree performance using predic- intelligible rules, and attaches a conﬁdence level to each rule. tions for secondary structure and solvent accessibility instead For example, using the pooled data the ﬁnal predictions are of experimentally determined values. The results from mixed made on the basis of 50 rules predicting ‘effect’ and 39 rules cross-validation using decision trees for various conﬁdence predicting ‘no effect’. It is not practical to analyse all the rules levels are given in Table 2. It is encouraging that observed in this paper, but some illustrative examples are given below. error rates are generally similar, or very slightly higher when predicted structure is used. However, the real effect Rule 1: residue = L; mut_res = P; Obs_Acc = b class of using predicted structure is seen when numbers of predic- ‘effect’ [90.2%]. tions are considered. When predicted structure is used it is Rule 2: residue = G; Obs_Acc = b; Cons_value > 0.252; clear that fewer predictions are possible at each conﬁdence Cons_value 0.352 class ‘effect’ [77.1%] level threshold (e.g. at the conﬁdence threshold of 0.9, on Rule 3: residue = A; mut_res = G class ‘no effect’ average 28.5 successful predictions of effects are made with [96.9%]. actual structures and only 6.5 with predicted). The degrada- tion of performance is therefore evident in prediction numbers rather than error rates, and this suggests that the predicted Here residue is the original residue, mut_res represents the conﬁdence levels of the decision tree rules are at least robust mutated residue, Obs_Acc is the observed solvent accessibil- enough to recognize when lower quality data leads to less ity of the original residue, and Cons_value is the conservation certain predictions. score of the original residue. The number in parentheses is The three-state secondary structure predictions used have the estimated percentage accuracy of the rule. The ﬁrst rule accuracies of 74% (lysozyme) and 80% (lac repressor), indicates that changing a buried leucine residue to a proline and the three-state solvent accessibility predictions have tends to affect function, and can be understood in the light of accuracies of 51% (lysozyme) and 57% (lac repressor). It our knowledge of the effect of proline on secondary structure. would be expected that our methods might perform less well The second reﬂects the special nature of glycine (small side on proteins for which these predictions were less accur- chain and high ﬂexibility); replacing glycine in a buried and ate. However, with only two proteins available there is conserved position tends to affect the function. This second 2205 V.G.Krishnan and D.R.Westhead Table 3. Comparison of the performance of decision trees (C4.5), SVMs and the probabilistic method for homogenous (A) and heterogeneous (B) cross-validation (A) Homogeneous Prediction Actual Lysozyme Lac repressor C4.5 SVM Probabilistic method C4.5 SVM Probabilistic method Effect Effect 52.5 ± 2.5 44 ± 5.1 43 95 ± 2.9 92 ± 2.9 78 No effect 18.5 ± 2.3 23 ± 1.6 24 24 ± 2.4 53 ± 4.5 27 No Effect No effect 86 ± 4.4 100 ± 4.8 63 146 ± 18.6 144 ± 6.9 169 Effect 11 ±135 ± 2 28 17.5 ± 5.3 39 ± 3.4 43 Overall error rate 0.2 ± 0.02 0.29 ± 0.01 0.33 0.16 ± 0.01 0.27 ± 0.01 0.22 Effect error rate 0.25 ± 0.04 0.37 ± 0.04 0.36 0.19 ± 0.01 0.37 ± 0.02 0.26 No effect error rate 0.12 ± 0.02 0.26 ± 0.02 0.31 0.13 ± 0.03 0.20 ± 0.02 0.20 (B) Heterogeneous Prediction Actual Training: Lysozyme Test: Lac repressor Training: Lac repressor Test: Lysozyme C4.5 SVM Probabilistic method C4.5 SVM Probabilistic method Effect Effect 851 858 551 299 323 341 No effect 708 475 345 176 186 166 No Effect No effect 1107 1503 786 483 1042 644 Effect 365 467 182 232 439 328 Overall error rate 0.35 0.28 0.28 0.34 0.31 0.33 Effect error rate 0.45 0.36 0.39 0.37 0.36 0.33 No effect error rate 0.25 0.24 0.19 0.32 0.30 0.34 The upper half of the table gives total prediction numbers and the lower half gives error rates. Data for decision trees and probabilistic method taken from Tables 1 and 2 at conﬁdence level threshold 0.5. Table 4. Comparison of decision trees (C4.5 conﬁdence level threshold 0.5) and SVM in mixed cross-validation Prediction Actual C4.5 SVM Effect Effect 135 ± 4.1 (124 ± 7.9) 131 ± 4 (62 ± 4.5) No effect 44.5 ± 3.1 (42.5 ± 6.8) 70 ± 4.3(29 ± 6.6) No Effect No effect 233 ± 37.4 (157 ± 43.6) 251 ± 6.9 (291 ± 5.1) Effect 41.5 ± 7.3 (37.5 ± 4) 76 ± 5 (151 ± 6.4) Overall error rate 0.21 ± 0.01 (0.23 ± 0.02) 0.28 ± 0.01 (0.33 ± 0.02) Effect error rate 0.25 ± 0.02 (0.25 ± 0.03) 0.36 ± 0.01 (0.36 ± 0.05) No effect error rate 0.16 ± 0.02 (0.18 ± 0.04) 0.23 ± 0.02 (0.34 ± 0.01) Results using actual structure (data outside parentheses) are compared to those using predicted structure (data in parentheses). The upper half of the table gives total prediction numbers and the lower half gives error rates. Results given as median ± interquartile range. rule gives both lower and upper limits on the conservation Performance of SVMs score, but we regard the upper limit as a feature reﬂecting The results of our second learning method, SVMs, are given learning on what is still a relatively small data set. On the in detail in Tables 3 and 4. Here comparative results are taken other hand, rule 3 shows that substituting residues with sim- from the 50% conﬁdence threshold predictions of decision ilar properties tends not to affect the function. It is noteworthy trees and the probabilistic method. It is not possible to that the ﬁrst two ‘effect’ rules relate to changes in the stability provide conﬁdence levels for SVM predictions (see System of the structure, rather than speciﬁc effects on key functional and Methods), but since SVMs provide a prediction for every residues. data point, it is most meaningful to compare results with the 2206 Machine learning for SNP predictions Table 5. Prediction results for the HIV test set from methods trained on other methods where they make the largest number of predic- lysozyme and lac repressor data tions, i.e. including all predictions from the 0.5 conﬁdence level upwards. The results of homogeneous cross-validation in Table 3A Prediction Actual C4.5 SVM indicate that decision trees tend to have the lowest error Effect Effect 173 (135) 160 (140) rates. The performances of SVM and probabilistic meth- No effect 42 (80) 33 (94) ods are similar, with the SVM performing better on lyso- No Effect No effect 25 (37) 78 (83) zyme and the probabilistic method performing better on lac Effect 21 (9) 65 (19) repressor. Prediction numbers vary from method to method Overall error rate 0.24 (0.34) 0.29 (0.34) but are broadly comparable. However, in heterogeneous cross- Effect error rate 0.20 (0.37) 0.17 (0.40) No effect error rate 0.46 (0.20) 0.45 (0.19) validation (Table 3B), it is clear that the decision tree method has higher error rates, particularly when the lysozyme data The upper half of the table gives total prediction numbers and the lower half gives error are used for training and lac repressor for testing (we earlier rates. Numbers outside parentheses treat the HIV mutations as ‘effect’ if any effect on attributed this effect to the learning of protein-speciﬁc rules). function was detected, those in parentheses require complete abolition of function for In contrast with the decision tree, the SVM produces perform- an ‘effect’. Decision tree (C4.5) results use prediction from conﬁdence level threshold 0.5. ance in heterogeneous cross-validation that is better than the probabilistic method. The error rates from the two methods are very similar, but at these rates the SVM is able to make proportion of ‘effect’ mutations in each data set (38–45%, signiﬁcantly more successful predictions. For example, with see System and Methods for details). However, this was not lysozyme training and lac repressor test the SVM makes 858 the case with the much smaller data set from the HIV protease. successful ‘effect’ predictions and 1503 successful ‘no effect’ In Table 5, we show results obtained for predictions on the predictions, to be compared with 551 and 786 for the probab- HIV data by methods trained on the combined lysozyme and ilistic method. This indicates that the SVM is less susceptible lac repressor data. The numbers not in parentheses use a con- than the decision tree to protein-speciﬁc effects in the small version analogous to that used for the training data, i.e. where learning set associated with a single protein. any experimentally detected loss in activity is regarded as an In Table 4, we give a comparison of decision trees and SVMs effect (see Methods). With the ﬁrst conversion, it is clear that in mixed cross-validation using both actual and predicted both machine-learning methods produce very high ‘no effect’ (numbers in parentheses) secondary structure and solvent error rates, in excess of 0.45 and close to the expected 0.5 for accessibility. In this test, the decision tree out-performs the random predictions. Many of the mutations predicted to have SVM in both cases. It is also interesting to note that while no effect actually do have an effect according to this deﬁnition. the effect of predicted data on the decision tree is easy to This led us to suspect inconsistency of the conversion of exper- understand (error rates remain broadly similar but prediction imental observations to binary values, and the observed high numbers fall), it is more complicated for the SVM. The ‘no proportion of ‘effect’ mutations in the data set with this con- effect’ error rate increases signiﬁcantly (from 0.23 to 0.34) version (67%, much greater than that in the training data) was when predicted data are used. This is probably related to further evidence for this possibility. With this in mind, we re- the fact that our decision tree predictions are limited to those assessed the HIV predictions using the alternative conversion with a conﬁdence level of 0.5 or greater. There is no estim- where only complete abolition of function was considered an ate of conﬁdence for the SVM and low conﬁdence predictions effect (reducing the proportion of effect mutations in the HIV (conﬁdence level lower than 0.5) cannot be ﬁltered out as they data to 47%, which is more consistent with the training data can with decision trees. from lysozyme and lac repressor). This alternative conversion was applied to the HIV test data only and not to the training Difference in results depending on deﬁnitions of data, so there is no issue of unbalanced training. The results effect and no effect are shown in parentheses in Table 5. Changing to this con- The effect of a mutation on protein function takes a continuum version method clearly improves the ‘no effect’ error rate, but of possible values from complete abolition of function through also signiﬁcantly increases the error rate observed in ‘effect’ degrees of loss of functional efﬁciency to no observable effect. predictions. We interpret this to indicate that neither of these Yet this type of study requires conversion of this data to a bin- conversions is really consistent with the level of functional ary valued ‘effect’ or ‘no effect’. For the process of training a effects deﬁned in the training data, the ﬁrst deﬁning too many machine learning method to work it is clearly important that minor functional changes as effects, and the second requiring this conversion process be approximately consistent, i.e. that too great a functional change to deﬁne an effect. it deﬁnes an equivalent level of functional effect between dif- Applications of methods to C.elegans SNPs ferent proteins. In the heterogeneous cross-validation using lysozyme and lac repressor data we found no evidence As an illustration, we have applied our methods to predictions of inconsistency, and this was reinforced by the similar of the functional effects of a set of 803 nsSNPs between two 2207 V.G.Krishnan and D.R.Westhead C.elegans strains. Using SVMs or decision trees trained on calculations are not available in the software we used, or other the combined lysoszyme and lac repressor data resulted in the commonly available software to our knowledge. prediction that around 300 (37%) of these might affect protein The inclusion of protein structure data in the attribute set function (see Discussion). has been discussed much in the literature (see Introduction). We ﬁnd that the use of structural attributes like secondary structure, solvent accessibility and buried charge produces DISCUSSION machine-learning methods that have lower error rates than We have made a thorough study of the use of two machine those based on sequence features alone. It is likely that learning methods to predict the functional effects of SNPs, and most mutations affecting protein function actually affect it compared the results with those from an existing probabilistic indirectly through changes in structural stability, and there- method (Chasman and Adams, 2001). Our results suggest that fore structural information should be valuable. An interesting the machine learning methods we use are competitive with the further observation from decision tree learning is that rules probabilistic method and perform signiﬁcantly better in some predicting ‘no effect’ seem to have higher conﬁdence levels circumstances. Decision trees are able to provide predictions on average. It would seem to be easier to predict if a mutation with signiﬁcantly lower error rates in homogeneous cross- does not affect stability than to predict if it does. validation, but seem to do less well in the more difﬁcult and It is important to appreciate that the way functional effects realistic test of heterogeneous cross-validation. However, in are deﬁned can seriously affect predictions. For instance, it this more difﬁcult test our results show that the SVM was able might be required to predict all observable effects on func- to perform at the same error rates as the probabilistic method, tion in some applications, but just complete abolition of while out-performing it in providing a signiﬁcantly greater function in others. Methods are trained to predict a certain number of predictions. level of effect, and if applied to data sets where differ- In comparison with the SVM, and also with the probabil- ent levels of effect need to be predicted they will perform istic method, we found that decision tree learning was more badly, as we illustrated with the HIV protease test data. susceptible to learning protein speciﬁc rules, resulting in very It is very difﬁcult to deﬁne equivalent levels of functional low error rates in homogeneous cross-validation, but signi- effect between two completely different proteins and this ﬁcantly higher error rates in heterogeneous cross-validation. highlights a general problem with methods of this type. How- This might suggest that decision trees are not the method of ever, the heterogeneous cross-validation results we report choice for this problem. Nevertheless, decision trees do have here, and also those from the probabilistic method, suggest advantages. First, they produce interpretable rules, and we that the deﬁnitions we adopted for lysozyme and the lac have shown that these often make sense from a protein struc- repressor (an enzyme and a regulatory protein) are approx- ture and stability perspective. Second, conﬁdence levels can imately equivalent. Based on this observation, it would seem be derived for decision tree rules. Apart from the obvious reasonable to accept that a deﬁnition of effect resulting in utility of a conﬁdence estimate to go with each prediction, a similar proportion of ‘effect’ mutations in unbiased muta- we showed that these conﬁdence estimates are generally very tion data sets for two proteins would indicate approximately robust. When we moved from actual structural data to lower equivalent deﬁnitions. However, application of this rule in quality predicted data, this was recognized in the derivation the case of the HIV data did not produce a conclusive of decision tree rules with reduced conﬁdence. This effect answer. More systematic mutation data sets with functional was manifested as falling prediction numbers at each conﬁ- effects deﬁned for different proteins are now required to dence level, while the observed error rates were maintained assess the generalizability of both deﬁnitions and prediction at approximately constant values. It is hardly surprising that methods. decision trees learn protein-speciﬁc rules when faced with This leads to the question of what level of effect should be training data from a single protein, but this effect is clearly predicted. The methods reported here were trained to predict reduced when training is on data from more than one pro- any observable effect on protein function. In application to tein (see the Results for mixed cross-validation). In time, it is the C.elegans SNP data this leads to the prediction that 37% likely that suitable training sets for other proteins will become of nsSNPs might affect protein function. This number is quite available, which should lead to the production of even higher large, given that the SNPs in this case are between two healthy quality decision trees. strains, but it is not out of line with estimates made using The lack of conﬁdence level estimates for SVM learning is other methods applied to human SNP data (e.g. Chasman and a disadvantage of that method. It would seem likely that more Adams, 2001; Sunyaev et al., 2001). The degree of effect conﬁdent SVM predictions would be those from data points on protein function needed to produce observable phenotypic located further from the optimal separating hyperplane, and consequences or diseases will vary from gene to gene, but one that it should be possible to ﬁt suitable probability distribu- possible explanation of these relatively large numbers is that tions to the data to provide conﬁdence estimates based on this the some of the protein functional consequences we predicted distance. However, this theory is not well developed, and such are too minor to cause major phenotypic effects. However, it is 2208 Machine learning for SNP predictions not possible to rule out errors in the SNP data or the annotation residues, as well as ‘spacers’ which do not require a speciﬁc sequence. J. Mol. Biol., 240, 421–433. of the C.elegans genome as alternative explanations. Mitchell,M.T. (1997) Machine Learning. McGraw-Hill, US. In conclusion, we have shown that machine-learning meth- Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid ods can make a useful contribution to SNP prediction prob- substitutions. Genome Res., 11, 863–874. lems, and compete well with currently available methods. The Pearson,W.R. (1990) Rapid and sensitive sequence comparison with generalization capability of the SVM is clearly a great advant- FASTP and FASTA. Methods Enzymol., 183, 63–98. age, but we have shown that decision trees too have signiﬁcant Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological advantages. A clear limitation of this study is the availability sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. of only two really systematic and extensive mutation data sets Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan for different proteins, but as more become available the power Kaufmann Publishers, San Mateo, CA. of all learning methods is sure to increase. Ramensky,V., Bork,P. and Sunyaev,S. (2002) Human non- synonymous SNPs: server and survey. Nucleic Acids Res., 30, 3894–3900. ACKNOWLEDGEMENTS Rennell,D., Bouvier,S.E., Hardy,L.W. and Poteete,A.R. (1991) We thank Ian Hope, Matthew Woodwark and Cary O’Donnell Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol., for valuable discussions. V.G.K. would like to thank ORS 222, 67–88. Rost,B. and Sander,C. (1993) Improved prediction of protein sec- (Overseas Research Students) awards scheme, Tetley Lupton ondary structure by use of sequence proﬁles and neural networks. and AstraZeneca Healthcare for support. Proc. Natl Acad. Sci. USA, 90, 7558–7562. Rost,B. and Sander,C. (1994) Conservation and prediction of solvent REFERENCES accessibility in protein families. Proteins, 20, 216–226. Sander,C. and Schneider,R. (1991) Database of homology-derived Alber,T., Sun,D.P., Nye,J.A., Muchmore,D.C. and Matthews,B.W. protein structures and the structural meaning of sequence align- (1987) Temperature-sensitive mutations of bacteriophage T4 lyso- ment. Proteins, 9, 56–68. zyme occur at sites with low mobility and low solvent accessibility Saunders,C.T. and Baker,D. (2002) Evaluation of structural and evol- in the folded protein. Biochemistry, 26, 3754–3758. utionary contributions to deleterious mutation prediction. J. Mol. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. Biol., 322, 891–901. (1990) Basic local alignment search tool. J. Mol. Biol., 215, Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., 403–410. Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI data- Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., base of genetic variation. Nucleic Acids Res., 29, 308–311. Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. Suckow,J., Markiewicz,P., Kleina,L.G., Miller,J., Kisters-Woike,B. et al. (2003) The SWISS-PROT protein knowledgebase and its and Muller-Hill,B. (1996) Genetic studies of the Lac repressor. supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. XV: 4000 single amino acid substitutions and analysis of the Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., resulting phenotypes on the basis of the protein structure. J. Mol. Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Biol., 261, 509–523. Data Bank. Nucleic Acids Res., 28, 235–242. Sunyaev,S., Ramensky,V., Koch,I., Lathe,W., III, Kondrashov,A.S. Chakravarti,A. (2001) To a future of genetic medicine. Nature, 409, and Bork,P. (2001) Prediction of deleterious human alleles. Hum. 822–823. Mol. Genet., 10, 591–597. Chasman,D. and Adams,R.M. (2001) Predicting the functional con- Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: sequences of non-synonymous single nucleotide polymorphisms: improving the sensitivity of progressive multiple sequence align- structure-based assessment of amino acid variation. J. Mol. Biol., ment through sequence weighting, position-speciﬁc gap penalties 307, 683–706. and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Cristianini,N. and Shawe-Taylor,J. (2000) An Introduction to Support Valdar,W.S. and Thornton,J.M. (2001a) Conservation helps to Vector Machines and Other Kernel-based Learning Methods. identify biologically relevant crystal contacts. J. Mol. Biol., 313, Cambridge University Press, Cambridge, UK. 399–416. Eisenberg,D., Weiss,R.M. and Terwilliger,T.C. (1984) The hydro- Valdar,W.S. and Thornton,J.M. (2001b) Protein–protein interfaces: phobic moment detects periodicity in protein hydrophobicity. analysis of amino acid conservation in homodimers. Proteins, 42, Proc. Natl Acad. Sci. USA, 81, 140–144. 108–124. Hunt,E.B., Martin,J. and Stone,P.J. (1966) Experiments in Induction. Vapnik,V. (1998) Statistical Learning Theory. John Wiley and Sons, Academic Press, New York. Inc., New York. Loeb,D.D., Swanstrom,R., Everitt,L., Manchester,M., Stamper,S.E. Wang,Z. and Moult,J. (2001) SNPs, protein structure, and disease. and Hutchison,C.A., III (1989) Complete mutagenesis of the Hum Mutat., 17, 263–270. HIV-1 protease. Nature, 340, 397–400. Wicks,S.R., Yeh,R.T., Gish,W.R., Waterston, R.H. and Plasterk,R.H. Licinio,J. and Wong,M. (2002) Pharmacogenomics. WILEY-VCH (2001) Rapid gene mapping in Caenorhabditis elegans using a Verlag GmbH, Weinheim, Germany. high density polymorphism map. Nat. Genet., 28, 160–164. Markiewicz,P., Kleina,L.G., Cruz,C., Ehret,S. and Miller,J.H. (1994) Witten,I. and Frank,E. (2000) Data Mining. Morgan Kaufmann, Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Academic Press, USA. Escherichia coli lac repressors reveals essential and non-essential http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/a-comparative-study-of-machine-learning-methods-to-predict-the-effects-Nj76uz2Nco

Loading next page...

References (31)

W. Valdar, Janet Thornton, Janet Thornton (2001)
Conservation helps to identify biologically relevant crystal contacts.
Journal of molecular biology, 313 2
C. Sander, R. Schneider (1991)
Database of homology‐derived protein structures and the structural meaning of sequence alignment
Proteins: Structure, 9
D. Eisenberg, R. Weiss, T. Terwilliger (1984)
The hydrophobic moment detects periodicity in protein hydrophobicity.
Proceedings of the National Academy of Sciences of the United States of America, 81 1
Tong Zhang (2001)
An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
AI Mag., 22
B. Rost, C. Sander (1994)
Conservation and prediction of solvent accessibility in protein families
Proteins: Structure, 20
D. Rennell, S. Bouvier, L. Hardy, A. Poteete (1991)
Systematic mutation of bacteriophage T4 lysozyme.
Journal of molecular biology, 222 1
W. Valdar, J. Thornton (2001)
Protein–protein interfaces: Analysis of amino acid conservation in homodimers
Proteins: Structure, 42
A. Chakravarti (2001)
Single nucleotide polymorphisms: . . .to a future of genetic medicine
Nature, 409
J. Quinlan (1992)
C4.5: Programs for Machine Learning
P. Markiewicz, L. Kleina, C. Cruz, S. Ehret, Jeffrey Miller (1994)
Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence.
Journal of molecular biology, 240 5
W. Pearson (1990)
Rapid and sensitive sequence comparison with FASTP and FASTA.
Methods in enzymology, 183
S. Hong (1997)
Data mining
Future Gener. Comput. Syst., 13
J. Thompson, D. Higgins, T. Gibson (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic acids research, 22 22
B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, Karine Michoud, C. O’Donovan, Isabelle Phan, S. Pilbout, Michel Schneider (2003)
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic acids research, 31 1
D. Loeb, R. Swanstrom, L. Everitt, M. Manchester, S. Stamper, C. Hutchison (1989)
Complete mutagenesis of the HIV-1 protease
Nature, 340
Zhen Wang, J. Moult (2001)
SNPs, protein structure, and disease
Human Mutation, 17
S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman (1990)
Basic local alignment search tool.
Journal of molecular biology, 215 3
P. Ng, S. Henikoff (2001)
Predicting deleterious amino acid substitutions.
Genome research, 11 5
S. Wicks, Raymond Yeh, W. Gish, R. Waterston, R. Plasterk (2001)
Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map
Nature Genetics, 28
C. Saunders, D. Baker (2002)
Evaluation of structural and evolutionary contributions to deleterious mutation prediction.
Journal of molecular biology, 322 4
D. Chasman, R. Adams (2001)
Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.
Journal of molecular biology, 307 2
S. Sunyaev, V. Ramensky, I. Koch, Warren Lathe, A. Kondrashov, P. Bork (2001)
Prediction of deleterious human alleles.
Human molecular genetics, 10 6
J. Suckow, P. Markiewicz, L. Kleina, Jeffrey Miller, B. Kisters-Woike, B. Müller-Hill (1996)
Genetic Studies of Lac Repressor: 4000 Single Amino Acid Substitutions and Analysis of the Resulting Phenotypes on the Basis of the Protein Structure
Journal of molecular biology, 261 4
V. Ramensky, P. Bork, S. Sunyaev (2002)
Human non-synonymous SNPs: server and survey.
Nucleic acids research, 30 17
T. Alber, D. Sun, J. Nye, D. Muchmore, B. Matthews (1987)
Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the folded protein.
Biochemistry, 26 13
W. Pearson, D. Lipman (1988)
Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences of the United States of America, 85 8
V. Vapnik (1998)
Statistical learning theory
E. Hunt, J. Marin, P. Stone (1966)
Experiments in induction
B. Rost, C. Sander (1993)
Improved prediction of protein secondary structure by use of sequence profiles and neural networks.
Proceedings of the National Academy of Sciences of the United States of America, 90 16
S. Sherry, Minghong Ward, Michael Kholodov, J. Baker, Lon Phan, Elizabeth Smigielski, K. Sirotkin (2001)
dbSNP: the NCBI database of genetic variation
Nucleic acids research, 29 1
S. Salzberg, Alberto Segre (1994)
Programs for Machine Learning

Publisher: Oxford University Press
Copyright: © Oxford University Press 2003; all rights reserved.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btg297
Publisher site: See Article on Publisher Site

Abstract

Vol. 19 no. 17 2003, pages 2199–2209 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg297 A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function V.G. Krishnan and D.R. Westhead School of Biochemistry and Molecular Biology, University of Leeds, Leeds LS2 9JT, UK Received on February 26, 2003; revised on May 9, 2003; accepted on May 21, 2003 ABSTRACT function of the encoded protein. There are many examples of Motivation: The large volume of single nucleotide polymor- SNPs in coding regions that have a relationship with disease phism data now available motivates the development of meth- phenotypes (Chakravarti, 2001; Licinio and Wong, 2002). ods for distinguishing neutral changes from those which have Equally, synonymous SNPs, and those outside protein coding real biological effects. Here, two different machine-learning regions can affect gene function through altered regulation, methods, decision trees and support vector machines (SVMs), splicing and levels of protein expression. The volumes of SNP are applied for the ﬁrst time to this problem. In common with data now available pose a key question: can we predict which most other methods, only non-synonymous changes in protein SNPs are likely to be neutral and which are likely to affect coding regions of the genome are considered. gene function? Results: In detailed cross-validation analysis, both learning Several recent studies have considered how deleterious methods are shown to compete well with existing methods, and neutral nsSNPs might be distinguished using sequence and to out-perform them in some key tests. SVMs show bet- and structural aspects of the proteins in which they occur. ter generalization performance, but decision trees have the A study by Wang and Moult (2001) showed that most of advantage of generating interpretable rules with robust estim- the detrimental nsSNPs affect protein function indirectly ates of prediction conﬁdence. It is shown that the inclusion of through effects on protein structural stability, for instance protein structure information produces more accurate meth- by disruption to the protein hydrophobic core, and these ods, in agreement with other recent studies, and the effect of authors provided a set of empirical rules to predict deleterious using predicted rather than actual structure is evaluated. SNPs. Following this, other workers (Chasman and Adams, Availability: Software is available on request from the authors. 2001; Sunyaev et al., 2001; Ramensky et al., 2002; Saunders Contact: [email protected] and Baker, 2002) have asserted the importance of protein structural considerations, and developed prediction methods INTRODUCTION that depend on mapping SNPs to positions in (homologous) three-dimensional (3D) protein structures, as well as using An important aspect of the post-genome biology of model information from multiple sequence alignments. organisms and human is to understand the biological effects In contrast to the above studies, which use mapping to of inherited variations between individuals. For instance, a 3D structures, Ng and Henikoff (2001) have developed the key problem for the pharmaceutical industry is to understand SIFT (Sorting Tolerant from Intolerant) method, based on variations in drug treatment responses among individuals at sequence conservation and scores from position-speciﬁc scor- the molecular level. Among these variations, single nucleotide ing matrices. These authors assert that their method performs polymorphisms (SNPs) have received much attention recently. ‘similarly’ to the structure-based methods. However, fair SNPs are subtle variations, such as insertions, deletions and comparison of these tools is fraught with difﬁculties, and the substitutions observed in the genomic DNA sequences of intimate relationship of 3D structure with protein function individuals of the same species. An enormous volume of SNP and stability would suggest that the use of explicit structural data are available in the public databases (http://snp.cshl.org information alongside sequence conservation will be found and http://www.ncbi.nlm.nih.gov/SNP, Sherry et al., 2001). to improve performance, as supported by the recent study of SNPs in protein coding exons are classiﬁed as synonymous Saunders and Baker (2002). or non-synonymous according to whether or not they alter Of the tools described above, the Chasman and Adams the protein sequence. Non-synonymous SNPs (nsSNPs) can method is the only one that involves automated learning from affect gene function through their effect on the structure and training data. While the other methods depend on rules derived empirically, this method uses a training data set to estimate the To whom correspondence should be addressed. Bioinformatics 19(17) © Oxford University Press 2003; all rights reserved. 2199 V.G.Krishnan and D.R.Westhead probability that a particular nsSNP will affect protein function. It is based on the description of a mutation or SNP in terms of a set of attributes, including sequence conservation and struc- tural features. The probability of an effect is estimated from the proportion of training set mutations with matching attrib- utes, in which an effect on function is known to occur. Effects are predicted if the probability is >0.5. The probability also serves as an estimate of the conﬁdence level of the predic- Fig. 1. A simpliﬁed example decision tree. Acc is the solvent access- tion. Attributes were chosen by a detailed statistical analysis ibility (b = buried, e = exposed, i = intermediate), Mut_res is the of their effect on function, and the method was trained and mutated residue identity (R = Arginine), Cons_value is the conser- cross-validated using the extensive systematic mutation data vation score of the original residue. In this case the ﬁnal decision is sets available for lysozyme (Alber et al., 1987; Rennell et al., binary, Y (effect) or N (no effect). 1991) and the lac repressor (Markiewicz et al., 1994; Suckow et al., 1996). Automated learning from training data is an attractive the application of our method to the SNPs observed to occur alternative to manual tuning of empirical rules. After the set between two strains of the nematode worm Caenorhabditis of descriptive attributes have been deﬁned for each mutation, elegans. automated methods are able to explore much more fully how these attributes can be used to produce a prediction method SYSTEM AND METHODS that is, in some sense, approximately optimal. It is also much easier to perform rigorous cross-validation of such methods. Here, we provide only brief descriptions of the decision tree Here, we report the application of two machine-learning and SVM methods. More detail can be found in the references methods, decision trees, as implemented in C4.5 (Quinlan, cited. 1993) and support vector machines (SVMs) (Cristianini and Decision trees Shawe-Taylor, 2000). The principal difference between these methods lies in the type of classifying function they attempt Decision tree learning (Mitchell, 1997; Witten and Frank, to learn. The decision tree represents the classiﬁer as a tree 2000) is a means for approximating discrete-valued target structure in which each node represents a decision based on functions, in which the learned function is represented by a an attribute value, and it leads to a set of predictive rules that decision tree. Each instance (in this case a mutation or a SNP) can be interpreted easily. On the other hand, the SVM relies is sorted down the tree from the root according to the val- on a mapping of the input attributes to a feature space that can ues of its attributes (e.g. types of residues involved, sequence be of very high dimension, where the classiﬁer takes the form conservation, structural features) until it reaches a classifying of a linear function (hyperplane). These methods have been leaf node (‘effect’ or ‘no effect’) where the prediction is made. found to be effective in many diverse ﬁelds (Mitchell, 1997). This process is illustrated in the (ﬁctitious) example shown in Here, we provide a comparison of the two methods and show Figure 1. that they have a contribution to make in SNP analysis. Here we used the C4.5 decision tree software, which is The attributes used by our methods include both sequence derived from Hunt’s method (Hunt et al., 1966) for construct- and structure-based information, but in contrast to other ing a decision tree. The software can be downloaded freely methods we investigate the possibility of using only struc- (http://www.cse.unsw.edu.au/~quinlan/). First, the decision tural attributes that can be predicted with sufﬁcient accuracy tree was obtained for the training data using the program c4.5 from sequence (secondary structure and solvent accessibility), and rules were generated by the program called c4.5rules, rather than relying on mapping mutations to (homologous) 3D which uses the decision tree constructed by c4.5. Experi- structures. By removing the need for a homologous structure ments were conducted to optimize the input parameters of the the applicability of our method is extended signiﬁcantly. It software (for both prediction accuracy and generalization), but is not clear from the literature which of the currently avail- the default values were found to be approximately optimal in able methods performs the best, but the availability of detailed most cases, and were used throughout this paper. cross-validation data and prediction conﬁdence estimates for The decision tree software gives an estimated accuracy for the Chasman and Adams method (above) is very conveni- each rule, which is derived from the training data. These estim- ent for comparison with our learning methods. Accordingly, ated accuracies were used to assign conﬁdence levels to the we adopt their training sets (unbiased mutation data for lyso- predictions. Rules with estimated accuracies of x% were taken zyme and lac repressor proteins), and replicate and extend to have a conﬁdence level of x/100 (e.g. a rule with estimated their cross-validation techniques. Throughout this paper, the accuracy of 90% was assigned an estimated conﬁdence level Chasman and Adams method is referred to as ‘the probabilistic of 0.9). The conﬁdence level can be viewed as an estimate method’. As an example application of our method, we report of the probability that a prediction from the rule is correct. 2200 Machine learning for SNP predictions In order to facilitate comparison with other published meth- previous studies, which have adopted the same deﬁnition, and ods, we report (see Results) error rates for predictions made because it leads to a similar rate of mutations causing effects from rules with conﬁdence levels above a deﬁned threshold (38–45%) in each data set, suggesting that it deﬁnes a similar (e.g. error rates at the conﬁdence level threshold of 0.6 contain degree of effect in each protein. It is not clear how this deﬁn- predictions from all rules whose conﬁdence level is 0.6). ition relates to observable phenotypic effects on an organism, a point that we will discuss later. SVMs In order to investigate these issues further we used a third, Currently, SVMs (Vapnik, 1998) are gaining great attention in smaller data set (336 mutations) for the HIV protease (Loeb the ﬁeld of bioinformatics. Here, a classic two-class problem et al., 1989). This data set contains at least one mutation is addressed: SNPs have to be divided into two classes, ‘effect’ at every sequence position. In this case the experimentalists or ‘no effect’. Like decision trees, SVMs use an input vector of deﬁne only three degrees of effect, + (no effect), +/− (small attributes for each instance. Using a kernel function the input effect), and − (complete absence). Taking a deﬁnition ana- vectors are mapped to a feature space of high dimension in logous to the one adopted for the other data, where everything which the SVM method constructs a hyperplane that optimally other than + is considered as an effect, resulted in 67% of separates instances from the two classes. Here, we constructed mutations being considered as effects. This percentage is sig- SVMs using mySVM (Vapnik, 1998, http://www-ai.cs.uni- niﬁcantly different to the 38–45% observed in lac repressor dortmund.de/SOFTWARE/MYSVM/). After various trials of and lysozyme data. Therefore, in addition, we investigated different parameters for better performance (data not shown), how an alternative deﬁnition in which both + and +/− we chose the polynomial kernel function of degree d = 2 were treated as neutral would change the performance of the given by learning methods. With this latter deﬁnition 47% of protease ∗ d K(x, y) = (x y + 1) mutations have an effect on function. The SNPs data of C.elegans genome were obtained from The values of the other parameters for the mySVM software −12 St Louis Washington University web site (Wicks et al., 2001), were ε = 1.0 × 10 and C = 1. (http://www.genome.wustl.edu/projects/celegans/index.php? The mySVM software does not provide any estimate of snp=1) and the six chromosome data from Sanger conﬁdence in classiﬁcations, and SVM theory in this area is centre website (http://www.sanger.ac.uk/Projects/C_elegans/ currently not well developed. In contrast to our analysis of WORMBASE/GFF_ﬁles.shtml). The SNPs data are between decision trees, therefore, we do not provide conﬁdence levels CB4856, an isolate from Hawaiian Islands, with reference to for predictions made by SVMs. the completely sequenced N2 strain (from Bristol, UK). The Data sets SNP data are from part (5.4 Mb) of the C.elegans genome. The The systematic unbiased mutagenesis data set of lac repressor SNPs had to be associated to the position in the chromosome (Markiewicz et al., 1994; Suckow et al., 1996) and T4 lyso- data by Fasta (Pearson and Lipman, 1988; Pearson, 1990) zyme (Alber et al., 1987; Rennell et al., 1991) were used to alignment. Using the exon, intron and intergenic information train and validate the prediction methods. Mutations in the provided by Sanger centre web site, the protein coding regions ﬁrst 62 residues of lac repressor were omitted, because they were translated to get the corresponding protein sequences. are missing from protein databank structure of this protein Attributes (Berman et al., 2000). The number of mutations taken for the analysis was 3303 mutations for lac repressor and 1990 The attributes of SNPs used for predictions were chosen from for lysozyme. The experimental results for both the proteins the following set: the residue identities of the original and are given as four-valued expressions of the effect of each mutated residue, the physicochemical classes of these residues mutation on the protein function. In the case of lysozyme, (hydrophobic, polar, charged, glycine), sequence conser- plaque-forming ability was rated as ++ (no effect), + (slight vation score at the mutated position, molecular mass shift effect), +/− (larger effect) and − (complete absence). Taking on mutation, hydrophobicity difference, secondary structure, ++ as a neutral mutation and all the rest as effects gives a data solvent accessibility and buried charge. This set is based on set in which 38% of mutations have an effect on function. In other attribute sets from the literature; changes in these quant- the case of the lac repressor the four values given were + (no ities on mutation are likely to affect protein function (e.g. effect), +− (slight effect), −+ (larger effect) and − (com- by changing a key conserved functional residue) or protein plete absence). Here, ‘+’ was considered as neutral and rest structural stability (e.g. by disruption of the hydrophobic core as effects, resulting in 45% of the mutations having an effect through a residue size change reﬂected in a large molecular on the protein function. These deﬁnitions were adopted by mass shift). Only the latter three attributes require inform- Chasman and Adams. ation from protein structure rather than sequence, but these It is not clear that the method above is the best way to convert have been included because they can be predicted (see below) the four experimental effect classes into a binary classiﬁca- and so our method does not require mapping to a homologous tion. Here, it was employed in order to give comparability to 3D structure. In contrast to other methods, we do not use the 2201 V.G.Krishnan and D.R.Westhead crystallographic B factor, which would limit the applicability results were assessed as median and interquartile range (these of our method to sequences that can be mapped to homolog- were found to be very similar to the alternative mean and SD). ous X-ray structures. In the Results section, we investigate the subset of these attributes that produces optimal cross-validated RESULTS predictions. The results presented in this section concern error rates or The secondary structure and solvent accessibility inform- misclassiﬁcation rates, observed in predictions of functional ation for lysozyme and lac repressor proteins was extrac- effects of mutations (nsSNPs). Predictions are binary valued, ted from homology-derived secondary structure of proteins ‘effect’ or ‘no effect’; indicating whether or not a given SNP (Sander and Schneider, 1991) ﬁles, as 3D structures are is predicted to have a deleterious effect on protein function. available for these proteins. A three-state description of The error rate is the proportion of the total number of predic- solvent accessibility (buried, intermediate and exposed) was tions that were wrong. In all cases, three different error rates used. In order to study the effect of using predicted rather than are reported, the overall error rate, and separate error rates for actual structure, and in cases of proteins of unknown struc- positive (effect) predictions and negative (no effect) predic- ture, PHD (Rost and Sander, 1993, 1994) was used to predict tions. With some methods it is possible to attach an estimated secondary structure and solvent accessibility. The hydro- conﬁdence level to each prediction. This can be viewed as phobicity values were taken from the literature. The sequence an estimate of the probability that the prediction is correct. conservation score was calculated using the ScoreCons pro- If a threshold is set so that only predictions above a certain gram (Valdar and Thornton, 2001a,b) using multiple sequence conﬁdence level are accepted, then the number of predictions alignments of proteins extracted by BLAST (Altschul et al., made usually decreases as this threshold is increased (i.e. if 1990) searches of the SWALL database (Boeckmann et al., higher conﬁdence is required then methods generally make 2003) using an E-value cut-off of 0.01 and aligned with fewer predictions). Therefore, we report number of predic- Clustal W (Thompson et al., 1994). The mutation mass shift tions made as well as error rates: the better of two methods was calculated as the difference between the relative molecu- compared at the same conﬁdence level or error rate is the one lar mass of the mutated residue and the original residue. The able to make the largest number of predictions. wild type residue was deemed to be a buried charge if it was one of K, R, D, E, H and its solvent accessibility was in the Optimization of the set of attributes buried class. Here, optimization means ﬁnding an attribute set that max- Cross-validation methods imizes the total number of predictions while minimizing Machine learning methods are generally evaluated by a stat- the overall error rate. Initially the sequence-based attributes, istical technique called cross-validation. The data are divided including conservation score and the identities of wild type into two sets randomly. The ﬁrst (‘training set’) is used in and mutated residues and their physicochemical classes were training the learning method; the second (‘test set’) is used for chosen. Following this, attributes were added sequentially to subsequent evaluation of the accuracy of the trained method. this basic set to test their effect on the quality of the predictions. This tests the ability of the method to generalize and make The performance of the decision tree method at a conﬁdence predictions on unknown data. level of 0.5 using mixed (lysozyme and lac repressor) cross- We report three types of cross-validation: homogeneous, validation for an expanding attribute set is shown in Figure 2. heterogeneous (after Chasman and Adams) and mixed. For It is clear from Figure 2 that each addition to the attribute set homogeneous cross-validation, each protein data set was prompts a fall in the overall error rate and a slight increase taken separately and cross-validation performed on that set in the number of predictions made. Also, the addition of in isolation. For heterogeneous cross-validation, the data set structural attributes, such as buried charge, solvent accessib- of one protein (e.g. lysozyme) was used as training set and ility and secondary structure information reduces error rates, that of the other protein (e.g. lac repressor) was used as test in agreement with the conclusions drawn in previous study set. For mixed cross-validation, the data from each protein was (Saunders and Baker, 2002). Given these observations, the pooled as a single data set and cross-validation performed on full set of attributes (set 5 in Fig. 2) was used for learning in this pooled set. all the subsequent studies reported here. In case of homogeneous and mixed cross-validations, the The error rates in Figure 2 are all in the range 0.29–0.21. data were randomized and split into 10 equal parts. One part These are signiﬁcantly lower than the best error rates that was used as test set and the remainder as training set. This could be achieved with naïve prediction methods. A naïve procedure was repeated 10 times so that each case or example method predicting either class randomly with equal probabil- (here it is each mutation) was used exactly once for testing. ity would have an error rate of 0.5 on any test set, while one This is called 10-fold cross-validation, and has been shown to with knowledge of the composition of the test set could use give good estimated error rates (Witten and Frank, 2000). In this optimally by predicting the dominant class (‘no effect’ in 10-fold cross-validation, the central tendency and spread of this case) exclusively. In this case, the latter method would 2202 Machine learning for SNP predictions It is interesting to compare observed error rates with the estimated conﬁdence levels of predictions. For instance, in the case of the decision tree method the error rate at conﬁdence threshold 0.9 (Table 1A) is close to the approximate expected value of 0.1 (= 1−0.9). In the case of both proteins, reﬂecting the use of rules with conﬁdence value 0.9 or above. Comparing error rates to conﬁdence levels for thresholds lower than 0.9 in Table 1A is more difﬁcult because data are given cumulatively. For instance, the predictions at the threshold 0.5 include all predictions of conﬁdence 0.5 and above, including many of much higher conﬁdence and the corresponding error rate is therefore signiﬁcantly <0.5. Fig. 2. The effect of including extra attributes on error rates (gray line and square markers) and prediction numbers (black line and dia- Homogeneous cross-validation tests the ability of a method mond markers) in mixed cross-validation using decision tree learning to learn rules applicable to a single protein. Heterogeneous (predictions with a conﬁdence level of 0.5 or greater). Attribute set 1: cross-validation is a much more stringent and realistic test. It the identities and physicochemical classes of wild type and mutated examines the ability of a method to learn rules that general- residues and also the sequence conservation score. Set 2: set 1 plus ize from one protein to another. The results of heterogeneous mass and hydrophobicity differences. Set 3: set 2 plus buried charge. cross-validation are shown in Table 1B. In this case, the overall Set 4: set 3 plus solvent accessibility. Set 5: set 4 plus secondary error rates at all conﬁdence levels are higher than in homo- structure. Error rates are signiﬁcantly lower (95% level) with set 5 geneous cross validation, as expected for a more difﬁcult test. compared to any other set (Wilcoxon rank sum test). However, in contrast with homogeneous cross validation the decision tree error rates are higher than those of the probabil- istic method at all but the highest conﬁdence level thresholds. have an error rate of 0.42 (the proportion of ‘effect’ mutations This is an indication that while the decision tree performs best in the data set). in homogeneous cross validation it is more prone to learning protein-speciﬁc rules that do not generalize well to other pro- Performance of decision tree learning tein examples. The effect seems to be particularly marked for The results for homogeneous cross-validation are given in ‘effect’ predictions, with performance of the methods being Table 1A. Results from the decision tree method are com- more comparable for ‘no effect’ predictions. pared with the published results of the probabilistic method Although extensive in the cases of lysozyme and lac of Chasman and Adams (2001). Note that data in this table repressor, the mutation data for the two proteins are still a very are cumulative: each conﬁdence level threshold includes pre- small sample of naturally occurring proteins, and it is almost dictions made at a conﬁdence level equal to or higher than the certainly unreasonable to expect rules learned from a single threshold. Both methods show the expected increase in both protein to be universally applicable. To form our most accurate the overall number of predictions and observed error rates with prediction method, we therefore used all the data in training. decreasing conﬁdence level threshold. The overall error rates Error rates for such a method can be estimated by mixed cross- of the decision tree method are generally signiﬁcantly lower validation. The results of this for various conﬁdence levels are than those of the probabilistic method for both proteins. The presented in Table 2 (numbers not in parentheses). In this case, exceptions to this are the error rates at high conﬁdence levels the observed error rates at each conﬁdence level are much (0.8 and 0.9) for the lac repressor. Viewing the ‘effect’ predic- more similar to those observed in homogeneous cross valida- tions separately, the error rates follow the same trend as the tion, indicating that when both data sets are used for training overall error rates, with the decision tree performing best at the decision tree method is able to learn rules applicable to lower conﬁdence thresholds. At higher thresholds (e.g. 0.9), both proteins. Mixed cross-validation was not performed by error rates from the probabilistic method are lower, but these Chasman and Adams (2001) so no comparison can be made rates are achieved at the expense of making fewer predictions. with the probabilistic method in this case. For instance in the case of the lac repressor at a conﬁdence It is noticeable in several of the above cases that the cumu- threshold of 0.9, the probabilistic method makes 10 effect lative number of ‘effect’ predictions tends to increase for each predictions with no errors (repeat cross-validations were not successive decrease in the conﬁdence level threshold, while reported for this method), while (in multiple cross-validations) the number of ‘no effect’ predictions often reaches a plateau. the decision tree makes on average 21.5 predictions and 1.5 For instance, in homogeneous cross-validation (Table 1A) the errors. In the case of ‘no effect’ predictions the decision tree number of no effect predictions made with the decision tree performs best on the lysozyme data in terms of both error method does not increase between conﬁdence level thresholds rate and prediction numbers at each conﬁdence level, but of 0.7–0.5 in the case of either protein. This indicates that performance is more comparable on the lac repressor data. the decision tree rules predicting ‘no effect’ are all of a 2203 V.G.Krishnan and D.R.Westhead Table 1. Prediction results and error rates from homogeneous (A) and heterogeneous (B) cross-validation at several conﬁdence level thresholds (A) Homogeneous Prediction Actual Lysozyme: conﬁdence threshold Lac repressor: conﬁdence threshold 0.9 0.8 0.7 0.6 0.5 0.9 0.8 0.7 0.6 0.5 Effect Effect 11 ± 4 [3] 32.5 ± 3.3 [8] 42 ± 2.5 [24] 50.5 ± 3.7 [37] 52.5 ± 2.5 [43] 21.5 ± 10 [10] 52.5 ± 7 [34] 78 ± 6 [52] 95 ± 5.5 [66] 95 ± 2.9 [78] No effect 2 ± 9 [0] 3.5 ± 0.5 [2] 10 ± 2.5 [9] 16.5 ± 2.5 [14] 18.5 ± 2.3 [24] 1.5 ± 1.3 [0] 6.5 ± 1.9 [8] 15 ± 2.2 [17] 23.5 ± 2.7 [24] 24 ± 2.4 [27] No Effect No effect 5.5 ± 3.4 [2] 76 ± 9.2 [15] 86 ± 4 [37] 86 ± 4.4 [46] 86 ± 4.4 [63] 72 ± 35.8 [90] 127 ± 20.6 [122] 146 ± 17.9 [142] 146 ± 18.7 [156] 146 ± 18.6 [169] Effect 0 ± 0.38 [1] 9.5 ± 1.3 [7] 11 ± 1 [10] 11 ± 1 [18] 11 ± 1 [28] 4.5 ± 3 [3] 12 ± 2.2 [5] 17.5 ± 4.9 [22] 17.5 ± 5.3 [32] 17.5 ± 5.3 [43] Overall error rate 0.11 ± 0.02 [0.17] 0.11 ± 0.02 [0.28] 0.15 ± 0.02 [0.24] 0.20 ± 0.03 [0.28] 0.2 ± 0.02 [0.33] 0.08 ± 0.02 [0.03] 0.12 ± 0.03 [0.08] 0.14 ± 0.03 [0.17] 0.16 ± 0.01 [0.20] 0.16 ± 0.01 [0.22] Effect error rate 0.14 [0.00] 0.10 ± 0.04 [0.20] 0.18 ± 0.04 [0.27] 0.25 ± 0.04 [0.27] 0.25 ± 0.04 [0.36] 0.07 ± 0.06 [0.00] 0.12 ± 0.03 [0.19] 0.16 ± 0.02 [0.25] 0.20 ± 0.01 [0.27] 0.19 ± 0.01 [0.26] No effect error rate 0 ± 0.03 [0.33] 0.12 ± 0.01 [0.32] 0.12 ± 0.02 [0.21] 0.12 ± 0.02 [0.28] 0.12 ± 0.02 [0.31] 0.07 ± 0.02 [0.03] 0.10 ± 0.03 [0.04] 0.13 ± 0.03 [0.13] 0.13 ± 0.03 [0.17] 0.13 ± 0.03 [0.20] (B) Heterogeneous Prediction Actual Training: Lysozyme Test: Lac repressor Training: Lac repressor Test: Lysozyme Conﬁdence threshold Conﬁdence threshold 0.9 0.8 0.7 0.6 0.5 0.9 0.8 0.7 0.6 0.5 Effect Effect 74 [68] 459 [227] 531 [358] 792 [483] 851 [551] 104 [30] 220 [70] 291 [156] 295 [271] 299 [341] No effect 6 [10] 301 [33] 405 [101] 632 [233] 708 [345] 23 [8] 81 [13] 172 [49] 174 [108] 176 [166] No Effect No effect 0 [101] 998 [259] 1107 [515] 1107 [666] 1107 [786] 225 [232] 388 [368] 483 [436] 483 [534] 483 [644] Effect 0 [9] 243 [27] 365 [109] 365 [132] 365 [182] 79 [90] 177 [135] 232 [183] 232 [233] 232 [328] Overall error rate 0.07 [0.10] 0.27 [0.11] 0.32 [0.19] 0.34 [0.24] 0.35 [0.28] 0.24 [0.27] 0.30 [0.25] 0.34 [0.28] 0.34 [0.30] 0.34 [0.33] Effect error rate 0.07 [0.13] 0.40 [0.13] 0.43 [0.22] 0.44 [0.33] 0.45 [0.39] 0.18 [0.21] 0.27 [0.16] 0.37 [0.24] 0.37 [0.28] 0.37 [0.33] No effect error rate NA [0.08] 0.2 [0.09] 0.25 [0.17] 0.25 [0.17] 0.25 [0.19] 0.26 [0.28] 0.31 [0.27] 0.32 [0.30] 0.32 [0.30] 0.32 [0.34] Decision tree results (data outside parentheses) are compared with the probabilistic method (data in parentheses). The upper half of the table gives total prediction numbers and the lower half gives error rates. Data for the probabilistic method taken from Chasman and Adams (2001) Tables 4 and 5. Decision tree results given as median± interquartile range (repeat cross validations not available for the probabilistic method, or for heterogeneous cross validation with either method). Machine learning for SNP predictions Table 2. Prediction results and error rates from mixed cross-validation at several conﬁdence level thresholds Prediction Actual Conﬁdence threshold 0.9 0.8 0.7 0.6 0.5 Effect Effect 28.5 ± 12.1 70 ± 7.4 105 ± 3.4 133 ± 4.1 135 ± 4.1 (6.5 ± 2.6)(32 ± 4.4)(80 ± 4.8)(110 ± 7.8)(124 ± 7.9) No effect 2 ± 0.5 9 ± 2.7 26.5 ± 4.7 42.5 ± 2.6 44.5 ± 3.1 (0 ± 3.8)(4 ± 0.9)(16.5 ± 4.8)(31.5 ± 7.8)(42.5 ± 6.8) No Effect No effect 32 ± 13.4 197 ± 37.4 228 ± 41.9 233 ± 37.4 233 ± 37.4 (17.5 ± 2.4)(115 ± 39.4)(149 ± 45.8)(157 ± 43.6)(157 ± 43.6) Effect 1.5 ± 1.4 28 ± 3.8 40.5 ± 8.3 41.5 ± 7.3 41.5 ± 7.3 (1.5 ± 1)(17.5 ± 5.5)(32 ± 4.9)(37.5 ± 4)(37.5 ± 4) Overall error rate 0.08 ± 0.04 0.13 ± 0.02 0.18 ± 0.01 0.2 ± 0.01 0.21 ± 0.01 (0.1 ± 0.04)(0.14 ± 0.02)(0.2 ± 0.03)(0.21 ± 0.03)(0.23 ± 0.02) Effect error rate 0.09 ± 0.02 0.11 ± 0.02 0.20 ± 0.02 0.24 ± 0.01 0.25 ± 0.02 (0 ± 0.03)(0.12 ± 0.00)(0.17 ± 0.04)(0.23 ± 0.03)(0.25 ± 0.03) No effect error rate 0.08 ± 0.05 0.13 ± 0.01 0.16 ± 0.01 0.16 ± 0.02 0.16 ± 0.02 (0.09 ± 0.05)(0.15 ± 0.02)(0.17 ± 0.03)(0.18 ± 0.04)(0.18 ± 0.04) Decision tree results using actual structure (data outside parentheses) are compared with those using predicted structure (data in parentheses). The upper half of the table gives total prediction numbers and the lower half gives error rates. Results given as median ± interquartile range. conﬁdence level 0.7 or above, while there are some rules insufﬁcient data to enable assessment of any trend relating of lower conﬁdence for ‘effect’ predictions. It would seem the performance of our methods to the accuracy of these to be easier to ﬁnd high conﬁdence rules for predicting ‘no predictions. effect’. Rules derived from decision trees Effect of predicted data An advantage of the decision tree method is that it produces Here we compare decision tree performance using predic- intelligible rules, and attaches a conﬁdence level to each rule. tions for secondary structure and solvent accessibility instead For example, using the pooled data the ﬁnal predictions are of experimentally determined values. The results from mixed made on the basis of 50 rules predicting ‘effect’ and 39 rules cross-validation using decision trees for various conﬁdence predicting ‘no effect’. It is not practical to analyse all the rules levels are given in Table 2. It is encouraging that observed in this paper, but some illustrative examples are given below. error rates are generally similar, or very slightly higher when predicted structure is used. However, the real effect Rule 1: residue = L; mut_res = P; Obs_Acc = b class of using predicted structure is seen when numbers of predic- ‘effect’ [90.2%]. tions are considered. When predicted structure is used it is Rule 2: residue = G; Obs_Acc = b; Cons_value > 0.252; clear that fewer predictions are possible at each conﬁdence Cons_value 0.352 class ‘effect’ [77.1%] level threshold (e.g. at the conﬁdence threshold of 0.9, on Rule 3: residue = A; mut_res = G class ‘no effect’ average 28.5 successful predictions of effects are made with [96.9%]. actual structures and only 6.5 with predicted). The degrada- tion of performance is therefore evident in prediction numbers rather than error rates, and this suggests that the predicted Here residue is the original residue, mut_res represents the conﬁdence levels of the decision tree rules are at least robust mutated residue, Obs_Acc is the observed solvent accessibil- enough to recognize when lower quality data leads to less ity of the original residue, and Cons_value is the conservation certain predictions. score of the original residue. The number in parentheses is The three-state secondary structure predictions used have the estimated percentage accuracy of the rule. The ﬁrst rule accuracies of 74% (lysozyme) and 80% (lac repressor), indicates that changing a buried leucine residue to a proline and the three-state solvent accessibility predictions have tends to affect function, and can be understood in the light of accuracies of 51% (lysozyme) and 57% (lac repressor). It our knowledge of the effect of proline on secondary structure. would be expected that our methods might perform less well The second reﬂects the special nature of glycine (small side on proteins for which these predictions were less accur- chain and high ﬂexibility); replacing glycine in a buried and ate. However, with only two proteins available there is conserved position tends to affect the function. This second 2205 V.G.Krishnan and D.R.Westhead Table 3. Comparison of the performance of decision trees (C4.5), SVMs and the probabilistic method for homogenous (A) and heterogeneous (B) cross-validation (A) Homogeneous Prediction Actual Lysozyme Lac repressor C4.5 SVM Probabilistic method C4.5 SVM Probabilistic method Effect Effect 52.5 ± 2.5 44 ± 5.1 43 95 ± 2.9 92 ± 2.9 78 No effect 18.5 ± 2.3 23 ± 1.6 24 24 ± 2.4 53 ± 4.5 27 No Effect No effect 86 ± 4.4 100 ± 4.8 63 146 ± 18.6 144 ± 6.9 169 Effect 11 ±135 ± 2 28 17.5 ± 5.3 39 ± 3.4 43 Overall error rate 0.2 ± 0.02 0.29 ± 0.01 0.33 0.16 ± 0.01 0.27 ± 0.01 0.22 Effect error rate 0.25 ± 0.04 0.37 ± 0.04 0.36 0.19 ± 0.01 0.37 ± 0.02 0.26 No effect error rate 0.12 ± 0.02 0.26 ± 0.02 0.31 0.13 ± 0.03 0.20 ± 0.02 0.20 (B) Heterogeneous Prediction Actual Training: Lysozyme Test: Lac repressor Training: Lac repressor Test: Lysozyme C4.5 SVM Probabilistic method C4.5 SVM Probabilistic method Effect Effect 851 858 551 299 323 341 No effect 708 475 345 176 186 166 No Effect No effect 1107 1503 786 483 1042 644 Effect 365 467 182 232 439 328 Overall error rate 0.35 0.28 0.28 0.34 0.31 0.33 Effect error rate 0.45 0.36 0.39 0.37 0.36 0.33 No effect error rate 0.25 0.24 0.19 0.32 0.30 0.34 The upper half of the table gives total prediction numbers and the lower half gives error rates. Data for decision trees and probabilistic method taken from Tables 1 and 2 at conﬁdence level threshold 0.5. Table 4. Comparison of decision trees (C4.5 conﬁdence level threshold 0.5) and SVM in mixed cross-validation Prediction Actual C4.5 SVM Effect Effect 135 ± 4.1 (124 ± 7.9) 131 ± 4 (62 ± 4.5) No effect 44.5 ± 3.1 (42.5 ± 6.8) 70 ± 4.3(29 ± 6.6) No Effect No effect 233 ± 37.4 (157 ± 43.6) 251 ± 6.9 (291 ± 5.1) Effect 41.5 ± 7.3 (37.5 ± 4) 76 ± 5 (151 ± 6.4) Overall error rate 0.21 ± 0.01 (0.23 ± 0.02) 0.28 ± 0.01 (0.33 ± 0.02) Effect error rate 0.25 ± 0.02 (0.25 ± 0.03) 0.36 ± 0.01 (0.36 ± 0.05) No effect error rate 0.16 ± 0.02 (0.18 ± 0.04) 0.23 ± 0.02 (0.34 ± 0.01) Results using actual structure (data outside parentheses) are compared to those using predicted structure (data in parentheses). The upper half of the table gives total prediction numbers and the lower half gives error rates. Results given as median ± interquartile range. rule gives both lower and upper limits on the conservation Performance of SVMs score, but we regard the upper limit as a feature reﬂecting The results of our second learning method, SVMs, are given learning on what is still a relatively small data set. On the in detail in Tables 3 and 4. Here comparative results are taken other hand, rule 3 shows that substituting residues with sim- from the 50% conﬁdence threshold predictions of decision ilar properties tends not to affect the function. It is noteworthy trees and the probabilistic method. It is not possible to that the ﬁrst two ‘effect’ rules relate to changes in the stability provide conﬁdence levels for SVM predictions (see System of the structure, rather than speciﬁc effects on key functional and Methods), but since SVMs provide a prediction for every residues. data point, it is most meaningful to compare results with the 2206 Machine learning for SNP predictions Table 5. Prediction results for the HIV test set from methods trained on other methods where they make the largest number of predic- lysozyme and lac repressor data tions, i.e. including all predictions from the 0.5 conﬁdence level upwards. The results of homogeneous cross-validation in Table 3A Prediction Actual C4.5 SVM indicate that decision trees tend to have the lowest error Effect Effect 173 (135) 160 (140) rates. The performances of SVM and probabilistic meth- No effect 42 (80) 33 (94) ods are similar, with the SVM performing better on lyso- No Effect No effect 25 (37) 78 (83) zyme and the probabilistic method performing better on lac Effect 21 (9) 65 (19) repressor. Prediction numbers vary from method to method Overall error rate 0.24 (0.34) 0.29 (0.34) but are broadly comparable. However, in heterogeneous cross- Effect error rate 0.20 (0.37) 0.17 (0.40) No effect error rate 0.46 (0.20) 0.45 (0.19) validation (Table 3B), it is clear that the decision tree method has higher error rates, particularly when the lysozyme data The upper half of the table gives total prediction numbers and the lower half gives error are used for training and lac repressor for testing (we earlier rates. Numbers outside parentheses treat the HIV mutations as ‘effect’ if any effect on attributed this effect to the learning of protein-speciﬁc rules). function was detected, those in parentheses require complete abolition of function for In contrast with the decision tree, the SVM produces perform- an ‘effect’. Decision tree (C4.5) results use prediction from conﬁdence level threshold 0.5. ance in heterogeneous cross-validation that is better than the probabilistic method. The error rates from the two methods are very similar, but at these rates the SVM is able to make proportion of ‘effect’ mutations in each data set (38–45%, signiﬁcantly more successful predictions. For example, with see System and Methods for details). However, this was not lysozyme training and lac repressor test the SVM makes 858 the case with the much smaller data set from the HIV protease. successful ‘effect’ predictions and 1503 successful ‘no effect’ In Table 5, we show results obtained for predictions on the predictions, to be compared with 551 and 786 for the probab- HIV data by methods trained on the combined lysozyme and ilistic method. This indicates that the SVM is less susceptible lac repressor data. The numbers not in parentheses use a con- than the decision tree to protein-speciﬁc effects in the small version analogous to that used for the training data, i.e. where learning set associated with a single protein. any experimentally detected loss in activity is regarded as an In Table 4, we give a comparison of decision trees and SVMs effect (see Methods). With the ﬁrst conversion, it is clear that in mixed cross-validation using both actual and predicted both machine-learning methods produce very high ‘no effect’ (numbers in parentheses) secondary structure and solvent error rates, in excess of 0.45 and close to the expected 0.5 for accessibility. In this test, the decision tree out-performs the random predictions. Many of the mutations predicted to have SVM in both cases. It is also interesting to note that while no effect actually do have an effect according to this deﬁnition. the effect of predicted data on the decision tree is easy to This led us to suspect inconsistency of the conversion of exper- understand (error rates remain broadly similar but prediction imental observations to binary values, and the observed high numbers fall), it is more complicated for the SVM. The ‘no proportion of ‘effect’ mutations in the data set with this con- effect’ error rate increases signiﬁcantly (from 0.23 to 0.34) version (67%, much greater than that in the training data) was when predicted data are used. This is probably related to further evidence for this possibility. With this in mind, we re- the fact that our decision tree predictions are limited to those assessed the HIV predictions using the alternative conversion with a conﬁdence level of 0.5 or greater. There is no estim- where only complete abolition of function was considered an ate of conﬁdence for the SVM and low conﬁdence predictions effect (reducing the proportion of effect mutations in the HIV (conﬁdence level lower than 0.5) cannot be ﬁltered out as they data to 47%, which is more consistent with the training data can with decision trees. from lysozyme and lac repressor). This alternative conversion was applied to the HIV test data only and not to the training Difference in results depending on deﬁnitions of data, so there is no issue of unbalanced training. The results effect and no effect are shown in parentheses in Table 5. Changing to this con- The effect of a mutation on protein function takes a continuum version method clearly improves the ‘no effect’ error rate, but of possible values from complete abolition of function through also signiﬁcantly increases the error rate observed in ‘effect’ degrees of loss of functional efﬁciency to no observable effect. predictions. We interpret this to indicate that neither of these Yet this type of study requires conversion of this data to a bin- conversions is really consistent with the level of functional ary valued ‘effect’ or ‘no effect’. For the process of training a effects deﬁned in the training data, the ﬁrst deﬁning too many machine learning method to work it is clearly important that minor functional changes as effects, and the second requiring this conversion process be approximately consistent, i.e. that too great a functional change to deﬁne an effect. it deﬁnes an equivalent level of functional effect between dif- Applications of methods to C.elegans SNPs ferent proteins. In the heterogeneous cross-validation using lysozyme and lac repressor data we found no evidence As an illustration, we have applied our methods to predictions of inconsistency, and this was reinforced by the similar of the functional effects of a set of 803 nsSNPs between two 2207 V.G.Krishnan and D.R.Westhead C.elegans strains. Using SVMs or decision trees trained on calculations are not available in the software we used, or other the combined lysoszyme and lac repressor data resulted in the commonly available software to our knowledge. prediction that around 300 (37%) of these might affect protein The inclusion of protein structure data in the attribute set function (see Discussion). has been discussed much in the literature (see Introduction). We ﬁnd that the use of structural attributes like secondary structure, solvent accessibility and buried charge produces DISCUSSION machine-learning methods that have lower error rates than We have made a thorough study of the use of two machine those based on sequence features alone. It is likely that learning methods to predict the functional effects of SNPs, and most mutations affecting protein function actually affect it compared the results with those from an existing probabilistic indirectly through changes in structural stability, and there- method (Chasman and Adams, 2001). Our results suggest that fore structural information should be valuable. An interesting the machine learning methods we use are competitive with the further observation from decision tree learning is that rules probabilistic method and perform signiﬁcantly better in some predicting ‘no effect’ seem to have higher conﬁdence levels circumstances. Decision trees are able to provide predictions on average. It would seem to be easier to predict if a mutation with signiﬁcantly lower error rates in homogeneous cross- does not affect stability than to predict if it does. validation, but seem to do less well in the more difﬁcult and It is important to appreciate that the way functional effects realistic test of heterogeneous cross-validation. However, in are deﬁned can seriously affect predictions. For instance, it this more difﬁcult test our results show that the SVM was able might be required to predict all observable effects on func- to perform at the same error rates as the probabilistic method, tion in some applications, but just complete abolition of while out-performing it in providing a signiﬁcantly greater function in others. Methods are trained to predict a certain number of predictions. level of effect, and if applied to data sets where differ- In comparison with the SVM, and also with the probabil- ent levels of effect need to be predicted they will perform istic method, we found that decision tree learning was more badly, as we illustrated with the HIV protease test data. susceptible to learning protein speciﬁc rules, resulting in very It is very difﬁcult to deﬁne equivalent levels of functional low error rates in homogeneous cross-validation, but signi- effect between two completely different proteins and this ﬁcantly higher error rates in heterogeneous cross-validation. highlights a general problem with methods of this type. How- This might suggest that decision trees are not the method of ever, the heterogeneous cross-validation results we report choice for this problem. Nevertheless, decision trees do have here, and also those from the probabilistic method, suggest advantages. First, they produce interpretable rules, and we that the deﬁnitions we adopted for lysozyme and the lac have shown that these often make sense from a protein struc- repressor (an enzyme and a regulatory protein) are approx- ture and stability perspective. Second, conﬁdence levels can imately equivalent. Based on this observation, it would seem be derived for decision tree rules. Apart from the obvious reasonable to accept that a deﬁnition of effect resulting in utility of a conﬁdence estimate to go with each prediction, a similar proportion of ‘effect’ mutations in unbiased muta- we showed that these conﬁdence estimates are generally very tion data sets for two proteins would indicate approximately robust. When we moved from actual structural data to lower equivalent deﬁnitions. However, application of this rule in quality predicted data, this was recognized in the derivation the case of the HIV data did not produce a conclusive of decision tree rules with reduced conﬁdence. This effect answer. More systematic mutation data sets with functional was manifested as falling prediction numbers at each conﬁ- effects deﬁned for different proteins are now required to dence level, while the observed error rates were maintained assess the generalizability of both deﬁnitions and prediction at approximately constant values. It is hardly surprising that methods. decision trees learn protein-speciﬁc rules when faced with This leads to the question of what level of effect should be training data from a single protein, but this effect is clearly predicted. The methods reported here were trained to predict reduced when training is on data from more than one pro- any observable effect on protein function. In application to tein (see the Results for mixed cross-validation). In time, it is the C.elegans SNP data this leads to the prediction that 37% likely that suitable training sets for other proteins will become of nsSNPs might affect protein function. This number is quite available, which should lead to the production of even higher large, given that the SNPs in this case are between two healthy quality decision trees. strains, but it is not out of line with estimates made using The lack of conﬁdence level estimates for SVM learning is other methods applied to human SNP data (e.g. Chasman and a disadvantage of that method. It would seem likely that more Adams, 2001; Sunyaev et al., 2001). The degree of effect conﬁdent SVM predictions would be those from data points on protein function needed to produce observable phenotypic located further from the optimal separating hyperplane, and consequences or diseases will vary from gene to gene, but one that it should be possible to ﬁt suitable probability distribu- possible explanation of these relatively large numbers is that tions to the data to provide conﬁdence estimates based on this the some of the protein functional consequences we predicted distance. However, this theory is not well developed, and such are too minor to cause major phenotypic effects. However, it is 2208 Machine learning for SNP predictions not possible to rule out errors in the SNP data or the annotation residues, as well as ‘spacers’ which do not require a speciﬁc sequence. J. Mol. Biol., 240, 421–433. of the C.elegans genome as alternative explanations. Mitchell,M.T. (1997) Machine Learning. McGraw-Hill, US. In conclusion, we have shown that machine-learning meth- Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid ods can make a useful contribution to SNP prediction prob- substitutions. Genome Res., 11, 863–874. lems, and compete well with currently available methods. The Pearson,W.R. (1990) Rapid and sensitive sequence comparison with generalization capability of the SVM is clearly a great advant- FASTP and FASTA. Methods Enzymol., 183, 63–98. age, but we have shown that decision trees too have signiﬁcant Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological advantages. A clear limitation of this study is the availability sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. of only two really systematic and extensive mutation data sets Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan for different proteins, but as more become available the power Kaufmann Publishers, San Mateo, CA. of all learning methods is sure to increase. Ramensky,V., Bork,P. and Sunyaev,S. (2002) Human non- synonymous SNPs: server and survey. Nucleic Acids Res., 30, 3894–3900. ACKNOWLEDGEMENTS Rennell,D., Bouvier,S.E., Hardy,L.W. and Poteete,A.R. (1991) We thank Ian Hope, Matthew Woodwark and Cary O’Donnell Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol., for valuable discussions. V.G.K. would like to thank ORS 222, 67–88. Rost,B. and Sander,C. (1993) Improved prediction of protein sec- (Overseas Research Students) awards scheme, Tetley Lupton ondary structure by use of sequence proﬁles and neural networks. and AstraZeneca Healthcare for support. Proc. Natl Acad. Sci. USA, 90, 7558–7562. Rost,B. and Sander,C. (1994) Conservation and prediction of solvent REFERENCES accessibility in protein families. Proteins, 20, 216–226. Sander,C. and Schneider,R. (1991) Database of homology-derived Alber,T., Sun,D.P., Nye,J.A., Muchmore,D.C. and Matthews,B.W. protein structures and the structural meaning of sequence align- (1987) Temperature-sensitive mutations of bacteriophage T4 lyso- ment. Proteins, 9, 56–68. zyme occur at sites with low mobility and low solvent accessibility Saunders,C.T. and Baker,D. (2002) Evaluation of structural and evol- in the folded protein. Biochemistry, 26, 3754–3758. utionary contributions to deleterious mutation prediction. J. Mol. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. Biol., 322, 891–901. (1990) Basic local alignment search tool. J. Mol. Biol., 215, Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., 403–410. Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI data- Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., base of genetic variation. Nucleic Acids Res., 29, 308–311. Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. Suckow,J., Markiewicz,P., Kleina,L.G., Miller,J., Kisters-Woike,B. et al. (2003) The SWISS-PROT protein knowledgebase and its and Muller-Hill,B. (1996) Genetic studies of the Lac repressor. supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. XV: 4000 single amino acid substitutions and analysis of the Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., resulting phenotypes on the basis of the protein structure. J. Mol. Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Biol., 261, 509–523. Data Bank. Nucleic Acids Res., 28, 235–242. Sunyaev,S., Ramensky,V., Koch,I., Lathe,W., III, Kondrashov,A.S. Chakravarti,A. (2001) To a future of genetic medicine. Nature, 409, and Bork,P. (2001) Prediction of deleterious human alleles. Hum. 822–823. Mol. Genet., 10, 591–597. Chasman,D. and Adams,R.M. (2001) Predicting the functional con- Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: sequences of non-synonymous single nucleotide polymorphisms: improving the sensitivity of progressive multiple sequence align- structure-based assessment of amino acid variation. J. Mol. Biol., ment through sequence weighting, position-speciﬁc gap penalties 307, 683–706. and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Cristianini,N. and Shawe-Taylor,J. (2000) An Introduction to Support Valdar,W.S. and Thornton,J.M. (2001a) Conservation helps to Vector Machines and Other Kernel-based Learning Methods. identify biologically relevant crystal contacts. J. Mol. Biol., 313, Cambridge University Press, Cambridge, UK. 399–416. Eisenberg,D., Weiss,R.M. and Terwilliger,T.C. (1984) The hydro- Valdar,W.S. and Thornton,J.M. (2001b) Protein–protein interfaces: phobic moment detects periodicity in protein hydrophobicity. analysis of amino acid conservation in homodimers. Proteins, 42, Proc. Natl Acad. Sci. USA, 81, 140–144. 108–124. Hunt,E.B., Martin,J. and Stone,P.J. (1966) Experiments in Induction. Vapnik,V. (1998) Statistical Learning Theory. John Wiley and Sons, Academic Press, New York. Inc., New York. Loeb,D.D., Swanstrom,R., Everitt,L., Manchester,M., Stamper,S.E. Wang,Z. and Moult,J. (2001) SNPs, protein structure, and disease. and Hutchison,C.A., III (1989) Complete mutagenesis of the Hum Mutat., 17, 263–270. HIV-1 protease. Nature, 340, 397–400. Wicks,S.R., Yeh,R.T., Gish,W.R., Waterston, R.H. and Plasterk,R.H. Licinio,J. and Wong,M. (2002) Pharmacogenomics. WILEY-VCH (2001) Rapid gene mapping in Caenorhabditis elegans using a Verlag GmbH, Weinheim, Germany. high density polymorphism map. Nat. Genet., 28, 160–164. Markiewicz,P., Kleina,L.G., Cruz,C., Ehret,S. and Miller,J.H. (1994) Witten,I. and Frank,E. (2000) Data Mining. Morgan Kaufmann, Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Academic Press, USA. Escherichia coli lac repressors reveals essential and non-essential

Journal

Bioinformatics – Oxford University Press

Published: Nov 22, 2003

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

References (31)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies