Deleterious SNP prediction: be mindful of your training data!

Matthew A. Care; Chris J. Needham; Andrew J. Bulpitt; David R. Westhead

doi:10.1093/bioinformatics/btl649

Deleterious SNP prediction: be mindful of your training data!

Care, Matthew A.; Needham, Chris J.; Bulpitt, Andrew J.; Westhead, David R. 2007-01-18 00:00:00 Vol. 23 no. 6 2007, pages 664–672 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl649 Genome analysis 1 2 2 1, Matthew A. Care , Chris J. Needham , Andrew J. Bulpitt and David R. Westhead 1 2 Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and School of Computing, University of Leeds, Leeds, LS2 9JT, UK Received on September 29, 2006; revised on November 22, 2006; accepted on December 18, 2006 Advance Access publication January 18, 2007 Associate Editor: Dmitrij Frishman understanding of effect mechanisms, but are not available for ABSTRACT all SNPs. Sequence attributes usually identify important Motivation: To predict which of the vast number of human single residues using information from homologous proteins. With nucleotide polymorphisms (SNPs) are deleterious to gene function or enough homologues (10), sequence attributes can often likely to be disease associated is an important problem, and many compete effectively with structural approaches (Bao and Cui, methods have been reported in the literature. All methods require 2005; Saunders and Baker, 2002; Yue and Moult, 2006). data sets of mutations classified as ‘deleterious’ or ‘neutral’ for Efforts to classify SNPs have used these attributes in a training and/or validation. While different workers have used different variety of prediction methods from sets of empirical rules data sets there has been no study of which is best. Here, the three (Herrgard et al., 2003; Ng and Henikoff, 2001; Ramensky et al., most commonly used data sets are analysed. We examine their 2002; Sunyaev et al., 2001; Wang and Moult, 2001), probabil- contents and relate this to classifiers, with the aims of revealing the istic prediction (Chasman and Adams, 2001) to a variety of strengths and pitfalls of each data set, and recommending a best machine-learning techniques including decision trees (DT) approach for future studies. (Dobson et al., 2006; Krishnan and Westhead, 2003), support Results: The data sets examined are shown to be substantially vector machines (Bao and Cui, 2005; Krishnan and Westhead, different in content, particularly with regard to amino acid substitu- 2003; Yue et al., 2005; Yue and Moult, 2006), neural networks tions, reflecting the different ways in which they are derived. This (Ferrer-Costa et al., 2004, 2005), Bayesian networks (Cai et al., leads to differences in classifiers and reveals some serious pitfalls of some data sets, making them less than ideal for non-synonymous 2004; Needham et al., 2006), random forests (Bao and Cui, SNP prediction. 2005) and Bayesian multivariate adaptive regression splines Availability: Software is available on request from the authors. (Verzilli et al., 2005). Although these different approaches Contact: [email protected] derive prediction rules in a variety of ways they almost all Supplementary information: Supplementary data are available at require a data set of classified mutations for both model Bioinformatics online. building (training) and error rate estimation (validation). For machine-learning methods to generalize well to target data, it is imperative that the right training data is chosen; the training and validation data should be drawn from the same (usually unknown) distribution as the target data. 1 INTRODUCTION However, this is not easy to arrange for the problem Single nucleotide polymorphisms (SNPs) are the most abun- concerned, and a number of very different data sets have been dant form of genetic variation, accounting for approximately employed. Some workers have used deleterious and neutral 90% of the DNA polymorphism in humans (Collins et al., nsSNPs data based on systematic mutation studies on particular 1998). It is estimated that there is a SNP of41% frequency for proteins (Cai et al., 2004; Chasman and Adams, 2001; Krishnan every 290 base-pairs (Kruglyak and Nickerson, 2001). Within and Westhead, 2003; Ng and Henikoff, 2001; Verzilli et al., coding regions there are on average four SNPs per gene with a 2005; Wang and Moult, 2001). Others have used annotated frequency above 1%. About half of these cause amino acid disease variants from protein sequence databases as deleterious substitutions: termed non-synonymous SNPs (nsSNPs) data, and have generated neutral data sets either from annotated (Cargill et al., 1999). sequence variants not known to be associated with disease (Bao Deleterious SNP prediction tries to ascertain if an nsSNP will and Cui, 2005), or by using pseudo mutations between affect a protein’s function and possibly contribute to genetic orthologous proteins in closely related species (Ferrer-Costa disease. Methods in the existing literature have used a large et al., 2002; Ferrer-Costa et al., 2004, 2005; Ramensky et al., range of structure- and sequence-based attributes to separate 2002; Sunyaev et al., 2001; Yue et al., 2005; Yue and Moult, deleterious from neutral SNPs (see supplementary Tables 1 2006). These approaches yield data sets that are different in and 2 for information). Structural attributes provide more content and character, with different properties when used to train machine-learning methods, and give rise to classifiers with *To whom correspondence should be addressed. varying error rates. 664 The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Deleterious SNP prediction This article is the first attempt to quantify the aforemen- 2.3.2 Swiss-Prot data set Another type of data set used for deleterious SNP prediction is derived from the Swiss-Prot variant tioned effects. We begin by comparing the contents of the data webpage (Yip et al., 2004). Approximately 20% of the human proteins sets to what might be expected in the target prediction data contained in the Swiss-Prot knowledgebase have one or more single (real human SNPs). Then, using a simple decision tree method amino acid polymorphism (SAP) (Boeckmann et al., 2003). Each SAP is to produce easily interpretable classifiers, we study the manually annotated in the feature table of the Swiss-Prot variant relationships between data set, classifiers and estimated database with the label ‘disease’ (SAP with disease association), accuracy. Finally, we quantify the transferability of classifiers ‘polymorphism’ (SAP with no known disease association) or ‘unclassi- between data sets, thus quantifying the effect of, for instance, fied’ (SAP which has too little information to classify). Parsing this data training a method on systematic mutagenesis data and applying gave a total of 12911 disease SAPs on 1055 proteins and 8302 it to human SNPs. This leads to detailed understanding of the polymorphism SAPs (deemed neutral) on 3388 proteins. advantages and potential pitfalls of each data set in training and validating nsSNP prediction methods. 2.3.3 Divergent data set An alternative source of neutral SAPs is the divergence data set, created by noting the changes between human proteins and their related mammalian orthologs. It is assumed that 2 SYSTEMS AND METHODS almost all of the variation fixed between closely related species is non- deleterious. There is a variation in the exact method used to create a 2.1 Decision trees divergent data set. Some research groups accept proteins with490% Decision trees (DT) are predictive models, displayed as a top down tree sequence identity (SI) and 480% coverage allowing all matches per structure. Every node in the tree represents a decision point, where a species (Yue and Moult, 2006), whilst others only accept495% SI over test is carried out upon an attribute. For every possible outcome of the 100% coverage and to avoid paralogs only use the best match per test there will be a child node, until the final decision node is reached, species (Sunyaev et al., 2001). which branches to a set of leaf nodes giving the final classification. As is the normal practice, the proteins containing disease SAPs Here, we used the ‘yet another decision tree’ (YaDT) algorithm (Swiss-Prot ‘disease’) were used to generate a divergence data set. Each (Ruggieri, 2004) for constructing trees, using default parameters protein was searched against the NCBI non-redundant (NR) database (confidence cut-off of 0.5; accepting all predictions) and with no using BLASTP (Altschul et al., 1997). All non-mammalian matches optimization. were discarded and the remaining matches processed using two different methods. For both methods each match was aligned with its corresponding disease protein and all amino acid differences were noted 2.2 Evaluation of accuracy along with the SI of the alignment. This resulted in a set of pseudo All experiments were carried out multiple times (see cross-validation mutations separated into SI categories from 30% to 95% SI. section) with balanced data sets using evaluation measures including the Furthermore, one of the methods used all of the mammalian matches overall error (OE) [(FPþ FN)/(TPþ FPþ TNþ FN)], (where TP¼ true (neutralAH) generated by BLASTP whilst the other only used the best positive, TN¼ true negative, FP¼ false positive and FN¼ false match per mammalian species (neutralBH, as Sunyaev et al., 2001) to negative), the false positive rate [FPR¼ FP/(TNþ FP)] and the false avoid possible paralogs. negative rate [FNR¼ FN/(TPþ FN)]. 2.4 Attributes 2.3 Data sets To allow for predictions to be made on all available SNPs a set of Three different types of data sets were used for deleterious SNP attributes was selected that could be generated without any requirement prediction, shown in Table 1 (see supplementary Table 3 for for structural information: information on data set usage in other studies, and Table 4 for detailed information on data sets used here): (1) Original and mutated amino acid residue identity (2) Original and mutated amino acid physicochemical class 2.3.1 Mutagenesis data sets The mutagenesis data sets consist of (Hydrophobic, Polar, Charged, Glycine) systematic unbiased mutations of the T4 lysozyme (Alber et al., 1987; (3) Hydrophobicity difference between original and mutated residues Rennell et al., 1991) and lac repressor (Markiewicz et al., 1994; Suckow et al., 1996) proteins. The subset of mutations used here (from Krishnan (4) Mass shift upon mutation and Westhead, 2003) has 1990 mutations for the T4 lysozyme and 3303 (5) Predicted secondary structure at mutation site:(Loop, Helix, for the lac repressor protein. The original mutagenesis experiments Strand) classified each mutation into four effect categories which were reduced (6) Predicted solvent accessibility at mutation site: (0! 9; to a binary classification by Chasman and Adams (2001), yielding data buried!exposed) sets with 40 and 38% of deleterious mutations for the lac and lysozyme, respectively. (7) Scorecons value: sequence conservation score at mutation site: (0!1; not!fully conserved) Table 1. Data sets. Showing the origin of the data sets along with the (8) Buried charge at mutation site: (Residue is one of K, R, D, E, H names assigned to each and has an accessibility of 0 or 1) (9) Position specific scoring matrix (PSSM) value for amino acid substitution Data set type Deleterious Neutral (10) Log-odds score of amino acid substitution Mutagenesis Lac/Lyso Lac/Lyso Attributes 1–8 are the same as those used by Krishnan and Westhead Swiss-Prot Disease Polymorphism (2003), with the exception that only predicted secondary-structure and Divergent – NeutralAH/NeutralBH solvent accessibility were used and these were generated using the Sable 665 M.A.Care et al. program (Adamczak et al., 2004) rather than PHD (Rost and Sander, Here, we used this matrix of relative rates along with calculated 1993). average human-codon-usage to calculate the expected rates of all the The attributes were generated as follows: Each protein sequence was amino acid substitutions resulting from single nucleotide mutations submitted to Sable for secondary-structure and solvent accessibility (SNM). The resulting HEAT matrix, shown in Figure 1a, is based on the average codon usage across all known coding sequences in the prediction. Sable carries out a PSIBLAST (Altschul et al., 1997) search human genome and, thus, is more general than those produced by against the NCBI NR database with 3 iterations. The resultant Vitkup (2003), which were created from a smaller sample of genes. alignment profile and PSSM were retained for later use. The proteins in the PSIBLAST alignment profile with E-score values 510 were pulled out of the NR database using fastacmd and then aligned with the 3 RESULTS human query protein using Muscle (Edgar, 2004). The produced multiple alignment was submitted to Scorecons (Valdar, 2002) to 3.1 Data set comparisons calculate the sequence conservation. The log-odds score was calculated as the log ratio of amino acid substitution probabilities in the neutral Single nucleotide mutations (SNM) within codons can give rise and deleterious data sets, respectively. to 150 possible amino acid substitutions (see Fig. 1a for relative rates in humans). The remaining 230 amino acid substitutions require multiple nucleotide mutations (MNM) to occur within 2.5 Cross-validation and data set randomization a codon. Figure 2 shows the percentage of amino acid All data sets were sampled to give an equal number of positive and substitutions in each data set that result from MNM. The negative examples, as it has been shown that balanced data sets give the two mutagenesis data sets have a very high percentage of best accuracy with decision trees (Dobson et al., 2006). MNM (Lac¼ 57%, Lyso¼ 59%). The Swiss-Prot data sets, in For the homogeneous cross-validation experiments (training and contrast, have almost no MNM with the disease and validation data drawn from the same data set) 4000 SAPs were polymorphism having only 0.2 and 0.1%, respectively. The randomly sampled from each data set 10 times (e.g. 4000 deleterious divergent ‘neutral’ data sets have 5–40% MNM, depending on and 4000 neutral) and used to carry out 10-fold cross-validation. To the SI threshold, a lower level than the mutagenesis data sets remove any possible training bias multiple SAPs on a given protein were not split between training and testing sections. In addition, the but still far greater than observed in the Swiss-Prot data sets. level of homologous proteins within the training data is too low to Even with this rudimentary data set analysis, it becomes cause bias (data not shown). apparent that a large percentage of the mutations in the Heterogeneous cross-validation involves using one data set for mutagenesis data sets (Lac/Lyso), and a significant proportion training and another for testing. For some of the experiments part of in the divergent data sets (neutralAH/BH), are very unlikely to the training set was from the same data set type as the test set (e.g. train be observed in the short evolutionary distance associated on disease/polymorphism, test on disease/divergent) and, therefore, the with real human mutations. One possible result of this is data sets had to be split into training and test groups. Thus, 4000 SAPs that irrelevant rules will be generated by learning methods, were randomly sampled 10 times from each data set and then split into with significant effects on prediction accuracy (see later training and test parts, as mentioned earlier in the article. This gave sections). training and test data sets of 4000 mutations (2000 deleterious and 2000 A more sophisticated method for comparing data sets is neutral). The exception to the above regards the experiments using the to observe their relative content of amino acid substitutions. mutagenesis data sets which owing to their limited size had to be Here we compare the amino acid substitution rates in each sampled differently. These data sets were each initially split into the two data set using HEAT as a reference distribution. Thus, classes of mutation (lac: 1325 ‘deleterious’, 1978 ‘neutral’; lysozyme: the Swiss-Prot data set, consisting of human SAPs, would 762 ‘deleterious’, 1228 ‘neutral’). From these 762 mutations were be expected to be similar to this distribution, with any randomly sampled 10 times from each part (maximum size limited by significant differences attributable to natural selection lysozyme ‘deleterious’). These samples were then used to carry out 10- within the human population. The divergent data sets, fold homogeneous cross-validation, with training and testing sizes of consisting of pseudo-mutations between man and related 1372 and 152 mutations, respectively. In addition, the lac and lysozyme mammals, are also likely to be similar, but with the data sets were merged to make a combined mutagenesis data set deviation from HEAT attributable to longer evolutionary containing 3048 mutations per sample. distances. HEAT is shown in Figure 1a, displaying the expected relative 2.6 Construction of HEAT matrix rates of amino acid substitutions. The amino acids are arranged A matrix of human expected amino acid transitions (HEAT) was by similarity, so that substitutions lying close to the leading constructed, consisting of the expected rates of amino acid substitutions (top-left-to-bottom-right) diagonal are between chemically in human protein coding genes, in the absence of selection. It was similar amino acids. The non-uniform nature of the matrix is constructed in a similar fashion to Vitkup et al. (2003) using a matrix of due to a variety of factors; the differing number of codons for neighbour-dependent substitution rates (Hess et al., 1994). These rates each amino acid, the codon usage pattern in the human genome were generated by aligning 10 Mb of human gene-pseudogene pairs, and the high rate of mutations caused by CpG deamination in resulting in 20 200 pseudo mutations. From this the relative substitution certain codons, notably Arg in which four of the six codons rates (X!Y) were calculated for the four nucleotide bases (X,Y) contain CpG sites. starting in all possible 3 nucleotide neighbourhoods (*X*), giving a The HEAT matrix only contains amino acid substitutions matrix of 96 neighbourhood dependent substitution rates ([12 16]/2; resulting from SNM and has no values for the other less likely 12 possible substitutions in 16 possible 3 base contexts with data amino acid substitutions (from MNM; represented by ‘’). aggregated for complementary substitutions) with a 65-fold variation of relative rates. Owing to the number of MNM present in the mutagenesis data 666 Deleterious SNP prediction Fig. 1. Deviation of data sets from expected mutations. (a) human expected amino acid transitions (HEAT); displaying the expected relative rates of amino acid substitutions under no selection pressure (Intensity of greyscale depicting expected rate of amino acid substitution. ‘’¼ substitutions requiring multiple nucleotide mutations; not present in HEAT). (b, c, d) deviation of data sets from expected (HEAT), blue¼ under-represented, red¼ over-represented ‘’¼ not present in HEAT. (b) Swiss-Prot annotated ‘disease’. (c) Swiss-Prot annotated ‘polymorphism’. (d) Divergent neutralBH SI90. sets, they were set aside for this analysis. For the other data HEAT matrix was calculated and is given along with an sets count matrices were created with the counts of all amino estimate of significance. The results for three of the data sets are shown in Figure 1. acid substitutions resulting from SNM. Then, to see which substitutions were over/under represented in each data set, the For the Swiss-Prot disease data set (Fig. 1b; R¼ 0.81, log-odds score was calculated for each amino acid substitution P50.0001) the squares lying close to the leading diagonal, [log (P(datasetSubstitution)/P(HEAT Substitution))]. In addi- displaying substitutions between chemically similar amino tion the Pearson’s correlation between each data set and the acids, are under-represented whilst those not lying close to 667 M.A.Care et al. 70% 59% 57% 60% 50% 40% 40% 33% 30% 28% 30% 26% 23% 23% 20% 18% 20% 16% 15% 13% 11% 9% 7% 10% 5% 0.2% 0.1% 0% Data set Fig. 2. The percentage of multiple nucleotide mutations present in each data set. Mutagenesis (Lac/Lyso), Swiss-Prot (disease/polymorphism) and divergent (neutralBH/neutralAH; sequence identity cut-off 30%–95%). this diagonal, substitutions between chemically dissimilar case, substitutions over short evolutionary distances are amino acids, are over-represented. For the Swiss-Prot poly- strongly influenced by the coding sequence. As the evolutionary morphism data set (Fig. 1c; R¼ 0.91, P50.0001) there is the distance increases, selection begins to reflect the constraints opposite trend, with the squares close to the leading diagonal imposed by protein structural stability and function, favouring over-represented, or near to expected, and the squares not lying substitutions between amino acids with similar chemical close to the leading diagonal under-represented. As expected properties (the only over-represented substitution in this latter this data set is more strongly correlated with HEAT than the case is Arg to the related basic amino acid Lys) (Benner disease mutations. et al., 1994). In the Swiss-Prot disease data set (Fig. 1b) the most over- Overall, the comparison with the HEAT matrix has high- represented substitutions are from the amino acids Cys, Gly, lighted differences between the data sets, showing the potential Trp, Arg and Tyr; this agrees with the findings of Vitkup et al. to discriminate deleterious from neutral using only the (2003). The differences seen in the data sets are mainly parameter of amino acid substitution. It has also emphasized governed by the types of substitutions an amino acid can significant differences in the distribution of amino acid undergo by SNM. In some cases, such as Cys and Trp, the substitutions present in the Swiss-Prot polymorphism and substitutions resulting from SNM are all disfavoured, while for divergent data sets. The polymorphism data set has a high level others, such as Gly, the substitutions resulting from SNM are a of correlation with the HEAT matrix (R¼ 0.91, P50.0001), mixture of favored, neutral, and disfavoured. Thus, substitu- while the divergent data set’s correlation (R¼ 0.74, P50.0001) tions from Cys and Trp are very likely to be deleterious, is actually less than that of Swiss-Prot disease (R¼ 0.81, not only because these amino acids play important structural P50.0001). roles but also because their likely substitutions are all This has important consequences for machine-learning disfavoured. methods: rules learned using the divergent data sets for neutral Most of the substitutions that are over-represented in the data are likely to give accurate rules to separate deleterious disease data set are under-represented in the polymorphism from neutral. However, there is a danger that the basis of these data set (Fig. 1c) with Cys and Tyr having the strongest rules would be simply the differing evolutionary distances for divergence from HEAT, suggesting that even the relatively the mutations in the Swiss-Prot disease set compared with simple attribute of amino acid substitution would separate the divergent data sets. Such rules may be of little use these data sets to some extent. in distinguishing human disease mutations from neutral The divergent (neutralBH90) data set (Fig. 1d; R¼ 0.74, mutations occurring on the same evolutionary time scale. P50.0001) is similar to the polymorphism, except that the divergence from HEAT is generally greater, owing to longer 3.2 Decision tree homogeneous cross-validation evolutionary distances, particularly for example with substitu- tions from Cys and Trp. A notable exception is Arg, which is The results from homogeneous 10-fold cross-validation are slightly over-represented in the polymorphism data set and shown in Table 2 (see Systems and Methods for information on strongly under-represented in the divergent data set. In this accuracy measures). The results are split according to the Lac Lyso Disease Polymorphism NeutralBH95 NeutralBH90 NeutralBH80 NeutralBH70 NeutralBH60 NeutralBH50 NeutralBH40 NeutralBH30 NeutralAH95 NeutralAH90 NeutralAH80 NeutralAH70 NeutralAH60 NeutralAH50 NeutralAH40 NeutralAH30 Percentage (%) multiple nucleotide mutations Deleterious SNP prediction Table 2. Homogenous cross-validation. Showing the average false positive rate (FPR), false negative rate (FNR) and overall-error (OE) for 10-fold cross-validation trained on ‘All’ attributes, ‘PSSM only’, ‘amino acid substitution only’ and ‘Scorecons only’ Data set Attributes All PSSM only Amino acid substitution only Scorecons only FPR FNR OE(%) MCC FPR FNR OE(%) MCC FPR FNR OE(%) MCC FPR FNR OE(%) MCC Lac 0.28 0.20 24.36 0.51 0.25 0.43 34.19 0.32 0.44 0.37 41.46 0.17 0.27 0.31 29.27 0.41 Lyso 0.23 0.29 26.28 0.48 0.31 0.40 35.93 0.28 0.30 0.33 32.18 0.36 0.12 0.61 36.51 0.31 Lac Lyso 0.30 0.29 30.05 0.40 0.33 0.39 35.85 0.28 0.35 0.42 39.24 0.22 0.27 0.40 34.01 0.32 DiseasePoly 0.26 0.31 28.42 0.43 0.23 0.41 32.22 0.36 0.32 0.42 36.71 0.27 0.23 0.48 35.56 0.26 DiseaseNAH90 0.20 0.21 20.41 0.59 0.27 0.27 27.05 0.46 0.26 0.27 26.55 0.47 0.32 0.37 34.61 0.31 DiseaseNBH90 0.20 0.20 19.88 0.60 0.25 0.27 26.17 0.48 0.25 0.27 26.06 0.48 0.28 0.42 35.25 0.30 attributes used for prediction, giving a comparison of separate all data sets, and is highly predictive when the accuracy using ‘All’ of the attributes with that from some divergent data set is used for negative data [diseaseN(A/B)H important individual attributes used independently (‘PSSM in Table 2]. Nevertheless this should be viewed with substantial only’, position specific scoring matrix value; ‘amino acid caution, since, as previously noted, the effect may not be due to substitution only’, amino acid substitution; ‘Scorecons only’, distinguishing deleterious from neutral mutations as distin- conservation value). guishing data sets differing in content of amino acid substitu- When ‘All’ attributes are used for prediction the overall-error tions, owing to variations in evolutionary distance and (OE) ranges from 19.88 to 30.05 across the different data sets, systematic mutation. showing that even under homogeneous cross-validation some Figure 3 shows the effect upon overall accuracy of data sets are far easier to classify than others. This range of homogeneous cross-validation (disease/divergent) when chan- accuracy across data sets is greatly influenced by the level of ging the minimum SI level for accepting homologs in the distinction between the ‘deleterious’/‘neutral’parts of each data divergent data sets (neutralAH/BH). Using ‘All’ attributes set. The comparison made with the HEAT matrix showed that produces the highest accuracy, followed by ‘amino acid the substitutions in the divergent data sets (neutralAH/BH) substitution only’ and then ‘PSSM only’. Again these results deviate further from HEAT than those in the Swiss-Prot are strongly influenced by data set content. As the SI level polymorphism data set, explaining the greater prediction increases the level of MNM (Fig. 2) decreases, and the accuracy when using the former (20% OE) compared with divergent data sets share more similarity with the Swiss-Prot the latter (28.42% OE) as neutral data. The divergent data sets disease data set, thus increasing the observed error rate for the are also easier to separate from disease due to their MNM ‘amino acid substitution only’ attribute. In addition the ‘amino (Fig. 2), which are almost completely absent in the Swiss-Prot acid substitution only’ attribute is little affected by the method used to create the divergent data set. In contrast, the ‘PSSM (polymorphism/disease) data sets. These are effectively ‘easy’ predictions as the DT can correctly classify 10% of the only’ attribute is strongly affected, with a larger variation in the divergent (SI90) data set on these alone. OE for the neutralAH (4.06%) compared with the neutralBH ‘PSSM only’ encodes position specific evolutionary informa- (1.45%). tion and leads to OE ranging from 26.17%–35.93%, with the To optimize the generation of divergent data sets, we note T4 lysozyme proving the hardest to classify and the that with increasing SI levels the discrepancy between the errors diseaseDivergent the easiest. The OE for the ‘amino acid for the two divergent data sets diminishes, but the neutralBH substitution only’ attribute ranges from 26.06%–41.46%, a gives consistently lower error rates. With ‘All’ attributes, lower larger range than for PSSM, yet for the diseaseDivergent data apparent error rates result from data sets at lower SI thresh- sets the ‘amino acid substitution only’ is the most accurate olds. However, this is again misleading. It is unlikely that these single attribute. ‘Scorecons only’ is an alternative measure of data sets give better SNP classification methods, rather, the position specific evolutionary information and gives rise to lower error rates are artefacts caused by different data set overall errors in the range 29.27%–36.51%. By contrast with contents in terms of amino acid substitutions. The effect is clear the PSSM, scorecons encodes only conservation while the with ‘amino acid substitution only’, but when ‘PSSM only’ is PSSM contains information on the likelihood of specific amino used the trend of error rate increasing with SI disappears. acid substitutions, yet this only makes a significant difference to Both ‘amino acid substitution only’ and ‘PSSM only’ show a the OE in the cases of the diseaseDivergent data sets. small increase in error between SI values of 90 and 95%, The clear interpretation emerging from these observations, suggesting that if these data sets were to be used to train and the previous data set analysis, is that the simplest attribute, methods, a 90% cut-off would be preferred. This may be caused amino acid substitution, contains useful information to by limited data available at very high sequence identity. 669 M.A.Care et al. field where training data can be substantially different to the 3.3 Heterogeneous cross-validation final target data for prediction. An exception is that training on Heterogeneous cross-validation measures the ability of a diseasePoly data has a homogeneous OE of 28% but predicts classifier trained on one data set to predict on another. on diseaseNBH90 with OE of 24%. The explanation here is Table 3 shows the results for heterogeneous and homogeneous that this latter data set is easier to separate, for reasons (on the diagonal) cross-validation for a selection of attributes, previously discussed. Otherwise, transfer of rules derived from along with their corresponding average error rates per attribute one data set results in significantly larger error rates, and type. In addition, for each attribute the average deviation from perhaps most notably rules learned from the LacLyso data tend the homogeneous overall-error is shown, indicating how well to transfer poorly to the other data sets, and vice versa. Rules the homogeneous cross-validation gauges predictive ability on based on this systematic mutagenesis data may be a poor choice other data sets. for SNP prediction, and the most likely cause of this is that the First, considering ‘All’ attributes it is clear that error rates in amino acid substitution content of the data set, particularly the heterogeneous cross validation are generally significantly large level of MNMs, leads to rules of little relevance for human higher than the corresponding values for the homogeneous SNPs. case. This would be expected, but is an important effect in a In contrast, rules derived from the other two data sets are more interchangeable, as might be expected since they share the same deleterious (disease) data. As before differences between these data sets stem from the relationship of the attributes used, to basic differences in amino acid substitution content in the neutral data. Notably, using ‘amino acid substitution only’, a homogeneous OE of 26% is obtained for diseaseNBH90, while the same DT rules have an OE almost 12% higher on the diseasePoly data. It is interesting, yet intuitive, that when prediction methods are based purely on the evolutionary attributes they tend to transfer better between the different data sets. Compared with the 12% difference noted above for ‘amino acid substitution only’, with ‘PSSM only’ the error rate rises by only 7% between homogeneous cross-validation (diseaseNBH90) and heteroge- neous cross-validation (tested on diseasePoly). Similarly, with ‘Scorecons only’ the OE rises by only 3%. These attribute- dependent effects are also clear in the figures for average deviation from homogeneous OE, showing smaller deviations Fig. 3. Effect of divergent data set’s minimum sequence identity on for the evolutionary attributes. It is also apparent that ‘PSSM overall error. Displaying homogeneous cross-validation overall only’ is a better predictor in general than ‘Scorecons only’; the error percent for ‘All’ attributes, ‘amino acid substitution only’ and latter only encodes conservation while the former contains ‘PSSM only’ with increasing sequence identity cut-off; data set disease/ information about possible neutral residue replacements. divergent. Table 3. Heterogeneous cross-validation. Showing the average homogeneous (on diagonal) and heterogeneous overall-error (OE) for data sets trained and tested on ‘All’ attributes, ‘PSSM only’, ‘amino acid substitution only’ and ‘Scorecons only’ Train Test All PSSM only Amino acid substitution only Scorecons only disPoly disNBH90 LacLyso disPoly disNBH90 LacLyso disPoly disNBH90 LacLyso disPoly disNBH90 LacLyso DiseasePoly 28.42 24.45 42.13 32.22 27.20 37.02 36.71 32.61 43.73 35.56 35.24 41.70 diseaseNBH90 30.82 19.88 45.27 33.32 26.17 40.15 37.66 26.06 44.52 38.34 35.254 43.49 Lac Lyso 37.99 35.66 30.05 36.33 32.55 35.85 45.78 42.90 39.24 41.00 39.04 34.01 Homogeneous Average OE 26.12 4.46 31.41 4.00 34.00 5.71 34.94 0.67 Heterogeneous Average OE 36.05 6.93 34.43 4.08 41.20 4.61 39.80 2.65 Av. deviation from 9.94 8.85 3.01 6.47 7.20 6.92 4.86 2.82 homogenous OE The average homogeneous OE for each attribute type. The average heterogeneous OE for each attribute type. The average deviation of heterogeneous OE from the predicted (homogeneous) attribute accuracy. 670 Deleterious SNP prediction (e.g. support vector machines or neural networks). The training 4 DISCUSSION data is fundamental, it affects all methods and it is important to It is an appealing idea to use data from gene mutations, known get it right first. to cause disease or affect protein function, to train machine- learning methods for predictions on observed human nsSNPs. It is nevertheless vitally important to consider the selection of 5 CONCLUSIONS training data very carefully. In this article we have shown that We have raised some important issues regarding training data the choice of training data has significant effects on classifiers for nsSNP prediction methods and recommended a best data and estimated error rates. set (Swiss-Prot disease/polymorphism). We believe that effects Our results suggest that the use of mutagenesis data, with a described here have affected several studies, including our own, significantly higher content of MNMs than would be expected and whatever view is taken on the best data set it is important for nsSNPs, may lead to largely irrelevant rules for SNP that workers in the field be aware of them. predictions. They remain, however, good unbiased data sets for the prediction of the effects of general protein mutations. ACKNOWLEDGEMENTS Equally, the generation of neutral data from pseudo-mutations between orthologous proteins (divergent data set), produces We would also like to acknowledge the comments of Fyodor data sets that can be distinguished from known disease Kondrashov and one anonymous reviewer who helped us mutations at reasonable error rates, solely on the basis of the improve this manuscript. Work carried out by M. Care with amino acid substitutions. But such classifiers are unlikely to technical support from C. Needham. Supervision and support perform with the same low error rates in distinguishing human provided by D. Westhead and A. Bulpitt. All authors approved deleterious and neutral SNPs. The rules may have some the final manuscript. BBSRC for funding – studentship (BBS/S/ predictive power for SNPs, but a significant contribution to A/2004/10974) and grant number (BBS/B/16585). their apparent homogeneous cross-validation accuracy results Conflict of Interest: none declared. from separation of the training data on the basis of content of amino acid substitutions, caused by different evolutionary distances in the deleterious and neutral parts of the training REFERENCES data. One potential way of improving the divergent data sets is Adamczak,R. et al. (2004) Accurate prediction of solvent accessibility using to limit the aligned orthologous proteins to primates, thus, neural networks-based regression. Proteins, 56, 753–767. reducing the evolutionary distance. This results in a neutral Alber,T. et al. (1987) Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the data set that has a higher correlation with HEAT (R¼ 0.81, folded protein. Biochemistry, 26, 3754–3758. P50.0001) than the mammalian derived data set (NBH90; Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of R¼ 0.74, P50.0001) yet still much lower than the Swiss-Prot protein database search programs. Nucleic Acids Res., 25, 3389–3402. polymorphism (R¼ 0.91, P50.0001). When combined with Bao,L. and Cui,Y. (2005) Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary informa- Swiss-Prot disease this new neutral data set produces a tion. Bioinformatics, 21, 2185–2190. homogeneous OE of 23.23%, placing it between Benner,S.A. et al. (1994) Amino acid substitution during functionally constrained diseaseNBH90 (OE 19.88%) and diseasePoly (OE 28.42%). divergent evolution of protein sequences. Protein Eng., 7, 1323–1332. Thus, while this data set is closer to HEAT than NBH90 it is Boeckmann,B. et al. (2003) The SWISS-PROT protein knowledgebase and its still clearly over a longer evolutionary distance than the supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. Cai,Z. et al. (2004) Bayesian approach to discovering pathogenic SNPs in Swiss-Prot polymorphism data. conserved protein domains. Hum. Mutat., 24, 178–184. Therefore, we suggest that the best training data for human Cargill,M. et al. (1999) Characterization of single-nucleotide polymorphisms in nsSNP predictions is the Swiss-Prot annotated ‘disease’ and coding regions of human genes. Nat. Genet., 22, 231–238. ‘polymorphism’ variants of known human proteins. This is not Chasman,D. and Adams,R.M. (2001) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assess- without problems: variants annotated as neutral polymorph- ment of amino acid variation. J. Mol. Biol., 307, 683–706. isms may have an unknown association with disease. Collins,F.S. et al. (1998) A DNA polymorphism discovery resource for research Nevertheless the differences in Figures 1b and c, and the fact on human genetic variation. Genome Res., 8, 1229–1231. that learning methods can successfully separate disease and Dobson,R. et al. (2006) Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics, 7, 217. polymorphism classes, suggests that this is unlikely to be the Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment method with case for the majority of the data. reduced time and space complexity. BMC Bioinformatics, 5, 113. Equally it might be suggested that other data sets could be Ferrer-Costa,C. et al. (2005) Use of bioinformatics tools for the used if appropriate attributes were chosen. Rules based on annotation of disease-associated mutations in animal models. Proteins, 61, 878–887. evolutionary attributes are more transferable between data sets Ferrer-Costa,C. et al. (2004) Sequence-based prediction of pathological muta- than amino acid substitutions. However, any good learning tions. Proteins, 57, 811–819. method will separate the data sets using the most informative Ferrer-Costa,C. et al. (2002) Characterization of disease-associated single amino attributes, and it can be difficult to completely remove effects acid polymorphisms in terms of sequence and structure properties. J. Mol. Biol., 315, 771–786. such as those reported here. For instance, the apparently purely Herrgard,S. et al. (2003) Prediction of deleterious functional effects of amino acid physicochemical attributes hydrophobicity and molecular- mutations using a library of structure-based function descriptors. Proteins, 53, mass-difference contain information sufficient to identify the 806–816. amino acid substitution involved. Such effects are even harder Hess,S.T. et al. (1994) Wide variations in neighbor-dependent substitution rates. to tease out with methods less interpretable then decision trees J. Mol. Biol., 236, 1022–1033. 671 M.A.Care et al. Krishnan,V.G. and Westhead,D.R. (2003) A comparative study of machine- Saunders,C.T. and Baker,D. (2002) Evaluation of structural and evolutionary learning methods to predict the effects of single nucleotide polymorphisms on contributions to deleterious mutation prediction. J. Mol. Biol., 322, 891–901. protein function. Bioinformatics, 19, 2199–2209. Suckow,J. et al. (1996) Genetic studies of the Lac repressor. XV: 4000 single Kruglyak,L. and Nickerson,D.A. (2001) Variation is the spice of life. Nat. Genet., amino acid substitutions and analysis of the resulting phenotypes on the basis 27, 234–236. of the protein structure. J. Mol. Biol., 261, 509–523. Markiewicz,P. et al. (1994) Genetic studies of the lac repressor. XIV. Analysis of Sunyaev,S. et al. (2001) Prediction of deleterious human alleles. Hum. Mol. 4000 altered Escherichia coli lac repressors reveals essential and non-essential Genet., 10, 591–597. residues, as well as ‘‘spacers’’ which do not require a specific sequence. J. Mol. Valdar,W.S. (2002) Scoring residue conservation. Proteins, 48, 227–241. Biol., 240, 421–433. Verzilli,C.J. et al. (2005) A hierarchical Bayesian model for predicting the Needham,C.J. et al. (2006) Predicting the effect of missense mutations on functional consequences of amino-acid polymorphisms. J. R. Stat. Soc. Ser. protein function: analysis with Bayesian networks. BMC Bioinformatics, 7,405. C-Appl. Stat., 54, 191–206. Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid substitutions. Vitkup,D. et al. (2003) The amino-acid mutational spectrum of human genetic Genome Res., 11, 863–874. disease. Genome Biol., 4, R72. Ramensky,V. et al. (2002) Human non-synonymous SNPs: server and survey. Wang,Z. and Moult,J. (2001) SNPs, protein structure, and disease. Hum. Mutat., Nucleic Acids Res., 30, 3894–3900. 17, 263–270. Rennell,D. et al. (1991) Systematic mutation of bacteriophage T4 lysozyme. J. Yip,Y.L. et al. (2004) The Swiss-Prot variant page and the ModSNP database: a Mol. Biol., 222, 67–88. resource for sequence and structure information on human protein variants. Rost,B. and Sander,C. (1993) Prediction of protein secondary structure at better Hum. Mutat., 23, 464–470. than 70% accuracy. J. Mol. Biol., 232, 584–599. Yue,P. et al. (2005) Loss of protein structure stability as a major causative factor Ruggieri,S. (2004) YaDT: Yet another Decision Tree builder. Proceedings of the in monogenic disease. J. Mol. Biol., 353, 459–473. 16th International Conference on Tools with Artificial Intelligence. IEEE Yue,P. and Moult,J. (2006) Identification and Analysis of Deleterious Human Press, 0, 260–265. SNPs. J. Mol. Biol., 356, 1263–1274. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/deleterious-snp-prediction-be-mindful-of-your-training-data-MLBhgLGfeK

Loading next page...

References (37)

B. Rost, C. Sander (1993)
Prediction of protein secondary structure at better than 70% accuracy.
Journal of molecular biology, 232 2
Zhaohui Cai, Eric Tsung, Voichita Marinescu, M. Ramoni, A. Riva, I. Kohane (2004)
Bayesian approach to discovering pathogenic SNPs in conserved protein domains
Human Mutation, 24
Y. Yip, H. Scheib, Alexander Diemand, Alexandre Gattiker, L. Famiglietti, E. Gasteiger, A. Bairoch (2004)
The Swiss‐Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants
Human Mutation, 23
D. Rennell, S. Bouvier, L. Hardy, A. Poteete (1991)
Systematic mutation of bacteriophage T4 lysozyme.
Journal of molecular biology, 222 1
C. Needham, J. Bradford, A. Bulpitt, M. Care, D. Westhead (2006)
Predicting the effect of missense mutations on protein function: analysis with Bayesian networks
BMC Bioinformatics, 7
W. Valdar (2002)
Scoring residue conservation
Proteins: Structure, 48
S. Bennet, Mark Cohen, G. Gonnet (1994)
Amino acid substitution during functionally constrained divergent evolution of protein sequences.
Protein engineering, 7 11
PROTEINS: Structure, Function, and Bioinformatics 56:753–767 (2004) Accurate Prediction of Solvent Accessibility Using Neural Networks–Based Regression
P. Markiewicz, L. Kleina, C. Cruz, S. Ehret, Jeffrey Miller (1994)
Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence.
Journal of molecular biology, 240 5
S. Herrgård, Stephen Cammer, Brian Hoffman, Stacy Knutson, M. Gallina, J. Speir, J. Fetrow, S. Baxter (2003)
Prediction of deleterious functional effects of amino acid mutations using a library of structure‐based function descriptors
Proteins: Structure, 53
P. Yue, Z. Li, J. Moult (2005)
Loss of protein structure stability as a major causative factor in monogenic disease.
Journal of molecular biology, 353 2
F. Collins, L. Brooks, A. Chakravarti (1998)
A DNA polymorphism discovery resource for research on human genetic variation.
Genome research, 8 12
B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, Karine Michoud, C. O’Donovan, Isabelle Phan, S. Pilbout, Michel Schneider (2003)
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic acids research, 31 1
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
Jeffrey Miller, D. Ganem, P. Lu, A. Schmitz (1977)
Genetic studies of the lac repressor. I. Correlation of mutational sites with specific amino acid residues: construction of a colinear gene-protein map.
Journal of molecular biology, 109 2
S. Ruggieri (2004)
YaDT: yet another decision tree builder
16th IEEE International Conference on Tools with Artificial Intelligence
Zhen Wang, J. Moult (2001)
SNPs, protein structure, and disease
Human Mutation, 17
R. Dobson, P. Munroe, M. Caulfield, M. Saqi (2006)
Predicting deleterious nsSNPs: an analysis of sequence and structural attributes
BMC Bioinformatics, 7
Carles Ferrer-Costa, M. Orozco, X. Cruz (2005)
Use of bioinformatics tools for the annotation of disease‐associated mutations in animal models
Proteins: Structure, 61
S. Hess, Jonathan Blake, R. Blake (1994)
Wide variations in neighbor-dependent substitution rates.
Journal of molecular biology, 236 4
P. Ng, S. Henikoff (2001)
Predicting deleterious amino acid substitutions.
Genome research, 11 5
V. Krishnan, D. Westhead (2003)
A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function
Bioinformatics, 19 17
C. Saunders, D. Baker (2002)
Evaluation of structural and evolutionary contributions to deleterious mutation prediction.
Journal of molecular biology, 322 4
P. Yue, J. Moult (2006)
Identification and analysis of deleterious human SNPs.
Journal of molecular biology, 356 5
D. Chasman, R. Adams (2001)
Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.
Journal of molecular biology, 307 2
S. Sunyaev, V. Ramensky, I. Koch, Warren Lathe, A. Kondrashov, P. Bork (2001)
Prediction of deleterious human alleles.
Human molecular genetics, 10 6
J. Suckow, P. Markiewicz, L. Kleina, Jeffrey Miller, B. Kisters-Woike, B. Müller-Hill (1996)
Genetic Studies of Lac Repressor: 4000 Single Amino Acid Substitutions and Analysis of the Resulting Phenotypes on the Basis of the Protein Structure
Journal of molecular biology, 261 4
Claudio Verzilli, J. Whittaker, N. Stallard, D. Chasman (2005)
A hierarchical Bayesian model for predicting the functional consequences of amino‐acid polymorphisms
Journal of the Royal Statistical Society: Series C (Applied Statistics), 54
Dennis Vitkup, C. Sander, G. Church (2003)
The amino-acid mutational spectrum of human genetic disease
Genome Biology, 4
Carles Ferrer-Costa, M. Orozco, X. Cruz (2002)
Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties.
Journal of molecular biology, 315 4
V. Ramensky, P. Bork, S. Sunyaev (2002)
Human non-synonymous SNPs: server and survey.
Nucleic acids research, 30 17
L. Kruglyak, D. Nickerson (2001)
Variation is the spice of life
Nature Genetics, 27
Lei Bao, Yan Cui (2005)
Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information
Bioinformatics, 21 10
Robert Edgar (2004)
MUSCLE: a multiple sequence alignment method with reduced time and space complexity
BMC Bioinformatics, 5
Carles Ferrer-Costa, M. Orozco, X. Cruz (2004)
Sequence‐based prediction of pathological mutations
Proteins: Structure, 57
T. Alber, D. Sun, J. Nye, D. Muchmore, B. Matthews (1987)
Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the folded protein.
Biochemistry, 26 13
M. Cargill, D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, N. Patil, C. Lane, Esther Lim, Nilesh Kalyanaraman, J. Nemesh, L. Ziaugra, L. Friedland, A. Rolfe, J. Warrington, R. Lipshutz, G. Daley, E. Lander (1999)
Characterization of single-nucleotide polymorphisms in coding regions of human genes
Nature Genetics, 22

Publisher: Oxford University Press
Copyright: © The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/btl649
pmid: 17234639
Publisher site: See Article on Publisher Site

Abstract

Vol. 23 no. 6 2007, pages 664–672 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl649 Genome analysis 1 2 2 1, Matthew A. Care , Chris J. Needham , Andrew J. Bulpitt and David R. Westhead 1 2 Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK and School of Computing, University of Leeds, Leeds, LS2 9JT, UK Received on September 29, 2006; revised on November 22, 2006; accepted on December 18, 2006 Advance Access publication January 18, 2007 Associate Editor: Dmitrij Frishman understanding of effect mechanisms, but are not available for ABSTRACT all SNPs. Sequence attributes usually identify important Motivation: To predict which of the vast number of human single residues using information from homologous proteins. With nucleotide polymorphisms (SNPs) are deleterious to gene function or enough homologues (10), sequence attributes can often likely to be disease associated is an important problem, and many compete effectively with structural approaches (Bao and Cui, methods have been reported in the literature. All methods require 2005; Saunders and Baker, 2002; Yue and Moult, 2006). data sets of mutations classified as ‘deleterious’ or ‘neutral’ for Efforts to classify SNPs have used these attributes in a training and/or validation. While different workers have used different variety of prediction methods from sets of empirical rules data sets there has been no study of which is best. Here, the three (Herrgard et al., 2003; Ng and Henikoff, 2001; Ramensky et al., most commonly used data sets are analysed. We examine their 2002; Sunyaev et al., 2001; Wang and Moult, 2001), probabil- contents and relate this to classifiers, with the aims of revealing the istic prediction (Chasman and Adams, 2001) to a variety of strengths and pitfalls of each data set, and recommending a best machine-learning techniques including decision trees (DT) approach for future studies. (Dobson et al., 2006; Krishnan and Westhead, 2003), support Results: The data sets examined are shown to be substantially vector machines (Bao and Cui, 2005; Krishnan and Westhead, different in content, particularly with regard to amino acid substitu- 2003; Yue et al., 2005; Yue and Moult, 2006), neural networks tions, reflecting the different ways in which they are derived. This (Ferrer-Costa et al., 2004, 2005), Bayesian networks (Cai et al., leads to differences in classifiers and reveals some serious pitfalls of some data sets, making them less than ideal for non-synonymous 2004; Needham et al., 2006), random forests (Bao and Cui, SNP prediction. 2005) and Bayesian multivariate adaptive regression splines Availability: Software is available on request from the authors. (Verzilli et al., 2005). Although these different approaches Contact: [email protected] derive prediction rules in a variety of ways they almost all Supplementary information: Supplementary data are available at require a data set of classified mutations for both model Bioinformatics online. building (training) and error rate estimation (validation). For machine-learning methods to generalize well to target data, it is imperative that the right training data is chosen; the training and validation data should be drawn from the same (usually unknown) distribution as the target data. 1 INTRODUCTION However, this is not easy to arrange for the problem Single nucleotide polymorphisms (SNPs) are the most abun- concerned, and a number of very different data sets have been dant form of genetic variation, accounting for approximately employed. Some workers have used deleterious and neutral 90% of the DNA polymorphism in humans (Collins et al., nsSNPs data based on systematic mutation studies on particular 1998). It is estimated that there is a SNP of41% frequency for proteins (Cai et al., 2004; Chasman and Adams, 2001; Krishnan every 290 base-pairs (Kruglyak and Nickerson, 2001). Within and Westhead, 2003; Ng and Henikoff, 2001; Verzilli et al., coding regions there are on average four SNPs per gene with a 2005; Wang and Moult, 2001). Others have used annotated frequency above 1%. About half of these cause amino acid disease variants from protein sequence databases as deleterious substitutions: termed non-synonymous SNPs (nsSNPs) data, and have generated neutral data sets either from annotated (Cargill et al., 1999). sequence variants not known to be associated with disease (Bao Deleterious SNP prediction tries to ascertain if an nsSNP will and Cui, 2005), or by using pseudo mutations between affect a protein’s function and possibly contribute to genetic orthologous proteins in closely related species (Ferrer-Costa disease. Methods in the existing literature have used a large et al., 2002; Ferrer-Costa et al., 2004, 2005; Ramensky et al., range of structure- and sequence-based attributes to separate 2002; Sunyaev et al., 2001; Yue et al., 2005; Yue and Moult, deleterious from neutral SNPs (see supplementary Tables 1 2006). These approaches yield data sets that are different in and 2 for information). Structural attributes provide more content and character, with different properties when used to train machine-learning methods, and give rise to classifiers with *To whom correspondence should be addressed. varying error rates. 664 The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Deleterious SNP prediction This article is the first attempt to quantify the aforemen- 2.3.2 Swiss-Prot data set Another type of data set used for deleterious SNP prediction is derived from the Swiss-Prot variant tioned effects. We begin by comparing the contents of the data webpage (Yip et al., 2004). Approximately 20% of the human proteins sets to what might be expected in the target prediction data contained in the Swiss-Prot knowledgebase have one or more single (real human SNPs). Then, using a simple decision tree method amino acid polymorphism (SAP) (Boeckmann et al., 2003). Each SAP is to produce easily interpretable classifiers, we study the manually annotated in the feature table of the Swiss-Prot variant relationships between data set, classifiers and estimated database with the label ‘disease’ (SAP with disease association), accuracy. Finally, we quantify the transferability of classifiers ‘polymorphism’ (SAP with no known disease association) or ‘unclassi- between data sets, thus quantifying the effect of, for instance, fied’ (SAP which has too little information to classify). Parsing this data training a method on systematic mutagenesis data and applying gave a total of 12911 disease SAPs on 1055 proteins and 8302 it to human SNPs. This leads to detailed understanding of the polymorphism SAPs (deemed neutral) on 3388 proteins. advantages and potential pitfalls of each data set in training and validating nsSNP prediction methods. 2.3.3 Divergent data set An alternative source of neutral SAPs is the divergence data set, created by noting the changes between human proteins and their related mammalian orthologs. It is assumed that 2 SYSTEMS AND METHODS almost all of the variation fixed between closely related species is non- deleterious. There is a variation in the exact method used to create a 2.1 Decision trees divergent data set. Some research groups accept proteins with490% Decision trees (DT) are predictive models, displayed as a top down tree sequence identity (SI) and 480% coverage allowing all matches per structure. Every node in the tree represents a decision point, where a species (Yue and Moult, 2006), whilst others only accept495% SI over test is carried out upon an attribute. For every possible outcome of the 100% coverage and to avoid paralogs only use the best match per test there will be a child node, until the final decision node is reached, species (Sunyaev et al., 2001). which branches to a set of leaf nodes giving the final classification. As is the normal practice, the proteins containing disease SAPs Here, we used the ‘yet another decision tree’ (YaDT) algorithm (Swiss-Prot ‘disease’) were used to generate a divergence data set. Each (Ruggieri, 2004) for constructing trees, using default parameters protein was searched against the NCBI non-redundant (NR) database (confidence cut-off of 0.5; accepting all predictions) and with no using BLASTP (Altschul et al., 1997). All non-mammalian matches optimization. were discarded and the remaining matches processed using two different methods. For both methods each match was aligned with its corresponding disease protein and all amino acid differences were noted 2.2 Evaluation of accuracy along with the SI of the alignment. This resulted in a set of pseudo All experiments were carried out multiple times (see cross-validation mutations separated into SI categories from 30% to 95% SI. section) with balanced data sets using evaluation measures including the Furthermore, one of the methods used all of the mammalian matches overall error (OE) [(FPþ FN)/(TPþ FPþ TNþ FN)], (where TP¼ true (neutralAH) generated by BLASTP whilst the other only used the best positive, TN¼ true negative, FP¼ false positive and FN¼ false match per mammalian species (neutralBH, as Sunyaev et al., 2001) to negative), the false positive rate [FPR¼ FP/(TNþ FP)] and the false avoid possible paralogs. negative rate [FNR¼ FN/(TPþ FN)]. 2.4 Attributes 2.3 Data sets To allow for predictions to be made on all available SNPs a set of Three different types of data sets were used for deleterious SNP attributes was selected that could be generated without any requirement prediction, shown in Table 1 (see supplementary Table 3 for for structural information: information on data set usage in other studies, and Table 4 for detailed information on data sets used here): (1) Original and mutated amino acid residue identity (2) Original and mutated amino acid physicochemical class 2.3.1 Mutagenesis data sets The mutagenesis data sets consist of (Hydrophobic, Polar, Charged, Glycine) systematic unbiased mutations of the T4 lysozyme (Alber et al., 1987; (3) Hydrophobicity difference between original and mutated residues Rennell et al., 1991) and lac repressor (Markiewicz et al., 1994; Suckow et al., 1996) proteins. The subset of mutations used here (from Krishnan (4) Mass shift upon mutation and Westhead, 2003) has 1990 mutations for the T4 lysozyme and 3303 (5) Predicted secondary structure at mutation site:(Loop, Helix, for the lac repressor protein. The original mutagenesis experiments Strand) classified each mutation into four effect categories which were reduced (6) Predicted solvent accessibility at mutation site: (0! 9; to a binary classification by Chasman and Adams (2001), yielding data buried!exposed) sets with 40 and 38% of deleterious mutations for the lac and lysozyme, respectively. (7) Scorecons value: sequence conservation score at mutation site: (0!1; not!fully conserved) Table 1. Data sets. Showing the origin of the data sets along with the (8) Buried charge at mutation site: (Residue is one of K, R, D, E, H names assigned to each and has an accessibility of 0 or 1) (9) Position specific scoring matrix (PSSM) value for amino acid substitution Data set type Deleterious Neutral (10) Log-odds score of amino acid substitution Mutagenesis Lac/Lyso Lac/Lyso Attributes 1–8 are the same as those used by Krishnan and Westhead Swiss-Prot Disease Polymorphism (2003), with the exception that only predicted secondary-structure and Divergent – NeutralAH/NeutralBH solvent accessibility were used and these were generated using the Sable 665 M.A.Care et al. program (Adamczak et al., 2004) rather than PHD (Rost and Sander, Here, we used this matrix of relative rates along with calculated 1993). average human-codon-usage to calculate the expected rates of all the The attributes were generated as follows: Each protein sequence was amino acid substitutions resulting from single nucleotide mutations submitted to Sable for secondary-structure and solvent accessibility (SNM). The resulting HEAT matrix, shown in Figure 1a, is based on the average codon usage across all known coding sequences in the prediction. Sable carries out a PSIBLAST (Altschul et al., 1997) search human genome and, thus, is more general than those produced by against the NCBI NR database with 3 iterations. The resultant Vitkup (2003), which were created from a smaller sample of genes. alignment profile and PSSM were retained for later use. The proteins in the PSIBLAST alignment profile with E-score values 510 were pulled out of the NR database using fastacmd and then aligned with the 3 RESULTS human query protein using Muscle (Edgar, 2004). The produced multiple alignment was submitted to Scorecons (Valdar, 2002) to 3.1 Data set comparisons calculate the sequence conservation. The log-odds score was calculated as the log ratio of amino acid substitution probabilities in the neutral Single nucleotide mutations (SNM) within codons can give rise and deleterious data sets, respectively. to 150 possible amino acid substitutions (see Fig. 1a for relative rates in humans). The remaining 230 amino acid substitutions require multiple nucleotide mutations (MNM) to occur within 2.5 Cross-validation and data set randomization a codon. Figure 2 shows the percentage of amino acid All data sets were sampled to give an equal number of positive and substitutions in each data set that result from MNM. The negative examples, as it has been shown that balanced data sets give the two mutagenesis data sets have a very high percentage of best accuracy with decision trees (Dobson et al., 2006). MNM (Lac¼ 57%, Lyso¼ 59%). The Swiss-Prot data sets, in For the homogeneous cross-validation experiments (training and contrast, have almost no MNM with the disease and validation data drawn from the same data set) 4000 SAPs were polymorphism having only 0.2 and 0.1%, respectively. The randomly sampled from each data set 10 times (e.g. 4000 deleterious divergent ‘neutral’ data sets have 5–40% MNM, depending on and 4000 neutral) and used to carry out 10-fold cross-validation. To the SI threshold, a lower level than the mutagenesis data sets remove any possible training bias multiple SAPs on a given protein were not split between training and testing sections. In addition, the but still far greater than observed in the Swiss-Prot data sets. level of homologous proteins within the training data is too low to Even with this rudimentary data set analysis, it becomes cause bias (data not shown). apparent that a large percentage of the mutations in the Heterogeneous cross-validation involves using one data set for mutagenesis data sets (Lac/Lyso), and a significant proportion training and another for testing. For some of the experiments part of in the divergent data sets (neutralAH/BH), are very unlikely to the training set was from the same data set type as the test set (e.g. train be observed in the short evolutionary distance associated on disease/polymorphism, test on disease/divergent) and, therefore, the with real human mutations. One possible result of this is data sets had to be split into training and test groups. Thus, 4000 SAPs that irrelevant rules will be generated by learning methods, were randomly sampled 10 times from each data set and then split into with significant effects on prediction accuracy (see later training and test parts, as mentioned earlier in the article. This gave sections). training and test data sets of 4000 mutations (2000 deleterious and 2000 A more sophisticated method for comparing data sets is neutral). The exception to the above regards the experiments using the to observe their relative content of amino acid substitutions. mutagenesis data sets which owing to their limited size had to be Here we compare the amino acid substitution rates in each sampled differently. These data sets were each initially split into the two data set using HEAT as a reference distribution. Thus, classes of mutation (lac: 1325 ‘deleterious’, 1978 ‘neutral’; lysozyme: the Swiss-Prot data set, consisting of human SAPs, would 762 ‘deleterious’, 1228 ‘neutral’). From these 762 mutations were be expected to be similar to this distribution, with any randomly sampled 10 times from each part (maximum size limited by significant differences attributable to natural selection lysozyme ‘deleterious’). These samples were then used to carry out 10- within the human population. The divergent data sets, fold homogeneous cross-validation, with training and testing sizes of consisting of pseudo-mutations between man and related 1372 and 152 mutations, respectively. In addition, the lac and lysozyme mammals, are also likely to be similar, but with the data sets were merged to make a combined mutagenesis data set deviation from HEAT attributable to longer evolutionary containing 3048 mutations per sample. distances. HEAT is shown in Figure 1a, displaying the expected relative 2.6 Construction of HEAT matrix rates of amino acid substitutions. The amino acids are arranged A matrix of human expected amino acid transitions (HEAT) was by similarity, so that substitutions lying close to the leading constructed, consisting of the expected rates of amino acid substitutions (top-left-to-bottom-right) diagonal are between chemically in human protein coding genes, in the absence of selection. It was similar amino acids. The non-uniform nature of the matrix is constructed in a similar fashion to Vitkup et al. (2003) using a matrix of due to a variety of factors; the differing number of codons for neighbour-dependent substitution rates (Hess et al., 1994). These rates each amino acid, the codon usage pattern in the human genome were generated by aligning 10 Mb of human gene-pseudogene pairs, and the high rate of mutations caused by CpG deamination in resulting in 20 200 pseudo mutations. From this the relative substitution certain codons, notably Arg in which four of the six codons rates (X!Y) were calculated for the four nucleotide bases (X,Y) contain CpG sites. starting in all possible 3 nucleotide neighbourhoods (*X*), giving a The HEAT matrix only contains amino acid substitutions matrix of 96 neighbourhood dependent substitution rates ([12 16]/2; resulting from SNM and has no values for the other less likely 12 possible substitutions in 16 possible 3 base contexts with data amino acid substitutions (from MNM; represented by ‘’). aggregated for complementary substitutions) with a 65-fold variation of relative rates. Owing to the number of MNM present in the mutagenesis data 666 Deleterious SNP prediction Fig. 1. Deviation of data sets from expected mutations. (a) human expected amino acid transitions (HEAT); displaying the expected relative rates of amino acid substitutions under no selection pressure (Intensity of greyscale depicting expected rate of amino acid substitution. ‘’¼ substitutions requiring multiple nucleotide mutations; not present in HEAT). (b, c, d) deviation of data sets from expected (HEAT), blue¼ under-represented, red¼ over-represented ‘’¼ not present in HEAT. (b) Swiss-Prot annotated ‘disease’. (c) Swiss-Prot annotated ‘polymorphism’. (d) Divergent neutralBH SI90. sets, they were set aside for this analysis. For the other data HEAT matrix was calculated and is given along with an sets count matrices were created with the counts of all amino estimate of significance. The results for three of the data sets are shown in Figure 1. acid substitutions resulting from SNM. Then, to see which substitutions were over/under represented in each data set, the For the Swiss-Prot disease data set (Fig. 1b; R¼ 0.81, log-odds score was calculated for each amino acid substitution P50.0001) the squares lying close to the leading diagonal, [log (P(datasetSubstitution)/P(HEAT Substitution))]. In addi- displaying substitutions between chemically similar amino tion the Pearson’s correlation between each data set and the acids, are under-represented whilst those not lying close to 667 M.A.Care et al. 70% 59% 57% 60% 50% 40% 40% 33% 30% 28% 30% 26% 23% 23% 20% 18% 20% 16% 15% 13% 11% 9% 7% 10% 5% 0.2% 0.1% 0% Data set Fig. 2. The percentage of multiple nucleotide mutations present in each data set. Mutagenesis (Lac/Lyso), Swiss-Prot (disease/polymorphism) and divergent (neutralBH/neutralAH; sequence identity cut-off 30%–95%). this diagonal, substitutions between chemically dissimilar case, substitutions over short evolutionary distances are amino acids, are over-represented. For the Swiss-Prot poly- strongly influenced by the coding sequence. As the evolutionary morphism data set (Fig. 1c; R¼ 0.91, P50.0001) there is the distance increases, selection begins to reflect the constraints opposite trend, with the squares close to the leading diagonal imposed by protein structural stability and function, favouring over-represented, or near to expected, and the squares not lying substitutions between amino acids with similar chemical close to the leading diagonal under-represented. As expected properties (the only over-represented substitution in this latter this data set is more strongly correlated with HEAT than the case is Arg to the related basic amino acid Lys) (Benner disease mutations. et al., 1994). In the Swiss-Prot disease data set (Fig. 1b) the most over- Overall, the comparison with the HEAT matrix has high- represented substitutions are from the amino acids Cys, Gly, lighted differences between the data sets, showing the potential Trp, Arg and Tyr; this agrees with the findings of Vitkup et al. to discriminate deleterious from neutral using only the (2003). The differences seen in the data sets are mainly parameter of amino acid substitution. It has also emphasized governed by the types of substitutions an amino acid can significant differences in the distribution of amino acid undergo by SNM. In some cases, such as Cys and Trp, the substitutions present in the Swiss-Prot polymorphism and substitutions resulting from SNM are all disfavoured, while for divergent data sets. The polymorphism data set has a high level others, such as Gly, the substitutions resulting from SNM are a of correlation with the HEAT matrix (R¼ 0.91, P50.0001), mixture of favored, neutral, and disfavoured. Thus, substitu- while the divergent data set’s correlation (R¼ 0.74, P50.0001) tions from Cys and Trp are very likely to be deleterious, is actually less than that of Swiss-Prot disease (R¼ 0.81, not only because these amino acids play important structural P50.0001). roles but also because their likely substitutions are all This has important consequences for machine-learning disfavoured. methods: rules learned using the divergent data sets for neutral Most of the substitutions that are over-represented in the data are likely to give accurate rules to separate deleterious disease data set are under-represented in the polymorphism from neutral. However, there is a danger that the basis of these data set (Fig. 1c) with Cys and Tyr having the strongest rules would be simply the differing evolutionary distances for divergence from HEAT, suggesting that even the relatively the mutations in the Swiss-Prot disease set compared with simple attribute of amino acid substitution would separate the divergent data sets. Such rules may be of little use these data sets to some extent. in distinguishing human disease mutations from neutral The divergent (neutralBH90) data set (Fig. 1d; R¼ 0.74, mutations occurring on the same evolutionary time scale. P50.0001) is similar to the polymorphism, except that the divergence from HEAT is generally greater, owing to longer 3.2 Decision tree homogeneous cross-validation evolutionary distances, particularly for example with substitu- tions from Cys and Trp. A notable exception is Arg, which is The results from homogeneous 10-fold cross-validation are slightly over-represented in the polymorphism data set and shown in Table 2 (see Systems and Methods for information on strongly under-represented in the divergent data set. In this accuracy measures). The results are split according to the Lac Lyso Disease Polymorphism NeutralBH95 NeutralBH90 NeutralBH80 NeutralBH70 NeutralBH60 NeutralBH50 NeutralBH40 NeutralBH30 NeutralAH95 NeutralAH90 NeutralAH80 NeutralAH70 NeutralAH60 NeutralAH50 NeutralAH40 NeutralAH30 Percentage (%) multiple nucleotide mutations Deleterious SNP prediction Table 2. Homogenous cross-validation. Showing the average false positive rate (FPR), false negative rate (FNR) and overall-error (OE) for 10-fold cross-validation trained on ‘All’ attributes, ‘PSSM only’, ‘amino acid substitution only’ and ‘Scorecons only’ Data set Attributes All PSSM only Amino acid substitution only Scorecons only FPR FNR OE(%) MCC FPR FNR OE(%) MCC FPR FNR OE(%) MCC FPR FNR OE(%) MCC Lac 0.28 0.20 24.36 0.51 0.25 0.43 34.19 0.32 0.44 0.37 41.46 0.17 0.27 0.31 29.27 0.41 Lyso 0.23 0.29 26.28 0.48 0.31 0.40 35.93 0.28 0.30 0.33 32.18 0.36 0.12 0.61 36.51 0.31 Lac Lyso 0.30 0.29 30.05 0.40 0.33 0.39 35.85 0.28 0.35 0.42 39.24 0.22 0.27 0.40 34.01 0.32 DiseasePoly 0.26 0.31 28.42 0.43 0.23 0.41 32.22 0.36 0.32 0.42 36.71 0.27 0.23 0.48 35.56 0.26 DiseaseNAH90 0.20 0.21 20.41 0.59 0.27 0.27 27.05 0.46 0.26 0.27 26.55 0.47 0.32 0.37 34.61 0.31 DiseaseNBH90 0.20 0.20 19.88 0.60 0.25 0.27 26.17 0.48 0.25 0.27 26.06 0.48 0.28 0.42 35.25 0.30 attributes used for prediction, giving a comparison of separate all data sets, and is highly predictive when the accuracy using ‘All’ of the attributes with that from some divergent data set is used for negative data [diseaseN(A/B)H important individual attributes used independently (‘PSSM in Table 2]. Nevertheless this should be viewed with substantial only’, position specific scoring matrix value; ‘amino acid caution, since, as previously noted, the effect may not be due to substitution only’, amino acid substitution; ‘Scorecons only’, distinguishing deleterious from neutral mutations as distin- conservation value). guishing data sets differing in content of amino acid substitu- When ‘All’ attributes are used for prediction the overall-error tions, owing to variations in evolutionary distance and (OE) ranges from 19.88 to 30.05 across the different data sets, systematic mutation. showing that even under homogeneous cross-validation some Figure 3 shows the effect upon overall accuracy of data sets are far easier to classify than others. This range of homogeneous cross-validation (disease/divergent) when chan- accuracy across data sets is greatly influenced by the level of ging the minimum SI level for accepting homologs in the distinction between the ‘deleterious’/‘neutral’parts of each data divergent data sets (neutralAH/BH). Using ‘All’ attributes set. The comparison made with the HEAT matrix showed that produces the highest accuracy, followed by ‘amino acid the substitutions in the divergent data sets (neutralAH/BH) substitution only’ and then ‘PSSM only’. Again these results deviate further from HEAT than those in the Swiss-Prot are strongly influenced by data set content. As the SI level polymorphism data set, explaining the greater prediction increases the level of MNM (Fig. 2) decreases, and the accuracy when using the former (20% OE) compared with divergent data sets share more similarity with the Swiss-Prot the latter (28.42% OE) as neutral data. The divergent data sets disease data set, thus increasing the observed error rate for the are also easier to separate from disease due to their MNM ‘amino acid substitution only’ attribute. In addition the ‘amino (Fig. 2), which are almost completely absent in the Swiss-Prot acid substitution only’ attribute is little affected by the method used to create the divergent data set. In contrast, the ‘PSSM (polymorphism/disease) data sets. These are effectively ‘easy’ predictions as the DT can correctly classify 10% of the only’ attribute is strongly affected, with a larger variation in the divergent (SI90) data set on these alone. OE for the neutralAH (4.06%) compared with the neutralBH ‘PSSM only’ encodes position specific evolutionary informa- (1.45%). tion and leads to OE ranging from 26.17%–35.93%, with the To optimize the generation of divergent data sets, we note T4 lysozyme proving the hardest to classify and the that with increasing SI levels the discrepancy between the errors diseaseDivergent the easiest. The OE for the ‘amino acid for the two divergent data sets diminishes, but the neutralBH substitution only’ attribute ranges from 26.06%–41.46%, a gives consistently lower error rates. With ‘All’ attributes, lower larger range than for PSSM, yet for the diseaseDivergent data apparent error rates result from data sets at lower SI thresh- sets the ‘amino acid substitution only’ is the most accurate olds. However, this is again misleading. It is unlikely that these single attribute. ‘Scorecons only’ is an alternative measure of data sets give better SNP classification methods, rather, the position specific evolutionary information and gives rise to lower error rates are artefacts caused by different data set overall errors in the range 29.27%–36.51%. By contrast with contents in terms of amino acid substitutions. The effect is clear the PSSM, scorecons encodes only conservation while the with ‘amino acid substitution only’, but when ‘PSSM only’ is PSSM contains information on the likelihood of specific amino used the trend of error rate increasing with SI disappears. acid substitutions, yet this only makes a significant difference to Both ‘amino acid substitution only’ and ‘PSSM only’ show a the OE in the cases of the diseaseDivergent data sets. small increase in error between SI values of 90 and 95%, The clear interpretation emerging from these observations, suggesting that if these data sets were to be used to train and the previous data set analysis, is that the simplest attribute, methods, a 90% cut-off would be preferred. This may be caused amino acid substitution, contains useful information to by limited data available at very high sequence identity. 669 M.A.Care et al. field where training data can be substantially different to the 3.3 Heterogeneous cross-validation final target data for prediction. An exception is that training on Heterogeneous cross-validation measures the ability of a diseasePoly data has a homogeneous OE of 28% but predicts classifier trained on one data set to predict on another. on diseaseNBH90 with OE of 24%. The explanation here is Table 3 shows the results for heterogeneous and homogeneous that this latter data set is easier to separate, for reasons (on the diagonal) cross-validation for a selection of attributes, previously discussed. Otherwise, transfer of rules derived from along with their corresponding average error rates per attribute one data set results in significantly larger error rates, and type. In addition, for each attribute the average deviation from perhaps most notably rules learned from the LacLyso data tend the homogeneous overall-error is shown, indicating how well to transfer poorly to the other data sets, and vice versa. Rules the homogeneous cross-validation gauges predictive ability on based on this systematic mutagenesis data may be a poor choice other data sets. for SNP prediction, and the most likely cause of this is that the First, considering ‘All’ attributes it is clear that error rates in amino acid substitution content of the data set, particularly the heterogeneous cross validation are generally significantly large level of MNMs, leads to rules of little relevance for human higher than the corresponding values for the homogeneous SNPs. case. This would be expected, but is an important effect in a In contrast, rules derived from the other two data sets are more interchangeable, as might be expected since they share the same deleterious (disease) data. As before differences between these data sets stem from the relationship of the attributes used, to basic differences in amino acid substitution content in the neutral data. Notably, using ‘amino acid substitution only’, a homogeneous OE of 26% is obtained for diseaseNBH90, while the same DT rules have an OE almost 12% higher on the diseasePoly data. It is interesting, yet intuitive, that when prediction methods are based purely on the evolutionary attributes they tend to transfer better between the different data sets. Compared with the 12% difference noted above for ‘amino acid substitution only’, with ‘PSSM only’ the error rate rises by only 7% between homogeneous cross-validation (diseaseNBH90) and heteroge- neous cross-validation (tested on diseasePoly). Similarly, with ‘Scorecons only’ the OE rises by only 3%. These attribute- dependent effects are also clear in the figures for average deviation from homogeneous OE, showing smaller deviations Fig. 3. Effect of divergent data set’s minimum sequence identity on for the evolutionary attributes. It is also apparent that ‘PSSM overall error. Displaying homogeneous cross-validation overall only’ is a better predictor in general than ‘Scorecons only’; the error percent for ‘All’ attributes, ‘amino acid substitution only’ and latter only encodes conservation while the former contains ‘PSSM only’ with increasing sequence identity cut-off; data set disease/ information about possible neutral residue replacements. divergent. Table 3. Heterogeneous cross-validation. Showing the average homogeneous (on diagonal) and heterogeneous overall-error (OE) for data sets trained and tested on ‘All’ attributes, ‘PSSM only’, ‘amino acid substitution only’ and ‘Scorecons only’ Train Test All PSSM only Amino acid substitution only Scorecons only disPoly disNBH90 LacLyso disPoly disNBH90 LacLyso disPoly disNBH90 LacLyso disPoly disNBH90 LacLyso DiseasePoly 28.42 24.45 42.13 32.22 27.20 37.02 36.71 32.61 43.73 35.56 35.24 41.70 diseaseNBH90 30.82 19.88 45.27 33.32 26.17 40.15 37.66 26.06 44.52 38.34 35.254 43.49 Lac Lyso 37.99 35.66 30.05 36.33 32.55 35.85 45.78 42.90 39.24 41.00 39.04 34.01 Homogeneous Average OE 26.12 4.46 31.41 4.00 34.00 5.71 34.94 0.67 Heterogeneous Average OE 36.05 6.93 34.43 4.08 41.20 4.61 39.80 2.65 Av. deviation from 9.94 8.85 3.01 6.47 7.20 6.92 4.86 2.82 homogenous OE The average homogeneous OE for each attribute type. The average heterogeneous OE for each attribute type. The average deviation of heterogeneous OE from the predicted (homogeneous) attribute accuracy. 670 Deleterious SNP prediction (e.g. support vector machines or neural networks). The training 4 DISCUSSION data is fundamental, it affects all methods and it is important to It is an appealing idea to use data from gene mutations, known get it right first. to cause disease or affect protein function, to train machine- learning methods for predictions on observed human nsSNPs. It is nevertheless vitally important to consider the selection of 5 CONCLUSIONS training data very carefully. In this article we have shown that We have raised some important issues regarding training data the choice of training data has significant effects on classifiers for nsSNP prediction methods and recommended a best data and estimated error rates. set (Swiss-Prot disease/polymorphism). We believe that effects Our results suggest that the use of mutagenesis data, with a described here have affected several studies, including our own, significantly higher content of MNMs than would be expected and whatever view is taken on the best data set it is important for nsSNPs, may lead to largely irrelevant rules for SNP that workers in the field be aware of them. predictions. They remain, however, good unbiased data sets for the prediction of the effects of general protein mutations. ACKNOWLEDGEMENTS Equally, the generation of neutral data from pseudo-mutations between orthologous proteins (divergent data set), produces We would also like to acknowledge the comments of Fyodor data sets that can be distinguished from known disease Kondrashov and one anonymous reviewer who helped us mutations at reasonable error rates, solely on the basis of the improve this manuscript. Work carried out by M. Care with amino acid substitutions. But such classifiers are unlikely to technical support from C. Needham. Supervision and support perform with the same low error rates in distinguishing human provided by D. Westhead and A. Bulpitt. All authors approved deleterious and neutral SNPs. The rules may have some the final manuscript. BBSRC for funding – studentship (BBS/S/ predictive power for SNPs, but a significant contribution to A/2004/10974) and grant number (BBS/B/16585). their apparent homogeneous cross-validation accuracy results Conflict of Interest: none declared. from separation of the training data on the basis of content of amino acid substitutions, caused by different evolutionary distances in the deleterious and neutral parts of the training REFERENCES data. One potential way of improving the divergent data sets is Adamczak,R. et al. (2004) Accurate prediction of solvent accessibility using to limit the aligned orthologous proteins to primates, thus, neural networks-based regression. Proteins, 56, 753–767. reducing the evolutionary distance. This results in a neutral Alber,T. et al. (1987) Temperature-sensitive mutations of bacteriophage T4 lysozyme occur at sites with low mobility and low solvent accessibility in the data set that has a higher correlation with HEAT (R¼ 0.81, folded protein. Biochemistry, 26, 3754–3758. P50.0001) than the mammalian derived data set (NBH90; Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of R¼ 0.74, P50.0001) yet still much lower than the Swiss-Prot protein database search programs. Nucleic Acids Res., 25, 3389–3402. polymorphism (R¼ 0.91, P50.0001). When combined with Bao,L. and Cui,Y. (2005) Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary informa- Swiss-Prot disease this new neutral data set produces a tion. Bioinformatics, 21, 2185–2190. homogeneous OE of 23.23%, placing it between Benner,S.A. et al. (1994) Amino acid substitution during functionally constrained diseaseNBH90 (OE 19.88%) and diseasePoly (OE 28.42%). divergent evolution of protein sequences. Protein Eng., 7, 1323–1332. Thus, while this data set is closer to HEAT than NBH90 it is Boeckmann,B. et al. (2003) The SWISS-PROT protein knowledgebase and its still clearly over a longer evolutionary distance than the supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. Cai,Z. et al. (2004) Bayesian approach to discovering pathogenic SNPs in Swiss-Prot polymorphism data. conserved protein domains. Hum. Mutat., 24, 178–184. Therefore, we suggest that the best training data for human Cargill,M. et al. (1999) Characterization of single-nucleotide polymorphisms in nsSNP predictions is the Swiss-Prot annotated ‘disease’ and coding regions of human genes. Nat. Genet., 22, 231–238. ‘polymorphism’ variants of known human proteins. This is not Chasman,D. and Adams,R.M. (2001) Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assess- without problems: variants annotated as neutral polymorph- ment of amino acid variation. J. Mol. Biol., 307, 683–706. isms may have an unknown association with disease. Collins,F.S. et al. (1998) A DNA polymorphism discovery resource for research Nevertheless the differences in Figures 1b and c, and the fact on human genetic variation. Genome Res., 8, 1229–1231. that learning methods can successfully separate disease and Dobson,R. et al. (2006) Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics, 7, 217. polymorphism classes, suggests that this is unlikely to be the Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment method with case for the majority of the data. reduced time and space complexity. BMC Bioinformatics, 5, 113. Equally it might be suggested that other data sets could be Ferrer-Costa,C. et al. (2005) Use of bioinformatics tools for the used if appropriate attributes were chosen. Rules based on annotation of disease-associated mutations in animal models. Proteins, 61, 878–887. evolutionary attributes are more transferable between data sets Ferrer-Costa,C. et al. (2004) Sequence-based prediction of pathological muta- than amino acid substitutions. However, any good learning tions. Proteins, 57, 811–819. method will separate the data sets using the most informative Ferrer-Costa,C. et al. (2002) Characterization of disease-associated single amino attributes, and it can be difficult to completely remove effects acid polymorphisms in terms of sequence and structure properties. J. Mol. Biol., 315, 771–786. such as those reported here. For instance, the apparently purely Herrgard,S. et al. (2003) Prediction of deleterious functional effects of amino acid physicochemical attributes hydrophobicity and molecular- mutations using a library of structure-based function descriptors. Proteins, 53, mass-difference contain information sufficient to identify the 806–816. amino acid substitution involved. Such effects are even harder Hess,S.T. et al. (1994) Wide variations in neighbor-dependent substitution rates. to tease out with methods less interpretable then decision trees J. Mol. Biol., 236, 1022–1033. 671 M.A.Care et al. Krishnan,V.G. and Westhead,D.R. (2003) A comparative study of machine- Saunders,C.T. and Baker,D. (2002) Evaluation of structural and evolutionary learning methods to predict the effects of single nucleotide polymorphisms on contributions to deleterious mutation prediction. J. Mol. Biol., 322, 891–901. protein function. Bioinformatics, 19, 2199–2209. Suckow,J. et al. (1996) Genetic studies of the Lac repressor. XV: 4000 single Kruglyak,L. and Nickerson,D.A. (2001) Variation is the spice of life. Nat. Genet., amino acid substitutions and analysis of the resulting phenotypes on the basis 27, 234–236. of the protein structure. J. Mol. Biol., 261, 509–523. Markiewicz,P. et al. (1994) Genetic studies of the lac repressor. XIV. Analysis of Sunyaev,S. et al. (2001) Prediction of deleterious human alleles. Hum. Mol. 4000 altered Escherichia coli lac repressors reveals essential and non-essential Genet., 10, 591–597. residues, as well as ‘‘spacers’’ which do not require a specific sequence. J. Mol. Valdar,W.S. (2002) Scoring residue conservation. Proteins, 48, 227–241. Biol., 240, 421–433. Verzilli,C.J. et al. (2005) A hierarchical Bayesian model for predicting the Needham,C.J. et al. (2006) Predicting the effect of missense mutations on functional consequences of amino-acid polymorphisms. J. R. Stat. Soc. Ser. protein function: analysis with Bayesian networks. BMC Bioinformatics, 7,405. C-Appl. Stat., 54, 191–206. Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid substitutions. Vitkup,D. et al. (2003) The amino-acid mutational spectrum of human genetic Genome Res., 11, 863–874. disease. Genome Biol., 4, R72. Ramensky,V. et al. (2002) Human non-synonymous SNPs: server and survey. Wang,Z. and Moult,J. (2001) SNPs, protein structure, and disease. Hum. Mutat., Nucleic Acids Res., 30, 3894–3900. 17, 263–270. Rennell,D. et al. (1991) Systematic mutation of bacteriophage T4 lysozyme. J. Yip,Y.L. et al. (2004) The Swiss-Prot variant page and the ModSNP database: a Mol. Biol., 222, 67–88. resource for sequence and structure information on human protein variants. Rost,B. and Sander,C. (1993) Prediction of protein secondary structure at better Hum. Mutat., 23, 464–470. than 70% accuracy. J. Mol. Biol., 232, 584–599. Yue,P. et al. (2005) Loss of protein structure stability as a major causative factor Ruggieri,S. (2004) YaDT: Yet another Decision Tree builder. Proceedings of the in monogenic disease. J. Mol. Biol., 353, 459–473. 16th International Conference on Tools with Artificial Intelligence. IEEE Yue,P. and Moult,J. (2006) Identification and Analysis of Deleterious Human Press, 0, 260–265. SNPs. J. Mol. Biol., 356, 1263–1274.

Journal

Bioinformatics – Oxford University Press

Published: Jan 18, 2007

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Deleterious SNP prediction: be mindful of your training data!

Deleterious SNP prediction: be mindful of your training data!

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Deleterious SNP prediction: be mindful of your training data!

Deleterious SNP prediction: be mindful of your training data!

References (37)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies