Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Prediction of recursive convex hull class assignments for protein residues

Prediction of recursive convex hull class assignments for protein residues Vol. 24 no. 7 2008, pages 916–923 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btn050 Structural bioinformatics Prediction of recursive convex hull class assignments for protein residues 1 1,2 3 1, Michael Stout , Jaume Bacardit , Jonathan D. Hirst and Natalio Krasnogor 1 2 Automated Scheduling, Optimization and Planning research group, School of Computer Science, Multi-disciplinary Centre for Integrative Biology, School of Biosciences and School of Chemistry, University of Nottingham, UK Received on November 6, 2007; revised on January 28, 2008; accepted on January 30, 2008 Advance Access publication February 5, 2008 Associate Editor: Burkhard Rost ABSTRACT de novo methods (Baldi and Pollastri, 2002). Whilst classifying residue neighbourhood density as high or low will generally Motivation: We introduce a new method for designating the location assign the high class to residues buried within the structure and of residues in folded protein structures based on the recursive the low class to residues exposed on the surface, residues lining convex hull (RCH) of a point set of atomic coordinates. The RCH can cavities in the structure that may be functionally significant be calculated with an efficient and parameterless algorithm. (Chen et al., 2007) can have a low coordination number even Results: We show that residue RCH class contains information when located far from the surface. Incorporation of comple- complementary to widely studied measures such as solvent accessibility (SA), residue depth (RD) and to the distance of residues mentary residue solvent accessibility and residue depth from the centroid of the chain, the residues’ exposure (Exp). RCH is information improves fold recognition (Liu et al., 2007). more conserved for related structures across folds and correlates A range of measures of residue location have been studied. better with changes in thermal stability of mutants than the other Lee and Richards (1971) used a spherical probe method to measures. Further, we assess the predictability of these measures measure the solvent accessible surface of residues and recently using three types of machine-learning technique: decision trees Kawabata and Go (2007) have used adjustable probe param- (C4.5), Naive Bayes and Learning Classifier Systems (LCS) showing eters to identify putative ligand binding pockets on protein that RCH is more easily predicted than the other measures. As an surfaces. Solvent accessibility, however, is difficult to compute exemplar application of predicted RCH class (in combination with and does not distinguish between residues below the surface. other measures), we show that RCH is potentially helpful in Hence, atom/residue depth (RD), the distance of an atom/ improving prediction of residue contact numbers (CN). residue from its nearest solvent accessible neighbour, was Contact: nxk@cs.nott.ac.uk introduced (Chakravarty and Varadarajan, 1999) and efficient Supplementary Information: For Supplementary data please refer algorithms are available to compute RD for a given structure to Datasets: www.infobiotic.net/datasets, RCH Prediction Servers: (Pintar et al., 2003; Vlahovicek et al., 2005). Whilst SA www.infobiotic.net emphasises burial, RD emphasises exposure and depends on the method used to identify surface atoms/residues. Hence, Half Sphere Exposure (HSE), has been recently proposed 1 INTRODUCTION (Hamelryck, 2005). HSE, like CN, counts neighbouring residues but distinguishes two regions (half spheres) around Prediction of the three-dimensional structure of proteins from each residue based on the C –C vector, i.e. a 2D measure of their constituent amino acid sequences continues to be one residue location. In addition, the distance (exposure) of residues of the key goals of structural biology and a wide range of predictive strategies has been investigated. Steady improve- from the chain centroid is a potentially interesting measure ments in predictive accuracy have resulted from decomposition being related to the location of catalytic residues in enzyme of the problem into subproblems, such as prediction of structures (Ben-shimon and Eisenstein, 2005). Measures of secondary structural elements [approaching a theoretical atom/residue location typically depend on specific parameters prediction limit of 80% (Dor and Zhou, 2007; Wood and such as probe size for SA or contact radius for CN. Hirst, 2005)], of residue coordination number [at over 80% In this paper, we introduce a new approach to stratifying (Bacardit et al., 2006)] and of residue solvent accessibility [at residues in protein structures by recursively identifying the over 77% using consensus predictors (Gianese and Pascarella, convex hull layer to which each residue belongs. The convex hull 2006)]. Burial of hydrophobic groups within the protein core of a set of points is a parameterless, mathematically rigorous is a primary driving force for protein structure formation. and unambiguous approach to identifying the points on the Characterizations of residue accessibility to solvent are, there- exterior of a point set, analogous to identifying those points that fore, important for protein structure prediction (PSP), poten- contact the enclosing surface when the point set is tightly tially helping to constrain the search space to be explored using wrapped. The convex hull is simple and efficient (O(n  log n)) to compute (Preparata and Hong, 1977). The recursive convex hull *To whom correspondence should be addressed. (RCH) of a point set is obtained by identification of the minimal 916  The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org Residue RCH prediction show that although not totally unrelated, these properties are indeed complementary. We show that RCH correlates better with structural conservation than the other measures of residue location and that RCH is also better correlated with changes in protein thermal stability in the presence of cavity forming mutations. We turn, in Part 2, to the question of how easy/ difficult it is, in practical terms, to learn to predict these measures. The relative predictability of RCH, RD, SA and Exp using four different machine-learning algorithms was assessed Fig. 1. Left: RCH of a 2D off-lattice protein model. The backbone is using six different, progressively richer, sets of input attributes represented by coloured circles joined by solid black lines. Residues on at three levels of precision. The relative benefits of using these the outermost RCH are coloured red, subsequent recursive convex hulls various inputs are described. C4.5 (Quinlan, 1992), Naive Bayes are coloured blue, green, and yellow, with residues on the innermost (John and Langley, 1995), GAssist (Bacardit, 2004) and recursive convex hull coloured purple. Right: A graphical representa- tion of the outer RCH of residues in a 3D model of a natural protein BioHEL (Bacardit et al., 2007) are the machine-learning chain (PDB Id. 1P4X). methods employed in this article. Finally, we demonstrate the usefulness of RCH by using the predicted RCH class of residues point set that generates the convex hull (the vertices) and as input for prediction of residue coordination number (CN) removal of these points from the point set followed by showing that, in combination with predicted residue SA and recursively applying these steps to the remaining points to Exp class, predicted RCH information increases predictive identify subsequent hulls. Applied to the point set of coordinates accuracy for CN. of residues in a protein chain, a series of hulls is obtained that groups the residues by their distance from the convex surface of 2 MATERIALS AND METHODS the structure. The recursive convex hulls of a 2D off-lattice 2.1 Datasets and features studied protein model are shown in Figure 1 along with a representation Next, we describe the datasets and algorithms employed to assess the of the outer convex hull of a 3D point set derived from the C novelty of RCH and its relation to previously studied measures. All of atomic coordinates of residues in a real protein chain. the measures studied are based on atomic coordinates. Two polypep- Convex hulls have found a wide range of applications in tides that have similar structures when represented using C coordinates studies of molecular structure. Here we give a brief, by no means may have distinct structures when represented using C coordinate complete, review. Badel-chagnon and colleagues introduced (Eidhammer et al., 2003). Throughout this article C atom coordinates a notion of the ‘molecular surface convex hull’ to define the depth are used (C for glycyl residues) as these are sensitive to the orientation of any molecular surface point (Badel-chagnon et al., 1994) and of side-chain atoms. Lin and colleagues used convex hulls to align 11 randomly Protein dataset: The dataset used here are those described by Bacardit generated bio-active tachykinin peptides, finding that 3D convex et al. (2006), originally proposed by Kinjo et al. (2005). Protein chains were selected from PDB-REPRDB [a non-redundant curated subset of hulls can be used to align even these flexible structures (Lin et al., the Protein Data Bank (PDB) (Noguchi et al., 2001), covering the space of 1999; Lin and Lin, 2001). Meier et al. (1995) proposed a convex possible folds] using the following criteria:less than 30% sequence hull-based segmentation technique (that makes few assumptions identity, sequence length greater than 50 residues, no membrane proteins, about the underlying surface) to find characteristically shaped no non-standard residues, no chain breaks, resolution better than 2 A and regions of molecular surfaces for prediction of possible protein a crystallographic R factor better than 20%. Chains that had no entry in docking sites. Liang and Dill (2001) used convex hulls to define the HSSP (Sander and Schneider, 1991) database were discarded. The the boundaries of surface pockets and depressions in studies of final dataset contains 1050 protein chains (257 560 residues). packing densities in proteins. Holmes and Tsai tackled protein Identification of residue RCH: Convex hulls were identified from the residue C atomic coordinates using the QHull package (Barber et al., side-chain packing and interactions by measuring variation in 1996). Hulls were iteratively identified, hull residues were assigned a hull convex hulls constructed around these groups (Holmes and Tsai, number and removed from the point set. This being repeated until all 2005). Coleman and Sharp, (2006) introduced the notion of residues had been assigned a hull number. The mean RCH number in this travel depth (the physical distance a solvent molecule would have dataset was 2.6 (SD 2.3). Assignment of RCH numbers to the 1050 to travel from a surface point to a suitably defined reference chains took 52 min. We term this numbering of hulls, from the outermost surface) using convex hulls of surface points. Recently, Lee and inward, residue RCH. An alternative numbering scheme, from inner- colleagues have employed 3D convex hulls around complemen- most hull outward, termed RCHr are given in the Supplementary tarity regions of antibodies to analyse binding sites (Lee et al., Material (Section 2.1). The mean RCHr number in this dataset was 5.1 2006) and Wang et al., (2006) have used convex hulls of protein (SD 2.7). Assigning RCHr numbers to all chains took 58 min. Calculation of residue solvent accessibility (SA): Solvent accessible backbones in neural network-based classification of protein surface values for each residue were extracted from the DSSP (Holm structures. However, dissection of protein structures by recur- and Sander, 1993) file for each structure. These values were divided by sively assigning convex hull numbers to residues, as we propose the solvent accessible surface values for each amino acid as defined in here, does not appear to have been previously reported. Rost and Sander (1994) to obtain the relative solvent accessibility of This article has two parts. In the first part we analyse RCH each residue. The mean SA value in this dataset was 0.27 (SD 0.27). as a new computable property of proteins. We compare Calculation of residue exposure (Exp): In this study, we characterize the information content of RCH to that of residue solvent residue exposure as the distance of residues from the centroid of each accessibility (SA), residue depth (RD) and exposure (Exp), and chain (Ben-shimon and Eisenstein, 2005). The chain centroid was 917 M.Stout et al. Table 2. Conservation of measures 0.9 RD Exp RCH RCHr SA 0.8 0.7 0.37 0.38 0.46 0.48 0.52 Norm. 0.37 0.46 0.55 0.55 0.50 0.6 0.5 Correlation of the Measures Studied between aligned residues in related structures. Norm. indicates coefficients based on normalized measures. 0.4 0.3 Table 3. Correlation of structural features with thermal stability 0.2 0.1 RD Exp RCH RCHr ASA 0.22 0.29 0.38 0.29 0.34 0.0 0.2 0.4 0.6 0.8 1.0 Norm. 0.20 0.44 0.35 0.35 0.37 RD Correlation of the measures studied with changes in thermal stability of mutant Fig. 2. Box and whisker plots of RD against RCH for 257 560 residues proteins. Norm. indicates coefficients based on normalized measures. from 1050 proteins. Black dots indicate median values. Values were normalized and rounded to one decimal place. Table 4. Pairwise mutual information Table 1. Correlation coefficients between measures studied SA RD Exp RCHr RCH SA RD Exp RCH RCHr SA 1.00 0.21 0.06 0.08 0.26 SA 1.00 0.51 0.39 0.62 0.41 1.00 0.21 0.12 0.26 0.26 Norm. 1.00 0.50 0.55 0.68 0.68 Norm. RD 0.91 0.04 0.05 0.14 RD 1.00 0.26 0.43 0.30 0.91 0.06 0.14 0.14 Norm. 1.00 0.34 0.48 0.48 Norm. Exp 1.00 0.38 0.07 EXP 1.00 0.41 0.85 1.00 0.29 0.29 Norm. 1.00 0.81 0.81 Norm. RCHr 0.99 0.08 RCH 1.00 0.42 1.00 1.00 Norm. 1.00 1.00 Norm. RCH 0.99 RCHr 1.00 1.00 Norm. 1.00 Norm. MI between two class (Q2) assignments for pairs of measures. Norm. indicates MI for class assignments based on normalized measures. Norm. indicates coefficients based on normalized measures. determined from the coordinates of the residues and the euclidean 50% of the data points, whiskers extend to 1.5 times the interquartile distance of each residue from this point was calculated to obtain the range with outliers plotted as blue dots and median values indicated residues exposure value. The mean Exp value in this dataset was 19.1 A with black dots. Median values for RD are positively correlated with (SD 7.8). Determination of Exp values for the whole dataset took less RCH yet RCH makes finer distinctions between degrees of burial and than 2 minutes. exposure. Further box plots for these measures are available in the Calculation of residue depth (RD): Residue depth (RD) values were Supplementary Materials (Fig. 3). obtained from the DPX server (Pintar et al., 2003) using default settings. Correlation coefficients: Pairs of measures that have a low correlation RD values were positively skewed with a mean RD of 0.86 (SD 1.41). coefficient are likely to be unrelated and potentially provide complemen- Normalization: In Section 2.2, both unnormalized and normalized tary information for PSP. Table 1 shows the Pearson correlation coef- values are reported for characterization of the measures studied using ficients between the measures studied. RD has low correlation with the box plots (Fig. 2), correlation coefficients (Table 1), structural conser- other measures.RCH is most highly anti-correlated with SA (0.62) and vation (Table 2), thermal stability (Table 3) and mutual information has a higher correlation with SA and Exp than RD. RCH is not highly between class assignments (Table 4). The value for each residue was correlated with RD, suggesting that these are distinct characterizations of divided by the maximum value for that measure in the corresponding residue location. RCH appears to be the measure that correlates closely to chain to obtain the normalized value. Histograms of unnormalized and many of the other measures. Hence, we would like to determine whether it normalized measures are shown in the Supplementary Materials (Figures is relatively more learnable than these other measures. 5 and 6). After normalization RCH and RCHr are symmetric. Conservation of RCH: For related proteins, aligned residues are potentially conserved even in the absence of strong sequence homology. Measures that have relatively high correlation for aligned residue pairs 2.2 Comparison between RCH and other measures potentially reflect conserved aspects of protein structure. We, therefore, of residue location assess to what degree these measures are correlated between aligned BoxPlots: Figure 2 plots RD versus RCH for each residue in the dataset residues in pairs of superimposed structures from a range of folds. using the statistically robust Box and Whisker technique. Boxes cover Following Hamelryck (2005), the conservation of RCH and the other RCH Residue RCH prediction using mutual information (MI) (Cover and Thomas, 2006). For discrete data, MI is defined as: XX pðx; yÞ IðX;YÞ¼ pðx; yÞ log ; ð1Þ pðxÞ pðyÞ y2Y x2X where p(x) and p(y) are the probabilities of x and y occurring in the dataset, and p(x, y) is the probability of the combination of x and y occurring together in the dataset. MI is used here to measure the quantity of information that one measure (e.g. SA) tells us about another (e.g. RCH). Table 4 shows the MI between pairs of measures for all 257 560 residues studied. When the MI between the class assignments for a pair of measures is high they represent closely related problems (the MI between a measure and itself is maximal, and is 1.00 if the classes assigned to the measure are well balanced). SA shares 0,26 MI with RCH whilst Exp shares 0.38 MI with RCHr and all other pairwise MI values are less than 0.10. This indicates that the RCH class of residues provides information distinct to SA, RD and Exp class information. MI for Q3 and Q5 class assignments is given in the Supplementary Materials (Table 3) along with a detailed pairwise examination of the Q5 class assignments for SA versus RCH, and RCHr versus Exp, where increased levels of MI were observed (Supplementary Materials, Tables Fig. 3. Space filling C atom models of proteins coloured by RCH and 4 and 6) along with RD versus RCH (and in Table 5). Frequent RD. ‘Core’ residues are coloured red/yellow and ‘surface’ residues blue/ differences in class assignments are observed for measures with greater green (rendered using RasMol). than 0.20 MI. To further highlight the distinction between RD and RCH, measures was calculated for 15 621 aligned residues (BLAST E-value visualisations of two space filling C atom models of protein structures 4¼ 1.0) in 218 pairs of structures from the SABmark version 1.63 are shown in Figure 3. The values for each measure were normalized and Twilight Zone database (Van Walle et al., 2005). This dataset comprises the colour assigned, in both measures, to indicate values from ‘exposed’ pairs of superimposed structures covering 236 folds. These pairs are (blue) to ‘buried’ (red). These models provide visual confirmation that structurally similar, yet are without probable common evolutionary residue RCH assignments are distinct to those for RD. Further examples origin, effectively, a hard dataset to predict. Table 2 reports the are available in the Supplementary Materials (Figs. 1 and 2). correlation coefficients for both unnormalized and normalized measures. RCH and RCHr have higher conservation correlation coefficients than 3 LEARNABILITY OF RCH AND OTHER RD, Exp and SA indicating that, for such aligned residues, RCH is more MEASURES highly correlated with structurally conserved locations than RD, Exp and (after normalization) SA. As we used C coordinates, values for RD Having demonstrated that residue RCH is a new and distinct and SA are around 0.1 lower than those previously reported (Hamelryck, characterization of residue location, we turn to the predict- 2005). ability of these measures and assess, in practical terms, which Relationship of RCH to changes in thermal stability of mutant proteins: of these characterizations of residue location is easier to learn. Changes in thermal stability of proteins after mutations of core Hence, potentially more useful for PSP. hydrophobic residues (that potentially lead to cavity formation) has been correlated with changes in SA and residue depth [for references see 3.1 Prediction experiments Hamelryck (2005)]. For such residues, measures that correlate relatively Inputs to predictions: For each measure (RCH, RCHr, RD, high with changes in the proteins thermal stability reflect structurally SA and Exp) predictions were made using six types of input important features. We, therefore, assess for these residues the degree to information and three levels of precision: two, three and five which these measures are correlated with changes in protein thermal class partitions (Q2, Q3 and Q5). Table 5 summarizes the six stability. The correlation of these measures of residue location (both different types of input information used for predictions of the normalized and unnormalized) with changes in the thermal stability (G in kcal/mol) of 91 Ile/Leu/Val to Ala point mutations was measures studied. Combinations of both local (neighbourhood measured. Sixteen protein structures from the Protherm database (Bava of the target in the chain) and global (protein-wise) information et al., 2004; Gromiha et al., 1999; Kumar et al., 2006) were employed, were used. A window of four residues either side of the target again following the approach of Hamelryck (2005). The correlation residue has been shown to lead to high CN predictive accuracy coefficients for RD, SA, Exp, RCH, RCHr and ASA (related to the using LCS (Bacardit et al., 2006) and was used in this study also to change in accessible surface upon folding) are shown in Table 3. RD facilitate comparison of results. For each representation (RCH, values were similar to those previously reported. RCH is more highly SA etc.) these inputs were labeled 1–6 in the rest of this article, correlated with changes in thermal stability upon mutation than the e.g. RCH-3 denotes RCH predicted using input dataset 3. other measures. Exp and ASA showed higher correlation when For each measure a total of 18 datasets was evaluated (six sets the data was normalized. RD showed the lowest correlation of of input attributes each at three levels of class assignment). A the measures studied. This data indicates that (unnormalized) RCH is detailed description of these inputs appears in Stout et al. (2007). correlated more strongly with residues in the hydrophobic core (that are In order to determine the degree to which RCH and RCHr vary related to structural stability) than are the other measures. in their learnability and capture properties of protein structures, Mutual information: The degree to which the classes assigned to residues using these measures are mutually informative was assessed in what follows we use their unnormalized versions. 919 M.Stout et al. Table 5. Datasets such as SA, a class boundary that leads to more balanced classes is traditionally chosen, e.g. for SA a cut point of 25% is widely used. We apply class balancing for all measures and levels of Scope Input information Dataset discretization (Q2, Q3 and Q5) in this study, adopting a uniform frequency classification procedure. For our data, balanced classes for SA were obtained using, e.g. a cut point of 18%. Class boundaries were determined individually for each training/ Local AA Types in window of target4 residues test set pair using the corresponding training fold. Details of the Pred. secondary cut points used are given in the Supplementary Table 1. structure of target Definition of the training and tests sets: Datasets were divided randomly into 10 training and test set pairs (950 Chain length chains for training and 100 for testing), using bootstrap Global Residue frequencies (Kohavi, 1995). We have placed a copy of the datasets at Pred. average of www.infobiotic.net target measure Performance measures: Different protein chains have differ- ent lengths and it is prediction accuracy on chains that is For each dataset (1–6) the input information type included in that dataset is typically reported (Kinjo et al., 2005; Jones, 1999). Therefore, indicated by . The two types of local (target and its closest neighbours) and three types of global (proteinwise) input information were investigated are shown. prediction accuracies for each chain were averaged to obtain the protein-wise accuracy reported here. Table 6. Summary of the highest predictive accuracies for each measure Machine-learning methods: We use four different machine- studied, in descending order of accuracy learning methods. The first two are popular machine-learning systems taken from the WEKA package (Witten and Frank, 2005): C4.5 (Quinlan, 1992), a decision tree rule induction Alg C4.5 BioHEL GAssist Naive Bayes system, Naive Bayes (John and Langley, 1995), a Bayesian- a a RCHr 79.8  1.5 78.5  1.5 78.4 1.5 77.9 1.7 learning algorithm. Learning systems belonging to the Learning a a RCH 77.3  1.0 75.9  1.2 75.7 1.1 76.1 1.1 Classifier Systems (LCS) (Holland and Reitman, 1978) class of a a RD 76.0  0.4 75.3  0.3 75.2 0.3 75.1 0.4 ML techniques were also studied. These systems are rule-based a a Exp 73.9  1.4 72.8  1.6 72.5 1.4 73.4 1.3 machine-learning systems that employ evolutionary computa- a a a SA 73.3  0.3 72.2  0.4 72.2 0.4 72.3 0.4 tion (Holland, 1975) as the search mechanism. Two LCS methods have been employed: GAssist (Bacardit, 2004) and Mean  SD for 10 fold cross validated predictions based on the input datasets a BioHEL (Bacardit et al., 2007) that implement different rule that gave the best results for each measure: namely type 4 or 6 (indicated by ). induction paradigms. A detailed description of both systems is included in the Supplementary Material (Section 3.1). Predicted secondary structure information of the target Analysis of results: For each experiment, the mean prediction residue was obtained using the PSI-PRED predictor (Jones, accuracy (as defined previously in this section) over the test sets 1999). This consists the secondary structure type (helix, strand or is reported. Student t-tests were applied to the 10 results from coil) and a confidence level of the prediction (ranging from 0 to 9). each experiment to determine the best method for each dataset For each measure, the average value of that measure was at a confidence level of 95%. Standard deviations and any determined for each chain and 10 pairs of training and test folds significant differences are indicated in each table. The were prepared. For each instance, the inputs were the chain conservative Bonferroni correction (Miller, 1981) for multiple length (one integer value) and amino acid composition of each pair-wise comparisons was applied. In addition, the contribu- chain (20 real values), and the target class was the measured tions of global input information were assessed as follows: average value for the particular measure (partitioned into 10 for each learning system and precision (Q2, Q3 and Q5), the classes using a uniform frequency cut-point strategy). Cut- maximum of (Dataset4, Dataset6)-Dataset2 was computed. As points were determined separately for each training fold a base for the performance gap, the dataset with predSS was and used to assign classes to the values in the corresponding used, because in certain situations the Dataset1 performed training and test folds. The predicted average value of the poorly, distorting the comparisons. Finally, the contribution of measure under consideration (termed PredAveRCH, predicted secondary structure was also assessed as follows: for PredAveRD, etc.) was predicted using the GAssist LCS (details each learning system and number of states the value of the below) prior to preparation of the datasets for the full measure maximum (Dataset2-Dataset1, Dataset4-Dataset3, Dataset6- predictions. Nine hundred fifty instances (chains) were used Dataset5) was determined. for training and 100 instances for testing. Ten iterations were performed for each prediction using different random 3.2 Prediction results number seeds and the 10 rule sets generated were combined as an ensemble using a majority vote to predict the measure. For each measure studied, Table 6 summarizes the best Q2 Class assignments: In order to predict measures using predictive accuracy (in descending order) for each measure classification techniques, the calculated values for each measure (using the best possible input dataset in each case). Detailed were partitioned into two, three and five classes (bins) here results for the predictions are given in the Supplementary termed Q2, Q3 and Q5, respectively. For imbalanced measures, Materials (Tables 7 –10). Predictive accuracy was higher on 920 Residue RCH prediction Table 7. Coordination number prediction (by BioHEL) using amino- the two RCH based representations than on the SA, RD or Exp acid sequence and various combinations of the predicted measures representation. The predictive accuracies for RCHr being statistically significantly higher than those for the other measures (P-value ¼ 0.5). Dataset Proteinwise acc. For all representations, higher predictive accuracies were seen when fewer classes were predicted (lower precision—Q2). CN1 77.2  0.8 Q5 predictive accuracy for RCH was between 30% and 40%, CN1 þ RD 77.4  0.8 Q3 was approximately 20% higher, between 55% and 60% CN1 þ RCHr 77.6  0.7 CN1 þ Exp 77.7  0.8 whilst for Q2 prediction accuracies exceeded 77%. The LCS’s CN1 þ Expþ RCHr 77.7  0.7 performed best on the RCHr representation when using input CN1 þ RCH 78.5  0.9 dataset RCHr-4. This dataset combines local information CN1 þ RCH þ RCHr 78.8  0.7 (a window of residues around the target and its predicted CN1 þ Expþ RCH 78.9  0.8 secondary structure) with global chain information (chain a CN1 þ SA 78.9  0.8 length and chain residue composition). The more compact CN1 þ Expþ RCHþ RCHr 78.9  0.7 a,b RCHr-6 was frequently the most learnable dataset for C4.5 and CN1 þ Expþ SA 79.1  0.8 a,b Naive Bayes. This dataset comprises local information (window CN1 þ Expþ SA þ RCHr 79.1  0.8 a,b and predicted secondary structure) and global information CN1 þ SA þ RCHr 79.1  0.8 a,b CN1 þ SA þ RCH 79.7  0.8 (predicted average RCHr of the chain). a,b CN1 þ Expþ SA þ RCH 79.8  0.8 a,b CN1 þ SA þ RCH þ RCHr 79.8  0.8 a,b 3.3 Predicted RCH improves CN prediction CN1 þ Expþ SA þ RCH þ RCHr 79.8  0.7 Finally, we assess the utility of predicted RCH as an input indicates input information that leads to statistically significant increases in to prediction of other aspects of protein structure, specifically predictive accuracy compared to the baseline CN1 inputs. The group of best coordination number (CN). For each of the measures studied, performing methods all have statistically similar performance. the Q5 predictions (using input dataset 4) made by BioHEL (which was, in general, the best performing method) are fed measures are mostly redundant (consistent with the observed back into prediction of CN (Bacardit et al., 2006). The CN of increase in MI between these two measures, Table 4). On the a residue is a count of the number of other residues from the other hand, combining SA and RCH resulted in much better chain that are located within a certain threshold distance. accuracy, showing that these measures complement each other. Specifically, we have used the Kinjo et al. (2005) definition of The best combinations of measures (ExpþSAþRCH, SAþ CN. We predict whether the CN of a residue is above or below RCHþRCHr and ExpþSAþ RCHþRCHr) lead to a perfor- the midpoint of the CN domain, using as input information mance increase of 2.6% over the baseline CN1 dataset. the AA type of a window of 4 residues around the target (equivalent to the first set of input attributes used to predict the other features), CN1. 4 DISCUSSION The contribution of SA, Exp, RCH and RCHr to CN 4.1 Prediction results prediction (individually and in combination with one another) was evaluated by extending the CN1 dataset with 16 combina- The results show some general trends across all measures tions of input attributes that correspond to all combinations studied and all learning methods. Predictive accuracy is of these measures. Using predicted RD as input gave the lowest increased when richer input information is employed. improvement (0.2%) over the CN1 (local window) input alone Inclusion of local information in the form of predicted and was, therefore, not included in predictions made with secondary structure typically leads to an increase in Q2 combinations of inputs. Table 7 shows the results of these predictive accuracy of 2–3% on most datasets for the learning experiments. As a baseline, the performance of the original systems used, whilst using global protein information (chain CN1 is included. The table has been sorted by accuracy which length and composition) can boost Q2 predictive accuracy by helps to identify the combinations of predicted measures that more than 10% (in the case of RCH and RCHr). The type 3 give the biggest performance boost. and 4 datasets, containing 21 real-valued global protein The results of these experiments were analyzed using paired attributes, in particular presents a considerably larger search t-test with 95% confidence level and the Bonferroni correction. space, and the mixture of real-valued and nominal attributes Two types of results were identified: the datasets in which makes the learning problem more difficult than for purely BioHEL performed significantly better than when learning nominal knowledge representations. The type 5 and 6 datasets from CN1 (marked with a ) and the (statistically indistinguish- use less attributes than the type 3 and 4 datasets enabling the able) group of datasets that resulted in the highest predictive LCS’s to generate rule sets that are more easily interpreted. The accuracies are indicated ( ). RCHr representation was the most predictable measure, this There are two groups of measures: those that only provide a resulting from use of input dataset 4 (chain composition and small performance boost over CN1 (RD, Exp and RCHr), and length information) and the BioHEL LCS. C4.5 performed best others that provide a larger boost (SA and RCH). Furthermore, when using datasets 4 and 6, benefiting from predicted local combining Exp and RCHr (together and with other measures) and global information. Naive Bayes generally performed best only marginally improves the performance, indicating that these when using the more compact dataset 6. 921 M.Stout et al. There were interesting differences between these representa- (4) If PredSS 2 {E} and PredAveAtt44.5 and Res 2 = {C, tions of residue location; RCH was easier to learn compared K, x} and Res 2 = R and Res 2 = {K, R} and Res 2 = 3 2 1 {D, Q, S, x} and Res2 = {E, F, K, M, R, T} and Res 2 = to other representations because this representation correctly þ1 {F, L, x} and Res 2 = {K} and Res 2 = {C} and assigned classes to residues for anisotropic (elongated) struc- þ2 þ3 Res 2 = {x} ! class is 1 tures. For such structures, the Exp representation may assign þ4 .... some surface residues to the buried class and some buried Default class is 0 residues to the exposed class. Similarly, RCH was more predictable than SA; assignment of residues was deep within The rule set structure is a decision list, rules are tested in order structures, but on the surface of cavities (high solvent starting with the first and rule evaluation continues until a match accessibility) to the exposed class may have lowered the is found or the default (last) rule is reached. This rule set predictability of the SA representation. comprised 15 rules, the first four rules are shown (the full rule set RD was more predictable than SA and Exp but was not as is shown in the Supplementary Materials, Section 4.3). In the first easily predicted as RCH and RCHr. As SA emphasises buried rule, the confidence level of the secondary structure prediction is tested then the predicted average value for the property under residues, RD emphasises exposed residues. The highly imbal- consideration (predicted average RCH 48.64). Subsequent anced nature of the RD measure (Supplementary Materials, predicates test the amino acid types of residues surrounding the Figs. 5 and 6) leads to imbalanced class assignments even when target in the sequence. AA M means the amino acid type of the using a uniform frequency cut-point strategy (Supplementary residue at position  M in respect to the target residue. Amino Materials, Table 2). This imbalance is likely to have made this acids are represented by their one letter code, plus the symbol x measure relatively more difficult to learn. Furthermore, when fed back as inputs for prediction of CN, RD in particular provided representing positions after the start/end of the chain, for cases little additional information over the CN1 inputs resulting in where the window of neighbouring residues overlaps the only marginal improvements in CN prediction. Prediction beginning or the end of the chain. For positions relative to the target residue (e.g. Res ) these predicates restrict the amino acid accuracy for all measures is likely to be boosted by including 2 types for the residue at that position (membership or non- additional (e.g. non-local) input information. For example, membership of a set). In subsequent rules, the predicted 79.3% 10-fold cross validated accuracy has been reported for Secondary Structure class (PredSS) of the target residue; either two-state SA prediction at a 25% cut point using an integrated Helix (H), Sheet (E) or Coil (C) is tested. In many cases, the target system of neural networks (Dor and Zhou, 2007) with position residue type is largely restricted to hydrophobic residues that are specific scoring matrices and a range of other input data. often found buried in the protein core and, therefore, have higher RCH numbers. The LCS has correctly identified this, predicting 4.2 ‘White box’ prediction: interpretable analysis and the above average RCH class (1) for these residues. performance considerations The accuracy, simplicity and interpretability of the rule sets generated by the LCS’s must, however, be balanced against the Understanding the basis on which a prediction is made may be computational expense needed to generate them. Run times for more valuable than making relatively accurate predictions in the learning phase of the BioHEL algorithm ranged from 3 min a blind manner. Decision trees can be interpreted, however, on the smaller input datasets to almost 7 h on the larger ones. In on these problems C4.5, produced pruned trees that ranged in contrast, the worst case for C4.5 was 32 min and for Naive Bayes mean size from 1068 (SD ¼ 331) to 138 977 (SD ¼ 2735) nodes was less than 1 min. Moreover, for each problem set, the LCS limiting their explanatory value. Naive Bayes, on the other hand, algorithms, BioHEL and GAssist, were run multiple times to generates compact solutions, however, these are probabilistic produce 10 classifiers for input to the ensemble procedure. Once models that have no immediate physico-chemical interpretation. trained, however, run times for the resulting classifiers on the In contrast, it is much easier to relate the compact rule sets entire test set was around two minutes for C4.5 and Naive Bayes evolved by the LCS algorithms to the underlying physical and but less then 1 min for the LCS’s, an indication, for instance, of chemical properties of proteins. The following is an example of a how these LCS evolved classifiers would perform as part of a rule set evolved by the GAssist LCS that produced 71.2% two- prediction web server. state predictive accuracy on the RCH-6 dataset. (1) If PredSSConf57.5 and PredAveAtt48.64 and 5 CONCLUSIONS Res 2 = {K,x} and Res 2 = x and Res2 = {D,E,K,N,Q} and 4 2 In this article, a new measure of residue location in folded Res 2 = {x} and Res 2 = {K} and Res 2 = {x} ! class is 1 þ1 þ2 þ4 protein chains, the RCH, was introduced. RCH is a param- (2) If PredSS2 = {C} and PredAveAtt44.8 and Res 2 = {E, Q} eterless, simple to compute and mathematically rigorous and Res 2 = {D, T} and Res 2 = E and Res 2 = {D, P,V} 3 2 1 method that situates residues in layers within protein structures. and Res2 = {D,E,H,K,N,P,Q,R,T} and Res 2 = {x} and þ1 We show that RCH is distinct to other widely studied measures Res 2 = {H} and Res 2 = {K} and Res 2 = {E,K,Q,R,x} ! þ2 þ3 þ4 of residue location and that RCH distinguishes a range class is 1 of degrees of residue burial/exposure, correlates better with (3) If PredSS2 = {C} and PredAveAtt44.8 and Res 2 = {E, R} residue conservation and changes in protein stability under and Res 2 = {D, E} and Res 2 = {P, W} and Res2 = mutation than measures such as solvent accessibility, residue 3 1 {A,D,E,K,N,P,Q,R} and Res 2 = {W, x} and Res 2 = {V, depth or residue distance from chain centroid. Further, we þ1 þ2 W} and Res 2 = {K, R} and Res 2 = {K, x} ! class is 1 assess the predictability of these measures using three types of þ3 þ4 922 Residue RCH prediction Holmes,J.B. and Tsai,J. (2005) Characterizing conserved structural contacts by ML technique: decision trees (C4.5), Naive Bayes and LCS, pair-wise relative contacts and relative packing groups. J. Mol. Biol., 354, employing a range of predictive inputs. We show that an LCS 706–721. that employs iterative rule learning, BioHEL, predicts RCH at Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of 77.3, 60.6 and 39.0% accuracy for Q2, Q3 and Q5, respectively. distance matrices. J. Mol. Biol., 233, 123–138. John,G.H. and Langley,P. (1995) Estimating continuous distributions in Bayesia We present examples of the competent yet simple and classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial interpretable LCS classification rules, showing how they Intelligence. Morgan Kaufmann Publishers, San Mateo, pp. 338–345. relate to the underlying physical and chemical properties of Jones,D.T. (1999) Protein secondary structure prediction based on position- the residues. As an exemplar application of predicted RCH specific scoring matrices. J. Mol. Biol., 292, 195–202. class (in combination with other measures), we show that Kawabata,T. and Go,N. (2007) Detection of pockets on protein surfaces using small and large probe spheres to find putative ligand binding sites. Proteins, prediction of contact number can be improved up to 2.6%. 68, 516–529. Kinjo,A.R. et al. (2005) Predicting absolute contact numbers of native protein ACKNOWLEDGEMENTS structure from amino acid sequence. Proteins, 58, 158–165. Kohavi,R. (1995) A study of cross-validation and bootstrap for accuracy We acknowledge the support of the UK Engineering and Physical estimation and model selection. In Mellish,C.S. (ed.) Proceedings of the Sciences Research Council (EPSRC) under grant GR/ T07534/01. Fourteenth International Joint Conference on Artificial Intelligence. Morgan We are grateful for the use of the University of Nottingham’s Kaufmann, San Mateo, pp. 1137–1145. High-Performance Computer. We are grateful to the anonymous Kumar,M.D. et al. (2006) Protherm and pronit: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucl. Acids Res., 34 (Database reviewers whose insightful comments have helped to improve issue), D204–D206. this manuscript. Lee,B. and Richards,F.M. (1971) The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol., 55, 379–400. Conflict of Interest: none declared. Lee,M. et al. (2006) Shapes of antibody binding sites: qualitative and quantitative analyses based on a geomorphic classification scheme. J Org Chem, 71, 5082–5092. REFERENCES Liang,J. and Dill,K.A. (2001) Are proteins well-packed? Biophys. J., 81, 751–766. Bacardit,J. et al. (2004) Coordination number predication using learning classifier Lin,T.H. and Lin,J.J. (2001) Three-dimensional quantitative structure-activity systems: Performance and interpretability In Proceedings of the 8th Anuual relationship for several bioactive peptides searched by a convex hull- Conference on Genetic and Evolutionary Computation (GECCO 06). ACM comparative molecular field analysis approach. Comput. Chem., 25, 489–498. Press, New York, pp. 247–254. Lin,T.H. et al. (1999) A comparative molecular field analysis study on several Bacardit,J. et al. (2007) Automated alphabet reduction method with evolutionary bioactive peptides using the alignment rules derived from identification of algorithms for protein structure prediction. GECCO ’07: Proceedings of the commonly exposed groups. Biochim Biophys Acta, 1429, 476–485. 9th annual conference on Genetic and evolutionary computation. Acm Press, Liu,S. et al. (2007) Fold recognition by concurrent use of solvent accessibility and London, Vol 1. pp. 346–353. residue depth. Proteins, 68, 636–645. Bacardit,J. (2004) Pittsburgh Genetics-Based Machine Learning in the Data Meier,R. et al. (1995) Segmentation of molecular surfaces based on their mining era: Representations, generalization, and run-time. PhD Thesis. convex hull. ICIP 95: Proceedings of the 1995 International Conference on Ramon Llull University, Barcelona. Catalonia Spain. Image Processing. IEEE Computer Society, Washington DC, Vol. 3, Badel-chagnon,A. et al. (1994) ‘‘Iso-depth contour map’’ of a molecular surface. pp. 552–555. J. Mol. Graph, 12, 162–168, 193. Miller,R.G. (1981) Simultaneous Statistical Inference (Springer Series in Baldi,P. and Pollastri,G. (2002) A machine-learning strategy for protein analysis. Statistics). Springer-Verlag Inc., New York. IEEE Intel. Sys., 17, 28–35. Noguchi,T. et al. (2001) Pdb-reprdb: a database of representative protein chains Barber,C.B. et al. (1996) The quickhull algorithm for convex hulls. ACM Trans. from the protein data bank (pdb). Nucl. Acids Res, 29, 219–220. Math. Software, 22, 469–483. Pintar,A. et al. (2003) Dpx: for the analysis of the protein core. Bioinformatics, 19, Bava,K.A. et al. (2004) Protherm, version 4.0: thermodynamic database for 313–314. proteins and mutants. Nucl. Acids Res., 32 (Database issue), D120–D121. Preparata,F.P. and Hong,S.J. (1977) Convex hulls of finite sets of points in two Ben-shimon,A. and Eisenstein,M. (2005) Looking at enzymes from the inside out: and three dimensions. Commun. ACM, 20, 87–93. the proximity of catalytic residues to the molecular centroid can be used for Quinlan,J.R. (1992) C4.5: Programs for Machine Learning. Morgan Kaufmann detection of active sites and enzyme-ligand interfaces. J. Mol. Biol., 351, Publishers, San Mateo, CA. 309–326. Rost,B. and Sander,C. (1994) Conservation and prediction of solvent accessibility Chakravarty,S. and Varadarajan,R. (1999) Residue depth: a novel parameter for in protein families. Proteins, 20, 216–226. the analysis of protein structure and stability. Structure, 7, 724–732. Sander,C. and Schneider,R. (1991) Database of homology-derived protein Chen,B.Y. et al. (2007) Cavity scaling: automated refinement of cavity-aware structures and the structural meaning of sequence alignment. Proteins, 9, 56–68. motifs in protein function prediction. J. Bioinform. Comput. Biol., 5, 353–382. Stout,M. et al. (2008) Prediction of topological contacts in proteins using learning Coleman,R.G. and Sharp,K.A. (2006) Travel depth, a new shape descriptor for classifier systems. Soft Computing, Special Issue on Evolutionary and macromolecules: application to ligand binding. J. Mol. Biol., 362, 441–458. Metaheuristic-based Data Mining (EMBDM), (in press). Cover,T. and Thomas,J.A. (2006) Elements of Information Theory (Wiley Series in Van Walle,I. et al. (2005) Sabmark–a benchmark for sequence alignment that Telecommunications and Signal Processing). Wiley-Interscience, Hoboken, NJ. covers the entire known fold space. Bioinformatics, 21, 1267–1268. Dor,O. and Zhou,Y. (2007) Achieving 80% ten-fold cross-validated accuracy for Vlahovicek,K. et al. (2005) Cx, dpx and pride: Www servers for the analysis and secondary structure prediction by large-scale training. Proteins, 66, 838–845. comparison of protein 3d structures. Nucl. Acids Res., 33 (Web Server issue), Eidhammer,I. et al. (2003) Protein Bioinformatics: An Algorithmic Approach to W252–W254. Sequence and Structure Analysis. J. Wiley & Sons Ltd., Chichester, New York. Wang,Y. et al. (2006) Automatic classification of protein structures based on Gianese,G. and Pascarella,S. (2006) A consensus procedure improving solvent convex hull representation by integrated neural network. Theory and accessibility prediction. J. Comput. Chem., 27, 621–626. Applications of Models of Computation, Third International Conference, Gromiha,M.M. et al. (1999) Protherm: thermodynamic database for proteins and {TAMC} 2006, Beijing, China, May 15–20, 2006. Lecture Notes in mutants. Nucl. Acids Res., 27, 286–288. Computer Science. Springer, Vol. 3959. pp. 505–514. Hamelryck,T. (2005) An amino acid has two sides: a new 2d measure provides a Witten,I.H. and Frank,E. (2005) Data Mining: Practical Machine Learning different view of solvent exposure. Proteins, 59, 38–48. Tools and Techniques. Second Edition: (The Morgan Kaufmann Series in Data Holland,J.H. (1975) Adaptation in Natural and Artificial Systems: An Introductory Management Systems). Morgan Kaufmann, Amsterdam, Boston, MA. Analysis with Applications to Biology, Control and Artificial Intelligence.MIT Wood,M.J. and Hirst,J.D. (2005) Protein secondary structure prediction with Press, Cambridge, MA, pp. 313–329. dihedral angles. Proteins, 59, 476–481. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Prediction of recursive convex hull class assignments for protein residues

Loading next page...
 
/lp/oxford-university-press/prediction-of-recursive-convex-hull-class-assignments-for-protein-23072rN3lE

References (55)

Publisher
Oxford University Press
Copyright
© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btn050
pmid
18252738
Publisher site
See Article on Publisher Site

Abstract

Vol. 24 no. 7 2008, pages 916–923 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btn050 Structural bioinformatics Prediction of recursive convex hull class assignments for protein residues 1 1,2 3 1, Michael Stout , Jaume Bacardit , Jonathan D. Hirst and Natalio Krasnogor 1 2 Automated Scheduling, Optimization and Planning research group, School of Computer Science, Multi-disciplinary Centre for Integrative Biology, School of Biosciences and School of Chemistry, University of Nottingham, UK Received on November 6, 2007; revised on January 28, 2008; accepted on January 30, 2008 Advance Access publication February 5, 2008 Associate Editor: Burkhard Rost ABSTRACT de novo methods (Baldi and Pollastri, 2002). Whilst classifying residue neighbourhood density as high or low will generally Motivation: We introduce a new method for designating the location assign the high class to residues buried within the structure and of residues in folded protein structures based on the recursive the low class to residues exposed on the surface, residues lining convex hull (RCH) of a point set of atomic coordinates. The RCH can cavities in the structure that may be functionally significant be calculated with an efficient and parameterless algorithm. (Chen et al., 2007) can have a low coordination number even Results: We show that residue RCH class contains information when located far from the surface. Incorporation of comple- complementary to widely studied measures such as solvent accessibility (SA), residue depth (RD) and to the distance of residues mentary residue solvent accessibility and residue depth from the centroid of the chain, the residues’ exposure (Exp). RCH is information improves fold recognition (Liu et al., 2007). more conserved for related structures across folds and correlates A range of measures of residue location have been studied. better with changes in thermal stability of mutants than the other Lee and Richards (1971) used a spherical probe method to measures. Further, we assess the predictability of these measures measure the solvent accessible surface of residues and recently using three types of machine-learning technique: decision trees Kawabata and Go (2007) have used adjustable probe param- (C4.5), Naive Bayes and Learning Classifier Systems (LCS) showing eters to identify putative ligand binding pockets on protein that RCH is more easily predicted than the other measures. As an surfaces. Solvent accessibility, however, is difficult to compute exemplar application of predicted RCH class (in combination with and does not distinguish between residues below the surface. other measures), we show that RCH is potentially helpful in Hence, atom/residue depth (RD), the distance of an atom/ improving prediction of residue contact numbers (CN). residue from its nearest solvent accessible neighbour, was Contact: nxk@cs.nott.ac.uk introduced (Chakravarty and Varadarajan, 1999) and efficient Supplementary Information: For Supplementary data please refer algorithms are available to compute RD for a given structure to Datasets: www.infobiotic.net/datasets, RCH Prediction Servers: (Pintar et al., 2003; Vlahovicek et al., 2005). Whilst SA www.infobiotic.net emphasises burial, RD emphasises exposure and depends on the method used to identify surface atoms/residues. Hence, Half Sphere Exposure (HSE), has been recently proposed 1 INTRODUCTION (Hamelryck, 2005). HSE, like CN, counts neighbouring residues but distinguishes two regions (half spheres) around Prediction of the three-dimensional structure of proteins from each residue based on the C –C vector, i.e. a 2D measure of their constituent amino acid sequences continues to be one residue location. In addition, the distance (exposure) of residues of the key goals of structural biology and a wide range of predictive strategies has been investigated. Steady improve- from the chain centroid is a potentially interesting measure ments in predictive accuracy have resulted from decomposition being related to the location of catalytic residues in enzyme of the problem into subproblems, such as prediction of structures (Ben-shimon and Eisenstein, 2005). Measures of secondary structural elements [approaching a theoretical atom/residue location typically depend on specific parameters prediction limit of 80% (Dor and Zhou, 2007; Wood and such as probe size for SA or contact radius for CN. Hirst, 2005)], of residue coordination number [at over 80% In this paper, we introduce a new approach to stratifying (Bacardit et al., 2006)] and of residue solvent accessibility [at residues in protein structures by recursively identifying the over 77% using consensus predictors (Gianese and Pascarella, convex hull layer to which each residue belongs. The convex hull 2006)]. Burial of hydrophobic groups within the protein core of a set of points is a parameterless, mathematically rigorous is a primary driving force for protein structure formation. and unambiguous approach to identifying the points on the Characterizations of residue accessibility to solvent are, there- exterior of a point set, analogous to identifying those points that fore, important for protein structure prediction (PSP), poten- contact the enclosing surface when the point set is tightly tially helping to constrain the search space to be explored using wrapped. The convex hull is simple and efficient (O(n  log n)) to compute (Preparata and Hong, 1977). The recursive convex hull *To whom correspondence should be addressed. (RCH) of a point set is obtained by identification of the minimal 916  The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org Residue RCH prediction show that although not totally unrelated, these properties are indeed complementary. We show that RCH correlates better with structural conservation than the other measures of residue location and that RCH is also better correlated with changes in protein thermal stability in the presence of cavity forming mutations. We turn, in Part 2, to the question of how easy/ difficult it is, in practical terms, to learn to predict these measures. The relative predictability of RCH, RD, SA and Exp using four different machine-learning algorithms was assessed Fig. 1. Left: RCH of a 2D off-lattice protein model. The backbone is using six different, progressively richer, sets of input attributes represented by coloured circles joined by solid black lines. Residues on at three levels of precision. The relative benefits of using these the outermost RCH are coloured red, subsequent recursive convex hulls various inputs are described. C4.5 (Quinlan, 1992), Naive Bayes are coloured blue, green, and yellow, with residues on the innermost (John and Langley, 1995), GAssist (Bacardit, 2004) and recursive convex hull coloured purple. Right: A graphical representa- tion of the outer RCH of residues in a 3D model of a natural protein BioHEL (Bacardit et al., 2007) are the machine-learning chain (PDB Id. 1P4X). methods employed in this article. Finally, we demonstrate the usefulness of RCH by using the predicted RCH class of residues point set that generates the convex hull (the vertices) and as input for prediction of residue coordination number (CN) removal of these points from the point set followed by showing that, in combination with predicted residue SA and recursively applying these steps to the remaining points to Exp class, predicted RCH information increases predictive identify subsequent hulls. Applied to the point set of coordinates accuracy for CN. of residues in a protein chain, a series of hulls is obtained that groups the residues by their distance from the convex surface of 2 MATERIALS AND METHODS the structure. The recursive convex hulls of a 2D off-lattice 2.1 Datasets and features studied protein model are shown in Figure 1 along with a representation Next, we describe the datasets and algorithms employed to assess the of the outer convex hull of a 3D point set derived from the C novelty of RCH and its relation to previously studied measures. All of atomic coordinates of residues in a real protein chain. the measures studied are based on atomic coordinates. Two polypep- Convex hulls have found a wide range of applications in tides that have similar structures when represented using C coordinates studies of molecular structure. Here we give a brief, by no means may have distinct structures when represented using C coordinate complete, review. Badel-chagnon and colleagues introduced (Eidhammer et al., 2003). Throughout this article C atom coordinates a notion of the ‘molecular surface convex hull’ to define the depth are used (C for glycyl residues) as these are sensitive to the orientation of any molecular surface point (Badel-chagnon et al., 1994) and of side-chain atoms. Lin and colleagues used convex hulls to align 11 randomly Protein dataset: The dataset used here are those described by Bacardit generated bio-active tachykinin peptides, finding that 3D convex et al. (2006), originally proposed by Kinjo et al. (2005). Protein chains were selected from PDB-REPRDB [a non-redundant curated subset of hulls can be used to align even these flexible structures (Lin et al., the Protein Data Bank (PDB) (Noguchi et al., 2001), covering the space of 1999; Lin and Lin, 2001). Meier et al. (1995) proposed a convex possible folds] using the following criteria:less than 30% sequence hull-based segmentation technique (that makes few assumptions identity, sequence length greater than 50 residues, no membrane proteins, about the underlying surface) to find characteristically shaped no non-standard residues, no chain breaks, resolution better than 2 A and regions of molecular surfaces for prediction of possible protein a crystallographic R factor better than 20%. Chains that had no entry in docking sites. Liang and Dill (2001) used convex hulls to define the HSSP (Sander and Schneider, 1991) database were discarded. The the boundaries of surface pockets and depressions in studies of final dataset contains 1050 protein chains (257 560 residues). packing densities in proteins. Holmes and Tsai tackled protein Identification of residue RCH: Convex hulls were identified from the residue C atomic coordinates using the QHull package (Barber et al., side-chain packing and interactions by measuring variation in 1996). Hulls were iteratively identified, hull residues were assigned a hull convex hulls constructed around these groups (Holmes and Tsai, number and removed from the point set. This being repeated until all 2005). Coleman and Sharp, (2006) introduced the notion of residues had been assigned a hull number. The mean RCH number in this travel depth (the physical distance a solvent molecule would have dataset was 2.6 (SD 2.3). Assignment of RCH numbers to the 1050 to travel from a surface point to a suitably defined reference chains took 52 min. We term this numbering of hulls, from the outermost surface) using convex hulls of surface points. Recently, Lee and inward, residue RCH. An alternative numbering scheme, from inner- colleagues have employed 3D convex hulls around complemen- most hull outward, termed RCHr are given in the Supplementary tarity regions of antibodies to analyse binding sites (Lee et al., Material (Section 2.1). The mean RCHr number in this dataset was 5.1 2006) and Wang et al., (2006) have used convex hulls of protein (SD 2.7). Assigning RCHr numbers to all chains took 58 min. Calculation of residue solvent accessibility (SA): Solvent accessible backbones in neural network-based classification of protein surface values for each residue were extracted from the DSSP (Holm structures. However, dissection of protein structures by recur- and Sander, 1993) file for each structure. These values were divided by sively assigning convex hull numbers to residues, as we propose the solvent accessible surface values for each amino acid as defined in here, does not appear to have been previously reported. Rost and Sander (1994) to obtain the relative solvent accessibility of This article has two parts. In the first part we analyse RCH each residue. The mean SA value in this dataset was 0.27 (SD 0.27). as a new computable property of proteins. We compare Calculation of residue exposure (Exp): In this study, we characterize the information content of RCH to that of residue solvent residue exposure as the distance of residues from the centroid of each accessibility (SA), residue depth (RD) and exposure (Exp), and chain (Ben-shimon and Eisenstein, 2005). The chain centroid was 917 M.Stout et al. Table 2. Conservation of measures 0.9 RD Exp RCH RCHr SA 0.8 0.7 0.37 0.38 0.46 0.48 0.52 Norm. 0.37 0.46 0.55 0.55 0.50 0.6 0.5 Correlation of the Measures Studied between aligned residues in related structures. Norm. indicates coefficients based on normalized measures. 0.4 0.3 Table 3. Correlation of structural features with thermal stability 0.2 0.1 RD Exp RCH RCHr ASA 0.22 0.29 0.38 0.29 0.34 0.0 0.2 0.4 0.6 0.8 1.0 Norm. 0.20 0.44 0.35 0.35 0.37 RD Correlation of the measures studied with changes in thermal stability of mutant Fig. 2. Box and whisker plots of RD against RCH for 257 560 residues proteins. Norm. indicates coefficients based on normalized measures. from 1050 proteins. Black dots indicate median values. Values were normalized and rounded to one decimal place. Table 4. Pairwise mutual information Table 1. Correlation coefficients between measures studied SA RD Exp RCHr RCH SA RD Exp RCH RCHr SA 1.00 0.21 0.06 0.08 0.26 SA 1.00 0.51 0.39 0.62 0.41 1.00 0.21 0.12 0.26 0.26 Norm. 1.00 0.50 0.55 0.68 0.68 Norm. RD 0.91 0.04 0.05 0.14 RD 1.00 0.26 0.43 0.30 0.91 0.06 0.14 0.14 Norm. 1.00 0.34 0.48 0.48 Norm. Exp 1.00 0.38 0.07 EXP 1.00 0.41 0.85 1.00 0.29 0.29 Norm. 1.00 0.81 0.81 Norm. RCHr 0.99 0.08 RCH 1.00 0.42 1.00 1.00 Norm. 1.00 1.00 Norm. RCH 0.99 RCHr 1.00 1.00 Norm. 1.00 Norm. MI between two class (Q2) assignments for pairs of measures. Norm. indicates MI for class assignments based on normalized measures. Norm. indicates coefficients based on normalized measures. determined from the coordinates of the residues and the euclidean 50% of the data points, whiskers extend to 1.5 times the interquartile distance of each residue from this point was calculated to obtain the range with outliers plotted as blue dots and median values indicated residues exposure value. The mean Exp value in this dataset was 19.1 A with black dots. Median values for RD are positively correlated with (SD 7.8). Determination of Exp values for the whole dataset took less RCH yet RCH makes finer distinctions between degrees of burial and than 2 minutes. exposure. Further box plots for these measures are available in the Calculation of residue depth (RD): Residue depth (RD) values were Supplementary Materials (Fig. 3). obtained from the DPX server (Pintar et al., 2003) using default settings. Correlation coefficients: Pairs of measures that have a low correlation RD values were positively skewed with a mean RD of 0.86 (SD 1.41). coefficient are likely to be unrelated and potentially provide complemen- Normalization: In Section 2.2, both unnormalized and normalized tary information for PSP. Table 1 shows the Pearson correlation coef- values are reported for characterization of the measures studied using ficients between the measures studied. RD has low correlation with the box plots (Fig. 2), correlation coefficients (Table 1), structural conser- other measures.RCH is most highly anti-correlated with SA (0.62) and vation (Table 2), thermal stability (Table 3) and mutual information has a higher correlation with SA and Exp than RD. RCH is not highly between class assignments (Table 4). The value for each residue was correlated with RD, suggesting that these are distinct characterizations of divided by the maximum value for that measure in the corresponding residue location. RCH appears to be the measure that correlates closely to chain to obtain the normalized value. Histograms of unnormalized and many of the other measures. Hence, we would like to determine whether it normalized measures are shown in the Supplementary Materials (Figures is relatively more learnable than these other measures. 5 and 6). After normalization RCH and RCHr are symmetric. Conservation of RCH: For related proteins, aligned residues are potentially conserved even in the absence of strong sequence homology. Measures that have relatively high correlation for aligned residue pairs 2.2 Comparison between RCH and other measures potentially reflect conserved aspects of protein structure. We, therefore, of residue location assess to what degree these measures are correlated between aligned BoxPlots: Figure 2 plots RD versus RCH for each residue in the dataset residues in pairs of superimposed structures from a range of folds. using the statistically robust Box and Whisker technique. Boxes cover Following Hamelryck (2005), the conservation of RCH and the other RCH Residue RCH prediction using mutual information (MI) (Cover and Thomas, 2006). For discrete data, MI is defined as: XX pðx; yÞ IðX;YÞ¼ pðx; yÞ log ; ð1Þ pðxÞ pðyÞ y2Y x2X where p(x) and p(y) are the probabilities of x and y occurring in the dataset, and p(x, y) is the probability of the combination of x and y occurring together in the dataset. MI is used here to measure the quantity of information that one measure (e.g. SA) tells us about another (e.g. RCH). Table 4 shows the MI between pairs of measures for all 257 560 residues studied. When the MI between the class assignments for a pair of measures is high they represent closely related problems (the MI between a measure and itself is maximal, and is 1.00 if the classes assigned to the measure are well balanced). SA shares 0,26 MI with RCH whilst Exp shares 0.38 MI with RCHr and all other pairwise MI values are less than 0.10. This indicates that the RCH class of residues provides information distinct to SA, RD and Exp class information. MI for Q3 and Q5 class assignments is given in the Supplementary Materials (Table 3) along with a detailed pairwise examination of the Q5 class assignments for SA versus RCH, and RCHr versus Exp, where increased levels of MI were observed (Supplementary Materials, Tables Fig. 3. Space filling C atom models of proteins coloured by RCH and 4 and 6) along with RD versus RCH (and in Table 5). Frequent RD. ‘Core’ residues are coloured red/yellow and ‘surface’ residues blue/ differences in class assignments are observed for measures with greater green (rendered using RasMol). than 0.20 MI. To further highlight the distinction between RD and RCH, measures was calculated for 15 621 aligned residues (BLAST E-value visualisations of two space filling C atom models of protein structures 4¼ 1.0) in 218 pairs of structures from the SABmark version 1.63 are shown in Figure 3. The values for each measure were normalized and Twilight Zone database (Van Walle et al., 2005). This dataset comprises the colour assigned, in both measures, to indicate values from ‘exposed’ pairs of superimposed structures covering 236 folds. These pairs are (blue) to ‘buried’ (red). These models provide visual confirmation that structurally similar, yet are without probable common evolutionary residue RCH assignments are distinct to those for RD. Further examples origin, effectively, a hard dataset to predict. Table 2 reports the are available in the Supplementary Materials (Figs. 1 and 2). correlation coefficients for both unnormalized and normalized measures. RCH and RCHr have higher conservation correlation coefficients than 3 LEARNABILITY OF RCH AND OTHER RD, Exp and SA indicating that, for such aligned residues, RCH is more MEASURES highly correlated with structurally conserved locations than RD, Exp and (after normalization) SA. As we used C coordinates, values for RD Having demonstrated that residue RCH is a new and distinct and SA are around 0.1 lower than those previously reported (Hamelryck, characterization of residue location, we turn to the predict- 2005). ability of these measures and assess, in practical terms, which Relationship of RCH to changes in thermal stability of mutant proteins: of these characterizations of residue location is easier to learn. Changes in thermal stability of proteins after mutations of core Hence, potentially more useful for PSP. hydrophobic residues (that potentially lead to cavity formation) has been correlated with changes in SA and residue depth [for references see 3.1 Prediction experiments Hamelryck (2005)]. For such residues, measures that correlate relatively Inputs to predictions: For each measure (RCH, RCHr, RD, high with changes in the proteins thermal stability reflect structurally SA and Exp) predictions were made using six types of input important features. We, therefore, assess for these residues the degree to information and three levels of precision: two, three and five which these measures are correlated with changes in protein thermal class partitions (Q2, Q3 and Q5). Table 5 summarizes the six stability. The correlation of these measures of residue location (both different types of input information used for predictions of the normalized and unnormalized) with changes in the thermal stability (G in kcal/mol) of 91 Ile/Leu/Val to Ala point mutations was measures studied. Combinations of both local (neighbourhood measured. Sixteen protein structures from the Protherm database (Bava of the target in the chain) and global (protein-wise) information et al., 2004; Gromiha et al., 1999; Kumar et al., 2006) were employed, were used. A window of four residues either side of the target again following the approach of Hamelryck (2005). The correlation residue has been shown to lead to high CN predictive accuracy coefficients for RD, SA, Exp, RCH, RCHr and ASA (related to the using LCS (Bacardit et al., 2006) and was used in this study also to change in accessible surface upon folding) are shown in Table 3. RD facilitate comparison of results. For each representation (RCH, values were similar to those previously reported. RCH is more highly SA etc.) these inputs were labeled 1–6 in the rest of this article, correlated with changes in thermal stability upon mutation than the e.g. RCH-3 denotes RCH predicted using input dataset 3. other measures. Exp and ASA showed higher correlation when For each measure a total of 18 datasets was evaluated (six sets the data was normalized. RD showed the lowest correlation of of input attributes each at three levels of class assignment). A the measures studied. This data indicates that (unnormalized) RCH is detailed description of these inputs appears in Stout et al. (2007). correlated more strongly with residues in the hydrophobic core (that are In order to determine the degree to which RCH and RCHr vary related to structural stability) than are the other measures. in their learnability and capture properties of protein structures, Mutual information: The degree to which the classes assigned to residues using these measures are mutually informative was assessed in what follows we use their unnormalized versions. 919 M.Stout et al. Table 5. Datasets such as SA, a class boundary that leads to more balanced classes is traditionally chosen, e.g. for SA a cut point of 25% is widely used. We apply class balancing for all measures and levels of Scope Input information Dataset discretization (Q2, Q3 and Q5) in this study, adopting a uniform frequency classification procedure. For our data, balanced classes for SA were obtained using, e.g. a cut point of 18%. Class boundaries were determined individually for each training/ Local AA Types in window of target4 residues test set pair using the corresponding training fold. Details of the Pred. secondary cut points used are given in the Supplementary Table 1. structure of target Definition of the training and tests sets: Datasets were divided randomly into 10 training and test set pairs (950 Chain length chains for training and 100 for testing), using bootstrap Global Residue frequencies (Kohavi, 1995). We have placed a copy of the datasets at Pred. average of www.infobiotic.net target measure Performance measures: Different protein chains have differ- ent lengths and it is prediction accuracy on chains that is For each dataset (1–6) the input information type included in that dataset is typically reported (Kinjo et al., 2005; Jones, 1999). Therefore, indicated by . The two types of local (target and its closest neighbours) and three types of global (proteinwise) input information were investigated are shown. prediction accuracies for each chain were averaged to obtain the protein-wise accuracy reported here. Table 6. Summary of the highest predictive accuracies for each measure Machine-learning methods: We use four different machine- studied, in descending order of accuracy learning methods. The first two are popular machine-learning systems taken from the WEKA package (Witten and Frank, 2005): C4.5 (Quinlan, 1992), a decision tree rule induction Alg C4.5 BioHEL GAssist Naive Bayes system, Naive Bayes (John and Langley, 1995), a Bayesian- a a RCHr 79.8  1.5 78.5  1.5 78.4 1.5 77.9 1.7 learning algorithm. Learning systems belonging to the Learning a a RCH 77.3  1.0 75.9  1.2 75.7 1.1 76.1 1.1 Classifier Systems (LCS) (Holland and Reitman, 1978) class of a a RD 76.0  0.4 75.3  0.3 75.2 0.3 75.1 0.4 ML techniques were also studied. These systems are rule-based a a Exp 73.9  1.4 72.8  1.6 72.5 1.4 73.4 1.3 machine-learning systems that employ evolutionary computa- a a a SA 73.3  0.3 72.2  0.4 72.2 0.4 72.3 0.4 tion (Holland, 1975) as the search mechanism. Two LCS methods have been employed: GAssist (Bacardit, 2004) and Mean  SD for 10 fold cross validated predictions based on the input datasets a BioHEL (Bacardit et al., 2007) that implement different rule that gave the best results for each measure: namely type 4 or 6 (indicated by ). induction paradigms. A detailed description of both systems is included in the Supplementary Material (Section 3.1). Predicted secondary structure information of the target Analysis of results: For each experiment, the mean prediction residue was obtained using the PSI-PRED predictor (Jones, accuracy (as defined previously in this section) over the test sets 1999). This consists the secondary structure type (helix, strand or is reported. Student t-tests were applied to the 10 results from coil) and a confidence level of the prediction (ranging from 0 to 9). each experiment to determine the best method for each dataset For each measure, the average value of that measure was at a confidence level of 95%. Standard deviations and any determined for each chain and 10 pairs of training and test folds significant differences are indicated in each table. The were prepared. For each instance, the inputs were the chain conservative Bonferroni correction (Miller, 1981) for multiple length (one integer value) and amino acid composition of each pair-wise comparisons was applied. In addition, the contribu- chain (20 real values), and the target class was the measured tions of global input information were assessed as follows: average value for the particular measure (partitioned into 10 for each learning system and precision (Q2, Q3 and Q5), the classes using a uniform frequency cut-point strategy). Cut- maximum of (Dataset4, Dataset6)-Dataset2 was computed. As points were determined separately for each training fold a base for the performance gap, the dataset with predSS was and used to assign classes to the values in the corresponding used, because in certain situations the Dataset1 performed training and test folds. The predicted average value of the poorly, distorting the comparisons. Finally, the contribution of measure under consideration (termed PredAveRCH, predicted secondary structure was also assessed as follows: for PredAveRD, etc.) was predicted using the GAssist LCS (details each learning system and number of states the value of the below) prior to preparation of the datasets for the full measure maximum (Dataset2-Dataset1, Dataset4-Dataset3, Dataset6- predictions. Nine hundred fifty instances (chains) were used Dataset5) was determined. for training and 100 instances for testing. Ten iterations were performed for each prediction using different random 3.2 Prediction results number seeds and the 10 rule sets generated were combined as an ensemble using a majority vote to predict the measure. For each measure studied, Table 6 summarizes the best Q2 Class assignments: In order to predict measures using predictive accuracy (in descending order) for each measure classification techniques, the calculated values for each measure (using the best possible input dataset in each case). Detailed were partitioned into two, three and five classes (bins) here results for the predictions are given in the Supplementary termed Q2, Q3 and Q5, respectively. For imbalanced measures, Materials (Tables 7 –10). Predictive accuracy was higher on 920 Residue RCH prediction Table 7. Coordination number prediction (by BioHEL) using amino- the two RCH based representations than on the SA, RD or Exp acid sequence and various combinations of the predicted measures representation. The predictive accuracies for RCHr being statistically significantly higher than those for the other measures (P-value ¼ 0.5). Dataset Proteinwise acc. For all representations, higher predictive accuracies were seen when fewer classes were predicted (lower precision—Q2). CN1 77.2  0.8 Q5 predictive accuracy for RCH was between 30% and 40%, CN1 þ RD 77.4  0.8 Q3 was approximately 20% higher, between 55% and 60% CN1 þ RCHr 77.6  0.7 CN1 þ Exp 77.7  0.8 whilst for Q2 prediction accuracies exceeded 77%. The LCS’s CN1 þ Expþ RCHr 77.7  0.7 performed best on the RCHr representation when using input CN1 þ RCH 78.5  0.9 dataset RCHr-4. This dataset combines local information CN1 þ RCH þ RCHr 78.8  0.7 (a window of residues around the target and its predicted CN1 þ Expþ RCH 78.9  0.8 secondary structure) with global chain information (chain a CN1 þ SA 78.9  0.8 length and chain residue composition). The more compact CN1 þ Expþ RCHþ RCHr 78.9  0.7 a,b RCHr-6 was frequently the most learnable dataset for C4.5 and CN1 þ Expþ SA 79.1  0.8 a,b Naive Bayes. This dataset comprises local information (window CN1 þ Expþ SA þ RCHr 79.1  0.8 a,b and predicted secondary structure) and global information CN1 þ SA þ RCHr 79.1  0.8 a,b CN1 þ SA þ RCH 79.7  0.8 (predicted average RCHr of the chain). a,b CN1 þ Expþ SA þ RCH 79.8  0.8 a,b CN1 þ SA þ RCH þ RCHr 79.8  0.8 a,b 3.3 Predicted RCH improves CN prediction CN1 þ Expþ SA þ RCH þ RCHr 79.8  0.7 Finally, we assess the utility of predicted RCH as an input indicates input information that leads to statistically significant increases in to prediction of other aspects of protein structure, specifically predictive accuracy compared to the baseline CN1 inputs. The group of best coordination number (CN). For each of the measures studied, performing methods all have statistically similar performance. the Q5 predictions (using input dataset 4) made by BioHEL (which was, in general, the best performing method) are fed measures are mostly redundant (consistent with the observed back into prediction of CN (Bacardit et al., 2006). The CN of increase in MI between these two measures, Table 4). On the a residue is a count of the number of other residues from the other hand, combining SA and RCH resulted in much better chain that are located within a certain threshold distance. accuracy, showing that these measures complement each other. Specifically, we have used the Kinjo et al. (2005) definition of The best combinations of measures (ExpþSAþRCH, SAþ CN. We predict whether the CN of a residue is above or below RCHþRCHr and ExpþSAþ RCHþRCHr) lead to a perfor- the midpoint of the CN domain, using as input information mance increase of 2.6% over the baseline CN1 dataset. the AA type of a window of 4 residues around the target (equivalent to the first set of input attributes used to predict the other features), CN1. 4 DISCUSSION The contribution of SA, Exp, RCH and RCHr to CN 4.1 Prediction results prediction (individually and in combination with one another) was evaluated by extending the CN1 dataset with 16 combina- The results show some general trends across all measures tions of input attributes that correspond to all combinations studied and all learning methods. Predictive accuracy is of these measures. Using predicted RD as input gave the lowest increased when richer input information is employed. improvement (0.2%) over the CN1 (local window) input alone Inclusion of local information in the form of predicted and was, therefore, not included in predictions made with secondary structure typically leads to an increase in Q2 combinations of inputs. Table 7 shows the results of these predictive accuracy of 2–3% on most datasets for the learning experiments. As a baseline, the performance of the original systems used, whilst using global protein information (chain CN1 is included. The table has been sorted by accuracy which length and composition) can boost Q2 predictive accuracy by helps to identify the combinations of predicted measures that more than 10% (in the case of RCH and RCHr). The type 3 give the biggest performance boost. and 4 datasets, containing 21 real-valued global protein The results of these experiments were analyzed using paired attributes, in particular presents a considerably larger search t-test with 95% confidence level and the Bonferroni correction. space, and the mixture of real-valued and nominal attributes Two types of results were identified: the datasets in which makes the learning problem more difficult than for purely BioHEL performed significantly better than when learning nominal knowledge representations. The type 5 and 6 datasets from CN1 (marked with a ) and the (statistically indistinguish- use less attributes than the type 3 and 4 datasets enabling the able) group of datasets that resulted in the highest predictive LCS’s to generate rule sets that are more easily interpreted. The accuracies are indicated ( ). RCHr representation was the most predictable measure, this There are two groups of measures: those that only provide a resulting from use of input dataset 4 (chain composition and small performance boost over CN1 (RD, Exp and RCHr), and length information) and the BioHEL LCS. C4.5 performed best others that provide a larger boost (SA and RCH). Furthermore, when using datasets 4 and 6, benefiting from predicted local combining Exp and RCHr (together and with other measures) and global information. Naive Bayes generally performed best only marginally improves the performance, indicating that these when using the more compact dataset 6. 921 M.Stout et al. There were interesting differences between these representa- (4) If PredSS 2 {E} and PredAveAtt44.5 and Res 2 = {C, tions of residue location; RCH was easier to learn compared K, x} and Res 2 = R and Res 2 = {K, R} and Res 2 = 3 2 1 {D, Q, S, x} and Res2 = {E, F, K, M, R, T} and Res 2 = to other representations because this representation correctly þ1 {F, L, x} and Res 2 = {K} and Res 2 = {C} and assigned classes to residues for anisotropic (elongated) struc- þ2 þ3 Res 2 = {x} ! class is 1 tures. For such structures, the Exp representation may assign þ4 .... some surface residues to the buried class and some buried Default class is 0 residues to the exposed class. Similarly, RCH was more predictable than SA; assignment of residues was deep within The rule set structure is a decision list, rules are tested in order structures, but on the surface of cavities (high solvent starting with the first and rule evaluation continues until a match accessibility) to the exposed class may have lowered the is found or the default (last) rule is reached. This rule set predictability of the SA representation. comprised 15 rules, the first four rules are shown (the full rule set RD was more predictable than SA and Exp but was not as is shown in the Supplementary Materials, Section 4.3). In the first easily predicted as RCH and RCHr. As SA emphasises buried rule, the confidence level of the secondary structure prediction is tested then the predicted average value for the property under residues, RD emphasises exposed residues. The highly imbal- consideration (predicted average RCH 48.64). Subsequent anced nature of the RD measure (Supplementary Materials, predicates test the amino acid types of residues surrounding the Figs. 5 and 6) leads to imbalanced class assignments even when target in the sequence. AA M means the amino acid type of the using a uniform frequency cut-point strategy (Supplementary residue at position  M in respect to the target residue. Amino Materials, Table 2). This imbalance is likely to have made this acids are represented by their one letter code, plus the symbol x measure relatively more difficult to learn. Furthermore, when fed back as inputs for prediction of CN, RD in particular provided representing positions after the start/end of the chain, for cases little additional information over the CN1 inputs resulting in where the window of neighbouring residues overlaps the only marginal improvements in CN prediction. Prediction beginning or the end of the chain. For positions relative to the target residue (e.g. Res ) these predicates restrict the amino acid accuracy for all measures is likely to be boosted by including 2 types for the residue at that position (membership or non- additional (e.g. non-local) input information. For example, membership of a set). In subsequent rules, the predicted 79.3% 10-fold cross validated accuracy has been reported for Secondary Structure class (PredSS) of the target residue; either two-state SA prediction at a 25% cut point using an integrated Helix (H), Sheet (E) or Coil (C) is tested. In many cases, the target system of neural networks (Dor and Zhou, 2007) with position residue type is largely restricted to hydrophobic residues that are specific scoring matrices and a range of other input data. often found buried in the protein core and, therefore, have higher RCH numbers. The LCS has correctly identified this, predicting 4.2 ‘White box’ prediction: interpretable analysis and the above average RCH class (1) for these residues. performance considerations The accuracy, simplicity and interpretability of the rule sets generated by the LCS’s must, however, be balanced against the Understanding the basis on which a prediction is made may be computational expense needed to generate them. Run times for more valuable than making relatively accurate predictions in the learning phase of the BioHEL algorithm ranged from 3 min a blind manner. Decision trees can be interpreted, however, on the smaller input datasets to almost 7 h on the larger ones. In on these problems C4.5, produced pruned trees that ranged in contrast, the worst case for C4.5 was 32 min and for Naive Bayes mean size from 1068 (SD ¼ 331) to 138 977 (SD ¼ 2735) nodes was less than 1 min. Moreover, for each problem set, the LCS limiting their explanatory value. Naive Bayes, on the other hand, algorithms, BioHEL and GAssist, were run multiple times to generates compact solutions, however, these are probabilistic produce 10 classifiers for input to the ensemble procedure. Once models that have no immediate physico-chemical interpretation. trained, however, run times for the resulting classifiers on the In contrast, it is much easier to relate the compact rule sets entire test set was around two minutes for C4.5 and Naive Bayes evolved by the LCS algorithms to the underlying physical and but less then 1 min for the LCS’s, an indication, for instance, of chemical properties of proteins. The following is an example of a how these LCS evolved classifiers would perform as part of a rule set evolved by the GAssist LCS that produced 71.2% two- prediction web server. state predictive accuracy on the RCH-6 dataset. (1) If PredSSConf57.5 and PredAveAtt48.64 and 5 CONCLUSIONS Res 2 = {K,x} and Res 2 = x and Res2 = {D,E,K,N,Q} and 4 2 In this article, a new measure of residue location in folded Res 2 = {x} and Res 2 = {K} and Res 2 = {x} ! class is 1 þ1 þ2 þ4 protein chains, the RCH, was introduced. RCH is a param- (2) If PredSS2 = {C} and PredAveAtt44.8 and Res 2 = {E, Q} eterless, simple to compute and mathematically rigorous and Res 2 = {D, T} and Res 2 = E and Res 2 = {D, P,V} 3 2 1 method that situates residues in layers within protein structures. and Res2 = {D,E,H,K,N,P,Q,R,T} and Res 2 = {x} and þ1 We show that RCH is distinct to other widely studied measures Res 2 = {H} and Res 2 = {K} and Res 2 = {E,K,Q,R,x} ! þ2 þ3 þ4 of residue location and that RCH distinguishes a range class is 1 of degrees of residue burial/exposure, correlates better with (3) If PredSS2 = {C} and PredAveAtt44.8 and Res 2 = {E, R} residue conservation and changes in protein stability under and Res 2 = {D, E} and Res 2 = {P, W} and Res2 = mutation than measures such as solvent accessibility, residue 3 1 {A,D,E,K,N,P,Q,R} and Res 2 = {W, x} and Res 2 = {V, depth or residue distance from chain centroid. Further, we þ1 þ2 W} and Res 2 = {K, R} and Res 2 = {K, x} ! class is 1 assess the predictability of these measures using three types of þ3 þ4 922 Residue RCH prediction Holmes,J.B. and Tsai,J. (2005) Characterizing conserved structural contacts by ML technique: decision trees (C4.5), Naive Bayes and LCS, pair-wise relative contacts and relative packing groups. J. Mol. Biol., 354, employing a range of predictive inputs. We show that an LCS 706–721. that employs iterative rule learning, BioHEL, predicts RCH at Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of 77.3, 60.6 and 39.0% accuracy for Q2, Q3 and Q5, respectively. distance matrices. J. Mol. Biol., 233, 123–138. John,G.H. and Langley,P. (1995) Estimating continuous distributions in Bayesia We present examples of the competent yet simple and classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial interpretable LCS classification rules, showing how they Intelligence. Morgan Kaufmann Publishers, San Mateo, pp. 338–345. relate to the underlying physical and chemical properties of Jones,D.T. (1999) Protein secondary structure prediction based on position- the residues. As an exemplar application of predicted RCH specific scoring matrices. J. Mol. Biol., 292, 195–202. class (in combination with other measures), we show that Kawabata,T. and Go,N. (2007) Detection of pockets on protein surfaces using small and large probe spheres to find putative ligand binding sites. Proteins, prediction of contact number can be improved up to 2.6%. 68, 516–529. Kinjo,A.R. et al. (2005) Predicting absolute contact numbers of native protein ACKNOWLEDGEMENTS structure from amino acid sequence. Proteins, 58, 158–165. Kohavi,R. (1995) A study of cross-validation and bootstrap for accuracy We acknowledge the support of the UK Engineering and Physical estimation and model selection. In Mellish,C.S. (ed.) Proceedings of the Sciences Research Council (EPSRC) under grant GR/ T07534/01. Fourteenth International Joint Conference on Artificial Intelligence. Morgan We are grateful for the use of the University of Nottingham’s Kaufmann, San Mateo, pp. 1137–1145. High-Performance Computer. We are grateful to the anonymous Kumar,M.D. et al. (2006) Protherm and pronit: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucl. Acids Res., 34 (Database reviewers whose insightful comments have helped to improve issue), D204–D206. this manuscript. Lee,B. and Richards,F.M. (1971) The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol., 55, 379–400. Conflict of Interest: none declared. Lee,M. et al. (2006) Shapes of antibody binding sites: qualitative and quantitative analyses based on a geomorphic classification scheme. J Org Chem, 71, 5082–5092. REFERENCES Liang,J. and Dill,K.A. (2001) Are proteins well-packed? Biophys. J., 81, 751–766. Bacardit,J. et al. (2004) Coordination number predication using learning classifier Lin,T.H. and Lin,J.J. (2001) Three-dimensional quantitative structure-activity systems: Performance and interpretability In Proceedings of the 8th Anuual relationship for several bioactive peptides searched by a convex hull- Conference on Genetic and Evolutionary Computation (GECCO 06). ACM comparative molecular field analysis approach. Comput. Chem., 25, 489–498. Press, New York, pp. 247–254. Lin,T.H. et al. (1999) A comparative molecular field analysis study on several Bacardit,J. et al. (2007) Automated alphabet reduction method with evolutionary bioactive peptides using the alignment rules derived from identification of algorithms for protein structure prediction. GECCO ’07: Proceedings of the commonly exposed groups. Biochim Biophys Acta, 1429, 476–485. 9th annual conference on Genetic and evolutionary computation. Acm Press, Liu,S. et al. (2007) Fold recognition by concurrent use of solvent accessibility and London, Vol 1. pp. 346–353. residue depth. Proteins, 68, 636–645. Bacardit,J. (2004) Pittsburgh Genetics-Based Machine Learning in the Data Meier,R. et al. (1995) Segmentation of molecular surfaces based on their mining era: Representations, generalization, and run-time. PhD Thesis. convex hull. ICIP 95: Proceedings of the 1995 International Conference on Ramon Llull University, Barcelona. Catalonia Spain. Image Processing. IEEE Computer Society, Washington DC, Vol. 3, Badel-chagnon,A. et al. (1994) ‘‘Iso-depth contour map’’ of a molecular surface. pp. 552–555. J. Mol. Graph, 12, 162–168, 193. Miller,R.G. (1981) Simultaneous Statistical Inference (Springer Series in Baldi,P. and Pollastri,G. (2002) A machine-learning strategy for protein analysis. Statistics). Springer-Verlag Inc., New York. IEEE Intel. Sys., 17, 28–35. Noguchi,T. et al. (2001) Pdb-reprdb: a database of representative protein chains Barber,C.B. et al. (1996) The quickhull algorithm for convex hulls. ACM Trans. from the protein data bank (pdb). Nucl. Acids Res, 29, 219–220. Math. Software, 22, 469–483. Pintar,A. et al. (2003) Dpx: for the analysis of the protein core. Bioinformatics, 19, Bava,K.A. et al. (2004) Protherm, version 4.0: thermodynamic database for 313–314. proteins and mutants. Nucl. Acids Res., 32 (Database issue), D120–D121. Preparata,F.P. and Hong,S.J. (1977) Convex hulls of finite sets of points in two Ben-shimon,A. and Eisenstein,M. (2005) Looking at enzymes from the inside out: and three dimensions. Commun. ACM, 20, 87–93. the proximity of catalytic residues to the molecular centroid can be used for Quinlan,J.R. (1992) C4.5: Programs for Machine Learning. Morgan Kaufmann detection of active sites and enzyme-ligand interfaces. J. Mol. Biol., 351, Publishers, San Mateo, CA. 309–326. Rost,B. and Sander,C. (1994) Conservation and prediction of solvent accessibility Chakravarty,S. and Varadarajan,R. (1999) Residue depth: a novel parameter for in protein families. Proteins, 20, 216–226. the analysis of protein structure and stability. Structure, 7, 724–732. Sander,C. and Schneider,R. (1991) Database of homology-derived protein Chen,B.Y. et al. (2007) Cavity scaling: automated refinement of cavity-aware structures and the structural meaning of sequence alignment. Proteins, 9, 56–68. motifs in protein function prediction. J. Bioinform. Comput. Biol., 5, 353–382. Stout,M. et al. (2008) Prediction of topological contacts in proteins using learning Coleman,R.G. and Sharp,K.A. (2006) Travel depth, a new shape descriptor for classifier systems. Soft Computing, Special Issue on Evolutionary and macromolecules: application to ligand binding. J. Mol. Biol., 362, 441–458. Metaheuristic-based Data Mining (EMBDM), (in press). Cover,T. and Thomas,J.A. (2006) Elements of Information Theory (Wiley Series in Van Walle,I. et al. (2005) Sabmark–a benchmark for sequence alignment that Telecommunications and Signal Processing). Wiley-Interscience, Hoboken, NJ. covers the entire known fold space. Bioinformatics, 21, 1267–1268. Dor,O. and Zhou,Y. (2007) Achieving 80% ten-fold cross-validated accuracy for Vlahovicek,K. et al. (2005) Cx, dpx and pride: Www servers for the analysis and secondary structure prediction by large-scale training. Proteins, 66, 838–845. comparison of protein 3d structures. Nucl. Acids Res., 33 (Web Server issue), Eidhammer,I. et al. (2003) Protein Bioinformatics: An Algorithmic Approach to W252–W254. Sequence and Structure Analysis. J. Wiley & Sons Ltd., Chichester, New York. Wang,Y. et al. (2006) Automatic classification of protein structures based on Gianese,G. and Pascarella,S. (2006) A consensus procedure improving solvent convex hull representation by integrated neural network. Theory and accessibility prediction. J. Comput. Chem., 27, 621–626. Applications of Models of Computation, Third International Conference, Gromiha,M.M. et al. (1999) Protherm: thermodynamic database for proteins and {TAMC} 2006, Beijing, China, May 15–20, 2006. Lecture Notes in mutants. Nucl. Acids Res., 27, 286–288. Computer Science. Springer, Vol. 3959. pp. 505–514. Hamelryck,T. (2005) An amino acid has two sides: a new 2d measure provides a Witten,I.H. and Frank,E. (2005) Data Mining: Practical Machine Learning different view of solvent exposure. Proteins, 59, 38–48. Tools and Techniques. Second Edition: (The Morgan Kaufmann Series in Data Holland,J.H. (1975) Adaptation in Natural and Artificial Systems: An Introductory Management Systems). Morgan Kaufmann, Amsterdam, Boston, MA. Analysis with Applications to Biology, Control and Artificial Intelligence.MIT Wood,M.J. and Hirst,J.D. (2005) Protein secondary structure prediction with Press, Cambridge, MA, pp. 313–329. dihedral angles. Proteins, 59, 476–481.

Journal

BioinformaticsOxford University Press

Published: Feb 5, 2008

There are no references for this article.