Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

PROTEIN FOLD CLASSIFICATION BASED ON MACHINE LEARNING PARADIGM - A REVIEW

PROTEIN FOLD CLASSIFICATION BASED ON MACHINE LEARNING PARADIGM - A REVIEW Protein fold recognition using machine learning-based methods is crucial in the protein structure discovery, especially when the traditional sequence comparison methods fail because the structurally-similar proteins share little in the way of sequence homology. Many different machine learning-based fold classification methods have been proposed with still increasing accuracy and the main aim of this article is to cover all the major results in this field. KEYWORDS: supervised learning algorithm, classifier, features, protein fold recognition 1. Introduction Proteins are indispensable for the existence and proper functioning of biological organisms [37]. Proteins are biochemical compounds consisting of one or more polypeptides which are single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues. The sequence of amino acids in a protein known as primary structure is defined by the sequence of genes which is encoded in the genetic code which, in general, specifies 20 standard amino acids. One of the most distinguishing features of polypeptides is their ability to fold typically into a globular or fibrous state, or "structure", that means 3D (three-dimensional) or tertiary structure. According to Anfinsen [2], the proteins can fold to their native structures spontaneously, therefore he stated that protein fold is coded in the amino acid sequence itself, but it is still not clear as to how structure is encoded in a sequence and, therefore, it is an open problem of much scientific interest in computational biology. The secondary structure of proteins is the characterization of a protein with respect to certain local structural conformations like -helices, -sheets (strands) and other such as loops, turns and coils. Fold can be defined as a three-dimensional pattern characterised by a set of major secondary structure conformations with certain arrangement and their topological connections. The structure of a protein serves as a medium through which to regulate either the function of a protein or activity of an enzyme. Understanding of how proteins fold in three-dimensional space can reveal significant information of how they function in biological reactions. Protein's function is strongly influenced by its structure ([9], [37], [49], [50]). Currently, sequencing projects rapidly produce protein sequences, but the number of 3D protein structures increases slowly due to the expensive and time-consuming conventionaaboratory methods, namely X-ray crystallography and nuclear magnetic resonance (NMR). Moreover, not all proteins are amenable to experimental structure determination. The protein sequence data banks such as Universal Protein Resource (UniProtKB/TrEMBL) [3] contains now more than 16 000 000 protein sequence entries, while the number of stored protein structures in Protein Data Bank (PDB) [5] is about 74 000. This leads to the necessary alternative to experimental determination of 3D protein structures, the computational methods like ab initio and homology modeling ones. Ab initio methods seek to build 3D protein models "from scratch", i.e., based on physical principles [23],[61]. There are many possible procedures that either attempt to mimic protein folding or apply some stochastic method to search the space of possible solutions. The two major problems here are calculation of protein free energy and finding the global minimum of this energy which require vast computational resources, and have thus only been carried out for tiny proteins. These problems can be partially bypassed in the homology-based methods [23], [61], when the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment. But the hurdle exists when the query protein does not have any structure-known homologous protein in the existing databases. Facing this kind of situation, predicting 3D structure of a protein was converted to a problem of protein fold recognition, i.e. identyfying which fold pattern it belongs to. Protein fold recognition methods have taken central stage as fold information could facilitate the identification of a protein tertiary structure and function. Many methods have been developed which are used to assigning folds to protein sequences. They can be broadly classified into three groups: 1) sequence-structure homology recogntion methods (for example [57]), 2) threading methods (for example Threader [36])) machine learning-based (ml-based) methods (also called taxonomy-based ). Sequence-structure homology and threading methods align target sequence onto known structural templates and calculate their sequencestructure compatibilities (scores) using for example environment-specific substitution tables or pseudo-energy-based functions, and the template with the best score is assumed to be the fold of the target sequence. While these methods each are effective in certain cases, there are drawbacks of both approaches. The first will fail when two proteins are structurally-similar but share little in the way of sequence homology. Threading methods rely on data derived from solved structures, but as we mentioned, the number of proteins whose structure has been solved is much smaller then the number of proteins that have been sequenced. These methods have not been able to achieve accuracies greater than 30%. In recent years, the machine learning-based methods have attracted great attention due to its encouraging performance. Ml-based methods for protein fold recognition assume that the number of protein folds in the universe is limited, according to [17] about 1000 and therefore, the protein fold recognition can be viewed as a fold classification problem. The last one can be formulated as the construction of a classifier using the learning (training) set, i.e. the sequence-derived features (properties) of proteins whose structure (fold) is known. The procedure for construction of a classifier is called supervised learning or classifier training. Its role in the fold classification task is to induce a mappings from primary sequences to folding classes. Such a trained classifier can then be used to assign a structure-based label (class of fold) to an unknown protein (i.e. a protein whose structure has yet not been solved). To implement a classification task, two major procedures are generally required: 1) Feature generation which refers to a procedure by which we "obtain" features from a query amino acid sequence so as to represent the underlying protein as a fixed-length numerical vector. 2) Machine learning classifier which is a function (implemented in some software) assigning numerical vector (here representing the protein with unknown structure) to one of the existing classes (here fold types). Many different ml-based fold classification methods have been proposed with still increasing accuracy and the main aim of this article is to cover all the major results in this exciting field. The organization of the rest of this paper is as follows. The next, second section explains the main ideas and principles of ml-based classification methods and shortly describes the four most popular ones: a gaussian, nearest neighbors, support vector machines and multi-layered perceptron classifiers. The third section deals with the protein feature generation methods (for the purpose of fold classification). In the forth section, after describing the datasets of protein folds used in the experiments from literature all the most important fold classification methods are presented. In the last section there are the conclusions drawn from the current study as well as the directions for future research in the field of ml-based fold prediction. 2. Supervised machine learning classification techniques Machine learning is a branch of artificial intelligence which is concerned with the development of learning algorithms that allow computers to evolve their behavior based on the empirical data (examples). Based on the examples a learning algorithm captures characteristics of interest, for example the underlying probability distribution to automatically learn to recognize complex patterns and make intelligent decisions based on data. Because examples compose only a small subset of a whole population, the learning algorithm must generalize from the given examples, so as to be able to produce a useful output in future, new cases. The objective of statistical classification is to assign a new observation xi to one of the pre-specified c classes. The supervised machine learning of classification problem [6] can be formally stated as follows. Suppose we have a dataset of n examples (training dataset): U n {( x1 , y1 ),..., ( xn , yn )} where each xi ( xi1 ,..., xid ) represents an observation, yi {1,..., c} is a categorical variable, a class label. We seek for a function d ( x) such that the value of d ( x) can be evaluated for any new observation x (i.e. not ^ included in a training dataset) and such that label y d ( x) predicted for that new observation x is as close as possible to the true class label y of x. The function d ( x) known as classifier is an element of some space of possible functions, usually called the hypothesis space. In the Bayesian decision framework, in order to measure how well a function fits the training data, a loss function L( y, d ( x)) for penalizing errors in prediction is defined. By far, the most common is 0-1 loss function, where all misclassifications are charged a single unit. This leads to a criterion for choosing d ( x) as the expected prediction error E ( L( y, d ( x))) , where the expectation is taken with respect to the joint probability distribution f ( x, y) . Conditioning on x and using 0-1 loss function we obtain a solution, a function d(x) of the form: d ( x) arg max i P(i | x) i 1,..., c or equivalently, using the Bayes rule: d ( x) arg max i f ( x | i) P(i) where f ( x | i) is a class-conditional density, P(i ) is a priori probability of class i. The obtained classifier d(x), known as Bayes classifier, says that we classify to the most probable class using the conditional distribution. Classifiers come in a great diversity of techniques and algorithms. Each classifier can be uniformly defined by the set of c discriminant functions. Each class i has its own discriminant function di(x) designed in such a way that for each object from class i the value of the corresponding discriminant function is (should be) the largest among the all c values: j 1,...,c j i d i ( x) d j ( x) Discriminant functions are determined based on the training set using different algorithms, depending on the particular classifier. This procedure is known as the classifier training or learning. The procedure of building a classifier for a particular application typically comprises the following steps [58]: 1) data collection (on appropriate features), 2) data preprocessing (for example normalization, outlier detection), 3) feature selection/extraction (to avoid curse of dimensionality [6]), 4) classifier training and validation of its internal parameters, 5) classifier testing to estimate its performance. The most often performance is estimated on a separate testing set (if available) or using the resampling technique N-fold cross-validation (N-CV). The N-CV method consists of the following steps: 1) divide randomly all the training examples into N equal-sized subsets (usually N=10 but in general depends on the size of a training dataset), 2) use all but one subset of examples to train the classifier, 3) measure the classification performance on the remaining one by means of the percentage accuracy, 4) repeat steps 2 and 3 for each subset, 5) average the results to get an estimate of the performance of a classifier. The built classifier can then be used for making predictions on new, unknown observations. The unknown observation x is usually assigned to a class whose discriminant function di ( x) has the largest value. 2.1 Gaussian classifiers The discriminant functions of the above described Bayes classifier are of the form: i 1,..., c di ( x) f ( x | i) P(i) Many classification methods are based on the Bayes classifier including parametric and nonparametric ones according to the estimation method used. The most popular are parametric gaussian classifiers which use the gaussian (normal) density: 1 1 f ( x | i) exp ( x i )T i1 ( x i ) (2 )d / 2 | i |1/ 2 2 where and are the parameters: mean vector and covariance matrix of a distribution and d is a space dimensionality. In the real situations, the parameters and as well as a priori class probabilities P(i) are replaced by their maximum likelihood (ML) estimates: 1 ni ^ i x j ni j 1 S 1 ^ i i ni ni (x j 1 ni ^ ^ i )( x j i )T i 1,..., c P(i) ni n based on a training set. Plugging the above expressions into discriminant function of a Bayes classifier results in the quadratic discriminant function of a gaussian classifier: 1 1 ^ ^ ^ ^ ^ di ( x) ( x i )T i1 ( x i ) ln | i | ln P(i) 2 2 In the simplest, special case where covariance matrices in all c classes are identical ( i , i 1,..., c ), the discriminant function of a gaussian classifier is linear: ^ ^ ^ 1 ^ ^ ^ di ( x) xT 1i iT 1i ln P(i) 2 However, when the number of training examples is small compared to the number of dimensions d there may be a problem in obtaining good ML estimates of class covariance matrices. One solution is to use regularized estimators proposed in [30]: ^ ^ ^ i ( ) (1 )i 0 1 where the parameter determines the amount of ,,shrinkage" of individual ^ matrices towards the pooled one, and , is the pooled (average) covariance matrix: S n c ^ ^ Si ( x j i )( x j i )T j 1 ni Such a classifier is called regularized gaussian classifier. 2.2 Nearest neighbors classifiers Another possibility is to use nonparametric density estimators, for example kernel or nearest neighbors ones [58], which leads to the different nonparametric Bayes classifiers. Based on the nearest neighbor density estimator and using the discriminant function of the Bayes classifier we can obtain very popular the k nearest neighbor (knn) classifier. This classifier is based on a distance function for pairs of observations such as Euclidean distance, and proceeds as follows to classify test set observations on the basis of the training set. For each element in the test set: 1) find the k closest observations in the training set, and 2) predict the class label by majority vote, i.e., choose the class that is most common among the k neighbors. The number of neighbors k is usually chosen in a validation step using n-fold cross-validation (n-CV) procedure (described in the experimental results section). 2.3 Support Vector Machine Linear support vector machine (SVM) binary classifier [6], [58] is defined by the optimal separating hyperplane (OSH), i.e., the one which maximizes the separation margin which is the distance between the hyperplane and the closest training observations (called support vectors). In the case when the data are not linearly separable, a non linear transformation is used to map (indirectly) the input data vectors xX from original feature space x into a higher dimensional Hilbert space F using a kernel function K ( x, z ) which is a function such that: K ( x, z) ( x), ( z) for all vectors x, z X ( , is an inner-product). We also consider the kernel matrix: k ( K ( xi , x j ))in, j 1 It is a symmetric positive definite matrix, and since it specifies the innerproduct between all pairs of points {xi }in1 , it completely determines the relative positions between those points in a Hilbert space F. The solution sought by kernel-based learning algorithms such as for example SVM is a linear (in a Hilbert space F discriminant function: n d ( x) sgn i yi K ( xi , x) b Where 0 C (i = 1,...,n) are Lagrange multipliers, C is a regularization parameter, b is a constant, both obtained through a numerical optimization during learning. An important issue in applications is that of choosing a kernel K for a given learning task (i.e. that induces a right metric in the space). One of the most widely used "standard" kernel functions is the gaussian kernel: 1 K ( xi , x j ) exp 2 || xi x j ||2 2 where parameter means the width of the kernel. The originally defined SVM is a binary classifier and one way for using it in a multi-class classification problem is to adopt standard techniques for combining the results of binary classifiers. The most popular are one versus all (1-all) and one versus one (1-1l) [6]. With 1-all approach, a binary classifier is constructed to decide between two classes: the class in question and the rest. Given c classes, c different classifiers are constructed and an unknown observation is assigned the label of whatever classifier returns a yes vote. In the case of multiple `yes' votes, a number of different tie-breaking solutions have been proposed. Using 1-1 strategy and a dataset with c classes, a classifier is constructed for every possible pair of classes, resulting in c(c-1)/2 different binary classifiers. Given an input observation, it is tested with each classifier, and the class returning the largest number of `yes' votes is assigned to the observation. 2.4 Multi-layer perceptron The multi-layer perceptron (MLP) [6] also termed feedforward neural network is a generalization of the single-layer perceptron. In fact, just three layers (including the input layer) are enough to approximate any continuous function. The input nodes form the input layer of the network. The outputs are taken from the output nodes, forming the output layer. The middle layer of nodes visible to neither the inputs nor the outputs, is termed the hidden layer. The discriminant functions of 3-layered perceptron with M neurons in a hidden layer are of the following form: M 2 d di ( x) f 2 wij f1 w1jr xr0 r 0 j 0 where 0 xr i 1,..., c are inputs, wij , w jr are components of two layers of network weights, d is the dimensionality of the input pattern, the univariate functions f1 and f 2 are typically each set to: 1 1 e x The parameters of the network (i.e. weights ) are modified during learning to optimize the match between outputs and targets, typically by minimizing the total square error using a variant of gradient descent which is conveniently organized as a backpropagation of errors [6]. f ( x) 3. Features of the amino acid sequence For ml-based protein fold classification it is necessary to represent the underlying protein as a feature vector, i.e. a vector composed of values of the features representing a protein. To realize that, one of the keys is to find an effective model to represent a sample of a protein, because performance of a fold classifier critically depends on the features used. Several methods for the extraction of features of amino acid sequences for protein fold classification have been developed. The most straightforward sequential model relies on representing a query protein as a series of successive amino acid symbols according a certain order, but it would fail when the query protein did not have significant homology to proteins of known characteristics. The simplest non sequential or discrete model of a protein P R1 RL with L amino acid residues Ri can be expressed by its amino acid composition (AAC): P f1 ,..., f 20 where fj (j=1,...,20) are the normalized occurrence frequencies of the 20 native amino acids in. Dubchak et al. [27], [29] first proposed a way to extract global physicalchemical propensities of amino acid sequence as fold discriminatory features. Together with AAC a protein sequence is represented by a set of the following 126 parameters divided into six groups: 1) AAC plus sequence length (21 features collectively denoted by a letter `C'), 2) predicted secondary structure (21 features denoted by `S'), 3) hydrophobicity (21 features denoted as `H'), 4) normalized van der Waals volume (21 features denoted as `V'), 5) polarity (21 features denoted by `P') and 6) polarisability (21 features denoted by `Z'). Secondary structural information based on three-state model: helix, strand and coil could be accomplished using one of the existing methods for secondary structure prediction, for example PSI-PRED [35]. Apart from AAC characteristics (`C' set of features) all other features were extracted based on the classification of all amino acids into three classes (for example polar, neutral, and hydrophobic for hydrophobicity attribute, see Table 1) in the following way. The descriptors a-composition, transition and distribution were calculated for each attribute to describe the global percent composition of each of the three groups in a protein, the percent frequencies with which the attribute changes its index along the entire length of the protein, and the distribution pattern of the attribute along the sequence, respectively. In the case of hydrophobicity for example, the a-composition descriptor `aC' consists of the three numbers ­ the global percent compositions of polar, neutral and hydrophobic residues in the protein (because regarding to hydrophobicity attribute, all amino acids are divided into three groups: polar, neutral and hydrophobic). The transition descriptor `T' of the following three numbers ­ the percent frequency with which: a polar residue is followed by a neutral one or a neutral by a polar residue and similarly with the other two types of residues. The distribution descriptor `D' consists of the five numbers for each of the three groups: the fractions of the entire sequence, where the first residue of a given group is located, and where 25, 50, 75, and 100 percent of those are contained. The complete parameter vector contains 3'aC'+3'T'+5x3'D'=21 components. Therefore the full feature vector (C, S, H, V, P, Z) counts 6 x 21 = 126 features. Table 1. Amino acid attributes and corresponding groups Attribute Secondary structure Hydrophobicity Polarizability Polarity Van der Waals volume Group 1 Helix Polar R,K,E,D,Q,N (0-2.78) G,A,S,C,T,P,D (4.9-6.2) L,I,F,W,C,M,V,Y (0-0.108) G,A,S,D,T Group 2 Strand Neutral G,A,S,T,P,H,Y (2.95-4.0) N,V,E,Q,I,L (8.0-9.2) P,A,T,G,S (0.128-0.186) C,P,N,V,E,Q,I,L Group 3 Coil Hydrophobic C,V,L,I,M,F,W (4.43-8.08) M,H,K,F,R,Y,W (10.4-13.0) H,Q,R,K,N,E,D (0.219-0.409) K,M,H,F,R,Y,W Pseudo amino acid composition (PseAA) was originally proposed [19] to avoid completely losing the sequence-order information as in AAC-discrete model. In PseAA model the first 20 factors represent the components of AAC while the additional ones incorporate some of its sequence-order information via various modes (i.e., as a series of rank-different correlation factors along a protein chain). The PseAAC-discrete model can be formulated as [18]: fu 20 f w k k 1 pu w u 20 20 f w k k 1 1 u 20 20 1 u 20 where w is the weight factor and k the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues: 1 Lk k J i ,i k ( k L ) L k with J i ,i k 2 2 1 [ H1 ( Ri k ) H1 ( Ri )] [ H 2 ( Ri k ) H 2 ( Ri )] 2 3 [ M ( Ri k ) M ( Ri )] where, H2(Ri) and M(Ri) are respectively the hydrophobicity, hydrophilicity and side chain mass values for the amino acid Ri, is a parameter (before substituting these values special normalization is used, for details see [18]). The n-th order amino acid pair composition proposed by Shamim et al. [53] is calculated using the following formula: f ( D i ,in ) j i,i+n N ( D i ,in ) j Ln where !N f(D )j is the number of n-th order amino acid pair j (j=1,...,400) in protein sequence of length L. These features encapsulate the interaction between the i-th and (i+n)-th amino acid residues and give the local order information in a protein. A special case of these features are bigram and spaced-bigram features proposed by Huang et al. [34], both derived from the N-gram concept. Besides the features extracted directly from amino acid sequences, some features are constructed through exploiting information such as predicted secondary structure, predicted solvent accessibility, functional domain and sequence evolution. Secondary structure-based features are generated based on the (predicted) secondary structure profile, for example generated by PSIPRED method [35]. Such profile comprises a state sequence, i.e. a sequence of the three possible symbols representing states: helix (H), strand (E) and coil (C) and the three probability sequences, each for one state, being the probability values with which the states occur along the query amino acid sequence. The first examples of these features are those used by Dubchak et al. [27], [29] as described at the beginning of this section. Chen and Kurgan [12] proposed the two new features: 1) the number of different secondary structure segments (DSSS), being the numbers of occurrences of distinct helix, strand and coil structures which length is above a certain threshold, 2) the arrangement of DSSS: there are 33 = 27 possible segment arrangements, i.e. class-class-class where class='H', `E' and `C'. Similarly to n-th order amino acid pair composition features, Shamim et al.et al. [53] defined the secondary structural state frequencies of amino acids pairs which are calculated as: f ( Dki ,i n ) j N ( Dki ,i n ) j Ln i where N ( Dk,i n ) j is the number of n-th order amino acid pair j (j=1,...,400) found in the state k=(H,E,C). Treating amino acid sequences as a time series Yang and Chen [60] proposed the following procedure for the extraction of new features from PSIPRED profile. For each of the three state sequences of secondary structural elements they first applied chaos game representation, analyzed them by a nonlinear technique, the recurrence quantification analysis (for details see [60]) and then applied autocovariance (AC) transformation which is the covariance of the sequence against a timeshifted version of itself: ACL ,t (ti t )(ti l t ) /( L l ) l 1,..., lmax where t (t1 , , t L ) s the input sequence, t is the average of all tj, l is the distance between two positions along the sequence, lmax is the maximum of l being the value of the shift. Shamim et al. [53] proposed the solvent accessibility state frequencies of amino acids calculated as follows: fi k Nik L where k = (B, E) are solvent accessibility states: B ­ buried, E ­ exposed, N ik is the number of amino acid i in solvent accessibility state k. For calculating the frequencies they used solvent accessibility states predicted by the method [13]. Similarly to n-th order of amino acid pair composition features, they defined the solvent accessibility state frequencies of amino acids pairs: i f ( Dk,i n ) j i ,i n k i N ( Dk,i n ) j Ln where N ( D ) j is the number of n-th order amino acid pair j (j=1,...,400) found in the accessibility state k=(B,E,I) (I ­ partially buried state) or secondary structural state k=(H,E,C). Proteins often contain several modules (domains), each with a distinct evolutionary origin and function. Several databases were developed to capture this kind of information, for example CDD database (version 2.11) [43] which covers 17402 common protein domains and families. In [56] the functional domain (FunD) composition vector for representing a given protein sample was proposed. It is extracted through the following procedure: 1) use RPS-BLAST (reverse PSI-BLAST [52) to compare the protein sequence with each of the 17402 domain sequences in CDD database, 2) if the significance threshold value is less than 0.001 for the i-th profile that means the hit is found and i-th component of the protein in 17402-dimensional space is assigned 1, otherwise 0. Evolutionary-based features mainly are extracted from position-specific scoring matrix (PSSM) profile generated by the PSI-BLAST program [1]. PSI-BLAST aligns a given query amino acid sequence to the NCBI's nonredundant database. Using multiple sequence alignment PSI-BLAST counts a frequency of each amino acid at each position for the query sequence and generates 20-dimensional vector of amino acid frequencies for each position in the query sequence, thus the element Sij of PSSM matrix reflects the probability of amino acid i occurring at the position j. More often than the absolute frequencies, the relative frequencies are tabulated in a profile (i.e. relative to a probability of a sequence in a random functional site). The generated profile considers evolutionary information, i.e. it can be used to identify key positions of conserved amino acids and positions that undergo mutations. Chen and Kurgan [12] extracted from 20-dimensional PSSM profile a profile-based composition vector (PCV) in a way by which the negative elements of PSSM profile are first replaced by zero, and then each column is averaged. However, in such representation valuable evolutionary information would be definitely lost. To avoid this, Shen and Chou [56] proposed pseudo position-specific scoring matrix (PsePSSM) by adding to the profile-based composition vector the correlation factors defined as: 1 L [S ij S(i ) j ]2 j 1, 2,..., 20; L is the correlation factor by coupling the most contiguous PSSM scores along the protein chain for the amino acid type j, 2j - the same as previous but for the second-most contiguous PSSM scores, and so forth. Another approach is proposed in [60]. Global features are extracted from PSSM matrix by first using a special normalization followed by the consensus sequence (CS) transformation: 1j (i) arg max{ f ij : 1 j 20} 1 i L where f ij denotes the normalized value of the element S ij of PSSM, and then computing: n( j ) 1 j 20 L where n(j) is the number of the amino acid j occurring in the CS. Additional two global features represent the entropy of the feature set: AACCS ( j ) ECS AACCS ( j ) ln AACCS ( j ) j 1 20 EFM 1 L 20 f ij ln f ij L j 1 the last computed on the raw, normalized PSSM. To extract local features, they first divide the raw, normalized PSSM into nonoverlapping fragments of equaength. Then, for each fragment s, the 20 features are computed as the average occurrence frequency of the amino acid j in the fragment s during the evolution process (for details see [60]). Each residue in amino acid sequence has many physical-chemical properties, so a sequence may be viewed as a time sequence of the corresponding properties. In [28] Dong et al. proposed features extracted from PSSM using AC transformation. The result measures the correlation of the same property between two residues separated by a distance l along the sequence: AC (i, l ) ( Sij Si )( Sij l Si ) /( L l ) where i is one of the residues, L is the protein sequence length, Sij is the PSSM score of amino acid i at position j, S i is the average score of amino acid i along the whole sequence. They also proposed the AC transformation for two different properties between two residues separated by l along the sequence: CC (i1, i 2, l ) ( Si1 j Si1 )( Si 2 j l Si 2 ) /( L l ) where i1, i2 are two different amino acids. Slightly different from the methods described above are the feature extraction methods based on kernels. A core component of each kernel methods (for example the described SVM) is the kernel function, which measures the similarity between any pair of examples. Different kernels correspond to different notions of similarity and can lead to discriminative functions with different performance. One of the early approaches for deriving a kernel function for protein classification was the SVM-pairwise scheme [Liao39] which presents each sequence as a vector of pairwise similarities to all sequences in the training set. A relatively simpler feature space that contains all possible short subsequences ranging from 3 to 8 amino acids (kmers) is explored in [38]. A sequence x is represented here as a vector in which a particular dimension u (kmer) is present in x vector (has non-zero weight) if x contains a substring that differs with u in at most a predefined number of positions (mismatches). An alternative to measuring pairwise similarity through a dot-product of vector representations is to calculate an explicit protein similarity measure. The method [51] measures the similarity between a pair of protein sequences by taking into account all the optimaocal alignment scores with gaps between all of their possible subsequences. In the work described in [48] they developed new kernel functions that are derived directly from explicit similarity measures and utilize sequence profiles constructed automatically via PSI-BLAST [1] and employed a profile-to-profile scoring scheme developed by extending profile alignment method [52]. The first kernel function, window-based, determines the similarity between the pair of sequences by using different schemes to combine ungapped alignment scores of certain fixed-length subsequences. The second, local alignmentbased, determines the similarity between the pair of sequences using SmithWaterman alignments and a position independent affine gap model, optimized for the characteristics of the scoring system. Experiments with fold classification problem show that these kernels together with SVM [48] are capable of producing excellent results, the overall performance measured on DD dataset is 67,8 %. 4. Protein fold machine learning-based classification methods 4.1 Datasets used in the described experiments Most implementations of the ml-based protein fold classification methods have adopted the SCOP (Structural Classification of Proteins) architecture [42], with which a query protein is classified into one of the known folds. Most of these methods use for the construction of a protein fold classifier the dataset (training and testing one) developed by Ding and Dubchak [27], [29] (DD dataset). The DD dataset contains 311 and 383 proteins for training and testing, respectively. This dataset has been formed such that, in the training set, no two proteins have more than 35% sequence identity to each other and each fold have seven or more proteins. In the test dataset, proteins have no more than 40% sequence identity to each other and have no more than 35% identity to proteins of the training set. The proteins from training and testing datasets belong to 27 different folds (according to SCOP [42]), representing all major structural classes , , + , and /. These are the following 27 fold types: 1) globin-like, 2) cytochrome c, 3) DNA-binding 3-helical bundle, 4) 4-helical up-and-down bundle, 5) 4-helical cytokines, 6) EF-hand, 7) immunoglobulin-likesandwich, 8) cupredoxins, 9) viral coat and capsid proteins, 10) ConA-like lectins/glucanases, 11) SH-3 like barrel, 12) OB-fold, 13) beta-trefoil, 14) trypsin-like serine proteases, 15) lipocalins, 16) (TIM)-barell, 17) FAD (also NAD)-binding motif, 18) flavodoxin like, 19) NAD(P)-binding Rossman fold, 20) P-loop, 21) thioredoxin-like, 22) ribonuclease H-like motif, 23) hydrolases, 24) periplasmic binding protein-like, 25) -grasp, 26) ferredoxinlike, 27) small inhibitors, toxins, lectins. Of the above 27 fold types, types 16 belong to all structural class, type 7-15 to all class, type 16-24 to / class and types 25-27 to + class. Later, researchers (see for example [60]) found some duplicate pairs between the training and testing sequences in DD dataset. After excluding such sequences, a new dataset called revised DD dataset (RDD) was created. Another extended DD dataset (called EDD) was constructed by populating additional protein samples. It is based on the Astral SCOP (http://astral.berkeley.edu/) in which any two sequences have less than 40% identity. To cover more folds, they constructed another datasets are constructed, comprising 86, 95, 194 and 199 folds respectively (F86, F95, F194, F199), for the detailed description see [28], [60]. 4.2 Methods Supervised ml-based methods for protein fold prediction have gained great interest since the work described in Craven et al. [22]. Craven et al obtained several sequence-derived features, i.e., average residue volume, charge and polarity composition, predicted secondary structure composition, isoelectric point, Fourier transform of hydrophobicity function, from a set of 211 proteins belonging to 16 folds and used the sequence attributes to train and test the following popular classifiers: decision trees, k nearest neighbor and neural network classifiers in the 16-class fold assignment problem. Ding and Dubchak [27], [29] first experimented with one-versus-others unique, one-versus-others and all-versus-all methods using neural networks or SVMs as classifiers in multiple binary classification tasks on a DD dataset of proteins using global description of amino acid sequence described in the previous section). They were able to recognize the correct fold with the accuracy of approximately 56%. Here, the accuracy refers to the percentage of proteins whose fold has been correctly identified on the test set. Other researchers have tried to improve prediction performance by either incorporating new features (as described in the previous section) or developing novel algorithms for multi-class classification (for example fusion of the different classifiers). A modified nearest neighbor algorithm called K-local hyperplane (HKNN) was used by Okun [46] (with the overall accuracy 57,4% on DD dataset). Classifying the same dataset as in Dubchak [29] and input features, employing a Bayesian Network-based approach [32] Chinnasamy et al. [14] improve to 60% on the average fold recognition results reported by Dubchak [29]. Nanni [45] proposed a specialized ensemble called SE of K-local hyperplane based on random subspace and feature selection and achieved 61,1% total accuracy on a dataset DD. Classifiers in this ensemble can be built on different subsets of features, either disjoint or overlapping. Feature subsets for a given classifier with a "favourite" class are found as those that best discriminate this class form others (i.e. in the context of the defined distance measure). For the prediction of protein folding patterns Shen and Chou proposed in [56] the ensemble classifier, known as PFP-Pred, constructed from the nine individual ET-KNN [25] (evidence-theoretic k-nearest neighbors) classifiers, each operating on only one of the inputs (in order not to reduce the clustertolerant capacity) and obtained the accuracy 62,`%. The ET-KNN rule is a attern classification method based on the Dempster-Shafer theory of belief functions. Near-optimal parameters of each such component classifier were obtained using optimization procedure from [64] resulting in the OET-KNN optimized classifier. As a protein representation they used features from Dubchak et al. [27], [29] (except the compostion) as well as the different dimensions of pseudo-amino acid composition, i.e. with four different values of parameter (see the description in the previous section), together nine groups of features. Rather than using a combined correlation function they proposed the alternate correlation function between hydrophobicity and hydrophilicity of the constituents amino acids to reflect sequence-order effects (for details see [56]). The outcomes of the individual classifiers were combined through a weighted voting to give a final determination for classifying a query protein. Chmielnicki and Stpor [15], [16] proposed a hybrid classifier of protein folds composed of regularized gaussian and SVM. Using feature selection algorithm to select the most informative features from those designed by Dubchak et al. [27] they obtained accuracy 62.6% on DD dataset. In [33] Guo and Gao presented the hierarchical ensemble classifier named GAOEC (Genetic-Algorithm Optimized Ensemble Classifier) for protein fold recognition. As the component classifier they proposed a novel optimized GAET-KNN classifier which use GA to generate the optimum parameters in ET-KNN to maximize classification accuracy. Two layer GAET-KNNs are used to classify query proteins in the 27 folds. As in Dubchak et al. [27] six kinds of features are extracted from every protein in a DD dataset. Six component GAET-KNN classifiers in the first layer are used to get a potential class index for every query protein. According to the result of the first layer, every component classifier of the second layer generates a 27dimensional vector whose elements represent the confidence degrees of 27 folds. The genetic algorithm is used for generating weights for the outputs of the second layer to get the final classification result. The overall accuracy of GAOEC is 64,7%. In [28] the protein fold recognition approach using SVM and features containing evolutionary information extracted from PSSM by using the described in the previous section AC transformation is presented. Two kinds of AC transformation were proposed resulting in two kinds of features: 1) measuring correlation between the same and 2) the two different properties. Two versions of a classifier were examined: with features 1) and the combination of 1) and 2) resulting in the performance 68,6% and 70,1% respectively (on DD dataset using 2-fold cross-validation). Using the EDD, F86, F199 datasets the performances computed by 5-fold cross-validation for the combination of 1) and 2) sets of features gain 87,6%, 80,9% and 77,2%, respectively. Kernel matrices incode the similarity between data objects within a given input space. Using kernel-based learning methods, the problem of heterogeneous data sources integration can be transformed into the problem of learning the most appropriate combination of their kernel matrices. The approach proposed in [24] `utilizes' four of the state-of-the-art string kernels built for proteins and combines them into an overall composite kernel where the multinomial probit kernel machine operates. The approach is based on the ability to embed each object description via the kernel trick [54] into a kernel space. This produces a similarity measure between proteins in every feature space and then, having a common measure, they combined informatively these similarities into a composite kernel space on which a single multi-class kernel machine can operate effectively. The performance obtained using this method on a DD dataset is 68,1%. Chen and Kurgan [12] proposed the fold recognition method PFRES using features generated from PSI-BLAST and PSI-PRED profiles (as described in the previous section) and the voting-based ensemble of the three different classifiers: SVM, Random Forests [8] and Kstar [20]. Using the entropy-based feature selection algorithm resulting in the compact representation (36 features) [63] they obtained 68,4% accuracy on DD dataset. In [40] two-level classification strategy called hierarchicaearning architecture (HLA) using neural networks for protein fold recognition was proposed. It relies on two indirect coding features (based on bigram and spaced-bigram as described in the previous section) as well as combinatorial fusion technique to facilitate feature selection and combination. The resulting accuracy is 69,6% for 27 folding classes from DD dataset. One of the novelties is the notion of a diversity score function between a pair of features. This parameter is used to select appropriate and diverse features for combination. It may be possible to achieve better results with combination of more than two features. Shamim et al. [53] developed a method for protein fold recognition based on SVM classifier (with three different multi-class methods: one versus all, one versus one and Crammer/Singer method [21]) that uses secondary structural state and the solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors (as described in the previous section). Among the feature combinations, combination of the secondary structural state and solvent accessibility state frequencies of amino acids and first order amino acid pairs ­ gave the highest accuracy 70,5% (measured using 5-fold cross-validation) on the EDD dataset. Shen and Chou developed new method PFP-FunD for protein fold pattern recognition using functional domain (FunD) composition vector and features extracted form PsePSSM matrix (as described in the previous section) with the previously designed OET-KNN ensemble classifier obtainng accuracy 70,5%. Recently Yang and Chen [60] developed fold recognition method TAXFOLD that extensively exploits the sequence evolution information form PSI-BLAST profiles and the secondary structure information form PSIPRED profiles. A comprehensive set of 137 features is constructed as described in the previous section which allows for depiction of both global and local characteristics. They tested different combinations of the extracted features. It follows that PSI-BLAST and PSI-PRED features make complementary contributions to each other, and it is important to use both kinds of features for enhanced protein fold recognition. The consensus sequences contain much more evolution information than the amino acid sequences, thereby leading to more accurate protein fold recognition. The best accuracy of TAXFOLD is 71,5% on DD dataset (83,2%, 86,29%, 76,5% and 72,6% on RDD, EDD, F95 and F194 datasets, respectively). In the kernel-based learning method [62] they proposed a novel information-theoretic approach to learn a linear combination of kernel matrices through the use of a Kullback-Leibler divergence between the output kernel matrix and the input one. Based on the position of the input and output kernel matrices, there are two formulations of the resulted optimization problem: by a difference of convex programming method and by a projected gradient descent algorithm. The method improves the fold discrimination accuracy to 73,3% (on DD dataset). Another, but different from those described in section two group of methods used for protein fold recognition are those based on Hidden Markov Model (HMM) [32]. The most promising result here was achieved by Deschavanne and Tuffery [26] who employed the hidden Markov structure alphabet as additional feature and got accuracy 78% (on EDD database). 5. Discussion, conclusions and future work This paper is not a comparison among the existing protein fold classification methods, it is only a review. To make such a comparison between performance of different classifiers one should implement the described methods and use the special, dedicated statistical tests to make sure that the differences are statistically significant. But some conclusions (of general nature) could be drawn. First, it should be noted that the reported accuracies of classifiers were estimated by using different methods: by applying the independent test dataset or 2 (5) cross-validation on the training dataset and sometimes using different datasets. The accuracies obtained using different estimation methods have different bias and variance. Moreover, these are only the values of the point estimators, the confidence intervals for example would give valuable information on the standard deviation of the prediction error. Considering the nature of the protein fold prediction problem, where the fold type of a protein can depend on a large number of protein characteristics and also noting that the number of fold types approaches 1000, it is straightforward to see the need for a methodological framework that can cope with a large number of classes and can incorporate as many as they are available feature spaces. As mentioned above, the existing protein fold classification methods produced several fold discriminatory data sources (i.e. groups of attributes such as amino acid composition, predicted secondary structure, and selected structural and physicochemical properties of the constituent amino acids). One of the problems is how to integrate many fold discriminatory data sources systematically and efficiently, without resorting to ad hoc ensemble learning?. One solution are the kernel methods that have been successfully used for data fusion in many biological applications. But the problem of integrating heterogenous data sources is still a challenging problem (problem 1). It is known that performance of many classifiers (like for example widely used SVM) depends on the size of dataset used - the look on DD dataset reveals that many folds are sparsely represented. Training in such a case becomes skewed towards populous folds labeled as positive rather than lesser populated folds labeled as negative (problem 2). An alternative class structure should be developed, for example by re-evaluation of the current class structure to determine classes which should be aggregated or discarded, or by incorporating larger sets of folding classes. Another sources of error in the described methods are the not appropriate (i.e. with small discriminatory power) features of a protein sequence. Moreover, the incorrectly predicted features like the secondary structure or solvent accessibility ones, could although decrease classification performance. Extracting a set of highly discriminative set of features from amino acid sequences remains a challenging problem (problem 3), though the newest work [60] shows, that even using a single classifier but with the carefully designed features (most discriminative), it is possible to obtain a very good classification performance, even greater than 80%. The obtained result is very well acceptable accuracy for the 27-class classification problem !. A random classifier would have a 3.7% (1/27x100) only. The main source of the achieved improvement is attributed to the application of features extracted from PSI-BLAST profile which considers evolutionary information, with suitable transformations. At last, employing an appropriate classifier or the method for ensembling them has also a critical effect to the problem. Using the weak classifiers on the separate feature spaces and the carefully designed combining strategy (learning algorithms) could give the results much higher than 70% (problem 4). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

PROTEIN FOLD CLASSIFICATION BASED ON MACHINE LEARNING PARADIGM - A REVIEW

Loading next page...
 
/lp/de-gruyter/protein-fold-classification-based-on-machine-learning-paradigm-a-LkIDms023Z
Publisher
de Gruyter
Copyright
Copyright © 2012 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.2478/bams-2012-0003
Publisher site
See Article on Publisher Site

Abstract

Protein fold recognition using machine learning-based methods is crucial in the protein structure discovery, especially when the traditional sequence comparison methods fail because the structurally-similar proteins share little in the way of sequence homology. Many different machine learning-based fold classification methods have been proposed with still increasing accuracy and the main aim of this article is to cover all the major results in this field. KEYWORDS: supervised learning algorithm, classifier, features, protein fold recognition 1. Introduction Proteins are indispensable for the existence and proper functioning of biological organisms [37]. Proteins are biochemical compounds consisting of one or more polypeptides which are single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues. The sequence of amino acids in a protein known as primary structure is defined by the sequence of genes which is encoded in the genetic code which, in general, specifies 20 standard amino acids. One of the most distinguishing features of polypeptides is their ability to fold typically into a globular or fibrous state, or "structure", that means 3D (three-dimensional) or tertiary structure. According to Anfinsen [2], the proteins can fold to their native structures spontaneously, therefore he stated that protein fold is coded in the amino acid sequence itself, but it is still not clear as to how structure is encoded in a sequence and, therefore, it is an open problem of much scientific interest in computational biology. The secondary structure of proteins is the characterization of a protein with respect to certain local structural conformations like -helices, -sheets (strands) and other such as loops, turns and coils. Fold can be defined as a three-dimensional pattern characterised by a set of major secondary structure conformations with certain arrangement and their topological connections. The structure of a protein serves as a medium through which to regulate either the function of a protein or activity of an enzyme. Understanding of how proteins fold in three-dimensional space can reveal significant information of how they function in biological reactions. Protein's function is strongly influenced by its structure ([9], [37], [49], [50]). Currently, sequencing projects rapidly produce protein sequences, but the number of 3D protein structures increases slowly due to the expensive and time-consuming conventionaaboratory methods, namely X-ray crystallography and nuclear magnetic resonance (NMR). Moreover, not all proteins are amenable to experimental structure determination. The protein sequence data banks such as Universal Protein Resource (UniProtKB/TrEMBL) [3] contains now more than 16 000 000 protein sequence entries, while the number of stored protein structures in Protein Data Bank (PDB) [5] is about 74 000. This leads to the necessary alternative to experimental determination of 3D protein structures, the computational methods like ab initio and homology modeling ones. Ab initio methods seek to build 3D protein models "from scratch", i.e., based on physical principles [23],[61]. There are many possible procedures that either attempt to mimic protein folding or apply some stochastic method to search the space of possible solutions. The two major problems here are calculation of protein free energy and finding the global minimum of this energy which require vast computational resources, and have thus only been carried out for tiny proteins. These problems can be partially bypassed in the homology-based methods [23], [61], when the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment. But the hurdle exists when the query protein does not have any structure-known homologous protein in the existing databases. Facing this kind of situation, predicting 3D structure of a protein was converted to a problem of protein fold recognition, i.e. identyfying which fold pattern it belongs to. Protein fold recognition methods have taken central stage as fold information could facilitate the identification of a protein tertiary structure and function. Many methods have been developed which are used to assigning folds to protein sequences. They can be broadly classified into three groups: 1) sequence-structure homology recogntion methods (for example [57]), 2) threading methods (for example Threader [36])) machine learning-based (ml-based) methods (also called taxonomy-based ). Sequence-structure homology and threading methods align target sequence onto known structural templates and calculate their sequencestructure compatibilities (scores) using for example environment-specific substitution tables or pseudo-energy-based functions, and the template with the best score is assumed to be the fold of the target sequence. While these methods each are effective in certain cases, there are drawbacks of both approaches. The first will fail when two proteins are structurally-similar but share little in the way of sequence homology. Threading methods rely on data derived from solved structures, but as we mentioned, the number of proteins whose structure has been solved is much smaller then the number of proteins that have been sequenced. These methods have not been able to achieve accuracies greater than 30%. In recent years, the machine learning-based methods have attracted great attention due to its encouraging performance. Ml-based methods for protein fold recognition assume that the number of protein folds in the universe is limited, according to [17] about 1000 and therefore, the protein fold recognition can be viewed as a fold classification problem. The last one can be formulated as the construction of a classifier using the learning (training) set, i.e. the sequence-derived features (properties) of proteins whose structure (fold) is known. The procedure for construction of a classifier is called supervised learning or classifier training. Its role in the fold classification task is to induce a mappings from primary sequences to folding classes. Such a trained classifier can then be used to assign a structure-based label (class of fold) to an unknown protein (i.e. a protein whose structure has yet not been solved). To implement a classification task, two major procedures are generally required: 1) Feature generation which refers to a procedure by which we "obtain" features from a query amino acid sequence so as to represent the underlying protein as a fixed-length numerical vector. 2) Machine learning classifier which is a function (implemented in some software) assigning numerical vector (here representing the protein with unknown structure) to one of the existing classes (here fold types). Many different ml-based fold classification methods have been proposed with still increasing accuracy and the main aim of this article is to cover all the major results in this exciting field. The organization of the rest of this paper is as follows. The next, second section explains the main ideas and principles of ml-based classification methods and shortly describes the four most popular ones: a gaussian, nearest neighbors, support vector machines and multi-layered perceptron classifiers. The third section deals with the protein feature generation methods (for the purpose of fold classification). In the forth section, after describing the datasets of protein folds used in the experiments from literature all the most important fold classification methods are presented. In the last section there are the conclusions drawn from the current study as well as the directions for future research in the field of ml-based fold prediction. 2. Supervised machine learning classification techniques Machine learning is a branch of artificial intelligence which is concerned with the development of learning algorithms that allow computers to evolve their behavior based on the empirical data (examples). Based on the examples a learning algorithm captures characteristics of interest, for example the underlying probability distribution to automatically learn to recognize complex patterns and make intelligent decisions based on data. Because examples compose only a small subset of a whole population, the learning algorithm must generalize from the given examples, so as to be able to produce a useful output in future, new cases. The objective of statistical classification is to assign a new observation xi to one of the pre-specified c classes. The supervised machine learning of classification problem [6] can be formally stated as follows. Suppose we have a dataset of n examples (training dataset): U n {( x1 , y1 ),..., ( xn , yn )} where each xi ( xi1 ,..., xid ) represents an observation, yi {1,..., c} is a categorical variable, a class label. We seek for a function d ( x) such that the value of d ( x) can be evaluated for any new observation x (i.e. not ^ included in a training dataset) and such that label y d ( x) predicted for that new observation x is as close as possible to the true class label y of x. The function d ( x) known as classifier is an element of some space of possible functions, usually called the hypothesis space. In the Bayesian decision framework, in order to measure how well a function fits the training data, a loss function L( y, d ( x)) for penalizing errors in prediction is defined. By far, the most common is 0-1 loss function, where all misclassifications are charged a single unit. This leads to a criterion for choosing d ( x) as the expected prediction error E ( L( y, d ( x))) , where the expectation is taken with respect to the joint probability distribution f ( x, y) . Conditioning on x and using 0-1 loss function we obtain a solution, a function d(x) of the form: d ( x) arg max i P(i | x) i 1,..., c or equivalently, using the Bayes rule: d ( x) arg max i f ( x | i) P(i) where f ( x | i) is a class-conditional density, P(i ) is a priori probability of class i. The obtained classifier d(x), known as Bayes classifier, says that we classify to the most probable class using the conditional distribution. Classifiers come in a great diversity of techniques and algorithms. Each classifier can be uniformly defined by the set of c discriminant functions. Each class i has its own discriminant function di(x) designed in such a way that for each object from class i the value of the corresponding discriminant function is (should be) the largest among the all c values: j 1,...,c j i d i ( x) d j ( x) Discriminant functions are determined based on the training set using different algorithms, depending on the particular classifier. This procedure is known as the classifier training or learning. The procedure of building a classifier for a particular application typically comprises the following steps [58]: 1) data collection (on appropriate features), 2) data preprocessing (for example normalization, outlier detection), 3) feature selection/extraction (to avoid curse of dimensionality [6]), 4) classifier training and validation of its internal parameters, 5) classifier testing to estimate its performance. The most often performance is estimated on a separate testing set (if available) or using the resampling technique N-fold cross-validation (N-CV). The N-CV method consists of the following steps: 1) divide randomly all the training examples into N equal-sized subsets (usually N=10 but in general depends on the size of a training dataset), 2) use all but one subset of examples to train the classifier, 3) measure the classification performance on the remaining one by means of the percentage accuracy, 4) repeat steps 2 and 3 for each subset, 5) average the results to get an estimate of the performance of a classifier. The built classifier can then be used for making predictions on new, unknown observations. The unknown observation x is usually assigned to a class whose discriminant function di ( x) has the largest value. 2.1 Gaussian classifiers The discriminant functions of the above described Bayes classifier are of the form: i 1,..., c di ( x) f ( x | i) P(i) Many classification methods are based on the Bayes classifier including parametric and nonparametric ones according to the estimation method used. The most popular are parametric gaussian classifiers which use the gaussian (normal) density: 1 1 f ( x | i) exp ( x i )T i1 ( x i ) (2 )d / 2 | i |1/ 2 2 where and are the parameters: mean vector and covariance matrix of a distribution and d is a space dimensionality. In the real situations, the parameters and as well as a priori class probabilities P(i) are replaced by their maximum likelihood (ML) estimates: 1 ni ^ i x j ni j 1 S 1 ^ i i ni ni (x j 1 ni ^ ^ i )( x j i )T i 1,..., c P(i) ni n based on a training set. Plugging the above expressions into discriminant function of a Bayes classifier results in the quadratic discriminant function of a gaussian classifier: 1 1 ^ ^ ^ ^ ^ di ( x) ( x i )T i1 ( x i ) ln | i | ln P(i) 2 2 In the simplest, special case where covariance matrices in all c classes are identical ( i , i 1,..., c ), the discriminant function of a gaussian classifier is linear: ^ ^ ^ 1 ^ ^ ^ di ( x) xT 1i iT 1i ln P(i) 2 However, when the number of training examples is small compared to the number of dimensions d there may be a problem in obtaining good ML estimates of class covariance matrices. One solution is to use regularized estimators proposed in [30]: ^ ^ ^ i ( ) (1 )i 0 1 where the parameter determines the amount of ,,shrinkage" of individual ^ matrices towards the pooled one, and , is the pooled (average) covariance matrix: S n c ^ ^ Si ( x j i )( x j i )T j 1 ni Such a classifier is called regularized gaussian classifier. 2.2 Nearest neighbors classifiers Another possibility is to use nonparametric density estimators, for example kernel or nearest neighbors ones [58], which leads to the different nonparametric Bayes classifiers. Based on the nearest neighbor density estimator and using the discriminant function of the Bayes classifier we can obtain very popular the k nearest neighbor (knn) classifier. This classifier is based on a distance function for pairs of observations such as Euclidean distance, and proceeds as follows to classify test set observations on the basis of the training set. For each element in the test set: 1) find the k closest observations in the training set, and 2) predict the class label by majority vote, i.e., choose the class that is most common among the k neighbors. The number of neighbors k is usually chosen in a validation step using n-fold cross-validation (n-CV) procedure (described in the experimental results section). 2.3 Support Vector Machine Linear support vector machine (SVM) binary classifier [6], [58] is defined by the optimal separating hyperplane (OSH), i.e., the one which maximizes the separation margin which is the distance between the hyperplane and the closest training observations (called support vectors). In the case when the data are not linearly separable, a non linear transformation is used to map (indirectly) the input data vectors xX from original feature space x into a higher dimensional Hilbert space F using a kernel function K ( x, z ) which is a function such that: K ( x, z) ( x), ( z) for all vectors x, z X ( , is an inner-product). We also consider the kernel matrix: k ( K ( xi , x j ))in, j 1 It is a symmetric positive definite matrix, and since it specifies the innerproduct between all pairs of points {xi }in1 , it completely determines the relative positions between those points in a Hilbert space F. The solution sought by kernel-based learning algorithms such as for example SVM is a linear (in a Hilbert space F discriminant function: n d ( x) sgn i yi K ( xi , x) b Where 0 C (i = 1,...,n) are Lagrange multipliers, C is a regularization parameter, b is a constant, both obtained through a numerical optimization during learning. An important issue in applications is that of choosing a kernel K for a given learning task (i.e. that induces a right metric in the space). One of the most widely used "standard" kernel functions is the gaussian kernel: 1 K ( xi , x j ) exp 2 || xi x j ||2 2 where parameter means the width of the kernel. The originally defined SVM is a binary classifier and one way for using it in a multi-class classification problem is to adopt standard techniques for combining the results of binary classifiers. The most popular are one versus all (1-all) and one versus one (1-1l) [6]. With 1-all approach, a binary classifier is constructed to decide between two classes: the class in question and the rest. Given c classes, c different classifiers are constructed and an unknown observation is assigned the label of whatever classifier returns a yes vote. In the case of multiple `yes' votes, a number of different tie-breaking solutions have been proposed. Using 1-1 strategy and a dataset with c classes, a classifier is constructed for every possible pair of classes, resulting in c(c-1)/2 different binary classifiers. Given an input observation, it is tested with each classifier, and the class returning the largest number of `yes' votes is assigned to the observation. 2.4 Multi-layer perceptron The multi-layer perceptron (MLP) [6] also termed feedforward neural network is a generalization of the single-layer perceptron. In fact, just three layers (including the input layer) are enough to approximate any continuous function. The input nodes form the input layer of the network. The outputs are taken from the output nodes, forming the output layer. The middle layer of nodes visible to neither the inputs nor the outputs, is termed the hidden layer. The discriminant functions of 3-layered perceptron with M neurons in a hidden layer are of the following form: M 2 d di ( x) f 2 wij f1 w1jr xr0 r 0 j 0 where 0 xr i 1,..., c are inputs, wij , w jr are components of two layers of network weights, d is the dimensionality of the input pattern, the univariate functions f1 and f 2 are typically each set to: 1 1 e x The parameters of the network (i.e. weights ) are modified during learning to optimize the match between outputs and targets, typically by minimizing the total square error using a variant of gradient descent which is conveniently organized as a backpropagation of errors [6]. f ( x) 3. Features of the amino acid sequence For ml-based protein fold classification it is necessary to represent the underlying protein as a feature vector, i.e. a vector composed of values of the features representing a protein. To realize that, one of the keys is to find an effective model to represent a sample of a protein, because performance of a fold classifier critically depends on the features used. Several methods for the extraction of features of amino acid sequences for protein fold classification have been developed. The most straightforward sequential model relies on representing a query protein as a series of successive amino acid symbols according a certain order, but it would fail when the query protein did not have significant homology to proteins of known characteristics. The simplest non sequential or discrete model of a protein P R1 RL with L amino acid residues Ri can be expressed by its amino acid composition (AAC): P f1 ,..., f 20 where fj (j=1,...,20) are the normalized occurrence frequencies of the 20 native amino acids in. Dubchak et al. [27], [29] first proposed a way to extract global physicalchemical propensities of amino acid sequence as fold discriminatory features. Together with AAC a protein sequence is represented by a set of the following 126 parameters divided into six groups: 1) AAC plus sequence length (21 features collectively denoted by a letter `C'), 2) predicted secondary structure (21 features denoted by `S'), 3) hydrophobicity (21 features denoted as `H'), 4) normalized van der Waals volume (21 features denoted as `V'), 5) polarity (21 features denoted by `P') and 6) polarisability (21 features denoted by `Z'). Secondary structural information based on three-state model: helix, strand and coil could be accomplished using one of the existing methods for secondary structure prediction, for example PSI-PRED [35]. Apart from AAC characteristics (`C' set of features) all other features were extracted based on the classification of all amino acids into three classes (for example polar, neutral, and hydrophobic for hydrophobicity attribute, see Table 1) in the following way. The descriptors a-composition, transition and distribution were calculated for each attribute to describe the global percent composition of each of the three groups in a protein, the percent frequencies with which the attribute changes its index along the entire length of the protein, and the distribution pattern of the attribute along the sequence, respectively. In the case of hydrophobicity for example, the a-composition descriptor `aC' consists of the three numbers ­ the global percent compositions of polar, neutral and hydrophobic residues in the protein (because regarding to hydrophobicity attribute, all amino acids are divided into three groups: polar, neutral and hydrophobic). The transition descriptor `T' of the following three numbers ­ the percent frequency with which: a polar residue is followed by a neutral one or a neutral by a polar residue and similarly with the other two types of residues. The distribution descriptor `D' consists of the five numbers for each of the three groups: the fractions of the entire sequence, where the first residue of a given group is located, and where 25, 50, 75, and 100 percent of those are contained. The complete parameter vector contains 3'aC'+3'T'+5x3'D'=21 components. Therefore the full feature vector (C, S, H, V, P, Z) counts 6 x 21 = 126 features. Table 1. Amino acid attributes and corresponding groups Attribute Secondary structure Hydrophobicity Polarizability Polarity Van der Waals volume Group 1 Helix Polar R,K,E,D,Q,N (0-2.78) G,A,S,C,T,P,D (4.9-6.2) L,I,F,W,C,M,V,Y (0-0.108) G,A,S,D,T Group 2 Strand Neutral G,A,S,T,P,H,Y (2.95-4.0) N,V,E,Q,I,L (8.0-9.2) P,A,T,G,S (0.128-0.186) C,P,N,V,E,Q,I,L Group 3 Coil Hydrophobic C,V,L,I,M,F,W (4.43-8.08) M,H,K,F,R,Y,W (10.4-13.0) H,Q,R,K,N,E,D (0.219-0.409) K,M,H,F,R,Y,W Pseudo amino acid composition (PseAA) was originally proposed [19] to avoid completely losing the sequence-order information as in AAC-discrete model. In PseAA model the first 20 factors represent the components of AAC while the additional ones incorporate some of its sequence-order information via various modes (i.e., as a series of rank-different correlation factors along a protein chain). The PseAAC-discrete model can be formulated as [18]: fu 20 f w k k 1 pu w u 20 20 f w k k 1 1 u 20 20 1 u 20 where w is the weight factor and k the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues: 1 Lk k J i ,i k ( k L ) L k with J i ,i k 2 2 1 [ H1 ( Ri k ) H1 ( Ri )] [ H 2 ( Ri k ) H 2 ( Ri )] 2 3 [ M ( Ri k ) M ( Ri )] where, H2(Ri) and M(Ri) are respectively the hydrophobicity, hydrophilicity and side chain mass values for the amino acid Ri, is a parameter (before substituting these values special normalization is used, for details see [18]). The n-th order amino acid pair composition proposed by Shamim et al. [53] is calculated using the following formula: f ( D i ,in ) j i,i+n N ( D i ,in ) j Ln where !N f(D )j is the number of n-th order amino acid pair j (j=1,...,400) in protein sequence of length L. These features encapsulate the interaction between the i-th and (i+n)-th amino acid residues and give the local order information in a protein. A special case of these features are bigram and spaced-bigram features proposed by Huang et al. [34], both derived from the N-gram concept. Besides the features extracted directly from amino acid sequences, some features are constructed through exploiting information such as predicted secondary structure, predicted solvent accessibility, functional domain and sequence evolution. Secondary structure-based features are generated based on the (predicted) secondary structure profile, for example generated by PSIPRED method [35]. Such profile comprises a state sequence, i.e. a sequence of the three possible symbols representing states: helix (H), strand (E) and coil (C) and the three probability sequences, each for one state, being the probability values with which the states occur along the query amino acid sequence. The first examples of these features are those used by Dubchak et al. [27], [29] as described at the beginning of this section. Chen and Kurgan [12] proposed the two new features: 1) the number of different secondary structure segments (DSSS), being the numbers of occurrences of distinct helix, strand and coil structures which length is above a certain threshold, 2) the arrangement of DSSS: there are 33 = 27 possible segment arrangements, i.e. class-class-class where class='H', `E' and `C'. Similarly to n-th order amino acid pair composition features, Shamim et al.et al. [53] defined the secondary structural state frequencies of amino acids pairs which are calculated as: f ( Dki ,i n ) j N ( Dki ,i n ) j Ln i where N ( Dk,i n ) j is the number of n-th order amino acid pair j (j=1,...,400) found in the state k=(H,E,C). Treating amino acid sequences as a time series Yang and Chen [60] proposed the following procedure for the extraction of new features from PSIPRED profile. For each of the three state sequences of secondary structural elements they first applied chaos game representation, analyzed them by a nonlinear technique, the recurrence quantification analysis (for details see [60]) and then applied autocovariance (AC) transformation which is the covariance of the sequence against a timeshifted version of itself: ACL ,t (ti t )(ti l t ) /( L l ) l 1,..., lmax where t (t1 , , t L ) s the input sequence, t is the average of all tj, l is the distance between two positions along the sequence, lmax is the maximum of l being the value of the shift. Shamim et al. [53] proposed the solvent accessibility state frequencies of amino acids calculated as follows: fi k Nik L where k = (B, E) are solvent accessibility states: B ­ buried, E ­ exposed, N ik is the number of amino acid i in solvent accessibility state k. For calculating the frequencies they used solvent accessibility states predicted by the method [13]. Similarly to n-th order of amino acid pair composition features, they defined the solvent accessibility state frequencies of amino acids pairs: i f ( Dk,i n ) j i ,i n k i N ( Dk,i n ) j Ln where N ( D ) j is the number of n-th order amino acid pair j (j=1,...,400) found in the accessibility state k=(B,E,I) (I ­ partially buried state) or secondary structural state k=(H,E,C). Proteins often contain several modules (domains), each with a distinct evolutionary origin and function. Several databases were developed to capture this kind of information, for example CDD database (version 2.11) [43] which covers 17402 common protein domains and families. In [56] the functional domain (FunD) composition vector for representing a given protein sample was proposed. It is extracted through the following procedure: 1) use RPS-BLAST (reverse PSI-BLAST [52) to compare the protein sequence with each of the 17402 domain sequences in CDD database, 2) if the significance threshold value is less than 0.001 for the i-th profile that means the hit is found and i-th component of the protein in 17402-dimensional space is assigned 1, otherwise 0. Evolutionary-based features mainly are extracted from position-specific scoring matrix (PSSM) profile generated by the PSI-BLAST program [1]. PSI-BLAST aligns a given query amino acid sequence to the NCBI's nonredundant database. Using multiple sequence alignment PSI-BLAST counts a frequency of each amino acid at each position for the query sequence and generates 20-dimensional vector of amino acid frequencies for each position in the query sequence, thus the element Sij of PSSM matrix reflects the probability of amino acid i occurring at the position j. More often than the absolute frequencies, the relative frequencies are tabulated in a profile (i.e. relative to a probability of a sequence in a random functional site). The generated profile considers evolutionary information, i.e. it can be used to identify key positions of conserved amino acids and positions that undergo mutations. Chen and Kurgan [12] extracted from 20-dimensional PSSM profile a profile-based composition vector (PCV) in a way by which the negative elements of PSSM profile are first replaced by zero, and then each column is averaged. However, in such representation valuable evolutionary information would be definitely lost. To avoid this, Shen and Chou [56] proposed pseudo position-specific scoring matrix (PsePSSM) by adding to the profile-based composition vector the correlation factors defined as: 1 L [S ij S(i ) j ]2 j 1, 2,..., 20; L is the correlation factor by coupling the most contiguous PSSM scores along the protein chain for the amino acid type j, 2j - the same as previous but for the second-most contiguous PSSM scores, and so forth. Another approach is proposed in [60]. Global features are extracted from PSSM matrix by first using a special normalization followed by the consensus sequence (CS) transformation: 1j (i) arg max{ f ij : 1 j 20} 1 i L where f ij denotes the normalized value of the element S ij of PSSM, and then computing: n( j ) 1 j 20 L where n(j) is the number of the amino acid j occurring in the CS. Additional two global features represent the entropy of the feature set: AACCS ( j ) ECS AACCS ( j ) ln AACCS ( j ) j 1 20 EFM 1 L 20 f ij ln f ij L j 1 the last computed on the raw, normalized PSSM. To extract local features, they first divide the raw, normalized PSSM into nonoverlapping fragments of equaength. Then, for each fragment s, the 20 features are computed as the average occurrence frequency of the amino acid j in the fragment s during the evolution process (for details see [60]). Each residue in amino acid sequence has many physical-chemical properties, so a sequence may be viewed as a time sequence of the corresponding properties. In [28] Dong et al. proposed features extracted from PSSM using AC transformation. The result measures the correlation of the same property between two residues separated by a distance l along the sequence: AC (i, l ) ( Sij Si )( Sij l Si ) /( L l ) where i is one of the residues, L is the protein sequence length, Sij is the PSSM score of amino acid i at position j, S i is the average score of amino acid i along the whole sequence. They also proposed the AC transformation for two different properties between two residues separated by l along the sequence: CC (i1, i 2, l ) ( Si1 j Si1 )( Si 2 j l Si 2 ) /( L l ) where i1, i2 are two different amino acids. Slightly different from the methods described above are the feature extraction methods based on kernels. A core component of each kernel methods (for example the described SVM) is the kernel function, which measures the similarity between any pair of examples. Different kernels correspond to different notions of similarity and can lead to discriminative functions with different performance. One of the early approaches for deriving a kernel function for protein classification was the SVM-pairwise scheme [Liao39] which presents each sequence as a vector of pairwise similarities to all sequences in the training set. A relatively simpler feature space that contains all possible short subsequences ranging from 3 to 8 amino acids (kmers) is explored in [38]. A sequence x is represented here as a vector in which a particular dimension u (kmer) is present in x vector (has non-zero weight) if x contains a substring that differs with u in at most a predefined number of positions (mismatches). An alternative to measuring pairwise similarity through a dot-product of vector representations is to calculate an explicit protein similarity measure. The method [51] measures the similarity between a pair of protein sequences by taking into account all the optimaocal alignment scores with gaps between all of their possible subsequences. In the work described in [48] they developed new kernel functions that are derived directly from explicit similarity measures and utilize sequence profiles constructed automatically via PSI-BLAST [1] and employed a profile-to-profile scoring scheme developed by extending profile alignment method [52]. The first kernel function, window-based, determines the similarity between the pair of sequences by using different schemes to combine ungapped alignment scores of certain fixed-length subsequences. The second, local alignmentbased, determines the similarity between the pair of sequences using SmithWaterman alignments and a position independent affine gap model, optimized for the characteristics of the scoring system. Experiments with fold classification problem show that these kernels together with SVM [48] are capable of producing excellent results, the overall performance measured on DD dataset is 67,8 %. 4. Protein fold machine learning-based classification methods 4.1 Datasets used in the described experiments Most implementations of the ml-based protein fold classification methods have adopted the SCOP (Structural Classification of Proteins) architecture [42], with which a query protein is classified into one of the known folds. Most of these methods use for the construction of a protein fold classifier the dataset (training and testing one) developed by Ding and Dubchak [27], [29] (DD dataset). The DD dataset contains 311 and 383 proteins for training and testing, respectively. This dataset has been formed such that, in the training set, no two proteins have more than 35% sequence identity to each other and each fold have seven or more proteins. In the test dataset, proteins have no more than 40% sequence identity to each other and have no more than 35% identity to proteins of the training set. The proteins from training and testing datasets belong to 27 different folds (according to SCOP [42]), representing all major structural classes , , + , and /. These are the following 27 fold types: 1) globin-like, 2) cytochrome c, 3) DNA-binding 3-helical bundle, 4) 4-helical up-and-down bundle, 5) 4-helical cytokines, 6) EF-hand, 7) immunoglobulin-likesandwich, 8) cupredoxins, 9) viral coat and capsid proteins, 10) ConA-like lectins/glucanases, 11) SH-3 like barrel, 12) OB-fold, 13) beta-trefoil, 14) trypsin-like serine proteases, 15) lipocalins, 16) (TIM)-barell, 17) FAD (also NAD)-binding motif, 18) flavodoxin like, 19) NAD(P)-binding Rossman fold, 20) P-loop, 21) thioredoxin-like, 22) ribonuclease H-like motif, 23) hydrolases, 24) periplasmic binding protein-like, 25) -grasp, 26) ferredoxinlike, 27) small inhibitors, toxins, lectins. Of the above 27 fold types, types 16 belong to all structural class, type 7-15 to all class, type 16-24 to / class and types 25-27 to + class. Later, researchers (see for example [60]) found some duplicate pairs between the training and testing sequences in DD dataset. After excluding such sequences, a new dataset called revised DD dataset (RDD) was created. Another extended DD dataset (called EDD) was constructed by populating additional protein samples. It is based on the Astral SCOP (http://astral.berkeley.edu/) in which any two sequences have less than 40% identity. To cover more folds, they constructed another datasets are constructed, comprising 86, 95, 194 and 199 folds respectively (F86, F95, F194, F199), for the detailed description see [28], [60]. 4.2 Methods Supervised ml-based methods for protein fold prediction have gained great interest since the work described in Craven et al. [22]. Craven et al obtained several sequence-derived features, i.e., average residue volume, charge and polarity composition, predicted secondary structure composition, isoelectric point, Fourier transform of hydrophobicity function, from a set of 211 proteins belonging to 16 folds and used the sequence attributes to train and test the following popular classifiers: decision trees, k nearest neighbor and neural network classifiers in the 16-class fold assignment problem. Ding and Dubchak [27], [29] first experimented with one-versus-others unique, one-versus-others and all-versus-all methods using neural networks or SVMs as classifiers in multiple binary classification tasks on a DD dataset of proteins using global description of amino acid sequence described in the previous section). They were able to recognize the correct fold with the accuracy of approximately 56%. Here, the accuracy refers to the percentage of proteins whose fold has been correctly identified on the test set. Other researchers have tried to improve prediction performance by either incorporating new features (as described in the previous section) or developing novel algorithms for multi-class classification (for example fusion of the different classifiers). A modified nearest neighbor algorithm called K-local hyperplane (HKNN) was used by Okun [46] (with the overall accuracy 57,4% on DD dataset). Classifying the same dataset as in Dubchak [29] and input features, employing a Bayesian Network-based approach [32] Chinnasamy et al. [14] improve to 60% on the average fold recognition results reported by Dubchak [29]. Nanni [45] proposed a specialized ensemble called SE of K-local hyperplane based on random subspace and feature selection and achieved 61,1% total accuracy on a dataset DD. Classifiers in this ensemble can be built on different subsets of features, either disjoint or overlapping. Feature subsets for a given classifier with a "favourite" class are found as those that best discriminate this class form others (i.e. in the context of the defined distance measure). For the prediction of protein folding patterns Shen and Chou proposed in [56] the ensemble classifier, known as PFP-Pred, constructed from the nine individual ET-KNN [25] (evidence-theoretic k-nearest neighbors) classifiers, each operating on only one of the inputs (in order not to reduce the clustertolerant capacity) and obtained the accuracy 62,`%. The ET-KNN rule is a attern classification method based on the Dempster-Shafer theory of belief functions. Near-optimal parameters of each such component classifier were obtained using optimization procedure from [64] resulting in the OET-KNN optimized classifier. As a protein representation they used features from Dubchak et al. [27], [29] (except the compostion) as well as the different dimensions of pseudo-amino acid composition, i.e. with four different values of parameter (see the description in the previous section), together nine groups of features. Rather than using a combined correlation function they proposed the alternate correlation function between hydrophobicity and hydrophilicity of the constituents amino acids to reflect sequence-order effects (for details see [56]). The outcomes of the individual classifiers were combined through a weighted voting to give a final determination for classifying a query protein. Chmielnicki and Stpor [15], [16] proposed a hybrid classifier of protein folds composed of regularized gaussian and SVM. Using feature selection algorithm to select the most informative features from those designed by Dubchak et al. [27] they obtained accuracy 62.6% on DD dataset. In [33] Guo and Gao presented the hierarchical ensemble classifier named GAOEC (Genetic-Algorithm Optimized Ensemble Classifier) for protein fold recognition. As the component classifier they proposed a novel optimized GAET-KNN classifier which use GA to generate the optimum parameters in ET-KNN to maximize classification accuracy. Two layer GAET-KNNs are used to classify query proteins in the 27 folds. As in Dubchak et al. [27] six kinds of features are extracted from every protein in a DD dataset. Six component GAET-KNN classifiers in the first layer are used to get a potential class index for every query protein. According to the result of the first layer, every component classifier of the second layer generates a 27dimensional vector whose elements represent the confidence degrees of 27 folds. The genetic algorithm is used for generating weights for the outputs of the second layer to get the final classification result. The overall accuracy of GAOEC is 64,7%. In [28] the protein fold recognition approach using SVM and features containing evolutionary information extracted from PSSM by using the described in the previous section AC transformation is presented. Two kinds of AC transformation were proposed resulting in two kinds of features: 1) measuring correlation between the same and 2) the two different properties. Two versions of a classifier were examined: with features 1) and the combination of 1) and 2) resulting in the performance 68,6% and 70,1% respectively (on DD dataset using 2-fold cross-validation). Using the EDD, F86, F199 datasets the performances computed by 5-fold cross-validation for the combination of 1) and 2) sets of features gain 87,6%, 80,9% and 77,2%, respectively. Kernel matrices incode the similarity between data objects within a given input space. Using kernel-based learning methods, the problem of heterogeneous data sources integration can be transformed into the problem of learning the most appropriate combination of their kernel matrices. The approach proposed in [24] `utilizes' four of the state-of-the-art string kernels built for proteins and combines them into an overall composite kernel where the multinomial probit kernel machine operates. The approach is based on the ability to embed each object description via the kernel trick [54] into a kernel space. This produces a similarity measure between proteins in every feature space and then, having a common measure, they combined informatively these similarities into a composite kernel space on which a single multi-class kernel machine can operate effectively. The performance obtained using this method on a DD dataset is 68,1%. Chen and Kurgan [12] proposed the fold recognition method PFRES using features generated from PSI-BLAST and PSI-PRED profiles (as described in the previous section) and the voting-based ensemble of the three different classifiers: SVM, Random Forests [8] and Kstar [20]. Using the entropy-based feature selection algorithm resulting in the compact representation (36 features) [63] they obtained 68,4% accuracy on DD dataset. In [40] two-level classification strategy called hierarchicaearning architecture (HLA) using neural networks for protein fold recognition was proposed. It relies on two indirect coding features (based on bigram and spaced-bigram as described in the previous section) as well as combinatorial fusion technique to facilitate feature selection and combination. The resulting accuracy is 69,6% for 27 folding classes from DD dataset. One of the novelties is the notion of a diversity score function between a pair of features. This parameter is used to select appropriate and diverse features for combination. It may be possible to achieve better results with combination of more than two features. Shamim et al. [53] developed a method for protein fold recognition based on SVM classifier (with three different multi-class methods: one versus all, one versus one and Crammer/Singer method [21]) that uses secondary structural state and the solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors (as described in the previous section). Among the feature combinations, combination of the secondary structural state and solvent accessibility state frequencies of amino acids and first order amino acid pairs ­ gave the highest accuracy 70,5% (measured using 5-fold cross-validation) on the EDD dataset. Shen and Chou developed new method PFP-FunD for protein fold pattern recognition using functional domain (FunD) composition vector and features extracted form PsePSSM matrix (as described in the previous section) with the previously designed OET-KNN ensemble classifier obtainng accuracy 70,5%. Recently Yang and Chen [60] developed fold recognition method TAXFOLD that extensively exploits the sequence evolution information form PSI-BLAST profiles and the secondary structure information form PSIPRED profiles. A comprehensive set of 137 features is constructed as described in the previous section which allows for depiction of both global and local characteristics. They tested different combinations of the extracted features. It follows that PSI-BLAST and PSI-PRED features make complementary contributions to each other, and it is important to use both kinds of features for enhanced protein fold recognition. The consensus sequences contain much more evolution information than the amino acid sequences, thereby leading to more accurate protein fold recognition. The best accuracy of TAXFOLD is 71,5% on DD dataset (83,2%, 86,29%, 76,5% and 72,6% on RDD, EDD, F95 and F194 datasets, respectively). In the kernel-based learning method [62] they proposed a novel information-theoretic approach to learn a linear combination of kernel matrices through the use of a Kullback-Leibler divergence between the output kernel matrix and the input one. Based on the position of the input and output kernel matrices, there are two formulations of the resulted optimization problem: by a difference of convex programming method and by a projected gradient descent algorithm. The method improves the fold discrimination accuracy to 73,3% (on DD dataset). Another, but different from those described in section two group of methods used for protein fold recognition are those based on Hidden Markov Model (HMM) [32]. The most promising result here was achieved by Deschavanne and Tuffery [26] who employed the hidden Markov structure alphabet as additional feature and got accuracy 78% (on EDD database). 5. Discussion, conclusions and future work This paper is not a comparison among the existing protein fold classification methods, it is only a review. To make such a comparison between performance of different classifiers one should implement the described methods and use the special, dedicated statistical tests to make sure that the differences are statistically significant. But some conclusions (of general nature) could be drawn. First, it should be noted that the reported accuracies of classifiers were estimated by using different methods: by applying the independent test dataset or 2 (5) cross-validation on the training dataset and sometimes using different datasets. The accuracies obtained using different estimation methods have different bias and variance. Moreover, these are only the values of the point estimators, the confidence intervals for example would give valuable information on the standard deviation of the prediction error. Considering the nature of the protein fold prediction problem, where the fold type of a protein can depend on a large number of protein characteristics and also noting that the number of fold types approaches 1000, it is straightforward to see the need for a methodological framework that can cope with a large number of classes and can incorporate as many as they are available feature spaces. As mentioned above, the existing protein fold classification methods produced several fold discriminatory data sources (i.e. groups of attributes such as amino acid composition, predicted secondary structure, and selected structural and physicochemical properties of the constituent amino acids). One of the problems is how to integrate many fold discriminatory data sources systematically and efficiently, without resorting to ad hoc ensemble learning?. One solution are the kernel methods that have been successfully used for data fusion in many biological applications. But the problem of integrating heterogenous data sources is still a challenging problem (problem 1). It is known that performance of many classifiers (like for example widely used SVM) depends on the size of dataset used - the look on DD dataset reveals that many folds are sparsely represented. Training in such a case becomes skewed towards populous folds labeled as positive rather than lesser populated folds labeled as negative (problem 2). An alternative class structure should be developed, for example by re-evaluation of the current class structure to determine classes which should be aggregated or discarded, or by incorporating larger sets of folding classes. Another sources of error in the described methods are the not appropriate (i.e. with small discriminatory power) features of a protein sequence. Moreover, the incorrectly predicted features like the secondary structure or solvent accessibility ones, could although decrease classification performance. Extracting a set of highly discriminative set of features from amino acid sequences remains a challenging problem (problem 3), though the newest work [60] shows, that even using a single classifier but with the carefully designed features (most discriminative), it is possible to obtain a very good classification performance, even greater than 80%. The obtained result is very well acceptable accuracy for the 27-class classification problem !. A random classifier would have a 3.7% (1/27x100) only. The main source of the achieved improvement is attributed to the application of features extracted from PSI-BLAST profile which considers evolutionary information, with suitable transformations. At last, employing an appropriate classifier or the method for ensembling them has also a critical effect to the problem. Using the weak classifiers on the separate feature spaces and the carefully designed combining strategy (learning algorithms) could give the results much higher than 70% (problem 4).

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Jan 1, 2012

There are no references for this article.