Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Extensive feature detection of N-terminal protein sorting signals

Extensive feature detection of N-terminal protein sorting signals Vol. 18 no. 2 2002 BIOINFORMATICS Pages 298–305 Extensive feature detection of N-terminal protein sorting signals 1 2 3 Hideo Bannai , Yoshinori Tamada , Osamu Maruyama , 1 1 Kenta Nakai and Satoru Miyano Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan, Department of Mathematical Sciences, Tokai University, 1117 Kitakaname, Hiratuka-shi, Kanagawa, 259-1292, Japan and Faculty of Mathematics, Kyushu University, Kyushu University 36, Fukuoka, 812-8581, Japan Received on July 13, 2001; revised on October 5, 2001; accepted on October 18, 2001 ABSTRACT of unknown or unannotated proteins may be used to Motivation: The prediction of localization sites of various gain some indication of its function. For example, the proteins is an important and challenging problem in the information may be used to screen candidate genes for field of molecular biology. TargetP, by Emanuelsson et al. drug discovery. Further, if the rules for prediction were (J. Mol. Biol., 300, 1005–1016, 2000) is a neural network biologically interpretable, this knowledge could help in based system which is currently the best predictor in the designing artificial proteins with desired properties. literature for N-terminal sorting signals. One drawback of TargetP (Emanuelsson et al., 2000), a neural network neural networks, however, is that it is generally difficult to based predictor, is known to be the best predictor in the lit- erature for N-terminal sorting signals. However, although understand and interpret how and why they make such neural networks are ‘readily available’ and ‘often success- predictions. In this paper, we aim to generate simple ful in practice,’ they are also infamous for the difficulty and interpretable rules as predictors, and still achieve involved in trying to understand and interpret their mean- a practical prediction accuracy. We adopt an approach ing (Chou, 2001). PSORT (Nakai and Kanehisa, 1992; which consists of an extensive search for simple rules Nakai and Horton, 1999) and MitoProt (Claros and Vin- and various attributes which is partially guided by human cens, 1996), unlike TargetP, are systems which incorpo- intuition. rate existing knowledge about sorting signals, but they use Results: We have succeeded in finding rules whose various real numbers as ‘weights’ in their prediction rules prediction accuracies come close to that of TargetP, while which also may not be trivially interpretable. Also, they still retaining a very simple and interpretable form. We also are somewhat obsolete and their performance is unsatis- discuss and interpret the discovered rules. factory compared to TargetP. Availability: An (experimental) web service using The aim of this work is to derive simple and inter- rules obtained by our method is provided at http: pretable rules which can be used to predict subcellular //hypothesiscreator.net/iPSORT/. localization sites, while still achieving a practical predic- Contact: bannai@ims.u-tokyo.ac.jp tion accuracy. Through our discovery oriented approach to the problem, we managed to find very simple and INTRODUCTION interpretable rules with prediction accuracies which come Most proteins are first synthesized in the cytosol, and fairly close to TargetP. carried to specified locations, such as mitochondria or We will first review the existing knowledge about the chloroplasts. In most cases, the information determining N-terminal signals, and then describe the general idea of the subcellular localization site is represented as a short our approach. amino acid sequence segment called a protein sorting signal (Nakai, 2000). If we could somehow detect the N-terminal sorting signals amino acid sequence encoding this information, we would The signals we consider are signals known to be on be able to predict the localization sites. the N-terminal of the protein. Mitochondrial Targeting Prediction of localization sites is useful in various Peptides (mTP), Chloroplast Transit Peptides (cTP), and ways. Because cellular functions are often localized in Signal Peptides (SPs) are the typical N-terminal sorting specific compartments, the prediction of localization sites signals. 298  c Oxford University Press 2002 Feature detection of N-terminal sorting signals Alphabet Indexing + Pattern Rule Mitochondrial Targeting Peptides are known to be rich Alphabet Indexing in Arginine (R), Alanine (A), and Serine (S), while 1 R Pattern 2 A, C, F, G, I, L, S, T, W Positive Examples Positive Examples 3 D, E, H, K, M, N, P, Q, V, Y 1) 3212221222 ‘‘212221’’ 1) TRUE negatively charged amino acid residues (Aspartic Acid (D) 2) 3222122213 2) TRUE 3) ...... 3) ...... and Glutamic Acid (E)) are rare (von Heijne et al., 1989). Amino Acid Sequences Negative Examples Negative Examples 234) 3222222222 234) FALSE Positive Examples Only weak consensus sequences have been found. Further, 235) 3223321222 235) FALSE 1) MARSLARTTTVRQP... ...... ...... 2) MASIRTGSRVSFKC... they are believed to form an amphiphilic α-helix important 3) ............ Cut interval [1, 10] ..... Negative Examples for import into the mitochondrion. 234) MNSLSLSFTTALN... 235) NSFDVSRVSTKTS... Positive Examples Threshold Positive Examples Chloroplast Transit Peptides are known to be rare in ...... 1) 68.1 63.3 < ? 1) TRUE 2) 68.4 Indexing sum 2) TRUE 3) .... acidic residues, and also believed to form an amphiphilic 3) ...... Negative Examples Negative Examples Amino Acid Indexing (ZIMJ680104: Isoelectric point) 234) 57.1 α-helix (Bruce, 2000). 234) FALSE 235) 59.6 A C D E F G H I K L ..... 235) FALSE 6.0 5.1 2.8 3.2 5.5 6.0 7.6 6.0 9.7 6.0 ...... It has been established that a concrete consensus se- M N P Q R S T V W Y 5.7 5.4 6.3 5.7 5.7 5.7 6.0 5.9 5.7 10.8 quence does not occur in SPs. Rather, a 3-region structure Amino Acid Index Rule is conserved: a positively charged n-region, a hydrophobic h-region, and a polar c-region (von Heijne, 1990). Fig. 1. Concept diagram of amino acid index rule and alphabet indexing + approximate pattern rule. Overview of our approach Several very important aspects in the process of scientific of the rule when we find a ‘good’ amino acid index knowledge discovery are: (1) the generation or discovery contained in AAindex, which helps greatly in explaining of good attributes, and ways of looking at the data, which the data. Therefore, using AAindex as a knowledge base, is then used to explain the data; (2) the incorporation of we generate amino acid index rules. and reflection on existing knowledge; and (3) the trial and error interaction between the expert and the problem. Alphabet indexing + approximate pattern. Again from We have been developing a computer software library previous studies, we know that there is no clear-cut con- focusing on these points to speed-up this process, and sensus sequence concerning each of the sorting signals. have been applying it to various problems in the field of However, since there does seem to be a common structure bioinformatics (Maruyama et al., 1998, 1999; Bannai et for the same signals, we wish to somehow capture this al., 2001). knowledge. Our approach here is to consider motifs The overall idea of this approach is to create massive which allow more ambiguity by using alphabet indexing amounts of very simple attributes and their trivial combi- (Shimozono, 1999) and approximate patterns (Wu and nations, based on various known attributes. This way, if Manber, 1992) over the indexed sequence, similar to the such rules exist for the data, we can expect to overcome BONSAI system (Shimozono et al., 1994), which was the poor descriptive strength of simple rules, while at the successful in discovering meaningful knowledge from same time control the complexity and structure of the rule amino acid sequences. to be generated. An alphabet indexing is a classification of characters Our search for the final hypothesis consists of two main of an alphabet into a smaller set of characters, and can aspects: amino acid index, and alphabet indexing + ap- be viewed as a discrete, unordered version of an amino proximate pattern. The details of each aspect will be pre- acid index. For example, we may divide the amino acids sented in the following sections, but we describe them into the two classes of hydrophobic amino acids and briefly here (see Figure 1). hydrophilic amino acids. Using this alphabet indexing, we can view the amino acid sequence as a sequence of Amino acid index. A large amount of experimental and ‘0’s (hydrophobic) and ‘1’s (hydrophilic), and search for theoretical research has been performed to characterize patterns (e.g. ‘001001100’) contained in the sequences. different kinds of properties of individual amino acids The outline of this paper is as follows: in the next and to represent them in terms of a numerical index. The section, we define the basic concepts used in our methods. AAindex database (Kawashima and Kanehisa, 2000) is a We then show the results we have obtained from applying compilation of 434 of such indices. As noted in Section our methods to the data. Finally, we discuss how the rules N-terminal sorting signals, protein sorting signals have we discovered may be interpreted. been characterized by the biochemical properties of the amino acids composing them, and it seems reasonable SYSTEM AND METHODS to assume that some kind of characteristic which is Amino acid index rule important for protein sorting is already contained in the AAindex database. Also, although an amino acid index An amino acid index is a mapping from one amino acid generally assigns a real number for each amino acid, it to a numerical value, representing various physiochemical should be easy for us to interpret the biological meaning and biochemical properties of amino acids. 299 H.Bannai et al. by s[u,v], a pattern p, and mismatch allowance M . A se- DEFINITION 1 (amino acid index). Let A denote the quence is predicted ‘positive’ if p matches somewhere in set of amino acids, and R the set of real numbers. For ψ(s[u,v]) within mismatch allowance M , and ‘negative’ a given amino acid index I : A → R and amino acid otherwise. sequence s = s s ··· s (for i : 1··· n, s ∈ A), let I (s) 1 2 n i denote the homomorphism [| I (s ); I (s ); ... ; I (s )|], 1 2 n The parameters to be chosen are: ψ, u,v, p, M , and the where [|; |] denotes a sequence of values. task again is to find the best combination of the parameters which can distinguish between sequences of different The AAindex Database (Kawashima and Kanehisa, signals. 2000) is a compilation of 434 types of amino acid indices, With R2-rules, we expect to find locally specific charac- which have appeared in various reports. teristics concerning N-terminal sorting signals. DEFINITION 2 (amino acid index rule). An amino acid Data index rule (R1-rule) is defined by: an amino acid index I , The data used in our computational experiments was a specified region of the amino acid sequence s denoted by obtained from the TargetP web-site (http://www.cbs.dtu. s[u,v], a function f ∈{avg, max avg , min avg }, and w w dk/services/TargetP/). These data consist of two data sets: a threshold τ , where: avg( I (s)) is defined as the average plant and non-plant sequences. The plant data set of 940 of the values of sequence I (s), max avg ( I (s)) is the sequences contained 368 mTP, 141 cTP, 269 SP, and average of a substring of size w in I (s), which gives the 162 ‘Other’ (consisting of 54 nuclear and 108 cytosolic) maximum value (i.e. max{ I (s ) | s = xs y, |s |= w}), sequences. The non-plant data set of 2738 sequences and min avg ( I (s)) is similarly the average of a substring contained 371 mTP, 715 SP and 1652 ‘Other’ (consisting of size w in I (s) which gives the minimum value (i.e. of 1214 nuclear and 438 cytosolic) sequences. min{ I (s ) | s = xs y, |s |= w}). s[u,v]= s ··· s u v We basically follow the work on TargetP, considering (1  u  v  n) for s = s s ··· s . 1 2 n different predictors for plant and non-plant proteins. Also, A sequence is predicted ‘positive’ by the R1-rule if as in the composition of TargetP, we will first consider f ( I (s[m, n])) > τ and ‘negative’ otherwise. binary predictors which just predict whether or not a given sequence contains a specific signal. The knowledge The parameters to be chosen are: I, u,v, f,w,τ , and the task is to look for the best combination of the parameters obtained from these binary rules is combined into a which can distinguish between sequences of different decision list, to form a final rule. For each binary predictor, signals. we will call the sequences concerning the signal in With R1-rules, we expect to capture the overall proper- question positive examples, and the sequences concerning ties of N-terminal sorting signals. the other signals, negative examples. Search strategies Alphabet indexing + approximate pattern rule We extensively search for various parameters described An alphabet indexing (Shimozono, 1999) is a classifica- in the previous section. Since the size of the search tion of characters of an alphabet. It is formally defined as space for the combinations of different parameters is follows: huge, an exhaustive search is not feasible even with the DEFINITION 3 (alphabet indexing). An alphabet in- powerful computers which were available. We adopted dexing ψ is a mapping from one alphabet to another a mixture of heuristics and exhaustive search. Many alphabet , where | |  | |.For x = x x ··· x ∈ , 1 2 l different combinations of the parameters as well as minute let ψ(x ) denote the homomorphism ψ(x )ψ (x )··· ψ 1 2 variations in the heuristics were tried. (x ) ∈ . We will call ψ(x ), the indexed sequence. Combining the rules. To create a single rule predicting the sorting signal for a given sequence, we combine DEFINITION 4 (approximate pattern). An approximate the binary rules generated for each sorting signal into pattern (Wu and Manber, 1992) is a string which can a decision list. The structure of the decision list is match another string, allowing up to k errors (mismatch). shown in Figure 2. The structure was determined greedily, The mismatch can consist of up to three types: Insertion according to the ‘ease’ of discrimination by the R1-rules, (ins), Deletion (del), and Substitution (sub). We will call which was stable for all training/test combinations. the parameters k and the types of mismatches allowed, the From preliminary experiments, R1-rules seemed to be mismatch allowance of the approximate pattern. sufficient for discriminating SP sequences. As for the DEFINITION 5. An alphabet indexing + approximate other signals, neither type of rule seemed good enough for pattern rule (R2-rule) is defined by: an alphabet indexing identifying the signals. Therefore, we consider combining ψ , a specified region of the amino acid sequence s denoted both types of rules (R1-rules and R2-rules) into a single 300 Feature detection of N-terminal sorting signals Preliminary experiments showed that R1-rules were fairly stable, but R2-rules seemed to somewhat over-fit the training data. To overcome this problem, in the training phase for R2-rules, the training set was again randomly divided into five sets (4 ttrain and 1 ttest sets), and rules are generated from ttrain and tested with ttest. The arithmetic product of the scores from the ttrain and ttest sets was used to select which alphabet indexing and substring interval to use. The pattern was then trained using all sequences of the original training set, using the alphabet indexing and substring interval. Rules are evaluated by the Matthews Correlation Coef- ficient (MCC; Matthews, 1975), defined by: tp × tn − fp × fn (tp + fn)(tp + fp)(tn + fp)(tn + fn) where tp = true positives, fp = false positives, tn = true negatives, and fn = false negatives. Sensitivity, the frac- tp tion of correctly predicted positive examples , and tp+fn specificity, the fraction of true positives in the examples Fig. 2. The structure of the final rule for plant and non-plant tp predicted as positive , were also calculated for ref- data sets. The last node in parentheses concerning cTP and mTP tp+fp is omitted for the non-plant data set (classifying to mTP). The erence. various parameters such as the amino acid index, substring intervals, Details of the search is given below: alphabet indexing, and patterns which were chosen in the 5 training Amino acid index rule. Since we know that the signals runs are summarized in Tables 1 and 2. are located somewhere in the N-terminal region, we look (somewhat) exhaustively at the substring intervals in this region: for amino acid index I , all 434 entries rule. The first node of the decision list discriminates SPs in the AAindex Database together with 20 more entries, with a single R1-rule, while the second and third nodes assigning a value of ‘1’ to one amino acid and ‘0’ to the consist of both R1-rule and R2-rule. The two rules (or rest were considered. For these 454 entries, 72 substring perhaps their negations) are combined with a logical ‘and’ intervals [u,v]=[5n + 1, 5k] (where n = 0··· 8 and where a sequence is judged to have a certain signal if both k = 1··· 8) were considered. For these 454 ∗ 72 = rules say so. 32 688 combinations, f = avg, max avg ,min avg w w Evaluation of prediction accuracy were considered where w was taken to be 6 to 12, resulting in 32 688∗(2∗7+1) = 490 320 combinations. For all these The whole search space conducted in our search was combinations, all possible thresholds are considered for τ : enormous, and involved considerable amounts of human let f ,..., f be all f values of n sequences in sorted intervention, influenced largely by human intuition. To w w w 1 n order. Then, τ = ( f + f )/2, i = 1,..., n − 1. The give a fair estimate for the prediction accuracy of our w w i i+1 combination of I, u,v, f,w,τ which gives the highest methods, we choose a modest range of parameters to MCC score is recorded. search for in the cross validation, and show that the knowledge discovered is fairly stable even in that quite Alphabet indexing + approximate pattern rule. The large range. substring intervals [u,v] is taken to be [5n + 1, 5n + 5k] (where n = 0, 1, 2 and k = 2, 3, 4). For a given alphabet Training and evaluation. We follow the training and indexing ψ , all patterns of length 8 appearing in the evaluation methods used for TargetP. The data were ran- sequences are considered for p. The mismatch types were domly divided into five equal sized data sets by dividing limited to insertion and deletion only (no substitution). each subset of sequences with specific localization sites The maximum mismatch number was fixedat2.We into five data sets. Rules were generated by using four of started with the alphabet indexing classifying the amino the data sets as training data, and testing was conducted acids into three classes, according to their ‘charges:’ on the remaining data set. This was repeated for the five possible pairs of training and test set, and the overall 0if x ∈{ D, E}, performance is the sum of the five results. (All rules are ψ (x ) = 1if x ∈{K , R}, generated by using the training data set only.) 2if x ∈ A −{ D, E , K , R} 301 H.Bannai et al. and optimized the indexing by conducting a local search on the alphabet indexing (Shimozono et al., 1994): i.e. consider the alphabet indexings which are obtained by changing the indexing for a single amino acid (40 candidates in this case), and adopt the indexing whose product of the MCC scores for ttrain and ttest is the highest for the best pattern. The process is repeated until a local maximum is reached. A local search strategy was used because an exhaustive search for all possible alphabet indexings would result in 3 combinations, which was not feasible. Numerous tries starting from other alphabet indexings, which were chosen randomly, were –10 1 2 3 4 Value also conducted, but high scoring indexings seemed to be centered around ψ . Fig. 3. Histograms (light-SP, dark-mTP, cTP, other) of max avg Combination of the rules. After determining the param- values of hydropathy index (Kyte and Doolittle, 1982) for the eters for the above rules, the rules and possibly their nega- substring [1, 30] of the plant data set. The threshold was 2.077 27. tions are combined with a logical ‘and,’ but a portion of We can see that there is a clear difference in the distribution. the parameters are trained again. Namely, the substring in- tervals, f , window sizes, amino acid index, and alphabet indexing are fixed. We retrain the mismatch allowance, signals, our scores would again rank second, after TargetP pattern, and threshold. Their ranges are expanded in the with respect to the other predictors. retraining: the maximum mismatch number of 1–3 was al- lowed, and all patterns of length 5–10 appearing in the data Biological evaluation of the rules were considered. The top 100 R1-rules are combined with Amino acid index rules all possible R2-rules, and the combination which gives the best MCC score is chosen to be used against the test set. SP versus (mTP + cTP + other) [Node P1, rule R1 in Table 1]. The amino acid index with the highest score was the hydropathy index (Kyte and Doolittle, IMPLEMENTATION 1982), and judging from the substring interval [1, 30], and The software used in our analysis was developed using function max avg where w is around 11, we can say the Hypothesis Creator Library (http://hypothesiscreator. this rule corresponds to characteristics known for SPs (the net/; Bannai et al., 2001). Various shared memory multi- hydrophobic h region) (von Heijne, 1990; Figure 3). What processor computers were available for calculation: 2 SGI is surprising is that such a simple rule could discriminate Origin 2000 with (128, 32) × 195 MHz R10000 proces- SPs so well—better than TargetP for plant proteins. sors, 1 Sun Ultra Enterprise 4500 and 2 Sun Ultra Enter- prise 3500 with (14, 8, 8) × 400 MHz Sun Ultra II proces- (mTP + cTP) versus other [Node P2, rule R1 in Table 1]. sors respectively. Each is equipped with well over 2 GB of The amino acid index was ‘negative charge,’ which memory, which was the limit of the software. assigns a value of 1 to D and E. This also corresponds to known characteristics: mTP and cTP are rare in negatively RESULTS AND DISCUSSION charged amino acids (von Heijne et al., 1989). The parameters found for each training set is summarized in Table 1 for the plant data set, and Table 2 for the mTP versus cTP [Node P3, rule R1 in Table 1]. Various non-plant data set. The scores of the cross validation amino acid indices were chosen, with substring regions is summarized in Table 3, together with the scores for for a very short region at the N-terminal. However, the TargetP written in parentheses. (The scores for TargetP amino acid index: isoelectric point (Zimmerman et al., was taken directly from Emanuelsson et al., 2000.) 1968) can be considered as a more accurate measure of the We can see that the MCC scores for our predictor is net amino acid charges. Atom based hydrophobic moment fairly close to those of TargetP, except for cTP. However, (Eisenberg and Mclachalan, 1986) is also a similar amino it should be noted that our score for cTP (0.64) would rank acid index, where the values for R and Lysine (K) are second, after TargetP (0.72), better than PSORT (0.51), higher than the other amino acids. Although values for D MitoProt (0.44), and ChloroP (0.50; Emanuelsson et al., and E are also higher for the atom based hydrophobic 1999), in the comparison of Emanuelsson et al. (2000). moment, these amino acids rarely appear in mTP or cTP, Our predictor scores higher for plant SPs, and for the other and do not effect the average values. Frequency 0 20 40 60 80 100 120 Feature detection of N-terminal sorting signals Table 1. The parameters chosen for each training set (threshold τ is omitted) for the plant data set. The nodes Pn corresponds to nodes in Figure 2 R1 R2 Node trial Amino acid index [u,v] f , dir Alphabet indexing [u,v] Pattern Mismatch Combination 1 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 2 Hydropathy index [6, 30] Max avg , ↑ Not used Not used Not used R1 → SP P1 3 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 4 Hydropathy index [6, 30] Max avg , ↑ Not used Not used Not used R1 → SP 5 Hydropathy index [6, 30] Max avg , ↑ Not used Not used Not used R1 → SP 1 Negative charge [1, 25] Avg, ↓ DE → 0, AR → 1, other → 2 [1, 20] 221200020 3 ins/del ¬R1 ∨ R2 → other 2 Negative charge [1, 25] Avg, ↓ DE → 0, CR → 1, other → 2 [1, 20] 20002212 2 ins/del ¬R1 ∨ R2 → other P2 3 Negative charge [1, 25] Avg, ↓ DE → 0, R → 1, other → 2 [1, 15] 022120 1 ins/del ¬R1 ∨ R2 → other 4 Negative charge [1, 30] Max avg , ↓ DE → 0, CR → 1, other → 2 [1, 20] 2002222222 1 ins/del ¬R1 ∨ R2 → other 5 Negative charge [1, 30] Avg, ↓ DE → 0, CRF → 1, other → 2 [1, 20] 020222222 1 ins/del ¬R1 ∨ R2 → other 1 Hyd. mom. [1, 10] Max avg , ↑ E → 0, KR → 1, other → 2 [1, 10] 22112221 2 ins/del R1 ∧ R2 → mTP 2 Isoelectric point [1, 10] Avg, ↑ E → 0, KRW → 1, other → 2 [1, 10] 22110 2 ins/del R1 ∧ R2 → mTP P3 3 Hyd. mom. [1, 10] Max avg , ↑ E → 0, ARW → 1, other → 2 [1, 10] 22110 2 ins/del R1 ∧ R2 → mTP 4 Net charge [1, 10] Avg, ↑ E → 0, KR → 1, other → 2 [1, 10] 1212221 2 ins/del R1 ∧ R2 → mTP 5 Isoelectric point [1, 15] Max avg , ↑ E → 0, DKRW → 1, other → 2 [1, 10] 11221 2 ins/del R1 ∧ R2 → mTP Hydropathy index (Kyte and Doolittle, 1982). Atom based hydrophobic moment (Eisenberg and Mclachalan, 1986). c,d Net charge, Isoelectric point (Zimmerman et al., 1968). The actual rule was R1 ∧¬ R2 → mTP or cTP. f ↑ means that rule will answer yes if the value of f ( I (s[u,v])) is above a certain value τ , f ↓ is the opposite. w w w Table 2. The parameters chosen for each training set (threshold τ is omitted) for the non-plant data set. The nodes Pn corresponds to nodes in Figure 2 R1 R2 Node trial Amino acid index [u,v] f , dir Alphabet indexing [u,v] Pattern Mismatch Combination 1 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 2 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP P1 3 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 4 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 5 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 1 Net charge [1, 25] Min avg , ↑ DE → 0, R → 1, other → 2 [1, 25] 202020220 3 ins/del R1 ∧¬ R2 → mTP 2 Negative charge [1, 20] Avg, ↓ DE → 0, R → 1, other → 2 [1, 25] 2211221222 2 ins/del R1 ∧ R2 → mTP P2 3 Negative charge [1, 20] Max avg , ↓ DE → 0, R → 1, other → 2 [1, 30] 2211221222 2 ins/del R1 ∧ R2 → mTP 4 Negative charge [1, 20] Max avg , ↓ DE → 0, R → 1, other → 2 [1, 25] 2212211222 2 ins/del R1 ∧ R2 → mTP 5 Negative charge [1, 20] Max avg , ↓ DEY → 0,R → 1, other → 2 [1, 25] 22122212 1 ins/del R1 ∧ R2 → mTP f ↑ means that rule will answer yes if the value of f ( I (s[u,v])) is above a certain value τ , f ↓ is the opposite. w w w Therefore, together with the interpretation from The plain occurrence count of amino acids did not seem (mTP + cTP) versus other, we can see that both mTP and to appear in any of the trials. This is perhaps because the number of certain amino acids is too rough an estimate of cTP lack negatively charged amino acids, but mTP tend the overall biochemical properties of the signals. to be more positively charged than cTP for the front end of the signal. Alphabet indexing + approximate pattern rules Also seen in the alphabet indexing + approximate pattern rule for mTP versus cTP, the region which was best (mTP + cTP) versus other [Node P2, rule R2 in Table 1]. for distinguishing the two signals seemed to be located The alphabet indexing was stable near ψ . The best in the short portion of the sequences, whereas the best patterns were found to match the ‘other’ sequences, rather regions for distinguishing the other signals tended to be than patterns matching mTP and cTP signals. Although longer. patterns of the latter type would be of more interest, this 303 H.Bannai et al. Table 3. The prediction accuracy of the decision lists (scores of TargetP (Emanuelsson et al., 2000) in parentheses). This represents the sum of the predictions of the five hypotheses of Tables 1 and 2 over the test set True No. of Predicted category Data set category sequences cTP mTP SP Other Sensitivity MCC cTP 141 96 (120) 26 (14) 0 (2) 19 (5) 0.68 (0.85) 0.64 (0.72) mTP 368 25 (41) 309 (300) 4 (9) 30 (18) 0.84 (0.82) 0.75 (0.77) Plant SP 269 6 (2) 9 (7) 244 (245) 10 (15) 0.91 (0.91) 0.92 (0.90) Other 162 8 (10) 17 (13) 2 (2) 135 (137) 0.83 (0.85) 0.71 (0.77) Specificity 0.71 (0.69) 0.86 (0.90) 0.98 (0.96) 0.70 (0.78) mTP 371 – 275 (330) 11 (9) 85 (32) 0.74 (0.82) 0.67 (0.73) Non-plant SP 715 – 8 (13) 660 (683) 47 (19) 0.92 (0.91) 0.90 (0.92) Other 1652 – 119 (152) 44 (49) 1489 (1451) 0.90 (0.85) 0.78 (0.82) Specificity – 0.68 (0.67) 0.92 (0.92) 0.92 (0.97) is natural since mTP and cTP are different signals and The parameters chosen for the rules in each of the training the similarity in their structure may be subtle. Looking at rounds seemed to be fairly stable, suggesting that the the combination of the rules, a signal is rejected for mTP rules are capturing relevant characteristics concerning the or cTP if the sequence contains (nearly) consecutive ‘0’s, N-terminal signals. which is D or E. The occurrence of ‘1’ in each pattern is limited, showing that mTP or cTP signals should contain FUTURE WORK a number of R. K is classified to ‘2’ perhaps showing the For the plant data set, looking at the number of classified asymmetry of R and K in mTP. sequences, the weakness of our predictor seems to lie mainly in the discrimination of mTP and cTP. It would mTP versus cTP [Node P2, rule R2 in Table 1]. The be interesting to find another simple but different form of best patterns were found to match mTP sequences. Some rule to discriminate the two types of signals. patterns may be too short to judge, but the patterns In the search we conducted, we defined the regions as ‘22112221’ and ‘1212221’ seem to be capturing the substring intervals, fixed for all the sequences. Although periodic occurrence of R or K (‘1’) in mTP, which is the the N-terminal signals are generally located in a somewhat characteristic of an amphiphilic α-helix (von Heijne et al., fixed area, this may not be true for nuclear sorting 1989). signals, whose position in the sequence looks arbitrary. With the same parameters, we also searched for the The substring interval may be ‘simple’ for humans to best patterns which match cTP and do not match mTP. understand, but may not be simple for the molecules The patterns found were ‘022210’ for trials 1, 3, and 4, detecting the signal. It would be desirable to find a way to ‘2222022222’ for trial 2, and ‘220222110’ for trial 5, all target the actual location of the signal, and then consider with a maximum of 1 insertion/deletion. It is interesting the rules mentioned in this paper. If we are successful, that all the patterns contain a ‘0,’ which is E. there might also be ways to predict cleavage sites by locating candidate areas, and finding some meaningful Non-plant amino acid index or alphabet index rule. A similar interpretation can be done for rules concerning the non-plant data set. The difference from the plant set CONCLUSION being that the alphabet indexing was more stable around ψ . Also looking at the patterns discovered, the first pat- We extensively searched various attributes and their tern ‘202020220’ does not match mTP sequences, mean- simple combinations and were successful in finding a ing that mTP sequences are again rare in D or E. For the simple and interpretable rule which could explain the data other patterns ‘2211221222,’‘2212211222,’‘22122212,’ set well. Despite their simplicities, the prediction accuracy we can see again the periodic occurrence of R of an am- of the rules were still competitive with the prediction phiphilic α-helix. The pattern seems to be more stable in scores of TargetP, the best predictor in the literature. the non-plant data set perhaps the data set is much larger An experimental www service for predicting than for the plant set. N-terminal sorting signals using a decision list Overall, the rules discovered can be interpreted in terms trained on the entire data set is provided at: http: of biological knowledge known for the different signals. //hypothesiscreator.net/iPSORT/. The range of parameters 304 Feature detection of N-terminal sorting signals structure of mitochondrial andchloroplast targeting peptides. Eur. searched to make the rules for the web service is different J. Biochem., 180, 535–545. from that in this paper in that the alphabet indexing was Kawashima,S. and Kanehisa,M. (2000) AAindex: Amino Acid searched in a wider range. Also, only avg was considered Index database. Nucleic Acids Res., 28, 374. for f . Other parameters were adjusted to give best cross Kyte,J. and Doolittle,R. (1982) A simple method for displaying the validation scores. hydropathic character of a protein. J. Mol. Biol., 157, 105–132. Maruyama,O., Uchida,T., Shoudai,T. and Miyano,S. (1998) Toward ACKNOWLEDGEMENTS genomic hypothesis creator: view designer for discovery. In Dis- covery Science,(Lecture Notes in Artificial Intelligence 1532), This research was supported in part by Grant-in-Aid pp. 105–116. for Encouragement of Young Scientists and Grant-in-Aid Maruyama,O., Uchida,T., Sim,K.L. and Miyano,S. (1999) Design- for Scientific Research on Priority Areas (C) ‘Genome ing views in HypothesisCreator: system for assisting in discov- Information Science’ from the Ministry of Education, ery. In Discovery Science,(Lecture Notes in Artificial Intelli- Sports, Science and Technology of Japan. gence 1721), pp. 115–127. Matthews,B.W. (1975) Comparison of predicted and observed REFERENCES secondary structure of t4 phage lysozyme. Biochim. Biophys. Bannai,H., Tamada,Y., Maruyama,O., Nakai,K. and Miyano,S. Acta, 405, 442–451. (2001) Views: fundamental building blocks in the process of Nakai,K. (2000) Protein sorting signals and prediction of subcellular knowledge discovery. In Proceedings of the 14th International localization. In Bork,P. (ed.), Analysis of Amino Acid Sequences, FLAIRS Conference. AAAI Press, Menlo Park, CA, pp. 233–238. Advances in Protein Chemistry 54, Academic Press, San Diego, Bruce,B.D. (2000) Chloroplast transit peptides: structure, function pp. 277–344. and evolution. Trends Cell Biol., 10, 440–447. Nakai,K. and Kanehisa,M. (1992) A knowledge base for predicting Chou,K.-C. (2001) Using subsite coupling to predict signal pep- protein localization sites in eukaryotic cells. Genomics, 14, 897– tides. Protein Eng., 14,75–79. Claros,M.G. and Vincens,P. (1996) Computational method to pre- Nakai,K. and Horton,P. (1999) PSORT: a program for detecting dict mitochondrially imported proteins and their targeting se- the sorting signals of proteins and predicting their subcellular quences. Eur. J. Biochem., 241, 779–786. localization. Trends Biochem. Sci., 24,34–35. Eisenberg,D. and Mclachalan,A. (1986) Solvation energy in protein Shimozono,S. (1999) Alphabet indexing for approximating features folding and binding. Nature, 319, 199–203. of symbols. Theor. Comput. Sci., 210, 245–260. Emanuelsson,O., Nielsen,H. and von Heijne,G. (1999) Chlorop, a Shimozono,S., Shinohara,A., Shinohara,T., Miyano,S., Kuhara,S. neural network-based method for predicting chloroplast transit and Arikawa,S. (1994) Knowledge acquisition from amino acid peptides and their cleavage sites. Protein Sci., 8, 978–984. sequences by machine learning system BONSAI. J. IPS Japan, Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne,G. (2000) 35, 2009–2017. Predicting subcellular localization of proteins based on their Wu,S. and Manber,U. (1992) Fast text searching allowing errors. N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. Commun. ACM, 35,83–91. von Heijne,G. (1990) The signal peptide. J. Membr. Biol., 115, 195– Zimmerman,J., Eliezer,N. and Simha,R. (1968) The characteriza- 201. tion of amino acid sequences in proteins by statistical methods. von Heijne,G., Steppuhn,J. and Herrmann,R.G. (1989) Domain J. Theor. Biol., 21, 170–201. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Extensive feature detection of N-terminal protein sorting signals

Loading next page...
 
/lp/oxford-university-press/extensive-feature-detection-of-n-terminal-protein-sorting-signals-oYUBAOYCB2

References (21)

Publisher
Oxford University Press
Copyright
© Oxford University Press 2002
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/18.2.298
Publisher site
See Article on Publisher Site

Abstract

Vol. 18 no. 2 2002 BIOINFORMATICS Pages 298–305 Extensive feature detection of N-terminal protein sorting signals 1 2 3 Hideo Bannai , Yoshinori Tamada , Osamu Maruyama , 1 1 Kenta Nakai and Satoru Miyano Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan, Department of Mathematical Sciences, Tokai University, 1117 Kitakaname, Hiratuka-shi, Kanagawa, 259-1292, Japan and Faculty of Mathematics, Kyushu University, Kyushu University 36, Fukuoka, 812-8581, Japan Received on July 13, 2001; revised on October 5, 2001; accepted on October 18, 2001 ABSTRACT of unknown or unannotated proteins may be used to Motivation: The prediction of localization sites of various gain some indication of its function. For example, the proteins is an important and challenging problem in the information may be used to screen candidate genes for field of molecular biology. TargetP, by Emanuelsson et al. drug discovery. Further, if the rules for prediction were (J. Mol. Biol., 300, 1005–1016, 2000) is a neural network biologically interpretable, this knowledge could help in based system which is currently the best predictor in the designing artificial proteins with desired properties. literature for N-terminal sorting signals. One drawback of TargetP (Emanuelsson et al., 2000), a neural network neural networks, however, is that it is generally difficult to based predictor, is known to be the best predictor in the lit- erature for N-terminal sorting signals. However, although understand and interpret how and why they make such neural networks are ‘readily available’ and ‘often success- predictions. In this paper, we aim to generate simple ful in practice,’ they are also infamous for the difficulty and interpretable rules as predictors, and still achieve involved in trying to understand and interpret their mean- a practical prediction accuracy. We adopt an approach ing (Chou, 2001). PSORT (Nakai and Kanehisa, 1992; which consists of an extensive search for simple rules Nakai and Horton, 1999) and MitoProt (Claros and Vin- and various attributes which is partially guided by human cens, 1996), unlike TargetP, are systems which incorpo- intuition. rate existing knowledge about sorting signals, but they use Results: We have succeeded in finding rules whose various real numbers as ‘weights’ in their prediction rules prediction accuracies come close to that of TargetP, while which also may not be trivially interpretable. Also, they still retaining a very simple and interpretable form. We also are somewhat obsolete and their performance is unsatis- discuss and interpret the discovered rules. factory compared to TargetP. Availability: An (experimental) web service using The aim of this work is to derive simple and inter- rules obtained by our method is provided at http: pretable rules which can be used to predict subcellular //hypothesiscreator.net/iPSORT/. localization sites, while still achieving a practical predic- Contact: bannai@ims.u-tokyo.ac.jp tion accuracy. Through our discovery oriented approach to the problem, we managed to find very simple and INTRODUCTION interpretable rules with prediction accuracies which come Most proteins are first synthesized in the cytosol, and fairly close to TargetP. carried to specified locations, such as mitochondria or We will first review the existing knowledge about the chloroplasts. In most cases, the information determining N-terminal signals, and then describe the general idea of the subcellular localization site is represented as a short our approach. amino acid sequence segment called a protein sorting signal (Nakai, 2000). If we could somehow detect the N-terminal sorting signals amino acid sequence encoding this information, we would The signals we consider are signals known to be on be able to predict the localization sites. the N-terminal of the protein. Mitochondrial Targeting Prediction of localization sites is useful in various Peptides (mTP), Chloroplast Transit Peptides (cTP), and ways. Because cellular functions are often localized in Signal Peptides (SPs) are the typical N-terminal sorting specific compartments, the prediction of localization sites signals. 298  c Oxford University Press 2002 Feature detection of N-terminal sorting signals Alphabet Indexing + Pattern Rule Mitochondrial Targeting Peptides are known to be rich Alphabet Indexing in Arginine (R), Alanine (A), and Serine (S), while 1 R Pattern 2 A, C, F, G, I, L, S, T, W Positive Examples Positive Examples 3 D, E, H, K, M, N, P, Q, V, Y 1) 3212221222 ‘‘212221’’ 1) TRUE negatively charged amino acid residues (Aspartic Acid (D) 2) 3222122213 2) TRUE 3) ...... 3) ...... and Glutamic Acid (E)) are rare (von Heijne et al., 1989). Amino Acid Sequences Negative Examples Negative Examples 234) 3222222222 234) FALSE Positive Examples Only weak consensus sequences have been found. Further, 235) 3223321222 235) FALSE 1) MARSLARTTTVRQP... ...... ...... 2) MASIRTGSRVSFKC... they are believed to form an amphiphilic α-helix important 3) ............ Cut interval [1, 10] ..... Negative Examples for import into the mitochondrion. 234) MNSLSLSFTTALN... 235) NSFDVSRVSTKTS... Positive Examples Threshold Positive Examples Chloroplast Transit Peptides are known to be rare in ...... 1) 68.1 63.3 < ? 1) TRUE 2) 68.4 Indexing sum 2) TRUE 3) .... acidic residues, and also believed to form an amphiphilic 3) ...... Negative Examples Negative Examples Amino Acid Indexing (ZIMJ680104: Isoelectric point) 234) 57.1 α-helix (Bruce, 2000). 234) FALSE 235) 59.6 A C D E F G H I K L ..... 235) FALSE 6.0 5.1 2.8 3.2 5.5 6.0 7.6 6.0 9.7 6.0 ...... It has been established that a concrete consensus se- M N P Q R S T V W Y 5.7 5.4 6.3 5.7 5.7 5.7 6.0 5.9 5.7 10.8 quence does not occur in SPs. Rather, a 3-region structure Amino Acid Index Rule is conserved: a positively charged n-region, a hydrophobic h-region, and a polar c-region (von Heijne, 1990). Fig. 1. Concept diagram of amino acid index rule and alphabet indexing + approximate pattern rule. Overview of our approach Several very important aspects in the process of scientific of the rule when we find a ‘good’ amino acid index knowledge discovery are: (1) the generation or discovery contained in AAindex, which helps greatly in explaining of good attributes, and ways of looking at the data, which the data. Therefore, using AAindex as a knowledge base, is then used to explain the data; (2) the incorporation of we generate amino acid index rules. and reflection on existing knowledge; and (3) the trial and error interaction between the expert and the problem. Alphabet indexing + approximate pattern. Again from We have been developing a computer software library previous studies, we know that there is no clear-cut con- focusing on these points to speed-up this process, and sensus sequence concerning each of the sorting signals. have been applying it to various problems in the field of However, since there does seem to be a common structure bioinformatics (Maruyama et al., 1998, 1999; Bannai et for the same signals, we wish to somehow capture this al., 2001). knowledge. Our approach here is to consider motifs The overall idea of this approach is to create massive which allow more ambiguity by using alphabet indexing amounts of very simple attributes and their trivial combi- (Shimozono, 1999) and approximate patterns (Wu and nations, based on various known attributes. This way, if Manber, 1992) over the indexed sequence, similar to the such rules exist for the data, we can expect to overcome BONSAI system (Shimozono et al., 1994), which was the poor descriptive strength of simple rules, while at the successful in discovering meaningful knowledge from same time control the complexity and structure of the rule amino acid sequences. to be generated. An alphabet indexing is a classification of characters Our search for the final hypothesis consists of two main of an alphabet into a smaller set of characters, and can aspects: amino acid index, and alphabet indexing + ap- be viewed as a discrete, unordered version of an amino proximate pattern. The details of each aspect will be pre- acid index. For example, we may divide the amino acids sented in the following sections, but we describe them into the two classes of hydrophobic amino acids and briefly here (see Figure 1). hydrophilic amino acids. Using this alphabet indexing, we can view the amino acid sequence as a sequence of Amino acid index. A large amount of experimental and ‘0’s (hydrophobic) and ‘1’s (hydrophilic), and search for theoretical research has been performed to characterize patterns (e.g. ‘001001100’) contained in the sequences. different kinds of properties of individual amino acids The outline of this paper is as follows: in the next and to represent them in terms of a numerical index. The section, we define the basic concepts used in our methods. AAindex database (Kawashima and Kanehisa, 2000) is a We then show the results we have obtained from applying compilation of 434 of such indices. As noted in Section our methods to the data. Finally, we discuss how the rules N-terminal sorting signals, protein sorting signals have we discovered may be interpreted. been characterized by the biochemical properties of the amino acids composing them, and it seems reasonable SYSTEM AND METHODS to assume that some kind of characteristic which is Amino acid index rule important for protein sorting is already contained in the AAindex database. Also, although an amino acid index An amino acid index is a mapping from one amino acid generally assigns a real number for each amino acid, it to a numerical value, representing various physiochemical should be easy for us to interpret the biological meaning and biochemical properties of amino acids. 299 H.Bannai et al. by s[u,v], a pattern p, and mismatch allowance M . A se- DEFINITION 1 (amino acid index). Let A denote the quence is predicted ‘positive’ if p matches somewhere in set of amino acids, and R the set of real numbers. For ψ(s[u,v]) within mismatch allowance M , and ‘negative’ a given amino acid index I : A → R and amino acid otherwise. sequence s = s s ··· s (for i : 1··· n, s ∈ A), let I (s) 1 2 n i denote the homomorphism [| I (s ); I (s ); ... ; I (s )|], 1 2 n The parameters to be chosen are: ψ, u,v, p, M , and the where [|; |] denotes a sequence of values. task again is to find the best combination of the parameters which can distinguish between sequences of different The AAindex Database (Kawashima and Kanehisa, signals. 2000) is a compilation of 434 types of amino acid indices, With R2-rules, we expect to find locally specific charac- which have appeared in various reports. teristics concerning N-terminal sorting signals. DEFINITION 2 (amino acid index rule). An amino acid Data index rule (R1-rule) is defined by: an amino acid index I , The data used in our computational experiments was a specified region of the amino acid sequence s denoted by obtained from the TargetP web-site (http://www.cbs.dtu. s[u,v], a function f ∈{avg, max avg , min avg }, and w w dk/services/TargetP/). These data consist of two data sets: a threshold τ , where: avg( I (s)) is defined as the average plant and non-plant sequences. The plant data set of 940 of the values of sequence I (s), max avg ( I (s)) is the sequences contained 368 mTP, 141 cTP, 269 SP, and average of a substring of size w in I (s), which gives the 162 ‘Other’ (consisting of 54 nuclear and 108 cytosolic) maximum value (i.e. max{ I (s ) | s = xs y, |s |= w}), sequences. The non-plant data set of 2738 sequences and min avg ( I (s)) is similarly the average of a substring contained 371 mTP, 715 SP and 1652 ‘Other’ (consisting of size w in I (s) which gives the minimum value (i.e. of 1214 nuclear and 438 cytosolic) sequences. min{ I (s ) | s = xs y, |s |= w}). s[u,v]= s ··· s u v We basically follow the work on TargetP, considering (1  u  v  n) for s = s s ··· s . 1 2 n different predictors for plant and non-plant proteins. Also, A sequence is predicted ‘positive’ by the R1-rule if as in the composition of TargetP, we will first consider f ( I (s[m, n])) > τ and ‘negative’ otherwise. binary predictors which just predict whether or not a given sequence contains a specific signal. The knowledge The parameters to be chosen are: I, u,v, f,w,τ , and the task is to look for the best combination of the parameters obtained from these binary rules is combined into a which can distinguish between sequences of different decision list, to form a final rule. For each binary predictor, signals. we will call the sequences concerning the signal in With R1-rules, we expect to capture the overall proper- question positive examples, and the sequences concerning ties of N-terminal sorting signals. the other signals, negative examples. Search strategies Alphabet indexing + approximate pattern rule We extensively search for various parameters described An alphabet indexing (Shimozono, 1999) is a classifica- in the previous section. Since the size of the search tion of characters of an alphabet. It is formally defined as space for the combinations of different parameters is follows: huge, an exhaustive search is not feasible even with the DEFINITION 3 (alphabet indexing). An alphabet in- powerful computers which were available. We adopted dexing ψ is a mapping from one alphabet to another a mixture of heuristics and exhaustive search. Many alphabet , where | |  | |.For x = x x ··· x ∈ , 1 2 l different combinations of the parameters as well as minute let ψ(x ) denote the homomorphism ψ(x )ψ (x )··· ψ 1 2 variations in the heuristics were tried. (x ) ∈ . We will call ψ(x ), the indexed sequence. Combining the rules. To create a single rule predicting the sorting signal for a given sequence, we combine DEFINITION 4 (approximate pattern). An approximate the binary rules generated for each sorting signal into pattern (Wu and Manber, 1992) is a string which can a decision list. The structure of the decision list is match another string, allowing up to k errors (mismatch). shown in Figure 2. The structure was determined greedily, The mismatch can consist of up to three types: Insertion according to the ‘ease’ of discrimination by the R1-rules, (ins), Deletion (del), and Substitution (sub). We will call which was stable for all training/test combinations. the parameters k and the types of mismatches allowed, the From preliminary experiments, R1-rules seemed to be mismatch allowance of the approximate pattern. sufficient for discriminating SP sequences. As for the DEFINITION 5. An alphabet indexing + approximate other signals, neither type of rule seemed good enough for pattern rule (R2-rule) is defined by: an alphabet indexing identifying the signals. Therefore, we consider combining ψ , a specified region of the amino acid sequence s denoted both types of rules (R1-rules and R2-rules) into a single 300 Feature detection of N-terminal sorting signals Preliminary experiments showed that R1-rules were fairly stable, but R2-rules seemed to somewhat over-fit the training data. To overcome this problem, in the training phase for R2-rules, the training set was again randomly divided into five sets (4 ttrain and 1 ttest sets), and rules are generated from ttrain and tested with ttest. The arithmetic product of the scores from the ttrain and ttest sets was used to select which alphabet indexing and substring interval to use. The pattern was then trained using all sequences of the original training set, using the alphabet indexing and substring interval. Rules are evaluated by the Matthews Correlation Coef- ficient (MCC; Matthews, 1975), defined by: tp × tn − fp × fn (tp + fn)(tp + fp)(tn + fp)(tn + fn) where tp = true positives, fp = false positives, tn = true negatives, and fn = false negatives. Sensitivity, the frac- tp tion of correctly predicted positive examples , and tp+fn specificity, the fraction of true positives in the examples Fig. 2. The structure of the final rule for plant and non-plant tp predicted as positive , were also calculated for ref- data sets. The last node in parentheses concerning cTP and mTP tp+fp is omitted for the non-plant data set (classifying to mTP). The erence. various parameters such as the amino acid index, substring intervals, Details of the search is given below: alphabet indexing, and patterns which were chosen in the 5 training Amino acid index rule. Since we know that the signals runs are summarized in Tables 1 and 2. are located somewhere in the N-terminal region, we look (somewhat) exhaustively at the substring intervals in this region: for amino acid index I , all 434 entries rule. The first node of the decision list discriminates SPs in the AAindex Database together with 20 more entries, with a single R1-rule, while the second and third nodes assigning a value of ‘1’ to one amino acid and ‘0’ to the consist of both R1-rule and R2-rule. The two rules (or rest were considered. For these 454 entries, 72 substring perhaps their negations) are combined with a logical ‘and’ intervals [u,v]=[5n + 1, 5k] (where n = 0··· 8 and where a sequence is judged to have a certain signal if both k = 1··· 8) were considered. For these 454 ∗ 72 = rules say so. 32 688 combinations, f = avg, max avg ,min avg w w Evaluation of prediction accuracy were considered where w was taken to be 6 to 12, resulting in 32 688∗(2∗7+1) = 490 320 combinations. For all these The whole search space conducted in our search was combinations, all possible thresholds are considered for τ : enormous, and involved considerable amounts of human let f ,..., f be all f values of n sequences in sorted intervention, influenced largely by human intuition. To w w w 1 n order. Then, τ = ( f + f )/2, i = 1,..., n − 1. The give a fair estimate for the prediction accuracy of our w w i i+1 combination of I, u,v, f,w,τ which gives the highest methods, we choose a modest range of parameters to MCC score is recorded. search for in the cross validation, and show that the knowledge discovered is fairly stable even in that quite Alphabet indexing + approximate pattern rule. The large range. substring intervals [u,v] is taken to be [5n + 1, 5n + 5k] (where n = 0, 1, 2 and k = 2, 3, 4). For a given alphabet Training and evaluation. We follow the training and indexing ψ , all patterns of length 8 appearing in the evaluation methods used for TargetP. The data were ran- sequences are considered for p. The mismatch types were domly divided into five equal sized data sets by dividing limited to insertion and deletion only (no substitution). each subset of sequences with specific localization sites The maximum mismatch number was fixedat2.We into five data sets. Rules were generated by using four of started with the alphabet indexing classifying the amino the data sets as training data, and testing was conducted acids into three classes, according to their ‘charges:’ on the remaining data set. This was repeated for the five possible pairs of training and test set, and the overall 0if x ∈{ D, E}, performance is the sum of the five results. (All rules are ψ (x ) = 1if x ∈{K , R}, generated by using the training data set only.) 2if x ∈ A −{ D, E , K , R} 301 H.Bannai et al. and optimized the indexing by conducting a local search on the alphabet indexing (Shimozono et al., 1994): i.e. consider the alphabet indexings which are obtained by changing the indexing for a single amino acid (40 candidates in this case), and adopt the indexing whose product of the MCC scores for ttrain and ttest is the highest for the best pattern. The process is repeated until a local maximum is reached. A local search strategy was used because an exhaustive search for all possible alphabet indexings would result in 3 combinations, which was not feasible. Numerous tries starting from other alphabet indexings, which were chosen randomly, were –10 1 2 3 4 Value also conducted, but high scoring indexings seemed to be centered around ψ . Fig. 3. Histograms (light-SP, dark-mTP, cTP, other) of max avg Combination of the rules. After determining the param- values of hydropathy index (Kyte and Doolittle, 1982) for the eters for the above rules, the rules and possibly their nega- substring [1, 30] of the plant data set. The threshold was 2.077 27. tions are combined with a logical ‘and,’ but a portion of We can see that there is a clear difference in the distribution. the parameters are trained again. Namely, the substring in- tervals, f , window sizes, amino acid index, and alphabet indexing are fixed. We retrain the mismatch allowance, signals, our scores would again rank second, after TargetP pattern, and threshold. Their ranges are expanded in the with respect to the other predictors. retraining: the maximum mismatch number of 1–3 was al- lowed, and all patterns of length 5–10 appearing in the data Biological evaluation of the rules were considered. The top 100 R1-rules are combined with Amino acid index rules all possible R2-rules, and the combination which gives the best MCC score is chosen to be used against the test set. SP versus (mTP + cTP + other) [Node P1, rule R1 in Table 1]. The amino acid index with the highest score was the hydropathy index (Kyte and Doolittle, IMPLEMENTATION 1982), and judging from the substring interval [1, 30], and The software used in our analysis was developed using function max avg where w is around 11, we can say the Hypothesis Creator Library (http://hypothesiscreator. this rule corresponds to characteristics known for SPs (the net/; Bannai et al., 2001). Various shared memory multi- hydrophobic h region) (von Heijne, 1990; Figure 3). What processor computers were available for calculation: 2 SGI is surprising is that such a simple rule could discriminate Origin 2000 with (128, 32) × 195 MHz R10000 proces- SPs so well—better than TargetP for plant proteins. sors, 1 Sun Ultra Enterprise 4500 and 2 Sun Ultra Enter- prise 3500 with (14, 8, 8) × 400 MHz Sun Ultra II proces- (mTP + cTP) versus other [Node P2, rule R1 in Table 1]. sors respectively. Each is equipped with well over 2 GB of The amino acid index was ‘negative charge,’ which memory, which was the limit of the software. assigns a value of 1 to D and E. This also corresponds to known characteristics: mTP and cTP are rare in negatively RESULTS AND DISCUSSION charged amino acids (von Heijne et al., 1989). The parameters found for each training set is summarized in Table 1 for the plant data set, and Table 2 for the mTP versus cTP [Node P3, rule R1 in Table 1]. Various non-plant data set. The scores of the cross validation amino acid indices were chosen, with substring regions is summarized in Table 3, together with the scores for for a very short region at the N-terminal. However, the TargetP written in parentheses. (The scores for TargetP amino acid index: isoelectric point (Zimmerman et al., was taken directly from Emanuelsson et al., 2000.) 1968) can be considered as a more accurate measure of the We can see that the MCC scores for our predictor is net amino acid charges. Atom based hydrophobic moment fairly close to those of TargetP, except for cTP. However, (Eisenberg and Mclachalan, 1986) is also a similar amino it should be noted that our score for cTP (0.64) would rank acid index, where the values for R and Lysine (K) are second, after TargetP (0.72), better than PSORT (0.51), higher than the other amino acids. Although values for D MitoProt (0.44), and ChloroP (0.50; Emanuelsson et al., and E are also higher for the atom based hydrophobic 1999), in the comparison of Emanuelsson et al. (2000). moment, these amino acids rarely appear in mTP or cTP, Our predictor scores higher for plant SPs, and for the other and do not effect the average values. Frequency 0 20 40 60 80 100 120 Feature detection of N-terminal sorting signals Table 1. The parameters chosen for each training set (threshold τ is omitted) for the plant data set. The nodes Pn corresponds to nodes in Figure 2 R1 R2 Node trial Amino acid index [u,v] f , dir Alphabet indexing [u,v] Pattern Mismatch Combination 1 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 2 Hydropathy index [6, 30] Max avg , ↑ Not used Not used Not used R1 → SP P1 3 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 4 Hydropathy index [6, 30] Max avg , ↑ Not used Not used Not used R1 → SP 5 Hydropathy index [6, 30] Max avg , ↑ Not used Not used Not used R1 → SP 1 Negative charge [1, 25] Avg, ↓ DE → 0, AR → 1, other → 2 [1, 20] 221200020 3 ins/del ¬R1 ∨ R2 → other 2 Negative charge [1, 25] Avg, ↓ DE → 0, CR → 1, other → 2 [1, 20] 20002212 2 ins/del ¬R1 ∨ R2 → other P2 3 Negative charge [1, 25] Avg, ↓ DE → 0, R → 1, other → 2 [1, 15] 022120 1 ins/del ¬R1 ∨ R2 → other 4 Negative charge [1, 30] Max avg , ↓ DE → 0, CR → 1, other → 2 [1, 20] 2002222222 1 ins/del ¬R1 ∨ R2 → other 5 Negative charge [1, 30] Avg, ↓ DE → 0, CRF → 1, other → 2 [1, 20] 020222222 1 ins/del ¬R1 ∨ R2 → other 1 Hyd. mom. [1, 10] Max avg , ↑ E → 0, KR → 1, other → 2 [1, 10] 22112221 2 ins/del R1 ∧ R2 → mTP 2 Isoelectric point [1, 10] Avg, ↑ E → 0, KRW → 1, other → 2 [1, 10] 22110 2 ins/del R1 ∧ R2 → mTP P3 3 Hyd. mom. [1, 10] Max avg , ↑ E → 0, ARW → 1, other → 2 [1, 10] 22110 2 ins/del R1 ∧ R2 → mTP 4 Net charge [1, 10] Avg, ↑ E → 0, KR → 1, other → 2 [1, 10] 1212221 2 ins/del R1 ∧ R2 → mTP 5 Isoelectric point [1, 15] Max avg , ↑ E → 0, DKRW → 1, other → 2 [1, 10] 11221 2 ins/del R1 ∧ R2 → mTP Hydropathy index (Kyte and Doolittle, 1982). Atom based hydrophobic moment (Eisenberg and Mclachalan, 1986). c,d Net charge, Isoelectric point (Zimmerman et al., 1968). The actual rule was R1 ∧¬ R2 → mTP or cTP. f ↑ means that rule will answer yes if the value of f ( I (s[u,v])) is above a certain value τ , f ↓ is the opposite. w w w Table 2. The parameters chosen for each training set (threshold τ is omitted) for the non-plant data set. The nodes Pn corresponds to nodes in Figure 2 R1 R2 Node trial Amino acid index [u,v] f , dir Alphabet indexing [u,v] Pattern Mismatch Combination 1 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 2 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP P1 3 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 4 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 5 Hydropathy index [1, 30] Max avg , ↑ Not used Not used Not used R1 → SP 1 Net charge [1, 25] Min avg , ↑ DE → 0, R → 1, other → 2 [1, 25] 202020220 3 ins/del R1 ∧¬ R2 → mTP 2 Negative charge [1, 20] Avg, ↓ DE → 0, R → 1, other → 2 [1, 25] 2211221222 2 ins/del R1 ∧ R2 → mTP P2 3 Negative charge [1, 20] Max avg , ↓ DE → 0, R → 1, other → 2 [1, 30] 2211221222 2 ins/del R1 ∧ R2 → mTP 4 Negative charge [1, 20] Max avg , ↓ DE → 0, R → 1, other → 2 [1, 25] 2212211222 2 ins/del R1 ∧ R2 → mTP 5 Negative charge [1, 20] Max avg , ↓ DEY → 0,R → 1, other → 2 [1, 25] 22122212 1 ins/del R1 ∧ R2 → mTP f ↑ means that rule will answer yes if the value of f ( I (s[u,v])) is above a certain value τ , f ↓ is the opposite. w w w Therefore, together with the interpretation from The plain occurrence count of amino acids did not seem (mTP + cTP) versus other, we can see that both mTP and to appear in any of the trials. This is perhaps because the number of certain amino acids is too rough an estimate of cTP lack negatively charged amino acids, but mTP tend the overall biochemical properties of the signals. to be more positively charged than cTP for the front end of the signal. Alphabet indexing + approximate pattern rules Also seen in the alphabet indexing + approximate pattern rule for mTP versus cTP, the region which was best (mTP + cTP) versus other [Node P2, rule R2 in Table 1]. for distinguishing the two signals seemed to be located The alphabet indexing was stable near ψ . The best in the short portion of the sequences, whereas the best patterns were found to match the ‘other’ sequences, rather regions for distinguishing the other signals tended to be than patterns matching mTP and cTP signals. Although longer. patterns of the latter type would be of more interest, this 303 H.Bannai et al. Table 3. The prediction accuracy of the decision lists (scores of TargetP (Emanuelsson et al., 2000) in parentheses). This represents the sum of the predictions of the five hypotheses of Tables 1 and 2 over the test set True No. of Predicted category Data set category sequences cTP mTP SP Other Sensitivity MCC cTP 141 96 (120) 26 (14) 0 (2) 19 (5) 0.68 (0.85) 0.64 (0.72) mTP 368 25 (41) 309 (300) 4 (9) 30 (18) 0.84 (0.82) 0.75 (0.77) Plant SP 269 6 (2) 9 (7) 244 (245) 10 (15) 0.91 (0.91) 0.92 (0.90) Other 162 8 (10) 17 (13) 2 (2) 135 (137) 0.83 (0.85) 0.71 (0.77) Specificity 0.71 (0.69) 0.86 (0.90) 0.98 (0.96) 0.70 (0.78) mTP 371 – 275 (330) 11 (9) 85 (32) 0.74 (0.82) 0.67 (0.73) Non-plant SP 715 – 8 (13) 660 (683) 47 (19) 0.92 (0.91) 0.90 (0.92) Other 1652 – 119 (152) 44 (49) 1489 (1451) 0.90 (0.85) 0.78 (0.82) Specificity – 0.68 (0.67) 0.92 (0.92) 0.92 (0.97) is natural since mTP and cTP are different signals and The parameters chosen for the rules in each of the training the similarity in their structure may be subtle. Looking at rounds seemed to be fairly stable, suggesting that the the combination of the rules, a signal is rejected for mTP rules are capturing relevant characteristics concerning the or cTP if the sequence contains (nearly) consecutive ‘0’s, N-terminal signals. which is D or E. The occurrence of ‘1’ in each pattern is limited, showing that mTP or cTP signals should contain FUTURE WORK a number of R. K is classified to ‘2’ perhaps showing the For the plant data set, looking at the number of classified asymmetry of R and K in mTP. sequences, the weakness of our predictor seems to lie mainly in the discrimination of mTP and cTP. It would mTP versus cTP [Node P2, rule R2 in Table 1]. The be interesting to find another simple but different form of best patterns were found to match mTP sequences. Some rule to discriminate the two types of signals. patterns may be too short to judge, but the patterns In the search we conducted, we defined the regions as ‘22112221’ and ‘1212221’ seem to be capturing the substring intervals, fixed for all the sequences. Although periodic occurrence of R or K (‘1’) in mTP, which is the the N-terminal signals are generally located in a somewhat characteristic of an amphiphilic α-helix (von Heijne et al., fixed area, this may not be true for nuclear sorting 1989). signals, whose position in the sequence looks arbitrary. With the same parameters, we also searched for the The substring interval may be ‘simple’ for humans to best patterns which match cTP and do not match mTP. understand, but may not be simple for the molecules The patterns found were ‘022210’ for trials 1, 3, and 4, detecting the signal. It would be desirable to find a way to ‘2222022222’ for trial 2, and ‘220222110’ for trial 5, all target the actual location of the signal, and then consider with a maximum of 1 insertion/deletion. It is interesting the rules mentioned in this paper. If we are successful, that all the patterns contain a ‘0,’ which is E. there might also be ways to predict cleavage sites by locating candidate areas, and finding some meaningful Non-plant amino acid index or alphabet index rule. A similar interpretation can be done for rules concerning the non-plant data set. The difference from the plant set CONCLUSION being that the alphabet indexing was more stable around ψ . Also looking at the patterns discovered, the first pat- We extensively searched various attributes and their tern ‘202020220’ does not match mTP sequences, mean- simple combinations and were successful in finding a ing that mTP sequences are again rare in D or E. For the simple and interpretable rule which could explain the data other patterns ‘2211221222,’‘2212211222,’‘22122212,’ set well. Despite their simplicities, the prediction accuracy we can see again the periodic occurrence of R of an am- of the rules were still competitive with the prediction phiphilic α-helix. The pattern seems to be more stable in scores of TargetP, the best predictor in the literature. the non-plant data set perhaps the data set is much larger An experimental www service for predicting than for the plant set. N-terminal sorting signals using a decision list Overall, the rules discovered can be interpreted in terms trained on the entire data set is provided at: http: of biological knowledge known for the different signals. //hypothesiscreator.net/iPSORT/. The range of parameters 304 Feature detection of N-terminal sorting signals structure of mitochondrial andchloroplast targeting peptides. Eur. searched to make the rules for the web service is different J. Biochem., 180, 535–545. from that in this paper in that the alphabet indexing was Kawashima,S. and Kanehisa,M. (2000) AAindex: Amino Acid searched in a wider range. Also, only avg was considered Index database. Nucleic Acids Res., 28, 374. for f . Other parameters were adjusted to give best cross Kyte,J. and Doolittle,R. (1982) A simple method for displaying the validation scores. hydropathic character of a protein. J. Mol. Biol., 157, 105–132. Maruyama,O., Uchida,T., Shoudai,T. and Miyano,S. (1998) Toward ACKNOWLEDGEMENTS genomic hypothesis creator: view designer for discovery. In Dis- covery Science,(Lecture Notes in Artificial Intelligence 1532), This research was supported in part by Grant-in-Aid pp. 105–116. for Encouragement of Young Scientists and Grant-in-Aid Maruyama,O., Uchida,T., Sim,K.L. and Miyano,S. (1999) Design- for Scientific Research on Priority Areas (C) ‘Genome ing views in HypothesisCreator: system for assisting in discov- Information Science’ from the Ministry of Education, ery. In Discovery Science,(Lecture Notes in Artificial Intelli- Sports, Science and Technology of Japan. gence 1721), pp. 115–127. Matthews,B.W. (1975) Comparison of predicted and observed REFERENCES secondary structure of t4 phage lysozyme. Biochim. Biophys. Bannai,H., Tamada,Y., Maruyama,O., Nakai,K. and Miyano,S. Acta, 405, 442–451. (2001) Views: fundamental building blocks in the process of Nakai,K. (2000) Protein sorting signals and prediction of subcellular knowledge discovery. In Proceedings of the 14th International localization. In Bork,P. (ed.), Analysis of Amino Acid Sequences, FLAIRS Conference. AAAI Press, Menlo Park, CA, pp. 233–238. Advances in Protein Chemistry 54, Academic Press, San Diego, Bruce,B.D. (2000) Chloroplast transit peptides: structure, function pp. 277–344. and evolution. Trends Cell Biol., 10, 440–447. Nakai,K. and Kanehisa,M. (1992) A knowledge base for predicting Chou,K.-C. (2001) Using subsite coupling to predict signal pep- protein localization sites in eukaryotic cells. Genomics, 14, 897– tides. Protein Eng., 14,75–79. Claros,M.G. and Vincens,P. (1996) Computational method to pre- Nakai,K. and Horton,P. (1999) PSORT: a program for detecting dict mitochondrially imported proteins and their targeting se- the sorting signals of proteins and predicting their subcellular quences. Eur. J. Biochem., 241, 779–786. localization. Trends Biochem. Sci., 24,34–35. Eisenberg,D. and Mclachalan,A. (1986) Solvation energy in protein Shimozono,S. (1999) Alphabet indexing for approximating features folding and binding. Nature, 319, 199–203. of symbols. Theor. Comput. Sci., 210, 245–260. Emanuelsson,O., Nielsen,H. and von Heijne,G. (1999) Chlorop, a Shimozono,S., Shinohara,A., Shinohara,T., Miyano,S., Kuhara,S. neural network-based method for predicting chloroplast transit and Arikawa,S. (1994) Knowledge acquisition from amino acid peptides and their cleavage sites. Protein Sci., 8, 978–984. sequences by machine learning system BONSAI. J. IPS Japan, Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne,G. (2000) 35, 2009–2017. Predicting subcellular localization of proteins based on their Wu,S. and Manber,U. (1992) Fast text searching allowing errors. N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. Commun. ACM, 35,83–91. von Heijne,G. (1990) The signal peptide. J. Membr. Biol., 115, 195– Zimmerman,J., Eliezer,N. and Simha,R. (1968) The characteriza- 201. tion of amino acid sequences in proteins by statistical methods. von Heijne,G., Steppuhn,J. and Herrmann,R.G. (1989) Domain J. Theor. Biol., 21, 170–201.

Journal

BioinformaticsOxford University Press

Published: Feb 1, 2002

There are no references for this article.