Inter-kingdom prediction certainty evaluation of protein subcellular localization tools: microbial pathogenesis approach for deciphering host microbe interaction

Inter-kingdom prediction certainty evaluation of protein subcellular localization tools:... Abstract Microbial pathogenesis involves several aspects of host–pathogen interactions, including microbial proteins targeting host subcellular compartments and subsequent effects on host physiology. Such studies are supported by experimental data, but recent detection of bacterial proteins localization through computational eukaryotic subcellular protein targeting prediction tools has also come into practice. We evaluated inter-kingdom prediction certainty of these tools. The bacterial proteins experimentally known to target host subcellular compartments were predicted with eukaryotic subcellular targeting prediction tools, and prediction certainty was assessed. The results indicate that these tools alone are not sufficient for inter-kingdom protein targeting prediction. The correct prediction of pathogen’s protein subcellular targeting depends on several factors, including presence of localization signal, transmembrane domain and molecular weight, etc., in addition to approach for subcellular targeting prediction. The detection of protein targeting in endomembrane system is comparatively difficult, as the proteins in this location are channelized to different compartments. In addition, the high specificity of training data set also creates low inter-kingdom prediction accuracy. Current data can help to suggest strategy for correct prediction of bacterial protein’s subcellular localization in host cell. protein targeting, microbial pathogenesis, in silico, nuclear proteins, mitochondrial proteins Introduction Microbial pathogenesis involves a highly coordinated response of the pathogens with the host for their survival, growth and reproduction. This coordination is multifaceted and involves microbial attachment to the host and the subsequent signaling with host cell machinery. These events are managed through multiple processes including pathogen proteins targeting the host cell. These targeted proteins get localized in several host subcellular compartments [1]. The most important among these are nucleus and mitochondria, which carry genetic material and control host cell survival and death. The bacterial proteins migrating to host nucleus are also known as nucleomodulins [2]. The nucleus is core of entire eukaryotic cellular machinery and controls genetic expression, which governs whole cell physiology. The mitochondrion is also a critically important organelle of eukaryotic cell that controls the energy requirement of cell. It is also involved in regulating intrinsic pathway of apoptosis, thereby controlling cellular senescence and death. These two organelles are common in terms of having their own genetic material susceptible to several bacterial genetic modulator proteins. In addition, several microbial proteins are known to target host cell endomembrane system and cytoplasm. The endomembrane system includes various membrane-bound compartments of eukaryotic cell, which include nuclear membrane, rough and smooth endoplasmic reticulum, golgi, cytoplasmic vesicles, which is connected to each other either directly or by vesicle transport. During microbial pathogenesis, these membrane-bound compartments communicate with each other and involves pathogen protein subcellular targeting among endomembrane system components [3–5]. Targeting of bacterial proteins in host cell cytoplasm is a common event affecting host cell machinery. For example, anthrax lethal toxin produced by bacteria Bacillus anthracis migrate to host cell cytoplasm and influence several host proteins including mitogen-activated protein kinase and kill macrophages and macrophage-like cell lines [6]. Several studies tried to detect pathogen protein targeting host cell to decipher their role in microbial pathogenesis, regulation of host cell physiology including cell death and proliferation [7–9]. As the experimental analysis of whole microbial proteome is always a labor-intensive and extravagant task and every laboratory cannot afford it, therefore computational prediction of microbial proteins targeting host cell is now a routine practice [10–14]. Several computational tools are available for predicting subcellular targeting of certain proteins. However, these tools are based on certain data set derived from same type of organism for which they are designed to predict subcellular targeting, but the capability of these tools for inter-organism prediction needs to be investigated. These tools work on variety of principles including detection of localization signal, evolutionary information, amino acid composition, dipeptide composition, sequence similarity, transmembrane segment, etc. (Figure 1). Each method has its own limitations and advantages, but they claim to have certain prediction ability depending on the type of tools (Table 1). Although prediction reliability of these tools is assessed for certain types of organisms, which is included in their training data set, evaluation of their prediction reliability for microbial proteins is required. Table 1 Different prediction tools used during the study and their prediction approach, training data set and reliability as mentioned in literature Sr. No.  Prediction tool  Database size and validation process  Reliability/prediction performance (as per literature)     Sensitivity  Specificity  Accuracy  1  cNLS mapper [15, 16]  Predicts NLS in query protein. The NLS activity is measured instead of conventional sequence similarity or machine-learning strategy. The NLS activity score is contributed by every amino acid residue at certain position. These predictions were validated by analyzing effect of replacing each individual amino acid and its effect on NLS activity for a certain class in budding yeast. It was found that each amino acid within an NLS contributes to the entire activity independently. Training data and limitations: NLS profiles were prepared through budding yeast data after considering conserved nature of importin α/β pathway in eukaryotes, but the prediction for other distant organisms may be less efficient. It cannot predict protein directly binding to importin β or working with α-independent NLSs.  Class ½  99  94  98  Class 3  100  100  100  Class 4  87  97  92  Bipartite  87  82  85  Values are based on test peptide sequence from synthetic NLS mutant  2  PSORT II [17, 18]  Detect sorting signal sequence plus transmembrane segment and membrane topology Training data: 1531 yeast sequences from Swiss-Prot  57% for yeast sequences and 86% f or Escherichia coli sequences  3  WOLF PSORT [19]  Uses amino acid composition in addition to PSORT features Training data set: Fungi: 2113; plant: 2333; animal: 12 771 proteins   70% sensitivity and specificity for mitochondria, nucleus, cytosol, PM, EC and chloroplast Low sensitivity for other sites  4  TargetP [20]  Uses N terminal sequence information only  Plants: 85% Non-plants: 90% On redundancy-reduced test sets  Plant  Chloroplast transit peptide (cTP): 141; mitochondrial targeting peptide (mTP): 368; secretory: 269; nuclear: 102; cytosolic: 195  Non-plant  Cytosolic: 438; mTP: 371; secretory: 715; nuclear: 1214  5  Mitoprot [21]  Evaluation of 47 parameters of large set of mitochondrial proteins present in Swiss-Prot Training data set: 12 432 non-mitochondrial and 607 mitochondrial proteins  With considering only amino acid sequence: 75–97% With Mictochondrial targeting sequence (MTS): 76–94%  6  BaCeILo [22]  Evaluate residue sequence and alignment profiles. It evaluates N- and C-termini sequence as well as whole protein sequence. The results are balanced in different categories to avoid effect of biased training data set. The similarity of data set was reduced to make sure that no protein has >30% identity, and prediction are balanced Training data set: 2597 animals, 1198 fungi and 491 plants proteins  Animal: 74% Fungi: 76% Plants: 67%  7  HSLPred [23]  Uses SVM to evaluate amino acid composition, dipeptide composition, PSI-BLAST and hybrid method including all above approaches Training data set: 3532 human proteins (cytoplasmic: 840; mitochondrial: 315; nuclear: 858; PM: 1519; endoplasmic reticulum: 63; EC: 48; peroxisome: 25; lysosome: 51; Golgi: 32; centrosome: 8; microsome: 21)  Amino acid composition: 76.6% Dipeptide composition: 77.8% Similarity based: 73.3% Hybrid approach: 84.9%  8  ESLPred [24]  Uses multiple approaches including amino acid composition-based SVM, physicochemical properties-based SVM, dipeptide composition-based SVM and PSI-BLAST-based SVM and a hybrid approach involving all above methods Training data set: 2427 eukaryotic proteins (cytosol: 684; mitochondrial: 321; nuclear: 1097; and EC: 325)  Amino acid composition: 78.1% Physicochemical properties: 77.8% Dipeptide based: 82.4% Hybrid module: 88.0%  9  SubLoc v 1.0 [25]  Analyzes sequences composition using SVM Training data set: Prokaryotic (cytosol: 688; periplasmic: 202; EC: 107) Eukaryotic (nuclear: 1097; cytosol: 684; mitochondrial: 321; EC: 325)  Three locations of prokaryotes: 91.4% Four locations of eukaryotes: 79.4%  10  EffectiveDB [26]  It is a combination of tools to predict secretion of bacterial proteins and their subsequent localization in subcellular compartments. We used the following: EffectiveT3 (predict signal peptide for type 3 secretion system) Training data set: 504 T3ss secreted proteins T4SEPre (predict type 4 secretion system) Training data set: 1913 T4SS effectors from 10 genera Predotar (predict N-terminal targeting sequence for host subcellular targeting) Training data set: 13 668 proteins with known subcellular location in Swiss-Prot  ET3: specificity: 93%; sensitivity: 73%, accuracy: 86%, Matthews correlation coefficient (MCC) = 0.66 T4SEPre: sensitivity: 89%, specificity: 97% Predotar: plant: 91.62% Non-plant: 94.00% [27]    11  TMPred [28]  It predicts membrane-spanning regions of certain protein with their orientation  Average prediction reliability for photosynthetic reaction centre, bacteriorhodopsin, and cytochrome c oxidase: 84.5% [29]  Sr. No.  Prediction tool  Database size and validation process  Reliability/prediction performance (as per literature)     Sensitivity  Specificity  Accuracy  1  cNLS mapper [15, 16]  Predicts NLS in query protein. The NLS activity is measured instead of conventional sequence similarity or machine-learning strategy. The NLS activity score is contributed by every amino acid residue at certain position. These predictions were validated by analyzing effect of replacing each individual amino acid and its effect on NLS activity for a certain class in budding yeast. It was found that each amino acid within an NLS contributes to the entire activity independently. Training data and limitations: NLS profiles were prepared through budding yeast data after considering conserved nature of importin α/β pathway in eukaryotes, but the prediction for other distant organisms may be less efficient. It cannot predict protein directly binding to importin β or working with α-independent NLSs.  Class ½  99  94  98  Class 3  100  100  100  Class 4  87  97  92  Bipartite  87  82  85  Values are based on test peptide sequence from synthetic NLS mutant  2  PSORT II [17, 18]  Detect sorting signal sequence plus transmembrane segment and membrane topology Training data: 1531 yeast sequences from Swiss-Prot  57% for yeast sequences and 86% f or Escherichia coli sequences  3  WOLF PSORT [19]  Uses amino acid composition in addition to PSORT features Training data set: Fungi: 2113; plant: 2333; animal: 12 771 proteins   70% sensitivity and specificity for mitochondria, nucleus, cytosol, PM, EC and chloroplast Low sensitivity for other sites  4  TargetP [20]  Uses N terminal sequence information only  Plants: 85% Non-plants: 90% On redundancy-reduced test sets  Plant  Chloroplast transit peptide (cTP): 141; mitochondrial targeting peptide (mTP): 368; secretory: 269; nuclear: 102; cytosolic: 195  Non-plant  Cytosolic: 438; mTP: 371; secretory: 715; nuclear: 1214  5  Mitoprot [21]  Evaluation of 47 parameters of large set of mitochondrial proteins present in Swiss-Prot Training data set: 12 432 non-mitochondrial and 607 mitochondrial proteins  With considering only amino acid sequence: 75–97% With Mictochondrial targeting sequence (MTS): 76–94%  6  BaCeILo [22]  Evaluate residue sequence and alignment profiles. It evaluates N- and C-termini sequence as well as whole protein sequence. The results are balanced in different categories to avoid effect of biased training data set. The similarity of data set was reduced to make sure that no protein has >30% identity, and prediction are balanced Training data set: 2597 animals, 1198 fungi and 491 plants proteins  Animal: 74% Fungi: 76% Plants: 67%  7  HSLPred [23]  Uses SVM to evaluate amino acid composition, dipeptide composition, PSI-BLAST and hybrid method including all above approaches Training data set: 3532 human proteins (cytoplasmic: 840; mitochondrial: 315; nuclear: 858; PM: 1519; endoplasmic reticulum: 63; EC: 48; peroxisome: 25; lysosome: 51; Golgi: 32; centrosome: 8; microsome: 21)  Amino acid composition: 76.6% Dipeptide composition: 77.8% Similarity based: 73.3% Hybrid approach: 84.9%  8  ESLPred [24]  Uses multiple approaches including amino acid composition-based SVM, physicochemical properties-based SVM, dipeptide composition-based SVM and PSI-BLAST-based SVM and a hybrid approach involving all above methods Training data set: 2427 eukaryotic proteins (cytosol: 684; mitochondrial: 321; nuclear: 1097; and EC: 325)  Amino acid composition: 78.1% Physicochemical properties: 77.8% Dipeptide based: 82.4% Hybrid module: 88.0%  9  SubLoc v 1.0 [25]  Analyzes sequences composition using SVM Training data set: Prokaryotic (cytosol: 688; periplasmic: 202; EC: 107) Eukaryotic (nuclear: 1097; cytosol: 684; mitochondrial: 321; EC: 325)  Three locations of prokaryotes: 91.4% Four locations of eukaryotes: 79.4%  10  EffectiveDB [26]  It is a combination of tools to predict secretion of bacterial proteins and their subsequent localization in subcellular compartments. We used the following: EffectiveT3 (predict signal peptide for type 3 secretion system) Training data set: 504 T3ss secreted proteins T4SEPre (predict type 4 secretion system) Training data set: 1913 T4SS effectors from 10 genera Predotar (predict N-terminal targeting sequence for host subcellular targeting) Training data set: 13 668 proteins with known subcellular location in Swiss-Prot  ET3: specificity: 93%; sensitivity: 73%, accuracy: 86%, Matthews correlation coefficient (MCC) = 0.66 T4SEPre: sensitivity: 89%, specificity: 97% Predotar: plant: 91.62% Non-plant: 94.00% [27]    11  TMPred [28]  It predicts membrane-spanning regions of certain protein with their orientation  Average prediction reliability for photosynthetic reaction centre, bacteriorhodopsin, and cytochrome c oxidase: 84.5% [29]  EC = endothelial cell; PM = plasma membrane. Figure 1 View largeDownload slide Graphical outline for different prediction methods used by different tools and their training data sets. Figure 1 View largeDownload slide Graphical outline for different prediction methods used by different tools and their training data sets. An estimation of the reliability and accuracy of these inter-organism predictions is always a challenging task. Therefore, we designed this study for evaluating the ability of eukaryotic subcellular localization prediction tools to predict prokaryotic proteins as a query. This calibration is highly important in maintaining prediction accuracy of these tools for their use in microbial pathogenesis-related studies. Materials and methods Protein sequences The 119 bacterial proteins experimentally known to target host subcellular compartments were selected for the study. These proteins included 44 (nuclear), 29 (mitochondrial), 32 (endomembrane system), 14 (cytosolic) proteins either known to target or interact with respective subcellular targeting location in host cell. Possible care was taken to avoid similar sequence with multiple accession numbers, but derived from similar bacterial strain. Although in some cases, proteins from two different organisms were included in the study, their origin from different bacteria made them suitable candidates for inclusion in the study. The protein sequences were retrieved from Uniprot, whereas the protein sequences, which were not found in Uniprot, were retrieved from NCBI protein database (details available in Supplementary tables). Both plant and animal pathogens (including human pathogens) were selected for prediction. Selection of tools The pathogen’s protein targeting in host cell is governed by multiple host pathogen factors. Under certain situations, pathogen proteins can passively localize to host subcellular compartments, and this property of proteins is governed by their molecular weight [30, 31]; therefore, we detected molecular weight of protein to understand their passive subcellular targeting. The pathogen proteins targeting host subcellular compartment are also regulated by presence of certain localization signals, so the tools predicting these localization signals were included in the study. The prediction tools based on single prediction approach cannot consider influence of other factors on subcellular targeting, therefore prediction tools detecting bacterial protein secretion mechanism, and host subcellular targeting by multiple approaches including transmembrane helices detection, evolutionary information, sequence similarity were also used. As the aim of this study was to detect prediction certainty of prokaryotic protein targeting in eukaryotic host cells, tools working on diverse principles and training data set were selected (Figure 1). A total of 11 tools working on different prediction approaches and training data set were used to predict pathogen protein targeting in host cell (Table 1; Figure 1). Among these, classical nuclear localization signal (cNLS) mapper detects nuclear targeting and therefore was used only for nuclear proteins, and MitoProt, which detects mitochondrial targeting, was used for host mitochondrial-targeted proteins only. Remaining seven prediction tools were known to predict both nuclear and mitochondrial subcellular targeting and therefore used for all proteins irrespective of their types. TargetP detects only mitochondrial, chloroplast and secretary pathway localization signal, but include data set of nuclear proteins also, and so, it was also used for all types of proteins to understand their effect on protein localization prediction (Table 1). TMPred was used for detection of transmembrane helices in query proteins. Host subcellular targeting prediction The bacterial proteins known to target host subcellular compartments were subjected as query for prediction by above tools. The default parameters were used for prediction, as these are most frequently used. For cNLS mapper prediction, the prediction was performed in the entire protein with NLS cutoff value 2.0. The plant and animal/human pathogen proteins were searched in their respective database wherever desired. With some tools like ESLPred and HSLPred, the protein subcellular localization is detected through various properties of query sequence under individual prediction approach, but the hybrid method involves inclusion of all approaches of prediction. We used hybrid method approach for prediction of subcellular targeting, as it is found to have highest prediction accuracy in comparison with other individual approaches (Table 1). TargetP has another variation SignalP, which predicts subcellular targeting of bacterial proteins. Nevertheless, we used only TargetP, as we wanted to predict targeting of bacterial proteins in eukaryotic system [20]. Detection of bacterial secretion of proteins Some tools are able to detect release of protein by potential secretion system of bacteria. In addition, this can also indicate about targeting of certain proteins in host subcellular compartments. The protein secretion with its subcellular targeting makes more sense for actual protein targeting in practical scenario. Therefore, we predicted secretion system in bacteria through EffectiveDB. Results Nuclear targeting prediction The results indicate that the detection of NLS is insufficient to guarantee about nuclear localization of proteins. After considering NLS cutoff value 5 as strong nuclear targeting signal, we found only 27% nuclear protein with monopartite and bipartite NLS. These NLS-containing proteins included 83.3% sequences with  >40 kDa molecular weight. In contrast, 56.25% proteins without NLS was found to have  <40 kDa molecular weight. The present distribution of transmembrane helices in proteins was found to be almost equal in both NLS-containing and not containing proteins. Figure 2 indicates about prediction performance of nuclear proteins with different protein subcellular localization prediction tools using bacterial proteins as query. BaCeILo and ESLPred were found to give better inter-kingdom prediction reliability with  >40 kDa nuclear proteins. In case of protein with  <40 kDa molecular weight and presence of transmembrane helices, BaCeILo was not able to predict nuclear targeting proteins accurately. Prediction reliability of PSORT II and WOLF PSORT was not good until second prediction choice was also considered as significant. Both these proteins subcellular targeting prediction tools give a number of hits or percent chance for subcellular targeting of query protein with WOLF PSORT and PSORT II, respectively. Inclusion of second predicted location choice markedly increased the prediction reliability (Figure 2). However, these tools were also not able to give 100% prediction certainty, and some false-negative predictions occurred depending on different factors associated with query protein, but ESLpred maintained almost uniform prediction reliability among the tools analyzed. Supplementary Table S1 gives details about overall prediction of bacterial proteins experimentally known to target host nucleus. During the analysis of NLS distribution among different molecular weight proteins, it was found that majority of nuclear-targeted proteins lies between 0 and 150 kDa. Among these, the highest molecular weight protein with accession number Q2GGH1 was found to have good mono- and bipartite NLS cutoff value. However, monopartite NLS was comparatively less than bipartite NLS with few exceptions, but in case of proteins with increasing molecular weight, the monopartite NLS cutoff value was higher than bipartite NLS cutoff (Figure 3). Figure 2 View largeDownload slide Prediction of host nuclear targeting of experimentally known bacterial proteins localizing host nucleus and their relation with associated factors. Figure 2 View largeDownload slide Prediction of host nuclear targeting of experimentally known bacterial proteins localizing host nucleus and their relation with associated factors. Figure 3 View largeDownload slide Prediction of NLS in bacterial proteins experimentally known to target host nucleus and their relation with molecular weight of proteins. Figure 3 View largeDownload slide Prediction of NLS in bacterial proteins experimentally known to target host nucleus and their relation with molecular weight of proteins. Mitochondrial targeting prediction During analysis of bacterial proteins known to target host mitochondria, the MitoProt P value greatly influenced prediction ability of subcellular localization prediction tools. The bacterial proteins known to target host mitochondria with MitoProt P value  >0.5 were found to have increased prediction reliability with sorting signals detecting tools like TargetP, PSORT II, WOLF PSORT and EffectiveDB. The prediction reliability was also increased with tools working on multiple approaches like BaCeIlo. The presence of transmembrane segments also reduced prediction reliability with TargetP, PSORT II, WOLF PSORT and EffectiveDB and the tool working on multiple approaches like BaCeILo. In contrast, HSLPred, ESLPred and SubLoc were not able to predict proteins without transmembrane helices (Figure 4). WOLF PSORT and PSORT II increased prediction accuracy after adding second choice as significant. Under certain situation, the first prediction choice by these tools almost miss all mitochondrial proteins. This makes an impression that second prediction choice should not be neglected in such predictions, as this can also give valuable information. During analysis of MitoProt P value and its relation with molecular weight of bacterial protein known to target host mitochondria, no consistent relation was found except with the proteins with molecular weight  >250 kDa giving low MitoProt P value (Figure 5). Supplementary Table S2 provides details about overall prediction of bacterial proteins experimentally known to target host cell mitochondria. Figure 4 View largeDownload slide Prediction of host mitochondrial targeting of experimentally known bacterial proteins localizing host mitochondria and their relation with associated factors. Figure 4 View largeDownload slide Prediction of host mitochondrial targeting of experimentally known bacterial proteins localizing host mitochondria and their relation with associated factors. Figure 5 View largeDownload slide Prediction of MitoProt P value in bacterial proteins experimentally known to target host mitochondria and their relation with molecular weight of proteins. Figure 5 View largeDownload slide Prediction of MitoProt P value in bacterial proteins experimentally known to target host mitochondria and their relation with molecular weight of proteins. Endomembrane system and cytoplasmic targeting prediction The detection of bacterial protein targeting in endomembrane system components was found to be highly influenced by presence of transmembrane helices in query proteins. However, the prediction performance or these targeting locations was poor, but detection of such locations in proteins without transmembrane helices was 0%. The correct prediction with cytoplasmic proteins was highest with SubLoc v 1.0 followed by BaCeILo and HSLPred (Figure 6). Supplementary Tables S3 and S4 give details about protein targeting prediction by different subcellular localization prediction tools for host endomembrane system and cytosol targeting proteins, respectively. Figure 6 View largeDownload slide Prediction of host endomembrane system targeting of experimentally known bacterial proteins localizing host endomembrane system and their relation with associated factors. Figure 6 View largeDownload slide Prediction of host endomembrane system targeting of experimentally known bacterial proteins localizing host endomembrane system and their relation with associated factors. During detection of secretion system in query proteins, it was found that among our listed secretion system (based on literature) and EffectiveDB predicted secretion system, 90.9% (nuclear), 83.33% (mitochondrial), 91.66% (endomembrane system) and 66.6% (cytosol) proteins showed correct prediction. The overall prediction assessment ability of these subcellular localization prediction tools is presented (Figure 7) as heat plot. Figure 7 View largeDownload slide Overall prediction performance of in silico tools for analysis of bacterial protein localization in host subcellular compartments. Figure 7 View largeDownload slide Overall prediction performance of in silico tools for analysis of bacterial protein localization in host subcellular compartments. Discussion Pathogen protein targeting host cell is an important part of microbial pathogenesis. Among the host subcellular compartments, the nucleus and mitochondria are important components that form the core of cell survival [14, 32, 33]. The pathogen tries to hijack the host cellular machinery such that the host cell survival and death are coordinated as per pathogen requirement. In addition, pathogen protein targeting host cell cytoplasm and other membrane-bound subcellular organelles has several implications in microbial pathogenesis [1]. Several subcellular protein targeting prediction tools are available, and their number and prediction performance are gradually improving. Our selection of tools for this study depends on diverse approaches for prediction as well as variable training data set (Figure 1; Table 1). The protein targeting can be passive through diffusion, where protein passively travels through available space and stops wherever it finds its proper restricting target [30]. In contrast, under most of the cases, the protein targeting depends on presence of certain localization signals present in protein itself [34]. We tried to use both prediction strategies during our study. We observed molecular weight of nuclear proteins, as it is known that  <40 kDa molecular weight proteins can passively localize to host cell nucleus [31]. The prediction tools based on detection of certain localization signal were further included in the study to cover protein targeting prediction by this mechanisms (Figure 1). However, these localization signals are not always present with certain category proteins. For example, only a part of nuclear protein carries NLS and therefore additional factors are involved in subcellular localization of certain protein [35]. Therefore, we additionally included other tools working on support vector machine (SVM) and consider multiple factors for protein subcellular targeting prediction. SVMs are computational supervised models to classify data on the basis of learning algorithm associated with SVM [36]. These SVMs work on different prediction approach as well as training data set to provide maximum accuracy to available protein subcellular localization tools (Table 1). The subcellular localization of query protein is also influenced by the presence of transmembrane domain and therefore it was also included in the study. However, the protein localization depends on multiple factors mentioned above, but the targeting of pathogen protein in host cell required additional measures. The pathogen uses special secretion system to export their proteins in host cell [37]. The subcellular targeting prediction of pathogen proteins in host cell is incomplete without prediction of their secretion system to export particular type of proteins by pathogen. Therefore, we included EffectiveDB, which is a combination of tools detecting secretion system as well as subcellular targeting of query protein. Our study predicted presence of NLS in only 27% proteins. This observation is consistent with the finding that only a part of nuclear proteins has NLS, and these proteins can use alternate strategies for targeting host nucleus [38]. Although low molecular weight proteins can be translocated to host nucleus as per the fact that  <40 kDa molecular weight proteins can passively enter into nucleus, high molecular weight proteins required higher NLS cutoff value (Figure 3) and justify the results. The detection of NLS has multiple advantages and disadvantages. However, being simple in nature with either one (mono) or two (bipartite) stretches of basic amino acids, but in the proteins with multiple predicted NLS, detection of actual functional NLS is not possible and therefore it can be a contributing factor behind inaccurate predictions. Moreover, the NLS activity is calculated as an isolated peptide instead of considering the structure of native protein. The cNLS mapper is based on yeast data set to predict nuclear localization [39]. Sometimes, the nuclear proteins are known to target multiple locations [40] and create complex situations for prediction tools. This problem is far grave with detection of distant protein as a query. During our study, we used bacterial proteins as a query to detect their targeting in eukaryotic host cell. Therefore, the disparity in targeting prediction is certain, and added parameters should be measured for getting more precise prediction as mentioned in Figure 2. According to results, ESLPred was able to give comparatively consistent inter-kingdom prediction accuracy for nuclear proteins. This may be because of a number of reasons including the multiple approaches used by ESLPred. The HSLPred and ESLPred are almost similar in prediction approaches, but the difference lies in their training data set. The HSLPred works on specific human protein data set, while ESLPred works on the basis of broad group of eukaryotic protein data set (Table 1). As mentioned in Figure 2, the inter-kingdom subcellular targeting prediction performance of ESLPred was always higher in comparison with HSLPred for nuclear proteins. This indicates that although highly specific training data set can provide high prediction accuracy for that particular organism query proteins [41, 42], the detection of prokaryotic pathogen protein targeting in eukaryotic host cell indicates that highly specific training data set creates low inter-kingdom prediction accuracy. Therefore, organism-specific protein subcellular targeting prediction tools cannot solve the problem of in silico detection of one organism’s protein targeting in another distant organism. During detection of pathogen protein targeting in host cell mitochondria, the tools detecting mitochondrial targeting signals (e.g. TargetP, PSORT II, WOLF PSORT and EffectiveDB) gave good inter-kingdom prediction performance. It has been already suggested that bacteria use mitochondrial targeting signals to target their proteins in host mitochondria [1]. This evidence fairly supports the result indicating high MitoProt P value proteins are showing good inter-kingdom mitochondrial targeting prediction accuracy (Figure 4) and poor prediction of mitochondrial protein with low MitoProt P value. The prediction performance of proteins with transmembrane helices was comparatively poor (Figure 4). Perhaps, the transmembrane helices detection by prediction tools creates additional complexity in the query proteins and reduces their mitochondrial targeting prediction ability. For example, it is found with Legionella pneumophila protein LncP (which is experimentally known to target host mitochondria), that it has four strong transmembrane segments (Supplementary Table S2). It has found that this protein targets mitochondria and makes a specific channel for transfer of metabolites. It is involved in evacuation of adenosine triphosphate molecules from mitochondrial matrix during infection [43]. It is obvious now that detection of pathogen’s protein (with strong mitochondrial sorting signal and without transmembrane domain) targeting in host subcellular compartments is comparatively easier than vice versa. The influence of mitochondrial targeting signal in prediction ability assessment is further supported by the fact that BaCeILo performance was higher among similar category tools detecting targeting of bacterial proteins with  >0.5 MitoProt P value. The BaCeILo considers N and C termini sequences in addition to evolutionary information for SVM, while ESLPred and HSLPred use different approaches of prediction (Table 1). MitoProt P value was less with  >250 kDa molecular weight mitochondrial proteins (Figure 5). This indicates that alternative mechanisms for mitochondrial targeting are possible and should be covered for prediction tools detecting pathogen’s protein targeting in host cell mitochondria. After analysis of these proteins by TMPred, it was found that these all contain transmembrane domain and can use mechanism like LcnP of L. pneumophila. It can be concluded for detection of pathogen’s protein targeting in host cell mitochondria that detection of transmembrane helices and mitochondrial targeting signals should be used as additional parameters to customize the predictions. In addition, the host pathogen protein targeting prediction tools should incorporate these parameters to improve prediction accuracy for microbial pathogenesis-related studies. The prediction performance of endomembrane system proteins was poorest among all subcellular targeting location analyzed, especially in the proteins without transmembrane helices. None of the prediction tool was able to predict correct subcellular targeting of endomembrane system proteins without transmembrane helices (Figure 6). However, the prediction performance of protein with transmembrane helices was comparatively higher, but not good. There may be several reasons behind this poor inter-kingdom prediction reliability. The endomembrane system involves protein trafficking through vesicles in multiple compartments [44], and it is already known that pathogen’s proteins are trafficked through endomembrane system during infection [45, 46]. Owing to this reason, we selected endomembrane system as a whole with the intention to get better prediction reliability. The proteins targeting any endomembrane system component was included as correct prediction, but still the prediction performance was poor. The majority of the tools used in the study were not detecting endomembrane system targeting. Only PSORT II and WOLF PSORT were detecting this targeting location on the basis of sorting signals, and HSLPred was detecting host cell plasma membrane targeting only. This may be the reason for low inter-kingdom prediction certainty for this location. It is required to have a tool with including endomembrane compartment sorting signals, transmembrane domain and other parameters for efficient prediction of pathogen’s protein targeting in host cell. The poor prediction performance of such protein deserves an independent study on properties of these proteins and their inclusion in prediction tools algorithm to increase prediction certainty. During detection of pathogen’s protein targeting in host cell cytoplasm, Subloc v 1.0 was found to have comparatively higher inter-kingdom prediction certainty. SubLoc is also based on SVM to predict subcellular targeting of query proteins. However, it has two variations to predict prokaryotic and eukaryotic proteins separately, but as it does not ask for a particular system to predict, the chances of giving good accuracy with prokaryotic proteins in eukaryotic system are higher and logical [25]. It analyzes query protein without asking its source (animal, plant, bacteria, etc.). It can be the reason behind comparatively good prediction performances of SubLoc for cytoplasmic proteins. The prediction of secretion system by EffectiveDB was comparatively good. This tool is primarily designed for detecting secretion of query protein by bacteria, but also detects host subcellular targeting [26]. This utility makes it an ideal candidate for microbial pathogenesis-related studies. Bacteria uses several secretion systems to transport their effectors in host cell, and the information about subcellular targeting can be better assumed with their secretion prediction specifically for extracellular pathogens. However, the prediction of secretion system adds valuable input in microbial pathogenesis, but the detection of subcellular targeting through only N-terminals targeting sequence may be the reason behind limited subcellular targeting prediction certainty of EffectiveDB [27]. This fact is also reflected in another study analyzing prediction accuracy of k-nearest neighbors classifier (PSORT II method), that it gives 60% prediction accuracy for 10 yeast classes and therefore may be the reason behind certain false predictions of PSORT II [47]. The variations in host subcellular targeting prediction of these tools indicate that these in silico prediction tools can miss many nuclear and mitochondrial proteins while predicting their subcellular targeting location elsewhere. However, this does not summarily nullify previous studies predicting bacterial proteins targeting in host cell, but raises skepticism that such prediction should be validated further for evaluating actual protein localization and their subsequent impact on host cell through protein–protein interactions (PPIs). Certainly, there are several factors behind low inter-kingdom prediction accuracy of these tools. For example, sometimes the proteins are not exclusively localized to one location (especially for multi membrane pass proteins) and makes prediction uncertain. Therefore, additional measures are required to increase prediction certainty for such proteins. In addition, the database used for prediction is different from the query sequence, and this can be another reason behind variable inter-kingdom prediction accuracy. This study also revealed that several measures can be taken to improve prediction accuracy by in silico tools. Every type of subcellular targeting location can be analyzed by different approach-based tools. The use of ESLPred was consistent for nuclear proteins with or without transmembrane (TM), NLS and with variation in molecular weight. However, the detection of pathogen’s proteins targeting host cell mitochondria should be coupled with additional parameters of MitoProt P value and presence of TM, but tools working on detection of sorting signals were good in comparison with tools based on SVM using other approaches. The detection of pathogen’s protein targeting in endomembrane system is still a challenging task, and tools working on sorting signal detection give a slightly better performance, but still it needs methods to incorporate additional parameters to predict accurate pathogen’s protein targeting in host cell (Figure 7). However, it has been experimentally verified that proteins with multiple subcellular targeting location, the targeting should be customized for in silico approach [48]. Current data generated from this study can add valuable inputs in customizing prokaryotic protein’s subcellular targeting prediction in eukaryotic host cell. The data generated during the study provide details about the factors those can be added to provide positive and negative impact on reliability values of these tools. It will also be helpful for development of prediction tools for such complex situations. In addition, researchers trying to predict bacterial proteins through existing tools can also involve these recommendations in their studies for getting better prediction outcomes. Another major addition can be done by predicting host pathogen PPI of query proteins by homology modeling, hidden Markov model or gold standard PPI data. Among these PPI methods, gold standard PPI is the most reliable method, which depends on experimentally validated PPI. Several databases of gold standard PPI are available and should be incorporated to increase the validity of predictions. The high specificity of training data set and less number of prediction approaches in a tool also create low inter-kingdom prediction certainty, and therefore, tools based on these criteria should be avoided for certain cases. The standard in silico approach involves evaluation of host pathogen PPI by multiple methods and detection of host subcellular targeting by significant interacting proteins. Therefore, the evaluation of microbial proteins influence on host physiology by only in silico protein subcellular targeting data should be discouraged. In conclusion, this article is not intended to raise criticism on actual function of tools analyzed, as these were not tested by us and perhaps already tested by developer before making tools available for public. Nevertheless, the results indicate the potential of these tools to predict certainty of bacterial protein as query should be carefully done and further validated by PPI data. In addition, the results also provide a glimpse about customizing parameters for inter-kingdom protein subcellular targeting prediction in case of microbial pathogenesis-related studies. Key Points Pathogen’s protein targeting in host subcellular compartments is an important part of microbial pathogenesis. Computational prediction of this targeting is a common practice now. Detection of prokaryotic protein targeting in eukaryotic host cell should be done carefully. Host subcellular compartment targeting can be analyzed by different approach-based tools, considering several other host pathogen factors. Supplementary data Supplementary data are available online at http://bib.oxfordjournals.org/. Abdul Arif Khan is working as an Assistant Professor in Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia. He has strong research interest in the field of cancer associated infections including study of host-pathogen interactions using system biology approaches. He is involved in using computational approaches to decipher role of microbes in cancer etiology and diagnosis. Zakir Khan is a Scientist at the Department of Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, USA. He has major research interest in understanding of molecular mechanisms for identifying novel targets/strategies in cancer treatment. He is also involved in using computational approaches to understand molecular mechanisms behind cancer etiology. Mohd Abul Kalam is working as an Assistant Professor at Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia. His research area is to conduct rigorous translational nanomedicine for promising improvements of potential therapeutics. His expertise is in nanotechnology including the role for computational approaches in nanotechnology research. Azmat Ali Khan is working as an Assistant Professor in Pharmaceutical Biotechnology Laboratory, Department of Pharmaceutical chemistry, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia. His research interest focus is on drug delivery via lipid nanoparticles. He is also working on study of host-pathogens interactions using computational and wet lab tools. Acknowledgments The authors are grateful to Deanship of Scientific Research and Research Centre, College of Pharmacy, King Saud University. References 1 Escoll P, Mondino S, Rolando M, et al.   Targeting of host organelles by pathogenic bacteria: a sophisticated subversion strategy. Nat Rev Microbiol  2016; 14: 5– 19. Google Scholar CrossRef Search ADS PubMed  2 Bierne H, Cossart P. When bacteria target the nucleus: the emerging family of nucleomodulins. Cell Microbiol  2012; 14: 622– 33. Google Scholar CrossRef Search ADS PubMed  3 Herweg JA, Hansmeier N, Otto A, et al.   Purification and proteomics of pathogen-modified vacuoles and membranes. Front Cell Infect Microbiol  2015; 5: 48. Google Scholar CrossRef Search ADS PubMed  4 Yu X, Decker KB, Barker K, et al.   Host-pathogen interaction profiling using self-assembling human protein arrays. J Proteome Res  2015; 14: 1920– 36. Google Scholar CrossRef Search ADS PubMed  5 Caillaud MC, Piquerez SJ, Fabro G, et al.   Subcellular localization of the Hpa RxLR effector repertoire identifies a tonoplast-associated protein HaRxL17 that confers enhanced plant susceptibility. Plant J  2012; 69: 252– 65. Google Scholar CrossRef Search ADS PubMed  6 Tang G, Leppla SH. Proteasome activity is required for anthrax lethal toxin to kill macrophages. Infect Immun  1999; 67: 3055– 60. Google Scholar PubMed  7 Hou M, Chen R, Yang D, et al.   Identification and functional characterization of EseH, a new effector of the type III secretion system of Edwardsiella piscicida. Cell Microbiol  2016, doi: 10.1111/cmi.12638. 8 Zupan JR, Citovsky V, Zambryski P. Agrobacterium VirE2 protein mediates nuclear uptake of single-stranded DNA in plant cells. Proc Natl Acad Sci USA  1996; 93: 2392– 7. Google Scholar CrossRef Search ADS PubMed  9 Pennini ME, Perrinet S, Dautry-Varsat A, et al.   Histone methylation by NUE, a novel nuclear effector of the intracellular pathogen Chlamydia trachomatis. PLoS Pathog  2010; 6: e1000995. Google Scholar CrossRef Search ADS PubMed  10 Khan S, Zakariah M, Rolfo C, et al.   Prediction of mycoplasma hominis proteins targeting in mitochondria and cytoplasm of host cells and their implication in prostate cancer etiology. Oncotarget  2016, doi: 10.18632/oncotarget.8306. 11 Khan S, Zakariah M, Palaniappan S. Computational prediction of Mycoplasma hominis proteins targeting in nucleus of host cell and their implication in prostate cancer etiology. Tumour Biol  2016; 37: 10805– 13. Google Scholar CrossRef Search ADS PubMed  12 Khan S, Imran A, Khan AA, et al.   Systems biology approaches for the prediction of possible role of Chlamydia pneumoniae proteins in the etiology of lung cancer. PLoS One  2016; 11: e0148530. Google Scholar CrossRef Search ADS PubMed  13 Xie LP, Gao Y, Tian SW, et al.   Bioinformatics analysis on homology of CagM protein in Helicobacter pylori Cag Pathogenicity Island. Adv Mat Res  2014; 926–30: 1081– 4. 14 Moreno-Altamirano MM, Paredes-Gonzalez IS, Espitia C, et al.   Bioinformatic identification of Mycobacterium tuberculosis proteins likely to target host cell mitochondria: virulence factors? Microb Inform Exp  2012; 2: 9. Google Scholar CrossRef Search ADS PubMed  15 Kosugi S, Hasebe M, Tomita M, et al.   Systematic identification of cell cycle-dependent yeast nucleocytoplasmic shuttling proteins by prediction of composite motifs. Proc Natl Acad Sci USA  2009; 106: 10171– 6. Google Scholar CrossRef Search ADS PubMed  16 Kosugi S, Hasebe M, Matsumura N, et al.   Six classes of nuclear localization signals specific to different binding grooves of importin alpha. J Biol Chem  2009; 284: 478– 85. Google Scholar CrossRef Search ADS PubMed  17 Nakao MC, Nakai K. Improvement of PSORT II protein sorting prediction for mammalian proteins. Genome Inform  2002; 13: 441– 2. 18 Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci  1999; 24: 34– 6. Google Scholar CrossRef Search ADS PubMed  19 Horton P, Park KJ, Obayashi T, et al.   WoLF PSORT: protein localization predictor. Nucleic Acids Res  2007; 35: W585– 7. Google Scholar CrossRef Search ADS PubMed  20 Emanuelsson O, Nielsen H, Brunak S, et al.   Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol  2000; 300: 1005– 16. Google Scholar CrossRef Search ADS PubMed  21 Claros MG, Vincens P. Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur J Biochem  1996; 241: 779– 86. Google Scholar CrossRef Search ADS PubMed  22 Pierleoni A, Martelli PL, Fariselli P, et al.   BaCelLo: a balanced subcellular localization predictor. Bioinformatics  2006; 22: e408– 16. Google Scholar CrossRef Search ADS PubMed  23 Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem  2005; 280: 14427– 32. Google Scholar CrossRef Search ADS PubMed  24 Bhasin M, Raghava GP. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res  2004; 32: W414– 9. Google Scholar CrossRef Search ADS PubMed  25 Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics  2001; 17: 721– 8. Google Scholar CrossRef Search ADS PubMed  26 Eichinger V, Nussbaumer T, Platzer A, et al.   EffectiveDB-updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems. Nucleic Acids Res  2016; 44: D669– 74. Google Scholar CrossRef Search ADS PubMed  27 Small I, Peeters N, Legeai F, et al.   Predotar: a tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics  2004; 4: 1581– 90. Google Scholar CrossRef Search ADS PubMed  28 Hofmann K, Stoffel W. TMbase—a database of membrane spanning proteins segments. Biol Chem Hoppe-Seyler  1993; 374: 166. 29 Gromiha MM. A simple method for predicting transmembrane alpha helices with better accuracy. Protein Eng  1999; 12: 557– 61. Google Scholar CrossRef Search ADS PubMed  30 Rudner DZ, Losick R. Protein subcellular localization in bacteria. Cold Spring Harb Perspect Biol  2010; 2: a000307 Google Scholar CrossRef Search ADS PubMed  31 Tran EJ, Wente SR. Dynamic nuclear pore complexes: life on the edge. Cell  2006; 125: 1041– 53. Google Scholar CrossRef Search ADS PubMed  32 Canonne J, Rivas S. Bacterial effectors target the plant cell nucleus to subvert host transcription. Plant Signal Behav  2012; 7: 217– 21. Google Scholar CrossRef Search ADS PubMed  33 Jiang JH, Tong J, Gabriel K. Hijacking mitochondria: bacterial toxins that modulate mitochondrial function. IUBMB Life  2012; 64: 397– 401. Google Scholar CrossRef Search ADS PubMed  34 Rusch SL, Kendall DA. Protein transport via amino-terminal targeting sequences: common themes in diverse systems. Mol Membr Biol  1995; 12: 295– 307. Google Scholar CrossRef Search ADS PubMed  35 Freitas N, Cunha C. Mechanisms and signals for the nuclear import of proteins. Curr Genomics  2009; 10: 550– 7. Google Scholar CrossRef Search ADS PubMed  36 Cortes C, Vapnik V. Support vector network. Learn Mach  1995; 20: 273– 97. 37 Costa TR, Felisberto-Rodrigues C, Meir A, et al.   Secretion systems in Gram-negative bacteria: structural and mechanistic insights. Nat Rev Microbiol  2015; 13: 343– 59. Google Scholar CrossRef Search ADS PubMed  38 Macara IG. Transport into and out of the nucleus. Microbiol Mol Biol Rev  2001; 65: 570– 94. Table of contents Google Scholar CrossRef Search ADS PubMed  39 Khan AA. In silico prediction of escherichia coli proteins targeting the host cell nucleus, with special reference to their role in colon cancer etiology. J Comput Biol  2014; 21: 466– 75. Google Scholar CrossRef Search ADS PubMed  40 Rivas S. Nuclear dynamics during plant innate immunity. Plant Physiol  2012; 158: 87– 94. Google Scholar CrossRef Search ADS PubMed  41 Kaundal R, Raghava GP. RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics  2009; 9: 2324– 42. Google Scholar CrossRef Search ADS PubMed  42 Kaundal R, Saini R, Zhao PX. Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis. Plant Physiol  2010; 154: 36– 54. Google Scholar CrossRef Search ADS PubMed  43 Dolezal P, Aili M, Tong J, et al.   Legionella pneumophila secretes a mitochondrial carrier protein during infection. PLoS Pathog  2012; 8: e1002459. Google Scholar CrossRef Search ADS PubMed  44 Hsu VW, Lee SY, Yang JS. The evolving understanding of COPI vesicle formation. Nat Rev Mol Cell Biol  2009; 10: 360– 4. Google Scholar CrossRef Search ADS PubMed  45 Alexander MM, Cilia M. A molecular tug-of-war: global plant proteome changes during viral infection. Current Plant Biol  2016; 5: 13– 24. Google Scholar CrossRef Search ADS   46 Lu YJ, Schornack S, Spallek T, et al.   Patterns of plant subcellular responses to successful oomycete infections reveal differences in host cell reprogramming and endocytic trafficking. Cell Microbiol  2012; 14: 682– 97. Google Scholar CrossRef Search ADS PubMed  47 Horton P, Nakai K. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proc Int Conf Intell Syst Mol Biol  1997; 5: 147– 52. Google Scholar PubMed  48 Fuss J, Liegmann O, Krause K, et al.   Green targeting predictor and ambiguous targeting predictor 2: the pitfalls of plant protein targeting prediction and of transient protein expression in heterologous systems. New Phytol  2013; 200: 1022– 33. Google Scholar CrossRef Search ADS PubMed  © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Inter-kingdom prediction certainty evaluation of protein subcellular localization tools: microbial pathogenesis approach for deciphering host microbe interaction

Loading next page...
 
/lp/ou_press/inter-kingdom-prediction-certainty-evaluation-of-protein-subcellular-R08RmdcgC0
Publisher
Oxford University Press
Copyright
© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbw093
Publisher site
See Article on Publisher Site

Abstract

Abstract Microbial pathogenesis involves several aspects of host–pathogen interactions, including microbial proteins targeting host subcellular compartments and subsequent effects on host physiology. Such studies are supported by experimental data, but recent detection of bacterial proteins localization through computational eukaryotic subcellular protein targeting prediction tools has also come into practice. We evaluated inter-kingdom prediction certainty of these tools. The bacterial proteins experimentally known to target host subcellular compartments were predicted with eukaryotic subcellular targeting prediction tools, and prediction certainty was assessed. The results indicate that these tools alone are not sufficient for inter-kingdom protein targeting prediction. The correct prediction of pathogen’s protein subcellular targeting depends on several factors, including presence of localization signal, transmembrane domain and molecular weight, etc., in addition to approach for subcellular targeting prediction. The detection of protein targeting in endomembrane system is comparatively difficult, as the proteins in this location are channelized to different compartments. In addition, the high specificity of training data set also creates low inter-kingdom prediction accuracy. Current data can help to suggest strategy for correct prediction of bacterial protein’s subcellular localization in host cell. protein targeting, microbial pathogenesis, in silico, nuclear proteins, mitochondrial proteins Introduction Microbial pathogenesis involves a highly coordinated response of the pathogens with the host for their survival, growth and reproduction. This coordination is multifaceted and involves microbial attachment to the host and the subsequent signaling with host cell machinery. These events are managed through multiple processes including pathogen proteins targeting the host cell. These targeted proteins get localized in several host subcellular compartments [1]. The most important among these are nucleus and mitochondria, which carry genetic material and control host cell survival and death. The bacterial proteins migrating to host nucleus are also known as nucleomodulins [2]. The nucleus is core of entire eukaryotic cellular machinery and controls genetic expression, which governs whole cell physiology. The mitochondrion is also a critically important organelle of eukaryotic cell that controls the energy requirement of cell. It is also involved in regulating intrinsic pathway of apoptosis, thereby controlling cellular senescence and death. These two organelles are common in terms of having their own genetic material susceptible to several bacterial genetic modulator proteins. In addition, several microbial proteins are known to target host cell endomembrane system and cytoplasm. The endomembrane system includes various membrane-bound compartments of eukaryotic cell, which include nuclear membrane, rough and smooth endoplasmic reticulum, golgi, cytoplasmic vesicles, which is connected to each other either directly or by vesicle transport. During microbial pathogenesis, these membrane-bound compartments communicate with each other and involves pathogen protein subcellular targeting among endomembrane system components [3–5]. Targeting of bacterial proteins in host cell cytoplasm is a common event affecting host cell machinery. For example, anthrax lethal toxin produced by bacteria Bacillus anthracis migrate to host cell cytoplasm and influence several host proteins including mitogen-activated protein kinase and kill macrophages and macrophage-like cell lines [6]. Several studies tried to detect pathogen protein targeting host cell to decipher their role in microbial pathogenesis, regulation of host cell physiology including cell death and proliferation [7–9]. As the experimental analysis of whole microbial proteome is always a labor-intensive and extravagant task and every laboratory cannot afford it, therefore computational prediction of microbial proteins targeting host cell is now a routine practice [10–14]. Several computational tools are available for predicting subcellular targeting of certain proteins. However, these tools are based on certain data set derived from same type of organism for which they are designed to predict subcellular targeting, but the capability of these tools for inter-organism prediction needs to be investigated. These tools work on variety of principles including detection of localization signal, evolutionary information, amino acid composition, dipeptide composition, sequence similarity, transmembrane segment, etc. (Figure 1). Each method has its own limitations and advantages, but they claim to have certain prediction ability depending on the type of tools (Table 1). Although prediction reliability of these tools is assessed for certain types of organisms, which is included in their training data set, evaluation of their prediction reliability for microbial proteins is required. Table 1 Different prediction tools used during the study and their prediction approach, training data set and reliability as mentioned in literature Sr. No.  Prediction tool  Database size and validation process  Reliability/prediction performance (as per literature)     Sensitivity  Specificity  Accuracy  1  cNLS mapper [15, 16]  Predicts NLS in query protein. The NLS activity is measured instead of conventional sequence similarity or machine-learning strategy. The NLS activity score is contributed by every amino acid residue at certain position. These predictions were validated by analyzing effect of replacing each individual amino acid and its effect on NLS activity for a certain class in budding yeast. It was found that each amino acid within an NLS contributes to the entire activity independently. Training data and limitations: NLS profiles were prepared through budding yeast data after considering conserved nature of importin α/β pathway in eukaryotes, but the prediction for other distant organisms may be less efficient. It cannot predict protein directly binding to importin β or working with α-independent NLSs.  Class ½  99  94  98  Class 3  100  100  100  Class 4  87  97  92  Bipartite  87  82  85  Values are based on test peptide sequence from synthetic NLS mutant  2  PSORT II [17, 18]  Detect sorting signal sequence plus transmembrane segment and membrane topology Training data: 1531 yeast sequences from Swiss-Prot  57% for yeast sequences and 86% f or Escherichia coli sequences  3  WOLF PSORT [19]  Uses amino acid composition in addition to PSORT features Training data set: Fungi: 2113; plant: 2333; animal: 12 771 proteins   70% sensitivity and specificity for mitochondria, nucleus, cytosol, PM, EC and chloroplast Low sensitivity for other sites  4  TargetP [20]  Uses N terminal sequence information only  Plants: 85% Non-plants: 90% On redundancy-reduced test sets  Plant  Chloroplast transit peptide (cTP): 141; mitochondrial targeting peptide (mTP): 368; secretory: 269; nuclear: 102; cytosolic: 195  Non-plant  Cytosolic: 438; mTP: 371; secretory: 715; nuclear: 1214  5  Mitoprot [21]  Evaluation of 47 parameters of large set of mitochondrial proteins present in Swiss-Prot Training data set: 12 432 non-mitochondrial and 607 mitochondrial proteins  With considering only amino acid sequence: 75–97% With Mictochondrial targeting sequence (MTS): 76–94%  6  BaCeILo [22]  Evaluate residue sequence and alignment profiles. It evaluates N- and C-termini sequence as well as whole protein sequence. The results are balanced in different categories to avoid effect of biased training data set. The similarity of data set was reduced to make sure that no protein has >30% identity, and prediction are balanced Training data set: 2597 animals, 1198 fungi and 491 plants proteins  Animal: 74% Fungi: 76% Plants: 67%  7  HSLPred [23]  Uses SVM to evaluate amino acid composition, dipeptide composition, PSI-BLAST and hybrid method including all above approaches Training data set: 3532 human proteins (cytoplasmic: 840; mitochondrial: 315; nuclear: 858; PM: 1519; endoplasmic reticulum: 63; EC: 48; peroxisome: 25; lysosome: 51; Golgi: 32; centrosome: 8; microsome: 21)  Amino acid composition: 76.6% Dipeptide composition: 77.8% Similarity based: 73.3% Hybrid approach: 84.9%  8  ESLPred [24]  Uses multiple approaches including amino acid composition-based SVM, physicochemical properties-based SVM, dipeptide composition-based SVM and PSI-BLAST-based SVM and a hybrid approach involving all above methods Training data set: 2427 eukaryotic proteins (cytosol: 684; mitochondrial: 321; nuclear: 1097; and EC: 325)  Amino acid composition: 78.1% Physicochemical properties: 77.8% Dipeptide based: 82.4% Hybrid module: 88.0%  9  SubLoc v 1.0 [25]  Analyzes sequences composition using SVM Training data set: Prokaryotic (cytosol: 688; periplasmic: 202; EC: 107) Eukaryotic (nuclear: 1097; cytosol: 684; mitochondrial: 321; EC: 325)  Three locations of prokaryotes: 91.4% Four locations of eukaryotes: 79.4%  10  EffectiveDB [26]  It is a combination of tools to predict secretion of bacterial proteins and their subsequent localization in subcellular compartments. We used the following: EffectiveT3 (predict signal peptide for type 3 secretion system) Training data set: 504 T3ss secreted proteins T4SEPre (predict type 4 secretion system) Training data set: 1913 T4SS effectors from 10 genera Predotar (predict N-terminal targeting sequence for host subcellular targeting) Training data set: 13 668 proteins with known subcellular location in Swiss-Prot  ET3: specificity: 93%; sensitivity: 73%, accuracy: 86%, Matthews correlation coefficient (MCC) = 0.66 T4SEPre: sensitivity: 89%, specificity: 97% Predotar: plant: 91.62% Non-plant: 94.00% [27]    11  TMPred [28]  It predicts membrane-spanning regions of certain protein with their orientation  Average prediction reliability for photosynthetic reaction centre, bacteriorhodopsin, and cytochrome c oxidase: 84.5% [29]  Sr. No.  Prediction tool  Database size and validation process  Reliability/prediction performance (as per literature)     Sensitivity  Specificity  Accuracy  1  cNLS mapper [15, 16]  Predicts NLS in query protein. The NLS activity is measured instead of conventional sequence similarity or machine-learning strategy. The NLS activity score is contributed by every amino acid residue at certain position. These predictions were validated by analyzing effect of replacing each individual amino acid and its effect on NLS activity for a certain class in budding yeast. It was found that each amino acid within an NLS contributes to the entire activity independently. Training data and limitations: NLS profiles were prepared through budding yeast data after considering conserved nature of importin α/β pathway in eukaryotes, but the prediction for other distant organisms may be less efficient. It cannot predict protein directly binding to importin β or working with α-independent NLSs.  Class ½  99  94  98  Class 3  100  100  100  Class 4  87  97  92  Bipartite  87  82  85  Values are based on test peptide sequence from synthetic NLS mutant  2  PSORT II [17, 18]  Detect sorting signal sequence plus transmembrane segment and membrane topology Training data: 1531 yeast sequences from Swiss-Prot  57% for yeast sequences and 86% f or Escherichia coli sequences  3  WOLF PSORT [19]  Uses amino acid composition in addition to PSORT features Training data set: Fungi: 2113; plant: 2333; animal: 12 771 proteins   70% sensitivity and specificity for mitochondria, nucleus, cytosol, PM, EC and chloroplast Low sensitivity for other sites  4  TargetP [20]  Uses N terminal sequence information only  Plants: 85% Non-plants: 90% On redundancy-reduced test sets  Plant  Chloroplast transit peptide (cTP): 141; mitochondrial targeting peptide (mTP): 368; secretory: 269; nuclear: 102; cytosolic: 195  Non-plant  Cytosolic: 438; mTP: 371; secretory: 715; nuclear: 1214  5  Mitoprot [21]  Evaluation of 47 parameters of large set of mitochondrial proteins present in Swiss-Prot Training data set: 12 432 non-mitochondrial and 607 mitochondrial proteins  With considering only amino acid sequence: 75–97% With Mictochondrial targeting sequence (MTS): 76–94%  6  BaCeILo [22]  Evaluate residue sequence and alignment profiles. It evaluates N- and C-termini sequence as well as whole protein sequence. The results are balanced in different categories to avoid effect of biased training data set. The similarity of data set was reduced to make sure that no protein has >30% identity, and prediction are balanced Training data set: 2597 animals, 1198 fungi and 491 plants proteins  Animal: 74% Fungi: 76% Plants: 67%  7  HSLPred [23]  Uses SVM to evaluate amino acid composition, dipeptide composition, PSI-BLAST and hybrid method including all above approaches Training data set: 3532 human proteins (cytoplasmic: 840; mitochondrial: 315; nuclear: 858; PM: 1519; endoplasmic reticulum: 63; EC: 48; peroxisome: 25; lysosome: 51; Golgi: 32; centrosome: 8; microsome: 21)  Amino acid composition: 76.6% Dipeptide composition: 77.8% Similarity based: 73.3% Hybrid approach: 84.9%  8  ESLPred [24]  Uses multiple approaches including amino acid composition-based SVM, physicochemical properties-based SVM, dipeptide composition-based SVM and PSI-BLAST-based SVM and a hybrid approach involving all above methods Training data set: 2427 eukaryotic proteins (cytosol: 684; mitochondrial: 321; nuclear: 1097; and EC: 325)  Amino acid composition: 78.1% Physicochemical properties: 77.8% Dipeptide based: 82.4% Hybrid module: 88.0%  9  SubLoc v 1.0 [25]  Analyzes sequences composition using SVM Training data set: Prokaryotic (cytosol: 688; periplasmic: 202; EC: 107) Eukaryotic (nuclear: 1097; cytosol: 684; mitochondrial: 321; EC: 325)  Three locations of prokaryotes: 91.4% Four locations of eukaryotes: 79.4%  10  EffectiveDB [26]  It is a combination of tools to predict secretion of bacterial proteins and their subsequent localization in subcellular compartments. We used the following: EffectiveT3 (predict signal peptide for type 3 secretion system) Training data set: 504 T3ss secreted proteins T4SEPre (predict type 4 secretion system) Training data set: 1913 T4SS effectors from 10 genera Predotar (predict N-terminal targeting sequence for host subcellular targeting) Training data set: 13 668 proteins with known subcellular location in Swiss-Prot  ET3: specificity: 93%; sensitivity: 73%, accuracy: 86%, Matthews correlation coefficient (MCC) = 0.66 T4SEPre: sensitivity: 89%, specificity: 97% Predotar: plant: 91.62% Non-plant: 94.00% [27]    11  TMPred [28]  It predicts membrane-spanning regions of certain protein with their orientation  Average prediction reliability for photosynthetic reaction centre, bacteriorhodopsin, and cytochrome c oxidase: 84.5% [29]  EC = endothelial cell; PM = plasma membrane. Figure 1 View largeDownload slide Graphical outline for different prediction methods used by different tools and their training data sets. Figure 1 View largeDownload slide Graphical outline for different prediction methods used by different tools and their training data sets. An estimation of the reliability and accuracy of these inter-organism predictions is always a challenging task. Therefore, we designed this study for evaluating the ability of eukaryotic subcellular localization prediction tools to predict prokaryotic proteins as a query. This calibration is highly important in maintaining prediction accuracy of these tools for their use in microbial pathogenesis-related studies. Materials and methods Protein sequences The 119 bacterial proteins experimentally known to target host subcellular compartments were selected for the study. These proteins included 44 (nuclear), 29 (mitochondrial), 32 (endomembrane system), 14 (cytosolic) proteins either known to target or interact with respective subcellular targeting location in host cell. Possible care was taken to avoid similar sequence with multiple accession numbers, but derived from similar bacterial strain. Although in some cases, proteins from two different organisms were included in the study, their origin from different bacteria made them suitable candidates for inclusion in the study. The protein sequences were retrieved from Uniprot, whereas the protein sequences, which were not found in Uniprot, were retrieved from NCBI protein database (details available in Supplementary tables). Both plant and animal pathogens (including human pathogens) were selected for prediction. Selection of tools The pathogen’s protein targeting in host cell is governed by multiple host pathogen factors. Under certain situations, pathogen proteins can passively localize to host subcellular compartments, and this property of proteins is governed by their molecular weight [30, 31]; therefore, we detected molecular weight of protein to understand their passive subcellular targeting. The pathogen proteins targeting host subcellular compartment are also regulated by presence of certain localization signals, so the tools predicting these localization signals were included in the study. The prediction tools based on single prediction approach cannot consider influence of other factors on subcellular targeting, therefore prediction tools detecting bacterial protein secretion mechanism, and host subcellular targeting by multiple approaches including transmembrane helices detection, evolutionary information, sequence similarity were also used. As the aim of this study was to detect prediction certainty of prokaryotic protein targeting in eukaryotic host cells, tools working on diverse principles and training data set were selected (Figure 1). A total of 11 tools working on different prediction approaches and training data set were used to predict pathogen protein targeting in host cell (Table 1; Figure 1). Among these, classical nuclear localization signal (cNLS) mapper detects nuclear targeting and therefore was used only for nuclear proteins, and MitoProt, which detects mitochondrial targeting, was used for host mitochondrial-targeted proteins only. Remaining seven prediction tools were known to predict both nuclear and mitochondrial subcellular targeting and therefore used for all proteins irrespective of their types. TargetP detects only mitochondrial, chloroplast and secretary pathway localization signal, but include data set of nuclear proteins also, and so, it was also used for all types of proteins to understand their effect on protein localization prediction (Table 1). TMPred was used for detection of transmembrane helices in query proteins. Host subcellular targeting prediction The bacterial proteins known to target host subcellular compartments were subjected as query for prediction by above tools. The default parameters were used for prediction, as these are most frequently used. For cNLS mapper prediction, the prediction was performed in the entire protein with NLS cutoff value 2.0. The plant and animal/human pathogen proteins were searched in their respective database wherever desired. With some tools like ESLPred and HSLPred, the protein subcellular localization is detected through various properties of query sequence under individual prediction approach, but the hybrid method involves inclusion of all approaches of prediction. We used hybrid method approach for prediction of subcellular targeting, as it is found to have highest prediction accuracy in comparison with other individual approaches (Table 1). TargetP has another variation SignalP, which predicts subcellular targeting of bacterial proteins. Nevertheless, we used only TargetP, as we wanted to predict targeting of bacterial proteins in eukaryotic system [20]. Detection of bacterial secretion of proteins Some tools are able to detect release of protein by potential secretion system of bacteria. In addition, this can also indicate about targeting of certain proteins in host subcellular compartments. The protein secretion with its subcellular targeting makes more sense for actual protein targeting in practical scenario. Therefore, we predicted secretion system in bacteria through EffectiveDB. Results Nuclear targeting prediction The results indicate that the detection of NLS is insufficient to guarantee about nuclear localization of proteins. After considering NLS cutoff value 5 as strong nuclear targeting signal, we found only 27% nuclear protein with monopartite and bipartite NLS. These NLS-containing proteins included 83.3% sequences with  >40 kDa molecular weight. In contrast, 56.25% proteins without NLS was found to have  <40 kDa molecular weight. The present distribution of transmembrane helices in proteins was found to be almost equal in both NLS-containing and not containing proteins. Figure 2 indicates about prediction performance of nuclear proteins with different protein subcellular localization prediction tools using bacterial proteins as query. BaCeILo and ESLPred were found to give better inter-kingdom prediction reliability with  >40 kDa nuclear proteins. In case of protein with  <40 kDa molecular weight and presence of transmembrane helices, BaCeILo was not able to predict nuclear targeting proteins accurately. Prediction reliability of PSORT II and WOLF PSORT was not good until second prediction choice was also considered as significant. Both these proteins subcellular targeting prediction tools give a number of hits or percent chance for subcellular targeting of query protein with WOLF PSORT and PSORT II, respectively. Inclusion of second predicted location choice markedly increased the prediction reliability (Figure 2). However, these tools were also not able to give 100% prediction certainty, and some false-negative predictions occurred depending on different factors associated with query protein, but ESLpred maintained almost uniform prediction reliability among the tools analyzed. Supplementary Table S1 gives details about overall prediction of bacterial proteins experimentally known to target host nucleus. During the analysis of NLS distribution among different molecular weight proteins, it was found that majority of nuclear-targeted proteins lies between 0 and 150 kDa. Among these, the highest molecular weight protein with accession number Q2GGH1 was found to have good mono- and bipartite NLS cutoff value. However, monopartite NLS was comparatively less than bipartite NLS with few exceptions, but in case of proteins with increasing molecular weight, the monopartite NLS cutoff value was higher than bipartite NLS cutoff (Figure 3). Figure 2 View largeDownload slide Prediction of host nuclear targeting of experimentally known bacterial proteins localizing host nucleus and their relation with associated factors. Figure 2 View largeDownload slide Prediction of host nuclear targeting of experimentally known bacterial proteins localizing host nucleus and their relation with associated factors. Figure 3 View largeDownload slide Prediction of NLS in bacterial proteins experimentally known to target host nucleus and their relation with molecular weight of proteins. Figure 3 View largeDownload slide Prediction of NLS in bacterial proteins experimentally known to target host nucleus and their relation with molecular weight of proteins. Mitochondrial targeting prediction During analysis of bacterial proteins known to target host mitochondria, the MitoProt P value greatly influenced prediction ability of subcellular localization prediction tools. The bacterial proteins known to target host mitochondria with MitoProt P value  >0.5 were found to have increased prediction reliability with sorting signals detecting tools like TargetP, PSORT II, WOLF PSORT and EffectiveDB. The prediction reliability was also increased with tools working on multiple approaches like BaCeIlo. The presence of transmembrane segments also reduced prediction reliability with TargetP, PSORT II, WOLF PSORT and EffectiveDB and the tool working on multiple approaches like BaCeILo. In contrast, HSLPred, ESLPred and SubLoc were not able to predict proteins without transmembrane helices (Figure 4). WOLF PSORT and PSORT II increased prediction accuracy after adding second choice as significant. Under certain situation, the first prediction choice by these tools almost miss all mitochondrial proteins. This makes an impression that second prediction choice should not be neglected in such predictions, as this can also give valuable information. During analysis of MitoProt P value and its relation with molecular weight of bacterial protein known to target host mitochondria, no consistent relation was found except with the proteins with molecular weight  >250 kDa giving low MitoProt P value (Figure 5). Supplementary Table S2 provides details about overall prediction of bacterial proteins experimentally known to target host cell mitochondria. Figure 4 View largeDownload slide Prediction of host mitochondrial targeting of experimentally known bacterial proteins localizing host mitochondria and their relation with associated factors. Figure 4 View largeDownload slide Prediction of host mitochondrial targeting of experimentally known bacterial proteins localizing host mitochondria and their relation with associated factors. Figure 5 View largeDownload slide Prediction of MitoProt P value in bacterial proteins experimentally known to target host mitochondria and their relation with molecular weight of proteins. Figure 5 View largeDownload slide Prediction of MitoProt P value in bacterial proteins experimentally known to target host mitochondria and their relation with molecular weight of proteins. Endomembrane system and cytoplasmic targeting prediction The detection of bacterial protein targeting in endomembrane system components was found to be highly influenced by presence of transmembrane helices in query proteins. However, the prediction performance or these targeting locations was poor, but detection of such locations in proteins without transmembrane helices was 0%. The correct prediction with cytoplasmic proteins was highest with SubLoc v 1.0 followed by BaCeILo and HSLPred (Figure 6). Supplementary Tables S3 and S4 give details about protein targeting prediction by different subcellular localization prediction tools for host endomembrane system and cytosol targeting proteins, respectively. Figure 6 View largeDownload slide Prediction of host endomembrane system targeting of experimentally known bacterial proteins localizing host endomembrane system and their relation with associated factors. Figure 6 View largeDownload slide Prediction of host endomembrane system targeting of experimentally known bacterial proteins localizing host endomembrane system and their relation with associated factors. During detection of secretion system in query proteins, it was found that among our listed secretion system (based on literature) and EffectiveDB predicted secretion system, 90.9% (nuclear), 83.33% (mitochondrial), 91.66% (endomembrane system) and 66.6% (cytosol) proteins showed correct prediction. The overall prediction assessment ability of these subcellular localization prediction tools is presented (Figure 7) as heat plot. Figure 7 View largeDownload slide Overall prediction performance of in silico tools for analysis of bacterial protein localization in host subcellular compartments. Figure 7 View largeDownload slide Overall prediction performance of in silico tools for analysis of bacterial protein localization in host subcellular compartments. Discussion Pathogen protein targeting host cell is an important part of microbial pathogenesis. Among the host subcellular compartments, the nucleus and mitochondria are important components that form the core of cell survival [14, 32, 33]. The pathogen tries to hijack the host cellular machinery such that the host cell survival and death are coordinated as per pathogen requirement. In addition, pathogen protein targeting host cell cytoplasm and other membrane-bound subcellular organelles has several implications in microbial pathogenesis [1]. Several subcellular protein targeting prediction tools are available, and their number and prediction performance are gradually improving. Our selection of tools for this study depends on diverse approaches for prediction as well as variable training data set (Figure 1; Table 1). The protein targeting can be passive through diffusion, where protein passively travels through available space and stops wherever it finds its proper restricting target [30]. In contrast, under most of the cases, the protein targeting depends on presence of certain localization signals present in protein itself [34]. We tried to use both prediction strategies during our study. We observed molecular weight of nuclear proteins, as it is known that  <40 kDa molecular weight proteins can passively localize to host cell nucleus [31]. The prediction tools based on detection of certain localization signal were further included in the study to cover protein targeting prediction by this mechanisms (Figure 1). However, these localization signals are not always present with certain category proteins. For example, only a part of nuclear protein carries NLS and therefore additional factors are involved in subcellular localization of certain protein [35]. Therefore, we additionally included other tools working on support vector machine (SVM) and consider multiple factors for protein subcellular targeting prediction. SVMs are computational supervised models to classify data on the basis of learning algorithm associated with SVM [36]. These SVMs work on different prediction approach as well as training data set to provide maximum accuracy to available protein subcellular localization tools (Table 1). The subcellular localization of query protein is also influenced by the presence of transmembrane domain and therefore it was also included in the study. However, the protein localization depends on multiple factors mentioned above, but the targeting of pathogen protein in host cell required additional measures. The pathogen uses special secretion system to export their proteins in host cell [37]. The subcellular targeting prediction of pathogen proteins in host cell is incomplete without prediction of their secretion system to export particular type of proteins by pathogen. Therefore, we included EffectiveDB, which is a combination of tools detecting secretion system as well as subcellular targeting of query protein. Our study predicted presence of NLS in only 27% proteins. This observation is consistent with the finding that only a part of nuclear proteins has NLS, and these proteins can use alternate strategies for targeting host nucleus [38]. Although low molecular weight proteins can be translocated to host nucleus as per the fact that  <40 kDa molecular weight proteins can passively enter into nucleus, high molecular weight proteins required higher NLS cutoff value (Figure 3) and justify the results. The detection of NLS has multiple advantages and disadvantages. However, being simple in nature with either one (mono) or two (bipartite) stretches of basic amino acids, but in the proteins with multiple predicted NLS, detection of actual functional NLS is not possible and therefore it can be a contributing factor behind inaccurate predictions. Moreover, the NLS activity is calculated as an isolated peptide instead of considering the structure of native protein. The cNLS mapper is based on yeast data set to predict nuclear localization [39]. Sometimes, the nuclear proteins are known to target multiple locations [40] and create complex situations for prediction tools. This problem is far grave with detection of distant protein as a query. During our study, we used bacterial proteins as a query to detect their targeting in eukaryotic host cell. Therefore, the disparity in targeting prediction is certain, and added parameters should be measured for getting more precise prediction as mentioned in Figure 2. According to results, ESLPred was able to give comparatively consistent inter-kingdom prediction accuracy for nuclear proteins. This may be because of a number of reasons including the multiple approaches used by ESLPred. The HSLPred and ESLPred are almost similar in prediction approaches, but the difference lies in their training data set. The HSLPred works on specific human protein data set, while ESLPred works on the basis of broad group of eukaryotic protein data set (Table 1). As mentioned in Figure 2, the inter-kingdom subcellular targeting prediction performance of ESLPred was always higher in comparison with HSLPred for nuclear proteins. This indicates that although highly specific training data set can provide high prediction accuracy for that particular organism query proteins [41, 42], the detection of prokaryotic pathogen protein targeting in eukaryotic host cell indicates that highly specific training data set creates low inter-kingdom prediction accuracy. Therefore, organism-specific protein subcellular targeting prediction tools cannot solve the problem of in silico detection of one organism’s protein targeting in another distant organism. During detection of pathogen protein targeting in host cell mitochondria, the tools detecting mitochondrial targeting signals (e.g. TargetP, PSORT II, WOLF PSORT and EffectiveDB) gave good inter-kingdom prediction performance. It has been already suggested that bacteria use mitochondrial targeting signals to target their proteins in host mitochondria [1]. This evidence fairly supports the result indicating high MitoProt P value proteins are showing good inter-kingdom mitochondrial targeting prediction accuracy (Figure 4) and poor prediction of mitochondrial protein with low MitoProt P value. The prediction performance of proteins with transmembrane helices was comparatively poor (Figure 4). Perhaps, the transmembrane helices detection by prediction tools creates additional complexity in the query proteins and reduces their mitochondrial targeting prediction ability. For example, it is found with Legionella pneumophila protein LncP (which is experimentally known to target host mitochondria), that it has four strong transmembrane segments (Supplementary Table S2). It has found that this protein targets mitochondria and makes a specific channel for transfer of metabolites. It is involved in evacuation of adenosine triphosphate molecules from mitochondrial matrix during infection [43]. It is obvious now that detection of pathogen’s protein (with strong mitochondrial sorting signal and without transmembrane domain) targeting in host subcellular compartments is comparatively easier than vice versa. The influence of mitochondrial targeting signal in prediction ability assessment is further supported by the fact that BaCeILo performance was higher among similar category tools detecting targeting of bacterial proteins with  >0.5 MitoProt P value. The BaCeILo considers N and C termini sequences in addition to evolutionary information for SVM, while ESLPred and HSLPred use different approaches of prediction (Table 1). MitoProt P value was less with  >250 kDa molecular weight mitochondrial proteins (Figure 5). This indicates that alternative mechanisms for mitochondrial targeting are possible and should be covered for prediction tools detecting pathogen’s protein targeting in host cell mitochondria. After analysis of these proteins by TMPred, it was found that these all contain transmembrane domain and can use mechanism like LcnP of L. pneumophila. It can be concluded for detection of pathogen’s protein targeting in host cell mitochondria that detection of transmembrane helices and mitochondrial targeting signals should be used as additional parameters to customize the predictions. In addition, the host pathogen protein targeting prediction tools should incorporate these parameters to improve prediction accuracy for microbial pathogenesis-related studies. The prediction performance of endomembrane system proteins was poorest among all subcellular targeting location analyzed, especially in the proteins without transmembrane helices. None of the prediction tool was able to predict correct subcellular targeting of endomembrane system proteins without transmembrane helices (Figure 6). However, the prediction performance of protein with transmembrane helices was comparatively higher, but not good. There may be several reasons behind this poor inter-kingdom prediction reliability. The endomembrane system involves protein trafficking through vesicles in multiple compartments [44], and it is already known that pathogen’s proteins are trafficked through endomembrane system during infection [45, 46]. Owing to this reason, we selected endomembrane system as a whole with the intention to get better prediction reliability. The proteins targeting any endomembrane system component was included as correct prediction, but still the prediction performance was poor. The majority of the tools used in the study were not detecting endomembrane system targeting. Only PSORT II and WOLF PSORT were detecting this targeting location on the basis of sorting signals, and HSLPred was detecting host cell plasma membrane targeting only. This may be the reason for low inter-kingdom prediction certainty for this location. It is required to have a tool with including endomembrane compartment sorting signals, transmembrane domain and other parameters for efficient prediction of pathogen’s protein targeting in host cell. The poor prediction performance of such protein deserves an independent study on properties of these proteins and their inclusion in prediction tools algorithm to increase prediction certainty. During detection of pathogen’s protein targeting in host cell cytoplasm, Subloc v 1.0 was found to have comparatively higher inter-kingdom prediction certainty. SubLoc is also based on SVM to predict subcellular targeting of query proteins. However, it has two variations to predict prokaryotic and eukaryotic proteins separately, but as it does not ask for a particular system to predict, the chances of giving good accuracy with prokaryotic proteins in eukaryotic system are higher and logical [25]. It analyzes query protein without asking its source (animal, plant, bacteria, etc.). It can be the reason behind comparatively good prediction performances of SubLoc for cytoplasmic proteins. The prediction of secretion system by EffectiveDB was comparatively good. This tool is primarily designed for detecting secretion of query protein by bacteria, but also detects host subcellular targeting [26]. This utility makes it an ideal candidate for microbial pathogenesis-related studies. Bacteria uses several secretion systems to transport their effectors in host cell, and the information about subcellular targeting can be better assumed with their secretion prediction specifically for extracellular pathogens. However, the prediction of secretion system adds valuable input in microbial pathogenesis, but the detection of subcellular targeting through only N-terminals targeting sequence may be the reason behind limited subcellular targeting prediction certainty of EffectiveDB [27]. This fact is also reflected in another study analyzing prediction accuracy of k-nearest neighbors classifier (PSORT II method), that it gives 60% prediction accuracy for 10 yeast classes and therefore may be the reason behind certain false predictions of PSORT II [47]. The variations in host subcellular targeting prediction of these tools indicate that these in silico prediction tools can miss many nuclear and mitochondrial proteins while predicting their subcellular targeting location elsewhere. However, this does not summarily nullify previous studies predicting bacterial proteins targeting in host cell, but raises skepticism that such prediction should be validated further for evaluating actual protein localization and their subsequent impact on host cell through protein–protein interactions (PPIs). Certainly, there are several factors behind low inter-kingdom prediction accuracy of these tools. For example, sometimes the proteins are not exclusively localized to one location (especially for multi membrane pass proteins) and makes prediction uncertain. Therefore, additional measures are required to increase prediction certainty for such proteins. In addition, the database used for prediction is different from the query sequence, and this can be another reason behind variable inter-kingdom prediction accuracy. This study also revealed that several measures can be taken to improve prediction accuracy by in silico tools. Every type of subcellular targeting location can be analyzed by different approach-based tools. The use of ESLPred was consistent for nuclear proteins with or without transmembrane (TM), NLS and with variation in molecular weight. However, the detection of pathogen’s proteins targeting host cell mitochondria should be coupled with additional parameters of MitoProt P value and presence of TM, but tools working on detection of sorting signals were good in comparison with tools based on SVM using other approaches. The detection of pathogen’s protein targeting in endomembrane system is still a challenging task, and tools working on sorting signal detection give a slightly better performance, but still it needs methods to incorporate additional parameters to predict accurate pathogen’s protein targeting in host cell (Figure 7). However, it has been experimentally verified that proteins with multiple subcellular targeting location, the targeting should be customized for in silico approach [48]. Current data generated from this study can add valuable inputs in customizing prokaryotic protein’s subcellular targeting prediction in eukaryotic host cell. The data generated during the study provide details about the factors those can be added to provide positive and negative impact on reliability values of these tools. It will also be helpful for development of prediction tools for such complex situations. In addition, researchers trying to predict bacterial proteins through existing tools can also involve these recommendations in their studies for getting better prediction outcomes. Another major addition can be done by predicting host pathogen PPI of query proteins by homology modeling, hidden Markov model or gold standard PPI data. Among these PPI methods, gold standard PPI is the most reliable method, which depends on experimentally validated PPI. Several databases of gold standard PPI are available and should be incorporated to increase the validity of predictions. The high specificity of training data set and less number of prediction approaches in a tool also create low inter-kingdom prediction certainty, and therefore, tools based on these criteria should be avoided for certain cases. The standard in silico approach involves evaluation of host pathogen PPI by multiple methods and detection of host subcellular targeting by significant interacting proteins. Therefore, the evaluation of microbial proteins influence on host physiology by only in silico protein subcellular targeting data should be discouraged. In conclusion, this article is not intended to raise criticism on actual function of tools analyzed, as these were not tested by us and perhaps already tested by developer before making tools available for public. Nevertheless, the results indicate the potential of these tools to predict certainty of bacterial protein as query should be carefully done and further validated by PPI data. In addition, the results also provide a glimpse about customizing parameters for inter-kingdom protein subcellular targeting prediction in case of microbial pathogenesis-related studies. Key Points Pathogen’s protein targeting in host subcellular compartments is an important part of microbial pathogenesis. Computational prediction of this targeting is a common practice now. Detection of prokaryotic protein targeting in eukaryotic host cell should be done carefully. Host subcellular compartment targeting can be analyzed by different approach-based tools, considering several other host pathogen factors. Supplementary data Supplementary data are available online at http://bib.oxfordjournals.org/. Abdul Arif Khan is working as an Assistant Professor in Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia. He has strong research interest in the field of cancer associated infections including study of host-pathogen interactions using system biology approaches. He is involved in using computational approaches to decipher role of microbes in cancer etiology and diagnosis. Zakir Khan is a Scientist at the Department of Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, USA. He has major research interest in understanding of molecular mechanisms for identifying novel targets/strategies in cancer treatment. He is also involved in using computational approaches to understand molecular mechanisms behind cancer etiology. Mohd Abul Kalam is working as an Assistant Professor at Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia. His research area is to conduct rigorous translational nanomedicine for promising improvements of potential therapeutics. His expertise is in nanotechnology including the role for computational approaches in nanotechnology research. Azmat Ali Khan is working as an Assistant Professor in Pharmaceutical Biotechnology Laboratory, Department of Pharmaceutical chemistry, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia. His research interest focus is on drug delivery via lipid nanoparticles. He is also working on study of host-pathogens interactions using computational and wet lab tools. Acknowledgments The authors are grateful to Deanship of Scientific Research and Research Centre, College of Pharmacy, King Saud University. References 1 Escoll P, Mondino S, Rolando M, et al.   Targeting of host organelles by pathogenic bacteria: a sophisticated subversion strategy. Nat Rev Microbiol  2016; 14: 5– 19. Google Scholar CrossRef Search ADS PubMed  2 Bierne H, Cossart P. When bacteria target the nucleus: the emerging family of nucleomodulins. Cell Microbiol  2012; 14: 622– 33. Google Scholar CrossRef Search ADS PubMed  3 Herweg JA, Hansmeier N, Otto A, et al.   Purification and proteomics of pathogen-modified vacuoles and membranes. Front Cell Infect Microbiol  2015; 5: 48. Google Scholar CrossRef Search ADS PubMed  4 Yu X, Decker KB, Barker K, et al.   Host-pathogen interaction profiling using self-assembling human protein arrays. J Proteome Res  2015; 14: 1920– 36. Google Scholar CrossRef Search ADS PubMed  5 Caillaud MC, Piquerez SJ, Fabro G, et al.   Subcellular localization of the Hpa RxLR effector repertoire identifies a tonoplast-associated protein HaRxL17 that confers enhanced plant susceptibility. Plant J  2012; 69: 252– 65. Google Scholar CrossRef Search ADS PubMed  6 Tang G, Leppla SH. Proteasome activity is required for anthrax lethal toxin to kill macrophages. Infect Immun  1999; 67: 3055– 60. Google Scholar PubMed  7 Hou M, Chen R, Yang D, et al.   Identification and functional characterization of EseH, a new effector of the type III secretion system of Edwardsiella piscicida. Cell Microbiol  2016, doi: 10.1111/cmi.12638. 8 Zupan JR, Citovsky V, Zambryski P. Agrobacterium VirE2 protein mediates nuclear uptake of single-stranded DNA in plant cells. Proc Natl Acad Sci USA  1996; 93: 2392– 7. Google Scholar CrossRef Search ADS PubMed  9 Pennini ME, Perrinet S, Dautry-Varsat A, et al.   Histone methylation by NUE, a novel nuclear effector of the intracellular pathogen Chlamydia trachomatis. PLoS Pathog  2010; 6: e1000995. Google Scholar CrossRef Search ADS PubMed  10 Khan S, Zakariah M, Rolfo C, et al.   Prediction of mycoplasma hominis proteins targeting in mitochondria and cytoplasm of host cells and their implication in prostate cancer etiology. Oncotarget  2016, doi: 10.18632/oncotarget.8306. 11 Khan S, Zakariah M, Palaniappan S. Computational prediction of Mycoplasma hominis proteins targeting in nucleus of host cell and their implication in prostate cancer etiology. Tumour Biol  2016; 37: 10805– 13. Google Scholar CrossRef Search ADS PubMed  12 Khan S, Imran A, Khan AA, et al.   Systems biology approaches for the prediction of possible role of Chlamydia pneumoniae proteins in the etiology of lung cancer. PLoS One  2016; 11: e0148530. Google Scholar CrossRef Search ADS PubMed  13 Xie LP, Gao Y, Tian SW, et al.   Bioinformatics analysis on homology of CagM protein in Helicobacter pylori Cag Pathogenicity Island. Adv Mat Res  2014; 926–30: 1081– 4. 14 Moreno-Altamirano MM, Paredes-Gonzalez IS, Espitia C, et al.   Bioinformatic identification of Mycobacterium tuberculosis proteins likely to target host cell mitochondria: virulence factors? Microb Inform Exp  2012; 2: 9. Google Scholar CrossRef Search ADS PubMed  15 Kosugi S, Hasebe M, Tomita M, et al.   Systematic identification of cell cycle-dependent yeast nucleocytoplasmic shuttling proteins by prediction of composite motifs. Proc Natl Acad Sci USA  2009; 106: 10171– 6. Google Scholar CrossRef Search ADS PubMed  16 Kosugi S, Hasebe M, Matsumura N, et al.   Six classes of nuclear localization signals specific to different binding grooves of importin alpha. J Biol Chem  2009; 284: 478– 85. Google Scholar CrossRef Search ADS PubMed  17 Nakao MC, Nakai K. Improvement of PSORT II protein sorting prediction for mammalian proteins. Genome Inform  2002; 13: 441– 2. 18 Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci  1999; 24: 34– 6. Google Scholar CrossRef Search ADS PubMed  19 Horton P, Park KJ, Obayashi T, et al.   WoLF PSORT: protein localization predictor. Nucleic Acids Res  2007; 35: W585– 7. Google Scholar CrossRef Search ADS PubMed  20 Emanuelsson O, Nielsen H, Brunak S, et al.   Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol  2000; 300: 1005– 16. Google Scholar CrossRef Search ADS PubMed  21 Claros MG, Vincens P. Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur J Biochem  1996; 241: 779– 86. Google Scholar CrossRef Search ADS PubMed  22 Pierleoni A, Martelli PL, Fariselli P, et al.   BaCelLo: a balanced subcellular localization predictor. Bioinformatics  2006; 22: e408– 16. Google Scholar CrossRef Search ADS PubMed  23 Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem  2005; 280: 14427– 32. Google Scholar CrossRef Search ADS PubMed  24 Bhasin M, Raghava GP. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res  2004; 32: W414– 9. Google Scholar CrossRef Search ADS PubMed  25 Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics  2001; 17: 721– 8. Google Scholar CrossRef Search ADS PubMed  26 Eichinger V, Nussbaumer T, Platzer A, et al.   EffectiveDB-updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems. Nucleic Acids Res  2016; 44: D669– 74. Google Scholar CrossRef Search ADS PubMed  27 Small I, Peeters N, Legeai F, et al.   Predotar: a tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics  2004; 4: 1581– 90. Google Scholar CrossRef Search ADS PubMed  28 Hofmann K, Stoffel W. TMbase—a database of membrane spanning proteins segments. Biol Chem Hoppe-Seyler  1993; 374: 166. 29 Gromiha MM. A simple method for predicting transmembrane alpha helices with better accuracy. Protein Eng  1999; 12: 557– 61. Google Scholar CrossRef Search ADS PubMed  30 Rudner DZ, Losick R. Protein subcellular localization in bacteria. Cold Spring Harb Perspect Biol  2010; 2: a000307 Google Scholar CrossRef Search ADS PubMed  31 Tran EJ, Wente SR. Dynamic nuclear pore complexes: life on the edge. Cell  2006; 125: 1041– 53. Google Scholar CrossRef Search ADS PubMed  32 Canonne J, Rivas S. Bacterial effectors target the plant cell nucleus to subvert host transcription. Plant Signal Behav  2012; 7: 217– 21. Google Scholar CrossRef Search ADS PubMed  33 Jiang JH, Tong J, Gabriel K. Hijacking mitochondria: bacterial toxins that modulate mitochondrial function. IUBMB Life  2012; 64: 397– 401. Google Scholar CrossRef Search ADS PubMed  34 Rusch SL, Kendall DA. Protein transport via amino-terminal targeting sequences: common themes in diverse systems. Mol Membr Biol  1995; 12: 295– 307. Google Scholar CrossRef Search ADS PubMed  35 Freitas N, Cunha C. Mechanisms and signals for the nuclear import of proteins. Curr Genomics  2009; 10: 550– 7. Google Scholar CrossRef Search ADS PubMed  36 Cortes C, Vapnik V. Support vector network. Learn Mach  1995; 20: 273– 97. 37 Costa TR, Felisberto-Rodrigues C, Meir A, et al.   Secretion systems in Gram-negative bacteria: structural and mechanistic insights. Nat Rev Microbiol  2015; 13: 343– 59. Google Scholar CrossRef Search ADS PubMed  38 Macara IG. Transport into and out of the nucleus. Microbiol Mol Biol Rev  2001; 65: 570– 94. Table of contents Google Scholar CrossRef Search ADS PubMed  39 Khan AA. In silico prediction of escherichia coli proteins targeting the host cell nucleus, with special reference to their role in colon cancer etiology. J Comput Biol  2014; 21: 466– 75. Google Scholar CrossRef Search ADS PubMed  40 Rivas S. Nuclear dynamics during plant innate immunity. Plant Physiol  2012; 158: 87– 94. Google Scholar CrossRef Search ADS PubMed  41 Kaundal R, Raghava GP. RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics  2009; 9: 2324– 42. Google Scholar CrossRef Search ADS PubMed  42 Kaundal R, Saini R, Zhao PX. Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis. Plant Physiol  2010; 154: 36– 54. Google Scholar CrossRef Search ADS PubMed  43 Dolezal P, Aili M, Tong J, et al.   Legionella pneumophila secretes a mitochondrial carrier protein during infection. PLoS Pathog  2012; 8: e1002459. Google Scholar CrossRef Search ADS PubMed  44 Hsu VW, Lee SY, Yang JS. The evolving understanding of COPI vesicle formation. Nat Rev Mol Cell Biol  2009; 10: 360– 4. Google Scholar CrossRef Search ADS PubMed  45 Alexander MM, Cilia M. A molecular tug-of-war: global plant proteome changes during viral infection. Current Plant Biol  2016; 5: 13– 24. Google Scholar CrossRef Search ADS   46 Lu YJ, Schornack S, Spallek T, et al.   Patterns of plant subcellular responses to successful oomycete infections reveal differences in host cell reprogramming and endocytic trafficking. Cell Microbiol  2012; 14: 682– 97. Google Scholar CrossRef Search ADS PubMed  47 Horton P, Nakai K. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proc Int Conf Intell Syst Mol Biol  1997; 5: 147– 52. Google Scholar PubMed  48 Fuss J, Liegmann O, Krause K, et al.   Green targeting predictor and ambiguous targeting predictor 2: the pitfalls of plant protein targeting prediction and of transient protein expression in heterologous systems. New Phytol  2013; 200: 1022– 33. Google Scholar CrossRef Search ADS PubMed  © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Briefings in BioinformaticsOxford University Press

Published: Jan 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off