TY - JOUR AU - Guo,, Fei AB - Abstract Human protein subcellular localization has an important research value in biological processes, also in elucidating protein functions and identifying drug targets. Over the past decade, a number of protein subcellular localization prediction tools have been designed and made freely available online. The purpose of this paper is to summarize the progress of research on the subcellular localization of human proteins in recent years, including commonly used data sets proposed by the predecessors and the performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO-Hum and pLoc-mHum provide a statistically significant improvement in performance, as measured by the value of accuracy, relative to the other methods. Meanwhile, we build a new data set using the latest version of Uniprot database and construct a new GO-based prediction method HumLoc-LBCI in this paper. Then, we test all selected prediction tools on the new data set. Finally, we discuss the possible development directions of human protein subcellular localization. Availability: The codes and data are available from http://www.lbci.cn/syn/. human proteins, subcellular localization, multi-label classification, sequence information, Gene Ontology terms, web server Introduction Protein subcellular localization plays an important role in biological processes. Conventional high quality localization data are obtained via wet-lab experiments such as electron microscopy, cell fractionation and fluorescent microscopy imaging. However, it is often time-consuming, costly and laborious to identify protein subcellular localization using traditional subcellular localization experiments. With the development of large-scale sequencing technology, the vast number of newly discovered proteins with unknown or uncertain locations require reliable and efficient prediction methods [1]. Therefore, an increasing number of computational tools are developed for fast protein subcellular localization. Simultaneously, it also plays an important part in drug design, therapeutic target discovery and other biological researches [2–5]. Nowadays, the subcellular localization prediction involves various species, such as Euk-mPLoc 2.0 [6], iLoc-Euk[7] and pLoc-mEuk[8], which focus on eukaryotic proteins; Virus-mPLoc [9], iLoc-Virus[10] and pLoc-mVirus[11] are predicted methods for virus proteins; pLoc-mAnimal[12] and iLoc-Animal[13] are targeted at animal proteins subcellular localization; Plant-mPLoc [14], iLoc-Plant[15] and pLoc-mPlant[16] only focused on plant proteins subcellular localization; others like pLoc-mGneg[17] and Gneg-mPLoc [18]; Gpos-mPLoc [19], pLoc-mGpos[20] and iLoc-Gpos[21]. Human protein subcellular localization research has become an important part which cannot be ignored, lots of prediction tools have been proposed[22], such as iLoc-Hum[23] and mGOF-loc[24]. The human subcellular localization prediction plays a role in elucidating protein function, drug targets and the mechanism of human diseases due to protein subcellular mislocalization. The problem of human protein subcellular localization aims to determine the specific locations of protein in a cell, which involves the localization of single-label and multi-label proteins. Recent studies focus on predicting multi-location proteins, which are playing key points in various life activities in more than one cellular compartment. We make detailed statistical analysis on the subcellular locations in a large number of human proteins, as listed in Table 1. Table 1 Commonly subcellular locations in human proteins 1 Cytoplasm 21 Extracellular space 2 Nucleus 22 Perinuclear region 3 Cell membrane 23 Mitochondrion inner membrane 4 Membrane 24 Extracellular matrix 5 Secreted 25 Cilium 6 Cytoskeleton 26 Cilium 7 Cell projection 27 Postsynaptic cell membrane 8 Endoplasmic reticulum membrane 28 Nucleus speckle 9 Cell junction 29 Nucleoplasm 10 Mitochondrion 30 Mitochondrion matrix 11 Golgi apparatus 31 Secretory vesicle 12 Cytoplasmic vesicle 32 Endosome membrane 13 Synapse 33 Spindle 14 Golgi apparatus membrane 34 Centromere 15 Microtubule organizing center 35 Endosome 16 Nucleolus 36 Nucleus membrane 17 Centrosome 37 Mitochondrion outer membrane 18 Cytosol 38 Cytoplasmic vesicle membrane 19 Chromosome 39 Lysosome 20 Endoplasmic reticulum 40 Melanosome 1 Cytoplasm 21 Extracellular space 2 Nucleus 22 Perinuclear region 3 Cell membrane 23 Mitochondrion inner membrane 4 Membrane 24 Extracellular matrix 5 Secreted 25 Cilium 6 Cytoskeleton 26 Cilium 7 Cell projection 27 Postsynaptic cell membrane 8 Endoplasmic reticulum membrane 28 Nucleus speckle 9 Cell junction 29 Nucleoplasm 10 Mitochondrion 30 Mitochondrion matrix 11 Golgi apparatus 31 Secretory vesicle 12 Cytoplasmic vesicle 32 Endosome membrane 13 Synapse 33 Spindle 14 Golgi apparatus membrane 34 Centromere 15 Microtubule organizing center 35 Endosome 16 Nucleolus 36 Nucleus membrane 17 Centrosome 37 Mitochondrion outer membrane 18 Cytosol 38 Cytoplasmic vesicle membrane 19 Chromosome 39 Lysosome 20 Endoplasmic reticulum 40 Melanosome Open in new tab Table 1 Commonly subcellular locations in human proteins 1 Cytoplasm 21 Extracellular space 2 Nucleus 22 Perinuclear region 3 Cell membrane 23 Mitochondrion inner membrane 4 Membrane 24 Extracellular matrix 5 Secreted 25 Cilium 6 Cytoskeleton 26 Cilium 7 Cell projection 27 Postsynaptic cell membrane 8 Endoplasmic reticulum membrane 28 Nucleus speckle 9 Cell junction 29 Nucleoplasm 10 Mitochondrion 30 Mitochondrion matrix 11 Golgi apparatus 31 Secretory vesicle 12 Cytoplasmic vesicle 32 Endosome membrane 13 Synapse 33 Spindle 14 Golgi apparatus membrane 34 Centromere 15 Microtubule organizing center 35 Endosome 16 Nucleolus 36 Nucleus membrane 17 Centrosome 37 Mitochondrion outer membrane 18 Cytosol 38 Cytoplasmic vesicle membrane 19 Chromosome 39 Lysosome 20 Endoplasmic reticulum 40 Melanosome 1 Cytoplasm 21 Extracellular space 2 Nucleus 22 Perinuclear region 3 Cell membrane 23 Mitochondrion inner membrane 4 Membrane 24 Extracellular matrix 5 Secreted 25 Cilium 6 Cytoskeleton 26 Cilium 7 Cell projection 27 Postsynaptic cell membrane 8 Endoplasmic reticulum membrane 28 Nucleus speckle 9 Cell junction 29 Nucleoplasm 10 Mitochondrion 30 Mitochondrion matrix 11 Golgi apparatus 31 Secretory vesicle 12 Cytoplasmic vesicle 32 Endosome membrane 13 Synapse 33 Spindle 14 Golgi apparatus membrane 34 Centromere 15 Microtubule organizing center 35 Endosome 16 Nucleolus 36 Nucleus membrane 17 Centrosome 37 Mitochondrion outer membrane 18 Cytosol 38 Cytoplasmic vesicle membrane 19 Chromosome 39 Lysosome 20 Endoplasmic reticulum 40 Melanosome Open in new tab Figure 1 Open in new tabDownload slide Summary of different protein subcellular localization prediction tools. Figure 1 Open in new tabDownload slide Summary of different protein subcellular localization prediction tools. In general, existing computational methods for identifying protein subcellular localization can be grouped into two categories: homology search-based model and machine learning-based model. The homology search-based approach can be considered as a nearest neighbor prediction model, where the distance between two proteins is measured by their sequence identity. For searching the query protein against a large number of annotated proteins, it selects the top K closest proteins and analyzes their annotations [25, 26] for predicting the query protein localization. The machine learning-based approach must emphasize how to extract discriminative features as well as associated prior knowledge. It can be mainly divided into three categories according to the sequence-based, annotation-based and combination-based features. We list some related prediction tools in Figure 1. Figure 2 Open in new tabDownload slide The pipeline of protein subcellular localization prediction. Figure 2 Open in new tabDownload slide The pipeline of protein subcellular localization prediction. The 1st category uses only sequence information [27–33]. The prediction tool proposed by Wei et al. [24], mGOF-Loc, sufficiently explored the sequence evolutionary information and generated a comprehensive feature set with 828 dimensions from the following three aspects: physicochemical properties, position-specific score matrix (PSSM) and the k-skip-n-gram model. Aarti Garg et al. proposed a systematic approach for predicting subcellular localization (cytoplasm, mitochondrial, nuclear and plasma membrane) of human proteins. It made full use of the traditional amino acid composition (AAC), dipeptide composition and similarity information obtained by PSI-BLAST. In addition to these, there are many other ways to extract sequence information: PsePSSM, discrete wavelet transform, average block (AvBlock) and so on. The 2nd one is based on annotation information [34–37], such as Gene Ontology (GO) terms[38], Swiss-Prot keywords and PubMed abstracts. Since GO terms contain high-level abstracted knowledge, they often result in high accuracy and sufficient annotations; it is the most commonly used annotation information in many researches. Actually, GO-based methods have been widely used in many bioinformatics fields. However, the large of annotation data bring some redundant and error information in the annotation database. Therefore, the selection of essential GO terms is necessary; it can reduce unnecessary features and produce more interpretable results [39, 40]. Moreover, the depth-dependent GO hierarchical information is more useful than the GO occurrence frequency. Hence, all GO terms in the GOA database make up the directed acyclic graph (DAG) [34], which shows the hierarchical relationship between GO terms. The prediction tool, mLASSO-Hum[34], is a method which can yield sparse and interpretable solutions for large-scale prediction of human protein subcellular localization. It improved the extraction method of GO terms [41] and made full use of the hierarchical relationship between GO terms. By using the one-versus-rest LASSO-based classifiers, it can decide not only where the protein within a cell, but also why it is located there. The 3rd one is a combination of above two models, which uses both sequence information and GO terms. In the past decades, many state-of-the-art machine learning-based human protein subcellular localization prediction methods [42–44] have been proposed. The prediction tool, iLoc-Hum[23], used pseudo-AAC, PSSM and GO features to train accumulation-label K-nearest neighbor (KNN) classifier which determined the number of labels for a query protein by the labels of KNN proteins. The prediction tool, pLoc-mHum[45], extracted the crucial GO information into the general pseudo-AAC, and they have indicated that pLoc-mHum is remarkably superior to iLoc-Hum. The prediction tool, Hum-mPLoc 3.0 [46], trained Support Vector Machine (SVM) classifiers to locate human proteins using sequence and annotation information of proteins. It further optimized the use of GO and functional domain information on the basis of Hum-mPLoc[47] and Hum-mPLoc2.0[41]. The above three types of methods predict the subcellular locations of human protein based on different feature representations. By summarizing the existing prediction tools of human protein localization, we find that the prediction tools using both sequence information and GO terms perform best. In the sequence-based method, prediction tools based on PSSM matrix perform good. For using GO [38] terms to predict human protein subcellular localization, the performance of prediction tools has improved significantly. The prediction tools with combined features have the advantages of both excellent representation and best performance. What’s more, a small quantity of prediction tools use other function-based features [48–52], such as sorting signals, functional domain and sequence motifs. In this paper, just like other existing researches [53–56], we mainly focus on human protein subcellular localization prediction research, summarize the existing data sets proposed by the predecessors and commonly used databases on human protein subcellular localization and compare the multi-label prediction web servers. To achieve a reasonable comparison, we test the performance of all selected prediction tools against the same benchmark data set. We also propose a new data set to verify some well-performed prediction tools and construct a new GO-based prediction tool HumLoc-LBCI. Table 2 Commonly used databases for human protein subcellular localization Database Web server Reference UniProtKB https://www.uniprot.org/ [59] LOCATE http://locate.imb.uq.edu.au/ [60] eSLDB http://gpcr.biocomp.unibo.it/esldb/ [61] LocDB http://www.rostlab.org/services/locDB [62] Database Web server Reference UniProtKB https://www.uniprot.org/ [59] LOCATE http://locate.imb.uq.edu.au/ [60] eSLDB http://gpcr.biocomp.unibo.it/esldb/ [61] LocDB http://www.rostlab.org/services/locDB [62] Open in new tab Table 2 Commonly used databases for human protein subcellular localization Database Web server Reference UniProtKB https://www.uniprot.org/ [59] LOCATE http://locate.imb.uq.edu.au/ [60] eSLDB http://gpcr.biocomp.unibo.it/esldb/ [61] LocDB http://www.rostlab.org/services/locDB [62] Database Web server Reference UniProtKB https://www.uniprot.org/ [59] LOCATE http://locate.imb.uq.edu.au/ [60] eSLDB http://gpcr.biocomp.unibo.it/esldb/ [61] LocDB http://www.rostlab.org/services/locDB [62] Open in new tab Materials and methods Human subcellular localization is an important branch of protein localization research, and great progress has been made so far. Here, we summarize some databases commonly used in computational biology and other databases dedicated to protein subcellular localization. Also, we discuss the relevant data sets proposed by the previous researches on human protein subcellular localization. They are almost from the UniProtKB (Swiss-Prot) [57, 58] database. As for comparison, we select several prediction tools for human proteins locations prediction which provide web servers. All of them are designed for human protein subcellular localization prediction, validated on human protein data sets. Figure 2 shows the overall process of protein subcellular localization prediction. Table 3 The multi-label data sets for human protein subcellular localization Data set Label Protein number HSLPred 4 3532 Hum-mPLoc 14 2750 Hum-mPLoc 2.0 14 3106 Hum-mPLoc 3.0 12 HumT:3219; HumB:379 mGOF-loc 37 4802 HPSLPred 10 11689 REALoc 6 Training:5939; Testing:868 HumLoc-LBCI 16 4887 Data set Label Protein number HSLPred 4 3532 Hum-mPLoc 14 2750 Hum-mPLoc 2.0 14 3106 Hum-mPLoc 3.0 12 HumT:3219; HumB:379 mGOF-loc 37 4802 HPSLPred 10 11689 REALoc 6 Training:5939; Testing:868 HumLoc-LBCI 16 4887 Open in new tab Table 3 The multi-label data sets for human protein subcellular localization Data set Label Protein number HSLPred 4 3532 Hum-mPLoc 14 2750 Hum-mPLoc 2.0 14 3106 Hum-mPLoc 3.0 12 HumT:3219; HumB:379 mGOF-loc 37 4802 HPSLPred 10 11689 REALoc 6 Training:5939; Testing:868 HumLoc-LBCI 16 4887 Data set Label Protein number HSLPred 4 3532 Hum-mPLoc 14 2750 Hum-mPLoc 2.0 14 3106 Hum-mPLoc 3.0 12 HumT:3219; HumB:379 mGOF-loc 37 4802 HPSLPred 10 11689 REALoc 6 Training:5939; Testing:868 HumLoc-LBCI 16 4887 Open in new tab Table 4 Distribution of the new benchmark set Label Subcellular localization Number of sequences 1 Cell membrane 779 2 Chromosome 200 3 Cytoplasm 1987 4 Cytosol 235 5 Endomembrane system 1349 6 Endosome membrane 144 7 Endosome 227 8 Golgi apparatus membrane 184 9 Golgi apparatus 301 10 Mitochondrion membrane 176 11 Mitochondrion 534 12 Nucleolus 239 13 Nucleus 1942 14 Centrosome 235 15 Cytoskeleton 481 16 Endoplasmic reticulum 428 Total 4887 Label Subcellular localization Number of sequences 1 Cell membrane 779 2 Chromosome 200 3 Cytoplasm 1987 4 Cytosol 235 5 Endomembrane system 1349 6 Endosome membrane 144 7 Endosome 227 8 Golgi apparatus membrane 184 9 Golgi apparatus 301 10 Mitochondrion membrane 176 11 Mitochondrion 534 12 Nucleolus 239 13 Nucleus 1942 14 Centrosome 235 15 Cytoskeleton 481 16 Endoplasmic reticulum 428 Total 4887 Open in new tab Table 4 Distribution of the new benchmark set Label Subcellular localization Number of sequences 1 Cell membrane 779 2 Chromosome 200 3 Cytoplasm 1987 4 Cytosol 235 5 Endomembrane system 1349 6 Endosome membrane 144 7 Endosome 227 8 Golgi apparatus membrane 184 9 Golgi apparatus 301 10 Mitochondrion membrane 176 11 Mitochondrion 534 12 Nucleolus 239 13 Nucleus 1942 14 Centrosome 235 15 Cytoskeleton 481 16 Endoplasmic reticulum 428 Total 4887 Label Subcellular localization Number of sequences 1 Cell membrane 779 2 Chromosome 200 3 Cytoplasm 1987 4 Cytosol 235 5 Endomembrane system 1349 6 Endosome membrane 144 7 Endosome 227 8 Golgi apparatus membrane 184 9 Golgi apparatus 301 10 Mitochondrion membrane 176 11 Mitochondrion 534 12 Nucleolus 239 13 Nucleus 1942 14 Centrosome 235 15 Cytoskeleton 481 16 Endoplasmic reticulum 428 Total 4887 Open in new tab Data set With the development of protein subcellular localization research, in addition to the commonly used protein databases such as Uniprot, the subcellular localization databases even for human proteins have emerged in large numbers. We summarize the commonly used databases for human protein subcellular localization and list the relevant information in Table 2. Then, we count the existing data sets for human protein subcellular localization. Among them, the data set proposed in HSLPred [44] is the earliest one, including 3532 proteins for four locations. Chou et al. proposed three data sets in different papers [41, 46, 47], which are respectively used by Hum-mPloc, Hum-mPloc 2.0 and Hum-mPloc 3.0. All above data sets are obtained from the database Uniprot (Swiss-Prot). Besides them, Wei et al. [24] proposed a data set from two databases, Uniprot and LOCATE. Here, we use the corresponding names of prediction tools to represent different data sets, and the details of all data sets are shown in Table 3. What’s more, HumLoc-LBCI is a new prediction tool proposed in this paper, and the corresponding novel benchmark data set includes 4887 human proteins for 16 labels. We construct the new valid benchmark data set by collecting all human proteins from SWISS-PROT [1] released on August 2018. To ensure the validity of high quality data, we use CD-HIT [63, 64] to remove invalid or redundant sequences. First, we exclude the proteins having no subcellular locations or uncertain annotations and filter proteins with the identity cutoff of 25%. Then, we focus on 16 major compartments in human cells, including cell membrane, chromosome, cytoplasm, cytosol, endomembrane system, endosome membrane, endosome, Golgi apparatus membrane, Golgi apparatus, mitochondrion membrane, mitochondrion, nucleolus, nucleus, centrosome, cytoskeleton and endoplasmic reticulum. Finally, the benchmark data set includes 4887 human proteins, 1966 of which have single subcellular location, 1873 of which have two subcellular locations, 675 of which have three subcellular locations and the rest of which have more subcellular locations. Here, each location can be regarded as a class label, and one protein with more than one location is a multi-labeled sample. The new data set contains the most number of human proteins with the reasonable number of subcellular locations. The distribution of new benchmark set is showed in Table 4. Through the reference or comparison of their respective tools, we find that the data set including 3106 proteins [41] is the most widely used, so it is also used as the evaluate data set in this paper to verify the performance of all selected prediction tools. We also test the validity of new data set on some well-performed tools. Feature representation We divide the human protein subcellular localization prediction tools into three categories based on different feature representations. The 1st category is sequence-based method, which only uses protein sequence information, such as AAC, sequence homology and evolution information. The 2nd category is annotation-based method, which only uses annotation information. GO is the most commonly used annotation information. Methods using annotation information described in this paper are all based on GO terms. The last category is the combination-based method, which not only uses sequence information, but also uses GO terms. To achieve a reasonable comparison, we test the performance of all selected prediction tools against the data set proposed in Hum-mPLoc 2.0. Results of prediction tools are obtained from the corresponding paper or calculated by running the web server given by corresponding paper. Details are listed in Table 5. Table 5 Summary of prediction tools for human protein subcellular localization Prediction tool Model Accuracy Year Reference Sequence-based methods HPSLPred Ensemble multi-label classifier 0.3150 2017 [65] GO-based methods mGOASVM Multi-label SVM classifier 0.8210 2015|$^*$| [35] mLASSO-Hum Multi-label LASSO classifier 0.8330 2015 [34] Mem-mEN Multi-label interpretable elastic nets classifier 0.8270 2016 [66] HumLoc-LBCI XGBoost classifier 0.7778 2019 — Combination-based methods Hum-mPLoc Ensemble classifier based on KNN rule 0.3810|$^a$| 2009|$^*$| [47] Hum-mPLoc2.0 Ensemble classifier based on OET-KNN rule 0.6270|$^a$| 2009 [41] iLoc-Hum Accumulation-label KNN classifier 0.7630|$^a$| 2012 [23] Wegoloc Multi-label SVM classifier 0.7946 2012 [67] pLoc-mHum Multi-label Gaussion kernel regression 0.8439 2017 [45] Prediction tool Model Accuracy Year Reference Sequence-based methods HPSLPred Ensemble multi-label classifier 0.3150 2017 [65] GO-based methods mGOASVM Multi-label SVM classifier 0.8210 2015|$^*$| [35] mLASSO-Hum Multi-label LASSO classifier 0.8330 2015 [34] Mem-mEN Multi-label interpretable elastic nets classifier 0.8270 2016 [66] HumLoc-LBCI XGBoost classifier 0.7778 2019 — Combination-based methods Hum-mPLoc Ensemble classifier based on KNN rule 0.3810|$^a$| 2009|$^*$| [47] Hum-mPLoc2.0 Ensemble classifier based on OET-KNN rule 0.6270|$^a$| 2009 [41] iLoc-Hum Accumulation-label KNN classifier 0.7630|$^a$| 2012 [23] Wegoloc Multi-label SVM classifier 0.7946 2012 [67] pLoc-mHum Multi-label Gaussion kernel regression 0.8439 2017 [45] |$^*$|This means the year in which the result was obtained.|$^a$|Because these prediction tools without ACC results, we use the value of overall ACC instead. Open in new tab Table 5 Summary of prediction tools for human protein subcellular localization Prediction tool Model Accuracy Year Reference Sequence-based methods HPSLPred Ensemble multi-label classifier 0.3150 2017 [65] GO-based methods mGOASVM Multi-label SVM classifier 0.8210 2015|$^*$| [35] mLASSO-Hum Multi-label LASSO classifier 0.8330 2015 [34] Mem-mEN Multi-label interpretable elastic nets classifier 0.8270 2016 [66] HumLoc-LBCI XGBoost classifier 0.7778 2019 — Combination-based methods Hum-mPLoc Ensemble classifier based on KNN rule 0.3810|$^a$| 2009|$^*$| [47] Hum-mPLoc2.0 Ensemble classifier based on OET-KNN rule 0.6270|$^a$| 2009 [41] iLoc-Hum Accumulation-label KNN classifier 0.7630|$^a$| 2012 [23] Wegoloc Multi-label SVM classifier 0.7946 2012 [67] pLoc-mHum Multi-label Gaussion kernel regression 0.8439 2017 [45] Prediction tool Model Accuracy Year Reference Sequence-based methods HPSLPred Ensemble multi-label classifier 0.3150 2017 [65] GO-based methods mGOASVM Multi-label SVM classifier 0.8210 2015|$^*$| [35] mLASSO-Hum Multi-label LASSO classifier 0.8330 2015 [34] Mem-mEN Multi-label interpretable elastic nets classifier 0.8270 2016 [66] HumLoc-LBCI XGBoost classifier 0.7778 2019 — Combination-based methods Hum-mPLoc Ensemble classifier based on KNN rule 0.3810|$^a$| 2009|$^*$| [47] Hum-mPLoc2.0 Ensemble classifier based on OET-KNN rule 0.6270|$^a$| 2009 [41] iLoc-Hum Accumulation-label KNN classifier 0.7630|$^a$| 2012 [23] Wegoloc Multi-label SVM classifier 0.7946 2012 [67] pLoc-mHum Multi-label Gaussion kernel regression 0.8439 2017 [45] |$^*$|This means the year in which the result was obtained.|$^a$|Because these prediction tools without ACC results, we use the value of overall ACC instead. Open in new tab A typical prediction tool belonging to the 1st category, HPSLPred, is an ensemble multi-label classifier for human protein subcellular location prediction and only uses protein sequence information. It makes full use of the physical chemical properties of proteins and PseAAC. A new data set that contains 11689 proteins of 10 labels is proposed in this paper. And the results of test data set are obtained through their web server, we select the same four labels to calculate the accuracy. The result shows that the performance of HPSLPred is not good, which means only using sequence information may not be enough. Most prediction tools belong to the 2nd category, mGOASVM, mLASSO-Hum and Mem-mEN, only use GO terms, which are all proposed by Wan et al. [34, 35, 66]. The 1st one, mGOASVM, is proposed to address the subcellular localization of plant and virus proteins. The 2nd one, mLASSO-Hum, is used to predict the locations of human proteins, which can yield sparse and interpretable solutions for large-scale prediction of human protein subcellular localization. It improves the extraction method of GO terms [41]. By using the one-versus-rest LASSO-based classifiers, 87 out of more than 8000 GO terms are found to play more significant roles in determining the subcellular localization. Based on these 87 essential GO terms, it can decide not only where the protein within a cell, but also why it is located there. To further exploit information from the remaining GO terms, the 3rd one, Mem-mEN, proposes the method based on the GO hierarchical information derived from the depth distance of GO terms. It also uses sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single-location and multi-location proteins. By using the one-versus-rest strategy, Mem-mEN identifies 429 out of more than 8000 GO terms, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by Mem-mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. All these tools have been tested on the provided data set [41]. Results denote that the performance of LASSO as a classifier is better than others. Figure 3 Open in new tabDownload slide Feature dimension of prediction tools for human protein subcellular localization. Figure 3 Open in new tabDownload slide Feature dimension of prediction tools for human protein subcellular localization. The remaining prediction tools belong to the 3rd category, which use sequence information as well as GO terms. Among them, Hum-mPLoc 2.0 [41] is an upgraded version of Hum-mPLoc[47], which has improved in three aspects: (i) in order to take advantage of the GO approach, Hum-mPLoc 2.0 proposes the homology-based GO extraction method, taking into account the homologous information of the protein. (ii) Since the current GO database is far from complete yet, many proteins cannot be meaningfully formulated in GO space even if their accession numbers are available. So, Hum-mPLoc 2.0 also uses other features besides the GO information. (iii) In addition to PseAAC, the functional domain and sequential evolution information are fused into the prediction tool by an ensemble classifier. As a consequence, the prediction power has been significantly enhanced. Then, iLoc-Hum [23] further improves the feature extraction method of GO terms in Hum-mPLoc 2.0 and proposes another strategy to determine protein locations. As a demonstration, the jackknife cross-validation is performed with iLoc-Hum on a benchmark data set of human proteins [41]. For such a complicated and stringent system, the overall accuracy rate achieved by iLoc-Hum is remarkably higher than that by any existing prediction tools which also have the capacity to deal with this kind of system. And we use overall accuracy instead of accuracy in this paper. WegoLoc [67] is a highly accurate and fast protein subcellular localization prediction tool based on sequence similarity and weighted GO information. A term weighting method in the text categorization process is applied to GO terms for an SVM classifier. WegoLoc provides a wide coverage on eukaryotic species and data sets as well as multiple localizations of proteins for the query sequences. We calculate the results by running the web server. Another prediction tool, pLoc-mHum, gets features by extracting the crucial GO information into the general PseAAC and obtains the result by using the multi-label theory Gaussian kernel regression classifier. Rigorous cross-validations on a same stringent benchmark data set have indicated that pLoc-mHum is remarkably superior to other prediction tool. Also, we take the statistical analysis for the feature dimension of human protein subcellular localization prediction tools, as shown in Figure 3. Multi-classification model Existing prediction tools have two different strategies to solve the multi-label classification problem. One strategy integrates the special binary classifiers into one classifier. Such as one-versus-rest strategy, for a classification problem with n labels, it classifies one label into 1st class, and all remaining labels are classified into 2nd class. Then, the multi-label classification problem can be transformed into the binary classification. It means for an |$n$| labels classification problem, we need to construct |$n$| base classifiers and all binary classifiers can be used as the base classifier. As described above, mLASSO-Hum, Mem-mEN and Wegoloc use this model to enable a base classifier to conduct multi-label classification: Wegoloc [67] uses the SVM as base classifier, which is a commonly used binary classifier and has high prediction accuracy; mLASSO-Hum [34] uses LASSO regression to predict human protein locations. The other strategy is the multi-label classifier which can get the prediction results of all labels for one sample at a time[68]. For instance, pLoc-Hum [45] uses the multi-label Gaussian kernel regression to calculate the scores for each label, and determines whether the sample belonging to a class by setting the threshold at a time. ML-KNN [68] is a classic multi-label learning approach, which is derived from the traditional |$k$|-Nearest Neighbor (kNN) algorithm. In details, for each unseen instance, its |$k$| nearest neighbors in the training set are firstly identified. After that, based on statistical information from the label sets of these neighboring instances, i.e. the number of neighboring instances belonging to each possible class, maximum a posteriori (MAP) principle is utilized to determine the label set for the unseen instance. In addition, there are still many prediction tools that use the ensemble method. Instead of using only one classifier, they choose to use multiple classifiers simultaneously. Through voting, averaging or accumulating the results of multiple base classifiers, they get final prediction results of the ensemble prediction tool. This method can combine above two strategies together and make many weak classifiers into a strong one. Some prediction tools selected in this paper also use the ensemble strategy such as Hum-mPLoc|$2.0$| [41] and HPSLPred [65]. A novel annotation-based prediction tool GO hierarchy HumLoc-LBCI is a new annotation-based method proposed in this paper. It makes an improvement on the extraction of GO features by mLASSO-Hum, which makes full use of the hierarchical relationship of GO terms. As we all know, GO terms in the taxonomy of cellular component are organized within a DAG [34], so the depth of two GO terms can be used to measure the similarity of them. Since the structural relationships among the essential GO terms, they are not independent with each other but take the hierarchical relationships of GO terms into consideration. So, firstly, just like what is done in mLASSO-Hum, we do feature selection on the |$U$|-dimension feature vector. After selected by Random Forest (RF) [69], the original |$U$|-dimensional feature vector becomes a novel |$V$|-dimensional feature vector that are more useful for prediction and the remaining GO terms have been removed. Then, we improve the method of calculating feature vector for proteins. For a protein |$i$|⁠, we build an |$1 \times V$| feature vector |$\left \{ q_{i,1}, q_{i,2}, \cdots , q_{i,V} \right \}$| to represent the hierarchical relationship between GO terms in this protein. Moreover, for a dataset with |$N$| samples, we construct a |$N \times V$| feature matrix |$F_{hie}$| to represent the GO hierarchy, as follows: $$\begin{equation} F_{hie} = \left( \begin{array}{@{}cccccc@{}} q_{1,1} & q_{1,2} & \cdots & q_{1,j} & \cdots & q_{1,v} \\ q_{2,1} & q_{2,2} & \cdots & q_{2,j} & \cdots & q_{2,v} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ q_{i,1} & q_{i,2} & \cdots & q_{i,j} & \cdots & q_{i,v} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ q_{N,1} & q_{N,2} & \cdots & q_{N,j} & \cdots & q_{N,v} \end{array} \right) \end{equation} $$ (1) where |$q_{i,j}$| is the hierarchical information of the |$j$|-th essential GO term in the |$i$|-th protein. The hierarchical information based (HIB) feature vector is constructed as follows: $$\begin{equation} q_{i,j} = avg_{k \in K_i} \frac{p_{i,k}}{2^{d_{j,k}}} \end{equation} $$ (2) where |$j$| is the |$j$|-th essential GO term, |$K_i$| is the set of distinct GO terms associated with protein |$i$|⁠, |$p_{i,k}$| denotes the number of occurrences of the |$k$|-th GO in the set |$K_i$|⁠, and |$d_{j,k}$| is calculated in the same way [34]. For a query protein, the value of each item |$\frac {p_{i,k}}{2^{d_{j,k}}}$| can be used to measure the similarity between two GO terms. Higher the value, more relevant the GO term to the protein. Moreover, the relationship between a protein and an essential GO term is measured by the similarity between all GO terms in |$K_i$| and the essential GO term. Here, we measure the relationship between protein and essential GO terms by average pooling, in order to reduce the case where similarity between GO terms is different, but the calculated relationship between protein and essential GO terms is same. One-vs-rest classification For the classification system, there are |$16$| class labels corresponding to |$16$| subcellular locations of human proteins. We use ensemble classifier XGBoost[70] as the base classifier, and adopt the one-vs-rest strategy to construct |$16$| binary classifiers. The output for each test sample is a |$16$|-dimensional label vector. Each dimension of the vector represents the confidence of being in a certain subcellular location. This method has a drawback because the training set is imbalanced in one-vs-rest strategy. In order to reduce the influence of biased samples, we extract part negative samples from the whole negative set as the training negative set. In particular, we use the Self-Representation Subspace Clustering (SRSC) method [71] to balance the positive and negative data sets by sampling subsets from the whole negative set. The process of constructing training set is shown in Figure 4. For the negative samples of each label, we first use SRSC to cluster them into |$m$| classes, then select |$s$| samples closest to the center of mass from each class to form the new negative sample set. It can remove some points that are further away from the center of mass, which may be the error points for this class. Finally, we get a relatively balanced data set for each label and train them on the one-vs-rest classifier XGBoost. Figure 4 Open in new tabDownload slide The flow chart of constructing training set. Figure 4 Open in new tabDownload slide The flow chart of constructing training set. Suppose |$\mathbf {X} = \{ \mathbf {x}_{1},\mathbf {x}_{2},...,\mathbf {x}_{n} \} \in \mathfrak {R}^{d \times n}$| is the feature matrix, where |$\mathbf {x}_{i}$| denotes a sample vector and |$d$| denotes the dimensionality of feature vector. We use the data itself as a dictionary to treat the problem as an optimal subspace representation under a constraint. After the solution, the similarity matrix is constructed using the subspace representation. Final clustering result is obtained by using the spectral clustering method based on the similarity matrix. Figure 5 Open in new tabDownload slide The performance of various prediction tools on different data sets. Figure 5 Open in new tabDownload slide The performance of various prediction tools on different data sets. The self-representation manner is defined as follows: $$\begin{equation} \mathbf{X} = \mathbf{X}\mathbf{Z} + \mathbf{E} \end{equation} $$ (3) where |$\mathbf {Z} = \{ \mathbf {z}_{1},\mathbf {z}_{2},...,\mathbf {z}_{n} \} \in \mathfrak {R}^{n \times n}$| denotes coefficient matrix, |$\mathbf {z}_{i}$| is the new representation of sample |$\mathbf {x}_{i}$| by other samples, and |$\mathbf {E}$| denotes the error matrix. The similarity matrix |$\mathbf {S} \in \mathfrak {R}^{n \times n}$| is constructed as follows: $$\begin{equation} \mathbf{S} = |\mathbf{Z}| + |\mathbf{Z}^{T}| \end{equation} $$ (4) where |$|\cdot |$| denotes the absolute operator, and |$\mathbf {S}$| is the input of spectral clustering. The self-representation subspace clustering formulation by minimizing as follows: $$\begin{equation} J(\mathbf{Z}) = \Vert \mathbf{X}\mathbf{Z} - \mathbf{X}\Vert_{F}^{2} + \alpha \, tr(\mathbf{Z}\mathbf{L}\mathbf{Z}^{T}) \end{equation} $$ (5) where |$\alpha $| is the tradeoff factor, and |$tr(\mathbf {Z}\mathbf {L}\mathbf {Z}^{T})$| is the smooth regularized term defined as follows: $$\begin{equation} tr(\mathbf{Z}\mathbf{L}\mathbf{Z}^{T}) = \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} w_{ij} \Vert \mathbf{z}_{i} - \mathbf{z}_{j} \Vert_{2}^{2} \end{equation} $$ (6) where |$tr$| is the trace of matrix, |$\mathbf {W}=\{w_{ij}\}$| denotes weight matrix measuring the spatial closeness of samples, |$\mathbf {L} = \mathbf {D} - \mathbf {W}$| denotes the Laplacian matrix, and |$\mathbf {D} \in \mathfrak {R}^{n \times n}$| is a diagonal matrix |$\mathbf {D}_{ii} = \sum _{j=1}^{n}w_{ij}$|⁠. In this paper, we employ the Radial Basis Function (RBF) to build |$\mathbf {W} \in \mathfrak {R}^{n \times n}$|⁠. Then, the solution can be obtained as follows: $$\begin{align} \partial J(\mathbf{Z})/\partial \mathbf{Z} &= 0 \end{align} $$ (7a) $$\begin{align} 2\mathbf{X}^{T}(\mathbf{X}\mathbf{Z} - \mathbf{X}) + 2 \alpha \mathbf{Z}\mathbf{L} &= 0 \end{align} $$ (7b) $$\begin{align} \mathbf{X}^{T}\mathbf{X}\mathbf{Z} + \alpha \mathbf{Z}\mathbf{L} &= \mathbf{X}^{T}\mathbf{X} \end{align} $$ (7c) where Eq. 7c is the Sylvester equation, which has been widely used in control theory. Finally, we carry out spectral clustering on the similarity matrix |$\mathbf {S}$| to construct a relatively balanced data set for each label and train them on the one-vs-rest classifier XGBoost. Results There are two commonly used evaluation criteria in protein subcellular localization research. Therefore, we firstly list the two evaluation criteria by summarizing the performance of various prediction tools on different data sets, and the results on three data sets are shown in Figure 5. First, Hum(⁠|$37$|⁠), mGOF-loc[24], is the data set proposed by Wei et al., which contains |$37$| different locations. Prediction tools are tested on this data set via average precision as evaluation criteria, where LBCI is a sequence-based method [22] that is different from HumLoc-LBCI proposed in this paper. Second, Hum(⁠|$12$|⁠), Hum-mPLoc|$3.0$|[46], is the data set proposed in Hum-mPloc |$3.0$|⁠, which is divided into training set and testing set. Third, Hum(⁠|$14$|⁠), Hum-mPLoc|$2.0$|[41], is the validate data set used in this paper. Accuracy is used as the evaluation criterion for the latter two data sets. Evaluation criteria All five parameters[72], including Accuracy (Acc), Precision (Pre), Recall, F1 score and Hamming Loss (HL), are used to evaluate different methods. These parameters are presented as follows: $$\begin{equation} \begin{split} Acc &= \frac{1}{N}\sum_{i=1}^N \frac{\left|Y_i\cap \bar{Y_i}\right|}{\left|Y_i\cup \bar{Y_i}\right|} \\ Pre &= \frac{1}{N}\sum_{i=1}^N \frac{\left|Y_i\cap \bar{Y_i}\right|}{\left|\bar{Y_i}\right|} \\ Recall &= \frac{1}{N}\sum_{i=1}^N \frac{\left|Y_i\cap \bar{Y_i}\right|}{\left|Y_i\right|} \\ F1 &= \frac{2\times pre\times recall}{pre+recall} \\ HL &= \frac{1}{N}\sum_{i=1}^N \left|\bar{Y_i}\Delta Y_i\right| \end{split} \end{equation} $$ (8) where |$N$| is the number of samples, |$Y_i$| represents the real labels of sample |$i$|⁠, |$\bar {Y_i}$| denotes the prediction result of |$i$|-th protein, and |$\Delta $| stands for the symmetric difference between two sets. On hamming loss, smaller the value, better the method’s performance, with the best value of |$0$| for hamming loss. On others, larger the value, better the method’s performance, with the best value of |$1$|⁠. Table 6 Performance of prediction tools for human protein subcellular localization on the new data set Prediction tool Accuracy Year Reference Wegoloc 0.3633 2012 [67] mLASSO-Hum 0.6114 2015 [34] pLoc-mHum 0.2831 2017 [45] Hum-mPLoc |$3.0$| 0.5606 2017 [46] HumLoc-LBCI 0.7458 2019 — Prediction tool Accuracy Year Reference Wegoloc 0.3633 2012 [67] mLASSO-Hum 0.6114 2015 [34] pLoc-mHum 0.2831 2017 [45] Hum-mPLoc |$3.0$| 0.5606 2017 [46] HumLoc-LBCI 0.7458 2019 — Open in new tab Table 6 Performance of prediction tools for human protein subcellular localization on the new data set Prediction tool Accuracy Year Reference Wegoloc 0.3633 2012 [67] mLASSO-Hum 0.6114 2015 [34] pLoc-mHum 0.2831 2017 [45] Hum-mPLoc |$3.0$| 0.5606 2017 [46] HumLoc-LBCI 0.7458 2019 — Prediction tool Accuracy Year Reference Wegoloc 0.3633 2012 [67] mLASSO-Hum 0.6114 2015 [34] pLoc-mHum 0.2831 2017 [45] Hum-mPLoc |$3.0$| 0.5606 2017 [46] HumLoc-LBCI 0.7458 2019 — Open in new tab Table 7 Comparison of different human protein subcellular location prediction tools on the new data set Location WegoLoc mLASSO-Hum Hum-mPLoc |$3.0$| HumLoc-LBCI pre rec F1 pre rec F1 pre rec F1 pre rec F1 Cell membrane - - - - - - - - - 0.7636 0.7778 0.7706 Chromosome - - - - - - - - - 0.6000 0.4286 0.5000 Cytoplasm 0.4871 0.9113 0.6348 0.8925 0.6694 0.7650 0.7016 0.7016 0.7016 0.7484 0.9597 0.8410 Cytosol - - - - - - - - - 0.4000 0.1667 0.2353 Endomembrane system - - - - - - - - - 0.8191 0.8652 0.8415 Endosome membrane - - - - - - - - - 0.7500 0.8571 0.8000 Endosome 0.5000 0.2727 0.3529 0.6667 0.1818 0.2857 1.0000 0.3636 0.5333 1.0000 0.8182 0.9000 Golgi apparatus membrane - - - - - - - - - 0.6250 0.5556 0.5882 Golgi apparatus 0.3171 0.6500 0.4262 0.6250 0.5000 0.5556 0.5833 0.3500 0.4375 0.7059 0.6000 0.6486 Mitochondrion membrane - - - - - - - - - 0.6875 0.9167 0.7857 Mitochondrion 0.7500 0.6774 0.7119 0.8387 0.8387 0.8387 0.8929 0.8065 0.8475 0.8750 0.9032 0.8889 Nucleolus - - - - - - - - - 0.6316 0.6316 0.6316 Nucleus 0.6437 0.9106 0.7542 0.8684 0.8049 0.8354 0.7402 0.7642 0.7520 0.7887 0.9350 0.8550 Centrosome 0.5000 0.4667 0.4828 0.8750 0.4667 0.6087 0.6667 0.1333 0.2222 0.7143 0.6667 0.6897 Cytoskeleton 0.1262 0.8966 0.2213 1.0000 0.1379 0.2424 0.8824 0.5172 0.6522 0.7037 0.6552 0.6786 Endoplasmic reticulum - - - 0.8947 0.6800 0.7727 0.7647 0.5200 0.6190 0.9048 0.7600 0.8261 ACC 0.3633 0.6114 0.5606 0.7458 Precision 0.3858 0.7069 0.6611 0.8166 Recall 0.6917 0.6275 0.6056 0.8780 F1 0.4667 0.6451 0.6076 0.8172 Hamming Loss 0.1105 0.0405 0.0514 0.0595 Location WegoLoc mLASSO-Hum Hum-mPLoc |$3.0$| HumLoc-LBCI pre rec F1 pre rec F1 pre rec F1 pre rec F1 Cell membrane - - - - - - - - - 0.7636 0.7778 0.7706 Chromosome - - - - - - - - - 0.6000 0.4286 0.5000 Cytoplasm 0.4871 0.9113 0.6348 0.8925 0.6694 0.7650 0.7016 0.7016 0.7016 0.7484 0.9597 0.8410 Cytosol - - - - - - - - - 0.4000 0.1667 0.2353 Endomembrane system - - - - - - - - - 0.8191 0.8652 0.8415 Endosome membrane - - - - - - - - - 0.7500 0.8571 0.8000 Endosome 0.5000 0.2727 0.3529 0.6667 0.1818 0.2857 1.0000 0.3636 0.5333 1.0000 0.8182 0.9000 Golgi apparatus membrane - - - - - - - - - 0.6250 0.5556 0.5882 Golgi apparatus 0.3171 0.6500 0.4262 0.6250 0.5000 0.5556 0.5833 0.3500 0.4375 0.7059 0.6000 0.6486 Mitochondrion membrane - - - - - - - - - 0.6875 0.9167 0.7857 Mitochondrion 0.7500 0.6774 0.7119 0.8387 0.8387 0.8387 0.8929 0.8065 0.8475 0.8750 0.9032 0.8889 Nucleolus - - - - - - - - - 0.6316 0.6316 0.6316 Nucleus 0.6437 0.9106 0.7542 0.8684 0.8049 0.8354 0.7402 0.7642 0.7520 0.7887 0.9350 0.8550 Centrosome 0.5000 0.4667 0.4828 0.8750 0.4667 0.6087 0.6667 0.1333 0.2222 0.7143 0.6667 0.6897 Cytoskeleton 0.1262 0.8966 0.2213 1.0000 0.1379 0.2424 0.8824 0.5172 0.6522 0.7037 0.6552 0.6786 Endoplasmic reticulum - - - 0.8947 0.6800 0.7727 0.7647 0.5200 0.6190 0.9048 0.7600 0.8261 ACC 0.3633 0.6114 0.5606 0.7458 Precision 0.3858 0.7069 0.6611 0.8166 Recall 0.6917 0.6275 0.6056 0.8780 F1 0.4667 0.6451 0.6076 0.8172 Hamming Loss 0.1105 0.0405 0.0514 0.0595 Open in new tab Table 7 Comparison of different human protein subcellular location prediction tools on the new data set Location WegoLoc mLASSO-Hum Hum-mPLoc |$3.0$| HumLoc-LBCI pre rec F1 pre rec F1 pre rec F1 pre rec F1 Cell membrane - - - - - - - - - 0.7636 0.7778 0.7706 Chromosome - - - - - - - - - 0.6000 0.4286 0.5000 Cytoplasm 0.4871 0.9113 0.6348 0.8925 0.6694 0.7650 0.7016 0.7016 0.7016 0.7484 0.9597 0.8410 Cytosol - - - - - - - - - 0.4000 0.1667 0.2353 Endomembrane system - - - - - - - - - 0.8191 0.8652 0.8415 Endosome membrane - - - - - - - - - 0.7500 0.8571 0.8000 Endosome 0.5000 0.2727 0.3529 0.6667 0.1818 0.2857 1.0000 0.3636 0.5333 1.0000 0.8182 0.9000 Golgi apparatus membrane - - - - - - - - - 0.6250 0.5556 0.5882 Golgi apparatus 0.3171 0.6500 0.4262 0.6250 0.5000 0.5556 0.5833 0.3500 0.4375 0.7059 0.6000 0.6486 Mitochondrion membrane - - - - - - - - - 0.6875 0.9167 0.7857 Mitochondrion 0.7500 0.6774 0.7119 0.8387 0.8387 0.8387 0.8929 0.8065 0.8475 0.8750 0.9032 0.8889 Nucleolus - - - - - - - - - 0.6316 0.6316 0.6316 Nucleus 0.6437 0.9106 0.7542 0.8684 0.8049 0.8354 0.7402 0.7642 0.7520 0.7887 0.9350 0.8550 Centrosome 0.5000 0.4667 0.4828 0.8750 0.4667 0.6087 0.6667 0.1333 0.2222 0.7143 0.6667 0.6897 Cytoskeleton 0.1262 0.8966 0.2213 1.0000 0.1379 0.2424 0.8824 0.5172 0.6522 0.7037 0.6552 0.6786 Endoplasmic reticulum - - - 0.8947 0.6800 0.7727 0.7647 0.5200 0.6190 0.9048 0.7600 0.8261 ACC 0.3633 0.6114 0.5606 0.7458 Precision 0.3858 0.7069 0.6611 0.8166 Recall 0.6917 0.6275 0.6056 0.8780 F1 0.4667 0.6451 0.6076 0.8172 Hamming Loss 0.1105 0.0405 0.0514 0.0595 Location WegoLoc mLASSO-Hum Hum-mPLoc |$3.0$| HumLoc-LBCI pre rec F1 pre rec F1 pre rec F1 pre rec F1 Cell membrane - - - - - - - - - 0.7636 0.7778 0.7706 Chromosome - - - - - - - - - 0.6000 0.4286 0.5000 Cytoplasm 0.4871 0.9113 0.6348 0.8925 0.6694 0.7650 0.7016 0.7016 0.7016 0.7484 0.9597 0.8410 Cytosol - - - - - - - - - 0.4000 0.1667 0.2353 Endomembrane system - - - - - - - - - 0.8191 0.8652 0.8415 Endosome membrane - - - - - - - - - 0.7500 0.8571 0.8000 Endosome 0.5000 0.2727 0.3529 0.6667 0.1818 0.2857 1.0000 0.3636 0.5333 1.0000 0.8182 0.9000 Golgi apparatus membrane - - - - - - - - - 0.6250 0.5556 0.5882 Golgi apparatus 0.3171 0.6500 0.4262 0.6250 0.5000 0.5556 0.5833 0.3500 0.4375 0.7059 0.6000 0.6486 Mitochondrion membrane - - - - - - - - - 0.6875 0.9167 0.7857 Mitochondrion 0.7500 0.6774 0.7119 0.8387 0.8387 0.8387 0.8929 0.8065 0.8475 0.8750 0.9032 0.8889 Nucleolus - - - - - - - - - 0.6316 0.6316 0.6316 Nucleus 0.6437 0.9106 0.7542 0.8684 0.8049 0.8354 0.7402 0.7642 0.7520 0.7887 0.9350 0.8550 Centrosome 0.5000 0.4667 0.4828 0.8750 0.4667 0.6087 0.6667 0.1333 0.2222 0.7143 0.6667 0.6897 Cytoskeleton 0.1262 0.8966 0.2213 1.0000 0.1379 0.2424 0.8824 0.5172 0.6522 0.7037 0.6552 0.6786 Endoplasmic reticulum - - - 0.8947 0.6800 0.7727 0.7647 0.5200 0.6190 0.9048 0.7600 0.8261 ACC 0.3633 0.6114 0.5606 0.7458 Precision 0.3858 0.7069 0.6611 0.8166 Recall 0.6917 0.6275 0.6056 0.8780 F1 0.4667 0.6451 0.6076 0.8172 Hamming Loss 0.1105 0.0405 0.0514 0.0595 Open in new tab Table 8 Comparison of locative recall of mLASSO-Hum and HumLoc-LBCI on two data sets Location New data set Chou’s data set mLASSO- Hum HumLoc- LBCI mLASSO- Hum HumLoc- LBCI Endosome 0.1818 0.8182 0.125 0.7686 Nucleus 0.8049 0.9350 0.910 0.9207 Centrosome 0.4667 0.6667 0.727 0.6364 Cytoplasm 0.6694 0.9597 0.861 0.8494 Cytoskeleton 0.1379 0.6552 0.392 0.3165 Golgi apparatus 0.5000 0.6000 0.826 0.6522 Mitochondrion 0.8387 0.9032 0.942 0.8874 Location New data set Chou’s data set mLASSO- Hum HumLoc- LBCI mLASSO- Hum HumLoc- LBCI Endosome 0.1818 0.8182 0.125 0.7686 Nucleus 0.8049 0.9350 0.910 0.9207 Centrosome 0.4667 0.6667 0.727 0.6364 Cytoplasm 0.6694 0.9597 0.861 0.8494 Cytoskeleton 0.1379 0.6552 0.392 0.3165 Golgi apparatus 0.5000 0.6000 0.826 0.6522 Mitochondrion 0.8387 0.9032 0.942 0.8874 Open in new tab Table 8 Comparison of locative recall of mLASSO-Hum and HumLoc-LBCI on two data sets Location New data set Chou’s data set mLASSO- Hum HumLoc- LBCI mLASSO- Hum HumLoc- LBCI Endosome 0.1818 0.8182 0.125 0.7686 Nucleus 0.8049 0.9350 0.910 0.9207 Centrosome 0.4667 0.6667 0.727 0.6364 Cytoplasm 0.6694 0.9597 0.861 0.8494 Cytoskeleton 0.1379 0.6552 0.392 0.3165 Golgi apparatus 0.5000 0.6000 0.826 0.6522 Mitochondrion 0.8387 0.9032 0.942 0.8874 Location New data set Chou’s data set mLASSO- Hum HumLoc- LBCI mLASSO- Hum HumLoc- LBCI Endosome 0.1818 0.8182 0.125 0.7686 Nucleus 0.8049 0.9350 0.910 0.9207 Centrosome 0.4667 0.6667 0.727 0.6364 Cytoplasm 0.6694 0.9597 0.861 0.8494 Cytoskeleton 0.1379 0.6552 0.392 0.3165 Golgi apparatus 0.5000 0.6000 0.826 0.6522 Mitochondrion 0.8387 0.9032 0.942 0.8874 Open in new tab Comparison of selected prediction tools on Chou’s data set In this section, we analyze the prediction accuracy of selected prediction tools during last ten years, as shown in Table 5. The best prediction tool, pLoc-mHum, achieves the outstanding performance, which combines sequence information and GO terms. HPSLPred only using sequence information obtains the worst prediction, indicating that it is not thorough in the use of sequence information or insufficient data contained in protein sequence. Hum-mPLoc and Hum-mPLoc |$2.0$| only use the number of occurrences of GO terms and the results are not good enough. However, mGOASVM, mLASSO-Hum and Mem-mEN all perform well, exceeding the performance of prediction tools that are earlier than they appear. They make full use of the relationship between GO terms and improve the GO feature extraction method. Overall, the performance of mLASSO-Hum and pLoc-mHum have the best results among all selected prediction tools in this paper. Comparison of selected prediction tools on the new data set Besides, we also test some well-performed prediction tools on the new data set, as shown in Table 6. Results denote that the performance of our proposed method has an advantage of human protein subcellular location prediction on the new data set. However, our proposed method are trained and tested on the new data set, while the results of other existing prediction tools are derived from the trained models. Because the model of selected prediction tools have been trained with older data sets, there may exist some calculation errors in their prediction results for testing previous methods on new data set. Therefore, it is understandable that our method has better results on the new data set than other prediction tools. Figure 6 Open in new tabDownload slide Comparison of different feature selection methods on the new data set. Figure 6 Open in new tabDownload slide Comparison of different feature selection methods on the new data set. Table 9 Comparison of different classification methods on the new data set Classifier Model Accuracy RF binary classifier with one-versus-rest strategy 0.6901 SVM-l SVM classifier on linear kernel 0.6919 SVM-r SVM classifier on RBF kernel 0.6959 ML-KNN multi-label classifier 0.6849 DNN ensemble classifier with one-versus-rest strategy 0.6550 GBDT ensemble classifier with one-versus-rest strategy 0.7225 XGBoost ensemble classifier with one-versus-rest strategy 0.7458 Classifier Model Accuracy RF binary classifier with one-versus-rest strategy 0.6901 SVM-l SVM classifier on linear kernel 0.6919 SVM-r SVM classifier on RBF kernel 0.6959 ML-KNN multi-label classifier 0.6849 DNN ensemble classifier with one-versus-rest strategy 0.6550 GBDT ensemble classifier with one-versus-rest strategy 0.7225 XGBoost ensemble classifier with one-versus-rest strategy 0.7458 Open in new tab Table 9 Comparison of different classification methods on the new data set Classifier Model Accuracy RF binary classifier with one-versus-rest strategy 0.6901 SVM-l SVM classifier on linear kernel 0.6919 SVM-r SVM classifier on RBF kernel 0.6959 ML-KNN multi-label classifier 0.6849 DNN ensemble classifier with one-versus-rest strategy 0.6550 GBDT ensemble classifier with one-versus-rest strategy 0.7225 XGBoost ensemble classifier with one-versus-rest strategy 0.7458 Classifier Model Accuracy RF binary classifier with one-versus-rest strategy 0.6901 SVM-l SVM classifier on linear kernel 0.6919 SVM-r SVM classifier on RBF kernel 0.6959 ML-KNN multi-label classifier 0.6849 DNN ensemble classifier with one-versus-rest strategy 0.6550 GBDT ensemble classifier with one-versus-rest strategy 0.7225 XGBoost ensemble classifier with one-versus-rest strategy 0.7458 Open in new tab Among the selected methods, Hum-mPLoc |$3.0$| [46] is an upgraded version of Hum-mPLoc |$2.0$| [41]. It is a new amino acid sequence-based human protein subcellular location prediction approach, which covers |$12$| human subcellular localizations. The sequences are represented by multi-view complementary features, i.e. context vocabulary annotation-based Gene Ontology (GO) terms, peptide-based functional domains, and residue-based statistical features. It proposes a novel feature representation protocol denoted as Hidden Correlation Modeling (HCM), which creates more compact and discriminative feature vectors by modeling the hidden correlations between annotation terms. Detailed comparison of different human protein subcellular location prediction tools on the new data set, are shown in Table 7. We compare the accuracy of each class for mLASSO-Hum and HumLoc-LBCI on two data sets. We find that some classes which are predicted well by mLASSO-Hum are not included in the new data set, while HumLoc-LBCI performs better on the other classes. We list several common classes in two data sets, and compare the performance of mLASSO-Hum and HumLoc-LBCI as shown in Table 8. HumLoc-LBCI achieves better performance than mLASSO-Hum on some classes, such as Endosome and Nucleus. However, the model used in mLASSO-Hum was trained by the old data set, which results in the contradictory prediction performance of mLASSO-Hum and HumLoc-LBCI on some classes in two different data sets, such as Centrosome, Cytoplasm, Cytoskeleton, Golgi apparatus and Mitochondrion. Therefore, the prediction results of these prediction methods on two data sets are quite different. Comparison of different feature selection methods To build the essential GO set, we compare the performance of three different feature selection methods on the new data set, such as Random Forest (RF), LASSO and Ridge Regression. As seen in Figure 6, RF achieves the best results among three feature selection methods. Finally, a |$484$|-dimensional essential GO term vector can be selected by RF to construct the GO hierarchy. Figure 7 Open in new tabDownload slide The performance of GO terms without (⁠|$F$|⁠) or with (⁠|$F^+$|⁠) homologous proteins for all taxonomy (⁠|$F_{All}$|⁠) or cellular component type (⁠|$F_{CC}$|⁠) on the new data set. Figure 7 Open in new tabDownload slide The performance of GO terms without (⁠|$F$|⁠) or with (⁠|$F^+$|⁠) homologous proteins for all taxonomy (⁠|$F_{All}$|⁠) or cellular component type (⁠|$F_{CC}$|⁠) on the new data set. Figure 8 Open in new tabDownload slide The performance of GO frequency (⁠|$F_{fre}$|⁠) and GO hierarchy (⁠|$F_{hie}$|⁠) on the new data set. Figure 8 Open in new tabDownload slide The performance of GO frequency (⁠|$F_{fre}$|⁠) and GO hierarchy (⁠|$F_{hie}$|⁠) on the new data set. Comparison of different classification methods We compare various classifiers with different strategies for solving the problem of multi-label classification. As seen in Table 9, the ensemble classifier XGBoost[70] achieves the best results (ACC: |$0.7458$|⁠) among seven classification methods. Therefore, we apply XGBoost as the independent classifier and adopt the one-versus-rest strategy to construct |$16$| binary classifiers. All parameters are optimized via |$5$|-fold cross validation on the new data set. Analysis of homologous GO terms on GO taxonomy The GO project provides a set of hierarchical controlled vocabulary split into three categories, and GO terms on cellular component typer have significant effect on the protein subcellular localization. As seen in Figure 7, the performance of GO terms on cellular component type (ACC: |$0.6648$|⁠, HL: |$0.0636$|⁠) is better than that on all taxonomy (ACC: |$0.6502$|⁠, HL: |$0.0705$|⁠) on the new data set. It shows that GO terms of cellular component type are more related to protein subcellular localization, and GO terms of all taxonomy contain more redundancy resulting in the worse prediction performance. Figure 9 Open in new tabDownload slide Comparison of different GO semantic similarity models on the new data set. Figure 9 Open in new tabDownload slide Comparison of different GO semantic similarity models on the new data set. We retrieve GO terms and construct GO vectors to analyze protein GO information. The homologous protein GO terms can improve the accuracy of protein subcellular localization prediction. We set E-value as |$0.001$| when selecting homology proteins in our experiments. As seen in Figure 7, the performance of GO terms in the taxonomy of cellular component with homologous proteins (ACC: |$0.6648$|⁠, HL: |$0.0636$|⁠) is better than that without homologous proteins (ACC: |$0.6540$|⁠, HL: |$0.0643$|⁠). However, the performance of GO terms in all taxonomy declines when adding homologous protein GO terms. It illustrates that homologous protein GO terms may add redundancy to the original GO terms and multiple validation is required when using homologous information. Analysis of GO frequency and GO hierarchy The number of distinct GO terms can be used to construct GO frequency, also the structural relationship among the essential GO terms can be used to build GO hierarchy. As seen in Figure 8, the performance of GO hierarchy is better than that of GO frequency. The results indicates that GO hierarchical relationship has more information about proteins and can better describe proteins. Comparison of different GO semantic similarity models GO semantic similarity is a commonly used method to extract GO features for proteins. In this paper, we also test several common methods on the new data set. We compare the novel proposed model with hierarchical relationships of GO terms to four different models on GO semantic similarity, Resnik’s measure [73], Lin’s measure [74], Jiang’s measure [75] and relevance similarity (RS) [76]. As seen in Figure 9, the performance of our model on GO hierarchy (ACC: |$0.6919$|⁠, HL: |$0.0636$|⁠) has such an enormous advantage for subcellular location prediction on the new data set. Our model is much better than different models on GO semantic similarity, such as Resnik (ACC: |$0.5877$|⁠, HL: |$0.0869$|⁠), Lin (ACC: |$0.5705$|⁠, HL: |$0.0893$|⁠), Jiang (ACC: |$0.5300$|⁠, HL: |$0.1164$|⁠) and RS (ACC: |$0.5724$|⁠, HL: |$0.0869$|⁠). The performance of these GO semantic similarity methods is poor in the new data set. Results show that the performance of simply applying the existing GO semantic similarity-based methods is poor. The performance of novel proposed model on GO hierarchy has such an enormous advantage of subcellular location prediction on the new data set. Existing GO semantic similarity models obtain poor performance, so our proposed model on hierarchical relationships of GO terms is an efficient way to represent GO information. Web server A web server is built for the new proposed method in this paper; the URL is http://www.lbci.cn/syn/. It supports two prediction formats, an on-line input single sequence or an entire multiple sequence upload file. The sequence format must be |$.fasta$|⁠. It will return the possibility of each label for human protein subcellular localization, and also give the suggested labels as final prediction result. Conclusion Computational-based methods have indeed accelerated the speed of prediction, and the use of GO terms greatly improves the accuracy of prediction. However, as the number of proteins increases, the number of proteins within unknown locations also increases at the same time. Lack of GO terms is a big problem for methods based on GO terms. So how to extract as many valid features as possible from the limited GO information is a big challenge. In addition to using the GO terms of homologous proteins as a substitute, we also need to find other ways to use GO terms, such as GO semantic similarity provided above. Except GO information, there are many other features, such as sequence information and domain functions. They all contain a lot of protein information that need to be fully used. There are lots of work that can be done on feature extraction. And an effective classifier is always necessary. We still need to do more researches to discover the application of protein subcellular localization. In this paper, we focus on the research of human protein subcellular localization prediction, summarize the existing commonly used human protein subcellular localization data sets, list some web-based prediction tools using sequence information or GO terms as features. Through comparison, we currently recommend using mLASSO-Hum and pLoc-mHum as prediction tools in predicting human protein subcellular localization. At last, we introduce a newly built benchmark data set and a novel proposed web-based prediction tool on the basis of previous methods. Key Points We summarize the progress of research on the subcellular localization of multi-label human proteins in recent years, including commonly used data sets proposed by the predecessors and performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO and pLoc-mHum provide a statistically significant improvement in performance. We propose a new data set to verify some well-performed prediction tools and construct a new GO-based prediction method which performs better than existing tools. Our novel proposed prediction tool provides very good performance and outperforms other predecessors on our novel data set. We compare the novel proposed model with hierarchical relationships of GO terms to four different models on GO semantic similarity. The performance of novel proposed model on GO hierarchy has such an enormous advantage for subcellular location prediction on the new data set. We establish a user-friendly web server with the implementation of our proposed approach, which is a useful tool for research community. Funding National Science Foundation of China (NSFC 61772362, 61771331, 61972280, 61902271). Yinan Shen is currently a master degree candidate in Tianjin University. Her research interests include protein subcellular localization and machine learning. Yijie Ding is an Assistant Professor in Suzhou University of Science and Technology. His research interests include bioinformatics and machine learning. Jijun Tang is a Professor in University of South Carolina. His main research interests include computational biology and algorithm. Quan Zou is a Professor in University of Electronic Science and Technology of China. His main research interests include bioinformatics, machine learning and parallel computing. Fei Guo is an Associate Professor in Tianjin University. Her research interests include bioinformatics and computational biology. References 1 Apweiler R . Functional information in Swiss-Prot: the basis for large-scale characterisation of protein sequences . Brief Bioinform 2001 ; 2 ( 1 ): 9 – 18 . WorldCat 2 Eisenhaber F , Bork P . Wanted: subcellular localization of proteins based on sequence . Trends Cell Biol 1998 ; 8 ( 4 ): 169 – 70 . WorldCat 3 Chou KC , Cai YD . Prediction of protein subcellular locations by GO-Fund-PseAA predictor . Biochem Biophys Res Commun 2004 ; 320 ( 4 ): 1236 – 9 . WorldCat 4 Chou KC , Cai YD . Using GO-PseAA predictor to predict enzyme sub-class . Biochem Biophys Res Commun 2004 ; 325 ( 2 ): 506 – 9 . WorldCat 5 Chou KC . Impacts of bioinformatics to medicinal chemistry . Med Chem 2015 ; 11 ( 3 ): 218 – 34 . WorldCat 6 Chou KC , Shen HB . A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0 . PLoS One 2010 ; 5 ( 4 ): e9931 . WorldCat 7 Chou KC , Wu ZC , Xiao X . iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins . PLoS One 2011 ; 6 ( 3 ): e18258 . WorldCat 8 Cheng X , Xiao X , Chou KC . pLoc-mEuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC . Genomics 2018 ; 110 ( 1 ): 50 – 8 . WorldCat 9 Shen HB , Chou KC . Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites . J Biomol Struct Dyn 2010 ; 28 ( 2 ): 175 – 86 . WorldCat 10 Xiao X , Wu ZC , Chou KC . iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites . J Theor Biol 2011 ; 284 ( 1 ): 42 – 51 . WorldCat 11 Cheng X , Xiao X . pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC . Gene 2017 ; 628 : 315 – 21 . WorldCat 12 Cheng X , Zhao SG , Lin WZ , et al. pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites . Bioinformatics 2017 ; 33 ( 22 ): 3524 – 31 . WorldCat 13 Lin WZ , Fang JA , Xiao X , et al. iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins . Mol Biosyst 2013 ; 9 ( 4 ): 634 – 44 . WorldCat 14 Chou KC , Shen HB . Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization . PLoS One 2010 ; 5 ( 6 ): e11335 . WorldCat 15 Wu ZC , Xiao X , Chou KC . iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites . Mol Biosyst 2011 ; 7 ( 12 ): 3287 – 97 . WorldCat 16 Cheng X , Xiao X , Chou KC . pLoc-mPlant: predict subcellular localization of multi-location plant proteins via incorporating the optimal GO information into general PseAAC . Mol Biosyst 2017 ; 13 : 1722 – 7 . WorldCat 17 Cheng X , Xiao X , Chou KC . pLoc-mGneg: predict subcellular localization of Gram-negative bacterial proteins by deep Gene Ontology learning via general PseAAC . Genomics 2018 ; 110 ( 4 ): 231 – 9 . WorldCat 18 Shen HB , Chou KC . Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins . J Theor Biol 2010 ; 264 ( 2 ): 326 – 33 . WorldCat 19 Shen HB , Chou KC . Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins . Protein Pept Lett 2009 ; 16 ( 12 ): 1478 – 84 . WorldCat 20 Xiao X , Cheng X , Su SC , et al. pLoc-mGpos: incorporate key Gene Ontology information into general PseAAC for predicting subcellular localization of Gram-positive bacterial proteins . Nat Sci 2017 ; 9 ( 9 ): 331 – 49 . WorldCat 21 Wu ZC , Xiao X , Chou KC . iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins . Protein Pept Lett 2012 ; 19 ( 1 ): 4 – 14 . WorldCat 22 Shen Y , Tang J , Guo F . Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC . J Theor Biol 2019 ; 462 : 230 – 9 . WorldCat 23 Chou KC , Wu ZC , Xiao X . iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites . Mol Biosyst 2012 ; 8 ( 2 ): 629 – 41 . WorldCat 24 Wei L , Liao M , Gao X , et al. mGOF-loc: a novel ensemble learning method for human protein subcellular localization prediction . Neurocomputing 2016 ; 217 : 73 – 82 . WorldCat 25 Rajesh N , Burkhard R . Sequence conserved for subcellular localization . Protein Sci 2002 ; 11 : 2836 – 47 . WorldCat 26 Wan S , Mak MW , Kung SY . GOASVM: a subcellular location predictor by incorporating term-frequency Gene Ontology into the general form of Chou’s pseudo-amino acid composition . J Theor Biol 2013 ; 323 : 40 – 8 . WorldCat 27 Cedano J , Aloy P , Pérez-Pons JA , et al. Relation between amino acid composition and cellular location of proteins . J Mol Biol 1997 ; 266 ( 3 ): 594 – 600 . WorldCat 28 Park KJ , Kanehisa M . Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs . Bioinformatics 2003 ; 19 ( 13 ): 1656 – 63 . WorldCat 29 Chou KC , Shen HB . Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization . Biochem Biophys Res Commun 2006 ; 347 ( 1 ): 150 – 7 . WorldCat 30 Shen HB , Chou KC . PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition . Anal Biochem 2008 ; 373 ( 2 ): 386 – 8 . WorldCat 31 Chou KC , Shen HB . MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM . Biochem Biophys Res Commun 2007 ; 360 ( 2 ): 339 – 45 . WorldCat 32 Uddin MR , Sharma A , Farid DM , et al. EvoStruct-Sub: an accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features . J Theor Biol 2018 ; 443 : 138 – 46 . WorldCat 33 Wei L , Ding Y , Su R , et al. Prediction of human protein subcellular localization using deep learning . J Parallel Distrib Comput 2017 ; 117 : 212 – 7 . WorldCat 34 Wan S , Mak MW , Kung SY . mLASSO-Hum: a lasso-based interpretable human-protein subcellular localization predictor . J Theor Biol 2015 ; 382 : 223 – 34 . WorldCat 35 Wan S , Mak MW , Kung SY . mGOASVM: multi-label protein subcellular localization based on Gene Ontology and support vector machines . BMC Bioinformatics 2012 ; 13 ( 1 ): 290 – 0 . WorldCat 36 Wan S , Mak MW , Kung SY . R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization . J Theor Biol 2014 ; 360 ( 25 ): 34 – 45 . WorldCat 37 Wan S , Mak MW , Kung SY . mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction . Anal Biochem 2015 ; 473 : 14 – 27 . WorldCat 38 Camon E , Magrane M , Barrell D , et al. The Gene Ontology Annotation (GOA) project: implementation of GO in Swiss-Prot, TrEMBL, and InterPro . Genome Res 2003 ; 13 ( 4 ): 662 – 72 . WorldCat 39 Li M , Li W , Wu FX , et al. Identifying essential proteins based on sub-network partition and prioritization by integrating subcellular localization information . J Theor Biol 2018 ; 447 : 65 – 73 . WorldCat 40 Wan S , Mak MW , Kung SY . HybridGO-Loc: mining hybrid features on Gene Ontology for predicting subcellular localization of multi-location proteins . PLoS One 2014 ; 9 ( 3 ): e89545 . WorldCat 41 Shen HB , Chou KC . A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0 . Anal Biochem 2009 ; 394 ( 2 ): 269 – 74 . WorldCat 42 Paul H , Keun-Joon P , Takeshi O , et al. Wolf psort: protein localization predictor . Nucleic Acids Res 2007 ; 35 : W585 – 7 . WorldCat 43 Chou KC , Shen HB . Cell-PLoc: a package of web servers for predicting subcellular localization of proteins in various organisms . Nat Protoc 2008 ; 3 ( 2 ): 153 – 62 . WorldCat 44 Garg A , Manoj B , Raghava GPS . Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search . J Biol Chem 2005 ; 280 ( 15 ): 14427 – 32 . WorldCat 45 Cheng X , Xiao X , Chou KC . pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information . Bioinformatics 2018 ; 34 ( 9 ): 1448 – 56 . WorldCat 46 Zhou H , Yang Y , Shen HB . Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of Gene Ontology and functional domain features . Bioinformatics 2017 ; 33 ( 6 ): 843 – 53 . WorldCat 47 Shen HB , Chou KC . Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites . Biochem Biophys Res Commun 2007 ; 355 ( 4 ): 1006 – 11 . WorldCat 48 Emanuelsson O , Nielsen HS , Von HG . Predicting subcellular localization of proteins based on their N-terminal amino acid sequence . J Mol Biol 2000 ; 300 ( 4 ): 1005 – 16 . WorldCat 49 Ian S , Nemo P , Fabrice L , et al. Predotar: a tool for rapidly screening proteomes for N-terminal targeting sequences . Proteomics 2004 ; 4 ( 6 ): 1581 – 90 . WorldCat 50 Chou KC , Cai YD . Using functional domain composition and support vector machines for prediction of protein subcellular location . J Biol Chem 2002 ; 277 ( 48 ): 45765 – 9 . WorldCat 51 Scott MS , Thomas DY , Hallett MT . Predicting subcellular localization via protein motif co-occurrence . Genome Res 2004 ; 14 ( 10A ): 1957 – 66 . WorldCat 52 Hu Y , Li T , Sun J , et al. Predicting Gram-positive bacterial protein subcellular localization based on localization motifs . J Theor Biol 2012 ; 308 : 135 – 40 . WorldCat 53 Abdul AK , Zakir K , Mohd AK , et al. Inter-kingdom prediction certainty evaluation of protein subcellular localization tools: microbial pathogenesis approach for deciphering host microbe interaction . Brief Bioinform 2018 ; 19 ( 1 ): 12 – 22 . WorldCat 54 Wu X , Zhang Q , Wu Z , et al. Subcellular locations of potential cell wall proteins in plants: predictors, databases and cross-referencing . Brief Bioinform 2018 ; 19 ( 6 ): 1130 – 40 . WorldCat 55 Emanuelsson O . Predicting protein subcellular localisation from amino acid sequence information . Brief Bioinform 2002 ; 3 ( 4 ): 361 – 76 . WorldCat 56 Bin L. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches . Brief. Bioinformatics 2017 ; bbx165 . 57 Claire O , Maria JM , Alexandre G , et al. High-quality protein knowledge resource: Swiss-Prot and TrEMBL . Brief Bioinform 2002 ; 3 ( 3 ): 275 – 84 . WorldCat 58 Nicola JM , Rolf A , Teresa KA , et al. InterPro: an integrated documentation resource for protein families, domains and functional sites . Brief Bioinform 2002 ; 3 ( 3 ): 225 – 35 . WorldCat 59 Bairoch A , Boeckmann B , Wu CH , et al. UniProt: the universal protein knowledgebase . Nucleic Acids Res 2004 ; 32 ( Suppl ): D115 – 9 . WorldCat 60 Josefine S , Fink JL , Seetha K , et al. LOCATE: a mammalian protein subcellular localization database . Nucleic Acids Res 2008 ; 36 ( Database issue ): D230 – 3 . WorldCat 61 Andea P , Pier Luigi M , Piero F , et al. eSLDB: eukaryotic subcellular localization database . Nucleic Acids Res 2007 ; 35 ( Database issue ): D208 – 12 . WorldCat 62 Shruti R , Burkhard R . LocDB: experimental annotations of localization for homo sapiens and arabidopsis thaliana . Nucleic Acids Res 2011 ; 39 ( Database issue ): D230 – 4 . WorldCat 63 Li W , Jaroszewski L , Godzik A . Clustering of highly homologous sequences to reduce the size of large protein databases . Bioinformatics 2001 ; 17 ( 3 ): 282 – 3 . WorldCat 64 Huang Y , Niu B , Gao Y , et al. CD-HIT suite: a web server for clustering and comparing biological sequences . Bioinformatics 2010 ; 26 ( 5 ): 680 – 2 . WorldCat 65 Wan S , Duan Y , Zou Q . HPSLPred: an ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source . Proteomics 2017 ; 17 ( 17-18 ): 1700262 . WorldCat 66 Wan S , Mak MW , Kung SY . Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins . BMC Bioinformatics 2016 ; 17 ( 1 ): 97 . WorldCat 67 Chi SM , Nam D . WegoLoc: accurate prediction of protein subcellular localization using weighted Gene Ontology terms . Bioinformatics 2012 ; 28 ( 7 ): 1028 – 30 . WorldCat 68 Zhang ML , Zhou ZH . ML-KNN: a lazy learning approach to multi-label learning . Pattern Recognit 2007 ; 40 ( 7 ): 2038 – 48 . WorldCat 69 Breiman L . Random forests . Mach Learn 2001 ; 45 ( 1 ): 5 – 32 . WorldCat 70 Chen T , Guestrin C . XGBoost: A scalable tree boosting system . In: Acm Sigkdd International Conference on Knowledge Discovery and Data Mining , 2016 , 785 – 94 . ACM New York, NY, USA , San Francisco, California, USA . 71 Cao X , Zhang C , Fu H , Si L , Hua Z . Diversity-induced Multi-view Subspace Clustering . In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 2015 ; 1 : 586 – 94 . IEEE , Boston, MA, USA . 72 Zhang ML , Zhou ZH . A review on multi-label learning algorithms . IEEE Trans Knowl Data Eng 2014 ; 26 ( 8 ): 1819 – 37 . WorldCat 73 Resnik P . Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language . J Artif Intell Res 1999 ; 11 ( 1 ): 95 – 130 . WorldCat 74 Lin D . An information-theoretic definition of similarity . In: International Conference On Machine Learning , ISBN: 1-55860-556-8 , 1998 , 296 – 304 . Morgan Kaufmann Publishers Inc. San Francisco, CA, USA . 75 Jiang J , Conrath D . Semantic similarity based on corpus statistics and lexical taxonomy . In: International Conference Research On Computational Linguistics (ROCLING X) , Taiwan , 1997 , 19 – 33 . 76 Schlicker A , Domingues FS , Rahnenführer J , et al. A new measure for functional similarity of gene products based on Gene Ontology . BMC Bioinformatics 2006 ; 7 ( 1 ): 302 – 2 . WorldCat © The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Critical evaluation of web-based prediction tools for human protein subcellular localization JF - Briefings in Bioinformatics DO - 10.1093/bib/bbz106 DA - 2003-02-01 UR - https://www.deepdyve.com/lp/oxford-university-press/critical-evaluation-of-web-based-prediction-tools-for-human-protein-oqZpJC7YC0 SP - 1 VL - Advance Article IS - DP - DeepDyve ER -