TY - JOUR AU - Butler, Gregory AB - Introduction Transmembrane proteins are gates that organize a variety of vital cellular functions including cell signaling, trafficking, metabolism, and energy production. It is estimated that, on average, one in every three proteins in a cell is a membrane protein [1, 2]. Any defective or misregulated membrane protein can disturb the organism’s functioning, giving rise to disease [3]. About one-half of the drug targets today are membrane proteins, such as transporters or related receptors [4]. While the amino acid sequence of many membrane proteins is available, they are among the least characterized proteins in terms of their structure and function. For example, only 3% of the structures in the Protein Data Bank [5] are transmembrane proteins. Numerous genome projects have produced an abundance of protein sequences, many of which remain uncharacterized. Transmembrane proteins are among the least characterized proteins, because experimental characterization of their structure and function is exceptionally difficult, owing to their hydrophobic surfaces and their lack of conformational stability. Consequently, there is an urgent need for computational approaches in which the available data is used to distinguish and characterize transmembrane proteins. Yet, this area of research is still in its early stages, and the researchers are far from finding a definitive solution. Existing tools for the annotation of transporters that predict the substrate of the transport reaction lag behind tools for other kinds of proteins such as enzymes for metabolic reactions. Most tools predict the type of substrates [6–10], chosen from a small subset of substrate types, without attempting to predict the specific substrate, or predict the family or subfamily [9, 11–13] of the protein within the transporter classification (TC) [14–16], and again without attempting to predict the specific substrate. For network modeling in systems biology [17, 18], we require tools to process the complete proteome and predict each transport reaction; this means identifying the transport protein and the specific substrate. Many tools rely simply on homology or orthology to predict transporters. This includes metabolic network tools merlin [19], and Pantograph [20], as well as our system TransATH [21]. Amongst the tools for de novo prediction of substrate class, the Transporter Substrate Specificity Prediction (TrSSP) [10] claims to be the state-of-the-art. Our previous efforts [22] for de novo prediction of specific substrates for sugar transporters in fungi were not successful. However, from it we learned how much depends on very few residues of the transporter; often three or so residues, and often internal to different helix transmembrane segments (TMSs) of the transporter [23]. These residues are far apart in the linear protein sequence, but are close to each other in the three-dimensional structure of the protein when integrated in the membrane. In looking forward to how we might improve on approaches that rely on the amino acid composition of the protein, we developed a roadmap, whereby the composition information would be combined with evolutionary information as captured by an MSA, and by positional information [24] about the residues responsible for determining the specificity of the transporter. This roadmap is a schema for a large number of possible algorithms due to the many choices for encoding of amino acid composition, MSA algorithms, and algorithms for specificity determining sites [25]. We also realized the importance of the alignment preserving the TMS positions, since the important residue positions seem to be located there. There are a number of such MSA algorithms [26–29]. We therefore conducted a preliminary study, which indicated that the combination of information about protein composition, protein evolution, and the specificity determining positions had a significant impact on our ability to predict the transported substrates. We chose the methodology and datasets of TrSSP [10] as our baseline, and varied this to illustrate the impact of each of the factors: compositional, evolutionary, and positional information. Our best approach, which defines the predictor we call TranCEP (an abbreviation of transporter substrate prediction using compositional, evolutionary, and positional information), involves utilizing the pair amino acid composition (PAAC) encoding scheme, the TM-Coffee [27] MSA algorithm, and the transitive consistency score (TCS) [30] algorithm for determining informative positions in the MSA, to build a suite of support vector machine (SVM) classifiers, one for distinguishing between each pair of classes of substrates. Background For most of the work done on the prediction of transport proteins [3], there is no available software, so it is difficult to reproduce the work and to compare the results of different articles. Pathway tools includes the Transport Inference Parser (TIP) [31], which analyses keywords in a gene annotation to assign gene-protein-reaction associations to transport reactions in MetaCyc [32]. The Genome-Blast (G-Blast) [33] screens proteins against all the entries in Transporter Classification Database (TCDB) [34] using Blast to retrieve the top hit, and HMMTOP [35] to determine the TMS for the query and the hit sequence. It is an integral part of a manual protocol of Saier’s lab to predict the transport proteins for a genome [36]. The Zhao Lab has developed three methods: a nearest neighbour approach [37]; TransportTP [12]; and TrSSP [10]. The nearest neighbour approach achieved a balanced accuracy of 67.0%. The TransportTP [12] is a two-phase algorithm that combines homology and machine learning to predict the TC family of one or more proteins. For training and cross-validation testing, TransportTP used the yeast proteome. When testing, it used 10 genomes from the TransportDB database [38] of annotated prokaryote transporters. As an independent test, TransportTP was trained on the proteome of A. thaliana and then used to predict the transporters in four other plant proteomes. TransportTP achieved a balanced accuracy of 81.8%. Compared with the SVM-Prot classifier [39], on the five TC superfamilies and three families used by SVM-Prot, TransportTP achieved better recall and precision. The TrSSP [10] is a web server utilized to predict membrane transport proteins and their substrate category. The substrate categories are: (1) oligopeptide (amino acid), (2) anion, (3) cation, (4) electron, (5) protein/mRNA, (6) sugar, and (7) other. TrSSP makes a top-level prediction of whether the protein is a transporter, or not. An SVM is applied, with the highest accuracy being reported when using amino acid index (AAindex) and position-specific scoring matrix (PSSM) features. The disc_function system [11] uses amino acid composition and neural networks to discriminate channels/pores, electrochemical and active transporters, with an accuracy of 68%. When augmented with PSSM profiles and amino acid physicochemical properties it gained 5–10% in discrimination accuracy [13]. They also considered six major families in TCDB [13] with an average accuracy of 69%. The TTRBF [7] considers four major classes of substrates: (1) electron, (2) protein/mRNA, (3) ion, and (4) other. It is an ensemble system combining amino acid composition, dipeptide composition, physicochemical properties, PSSM profiles and radial basis function (RBF) networks. Schaadt et al. [6] used amino acid composition, characteristics of amino acid residues and conservation to detect transporters based on four classes of substrate: (1) amino acid, (2) oligopeptide, (3) phosphate, and (4) hexose. The number of characterized transporters in A. thaliana for the four substrates was from 13 to 17. They constructed a vector for each protein using various types of amino acid composition, and used Euclidean distance from the query protein’s vector to the known vectors to rank the substrate categories. They found that the PAAC performed as well as the more complicated features, yielding an accuracy of over 90%. Schaadt and Helms [8] compared the similarity of transporters in the TCDB and annotated transporters in A. thaliana using amino acid composition and classified the proteins into three families. By distinguishing the amino acid composition of TMS and non-TMS regions, they could classify four different families with an accuracy of 80%. Barghash and Helms [9] performed a comparison of three different approaches (homology, HMMER, MEME) to predict the substrate category and predicting TC family. They used four substrate categories, namely metal ions, phosphate, sugar, and amino acid; and the 29 TC families with the most examples. The datasets were from E. coli, S. cerevisiae, and A. thaliana, consisting of the 155, 158 and 177 proteins, respectively, that had both a substrate annotation and TC family annotation that were experimentally determined. The merlin [19] system for the reconstruction of metabolic networks handles eukaryote genomes, and includes the determination of transport gene-protein-reaction associations, as well as the localization of reactions across a number of cellular compartments: mitochondrion, endoplasmic reticulum (ER), and Golgi apparatus. In merlin, transport proteins are predicted based on the existence of TMS as predicted by TMHMM, and by similarity to entries in the TCDB using the Smith-Waterman algorithm [40]. The association of transport reactions and specific substrates for the predicted transport proteins were originally taken from a manually curated database of some 4,000 TCDB entries originally. It now incorporates Transport Reactions Annotation and Generation (TRIAGE) [41], with information for 5,495 TCDB entries. The merlin software is available as open source Java code (http://www.merlin-sysbio.org). The Pantograph [20] is designed for the metabolic pathway reconstruction of yeasts such as Yarrowia lipolytica [42], which accumulates lipids in the peroxisome of the cell. It specifically models the cellular components and the transport across the membranes in a reference template, called the scaffold model. The Pantograph method relies on a database of profile HMMs for fungal protein families and their annotations, which is maintained at Génolevures in Bordeaux. The Pantograph algorithm first assigns gene-protein-reaction associations, and then decides which compartments and reactions to include in the draft model based on these associations. The scaffold model, which is the reference template for Pantograph, was manually curated to include 421 transport reactions. The associated transport protein families of orthologs were manually identified for Génolevures collection. The Pantograph software, written in Python, is available at http://pathtastic.gforge.inria.fr/. The distribution includes the scaffold model in Systems Biology Markup Language (SBML). The Transporters via Annotation Transfer by Homology (TransATH) [21] is a system that automates Saier’s protocol based on sequence similarity. The TransATH includes the computation of subcellular localization and improves the computation of TMSs. The parameters of TransATH are chosen for optimal performance on a gold standard set of transporters and non-transporters from S. cerevisiae. A website http://transath.umt.edu.my for TransATH is available for use. The SCMMTP [43] has a novel scoring card method (SCM) that utilizes the dipeptide composition to identify putative membrane transport proteins. The SCMMTP method first builds an initial matrix of 400 dipeptides and uses the difference between compositions of positives and negatives as an initial dipeptide scoring matrix. This matrix is then optimized using a genetic algorithm. SCMMP achieved an overall accuracy of 76.11% and Matthews correlation coefficient (MCC) of 0.47 on the independent dataset. Li et al [44] tool uses an SVM model to predict substrate classes of transmembrane transport proteins by integrating features from PSSM, amino acid composition, biochemical properties and gene ontology (GO) [45] terms. The tool achieved an overall accuracy of 80.45% on the independent dataset. Background For most of the work done on the prediction of transport proteins [3], there is no available software, so it is difficult to reproduce the work and to compare the results of different articles. Pathway tools includes the Transport Inference Parser (TIP) [31], which analyses keywords in a gene annotation to assign gene-protein-reaction associations to transport reactions in MetaCyc [32]. The Genome-Blast (G-Blast) [33] screens proteins against all the entries in Transporter Classification Database (TCDB) [34] using Blast to retrieve the top hit, and HMMTOP [35] to determine the TMS for the query and the hit sequence. It is an integral part of a manual protocol of Saier’s lab to predict the transport proteins for a genome [36]. The Zhao Lab has developed three methods: a nearest neighbour approach [37]; TransportTP [12]; and TrSSP [10]. The nearest neighbour approach achieved a balanced accuracy of 67.0%. The TransportTP [12] is a two-phase algorithm that combines homology and machine learning to predict the TC family of one or more proteins. For training and cross-validation testing, TransportTP used the yeast proteome. When testing, it used 10 genomes from the TransportDB database [38] of annotated prokaryote transporters. As an independent test, TransportTP was trained on the proteome of A. thaliana and then used to predict the transporters in four other plant proteomes. TransportTP achieved a balanced accuracy of 81.8%. Compared with the SVM-Prot classifier [39], on the five TC superfamilies and three families used by SVM-Prot, TransportTP achieved better recall and precision. The TrSSP [10] is a web server utilized to predict membrane transport proteins and their substrate category. The substrate categories are: (1) oligopeptide (amino acid), (2) anion, (3) cation, (4) electron, (5) protein/mRNA, (6) sugar, and (7) other. TrSSP makes a top-level prediction of whether the protein is a transporter, or not. An SVM is applied, with the highest accuracy being reported when using amino acid index (AAindex) and position-specific scoring matrix (PSSM) features. The disc_function system [11] uses amino acid composition and neural networks to discriminate channels/pores, electrochemical and active transporters, with an accuracy of 68%. When augmented with PSSM profiles and amino acid physicochemical properties it gained 5–10% in discrimination accuracy [13]. They also considered six major families in TCDB [13] with an average accuracy of 69%. The TTRBF [7] considers four major classes of substrates: (1) electron, (2) protein/mRNA, (3) ion, and (4) other. It is an ensemble system combining amino acid composition, dipeptide composition, physicochemical properties, PSSM profiles and radial basis function (RBF) networks. Schaadt et al. [6] used amino acid composition, characteristics of amino acid residues and conservation to detect transporters based on four classes of substrate: (1) amino acid, (2) oligopeptide, (3) phosphate, and (4) hexose. The number of characterized transporters in A. thaliana for the four substrates was from 13 to 17. They constructed a vector for each protein using various types of amino acid composition, and used Euclidean distance from the query protein’s vector to the known vectors to rank the substrate categories. They found that the PAAC performed as well as the more complicated features, yielding an accuracy of over 90%. Schaadt and Helms [8] compared the similarity of transporters in the TCDB and annotated transporters in A. thaliana using amino acid composition and classified the proteins into three families. By distinguishing the amino acid composition of TMS and non-TMS regions, they could classify four different families with an accuracy of 80%. Barghash and Helms [9] performed a comparison of three different approaches (homology, HMMER, MEME) to predict the substrate category and predicting TC family. They used four substrate categories, namely metal ions, phosphate, sugar, and amino acid; and the 29 TC families with the most examples. The datasets were from E. coli, S. cerevisiae, and A. thaliana, consisting of the 155, 158 and 177 proteins, respectively, that had both a substrate annotation and TC family annotation that were experimentally determined. The merlin [19] system for the reconstruction of metabolic networks handles eukaryote genomes, and includes the determination of transport gene-protein-reaction associations, as well as the localization of reactions across a number of cellular compartments: mitochondrion, endoplasmic reticulum (ER), and Golgi apparatus. In merlin, transport proteins are predicted based on the existence of TMS as predicted by TMHMM, and by similarity to entries in the TCDB using the Smith-Waterman algorithm [40]. The association of transport reactions and specific substrates for the predicted transport proteins were originally taken from a manually curated database of some 4,000 TCDB entries originally. It now incorporates Transport Reactions Annotation and Generation (TRIAGE) [41], with information for 5,495 TCDB entries. The merlin software is available as open source Java code (http://www.merlin-sysbio.org). The Pantograph [20] is designed for the metabolic pathway reconstruction of yeasts such as Yarrowia lipolytica [42], which accumulates lipids in the peroxisome of the cell. It specifically models the cellular components and the transport across the membranes in a reference template, called the scaffold model. The Pantograph method relies on a database of profile HMMs for fungal protein families and their annotations, which is maintained at Génolevures in Bordeaux. The Pantograph algorithm first assigns gene-protein-reaction associations, and then decides which compartments and reactions to include in the draft model based on these associations. The scaffold model, which is the reference template for Pantograph, was manually curated to include 421 transport reactions. The associated transport protein families of orthologs were manually identified for Génolevures collection. The Pantograph software, written in Python, is available at http://pathtastic.gforge.inria.fr/. The distribution includes the scaffold model in Systems Biology Markup Language (SBML). The Transporters via Annotation Transfer by Homology (TransATH) [21] is a system that automates Saier’s protocol based on sequence similarity. The TransATH includes the computation of subcellular localization and improves the computation of TMSs. The parameters of TransATH are chosen for optimal performance on a gold standard set of transporters and non-transporters from S. cerevisiae. A website http://transath.umt.edu.my for TransATH is available for use. The SCMMTP [43] has a novel scoring card method (SCM) that utilizes the dipeptide composition to identify putative membrane transport proteins. The SCMMTP method first builds an initial matrix of 400 dipeptides and uses the difference between compositions of positives and negatives as an initial dipeptide scoring matrix. This matrix is then optimized using a genetic algorithm. SCMMP achieved an overall accuracy of 76.11% and Matthews correlation coefficient (MCC) of 0.47 on the independent dataset. Li et al [44] tool uses an SVM model to predict substrate classes of transmembrane transport proteins by integrating features from PSSM, amino acid composition, biochemical properties and gene ontology (GO) [45] terms. The tool achieved an overall accuracy of 80.45% on the independent dataset. Materials and methods Datasets We used the same training dataset and test dataset as TrSSP by Mishra et al. [10] which are available at http://bioinfo.noble.org/TrSSP. This benchmark dataset was collected from the Swiss-Prot database (release 2013-03). The dataset initially contained 10,780 transporter, carrier, and channel proteins that were well characterized at the protein level and had clear substrate annotations. Then, Mishra et al. removed the transporters with more than two substrate specificities, sequences with biological function annotations based solely on sequence similarity, and sequences with greater than 70% similarity. Mishra et al.’s final transporter dataset contains 900 transporters divided into seven substrate classes: amino acid, anion, cation, electron, protein/mRNA, sugar, and other. The latter refers to transporters that do not belong to any of the previous six classes. Table 1 shows the dataset details with the size of each class. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. The dataset. https://doi.org/10.1371/journal.pone.0227683.t001 Databases We used the Swiss-Prot database when searching for similar sequences. When constructing MSAs, we used TM-Coffee [27] with the UniRef50-TM database, which consists of the entries in UniRef50 that have keyword transmembrane. The TrSSP dataset was derived from the Swiss-Prot database, so we made sure to remove all entries in dataset from the two databases that we used. Algorithm Fig 1 illustrates the steps of TranCEP. The sequence (a) in this case, according to TM-Coffee, has four TMSs, as shown by the gray shading. The example focuses on the first TMS, and abbreviates the middle section of the sequence. Part (b) shows an MSA conserving the TMS structure constructed by TM-Coffee, where the gray shading indicates the TMS location. Part (c) shows the colour coding of the reliability index of each column as determined by TCS, and shows how gaps replace unreliable columns in the filtered MSA. Part (d) shows a part of the 400-dimensional vector of dipeptide frequencies (PAAC) from the filtered MSA. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Example of steps of TranCEP. The figure illustrates the steps of TranCEP. Note that we abbreviated the middle section of the sequence. Part (a) shows the sequence the four TMSs in gray shading. Part (b) shows an MSA constructed by TM-Coffee. Gray shading indicates a TMS. Part (c) shows the color coding of the reliability index of each column as determined by TCS, and shows gaps in unreliable columns in the filtered MSA. Part (d) shows a 400-dimensional PAAC vector from the filtered MSA. https://doi.org/10.1371/journal.pone.0227683.g001 The template for combining evolutionary, positional, and compositional information is presented in Algorithm 1. In this work, we used TM-Coffee to compute the MSA that conserves the TMSs, and TCS to determine a reliability index for each position (column) in the MSA. We experimented with three composition schemes: amino acid composition (AAC), pairwise amino acid composition (PAAC), and pseudo-amino acid composition (PseAAC), as well as the optional use of TM-Coffee (TMC) and the transitive consistency score (TCS). Algorithm 1. Template for constructing the composition vector. function comp_vec(seq s) // Evolutionary (E) Step, optional Construct an MSA from s // Positional (P) Step, optional Determine the informative positions (columns) in the MSA Filter the uninformative positions from the MSA // Composition (C) Step, mandatory return vector encoding composition of the filtered MSA end function A template for the algorithm showing the role of evolutionary (E), positional (P), and composition (C) information. Note that the use of evolutionary (E) and positional (P) is optional; and that if positional (P) information is used then it requires evolutionary (E) information in the form of Multiple Sequence Alignment (MSA). Note that if Step (E) is not done, then Step (C) encodes the sequence s. Note that if Step (E) is done but Step (P) MSA is not done, then Step (C) encodes the MSA. Algorithm 2 shows the composition vectors being used to build a set of classifiers; SVM classifiers in this case. Algorithm 3 shows the prediction algorithm. Algorithm 2. Build SVM classifiers. Require: a training set T of sequences labelled with classes C1, …, Cn Ensure: a set of SVM’s svm(i, j) distinguishing class Ci and Cj procedure Build_SVMs(T: set of seq; svm: set of SVM) for all seq s in T do v(s) ← COMP_VEC(s) end for for all pair (Ci, Cj) of classes do svm(i, j) ← SVM.build({v(s): s ∈ T ∩ (Ci ∪ Cj)}) end for end procedure The algorithm to build a set of SVM classifiers to distinguish between each pair of classes in the training set. The actual construction of each SVM was done by an SVM package’s build function. Algorithm 3. Prediction. Require: test sequence s Require: a set of SVM’s svm(i, j) distinguishing classes Ci and Cj Ensure: result is predicted class Cp function predict_class(seq s) v ← COMP_VEC(s) for all pair (Ci, Cj) of classes do c(i, j) ← svm(i, j) applied to v end for p ← mode of c(i, j) for all pairs (i, j) return Cp end function The prediction algorithm that applies each of the SVMs, and takes the class that is predicted most often by the set of SVMs. Encoding amino acid composition The properties of the amino acids at each position in the protein sequence can be encoded into vectors that summarize the overall composition of the protein. Three approaches to encoding amino acid composition were implemented in this study: AAC, PAAC, and PseAAC. The amino acid composition (AAC) is the normalized occurrence frequency of each amino acid. The fractions of all 20 natural amino acids are calculated as: (1) where Fi is the frequency of the ith amino acid in sequence P, and L is the length of the sequence. The AAC is represented as a vector of size 20: (2) where ci is the composition of ith amino acid. The pair amino acid composition (PAAC) is the normalized frequency of each pair of amino acids. The PAAC is calculated as (3) where Fi,j is the frequency of the ith and jth amino acids as a pair (dipeptide) adjacent to each other in the sequence P, and L is the length of the sequence. The PAAC is represented as a vector of size 400 as follows: (4) where di,j is the dipeptide composition of the ith and jth amino acid. The pseudo-amino acid composition (PseAAC) [46] combines the 20 components of AAC with a set of sequence order correlation factors that incorporates some biochemical properties. Given a protein sequence P = R1R2R3R4…RL of length L, a set of descriptors called sequence order-correlated factors are defined as: (5) The parameter λ is chosen such that (λ < L). A correlation function is given by: (6) where H1(Ri) is the hydrophobicity value, H2(Ri) is the hydrophilicity value, and M(Ri) is the side chain mass of the amino acid Ri. Those quantities are converted from their original values. For example, for hydrophobicity, H1(R) is derived from the average hydrophobicity value , as follows: (7) The original hydrophobicity value is taken from Tanford [47]. The original hydrophilicity value for the amino acid Ri is taken from Hopp and Woods [48]. PseAAC is represented as a vector of size (20 + λ) as follows: (8) where si is the pseudo-amino acid composition (9) where fi is the normalized occurrence frequency of the ith amino acid in the protein sequence, θj is the jth sequence order-correlated factor calculated from Eq 5, and ω is a weight factor for the sequence order effect. The weight factor ω puts weight on the additional PseAAC components with respect to the conventional AAC components. Any value from 0.05 to 0.7 can be selected for the weight factor. The default value given by Chou [46] is 0.05, which we selected in this study. Multiple sequence alignment We adopted the MSA-AAC approach [6] that combines amino acid composition with the evolutionary information available from the MSA. This is done by first retrieving homologous sequences of each protein sequence in the dataset, then building an MSA for the corresponding protein, and then taking the counts for computing the composition information using all the residues in the MSA. In [6], the researchers only utilized the AAC encoding, whereas we also applied the approach to the PAAC and PseAAC encodings. Another difference was that we made use of TM-Coffee [27] (Version-11.00.8cbe486) to compute alignments, rather than ClustalW [49] as done in [6], because we felt it was important to align the TMSs. Other differences included searching the Swiss-Prot [50] database and retrieving a maximum of 120 homologous sequences, instead of searching the non-redundant database nr and retrieving 1,000 sequences. This was done to make the computational time more manageable, because the TM-Coffee algorithm requires more memory usage and execution time. Furthermore, all exact hits of the test sequences were removed from the Swiss-Prot and UniRef50-TM databases to maintain independence between the MSA and the test data. Our alignment command was the following: t_coffee mysequences. fasta −mode psicoffee \ −protein_db uniref50 −TM \ −template_file PSITM where mysequences.fasta is the file that contains the 120 similar sequences retrieved by Blast search on the Swiss-Prot database. Positional information In order to focus on those positions in the protein that determine specificity, we needed a method to determine those positions, and then to filter our MSA. The MSA was filtered by setting the entries for all other positions to null, that is, the symbol ‘-’ so that it was ignored when gathering counts for the amino acid composition. We applied the transitive consistency score (TCS) algorithm [30] to the alignment to determine the informative positions. The TCS is a scoring scheme that uses a consistency transformation to assign a reliability index to every pair of aligned residues, to each individual residue in the alignment, to each column, and to the overall alignment. This scoring scheme has been shown to be highly informative with respect to structural predictions based on benchmarking databases. The reliability index ranges from 0 to 9, where 0 is extremely uncertain and 9 is extremely reliable. Columns with a reliability index of below 4 were removed using the following command: t_coffee −infile myMSA. aln −evaluate \ −output tcs_column_filter4. fasta where myMSA.aln is the MSA file, tcs_column_filter4.fasta is the filtered file in FASTA format. Training Following TrSSP [10], we used a multi-class SVM with RBF kernel, as implemented in the R e1071 library version 1.6-8, using a one-against-one approach in which 21 = (7 × 6)/2 binary classifiers were trained. The predicted class was determined through a voting scheme in which all the binary classifiers were applied and the class that got highest number of votes was the result. Both the cost and gamma parameters of the RBF kernel were optimized by performing a grid search using the tune function in the library (range of cost: 2 … 16, range of gamma: 2e-5 … 1). Methods We adopted three approaches to encode the amino acid composition: AAC, PAAC, as done by TrSSP [10], and PseAAC. This was followed by training using SVM to form the prediction methods AAC, PAAC, and PseAAC respectively. By combining the amino acid composition and the evolutionary information obtained using TM-Coffee, followed by SVM, we implemented the prediction methods: TMC-AAC, TMC-PAAC, and TMC-PseAAC respectively. Filtering was incorporated by applying TCS after TM-Coffee, then computing the amino acid composition vectors, and applying SVM to implement the prediction methods: TMC-TCS-AAC, TMC-TCS-PAAC, and TMC-TCS-PseAAC respectively. The method of TranCEP is TMC-TCS-PAAC, the method with the best performance during cross-validation. Performance measurement Four statistical measures were considered to measure the performance: sensitivity which is the proportion of positives that are correctly identified: (10) specificity which is the proportion of negatives that are correctly identified: (11) accuracy which is proportion of correct predictions made amongst all the predictions: (12) Matthews correlation coefficient (MCC) which is a single measure taking into account true and false positives and negatives: (13) where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. We used the MCC because it is less influenced by imbalanced data and is arguably the best single assessment metric in this case [51–53]. The MCC value ranges from 1 to −1, where 1 indicates a perfect prediction; 0 represent no better than random; and −1 implies total disagreement between prediction and observation. A high MCC value means the predictor has high accuracy on both positive and negative classes, and also low misprediction on both. When dealing with multiclass classification, it is often desirable to compute a single aggregate measure that reflects the overall performance. There are two methods to compute the overall performance, namely micro-averaging and macro-averaging [54]. Macro-averaging computes a simple average performance of individual classes’ performances. Micro-averaging computes an overall performance by globally counting the total true positives, false negatives and false positives. Depending on the class distribution the difference between the two methods can be large. Macro-averaging gives equal weight to each class, whereas micro-averaging gives equal weight to each individual classification decision [54]. The overall accuracy of the tool is often calculated as the fraction of the correct predictions by the total number of predictions as follows: (14) where TPk is the number of true positives in class k, K is the number of different classes, and N is the total number of predictions. Another way to compute the accuracy is to take the macro-average accuracy of the individual classes: (15) where Accuracyk is the accuracy of class k, and K is the number of different classes. Similarly, the overall MCC is calculated in terms of the confusion matrix C of dimension K × K [55]: (16) Or as a macro-average MCC: (17) where MCCk is the accuracy of class k, and K is the number of different classes. Because the number of samples in each class of the dataset is imbalanced, we used the overall accuracy as in Eq 14 and overall MCC as in Eq 16 to evaluate and compare different methods. It is explicitly stated when the macro-average was used. Experiments The performance of each method was determined using five-fold cross-validation whereby the training dataset was randomly partitioned into five equal sized sets. A single set was kept as the validation data and the remaining four sets were used to train the SVM model. This model was then tested using the validation set. The cross-validation process was repeated four times; where each of the sets was used once as the validation data. The performance of each model was averaged to produce a single estimation and the variation in performances was captured by computing the standard deviation. In addition, we ran leave-one-out cross-validation (LOOCV) on different methods to evaluate their robustness and compared the results with those obtained by five-fold cross-validation. Furthermore, the independent dataset was used to test the final models. The data in the independent dataset was not used, nor considered, during the training process and completely unknown to the different models. The performance of TranCEP on the test dataset was compared to the performance of TrSSP, as reported in their paper [10]. Statistical analysis In this analysis, the average number of informative residues, as determined by TCS scores, in different segments of a protein sequence was computed. For each substrate class, pairwise comparisons between means of important positions in different segments were performed. Since the sample size of each substrate class is >30, based on the the central limit theorem [56], the Student t-test (two tailed, paired) was applied. The difference was considered statistically significant when the Student t-test significance level P (P-value) was less than 0.0001. Datasets We used the same training dataset and test dataset as TrSSP by Mishra et al. [10] which are available at http://bioinfo.noble.org/TrSSP. This benchmark dataset was collected from the Swiss-Prot database (release 2013-03). The dataset initially contained 10,780 transporter, carrier, and channel proteins that were well characterized at the protein level and had clear substrate annotations. Then, Mishra et al. removed the transporters with more than two substrate specificities, sequences with biological function annotations based solely on sequence similarity, and sequences with greater than 70% similarity. Mishra et al.’s final transporter dataset contains 900 transporters divided into seven substrate classes: amino acid, anion, cation, electron, protein/mRNA, sugar, and other. The latter refers to transporters that do not belong to any of the previous six classes. Table 1 shows the dataset details with the size of each class. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. The dataset. https://doi.org/10.1371/journal.pone.0227683.t001 Databases We used the Swiss-Prot database when searching for similar sequences. When constructing MSAs, we used TM-Coffee [27] with the UniRef50-TM database, which consists of the entries in UniRef50 that have keyword transmembrane. The TrSSP dataset was derived from the Swiss-Prot database, so we made sure to remove all entries in dataset from the two databases that we used. Algorithm Fig 1 illustrates the steps of TranCEP. The sequence (a) in this case, according to TM-Coffee, has four TMSs, as shown by the gray shading. The example focuses on the first TMS, and abbreviates the middle section of the sequence. Part (b) shows an MSA conserving the TMS structure constructed by TM-Coffee, where the gray shading indicates the TMS location. Part (c) shows the colour coding of the reliability index of each column as determined by TCS, and shows how gaps replace unreliable columns in the filtered MSA. Part (d) shows a part of the 400-dimensional vector of dipeptide frequencies (PAAC) from the filtered MSA. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Example of steps of TranCEP. The figure illustrates the steps of TranCEP. Note that we abbreviated the middle section of the sequence. Part (a) shows the sequence the four TMSs in gray shading. Part (b) shows an MSA constructed by TM-Coffee. Gray shading indicates a TMS. Part (c) shows the color coding of the reliability index of each column as determined by TCS, and shows gaps in unreliable columns in the filtered MSA. Part (d) shows a 400-dimensional PAAC vector from the filtered MSA. https://doi.org/10.1371/journal.pone.0227683.g001 The template for combining evolutionary, positional, and compositional information is presented in Algorithm 1. In this work, we used TM-Coffee to compute the MSA that conserves the TMSs, and TCS to determine a reliability index for each position (column) in the MSA. We experimented with three composition schemes: amino acid composition (AAC), pairwise amino acid composition (PAAC), and pseudo-amino acid composition (PseAAC), as well as the optional use of TM-Coffee (TMC) and the transitive consistency score (TCS). Algorithm 1. Template for constructing the composition vector. function comp_vec(seq s) // Evolutionary (E) Step, optional Construct an MSA from s // Positional (P) Step, optional Determine the informative positions (columns) in the MSA Filter the uninformative positions from the MSA // Composition (C) Step, mandatory return vector encoding composition of the filtered MSA end function A template for the algorithm showing the role of evolutionary (E), positional (P), and composition (C) information. Note that the use of evolutionary (E) and positional (P) is optional; and that if positional (P) information is used then it requires evolutionary (E) information in the form of Multiple Sequence Alignment (MSA). Note that if Step (E) is not done, then Step (C) encodes the sequence s. Note that if Step (E) is done but Step (P) MSA is not done, then Step (C) encodes the MSA. Algorithm 2 shows the composition vectors being used to build a set of classifiers; SVM classifiers in this case. Algorithm 3 shows the prediction algorithm. Algorithm 2. Build SVM classifiers. Require: a training set T of sequences labelled with classes C1, …, Cn Ensure: a set of SVM’s svm(i, j) distinguishing class Ci and Cj procedure Build_SVMs(T: set of seq; svm: set of SVM) for all seq s in T do v(s) ← COMP_VEC(s) end for for all pair (Ci, Cj) of classes do svm(i, j) ← SVM.build({v(s): s ∈ T ∩ (Ci ∪ Cj)}) end for end procedure The algorithm to build a set of SVM classifiers to distinguish between each pair of classes in the training set. The actual construction of each SVM was done by an SVM package’s build function. Algorithm 3. Prediction. Require: test sequence s Require: a set of SVM’s svm(i, j) distinguishing classes Ci and Cj Ensure: result is predicted class Cp function predict_class(seq s) v ← COMP_VEC(s) for all pair (Ci, Cj) of classes do c(i, j) ← svm(i, j) applied to v end for p ← mode of c(i, j) for all pairs (i, j) return Cp end function The prediction algorithm that applies each of the SVMs, and takes the class that is predicted most often by the set of SVMs. Encoding amino acid composition The properties of the amino acids at each position in the protein sequence can be encoded into vectors that summarize the overall composition of the protein. Three approaches to encoding amino acid composition were implemented in this study: AAC, PAAC, and PseAAC. The amino acid composition (AAC) is the normalized occurrence frequency of each amino acid. The fractions of all 20 natural amino acids are calculated as: (1) where Fi is the frequency of the ith amino acid in sequence P, and L is the length of the sequence. The AAC is represented as a vector of size 20: (2) where ci is the composition of ith amino acid. The pair amino acid composition (PAAC) is the normalized frequency of each pair of amino acids. The PAAC is calculated as (3) where Fi,j is the frequency of the ith and jth amino acids as a pair (dipeptide) adjacent to each other in the sequence P, and L is the length of the sequence. The PAAC is represented as a vector of size 400 as follows: (4) where di,j is the dipeptide composition of the ith and jth amino acid. The pseudo-amino acid composition (PseAAC) [46] combines the 20 components of AAC with a set of sequence order correlation factors that incorporates some biochemical properties. Given a protein sequence P = R1R2R3R4…RL of length L, a set of descriptors called sequence order-correlated factors are defined as: (5) The parameter λ is chosen such that (λ < L). A correlation function is given by: (6) where H1(Ri) is the hydrophobicity value, H2(Ri) is the hydrophilicity value, and M(Ri) is the side chain mass of the amino acid Ri. Those quantities are converted from their original values. For example, for hydrophobicity, H1(R) is derived from the average hydrophobicity value , as follows: (7) The original hydrophobicity value is taken from Tanford [47]. The original hydrophilicity value for the amino acid Ri is taken from Hopp and Woods [48]. PseAAC is represented as a vector of size (20 + λ) as follows: (8) where si is the pseudo-amino acid composition (9) where fi is the normalized occurrence frequency of the ith amino acid in the protein sequence, θj is the jth sequence order-correlated factor calculated from Eq 5, and ω is a weight factor for the sequence order effect. The weight factor ω puts weight on the additional PseAAC components with respect to the conventional AAC components. Any value from 0.05 to 0.7 can be selected for the weight factor. The default value given by Chou [46] is 0.05, which we selected in this study. Multiple sequence alignment We adopted the MSA-AAC approach [6] that combines amino acid composition with the evolutionary information available from the MSA. This is done by first retrieving homologous sequences of each protein sequence in the dataset, then building an MSA for the corresponding protein, and then taking the counts for computing the composition information using all the residues in the MSA. In [6], the researchers only utilized the AAC encoding, whereas we also applied the approach to the PAAC and PseAAC encodings. Another difference was that we made use of TM-Coffee [27] (Version-11.00.8cbe486) to compute alignments, rather than ClustalW [49] as done in [6], because we felt it was important to align the TMSs. Other differences included searching the Swiss-Prot [50] database and retrieving a maximum of 120 homologous sequences, instead of searching the non-redundant database nr and retrieving 1,000 sequences. This was done to make the computational time more manageable, because the TM-Coffee algorithm requires more memory usage and execution time. Furthermore, all exact hits of the test sequences were removed from the Swiss-Prot and UniRef50-TM databases to maintain independence between the MSA and the test data. Our alignment command was the following: t_coffee mysequences. fasta −mode psicoffee \ −protein_db uniref50 −TM \ −template_file PSITM where mysequences.fasta is the file that contains the 120 similar sequences retrieved by Blast search on the Swiss-Prot database. Positional information In order to focus on those positions in the protein that determine specificity, we needed a method to determine those positions, and then to filter our MSA. The MSA was filtered by setting the entries for all other positions to null, that is, the symbol ‘-’ so that it was ignored when gathering counts for the amino acid composition. We applied the transitive consistency score (TCS) algorithm [30] to the alignment to determine the informative positions. The TCS is a scoring scheme that uses a consistency transformation to assign a reliability index to every pair of aligned residues, to each individual residue in the alignment, to each column, and to the overall alignment. This scoring scheme has been shown to be highly informative with respect to structural predictions based on benchmarking databases. The reliability index ranges from 0 to 9, where 0 is extremely uncertain and 9 is extremely reliable. Columns with a reliability index of below 4 were removed using the following command: t_coffee −infile myMSA. aln −evaluate \ −output tcs_column_filter4. fasta where myMSA.aln is the MSA file, tcs_column_filter4.fasta is the filtered file in FASTA format. Training Following TrSSP [10], we used a multi-class SVM with RBF kernel, as implemented in the R e1071 library version 1.6-8, using a one-against-one approach in which 21 = (7 × 6)/2 binary classifiers were trained. The predicted class was determined through a voting scheme in which all the binary classifiers were applied and the class that got highest number of votes was the result. Both the cost and gamma parameters of the RBF kernel were optimized by performing a grid search using the tune function in the library (range of cost: 2 … 16, range of gamma: 2e-5 … 1). Methods We adopted three approaches to encode the amino acid composition: AAC, PAAC, as done by TrSSP [10], and PseAAC. This was followed by training using SVM to form the prediction methods AAC, PAAC, and PseAAC respectively. By combining the amino acid composition and the evolutionary information obtained using TM-Coffee, followed by SVM, we implemented the prediction methods: TMC-AAC, TMC-PAAC, and TMC-PseAAC respectively. Filtering was incorporated by applying TCS after TM-Coffee, then computing the amino acid composition vectors, and applying SVM to implement the prediction methods: TMC-TCS-AAC, TMC-TCS-PAAC, and TMC-TCS-PseAAC respectively. The method of TranCEP is TMC-TCS-PAAC, the method with the best performance during cross-validation. Performance measurement Four statistical measures were considered to measure the performance: sensitivity which is the proportion of positives that are correctly identified: (10) specificity which is the proportion of negatives that are correctly identified: (11) accuracy which is proportion of correct predictions made amongst all the predictions: (12) Matthews correlation coefficient (MCC) which is a single measure taking into account true and false positives and negatives: (13) where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. We used the MCC because it is less influenced by imbalanced data and is arguably the best single assessment metric in this case [51–53]. The MCC value ranges from 1 to −1, where 1 indicates a perfect prediction; 0 represent no better than random; and −1 implies total disagreement between prediction and observation. A high MCC value means the predictor has high accuracy on both positive and negative classes, and also low misprediction on both. When dealing with multiclass classification, it is often desirable to compute a single aggregate measure that reflects the overall performance. There are two methods to compute the overall performance, namely micro-averaging and macro-averaging [54]. Macro-averaging computes a simple average performance of individual classes’ performances. Micro-averaging computes an overall performance by globally counting the total true positives, false negatives and false positives. Depending on the class distribution the difference between the two methods can be large. Macro-averaging gives equal weight to each class, whereas micro-averaging gives equal weight to each individual classification decision [54]. The overall accuracy of the tool is often calculated as the fraction of the correct predictions by the total number of predictions as follows: (14) where TPk is the number of true positives in class k, K is the number of different classes, and N is the total number of predictions. Another way to compute the accuracy is to take the macro-average accuracy of the individual classes: (15) where Accuracyk is the accuracy of class k, and K is the number of different classes. Similarly, the overall MCC is calculated in terms of the confusion matrix C of dimension K × K [55]: (16) Or as a macro-average MCC: (17) where MCCk is the accuracy of class k, and K is the number of different classes. Because the number of samples in each class of the dataset is imbalanced, we used the overall accuracy as in Eq 14 and overall MCC as in Eq 16 to evaluate and compare different methods. It is explicitly stated when the macro-average was used. Experiments The performance of each method was determined using five-fold cross-validation whereby the training dataset was randomly partitioned into five equal sized sets. A single set was kept as the validation data and the remaining four sets were used to train the SVM model. This model was then tested using the validation set. The cross-validation process was repeated four times; where each of the sets was used once as the validation data. The performance of each model was averaged to produce a single estimation and the variation in performances was captured by computing the standard deviation. In addition, we ran leave-one-out cross-validation (LOOCV) on different methods to evaluate their robustness and compared the results with those obtained by five-fold cross-validation. Furthermore, the independent dataset was used to test the final models. The data in the independent dataset was not used, nor considered, during the training process and completely unknown to the different models. The performance of TranCEP on the test dataset was compared to the performance of TrSSP, as reported in their paper [10]. Statistical analysis In this analysis, the average number of informative residues, as determined by TCS scores, in different segments of a protein sequence was computed. For each substrate class, pairwise comparisons between means of important positions in different segments were performed. Since the sample size of each substrate class is >30, based on the the central limit theorem [56], the Student t-test (two tailed, paired) was applied. The difference was considered statistically significant when the Student t-test significance level P (P-value) was less than 0.0001. Results and discussion Methods evaluation The performance of the nine methods was evaluated using five-fold cross-validation. Table 2 presents the overall accuracy and MCC of the SVM models for the nine methods, sorted from the best to the worst MCC. Details of the performance for each method is available in S1 File; the comparison between the different methods among the seven classes in MCC, accuracy, sensitivity, and specificity is presented in Fig 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Performance of different methods. (a) MCC, (b) accuracy, (c) sensitivity, and (d) specificity. The dotted line represent the performance of TranCEP, which is TCS-TMC-PAAC. https://doi.org/10.1371/journal.pone.0227683.g002 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Overall cross-validation performance of the methods. https://doi.org/10.1371/journal.pone.0227683.t002 All the SVM models that utilized evolutionary data notably performed better than the SVM models that did not. In addition, the top two models, TMC-TCS-PAAC and TMC-TCS-AAC, integrated positional and evolutionary information. The highest MCC was obtained by TMC-TCS-PAAC, which is the method chosen for our predictor TranCEP. This method incorporates the use of PAAC with evolutionary data in the form of MSA with positional information, in which columns that have a reliability below 4 are filtered out. We found that the performance peaked using this threshold and started to decline when filtering columns with a higher reliability index. The TMC-TCS-PAAC method yielded an overall MCC of 0.63. The detailed performance of TMC-TCS-PAAC on five-fold cross-validation is presented in Table 3, The MCC was 0.82, 0.61, 0.62, 0.58, 0.46, 0.52, and 0.69 for amino acid, anion, cation, electron, protein/mRNA, sugar, and other, respectively. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Cross validation performance for TMC-TCS-PAAC. https://doi.org/10.1371/journal.pone.0227683.t003 The performance evaluation using LOOCV and independent testing prediction yielded similar patterns (available in S2 and S3 Files, respectively). A comparison of the results obtained by five-fold cross-validation, LOOCV, and the independent testing is presented in Fig 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Methods overall performance. (a) overall accuracy, and (b) overall MCC. The asterisk (*) symbol refers to the performance of TranCEP, which is TCS-TMC-PAAC. https://doi.org/10.1371/journal.pone.0227683.g003 Table 4 shows the confusion matrix for TranCEP when run on the independent test set. Most of the confusion occurred between a substrate class and the class other. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Confusion matrix for TranCEP. https://doi.org/10.1371/journal.pone.0227683.t004 Comparison Table 5 compares the performance of TranCEP to TrSSP [10] on the independent test set. TranCEP scored higher for all the substrate classes in terms of accuracy, specificity, and MCC. For the true positive rate as measured by sensitivity, TranCEP performed better on the cation and other classes, and TrSSP either matched or outperformed TranCEP on the other five classes. In TranCEP, the overall accuracy was computed as in Eq 14, and the MCC as in Eq 16. The performance of TrSSP was computed as macro-average accuracy and macro-average MCC. When we adopted the same method as TrSSP, the macro-average accuracy of TranCEP is 91.23% and the macro-average MCC is 0.69. Interestingly, the MCC score was not effected when using the overall and the macro-average, which further highlights its robustness when dealing with imbalanced data. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Comparing TranCEP and TrSSP. https://doi.org/10.1371/journal.pone.0227683.t005 Overall, TranCEP obtained an MCC of 0.69 which is 68% higher than that of TrSSP. Impact of factors Table 6 shows the impact of evolutionary information and positional information on the overall MCC from the five-fold cross-validation. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. Impact of factors on performance. https://doi.org/10.1371/journal.pone.0227683.t006 The use of evolutionary information in form of MSA on the composition encodings AAC, PAAC, and PseAAC improved the MCC by an average of 104% with the highest improvement being on AAC by 127%. The further use of positional information by filtering out the unreliable columns from the MSA showed an average improvement of 132% to the baseline compositions. The impact of positional information over that already achieved by evolutionary information was an average of 13%. The highest impact is on TMC-PAAC, where the MCC improved by 24% in TMC-TCS-PAAC. It is difficult to isolate the exact residues that are key to inferring the substrate class; the results suggest that evolutionary information, obtained by MSA, is the main source of achieving high prediction performance. In addition, the TCS informative positions (with TCS score ≥ 4) can help to filter out the unnecessary noise and get a clearer signal to further improve the prediction. Using TCS informative positions filtered out an average of (35% ± 7%) of the sequence. However, when we tried to filter out more positions (by taking a stricter the TCS score cut-off), the performance started to deteriorate. To visualize the informative positions relative to the hydropathy scale of amino acids, the hydropathy scale proposed by [57] was utilized and the average hydropathy of each column in the MSA was computed. Higher positive scores indicate that amino acids in that region have hydrophobic properties and are likely to be located in a transmembrane alpha-helix segment. The TCS score of each column in the alignment is noted on the hydropathy plot through color coding. Fig 4 shows diverse examples. The red shades correspond to the informative columns (TCS score ≥ 4), while the gray and white shades correspond to non-informative columns that are filtered out by TranCEP. In Fig 4(a) and 4(b), the regions with high positive average hydropathy values appear to be more informative compared to the ones with lower values. However, in Fig 4(c) and 4(d), the difference between the informative positions with high and low hydropathy values is not as clear. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Average Kyte-Doolittle hydropathy of the MSA with TCS scores. The figure indicates that the columns highlighted in red are informative and used by TranCEP. The TranCEP considers a column to be informative if it has a TCS score of at least 4 (shades of red), and filters out the other columns (gray and white). In (a) P39515 contains 158 residues, and the alignment of P39515 with other homologous sequences has 1203 columns, only 128 of them are informative (highlighted in shades of red). In (b) P10870 contains 884 residues, and the alignment of P10870 with other homologous sequences has 1312 columns, only 402 of them are informative. In (c) P11551 contains 438 residues, and the alignment of P11551 with other homologous sequences has 1753 columns, only 343 of them are informative. In (d) P0AAT4 contains 415 residues, and the alignment of P0AAT4 with other homologous sequences has 1653 columns, only 274 of them are informative. https://doi.org/10.1371/journal.pone.0227683.g004 To be able to measure the informative positions relative to different segments of the protein sequence, we divided the protein sequence positions into those in the TMS and those not in the TMS. Those in the TMS were divided into the interior one-third positions, and the remaining exterior positions in the TMS. The non-TMS positions were divided into those near a TMS, that is, within 10 positions, and the remaining positions far from a TMS. The location of the TMS was predicted using TOPCONS2 [58] to predict the α-helical TMS and PRED-TMBB2 [59] for β-barrel TMS. Table 7 shows a breakdown of where the informative positions, as determined by TCS, are located with respect TMS regions. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Positional information. https://doi.org/10.1371/journal.pone.0227683.t007 For instance, in Fig 4(b), 45.47% of the residues of the sequence with UniProt-ID P10970 are informative (i.e., correspond to informative columns in the alignment); thus 54.53% of this sequence is filtered out. In this case, the residues in the TMSs of this protein are indeed more informative, where 96.42% of them are informative. On the other hand, only 25% of the residues on non-TMSs are informative. However, this does not always hold true as in the sequence with UniProt-ID P0AAT4 in Fig 4(d), for example. Details of the sequences in the figure are presented in Table 8. Details of all the sequences in all of the examined segments are found in the open source repository. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Examples of the informative residues distribution with respect to TMS and non-TMS. https://doi.org/10.1371/journal.pone.0227683.t008 Table 9 presents a pairwise comparison between informative positions in TMS and non-TMS regions. The sequences in amino acid, anion, cation, sugar, and other, substrate classes have significantly more informative positions in the TMS regions compared with non-TMS regions. However, the difference is not significant in the sequences that belong to protein/mRNA and electron substrates. This could be because these two classes have fewer TMSs per sequence, and TCS did not capture their conservation. The same difference in significance positions close to TMSs and positions far from TMSs is found, as shown in Table 10. In contrast, there is no difference between classes for interior and exterior segments of the TMSs, as presented in Table 11. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 9. Statistical analysis of informative position rates in TMS and non-TMS regions. https://doi.org/10.1371/journal.pone.0227683.t009 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 10. Statistical analysis of informative position rates close to TMS regions and far from TMS regions. https://doi.org/10.1371/journal.pone.0227683.t010 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 11. Statistical analysis of informative position rates in the interior TMS regions and exterior TMS regions. https://doi.org/10.1371/journal.pone.0227683.t011 Methods evaluation The performance of the nine methods was evaluated using five-fold cross-validation. Table 2 presents the overall accuracy and MCC of the SVM models for the nine methods, sorted from the best to the worst MCC. Details of the performance for each method is available in S1 File; the comparison between the different methods among the seven classes in MCC, accuracy, sensitivity, and specificity is presented in Fig 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Performance of different methods. (a) MCC, (b) accuracy, (c) sensitivity, and (d) specificity. The dotted line represent the performance of TranCEP, which is TCS-TMC-PAAC. https://doi.org/10.1371/journal.pone.0227683.g002 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Overall cross-validation performance of the methods. https://doi.org/10.1371/journal.pone.0227683.t002 All the SVM models that utilized evolutionary data notably performed better than the SVM models that did not. In addition, the top two models, TMC-TCS-PAAC and TMC-TCS-AAC, integrated positional and evolutionary information. The highest MCC was obtained by TMC-TCS-PAAC, which is the method chosen for our predictor TranCEP. This method incorporates the use of PAAC with evolutionary data in the form of MSA with positional information, in which columns that have a reliability below 4 are filtered out. We found that the performance peaked using this threshold and started to decline when filtering columns with a higher reliability index. The TMC-TCS-PAAC method yielded an overall MCC of 0.63. The detailed performance of TMC-TCS-PAAC on five-fold cross-validation is presented in Table 3, The MCC was 0.82, 0.61, 0.62, 0.58, 0.46, 0.52, and 0.69 for amino acid, anion, cation, electron, protein/mRNA, sugar, and other, respectively. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Cross validation performance for TMC-TCS-PAAC. https://doi.org/10.1371/journal.pone.0227683.t003 The performance evaluation using LOOCV and independent testing prediction yielded similar patterns (available in S2 and S3 Files, respectively). A comparison of the results obtained by five-fold cross-validation, LOOCV, and the independent testing is presented in Fig 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Methods overall performance. (a) overall accuracy, and (b) overall MCC. The asterisk (*) symbol refers to the performance of TranCEP, which is TCS-TMC-PAAC. https://doi.org/10.1371/journal.pone.0227683.g003 Table 4 shows the confusion matrix for TranCEP when run on the independent test set. Most of the confusion occurred between a substrate class and the class other. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Confusion matrix for TranCEP. https://doi.org/10.1371/journal.pone.0227683.t004 Comparison Table 5 compares the performance of TranCEP to TrSSP [10] on the independent test set. TranCEP scored higher for all the substrate classes in terms of accuracy, specificity, and MCC. For the true positive rate as measured by sensitivity, TranCEP performed better on the cation and other classes, and TrSSP either matched or outperformed TranCEP on the other five classes. In TranCEP, the overall accuracy was computed as in Eq 14, and the MCC as in Eq 16. The performance of TrSSP was computed as macro-average accuracy and macro-average MCC. When we adopted the same method as TrSSP, the macro-average accuracy of TranCEP is 91.23% and the macro-average MCC is 0.69. Interestingly, the MCC score was not effected when using the overall and the macro-average, which further highlights its robustness when dealing with imbalanced data. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. Comparing TranCEP and TrSSP. https://doi.org/10.1371/journal.pone.0227683.t005 Overall, TranCEP obtained an MCC of 0.69 which is 68% higher than that of TrSSP. Impact of factors Table 6 shows the impact of evolutionary information and positional information on the overall MCC from the five-fold cross-validation. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. Impact of factors on performance. https://doi.org/10.1371/journal.pone.0227683.t006 The use of evolutionary information in form of MSA on the composition encodings AAC, PAAC, and PseAAC improved the MCC by an average of 104% with the highest improvement being on AAC by 127%. The further use of positional information by filtering out the unreliable columns from the MSA showed an average improvement of 132% to the baseline compositions. The impact of positional information over that already achieved by evolutionary information was an average of 13%. The highest impact is on TMC-PAAC, where the MCC improved by 24% in TMC-TCS-PAAC. It is difficult to isolate the exact residues that are key to inferring the substrate class; the results suggest that evolutionary information, obtained by MSA, is the main source of achieving high prediction performance. In addition, the TCS informative positions (with TCS score ≥ 4) can help to filter out the unnecessary noise and get a clearer signal to further improve the prediction. Using TCS informative positions filtered out an average of (35% ± 7%) of the sequence. However, when we tried to filter out more positions (by taking a stricter the TCS score cut-off), the performance started to deteriorate. To visualize the informative positions relative to the hydropathy scale of amino acids, the hydropathy scale proposed by [57] was utilized and the average hydropathy of each column in the MSA was computed. Higher positive scores indicate that amino acids in that region have hydrophobic properties and are likely to be located in a transmembrane alpha-helix segment. The TCS score of each column in the alignment is noted on the hydropathy plot through color coding. Fig 4 shows diverse examples. The red shades correspond to the informative columns (TCS score ≥ 4), while the gray and white shades correspond to non-informative columns that are filtered out by TranCEP. In Fig 4(a) and 4(b), the regions with high positive average hydropathy values appear to be more informative compared to the ones with lower values. However, in Fig 4(c) and 4(d), the difference between the informative positions with high and low hydropathy values is not as clear. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Average Kyte-Doolittle hydropathy of the MSA with TCS scores. The figure indicates that the columns highlighted in red are informative and used by TranCEP. The TranCEP considers a column to be informative if it has a TCS score of at least 4 (shades of red), and filters out the other columns (gray and white). In (a) P39515 contains 158 residues, and the alignment of P39515 with other homologous sequences has 1203 columns, only 128 of them are informative (highlighted in shades of red). In (b) P10870 contains 884 residues, and the alignment of P10870 with other homologous sequences has 1312 columns, only 402 of them are informative. In (c) P11551 contains 438 residues, and the alignment of P11551 with other homologous sequences has 1753 columns, only 343 of them are informative. In (d) P0AAT4 contains 415 residues, and the alignment of P0AAT4 with other homologous sequences has 1653 columns, only 274 of them are informative. https://doi.org/10.1371/journal.pone.0227683.g004 To be able to measure the informative positions relative to different segments of the protein sequence, we divided the protein sequence positions into those in the TMS and those not in the TMS. Those in the TMS were divided into the interior one-third positions, and the remaining exterior positions in the TMS. The non-TMS positions were divided into those near a TMS, that is, within 10 positions, and the remaining positions far from a TMS. The location of the TMS was predicted using TOPCONS2 [58] to predict the α-helical TMS and PRED-TMBB2 [59] for β-barrel TMS. Table 7 shows a breakdown of where the informative positions, as determined by TCS, are located with respect TMS regions. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Positional information. https://doi.org/10.1371/journal.pone.0227683.t007 For instance, in Fig 4(b), 45.47% of the residues of the sequence with UniProt-ID P10970 are informative (i.e., correspond to informative columns in the alignment); thus 54.53% of this sequence is filtered out. In this case, the residues in the TMSs of this protein are indeed more informative, where 96.42% of them are informative. On the other hand, only 25% of the residues on non-TMSs are informative. However, this does not always hold true as in the sequence with UniProt-ID P0AAT4 in Fig 4(d), for example. Details of the sequences in the figure are presented in Table 8. Details of all the sequences in all of the examined segments are found in the open source repository. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Examples of the informative residues distribution with respect to TMS and non-TMS. https://doi.org/10.1371/journal.pone.0227683.t008 Table 9 presents a pairwise comparison between informative positions in TMS and non-TMS regions. The sequences in amino acid, anion, cation, sugar, and other, substrate classes have significantly more informative positions in the TMS regions compared with non-TMS regions. However, the difference is not significant in the sequences that belong to protein/mRNA and electron substrates. This could be because these two classes have fewer TMSs per sequence, and TCS did not capture their conservation. The same difference in significance positions close to TMSs and positions far from TMSs is found, as shown in Table 10. In contrast, there is no difference between classes for interior and exterior segments of the TMSs, as presented in Table 11. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 9. Statistical analysis of informative position rates in TMS and non-TMS regions. https://doi.org/10.1371/journal.pone.0227683.t009 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 10. Statistical analysis of informative position rates close to TMS regions and far from TMS regions. https://doi.org/10.1371/journal.pone.0227683.t010 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 11. Statistical analysis of informative position rates in the interior TMS regions and exterior TMS regions. https://doi.org/10.1371/journal.pone.0227683.t011 Conclusion We have developed a novel method TranCEP for de novo prediction of substrates for membrane transporter proteins that combines information based on amino acid composition, evolutionary information, and positional information. TranCEP incorporates first, the use of evolutionary information taking 120 similar sequences and constructing a multiple sequence alignment using TM-Coffee, second, the use of positional information by filtering to reliable positions as determined by TCS, and third, the use of pair amino acid composition. TranCEP achieved an overall MCC of 0.69, which is a 68% improvement over the state-of-the-art method TrSSP that uses the primary sequence alone. In addition, we evaluated the impact on performance of each factor: incorporating evolutionary information and filtering the unreliable positions. We observed that using amino acid composition alone does not obtain strong performance. The enhanced performance came mainly from incorporating evolutionary and positional information. We learned that certain positions in the alignment have greater significance, and identifying them helped to boost the performance. Supporting information S1 File. Five-fold cross-validation performance details. https://doi.org/10.1371/journal.pone.0227683.s001 (PDF) S2 File. Leave-one-out cross-validation performance details. https://doi.org/10.1371/journal.pone.0227683.s002 (PDF) S3 File. Independent testing performance details. https://doi.org/10.1371/journal.pone.0227683.s003 (PDF) S1 Table. Manuscript tables (.xlsx). https://doi.org/10.1371/journal.pone.0227683.s004 (XLSX) S2 Table. Training data (.xlsx). https://doi.org/10.1371/journal.pone.0227683.s005 (XLSX) S3 Table. Testing data (.xlsx). https://doi.org/10.1371/journal.pone.0227683.s006 (XLSX) TI - TranCEP: Predicting the substrate class of transmembrane transport proteins using compositional, evolutionary, and positional information JF - PLoS ONE DO - 10.1371/journal.pone.0227683 DA - 2020-01-14 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/trancep-predicting-the-substrate-class-of-transmembrane-transport-ExdeLcSWIg SP - e0227683 VL - 15 IS - 1 DP - DeepDyve ER -