Abstract In the establishment and maintenance of the interaction between pathogenic or symbiotic bacteria with a eukaryotic organism, protein substrates of specialized bacterial secretion systems called effectors play a critical role once translocated into the host cell. Proteins are also secreted to the extracellular medium by free-living bacteria or directly injected into other competing organisms to hinder or kill. In this work, we explore an approach based on the evolutionary dependence that most of the effectors maintain with their specific secretion system that analyzes the co-occurrence of any orthologous protein group and their corresponding secretion system across multiple genomes. We compared and complemented our methodology with sequence-based machine learning prediction tools for the type III, IV and VI secretion systems. Finally, we provide the predictive results for the three secretion systems in 1606 complete genomes at http://www.iib.unsam.edu.ar/orgsissec/. phylogenetic profile, protein secretion prediction, T3SS, T4SS, T6SS, machine learning Introduction Bacteria have several secretion systems (SSs) for delivering proteins into membranes, periplasmic space or the extracellular medium. Some of these SSs translocate proteins into interacting cells. Such is the case of interactions of symbiotic or pathogenic bacteria with eukaryotic hosts or even during bacteria–bacteria interactions. Translocated proteins that affect the metabolism of the host cell and have a role in the interaction process are called ‘effectors’. Translocation of effector proteins was demonstrated for type III, IV and VI secretion systems [1–5]. The identification of effector proteins is crucial to understand the molecular bases of the interaction. A number of experimental and bioinformatic approaches have been taken to identify effectors. An experimental approach used to identify putative effector proteins is the culture supernatant screening [6, 7]. At present, this proteomic analysis cannot identify effectors secreted in low abundance, neither those that are only translocated to the host cell or those that were not being expressed. Translocated proteins can also be experimentally identified by a generalized translocation analysis with tagged proteins [8, 9]. The screening of mutant organisms with reduced virulence phenotypes as tool for the identification of effectors may be hampered by the existence of redundancy among translocated proteins targeting the host cells. Bioinformatic approaches principally comprise two methods. One is based on amino acid sequence composition analysis [10–15]. This method was successfully applied to proteins translocated by the type III secretion system (T3SS), which translocates proteins from the bacterial cytoplasm into the host cell. An analysis of amino acid sequences of experimentally identified proteins that are translocated through the T3SS of Pseudomonas allowed the determination of shared characteristic in the amino acid pattern of the N-terminal sequence of the proteins: greater than 10% serine within the first 50 amino acids, an aliphatic residue or proline at position 3 or 4, and a lack of acidic amino acids within the first 12 residues [6, 10, 11]. The method could also be applied to proteins secreted by the T3SS of other bacterial species. However, not all the proteins secreted by T3SS exhibit these characteristics. While the proportion is high for Pseudomonas, it declines for other bacterial species . Prediction programs based on machine learning techniques such as EffectiveT3, TEREE or the modlab’s T3SS effector prediction were developed for sets of T3SS effectors [12, 14, 15]. Similarly, TEREE could be efficiently applied to Pseudomonas and Ralstonia but not to Salmonella. For the type IV secretion system (T4SS), a translocation signal was proposed to reside in clusters of hydrophobic or positively charged amino acids of the C-terminal, while substrate recognition was mediated by a combination of a C-terminal signal together with additional intrinsic motifs and other cellular factors. Recent efforts have shown that the state-of-the-art performance could be surpassed by majority voting ensembles of classifiers based on a high number of feature extracting strategies [4, 17–19]. T4SS is a highly diversified and distributed complex, both in function and composition. It can be found in >70% of all complete genomes, being virB4 the only ubiquitous protein harboring 116 families of non-core proteins conforming eight T4SS subclasses . The T4SS can be dedicated to DNA transfer, protein translocation or both. Even when dedicated to secretion, effectors secreted by the T4SS-A and T4SS-B are detected by different signal sequences. In sharp contrast, no characteristic amino acid pattern has been identified for the type VI secretion system (T6SS), which translocates proteins into bacterial and eukaryotic host cells. Another bioinformatic method is based on a promoter sequence evaluation [11, 20, 21]. The method assumes that there is a concerted activation of the secretion system components and effectors. This methodology can be applied if specific transcription binding sites are known and are detectable in the genome sequence. Prediction based on co-regulation allows the identification of several effectors in the case of the T3SS . However, effectors that do not share these characteristics in the sequence box of their promoter regions were also identified for the T3SS of Pseudomonas [10, 11]. Other approaches involve searching the entire genomes or symbiotic or pathogenic islands for genes that code for proteins with eukaryotic-like domains and identification of homologs to experimentally identified effectors in other bacterial species . In the case of T3SS and T6SS, analysis of genes adjacent to those coding chaperons or toxin–toxin inhibitor pairs, respectively, provide good candidates . No method has proved to be absolute, and the convergence of several strategies is required to make an accurate assessment of the set of effectors translocated or secreted by a particular secretion system. Therefore, with more diverse methods available, a better identification of secreted proteins is possible. SSs and secreted proteins play different roles toward a common goal, the interaction with other organisms. The pool of secreted proteins is highly dependent on the bacterial lifestyle and is under high evolutionary pressure either by their hosts or other competing organism, and this translates into high mutation rates and higher mobility across genomes [22, 23]. In sharp contrast, SSs are coordinated multiprotein complexes that interact with regulatory proteins and act as a hub for multiple secreted proteins, constraining their evolutionary rates . The analysis of patterns of coevolution between pairs of proteins in completely sequenced genomes has been successfully applied to infer protein interactions and networking [25–27]. In the specific case of applications to translocated proteins, a coevolutionary analysis has been applied to pairs of effectors in Legionella . Evidence of the dependence between two or more proteins is present at many levels. While finding conserved patterns of co-occurrence between proteins in different organisms provides the strongest evidence of dependence, similarities between phylogenetic trees and correlated mutations across residues in or between multiple sequence alignments are also valuable strategies [29–31]. Early studies codified the presence and absence of a set of homologous proteins into binary vectors and measured the Hamming distance between multiple phylogenetic profiles (PPs) . More complex methods have been developed, incorporating alignment characteristics into the profiles, such as a score based in the BLAST Expected Value—1/log(E.value). The comparison is done by binning the continuous variables that form the profiles into discrete variables and calculating the mutual information (MI) between them [25, 29]. The aim of the present work is to validate the phylogenetic profiling approach as a tool that can identify effectors based on the coevolutionary dependence between a secreted protein and its respective SS, specifically in the T3SS, T4SS and T6SS, and the comparison and complementation with established sequence-based prediction methods. The results of the application of this methodology to a variety of relevant organisms are made available at http://www.iib.unsam.edu.ar/orgsissec/ in the form of a database as well as a standalone version of the executable code that allows the integration with external sequence-based classification tools. Methodology Summary workflow We analyzed 1606 completely sequenced bacterial genomes present in the OMA database, which provides orthologous groups for every gene within its constituent organisms, and generated a binary PP for each gene . The PP represents the presence or absence of its orthologs across the 1606 organisms. The presence or absence of a given SS was determined in every bacterial genome by combining a specialized SS detection tool and Hidden Markov Models (HMMs). We generated binary SS profiles spanning 1606 organisms and performed additional optimization on validated intracellular and secreted proteins. We calculated the MI between all the PPs and a SS profile to quantify the dependence that any given protein has to a SS. A higher MI content is associated with evolutionary dependence. Open reading frames are sorted descendingly according to their MI content. This is the simplest classification tool and does not involve machine learning or training strategies. Additionally, we evaluated the number of elements in the PP and the ratio between the MI score and the average MI score found in profiles with similar number of elements. We assessed the predictive performance of MI, a MI-based classifier, a sequence-based classifier and MI combined with sequence-based classification tool on a non-validated negative genomic data set. Figure 1 shows a scheme of the workflow. Figure 1. View largeDownload slide Schematic representation of the complete computational workflow. Figure 1. View largeDownload slide Schematic representation of the complete computational workflow. Data set construction Experimentally validated sequences of proteins secreted by the T3SS, T4SS and T6SS were obtained from SecretEPDB . Redundant proteins that clustered with a 40% identity were removed using uclust . Nonredundant, naive data set Reviewed intracellular sequences were retrieved from Uniprot by searching: ‘NOT goa:(“extracellular region ”) taxonomy:“Bacteria ” AND reviewed: yes’. In each data set, redundant proteins that clustered with a 40% identity were removed using uclust. It has already been reported that the frequency of gene families with a given number of orthologs fits the Zipf’s distribution, with a peak at the tail representing essential genes  (Supplementary Figure S1). This behavior was reproduced when we computed the frequency table of the number of orthologs in the complete OMA database; in contrast, genes with few orthologs were underrepresented in the negative Uniprot subset. The linear regression of the frequencies was used to fit the negative subset into the observed OMA distribution by recursively removing elements from the negative subset and maximizing the coefficient of determination of the linear regression (R2). The adjustment of the distribution was performed to increase the similarities between the negative data set and the working conditions of the classifier; this data set was used to perform the evaluation and adjustment of the SS profile, and to train a T6SS sequence-based classifier. Nonredundant genomic data set For each SS, the sequences found in genomes that contained experimentally validated secreted proteins were assumed to be non-secreted, while validated secreted proteins were removed. The data set was reduced in redundancy by clustering to a 40% identity using usearch. This data set represents the working conditions of the classifier, where the complete proteome with an active SS is screened for putative secreted proteins. The three secreted subsets (corresponding to T3SS, T4SS and T6SS) and two not secreted subsets were combined into six highly skewed data sets in favor of the not secreted proteins. Reference genome database and PP construction The protein sequences and pairwise orthologs tables of the OMA database corresponding to the 2016 release were downloaded from ‘http://omabrowser.org/oma/current/’. The pairwise ortholog table was used to generate a binary matrix of 1606 bacterial organisms and 4.8 million genes, where each column contains the presence or absence pattern of orthologs to a specific gene, reconstructing its PP among every organism. Redundancy among the genomes of reference organisms is a common problem that leads to suboptimal performance when applying phylogenetic profiling techniques. However, a previous study showed no indication of this behavior when using OMA as a source of orthologous genes . Detection of SS in reference genomes, construction and optimization of the SS profile To determine the dependence between any PP and a SS we constructed a binary vector that codified the existence of the SS in every reference genome. Two methods were combined to define if an organism was considered as a positive or negative carrier of a SS. We used the MacSyFinder software with a specific set of HMMs and rules designed to detect active SSs in complete bacterial genomes [37, 38]. Preliminary observations indicated that secreted proteins could still be found in organisms with incomplete SS and led us to integrate MacSyFinder results with a more flexible method. The number of SS elements in each reference bacterial genome was determined by using a pool of HMMs from TIGRFAM, Pfam and bibliography [38–40], with each HMM representing specific elements of the SS complex (Supplementary Table S1). HMMER 3.1b1 (May 2013; http://hmmer.org/) was used to map the models into the genomic database. Scores superior to the trusted threshold cutoffs included in the cured models and with an E-value lower than 1e-20 were considered positive hits. Because some organisms harbor multiple copies of the same SS, each positive hit was counted one time per HMM regardless of the number of matching elements in the genome and each gene was not allowed more than one HMM hit to account for redundancy among HMMs; the complete analysis can be found in Supplementary Table S2. We determined the minimum number of SS HMMs, by iteratively increasing the threshold that separated SS positive from SS negative organisms. In each iteration we generated a SS profile and calculated the predictive performance on the nonredundant naive data set, by using the MI score as a classification tool. The selected optimal threshold was the one that maximized the area under the receiver operator curve (AROC). Finally, the existence of an active or incomplete T3SS, T4SS and T6SS was determined for every genome of the OMA database. An organism was considered SS positive if it had been detected by MacSyFinder or if it had exceeded a threshold number of SS-associated HMMs hits. To reduce the complexity of the analysis we did not explore the SS subtypes. Score calculation and metrics Each column of the binary matrix can be compared with a SS profile, and for each comparison we generated three sequence independent metrics. Mutual information Measures the dependence between the vector representing all the orthologs to a gene and the presence of the SS in every genome, quantifying the evolutionary pressure that secreted proteins and SSs maintain. MI=log2pa,bpa⋅pb P(a, b) represents the joint probability of co-occurrence for the orthologs of ORF a and ORF b in the same genome, p(a) represents the occurrence probability of the orthologs of ORF a and p(b) represents the occurrence probability of the orthologs of ORF b. Total number of orthologs in the PPs: PPs with low number of elements are biased to high MI scores, owing to the fact that they are present in evolutionary close clades where the SS profiles are less likely to change. The total number of orthologs in the PPs represents the number of observations that supports the MI value. MI score to mean MI ratio We binned PP containing ±10% number of orthologs and calculated the average MI of each bin. Then, the ratio between the MI score of a profile and the average MI score of the corresponding bin was calculated. While a combination of MI score and a given number of orthologs in a profile would be considered relevant in an organism with few closely related sequenced genomes, this might not remain valid when evaluating organisms with higher number of closely related sequenced genomes. This metric infers the relevance of a profile by comparison with profiles with similar number of observations. True positive rates (TPRs) and false positive rates (FPRs) of secreted proteins are data set dependent and were calculated as: TPR=True PositivesTotal Positives FPR=False PositivesTotal Negatives Performance measure and AROC curve The receiver operating characteristic (ROC) curve is generated by plotting the TPR across the Y axis and the FPR across the X axis at different predictive thresholds. The AROC, i.e. the ‘area under the ROC’ curve or simply the ‘area under the curve’ (AUC), was selected to compare the predictive performance. This indicator has a value of 1 for a perfect prediction and a value of 0.5 for a random classification and is not affected by skewed or unbalanced data set. The steepness of the ROC curve can be interpreted as the ability of a classifier to separate positives from negatives at a lower FPR. Sequence-based analysis In the case of the T3SS and T4SS, extensive effort has been dedicated into the development of sequence-based classification tools. We applied BPBAac and T4SEpre_bpbAac to the complete OMA database to compare and complement the PP method with the sequence-based one [41, 42]. To our understanding, no method for the prediction of proteins secreted by the T6SS is currently available. However, it has been noticed that the proteins secreted by the T6SS have a distinctive amino acid composition when compared with other proteins . Based on this observation, we trained a sequence-based classification tool. We used the nonredundant naive data set as negative samples; T6SS secreted proteins were further reduced to 30% redundancy and codified in a 202 element vector, where each position represented the number of occurrences of a given bi-gram in the protein sequence. Every bi-gram class was scaled by removing the mean and expressed as its variance. A support vector machine (SVM) was trained using the nonredundant naive data set as negatives and the nonredundant T6SE data set as positives. The classifiers performance was evaluated by 5-fold, stratified leave-one-out cross-validation and resulted in a mean AUC performance of 0.87. This classifier was implemented using Python’s module sklearn.svm.SVC using a radial basis function kernel. The data set was balanced by artificially replicating the positive class to account for its skewed composition. Models, comparison and complementation Four prediction models were compared and their performance was evaluated for each SS, using the corresponding nonredundant genomic data set as true negatives and secreted effector data set as true positives. MI classification Each protein sequence was scored using the MI score. Sequence-based ML classifier For the T3SS and T4SS we applied the available classification tools directly to the data set [41, 42]. In the case of the T6SS, if any sequence in the evaluation set had been used for training the classification tool, we replaced its score with the cross-validation score to provide an accurate performance measure. Phylogenetic profiling and machine learning To integrate the generated metrics we used the MI score, the number of orthologs in the profile and the MI score to mean MI ratio as inputs and trained a five wide by five deep Multi-Layer Perceptron model to predict secreted proteins. Phylogenetic profiling and sequence-based machine learning This approach is similar to phylogenetic profiling and machine learning (PP + ML), but with the addition of the output of a sequence-based classifier as fourth input value. Multi-Layer Perceptron models were implemented with the python module ‘sklearn.neural_network.MLPClassifier 0.18.1’, using the default settings. All the input values were preprocessed by substracting the mean and scaled into unit-variance. Performance was evaluated by 10-fold leave one out, stratified cross-validation. The nonredundant genomic data set was shuffled and split into five subsets with similar number of positives and negatives. For every cross-validation step, four-fifth of the data set were used for training and the remaining one-fifth of the data set was used for validation. Results Determination of the optimum number of HMM for each SS We optimized the minimum number of HMM hits needed to classify an organism as a carrier of a SS. This was achieved by maximizing the MI predictive performance using the resulting SS profile. As shown in Figure 2, T3SS and T4SS secretion prediction increased its performance while increasing the threshold until it cannot be met by actual SS carriers, and performance diminishes. T6SS was marginally affected by increasing the SS threshold. The last value of every series in the scatter plot represents the plateau of the curve. Further increments in the threshold did not modify the performance, as no organism harbors that number of HMMs, and the SS profile is entirely composed by MacSyFinder positive genomes. The number of HMM elements that assure the maximal performance for each SS is indicated in Table 1. Table 1. Minimal number of SS-HMMs that produced the maximal performance for each SS Secretion system Minimum number of SS elements MI-based performance (AROC) T3SS 7 0.953 T4SS 6 0.95 T6SS 13 0.993 Secretion system Minimum number of SS elements MI-based performance (AROC) T3SS 7 0.953 T4SS 6 0.95 T6SS 13 0.993 Table 1. Minimal number of SS-HMMs that produced the maximal performance for each SS Secretion system Minimum number of SS elements MI-based performance (AROC) T3SS 7 0.953 T4SS 6 0.95 T6SS 13 0.993 Secretion system Minimum number of SS elements MI-based performance (AROC) T3SS 7 0.953 T4SS 6 0.95 T6SS 13 0.993 Figure 2. View largeDownload slide Variation of MI AUC performance as function of an iterative increment in the minimal number of SS-HMMs that an organism requires to be considered as SS positive. Figure 2. View largeDownload slide Variation of MI AUC performance as function of an iterative increment in the minimal number of SS-HMMs that an organism requires to be considered as SS positive. Development of a sequence prediction tool for T6SS A SVM was trained based in the amino acid sequence of positive and negative T6SS effectors. The bi-gram SVM classifier showed an average AUC of 0.87 in a 5-fold cross-validation analysis (Supplementary Figure S1). The results of the fully trained model were used as input for the phylogenetic profiling and sequence-based machine learning (PP + Seq-ML) T6SS MLP model. When a sequence was common to both classifier data sets, we replaced the prediction value of the fully trained sequence-base classifier by the corresponding result of the cross-validation step to avoid training the second classifier on over-fitted data. Prediction performance results for type III, IV and VI secretion systems Figures 3–5 show the AUC corresponding to each of the four prediction models applied to T3SS, T4SS and T6SS [MI, sequence-based ML classifier (Seq-ML), PP + ML and PP + Seq-ML]. Figure 3. View largeDownload slide T3SS predictive performance. Figure 3. View largeDownload slide T3SS predictive performance. Figure 4. View largeDownload slide T4SS predictive performance. Figure 4. View largeDownload slide T4SS predictive performance. Figure 5. View largeDownload slide T6SS predictive performance. Figure 5. View largeDownload slide T6SS predictive performance. Using MI scores to classify secreted proteins produced acceptable AUC values ranging from 0.87 to 0.89 and a low ROC slope when compared with other models. We observed that approximately 10% of the PPs present in the validation data set achieved the maximum MI score and contained 80% of the positive profiles. This can be observed in the starting MI linear phase of the ROC, as maximum scores cannot be sorted among themselves. The majority of the PPs of nonsecreted proteins with high MI scores had fewer elements than secreted proteins. Additional profile features incorporated in PP + ML increased the steepness of the ROC curve and the overall AUC value in T3SS and T6SS. Finally, the complementation between PP and sequenced-based machine learning (PP + Seq-ML) outperformed the other three classification techniques. The comparison of the predictive performance reached for each SS with each applied prediction model is summarized in Table 2. Table 2. Predictive performance summary Metric T3SS T4SS T6SS PP + SeqML AUC 0.94 0.97 0.94 TPR 0.8 0.92 0.76 FPR 0.05 0.03 0.04 MI AUC 0.89 0.87 0.89 TPR 0.78 0.82 0.85 FPR 0.085 0.11 0.13 PP ML AUC 0.89 0.89 0.91 TPR 0.7 0.81 0.7 FPR 0.05 0.1 0.08 Seq ML AUC 0.85a 0.96a 0.85 TPR 0.47a 0.89a 0.59 FPR 0.01a 0.03a 0.09 Metric T3SS T4SS T6SS PP + SeqML AUC 0.94 0.97 0.94 TPR 0.8 0.92 0.76 FPR 0.05 0.03 0.04 MI AUC 0.89 0.87 0.89 TPR 0.78 0.82 0.85 FPR 0.085 0.11 0.13 PP ML AUC 0.89 0.89 0.91 TPR 0.7 0.81 0.7 FPR 0.05 0.1 0.08 Seq ML AUC 0.85a 0.96a 0.85 TPR 0.47a 0.89a 0.59 FPR 0.01a 0.03a 0.09 a These results are not cross-validated. Table 2. Predictive performance summary Metric T3SS T4SS T6SS PP + SeqML AUC 0.94 0.97 0.94 TPR 0.8 0.92 0.76 FPR 0.05 0.03 0.04 MI AUC 0.89 0.87 0.89 TPR 0.78 0.82 0.85 FPR 0.085 0.11 0.13 PP ML AUC 0.89 0.89 0.91 TPR 0.7 0.81 0.7 FPR 0.05 0.1 0.08 Seq ML AUC 0.85a 0.96a 0.85 TPR 0.47a 0.89a 0.59 FPR 0.01a 0.03a 0.09 Metric T3SS T4SS T6SS PP + SeqML AUC 0.94 0.97 0.94 TPR 0.8 0.92 0.76 FPR 0.05 0.03 0.04 MI AUC 0.89 0.87 0.89 TPR 0.78 0.82 0.85 FPR 0.085 0.11 0.13 PP ML AUC 0.89 0.89 0.91 TPR 0.7 0.81 0.7 FPR 0.05 0.1 0.08 Seq ML AUC 0.85a 0.96a 0.85 TPR 0.47a 0.89a 0.59 FPR 0.01a 0.03a 0.09 a These results are not cross-validated. For each of the analyzed genome we independently sorted our final results based on the Seq-ML, PP-ML and PP + Seq ML scores. We analyzed the top 2.5% scoring elements for positive secreted proteins. These results are presented in the form of a Venn diagram in Supplementary Figures S3–S5. We compared the AUC performance of the ‘PP + SeqML’ and ‘PP ML’ methods by running six training iterations for each secretion system. AUC results were analyzed by applying one tailed t-test, and the increment in performance of combining ‘PP + SeqML’ was found to be statistically significant in all the evaluated cases (Supplementary Table S3). Independent data set We have applied our methodology to two independent curated data sets for the T3SS and two for the T4SS (Tbooster, Genset T4EffPred)[43–45]. It should be noted that a fraction of the entries of the independent data sets contained identifiers that could not be linked into the OMA-DB, and as a result we were limited in the number of data sets that could be used. The results can be found in Supplementary Table S4 and Supplementary Figures S6–S9. Case study Additionally, we analyzed the top scoring proteins of Yersinia pestis (Supplementary Table S5) and Mesorhizobium loti MAFF303099 (Supplementary Table S6). Case study 1, T3SS in Y. pestis As a particular example, we show the top 15 predictions for proteins secreted by the T3SS in the proteome of Y.pestis, a notorious human pathogen. The complete proteome was sorted by two different keys, the sequence characteristics ‘Seq-ML’ score and its combined score of sequence characteristics and PP ‘PP + Seq-ML’ score. We observed that while the exclusive sequence-based analysis was able to retrieve only two T3SS secreted effectors (T3SE), the combined approach was able to retrieve seven of the T3SE from a total of 10 characterized T3SE in Y.pestis. Case study 2, T3SS in Rhizobium loti (M. loti MAF303099) Mesorhizobium loti is a nodulating leguminous symbiont that uses its T3SE to modulate the plant–bacterial interaction [21, 46]. We have previously characterized and experimentally validated some of the eight proteins putatively secreted by M. loti T3SS [21, 46]. Three of the validated proteins lack the characteristic signal sequence in the annotated N-terminal end (Mlr6358, Mlr6316 and Mlr6331 proteins) and were not recognized by the sequence-based classifier; however, the complemented method inferred the correct classification for Mlr6358 and Mlr6331. As in the first case, we compared the sequence-based analysis with the sequence and PP complementation. The sequence exclusive method was able to identify two secreted proteins in the top 30 positions, while the sequence and PP complementation located four secreted proteins in the top 6 candidates and five (including Mlr6331 protein RHILO04946) in the top 30 candidates. Discussion Bacterial SS components are dynamic and highly widespread among bacterial organisms. Such adaptations have made them diverse both in sequence and molecular makeup. The use of groups of HMMs provides sequence sensitivity and tolerance to change in SS macromolecular complex composition when compared with profiles generated using sequence alignment tools. While optimizing the minimum number of HMMs that were needed to detect a SS we observed AUC performance values ranging from 0.95 to 0.99 based on the MI score on validated secreted and intracellular proteins. This behavior was also observed in the independent T4EffPred and GenSet data sets. These performance values could not be reproduced on the nonredundant genomic data set, which is a more accurate representation of the working condition of the classifier where all the PPs have at least one event of co-occurrence with the SS. PPs with small number of elements in genomes with an active SS are likely to be present in a clade of similar organisms that also harbor the SS. This generates a bias toward high MI scores. In contrast, profiles with small number of elements in genomes without a SS are likely to be present in a clade of similar organisms that also lack the SS and thus generate a bias toward low MI scores. Validated intracellular proteins belong to organisms that may or may not have a SS, while proteins in the genomic data set are by definition present in an organism with a SS. The main advantage of MI is its predictive capability in the evaluated SS and its independence from signal sequences in the substrate that could be either non-conserved or unknown. Additionally, MI between PP and SS profiles can be used as a filter to reduce the size of the search space 10-fold while retaining 80% of the positive proteins. It could also be applied to newly discovered SSs where the few secreted proteins cannot train sequence-based classifiers. In contrast, sequence-based classifiers are bound to the properties that can be extracted from sequences of known secreted proteins. The complementation of MI with additional features of the PP, identified through a small neural network, provided additional features and allowed to both differentiate the most relevant profiles and enhance the AUC performance in all the evaluated SS. Furthermore, the addition of a sequence-based score provides an independent classification method that continues to increase the predictive performance. The complementation of T4SEpre_bpbAac with PPs showed the lowest AUC improvement; the main reason is that a large proportion of the T4SE has low number of orthologs and the high performance of T4SEpre_bpbAac in our evaluation set; however, the initial steepness of the curve increased, allowing for a better discrimination when the high-scoring elements are analyzed. It should be noted that the obtained performance values for BPBAac and T4SEpre_bpbAac cannot be interpreted as an absolute score, as we were unable to cross-validate the classifiers and our testing data set contains sequences on which the models had been trained. While MI and PP have intrinsic limitations such as not being able to differentiate between secreted proteins and other accessory elements related to the functionality of the secretion system itself, false positives are constant enough to be discarded based on their homology to previously characterized SS constitutive elements. Additionally, we will not be able to detect those less common effectors that are specific to only one strain and share no homology to others if they have not diverged significantly from their paralogs. On the other hand, novel effectors could arise by converting a non-secreted common protein into a secreted one owing to a mechanism called terminal reassortment or to N-terminal elongation with a signal peptide . These instances could not be detected by the PP method. PP provides a straightforward genome-wide approach for the detection of protein–protein interactions. Our study demonstrates that the methodology can be extended to most of the secreted proteins and their corresponding SS. This study adds a much-needed new dimension to the protein secretion classification problem, taxonomically unbiased and based on the concept of genome evolution, PP can be applied to the prediction of proteins secreted by different SS even in less-characterized organisms . It is well established that bioinformatic predictions are more reliable when multiple approaches with different fundamentals are combined. Here we add a novel and universal technique to the current repertoire that can improve our understanding of protein secretion. Key Points Phylogenetic profiling can be used to infer protein secretion. Neural networks can be used to interpret PPs. Phylogenetic profiles and sequence-based approaches can be combined for enhancing protein secretion prediction. Funding The Agencia Nacional de Promoción Científica y Tecnológica of Argentina (ANPCyT) (PICT Raices 2011-1212). Zalguizuri Andrés is a doctoral fellowship in the Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, Argentina. He is undertaking a bioinformatics projects focused on computational and experimental analysis of proteins secreted by bacterial secretion systems. Caetano-Anollés Gustavo is a Researcher and Professor in the Department of Crop Sciences, University of Illinois, USA. Its current research is focused in evolutionary genomics, system biology and synthetic biology. Lepek Viviana Claudia is a Researcher and Professor in the Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín. Its current research is focused in bacterial secretion systems. References 1 Fauvart M , Michiels J. Rhizobial secreted proteins as determinants of host specificity in the rhizobium-legume symbiosis . FEMS Microbiol Lett 2008 ; 285 ( 1 ): 1 – 9 . Google Scholar CrossRef Search ADS PubMed 2 Records AR. The type VI secretion system: a multipurpose delivery system with a phage-like machinery . Mol Plant Microbe Interact 2011 ; 24 ( 7 ): 751 – 7 . Google Scholar CrossRef Search ADS PubMed 3 Hayes CS , Aoki SK , Low DA. Bacterial contact-dependent delivery systems . Annu Rev Genet 2010 ; 44 : 71 – 90 . Google Scholar CrossRef Search ADS PubMed 4 Alvarez-Martinez CE , Christie PJ. Biological diversity of prokaryotic type IV secretion systems . Microbiol Mol Biol Rev 2009 ; 73 ( 4 ): 775 – 808 . Google Scholar CrossRef Search ADS PubMed 5 Arnold R , Jehl A , Rattei T. Targeting effectors: the molecular recognition of Type III secreted proteins . Microbes Infect 2010 ; 12 ( 5 ): 346 – 58 . Google Scholar CrossRef Search ADS PubMed 6 Guttman DS , Vinatzer BA , Sarkar SF , et al. A functional screen for the type III (Hrp) secretome of the plant pathogen Pseudomonas syringae . Science 2002 ; 295 ( 5560 ): 1722 – 6 ., Google Scholar CrossRef Search ADS PubMed 7 Hempel J , Zehner S , Göttfert M , et al. Analysis of the secretome of the soybean symbiont Bradyrhizobium japonicum . J Biotechnol 2009 ; 140 ( 1–2 ): 51 – 8 . Google Scholar CrossRef Search ADS PubMed 8 Chang JH , Urbach JM , Law TF , et al. A high-throughput near-saturating screen for type III effector genes from Pseudomonas syringae . Proc Natl Acad Sci USA 2005 ; 102 ( 7 ): 2549 – 54 . Google Scholar CrossRef Search ADS PubMed 9 Mukaihara T , Tamura N , Iwabuchi M. Genome-wide identification of a large repertoire of Ralstonia solanacearum type III effector proteins by a new functional screen . Mol Plant Microbe Interact 2010 ; 23 ( 3 ): 251 – 62 . Google Scholar CrossRef Search ADS PubMed 10 Petnicki-Ocwieja T , Schneider DJ , Tam VC , et al. Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000 . Proc Natl Acad Sci USA 2002 ; 99 ( 11 ): 7652 – 7 . Google Scholar CrossRef Search ADS PubMed 11 Schechter LM , Vencato M , Jordan KL , et al. Multiple approaches to a complete inventory of Pseudomonas syringae pv. tomato DC3000 type III secretion system effector proteins . Mol Plant Microbe Interact 2006 ; 19 ( 11 ): 1180 – 92 . Google Scholar CrossRef Search ADS PubMed 12 Arnold R , Brandmaier S , Kleine F , et al. Correction: sequence-based prediction of type III secreted proteins . PLoS Pathog 2009 ; 5 ( 4 ): e1000376 . Google Scholar CrossRef Search ADS PubMed 13 Yang Y , Zhao J , Morgan RL , et al. Computational prediction of type III secreted proteins from gram-negative bacteria . BMC Bioinformatics 2010 ; 11 ( Suppl 1 ): S47 . Google Scholar CrossRef Search ADS PubMed 14 Schechter LM , Valenta JC , Schneider DJ , et al. Functional and computational analysis of amino acid patterns predictive of type III secretion system substrates in Pseudomonas syringae . PLoS One 2012 ; 7 ( 4 ): e36038 . Google Scholar CrossRef Search ADS PubMed 15 Jehl M-A , Arnold R , Rattei T. Effective–a database of predicted secreted bacterial proteins . Nucleic Acids Res 2011 ; 39 : D591 – 5 . Google Scholar CrossRef Search ADS PubMed 16 Samudrala R , Heffron F , McDermott JE , et al. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems . PLoS Pathog 2009 ; 5 ( 4 ): e1000375 . Google Scholar CrossRef Search ADS PubMed 17 Burstein D , Zusman T , Degtyar E , et al. Genome-scale identification of Legionella pneumophila effectors using a machine learning approach . PLoS Pathog 2009 ; 5 ( 7 ): e1000508 . doi: 10.1371/journal.ppat.1000508 Google Scholar CrossRef Search ADS PubMed 18 Lifshitz Z , Burstein D , Peeri M , et al. Computational modeling and experimental validation of the Legionella and Coxiella virulence-related type-IVB secretion signal . Proc Natl Acad Sci U S A 2013 ; 110 ( 8 ): E707 – 15 . doi: 10.1073/pnas.1215278110 Google Scholar CrossRef Search ADS PubMed 19 Wang J , Yang B , An Y , et al. Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches . Brief Bioinform 2017 , doi: 10.1093/bib/bbx164. 20 Kimbrel JA , Thomas WJ , Jiang Y , et al. Mutualistic co-evolution of type III effector genes in Sinorhizobium fredii and Bradyrhizobium japonicum . PLoS Pathog 2013 ; 9 ( 2 ): e1003204 . Google Scholar CrossRef Search ADS PubMed 21 Sánchez C , Iannino F , Deakin WJ , et al. Characterization of the Mesorhizobium loti MAFF303099 type-three protein secretion system . Mol Plant Microbe Interact 2009 ; 22 ( 5 ): 519 – 28 . Google Scholar CrossRef Search ADS PubMed 22 Nogueira T , Touchon M , Rocha EPC. Rapid evolution of the sequences and gene repertoires of secreted proteins in bacteria . PLoS One 2012 ; 7 ( 11 ): e49403. Google Scholar CrossRef Search ADS PubMed 23 Nogueira T , Rankin DJ , Touchon M , et al. Horizontal gene transfer of the secretome drives the evolution of bacterial cooperation and virulence . Curr Biol 2009 ; 19 ( 20 ): 1683 – 91 . Google Scholar CrossRef Search ADS PubMed 24 Batada NN , Hurst LD , Tyers M. Evolutionary and physiological importance of hub proteins . PLoS Comput Biol 2006 ; 2 ( 7 ): e88. Google Scholar CrossRef Search ADS PubMed 25 Date SV , Marcotte EM. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages . Nat Biotechnol 2003 ; 21 ( 9 ): 1055 – 62 . Google Scholar CrossRef Search ADS PubMed 26 Peregrin-Alvarez JM. The phylogenetic extent of metabolic enzymes and pathways . Genome Res 2003 ; 13 ( 3 ): 422 – 7 . Google Scholar CrossRef Search ADS PubMed 27 von Mering C , Huynen M , Jaeggi D , et al. STRING: a database of predicted functional associations between proteins . Nucleic Acids Res 2003 ; 31 ( 1 ): 258 – 61 . Google Scholar CrossRef Search ADS PubMed 28 Burstein D , Amaro F , Zusman T , et al. Genomic analysis of 38 Legionella species identifies large and diverse effector repertoires . Nat Genet 2016 ; 48 ( 2 ): 167 – 75 . doi: 10.1038/ng.3481 Google Scholar CrossRef Search ADS PubMed 29 Pellegrini M , Marcotte EM , Thompson MJ , et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles . Proc Natl Acad Sci USA 1999 ; 96 ( 8 ): 4285 – 8 . Google Scholar CrossRef Search ADS PubMed 30 Pazos F , Valencia A. Similarity of phylogenetic trees as indicator of protein-protein interaction . Protein Eng 2001 ; 14 ( 9 ): 609 – 14 . Google Scholar CrossRef Search ADS PubMed 31 Pazos F , Helmer-Citterich M , Ausiello G , et al. Correlated mutations contain information about protein-protein interaction . J Mol Biol 1997 ; 271 ( 4 ): 511 – 23 . Google Scholar CrossRef Search ADS PubMed 32 Altenhoff AM , Skunca N , Glover N , et al. The OMA orthology database in 2015: function predictions better plant support, synteny view and other improvements . Nucleic Acids Res 2015 ; 43 ( D1 ): D240 – 9 . Google Scholar CrossRef Search ADS PubMed 33 An Y , Wang J , Li C , et al. SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III IV and VI secretion systems . Sci Rep 2017 ; 7 : 41031. Google Scholar CrossRef Search ADS PubMed 34 Edgar RC. Search and clustering orders of magnitude faster than BLAST . Bioinformatics 2010 ; 26 ( 19 ): 2460 – 1 . Google Scholar CrossRef Search ADS PubMed 35 Qian J , Luscombe NM , Gerstein M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model . J Mol Biol 2001 ; 313 ( 4 ): 673 – 81 . Google Scholar CrossRef Search ADS PubMed 36 Škunca N , Dessimoz C. Phylogenetic profiling: how much input data is enough? PLoS One 2015 ; 10 : e0114701 . Google Scholar CrossRef Search ADS PubMed 37 Abby SS , Néron B , Ménager H , et al. MacSyFinder: a program to mine genomes for molecular systems with an application to CRISPR-Cas systems . PLoS One 2014 ; 9 ( 10 ): e110726 . Google Scholar CrossRef Search ADS PubMed 38 Abby SS , Cury J , Guglielmini J , et al. Identification of protein secretion systems in bacterial genomes . Sci Rep 2016 ; 6 ( 1 ): 23080 . Google Scholar CrossRef Search ADS PubMed 39 Haft DH , Selengut JD , White O. The TIGRFAMs database of protein families . Nucleic Acids Res 2003 ; 31 ( 1 ): 371 – 3 . Google Scholar CrossRef Search ADS PubMed 40 Bateman A , Birney E , Durbin R , et al. The Pfam protein families database . Nucleic Acids Res 2000 ; 28 ( 1 ): 263 – 6 ., Google Scholar CrossRef Search ADS PubMed 41 Wang Y , Zhang Q , Sun M-A , et al. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles . Bioinformatics 2011 ; 27 ( 6 ): 777 – 84 . Google Scholar CrossRef Search ADS PubMed 42 Wang Y , Wei X , Bao H , et al. Prediction of bacterial type IV secreted effectors by C-terminal features . BMC Genomics 2014 ; 15 ( 1 ): 50. Google Scholar CrossRef Search ADS PubMed 43 An Y , Wang J , Li C. Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III IV and VI . Brief Bioinform 2016 ; 19 ( 1 ): 148 – 61 . 44 Zou L , Nan C , Hu F. Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles . Bioinformatics 2013 ; 29 ( 24 ): 3135 – 42 . Google Scholar CrossRef Search ADS PubMed 45 Hobbs CK , Porter VL , Stow ML , et al. Computational approach to predict species-specific type III secretion system (T3SS) effectors using single and multiple genomes . BMC Genomics 2016 ; 17 : 1048 . Google Scholar CrossRef Search ADS PubMed 46 Sánchez C , Mercante V , Babuin MF , et al. Dual effect of Mesorhizobium loti T3SS functionality on the symbiotic process . FEMS Microbiol Lett 2012 ; 330 ( 2 ): 148 – 56 . Google Scholar CrossRef Search ADS PubMed 47 Stavrinides J , Ma W , Guttman DS. Terminal reassortment drives the quantum evolution of type III effectors in bacterial pathogens . PLoS Pathogens 2006 ; 2 ( 10 ): e104. Google Scholar CrossRef Search ADS PubMed 48 Zeng C , Zou L. An account of in silico identification tools of secreted effector proteins in bacteria and future challenges . Brief Bioinform 2017 , doi: 10.1093/bib/bbx078. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com
Briefings in Bioinformatics – Oxford University Press
Published: Jan 31, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera