Access the full text.
Sign up today, get DeepDyve free for 14 days.
Mohan (2006)
1043J. Mol. Biol, 362
Heffernan (2015)
11476.Sci. Rep, 5
Sharma (2015)
915IEEE Trans. Nanobiosci, 14
V. Uversky (2014)
Introduction to intrinsically disordered proteins (IDPs).Chemical reviews, 114 13
Yugong Cheng, C. Oldfield, Jingwei Meng, P. Romero, V. Uversky, A. Dunker (2007)
Mining α-Helix-Forming Molecular Recognition Features with Cross Species Sequence Alignments†Biochemistry, 46
R. Lee, M. Buljan, Benjamin Lang, Robert Weatheritt, G. Daughdrill, A. Dunker, M. Fuxreiter, J. Gough, J. Gsponer, David Jones, Philip Kim, R. Kriwacki, C. Oldfield, R. Pappu, P. Tompa, V. Uversky, P. Wright, M. Babu (2014)
Classification of Intrinsically Disordered Regions and ProteinsChemical Reviews, 114
Alok Sharma, K. Paliwal, A. Dehzangi, James Lyons, S. Imoto, S. Miyano (2013)
A strategy to select suitable physicochemical attributes of amino acids for protein fold recognitionBMC Bioinformatics, 14
Vacic (2007)
2351J. Proteome Res, 6
Yugong Cheng, C. Oldfield, Jingwei Meng, P. Romero, V. Uversky, A. Dunker (2007)
Mining alpha-helix-forming molecular recognition features with cross species sequence alignments.Biochemistry, 46 47
Ronesh Sharma, Shiu Kumar, T. Tsunoda, Ashwini Patil, Alok Sharma (2016)
Predicting MoRFs in protein sequences using HMM profilesBMC Bioinformatics, 17
Jiangang Liu, Jiangang Liu, N. Perumal, C. Oldfield, E. Su, V. Uversky, A. Dunker (2006)
Intrinsic disorder in transcription factors.Biochemistry, 45 22
Nawar Malhis, Matthew Jacobson, J. Gsponer (2016)
MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequencesNucleic Acids Research, 44
Cheng (2007)
13468Biochemistry, 46
Mousavian (2016)
42J. Pharmacol. Toxicol. Methods, 78
J. Lyons (2015)
Advancing the accuracy of protein fold recognition by utilizing profiles from Hidden Markov ModelsIEEE Trans. Nanabiosci, 14
T. Hamelryck (2005)
An amino acid has two sides: a new 2D measure provides a different view of solvent exposureProteins Struct. Funct. Bioinf, 59
Dosztányi (2009)
2745Bioinformatics, 25
A. Hierlemann, A. Ricco, K. Bodenhoefer, W. Goepel (1998)
Characterization of molecular recognition in gas sensors
Malhis (2015)
e0141603PLoS ONE, 10
Richard Edwards, Norman Davey, D. Shields (2007)
SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in ProteinsPLoS ONE, 2
Kavianpour (2017)
261Amino Acids, 49
Sharma (2013)
41J. Theor. Biol, 320
Lee (2014)
6589Chem. Rev, 114
Lyons (2015)
761IEEE Trans. Nanabiosci, 14
Hamelryck (2005)
38Proteins Struct. Funct. Bioinf, 59
Z. Dosztányi, Bálint Mészáros, I. Simon (2009)
ANCHOR: web server for predicting protein binding regions in disordered proteinsBioinformatics, 25
Uversky (2014)
6557Chem. Rev, 114
Peng (2017)
8087.Sci. Rep, 7
Yang (2017)
55Methods Mol. Biol, 1484
V. Vacic, C. Oldfield, A. Mohan, P. Radivojac, M. Cortese, V. Uversky, A. Dunker (2007)
Characterization of molecular recognition features, MoRFs, and their binding partners.Journal of proteome research, 6 6
Liu (2006)
6873Biochemistry, 45
Tompa (2011)
419Curr. Opin. Struct. Biol, 2011
Peter Wright, H. Dyson (2014)
Intrinsically disordered proteins in cellular signalling and regulationNature Reviews Molecular Cell Biology, 16
Heffernan (2016)
843Bioinformatics, 32
Hamidreza Kavianpour, Mahdi Vasighi (2016)
Structural classification of proteins using texture descriptors extracted from the cellular automata imageAmino Acids, 49
Malhis (2016)
W488Nucleic Acids Res, 44
Disfani (2012)
i75Bioinformatics, 28
Zaynab Mousavian, Sahand Khakabimamaghani, K. Kavousi, A. Masoudi-Nejad (2016)
Drug-target interaction prediction from PSSM based evolutionary information.Journal of pharmacological and toxicological methods, 78
Li (2006)
1658Bioinformatics, 22
P. Tompa (2011)
Unstructural biology coming of age.Current opinion in structural biology, 21 3
J. Dyson, P. Wright (2005)
Intrinsically unstructured proteins and their functionsNature Reviews Molecular Cell Biology, 6
Alok Sharma, James Lyons, A. Dehzangi, K. Paliwal (2013)
A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition.Journal of theoretical biology, 320
Rhys Heffernan, A. Dehzangi, James Lyons, K. Paliwal, Alok Sharma, Jihua Wang, A. Sattar, Yaoqi Zhou, Yuedong Yang (2016)
Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteinsBioinformatics, 32 6
N. Malhis (2015)
Computational identification of MoRFs in protein sequences using hierarchical application of Bayes RulePLoS ONE, 10
Yuedong Yang, Rhys Heffernan, K. Paliwal, James Lyons, A. Dehzangi, Alok Sharma, Jihua Wang, Abdul Sattar, Yaoqi Zhou (2017)
SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks.Methods in molecular biology, 1484
(2015)
Utilizing Profiles From Hidden Markov Models, IEEE Transaction on Nanabioscience
T. Hamelryck (2005)
An amino acid has two sides: A new 2D measure provides a different view of solvent exposureProteins: Structure, 59
C. Oldfield, Yugong Cheng, M. Cortese, P. Romero, V. Uversky, A. Dunker (2005)
Coupled folding and binding with alpha-helix-forming molecular recognition elements.Biochemistry, 44 37
Edwards (2007)
e967.PLoS ONE, 2
Oldfield (2005)
12454Biochemistry, 44
Weizhong Li, A. Godzik (2006)
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequencesBioinformatics, 22 13
Jiaqi Xia, Zhenling Peng, Dawei Qi, Hongbo Mu, Jianyi Yang (2016)
An ensemble approach to protein fold classification by integration of template‐based assignment and support vector machine classifierBioinformatics, 33
Dyson (2005)
197Nat. Rev. Mol. Cell Biol, 6
Fatemeh Disfani, W. Hsu, M. Mizianty, C. Oldfield, B. Xue, A. Dunker, V. Uversky, Lukasz Kurgan (2012)
MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteinsBioinformatics, 28
Xia (2017)
863Bioinformatics, 33
Ronesh Sharma, A. Dehzangi, James Lyons, K. Paliwal, T. Tsunoda, Alok Sharma (2015)
Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAACIEEE Transactions on NanoBioscience, 14
James Lyons, A. Dehzangi, Rhys Heffernan, Yuedong Yang, Yaoqi Zhou, Alok Sharma, K. Paliwal (2015)
Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov ModelsIEEE Transactions on NanoBioscience, 14
Nawar Malhis, J. Gsponer (2015)
Computational identification of MoRFs in protein sequencesBioinformatics, 31 11
Sharma (2016)
S14BMC Bioinformatics, 17
C. Oldfield, Yugong Cheng, M. Cortese, P. Romero, V. Uversky, A. Dunker (2005)
Coupled folding and binding with α-helix-forming molecular recognition elementsBiochemistry, 44
Lihong Peng, Wen Zhu, Bo Liao, Yu Duan, Min Chen, Yi Chen, Jialiang Yang (2017)
Screening drug-target interactions with positive-unlabeled learningScientific Reports, 7
Rhys Heffernan, K. Paliwal, James Lyons, A. Dehzangi, Alok Sharma, Jihua Wang, Abdul Sattar, Yuedong Yang, Yaoqi Zhou (2015)
Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learningScientific Reports, 5
Sharma (2013)
233BMC Bioinformatics, 14
A. Mohan, C. Oldfield, P. Radivojac, V. Vacic, M. Cortese, A. Dunker, V. Uversky (2006)
Analysis of molecular recognition features (MoRFs).Journal of molecular biology, 362 5
Wright (2015)
18Nat. Rev. Mol. Cell Biol, 16
Malhis (2015)
1738Bioinformatics, 31
Motivation: Intrinsically disordered proteins lack stable 3-dimensional structure and play a crucial role in performing various biological functions. Key to their biological function are the molecular recognition features (MoRFs) located within long disordered regions. Computationally identifying these MoRFs from disordered protein sequences is a challenging task. In this study, we present a new MoRF predictor, OPAL, to identify MoRFs in disordered protein sequences. OPAL utilizes two independent sources of information computed using different component predictors. The scores are processed and combined using common averaging method. The first score is computed using a component MoRF predictor which utilizes composition and sequence similarity of MoRF and non-MoRF regions to detect MoRFs. The second score is calculated using half-sphere exposure (HSE), solvent accessible surface area (ASA) and backbone angle information of the disordered protein sequence, using information from the amino acid properties of flanks surrounding the MoRFs to distinguish MoRF and non-MoRF residues. Results: OPAL is evaluated using test sets that were previously used to evaluate MoRF predictors, MoRFpred, MoRFchibi and MoRFchibi-web. The results demonstrate that OPAL outperforms all the available MoRF predictors and is the most accurate predictor available for MoRF prediction. It is available at http://www.alok-ai-lab.com/tools/opal/. Contact: ashwini@hgc.jp or alok.sharma@griffith.edu.au Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction (Lee et al., 2014; Uversky, 2014). Proteins with such regions are Recent progress in computational and experimental methods have known as intrinsically disordered proteins (IDPs) (Dyson and revealed many protein regions lacking stable 3-dimensional struc- Wright, 2005; Tompa, 2011). IDPs often execute their function with ture (Dyson and Wright, 2005; Lee et al., 2014; Uversky, 2014; loosely structured short protein regions that bind to a structured Wright and Dyson, 2015). These protein regions perform various partner and undergo a disorder-to-order transition to adopt a biological functions such as cell regulation and signal transduction well-defined conformation (Lee et al., 2014; Mohan et al., 2006; V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1850 OPAL 1851 Vacic et al., 2007). These short regions are known as short linear sequences in TEST464 have sequence identity 30% or more with motifs (SLiMs) and molecular recognition features (MoRFs). SLiMs one other sequence in the set. To address this, we used cd-hit (Li and are short linear sequence motifs that vary in size from 3 to 10 amino Godzik, 2006) to remove the sequences from TEST464 which share acids and are enriched in intrinsically disordered regions (IDRs) 30% or more sequence identity. This resulted in 266 sequences and we called this filtered set TEST266. We used an additional test set, (Edwards et al., 2007). On the other hand, MoRFs are long disor- dered regions that fold upon binding to their partner protein and are EXP53, which was collected and assembled by Malhis et al. (2015). up to 70 amino acids in length (Mohan et al., 2006). MoRFs were There are 53 non-redundant protein sequences in this set containing first introduced as Molecular Recognition Elements (MoREs) MoRF regions that are experimentally verified to be disordered in (Oldfield et al., 2005) and their role in protein–protein interactions isolation. EXP53 set was filtered to have sequences with less than was elucidated (Liu et al., 2006; Mohan et al., 2006; Vacic et al., 30% sequence identity to those in the training set. EXP53 sequences also share less than 30% sequence identity with each other (Malhis 2007). The functional importance of MoRFs has led to the development et al., 2015). EXP53 set contains 25 186 residues, of which 2432 of several computational methods and predictors including the are MoRF residues. Since protein sequences in EXP53 set contain very early ones (Cheng et al., 2007; Oldfield et al., 2005), and MoRFs with length greater than 30 residues, MoRFs are further div- more recent efforts such as ANCHOR (Doszta ´ nyi et al., 2009), ided into short MoRFs (up to 30 residues) and long MoRFs (longer MoRFpred (Disfani et al., 2012), MoRFchibi (Malhis and Gsponer, than 30 residues). For the rest of the paper, we refer to short MoRFs 2015), MoRFchibi-light (Malhis et al., 2016) and MoRFchibi-web as EXP53short, long MoRFs as EXP53long and all MoRFs as (Malhis et al., 2015, 2016) and our previous work (Sharma et al., EXP53all. We used TEST266 and EXP53 sets to compare and vali- 2016). date the proposed predictor. In this study, we present OPAL, a predictor for MoRFs of sizes 5–25 residues located within long disordered protein sequences. 2.2 The PROMIS model OPAL is an ensemble of two predictors: MoRFchibi and Prediction In order to distinguish between MoRF and non-MoRF residues, the of MoRFs Incorporating Structure (PROMIS), which is also proposed PROMIS model uses structural information of disordered described in this work. OPAL combines MoRF scores at multiple regions to compute amino acid properties of flanking regions sur- stages using common averaging method. The first score is calculated rounding the MoRFs. The structural information includes attributes using a component MoRF predictor, MoRFchibi. The score is proc- such as HSE (Hamelryck, 2005), ASA and backbone angles of essed and is combined with the score of PROMIS. PROMIS is con- amino acids in disordered regions predicted via Spider2 (Heffernan structed using half-sphere exposure (HSE) (Hamelryck, 2005), et al., 2015; Yang et al., 2017), a sequence predictor of local and solvent accessible surface area (ASA) and backbone angle informa- non-local structural features of protein sequences. Two different tion of disordered protein sequences to predict MoRFs. The devel- methods of feature extraction are employed to retrieve meaningful opment of PROMIS offered a significant improvement in prediction features from structural attributes. The first method is based on pro- accuracies when compared with MoRFchibi, MoRFpred and file bigram (Sharma et al., 2013a, where the feature vector is ANCHOR predictors. Overall, the integration of PROMIS with obtained by counting the bigram frequencies from the position spe- MoRFchibi provided better prediction quality for OPAL compared cific scoring matrix (PSSM) representing a protein region. However, with predictors having a similar approach e.g. MoRFchibi-light and in this paper we do not apply PSSM to compute profile bigram, MoRFchibi-web. OPAL is available as an online server at http:// instead we used structural attributes to evaluate bigram features. www.alok-ai-lab.com/tools/opal. The second method is based on the properties of flanks surrounding the MoRF residue. In this method, feature vector is extracted from structural attributes to encode the properties of flanks sur- 2 Materials and methods rounding the query residue. More details on the above two methods 2.1 Benchmark dataset are given later. For the rest of the paper, we refer to the above We used training and test sets that were previously introduced by two methods as BigramMoRF and StructMoRF, respectively (please Disfani et al. (2012) to develop MoRF predictors. These sets were see Supplementary Text S1). The feature vectors generated using recently used to train and benchmark predictors MoRFchibi, the above two methods are sent to an SVM model for prediction. MoRFchibi-light and MoRFchibi-web. The training set contains The architecture of the proposed PROMIS predictor is shown in 421 protein sequences with 245 984 residues, of which 5396 are Figure 1. The prediction scores of the SVM model obtained from MoRF residues. The test set contains 419 protein sequences with each of the methods described above are combined using the com- 258 829 residues, of which 5153 are MoRF residues. A second test mon averaging strategy to produce propensity scores of PROMIS. In set, named NEW in Malhis et al.(Malhis and Gsponer, 2015) con- common averaging, the score of all SVM models is added and fur- tains 45 sequences with 37 533 residues, of which 626 are MoRF ther divided by the number of models used. residues. These sets were collected from Protein Data Bank (PDB) To construct PROMIS, we took a similar approach as we took in (Disfani et al., 2012) and peptide regions of 5–25 residues were our previous study (Sharma et al., 2016) to divide each training identified as MoRF regions. All the test sequences share less than sequence into two segments. Using the first segment, we extract pos- 30% sequence identity to sequences in the training set (Disfani itive samples representing MoRFs and using the second segment, we et al., 2012). We use the training set to train OPAL and the first test extract negative samples representing non-MoRFs. Our previous set to evaluate it. We further combine first and the second test sets study used a fixed flank length of 12 amino acids surrounding the into single set, TEST464 as in Malhis et al. (2016) and use it to com- MoRF region to create segments. However, in this study we varied pare the proposed predictor with the state-of-the-art MoRF predic- the flank length from 12 to 25 amino acids to identify the length tors. Although TEST464 contains sequences that were used to test that best discriminates MoRF residues from non-MoRF residues. previous MoRF predictors (Disfani et al., 2012; Malhis and Using AUC performance measure on the test data (please see Gsponer, 2015; Malhis et al., 2016), we found that 42% of the Supplementary Table S1), we selected the optimal flank length to be 1852 R.Sharma et al. n n. This matrix B can be represented as a vector form F by reshaping the n n matrix into a vector of length n . F is a bigram of attributes. The use of bigram features has shown promising results in protein fold recognition, protein subcellu- lar localization, structural class prediction, functional analy- sis, drug-interaction and other related problems (Kavianpour and Vasighi, 2017; Lyons et al.,2015; Mousavian et al.,2016; Peng et al., 2017; Sharma et al., 2013a, 2015; Xia et al., 2017). StructMoRF: to represent a protein region, in this method the attribute values are treated as features. i.e. the feature vector for a sample can be interpreted as F ¼[M , s 1;1 M ,.. ., M ; .. . ; M ], where M is an element of structural 2;1 i;j L;n i;j matrix M of size L by n.Asbefore L is the length of a protein region and n is the number of attributes. F is a tensor sum of attributes. Fig. 1. Architecture of the PROMIS predictor. PROMIS is constructed using 2.2.3 SVM model structural attributes of disordered regions. The scores are processed using An SVM classifier with radial basis function (RBF) is used to evalu- common averaging. ate the features generated. Performing a grid search, SVM kernel parameters C and gamma were selected as 1000 and 0.0038, respec- tively (please see Supplementary Text S2). 20. The following subsections outline the structural attributes, fea- ture extraction methods, and training and test of the SVM model. 2.2.4 Training For training, since there are more non-MoRF residues compared to 2.2.1 Structural attributes MoRF residues in training data, balanced sampling is done by Spider2 (Yang et al., 2017) output is used as a source of feature extracting equal number of positive and negative samples. This ratio extraction. It predicts structural information about the protein sequences which includes: is further increased to 1:2, i.e. two non-MoRF samples for each MoRF sample, to obtain higher AUCs in detecting MoRF residues Secondary structure (SS): contains structural description of each (please see Supplementary Table S2 for details). For BigramMoRF amino acid residue in a number of discrete states, such as helix, method, samples are chosen to represent a region of MoRF residues sheet and coil. with a flank of 20 amino acids upstream and downstream of the Accessible surface area (ASA): measures the exposure level of selected region. On the other hand, for StructMoRF method, sam- amino acid residue to solvent in a protein region and is a one– ples are chosen to represent a MoRF residue with a flank of 20 dimensional structural property. amino acids upstream and downstream of the selected residue. The Backbone angles: includes backbone dihedral angles of feature vector is computed from the sample and is used for training amino acids in protein region. We consider the Phi, Psi, Theta (h) the model. The detailed procedure of extracting positive and nega- and Tau (s) angles. h is the angle between Ca atoms tive samples from training data is illustrated in Supplementary (Ca – Ca – Ca Þ and s is the dihedral angle rotated about i1 i iþ1 Material (please see Supplementary Text S1). the Ca Ca bond. i iþ1 Half-sphere exposure (HSE): is an alternative measure of the sol- 2.2.5 Testing vent exposure of a residue and has been shown to perform better To score a query sequence, we take a query sample from the query than ASA (Heffernan et al., 2016). It gives the number of C alpha sequence to represent each residue. The feature vector is extracted atoms in the upper and lower spheres defined for each residue. from the sample and is used for scoring. The detailed explanation We use two measures specifying the HSE alpha and beta (HSEu and illustration of scoring a query sequence is demonstrated in and HSEd) along with the contact number for each residue. Supplementary Material (please see Supplementary Text S1). 2.2.2 Feature extraction 2.3 Score calculation To extract feature vectors from structural attributes, the following In order to validate that the predicted residue is a part of a binding feature extraction methods are considered: region, we process each predicted score using its neighboring residue BigramMoRF: in this method, we computed bigram features scores. Taking the score of query residue at ith location and its by utilizing structural attributes. The bigram features from neighboring residue scores of size z on both sides, we compute the kth attribute to lth attribute (in a protein sequence) is computed processed propensity score for each residue as follows: as follows: Processed score ¼ðmaxðÞ scores þ medianðscores ÞÞ=2 (2) i x x L1 B ¼ M MðÞ 1 k n and 1 l n (1) k;l i;k iþ1;l where i ¼ 1, 2,.. ., L, L is the length of query protein sequence and x i¼1 varies from i z to i þ z. where M is the element of structural matrix M of size L by n, L i;k 2.4 Combined model (OPAL) is the length of a protein region and n is the number of structural attributes. Computing the bigram frequencies B for k ¼ 1; 2; We applied common averaging technique to combine the proposed k;l .. . n and l ¼ 1; 2; .. . ; n would give a bigram matrix B of size PROMIS predictor with component MoRF predictor, MoRFchibi OPAL 1853 Fig. 3. OPAL online server homepage, available at http://www.alok-ai-lab.com/ Fig. 2. Overview of OPAL predictor. OPAL is constructed using processed tools/opal/ MoRFchibi component predictor and PROMIS predictor proposed in this study. The scores are processed using common averaging. To use MoRFchibi as a component predictor, we downloaded MoRFchibi and interfaced it with PROMIS job to OPAL online server. Once the job is processed, the result can be downloaded using the job ID assigned to the submission. It takes an average processing time of 15 to 20 min to process a job. If into a single model called OPAL. To predict MoRFs, MoRFchibi the user provides an email address with the job submission, then targets similarity, composition and contrast information of MoRF notification is sent to the email once the job is processed. A screen- and non-MoRF regions using physicochemical properties of amino acids. MoRFchibi utilizes two SVM models with two different ker- shot of the output is shown in Supplementary Figure (please see nel functions (sigmoid and RBF). The first kernel (sigmoid) is used Supplementary Fig. S1). to distinguish sequence similarity of query regions to that of training regions and the second kernel (RBF) is used to extract composition 3 Results and contrast information between the MoRF region and its sur- rounding regions. In this framework, MoRFchibi is used as one of OPAL is trained and tested using the same data that was used to the components of OPAL and its propensity scores are processed train and evaluate the predictors, MoRFpred, MoRFchibi and and combined with PROMIS. The details of score processing is MoRFchibi-web. These datasets are described in detail in Disfani described in subsection 2.3. The overview of the combined predic- et al. (2012) and Malhis et al. (2015). We use test sets to evaluate tor, OPAL is shown in Figure 2. the proposed predictor and the set EXP53 to validate that the per- formance improvement is not the result of over fitting. In addition, since all the mentioned and the proposed predictors are trained to 2.5 Performance measure predict MoRFs of sizes 5–25 residues while EXP53 set contains To evaluate OPAL, we used AUC performance measure. AUC is the sequences with MoRFs of length greater than 30 residues, we show area under the receiver operating characteristics (ROC) curve and is performance of EXP53 as EXP53all, EXP53long and EXP53short, commonly used to evaluate a predictor to see how well it separates where EXP53all contains all the MoRFs from 53 sequences, two classes of information, i.e. MoRF and non-MoRF residues. We EXP53long contains MoRFs that are greater than 30 residues in size also report precision, F-measure and false positive rate (FPR) for dif- ferent values of true positive rate (TPR), since we are interested in and EXP53short contains MoRFs that are up to 30 residues in size. predicting MoRFs at a high threshold probability which is near the lower left corner of the AUC curve. TPR is defined as TP/N and 3.1 Attribute and model selection MoRF FPR is defined as FP/N , where TP is the number of cor- nonMoRF Evaluating the test set, we select important structural attributes and rectly classified MoRF residues, FP is the number of incorrectly pre- models to identify MoRFs in disordered protein sequence. We use dicted MoRF residues, N is the total number of MoRF residues MoRF successive feature selection scheme in the forward direction (Sharma and N is the total number of non-MoRF residues. To et al., 2013b) to rank each structural attribute according to its con- nonMoRF report the processing speed of the predictor, we noted the processing tribution towards successfully predicting MoRFs. For BigramMoRF time of the predictor to score a protein sequence and used it to com- method, HSEa attributes were ranked highest amongst structural pute the number of residues it predicts in one minute i.e. residues/ attributes, h attribute was ranked highest amongst the dihedral minute (r/m). angles and was ranked second overall. Moreover, ASA attribute performed average and s angle attribute was ranked lowest. For 2.6 OPAL online server StructMoRF method, HSEu attribute from HSEa group was ranked OPAL is available as an online server at http://www.alok-ai-lab. highest and gave good prediction accuracies in first stage of com/tools/opal/. The details of using OPAL online server are as fol- selection, however, in the second stage its combination with other lows: Opal accepts input as a single protein sequence of length attributes deteriorated the accuracies. Thus, to obtain average per- formance, we construct three SVM models, MoRFbi-1, MoRFbi-2 greater than 26 amino acids. A screenshot of the top page of OPAL online server is shown in Figure 3. To use OPAL, users need to enter and MoRFwin as shown in Figure 1. MoRFbi-1 and MoRFbi-2 are a protein sequence and email address (optional) before submitting a constructed using BigramMoRF method and MoRFwin is 1854 R.Sharma et al. constructed using StructMoRF method. For feature extraction sampling ratio 1:1 (for more details on model selection please see MoRFbi-1 uses attributes HSEa and ASA; MoRFbi-2 uses attribute Supplementary Table S2). Thus, we select best performing models h from dihedral angles; and, MoRFwin uses attribute HSEu. and combine their scores using common averaging method. The Furthermore, for scoring a query sequence, the window size for selected and combined models are shown in Table 2. extracting a sample was set as 70 and 41 for BigramMoRF and To validate that the predicted scores of MoRFs form part of the StructMoRF methods, respectively. We selected these sizes, because binding region, we use equation (2) to process each predicted score AUCs computed were highest with these sizes compared with other by varying parameter z (parameter z refers to the size of neighboring window sizes for each method. residue scores). Figure 4 shows the AUCs for different values of z for Table 1 shows the AUCs for models trained with training sam- each of the model. It is observed that models MoRFbi-1, and pling ratio 1:1 and 1:2. First model performed well with sampling MoRFbi-2 obtain optimal results at z¼20, whereas MoRFwin ratio of 1:2, whereas second and third models gave good AUCs with obtains optimal results at z ¼12. Furthermore, since MoRFchibi is used in our proposed combined model OPAL, we also processed MoRFchibi scores and found that it obtains optimal results at z¼ 4 Table 1. AUCs for models with sampling ratio 1:1 and 1:2 using as observed in Figure 4. Thus, to develop our final predictors test set PROMIS and OPAL, we process the mentioned model scores at Sampling ratio 1:1 Sampling ratio 1:2 specified z parameters giving the best results. Models AUC AUC 3.2 Comparison with state-of-the-art predictors 1 MoRFbi-1 0.734 0.760 To compare the performance of the proposed predictor with avail- 2 MoRFbi-2 0.689 0.652 3 MoRFwin 0.769 0.738 able state-of-the-art MoRF predictors, we use datasets TEST464, TEST266 and EXP53 to report the AUCs. We show performance Note: Bold numbers indicate best performance measure. of PROMIS and OPAL, in Table 3 and Table 4, respectively. In Table 3 PROMIS is compared with predictors ANCHOR, Table 2. AUCs for selected and combined models using test set MoRFpred and MoRFchibi. These predictors are developed using similar approaches, whereas in Table 4 we compare MoRF predic- Models Sampling ratio Test AUC tors which are constructed using many other component predictors 1 MoRFbi-1 1:2 0.760 and their scores are combined at several stages to produce the final 2 MoRFbi-2 1:1 0.689 MoRF propensity scores. These predictors are MoRFchibi-light, 3 MoRFwin 1:1 0.769 MoRFchibi-web and OPAL. Combined PROMIS 0.791 PROMIS achieves significant improvements in predicting MoRFs. Compared with MoRFchibi, it provided 4.7% increase in Note: Combined PROMIS model with significant improvement in AUCs compared to individual selected models. AUCs for TEST464 dataset, 10.6% increase in AUCs for EXP53all, Fig. 4. Processed AUCs for each model. The size of parameter z in equation (2) is varied from 1 to 20 and suitable size is selected for each model. a) All models AUCs are shown. b) MoRFbi-1 AUCs. c) MoRFbi-2 AUCs. d) MoRFwin AUCs. e) MoRFchibi AUCs. MoRFbi-1 and MoRFbi-2 performed well at z¼20, MoRFwin per- formed well at z¼12 and MoRFchibi performed well at z¼4 OPAL 1855 13.6% increase in AUCs for EXP53long and 3.3% increase in AUCs given true positive rate (TPR) (please see Supplementary Table S3). for EXP53short datasets, respectively. All the mentioned MoRF predictors in this study were designed Incorporating a number of component predictors is thought to to predict MoRFs up to size of 25 residues, therefore, it was increase the performance; this is observed in Table 4. All the com- important to test their performance in scoring longer MoRFs. bined predictors perform well in comparison with individual predic- Thus, PROMIS and OPAL have shown significant increase in tors observed in Table 3. Compared with MoRFchibi-light and accuracies for predicting these MoRFs. For comparison, we also MoRFchibi-web, OPAL obtained 3.9 and 1.1% increase in AUCs computed precision and F-measure for different values of TPR as for TEST464 dataset, 3.9 and 3.7% increase in AUCs for EXP53all, observed in Table 5. 6.5 and 5.3% increase in AUCs for EXP53long and performed very similar to MoRFchibi-light and MoRFchibi-web for EXP53short datasets, respectively. The AUC curves generated using each of the 3.3 Processing speed dataset is shown in Figure 5. Moreover, it is observed that the pro- MoRF predictors are used to score large sets of proteins; therefore, posed predictor achieves much lower false positive rate (FPR) at any it is necessary to test its efficiency. We compare and report the pre- diction speed for each of the predictor. For MoRFchibi-light, Table 3. AUCs for predictors of similar approach MoRFchibi and ANCHOR, we tested these predictors using the entire TEST set using i5, 3.5 GHz computer, since these predictors Predictor/ TEST464 TEST266 EXP53all EXP53long EXP53short do not require multiple sequence alignments (MSA). On the other methods hand, MoRFchibi-web and OPAL required MSA, therefore, we test ANCHOR 0.605 0.599 0.615 0.586 0.683 both these predictors using a single sequence from test set (Uniprot: MoRFpred 0.675 0.651 0.620 0.598 0.673 Q38087) with 903 residues. Predictor MoRFpred is not download- MoRFchibi 0.743 0.709 0.712 0.679 0.790 able, so it was tested on its prediction server with single sequence PROMIS 0.790 0.770 0.818 0.815 0.823 (Uniprot: Q38087). The processing speed of each predictor with its Note: In bold, PROMIS shows significant improvement in prediction accu- AUCs are summarized in Table 5. Prediction speed for ANCHOR, racies, compared to MoRFchibi. MoRFpred and ANCHOR. MoRFchibi and MoRFchibi-light do not require generation of evo- lutionary profiles, therefore were fastest with speeds of 3.9 10eþ6 Table 4. AUCs for combined component MoRF predictors r/m, 10.5 10eþ3 r/m and 9.9 10eþ3 r/m, respectively. The prediction speed of OPAL came third with 215 r/m, whereas Predictor/ TEST464 TEST266 EXP53 EXP53 EXP53 MoRFchibi-web provided speed of 80 r/m and MoRFpred came methods all long short slowest with 48 r/m. Additionally, processing single sequence MoRFchibi-light 0.777 0.762 0.799 0.770 0.869 using MoRFchibi-web on its prediction server provided its speed of MoRFchibi-web 0.805 0.785 0.797 0.758 0.886 588 r/m, however, the server hardware configuration is unknown. OPAL 0.816 0.795 0.836 0.823 0.870 The comparison might not be entirely fair, since the prediction Note: In bold, OPAL shows overall improvement in prediction accuracies, server processor for some predictors are unknown and some predic- compared to MoRFchibi-light and MoRFchibi-web. tors required the generation of evolutionary profiles. Fig. 5. AUC curves generated for each of the datasets, EXP53all, EXP53long, EXP53short and TEST464. Curves a, b, c and d show OPAL and PROMIS compared with MoRFchibi, MoRFpred and ANCHOR, respectively. Curves e, f, g and h show OPAL and PROMIS compared with MoRFchibi-web and MoRFchibi-light, respectively 1856 R.Sharma et al. Table 5. Overall comparison of results Predictors Precision F-measure AUC Residues Residues Multiple Combined per minute per minute sequence component (workstation) (webserver) alignments predictors ANCHOR 0.156, 0.134 0.201, 0.212 0.605, 0.615 3.9 10 – MoRFchibi 0.334, 0.210 0.316, 0.296 0.743, 0.712 10.5 10 – MoRFpred 0.181, 0.147 0.226, 0.228 0.675, 0.620 – 48 PROMIS 0.363, 0.332 0.329, 0.400 0.790, 0.818 220 – MORFchibi light 0.431, 0324 0.354, 0.392 0.777, 0.799 9.9 10 – MoRFchibi-web 0.495, 0.332 0.373, 0.399 0.805, 0.797 80 588 OPAL 0.530, 0.386 0.384, 0.436 0.816, 0.836 215 – Note: Precision and F-measure is given for TPR values of 0.3 and 0.5, respectively, for EXP53all set and AUC is given for TEST464 and EXP53all sets, respectively. complementary information provided by each, which results in per- Table 6. FPR as a function of TPR for validating OPAL using formance improvement. Using validation dataset EXP53all with 53 EXP53all set protein sequences, where MoRF regions are experimentally verified TPR 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 to be disordered in isolation, OPAL showed 3.9% performance improvement over MoRFchibi-web and provided lower FPR at any MoRFchibi-web 0.016 0.033 0.061 0.107 0.166 0.261 0.382 0.539 given TPR as shown in Table 6. OPAL 0.015 0.029 0.056 0.085 0.128 0.193 0.286 0.437 The additional improvement for the proposed predictor is Note: FPR for varying TPR values of OPAL compared with MoRFchibi- the outcome of processing the propensity scores at each stage. web. OPAL outperforms MoRFchibi-web at any given TPR. Varying and selecting the best size of parameter z in Equation (2) showed improvement of 0.3% in MoRFbi-1 model, 1.8% in MoRFbi-2 model, 2.6% in MoRFwin model and 0.4% in 4 Discussion MoRFchibi model. We present OPAL, a new sequence based predictor for MoRFs in To predict residues in protein sequences as MoRFs or non- IDRs. OPAL is developed using processed scores of component MoRFs, predictors are supposed to be consistent over the entire MoRFchibi predictor and the scores of proposed PROMIS predic- length of the protein sequence. However, if the query samples tor. We compared its performance with predictors ANCHOR, taken for the regions are very similar to that of training samples, MoRFpred, MoRFchibi, MoRFchibi-light and MoRFchibi-web. the predictor will over score and produce biased prediction. To The predictors like MoRFchibi-light and MoRFchibi-web are overcome this bias, OPAL implements and combines several recently published and they are constructed by combining several approaches such as, using two independent sources of informa- other disorder and MoRF component predictors. On the other tion, two different feature extraction methods, selecting best sam- hand, predictors like ANCHOR, MoRFpred and MoRFchibi are pling ratios between MoRF and non-MoRF samples during similar to PROMIS. Therefore, we first compared PROMIS with training, and excluding non-MoRF residues neighboring MoRF ANCHOR, MoRFpred and MoRFchibi and then we compared regions as negative samples. Moreover, using common averaging OPAL with MoRFchibi-light and MoRFchibi-web. Using test sets to combine different models and component predictors with TEST464, TEST266 and EXP53, the results demonstrate that different machine leaning approach makes OPAL less likely to PROMIS outperforms ANCHOR, MoRFpred and MoRFchibi, produce biased scores. To show the importance of combining by observing significant improvement in AUCs. Furthermore, PROMIS with MoRFchibi, we plotted propensity scores of combining component MoRF predictors (such as MoRFchibi-light protein P42768 from EXP53 set. This protein contains two veri- and MoRFchibi-web), OPAL demonstrated improvement in fied MoRF regions. Figure 6 shows the propensity scores for mod- performance and outperformed the benchmarked MoRFchibi-web els OPAL, PROMIS and MoRFchibi. It is noted that MoRFchibi predictor. obtains high scores at the verified MoRF regions, however, it also Achieving higher prediction accuracy for OPAL is the result of gives high scores where MoRFs do not exist, i.e. high scores combining predictors, PROMIS and MoRFchibi. PROMIS uses between residues 75 to 140. On the other hand, PROMIS keeps structural information of disordered regions for prediction. In the the scores less between residues 75 to 140 and provides above result, it was observed that PROMIS alone provided AUC of 0.790 average scores at verified MoRF regions, therefore combining for TEST464 dataset, whereas MoRFchibi provided AUC of 0.743 PROMIS with MoRFchibi corrects the scores to produce higher only. Using structural features for predicting MoRFs provided scores where MoRFs exists. enough discrimination information to differentiate MoRFs from its In summary, we have proposed a new MoRF predictor named surrounding regions. Compared with physicochemical features of OPAL using structural information of disordered regions and physi- MoRFchibi, they perform well along with the solvent exposure level cochemical properties of amino acids. Overall, OPAL is the most of amino acids contained in disordered regions. Furthermore, com- bining PROMIS with MoRFchibi, outperformed all the predictors accurate MoRF predictor available today and it has outclassed the state-of-the-art predictors ANCHOR, MoRFpred, MoRFchibi and across the test sets. These predictors use different source of features with different learning algorithms, thus, combining them utilizes the MoRFchibi-web. OPAL 1857 Fig. 6. Propensity scores for the human Wiskott-Aldrich syndrome protein (P42768) as given by OPAL, PROMIS and MoRFchibi. The two experimentally verified MoRFs are marked on the X axis Kavianpour,H. and Vasighi,M. (2017) Structural classification of proteins Acknowledgements using texture descriptors extracted from the cellular automata image. We would like to acknowledge the authors of MoRFpred predictor Disfani Amino Acids, 49, 261–271. et al. (2012) and MoRFchibi predictor Malhis et al (2015) for making the pre- Lee,R.V.D. et al. (2014) Classification of intrinsically disordered regions and dictors, and train and test sequence data for MoRF prediction publicly proteins. Chem. Rev., 114, 6589–6631. available. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and compar- ing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659. Funding Liu,J. et al. (2006) Intrinsic disorder in transcription factors. Biochemistry, 45, This work was supported by CREST, JST, Yokohama 230–0045, Japan; 6873–6888. RIKEN, Center for Integrative Medical Sciences, Japan (T. T.) and Japan Lyons,J. et al. (2015) Advancing the accuracy of protein fold recognition by Agency for Medical Research and Development (Grant number: utilizing profiles from Hidden Markov Models. IEEE Trans. Nanabiosci., 16cm0106320h0001) (A. P.). 14, 761–772. Malhis,N. and Gsponer,J. (2015) Computational identification of MoRFs in Conflict of Interest: none declared. protein sequences. Bioinformatics, 31, 1738–1744. Malhis,N. et al. (2016) MoRFchibi SYSTEM: software tools for the identifica- tion of MoRFs in protein sequences. Nucleic Acids Res., 44, W488–W493. References Malhis,N. et al. (2015) Computational identification of MoRFs in protein Cheng,Y. et al. (2007) Mining alpha-helix-forming molecular recognition sequences using hierarchical application of Bayes Rule. PLoS ONE, 10, features with cross species sequence alignments. Biochemistry, 46, e0141603. 13468–13477. Mohan,A. et al. (2006) Analysis of Molecular Recognition Features (MoRFs). Disfani,F.M. et al. (2012) MoRFpred, a computational tool for J. Mol. Biol., 362, 1043–1059. sequence-based prediction and characterization of short disorder-to-order Mousavian,Z. et al. (2016) Drug–target interaction prediction from PSSM based transitioning binding regions in proteins. Bioinformatics, 28, i75–i83. evolutionary information. J. Pharmacol. Toxicol. Methods, 78, 42–51. Doszta ´ nyi,Z. et al. (2009) ANCHOR: web server for predicting protein bind- Oldfield,C.J. et al. (2005) Coupled folding and binding with a-helix-forming ing regions in disordered proteins. Bioinformatics, 25, 2745–2746. molecular recognition elements. Biochemistry, 44, 12454–12470. Dyson,H.J. and Wright,E.P. (2005) Intrinsically unstructured proteins and Peng,L. et al. (2017) Screening drug-target interactions with positive-unlabeled their functions. Nat. Rev. Mol. Cell Biol., 6, 197–208. learning. Sci. Rep., 7, 8087. Edwards,R.J. et al. (2007) SLiMFinder: a probabilistic method for identifying Sharma,A. et al. (2013a) A feature extraction technique using bi-gram proba- over-represented, convergently evolved, short linear motifs in proteins. bilities of position specific scoring matrix for protein fold recognition. J. PLoS ONE, 2, e967. Theor. Biol., 320, 41–46. Hamelryck,T. (2005) An amino acid has two sides: a new 2D measure pro- Sharma,A. et al. (2013b) A strategy to select suitable physicochemical attrib- vides a different view of solvent exposure. Proteins Struct. Funct. Bioinf., utes of amino acids for protein fold recognition. BMC Bioinformatics, 14, 59, 38–48. 233–211. Heffernan,R. et al. (2016) Highly accurate sequence-based prediction of Sharma,R. et al. (2015) Predict Gram-positive and Gram-negative subcellular half-sphere exposures of amino acid residues in proteins. Bioinformatics, localization via incorporating evolutionary information and physicochemi- 32, 843–849. cal features into Chou’s general PseAAC. IEEE Trans. Nanobiosci., 14, Heffernan,R. et al. (2015) Improving prediction of secondary structure, local 915–926. backbone angles, and solvent accessible surface area of proteins by iterative Sharma,R. et al. (2016) Predicting MoRFs in protein sequences using HMM deep learning. Sci. Rep., 5, 11476. profiles. BMC Bioinformatics, 17, S14. 1858 R.Sharma et al. Tompa,T. (2011) Unstructural biology coming of age. Curr. Opin. Struct. Xia,J. et al. (2017) An ensemble approach to protein fold classification by inte- Biol., 2011, 419–425. gration of template-based assignment and support vector machine classifier. Uversky,V. (2014) Introduction to Intrinsically Disordered Proteins (IDPs). Bioinformatics, 33, 863–870. Chem. Rev., 114, 6557–6560. Yang,Y. et al. (2017) SPIDER2: a package to predict sccondary structure, Vacic,V. et al. (2007) Characterization of molecular recognition features, accessible surface area and main-chain torsional angles by deep neural net- MoRFs, and their binding partners. J. Proteome Res., 6, 2351–2366. works. Methods Mol. Biol., 1484, 55–63. Wright,P.E. and Dyson,H.J. (2015) Intrinsically disordered proteins in cellular signalling and regulation. Nat. Rev. Mol. Cell Biol., 16, 18–29.
Bioinformatics – Oxford University Press
Published: Jan 18, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.